143

I'm pretty sure this is a stupid idea but I'd like to know why, so bear with me for a moment.
Lots of the work backend developers do is providing CRUD access to customers via HTTP, essentially mapping data from and to the internal database. Customers authorize to the web service using some sort of credentials via an encrypted connection, the web service validates data and performs queries against the backend database, then returns the result to the client.

All in all, this is merely a worse way to interact with the database directly: Almost nobody fully implements the REST specification, and sooner or later you always end up with home-cooked generic filtering, sorting or pagination - while SQL supports all of this already.

That got me wondering: Why not give customers access to the database by exposing the SQL port, skipping the HTTP API entirely? This has lots of advantages:

  • Clients must encrypt connections using a client certificate
  • We can use the access control built into the server or just use shard databases per customer
  • (My-)SQL permissions are pretty fine-grained, so I'd wager there shouldn't be any obvious security issues
  • Performance should be way better, since we skip the entire HTTP communication and web app code
  • New features are a matter of database migrations, everything is reflected in the schema
  • Powerful query capabilities are provided to users, without any additional effort

The downsides seem to include being unable to support multiple schema versions, even though I think careful deprecations (and client SDKs, maybe) should make the impact minimal.

As nobody seems to do this, there must be a security risk I'm overlooking. Why can't we provide public SQL access to our customers? What could possibly go wrong? (Please keep in mind that this is just a thought experiment born out of curiosity)

schroeder
  • 125,553
  • 55
  • 289
  • 326
Moritz Friedrich
  • 1,465
  • 2
  • 10
  • 10
  • 1
    I'm not sure I understand your question, but I'd say it's a matter of design. Your idea might work in some simple cases, but in more complex applications it would soon become a mess (ugly or infeasible). – reed Apr 17 '20 at 13:46
  • 1
    Hi Moritz. I've edited your question title to be an actual question. If you don't like my change though you are welcome to roll it back or edit it yourself! – Conor Mancone Apr 17 '20 at 14:46
  • 12
    For reference, there are actually many cases where people do exactly this, although it is much more common in older software. There have been cases where I had to open up direct DB access because that was the only kind of integration that some critical and older business software supported. Usually in that case I wouldn't give full database access, or even access to tables my application used. Instead I kept dedicated tables for the applicaton's consumption that only populated when I needed to send data, and other tables for it to write to which I would check regularly for changes. – Conor Mancone Apr 17 '20 at 14:48
  • Typically that was done via an ODBC connection, so that may be worth a read (although ironically ODBC itself was effectively an application layer between the integrating application and the database, it's just that ODBC would interact directly with the database over the internet) – Conor Mancone Apr 17 '20 at 14:49
  • for databases that do not have row-level permissions, you can allow them to access only views that query rows they are allowed to see based on their credentials – lurscher Apr 18 '20 at 07:36
  • I think you are asking a good question. So I get this, SQL <-> https <-> (rest/api)user. I think the issue could be sql does not have the security. Thus the web server can only talk to the unprotected data base. This could also be efficient. – marshal craft Apr 18 '20 at 08:35
  • Also I don't know how much sql injections are a thing today, but by having the web server query the data base instead of the client, it reduces the possibility of unpredicted behavior, which some security vulnerabilities take advantage of. The rest api still alows a great degree of user programmability, but while removing the possibility of sql injections. That said I have seen website js which apparently did formulate sql requests. Some sites store client html and code on sql servers, which could be driven by desire to reduce costs or complexity maybe? – marshal craft Apr 18 '20 at 10:34
  • In some controlled, trusted environments a “valet key” approach” does allow for partial/scoped access to APIs. Azure Table, based in part on OData, CosmosDB, and others support this as an alternative to OAuth calls by websites and apps. – makerofthings7 Apr 18 '20 at 22:12
  • 1
    Also consider CQRS in place of CRUD. I’m pursuing CQRS because of the different psychological effects it has when talking about data. (Crud this, crud that.. carries an unnecessary weight with it) – makerofthings7 Apr 18 '20 at 22:15
  • 4
    [GraphQL](https://graphql.org/) addresses many of these problems. – Schwern Apr 18 '20 at 22:36
  • Because it's never CRUD, that's why. With time you start writing more and more stored procedures and views, which becomes unmanageable, because this monolith depends on RDBMS, which is optimized for data querying and update, and is a bad replacement for any major language's ecosystem (libs, dev tools, vms, etc). – Askar Kalykov Apr 19 '20 at 07:35
  • The better maintainability gained by using an application layer is usually reason enough. Still, it's good to discuss the security aspects as well. – Blueriver Apr 20 '20 at 19:36
  • First question: HOW does the client connect to the server that hosts the database service That client's will need a DBC driver installed on their computer to make the connection. Remote client that can't be connected via tunnelling connection, then the connection can NOT be successful A remote client can't use DBC driver over http (not compatible) Just to re-interate: Security IS the biggest concern as it guards the data. Software engineering practices create quality software, that guards the data. Data is King! Quality is Queen! The King loves the Queen The Queen loves the King – GregJF May 06 '20 at 03:40
  • `"Performance should be way better, since we skip the entire HTTP communication and web app code"` how did you come up that directly communicating in the database removes HTTP communication? If this is some sort of embedded database, the question wouldn't still make a valid point since it was talking about HTTP. – mr5 May 11 '20 at 13:05

24 Answers24

141

TL,DR: Don't.

(My-)SQL permissions are pretty fine-grained, so I'd wager there shouldn't be any obvious security issues

Even with permission on the record level, it does not scale easy. If a user has irrestricted SELECT on a table, they can select any record on that table, even those not belonging to them. A salary table would be a bad one. If any user has DELETE or UPDATE, they may forget the WHERE clause, and there goes your table. It happens even to DBAs, so why would it not happen to a user?

Performance should be way better, since we skip the entire HTTP communication and web app code

And you throw away all security, auditing, filtering and really fine grained permission control from using an application to validate, filter, grant and deny access. And usually most of the time spent on a transaction is the database processing the query. Application code is less than that, and you will not remove the HTTP communication, you just replace it with SQL communication.

New features are a matter of database migrations, everything is reflected in the schema

That's why so many people use a "spreadsheet as a database." And it's a nightmare when you need to reconcile data from multiple sources.

Powerful query capabilities are provided to users, without any additional effort

It's like putting a powerful engine on a skeleton chassis, bolting on a seat, and taking it to a race. There's no extra weight slowing the car down, so it's very fast!

It's the same here. Sure, it's fast and powerful, but without security measures provided by the application, no session, record-level access control, "users do what they are allowed to", or auditing.

One of the most common vulnerabilities on web applications is SQL Injection, and you are giving a SQL console to your users. You are giving them a wide variety of guns, lots of bullets, and your foot, your hand, your head... And some of them don't like you.

ThoriumBR
  • 51,983
  • 13
  • 131
  • 149
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/106981/discussion-on-answer-by-thoriumbr-why-cant-i-just-let-customers-connect-directl). – Rory Alsop Apr 21 '20 at 12:17
  • 7
    Though question is tagged with `mysql`, it doesn't look like MySQL specific, so let me add some notes. [PostgreSQL](https://www.postgresql.org/docs/9.5/ddl-rowsecurity.html), [SQL Server](https://www.mssqltips.com/sqlservertip/4028/sql-server-2016-row-level-security-example/) and [Oracle](https://mwidlake.wordpress.com/2012/11/15/row-level-security-part-1/) do have row-level (record-level) security, so they seem to fit much better for this kind of task. Also [views can be used](https://stackoverflow.com/a/5527161/4862360) for that with MySQL. – Aleksander Ryhlitski May 07 '20 at 08:53
68

Interesting question. In theory, this can be done securely. MS-SQL can secure the connection with encryption, authenticate the user, and provides fine-grained permissions and other security features like auditing.

In fact, it used to be common in intranet environments that thick clients would access a database directly, so the database security controls were the primary security mechanism. This tended to be done badly, for example, all users connecting as admin with a password hardcoded in the app. But it can be done well.

A major problem is privilege escalation flaws. The database API is extremely complex presenting a huge attack surface, and the protocols are designed for speed, as well as being old and not Internet hardened. Oracle, for example, has had hundreds of privilege escalation flaws. However, MS-SQL is one of the better databases in this regard. You can also reduce the attack surface by locking down user permissions.

Architecturally, exposing an interface that allows generic queries, and applies security restrictions, makes a great deal of sense. To some extent people are reinventing the wheel as REST APIs gain features like custom queries.

Whether you can do this depends largely on the relationship with your users. If these are paying customers with a contractual relationship, so to some degree more trusted than a random Internet user, it may be appropriate to use this architecture. Especially if different clients are siloed on separate databases. Tread with great care. It's the kind of thing where if you were to suffer a breach, you could be criticised for this, despite having consider the risks and benefits carefully. If you're running a web scale service with anonymous sign-up, I would avoid this. Although it's worth noting that most cloud platform providers do offer features where they expose database ports to clients.

paj28
  • 32,906
  • 8
  • 93
  • 130
  • 3
    Just realised you said MySQL not MS-SQL but most of this stands – paj28 Apr 17 '20 at 14:27
  • 3
    It should be notes that an API can be provided by wrapping the actual tables in views and using update triggers and stored procedures for updates. That way if the underlying schema changes, the definitions can be updated so that old queries still work while new views are added that include the new functionality. Whether it's better than a separate service on top of the database depends on relative complexity of the queries and high-level operations. – Jan Hudec Apr 18 '20 at 20:47
  • 2
    @JanHudec but realistically, if you give users the ability to write their own queries, you have zero chance of making sure all "old queries" will work, because you have no idea what they are. People can do very stupid stuff. – Nelson Apr 19 '20 at 05:26
  • 1
    @Nelson, if you only give the users access to *views*, then you can keep the schema they see the same by adjusting definition of the views to any changes in the underlying tables which the user does not see. It is also not true you have no idea what the queries are, because you see the slow ones in the slow query log and can enable the full query log if you need to check them. – Jan Hudec Apr 19 '20 at 11:17
  • You say it "used to be common in intranet environments". It is the case in the intranet environment at my employer (but certainly not in whatever is exposed to the internet). Is this (now) uncommon? I can send read-only queries to those databases to which I have been granted access (Oracle database). – gerrit Apr 19 '20 at 19:07
  • 1
    @gerrit - I don't have any figures on this. Informally it's become less common. Partly because thick clients may use a web service back-end - Microsoft has been pushing that for 20 years now. But moreover, internal apps are now often web apps. Certainly you still get some direct database thick clients. I changed job a few years ago so haven't done any internal tests for 3 years now. – paj28 Apr 19 '20 at 21:28
  • 1
    How much can you *really* trust your customers, though? Even if you can trust them not to screw you themselves, can you trust them **not** to make their password "Password123"? Or write it down on a sticky if the system forces it to be something more complex? Just because someone logs in as a trusted customer doesn't mean they ARE that trusted customer. – Steve-O Apr 20 '20 at 13:43
  • @Steve-O - That's a good point! A common issue, not unique to this. In part you enforce some minimums, like password strength or MFA. Not perfect as you point out. So the main thing is to enforce separation. A careless customer can let themselves get compromised, but they can't affect other customers. That's exactly what cloud providers are doing when offering DBaaS. Perhaps I should have said "only if siloed" not especially, but I'm gonna leave it. Interesting angle, thanks. – paj28 Apr 20 '20 at 14:18
58

I've built both RESTful interfaces and provided customers with direct SQL access.

The problem here is that the question is fundamentally flawed:

Lots of the work backend developers do is providing CRUD access to customers via HTTP, essentially mapping data from and to the internal database.

This is not, in my experience, a significant portion of what I do. Let's simplify it to 4 tasks RE data access:

  1. Validate incoming Data.
  2. Authentication, Authorization, and Logging.
  3. Expose a limited set of functionality.
  4. Provide an API that is sensible for the user.

DB's typically do not provide tools to do this as needed by these tasks. For example, I might want to:

  1. Validate incoming data using an external service.
  2. Use OAuth to provide Authentication, and roles to provide access to specific rows. I then have specific logs I want to write based on roles/data access.
  3. I might want to only expose certain reporting (e.g. for performance (think DoS) or business reasons).
  4. SQL is not the format that most of my customers want most of the time.

While I'm sure there is some DB with some feature for each one of these scenarios... typically most DB's won't support most of these scenarios because they are databases and as such not designed to handle business logic.

All that being said, there are scenarios where clients want database level access - in which case you will find solutions that do provide direct access. There's nothing prohibiting this from occurring - it's just not typical.

JoshMc
  • 103
  • 2
NPSF3000
  • 680
  • 4
  • 6
  • 2
    All the major databases have stored procedures where you can program basically any business logic you need. Main advantage is that they run inside transactions; doing transactions over rest api tends to be ugly. But you are of course still more limited there (only PostgreSQL has decent choice of languages that you can run from stored procedures). – Jan Hudec Apr 18 '20 at 20:57
  • 9
    @JanHudec While stored procedures and views are a good supplement to an application, isn't replacing the REST API with SQL queries + stored procedures just pushing the work down a layer? Instead of writing REST endpoints for all the business logic, you write stored procedures for all the business logic. But now your users can also run any additional queries they want; slow, blocking queries and whoops, somebody forgot a where clause on their update. – Schwern Apr 18 '20 at 22:35
  • 1
    @Schwern, no; layer is defined by what logic it contains, so it is just implementing the layer using different technology. It does not (have to) allow the user any additional queries or risky updates, because you only grant them permission to read via views, call the stored procedures and update either also via stored procedures or views with update triggers that implement the checks. The layer should still exist, only it is now implemented in the database engine rather than as a separate service. It depends on which kind of API is better to express the required operations. – Jan Hudec Apr 18 '20 at 22:52
  • 1
    @Schwern, and note that stored procedures are absolutely *horrible* thing to *supplement* an application, because now you've created an additional layer and spread the logic over the two layers and the result is almost certainly harder to maintain. Either put all logic in application and have a dumb database under its full control, or put the logic in the database and access it directly (with lightweight service for things that can't do SQL directly like web apps). – Jan Hudec Apr 18 '20 at 22:57
  • 1
    @JanHudec "All the major databases have stored procedures where you can program basically any business logic you need". While you *could* do this, the question is whether or not this makes any sense for your application. For example, many backends contain *multiple* different database technologies, work with various web services, utilizes dozens/hundreds of frameworks and libraries, sometimes do computationally intensive tasks, and need to deal with scaling. It's not necessarily an issue of 'can' but more 'should. – NPSF3000 Apr 19 '20 at 00:03
  • 2
    @NPSF3000, for some applications it makes sense, for others it does not. For applications that are largely centred around the data it often does make sense to put logic that ensures consistency into the database and then the components that take care of various part of the workflow (reports, imports, exports etc.) independently connect to the database (end e.g. let the customer implement some of them themselves). For others it makes more sense to have a front-end service that everything else goes through. So I agree it is an issue of should, but there are cases where the answer is yes. – Jan Hudec Apr 19 '20 at 17:43
  • 1
    @NPSF3000, the frameworks are available in the better database engines too. Postgresql allows basically any major language, with any library, for implementing stored procedures, and IIRC ms sql server allows all of .net and oracle allows all of java. Scaling is indeed more problematic, but being able to run inside transaction is a big benefit for ensuring consistency. So there are use-cases for both approaches. – Jan Hudec Apr 19 '20 at 17:47
  • 1
    @JanHudec "Scaling is indeed more problematic" It sure is. For example, I've worked on systems that had *hundreds* of servers doing business layer work - scaling up and down to handle the load. For example, I've had events where we've *doubled* our high demand load within minutes. How would you try to architect that with say Postgres? Why would you try? "but being able to run inside transaction is a big benefit for ensuring consistency." Not really... I could write all my code in transactions today... and I don't - it's a solution looking for a problem. – NPSF3000 Apr 19 '20 at 20:48
  • 2
    @NPSF3000, you have a specific kind of applications in mind, one for which putting logic in stored procedures is wrong. But not all applications are like that. A system where primary function is bookkeeping (accounting, ERP and similar) does not need to scale, because the number of people using it is limited, but has complex constraints and complex operations (import something, match it with various other records and only if it matches, commit) that are more easily expressed using the database tools. Otherwise you end up just wrapping heaps of queries as in the question. – Jan Hudec Apr 19 '20 at 21:07
  • @JanHudec sure, that might be an example where someone would want to use a database. – NPSF3000 Apr 19 '20 at 21:18
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/106926/discussion-between-npsf3000-and-jan-hudec). – NPSF3000 Apr 19 '20 at 23:15
37

Performance

You say that performance should be "way better", except that now you've just given malicious actors complete authority to wreck the performance of your DB. Of course, they have to authenticate, but the "malicious" actor could also be a "naive, incompetent" legitimate user. What are you going to do when users start running outer joins on all the tables they can find, with where clauses on every non-indexed field in your DB, and computed fields which are computationally expensive? Unless your DB is trivially small, you are exposed to this risk.

My guess is that your DB is trivially small, because one of the big things that a webapp fronting a DB should be doing is caching the most common results. Not every service can do this, because some are designed explicitly to give read/write access with full consistency. But many are readonly and can tolerate some latency w.r.t. update consistency. These services can be literally thousands of times faster than direct DB access if they use in-memory caches like mecached, redis, etc.

Validation

Unless you have Update triggers on every table which requires some kind of business rule validation, direct access is a good way to wreck your integrity. Oh, that's a Zip Code field that someone just wrote alpha characters into? No problem. The phone number field contains alphas? Fine. Currency field contains commas and semicolons? Maybe someone is trying to give themselves a free bonus with some logic hacking. Do you really trust EVERY SINGLE USER to perform the same level of validation as your webapp? You should quit coding and become a priest, because the level of your faith is unbelievable.

Maintenance

Sometimes you need to take your DB offline to do major maintenance. A caching webapp can at least continue to service reads while this happens, but direct access screws over the entire user community. Sometimes you want to migrate the DB to a beefier server. What's that? You're having trouble getting all your users to switch their connect strings at the same time? Sometimes you want to switch to clustering. Oh, those hard-coded connect strings are really biting you in the @$$ now, aren't they? Did security just ask you to switch ports because they updated the firewall rules? Hmm...time to notify all the customers that their connect strings need to be updated.

Conclusion

If you intend to never have more than a handful of customers or more than a few thousand rows, and you are sure your DB/app will never scale beyond this toy size, then direct access is Just Fine(TM). If, on the other hand, your DB may one day outgrow its current incarnation, or you want to do a major migration that involves restructuring, rescaling, or rehoming, then you will thank your lucky stars that you have that layer of indirection to save your bacon and bring all the goodness of a scalable, high-performance solution.

Lawnmower Man
  • 471
  • 3
  • 3
  • 4
    Whilst not necessarily the case for a traditional server database, if you use a one of the SaaS big data databases like BigQuery or Aurora, that malicious performance issue can turn into an ability to run up large bills to your company extremely quickly. I once came across an unexpected $2000 bill, because the internal client was naive about the storage of the data in question. If they had been malicious, they could have run up $20-30,000 before we noticed. – user1937198 Apr 17 '20 at 23:54
  • 1
    @user1937198 to be fair, at least as far as BigQuery is concerned, you can share data while letting customers use their own billing projects to pay for usage. BigQuery is a good way to share large scale SQL data - but to be clear that's very different from what OP is asking. – NPSF3000 Apr 18 '20 at 00:32
  • If you set it up that way. Its less common to think about that in a company which doesn't use internal billing, but where the internal client is far enough away from the operators not to think about costs – user1937198 Apr 18 '20 at 01:30
  • And it does depend on the exact pricing structure and mechanisms. I'm more familiar with AWS, and I definitely know there are a bunch of ways to shot yourself in the foot with secondary bills like S3 there. – user1937198 Apr 18 '20 at 01:37
  • 6
    Minor nitpick: I wouldn't call everyone who abuses the DB malicious or incompetent. Part of the point of separating front and back end devs is that a frontender doesn't *need* to master SQL or remember which fields are indexed (most of the time). Hand them the keys to the DB and that goes out the window. – Ruther Rendommeleigh Apr 18 '20 at 08:59
  • Caching is a big one for us. Our app was designed for LANs and connects directly to an on-premise SQL server, and even with < 100 users at a time, some of the bigger queries can bring the entire server to a halt when 10 users run them at the same time. Implementing a HTTP/SignalR layer has resulted in huge performance gains due to caching and not requiring the client to poll for changes. I can't even imagine trying to optimize for a web-scale service all with direct connections. – MrZander Apr 20 '20 at 18:13
  • client (100 user) directly connect to sql vs client (100 user) to web server to database, what will be the performance impact on SQL Server? – Kashif Faraz Nov 05 '22 at 11:24
  • @KashifFaraz that is unanswerable without seeing the queries being run and the size of the DB. If the clients are running small queries which touch hundreds of rows then SQL Server can probably run everything in-memory and the impact is negligible. If DB has a billion records and users are doing full table scans, then 100 of them can easily crush the server. – Lawnmower Man Nov 05 '22 at 19:15
  • assumed I have employee table with 1000 rows and I query select * from employee. and I have two types of architecture for same data application 1. Desktop app clients directly connect to sql and 2. Desktop App Clients https request to webserver and webserver request to database, in both cases I run same query select * from employee and also same number of clients (100 or more) . Now my question is that in which one is better for sql performance? in others words, different 100 clients 100 connections vs same 1 client (Webserver) 100 connections? Note: no using any type cache (redis) – Kashif Faraz Nov 08 '22 at 07:23
  • Assuming the employee records are not huge, this example is small enough to have minimal performance impact. Everything will reside in memory even on the SQL Server. The advantage of a web service is if multiple apps connect to the DB. When you do maintenance on the DB, the apps can still serve read-only requests, or it can transparently redirect to a clone of the DB. If apps connect directly to DB, every app needs to do this chore. – Lawnmower Man Nov 08 '22 at 21:16
  • Agreed with this one advantage of web service and it is not limited to this one. but my concern is only SQL DATABASE performance. Can you share any performance related comparison between two approaches I mentioned earlier. You may define any scenario which may impact on performance. I read some articles , one is : "When simultaneous client requests are made, application performance degrades rapidly due to the fact that clients necessitate separate connections and increased CPU memory. " https://medium.com/@paulndemo/2-and-3-tier-architecture-4a473e5ced3d – Kashif Faraz Nov 09 '22 at 14:50
19

This could be a reasonable approach in certain circumstances:

  1. The customer gets read-only access.

  2. They get read access to an entire database: it's either quasi-public data to all your customers, or it contains only their own data. In particular it must not contain user PII or data that's otherwise subject to regulatory controls.

  3. You don't mind them reading it as much as they want, or making copies. For example if it leaks and becomes entirely public, it's no more than a bit annoying.

  4. They don't access the live production system, but rather a write-behind mirror or data warehouse.

  5. You have adequately considered and addressed the risk of sensitive or customer-specific data leaking into the warehouse.

  6. The system's technically isolated from your real production systems. I'd look at perhaps creating a Google BigQuery service with a mirror of your data, and granting access to that.

  7. You have a good way to manage access grants, including revocation, abuse detection, and including letting customers manage internal delegation of access granted to them. Again, outsourcing that to an IaaS provider like BQ's IAM is probably much easier than bringing it up yourself.

  8. The customer wants to do complex operations on the data that are easily expressed in SQL, and they know how to write SQL.

  9. Your exported schema is stable enough, or your customers are tolerant enough, that changing the schema and breaking their queries isn't a big problem.

These conditions aren't totally black-and-white, but direct access to a live database containing information from many users gets increasing risky in ways other answers have described.

A hypothetical scenario where this might be reasonable is: you have a complex parts catalog of items for sale. The information about what parts you have and what their prices are isn't commercially sensitive and you're not too worried about people keeping copies of it. And it's more complex than just a simple list: perhaps there are complex relations around pricing or which parts work together.

If all those conditions apply, then a starting point is to just provide a download of the data as CSVs or JSON. If you're not comfortable to do that, then giving SQL access probably isn't right either. However there are a couple of cases where granting access to BQ would be better than providing downloads:

  • There are so many tables that managing the imports will be annoying for customers.

  • The data is very large (TBs?) and user's queries read relatively little of it.

  • The data's exported frequently so, again, batch downloads will be hard to keep fresh.

  • You want to provide canned examples of interesting queries and to control the query engine.

  • Your customers are technical enough to write SQL but don't want the hassle of running their own import system and database.

  • The schema changes often enough that their import automation would break, but not in ways that break their queries.


A nice example of this pattern is this 3+TB dataset from GitHub on BigQuery. Although all this information is available through the GitHub API, some queries will be much easier or faster in SQL. Other datasets include political advertising on Google, and film locations in San Francisco.

poolie
  • 303
  • 1
  • 8
  • 2
    Good answer. It describes a number of conditions, all of which must be satisfied, and if _all_ are satisfied it _might conceivably_ be OK to give external actors relatively direct access to the database (though better options such as GraphQL may exist). In all other situations it's not a great idea. – Thomas W Apr 19 '20 at 22:07
  • "The customer wants to do complex operations on the data that are easily expressed in SQL, and they know how to write SQL." This is point one to my mind - Unless your users are advanced SQL users this won't work - and if they are, what are they paying you for anyway? – Dragonel Apr 20 '20 at 16:20
  • @Dragonel I know how to write SQL but I pay (lots!) of companies either for access to well-curated data, or for goods and services to which the data is ancillary. – poolie Apr 21 '20 at 04:17
13

Here’s a Bayesian kind of answer...

We collectively, as an industry, have about three decades experience designing 3-tier user-facing applications and have amassed a great body of knowledge on how to do it right. Deviating from this pattern isn’t necessarily wrong, as some of the other answers demonstrate, but you would be in sparsely travelled territory and at greater risk of making a fundamental mistake.

jl6
  • 625
  • 4
  • 9
  • 9
    Hence the question! I'm not assuming I'm smarter than everyone else but I'd like to figure out why this isn't a viable solution to the problem. – Moritz Friedrich Apr 18 '20 at 10:35
  • 3
    When you say "three tier" what exactly do you mean? I know this Is a common term, but people use it differently. – paj28 Apr 18 '20 at 11:10
  • 1
    I’m assuming 1. Client 2. Server 3. DB – Steven Lu Apr 19 '20 at 08:17
  • 1
    Disagree. Industry arrived at doctrine. No meritocracy. Industry driven fundamentally by credit bubbles. Radical difference in culture between web dev in certain industries and Dev in safety critical industries, eg control systems, flight, space, military, nuclear etc etc. 3 tier architecture has no benefit, merit. Only arbitrary doctrine. – Frank Apr 19 '20 at 20:45
8

tl;dr: Don't expose your DBMS to the public network, because cybercreeps love big attack surfaces.

We who run information systems are in a war with cybercreeps. They have every advantage against us: they're smart, they're highly motivated, and they only need to find one hole in our systems to pwn us.

We are defending our systems. We have to plug all the holes. A good starting point is to limit the number of holes.

Web servers are holes. A whole ton of work has been done, and is ongoing, on limiting the attack surface of web servers exposed to public networks. When cybercreeps come up with some new web server exploit, vendors typically push out patches quickly. And, those patches are not difficult for us to apply promptly.

A publicly exposed DBMS is also a hole. Sure, some of them have excellent column-, row-, view-, and table- granularity access control. So in theory it may be possible to allow public-net access while maintaining security.

But what happens if a cybercreep cooks up some kind of exploit against the DBMS?

  1. Being more complex than web servers, DBMS servers have more potential for buffer-overrun, privilege escalation, and other exploits. Every bit of functionality is a potential attack vector.
  2. Fewer DBMS exploits (than web server exploits) are found because most DBMSs are behind firewalls.
  3. If a cybercreep breaks in to your DBMS, they've pwned your data and your applications.
  4. It can be very hard indeed to apply a patch to a DBMS server to plug a hole.

Plus, if you expose your DBMS to a public network, your security auditors won't like it. Not at all. And for good reasons.

Please don't do this.

O. Jones
  • 369
  • 1
  • 5
7

It is sometimes done. Esp. when there is no better options and in (an attempt of) pretty controlled environment. Success rate is low.

  1. RDBMS are lagging in security behind HTTP servers big time. They are developed and optimized with different goals in mind. Their general use case faces much friendlier environment than the typical HTTP server. Even unintentionally exposing the listening DB port to the Internet is considered a security fault.

  2. Row-level access control is a thing, but it rarely fits into business logic of the database and the more normalized the database is, and the more complex your permission system is, the less it fits. It also has somehow hidden (from developer viewpoint) performance implications.

  3. Interoperability: Considering the profound mess in DB access protocols and their corresponding drivers, you rather don't want to limit your users to some development stack or platform. Everyone has HTTP or SOAP client available, your-choice-of-SQL-server client - are you sure? You may as well think about changing your database software. Migration from Oracle to MySQL or long-overdue upgrade from PostgreSQL 9.2 to 12 ? With HTTP interface you can do that without even informing your clients. Some downtime and few bugs later you are done.

  4. Security and network management tools (firewalls, proxies, load-balancers, etc...) working on HTTP are available and diverse. Good luck finding an intrusion detection system that understands TDS protocol.

fraxinus
  • 3,458
  • 6
  • 20
6

You ask the question

Why can't I just let customers connect directly to my database?

And the answer really hinges on who you mean by customers in this context. Random people on the Internet? As most of the other replies have said, this is a bad idea and they have given many good reasons. But if your customers are trusted business partners e.g. B2B, and you have VPN connections between your sites and perhaps even a federated SSO solution, then this is not automatically a bad idea. It will be a support nightmare however unless it is incredibly well documented, you will spend all your days answering questions about what data is in each table.

nobody seems to do this

You might be surprised.

Gaius
  • 820
  • 6
  • 7
  • I agree with you that the author has failed to define their customer. To add to your answer how intuitive you need your system to be could be a factor to consider. – ThE uSeFuL May 08 '20 at 17:07
5

As nobody seems to do this, there must be a security risk I'm overlooking.

Human error, granting the wrong authorization to a client on the database level, could have drastic consequences.

Why can't we provide public SQL access to our customers? What could possibly go wrong?

You are creating unnecessary inconvenience for your customer's system :
- In order to write the proper sql queries against your database, your customer has to understand your database schema, or does he need just a part of it ?
- Your customer's code will be tightly coupled with your database, any change on the schema need to be reflected on the customer's code.

That's why, since the year 1, we tend to write applications and API endpoints to abstract the structures of databases.

elsadek
  • 1,822
  • 2
  • 18
  • 55
5

A similar thing is actually done with programs like Hasura—that program exposes your database, using PostgreSQL's permissions system for both row- and column-level permissions, over a GraphQL interface. Clients don't get the full power of SQL queries, but get a subset via GraphQL queries. That allows for queries that get a subset of the data (instead of every column), as well as joins and some level of filtering. The full power of SQL isn't fully exposed, meaning that you can't create a query that creates a DOS attack against the server. Each query is transformed into one SQL query, while restricting access to features that could slow down the server.

As per concerns about migrations, you are definitely able to update the API by just migrating the database. If that's unwanted, Hasura allows you to use custom SQL views that translate your actual tables into public-facing ones, so you can change the internal representation without affecting the API. Additionally, you could always swap out Hasura for any other server that simply exposes the same API, so you're not locked in.

However, you still have the overhead of HTTP, but you can also use the WebSockets interface to avoid reconnecting every time.

lights0123
  • 443
  • 2
  • 6
  • That's really interesting, I'm going to look at Hasura for some of my own projects. – paj28 Apr 18 '20 at 11:10
  • GraphQL is an excellent option to provide a layer of mediation between A) what you want to expose, and a "logical model" for it, and B) your actual backend. – Thomas W Apr 19 '20 at 22:09
  • When we speak of not only about SQL I wonder nobody mentioned FaunaDB or Firebase that do exacly what is in the question. – Mišo May 07 '20 at 20:29
  • @Mišo except we are talking about SQL. GraphQL can offer an abstraction over SQL, but there's still a SQL database somewhere. And it seems weird to me to have a database that you can't even host yourself. – lights0123 May 07 '20 at 20:33
  • @lights0123, you can add any abstraction over SQL. There is even project http://postgrest.org/ to provide REST API. But personally I don't think it really matters what technology is used for as data storage under the hood as long as it provides query language like SQL, GraphQL (https://fauna.com), REST, ... (https://firebase.google.com/products/realtime-database) – Mišo May 08 '20 at 06:02
4

Even in the most advanced RDBMS that I've seen, you can't get a good enough security out of the box. The business rules for all but the most trivial applications are just too complex. You need to write custom code that limits what they can do in specific ways if you don't want evil hackers to wreak havoc. Now, you might put it all into stored procedures and only allow your clients to call those... but then you're back where you started - you still have a custom app on top of a basic database. And the stored procedure language is usually a lot more awkward and difficult to use than your generic programming languages (PHP/C#/Java/Javascript/Ruby/Python/etc)

Vilx-
  • 998
  • 2
  • 7
  • 15
  • Exactly. Software engineering tools -- IDEs, debuggers, frameworks, libraries, error handling, logging, performance and profiling -- are at least two orders of magnitude better for application languages (Java, C#, Node JS) than for stored procedures/ embedded database code. – Thomas W Apr 19 '20 at 22:12
3

Security

Application servers (web servers, containers etc) are expected to be exposed to customers/ untrusted external actors directly and undergo much stronger security testing for this purpose. Database servers by comparison frequently have vulnerabilities found, and if exposed directly would likely be fairly subject to exploit.

Application logic & permissioning can often be non-trivial. While database systems offer some limited capabilities, it would probably normally be more cohesive to centralize these in a more capable system (application logic).

Decoupling

Allowing customers to couple directly to your physical datamodel, causes a problem in that you become contractually/ commercially obliged to maintain that exact same datamodel. This is very undesirable, since it makes it impossible to maintain/ improve/ refactor your datamodel since changing it will break customers. (Management will tell you not to.)

This is particularly bad if you can't see how your customers are using it -- ie. if you give them a raw DB connection & can't even yourself parse/ rewrite what they are doing.

Portability between backends is also an issue. While the security and stored proc/ scripting facilities the DB offers you are limited and poor tools for the job, even worse they are vendor-specific. When you want to migrate to a different database for performance/ scalability/ cost reasons, you will find yourself stuck.

The preferred solution is to contract to a "logical model", which is somewhat decoupled from your physical implementation. This also has the benefit that it generally gives a simpler, clearer and more useful model to external parties.

Better API Structure

Several potential improvements:

  1. As mentioned, defining a nice clear logical model is normally easier for your customers to consume.
  2. Offering it by REST makes it much more broadly available, from a vast range of client software & tools, than requiring client software to include a specific DB library, open a connection & run SQL.
  3. API standards such as GraphQL can give you really nice & simultaneous powerful generalized graph access and data retrieval across your logical model -- ie. many of the advantages of SQL -- while giving a better degree of permissioning and control.

Development Effectiveness

Software engineering tools -- IDEs, debuggers, frameworks, libraries, error handling, logging, performance and profiling -- are probably two orders of magnitude better for application languages (Java, C#.. also other langs like Node JS, Rust, Go) than for stored procedures & embedded database code.

Given lifetime maintenance costs are 4-10x that of initial development, even a "very small" initial project of say 7 days exposing data to customers is likely to incur large lifetime cost differences.

On the development side I would expect 3-4x productivity difference to use reasonable application-language tools & a modern framework (Spring Boot, GraphQL or somesuch). On the customer side I would expect things to be much easier to consume the data & far less disruption (since the API will be able to be stable).

Those claiming there would be no development-side costs to expose SQL are probably omitting the cost of trying to implement security rules, trying to patch datamodel issues, and trying to fix security problems after-the-fact.

How much is it going to cost to write a custom DB connection proxy to read the database wire protocol, parse SQL queries and block security concerns at the SQL query level? Because if you allow a raw SQL connection and DB rules are inadequate (which we pretty much know they are), that's what your security fallback will be.

Recommendations

Do yourself and your customers a favour. GraphQL might or might not be right for you, but it does offer much of the advantage of SQL while bypassing many of the problems.

Use a modern application language & frameworks -- I favour Java (SpringBoot) or C#, but other options like NodeJS etc are available. (Avoid PHP).

I would suggest:

  1. Plug in a GraphQL framework, build a logical model, try GraphQL out.
  2. Alternatively build out a REST API for each well-defined view, implementing permission logic in the application, answering JSON if possible -- perhaps add a CSV option if your customers really want flat data.

Hope this helps!

Thomas W
  • 131
  • 4
2

If you take a look how Elasticsearch API is done, then you probably find similarity to your idea. Having only Elasticsearch saves you from the need to develop any custom backend code and build your own REST API if your case is simple. Connect clients directly to the Elasticsearch REST API and that’s it . (It is not that I am advocating for Elasticsearch. I just find that this is a good real world example for your idea)

2

Your third bullet point, really needs to be your first, as it is the most important reason for not doing allowing direct access.

  • (My-)SQL permissions are pretty fine-grained, so I'd wager there shouldn't be any obvious security issues

The main reason this isn’t done is because row level security hasn’t been a thing for that long of a time, and without row level security there is no security in your scenario. MySQL doesn’t have row level security even now.

And row level security doesn’t end your security and design problems, it is just a necessary first step. Basically, there’s no benefit that outweighs the security problem.

Which is to be expected. I would describe a database with between 50 and 500 tables as mildly complex. Even if people were 100% honest, I would not want them using the database by direct interaction.

I believe that in most organizations, if something can be done via the application, it is considered better to do it that way than through direct database access, even if the user has the ability and knowledge to do it through the database.

The ability to keep users out of the database and require them to make their changes in a defined manner is one of the reasons to move away from Access and/or Excel. Expand that from an organization where you can at least suppose that all of the users are trustworthy if not equally skilled, to the wider internet where you should really assume that any random user is a bad actor...

jmoreno
  • 496
  • 2
  • 9
2

It sounds like you have a situation where you are not too worried about which clients see which data rows, or about clients needing to update rather than just query tables.

One alternative that may fit your use case is to simply supply, directly or indirectly, a copy of the database to the customers. Imagine just sending them a SQLite clone as a file, either inside an application or even directly, depending on their sophistication. This bypasses performance concerns in the case of malformed queries, at least for your servers.

In this modern age of people watching gigabyte size YouTube videos, you can afford to send a pretty big database across the wire in its entirety.

Brian B
  • 121
  • 3
2

This is possible provided that you select the right database service. The reality is that there are few that provide the mixture of permission-granularity and access model. Without trying to promote products, but as an example today, You can accomplish similar with 'enterprise scale' database systems like

  • DB2
  • Snowflake
  • Oracle
  • MSSQL

and probably others.

The challenge is that these systems are quite complicated to manage. If you

  1. have the team,
  2. can support & afford the operational overhead, and
  3. can sell the data product that way,

then who's to stop you?

What usually stops them is

  1. more of the talent out there builds apps. Apps may also be their preferred career growth, not uncommon DB configuration devops
  2. Testing these systems for QA is different than the methods for apps. Managing for that may take more time, or run into the same talent challenges.
  3. Sales teams often aren't that savvy to tech, but solutions engineers are. Most counterparts solutions engineers are not DB admins. That increases sales friction. People buy things that make sense to them.

Tread lightly. Innovate. Plan well.

New Alexandria
  • 270
  • 1
  • 9
1

There are pros and cons, but IF you are going to expose the database to your client, make it a small attack surface by only giving them access to a particular schema. This schema would only contain stored procedures that you will allow them to run. This will mitigate against SQL injection attacks, the authorization the user has depends on the SQL authorization.

If you want different customers to only be able to access their own records, and have different authorizations for different people in the same customer organization, you can do all of that inside larger and larger stored procedures. In essence, you are building your own API inside the stored procedure, and if you are about to do this, you are better off maintainability-wise, having your own API layer in the middle tier. If you have complex business logic, both performance-wise and maintenance wise, it is better in a middle tier than in a stored procedure.

So in summary, you can put everything in the SQL database and stored procedures if you want to. Both function-wise and security-wise you can get it to work with a small attack surface. But if you have a complex system, and you understand what is involved, most of the time you won't want to.

schroeder
  • 125,553
  • 55
  • 289
  • 326
1

Why can't I just let customers connect directly to my database? Why not give customers access to the database by exposing the SQL port, skipping the HTTP API entirely?

You can't/shouldn't because the access control provided by your database engine likely lacks the granularity you would need to adequately control your client's access.

Most databases are normalized, and that results on all objects of the same type being stored in the same table. Even those which belong to different customers.

Most (all) database engine permission systems grant or deny access to entire tables at once, not on a record-by-record basis. And you probably don't want one customer to see all other customer's data.

So this is why it's worth writing an API handler that does the database queries on the customer's behalf, and returns only the results that any given client is authorized to receive. The API can also implement billing, rate-throttling, logging, and other handy business functions that the database engine can not.

So yes, you could grant direct DB access and setup a nightly stored procedure that would dump all the data a customer would need into some table and you only give them access to that table. But it's not customary to create a table for each customer and it goes against normalization. It would introduce delay in customers seeing fresh data, a recurring IO spike to regenerate the customer-viewable tables, and use vastly more disk space.

Don't give customers direct database SQL access.

Billy
  • 11
  • 1
0

So long as you are not giving them write access to data that they should not be allowed to modify or read access to data that they should not be able to read this seems acceptable.

Many systems put data belonging to different customers into a shared table, this obviosly would be an unsuitable approach.

Jasen
  • 931
  • 5
  • 9
0

Alternatively you can establish connection over trusted vpn. However in terms of database server security, needs to configure more secure permission and tight access right for user that belong to client database schema on database server.

But this one is no recommended for public clients, usually only for internal topology.

0

I wouldn't put it on security perspective, rather on software engineering

Why not giving customers access to direct CRUD?

Because CRUD (Create, Read, Update, Delete) primitives are atomic primitives. They do not implement the business logic. And relational permissioning models do not take into account data segregation

Assuming security is well defined, here are few reasons why CRUD doesn't work

CRUD is atomic. Too atomic for business

A money transfer is made of a credit and debit. Two queries. Are you 100% sure that your customer(s) will run all the queries in the expected order and within a transaction? Your relational database does not enforce the Newtonian constraint that "money cannot be created or destroyed, but just transferred".

A REST API guarantees to run two queries in a transaction.

Permissions are limited

Relational permissions don't take into account the data semantics and the business logic. If you have access to UPDATE (say you don't have access to DELETE for sake of discussion) you can't (easily) limit the values your client wants to write.

What if you UPDATE the number of coupons on an account when you don't actually have so many coupons available in your company?

A REST API will validate data in input before issuing queries

Permissions don't allow segregation

You will normally discriminate tenants by a column value (e.g. TENANT_ID). READ access grants access to every piece of information.

SQL permissions allow to limit the columns available to a certain role, NOT the rows.

A REST API will add a filter to every query

Auditing and logging

With direct CRUD access you will rely on customers issuing an INSERT INTO AUDIT_LOG. Malice apart, are you sure that everyone will issue that query? With budget constraints, I expect some customer "forgets" implementing that query and forgets to test.

A REST API will issue audit logs on every invocation.

In short

Giving CRUD access to customers users is too primitive and allows the same level of EVIL granted by accessing an order Excel sheets on the same NAS folder to a myriad of users.

Trust me, I witnessed disaster recoveries that you humans...

usr-local-ΕΨΗΕΛΩΝ
  • 5,361
  • 2
  • 18
  • 35
-1

When access is given directly to the DB, you loose control over the kind of CRUD statements that are issued. A malicious user can issue a statement like delete from table_name, and all your data would be lost.

From a user "friendliness" perspective an API would be a lot easier and straight forward compared to issuing SQL statements where the end user would most likely now know the DB design.

Ariff
  • 11
  • 5
    Consider though that they don't have the privilege to drop tables or delete data from tables that don't belong to their account. That's something MySQL supports natively, for example. And the "friendliness" perspective is two-fold: Web developers might be more accustomed to JSON via HTTP, while data analysts strongly favor SQL if given the choice. – Moritz Friedrich Apr 17 '20 at 13:47
  • I'm going to accept Ariff's answer as a clear example of application permissioning. _Applications_ are generally a better place to implement application logic, application permission control, and application APIs. Some simplistic parts of these (not all) can theoretically be implemented in database scripting, the development efficiency & effectiveness of those tools are far lower. There is a reason that _application_ is a separate word from _database_. – Thomas W Apr 19 '20 at 22:17
-2

That's why we have business intelligence software like Business Objects and others: to allow users to build their own queries without mastering SQL, and also limit what they can see. Permissions can be set on a per-user basis of course.

To give you an idea, one company I work with provides remote access to the data through SAP Business Objects. Users have access to a number of ready-made reports and are not allowed to build their own (this is by design). They also receive some Excel files every day (or week), that are generated by BO. They are quite happy with this, because we are doing their work, or most of it. One important criterion is: do your customers need live data or not. If they don't you can provide the data in a controlled fashion like a CSV download.

Exposing your database does not not seem to be a very good idea at first glance. I note that you say customers, and not developers, that is people who actually develop around your systems and need remote access for a valid reason. I don't think that non-developers should be granted a level of access that broad.

Your customers may not be malicious or technically proficient enough to mess with your system, but there is always a chance that one of them is working from a compromised computer (still too many people run cracked software or download dubious applications...). In a way, the innocent users may be more dangerous than experienced IT guys, because their own security posture tends to be quite weaker.

There is also the issue of liability if a leak occurs, especially a leak of personal data. Your company may be subject to some regulatory regime such as GDPR, and liable for hefty fines for failure to adequately protect your data.

You are not thinking outside the box like a hacker (or penetration tester) would. Fine-grained permissions are the minimum but the scope of your security goes beyond Mysql. You can write files using MySQL, not in any arbitrary location obviously, but that depends on your settings. If Mysql is misconfigured (eg. running as root) or you have excessive permissions on some folders, then there is some room for compromise.

You also read files from Mysql, for example:

select load_file('/etc/passwd') 

The obvious answer here is to deny the FILE privilege. But are you sure you have considered every possible scenario ? And are you willing to put your job on the line if things go wrong and you are blamed for the mishap ? (I believe in Murphy's Law).

As nobody seems to do this, there must be a security risk I'm overlooking. Why can't we provide public SQL access to our customers? What could possibly go wrong? (Please keep in mind that this is just a thought experiment born out of curiosity)

Worst-case scenario: your database server is compromised, then a hacker uses the server as a gateway to the rest of your corporate network. That's what happens every day in the real world: a single machine is hacked, a workstation or a server exposed on the Internet. It often starts with a SQL injection or some common vulnerability, or simply an exploit taking advantage of unpatched software (Think Equifax). When the hackers gain a foothold on the machine, they pivot into your network and what will they find ?

I would seriously reconsider the idea, but whatever you do log everything.

Kate
  • 7,092
  • 21
  • 23
  • 1
    I think this is mostly a good answer, but I do want to challenge a few points: 1) For things like CSV downloads these need development, which is what OP was trying to avoid. 2) This only moves GDPR liability if this is a bad technical decision, which is exactly the question being asked. 3) By default, regular MySQL users don't have the FILE permission, so this would have to be erroneously granted, not just forgotten. 4) It's standard to put customer-facing systems in a DMZ to avoid network pivoting. Sorry for all the nitpicks, just wanting to focus on facts. – paj28 Apr 17 '20 at 14:59
  • I agree with you _for the most part_, but I'm actually thinking further out of the box than you :) The question is kept intentionally broad, since I didn't want to focus on the implementation. Think providing access to a sharded read-replica that only contains customer-specific data, for example, running in its own DMZ. Maybe even having a backend server that merely speaks a subset of SQL and applies its own restrictions. There are heaps of possibilities in this problem space, and I'm just curious. No database servers were harmed in the making of this question ;) – Moritz Friedrich Apr 19 '20 at 11:18