Security flaws of storing strings and some of its substrings using Microsoft SQL Server Transparent Data Encryption?

Question

I am creating a database table with an encrypted value, say users:

Say John encrypted is U2FsdGVkX193AOGlBRE1RNScJRGN9vSB4erIljJwaKw=.

UserId | Name
------ | --------------------------------------------
1      | U2FsdGVkX193AOGlBRE1RNScJRGN9vSB4erIljJwaKw=

Johnny would be encrypted as U2FsdGVkX19lYDuvuBg3BzTnZro4J6zKoo9s/TLeiPU=.

UserId | Name
------ | --------------------------------------------
1      | U2FsdGVkX193AOGlBRE1RNScJRGN9vSB4erIljJwaKw=
2      | U2FsdGVkX19lYDuvuBg3BzTnZro4J6zKoo9s/TLeiPU=

How could I search for names beginning with Jo?

I was thinking of storing:

NameStub | UserId
-------- | ------
Jo       | 1
Jo       | 2
Joh      | 1
Joh      | 2
John     | 1
John     | 2
Johnn    | 2
Johnny   | 2

I can now do searches like 'Jo%', 'Joh%' and both will include both users.

I can see the obvious security flaw with the above that anyone can map out stubs like Jo or Joh.

So I have decided to store the stubs encrypted:

NameStub    | UserId
----------- | ------
enc(Jo)     | 1
enc(Jo)     | 2
enc(Joh)    | 1
enc(Joh)    | 2
enc(John)   | 1
enc(John)   | 2
enc(Johnn)  | 2
enc(Johnny) | 2

where enc(x) is just the encrypted value for x.

When searching for Joh, I will encrypt Joh and then query for matching stubs and any associated users with names for that stub.

Note:

Both of the encrypted columns will use different keys and algorithms to protect the data.
I will be using Microsoft SQL Server Transparent Data Encryption (TDE).

I need to understand if I am missing something fundamental here.

possibly relevant: https://stackoverflow.com/questions/4961603/whats-the-best-way-to-store-and-yet-still-index-encrypted-customer-data https://security.stackexchange.com/questions/72432/indexing-encrypted-data-for-efficient-searching — Jack, Dec 03 '19 at 02:30
you will at least need to encrypt the whole of your second column, otherwise the mapping will leak information about relationships between values. — Jack, Dec 03 '19 at 02:34
You should understand that whatever this custom encryption algorithm of yours is, it is not secure. Why are you including it? If you are also encrypting with AES, then you should just stick with AEA — Conor Mancone, Dec 03 '19 at 02:44
It is not a custom encryption, it was just used as an example. I will be using the already in market encryption standards and methods. — Shreyash Sharma, Dec 03 '19 at 02:55
What I need is to find a way to hide the mapping and still be able to map. — Shreyash Sharma, Dec 03 '19 at 02:59
You can try searching for Searchable Symmetric Encryption (SSE). — AlphaD, Dec 03 '19 at 03:24
@AlphaD SSE seems pretty new given that there was no Wikipedia page for it in the first page of search results for it. — Kenny Evitt, Dec 03 '19 at 05:31
@KennyEvitt Yes, its not something I would consider for production just yet but it's probably the closest to what this question is looking for. — AlphaD, Dec 03 '19 at 06:58

Kenny Evitt · Accepted Answer · 2019-12-06T16:38:39.030

TLDRTLDR

Don't do this!

TLDR

TDE is irrelevant for this question.
Don't do this (index encrypted substrings of encrypted strings).
Don't index encrypted values generally.
If you are going to do this anyways, use an HMAC or other MAC algorithm to generate the encrypted values to be used in the index(es).
If you're feeling really lucky, consider using a format-preserving encryption algorithm or a "Searchable Symmetric Encryption (SSE)" algorithm instead of a MAC algorithm.

TDE

I'm assuming "TDE" is Microsoft SQL Server Transparent Data Encryption. TDE is irrelevant (for your question). TDE encrypts an entire database and is 'transparent' to database users or other clients connecting to the database. What TDE protects from is an adversary accessing the data in the database offline, i.e. via copies of the database files or copies of the encrypted data, but without also having access to the encryption key with which the secured data was encrypted. Think of an attacker stealing a backup of the database, e.g. stealing a backup tape cartridge containing the database data or log files. If an attacker can access a live database for which TDE is configured, and the database system doesn't otherwise deny them access to the secured data, they can access any data in it (that's not otherwise encrypted apart from TDE).

Indexing Encrypted Data

The most secure method of searching encrypted field values is going to be to decrypt all of the field values to be searched first. Ideally, you'd do this directly in the database, e.g. in a stored procedure or function. It seems unlikely that an attacker could read the memory of your database process(es) but not also access the encryption key(s) used to encrypt the data in the database.

Your idea of storing encrypted "stubs" or substrings will weaken the security, perhaps substantially. This is a great answer on another question on this site that seems to generalize your question somewhat:

encryption - Indexing encrypted data for efficient searching

That answer points out an important reason why this approach weakens the security:

At best, what you could do is to implement deterministic encryption, such that encryption of a given record value always yields the same encrypted result. This leaks a modicus of information (if two records have the same contents then this will show, despite the encryption layer); on the other hand, it allows for exact searches: you encrypt the value to search, and use the index on the encrypted values.

You can't create an index of encrypted values without exposing (to people that can see those encrypted values) that equal values are equal. Typically, values would be encrypted with a salt that's unique to each specific instance or occurrence of a value but one consequence of that is that any number of instances of the same cleartext will be encrypted using different salts and thus will result in different encrypted values. This is one reason why decrypting the values to be searched – before searching – is probably the best choice if you really need to store these values encrypted by default, even for authorized database users.

Note that this same weakness applies even for searches for exact matches, i.e. even if you're just indexing the encrypted values whole for performing searches for exact matches. In fact, for the two sentences I quoted above, the sentence immediately following is:

Substring searches, however, should be avoided at all costs.

(That's referring to indexing encrypted substrings.)

Algorithms for Indexes of Encrypted Data

But – if you decide that weakening security as described above is something you're willing to accept, then – cryptographically – the top two answers on this related Stack Overflow question suggest using a MAC algorithm to generate the indexed version of the field to be searched:

database - What's the best way to store and yet still index encrypted customer data? - Stack Overflow

This answer on the same question above suggests carefully considering using a format-preserving encryption algorithm, the care being in knowingly using a class of algorithms that are relatively new and thus possibly vulnerable to attacks not yet publicly known.

In this comment on this very question, @AlphaD suggests considering using a "Searchable Symmetric Encryption (SSE)" algorithm. This should only be considered with even more care than for considering a format-preserving encryption algorithm. SSE algorithms are so new that there wasn't even a Wikipedia page for them when I checked.

The answers on your link, suggests using HMAC but says that will leak information as well. So that would not work as well, correct? I see no substantial increase in security using HMAC. — Shreyash Sharma, Dec 06 '19 at 02:36
@ShreyashSharma Would you quote what exactly you think claims that HMAC will leak information as well? (There are seven links in my answer and two that I think you're probably referring to.) Maybe you meant these sentences from the "encryption" link?: "At best, what you could do is to implement deterministic encryption, such that encryption of a given record value always yields the same encrypted result. This leaks a modicus of information (if two records have the same contents then this will show, despite the encryption layer); ...". — Kenny Evitt, Dec 06 '19 at 15:51
@ShreyashSharma I think I understand your comment now. *If* you decided that you *do* want to create *indexes* of encrypted values for searching, for either 'whole' strings *or* substrings – despite the weakened security – *then* you should probably use an HMAC algorithm to generate the values to be used in those indexes. But yes, *overall*, you probably shouldn't create indexes of encrypted values at all. — Kenny Evitt, Dec 06 '19 at 16:22
Thank you for the replies and updating your answer. I was going through the paper on CryptDB (https://people.eecs.berkeley.edu/~raluca/CryptDB-sosp11.pdf) & they have a similar approach in the paper which is detailed on page 5 under "Word Search". They are providing an extra layer of security with the proxy and the database will only give a "yes" or "no" for the *LIKE* queries. So what if I was to keep the indexes of the complete name strings ("John" "Johnny") on a separate database with its own encryption and TDE and this one database only revealed the indexes(hashed) for that second db? — Shreyash Sharma, Dec 07 '19 at 10:28
I was focused on TDE as I am trying to accomplish all this to save the data to be seen in plain text from the privileged user, and there will only be one user, that is because I need at least one user to fire queries on the database. — Shreyash Sharma, Dec 07 '19 at 10:29
@ShreyashSharma TDE is perfectly fine, and might be entirely appropriate, but it's a *separate* or *independent* form of security from encrypting a specific subset of your data in the database. As always, with any kind of security analysis, you need to carefully consider your 'threat model', i.e. what kinds of 'attacks' do you want to defend against or mitigate? TDE protects against unauthorized database users accessing *any* data in the database, e.g. from stolen copies of your database files. — Kenny Evitt, Dec 07 '19 at 20:31
@ShreyashSharma [continued] If you think about hashing passwords in a database, that's intended to protect against *anyone* accessing or knowing the password except the original user. In that case, *you* don't *want* to know the password, ever. You only want to be able to confirm whether a password being entered, after being hashed, matches the hash of the original password. In the case of data that's *intended* to be decrypted, you want to carefully consider who should have access to that decrypted data. Obviously *something* or *someone* is going to need to access any data that's decrypted. — Kenny Evitt, Dec 07 '19 at 20:35
@ShreyashSharma The problem with CryptDB, or anything like that, is that's *impossible* to protect secrets from a system's administrators *if* those secrets ever need to or even can be decrypted. That's why passwords are protected with hash algorithms instead of encryption algorithms. But note that, even for hashed passwords, a suitably privileged system administrator could simply change the code of the system to log the plaintext passwords entered by the users when they login! The only way for a user to protect a secret, like a password, from others is to never share it in the first place. — Kenny Evitt, Dec 07 '19 at 20:43

Security flaws of storing strings and some of its substrings using Microsoft SQL Server Transparent Data Encryption?

1 Answers1

TLDRTLDR

TLDR

TDE

Indexing Encrypted Data

Algorithms for Indexes of Encrypted Data