4

I'm still a beginner to cryptography and want to know what the best practice is for solving the following problem:

I'm creating an application with the requirement that a user's email address must be encrypted (at rest) in the database. An email address is also the identifier for a user and thus must be a unique field in the Database.

When a new user signs up, the encrypted value of the email is compared to all previously encrypted values to check for a duplicate.

Will using a symmetric algorithm such as AES-256 ensure that:

  1. Collisions can't occur between two different emails
  2. Two of the same emails cannot end up as different ciphertexts (deterministic)

I don't want to use one-way hash and salts because I need to be able to access the plain text when necessary.

Any help or advice would be greatly appreciated.

2 Answers2

5

So, it looks to me like this is a homework problem, although it could be job assignment; I'll try to answer in a way that meets both. First, some general thoughts on databases and stored data encryption. FWIW, I used to work for a large database vendor and advised the developer who built their column and table encryption feature.

Challenges

The challenge with encrypting stored data in a database is the diametrically opposed goals. With plain old stored data, you often want it indexed (for fast searches) be it for Primary Keys (data in column must be unique) or Foreign Keys (joins). With encrypted data, you don't care so much about the properties of the plain text data, however, you do care that crypt text doesn't accidentally reveal any information about the plaintext. Here, too, the crypt text can be unique, although the uniqueness property is actually working against you. More importantly, matches of crypt text (whether that crypt text is reversible encryption or irreversible hashes) is a non-requirement, that is, you don't want it to match ever.

See the diametrical opposition? Searchable vs. No Information Leakage. Searchable means there's an Index out there which means it's Sorted which means you have Information about the properties of the data. Sorting crypt text is useless, so you really want the Index to Sort on plaintext, but that then leaks Information! So, basically, what we are saying is, you're really quite up a creek with this one. In your example, you want the email address to be both unique (as in Primary Key, searchable in the database) and yet provide no information in the case of theft (encrypted in a secure manner).

All that said, we still want to solve the problem of stored data encryption, so what should we do?

There are two solutions: (1) native database encryption or (2) application level encryption.

In Native Database Encryption, the database itself runs the crypto, stores the data on disk as encrypted, maintains various keys in a wallet & IVs with the data, and provides access control for who can and cannot retrieve the data. It also encrypts indices when the data column is a Primary Key or the column has an Index on it. In some ways, this is nothing more than a glorified access control scheme. However, it meets the requirement that the data be stored (at rest) in an encrypted form. Of course, this means the crypto feature is built into the proprietary database or someone coded it into an Open Source database.

In Application Level Encryption, an app developer encrypts the data and then stores that in the database. I think Mike did an excellent job outlining some of the choices with this model. I'd reiterate that the biggest challenge here is how to handle encrypting data you also wish to be a Primary Key (unique to a column). Since the proper (i.e. secure) method of encrypting data is to generate a unique IV per plaintext, you'll have to scan and decrypt every crypt text email address before determining you have a unique email. That's not a very good situation from an operational perspective (very bad runtime behavior).

Options

  1. Try to use an alternate unique identifier. Email addresses are bad for this, although they are seductive and everyone loves them. They're great for password reset, so leave them as part of the user profile, but instead assign the user a unique numeric ID. These are difficult to guess for an attacker. Often, the most public thing known about a person is their email. Using it as a username makes attacks on an account that much easier.
  2. If email address must be the unique user ID and if encrypted email address is a check the box requirement, use a database that provides native column/table encryption as you can legitimately check the I-encrypted-it box. Oracle (the company I used to work for) has a database that does this. There might be others.
  3. If email address must be the UID and you cannot specify a database that has native crypto (or this is a homework assignment), you'll have to resort to application level cryptography. I have a suggestion to solve the Primary Key problem.

Application Level Stored Data Encryption

Assumption: Some data which must be encrypted must also be unique in database column. Let's call that data "email address".

You're best bet is to properly encrypt the email address by adding a unique IV to each plaintext. This means you cannot make email address column an Index or a Primary Key. First, if you store the IV in a separate column from the crypt text, there's a very small chance that the crypt text will collide (be equal) even if the plaintext is not. It's very small, but I believe one should always code for correctness, so no Primary Key. Next, Index on crypt text is useless, anyway. If the IV is different per row, you cannot do any meaningful search or join on the column, so application level encrypted data columns should never be Indexed.

Depending upon the row count in your table (number of unique users/email addresses), doing anything with that data is going to be horrific. Every time you touch it, you'll have to decrypt on average 50% or possibly 100% of the data depending upon what you're doing. If the database has advanced features (like virtual tables, stored procedures that can build them and internal cryptographic APIs), you might be able to have the database do the heavy lifting for you. This, to be honest, is your best bet. Oracle has the DBMS_CRYPTO PL/SQL package (I wrote it), so, yes, again Oracle can help you solve this problem. No, I no longer work for the company, nor do I own any of its stock. Sorry, I don't mean this to be a sales pitch for Oracle.

Let's say you don't have access to these advanced database features, what to do next? All this comes down to the need to expose useful some information about the email address without allowing an attack on the user data. Here, your decision swings on how large your user population is.

If it's small, just encrypt everything with IVs and decrypt it all to match. The speed of operation won't be that expensive and the extra development work won't be worth it. If this is a homework assignment, that's your best bet.

If the user population is large enough that full table decrypt will be noticeable, then still encrypt everything with a unique IV, however, provide a second, indexed column in which you hash the email. Now, this hash is not your normal cryptographic hash. In fact, you don't want anything of the kind. You want a hash algorithm that has a fair amount of (but not too many) collisions. Collisions are your friend.

Here's a possible algorithm:

    E = Email Address
    Choose N s.t. 2^(N+3) <= NUM_USERS < 2^(N+4), N>=0)
    H(E) = (Sum each ASCII value of E)%(2^N)

Assuming a uniform distribution of email addresses, H(E) will produce 12 or so collisions for a large dataset on average. If you store H(E) in a second column and index it, you can quickly locate candidate email addresses that may match the plaintext email address while ignoring most of the other email addresses. You'll need to redo all of the hashes every once in a while when the population of users jumps above a level. You should also redo them when it drops below a level. You should add hysteresis to this process so you don't flail around redoing the hash column if a user adds/deletes often.

There is Information Leakage with this design, you can guess email addresses and see if they match the Hash column. Other unencrypted data (like a State or Age field) might allow an attacker to triangulate on who the user really is. It's not perfect. However, if you've got 10^6 users, you're going to have to solve the problem some how which means leaking some information.

Also, this won't allow you to sort by email address (no alphabetic sort). It acts more like a true Hash Table from a Data Structures class.

Andrew Philips
  • 1,431
  • 8
  • 11
4

Block ciphers like AES use Initialization Vectors (IVs) much the same way that hashes use salts: so that repeated encryption of the same data does not result in the same cipher text (this is a good thing). You can, of course, use the same IV for every encryption operation. This is a terrible, horrible idea because it completely weakens the mathematical strength of the enryption, but I'll include it here as an option for educational reasons.

Answering your questions:

1. Collisions can't occur between two different emails

With a fixed IV: No, collisions are not possible. Logically, if two different input strings result in the same cipher text, then how can you decrypt them back into two different outputs?

With unique IVs: Yes, collisions are possible. With larger block sizes (and therefore larger IVs) the probability of this is minuscule. To be safe, you could encrypt, check the db for collisions, and encrypt with a new random IV if there is a collision.

2. Two of the same emails cannot end up as different ciphertexts (deterministic)

With a fixed IV: Correct. The IV is the only source of randomness (aka non-determinism) in AES. With a fixed IV, the same input will always give the same output.

With unique IVs: No, the same input encrypted with two different IVs will give two different outputs.

Your point #2 is a little bit baffling because it asks for the exact opposite of one of the fundamental goals of encryption (and secure hashing), Ciphertext Indistinguishability; the idea that if an attacker sees two ciphertext messages, they have no way to tell if they came from the same plaintext message or not.


As Niel McGuigan said in comments, what is your threat model? Or put a different way, what are you trying to protect the data against? Outside attackers stealing the database and trying to crack it? Internal db admins with prying eyes? Trying to obfuscate log files? There are standard solutions to these problems, but they are all different.

Mike Ounsworth
  • 58,107
  • 21
  • 154
  • 209
  • Note that IV is not a parameter to the cipher itself but to some modes of operation. Given a security-sound random key, I don't think any good block cipher will be weakened by a fixed IV (it's just as good as any other plain-text). The attacks against fixed IV require active tempering or viewing of the chiper text, which is not possible here. However, the OP should consider whether the key can be properly secured for this solution. – billc.cn Nov 20 '15 at 14:45
  • Thanks for your answer. In terms of the threat model I would be preparing against: 1) An attacker gaining access to the DB, 2) The prying eyes of staff/admins using the application. – Daniel van Flymen Nov 20 '15 at 22:19
  • In that case, billc.cn is right, no matter how you encrypt it, how you store the encryption keys and how you control which admins have access to those keys (some admins will need access, that's their job) is a FAR trickier problem than the encryption itself. The "proper" solution us to use a Hardware Security Module (HSM) for all your crypto and key storage. Those will run you around $10,000. – Mike Ounsworth Nov 20 '15 at 22:34
  • 1
    Even an HSM may not solve the problem if a cracker finds a SQL Injection attack, since the HSM can decrypt by authority of the application (or database) server. Most times, stored data encryption is a Marketing Strategy (what?! you didn't encrypt the addresses?!) and not a real solution. People expect it, so we developers and security people have to provide it. I should have put this comment in my answer. Here's a link to my argument [*We cannot protect a computer from itself*](http://security.stackexchange.com/questions/105110/prevent-or-detect-key-copy/105147#105147). – Andrew Philips Nov 20 '15 at 22:40