25

I am looking for a way to safely store personal information with low entropy safely.

I have the following requirements for the data:

  • Must be able to search (i.e. to look up an existing piece of data) but not view
  • Other systems must be able to recover the real value
  • The system must be reasonably well performant (options in seconds not hours)

I think a system of encrypting the data using a public key is my best option. I can keep the private key offline so the individual value cannot be directly recovered. However I think that an attacker could use the encryption process as an oracle and recover the data due to its low entropy.

Any ideas on how to improve the security of this system? Not collecting this data is not an option. There will be additional layers around this data (access control, logging, physical security, etc) so I am just focused on this part of the system.

Diti
  • 814
  • 9
  • 17
chotchki
  • 487
  • 2
  • 5
  • 11
  • 4
    What's your threat model? What kinds of attackers, with what resources? – Graham Hill Jun 09 '14 at 14:28
  • The main attack channel is assumed to be exploiting the application itself. However the application is on an isolated network in a physically secure area. As far as attackers resources its not a large system so I doubt it would get a lot of resources aimed at it. – chotchki Jun 09 '14 at 14:47
  • 3
    What do you mean by "able to search ... but not view"? If it is low-entropy data, can I search for all possibilities to view the data? – Ari Trachtenberg Jun 13 '14 at 00:52
  • Do you need to be able to search only for metadata? Or excerpts from the file itself? – KnightOfNi Jun 14 '14 at 03:13
  • @AriTrachtenberg You can search using the value as a key but the system will never display the value. – chotchki Jun 16 '14 at 12:28
  • So, if the values are low-entropy, can't I fuzz through all possible values and establish what is in the database? – Ari Trachtenberg Jun 16 '14 at 19:20

6 Answers6

16

What you're looking for is deterministic encryption: that the same value encrypted twice gives the same output. Given deterministic encryption with a key K, an attacker would need the key to determine which SSN maps to which encrypted value. You can still perform searches on the deterministically encrypted data, but only equivalency comparisons (==, !=).

Examples of deterministic crypto that would work:

  • Block ciphers in ECB mode, if the data is <1 block long
  • Block ciphers in CBC mode, with a static IV.
  • Block ciphers in CBC mode with an IV derived from the plaintext. (Note that you don't want to store the IV then, so decryption without the plaintext is thus impossible, so this is a search-only option.)

What won't work:

  • CTR Mode with a static IV (an attacker can then use multiple ciphertexts to recover the keystream & plaintexts)
  • CBC Mode with a random IV (can't search)
  • Any stream cipher (same as CTR mode)

Note that, in all cases, you are giving up ciphertext indistinguishability, but that's a core requirement of being able to search on the ciphertexts.

You do need a mechanism to share the key with other systems that need access to the plaintext, but an attacker who gains access to a database backup, SQL injection, or any other attack that gives access only to the database won't be able to discern the plaintexts.

PKI is not useful here, as you point out, as having the public key allows to enumerate the values and recover them, if you're using a deterministic PKI cryptosystem (plain, unpadded, RSA, for example). Using a non-deterministic PKI (padded RSA) will not allow you to search on the ciphertexts.

I would review whether you really need to encrypt small, easily brute forced plaintexts. What is your threat model? Can you protect against these threats in other ways?

KyleMit
  • 119
  • 7
David
  • 15,939
  • 3
  • 50
  • 73
  • this is close to answer I was thinking about. However is there any difference between a deterministic block encryption and an unpadded public key encryption? – chotchki Jun 16 '14 at 15:23
  • If the public key is protected as well as a symmetric key would be, no. If the public key is known to the attacker, it is trivial to enumerate the range of something as small as SSNs. – David Jun 16 '14 at 15:26
  • +1 for novel implementation of a lossless one way function. – Aron Jun 18 '14 at 08:47
  • 2
    CBC with a static IV would leak information. If (and only if) two plaintexts start with the same block value then the first ciphertext block will also be the same. That can go on with the 2nd, 3rd, ... blocks. The first block where two cipher texts differ is in the same position as the first block where the two plaintexts differ. – Future Security Nov 06 '18 at 02:40
  • What package should I use to deterministically encrypt SSN in django python? – Aseem May 10 '19 at 06:24
7

Keep in mind there are two separate pieces to securing this data, when it's at rest and when it's in transit.

You should not store (data at rest) any kind sensitive data directly in clear text, period. Things like passwords and social security, and credit card numbers should be encrypted before they are stored on disk. I agree with lorenzog about decoupling your solution but I suggest a slightly different setup:

  1. Database server. This server stores sensitive encrypted fields in a database (SQL/MySQL/Oracle), but never has the cleartext data. It will be encrypted before it's stored in the database table / field. It also does not have the private key to decrypt the data, just encrypted blobs.

  2. Crypto application server. This server stores the private key used for encrypting and decrypting the fields for an authenticated, authorized user. This is the only place the data stored in the database server can be encrypted and decrypted. Obviously this will be a high asset target, and should be hardened and controlled via policy. Treat similar to a domain controller for example and audit all access and queries to it.

  3. Web Server. Load balance requests from the user and secure communication between servers and services. Serve as endpoint for communication to external users.

Communication (data in transit) with the client and your partner teams is also very important here, don't over look that. Make sure you are using SSL and at the highest levels of ciphers and encryption possible.

It won't be easy to setup (harder than no basic security for sure, but not impossible by any means) and if you breach your customers trust you'll be in much worst shape than the time it takes to get securing personal data right. :)

Good luck!

AckSynFool
  • 111
  • 4
  • You model would not work well with searching. – Aron Jun 18 '14 at 05:29
  • Depends how you design the search function. One way to do it is search query goes to Crypto server, query input gets hashed with private key and then looked up in the database for a matching hash. Why wouldn't that work? – AckSynFool Jun 19 '14 at 20:33
  • One other thought... wildcard searching doesn't seem to apply in this case (although the OP didn't specify). Consider if users are searching against sensitive data like SS# or CC# then it would need to match precisely, not a partial search. – AckSynFool Jun 20 '14 at 01:09
4

Actually, you have THREE problems that you have implied in your question.

  • The title talks about data at rest.
  • In the question, you talk about access control as well.
  • In addition, you then also have a question of data in transit.

The question may have a different answer if you are already using a DB system and introducing encryption in an existing system. Many of the DB systems now support such security features (see below).

Access control and data in transit

Most DB systems support access control from the first day (it's almost a min requirement). However, when you say the such and such system needs to be able to read it, it's really an access control question.

Likewise, data in transit is also a question of the protocols used, many of which are supported by existing DB system(s). For example, SQL Server supports SSL for connections, as does MySQL. (Search for others, they might support it too.)

Encryption at rest

The third is encryption at rest, which solves the issue of if an unauthorized person or system were to get the actual DB file, what do they see. It also comes a related issue of key management, i.e. why can't whoever got your DB file not get the keys?

During the design, it would be prudent to assume that one day the key(s) could be compromised or stolen or, purely from a crypto agility point of view, you will have to change the algorithm and keys (e.g. whoever used DES had to eventually move to AES). Even though it can't be 0 cost, there has to be a path esp. if your DB is going to be a distributed one, to change either the algorithm or the key.

Many DBs now do provide encryption at rest along with some key management solutions. For example SQL Server has supported encryption since 2008. In addition, SQL server has published a key lifecycle management story too with apparently supports symmetric as well as asymmetric keys (via certificates). I believe SQL also supports full DB encryption vs selected fields via queries (such as in your case for SSN).

Likewise MySQL also supports encryption via query functions, which you could utilize for your SSN scenario. You can likewise other DB systems as well that might already support encryption and use those.

If you utilize a system that support built-in encryption, you are likely to avoid many pitfalls associated with doing it your own, as well as get a supported system.

Research DB

CryptDB is a DB system developed at MIT which encrypts data at rest and also supports running queries over encrypted data. If you look at the page for the system, it lists organizations that are actually using it.

Writing own encryption logic

This is probably more time consuming and more challenging to get it right, but based on your question, it seems that you are contemplating this as an issue. If I were in a similar situation, I would definitely avoid it and go with one of the existing DB systems.

There are many issues. For example, when you encrypt data, the output is somewhat randomized so encrypting the same data with the same key will usually not result in the same cipher text. It might be a bit challenging and you may have to decrease entropy (e.g. by using the same IVs or salts) which might impact the security of your system. And with something as simple like as storing hashes (or even HMACs with a single key), if someone gets the database file(s), they can run brute force to recover the data within weeks, if not days. This is especially true of fields like SSN, unless you were to spend time and always require multiple fields for a query (e.g. SSN and DOB and first three letters of last name, or such combinations), and only store those as hashed but neither of these separately. This will increase entropy and make it harder for someone to find actual values were they to get your DB file.

Other than that, one has to figure out key lifecycle management issues.

EDIT: It's actually a common issue and I had once evaluated encrypting data, when I wrote the initial response, I did not include that here. I have since updated my response to include that, as well as clarify access control, secure connection and data at rest issues.

Omer Iqbal
  • 584
  • 2
  • 10
1

I'm not sure what you are trying to do (is it a web service? A mobile app? A desktop app?) but given your requirements, you might consider decoupling the system into two separate components:

  • One would hold a (secure) hash of the SSN acting as a "read-only" database. A search for a certain SSN would hash the query and match it against the database. If the hash exists, it returns a match. You should obviously consider rate-limiting queries so as to avoid bruteforce attacks.

  • Another system (VM or physically separated, up to you) would hold the data "in the clear" with a process similar to PCI (i.e. to store sensitive financial data). Access to this system would be stricter and you would be able to more closely audit successful (and failed) authentications.

Entering a new SSN on the latter system would trigger an update of the entries on the former. This way you could replicate the "read-only" database through load balancing or similar techniques to ensure performance.

lorenzog
  • 1,911
  • 11
  • 18
  • You should note that the Hashing algorithm will be a HUGE compromise between security/speed. The fewer hash collisions there are the more information you will be leaving insecure, conversely the more hash collisions you have the more search/comparisons you need to make. – Aron Jun 18 '14 at 05:41
  • @Aron true, but this is an implementation detail that can be solved by throwing more CPU power at the system. My approach was focused on the design. – lorenzog Jun 18 '14 at 08:02
  • Can't disagree more with you. In a typical database setup, CPU is not the bottle neck, I/O is. There is no accepted easy/cheap solution for scaling I/O. – Aron Jun 18 '14 at 08:45
1
How to safely store sensitive data like a social security number?
...
Must be able to search (i.e. to look up an existing piece of data) but not view
...

Homomorphic encryption will allow sorting and searching of encrypted data. Both Microsoft and IBM have systems. But I have not seen them in mainstream production (yet). See, for example, Efficient Fully Homomorphic Encryption from (Standard) LWE. It also meets your other two requirements - reversibility and performance.

If you don't need the PRP notion of security, then use a block cipher. You might even be able to use a Format Preserving Encryption (FPE) scheme. See, for example, Order-Preserving Encryption Revisited - Improved Security Analysis and Alternative Solutions and even A Synopsis of Format Preserving Encryption for some ideas.

I'm not sure what to make of "Other systems must be able to recover the real value" (other than reversibility). Can you explain the data flow? Naively, I'd say perform the selection on the encrypted data, decrypt the data, encrypt the data under the remote system's public key, and then send the encrypted data to the remote system.


However I think that an attacker could use the encryption process as an oracle and recover the data due to its low entropy.

Its going to leak information if it lacks the PRP notion of security; not because of low-entropy data like SSNs. For example, RSA/OAEP can effectively mask a SSN. The bad guy has no more advantage than guessing (with some hand waiving).


You will also need a strategy for storing the private key. Perhaps an HSM or KMIP. Guttman has some interesting thoughts on HSM and other storage devices (like the hardware backing the KMIP protocol) in his book Engineering Security.

-3

Deterministic encryption (same string encrypts to same value each time) in Python:

In terminal:

pip install pycrypto
pip install pycryptodome

Python code:

import os
from Crypto.Cipher import AES
from Crypto.Util.Padding import pad #will only work with pycryptodome
key=os.urandom(32)
data="Hello world".encode("utf-8")
encrypted = crypto.encrypt(pad(data, AES.block_size)) #our string may not of 16 bytes(16 digits) whic is req here. so pad will add some padding
Aseem
  • 95
  • 1