9

I am currently building a web service at http://write-math.com similar to http://detexify.kirelabs.org/ that should help users to get LaTeX code from drawn formulae. It is part of my bachelors thesis and a main goal of this project is to make it easier to do research in the field of on-line handwriting recognition. That means I want to share all data I get from users.

The easiest way to do so would be to simply dump the database. This way I could do my back-up copy and a dump for researchers in one step.

There are only two pieces where I hesitate to share it with the public as soon as other users use my system: Email addresses and passwords.

Passwords

The password is stored hashed and salted (that means I store md5($userpass.$salt) and $salt which is an 8 character random string with characters from A-Za-z0-9 - the salt is generated for each user). Is that enough so that it would be ok to make this public?

The main part of the question is about the Email address: At the moment, I store it as plain text. But I am thinking about storing a hash of the Email address only. This hash could not be salted, because my login function works as follows:

The user enters $email and $password. Both get sent as plain text to the server. Then the server does (as pseudocode):

$pwdb, $salt = query(SELECT password, salt FROM users WHERE email = :email)
if (md5($password.$salt) == $pwdb) {
   Logged in
} else {
   Wrong password
}

Email addresses

It does not matter if :email is $email or md5($email) or md5($email.$applicationwide_random_str). But I can't make a new salt for each user without having to go through each user (which would probably be not too bad when I think I will never have more than 10,000 users).

Questions

  • How long would it take to "unhash" one Email (e.g. info@martin-thoma.de or mexplex@gmail.com) which has a random salt of 8 characters attached (e.g. FHCJ81ru) with "standard" hardware (< $1000) when you don't know the random string? Is it a matter of seconds, minutes, hours or days?
  • Is it bad if people can do that? I mean they could also simply send Emails and look what they get back. In my service, there is not much personal data involved:
    • handwritten symbols and formulae
    • eventually handedness
    • eventually when / where the person learned writing
    • eventually the language of the user
  • Why does no service hash the Email address (ok, I don't know if there are no services that do so, but I have never read that - hashing passwords is common, but hashing Email addresses? Never heard that.)
  • Is it a good idea to hash Emails if you want to use the Email only if the user has lost his password and to sign in? (I though about using OpenID, but most people don't know what it is)
Martin Thoma
  • 3,902
  • 6
  • 30
  • 42
  • 17
    Tangent: `md5($userpass.$salt)` is horribly insecure. Please do not use md5 or the sha family of hashing functions for securing passwords, because they are absurdly fast, and thus quite easy to brute force. Please use a key derivation function like PBDF2, bcrypt, or scrypt. – Kitsune May 08 '14 at 18:16
  • @Kitsune: Thanks. I will change that. See also: [PHP crypt() or phpass for storing passwords?](http://security.stackexchange.com/q/17111/3286) and [How to generate a good salt with PHP](http://stackoverflow.com/q/4099333/562769) – Martin Thoma May 08 '14 at 20:00
  • 1
    I don't understand. Why do you need to share emails at all? Can't you assign each user a random id and use that id in the database? Only you need to know the mapping between ids and email/pass. – Navin May 09 '14 at 05:18
  • @Navin I don't need to share them and I will not do so. But I hoped that hashing woulde be "secure enough" because it's easier to make a complete dump (as I also wrote in my question) – Martin Thoma May 09 '14 at 08:13
  • @moose Ah, somehow I did not make the connection that it is easier for you to send the whole db. Well, I'm glad the other answers have convinced you that it's not worth it :) – Navin May 10 '14 at 03:22

2 Answers2

20

In the end, there are two questions: what you should store, and what you should share.

What you should store

Storing the email address has the advantage that you can contact users. A lot of sites do want to be able to contact users who are not currently logged in. For example, merchant sites want to be able to notify users that their order has been dispatched or that their payment bounced. A lot of sites have configurable email notifications. Sites may want to inform users of a privacy or security breach — people tend to prefer being notified privately than learning it in the news. And that's not counting all the nefarious purposes (sendind ---spam--- “promotional offers”).

If you decide that you never need to contact users, store (slow and salted! Not MD5 or SHA-2, but PBKDF2 or bcrypt or scrypt.) hashes of emails. But be aware of the limitations.

I assume that you'll be using email addresses as user unique identifiers. This has a downside: sometimes people change emails. For example, in the academic world (which a lot of users are likely to belong to), people often use their email from their current institution, and then the next year this email becomes unusable. This can cut them off from accounts that are too strongly tied to an email address. Be sure to allow a way to transition (which can be tricky if you require access to the old email address to add a new one).

What you should share

Brute-forcing a salting hash requires enumerating all the possibilities. The time it takes to try one possibility is a configuration parameter of a slow hash — you should make it as slow as your server supports, but no slower. So the answer to “How long would it take to "unhash" one Email” is literally “whatever you choose”.

How long it takes to brute-force your email database is not really the decisive question anyway. Verifying that an email is in your database is obviously practical — your server will be doing it all the time — and this allows someone who knows the hashes to answer the question “does Bob have an account?”. This is already a privacy breach.

The same goes for the password: even allowing third parties to check their guesses of Bob's password is bad. Not as bad as revealing Bob's password, but still bad.

So the simple answer is: do not communicate email addresses or passwords, nor hashes of them, to third parties. If you accidentally leak even hashes, this is a privacy breach. When you share data, use meaningless identifiers for user accounts, for example sequential IDs or random UUIDs.

Also beware of scope creep in your database. If you store too much information about a user, this can allow identification and making connections. This is a common problem with medical databases — if you happen to know that Alice was at Riverside Hospital from 1997-02-25 to 1997-03-03 and from 2001-07-21 to 2001-07-28, and there is a single patient record who was admitted Riverside Hospital in February 1997, left in March, and was admitted again in July 2001 — Alice has been identified even if her name was never exposed. This isn't likely to be a concern with the information you're planning to store now, but keep it in mind.

Gilles 'SO- stop being evil'
  • 51,415
  • 13
  • 121
  • 180
  • 1
    Thank you for the long answer. To sum it up: It is quite likely I want to contact the user (e.g. inform them of a security breach) so I shouldn't hash the Email. On the other hand, hashing is an improvement over plain-text passwords / Emails, but it still should not be shared. – Martin Thoma May 08 '14 at 18:31
  • 3
    @moose Yes, or in other words, hashing is a second line of defense, not the sole one. Also, use slow hashes for things like passwords, not homemade ones. – Gilles 'SO- stop being evil' May 08 '14 at 18:33
  • 2
    @moose Also need to be aware that hashes with a fixed output length have collisions, and you should therefore never try to uniquely identify someone with a hash. What you you do if, by chance, someone's email address hashes the same as someone else's? – Bob May 09 '14 at 02:46
  • 3
    @Bob No, cryptographic hashes do not have collisions. (Mathematically speaking, they do, of course, but in practice the chance of a collision is less than the chance of your computer accidentally flipping a bit in the hash.) – Gilles 'SO- stop being evil' May 09 '14 at 02:54
  • 1
    @Gilles A good hash should have a very very low possibility, but the possibility remains. Also, from the question MD5 was considered, and MD5 is known to have collision vulnerabilities (granted, the chance of a *random* collision is still miniscule, but it is there. Even more, you're hashing user input, so they could theoretically use a collision to perform an attack). 'course, the max length of an email address is 254 characters, and any hash with collisions in that range is terrible. – Bob May 09 '14 at 03:32
  • 1
    @Bob: How should you perform an attack with that? The "email" colum in my database has the "unique" attribute, so they could not add an Email with an collision. – Martin Thoma May 09 '14 at 08:17
  • 1
    @Bob: Oh, I just realized it myself. They could take over an account by using the "Password reset" function. – Martin Thoma May 09 '14 at 08:20
  • 1
    But they would not only need to find such a collision (which is hard in the first place, considering that I will switch from MD5 to something better) but also have to own that email address – Martin Thoma May 09 '14 at 08:23
  • 1
    @Bob Actually the low length would have little impact on finding a collision, given that the internal state of typical hashes is less. But no, I insist, it is in any practical sense impossible to find collisions for cryptographic hashes (excluding straight MD5 but including iterated MD5). – Gilles 'SO- stop being evil' May 09 '14 at 13:49
  • 1
    @moose Don't worry about hash collisions. They just cannot happen. Bob doesn't understand what a cryptographic hash is. Even the use MD5 wouldn't be a problem, since you'd be using *iterated* MD5 as part of a slow hash like PBKDF2. – Gilles 'SO- stop being evil' May 09 '14 at 13:50
  • 1
    Isn't key stretching (PBKDF2/bcrypt/scrypt) a bit of an overkill for e-mail addresses? – Ohad Schneider Aug 12 '17 at 13:54
  • 1
    @OhadSchneider It's rather underkill. If a password's hash is leaked, that merely requires the user to change their password before the hash is cracked. But if a email address's hash is leaked, then the user's email address is revealed once the hash is cracked, and you can't undo the loss of privacy. – Gilles 'SO- stop being evil' Aug 12 '17 at 20:13
2

Never export any user data, even in hashed form, chances are someone will figure out a way to break the encryption/hashing.

So only export the relevant data tables, not the user table. You will have foreign key references in your data, so you know which items belong to the same user, but it will be an anonymous number account to whoever is using the dumped data.

marlene
  • 21
  • 1
  • 1
    Agreed. Even if the hashing can't be "broken" completely, it can be used by other to verify the pass offline. – Navin May 09 '14 at 05:19