Did you ever write a secret, “coded” message? Speak in Pig Latin? Swap “A” with “1″, “B” with “2″? These are all different methods of encryption – and whether it’s ROT-13 (A = M, B = N, C = O, …) or super-secret wartime ciphers — the idea is that you use a key (think Little Orphan Annie’s secret decoder ring) to encode a message and someone else with the same key can translate that into the original message.
The idea of hashing is a little counter intuitive at first — because, once something is hashed, it can’t ever be converted back into the original message.
Let’s use the MD5 algorithm, for example.
MD5("Hello world!") = 179420601124810473743368533902492969504
It is impossible to ever turn that 39-digit number above back into “Hello world!” The reason is that you can hash anything: meaning you have essentially an infinite number of possible “inputs” — but no matter what you put in, MD5 spits out a 39-digit number.
That means that there are many different possible outputs — 3.48×1038, actually, which is a pretty large number. But there are more inputs than outputs: you can hash a whole book if you want. So at some point, two different inputs can produce the same output — so logically, you can’t reverse-engineer a hash to get the original message, because there are innumerable possibilities.
But the number of possibilities is irrelevant. There is no formula for converting a hash back into what it was originally — the “key” in this case, is the original message itself. So essentially, to “decrypt” the message, you need to know what the message was in the first place, which kind of defeats the purpose.
And so, as one so frequently ponders, what makes hashing useful?
It’s useful because every time you hash the same input, you get the same output.
When I sign up for a website, I create an account with a username and a password:
Here’s what happens behind the scenes when I click “Sign Up”:
- A script connects to the database at MySpace.
- It adds a row to the users table with my information:
- Full Name: Matt Miller
- Gender: Male
- Username: matt_miller_ecreations
- Password hash: 1878782304126990231866587686374681626
This way, nobody — no hacker, not even the database administrator — can ever know what my password was. And every time I go to MySpace and log in, I enter my username and password and it gets sent to a script on the server which says:
- Connect to the database.
- Look up the password hash of the user with the username “matt_miller_ecreations”.
- Meanwhile, take what the user entered into the password box and generate a hash of it: MD5(“eCreations_is_the_best!”).
- Now, compare that hash with the one pulled from the database. If they match, log Matt in.
The idea is that the password itself is never stored – as soon as that hashing operation is done, the hash is put in the database and the original password is forgotten.
Now, if somebody can hack into the database and see my password’s hash, it means they can probably hack into a lot of other things — among them, they could probably break into my profile and change my settings (or just post stupid comments all over my page). But if all they can see is the hash, at least they can’t see my actual password, which means that even if I use the same password for my bank account, they can’t get into it.
Unfortunately, hashes aren’t exactly foolproof, because once a hacker can see your password hash, he can make a guess as to what your password might be. How?
Well, they invented something called rainbow tables. Rainbow tables are essentially really big spreadsheets that look like this:
Smaller rainbow tables are just a list of maybe the 10,000 most common passwords and their resultant hashes — so they can look up your hash and find out what your password most likely is. Larger ones have every possible combination of A-Z, a-z, 0-9, and some special characters, to compute every possible password up to a certain length. So, rule #1: always come up with a long password.
If we use A-Z, a-z, and 0-9 character sets when we make our password, how many 6-character passwords are possible? (26 uppercase letters + 26 lowercase letters + 10 digits)6 is about 57 billion, which is a lot, but it’s a few minutes to a couple of fast computers to generate a rainbow table of that size. But at nine characters, that number jumps to thirteen quadrillion. If you thought the national debt was high, this number is unfathomable.
But as developers, we have to face the fact that users are going to pick really obvious passwords. If you saw your password on that list… you really oughta change it. But from the perspective of programmers, we have to accept that some users are going to pick obvious passwords — but we still don’t want hackers to be able to guess them if they get into our database or access the cookies on your computer.
This is where the salting technique comes in.
Every rainbow table has the MD5 hash for “123456″ in it. But how many have the hash for “erawguhbvmncvxbnsrheoi123456″? Probably none.
Essentially, the concept is: salt a password, and then hash it. “Salting” simply means: take something completely random and unpredictable, and stick it before or after or even in the middle of the password. Now you just need to store that salt so that every subsequent time you need to check if the user entered the correct password, you use the same salt before you check to see if the hashes match.
It doesn’t matter if the hacker knows the salt because the hacker doesn’t know what to add to the salt to produce the same hash as the one in the database.
One way to create a salt is to simply type a bunch of random characters (or have them generated) and store them in a file on the server. Then, whenever you want to generate a hash, your code just says: stick our salt in before the password and then hash it. However, there is a caveat to this approach:
Say Joe is logging into Wells Fargo, and he doesn’t know his best friend James is actually a hacker. James is in the room spying on Joe as he enters his password and memorizes the password as Joe enters it with his keyboard. (Joe isn’t the brightest — his password is 123456.)
Now James hacks into the banking website’s database. He finds the salt stored either on the file system or in the database and sticks the salt before 123456, getting “erawguhbvmncvxbnsrheoi123456″. Then he generates an MD5 hash of this and gets the resultant number: 28430040715551399537164885414111372199.
Next, he queries the database, looking for other users who have this stored as their password hash. He finds that about 10,000 other Wells Fargo users have the same hash. Now he knows that most, if not all, of those other 10,000 users are using that same password: 123456. Now, James’ damage isn’t limited to what he can do to compromise Joe’s finances — James can log in as 10,000 other people as well.
The solution? Every time a user is added to Wells Fargo, a new random salt is generated and stored in the users table along with the user’s other information (including the password hash). Then, when checking if the entered password is correct, we first look up the user by his/her username/e-mail address, then we pull down the salt for that user and use that to generate the hash. This way, since every user has a different salt, even if two (or 10,001) users have the same password — they all have different hashes, because they all have different salts. James can look up Joe’s hash in the database, but chances are, no other user has the same hash as Joe. It’s not guaranteed — it’s certainly possible that two users share the same salt and password, or even that two users have different salts and passwords but they generated the same resultant hash and are thus considered equivalent (this is known as a collision) — but this is very, very highly unlikely.
In particular, the MD5 algorithm is known to be vulnerable to collisions — it has been proven that several fairly short and similar inputs have been able to produce the same hash. Because of this, many recommend newer and less frequently used algorithms as they’re less likely to be exploited. For example, here at eCreations, we frequently use the Whirlpool algorithm. Whirlpool hashes are 64 hexadecimal digits long, which in sensible terms means that if you take that huge number that represented the number of possible MD5 hashes you can make and then square it, you’ll get the number of possible Whirlpool outputs. Honestly, there’s not much you can compare this number to (seriously, this number is approximately the number of atoms in the universe).
Hashing, when compounded with the salting technique, is a pretty good tool for web developers to keep their users’ information secure — when done right. It’s not a substitute for insisting your users use strong passwords (before you get annoyed, remember — if a website tells you your password has to be at least 8 characters long — they’re doing it to protect you!). And, as long as you keep using the same passwords for multiple websites (an easy habit to fall into), your information is only as secure as the most vulnerable site you have an account on is — once your password is found somewhere and tied to the same username or e-mail you use on multiple sites (or other easy-to-obtain information), a hacker can log into your accounts anywhere. And don’t assume big corporations are gung-ho about security — even tech conglomerate Adobe was storing passwords that were unencrypted. But knowing a thing or two about hashing algorithms and salting will go a long way in protecting your users’ information, on your site and beyond.