-1

I try to hash path to the file with it filename by md5 algorithm, of course the filename always different in system. I'm interesting in is this hash always will be different for different files or hash can repeat?

Сan be such that I get same hash from different files? Are there any restrictions? Thanks

Victor Bocharsky
  • 11,930
  • 13
  • 58
  • 91
  • No, a "hash clash" or a "collision" is the term used when two different strings generate the same hash – Mark Baker Mar 20 '14 at 17:20
  • Here's proof of clashing: http://3v4l.org/2vrMi – John Conde Mar 20 '14 at 17:21
  • @JohnConde But when I print this two hash they are different, don't understand why? – Victor Bocharsky Mar 20 '14 at 17:26
  • 2
    @Victor It's because both hash return strings that stars with "0e..." so PHP treat them as floating point numbers and they end up equal. Just use === for comparison and it will give correct results [3v4l.org/4jLl0](http://3v4l.org/4jLl0) – Dexa Mar 20 '14 at 17:59

3 Answers3

2

It's very unlikely that you will get a hash collision on your file names, however it is possible so you may want to consider it as a potential source of bugs in your application (depending on how many strings you are going to be hashing).

You don't mention if there is any cryptographic reason for you hashing the filenames, if you do need the filenames to be securely encrypted you should use the php crypt() function instead (md5 hashing has not been considered secure for a long time http://www.kb.cert.org/vuls/id/836068)

Seb
  • 61
  • 2
1

Yes, you can potentially get hash collisions. See wikipedia articles on the pigeonhole principle and birthday paradox to understand why.

Tarik
  • 10,810
  • 2
  • 26
  • 40
1

Hash collision is a major issue today in cryptography and general computer science. While md5 is a widely known and used hash, collisions can be very prevalent in it. Collisions are unlikely to happen, but they can occur. They typically will not occur unless someone is attempting to create them.

I submit here the issue that there are 340282366920938463463374607431768211456 possible md5 hashes (since they're displayed hexadecimal style, 16 possibly characters raised to the power of the 32 character length), but there are an infinite amount of strings which can be hashed (that of course taking computational limits out of the equation).

But what is a developer to do if it's possible to have collisions?

I was recently in a meeting with a new friend of mine who runs a business that, among other things, involves cryptography. He said something that I had never thought of before. As I lack the memory to recall word-for-word, it was something along these lines: "Sure, you can fool my md5, but try to fool both my md5 and my sha256." What he was saying is that as a developer we have an enormous amount of programming options and that we should take advantage of them. We have the md5, gost, the sha-family and the list could go on. Hash your string with both a sha256 and an md5 and you'll find the chance of collisions to be lowered tremendously. In fact it will likely be lowered to the point where your chance of collision is practically nothing.

An implementation of this:

<?php $salt = "my_secret_salt"; /* this should have numbers, spaces, letters, special characters, etc. */ $stringToHash = $theUsersCookieValues; $time = time(); $hash_1 = hash('md5', $time . $salt . $stringToHash); $hash_2 = hash('sha256', $time . $salt . $stringToHash); setcookie("time_created", $time); setcookie("user", $theUserCookieValues); setcookie("hash_1", $hash_1); setcookie("hash_2", $hash_2); ?> While this deals with cookies, not filenames, it is still a great way to implement this principle, in my humble opinion.

Spencer D
  • 3,376
  • 2
  • 27
  • 43
  • Well, instead of using two hashes, you can use a larger hash with 512 bits or more. – Tarik Mar 20 '14 at 18:14
  • Yes, but from purely a speed standpoint, the sha256 and the md5 take a little less time to calculate (I'm just making an assumption) than the sha512 for example. On top of this, a 512 bit hash is just 1 hash. 1 hash is still easier to fool than 2 hashes. While both ways are extremely unlikely and would take decades-upon-decades to likely find a collision for, I would still venture to guess that fooling both an md5 and a sha256 would be even less likely than fooling a single 512. That being said, one could do two 512's and 2 different salts, but that seems like overkill. – Spencer D Mar 20 '14 at 18:32
  • "1 hash is still easier to fool than 2 hashes." No, that's not the case as the likelihood of getting a collision on sha 512 bits is way less than the likelihood of a collision on a sha256 x likelihood of a md5 collision. – Tarik Mar 20 '14 at 18:42
  • However, the collision you find for the md5 and the sha256 must be identical to each other. – Spencer D Mar 20 '14 at 18:43
  • As for speed of calculating a larger hash, it can be mitigated by going for a single hash with a number of bits equal to 256 + 128 bits. – Tarik Mar 20 '14 at 18:45
  • " the collision you find for the md5 and the sha256 must be identical to each other." Not possible as the hashes have different size. However, you may have two simultaneous collisions. – Tarik Mar 20 '14 at 18:46
  • Consider this: You find a collision for md5 and it's "abcd" You find a collision for sha256 and it's "xyz" "xyz" != "abcd" which means now you have to find another collision. The user could come up with an exceptionally large amount of collisions, but the collisions, unless they're the actual values, will likely never be equal. If you want to be even more extreme, you could use more than 2 hashes and make finding the collision even closer to impossible. – Spencer D Mar 20 '14 at 18:50
  • "As for speed of calculating a larger hash, it can be mitigated by going for a single hash with a number of bits equal to 256 + 128 bits" Yes, but there are different implement the hashing functions in different ways. For example, there are faster versions of the md5 and there are slower versions even though they all produce the same results and do roughly the same thing. But yes, that's why I was guessing it was faster. – Spencer D Mar 20 '14 at 18:53
  • Given s1 we are looking for s2 such that md5(s1) = md5 (s2) and sha256(s1) = sha256(s2). On the other hand we are looking for the likelihood of sha512(s1) = sha512(s2). The first case gives 1/2^128 x 1/2^256. In case 2 it is 1/2^512. – Tarik Mar 20 '14 at 18:59
  • "I was guessing it was faster". Indeed. – Tarik Mar 20 '14 at 19:06