15

I have a 128bit ID that I want to perform a one way hash on, but I don't want to ever get the same digest for an input message. Does anyone know if sha-1, or an alternative, is guaranteed not to produce collisions for the set of messages less than its output digest size? This is at least theoretically possible...

I also considered using RSA, and discarding the private key to give me a one-way encrypt, but I need to store the result in a 32 char DB field, and the encryption schemes available to me don't produce anything small enough.

Any suggestions of another way of producing a deterministic, non-reversable and collision free transform of the original value are welcome.

Scott Arciszewski
  • 33,610
  • 16
  • 89
  • 206
Malcolm Smith
  • 3,540
  • 25
  • 29
  • 1
    I think this has been answered by various respondees: sha-1 probably will produce collisions for the message space smaller than its digest size, there is certainly no gaurantee that it won't, but the odds are very small. This could be mitigated by recording the hashed values. A 128 block cypher will produce the 1:1 mapping but requires the key be secret for ever (or changed peridically) There may be other approaches based on number theory, and exponention but they would involve me breaking a core rule of security coding - "don't try and roll you own - you'll probably get it wrong" – Malcolm Smith Jul 02 '10 at 09:03
  • I think we're going to take our chances with hashing, which allows me to go back to worrying about what I do with all that money when I win the lottery 100 times in a row ;) – Malcolm Smith Jul 02 '10 at 09:04

9 Answers9

7

Cryptographic hashes give a very good approximation of a random number for a given input. So how many random hashes do you need in a room until you get the same 160 bits? It about the square root (disclaimer: I am not a statistician). So you should expect to see clashes at around 80-bits.

I guess practicalities mean you should know when cosmic rays will be a bigger problem than collisions.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
  • According to this paper http://www.ictroma3.it/public/uploads/Finding_Collisions_in_the_Full_SHA-1.pdf I think you should expect clashes around 69-bits. – hoang Jul 01 '10 at 14:00
  • 4
    @hoang That's an attack complexity of 2^63, not a message length. So you should be able to find a collision by only computing 2^63 hashes. But if we were to talk about "randomly" selected messages it is still 2^80. Of course if you selected specific messages (and had a sufficient mechanism to find them), you only need two for a collision! – Tom Hawtin - tackline Jul 01 '10 at 14:41
7

If you want to compute a secret permutation from 128 bits to 128 bits, one simple solution is to use a 128-bit block cipher like AES with a fixed but secret key. You must, of course, be able to keep the key secret forever or you've got nothing.

President James K. Polk
  • 40,516
  • 21
  • 95
  • 125
4

Your ID is unique and 128 bits.

Your comments explain that you cannot use the ID as-is.

You need it to be unique, not just probably unique. Therefore, you cannot use a hash.

You cannot have both worlds - you cannot have a 1:1 mapping that is not reversible. Its an impossibility.

Encrypting - a bijective operation, so there'll be no collisions - IDs with a secret key will make reversing the ID to determine its original value very hard.

AES has a nice block length of 128 bits which will generate 128 bits output from your 128 bits of input, is faster than old algorithms (!) and is widely available for most platforms and languages. I suggest you use AES for your purpose.

Will
  • 73,905
  • 40
  • 169
  • 246
  • The ID is sensitive in its raw form, I want to transform it in such a way that the original ID is unobtainable but in a deterministic way, so the same original always transforms to the same output. 160 bits is just what sha-1 does. I'm not particularly after 160. – Malcolm Smith Jul 01 '10 at 14:00
  • Thanks, that pretty much confirms the way my thinking has been led on this, abeit with the exception that you can make a 1:1 mapping very hard to reverse using encryption and keeping the key secret. – Malcolm Smith Jul 02 '10 at 08:55
  • @Malcolm Smith, yeap, the 'very hard' is certainly true; it only fails at the higher bar of "unobtainable but in a deterministic way". I'll suggest you use encryption so you can accept my answer ;) – Will Jul 02 '10 at 09:10
  • Hmmm, nice try but your answer is a bit too messy to be acceptable. – Malcolm Smith Jul 02 '10 at 15:46
  • It's an important point that the stated requirements ("non-reversable and collision free") are in essential conflict. – caf Jul 05 '10 at 00:29
2

Does anyone know if sha-1, or an alternative, is guaranteed not to produce collisions

Hash functions were designed not to produce collisions, but nothing is "guaranteed." On the contrary, it is guaranteed that there WILL be collisions, because the message space is practically indefinite, while you have a limited number of possible hashes.

SHA-1 however has been proven to be collision-resistant, and that's the best you can hope for.

quantumSoup
  • 27,197
  • 9
  • 43
  • 57
  • 2
    The message space isn't indefinite if it's 128-bit. – Skilldrick Jul 01 '10 at 13:51
  • I understand that sha-1 must produce collisions when the message is larger than the output, my questions is regarding the specific case where the input size is less than or equal to the output size – Malcolm Smith Jul 01 '10 at 13:56
  • Not "message", but "message space." I get your question, but as I said, there are no guarantees. Just work on the assumption that they are collision-resistant, because nobody has yet proved it otherwise. – quantumSoup Jul 01 '10 at 14:10
  • 128-bit isn't infinite, sure, but (to me) it's really really big! 2^128 is over 10^38. The universe is only around 10^17 seconds old, so a collision would be like asking people to pick 2 specific seconds in time (since the beginning of the universe), and someone picking the same pair as someone else. It may just be my feeble mind, but I'd agree with calling that "practically indefinite". :-) – Ken Jul 01 '10 at 14:47
  • @Ken The message space is indefinite, meaning the input for SHA1 is indefinite, the output will always be 128 bits. Collisions will be guaranteed, because there is no limit on input, but only 2^128 possibilities for the output. – MartijnvdB Oct 07 '14 at 11:04
2

I don't know what hash functions avoid collisions but, if you can't find the answer here, a good starting point might be Perfect Hash Function on Wikipedia. From that page:

A perfect hash function for a set S is a hash function that maps distinct elements in S to distinct integers, with no collisions.

There's a number of links to more information on that page that you may find useful.

That being said, can you say why you need a perfect has function? It may be that there are other ways to accomplish what you want to without needing that property, and someone here may be able to make a suggestion.

RHSeeger
  • 16,034
  • 7
  • 51
  • 41
  • Perfect hashes are difficult to write – impossible for arbitrary string inputs – so you typically have to make do with hash-equality not implying value-equality (but the other direction of implication is vital). There *are* some very good hash algorithms though. – Donal Fellows Jul 01 '10 at 14:26
  • @Donal - Fair enough, but bear in mind that his inputs are a fairly limited set (128bit key), so it might be possible for him. (Random aside... I followed the discussion on hashes on the Tcl Core list, which certainly expanded my knowledge of hashes and the issues involved :) – RHSeeger Jul 01 '10 at 17:32
2

For sufficiently large bit sizes, I think discrete exponentiation is a 1:1 function but reversal is computationally infeasible. I'm not sure how "large" is required though. Code for an unusably slow (but conceptually understandable) implementation:

unsigned long spin_once(unsigned long dat)
{
  if (dat & 1)
    return (dat >> 1);
  return (dat >> 1) ^ SomeMagicNumber;
}

unsigned long hash(unsigned long dat)
{
  unsigned long i,ret;

  if (dat == 0xFFFFFFFF)
    return 0;
  ret = 1;
  for (i=0; i < dat; i++)
    ret = spin_once(ret);
}

This program would take billions of steps to compute the hash for many values of dat, but with trickier code the job can be done in reasonable time. A 32-bit hash is cryptographically worthless, of course, but the approach can be readily extended to any size.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Interesting, but you missed some of your example implementation: for (i=0; i ... Does this paper cover what you are proposing? http://www.farcaster.com/papers/crypto-field/node1.html – Malcolm Smith Jul 01 '10 at 16:20
  • Thanks for the link. It would seem that the size of the numbers required would probably be similar to those required for secure RSA encryption, at least for the simple spin function shown. If one were to use an exponentiation operation which operated in some base other than base 2, I don't know how that would affect things. One advantage of base 2 is that if N is prime, 2^N-1 is often prime. By contrast, k^N-1 is never prime for other k unless N is one. – supercat Jul 01 '10 at 17:30
1

Hashing is "unlikely" to produce any duplicates, but there are no guarantees. On the other hand, any symmetric encryption scheme will produce 128 bits out for 128 bits in, and guarantee no duplicates.

On the other hand, if you are depending on the hashes being unique, my intuition is you're doing something wrong. If you're using hashes to obfuscate passwords for example, you have to be careful that you don't make the hashed password the de facto password.

ddyer
  • 1,792
  • 19
  • 26
0

Isn't the point of a one-way hash that you can't (in general) recover the original data from the hashed value? So why would someone designing a hash function go out of their way to prevent collisions for small inputs?

Instead, it looks like you want to obscure the data, which is fine for some purposes. If it's not practical to use public key encryption, you might as well write your own function.

Justin K
  • 2,664
  • 1
  • 19
  • 16
  • As a simple example, if one had a table in which one had to be able to query people by Social Security Number but one wished to keep the Social Security Number confidential(*), and if one had a hashing function that yielded a different output for each SSN, one could use the hash of the SSN as the primary key for the table. If the hash function produced collisions for two SSN's that were both supposed to be stored in the same table, such an approach would fail. (*) Illustrative example only, since 1,000,000,000-item hash space would be brute-force searchable. – supercat Jul 01 '10 at 15:19
  • I'm not disputing that such a function would be useful. I'm just saying that it's a different use case than that for one-way hash functions. – Justin K Jul 01 '10 at 16:03
  • 1
    Yes, I'm aware that sha-1 was not designed to hash small values, and therefore would not be designed to ensure no collisions for the message space less than or equal to its digest size, but wondered if anyone could definitvely tell me: yes, or no, or suggest another way of producing a deterministic, non-reversable and collision free transform of the original value. – Malcolm Smith Jul 01 '10 at 16:27
  • 1
    That's the thing: the way to guarantee that the function is collision-free is to make it reversible. You can't lose any information in the transformation, or you risk collisions because the output space is smaller than the input space. The best you can do is use some number theory to make it hard to reverse, as in the RSA idea you had. I'll let you know if anything comes to mind. – Justin K Jul 01 '10 at 18:09
0

According to this article http://www.debian-administration.org/users/dkg/weblog/48,

US gov't federal agencies have been directed to cease all reliance on SHA-1 by the end of 2010

However, as far as I know no collision has been found yet , which means , even people which have tried really hard haven't found any collisions. So it seems reasonable to assume that no collision happen (if you are not dealing with high-security data)

mb14
  • 22,276
  • 7
  • 60
  • 102
  • The no collision found, stands for any size of the message , so even for short message, the probability of collision is almost null. – mb14 Jul 01 '10 at 14:21