2

I'm trying to generate short string hashes like youtube video id's to use in my app but i can't figure out what is the fastest and simplest way while providing shortest hash possible using arrays or json-based strings.

I read Kevin van Zonneveld's excellent article about this subject, he produces alphaID's based on integers and it works two way. Also there are good SO answers but my case is a little bit different:

  • There are lot of (more than 100K) but small data sets (arrays) for each unique record in database something like that:

    $id = 1;
    $set[$id] = array(533 => array('a' => 78), 460 => array('a' => 89));
    $set[$id] = array(534 => array('b' => 79), 620 => array('a' => 908));
    $set[$id] = array(535 => array('a' => 80), 782 => array('c' => 901));
    
    $id = 2;
    $set[$id] = array(672 => array('a' => 12), 852 => array('a' => 122));
    $set[$id] = array(542 => array('a' => 67), 372 => array('a' => 831));
    $set[$id] = array(573 => array('a' => 77), 853 => array('a' => 127));
    
    // ...
    
  • I'm trying to generate unique (but short) hashes for every set like 1:aeF4t, 2:eaXvT, 3:t4fa.
  • Uniqueness under the same id is important. For example:

    1:aeF4t and 2:aeF4t is ok but i dont want the same hashes under the same unique id: 1:aeF4t and 1:aeF4t.

  • Sets are doesn't have siblings more than around ~120K under the same id.
  • I can easily convert this array's to json strings.
  • Generating hashes in one-way is enough for me. I don't need to decode previously produced hashes later.
  • Hash method should generate same hash when i provide same dataset as input later. So, salting with date or microtime based values are not good options.
  • I think md5() and sha1() are fastest options on the desk but they are generating too long values. I'm looking for a way to shortening total-length of the hash.
  • Built-in uniqid() method producing different hashes every time while input is not changed.

Is there any elegant option or good programming technique to achieve that in php while keeping performance in mind?

edigu
  • 9,878
  • 5
  • 57
  • 80
  • What does "I don't want the same hashes" mean in mathematical terms? Are you familiar with the pigeonhole principle? Are you sure you want *hashes* and not *ids*? – Jon Nov 28 '13 at 22:49
  • No, i'm not familiar with the pigeonhole principle. (Thanks, i'll read on wikipedia, sounds interesting) I mean context of the uniqueness of a single hash. It should unique for only one id. This means, some db records can have similar sets (so similiar hashes), when joining hash with id like `1:abcd` `2:abcd` it becomes unique. – edigu Nov 28 '13 at 23:05

1 Answers1

9

You could try a checksum function like crc32. I am not sure if you get collisions (same checksum for different arrays) but the probability should be very low.

$array = array(533 => array('a' => 78), 460 => array('a' => 89));
$crc32 = sprintf('%u', crc32(serialize($array)));
echo $crc32; // 547561972

With a base convert you can make this integer shorter:

echo base_convert($crc32, 10, 36); // 9205is

If you'd convert to base 62 you could shorten it even more:

base62 = b3Vsi

For base 62 converting visit:

converting a number base 10 to base 62 (a-zA-Z0-9)

http://marcus.bointon.com/php-base-62-encoding/.

By the way: With a base convert you could make a md5 hash shorter, too:

md5 (base 16) = de07bf84ad7708b93eca60b608c7b6e2
md5 (base 62) = 6KXPVjy4V22IgMsCKo86IQ
Community
  • 1
  • 1
bitWorking
  • 12,485
  • 1
  • 32
  • 38