4

I would like to create a unique hash in python for a given directory. Thanks to zmo for the code below to generate a hash for every file in a directory but how can I aggregate these to generate a single hash to represent the folder?

import os
import hashlib

def sha1OfFile(filepath):
    sha = hashlib.sha1()
    with open(filepath, 'rb') as f:
        while True:
            block = f.read(2**10) # Magic number: one-megabyte blocks.
            if not block: break
            sha.update(block)
        return sha.hexdigest()

for (path, dirs, files) in os.walk('.'):
  for file in files:
    print('{}: {}'.format(os.path.join(path, file),       
sha1OfFile(os.path.join(path, file)))
iheartcpp
  • 371
  • 1
  • 5
  • 14
  • `str.join` the hash values and hash the resulting string? Or merge the file contents and hash the merged content. – a_guest Mar 24 '16 at 15:48
  • 1
    if you do the latter (hash merged content of all files), you should avoid reading the data twice. You can use the block you read in to update two different hash objects (in case you need the hashes of the files too). – Markus Mar 24 '16 at 15:53
  • 3
    Does this answer your question? [How can I calculate a hash for a filesystem-directory using Python?](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python) – Josh Correia May 15 '21 at 06:30

2 Answers2

4

The right thing to do (probably) is to calculate hashes recusively for each directory like that:

import os
import hashlib

def sha1OfFile(filepath):
    sha = hashlib.sha1()
    with open(filepath, 'rb') as f:
        while True:
            block = f.read(2**10) # Magic number: one-megabyte blocks.
            if not block: break
            sha.update(block)
        return sha.hexdigest()

def hash_dir(dir_path):
    hashes = []
    for path, dirs, files in os.walk(dir_path):
        for file in sorted(files): # we sort to guarantee that files will always go in the same order
            hashes.append(sha1OfFile(os.path.join(path, file)))
        for dir in sorted(dirs): # we sort to guarantee that dirs will always go in the same order
            hashes.append(hash_dir(os.path.join(path, dir)))
        break # we only need one iteration - to get files and dirs in current directory
    return str(hash(''.join(hashes)))

The problem with using only files in the order os.walk gives you them (like Markus does) is that you may get the same hash for different file structures that contain same files. For example, this dir's hash

main_dir_1:
    dir_1:
        file_1
        file_2
    dir_2:
        file_3

and this one's

main_dir_2:
    dir_1:
        file_1
    dir_2:
        file_2
        file_3

will be the same.


Another problem is that you need to guarantee that the order of files will be always the same - if you concat two hashes in different orders and calculate hashes of the strings you got, you will get different results for same directory structures.

Ilya Peterov
  • 1,975
  • 1
  • 16
  • 33
2

Just keep feeding the data into your sha object.

import os
import hashlib

def update_sha(filepath, sha):
    with open(filepath, 'rb') as f:
        while True:
            block = f.read(2**10) # Magic number: one-megabyte blocks.
            if not block:
                break
            sha.update(block)

for (path, dirs, files) in os.walk('.'):
    sha = hashlib.sha1()
    for file in files:
        fullpath = os.path.join(path, file)
        update_sha(fullpath, sha)

    print(sha.hexdigest())

Or hash the concatenated hashs of the files.

Markus
  • 592
  • 3
  • 13