The right thing to do (probably) is to calculate hashes recusively for each directory like that:
import os
import hashlib
def sha1OfFile(filepath):
sha = hashlib.sha1()
with open(filepath, 'rb') as f:
while True:
block = f.read(2**10) # Magic number: one-megabyte blocks.
if not block: break
sha.update(block)
return sha.hexdigest()
def hash_dir(dir_path):
hashes = []
for path, dirs, files in os.walk(dir_path):
for file in sorted(files): # we sort to guarantee that files will always go in the same order
hashes.append(sha1OfFile(os.path.join(path, file)))
for dir in sorted(dirs): # we sort to guarantee that dirs will always go in the same order
hashes.append(hash_dir(os.path.join(path, dir)))
break # we only need one iteration - to get files and dirs in current directory
return str(hash(''.join(hashes)))
The problem with using only files in the order os.walk
gives you them (like Markus does) is that you may get the same hash for different file structures that contain same files. For example, this dir's hash
main_dir_1:
dir_1:
file_1
file_2
dir_2:
file_3
and this one's
main_dir_2:
dir_1:
file_1
dir_2:
file_2
file_3
will be the same.
Another problem is that you need to guarantee that the order of files will be always the same - if you concat two hashes in different orders and calculate hashes of the strings you got, you will get different results for same directory structures.