21

I'm using the following code to extract a tar file:

import tarfile
tar = tarfile.open("sample.tar.gz")
tar.extractall()
tar.close()

However, I'd like to keep tabs on the progress in the form of which files are being extracted at the moment. How can I do this?

EXTRA BONUS POINTS: is it possible to create a percentage of the extraction process as well? I'd like to use that for tkinter to update a progress bar. Thanks!

FLX
  • 4,634
  • 14
  • 47
  • 60

7 Answers7

13

Both file-progress and global progress:

import io
import os
import tarfile

def get_file_progress_file_object_class(on_progress):
    class FileProgressFileObject(tarfile.ExFileObject):
        def read(self, size, *args):
            on_progress(self.name, self.position, self.size)
            return tarfile.ExFileObject.read(self, size, *args)
    return FileProgressFileObject

class TestFileProgressFileObject(tarfile.ExFileObject):
    def read(self, size, *args):
        on_progress(self.name, self.position, self.size)
        return tarfile.ExFileObject.read(self, size, *args)

class ProgressFileObject(io.FileIO):
    def __init__(self, path, *args, **kwargs):
        self._total_size = os.path.getsize(path)
        io.FileIO.__init__(self, path, *args, **kwargs)

    def read(self, size):
        print("Overall process: %d of %d" %(self.tell(), self._total_size))
        return io.FileIO.read(self, size)

def on_progress(filename, position, total_size):
    print("%s: %d of %s" %(filename, position, total_size))

tarfile.TarFile.fileobject = get_file_progress_file_object_class(on_progress)
tar = tarfile.open(fileobj=ProgressFileObject("a.tgz"))
tar.extractall()
tar.close()
Community
  • 1
  • 1
tokland
  • 66,169
  • 13
  • 144
  • 170
  • This is still monkeypatching. `:)` – Mike Graham Sep 09 '10 at 11:14
  • Thanks tokland, this works :) Any way of getting a float of the overall extraction process? – FLX Sep 09 '10 at 12:30
  • To be more specific, is there a way of getting the uncompressed size before starting the extraction process? – FLX Sep 09 '10 at 14:01
  • @Mike: is this considered to be monkeypatching? I assumed that tarfile.TarFile being a "public" class (no _underscore) of the module, and fileobject a "public" class attribute (again, no underscore), you can play safely with them. But I am not really familiar with Python policy on this regard. – tokland Sep 09 '10 at 15:52
  • @FLX. I am afraid that using the code above you cannot get the total percentage with byte granularity. You could have two progress bars: the overall progress (file granularity) and the current file progress (byte granularity). – tokland Sep 09 '10 at 15:55
  • @FLX. I edited the answer to add the overall progress code. I think it now covers it all. – tokland Sep 09 '10 at 16:28
  • @tokland, `TarFile.fileobject` is an typically-fixed piece of global state you modify to change the behavior of it where you use it in your code (and end up modifying it for everyone else `=p`). If not monkeypatching, this is something close to it. The underscore convention is not the primary means to have internal attributes in Python, it is *documentation*. I doubt the decision to name it `fileobject` was because the implementor thought, "Oh, what a nice API for someone to replace this for their needs". If it was, I seriously doubt their object oriented design skills. – Mike Graham Sep 10 '10 at 14:00
  • @Mike, yeah, that sounds reasonable. I'd just go with the code that creates a custom file object so as not tweak the tarfile module. – tokland Sep 10 '10 at 18:48
8

You can just use tqdm() and print the progress of the number of files being extracted:

import tarfile
from tqdm import tqdm

# open your tar.gz file
with tarfile.open(name=path) as tar:

    # Go over each member
    for member in tqdm(iterable=tar.getmembers(), total=len(tar.getmembers())):

        # Extract member
        tar.extract(member=member)
RoadRunner
  • 25,803
  • 6
  • 42
  • 75
7

You can specify the members parameter in extractall()

with tarfile.open(<path>, 'r') as tarball:
   tarball.extractall(path=<some path>, members = track_progress(tarball))

def track_progress(members):
   for member in members:
      # this will be the current file being extracted
      yield member

member are TarInfo objects, see all available functions and properties here

mingxiao
  • 1,712
  • 4
  • 21
  • 33
  • 3
    To fill this out, after `yield member` you can print out the name or update a progress bar. – Xiong Chiamiov Aug 23 '16 at 18:25
  • 1
    This seems like it shouldn't work - the members is an input to extractall, not an output? Am I missing something? – O'Rooney Feb 03 '20 at 03:53
  • @O'Rooney I'm late, but yes. That's why we yield them there. Default would be simple for loop, our rewrite means that we get access to the list of the members too in the middle of extracting, the downside is responsibility of making sure we don't miss any members now falls onto us. – Yamirui Sep 20 '20 at 20:49
3

You could use extract instead of extractall - you would be able to print the member names as they are being extracted. To get a list of members, you could use getmembers.

A textual progressbar library can be found here:

Tkinter snippet:

miku
  • 181,842
  • 47
  • 306
  • 310
  • 1
    Looking at the code "extractall" calls "extract", so there should be no speed penalization. – tokland Sep 08 '10 at 14:26
  • The documentation notes "The extract() method does not take care of several extraction issues. In most cases you should consider using the extractall() method.". Without knowing what those extraction issues are, I'm hesitant to just swap out `extract` for `extractall`. – Xiong Chiamiov Aug 23 '16 at 17:50
2

There's a cool solution here that overrides the tarfile module as a drop-in replacement and lets you specify a callback to update.

https://github.com/thomaspurchas/tarfile-Progress-Reporter/

updated based on comment

  • That library is far from production ready, e.g. usage of unassigned variables when not passing a progress function... passing a path string to extractall fails, because it expects a tarinfo (although both options should be possible) – andsens Mar 09 '15 at 13:27
1

To see which file is currently being extracted, the following worked for me:

import tarfile

print "Extracting the contents of sample.tar.gz:"
tar = tarfile.open("sample.tar.gz")

for member_info in tar.getmembers():
    print "- extracting: " + member_info.name
    tar.extract(member_info)

tar.close()
Locotes
  • 91
  • 2
  • 10
0

This is what I use, without monkey patching or needing the number of entries.

def iter_tar_files(f):
    total_bytes = os.stat(f).st_size
    with open(f, "rb") as file_obj,\
        tarfile.open(fileobj=file_obj, mode="r:gz") as tar:
        for member in tar.getmembers():
            f = tar.extractfile(member)
            if f is not None:
                content = f.read()
                yield member.path, content
            # This prints something like: 512/1024 = 50.00%
            print(f"{file_obj.tell()} / {total_bytes} = {file_obj.tell()/total_bytes*100:.2f}%")
felixh
  • 379
  • 2
  • 7