2

I'm using odo from the blaze project to merge multiple pandas hdfstore tables following the suggestion in this question: Concatenate two big pandas.HDFStore HDF5 files

The stores have identical columns and non-overlapping indicies by design and a few million rows. The individual files may fit into memory but the total combined file probably will not.

Is there a way I can preserve the settings the hdfstore was created with? I loose the data columns and compression settings.

I tried odo(part, whole, datacolumns=['col1','col2']) without luck.

Alternatively, any suggestions for alternative methods would be appreciated. I could of course do this manually but then I have to manage the chunksizing in order to not run out of memory.

Community
  • 1
  • 1
Kyle
  • 2,814
  • 2
  • 17
  • 30

1 Answers1

2

odo doesn't support propogation of compression and/or data_columns ATM. Both are pretty easy to add, I created an issue here

You can do this in pandas this way:

In [1]: df1 = DataFrame({'A' : np.arange(5), 'B' : np.random.randn(5)})

In [2]: df2 = DataFrame({'A' : np.arange(5)+10, 'B' : np.random.randn(5)})

In [3]: df1.to_hdf('test1.h5','df',mode='w',format='table',data_columns=['A'])

In [4]: df2.to_hdf('test2.h5','df',mode='w',format='table',data_columns=['A'])

Iterate over the input files. Chunk read/write to the final store. Note that you have to specify the data_columns here as well.

In [7]: for f in ['test1.h5','test2.h5']:
   ...:     for df in pd.read_hdf(f,'df',chunksize=2):
   ...:         df.to_hdf('test3.h5','df',format='table',data_columns=['A'])
   ...:         

In [8]: with pd.HDFStore('test3.h5') as store:
    print store
   ...:     
<class 'pandas.io.pytables.HDFStore'>
File path: test3.h5
/df            frame_table  (typ->appendable,nrows->1,ncols->2,indexers->[index],dc->[A])
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I'm scratching my head here. Logically this should work but the end result for me (copy pasting your code) is a store with a single row. In your output it looks like there is a single row also? `nrows->1`. Am I missing something? I tried with an explicit mode='a' although it's the default and got the same result. – KobeJohn Mar 04 '16 at 04:40
  • Ok here it is. I believe you have to add `append=True` to the final `df.to_hdf(...)` since it seems to have an independent setting from the standard append file mode. In your output, there is only one row `nrows->1` and if I add the append option, it comes out right to 10 rows. Thanks for the method! – KobeJohn Mar 04 '16 at 04:52