0

I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.

Input file : N number of them.

File1.csv, File2.csv, File3.csv, ........... Filen.csv 

Ids,Date-time-1    Ids,Date-time-2  Ids,Date-time-1
56,4568          645,5545         25,54165
45,464           458,546        

I am trying to merge the Date-time column of all the files into a big data file with respect to Ids

Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
  1. Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.

  2. Fill in empty place with NAN and if next file has that value then replace the new value with NAN

If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.

Appreciate some suggestions to begin with. Thanks in advance.

pnuts
  • 58,317
  • 11
  • 87
  • 139
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54

1 Answers1

1

Here is one way to do it.

import pandas as pd
import numpy as np
from StringIO import StringIO

# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]

# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]


# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')


  Out[206]: 
        Date-time-1  Date-time-2
   Ids                          
   56          4568          NaN
   45           464          NaN
   645          NaN         5545
   458          NaN          546
   25         54165          NaN
   645         4354          NaN

# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
    return group.fillna(method='ffill').iloc[-1]

# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)


  Out[207]: 
        Date-time-1  Date-time-2
   Ids                          
   25         54165          NaN
   45           464          NaN
   56          4568          NaN
   458          NaN          546
   645         4354         5545

# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()

# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)

# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)



   Out[208]: 
        Date-time-ref  Date-time-1  Date-time-2
   Ids                                         
   25             250        53915          NaN
   45             150          314          NaN
   56             100         4468          NaN
   458            200          NaN          346
   645             50         4304         5495
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Thank you .. But I have a small request. The files are in another path and are many which keep increasing at certain decided time period and the master file is in another path. Could you pls fix that. – Sitz Blogz Jun 28 '15 at 10:46
  • @SitzBlogz. I don't quite understand. You can just replace those buffer `io.StringIO()` with your `file path` and it should work. – Jianxun Li Jun 28 '15 at 10:51
  • Thank you .. I am rewrite this into classes and functions mode – Sitz Blogz Jun 30 '15 at 09:20
  • I am sorry but this does not work for me. Their could be many reasons for that other than any mistake in program. I a m trying to find them. Will update .. Any how I did accept your answer. – Sitz Blogz Jun 30 '15 at 17:33
  • FYI I am using Python 2.7 and Pandas 0.16 – Sitz Blogz Jun 30 '15 at 19:28
  • @SitzBlogz I just rerun the script and it works fine on my PC. I use Python 3.4 with pandas 0.16.2. The problem could be the `import io` part because that part is different in py 2.x and 3.x – Jianxun Li Jun 30 '15 at 19:36
  • @SitzBlogz I've just modified the `io.StringIO` part and now the above script should work on your python version. Let me know if this is not the case. – Jianxun Li Jun 30 '15 at 19:39
  • The same program only with change in path of files : I get this error ` File "index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3786) File "index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3664) File "hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11943) File "hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11896) KeyError: 'Ids'` – Sitz Blogz Jul 01 '15 at 02:19
  • Please could you help fix this program. Would be a huge help. Thanks !! – Sitz Blogz Jul 01 '15 at 18:26
  • @SitzBlogz I really want to help, but I cannot reproduce the error message you saw... The code works fine on my PC. Is the data proprietary? Is it possible for you to let me get access to the real data file, for example, a dropbox share link of your data? – Jianxun Li Jul 01 '15 at 18:29
  • I have posted it as new question for a much clear view of what is the problem. http://stackoverflow.com/questions/31174982/sub-value-and-add-new-column-pandas – Sitz Blogz Jul 02 '15 at 03:10