0

I am trying to read few files from a path as extension to my previous question The answer given by Jianxun Definitely makes sense but I am getting a key error. very very new to pandas and not able to fix error.

Note: I use Python 2.7 and Pandas 0.16

File_1.csv 

Ids,12:00:00
2341,9865
7352,8969

File_2.csv
Ids,12:45:00
1234,9865
8435,8969

Master.csv
Ids,00:00:00,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

Programs:

import pandas as pd
import numpy as np
from StringIO import StringIO

# your csv file contents
csv_file1 = 'Path/Transition_Data/Test_1.csv '
csv_file2 = 'Path/Transition_Data/Test_2.csv '
csv_file_all = [csv_file1, csv_file2]

# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path

df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]


# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')


# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
    return group.fillna(method='ffill').iloc[-1]

# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)

# do the subtraction
master_csv_file = 'Path/Data_repository/Master1_Test.csv'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()

# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)

# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)

Error:

Traceback (most recent call last):
  File "distribute_count.py", line 18, in <module>
    merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 2583, in set_index
    level = frame[col].values
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/lib/pymodules/python2.7/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3786)
  File "index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3664)
  File "hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11943)
  File "hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11896)
KeyError: 'Ids'
Community
  • 1
  • 1
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54

1 Answers1

1
import pandas as pd
import numpy as np

# your csv file contents
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Transition_Data/Test_1.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Transition_Data/Test_2.csv'
master_csv_file = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file_all = [csv_file1, csv_file2]

# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path

df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]

# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')

# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
    return group.fillna(method='ffill').iloc[-1]

# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)

# do the subtraction

df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()

# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)

# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)

print(df_matched)

      00:00:00  00:30:00  00:45:00  12:00:00  12:45:00
Ids                                                   
1234      1000      -500      -900       NaN      8865
2341       563      -163      -163      9302       NaN
7352       345       155       255      8624       NaN
8435      5243     -4943     -5043       NaN      3726
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Still the same error : `File "distribute_lac.py", line 30, in df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index() File "/usr/lib/pymodules/python2.7/pandas/io/parsers.py", line 256, in _read return parser.read() File "/usr/lib/pymodules/python2.7/pandas/io/parsers.py", line 715, in read ret = self._engine.read(nrows) File "/usr/lib/pymodules/python2.7/pandas/io/parsers.py", line 1184, in read values = data.pop(self.index_col[i]) KeyError: 'Ids'` – Sitz Blogz Jul 02 '15 at 07:00
  • @SitzBlogz Can you post the first few rows of the original `Master1_Test.csv` file? it seems that the file has no 'Ids' column. – Jianxun Li Jul 02 '15 at 07:02
  • I using the same sample file and I am getting error for this only haven't tried with real data yet. – Sitz Blogz Jul 02 '15 at 07:19
  • 1
    @SitzBlogz Why in master csv, there are three column headers `Ids, 00:00:00, 00:30:00`, but 4 columns of values `1234,1000,500,100`? – Jianxun Li Jul 02 '15 at 07:21
  • Now I did edit the Master input file and I do not get error but it does not even show results. – Sitz Blogz Jul 02 '15 at 07:44
  • @SitzBlogz Could you please double check whether each file has been read successfully? for example, print out the first few lines of each `df` see whether the `df` looks normal. If not, please post what you saw. – Jianxun Li Jul 02 '15 at 07:46
  • Pls find the Download like of Dropbox for the files and program. https://www.dropbox.com/s/ak1uw5in5xq2ayq/stack_flow_bundle.zip?dl=0 – Sitz Blogz Jul 02 '15 at 10:12
  • The data sets are for two programs the current one and the one from this link. http://stackoverflow.com/questions/31168789/pandas-column-mathematical-operations – Sitz Blogz Jul 02 '15 at 10:13
  • @SitzBlogz I've replaced the code with the real file path and it works on my PC. Give a check and see whether it raises any error on your side. – Jianxun Li Jul 02 '15 at 10:23
  • Finally !! Awesome .. Thank you so much ... so now if i write a schedule program to auto run the program those with NAN will be refilled ? and if possible do help with second program also. – Sitz Blogz Jul 02 '15 at 10:42
  • @SitzBlogz I am certainly happy to help out. For the 2nd program, base on the files you sent, I didn't see the `Ids` columns at all in any of the csv files. Could you please give a look at this issue and re-upload the files? – Jianxun Li Jul 02 '15 at 10:46
  • @SitzBlogz also, the name `New York` in `lat_lon_2.csv` seems to have some white space between these two names. Is this natural to your real data file? – Jianxun Li Jul 02 '15 at 10:55
  • For second program only the two data files of data_repository are to be considered.. Not from Transition_Data. – Sitz Blogz Jul 02 '15 at 11:25