9

I am trying to perform some simple mathematical operations on the files.

The columns in below file_1.csv are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column

master_ids.csv : Before any pre-processing

Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345

master_count.csv : Before any processing

Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300

master_Ids.csv : after one pre-processing

Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500

master_count.csv: expected Output (Append/merge)

Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750

Eg: Ids: 1234 appears 2 times so the value of ids:1234 at current time (00:30:00) is 500 which is to be divided by count of ids occurrence and then add to the corresponding values from ref1 and create a new column with the current time.

master_Ids.csv : After another pre-processing

Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

master_count.csv: expected output after another execution (Merge/append)

Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600

So here current time is 00:45:00, and we divide the current time value by the count of ids occurrences, and then add to the corresponding ref1 values by creating an new column with new current time.

Program: By Jianxun Li

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column any time series
def my_func(group):
    num_obs = len(group)
    # process with column name after next timeseries (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

The program executes with no errors and no output. Need some fixing suggestions please.

Community
  • 1
  • 1
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • I'd say you should consider restructuring your data. Instead of adding a new column for each "preprocessing" step, give your data a fixed number of columns, one of which includes the time information you are currently using as the headers of the new columns. That is, a `current_time` column, and a bunch of rows with "00:30:00" in that column, then a bunch of rows with "00:45:00" in that column, etc. – BrenBarn Jul 11 '15 at 18:04
  • @BrenBarn I cannot go for reconstruction because I need the old timeseries count too for future plotting purpose. – Sitz Blogz Jul 12 '15 at 03:09
  • Not sure what you mean. The changes I am describing will not result in any loss of information, just a different format. – BrenBarn Jul 12 '15 at 03:21
  • @BrenBarn Will it be possible for you to show in program and output format so the confusion will be cleared ? – Sitz Blogz Jul 12 '15 at 03:45
  • @JohnE If you have time can you also pls help me with this merging of two programs or write a new program that will call the two programs after a scheduled time say every 5 mins http://stackoverflow.com/questions/31314792/how-to-merge-two-programs-with-scheduled-execution – Sitz Blogz Jul 12 '15 at 05:46
  • 1
    I added an update to the answer, check it out. Also, can you check your expected output for the China row? I think I'm getting the expected results for all rows except that one. – JohnE Jul 12 '15 at 15:05
  • @SitzBlogz: I added an answer showing what I mean. – BrenBarn Jul 13 '15 at 05:08
  • @SitzBlogz I use the raw data file you provided at https://www.dropbox.com/s/ak1uw5in5xq2ayq/stack_flow_bundle.zip?dl=0, and the code I wrote seems to work as expected. By `no answer no error`, do you mean that `print(result)` shows an empty dataframe? – Jianxun Li Jul 13 '15 at 11:10
  • @JianxunLi I think he wants an explicit csv file output (see the comments to my answer) – JohnE Jul 13 '15 at 12:59
  • When I Wrote a print statement It was same with no error and no answer. Then I had doubt may be I wrote wrong syntax. So I asked John to also include the csv file output. And I was wrong with my print or append syntax. Thank you so much both of you for all the help. – Sitz Blogz Jul 13 '15 at 16:12

3 Answers3

3

This program assumes updating of both master_counts.csv and master_ids.csv over time and should be robust to the timing of the updates. That is, it should produce correct results if run multiple times on the same update or if an update is missed.

# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]

# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')

for i in range( 2, len(master_ids.columns) ):
    master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
    count = master_counts.groupby('Ids')['ref1'].transform('count')
    master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count

master_counts.to_csv('master_counts.csv',index=False)

%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Please can you give a little more clear code for understanding . – Sitz Blogz Jul 01 '15 at 21:56
  • Pls check the description as Edit section in questions. – Sitz Blogz Jul 02 '15 at 02:03
  • The first program cannot be used as the `master_ids.csv` does not change it only keeps on appending. The updated version just runs with no errors and no output at the sametime. – Sitz Blogz Jul 13 '15 at 03:49
  • I am checking in the `master_count.csv` and I dont see any update in that The output needs to be appended in the `master_count.csv`. – Sitz Blogz Jul 13 '15 at 04:10
  • 1
    @SitzBlogz OK, I explicitly added the csv output. That's the easiest part, I assumed you already knew how to do that. – JohnE Jul 13 '15 at 04:16
  • I am supposed to know have worked with pure python and dict.. Being new to Pandas I am facing trouble in that. Would be huge help if you have free time to help with this http://stackoverflow.com/questions/31201986/date-time-series-wise-grouping-of-data-and-distribution/31369106#31369106 – Sitz Blogz Jul 13 '15 at 04:32
2
import pandas as pd
import numpy as np

csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

Out[53]: 
      00:00:00  00:30:00  00:45:00
Ids                               
1234      1000       500       100
8435      5243       300       200
2341       563       400       400
7352       345       500       600

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

Out[81]: 
         Name   lat   lon  00:00:00
Ids                                
1234   London  40.4  10.1       500
1234   Prague  40.4  10.1       500
2341  NewYork  60.6  30.3       700
2341  Austria  60.6  30.3       700
7352    Japan  70.7  80.8       500
7352    China  70.7  80.8       500
8435    Paris  50.5  20.2       400
8435   Berlin  50.5  20.2       400

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])



Out[55]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       500       100
1234   Prague  40.4  10.1       500       500       100
2341  NewYork  60.6  30.3       700       400       400
2341  Austria  60.6  30.3       700       400       400
7352    Japan  70.7  80.8       500       500       600
7352    China  70.7  80.8       500       500       600
8435    Paris  50.5  20.2       400       300       200
8435   Berlin  50.5  20.2       400       300       200

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group



result = temp.groupby(level='Ids').apply(my_func)

Out[104]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       750       550
1234   Prague  40.4  10.1       500       750       550
2341  NewYork  60.6  30.3       700       900       900
2341  Austria  60.6  30.3       700       900       900
7352    Japan  70.7  80.8       500       750       800
7352    China  70.7  80.8       500       750       800
8435    Paris  50.5  20.2       400       550       500
8435   Berlin  50.5  20.2       400       550       500
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
2

My suggestion is to reformat your data so that it's like this:

Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None

Then after your "first preprocess" it will become like this:

Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500

. . . and so on. The idea is that you should make a single column to hold the time information, and then for each preprocess, insert the new data into new rows, and give those rows a value in the time column indicating what time period they come from. You may or may not want to keep the initial rows with "None" in this table; maybe you just want to start with the "00:30:00" values and keep the "master ids" in a separate file.

I haven't totally followed exactly how you're computing the new ref1 values, but the point is that doing this is likely to greatly simplify your life. In general, instead of adding an unbounded number of new columns, it can be much nicer to add a single new column whose values will then be the values you were going to use as headers for the open-ended new columns.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384