I am trying to perform some simple mathematical operations on the files.
The columns in below file_1.csv
are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column
master_ids.csv
: Before any pre-processing
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
master_count.csv
: Before any processing
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
master_Ids.csv
: after one pre-processing
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
master_count.csv
: expected Output (Append/merge)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
Eg: Ids: 1234
appears 2
times so the value of ids:1234
at current time (00:30:00)
is 500
which is to be divided by count of ids
occurrence and then add to the corresponding values from ref1
and create a new column with the current time.
master_Ids.csv
: After another pre-processing
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
master_count.csv
: expected output after another execution (Merge/append)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
So here current time
is 00:45:00
, and we divide the current time value
by the count
of ids
occurrences, and then add
to the corresponding ref1
values by creating an new column with new current time
.
Program: By Jianxun Li
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
The program executes with no errors and no output. Need some fixing suggestions please.