4

I have a DataFrame, which is already sorted by the columns ('year', 'month'), that looks like this:

df = pd.DataFrame({
    'year': [2020, 2020, 2020, 2021, 2021, 2021, 2021],
    'month': [1, 2, 5, 2, 4, 7, 9],
    'values': [
        ['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c'],
        ['A', 'B', 'C', 'D'], ['A', 'B', 'C', 'D'], ['A', 'B', 'C', 'D'], ['A', 'B', 'C', 'D']
    ]
})

print(df)

    year    month   values
0   2020    1       ['a', 'b', 'c']
1   2020    2       ['a', 'b', 'c']
2   2020    5       ['a', 'b', 'c']
3   2021    2       ['A', 'B', 'C', 'D']
4   2021    4       ['A', 'B', 'C', 'D']
5   2021    7       ['A', 'B', 'C', 'D']
6   2021    9       ['A', 'B', 'C', 'D']

I want to create a new column named 'value', which contains the value of the i-th element on the 'values' array, where i is the index of the corresponding month, grouped by year. In this case, the outcome would be:

    year    month   values                  value
0   2020    1       ['a', 'b', 'c']         'a'
1   2020    2       ['a', 'b', 'c']         'b'
2   2020    5       ['a', 'b', 'c']         'c'
3   2021    2       ['A', 'B', 'C', 'D']    'A'
4   2021    4       ['A', 'B', 'C', 'D']    'B'
5   2021    7       ['A', 'B', 'C', 'D']    'C'
6   2021    9       ['A', 'B', 'C', 'D']    'D'

I'm assuming there is no data missing on the arrays. Some lines I've tried involve using .groupby('year') followed by .get_loc('month'), but couldn't get the correct result so far.

EDIT:

There is one detail I forgot to mention: The months are not necessarily in a uniform range, therefore the index is not always month-1. I've edited the DataFrame in question to reflect this nuance.

SeaBean
  • 22,547
  • 3
  • 13
  • 25
L. B.
  • 430
  • 3
  • 14

3 Answers3

3

comprehension

df.assign(value=[v[m-1] for v, m in zip(df['values'], df['month'])])

   year  month        values value
0  2020      1     [a, b, c]     a
1  2020      2     [a, b, c]     b
2  2020      3     [a, b, c]     c
3  2021      1  [A, B, C, D]     A
4  2021      2  [A, B, C, D]     B
5  2021      3  [A, B, C, D]     C
6  2021      4  [A, B, C, D]     D

Alternative map with list.__getitem__

df.assign(value=[*map(list.__getitem__, df['values'], df['month'] - 1)])
piRSquared
  • 285,575
  • 57
  • 475
  • 624
3

Let us try lookup

df['out'] = pd.DataFrame(df['values'].tolist(),index=df.index).lookup(df.index,df.month-1)
df
Out[372]: 
   year  month        values out
0  2020      1     [a, b, c]   a
1  2020      2     [a, b, c]   b
2  2020      3     [a, b, c]   c
3  2021      1  [A, B, C, D]   A
4  2021      2  [A, B, C, D]   B
5  2021      3  [A, B, C, D]   C
6  2021      4  [A, B, C, D]   D

For the future people since the lookup is going to be removed

df['out'] = pd.DataFrame(df['values'].tolist(),index=df.index).values[df.index,df.month-1]
BENY
  • 317,841
  • 20
  • 164
  • 234
2

Use df.apply() on axis=1:

df['value'] = df.apply(lambda x: x['values'][x['month']-1], axis=1)

Alternatively, use list([map()]) for better system performance (execution time) for large data:

df['value'] = list(map((lambda x, y: x[y]), df['values'], df['month'] - 1))

Output:

   year  month        values value
0  2020      1     [a, b, c]     a
1  2020      2     [a, b, c]     b
2  2020      3     [a, b, c]     c
3  2021      1  [A, B, C, D]     A
4  2021      2  [A, B, C, D]     B
5  2021      3  [A, B, C, D]     C
6  2021      4  [A, B, C, D]     D

Edit: (According to OP's Edit that "index is not always month-1")

In response to OP's edit after the answer is posted, that:

The months are not necessarily in a uniform range, therefore the index is not always month-1.

While the following requirement, mentioned in the original version of question, still holds true:

I want to create a new column named 'value', which contains the value of the i-th element on the 'values' array, where i is the index of the corresponding month, grouped by year.

The solution will be amended in as follows:

Step 1: Create index of the corresponding month, within the group that is grouped by year

Setup column value_idx by df.groupby together with Series.transform as follows:

Use x.index - x.index[0] within Series.transform to get the relative index within pandas groups.

df['value_idx'] = df.groupby('year')['values'].transform(lambda x: x.index - x.index[0])

print(df)

   year  month        values  value_idx
0  2020      1     [a, b, c]          0
1  2020      2     [a, b, c]          1
2  2020      5     [a, b, c]          2
3  2021      2  [A, B, C, D]          0
4  2021      4  [A, B, C, D]          1
5  2021      7  [A, B, C, D]          2
6  2021      9  [A, B, C, D]          3

The values in column value_idx are set up independent of the values of month and are the relative index within the respective year groups.

Step 2: Set up column value according to index in column value_idx independent of column month values

Set up column value with either df.apply() or list(map()):

Either:

df['value'] = df.apply(lambda x: x['values'][x['value_idx']], axis=1)

or:

df['value'] = list(map((lambda x, y: x[y]), df['values'], df['value_idx']))

print(df) Output:

   year  month        values  value_idx value
0  2020      1     [a, b, c]          0     a
1  2020      2     [a, b, c]          1     b
2  2020      5     [a, b, c]          2     c
3  2021      2  [A, B, C, D]          0     A
4  2021      4  [A, B, C, D]          1     B
5  2021      7  [A, B, C, D]          2     C
6  2021      9  [A, B, C, D]          3     D

Optionally, you can clean up the column value_idx if desired:

df = df.drop(columns='value_idx')

print(df) Output:

   year  month        values value
0  2020      1     [a, b, c]     a
1  2020      2     [a, b, c]     b
2  2020      5     [a, b, c]     c
3  2021      2  [A, B, C, D]     A
4  2021      4  [A, B, C, D]     B
5  2021      7  [A, B, C, D]     C
6  2021      9  [A, B, C, D]     D
SeaBean
  • 22,547
  • 3
  • 13
  • 25