2

How do I find second and third largest values from multiple columns? It's clear how to find max value, min and median, however I cannot extract the second and third largest values as new columns?

import pandas as pd

df = pd.read_csv(...)

df['max'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].max(axis=1)

df['min'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].min(axis=1)

# ? 
df['2nd_largest'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]]
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
g123456k
  • 301
  • 1
  • 8
  • FYI: a column name like `max` is dangerous to some extent since you might go for `.` notation, e.g., `df.max` but it'll give you the method, not the column! Same goes for `min`. – Mustafa Aydın Jan 27 '22 at 09:26

3 Answers3

4

To find the second largest values of each row, you can use nlargest; apply a function to each row:

df['2nd_largest'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].apply(lambda row: row.nlargest(2).iat[-1], axis=1)
2

If you values are in a list you can sort the values like:

df.sort()
 

and you can find the 2nd highest value like:

df[-2]

And alternative is a function I found on Get the second largest number in a list in linear time

def second_largest(numbers):
    count = 0
    m1 = m2 = float('-inf')
    for x in numbers:
        count += 1
        if x > m2:
            if x >= m1:
                m1, m2 = x, m1            
            else:
                m2 = x
    return m2 if count >= 2 else None
Kylian
  • 319
  • 2
  • 14
  • 1
    Question is about a pandas DataFrame (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), not Python lists :) – Mustafa Aydın Jan 27 '22 at 09:21
2

I think it's fine enough to sort the dataframe once (wrt rows) and then extract those information. But we need to do the sorting in NumPy domain as it doesn't care about the column labels whilst sorting and so allows "independent" sorting of the rows:

# sort the subset frame over axis 1, i.e., each row is independently sorted
cols_subset = ["A1", "B1", "C1", "D1", "E1", "F1"]
sorted_values = np.sort(df[cols_subset], axis=1)

# extract information
df["maximum"] = sorted_values[:, -1]  # since ascending sort, max is at the end
df["minimum"] = sorted_values[:, 0]
df["2nd largest"] = sorted_values[:, -2]
df["3rd largest"] = sorted_values[:, -3]

(The index is kept intact during sorting, so we don't worry about alignment issues when assigning.)

An example:

>>> np.random.seed(224)
>>> df = pd.DataFrame(np.random.randint(-5, 22, size=(6, 4)), columns=[*"ABDE"])
>>> df
    A   B   D   E
0   2   9  21   8
1  15   8  10  -2
2   0  18  -3   3
3   2  -3   6  -1
4  -4  17   9  10
5   8  -4  12  16

>>> cols_subset = ["A", "B", "E"]
>>> sorted_values = np.sort(df[cols_subset], axis=1)

# each row sorted
>>> sorted_values 
array([[ 2,  8,  9],
       [-2,  8, 15],
       [ 0,  3, 18],
       [-3, -1,  2],
       [-4, 10, 17],
       [-4,  8, 16]])

# after above assignments for maximum, minimum, ...
>>> df
    A   B   D   E  maximum  minimum  2nd largest  3rd largest
0   2   9  21   8        9        2            8            2
1  15   8  10  -2       15       -2            8           -2
2   0  18  -3   3       18        0            3            0
3   2  -3   6  -1        2       -3           -1           -3
4  -4  17   9  10       17       -4           10           -4
5   8  -4  12  16       16       -4            8           -4
# `"minimum"` and `"3rd largest"` coincide here since `cols_subset` is of length 3
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38