Python dataframe drop rows which occur less frequently

Question

I have a data frame with repeatedly occurring rows with different names. I want to delete less occurring rows. My data frame is very big. I am giving only a small size here.

dataframe:

df = 
         name     value
    0      A      10
    1      B      20
    2      A      30
    3      A      40
    4      C      50
    5      C      60
    6      D      70

In the above data frame B and D rows occurred fewer times. That is less than 1. I want to delete/drop all such rows that occur less than 2.

My code:

##### Net strings
net_strs = df['name'].unique().tolist()
strng_list = df.group.unique().tolist()
tempdf = df.groupby('name').count()
##### strings that have less than 2 measurements in whole data set
lesstr = tempdf[tempdf['value']<2].index
##### Strings that have more than 2 measurements in whole data set
strng_list = np.setdiff1d(net_strs,lesstr).tolist()
##### Removing the strings with less measurements
df = df[df['name']==strng_list]

My present output:

ValueError: Lengths must match to compare

My expected output:

         name     value
    0      A      10
    1      A      30
    2      A      40
    3      C      50
    4      C      60

@Sushanth No. I want to drop rows that occur less frequently. In the above question, I said less than 2. I may wish to choose less than 3 as well. My question is representational only. — Mainland, Jul 10 '20 at 17:39
Try `df.groupby('name').apply(lambda dd: dd if len(dd) > 1 else pd.DataFrame()).reset_index(drop=True)`. But this is likely slow. — Abdou, Jul 10 '20 at 17:43

score 6 · Accepted Answer · answered Jul 10 '20 at 17:49

You could find the count of each element in name and then select rows only those rows having names that occur more than once.

v = df.name.value_counts()
df[df.name.isin(v.index[v.gt(1)])]

Output :

    name    value
0   A   10
2   A   30
3   A   40
4   C   50
5   C   60

score 4 · Answer 2 · answered Jul 10 '20 at 17:46

4

I believe this code should give you what you want.

df['count'] = df.groupby('name').transform('count')
df2 = df.loc[df['count'] >= 2].drop(columns='count')

answered Jul 10 '20 at 17:46

rhug123

7,893
1
9
24

Adding to above answer, ```df[df.groupby('name').transform('count').gt(1)['value']]``` – sushanth Jul 10 '20 at 17:50
Personal warning, from experience, `groupby().transform()` can be extremely slow for big datasets – Celius Stingher Jul 10 '20 at 17:54

score 1 · Answer 3 · answered Jul 10 '20 at 17:43

You should use value_counts() to get the occurrence of each row, followed by slicing this series to get the name of the rows you can to drop.

df = pd.DataFrame({'name':['A','B','A','A','C','C','D'],
                   'value':[10,20,30,40,50,60,70]})
removals = df['name'].value_counts().reset_index()
removals = removals[removals['name'] > 1]['index'].values

Here, we're setting a threshold of 1, where all values that show up more than one will get selected, but this can obviously be a variable, or changed accordingly.

filtered_df = df[df['name'].isin(removals)]
print(filtered_df)

Output:

  name  value
0    A     10
2    A     30
3    A     40
4    C     50
5    C     60

Python dataframe drop rows which occur less frequently

3 Answers3