Replace whole string if it contains substring in pandas dataframe

Question

I have an sample dataset.

raw_data = {
    'categories': ['sweet beverage', 'salty snacks', 'beverage,sweet', 'fruit juice,beverage,', 'salty crackers'],
    'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df_a = pd.DataFrame(raw_data)

I need to iterate thru the rows in the 'categories' columns, and check if it contains a particular string, in this case, 'beverage', after which i will update the categories to just 'beverage'. This link is the closest i found on stackoverflow, but doesnt tell me how to go thru the whole dataset.

Replace whole string if it contains substring in pandas

Here's my sample code.

for index,row in df.iterrows():
    if row.str.contains('beverage', na=False):
        df.loc[index,'categories_en'] = 'Beverages' 
    elif row.str.contains('salty',na=False):
        df.loc[index,'categories_en'] = 'Salty Snack'
     ....<and other conditions>

How can I achive this? Thanks all!

Why do you call it "the `categories` column**s**" when it is actually just a single column in the dataframe? — Omni, Feb 14 '18 at 21:38

score 3 · Answer 1 · answered Feb 14 '18 at 21:43

3

Create following dicts , then using replace

Yourdict2={1:'Beverages',2:'salty'}
Yourdict1={'beverage':1,'salty':2}
df_a.categories.replace(Yourdict1,regex=True).map(Yourdict2)
Out[275]: 
0    Beverages
1        salty
2    Beverages
3    Beverages
4        salty
Name: categories, dtype: object

answered Feb 14 '18 at 21:43

BENY

317,841
20
164
234

This is interesting – Vaishali Feb 14 '18 at 21:48

score 1 · Answer 2 · answered Feb 14 '18 at 21:46

1

You can use

df_a.loc[df_a.categories.str.contains('beverage'), 'categories'] = 'beverage'


    categories      product_name
0   beverage        coca-cola
1   salty snacks    salted pistachios
2   beverage        fruit juice
3   beverage        lemon tea
4   salty crackers  roasted peanuts

answered Feb 14 '18 at 21:46

Vaishali

37,545
5
58
86

Hi, Thanks for your suggestion. I tried to create a function def transformCat(s) so it will check for other strings. Within the function, i have for example, s=df.loc[df.categories_en.str.lower().str.contains('milk',na=False)] ='Dairies,Milks' return s When I call my function, df['my_categories'] =df.categories_en.apply(transformCat) Is this correct? – Zoozoo Feb 16 '18 at 11:30

score 1 · Answer 3 · edited May 27 '20 at 17:25

1

Use the __contains__() method of Pythons string class:

for a in df_a["categories"]:
if a.__contains__("beverage"):
    df_a["categories"].replace(a, "beverage", inplace=True)

edited May 27 '20 at 17:25

Francesco

897
8
22

answered May 27 '20 at 15:39

Xanyar

11
1

score 0 · Answer 4 · answered Feb 14 '18 at 21:40

0

Maybe you can try something like this:

def selector(x):
    if 'beverage' in x:
        return 'Beverages'
    if 'salty' in x:
        return 'Salty snack'

df_a['categories_en'] = df_a['categories'].apply(selector)

answered Feb 14 '18 at 21:40

relay

189
8

score 0 · Answer 5 · answered Feb 14 '18 at 21:44

Use apply to generate a new categories column. Then assign it to the categories_en column of the dataframe.

def map_categories(cat: str) -> str:
    if cat.find("beverage") != -1:
        return "beverage"
    else:
        return str
new_col = df['categories'].apply(map_categories)
df['categories_en'] = new_col

score 0 · Answer 6 · answered Feb 16 '18 at 19:02

Thanks for all the various solutions to my question. Based on all your inputs, I have come up with this solution, which works.

def transformCat(df):

df.loc[df.categories_en.str.lower().str.contains('beers|largers|wines|rotwein|biere',na=False)] = 'Alcoholic,Beverages'
df.loc[df.categories_en.str.lower().str.contains('cheese',na=False)] = 'Dairies,Cheeses'
df.loc[df.categories_en.str.lower().str.contains('yogurts',na=False)] = 'Dairies,Yogurts'
df.loc[df.categories_en.str.lower().str.contains(r'sauce.*ketchup|ketchup.*sauce',na=False)] = 'Sauces,Ketchups'

Would appreciate any inputs. Thanks all!

PS - I am aware there should be an indent beginning at df.loc, but since i am new to stackoverflow (i will learn, i promise), somehow I cant get the indentation correct.

Replace whole string if it contains substring in pandas dataframe

6 Answers6