3

I have an sample dataset.

raw_data = {
    'categories': ['sweet beverage', 'salty snacks', 'beverage,sweet', 'fruit juice,beverage,', 'salty crackers'],
    'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df_a = pd.DataFrame(raw_data)

I need to iterate thru the rows in the 'categories' columns, and check if it contains a particular string, in this case, 'beverage', after which i will update the categories to just 'beverage'. This link is the closest i found on stackoverflow, but doesnt tell me how to go thru the whole dataset.

Replace whole string if it contains substring in pandas

Here's my sample code.

for index,row in df.iterrows():
    if row.str.contains('beverage', na=False):
        df.loc[index,'categories_en'] = 'Beverages' 
    elif row.str.contains('salty',na=False):
        df.loc[index,'categories_en'] = 'Salty Snack'
     ....<and other conditions>

How can I achive this? Thanks all!

Zoozoo
  • 240
  • 4
  • 13
  • Why do you call it "the `categories` column**s**" when it is actually just a single column in the dataframe? – Omni Feb 14 '18 at 21:38

6 Answers6

3

Create following dicts , then using replace

Yourdict2={1:'Beverages',2:'salty'}
Yourdict1={'beverage':1,'salty':2}
df_a.categories.replace(Yourdict1,regex=True).map(Yourdict2)
Out[275]: 
0    Beverages
1        salty
2    Beverages
3    Beverages
4        salty
Name: categories, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You can use

df_a.loc[df_a.categories.str.contains('beverage'), 'categories'] = 'beverage'


    categories      product_name
0   beverage        coca-cola
1   salty snacks    salted pistachios
2   beverage        fruit juice
3   beverage        lemon tea
4   salty crackers  roasted peanuts
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Hi, Thanks for your suggestion. I tried to create a function def transformCat(s) so it will check for other strings. Within the function, i have for example, s=df.loc[df.categories_en.str.lower().str.contains('milk',na=False)] ='Dairies,Milks' return s When I call my function, df['my_categories'] =df.categories_en.apply(transformCat) Is this correct? – Zoozoo Feb 16 '18 at 11:30
1

Use the __contains__() method of Pythons string class:

for a in df_a["categories"]:
if a.__contains__("beverage"):
    df_a["categories"].replace(a, "beverage", inplace=True)
Francesco
  • 897
  • 8
  • 22
Xanyar
  • 11
  • 1
0

Maybe you can try something like this:

def selector(x):
    if 'beverage' in x:
        return 'Beverages'
    if 'salty' in x:
        return 'Salty snack'

df_a['categories_en'] = df_a['categories'].apply(selector)
relay
  • 189
  • 8
0

Use apply to generate a new categories column. Then assign it to the categories_en column of the dataframe.

def map_categories(cat: str) -> str:
    if cat.find("beverage") != -1:
        return "beverage"
    else:
        return str
new_col = df['categories'].apply(map_categories)
df['categories_en'] = new_col
Omni
  • 1,002
  • 6
  • 12
0

Thanks for all the various solutions to my question. Based on all your inputs, I have come up with this solution, which works.

def transformCat(df):

df.loc[df.categories_en.str.lower().str.contains('beers|largers|wines|rotwein|biere',na=False)] = 'Alcoholic,Beverages'
df.loc[df.categories_en.str.lower().str.contains('cheese',na=False)] = 'Dairies,Cheeses'
df.loc[df.categories_en.str.lower().str.contains('yogurts',na=False)] = 'Dairies,Yogurts'
df.loc[df.categories_en.str.lower().str.contains(r'sauce.*ketchup|ketchup.*sauce',na=False)] = 'Sauces,Ketchups'

Would appreciate any inputs. Thanks all!

PS - I am aware there should be an indent beginning at df.loc, but since i am new to stackoverflow (i will learn, i promise), somehow I cant get the indentation correct.

Zoozoo
  • 240
  • 4
  • 13