1

So I have a dataframe about NBA stats from the last season which I am using to learn pandas and matplotlib but all numbers (Points per game, salaries, PER etc.) are strings. I noticed it when I tried to sum them and they just concatenated. So I used this :

df['Salary'] = df['Salary'].astype(float)

to change the values but there is many more columns that I have to do the same thing for and I know that I should do it manually. First thing that comes to mind is some kind of regex but I am not familiar with it so I am seeking for help. Thanks in advance!

rLoper
  • 63
  • 7

5 Answers5

0

I think you can use select_dtypes

The strategy is to find the columns with types object, which usually are string. You can check it out by using df.info().

so :

df.select_dtypes(include = ['object']).astype(float)

would do the trick

If you want to keep a trace of this :

str_cols = df.select_dtypes(include = ['object'].columns

mapping = {col_name:col_type for col_name, col_type in zip(str_cols, [float]*len(str_cols))}

df[str_cols] = df[str_cols].astype(mapping)

I like this approach because you can create a dictionary of the types you want your columns to be in.

Nathan Furnal
  • 2,236
  • 3
  • 12
  • 25
0

If you know the names of the columns you can use a for loop to apply the same transformation to each column. This is useful if you don't want to convert entire data frame but only the numeric columns etc. Hope that helps

    cols = ['points','salary','wins']
    for i in cols:
         df[i] = df[i].astype(float)
bamdan
  • 836
  • 7
  • 21
0

Another way to do this if you know the columns in advance is to specify the dtype when you import the dataframe.

df = pd.read_csv("file.tsv", sep='\t', dtype={'a': np.float. 'b': str, 'c': np.float}

A second method could be to use a conversion dictionary:

conversion_dict = {'a': np.float, 'c': np.float}
df = df.astype(conversion_dict)

A third method if your column would be an object would be to use the infer_object() method from pandas. Using this method you dont have to specify all the columns yourself.

df =  df.infer_objects()

good luck

Mack123456
  • 376
  • 2
  • 8
0

In Pandas, DataFrame objects make a list of all columns contained in the frame available via the columns attribute. This attribute is iterable, which means you can use this as the iterable object of a for-in loop. This allows you to easily run through and apply an operation to all columns:

for col in df.columns: 
    df[col] = df[col].astype('float', errors='ignore')

Documentation page for Pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Spencer D
  • 3,376
  • 2
  • 27
  • 43
0

I think what OP is asking is how he can convert each column to it's appropriate type (int, float, or str) without having to manually inspect each column and then explicitly convert it.

I think something like the below should work for you. Keep in mind that this is pretty exhaustive and checks each value for the entire column. You can always the second for loop to maybe only look at the first 100 columns to make a decision on what type to use for that column.

import pandas as pd
import numpy as np

# Example dataframe full of strings
df = pd.DataFrame.from_dict({'name':['Lebron James','Kevin Durant'],'points':['38','   '],'steals':['2.5',''],'position':['Every Position','SG'],'turnovers':['0','7']})   

def convertTypes(df):
    for col in df: 
        is_an_int = True
        is_a_float = True
        if(df[col].dtype == np.float64 or df[col].dtype == np.int64):
            # If the column's type is already a float or int, skip it
            pass
        else:
            # Iterate through each value in the column
            for value in df[col].iteritems():
                if value[1].isspace() == True or value[1] == '':
                    continue
                # If the string's isnumeric method returns false, it's not an int
                if value[1].isnumeric() == False: 
                    is_an_int = False
                # if the string is made up of two numerics split by a '.', it's a float
                if isinstance(value[1],str): 
                    if len(value[1].split('.')) == 2: 
                        if value[1].split('.')[0].isnumeric() and value[1].split('.')[1].isnumeric(): 
                            is_a_float = True 
                        else: 
                            is_a_float = False 
                    else: 
                        is_a_float = False 
                else: 
                    is_a_float = False 
            if is_a_float == True:
                # If every value's a float, convert the whole column
                # Replace blanks and whitespaces with np.nan
                df[col] = df[col].replace(r'^\s*$', np.nan, regex=True).astype(float)
            elif is_an_int == True:
                # If every value's an int, convert the whole column
                # Replace blanks and whitespaces with 0
                df[col] = df[col].replace(r'^\s*$', 0, regex=True).astype(int)

convertTypes(df)
samredai
  • 662
  • 3
  • 8
  • Crap, in the 14th line the isnumeric() method spits an error because the value is a float, I am working on a fix – rLoper Aug 27 '19 at 10:14
  • @rLoper you mean the datatype of the column is already a float? The first if statement is meant to check for this and pass but the way I wrote it only checks for np.float64. Does the column throwing the error have a type of np.float32 or np.float16? – samredai Aug 27 '19 at 11:29