How to convert categorical data to numerical data?

Question

I have feature => city which is categorical data i.e string but instead of hardcoding using replace() is there any smart approach ?

train['city'].unique()
Output: ['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
       'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
       'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
       'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
       'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
       'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
       'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......]

What I was trying :

train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
           'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
           'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
           'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
           'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
           'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
           'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......], [1,2,3,4,5,6,7,8,9....], inplace=True)

Is there any better way to convert the data into numerical ? Because the number of unique values are 123. So I need to hard code numbers from 1,2,3,4,...123 to convert it. Suggest some better way to convert it into numerical value.

@SuperStew I am newbie in python and pandas can you please help — stone rock, Jul 12 '18 at 17:59
https://stackoverflow.com/questions/38088652/pandas-convert-categories-to-numbers try search before post , this is common question — BENY, Jul 12 '18 at 18:09
Could have used: `train['city'].str.split('_').str[1].astype(int)` too — Anton vBR, Jul 12 '18 at 18:13

sacuL · Accepted Answer · 2018-07-12T18:08:17.100

11

Try pd.factorize():

train['city'] = pd.factorize(train.city)[0]

Or categorical dtypes:

train['city'] = train['city'].astype('category').cat.codes

For example:

>>> train
       city
0  city_151
1  city_149
2  city_151
3  city_149
4  city_149
5  city_149
6  city_151
7  city_151
8  city_150
9  city_151

factorize:

train['city'] = pd.factorize(train.city)[0]

>>> train
   city
0     0
1     1
2     0
3     1
4     1
5     1
6     0
7     0
8     2
9     0

Or astype('category'):

train['city'] = train['city'].astype('category').cat.codes

>>> train
   city
0     2
1     0
2     2
3     0
4     0
5     0
6     2
7     2
8     1
9     2

edited Jul 12 '18 at 18:08

answered Jul 12 '18 at 18:00

sacuL

49,704
8
81
106

Best answer thanks :) I am new to pandas and there are hell lot of functions and I was not aware about factorize. Thanks once again. – stone rock Jul 12 '18 at 18:04
No problem! Glad to help! – sacuL Jul 12 '18 at 18:05
1

When you get to a resolution, please remember to up-vote useful things and accept your favourite answer (even if you have to write it yourself), so Stack Overflow can properly archive the question. – Prune Jul 12 '18 at 18:06
@Prune Yes I will accept the answer. I can accept only after 15min time period. – stone rock Jul 12 '18 at 18:10
@sacul Can you please help me here: https://stackoverflow.com/questions/51342146/i-am-getting-type-error-in-seaborn – stone rock Jul 14 '18 at 18:23

score 1 · Answer 2 · answered Jul 12 '18 at 18:02

1

You can accomplish this via mapping:

   value_mapper = dict(zip(train['city'].unique(), np.arange(1, 124)))
    train['city'].map(value_mapper)

Or the more idiomatic categorical data:

pd.Categorical(train['city']).codes

answered Jul 12 '18 at 18:02

iDrwish

3,085
1
15
24

score 1 · Answer 3 · answered Jul 12 '18 at 18:02

If your values always have an underscore before the integer, a list comprehension might work for you:

data = [int(x.split('_')[-1]) for x in train['city']]

The comprehension loops across each x in train['city'], splits x into underscore delimited parts, and converts the last part to an integer. This works if you have more than one underscore, like foo_bar_5.

How to convert categorical data to numerical data?

3 Answers3

Linked