Converting column with various types of numerical units

Question

I have a dataframe with a column that has various unit types

The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column

datasample = pd.DataFrame(columns=['rawNo','ConvNo'])

datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -15623000000},ignore_index = True)

I figure I will have to use some type of conditional apply. This apply to remove the percent is failing

def rPercent(value):
    value = str(value)
    count = value.count('%')
    print(count)
    
    if (count != 0):
        return value.str.rstrip('% ').astype('float') / 100.0
    else:
        return value

datasample["ConvNo"] = datasample['actual'].apply(rPercent)

Error I get:

> AttributeError: 'str' object has no attribute 'str'

Data File. you can download file from this link https://projectcodesamples.s3.amazonaws.com/ForEx.csv

The columns I am trying to convert is "actual" the result is in the "CNactual" column

And what have you done so far? This is an interesting and not too difficult task, but this is not a code-writing service. — Tim Roberts, Jun 12 '23 at 01:15
I have done this. data4['Cactual'] = data4["actual"].str.rstrip('% ').astype('float') / 100.0 but i need some type of case statement because it fails when there is no '%' — Leo Torres, Jun 12 '23 at 01:18
Right. You clearly can't do this in bulk, because each line is different. You will need to write a function and use `.apply`. And the line in Yen is wrong; it needs to be negative. — Tim Roberts, Jun 12 '23 at 01:21
@TimRoberts: it's possible to do this in bulk, kindly see my answer. — SultanOrazbayev, Jun 19 '23 at 04:16

SultanOrazbayev · Answer 1 · 2023-06-20T03:28:53.293

1

The function you use performs operations on every individual value, but for efficiency it's better to perform vectorized operations. One specific trick is to map the specific characters to their equivalent in scientific notation (e.g. see this answer for handling just the %).

from numpy import nan
from pandas import DataFrame

data = [
    {"rawNo": "-4.35%", "ConvNo": -0.0435},
    {"rawNo": "246.6K", "ConvNo": 246600},
    {"rawNo": nan, "ConvNo": nan},
    {"rawNo": "$12.76B", "ConvNo": 12760000000},
    {"rawNo": "4.68%", "ConvNo": 0.0468},
    {"rawNo": "¥-459.5B", "ConvNo": -459500000000},
    {"rawNo": "€-6.8B", "ConvNo": -6800000000},
    {"rawNo": "£-15.623B", "ConvNo": -15623000000},
]

df = DataFrame(data)

# note that the currency symbols are mapped to blank string
# to facilitate the conversion to float
character_mapping = {
    "%": "e-2",
    "K": "e3",
    "M": "e6",
    "B": "e9",
    "£": "",
    "\$": "",
    "¥": "",
    "€": "",
}

df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')

print(df)
#        rawNo        ConvNo     Converted
# 0     -4.35% -4.350000e-02 -4.350000e-02
# 1     246.6K  2.466000e+05  2.466000e+05
# 2        NaN           NaN           NaN
# 3    $12.76B  1.276000e+10  1.276000e+10
# 4      4.68%  4.680000e-02  4.680000e-02
# 5   ¥-459.5B -4.595000e+11 -4.595000e+11
# 6     €-6.8B -6.800000e+09 -6.800000e+09
# 7  £-15.623B -1.562300e+10 -1.562300e+10

Note that dollar sign had to be escaped because it's a special regex character. Also note that .astype has kwarg errors="ignore", this will retain the original values if conversion to float failed (but will allow the pipeline to run without raising errors).

edited Jun 20 '23 at 03:28

answered Jun 19 '23 at 04:06

SultanOrazbayev

14,900
3
16
46

I think this should work. I am getting an error looks like about space (ValueError: could not convert string to float: '' ). I did a strip() on the column still issue persist I also checked and removed any " ' " that may exists now I am at a loss. Thank you for the prompt response – – Leo Torres Jun 19 '23 at 15:23
Looks like you have other non-numeric characters. If it's safe for you to do so, you can include them as extra keys in the `character_mapping`. – SultanOrazbayev Jun 19 '23 at 16:19
Yes, SultanOrazbayev. I did add a bunch of new characters to the mapping and they all worked except for the last error I posted here in the comments. See in the next comment the revised version I did to address other issues. This seems to work but I am lost here not knowing how to resolve this character I thought it was empty space – Leo Torres Jun 19 '23 at 18:15
character_mapping = { "%": "e-2", "K": "e3", "M": "e6", "B": "e9", "T" : "e12", "£": "", "\$": "", "¥": "", "€": "", ",": "", "-" : "", "'" : "" } – Leo Torres Jun 19 '23 at 18:15
OK, in your dictionary you probably do not want to modify the '-' character to retain the sign (if it's relevant). I updated the answer with `errors="ignore"` kwarg, this is not ideal, but will allow you to complete the conversion and you can then manually investigate which rows are causing errors. – SultanOrazbayev Jun 20 '23 at 03:30
I get the same error. ( could not convert string to float: '' ) I have created a link to the data since Corralien requested data. Please see bottom of question. Thank you! – Leo Torres Jun 20 '23 at 06:44
Just one more question, does it possible to use some options from `pd.read_csv`? – Corralien Jun 20 '23 at 07:20

Corralien · Answer 2 · 2023-06-20T01:36:54.283

1

In your output, you lost the currency type. To go further, It would be great to convert the currencies before losing this information. As @SultanOrazbayev, we can use dictionaries to perform the conversion:

factors = {
    '%': '0.01',
    'K': '1e3',
    'M': '1e6',
    'B': '1e9',
    'T': '1e12',
    np.nan: '1'
}

# to US Dollars
currencies = {
    '£': '1.28',
    '\$': '1',
    '¥': '0.007',
    '€': '1.09',
    np.nan: '1'
}

values = {
    ",": "",
    "'" : ""
}

# Split prefix (currency), value and suffix (factor)
dmap = {'currency': currencies, 'value': values, 'factor': factors}
pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'

# Replace and compute the total
df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']

Output:

>>> datasample

       rawNo        ConvNo
0     -4.35% -4.350000e-02  # val * 0.01
1     246.6K  2.466000e+05  # val * 1000
2        NaN           NaN
3    $12.76B  1.276000e+10  # 1 * val * 1000000000
4      4.68%  4.680000e-02  # val * 0.01
5   ¥-459.5B -3.216500e+09  # 0.07 * val * 1000000000
6     €-6.8B -7.412000e+09  # 1.09 * val * 1000000000
7  £-15.623B -1.999744e+10  # 1.28 * val * 1000000000
8   $-1,400B -1.400000e+12  # 1 * val * 1000000000

>>> df1
   currency     value        factor
0     1.000    -4.350  1.000000e-02
1     1.000   246.600  1.000000e+03
2     1.000       NaN  1.000000e+00
3     1.000    12.760  1.000000e+09
4     1.000     4.680  1.000000e-02
5     0.007  -459.500  1.000000e+09
6     1.090    -6.800  1.000000e+09
7     1.280   -15.623  1.000000e+09
8     1.000 -1400.000  1.000000e+09

edited Jun 20 '23 at 01:36

answered Jun 19 '23 at 10:44

Corralien

109,409
8
28
52

Thanks, @Corralien! In line for `pattern = ...`, `M` is missing (`^%KMB`). – SultanOrazbayev Jun 19 '23 at 12:43
I did try this solution but got an error (could not convert string to float: '3.128T'). I did modify the factors by adding 'T': '1e12' to the list yet it still fails. There are some numbers in the trillions not in my example but I added it and still did not work. Thank you for your help. – Leo Torres Jun 19 '23 at 18:12
1

You have to modify the pattern too to exclude `'T'`: `(?P[^%KMBT]+)` However you can replace this by `(?P[-.\d]+)` – Corralien Jun 19 '23 at 18:58
ah ok Corralien. I added the 'T' and some more with your suggestion. This is what it is now '(?P[^\d\-\.]+)?(?P[^%KMBT,-¥€$]+)(?P.+)?'. This completed and gave me a column with all nan. I had to include the different currency symbols else I would get this error. (could not convert string to float: '¥'). I would get it for all currencies. Did I do something wrong? Thank you for your help. – Leo Torres Jun 19 '23 at 19:32
The full pattern should be `r'(?P[^\d\-\.]+)?(?P[^%KMBT]+)(?P.+)?'` or `r'(?P[^\d\-\.]+)?(?P[-.\d]+)(?P.+)?'`. You don't need to add currency in `value` group. – Corralien Jun 19 '23 at 20:06
I changed it to [^%KMBT,-] it was the only way to get it to work but this would cause an issue with my negative values. Yes, I know i should not have currency signs as part of exclusion but if I dont enter them I get an error because it cant convert to float (could not convert string to float: '¥') – Leo Torres Jun 20 '23 at 00:05
Can you provide a more relevant sample of your data, please? I can't reproduce your error. I updated my answer to remove special characters. Can you check it please? – Corralien Jun 20 '23 at 01:39
Yes I have added a link to the data back in the question. – Leo Torres Jun 20 '23 at 06:45

score 1 · Accepted Answer · answered Jun 20 '23 at 07:31

This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).

A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan). After incorporating these into the character_mapping the conversion works without raising any errors:

from numpy import nan
from pandas import read_csv

df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")

character_mapping = {
    "%": "e-2",
    "K": "e3",
    "M": "e6",
    "B": "e9",
    "T": "e12",
    # note that the currency symbols are mapped to blank string
    # to facilitate the conversion to float
    "£": "",
    "\$": "",
    "¥": "",
    "€": "",
    # if an entry consists of a dash, replace with nan
    "^-$": nan,
    # some numbers have a comma as a thousands separator
    ",": "",
}

df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)

mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
print(df.loc[mask, ["consensus", "Converted"]])

Thank you! Yes it runs with our error. But I am seeing a huge change in the number row 18 converted 13K -> 15000, row 36 -4.5% -> -0.016 , row 54 46.075K -> 55000. There is quite a few there some that look like rounding those seem OK but these are huge differences. — Leo Torres, Jun 20 '23 at 10:42
The conversion is accurate, the 13K -> 15000 example refers to different columns ("actual" rather than "consensus"). — SultanOrazbayev, Jun 20 '23 at 11:51

Converting column with various types of numerical units

3 Answers3