3

Given a koalas Dataframe:

df = ks.DataFrame({"high_risk": [0, 1, 0, 1, 1], 
                   "medium_risk": [1, 0, 0, 0, 0]
                   })

Running a lambda function to get a new column based on the existing column values:

df = df.assign(risk=lambda x: "High" if x.high_risk else ("Medium" if x.medium_risk else "Low"))
df
Out[72]: 
   high_risk  medium_risk  risk
0          0            1  High
4          1            0  High
1          1            0  High
2          0            0  High
3          1            0  High

Expected return:

       high_risk  medium_risk  risk
    0          0            1  Medium
    4          1            0  High
    1          1            0  High
    2          0            0  Low
    3          1            0  High

Why does this assign "High" to each of the values. The intent is to operations on each row, is it looking at the whole column in the comparison?

Ben.T
  • 29,160
  • 6
  • 32
  • 54
ratchet
  • 195
  • 4
  • 15
  • is it mandatory to use `assign` as it seems complicated to use it for now the way you want? I think about a work around but not sure about the computational cost – Ben.T Oct 11 '19 at 14:34
  • Not mandatory, however, my understanding is that koalas does not support: df["risk"] = df[] for column assignment. – ratchet Oct 11 '19 at 15:53

2 Answers2

1

Using assign on a koalas df seems not easy to me, but for your case, I would mul the column 'high_risk' by 2 then add the column 'medium_risk' and finally map the result to replace the 2 by 'high' (because you multiply the column by 2 before) 1 by 'medium' and 0 by 'low' such as:

df = df.assign(risk= df.high_risk.mul(2).add(df.medium_risk)
                       .map({0:'low', 1:'medium', 2:'high'}))
df
   high_risk  medium_risk    risk
0          0            1  medium
1          1            0    high
2          0            0     low
3          1            0    high
4          1            0    high

Note : this would fail if you have 1 in both high and medium risks column.

Ben.T
  • 29,160
  • 6
  • 32
  • 54
0
def function1(ss:ks.Series):
    if ss.high_risk==1:
        return "High"
    elif ss.medium_risk==1:
        return "Medium"
    else:
        return "Low"

col1=df.apply(function1,axis=1)
df.join(col1.rename("risk"))

out:

       high_risk  medium_risk  risk
    0          0            1  Medium
    4          1            0  High
    1          1            0  High
    2          0            0  Low
    3          1            0  High
G.G
  • 639
  • 1
  • 5