4

I have a large dataset that I had to clean. Now, simplifying, I have this:

   A  B  C  D
1  1  5  2  2
4  2  5  3  1
5  3  3  2  1
8  4  1  4  4

So, the values for each column goes from 1 to 5. Now I want to transform this 4 columns in 5 dummy columns and count at the same time the amount of "values" for each row of each value, in order to have that:

   S_1 S_2 S_3 S_4 S_5
1   1   2   0   0   1
4   1   1   1   0   1
5   1   1   2   0   0
8   1   0   0   3   0

So "S_1" represents the amount of "1" for each row, "S_2" the amount of "2" of each row, and so on.

I guess this is possible with a pivot table, but I can't do it. Can anybody help me, please?

bonaqua
  • 101
  • 7

2 Answers2

2

One approach is to use collections.Counter:

import pandas as pd
from collections import Counter

data = [[1, 5, 2, 2],
        [2, 5, 3, 1],
        [3, 3, 2, 1],
        [4, 1, 4, 4]]

df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'], index=[1, 4, 5, 8])

total = {k: 0 for k in range(1, 6)}

result = pd.DataFrame([{**total, **Counter(row)} for row in df.values], index=df.index)

result = result.rename(columns={k: f'S_{k}' for k in total}).fillna(0)

print(result)

Output

   S_1  S_2  S_3  S_4  S_5
1    1    2    0    0    1
4    1    1    1    0    1
5    1    1    2    0    0
8    1    0    0    3    0

Use Counter to count the occurrences, the expression:

{**total, **Counter(row)}

creates a dictionary with 0 count for the missing values.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Yes, it worked! However, '0' are 'Nan'. But there is no problem because I can replace the Nans by 0. However, the first column is '0' and then it starts with "S_1", "S_2", and so on...do you know why the first column is named as "0"? – bonaqua Mar 31 '20 at 18:35
  • @bonaqua I updated the answer with a full example, filling the possible nan with 0 and the right index. The first column is the index – Dani Mesejo Mar 31 '20 at 18:41
  • In my dataset (a little bit more complex), it still appears the first column as '0', and only the second column is "S_1", can't understand why. It is like this. array([0, 'S_1', 'S_2', 'S_3', 'S_4', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_10', 'S_11', 'S_12', 'S_13', 'S_14', 'S_15', 'S_16', 'S_17', 'S_18', 'S_19', 'S_20', 'S_21'], dtype=object) – bonaqua Mar 31 '20 at 18:53
  • How are you defining total, notice range starts from 1, both times. I updated the answer to remove the second range iterable – Dani Mesejo Mar 31 '20 at 18:55
  • I updated the range (1, 22), however, it stills appears the first columns as '0'. Using your code, when I do df.columns.values, it gives me array([0, 'S_1', 'S_2', 'S_3', 'S_4', 'S_5', 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21], dtype=object) – bonaqua Mar 31 '20 at 19:13
  • It is going to be hard to answer, without actually looking at your example, did you updated twice? Or did you apply the last change? It seems you only updated once, notice it is only updating those from 1 to 6 – Dani Mesejo Mar 31 '20 at 19:14
  • I understand that it is hard. I restarted the kernel and runned everything again and still gives me that. If I change to "for k in range(1, 22)", then array([0, 'S_1', 'S_2', 'S_3', 'S_4', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_10', 'S_11', 'S_12', 'S_13', 'S_14', 'S_15', 'S_16', 'S_17', 'S_18', 'S_19', 'S_20', 'S_21']. The only thing I cannot understand is that '0' column – bonaqua Mar 31 '20 at 19:46
  • Do you have 0 as a value in the data? – Dani Mesejo Mar 31 '20 at 19:47
  • Oh yees! That was it! I have already made the changes! for k in range(0, 22). Thank you so much! Let me ask you just one more thing, please. What is exacly this "**" in "{**total, **Counter(row)}"? What does it do? – bonaqua Mar 31 '20 at 19:53
  • See this: https://stackoverflow.com/questions/21809112/what-does-tuple-and-dict-means-in-python, basically it finds the union of two dictionaries – Dani Mesejo Mar 31 '20 at 19:56
  • Alright! And why here "result = result.rename(columns={k: f'S_{k}' for k in total}).fillna(0) " you put that "f" before 'S_{k}'? – bonaqua Mar 31 '20 at 19:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210689/discussion-between-dani-mesejo-and-bonaqua). – Dani Mesejo Mar 31 '20 at 20:04
  • Hey Dani Mesejo! I had to apply again that code to another part, that is basically the same, but where I have 29 (from 0 to 28) different codes instead of 22. And I change the code for range(0,29). – bonaqua Apr 26 '20 at 00:36
  • However, .columns.values gives me this "'S_1', 'S_2', 'S_3', 'S_4', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_10', 'S_11', 'S_12', 'S_13', 'S_14', 'S_15', 'S_16', 'S_17', 'S_18', 'S_19', 'S_20', 'S_21', 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27', 'S_28', '12', '22', '23', '25', '20', '21', '1', '8', '26', '3', '2', '28', '4', '7', '27', '9', '24', '13', '10', '18', '5', '17', '6', '0', '19', '16', '11', '15', '14'" I really don't know where the "12, 22, 23, ..." comes from. I know it's difficult to answer without seeing it, but do you have any idea why this is happening? – bonaqua Apr 26 '20 at 00:37
  • 1
    Check the type of the values, they could be string, instead of int. – Dani Mesejo Apr 26 '20 at 09:06
  • can you make this function more generic, to handle occurence for any dummies data – Golden Lion Aug 06 '22 at 14:40
0

You can try with this, hope this helps

import pandas as pd
from collections import defaultdict # Initialize a dictionary with a default value

df = pd.DataFrame(
    [[1,5,2,2],
     [2,5,3,1],
     [3,3,2,1],
     [4,1,4,4]]
    , columns = ['A','B','C','D'])


categories = [1,2,3,4,5]

# Count per row
rows_counts = []
for idx in df.index:
    dict_counts = defaultdict(int)

    # Count for each category
    for category in categories:
        # Get row as list to count()
        row = df.loc[idx,:].tolist()

        # Count
        dict_counts[category] = row.count(category)

    # Append results
    rows_counts.append(dict_counts)

# Get desired output    
new_df = pd.DataFrame(rows_counts)
new_df.columns = ['S_'+ str(cat) for cat in new_df.columns]

Jorge Avila
  • 151
  • 4