2

In my dataframe:.

df = pd.DataFrame(zip(datetimes, from_, message), columns=['timestamp', 'sender', 'message'])
df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')

There are some problematic values, defined by a clear pattern:

    timestamp                              sender                               message
    113381 2020-06-04 11:59:24              Jose                               bom te ver feliz\r\n
    113382 2020-06-04 11:59:29              Jose                                              ❤\r\n
    113383 2020-06-04 11:59:40              Maria                Estar bem com você me faz feliz\r\n
    113384 2020-06-04 12:00:57              Maria   Estava falando com uma amiga de infância aque...
    113385 2020-06-04 12:01:14              Maria           Ela teve uma briga feia com o marido\r\n
    113386 2020-06-04 12:01:24   Maria: ‎<attached        00113509-PHOTO-2020-06-04-12-01-25.jpg>\r\n
    113387 2020-06-04 12:02:54              Maria                       e assim leva-se a vida, um\n
    113388 2020-06-04 12:03:21              Maria                  Pelo menos ela riu isso ajuda\r\n
    113389 2020-06-04 13:06:39    Jose: ‎<attached        00113512-PHOTO-2020-06-04-13-06-40.jpg>\r\n

Names will always vary, and could well be:

John
John: <attached
Mary
Mary: <attached

But : <attached will always be there.


How do I perform string replacement which corrects that, independently of the string, ending up with:

timestamp                              sender                               message
113381 2020-06-04 11:59:24              Jose                               bom te ver feliz\r\n
113382 2020-06-04 11:59:29              Jose                                              ❤\r\n
113383 2020-06-04 11:59:40              Maria                Estar bem com você me faz feliz\r\n
113384 2020-06-04 12:00:57              Maria   Estava falando com uma amiga de infância aque...
113385 2020-06-04 12:01:14              Maria           Ela teve uma briga feia com o marido\r\n
113386 2020-06-04 12:01:24              Maria        00113509-PHOTO-2020-06-04-12-01-25.jpg>\r\n
113387 2020-06-04 12:02:54              Maria                       e assim leva-se a vida, um\n
113388 2020-06-04 12:03:21              Maria                  Pelo menos ela riu isso ajuda\r\n
113389 2020-06-04 13:06:39              Jose       00113512-PHOTO-2020-06-04-13-06-40.jpg>\r\n
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198

3 Answers3

2

This should work;

df['sender'] = df['sender'].str.replace(u': \u200e<attached', '')
Sy Ker
  • 2,047
  • 1
  • 4
  • 20
  • Weird, I just tested it on my side and it checked out. Is the data type of the column string ? – Sy Ker Jun 06 '20 at 01:31
2

data

df = pd.DataFrame({'sender': ['Jose','Jose','Maria','Maria','Maria','Maria: <attached','Maria','Maria','Jose: <attached']})

Solution

df.sender = df.sender.str.split(': <attached').str[0]

   sender
0   Jose
1   Jose
2   Maria
3   Maria
4   Maria
5   Maria
6   Maria
7   Maria
8   Jose
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • edited, on what version are you on? Worked on three different pcs for me – wwnde Jun 06 '20 at 01:29
  • Included my data sample. Maybe you look into it and tell me what the difference is from your sample. Because couldn't reproduce your sample any other way and the code works for me. – wwnde Jun 06 '20 at 01:32
  • 1
    @wwnde I suspect that the source of the issue is that 8-Bit Borges' dataframe would look something like this if he sent to dict: `df = pd.DataFrame({'sender': ['Jose','Jose','Maria','Maria','Maria','Maria: \u200e – David Erickson Jun 06 '20 at 01:54
  • The \u200e is the source of your problems. – Sy Ker Jun 06 '20 at 02:41
2

8-Bit Borges, you may have a \u200e character in your data. I have run into similar issues with split doing nothing, because of strange characters like this. This is my solution:

a = df['sender'].to_dict()

Then, I saw what the actual value is when you send it to a dict. The value was : \u200e<attached. Then, I simply did:

df['sender'] = df['sender'].str.split(': \u200e<attached').str[0]

More information about \u200e here: decoding \u200e to string

David Erickson
  • 16,433
  • 2
  • 19
  • 35