Using spacy to redact names from a column in a data frame

Question

I have a data frame named "df1". This data frame has 12 columns. The last column in this data frame is called notes. I need to replace common names like "john, sally and richard" from this column and replace the values with xxxx or something similar. I have a working script that is creating this data frame from MS SQL. I have spent several hours and used various resources to try and get some code that works to do this but I have not been successful. I do not have to use Spacy, but I was told this is a good package to work with. Any help would be appreciated.

Wiktor Stribiżew · Accepted Answer · 2022-02-11T10:24:24.827

1

You need to use a solution like

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'notes':["Peter A. Smith came to see Bart in Washington on Tuesday."]})

print(df['notes'])
# => 0    Peter A. Smith came to see   Bart     in   Washington on  Tuesday.
##        <<PERSON>>     came to see <<PERSON>> in     <<GPE>>  on <<DATE>>.

nlp = spacy.load('en_core_web_trf')

def redact_with_spacy(text: str) -> str:
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only redact PERSON entities
            start = e.start_char
            end = start + len(e.text)
            newString = newString[:start] + "xxxx" + newString[end:]
    return newString

df['notes'] = df['notes'].apply(redact_with_spacy)
print(df['notes'])

Output:

0    xxxx came to see xxxx in Washington on Tuesday.

Note you may adjust the "xxxx" in the redact_with_spacy function. E.g., you may replace the found entity with the same amount of xs if you use newString = newString[:start] + ("x" * len(e.text)) + newString[end:]. Or, to keep spaces, newString = newString[:start] + "".join(["x" if not x.isspace() else " " for x in e.text]) + newString[end:].

edited Feb 11 '22 at 10:24

answered Feb 11 '22 at 10:15

Wiktor Stribiżew

607,720
39
448
563

Hi Wiktor, thank you for the help. I tried this code and it somewhat worked with instances like "lisa denied help" or 'susan did not like the medical recommendations". But it would not redact names like michael or jerome, or kent in my testings.. When used with real PHI type scenarios it appears to miss 98% of the names. Especially when there is a first name and last name. Its almost if it ignored them. Any thoughts? – Amicheals Feb 14 '22 at 18:42
@Amicheals If the model did not "see" these names, it won't identify them as entities. Use a regex to mask these names since it seems you have a got a list. I can watch my YT video on how to build efficient regex from a word list [here](https://www.youtube.com/watch?v=YlN-ZBa5DvU). [Here](https://stackoverflow.com/questions/68274867/optimize-long-lists-of-fixed-string-alternatives-in-regex/68275882#68275882), you can generate a quick regex trie. – Wiktor Stribiżew Feb 14 '22 at 22:12
Wiktor, Thank you for that, and I will watch that video. So maybe this is me not understanding how this model works, but I found a "solution" and this model is now catching about 99% of names. Instead of if e.label_ == "PERSON": . I have switched it to if e.label_ == "PERSON" or "Names": . I do not even think "names" is a built in entity mapping. Using just "Names" alone does not redact anything but both together seem to work. Why would this solve my issue? I start my masters in data science next month so I will learn more then about models. – Amicheals Feb 15 '22 at 01:49
@Amicheals If you use `if e.label_ == "PERSON" or "Names":` it means you do not filter by entity label as `"Names"` is a str and always coerces to *True*. – Wiktor Stribiżew Feb 15 '22 at 08:13

rad15f · Answer 2 · 2022-11-25T14:57:03.180

import pandas as pd
import spacy
nlp = spacy.load("en_core_web_md") #model

Here I'm removing all labels in the data frame column, you can select which labels to remove

nlp.get_pipe('ner').labels

'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART'

def replace_text(mytext):
    labels = list(['PERSON']) #here selecting PERSON labels from the model
    doc = nlp(mytext)
    labels_to_remove = []
    final_string = str(mytext)
    for sent in doc.ents:
        if sent.label_ in labels:
            labels_to_remove.append(str(sent.text))
    for n in range(len(labels_to_remove)):
        final_string = final_string.replace(labels_to_remove[n],'[REDACTED]')
    return final_string

df['Sentences_Redacted_Function'] = df['Sentences'].apply(lambda x:replace_text(x))

I fixed it, sorry for the typos. I'm trying to get better at StackOverflow — rad15f, Nov 25 '22 at 14:55

Using spacy to redact names from a column in a data frame

2 Answers2