I have a data frame named "df1". This data frame has 12 columns. The last column in this data frame is called notes. I need to replace common names like "john, sally and richard" from this column and replace the values with xxxx or something similar. I have a working script that is creating this data frame from MS SQL. I have spent several hours and used various resources to try and get some code that works to do this but I have not been successful. I do not have to use Spacy, but I was told this is a good package to work with. Any help would be appreciated.
Asked
Active
Viewed 642 times
2 Answers
1
You need to use a solution like
import spacy
import pandas as pd
# Test dataframe
df = pd.DataFrame({'notes':["Peter A. Smith came to see Bart in Washington on Tuesday."]})
print(df['notes'])
# => 0 Peter A. Smith came to see Bart in Washington on Tuesday.
## <<PERSON>> came to see <<PERSON>> in <<GPE>> on <<DATE>>.
nlp = spacy.load('en_core_web_trf')
def redact_with_spacy(text: str) -> str:
doc = nlp(text)
newString = text
for e in reversed(doc.ents):
if e.label_ == "PERSON": # Only redact PERSON entities
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + "xxxx" + newString[end:]
return newString
df['notes'] = df['notes'].apply(redact_with_spacy)
print(df['notes'])
Output:
0 xxxx came to see xxxx in Washington on Tuesday.
Note you may adjust the "xxxx"
in the redact_with_spacy
function. E.g., you may replace the found entity with the same amount of x
s if you use newString = newString[:start] + ("x" * len(e.text)) + newString[end:]
. Or, to keep spaces, newString = newString[:start] + "".join(["x" if not x.isspace() else " " for x in e.text]) + newString[end:]
.

Wiktor Stribiżew
- 607,720
- 39
- 448
- 563
-
Hi Wiktor, thank you for the help. I tried this code and it somewhat worked with instances like "lisa denied help" or 'susan did not like the medical recommendations". But it would not redact names like michael or jerome, or kent in my testings.. When used with real PHI type scenarios it appears to miss 98% of the names. Especially when there is a first name and last name. Its almost if it ignored them. Any thoughts? – Amicheals Feb 14 '22 at 18:42
-
@Amicheals If the model did not "see" these names, it won't identify them as entities. Use a regex to mask these names since it seems you have a got a list. I can watch my YT video on how to build efficient regex from a word list [here](https://www.youtube.com/watch?v=YlN-ZBa5DvU). [Here](https://stackoverflow.com/questions/68274867/optimize-long-lists-of-fixed-string-alternatives-in-regex/68275882#68275882), you can generate a quick regex trie. – Wiktor Stribiżew Feb 14 '22 at 22:12
-
Wiktor, Thank you for that, and I will watch that video. So maybe this is me not understanding how this model works, but I found a "solution" and this model is now catching about 99% of names. Instead of if e.label_ == "PERSON": . I have switched it to if e.label_ == "PERSON" or "Names": . I do not even think "names" is a built in entity mapping. Using just "Names" alone does not redact anything but both together seem to work. Why would this solve my issue? I start my masters in data science next month so I will learn more then about models. – Amicheals Feb 15 '22 at 01:49
-
@Amicheals If you use `if e.label_ == "PERSON" or "Names":` it means you do not filter by entity label as `"Names"` is a str and always coerces to *True*. – Wiktor Stribiżew Feb 15 '22 at 08:13
-1
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_md") #model
Here I'm removing all labels in the data frame column, you can select which labels to remove
nlp.get_pipe('ner').labels
'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART'
def replace_text(mytext):
labels = list(['PERSON']) #here selecting PERSON labels from the model
doc = nlp(mytext)
labels_to_remove = []
final_string = str(mytext)
for sent in doc.ents:
if sent.label_ in labels:
labels_to_remove.append(str(sent.text))
for n in range(len(labels_to_remove)):
final_string = final_string.replace(labels_to_remove[n],'[REDACTED]')
return final_string
df['Sentences_Redacted_Function'] = df['Sentences'].apply(lambda x:replace_text(x))

rad15f
- 29
- 3
-
I fixed it, sorry for the typos. I'm trying to get better at StackOverflow – rad15f Nov 25 '22 at 14:55