10

Is there any package which i can use to to remove proper nouns from a sentence using Python?

I know of a few packages like NLTK, Stanford and Text Blob which does the job(removes names) but they also remove a lot of words which start with a capital letter but are not proper nouns.

Also, i cannot have a dictionary of names because it'll be huge and will keep extending as the data keeps populating in the DB.

Pri
  • 123
  • 1
  • 1
  • 5
  • this might help: http://stackoverflow.com/questions/17669952/finding-proper-nouns-using-nltk-wordnet – Neeraj Kumar Sep 22 '16 at 08:48
  • Marking this as duplicate (you asked same question yesterday): http://stackoverflow.com/q/39610137/6313992 – Tomasz Plaskota Sep 22 '16 at 08:49
  • Hi Neeraj, this does what i explained. It considers even the words starting with a capital letter as a proper nouns, words which are not even proper nouns – Pri Sep 22 '16 at 09:12
  • Do you just want to remove single words? or what about named entities? – Nathan McCoy Sep 22 '16 at 09:19
  • I'd be tempted to use a dictionary web service to lookup the words and if they don't fall into noun, verb, adjective, etc, they are nouns... Not sure how I'd go about implementing it though. As you said, the dictionary would be huge but it does exist already to an extent. – XtrmJosh Sep 22 '16 at 09:32
  • added answer, should work for single words, but [NER](https://en.wikipedia.org/wiki/Named-entity_recognition) is a research topic in itself – Nathan McCoy Sep 22 '16 at 09:40

1 Answers1

13

If you want to just remove single words that are proper nouns, you can use nltk and tag your sentence in question, then remove all words with the tags that are proper nouns.

>>> import nltk
>>> nltk.tag.pos_tag("I am named John Doe".split())
[('I', 'PRP'), ('am', 'VBP'), ('named', 'VBN'), ('John', 'NNP'), ('Doe', 'NNP')]

The default tagger uses the Penn Treebank POS tagset which has only two proper noun tags: NNP and NNPS

So you can just do the following:

>>> sentence = "I am named John Doe"
>>> tagged_sentence = nltk.tag.pos_tag(sentence.split())
>>> edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS']
>>> print(' '.join(edited_sentence))
I am named

Now, just as a warning, POS tagging is not 100% accurate and may mistag some ambiguous words. Also, you will not capture Named Entities in this way as they are multiword in nature.

Nathan McCoy
  • 3,092
  • 1
  • 24
  • 46
  • This helped to quite some extent but not completely. And also, is there a way to remove email content if any in a text? – Pri Sep 23 '16 at 06:11
  • what do you mean by email content? maybe you can update your question? also, what did it not remove? – Nathan McCoy Sep 23 '16 at 12:22
  • 1
    it removed the names but also removed the words starting with an upper case letter. Probably considered them a proper noun as well. – Pri Sep 26 '16 at 05:13
  • Upper case initial character is a feature used for [NER](https://en.wikipedia.org/wiki/Named-entity_recognition). Sounds like you need to define your use cases and submit another more specific question. – Nathan McCoy Sep 27 '16 at 23:40