2

I want to edit my text like this:

arr = [] 
# arr is full of tokenized words from my text

For example:

"Abraham Lincoln Hotel is very beautiful place and i want to go there with
 Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

Edit: Basically I want to detect Proper Names and group them by using istitle() and isAlpha() in for statement like:

for i in arr:
    if arr[i].istitle() and arr[i].isAlpha

In the example arr appened until the next word hasn't his first letter upper case.

arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel

This is what i want with my new arr:

['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].

"Also" is not problem for me it will be usefull when i try to match with my dataset.

Arda Nalbant
  • 479
  • 2
  • 7
  • 16
  • Possible duplicate of [Finding Proper Nouns using NLTK WordNet](http://stackoverflow.com/questions/17669952/finding-proper-nouns-using-nltk-wordnet) – Selcuk Apr 18 '16 at 07:52
  • I want a basic python code and this always returns proper names without grouping them but thanks anyway. – Arda Nalbant Apr 18 '16 at 08:02
  • You cant do a *basic python code* to return proper names. It's not that easy and you need to use `NTLK` in order to archieve it. – Avión Apr 18 '16 at 08:12
  • 2
    The problem with using `istitle()` is that will also take the `Also` world as it's capitalized after the dot. – Avión Apr 18 '16 at 08:21
  • 1
    i will check the the Proper names with my datasets. Thats why i dont want to use nltk. I create my own language process program. – Arda Nalbant Apr 18 '16 at 08:24

2 Answers2

1

You could do something like this:

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
    if(word.istitle() and word.isalpha()):
        if(last_word_index == idx-1):
            proper_nouns[-1] = proper_nouns[-1] + " " + word
        else:
            proper_nouns.append(word)
        last_word_index = idx
print(proper_nouns)

This code will:

  • Split all the words into a list
  • Iterate over all of the words and
    • If the last capitalized word was the previous word, it will append it to the last entry in the list
    • else it will store the word as a new entry in the list
    • Record the last index that a capitalized word was found
arbylee
  • 1,959
  • 2
  • 12
  • 6
  • This outputs `['Abraham Lincoln Hotel', 'Barbara', 'Also']`, not `['Abraham', 'Lincoln', 'Hotel', 'Barbara', 'Palvin.', 'Adidas', 'Nike', 'Reebok.']` – Avión Apr 18 '16 at 09:29
  • Words like "Also" or "Because" wont be problem for me because they wont match with my datasets which are full of organizaton , location and person names later. So any solution like ; ['Abraham Lincoln Hotel'] , ['Barbara Palvin'] ,['Adidas'], ['Nike'], ['Reebok'] will be usefull. Because later i will send them grouped words to my functions as inputs. – Arda Nalbant Apr 18 '16 at 10:53
  • The code you wrote did what i want but only for first letter. Output is:['Abraham Lincoln Hotel', 'Barbara', 'Also'] – Arda Nalbant Apr 18 '16 at 10:58
0

Is this what you are asking?

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

chars = ".!?,"                                   # Characters you want to remove from the words in the array

table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table)             # Replace characters with spaces

arr = sentence.split()                           # Split the string into an array whereever a space occurs

print(arr)

The output is:

['Abraham',
 'Lincoln',
 'Hotel',
 'is',
 'very',
 'beautiful',
 'place',
 'and',
 'i',
 'want',
 'to',
 'go',
 'there',
 'with',
 'Barbara',
 'Palvin',
 'Also',
 'there',
 'are',
 'stores',
 'like',
 'Adidas',
 'Nike',
 'Reebok']

Note about this code: any character that is in the chars variable will be removed from the strings in the array. Explenation is in the code.

To remove the non-names just do this:

import string
new_arr = []

for i in arr:
    if i[0] in string.ascii_uppercase:
        new_arr.append(i)

This code will include ALL words that start with a capital letter.

To fix that you will need to change chars to:

chars = ","

And change the above code to:

import string
new_arr = []
end = ".!?"    

b = 1
for i in arr:
    if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
        new_arr.append(i)
    b += 1

And that will output:

['Abraham', 
'Lincoln', 
'Hotel', 
'Barbara', 
'Palvin.', 
'Adidas', 
'Nike',
'Reebok.']
Ciprum
  • 734
  • 1
  • 11
  • 18
  • This is not the correct approach. I mean, it's imposible for the OP to make a list of all the words that *are not proper names*. – Avión Apr 18 '16 at 08:18
  • Edited. @ArdaNalbant You should find more criteria that fits or does not fit the names you need to identify so the program is more precise. – Ciprum Apr 18 '16 at 08:27
  • The output is what i needed let me try. Good job here – Arda Nalbant Apr 18 '16 at 08:28
  • i declare utf-8, will it effect the output ? edit : nope it didn't – Arda Nalbant Apr 18 '16 at 08:33
  • No. It will not. (at least not on Python 3) – Ciprum Apr 18 '16 at 08:37
  • Is there a way to grouping them ? The output is okey but i want my current arr is changed like this: arr[0]="[Abraham Lincoln Hotel]" arr[12]="[Barbara Palvin]".Thank you. – Arda Nalbant Apr 18 '16 at 08:46
  • You could group all words that start with a capital in a row, however then names like "Republic of the..." that contain words that start with lowercase letters will be seperated. – Ciprum Apr 18 '16 at 08:52
  • the code you writed wont append nike and reebok. How can i manage the punctuation before the specific word ? – Arda Nalbant Apr 18 '16 at 09:10