Extract sentence from corpus using Python

Question

I am trying to extract sentences from a text, by using Python. Every word in the text are written in a line with additional info related to that word:

Mary Noun Name
loves Verb No-Name
John Noun Name
. Punct No-Name

The sentence boundaries are marked with an empty line. I want to extract the whole sentence that contains words with some particular feature (e.g. sentences with names).

Util now, I have only managed to extract the word of interest, and not the whole sentence. I use .readlines() to read the text line by line. I then loop through the lines and use re and .split('\t') to split the lines, so that every line is represented by a list of 3 elements. I then match the element in the list with the desired value, and can extract the related word, but I can not figure out how I can extract the whole sentence..

Anyone have some advice?

NLTK may be useful if you want to do more sophisticated processing. — nneonneo, Oct 15 '12 at 16:51

score 1 · Answer 1 · answered Oct 15 '12 at 16:40

You could break up by blank lines, split up the types into a set, then use that - an untested example...

text="""Mary Noun Name
loves Verb No-Name
John Noun Name
. Punct No-Name

John Noun Name
loves Verb No-Name
Mary Noun Name
. Punct No-Name"""

from itertools import takewhile

sentences = []
split = iter(text.splitlines())
while True:
    sentence = list(takewhile(bool, split))
    if not sentence:
        break
    types = set(el.split()[1] for el in sentence)
    words = [el.split(' ', 1)[0] for el in sentence]
    sentences.append(
        {
        'sentence': sentence,
        'types': types,
        'words': words
        }
    )


print sum(1 for el in sentences if 'Noun' in el['types']), 'sentences contain Noun'
print sentences[0]['words']

Haven't mentioned this approach though - if you're dealing with a "standard" file format - then they may well be a corpus reader than NLTK or similar package can already manage for you... (which you may be using anyway when dealing with corpus...) — Jon Clements, Oct 15 '12 at 16:50

Blender · Answer 2 · 2012-10-15T16:21:04.003

I'd parse the individual rows into dictionaries, which you can group into lists separated by punctuation (or periods).

sentences = []
columns = ('word', 'pos', 'type')

with open('file.txt', 'r') as handle:
    sentence = []

    for row in handle:
        chunks = row.split('\t')
        structure = dict(zip(columns, chunks))

        sentence.append(structure)

        if structure['pos'] == 'Punct':
            sentences.append(sentence)
            sentence = []

Now, sentences contains lists that contains all of the parts of your sentences (if this code works).

I'll leave it to you to figure out how to do the rest. Finding your target sentence should be easy with a couple of for loops.

To print out a sentence given its list, something like this should get you started:

print ' '.join((chunk['word'] for chunk in sentence))

score 0 · Answer 3 · edited May 23 '17 at 10:24

The existing answers assume the corpus is small enough to read into memory at one go, and build a data structure of sentences which you then filter. If that isn't the case (and even if it is now, it may not be in the future), you'll need to do some sort of generator solution. I'd take a look at the similar question: Python: How to loop through blocks of lines and see if you can make that work for you.

Personally, I think people make more work for themselves by forcing use of a single tool. This particular problem is ready-made for a simple awk filter:

awk -v RS='\n\n' -v FS='\n' -v ORS='\n\n' -v OFS='\n' '/ Name/'

Of course, if you're going to do further processing in python, neither point is valid.

score 0 · Answer 4 · answered Mar 21 '13 at 11:09

You might want to combine Blender or Jon Clements solution with storing a pickled result of your 'parsed' sentences, so the next time around you can load that information and start searching more quickly.

If your list of sentences does not fit in memory then store the individual sentence information a pickle sequentially in a file if you use the binary pickle, store a length indicator before each pickled sentence.

This extra effort is only worth it if you have to search often and parsing takes substantial time (with huge texts).

Extract sentence from corpus using Python

4 Answers4