Split a string into its sentences using python

Question

I have this following string:

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

Now, I want to split it into two sentence.

However, when I do:

string.split('.')

I get:

['This is one sentence  ${w_{1},',
 '',
 ',w_{i}}$',
 ' This is another sentence',
 ' ']

Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?

Also, how would you go about this:

string2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

EDIT 1:

The desired outputs would be:

For string 1:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence']

For string 2:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe !  ']

One thing you should consider is that in LaTeX, the proper ellipsis is `\ldots`, not `...`. — Zev Chonoles, Apr 11 '19 at 20:25
You're setting `.` as your delimiter, which is why it will split up your string at every `.` it finds, regardless of its context within the string. — Anshuman Dikhit, Apr 11 '19 at 20:25
@ZevChonoles Yes, definitely. I will change that. However, the question still remains, since I have other cases, where I cannot simply replace it. — henry, Apr 11 '19 at 20:28

score 3 · Accepted Answer · answered Apr 11 '19 at 20:31

For the more general case, you could use re.split like so:

import re

mystr = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', '']

str2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

re.split("[.!?]\s{1,}", str2)
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']

Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case

Here's a (somewhat hacky) way to get the punctuation back

punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '!  ']

sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence  ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe !  ']

Thank you very much for your answer and especially for your last hack ! Really cool ! +1 — henry, Apr 11 '19 at 20:35

blhsing · Answer 2 · 2019-04-11T20:55:37.420

3

You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:

re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)

This returns, for the first string:

['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence']

and for the second string:

['This is one sentence  ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']

edited Apr 11 '19 at 20:55

answered Apr 11 '19 at 20:36

blhsing

91,368
6
71
106

1

Very nice ! Thanks a lot for your answer ! – henry Apr 11 '19 at 21:00

score 0 · Answer 3 · answered Apr 11 '19 at 20:28

0

Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

string.split('. ')

this returns:

['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

answered Apr 11 '19 at 20:28

Good idea ! But what about if the sentences have bad formatting ? Maybe one could just ignore everything that is between `$ $`? – henry Apr 11 '19 at 20:29
Im not sure. You could ensure that all sentences have correct formatting so that when one sentence ends (with a .) there is a space afterwards before the next sentence begins. This way there will never be bad formatting. Is that possible for your code? – Apr 11 '19 at 20:32

Split a string into its sentences using python

3 Answers3