2

I have this following string:

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

Now, I want to split it into two sentence.

However, when I do:

string.split('.')

I get:

['This is one sentence  ${w_{1},',
 '',
 ',w_{i}}$',
 ' This is another sentence',
 ' ']

Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?

Also, how would you go about this:

string2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

EDIT 1:

The desired outputs would be:

For string 1:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence']

For string 2:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe !  ']
henry
  • 875
  • 1
  • 18
  • 48

3 Answers3

3

For the more general case, you could use re.split like so:

import re

mystr = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', '']

str2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

re.split("[.!?]\s{1,}", str2)
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']

Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case

Here's a (somewhat hacky) way to get the punctuation back

punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '!  ']

sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence  ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe !  ']
C.Nivs
  • 12,353
  • 2
  • 19
  • 44
  • Thank you very much for your answer and especially for your last hack ! Really cool ! +1 – henry Apr 11 '19 at 20:35
3

You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:

re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)

This returns, for the first string:

['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence']

and for the second string:

['This is one sentence  ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']
blhsing
  • 91,368
  • 6
  • 71
  • 106
0

Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

string.split('. ')

this returns:

['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

  • Good idea ! But what about if the sentences have bad formatting ? Maybe one could just ignore everything that is between `$ $`? – henry Apr 11 '19 at 20:29
  • Im not sure. You could ensure that all sentences have correct formatting so that when one sentence ends (with a .) there is a space afterwards before the next sentence begins. This way there will never be bad formatting. Is that possible for your code? –  Apr 11 '19 at 20:32