0

I have a very long string and want to split that string at the last dot before 100 characters.

For example I have a string (200 characters) like:

string = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor. invidunt ut labore et dolore mgna aliquyam erat. sed diam voluptua. At vero eos et accusam et justo duo dolores."

At the end I want to have a list with two or more full sentences with max 100 characters.

fteinz
  • 1,085
  • 5
  • 15
  • 30

3 Answers3

1
str = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor. invidunt ut labore et dolore mgna aliquyam erat. sed diam voluptua. At vero eos et accusam et justo duo dolores."

n = 100
str = str.rstrip(".")
chunks = [str[i:i+n] + "." for i in range(0, len(str), n)]
print(chunks)

output:

['Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor. invidunt ut .', 'labore et dolore mgna aliquyam erat. sed diam voluptua. At vero eos et accusam et justo duo dolores.']
  • Do you really want to break in the middle of a sentence (or word) and add a period that wasn't there? e.g. ` invidunt ut .` – Alain T. Mar 10 '21 at 13:16
0

want to split that string at the last dot before 100 character

see below

string = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor. invidunt ut labore et dolore mgna aliquyam erat. sed diam voluptua. At vero eos et accusam et justo duo dolores."
split_idx = string[:100].rfind('.')
final_string = string[:split_idx]
print(final_string)

output

Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor
balderman
  • 22,927
  • 7
  • 34
  • 52
0

You could use a regular expression with finditer():

import re
s = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor. invidunt ut labore et dolore mgna aliquyam erat. sed diam voluptua. At vero eos et accusam et justo duo dolores."

r = [m.group().strip() for m in re.finditer(r' *((.{0,99})(\.|.$))',s)]

print(r)

['Lorem ipsum dolor sit amet, consetetur sadipscing elitr. sed diam nonumy eirmod tempor.',
 'invidunt ut labore et dolore mgna aliquyam erat. sed diam voluptua.',
 'At vero eos et accusam et justo duo dolores.']

print(*map(len,r)) # 87 67 44

The r' *((.{0,99})(\.|.$))' expression finds up to 99 characters (.{0,99}) followed by a period or the last character of the string (\.|.$). It also includes (but does not count) leading spaces * so that the space following a line breaking period can be stripped from the subsequent line without hampering its length limit.

Note that this assumes that no individual sentence has more than 100 characters. Depending on what you want to do when this occurs, you can adjust the regular expression accordingly. For example, if you want to arbitrarily split after 100 character: *((.{0,99})(\.|.$)|(.{100})) or try to break between words: *((.{0,99})(\.|.$)|(.{100})\b).

Alain T.
  • 40,517
  • 4
  • 31
  • 51