11

I have quite large amount of text which include control charachters like \n \t and \r. I need to replace them with a simple space--> " ". What is the fastest way to do this? Thanks

Hossein
  • 40,161
  • 57
  • 141
  • 175
  • Obviously, as Zen of Python suggests, there is only way to do that ;-) – gruszczy Feb 10 '11 at 09:51
  • when the string has multiple adjacent such characters e.g.`foo\r\nbar`, do you want to replace `\r\n` by two spaces or only 1? – John Machin Feb 10 '11 at 10:53
  • i want to replace it with only 1 – Hossein Feb 10 '11 at 11:30
  • Consider also stripping leading and trailing whitespace. Then please edit your question so that it specifies exactly what you want. – John Machin Feb 10 '11 at 11:57
  • If you want to strip leading and trailing whitespace as well, have a look at [this answer](http://stackoverflow.com/questions/1898656/remove-whitespace-in-python-using-string-whitespace/1898835#1898835). – Sven Marnach Feb 10 '11 at 17:20

6 Answers6

27

I think the fastest way is to use str.translate():

import string
s = "a\nb\rc\td"
print s.translate(string.maketrans("\n\t\r", "   "))

prints

a b c d

EDIT: As this once again turned into a discussion about performance, here some numbers. For long strings, translate() is way faster than using regular expressions:

s = "a\nb\rc\td " * 1250000

regex = re.compile(r'[\n\r\t]')
%timeit t = regex.sub(" ", s)
# 1 loops, best of 3: 1.19 s per loop

table = string.maketrans("\n\t\r", "   ")
%timeit s.translate(table)
# 10 loops, best of 3: 29.3 ms per loop

That's about a factor 40.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 5
    It is important to note that string.translate and string.makestrans is not available in Python3. re based solution seems better. – Senthil Kumaran Feb 10 '11 at 10:05
  • @Ignacio: import string;hasattr(string,'translate');hasattr(string,'maketrans') It will be False, if you do hasattr(str,'translate') and hasattr(str,'maketrans') it is True. module string is just a collection of string constants. Moreover, as per definition and proper way to use maketrans would be bytes.maketrans. Thanks! – Senthil Kumaran Feb 10 '11 at 10:21
10

You may also try regular expressions:

import re
regex = re.compile(r'[\n\r\t]')
regex.sub(' ', my_str)
Michal Chruszcz
  • 2,452
  • 16
  • 20
  • I've compared the actual performance and it looks like using regular expressions is as fast as using the string module. – Michal Chruszcz Feb 10 '11 at 10:14
  • `python2.6 timeit.py -s "import string" -s "s = 'a\nb\rc\td'" -s "s.translate(string.maketrans('\n\t\r', ' '))"` 10000000 loops, best of 3: 0.0235 usec per loop – Michal Chruszcz Feb 10 '11 at 10:15
  • `python2.6 timeit.py -s "import re" -s "regex = re.compile(r'[\n\r\t]')" -s "regex.sub(' ', 'a\nb\rc\td')"` 10000000 loops, best of 3: 0.0232 usec per loop – Michal Chruszcz Feb 10 '11 at 10:15
  • 1
    @Michal - are you comparing `regex.sub(...)` to `s.translate(string.maketrans(...))` or to `s.translate(preparedTrans)` only? – eumiro Feb 10 '11 at 10:20
  • @eumiro, the former, my bad - I focused on the above solution. The latter is comparable, though. – Michal Chruszcz Feb 10 '11 at 10:27
  • `python2.6 timeit.py -s "import string" -s "s = 'a\nb\rc\td'" -s "trans = string.maketrans('\n\t\r', ' ')" -s "s.translate(trans)"` 10000000 loops, best of 3: 0.0256 usec per loop – Michal Chruszcz Feb 10 '11 at 10:27
  • 1
    @Michal: It's completely meaningless to try this on a string with 7 characters. See the edit in my answer. – Sven Marnach Feb 10 '11 at 10:39
5
>>> re.sub(r'[\t\n\r]', ' ', '1\n2\r3\t4')
'1 2 3 4'
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
4

If you want to normalise whitespace (replace runs of one or more whitespace characters by a single space, and strip leading and trailing whitespace) this can be accomplished by using string methods:

>>> text = '   foo\tbar\r\nFred  Nurke\t Joe Smith\n\n'
>>> ' '.join(text.split())
'foo bar Fred Nurke Joe Smith'
John Machin
  • 81,303
  • 11
  • 141
  • 189
2

using regex

re.sub(r'\s+', ' ', '1\n2\r3\t4')

without regex

>>> ' '.join('1\n\n2\r3\t4'.split())
'1 2 3 4'
>>>
kurumi
  • 25,121
  • 5
  • 44
  • 52
1

my_string is the string where you want to delete specific control characters. As strings are immutable in python, after substitute operation you need to assign it to another string or reassign it:

my_string = re.sub(r'[\n\r\t]*', '', my_string)
3kstc
  • 1,871
  • 3
  • 29
  • 53
Srikanth
  • 11
  • 1