1

I am trying to replace occurrences of; for example 'word one' with 'word_one'. Replacing the whitespace with a '_'.

Here is my code:

labels_ls = ['word <= 0.01', 'word_two <= 0.23', 'word three <= 0.01']

regex_whitespace = r'\w+\s+\w+\b'
new_regex = r'\w+\_+\w+\b'
pattern = re.compile(regex_whitespace) # this I just added after reviewing other related questions

# Loop through labels_ls to find any ngrams whitespace separated labels (i.e gilt maximal)

for i in labels_ls:
    if re.match(regex_whitespace, i):
        # replace the whitespace with a '_' to form gilt*maximal
        new_string = re.sub(pattern, new_regex, i)
        print('new string: ', new_string)

I have tested my regex here https://pythex.org, and it works as required, however when I run this code I get the following error:

re.error: bad escape \w at position 0

I have looked at all the related answered questions:

how to fix - error: bad escape \u at position 0

and

Regex: Replace one pattern with another

I have tried removing the r before the regex as mentioned in the above question however it still doesn't work.

I also tried using compile() but this also didn't fix the problem

labels_ls = ['internal_punctuation <= 0.042', 'darf <= 0.717', 'formal_global_yes <= 0.5', 'wert <= 0.272', 'signal <= 0.5', 'Flesch_Index <= 0.813', 'zulass <= 0.379', 'polarity <= 0.713', 'Nb_of_auxiliary <= 0.071', 'gini = 0.0', 'polarity <= 0.375', 'gini = 0.0', 'Nb_of_verbs <= 0.094', 'weakwords_nb <= 0.143', 'passive_global_yes <= 0.5', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'Nb_of_verbs <= 0.094', 'passive_global_yes <= 0.5', 'WPS <= 0.062', 'measurement_values_no <= 0.5', 'gini = 0.0', 'SPW <= 0.575', 'weird_words <= 0.042', 'weakwords_nb <= 0.036', 'SPW <= 0.272', 'gini = 0.0', 'words_nb <= 0.033', 'gini = 0.5', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'Flesch_Index <= 0.774', 'SPW <= 0.331', 'gini = 0.0', 'gini = 0.0', 'Comp_conj <= 0.375', 'SPW <= 0.111', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'Sub_Conj <= 0.25', 'weird_words <= 0.208', 'zsdf <= 0.5', 'signal <= 0.297', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'words_nb <= 0.164', 'Aux_Start_no <= 0.5', 'gini = 0.0', 'Nb_of_Umsetzbarkeit_conj <= 0.167', 'werden <= 0.125', 'darf <= 0.297', 'polarity <= 0.925', 'SPW <= 0.376', 'WPS <= 0.11', 'numerical_values <= 0.091', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'WPS <= 0.11', 'gini = 0.0', 'gini = 0.0', 'polarity <= 0.25', 'gini = 0.0', 'Flesch_Index <= 0.663', 'words_nb <= 0.033', 'SPW <= 0.475', 'gini = 0.0', 'gini = 0.0', 'Comp_conj <= 0.125', 'gini = 0.56', 'gini = 0.0', 'Flesch_Index <= 0.75', 'gini = 0.444', 'gini = 0.0', 'Aux_Start_yes <= 0.5', 'darf <= 0.241', 'Nb_of_verbs <= 0.156', 'gini = 0.0', 'SPW <= 0.246', 'polarity <= 0.675', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'Sub_Conj <= 0.25', 'numerical_values <= 0.227', 'funktion <= 0.348', 'internal_punctuation <= 0.458', 'polarity <= 0.375', 'gini = 0.0', 'Nb_of_verbs <= 0.031', 'gini = 0.0', 'Flesch_Index <= 0.409', 'gini = 0.0', 'numerical_values <= 0.136', 'WPS <= 0.065', 'darf <= 0.359', 'Nb_of_Umsetzbarkeit_conj <= 0.167', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'formal_global_no <= 0.5', 'WPS <= 0.164', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gini = 0.0', 'gilt randbeding <= 0.181', 'fahrzeug <= 0.352', 'gini = 0.0', 'zulass <= 0.082', 'gini = 0.0', 'gini = 0.0', 'fur <= 0.194', 'weakwords_nb <= 0.321', 'gini = 0.444', 'gini = 0.0', 'gini = 0.0', 'Nb_of_Umsetzbarkeit_conj <= 0.167', 'Nb_of_verbs <= 0.344', 'gini = 0.0', 'gini = 0.0', 'words_nb <= 0.178', 'gini = 0.0', 'words_nb <= 0.224', 'gini = 0.0', 'gini = 0.0']
codiearcher
  • 373
  • 1
  • 3
  • 12
  • Can you provide an example of `labels_ls`? Also, `\w` matches underscores. What exact chars do you need to match with your regex? – Wiktor Stribiżew May 06 '19 at 16:15
  • Is regex a requirement? I believe ```str.replace()``` will make life a lot easier for that job. – accdias May 06 '19 at 16:17
  • Comment out the re.sub() line and try it. Is that the line the error is on ? You may need to double escape the escapes : `r'\\w+\\_+\\w+\\b'` since this is the replacement string. –  May 06 '19 at 16:18
  • Why do you replace with a regex pattern? – Wiktor Stribiżew May 06 '19 at 16:19
  • @sln yes that is the line which causes the error. But I need a way of replacing the whitespace with a _ – codiearcher May 06 '19 at 16:20
  • @WiktorStribiżew I guess I thought that would be the best way, is there a better way? – codiearcher May 06 '19 at 16:21
  • 1
    `new_string = re.sub(r"\s", "_", i)` –  May 06 '19 at 16:22
  • @accdias I tried using str.replace initially but I couldn't manage to make it work, I tried i.replace(" ", "_") but this doesn't work, how can I use str.replace() in this instance? – codiearcher May 06 '19 at 16:23
  • @codiearcher, what do you mean by it doesn't work? What was the error? – accdias May 06 '19 at 16:24
  • Or, if it really bother's you that underscore should only be between words: `new_string = re.sub(r"(?<\w)\s+(?=\w)", "_", i)` –  May 06 '19 at 16:27

1 Answers1

5

You need to use

regex_whitespace = r'(\w+)\s+(\w+)\b'

and then later:

new_string = re.sub(pattern, r'\1_\2', i)

See the Python demo online.

The point is that you need to capture the word chars matched with the first regex into capturing groups and then use backreferences to the matched group values. The new_regex = r'\w+\_+\w+\b' is redundant as you cannot have a regex pattern as a replacement, replacement patterns can only contain backreferences and escaped sequences (the literal backslash must be escaped there).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I replaced the code with what you wrote and I'm not sure why but I get the following error: raise s.error ("invalid group reference %d" % index, pos) re.error: invalid group reference 1 at position 1. – codiearcher May 06 '19 at 16:26
  • 1
    @codiearcher Please copy/paste from [here](https://ideone.com/upA46g). If you have other issues, just share a Python code snippet via ideone. I believe you forgot to add capturing groups. – Wiktor Stribiżew May 06 '19 at 16:29
  • I guess the problem is with my actual list (I just made a dummy list) which appears to not be accepting the regex. I will add the original data from labels_ls above. What is catering groups? I will look at ideone. Thanks – codiearcher May 06 '19 at 16:38
  • 1
    See https://ideone.com/xYaiA0, all works as expected. The groups are *capturing*, not *catering*. – Wiktor Stribiżew May 06 '19 at 16:44