0

I have been trying to solve this issue for a while but I can't seem to think of a right solution.

Basically, I am parsing few pdfs and depending on the source of the pdf, the terminology used is different. For example, source A1 writes 'Batman' as 'The Batman'. Source B2 writes it as 'bat man'.

So what I tried to do is create a dictionary:

Voc_dict = {'Batman':'Batman',
'the Batman': 'Batman',
'bat man': 'Batman'}

Assume this dictionary extends to other superhero names.

So, I am trying to standardize the following 2d list:

Super_list  = [['among the heros with daddy issues, the bat man shines'], ['Bat man protects the city with everything he gots']]

You get the picture.

Apologies for the format and stupid example. I can't find more relatable one and it is my first time using mobile app.

Thank guys.

What I did is the following: Loop through the list and loop through dictionary.

For i in super_list:
    For key, value in voc_dict.items():
         i.replace(voc_dict[key], voc_dict[value])
H. H.
  • 23
  • 6
  • You can create a regex for each of your heroes that represents all the different ways that it is commonly spelled and search for that regex in your files maybe? – LeoE Mar 28 '23 at 17:32
  • Thanks for your response. I have tried but unfortunately some of strings are mixed (letters and integer) also it tend to pick up some of the target but not all. – H. H. Mar 28 '23 at 20:15
  • 1
    It is pointless to replace `'Batman'` with `'Batman'`. And your loop is incorrect because `str.replace` does not (cannot) change the string in place. You need to do something like `i = i.replace(...)` and then add the final `i` to a new list. – Tim Roberts Mar 28 '23 at 22:28

1 Answers1

1

What I did is the following: Loop through the list and loop through dictionary.

for i in super_list:
    for key, value in voc_dict.items():
         i.replace(voc_dict[key], voc_dict[value])

I would expect there to be at least three issues with this:

  1. You mentioned that super_list is a nested list, for you also need a nested for-loop to traverse it. Also i is just a list [not a string] and does not have a .replace method, so i.replace would raise an AttributeError.
  2. As TimRoberts commented, .replace is not an inplace method, so you would need something like i = i.replace... to change i [if i was a string].
  3. Although, even if i was a string, there would be no point in using i = i.replace... because i would be a copy of an item in the list. Generally, you should use enumerate if you want to loop through and edit a list.
for si, sub_list in enumerate(super_list):
    for i, sl_item in enumerate(sub_list):
        for k, kw in Voc_dict.items():
            super_list[si][i] = sl_item.replace(k, kw)

However, if you try the above code on your sample super_list, you might notice that only the first item gets altered, so you need to either add 'Bat man': 'Batman' to Voc_dict or use regex with re.IGNORECASE by using re.sub(k, kw, sl_item, flags=re.I)(view output) instead of sl_item.replace(k, kw).


If you use regex, you can reduce the number of iterations by first reducing Voc_dict to something like {'(Batman|the Batman|bat man)': 'Batman'} with

Voc_dict = {'('+'|'.join([
    k for k,v in Voc_dict.items() if v==kw
])+')':kw for kw in set(Voc_dict.values())}
Driftr95
  • 4,572
  • 2
  • 9
  • 21
  • Hi @Driftr95 thank you very much for this response. it is very helpful. To explore, your regex take. What I did before is break each statement by space split(" ") and try to replace and loop through each word in each list and replace if matches. but it did not work for statements where the it is written like "Batman's". Is there a smarter way to to avoid this loop in a loop issue. – H. H. Mar 29 '23 at 01:47
  • 1
    @H.H. that would also not work for `bat man` and `the Batman` since those are separate words. On the other hand, `.replace` or `re.sub` [without `split`] should work with added `'s` or any other suffix (or prefix, for that matter) though, *and* it involve one less loop so what's the issue? – Driftr95 Mar 29 '23 at 02:03