10

Possible Duplicate:
How to get string Objects instead Unicode ones from JSON in Python?

I have a lot of input as multi-level dictionaries parsed from JSON API calls. The strings are all in unicode which means there is a lot of u'stuff like this'. I am using jq to play around with the results and need to convert these results to ASCII.

I know I can write a function to just convert it like that:

def convert(input):
    if isinstance(input, dict):
        ret = {}
        for stuff in input:
            ret = convert(stuff)
    elif isinstance(input, list):
        ret = []
        for i in range(len(input))
            ret = convert(input[i])
    elif isinstance(input, str):
        ret = input.encode('ascii')
    elif :
        ret = input
    return ret

Is this even correct? Not sure. That's not what I want to ask you though.

What I'm asking is, this is a typical brute-force solution to the problem. There must be a better way. A more pythonic way. I'm no expert on algorithms, but this one doesn't look particularly fast either.

So is there a better way? Or if not, can this function be improved...?


Post-answer edit

Mark Amery's answer is correct but I would like to post a modified version of it. His function works on Python 2.7+ and I'm on 2.6 so had to convert it:

def convert(input):
    if isinstance(input, dict):
        return dict((convert(key), convert(value)) for key, value in input.iteritems())
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input
Community
  • 1
  • 1
Dreen
  • 6,976
  • 11
  • 47
  • 69
  • 1
    If you're on Python 2, then unicode isn't an instance of `str`, but of `unicode`. Also, in the `list` and `dict` processing, you're doing it wrong. – agf Oct 27 '12 at 15:49
  • For the list case, you may wish to consider handling any iterable. In any case, you can replace that branch of the if statement with `ret = [convert(x) for x in input]`. Also, check your dictionary case. `ret` will only contain whatever the last key in the dictionary converts to. – Michael Mior Oct 27 '12 at 15:55
  • @MichaelMior The trouble with handling any iterable in the way you've described is that not all iterables are list-like. For example, dictionaries are iterable, but `ret = [convert(x) for x in input]` is clearly not what we want if `input` is a dictionary. – Mark Amery Oct 27 '12 at 16:28
  • @MarkAmery Of course. Dictionaries need to be handled separately. – Michael Mior Oct 27 '12 at 18:43

1 Answers1

30

Recursion seems like the way to go here, but if you're on python 2.xx you want to be checking for unicode, not str (the str type represents a string of bytes, and the unicode type a string of unicode characters; neither inherits from the other and it is unicode-type strings that are displayed in the interpreter with a u in front of them).

There's also a little syntax error in your posted code (the trailing elif: should be an else), and you're not returning the same structure in the case where input is either a dictionary or a list. (In the case of a dictionary, you're returning the converted version of the final key; in the case of a list, you're returning the converted version of the final element. Neither is right!)

You can also make your code pretty and Pythonic by using comprehensions.

Here, then, is what I'd recommend:

def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

One final thing. I changed encode('ascii') to encode('utf-8'). My reasoning is as follows: any unicode string that contains only characters in the ASCII character set will be represented by the same byte string when encoded in ASCII as when encoded in utf-8, so using utf-8 instead of ASCII cannot break anything and the change will be invisible as long as the unicode strings you're dealing with use only ASCII characters. However, this change extends the scope of the function to be able to handle strings of characters from the entire unicode character set, rather than just ASCII ones, should such a thing ever be necessary.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
  • 1
    +1. Except for you comment about recursion :) Recursion is useful for almost any kind of tree traversal, and most parsing problems. Recursion is often the "way to go", especially when it comes to functional programming. – Joel Cornett Oct 27 '12 at 16:08
  • 1
    @JoelCornett Fair enough. My comment wasn't meant to be broadly anti-recursion; I can see that recursion makes sense in tree traversal problems, of which I guess a lot of parsing problems are a subset. I'm just pretty new to this game and not from a compsci background, so I haven't come across any problems of that nature myself yet. Examples of recursion I've seen tend to be pointless and contrived, and apply it to situations where iteration would be clearer. This is the first time I've suddenly gone 'whoa, recursion *really simplifies things* here', which was exciting for me. :) – Mark Amery Oct 27 '12 at 16:17
  • Thanks, this is really nice. Much better than any answer in the question that this is supposedly a duplicate of. – Dreen Oct 28 '12 at 13:23
  • Alsom I posted a modified version of your code for older Python – Dreen Oct 28 '12 at 17:49
  • Your code didn't work for me for some reason so I did this instead: def unicode_to_string(text): if type(text) is unicode: return text.encode('ascii', 'ignore') if type(text) is list: return [unicode_to_string(a) for a in text] if type(text) is dict: return dict((unicode_to_string(key), unicode_to_string( value)) for key, value in text.iteritems()) return text – Gil Zellner Feb 23 '16 at 15:04
  • 2
    worked like a charm, thanks – nishantvas Jul 12 '17 at 08:44
  • thank you. works great - py2.7/ubuntu 19 -> input = json response convert w json module – FlyingZebra1 Jun 05 '19 at 21:53