0

I am having a hard time converting Cyrillic symbols stored in Unicode to UTF-8 using Python's json library. The input JSON string:

data = '{"name": "\\u0431\\u0433"}'

The encoded result I am getting from json.dumps(data) and json.dumps(data).encode('utf8') is identical to the input, no conversion takes place.

Even more oddly, json.dumps(data, ensure_ascii=False).encode('utf8') returns a hexadecimal result: '{"name": "\xd0\xb1\xd0\xb3"}' Does anyone have idea what I am doing wrong?

Michael Meyer
  • 2,179
  • 3
  • 24
  • 33

1 Answers1

0

The only thing that you are doing wrong is trying to serialise data, which is already serialised as JSON. The unicode escapes - '\\uxxxx' - are legitimate equivalent representations of cyrillic characters.

>>> data = '{"name": "\\u0431\\u0433"}'    # already json-serialised                                                                                          
>>> obj = json.loads(data)                 # deserialise to a python object
>>> obj                                                                                                                              
{u'name': u'\u0431\u0433'}                                                                                                           

>>> print obj['name']  # printing the string displays as cyrillic
бг

Escaping unicode characters is permitted by the standard (see this answer). Other json parsers will process escaped characters correctly.

For example, in the Firefox console:

data = '{"name": "\\u0431\\u0433"}'
"{"name": "\u0431\u0433"}"
obj = JSON.parse(data)
Object { name: "бг" }
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153