2

I have some json files and there're some places with encoded japanese like \u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831 in the files, and I want to decode them into japanese.

The problem is when I use this method:

text = '\u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831'
print(text)

And it printed

本・雑誌・書籍情報

But when I read it directly from file, for example, the prepared file is index.json and its content is just:

\u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831

and the method I used is

file = open('index.json','r')
text = file.read()
print(text)

and it just printed

\u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831

One thing I found kinda wierd is that when I tried to print:

print(file.read())
print(text)

The file.read() returns nothing, even with file.read(1).

Edit: I found out that the main problem is when you write text = '\u672c', python would recognize \u672c as a single character. But when you read from a file, then it would recognize it as a string with 6 characters. Anyway to convert it?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
pruggg
  • 23
  • 3

1 Answers1

2

There are a couple of issues here.

Let's say that your file contains the following (literal) text:

\u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831

You could represent this in Python as either

text = '\\u672c\\u30fb\\u96d1\\u8a8c\\u30fb\\u66f8\\u7c4d\\u60c5\\u5831'

OR

text = r'\u672c\u30fb\u96d1\u8a8c\u30fb\u66f8\u7c4d\u60c5\u5831'

To convert the literal escapes into the Unicode characters they represent, you need to decode them properly:

text.encode('ascii').decode('unicode-escape')

results in

本・雑誌・書籍情報

The reason that file.read() and file.read(1) did not work for you is that a file does not automatically rewind. Once you read in the file, it's at the end until you manually rewind it or close and reopen it.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264