0

I'm currently trying to put text from many small files in one big file using java. This big file is further used in a python module to extract phrases from it. During this process, I get an error indicating invalid utf8 text. Some research brought me to this error in java, but it didnt solve my problem.

Strangely, when I type the sentence in a online converter for utf8 like this one, it also say's error. The string I used is "Brawlers Were Back On Ice and Canvas".

Can anyone explain to me why this happens?

Thanks in advance!

EDIT/UPDATE It looks like this online tool might have a bug. Im still on a fix with the problem of using the file in python, so I'll show the code to create it:

     Writer writer = new BufferedWriter(new OutputStreamWriter(
                  new FileOutputStream("samplefile"), "utf-8"))) {
     writer.write(someText);

But this produces errors in python like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data 

SecondEdit: The python code to process the data:

dr = DirRunner(self.dir)
    for item in dr:
        #open for reading using a buffer
        file = open(item, "r", 1);
        for line in file.readlines():
            yield line

DirRunner just returns a list of all the files and folders in one directory.

Each line is then processed in this function:

def any2utf8(input):
"""
 convert a string or object into utf8 encoding
 source: http://stackoverflow.com/questions/13101653/python-convert-complex-dictionary-of-strings-from-unicode-to-ascii
 usage: 
    str = "abc"
    str_replace = any2utf8(str)
"""
if isinstance(input, dict):
    return {any2utf8(key): any2utf8(value) for key, value in input.iteritems()}
elif isinstance(input, list):
    return [any2utf8(element) for element in input]
elif isinstance(input, unicode):
    return input.encode('utf-8')
else:
    return input
  • Can you explain a bit more on the failing process? The original file encoding is utf-8 already the processing is on an utf-8 machine using an utf-8 locale setting? – rpy Apr 17 '16 at 19:49
  • Don’t trust that “error in Java”, it’s not really an error. Please provide additional information. The string you gave here is ASCII-only, so it cannot produce wrongly encoded bytes in the large file. – Roland Illig Apr 17 '16 at 19:53
  • The failing in python comes with the message `UnicodeDecodeError: 'utf8' codec can't decode byte 0xef in position 0: unexpected end of data ` which is produced by a call of `any2utf8()`. I have modified the creation via java in some ways (including the one in the second link), and im pretty sure this is okay. By the way, the module im using is gensim. – Patrick Liedtke Apr 17 '16 at 19:57
  • @RolandIllig yeah, but still if you enter it in this online converter, is produces an error. How, why? – Patrick Liedtke Apr 17 '16 at 19:59
  • Please also show us the Python code. The Java code looks great (assuming you close the stream at the end), but the Python code seems to read only one byte instead of all bytes from its input. The first bytes `\xEF` and `\xC3` look great, too. It’s just that they should be followed by more bytes. – Roland Illig Apr 18 '16 at 03:50
  • `dr = DirRunner(self.dir) for item in dr: file = open(item, "r", 1); for line in file.readlines(): yield line` this is to open the files, DirRunner ist just a class to get recursive paths of all the files. The lines are then processed in the function any2utf8(), which you can find [here](https://github.com/lidingpku/open-conference-data/blob/master/iswc-metadata/src/mu/lib_unicode.py) – Patrick Liedtke Apr 18 '16 at 11:41

0 Answers0