I'm currently trying to put text from many small files in one big file using java. This big file is further used in a python module to extract phrases from it. During this process, I get an error indicating invalid utf8 text. Some research brought me to this error in java, but it didnt solve my problem.
Strangely, when I type the sentence in a online converter for utf8 like this one, it also say's error. The string I used is "Brawlers Were Back On Ice and Canvas".
Can anyone explain to me why this happens?
Thanks in advance!
EDIT/UPDATE It looks like this online tool might have a bug. Im still on a fix with the problem of using the file in python, so I'll show the code to create it:
Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("samplefile"), "utf-8"))) {
writer.write(someText);
But this produces errors in python like
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
SecondEdit: The python code to process the data:
dr = DirRunner(self.dir)
for item in dr:
#open for reading using a buffer
file = open(item, "r", 1);
for line in file.readlines():
yield line
DirRunner just returns a list of all the files and folders in one directory.
Each line is then processed in this function:
def any2utf8(input):
"""
convert a string or object into utf8 encoding
source: http://stackoverflow.com/questions/13101653/python-convert-complex-dictionary-of-strings-from-unicode-to-ascii
usage:
str = "abc"
str_replace = any2utf8(str)
"""
if isinstance(input, dict):
return {any2utf8(key): any2utf8(value) for key, value in input.iteritems()}
elif isinstance(input, list):
return [any2utf8(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input