How to read files that use unsupported encodings and/or charsets in Java

Question

I need to read a CSV file into a Java application, but the file is encoded using Western (Mac OS Roman), which is unsupported in Java.

It's been suggested I use Byte Stream to read in the text and convert everything over 128 to the space character (ASCII character 32). But I have no idea how to do this. I don't know how to deal with each byte at a time, how to convert them, and when I've reached the end of the line how to then take that line of "truncated" text, split it into an array, and then pull the data out of the indexes I need.

SortedMap<String, OBJ_NAME> mapResults = new TreeMap<String, OBJ_NAME>();
String url = 'url-to-file';
InputStream inputStream = null;
InputStreamReader = null;
CSVReader = csvReader = null;
final Pattern regexPattern = Pattern.compile("^\\d{2}\\.\\d{1.3}$");

try {
    inputStream = new URL(url).openStream();

    reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
    csvReader = new CSVReader(reader, ',', '"', 1);
    List<String[]> lines = csvReacer.readAll();

    for (String[] line : lines) {
        // logic to grab data from first and second indices of the line
        OBJ_NAME objInstance = new OBJ_NAME();

        objInstance.setFieldOne(line[0]);
        objInstance.setFieldTwo(line[1]);
        mapResults.put(line[1], objInstance);
    }
} catch (Exception e) {
    throw new IOException(e);
} finally {
    // IOUtils from apache commons
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(reader);
    IOUtils.closeQuietly(csvReader);
}

Because the CSV is using an unsupported format, the logic above is reading the data wrong since it's not UTF-8, and so I'm getting far fewer results than I should. I'm not sure if I should input it as ASCII and "interrupt" characters over 128 (which I don't know how to do), or do it with Byte Stream instead (which I also don't know how to do).

Help? And also, screw anyone who releases documents with official information in outdated, unsupported encodings.

Is replacing anything beyond `0x80` (byte 128) actually acceptable? It is possible, after all, but you only mentioned that it was "suggested" to you, not that you intend doing it. — Izruo, Jan 16 '19 at 19:26
@Izruo Looking at the chart for Mac OS Roman, it looks like the first 128 are basic ASCII and the rest are special characters. I don't need any of those special characters for this task, so losing those is trivial. I just need the basic alphanumerical characters, plus the double quotes and commas and newlines for the CSV formatting, so I'm good. I just...don't know how to do it. — MystikDan, Jan 16 '19 at 21:16
You could write our own Charset and CharsetDecoder based on ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT. But if all your data is expected to come into the [C0 Controls and Basic Latin](http://www.unicode.org/charts/nameslist/index.html) block you could abuse another Charset such as StandardCharsets.ISO_8859_1 (which would never raise an exception) or StandardCharsets.US_ASCII (which would). — Tom Blodget, Jan 17 '19 at 02:44

How to read files that use unsupported encodings and/or charsets in Java

0 Answers0