0

I have a bunch of text files that were encoded in UTF-8. The text inside the files look like this: \x6c\x69b/\x62\x2f\x6d\x69nd/m\x61x\x2e\x70h\x70.

I've copied all these text files and placed them into a directory /convert/.

I need to read each file and convert the encoded literals into characters, then save the file. filename.converted.txt

What would be the smartest approach to do this? What can I do to convert to the new text? Is there a function for handling Unicode text to convert between the literal to character types? Should I be using a different programming language for this?

This is what I have at the moment:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;

public class decode {
    public static void main(String args[]) {
        File directory = new File("C:/convert/");
        String[] files = directory.list();
        boolean success = false;
        for (String file : files) {
            System.out.println("Processing \"" + file + "\"");

            //TODO read each file and convert them into characters
            success = true;

            if (success) {
                System.out.println("Successfully converted \"" + file + "\"");
            } else {
                System.out.println("Failed to convert \"" + file + "\"");
            }

            //save file
            if (success) {
                try {
                    FileWriter open = new FileWriter("C:/convert/" + file + ".converted.txt");
                    BufferedWriter write = new BufferedWriter(open);
                    write.write("TODO: write converted text into file");
                    write.close();
                    System.out.println("Successfully saved \"" + file + "\" conversion.");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }
}
Kyle
  • 3,004
  • 15
  • 52
  • 79
  • Was link I provided in your previous post useless? [link](http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java) – svaor Nov 04 '11 at 07:16
  • @svaor I made a new question to cover what I was trying to accomplish. The link was not useless it was helpful but I needed more insight to get started. Thanks for the link. – Kyle Nov 04 '11 at 08:32

2 Answers2

3

(It looks like there's some confusion about what you mean - this answer assumes the input file is entirely in ASCII, and uses "\x" to hex-encode any bytes which aren't in the ASCII range.)

It sounds to me like the UTF-8 part of it is actually irrelevant. You can treat it as opaque binary data for output. Assuming the input file is entirely ASCII:

  • Open the input file as text (e.g. using FileInputStream wrapped in InputStreamReader specifying an encoding of "US-ASCII")
  • Open the output file as binary (e.g. using FileOutputStream)
  • Read each character from the input
  • Is it '\'?
    • If not, write the character's ASCII value to the output stream (just case from char to byte)
    • What's the next character?
    • If it's 'x', read the next two characters, convert them from hex to a byte (there's lots of code around to do this part), and write that byte to the output stream
    • If it's '\', write the ASCII value for '\' to the output stream
    • Otherwise, possibly throw an exception indicating failure
  • Loop until you've exhausted the input file
  • Close both files in finally blocks

You'll then have a "normal" UTF-8 file which should be readable by any text editor which supports UTF-8.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Some of the files contain php functions in plaintext but my ultimate goal is to restore the files into plaintext. My friend "obfuscated" these files and I was trying to show him how pointless it was. – Kyle Nov 04 '11 at 08:41
0

java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java chars. I'm not exactly sure how you want to write it back out, though. Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences?

Matthew Cline
  • 2,312
  • 1
  • 19
  • 36
  • It sounds like we've understood the question in very different ways... it'll be interesting to see what the correct interpretation is. – Jon Skeet Nov 04 '11 at 07:14