0

Using Mac OSX and if there is a file encoded with UTF-8 (contains international characters besides ASCII), wondering if any tools or simple command (e.g. in Python 2.7 or shell) we can use to find the related hex (base-16) values (in terms of byte stream)? For example, if I write some Asian characters into the file, I can find the related hex value.

My current solution is I open the file and read them byte by byte using Python str. Wondering if any simpler ways without coding. :)

Edit 1, it seems the output of od is not correct,

cat ~/Downloads/12
1

od ~/Downloads/12
0000000    000061
0000001

Edit 2, tried od -t x1 options as well,

od -t x1 ~/Downloads/12
0000000    31
0000001

thanks in advance, Lin

Lin Ma
  • 9,739
  • 32
  • 105
  • 175
  • 1
    Not sure what you mean by "related hex" in this case. Using the Terminal.app in Mac OS X with a bash shell, it's trivial to grep for Unicode characters outside the US-ASCII range. Something like: `grep 中国 cn.txt` Are you trying to find linguistically similar characters, near a codepoint range or something else? – Lex Scarisbrick Aug 01 '16 at 00:40
  • @LexScarisbrick, nice example. The hex value I mean the real byte values which is encoded as UTF-8. The reason why I want to get hex value is because I may need to assign variable values like `\xE3\x80\x82` in Python 2.7, which are the byte values hex form for an unicode character. I am not doing grep. If you have any ideas, it will be great. – Lin Ma Aug 01 '16 at 03:52
  • 1
    `od` is the POSIX hex dump tool. Not a programming question; voting to close. – tripleee Aug 01 '16 at 03:56
  • 1
    Python 2 has a hex codec in its standard library; you don't need an external tool. For Python 3, see http://stackoverflow.com/questions/13435922/python-encode – tripleee Aug 01 '16 at 03:57
  • 2
    You can absolutely assign variables with Unicode characters (e.g. `foo = u'\u3002'`). It's still not clear why you would want to work directly with a UTF-8 encoded byte stream as opposed to decoded character strings. Something to keep in mind is that UTF-8 encoded characters are _variable_ _length_ and anywhere between 1 and 4 bytes long. Further reading: [http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html) – Lex Scarisbrick Aug 01 '16 at 15:32
  • @LexScarisbrick, I know how to use `u` prefix to assign, my question is, I do not know the unicode value (e.g. `3002` in your example), I only know the international character, and it is why I want to write this international character into a text file, and then get its related hex values of UTF-8 (e.g. `3002` in your example). If you have any better ideas how to resolve this problem, it will be great. Vote up for your reply. – Lin Ma Aug 01 '16 at 17:55
  • @tripleee, nice ideas and vote up for your both posts. I am going to write some international characters and save into UTF-8 encoding to have a try. Do you know which tools on Mac supports encoding plain text into different encoding, like UTF-8, ITF-16, etc? I tried Atom and TextEdit, it seems neither of them give me a choice of encoding methods when saving. Thanks. – Lin Ma Aug 01 '16 at 17:57
  • @tripleee, I tried `od`, the output seems not correct, please refer to Edit 1 section of my original post. – Lin Ma Aug 01 '16 at 18:13
  • 1
    Just to be clear, are you hoping to print the Unicode code points or the byte values of the encoded byte stream? For example, for the single character `我`, do you want to see the Unicode code point `6211` or the UTF-8-encoded byte stream `e6 88 91`? – Robᵩ Aug 01 '16 at 18:32
  • @Robᵩ, byte values. Thanks and vote up. – Lin Ma Aug 01 '16 at 18:48

3 Answers3

3

I'm not sure exactly what you want, but this script can help you look up the Unicode codepoint and UTF-8 byte sequence for any character. Be sure to save the source as UTF-8.

# coding: utf8
s = u'我是美国人。'
for c in s:
    print c,'U+{:04X} {}'.format(ord(c),repr(c.encode('utf8')))

Output:

我 U+6211 '\xe6\x88\x91'
是 U+662F '\xe6\x98\xaf'
美 U+7F8E '\xe7\xbe\x8e'
国 U+56FD '\xe5\x9b\xbd'
人 U+4EBA '\xe4\xba\xba'
。 U+3002 '\xe3\x80\x82'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
1

You can use the command iconv to convert between encodings. The basic command is:

iconv -f from_encoding -t to_encoding inputfile

and you can see a list of supported encodings with

iconv --list

In your case,

iconv -f UTF8 -t UCS-2 inputfile

You've also asked to see the hex values. A standard utility that will do this is xxd. You can pipe the results of iconv to xxd as follows:

iconv -f UTF8 -t UCS-2 inputfile | xxd  
borrible
  • 17,120
  • 7
  • 53
  • 75
  • Thanks borrible, vote up. I do not need to convert, I just need to see its existing hex values. Want to confirm it is what I need? Thanks. – Lin Ma Jul 31 '16 at 23:32
  • 1
    Your question's title looks like this is what you are asking for, but the actual question looks like it definitely isn't. We can't really know, can we? – tripleee Aug 01 '16 at 05:05
  • @tripleee, nice catch. I re-read my title and update it. Vote up for your recommendations. – Lin Ma Aug 01 '16 at 18:19
1

od is the right command, but you need to specify an optional argument -t x1:

$ od -t x1 ~/Downloads/12
0000000 31
0000001

If you prefer not to see the file offsets, try adding -A none:

$ od -A none -t x1 ~/Downloads/12
 31

Additionally, the Linux man page (but not the OS X man page) lists this example: od -A x -t x1z -v, "Display hexdump format output."

Reference: http://www.unix.com/man-page/osx/1/od/

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Thanks Robᵩ, vote up. I post the command output in Edit 2 section, think the output should only be `31`, what does `0000000` and `0000001` mean? – Lin Ma Aug 01 '16 at 18:25
  • 1
    Those are the offsets into the data file. So the first line represents offset 0, the next line, being empty, represents the end of the file at offset 1. Try a larger file to see how those offsets work. If you don't want to see the offsets, try adding `-A none`. – Robᵩ Aug 01 '16 at 18:26
  • Thanks Robᵩ, your solution works. Vote up and mark your reply as answer. – Lin Ma Aug 01 '16 at 18:40