0

When I using nokogiri to parser htmls, the Chinese characters are transfer to escaped sequences like

"å·…å³°å»¶æ—¶"

How could I decode the escaped characters like "å·…å³°å»¶æ—¶" back to normal characters?

Phrogz
  • 296,393
  • 112
  • 651
  • 745
ArchenZhang
  • 151
  • 2
  • 8
  • Possible duplicate: http://stackoverflow.com/questions/1600526/how-do-i-encode-decode-html-entities-in-ruby – DNNX Mar 16 '14 at 09:23

2 Answers2

1

It looks like your HTML page is encoded as UTF-8 but you are parsing as ISO-8859-1. You need to ensure you specify the correct encoding when parsing. If you are parsing from a string Nokogiri should use the same encoding as the string. If you are parsing from an IO object you can specify the encoding as the third argument to the parse method:

Nokogiri::HTML::Document.parse(io_object, nil, 'UTF-8')
matt
  • 78,533
  • 8
  • 163
  • 197
0

What should the normal characters be though? This looks like their string representations.

Otherwise you have CGI.unescapeHTML() and CGI.escapeHTML() available in standard ruby (stdlib).

shevy
  • 920
  • 1
  • 12
  • 18