0

I have a vector coming from an external database that had a problem with encoding, resulting in a character vector with many occurrences of &ampXXX.

This looks like HTML encoding of accented letters.

Is there a function that can convert this to a readable vector?

Obviously, iconv() is not working as the very encoding seems right. Encoding(x) returns unknown though.

Here is a little reprex with the expected output:

x="H&ampeacutemipl&ampeacutegie"
iconv(x, from="latin1", to="utf8") #no effect
"Hémiplégie" #expected outcome
Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
  • 1
    it is "escaping" (so no proper encoding), in any case, HTML would have a ; at the end of the "entity" (the name of used by HTML). Check https://search.r-project.org/CRAN/refmans/htmltools/html/htmlEscape.html -- link wrong. it is the reverse. But it is an duplicate of: https://stackoverflow.com/questions/42724885/convert-html-entity-to-proper-character-r – Giacomo Catenazzi Jul 19 '23 at 14:43
  • 1
    If the string were properly formatted, then https://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r would also work. But what you have isn't any sort of valid encoded HTML entity. It should be more like `x="Hémiplégie"` – MrFlick Jul 19 '23 at 18:19
  • Yeah, I though I had the "encoding" word wrong, thanks. And indeed, this nasty database comes from a nasty software, and that's not the worst it can do :-( – Dan Chaltiel Jul 19 '23 at 19:07
  • @GiacomoCatenazzi my vector lacks the `;` so `read_html()` doesn't work. I guess I will have to do it manually with `str_replace()` then... – Dan Chaltiel Jul 19 '23 at 19:09
  • Yeah. Or check the origin of data (possibly the `;` was lost in some conversions – Giacomo Catenazzi Jul 20 '23 at 06:36

0 Answers0