Convert HTML Entity to proper character R

Question

Does anyone know of a generic function in r that can convert ä to its unicode character â? I have seen some functions that take in â, and convert it to a normal character. Any help would be appreciated. Thanks.

Edit: Below is a record of data, which I probably have over 1 million records. Is there an easier solution other than reading the data into a massive vector, and for each element, changing the records?

wine/name: 1999 Domaine Robert Chevillon Nuits St. Georges 1er Cru Les Vaucrains
wine/wineId: 43163
wine/variant: Pinot Noir
wine/year: 1999
review/points: N/A
review/time: 1337385600
review/userId: 1
review/userName: Eric
review/text: Well this is awfully gorgeous, especially with a nicely grilled piece of Copper River sockeye. Pine needle and piercing perfume move to a remarkably energetic and youthful palate of pure, twangy, red fruit. Beneath that is a fair amount of umami and savory aspect with a surprising amount of tannin. Lots of goodness here. Still quite young but already rewarding at this stage.

wine/name: 2001 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg Riesling Sp&#228;tlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

Update: Using the function stri_trans_general function will convert any Â to a correct lowercase character, and vapply results need to be assigned to save changes.

#cellartracker-10records is the test file to use  
 tester <- "/Users/petergensler/Desktop/Wine Analysis/cellartracker-10records.txt"
 decode <- function(x) {   xmlValue(getNodeSet(htmlParse(tester), "//p")[[1]]) }

#Using vector, as we want to iterate over the raw file for cleaning
poop <- vapply(tester, decode, character(1), USE.NAMES = FALSE)

#Now use stringi to convert all characters to correct characters poop           
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
writeLines(poop, "wines.txt")

http://stackoverflow.com/questions/16028658/unicode-conversion-and-export-in-r — Josh Brody, Mar 10 '17 at 21:50
See [this](https://stackoverflow.com/a/22157168/1548942) answer. — Davor Josipovic, Dec 23 '19 at 15:37

Gavin Simpson · Accepted Answer · 2017-03-10T20:00:59.867

3

Here's one way via the XML package:

txt <- "wine/name: 2003 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg Riesling Kabinett"

library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.

This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.

If you want to run this for a character vector of length >1, then lapply() this:

txt <- rep(txt, 2)
decode <- function(x) {
  xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)

or if you want it as a vector, vapply():

> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:

txt <- "wine/name: 2001 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg 
Riesling Sp&#228;tlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"

out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

This gives me

> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"

Which if you write out using writeLines()

writeLines(out, "wines.txt")

You'll get a text file, which can be read in again using your other parsing code:

> readLines("wines.txt")
 [1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
 [2] "Riesling Spätlese"                                            
 [3] "wine/wineId: 3058"                                            
 [4] "wine/variant: Riesling"                                       
 [5] "wine/year: 2001"                                              
 [6] "review/points: N/A"                                           
 [7] "review/time: 1095120000"                                      
 [8] "review/userId: 1"                                             
 [9] "review/userName: Eric"                                        
[10] "review/text: Hideously corked!"

And it is a file (from my BASH terminal)

$ cat wines.txt 
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg 
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

edited Mar 10 '17 at 20:00

answered Mar 10 '17 at 18:43

Gavin Simpson

170,508
25
396
453

Thanks for the response! This seems to work fine on a single line, but I have lots of records that I need to operate over that have this issue. Should I be using apply over a large character vector to change all these elements? – petergensler Mar 10 '17 at 19:11
Sorry, I should have clarified my issue. I am currently reading my file into a character vector, but the file is pretty large(over 1 million rows). When I tried running the vapply function, I get Error in which(value == defs) : argument "code" is missing, with no default – petergensler Mar 10 '17 at 19:44
Do you want the output to be the same (million row) text document but with the html entities replaced? That's it? – Gavin Simpson Mar 10 '17 at 19:46
Yep. I'll then use readr's readlines to read the file with a different function that parses the txt file into a dataframe. – petergensler Mar 10 '17 at 19:47
I've updated the example with more data so the format makes more sense. – petergensler Mar 10 '17 at 19:55
OK, I don't think it makes much difference how much data there is. If all you want is to replace the html entities in the file, which you read in, then you can do it as I show in the updated answer. Is that what you want? – Gavin Simpson Mar 10 '17 at 20:02
It seems like your function works fine on a single record, but I think it crashes on my sample file. I still get this same error:Error in which(value == defs) : argument "code" is missing, with no default Called from: which(value == defs). The raw file is located here:http://snap.stanford.edu/data/cellartracker.txt.gz) – petergensler Mar 10 '17 at 20:10
Sorry @petergensler put I can't start debugging this for you - that's a 230Mb text file zipped. You'll need to make this into a minimal example. Does the example fail with the longer version of the data you pasted into the question? If not you'll need to boil it down to a simpler example. – Gavin Simpson Mar 10 '17 at 20:13
I think I know what is going on here....when I read my file into a char vector it has a length of 1:109 records. However, when I simply copy/paste the record to simulate it, it treats the file as one long record. Let me see if I can get a MWE subset – petergensler Mar 10 '17 at 20:15
You might not need the `asText = TRUE` part and just pass the name of the file instead of `txt`. That will avoid you having to read the file into R at all. Try that? – Gavin Simpson Mar 10 '17 at 20:18
I think that worked. I'll update the code in my example. – petergensler Mar 10 '17 at 20:26
Yeah the only issue I'm having is that when I try to clean the strings, the character is saved as an uppercase unicode character, and I can't seem to save it as lowercase. Thoughts? – petergensler Mar 10 '17 at 20:52
You might want to isolate that bit into a *new* question, as it seems distinct from converting HTML entities, which is what we've looked at here. – Gavin Simpson Mar 10 '17 at 21:38
Thanks for the feedback. When I try to run your new code I get this error:Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes. Should I be reading this data into some other r object to clean it with, or should I be using bash to clean the file prior to importing it? – petergensler Mar 13 '17 at 17:11
So the size of the resulting vector is exceeding the largest allowed length in R. You might need to read in lines in chunks, apply the solution to each of those lines, write those to a new file, then read the next chunk of lines, apply the solution, append lines to the new file, and so on. – Gavin Simpson Mar 13 '17 at 17:14
I have used the above function and it worked. But it seems to my that converting html entities is a very common task. I wonder if there are packages already including this function? – petzi May 26 '18 at 10:21

Convert HTML Entity to proper character R

1 Answers1

Linked

Related