0

I'm working with data from many different sources, so I'm creating a name bridge and a function to make it easier to join tables. One of the sources uses an umlaut for a value and (I think) the excel csv isn't UTF-8 encoded, so I'm getting strange results.

Since I can't control how the other source compiles their data, I'd like to make a universal function that fixes all the weird encoding rules. I'll use Dennis Schröder as an example name.

One particular source uses the Umlaut, and when I read it in with read.csv and view the table in RStudio, it shows up as Dennis Schr<f6>der. However, if I index the particular table to his value (table[i,j]), the console reads Dennis Schr\xf6der

So in my name-bridge csv, I made a row to map all Dennis Schr\xf6der to Dennis Schroder. I read this name bridge in (with the condition allowEscapes = TRUE), and he shows up exactly the same in my name-bridge table. Great! I should be able to left_join this to the other source to change the name to just Dennis Schroder.

But unfortunately the names still don't map unless I Don't trim strings (I have to trim strings in general because other sources introduce white spaces). Here's the general function I use to fix names. The dataframe is the other source's table, VarUse is the name-column that I want to fix from dataframe, and correctionTable is my name-bridge.

nameUpdate <- dataframe %>%
  mutate(name = str_trim(VarUse, 'both')) %>% 
  left_join(correctionTable, by = c('name' = 'WrongName'))

When I dig into the results of this mapping, I get the following:

  • correctionTable[14,1] is my name-bridge input of "Dennis Schr\xf6der".
  • nameUpdate[29,3] is the original name variable from the other source which reads "Dennis Schr\xf6der".
  • nameUpdate[29,19] is the mutated name variable from the other source after using str_trim, which also reads "Dennis Schr\xf6der".

However, for some reason the str_trim version is not equal to the name-bridge, so it won't map:

enter image description here

In writing this (non-reproducible, sorry) example, I've figured out a work-around by using a combo of str_trim and by not using it, but at this point I'm just confused why the name doesn't get fixed after I use str_trim. The values look exactly the same.

CoolGuyHasChillDay
  • 659
  • 1
  • 6
  • 21
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Just comparing printed values can be misleading. What does `charToRaw()` and `Encoding()` return for these nameUpdate values? – MrFlick Dec 10 '18 at 21:06
  • How are the encodings set? You can check with `Encoding(nameUpdate[29,19]), and if encodings are set differently, the strings are considered different, even if they look the same. – Emil Bode Dec 11 '18 at 00:14

0 Answers0