I'm working with data from many different sources, so I'm creating a name bridge and a function to make it easier to join tables. One of the sources uses an umlaut for a value and (I think) the excel csv isn't UTF-8 encoded, so I'm getting strange results.
Since I can't control how the other source compiles their data, I'd like to make a universal function that fixes all the weird encoding rules. I'll use Dennis Schröder as an example name.
One particular source uses the Umlaut, and when I read it in with read.csv
and view the table in RStudio, it shows up as Dennis Schr<f6>der
. However, if I index the particular table to his value (table[i,j]
), the console reads Dennis Schr\xf6der
So in my name-bridge csv, I made a row to map all Dennis Schr\xf6der
to Dennis Schroder
. I read this name bridge in (with the condition allowEscapes = TRUE
), and he shows up exactly the same in my name-bridge table. Great! I should be able to left_join
this to the other source to change the name to just Dennis Schroder
.
But unfortunately the names still don't map unless I Don't trim strings (I have to trim strings in general because other sources introduce white spaces). Here's the general function I use to fix names. The dataframe
is the other source's table, VarUse
is the name-column that I want to fix from dataframe
, and correctionTable
is my name-bridge.
nameUpdate <- dataframe %>%
mutate(name = str_trim(VarUse, 'both')) %>%
left_join(correctionTable, by = c('name' = 'WrongName'))
When I dig into the results of this mapping, I get the following:
- correctionTable[14,1] is my name-bridge input of "Dennis Schr\xf6der".
- nameUpdate[29,3] is the original name variable from the other source which reads "Dennis Schr\xf6der".
- nameUpdate[29,19] is the mutated
name
variable from the other source after usingstr_trim
, which also reads "Dennis Schr\xf6der".
However, for some reason the str_trim
version is not equal to the name-bridge, so it won't map:
In writing this (non-reproducible, sorry) example, I've figured out a work-around by using a combo of str_trim
and by not using it, but at this point I'm just confused why the name doesn't get fixed after I use str_trim
. The values look exactly the same.