3

have problem to change the swedish characters ä ö å in a presentable way in R
I got my data directly from MS SQL database
here are the examples

markets <- c("Caf\xe9                          ","Restaurang kv\xe4ll              ","Barnomsorg tillagningsk\xf6k     ","Folkh\xf6gskola                  ")

then I use gusb to remove the lefthand space

market=gsub(" ", "", markets,fixed = TRUE)

I got this error:
Error in gsub(" ", "", market, fixed = TRUE) :
input string 3 is invalid UTF-8

then I use this command:
markets_new=gsub(" ", "", markets)

then have strange Chinese characters in the string, "Caf攼㸹" "Restauranglunch+kv攼㸴ll" "Barnomsorgtillagningsk昼㸶k" "Folkh昼㸶gskola"

I tried the treatment change the default setting of Rstudio by following: https://yihui.name/en/2018/11/biggest-regret-knitr/?fbclid=IwAR2E5Lp0zjS51fcdjgZ1tej0sg5EBxfG8sNitt-cUA2XEshnT3lNCHNQ3Do

it does not help, was also try to use gsub() substitute the characters but seems not working.

One more thing, if I use

write.csv(markets,'submarket product view.csv',row.names = F)

then in my csv file what I see as follows

"Caf<e9>                          "
"Restaurang kv<e4>ll              "
"Barnomsorg tillagningsk<f6>k     "
"Folkh<f6>gskola                  "
"Sm<f6>rg<e5>s/salladsrestaurang     " 

I think <e9> is e with a hat, <e4> is ä, <f6> is ö, and <e5> is å
Any treatment suggestion?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
CloverCeline
  • 511
  • 3
  • 18
  • 1
    Try `Encoding(markets)<-"latin1"`. – nicola Mar 07 '19 at 09:08
  • 1
    It works fine as is in my Windows RGui 3.4.3 build. The problem is most likely with the locale. – Wiktor Stribiżew Mar 07 '19 at 09:15
  • ``gsub(" ", "", `Encoding<-`(markets, "latin1"),fixed = TRUE)`` should work. – Wiktor Stribiżew Mar 07 '19 at 10:10
  • @WiktorStribiżew: it not really work, when I use that commend for one column in data frame or tibble I got the result:Caf攼㸹 ** , **Restaurangostork昼ã . but it works if I only apply on it on this character vector. Any more suggestion? thank you! – CloverCeline Mar 08 '19 at 09:05
  • Provide [reproducible data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Wiktor Stribiżew Mar 08 '19 at 09:07
  • @WiktorStribiżew: hi Wiktor, her is the code: m <- c("Caf\xe9 ","Barnomsorg tillagningsk\xf6k ","Folkh\xf6gskola ","\xd6vriga stork\xf6k ") date <- c(as.Date('2016-12-26'), as.Date('2016-12-23'),as.Date('2017-01-19'),as.Date('2017-01-02')) number <- rnorm(4) df <- data.frame(m,date,number) ###also can try, dt=as_tibble(df) also t <- df %>% mutate(market=gsub(" ", "", `Encoding<-`(m,"latin1"),fixed = TRUE)) and will be nice to have know how to understand `Encoding<-`(markets, "latin1") commend – CloverCeline Mar 10 '19 at 13:06
  • 1
    ``df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE)`` works for me. – Wiktor Stribiżew Mar 10 '19 at 15:51

2 Answers2

3

Thanks to @Wiktor Stribiżew this solution works best:

df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE) 
CloverCeline
  • 511
  • 3
  • 18
1

try this

Encoding(markets) <- "UTF-16"
markets <- trimws(markets)

#[1] "Café" "Restaurang kväll" "Barnomsorg tillagningskök" "Folkhögskola"  
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • This doesn't work for me, while `latin1` works, as I said in the comments. – nicola Mar 07 '19 at 09:13
  • @nicola weird.. works just fine for me. Any ideas why? selected locale? – Wimpel Mar 07 '19 at 09:14
  • 1
    Not really. It seems very strange that something like `UTF-16` might work here (as far as I know, `UTF-16` wants two bytes for character). – nicola Mar 07 '19 at 09:18
  • Thank you for the input. I tried both. if treat 'markets' as "character" vector. both `latin1` and `UTF-16`. But currently I have a column in tibble format called 'market' has this issue. the I tried `test <- dfp %>% head(100) %>% mutate(mar=gsub(" ", "", `Encoding<-`(market, "UTF-16"),fixed = TRUE))` or `test <- dfp %>% head(100) %>% mutate(mar=gsub(" ", "", `Encoding<-`(market, "latin1"),fixed = TRUE))` they have same results, when I use head(test): I see ** Caf攼㸹 ** , **Restaurangostork昼ã ** in the console. any more suggestion? – CloverCeline Mar 07 '19 at 12:58