Trying to parse spanish text with axios/cheerio (JavaScript)

Question

I am trying to scratch from a spanish website: 'https://www.marca.com/futbol/real-madrid.html?intcmp=MENUESCU&s_kw=realmadrid'

But when parsing the headlines I recieve the following text: "Rafa Mar�n y Peter, los �nicos canteranos para la Copa adem�s de los porteros Fuidias y Diego". So I am trying to get rid of the � and parse the correct ó ñ ¿ á é í ú... characters.

I am scratching the data as follows next:

axios.get(newspaper.address)
.then((response) =>{
    const html = response.data;
    const $ = cheerio.load(html);
    $('.mod-title > a', html).each(function(){
        const headline = $(this).text().trim();
        const link = $(this).attr('href');
        if(!articles.some(article => article.headline == headline)){
            articles.push({source:newspaper.name, headline, link});
        }
    });
}).catch((err) => console.log(err));

I do not really know how change the encoding and what encoding use.

The website has its charset set as "iso-8859-15". – gre_gor Jan 04 '22 at 22:40 — gre_gor, Jan 04 '22 at 22:40

score -1 · Answer 1 · answered Jan 05 '22 at 01:18

It does seem as you are fetching a web page with a different encoding.

What you can do is request with responeType and responseEncoding as shown below:

const response = await axios.request({
  method: 'GET',
  url: 'https://www.WantedWebsite.com',
  responseType: 'arraybuffer',
  responseEncoding: 'binary'
});

You then have to decode the data So you can use it for your format!

let html = iso88592.decode(response.data.toString('binary'));

You might have to edit a few things but this might be a solution to your problem.

Hope this helps, good luck deving!

This seems to be just reworded from the accepted answer from the question I linked. — gre_gor, Jan 05 '22 at 01:26

Trying to parse spanish text with axios/cheerio (JavaScript)

1 Answers1