Scraping a wiki page for the "Periodic table" and all the links

Question

I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table

So that the output of my R code will be a table with the following columns:

Chemical elements short name
Chemical elements full name
The URL to the chemical elements wiki page

(and with a row for each chemical element, obviously)

I am trying to get to the values inside the page using the XML package, but seems to be stuck in the beginning, so I'd appreciate an example on how to do it (and/or links to relevant examples)

library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"

Why are you trying to scrap the page? Get your data from elsewhere, periodic system ain't gonna change frequently. And those link gotta follow some pattern... — aL3xa, Dec 09 '10 at 00:38
If it weren't you, Tal, asking this question I would be very suspicious, but I know your motives must tbe pure. That page is "protected"and so even if there were handy tools for R to access the special Wiki-interface they might not work for this. Try going to this page:http://en.wikipedia.org/w/index.php?title=Template:Periodic_table&action=edit — IRTFM, Dec 09 '10 at 00:48
Dear DWin and aL3Xa, thank you for your comments and your vote of confidence in my motives :) ---- Dear aL3Xa, The reason I asked for this question is because of the skills needed to complete it, NOT because of a need for the periodic table. Dear DWin - I was looking for a "neutral" example to ask my question on, of a webpage that was in "public domain". I didn't notice that this wiki page in particular was protected (and thus, not a prime candidate for scraping). I can assure you both that I intend to use these skills on webpages I am authorized to scrap :) Best, Tal. — Tal Galili, Dec 09 '10 at 08:21
linguistic point: 'scrape' is the verb from 'scraping'. To 'scrap' something means to turn it into garbage! 'Scrap' gets two 'p's - so 'scrapping a page' means to turn it into rubbish, 'scraping a page' is getting the data from it! — Spacedman, Dec 09 '10 at 12:04
@ Tal: I didn't realize what a great question this would turn out to be. Two great replies! — IRTFM, Dec 09 '10 at 15:07
Spacedman - thanks for the correction (English is far from being my native language, so any correction is much appreciated). DWin - I'm glad to read you gained from this question as well :) Cheers, Tal — Tal Galili, Dec 09 '10 at 17:10
Every wikimedia page can be returned in XML format rather than HTML - e.g. http://en.wikipedia.org/wiki/Special:Export/Periodic_Table Read more here: http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export — David d C e Freitas, Nov 18 '11 at 22:29

G. Grothendieck · Answer 1 · 2010-12-09T12:20:21.570

Try this:

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))

# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])

m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])

A bit of the output:

> dim(m3)
[1] 118   3
> head(m3)
     URL                                      Name        Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen"  "Hydrogen"  "H"   
[2,] "http://en.wikipedia.org/wiki/Helium"    "Helium"    "He"  
[3,] "http://en.wikipedia.org/wiki/Lithium"   "Lithium"   "Li"  
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"  
[5,] "http://en.wikipedia.org/wiki/Boron"     "Boron"     "B"   
[6,] "http://en.wikipedia.org/wiki/Carbon"    "Carbon"    "C"

We can make this more compact by enhancing the xpath expression further starting with Jeffrey's xpath expression (since it nearly gets the elements at top) and adding a qualification to it which exactly does. In that case xpathSApply can be used to eliminate the need for do.call or the plyr package. The last bit where we fix up odds and ends is the same as before. This produces a matrix rather than a data frame which seems preferable since the content is entirely character.

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]

# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])

View(M)

Hello Grothendieck - wonderful answer - two thumbs up (and one "chosen" answer). Thank you for your help! Best, Tal — Tal Galili, Dec 09 '10 at 08:53
I was debating who to choose as the "answer". And since Jeffrey (the other answer), has only 18 karma points (and he also supplied a viable answer), I decided to give the "V mark" to him. But your answer was very helpful. Thank you again! Best, Tal — Tal Galili, Dec 09 '10 at 09:12
Have added a second solution based on an enhancement of Jeffrey's xpath expression and my previous code. — G. Grothendieck, Dec 09 '10 at 12:21
@ Gabor: I already up-voted your first reply, so all I can do is upvote this further comment and add written extra thanks for this and all of you other excellent replies on rhelp. — IRTFM, Dec 09 '10 at 15:05

Jeffrey Breen · Accepted Answer · 2010-12-09T15:19:02.063

Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!

But alas, this is not what you want:

library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)

# ... look through the list to find the one you want...

table = tables[3]
table
$`NULL`
         Group #    1    2    3     4     5     6     7     8     9    10    11    12     13     14     15     16     17     18
1         Period      <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
2              1   1H       2He  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
3              2  3Li  4Be         5B    6C    7N    8O    9F  10Ne  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
4              3 11Na 12Mg       13Al  14Si   15P   16S  17Cl  18Ar  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
5              4  19K 20Ca 21Sc  22Ti   23V  24Cr  25Mn  26Fe  27Co  28Ni  29Cu  30Zn   31Ga   32Ge   33As   34Se   35Br   36Kr
6              5 37Rb 38Sr  39Y  40Zr  41Nb  42Mo  43Tc  44Ru  45Rh  46Pd  47Ag  48Cd   49In   50Sn   51Sb   52Te    53I   54Xe
7              6 55Cs 56Ba    *  72Hf  73Ta   74W  75Re  76Os  77Ir  78Pt  79Au  80Hg   81Tl   82Pb   83Bi   84Po   85At   86Rn
8              7 87Fr 88Ra   ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
9                <NA> <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
10 * Lanthanoids 57La 58Ce 59Pr  60Nd  61Pm  62Sm  63Eu  64Gd  65Tb  66Dy  67Ho  68Er   69Tm   70Yb   71Lu          <NA>   <NA>
11  ** Actinoids 89Ac 90Th 91Pa   92U  93Np  94Pu  95Am  96Cm  97Bk  98Cf  99Es 100Fm  101Md  102No  103Lr          <NA>   <NA>

The names are gone and the atomic number runs into the symbol.

So back to the drawing board...

My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:

library(XML)
library(plyr) 

url = 'http://en.wikipedia.org/wiki/Periodic_table'

# don't forget to parse the HTML, doh!

doc = htmlParse(url)

# get every link in a table cell:

links = getNodeSet(doc, '//table/tr/td/a')

# make a data.frame for each node with non-blank text, link, and 'title' attribute:

df = ldply(links, function(x) {
            text = xmlValue(x)
            if (text=='') text=NULL

            symbol = xmlGetAttr(x, 'title')
            link = xmlGetAttr(x, 'href')
            if (!is.null(text) & !is.null(symbol) & !is.null(link))
                data.frame(symbol, text, link)
        } )

# only keep the actual elements -- we're lucky they're first!

df = head(df, 118)

head(df)
     symbol text            link
1  Hydrogen    H  /wiki/Hydrogen
2    Helium   He    /wiki/Helium
3   Lithium   Li   /wiki/Lithium
4 Beryllium   Be /wiki/Beryllium
5     Boron    B     /wiki/Boron
6    Carbon    C    /wiki/Carbon

Hello Jeffrey, thank you for your answer! I like seeing how you implemented it. I also suggest you'd add the "doc <- htmlTreeParse(URL, useInternalNodes = TRUE) " line which is missing. Thanks very much! Tal — Tal Galili, Dec 09 '10 at 08:53
After more thought (and since I ended up using pieces of your code), I decided to move the "V mark" to your answer. I hope to see you here more - your answer was wonderful (p.s: when I started I also thought to go with the readHTMLTable function, and then quickly found it's limitation). Cheers, Tal — Tal Galili, Dec 09 '10 at 09:13
Thanks very much -- and sorry for missing the parse call! I too like G. Grothendieck's answer, but I appreciate the points! — Jeffrey Breen, Dec 09 '10 at 16:07

score 0 · Answer 3 · answered Dec 05 '18 at 20:15

Do you have to scrape Wikipedia? You can run this SPARQL query against Wikidata instead (results):

SELECT
  ?elementLabel
  ?symbol
  ?article
WHERE
{
  ?element wdt:P31 wd:Q11344;
           wdt:P1086 ?n;
           wdt:P246 ?symbol.
  OPTIONAL {
    ?article schema:about ?element;
             schema:inLanguage "en";
             schema:isPartOf <https://en.wikipedia.org/>.
  }
  FILTER (?n >= 1 && ?n <= 118).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?n

Sorry if this doesn't answer your question directly but this should help people looking to scrape the same information but in a clean manner.

Scraping a wiki page for the "Periodic table" and all the links

3 Answers3

Linked