Extracting content from Webpage

Question

I am attempting to use HTMLagilitypack to extract all the content from the webpage.

foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    sb.AppendLine(node.Text);
}

When i try to parse google.com using above code i get lots of javascript. All i want is to extract the content in the webpage like in h or p tags. Like taking the question,answer,comments on this page and removing everything else.

I am really new to XPath and don't exactly know where to move forward. So any help would be appreciated.

Ok so the problem i was having that innerText of `script` and `style` was also being returned. So for that removal of script and style was necessary all credit goes to this guy [link](http://stackoverflow.com/a/2785108/1762761) — Win Coder, Aug 21 '13 at 11:47
I am not familiar with HTMLagilitypack but is sounds strange to me to get a **Text** from a **text()** node. You could try **SelectNodes("//*[text()]")** to get all the node with has a text node. — jvverde, Aug 21 '13 at 13:55

Daniel B · Answer 1 · 2013-08-21T12:03:46.680

0

You can filter for the non-wanted tags by name and remove them from your document.

        doc = page.Load("http://www.google.com");
        doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());

edited Aug 21 '13 at 12:03

answered Aug 21 '13 at 11:28

Daniel B

8,770
5
43
76

That's the thing i don't want to select only h1 tags. Rather i want to select text from the whole page. I don't think i would be able to cover every conceivable combination of tags for text text extraction. – Win Coder Aug 21 '13 at 11:38

score 0 · Answer 2 · answered Aug 21 '13 at 13:51

0

You could use this XPath expression:

//body//*[local-name() != 'script']/text()

It takes only the elements inside the body and skips the script elements

answered Aug 21 '13 at 13:51

Bill Velasquez

875
4
9

Extracting content from Webpage

2 Answers2