0

I am attempting to use HTMLagilitypack to extract all the content from the webpage.

foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    sb.AppendLine(node.Text);
}

When i try to parse google.com using above code i get lots of javascript. All i want is to extract the content in the webpage like in h or p tags. Like taking the question,answer,comments on this page and removing everything else.

I am really new to XPath and don't exactly know where to move forward. So any help would be appreciated.

Win Coder
  • 6,628
  • 11
  • 54
  • 81
  • Ok so the problem i was having that innerText of `script` and `style` was also being returned. So for that removal of script and style was necessary all credit goes to this guy [link](http://stackoverflow.com/a/2785108/1762761) – Win Coder Aug 21 '13 at 11:47
  • I am not familiar with HTMLagilitypack but is sounds strange to me to get a **Text** from a **text()** node. You could try **SelectNodes("//*[text()]")** to get all the node with has a text node. – jvverde Aug 21 '13 at 13:55

2 Answers2

0

You can filter for the non-wanted tags by name and remove them from your document.

        doc = page.Load("http://www.google.com");
        doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());
Daniel B
  • 8,770
  • 5
  • 43
  • 76
  • That's the thing i don't want to select only h1 tags. Rather i want to select text from the whole page. I don't think i would be able to cover every conceivable combination of tags for text text extraction. – Win Coder Aug 21 '13 at 11:38
0

You could use this XPath expression:

//body//*[local-name() != 'script']/text()

It takes only the elements inside the body and skips the script elements

Bill Velasquez
  • 875
  • 4
  • 9