C# html viewing using html agility pack

Question

I made a console c# application which is supposed to display the html source of a page.

Instead, the console app is showing HtmlAgilityPack.HtmlDocument.

Can anyone explain to me why that is?

class Program
{
    public HtmlDocument read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = Convert.ToString(dis.read());
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}

I don't know the model of HtmlDocument; but clearly its ToString() is not implemented to return the html. You will need to inspect the properties and use one of them which should contain the source. — Nate, Jul 03 '13 at 15:30
posisble duplicate http://stackoverflow.com/questions/5599012/html-agility-pack-htmldocument-show-all-html — Liam, Jul 03 '13 at 15:33

Amine Hajyoussef · Answer 1 · 2013-07-03T15:41:05.703

replace

 return document;

with:

 return document.DocumentNode.InnerHtml;

or if you wanna to extract text only (without HTML tags):

 return document.DocumentNode.InnerText;

the whole code would be:

class Program
{
    public string read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document.DocumentNode.InnerHtml;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = dis.read();
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}

score 2 · Answer 2 · answered Jul 03 '13 at 15:34

The default implementation of .ToString() is just to output the name of the class, which is what you're seeing. So HtmlDocument from the HtmlAgilityPack obviously doesn't provide a derived implementation.

From glancing at the code over on CodePlex, it looks like you need to use the Save function to save the output to an XmlWriter and then use that to get the string. I don't see another way to get at the whole contents of the page directly from that object (though admittedly I just scanned it).

Edit: Amine Hajyoussef pointed you in the right direction with document.DocumentNode.Innerhtml, though note that you'll need to change the return type of the function as well.

C# html viewing using html agility pack

2 Answers2