0

I made a console c# application which is supposed to display the html source of a page.

Instead, the console app is showing HtmlAgilityPack.HtmlDocument.

Can anyone explain to me why that is?

class Program
{
    public HtmlDocument read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = Convert.ToString(dis.read());
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}
Liam
  • 27,717
  • 28
  • 128
  • 190
ahamed
  • 23
  • 5
  • The output is "HtmlAgilityPack.HtmlDocument" – ahamed Jul 03 '13 at 15:28
  • 2
    I don't know the model of HtmlDocument; but clearly its ToString() is not implemented to return the html. You will need to inspect the properties and use one of them which should contain the source. – Nate Jul 03 '13 at 15:30
  • 1
    posisble duplicate http://stackoverflow.com/questions/5599012/html-agility-pack-htmldocument-show-all-html – Liam Jul 03 '13 at 15:33
  • how do I then convert document to string? – ahamed Jul 03 '13 at 15:33

2 Answers2

3

replace

 return document;

with:

 return document.DocumentNode.InnerHtml;

or if you wanna to extract text only (without HTML tags):

 return document.DocumentNode.InnerText;

the whole code would be:

class Program
{
    public string read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document.DocumentNode.InnerHtml;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = dis.read();
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}
Amine Hajyoussef
  • 4,381
  • 3
  • 22
  • 26
2

The default implementation of .ToString() is just to output the name of the class, which is what you're seeing. So HtmlDocument from the HtmlAgilityPack obviously doesn't provide a derived implementation.

From glancing at the code over on CodePlex, it looks like you need to use the Save function to save the output to an XmlWriter and then use that to get the string. I don't see another way to get at the whole contents of the page directly from that object (though admittedly I just scanned it).

Edit: Amine Hajyoussef pointed you in the right direction with document.DocumentNode.Innerhtml, though note that you'll need to change the return type of the function as well.

Tim
  • 14,999
  • 1
  • 45
  • 68