1

I have crawler that crawls urls from website containing RDF data. I tried to get it with Jena like this

Model model = ModelFactory.createDefaultModel();
model.read(url);
model.write(System.out);

url is String and first line gets executed, debugger stops for second line and then it goes back to first line (because of loop). url is web page link. I have also tried to get html code of page, and than pass that string to read function, but it didn't work either.

I'm really a rookie to RDF and Jena, and my Java experience isn't really extensive, so any help is good.

Vuk Stanković
  • 7,864
  • 10
  • 41
  • 65
  • You say "goes back to first line (because of loop)," but there's no loop shown here. [`Model.read`](http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Model.html#read(java.lang.String)) doesn't read HTML, but rather an RDF document. (The specs say XML, implying RDF/XML, but I wouldn't be surprised if it can handle other serializations, too (e.g., Turtle).) – Joshua Taylor Aug 15 '13 at 12:54
  • This part of code is in a crawler loop – Vuk Stanković Aug 15 '13 at 12:55
  • 1
    Yes, the context suggested that. My point was that, according to the closing guidelines, "Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance." There's not enough here to reproduce the problem. It could be due to the `url`, it could be due to something else in the loop, etc. There's not enough code here to reproduce the problem yet. – Joshua Taylor Aug 15 '13 at 12:58
  • On what value for `url` are you having this problem? – Joshua Taylor Aug 15 '13 at 13:06

1 Answers1

1

The code you've got for reading a model from a url is correct. For instance, here's a complete example that reads one of the examples from section 2.13 Typed Node Elements of the RDF/XML specification:

import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;

public class RetrieveRemoteRDF {
    public static void main(String[] args) {
        final String url = "http://www.w3.org/TR/REC-rdf-syntax/example14.nt";
        final Model model = ModelFactory.createDefaultModel();
        model.read(url);
        model.write(System.out);
    }
}

The output (in the default RDF/XML serialization) is:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:j.0="http://example.org/stuff/1.0/" > 
  <rdf:Description rdf:about="http://example.org/thing">
    <dc:title>A marvelous thing</dc:title>
    <rdf:type rdf:resource="http://example.org/stuff/1.0/Document"/>
  </rdf:Description>
</rdf:RDF>

If you're encountering a problem, it seems like it must be due to the url that is getting passed to model.read.

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • Then it is my mistake. I have tried to pass web page that contains embeded rdf data. So I have to pass valid RDF document to `read` method? Is there a way to get RDF from HTML page? – Vuk Stanković Aug 15 '13 at 13:06
  • @VukBG It would probably depend on what you mean by “embedded.” Do you mean that: the text of some RDF document is included in the page; the page is an RDFa document from which RDF can be extracted; something else entirely? This is why it's important to tell us what the URL is, or what the text of the document that you can't load is. Also, I'm surprised if Jena isn't throwing some sort of exception if it can't read the document. – Joshua Taylor Aug 15 '13 at 13:07
  • It's for example rottentomatoes.com page. It is just a html page but it has RDF data added to it. Example: http://www.rottentomatoes.com/m/the_wolverine_2012/ – Vuk Stanković Aug 15 '13 at 13:09
  • @VukBG Where is the RDF in that? (I'm genuinely asking.) Searching on the page for, e.g., "RDF" doesn't turn anything up. – Joshua Taylor Aug 15 '13 at 13:10
  • It can be seen with Google Structured data tool http://www.google.com/webmasters/tools/richsnippets?q=http%3A%2F%2Fwww.rottentomatoes.com%2Fm%2Fthe_wolverine_2012%2F – Vuk Stanković Aug 15 '13 at 13:11
  • @VukBG For what it's worth, there is a page describing the [Rotten Tomatoes API](http://developer.rottentomatoes.com/docs) using which you can get JSON data for lots of things. Notice that the Google tools mention “rdfa-node”. Given that, you might have more luck with something like an [RDFa distiller](http://www.w3.org/2012/pyRdfa/). I think you'll need to do some more investigation on how Google is extracting data, though. – Joshua Taylor Aug 15 '13 at 13:13
  • @VukBG Jena does not have built in support for RDFa, you need to use a third party library that integrates with Jena if you wish to do this – RobV Aug 15 '13 at 16:08