Parsing XHTML+RDFa
July 11th, 2008This parser currently ONLY accepts XHTML+RDFa, and it needs to be well formed. The URI to use is http://htmlwg.mn.aptest.com/rdfa/extract_rdfa.pl and it accepts the following parameters:
- uri - the URI to parse. The parser will fetch the data from this URI
- format - the output format. Values are N3 or xml. The default value is N3.
The parser attempts to retrieve the resource from the URI. If that resource is an XHTML document, it will ensure that it is well formed. If it is well formed, then it will parse it and emit the triples.
Once this gets tightened up, my plan is to create a generic perl module that extracts triples and make it available on CPAN. Architecturally, here is the organization:
- The triple generation method relies upon basic DOM interfaces to traverse a tree and generate a map.
- There is a generic rules collection that drives the interpretation of attributes and attribute values - to facilitate different language bindings.
- It should be possible to extract triples from anything that supports the RDFa attributes and from which a DOM tree can be created (HTML4, XHTML 1.1, etc.)