Talk:Website Parse Template

From Wikipedia, the free encyclopedia

If this is XML based and requires new tags to be added to a page, how can that be compatible with HTML? Andy Dingley (talk) 14:54, 20 May 2008 (UTC)

Website Parse Template helps web crawlers to generate RDF triplets. This format has no relation with pages' HTML codes. It's a separate file that is located on the same directory as sitemap. Kiranoush (talk) 08:18, 22 May 2008 (UTC)
I see - so it's most useful for database-generated sites, where a great many pages exist, and they all share a near-identical structure where fairly-crude XPath statements are enough to make these un-readable pages readable (in a SemWeb sense). Andy Dingley (talk) 13:08, 22 May 2008 (UTC)

[edit] Needs examples and definitions

This article, to be readable, needs corresponding example fragments of HTML input and certainly some output from WPT, as produced by the example templates.

It should also link to the DTD or Schema that defines ICDL.

I'm also confused as to the distinction between icdl (the screen-scraper parser definition) and icdl (the ontological description language). I'm not even seeing any namespaces here, which worries me. I hope these do some clear formal definition somewhere.

[edit] Totally subjective POV comment, with no place near a Wikipedia article page

I can't say I'm impressed by this protocol! It seems to be taking the wrong approach to a SemWeb solution. Rather than the fairly well-described techniques for embedding accessible metadata into a resource, such as RDFa, it's gaffer-taping on an external hack. This is what GRDDL already does, except that GRDDL uses a rather simpler approach built out of existing tools (albeit the old "With XSLT we can transform anything to anything" canard). It's generally accepted that embedding metadata into a resource is preferable than building extractors (extractors are complex to build and brittle in service), and this WPT approach seems to combine complexity, limited function (XPath is far from sufficient), requirement to be the site's operator and a whole new complicated language.

I don't understand the use of ontology here either. The input to WPT is entirely non-ontological, dumb-scraping through XPath (as it has to be). The output is neither a common-denominator format such as Dublin Core, nor (unless it uses OWL) is it described ontologically in a communicable format. Proprietary ontology descriptions are unworkable, almost by definition.

Then there's the question of what "XPath" means when applied to a HTML DOM, rather than an XML DOM.
Andy Dingley (talk) 13:08, 22 May 2008 (UTC)