The Resource Descriptor Framework (RDF) provides a structured backdrop to the definition of data that supports SQL-like query semantics. XML provides a weaker backdrop, but neither RDF or XML are complete without a schema. The complexity of RDF in practice has acted as a brake to the advance of the Semantic Web. RDF offers benefits when it comes to storage and query of data, however the benefit of RDF for communication of machine-processable information is debatable. XML has in contrast seen rapid adoption.    (6B1)

The premise of the XML Semantic Web is that RDF has not solved the sociotechnical set of problems around communicating information from one machine to another. While RDF solves one specific set of problems, it does not solve the core problems of a Semantic Web. We propose to simplify the document model of RDF down to a set of constraints that make a practice difference to the way XML is used today. We propose that best-practice standards for XML be defined, instead of reinventing XML around triples. This XML-based approach relies on a small number of constraints or conventions on document structure that each achieve specific benefits, and focuses on using XML well rather than abstracting it away.    (6B2)

What is a Semantic Web?    (6G1)

The w3c talks about the vision of the semantic web as "extend(ing) principles of the Web from documents to data". The problem with this definition is twofold. The first problem is that documents are data. They are transportable encoded information, so it is double-speak to talk about moving from one to another. The second problem is that it prematurely implies an implementation that removes documents from the picture, rather than defining a set of requirements. If we move on to the outcomes expected of this transition we can see something closer to vision we can use:    (6GA)

"    (6GG)

  1. It allows data to be surfaced in the form of real data, so that a program doesn’t have to strip the formatting and pictures and ads off a Web page and guess where the data on it is.    (6GC)
  2. it allows people to write (or generate) files which explain—to a machine—the relationship between different sets of data. For example, one is able to make a “semantic link” between a database with a “zip-code” column and a form with a “zip” field that they actually mean the same – they are the same abstract concept. This allows machines to follow links and hence automatically integrate data from many different sources. "    (6GF)

I think that a Semantic Web is really like the human Web of today. Today, humans can navigate an extract information from the Web. A Semantic Web would allow programs to navigate it and extract information from it as well. Exactly what programs and exactly what information is a question for the global community. Without knowing exactly what will be communicated we can extract a number of fundamentals based on today's Web:    (6G2)

  1. Standard interactions between components in the architecture are required for the transfer of data to or from identified resources. For example, the HTTP GET operation permits a client to transfer data from a given server URL.    (6G3)
  2. The information must be reasonably expected to be understandable when it arrives. This means that a given kind of information should have a limited number of ways that it can be encoded as data. The recipient of this data who expects this specific kind of information should reasonably be expected to have parser implementations for each way the information might be encoded. In the Web of today we call these "Content Types" or Document Types.    (6G4)
  3. These mechanisms must be able to evolve over long periods of time without gratuitously breaking backwards compatibility. The latest and greatest software on the Web should be able to support interaction with clients or servers written against early versions of interactions or content types, and with most of the forks and standards iterations that come in-between.    (6G5)

Targets    (6B3)

If we were to contrast the RDF and the XML Semantic Web implementations, the main difference would be this:    (6GI)

It is our prediction that for any equivalent pair of documents with identical real-world information schema, it will be easier to write most real-world programs to use the information from an XML document than it will be to write an equivalent program against the equivalent RDF document. RDF attempts to compete both with XML in the communications space, and relational databases in the query and data access space. In the communications space RDF gives the programmer too abstract a model to work with, weakening her hand. In the transition to the query and data access space we need to understand the data in order to scrub it properly and ensure database consistency, so there too it would appear to fail. Finally, in the actual query and data access space it may become a reasonable player. It provides a more flexible data schema than the RDBMS, so efforts such as Mulgara may eventually pay off in this area.    (6GL)

Conventions Summary    (6B7)

xml documents    (6BE)

XML documents are used because:    (6BF)

Other forms of documents are also used. XML will never replace jpeg, for example. However, these other types are generally the leaf nodes of the semantic web. XML is the main format to use for hyper-linked documents.    (6BJ)

The use of meaningful structure in XML documents is important to the adoption of the document type. Document structure should not be an automatic product of a data set, and should not allow arbitrary change in form. This is one of the weaknesses of RDF: It is too complex to transform, because its representation (particularly in XML) is a moving target. Your XML document types should be easy to understand, easy to transform, and easy to copy.    (6BK)

Use xsd built-in data types whenever possible. The structure of these types has already been worked out, so there is no need to reinvent them in your XML document type. Data in these formats can also be placed in text/plain leaf documents where no additional structure is required.    (6CB)

Hyperlink endpoints (xml:id)    (6BL)

xml:id is a standard mechanism for identifying an element or sub-tree in an XML document. Hyper-links to an element using xml:id are unambigous, and tooling works with xml:id.    (6BM)

Subdocuments and Document Hyperlinks    (6BN)

A common approach to hyperlinking and sub-documents should be used across document types:    (6BO)

This general approach is taken from atom, with prior art in html and other existing document types on the Web.    (6BX)

Mime types and xml namespaces    (6BY)

One of the issues with mixing schemas in RDF documents is the need to manage and correctly associate a large collection of namespaces to elements and attributes. The XMLSemanticWeb approach is based around sub-documents whose type only needs to be identified at document boundaries. Mime types specified with the type attribute take precedence over XML namespaces based on the xmlns attribute. No XML namespace prefixes should normally be required in the document. Consider the example:    (6CK)

< vcalendar >
  < vevent >
    < dtstart >2008-02-09T01:00:00+10:00< /dtart >
    < dtend >2008-02-09T02:00:00+10:00< /dtend >
    < summary >Blogging time< /summary >
    < description type="application/xhtml+xml" >
      < div xmlns="http://www.w3.org/1999/xhtml" >Time to blog, guys< /div >
    < /description >
  < /vevent >
< /vcalendar >    (6CU)

The content type of the overall document is already negotiated as part of my receiving it. I can process all the way down to description with the vcalendar parser, and no extra type information is required. An XML namespace declaration is provided for the root of the xhtml sub-document, however this can be safely ignored. I already know which parser to invoke based on the mime type on the vevent's description element.    (6CQ)

XML Namespaces can convey information about which parser to invoke for a specific sub-document. However, they should be ignored in favour of the "type" attribute if supplied. If no type attribute is used the XML type and name of the root element MAY be used to determine how to parse.    (6BZ)

The clear separation of document and sub-document is intended to avoid negative social effects associated with mixed-namespace documents in the general case. Mark Nottingham noted the following back in 2006:    (6CV)

"What I found interesting about HTML extensibility was that namespaces weren’t necessary; Netscape added blink, MSFT added marquee, and so forth. I’d put forth that having namespaces in HTML from the start would have had the effect of legitimising and institutionalising the differences between different browsers, instead of (eventually) converging on the same solution, as we (mostly) see today, at least at the element/attribute level."    (6C3)

The philosophy of the document/sub-document separation is also intended to allow the two sets to evolve independently. It should be possible to start replacing an rss sub-document with an atom sub-document, without having to modify the document type that contains these feeds.    (6CT)

Must-ignore    (6C4)

Must ignore semantics are central to the evolution model of document types on the Internet. The main mechanism for change is the addition of new elements and attributes. Any such addition must be made in the knowledge that some consumers of the document type won't understand the extension, and this has to be designed in from the start. If the must-ignore evolution path leads to a dead-end, a new media type needs to be selected for a new, incompatible document type. HTTP supports document type negotiation for these rare transitions.    (6C5)

Version numbers should not be used to determine how to parse a document, unless that version number is opaquely contained within the document's mime type.    (6C6)

Maximise opacity    (6C7)

Keep document types clear and focused, based on existing prior art, and avoid over-specifying. Wherever there is a change in the focus of a document type, consider making the different kind of data a sub-document. Sub-documents are free to evolve and be replaced with hyperlinks over time.    (6C8)

HTML is sometimes criticised for not containing rich semantics, but there is an important lesson to be learned: Complex locally-relevant semantics don't scale up. It is important to keep semantics simple and general in order to produce document types that can be widely understood. Section off any local conventions into parent documents, subdocuments, or hyperlinks.    (6C9)

Comments    (6CD)

Please leave your comments below, noting who you are and when you made your comment. Email can also be sent to benjamin carlyle at soundadvice.id.au    (6CE)