The Semantic Web (SW), also known as Web 3.0, promises to transform the World Wide Web from a large, indexed collection of hyperlinked documents into a vast knowledge base of documents, maps, and other media forms that can be linked, remixed, and mashed up by intelligent agents for a variety of purposes. Your mission is to create a prototypical content development system — by modding a blog or wiki, for example — that would allow for the mass creation of mashable semantic content. Such a tool, if simple enough to use, could make a viral contribution to the emergence of the SW.
The key feature of the SW (and the reason it’s called semantic in the first place) is the addition of a layer of markup to the web’s documents that describes, in both a machine and human readable way, the “meaning” of what documents contain. For example, instead of simply italicizing and capitalizing a string of text in a document to signify that it is a book title, one would wrap the string in a title tag, like so: “
<title>Price and Prejudice</title>.” In addition, these tags would be part of a larger framework, or ontology, that would provide the opportunity for machines to disambiguate further the meanings of the strings in documents.
With a critical mass of properly marked up documents, it is easy to write programs that can combine documents to produce other documents or media forms that summarize or visualize the content of the source documents. In the language of relational databases, SW markup allows programmers to develop agents that can automatically perform join-like queries in a document collection, producing interesting and useful reports. (Object-oriented database developers may recognize the concept of traversal here.) The key is a shared set of tags and identifiers among the documents.
The problem, of course, is how to generate a critical mass of marked up documents. Traditional SW proponents envision a new regime of mark-up in which all documents are written in according to a newer, more complex dialect of XML to replace (x)HTML. This is known as the “bottom-up” approach to the SW. We know that’s never going to happen. If anything, Web 2.0 technologies have taught us that critical masses of content, in which network effects are possible, require very low thresholds of participation. The reason blogs and wikis are so incredibly successful is that they are incredibly easy to use.
Another approach to the SW is the “top-down” approach. In this view, machines do a lot of the work to interpret what humans already know how to read. In effect, this view puts its faith in artificial intelligence and the model of Google in being to tell what an italicized, capitalized string of text is in a document. (Projects like CiteSeer have shown that this is possible with traditional citation styles in academic documents.) The problem with this approach is that, beyond some simple things like text references, it is very hard to write programs that can read text they way humans do. We will get there at some point (probably sooner than we imagine), but we aren’t there now.
In between these views is a method that adopts the wisdom of Web 2.0 applications — the use of microformats. Instead of trying to reinvent either the way content producers write, or the way search engines work, the microformats method piggy-backs on existing mark-up and search practices and adds the 20% of effort that may make the 80% difference (or maybe it’s 1 to 99). For example, RDFa is a standard that allows users to add attributes to existing XHMTL documents that provides semantic content. And there are already in use other standards for adding semantic content to documents: hCalendar, FOAF, XFN, hCard, etc.
But these formats are still not easy enough to use. Ideally, one would have a user interface that makes it easy to add these attributes to elements as one writes them, something as easy or easier than adding a trackback or a set of tags to a blog post. Imagine being able to block off an arbitrary segment of text, have a dialog box appear asking what kind of content this is, and then adding a few simple attributes, perhaps with some AJAX-fed data fields to smooth the process. Imagine also being able to add any number of microformats to a document, and getting them from a public repository.
Efforts to produce such a tool would not be in vain. We know that there is already a productive mutual calibration that goes on between content providers and search engines. For example, there is a whole subindustry in SEO — search engine optimization — in which content providers have demonstrated their willingness to adapt content to the ways of search engines, such as Google. And it works the other way, too: all the buzz about Web 3.0 has led companies like Yahoo! to incorporate semantic web principles into their search engines, leading to a new field of semantic search optimization. The industry seems ripe for a spark to catalyze the various developments in the field of the semantic web — see Calais, Twine, and dbPedia for some examples.
The content development tool suggested by this proposal could be that spark. Such a spark could ignite a dynamic in which search engines will begin to seek out, privilege, and select for documents with semantic content. The pressure will then be on for content providers — and I mean anyone who produces web content, not just big companies — to shape up their content for the search engines, and the forces of distribution will pull production in its considerable draft.
A Practical Note
As for which content development system to use, three come to mind as both standard and extensible: MediaWiki, WordPress, and Drupal.