Capturing real-world knowledge with Protégé OWL

254

Author: James Tizard

Take a well resourced, 10-year-old open source project with a global user community. Add support for World Wide Web Consortium (W3C) Semantic Web standards and some hard-core computer science research, and what do you get? In the case of Protégé OWL the answer is the best application I’ve seen for modeling, capturing, and sharing knowledge about the real world — the kind of fuzzy, variable and unpredictable “stuff about stuff” that’s hard to squeeze into the nice neat rectangular tables of a conventional database.

Protégé is a Java GUI application developed and maintained by Stanford Medical Informatics at Stanford University, and supported by a variety of public research agencies. Protégé was created originally for medical researchers, who use it to build complex knowledge bases or ontologies about specific research topics.

Although originally a term used by philosophers to described the study of existence, an ontology in computer science is a formal, machine-readable description of a body of knowledge about a given topic. In other words, it’s “everything we know about topic X.” That’s pretty broad, and any software application that can handle that sort of thing well is going to be useful to a lot more people than just medical researchers. Not surprisingly, therefore, Protégé is used today in a wide range of fields outside medicine, including ecology, IT architecture, and geographic information systems.

Fortunately, Protégé’s designers seem to have anticipated this wide range of uses, and built into the system a robust and well-documented plugin architecture, which allows developers to create new GUI widgets, application logic, and input/output extensions. Adding a plugin is as simple as copying the relevant Java archive (jar) file into the $PROTEGE_ROOT/plugins directory. The Protégé site lists more than 60 plugins available for download.

Protégé OWL is a plugin for Protégé that transforms the base Protégé application into a comprehensive graphical development environment for the W3C’s new ontology language, OWL. Actually this is a bit like calling GNOME or KDE a plugin for the X Window System, as Protégé OWL is essentially a whole new application built on the Protégé foundation.

OWL

In the tradition of whimsical computing acronyms, OWL stands for Web Ontology Language (why? — try saying “WOL” out loud a few times). As you might expect, being a W3C standard, OWL has an XML syntax; and as a part the W3C Semantic Web technology domain, OWL allows ontologies to be shared, mixed, and merged on the World Wide Web.

A lot has been written about the Semantic Web, on which computers, and not just people, will be able to understand and respond to Web documents and other resources. A common criticism, however, is that the Semantic Web remains heavy on the grand vision and light on practical, real-world applications. While the Semantic Web was first described in 1998, most descriptions today still adopt the future tense.

In Protégé OWL, we have available today a rock-solid OWL authoring environment that is ready for daily work. Protégé and Protégé OWL are available for download, and are distributed under the Mozilla Public License. Installation is easy — OS-specific installers are available for for Windows, Linux/Unix, and Mac OS X, with or without a bundled Java runtime. Java 1.5 or later is required.

Creating ontologies with Protégé OWL

If you are reasonably at home with a graphical software development environment, an object-oriented (OO) programming language, or a graphical database development environment, then you’ll find much of Protégé OWL familiar, as it has similarities with all three. Building an ontology with Protégé OWL is similar to starting a database project. Once you’ve decided on the information you want to put in (and get out), the next step is to design the formal data structures that will hold your information.

With a relational database, data structures are one or more tables. These are great when describing large numbers of data records with exactly the same structure, such as entries in a financial ledger or employees on a payroll. Things get more complex, however, when you want to describe records that are not all the same “shape,” and that relate to one another in variable ways.

By contrast, OWL describes information in a hierarchical manner that will be familiar to anyone who knows an OO programming language such as Java, Python, or Ruby. An OWL ontology consists of a number of classes, each of which can contain sub-classes that inherit the attributes of their parents. Keeping with the OO theme, every unique data item is an individual or instance of one or more classes, and every instance has one or more properties, which are somewhat similar to method variables in OO languages. Properties can take either scalar values (datatype properties) or references to other instances (object properties).

In Protégé OWL, the root class of every ontology is owl:Thing. Every sub-class is therefore a refinement of a ‘thing’. An ontology about a given domain can thus contain, literally, every thing about that domain.

Despite these similarities, OWL is not itself an OO programming language. In particular, OWL has no concept of object methods or executable code. OWL is concerned with representing information, not manipulating it. Nonetheless, OWL does echo the OO philosophy of making it easier to model real-world ‘non-rectangular’ data.

The Protégé site has some screenshots illustrating the Protégé OWL class hierarchy, properties, and individuals from a sample ontology about holiday travel.

OWL’s OO-like flavour gives you great flexibility in modeling your information, and allows your knowledge base to grow and change organically in response to new data and needs. You can change the class definitions, hierarchy relationships, and property assignments on the fly, with Protégé OWL taking care of the resulting changes to the instance data.

While these features alone are enough to make it a useful tool, Protégé OWL is a lot more than an object-oriented data editor.

Protégé OWL supports OWL’s powerful logic-based features for modeling real-world knowledge. Generally these work by allowing you to declare logical constraints on classes and properties, which an OWL processor such as Protégé OWL can then use to modulate the flow of information into and out of the knowlege base. These features are somewhat analogous to the business logic programmed into a relational database application in the form of scripts or stored procedures. With OWL, however, these features are built into the language and do not require an external scripting language.

Two examples are inverse properties and cardinality constraints. If you implement an ‘is-child-of’ property linking two instances of a ‘Person’ class, then you can also declare an inverse ‘is-parent-of’ property that applies to any target of ‘is-child-of.’ The result is that if you assert that ‘John is a child of Mary,’ then OWL can also deduce that ‘Mary is a parent of John,’ and vice versa. You can also assert that any one ‘Person’ instance can have no more than two ‘is-child-of’ relationships (because no one has three parents), or in other words, that the maximum cardinality of ‘is-child-of’ is two.

Advanced features — not for the fainthearted

The OWL language specification also defines how to compute more complex logical consequences of an existing OWL ontology to derive new facts and class relationships that are not explicitly stated in the original ontology. To give a simple example, in an ontology about people’s Web browsing habits I could create a logical rule (known as a restriction) to assert that if a person visits Slashdot more than five times a day, then, ipso facto, he or she belongs to the Geek class.

This is the part of OWL that will be most unfamiliar to most new users (including me), and it’s here that Protégé OWL shows its roots and ongoing application in heavy-duty computing research. Thankfully there is a long and detailed OWL tutorial that illustrates OWL’s impressive reasoning capabilities. Having worked through the tutorial software and tried a few examples of my own, I can see that there’s a lot of intriguing research going on, and that OWL has huge potential in this direction. But these features are still largely experimental, and for the meantime, I’m happy just to watch developments with interest.

Protégé OWL in everyday use

So far I’ve described only some of the ways in which Protégé OWL will reward the persistent user. I haven’t touched on features such as WYSIWYG HTML and forms editing, XML name space support, version control, import and merging of ontologies, data entry wizards, JDBC back end, and multi-user support. The best way to really appreciate the power of Protégé OWL is to use it for some real work.

Protégé OWL runs on my PowerBook all the time, right next to Mail, iCal, and Firefox. I use it daily in places where in the past I might have looked reluctantly at Microsoft Access or an open source alternative. Protégé OWL manages all of the corporate records and information of the small public-sector telecommunications company that I run. The ontology acts as a conventional records-management system, recording file and document numbers, dates, file notes, and cross-references. I also can, and do, add additional information of all sorts as it comes to hand, including details of people and organisations, event reminders, MD5 hashes of controlled documents, news items for the company Web site, email threads, and financial information (although I do run a separate accounting system). The beauty of Protégé OWL is that I can add, modify, and delete classes, subclasses, properties, and instances as and when I need.

Another attraction for me is the confidence that comes from knowing that my critical data is stored in a thoroughly documented, open, and well-understood standard format. Each Protégé OWL ontology is stored as a single XML document that can be read and understood by any OWL-compliant system. So, if I am abducted by aliens tomorrow, no one will need to try and figure out how to use my ‘custom’ database.

At home, I use Protégé OWL to manage the content for my son’s junior football club Web site. In fact the entire site is just an XSLT transformation of the underlying ontology.

A few problems

As with any large, complex system, Protégé OWL has a few problems to overcome.

  • I haven’t tried Protégé OWL with a very large ontology, but I suspect that it won’t scale the way a relational database would. This probably isn’t surprising, given the complex data structures and relationships that can be represented in OWL. You get the impression that there’s a great deal going on under the hood. Still, in everyday use on a moderately sized ontology, Protégé OWL feels no slower than any other Java GUI application.
  • There is no real Protégé OWL manual as such, although there is a FAQ, some tutorials, and a general Protégé Wiki. Fortunately, the software is reasonably self-explanatory, and as ever, Google is your friend. However if anyone can tell me what that ‘Search’ panel in the OWL Preferences dialogue does…
  • By default, Protégé OWL stores ontologies in a XML file, using RDF syntax. Unfortunately, the XML is generated automatically by the Jena RDF library used by Protégé OWL, and is extremely difficult to understand. This makes it nearly impossible to write XSLT stylesheets to extract information directly from the ontology.

    This was a real problem for me, as I wanted to use XSLT to create HTML pages representing ontology data in reader-friendly form. The problem is acknowledged in the Protégé OWL FAQ, but no solution is offered. I scratched my itch and wrote a small command-line utility that reads an ontology using the Jena library and then writes out the instance data in a simple, verbose, but XSLT-friendly XML format. I’ve used this method for a year now, through several revisions of both Protégé OWL and Jena, with no problems. Happily, it looks as if the next version of Protégé OWL will allow ontologies to be stored directly in a more accessible XML format.

  • Protégé OWL can generate HTML pages listing the raw ontology data, and provides for a degree of customization. But for more complex reports, your only real alternative is probably XSLT.
  • Protégé OWL currently requires the support of an external ‘reasoner’ program to implement the advanced features of OWL. Unfortunately the program cited in the FAQ (Racer) has recently become non-free.

Conclusion

To get started with Protégé OWL you’ll need some computing skills. But if you’re comfortable setting up and running open source software, and you’re prepared to try out a different way of managing electronic information, then I hope you will get as much value out as I do from this impressive piece of software.

Category:

  • Enterprise Applications