Category Archives: Data integration

Ontology Matching Book

Ontology Matching

Euzenat and Shvaiko’s book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching finds correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence.

With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the presented work and techniques can equally be applied to database schema matching, catalog integration, XML schema matching and other related problems. The book presents the state of the art and the latest research results in ontology matching by providing a detailed account of matching techniques and matching systems in a systematic way from theoretical, practical and application perspectives.

Comparing Upper Ontology Definitions of “Information Resource”

Copia has an interesting post about comparing the definitions of “information resource” in various upper ontologies.

Just looking at the isolated semantic differences between a simple concept across all of these upper ontologies is a bit of a reality check.  I have read papers that claim that the data interoperability problem will be solved by simply mapping all of the upper ontologies together, and having each domain describe their data in terms of a particular ontology that uses the upper-ontology of their choice.  Looking at the complexity and subtlety of just “information resource” across these different conceptualizations makes that approach look pretty silly.

Annotate anything, anywhere

If you are interested in annotations on ontologies or any other sort of resource, check out the W3C Project Annotea. They are working towards making it possible to load annotations alongside any web resource.  Imagine going to a web page, and clicking on a toolbar button to reveal annotations made by any number of people across the world.  I assume some other mechanisms would be necessary to make sure the annotations, bookmarks, and marginalia were relevant to you, but you get the idea.

They have built an annotation schema for describing the RDF structure of those annotations, and some basic software for creating annotations and storing them on an annotation server.  The mozilla project is hosting Annozilla, an attempt to integrate Annotea into mozilla and firefox.

Each annotation can then become a discussion thread of its own.  Imagine reading a web page on history, and noticing that someone else has posted an annotation that suggests a linkage to some other event that happened in history.  You would then be able to reply to that annotation, ask a question, or dispute its accuracy.

In these kinds of discussion, the discussion’s context would be extremely localized.  An entire thread of discussion might take place about a particular phrase or passage in a document.  Such discussions have always taken place, the difference is that with stored annotations, there is explicit traceability back to the source data that provides “context” for the discussion.  Of course you’ll always have issues such as when the discussion topic drifts.  Still, tracking the seed idea or piece of information that started the discourse has its own benefits.

Modeling Context: Semantic Data Integration

I came across another interesting research paper suggested by a colleague:

Context Interchange:  New Features and Formalisms for the Intelligent Integration of Information (Cheng Hian Goh, Stephane Bressan, Stuart Madnick, Michael Siegel)

The basic approach that they outline is data integration through an intermediate “context mediator”.  Many semantic data integration approaches require that individual data sources build robust ontologies to describe their semantics, but this still requires mapping between the ontologies.  The context interchange approach requires the domain model ontology, but also a set of “context axioms” which describe the implicit context of the data, as well as what transformations would be necessary to get the data to other contexts.

There are several strengths in this paper.  The first is that it actually describes context and tries to formalize context for information sharing.  Also, by abstracting up to an information source’s context, their approach prevents you from having to map domain models on to one another.  (The function of that mapping is subsumed by the context axioms).

The main downside of this approach that I can see is that it may not ultimately save you any mapping work.  If you want to translate a data source A into something else in a different context, the context axioms in A still need to express how that data would be transformed or interpreted in another named context.  Note the slight shift – instead of saying how data would be interpreted under a given domain model, the context axioms talk about how it would be interpreted under a different named context.  To the extent that you believe that there are fewer contexts than systems out there, this reduces the mapping burden.  Surprise surprise, the software in the middle still can’t produce spontaneous semantic magic.  🙂
I’m reading through this material in a quest for reasonable work in the area of modeling context.  This paper is one of the closer things that I’ve found to what I’m looking for, but doesn’t explicitly address how context is modeled; it just uses a running example that happens to include a model for the example’s limited context.  (The running example in the model has to do with integrating information from two systems that store information about corporate performance.  If you want to know about all companies that are profitable, how do you do that if one system has information in dollars with a scale factor of 1, and the other in yen with a scale factor of 1,000?)

Schema Matching: Similarity Flooding (Melnik & Rahm)

A colleague recently gave me a copy of an interesting article:

Similarity Flooding:  A Versatile Graph Matching Algorithm and its Application to Schema Matching (Sergey Melnik, Erhard Rahm)

In a nutshell, it outlines a method of taking two arbitrary schemas or graphs (think SQL DDL, RDF datasets, or XML schemas) and matching them together to simplify data integration.  They have supportable results that about 50% (on average) of the schema matching task can be automated with no understanding of the semantics of the underlying models.

To sum up their algorithm, they take an initial set of mappings between the two graphs that’s based on something simple and easy (e.g. string prefix and suffix matching on node names) and then propagate that similarity through the network.  The algorithm’s assumption is that “whenever any two elements in models G1 and G2 are found to be similar, the similarity of their adjacent elements increases”.

This is an interesting algorithmic approach to schema matching.  One of the things you see again and again in the data integration space is the use of semi-automated techniques, e.g. an approach where it is assumed from the start that humans will go behind the computer and fix mistakes, annotate with additional information, and so on.