Category Archives: Data Modeling

“Unpacking” implicit data values

One of the areas where missing data is most often made explicit is in “unpacking” data values.

Recently, I’ve been working with a system called Trio, which integrates uncertainty information and lineage with data into a database. The idea is that you can express fuzzier tuples in a relation. Instead of saying “Mary saw a Toyota”, you can assign it 60% confidence, or even add an alternative – either Mary saw a Toyota, or Mary saw a Honda (but not both).

Trio separates each of the data items in a table into two groups; those that are certain, and those that are uncertain. Unsurprisingly, uncertain data items may have a confidence stored along with them. The more interesting case is what gets put into the “certain” category, and why. As far as I can tell, the “certain” table has no formal definition, it just is the set of data items that don’t need a confidence associated with them. So in fact we’re “packing” several different use cases into that designation of a data item as “certain”:

  • Honest-to-god certain, meaning that it has a confidence rating of 100%
  • Certainty information is unknown, but assumed to be certain
  • Certainty information doesn’t apply

Many people will reasonably point out that it’s OK for there to be nulls, and for them to have some special meaning, it’s just important that their meaning be consistent. When the meaning of nulls can’t be consistent (because the semantics of the domain require that nulls differ in meaning based on context) – then you have a missing data problem. The common approach is then to go “unpack” those null values and enumerate their actual meanings so that they can be stored alongside the data in the future.

Other background – see also the caBIG “Missing Value Reasons” paper, and the flavors of null discussion.

Schema Matching: Similarity Flooding (Melnik & Rahm)

A colleague recently gave me a copy of an interesting article:

Similarity Flooding:  A Versatile Graph Matching Algorithm and its Application to Schema Matching (Sergey Melnik, Erhard Rahm)

In a nutshell, it outlines a method of taking two arbitrary schemas or graphs (think SQL DDL, RDF datasets, or XML schemas) and matching them together to simplify data integration.  They have supportable results that about 50% (on average) of the schema matching task can be automated with no understanding of the semantics of the underlying models.

To sum up their algorithm, they take an initial set of mappings between the two graphs that’s based on something simple and easy (e.g. string prefix and suffix matching on node names) and then propagate that similarity through the network.  The algorithm’s assumption is that “whenever any two elements in models G1 and G2 are found to be similar, the similarity of their adjacent elements increases”.

This is an interesting algorithmic approach to schema matching.  One of the things you see again and again in the data integration space is the use of semi-automated techniques, e.g. an approach where it is assumed from the start that humans will go behind the computer and fix mistakes, annotate with additional information, and so on.

Reification of RDF Statements: Concrete application of RDF data modeling

Let’s say a user wants to annotate a particular RDF data model with a statement. Let’s call that model “M”. Here’s the user’s annotation statement:

(Subject: Model M) -> (Predicate: was created by) -> (Object: user “OntologyStud“)

How then do we connect this statement to a series of other statements containing metadata about the annotation? It’s not enough to know just this single statement, we need to created linked statements saying which user asserted this, when, and so on.

Looking back at yesterday’s post on RDF data modeling, we would want to create some higher-level “grouper” node. The problem is that the grouper node can’t link to the subject, predicate, or object of this statement, because those resources may have any number of other statements associated with them. Certainly this annotation is not the only statement about “Model M”, so if we were to create a link from a grouper node to “Model M”, we wouldn’t be able to tell which of the statements about “Model M” was the annotation statement.

The solution is to reify. What is reification?

Reification, also called hypostatisation, is treating an abstract concept as if it were a real, concrete thing.

To reify a statement means to take an RDF statement (subject, predicate, object) and treat it as if it were a new resource. That way, instead of having resources point to things like usernames, particular property names, tags, and so on a resource can talk about a statement.

This is moving up one meta-level, and allows RDF statements to talk about themselves. It’s extremely useful when you want to add additional information to a statement. For example, you might want to add data about a statement:

  • Who asserted this?
  • How trustworthy is it?
  • When was this said?
  • What was the context surrounding this statement?

In our example, that’s just what we’re trying to do – take an annotation statement and add information about who said it, and when.

To recap:

  • Take the annotation statement “Model M -> was created by -> OntologyStud”.
  • Reifiy that statement as a resource, let’s call that resource “Annotation 1”.
  • Create a grouper node called “Annotation Instance 1”
  • Create a new statement linking the annotation instance to the Annotation: “Annotation Instance 1” -> hasAnnotation -> “Annotation 1”
  • Add any additional statements necessary by linking from “Annotation Instance 1”.

Data Modeling and RDF: Redux

Following up on my last post, I thought I’d drop this link – Data Modeling, RDF, and OWL.  This is a good article written for TDAN by David Hay about the field of semantics from the perspective of data modeling.

He makes the good point that a relational data model is a sort of ontology.  It does outline the high-level concepts (entities) and the relationships between those entities.  Most semantic types would probably say that a database schema is myopic in that it’s too focused on whether last names are 22 characters or 30, and not focused enough on the relationships that pull concepts together.

Data Modeling and RDF

I have been reading a little bit recently about data modeling and RDF, and I wanted to post about a general pattern I’ve seen that’s helpful for those who are familiar with relational modeling.

First, take a simple use case of how a web 2.0 site would keep track of users tagging individual pages.  Social bookmarking sites like del.icio.us have to do this all the time.

The RDF data model can basically be thought of as a series of triples, each with a subject, predicate, and object.  So how could we model storing the fact that a particular user labeled a web page as “funny”?  It might go something like this:  (click for full size image)

Individual Tuples and RDF

This tracks a number of facts – the user in question, the tag that they used, the web page they’re tagging, and when they did it.  With just three tuples, we stored 4 facts. The problem with this independent set of tuples is that there isn’t a way of relating a tag back to a user.  We can tell that the user tagged the page, and that the page is tagged “funny”.  If we want to go back though and figure out which tags this user has applied to this page, we’re stuck because there’s no linkage from the tag content to the user.

The solution is typically to abstract up one level, and to create a data object that facilitates the linkage.  Let’s call that “Tag Instance”, and have a look at round two of modeling the same information:

Bucket view of RDF

Now, essentially “Tag Instance” is a bogus piece of information whose only purpose is to tie together a bunch of links.  Still, modeling the information in this way allows us to create the linkages.  You can (via queries) navigate from the user to the specific tag through the “Tag Instance” object.  There are a couple of things about this example that are different from the others:

  • The names of the predicates all changed – individual data items aren’t being related directly to one another, but only indirectly.
  • There are 4 tuples for 4 facts, instead of three.
  • This hierarchy is still easily decomposable into tuples similar to those seen in the first figure, by using “Tag Instance” as the subject of the tuple, the arc in the diagram as the predicate, and the destination node as the object

This process basically reinvents the table structure in relational modeling.  “Tag Instance” is acting as a grouping piece of metadata indicating that all of the individual data elements belong to the same logical record.  Using this pattern allows you to take any arbitrary table structure and “RDFize” it.  For example:

Comparison of Relational and RDF

From this example, a record about a person, their phone number, and their sex is being related to an RDF graph storing the same information, using the same style of “grouping” node shown earlier.

Now let’s formalize how this transformation takes place as a series of steps.  The starting point is a table, and the desired end point is an RDF graph that can store a row of information from that table.  If we can get that far, we can replicate the process and convert the entire contents of the relational table into RDF data.

  • Step 1:  Create a “grouping” node.  A good name for the grouping node will be the unique identifier for the row, or the name of the table/entity, followed by a generated unique identifier.
  • Step 2:  Create a node for each individual data “fact” or observation within the row.  That is, a node for each cell of data.
  • Step 3:  Create links from the “grouping” node to each of the individual “fact” nodes.  Label those links with the column name for the data cell in question.

If you’re wondering about links between tables, those can be accomplished in two ways.  The first is to create predicates that link grouping nodes to other grouping nodes.  The second method is to create a separate grouping node that represents a compound foreign key.  That “foreign key” node can then link to all of the source data and target data.