I have been reading a little bit recently about data modeling and RDF, and I wanted to post about a general pattern I’ve seen that’s helpful for those who are familiar with relational modeling.
First, take a simple use case of how a web 2.0 site would keep track of users tagging individual pages. Social bookmarking sites like del.icio.us have to do this all the time.
The RDF data model can basically be thought of as a series of triples, each with a subject, predicate, and object. So how could we model storing the fact that a particular user labeled a web page as “funny”? It might go something like this: (click for full size image)
This tracks a number of facts – the user in question, the tag that they used, the web page they’re tagging, and when they did it. With just three tuples, we stored 4 facts. The problem with this independent set of tuples is that there isn’t a way of relating a tag back to a user. We can tell that the user tagged the page, and that the page is tagged “funny”. If we want to go back though and figure out which tags this user has applied to this page, we’re stuck because there’s no linkage from the tag content to the user.
The solution is typically to abstract up one level, and to create a data object that facilitates the linkage. Let’s call that “Tag Instance”, and have a look at round two of modeling the same information:
Now, essentially “Tag Instance” is a bogus piece of information whose only purpose is to tie together a bunch of links. Still, modeling the information in this way allows us to create the linkages. You can (via queries) navigate from the user to the specific tag through the “Tag Instance” object. There are a couple of things about this example that are different from the others:
- The names of the predicates all changed – individual data items aren’t being related directly to one another, but only indirectly.
- There are 4 tuples for 4 facts, instead of three.
- This hierarchy is still easily decomposable into tuples similar to those seen in the first figure, by using “Tag Instance” as the subject of the tuple, the arc in the diagram as the predicate, and the destination node as the object
This process basically reinvents the table structure in relational modeling. “Tag Instance” is acting as a grouping piece of metadata indicating that all of the individual data elements belong to the same logical record. Using this pattern allows you to take any arbitrary table structure and “RDFize” it. For example:
From this example, a record about a person, their phone number, and their sex is being related to an RDF graph storing the same information, using the same style of “grouping” node shown earlier.
Now let’s formalize how this transformation takes place as a series of steps. The starting point is a table, and the desired end point is an RDF graph that can store a row of information from that table. If we can get that far, we can replicate the process and convert the entire contents of the relational table into RDF data.
- Step 1: Create a “grouping” node. A good name for the grouping node will be the unique identifier for the row, or the name of the table/entity, followed by a generated unique identifier.
- Step 2: Create a node for each individual data “fact” or observation within the row. That is, a node for each cell of data.
- Step 3: Create links from the “grouping” node to each of the individual “fact” nodes. Label those links with the column name for the data cell in question.
If you’re wondering about links between tables, those can be accomplished in two ways. The first is to create predicates that link grouping nodes to other grouping nodes. The second method is to create a separate grouping node that represents a compound foreign key. That “foreign key” node can then link to all of the source data and target data.