History of an Idea: Missing Data

Entries from August 2007

Trends in usage – data/information that fades from common usage

August 31, 2007 · Leave a Comment

A very broad category of data that we’ve talked about before is everything that simply fades from common usage because of changes in convention.  This is certain an obvious reason why something might not be there but is in someways the reverse of information that is so commonly used that it is no longer explicitly noted, i.e., evolving standards.

In a subsequent post I’ll talk about a specfic examples such as evolving knowledge and frames of reference in science and the shift in standards for research.   

Categories: General

Missing data and Tacit Knowledge

August 31, 2007 · Leave a Comment

Searching on “Tacit” along with data terms provides a wealth of links to discussions that fit in this domain.  However, tacit knowledge has a meaning that is a bit off from what we are going after.

Categories: General

Unified Data Feed on Web2Express

August 29, 2007 · Leave a Comment

AJ Chen has just released a new service that allows users to create data feeds that can be published in RSS 2.0, RDF, and ATOM depending on the user’s preference.

His announcement email to the W3C semantic web list:

“One of the big challenges facing semantic web is to encourage people to put out their data in semantic format on the web. I have been looking into practical areas where semantic web can make real difference.  Datafeed is such an area, I believe. Yesterday, I just released a  free new online tool – unified data feed on web2express.org. I hope people will find it useful for creating data feeds for products, news, events, jobs and studies. Besides the feeds, all of the data are also openly available as RDF. “

Worth checking out.  Aside from repackaging existing feeds in a more flexible format, he has a number of feeds for various products and other entities.  The secret sauce of course is whether or not you can get people to agree on using a particular set of schemas for representing, say, product price.  From what I can tell, he’s coming up with his own microformats for that.

Categories: General

TANSTAAFL: The No-Free-Lunch Theorem

August 29, 2007 · Leave a Comment

I came across this interesting tidbit while reading one of Numenta’s papers on their HTM approach.

The No-Free-Lunch Theorem:   “no learning algorithm has an inherent advantage over another learning algorithm for all classes of problems.  What matters is the set of assumptions an algorithm exploits in order to learn the world it is trying to model.”

 Ho, Y. C. & Pepyne, D. L. (2002), “Simple Explanation of the No-Free-Lunch Theorem and Its Implications”, Journal of Optimization Theory and Applications V115(3), 549-570.

Or, as Robert Heinlein put it a while back, TANSTAAFL.  There Ain’t No Such Thing As A Free Lunch.

Categories: Context · Links

Culturally Embedded Computing Group (i.e. cultural assumptions)

August 28, 2007 · Leave a Comment

The below site offers a look into research about the culturally derived assumptions we make when designing systems. 
http://cemcom.infosci.cornell.edu/home.php
—————-

Cornell University Faculty of Information Science and Department of Science & Technology Studies

We analyze, design, build, and evaluate computing devices as they relate to their cultural context. We analyze the ways in which technologies reflect and perpetuate unconscious cultural assumptions, and design, build, and test new computing devices that reflect alternative possibilities for technology. We have a focus on reflective design, or design practices that help both users and designers reflect on their experiences and the role technology plays in those experiences.

Our primary focus is the environment; we are exploring the role IT could play to reconfigure our cultural relationship to the environment. We have worked extensively on affective computing, to develop approaches in which the full complexity of human emotions and relationships as experienced by users in central to design (rather than the extent to which computers can understand and process those emotions).

We draw from, contribute to, and mutually inform the technical fields of human computer interaction and artificial intelligence and the humanist/sociological fields of cultural and science & technology studies.

Categories: General

Jena Application Architecture Issues: Querying across models with SPARQL.

August 27, 2007 · Leave a Comment

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance. They say that when programming “premature optimization is the root of all evil”. Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models. Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models? Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource. In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource> <http://b.com/#tagged> “foo”.

So all you have to do is search all available models for statements matching:
(?) b:tagged “foo”

Easy enough. But how do you implement such a search? By default, Jena stores different models in different relational tables in the underlying database. There is no single unified relational table (that I can find) where you can search for “foo”. There is no implicit unified model that contains all of the triples of all available models. From what I can tell, my options are:

(1) Create an uber-model that is basically a union of all of the models I have, and search only that model.
(2) Issue the “foo” search n times, where n is the number of models I have; once for each model.
(3) Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples. In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model. Option (2) is good, except that query performance sucks. Additionally, you’ll be wasting time searching many models that won’t have any hits at all. Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option. Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list, but I have yet to check out Andy’s SDB, which promises to have support for union queries, a potential extra option.

Categories: General

ebXML’s Context Table

August 27, 2007 · Leave a Comment

ebXML’s context table outlines the different classes of context that the ebXML model recognizes.  “Context” in this case is business context – the set of circumstances around the transmission of business process related information.

ebXML Context

Click the image for a larger-sized picture of what a context representation in ebXML might look like, and where it fits in.

The things that are in this context model are obviously skewed towards electronic interchange of business information (e.g. a high-level context type of “Geopolitical/Regulatory”) but they do have a number of types that could be reinterpreted in much broader ways, for example “Partner Role”, “Info Structural Context”, and everybody’s favorite – “Temporal”.

This is interesting because it’s a practical implementation of context that bites the bullet and categorizes all possible contexts as a fixed number of types.

Categories: Business · Context · Links

Notes on Questions about missing data

August 23, 2007 · Leave a Comment

  • Which data is missing?
    • Usually, this should be smack-you-in-the-face obvious.  Something that is supposed to be there isn’t there.
    • Knowing that something is missing presumes that you knew it was supposed to be there.
    • “Lies of omission, and incomplete truths” – sometimes information that is missing isn’t just a blank, it’s the absence of precision or detail.
    • Context determines what should be there.  The producer’s context and the consumer’s context may not match, and the fact that something is missing is evidence for this context mismatch.
  • Where is it missing from?
    • What is the scope? Are we dealing with web pages, structured databases, PDF files, or the entire universe of data?
    • Do particular forms of data (structured, unstructured) have particular characteristics that lend themselves  to analysis of missing information questions?
    • Should the issue of things that are missing be narrowed to data (at the detriment of “information”)?
  • Why is the data missing?
    • An exhaustive taxonomy of missing value reasons is likely impossible, if you accept that the number of contexts is unlimited.
    • Still, a taxonomy may be able to generalize reasons into buckets and cover vast swaths of the “reason space” explaining why something is missing.
    • What level of analysis is most important?  Is it that an individual value in an individual observation is missing?  Is it that all values for a particular field are missing?  How about data global absence (i.e. it’s not an individual data field, it’s the whole data asset)
  • Why do we care that the data is missing?
    • What valuable contextual information would the missing data have provided?
    • Are we interested in drawing an in-model conclusion (e.g. what the value should be, or how that missingness impacts other values)?
    • Are we interested in drawing an out-model conclusion (e.g. where the conclusion’s impact is completely outside the data set where something was missing)?
  • Given the above three questions, what conclusions can we draw?

Categories: Notes · Outlines

k-anonymity and l-diversity

August 23, 2007 · Leave a Comment

l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer

This is a great article on how to improve privacy for individuals in datasets that are disseminated. Quick summary:

The paper talks about “quasi-identifiers” – combinations of attributes within the data that can be used to identify individuals. For example, the statistic given is that 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5-digit zip code. Given that three-attribute “quasi-identifier”, a dataset that has only one record with any given combination of those fields is clearly not anonymous – most likely it identifies someone. Datasets are “k-anonymous” when for any given quasi-identifier, a record is indistinguishable from k-1 others.

The next concept is “l-diversity”. Say you have a group of k different records that all share a particular quasi-identifier. That’s good, in that an attacker cannot identify the individual based on the quasi-identifier. But what if the value they’re interested in, (e.g. the individual’s medical diagnosis) is the same for every value in the group? There are 7 different values in a group, and I don’t know which one of them is Bob Smith, but since I know that all of them are flagged with a diagnosis of cancer, the data has “leaked” that Bob Smith has cancer. (Figuring this out is unsurprisingly called a “homogeneity attack”) The distribution of target values within a group is referred to as “l-diversity”.

The paper outlines the mathematical underpinnings of what l-diversity is, and shows that it is practical and be implemented efficiently.


Improving both k-anonymity and l-diversity requires fuzzing the data a little bit.  Broadly, there are three ways you can do this:

  • You can generalize the data to make it less specific.  (E.g. the age “34″ becomes “30-40″, or a diagnosis of “Chronic Cough” becomes “Respiratory Disorder”
  • You can suppress the data.  Simply delete it.  (Which leads us into our host of “missing data” questions)
  • You can perturb the data.  The actual value can be replaced with a random value out of the standard distribution of values for that field.  In this way, the overall distribution of values for that field will remain the same, but the individual data values will be wrong.

Categories: General · Links

Heilmeier’s Catechism: How to make your presentation matter

August 23, 2007 · 1 Comment

Found this on the web – Heilmeier’s Catechism.  The key questions you need to answer in order to pitch a good idea.  It doesn’t matter whether the idea is a research proposal, a new business, or a plan to colonize Mars.

I believe this was originally written by an engineer, so I’ll attempt to translate into business-speak in the added parenthetical comments to show the parallels to business and even venture capital.

  • What are you trying to do? Articulate your objectives using absolutely no jargon.  (Mission or vision)
  • How is it done today, and what are the limits of current practice? (Market research)
  • What’s new in your approach and why do you think it will be successful?  (Your intellectual capital)
  • Who cares? If you’re successful, what difference will it make?  (Value-add to customers)
  • What are the risks and the payoffs? (What’s the ROI?  Show me the money)
  • How much will it cost? How long will it take? (Investment)
  • What are the midterm and final “exams” to check for success?  (Metrics)

Categories: Business · General · Links

Annotate anything, anywhere

August 21, 2007 · Leave a Comment

If you are interested in annotations on ontologies or any other sort of resource, check out the W3C Project Annotea. They are working towards making it possible to load annotations alongside any web resource.  Imagine going to a web page, and clicking on a toolbar button to reveal annotations made by any number of people across the world.  I assume some other mechanisms would be necessary to make sure the annotations, bookmarks, and marginalia were relevant to you, but you get the idea.

They have built an annotation schema for describing the RDF structure of those annotations, and some basic software for creating annotations and storing them on an annotation server.  The mozilla project is hosting Annozilla, an attempt to integrate Annotea into mozilla and firefox.

Each annotation can then become a discussion thread of its own.  Imagine reading a web page on history, and noticing that someone else has posted an annotation that suggests a linkage to some other event that happened in history.  You would then be able to reply to that annotation, ask a question, or dispute its accuracy.

In these kinds of discussion, the discussion’s context would be extremely localized.  An entire thread of discussion might take place about a particular phrase or passage in a document.  Such discussions have always taken place, the difference is that with stored annotations, there is explicit traceability back to the source data that provides “context” for the discussion.  Of course you’ll always have issues such as when the discussion topic drifts.  Still, tracking the seed idea or piece of information that started the discourse has its own benefits.

Categories: Annotations · Context · Data integration · Links

Links to Statistical Approaches to Missing Data

August 21, 2007 · Leave a Comment

Very good blog post:  Data Mining and Predictive Analytics – Missing Values and Special Values:  The Plague of Data Analysis.

This post has a number of references to other writing on the topic, such as Working with missing values.

One of the interesting points about this blog post and other documents linked off of it is that it seems the NMAR, MAR, MCAR approach to categorizing missing values does have some weight in the field.

Categories: General

Quick Summarization of Data Mining Approaches to Handling Missing Data

August 21, 2007 · Leave a Comment

Data Mining Research Blog – Handling Missing Values

A few observations about these approaches, which illustrate the predispositions of data mining:

  • They don’t seem to care about what the value of the missing data is, they primarily care about the missing data’s impact on the value of the particular data observation (or row)
  • For expediency, they tend to assume that missing values will be statistically distributed similar to how the non-missing (or observed) values are distributed
  • There is a focus on a large corpus of observations; the impact of the individual observation is small.

These are all reasonable constraints given what data mining is doing.  As I discover this kind of thing though, I’m trying to keep it documented because these types of themes would probably be interesting to contrast with an approach that was aimed at using missing values as an information channel.

Categories: Context · Links

RDF123: Generate Flexible RDF From Spreadsheets

August 21, 2007 · Leave a Comment

A new application called RDF123 has recently been made available, and has gotten a lot of well-deserved positive buzz from the semantics and RDF community.  Basically, RDF123 allows you to take spreadsheets as an input, and produce RDF data as an output.  Lushan Han, a Ph.D. candidate at UMBC, deserves the credit.

This capability is a big deal because of the amount of useful data in the world that’s locked up in spreadsheets.  This data is usually accessible only to applications that know how to read its proprietary binary format.  Converting the data into RDF may be a win in many cases because it provides access to the data in an open format (RDF) and allows marking the data up with a structured schema or with linkages to an ontology to permit new uses of the data.  Maybe you don’t buy that people are going to convert spreadsheets into RDF and then do reasoning with the data, and that’s fine.  You don’t have to do anything fancy – just having the data in RDF makes it much more flexible for any kind of machine processing.  This is of course in contrast to regular spreadsheets, which are typically designed primarily for human visual consumption.

Here’s how the project summarizes itself, for a few basic details:

RDF123 is an application and web service for converting data in simple spreadsheets to an RDF graph. Users control how the spreadsheet’s data is converted to RDF by constructing a graphical RDF123 template that specifies how each row in the spreadsheet is converted as well as metadata for the spreadsheet and its RDF translation. The template can map spreadsheet cells to a new RDF node or to a literal value. Labels on the nodes in the map can be used to create blank nodes or labeled nodes, attach a XSD datatype, and invoke simple functions (e.g., string concatenation). The graph produced for the spreadsheet is the union of the sub-graphs created for each row. The template itself is stored as a valid RDF document encouraging reuse and extensibility.

If you’re interested in more information on RDF123, eBiquity posted about it, and so did AI3 as well as Aman.  RDF123 was funded out of an NSF Grant (details on their grant) that includes a number of usual suspects (Timothy Finin, Jim Hendler).

Categories: RDF

Jena Architecture Issues: Querying with SPARQL Across Models

August 21, 2007 · Leave a Comment

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance.  They say that when programming “premature optimization is the root of all evil”.  Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models.  Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models?  Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource.  In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource> <http://b.com/#tagged> “foo”.

So all you have to do is search all available models for statements matching:

(?) b:tagged “foo”

Easy enough.  But how do you implement such a search?  By default, Jena stores different models in different relational tables in the underlying database.  There is no single unified relational table (that I can find) where you can search for “foo”.  There is no implicit unified model that contains all of the triples of all available models.  From what I can tell, my options are:

  1. Create an uber-model that is basically a union of all of the models I have, and search only that model.
  2. Issue the “foo” search n times, where n is the number of models I have; once for each model.
  3. Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples.  In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model.  Option (2) is good, except that query performance sucks.  Additionally, you’ll be wasting time searching many models that won’t have any hits at all.  Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option.  Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list,  but I have yet to check out Andy’s SDB.

Categories: General

Ontology Annotation Use Cases

August 15, 2007 · Leave a Comment

I recently found another interesting link dealing with ontology evolution and annotation. This paper is primarily interested in annotation from the perspective of maintenance and evolution. Indeed, the protoge collaborative software that’s being developed is focused on making suggestions for changes to a single existing ontology.

Categories: Links · Ontology · Semantics

Context is more than location

August 15, 2007 · Leave a Comment

Context is more than location.  Schmidt, Beigl, Gellerson.

These chaps break down the concept of “context” as it relates to human using mobile computing devices, into this hierarchy:

  • Human Factors
    • User (knowledge of habits, emotional state, biophysiological conditions, …)
    • Social Environment (co-location of others, social interaction, group dynamics, …)
    • Task (spontaneous activity, engaged tasks, general goals,…)
  • Physical Environment
    • Location (absolute position, relative position, co-location,…)
    • Infrastructure (surrounding resources for computation, communication, task performance…)
    • Location (noise, light, pressure,…)

This paper also includes a number of good links to other work on context.

Categories: Context · General · Links · Taxonomy

3 Fundamental Classes of Missing Data

August 15, 2007 · 1 Comment

Taken from Missingness mechanism, James Carpenter’s and Mike Kenward’s site on the statistics of missing data.

They outline three different mechanisms that would cause data to be missing:

  1. Missing Completely At Random (MCAR): when the reason for the data being missing does not depend on its value or lack of value.
  2. Missing At Random (MAR): when the reason for missing data can be explained by the observed data; after accounting for this, there is no further information in the unseen data.
  3. Not Missing At Random (NMAR):  when even after considering the information in the rest of the data, the reason for missing information depends on that unseen information

In this case, the conceptual common denominator seems to be “the relationship between the reason information is missing, and the content of the data set”.

This is an interesting differentiation to use potentially in a taxonomy of missing data reasons, because it looks at reasons from a different perspective.  (I.e. not from the perspective of the context in which the information was fetched, and also not from the perspective of whether the information is known vs. available)

Categories: Links

Work progressing on Ontrospect…

August 15, 2007 · Leave a Comment

As I posted about before, I have a little project I’m working on.  Over the past couple of days, I managed to implement annotations in the system, which is basically the ability to add an arbitrary RDF triple to a separate model that “describes” another model.  For example, take the pizza ontology.  Of course this file is read-only because it’s on someone else’s site.  Annotations in Ontrospect would allow you to add extra triples to another model (let’s call it “Annotations: Pizza”) that make statements about resources in the pizza ontology.

The annotation could be a regular OWL or RDF assertion (such as indicating the type of a resource, or claiming that something is a subclass of something else) but it could also be a comment, a text tag, or more importantly a link to a completely different ontology.

I’m interested in this as a prototype system for building mappings between ontologies.  Hypothetically, you could for example take two upper ontologies and create links between them using these annotations.  For example, load the SUMO ontology into the program, and then link specific classes in SUMO to classes within DOLCE-lite.  (It could be something as simple as owl:sameAs, but would probably need to be something more sophisticated)

Where this becomes useful is when you union multiple models together with their annotation models.  You can then use that as a basis for reasoning across multiple ontologies, allowing you to work with mixed data.  An interesting other consequence is that it (hopefully) could make the semantic web read/write, instead of just read only.  The pizza ontology is out there, and you can’t change it.  What you can do is create your own set of annotations, and then publish a new model that is the pizza ontology union’d with your annotations – in effect a different model.

Categories: Jena · Ontology · RDF · Semantics · Software

Modeling Context: Semantic Data Integration

August 15, 2007 · Leave a Comment

I came across another interesting research paper suggested by a colleague:

Context Interchange:  New Features and Formalisms for the Intelligent Integration of Information (Cheng Hian Goh, Stephane Bressan, Stuart Madnick, Michael Siegel)

The basic approach that they outline is data integration through an intermediate “context mediator”.  Many semantic data integration approaches require that individual data sources build robust ontologies to describe their semantics, but this still requires mapping between the ontologies.  The context interchange approach requires the domain model ontology, but also a set of “context axioms” which describe the implicit context of the data, as well as what transformations would be necessary to get the data to other contexts.

There are several strengths in this paper.  The first is that it actually describes context and tries to formalize context for information sharing.  Also, by abstracting up to an information source’s context, their approach prevents you from having to map domain models on to one another.  (The function of that mapping is subsumed by the context axioms).

The main downside of this approach that I can see is that it may not ultimately save you any mapping work.  If you want to translate a data source A into something else in a different context, the context axioms in A still need to express how that data would be transformed or interpreted in another named context.  Note the slight shift – instead of saying how data would be interpreted under a given domain model, the context axioms talk about how it would be interpreted under a different named context.  To the extent that you believe that there are fewer contexts than systems out there, this reduces the mapping burden.  Surprise surprise, the software in the middle still can’t produce spontaneous semantic magic.  :)
I’m reading through this material in a quest for reasonable work in the area of modeling context.  This paper is one of the closer things that I’ve found to what I’m looking for, but doesn’t explicitly address how context is modeled; it just uses a running example that happens to include a model for the example’s limited context.  (The running example in the model has to do with integrating information from two systems that store information about corporate performance.  If you want to know about all companies that are profitable, how do you do that if one system has information in dollars with a scale factor of 1, and the other in yen with a scale factor of 1,000?)

Categories: Context · Data integration · Links

Schema Matching: Similarity Flooding (Melnik & Rahm)

August 14, 2007 · 1 Comment

A colleague recently gave me a copy of an interesting article:

Similarity Flooding:  A Versatile Graph Matching Algorithm and its Application to Schema Matching (Sergey Melnik, Erhard Rahm)

In a nutshell, it outlines a method of taking two arbitrary schemas or graphs (think SQL DDL, RDF datasets, or XML schemas) and matching them together to simplify data integration.  They have supportable results that about 50% (on average) of the schema matching task can be automated with no understanding of the semantics of the underlying models.

To sum up their algorithm, they take an initial set of mappings between the two graphs that’s based on something simple and easy (e.g. string prefix and suffix matching on node names) and then propagate that similarity through the network.  The algorithm’s assumption is that “whenever any two elements in models G1 and G2 are found to be similar, the similarity of their adjacent elements increases”.

This is an interesting algorithmic approach to schema matching.  One of the things you see again and again in the data integration space is the use of semi-automated techniques, e.g. an approach where it is assumed from the start that humans will go behind the computer and fix mistakes, annotate with additional information, and so on.

Categories: Data Modeling · Data integration · Links

Context-Aware Computing

August 8, 2007 · Leave a Comment

I found this interesting article detailing several different aspects of “context-aware computing”.

Context refers to the physical and social situation in which computational devices are embedded. One goal of context-aware computing is to acquire and utilize information about the context of a device to provide services that are appropriate to the particular people, place, time, events, etc. For example, a cell phone will always vibrate and never beep in a concert, if the system can know the location of the cell phone and the concert schedule. However, this is more than simply a question of gathering more and more contextual information about complex situations. More information is not necessarily more helpful. Further, gathering information about our activities intrudes on our privacy. Context information is useful only when it can be usefully interpreted, and it must be treated with sensitivity.

This article is an introduction to a special issue of an HCI publication on context-aware programming.  It covers some of the background.  Recently, I’ve been searching for information on how to model arbitrary contexts, which I take to be a pretty much intractable problem.  So far, little success but good characterizations of the problem abound.

Categories: Context · General · Links

Reification of RDF Statements: Concrete application of RDF data modeling

August 7, 2007 · Leave a Comment

Let’s say a user wants to annotate a particular RDF data model with a statement. Let’s call that model “M”. Here’s the user’s annotation statement:

(Subject: Model M) -> (Predicate: was created by) -> (Object: user “OntologyStud“)

How then do we connect this statement to a series of other statements containing metadata about the annotation? It’s not enough to know just this single statement, we need to created linked statements saying which user asserted this, when, and so on.

Looking back at yesterday’s post on RDF data modeling, we would want to create some higher-level “grouper” node. The problem is that the grouper node can’t link to the subject, predicate, or object of this statement, because those resources may have any number of other statements associated with them. Certainly this annotation is not the only statement about “Model M”, so if we were to create a link from a grouper node to “Model M”, we wouldn’t be able to tell which of the statements about “Model M” was the annotation statement.

The solution is to reify. What is reification?

Reification, also called hypostatisation, is treating an abstract concept as if it were a real, concrete thing.

To reify a statement means to take an RDF statement (subject, predicate, object) and treat it as if it were a new resource. That way, instead of having resources point to things like usernames, particular property names, tags, and so on a resource can talk about a statement.

This is moving up one meta-level, and allows RDF statements to talk about themselves. It’s extremely useful when you want to add additional information to a statement. For example, you might want to add data about a statement:

  • Who asserted this?
  • How trustworthy is it?
  • When was this said?
  • What was the context surrounding this statement?

In our example, that’s just what we’re trying to do – take an annotation statement and add information about who said it, and when.

To recap:

  • Take the annotation statement “Model M -> was created by -> OntologyStud”.
  • Reifiy that statement as a resource, let’s call that resource “Annotation 1″.
  • Create a grouper node called “Annotation Instance 1″
  • Create a new statement linking the annotation instance to the Annotation: “Annotation Instance 1″ -> hasAnnotation -> “Annotation 1″
  • Add any additional statements necessary by linking from “Annotation Instance 1″.

Categories: Data Modeling · RDF · Semantics

Data Modeling and RDF: Redux

August 6, 2007 · Leave a Comment

Following up on my last post, I thought I’d drop this link – Data Modeling, RDF, and OWL.  This is a good article written for TDAN by David Hay about the field of semantics from the perspective of data modeling.

He makes the good point that a relational data model is a sort of ontology.  It does outline the high-level concepts (entities) and the relationships between those entities.  Most semantic types would probably say that a database schema is myopic in that it’s too focused on whether last names are 22 characters or 30, and not focused enough on the relationships that pull concepts together.

Categories: Data Modeling · Links · RDF

Data Modeling and RDF

August 6, 2007 · 1 Comment

I have been reading a little bit recently about data modeling and RDF, and I wanted to post about a general pattern I’ve seen that’s helpful for those who are familiar with relational modeling.

First, take a simple use case of how a web 2.0 site would keep track of users tagging individual pages.  Social bookmarking sites like del.icio.us have to do this all the time.

The RDF data model can basically be thought of as a series of triples, each with a subject, predicate, and object.  So how could we model storing the fact that a particular user labeled a web page as “funny”?  It might go something like this:  (click for full size image)

Individual Tuples and RDF

This tracks a number of facts – the user in question, the tag that they used, the web page they’re tagging, and when they did it.  With just three tuples, we stored 4 facts. The problem with this independent set of tuples is that there isn’t a way of relating a tag back to a user.  We can tell that the user tagged the page, and that the page is tagged “funny”.  If we want to go back though and figure out which tags this user has applied to this page, we’re stuck because there’s no linkage from the tag content to the user.

The solution is typically to abstract up one level, and to create a data object that facilitates the linkage.  Let’s call that “Tag Instance”, and have a look at round two of modeling the same information:

Bucket view of RDF

Now, essentially “Tag Instance” is a bogus piece of information whose only purpose is to tie together a bunch of links.  Still, modeling the information in this way allows us to create the linkages.  You can (via queries) navigate from the user to the specific tag through the “Tag Instance” object.  There are a couple of things about this example that are different from the others:

  • The names of the predicates all changed – individual data items aren’t being related directly to one another, but only indirectly.
  • There are 4 tuples for 4 facts, instead of three.
  • This hierarchy is still easily decomposable into tuples similar to those seen in the first figure, by using “Tag Instance” as the subject of the tuple, the arc in the diagram as the predicate, and the destination node as the object

This process basically reinvents the table structure in relational modeling.  “Tag Instance” is acting as a grouping piece of metadata indicating that all of the individual data elements belong to the same logical record.  Using this pattern allows you to take any arbitrary table structure and “RDFize” it.  For example:

Comparison of Relational and RDF

From this example, a record about a person, their phone number, and their sex is being related to an RDF graph storing the same information, using the same style of “grouping” node shown earlier.

Now let’s formalize how this transformation takes place as a series of steps.  The starting point is a table, and the desired end point is an RDF graph that can store a row of information from that table.  If we can get that far, we can replicate the process and convert the entire contents of the relational table into RDF data.

  • Step 1:  Create a “grouping” node.  A good name for the grouping node will be the unique identifier for the row, or the name of the table/entity, followed by a generated unique identifier.
  • Step 2:  Create a node for each individual data “fact” or observation within the row.  That is, a node for each cell of data.
  • Step 3:  Create links from the “grouping” node to each of the individual “fact” nodes.  Label those links with the column name for the data cell in question.

If you’re wondering about links between tables, those can be accomplished in two ways.  The first is to create predicates that link grouping nodes to other grouping nodes.  The second method is to create a separate grouping node that represents a compound foreign key.  That “foreign key” node can then link to all of the source data and target data.

Categories: Data Modeling · General · RDF · Semantics