Missing data and Tacit Knowledge

Searching on “Tacit” along with data terms provides a wealth of links to discussions that fit in this domain.  However, tacit knowledge has a meaning that is a bit off from what we are going after.

Unified Data Feed on Web2Express

AJ Chen has just released a new service that allows users to create data feeds that can be published in RSS 2.0, RDF, and ATOM depending on the user’s preference.

His announcement email to the W3C semantic web list:

“One of the big challenges facing semantic web is to encourage people to put out their data in semantic format on the web. I have been looking into practical areas where semantic web can make real difference.  Datafeed is such an area, I believe. Yesterday, I just released a  free new online tool – unified data feed on web2express.org. I hope people will find it useful for creating data feeds for products, news, events, jobs and studies. Besides the feeds, all of the data are also openly available as RDF. “

Worth checking out.  Aside from repackaging existing feeds in a more flexible format, he has a number of feeds for various products and other entities.  The secret sauce of course is whether or not you can get people to agree on using a particular set of schemas for representing, say, product price.  From what I can tell, he’s coming up with his own microformats for that.

TANSTAAFL: The No-Free-Lunch Theorem

I came across this interesting tidbit while reading one of Numenta’s papers on their HTM approach.

The No-Free-Lunch Theorem:   “no learning algorithm has an inherent advantage over another learning algorithm for all classes of problems.  What matters is the set of assumptions an algorithm exploits in order to learn the world it is trying to model.”

 Ho, Y. C. & Pepyne, D. L. (2002), “Simple Explanation of the No-Free-Lunch Theorem and Its Implications”, Journal of Optimization Theory and Applications V115(3), 549-570.

Or, as Robert Heinlein put it a while back, TANSTAAFL.  There Ain’t No Such Thing As A Free Lunch.

Culturally Embedded Computing Group (i.e. cultural assumptions)

The below site offers a look into research about the culturally derived assumptions we make when designing systems. 
http://cemcom.infosci.cornell.edu/home.php
—————-

Cornell University Faculty of Information Science and Department of Science & Technology Studies

We analyze, design, build, and evaluate computing devices as they relate to their cultural context. We analyze the ways in which technologies reflect and perpetuate unconscious cultural assumptions, and design, build, and test new computing devices that reflect alternative possibilities for technology. We have a focus on reflective design, or design practices that help both users and designers reflect on their experiences and the role technology plays in those experiences.

Our primary focus is the environment; we are exploring the role IT could play to reconfigure our cultural relationship to the environment. We have worked extensively on affective computing, to develop approaches in which the full complexity of human emotions and relationships as experienced by users in central to design (rather than the extent to which computers can understand and process those emotions).

We draw from, contribute to, and mutually inform the technical fields of human computer interaction and artificial intelligence and the humanist/sociological fields of cultural and science & technology studies.

Jena Application Architecture Issues: Querying across models with SPARQL.

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance. They say that when programming “premature optimization is the root of all evil”. Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models. Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models? Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource. In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource&gt; <http://b.com/#tagged&gt; “foo”.

So all you have to do is search all available models for statements matching:
(?) b:tagged “foo”

Easy enough. But how do you implement such a search? By default, Jena stores different models in different relational tables in the underlying database. There is no single unified relational table (that I can find) where you can search for “foo”. There is no implicit unified model that contains all of the triples of all available models. From what I can tell, my options are:

(1) Create an uber-model that is basically a union of all of the models I have, and search only that model.
(2) Issue the “foo” search n times, where n is the number of models I have; once for each model.
(3) Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples. In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model. Option (2) is good, except that query performance sucks. Additionally, you’ll be wasting time searching many models that won’t have any hits at all. Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option. Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list, but I have yet to check out Andy’s SDB, which promises to have support for union queries, a potential extra option.

ebXML’s Context Table

ebXML’s context table outlines the different classes of context that the ebXML model recognizes.  “Context” in this case is business context – the set of circumstances around the transmission of business process related information.

ebXML Context

Click the image for a larger-sized picture of what a context representation in ebXML might look like, and where it fits in.

The things that are in this context model are obviously skewed towards electronic interchange of business information (e.g. a high-level context type of “Geopolitical/Regulatory”) but they do have a number of types that could be reinterpreted in much broader ways, for example “Partner Role”, “Info Structural Context”, and everybody’s favorite – “Temporal”.

This is interesting because it’s a practical implementation of context that bites the bullet and categorizes all possible contexts as a fixed number of types.

Notes on Questions about missing data

  • Which data is missing?
    • Usually, this should be smack-you-in-the-face obvious.  Something that is supposed to be there isn’t there.
    • Knowing that something is missing presumes that you knew it was supposed to be there.
    • “Lies of omission, and incomplete truths” – sometimes information that is missing isn’t just a blank, it’s the absence of precision or detail.
    • Context determines what should be there.  The producer’s context and the consumer’s context may not match, and the fact that something is missing is evidence for this context mismatch.
  • Where is it missing from?
    • What is the scope? Are we dealing with web pages, structured databases, PDF files, or the entire universe of data?
    • Do particular forms of data (structured, unstructured) have particular characteristics that lend themselves  to analysis of missing information questions?
    • Should the issue of things that are missing be narrowed to data (at the detriment of “information”)?
  • Why is the data missing?
    • An exhaustive taxonomy of missing value reasons is likely impossible, if you accept that the number of contexts is unlimited.
    • Still, a taxonomy may be able to generalize reasons into buckets and cover vast swaths of the “reason space” explaining why something is missing.
    • What level of analysis is most important?  Is it that an individual value in an individual observation is missing?  Is it that all values for a particular field are missing?  How about data global absence (i.e. it’s not an individual data field, it’s the whole data asset)
  • Why do we care that the data is missing?
    • What valuable contextual information would the missing data have provided?
    • Are we interested in drawing an in-model conclusion (e.g. what the value should be, or how that missingness impacts other values)?
    • Are we interested in drawing an out-model conclusion (e.g. where the conclusion’s impact is completely outside the data set where something was missing)?
  • Given the above three questions, what conclusions can we draw?