Ontology Game: Humans Matching Concepts

A new “ontology game” has recently been announced, as a “game with a purpose” to help get humans to categorize objects properly according to a formal ontology.

How it Works

The game operates in a way that’s similar to Google’s image tagging application; pairs of users who do not know one another are presented with the abstract from a Wikipedia page, and they have to choose categories in an upper ontology that accurately describe the article. (E.g. does it correspond to an abstract concept? An agent? A happening?) Users get points when both users choose the same answer to categorize an article. As the game goes on, the categorization gets more and more specific until it “bottoms out” in the upper ontology. At that point, you jump to a new article and start the process over again.

Gameplay

In terms of gameplay, it feels a little bit rough in part because the game doesn’t choose the articles very intelligently. (In one case, I got the same article twice in a row) Also, after you tag 5-6 different articles, the player has a good working knowledge of the taxonomy of the upper ontology, and it becomes less fun as the game devolves into categorization along lines you’ve seen many times before. The key difference here from Google’s image tagging game is that in Google’s game, people enter free-form words, so your input is almost limitless. Oh, and one other thing – in order to categorize properly, you have to read the 2-3 sentence descriptions of what the categories mean, which can take some time the first time around when you have 6-7 categories to choose from.

These don’t appear to me though to be fatal problems for the game, just teething problems. It could be fun if the data set was widened substantially, and the category choice perhaps narrowed a bit. And of course in the background, they’re building an interesting data set mapping Wikipedia articles to high-level concepts of what they represent.

Background

Here’s the original announcement email from Martin Hepp at DERI

We are proud to release the first one in our series of online computer
games that turn core tasks of weaving the Semantic Web into challenging
and interesting entertainment – check it out today athttp://www.ontogame.org/

A very early paper written in late Summer is in Springer LNCS Vol.
4806, 2007, pp. 1222-1232 [1].

A complete Technical Report including our quantitative evidence and video footage will be released shortly on our project Web page at http://www.ontogame.org/

The next series of games for other tasks of building the Semantic Web is already in the pipeline, so please stay tuned 🙂

Please subscribe to our OntoGame mailing list if you want to be informed once new gaming scenarios or results are available. See [2] for details on how to subscribe.

What is it good for?
====================
Despite significant advancement in technology and tools, building ontologies, annotating data, and aligning multiple ontologies remain tasks that highly depend on human intelligence, both as a source of domain expertise and for making conceptual choices. This means that people need to contribute time, and sometimes other resources, to this endeavor.

As a novel solution, we have proposed to masquerade core tasks of weaving the Semantic Web behind on-line, multi-player game scenarios, in order to create proper incentives for humans to contribute. Doing so, we adopt the findings from the already famous “games with a purpose” by von Ahn, who has shown that presenting a useful task, which requires human intelligence, in the form of an on-line game can motivate a large amount of people to work heavily on this task, and this for free.

Since our first experiments in May 2007, we have gained preliminary evidence that (1) users are willing to dedicate a lot of time to those games, (2) are able to produce high-quality conceptual choices, and, by doing so, (3) can unknowingly weave the Semantic Web.

Acknowledgments: OntoGame is possible only thanks to the hard work of the OntoGame team – special thanks to Michael Waltl, Werner Huber, Andreas Klotz, Roberta Hart-Hiller, and David Peer for their dedication and continuous contributions! The work on OntoGame has been funded in part by the Austrian BMVIT/FFG under the FIT-IT Semantic Systems project myOntology (grant no. 812515/9284), http://www.myontology.org, which we gratefully acknowledge.

And now…. play and enjoy!

Best wishes

Martin Hepp and Katharina Siorpaes

Advertisement

“Unpacking” implicit data values

One of the areas where missing data is most often made explicit is in “unpacking” data values.

Recently, I’ve been working with a system called Trio, which integrates uncertainty information and lineage with data into a database. The idea is that you can express fuzzier tuples in a relation. Instead of saying “Mary saw a Toyota”, you can assign it 60% confidence, or even add an alternative – either Mary saw a Toyota, or Mary saw a Honda (but not both).

Trio separates each of the data items in a table into two groups; those that are certain, and those that are uncertain. Unsurprisingly, uncertain data items may have a confidence stored along with them. The more interesting case is what gets put into the “certain” category, and why. As far as I can tell, the “certain” table has no formal definition, it just is the set of data items that don’t need a confidence associated with them. So in fact we’re “packing” several different use cases into that designation of a data item as “certain”:

  • Honest-to-god certain, meaning that it has a confidence rating of 100%
  • Certainty information is unknown, but assumed to be certain
  • Certainty information doesn’t apply

Many people will reasonably point out that it’s OK for there to be nulls, and for them to have some special meaning, it’s just important that their meaning be consistent. When the meaning of nulls can’t be consistent (because the semantics of the domain require that nulls differ in meaning based on context) – then you have a missing data problem. The common approach is then to go “unpack” those null values and enumerate their actual meanings so that they can be stored alongside the data in the future.

Other background – see also the caBIG “Missing Value Reasons” paper, and the flavors of null discussion.

Ontology Matching Book

Ontology Matching

Euzenat and Shvaiko’s book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching finds correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence.

With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the presented work and techniques can equally be applied to database schema matching, catalog integration, XML schema matching and other related problems. The book presents the state of the art and the latest research results in ontology matching by providing a detailed account of matching techniques and matching systems in a systematic way from theoretical, practical and application perspectives.

Missing Data and Causal Chains

One of my colleagues today suggested an interesting way of looking at the problem of missing data. He referred back to a lot of work on process modeling, where people in essence try to “reverse engineer” existing business processes.

Let’s say you discover a process, and you are able to identify steps 1, 2, and 4 of the process. Obviously the analyst knows that there was a step 3 somewhere, and the name of the game from that point becomes locating and describing step 3 – the missing data – of the process.

More broadly, steps in a process or information that is gathered are part of some causal chain. If you can identify the causal chain that the missing data belongs to, you at least have a framework for understanding how the missing data relates to other observations, and a starting point for asking the question “what does it mean that this information is missing?”  This causal chain might be thought of as the context surrounding the missing data.

Judea Pearl and Causality

Judea Pearl, who is one of the more influential authors on knowledge representation, causal reasoning, and AI wrote a book in 2000 titled “Causality: Models, Reasoning, and Inference“.  (Dr. Pearl is also as it happens Daniel Pearl’s father)

On his page that describes why he wrote the book, he relates an interesting anecdote about the scientific community’s avoidance of discussion about causality, and how that is a problem. His book was intended to help “students of statistics who wonder why instructors are reluctant to discuss causality in class; and students of epidemiology who wonder why simple concepts such as ‘confounding’ are so terribly complex when expressed mathematically”. His summation:

“Causality is not mystical or metaphysical.
It can be understood in terms of simple processes,
and it can be expressed in a friendly mathematical
language, ready for computer analysis.”

Comparing Upper Ontology Definitions of “Information Resource”

Copia has an interesting post about comparing the definitions of “information resource” in various upper ontologies.

Just looking at the isolated semantic differences between a simple concept across all of these upper ontologies is a bit of a reality check.  I have read papers that claim that the data interoperability problem will be solved by simply mapping all of the upper ontologies together, and having each domain describe their data in terms of a particular ontology that uses the upper-ontology of their choice.  Looking at the complexity and subtlety of just “information resource” across these different conceptualizations makes that approach look pretty silly.

Trends in usage – data/information that fades from common usage

A very broad category of data that we’ve talked about before is everything that simply fades from common usage because of changes in convention.  This is certain an obvious reason why something might not be there but is in someways the reverse of information that is so commonly used that it is no longer explicitly noted, i.e., evolving standards.

In a subsequent post I’ll talk about a specfic examples such as evolving knowledge and frames of reference in science and the shift in standards for research.   

Missing data and Tacit Knowledge

Searching on “Tacit” along with data terms provides a wealth of links to discussions that fit in this domain.  However, tacit knowledge has a meaning that is a bit off from what we are going after.

Unified Data Feed on Web2Express

AJ Chen has just released a new service that allows users to create data feeds that can be published in RSS 2.0, RDF, and ATOM depending on the user’s preference.

His announcement email to the W3C semantic web list:

“One of the big challenges facing semantic web is to encourage people to put out their data in semantic format on the web. I have been looking into practical areas where semantic web can make real difference.  Datafeed is such an area, I believe. Yesterday, I just released a  free new online tool – unified data feed on web2express.org. I hope people will find it useful for creating data feeds for products, news, events, jobs and studies. Besides the feeds, all of the data are also openly available as RDF. “

Worth checking out.  Aside from repackaging existing feeds in a more flexible format, he has a number of feeds for various products and other entities.  The secret sauce of course is whether or not you can get people to agree on using a particular set of schemas for representing, say, product price.  From what I can tell, he’s coming up with his own microformats for that.

TANSTAAFL: The No-Free-Lunch Theorem

I came across this interesting tidbit while reading one of Numenta’s papers on their HTM approach.

The No-Free-Lunch Theorem:   “no learning algorithm has an inherent advantage over another learning algorithm for all classes of problems.  What matters is the set of assumptions an algorithm exploits in order to learn the world it is trying to model.”

 Ho, Y. C. & Pepyne, D. L. (2002), “Simple Explanation of the No-Free-Lunch Theorem and Its Implications”, Journal of Optimization Theory and Applications V115(3), 549-570.

Or, as Robert Heinlein put it a while back, TANSTAAFL.  There Ain’t No Such Thing As A Free Lunch.

Culturally Embedded Computing Group (i.e. cultural assumptions)

The below site offers a look into research about the culturally derived assumptions we make when designing systems. 
http://cemcom.infosci.cornell.edu/home.php
—————-

Cornell University Faculty of Information Science and Department of Science & Technology Studies

We analyze, design, build, and evaluate computing devices as they relate to their cultural context. We analyze the ways in which technologies reflect and perpetuate unconscious cultural assumptions, and design, build, and test new computing devices that reflect alternative possibilities for technology. We have a focus on reflective design, or design practices that help both users and designers reflect on their experiences and the role technology plays in those experiences.

Our primary focus is the environment; we are exploring the role IT could play to reconfigure our cultural relationship to the environment. We have worked extensively on affective computing, to develop approaches in which the full complexity of human emotions and relationships as experienced by users in central to design (rather than the extent to which computers can understand and process those emotions).

We draw from, contribute to, and mutually inform the technical fields of human computer interaction and artificial intelligence and the humanist/sociological fields of cultural and science & technology studies.

Jena Application Architecture Issues: Querying across models with SPARQL.

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance. They say that when programming “premature optimization is the root of all evil”. Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models. Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models? Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource. In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource&gt; <http://b.com/#tagged&gt; “foo”.

So all you have to do is search all available models for statements matching:
(?) b:tagged “foo”

Easy enough. But how do you implement such a search? By default, Jena stores different models in different relational tables in the underlying database. There is no single unified relational table (that I can find) where you can search for “foo”. There is no implicit unified model that contains all of the triples of all available models. From what I can tell, my options are:

(1) Create an uber-model that is basically a union of all of the models I have, and search only that model.
(2) Issue the “foo” search n times, where n is the number of models I have; once for each model.
(3) Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples. In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model. Option (2) is good, except that query performance sucks. Additionally, you’ll be wasting time searching many models that won’t have any hits at all. Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option. Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list, but I have yet to check out Andy’s SDB, which promises to have support for union queries, a potential extra option.

ebXML’s Context Table

ebXML’s context table outlines the different classes of context that the ebXML model recognizes.  “Context” in this case is business context – the set of circumstances around the transmission of business process related information.

ebXML Context

Click the image for a larger-sized picture of what a context representation in ebXML might look like, and where it fits in.

The things that are in this context model are obviously skewed towards electronic interchange of business information (e.g. a high-level context type of “Geopolitical/Regulatory”) but they do have a number of types that could be reinterpreted in much broader ways, for example “Partner Role”, “Info Structural Context”, and everybody’s favorite – “Temporal”.

This is interesting because it’s a practical implementation of context that bites the bullet and categorizes all possible contexts as a fixed number of types.

Notes on Questions about missing data

  • Which data is missing?
    • Usually, this should be smack-you-in-the-face obvious.  Something that is supposed to be there isn’t there.
    • Knowing that something is missing presumes that you knew it was supposed to be there.
    • “Lies of omission, and incomplete truths” – sometimes information that is missing isn’t just a blank, it’s the absence of precision or detail.
    • Context determines what should be there.  The producer’s context and the consumer’s context may not match, and the fact that something is missing is evidence for this context mismatch.
  • Where is it missing from?
    • What is the scope? Are we dealing with web pages, structured databases, PDF files, or the entire universe of data?
    • Do particular forms of data (structured, unstructured) have particular characteristics that lend themselves  to analysis of missing information questions?
    • Should the issue of things that are missing be narrowed to data (at the detriment of “information”)?
  • Why is the data missing?
    • An exhaustive taxonomy of missing value reasons is likely impossible, if you accept that the number of contexts is unlimited.
    • Still, a taxonomy may be able to generalize reasons into buckets and cover vast swaths of the “reason space” explaining why something is missing.
    • What level of analysis is most important?  Is it that an individual value in an individual observation is missing?  Is it that all values for a particular field are missing?  How about data global absence (i.e. it’s not an individual data field, it’s the whole data asset)
  • Why do we care that the data is missing?
    • What valuable contextual information would the missing data have provided?
    • Are we interested in drawing an in-model conclusion (e.g. what the value should be, or how that missingness impacts other values)?
    • Are we interested in drawing an out-model conclusion (e.g. where the conclusion’s impact is completely outside the data set where something was missing)?
  • Given the above three questions, what conclusions can we draw?

k-anonymity and l-diversity

l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer

This is a great article on how to improve privacy for individuals in datasets that are disseminated. Quick summary:

The paper talks about “quasi-identifiers” – combinations of attributes within the data that can be used to identify individuals. For example, the statistic given is that 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5-digit zip code. Given that three-attribute “quasi-identifier”, a dataset that has only one record with any given combination of those fields is clearly not anonymous – most likely it identifies someone. Datasets are “k-anonymous” when for any given quasi-identifier, a record is indistinguishable from k-1 others.

The next concept is “l-diversity”. Say you have a group of k different records that all share a particular quasi-identifier. That’s good, in that an attacker cannot identify the individual based on the quasi-identifier. But what if the value they’re interested in, (e.g. the individual’s medical diagnosis) is the same for every value in the group? There are 7 different values in a group, and I don’t know which one of them is Bob Smith, but since I know that all of them are flagged with a diagnosis of cancer, the data has “leaked” that Bob Smith has cancer. (Figuring this out is unsurprisingly called a “homogeneity attack”) The distribution of target values within a group is referred to as “l-diversity”.

The paper outlines the mathematical underpinnings of what l-diversity is, and shows that it is practical and be implemented efficiently.


Improving both k-anonymity and l-diversity requires fuzzing the data a little bit.  Broadly, there are three ways you can do this:

  • You can generalize the data to make it less specific.  (E.g. the age “34” becomes “30-40”, or a diagnosis of “Chronic Cough” becomes “Respiratory Disorder”
  • You can suppress the data.  Simply delete it.  (Which leads us into our host of “missing data” questions)
  • You can perturb the data.  The actual value can be replaced with a random value out of the standard distribution of values for that field.  In this way, the overall distribution of values for that field will remain the same, but the individual data values will be wrong.

Heilmeier’s Catechism: How to make your presentation matter

Found this on the web – Heilmeier’s Catechism.  The key questions you need to answer in order to pitch a good idea.  It doesn’t matter whether the idea is a research proposal, a new business, or a plan to colonize Mars.

I believe this was originally written by an engineer, so I’ll attempt to translate into business-speak in the added parenthetical comments to show the parallels to business and even venture capital.

  • What are you trying to do? Articulate your objectives using absolutely no jargon.  (Mission or vision)
  • How is it done today, and what are the limits of current practice? (Market research)
  • What’s new in your approach and why do you think it will be successful?  (Your intellectual capital)
  • Who cares? If you’re successful, what difference will it make?  (Value-add to customers)
  • What are the risks and the payoffs? (What’s the ROI?  Show me the money)
  • How much will it cost? How long will it take? (Investment)
  • What are the midterm and final “exams” to check for success?  (Metrics)

Annotate anything, anywhere

If you are interested in annotations on ontologies or any other sort of resource, check out the W3C Project Annotea. They are working towards making it possible to load annotations alongside any web resource.  Imagine going to a web page, and clicking on a toolbar button to reveal annotations made by any number of people across the world.  I assume some other mechanisms would be necessary to make sure the annotations, bookmarks, and marginalia were relevant to you, but you get the idea.

They have built an annotation schema for describing the RDF structure of those annotations, and some basic software for creating annotations and storing them on an annotation server.  The mozilla project is hosting Annozilla, an attempt to integrate Annotea into mozilla and firefox.

Each annotation can then become a discussion thread of its own.  Imagine reading a web page on history, and noticing that someone else has posted an annotation that suggests a linkage to some other event that happened in history.  You would then be able to reply to that annotation, ask a question, or dispute its accuracy.

In these kinds of discussion, the discussion’s context would be extremely localized.  An entire thread of discussion might take place about a particular phrase or passage in a document.  Such discussions have always taken place, the difference is that with stored annotations, there is explicit traceability back to the source data that provides “context” for the discussion.  Of course you’ll always have issues such as when the discussion topic drifts.  Still, tracking the seed idea or piece of information that started the discourse has its own benefits.

Links to Statistical Approaches to Missing Data

Very good blog post:  Data Mining and Predictive Analytics – Missing Values and Special Values:  The Plague of Data Analysis.

This post has a number of references to other writing on the topic, such as Working with missing values.

One of the interesting points about this blog post and other documents linked off of it is that it seems the NMAR, MAR, MCAR approach to categorizing missing values does have some weight in the field.

Quick Summarization of Data Mining Approaches to Handling Missing Data

Data Mining Research Blog – Handling Missing Values

A few observations about these approaches, which illustrate the predispositions of data mining:

  • They don’t seem to care about what the value of the missing data is, they primarily care about the missing data’s impact on the value of the particular data observation (or row)
  • For expediency, they tend to assume that missing values will be statistically distributed similar to how the non-missing (or observed) values are distributed
  • There is a focus on a large corpus of observations; the impact of the individual observation is small.

These are all reasonable constraints given what data mining is doing.  As I discover this kind of thing though, I’m trying to keep it documented because these types of themes would probably be interesting to contrast with an approach that was aimed at using missing values as an information channel.

RDF123: Generate Flexible RDF From Spreadsheets

A new application called RDF123 has recently been made available, and has gotten a lot of well-deserved positive buzz from the semantics and RDF community.  Basically, RDF123 allows you to take spreadsheets as an input, and produce RDF data as an output.  Lushan Han, a Ph.D. candidate at UMBC, deserves the credit.

This capability is a big deal because of the amount of useful data in the world that’s locked up in spreadsheets.  This data is usually accessible only to applications that know how to read its proprietary binary format.  Converting the data into RDF may be a win in many cases because it provides access to the data in an open format (RDF) and allows marking the data up with a structured schema or with linkages to an ontology to permit new uses of the data.  Maybe you don’t buy that people are going to convert spreadsheets into RDF and then do reasoning with the data, and that’s fine.  You don’t have to do anything fancy – just having the data in RDF makes it much more flexible for any kind of machine processing.  This is of course in contrast to regular spreadsheets, which are typically designed primarily for human visual consumption.

Here’s how the project summarizes itself, for a few basic details:

RDF123 is an application and web service for converting data in simple spreadsheets to an RDF graph. Users control how the spreadsheet’s data is converted to RDF by constructing a graphical RDF123 template that specifies how each row in the spreadsheet is converted as well as metadata for the spreadsheet and its RDF translation. The template can map spreadsheet cells to a new RDF node or to a literal value. Labels on the nodes in the map can be used to create blank nodes or labeled nodes, attach a XSD datatype, and invoke simple functions (e.g., string concatenation). The graph produced for the spreadsheet is the union of the sub-graphs created for each row. The template itself is stored as a valid RDF document encouraging reuse and extensibility.

If you’re interested in more information on RDF123, eBiquity posted about it, and so did AI3 as well as Aman.  RDF123 was funded out of an NSF Grant (details on their grant) that includes a number of usual suspects (Timothy Finin, Jim Hendler).

Jena Architecture Issues: Querying with SPARQL Across Models

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance.  They say that when programming “premature optimization is the root of all evil”.  Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models.  Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models?  Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource.  In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource&gt; <http://b.com/#tagged&gt; “foo”.

So all you have to do is search all available models for statements matching:

(?) b:tagged “foo”

Easy enough.  But how do you implement such a search?  By default, Jena stores different models in different relational tables in the underlying database.  There is no single unified relational table (that I can find) where you can search for “foo”.  There is no implicit unified model that contains all of the triples of all available models.  From what I can tell, my options are:

  1. Create an uber-model that is basically a union of all of the models I have, and search only that model.
  2. Issue the “foo” search n times, where n is the number of models I have; once for each model.
  3. Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples.  In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model.  Option (2) is good, except that query performance sucks.  Additionally, you’ll be wasting time searching many models that won’t have any hits at all.  Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option.  Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list,  but I have yet to check out Andy’s SDB.

Ontology Annotation Use Cases

I recently found another interesting link dealing with ontology evolution and annotation. This paper is primarily interested in annotation from the perspective of maintenance and evolution. Indeed, the protoge collaborative software that’s being developed is focused on making suggestions for changes to a single existing ontology.

Context is more than location

Context is more than location.  Schmidt, Beigl, Gellerson.

These chaps break down the concept of “context” as it relates to human using mobile computing devices, into this hierarchy:

  • Human Factors
    • User (knowledge of habits, emotional state, biophysiological conditions, …)
    • Social Environment (co-location of others, social interaction, group dynamics, …)
    • Task (spontaneous activity, engaged tasks, general goals,…)
  • Physical Environment
    • Location (absolute position, relative position, co-location,…)
    • Infrastructure (surrounding resources for computation, communication, task performance…)
    • Location (noise, light, pressure,…)

This paper also includes a number of good links to other work on context.

3 Fundamental Classes of Missing Data

Taken from Missingness mechanism, James Carpenter’s and Mike Kenward’s site on the statistics of missing data.

They outline three different mechanisms that would cause data to be missing:

  1. Missing Completely At Random (MCAR): when the reason for the data being missing does not depend on its value or lack of value.
  2. Missing At Random (MAR): when the reason for missing data can be explained by the observed data; after accounting for this, there is no further information in the unseen data.
  3. Not Missing At Random (NMAR):  when even after considering the information in the rest of the data, the reason for missing information depends on that unseen information

In this case, the conceptual common denominator seems to be “the relationship between the reason information is missing, and the content of the data set”.

This is an interesting differentiation to use potentially in a taxonomy of missing data reasons, because it looks at reasons from a different perspective.  (I.e. not from the perspective of the context in which the information was fetched, and also not from the perspective of whether the information is known vs. available)

Work progressing on Ontrospect…

As I posted about before, I have a little project I’m working on.  Over the past couple of days, I managed to implement annotations in the system, which is basically the ability to add an arbitrary RDF triple to a separate model that “describes” another model.  For example, take the pizza ontology.  Of course this file is read-only because it’s on someone else’s site.  Annotations in Ontrospect would allow you to add extra triples to another model (let’s call it “Annotations: Pizza”) that make statements about resources in the pizza ontology.

The annotation could be a regular OWL or RDF assertion (such as indicating the type of a resource, or claiming that something is a subclass of something else) but it could also be a comment, a text tag, or more importantly a link to a completely different ontology.

I’m interested in this as a prototype system for building mappings between ontologies.  Hypothetically, you could for example take two upper ontologies and create links between them using these annotations.  For example, load the SUMO ontology into the program, and then link specific classes in SUMO to classes within DOLCE-lite.  (It could be something as simple as owl:sameAs, but would probably need to be something more sophisticated)

Where this becomes useful is when you union multiple models together with their annotation models.  You can then use that as a basis for reasoning across multiple ontologies, allowing you to work with mixed data.  An interesting other consequence is that it (hopefully) could make the semantic web read/write, instead of just read only.  The pizza ontology is out there, and you can’t change it.  What you can do is create your own set of annotations, and then publish a new model that is the pizza ontology union’d with your annotations – in effect a different model.