Category Archives: Links

Links to sites with information, background, or research on missing data.

Ontology Game: Humans Matching Concepts

A new “ontology game” has recently been announced, as a “game with a purpose” to help get humans to categorize objects properly according to a formal ontology.

How it Works

The game operates in a way that’s similar to Google’s image tagging application; pairs of users who do not know one another are presented with the abstract from a Wikipedia page, and they have to choose categories in an upper ontology that accurately describe the article. (E.g. does it correspond to an abstract concept? An agent? A happening?) Users get points when both users choose the same answer to categorize an article. As the game goes on, the categorization gets more and more specific until it “bottoms out” in the upper ontology. At that point, you jump to a new article and start the process over again.


In terms of gameplay, it feels a little bit rough in part because the game doesn’t choose the articles very intelligently. (In one case, I got the same article twice in a row) Also, after you tag 5-6 different articles, the player has a good working knowledge of the taxonomy of the upper ontology, and it becomes less fun as the game devolves into categorization along lines you’ve seen many times before. The key difference here from Google’s image tagging game is that in Google’s game, people enter free-form words, so your input is almost limitless. Oh, and one other thing – in order to categorize properly, you have to read the 2-3 sentence descriptions of what the categories mean, which can take some time the first time around when you have 6-7 categories to choose from.

These don’t appear to me though to be fatal problems for the game, just teething problems. It could be fun if the data set was widened substantially, and the category choice perhaps narrowed a bit. And of course in the background, they’re building an interesting data set mapping Wikipedia articles to high-level concepts of what they represent.


Here’s the original announcement email from Martin Hepp at DERI

We are proud to release the first one in our series of online computer
games that turn core tasks of weaving the Semantic Web into challenging
and interesting entertainment – check it out today at

A very early paper written in late Summer is in Springer LNCS Vol.
4806, 2007, pp. 1222-1232 [1].

A complete Technical Report including our quantitative evidence and video footage will be released shortly on our project Web page at

The next series of games for other tasks of building the Semantic Web is already in the pipeline, so please stay tuned 🙂

Please subscribe to our OntoGame mailing list if you want to be informed once new gaming scenarios or results are available. See [2] for details on how to subscribe.

What is it good for?
Despite significant advancement in technology and tools, building ontologies, annotating data, and aligning multiple ontologies remain tasks that highly depend on human intelligence, both as a source of domain expertise and for making conceptual choices. This means that people need to contribute time, and sometimes other resources, to this endeavor.

As a novel solution, we have proposed to masquerade core tasks of weaving the Semantic Web behind on-line, multi-player game scenarios, in order to create proper incentives for humans to contribute. Doing so, we adopt the findings from the already famous “games with a purpose” by von Ahn, who has shown that presenting a useful task, which requires human intelligence, in the form of an on-line game can motivate a large amount of people to work heavily on this task, and this for free.

Since our first experiments in May 2007, we have gained preliminary evidence that (1) users are willing to dedicate a lot of time to those games, (2) are able to produce high-quality conceptual choices, and, by doing so, (3) can unknowingly weave the Semantic Web.

Acknowledgments: OntoGame is possible only thanks to the hard work of the OntoGame team – special thanks to Michael Waltl, Werner Huber, Andreas Klotz, Roberta Hart-Hiller, and David Peer for their dedication and continuous contributions! The work on OntoGame has been funded in part by the Austrian BMVIT/FFG under the FIT-IT Semantic Systems project myOntology (grant no. 812515/9284),, which we gratefully acknowledge.

And now…. play and enjoy!

Best wishes

Martin Hepp and Katharina Siorpaes


Ontology Matching Book

Ontology Matching

Euzenat and Shvaiko’s book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching finds correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence.

With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the presented work and techniques can equally be applied to database schema matching, catalog integration, XML schema matching and other related problems. The book presents the state of the art and the latest research results in ontology matching by providing a detailed account of matching techniques and matching systems in a systematic way from theoretical, practical and application perspectives.

Judea Pearl and Causality

Judea Pearl, who is one of the more influential authors on knowledge representation, causal reasoning, and AI wrote a book in 2000 titled “Causality: Models, Reasoning, and Inference“.  (Dr. Pearl is also as it happens Daniel Pearl’s father)

On his page that describes why he wrote the book, he relates an interesting anecdote about the scientific community’s avoidance of discussion about causality, and how that is a problem. His book was intended to help “students of statistics who wonder why instructors are reluctant to discuss causality in class; and students of epidemiology who wonder why simple concepts such as ‘confounding’ are so terribly complex when expressed mathematically”. His summation:

“Causality is not mystical or metaphysical.
It can be understood in terms of simple processes,
and it can be expressed in a friendly mathematical
language, ready for computer analysis.”

TANSTAAFL: The No-Free-Lunch Theorem

I came across this interesting tidbit while reading one of Numenta’s papers on their HTM approach.

The No-Free-Lunch Theorem:   “no learning algorithm has an inherent advantage over another learning algorithm for all classes of problems.  What matters is the set of assumptions an algorithm exploits in order to learn the world it is trying to model.”

 Ho, Y. C. & Pepyne, D. L. (2002), “Simple Explanation of the No-Free-Lunch Theorem and Its Implications”, Journal of Optimization Theory and Applications V115(3), 549-570.

Or, as Robert Heinlein put it a while back, TANSTAAFL.  There Ain’t No Such Thing As A Free Lunch.

ebXML’s Context Table

ebXML’s context table outlines the different classes of context that the ebXML model recognizes.  “Context” in this case is business context – the set of circumstances around the transmission of business process related information.

ebXML Context

Click the image for a larger-sized picture of what a context representation in ebXML might look like, and where it fits in.

The things that are in this context model are obviously skewed towards electronic interchange of business information (e.g. a high-level context type of “Geopolitical/Regulatory”) but they do have a number of types that could be reinterpreted in much broader ways, for example “Partner Role”, “Info Structural Context”, and everybody’s favorite – “Temporal”.

This is interesting because it’s a practical implementation of context that bites the bullet and categorizes all possible contexts as a fixed number of types.

k-anonymity and l-diversity

l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer

This is a great article on how to improve privacy for individuals in datasets that are disseminated. Quick summary:

The paper talks about “quasi-identifiers” – combinations of attributes within the data that can be used to identify individuals. For example, the statistic given is that 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5-digit zip code. Given that three-attribute “quasi-identifier”, a dataset that has only one record with any given combination of those fields is clearly not anonymous – most likely it identifies someone. Datasets are “k-anonymous” when for any given quasi-identifier, a record is indistinguishable from k-1 others.

The next concept is “l-diversity”. Say you have a group of k different records that all share a particular quasi-identifier. That’s good, in that an attacker cannot identify the individual based on the quasi-identifier. But what if the value they’re interested in, (e.g. the individual’s medical diagnosis) is the same for every value in the group? There are 7 different values in a group, and I don’t know which one of them is Bob Smith, but since I know that all of them are flagged with a diagnosis of cancer, the data has “leaked” that Bob Smith has cancer. (Figuring this out is unsurprisingly called a “homogeneity attack”) The distribution of target values within a group is referred to as “l-diversity”.

The paper outlines the mathematical underpinnings of what l-diversity is, and shows that it is practical and be implemented efficiently.

Improving both k-anonymity and l-diversity requires fuzzing the data a little bit.  Broadly, there are three ways you can do this:

  • You can generalize the data to make it less specific.  (E.g. the age “34” becomes “30-40”, or a diagnosis of “Chronic Cough” becomes “Respiratory Disorder”
  • You can suppress the data.  Simply delete it.  (Which leads us into our host of “missing data” questions)
  • You can perturb the data.  The actual value can be replaced with a random value out of the standard distribution of values for that field.  In this way, the overall distribution of values for that field will remain the same, but the individual data values will be wrong.

Heilmeier’s Catechism: How to make your presentation matter

Found this on the web – Heilmeier’s Catechism.  The key questions you need to answer in order to pitch a good idea.  It doesn’t matter whether the idea is a research proposal, a new business, or a plan to colonize Mars.

I believe this was originally written by an engineer, so I’ll attempt to translate into business-speak in the added parenthetical comments to show the parallels to business and even venture capital.

  • What are you trying to do? Articulate your objectives using absolutely no jargon.  (Mission or vision)
  • How is it done today, and what are the limits of current practice? (Market research)
  • What’s new in your approach and why do you think it will be successful?  (Your intellectual capital)
  • Who cares? If you’re successful, what difference will it make?  (Value-add to customers)
  • What are the risks and the payoffs? (What’s the ROI?  Show me the money)
  • How much will it cost? How long will it take? (Investment)
  • What are the midterm and final “exams” to check for success?  (Metrics)