Category Archives: General

Anything else

Trends in usage – data/information that fades from common usage

A very broad category of data that we’ve talked about before is everything that simply fades from common usage because of changes in convention.  This is certain an obvious reason why something might not be there but is in someways the reverse of information that is so commonly used that it is no longer explicitly noted, i.e., evolving standards.

In a subsequent post I’ll talk about a specfic examples such as evolving knowledge and frames of reference in science and the shift in standards for research.   

Advertisements

Missing data and Tacit Knowledge

Searching on “Tacit” along with data terms provides a wealth of links to discussions that fit in this domain.  However, tacit knowledge has a meaning that is a bit off from what we are going after.

Unified Data Feed on Web2Express

AJ Chen has just released a new service that allows users to create data feeds that can be published in RSS 2.0, RDF, and ATOM depending on the user’s preference.

His announcement email to the W3C semantic web list:

“One of the big challenges facing semantic web is to encourage people to put out their data in semantic format on the web. I have been looking into practical areas where semantic web can make real difference.  Datafeed is such an area, I believe. Yesterday, I just released a  free new online tool – unified data feed on web2express.org. I hope people will find it useful for creating data feeds for products, news, events, jobs and studies. Besides the feeds, all of the data are also openly available as RDF. “

Worth checking out.  Aside from repackaging existing feeds in a more flexible format, he has a number of feeds for various products and other entities.  The secret sauce of course is whether or not you can get people to agree on using a particular set of schemas for representing, say, product price.  From what I can tell, he’s coming up with his own microformats for that.

Culturally Embedded Computing Group (i.e. cultural assumptions)

The below site offers a look into research about the culturally derived assumptions we make when designing systems. 
http://cemcom.infosci.cornell.edu/home.php
—————-

Cornell University Faculty of Information Science and Department of Science & Technology Studies

We analyze, design, build, and evaluate computing devices as they relate to their cultural context. We analyze the ways in which technologies reflect and perpetuate unconscious cultural assumptions, and design, build, and test new computing devices that reflect alternative possibilities for technology. We have a focus on reflective design, or design practices that help both users and designers reflect on their experiences and the role technology plays in those experiences.

Our primary focus is the environment; we are exploring the role IT could play to reconfigure our cultural relationship to the environment. We have worked extensively on affective computing, to develop approaches in which the full complexity of human emotions and relationships as experienced by users in central to design (rather than the extent to which computers can understand and process those emotions).

We draw from, contribute to, and mutually inform the technical fields of human computer interaction and artificial intelligence and the humanist/sociological fields of cultural and science & technology studies.

Jena Application Architecture Issues: Querying across models with SPARQL.

Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance. They say that when programming “premature optimization is the root of all evil”. Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.

My application stores many different RDF data models. Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.

The question is how should I best implement this model soup such that users can search for specific triples across all models? Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource. In my version of RDF, that translates quite easily to the triple:

<http://a.com/some-resource&gt; <http://b.com/#tagged&gt; “foo”.

So all you have to do is search all available models for statements matching:
(?) b:tagged “foo”

Easy enough. But how do you implement such a search? By default, Jena stores different models in different relational tables in the underlying database. There is no single unified relational table (that I can find) where you can search for “foo”. There is no implicit unified model that contains all of the triples of all available models. From what I can tell, my options are:

(1) Create an uber-model that is basically a union of all of the models I have, and search only that model.
(2) Issue the “foo” search n times, where n is the number of models I have; once for each model.
(3) Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples. In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.

Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model. Option (2) is good, except that query performance sucks. Additionally, you’ll be wasting time searching many models that won’t have any hits at all. Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.

SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option. Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.

I have been getting good support on the jena-dev mailing list, but I have yet to check out Andy’s SDB, which promises to have support for union queries, a potential extra option.

k-anonymity and l-diversity

l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer

This is a great article on how to improve privacy for individuals in datasets that are disseminated. Quick summary:

The paper talks about “quasi-identifiers” – combinations of attributes within the data that can be used to identify individuals. For example, the statistic given is that 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5-digit zip code. Given that three-attribute “quasi-identifier”, a dataset that has only one record with any given combination of those fields is clearly not anonymous – most likely it identifies someone. Datasets are “k-anonymous” when for any given quasi-identifier, a record is indistinguishable from k-1 others.

The next concept is “l-diversity”. Say you have a group of k different records that all share a particular quasi-identifier. That’s good, in that an attacker cannot identify the individual based on the quasi-identifier. But what if the value they’re interested in, (e.g. the individual’s medical diagnosis) is the same for every value in the group? There are 7 different values in a group, and I don’t know which one of them is Bob Smith, but since I know that all of them are flagged with a diagnosis of cancer, the data has “leaked” that Bob Smith has cancer. (Figuring this out is unsurprisingly called a “homogeneity attack”) The distribution of target values within a group is referred to as “l-diversity”.

The paper outlines the mathematical underpinnings of what l-diversity is, and shows that it is practical and be implemented efficiently.


Improving both k-anonymity and l-diversity requires fuzzing the data a little bit.  Broadly, there are three ways you can do this:

  • You can generalize the data to make it less specific.  (E.g. the age “34” becomes “30-40”, or a diagnosis of “Chronic Cough” becomes “Respiratory Disorder”
  • You can suppress the data.  Simply delete it.  (Which leads us into our host of “missing data” questions)
  • You can perturb the data.  The actual value can be replaced with a random value out of the standard distribution of values for that field.  In this way, the overall distribution of values for that field will remain the same, but the individual data values will be wrong.

Heilmeier’s Catechism: How to make your presentation matter

Found this on the web – Heilmeier’s Catechism.  The key questions you need to answer in order to pitch a good idea.  It doesn’t matter whether the idea is a research proposal, a new business, or a plan to colonize Mars.

I believe this was originally written by an engineer, so I’ll attempt to translate into business-speak in the added parenthetical comments to show the parallels to business and even venture capital.

  • What are you trying to do? Articulate your objectives using absolutely no jargon.  (Mission or vision)
  • How is it done today, and what are the limits of current practice? (Market research)
  • What’s new in your approach and why do you think it will be successful?  (Your intellectual capital)
  • Who cares? If you’re successful, what difference will it make?  (Value-add to customers)
  • What are the risks and the payoffs? (What’s the ROI?  Show me the money)
  • How much will it cost? How long will it take? (Investment)
  • What are the midterm and final “exams” to check for success?  (Metrics)