History of an Idea: Missing Data

Entries from July 2007

Prototype screenshots: Ontrospect 0.01 alpha

July 31, 2007 · 1 Comment

Here are some screenshots of the model view, the class view, and the query view respectively.

This stuff is basically a thin web wrapper around the normal functionality that Jena provides.  Click the thumbnails for a full-size screenshot.

Screenshot 1

Screenshot 2

Screenshot 3

This query screen shot shows off Jena’s support for the SPARQL query language, which is nice.  Database persistence in this example is provided by Derby, the Apache Foundation’s all-java RDBMS.

Categories: Jena · Software

Data Integration: Common Agreements, or Point to Points?

July 31, 2007 · Leave a Comment

 (The following are a few ideas I’ve had kicking around that I’m just now getting around to writing about.  Maybe there’s a DM Review article in here somewhere…)

General Approaches to Data Integration and Information Sharing

At work, there are usually any number of spirited arguments going on about data integration, and how it’s best done.  Granted these are somewhat extreme caricatures of the positions, but broadly, you have:

  • the SOA camp (some of whom think information is magically integrated when you use web services, as if an architectural approach is the same thing as actual integration work)
  • the mediation camp (let everyone store and transmit information in their own format, and we’ll translate somewhere in the middle)
  • the data warehouse camp (move a single copy of all data into one physical location and integrate it there)
  • The semantic camp (the format isn’t very important – get the semantics right and you will be able to translate the format with the use of that common understanding)
  • The Community of Interest/COI camp (let’s get everyone together around a table to agree on an interchange schema, and just use that)

The domain is overwhelmingly conceptualized along the lines of “point to point” connections (e.g. the mediation camp) versus the use of some centralized, agreed-upon format.  I’m not convinced this is the only way to look at the issue, but that is how most people understand it these days.

 Which to Choose?

The question is, which approach is right?  People constantly look to the internet for the answers.  They see the successes of information sharing on the web, and mistakenly assume that a model that works for home users who don’t know one another will somehow work for their large monolithic organization.  Correspondingly, there are rushes to adopt web 2.0, tagging, blogging, video sharing, or any number of other flavors of the month.

The one lesson that no one seems to take from the internet is that it never, ever chooses one model or approach to a problem.  Because it is by nature a large unregulated marketplace of ideas, the internet’s approach to data integration is both to form common agreements and to build point-to-point connections.  There is no one correct approach because the reason for information sharing differs by community, information type, and underlying purpose.

On the internet “common agreement” front, we have Microformats, OASIS, IEEE, NIST, and the W3C.  On the “proliferation of point to points” front, we have thousands of people in their basements developing their own custom schemas and publishing them to the world.  The internet manages to adapt, change, and pick the best of the litter in part because it is so large, it can afford to go in many different directions and let users choose the best solution.

Companies can’t always act as the internet does

Companies and government agencies are in a different situation though.  There is incremental cost to the organization associated with trying something new or different.  The “internet” isn’t a monolithic organization that arranges or guides investment – if it tries 1,000 different approaches simultaneously and only 5 pan out, that’s fine because the risk was borne by the individual who tried their approach.  This approach is called “Let 1,000 flowers bloom“  Companies and agencies frequently cannot afford to have only 5 out of 1,000 ideas pan out.  If they are lucky, they have the money to plant three flowers – but they’ll still absolutely need to have one of them come out a real winner.
So what do they do?  Develop point-to-point interfaces, or shoot for some agreed-upon central format (whether semantic or structural)?

The difficult but unavoidable issue is that the answer to this question depends on context and purpose.  The question shouldn’t even be answered outside of the confines of a specific context, because the factors that dictate which approach will actually be successful are context-specific.

Infinite contexts, and missing information

What’s really needed is some sort of general approach to analyzing context and making recommendations on data integration approaches based on context specifics. This article (and several others) provide summary tables of different data integration approaches with their pros and cons.  For example, using data warehouses to integrate data makes queries fast and makes cleaning data up at least possible, but it also makes for stale data, very complex schemas, and a duplicate copy of all of your data.  Matching up those pros and cons to individual contextual considerations is left as an exercise to the reader.

Tying this back to the science of missing information is this notion of context.  I was looking at a potential taxonomy of “missing value reasons” that was organized around the type of context that would lead to information being missing.  For example, maybe it was a “Shared Location Context”, e.g. area code information was committed from a telephone number because the recipient is in the same area code.  Classifying missing reasons by context turns out to be really difficult because the number of contexts is infinite, and of course doesn’t lend itself to easy categorization.

The unsatisfying conclusion

In order to pick the best data integration approach, you still need really smart people who intimately understand the specifics of your information sharing context.  Consultants probably cannot suggest the best approach, because they will not be able to have enough information about your context to make that decision for you.  Consultants will be useful in applying general knowledge about pros and cons of approaches to the contextual specifics you bring to the table.  In other words, they can advise, but cannot provide a definitive answer to the question of which approach is best.

Categories: General · Information Sharing

Experimentation with Jena & prototyping

July 31, 2007 · Leave a Comment

Over the past few weeks, I have been experimenting with a prototype application written using the Jena framework. Jena is a fairly decent library that supports RDF, OWL, and DAML+OIL ontologies, as well as reasoning across those models.

The basic prototype that I’ve built allows you to view and navigate individual models, as well as perform some basic set operations on those models, such as inspecting their union and difference. Most of what the prototype does right now is just an effort to teach myself the functionality of Jena rather than to build an application that does anything revolutionary.

The prototype itself might have a useful life as an Internet-accessible service to browse ontologies that would otherwise be opaque XML files. I’m not really as interested in the browser functionality right now though.

Where I’d really like to take the application is to be able to make it a feed generator for all things semantic web. For example, let’s say there are two ontologies, A and B, hosted on two different sites. What I’d really like is a system that:

  • Allowed users to subscribe to an RSS feed that contained changes or updates to those ontologies over time
  • Allowed users to subscribe to a feed that corresponded to the union (or difference) of those two models
  • Provided a facility for arbitrary external annotation of those ontologies.
    • Allow users to specify which classes in A are equivalent to which classes in B
    • Associate comments or additional text or properties to classes in A, but have those assertions be stored outside of A’s serialization. (I.e. “add-ons” for A’s ontology, but not actually stored in the file that is A’s content

I’m running into the same problem with this application prototype as I have with many others in the past. I have reached the point with development that in order to make substantial steps towards the goal functionality requires learning loads and loads of things that aren’t as interesting, in order to know how implementation can happen. Examples of what I mean would include database persistence for ontologies, how to get around jena memory management issues, RSS libraries in java, and existing annotation standards on the web. Learning that stuff requires a substantial time investment that goes beyond the bounds of what I can accomplish in a “spare time” prototype.

Then again, there’s always an application for research funding. If I could articulate the idea a bit more, maybe it could grow legs.

Categories: Jena · Software

Evolving Taxonomy: Reasons why data would be missing

July 27, 2007 · 1 Comment

I have been working on a sample taxonomy with a fairly substantial list of all of the reasons why data would be missing. Unfortunately, WordPress doesn’t allow me to upload OWL files, and the taxonomy is large enough that screenshots aren’t too helpful. Protege does fortunately have a way of exporting taxonomies as text, so here’s a sample of what it looks like at the moment. (Without documentation or constraints)

T1-ReasonForAbsence

Available

KnownAndAvailable

HandlingError

NotApplicableByContext

NotRelevant

OutOfBand

ByApplication

ByTelephone

OtherSystem

PreviouslySpecified

Restricted

LawRestricted

PrivacyRestricted

RegulationRestricted

SecurityRestricted

AuthenticatedRestricted

AnonymousAccessNotAllowed

AuthorizationRestricted

SharedContext

SharedCultureContext

SharedLocationContext

SharedTechContext

TechnicalError

NotKnownButAvailable

Approximated

Encrypted

ErrorProne

AvoidDecisionMaking

Misclassified

Obfuscated

TimeSpecific

IntervalSpecific

MomentSpecific

WildAssGuess

NotAvailable

KnownButNotAvailable

Deceit

Hidden

Implied

Derivable

Functional

NotKnownAndNotAvailable

Deleted

Dependent

Indicative

MissingAtRandom

MissingCompletelyAtRandom

(NotApplicableByContext)…

NotMissingAtRandom

Subjective

Unknown

Unmeasurable

SubjectRefusedToAnswer

Categories: Taxonomy

Examples dealing with UIs and Medical Records

July 27, 2007 · Leave a Comment

Another area of research is in the negative “content awareness” of a UI e.g., how readily the content of a page is grasped.

(Actually UIs aren’t aware of anything, so it should be re-termed content-self-evidence or comprehensibility. But who am I to take on any industry buzzword).

This also applies to web services or SOA.

My thesis focuses on how fast can someone figure out WHAT IS NOT and WOULD NOT be on that page or more broadly NOT IN THAT INFORMANTION RESOURCE.

Example: A computer description:

  1. A text description doesn’t include FSB speed. (You don’t know if it is even part of their awareness.)
    1. We don’t want to confuse you
    2. We want to be brief
    3. We don’t know what it is

How about doing it in a table…

  1. A table description has a field for FSB but it is blank
  2. The field is blank but next to it is in a linked note that says, “varies with model”
  3. The field is blank but an error code next to it says data supplied is not displayed as it exceeded established acceptable range/rules.

This type of thing is growing in importance as XML offers optional tags and semi-structured/text data is making its way into content management (RS web database)

In the past we had only fixed fields or text. Now it is harder to know what is missing.

A dialog box in MS Excel lacks a pick option that is available in other conceptually parallel dialogs. The user goes insane trying to figure how to do the function. Better solution: The option is there but grayed out. Mousing over the option gives a shot explanation as to why it is not available in this context.

This raises the importance of methods of knowing what someone is looking for but can’t find – because it isn’t clear.

Medical Records Context,

(Just one topic for today)

Classic problem what does Null mean?

Why does a patient history not have something?

Why is there no family history?

What does blank mean?

  • No one bothered to ask.
  • The patient declined to answer
  • Patient privacy laws/business rules prevent transmission / display
  • Need-to-know not yet established
  • Classification / Security
  • The test was not available at the facility
  • The test could not be done due to time constraints
  • We couldn’t ask as the patient was unconscious
  • The answer was nonsensical or appeared to a statistical outlier.
  • A test was inconclusive (not the same as an outlier)
  • Test results are pending (…time frame)
  • We have no idea.
  • The data was not sent.
  • An intermediary web service had no such field so data was lost in conversion.
  • The data was rejected due to a formatting problem. (Error type)
  • You are being presented with an abbreviated display for your convenience

How to get the rest of it…

Categories: General

Information intentionally deleted

July 27, 2007 · Leave a Comment

  • The serial number on a gun or car part – Goal is to illegally prevent access
  • The content on a redacted document – Goal is to provide legal access to the document.
  • Hmmm.. If an electronic document gets redacted, do you get to see how much is concealed or is there just an indefinite <clip>?

Here the information is missing but the scope and context is preserved.

Categories: General

Implications for protecting supervisory control systems

July 27, 2007 · Leave a Comment

DHS/LOGIC

One of the domains missing value work impacts is the area of implications for protecting
supervisory control systems.

The trend toward loosely coupled / messaging based architectures increases the importance of proactive monitoring to discover and interpret the non-presence of data.

Categories: General

Abstract: First Attempt

July 26, 2007 · Leave a Comment

(The following is a first draft at an abstract describing the research focus)

When a person is presented with an information resource, one of the very first things that they do is begin to understand the information and draw conclusions based on the information that the resource provides.  But what about the information the person expects to see, but is absent?  Often, what is missing can give us many additional hints to the sophistication of the information resource author, the audience for which the information was intended, or even evidence of deceit.  This paper attempts to describe a framework for how lacking a piece of information can itself be an important piece of information.  We will attempt to demonstrate how missing information interplays with the purpose and context of the information consumer, and the utility to the information consumer of identifying what is missing. 

Categories: General

Miscellaneous Notes from Rafner

July 26, 2007 · Leave a Comment

(June 2007)

We have a script. We can expect certain pieces of information to be passed back and forth.

  •   There are certain key pieces of information you can’t escape, but the more familiarity there is between people, the less information is passed. (Man walks into a restaraunt, says nothing, but gets everything that he wants because there’s a script and set of expectations. Particulars are no longer required because it’s already codified.  (One example of a progression from unfamiliar to familiar)
  • A person sits down in a different place, and the waiter needs to explain that the olive oil is used for dipping bread into
  • The opposite continuum – historical. Floppy disks produced in the earliest days never said hard vs. soft sectored. Hard sectored meant that the physical geometry of the disk was predetermined. All they could do was flag a sector bad if there was bad data. Soft sectoring, the computer chooses where the sectors are. A small sector of the disk can be excluded based on flaws.
  • You can use the lack of information to date a system, or to locate it culturally. When and where was this data created?
  • Approach to epistemological analysis. Then you have a corpus of data, and you want to figure something out. Take a corpus of medical records. Is there a lack of reporting on a particular thing. Does this mean that it wasn’t tested, or all results were normal?
  • “Simply characterizing the domain is very exciting to people”. If you can come to it and identify 8 different ways that data can be missing, that’s a perfectly acceptable piece of intellectual achievement.
  • Figuring out the reason why its missing is the starting point for determining how to use it.

Categories: General

E.F. Codd Quote Points out the Obvious

July 26, 2007 · Leave a Comment

EF Codd Paper on missing values.

Quote:

The semantics of the fact that a db-value is missing are NOT the same as the semantics of the db-value itself. The former fact applies to any db-value, no matter what its type. The latter fact has semantics depending heavily on the domain (or application data type) from which the attribute draws its values.

Like a variable, a mark is a place-holder. However, it does not conform to the other accepted property of a variable: namely that semantically distinct missing values are represented by distinctly named variables.

SIGMOD RECORD, Vol. 15, No. 4, December 1986

This I think draws a clear line between the aspect of missing information that I’m interested in, and the aspect that I’m not interested in.

If the missing information is interpreted from the context of the “db-value” semantics, then I’m not interested.  What I am interested in is the semantics of the missing information and the “domain (or application data type) from which the attribute draws its values.”

Categories: Links

EF Codd’s Breakdown of Missing Value Reasons

July 26, 2007 · Leave a Comment

(Click image for full-size view)

EF Codd Hierarchy

Taken from:

Missing Information (Applicable and Inapplicable) n Relational Databases by E. F. Codd

(Link to ACM Digital Library PDF)

Categories: Uncategorized

The Final NULL in the Coffin

July 26, 2007 · Leave a Comment

The Final NULL in the Coffin

(Fabian Pascal)

This paper summarizes the drawbacks of the many-valued logic approach to missing data, and SQL’s problematic and poorly implemented flavor of three-valued logic via NULLs, and proposes a possible solution within the two-valued logic/relational framework. It (a) separates unknown and therefore missing data from “inapplicable” and therefore nonmissing data), and provides proper design guidelines to avoid the latter (b) treats missing data correctly as metadata and (c) yields logically correct answers with respect to the real world, without the complications and problematics of many-valued logic and SQL’s NULLs.

Categories: Links

caBIG: Missing Value Reason Whitepaper

July 26, 2007 · 1 Comment

Missing Value Reason Whitepaper.

A group was formed as part of caBIG (Cancer Bioinformatics Grid, affiliated with the NIH) to investigate cataloging a list of reasons why information would be missing from biomedical research data sets.

This is where I originally discovered the “flavors of NULL” discussion within the HL7 standard – it was an influencing approach for the caBIG MVR group.

The group does have a collaboration page, but it’s empty. It’s unclear whether these guys are still working on this. The whitepaper has a good Q&A section that deals with their approach.

Categories: Links

Interesting Missing Value Case: Neuropsych evaluations

July 26, 2007 · Leave a Comment

Missing information in neuropsychological evaluations

They’re trying to take neuropsych measurements on people with behavioral disorders due to dementia — the disorder causes them to be unable to gather the information.

The actual lack of information (or inability to gather it) is an indicator of the seriousness of the disorder they’re studying.

Categories: Links

Missing Data – Sci Research & Statistical Imputation

July 26, 2007 · Leave a Comment

I recently did a Clusty search on “missing data”. The focus of the results turns out to be the statistical science of imputing missing data in experiments, etc.

Knowing about this is important for grounding our work and contrasting it with semantically data missing for reasons such as neuropsych or semantic assumptions. The generic statistical limits on outliers don’t really apply to what we are doing unless the context is bounded. By that I mean we may be in an open-world context, not a clinical measurement.

Missign Data: A Gentle Introduction
http://www.guilford.com/cgi-bin/cartscript.cgi?page=pr/mcknight.htm&sec=toc&dir=research/res_quant&cart_id=146538.16024

http://www.spss.com/missing_value/ http://www.statistics.com/courses/missing http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html http://www.fields.utoronto.ca/programs/scientific/04-05/missing-data/

http://www.lshtm.ac.uk/msu/missingdata/index.html —–Original Message—–From: M. David Allen [mailto:mda@upinova.com]Sent: Thursday, July 26, 2007 10:15 AMTo: drafner@comcast.net

Subject: Interesting missing value case: neuropsych evaluations

http://jnnp.bmj.com/cgi/content/abstract/68/6/726

They’re trying to take neuropsych measurements on people with behavioral disorders due to dementia — the disorder causes them to be unable to gather the information.The actual lack of information (or inability to gather

it) is an indicator of the seriousness of the disorder they’re studying.

Categories: General

Process gaps can leave assumption-based or TBD (pre-event) data incomplete

July 26, 2007 · Leave a Comment

In many processes, assumptions are made and information is stored in advance of hard facts or the actual event/instance. Afterwards, revision of the data may be flawed, overlooked or not deemed worthwhile.  In such circumstances, information can be inaccurate or may have been left out. 

An example is names of people attending a seminar.  An atendee preregisters himself and one additional colleague, to-be-determined.  At the seminar organizers may or may not capture the actual name into their system.  A registration sytem might be designed in many different ways to handle this use case.

Categories: Uncategorized

Type discipline and missing values

July 26, 2007 · 1 Comment

Type discipline and missing values – from Poetix.

Fabian Pascal rails against the very notion of NULL.

Categories: Links

Past work on missing information

July 26, 2007 · Leave a Comment

  • Quantitatively focused
  • Treat missing information as “random variables” (Orchard)
  • Try to find reasonable/plausible maximums and minimums to constrain guessing
  • Data mining
  • Ignore all records with missing values
  • Replace missing values with mode or mean
  • Infer missing values from other records
  • All approaches assume a large corpus of available information, and that missing values can be computed based on other available information

Categories: General

Null, Three-Value Logic, and the “Type” of missing information

July 26, 2007 · Leave a Comment

  • “Since Null is not a member of any data domain, it is not considered a “value”, but rather a marker (or placeholder) indicating the absence of value. Because of this, comparisons with Null can never result in either True or False, but always in a third logical result, Unknown. “ – Wikipedia: NULL
  • Missing information cannot be used as a valid basis of comparison with any other atomic value. It isn’t greater, less, longer, shorter, better, worse, cheaper, or more expensive than any other value.
  • Like NULL, two missing pieces of information are never equivalent. NULL != NULL. Because it lacks a specific value, it does not follow the rule of identity.
  • Null is un-typed in relational databases, but in practice “missing information” typically does have a type associated with it:
    • Metadata provided by the system that indicates information is missing (e.g. last name “”)
    • The expectation that noticed the missing information gives it a type

Categories: General

What does it mean for information to be missing?

July 26, 2007 · Leave a Comment

 

Expectation:  belief about (or mental picture of) the future

  • Expectations are predictions about the future based on mental models derived from past experiences
  • “Mental models” integrate past observations, noting that certain pieces of information will be present.
  • All descriptions of cars will include make and model.
  • The make and model must be some value, but could be any. (Some but any)
  • When information is missing (an expectation has been violated), there are three possible explanations:
  • Option 1: The mental model contained an error, or was the wrong one for the scenario (incorrect conceptualization)
  • Option 2: The mental model was partially correct for a subset of the possible domain observations – it should be widened to integrate the new information (incomplete conceptualization, discovered with new information)
  • Option 3: The mental model is correct. Exploring the mismatch should focus on the information source.

Categories: General

Assumptions and Background

July 26, 2007 · Leave a Comment

  • The fact that “something is missing” itself constitutes novel information
  • Past work has focused on how to fill in the gap, with the objective of discovering the value of what’s missing. (“value of X”)
  • This approach focuses on how to draw new useful conclusions other than X

Categories: General

HL7 Flavors of Null / Reasons Why Data is Missing

July 26, 2007 · Leave a Comment

code name definition
NI No information No information whatsoever can be inferred from this exceptional value. This is the most general exceptional value. It is also the default exceptional value.
OTH other The actual value is not an element in the value domain of a variable. (e.g., concept not provided by required code system).
UNK unknown A proper value is applicable, but not known
ASKU asked but unknown Information was sought but not found (e.g., patient was asked but didn’t know)
NAV temporarily unavailable Information is not available at this time but it is expected that it will be available later.
NASK not asked This information has not been sought (e.g., patient was not asked)
MSK masked There is information on this item available but it has not been provided by the sender due to security, privacy or other reasons. There may be an alternate mechanism for gaining access to this information.Note: using this null flavor does provide information that may be a breach of confidentiality, even though no detail data is provided. Its primary purpose is for those circumstances where it is necessary to inform the receiver that the information does exist without providing any detail.
NA not applicable No proper value is applicable in this context (e.g., last menstrual period for a male).
NP not present Value is not present in a message. This is only defined in messages, never in application data! All values not present in the message must be replaced by the applicable default, or no-information (NI) as the default of all defaults.

Section 1.11.4 of HL7 Data Type Specification

Categories: General

Previously Located Good References

July 26, 2007 · Leave a Comment

Categories: Links

Missing data in social science

July 26, 2007 · Leave a Comment

Guidelines for handling missing data in social science.

Overview

Missing data are ubiquitous in social science research. This document is a guideline for researchers
faced with analysing partially observed datasets describes the issues that need to be considered.
Technical details, however, will vary considerably between analyses, so these are not discussed here.
Further information and references can be obtained from www.missingdata.org.uk, or by emailing
James.Carpenter@lshtm.ac.uk

Categories: Links

A Non-technical introduction to analysis of datasets with missing values

July 26, 2007 · Leave a Comment

Categories: Links