One of the areas where missing data is most often made explicit is in “unpacking” data values.
Recently, I’ve been working with a system called Trio, which integrates uncertainty information and lineage with data into a database. The idea is that you can express fuzzier tuples in a relation. Instead of saying “Mary saw a Toyota”, you can assign it 60% confidence, or even add an alternative – either Mary saw a Toyota, or Mary saw a Honda (but not both).
Trio separates each of the data items in a table into two groups; those that are certain, and those that are uncertain. Unsurprisingly, uncertain data items may have a confidence stored along with them. The more interesting case is what gets put into the “certain” category, and why. As far as I can tell, the “certain” table has no formal definition, it just is the set of data items that don’t need a confidence associated with them. So in fact we’re “packing” several different use cases into that designation of a data item as “certain”:
- Honest-to-god certain, meaning that it has a confidence rating of 100%
- Certainty information is unknown, but assumed to be certain
- Certainty information doesn’t apply
Many people will reasonably point out that it’s OK for there to be nulls, and for them to have some special meaning, it’s just important that their meaning be consistent. When the meaning of nulls can’t be consistent (because the semantics of the domain require that nulls differ in meaning based on context) – then you have a missing data problem. The common approach is then to go “unpack” those null values and enumerate their actual meanings so that they can be stored alongside the data in the future.