(The following are a few ideas I’ve had kicking around that I’m just now getting around to writing about. Maybe there’s a DM Review article in here somewhere…)
General Approaches to Data Integration and Information Sharing
At work, there are usually any number of spirited arguments going on about data integration, and how it’s best done. Granted these are somewhat extreme caricatures of the positions, but broadly, you have:
- the SOA camp (some of whom think information is magically integrated when you use web services, as if an architectural approach is the same thing as actual integration work)
- the mediation camp (let everyone store and transmit information in their own format, and we’ll translate somewhere in the middle)
- the data warehouse camp (move a single copy of all data into one physical location and integrate it there)
- The semantic camp (the format isn’t very important – get the semantics right and you will be able to translate the format with the use of that common understanding)
- The Community of Interest/COI camp (let’s get everyone together around a table to agree on an interchange schema, and just use that)
The domain is overwhelmingly conceptualized along the lines of “point to point” connections (e.g. the mediation camp) versus the use of some centralized, agreed-upon format. I’m not convinced this is the only way to look at the issue, but that is how most people understand it these days.
Which to Choose?
The question is, which approach is right? People constantly look to the internet for the answers. They see the successes of information sharing on the web, and mistakenly assume that a model that works for home users who don’t know one another will somehow work for their large monolithic organization. Correspondingly, there are rushes to adopt web 2.0, tagging, blogging, video sharing, or any number of other flavors of the month.
The one lesson that no one seems to take from the internet is that it never, ever chooses one model or approach to a problem. Because it is by nature a large unregulated marketplace of ideas, the internet’s approach to data integration is both to form common agreements and to build point-to-point connections. There is no one correct approach because the reason for information sharing differs by community, information type, and underlying purpose.
On the internet “common agreement” front, we have Microformats, OASIS, IEEE, NIST, and the W3C. On the “proliferation of point to points” front, we have thousands of people in their basements developing their own custom schemas and publishing them to the world. The internet manages to adapt, change, and pick the best of the litter in part because it is so large, it can afford to go in many different directions and let users choose the best solution.
Companies can’t always act as the internet does
Companies and government agencies are in a different situation though. There is incremental cost to the organization associated with trying something new or different. The “internet” isn’t a monolithic organization that arranges or guides investment – if it tries 1,000 different approaches simultaneously and only 5 pan out, that’s fine because the risk was borne by the individual who tried their approach. This approach is called “Let 1,000 flowers bloom” Companies and agencies frequently cannot afford to have only 5 out of 1,000 ideas pan out. If they are lucky, they have the money to plant three flowers – but they’ll still absolutely need to have one of them come out a real winner.
So what do they do? Develop point-to-point interfaces, or shoot for some agreed-upon central format (whether semantic or structural)?
The difficult but unavoidable issue is that the answer to this question depends on context and purpose. The question shouldn’t even be answered outside of the confines of a specific context, because the factors that dictate which approach will actually be successful are context-specific.
Infinite contexts, and missing information
What’s really needed is some sort of general approach to analyzing context and making recommendations on data integration approaches based on context specifics. This article (and several others) provide summary tables of different data integration approaches with their pros and cons. For example, using data warehouses to integrate data makes queries fast and makes cleaning data up at least possible, but it also makes for stale data, very complex schemas, and a duplicate copy of all of your data. Matching up those pros and cons to individual contextual considerations is left as an exercise to the reader.
Tying this back to the science of missing information is this notion of context. I was looking at a potential taxonomy of “missing value reasons” that was organized around the type of context that would lead to information being missing. For example, maybe it was a “Shared Location Context”, e.g. area code information was committed from a telephone number because the recipient is in the same area code. Classifying missing reasons by context turns out to be really difficult because the number of contexts is infinite, and of course doesn’t lend itself to easy categorization.
The unsatisfying conclusion
In order to pick the best data integration approach, you still need really smart people who intimately understand the specifics of your information sharing context. Consultants probably cannot suggest the best approach, because they will not be able to have enough information about your context to make that decision for you. Consultants will be useful in applying general knowledge about pros and cons of approaches to the contextual specifics you bring to the table. In other words, they can advise, but cannot provide a definitive answer to the question of which approach is best.