Continuing to work on my Jena-based application, I’ve run into a snag that has to do with how to architect the application for the best performance. They say that when programming “premature optimization is the root of all evil”. Generally, it’s true but it’s no excuse to pick an architecture that hamstrings your application from the start.
My application stores many different RDF data models. Each model may have a separate associated “annotation model” that contains a set of additional triples referring back to the original model.
The question is how should I best implement this model soup such that users can search for specific triples across all models? Let’s say you’re searching for a tag “foo” that has been associated with a particular model or resource. In my version of RDF, that translates quite easily to the triple:
<http://a.com/some-resource> <http://b.com/#tagged> “foo”.
So all you have to do is search all available models for statements matching:
(?) b:tagged “foo”
Easy enough. But how do you implement such a search? By default, Jena stores different models in different relational tables in the underlying database. There is no single unified relational table (that I can find) where you can search for “foo”. There is no implicit unified model that contains all of the triples of all available models. From what I can tell, my options are:
(1) Create an uber-model that is basically a union of all of the models I have, and search only that model.
(2) Issue the “foo” search n times, where n is the number of models I have; once for each model.
(3) Don’t separate the models in the first place – only ever have one uber-model, and create logical separations in the model by adding additional triples. In other words, smash all models together into one giant model from the start, and create the illusion that they’re separate with additional metadata that allows the application to figure out which triples belong to which “sub-model”.
Option (1) is lousy, because you end up storing every triple twice; once for the model it’s in, and once for the uber-model. Option (2) is good, except that query performance sucks. Additionally, you’ll be wasting time searching many models that won’t have any hits at all. Option (3) is horrendous – complicated to implement, unwieldy, and requiring a lot of code to manage different things that Jena would normally do for me.
SPARQL explicitly supports option (2) with named graphs, and it currently seems like the best option. Jena also provides a way of indexing text with Apache’s Lucene, which doubtless will improve performance, but doesn’t change the architectural problem of having to search n different models for a single query.
I have been getting good support on the jena-dev mailing list, but I have yet to check out Andy’s SDB, which promises to have support for union queries, a potential extra option.