Data, Persistence, and My Frying Pan

January 28, 2013

I like cooking, and my Wiener Schnitzel is rather famous. When preparing Schnitzel, the most important tool is my favourite, teflon-coated frying pan. Admittedly, the teflon coating itself is not really important to make Schnitzel, but it saves me from having to scrape everything else from the bottom of the pan. This makes my coated pan pretty much the only one that is permanently in a usable state, if you know what I mean.

One of the cool things about teflon-coated frying pans is that they usually come with a warranty of at least five years. In my case, for each frying pan, after about three years, the following things happen: I notice that the teflon coating is damaged, so I plan to return the pan to the household supply store where I had originally bought it. Now at least one of the following conditions arise: I cannot remember in which store I had bought the pan, and I cannot find the invoice.

Why do I need an invoice to return the pan in the first place? First of all, it serves as proof that the pan is from the store. Second, based on the date of the invoice, they are able to determine if there is still warranty on the pan. But why do I need to bring the invoice? The store must have the same information on record, if only to be able to make correct fiscal statements.

Retrieving the correct invoice is a daunting task, however. If the records are kept on paper (which is not unlikely, even today), browsing binders sequentially is pretty much the only way to find the correct invoice. If I don't remember the exact date, nobody knows where to start (in my case, I usually do not even manage to remember the year, so that would mean a lot of pages to flip through).

Even if the shop keeps their data in a (relational) database, the question is whether their software allows them to ask a question like"please list all invoices from 2010 where somebody anonymously bought – possibly among other things – a certain frying pan" (we are still blatantly ignoring the fact that, for sure, somebody will have bought one of these frying pans in 2010, but the question remains whether that was me). On the technical side, creating an SQL statement to retrieve the desired information would be a matter of minutes.

So regardless of whether the application provides us with the required access path to the data, we can put the ad-hoc reporting capability of a relational database to good use. (Within the scope of this article, let us just assume that every household supply store usually happily hands out database credentials to anonymous customers and provide them with shell access to their database server).

What if they used a NoSQL technology? They seem to make storing of data easier, for example, because they are, or at least claim to be, schema-less. Retrieving data, however, is only easy if you stick to the access paths that the system was originally designed for. In other words, NoSQL technologies force you to store the data in a format that closely matches the format that you are planning to retrieve. If you know what you are designing the system for, NoSQL can be a quick win.

As soon as "strange" new requirements come up, like asking who had bought a certain frying pan back in 2010, the effort involved in creating a solution might overcompensate the original benefit of getting started more quickly. Hand-writing a query would probably still be rather easy for a flat data structure, but if we are dealing with a complex object hierarchy, it becomes increasingly difficult to do so. Remember: sequentially searching through all records or objects is something that always works in theory. In practice, we know that we need to avoid scanning at all costs, because it kills performance and it does not scale.

Since most web applications are both read-intensive and are changed frequently, a solution that saves on time-to-market in the short run, but increases maintenance effort by requiring more work to incorporate new requirements, is usually not the best choice.

Since nobody really knows what the future holds, the only way to deal with the problems described above is to create an architecture that does not depend on certain technologies, but abstracts from them: a certain component persists data, and provides you with ways of retrieving that data again. You will have to figure out how before you can choose a solution, and this decision should only be based on the functional and nun-functional requirements of your application.

Just using technology X, Y, or Z as a data store and hoping that the access paths this technology offers – in conjunction with the data formats used – will be sufficient to build a scalable and performant application and put you on a road with a dead end. I know because I have been down that road, and came to a stop right in front of that "dead end" sign.

And yes, I really should keep an indexed record of all my hardware, instead of just throwing all invoices into one box.