Data, Persistence, and My Frying Pan
I like cooking, and my Wiener Schnitzel is rather famous. When preparing
Schnitzel, the most important tool is my favourite, teflon-coated frying
pan. Admittedly, the teflon coating itself is not really important to make
Schnitzel, but it saves me from having to scrape everything else from the
bottom of the pan. This makes my coated pan pretty much the only one that is
permanently in a usable state, if you know what I mean.
One of the cool things about teflon-coated frying pans is that they
usually come with a warranty of at least five years. In my case, for each
frying pan, after about three years, the following things happen: I notice
that the teflon coating is damaged, so I plan to return the pan to the
household supply store where I had originally bought it. Now at least one of
the following conditions arise: I cannot remember in which store I had
bought the pan, and I cannot find the invoice.
Why do I need an invoice to return the pan in the first place? First of
all, it serves as proof that the pan is from the store. Second, based on the
date of the invoice, they are able to determine if there is still warranty
on the pan. But why do I need to bring the invoice? The store must have the
same information on record, if only to be able to make correct fiscal
Retrieving the correct invoice is a daunting task, however. If the
records are kept on paper (which is not unlikely, even today), browsing
binders sequentially is pretty much the only way to find the correct
invoice. If I don't remember the exact date, nobody knows where to start (in
my case, I usually do not even manage to remember the year, so that would
mean a lot of pages to flip through).
Even if the shop keeps their data in a (relational) database, the
question is whether their software allows them to ask a question like
"please list all invoices from 2010 where somebody anonymously bought –
possibly among other things – a certain frying pan" (we are still
blatantly ignoring the fact that, for sure, somebody will have bought one of
these frying pans in 2010, but the question remains whether that was me). On
the technical side, creating an SQL statement to retrieve the desired
information would be a matter of minutes.
So regardless of whether the application provides us with the required
access path to the data, we can put the ad-hoc reporting capability of a
relational database to good use. (Within the scope of this column, let us
just assume that every household supply store usually happily hands out
database credentials to anonymous customers and provide them with shell
access to their database server).
What if they used a NoSQL technology? They seem to make storing of data
easier, for example, because they are, or at least claim to be, schema-less.
Retrieving data, however, is only easy if you stick to the access paths that
the system was originally designed for. In other words, NoSQL technologies
force you to store the data in a format that closely matches the format that
you are planning to retrieve. If you know what you are designing the system
for, NoSQL can be a quick win.
As soon as "strange" new requirements come up, like asking who had bought
a certain frying pan back in 2010, the effort involved in creating a
solution might overcompensate the original benefit of getting started more
quickly. Hand-writing a query would probably still be rather easy for a flat
data structure, but if we are dealing with a complex object hierarchy, it
becomes increasingly difficult to do so. Remember: sequentially searching
through all records or objects is something that always works in theory. In
practice, we know that we need to avoid scanning at all costs, because it
kills performance and it does not scale.
Since most web applications are both read-intensive and are changed
frequently, a solution that saves on time-to-market in the short run, but
increases maintenance effort by requiring more work to incorporate new
requirements, is usually not the best choice.
Since nobody really knows what the future holds, the only way to deal
with the problems described above is to create an architecture that does not
depend on certain technologies, but abstracts from them: a certain component
persists data, and provides you with ways of retrieving that data again. You
will have to figure out how before you can choose a solution, and this
decision should only be based on the functional and nun-functional
requirements of your application.
Just using technology X, Y, or Z as a data store and hoping that the
access paths this technology offers – in conjunction with the data
formats used – will be sufficient to build a scalable and performant
application and put you on a road with a dead end. I know because I have
been down that road, and came to a stop right in front of that "dead end"
And yes, I really should keep an indexed record of all my hardware,
instead of just throwing all invoices into one box.
This article originally appeared in Web & PHP magazine.