When it gets hot

October 26, 2012

The first thing I heard after boarding an InterCity Express (ICE) train in Germany recently was this announcement:

Due to technical problems, this train has only half the number of coaches today.

Of course, my seat reservation was for one of the coaches that was not there. "No big problem" I thought, "I am fine standing for about 30 minutes". But then the second announcement was made:

Due to technical problems, the air conditioning system only works in half of the coaches today.

This would not be a problem if the windows in an ICE could be opened. But since it is a bad idea to have open windows when the train travels at 300 km/h, opening the windows on ICE trains is not possible.

When the A/C is not working, the temperature on board an ICE train can easily rise to 50°C or more in the summer. Trust me, this is not a fun place to be.

While I happened to be standing in one of the coaches that had a functioning air conditioning system, this situation made me angry. Why? Because I know that this is a systemic problem: the air conditioning system is only designed to work for outside temperatures up to 32°C. And guess what happens when the passengers migrate from a coach without A/C into a coach with A/C? Simple: another air conditioning system shuts off because it cannot handle the situation.

If you ask me, the design (maybe also the implementation, but I am not an engineer) of the InterCity Express' air conditioning system is not up to the standard of reliability, quality and aspiration of perfection that is expected of a product that is "Made in Germany".

In our world of software engineering in general and web applications with high user numbers in particular, reliability is the aspect of software quality that deals with questions such as "Can the application bear with high loads?" or "Does the application still function correctly under unusual situations?".

The environment, for example the size and the behavior of the user base, of a web application are constantly changing. What was sufficient yesterday can be insufficient tomorrow. Capacity Planning is our tool to recognize today that we need to scale our application further so that it still can fulfill its functional requirements and quality goals tomorrow.

Product owner and the developers sometimes only have a vague service-level agreement such as

The webserver must be able to respond to N requests per second.

Does serving an HTTP 500 (Internal Server Error) status code count as responding to a request? Technically it does, but it will probably not make the product owner happy.

Furthermore, not all requests are equally expensive with regard to resource usage: serving static assets is cheaper than executing complex business logic. This means that a sensible service-level agreement has to take this into account and specify a performance goal that can be measured.

Such a service-level agreement is something the developers can commit to and design the system for. Imagine the blame game that could ensue when the product owner only vaguely expresses the quality goals.

Traffic can be sporadic and unpredictable at times. To prepare for periodic spikes you need to know how many requests per second a server can manage before its performance degrades below a given threshold. This knowledge will allow you to configure alerts in your systems monitoring. You will also know the impact to expect when you add a new server. Combined, these two aspects will hopefully make you aware of the fact that you need more machines early enough.

When periodic spikes turn into the "new normal" (ideally even before) the service-level agreement has to be renegotiated between the business owner and the developers as constantly managing a significantly higher amount of load might not be achievable by "just" throwing more hardware at the problem but rather a refactoring of the software itself for which the developers require time and budget.

The air conditioning system of the InterCity Express was designed for a maximum outside temperature of 32°C decades ago. Nowadays temperatures above 32°C are no longer periodic spikes and the requirements for the A/C system should be updated. The business owner has decided to not upgrade the legacy A/C systems and let the customers sweat instead.

In our world of software engineering it should be a lot easier to replace legacy components and adapt the system to new requirements than it is to replace legacy hardware.