In order that not only the individual developer, the development team involved or the company concerned can learn from the project experience, but also our industry as a whole, a public discussion of IT projects is necessary. However, this is the exception and happens far too rarely. For example, when a 655 million US dollar space probe is lost. Or when a Fortune 500 Company (Hertz) sues another Fortune 500 Company (Accenture) for a failed IT project.
NASA's Mars Climate Orbiter (MCO) mission is an example of an IT project whose failure is well documented. A probe, which was supposed to explore the climate of the planet Mars, was lost on September 23, 1999 due to a unit error in the navigation system. Just one week later, on September 30, 1999, NASA made the cause of the error public: part of the team had calculated inches, feet and pounds, while the rest of the team had calculated in metric units. This statement is remarkable:
The problem here was not the error, it was the failure of NASA's systems engineering, and the checks and balances in our processes to detect the error.
The technical error that led to the loss of the probe is therefore not seen as the underlying cause, but as a consequence of shortcomings in the development process: too little communication between the teams and lack of integration tests.
In October 2018, a Boeing 737 MAX 8 aircraft crashed shortly after take-off in Indonesia (Lion Air flight 610). In March 2019, an aircraft of the same type also crashed just minutes after take-off in Ethiopia (Ethiopian Airlines Flight 302). In both cases all people on board were killed. In both cases software, the Maneuvering Characteristics Augmentation System (MCAS), was the cause of the crash.
Robert C. Martin, known to software developers as "Uncle Bob" and author of "Clean Code", has analyzed in an article what is known about the software problems of the MCAS of the Boeing 737 MAX 8. His perspective is not limited to that of an expert in software development. As a pilot he also knows the context in which the software in question operates. As far as is known, in both cases the MCAS software was supplied with faulty data from an angle-of-attack sensor. Based on this data, the MCAS then took control of the aircraft from the pilots. The software relied on the data from this sensor without comparing it with other data such as airspeed, vertical speed, altitude. Or with the data from another angle-of-attack sensor. Such a cross-check of the data becomes second nature to a pilot during instrument flight training. After all, an instrument can fail in such a way that the failure is not directly noticeable. In his article, Robert C. Martin asks why the software developers did not take these basic principles of aviation into account.
In April 2019, the lawsuit filed by US car rental company Hertz against its IT service provider Accenture caused a stir. The complaint , which has become public, contains many details that provide deep insights into a failed IT project. In August 2016 Hertz commissioned Accenture to develop a new online presence. This should go into production in December 2017. This first deadline was missed as well as a second (January 2018) and a third (April 2018). Hertz is now suing Accenture for $32 million in fees already paid and millions more needed to clean up the mess left behind by Accenture. According to Hertz, Accenture did not deliver a working app or a working website. Although Hertz required a solution that could be used for the Dollar and Thrifty brands and in markets outside North America, Accenture developed software that could only be used for the Hertz brand and only in North America. So it could have been used if it had been completed.
It may be tempting to gloat over the fact that even big companies like Hertz and Accenture can screw up a project. But the only thing that makes the failure of this project special is the fact that, on the one hand, a lot of things went wrong and, on the other hand, all this is came to light because of a lawsuit.
Accenture did not test the developed software, at least not thoroughly or in time. The statement of claim can be read in such a way that neither automated tests in general nor test-driven development in particular were used. The missing automated tests will not be able to be implemented retrospectively because it is not clear enough what the code actually exists for. Why should it not be clear why the code under test actually exists? This can be read between the lines of the description of other project problems. For example, the Accenture developers were not able to integrate the back-end code (Java) with the front-end code (Angular) in an error-free, high-performance and secure way.
It would be cheap and would not be enough if Accenture were the only company to be blamed for the failure of the project. Hertz did not have a development team of its own that could have implemented the project in-house, because all developers were laid off at the beginning of 2016 . For the project to succeed, at least the role of "Product Owner", to use a term from the Scrum world, would have had to be filled in-house. Since this did not happen, the responsibility for which requirements are in the product backlog and in which order they are processed was not in the hands of the client but of the service provider. The resulting poor communication between Hertz and Accenture was probably one of the main reasons for the failure of the project. And regardless of whether Hertz demanded it or not, Accenture has committed itself to a go-live date. Instead of the planned Big Bang! deployment, which never happened, it would have made sense not only to implement use case by use case in short iterations, but also to roll it out continuously.
Modern development processes provide for retrospectives and post-mortems in order to learn from one's own mistakes. These help individual developers, individual teams and the entire company to become better. I find it commendable when companies make their "post-incident analysis" public, for example in the company blog, when something has happened to them. Examples of this are GitHub and Travis. This gives others the opportunity to learn from the mistakes of others.
Of course you can also learn from things that have gone well in a project. For example, I like to watch videos of the "Classic Game Post-Mortem" lectures from the Game Developers Conference. There I not only learn about games like Maniac Mansion or Civilization that I played on the Amiga as a child. Or about how the programmers used tricky programming to overcome the hardware limitations of the computers of that time. But also how software was developed at that time, how projects were managed (or not), etc. I am always glad that software is developed differently today.
The Chrysler Comprehensive Compensation System (C3 Project) is an example of an IT project from which much good has come. In the early 1990s, the US car manufacturer Chrysler decided to develop new software for the payroll accounting of 87,000 employees. Development (in SmallTalk) began in 1994, and two years later, the software had not yet run a single payroll operation, they hired Kent Beck to save the project. Beck in turn brought Ron Jeffries on board. In March 1996, the team estimated that the software would be ready to go a year later. The C3 project went down in software development history because in 1997 the team decided to change the way it works. This new way of working later became known as "Extreme Programming". The estimate that the software would be ready for use within a year proved to be almost accurate. With only a few months delay (due to a few unclear requirements) a first version, which was used to handle monthly payroll accounting for 10,000 employees, went live. The practices used by Kent Beck, Ron Jeffries and their team, such as Test First programming, Pair programming or closer involvement of the customer, especially in combination with short feedback cycles, which were successfully tried out in the C3 project and formalized and popularized in its aftermath, have changed the way we all develop software forever.
To develop software successfully means to proceed in a goal-oriented way. These goals should result from acceptance criteria agreed upon with the business. Without a clear definition of goals (in terms of tasks), the developer runs the risk of getting lost in his work. In particular, he does not know when he will be finished with a task. Acceptance criteria can be documented and checked using automated tests. Either way, the goals must be defined before production code is written. This is test-driven development, whether you want to call it that or not.
Together with Stefan Priebsch and Arne Blankerts, I wrote an article a while ago about why we think that tests do not keep a developer from getting things done.
The main task of a developer is not to write code, but to understand a problem. In his article about the MCAS software of the Boeing 737 MAX 8 Robert C. Martin puts it this way:
Programmers must not be treated as requirement robots. Rather, programmers must have intimate knowledge of the domain they are programming in.
A developer can only work meaningfully if he understands the domain in which the company for which he develops software operates. In my experience, IT projects work poorly when "the business" communicates with the "cost center IT" only via tickets, without any context as to why something should be changed or implemented. And in my experience, IT projects work well when the software developers understand how the business works and how a change can contribute to the business success. This succeeds when it is clear to everyone involved that software development is more man-human communication than man-machine communication.