Sunday 16 January 2011

Planning for failure.

Years ago I heard a story - possibly apocryphal - about the emergent electronics industry in the sixties. A large American company wanted to get the contract for building some of the Saturn V / Apollo hardware. They worked on their proposal, costed it and got ready for the meeting with NASA.

They were surprised to find that the NASA team was mostly comprised of engineers. This team sat through the company's slick presentation without comment until the end, when they were asked if they had any questions. One of the NASA engineers asked a simple question: "How does it fail?"

The company's marketing men were shocked and did not have an answer. They had prepared for the meeting with lots of questions relating to cost, timescales and capabilities, but this first question totally stumped them.

So why did NASA want to know how it would fail? And why was it their first question? The answer is simple: they trusted the company to meet the specification requested; after all, that was their job. However, they wanted to ensure that if it failed it would not damage any of the other components made by other companies.

After that, the company always had engineers in their meetings with NASA, and always made sure they knew how failure of their devices would affect the rest of the system.

Many of the common bugs in computer programs are caused by the programmer not planning for failure.

Let us take one simple and common instruction in the C programming language. malloc() allocates an area of memory for use by the programmer. On the vast majority of occasions it will succeed, returning a pointer to the memory. However, sometimes it will fail. It is common to see code where the programmer does not check for this failure case. The reason is checking for all possible failures takes time, and programmers are more interested in the cases where it works.

For instance, the following line of code, whilst nominally correct, will have me tearing my hair out:
int *broken_ptr = malloc(20);

A better example would be the following:
int *good_ptr = malloc(20 * sizeof(*good_ptr));
if (good_ptr == NULL)
{
  // Failed to allocate memory, must recover.
}
else
{
  // We can now do something.
  ...
  // We have finished with the buffer. Free the memory.
  free (good_ptr);
  good_ptr = NULL;
}

Even a non-programmer can see that the second example takes far longer to write and requires much more thought. It is, however, much better code (although still not perfect). In particular the programmer will need to consider exactly how to recover from the failure to allocate the memory. Unfortunately, misuse of malloc() in C is a prominent cause of programming bugs.

Similar problems can be seen in many other forms of engineering. It can be seen when 'cascade failures' occur; the failure of one part of a system causes other parts to fail in a cascade. This particularly occurs in power transmission systems, and engineers strive to design against it.

The key is to give engineers the time to design and implement systems fully. It is relatively trivial to get a system working; the real work lies in making it work properly in all cases, including the unforeseen.

No comments: