If, like me, you’ve been gripped by the Moon landing 50th anniversary coverage, you’ve probably been struck by how calmly the Apollo 11 team dealt with the stuff which went wrong, given the wafer-thin margins and mortal risks they faced.
If, like me, you’re an IT person, you’ve probably also noticed how much of the stuff which went wrong was IT-related.
Technology has developed exponentially in the 50 years since Apollo 11, yet I’ve found myself thinking how many of the IT lessons from the mission are still relevant today.
If anything, they’re more pertinent than ever now we’ve moved way beyond circuit-breakers and hand-written cheat sheets towards the age of ‘Open the pod doors, Hal’.
By the way, for all the entrepreneurs out there, I’m definitely in the market for a smart speaker which answers only to ‘Hal’. If you have no idea what I’m talking about I apologise, twice, because ‘2001 A Space Odyssey’ crops up again later.
And on that subject, the first IT lesson I picked up while I was reliving the moon landings was the enduring importance of the manual override.
As the Lunar Module began its computer plotted landing trajectory, Neil Armstrong spotted that they were going to land in a boulder field, so he switched to manual control in order to land somewhere safer.
To rub salt in the robots’ wounds, the unmanned and automated Soviet Luna 15 probe crashed the next day in its attempt to prove that you could soft land an unmanned craft on the moon and do everything Apollo 11 did but without risking anyone’s life.
There’s nothing particularly startling here. Pretty much every autopilot in history so far has had a manual override.
But as IT gets more pervasive and sophisticated it gets harder to avoid relying upon it absolutely, so that even the ‘manual’ option requires some software to perform 100% predictably and reliably,
In the terrible Boeing 737 Max accidents it appears that the aircraft may not have responded to intervention by the pilots in the way they expected, or possibly, and most disturbingly, in the way they had experienced in the simulator.
As automation increases we need to make sure we don’t lose sight of the importance of the role of the human actor, particularly in safety critical operations, and that we design them into places where there input is still vital.
Even at much more trivial levels there are risks and downsides to giving the algorithms complete free rein.
Like most people with an online footprint, I’m constantly receiving suggested posts and recommendations based on my buying and social media history. These are powered by undoubtedly terribly clever algorithms and impressively hefty data mining.
And yet, I’ve yet to spot the manual override option, the one which allows you to say ‘Good try, but this particular 58 year old male with minor prostate issues is surprisingly not interested in male incontinence products quite yet, so please don’t suggest that post again.’
It doesn’t seem a stretch for the technology to build in this kind of feedback loop, and it strikes me it would be hugely valuable in refining those terribly clever algorithms.
Yet we seem to prefer to keep the loop closed and not let people interfere.
I believe we need to keep that constructive human/machine interaction if we’re going to avoid the awkward situation where we end up wielding the big screwdriver while the computer sings ‘Daisy.’
(Sorry, that’s the other ‘2001’ reference – get someone who’s seen it to explain it to you).
Back to Apollo 11. Having switched to manual and looking to complete a longer than planned landing run with rapidly dwindling fuel, as in all the best disaster movies, Neil Armstrong and Buzz Aldrin now had a master alarm going off in the cabin.
They asked Mission Control to advise on the specific error code triggering the alarm, code 1202.
Fortunately, the same type of error had occurred in an earlier training run. Although that training run had been aborted as a result, further investigation identified that the code had been triggered by a marginal overload of the onboard computer which shouldn’t affect the landing as long it didn’t keep recurring.
A forward-thinking engineer had prepared a crib sheet of errors they’d seen during these training runs, along with the impact and action to be taken.
Based on this crib sheet Armstrong and Aldrin were advised they didn’t need to abort the landing as long as they didn’t get too many more of these alarms (they got a few but landed anyway).
This incident made me think of three other IT lessons I believe we can learn from Apollo 11.
It shows the potential for applying AI to monitor and manage complex interrelations between software and the operating environment to prevent costly failures.
I think most people who have worked in IT Operations realise that it’s often the non-functional stuff which bites you hardest in production.
Functional bugs do happen, but the problems which hit the headlines and lead to CIOs suddenly wanting to spend more time with their families are the ones where the code doesn’t play nicely with the infrastructure.
As our IT becomes more and more distributed and non-deterministic, it gets even harder to keep control of the non-functional aspects.
Non-functional requirements in areas like security and interoperability which used to be definable in fairly generic, broad brush terms are becoming increasingly complex and granular.
Clearly that’s partly why DevOps was invented, and it helps a lot, but I believe we’re at the point where we need more help from somewhere to keep the non-functional stuff under control.
I think AI may be the answer, at least in part.
If the software engineers on Apollo 11 could have built in some rudimentary machine learning around the interactions of the software, OS and hardware, it could have been used to suppress the 1202 alarm to the crew unless and until it required an abort decision.
That’s pretty basic stuff; fast forward fifty years and we need sophisticated algorithms and feedback loops to keep up with the ever more complex non-functional challenges.
Cyber security is seeing some big advances in this space, driven by some powerful external drivers, and we’re starting to see some intelligent network management tools too; I think there is still plenty of scope for growth in the other non-functional areas.
The second lesson I drew from the 1202 incident is the continuing need always to consider IT issues from the viewpoint of the customer’s priorities.
The 1202 error triggered a master alarm exactly when Armstrong and Aldrin needed it like a hole in a heat shield.
As I understand it while the error was important in terms of system health, it wasn’t immediately critical to the mission, and there was no immediate remedial action the crew or Mission Control could have taken.
Given 20/20 hindsight, it’s at least questionable whether it should have triggered a master alarm.
As IT professionals, we naturally want to know as soon as anything goes wrong with our systems, and when it does, we want to fix it.
It’s all too easy to become so focused on the problem that we lose sight of the business or user impact. That’s why ITIL makes the vital distinction between incident and problem management.
Getting the balance right gives us some big challenges, all the way back from operation into design.
Shutting down a server in an HA cluster when it loses comms with its peers is necessary to avoid data corruption, yet it’s likely that all the servers in the cluster are doing the same, rendering the client’s business critical HA system DOA.
I don’t know have a specific answer to this kind of dilemma, but I get a sense that we could be doing more with the know-how and technology we have to smooth this kind of bump in the road and design out the nasty surprises – the 1202 master alarms – for our customers.
The last lesson comes from that engineer’s hand-written crib sheet. It’s that documentation is still really important, but only when you need it.
When Agile came along, one of its great liberating benefits of Agile was that it got us away from having to design the whole system before we started building it. Instead, with Agile, the design evolves as the product is developed.
One of the risks around this is that we are so busy delivering functionality that we can be tempted to skimp on recording key information about the system, particularly about non-functional stuff which may not feature on the customer backlog.
For me, the Apollo 11 engineer’s crib-sheet is an example of the perfect answer to the question “how much system documentation do I need?”, the answer being “just the right amount.”
It’s no more than a pencil-written page torn off a legal pad and probably took a few minutes to jot down, but the mission controller was able to use it during those critical minutes of the landing to inform a vital decision on whether to abort or continue.
Of course, once we’ve given all our systems the intelligence to look after their own operability, security, conformance and the rest, we can consign system documentation to history along, sadly, with that marvellous handwritten list of error codes and whether they’d stop us landing on the moon.
Those are my lessons for IT from Apollo 11. I’m sure there are many more, and I’d love to hear any you may have.
In the meantime, let’s pay tribute once more to the team which made the moon landings happen, and look forward to the first IT error-free, manually overridable Mars landing in due course.