Arkose People

Managing Machine Learning Projects Using Failure Envelopes

March, 10, 20206 min Read

Managing a Machine Learning/Artificial Intelligence (ML/AI) project is hard. Even if you just limit yourself to worrying about the technical reasons why your ML/AI project can fail, there are typically more than enough questions to keep you awake at night.

As the manager of a ML/AI project, you will have additional challenges to face beyond making sure your data is clean, that you’re using the right algorithms, that you haven’t waited to build the finished system before exposing your algorithms to your data, and that everything is architected optimally. ML/AI projects can suddenly blow up, but more often they slowly and despite the best intentions drift towards failure.

The good news is that there are only a few ways ML/AI projects commonly fail, and all are driven by factors under our control. We can use the concept of a failure envelope as a useful tool to explore how effort in a project must be correctly balanced.

What is a Failure Envelope?

If you’re unfamiliar with a failure envelope, it is a graphical representation of how a thing responds when various stresses are applied to it. That thing can be anything from a rock to an aircraft. Using the example of an aircraft, imagine an aircraft flying along with the pilot controlling the airspeed and the control surfaces (heading/pitch) of the aircraft. It turns out that there are limited combinations of these values that will keep the aircraft flying stably, and not see it plummet out of the ground or break apart mid-air. Here is a generic airspeed vs load variation diagram:

As you can see, the aim of a pilot is to use their control inputs to keep the aircraft in the green from the beginning of the flight to the very end. Too much movement away from the green area and they risk becoming a case writeup.

The general lesson is clear: Unbalanced effort will lead to failure. What holds true for aircraft also holds true for ML/AI projects. Whereas in the aircraft example we are only looking at two axes, for ML/AI projects I’d like to look at three:

  • Direct ML development effort
  • User training & integration effort
  • Cost containment efforts

Avoiding Project Failure Using ML/AI Failure Envelopes

What this model allows us to do is develop for ourselves an early warning system to identify which failure mode is currently the most likely. We can then correct the balance in our efforts.

1. In the diagram, all projects start at the bottom of the diagram. Technically, this is still a failure state, as we have not yet put any effort into the project and so the ML project would, unsurprisingly, be considered a failure. As ML effort ramps up, the project moves upwards in the diagram. Ideally, to minimise project risk the goal would be to stay in the middle of the diagram, at #7.

2. However, cost pressures may arise from the business, causing insufficient resources (time, staffing or equipment) to be available for the project. For example, key members of your project may be partially reassigned to other projects. A project in this situation runs the risk of failing by losing momentum, and then exiting off in the bottom-left of the figure. A robust and honest appraisal of the project scope and strategic value may be required.

3. If this failure state is avoided, the project may then proceed at a healthy pace, and so it moves upwards in the figure. At this stage, the combination of successful ML development effort and robust cost control pressures don’t leave much time or money to train staff in how the new system will be used or integrated into their current workflow. Some staff may not have the necessary skills to allow them to succeed in this new ML-powered future. If the end-users of the system do not or cannot understand how the project benefits them, these are barriers to adoption. If these are sufficiently large, this can eliminate any potential benefits that the project would have delivered. Conversely, if developers are not engaging sufficiently with the end-users to understand and deliver the features that they need, then there is also no incentive to use the system.

4. The next failure mode is most common when ML efforts are extensive but disconnected from the rest of the business. The danger here comes from insufficient end-user engagement & training and insufficient cost pressures. The project moves upwards and exits at the top of the diagram. The result is a White Elephant. These projects are generally incredibly expensive, breathtaking in scope and are astounding technical achievements. If only we knew what they did or how to use them. Where possible, these projects tend to be rapidly shelved and everyone involved in them quickly tries to move on. This is not always possible. If you find yourself moving in this direction, focus on delivering rapid iterations, that each provide incremental value to the end-users.

5. If too much effort is spent on end-user engagement and training, the development team can find themselves becoming too responsive to user requests and feedback. In this situation the team is expending considerable effort in developing the core features, and even more in engaging with the end-users, redesigning the system and refactoring code. This scope creep increases the cost of the project, potentially to the point where a positive return on investment can never be achieved by the project. The project may be successful both technically and organisationally, but financially it is a failure. Having and communicating a clear vision for what is in and out of scope for the project is essential to avoid this outcome.

6. Finally, the last failure mode is less common overall but more typically a problem for organisations that operate in Consulting & Professional Services. Your organisation may end up spending more time talking to users about the solution than actually implementing it. After establishing a thorough understanding of the problem you are able to identify a brilliant solution. However, faced with insufficient development resources to develop and implement it, all you are able to achieve is to write a compelling report. In academia, you would write a scientific paper instead. In this situation, if it is not possible to rectify the lack of development resources, your best bet will be to develop your theoretical proof of concept into something presented as a Minimum Viable Product.

7. At this stage it may seem like the only way to lead a ML/AI project to success is the impossible solution of investing 100% of effort in each of the three axes. However, the idea of this model is to help you find a balance between each of these factors. “Balance” is the keyword. It is something that’s inherently dynamic, and the optimum mixture of factors will change over the course of the project, as it progresses from a bright idea to an amazing product.