Skip to content

How to Compare ML Solutions Effectively

Increasing the chances of getting a model to production

When evaluating and comparing machine learning solutions, your first go-to evaluation metric will probably be predictive power. It’s easy to compare different models with one single metric, and this is perfectly fine in Kaggle competitions. In real life, the situation is different. Imagine two models: one model that uses 100 features and a complex architecture, and another model that uses 10 features and XGBoost. The complex model scores just a little bit better than the XGBoost model. In this case, would you go for the best performing model or for the easier one?

This article will give you an overview of different factors you can consider while comparing different machine learning solutions. With an illustrative example, I will show you how to compare models in a better way than using only predictive power. Let’s go!

Besides prediction results, there are several other important factors to consider when comparing machine learning prototypes. These factors provide valuable insights into the overall suitability and effectiveness of the models in real-life scenarios. By focusing not only on predictive power, your chances of getting your machine learning solution to production increases.

The factors are grouped in four categories: maintenance, implementation complexity, costs, and business requirements. Up front, the project team should decide which factors are important for the project. During creation of the prototype solutions, developers can already take notes about the different factors.


How hard is it to collect data or to perform feature engineering? Do you use many different libraries and is the model sensitive to parameter tuning? Is the project using standard API’s you can place in a pipeline? These aspects make a solution easier or harder to maintain.

If your data is coming from many different internal and external sources, it presents a disadvantage compared to a solution that solely relies on internal company data. The reason is that you cannot completely rely on the external sources to remain unchanged, and any alterations or updates in those sources would require refactoring or adjustments in your solution. This is an example of a maintenance issue that can arise.

Another part of maintainability is monitoring. This involves tracking metrics, detecting anomalies or degradation in performance, and debugging issues that may arise. Some models provide robust monitoring and debugging capabilities. This can be an advantage over other models.

Implementation complexity

Implementation complexity measures the difficulty and effort involved in deploying a model into a production system. It considers factors like the availability of necessary libraries, the complexity of the model architecture, and the compatibility with existing infrastructure. A model that is straightforward to implement and integrate into existing systems can save valuable time and resources during the deployment phase.

Another factor that can influence implementation complexity is familiarity with the approach. Choosing a model that aligns with the team’s skill set can significantly impact the development timeline.

Complex road structure. Photo by Timo Volz on Unsplash.


It’s easy to develop a model that costs a lot of money. Costs are an important factor for almost any company. If you need an expensive license for a certain solution, you should be able to justify why that license is worth the costs.

You can spend money on data acquisition, data storage, (re)training, inference, or licenses and subscriptions. Also, the resources for developing the solution have a certain cost. By making an educated guess about these costs up front for every solution, it becomes another factor to compare solutions on.

If the costs exceed the budget (or the value the model will bring), you should reconsider the approach. It can also be the case that two solutions score the same on all factors except the costs. In that case the choice is easy, the cheaper solution is the better one.

Business requirements

Finally, business requirements are a critical factor when comparing ML solutions. They can come in many forms. Here are some common ones:


Being able to understand and explain specific predictions is a vital part of some business processes. In that case, a model that is easy to explain can be of higher importance than predictive power. If interpretability is important, you should try to keep the model simple. You can experiment with different interpretation techniques and score how easy it is to use the technique together with the model.


In competitive industries or when addressing time-sensitive opportunities, the speed at which the model can be developed and deployed may be a critical business requirement. Minimizing the time-to-market can be essential to gain a competitive advantage. Models that can be developed and deployed quickly, with minimal iterations or complex preprocessing steps, can be advantageous in such scenarios.

Regulatory compliance

Certain industries, such as finance, healthcare, and insurance, have strict regulations and compliance standards. Business requirements may include ensuring that the selected models adhere to these regulations, such as data privacy laws (e.g., GDPR), industry-specific guidelines, or ethical considerations. Models must be compliant with relevant regulations to avoid legal and reputational risks.

Real-time inference

Some applications require (near) real-time predictions, where decisions need to be made within strict time constraints. Business requirements may specify the need for low-latency models that can quickly process incoming data and generate predictions in real-time. Models that offer efficient real-time inference capabilities are crucial for time-sensitive applications like fraud detection or recommendation systems.

Comparing prototypes

After being aware of different factors that can play an important role in evaluating and comparing solutions, your next question might be how to compare these factors.

That doesn’t have to be complicated. First, the team determines factors that are most important for the use case. Let’s say they want to focus on prediction power, data collection, overall implementation complexity, training costs, and interpretability.

During prototype creation, everyone takes notes about these 5 topics. In the end, you can fill up a matrix similar to the one below:

Comparing prototypes. Image by author.

On top there are the factors determined by the team. On the left are the four prototypes in the comparison. The meaning of the dots is as follows: the bigger the dot, the higher the impact. The color of the dot means positive (green), neutral (grey), or negative (red). So, the prediction power is really good for prototype 1, 3, and 4, and okay for prototype 2. Data collection is okay for prototype 1 and 2, really hard for prototype 3, and also a bit hard for prototype 4.

This is just an example, it’s perfectly fine to create your own comparing method. You can decide to quantify the scores, instead of using the dots. What’s nice about this method, is that it gives you a clear overview and a direct understanding about the prototype you should continue with, which is prototype 1. You might also consider prototype 3, but that one entails difficult data collection.


By comparing prototypes for a use case as described in this article, you will for sure increase your chances of getting to production! It becomes easy to explain your decision making behind the model and motivate business stakeholders in the company.

It helps to discuss the important evaluation factors with other project members up front, to make sure everyone is on the same page. Implementation complexity, maintenance, costs, and business requirements are hard to ignore in most projects. By focusing only on predictive power, you might miss complexities that will arise later. During prototype creation, you can take notes regarding the criteria and discuss it with the team at evaluation time, and choose the prototype that is most likely to succeed.

This article was originally published on Towards Data Science.