Evaluation Stores - a high bias, low variance view

Feature Store has been one of the hottest buzzwords in the machine learning community in recent years. In my view, however, “Evaluation Store” should be of equal or higher priority in many teams. In this post, I will write about the main idea of evaluation stores and explain why I think it is critical.


Suppose you are building ML solutions for a certain loan business.

You developed a ML model that predicts default probability for each borrower.

And you deployed the model.

Checking model performance

A few weeks later, you wondered how it has been going, so checked the model performance, by doing the following tasks:

Here is a diagram of the above tasks.

Poor performance detected

The output from the above tasks looks like this:

Suppose you found that the model performed very poorly for the period 2023-04-10 ~ 2023-04-23.

Looking deeply

You wanted to understand why, so you looked into it with more granular views like state-level. For that, you conducted the similar tasks as above, but now with an additional slicing - state.

Now, within the period 2023-04-01 ~ 2023-04-23, it is revealed that the model performed especially bad for the borrowers who live in NY:

Looking more deeply

You are still not sure about why, so you dived into this sliced data by further slicing it by occupation, and discovered that the model performed the worst for the group of borrowers with Science & Engineering occupation.

Why does the model perform badly for this slice of data (2023-04-10 ~ 2023-04-03 + NY + Science & Engineering? At this point, you may or may not find the reason. You may need to try other slices. You may also need to dig into macro-economic data, third-party data, bugs in codes, bugs in model applications, etc. You would continue until you get some clues.


The above example is not unusual - many data scientists actually do this kind of work. Is there anything wrong here?

Nothing is wrong, if this happens only once, for a single model, and for a single data scientist. But that’s not the case mostly.

If you have a ML team, imagine how many times and how many data scientists have to do similar things repeatedly over time. Not only that, but the following questions can arise:

So your team needs to discuss how to answer these questions. Note that none of these are about developing sophisticated machine learning models, real-time training or rigorous deployment, etc. All we want here is simple - to understand how the model performs in some data slices. But as we saw from the example, evaluating models involves various components with details, which makes it nontrivial in the end. As a result, doing such analytics may eat up all of your time.

So efficiency and transparency are needed here.

Evaluation Stores

An evaluation store is a single place where model performances are summarized for different data slices or use cases.

It can play a critical role for the team. For example, it is obvious that monitoring and reporting can be benefited from evaluation stores. A new ideation to improve the current model should also start from the current model performance analysis. Or, when you talk with other team members about model issues, you have to refer to a single source of truth for the metrics.

So key benefits of evaluation stores are:

Designing Evaluation Stores

There is no standard rule or format on how to build an evaluation store. It all depends on teams and businesses. Building a scalable evaluation store actually could be an overkill in many small companies. So each team should develop its own process taking into account the core ideas of evaluation stores.

Here are some possible designs of evaluation stores that I thought of:

Version 0

Version 1

Version 2

Version 3

Concluding Remarks

There could be a lot of frustration that comes with evaluating models.

Some of them actually can be resolved with feature stores. A feature store enables reusing features effectively across the team and facilitates various tasks in ML pipelines. But it is not easy to build an infrastructure for that. Plus, the feature drifts are not that common and small errors in features do not necessarily cause severe impacts on the product.

On the other hand, model performances are directly linked to the product’s success. So the process of diagnosing the model performances must be innovated. Data scientists should start working on models, with much better understanding of model performance with almost no costs.

In terms of building an evaluation store, I would prefer a lean approach - starting from building a minimal version, and make it evolve with the progression of teams, products and infrastructures.



  1. By model experiments, I meant the iterative process of ideation, building new features, tuning hyperparameters, training, validation, backtesting, and model decisions