Filtering by data distribution

Improving models one slice of data at a time

We want models that perform well for not only whole datasets but also for every potential edge case it might encounter out in the wild. The problem when we strive to achieve that goal, it is easy to be overwhelmed by the number of possibilities and suffer from analysis paralysis.

A better and more realistic approach would be to increase the model’s performance one slice of data at a time.

The data distribution tab on the error analysis panel helps us identify what are the most common mistakes our models are making. A good idea is, then, to focus on improving model performance on these error classes iteratively for the next rounds of ML development.

Data distribution

When you click on the Data distribution tab, the Error analysis panel is divided in two.

On the left-hand part, you see the labels for your task. In our churn binary classifier, we see the two classes there: “Retained” and “Exited”. Furthermore, right beneath each tag, we see the performance, measured by aggregate metrics per class. Using aggregate metrics per class is particularly important when working with unbalanced datasets, where the model performance on the majority class might distort some of the metrics.

On the right-hand part, we see the different error classes. This is a flattened confusion matrix.

As a side note, notice that we have an unbalanced dataset and most of the data is from the Retained class. Churn problems are indeed often imbalanced, after all, at any given time there are (hopefully) many more users that will continue using our platform than users churning.


Error classes

Looking at the Error analysis panel, can you spot the most common mistake our model makes? Can you filter the data to have a closer look?

The most common error class our model is making is predicting users will be retained when in fact they churn. Maybe, for the next quarter, we would like to focus on improving the model performance in this error class.


Documenting error classes

Can you filter the dataset to show only samples our model predicted as Retained but that the label was Exited and tag them with the name Q4 (so that the whole team knows that this is the priority for the next quarter)?


Actionable insight:

  • Focus on one digestible chunk of the data at a time, and systematically improve the model’s performance gradually.

Did this page help you?