Aggregate metrics, such as accuracy and precision, can be very misleading. We might be led into thinking that we have a good model, when, in fact, we cannot be so sure.
Let’s get back to our churn classification problem.
The 86% accuracy we obtained, as an aggregate metric, summarizes the performance of our model across our whole validation set. It is a useful first metric to look at, but it doesn’t convey the complete story of how our model behaves.
For example, how does our model perform for different user groups? What’s the performance for users aged between 25 and 35? What about for users from different countries?
What we will most likely find out is that the performance of our model is not uniform across different cohorts of the data. Furthermore, we may even encounter some data pockets with low accuracies and specific failure modes.
The image below illustrates well what is often the case with the model performance.
Analyzing different cohorts of the data is critical to building trust in your model and not being surprised by failure modes only after your model is serviced in production.
In this part of the tutorial, we will conduct error cohort analysis to understand how our model performs for different user groups. The key functionality that allows analyzing multiple data cohorts is tagging. For a comprehensive reference on the importance of tagging, check out Andrew Ng’s online course on ML in production.
The first step required to conduct error cohort analysis is being able to easily query our dataset so that we can access the data cohorts we are interested in exploring further.
This can be done with the Filter bar, right below the Error analysis panel.
For example, let’s filter the data to only look at the dataset rows from users aged between 25 and 35.
First, we select the feature we are interested in, which is
Age. Then, we select the relationship we want. Since we want to filter a range of values, we select between. Finally, we type the range of values we are interested in and click on Filter.
Now, below the filter bar, we only see the data that satisfies our query.
Combining flexible tagging with easy filtering results in endless possibilities to conduct repeatable and precise data cohort analysis.
Now that we filtered to only see the data for users aged between 25 and 35, let’s create a tag for them.
On the dataset, hover over the selection item in the upper left corner. While hovering, the option to Select page appears. If you click it once, you will select all the rows being shown on that page. This is not what we want in our case. What we want is to select the data from all the pages that satisfy our query. We can achieve this by double-clicking the Select page option.
With our rows selected, the option to tag the data should have appeared on the filter bar, right below our query. Let’s name that data cohort
age_between_25_35 and click on Tag.
Voilà! All of the data samples are now tagged! This is the user group we are going to focus on.
You might have noticed that on the Error analysis panel, there is a tab called Tags. If you click on it, you will see all the tags you already created. In our case, you will only see our newly created
Every time you need to have a look or need to show this data cohort to someone, you can simply click on it and the data below will be filtered according to the query used to generate it. This is a great way to document patterns.
Filter with a tag
Clear the filters in the filter bar. You can clear a filter by clicking on the condition in the filter bar and pressing backspace. Then, click on the
age_between_25_35tag to see what happens to the data rows shown below the error analysis panel
Back to the Tag tab. Did you notice something interesting?
Right below our newly created tag, you can see our model’s performance for that specific data cohort.
The model performance for that user group is pretty good! But is it uniform for various user groups?
Inspecting different data cohorts
With what you’ve learned so far, can you check what’s the model performance for users aged between 50 and 90? What about for male users that live in Germany and that purchased between 1 and 5 products?
Notice that for some user groups, the model performs much better than for others. Not only that, but with such filtering and tagging mechanisms, it is easy to check the most evident potential model biases, for instance, by checking out how the model performance differs for different user genders.
- Quickly identify data pockets with specific model failure modes;
- Mindfully take actions based on your model’s output knowing how much trust you can deposit. For example, if you know that the model is very good for users aged between 25 and 35, but not so much for users over 50, we can use the model’s results accordingly;
- Know exactly what kind of data is needed if you want to boost your model’s performance. You can either collect and label more data that looks like it or generate synthetic data.
Automatically suggested tags give you a head start on the whole exploratory process, which is critical when conducting error analysis. On the Suggested tab in the Error analysis panel, you will find some suggested tags that we made for you because we think that these can be data samples that you might be interested in taking a closer look at.
Each suggested tag has a different meaning and if you click on Create, we will automatically tag all the samples that satisfy that criteria.
For example, one of the most powerful suggested tags is the
potential_mislabel tag. This shows data that might have been mislabeled, because (among other things) the model is making mistakes with low uncertainty. This is not a guarantee that the points are mislabeled, but it might be worth double-checking them, as training models with mislabeled data will likely hinder their performance.
Using a suggested tag
Click on Create in one of the suggested tags. Once you create a tag, it will show on the Tags tab with the associated aggregate metrics.
Now that you are familiar with filtering and tagging, let’s move on to the next part of the tutorial, where we will explore additional slicing and dicing capabilities.
Updated 3 months ago