Aggregate metrics, such as accuracy and precision, can be very misleading. We might be led into thinking that we have a good model, when, in fact, we cannot be so sure.
Let’s get back to our banking chatbot problem.
The 83% accuracy we obtained, as an aggregate metric, summarizes the performance of our model across our whole validation set. It is a useful first metric to look at, but it doesn’t convey the complete story of how our model behaves.
For example, how does our model perform for different groups of the data? What’s the performance for messages where the overarching theme is the users’ credit card? What about for messages from users that complain about refunds?
What we will most likely find out is that the performance of our model is not uniform across different cohorts of the data. Furthermore, we may even encounter some data pockets with low accuracies and specific failure modes.
The image below illustrates well what is often the case with the model performance.
Analyzing different cohorts of the data is critical to building trust in your model and not being surprised by failure modes only after your model is serviced in production.
In this part of the tutorial, we will conduct error cohort analysis to understand how our model performs for different user groups. The key functionality that allows analyzing multiple data cohorts is tagging. For a comprehensive reference on the importance of tagging, check out Andrew Ng’s online course on ML in production.
The first step required to conduct error cohort analysis is being able to easily query our dataset so that we can access the data cohorts we are interested in exploring further.
This can be done with the Filter bar, right below the Error analysis panel.
For example, let’s filter the data to only look at the dataset rows from messages that contain the word “card”. To do so, we can simply type “card” in the filter bar and press Enter. Now, below the filter bar, we only see the data that satisfies our query.
Combining flexible tagging with easy filtering results in endless possibilities to conduct repeatable and precise data cohort analysis.
Now that we filtered to only see the data for messages that contain the word “card”, let’s create a tag for them.
On the dataset, hover over the selection item in the upper left corner. While hovering, the option to Select page appears. If you click it once, you will select all the rows being shown on that page. This is not what we want in our case. What we want is to select the data from all the pages that satisfy our query. We can achieve this by double-clicking the Select page option.
With our rows selected, the option to tag the data should have appeared on the filter bar, right below our query. Let’s name that data cohort
about_cards and click on Tag.
Voilà! All of the data samples are now tagged! This is the user group we are going to focus on.
You might have noticed that on the Error analysis panel, there is a tab called Tags. If you click on it, you will see all the tags you already created. In our case, you will only see our newly created
Every time you need to have a look or need to show this data cohort to someone, you can simply click on it and the data below will be filtered according to the query used to generate it. This is a great way to document patterns.
Filtering with a tag
Clear the filters in the filter bar. You can clear a filter by clicking on the condition in the filter bar and pressing backspace. Then, click on the
about_cardstag to see what happens to the data rows shown below the error analysis panel
Back to the Tag tab. Did you notice something interesting?
Right below our newly created tag, you can see our model’s performance for that specific data cohort.
The model performance for that user group is not very good when we look at F1 and precision! But is it uniform for various user groups?
Inspecting different data cohorts
With what you’ve learned so far, can you check what’s the model performance for messages that contain the word “refund”?
Notice that for some user groups, the model performs much better than for others.
- Quickly identify data pockets with specific model failure modes;
- Mindfully take actions based on your model’s output knowing how much trust you can deposit. For example, if you know that the model is not very good for messages about cards, we can direct such messages directly to a human in customer support, instead of trying to automatically respond;
- Know exactly what kind of data is needed if you want to boost your model’s performance. You can either collect and label more data that looks like it or generate synthetic data.
Automatically suggested tags give you a head start on the whole exploratory process, which is critical when conducting error analysis. On the Suggested tab in the Error analysis panel, you will find some suggested tags that we made for you because we think that these can be data samples that you might be interested in taking a closer look at.
Each suggested tag has a different meaning and if you click on Create, we will automatically tag all the samples that satisfy that criteria.
For example, one of the most powerful suggested tags is the
potential_mislabel tag. This shows data that might have been mislabeled, because (among other things) the model is making mistakes with low uncertainty. This is not a guarantee that the points are mislabeled, but it might be worth double-checking them, as training models with mislabeled data will likely hinder their performance.
Using a suggested tag
Click on Create in one of the suggested tags. Once you create a tag, it will show on the Tags tab with the associated aggregate metrics.
Now that you are familiar with filtering and tagging, let’s move on to the next part of the tutorial, where we will explore additional slicing and dicing capabilities.
Updated 2 months ago