Unbox
Search…
Test Suites & Reports

Overview

Within a task, you can create test suites to run against your models. Test suites help you accomplish two goals:
  1. 1.
    Uncover errors, biases, or inconsistencies in your model by generating new examples
  2. 2.
    Raise the bar by ensuring the next versions of your model don't regress on certain data
There are three categories of tests, each with its own purpose.

Invariance

Invariance tests take an existing slice of your data (either randomly sampled, or selected from a tag), augment the rows in a chosen manner (all augmentations enumerated below), and expect the same label as the original row. This is a great way to test whether your model is robust to typos, differing punctuation, names of varying gender or culture, and so on. It also can be used to generate synthetic data that looks like existing rows (paraphrasing them, for example).

Types

Paraphrase

Generate synthetic data, paraphrasing rows by swapping in proximal embeddings.

Add typos

Inject typos by swapping neighboring characters.

Change names

Replace recognized names with others.

Change numbers

Replace integers with others within a 20% interval from the original.

Change locations

Replace city and country names with others.

Strip punctuation

Remove leading and trailing punctuation.

Expand contractions

Expand all contractions within the row (e.g. “isn’t” becomes “is not”).

Contract contractions

Contract all contractions within the row (e.g. “is not” becomes “isn’t”).

Functional

Functional tests allow you to create entirely new data from templates. This supports {handlebars} syntax so you can inject values from a reserved lexicon (enumerated below). This is a fantastic way to imagine entirely new data and probe for unchecked biases. You may find in a run report, for example, that your model performs poorly on a certain class or on rows that contain a certain token. You can create look-alike data to re-train on and see immediate accuracy gains.
1
{
2
mask,
3
male,
4
female,
5
first_name,
6
first_pronoun,
7
last_name,
8
country,
9
nationality,
10
city,
11
religion,
12
religion_adj,
13
sexual_adj,
14
country_city,
15
male_from,
16
female_from,
17
last_from
18
}
Copied!

Confidence

Finally, confidence tests allow you to create unit or regression tests from existing slices of your data (again, either randomly sampled or from tags). Moreover, you can set a confidence threshold which defines the minimum confidence score on the prediction of the label.
If you set a confidence threshold < 1 / number of classes, you may find that test runs which didn't predict the correct label still succeed (if they met the bar on the prediction of the label). E.g. if you set a threshold of 0.25 on a binary classifier, a prediction of the label might have a confidence of only 0.3 (and the predicted class, not the label, will have a confidence of 0.7) but the run will still be marked as a success.
Note that this assumes you have a soft max layer at the end of your model, which we currently do not enforce.

Test Reports

The output of the run of a test suite is a test report. Test reports look similar to run reports, but without misprediction filtering and (currently) row-level explainability annotations.
Last modified 4mo ago