Aggregate metrics - Openlayer

Definition

Aggregate metric tests allow you to define the expected level of model performance for the entire validation set or specific subpopulations. You can use any of the available metrics for the task type you are working on.

To compute most of the aggregate metrics supported, your data must contain ground truths.

For monitoring use cases, if your data is not labeled during publish/stream time, you can update ground truths later on. Check out the Updating data guide for details.

Taxonomy

Category: Performance.
Task types: LLM, tabular classification, tabular regression, text classification.
Availability: and .

Why it matters

Aggregate metrics are a straightforward way to measure model performance.
Overall aggregate metrics (i.e., computed on the entire validation set or production data) are useful to get a high-level view of the model performance. However, we encourage you to go beyond them and also define tests for specific subpopulations.
The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below. A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time.

Available metrics

The aggregate metrics available for LLM projects are:

Metric	Description	Comments
Mean BLEU	Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).
Mean edit distance	Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.
Mean exact match	Assesses if two strings are identical in every aspect.
Mean JSON score	Measures how close the output is to a valid JSON.
Mean quasi-exact match	Assesses if two strings are similar, allowing partial matches and variations.
Mean semantic similarity	Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space.
Mean, max, and total number of tokens	Statistics on the number of tokens.	The `tokenColumnName` must be specified in the dataset config.
Mean, and max latency	Statistics on the response latency.	The `latencyColumnName` must be specified in the dataset config.
Context relevancy*	Measures how relevant the context retrieved is given the question.	Applies to RAG problems. The `contextColumnName` must be specified in the dataset config.
Answer relevancy*	Measures how relevant the answer (output) is given the question.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Correctness*	Correctness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Harmfulness*	Harmfulness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Coherence*	Coherence of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Conciseness*	Conciseness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Maliciousness*	Maliciousness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Context recall*	Measures the ability of the retriever to retrieve all necessary context for the question.	Applies for RAG problems. The `groundTruthColumnName` and `contextColumnName` must be specified in the dataset config.

*To have access to these metrics, you must have a valid OpenAI key and specify it in the Openlayer platform. Furthermore, to compute them, we run a sample of your data through OpenAI’s GPT-3.5 turbo model.

Documentation

​Definition

​Taxonomy

​Why it matters

​Available metrics

Definition

Taxonomy

Why it matters

Available metrics