Aggregate metrics - Openlayer

This guide explains the model evaluation metrics currently available throughout the Openlayer platform.

Classification metrics

Metric	Description	Comments
Accuracy	The classification accuracy. Defined as the ratio of the number of correctly classified samples and the total number of samples.
Precision per class	The precision score for each class. Given by TP / (TP + FP).
Recall per class	The recall score for each class. Given by TP / (TP + FN).
F1 per class	The F1 score for each class. Given by 2 _ ( Precision _ Recall ) / ( Precision + Recall ).
Precision	For binary classification, the precision considering class 1 as “positive.” For multiclass classification, the macro-average of the precision score for each class, i.e., treating all classes equally.
Recall	For binary classification, the recall considering class 1 as “positive.” For multiclass classification, the macro-average of the recall score for each class, i.e., treating all classes equally.
F1	For binary classification, the F1 considering class 1 as “positive.” For multiclass classification, the macro-average of the F1 score for each class, i.e., treating all classes equally.
ROC AUC	The macro-average of the area under the receiver operating characteristic curve score for each class, i.e., treating all classes equally. For multi-class classification tasks, uses the one-versus-one configuration.	The ROC AUC is available only if the class probabilities are uploaded with the model. This is done by specifying a `predictionScoresColumnName` on the dataset configs. Refer to the How to write dataset config guides for details.
False positive rate	Given by FP / (FP + TN).	The false positive rate is only available for binary classification tasks.
Geometric mean	The geometric mean of the precision and the recall.
Log loss	Measure of the dissimilarity between predicted probabilities and the true distribution. Also known as cross-entropy loss or binary cross-entropy (in the binary classification case).	The log loss is available only if the class probabilities are uploaded with the model. This is done by specifying a `predictionScoresColumnName` on the dataset configs. Refer to the How to write dataset config guides for details.

Where:

TP: true positive.
TN: true negative.
FP: false positive.
FN: false negative.

LLM metrics

Metric	Description	Comments
Mean BLEU	Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).
Mean edit distance	Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.
Mean exact match	Assesses if two strings are identical in every aspect.
Mean JSON score	Measures how close the output is to a valid JSON.
Mean quasi-exact match	Assesses if two strings are similar, allowing partial matches and variations.
Mean semantic similarity	Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space.
Mean, max, and total number of tokens	Statistics on the number of tokens.	The `tokenColumnName` must be specified in the dataset config.
Mean, and max latency	Statistics on the response latency.	The `latencyColumnName` must be specified in the dataset config.
Context relevancy*	Measures how relevant the context retrieved is given the question.	Applies to RAG problems. The `contextColumnName` must be specified in the dataset config.
Answer relevancy*	Measures how relevant the answer (output) is given the question.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Correctness*	Correctness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Harmfulness*	Harmfulness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Coherence*	Coherence of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Conciseness*	Conciseness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Maliciousness*	Maliciousness of the answer.	Applies to RAG problems. The `questionColumnName` must be specified in the dataset config.
Context recall*	Measures the ability of the retriever to retrieve all necessary context for the question.	Applies for RAG problems. The `groundTruthColumnName` and `contextColumnName` must be specified in the dataset config.

*To have access to these metrics, you must have a valid OpenAI key and specify it in the Openlayer platform. Furthermore, to compute them, we run the first 10 rows of your data through OpenAI’s GPT-3.5 turbo model.

Regression metrics

Metric	Description	Comments
Mean squared error (MSE)	Average of the squared differences between the predicted values and the true values.
Root mean squared error (RMSE)	The square root of the MSE.
Mean absolute error (MAE)	Average of the absolute differences between the predicted values and the true values.
R-squared	Also known as coefficient of determination. Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.
Mean absolute percentage error (MAPE)	Average of the absolute percentage differences between the predicted values and the true values.

Documentation

​Classification metrics

​LLM metrics

​Regression metrics

Classification metrics

LLM metrics

Regression metrics