Datasets and models
Aggregate metrics
This guide explains the model evaluation metrics currently available throughout the Openlayer platform.
Classification metrics
Metric | Description | Comments |
---|---|---|
Accuracy | The classification accuracy. Defined as the ratio of the number of correctly classified samples and the total number of samples. | |
Precision per class | The precision score for each class. Given by TP / (TP + FP). | |
Recall per class | The recall score for each class. Given by TP / (TP + FN). | |
F1 per class | The F1 score for each class. Given by 2 _ ( Precision _ Recall ) / ( Precision + Recall ). | |
Precision | For binary classification, the precision considering class 1 as “positive.” For multiclass classification, the macro-average of the precision score for each class, i.e., treating all classes equally. | |
Recall | For binary classification, the recall considering class 1 as “positive.” For multiclass classification, the macro-average of the recall score for each class, i.e., treating all classes equally. | |
F1 | For binary classification, the F1 considering class 1 as “positive.” For multiclass classification, the macro-average of the F1 score for each class, i.e., treating all classes equally. | |
ROC AUC | The macro-average of the area under the receiver operating characteristic curve score for each class, i.e., treating all classes equally. For multi-class classification tasks, uses the one-versus-one configuration. | The ROC AUC is available only if the class probabilities are uploaded with the model. This is done by specifying a predictionScoresColumnName on the dataset configs. Refer to the How to write dataset config guides for details. |
False positive rate | Given by FP / (FP + TN). | The false positive rate is only available for binary classification tasks. |
Geometric mean | The geometric mean of the precision and the recall. | |
Log loss | Measure of the dissimilarity between predicted probabilities and the true distribution. Also known as cross-entropy loss or binary cross-entropy (in the binary classification case). | The log loss is available only if the class probabilities are uploaded with the model. This is done by specifying a predictionScoresColumnName on the dataset configs. Refer to the How to write dataset config guides for details. |
Where:
- TP: true positive.
- TN: true negative.
- FP: false positive.
- FN: false negative.
LLM metrics
Metric | Description | Comments |
---|---|---|
Mean BLEU | Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4). | |
Mean edit distance | Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity. | |
Mean exact match | Assesses if two strings are identical in every aspect. | |
Mean JSON score | Measures how close the output is to a valid JSON. | |
Mean quasi-exact match | Assesses if two strings are similar, allowing partial matches and variations. | |
Mean semantic similarity | Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space. | |
Mean, max, and total number of tokens | Statistics on the number of tokens. | The tokenColumnName must be specified in the dataset config. |
Mean, and max latency | Statistics on the response latency. | The latencyColumnName must be specified in the dataset config. |
Context relevancy* | Measures how relevant the context retrieved is given the question. | Applies to RAG problems. The contextColumnName must be specified in the dataset config. |
Answer relevancy* | Measures how relevant the answer (output) is given the question. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Correctness* | Correctness of the answer. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Harmfulness* | Harmfulness of the answer. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Coherence* | Coherence of the answer. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Conciseness* | Conciseness of the answer. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Maliciousness* | Maliciousness of the answer. | Applies to RAG problems. The questionColumnName must be specified in the dataset config. |
Context recall* | Measures the ability of the retriever to retrieve all necessary context for the question. | Applies for RAG problems. The groundTruthColumnName and contextColumnName must be specified in the dataset config. |
*To have access to these metrics, you must have a valid OpenAI key and specify it in the Openlayer platform. Furthermore, to compute them, we run the first 10 rows of your data through OpenAI’s GPT-3.5 turbo model.
Regression metrics
Metric | Description | Comments |
---|---|---|
Mean squared error (MSE) | Average of the squared differences between the predicted values and the true values. | |
Root mean squared error (RMSE) | The square root of the MSE. | |
Mean absolute error (MAE) | Average of the absolute differences between the predicted values and the true values. | |
R-squared | Also known as coefficient of determination. Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. | |
Mean absolute percentage error (MAPE) | Average of the absolute percentage differences between the predicted values and the true values. |
Was this page helpful?