Definition

The duplicate rows test checks if there are rows that are identical to each other in the dataset.

Taxonomy

  • Category: Integrity.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Duplicate rows on the training set can lead the model to overfit on the duplicated examples.
  • Duplicate rows on the validation set can distort the aggregate metrics, making them overly optimistic or pessimistic.