Definition

The special characters ratio test allows you to set a threshold on the ratio between the number of rows that only contain special characters and the ones that contain alphanumeric characters for a given column.

Taxonomy

  • Category: Integrity.
  • Task types: LLM, text classification.
  • Availability: and .

Why it matters

  • Often, entries that only contain special characters are a sign of data quality issues.
  • Understanding the extent of rows with only special characters helps in designing models that are robust to such anomalies. If your model is expected to encounter similar data in production, you might want to train it with some level of noise tolerance.