Workers

Quality Score Worker

The Quality Score worker allows you to leverage TextReveal® quality score module so as to filter out noisy or unreadable documents according to a definable threshold.

Textreveal® typical workflow Textreveal® API Quality score worker

The quality-score worker is restricted to certain languages, see the Language Support page for more informations

Definition

The Quality Score is a readability score provided at the document level. Is an indicator of the difference between a document and a set of reference texts to provide a readability value between 0 and 100. To calculate the quality score, the following are used:

  • Specific key performance indicators (KPIs) for each language group

  • Specific reference texts for each site type

The quality score operates at the topological level. It has no semantic feature and therefore is not fit to detect inappropriate document such as adult content.

Reference texts

Reference texts are documents that we consider to be the baseline for good readability documents. There are three groups of them which correspond to the site type currently available in SESAMm's data warehouse:

  • News
  • Blogs
  • Discussions

KPIs

Textreveal® typical workflow KPIs involved in quality score calculation

KPIs are statistical values computed from a document (such as number of char, number of sentences, etc). Human language is complex and depending on the language writing specificity may differ. Considering three differences we defined two main language groups which we called European languages, Japanese and Chinese languages ( Chinese and Japanese ).

There are currently two groups of KPIs including European language group and Chinese and Japanese language group.

European languagesChinese and japanese
KPIs
  • characters count (length of text)
  • words count
  • letters count
  • numbers count
  • non alphanumeric characters count
  • percentage of letters over characters
  • percentage of numbers over characters
  • percent of non alphanumeric characters
  • length of the document
  • number of sentences in the document
  • average length of sentences in the document
  • number of full stops in the document
  • number of commas
  • percentage of japanese or chinese characters over text length
  • percentage of non alphanumeric characters

The group names do not reflects the geographical original area of languages. For example we considered Korean to be into the group European languages because it uses “European” punctuation and every words are separated with a blank space like European languages.