Workers

Quality Score Worker

The Quality Score worker allows you to leverage TextReveal® quality score module so as to filter out noisy or unreadable documents according to a definable threshold.

Textreveal® API Quality score worker

The quality-score worker is restricted to certain languages, see the Language Support page for more information

Definition

The Quality Score is a readability score provided at the document level. Is an indicator of the difference between a document and a set of reference texts to provide a readability value between 0 and 100. To calculate the quality score, the following are used:

Specific key performance indicators (KPIs) for each language group
Specific reference texts for each site type

The quality score operates at the topological level. It has no semantic feature and therefore is not fit to detect inappropriate document such as adult content.

Reference texts

Reference texts are documents that we consider to be the baseline for good readability documents. There are three groups of them which correspond to the site type currently available in SESAMm's data warehouse:

News
Blogs
Discussions

KPIs

KPIs involved in quality score calculation

KPIs are statistical values computed from a document (such as number of char, number of sentences, etc). Human language is complex and depending on the language writing specificity may differ. Considering three differences we defined two main language groups which we called European languages, Japanese and Chinese languages ( Chinese and Japanese ).

There are currently two groups of KPIs including European language group and Chinese and Japanese language group.

	European languages	Chinese and japanese
KPIs	characters count (length of text) words count letters count numbers count non alphanumeric characters count percentage of letters over characters percentage of numbers over characters percent of non alphanumeric characters	length of the document number of sentences in the document average length of sentences in the document number of full stops in the document number of commas percentage of japanese or chinese characters over text length percentage of non alphanumeric characters

The group names do not reflects the geographical original area of languages. For example we considered Korean to be into the group European languages because it uses “European” punctuation and every words are separated with a blank space like European languages.