Data Dictionary
Data Dictionary
This page reports on input and output data available on TextReveal® analyze
resource. It presents data columns, their descriptions and their coverage as well
POST /analyze/dataset
This route allows a client to launch a query to analyse data relevant to a list of entities
Parameters
Body parameters
name | type | description | scope | required |
---|---|---|---|---|
entities | list[dict] | List of entities to be requested | global | Yes |
entity_of_interest | string | Unique id for the entity of interest | entity | Yes |
keywords | list | List of keywords to search. All keywords with a length strictly lower than 3 characters are filtered out except for Japanese , Chinese and Korean languages. | entity | Yes |
concepts | dict[str, list[str]] | List of concepts or risks that are to be analyzed. Each individual concept is defined by its own list of keywords. Punctuation is not handled in the concept labels. Each concept label must be unique (case insensitive) | global | No |
concepts_filter | dict[str, list[str]] | Same as concepts but filters out documents that does not contain the concepts. Note: You can either use concepts or concepts_filter | global | No |
sentiments_filter | dict[str, dict[str, int]] | Partial object containing a min/max values for each sentiments. The end analysis will contains documents that match these filters. Allowed keys:
| global | No |
sites_excludes | list | List of website to exclude from search. N.B: Use the base domain of the websites. | global | No |
min_match | int | The message must contain at least min_match keywords. When used, each entity must have at least min_match keywords.Example:
| global | No |
min_repeat | int | The message must contain at least min_repeat occurrence of a keyword. Example:
| global | No |
start_date | date | format: (YYYY-MM-DD) | global | Yes |
end_date | date | format: (YYYY-MM-DD) | global | Yes |
site_type | list | Type of sites to search (field thread.site_type ) Available options are :
| global | No |
languages | list[str] | List of languages to search, see Language Support page for more information. Note:
| global | No |
countries | list[str] | List of countries to search (field thread.country ). N.B: Use alpha-2 format. | global | No |
sites | list | List of websites to search. N.B: Use the base domain of the websites. | global | No |
co_mentions | list | List of keywords to search with the keywords list. Works like a boolean AND .Example:
co_mentions is operated in full-text and is case insensitive. | global | No |
keywords_exclude | list | List of keywords to exclude from the search. Works like a boolean AND NOT . Example:
keywords_exclude is operated in full-text and is case insensitive. | global | No |
qscore | float | Quality threshold to filter out unreadable data. The default value is 50. No filtering is applied if the quality-score worker is not provided. | global | No |
neg_keywords | list | List of keywords not used for search but for named entity resolution or annotation task. Detailed explanation:
| entity | No |
workers | list | Workflow steps definition | global | Yes |
context | string | Context description of the entity | entity | Yes |
similarity_threshold | float | Similarity score threshold for recognized or matched entities. Filters out documents containing entities with a similarity score lesser than the threshold. | global | No |
search_in | list[string] | Allows to define if the documents extraction has to be done by searching entity keywords in the title and/or in the text. Example:
Note:
| global | No |
Response
name | type | description |
---|---|---|
instance_id | string | The unique identifier of the analysis |
POST /analyze/tql
This route allows a client to launch an analysis for data relevant to a list of entities using Textreveal query language (TQL).
The TextReveal Query Language (TQL) is a simple text-based query language for filtering data. It is composed with a field on which a value is applied:
site_type
:"news"
. Each filter can be combined to create a boolean expression withAND
,OR
andNOT
operators.
Example:(text:"Apple TV" OR title:"Steve Jobs") AND NOT text:"apple tree"
Unlike the dataset route, the TQL route requests all types of sites. The
news
site type groupsnews
,premium_news
andlicensed_news
. Moreover, the workers are implicitly enabled if the associated parameter is used. For example, the quality score worker will be enabled if the qscore parameter is used.
Parameters
Body parameters
name | type | description | scope | required |
---|---|---|---|---|
entities | list[dict] | List of entities to be requested | global | Yes |
entity_of_interest | string | Unique id for the entity of interest | entity | Yes |
context | string | Context description of the entity. The context is mandatory if you use the similarity_threshold parameter. | entity | No |
query | string | A TQL query to define the entity of interest. The TQL query will be used for the data extraction. Accepted fields are :
Specific values for site_type :
Example: ((title:"1&1" AND text:"1&1 DRILLISCH") OR (title:"DRILLISCH" AND text:"1&1 DRILLISCH") AND (ner:"1&1 DRILLISCH")) | entity | Yes |
annotate_keywords | list | List of keywords not used for search but for named entity resolution or annotation task. | entity | Yes |
concepts | dict[str, list[str]] | List of concepts or risks that are to be analyzed. Each individual concept is defined by its own list of keywords. Punctuation is not handled in the concept labels. Each concept label must be unique (case insensitive) | global | No |
concepts_filter | dict[str, list[str]] | Same as concepts but filters out documents that does not contain the concepts. Note: You can either use concepts or concepts_filter | global | No |
sentiments_filter | dict[str, dict[str, int]] | Partial object containing a min/max values for each sentiments. The end analysis will contains documents that match these filters. Allowed keys:
| global | No |
min_match | int | The message must contain at least min_match annotate keywords. When used, each entity must have at least min_match keywords.Example:
| global | No |
min_repeat | int | The message must contain at least min_repeat occurrence of an annotate keyword. Example:
| global | No |
start_date | date | format: (YYYY-MM-DD) | global | Yes |
end_date | date | format: (YYYY-MM-DD) | global | Yes |
language | string | Language to search, see Language Support page for more information. Default value is : english | global | No |
qscore | float | Quality threshold to filter out unreadable data. No filtering is applied if the qscore parameter is not provided. | global | No |
similarity_threshold | float | Similarity score threshold for recognized or matched entities. Filters out documents containing entities with a similarity score lesser than the threshold. | global | No |
Response
name | type | description |
---|---|---|
instance_id | string | The unique identifier of the analysis |
POST /analyze/download
This route allows a client to preview result of an analysis previously run
Parameters
Body parameters
name | type | description | required |
---|---|---|---|
instance | string | Id of the instance you want to retrieve the textual data | Yes |
limit | number|dict | When limit is a number (e.g., 100 ):
limit is a dictionary (e.g., {"by": "entity", "value": 3} ):
id , title , sentences , url or thread ) are only returned if the total number of documents (the “computed limit”) is ≤ 2000. | Yes |
date | string | Filter the documents on a given date. Use %Y-%m-%d format. The date must be included in the date range of the analysis. | No |
entity | string | Filter the documents on a given entity. The entity must be an entity of interest of the analysis. | No* |
concept | string | Filter the documents on a given concept. The concept must be present in the analysis. | No |
sort | dict | Sort the documents in ascending or descending order given a field | no |
sort >field | string | The field to sort the documents. Available fields are:
| yes |
sort >order | string | The order of the sorting. Available values are
| yes |
fields | list[str] | Collect only the fields you need. By default, all fields except summary are returned.id field is always returned.Available keys for the fields parameter are:
| No |
- If you use the
sort
parameter, the date parameter can become mandatory if your analysis has generated a certain amount of results (2 500 000 documents). - When using the
sort
parameter with a field that has aggregation functions (e.g,min
,max
,median
,mean
), we will use the mean value. - When using one of the entity match field (
document_entity_polarity
,document_entity_positive
,document_entity_neutral
,document_entity_negative
) in the sort parameter, theentity
parameter is mandatory. - premium_news text cannot be retrieved. Each sentence is replaced by this placeholder
The download of licensed text is not allowed.
- The
summary
field is an experimental feature, we recommend to use the/documents
route with the document id, as seen in the example page here
Response
name | type | parent | description |
---|---|---|---|
extract_date | datetime | Corresponds to the date of extraction of the article. (YYYY-MM-DD) | |
language | string | Language of the article | |
thread | dict | Parent key for country , site , site_type and title | |
country | string | thread | 2 letter ISO country code |
site | string | thread | Site of the article |
site_type | string | thread | Site type of the article Available options are :
|
title | string | thread | Title of the thread mapped from sentences of type 2 (If no sentences, the title will be an empty string) |
url | string | Url of the article | |
id | string | id of the article | |
title | string | Title of the document mapped from sentences of type 1 (If no sentences, the title will be an empty string) | |
sentences | list[dict] | List of sentences with their match and indicators when available | |
text | string | sentences | Text of the sentence |
entities | |||
sentence_id | int | sentences | Id of the sentence |
type | int | sentences | Type of the sentence:
|
matches | list[dict] | sentences | List of matched keywords or entities |
results | dict | sentences | List of indicators:
|
negative | float | results | Negative sentiment probability |
positive | float | results | Positive sentiment probability |
neutral | float | results | Neutral sentiment probability |
polarity | float | results | The aggregated of positive and negative sentiment scores at the sentence level.
|
polarity_exp | float | results | The aggregated score, at the sentence level, of the difference between negative and positive sentiment scores processed into a sigmoid in order to smooth outliers.
|
document_entity_polarity | dict | Evaluate the sentiment level towards the entity of interest in all sentences mentioning the entity in a given document. Formula:
| |
document_entity_positive | dict | Evaluate the level of positive sentiment towards an entity of interest in all sentences mentioning the entity in a given document. Formula:
| |
document_entity_neutral | dict | Evaluate the level of neutral sentiment towards an entity of interest in all sentences mentioning the entity in a given document. Formula:
| |
document_entity_negative | dict | Evaluate the level of negative sentiment towards an entity of interest in all sentences mentioning the entity in a given document. Formula:
| |
document_{sentiment} | dict | Evaluate the desired sentiment1 of a document Formula:
| |
document_polarity | float | Evaluate the sentiment level in all sentences in a given document. Formula:
| |
nb_sentences | int | Number of sentences composing the article | |
text | string | matches | mention of the keyword or entity in the sentence |
entity | dict | matches | Identifier of the entity |
count | dict | matches | Prevalence of keywords for the matched concept. |
similarity | float | matches | Cosine similarity score between the sentence and the context of the entity. Ranges between [0,1] |
qscore | float | Readability score of the document. This score is calculated on some kpis such as the average length of sentences within the document or the ratio of non alphanumerical character within the document. | |
concepts | dict | Sum of occurrences of the keywords related to a given concept in each sentence. Available with the concept worker | |
mentions | dict | Sum of occurrences of the keywords related to a given mention in each sentence. Available with the raw-matcher worker | |
entities | dict | Sum of occurrences of the keywords related to a given entity in each sentence. Available with the ner-linking worker | |
summary | string | Summary in english of the document's text |
It is possible to have exceptionally empty summaries for some texts that the model is not able to handle.
- Available sentiment classes:
positive
,neutral
,negative
*Deprecated: The field is deprecated and will be removed in future releases. Please consider updating your code as soon as possible.
POST /analyze/status
This route allows a client to get the status of a previously run analysis.
Parameters
Body parameters
name | type | description | required |
---|---|---|---|
instance | string | The identifier of an analysis. This identifier have to be used to get results. | Yes |
Response
name | type | description |
---|---|---|
count | ||
filtered | ||
globalSpeed | ||
handled | number | Number of documents in the analysis result set. |
lastErrorMessage | ||
startedAt | date | The time when the analysis started. |
status | string | The current status of the analysis. One of :
|
updatedAt | date | The last time the analysis was updated. |
Pending: Your analysis is queued. The limit for concurrent analyses is reached and your analysis will start as soon as another already-running analysis finishes. See the limitation page for more information.
Starting: Your analysis is starting. Necessary resources are being gathered in order to run it.
*Deprecated: The field is deprecated and will be removed in future releases. Please consider updating your code as soon as possible.
POST /analyze/{id}/timeseries
This route allows a client to run an HTTP request in order to start the computing of a Timeseries.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. This id is the response of analyze/dataset route. The analysis must be completed. |
Body parameters
name | type | description | required |
---|---|---|---|
operands | list[string] | The operators that will be used for aggregation. Must be a list composed of one or more of:
| No |
output_format | string | The output format of the final result. Must be one of:
json as the output_format, the concept names used will be returned in lowercase, while the csv format maintains the original case to the output_format | No |
pivots | list[string] | The pivots that will be used for aggregation (Additional to the date and entity). Must be a list composed of one or more of:
| No |
time_granularity | string | Aggregation granularity period. Must be one of:
| No |
volume_only | boolean | Aggregation mode. Set to true to display only volumes | No |
As you can notice, the output format is chosen when launching a timeseries and not when downloading it. This means you need to run a new timeseries in order to change the output format.
Response
name | type | description |
---|---|---|
hash | integer | The hash of launched analysis |
The table above only show the successful HTTP API Response (Status code = 200). You can expect multiple responses and status codes. Please see here for more information.
GET /analyze/{id}/timeseries/{hash}/status
This route allows a client to retrieve the timeseries status of a given instance using its hash.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. The analysis must be completed. (format: uuid ) |
hash | string | The timeseries hash. |
Response
name | type | description |
---|---|---|
status | string | The status of the timeseries. One of :
|
GET /analyze/{id}/timeseries/{hash}/download
This route allows a client to download the timeseries results of a given instance using its hash.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. The analysis must be completed. (format: uuid ) |
hash | string | The timeseries hash. |
Response
name | type | description |
---|---|---|
{concept_label}_score | float | The percentage of documents containing at least one keyword related to the concept.
|
entity | string | The detected entity |
extract_day | string | Extract day of the article Format: date YYYY-MM-dd |
extract_hour | integer | Extract hour of the article |
extract_minute | integer | Extract minute of the article |
language | string | The language of the article |
{operator}_{sentiment_class} | float | {operator} 1 aggregation sentiment score4 based on the{sentiment_class} 2 score4 of all the sentences of all the documents matching the entity of interest for the selected aggregation period |
{operator}_{emotion_class} | float | {operator} 1 aggregation emotion score4 based on the{emotion_class} 3 score4 of all the sentences of all the documents matching the entity of interest for the selected aggregation period |
entity_{operator}_{sentiment_class} | float | {operator} 1 aggregation sentiment score4 based on the{sentiment_class} 2 score4 of the sentences matching the entity of interest for the selected aggregation period |
entity_{operator}_{emotion_class} | float | {operator} 1 aggregation emotion score4 based on the{emotion_class} 3 score4 of the sentences matching the entity of interest for the selected aggregation period |
volume_document | integer | The volume of documents where the entity of interest is matched for the aggregation period |
volume_sentence | integer | The volume of all sentences of all documents where the entity of interest is matched for the aggregation period |
entity_volume_sentence | integer | The volume of sentences where the entity of interest is matched for the aggregation period. |
volume_document_{concept_label} | integer | The volume of documents where the entity of interest AND the specified concept are matched for the aggregation period |
volume_sentence_{concept_label} | integer | The volume of all sentences of all documents where the entity of interest AND the specified concept are matched for the aggregation period |
{concept_label}_sentiment_polarity | integer | Average sentiment polarity of documents that match both the specified concept and the entity |
concepts_keywords_count | dict[str, dict[str, int]] | Represents the count of keywords matched per concepts in the document for the aggregation period. More info on the timeseries indicators page |
- Available operators:
min, max, median, mean
min: lowest value observed for the class on the defined period
max: highest value observed for the class on the defined period
median: middle value observed for the class on the defined period
mean: average value observed for the class on the defined period
- Available sentiment classes:
positive, neutral, negative
- Available emotion classes:
anger, anticipation, fear, joy, sadness, surprise, trust
- Sentiment
and emotionscores will be displayed using scientific notation, meaning that an exponent can appear at the end of the number
Spaces surrounding the concept label are removed in the result
GET /analyze/{id}
This route allows a client to retrieve the payload of an instance previously ran using its id.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. (format: uuid ) |
Response
name | type | description | scope |
---|---|---|---|
entities | list[dict] | List of entities to be requested | global |
entity_of_interest | string | Unique id for the entity of interest | entity |
keywords | list | List of keywords to search | entity |
sites_excludes | list | List of website to exclude from search | global |
min_match | int | The message must contain at least min_match keywords. Example:
| global |
min_repeat | int | The message must contain at least min_repeat occurrence of a keyword. Example:
| global |
start_date | date | format: (YYYY-MM-DD) | global |
end_date | date | format: (YYYY-MM-DD) | global |
site_type | list | Type of sites to search (field thread.site_type ) Available options are :
| global |
languages | list | List of languages to search | global |
countries | list | List of countries to search (field thread.country ) | global |
sites | list | List of websites to search | global |
co_mentions | list | List of keywords to search with the keywords list. Works like a boolean AND .Example:
co_mentions is operated in full-text and is case insensitive. | global |
keywords_exclude | list | List of keywords to exclude from the search. Works like a boolean AND NOT . Example:
keywords_exclude is operated in full-text and is case insensitive. | global |
qscore | float | Readability score of the document. This score is calculated on some kpis such as the average length of sentences within the document or the ratio of non alphanumerical character within the document. | global |
neg_keywords | list | List of keywords not used for search but for named entity resolution or annotation task. Detailed explanation:
| entity |
workers | list | Workflow steps definition | global |
context | string | Context description of the entity | entity |
precompute | boolean | Whether to query offline data:
| global |
similarity_threshold | float | Similarity score threshold for recognized or matched entities. Filters out documents containing entities with a similarity score lesser than the threshold. |
POST /analyze/{id}/stop
This route allows a client to stop an instance previously ran using its id.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. (format: uuid ) |
POST /analyze/{id}/download
Prepare the download of your instance. The result will be available in the analyze/{id}/download/{hash}
route once the process is completed.
Parameters
Body parameters
name | type | description | required |
---|---|---|---|
id | uuid | The instance id | yes |
limit | number | dict | The number of documents to download. Or or dictionary specifying a limit per resource | no |
limit >by | string | The resource to limit. Possible values: entity | yes |
limit >value | number | The limit value | yes |
fields | List[string] | Collect only the fields you need. By default, all fields except summary are returned.Available keys for the fields parameter are:
| no |
date | daterange | Extract only the documents published between the two dates. | no |
date >start | date | The start date of the date range. The format is YYYY-MM-DD . | no |
date >end | date | The end date of the date range. The format is YYYY-MM-DD . | no |
concepts | List[string] | Extract only the documents that contain the concepts. Each concept must be present in the analysis. | no |
entities | List[string] | Extract only the documents that contain the entities. Each entity must be present in the analysis. | no |
sort | dict | Sort the documents in ascending or descending order given a field | no |
sort >field | string | The field to sort the documents. Available fields are:
| yes |
sort >order | string | The order of the sorting. Available values are
| yes |
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. The analysis must be completed. (format: uuid ) |
Response
name | type | description |
---|---|---|
hash | integer | The hash of the download |
GET /analyze/{id}/download/{hash}/status
This route allows a client to retrieve the download status of a given instance using its hash.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. The analysis must be completed. (format: uuid ) |
hash | string | The download hash. |
Response
name | type | description |
---|---|---|
status | string | The status of the download. One of :
|
GET /analyze/{id}/download/{hash}
This route allows a client to download the results of a given instance using its hash.
Parameters
Path parameters
name | type | description |
---|---|---|
id | string | The analysis id. The analysis must be completed. (format: uuid ) |
hash | string | The download hash. |
Response
Array of urls that you can use to retrieve the result of the download. Example:
[
"https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-0.parquet.gz",
"https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-1.parquet.gz"
]
POST /analyze/timeserie
deprecated
Deprecated: The route is deprecated and will be removed in future releases. Please consider updating your code as soon as possible.
Parameters
Body parameters
Response
name | type | description |
---|---|---|
extract_day | date | Day of extraction of the article. (YYYY-MM-DD) Date : UTC+0 |
extract_date | datetime | Date of extraction of the article |
extract_hour | times | Available when time series are aggregated by hour |
extract_minute | times | Available when time series are aggregated by minute |
country | string | Country of the site, determined automatically by the site language, IP and TLD |
entity | string | Entity detected for the record |
id | string | identifier of the document |
site_type | string | Type of data source for document
|
language | string | language of the document |
site | string | Website of the document |
url | string | Url of the document |
volume_sentence | int | Number of sentences |
volume_document | int | Number of the documents |
mean_<indicator> | float | Mean score calculated for the record |
max_<indicator> | float | Max score calculated for the record |
min_<indicator> | float | Min score calculated for the record |
median_<indicator> | float | Median score calculated for the record |