Guide
Analyze
Asynchronous Download
Asynchronous Download
Useful Links
-
Endpoints:
-
Fields description: Data dictionary
Download data asynchronously
The asynchronous download feature is recommended for larger datasets (beyond 2,000 documents).
- First, call the route
POST
/analyze/{id}/download, which prepares a download for the specified instance and returns a hash that you can use to retrieve the result of the download.
POST /analyze/{id}/download
{
"concepts": [
"environment"
],
"date": {
"end": "2019-02-01",
"start": "2019-02-01"
},
"entities": [
"apple"
],
"fields": [
"title"
],
"limit": 500,
"sort": {
"field": "document_polarity",
"order": "ASC"
}
}
Response
{
"hash": "98756"
}
- You can then check the progress by calling the route
GET
/analyze/{id}/download/{hash}/status. The status can bestarting
,running
,failed
,stopped
, orcompleted
.
Response
{
"status": "completed"
}
- When the process completes, retrieve your download via the route
GET
/analyze/{id}/download/{hash}, which returns an array of URLs, each pointing to a gzipped Parquet file. These files remain accessible for one week.
Response
[
"https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-0.parquet",
"https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-1.parquet"
]
Notes
- The instances must be
completed
, you can check the status using the /analyze/status route. - Some fields (
id
,title
,sentences
,url
,summary
andthread
) are only available if the total number of documents (after applying the limit) is below 2000. - When using the
sort
parameters, the "first" document will be in theoutput-0.parquet
file and the "last" document will be in theoutput-N.parquet
file (name can vary but a number suffix will be present and files will be ordered in the array). Order is kept throughout the files.
Understanding the limit parameter
Limit as a number
- If you pass a number directly to the
limit
parameter (e.g.,"limit": 100
), you will receive up to this value of documents in total, sorted by the rules specified in thesort
field (for example, descending order on a specific score).
Limit by object (e.g., entity)
- You can also pass a dictionary with this structure:
"limit": {
"by": "entity",
"value": 3
}
- In this scenario, the system returns the top 3 documents per entity, based on the specified
sort
. - Example: If your analysis contains 10 distinct entities, and you request 3 documents per entity, you could receive up to 30 documents in total. We refer to this final total (30 in this example) as the computed limit.
Access to id, title, sentences, url and thread fields
- These fields are only available if the computed limit is below 2000.
- Example: If you request 3 documents per entity (
"limit": {"by": "entity", "value": 3}
) but have 1000 entities, the computed limit is 3000, which exceeds 2000. Therefore, theid
,text
, andsentences
fields will not be included. - If you need these fields for more than 2000 documents, you can make multiple calls (e.g., split queries, apply filters by date or entity) or adjust your request so that each call stays under the 2000-document threshold.
Example: Downloading the top 10 documents per entity of an analysis
Here assume that you started an analysis with 10 distinct entities using the POST /analyze/dataset or POST /analyze/tql routes.
That instance is now in completed
state and its id is INSTANCE_ID
.
Requirements
In this example we'll use python
with the following libraries:
- pandas to compute and manipulate the downloaded data
- pyarrow to read the parquet files
- requests to send the requests
example.py
import pyarrow.parquet as pq
import pyarrow as pa
import requests
import time
import json
# Functions found in the section "Quick start" under "Getting started"
from connect_v2 import read_config, get_token
config = read_config()
host = config['api']['host']
token = get_token(config)
instance_id = 'INSTANCE_ID' # Replace with your instance id
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {token}'
}
# Start a download of the top 10 documents per entity
payload = json.dumps({
"limit": {
"by": "entity",
"value": 10
},
"sort": {
"field": "document_positive",
"order": "DESC"
},
"fields": ["title", "entities", "mentions"]
})
endpoint = f'{host}/api/2.0/analyze/{instance_id}/download'
response = requests.post(endpoint, headers=headers, data=payload)
download_hash = response.json()['hash']
# Wait for the download to be completed
status_endpoint = f'{host}/api/2.0/analyze/{instance_id}/download/{download_hash}/status'
while True:
status = requests.get(status_endpoint, headers=headers).json()
if status['status'] == 'completed':
break
else:
print(f'Download status: {status["status"]}')
time.sleep(1)
# Download the results
download_endpoint = f'{host}/api/2.0/analyze/{instance_id}/download/{download_hash}'
response = requests.get(download_endpoint, headers=headers)
# Read the parquet files
parquet_files = response.json()
for parquet_file_url in parquet_files:
with requests.get(parquet_file_url) as f:
df = pq.read_table(pa.BufferReader(f.content)).to_pandas(maps_as_pydicts="strict")
# Do something with the data
print(df)
The result will contain up to 100 documents (10 entities * 10 documents per entity
).
And in this example, 3 columns are returned: title
, entities
and mentions
.