Asynchronous Download

Useful Links

Endpoints:
- POST /analyze/{id}/download
- GET /analyze/{id}/download/{hash}/status
- GET /analyze/{id}/download/{hash}
Fields description: Data dictionary

Download data asynchronously

The asynchronous download feature is recommended for larger datasets (beyond 2,000 documents).

First, call the route POST /analyze/{id}/download, which prepares a download for the specified instance and returns a hash that you can use to retrieve the result of the download.

POST /analyze/{id}/download

{
  "concepts": [
    "environment"
  ],
  "date": {
    "end": "2019-02-01",
    "start": "2019-02-01"
  },
  "entities": [
    "apple"
  ],
  "fields": [
    "title"
  ],
  "limit": 500,
  "sort": {
    "field": "document_polarity",
    "order": "ASC"
  }
}

Response

{
  "hash": "98756"
}

You can then check the progress by calling the route GET /analyze/{id}/download/{hash}/status. The status can be starting, running, failed, stopped, or completed.

Response

{
  "status": "completed"
}

When the process completes, retrieve your download via the route GET /analyze/{id}/download/{hash}, which returns an array of URLs, each pointing to a gzipped Parquet file. These files remain accessible for one week.

Response

[
  "https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-0.parquet",
  "https://files.textreveal.com/download/company=e8c8d3ba-4ca0-45d1-b4ba-c1b1f2364a12/instance=fabd78aa-5241-4842-8108-fd52ef805cde/download=03d8c58a31/output-1.parquet"
]

Notes

The instances must be completed, you can check the status using the /analyze/status route.
Some fields (id, title, sentences, url, summary and thread) are only available if the total number of documents (after applying the limit) is below 2000.
When using the sort parameters, the "first" document will be in the output-0.parquet file and the "last" document will be in the output-N.parquet file (name can vary but a number suffix will be present and files will be ordered in the array). Order is kept throughout the files.

Understanding the limit parameter

Limit as a number

If you pass a number directly to the limit parameter (e.g., "limit": 100), you will receive up to this value of documents in total, sorted by the rules specified in the sort field (for example, descending order on a specific score).

Limit by object (e.g., entity)

You can also pass a dictionary with this structure:

"limit": {
  "by": "entity",
  "value": 3
}

In this scenario, the system returns the top 3 documents per entity, based on the specified sort.
Example: If your analysis contains 10 distinct entities, and you request 3 documents per entity, you could receive up to 30 documents in total. We refer to this final total (30 in this example) as the computed limit.

Access to id, title, sentences, url and thread fields

These fields are only available if the computed limit is below 2000.
Example: If you request 3 documents per entity ("limit": {"by": "entity", "value": 3}) but have 1000 entities, the computed limit is 3000, which exceeds 2000. Therefore, the id, text, and sentences fields will not be included.
If you need these fields for more than 2000 documents, you can make multiple calls (e.g., split queries, apply filters by date or entity) or adjust your request so that each call stays under the 2000-document threshold.

Example: Downloading the top 10 documents per entity of an analysis

Here assume that you started an analysis with 10 distinct entities using the POST /analyze/dataset or POST /analyze/tql routes.
That instance is now in completed state and its id is INSTANCE_ID.

Requirements

In this example we'll use python with the following libraries:

pandas to compute and manipulate the downloaded data
pyarrow to read the parquet files
requests to send the requests

example.py

import pyarrow.parquet as pq
import pyarrow as pa
import requests
import time
import json
# Functions found in the section "Quick start" under "Getting started"
from connect_v2 import read_config, get_token
 
config = read_config()
host = config['api']['host']
token = get_token(config)
 
instance_id = 'INSTANCE_ID' # Replace with your instance id
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {token}'
}
 
# Start a download of the top 10 documents per entity
payload = json.dumps({
    "limit": {
        "by": "entity",
        "value": 10
    },
    "sort": {
        "field": "document_positive",
        "order": "DESC"
    },
    "fields": ["title", "entities", "mentions"]
})
 
endpoint = f'{host}/api/2.0/analyze/{instance_id}/download'
response = requests.post(endpoint, headers=headers, data=payload)
download_hash = response.json()['hash']
 
# Wait for the download to be completed
status_endpoint = f'{host}/api/2.0/analyze/{instance_id}/download/{download_hash}/status'
while True:
    status = requests.get(status_endpoint, headers=headers).json()
    if status['status'] == 'completed':
        break
    else:
        print(f'Download status: {status["status"]}')
        time.sleep(1)
 
# Download the results
download_endpoint = f'{host}/api/2.0/analyze/{instance_id}/download/{download_hash}'
response = requests.get(download_endpoint, headers=headers)
 
# Read the parquet files
parquet_files = response.json()
for parquet_file_url in parquet_files:
    with requests.get(parquet_file_url) as f:
        df = pq.read_table(pa.BufferReader(f.content)).to_pandas(maps_as_pydicts="strict")
        # Do something with the data
        print(df)

The result will contain up to 100 documents (10 entities * 10 documents per entity).
And in this example, 3 columns are returned: title, entities and mentions.