Get documents

The TextReveal® HTTP API exposes an endpoint named GET /documents that can retrieve multiple document's text, title and/or a text summary.

Usage with the download route

This route was initialy designed to be chained with the download route. One possible use case is to use the fields parameters to fetch the documents ids and extract_date, and then pass them to this route to retrieve their content.

It is still possible to view the text directly from the sentences text field. However the documents route preserves the initial layout of the document, making it more readable.

This route allows users to retrieve the content of all documents except premium news. For premium news, users can still access all information except the text.

You can find an example of how to chain the two routes in the examples section.

Usage

BODY

{
  "documents": [
    {
      "extracted": "2022-12-30T22:59:57.502Z",
      "id": "c34ac671a1b0b80078f9acd7e80217e28e8c554e14e1de707fb4370e52299add"
    },
    {
      "id": "1454343030016",
      "extracted": "2022-12-31T23:59:47.000Z"
    }
  ],
  "fields": [
    "title",
    "text"
  ]
}

name	type	description
`documents`	List[Document]	List of documents to translate Required Maximum `500` documents
`documents` > `id`	string	The document id, it can be found in TextReveal® Dashboards or in a download result. Required Max length `64`
`documents` > `extracted`	date	Extraction date of the document. Using ISO 8601 format. If absent the most recent document will be used.
`fields`	List[string]	Fields to retrieve. Accepted values: `title`, `text`, `summary` Default: `["title", "text"]`

Response

RESPONSE

[
  {
    "id": "c34ac671a1b0b80078f9acd7e80217e28e8c554e14e1de707fb4370e52299add",
    "extracted": "2022-12-30T22:59:57.502Z",
    "title": "title",
    "text": "..."
  },
  {
    "error": {
      "statusCode": 403,
      "message": "Forbidden – Access to the data is denied",
      "field": {
        "text": {
          "statusCode": 403,
          "message": "The download of licensed text is not allowed."
        }
      }
    },
    "id": "1454343030016",
    "extracted": "2022-12-31T23:59:47.000Z",
    "title": "title"
  }
]

The response is not guaranteed to be in the same order as the input.

Examples

With the POST /analyze/download endpoint

In this example, we'll use the POST /analyze/download endpoint to retrieve the top 500 documents of our analyze. And then we'll use the GET /documents endpoint to retrieve a summary for each of them.

example.py

import json
import requests
# Functions found in the section "Quick start" under "Getting started"
from connect_v2 import read_config, get_token
 
config = read_config()
host = config['api']['host']
token = get_token(config)
 
instance_id = 'INSTANCE_ID' # Replace with your instance id
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {token}'
}
 
# Download the top 500 documents
payload = json.dumps({
    'instance': instance_id,
    'limit': 500, # The documents route is limited to 500 documents
    'sort': {
        'field': 'document_positive',
        'order': 'DESC'
    },
    'fields': ['title', 'extract_date'] # id is automatically added
})
 
endpoint = f'{host}/api/2.0/analyze/download'
response = requests.post(endpoint, headers=headers, data=payload)
lines = [json.loads(line) for line in response.text.splitlines()]
 
# Now we can use the id and extract_date to generate a summary
payload = json.dumps(
    {
        "fields": ["summary"],
        "documents": list(
            map(lambda x: {"id": x["id"], "extracted": x["extract_date"]}, lines)
        ),
    }
)
 
endpoint = f"{host}/api/2.0/documents"
response = requests.post(endpoint, headers=headers, data=payload)
documents_with_summaries = response.json()
 
# Now join the documents with their summaries
documents = {}
for line in lines:
    documents[line["id"]] = line
for document in documents_with_summaries:
    documents[document["id"]]["summary"] = document.get("summary")
 
# We now have a dictionary with the documents and their summaries
print(json.dumps(documents, indent=2))