Get documents
Get documents
The TextReveal® HTTP API
exposes an endpoint named GET /documents
that can retrieve multiple document's text, title and/or a text summary.
Usage with the download route
This route was initialy designed to be chained with the download
route.
One possible use case is to use the fields
parameters to fetch the documents ids and extract_date, and then pass them to this route to retrieve their content.
It is still possible to view the text directly from the sentences
text field. However the documents
route preserves the initial layout of the document, making it more readable.
This route allows users to retrieve the content of all documents except premium news. For premium news, users can still access all information except the text
.
You can find an example of how to chain the two routes in the examples section.
Usage
{
"documents": [
{
"extracted": "2022-12-30T22:59:57.502Z",
"id": "c34ac671a1b0b80078f9acd7e80217e28e8c554e14e1de707fb4370e52299add"
},
{
"id": "1454343030016",
"extracted": "2022-12-31T23:59:47.000Z"
}
],
"fields": [
"title",
"text"
]
}
name | type | description |
---|---|---|
documents | List[Document] | List of documents to translate
|
documents > id | string | The document id, it can be found in TextReveal® Dashboards or in a download result.
|
documents > extracted | date | Extraction date of the document. Using ISO 8601 format. If absent the most recent document will be used. |
fields | List[string] | Fields to retrieve.
|
Response
[
{
"id": "c34ac671a1b0b80078f9acd7e80217e28e8c554e14e1de707fb4370e52299add",
"extracted": "2022-12-30T22:59:57.502Z",
"title": "title",
"text": "..."
},
{
"error": {
"statusCode": 403,
"message": "Forbidden – Access to the data is denied",
"field": {
"text": {
"statusCode": 403,
"message": "The download of licensed text is not allowed."
}
}
},
"id": "1454343030016",
"extracted": "2022-12-31T23:59:47.000Z",
"title": "title"
}
]
The response is not guaranteed to be in the same order as the input.
Examples
With the POST /analyze/download
endpoint
In this example, we'll use the POST /analyze/download
endpoint to retrieve the top 500 documents of our analyze.
And then we'll use the GET /documents
endpoint to retrieve a summary for each of them.
import json
import requests
# Functions found in the section "Quick start" under "Getting started"
from connect_v2 import read_config, get_token
config = read_config()
host = config['api']['host']
token = get_token(config)
instance_id = 'INSTANCE_ID' # Replace with your instance id
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {token}'
}
# Download the top 500 documents
payload = json.dumps({
'instance': instance_id,
'limit': 500, # The documents route is limited to 500 documents
'sort': {
'field': 'document_positive',
'order': 'DESC'
},
'fields': ['title', 'extract_date'] # id is automatically added
})
endpoint = f'{host}/api/2.0/analyze/download'
response = requests.post(endpoint, headers=headers, data=payload)
lines = [json.loads(line) for line in response.text.splitlines()]
# Now we can use the id and extract_date to generate a summary
payload = json.dumps(
{
"fields": ["summary"],
"documents": list(
map(lambda x: {"id": x["id"], "extracted": x["extract_date"]}, lines)
),
}
)
endpoint = f"{host}/api/2.0/documents"
response = requests.post(endpoint, headers=headers, data=payload)
documents_with_summaries = response.json()
# Now join the documents with their summaries
documents = {}
for line in lines:
documents[line["id"]] = line
for document in documents_with_summaries:
documents[document["id"]]["summary"] = document.get("summary")
# We now have a dictionary with the documents and their summaries
print(json.dumps(documents, indent=2))