Guides

Summarize and translate documents

How to retrieve, summarize and translate documents using TextReveal API.

Using our API, you'll often get a document id, but maybe what you want is a summary of a document and it's title in your language.
This guide will show you how to achieve this.

Basics

You'll find documents ids in many routes, for example in Entities SDG Documents the document id is document_id. In the Download routes, the document id is simply id.

To get the summary, title and text of a document we'll use the Get Documents route. And for translation the Translate Batch route.

Code example

Generic method on how to extend documents
from copy import deepcopy
from collections import defaultdict
import requests
import json
 
# Documentation on how to get a token: https://docs.textreveal.com/guide/authentication
token = "..."
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {token}",
}
host = "https://api.textreveal.com"
 
def post(endpoint, data):
    response = requests.post(
        f"{host}{endpoint}", headers=headers, data=json.dumps(data)
    )
    return response.json()
 
 
def get(endpoint):
    response = requests.get(f"{host}{endpoint}", headers=headers)
    return response.json()
 
 
def extend_documents(
    *,
    documents: list[dict],
    fields: list[str],
    language: str,
    document_id_field: str = "id",
) -> list[dict]:
    """
    Fetch documents title, summary and/or text using TextRevealAPI.
    Note: Does not mutate the documents list.
 
    Args:
        documents (list[dict]): List of documents to extend.
        fields (list[str]): List of fields to extend. Valid fields are title, summary, text, translated_title, translated_text.
        language (str): Language to use for translation. [https://docs.textreveal.com/guide/languages#translation](https://docs.textreveal.com/guide/languages#translation)
 
    Returns:
        list[dict]: List of documents with extended fields.
    """
    documents_endpoint = "/api/2.0/documents"
    translation_endpoint = "/api/2.0/documents/translate/batch"
 
    assert all(
        x in ["title", "summary", "text", "translated_title", "translated_text"]
        for x in fields
    ), "Invalid field. Valid fields are title, summary, text, translated_title, translated_text"
 
    # TODO: split in two loops. One with batch size of 25 and one with 500. It will be a bit faster.
    batch_size = 25  # Translate route can accept up to 25 documents at a time
    if not any(x.startswith("translated_") for x in fields):
        batch_size = 500  # Documents route can accept up to 500 documents at a time
 
    documents_by_id_dict = defaultdict(list)
    for document in documents:
        documents_by_id_dict[document[document_id_field]].append(deepcopy(document))
 
    for i in range(0, len(documents), batch_size):
        batch = documents[i : i + batch_size]
 
        # Add summary/text/title
        payload = {
            "fields": [x for x in fields if x in ["summary", "text", "title"]],
            "documents": list(
                map(
                    lambda x: {
                        "id": x[document_id_field],
                    },
                    batch,
                )
            ),
        }
        if len(payload["fields"]) > 0:
            response = post(documents_endpoint, data=payload)
            for document in response:
                document_id = document["id"]
                for doc in documents_by_id_dict.get(document_id, []):
                    if "summary" in fields:
                        doc["summary"] = document.get("summary")
                    if "text" in fields:
                        doc["text"] = document.get("text")
                    if "title" in fields:
                        doc["title"] = document.get("title")
 
        # Add translated text/title
        payload = {
            "fields": [
                x.replace("translated_", "")
                for x in fields
                if x.startswith("translated_")
            ],
            "language": language,
            "documents": list(
                map(
                    lambda x: {
                        "id": x[document_id_field],
                    },
                    batch,
                )
            ),
        }
        if len(payload["fields"]) > 0:
            response = post(translation_endpoint, data=payload)
            for document in response:
                document_id = document["id"]
                for doc in documents_by_id_dict.get(document_id, []):
                    if "title" in fields:
                        doc["translated_title"] = document.get("title")
                    if "text" in fields:
                        doc["translated_text"] = document.get("text")
 
    extended_documents = [
        document
        for document_list in documents_by_id_dict.values()
        for document in document_list
    ]
    return extended_documents
 

How to use this code

This generic method can be used for any routes, below we'll show an example on how to use it with the Universes ESG Documents route.

The parameters are:

  • documents: List of documents, each document has to have a field containing the document id.
  • document_id_field: The field containing the document id. (example: id or document_id)
  • fields: List of fields to extend. Valid fields are title, summary, text, translated_title, translated_text.
  • language: Language to use for translation. See the full list
Example using ESG documents
# 1. Get 10 documents (you can apply filters, use the pagination to get more document...)
universe_id = "UNIVERSE_ID"
documents = get(f"/v3/universes/{universe_id}/esg/documents").get("data")
 
# 2. Extend the documents with the translated text
language = "french"
fields = [
    "summary",
    "translated_title",
]  # ["title", "summary", "text", "translated_title", "translated_text"]
 
extended_documents = extend_documents(
    documents=documents,
    document_id_field="document_id",
    fields=fields,
    language=language,
)
 
# 3. Print the extended documents
print(json.dumps(extended_documents, indent=2))