High Volume¶

The Receptiviti API has limits on bundle requests, so the receptiviti.request() function splits texts into acceptable bundles, to be spread across multiple requests.

This means the only remaining limitation on the number of texts that can be processed comes from the memory of the system sending requests.

The basic way to work around this limitation is to fully process smaller chunks of text.

There are a few ways to avoid loading all texts and results.

Cache as Output¶

Setting the collect_results argument to False avoids retaining all batch results in memory as they are receive, but means results are not returned, so the they have to be collected in the cache.

If texts are also too big to load into memory, they can be loaded from files at request time. By default, when multiple files pointed to as text, the actual texts are only loaded when they are being sent for scoring, which means only bundle_size * cores texts are loaded at a time.

We can start by writing some small text examples to files:

In [2]:

Copied!





from os import makedirs

base_dir = "../../../"
text_dir = base_dir + "test_texts"
makedirs(text_dir, exist_ok=True)

for i in range(10):
    with open(f"{text_dir}/example_{i}.txt", "w", encoding="utf-8") as file:
        file.write(f"An example text {i}.")
from os import makedirs

base_dir = "../../../"
text_dir = base_dir + "test_texts"
makedirs(text_dir, exist_ok=True)

for i in range(10):
    with open(f"{text_dir}/example_{i}.txt", "w", encoding="utf-8") as file:
        file.write(f"An example text {i}.")

And then minimally load these and their results by saving results to a Parquet dataset.

Disabling the request_cache will also avoid storing a copy of raw results.

In [3]:

Copied!





import receptiviti

db_dir = base_dir + "test_results"
makedirs(db_dir, exist_ok=True)

receptiviti.request(
  directory=text_dir, collect_results=False, cache=db_dir, request_cache=False
)
import receptiviti

db_dir = base_dir + "test_results"
makedirs(db_dir, exist_ok=True)

receptiviti.request(
  directory=text_dir, collect_results=False, cache=db_dir, request_cache=False
)

Results are now available in the cache directory, which you can load in using the request function again:

In [4]:

Copied!

# adding make_request=False just ensures requests are not made if not found
results = receptiviti.request(directory=text_dir, cache=db_dir, make_request=False)
results.iloc[:, 0:3]
# adding make_request=False just ensures requests are not made if not found
results = receptiviti.request(directory=text_dir, cache=db_dir, make_request=False)
results.iloc[:, 0:3]

Out[4]:

	id	text_hash	summary.word_count
0	../../../test_texts\example_0.txt	b88ba15b5436224a66b02f13b11b13db	4
1	../../../test_texts\example_1.txt	4ab51432a48a54d4fd780d226d0c3f1c	4
2	../../../test_texts\example_2.txt	786744d45ec2d0e302394dc2ececa004	4
3	../../../test_texts\example_3.txt	bb2ced7fa2cf67ec2aa974bac3cfb609	4
4	../../../test_texts\example_4.txt	507616854eb89e3d12e49001fe4333c8	4
5	../../../test_texts\example_5.txt	b80b47a8f2def76f4c256108053dda50	4
6	../../../test_texts\example_6.txt	3c3d708014a7c25ee468afbd32843dab	4
7	../../../test_texts\example_7.txt	fbc965c3d7b8cb1f487789bb9a313efa	4
8	../../../test_texts\example_8.txt	11fa50aa3392a971cb0dc098d8efc5a4	4
9	../../../test_texts\example_9.txt	72c5453f604e9493e3d7619542272068	4

Manual Chunking¶

A more flexible approach would be to process smaller chunks of text normally, and handle loading and storing results yourself.

In this case, it may be best to disable parallelization, and explicitly disable the primary cache (in case it's specified in an environment variable).

In [5]:

Copied!





res_dir = base_dir + "text_results_manual"
makedirs(res_dir, exist_ok=True)

# using the same files as before
files = [f"{text_dir}/{file}" for file in os.listdir(text_dir)]

# process 5 files at a time
for i in range(0, len(files), 5):
  file_subset = files[i : i + 5]
  results = receptiviti.request(
    files=file_subset, ids=file_subset, cores=1, cache=False, request_cache=False
  )
  results.to_csv(f"{res_dir}/files_{i}-{i + 5}.csv.xz", index=False)
res_dir = base_dir + "text_results_manual"
makedirs(res_dir, exist_ok=True)

# using the same files as before
files = [f"{text_dir}/{file}" for file in os.listdir(text_dir)]

# process 5 files at a time
for i in range(0, len(files), 5):
  file_subset = files[i : i + 5]
  results = receptiviti.request(
    files=file_subset, ids=file_subset, cores=1, cache=False, request_cache=False
  )
  results.to_csv(f"{res_dir}/files_{i}-{i + 5}.csv.xz", index=False)

Now results will be stored in smaller files:

In [6]:

Copied!

from pandas import read_csv

read_csv(f"{res_dir}/files_0-5.csv.xz").iloc[:, 0:3]
from pandas import read_csv

read_csv(f"{res_dir}/files_0-5.csv.xz").iloc[:, 0:3]

Out[6]:

	id	text_hash	summary.word_count
0	../../../test_texts/example_0.txt	b88ba15b5436224a66b02f13b11b13db	4
1	../../../test_texts/example_1.txt	4ab51432a48a54d4fd780d226d0c3f1c	4
2	../../../test_texts/example_2.txt	786744d45ec2d0e302394dc2ececa004	4
3	../../../test_texts/example_3.txt	bb2ced7fa2cf67ec2aa974bac3cfb609	4
4	../../../test_texts/example_4.txt	507616854eb89e3d12e49001fe4333c8	4