High Volume¶
The Receptiviti API has limits
on bundle requests, so the receptiviti.request()
function splits texts into acceptable bundles, to be spread across multiple requests.
This means the only remaining limitation on the number of texts that can be processed comes from the memory of the system sending requests.
The basic way to work around this limitation is to fully process smaller chunks of text.
There are a few ways to avoid loading all texts and results.
Cache as Output¶
Setting the collect_results
argument to False
avoids retaining all batch results in memory as they are receive, but means
results are not returned, so the they have to be collected in the cache.
If texts are also too big to load into memory, they can be loaded from files at request time.
By default, when multiple files pointed to as text
, the actual texts are only loaded when they are being sent for scoring,
which means only bundle_size
* cores
texts are loaded at a time.
We can start by writing some small text examples to files:
from os import makedirs
base_dir = "../../../"
text_dir = base_dir + "test_texts"
makedirs(text_dir, exist_ok=True)
for i in range(10):
with open(f"{text_dir}/example_{i}.txt", "w", encoding="utf-8") as file:
file.write(f"An example text {i}.")
And then minimally load these and their results by saving results to a Parquet dataset.
Disabling the request_cache
will also avoid storing a copy of raw results.
import receptiviti
db_dir = base_dir + "test_results"
makedirs(db_dir, exist_ok=True)
receptiviti.request(
directory=text_dir, collect_results=False, cache=db_dir, request_cache=False
)
Results are now available in the cache directory, which you can load in using the request function again:
# adding make_request=False just ensures requests are not made if not found
results = receptiviti.request(directory=text_dir, cache=db_dir, make_request=False)
results.iloc[:, 0:3]
id | text_hash | summary.word_count | |
---|---|---|---|
0 | ../../../test_texts\example_0.txt | b88ba15b5436224a66b02f13b11b13db | 4 |
1 | ../../../test_texts\example_1.txt | 4ab51432a48a54d4fd780d226d0c3f1c | 4 |
2 | ../../../test_texts\example_2.txt | 786744d45ec2d0e302394dc2ececa004 | 4 |
3 | ../../../test_texts\example_3.txt | bb2ced7fa2cf67ec2aa974bac3cfb609 | 4 |
4 | ../../../test_texts\example_4.txt | 507616854eb89e3d12e49001fe4333c8 | 4 |
5 | ../../../test_texts\example_5.txt | b80b47a8f2def76f4c256108053dda50 | 4 |
6 | ../../../test_texts\example_6.txt | 3c3d708014a7c25ee468afbd32843dab | 4 |
7 | ../../../test_texts\example_7.txt | fbc965c3d7b8cb1f487789bb9a313efa | 4 |
8 | ../../../test_texts\example_8.txt | 11fa50aa3392a971cb0dc098d8efc5a4 | 4 |
9 | ../../../test_texts\example_9.txt | 72c5453f604e9493e3d7619542272068 | 4 |
Manual Chunking¶
A more flexible approach would be to process smaller chunks of text normally, and handle loading and storing results yourself.
In this case, it may be best to disable parallelization, and explicitly disable the primary cache (in case it's specified in an environment variable).
res_dir = base_dir + "text_results_manual"
makedirs(res_dir, exist_ok=True)
# using the same files as before
files = [f"{text_dir}/{file}" for file in os.listdir(text_dir)]
# process 5 files at a time
for i in range(0, len(files), 5):
file_subset = files[i : i + 5]
results = receptiviti.request(
files=file_subset, ids=file_subset, cores=1, cache=False, request_cache=False
)
results.to_csv(f"{res_dir}/files_{i}-{i + 5}.csv.xz", index=False)
Now results will be stored in smaller files:
from pandas import read_csv
read_csv(f"{res_dir}/files_0-5.csv.xz").iloc[:, 0:3]
id | text_hash | summary.word_count | |
---|---|---|---|
0 | ../../../test_texts/example_0.txt | b88ba15b5436224a66b02f13b11b13db | 4 |
1 | ../../../test_texts/example_1.txt | 4ab51432a48a54d4fd780d226d0c3f1c | 4 |
2 | ../../../test_texts/example_2.txt | 786744d45ec2d0e302394dc2ececa004 | 4 |
3 | ../../../test_texts/example_3.txt | bb2ced7fa2cf67ec2aa974bac3cfb609 | 4 |
4 | ../../../test_texts/example_4.txt | 507616854eb89e3d12e49001fe4333c8 | 4 |