Request
Make requests to the API.
request(text=None, output=None, ids=None, text_column=None, id_column=None, files=None, directory=None, file_type='txt', encoding=None, return_text=False, context='written', custom_context=False, api_args=None, frameworks=None, framework_prefix=None, bundle_size=1000, bundle_byte_limit=7500000.0, collapse_lines=False, retry_limit=50, clear_cache=False, request_cache=True, cores=1, collect_results=True, in_memory=None, verbose=False, progress_bar=os.getenv('RECEPTIVITI_PB', 'True'), overwrite=False, make_request=True, text_as_paths=False, dotenv=True, cache=os.getenv('RECEPTIVITI_CACHE', ''), cache_degragment=True, cache_overwrite=False, cache_format=os.getenv('RECEPTIVITI_CACHE_FORMAT', ''), key=os.getenv('RECEPTIVITI_KEY', ''), secret=os.getenv('RECEPTIVITI_SECRET', ''), url=os.getenv('RECEPTIVITI_URL', ''), version=os.getenv('RECEPTIVITI_VERSION', ''), endpoint=os.getenv('RECEPTIVITI_ENDPOINT', ''))
Send texts to be scored by the API.
PARAMETER | DESCRIPTION |
---|---|
text |
Text to be processed, as a string or vector of
strings containing the text itself, or the path to a file from which to read in text.
If a DataFrame,
TYPE:
|
output |
Path to a file to write results to.
TYPE:
|
ids |
Vector of IDs for each
TYPE:
|
text_column |
Column name in
TYPE:
|
id_column |
Column name in
TYPE:
|
files |
Vector of file paths, as alternate entry to
TYPE:
|
directory |
A directory path to search for files in, as alternate entry to
TYPE:
|
file_type |
Extension of the file(s) to be read in from a directory (
TYPE:
|
encoding |
Encoding of file(s) to be read in; one of the
standard encodings.
If this is
TYPE:
|
return_text |
If
TYPE:
|
context |
Name of the analysis context.
TYPE:
|
custom_context |
Name of a custom context (as listed by
TYPE:
|
api_args |
Additional arguments to include in the request.
TYPE:
|
frameworks |
One or more names of frameworks to request. Note that this changes the results from the API, so it will invalidate any cached results without the same set of frameworks.
TYPE:
|
framework_prefix |
If
TYPE:
|
bundle_size |
Maximum number of texts per bundle.
TYPE:
|
bundle_byte_limit |
Maximum byte size of each bundle.
TYPE:
|
collapse_lines |
If
TYPE:
|
retry_limit |
Number of times to retry a failed request.
TYPE:
|
clear_cache |
If
TYPE:
|
request_cache |
If
TYPE:
|
cores |
Number of CPU cores to use when processing multiple bundles.
TYPE:
|
collect_results |
If
TYPE:
|
in_memory |
If
TYPE:
|
verbose |
If
TYPE:
|
progress_bar |
If
TYPE:
|
overwrite |
If
TYPE:
|
text_as_paths |
If
TYPE:
|
dotenv |
Path to a .env file to read environment variables from. By default,
will for a file in the current directory or
TYPE:
|
cache |
Path to a cache directory, or
TYPE:
|
cache_degragment |
If
TYPE:
|
cache_overwrite |
If
TYPE:
|
cache_format |
File format of the cache, of available Arrow formats.
TYPE:
|
key |
Your API key.
TYPE:
|
secret |
Your API secret.
TYPE:
|
url |
The URL of the API; defaults to
TYPE:
|
version |
Version of the API; defaults to
TYPE:
|
endpoint |
Endpoint of the API; defaults to
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame | None
|
Scores associated with each input text. |
Examples:
# score a single text
single = receptiviti.request("a text to score")
# score multiple texts, and write results to a file
multi = receptiviti.request(["first text to score", "second text"], "filename.csv")
# score texts in separate files
## defaults to look for .txt files
file_results = receptiviti.request(directory = "./path/to/txt_folder")
## could be .csv
file_results = receptiviti.request(
directory = "./path/to/csv_folder",
text_column = "text", file_type = "csv"
)
# score texts in a single file
results = receptiviti.request("./path/to/file.csv", text_column = "text")
Request Process
This function (along with the internal _manage_request
function) handles texts and results in several steps:
- Prepare bundles (split
text
into <=bundle_size
and <=bundle_byte_limit
bundles).- If
text
points to a directory or list of files, these will be read in later. - If
in_memory
isFalse
, bundles are written to a temporary location, and read back in when the request is made.
- If
- Get scores for texts within each bundle.
- If texts are paths, or
in_memory
isFalse
, will load texts. - If
cache
is set, will skip any texts with cached scores. - If
request_cache
isTrue
, will check for a cached request. - If any texts need scoring and
make_request
isTrue
, will send unscored texts to the API.
- If texts are paths, or
- If a request was made and
request_cache
is set, will cache the response. - If
cache
is set, will write bundle scores to the cache. - After requests are made, if
cache
is set, will defragment the cache (combine bundle results within partitions). - If
collect_results
isTrue
, will prepare results:- Will realign results with
text
(andids
if provided). - If
output
is specified, will write realigned results to it. - Will drop additional columns (such as
custom
andid
if not provided). - If
framework
is specified, will use it to select columns of the results. - Returns results.
- Will realign results with
Cache
If cache
is specified, results for unique texts are saved in an Arrow database
in the cache location (os.getenv("RECEPTIVITI_CACHE")
), and are retrieved with
subsequent requests. This ensures that the exact same texts are not re-sent to the API.
This does, however, add some processing time and disc space usage.
If cache
if True
, a default directory (receptiviti_cache
) will be
looked for in the system's temporary directory (tempfile.gettempdir()
).
The primary cache is checked when each bundle is processed, and existing results are loaded at that time. When processing many bundles in parallel, and many results have been cached, this can cause the system to freeze and potentially crash. To avoid this, limit the number of cores, or disable parallel processing.
The cache_format
arguments (or the RECEPTIVITI_CACHE_FORMAT
environment variable) can be
used to adjust the format of the cache.
You can use the cache independently with
pyarrow.dataset.dataset(os.getenv("RECEPTIVITI_CACHE"))
.
You can also set the clear_cache
argument to True
to clear the cache before it is used
again, which may be useful if the cache has gotten big, or you know new results will be
returned.
Even if a cached result exists, it will be reprocessed if it does not have all of the variables of new results, but this depends on there being at least 1 uncached result. If, for instance, you add a framework to your account and want to reprocess a previously processed set of texts, you would need to first clear the cache.
Either way, duplicated texts within the same call will only be sent once.
The request_cache
argument controls a more temporary cache of each bundle request. This
is cleared after a day. You might want to set this to False
if a new framework becomes
available on your account and you want to process a set of text you re-processed recently.
Another temporary cache is made when in_memory
is False
, which is the default when
processing in parallel (when there is more than 1 bundle and cores
is over 1). This is a
temporary directory that contains a file for each unique bundle, which is read in as needed
by the parallel workers.
Parallelization
text
s are split into bundles based on the bundle_size
argument. Each bundle represents
a single request to the API, which is why they are limited to 1000 texts and a total size
of 10 MB. When there is more than one bundle and cores
is greater than 1, bundles are
processed by multiple cores.
If you have texts spread across multiple files, they can be most efficiently processed in
parallel if each file contains a single text (potentially collapsed from multiple lines).
If files contain multiple texts (i.e., collapse_lines=False
), then texts need to be
read in before bundling in order to ensure bundles are under the length limit.
If you are calling this function from a script, parallelization will involve rerunning
that script in each process, so anything you don't want rerun should be protected by
a check that __name__
equals "__main__"
(placed within an if __name__ == "__main__":
clause).
Source code in src\receptiviti\request.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 |
|