Request
Make requests to the API.
request(text=None, output=None, ids=None, text_column=None, id_column=None, files=None, directory=None, file_type='txt', encoding=None, return_text=False, api_args=None, frameworks=None, framework_prefix=None, bundle_size=1000, bundle_byte_limit=7500000.0, collapse_lines=False, retry_limit=50, clear_cache=False, request_cache=True, cores=1, in_memory=None, verbose=False, progress_bar=os.getenv('RECEPTIVITI_PB', 'True'), overwrite=False, make_request=True, text_as_paths=False, dotenv=True, cache=os.getenv('RECEPTIVITI_CACHE', ''), cache_overwrite=False, cache_format=os.getenv('RECEPTIVITI_CACHE_FORMAT', ''), key=os.getenv('RECEPTIVITI_KEY', ''), secret=os.getenv('RECEPTIVITI_SECRET', ''), url=os.getenv('RECEPTIVITI_URL', ''), version=os.getenv('RECEPTIVITI_VERSION', ''), endpoint=os.getenv('RECEPTIVITI_ENDPOINT', ''))
Send texts to be scored by the API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str | list[str] | DataFrame
|
Text to be processed, as a string or vector of
strings containing the text itself, or the path to a file from which to read in text.
If a DataFrame, |
None
|
output |
str
|
Path to a file to write results to. |
None
|
ids |
str | list[str | int]
|
Vector of IDs for each |
None
|
text_column |
str
|
Column name in |
None
|
id_column |
str
|
Column name in |
None
|
files |
list[str]
|
Vector of file paths, as alternate entry to |
None
|
directory |
str
|
A directory path to search for files in, as alternate entry to |
None
|
file_type |
str
|
Extension of the file(s) to be read in from a directory ( |
'txt'
|
encoding |
str | None
|
Encoding of file(s) to be read in; one of the
standard encodings.
If this is |
None
|
return_text |
bool
|
If |
False
|
api_args |
dict
|
Additional arguments to include in the request. |
None
|
frameworks |
str | list
|
One or more names of frameworks to return. |
None
|
framework_prefix |
bool
|
If |
None
|
bundle_size |
int
|
Maximum number of texts per bundle. |
1000
|
bundle_byte_limit |
float
|
Maximum byte size of each bundle. |
7500000.0
|
collapse_lines |
bool
|
If |
False
|
retry_limit |
int
|
Number of times to retry a failed request. |
50
|
clear_cache |
bool
|
If |
False
|
request_cache |
bool
|
If |
True
|
cores |
int
|
Number of CPU cores to use when processing multiple bundles. |
1
|
in_memory |
bool | None
|
If |
None
|
verbose |
bool
|
If |
False
|
progress_bar |
str | bool
|
If |
getenv('RECEPTIVITI_PB', 'True')
|
overwrite |
bool
|
If |
False
|
text_as_paths |
bool
|
If |
False
|
dotenv |
bool | str
|
Path to a .env file to read environment variables from. By default,
will for a file in the current directory or |
True
|
cache |
bool | str
|
Path to a cache directory, or |
getenv('RECEPTIVITI_CACHE', '')
|
cache_overwrite |
bool
|
If |
False
|
cache_format |
str
|
File format of the cache, of available Arrow formats. |
getenv('RECEPTIVITI_CACHE_FORMAT', '')
|
key |
str
|
Your API key. |
getenv('RECEPTIVITI_KEY', '')
|
secret |
str
|
Your API secret. |
getenv('RECEPTIVITI_SECRET', '')
|
url |
str
|
The URL of the API; defaults to |
getenv('RECEPTIVITI_URL', '')
|
version |
str
|
Version of the API; defaults to |
getenv('RECEPTIVITI_VERSION', '')
|
endpoint |
str
|
Endpoint of the API; defaults to |
getenv('RECEPTIVITI_ENDPOINT', '')
|
Returns:
Type | Description |
---|---|
DataFrame
|
Scores associated with each input text. |
Cache
If cache
is specified, results for unique texts are saved in an Arrow database
in the cache location (os.getenv("RECEPTIVITI_CACHE")
), and are retrieved with
subsequent requests. This ensures that the exact same texts are not re-sent to the API.
This does, however, add some processing time and disc space usage.
If cache
if True
, a default directory (receptiviti_cache
) will be
looked for in the system's temporary directory (tempfile.gettempdir()
).
The primary cache is checked when each bundle is processed, and existing results are loaded at that time. When processing many bundles in parallel, and many results have been cached, this can cause the system to freeze and potentially crash. To avoid this, limit the number of cores, or disable parallel processing.
The cache_format
arguments (or the RECEPTIVITI_CACHE_FORMAT
environment variable) can be
used to adjust the format of the cache.
You can use the cache independently with
pyarrow.dataset.dataset(os.getenv("RECEPTIVITI_CACHE"))
.
You can also set the clear_cache
argument to True
to clear the cache before it is used
again, which may be useful if the cache has gotten big, or you know new results will be
returned.
Even if a cached result exists, it will be reprocessed if it does not have all of the variables of new results, but this depends on there being at least 1 uncached result. If, for instance, you add a framework to your account and want to reprocess a previously processed set of texts, you would need to first clear the cache.
Either way, duplicated texts within the same call will only be sent once.
The request_cache
argument controls a more temporary cache of each bundle request. This
is cleared after a day. You might want to set this to False
if a new framework becomes
available on your account and you want to process a set of text you re-processed recently.
Another temporary cache is made when in_memory
is False
, which is the default when
processing in parallel (when there is more than 1 bundle and cores
is over 1). This is a
temporary directory that contains a file for each unique bundle, which is read in as needed
by the parallel workers.
Parallelization
text
s are split into bundles based on the bundle_size
argument. Each bundle represents
a single request to the API, which is why they are limited to 1000 texts and a total size
of 10 MB. When there is more than one bundle and cores
is greater than 1, bundles are
processed by multiple cores.
If you have texts spread across multiple files, they can be most efficiently processed in
parallel if each file contains a single text (potentially collapsed from multiple lines).
If files contain multiple texts (i.e., collapse_lines=False
), then texts need to be
read in before bundling in order to ensure bundles are under the length limit.
If you are calling this function from a script, parallelization will involve rerunning
that script in each process, so anything you don't want rerun should be protected by
a check that __name__
equals "__main__"
(placed within an if __name__ == "__main__":
clause).
Source code in src\receptiviti\request.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 |
|