📖 Parse API documentation

TextualParse class

class tonic_textual.parse_api.TextualParse( base_url: str = 'https://textual.tonic.ai', api_key: str | None = None, verify: bool = True, )

Wrapper class for invoking Tonic Textual API

Parameters:

base_url (Optional[str]) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (Optional[str]) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification verification. By default, this is enabled.

Examples

>>> from tonic_textual.parse_api import TextualParse
>>> textual = TonicTextualParse("https://textual.tonic.ai")

create_azure_pipeline( pipeline_name: str, credentials: PipelineAzureCredential, synthesize_files: bool | None = False, ) → AzurePipeline

Create a new pipeline with files from Azure blob storage.

Parameters:

pipeline_name (str) – The name of the pipeline.
credentials (PipelineAzureCredential) – The credentials to use to connect to Azure.
synthesize_files (Optional[bool]) – Whether to generate a redacted version of the file in addition to the parsed output. Default value is False.

Returns:

The newly created pipeline.

Return type:

AzurePipeline

create_databricks_pipeline( pipeline_name: str, credentials: PipelineDatabricksCredential, synthesize_files: bool | None = False, ) → Pipeline

Create a new pipeline on top of Databricks Unity Catalog.

Parameters:

pipeline_name (str) – The name of the pipeline.
credentials (PipelineDatabricksCredential) – The credentials to use to connect to Databricks
synthesize_files (Optional[bool]) – Whether to generate a redacted version of the file in addition to the parsed output. Default value is False.

Returns:

The newly created pipeline.

Return type:

Pipeline

create_local_pipeline( pipeline_name: str, synthesize_files: bool | None = False, ) → LocalPipeline

Create a new pipeline from files uploaded from a local file system.

Parameters:

pipeline_name (str) – The name of the pipeline.
synthesize_files (Optional[bool]) – Whether to generate a redacted version of the files in addition to the parsed output. Default value is False.

Returns:

The newly created pipeline.

Return type:

LocalPipeline

create_s3_pipeline( pipeline_name: str, credentials: PipelineAwsCredential | None = None, aws_credentials_source: str | None = 'user_provided', synthesize_files: bool | None = False, ) → S3Pipeline

Create a new pipeline with files from Amazon S3.

Parameters:

pipeline_name (str) – The name of the pipeline.
file_source (Optional[str]) – The type of location where the files to process are stored. Possible values are local, aws, azure, and databricks. The local option allows you to upload files from your local machine.
credentials (PipelineAwsCredential) – The credentials to use to connect to AWS. Not required when aws_credentials_source is from_environment.
synthesize_files (Optional[bool]) – Whether to generate a redacted version of the file in addition to the parsed output. Default value is False.
aws_credentials_source (Optional[str]) – For an Amazon S3 pipeline, how to obtain the AWS credentials. Options are user_provided and from_environment. For user_provided, you provide the credentials in the credentials parameter. For from_environment, the credentials are read from your Textual instance.

Returns:

The newly created pipeline.

Return type:

S3Pipeline

delete_pipeline( pipeline_id: str, )

Delete a pipeline.

Parameters:: pipeline_id (str) – The identifier of the pipeline.

get_pipeline_by_id( pipeline_id: str, ) → Pipeline | None

Gets the pipeline based on its identifier.

Parameters:: pipeline_id (str) – The identifier of the pipeline.
Returns:: The pipeline object, or None if no pipeline is found.
Return type:: Union[Pipeline, None]

get_pipelines() → List[Pipeline]

Get the pipelines for the Tonic Textual instance.

Returns:: A list of pipeline objects, ordered by their creation timestamp.
Return type:: List[Pipeline]

Examples

>>> latest_pipeline = textual.get_pipelines()[-1]

parse_file( file: IOBase, file_name: str, timeout: int | None = None, ) → FileParseResult

Parse a given file. To open binary files, use the ‘rb’ option.

Parameters:

file (io.IOBase) – The opened file, available for reading, to parse.
file_name (str) – The name of the file.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for the parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

parse_s3_file( bucket: str, key: str, timeout: int | None = None, ) → FileParseResult

Parse a given file found in Amazon S3. Uses boto3 to fetch files from Amazon S3.

Parameters:

bucket (str) – The bucket that contains the file to parse.
key (str) – The key of the file to parse.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

Pipeline class

class tonic_textual.classes.pipeline.Pipeline( name: str, id: str, client: HttpClient, )

Class to represent and provide access to a Tonic Textual pipeline. This class is abstract. Do not instantiate it directly.

Parameters:

name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.

describe() → str: Returns the name and id of the pipeline.

enumerate_files( lazy_load_content=True, ) → PipelineFileEnumerator

Enumerate the files in the pipeline.

Parameters:: lazy_load_content (bool) – Whether to lazily load the content of the files. Default is True.
Returns:: An enumerator for the files in the pipeline.
Return type:: PipelineFileEnumerator

get_delta( pipeline_run1: PipelineRun, pipeline_run2: PipelineRun, ) → FileParseResultsDiffEnumerator

Enumerates the files in the diff between two pipeline runs.

Parameters:

pipeline_run1 (PipelineRun) – The first pipeline run.
pipeline_run2 (PipelineRun) – The second pipeline run.

Returns:

An enumerator for the files in the diff between the two runs.

Return type:

FileParseResultsDiffEnumerator

get_runs() → List[PipelineRun]

Get the runs for the pipeline.

Returns:: A list of PipelineRun objects.
Return type:: List[PipelineRun]

run() → str

Run the pipeline.

Returns:: The ID of the job.
Return type:: str

set_synthesize_files( synthesize_files: bool, )

upload_file( file: IOBase, file_name: str, csv_config: SolarCsvConfig | None = None, ) → str

Local Pipeline class

class tonic_textual.classes.local_pipeline.LocalPipeline( name: str, id: str, client: HttpClient, )

Class to represent and provide access to a Tonic Textual uploaded local file pipeline.

Parameters:

name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.

add_file( file: IOBase, file_name: str, csv_config: SolarCsvConfig | None = None, ) → str

Uploads a file to the pipeline.

Parameters:

file (io.IOBase) – The file to upload.
file_name (str) – The name of the file.
csv_config (SolarCsvConfig) – The configuration for the CSV file. This is optional.

Returns:

This function does not return any value.

Return type:

None

S3 Pipeline class

class tonic_textual.classes.s3_pipeline.S3Pipeline( name: str, id: str, client: HttpClient, )

Class to represent and provide access to a Tonic Textual Amazon S3 pipeline.

Parameters:

name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.

add_files( bucket: str, file_paths: List[str], ) → str

Add files to your pipeline

Parameters:

bucket (str) – The S3 bucket
prefix (List[str]) – The list of files to include

add_prefixes( bucket: str, prefixes: List[str], )

Adds prefixes to your pipeline. Textual processes all of the files under the prefix that are of supported file types.

Parameters:

bucket (str) – The S3 bucket
prefix (List[str]) – The list of prefixes to include

set_output_location( bucket: str, prefix: str | None = None, )

Sets the location in Amazon S3 where the pipeline stores processed files.

Parameters:

bucket (str) – The S3 bucket
prefix (str) – The optional prefix on the bucket

Azure Pipeline class

class tonic_textual.classes.azure_pipeline.AzurePipeline( name: str, id: str, client: HttpClient, )

Class to represent and provide access to a Tonic Textual Azure blob storage pipeline.

Parameters:

name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.

add_files( container: str, file_paths: List[str], ) → str

Add files to your pipeline

Parameters:

container (str) – The container name.
prefix (List[str]) – The list of files to include.

add_prefixes( container: str, prefixes: List[str], )

Add prefixes to your pipeline. Textual processes all of the files under the prefix that are of supported file types.

Parameters:

container (str) – The container name.
prefix (List[str]) – The list of prefixes to include.

set_output_location( container: str, prefix: str | None = None, )

Sets the location in Azure blob storage where the pipeline stores processed files.

Parameters:

container (str) – The container name.
prefix (str) – The optional prefix on the container.

File enumerators

class tonic_textual.classes.pipeline_file_enumerator.PipelineFileEnumerator( job_id: str, client: HttpClient, lazy_load_content=True, )

Enumerates the files in a pipeline.

Parameters:

job_id (str) – The job identifier.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the files. Default is True.

next() → FileParseResult

class tonic_textual.classes.file_parse_result_diff_enumerator.FileParseResultsDiffEnumerator( job_id1: str, job_id2: str, client: HttpClient, )

Enumerates the files in a diff between two jobs.

Parameters:

job_id1 (str) – The first job identifier.
job_id2 (str) – The second job identifier.
client (HttpClient) – The HTTP client to use.

next() → FileParseResultsDiff

Pipeline file results

class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult( response: Dict, client: HttpClient, lazy_load_content=False, document: Dict = None, )

A class that represents the result of a parsed file.

Parameters:

response (Dict) – The response from the API.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the file. Default is False.

describe() → str: Returns the parsed file path.

download_results() → str

Downloads the results file.

Returns:: The results file.
Return type:: string

get_all_entities() → List[SingleDetectionResult]

Returns a list of all of the detected entities in the file.

Returns:: A list of detected entities in the file.
Return type:: List[SingleDetectionResult]

get_chunks( max_chars=15000, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, metadata_entities: List[str] = [], include_metadata=True, ) → List

Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.

Parameters:

max_chars (int = 15_000) – The maximum number of characters in each chunk.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.

include_metadata: bool = True

If True, the metadata is included in the chunk.

Returns:

List[str]: A list of strings that contain the chunks of text.

get_entities( generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, allow_overlap: bool = False, ) → List[SingleDetectionResult]

Returns a list of entities in the document. The entities are filtered by the generator_default configuration.

Parameters:

generator_default (PiiState) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.

Return type:

List[SingleDetectionResult]

get_json() → Dict

Returns the raw JSON generated by Tonic Textual.

Returns:: The raw JSON that Textal generates when it parses the file, in the form of a dictionary.
Return type:: Dict

get_markdown( generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, random_seed: int | None = None, ) → str

Returns the file in Markdown format. In the file, the entities are redacted or synthesized based on the specified configuration.

Parameters:

generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The file in Markdown format. In the file, the entities are redacted or synthesized based on generator_config and generator_default.

Return type:

str

get_tables() → List[Table]

Returns a list of tables found in the document. Applies to CSV, XLSX, PDF, and image files.

Parameters:

sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.

Returns:

Returns True if the element contains sensitive data. Otherwise returns False.

Return type:

bool

is_sensitive( sensitive_entity_types: List[str], start: int = 0, end: int = -1, ) → bool

Returns True if the element contains sensitive data. Otherwise returns False.

Parameters:

sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.

Returns:

Returns True if the element contains sensitive data. Otherwise returns False.

Return type:

bool

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseDiffAction( value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None, )

Enum that stores possible state of a file parse result diff.

Added = 1: The file was added, so it is new..

Deleted = 2: The file was deleted.

Modified = 3: The file was was modified.

NonModified = 4: The file was not modified.

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseResultsDiff( status: FileParseDiffAction, file: FileParseResult, )

Stores the file parse result and the file parse result action.

Parameters:

status (FileParseDiffAction) – The action of the file parse result.
file (FileParseResult) – The file parse result.

deconstruct() → Tuple[FileParseDiffAction, FileParseResult]: Returns the status and the file path of the diff.

describe() → str: Returns the status and the file path of the diff as a string.