📖 Redact API documentation

TextualNer class

class tonic_textual.redact_api.TextualNer( base_url: str = 'https://textual.tonic.ai', api_key: str | None = None, verify: bool = True, )

Wrapper class to invoke the Tonic Textual API

Parameters:

base_url (str) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (str) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TONIC_TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification. By default, this is enabled.

Examples

>>> from tonic_textual.redact_api import TextualNer
>>> textual = TonicTextual("https://textual.tonic.ai")

create_dataset( dataset_name: str, )

Creates a dataset. A dataset is a collection of 1 or more files for Tonic Textual to scan and redact.

Parameters:: dataset_name (str) – The name of the dataset. Dataset names must be unique.
Returns:: The newly created dataset.
Return type:: Dataset
Raises:: DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.

delete_dataset( dataset_name: str, )

Deletes dataset by name.

Parameters:: dataset_name (str) – The name of the dataset to delete.

download_redacted_file( job_id: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, num_retries: int = 6, wait_between_retries: int = 10, ) → bytes

Download a redacted file

Parameters:

job_id (str) – The identifier of the redaction job.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types that are not specified in generator_config.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
num_retries (int = 6) – An optional value to specify the number of times to attempt to download the file. If a file is not yet ready for download, Textual pauses 10-second pause before retrying. (The default value is 6)
wait_between_retries (int = 10) – The number of seconds to wait between retry attempts. (The default value is 10)

Returns:

The redacted file as a byte array.

Return type:

bytes

get_dataset( dataset_name: str, ) → Dataset

Gets the dataset for the specified dataset name.

Parameters:: dataset_name (str) – The name of the dataset.
Return type:: Dataset

Examples

>>> dataset = tonic.get_dataset("llama_2_chatbot_finetune_v5")

get_files( dataset_id: str, ) → List[DatasetFile]

Gets all of the files in the dataset.

Returns:: A list of all of the files in the dataset.
Return type:: List[DatasetFile]

llm_synthesis( string: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, ) → RedactionResponse

Deidentifies a string. Redacting sensitive data and replaces those values with values generated by an LLM.

Parameters:

string (str) – The string to redact.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types that are not specified in generator_config.

Returns:

The redacted string, along with ancillary information about the detected entities.

Return type:

RedactionResponse

redact( string: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, ) → RedactionResponse

Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

string (str) – The string to redact.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact(
>>>     "John Smith is a person",
>>>     # only redacts NAME_GIVEN
>>>     generator_config={"NAME_GIVEN": "Redaction"},
>>>     generator_default="Off",
>>>     # Occurrences of "There" are treated as NAME_GIVEN entities
>>>     label_allow_lists={"NAME_GIVEN": ["There"]},
>>>     # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY
>>>     label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]},
>>> )

redact_bulk( strings: List[str], generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, ) → BulkRedactionResponse

Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

strings (List[str]) – The array of strings to redact.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact_bulk(
>>>     ["John Smith is a person", "I live in Atlanta"],
>>>     # only redacts NAME_GIVEN
>>>     generator_config={"NAME_GIVEN": "Redaction"},
>>>     generator_default="Off",
>>>     # Occurrences of "There" are treated as NAME_GIVEN entities
>>>     label_allow_lists={"NAME_GIVEN": ["There"]},
>>>     # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY
>>>     label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]},
>>> )

redact_html( html_data: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, ) → RedactionResponse

Redacts the values in an HTML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:

html_data (str) – The HTML for which to redact values.
generator_config (Dict[str, PiiState]) – A dictionary of entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected values.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for any entity type that is not included in generator_config.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). The ignored values are regular expressions. When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). The additional values are regular expressions. When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

redact_json( json_data: str | dict, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, jsonpath_allow_lists: Dict[str, List[str]] | None = None, ) → RedactionResponse

Redacts the values in a JSON blob. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

json_string (Union[str, dict]) – The JSON for which to redact values. This can be either a JSON string or a Python dictionary.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all types that are not specified in generator_config.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When an value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
jsonpath_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, path expression). When an element in the JSON document matches the JSON path expression, the entire text value is treated as the specified entity type. Only supported for path expressions that point to JSON primitive values. This setting overrides any results found by the NER model or in label allow and block lists. If multiple path expressions point to the same JSON node, but specify different entity types, then the value is redacted as one of those types. However, the chosen type is selected at random - it could use any of the types.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

redact_xml( xml_data: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, ) → RedactionResponse

Redacts the values in an XML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:

xml_data (str) – The XML for which to redact values.
generator_config (Dict[str, PiiState]) – A dictionary of entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected values.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for any entity type that is not included in generator_config.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When an value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

send_redact_bulk_request( endpoint: str, payload: Dict, random_seed: int | None = None, ) → BulkRedactionResponse: Helper function to send redact requests, handle responses, and catch errors.

send_redact_request( endpoint: str, payload: Dict, random_seed: int | None = None, ) → RedactionResponse: Helper function to send redact requests, handle responses, and catch errors.

start_file_redaction( file: IOBase, file_name: str, ) → str

Redact a provided file

Parameters:

file (io.IOBase) – The opened file, available for reading, to upload and redact.
file_name (str) – The name of the file.

Returns:

The job identifier, which can be used to download the redacted file when it is ready.

Return type:

str

unredact( redacted_string: str, random_seed: int | None = None, ) → str

Removes the redaction from a provided string. Returns the string with the original values.

Parameters:

redacted_string (str) – The redacted string from which to remove the redaction.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The string with the redaction removed.

Return type:

str

unredact_bulk( redacted_strings: List[str], random_seed: int | None = None, ) → List[str]

Removes redaction from a list of strings. Returns the strings with the original values.

Parameters:

redacted_strings (List[str]) – The list of redacted strings from which to remove the redaction.
random_seed (Optional[int] = None) – Ann optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The list of strings with the redaction removed.

Return type:

List[str]

Dataset class

class tonic_textual.classes.dataset.Dataset( client: HttpClient, id: str, name: str, files: List[Dict[str, Any]], generator_config: Dict[str, PiiState] | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, )

Class to represent and provide access to a Tonic Textual dataset.

Parameters:

id (str) – Dataset identifier.
name (str) – Dataset name.
files (Dict) – Serialized DatasetFile objects that represent the files in a dataset.
client (HttpClient) – The HTTP client to use.

add_file( file_path: str | None = None, file_name: str | None = None, file: IOBase | None = None, ) → DatasetFile | None

Uploads a file to the dataset.

Parameters:

file_path (Optional[str]) – The absolute path of the file to upload. If specified, you cannot also provide the ‘file’ argument.
file_name (Optional[str]) – The name of the file to save to Tonic Textual. Optional if you use file_path to upload the file. Required if you use the ‘file’ argument.
file (Optional[io.IOBase]) – The bytes of a file to upload. If specified, you must also provide the ‘file_name’ argument. You cannnot use the ‘file_path’ argument in the same call.

Raises:

DatasetFileMatchesExistingFile – Returned if the file content matches an existing file.

delete_file( file_id: str, )

Deletes the given file from the dataset

Parameters:: file_id (str) – The identifier of the dataset file to delete.

describe() → str

Returns a string of the dataset name, identifier, and the list of files.

Examples

>>> workspace.describe()
Dataset: your_dataset_name [dataset_id]
Number of Files: 2
Number of Rows: 1000

edit( name: str | None = None, generator_config: Dict[str, PiiState] | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, should_rescan=True, )

Edit dataset. Only edits fields that are provided as function arguments. Currently, you can edit the name of the dataset and the generator setup, which indicate how to handle each entity.

Parameters:

name (Optional[str]) – The new name of the dataset. Returns an error if the new name conflicts with an existing dataset name.
generator_config (Optional[Dict[str, PiiState]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored entities). When an entity of the specified type matches a regular expression in the list, the value is ignored and not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, included entities). When a piece of text matches a regular expression in the list, the text is marked as the entity type and is included in the redaction or synthesis.

Raises:

DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.

fetch_all_df()

Fetches all of the data in the dataset as a pandas dataframe.

Returns:: Dataset data in a pandas dataframe.
Return type:: pd.DataFrame

fetch_all_json() → str

Fetches all of the data in the dataset as JSON.

Returns:: Dataset data in JSON format.
Return type:: str

get_failed_files() → List[DatasetFile]

Gets all of the dataset files that encountered an error when they were processed. These files are effectively ignored.

Returns:: The list of files that had processing errors.
Return type:: List[DatasetFile]

get_processed_files() → List[DatasetFile]

Gets all of the dataset files for which processing is complete. The data in these files is returned when data is requested.

Returns:: The list of processed dataset files.
Return type:: List[DatasetFile]

get_queued_files() → List[DatasetFile]

Gets all of the dataset files that are waiting to be processed.

Returns:: The list of dataset files that await processing.
Return type:: List[DatasetFile]

get_running_files() → List[DatasetFile]

Gets all of the dataset files that are currently being processed.

Returns:: The list of files that are being processed.
Return type:: List[DatasetFile]

DatasetFile class

class tonic_textual.classes.datasetfile.DatasetFile( client: HttpClient, id: str, dataset_id: str, name: str, num_rows: int | None, num_columns: int, processing_status: str, processing_error: str | None, label_allow_lists: Dict[str, LabelCustomList] | None = None, )

Class to store the metadata for a dataset file.

Parameters:

id (str) – The identifier of the dataset file.
name (str) – The file name of the dataset file.
num_rows (long) – The number of rows in the dataset file.
num_columns (int) – The number of columns in the dataset file.
processing_status (string) – The status of the dataset file in the processing pipeline. Possible values are ‘Completed’, ‘Failed’, ‘Cancelled’, ‘Running’, and ‘Queued’.
processing_error (string) – If the dataset file processing failed, a description of the issue that caused the failure.
label_allow_lists (Dict[str, LabelCustomList]) – A dictionary of custom entity detection regular expressions for the dataset file. Each key is an entity type to detect, and each values is a LabelCustomList object, whose regular expressions should be recognized as the specified entity type.

describe() → str: Returns the dataset file metadata as string. Includes the identifier, file name, number of rows, and number of columns.

download( random_seed: int | None = None, num_retries: int = 6, wait_between_retries: int = 10, ) → bytes

Download a redacted file

Parameters:

random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
num_retries (int = 6) – An optional value to specify the number of times to attempt to download the file. If a file is not yet ready for download, there is a 10-second pause before retrying. (The default value is 6)
wait_between_retries (int = 10) – The number of seconds to wait between retry attempts.

Returns:

The redacted file as a byte array.

Return type:

bytes

Redaction response

class tonic_textual.classes.redact_api_responses.redaction_response.RedactionResponse( original_text: str, redacted_text: str, usage: int, de_identify_results: List[Replacement], )

Redaction response object

Variables:

original_text (str) – The original text.
redacted_text (str) – The redacted and synthesized text.
usage (int) – The number of words used
de_identify_results (List[Replacement]) – The list of named entities that were found in original_text.

class tonic_textual.classes.common_api_responses.replacement.Replacement( start: int, end: int, new_start: int, new_end: int, label: str, text: str, score: float, language: str, new_text: str | None = None, example_redaction: str | None = None, json_path: str | None = None, xml_path: str | None = None, )

A span of text that was detected as a named entity.

Variables:

start (int) – The start index of the entity in the original text.
end (int) – The end index of the entity in the original text. The end index is exclusive.
new_start (int) – The start index of the entity in the redacted/synthesized text.
new_end (int) – The end index of the entity in the redacted/synthesized text. The end index is exclusive.
python_start (Optional[int]) – The start index in Python (if different from start).
python_end (Optional[int]) – The end index in Python (if different from end).
label (str) – The label of the entity.
text (str) – The substring of the original text that was detected as an entity.
new_text (Optional[str]) – The new text to replace the original entity.
score (float) – The confidence score of the detection.
language (str) – The language of the entity.
example_redaction (Optional[str]) – An example redaction for the entity.
json_path (Optional[str]) – The JSON path of the entity in the original JSON document. This is only present if the input text was a JSON document.
xml_path (Optional[str]) – The xpath of the entity in the original XML document. This is only present if the input text was an XML document. NOTE: Arrays in xpath are 1-based.