๐Ÿ“– Redact API documentation๏ƒ

TextualNer class๏ƒ

class tonic_textual.redact_api.TextualNer(
base_url: str,
api_key: str | None = None,
verify: bool = True,
)๏ƒ

Wrapper class for invoking Tonic Textual API

Parameters:
  • base_url (str) โ€“ The URL to your Tonic Textual instance. Do not include trailing backslashes.

  • api_key (str) โ€“ Your API token. This argument is optional. Instead of providing the API token here, it is recommended that you set the API key in your environment as the value of TONIC_TEXTUAL_API_KEY.

  • verify (bool) โ€“ Whether SSL Certification verification is performed. This is enabled by default.

Examples

>>> from tonic_textual.redact_api import TextualNer
>>> textual = TonicTextual("https://textual.tonic.ai")
create_dataset(
dataset_name: str,
)๏ƒ

Creates a dataset. A dataset is a collection of 1 or more files for Tonic Textual to scan and redact.

Parameters:

dataset_name (str) โ€“ The name of the dataset. Dataset names must be unique.

Returns:

The newly created dataset.

Return type:

Dataset

Raises:

DatasetNameAlreadyExists โ€“ Raised if a dataset with the same name already exists.

delete_dataset(
dataset_name: str,
)๏ƒ

Deletes dataset by name.

Parameters:

dataset_name (str) โ€“ The name of the dataset to delete.

download_redacted_file(
job_id: str,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
random_seed: int | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
num_retries: int = 6,
wait_between_retries: int = 10,
) bytes๏ƒ

Download a redacted file

Parameters:
  • job_id (str) โ€“ The ID of the redaction job

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for all types not specified in generator_config.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.

  • num_retries (int = 6) โ€“ An optional value to specify how many times to attempt to download the file. If a file is not yet ready for download, there will be a 10 second pause before retrying. (The default value is 6)

  • wait_between_retries (int = 10) โ€“ The number of seconds to wait between retry attempts. (The default value is 6)

Returns:

The redacted file as byte array

Return type:

bytes

get_dataset(
dataset_name: str,
) Dataset๏ƒ

Gets the dataset for the specified dataset name.

Parameters:

dataset_name (str) โ€“ The name of the dataset.

Return type:

Dataset

Examples

>>> dataset = tonic.get_dataset("llama_2_chatbot_finetune_v5")
get_files(
dataset_id: str,
) List[DatasetFile]๏ƒ

Gets all of the files in the dataset.

Returns:

A list of all of the files in the dataset.

Return type:

List[DatasetFile]

llm_synthesis(
string: str,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
) RedactionResponse๏ƒ

Deidentifies a string by redacting sensitive data and replacing these values with values generated by an LLM.

Parameters:
  • string (str) โ€“ The string to redact.

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for all types not specified in generator_config.

Returns:

The redacted string, along with ancillary information about the detected entities.

Return type:

RedactionResponse

redact(
string: str,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
random_seed: int | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
) RedactionResponse๏ƒ

Redacts a string. Depending on the configured handling for each sensitive data type, values can be either redacted, synthesized, or ignored.

Parameters:
  • string (str) โ€“ The string to redact.

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of โ€œRedactionโ€, โ€œSynthesisโ€, or โ€œOffโ€.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for all types not specified in generator_config. Values must be one of โ€œRedactionโ€, โ€œSynthesisโ€, or โ€œOffโ€.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact(
>>>     "John Smith is a person",
>>>     # only redacts NAME_GIVEN
>>>     generator_config={"NAME_GIVEN": "Redaction"},
>>>     generator_default="Off",
>>>     # Occurrences of "There" are treated as NAME_GIVEN entities
>>>     label_allow_lists={"NAME_GIVEN": ["There"]},
>>>     # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY
>>>     label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]},
>>> )
redact_html(
html_data: str,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
random_seed: int | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
) RedactionResponse๏ƒ

Redacts the values in an HTML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:
  • html_data (str) โ€“ The HTML for which to redact values.

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected values.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for any entity type that is not included in generator_config.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, ignored values). The ignored values are regular expressions. When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, additional values). The additional values are regular expressions. When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

redact_json(
json_data: str | dict,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
random_seed: int | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
jsonpath_allow_lists: Dict[str, List[str]] | None = None,
) RedactionResponse๏ƒ

Redacts the values in a JSON blob. Depending on the configured handling for each sensitive data type, values can be either redacted, synthesized, or ignored.

Parameters:
  • json_string (Union[str, dict]) โ€“ The JSON whose values will be redacted. This can be either a JSON string or a Python dictionary

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for all types not specified in generator_config.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, ignored values). When an value for the entity type, matches a listed regular expression, the value is ignored and is not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

  • jsonpath_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, path expression). When an element in the JSON document matches the JSON path expression, the entire text value is treated as the specified entity type. Only supported for path expressions that point to JSON primitive values. This setting overrides any results found by the NER model or in label allow and block lists. If multiple path expressions point to the same JSON node, but specify different entity types, then the value is redacted as one of those types. However, the chosen type is selected at random - it could use any of the types.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

redact_xml(
xml_data: str,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
random_seed: int | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
) RedactionResponse๏ƒ

Redacts the values in an XML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:
  • xml_data (str) โ€“ The XML for which to redact values.

  • generator_config (Dict[str, PiiState]) โ€“ A dictionary of entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected values.

  • generator_default (PiiState = PiiState.Redaction) โ€“ The default redaction used for any entity type that is not included in generator_config.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, ignored values). When an value for the entity type, matches a listed regular expression, the value is ignored and is not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

send_redact_request(
endpoint: str,
payload: Dict,
random_seed: int | None = None,
) RedactionResponse๏ƒ

Helper function to send redact requests, handle responses, and catch errors.

start_file_redaction(
file: IOBase,
file_name: str,
) str๏ƒ

Redact a provided file

Parameters:
  • file (io.IOBase) โ€“ The opened file, available for reading, which will be uploaded and redacted

  • file_name (str) โ€“ The name of the file

Returns:

The job id which can be used to download the redacted file once it is ready

Return type:

str

unredact(
redacted_string: str,
random_seed: int | None = None,
) str๏ƒ

Removes the redaction from a provided string. Returns the string with the original values.

Parameters:
  • redacted_string (str) โ€“ The redacted string from which to remove the redaction.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The string with the redaction removed.

Return type:

str

unredact_bulk(
redacted_strings: List[str],
random_seed: int | None = None,
) List[str]๏ƒ

Removes redaction from a list of strings. Returns the strings with the original values.

Parameters:
  • redacted_strings (List[str]) โ€“ The list of redacted strings from which to remove the redaction.

  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The list of strings with the redaction removed.

Return type:

List[str]

Dataset class๏ƒ

class tonic_textual.classes.dataset.Dataset(
client: HttpClient,
id: str,
name: str,
files: List[Dict[str, Any]],
generator_config: Dict[str, PiiState] | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
)๏ƒ

Class to represent and provide access to a Tonic Textual dataset.

Parameters:
  • id (str) โ€“ Dataset identifier.

  • name (str) โ€“ Dataset name.

  • files (Dict) โ€“ Serialized DatasetFile objects representing the files in a dataset.

  • client (HttpClient) โ€“ The HTTP client to use.

add_file(
file_path: str | None = None,
file_name: str | None = None,
file: IOBase | None = None,
) DatasetFile | None๏ƒ

Uploads a file to the dataset.

Parameters:
  • file_path (Optional[str]) โ€“ The absolute path of the file to upload. If specified you cannot also provide the โ€˜fileโ€™ argument.

  • file_name (Optional[str]) โ€“ The name of the file to save to Tonic Textual. This is optional if uploading a file via file_path but required if using the โ€˜fileโ€™ argument

  • file (Optional[io.IOBase]) โ€“ The bytes of a file to be uploaded. If specified you must also provide the โ€˜file_nameโ€™ argument. The โ€˜file_pathโ€™ argument cannot be used in the same call.

Raises:

DatasetFileMatchesExistingFile โ€“ Returned if the file content matches an existing file.

delete_file(
file_id: str,
)๏ƒ

Deletes the given file from the dataset

Parameters:

file_id (str) โ€“ The ID of the file in the dataset to delete

describe() str๏ƒ

Returns a string of the dataset name, identifier, and the list of files.

Examples

>>> workspace.describe()
Dataset: your_dataset_name [dataset_id]
Number of Files: 2
Number of Rows: 1000
edit(
name: str | None = None,
generator_config: Dict[str, PiiState] | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
should_rescan=True,
)๏ƒ

Edit dataset. Only fields provided as function arguments will be edited. Currently, supports editing the name of the dataset and the generator setup (how each entity is handled during redaction/synthesis)

Parameters:
  • name (Optional[str]) โ€“ The new name of the dataset. Will return an error if the new name conflicts with an existing dataset name

  • generator_config (Optional[Dict[str, PiiState]]) โ€“ A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • label_block_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (pii type, ignored entities). When an entity of pii type, matching a regex in the list, is found, the value will be ignored and not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) โ€“ A dictionary of (pii type, included entities). When a piece of text matches a regex in the list, said text will be marked as the pii type and be included in redaction or synthesis.

Raises:

DatasetNameAlreadyExists โ€“ Raised if a dataset with the same name already exists.

fetch_all_df()๏ƒ

Fetches all of the data in the dataset as a pandas dataframe.

Returns:

Dataset data in a pandas dataframe.

Return type:

pd.DataFrame

fetch_all_json() str๏ƒ

Fetches all of the data in the dataset as JSON.

Returns:

Dataset data in JSON format.

Return type:

str

get_failed_files() List[DatasetFile]๏ƒ

Gets all of the files in dataset that encountered an error when they were processed. These files are effectively ignored.

Returns:

The list of files that had processing errors.

Return type:

List[DatasetFile]

get_processed_files() List[DatasetFile]๏ƒ

Gets all of the files in the dataset for which processing is complete. The data in these files is returned when data is requested.

Returns:

The list of processed dataset files.

Return type:

List[DatasetFile]

get_queued_files() List[DatasetFile]๏ƒ

Gets all of the files in the dataset that are waiting to be processed.

Returns:

The list of dataset files that await processing.

Return type:

List[DatasetFile]

get_running_files() List[DatasetFile]๏ƒ

Gets all of the files in the dataset that are currently being processed.

Returns:

The list of files that are being processed.

Return type:

List[DatasetFile]

DatasetFile class๏ƒ

class tonic_textual.classes.datasetfile.DatasetFile(
client: HttpClient,
id: str,
dataset_id: str,
name: str,
num_rows: int | None,
num_columns: int,
processing_status: str,
processing_error: str | None,
label_allow_lists: Dict[str, LabelCustomList] | None = None,
)๏ƒ

Class to store the metadata for a dataset file.

Parameters:
  • id (str) โ€“ The identifier of the dataset file.

  • name (str) โ€“ The file name of the dataset file.

  • num_rows (long) โ€“ The number of rows in the dataset file.

  • num_columns (int) โ€“ The number of columns in the dataset file.

  • processing_status (string) โ€“ The status of the dataset file in the processing pipeline. Possible values are โ€˜Completedโ€™, โ€˜Failedโ€™, โ€˜Cancelledโ€™, โ€˜Runningโ€™, and โ€˜Queuedโ€™.

  • processing_error (string) โ€“ If the dataset file processing failed, a description of the issue that caused the failure.

  • label_allow_lists (Dict[str, LabelCustomList]) โ€“ A dictionary of custom entity detection regex for the dataset file. The keys are the pii type to be detected, and the values are LabelCustomList objects, whose regexes should be recognized as said pii type.

describe() str๏ƒ

Returns the dataset file metadata as string. Includes the identifier, file name, number of rows, and number of columns.

download(
random_seed: int | None = None,
num_retries: int = 6,
wait_between_retries: int = 10,
) bytes๏ƒ

Download a redacted file

Parameters:
  • random_seed (Optional[int] = None) โ€“ An optional value to use to override Textualโ€™s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • num_retries (int = 6) โ€“ An optional value to specify how many times to attempt to download the file. If a file is not yet ready for download, there will be a 10 second pause before retrying. (The default value is 6)

  • wait_between_retries (int = 10) โ€“ The number of seconds to wait between retry attempts

Returns:

The redacted file as byte array

Return type:

bytes

Redaction response๏ƒ

class tonic_textual.classes.redact_api_responses.redaction_response.RedactionResponse(
original_text: str,
redacted_text: str,
usage: int,
de_identify_results: List[Replacement],
)๏ƒ

Redaction response object

Variables:
  • original_text (str) โ€“ The original text

  • redacted_text (str) โ€“ The redacted/synthesized text

  • usage (int) โ€“ The number of words used

  • de_identify_results (List[Replacement]) โ€“ The list of named entities found in original_text

class tonic_textual.classes.common_api_responses.replacement.Replacement(
start: int,
end: int,
new_start: int,
new_end: int,
label: str,
text: str,
score: float,
language: str,
new_text: str | None = None,
example_redaction: str | None = None,
json_path: str | None = None,
xml_path: str | None = None,
)๏ƒ

A span of text that has been detected as a named entity.

Variables:
  • start (int) โ€“ The start index of the entity in the original text

  • end (int) โ€“ The end index of the entity in the original text. The end index is exclusive.

  • new_start (int) โ€“ The start index of the entity in the redacted/synthesized text

  • new_end (int) โ€“ The end index of the entity in the redacted/synthesized text. The end index is exclusive.

  • python_start (Optional[int]) โ€“ The start index in Python (if different from start)

  • python_end (Optional[int]) โ€“ The end index in Python (if different from end)

  • label (str) โ€“ The label of the entity

  • text (str) โ€“ The substring of the original text that was detected as an entity

  • new_text (Optional[str]) โ€“ The new text to replace the original entity

  • score (float) โ€“ The confidence score of the detection

  • language (str) โ€“ The language of the entity

  • example_redaction (Optional[str]) โ€“ An example redaction for the entity

  • json_path (Optional[str]) โ€“ The JSON path of the entity in the original JSON document. This is only present if the input text was a JSON document.

  • xml_path (Optional[str]) โ€“ The xpath of the entity in the original XML document. This is only present if the input text was an XML document. NOTE: Arrays in xpath are 1-based