📁 Datasets

A dataset is a collection of files that are all redacted and synthesized in the same way. Datasets are a helpful organization tool to ensure that you can easily track a collections of files and how sensitive data is removed from those files.

Typically, you configure datasets from the Textual application, but for ease of use, the SDK supports many dataset operations. However, some operations can only be performed from the Textual application.

Creating a dataset

To create a dataset:

from tonic_textual.redact_api import TextualNer

textual = TonicTextual("https://textual.tonic.ai")

dataset = textual.create_dataset('my_dataset')

Retrieving an existing dataset

To retrieve an existing dataset by the dataset name:

dataset = textual.get_dataset('my_dataset')

Editing a dataset

You can use the SDK to edit a dataset. However, not all dataset properties can be edited from the SDK.

The following snippet renames the dataset and disables modification of entities that are tagged as ORGANIZATION.

dataset.edit(name='new_dataset_name', generator_config={'ORGANIZATION': 'Off'})

Uploading files to a dataset

You can upload files to your dataset from the SDK. Provide the complete path to the file, and the complete name of the file as you want it to appear in Textual.

dataset.add_file('<path to file>','<file name>')

Viewing the list of files in a dataset

To get the list of files in a dataset, view the files property of the dataset.

To filter dataset files based on their processing status, call:

get_failed_files
get_running_files
get_queued_files
get_processed_files

Downloading a redacted dataset file

To download the redacted or synthesized version of the file, get the specific file from the dataset, then call the download function.

For example:

files = dataset.get_processed_files()
for file in files:
    file_bytes = file.download()
    with open('<file name>', 'wb') as f:
        f.write(file_bytes)

To download a specific file in a dataset that you fetch by name:

file = txt_file = list(filter(lambda x: x.name=='<file to download>', dataset.files))[0]
file_bytes = file.download()
with open('<file name>', 'wb') as f:
    f.write(file_bytes)