tanaos-text-anonymizer-v1: A small but performant Text Anonymization model
This model was created by Tanaos with the Artifex Python library.
This is a Named Entity Recognition model based on tanaos/tanaos-NER-v1 and fine-tuned on a synthetic dataset to recognize Personal Identifiable Information (PII) entities in text. Once identified, the entities are redacted to ensure privacy and confidentiality, before sharing or processing text data.
While the base NER model was trained to recognize 14 named entity categories, this Text Anonymization was fine-tuned specifically to focus on the following 5 key PII entity categories that are commonly found in text data and are critical for anonymization:
| Entity | Description |
|---|---|
PERSON |
Individual people, fictional characters |
LOCATION |
Geographical areas |
DATE |
Absolute or relative dates, including years, months and/or days |
ADDRESS |
Full addresses |
PHONE_NUMBER |
Telephone numbers |
How to Use
Use this model for free via the Tanaos API in 3 simple steps:
- Sign up for a free account at https://platform.tanaos.com/
- Create a free API Key from the API Keys section
- Replace
<YOUR_API_KEY>in the code below with your API Key and use this snippet:
import requests
session = requests.Session()
ta_out = session.post(
"https://slm.tanaos.com/models/text-anonymization",
headers={
"X-API-Key": "<YOUR_API_KEY>",
},
json={
"text": "John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567.",
"include_mask_type": True,
"include_mask_counter": True
}
)
print(ta_out.json()["data"])
# >>> ['[MASKED_PERSON_3] lives at [MASKED_ADDRESS_2] [MASKED_LOCATION_1] His phone number is [MASKED_PHONE_NUMBER_0]']
Model Description
- Base model:
FacebookAI/roberta-base - Task: Token classification (Named Entity Recognition for Text Anonymization)
- Languages: English
- Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 5 Personal Identifiable Information categories.
Training Details
This model was trained using the Artifex Python library
pip install artifex
by providing the following instructions and generating 10,000 synthetic training samples:
from artifex import Artifex
ta = Artifex().text_anonymization
ta.train(
domain="general",
num_samples=10000
)
Intended Uses
This model is intended to:
- Anonymize text data by redacting personal identifiable information (PII) such as names, addresses, phone numbers, dates, and locations.
- Ensure privacy and confidentiality in text data for compliance with data protection regulations.
- Be used before sharing or processing text data to protect sensitive information.
- Stay GDPR compliant when handling personal data.
Not intended for:
- Scenarios involving highly specialized or domain-specific text without further fine-tuning.
- Downloads last month
- 316