PII NER Azerbaijani v2
PII NER Azerbaijani is a second version of fine-tuned Named Entity Recognition (NER) model (First version: PII NER Azerbaijani) based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
Model Details
Base Model: XLM-RoBERTa
Training Metrics:
Epoch Training Loss Validation Loss Precision Recall F1 1 0.029100 0.025319 0.963367 0.962449 0.962907 2 0.019900 0.023291 0.964567 0.968474 0.966517 3 0.015400 0.018993 0.969536 0.967555 0.968544 4 0.012700 0.017730 0.971919 0.969768 0.970842 5 0.011100 0.018095 0.973056 0.970075 0.971563 Test Metrics:
Precision: 0.9760
Recall: 0.9732
F1 Score: 0.9746
Detailed Test Classification Report
Entity | Precision | Recall | F1-score | Support |
---|---|---|---|---|
AGE | 0.98 | 0.98 | 0.98 | 509 |
BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 |
CITY | 1.00 | 1.00 | 1.00 | 2100 |
CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 |
DATE | 0.85 | 0.92 | 0.88 | 1576 |
DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 |
0.98 | 1.00 | 0.99 | 1485 | |
GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 |
IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 |
PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 |
STREET | 0.94 | 0.98 | 0.96 | 1480 |
SURNAME | 1.00 | 1.00 | 1.00 | 3357 |
TAXNUM | 0.99 | 1.00 | 0.99 | 240 |
TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 |
TIME | 0.96 | 0.96 | 0.96 | 2216 |
ZIPCODE | 0.97 | 0.97 | 0.97 | 520 |
Averages
Metric | Precision | Recall | F1-score | Support |
---|---|---|---|---|
Micro avg | 0.98 | 0.97 | 0.97 | 28976 |
Macro avg | 0.97 | 0.96 | 0.97 | 28976 |
Weighted avg | 0.98 | 0.97 | 0.97 | 28976 |
A list of entities that the model is able to recognize.
[
"AGE",
"BUILDINGNUM",
"CITY",
"CREDITCARDNUMBER",
"DATE",
"DRIVERLICENSENUM",
"EMAIL",
"GIVENNAME",
"IDCARDNUM",
"PASSPORTNUM",
"STREET",
"SURNAME",
"TAXNUM",
"TELEPHONENUM",
"TIME",
"ZIPCODE"
]
Usage
To use the model for spell correction:
The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind.
import torch
from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast
import numpy as np
from typing import List, Dict, Tuple
class AzerbaijaniNER:
def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"):
self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)
self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
self.id_to_label = {
0: "O",
1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER",
5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME",
9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME",
13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE",
17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER",
21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME",
25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME",
29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE"
}
self.entity_types = {
"AGE": "Age",
"BUILDINGNUM": "Building Number",
"CITY": "City",
"CREDITCARDNUMBER": "Credit Card Number",
"DATE": "Date",
"DRIVERLICENSENUM": "Driver License Number",
"EMAIL": "Email",
"GIVENNAME": "Given Name",
"IDCARDNUM": "ID Card Number",
"PASSPORTNUM": "Passport Number",
"STREET": "Street",
"SURNAME": "Surname",
"TAXNUM": "Tax ID Number",
"TELEPHONENUM": "Phone Number",
"TIME": "Time",
"ZIPCODE": "Zip Code"
}
def predict(self, text: str, max_length: int = 512) -> List[Dict]:
text = text.lower()
inputs = self.tokenizer(
text,
return_tensors="pt",
max_length=max_length,
padding="max_length",
truncation=True,
return_offsets_mapping=True
)
offset_mapping = inputs.pop("offset_mapping").numpy()[0]
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs)
predictions = outputs.logits.argmax(dim=2)
predictions = predictions[0].cpu().numpy()
entities = []
current_entity = None
for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)):
if offset[0] == 0 and offset[1] == 0:
continue
pred_label = self.id_to_label[pred_id]
if pred_label.startswith("B-"):
if current_entity:
entities.append(current_entity)
entity_type = pred_label[2:]
current_entity = {
"label": entity_type,
"name": self.entity_types.get(entity_type, entity_type),
"start": int(offset[0]),
"end": int(offset[1]),
"value": text[offset[0]:offset[1]]
}
elif pred_label.startswith("I-") and current_entity is not None:
entity_type = pred_label[2:]
if entity_type == current_entity["label"]:
current_entity["end"] = int(offset[1])
current_entity["value"] = text[current_entity["start"]:current_entity["end"]]
else:
entities.append(current_entity)
current_entity = None
elif pred_label == "O" and current_entity is not None:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
return entities
def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]:
entities = self.predict(text)
if not entities:
return text, []
entities.sort(key=lambda x: x["start"], reverse=True)
anonymized_text = text
for entity in entities:
start = entity["start"]
end = entity["end"]
length = end - start
anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:]
entities.sort(key=lambda x: x["start"])
return anonymized_text, entities
def highlight_entities(self, text: str) -> str:
entities = self.predict(text)
if not entities:
return text
entities.sort(key=lambda x: x["start"], reverse=True)
highlighted_text = text
for entity in entities:
start = entity["start"]
end = entity["end"]
entity_value = entity["value"]
entity_type = entity["name"]
highlighted_text = (
highlighted_text[:start] +
f"[{entity_type}: {entity_value}]" +
highlighted_text[end:]
)
return highlighted_text
if __name__ == "__main__":
ner = AzerbaijaniNER()
test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?"""
print("=== Original Text ===")
print(test_text)
print("\n=== Found Entities ===")
entities = ner.predict(test_text)
for entity in entities:
print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})")
print("\n=== Text with Highlighted Entities ===")
highlighted_text = ner.highlight_entities(test_text)
print(highlighted_text)
print("\n=== Anonymized Text ===")
anonymized_text, _ = ner.anonymize_text(test_text)
print(anonymized_text)
=== Original Text ===
Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
=== Found Entities ===
Given Name: əli (positions 18-21)
Surname: hüseynov (positions 22-30)
Date: 15.05.1990 (positions 48-58)
City: bakı (positions 64-68)
Street: 28 may küçəsi (positions 80-93)
Building Number: 4 (positions 94-95)
Phone Number: +994552345678 (positions 132-145)
Credit Card Number: 4169741358254152 (positions 155-171)
=== Text with Highlighted Entities ===
Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
=== Anonymized Text ===
Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
CC BY 4.0 License — What It Allows
The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:
✅ You Can:
- Use the model for any purpose, including commercial use.
- Share it — copy and redistribute in any medium or format.
- Adapt it — remix, transform, and build upon it for any purpose, even commercially.
📝 You Must:
- Give appropriate credit — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
- Not imply endorsement — Do not suggest the original author endorses you or your use.
❌ You Cannot:
- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
Summary:
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
For more information, please refer to the CC BY 4.0 license.
Contact
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
- Downloads last month
- 57
Model tree for LocalDoc/private_ner_azerbaijani_v2
Base model
FacebookAI/xlm-roberta-base