PII NER Azerbaijani v2

PII NER Azerbaijani is a second version of fine-tuned Named Entity Recognition (NER) model (First version: PII NER Azerbaijani) based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.

Model Details

Base Model: XLM-RoBERTa
Training Metrics:

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.029100	0.025319	0.963367	0.962449	0.962907
2	0.019900	0.023291	0.964567	0.968474	0.966517
3	0.015400	0.018993	0.969536	0.967555	0.968544
4	0.012700	0.017730	0.971919	0.969768	0.970842
5	0.011100	0.018095	0.973056	0.970075	0.971563

Test Metrics:
Precision: 0.9760
Recall: 0.9732
F1 Score: 0.9746

Detailed Test Classification Report

Entity	Precision	Recall	F1-score	Support
AGE	0.98	0.98	0.98	509
BUILDINGNUM	0.97	0.75	0.85	1285
CITY	1.00	1.00	1.00	2100
CREDITCARDNUMBER	0.99	0.98	0.99	249
DATE	0.85	0.92	0.88	1576
DRIVERLICENSENUM	0.98	0.98	0.98	258
EMAIL	0.98	1.00	0.99	1485
GIVENNAME	0.99	1.00	0.99	9926
IDCARDNUM	0.99	0.99	0.99	1174
PASSPORTNUM	0.99	0.99	0.99	426
STREET	0.94	0.98	0.96	1480
SURNAME	1.00	1.00	1.00	3357
TAXNUM	0.99	1.00	0.99	240
TELEPHONENUM	0.97	0.95	0.96	2175
TIME	0.96	0.96	0.96	2216
ZIPCODE	0.97	0.97	0.97	520

Averages

Metric	Precision	Recall	F1-score	Support
Micro avg	0.98	0.97	0.97	28976
Macro avg	0.97	0.96	0.97	28976
Weighted avg	0.98	0.97	0.97	28976

A list of entities that the model is able to recognize.

[
    "AGE",
    "BUILDINGNUM",
    "CITY",
    "CREDITCARDNUMBER",
    "DATE",
    "DRIVERLICENSENUM",
    "EMAIL",
    "GIVENNAME",
    "IDCARDNUM",
    "PASSPORTNUM",
    "STREET",
    "SURNAME",
    "TAXNUM",
    "TELEPHONENUM",
    "TIME",
    "ZIPCODE"
]

Usage

To use the model for spell correction:

The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind.

import torch
from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast
import numpy as np
from typing import List, Dict, Tuple

class AzerbaijaniNER:
    def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"):
        self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)
        self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
        
        self.model.eval()
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        
        self.id_to_label = {
            0: "O",
            1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER",
            5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME",
            9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME",
            13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE",
            17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER",
            21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME", 
            25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME",
            29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE"
        }
        
        self.entity_types = {
            "AGE": "Age",
            "BUILDINGNUM": "Building Number",
            "CITY": "City",
            "CREDITCARDNUMBER": "Credit Card Number",
            "DATE": "Date",
            "DRIVERLICENSENUM": "Driver License Number",
            "EMAIL": "Email",
            "GIVENNAME": "Given Name",
            "IDCARDNUM": "ID Card Number",
            "PASSPORTNUM": "Passport Number",
            "STREET": "Street",
            "SURNAME": "Surname",
            "TAXNUM": "Tax ID Number",
            "TELEPHONENUM": "Phone Number",
            "TIME": "Time",
            "ZIPCODE": "Zip Code"
        }
    
    def predict(self, text: str, max_length: int = 512) -> List[Dict]:
        text = text.lower()
        
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=max_length,
            padding="max_length",
            truncation=True,
            return_offsets_mapping=True
        )
        
        offset_mapping = inputs.pop("offset_mapping").numpy()[0]
        
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = outputs.logits.argmax(dim=2)
        
        predictions = predictions[0].cpu().numpy()
        
        entities = []
        current_entity = None
        
        for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)):
            if offset[0] == 0 and offset[1] == 0:
                continue
                
            pred_label = self.id_to_label[pred_id]
            
            if pred_label.startswith("B-"):
                if current_entity:
                    entities.append(current_entity)
                
                entity_type = pred_label[2:]
                current_entity = {
                    "label": entity_type,
                    "name": self.entity_types.get(entity_type, entity_type),
                    "start": int(offset[0]),
                    "end": int(offset[1]),
                    "value": text[offset[0]:offset[1]]
                }
            
            elif pred_label.startswith("I-") and current_entity is not None:
                entity_type = pred_label[2:]
                
                if entity_type == current_entity["label"]:
                    current_entity["end"] = int(offset[1])
                    current_entity["value"] = text[current_entity["start"]:current_entity["end"]]
                else:
                    entities.append(current_entity)
                    current_entity = None
            
            elif pred_label == "O" and current_entity is not None:
                entities.append(current_entity)
                current_entity = None
        
        if current_entity:
            entities.append(current_entity)
        
        return entities
    
    def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]:
        entities = self.predict(text)
        
        if not entities:
            return text, []
        
        entities.sort(key=lambda x: x["start"], reverse=True)
        
        anonymized_text = text
        for entity in entities:
            start = entity["start"]
            end = entity["end"]
            length = end - start
            anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:]
        
        entities.sort(key=lambda x: x["start"])
        
        return anonymized_text, entities

    def highlight_entities(self, text: str) -> str:
        entities = self.predict(text)
        
        if not entities:
            return text
        
        entities.sort(key=lambda x: x["start"], reverse=True)
        
        highlighted_text = text
        for entity in entities:
            start = entity["start"]
            end = entity["end"]
            entity_value = entity["value"]
            entity_type = entity["name"]
            
            highlighted_text = (
                highlighted_text[:start] + 
                f"[{entity_type}: {entity_value}]" + 
                highlighted_text[end:]
            )
        
        return highlighted_text

if __name__ == "__main__":
    ner = AzerbaijaniNER()
    
    test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?"""
    
    print("=== Original Text ===")
    print(test_text)
    print("\n=== Found Entities ===")
    
    entities = ner.predict(test_text)
    for entity in entities:
        print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})")
    
    print("\n=== Text with Highlighted Entities ===")
    highlighted_text = ner.highlight_entities(test_text)
    print(highlighted_text)
    
    print("\n=== Anonymized Text ===")
    anonymized_text, _ = ner.anonymize_text(test_text)
    print(anonymized_text)

=== Original Text ===
Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

=== Found Entities ===
Given Name: əli (positions 18-21)
Surname: hüseynov (positions 22-30)
Date: 15.05.1990 (positions 48-58)
City: bakı (positions 64-68)
Street: 28 may küçəsi (positions 80-93)
Building Number: 4 (positions 94-95)
Phone Number: +994552345678 (positions 132-145)
Credit Card Number: 4169741358254152 (positions 155-171)

=== Text with Highlighted Entities ===
Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

=== Anonymized Text ===
Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

CC BY 4.0 License — What It Allows

The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:

✅ You Can:

Use the model for any purpose, including commercial use.
Share it — copy and redistribute in any medium or format.
Adapt it — remix, transform, and build upon it for any purpose, even commercially.

📝 You Must:

Give appropriate credit — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
Not imply endorsement — Do not suggest the original author endorses you or your use.

❌ You Cannot:

Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).

Summary:

You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.

For more information, please refer to the CC BY 4.0 license.

Contact

For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].

LocalDoc
/

private_ner_azerbaijani_v2