You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To get access to this model, send an email to adamcoddml@gmail.com and provide a brief description of your project or application. Requests without this information will not be considered, and access will not be granted under any circumstances.

DistilRoBERTa-nsfw-prompt-stable-diffusion

=== V3 ===

This model has been retrained on the AdamCodd/Civitai-15M-prompts dataset, on ~10 million positive prompts, evenly split between SFW and NSFW categories (4,858,894 samples of each, ensuring a balanced dataset). The data split for this version is 90% for training and 10% for validation, compared to the previous 80/20 split, to maximize the amount of data used for training.

As we are hitting diminishing returns due to architectural limitations, this will be the final version. However, there are still modest improvements, which justify the V3 release. It achieves the following results on the evaluation set:

Loss: 0.19507 (↓ 8.80% from V2)
Accuracy: 0.91981 (↑ 0.92% from V2)
F1: 0.92005 (↑ 1.18% from V2)
AUC: 0.97641 (↑ 0.50% from V2)
Precision: 0.91794 (↓ 1.49% from V2)
Recall: 0.92217 (↑ 3.86% from V2)

Confusion matrix:

[[445597 40072]

[ 37835 448275]]

There is a slight trade-off in the latest results: the number of false negatives has been significantly reduced (higher recall), while there is a slight increase in false positives (lower precision). This means the model is now more proactive in flagging NSFW content, identifying more true violations but also generating a few more false alarms. I find this trade-off acceptable, as missing problematic content (false negatives) is generally more concerning than occasional false positives.

The V3 model is under the same license as the V2 model (cc-by-nc-4.0). For commercial use rights, please contact me (adamcoddml@gmail.com).

=== V2 ===

This model has been retrained on the improved AdamCodd/Civitai-8m-prompts dataset, on ~5 million positive prompts, evenly split between SFW and NSFW categories (2,820,319 samples of each, ensuring a balanced dataset).

It's a massive improvement over the V1 model. It achieves the following results on the evaluation set:

Loss: 0.2139 (↓ 31.07% over V1)
Accuracy: 0.9114 (↑ 5.46% over V1)
F1: 0.9093 (↑ 5.58% over V1)
AUC: 0.9716 (↑ 3.27% over V1)
Precision: 0.9318 (↑ 5.81% over V1)
Recall: 0.8879 (↑ 5.36% over V1)

Confusion matrix:

[[658795 45843]

[ 79066 626456]]

The V2 model is less prone to false positives compared to V1, which avoid classifying as NSFW description of body parts under clothes (the cutoff for the NSFW classification is nsfwLevel == 2 on the dataset).

NB: The new license for the V2 model is cc-by-nc-4.0. For commercial use rights, please contact me (adamcoddml@gmail.com). Meanwhile, the V1 model remains available under the MIT license (under v1 branch).

The V1 and V2 models are both compatible with Transformers.js.

=== V1 ===

This model utilizes the Distilroberta base architecture, which has been fine-tuned for a classification task on AdamCodd/Civitai-2m-prompts dataset, on the positive prompts.

It achieves the following results on the evaluation set:

Loss: 0.3103
Accuracy: 0.8642
F1: 0.8612
AUC: 0.9408
Precision: 0.8805
Recall: 0.8427

Model description

This model is designed to identify NSFW prompts in Stable-diffusion, trained on a dataset comprising of ~2 million prompts, evenly split between SFW and NSFW categories (1,043,475 samples of each, ensuring a balanced dataset). Single-word prompts have been excluded to enhance the accuracy and relevance of the predictions.

Additionally, it is important to note that the model assesses the likelihood of a prompt being NSFW based on statistical occurrences, rather than evaluating the specific words. This approach allows for the identification of NSFW content in prompts that may appear SFW. The accuracy of the model tends to increase with the length of the prompt. Therefore, prompts that are extremely brief, such as those comprising only two or three words, might be subject to less accurate evaluations.

Although this model demonstrates satisfactory accuracy, it is recommended to use with this image NSFW detector to improve overall detection capabilities and minimize the occurrence of false positives.

Usage

from transformers import pipeline

prompt_detector = pipeline("text-classification", model="AdamCodd/distilroberta-nsfw-prompt-stable-diffusion")

predicted_class = prompt_detector("masterpiece, 1girl, looking at viewer, sitting, tea, table, garden")
print(predicted_class)
#[{'label': 'SFW', 'score': 0.868}]

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 150
Mixed precision
num_epochs: 1
weight_decay: 0.01

Training results

Metrics: Accuracy, F1, Precision, Recall, AUC

'eval_loss': 0.3103,
'eval_accuracy': 0.8642,
'eval_f1': 0.8612,
'eval_precision': 0.8805,
'eval_recall': 0.8427,
'eval_roc_auc': 0.9408,

Confusion matrix:

[[184931 23859]

[32820 175780]]

Framework versions

Transformers 4.36.2
Datasets 2.16.1
Tokenizers 0.15.0
Evaluate 0.4.1

If you want to support me, you can here.

Citation and Acknowledgments

The V2 model was utilized in the following arXiv paper:

@misc{li2024art,
      title={ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users}, 
      author={Guanlin Li and Kangjie Chen and Shudong Zhang and Jie Zhang and Tianwei Zhang},
      year={2024},
      eprint={2405.19360},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}