## a. Introduction:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Earlier, in the DATA CLEANING and EDA Notebook, we looked at how we can clean an NLP DATASET. Well in this section, we will be taking a look at how you can model such a dataset . We will be fine tuning three models:

- MiniLM from Microsoft
- TinyBeRT from Huawei
- DistillBeRT from HuggingFace.

Note all these models can be found on HuggingFace.

Also, the dataset we will be using has already be cleaned. Therefore, you might have to check out the cleaning data set on hwo that was done

## b.Dependencies

### i. Installations

> Indented block



In [2]:
%%capture
!pip install transformers
!pip install accelerate -U
!pip install datasets
!pip install huggingface_hub



### Dependencies Importations

In [3]:
##for handling path of my datasets
import os
from google.colab import drive

##for data handling:

import pandas as pd
import numpy as np

##for visuals-- just in case

import seaborn as sns
import matplotlib.pyplot as plt

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax
from transformers import TrainingArguments
from torch import nn
from transformers import Trainer


##modelling:

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import transformers
from transformers import pipeline
from datasets import load_dataset
import nltk
nltk.download('punkt')
##others
import warnings
warnings.filterwarnings("ignore")
import os
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## c. Importing dataset from my Google Drive

In [4]:
data_path= '/content/drive/MyDrive/deep-learning/clean_copy.csv'

In [5]:
data= pd.read_csv(data_path)

In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


## c.Splitting Dataset

###i.creating a categorical mapping

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [8]:
data= data.dropna()
data= data.drop("Unnamed: 0", axis=1)

In [9]:
##before splitting I will convert each tweet row to a tuple since that't the acceptable format

data['clean_tweet'] = data['clean_tweet'].apply(lambda tweet: tuple(tweet.split(),))




In [10]:
data.head()

Unnamed: 0,clean_tweet,label,agreement
0,"(amp, big, homie, meanboy, stegman, st)",0.0,1.0
1,"(im, thinking, devoting, career, proving, auti...",1.0,1.0
2,"(vaccines, vaccinate, child)",-1.0,1.0
3,"(mean, immunize, kid, something, wont, secretl...",-1.0,1.0
4,"(thanks, catch, performing, la, nuit, nyc, st,...",0.0,1.0


### ii.Data Splitting


In [11]:
train_set, eval_set= train_test_split(data, test_size= 0.2, stratify= data["label"])

In [12]:

train_set.to_csv("/content/train_set.csv")
eval_set.to_csv("/content/eval_set.csv")

### iii.Loading the Dataset

In [13]:
dataset= load_dataset( "csv", data_files= { "train_set":"train_set.csv", "eval_set":"eval_set.csv"}                     )

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-c98de36a0abd926e/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train_set split: 0 examples [00:00, ? examples/s]

Generating eval_set split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-c98de36a0abd926e/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 1994
    })
})

##.C. Tokenization

In [15]:
mini= "microsoft/MiniLM-L12-H384-uncased" ##creating a path for my transformer

In [16]:
mini_tokenizer= AutoTokenizer.from_pretrained(mini)

In [17]:
data["clean_tweet"].head()

0              (amp, big, homie, meanboy, stegman, st)
1    (im, thinking, devoting, career, proving, auti...
2                         (vaccines, vaccinate, child)
3    (mean, immunize, kid, something, wont, secretl...
4    (thanks, catch, performing, la, nuit, nyc, st,...
Name: clean_tweet, dtype: object

In [18]:
## our labels are-1, 0, 1 and we will like to transform them into 0,1,2 respectively

def transform_labels(input):
  label= input["label"]
  num =0

  if label== -1:
    num= 0
  elif label== 0:
    num =1
  elif label == 1:
    num = 2
  return {"label": num}

def mini_tokenize(example):
  return mini_tokenizer(example["clean_tweet"], padding= "max_length", truncation=True)


In [19]:
dataset= dataset.map(mini_tokenize, batched= True)
remove_columns= ['Unnamed: 0', 'clean_tweet', 'label', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [20]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1994
    })
})

###d. Dealing with Imbalance Class 0

From our EDA, we realized the -1 class (now our 0 class)  was imbalaned so we will deal with that in this section



In [21]:
##we are going to use our data dataset

class_weights= (1-(data["label"].value_counts().sort_index() /len(data))).values
class_weights


array([0.89618857, 0.50992979, 0.59388164])

In [22]:
##uploading my weights to the GPU

class_weights= torch.from_numpy(class_weights).float().to("cuda")

In [23]:
dataset= dataset.rename_column("label","labels")

In [24]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1994
    })
})

### e.Computing loss

In [26]:
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs["labels"]
        inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits.float()
        labels = labels.long()
        loss_func = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_func(logits, labels)
        return (loss, outputs) if return_outputs else loss

## f.Modelling

### 1.Finetuning MiniML

In [27]:
mini_model= AutoModelForSequenceClassification.from_pretrained(mini, num_labels= 3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average="weighted")
  return {"f1": f1}

In [29]:
batch_size= 16

In [31]:

training_args = TrainingArguments( output_dir="Greg-Sentiment-classifier",
   num_train_epochs=3, load_best_model_at_end=True,evaluation_strategy="steps",save_strategy="steps",push_to_hub=True

)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [32]:
train_dataset= dataset['train_set'].shuffle(seed=10)
eval_dataset= dataset['eval_set'].shuffle(seed=10)

In [33]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
trainer = WeightedLossTrainer(
      model= mini_model,
      args= training_args,
      train_dataset= train_dataset,
      eval_dataset= eval_dataset,
      tokenizer= mini_tokenizer,
      compute_metrics=compute_metrics )

Cloning https://huggingface.co/gArthur98/Greg-Sentiment-classifier into local empty directory.


Download file pytorch_model.bin:   0%|          | 16.5k/127M [00:00<?, ?B/s]

Download file model.safetensors:   0%|          | 8.00k/127M [00:00<?, ?B/s]

Download file runs/Jul16_14-42-21_484444fb5cf2/events.out.tfevents.1689518659.484444fb5cf2.27445.2:  49%|####9…

Download file runs/Jul17_08-18-15_52b71c045d9b/events.out.tfevents.1689581956.52b71c045d9b.244.0:  68%|######7…

Download file runs/Jul16_13-49-40_484444fb5cf2/events.out.tfevents.1689515396.484444fb5cf2.209.4:  55%|#####4 …

Download file runs/Jul16_14-17-00_484444fb5cf2/events.out.tfevents.1689517023.484444fb5cf2.27445.0:  92%|#####…

Download file runs/Jul16_14-27-19_484444fb5cf2/events.out.tfevents.1689517825.484444fb5cf2.27445.1:  92%|#####…

Download file runs/Jul16_19-49-06_aa002f9c73c9/events.out.tfevents.1689537010.aa002f9c73c9.4862.0:  49%|####8 …

Download file runs/Jul16_20-38-29_aa002f9c73c9/events.out.tfevents.1689539980.aa002f9c73c9.20203.0:  95%|#####…

Download file runs/Jul16_20-59-11_aa002f9c73c9/events.out.tfevents.1689541176.aa002f9c73c9.20203.1: 100%|#####…

Download file runs/Jul16_21-11-57_aa002f9c73c9/events.out.tfevents.1689541977.aa002f9c73c9.20203.2:  97%|#####…

Download file runs/Jul18_09-10-14_41d09b3308d2/events.out.tfevents.1689671479.41d09b3308d2.6473.0: 100%|######…

Download file runs/Jul16_21-56-52_aa002f9c73c9/events.out.tfevents.1689544616.aa002f9c73c9.55385.0: 100%|#####…

Download file runs/Jul16_22-09-23_aa002f9c73c9/events.out.tfevents.1689545447.aa002f9c73c9.55385.2: 100%|#####…

Download file runs/Jul16_22-14-50_aa002f9c73c9/events.out.tfevents.1689545726.aa002f9c73c9.62094.0: 100%|#####…

Download file runs/Jul16_13-36-18_484444fb5cf2/events.out.tfevents.1689514615.484444fb5cf2.209.1: 100%|#######…

Download file runs/Jul16_21-53-31_aa002f9c73c9/events.out.tfevents.1689544442.aa002f9c73c9.52363.1: 100%|#####…

Download file runs/Jul16_21-46-50_aa002f9c73c9/events.out.tfevents.1689544129.aa002f9c73c9.52363.0: 100%|#####…

Download file runs/Jul16_13-47-26_484444fb5cf2/events.out.tfevents.1689515266.484444fb5cf2.209.3: 100%|#######…

Download file runs/Jul16_13-41-23_484444fb5cf2/events.out.tfevents.1689514901.484444fb5cf2.209.2: 100%|#######…

Download file runs/Jul16_13-26-53_484444fb5cf2/events.out.tfevents.1689514304.484444fb5cf2.209.0: 100%|#######…

Download file runs/Jul18_09-10-14_41d09b3308d2/events.out.tfevents.1689671803.41d09b3308d2.6473.1: 100%|######…

Clean file runs/Jul16_13-49-40_484444fb5cf2/events.out.tfevents.1689515396.484444fb5cf2.209.4:   2%|1         …

Clean file runs/Jul16_20-38-29_aa002f9c73c9/events.out.tfevents.1689539980.aa002f9c73c9.20203.0:   3%|2       …

Clean file runs/Jul16_14-42-21_484444fb5cf2/events.out.tfevents.1689518659.484444fb5cf2.27445.2:   2%|1       …

Clean file runs/Jul17_08-18-15_52b71c045d9b/events.out.tfevents.1689581956.52b71c045d9b.244.0:   2%|2         …

Clean file runs/Jul16_14-17-00_484444fb5cf2/events.out.tfevents.1689517023.484444fb5cf2.27445.0:   3%|2       …

Clean file runs/Jul16_20-59-11_aa002f9c73c9/events.out.tfevents.1689541176.aa002f9c73c9.20203.1:   6%|6       …

Clean file runs/Jul16_14-27-19_484444fb5cf2/events.out.tfevents.1689517825.484444fb5cf2.27445.1:   3%|2       …

Clean file runs/Jul18_09-10-14_41d09b3308d2/events.out.tfevents.1689671479.41d09b3308d2.6473.0:  15%|#4       …

Clean file runs/Jul16_22-09-23_aa002f9c73c9/events.out.tfevents.1689545447.aa002f9c73c9.55385.2:  18%|#8      …

Clean file runs/Jul16_21-56-52_aa002f9c73c9/events.out.tfevents.1689544616.aa002f9c73c9.55385.0:  14%|#4      …

Clean file runs/Jul16_21-11-57_aa002f9c73c9/events.out.tfevents.1689541977.aa002f9c73c9.20203.2:   3%|3       …

Clean file runs/Jul16_22-14-50_aa002f9c73c9/events.out.tfevents.1689545726.aa002f9c73c9.62094.0:  22%|##2     …

Clean file runs/Jul16_13-36-18_484444fb5cf2/events.out.tfevents.1689514615.484444fb5cf2.209.1:  25%|##4       …

Clean file runs/Jul16_21-53-31_aa002f9c73c9/events.out.tfevents.1689544442.aa002f9c73c9.52363.1:  24%|##3     …

Clean file runs/Jul16_19-49-06_aa002f9c73c9/events.out.tfevents.1689537010.aa002f9c73c9.4862.0:   2%|1        …

Clean file runs/Jul16_21-46-50_aa002f9c73c9/events.out.tfevents.1689544129.aa002f9c73c9.52363.0:  25%|##4     …

Download file runs/Jul19_11-24-09_ddc5dd27ef93/events.out.tfevents.1689765956.ddc5dd27ef93.459.0: 100%|#######…

Clean file runs/Jul16_13-47-26_484444fb5cf2/events.out.tfevents.1689515266.484444fb5cf2.209.3:  25%|##4       …

Clean file runs/Jul16_13-26-53_484444fb5cf2/events.out.tfevents.1689514304.484444fb5cf2.209.0:  25%|##4       …

Download file training_args.bin: 100%|##########| 3.87k/3.87k [00:00<?, ?B/s]

Clean file runs/Jul16_13-41-23_484444fb5cf2/events.out.tfevents.1689514901.484444fb5cf2.209.2:  25%|##4       …

Clean file runs/Jul18_09-10-14_41d09b3308d2/events.out.tfevents.1689671803.41d09b3308d2.6473.1: 100%|#########…

Clean file runs/Jul19_11-24-09_ddc5dd27ef93/events.out.tfevents.1689765956.ddc5dd27ef93.459.0:  18%|#8        …

Clean file training_args.bin:  26%|##5       | 1.00k/3.87k [00:00<?, ?B/s]

Clean file model.safetensors:   0%|          | 1.00k/127M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/127M [00:00<?, ?B/s]

In [35]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1
500,0.9048,0.896298,0.631849
1000,0.8732,0.861065,0.643677
1500,0.8592,0.909302,0.648592
2000,0.8314,0.824165,0.65425
2500,0.8114,0.836766,0.656399


TrainOutput(global_step=2991, training_loss=0.8442818690446339, metrics={'train_runtime': 327.0366, 'train_samples_per_second': 73.166, 'train_steps_per_second': 9.146, 'total_flos': 190406486481120.0, 'train_loss': 0.8442818690446339, 'epoch': 3.0})

In [36]:
trainer.push_to_hub()

Upload file runs/Jul19_11-34-00_ddc5dd27ef93/events.out.tfevents.1689766514.ddc5dd27ef93.7187.0:   0%|        …

To https://huggingface.co/gArthur98/Greg-Sentiment-classifier
   5c16be4..dc318d0  main -> main

   5c16be4..dc318d0  main -> main

To https://huggingface.co/gArthur98/Greg-Sentiment-classifier
   dc318d0..cc242f2  main -> main

   dc318d0..cc242f2  main -> main



'https://huggingface.co/gArthur98/Greg-Sentiment-classifier/commit/dc318d0b68f2c0cb9f9a9ffc97dc7fb5064150ca'