File size: 163,838 Bytes
0582cd2 |
1 |
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"gpuType":"T4"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# How to use the ParlaSent model? A step-by-step tutorial (/w stanza)\n","\n","Authors: Mochtak, Michal, Peter Rupnik, Taja Kuzman, and Nikola Ljubešić\n","\n","Date: 1 june 2025"],"metadata":{"id":"EzUt9g09sMfV"}},{"cell_type":"markdown","source":["## Introductory remarks ⛳\n","\n","This is an interactive Jupyter notebook that presents a step-by-step tutorial on how to use the ParlaSent model with your own data. The overall structure of the notebook is organized around two elements: 1) sentence extraction and 2) sentence annotation.\n","\n","If you use this tutorial, please cite the paper:\n","\n","\n","> Mochtak, Michal, Peter Rupnik, and Nikola Ljubešić. 2024. “The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 16024–36. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1393.\n","\n","**NOTE: As of June 1, 2025, the original tutorial using the trankit library no longer works due to broken dependencies that cannot be resolved within the same Colab session. As a workaround, the trankit library has been replaced with the stanza toolkit (how_to_use_authdetect_w_stanza.ipynb). Stanza performs the same functions as trankit and does not have the same dependency compatibility issues. This is the recommended pipeline for Google Colab and serves as a functional alternative to trankit, if needed.**\n"],"metadata":{"id":"Mjg6lAlFsbJ1"}},{"cell_type":"markdown","source":["## Prerequisities ⚡\n","Google Colab is an interactive development environment with access to computational resources that are easy to utilize free of charge (read more about it here: https://colab.research.google.com/).\n","\n","To use the ParlaSent model, you first need to connect to an interactive session with access to a graphical processing unit. To do this, click \"Runtime\" in the top toolbar and select \"Change runtime type\":\n","\n","<br>\n","<br>\n","After a pop-up appears, select any available GPU accelerator and save your selection:\n","\n","\n","\n","Finally, in the top right corner, click \"Connect.\" After a moment, a green check mark (✔) will appear, indicating that your virtual session has been successfully set up:\n","\n",""],"metadata":{"id":"Tyr0BdDlvJJl"}},{"cell_type":"markdown","source":["## Loading data for processing 💾\n","This notebook is designed for a simple use case that expects users to prepare their data outside the Google Colab environment as a plain .csv file and then upload it for further processing. The repository for this paper contains a sample file that you can use as a guide for formatting your own data. The file contains 124 speeches in English from the debate on the proposal for a regulation of the European Parliament and of the Council setting emission performance standards for new passenger cars and new light commercial vehicles held on 3rd March 2018. The file has just two columns: \"doc_id\" as a document identifier and \"speech\" for the actual transcripts (the pipeline does not require anything else).\n","\n","To use the file (or any other file you prepare), upload it to your interactive session by clicking the folder icon on the left and dragging and dropping your file into the area under the \"sample_data\" folder. The file will be uploaded to your interactive session and will be available for processing. It is important to note that the file exists only in this interactive session and will be deleted when you close it. This applies to any file you create in your session (e.g., the processed data).\n","\n","\n","\n"],"metadata":{"id":"cF1rFHA12gBC"}},{"cell_type":"markdown","source":["## Processing the data 🎆\n","\n","The processing pipeline can be divided into two steps: 1) sentence extraction and 2) sentence annotation. From this point on, the notebook will also include code cells that can be executed by clicking the small \"play\" icon next to them (hover over the cell to make it visible). The only cell you may need to modify is the one below, which contains a few meta-parameters that will be used in the pipeline."],"metadata":{"id":"tAREYA5V9Pty"}},{"cell_type":"markdown","source":["### Loading the necessary packages 💻\n","To process the input data, we need to install and load a few packages we will use."],"metadata":{"id":"xedYNVClALEN"}},{"cell_type":"code","execution_count":1,"metadata":{"id":"D-oYq7EnsHae","executionInfo":{"status":"ok","timestamp":1748795236125,"user_tz":-120,"elapsed":21,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"outputs":[],"source":["# Before we start, we will set a few meta-parameters for the pipeline to use.\n","language = \"english\" # This parameter specifies the language of your input text so Trankit can load the appropriate sentence parser. In this example, the speeches are in English; check the available languages at https://trankit.readthedocs.io/en/latest/pkgnames.html (look for the Code Name for pipeline initialization).\n","text_column = \"speech\" # Name of the column in the .csv file containing the input text to be analyzed. In this example, the column we will process is \"speech\".\n","doc_id = \"doc_id\" # Name of the column with the unique identifier for each text to be processed. In this example, the column is \"doc_id\".\n","filename = \"sample_data.csv\" # Name of the file you uploaded to Google Colab that you want to process. The tutorial folder contain a dummy dataset you need to upload here."]},{"cell_type":"code","source":["# Install missing/needed libraries to your session; this needs to be done each time you\n","# open the notebook, as the session is interactive. It takes a while.\n","!pip install simpletransformers\n","!pip install stanza"],"metadata":{"id":"m9Ekm2hy_5zt"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Load the necessary libraries.\n","import simpletransformers.classification as cl\n","import stanza\n","import pandas as pd"],"metadata":{"id":"ixpDR4gQA_Ri","executionInfo":{"status":"ok","timestamp":1748795389156,"user_tz":-120,"elapsed":28964,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":["### Step 1: Sentence extraction ⛏\n","Now that all the necessary packages are loaded and ready to use, we can proceed with the first step—sentence extraction."],"metadata":{"id":"F-iTvMydBJu4"}},{"cell_type":"code","source":["# Load the dataset you want to process. This tutorial uses a plain .csv file for simplicity.\n","df = pd.read_csv(filename)"],"metadata":{"id":"d7bi80FCBYKt","executionInfo":{"status":"ok","timestamp":1748795392159,"user_tz":-120,"elapsed":21,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":4,"outputs":[]},{"cell_type":"code","source":["# Load the stanza pipeline with the language model you specified earlier.\n","p = stanza.Pipeline(language, processors='tokenize')"],"metadata":{"id":"a9Y_JgI4Cosb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Check the dataset to be sure it was read in correctly.\n","df"],"metadata":{"id":"It1UPAF2CVT9"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Split texts into sentences. We use a simple loop as the model processes inputs sequentially.\n","sentences = []\n","for n in range(0, len(df[\"text\"])):\n"," doc = p(df[\"text\"][n])\n"," one_text = [sentence.text for sentence in doc.sentences]\n"," one_text_df = pd.DataFrame({\n"," 'doc_id': n+1,\n"," 'id': range(1, len(one_text) + 1),\n"," 'text': one_text})\n"," sentences.append((one_text_df))\n","\n","# Concatenate the list and reset the index.\n","sentences_final = pd.concat(sentences)\n","sentences_final.reset_index(drop=True, inplace=True)"],"metadata":{"id":"cY7HA7QZC69L","executionInfo":{"status":"ok","timestamp":1748795458827,"user_tz":-120,"elapsed":3790,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":9,"outputs":[]},{"cell_type":"code","source":["# Check the result of sentence extraction. It is a data frame with the following columns:\n","# - doc_id: Refers to the original document.\n","# - id: Refers to the sentence ID within the processed input (e.g., speech).\n","# - text: Contains the extracted grammatical units (i.e., the sentences).\n","\n","sentences_final"],"metadata":{"id":"f-dlGQBPDz27"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Step 2: Sentiment annotation 🌡\n","With the extracted sentences, we can proceed to the second step: sentiment annotation."],"metadata":{"id":"DjTSUVHIwggU"}},{"cell_type":"code","source":["# Load the ParlaSent model from the Huggingface Hub.\n","model = cl.ClassificationModel(\"xlmroberta\", \"classla/xlm-r-parlasent\")"],"metadata":{"id":"nlEPS7zQwygU"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Annotate the prepared sentences with the ParlaSent model.\n","prediction = model.predict(to_predict = sentences_final[\"text\"].tolist())\n","\n","final_df = sentences_final.assign(predict = prediction[1])"],"metadata":{"id":"RyEbfsrSxPRa"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Check the result. The `final_df` data frame now contains an additional column\n","# called \"predict\" with the predictions made by the model. Since the classification model\n","# predicts the label (score) on a continuous scale, similar to a regression model,\n","# it can produce scores above or below the scale used for training (0-5).\n","# It is worth mentioning that annotating 124 speeches containing 1,289 sentences\n","# took approximately 6 seconds (on a T4 GPU).\n","final_df"],"metadata":{"id":"DXujhnV9yKhk"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Save the annotated data as a .csv file. The new file will be located in the same\n","# directory as the input file you uploaded at the beginning. If it does not\n","# appear automatically, click the refresh button (the circled arrow) to reload\n","# the folder contents. To download the file, right-click on it and select \"Download\"\n","# to save it to your local machine.\n","final_df.to_csv('output.csv', index=False)"],"metadata":{"id":"x3sZwEXfzeqX","executionInfo":{"status":"ok","timestamp":1748795612149,"user_tz":-120,"elapsed":17,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":18,"outputs":[]},{"cell_type":"markdown","source":["## Closing remarks 👋\n","The tutorial has guided you through the entire annotation pipeline. We demonstrated how easy it is to set up and execute the process on your own data. By defining just a few meta-parameters related to your uploaded document (such as column names and the language you want to analyze), you can quickly annotate your own text data. Whether the text is in English, German, Czech, Polish, or Italian, the model will handle it effectively. The result is a straightforward data frame with annotated sentences that can be further processed or aggregated for specific research purposes (e.g., speech levels, time periods, or groups) and merged with the available metadata using the `doc_id` identifier.\n"],"metadata":{"id":"I4iiq0MF0quY"}}]} |