Classification of Covid-19 Vaccination Using Ensemble Methods with BERT Models***

Classification of Covid-19 Vaccination Using Ensemble Methods with BERT Models

 

 

 

 

 

 

 

 

 

Rabia Bounaama(1)    Mohamed El Amine Abderrahim(2) 

 

 

 

Biomedical Engineering Laboratory, Tlemcen University, Algeria

 

 

 

Laboratory of Arabic Natural Language Processing, Tlemcen University, Algeria

 

 

 

rabea.bounaama@univ-tlemcen.dz

 

 

 

 

 

 

 

 

 

 

 

مؤلفون Authors/

الملخص / Abstract

الكلمات المفتاحية / Keywords

أقسام الملف

 

 

 

 

 

 

Introduction

 

Related work

  

Dataset description  
  

Experiments

 

Results and discussion 
  

Conclusion 

  

References

 

 

 


  

 

 

 

Abstract

This paper discusses the participation of the "techno" team in the SMM4h 2022 shared subtasks. The focus of the study is on Task 6, which involves classifying tweets that report self-reported COVID-19 vaccination status. To enhance the performance of the classification systems, the team tested Ensemble methods for two approaches: a classical machine learning approach and a state-of-the-art language model using BERT (Bidirectional Encoder Representations from Transformers). The results showed that the state-of-the-art language model achieved the best results, with an F1-score of 0.82.

 

 

 

Keywords

 SMM4h 2022, COVID-19, machine learning, BERT.

 

 

 

 Introduction

 

World health organization (WHO) evaluates covid-19 vaccines since January 2022[1], and it supports countries to accelerate and maintain covid-19 vaccination. Hospital electronic records, state vaccination registries or pharmacy records were used to gathering Covid-19 vaccination for patients’ information.

 

Users in social media platforms shared their own experiences with covid-19 vaccination, including their feeling, reactions toward vaccine, they share also possible side effect that they may have experienced. Which makes self-reports about covid-19 vaccinations on social media a valuable data resource. However, social media platforms can be used as a tool to spread misinformation about covid-19 vaccination [9].  

 

 

 

 

The impact of this public information on vaccination hesitancy and side effects creates a pressing need for research.  Social Media Mining for Health (SMM4h) shared subtasks provide for task 6 an annotated dataset of Twitter users personally reporting vaccination status and users discussing vaccination status but not revealing their own [10], for unique insights into population health.

 

This paper summarizes our submission to the task, where we built two systems to solve the provided shared task "Classification of tweets which indicate self-reported COVID-19 vaccination status (in English)". The first system uses traditional machine learning (ML) combined with ensemble learning, while the second system uses an ensemble learning approach combined with the state-of-the-art language model BERT (Bidirectional Encoder Representations from Transformers). The purpose of using two systems is to compare the performance of classical ML with the advanced language model, and to examine the potential of ensemble learning in improving the performance of the model.

Related work are presented in the next section , we perform dataset distribution in section 3, preprocessing and model’s fine-tuning are in section 4, results and discussion are provided in section 5,  and section 6 conclude the paper with conclusion and future perspectives.

 

 

 

 

 

 

 Related work

 


Several works were established about the topic of covid-19 vaccination. In the field of psychosocial studies, authors in [3] investigated the relationship between nocebo[2]  factors and side effects from the covid-19 vaccination. Such side effects include worry, depression, headache, fatigue, and pain at the injection site. The findings of this study could help to reduce vaccine hesitancy and decrease adverse reactions to vaccines.

 

For Natural Language Processing (NLP) field, and on the same task that we have examined, the paper [1] applied ensemble method [11], on three transfer learning systems of BERT, where they prove its effectiveness. They obtained 0.80 as F1-score. In the study of [2], which is about self-training question-answering model, authors used ensemble method in order to boost their built system performance by achieving 0.76 F1-score with multi-BERT ensemble. 

 

 

 

In [4], the authors used transformer-based classification models such as RoBERT and BERT on multiple COVID-19 vaccination tasks to classify Twitter users' feedback into positive, negative, and neutral classes. They also used SVM as a meta-learner to improve model performance by applying two ensemble learning techniques: stacking and majority voting. Stacking classifier is a ML ensemble technique that merges models to form one single powerful model[3]. The experiment results show that ensemble learning techniques outperform the individual classifiers. Another study that used these techniques is in [5].

Khanday et al. [6] presented a twitter dataset related to COVID-19 that was manually annotated with hate and non-hate classes. They used an ensemble learning approach and applied TF-IDF and Bag of Words as feature extraction techniques. As a result, the decision tree classifier outperformed other traditional ML models with a precision of 0.98. In another study established by [7], state-of-the-art transformer-based models such as RoBERTa, BERTweet, and CT-BERT were used along with the MVEDL (Majority Voting technique-based Ensemble Deep Learning) model to identify informative tweets in the context of limiting the spread of irrelevant information in the WNUT 2020 Shared Task2 [8]. The proposed system showed effectiveness with an f1-score of 0.91.

 

 

 

 

 

 

 Dataset description

 

 

SMM4h 2022 shared subtasks organizers provide for task 6 a twitter dataset. The task is about identifying COVID-19 self-reports vaccination tweets written in English language. Training dataset comprises 13693 tweets as total, table 1 shows data distribution among training, validation and test sets.

 

 

Table 1 : dataset distribution for task6 on SMM4H 2022 shared tasks

 

 

 

Training dataset

Validation dataset

Test dataset

1496 tweets for Self_reports class

12197 tweets for Vaccine_chatter class

2784 tweets

5923 tweets

About 11% of training dataset is Self reports class. It indicates tweets of users clearly stating that they have been vaccinated, the other class is about tweets of users discussing vaccination status[4].

 

 

 

 

 

 

 

 Experiments

 

 

4.1 Pre-processing

We performed some common text pre-processing steps before beginning feature engineering. In this crucial step, we can ensure that our dataset is in a suitable format and any noisy information are removed. We used regular expression module in python to perform string replacement operations on URLs and punctuations as well as digits. We removed missing values, stop words, and then tokenized the dataset. By applying the above-mentioned steps, our text dataset is more consistent and clean.

We used resampling techniques in order to balance our dataset, where we remove examples from the majority class, which in our case was the Vaccine chatter class.

We then applied another fundamental step of the ML process, which consists in dividing the dataset into a training set and a test set. This step is used to evaluate the performance of the model on an unseen dataset.

We start feature engineering step with the implementation of two popular tools in text mining and NLP fields ( CountVectorizer and TfidfTransformer) in order to build a document term matrix. By applying the first tool, we obtained a matrix of token counts. Each row represents a document (example), and each column represents a word from text data vocabulary that has been built by the tool.

 The second tool transforms the count matrix into tf-idf (term frequency-inverse document frequency ) representation by giving more importance to terms that are more specific to the document and giving marginal importance to terms that are prevalent to all documents.

4.2 Models fine-tuning

We used the resulted features as input to build our text classification models. We implemented hyper-parameter tuning for classical ML in Python using the scikit-learn library with the following adjustments:

-          As in the works of [15], for Naïve Bayes classifier, we used Multinomial model, where we set alpha factor to 0.5 (it helps to add smoothing to the model)[5] .

-          For Logistic Regression classifier, we used LogisticRegression model, where we set “C” parameter to 4 (it is used to control the inverse of regularization strength)

 

[6]

 

-          For the Support Vector Machines classifier [16], we used the SVM model with the default kernel and default parameters.

 

 

 

 

 Results and discussion

 

 

 

 

Table 2 provides our results using NB, SVM and LR with tf-idf as feature representation technique.

 

Table 2: techno-team submission results using classical machine learning

 

 

 

 

NB(tf-idf)

LR(tf-idf)

SVM(tf-idf)

Ensemble learning

Precision

 

0.63

 

0.63

0.60

0.62

Recall

0.88

 

0.96

 

0.96

0.95

F1-score

0.74

 

0.76

0.74

0.75

 

After training the NB, LR, and SVM classifiers on the same dataset, we saved their predicted outputs. We then combined these outputs to apply an ensemble learning technique, using the mode() function to identify the most frequently predicted classes. The majority voting method [12] returns the label that was predicted by the majority of models, which serves as the final prediction for the ensemble model.

The first system employed traditional ML including (SVM, NB and LR) with the application of the under-sampling technique proved to be effective in resolving the issue of imbalanced data in the dataset for this task. The Edinburgh_UCL_Health team [18] applied a resampling technique with an LSTM model and used three types of embeddings, namely Glove, Forward, and Flair-Backward, along with under-sampling methods. The best system results were achieved with the LSTM (Glove_FlairFor_FlairBack_epoch30) model, obtaining an f1-score measure of 0.77 on the test dataset which is very close to our f1-score which is 0.76.

For our second system, we made multiple submissions including BERT pre-trained models (bert-base-uncased, bert-base-cased, roberta-base) and Ensemble learning, without dataset resampling.

Figure 1 summarize the classification process for both systems as well as the best ranked system (see figure 1).

 

Figure1: The classification process for both systems as well as the best ranked system.

 

 

 

 

Table 3 shows the scores of our different submissions.

Table 3: techno-team submission results using language models and ensemble learning technique

 

 

 

bert-base-uncased

bert-base-cased

roberta-base [17]

Ensemble learning

Precision

 

0.94

0.85

0.86

0.88

Recall

0.71

 

0.78

0.71

0.73

F1-score

0.81

 

0.82

0.78

0.80

 

The transformer-based architecture of BERT [14] is designed to learn contextual representations of words in sentences. Unlike traditional deep learning models that only consider context in one direction, BERT takes into account context in both directions. This allows it to capture the relationships between words in a sentence and achieve state-of-the-art results on a wide range of NLP tasks without the need for dataset resampling [13]. The results of the first ranked system by our team are publicly available at[7].

By aggregating the predictions of multiple BERT models, ensemble learning can mitigate the limitations of some single model; however, BERT's powerful performance can be demonstrated without the use of ensemble learning technique.

The best results for the second system were achieved using the bert-base-cased pre-trained model, which obtained an f1-score of 0.82, as demonstrated in Table 3. This highlights the superiority of BERT-based pre-trained models, owing to their capacity to capture the context and interrelationships between words in a sentence. This is particularly significant in NLP tasks where the meaning of a word is dependent on its context. While traditional ML models may still have their uses, the findings of this study suggest that BERT-based models should be considered as the preferred approach for NLP classification tasks. Further investigations involving other traditional ML models can be carried out to generalize the study.

 

 

 

 

 

Conclusion

 


In this study, we compared the performance of traditional ML models and transformer-based models on the same unseen dataset and applied ensemble learning to improve the results. Our results show that the under-sampling technique can effectively resolve the issue of imbalanced data in classical ML models. As the experimental results demonstrate, the state-of-the-art models reached an F1-score of 0.82 using the BERT language model. In future work, we plan to test the robustness of ensemble learning with different datasets and other ML models.

 

 

 

 

References

 

 

 

 

 

 

 

1.      Zohair, M., Bhavsar, N., Bhatnagar, A., & Singh, M. (2022, October). Innovators@ smm4h’22: An ensembles approach for self-reporting of covid-19 vaccination status tweets. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task (pp. 123-125).

 

2.      Xu, C., Barth, S., & Solis, Z. Applying Ensembling Methods to BERT to Boost Model Performance, report from Department of Computer Science. Stanford University, https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15775971.pdf (consulted 28 January 2023).

 

3.      Geers, A. L., Clemens, K. S., Faasse, K., Colagiuri, B., Webster, R., Vase, L., ... & Colloca, L. (2021). Psychosocial Factors Predict COVID-19 Vaccine Side Effects. Psychotherapy and Psychosomatics, 1.

 

 

4.      Ismail, Q., Obeidat, R., Alissa, K., & Al-Sobh, E. (2022, June). Sentiment analysis of covid-19 vaccination responses from twitter using ensemble learning. In 2022 13th International Conference on Information and Communication Systems (ICICS) (pp. 321-327). IEEE.

 

 

5.      Rahman, M. M., & Islam, M. N. (2021). Exploring the performance of ensemble machine learning classifiers for sentiment analysis of COVID-19 tweets. In Sentimental Analysis and Deep Learning: Proceedings of ICSADL 2021 (pp. 383-396). Singapore: Springer Singapore.

 

 

6.      Khanday, A. M. U. D., Rabani, S. T., Khan, Q. R., & Malik, S. H. (2022). Detecting twitter hate speech in COVID-19 era using machine learning and ensemble learning techniques. International Journal of Information Management Data Insights, 2(2), 100120.

 

 

7.      Malla, S., & Alphonse, P. J. A. (2021). COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets. Applied Soft Computing107, 107495.

 

 

8.      Nguyen, D. Q., Vu, T., Rahimi, A., Dao, M. H., Nguyen, L. T., & Doan, L. (2020). WNUT-2020 task 2: identification of informative COVID-19 english tweets. arXiv preprint arXiv:2010.08232.

 

 

9.      Jennings, W., Stoker, G., Willis, H., Valgardsson, V., Gaskell, J., Devine, D., ... & Mills, M. C. (2021). Lack of trust and social media echo chambers predict COVID-19 vaccine hesitancy. MedRxiv, 2021-01.

 

 

10.  Weissenbacher, D., Banda, J., Davydova, V., Zavala, D. E., Sánchez, L. G., Ge, Y., ... & Gonzalez, G. (2022). Overview of the seventh social media mining for health applications (# smm4h) shared tasks at coling 2022. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task (pp. 221-241).

 

11.  Polikar, R. (2012). Ensemble Learning. In: Zhang, C., Ma, Y. (eds) Ensemble Machine Learning. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9326-7_1

 

12.  Kuncheva, L. I. (2014). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.

 

13.  OpenAI. (2021). OpenAI's language models. https://openai.com/language-models/, (consulted 13 February 2023).

 

14.  Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

 

 

15.  McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).

 

 

16.  Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28.

 

 

17.  Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

 

 

18.  Guellil, I., Wu, J., Wu, H., Sun, T., & Alex, B. (2022). Edinburgh_UCL_Health@ SMM4H'22: From Glove to Flair for handling imbalanced healthcare corpora related to Adverse Drug Events, Change in medication and self-reporting vaccination. In Proceedings of the 29th International Conference on Computational Linguistics (Vol. 2022, pp. 148-152).

 

 

 

 


[1] https://www.who.int/emergencies/diseases/novel-coronavirus-2019/covid-19-vaccines

[2] A harmless substance or treatment that may cause harmful side effects or worsening of symptoms because the patient thinks or believes they may occur or expects them to occur (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/nocebo)

[3] https://vitalflux.com/stacking-classifier-sklearn-python-example/

[4] https://codalab.lisn.upsaclay.fr/competitions/3536

[5] https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

[6] https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

[7] https://codalab.lisn.upsaclay.fr/competitions/3536#results