Chargement Évènements

« Tous les Évènements

  • Cet évènement est passé

PhD defense: Ashwin Geet D’Sa (Multispeech)

6 mai 2022 @ 10:00 - 12:00

Ashwin Geet D’Sa (Multispeech) will defend his thesis on May 6th, 2022 at 10 am, in Room A008. His presentation is entitled:

“Expanding the training data for neural network based hate speech classification”

Abstract:

The phenomenal increase in internet usage, catering to the dissemination of knowledge and expression, has also led to an increase in online hate speech. Online hate speech is anti-social communicative behavior, which leads to the threat and violence towards an individual or a group. Deep learning-based models have become the state-of-the-art solution in classifying hate speech. However, the performance of these models depends on the amount of labeled training data. In this thesis, we explore various solutions to expand the training data to train a reliable model for hate speech classification.

As the first approach, we use a semi-supervised learning to combine the huge amount of unlabeled data, easily available on the internet, with a limited amount of labeled data to train the classifier. For this, we use the label-propagation algorithm. The performance of this method depends on the representation space of labeled and unlabeled data. We show that pre-trained sentence embeddings are label agnostic and yield poor results. We propose a simple and effective neural-network-based approach for transforming these pre-trained representations to task-aware ones. This method achieves significant performance improvements in low-resource scenarios.
In our second approach, we explore data augmentation, a solution to obtain synthetic samples using the original training data. Our data augmentation technique is based on a single conditional GPT-2 language model fine-tuned on the original training data. Our approach uses a fine-tuned BERT model to select high-quality synthetic data. We study the effect of the quantity of augmented data and show that using a few thousand synthetic samples yields significant performance improvements in hate speech classification. Our qualitative evaluation shows the effectiveness of using BERT for filtering the generated samples.
For our final approach, we use multi-task learning as a method to combine several available hate speech datasets and jointly train a single classification model. Our approach leverages the advantages of a pre-trained language model (BERT) as shared layers of our multi-task architecture. We treat one hate speech corpus as one task. Thus, adopting the paradigm of multi-task learning to multi-corpus learning. We show that training a multi-task model with several corpora achieves similar performance as training several corpus-specific models. Nevertheless, fine-tuning the multi-task model for a specific corpus allows improving the results. We demonstrate the effectiveness of our multi-task learning approach for domain adaptation on hate speech corpora.
We explore the three proposed approaches in low-resource scenarios and show that they achieve significant performance improvements in very low-resource setups.

Jury members:

PhD Advisors:
Irina ILLINA, Maître de conférence, Université de Lorraine
Dominique FOHR, Chargé de Recherche, CNRS
Reviewers:
Richard DUFOUR, Professeur, Laboratoire des Sciences du Numérique de Nantes (LS2N)
Pavel KRÁL, Professeur associé, University of West Bohemia
Examiners:
Georges LINARÈS, Professeur, Université d’Avignon
François PORTET, Professeur, Laboratoire d’Informatique de Grenoble
Josiane MOTHE, Professeur, Université de Toulouse
Christophe CERISARA, Chargé de Recherche, CNRS
Invited members:
Dietrich KLAKOW, Professeur, Universität des Saarlandes
Angeliki MONNIER, Professeur, Université de Lorraine

Détails

Date :
6 mai 2022
Heure :
10:00 - 12:00
Catégorie d’évènement:
Étiquettes évènement :
, , , ,

Lieu

A008