Comparison of Data Models for Unsupervised Twitter Sentiment Analysis

Identifying the sentiment of collected tweets has become a challenging and interesting task. In addition, mining and defining relevant features that can improve the quality of a classification system is crucial. The data modeling phase is fundamental for the whole process since it can reveal hidden...

Full description

Saved in:

Bibliographic Details
Published in	Studia Universitatis Babes-Bolyai: Series Informatica Vol. 67; no. 2
Main Author	Sergiu LIMBOI
Format	Journal Article
Language	English
Published	Babes-Bolyai University, Cluj-Napoca 02.06.2023
Subjects	Sentiment Analysis, Twitter, Data Representation, Hashtags, Clustering
Online Access	Get full text
ISSN	1224-869X 2065-9601 2065-9601
DOI	10.24193/subbi.2022.2.05

Cover

More Information
Summary:	Identifying the sentiment of collected tweets has become a challenging and interesting task. In addition, mining and defining relevant features that can improve the quality of a classification system is crucial. The data modeling phase is fundamental for the whole process since it can reveal hidden information from the textual inputs. Two models are defined in the presented paper, considering Twitter-specific concepts: a hashtagbased representation and a text-based one. These models will be compared and integrated into an unsupervised system that determines groups of tweets based on sentiment labels (positive and negative). Moreover, wordembedding techniques (TF-IDF and frequency vectors) are used to convert the representations into a numeric input needed for the clustering methods. The experimental results show good values for Silhouette and Davies-Bouldin measures in the unsupervised environment. A detailed investigation is presented considering several items (dataset, clustering method, data representation, or word embeddings) for checking the best setup for increasing the quality of detecting the sentiment from Twitter’s messages. The analysis and conclusions show that the first results can be considered for more complex experiments. Received by the editors: 4 April 2023. 2010 Mathematics Subject Classification. 68T30, 68T50. 1998 CR Categories and Descriptors. I.2.7 [Artificial Intelligence]: Natural Language Processing – Text analysis.
ISSN:	1224-869X 2065-9601 2065-9601
DOI:	10.24193/subbi.2022.2.05