Sentiment analysis and spam filtering using the YAC2 clustering algorithm with transferability

•Application of YAC2 clustering algorithm to textual data is presented.•Efficacy of the approach is measured against KNN, DBSCAN, and Spectral clustering alternatives.•A domain transferable feature engineering approach is developed for diverse datasets.•Intelligent feature engineering can improve pe...

Full description

Saved in:
Bibliographic Details
Published inComputers & industrial engineering Vol. 165; p. 107959
Main Authors Ghiassi, M., Lee, Sean, Gaikwad, Swati Ramesh
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.03.2022
Subjects
Online AccessGet full text
ISSN0360-8352
1879-0550
DOI10.1016/j.cie.2022.107959

Cover

More Information
Summary:•Application of YAC2 clustering algorithm to textual data is presented.•Efficacy of the approach is measured against KNN, DBSCAN, and Spectral clustering alternatives.•A domain transferable feature engineering approach is developed for diverse datasets.•Intelligent feature engineering can improve performance regardless of tools used. Two notable applications of text classification are sentiment analysis and spam filtering. Traditional machine learning approaches to text classification are often complex, non-transferrable, and require supervision. This paper introduces an unsupervised approach to text classification which is relatively simple and transfers between problem domains, while providing accuracy comparable or better than established alternatives. We present an integrated solution which combines a new clustering algorithm, Yet Another Clustering Algorithm (YAC2), with a domain transferrable feature engineering approach for Twitter sentiment analysis and spam filtering of YouTube comments. We evaluate the effectiveness of this integrated solution for Twitter sentiment analysis using three datasets: Starbucks, Verizon, and Southwest Airlines. YouTube spam filtering is evaluated using four datasets: Psy, LMFAO, Shakira,and Katy Perry. We compare the results with established clusteringsolutions: KNN, Spectral, and DBSCAN. Our integrated solution performs better than all the alternatives for sentiment analysis. For spam filtering, YAC2 and KNN perform within 1% of each other and far better than Spectral and DBSCAN for all datasets. Additionally, our feature engineering approach improves accuracy compared to using a traditional method, while significantly reducing model dimensionality, matrix sparsity and providing transferability across the datasets tested.
ISSN:0360-8352
1879-0550
DOI:10.1016/j.cie.2022.107959