一种用于微博谣言检测的半监督学习算法
在微博谣言检测中,对微博谣言进行正确标注需要耗费大量的人力和时间,同时数据类别的不平衡也影响了微博谣言的正确识别。为了解决该问题,提出一种基于Co-Forest算法针对不平衡数据集的改进方法,利用SMOTE算法和分层抽样平衡数据分布,并通过代价敏感的加权投票法来提高对未标记样本预测的正确率。该方法只需要对少量训练数据实例进行谣言类别标注即可有效检测谣言。10组UCI测试数据和2组微博谣言的实证实验证明了算法有效性。...
Saved in:
Published in | 计算机应用研究 Vol. 33; no. 3; pp. 744 - 748 |
---|---|
Main Author | |
Format | Journal Article |
Language | Chinese |
Published |
中国人民解放军 61516 部队,北京 100094%山东大学 计算机科学与技术学院,济南,250101
2016
山东大学 计算机科学与技术学院,济南 250101 |
Subjects | |
Online Access | Get full text |
ISSN | 1001-3695 |
DOI | 10.3969/j.issn.1001-3695.2016.03.024 |
Cover
Summary: | 在微博谣言检测中,对微博谣言进行正确标注需要耗费大量的人力和时间,同时数据类别的不平衡也影响了微博谣言的正确识别。为了解决该问题,提出一种基于Co-Forest算法针对不平衡数据集的改进方法,利用SMOTE算法和分层抽样平衡数据分布,并通过代价敏感的加权投票法来提高对未标记样本预测的正确率。该方法只需要对少量训练数据实例进行谣言类别标注即可有效检测谣言。10组UCI测试数据和2组微博谣言的实证实验证明了算法有效性。 |
---|---|
Bibliography: | 51-1196/TP microblog; rumor detection; imbalanced data; semi-supervised learning; Co-Forest algorithm; SMOTE; cost sensitive In microblog rumor detection,labeling microblog rumors correctly requires a huge amount of manpower and time.At the same time,imbalanced data category also affects the correct recognition of microblog rumors. To resolve this problem,this paper proposed an improved method based on Co-Forest algorithm,which could be used for imbalanced dataset. This method used SMOTE algorithm and stratified sampling to balance the data's distribution. Besides,it improved the correct rate of unlabeled sample through the cost-sensitive weighted voting method. This method required only a small amount of training data instances which labeled a rumor category,and could be used to detect rumors effectively. Experiment results on 10 UCI data sets and 2 microblog rumors prove that the algorithm is effective. Lu Tongqiang,Shi Bing,Yan Zhongmin,Zhou Pei(1. School of Computer Science & Technology, Shandong University, |
ISSN: | 1001-3695 |
DOI: | 10.3969/j.issn.1001-3695.2016.03.024 |