海量数据的MapReduce相似度检测
针对当前海量数据的相似重复问题,提出了MapReduce下通过SimHash算法检测相似文档的方法:即首先将存储在分布式文件系统的海量文档集进行分类,然后进行特征提取,由SimHash算法生成SimHash指纹和生成Sequence File;最后,计算相似度产生检测结果;通过实验测试可知,提出的检测方法和设计的相似度算法能很好适应海量数据相似检测,并能有效地提高工作效率....
Saved in:
Published in | 实验室研究与探索 Vol. 33; no. 9; pp. 132 - 136 |
---|---|
Main Author | |
Format | Journal Article |
Language | Chinese |
Published |
河南理工大学 测绘与国土信息工程学院,河南焦作,454003
2014
|
Subjects | |
Online Access | Get full text |
ISSN | 1006-7167 |
Cover
Summary: | 针对当前海量数据的相似重复问题,提出了MapReduce下通过SimHash算法检测相似文档的方法:即首先将存储在分布式文件系统的海量文档集进行分类,然后进行特征提取,由SimHash算法生成SimHash指纹和生成Sequence File;最后,计算相似度产生检测结果;通过实验测试可知,提出的检测方法和设计的相似度算法能很好适应海量数据相似检测,并能有效地提高工作效率. |
---|---|
Bibliography: | ZHANG Min(College of Surveying & Land Information Engineering, Henan Polytechnic University, Jiaozuo 454003, China) similarity ; MapReduce ; mass data ; algorithms ; duplicated-removing For the question of similar duplication of big data,this paper offers an approach to find similar document by using SimHash algorithm and MapReduce.The approach consists of several steps.First,massive documents which stored in the DFS(Distribute File System) are classified; then,the characteristics of data are extracted and Simhash fingerprint and Sequence file are produced by SimHash algorithm; finally,detection result is generated through computing similarity.The experiments prove that the approach presented and similarity designed well suit near-duplicate detection for big data,can improve work efficiency greatly. 31-1707/T |
ISSN: | 1006-7167 |