海量数据的MapReduce相似度检测

针对当前海量数据的相似重复问题,提出了MapReduce下通过SimHash算法检测相似文档的方法：即首先将存储在分布式文件系统的海量文档集进行分类,然后进行特征提取,由SimHash算法生成SimHash指纹和生成Sequence File;最后,计算相似度产生检测结果;通过实验测试可知,提出的检测方法和设计的相似度算法能很好适应海量数据相似检测,并能有效地提高工作效率....

Full description

Saved in:

Bibliographic Details
Published in	实验室研究与探索 Vol. 33; no. 9; pp. 132 - 136
Main Author	张敏
Format	Journal Article
Language	Chinese
Published	河南理工大学测绘与国土信息工程学院,河南焦作,454003 2014
Subjects	MapReduce 去重海量数据相似度算法算法 algorithms 海量数据 similarity mass data 去重相似度 duplicated-removing MapReduce
Online Access	Get full text
ISSN	1006-7167

Cover

More Information
Summary:	针对当前海量数据的相似重复问题,提出了MapReduce下通过SimHash算法检测相似文档的方法：即首先将存储在分布式文件系统的海量文档集进行分类,然后进行特征提取,由SimHash算法生成SimHash指纹和生成Sequence File;最后,计算相似度产生检测结果;通过实验测试可知,提出的检测方法和设计的相似度算法能很好适应海量数据相似检测,并能有效地提高工作效率.
Bibliography:	ZHANG Min（College of Surveying ＆ Land Information Engineering, Henan Polytechnic University, Jiaozuo 454003, China） similarity ; MapReduce ; mass data ; algorithms ; duplicated-removing For the question of similar duplication of big data,this paper offers an approach to find similar document by using SimHash algorithm and MapReduce.The approach consists of several steps.First,massive documents which stored in the DFS（Distribute File System） are classified; then,the characteristics of data are extracted and Simhash fingerprint and Sequence file are produced by SimHash algorithm; finally,detection result is generated through computing similarity.The experiments prove that the approach presented and similarity designed well suit near-duplicate detection for big data,can improve work efficiency greatly. 31-1707/T
ISSN:	1006-7167