GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU’s for general purpose computing has been inc...

Full description

Saved in:

Bibliographic Details
Published in	Journal of the Institution of Engineers (India). Series B, Electrical Engineering, Electronics and telecommunication engineering, Computer engineering Vol. 98; no. 5; pp. 467 - 476
Main Authors	Srinivasa, K. G., Shree Devi, B. N.
Format	Journal Article
Language	English
Published	New Delhi Springer India 01.10.2017 Springer Nature B.V
Subjects	Algorithms Communications Engineering Data management Data mining Engineering Information retrieval Networks Original Contribution Search algorithms Search process String matching Strings Tables Texts CUDA N-gram NVIDIA String searching String matching Trigram GPGPU Score table
Online Access	Get full text
ISSN	2250-2106 2250-2114
DOI	10.1007/s40031-017-0295-3

Cover

More Information
Summary:	String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU’s for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document’s length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA’s GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2250-2106 2250-2114
DOI:	10.1007/s40031-017-0295-3