Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from...

Full description

Saved in:

Bibliographic Details
Published in	Journal of materials science Vol. 59; no. 30; pp. 13856 - 13872
Main Authors	Gupta, Antrakrate, Mittal, Divyansh, Goel, Ojsi, Jha, Shikhar Krishn
Format	Journal Article
Language	English
Published	New York Springer US 01.08.2024 Springer Springer Nature B.V
Subjects	Algorithms Characterization and Evaluation of Materials Chemistry and Materials Science Classical Mechanics Computation & Theory Computational linguistics Crystallography and Scattering Methods Data analysis Documents domain Equipment and supplies Image analysis Image processing Language processing Materials Science Natural language interfaces Natural language processing Polymer Sciences Software Solid Mechanics Webs
Online Access	Get full text
ISSN	0022-2461 1573-4803
DOI	10.1007/s10853-024-09980-z

Cover

More Information
Summary:	With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from selected resources for performing domain-specific tasks. Current algorithms and generalized tools lack specificity and are challenged by errors in analysing data from a bundle of specific documents selected eclectically. Current work addresses the need for such a tool, which focuses on specificity based on users' input keywords and phrases to find relevant information from bundles of articles from the web. Reseractor is based on a customized algorithm, Whitespace, in synergy with output from open-access tools for document image analysis and focused domain data extraction using NLP. The current tool is designed for the material science domain with the features of adopting various generalized and scientific corpora as layers. It is tested on two sets of different bundles of papers and gives an accuracy of 81.12% along with a recall of 78.38% and a precision of 84.06%. Owing to the simple and direct applicability of algorithms, users from other domains can directly use their corpora in algorithms and remodel the tool for their purpose. Current work fulfills the need for domain-specific experimental data extraction stored in organized and structured databases for upcoming computational researchers.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0022-2461 1573-4803
DOI:	10.1007/s10853-024-09980-z