Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from...

Full description

Saved in:
Bibliographic Details
Published inJournal of materials science Vol. 59; no. 30; pp. 13856 - 13872
Main Authors Gupta, Antrakrate, Mittal, Divyansh, Goel, Ojsi, Jha, Shikhar Krishn
Format Journal Article
LanguageEnglish
Published New York Springer US 01.08.2024
Springer
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0022-2461
1573-4803
DOI10.1007/s10853-024-09980-z

Cover

More Information
Summary:With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from selected resources for performing domain-specific tasks. Current algorithms and generalized tools lack specificity and are challenged by errors in analysing data from a bundle of specific documents selected eclectically. Current work addresses the need for such a tool, which focuses on specificity based on users' input keywords and phrases to find relevant information from bundles of articles from the web. Reseractor is based on a customized algorithm, Whitespace, in synergy with output from open-access tools for document image analysis and focused domain data extraction using NLP. The current tool is designed for the material science domain with the features of adopting various generalized and scientific corpora as layers. It is tested on two sets of different bundles of papers and gives an accuracy of 81.12% along with a recall of 78.38% and a precision of 84.06%. Owing to the simple and direct applicability of algorithms, users from other domains can directly use their corpora in algorithms and remodel the tool for their purpose. Current work fulfills the need for domain-specific experimental data extraction stored in organized and structured databases for upcoming computational researchers.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0022-2461
1573-4803
DOI:10.1007/s10853-024-09980-z