Improving Publication Pipeline with Automated Biological Entity Detection and Validation Service

With the increasing amount of digital journal submissions, there is a need to deploy new scalable computational methods to improve information accessibilities. One common task is to identify useful information and named entity from text documents such as journal article submission. However, there ar...

Full description

Saved in:

Bibliographic Details
Published in	Data and information management Vol. 3; no. 1; pp. 3 - 17
Main Authors	Xu, Weijia, Gupta, Amit, Jaiswal, Pankaj, Taylor, Crispin, Lockhart, Patti, Regala, Jennifer
Format	Journal Article
Language	English
Published	Warsaw Sciendo 01.03.2019 Elsevier Limited
Subjects	Accessibility Annotations Applications programs Automation digital curation digital library entity extraction Identification methods machine learning natural language processing ontology Periodicals Pilot projects Recall text mining Use statistics
Online Access	Get full text
ISSN	2543-9251 2543-9251
DOI	10.2478/dim-2019-0003

Cover

More Information
Summary:	With the increasing amount of digital journal submissions, there is a need to deploy new scalable computational methods to improve information accessibilities. One common task is to identify useful information and named entity from text documents such as journal article submission. However, there are many technical challenges to limit applicability of the general methods and lack of general tools. In this paper, we present domain informational vocabulary extraction (DIVE) project, which aims to enrich digital publications through detection of entity and key informational words and by adding additional annotations. In a first of its kind to our knowledge, our system engages authors of the peer-reviewed articles and the journal publishers by integrating DIVE implementation in the manuscript proofing and publication process. The system implements multiple strategies for biological entity detection, including using regular expression rules, ontology, and a keyword dictionary. These extracted entities are then stored in a database and made accessible through an interactive web application for curation and evaluation by authors. Through the web interface, the authors can make additional annotations and corrections to the current results. The updates can then be used to improve the entity detection in subsequent processed articles in the future. We describe our framework and deployment in details. In a pilot program, we have deployed the first phase of development as a service integrated with the journals Plant Physiology and The Plant cell published by the American Society of Plant Biologists (ASPB). We present usage statistics to date since its production on April 2018. We compare automated recognition results from DIVE with results from author curation and show the service achieved on average 80% recall and 70% precision per article. In contrast, an existing biological entity extraction tool, a biomedical named entity recognizer (ABNER), can only achieve 47% recall and return a much larger candidate set.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2543-9251 2543-9251
DOI:	10.2478/dim-2019-0003