Extracting Variant Forms of Chemical Names for Information Retrieval

Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. A large set of acronyms and associated text parts was extracted from a subset of the Medline collection and used to construct a full name -- acronym index. A longest com...

Full description

Saved in:
Bibliographic Details
Published inInformation research Vol. 13; no. 3
Main Author Pirkola, An
Format Journal Article
LanguageEnglish
Published InformationR.net 01.09.2008
Subjects
Online AccessGet full text
ISSN1368-1613
1368-1613

Cover

Abstract Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. A large set of acronyms and associated text parts was extracted from a subset of the Medline collection and used to construct a full name -- acronym index. A longest common subsequence and statistics based technique (named FNV-Finder) was devised to identify MeSH term variants from the full name -- acronym index for use as query terms in searching. The average number of variants for each MeSH term, the performance of the FNV-Finder technique and retrieval performance were evaluated. The average number of unique variants for each MeSH term denoting a chemical substance is 2.82. The FNV-Finder technique achieved 95.0% recall and 97.1% precision. The retrieval experiments showed that the collection contains a substantial number of documents that contain only variant forms of the MeSH terms (and do not contain the MeSH terms or CAS registry numbers). The selection of variant forms for queries from a collection would be very useful or even necessary in chemical name searching. Variant forms can be selected readily from the full name -- acronym index either manually or automatically using the FNV-Finder technique. Adapted from the source document.
AbstractList Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. A large set of acronyms and associated text parts was extracted from a subset of the Medline collection and used to construct a full name -- acronym index. A longest common subsequence and statistics based technique (named FNV-Finder) was devised to identify MeSH term variants from the full name -- acronym index for use as query terms in searching. The average number of variants for each MeSH term, the performance of the FNV-Finder technique and retrieval performance were evaluated. The average number of unique variants for each MeSH term denoting a chemical substance is 2.82. The FNV-Finder technique achieved 95.0% recall and 97.1% precision. The retrieval experiments showed that the collection contains a substantial number of documents that contain only variant forms of the MeSH terms (and do not contain the MeSH terms or CAS registry numbers). The selection of variant forms for queries from a collection would be very useful or even necessary in chemical name searching. Variant forms can be selected readily from the full name -- acronym index either manually or automatically using the FNV-Finder technique. Adapted from the source document.
Introduction. Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. Method. A large set of acronyms and associated text parts was extracted from a subset of the Medline collection and used to construct a full name - acronym index. A longest common subsequence and statistics based technique (named FNV-Finder) was devised to identify MeSH term variants from the full name - acronym index for use as query terms in searching. The average number of variants for each MeSH term, the performance of the FNV-Finder technique and retrieval performance were evaluated. Results. The average number of unique variants for each MeSH term denoting a chemical substance is 2.82. The FNV-Finder technique achieved 95.0% recall and 97.1% precision. The retrieval experiments showed that the collection contains a substantial number of documents that contain only variant forms of the MeSH terms (and do not contain the MeSH terms or CAS registry numbers). Conclusions. The selection of variant forms for queries from a collection would be very useful or even necessary in chemical name searching. Variant forms can be selected readily from the full name - acronym index either manually or automatically using the FNV-Finder technique. Introducción. Los nombres de las substancias químicas son largos, complejos y propensos a la variación. Este estudio investiga los efectos en la recuperación de la variación. Método. Se extrajo un gran conjunto de acrónimos y partes textuales asociadas de un subconjunto de la colección Medline y se usó para construir un índice completo de nombre-acrónimo. Se diseñó una técnica basada en la subsecuencia común más larga y estadística (denominada FNV-Finder) para identificar las variantes de términos MeSH desde el índice completo de nombre-acrónimo para su uso como términos de consulta en búsquedas. Se evaluó el número medio de variantes para cada término MeSH, el desempeño de la técnica FNV-Finder y el desempeño de la recuperación. Resultados. El número medio de variantes únicas de cada término MeSH denotando una substancia química es de 2.82. La técnica FNV-Finder logró un 95.0% de exhaustividad y un 97.1% de precisión. Los experimentos de recuperación mostraron que la colección contiene un número sustancial de documentos que contienen sólo variantes de los términos MeSH (y no contiene términos MeSH o números de registro CAS). Conclusiones. La selección de formas variantes para las consultas desde una colección sería muy útil o incluso necesaria en la búsqueda de nombres químicos. Pueden seleccionarse rápidamente las formas variantes del índice nombre completo - acrónimo manual o automáticamente usando la técnica FNV-Finder.
Author Pirkola, An
Author_xml – sequence: 1
  givenname: An
  surname: Pirkola
  fullname: Pirkola, An
BookMark eNpNjkFLwzAYhoNMcJv-h5y8SCFN2qQ9ytx0MBREvZav6ReNpMlMMpn_3sI8eHmf9_Dw8i7IzAePZ2ReCtkUpSzF7F-_IIuUPhnjrFL1nNytjzmCzta_0zeIFnymmxDHRIOhqw8crQZHH2HERE2IdOunHCHb4Okz5mjxG9wlOTfgEl79cUleN-uX1UOxe7rfrm53heOc56JCrVVlZNkbNkDFaqMUyFYprLUxUjeK9xIrXQK0INqe97pWjA_Q6gHYYMSS3Jx23XTAWz_gsdtHO0L86QLYLqIOcehq1shWTPb1yd7H8HXAlLvRJo3OgcdwSF2tlGhFw8UvOvdb0g
ContentType Journal Article
Copyright free
Copyright_xml – notice: free
DBID E3H
F2A
77F
DatabaseName Library & Information Sciences Abstracts (LISA)
Library & Information Science Abstracts (LISA)
Latindex
DatabaseTitle Library and Information Science Abstracts (LISA)
DatabaseTitleList Library and Information Science Abstracts (LISA)

DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
EISSN 1368-1613
ExternalDocumentID oai_record_508693
GroupedDBID .4I
29I
2WC
5GY
5VS
77I
77K
AAFWJ
ABDBF
ABOPQ
ADBBV
ADMLS
AEGXH
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
C1A
E3H
E3Z
EBS
EJD
ELW
F2A
GROUPED_DOAJ
H13
KQ8
M~E
OVT
P2P
RNS
XSB
4I
77F
ABNOP
ADACO
AGCAB
RIG
ID FETCH-LOGICAL-l222t-4ecc74f61bf0da405f77a6977e5cff6c872b6e4c1aa9a39b2bc5702da9cda0df3
ISSN 1368-1613
IngestDate Wed Nov 11 00:08:33 EST 2020
Fri Sep 05 11:54:44 EDT 2025
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-l222t-4ecc74f61bf0da405f77a6977e5cff6c872b6e4c1aa9a39b2bc5702da9cda0df3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink http://dialnet.unirioja.es/servlet/oaiart?codigo=2863060
PQID 57739382
PQPubID 23477
ParticipantIDs latinindex_primary_oai_record_508693
proquest_miscellaneous_57739382
ProviderPackageCode 77F
PublicationCentury 2000
PublicationDate 2008-09-01
PublicationDateYYYYMMDD 2008-09-01
PublicationDate_xml – month: 09
  year: 2008
  text: 2008-09-01
  day: 01
PublicationDecade 2000
PublicationTitle Information research
PublicationYear 2008
Publisher InformationR.net
Publisher_xml – name: InformationR.net
SSID ssj0020475
Score 1.7756839
Snippet Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. A large set of acronyms and...
Introduction. Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. Method. A...
SourceID latinindex
proquest
SourceType Open Access Repository
Aggregation Database
SubjectTerms Chemical names
Search strategies
Subject indexing
Title Extracting Variant Forms of Chemical Names for Information Retrieval
URI https://www.proquest.com/docview/57739382
http://dialnet.unirioja.es/servlet/oaiart?codigo=2863060
Volume 13
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 1368-1613
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0020475
  issn: 1368-1613
  databaseCode: KQ8
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 1368-1613
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0020475
  issn: 1368-1613
  databaseCode: DOA
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVEBS
  databaseName: EBSCOhost Academic Search Ultimate
  customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn
  eissn: 1368-1613
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0020475
  issn: 1368-1613
  databaseCode: ABDBF
  dateStart: 20070101
  isFulltext: true
  titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn
  providerName: EBSCOhost
– providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 1368-1613
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0020475
  issn: 1368-1613
  databaseCode: ADMLS
  dateStart: 20070101
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 1368-1613
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0020475
  issn: 1368-1613
  databaseCode: M~E
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PT8IwFG6Ekxfjz4gK9kC8jYx17djRKIaY4MFAwm1puzZBzTA4iPHg3-5r95PIAb0sS0O6rd_2-r3He99DqMuppiHl1OGUh44fM-YMFKHgqkglwj4XmpvQwPiJjab-44zOao0uTHVJKnrya2tdyX9QhTHA1VTJ_gHZclIYgHPAF46AMBx3wnj4mdoiJ_D21-DzcqOzBBzUZmfIQgkgMXmwNpswF0m1iC9tJ611foWXIp-9_oNanMvYzvnyFdxga0uW841gwaDMhsodyGqaZxP6qBk-wsCbZFldaE9tGSusJam9FaTaRTb0qrMAUwTkj4WkgRqkb5pLjL-HpS_s-lYDubyEEQCF-0qsRuSvfdFu9pNDdJCzdHybLfkR2lPJMWrnNR74BteeEOfW8QTdV3DgHA5s4cALjQs4sIXDjOMaHLiE4xRNH4aTu5GTN6lw3oBapY4P30Dga9YX2o050F8dBJwBq1ZUas3kIPAEU77scx5yEgpPSBq4XsxDGXM31uQMNZNFos4RFjIIgG_GsVYSvhppehEAf2A6Jp7yPbeFutUaRe-ZGkn0a71b6LpYvQhshfkDiCdqsfqIaGD0DwfexW4TXaL96h26Qs10uVJtoGCp6NjQRcdC-gM4GkGm
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Extracting+variant+forms+of+chemical+names+for+information+retrieval&rft.jtitle=Information+research&rft.au=Pirkola%2C+Ari&rft.date=2008-09-01&rft.pub=InformationR.net&rft.issn=1368-1613&rft.eissn=1368-1613&rft.volume=13&rft.issue=3&rft.externalDocID=oai_record_508693
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1368-1613&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1368-1613&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1368-1613&client=summon