Comparative analysis of QSAR feature selection methods

Quantitative structure-activity relationships (QSAR) describe the relationship between quantitative chemical structural properties (molecular descriptors) and biological activity. QSAR assays are increasingly used in drug discovery and development as they can save significant time and human resource...

Full description

Saved in:
Bibliographic Details
Published inAIP conference proceedings Vol. 3004; no. 1
Main Authors Davronov, Rifkat, Kushmuratov, Samariddin
Format Journal Article Conference Proceeding
LanguageEnglish
Published Melville American Institute of Physics 11.03.2024
Subjects
Online AccessGet full text
ISSN0094-243X
1935-0465
1551-7616
1551-7616
DOI10.1063/5.0199872

Cover

Abstract Quantitative structure-activity relationships (QSAR) describe the relationship between quantitative chemical structural properties (molecular descriptors) and biological activity. QSAR assays are increasingly used in drug discovery and development as they can save significant time and human resources. Several parameters affect the predictive performance of QSAR models. On the one hand, various statistical methods can be used to study the linear or nonlinear behavior of a data set. Feature selection approaches, on the other hand, are used to reduce model complexity, limit the risk of overfitting/overtraining, and select the most important descriptors from hundreds of lists. A mathematical model is then used to relate the selected descriptors to the biological activity of the corresponding molecule. A variety of modeling strategies can be used, some of which involve explicit feature selection. QSAR models are useful for developing new compounds with increased potency in the class under consideration. Only connections that are considered interesting are created. Learning algorithms face the challenge of selecting a meaningful subset of features of interest while ignoring the rest of the feature selection problem. This paper studied the comparative analysis of the Chi-square, Mutual Information, Anova F-value, Fisher Score, Permutation Importance, Recursive Feature Elimination, Random Forest, LightGBM and SHAP feature selection methods used in QSAR modeling. The Python code written to get experimental results in this article has been uploaded to Github (https://github.com/kushmuratoff/feature_selection ).
AbstractList Quantitative structure-activity relationships (QSAR) describe the relationship between quantitative chemical structural properties (molecular descriptors) and biological activity. QSAR assays are increasingly used in drug discovery and development as they can save significant time and human resources. Several parameters affect the predictive performance of QSAR models. On the one hand, various statistical methods can be used to study the linear or nonlinear behavior of a data set. Feature selection approaches, on the other hand, are used to reduce model complexity, limit the risk of overfitting/overtraining, and select the most important descriptors from hundreds of lists. A mathematical model is then used to relate the selected descriptors to the biological activity of the corresponding molecule. A variety of modeling strategies can be used, some of which involve explicit feature selection. QSAR models are useful for developing new compounds with increased potency in the class under consideration. Only connections that are considered interesting are created. Learning algorithms face the challenge of selecting a meaningful subset of features of interest while ignoring the rest of the feature selection problem. This paper studied the comparative analysis of the Chi-square, Mutual Information, Anova F-value, Fisher Score, Permutation Importance, Recursive Feature Elimination, Random Forest, LightGBM and SHAP feature selection methods used in QSAR modeling. The Python code written to get experimental results in this article has been uploaded to Github (https://github.com/kushmuratoff/feature_selection ).
Author Davronov, Rifkat
Kushmuratov, Samariddin
Author_xml – sequence: 1
  givenname: Rifkat
  surname: Davronov
  fullname: Davronov, Rifkat
  organization: V.I.Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
– sequence: 2
  givenname: Samariddin
  surname: Kushmuratov
  fullname: Kushmuratov, Samariddin
  email: bekmezonali@gmail.com
  organization: V.I.Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
BookMark eNp9j0lLw0AYQAepYFs9-A8C3oTUWTLbsRQ3KIgbeBu-mcxgSpqJmUTJv7elBW-e3uXx4M3QpImNR-iS4AXBgt3wBSZaK0lP0JRwTnIpiJigKca6yGnBPs7QLKUNxlRLqaZIrOK2hQ766ttn0EA9piplMWTPr8uXLHjoh85nydfe9VVssq3vP2OZztFpgDr5iyPn6P3u9m31kK-f7h9Xy3XeEqFo7olloAsWguBKce6sppI5Bw4XjAdCFID02NkQaAAqSgWUlNyWQVlrCWVzdH3oDk0L4w_UtWm7agvdaAg2-2PDzfF4J18d5LaLX4NPvdnEods9JUM1L3YS1vIvmVzVw37qn-Qva3tjwg
CODEN APCPCS
ContentType Journal Article
Conference Proceeding
Copyright Author(s)
2024 Author(s). Published by AIP Publishing.
Copyright_xml – notice: Author(s)
– notice: 2024 Author(s). Published by AIP Publishing.
DBID 8FD
H8D
L7M
ADTOC
UNPAY
DOI 10.1063/5.0199872
DatabaseName Technology Research Database
Aerospace Database
Advanced Technologies Database with Aerospace
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle Technology Research Database
Aerospace Database
Advanced Technologies Database with Aerospace
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 1551-7616
Editor Shadimetov, Kholmat M.
Durdiev, Durdimurod K.
Hayotov, Abdullo R.
Babaev, Samandar S.
Jalolov, Ozodjon I.
Editor_xml – sequence: 1
  givenname: Kholmat M.
  surname: Shadimetov
  fullname: Shadimetov, Kholmat M.
  organization: V. I. Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
– sequence: 2
  givenname: Abdullo R.
  surname: Hayotov
  fullname: Hayotov, Abdullo R.
  organization: V. I. Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
– sequence: 3
  givenname: Durdimurod K.
  surname: Durdiev
  fullname: Durdiev, Durdimurod K.
  organization: V. I. Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
– sequence: 4
  givenname: Samandar S.
  surname: Babaev
  fullname: Babaev, Samandar S.
  organization: V. I. Romanovskiy Institute of Mathematics, Uzbekistan Academy of Sciences
– sequence: 5
  givenname: Ozodjon I.
  surname: Jalolov
  fullname: Jalolov, Ozodjon I.
  organization: Bukhara State University
ExternalDocumentID 10.1063/5.0199872
acp
Genre Conference Proceeding
GroupedDBID -~X
23M
5GY
AAAAW
AABDS
AAEUA
AAPUP
AAYIH
ABJNI
ACBRY
ACZLF
ADCTM
AEJMO
AFATG
AFHCQ
AGKCL
AGLKD
AGMXG
AGTJO
AHSDT
AJJCW
ALEPV
ALMA_UNASSIGNED_HOLDINGS
ATXIE
AWQPM
BPZLN
F5P
FDOHQ
FFFMQ
HAM
M71
M73
RIP
RQS
SJN
~02
8FD
ABJGX
ADMLS
H8D
L7M
0ZJ
ADTOC
J23
NEUPN
RDFOP
UNPAY
ID FETCH-LOGICAL-p1682-e1b3a943ff658855cb9273ccac0435f118aa7e0cbff2fa26d8a21d5bdf8bbb123
IEDL.DBID UNPAY
ISSN 0094-243X
1935-0465
1551-7616
IngestDate Tue Aug 19 23:19:44 EDT 2025
Mon Jun 30 03:37:28 EDT 2025
Fri Jun 21 00:11:05 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License Published by AIP Publishing.
0094-243X/2024/3004/050002/5/$30.00
LinkModel DirectLink
MeetingName INTERNATIONAL SCIENTIFIC AND PRACTICAL CONFERENCE ON “MODERN PROBLEMS OF APPLIED MATHEMATICS AND INFORMATION TECHNOLOGY (MPAMIT2022)”
MergedId FETCHMERGED-LOGICAL-p1682-e1b3a943ff658855cb9273ccac0435f118aa7e0cbff2fa26d8a21d5bdf8bbb123
Notes ObjectType-Conference Proceeding-1
SourceType-Conference Papers & Proceedings-1
content type line 21
OpenAccessLink https://proxy.k.utb.cz/login?url=https://pubs.aip.org/aip/acp/article-pdf/doi/10.1063/5.0199872/19722248/050002_1_5.0199872.pdf
PQID 2954998097
PQPubID 2050672
PageCount 5
ParticipantIDs proquest_journals_2954998097
scitation_primary_10_1063_5_0199872
unpaywall_primary_10_1063_5_0199872
PublicationCentury 2000
PublicationDate 20240311
PublicationDateYYYYMMDD 2024-03-11
PublicationDate_xml – month: 03
  year: 2024
  text: 20240311
  day: 11
PublicationDecade 2020
PublicationPlace Melville
PublicationPlace_xml – name: Melville
PublicationTitle AIP conference proceedings
PublicationYear 2024
Publisher American Institute of Physics
Publisher_xml – name: American Institute of Physics
References Halder, Cordeiro (c8) 2021
Wagener, Geerestein (c4) 2000
Frimurer, Bywater, Namm, Lauritsen, Bnuiak (c3) 2000
References_xml – start-page: 29
  year: 2021
  ident: c8
  article-title: QSAR-Co-X: an open source toolkit for multitarget QSAR modelling
  publication-title: Journal of Cheminformatics.
– start-page: 1315
  year: 2000
  ident: c3
  article-title: Improving the odds in discriminating drug-like from non drug-like compounds
  publication-title: J. Chem. Inf. Comput. Sci.
– start-page: 280
  year: 2000
  ident: c4
  article-title: Potential drugs and nondings: prediction and identification of important struc-tural features
  publication-title: J. Chem. Inf. Comput. Sci.
SSID ssj0029778
Score 2.3513696
Snippet Quantitative structure-activity relationships (QSAR) describe the relationship between quantitative chemical structural properties (molecular descriptors) and...
SourceID unpaywall
proquest
scitation
SourceType Open Access Repository
Aggregation Database
Publisher
SubjectTerms Algorithms
Biological activity
Biological properties
Comparative analysis
Feature selection
Machine learning
Mathematical models
Performance prediction
Permutations
Statistical methods
Title Comparative analysis of QSAR feature selection methods
URI http://dx.doi.org/10.1063/5.0199872
https://www.proquest.com/docview/2954998097
https://pubs.aip.org/aip/acp/article-pdf/doi/10.1063/5.0199872/19722248/050002_1_5.0199872.pdf
UnpaywallVersion publishedVersion
Volume 3004
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 1551-7616
  dateEnd: 20241102
  omitProxy: false
  ssIdentifier: ssj0029778
  issn: 0094-243X
  databaseCode: ADMLS
  dateStart: 20000101
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ1bS8MwFMeDbohv3iZO5ijoa7fekiWPYzqG6JjOwXwqucJwdMVuiL741T29rJuC4IsPhUAT2pML5x9O8jsIXWHuCmEC1-Zeh9uBYbCkOCY2047C0pM-zkhM90MymAS3UzwtUpWmd2HgJ5IWn8U5IngWt7mEJ-9EO1ZmQxwgfjslbsJ2oeO10-RZ4Ixo20kB_17ohuW7FrTaRVWCQatXUHUyHHWfczRlYHuBP82Aqti1YT9P8hA0Ts874jWGaPs732ToPvioPFwO5VUU8_c3Pp9veaf-Afos7coOpby0VkvRkh8_kI__Z_ghqm1uDlqj0iceoR0dHaO97JipTE4Q6W1g4xYveCjWwlgP4-6jZXRGGbWSLDcPWGzl-a2TGpr0b556A7vI3GDHLgHJrl3hcxb4xoDAoRhLwUAmwWSRDsgzA5sazjvakcIYz3CPKMo9V2GhDBVCgDM9RZVoEekzZCktMNYY5hNXgSQBo5IpKmnaSilG66ixHpKwWH5JmAUvGXVYp44uy2EK4xzgEWaBd-KHOCy6CmqVA_h7rfM_1WqgyvJ1pS9AqyxFE1W71_d342Yx8b4AyJjgQw
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ1JSwMxFMeDtog3t4qVKgN6nX2SSY6lWIpgqUuhnoasUCzt4LSIXvzqvlk6rYLgxcNAYBJmXhbeP7zk9xC6xtwXwkS-zYOY25FhsKQ4JjbTnsIykCEuSEx3QzIYR7cTPKlSleZ3YeAnModP0xIRPE1dLuEpO9FOldkQB0jo5sRN2C7EgZsnzwJnRF0vB_wHiZ_U7xxotYuaBINWb6DmeDjqPpdoysgOonBSAFWxb8N-npQhaJyfd8RrDNH2d77J0H3wUWW4HMqrecrf3_hstuWd-gfos7arOJTy4qyWwpEfP5CP_2f4IWptbg5ao9onHqEdPT9Ge8UxU5mdINLbwMYtXvFQrIWx7h-7D5bRBWXUyorcPGCxVea3zlpo3L956g3sKnODnfoEJLv2RchZFBoDAodiLAUDmQSTRXogzwxsajiPtSeFMYHhAVGUB77CQhkqhABneooa88VcnyFLaYGxxjCfuIokiRiVTFFJ81ZKMdpGnfWQJNXyy5IieMmox-I2uqqHKUlLgEdSBN5JmOCk6iqoVQ_g77XO_1SrgxrL15W-AK2yFJfVhPsCzmLerw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=AIP+conference+proceedings&rft.atitle=Comparative+analysis+of+QSAR+feature+selection+methods&rft.date=2024-03-11&rft.pub=American+Institute+of+Physics&rft.issn=0094-243X&rft.eissn=1551-7616&rft.volume=3004&rft.issue=1&rft_id=info:doi/10.1063%2F5.0199872&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0094-243X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0094-243X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0094-243X&client=summon