Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of...

Full description

Saved in:
Bibliographic Details
Published inApplications of Evolutionary Computation pp. 123 - 137
Main Authors Olson, Randal S., Urbanowicz, Ryan J., Andrews, Peter C., Lavender, Nicole A., Kidd, La Creis, Moore, Jason H.
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing 2016
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN3319312030
9783319312033
ISSN0302-9743
1611-3349
DOI10.1007/978-3-319-31204-0_9

Cover

Abstract Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
AbstractList Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
Author Andrews, Peter C.
Kidd, La Creis
Lavender, Nicole A.
Moore, Jason H.
Olson, Randal S.
Urbanowicz, Ryan J.
Author_xml – sequence: 1
  givenname: Randal S.
  surname: Olson
  fullname: Olson, Randal S.
  email: olsonran@upenn.edu
  organization: Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, USA
– sequence: 2
  givenname: Ryan J.
  surname: Urbanowicz
  fullname: Urbanowicz, Ryan J.
  organization: Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, USA
– sequence: 3
  givenname: Peter C.
  surname: Andrews
  fullname: Andrews, Peter C.
  organization: Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, USA
– sequence: 4
  givenname: Nicole A.
  surname: Lavender
  fullname: Lavender, Nicole A.
  organization: University of Louisville, Louisville, USA
– sequence: 5
  givenname: La Creis
  surname: Kidd
  fullname: Kidd, La Creis
  organization: University of Louisville, Louisville, USA
– sequence: 6
  givenname: Jason H.
  surname: Moore
  fullname: Moore, Jason H.
  organization: Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, USA
BookMark eNpFkN1OwzAMhQMMiW3wBNzkBQJ20i3N5TZ-xaQhUa6jtHW3wNZMTXfD0xMGEhc-lm3p6PgbsUEbWmLsGuEGAfSt0blQQqFJJSETYM0JG6m0OM7mlA1xiiiUyszZ_0HBgA2TSmF0pi7YKMYPAJDayCF7mR36sHO9b9d87sOOal-5Lb9zveNvlae2Il5sunBYb3jREYm5i1TzV7-nrW-Jr_a93_mvZBDaS3beuG2kq78-Zu8P98XiSSxXj8-L2VJENLoXTaPyhipwNKnLSWYQKSulmiqqJZGmGmrtwEEzUTlIBCzRgKkgd40zOTk1ZvjrG_ddyk2dLUP4jBbB_nCyiZNVNn1vj1xs4qS-AUSpWU0
ContentType Book Chapter
Copyright Springer International Publishing Switzerland 2016
Copyright_xml – notice: Springer International Publishing Switzerland 2016
DOI 10.1007/978-3-319-31204-0_9
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 3319312049
9783319312040
EISSN 1611-3349
Editor Burelli, Paolo
Squillero, Giovanni
Editor_xml – sequence: 1
  givenname: Giovanni
  surname: Squillero
  fullname: Squillero, Giovanni
  email: giovanni.squillero@polito.it
– sequence: 2
  givenname: Paolo
  surname: Burelli
  fullname: Burelli, Paolo
  email: pabu@create.aau.dk
EndPage 137
GroupedDBID -DT
-GH
-~X
1SB
29L
2HA
2HV
5QI
875
AASHB
ABMNI
ACGFS
ADCXD
AEFIE
ALMA_UNASSIGNED_HOLDINGS
EJD
F5P
FEDTE
HVGLF
LAS
LDH
P2P
RNI
RSU
SVGTG
VI1
~02
ID FETCH-LOGICAL-s197t-ff38fec0ae5db54911e4b2363ed2ee7ed0d7a0a0f53802101b1909c08afa98ea3
ISBN 3319312030
9783319312033
ISSN 0302-9743
IngestDate Wed Sep 17 03:01:49 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s197t-ff38fec0ae5db54911e4b2363ed2ee7ed0d7a0a0f53802101b1909c08afa98ea3
PageCount 15
ParticipantIDs springer_books_10_1007_978_3_319_31204_0_9
PublicationCentury 2000
PublicationDate 2016
PublicationDateYYYYMMDD 2016-01-01
PublicationDate_xml – year: 2016
  text: 2016
PublicationDecade 2010
PublicationPlace Cham
PublicationPlace_xml – name: Cham
PublicationSeriesSubtitle Theoretical Computer Science and General Issues
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate Lect.Notes Computer
PublicationSubtitle 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I
PublicationTitle Applications of Evolutionary Computation
PublicationYear 2016
Publisher Springer International Publishing
Publisher_xml – name: Springer International Publishing
RelatedPersons Kleinberg, Jon M.
Mattern, Friedemann
Naor, Moni
Mitchell, John C.
Terzopoulos, Demetri
Steffen, Bernhard
Pandu Rangan, C.
Kanade, Takeo
Kittler, Josef
Weikum, Gerhard
Hutchison, David
Tygar, Doug
RelatedPersons_xml – sequence: 1
  givenname: David
  surname: Hutchison
  fullname: Hutchison, David
  organization: Lancaster University, Lancaster, United Kingdom
– sequence: 2
  givenname: Takeo
  surname: Kanade
  fullname: Kanade, Takeo
  organization: Carnegie Mellon University, Pittsburgh, USA
– sequence: 3
  givenname: Josef
  surname: Kittler
  fullname: Kittler, Josef
  organization: University of Surrey, Guildford, United Kingdom
– sequence: 4
  givenname: Jon M.
  surname: Kleinberg
  fullname: Kleinberg, Jon M.
  organization: Cornell University, Ithaca, USA
– sequence: 5
  givenname: Friedemann
  surname: Mattern
  fullname: Mattern, Friedemann
  organization: CNB H 104.2, ETH Zürich, Zürich, Switzerland
– sequence: 6
  givenname: John C.
  surname: Mitchell
  fullname: Mitchell, John C.
  organization: Stanford, USA
– sequence: 7
  givenname: Moni
  surname: Naor
  fullname: Naor, Moni
  organization: Weizmann Institute of Science, Rehovot, Israel
– sequence: 8
  givenname: C.
  surname: Pandu Rangan
  fullname: Pandu Rangan, C.
  organization: Indian Institute of Technology Madr, Chennai, India
– sequence: 9
  givenname: Bernhard
  surname: Steffen
  fullname: Steffen, Bernhard
  organization: Fakultät Informatik, TU Dortmund, Dortmund, Germany
– sequence: 10
  givenname: Demetri
  surname: Terzopoulos
  fullname: Terzopoulos, Demetri
  organization: Los Angeles, USA
– sequence: 11
  givenname: Doug
  surname: Tygar
  fullname: Tygar, Doug
  organization: University of California, Berkeley, USA
– sequence: 12
  givenname: Gerhard
  surname: Weikum
  fullname: Weikum, Gerhard
  organization: Max Planck Institute for Informatic, Saarbrücken, Germany
SSID ssj0002792
ssj0001657029
Score 2.34593
Snippet Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business,...
SourceID springer
SourceType Publisher
StartPage 123
SubjectTerms Data science
Genetic programming
Hyperparameter optimization
Machine learning
Pipeline optimization
Title Automating Biomedical Data Science Through Tree-Based Pipeline Optimization
URI http://link.springer.com/10.1007/978-3-319-31204-0_9
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9NAEF6l5YI4lKd4ywdOWBt5vY4fBw6hFFUltAgS1Ju13geqRJ0qcUDwn_iPzHh3bSflUi5WtGvZznzr2fE8viHklVQmzlOZ0zytFE2EklQUTNF8IqJYa22Mwtrhj6fp8SI5OZ-cj0Z_BllLm6Yay9__rCv5H1RhDHDFKtkbINtdFAbgN-ALR0AYjjvG77ab1aYXD0LPbX7gD3cvzIOzzRq2ouxnHS8jOg--h1_GfmaxqkS9_HkhW2fy519Y0dRN2ozHdZfLGx52UzNsRa8s5u2K0uF0PFyD002zRIu4_oY9Ly9dTOidaESnU-auT9B8pTV9C1uqCj9dXOnW-D0DdXbp6kSt9kNW5vWbmYt7nC6bNp0s9K0p_FWHrgy268rwrswdZ2jvj9v69uWgPDiLI8uj4WvAQL_DF5Id0lalp0jUyC0xqlPTLOaDHZ9Z2plrm8kwfwRrvfBuCY3KYo_sZRmo01vTo5PZ196lh2lEaG06QwC5GW0Qyz4Ulhb5h3Z0Y_2f6BixLOnxzh2vxelb82d-l9zBkpgAa1VAZPfISNf3yYGXeuCk_oB86AEPesADBNyfFDjAgx7wwAMeDAF_SBbvj-aHx9R166BrVmQNNYbnRstI6ImqJglsojqpYp5yreClz7SKVCYiERnYYtHRwCqwRQsZ5cKIIteCPyL79bLWj0lQFJnEmu6WX08IJlIjeCpjsLcyzSR_Ql57aZT4_q1LT74Noit5CaIrW9GVILqnNzn5GbndL8znZL9ZbfQLsDqb6qVD-y_6GHws
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Applications+of+Evolutionary+Computation&rft.au=Olson%2C+Randal+S.&rft.au=Urbanowicz%2C+Ryan+J.&rft.au=Andrews%2C+Peter+C.&rft.au=Lavender%2C+Nicole+A.&rft.atitle=Automating+Biomedical+Data+Science+Through+Tree-Based+Pipeline+Optimization&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2016-01-01&rft.pub=Springer+International+Publishing&rft.isbn=9783319312033&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=123&rft.epage=137&rft_id=info:doi/10.1007%2F978-3-319-31204-0_9
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon