Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads

Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approa...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 8; no. 12; p. e83784
Main Authors	Gautier, Laurent, Lund, Ole
Format	Journal Article
Language	English
Published	United States Public Library of Science 31.12.2013 Public Library of Science (PLoS)
Subjects	Algorithms Bacteria - classification Bacteria - genetics Bandwidth Bandwidths Bioinformatics Biology Bioreactors Broadband Client server systems Data processing Deoxyribonucleic acid DNA DNA sequencing Food production Gene sequencing Genome Genomes Genomics High-Throughput Nucleotide Sequencing Humans Internet Lists Matching Microorganisms Mobile computing Molecular biology Portable equipment Programming languages Search engines Servers Software Tracking control Wireless networks
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0083784

Cover

More Information
Summary:	Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have the following interests. The following patent application related to the work presented has been filed: Database-driven Primary Analysis of Raw Sequencing Data, PCT/EP2013/071280. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials. Conceived and designed the experiments: LG. Performed the experiments: LG. Analyzed the data: LG. Contributed reagents/materials/analysis tools: LG OL. Wrote the paper: LG.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0083784