GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets

Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 sin...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 12; no. 7; p. e0181420
Main Authors	Jeong, Seongmun, Kim, Jae-Yoon, Jeong, Soon-Chun, Kang, Sung-Taeg, Moon, Jung-Kyung, Kim, Namshin
Format	Journal Article
Language	English
Published	United States Public Library of Science 20.07.2017 Public Library of Science (PLoS)
Subjects	Access to Information Algorithms Analysis Bioinformatics Biology and Life Sciences Biotechnology Breeding Collection Computer and Information Sciences Computer programs Cost analysis Crop science Crops Databases, Genetic Datasets Datasets as Topic Gene Frequency Gene polymorphism Genetic aspects Genetic distance Genetic diversity Genetic markers Genome-wide association studies Genomes Genomics Internet Markers Methods Oryza - genetics Phenotype Physical Sciences Picking Plant breeding Polymorphism Polymorphism, Single Nucleotide Principal Component Analysis Reproducibility of Results Research and Analysis Methods Rice Single nucleotide polymorphisms Single-nucleotide polymorphism Software Software development Soybeans Studies Triticum - genetics
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0181420

Cover

More Information
Summary:	Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0181420