SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing

Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information....

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 12; no. 9; p. e0184087
Main Authors	Jeong, Seongmun, Kim, Jiwoong, Park, Won, Jeon, Hongmin, Kim, Namshin
Format	Journal Article
Language	English
Published	United States Public Library of Science 08.09.2017 Public Library of Science (PLoS)
Subjects	Alleles Animals Archives & records Bioinformatics Biology and life sciences Biotechnology Chromosome Mapping Chromosomes Computational Biology - methods Data processing Datasets Datasets as Topic Deoxyribonucleic acid DNA DNA methylation Exome Gene expression Gene frequency Gene sequencing Genetic Markers Genetic testing Genome Genomes Genomics Genomics - methods High-Throughput Nucleotide Sequencing Humans Identification Learning algorithms Machine learning Medical research Medicine Nucleotide sequence Programming languages Quality control Reproducibility of Results Research and analysis methods Researchers Ribonucleic acid RNA Scripts Sequence Analysis, RNA Sex Sex chromosomes Sex Determination Processes - genetics Synteny W chromosomes United States > US
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0184087

Cover

More Information
Summary:	Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0184087