Assessing the factors influencing the performance of machine learning for classifying haplogroups from Y-STR haplotypes

Two distinct genetic markers, single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs), exist simultaneously in the non-recombining portion of the Y chromosome. Because of their different rates of mutation, Y-STRs and Y-SNPs play distinct roles in forensic and evolutionary genetics...

Full description

Saved in:
Bibliographic Details
Published inForensic science international Vol. 340; p. 111466
Main Author Fan, Guang-Yao
Format Journal Article
LanguageEnglish
Published Amsterdam Elsevier B.V 01.11.2022
Elsevier Limited
Subjects
Online AccessGet full text
ISSN0379-0738
1872-6283
1872-6283
DOI10.1016/j.forsciint.2022.111466

Cover

More Information
Summary:Two distinct genetic markers, single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs), exist simultaneously in the non-recombining portion of the Y chromosome. Because of their different rates of mutation, Y-STRs and Y-SNPs play distinct roles in forensic and evolutionary genetics. Current approaches to infer haplogroup status rely on genotyping lots of Y-SNP loci. Given the relationship between haplotype and haplogroup of a Y chromosome, a cost-effective strategy of Y-STRs typing had an advantage in haplogroup prediction. Many machine learning algorithms have sprung up for assigning a Y-STR haplotype to a haplogroup. However, a series of issues must be solved before the using of machine learning method in practice. Thus, the k-nearest neighbor (kNN) classifier was built respectively based on different situations in this study. We assessed different factors which may influence the performance of the kNN prediction model for classifying haplogroups. The training set was based on a diverse ground-truth data set comprising Y-STR haplotypes and corresponding Y-SNP haplogroups. Our results showed that combining different levels of haplogroups into the observations or transracial prediction was impractical. Moreover, using more slow mutation Y-STR loci in the category is good for promoting classification accuracy. The preconditions for an effective and accurate haplogroup assignment by the kNN classifier were revealed. •The factors influencing the performance of kNN algorithms for classifying haplogroups were assessed.•Combine all the levels of haplogroups into the observations is inappropriate.•Transracial prediction was proved to be impractical.•Classification accuracy under the SM group of Y-STR loci was higher than that of the RM group.•The kNN classifier can be effectively used for accurate haplogroup assignment.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0379-0738
1872-6283
1872-6283
DOI:10.1016/j.forsciint.2022.111466