Assessing the factors influencing the performance of machine learning for classifying haplogroups from Y-STR haplotypes
Two distinct genetic markers, single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs), exist simultaneously in the non-recombining portion of the Y chromosome. Because of their different rates of mutation, Y-STRs and Y-SNPs play distinct roles in forensic and evolutionary genetics...
Saved in:
| Published in | Forensic science international Vol. 340; p. 111466 |
|---|---|
| Main Author | |
| Format | Journal Article |
| Language | English |
| Published |
Amsterdam
Elsevier B.V
01.11.2022
Elsevier Limited |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0379-0738 1872-6283 1872-6283 |
| DOI | 10.1016/j.forsciint.2022.111466 |
Cover
| Summary: | Two distinct genetic markers, single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs), exist simultaneously in the non-recombining portion of the Y chromosome. Because of their different rates of mutation, Y-STRs and Y-SNPs play distinct roles in forensic and evolutionary genetics. Current approaches to infer haplogroup status rely on genotyping lots of Y-SNP loci. Given the relationship between haplotype and haplogroup of a Y chromosome, a cost-effective strategy of Y-STRs typing had an advantage in haplogroup prediction. Many machine learning algorithms have sprung up for assigning a Y-STR haplotype to a haplogroup. However, a series of issues must be solved before the using of machine learning method in practice. Thus, the k-nearest neighbor (kNN) classifier was built respectively based on different situations in this study. We assessed different factors which may influence the performance of the kNN prediction model for classifying haplogroups. The training set was based on a diverse ground-truth data set comprising Y-STR haplotypes and corresponding Y-SNP haplogroups. Our results showed that combining different levels of haplogroups into the observations or transracial prediction was impractical. Moreover, using more slow mutation Y-STR loci in the category is good for promoting classification accuracy. The preconditions for an effective and accurate haplogroup assignment by the kNN classifier were revealed.
•The factors influencing the performance of kNN algorithms for classifying haplogroups were assessed.•Combine all the levels of haplogroups into the observations is inappropriate.•Transracial prediction was proved to be impractical.•Classification accuracy under the SM group of Y-STR loci was higher than that of the RM group.•The kNN classifier can be effectively used for accurate haplogroup assignment. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 0379-0738 1872-6283 1872-6283 |
| DOI: | 10.1016/j.forsciint.2022.111466 |