Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus

The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using su...

Full description

Saved in:

Bibliographic Details
Published in	Arabian journal for science and engineering Vol. 48; no. 2; pp. 1845 - 1858
Main Authors	Rai, Sawan, Belwal, Ramesh Chandra, Gupta, Atul
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.02.2023 Springer Nature B.V
Subjects	Algorithms Availability Computer Engineering and Computer Science Engineering Humanities and Social Sciences Machine learning Machine translation multidisciplinary Research Article-Computer Engineering and Computer Science Science Parallel corpus Python code Statistical machine translation Pseudo-code Neural machine translation
Online Access	Get full text
ISSN	2193-567X 1319-8025 2191-4281 2191-4281
DOI	10.1007/s13369-022-07049-0

Cover

More Information
Summary:	The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using suitable evaluation measures. However, the performance of underlying learning algorithms can be greatly influenced by the correctness and the consistency of the corpus. We present our investigations for the relevance of a publicly available python to pseudo-code parallel corpus for automated documentation task, and the studies performed using this corpus. We found that the corpus had many visible issues like overlapping of instances, inconsistency in translation styles, incompleteness, and misspelled words. We show that these discrepancies can significantly influence the performance of the learning algorithms to the extent that they could have caused previous studies to draw incorrect conclusions. We performed our experimental study using statistical machine translation and neural machine translation models. We have recorded a significant difference ( ∼ 10% on BLEU score) in the models’ performance after removing the issues from the corpus.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Report-3 ObjectType-Case Study-4
ISSN:	2193-567X 1319-8025 2191-4281 2191-4281
DOI:	10.1007/s13369-022-07049-0