An Exploration of Open Source Small Language Models for Automated Assessment

We explore the classification and assessment capabilities of a selection of Open Source Small Language Models, on the specific task of evaluating learners' Descriptions of Algorithms. The algorithms are described in the framework of programming assignments, to which the learners in a class of B...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / International Conference on Information Visualisation pp. 1 - 6
Main Authors	Sterbini, Andrea, Temperini, Marco
Format	Conference Proceeding
Language	English
Published	IEEE 22.07.2024
Subjects	Algorithm Description Quality Natural language processing (NLP) Analytical models Automated Assessment Classification algorithms Computational modeling Natural language processing Open Source Small Language Models Peer Assessment Programming Python Technology Enhanced Learning Training Transformer-based Large and Small Language Models Visualization
Online Access	Get full text
ISSN	2375-0138
DOI	10.1109/IV64223.2024.00064

Cover

More Information
Summary:	We explore the classification and assessment capabilities of a selection of Open Source Small Language Models, on the specific task of evaluating learners' Descriptions of Algorithms. The algorithms are described in the framework of programming assignments, to which the learners in a class of Basics in Computer Programming have to answer. The task requires to 1) provide a program, in Python, to solve the assigned problem, 2) submit a description of the related algorithm, and 3) participate in a formative peer assessment session, over the submitted algorithms. Can a Language Model, be it small or large, produce an assessment for the algorithm descriptions? Rather than using any of the most famous, huge, and proprietary models, here we explore Small, Open Source based, Language Models, i.e. models that can be run on relatively small computers, and whose functions and training sources are provided openly. We produced a ground-truth evaluation of a large set of algorithm descriptions, taken from one year of use of the Q2A-II system. In this we used an 8-value scale, grading the usefulness of the description in a Peer Assessment session. Then we tested the agreement of the models assessments with such ground-truth. We also analysed whether a pre-emptive, automated, binary classification of the descriptions (as useless/useful for a Peer Assessment activity) would help the models to grade the usefulness of the description in a better way.
ISSN:	2375-0138
DOI:	10.1109/IV64223.2024.00064