Toward Automated Classroom Observation: Multimodal Machine Learning to Estimate CLASS Positive Climate and Negative Climate

In this article we present a multi-modal machine learning-based system, which we call ACORN, to analyze videos of school classrooms for the Positive Climate (PC) and Negative Climate (NC) dimensions of the CLASS [1] observation protocol that is widely used in educational research. ACORN uses convolu...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on affective computing Vol. 14; no. 1; pp. 664 - 679
Main Authors	Ramakrishnan, Anand, Zylich, Brian, Ottmar, Erin, LoCasale-Crouch, Jennifer, Whitehill, Jacob
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Activity recognition Artificial neural networks auditory analysis Automatic classroom observation classroom assessment scoring system Classrooms Coders Computer architecture Computer vision Datasets Encoding facial expression recognition Machine learning Meteorology Network reliability Segments Video Videos
Online Access	Get full text
ISSN	1949-3045 1949-3045
DOI	10.1109/TAFFC.2021.3059209

Cover

More Information
Summary:	In this article we present a multi-modal machine learning-based system, which we call ACORN, to analyze videos of school classrooms for the Positive Climate (PC) and Negative Climate (NC) dimensions of the CLASS [1] observation protocol that is widely used in educational research. ACORN uses convolutional neural networks to analyze spectral audio features, the faces of teachers and students, and the pixels of each image frame, and then integrates this information over time using Temporal Convolutional Networks. The audiovisual ACORN's PC and NC predictions have Pearson correlations of 0.55 and 0.63 with ground-truth scores provided by expert CLASS coders on the UVA Toddler dataset (cross-validation on <inline-formula><tex-math notation="LaTeX">n=300</tex-math> <mml:math><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>300</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="whitehill-ieq1-3059209.gif"/> </inline-formula> 15-min video segments), and a purely auditory ACORN predicts PC and NC with correlations of 0.36 and 0.41 on the MET dataset (test set of <inline-formula><tex-math notation="LaTeX">n=2000</tex-math> <mml:math><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>2000</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="whitehill-ieq2-3059209.gif"/> </inline-formula> videos segments). These numbers are similar to inter-coder reliability of human coders. Finally, using Graph Convolutional Networks we make early strides (AUC=0.70) toward predicting the specific moments (45-90sec clips) when the PC is particularly weak/strong. Our findings inform the design of automatic classroom observation and also more general video activity recognition and summary recognition systems.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1949-3045 1949-3045
DOI:	10.1109/TAFFC.2021.3059209