Evaluating Large Language Model Robustness using Combinatorial Testing

Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of task...

Full description

Saved in:

Bibliographic Details
Published in	2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) pp. 300 - 309
Main Authors	Chandrasekaran, Jaganmohan, Patel, Ankita Ramjibhai, Lanus, Erin, Freeman, Laura J.
Format	Conference Proceeding
Language	English
Published	IEEE 31.03.2025
Subjects	Cognition Combinatorial testing Conferences Large language models LLM Evaluation LLM Robustness Option Order Swapping Robustness Sensitivity Sentiment analysis Systematics Testing AI Testing LLM
Online Access	Get full text
DOI	10.1109/ICSTW64639.2025.10962520

Cover

More Information
Summary:	Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
DOI:	10.1109/ICSTW64639.2025.10962520