Evaluating Large Language Model Robustness using Combinatorial Testing
Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of task...
Saved in:
Published in | 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) pp. 300 - 309 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
31.03.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/ICSTW64639.2025.10962520 |
Cover
Summary: | Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests. |
---|---|
DOI: | 10.1109/ICSTW64639.2025.10962520 |