Evaluating Large Language Model Robustness using Combinatorial Testing

Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of task...

Full description

Saved in:
Bibliographic Details
Published in2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) pp. 300 - 309
Main Authors Chandrasekaran, Jaganmohan, Patel, Ankita Ramjibhai, Lanus, Erin, Freeman, Laura J.
Format Conference Proceeding
LanguageEnglish
Published IEEE 31.03.2025
Subjects
Online AccessGet full text
DOI10.1109/ICSTW64639.2025.10962520

Cover

Abstract Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
AbstractList Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
Author Chandrasekaran, Jaganmohan
Freeman, Laura J.
Lanus, Erin
Patel, Ankita Ramjibhai
Author_xml – sequence: 1
  givenname: Jaganmohan
  surname: Chandrasekaran
  fullname: Chandrasekaran, Jaganmohan
  email: jagan@vt.edu
  organization: Virginia Tech,Sanghani Center for Artificial Intelligence & Data Analytics,Arlington,VA,USA
– sequence: 2
  givenname: Ankita Ramjibhai
  surname: Patel
  fullname: Patel, Ankita Ramjibhai
– sequence: 3
  givenname: Erin
  surname: Lanus
  fullname: Lanus, Erin
  organization: Virginia Tech,National Security Institute,Arlington,VA,USA
– sequence: 4
  givenname: Laura J.
  surname: Freeman
  fullname: Freeman, Laura J.
  organization: Virginia Tech,National Security Institute,Arlington,VA,USA
BookMark eNo1T8FKxDAUjKAHXfcPPOQHWpukzUuOUnZ1oSJoxePy0r6WQDeVphX8e7uol5lhmBmYG3YZxkCMcZGlQmT2_lC-1R8618qmMpNFunpaFjK7YFsL1iglCpVrgGu2333hsODsQ88rnHpaMfQLruJ5bGngr6Nb4hwoRr7Ec6ocT84HnMfJ48BriufuLbvqcIi0_eMNe9_v6vIpqV4eD-VDlXgBZk6ApHDQGu3AapBF15gst42mXDrXagOkIBdSohGExpjOEiI1hbCmaUAotWF3v7ueiI6fkz_h9H38v6d-APXGSs8
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICSTW64639.2025.10962520
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331534677
EndPage 309
ExternalDocumentID 10962520
Genre orig-research
GrantInformation_xml – fundername: U.S. Department of Defense
  funderid: 10.13039/100000005
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133
IEDL.DBID RIE
IngestDate Wed Apr 23 05:41:09 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133
PageCount 10
ParticipantIDs ieee_primary_10962520
PublicationCentury 2000
PublicationDate 2025-March-31
PublicationDateYYYYMMDD 2025-03-31
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-March-31
  day: 31
PublicationDecade 2020
PublicationTitle 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)
PublicationTitleAbbrev ICSTW
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.9085921
Snippet Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to...
SourceID ieee
SourceType Publisher
StartPage 300
SubjectTerms Cognition
Combinatorial testing
Conferences
Large language models
LLM Evaluation
LLM Robustness
Option Order Swapping
Robustness
Sensitivity
Sentiment analysis
Systematics
Testing AI
Testing LLM
Title Evaluating Large Language Model Robustness using Combinatorial Testing
URI https://ieeexplore.ieee.org/document/10962520
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8QwEA26J08qVvwmB6-tbbZN0vOyZRVZRHdxb0uSThdRuqLtxV_vTLpVFAQvpZRA89Fm5k3em2HsUma2EgnoMB2mlkI3MrQaqhDxUAxGuMxaT5Cdysk8vVlki41Y3WthAMCTzyCiW3-WX65dS6Ey_MNzdNcFIvRtpfJOrNWzc-L86nr0MHuUqfQCFJFFffMfhVO83Sh22bR_Y0cXeY7axkbu41cyxn93aY8F3xI9fvdlfPbZFtQHrBhvcnfXK35LFG-8duFITjXPXvj92rbvDe1unAjvK47bAUJjAt74HfIZpdyoVwGbF-PZaBJuCiWET4nSTahAJFaVWlqcayWIloWOv5OQCmtLqRUM0XEQwugEDELeKgdjwKHzpp0jlHrIBvW6hiPGqzwtCWIoOl7N0VsphYmVQqMeI-zO1DELaBKWr10ujGU__pM_np-yHVqLTsV3xgbNWwvnaMYbe-GX7xMJRJ1A
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF2kHvSkYsVv9-A1Mdkmu5tzaWm1FtEUeyvZzaSIkoomF3-9M5tGURC8hBBYsh_JzrzZ92YYu5SxKUQI2ot6kaHQjfSMhsJDPBRAJmxsjCPITuVoFl3P4_larO60MADgyGfg0607y89XtqZQGf7hCbrrAhH6ZoywQjVyrZafEyRX4_5D-igj6SQoIvbbBj9KpzjLMdxh0_adDWHk2a8r49uPX-kY_92pXdb9Funxuy_zs8c2oNxnw8E6e3e55BMieeO1CUhyqnr2wu9Xpn6vaH_jRHlfctwQEBwT9MYvkaeUdKNcdtlsOEj7I29dKsF7CpWuPAUiNCrX0uBsK0HELHT9rYRIGJNLraCHroMQmQ4hQ9BbJJBlYNF909YSTj1gnXJVwiHjRRLlBDIUHbAm6K_kIguUQrMeIPCO1RHr0iQsXptsGIt2_Md_PL9gW6P0drKYjKc3J2yb1qXR9J2yTvVWwxka9cqcu6X8BH7RoJE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+IEEE+International+Conference+on+Software+Testing%2C+Verification+and+Validation+Workshops+%28ICSTW%29&rft.atitle=Evaluating+Large+Language+Model+Robustness+using+Combinatorial+Testing&rft.au=Chandrasekaran%2C+Jaganmohan&rft.au=Patel%2C+Ankita+Ramjibhai&rft.au=Lanus%2C+Erin&rft.au=Freeman%2C+Laura+J.&rft.date=2025-03-31&rft.pub=IEEE&rft.spage=300&rft.epage=309&rft_id=info:doi/10.1109%2FICSTW64639.2025.10962520&rft.externalDocID=10962520