Evaluating Large Language Model Robustness using Combinatorial Testing

Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of task...

Full description

Saved in:

Bibliographic Details
Published in	2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) pp. 300 - 309
Main Authors	Chandrasekaran, Jaganmohan, Patel, Ankita Ramjibhai, Lanus, Erin, Freeman, Laura J.
Format	Conference Proceeding
Language	English
Published	IEEE 31.03.2025
Subjects	Cognition Combinatorial testing Conferences Large language models LLM Evaluation LLM Robustness Option Order Swapping Robustness Sensitivity Sentiment analysis Systematics Testing AI Testing LLM
Online Access	Get full text
DOI	10.1109/ICSTW64639.2025.10962520

Cover

Abstract	Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
AbstractList	Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
Author	Chandrasekaran, Jaganmohan Freeman, Laura J. Lanus, Erin Patel, Ankita Ramjibhai
Author_xml	– sequence: 1 givenname: Jaganmohan surname: Chandrasekaran fullname: Chandrasekaran, Jaganmohan email: jagan@vt.edu organization: Virginia Tech,Sanghani Center for Artificial Intelligence & Data Analytics,Arlington,VA,USA – sequence: 2 givenname: Ankita Ramjibhai surname: Patel fullname: Patel, Ankita Ramjibhai – sequence: 3 givenname: Erin surname: Lanus fullname: Lanus, Erin organization: Virginia Tech,National Security Institute,Arlington,VA,USA – sequence: 4 givenname: Laura J. surname: Freeman fullname: Freeman, Laura J. organization: Virginia Tech,National Security Institute,Arlington,VA,USA
BookMark	eNo1T8FKxDAUjKAHXfcPPOQHWpukzUuOUnZ1oSJoxePy0r6WQDeVphX8e7uol5lhmBmYG3YZxkCMcZGlQmT2_lC-1R8618qmMpNFunpaFjK7YFsL1iglCpVrgGu2333hsODsQ88rnHpaMfQLruJ5bGngr6Nb4hwoRr7Ec6ocT84HnMfJ48BriufuLbvqcIi0_eMNe9_v6vIpqV4eD-VDlXgBZk6ApHDQGu3AapBF15gst42mXDrXagOkIBdSohGExpjOEiI1hbCmaUAotWF3v7ueiI6fkz_h9H38v6d-APXGSs8
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICSTW64639.2025.10962520
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798331534677
EndPage	309
ExternalDocumentID	10962520
Genre	orig-research
GrantInformation_xml	– fundername: U.S. Department of Defense funderid: 10.13039/100000005
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133
IEDL.DBID	RIE
IngestDate	Wed Apr 23 05:41:09 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133
PageCount	10
ParticipantIDs	ieee_primary_10962520
PublicationCentury	2000
PublicationDate	2025-March-31
PublicationDateYYYYMMDD	2025-03-31
PublicationDate_xml	– month: 03 year: 2025 text: 2025-March-31 day: 31
PublicationDecade	2020
PublicationTitle	2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)
PublicationTitleAbbrev	ICSTW
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.9085921
Snippet	Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to...
SourceID	ieee
SourceType	Publisher
StartPage	300
SubjectTerms	Cognition Combinatorial testing Conferences Large language models LLM Evaluation LLM Robustness Option Order Swapping Robustness Sensitivity Sentiment analysis Systematics Testing AI Testing LLM
Title	Evaluating Large Language Model Robustness using Combinatorial Testing
URI	https://ieeexplore.ieee.org/document/10962520
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8QwEA26J08qVvwmB6-tbbZN0vOyZRVZRHdxb0uSThdRuqLtxV_vTLpVFAQvpZRA89Fm5k3em2HsUma2EgnoMB2mlkI3MrQaqhDxUAxGuMxaT5Cdysk8vVlki41Y3WthAMCTzyCiW3-WX65dS6Ey_MNzdNcFIvRtpfJOrNWzc-L86nr0MHuUqfQCFJFFffMfhVO83Sh22bR_Y0cXeY7axkbu41cyxn93aY8F3xI9fvdlfPbZFtQHrBhvcnfXK35LFG-8duFITjXPXvj92rbvDe1unAjvK47bAUJjAt74HfIZpdyoVwGbF-PZaBJuCiWET4nSTahAJFaVWlqcayWIloWOv5OQCmtLqRUM0XEQwugEDELeKgdjwKHzpp0jlHrIBvW6hiPGqzwtCWIoOl7N0VsphYmVQqMeI-zO1DELaBKWr10ujGU__pM_np-yHVqLTsV3xgbNWwvnaMYbe-GX7xMJRJ1A
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF2kHvSkYsVv9-A1Mdkmu5tzaWm1FtEUeyvZzaSIkoomF3-9M5tGURC8hBBYsh_JzrzZ92YYu5SxKUQI2ot6kaHQjfSMhsJDPBRAJmxsjCPITuVoFl3P4_larO60MADgyGfg0607y89XtqZQGf7hCbrrAhH6ZoywQjVyrZafEyRX4_5D-igj6SQoIvbbBj9KpzjLMdxh0_adDWHk2a8r49uPX-kY_92pXdb9Funxuy_zs8c2oNxnw8E6e3e55BMieeO1CUhyqnr2wu9Xpn6vaH_jRHlfctwQEBwT9MYvkaeUdKNcdtlsOEj7I29dKsF7CpWuPAUiNCrX0uBsK0HELHT9rYRIGJNLraCHroMQmQ4hQ9BbJJBlYNF909YSTj1gnXJVwiHjRRLlBDIUHbAm6K_kIguUQrMeIPCO1RHr0iQsXptsGIt2_Md_PL9gW6P0drKYjKc3J2yb1qXR9J2yTvVWwxka9cqcu6X8BH7RoJE
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+IEEE+International+Conference+on+Software+Testing%2C+Verification+and+Validation+Workshops+%28ICSTW%29&rft.atitle=Evaluating+Large+Language+Model+Robustness+using+Combinatorial+Testing&rft.au=Chandrasekaran%2C+Jaganmohan&rft.au=Patel%2C+Ankita+Ramjibhai&rft.au=Lanus%2C+Erin&rft.au=Freeman%2C+Laura+J.&rft.date=2025-03-31&rft.pub=IEEE&rft.spage=300&rft.epage=309&rft_id=info:doi/10.1109%2FICSTW64639.2025.10962520&rft.externalDocID=10962520