Evaluating Large Language Model Robustness using Combinatorial Testing
Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of task...
Saved in:
Published in | 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) pp. 300 - 309 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
31.03.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/ICSTW64639.2025.10962520 |
Cover
Abstract | Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests. |
---|---|
AbstractList | Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM's versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM's understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests. |
Author | Chandrasekaran, Jaganmohan Freeman, Laura J. Lanus, Erin Patel, Ankita Ramjibhai |
Author_xml | – sequence: 1 givenname: Jaganmohan surname: Chandrasekaran fullname: Chandrasekaran, Jaganmohan email: jagan@vt.edu organization: Virginia Tech,Sanghani Center for Artificial Intelligence & Data Analytics,Arlington,VA,USA – sequence: 2 givenname: Ankita Ramjibhai surname: Patel fullname: Patel, Ankita Ramjibhai – sequence: 3 givenname: Erin surname: Lanus fullname: Lanus, Erin organization: Virginia Tech,National Security Institute,Arlington,VA,USA – sequence: 4 givenname: Laura J. surname: Freeman fullname: Freeman, Laura J. organization: Virginia Tech,National Security Institute,Arlington,VA,USA |
BookMark | eNo1T8FKxDAUjKAHXfcPPOQHWpukzUuOUnZ1oSJoxePy0r6WQDeVphX8e7uol5lhmBmYG3YZxkCMcZGlQmT2_lC-1R8618qmMpNFunpaFjK7YFsL1iglCpVrgGu2333hsODsQ88rnHpaMfQLruJ5bGngr6Nb4hwoRr7Ec6ocT84HnMfJ48BriufuLbvqcIi0_eMNe9_v6vIpqV4eD-VDlXgBZk6ApHDQGu3AapBF15gst42mXDrXagOkIBdSohGExpjOEiI1hbCmaUAotWF3v7ueiI6fkz_h9H38v6d-APXGSs8 |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICSTW64639.2025.10962520 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9798331534677 |
EndPage | 309 |
ExternalDocumentID | 10962520 |
Genre | orig-research |
GrantInformation_xml | – fundername: U.S. Department of Defense funderid: 10.13039/100000005 |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133 |
IEDL.DBID | RIE |
IngestDate | Wed Apr 23 05:41:09 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i178t-7e21b7d86b796725fc8049c6e42bbd687e374122a81ea888f9eaaec5198cc7133 |
PageCount | 10 |
ParticipantIDs | ieee_primary_10962520 |
PublicationCentury | 2000 |
PublicationDate | 2025-March-31 |
PublicationDateYYYYMMDD | 2025-03-31 |
PublicationDate_xml | – month: 03 year: 2025 text: 2025-March-31 day: 31 |
PublicationDecade | 2020 |
PublicationTitle | 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) |
PublicationTitleAbbrev | ICSTW |
PublicationYear | 2025 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.9085921 |
Snippet | Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 300 |
SubjectTerms | Cognition Combinatorial testing Conferences Large language models LLM Evaluation LLM Robustness Option Order Swapping Robustness Sensitivity Sentiment analysis Systematics Testing AI Testing LLM |
Title | Evaluating Large Language Model Robustness using Combinatorial Testing |
URI | https://ieeexplore.ieee.org/document/10962520 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8QwEA26J08qVvwmB6-tbbZN0vOyZRVZRHdxb0uSThdRuqLtxV_vTLpVFAQvpZRA89Fm5k3em2HsUma2EgnoMB2mlkI3MrQaqhDxUAxGuMxaT5Cdysk8vVlki41Y3WthAMCTzyCiW3-WX65dS6Ey_MNzdNcFIvRtpfJOrNWzc-L86nr0MHuUqfQCFJFFffMfhVO83Sh22bR_Y0cXeY7axkbu41cyxn93aY8F3xI9fvdlfPbZFtQHrBhvcnfXK35LFG-8duFITjXPXvj92rbvDe1unAjvK47bAUJjAt74HfIZpdyoVwGbF-PZaBJuCiWET4nSTahAJFaVWlqcayWIloWOv5OQCmtLqRUM0XEQwugEDELeKgdjwKHzpp0jlHrIBvW6hiPGqzwtCWIoOl7N0VsphYmVQqMeI-zO1DELaBKWr10ujGU__pM_np-yHVqLTsV3xgbNWwvnaMYbe-GX7xMJRJ1A |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF2kHvSkYsVv9-A1Mdkmu5tzaWm1FtEUeyvZzaSIkoomF3-9M5tGURC8hBBYsh_JzrzZ92YYu5SxKUQI2ot6kaHQjfSMhsJDPBRAJmxsjCPITuVoFl3P4_larO60MADgyGfg0607y89XtqZQGf7hCbrrAhH6ZoywQjVyrZafEyRX4_5D-igj6SQoIvbbBj9KpzjLMdxh0_adDWHk2a8r49uPX-kY_92pXdb9Funxuy_zs8c2oNxnw8E6e3e55BMieeO1CUhyqnr2wu9Xpn6vaH_jRHlfctwQEBwT9MYvkaeUdKNcdtlsOEj7I29dKsF7CpWuPAUiNCrX0uBsK0HELHT9rYRIGJNLraCHroMQmQ4hQ9BbJJBlYNF909YSTj1gnXJVwiHjRRLlBDIUHbAm6K_kIguUQrMeIPCO1RHr0iQsXptsGIt2_Md_PL9gW6P0drKYjKc3J2yb1qXR9J2yTvVWwxka9cqcu6X8BH7RoJE |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+IEEE+International+Conference+on+Software+Testing%2C+Verification+and+Validation+Workshops+%28ICSTW%29&rft.atitle=Evaluating+Large+Language+Model+Robustness+using+Combinatorial+Testing&rft.au=Chandrasekaran%2C+Jaganmohan&rft.au=Patel%2C+Ankita+Ramjibhai&rft.au=Lanus%2C+Erin&rft.au=Freeman%2C+Laura+J.&rft.date=2025-03-31&rft.pub=IEEE&rft.spage=300&rft.epage=309&rft_id=info:doi/10.1109%2FICSTW64639.2025.10962520&rft.externalDocID=10962520 |