Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.02.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2502.07077 |
Cover
Abstract | The tendency of users to anthropomorphise large language models (LLMs) is of
growing interest to AI developers, researchers, and policy-makers. Here, we
present a novel method for empirically evaluating anthropomorphic LLM
behaviours in realistic and varied settings. Going beyond single-turn static
benchmarks, we contribute three methodological advances in state-of-the-art
(SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14
anthropomorphic behaviours. Second, we present a scalable, automated approach
by employing simulations of user interactions. Third, we conduct an
interactive, large-scale human subject study (N=1101) to validate that the
model behaviours we measure predict real users' anthropomorphic perceptions. We
find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by
relationship-building (e.g., empathy and validation) and first-person pronoun
use, and that the majority of behaviours only first occur after multiple turns.
Our work lays an empirical foundation for investigating how design choices
influence anthropomorphic model behaviours and for progressing the ethical
debate on the desirability of these behaviours. It also showcases the necessity
of multi-turn evaluations for complex social phenomena in human-AI interaction. |
---|---|
AbstractList | The tendency of users to anthropomorphise large language models (LLMs) is of
growing interest to AI developers, researchers, and policy-makers. Here, we
present a novel method for empirically evaluating anthropomorphic LLM
behaviours in realistic and varied settings. Going beyond single-turn static
benchmarks, we contribute three methodological advances in state-of-the-art
(SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14
anthropomorphic behaviours. Second, we present a scalable, automated approach
by employing simulations of user interactions. Third, we conduct an
interactive, large-scale human subject study (N=1101) to validate that the
model behaviours we measure predict real users' anthropomorphic perceptions. We
find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by
relationship-building (e.g., empathy and validation) and first-person pronoun
use, and that the majority of behaviours only first occur after multiple turns.
Our work lays an empirical foundation for investigating how design choices
influence anthropomorphic model behaviours and for progressing the ethical
debate on the desirability of these behaviours. It also showcases the necessity
of multi-turn evaluations for complex social phenomena in human-AI interaction. |
Author | Weidinger, Laura Shanahan, Murray Rastogi, Charvi Rieser, Verena Elasmar, Rasmi Ibrahim, Lujain Kahng, Minsuk McKee, Kevin R Morris, Meredith Ringel Akbulut, Canfer |
Author_xml | – sequence: 1 givenname: Lujain surname: Ibrahim fullname: Ibrahim, Lujain – sequence: 2 givenname: Canfer surname: Akbulut fullname: Akbulut, Canfer – sequence: 3 givenname: Rasmi surname: Elasmar fullname: Elasmar, Rasmi – sequence: 4 givenname: Charvi surname: Rastogi fullname: Rastogi, Charvi – sequence: 5 givenname: Minsuk surname: Kahng fullname: Kahng, Minsuk – sequence: 6 givenname: Meredith Ringel surname: Morris fullname: Morris, Meredith Ringel – sequence: 7 givenname: Kevin R surname: McKee fullname: McKee, Kevin R – sequence: 8 givenname: Verena surname: Rieser fullname: Rieser, Verena – sequence: 9 givenname: Murray surname: Shanahan fullname: Shanahan, Murray – sequence: 10 givenname: Laura surname: Weidinger fullname: Weidinger, Laura |
BackLink | https://doi.org/10.48550/arXiv.2502.07077$$DView paper in arXiv |
BookMark | eNqFzb0OgkAQBOArtPDvAay8FwBPlGCrBmMBnT3Z6AGbHLdkuSP69iKxt5mZYpJvLiaWrBZivVPh4RjHagv8wj6MYhWFKlFJMhNZ7o3DwHm2Mu3BeHBIVlIpT9bVTC01xG2ND3nWNfRInjuJVmbAlR7SVh6GkdNTm24ppiWYTq9-vRCba3q_3ILRLVrGBvhdfP1i9Pf_Hx9w9j0l |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2502.07077 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2502_07077 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2502_070773 |
IEDL.DBID | GOX |
IngestDate | Tue Jul 22 23:16:36 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2502_070773 |
OpenAccessLink | https://arxiv.org/abs/2502.07077 |
ParticipantIDs | arxiv_primary_2502_07077 |
PublicationCentury | 2000 |
PublicationDate | 2025-02-10 |
PublicationDateYYYYMMDD | 2025-02-10 |
PublicationDate_xml | – month: 02 year: 2025 text: 2025-02-10 day: 10 |
PublicationDecade | 2020 |
PublicationYear | 2025 |
Score | 3.8067002 |
SecondaryResourceType | preprint |
Snippet | The tendency of users to anthropomorphise large language models (LLMs) is of
growing interest to AI developers, researchers, and policy-makers. Here, we... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computation and Language Computer Science - Computers and Society Computer Science - Human-Computer Interaction |
Title | Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models |
URI | https://arxiv.org/abs/2502.07077 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTQzTTQ3NzfWNTYzSQMN3aToJlmmAdOymVGqhYVJqrEpeHu0r5-ZR6iJV4RpBBODAmwvTGJRRWYZ5HzgpGJ9YP1spAc6kMacGbRXEZRq3f0jIJOT4KO4oOoR6oBtTLAQUiXhJsjAD23dKThCokOIgSk1T4TBB7zJVRdYtucpuMIP11bIT1OA3VKQmw_0bWayAvSwQmAHXiEzT8EHtEgbSEIGFBVAt5blFIsyyLu5hjh76ILtjy-AHBYRD3JaPNhpxmIMLMAufaoEg4JJkoVZanKihblRKmhPN-gQdBNTw-TktDQTw7Rks0RJBglcpkjhlpJm4DIC3U4Luq7EQIaBpaSoNFUWWGWWJMmBww0Ad0xwOg |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-turn+Evaluation+of+Anthropomorphic+Behaviours+in+Large+Language+Models&rft.au=Ibrahim%2C+Lujain&rft.au=Akbulut%2C+Canfer&rft.au=Elasmar%2C+Rasmi&rft.au=Rastogi%2C+Charvi&rft.date=2025-02-10&rft_id=info:doi/10.48550%2Farxiv.2502.07077&rft.externalDocID=2502_07077 |