Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static...

Full description

Saved in:

Bibliographic Details
Main Authors	Ibrahim, Lujain, Akbulut, Canfer, Elasmar, Rasmi, Rastogi, Charvi, Kahng, Minsuk, Morris, Meredith Ringel, McKee, Kevin R, Rieser, Verena, Shanahan, Murray, Weidinger, Laura
Format	Journal Article
Language	English
Published	10.02.2025
Subjects	Computer Science - Computation and Language Computer Science - Computers and Society Computer Science - Human-Computer Interaction
Online Access	Get full text
DOI	10.48550/arxiv.2502.07077

Cover

Abstract	The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
AbstractList	The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
Author	Weidinger, Laura Shanahan, Murray Rastogi, Charvi Rieser, Verena Elasmar, Rasmi Ibrahim, Lujain Kahng, Minsuk McKee, Kevin R Morris, Meredith Ringel Akbulut, Canfer
Author_xml	– sequence: 1 givenname: Lujain surname: Ibrahim fullname: Ibrahim, Lujain – sequence: 2 givenname: Canfer surname: Akbulut fullname: Akbulut, Canfer – sequence: 3 givenname: Rasmi surname: Elasmar fullname: Elasmar, Rasmi – sequence: 4 givenname: Charvi surname: Rastogi fullname: Rastogi, Charvi – sequence: 5 givenname: Minsuk surname: Kahng fullname: Kahng, Minsuk – sequence: 6 givenname: Meredith Ringel surname: Morris fullname: Morris, Meredith Ringel – sequence: 7 givenname: Kevin R surname: McKee fullname: McKee, Kevin R – sequence: 8 givenname: Verena surname: Rieser fullname: Rieser, Verena – sequence: 9 givenname: Murray surname: Shanahan fullname: Shanahan, Murray – sequence: 10 givenname: Laura surname: Weidinger fullname: Weidinger, Laura
BackLink	https://doi.org/10.48550/arXiv.2502.07077$$DView paper in arXiv
BookMark	eNqFzb0OgkAQBOArtPDvAay8FwBPlGCrBmMBnT3Z6AGbHLdkuSP69iKxt5mZYpJvLiaWrBZivVPh4RjHagv8wj6MYhWFKlFJMhNZ7o3DwHm2Mu3BeHBIVlIpT9bVTC01xG2ND3nWNfRInjuJVmbAlR7SVh6GkdNTm24ppiWYTq9-vRCba3q_3ILRLVrGBvhdfP1i9Pf_Hx9w9j0l
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2502.07077
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2502_07077
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2502_070773
IEDL.DBID	GOX
IngestDate	Tue Jul 22 23:16:36 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2502_070773
OpenAccessLink	https://arxiv.org/abs/2502.07077
ParticipantIDs	arxiv_primary_2502_07077
PublicationCentury	2000
PublicationDate	2025-02-10
PublicationDateYYYYMMDD	2025-02-10
PublicationDate_xml	– month: 02 year: 2025 text: 2025-02-10 day: 10
PublicationDecade	2020
PublicationYear	2025
Score	3.8067002
SecondaryResourceType	preprint
Snippet	The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computation and Language Computer Science - Computers and Society Computer Science - Human-Computer Interaction
Title	Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
URI	https://arxiv.org/abs/2502.07077
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTQzTTQ3NzfWNTYzSQMN3aToJlmmAdOymVGqhYVJqrEpeHu0r5-ZR6iJV4RpBBODAmwvTGJRRWYZ5HzgpGJ9YP1spAc6kMacGbRXEZRq3f0jIJOT4KO4oOoR6oBtTLAQUiXhJsjAD23dKThCokOIgSk1T4TBB7zJVRdYtucpuMIP11bIT1OA3VKQmw_0bWayAvSwQmAHXiEzT8EHtEgbSEIGFBVAt5blFIsyyLu5hjh76ILtjy-AHBYRD3JaPNhpxmIMLMAufaoEg4JJkoVZanKihblRKmhPN-gQdBNTw-TktDQTw7Rks0RJBglcpkjhlpJm4DIC3U4Luq7EQIaBpaSoNFUWWGWWJMmBww0Ad0xwOg
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-turn+Evaluation+of+Anthropomorphic+Behaviours+in+Large+Language+Models&rft.au=Ibrahim%2C+Lujain&rft.au=Akbulut%2C+Canfer&rft.au=Elasmar%2C+Rasmi&rft.au=Rastogi%2C+Charvi&rft.date=2025-02-10&rft_id=info:doi/10.48550%2Farxiv.2502.07077&rft.externalDocID=2502_07077