Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static...

Full description

Saved in:
Bibliographic Details
Main Authors Ibrahim, Lujain, Akbulut, Canfer, Elasmar, Rasmi, Rastogi, Charvi, Kahng, Minsuk, Morris, Meredith Ringel, McKee, Kevin R, Rieser, Verena, Shanahan, Murray, Weidinger, Laura
Format Journal Article
LanguageEnglish
Published 10.02.2025
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2502.07077

Cover

Abstract The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
AbstractList The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
Author Weidinger, Laura
Shanahan, Murray
Rastogi, Charvi
Rieser, Verena
Elasmar, Rasmi
Ibrahim, Lujain
Kahng, Minsuk
McKee, Kevin R
Morris, Meredith Ringel
Akbulut, Canfer
Author_xml – sequence: 1
  givenname: Lujain
  surname: Ibrahim
  fullname: Ibrahim, Lujain
– sequence: 2
  givenname: Canfer
  surname: Akbulut
  fullname: Akbulut, Canfer
– sequence: 3
  givenname: Rasmi
  surname: Elasmar
  fullname: Elasmar, Rasmi
– sequence: 4
  givenname: Charvi
  surname: Rastogi
  fullname: Rastogi, Charvi
– sequence: 5
  givenname: Minsuk
  surname: Kahng
  fullname: Kahng, Minsuk
– sequence: 6
  givenname: Meredith Ringel
  surname: Morris
  fullname: Morris, Meredith Ringel
– sequence: 7
  givenname: Kevin R
  surname: McKee
  fullname: McKee, Kevin R
– sequence: 8
  givenname: Verena
  surname: Rieser
  fullname: Rieser, Verena
– sequence: 9
  givenname: Murray
  surname: Shanahan
  fullname: Shanahan, Murray
– sequence: 10
  givenname: Laura
  surname: Weidinger
  fullname: Weidinger, Laura
BackLink https://doi.org/10.48550/arXiv.2502.07077$$DView paper in arXiv
BookMark eNqFzb0OgkAQBOArtPDvAay8FwBPlGCrBmMBnT3Z6AGbHLdkuSP69iKxt5mZYpJvLiaWrBZivVPh4RjHagv8wj6MYhWFKlFJMhNZ7o3DwHm2Mu3BeHBIVlIpT9bVTC01xG2ND3nWNfRInjuJVmbAlR7SVh6GkdNTm24ppiWYTq9-vRCba3q_3ILRLVrGBvhdfP1i9Pf_Hx9w9j0l
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2502.07077
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2502_07077
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2502_070773
IEDL.DBID GOX
IngestDate Tue Jul 22 23:16:36 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2502_070773
OpenAccessLink https://arxiv.org/abs/2502.07077
ParticipantIDs arxiv_primary_2502_07077
PublicationCentury 2000
PublicationDate 2025-02-10
PublicationDateYYYYMMDD 2025-02-10
PublicationDate_xml – month: 02
  year: 2025
  text: 2025-02-10
  day: 10
PublicationDecade 2020
PublicationYear 2025
Score 3.8067002
SecondaryResourceType preprint
Snippet The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computation and Language
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Title Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
URI https://arxiv.org/abs/2502.07077
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTQzTTQ3NzfWNTYzSQMN3aToJlmmAdOymVGqhYVJqrEpeHu0r5-ZR6iJV4RpBBODAmwvTGJRRWYZ5HzgpGJ9YP1spAc6kMacGbRXEZRq3f0jIJOT4KO4oOoR6oBtTLAQUiXhJsjAD23dKThCokOIgSk1T4TBB7zJVRdYtucpuMIP11bIT1OA3VKQmw_0bWayAvSwQmAHXiEzT8EHtEgbSEIGFBVAt5blFIsyyLu5hjh76ILtjy-AHBYRD3JaPNhpxmIMLMAufaoEg4JJkoVZanKihblRKmhPN-gQdBNTw-TktDQTw7Rks0RJBglcpkjhlpJm4DIC3U4Luq7EQIaBpaSoNFUWWGWWJMmBww0Ad0xwOg
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-turn+Evaluation+of+Anthropomorphic+Behaviours+in+Large+Language+Models&rft.au=Ibrahim%2C+Lujain&rft.au=Akbulut%2C+Canfer&rft.au=Elasmar%2C+Rasmi&rft.au=Rastogi%2C+Charvi&rft.date=2025-02-10&rft_id=info:doi/10.48550%2Farxiv.2502.07077&rft.externalDocID=2502_07077