Building a business email compromise research dataset with large language models

Email-based attacks, such as Business Email Compromise, seriously threaten many organizations. In recent years, Large Language Models have improved the potency of email-based attacks by giving attackers an easy-to-use tool to overcome the language barrier and craft believable emails. At the same tim...

Full description

Saved in:
Bibliographic Details
Published inJournal of Computer Virology and Hacking Techniques Vol. 21; no. 1; p. 3
Main Author Dube, Rohit
Format Journal Article
LanguageEnglish
Published Paris Springer Paris 02.01.2025
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN2263-8733
2263-8733
DOI10.1007/s11416-024-00544-y

Cover

More Information
Summary:Email-based attacks, such as Business Email Compromise, seriously threaten many organizations. In recent years, Large Language Models have improved the potency of email-based attacks by giving attackers an easy-to-use tool to overcome the language barrier and craft believable emails. At the same time, Business Email Compromise research remains hamstrung by the lack of a publicly available dataset. This paper proposes a novel system composed of Large Language Models to create Business Email Compromise datasets. Two datasets are generated. The first one (BEC-1) is a small 20-email proof-of-concept dataset that demonstrates that the system produces a dataset that a human analyst finds credible. The second (BEC-2) is a larger 279-email dataset generated using the same system. BEC-2 is the first public Business Email Compromise dataset available to the email security research community. The paper also proposes an accuracy-like metric called “agreement score” to measure the quality of datasets produced. Both BEC-1 and BEC-2 have high agreement scores – 90 and 93, respectively – validating the effectiveness of the Large Language Model system.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2263-8733
2263-8733
DOI:10.1007/s11416-024-00544-y