EDGAR-CORPUS: Billions of Tokens Make The World Go Round

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into the...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Lefteris Loukas, Manos Fergadiotis, Androutsopoulos, Ion, Malakasiotis, Prodromos
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 01.10.2021
Subjects	Annual reports Computer Science - Computation and Language Management reports Source code
Online Access	Get full text
ISSN	2331-8422
DOI	10.48550/arxiv.2109.14394

Cover

More Information
Summary:	We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.
Bibliography:	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50
ISSN:	2331-8422
DOI:	10.48550/arxiv.2109.14394