Efficient URL and URI Compression

Web applications use Universal Resource Identifiers (URIs), interchangeably referred to as Uniform Resource Locators (URLs), to locate resources such as files and web pages on the Internet. Messaging services, firewalls, content distribution frameworks, event logs, databases and datasets store count...

Full description

Saved in:
Bibliographic Details
Published inProceedings - International Conference on Computer Communications and Networks pp. 1 - 9
Main Authors Savins, Felix, Saric, Kevin, Ramachandran, Gowri Sankar, Jurdak, Raja
Format Conference Proceeding
LanguageEnglish
Published IEEE 29.07.2024
Subjects
Online AccessGet full text
ISSN2637-9430
DOI10.1109/ICCCN61486.2024.10637589

Cover

More Information
Summary:Web applications use Universal Resource Identifiers (URIs), interchangeably referred to as Uniform Resource Locators (URLs), to locate resources such as files and web pages on the Internet. Messaging services, firewalls, content distribution frameworks, event logs, databases and datasets store countless URIs. Due to the proliferation of the Internet, the number of URIs has increased rapidly, demanding significant storage. Several compression schemes are present in the literature for efficiently storing files on hard disks. However, existing compression schemes are designed for generic content, resulting in sub-optimal storage efficiency for standalone URIs. This paper presents a compression scheme specifically designed for URIs. Our contribution is three-fold: a) an empirical analysis of existing compression schemes for storing URIs, b) a design for a novel URI-focused compression scheme that improves on existing schemes and c) an adaptation of the well-known Huffman coding scheme to URIs using Natural Language Processing (NLP) to create a custom compression dictionary. Evaluation results using five million standalone URI strings show that our novel compression scheme improves storage efficiency by 18%. Furthermore, our customized Huffman coding compression scheme outperforms the standard content-agnostic Huffman technique. Our compression scheme reduces the storage space of a single instance of all URIs in existence - estimated to be more than 130 trillion - by more than 1.2 petabytes (PB) compared to the standard Huffman coding technique. Considering only unique instances of URIs, this bare minimum of 1.2 PB of hard disk savings worth approximately USD24,000 can be saved, however, in practice, many orders of magnitude more may be possible.
ISSN:2637-9430
DOI:10.1109/ICCCN61486.2024.10637589