Efficient URL and URI Compression
Web applications use Universal Resource Identifiers (URIs), interchangeably referred to as Uniform Resource Locators (URLs), to locate resources such as files and web pages on the Internet. Messaging services, firewalls, content distribution frameworks, event logs, databases and datasets store count...
Saved in:
| Published in | Proceedings - International Conference on Computer Communications and Networks pp. 1 - 9 |
|---|---|
| Main Authors | , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
29.07.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2637-9430 |
| DOI | 10.1109/ICCCN61486.2024.10637589 |
Cover
| Summary: | Web applications use Universal Resource Identifiers (URIs), interchangeably referred to as Uniform Resource Locators (URLs), to locate resources such as files and web pages on the Internet. Messaging services, firewalls, content distribution frameworks, event logs, databases and datasets store countless URIs. Due to the proliferation of the Internet, the number of URIs has increased rapidly, demanding significant storage. Several compression schemes are present in the literature for efficiently storing files on hard disks. However, existing compression schemes are designed for generic content, resulting in sub-optimal storage efficiency for standalone URIs. This paper presents a compression scheme specifically designed for URIs. Our contribution is three-fold: a) an empirical analysis of existing compression schemes for storing URIs, b) a design for a novel URI-focused compression scheme that improves on existing schemes and c) an adaptation of the well-known Huffman coding scheme to URIs using Natural Language Processing (NLP) to create a custom compression dictionary. Evaluation results using five million standalone URI strings show that our novel compression scheme improves storage efficiency by 18%. Furthermore, our customized Huffman coding compression scheme outperforms the standard content-agnostic Huffman technique. Our compression scheme reduces the storage space of a single instance of all URIs in existence - estimated to be more than 130 trillion - by more than 1.2 petabytes (PB) compared to the standard Huffman coding technique. Considering only unique instances of URIs, this bare minimum of 1.2 PB of hard disk savings worth approximately USD24,000 can be saved, however, in practice, many orders of magnitude more may be possible. |
|---|---|
| ISSN: | 2637-9430 |
| DOI: | 10.1109/ICCCN61486.2024.10637589 |