A Learned Approach to Design Compressed Rank/Select Data Structures

IRIS

We address the problem of designing, implementing, and experimenting with compressed data structures that support rank and select queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of "learning" the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector.We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the kth order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely, ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector.We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNAsequencing, information retrieval, and natural language processing) and show that our approach provides new interesting space-time tradeoffs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our select is the fastest, and our rank is on the space-time Pareto frontier.

A Learned Approach to Design Compressed Rank/Select Data Structures

Antonio Boffa;Paolo Ferragina;Giorgio Vinciguerra

2022-01-01

Abstract

We address the problem of designing, implementing, and experimenting with compressed data structures that support rank and select queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of "learning" the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector.We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the kth order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely, ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector.We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNAsequencing, information retrieval, and natural language processing) and show that our approach provides new interesting space-time tradeoffs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our select is the fastest, and our rank is on the space-time Pareto frontier.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

2022

Appare nelle tipologie:

1.1 Articolo su Rivista/Article

File in questo prodotto:

File	Dimensione	Formato
3524060.pdf non disponibili Licenza: Copyright dell'editore Dimensione 1.31 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.31 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/566784

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

26

social impact