The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Scalable Compression of Massive Data Collections on HPC Systems

Ferragina P.;
2026-01-01

Abstract

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.
2026
9783031998560
9783031998577
File in questo prodotto:
File Dimensione Formato  
EURO-PAR_2025_paper_116 (submission).pdf

solo utenti autorizzati

Tipologia: Documento in Pre-print/Submitted manuscript
Licenza: Creative commons (selezionare)
Dimensione 611.86 kB
Formato Adobe PDF
611.86 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/581152
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
social impact