Scalable Compression of Massive Data Collections on HPC Systems

Belcastro, L.; Ferragina, P.; Manzini, G.; Marozzo, F.; Talia, D.; Trunfio, P.

doi:10.1007/978-3-031-99857-7_23

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Scalable Compression of Massive Data Collections on HPC Systems

Belcastro L.;Ferragina P.;Manzini G.;Marozzo F.;Talia D.;Trunfio P.

2026-01-01

Abstract

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2026
			
	Codice ISBN
	
				9783031998560
9783031998577
			
	Appare nelle tipologie:
	
				4.1 Contributo Atti Congressi/Articoli in extenso

File in questo prodotto:

File	Dimensione	Formato
EURO-PAR_2025_paper_116 (submission).pdf solo utenti autorizzati Tipologia: Documento in Pre-print/Submitted manuscript Licenza: Creative commons (selezionare) Dimensione 611.86 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	611.86 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/581152

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

Scalable Compression of Massive Data Collections on HPC Systems

Belcastro L.;Ferragina P.;Manzini G.;Marozzo F.;Talia D.;Trunfio P.

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)