Explanipedia

Scalable Compression of Massive Data Collections on HPC Systems Open

Loris Belcastro, Paolo Ferragina, Giovanni Manzini, Fabrizio Marozzo, Domenico Talia , et al. · 2025

Compressing Suffix Trees by Path Decompositions Open

Ruben Becker, Davide Cenzato, Travis Gagie, Sunghwan Kim, Ragnar Groot Koerkamp , et al. · 2025

In this paper, we solve the long-standing problem of designing I/O-efficient compressed indexes. Our solution broadly consists of generalizing suffix sorting and revisiting suffix tree path compression. In classic suffix trees, path compre…

Prefix-free parsing for merging big BWTs Open

Diego Diaz-Domínguez, Travis Gagie, Veronica Guerrini, Ben Langmead, Zsuzsanna Lipták , et al. · 2025

When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very sim…

Generalization of Repetitiveness Measures for Two-Dimensional Strings Open

L. Carfagna, Giovanni Manzini, Giuseppe Romana, Marinella Sciortino, C. Urbina · 2025

The problem of detecting and measuring the repetitiveness of one-dimensional strings has been extensively studied in data compression and text indexing. Our understanding of these issues has been significantly improved by the introduction …

On the compressibility of large-scale source code datasets Open

Antonio Boffa, Roberto Di Cosmo, Paolo Ferragina, André Guerra, Giovanni Manzini , et al. · 2025

Toward Greener Matrix Operations by Lossless Compressed Formats Open

Francesco Tosoni, Philip Bille, Valerio Brunacci, Alessio De Angelis, Paolo Ferragina , et al. · 2025

Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed fo…

Toward Greener Matrix Operations by Lossless Compressed Formats Open

Francesco Tosoni, Philip Bille, Valerio Brunacci, Alessio De Angelis, Paolo Ferragina , et al. · 2024

Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed fo…

Faster run-length compressed suffix arrays Open

Travis Gagie, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino · 2024

We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) …

Computing the LCP Array of a Labeled Graph Open

Jarno Alanko, Davide Cenzato, Nicola Cotumaccio, Sunghwan Kim, Giovanni Manzini , et al. · 2024

The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the …

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests Open

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead , et al. · 2024

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular c…

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests Open

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead , et al. · 2024

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular c…

The Landscape of Compressibility Measures for Two-Dimensional Data Open

L. Carfagna, Giovanni Manzini · 2024

In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the measure defined in terms of the smallest string attractor, and the measure defined in terms of the number of distinct s…

A new class of string transformations for compressed text indexing Open

Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino · 2023

The landscape of compressibility measures for two-dimensional data Open

L. Carfagna, Giovanni Manzini · 2023

In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $γ$ measure defined in terms of the smallest string attractor, and the $δ$ measure defined in terms of the number of dist…

Computing matching statistics on Wheeler DFAs Open

Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza , et al. · 2023

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for …

Computing matching statistics on Wheeler DFAs Open

Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza , et al. · 2023

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for …

Practical Random Access to SLP-Compressed Texts Open

Travis Gagie, I Tomohiro, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto , et al. · 2022

Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as …

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform Open

Travis Gagie, Giovanni Manzini, Marinella Sciortino · 2022

The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty …

Improving matrix-vector multiplication via lossless grammar-compressed matrices Open

Paolo Ferragina, Giovanni Manzini, Travis Gagie, Dominik Köppl, Gonzalo Navarro , et al. · 2022

As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compr…

A New Class of String Transformations for Compressed Text Indexing Open

Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino · 2022

Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the …

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices Open

Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro , et al. · 2022

As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compr…

Space Efficient Merging of de Bruijn Graphs and Wheeler Graphs Open

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini · 2022

The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Br…

Compressing and Querying Integer Dictionaries Under Linearities and Repetitions Open

Paolo Ferragina, Giovanni Manzini, Giorgio Vinciguerra · 2022

We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting simultaneously two kinds of regularities arising in real data: repetitiveness and approximate linearit…

Space Efficient Merging of de Bruijn Graphs and Wheeler Graphs Open

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini · 2021

PHONI: Streamed Matching Statistics with Multi-Genome References Open

Christina Boucher, Travis Gagie, I Tomohiro, Dominik Köppl, Ben Langmead , et al. · 2021

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this cas…

Efficiently Merging r-indexes Open

Marco Antônio Oliva, Massimiliano Rossi, Jouni Sirén, Giovanni Manzini, Tamer Kahveci , et al. · 2021

Large sequencing projects, such as GenomeTrakr and MetaSub, are updated frequently (sometimes daily, in the case of GenomeTrakr) with new data. Therefore, it is imperative that any data structure indexing such data supports efficient updat…

Repetition- and Linearity-Aware Rank/Select Dictionaries Open

Paolo Ferragina, Giovanni Manzini, Giorgio Vinciguerra · 2021

We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first co…

PFP Compressed Suffix Trees Open

Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini , et al. · 2021

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a p…

Compressing and indexing aligned readsets Open

Travis Gagie, Garance Gourdel, Giovanni Manzini · 2021

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same ind…

PFP Data Structures Open

Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini , et al. · 2020

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ o…

Giovanni Manzini YOU? Author Swipe