top of page

Benchmarks

Below are some benchmarks measuring Genozip’s performance on a variety of file types.

To see more, peer-reviewed, benchmarks, see Publications.

FASTQ
Title
Sequencer
Data
.fq.gz size
Genozip size
Size reduction
F1
Illumina NovaSeq
Human 30x WGS (R1+R2)
61.2 GB
13.1 GB
-79%
F3
MGI Tech DNBSEQ-G400
Human WGS (R1+R2)
99.9 GB
48.8 GB
-50%

Notes:

1. The compression shown here is of files that are already compressed with .gz

2. genozip options used: --best. For F1 and F3 --reference and --pair were used as well. For F2, genozip was used with the --multiseq option.

BAM
Title
Name
Feature
Sequencer
Aligner
BAM size
Genozip size
Size reduction
T1
NA12878.final.cram
WGS - NovaSeq
Illumina NovaSeq
bwa
15,797,182,294
9,015,105,386
-43%
T2
NA12878_S1.bam
WGS - HiSeq 2000
Illumina HiSeq 2000
bwa
121,691,186,161
52,005,672,886
-57%
T3
NA12878.pacbio.bwa-sw.20140202.bam
WGS - PacBio CLR
PacBio
bwa sw
57,785,894,776
25,442,443,993
-44%
T4
ENCFF047UEJ.bam
RNA-seq - transcriptome alignments (STAR)
Illumina HiSeq 2500
STAR
2,065,298,931
324,674,984
-84%
T5
ENCFF575KZB.bam
RNA-seq - genome alignments (STAR)
Illumina HiSeq 2500
STAR
2,322,825,802
428,436,333
-82%
T6
ENCFF900XHI.bam
Long read RNA-seq
PacBio Sequel II
minimap2
1,308,956,918
82,497,181
-94%
T7
sorted_final_merged.bam
WGS (GIAB)
PacBio
blasr
146,870,854,017
31,796,498,533
-78%
T8
ENCFF069XDC.bam
DNase-Seq
Illumina HiSeq 4000
bwa sampe
6,668,114,400
1,598,034,373
-76%
T9
ENCFF669HBS.bam
STARR-seq
Illumina NovaSeq 6000
bowtie2
5,322,159,747
1,242,102,254
-77%
T10
ENCFF046VPK.bam
scRNA-seq
Illumina NextSeq 2000
STARsolo
1,306,491,967
277,976,393
-79%
T11
ENCFF460RWK.bam
totalRNA-seq
Illumina HiSeq 2500
STAR
4,171,989,444
1,473,370,983
-65%
T12
hgmm_10k_v3_possorted_genome_bam.bam
1:1 Mixture of Fresh Frozen Human and Mouse Cells
Illumina NovaSeq
STAR + cellranger
53,694,972,688
15,552,234,159
-71%
T13
ENCFF283TLK.bam
WGS - Nanopore MinION
Oxford Nanopore MinION
ngmlr
222,354,403,768
107,386,919,618
-52%
T14
ENCFF786GJA.bam
WGBS paired-end (Methylation)
Illumina HiSeq X Ten
Bismark
198,026,253,992
27,448,746,040
-86%

Source: Lan, D., et al. (2022) Genozip 14 - advances in compression of BAM and CRAM files (preprint)

Notes:

1. genozip options used: --best ; --reference with the appropriate reference file (except T4, T12)

2. T1 is a CRAM file, and hence compression ratio is shown relative to CRAM, not BAM. Genozip compresses a CRAM by first converting it to BAM.

VCF
Title
Data
.vcf.gz size
.genozip size
Size reduction
V1
3202 human samples from "1000 Genomes Project"
27 GB
7.8 GB
-71%
V2
1135 plant samples from 1001 "Genomes - Arabidopsis Thaliana"
132.4 GB
31.6 GB
-76%
V3
3K Rice samples from "3K Rice Genome"
1.9 GB
315.5 MB
-84%
V4
GVCF (single sample, human)
12.6 GB
763.5 MB
-94%
V5
Illumina Genotyping
35 MB
9.6 MB
-73%

Notes:

1. The compression shown here is of files that are already compressed with .gz

2. genozip options used: --best ; For V4 and V5 --reference was used as well

Other types

Other data types

Title
Data
Type
.gz Size
.genozip
Size reduction
A1
Covid-19 Multi-FASTA from coronavirus.innar.com
FASTA
254.9 MB
1.5 MB
-99%
G1
Gene annotation from the Telomere-to-telomere consortium
GFF3
91.8 MB
32.3 MB
-65%
G2
Homo_sapiens.GRCh38.108.gtf.gz from Ensembl
GTF
51.6 MB
11.3 MB
-78%
L1
Illumina s.locs file
LOCS (Illumina)
12MB
5.9 KB
-99.95%
L2
Illumina s_X_XXXX.locs file from BaseSpace Demo Data
LOCS (Illumina)
4.4MB
3.3 MB
-25%
M1
Consumer DNA test “raw data”
23andMe
5.1 MB
2.9 MB
-43%

Notes:

1. The compression shown here is of files that are already compressed with .gz, except for M1 which a file compressed with .zip

2. genozip options used: --best. For A1 genozip was used with the --multiseq option.

bottom of page