Compressing BAM files

How good is Genozip at compressing BAM files?

Relative sizes of .bam files, before and after compression with genozip.

--reference was used for compressing with all files, --best was not used. Datasets can be found here.

Compressing a BAM, SAM or CRAM file

In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing CRAM and SAM files. To compress CRAM, samtools must be install on your system.

Compressing and uncompressing BAM is straightforward:

$ genozip myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam
-rw-rw-r--+ 1 divon divon 16G Aug 1 18:44 myfile.bam.genozip

Compressing using a reference file

Better (sometimes significantly so) compression can be achieved by providing a reference file.

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam

-rw-rw-r--+ 1 divon divon 15G Aug 1 19:01 myfile.bam.genozip

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to --reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

or alternatively:

$ genozip --make-reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/

reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).

Co-compressing BAM and FASTQ files (Genozip Deep™)

Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

The datasets can be found here.

Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.
Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file. The RAM consumption is mostly during uncompressing.

Example:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fastq.gz myfile-R2.fastq.gz

$ ls -l

-rw-------+ 1 57G May 4 05:23 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fastq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fastq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

Reducing RAM consumption: Uncompression of a .deep.genozip file is very RAM hungry. Usually, most of the RAM consumption is due to the matching of the QUAL data between the BAM and FASTQ files. Getting rid of the QUAL redundancy between BAM and FASTQ also usually accounts for the majority of benefit of Deep. Using Deep, but without matching QUAL, drastically reduces the RAM consumption, with the resulting file size being somewhere in between compressing the FASTQ and BAM separately, and full Deep.

Example:

$ genozip --deep=no-qual --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz

--optimize : even better compression, with some caveats.

While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and number of decimal digits in fractions. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications to SAM/BAM/CRAM data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.

Base quality scores binning: base quality scores ∈[0,93] (which appear in textual SAM files as the ASCII characters '!' through '~') are binned, loosely following Illumina's method. Binning is applied to the main QUAL field, as well as to the following tags which also contain base quality scores: QT:Z CY:Z BZ:Z and in the case of files generated by 10xGenomics or STARsolo, also to the following tags: UY:Z QX:Z sQ:Z 2Y:Z fq:Z. Note that it is not applied to OQ:Z.

Old values New value

0-2 unchanged

3-9 6

10-10 15

20-24 22

25-29 27

...

85-89 87

90-92 91

93 93

Note: binning is not applied if the main QUAL field is already binned to 8 or fewer values.

Floating-point number rounding: all floating point tags (XX:f and XX:B:f) are rounded: for BAM and CRAM, they are rounded to the 10 significant bits (this provides an accuracy of approximately 3 significant decimal digits), and for SAM, they are rounded to the 3 significant digits.

Note: when displaying optimized BAM/CRAM data in textual (SAM) format, these numbers might seem surprisely not round - this is because they are displayed in base-10 - rest assured that they are indeed round in base-2.

IonTerrent flow signal rounding: Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.

Example: -20,212,427 ➔ 0,210,430

Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.

Note: --optimize can take an argument, to fine grain which fields get optimized, for example:

$ genozip --optimize=QUAL,rq:f - optimizes only the QUAL and rq:f fields, if they are optimizable

$ genozip --optimize=^QUAL,rq:f - optimizes all optimizable fields, except QUAL and rq:f .

Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):

$ genozip --optimize --stats test.bam

BAM file: test.bam
BAM alignments: 100,000 (in Prim VBs: 50 in Depn VBs: 53)
Contexts: 55 Vblocks: 10 x 4.0 MB Sections: 658
Main VBs: 8 Prim VBs: 1 Depn VBs: 1
Sorting: Sorted by POS
Aligner: dragen
Buddying: sag_type=BY_SA mate=50% saggy_near=0% prim_far=0.02%
Read name style: MGI-R8
Programs: ID: Hash Table Build;ID: DRAGEN HW build
Fields optimized: QUAL,sd:f
Genozip version: 15.0.60 github

optimize-bam

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a BAM file which consists of MGI Tech reads aligned with Illumina DRAGEN. Note: Compression performance may vary considerably depending on the specific data - in this particular case the incremental gain of --optimize is primarily due to binning of base quality scores.

Uncompressing

Uncompress a file:

$ genounzip myfile.bam.genozip

Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam or
--cram to output in BAM or CRAM formats respectively, or --bgzf to output gz-compressed SAM data.

$ genocat myfile.bam.genozip

Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:

$ genounzip --index myfile.bam.genozip

.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to specifiy the re-compression level (from 0=no recompression to 5=maximum recompression) or with --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file. Note: this does not apply for CRAM files since they do not use BGZF.

$ genounzip --bgzf exact myfile.bam.genozip

$ genounzip --bgzf 6 myfile.sam.genozip

Uncompress to a SAM, BAM or CRAM format by using genocat and either explicitly specifying the format with --sam --bam or --cram, or implicitly with specifying a .cram .bam .sam or .sam.gz file name extension in --output.

$ genocat myfile.bam.genozip --output myfile.cram # outputs a CRAM file

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.

Option Effect

--downsample Show only one in every X alignments

--regions -r Exclude or include certain genomic regions (works even if file is not sorted)

--regions-file -R Like --regions, but list of regions is specified in a file

--grep Show only alignments containing the specified string

--grep-w -g Like --grep, but match whole words

--lines -n Show only a alignments from given range of line numbers

--head Show only a certain number of alignments from the start of the file

--tail Show only a certain number of alignments from the end of the file

--FLAG Include or exclude alignments with certain SAM FLAGs

--MAPQ Include or exclude alignments with at least a certain MAPQ value

--bases Filter alignments based on the IUPAC nucleotide codes in the sequence data

--no-header Show only the alignments - exclude the SAM header

--header-only Show only the SAM header - exclude the alignments

--qnames Show only alignments with a QNAME specified (or not) in a comma-separated list

--qnames-file Show only alignments with a QNAME specified (or not) in a file

--seqs-file Show only alignments with a SEQ specified (or not) in a file

idxstats

Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.

$ genocat --idxstats myfile.bam.genozip

Per-contig coverage and depth

Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.

$ genocat --coverage myfile.bam.genozip

Retrieving one of the components in a Deep file

$ genocat --fastq myfile.deep.bam.genozip

$ genocat --R1 myfile.deep.bam.genozip

$ genocat --R2 myfile.deep.bam.genozip

$ genocat --R=3 myfile.deep.bam.genozip # in case Deep file consists of more than 2 FASTQs

$ genocat --interleaved myfile.deep.bam.genozip

$ genocat --sam myfile.deep.bam.genozip

$ genocat --bam myfile.deep.bam.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com