Advanced Features
Summary
​​
Our Advanced Features consist of Deep™, Pair, BAM-assist and Optimize.
They are available to users of Genozip Enteprise, Genozip Premium, Genozip Researcher and Genozip Biobank.
​​
​
DEEP: co-compression of FASTQ and BAM
​
Genozip Deep™ is a patent-pending method for co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately, as can be appreciated from the benchmark above.
The exact compression ratios achieved are very much dependent on the specifics of the data being compressed, but it is not unusual to achieve file size reductions of 85%-90% with this method.
PAIR: co-compression of paired-end FASTQ files
The left bar shows sizes of each .fastq.genozip file when compressed separately, relative
to the combined size of the .fastq.gz files, and the right bar shows the relative size of the
.fastq.genozip file when co-compressed together using --pair.
​
With the --pair option, Genozip exploits redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together. Typically, this results in shrinking the compressed file by an additional 10-15%.
​
By default, decompression recovers the original two FASTQs, however, it is also possible to output them together in interleaved format.
BAMASS: BAM-assisted compression of FASTQ files
The effect of using the --bamass option on CPU consumption: the blue bars show the CPU consumption of compressing FASTQ files
with Genozip (taken as 100%), and the orange bars show the relative CPU consumption when compressing the same files with
Genozip using the --bamass option, showing a reduction of 40-60% in CPU time.
​
Not shown here: In addition to reduction in CPU time, the compression was better as well: the compressed file size was smaller
by 14% in the case of Illumina WGS and 4% in the case of MGI WGS.
​
Background: When Genozip compresses a FASTQ file, it uses its internal Approximate Aligner™ to find a region in the reference genome that is similar enough to the read on hand, so that the coordinates in the reference genome plus a record of the discrepency between that region of the reference and the actual read could be used to describe the read more parsimoniously that the read itself, and hence achieve compression. Unlike regular FASTQ-to-BAM aligners that try to find the true location on the DNA molecule from which this read originated, Genozip's Approximate Aligner™ only desires to find some location on the reference which is similar enough to the read - a relaxed objective which results in Approximate Aligner™ being extremely fast. However, Its work still accounts for about half the CPU usage when compressing a FASTQ file.
​
In some use cases, both the FASTQ files and the corresponding BAM file are available, but only the FASTQ files are destined for long term storage and hence require compression, while the BAM file will be discarded. In these cases, Genozip can use the patent-pending BAM-Asssited Compression of FASTQ method to salvage the alignment information contained in the BAM file to reduce the need of using the Approximate Aligner™, typically slashing compression CPU time by 40-60%. In addition, in some cases, this may also improve the compression ratio by up to 15% (this is observed in Illumina data with binned quality scores as well as in Ultima Genomics data).
​
OPTIMIZE: even better compression, with some caveats.
While Genozip is primarily a lossless compressor, the --optimize option is where we venture into "lossy compression" as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and number of decimal digits in fractions. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications which genozip performs when --optimize is used, are designed to have negligible impact on downstream analysis in many common cases, however you should validate this for your own data.
​
The additional savings with the Optimize method highly depend on the details of the specific file being compressed. If the file is already highly optimized, then --optimize might have limited effect. In other files, it might halve the size of the file or better, compared to genozip without --optimize.
​
Those who resent the Z in the word "optimize" will be glad to know that --optimise works as well 😊.
​
Details of the specific modifications can be found here: FASTQ SAM/BAM/CRAM VCF
Optimize for FASTQ
Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a FASTQ file produced by an MGI Tech sequencer, obtained from here.
Optimize for BAM
Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a BAM file which consists of MGI Tech reads aligned with Illumina DRAGEN.
Optimize for VCF
Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a GVCF file produced by Illumina DRAGEN obtained from here.
Questions? support@genozip.com