How good is Genozip at compressing VCF files?
​Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip.
Used --best. --reference was used for compressing the two GVCF files. Datasets can be found here.
​
Compressing and uncompressing a VCF or BCF file
​
In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files, provided that bcftools is installed on your system.
​
Compressing​
​
Compressing and uncompressing VCF is straightforward:
​
$ genozip myfile.vcf.gz
​
$ ls -lh myfile.vcf*
-rwxrwxrwx 1 divon divon 88M Aug 1 08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10 2020 myfile.vcf.gz
​
Uncompressing:
​​
$ genounzip myfile.vcf.genozip
​​
Viewing:
​
$ genocat myfile.vcf.genozip
​
A VCF or BCF output might be selected with --vcf or --bcf respectively, or implicitly with --output and a filename ending with .vcf .vcf.gz or .bcf . Re-compression level of the output VCF or BCF file may be determined with --bgzf where --bgzf=0 means "no compression at all".
​
Compressing VCF using a reference file
​​
It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:
1. GVCF files
3. Illumina Genotyping VCF files
4. Ultima Genomics files
5. VCF files with little or no sample or INFO data.
​
Example:
​
$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz
​
Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to --reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
​​
$ genozip --make-reference hs37d5.fa.gz
​​
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.
​
​
--optimize : even better compression, with some caveats.
While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include the number of decimal digits in fractions and the maximum meaningful Phred score. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications to VCF data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.
​​
Phred-score rounding: annotations containing Phred score values, are modified such that the Phred values are rounded to the nearest integer and capped at 60. This method is applied to the following FORMAT fields: PL, SPL, PP, PRI, GQ and also to GP if it contains Phred scores.
​
Conversion to Phred scores: the GL annotation which contains likelihoods is converted to the equivalent PL annotation that contains Phred scores. If the GP annotation contains probabilities, it is convered to the equivalent PP annotation that contains Phred scores. The Phred scores generated are rounded and capped as described above.
​
INFO sorting: The annotations in the INFO field re-ordered (but not otherwise modified) so that they appear in the same consistent order across all variants.
​
Floating-point number rounding: numerical INFO and FORMAT annotations containing fractions (also known as floating-point numbers), if not already optimized by one of the methods above, are rounded to their 3 signifcant digits.
​
QUAL field: Usually, the QUAL field is rounded as described in "Floating-point number rounding". The exception is single-sample DRAGEN-produced files with a FORMAT/GP annotation, where it is rounded as desribed in "Phred-score rounding".
​
Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.
​
Note: --optimize can take an argument, to fine grain which fields get optimized, for example:
​
$ genozip --optimize=VQSLOD,SOR - optimizes only the VQSLOD and SOR fields, if they are optimizable
​​​
$ genozip --optimize=^VQSLOD,SOR - optimizes all optimizable fields, except VQSLOD and SOR.
​​​
Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):
​​​
$ genozip --optimize --stats test.vcf
VCF file: test.vcf
Samples: 1 variants: 149,870 Contexts: 70 Vblocks: 11 x 4.0 MB Sections: 938
Programs: DRAGEN
Fields optimized: QUAL,INFO,AF,GP,GQ,PL,PRI,AF,MQ,VQSLOD,R2_5P_bias,QD,FS,SOR,MQRankSum,ReadPosRankSum
Genozip version: 15.0.60 github
Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a GVCF file produced by Illumina DRAGEN obtained from here. Note: Compression performance may vary considerably depending on the specific data.