Compressing VCF files

How good is Genozip at compressing VCF files?

Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip.

Used --best. --reference was used for compressing the two GVCF files. Datasets can be found here.

Compressing and uncompressing a VCF or BCF file

In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files, provided that bcftools is installed on your system.

Compressing

Compressing and uncompressing VCF is straightforward:

$ genozip myfile.vcf.gz

$ ls -lh myfile.vcf*

-rwxrwxrwx 1 divon divon 88M Aug 1 08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10 2020 myfile.vcf.gz

Uncompressing:

$ genounzip myfile.vcf.genozip

Viewing:

$ genocat myfile.vcf.genozip

A VCF or BCF output might be selected with --vcf or --bcf respectively, or implicitly with --output and a filename ending with .vcf .vcf.gz or .bcf . Re-compression level of the output VCF or BCF file may be determined with --bgzf where --bgzf=0 means "no compression at all".

Compressing VCF using a reference file

It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:

1. GVCF files

2. Structural variants files

3. Illumina Genotyping VCF files

4. Ultima Genomics files

5. VCF files with little or no sample or INFO data.

Example:

$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to --reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

or alternatively:

$ genozip --make-reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/

reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.

--optimize : even better compression, with some caveats.

While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include the number of decimal digits in fractions and the maximum meaningful Phred score. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications to VCF data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.

Phred-score rounding: annotations containing Phred score values, are modified such that the Phred values are rounded to the nearest integer and capped at 60. This method is applied to the following FORMAT fields: PL, SPL, PP, PRI, GQ and also to GP if it contains Phred scores.

Conversion to Phred scores: the GL annotation which contains likelihoods is converted to the equivalent PL annotation that contains Phred scores. If the GP annotation contains probabilities, it is convered to the equivalent PP annotation that contains Phred scores. The Phred scores generated are rounded and capped as described above.

INFO sorting: The annotations in the INFO field re-ordered (but not otherwise modified) so that they appear in the same consistent order across all variants.

Floating-point number rounding: numerical INFO and FORMAT annotations containing fractions (also known as floating-point numbers), if not already optimized by one of the methods above, are rounded to their 3 signifcant digits.

QUAL field: Usually, the QUAL field is rounded as described in "Floating-point number rounding". The exception is single-sample DRAGEN-produced files with a FORMAT/GP annotation, where it is rounded as desribed in "Phred-score rounding".

Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.

Note: --optimize can take an argument, to fine grain which fields get optimized, for example:

$ genozip --optimize=VQSLOD,SOR - optimizes only the VQSLOD and SOR fields, if they are optimizable

$ genozip --optimize=^VQSLOD,SOR - optimizes all optimizable fields, except VQSLOD and SOR.

Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):

$ genozip --optimize --stats test.vcf

VCF file: test.vcf
Samples: 1 variants: 149,870 Contexts: 70 Vblocks: 11 x 4.0 MB Sections: 938
Programs: DRAGEN
Fields optimized: QUAL,INFO,AF,GP,GQ,PL,PRI,AF,MQ,VQSLOD,R2_5P_bias,QD,FS,SOR,MQRankSum,ReadPosRankSum
Genozip version: 15.0.60 github

vcf-with-ref

optimize-vcf

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a GVCF file produced by Illumina DRAGEN obtained from here. Note: Compression performance may vary considerably depending on the specific data.

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.

Option Effect

--downsample Show only one in every X variants

--regions -r Exclude or include certain genomic regions

--regions-file -R Like --regions, but list of regions is specified in a file

--grep Show only variants containing the specified string

--grep-w -g Like --grep, but match whole words

--lines -n Show only a variants from given range of line numbers

--head Show only a certain number of variants from the start of the file

--tail Show only a certain number of variants from the end of the file

--no-header Show only the variants - exclude the VCF header

--header-only Show only the VCF header - exclude the variants

--header-one -1 Show only the the #CHROM line of the VCF header and the variants

--samples -s Show a subset of samples

--drop-genotypes -G Output the data without the samples and FORMAT column

--GT-only Within samples output only genotype (GT) data - dropping the other tags

--snps-only Drop variants that are not a Single Nucleotide Polymorphism (SNP)

--indels-only Drop variants that are not Insertions or Deletions (indel)

Note: in multi-sample files with with INFO fields that contain aggregate data across all samples, subsetting with --drop-genotypes, --GT-only or --samples, will not update the INFO fields. In the case of INFO/DP, it will show -1. To avoid showing -1 in INFO/DP when subsetting, use the --secure-DP option when compressing the file.