How good is Genozip at compressing VCF files?
​Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip.
Used Genozip 15.0.4 with --best. --reference was used for compressing the two GVCF files. Datasets can be found here.
​
Compressing and uncompressing a VCF or BCF file
​
In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files with some limitations.
​
Compressing​
​
Compressing and uncompressing VCF is straightforward:
​
$ genozip myfile.vcf.gz
​
$ ls -lh myfile.vcf*
-rwxrwxrwx 1 divon divon 88M Aug 1 08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10 2020 myfile.vcf.gz
​
Uncompressing
​
$ genounzip myfile.vcf.genozip
​
Viewing
​
$ genocat myfile.vcf.genozip
​
Compressing VCF using a reference file
​​
It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:
1. GVCF files
3. Illumina Genotyping VCF files
4. Ultima Genomics files
5. VCF files with little or no sample or INFO data.
​
Example:
​
$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz
​
Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
​​
$ genozip --make-reference hs37d5.fa.gz
​​
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..
​
Compressing BCF files
​​
Genozip does not support BCF natively - it uses bcftools to convert BCF files to the VCF format, and as such it requires bcftools to be installed for BCF compression to work.
​​
genozip myfile.bcf​​
​
To decompress to a BCF file (regardless of whether original file was VCF or BCF). As above, this option relies on bcftools:
​
genocat --bcf myfile.vcf.genozip
​
The --bcf option is implicit if --output is specified with a file name with a .bcf extension.
​
Optimizing compression
​
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all these optimization options.
​​
Option Fields affected Action
--optimize-sort INFO Sorts INFO annotations alphabetically by tag name.
​
--optimize-phred FORMAT/PL Phred scores are rounded to the nearest integer and capped at 60
FORMAT/PRI
FORMAT/PP
FORMAT/GL (VCF v4.2 or earlier)
--GL-to-PL FORMAT/GL The GL annotation is converted to PL and Phred values are capped at 60
​​
--GP-to-PP FORMAT/GP (VCF v4.3- ) The GP annotation is converted to PP and Phred values are capped at 60
​
--optimize-VQSLOD INFO/VQSLOD The value is rounded to 2 significant digits
​
More information: VCF optimizations
​
Slicing & dicing your data with genocat
​
Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.
​
Option Effect
--downsample Show only one in every X variants
--regions -r Exclude or include certain genomic regions
--regions-file -R Like --regions, but list of regions is specified in a file
--grep Show only variants containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a variants from given range of line numbers
--head Show only a certain number of variants from the start of the file
--tail Show only a certain number of variants from the end of the file
--no-header Show only the variants - exclude the VCF header
--header-only Show only the VCF header - exclude the variants
--header-one -1 Show only the the #CHROM line of the VCF header and the variants
--samples -s Show a subset of samples
--drop-genotypes -G Output the data without the samples and FORMAT column
--GT-only Within samples output only genotype (GT) data - dropping the other tags
--snps-only Drop variants that are not a Single Nucleotide Polymorphism (SNP)
--indels-only Drop variants that are not Insertions or Deletions (indel)
​
Note: in multi-sample files with both INFO/DP and FORMAT/DP fields, subsetting with --drop-genotypes, --GT-only or
--samples would normally cause INFO/DP and INFO/QD to show as -1 and INFO/BaseCounts to show as '.'. Compressing with --secure-DP avoids this issue, at the expense of a slightly worse compression.
​