File types compressible with Genozip
​
Genozip is designed to compress the following genomic file formats:
​
FASTQ, BAM, SAM, CRAM, VCF, BCF, FASTA, GFF, GVF, GTF, BED, TRACK, 23andMe, Illumina LOC files.
​
Note: Files that are not of one of these formats are treated as generic and are compressible as well. Genozip is often better and faster than general-purpose compressors even for generic files. The support for generic files allows compression of entire directories with Genozip, even if they contain some generic files as well.
​
Simple compression and decompression
​
genozip sample.bam
genounzip sample.bam.genozip
​
Compressing with a reference
​
To achieve good compression for FASTQ and BAM files, it is highly recommended to compress with a reference. This is also recommended for certain VCF files.
​
genozip --reference hs37d5.fa.gz mydata.fq.gz
genounzip mydata.fq.genozip
​
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
​
genozip --make-reference hs37d5.fa.gz
​
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to the reference file directly.
​
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
​
Co-compressing a pair of FASTQ files
​
Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:
​
genozip --reference hs37d5.fa.gz --pair mysample-R1.fastq.gz mysample-R2.fastq.gz
​
Co-compressing a BAM file and its related FASTQ files
​
Simiarly, Genozip can take advantage of redundancies between a BAM file and the FASTQ files from which the BAM was generated, to significantly improve compression when they are compressed together, using --deep:
​
$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz ​
​
$ ls -lh myfile*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz
-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip
​
Important: for --deep to work, all FASTQ files that contributed reads to this BAM file must be included.
​
Compressing and decompressing multiple files into a tar file (preserving directory structure)
​
genozip my_data/ --tar my_data.tar --subdirs
​
tar xvf my_data.tar |& genounzip --files-from - --replace
​
See also: Archiving
​
Using Genozip in a pipeline
​
Example:
​
my-bam-outputting-method | genozip --output mysample.bam.genozip
​
genocat mysample.bam.genozip | my-sam-inputting-method
​
Note: when piping data into genozip, genozip attempts to detect the file type from the data. The file type may also be given explicitly with --input, e.g., --input bam
​
Slicing & dicing your data with genocat
​
Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.
​
Some examples:
​
genocat --regions ^Y,MT mysample.bam.genozip # Displays all alignments except Y and MT contigs.
​
genocat --regions chrM GRCh38.fa.genozip # Dislays the sequence of chrM.
​
genocat --samples SMPL1,SMPL2 mysamples.vcf.genozip # Displays 2 samples.
​
genocat --grep 1101:2392 myreads.fq.genozip # Displays reads with “1101:2392” in the description.
​
genocat --downsample 10 mysample.fq.genozip # Displays 1 in 10 reads.
​
Removing the original file after successful compression or decompression
​
genozip --replace myfile.bam
​
genounzip --replace myfile.bam.genozip
​
Balancing compression, speed and memory
​
There is a tradeoff between the execution speed and memory consumption of genozip, and the compression achieved. By default, genozip attempts to strike a good balance. Options --fast, --best and --low-memory can be used to tell genozip to skew the balance in either direction:
​
genozip --best sample.bam # better, but slower, compression
​
genozip --fast sample.bam # faster, but less efficient compression
​
genozip --low-memory sample.bam # uses less RAM​, but slower and less efficient compression
​
MD5
​
Genozip verifies the identicality of a decompressed file to its original using Adler32. To use MD5 instead, as well as report the MD5 value, use --md5. Note that this usually results in slower compression and decompression.
​
genozip --md5 sample.bam
​
​
Encryption
​
For security and compliance, it is possible to encrypt the file while compressing it. Genozip uses AES (256 bits) for encryption, and using encryption has no negative impact on compression speed or the compression ratio.
​
genozip --password mysecret sample.bam
​
genounzip --password mysecret sample.bam.genozip
​
See: Encryption
​
Multi-threading
By default, genozip attempts to utilize as many cores as it can. For that, it sets the number of threads to be a bit more than the number of cores (a practice known as over-subscription), as at any given moment some threads might be idle, waiting for a resource to become available. The --threads option allows explicit specification of the number of compute threads to be used (in addition, a small number of I/O threads is used too, usually 1 or 2).
​
genozip --threads 20 sample.sam
​
On machines with a large number of cores (> 100), the bottleneck becomes the speed in which Genozip and read and write from disk, and Genozip will not utilize all the cores available.
​
When running on a personal computer (Windows, Mac or WSL), genozip uses less cores than are available, to avoid starving the user interface threads of the operating system or of the other applications running, causing the computer to feel "stuck". To override this behavior and utilize all available cores, use --threads to specify a number that is 10% higher than the number of cores on your machine.
​
Reference file caching in RAM
​
To speed up loading of reference data, Genozip caches (=stores) the reference data in RAM (=memory).
​
To instruct genozip not to cache the reference data, use --no-cache :
​
genozip --reference hs37d5.fa.gz --no-cache mydata.fq.gz
​
To see the list of reference genomes currently cached in RAM:
​
genols --cache
​
To remove a specific reference genome from RAM:
​
genozip --reference hs37d5.ref.genozip --no-cache
​
To remove all genomes from RAM:
​
genozip --no-cache
​
Suppressing automatic testing
​
By default, after compressing a file, genozip verifies the compression by decompressing the compressed file in memory and comparing the digest (Adler32 or MD5) of the original file to that of the decompressed file. Using the --no-test option suppresses this verification, saving execution time. This is strongly discouraged as the post-compression testing is an essential part of Genozip's strategy to ensure data integrity.
​
genozip --no-test myfile.bam
​
Memory (RAM) consumption
In genozip, each compute thread is assigned a segment of the input file, known as a VBlock. By default, the VBlock size is selected based on characteristics of the data, however it may be set explicitly with --vblock. A larger VBlock usually results in better compression while a smaller VBlock causes genozip to consume less RAM. The VBlock size can be observed at the top of the --stats report. genozip’s memory consumption is linear with (VBlock-size X number-of-threads).​
genozip --vblock 32 sample.bam # 32 MB of source file data per VBlock
​
genocat and genounzip also consume memory linearly with (VBlock-size X number-of-threads), where VBlock-size is the value used by genozip of the particular file (it cannot be modified genocat or genounzip). Usually, genocat and genounzip consume significantly less memory compared to genozip.
​​
When using a reference file, it is loaded to memory too. If multiple genozip / genocat / genounzip processes are running in parallel, only one copy of the reference file is loaded to memory and shared between all processes, and depending on how busy the computer is, that reference file data might persist in RAM even between consecutive runs, therefore avoiding the need to load it again from disk. All this all happens behind the scenes.
​
Use --low-memory to instruct genozip, genounzip or genocat to make tradeoffs in a way that uses less RAM, even at the expense of lesser compression or slower execution.
​
Questions? support@genozip.com
​