Compressing FASTA files
Compressing
​
$ genozip chm13.draft_v1.1.fasta.gz
genozip chm13.draft_v1.1.fasta.gz : Done (15 seconds, FASTA compression ratio: 4.8 - better than .fasta.gz by a factor of 1.4)
testing: genounzip chm13.draft_v1.1.fasta.genozip : verified as identical to the original FASTA
​
$ ls -lh chm13.draft_v1.1.*
-rw-rw-r--+ 1 divon divon 617M Aug 5 00:05 chm13.draft_v1.1.fasta.genozip
-rw-rw-r--+ 1 divon divon 852M May 8 2021 chm13.draft_v1.1.fasta.gz
​
Compressing a FASTA using a reference file
​
For FASTA files that consist of sequences that are short reads (rather than assembled contigs or long reads), it is highly advisable compress using a reference file (i.e. another FASTA file consisting of a reference genome), as it can improve compression by 3X to 5X comprared to compression without a reference, as demonstrated by this chart:
Relative sizes of a .fasta file generated with Illumina Novaseq WGS 30x coverage.
Showing the benefit of using --reference or --REFERENCE
​
$ genozip --reference hs37d5.fa.gz short-reads.fa.gz
genozip short-reads.fa.gz : Done (2 minutes 50 seconds, FASTA compression ratio: 22.7 - better than .fa.gz by a factor of 5.5)
testing: genounzip short_reads.fa.gz : verified as identical to the original FASTA​
$ ls -lh short-reads.fa.*
-rw-rw-r--+ 1 divon divon 3.9G May 13 04:04 short_reads.fa.genozip
-rw-rw-r--+ 1 divon divon 21G May 13 02:20 short_reads.fa.gz​
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
​​
$ genozip --make-reference hs37d5.fa.gz
​​
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.
​​
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
​
Note: using --reference is only useful for compressing FASTA files that consist of short reads. It is ignored for other types of FASTAs.
​
Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.
​
Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.
Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.​
Uncompressing
​
$ genounzip chm13.draft_v1.1.fasta.genozip
​
Viewing and analyzing
​
$ genocat chm13.draft_v1.1.fasta.genozip
​
genocat options:
​
--sequential : each sequence is output in single line - newlines are removed.
​
--header-only: shows only the description lines (no sequences).
​
--no-header: shows only the sequences, omitting the description lines.
​
--header-one: shows the description lines truncated at the first space or tab character.
​
--grep string: shows only the sequences in which string is contained in the header.
​
--grep-w string: same as --grep, but string must match a whole word.
​
--regions sequence-name[,sequence-name2...] shows only the sequences requested. sequence-name is the prefix of the description line up to the first space, tab or newline.
​
--regions-file filename same as --regions, but list of sequence names is taken from a file.
​
--head[=lines] show line lines from the top of the file. This is similar to piping genocat | head but faster.
​
--tail[=lines] similar to --head but shows lines from the end of file.
​
--lines [first]-[last] or [first] shows a range of lines.
​
--downsample rate[,shard] technically works on FASTA files, but usually not very useful. See Downsampling.
​
Note: --grep, --grep-w, --regions and --regions-file require the file to be indexed during compression (only applicable when compressing FASTA files). This can be achieved by adding the --index option when compressing. However, please be aware that --index sometimes negatively impacts compression, and in particular, --reference is ignored. --index is automatically set for files which contain up to 10,000 sequences which are assembled contigs (i.e. not sequencing reads).
​
Generating a Genozip reference file from a FASTA file
​
A FASTA file may be used to generate a Genozip reference file, using the --make-reference option:
​
$ genozip --make-reference chm13.draft_v1.1.fasta.gz
​
Reference files are a file format used in Genozip internally, and cannot be uncompressed. Their primary use is for compressing other files, using the --reference or --REFERENCE options, which usually results in significantly improved compression.
​
However, In addition to their primary use, reference files are also useful for analyzing the underlying FASTA: they can be used to easy view sub-sequences of contigs in certain regions (forward or reverse complemented) using --regions and --regions-file, for finding IUPAC non-ACGTN pseudo-bases in the file with --show-ref-iupacs, and seeing properties of the contigs with --show-ref-contigs. See more here: Reference file options.
​
Note: A reference file is also created implicitly when a FASTQ, SAM/BAM/CRAM, VCF or FASTA file, is compressed with the
--reference or --REFERENCE option and with a FASTA filename as an arugment or the option.
​
For a full list of options, see the genozip command line reference
​
Questions? support@genozip.com
​