Display and analyze a file compressed with genozip.
​
Usage
genocat [options]… [files]…
One or more file names must be given.
General options
​
-e, --reference filename.
Load a reference file prior to decompressing. Used only for files compressed with --reference. If not provided, genocat will use the same reference filename as used for genozip.
​
Note: this is equivalent of setting the environment variable $GENOZIP_REFERENCE with the reference filename.
​
Note: if $GENOZIP_REFERENCE is set to a directory name, then the reference file is sought in that directory, with the reference file name used during compression. If it is not found there, the reference file path used for compression is used.
​
--no-cache
Don't store reference genome data in RAM. Can also be used to delete previously cached genomes. See reference genome caching.
​
-f, --force
Force overwrite of the output file.
​
-D, --subdirs
If a file name on the command line is a directory include all files of that directory (recursively).
​
-o, --output output-filename
Output to this filename.
Note: output-filename can also be a directory name, in which case the output file is written to the specified directory. If the name has a ‘/’ suffix (e.g. “-o my-dir/”), then the directory is created if it doesn’t already exist.
​
-p, --password password.
Provide password to access file(s) that were compressed with --password.
​
--count
Rather than displaying the file content just report the number of lines (FASTQ: reads) (excluding the header) that would have been displayed. Useful in combination with filtering options.
​
Limitation: cannot be used in combination with --downsample, --head, --tail or --lines.
​
-z, --bgzf level
Recompress the output to the BGZF format (.gz or .bam extension). level specifies the BGZF recompression level from 0 (no compression) to 5 (best yet slowest compression). The default level is 2.
​
Use --bgzf=exact to instruct genounzip to attempt to re-create the same exact BGZF compression as in the original file. Whether genounzip succeeds in re-creating the exact same BGZF compression ratio depends on the compression library used by the application that generated the original file. See also: Compressing already-compressed files.
​
-q, --quiet
Don't show the progress indicator or warnings.
​
-Q, --noisy
The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.
​
-@, --threads number
Specify the maximum number of threads. By default genozip allocates 1.1 threads per core in order to maximize usage of all available cores. An exception is on Mac and Windows (including WSL) where the default allocation is 0.75 threads per core to maintain the operating system's UI's feeling of interactivity.
​
Note: For genounzip and genocat this limit is only approximate. For genozip, it is strictly enforced.
​
--low-memory
Uses less memory than normal, but runs slower
​
-w, --stats
Show the internal structure of a genozip file and the associated compression statistics.
​
--print-filename
Show the file name for each file.
​
-T, --files-from filename
An alternative to providing input file names on the command line. filename it a textual file containing a newline-separated list of files. If filename is - (a hyphen) data is taken from stdin rather than a file.
​
--validate filenames
Tests if one or more files are valid genozip files. If they are, they are ignored. If they are not, an error is printed. Alternatively, with --validate=valid , the names of the valid files are printed, and the invalid files ignored.
​
Note: This only checks that the file is indeed a genozip file (technically: checks its "magic number"). It does not inspect the contents of the file. For a thorough verification of the contents of a file, use genounzip --test.
​
--log filename
Send non-file output to a log file instead of the terminal.
​
--echo
Output the full command line upon successful or failed completion of execution.
​
--help
Show a link to this page.
​
-L, --license, --licence
Show the license terms and conditions for this product as accepted. Combine with --force to see the version of the license current to the version of Genozip used. If you wish to change your license to the most recent one - make sure your version of Genozip is the latest and re-activate with genozip --activate.
​
-V, --version
Display Genozip's version number
​
--print-reference
Show the name and MD5 of the reference file that needs to be provided to uncompress this file. Combine with --force to show the name only.
​
Subsetting options
​
Note: subsetting options are options that filter or otherwise modify the data
​
--downsample rate[,shard]
Applicable data types: all
Show only one in every rate lines (reads in the case of FASTQ ; sequences in the case FASTA).
The optional shard parameter indicates which of the shards is shown - it must be a value between 0 and rate-1.
Other subsetting options (if any) will be applied to the surviving lines.
​
-r, --regions [^]chr | chr:pos | pos | chr:from-to | chr:from- | chr:-to | from-to | from- | -to | from+len [,...].
Applicable data types: VCF SAM/BAM GFF3/GVF FASTA 23andMe Reference
Show one or more regions of the file.
​
Examples:
​
genocat myfile.vcf.genozip -r 22:1000-2000 # Positions 1000 to 2000 on contig 22
genocat -e myfile.ref.genozip -r 22:2000-1000 # Reverse complement of positions 1000 to 2000 on contig 22 (reference file only)
genocat myfile.sam.genozip -r 22:1000+151 #151 bases, starting pos 1000, on contig 22
genocat -e myfile.ref.genozip -r 22:1000-151 # Reverse complement of 151 bases, from 1000 to 850, on contig 22 (reference file only)
genocat myfile.vcf.genozip -r -2000,2500- # Two ranges on all contigs
genocat myfile.sam.genozip -r chr21,chr22 # Contigs chr21 and chr22 in their entirety
genocat myfile.vcf.genozip -r ^MT,Y # All contigs, excluding MT and Y
genocat myfile.vcf.genozip -r ^-1000 # All contigs, excluding positions up to 1000
genocat myfile.fa.genozip -r chrM # Contig chrM
​
Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file.
.
Note: Indels are considered part of a region if their start position is.
​
Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument.
​
Note: For SAM/BAM files, unlike samtools, this works even if the files are not sorted by position.
​
Note: For Reference files, use in combination with --reference (or -e).
​
Note: For FASTA files, only whole-contig regions are possible.
​
Note: Combine with --gpos to see Global POSition values instead of positions on chromosomes.
​
-R, --regions-file [^]filename
Applicable data types: VCF SAM/BAM GFF3/GVF FASTA 23andMe Reference
Show regions from a list in tab-separated file. To include all regions except those in the fileÙ« prefix the filename with ^. If filename is - (or ^-) data is taken from stdin rather than a file.
​
Example of a valid file: The first two rows (ignoring the comment line) produce the same 100-base region, and the
third row is a single base:
​
# Comment lines starting with a # are ignored.
chr22 17000000 17000099
chr22 17000000 +100
chr22 17000000
​
--grep string
Show only lines (FASTA: sequences ; FASTQ: reads) in which string is a case-sensitive substring of the lines (FASTA: description). This does not affect showing the file header.
​
-g, --grep-w string
Same as --grep, but with whole-word matching.
​
-n, --lines [first]-[last] or [first].
Show a certain range of lines. first and last are numbers of lines in the file (starting from 1, excluding header). Combine with --no-header to avoid outputting the file header.
​​
Examples:
​​
genocat --lines 1000-2000 # displays the 1001 lines between 1000 and 2000
genocat --lines=1000- # displays all lines starting from 1000 (optional =)
genocat -n -2000 # displays lines 1 to 2000 (-n instead of --lines)
genocat -n 1000 # displays 10 lines starting from line 1000
​​​
Note on FASTQ: The numbering is of reads rather than lines.
​
Note: Line numbers are taken before any additional filters are applied.
​
--head[=num_lines]
Show num_lines lines from the start of the file (default is 10). Line count excludes header. Combine with --no-header to avoid outputting the file header.
​
--tail[=num_lines]
Show num_lines lines from the end of the file (default is 10). Combine with --no-header to avoid outputting the file header.
​​
​​
VCF options
​
-s, --samples [^]sample_name[,...] or num_samples
Show a subset of samples (individuals).
​
Examples:
​
genocat myfile.vcf.genozip -s HG00255,HG00256 # show two samples
genocat myfile.vcf.genozip -s ^HG00255,HG00256 # show all samples except these two
genocat myfile.vcf.genozip -s 5 # show the first 5 samples
​
Note: This does not change the INFO data (including the AC, AF, AN tags).
Note: sample_name is case-sensitive.
​
Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument.
​
Note: The INFO/DP field will display -1 as it requires FORMAT/DP data of all samples to reconstruct. To show the true INFO/DP values, use the --secure-DP option when compressing the file.
​
-G, --drop-genotypes
Output the data without the samples and FORMAT column.
​
Note: This does not change the INFO data (including the AC, AF, AN tags).
​
Note: The INFO/DP field will display -1 as it requires FORMAT/DP data to reconstruct. To show the true INFO/DP values, use the --secure-DP option when compressing the file.
​
--GT-only
Within samples output only genotype (GT) data - dropping the other tags.
​
Note: The INFO/DP field will display -1 as it requires FORMAT/DP data to reconstruct. To show the true INFO/DP values, use the --secure-DP option when compressing the file.
​
--snps-only
Drops variants that are not a Single Nucleotide Polymorphism (SNP).
​
--indels-only
Drops variants that are not Insertions or Deletions (indel).
​​
-1, --header-one
Output only the last line on the header (the line with the field and sample names).
​
--bcf
Output as BCF. Note: bcftools needs to be installed for this option to work.
​
--no-PG
Supresses adding a "##genozip_command" line to the VCF header.
​
BAM/CRAM/SAM options
​
-H, --no-header
Don't output the SAM header lines.
​
-h, --header-only
Output only the SAM header lines.
​
--FLAG {+-^}value
Filter alignments based on the FLAG value: value is a + - or ^ followed by a decimal or hexadecimal value or a flag name (or its unique prefix) from the table below.
​
+ Includes alignments in which all flags in value are set in the line’s FLAG
- Includes alignments in which no flags in value are set in the line’s FLAG
^ excludes alignments in which all flags in value are set in the line’s FLAG
​
Example: --FLAG -192 includes only alignments in which neither FLAG 64 nor 128 are set. This can also be expressed as --FLAG -0xC0
​
Example: --FLAG +SUPP includes only alignments in which the SUPPLEMENTARY flag (2048) is set.
Decimal | Hex | Name | Meaning |
---|---|---|---|
1 | 0x1 | MULTI | template having multiple segments in sequencing |
2 | 0x2 | ALIGNED | each segment properly aligned according to the aligner |
4 | 0x4 | UNMAPPED | segment unmapped |
8 | 0x8 | NUNMAPPED | next segment in the template unmapped |
16 | 0x10 | REVCOMP | SEQ being reverse complemented |
32 | 0x20 | NREVCOMP | SEQ of the next segment in the template being reverse complemented |
64 | 0x40 | FIRST | the first segment in the template |
128 | 0x80 | LAST | the last segment in the template |
256 | 0x100 | SECONDARY | secondary alignment |
512 | 0x200 | FILTERED | not passing filters, such as platform/vendor quality controls |
1024 | 0x400 | DUPLICATE | PCR or optical duplicate |
2048 | 0x800 | SUPPLEMENTARY | supplementary alignment |
--MAPQ [^]value
Filter alignments based on the MAPQ value: include (or exclude if value is prefixed with ^) lines with a MAPQ greater or equal to value.
​
--bases [^]value
Filter alignments based on the IUPAC nucleotide codes of the sequence data.
​
Examples:
genocat --bases ACGT # displays only lines in which all characters of the SEQ are one of A,C,G,T
genocat --bases ^ACGT # displays only lines that contain non-A,C,G,T characters (e.g. an N)
​
Note: In SAM/BAM, all alignments missing a sequence (i.e. SEQ=*) are included in positive --bases filters (the first example above) and excluded in negative ones.
​
Note: The list of IUPAC characters can be found here: IUPAC codes
​
--qnames [^]qname[,qname...]
Filter alignments to only those whose QNAME (read name) is on the comma-separated list. If the list is prefixed with ^ only the alignments that have a QNAME that is not on the list in the file are shown.
​
Example:
genocat mydata.bam.genozip --qnames=A00966:41:H25GJDSXY:3:2663:21793:8390
​
--qnames-file [^]filename
Filter alignments to only those whose QNAME (read name) is listed in the file. If the filename is prefixed with ^ only the alignments that do not have a QNAME that is listed in the file are shown. The format of the file is a newline separated list of QNAMEs. If an initial '@' character appears in a line, it is ignored, and all characters after the first space or tab character are ignored as well, allowing for entire FASTQ description lines to be used in the qnames file. The comparison is done based on canonical names so a QNAME listed in the qnames file as "A00910:85:HYGWJDSXX:1:1101:3025:1000" could match a an alignment with a QNAME "A00910:85:HYGWJDSXX:1:1101:3025:1000_1:N:0:CAACGAGAGC+GAATTGAGTG".
​
Example:
genocat mydata.fq.genozip --header-only | head -10 > qnames # descriptions of first 10 reads
genocat mydata.bam.genozip --qnames-file qnames # alignments originating from these 10 reads
​
--seqs-file [^]filename
Filter alignments to only those whose SEQ (sequences) appears in the file. If the filename is prefix with ^ only the alignments that do not have a SEQ that appears in the file are shown. A matching alignment is an alignment with the exact SEQ or the exact reverse-complemented SEQ.
​
Example:
genocat mydata.fq.genozip --seq-only | head -10 > seqs # sequences of first 10 reads
genocat mydata.bam.genozip --seqs-file seqs # alignments containing these sequences
​
--bam
Output as BAM. This option is implicit if --output specifies a filename ending with .bam. For a Deep file, this will output the BAM/CRAM/SAM component in BAM format.
​
--cram
Output as CRAM. This option is the default in genocat on BAM/CRAM/SAM data and is implicit if --output specifies a filename ending with .cram. For a Deep file, this will output the BAM/CRAM/SAM component in CRAM format.
​
--sam
Output as SAM. This option is the default in genocat on BAM/CRAM/SAM data and is implicit if --output specifies a filename ending with .sam or .sam.gz, or if output is to stdout. For a Deep file, this will output the BAM/CRAM/SAM component in SAM format.
​
--no-PG
Suppress adding a @PG line to the file header.
​
​
FASTQ options
--interleaved[=both|either]
For FASTQ data compressed with --pair: Show every pair of paired-end FASTQ files with their reads interleaved: first one read of the first file ; then a read from the second file ; then the next read from the first file and so on. Optional argument 'both' (default) or 'either' determines whether both reads of a pair or only one is required for the pair to survive when combining with a subsetting option such as --grep.
​
--R=n
View one the FASTQ files in a genozip file created with --pair or --deep. Not be to confused with -R (single hyphen).
​
--R1 and --R2
View one of the two FASTQ files in a genozip file created with --pair or --deep. Equivalent of --R=1 and --R=2.
​
--header-only
Output only the description lines.
​
--seq-only
Output only the sequence (nucleotide) lines.
​
--qual-only
Output only the quality lines.
​
Limitation: doesn't work on some long read data, because Genozip compresses long read quality scores with the LONGR codec that requires sequence data to reconstruct correctly.
​
--bases [^]value
Filter lines based on the IUPAC nucleotide codes of the sequence data. See SAM/BAM --bases option.
​
​
DEEP options
options applicable to genozip files created with --deep
​​
--bam, --cram, --sam
View the BAM/CRAM/SAM component of the deep file, outputted in SAM or BAM or CRAM format.
​​
--R=n, --R1, --R2
View one the FASTQ components of the deep file. Not be to confused with -R (single hyphen).
​
--fastq, --fq
In case of a deep file that contains a single FASTQ file - outputs that file. In case it contains two paired-end FASTQ files, outputs the FASTQ data in interleaved format.
​​
--interleaved[=both|either]
View both paired-end FASTQ files contained in the deep file, in interleaved format, as described above in FASTQ options.
​
All other BAM/CRAM/SAM and FASTQ options are available for deep files when combining with --bam, --cram, --sam or --fastq
​
​
FASTA options
-1, --header-one
Output the sequence name up to the first space or tab.
​
-H, --no-header
Don't output the header (sequence name) lines.
​
-h, --header-only
Output only the header (sequence name) lines.
​
--sequential
Output in sequential format - each sequence in a single line.
​
​
Reference file options
​
--reference filename --regions regions [--header-only]
View one or more regions of a reference file.
Note: For reverse complement, use a reverse range, eg -r1000000-999995 or equivalently -r1000000-6
​
Note: --regions-file maybe used instead of --regions
​
Note: Combine with --no-header to suppress output of the chromosome name.
​
Note: Short forms of the options (e.g., -e instead of --reference) are fine too.
​
--gpos
In combination with --reference and --regions or --regions-file - shows coordinates in GPOS (Global POSition) terms - a single genome-wide numeric coordinate - rather than (CHROM,POS).
​
--print-ref-contigs
Show the details of the reference file contigs.
Note: reference filename provided without the --reference option.
​​
--print-ref-iupacs
Show non-ACTGN IUPAC pseudo-bases in the reference file.
​
Note: requires using --reference.
​​
23andMe options
--vcf
Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37). Note: Indel variants ('DD' 'DI' 'II') as well as uncalled sites ('--') are discarded.
See: Converting 23andMe to VCF
Analysis options
--contigs
Applicable data types: VCF SAM BAM FASTA GFF3/GVF 23andMe
List the names of the chromosomes (or contigs) included in the file. Alternative option names: --list-chroms --chroms.
​
--coverage[=all|=one]
Applicable data types: SAM BAM FASTQ
Show the coverage and depth of each contig. Approximate values when using on FASTQ. See Coverage and Depth.
​
--idxstats
Applicable data types: SAM BAM FASTQ
Shows the count of mapped and unmapped reads by contig. Approximate values when using on FASTQ. Same output format as samtools idxstats. See idxstats.
​