top of page

Archiving using --tar

Using the genozip --tar option, genozip compresses files directly into a standard tar file.

​

Each file is compressed independently and written directly into a standard tar file as it is being formed. This is faster and consumes less disk space than first genozipping files and then packaging them into a tar file, since no separate .genozip files are created - just the tar file.

​

Note: if using the Linux version of Genozip, a copy of the Genozip executables will also be added to the tar file, adding a negligible overhead of 4 MB.

​

Example 1

 

> # Compressing

> genozip --tar mydata.tar sample1.bam sample2.bam variants.vcf

 

> # Listing the contents of the tar file

> tar tvf mydata.tar

-rw-rw-rw- USER/USER 3424847 2021-06-01 11:34 sample1.bam.genozip

-rw-rw-rw- USER/USER 6765323 2021-03-04 22:04 sample2.bam.genozip

-rw-rw-rw- USER/USER 765323  2021-03-04 22:08 variants.vcf.genozip

​

> # Unarchiving and decompressing all files

> tar xvf mydata.tar |& genounzip --files-from - --replace

​

Note: When unarchiving this way, genounzip will display an informational message for each file that is not in genozip format, including the Genozip executables which were added to the tar file. These messages can be safely ingored, or suppressed with --quiet.

​

Example 2
​

Compress all files in a directory and its sub-directories, using --subdirs

​

> genozip --tar mydata.tar --subdirs my-data-dir

​

Example 3
​

Compress all files in a directory and its sub-directories, using a reference file. The reference file is utilized to significantly improve the compression of FASTQ, BAM and VCF files and ignored for other file types.

 

Note: This will fail if any VCF or BAM files are incompatible with the reference provided. 

​

Note: the reference file itself may optionally be one of the files included in the tar. If so, it is better to generate it in advance with genozip --make-reference and provide the .ref.genozip file (rather than the .fa.gz file) in the command line.

​

> genozip --tar mydata.tar --subdirs my-data-dir --reference hs37d5.fa.gz

​

- or -

 

> genozip --tar mydata.tar --subdirs my-data-dir --reference my-data-dir/hs37d5.ref.genozip

​

Example 4
 

compress and archive all BAM files in the current directory and its sub-directories, preserving the directory structure:

​

> find . -name "*.bam" | genozip --tar mydata.tar --files-from -

​

Implementation note: Genozip implements the IEEE 1003.1-1988 (“ustar”) standard of tar files, with the size field in binary format for files 8GB or larger. The GNU-tar LongLink extension is used for file names longer than 99 characters. This is compatible with most modern tar implementations, including GNU tar.

​

bottom of page