Genozip telemetry service
(compression logs)
When using a free version of Genozip, every time a file is compressed with genozip, a tiny record containing aggregate statistics regarding the performance of our compression methods and associated metadata is uploaded and logged on the Genozip server.
Our log record is tiny, and contains only the aggregate compression method statistics and metadata as specified below.
We use these logs primarily to catch problems with the quality of the compression, as well as to identify specific combinations of sequencing technologies, base callers, aligners, variant callers, and study types in which Genozip is not doing as well as it should.
We also use these logs to provide technical support to you in case of technical issues.
Note: if you have a paid license (i.e. Standard, Enterprise, Premium or Paid Academic) you may, if you choose to do so, opt in to logging. If you do choose to do so, it will greatly help us improve Genozip for your specific use case.
Retention policy: logs may be retained indefinitely, or may be deleted if no longer needed, if required to do so by law or regulations, or if requested to do so by the user. To request deletion or to receive a copy of the records submitted under your license, please email support@genozip.com.
The structure of a log record is illustrated by the following example (one record per file compressed). This structure is of the most recent version of Genozip and may continue to evolve over time as Genozip develops.
If you have concerns regarding our logging, please feel free to contact us at support@genozip.com to discuss.
Field name | Example | Notes |
---|---|---|
contexts | DIVRQUAL,QUAL,27.8%,NONE,27.8%,N/A,0.0%,0,0.0%,; NONREF,SEQUENCE,5.9%,NONE,5.9%,N/A,0.0%,0,0.0%,; QUAL,QUAL,5.3%,NONE,5.2%,N/A,0.0%,1,0.0%,; SQBITMAP,SEQUENCE,4.8%,NONE,4.2%,RANB,0.6%,2,0.0%,; Q5NAME,QNAME,4.4%,BSC,0.0%,LZMA,4.0%,3536,0.4%,; Q6NAME,QNAME,4.0%,BSC,0.0%,BSC,3.7%,2304,0.3%,; DOMQRUNS,QUAL,18.5%,NONE,18.5%,N/A,0.0%,3,0.0%,; Q4NAME,QNAME,3.7%,BSC,0.0%,ARTw,3.6%,936,0.1%,; P2NEXT,PNEXT,3.0%,BSC,0.0%,BSC,2.9%,822,0.1%,; XS:i,XS:i,2.4%,NONE,2.4%,N/A,0.0%,1,0.0%,; CIGAR,CIGAR,2.3%,BSC,0.6%,BSC,1.5%,853,0.1%,; AS:i,AS:i,1.5%,NONE,0.6%,ARTb,0.9%,9,0.0%,; TLEN,TLEN,1.2%,ARTB,0.2%,ARTB,1.0%,16,0.0%,; P0OS0,POS,1.1%,BZ2,0.0%,ARTW,1.1%,170,0.0%,; Q2NAME,QNAME,0.8%,ARTB,0.0%,RANB,0.8%,5,0.0%,; Q1NAME,QNAME,0.8%,NONE,0.8%,N/A,0.0%,0,0.0%,; Q3NAME,QNAME,0.7%,NONE,0.7%,N/A,0.0%,0,0.0%,; QNAME,QNAME,0.6%,ARTB,0.0%,RANB,0.6%,3,0.0%,; F0LAG0,FLAG,0.6%,ARTB,0.0%,ARTB,0.5%,26,0.0%,; | Aggregate statistics of contexts. For each context: its name, parent name, % of genozip file, codec of local data, % of genozip file of local data, codec and % of genozip file of b250 of b250 data, number of words in dictionary, % of genozip file of dictionary |
data_type | BAM | |
environment | OS=Windows_10.0.22000; cores=8; physical_GB=16; runtime=0h1'23"; dist=conda; n_files=3; remote=0.0.0.0; local=174.22.10.11; glibc=2.27; filesystem=NTFS | Compute environment, distribution, genozip runtime, and number of files compressed in this execution,local and remote IP addresses |
features (--make-reference) | VBs=2998 X 1.0 MB; num_contigs=24; num_bases=3145129148; | Features of the file |
features (BED) | columns=10;sorted; | Features of the file that affect compression. |
features (FASTA) | Nucleotide_bases;num_sequences=12311; | Features of the file that affect compression. |
features (FASTA) | VBs=196 X 16.0 MB; num_lines=12167; Nucleotide_bases; segconf.line_len=1649; | Features of the file that affect compression |
features (FASTQ) | VBs=531 X 16.0 MB;num_lines=46009532;Qname=Illumina-old/;segconf.line_len=194;segconf.longest_seq_len=76;Sequencer=Illumina;ref_nbases=2542341441;ref_ncontigs=25; | Features of the file that affect compression |
features (GENERIC) | VBs=1 X 16.0 MB; magic="MZ??????????????????????@???????" 4D.5A.90.00.03.00.00.00.04.00.00.00.FF.FF.00.00.B8.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00; extension="exe"; segconf.line_len=0; | Features of the file that affect compression. "magic" is the first 32 bytes of the file; "extension" is the component of the filename following the final ".", but if it is 'gz', 'bz2', 'xz' or 'zip', the before-last component is included too. |
features (GFF) | num_fasta_sequences=1 | Features of the file that affect compression. |
features (SAM/BAM) | VBs=4 X 28.1 MB; num_lines=99909; hdr_contigs=86 (3137454505);
ref_contigs=298 (3235006512); Sorted; Mapper=dragen; Paired-End; sag_type=BY_SA; mate=49%; saggy_near=0%; prim_far=0.01%; Qname=Illumina; segconf.line_len=344; segconf.longest_seq_len=151;bisulfite; | Features of the file that affect compression |
features (VCF) | VBs=7 X 32.0 MB; num_lines=6907; num_samples=722; GVCF; segconf.line_len=20964; hdr_contigs=86 (3137454505);
ref_contigs=298 (3235006512) | Features of the file that affect compression |
features (reference file) | ref_contigs=298 (3235006512) | Features of the file that affect --make-reference. |
fields_gain | QUAL,53.1%,15.0X; QNAME,18.1%,11.4X; SEQ,16.0%,49.9X; PNEXT,3.2%,11.5X; CIGAR,2.4%,12.1X; AS:i,1.4%,14.9X; TLEN,1.2%,19.1X; POS,1.2%,30.6X; MAPQ,1.1%,10.4X; FLAG,0.6%,30.1X; XS:i,0.6%,35.3X; TXT_HEADER,0.4%,3.4X; SA:Z,0.4%,2.0X; Other,0.2%,587.9X; RNEXT,0.1%,129.0X; XQ:i,0.0%,2.6X; RNAME,0.0%,466.9X; MD:Z,0.0%,1712.4X; NM:i,0.0%,2365.2X; BAM_BIN,0.0%,0.0X; RG:Z,0.0%,4757.6X | For each field: its name, % of the genozip file which is this field, and compression ratio of the field |
flags | best; optimize; reference=EXTERNAL ; file_i=4/12 | Flags that affect compression |
genozip_gain | 17.5 | Compression ratio of genozip vs the uncompressed source file |
hash_issues | TaOKEN,QNAME,512.0 KB,73%,SRR34514354.57574038,SRR10260032.79514335,SRR10260015.78254568,SRR10260015.71887571,SRR10260013.69705869,SRR10260015.55836671 | In rare cases in which a certain field has statistical properties that cause Genozip to run slowly - 6 example values of the field are sent for diagnosis. |
hash_issues | QNAME,,,,SRR11234134.1 1/2,SRR11234134.2 2/2,SRR11234134.3 3/2,SRR11234134.4 4/2,SRR11234134.5 5/2,SRR11234134.6 6/2 | Read names and other similar fields: in extremely rare cases in which Genozip cannot effeciently parse the string due to unsupported formatting, 6 example values are sent for diagnosis |
hash_issues | A00910:85:HYGWJDSXX:1:1101:3025:1000_1:N:0:CAACGAGAGC+GAATTGAGTG;A00910:85:HYGWJDSXX:1:1101:3025:1000 | when using --deep: the first FASTQ read name and the first BAM QNAME in the respective files. Sent for diagnosis in rare cases in which Genozip cannot make sense of the relationship between them. |
license_num | 442123256 | Genozip license of user |
programs (GFF) | Prodigal; | programs that generated the data - deduced from the data format |
programs (SAM/BAM) | MarkDuplicates;bwa; | programs that generated the data - generated from the ID and PN subfields of the @PG header lines |
programs (VCF) | VarScan2; | programs that generated the data - extracted from the VCF header lines |
qual_acgt (SAM/BAM/FASTQ) | I@A?;:>9786≐<,52――I;:986>7≐5<430/1――I;:97865/3140≐>,――I@A?>;≐:<98HG756 | the most common base quality scores corresponding to each of A,C,G,T in the sequence, in descending order of frequency. |
qual_histo (SAM/BAM/FASTQ) | ISD<72+$ | the most common base quality scores, in descending order of frequency. |
runtime | 0h14'25"; 25,13 | runtime of genozip, and the average number of cores used to compress each component |
timestamp | 1/June/2022 9:54 | Time this record was created |
txt_codec | GZ,GZ | for each component: codec of the source file and if GZIP, also the first GZIP header (typically 12-32 bytes) excluding the FNAME field, and the block sizes of the first 2 GZIP blocks. |
txt_size | 8234126873,3463182091 | Sizes of file(s) prior to Genozip compression - after removal of source compression (eg .gz) and original size (i.e. with source compression). |
user_host | john@lab | User and host running genozip |
version | 15.0.25 | Version of Genozip used |