top of page

Example of a FASTQ to BAM pipeline using Genozip

Overview

​

This pipeline starts with a .fq.genozip file, cleans it up with fastp, maps it with bwa mem, sorts it with bamsort, removes duplicates with bamstreamingmarkduplicates and finally outputs a .bam.genozip file.

​

This pipeline has Genozip files on both ends, and uses no intermediate files, so it is efficient in storage usage.

​

Genozip adds very little overhead, as its CPU consumption is relatively minor in comparison with the tools used in this pipeline, in particular bwa mem.

​

Notes

​

• The input fastq.genozip files were originally created from pairs of fq.gz files:

genozip sample01-R1.fq.gz sample01-R2.fq.gz --pair

 

• Since the .fq.genozip file was generated with genozip --pairgenocat outputs FASTQ data in interleaved format, i.e. reads from R1 interleaved with reads from R2, which is the format expeceted by fastp --interleaved_in.

​

• This script is designed to have multiple instances of it running in parallel. Each instance will work on different files.

​

• Technical note: ${out}.doing_now is used as a mutex and touch/rm as mutex_lock/unlock. Since touch is not atomic, exclusivity is not 100% guarateed, but in practice it seems to work.

​

Script
​

#!/bin/bash

ref=GRCh38_full_analysis_set_plus_decoy_hla.fa
study=mystudy
fastq=myfastqdir
mapped=mymappeddir

files=($fastq/*.genozip)

processed=1

while (( processed == 1 )); do
    processsed=0
    for file in ${files[@]}
    do
        sample=`echo $file | grep -o -E sample'[[:digit:]]{2}'`  # convert the file name to a sample name
        out=$mapped/${sample}.bam.genozip

        if [ -f $out ]; then continue; fi # already processed
        if [ -f ${out}.doing_now ]; then continue; fi # another instance of this script is working on it

        processed=1
        touch ${out}.doing_now

        echo =========================================
        echo Sample $sample
        echo =========================================

        ( genocat $file -e ${ref%.fa}.ref.genozip || >&2 echo "genocat exit=$?" )|\
        ( fastp --stdin --interleaved_in --stdout --html ${fastq}/${sample}.html --json ${fastq}/${sample}.json || >&2 echo "fastp exit=$?"   )|\
        ( bwa mem $ref - -p -t 54 -T 0 -R "@RG\tID:$sample\tSM:$study\tPL:Illumina" || >&2 echo "bwa exit=$?"     )|\
        ( samtools view -h -OSAM || >&2 echo "samtools exit=$?")|\
        ( bamsort fixmates=1 adddupmarksupport=1 inputformat=sam outputformat=sam inputthreads=5 outputthreads=5 sortthreads=30 level=1 || >&2 echo "bamsort exit=$?" )|\
        ( bamstreamingmarkduplicates inputformat=sam inputthreads=3 outputthreads=3 level=1                     || >&2 echo "bamstreamingmarkduplicates exit=$?" )|\
        ( genozip -e $ref -i bam -o $out || >&2 echo "genozip exit=$?" )

        rm ${out}.doing_now
    done
done

bottom of page