top of page

Matching contig names of the file to those in the reference file

Data Types

​

SAM, BAM, VCF, GFF3, GVF, 23andMe, and chain files

​

Overview

​

The unfortunate reality of the bioinformatics world is that contigs may appear with different names in different reference files, causing problems in analysis pipelines.

​

Examples:

- chr22 ⇆ 22

- M ⇆ chrM ⇆ MT ⇆ chrMT

- chr21_gl000210_random ⇆ GL000210.1

​

Genozip offers a command line option, --match-chrom-to-reference, that updates the contigs of a file to match those of the reference file.

​

Example
​

Notice that in the example below, the contig name 1 was updated to chr1 both in the SAM header and the 3rd column of the data line, which is the RNAME field.

​

$ cat example.sam

​

@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2 

​

$ genozip example.sam --reference hg19.fa.gz --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)

​

$ genocat example.sam.genozip

 

@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621

A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2 

​

​

Fields updated
​

Data type    Fields updated

SAM, BAM   @SQ lines in the file header

           RNAME (column 3) and RNEXT (column 7)

           SA, OA and XA tags - contig names

​

VCF        ##contig lines in the file header

           CHROM (column 1)

​

Chain      qName and tName fields

​

GFF3, GVF  SequenceId (column 1)

           optional chr= attribute

​

23andMe    Chromosome (column 2)

​

Method of converting contig names

 

Each contig name that appears in the file is searched for in the reference file - in its unmodified form as well as variations of its name. Once a match is found in the reference file, the length of the contig, if known, is compared to make sure it is indeed the same contig. If a match is not found in the reference file, the contig name is not changed.

​

The variations of the contig names considered are:

 

- For chromosomes with a numeric suffix (eg chr22) as well as chrX, chrY, chrW and chrZ - with or without the chr prefix - the name with or without a lower case chr prefix is considered.

​

- NC_000001 to NC_000024 with a version number (eg NC_000001.10) are considered as equivalent to 1,…22,X,Y (with or without a chr prefix)

​

- For the Mitochondria chromosome, four options are considered - M, MT, chrM, chrMT

 

- For contigs that contain an embedded Accession Number, the following variations are considered:

 

Example contig name                  Interpreted as…

chr4_gl383528_alt            Accession Number GL383528 version 1

chrUn_JTFH01001867v2_decoy   Accession Number JTFH01001867 version 2

GL000192.1                   Accession Number GL000192 version 1

​

Questionssupport@genozip.com

​

bottom of page