Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

mpileup Formatting and RNA-Seq

Using mpileup output formats and RNA-Seq data.

While working on the haplotype phasing program we ran across some odd basses in the pileup generated from samtools.

PHYCAscaffold_1 218108 A 7 .,…,. #F#EB PHYCAscaffold_1 218175 A 7 »> #F#CF3F PHYCAscaffold_1 218176 A 7 »> #F#CF3F PHYCAscaffold_1 218177 C 7 »> #F#CF3F PHYCAscaffold_1 218178 G 7 »> #F#CF3F PHYCAscaffold_1 218179 T 7 »> #F#CF3F PHYCAscaffold_1 218180 A 7 »> #F#CF3F PHYCAscaffold_1 218181 C 7 »> #F#CF3F PHYCAscaffold_1 218182 G 7 .$,…,.#F#CF3F

The > and < symbols in the mapped reads string correspond to reference skip. One of our initial problems with deciphering this concept was that initially we believed that the normal in-del notations ([+-]\d+[atgc]+) used in the string should be sufficient to code these bases.

Once we thought about the way that TopHat2 encodes reads across junctions (in the CIGAR with the skipped region code) of it became more apparent why these are encoded so much differently.

Essentially it is saying this:

………………………………………. Reference Genome ****** ***** Encoding Regions *********** Sequencing ++++++++++++++NNNNNN+++++++++++++ How CIGAR Maps the Aligned Read

Those Ns in the CIGAR encoded read are the > s and < s in the mpileup. It’s not that they don’t exist because they’re deleted, simply that the data doesn’t support their presence because its been excised from the RNA already.

blog comments powered by Disqus