Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

Calculating IUPAC Codes Perl

This is a pretty common way of classifying and handling character data in programming languages, especially when you know a priori how certain combinations are going to be made. To be honest I’ve seen it for years in code was always kind of confused by it and didn’t really need to think about it until recently for a project.

Let’s say we have a need to determine that we’ve seen a specific combination of letters, and then emit a second classification for that first set. (i.e. If we’ve seen both A and T , emit W).

It’s fairly easy to imagine how we might do this with a series of nested if/else if/ else switches, the problem with this is that it’s clunky to write, clunky to read, and your program will have to actually test each conditional making the set of statements in - efficient. It also makes it further complicated if you have multiple levels of combination that are meaningful.

Instead we can do something that is less intuitive on the surface but allows us to use a smaller set of statements that’s easier to read and a little more efficient computationally.

The first thing we do is we define what each of our character clases and find out what their individual ASCII codes are:

my @possible = qw /A T G C /;
print "Single Values:\n";
map {print $_ . "\t"} @possible; print "\n";
map {print unpack('C*',$_) . "\t"} @possible;
print "\n";

Which outputs:

Single Values:
A	T	G	C	
65	84	71	67	

Next we look at our pairs of values:

print "Pairwise Values:\n";
print "x\t";
map {print $_ . "\t"} @possible; print "\n";

foreach my $j (@possible){
    print "$j\t";
    foreach my $i (@possible){
        print unpack('C*',$i)+unpack('C*',$j) . " \t";
        }
    print "\n";
    }

Which gives us:

Pairwise Values:
x	A	T	G	C	
A	130 	149 	136 	132 	
T	149 	168 	155 	151 	
G	136 	155 	142 	138 	
C	132 	151 	138 	134 	

Third we look at our classes of 3 (which mathmatically is the sum of all codes minus one) :

print "3 way values:\n";
my $N;
map {$N+=unpack('C*',$_)} @possible;
map {print "not$_ \t"} @possible; print "\n";
map {print $N - unpack('C*',$_) . "\t"} @possible;
print "\n";
print "N is $N\n";
Which Gives us:

3 way values:
notA 	notT 	notG 	notC 	
222	203	216	220	
N is 287

Once we have all of this calculated, determine what your overlapping classes outputs are and put it in a switch statement:

sub iupac{
	my @incoming = shift;
	my $sum = 0;

	map {
		if ($_){
			if (defined $_){
				$sum += unpack('C*',$_);
				}
			}
		} @incoming;
	given($sum){
		when(65) {return 'A'}
		when(84) {return 'T'}
		when(71) {return 'G'}
		when(67) {return 'C'}
		when(136) {return 'R'}
		when(151) {return 'Y'}
		when(138) {return 'S'}
		when(149) {return 'W'}
		when(155) {return 'K'}
		when(132) {return 'M'}
		when(222) {return 'B'}
		when(220) {return 'D'}
		when(216) {return 'H'}
		when(203) {return 'V'}
		when(287){return 'N'}
		when(0){return ''}
		default {die $! . Dumper \@incoming}
		}
	}

Full Table from script:

Single Values:
A	T	G	C	
65	84	71	67	
Pairwise Values:
x	A	T	G	C	
A	130 	149 	136 	132 	
T	149 	168 	155 	151 	
G	136 	155 	142 	138 	
C	132 	151 	138 	134 	
3 way values:
notA 	notT 	notG 	notC 	
222	203	216	220	
N is 287
blog comments powered by Disqus