Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

Streaming Folded Fasta Files For Processing

Alright this is probably old news to people out there but after a year of screwing around with weird look behinds I stumbled across a method for processing fasta records as a stream instead of reading them in to a hash before processing. Here’s the snippet :

#!/usr/bin/perl
use warnings;
use strict;
 
my $default = $/; #store our current line terminator
$/ = ">"; # set our line terminator to the header identifier;
 
open ( FASTA , "<", 'YFO.fasta') || die $!;
 
while (){
    chomp $_;
    if ($_ ne''){ #first record will always be blank
         (my $header , my @tmp ) = split(/\n/,$_);
         my $sequence = join('',@tmp);
         &do_work_son($sequence);
         }
     }
 
$/ = $default; # reset your terminator just to be safe - especially if you're doing work on other files.
 
exit;

The codes pretty straight forward , by changing your defined line terminator to something other than the OS defined new line you can trick the interpreter into treating a fasta record into a single line that is terminated with a “>”. Now you simply split on the “\n” character with the first line being header and everything else being sequence. Join your sequence portion into a string. And do work on the record before you move onto the next one.

Or even better you can slurp the whole file in one go and do stuff in parallel with the resulting array.

#!/usr/bin/perl
use warnings;
use strict;
use threads;
 
my $thread_max = 8; #(dont exceed the max threads on your system);
 
my $default = $/;
$/=">";
 
open (FASTA,'<','YFO.fasta') || die $!;
 
my @records = ;
my @executing = ();
while (@records){
    for(my $i =0; $i < $max_threads; $i++){
        if(@records){
            push(@executing , threads->create(\&do_work_on_sequence , pop @records);  #execute a thread if you can;             
            }
        }
    while @executing {
        shift @executing->join();  ## thread clean up
        }
    }
exit;
 
sub do_work_on_sequence{
    chomp $_[0];
    if ($_[0] eq ''){
        return; 
        } 
   (my $header , my @tmp ) = split (/\n/, $_[0]);
    my $sequence = join '', @tmp;
     
    }

Because whats the point of having 8 processors if you’re not using them am I right ?

blog comments powered by Disqus