Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

Randomly Sub Sampling FASTQ Files

Randomly Sub Sampling FASTQ files using Perl.

Introduction

For whatever reason people seem to think that more sequencing is better when it comes to de-novo assembly. While more can be better , its better to have more insert libraries not just a full chip of sequencing for a relatively small genome. (17Gb of sequencing for a genome thats probably under 100Mb is over kill.) I’ll need a method of getting a subset of these sequences for now this program will only handle paired end sequencing data.

Problem :

Given 2 fastq files print a randomly selected (without replacement) set of reads to two new files.

Assumptions:

Files will be of the fastq format where each record consists of four lines :

  • Header
  • Sequence
  • Secondar Header
  • Phred Quality Scores

Files are paired with each files reads will be in the same order.

Considerations

  • I don’t want to hold the complete Fastq in memory so I need to be able to work on a stream
  • I want to ensure that I only have to loop over a file once so I need to select my records prior to opening and printing records

Problem One : Creating A Unique Random Set of Numbers

My initial thoughts were to do something like this:

my @random_numbers;
my %seen;
while (@records < $wanted_records){ #while the array is holding less than the number of records I want
    my $tmp = int rand ($total_number_records); #get a random number
    if (!(exists $seen{$tmp})){ #haven't seen it
        push @records tmp; # push it on the list
        $seen{$tmp}++; #create a key
        }
    } 

It became pretty clear to me that while this would be sufficient for few records and large pools to pool from, it was going to get slow fast if I needed a lot of records from an only slightly larger pool. Oh and while I think Perl’s RNG isn’t very random , if it truly was we could never guarantee that the loop would terminate. With some digging I found this handy little one liner that uses the List::Util package.

use List::Util 'shuffle';
my @numbers = (shuffle(1..$max_records)[1..$wanted_records];

This shuffles an array of numbers from 1..x in value , and then takes the slice of the first n values. Quick to write , quick to run , and leverages standard modules which is always a plus.

##Problem Two : Processing Complete Fastq Records as a Stream

I had a couple ideas for doing this. Initially I had thought to leverage the $/ variable to split records based on the “@” symbol that precedes the first character of the header file. Unfortunately these are also littered around the Phred score line as well so that was a no go. (Something like this here.) I did look high and low for a non Bio::Perl Fastq iterator package and found one here .Unfortunately it’s not a part of CPAN and I don’t like using packages that would require me to bundle or point people in the right direction. Between this and some extra reading I was able to figure out a really basic iterator.

#!/usr/bin/perl
use warnings;
use strict;
while (my $entry = $fastq_iterator->()){
        do_something_with_entry();
        }
exit;
 
sub iterator{
    my $file_handle = shift // die $!;
    return sub{
        my $record .= readline ($file_handle) // return;
        $record .= readline ($file_handle);
        $record .= readline ($file_handle);
        $record .= readline ($file_handle);
         
        return $record;
        };
     
    }

There is a nice write up on iterators in Perl over here. I think we’ll be getting this into the curriculum for next year ! This iterator takes a file_handle and returns Fastq records by fetching four lines from the file at a time. Pretty simple and almost no error checking.

#!/usr/bin/perl
 
use warnings;
use strict;
use Getopt::Long;
use List::Util 'shuffle';
use File::Basename;
 
my $total_reads = 0;
my $desired_reads = 0;
my $paired_end = 1;
my $pair_1 = '';
my $pair_2 = '';
my $single_end;
my $interleave_out = 0;
my $help = 0;
 
GetOptions( "total=i"   =>   \$total_reads,
            "desired=i" =>   \$desired_reads,
            "paired"  => \$paired_end,
            "1=s"     =>     \$pair_1,
            "2=s"         => \$pair_2,
            "interleave_out=i" => \$interleave_out,
            "help|?" => \$help,
);
 
print_usage() if $help;
 
 
if ($paired_end == 1) {
    die "file $pair_1 doesn't exist" if (! -e $pair_1);
    die "file $pair_2 doesn't exist" if (! -e $pair_2);
    if ($total_reads == 0){
        my $total_reads_1 = $1 if (`wc -l $pair_1` =~ /^(\d+)/);
        my $total_reads_2 = $1 if (`wc -l $pair_2` =~ /^(\d+)/);
         
        #File Sanity Checks# ( Bare minimum )
        die "files not of same length" if ($total_reads_1 ne $total_reads_2);
        die "Malformed Fastq $pair_1" if ($total_reads_1 % 4 ne 0);
        die "Malformed Fastq $pair_2" if ($total_reads_2 % 4 ne 0);
         
        $total_reads = ($total_reads_1 + $total_reads_2) / 4 ; #total records = total lines divide 4
        }
         
## PRINT OPTIONS TO SCREEN #    
    print "Current Options Are:
    Total Reads: $total_reads
    Desired Reads: $desired_reads
    ";
    print"Paired end reads:\t$pair_1 \t $pair_2 \n" if $paired_end == 1;
    print"Single end reads:\t$single_end" if $paired_end == 0;
 
## Get our Randomly selected record numbers 
    $desired_reads = $desired_reads/2 ; # taking half from each file.
    $total_reads = $total_reads/2;
    open(TMP ,'>','temp.txt') || die $!;
    map {print TMP "$_\n"} sort {$a <=> $b}  (shuffle(1..$total_reads)) [0..($desired_reads-1)]; ## sorted list of records desired reads long from a list of numbers containing all reads.
    print TMP "-1"; # padding by one;
    print "Generated Random Numbers\n";
    close TMP; open (TMP ,'<', 'temp.txt') || die $!;
     
     
    open (my $FH1 , '<' , $pair_1) || die $!;
    open (my $FH2 , '<' , $pair_2) || die $!;
     
    my $base_name  = basename($pair_1 , qw".fastq .fq  .fq33  .fq64");
    my $base_name2  = basename($pair_2 , qw".fastq .fq  .fq33  .fq64");
     
    open (ONEOUT , '>' ,  $base_name.".subfq") || die $!;
    open (TWOOUT , '>' ,  $base_name2.".subfq") || die $!;
 
    my $fastq_iterator = iterator($FH1);
    my $fastq_iterator_2 = iterator($FH2);
     
    my $entry = $fastq_iterator->(); 
    my $entry2 = $fastq_iterator_2->();
     
    my $counter = 1;
     
    my $sample_num = readline(TMP);
 
     
     
     
    while ($sample_num != -1){
     
    if (($sample_num == $counter)){
        print  ONEOUT $entry;
        print  TWOOUT $entry2;
        $sample_num = <tmp>;
        }
     $entry = $fastq_iterator->(); 
     $entry2 = $fastq_iterator_2->();    
         
    $counter++;
    }
     
     
     
unlink 'temp.txt';
close $FH2;
close $FH1;
close ONEOUT;
close TWOOUT;
 
if ($interleave_out == 1){
    open (ONE , '<' , $base_name.".subfq") || die $!;
    open (TWO , '<' , $base_name2.".subfq") || die $!;
    open (MIX , '>' , $base_name."_inter.subfq") || die $!;
 
    select MIX;
    while (<one>){
        print $_;
        $_ = <one>;
        print $_;
        $_ = <one>;
        print $_;
        $_ = <one>;
        print $_;
         
        $_ = <two>;
        print $_;
        $_ = <two>;
        print $_;
        $_ = <two>;
        print $_;
        $_ = <two>;
        print $_;
        }
    }
}   
     
     
     
else{
    #implement single end;
    }
 
 
exit;
 
 
 
sub iterator{
    my $file_handle = shift // die $!;
    return sub{
        my $record .= readline($file_handle) // return;
        $record .= readline($file_handle);
        $record .= readline($file_handle);
        $record .= readline($file_handle);
        return $record;
        };
    }
     
 
 
 
 
sub print_usage{
    print "\nUsage is ./SubSample_Fastq <options> \n";
    print "Available Options:
    --total <int>\t total reads to select from [default is calculated from the file(s)]
    --desired  <int>\t desired number of reads 
    --paired <1|0>\t paired end reads [default ON (1)]
    -1 <string>\t file name forward
    -2 <string>\t file name in reverse
    --interleave_out <1|0> interleave the output file\n
    ";
    exit;   
    }
blog comments powered by Disqus