Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

Random Sub Sampling FASTQ Revisited

Randomly Sub-setting a FASTQ Files.

Introduction :

Strangely enough the most common thing my website is getting hits for is randomly subsampling fastq files. Either people don’t do this often or everyone who used to need it used an in-house script. (Probably just head   tail at the command line.) I’ve gotten a few emails from people asking if I could possibly get it to speed up and since I’ve been using it frequently I decided to sit down and re work it.

Problems :

Same as before - but this time I can’t use underlying perl voodoo magic to just make a shuffled list. (In all fairness the underlying magic isn’t in Perl , but it is hidden and at the time I wasn’t entire sure how the list was getting shuffled.)

Also - I didn’t want to re invent the wheel by creating my own code for parsing a fastq file in a sane manner.

Solutions :

The fastq file parsing was actually the easiest portion to solve. Heng Li (lh3 on every board I’ve ever lurked and the gentleman who wrote samtools, BWA , and a few other “lesser known” tools) has a great C header file that handles both normally formatted fastq and fasta files.( Source and example usages over here. )

The array shuffling was a little more difficult mainly just because I needed a reasonably fast method. With some lurking the boards over at stack and a little reading I found the Fisher-Yates shuffle. It’s a simple bit of code to implement and its linear complexity!

The relevant bit of code is here:

/* initialize our array */
records = new int[total_records];
int l;
for (l=0; l < total_records ; ++l){
    records[l] = l;
 }
 
    srand (time(NULL));
/*Yates shuffle (random enough for us)*/
for (l = total_records-1; l > 0 ; l--){
    int random = rand() % l; // get a random number between 0 and current max
    int temp = records[random]; // get the value of random array index
    records[random] = records[l]; // set random index to current max record
    records[l] = temp; // set current max to value of random index
}

This just creates an array with random values. So when we loop over a fastq file we check the value of the record stored in the array and only keep those that are less than the max number of records we want to keep. And that’s really it as nuts and bolts go.

Code is over here .

Make sure you have the zlib.h header files on your system and simply type make.

Usage is pretty simple:

./RandomSubFq file1 .. filen -w  -t 

-t is the total number of RECORDS in the file , not required but if you know the information just pass it.

-w is the total number of RECORDS you want out of the file.

You can use either gzip compressed or regular fastq files.

blog comments powered by Disqus