Geneticists cut transfer times by two thirds using Globus to move data between a sequencing lab and a campus HPC cluster

Genetics researchers at the Semel Institute's Center for Neurobehavioral Genetics are using the vervet as a model organism to study behavioral traits. As the second Old World Monkey sequenced (the first being Rhesus macaque), vervets, unlike the great apes who are mostly in near-extinction status, are widely available* for biomedical research. Given the greater genetic similarity to humans than other model organisms such as a mouse, the vervet is a great genetic model for studying high-level cognitive/behavioral traits, such as novelty-seeking, intruder avoidance, and also primate-specific diseases, such as simian immunodeficiency virus (SIV). The large pedigree of vervet monkeys housed in the Vervet Research Colony (VRC) in North Carolina provides a direct genetic resource to study various phenotypes in a controlled environment. The Center's current focus is to sequence a large number of VRC monkeys with well-characterized and highly heritable phenotypes, and to carry out genetic mapping to find the genetic loci underlying these phenotypes.

The challenge is dealing with the large data size of the sequences for these vervets. The average genome size is 3 billion bytes and each genome is sequenced 10X to ensure data quality. Each base of the genome has a sequencing quality score attached, effectively doubling the volume of data. Thus, it takes 60GB (3GB X 10 X 2) to store one genome (uncompressed). As of April 2012, the Center was halfway to their final goal of sequencing 723 vervets, with cumulative data totaling 18TB. From these raw sequences, researchers run a workflow to find genetic variants (where the vervets are different from each other), and excluding numerous intermediate files, they retain the filtered sequences, the alignment files and the final genotype calls. When all is said and done, they will have accumulated about 20TB of compressed data.

Data is stored on sequencing lab's mini cluster, but the research team would like to leverage the power of the Hoffman2 high-performance computing cluster. This requires constantly moving data back and forth between the lab mini cluster and Hoffman2. The team used scp as their primary file transfer mechanisms but had to deal with slow performance and varying levels of reliability.

Globus effectively removed this bottleneck with its use of multiple TCP streams via GridFTP, allowing researchers to achieve much higher data transfer rates between the lab's mini cluster and Hoffman2. The result was an up to threefold boost in transfer speed. "It is a great tool and we hope the Hoffman2 staff will continue supporting it", said one researcher. The impact on science is clear: less time wasted dealing with IT issues and more time spent on real research. Highlights:

  • "The bottleneck here is that scp (copy through ssh) is simply too slow.”
  • "Now that Globus Online is setup, it usually takes one day to finish transferring data between the lab and Hoffman2, compared with three days before.”
  • "With the time saved in transferring data, we could spend more time in processing data."

Scientists collaborating on this project include: Nelson Freimer (PI), Yu S. Huang, Vasily Ramensky, Susan Service, Nam Tran, and Anna J. Jasinska from the Center for Neurobehavioral Genetics, UCLA Wesley Warren, and George Weinstock from The Genome Institute at Washington University in St. Louis Ken Dewar from McGill University and Genome Quebec Innovation Centre * Rhesus is widely available in India, but export restrictions imposed by the Indian government make it a less ideal candidate.