ENCODE DNA Data Project, Inspired and Built By Linux

214

Scientists celebrated a breakthrough in their understanding of the human genome this month – the results of a large collaborative project driven by big data and built with Linux.

Nature ENCODE data graphicOn Sept. 5, Nature and two other scientific journals simultaneously published 30 papers with the results of the ENCODE (Encyclopedia of DNA Elements) Project. The 5-year project involved nearly 450 scientists from 30 institutions around the globe and produced scores of data on how and when genes are regulated.

Their discoveries will serve as the basis for further biological research and advances in medical care.

The project’s success also makes it a model for big data collaborations and scientific analysis, said Mark Gerstein, a bioinformatics and computer science professor at Yale University and a lead researcher on the ENCODE studies. The papers provide documentation of what’s possible with modern data technology and computational methods – not to mention the collaborative process.  (For more in-depth analysis, see ENCODE Consortium coordinator Ewan Birney’s Nature article, Lessons for Big Data Projects.)

“The open source movement was a big inspiration for the genomics community,” Gerstein said. “The genomics world grew up with Linux.”

Built on Linux

Though ENCODE computing and data storage is scattered amongst the various institutions involved in the project, the Data Coordination Center at the University of California, Santa Cruz is the main repository for the results collected in the ENCODE project studies. 

The center keeps roughly 50 Terabytes of nicely packaged and compressed data available for public download online, as well as 200 Terabytes of uncompressed raw data, said Jim Kent, a bioinformatics researcher at UCSC who runs the ENCODE project’s data center.

They use a computer cluster running CentOS and IBM’s GPFS, (Generalized Parallel File System) – an enterprise storage management system originally developed for large multimedia files that also works well for genomics files, he said. Bonus: It’s free for academic use.

“It’s proven very robust,” Kent said.

In addition to the storage systems, the lab has a compute cluster with 1,000 CPU cores and 256 machines. Key to their computing efficiency is the job scheduler, Parasol, developed in-house especially for running the same DNA sequencing program hundreds of thousands of times on the same data. It’s available for free in portable C code, Kent said.

“It has a lot of steps to be robust when nodes fail and it’s been quite useful,” he said. 

Biology of ENCODE

ENCODE was made possible, in large part, by rapid advances in big data processing and DNA sequencing technology over the past five years. It also builds on work completed in 2001 by the Human Genome Project to sequence all 3 billion chemical base pairs in human DNA.

With the DNA sequence in hand, ENCODE set out to map the functions of all those bases.  Or, set in programming terms, ENCODE was interested in the logic, not the straight-line code of the DNA, Kent said.

A decade ago it was thought that the main function of DNA was to code for proteins – the chemical dictators of cellular activity. But only about 1 to 1.5 percent of DNA is comprised of genes that actually code for proteins, he said. The rest of the genome is either devoted to regulating those genes or it’s so-called junk DNA, evolutionary relics that don’t have a current function.

Researchers in the ENCODE project sorted through the DNA in short segments, called “reads,” of about 35 to 75 base pairs to find function and map its place in the genome. Sequencing machines produce millions of these reads at a time, creating a massive pile of data to comb through. 

Read mapping was just the first of three steps in data analysis, but it required the most intensive compute power.  The results were compiled into UCSC’s central repository.

“Before ENCODE we’d identified less than 1 percent of the regulatory regions,” Kent said, “and with ENCODE we’re close to 75 percent.”

The ENCODE database now serves as an annotated map of the genome. It’s a framework for future discoveries built on Linux and assembled through a large collaborative effort inspired by Linux and the open source community.

Editor’s Note: Mark Gerstein is a professed Linux fan. Check out his 2010 analysis comparing the evolution of the Linux call graph to that of the genome – what he calls the “operating system of a living organism.”  

https://www.youtube.com/watch?v=Y3V2thsJ1Wc” frameborder=”0″ width=”560