If ever there was a goal with universally undisputed support, it’s finding a cure for cancer. Making that happen, of course, is the trick, thanks in no small part to the sheer volumes of data involved.
Cancer is not a single disease, George Komatsoulis, deputy director of the National Cancer Institutes’s Center for Biomedical Informatics and Information Technology, told Linux.com.
Though it’s traditionally been categorized based on where in the body it appears, “the trouble is that it’s not like an infectious disease” with a single infectious agent, Komatsoulis explained. “Here, there are dozens if not hundreds of possible mutations. Pragmatically, we need to molecularly categorize these tumors — and lots of them — so we can start classifying the ones that will respond well to treatment.”
Translate that imperative into IT terms and what do you get? That’s right: a whole lot of data.
2.5 Petabytes of Data
Back in 2005, NCI and the National Human Genome Research Institute launched The Cancer Genome Atlas, a project to collect detailed molecular characterizations from 11,000 tumors, among other data. By the end of fiscal year 2014, some 2.5 petabytes of data will have been collected.
That’s surely a good thing, but here’s the catch: Only researchers at the wealthiest institutions can afford to access it.
“It costs millions of dollars to store that much data, even before you put in the computing resources to analyze it,” Komatsoulis explained — and that’s to say nothing of how long it would take to download the data from the Genome Atlas repository even under the best bandwidth conditions. It’s a download you’d measure “with a calendar, not a stopwatch,” he pointed out.
The Cancer Genomics Cloud
Enter the NCI’s proposed Cancer Genomics Cloud, an effort now being planned not just to make this treasure trove of data available to researchers at a more reasonable cost but also, on a higher level, to democratize that access.
“The situation right now is that if a smart graduate student comes along with a good idea, they have to go to their institution and ask for a couple million dollars in IT equipment and then take a year to get an answer,” Komatsoulis said. “That’s not where we want to be.”
Toward that end, the NCI has issued a Broad Agency Announcement for three Cancer Genomics Cloud pilot tests, for which it plans to award contracts this year. Those pilots will then be evaluated and used to figure out what a production scale cloud would look like.
The overriding goal for the cloud-based initiative is that researchers will be able to access the data via a Web browser and analyze it remotely. A standard application programming interface and analysis tools are planned to help make that happen.
Open Source Licenses
While it’s too early to say what types of technologies might be involved in the end result, the NCI began with a relatively open approach, including tapping the IdeaScale crowdsourcing platform for public input on the technology’s requirements.
It’s also chosen not to constrain the technology solution, so that public clouds, private clouds and/or dedicated hardware using open or closed source can all be acceptable.
“We are tech-agnostic,” Komatsoulis explained. “We’re asking groups out there to propose their best tech solutions, and we’ll see which work best.”
That said, however, there is a requirement that designs be released to the government under a non- viral open source license. Any new software must be developed and distributed under a non-viral open source license as well. Commercial items will be permitted as long as they are available under standard commercial terms.
The NCI wants to ensure that any third party can build a replica of any of the clouds if that design meets their needs, Komatsoulis noted. It also wants to see the clouds developed as pre-competitive technology that can be reused for commercial or non-commercial, open or closed source derivative works. Avoidance of vendor lock-in is another goal.
A Significant Role
Cloud computing represents “a way for large enterprises and service providers to reduce IT waste and increase speed and efficiency, so it is no surprise to see the NCI doing the same thing, particularly as it confronts the vast amounts of Big Data involved in its work,” Jay Lyman, senior analyst for enterprise software with 451 Research, told Linux.com.
“In terms of open source software, it has taken time for open source to become part of the health care and research technology, but it is increasingly on the radar of health IT organizations,” Lyman added. “In addition, the NCI work involves cloud computing and creating an efficient API, which means open source software will likely play a significant role since it is a huge part of cloud and API creation, management and integration.”
Ultimately, the NCI’s overriding goal is to create a brighter future for those afflicted with cancer.
“The nation’s cancer patients shouldn’t have to wait because of technological limitations,” Komatsoulis concluded. “While this won’t remove all the limitations, I think it will be a good start.”