DNAnexus is providing genomic storage and analysis tools in the cloud. Techonomy Contributing Editor Adrienne Burke spoke with the company’s leadership recently about what their innovative approach to managing this unique brand of big data means to scientific research, personalized medicine, and individuals who’ve had their DNA sequenced.
DNA sequencing technologies are advancing so rapidly “that it’s ridiculous,” says Andreas Sundquist, CEO of bioinformatics analysis company DNAnexus. “We’re seeing about a tenfold improvement every two years,” he says.
Genomic data is being churned out at a rate that dwarfs the “big data” challenges in other fields that are experiencing the mere doubling every two years that Moore’s Law predicts. By one estimate, genomic data is growing four times faster than Moore’s law.
The supercomputer and software that handled the first human genome sequence nine years ago are already relics. But Sundquist says even DNA analysis operations that were equipped as recently as two years ago are now obsolete.
To help genomic research operations get off of what he calls the infrastructure-upgrade treadmill, Sundquist co-founded DNAnexus in Mountain View, Calif., in 2009, a year after he completed a computer science PhD at Stanford. The company, backed by Google Ventures, offers an opportunity for genomic research operations large and small to manage and analyze data without any investment in computing hardware or software.
Relying on Amazon Web Services and Google Cloud, in three years DNAnexus has built a cloud-based platform that hosts a suite of bioinformatics tools and nearly a petabyte of genomic data. Several thousand customers – big academic and pharma researchers as well as smaller scientific operations, clinical labs, and even DIY biologists – have employed the service, storing, comparing, and otherwise analyzing data generated by sequencing instruments. Users may upload genomic data in any of a variety of formats (*.bcl, FASTQ, BAM, *.vcf, *.csv, or *.txt) that are interoperable with DNAnexus data analysis and management tools.
Sundquist and Marc Olesen, former senior VP and general manager of McAfee’s Network and Cloud Security whose role now as DNAnexus president and COO is to show customers how the platform will help them accelerate sequencing projects, were in New York recently for a cloud computing conference. Techonomy had this conversation with them.
What is DNAnexus in a nutshell?
We are a platform that does all the data management and the analysis for DNA. Any biologist can log in to DNAnexus through their web browser, upload the data they’re working on, and click a button to analyze it and get their results right there. They can do things like map genomes, find variants and mutations, visualize the data, share that data, and collaborate with others.
It is smarter and more efficient than hosting a server farm. It’s a really fast way to give more people access to those tools. It’s democratizing DNA sequencing.
Who are your users?
We have thousands across a broad spectrum. Academic researchers were first and now we have customers in biotech, big pharma, core labs, and a lot of clinical and diagnostic testing labs.
We also serve commercial sequencing providers. They’re the companies that are generating the DNA sequences for customers who send them samples. One of the surprising things about some of those companies is that, in a day and age when we get so much of our information through the Internet and when all of the resources we need for data storage and analysis are accessible through a web browser, they ship the customer a hard drive with all the raw data.
We want to bring genomics into the 21st century. We want to put the tools and data online using cloud computing so that a doctor doesn’t have to get access to a giant data center to work on 100 patients.
Is it all about medical genomics, or are other types of genomic research being done on the DNAnexus platform?
We are like an operating system for genomics, agnostic not just to the technology used to sequence the genome, but also to the particular method of interpreting the genome, the organism that’s sequenced, or the tools and datasets users want to use as references. We handle 40 or 50 different reference genomes including all the model organisms, such as mouse and c. elegans that are used to study human diseases and drug responses, as well as human.
But today 90 percent of all the data being uploaded by customers are human genomes. Exome sequences are the most common, which means they’re looking for variants just in the one percent of the genome that is thought to harbor most relevant disease-causing mutations. And we have people uploading cancer tumor genomes. They’re looking for variants between the tumor and normal samples in order to better characterize the type of cancer or its progression.
Is sharing data on the cloud more secure than sharing it on discs or hard drives?
In some ways the cloud does present an opportunity to build better security and access controls. If you ship a hard drive and that box falls off the back of a FedEx truck, that’s a concern. That sort of thing can be mitigated with the cloud. One of our core features is to make sure everybody who is sequencing DNA and working with that data has the same level of security and access to controls. Whether the data is on the cloud or not, it should be anonymized or tokenized so it’s not personally identifiable.
What do consumers need to understand about the cloud and how it will be useful for storing their genomes?
In the near term it’s plausible that you can get your genome sequenced for a few thousand dollars. More and more there will be people who won’t have the background or expertise to build a data center, access all the tools and data integration, or implement security and privacy controls that they would need to safeguard their own data. The cloud can put them in the driver’s seat.
I can’t help but think of using the cloud as putting my information “out there” instead of keeping it protected on my own hard drive.
We have heard that, and we’d like to reframe that. When we started the company, a lot of people in our field did equate the cloud with open or public. That’s not the case. The cloud is simply tapping into massive data storage and computing resources. It doesn’t mean it’s out there publicly on the web. In fact the data we have is always kept securely and completely in the control of the customers and the users. If they wanted to share publicly they could do so, but most choose not to.
The data is anonymized. There’s no personally identifiable information that attaches you to any particular genome sequence. Think of it as going into a clinic to have your blood drawn: that blood goes somewhere, they do their analysis, it’s stored, and results are sent back to the clinic. This is very similar: You have a sample drawn, the sample goes off and gets sequenced, there’s a report, and the clinic can take action. It’s bar-coded and doesn’t have your name on it.
I’ve heard that moving genomic data onto and off of the cloud is very time consuming. Is this a problem for your customers?
DNA data transfer from and to the cloud is quite manageable nowadays, given ubiquitous access to modern network connections. In fact, streaming a DNA dataset as it is being produced in real time by a modern sequencing instrument requires a lower bitrate than streaming a movie over the Internet!