Bio & Life Sciences

Challenges for Genomics in the Age of Big Data

(Image via Shutterstock)

(Image via Shutterstock)

Scientists don’t dabble much in predictions. They’re comfortable with data, and facts they can observe. So when they do speculate, it’s worth paying attention.

Last week, a group of respected researchers published a commentary about the coming data challenges in genomics. Comparing the projected growth of genomic data to three other sources considered among the most prolific data producers in the world—astronomy, Twitter, and YouTube—these scientists predict that by 2025, genomics could well represent the biggest of big data fields. With the raw data for each human genome taking up about 100 GB, we’re well on our way.

Genomics only recently entered the big data realm, and we have major issues to address before it leapfrogs every other data-generating group. Here are the top four areas Techonomy believes must be improved in the next decade to get the genomics house in order.

Informed consent

Consent forms are a cornerstone of biomedical research. Any research project collecting genomic data begins with informed consent, a pile of paperwork that human subjects have to sign to be admitted to a study. Currently, these consent forms vary from project to project and from institution to institution, with a wide range of permissions and access guidelines not only for how data will be used in the current study, but also for how that data might be used in the future. This landscape makes it far more challenging than it should be for scientists to share data later, or to delve into existing data to draw new conclusions.

Scientists routinely praise an initiative called the Personal Genome Project, run through Harvard Medical School, for having the most broadly useful informed consent policies. Study participants have the option to be contacted for future studies, for example, and they agree to make their data and samples openly available to other labs. Because of that, PGP data has been accessed by researchers around the world who have used it to make important new discoveries. Before data generation ramps up to the billion-plus human genomes that scientists predict could be sequenced by 2025, it’s imperative that institutions embrace informed consent policies like PGP’s, allowing for massive data sharing and maximizing utility of this data.

The authors of the commentary write: “If we do not commit as a scientific community to sharing now, we run the risk of establishing thousands of isolated, private data collections, each too underpowered to allow subtle signals to be extracted.” Successful data sharing in the future depends on significant improvement in informed consent guidelines.

Data security and storage

How many letters have you gotten in the last few years letting you know that your personal data—credit cards, bank accounts, health, insurance—may have been accessed by someone who violated the security of the organization you trusted to keep your information safe?

Those letters could become even more alarming if they’re reporting the theft of your genomic data. We have no idea how this type of data could be used by criminals, but no doubt there will be a market for it. We must invest now in superior data-protection tools for our genomic data if we hope to keep it safer than our financial data is with today’s leaky systems.

Storage methods must improve as well. Too much scientific data is stored in purpose-built databases, each with its own different formats and access rules. Cloud computing is often considered a way to improve the situation: store genomic data in one place (with lots of redundancy) where scientists, clinicians, and even consumers could access and run queries on it. But this will not improve access to the vast amount of existing data that we could capitalize on now if only it were connected and easily queried. We need the tech world to help with better options to store standardized data securely, and to add hooks to existing public data repositories to make them more useful.

Analysis tools

Virtually all genomic analysis tools were created by and for the research community. If you’ve never seen a data-crunching program designed by a scientist, let’s just say you’ve avoided a serious headache. With the projected explosion in genomic data, it’s critical to have tools that can be used as easily by consumers and physicians as by experts in genetics.

The myriad potential applications of genomics—from choosing the prescription least likely to cause side effects to smart toilets that analyze microbiome health and disease biomarkers each time you flush—all demand a foundation of rapid, reliable analytical tools that don’t require an expert.

To tackle this challenge, we’ll need to harness the best analytical and coding minds, from those quants doing number crunching on Wall Street to the bright minds creating sleek online games and mobile apps.


Most human genomes sequenced so far have been for research use. Ten years from now, it’s likely that the center of gravity will have shifted to medical diagnosis and treatment, or even directly to consumers themselves, depending on how clinical guidelines evolve. The medical world needs to get ready, and fast.

Today, an average consumer can’t just go out and get her genome sequenced. Most opportunities available to consumers require a physician to prescribe a genome sequence—something doctors frequently refuse to do on the grounds that clinical benefit hasn’t been demonstrated. With demand for genome data continually increasing, physicians must be educated not only on potential uses of this data, but also on the concept that such data isn’t something most patients should be shielded from.

In the meantime, regulators must think hard about whether there’s really a need to always maintain a medical gatekeeper between a person and his genomic data. It seems clear that consumers will eventually have access to their own data. If we make that impossible in the United States, it’s likely that people will get the work done elsewhere. Would it really be in their best interest to rely on countries with lax guidelines? Couldn’t we support citizens better with safe, straightforward policies for genome sequencing right here at home?

Genetic counselors probably have the best background to meet this demand, so they need to be empowered in the medical community. These counselors should be able to order whole genome sequences or gene tests without having a medical doctor sign off. There’s a staggering shortage of genetic counselors, so we also need to make an investment in this field to attract more people to the career.

If we can meet these challenges, we’ll be in good shape when genomics outpaces every other source of data. But we have to get cracking right now.

Tags: , ,

  • Will Greene

    Nice piece, Meredith. I’m working on a story about how the open data movement could impact health, medicine and scientific research in general. Genomics is certainly a big part of that story, so this piece is helpful. I imagine that issues of privacy, security, and interoperability are just as relevant to other big data inquiries in healthcare. Researchers can probably derive a lot of insight from electronic medical records, for example, but we obviously don’t want people’s lives and welfare compromised in the process.

    What makes genomics particularly thorny, it seems, are the regulations that put walls between people and their own genomic data. I enjoyed your panel session at Techonomy Bio 2015 on this topic. I’m generally an advocate for individual empowerment, but there are clearly some tough ethical and policy considerations that come into play here.