
Medicine is among many sectors waiting to be transformed by big data, we often hear. Conducting global studies of disease progression, integrating health records electronically, or analyzing petabyte-size banks of DNA sequence data should hasten the pace of medical discovery and lead to faster cures, the thinking goes.
Not so fast, says computational biologist Michael Liebman. Health information is only as useful as the thought that went into gathering it. And Liebman says not enough thought is being applied to what data should be collected in healthcare.
A good example, he says, is when the federal government’s stimulus package funded electronic health records technology development. “They focused on making EHRs interconnect. They didn’t put any money into determining if EHRs were collecting the right data,” Liebman says. “If you’re not collecting the right data, sharing it doesn’t get you where you want and need to go.”
Liebman has been global head of computational genomics at Roche Pharmaceuticals and director of computational biology at the University of Pennsylvania Cancer Center. Now, as managing director of two startups, Strategic Medicine and IPQ Analytics, and an advisory board member for a major pharmaceutical industry trade group, he is on a mission to prod colleagues into formulating data-gathering questions that can lead to cures.
Techonomy spoke with Liebman about what’s wrong with current big data strategies in healthcare.
What’s the disconnect between medical research and applying its discoveries?

Most researchers are academics who aren’t focused on identifying the real clinical problems, because they don’t work with clinicians. It’s a very simple thing, but it’s incredibly critical.
For instance, it might look like we need a better drug for patients, when the real problem is that we require more accurate diagnosis and stratification of patients and disease to know who should receive specific drugs that are already available. It’s also critical to consider that disease is a process, not a state—that adds to the complexity.
The gap between data and knowledge, unfortunately, continues to grow because new technologies support data generation, but it still requires further effort to convert that to clinical utility.
Then why is it so popular, especially in genomics, to talk about the importance of collecting more data?
Technology drives data generation. It doesn’t drive knowledge generation. We are buried in the era of big data, but we still need to focus on the right question and not only look to the available data as the right data.
Here’s an example. Breast cancer surgeons came up with a method to “identify and test the first node that the tumor would drain into” to determine if the cancer had spread. If that “sentinel node” tested negative, then they could say there is no metastasis and no reason to remove additional nodes. I asked a surgeon who was giving us sentinel node biopsies to evaluate to describe how he determines which is the sentinel node we were using as a cancer diagnostic. He explained: “While the patient is open on the operating table, we inject dye and then, because we can’t leave the patient open, we massage the breast to see where the dye goes.” The path of the dye would presumably mimic the path of the cancer cells to the nodes.
I suggested that where the dye goes under mechanical pressure isn’t necessarily where it might go in normal circulation. There are other factors, such as flow and the amount of time something resides at a given site, that could indicate a different “sentinel node.” The node we are evaluating and relying on to determine if the cancer has metastasized might not actually be the sentinel node. He said, “Yes, but this is how we do it.”
Physicians accept uncertainty. It’s fine to accept it. But you should pass on that knowledge in a way that people who do research understand the definition and its variability. If you bury it, researchers may look under the wrong lamppost.
You’re saying that this breast cancer surgeon and others following that protocol were having these patients’ nodes analyzed to diagnose whether their cancer had spread, but they might not even have been collecting the right nodes?
That’s right. In addition, when we collect biospecimens, such as tissue biopsies from breast cancer patients, it’s important to also annotate information about the patient. For instance, has psychological stress—a normal response to an abnormal mammogram—increased throughout the follow-up? It’s difficult to assess, but stress does have an impact on hormone production. Or has the patient’s inflammatory response to the biopsy continued to escalate? All of this could change the gene and protein expression.
What do you think about the emphasis on genomic data collection as a driver for big data?
We need to look at genomic data as the architectural plan for an individual just as we would for a house. The same house, built on an oceanside cliff, in the forest, or in a large city, will not function and age the same way because of the environment, access to resources, and exposures during aging. Humans have genomics as their architectural plan, but environmental exposure, lifestyle, and personal histories will drive what actually converts from genomic risk into disease. Without these details, we can only associate the plan with the outcome and not necessarily understand the real cause.