Too much data, too few drugs

By David Ewing Duncan, contributor


(Fortune) -- Like sages of old, they came to San Francisco last weekend, a group of biologists and computer scientists setting out to one-up every library ever conceived, from the great one in ancient Alexandria to Wikipedia today.

This library, however, will not consist of vellum scrolls or e-page entries. It aims to compile and make sense of genetic sequences and other raw biological data that are proliferating so fast that biology is about to move from petabytes to exabytes of data -- from quadrillions to quintillions. Just ten years ago, in 2000, all of digitized biology equaled only about 10 gigabytes (giga=billion).


While this is a stunning technological achievement, it also may be contributing to the glut of new drugs coming out of the pharmaceutical industry in recent years. The problem is that too much raw scientific data is scattered across too many databases with too little thought given to organizing it all so that it can be properly mined and used to develop treatments.

Trying to make sense of all this data is what brought two hundred scientists here to the first-ever Sage Congress. Organized by Sage Bionetworks, a new nonprofit based in Seattle, the meeting's attendees have proposed a novel solution: to create a new, open-source model to standardize and link together thousands of databases around the world -- in universities, institutes, governments, and businesses.

"It's time to admit the truth, we're not doing drug development the right way," says Stephen Friend, a co-founder of Sage who until recently headed up Merck's research program in oncology. "75% of cancer drugs don't work."

One day, Sage might allow scientists studying cancer -- or Alzheimer's Disease or diabetes -- to easily access the raw genetic data of thousands of people collected, say, in Ohio, Iceland, and Japan, and connect them to databases detailing cellular mechanisms that may explain how these diseases work.

Sage also wants to build systems that can organize and analyze complex interactions among networks of genes in humans and other organisms. Understanding how these networks react to environmental stimuli -- for instance, an individual's diet and exposure to chemical toxins such as mercury -- is the key to unlocking the secrets of common diseases such as heart disease and diabetes, say scientists.

"We need systems that can mimic the complexity of human biology before we'll really understand how everything works for a disease like diabetes," says Sage co-founder Eric Schadt. A biocomputer scientist, Schadt also recently left Merck (MRK, Fortune 500), where he headed up a team that used super computers and sophisticated tests to study how complex genetic networks and pathways and other molecular entities affect disease.

Creating an über-database is a formidable engineering challenge, but it's not the only barrier. Attitudes also need to change among scientists and institutions used to keeping their data to themselves whenever possible.

"It will require a fundamental change in thinking to realize that sharing data is important," says Friend.

Friend was also a co-founder of Rosetta, a bioinformatics company acquired by Merck in 2001 for $620 million. As part of Merck, Rosetta built one of the fastest supercomputers in the drug industry, running 16 trillion calculations a second. The company also developed specialized chips and computer programs to sequence and analyze tissues throughout the body.

Last year, Merck disbanded Rosetta as part of its downsizing, deciding that building ever more complex models of human biological systems was beyond the resources of a single company. Merck developed several drugs out of the Rosetta project and has agreed to hand over key components of the technology to Sage.

The enormity of the effort led Friend and Schadt to turn to open source technology, which can be run by a small staff while drawing on hundreds, or even thousands, of contributors. Open source has been used with great success in developing software systems like Linux. In science, projects like Science Commons, based at the Massachusetts Institute of Technology, are also working to break down legal, financial and infrastructural barriers to sharing studies and data.

So far, Sage has raised several million dollars from private foundations, companies such as Merck and Pfizer (PFE, Fortune 500), and the National Institutes of Health.

Meanwhile, the petabytes, and soon exabytes, of data keep piling up, adding to the urgency of sorting it all out. Sage will need significantly more funding and a staff large enough to wrestle with and organize a Great Library of this size so that we can start maximizing the potential for understanding biology and developing drugs sooner rather than later.

We may even want to stop producing so much data for a period of time and concentrate on organizing what we've got. To top of page