So Many Numbers - What Do You Do With The Data?

Author:  Harris Margaret
Institution:  Physics
Date:  June 2002

The genetic instructions for making you, a human being, are written in three billion DNA base pairs and tucked inside the nucleus of every cell in your body (except red blood cells). Of those three billion, only 2% are actually part of your roughly 35,000 genes. The remainder may hold your chromosomes' structure together, play unknown roles in regulating protein production, or simply take up space as "junk" DNA, the detritus of humankind's long evolution from earlier species.

If all that information - junk, genes, and all - were printed on paper, it would fill 200 volumes each the size of a Manhattan telephone book. If you started to read this weighty collection tomorrow, you'd be at it for another 9.5 years. And if you took all three billion DNA base pairs and laid them end-to-end, as you would inevitably be inclined to do, they would reach pretty much anywhere on Earth you wanted - several times over. When describing the size of these numbers, "staggering" is an understatement.

Computers have, to a certain extent, solved the problem: The Complete Works of You, Vols. 1-200, would fit comfortably onto a reasonably-sized computer hard drive. Three billion pairs means three gigabytes (GB) of disk space - a big number, certainly, but nothing modern computers can't handle.

But what happens if researchers want to look at more than one genome at a time? What if they want to examine, say, a few thousand, in order to seek out and compare the genetic quirks that make us unique - or give us rare diseases? What if they want to add their own comments every few lines, as an aid to others' understanding? What should they do with the data?

Different field, same problem: The next generation of fiber optic cables will be fast enough to transmit the informational content of the entire Library of Congress from New York to Paris in about a millisecond, quicker than you can say "overdue fine." But what happens once it gets there? What do you do with the data?

Consider the weatherman, the poor chap on TV who predicts flurries and gets a blizzard. Creating better mathematical and computer models of weather patterns - or any other near-chaotic, complex, and time-dependent system - requires not only tremendous number-crunching power and an immense amount of storage space, but also some means of organizing and displaying the data in a meaningful way. Teasing predictions out of such a system is, in the understated jargon of science, "nontrivial," not easy. How do they deal with the data?

The data problem is not new. Researchers have used computers for scientific purposes since the days of vacuum tubes and relays, and the need for effective data storage and access has always been a driving force behind computer architecture. So far, the results have been impressive; an empirical axiom, Moore's Law, states that computing power will double every 18 months, and historically the doubling period has often been even shorter.

In recent years, however, changes in the scientific process have led to such a rapid proliferation of data that advances in storage are no longer keeping up with the sheer volume of data. The biggest culprit is genomics, a new field whose reams of raw data gobble up storage space on desktops and mainframes alike - both in high-profile efforts like the Human Genome Project and in studies of smaller gene fragments, or other species. Unlike older areas of biology, genomics is largely driven by discoveries rather than by hypotheses. In discovery-driven science, researchers mine collected data for anomalies and trends, instead of using the more traditional process of formulating a hypothesis or theory and testing it experimentally. For example, a genomic scientist might track the expression of various genes in yeast samples at different temperatures, to see if differences in temperature led to changes in the yeast's life cycle. The result is a data-storage-and-access nightmare; even a humble fungus has some 6300 genes for researchers to monitor over its six-hour life cycle, and it's easy to extrapolate that humans are almost unimaginably more complex.

Other fields also use the discovery approach to research, and they face similar problems with data. Neuroscientists, for example, often compare large amounts of data in an effort to find patterns and trends, and for them data storage and access has proved even more problematic. In one experiment at Duke University, volunteers' brains were scanned as they looked at various words. The resulting images, depicted in false color to show regions of activity in the brain, were recorded on high-density DVDs. This would have been ideal, since DVDs are efficient information-storage platforms. However, once the images were in place, the DVDs were stacked in a closet. Although a closet full of disembodied brain images might sound creepy to a layman, it bothered the scientists even more: How do you compare images and data when all the raw information is literally closeted and no two scans will fit on the computer at the same time?

The scientific data problem is inherently interesting to computer scientists. This is especially true because information proliferation will eventually bring Moore's Law of progress up against a formidable theoretical barrier: The quantum-mechanical limit to how much information and how much computing power can fit onto a silicon wafer. Research on quantum computing - an effort that hopes to use the quantum properties of atoms or nuclei to mimic some functions of a computer's processor and memory - is at least partly aimed at overcoming this physical barrier.

Suggestions for a shorter-term "fix" are not hard to find. In one typically clever proposal - with the catchy title "A Petabyte in Your Pocket" - researchers at the University of Wisconsin-Madison and the Oregon Graduate Institute suggested that a new method of structuring databases could offer unheard-of storage capacity without resorting to exotic quantum- or DNA-based computers. Another approach proposes trading two-dimensional CDs and computer chips for three-dimensional "information cubes;" the addition of a z-axis to the x and y would allow more information to be encoded at a single point.

Whatever the solution, though, the underlying problem is not going to disappear anytime soon. Until then, the question remains: What do you do with the data?