Authors: Sam Kean
Knot theory hasn’t been the only unexpected math to pop up during DNA research. Scientists have used Venn diagrams to study DNA, and the Heisenberg uncertainty principle. The architecture of DNA shows traces of the “golden ratio” of length
to width found in classical edifices like the Parthenon. Geometry enthusiasts have twisted DNA into Möbius strips and constructed the five Platonic solids. Cell biologists now realize that, to even fit inside the nucleus, long, stringy DNA must fold and refold itself into a fractal pattern of loops within loops within loops, a pattern where it becomes nearly impossible to tell what scale—nano-, micro-, or millimeter—you’re looking at. Perhaps most unlikely, in 2011 Japanese scientists used a Tie Club–like code to assign combinations of A, C, G, and T to numbers and letters, then inserted the code for “E = mc
2
1905!” in the DNA of common soil bacteria.
DNA has especially intimate ties to an oddball piece of math called Zipf’s law, a phenomenon first discovered by a linguist. George Kingsley Zipf came from solid German stock—his family had run breweries in Germany—and he eventually became a professor of German at Harvard University. Despite his love of language, Zipf didn’t believe in owning books, and unlike his colleagues, he lived outside Boston on a seven-acre farm with a vineyard and pigs and chickens, where he chopped down the Zipf family Christmas tree each December. Temperamentally, though, Zipf did not make much of a farmer; he slept through most dawns because he stayed awake most nights studying (from library books) the statistical properties of languages.
A colleague once described Zipf as someone “who would take roses apart to count their petals,” and Zipf treated literature no differently. As a young scholar Zipf tackled James Joyce’s
Ulysses
, and the main thing he got out of it was that it contained 29,899 different words, and 260,430 words total. From there Zipf dissected
Beowulf
, Homer, Chinese texts, and the oeuvre of the Roman playwright Plautus. By counting the words in each work, he discovered Zipf’s law. It says that the most common word in a language appears roughly twice as often as the second most common word, roughly three times as often as the third most
common, a hundred times as often as the hundredth most common, and so on. In English,
the
accounts for 7 percent of words,
of
about half that,
and
a third of that, all way down to obscurities like
grawlix
or
boustrophedon.
These distributions hold just as true for Sanskrit, Etruscan, or hieroglyphics as for modern Hindi, Spanish, or Russian. (Zipf also found them in the prices in Sears Roebuck mail-order catalogs.) Even when people make up languages, something like Zipf’s law emerges.
After Zipf died in 1950, scholars found evidence of his law in an astonishing variety of other places—in music (more on this later), city population ranks, income distributions, mass extinctions, earthquake magnitudes, the ratios of different colors in paintings and cartoons, and more. Every time, the biggest or most common item in each class was twice as big or common as the second item, three times as big or common as the third, and so on. Probably inevitably, the theory’s sudden popularity led to a backlash, especially among linguists, who questioned what Zipf’s law even meant, if anything.
*
Still, many scientists defend Zipf’s law because it feels correct—the frequency of words doesn’t seem random—and, empirically, it does describe languages in uncannily accurate ways. Even the “language” of DNA.
Of course, it’s not apparent at first that DNA is Zipfian, especially to speakers of Western languages. Unlike most languages DNA doesn’t have obvious spaces to distinguish each word. It’s more like those ancient texts with no breaks or pauses or punctuation of any kind, just relentless strings of letters. You might think that the A-C-G-T triplets that code for amino acids could function as “words,” but their individual frequencies don’t look Zipfian. To find Zipf, scientists had to look at groups of triplets instead, and a few turned to an unlikely source for help: Chinese search engines. The Chinese language creates compound words by linking adjacent symbols. So if a Chinese text reads ABCD, search engines might examine a sliding “window” to find
meaningful chunks, first AB, BC, and CD, then ABC and BCD. Using a sliding window proved a good strategy for finding meaningful chunks in DNA, too. It turns out that, by some measures, DNA looks most Zipfian, most like a language, in groups of around twelve bases. Overall, then, the most meaningful unit for DNA might not be a triplet, but four triplets working together—a dodecahedron motif.
The
expression
of DNA, the translation into proteins, also obeys Zipf’s law. Like common words, a few genes in every cell get expressed time and time again, while most genes hardly ever come up in conversion. Over the ages cells have learned to rely on these common proteins more and more, and the most common one generally appears twice and thrice and quatrice as often as the next-most-common proteins. To be sure, many scientists harrumph that these Zipfian figures don’t mean anything; but others say it’s time to appreciate that DNA isn’t just analogous to but really functions like a language.
And not just a language: DNA has Zipfian musical properties, too. Given the key of a piece of music, like C major, certain notes appear more often than others. In fact Zipf once investigated the prevalence of notes in Mozart, Chopin, Irving Berlin, and Jerome Kern—and lo and behold, he found a Zipfian distribution. Later researchers confirmed this finding in other genres, from Rossini to the Ramones, and discovered Zipfian distributions in the timbre, volume, and duration of notes as well.
So if DNA shows Zipfian tendencies, too, is DNA arranged into a musical score of sorts? Musicians have in fact translated the A-C-G-T sequence of serotonin, a brain chemical, into little ditties by assigning the four DNA letters to the notes A, C, G, and, well, E. Other musicians have composed DNA melodies by assigning harmonious notes to the amino acids that popped up most often, and found that this produced more complex and euphonious sounds. This second method reinforces the idea
that, much like music, DNA is only partly a strict sequence of “notes.” It’s also defined by motifs and themes, by how often certain sequences occur and how well they work together. One biologist has even argued that music is a natural medium for studying how genetic bits combine, since humans have a keen ear for how phrases “chunk together” in music.
Something even more interesting happened when two scientists, instead of turning DNA into music, inverted the process and translated the notes from a Chopin nocturne into DNA. They discovered a sequence “strikingly similar” to part of the gene for RNA polymerase. This polymerase, a protein universal throughout life, is what builds RNA from DNA. Which means, if you look closer, that the nocturne actually encodes an entire life cycle. Consider: Polymerase uses DNA to build RNA. RNA in turn builds complicated proteins. These proteins in turn build cells, which in turn build people, like Chopin. He in turn composed harmonious music—which completed the cycle by encoding the DNA to build polymerase. (Musicology recapitulates ontology.)
So was this discovery a fluke? Not entirely. Some scientists argue that when genes first appeared in DNA, they didn’t arise randomly, along any old stretch of chromosome. They began instead as repetitive phrases, a dozen or two dozen DNA bases duplicated over and over. These stretches function like a basic musical theme that a composer tweaks and tunes (i.e., mutates) to create pleasing variations on the original. In this sense, then, genes had melody built into them from the start.
Humans have long wanted to link music to deeper, grander themes in nature. Most notably astronomers from ancient Greece right through to Kepler believed that, as the planets ran their course through the heavens, they created an achingly beautiful
musica universalis,
a hymn in praise of Creation. It turns out that
universal music does exist, only it’s closer than we ever imagined, in our DNA.
Genetics and linguistics have deeper ties beyond Zipf’s law. Mendel himself dabbled in linguistics in his older, fatter days, including an attempt to derive a precise mathematical law for how the suffixes of German surnames (like
-mann
and
-bauer
) hybridized with other names and reproduced themselves each generation. (Sounds familiar.) And heck, nowadays, geneticists couldn’t even talk about their work without all the terms they’ve lifted from the study of languages. DNA has synonyms, translations, punctuation, prefixes, and suffixes. Missense mutations (substituting amino acids) and nonsense mutations (interfering with stop codons) are basically typos, while frameshift mutations (screwing up how triplets get read) are old-fashioned typesetting mistakes. Genetics even has grammar and syntax—rules for combining amino acid “words” and clauses into protein “sentences” that cells can read.
More specifically, genetic grammar and syntax outline the rules for how a cell should fold a chain of amino acids into a working protein. (Proteins must be folded into compact shapes before they’ll work, and they generally don’t work if their shape is wrong.) Proper syntactical and grammatical folding is a crucial part of communicating in the DNA language. However, communication does require more than proper syntax and grammar; a protein sentence has to
mean
something to a cell, too. And, strangely, protein sentences can be syntactically and grammatically perfect, yet have no biological meaning. To understand what on earth that means, it helps to look at something linguist Noam Chomsky once said. He was trying to demonstrate the independence of syntax and meaning in human speech.
His example was “Colorless green ideas sleep furiously.” Whatever you think of Chomsky, that sentence has to be one of the most remarkable things ever uttered. It makes no literal sense. Yet because it contains real words, and its syntax and grammar are fine, we can sort of follow along. It’s not quite devoid of meaning.
In the same way, DNA mutations can introduce random amino acid words or phrases, and cells will automatically fold the resulting chain together in perfectly syntactical ways based on physics and chemistry. But any wording changes can change the sentence’s whole shape and meaning, and whether the result still makes sense depends. Sometimes the new protein sentence contains a mere tweak, minor poetic license that the cell can, with work, parse. Sometimes a change (like a frameshift mutation) garbles a sentence until it reads like grawlix—the #$%^&@! swear words of comics characters. The cell suffers and dies. Every so often, though, the cell reads a protein sentence littered with missense or nonsense… and yet, upon reflection, it somehow does make sense. Something wonderful like Lewis Carroll’s “mimsy borogoves” or Edward Lear’s “runcible spoon” emerges, wholly unexpectedly. It’s a rare beneficial mutation, and at these lucky moments, evolution creeps forward.
*
Because of the parallels between DNA and language, scientists can even analyze literary texts and genomic “texts” with the same tools. These tools seem especially promising for analyzing disputed texts, whose authorship or biological origin remains doubtful. With literary disputes, experts traditionally compared a piece to others of known provenance and judged whether its tone and style seemed similar. Scholars also sometimes cataloged and counted what words a text used. Neither approach is wholly satisfactory—the first too subjective, the second too sterile. With DNA, comparing disputed genomes often involves matching up a few dozen key genes and searching for small differences. But
this technique fails with wildly different species because the differences are so extensive, and it’s not clear which differences are important. By focusing exclusively on genes, this technique also ignores the swaths of regulatory DNA that fall outside genes.
To circumvent these problems, scientists at the University of California at Berkeley invented software in 2009 that again slides “windows” along a string of letters in a text and searches for similarities and patterns. As a test, the scientists analyzed the genomes of mammals and the texts of dozens of books like
Peter Pan
, the Book of Mormon, and Plato’s
Republic.
They discovered that the same software could, in one trial run, classify DNA into different genera of mammals, and could also, in another trial run, classify books into different genres of literature with perfect accuracy. In turning to disputed texts, the scientists delved into the contentious world of Shakespeare scholarship, and their software concluded that the Bard did write
The Two Noble Kinsmen—
a play lingering on the margins of acceptance—but didn’t write
Pericles
, another doubtful work. The Berkeley team then studied the genomes of viruses and archaebacteria, the oldest and (to us) most alien life-forms. Their analysis revealed new links between these and other microbes and offered new suggestions for classifying them. Because of the sheer amount of data involved, the analysis of genomes can get intensive; the virus-archaebacteria scan monopolized 320 computers for a year. But genome analysis allows scientists to move beyond simple point-by-point comparisons of a few genes and read the full natural history of a species.