If the DNA molecule is the book of life, it’s a very strange book indeed

by Yohan J. John

DNA_replication_split.svgThe DNA molecule is often described as the book of life, as a blueprint for constructing the organism, or as a program for computing the organism. These metaphors have become so pervasive that we often forget that they are metaphors. In this essay I'd like to take this class of metaphor —the life-as-information metaphor — seriously, and investigate what some recent findings in molecular biology look like when mapped onto the world of books, blueprints and programs. I'd like to run with the information metaphor, seeing how far it can take us. I think this will help us understand the limits of the metaphor, but more importantly, it can help us appreciate the richness and complexity of biological processes, and the sheer scale of the ongoing endeavor to understand the science of life. [In part I of this series I looked at the origins of information theory and computer science, and in part II I traced the history of genetics, up to the discovery of the genetic code. This essay continues the themes from those columns, but can be read as a standalone article.]

The discovery of the double helix structure of the DNA molecule in the mid 20th century was the culmination of a quest to understand the nature of heredity that had begun a little over a century before. In the early 19th century, biologists began asking two intertwined questions about organisms: the question of heredity, and the question of development. How did hereditary traits pass from one generation to the next? And what biological, chemical and physical processes were involved in the development of the organism from an embryo? The first question was often described as a question of 'ultimate causes', and was closely linked to the theory of evolution by natural selection. Charles Darwin's theory depended on inheritance, but he could only provide speculative accounts of the physical basis of heredity. Many 19th century cell biologists were more interested in what they saw as the nuts and bots of biology, and preferred to investigate the question of development. They believed that only 'proximal causes' could be tested in a lab, and perhaps even witnessed under a microscope. Evolutionary theory by contrast seemed more like philosophy.

The two sorts of question 19th century biologists were interested in find their counterpart in two broad spheres of genetics research: transmission genetics, which studies how hereditary traits pass from one generation to the next, and developmental genetics, which studies how genes participate in the physical processes by which traits become manifest in cells and in organisms. The concept of the genotype is useful when thinking about transmission genetics: the genotype is the sum total of the genetic makeup of an organism, and in a sense represents all the potential hereditary traits that can become manifest. Nowadays the word 'genome' is used in a closely related way. When thinking about developmental genetics, the concept of the phenotype is central: it is the sum total of an organism's observable traits, which are not just a product of the genetic makeup, but are also influenced by the environment, and by the developmental process itself. Transmission genetics studies how the inheritance, reassembly and mutation of genetic material lead to the formation of a genotype, whereas developmental genetics studies how the potential latent in the genotype is actualized to give rise to the phenotype.

The emergence of genetics as a new subfield of biology represented a conceptual breakthrough that allowed questions about transmission and development to be situated in a unified framework. In the 1850s Gregor Mendel discovered that traits could be productively studied as discrete units or factors that were transferred from one generation to the next in discrete, all-or-nothing fashion. He also showed that by studying the distribution of traits in each generation, one could infer accurate mathematical laws of heredity. Over the course of the second half of the 19th century, ignorant of Mendel's work, cell biologists progressively zoomed in on the physical components of development: they first identified the sex cells — the sperm and the egg cell — as the key players in fertilization. They then recognized that the nuclei of these cells merge at the moment of fertilization to form the primordial nucleus from which all other cell nuclei in an organism descend. Finally they discovered that each cell nucleus contains structures called chromosomes which engage in a complex dance at the moment when a cell divides, which is the only way new cells are created. When Mendel's work was rediscovered at the dawn of the 20th century, biologists quickly realized that his discrete units or factors of heredity — which were quickly christened 'genes' — resided on the chromosomes. Though many of the early cell biologists wanted to answer the developmental question, it was the question of heredity that suddenly seemed within grasp. Uncovering how exactly chromosomes carried hereditary material proved to be a highly tractable line of inquiry — and progress in answering developmental question proved to be much more difficult.

Once it was established that chromosomes were the carriers of genetic material, cell biology was displaced by molecular biology, biochemistry and biophysics at the bleeding edge of transmission genetics. Chromosomes are composed of nucleic acids: DNA and RNA, and DNA was rapidly recognized as the key carrier of genetic material. The race to describe the structure of DNA ended in 1953, when James Watson and Francis Crick introduced the double helix to the world. The very structure of DNA suggested to Watson and Crick a mechanism for the replication of DNA during the cell division process. By the 1970s so much progress was made that transmission genetics was declared a closed chapter in the history of biology.

It was during the mid-20th century build-up to the “end of history” for transmission genetics that the concept of genetic information became popular. As far as I can tell, a detailed history of how the concept of information came to dominate molecular biology has not yet been written. But it seems clear that the intellectual milieu of the mid 20th century played a key role: genetics moved into top gear just as information theory and computer science were coming into their own.

Claude Shannon had shown how all forms of communication could be understood in terms of the transmission of discrete symbols. Shannon's information theory formalized the process of transmitting symbols between two points connected by a channel, allowing it to be performed as efficiently as possible. Shannon's discrete units of information, typically denominated in bits, evoked Mendel's units of genetic material: genes. They also resembled the discrete units that made up the DNA molecule: the nucleotide “letters” G (guanine), A (adenine), T (thymine) and C (cytosine). The process of transmitting genes from one generation to the next, or from one cell to its daughter cells, served as the biological equivalent of the channel in Shannon's framework. The concept of “genetic information” must have seemed quite natural given Shannon's definition of information.

The DNA molecule on which genetic information resides is a long double strand on which the nucleotide pairs are arranged sequentially. For many thinkers in the 1950s, this image must have conjured up Alan Turing's powerful model of general-purpose computation, the Turing machine. The Turing machine consists of three parts: (1) a long tape on which discrete symbols can be written and erased, (2) a 'head' that can move along the tape and write or erase symbols, and (3) a table of instructions that determine what the 'head' does. With a Universal Turing machine, virtually any set of instructions can be encoded on the tape. The Universal Turing machine inspired the design of modern stored-program computers. The DNA molecule may well be biology's very own tape, carrying instructions for how to 'compute' the organism.

Here's a fleshed-out version of the life-as-information metaphor. Each DNA molecule is a book, or a blueprint, or a molecular 'hard drive'; in essence it is a physical store of abstract symbols that can be quantified as bits or bytes of information. The abstract symbols used by the DNA molecule are the nucleotides: the letters of the genetic code. Each DNA book contains two copies of the exact same information, since each nucleotide pairs with a complementary nucleotide. Each chromosome is a 'volume' in the genetic 'library' of the cell's nucleus. This library constitutes the organism's genome. Normal cells contain two chromosomes of each type: a genetic library consists of two slightly different 'editions' of the same volume, one inherited from each parent. The human genetic library consists of 46 books, or 23 edition-pairs. The sex cells are special: they consist of only one edition of each volume. These volumes are new, 'remixed' versions, with bits chosen from each parent's edition by what seems to have been a coin flip. In a given sperm or egg cell, Chapter 1 of chromosome volume 1 might be excerpted from your father's edition, whereas chapter 2 of the same volume might come from your mother's edition. Even the chapters can be 'remixed' in this way. During fertilization, the nuclei of the sperm cell and the egg cell fuse to create a new nucleus: a new library that contains the usual two editions of each volume. When a cell divides to create two new daughter cells, the genetic library must be copied in its entirety. The nuclear library is not the only room in the cellular house, but most of the building blocks for constructing and maintaining the other rooms in the house depend on the information written in the genetic library. These building blocks are the proteins, and the volumes in the genetic library contain recipes for making them. We could have replaced the word 'recipe' here with the phrase 'stored computer program', since like a recipe, a program or code provides step-by-step instructions for achieving some goal.

When we use an information metaphor, we create the conditions for certain implicit assumptions. For example, the information content of a book is not typically viewed as residing in the physical shape of the book, in the texture of the paper, or in the typeface. The result of a computer program does not typically depend on the weather outside. The information metaphor also obscures certain questions, such as 'what is the cellular chef that is following the DNA recipes?', or 'what computer is executing the DNA's program'?

But despite these shortcomings, the information metaphor can go much further before it starts to break down: we can rephrase the two central questions of genetics in terms of books. Transmission genetics is the study of how genetic information from the parental DNA library is remixed and rebound to create a new half-size library in the sex cells; how the parental libraries are combined during fertilization to create a new genetic library; and how the new library in the offspring is copied during each of its cell divisions. Developmental genetics is the study of how the information written in the DNA books is 'read out' by the cellular machinery in order to create more useful machinery.

So how does the transmission of genetic information occur? The key lies in how DNA is copied during cell division. The DNA replication mechanism that Watson and Crick proposed is relatively straightforward to understand. The two strands of the DNA molecule are “unzipped”. Since each nucleotide only pairs with a particular complementary nucleotide — G with C and T with A — each strand is the complement of the other. Each nucleotide has a chemical affinity for its complementary nucleotide, large numbers of which are floating around, unaffiliated with any DNA molecule. A type of enzyme called DNA polymerase steps in to wed these free nucleotides to one of the two unzipped DNA strands. With the help of two DNA polymerase molecules, a single unzipped DNA molecule becomes two almost-identical DNA double helices. We can map this process onto the book metaphor. Let us imagine our DNA books are long scrolls. The scrolls are divided lengthwise down the middle, and each side of the scroll is the complement of the other. If the scrolls are written in a binary code, then wherever the left side has a '0', the right side has a '1' and vice versa. During replication, the two halves of the scroll are slowly torn apart. Free 1's and 0's take up positions in a new copy of the original volume when brought nearby by the DNA polymerase. The DNA polymerase is the photocopier, and the free nucleotides are the copier ink.

This picture of DNA replication is a simplified caricature, but it gets across the basic mechanism. In principle, understanding transmission genetics does not depend on any understanding of how genetic information is read out or decoded, since the copying and transmission mechanisms can be understood without understanding the meaning of what is being copied. A photocopier can be understood without any knowledge of the content of the books being copied.

But there may well be more to hereditary transmission than the copying of sequences of nucleotide letters. In order to appreciate what might be left out by the DNA replication picture, we must return to the other great question in genetics: how development of the cell and the organism occurs.

The first small step towards understanding the meaning of the information encoded in the DNA molecule involved cracking the genetic code. It was discovered that the building blocks of the cell, proteins, were created with the help of the DNA molecule. A three-letter nucleotide code represents the full set of amino acids that a cell requires to produce proteins. The words in the books of the genetic library, called codons, are three letters long. Each 'sentence' in a DNA book is a string of 3-letter nucleotide 'words', which specifies a protein. A gene is often defined at the molecular level as a protein-coding sentence of this sort. The process of gene expression is the equivalent of reading out a sentence from a DNA book, or computing one line of DNA code.

One of the key players in gene expression is RNA, which is also composed of 4 nucleotides, but instead of thymine (T), RNA uses uracil (U). RNA has a single-stranded structure, unlike DNA which is double-stranded. Gene expression consists of two processes: transcription and translation. During gene transcription, an enzyme called RNA polymerase attaches itself to a stretch of DNA, creating a transcription bubble —a small unzipped section of the DNA double helix. RNA polymerase then leads to the production of a strand of RNA that is the complement of the section of DNA being transcribed. (In RNA, U is the complementary nucleotide for A.) This stretch of DNA is called a transcription unit. After a few more steps, a new free RNA strand is created.

During gene translation, the newly formed RNA, called a messenger RNA (mRNA), is decoded by the ribosome — the cell's protein factory. Proteins are sequences of amino acids, and the mRNA molecule determines the order of amino acids. The mRNA is read 3 letters at a time, or in triplets. Another type of RNA, the transfer RNA (tRNA), transports amino acids to the ribosome. The tRNA consists of two parts — one amino acid attachment site, and one site called an 'anticodon', which is a 3 letter RNA triplet that is complementary to the mRNA triplet currently being read. The tRNA molecule brings its amino acid payload to the ribosomal factory, where it is offloaded and attached to the previous amino acid in the sequence. Thus a chain of amino acids is created in the ribosome, which eventually grows into a protein molecule.

Gene expression can also be aligned with our book/scroll metaphor. During transcription, RNA polymerase enters the genetic library and notes down a message from one of the recipes in one of the DNA volumes. This message — the mRNA — is then sent outside the library, to the ribosomal workshop. At the workshop, the words that comprise the message are called out one by one — each one corresponds to one component: one amino acid. The transfer RNA brings the right component at the right time, which is attached to the previous component. In this way a protein device is manufactured from amino acid components.*

Developmental genetics revolves around the question of how traits become manifest at the level of the cell and of the organism. Long before the genetic code was cracked, some scientists began proposing that there was 'information' relevant to answering this question that did not reside in the chromosome's genetic code. The first clue that this might be the case lies in the sheer diversity of cell types in multicellular organisms. If all cells originate from a single fertilized egg cell, then they are all genetically identical. How then do the various cell types arise?

Let us imagine that a multicellular organism is like a city, with the cells corresponding to different sorts of buildings. Each building contains its very own genetic library — the nucleus. In the multicellular city, the first room in each building is always the library, which is crucial to the construction of the other rooms. But on its own, the library cannot determine what the final building will look like, because it always contains the same books. Imagine the perspective of a contractor asked to create a city containing various types of building, but instead of being given the city plan, is given a set of books containing instructions for making any tool or building material whatsoever. This encyclopedia is very useful, but it doesn't specify when or where to do anything. All that information must be gleaned from somewhere else. This situation is also like being handed a dictionary and told to write a series of novels in vastly different genres. All the words and their meanings are contained in the dictionary, but the dictionary does not provide any clues for how to structure and order the words into sentences or paragraphs.

Biologists seem to agree that the cell type is not fully specified by the genetic information contained in the nucleus. Genetic information seems instead to represent the potential for the formation of any cell type. The concept of potential is reflected in the descending hierarchy of potency of stem cells: totipotent cells can form all the cell types in a body, pluripotent cells can give rise to most of the cell types, and multipotent cells can give rise to still fewer cell types. The developmental trajectory of the organism must provide the necessary information that constrains what sort of cell emerges from a cell division event. This idea was first popularized by Conrad Waddington, who coined the term 'epigenetic' — meaning “outside” the gene.

Waddington originally proposed a restricted definition of epigenetics: he called it “the branch of biology which studies the causal interactions between genes and their products, which bring the phenotype into being”. He proposed the concept of an epigenetic landscape to describe how a cell could “decide” on a particular form. He envisioned each cell as being represented by a marble at the highest point in a landscape of hills and valleys. From this point the marble can roll into any of the valleys. The bottom of a valley represents a stable cell type. The rolling of the marble down the slope represents the developmental trajectory that determines which valley it ends up in — which cell type it eventually becomes. The further the marble rolls, the fewer valleys become accessible. Since there is a growing number of cells in the developing organism, there are multiple marbles, which may collide with each other, so that different valleys come to be occupied. Thus the very process by which each cell develops can influence how other cells in the neighborhood develop. Waddington proposed the epigenetic landscape idea long before the DNA structure or the genetic code were understood, but it remains a powerful lens with which to interpret the research on the complex dynamics of development. The process by which genotypic potentiality becomes phenotypic actuality is starting to look quite far removed from books in a library, but there are still ways to preserve the original metaphor. The contractors building the multicellular city start constructing various buildings at the same time. The very act of construction changes the landscape, which in turn changes how the contractors make use of their all-encompassing libraries. When surrounded by buildings of type A, start looking up instruction #3; when the height of skyscraper X crosses a certain level, start looking up instruction #42, and so on. Creating this sort of complex epigenetic 'look-up table' is a plausible-sounding way to extend and enhance the original genetic information metaphor. But epigenetics may still render the metaphor hard to defend.

Modern epigenetics researchers have broadened Waddington's notion somewhat, describing their field as the bridge between genotype and phenotype. It covers not only the developmental questions that Waddington was interested in, but also the study of traits that are heritable yet do not involve modifications of the underlying genetic code. This aspect of epigenetics is both controversial and somewhat confusing. It's controversial because scientists who have come to know and love the modern evolutionary synthesis see any challenge to the old picture as unnecessary and potentially harmful distraction. It's confusing because the word 'inheritance' is used in epigenetics in two ways: to describe transmission of information from parent to offspring, and more commonly to describe transmission of information from mother cell to daughter cells within an organism. These two types of transmission are related, but are not quite the same process. And for evolutionary theorists, only a change to the first sort of transmission would be seen as a true threat to the status quo of evolutionary theory.

Let's look into this epigenetic 'challenger': non-genetic inheritance from parent to offspring. As we have seen, genetic inheritance is dependent on the transmission of a half set of chromosomes from each parent to the offspring. The information in a chromosome is represented by the corresponding DNA sequence, and according to the standard story of inheritance, the typical way mutation occurs is by the rare, accidental change in one or more of the letters in the a given stretch of DNA. It is the letters that carry the information, and so only by changing the letters can the message be altered. These rare mutations are the raw material for genetic variability, which in turn is the raw material for natural selection. But epigenetic research suggests that this is not the only way a new hereditary message can be sent from parent to offspring. From the standard genetic information perspective, the new epigenetic findings are akin to discovering that changing the typeface of a letter can change the meaning of the inherited message.

The most well-studied mechanism for transgenerational epigenetic inheritance is DNA methylation. DNA methylation is the process by which a methyl group is added to one or more of the cytosine (C) or adenine (A) nucleotides in a DNA molecule. A methylated 'C' is still a 'C' in the standard genetic code, but its meaning has been altered. Methylation is one of the mechanisms through which genes are silenced. DNA methylation may be one of the ways that a cell's role is constrained during the course of development. But in mammals, when the sex cells are formed, the methyl groups that accumulate over the course of an animals life are usually removed from the genome of the sex cells through a process known as 'reprogramming', so most methylation probably doesn't get transferred to the next generation.

Even though transgenerational epigenetic inheritance seems to be rare, it is controversial. This is because it can be triggered by the environment, and can therefore provide a mechanism for a notoriously non-Darwinian evolutionary process: the inheritance of acquired characteristics, or Lamarckian evolution. Transgenerational inheritance through DNA methylation is often dismissed as a rare and odd phenomenon, particularly in mammals, and therefore no threat to evolution by natural selection. But recently a group of neuroscientists showed that when mice are taught to fear a previously neutral odor, both their offspring and the subsequent generation are born fearing it. This is a clear example of inheritance of an acquired trait, and appears to have been caused by the demethylation of the gene for an olfactory receptor. [1]

The intriguing possibility of epigenetic inheritance in humans was suggested by a major multigenerational study in Sweden. The Överkalix study found that the risk of cardiovascular disease was influenced by whether the participants' grandparents went through famines or not. They also found sex-specific effects, such as a greater body mass index in the sons of men who began smoking early, but not in the daughters. Establishing that these effects are actually epigenetic rather than cultural is very difficult, however.

Chromatin_StructuresChanging the expression of a gene by adding a tiny methyl group to a nucleotide is a bit like changing a word's meaning just by italicizing one of its letters. There are other epigenetic processes that seem more like changing the shape of the book itself. The history of the study of heredity progressed from nucleus to chromosome, and from chromosome to DNA. Most people outside of modern molecular genetics don't realize that the conceptual step from chromosome to DNA involves a leap in scale — there is quite a bit of 'higher order' organization in the DNA molecule, beyond the famous double helix structure. Several epigenetic processes seem to operate at this higher level. A human cell's DNA is around 2 metres in length. Clearly it is very tightly packed in the nucleus, which is only 6 micrometers wide. But it is not just scrunched up in a structureless tangle. Chromosomes are made up of fibers of chromatin, a complex of DNA, protein and RNA. There are 3 levels of chromatin organization:

  • Euchromatin: Double-stranded DNA wraps around histone proteins, forming nucleosomes arranged loosely, like beads on a string
  • Heterochromatin: Multiple histones wrap into a 30 nanometer fiber that consists of a compact array of nucleosomes.
  • Higher level DNA packaging of the 30-nm fiber: this level of structure arises during cell division

If each chromosome is a book in the nuclear library, then the higher order structure of chromatin specifies the way the book is bound, and how the pages are arranged. Stretches of DNA that are being regularly transcribed in a particular cell are more loosely packaged, and are therefore in the euchromatin state, whereas DNA stretches that are inactive are more tightly packaged, and therefore found in the heterochromatin state. DNA that is in the euchromatin state constitute the dog-eared pages of the cell's book —the pages that the cell repeatedly re-reads. The DNA in the heterochromatin state has been filed away for potential use at some other time, or perhaps by a daughter cell. Modification of the histone proteins that bind the nucleosomes together gives the cell a degree of control over which genes are transcribed and which are silenced. Histone modifications appear to be transmitted from a parent cell to daughter cells, but there is not yet any evidence that it can be transmitted from parent to offspring.

Histone modification gives us a glimpse of the dizzying web of complexity being uncovered through genomics and epigenetics. The best way to understand these recent developments is to realize that the overarching three-dimensional morphology of the genetic material in the nucleus influences the biology of the cell. In other words, understanding the link between genotype and phenotype requires elucidating the topology of the genome.

Molecules of DNA do not crumple up at random, but instead fold and coil into structures that help determine which genes are active and which are inactive in a given cell. One such structure is the DNA loop. In a pioneering study of the three-dimensional structure of the genome, researchers recently discovered that the human genome is divided into around 10,000 loops [2]. Each loop is formed when stretches of DNA that are far apart on a chromosome are brought together by specific kinds of protein. These loops in turn form condensed folds called “contact domains”. Stretches of DNA within a given domain tend to have the same histone modification patterns. And domains with the same pattern of histone modification tend to reside in nearby neighborhoods of the nucleus, even if they are far apart on the chromosome. The researchers found that there were at least 6 such neighborhoods, which they call nuclear “subcompartments”, each with a distinct 'flavor' of histone modification. They also found that many of the loops, domains and subcompartments were found in homologous parts of the mouse genome. So the higher order folding topology of the genome may also be partially conserved across different species. The researchers didn't really discuss the possibility, but the inter-species stability of these topological patterns suggests that they may be heritable.

A video that accompanies this research provides a vivid metaphor for the nature of genome topology. The DNA molecule is engaged in a form of origami: starting from a basic piece of genomic paper, different cell types fold up the DNA in different ways, contributing to the creation of the vast array of cell types we see in multicellular organisms. At this point the books in our genetic library are starting to look a bit out of shape. Our scrolls seems to rearrange themselves depending on context, folding up like origami to create all kinds of three dimensional forms. As Waddington might have suggested, these folds —and the cellular versatility they appear to allow for — are not simply a product of phenomena restricted to the nucleus or even to the cell, but are a result of the overall developmental trajectory of the organism. And this trajectory itself is not immune to events outside the organism. As the study of the mouse fear memory suggests, the environment also provides crucial 'information' that influences development. The clearest evidence for an environmental role in epigenetic processes comes from studies of plants and of insects. Vernalization is a striking example in plants — after being exposed to the cold temperatures of winter, some plants introduced to temperate climates begin to flower earlier than normal. Evidence from mammals is weaker, but still quite suggestive. If a particular stretch of the mouse genome is unmethylated, a gene called agouti is overexpressed, leading to a yellow coat color, obesity and diabetes. Indirect evidence suggests that in humans, nutrition and exposure to toxins can have long-term effects on the phenotype. [3]

A certain degree of hype surrounding epigenetics [4] has lead some people to worry that it is being used to resuscitate unproductive old ideas about evolution. But the mere existence of non-Darwinian inheritance mechanisms doesn't change the basic thrust of the theory of evolution by natural selection: epigenetic changes may wear off after several generations, and even if they do not, they may still be rare compared with evolution through mutation and natural selection. If some kind of Neo-Lamarckism emerges in the 21st century, it is likely to turn up only rare, subtle and weak effects. Lamarckism was, after all, a popular theory of evolution well into the late 19th century, and several unsuccessful attempts were made to prove that it was a real phenomenon. Nevertheless, transgenerational epigenetics may at the very least necessitate adding some interesting footnotes to the supposedly closed chapters in the history of transmission genetics.

For those of us who are more interested in developmental questions than heredity or long-term evolution, epigenetics adds more structure and complexity to our quest to understand the physical basis of biological processes. The metaphor of genetic information is a useful way to think about genetic transmission from parent to offspring and from mother cell to daughter cells, but as I hope I have shown here, it becomes increasingly cumbersome when we try to make sense of the processes that allow a cell to develop and fulfill its role in the living organism. We can preserve our “book of life” metaphor, but only at the cost of twisting it almost beyond recognition.

If the only purpose of genetics was to understand how proteins are created by DNA and RNA, then the metaphor of transcription and translation would be quite sufficient. Francis Crick's Central Dogma of molecular biology was that “DNA makes RNA makes protein”. But the vast majority of the genome — 98% in humans — is “non-coding”. So most of the sentences in the library of life do not code for proteins. Some of these sentences may well be “junk DNA”: evolutionary hitchhikers that were neither useful enough nor harmful enough to become targets for natural selection. But many non-coding sections of the genome are transcribed: the RNA that result from this process take up important roles as transfer RNA, ribosomal RNA, regulatory RNA, and microRNA.

Having surveyed the daunting complexity of genomic structure, we might console ourselves with the relative simplicity of gene expression. Even if the 3D structure of the chromatin plays a role in how genes are expressed, surely the basic decoding of DNA messages is straightforward? It's a code, after all.

It may be a code, but it's a rather peculiar code. Let's revisit the idea of gene expression. What exactly is a gene? So far I've tried to avoid using the word “gene” as much as possible. Originally, a gene was defined as a discrete, abstract unit of heredity: a carrier of a Mendelian factor of inheritance. Early studies of fruit flies identified discrete traits at particular locations on a chromosome. Once the structure of DNA was elucidated, it made sense to think of stretches of DNA as genes. The genes could then be imagined as beads on a string. But as it turns out, high-level traits that we observe at the organism level don't always have a neat one-to-one mapping with a stretch of DNA. Single-trait DNA stretches are the exception, rather than the rule. Freckles, for example, are controlled primarily by the MC1R gene. [5] Huntington's disease has been traced to an alteration in a single gene called Huntingtin. The overwhelming majority of genetic modifications seem to have only very weak correlations with disease, which is one of the reasons the Human Genome Project has led to disappointment in some quarters. Francis Collins, one of the former leaders of the project, confessed that “the Human Genome Project has not yet directly affected the health care of most individuals”. [6]

So most stretches of DNA don't code for proteins, and they don't seem to code for high-level traits. There is no single gene for autism, or depression, or obesity, or schizophrenia. But perhaps the strangest feature of the DNA molecule is the fact that it isn't always clear what an individual “sentence” is. When I described DNA transcription earlier, I left out some details that render the 'read out' metaphor somewhat more complicated. The genetic information in a protein-coding DNA sequence is carried by 'exons', which are interspersed with non-coding 'introns'. Alternative splicing, first observed in viruses in 1977, allows cells to snip out the introns and splice the entrons together in various ways before the final messenger RNA molecule reaches the ribosome. So the same stretch of DNA can code for multiple proteins. There are even weirder 'encodings', such as genes within genes, and overlapping genes. Studies of RNA transcripts only render the picture more baffling. It seems that the transcription can start at the DNA sequence for one protein, and then just go on 'reading', creating a 'fused transcript' that may result in a different sort of protein. In our book metaphor, now the messages being sent to the ribosomal factory are sometimes ripped up and reordered, like a ransom letter made from newspaper headlines. [7] Capture11

When we investigate developmental genetics closely, we find that our neat metaphors of the DNA molecule as a book or a computer start to sound increasingly tenuous. The meaning of a book typically does not depend on its three dimensional topology. And running a computer program would be rather difficult if the symbols could mean different things in subtly different contexts. But without trying to stretch these metaphors to breaking point, perhaps only specialists would be able to fully appreciate the dynamism and complexity of biological processes.

Over the past few months I have tried to look at information from a variety of angles as I worked my way towards the concept of genetic information —a concept that seems to be creaking under the weight of empirical reality these days. One thing I rarely come across when I read about codes in biology is the idea that the real codes used by humans are typically arbitrary and ad hoc. There is no necessary physical connection between the word 'tea' and the beverage. If someone made up a word and explained to me that it meant tea, I would have no problem adding it to my lexicon. Any symbol can be used to stand for any other object, or person, or concept, or percept. Inheritance may involve discrete traits, and the nucleotides that make up the DNA sequence may seem very much like letters, but unlike human codes, nature's symbols do not seem to be ad hoc. Could we somehow coax a ribosome to turn, say, the symbol-string 'AGGTACCATCATGATGATGAT' into an arbitrary protein of our choice? Could we even invent an artificial system that would do this? I don't think so, because proteins are more than just the sum of the amino acids that make up their chain. Proteins are three dimension physical objects that can fold up in several ways. This folding depends on factors other than the amino acids, such as the temperature, the pH, the concentration of salts, and various other physical and chemical factors that are rarely given the privileged title of biological 'information'.

What is true of the protein molecule is also true of the entire genome. It's three dimensional structure has a powerful effect on which of its stretches of DNA are allowed to be transcribed. And this structure may well depend on existing levels of gene expression, as well as 'contextual' factors inside as well as outside the cell. There is no denying that some aspects of the genome resemble long strings of abstract symbols. But perhaps this is only a resemblance. For understanding transmission genetics, perhaps this resemblance provided the necessary spark for the researchers' imaginations. But perhaps developmental genetics, and genomics, and epigenetics, and biology more generally, will be better served by Waddington's more grounded metaphor: marbles rolling down a hilly landscape. The marbles are not words to be translated, but real physical objects whose trajectories are influenced by all the processes that are currently ongoing — from the texture of the surface it is rolling on to the subtle movement of the air — as well as the tracks left by marbles that have been down those paths before.

Perhaps abstract, symbolic and informational notions can only take us so far, after which nature invites us to reacquaint ourselves with those of its facets that are irreducibly concrete and physical.


Notes and References

[1] Lamarck revisited: epigenetic inheritance of ancestral odor fear conditioning [Behind a paywall]

[2]: DNA Loop-the-Loops

[3] Epigenetics and the environment: emerging patterns and implications [Behind a paywall]

[4] Do Your Grandmother's Experiences Really Make It Into Your Genes?

[5] Observable Human Characteristics

[6] Ten Years On — The Human Genome and Medicine

[7] What is a gene? [Behind a paywall]

* The process of assembling a protein can also be described using a pre-scientific metaphor: 'name magic', which was the topic of my first 3 Quarks Daily essay. “To summon a demon you must know its name.” To summon a protein its name must be copied from one of the DNA scrolls. The name is the exact description of the protein, and in order to summon the protein, its name must be recited at the ribosomal 'altar'. Helpful sprites — the tRNA molecules — will hear the name-spell and bring with them the protein body parts. Name magic of this sort has also been used to talk about how computer programs work.