The Poetry Fan Who Taught an LLM to Read and Write DNA

Ingrid Wickelgren in Quanta:

DNA is often compared to a written language. The metaphor leaps out: Like letters of the alphabet, molecules (the nucleotide bases A, T, C and G, for adenine, thymine, cytosine and guanine) are arranged into sequences — words, paragraphs, chapters, perhaps — in every organism, from bacteria to humans. Like a language, they encode information. But humans can’t easily read or interpret these instructions for life. We cannot, at a glance, tell the difference between a DNA sequence that functions in an organism and a random string of A’s, T’s, C’s and G’s.

“It’s really hard for humans to understand biological sequence,” said the computer scientist Brian Hie(opens a new tab), who heads the Laboratory of Evolutionary Design at Stanford University, based at the nonprofit Arc Institute(opens a new tab). This was the impetus behind his new invention, named Evo: a genomic large language model (LLM), which he describes as ChatGPT for DNA.

ChatGPT was trained on large volumes of written English text, from which the algorithm learned patterns that let it read and write original sentences. Similarly, Evo was trained(opens a new tab) on large volumes of DNA — 300 billion base pairs from 2.7 million bacterial, archaeal and viral genomes — to glean functional information from stretches of DNA that a user inputs as prompts.

More here.

Enjoying the content on 3QD? Help keep us going by donating now.