Adam Rogers in Business Insider:
The inner workings of the large language models at the heart of a chatbot are a black box; the datasets they’re trained on are so critical to their functioning that their creators consider the information a proprietary secret. So Bamman’s team decided to become “data archaeologists.” To figure out what GPT-4 has read, they quizzed it on its knowledge of various books, as if it were a high-school English student. Then they gave it a score for each book. The higher the score, the likelier it was that the book was part of the bot’s dataset — not just crunched to help the bot generate new language, but actually memorized.
In a recent preprint, meaning it hasn’t been peer reviewed yet — the team presented its findings — what amounts to an approximation of the chatbot canon.
More here.