The Turing test has been beaten. Soundly, I might add. I think most people who have been observing the changing character of LLM-interaction for the past three-odd years1 would agree that it we blew past it a ways back, but recently it was made official when researchers found that GPT 4.5, the largest pre-trained model in the world, was judged to be human by an interrogator a remarkable 73% of the time. LLama 3.1 405b choked out a win, being judged human 56% of the time, and smaller models lost. So it seems that if a model is trained long enough and big enough with a high quality data distribution, it will pass the Turing Test for the vast majority of people.
And yet, we are unsatisfied. We do not feel that we’ve achieved artificial general intelligence.
It turns out, as is often the case with AI research, the thing we were aiming at was not the target we wanted to hit. Intelligence is hard to define, as Turing noted, and so chose an at-the-time preposterous proxy goal - in a five minute conversation, fool a judge into thinking you were more human than your flesh and blood counterpart. The reasoning seems sound. If a system were able to accomplish this, it implies some level of intelligence. A human interrogator might ask anything, and it’s impossible for an operator to “pre-program” all the answers in. So the system must, live, respond to any arbitrary part of the conversation and come across the same way you or I would. Can’t forget the conversation halfway through, has to keep its story relatively straight, and has to come across as something that feels2. So why aren’t we satisfied?
Well, lots of reasons. LLMs are still prone to hallucinate facts and are mostly incapable of taking action across long time horizons. Like many people, I use them for software engineering and have found them far more capable of doing things I understand how to do very quickly. Claude Code, for instance, is a very impressive system. If your task is relatively in-distribution (you’re doing something someone else has done hundreds of times) it is snappy and effective. For tasks that are more out-of-distribution, that is, you’re attempting to do something rare or strange the results are more mixed. As web assistants, they can browse, but pretty slowly, and I still can’t find a system that can book me a doctor’s appointment or a haircut reliably, and I’ve yet to see any system anywhere of any kind that I can set on a task of reasonable importance without triple checking that it was completed as I specified it. You can use tools and function calling to paper over some gaps, but outside the tight sandbox of known workflows, even the best agents are unreliable. That is, I am not working with an intelligent assistant, but more with a junior employee whose work I can’t trust without a review because it’s my ass if they blow it.
That said, it’s impossible to argue the systems aren’t becoming more impressive and capable with scale and know-how. Evals, benchmarks designed to help us smoothly measure what we expect to be a slow trickle of capability advancement keep being totally defeated as quickly as they’re developed across diverse fields. More and more providers show up with models that are GPT-4-tier or above, and some of those are open-source. The money is continuing to flood in, the research is still in its earliest stages, and the energy is there. There’s no reason to expect the improvements won’t continue. So it now becomes worth asking oneself, where will you draw the line? What would you want to see a model perform before you hang up their hat, take a long swig of something sharp and say, “Yup. That’s an intelligent system.”
For some people it’s building full stack applications from written specs. For others it’s piloting a robot that can come into an arbitrary house and make a decent cup of coffee. For the nerds it’s playing a rousing game of Gen-1 Pokemon all the way through3. But for me? For me, it’s running a long-form, dramatically satisfying and tactically engaging game of Dungeons & Dragons.
Why DND?
I didn’t play DND growing up. I heard about it, but it never interested me. If it had, I doubt I’d been able to find friends to play it with. I did on the other hand read a lot of fantasy novels and play an absolute shitload of video games. I started playing with the Playstation and Gameboy Color and was spoiled for choice. Pokemon pretty much taught me how to read. It was fun to explore digital worlds where anything could happen at my own pace4.
When I got a little older I knew some people who played DND, but I didn’t get it. I mostly associated it with LARPing, which I was well aware I did not have the social capital to survive a down payment on. Also, what was the point? I had spent a lot of time play acting with my little brother that we were in Dragon Ball Z or whatever, and while goofy and fun to do, there was nothing that seemed to justify indulging in it as a particularly thought provoking activity. You just said you wanted to do some arcane shit, and arcane shit happened or didn’t, and in the meantime you played dress up with other socially awkward people and did goofy voices. Not interested.
Critical Roll got popular, which I started listening to on long drives. Now this was categorically different. Stories were long form, the players were all talented actors, and the rules meant something. I watched the story change in response to actions taken by the players. A random NPC became an important friend of the party and they were sad when he died. Narrative arcs slowly emerged for each of them and were resolved within the campaign. Combat was this serious affair that Matthew Mercer ruled over with an impressive amount of knowledge and more than a few books, and if the players screwed up they failed and maybe died and the show went on. I started to get the whole “collaborative storytelling” thing. I reasoned this was an exception, no game I ever played was like that, so probably it was pre-written and performed by these talented actors. Still, I had a fondness for DND at that point. I was in grad school though, and was focused mostly on deep learning and earning enough money to pay rent, and fun on anybody else’s schedule was not on my bingo card.
Then the pandemic hit, and I made a decision only possible during global crisis-induced cabin fever: I agreed to give up four hours every Sunday to play a two-year-long game of D&D. And it was magic.
I spent that year thinking mostly about my ML work and DND. I was amazed at what our DM did. He made persistent characters, he set up encounters, he took our relatively goofy characters and gave them real dramatic arcs. I found myself invested in this warlock I was playing and whether he would ever be free from the consequences of his pact with a fiend. We started a tavern on a whim and spent a two hour session mostly seeing whether or not we were profitable. Our DM set up these intricate worlds and characters, and we responded to his investment with equal parts reckless abandon and careful consideration. It felt like a world. Suddenly video games didn’t hold my interest very much. They felt limited. I could see the strings of what I was allowed to do and what I wasn’t. A few games like Baldur’s Gate 3 and Disco Elysium scratched the itch I was looking for (long-form consequences of my actions, a world that pretends to give a shit about what I choose to do in it), but even then. Every time a character started repeating dialogue, it started to feel hollowed out. I had reached the limits of the sandbox I was placed in. I started to see being a dungeon master as this very rich, human, creative and reasoning-intensive activity.
Coincidentally, all the things I like about a good dungeon master are all the things LLMs struggle with. Their writing tends to be very flat. Their context length limitations mean an NPC’s backstory or characterization is likely to change if they don’t just fall off of the stack entirely. A game with hallucinated rules is really no game, and without emerging narrative arcs for your characters there’s no reason to care. Right now, a campaign run by an LLM is everything I thought LARPing was: chaotic, shapeless, and kind of embarrassing. But does it have to be?
The Gygax Test
And so, The Gygax Test: Can an LLM run a Dungeons and Dragons campaign? One with interesting and developing long-running characters, emerging dramatic narratives, and tactically stimulating combat? Can it interact with real human players and create a campaign length satisfying play experience? Right now, the answer seems to be a sound no5. But, as far as I can tell, nobody’s seriously tried. It’s easy to see why, how do you even measure progress? Vibes?6
But I want it to exist, so I’m going to try to build it. That’s what this blog/twitter account are for. It’s partially an excuse for me to dive into the full LLM stack - data, evals, finetuning, partially because it feels like my own relatively under-studied sandbox to play in as I develop my research muscles, and partially because I’d like to interact with a system that gave me that same feeling I had during that campaign. Obviously it won’t be human even if it were good at its job, and I wouldn’t feel the same warmth I felt towards a good friend running a game, but scientifically there are a lot of interesting research questions and no hard blocks.
So what’s to come?
This blog will be part worklog, part research journal. Expect breakdowns of LLM limitations, experiment writeups, and eventually releases: data, eval tools, maybe even a few models. The first experiments will be modest. Think: natural language → combat move resolution. Can a model interpret intent and run the mechanics cleanly? From there we’ll work up to memory, character modeling, and narrative cohesion. The benchmarks will be fuzzy, but that’s part of the fun. And if we get it right? Maybe someday an LLM will run a game worth remembering.
-
ChatGPT came out in late November of 2022. Originally I had written “five-odd” years, but that’s just what it feels like. ↩︎
-
It has been noted by many high taste testers of LLMs that large models have a big model smell indicative of the vast amount of data they’ve compressed, and OpenAI and other users have noted GPT 4.5 is better at writing than prior models, which tended to collapse into slop with a character belonging to an enthusiastic but not particularly bright high-school aged essay writer who got all their reps in prepping for the SAT. ↩︎
-
Personally I would prefer it scored highly on Pokemon showdown. If it can figure out the meta it’s way smarter than me. Hell, I’d take it if it could figure out the IV/EV stuff and train a “perfect” team. Talk about your long time-horizon tasks, amirite? ↩︎
-
That is, a voracious and myopic one. ↩︎
-
And everyone involved seems mostly stoked about that lack of capability. Given WOTC’s anti-consumer practices, I get it. ↩︎
-
Tragically non-scalable, I’m afraid. ↩︎