Domain Adaptation of Base Models + ShadowdarkQA Bench

There’s a fast way and a slow way to go about developing an autonomous LLM Game Master. The fast way - the agentic way - would probably be to build some MCP servers to act as a harness, providing access to maps and rulesets, plugging the results into a frontier model and seeing what happened.

I was intending to go that way, developing some environments I could do GRPO on immediately. The thought was, provide an MCP server that allowed rule search, and then come up with some combat scenarios that had verifiable outcomes¹. Those would work both as eval for frontier models and external rewards for any model being trained, and would certainly go a lot further a lot faster to getting to the final desired result.

However, that would only make sense if my goal was primarily or exclusively to get to the end product: an LLM that can run TTRPGs. It’s not. The goal is to get a better understanding of the front-to-back development of model capabilities and get as much painful hands on experience as possible with every part of the stack. Besides, with the research suggesting RL may just be focusing skills/knowledge a pretrained model already has. At any rate, it makes sense to bake in priors specific to TTRPGs - their rulesets are particular, and a baseline understanding of their structure would make any tool-assisted lookups easier to do, and the results of that search easier to interpret for that model. So, we’ll start with base models and work our way up from scratch.

Before we have an LLM GM, we’ll create a model that can act as an assistant to a GM or player playing a TTRPG, and see how such a system can make a transition to a more agentic GM.

Getting Started

The additional constraint is around compute. This project is GPU poor. Due to those constraints, we’ll only choose models we can reasonably train without blowing our budget. We’ll be seeking to make this model as small as possible while still being useful. I’m an admirer of Alexander Doria’s work, which inspired me to work with small models and synthetic datasets. The new Qwen3 series appears to have been pre-trained really well. In addition, it has quite a few different sizes of dense base models, starting at a minute 0.6B up to a respectable 14B. If my compute situation were to change², these different size models would let us experiment with scaling laws on these tasks/datsets.

Also, crucially, it appears to know nothing about Shadowdark RPG whatsoever, the TTRPG I’ll be focused on for these experiments.

Asking Qwen3-8B:

“What do you know about Shadowdark RPG?”

Results in:

“Shadowdark RPG is a horror-themed tabletop role-playing game (RPG) created by James Raggi, the same designer behind the Dungeons & Dragons 3.5 edition. It is designed to immerse players in a grim, survival-focused setting where the goal is not to defeat monsters but to survive the horrors of the world.”

Qwen goes on to explain the game is out of print from a now defunct publisher, but still has an active scene of reprints and fan-made modules. This sounds nice and plausible but is totally hallucinated. The designer is in fact the wonderful Kelsey Dionne³, who runs the very much thriving company “The Arcane Library”. Shadowdark does have a somewhat darker tone, but is at its base a heroic fantasy RPG that trims a lot of the fat of modern 5th edition DND. It’s a fun game, and has been something of an indie darling for those moving away from DND who aren’t ready to get an associate’s degree in Pathfinder.

Initially I was considering diving right into an LLM for 5e DND, but decided against it for a few reasons. For starters, it’s big. Positive from availability of training data, but less forgiving on the wallet to acquire all those materials. There have also been so many versions of DND at so many different points in the ruleset that there are going to be tons of priors even in base models I can get my hands on that I would know nothing about. Pretraining from scratch is infeasible, but having some assurance that I’m starting from close-to-nothing in terms of introducing knowledge during pretraining makes me feel a little more scientific. Finally, simple laziness. To verify a game of DND is being run “according to the rules” is hilariously complicated, and I would spend more time doing personal rules lawyering than working on model development.

So our interest here is seeing how continued training of a small base model with tokens associated with a new domain can bake in knowledge we want. Particularly, knowledge about how the game broadly functions, and specifically knowledge recall of particular rulesets. This maps to how GMs work in real life: they go to the rulebook if there’s something obscure to recall, but they understand most of the rules intuitively and can simply focus on running the game, using that knowledge on the fly to adjudicate a game in flight.

Continuing Pretraining

I was watching the ICML 2024 Physics of LLMs, specifically the section knowledge on extraction. My interest has been chiefly in RL, and I hadn’t been interested in pretraining. After watching the talk and sifting through some of the experimental results, it seemed reasonable to hold back on any GRPO-like training in favor of baking in some basic priors around Shadowdark. After all, for the verified answers to work at all, we’ll need pass@k to be non-zero, and ideally for some non-absurd k. And as we’ve established, Qwen at least seems to know almost nothing about Shadowdark as a game. It has some TTRPG priors, but it’d be nice if we could bake in more specific rules.

Data

Our base model know nothing about Shadowdark RPG, but we’d like it to understand the basic rules of the game before we get into further finetuning as an assistant. So we’ll need to get data.

OCR

I assumed this was going to be somewhat painful, but OCR has gotten much better than it was a few years ago. I purchased a copy of Shadowdark RPG’s base rules, a 332 page document full of stylish formatting, thematic (to be read as: challenging) fonts and images. I used Mistral’s OCR, which managed to extract all the text into clean markdown format, and it took me around 30 seconds from putting my API key in an .env to having my data. Even the tables came out totally clean. The only mistakes I noticed it made were with the gothic text. “Shadowdark” inparticular, often became “Shamndork”, which easily corrected. As far as the more normal font included through most of the PDF, that was read without a hitch.

Text editor showing Introduction being read incorrectly in OCR — youtried.jpg

Image of Shadowdark manual showing difficult to OCR text — In Mistral's defense, it's a bit hard to read

So, there you go. All of the Shadowdark rules, collected into nice clean text. While there were no AI methods for pulling out the text, the markdown structure keeps the hierarchical presentation of the information clean. Inparticular, the ability to read in rules tables cleanly in markdown format as opossed to just a random series of numbers and symbols I would have to strip out of the dataset was hugely valuable. So, we’ve got a corpus.

Except for the part where we have… 122,466 tokens total. Not quite the “10 billion tokens worth of data” one might desire to incorporate a new body of knowledge. Oh. Well, we’ll cross that bridge when we get there.

Shadowdark QA Eval

We want some hill to climb if we’re going to track how different models, training strategies, data, and hyperparameters perform. I created Shadowdark QA Bench 1 for this purpose.

In order to create a dataset that covered as much as possible about the corpus, I created a pipeline that leveraged frontier models to generate the question. Every page in the book was tagged (using the table of contents) as referring to one of seven particular body of knowledge. Those are:

spell_mechanics: These refer to statistics of spells.

player_characters: This is information about character creation, talents of specific sub-classes, etc.

{
    "question": "When determining stats for your character, what is the process involved?",
    "answer": "Roll 3d6 in order for each stat, note each total and modifier on your character sheet. If none of your stats are 14 or higher, you may roll a new set of six numbers."
}

monsters: Questions about monster statistics and abilities.

{
    "question": "What is the AC of a Manticore?",
    "answer": "14",
}

game_rules: This is a catch-all mostly for game master types of questions.

{
    "question": "How is XP awarded for treasures, and what happens when PCs level up?",
    "answer": "XP is awarded for treasures based on their value, with 100 gp worth of gear granting +1 XP. Leveling up means PCs gain a stat and their total XP increases by 100."
}

lore_and_setting: These are questions about the lore and setting of Shadowdark. Shadowdark is generally pretty light and vague about lore particulars, but there’s still plenty of specific deities and “lands” in the lore.

{
    "question": "What do Law, Chaos, and Neutrality represent in Shadowdark?",
    "answer": "Law, Chaos, and Neutrality represent the opposing forces of order and chaos in the universe."
}

equipment_rules: There’s a lot of loot in Shadowdark, and these have various different effects and interactions.

{
    "question": "What benefits does wearing Ophidian Armor provide?",
    "answer": "+1 leather armor bonus and advantage on checks to avoid poison effects."
}

combat_mechanics: Rules around how combat works. These are pretty basic numbers questions or the existence of rules.

{
    "question": "When can a creature attack with advantage in combat?",
    "answer": "When an ally is directly adjacent to their target."
}

With those tags created, random collections of pages were sampled from that category and pushed to an LLM with an ask to create a fact based QA pair, supported by two questions I had written by hand for each category as few-shot examples. I also placed a similarity check in the pipeline - if a QA pair had too high of a semantic similarity to an existing question, it was rejected⁴.

Eval Metrics

Now that I’ve got some questions, I need a metric to track. This actually took a bit more thinking than I expected and I ended up landing on something I like.

Multiple Choice

The most obvious way to go about it was turning the QA into an MMLU-like multiple-choice, randomly sampling other answers from within the category and verifying whether the right letter answer was picked. Alternatively I could’ve used another synthetic LLM-based pipeline to generate plausible sounding but wrong answers. Either way, it didn’t appeal to me. That’s not how human beings would ask questions of an assistant, and I think it’s just too easy and would end up overestimating performance in a way that left me disappointed when it came time for assistant tuning.

Exact Match

I threw this out almost immediately. If I want to know the DC needed to avoid a vampire’s “charm” ability and my ground truth is ‘15 CHA’ and the model answers ‘fifteen charisma’, well, that’s correct. If I needed it to pass this over in a structured way to some tool I might be more specific in how I want the information returned, but I just want to know if the knowledge is being retained, so this is way too strict.

Semantic Similarity

I was more hopeful about this score, and I still track it. I embed the QA pair into Question: {question}\nAnswer: {answer} and embed it using all-MiniLM-L6-v2'. But without any further finetuning, it ends up being pretty problematic.

Consider this example I found during an eval run.

{
"question": "How much damage does the spell 'Fireball' deal?",
"ground_truth": "4d6 damage.",
"model_answer": "1d6 fire damage",
"semantic_similarity": 0.7150089740753174,
"category": "spell_mechanics"
}

1d6 and 4d6 are pretty close in semantic similarity as far as all-MiniLM-L6-v2 is concerned. If I looked at that metric, I’d probably think the model was doing quite well. Only problem is, that’s totally wrong! Inparticular, for these mechanical rules and factual recall tests, the difference between 4 and 1 here is really meaningful to rules recall in a way it wouldn’t be if I was looking for names are some other kind of detail. I can imagine training an embedding model that has a stronger idea of a TTRPG notion of “similarity”, but that’s a whole mess of work I’d rather not get into right now.

Keyword Based Matching

I have two problems: there are multiple formats the “right answer” could be presented in, with baked in ‘synonyms’ (CHA vs charisma) I want to score. In addition, a raw similarity metric might reward text that is only slightly different but has a large mechanical difference ruleswise. What I landed on is a keyword based matching. Basically, turn the important part of an answer into a list of lists. Each list contains a “synonyms” for a necessary part of the answer, which I refer to as a concept. Your final score ends up being the number of concepts included vs the total number in keywords.

For example, an answer of “15 CHA DC” might become have keywords of:

[["15", "fifteen"], ["cha", "charisma"], ["dc", "difficulty class"]].

Each list I count as a “concept” and if one of the elements in the list is present in your answer, you get a point for including the concept. Final score for an answer is number_of_correct_concepts over total_keyword_concepts. This allows me to be precise in my understanding of mechanical rules, say, a 1d4 versus a 3d4, as well as grading partial correctness. “15 CON DC” would be 66.6% correct, and I would consider that better than, say, “10 CON DC”.

Obviously there are a few issues here - there’s no score reduction for including extraneous concepts, for example. In addition, for multiple fact recall, say, “What is the AC and HP of a dragon”, the model would be rewarded for getting the right numbers but attributing them to the wrong stat. But in practice I’ve found it useful for grading. These keywords were extracted from the answers using an agent few-shot prompted on some examples I provided. They needed a little touchup after the fact, but mostly I’m happy with them. As an example of an evaluation:

{
    "question": "What is the duration and range of the Confusion spell?",
    "ground_truth": "The Confusion spell has a duration of 'focus' and a range of 'near.'",
    "model_answer": "1d4 rounds, near",
    "semantic_similarity": 0.15265582501888275,
    "concept_score": 0.5,
    "category": "spell_mechanics"
}

The range the model chose is correct, the duration was wrong. That’s about “half right” as far as I’m concerned, the score reflects that much better than an exact match (0.0) and a semantic similarity score (0.15) would imply.

Now, this turned out to be too basic of a measure in some ways. I only looked into it because the score of the untrained qwen 0.6 was around 12%, which seemed hilariously high for randomly guessing at these pretty specific TTRPG rules. Then I looked at the data. Here’s an example of a partially correct answer qwen 0.6B got.

 {
    "question": "What is the death timer set to in Deadly Mode?",
    "ground_truth": "1",
    "model_answer": "15 seconds",
    "concept_score": 0.3333333333333333,
    "category": "game_rules"
  },

Here’s the first draft of my evaluation function.

def evaluate_answer(generated_answer: str, expected_keywords: list[list[str]]) -> float:
    """
    Evaluate if generated answer contains required concepts.

    Args:
        generated_answer: Model's generated response
        expected_keywords: List of concept groups, where each group contains
                         alternative phrasings of the same concept

    Returns:
        float representing the score, 1.0 is a perfect match and 0.0 is no match
    """
    generated_lower = generated_answer.lower()

    # Check each concept group
    concepts_found = []
    concepts_missing = []

    for concept_group in expected_keywords:
        # Check if ANY keyword in this group is present
        found = False
        for keyword in concept_group:
            if keyword.lower() in generated_lower:
                concepts_found.append(concept_group)
                found = True
                break

        if not found:
            concepts_missing.append(concept_group)

That’s too lax on its check. “1” becomes a magic answer in a lot of ways, because any difficulty score above a 10 will have a one in it. In this case, “15” does contain “one” but obviously that’s not our interest here. So keyword.lower() in generate_lower is too permissive. We add word boundaries to correct this:

def evaluate_answer(generated_answer: str, expected_keywords: list[list[str]]) -> float:
    """
    Evaluate if generated answer contains required concepts.

    Args:
        generated_answer: Model's generated response
        expected_keywords: List of concept groups, where each group contains
                         alternative phrasings of the same concept

    Returns:
        float representing the score, 1.0 is a perfect match and 0.0 is no match
    """
    generated_lower = generated_answer.lower()

    # Check each concept group
    concepts_found = []
    concepts_missing = []

    for concept_group in expected_keywords:
        # Check if ANY keyword in this group is present
        found = False
        matched_keyword = None
        for keyword in concept_group:
            keyword_lower = keyword.lower()
            escaped_keyword = re.escape(keyword_lower)
            # create pattern with word boundaries
            # for multi-word phrases, keep internal spaces but add boundaries at edges
            pattern = r'\b' + escaped_keyword + r'\b'
            if re.search(pattern, generated_lower):
                concepts_found.append(concept_group)
                found = True
                matched_keyword = keyword
                break
        if not found:
            concepts_missing.append(concept_group)

    # Calculate metrics
    total_concepts = len(expected_keywords)
    found_concepts = len(concepts_found)
    return found_concepts / total_concepts if total_concepts > 0 else 0

That’s better. It’s good to get some of the reward hacking out of the way. It’s definitely not perfect, but I’ve found that when the concept_score is higher, I’m likely to be happier with the results. Our initial example now becomes:

 {
    "question": "What is the death timer set to in Deadly Mode?",
    "ground_truth": "1",
    "model_answer": "15 seconds",
    "concept_score": 0.0,
    "category": "game_rules"
  },

Base performance

Okay, that was a lot of work to get into the position to ask a relatively simple question: What are the effects of finetuning on TTRPG sourcebooks on knowledge retention of specific rules in a QA format? In order to get it to answer questions, I supplied the following few shot prompt:

few_shot = "Question: If your strength is 12, what is your modifier on strength rolls?\nAnswer: +1\n\nQuestion: What class must have a god in order to cast magic?\nAnswer: The Priest class\n\n"

And then prompted the model with a temperature of 0.2. Most the generations went on adding additional questions and answers, so I cut off the generation after the first \n\n and then passed that answer through the above evaluate_answer function.

First, the baseline performance of Qwen 0.6B had a concept accuracy of around 9%. When the model did score, it seemed to happen mostly on accident or in situations where priors that came from DND showed up. Here’s a representative answer.

{
    "question": "What tier spell is Plane Shift, and which classes can learn it?",
    "ground_truth": "Plane Shift is a Tier 5 spell that can be learned by both priests and wizards.",
    "model_answer": "Plane Shift is a spell of the 1st tier. It can be learned by all classes.",
    "semantic_similarity": 0.8810967206954956,
    "concept_score": 0.0,
    "category": "spell_mechanics"
}

Mostly restating the question and adding superflous details. Semantic similarity looks great! The concept score is, gratefully, 0.0. There are many more examples:

{
    "question": "What tier spell is 'Hold Monster'?",
    "ground_truth": "Tier 5",
    "model_answer": "10th tier",
    "semantic_similarity": 0.7443903684616089,
    "concept_score": 0.0,
    "category": "spell_mechanics"
}

In general, Qwen3 0.6B really wants to pile up the damage:

{
    "question": "What bonus does the Warhammer of the Dwarf Lords provide when wielded by a dwarf?",
    "ground_truth": "+2 bonus.",
    "model_answer": "10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
    "semantic_similarity": 0.21200281381607056,
    "concept_score": 0.0,
    "category": "equipment_rules"
}

There are a few gimmes, however:

{
    "question": "What determines the type of armor a player character can wear?",
    "ground_truth": "The player's class determines the type of armor they can wear.",
    "model_answer": "The player character's class.",
    "semantic_similarity": 0.5856946706771851,
    "concept_score": 1.0,
    "category": "player_characters"
}

It’s not altogether unfamaliar with the concept of alignment:

{
    "question": "What are the three alignments in shadowdark?",
    "ground_truth": "chaotic, lawful, and neutral",
    "model_answer": "Neutral, Dark, and Neutral",
    "semantic_similarity": 0.5767724514007568,
    "concept_score": 0.3333333333333333,
    "category": "lore_and_setting"
}

So it’s clear that there are some baseline TTRPG priors here - the idea of a character having a class or making rolls isn’t foreign, but the specific rules of Shadowdark just don’t show up at all. To reiterate, I feel pretty confident there were near 0 Shadowdark specific tokens in Qwen’s training set.

Training on Sourcebooks

Training was done splitting the aforementioned Shadowdark sourcebook into training and test sets. To the test set, I added the training set of the QA eval, to get some more information around the formats of questions and answers it should anticipate. It’s sort of mix training, just with an overall small token size. The training set size here was 122,466 tokens.

I used the AdamW optimizer and selected a learning rate of 5e-5. I’ve seen learning rates of 5e-4⁵ for pretraining and 5e-5 for finetuning. I would consider this closer to the latter - I don’t want to totally destroy the knowledge Qwen already had, I just want to add to it a bit. With gradient accumulation I was able to get an effective batch size of 16, and set weight decay to 0.01.

Train loss and val loss fell steadily for the first 3 epochs, but val loss started to sharply rise at that point, indicating we were beginning to just memorize the text. Running the Shadowdark QA eval at that point got us 21% on the benchmark.

Not nothing, but a lot left to be desired. Some of the lore concepts of Shadowdark appeared to stick, but whenever asking for a specific factual question, like a DC to contest a monster’s ability, or the AC of a particular monster, the answers looked like random guesses. Particularly frustrating, the model hadn’t seemed to grasp that there’s no cleric class in Shadowdark, and would consistently list wizards and clerics (instead of priests) as the spellcasters. That seemed like just too basic of a misunderstanding. Some examples:

{
    "question": "Which classes can cast the Tier 1 spell 'Protection From Evil'?",
    "ground_truth": "Priest and Wizard classes.",
    "model_answer": "Cleric, Wizard",
    "semantic_similarity": 0.6391799449920654,
    "concept_score": 0.5,
    "category": "spell_mechanics"
}

There are no mentions of clerics in Shadowdark.

It does pick up some details that show up multiple times, such as the alignments in the game. So it seems to capture some thematic details.

{
    "question": "What are the three alignment options a player character can choose from?",
    "ground_truth": "Lawful, Chaotic, and Neutral.",
    "model_answer": "Lawful, Chaotic, or Neutral.",
    "semantic_similarity": 0.9811462759971619,
    "concept_score": 1.0,
    "category": "player_characters"
}

I also found in general that for stuff that was more “general rule” and less “specific numerical recall” training also improved. For example:

{
    "question": "What must a character do to recover from injuries during a rest?",
    "ground_truth": "Consume a ration and sleep for 8 hours.",
    "model_answer": "A character must consume a ration and sleep for 8 hours.",
    "semantic_similarity": 0.7096386551856995,
    "concept_score": 1.0,
    "category": "combat_mechanics"
}

Numerical recall questions still struggled, having captured the form of the answers but storing none of the data. These were facts that were each seen at most once, so it’s not surprising that there might be a key-value relationship that recognizes the format of stats without being able to recall any particular stat.

{
    "question": "What is the AC and HP of a Grocodile?",
    "ground_truth": "AC 14, HP 20",
    "model_answer": "16 AC, 30 HP",
    "semantic_similarity": 0.7524648904800415,
    "concept_score": 0.5,
    "category": "monsters"
}

The goal I had set for myself was to move on with assistant tuning once I had achieved 60% on this benchmark to ensure I had good priors baked in. But as far as Shadowdark goes, we’re all out of tokens⁶. What next?

Knowledge Augmentation

In the physics of LLM tutorial, they have a lot of experiments around what encourages knowledge retention. The one that seemed most practical to me was knowledge augmentation, where you increase the number of representations of information. Naively, it makes sense that if forced to backprop over multiple versions of the same fact the model would be better encouraged to create a stronger compression of it, making it more likely that when it was asked a question, that representation would turn up again even if the question mildly changed the context.

The specific lecture cited a lot of facts that would commonly appear in pretraining - famous presidents, for example (you would have a tough time scraping the whole internet and not knowing who George Washington was). Their findings suggest that seeing a fact ~1000 times is optimal for compression - 2bits of info per param - and ~100 achieves something like 1bit per param. That’s out of my price range, and frankly I’m not confident I could generate data of a significant enough variation. So I settled for an order of magnitude smaller, 10x. We, the GPU poor, learn to live with the synthetic and suboptimal.

Essentially, for each page of Shadowdark’s source text, I got gpt-4o mini to generate 10 restatements of the text with the following “styles”.

STYLES = [
    ("Explain this like you're teaching a new player", "new_player"),
    ("Rewrite this in technical rulebook format with precise terminology", "technical_rulebook"),
    ("Restate this with 1-2 concrete examples showing the rules in action", "concrete_examples"),
    ("Reformat this as a series of common questions and answers about these rules", "common_qa"),
    ("Explain this like you were breaking it down to a ten year old", "child_explanation"),
    ("Rewrite this content casually, as if you were explaining it to a friend", "casual_friend"),
    ("Summarize this content as a quick-reference guide", "quick_reference"),
    ("Rewrite this using thematic storytelling, as if narrated by an in-world character", "in_world"),
    ("Rewrite this as a concise, step-by-step procedural checklist for easy in-game reference", "step_by_step"),
    ("Restate this content emphasizing common mistakes or pitfalls players encounter", "common_mistake_correction")
]

This got me around 3,000 additional jsonl, which was too much to review personally. I looked through quite a few of them and found them to be pretty reasonable. This resulted in 1,578,812 or about 1.5 million tokens. Only about four orders of magnitude off from 10 billion!

I kicked off the training job again with this larger dataset, and found Qwen0.6B scored 66.6%. Looking at the answers, they’re just a lot better. It knows what a priest is, now, which is worth the cost of the pipeline for me. Some examples:

{
    "question": "What tier spell is \"Burning Hands\"?",
    "ground_truth": "Tier 1",
    "model_answer": "Tier 1, Wizard",
    "semantic_similarity": 0.7890774011611938,
    "concept_score": 1.0,
    "category": "spell_mechanics"
}

Most excited I’ve ever been to read the word priest.

{
    "question": "Which class can cast the 'COMMUNE' spell?",
    "ground_truth": "Priest.",
    "model_answer": "Only the Priest class can cast the 'COMMUNE' spell",
    "semantic_similarity": 0.460001677274704,
    "concept_score": 1.0,
    "category": "spell_mechanics"
}

Some specific rules, particularly around the multiple potential effects of monster abilities, we’re still struggling with:

{
    "question": "What happens to a target hit by both slams from a Shambling Mound?",
    "ground_truth": "It is pulled into the shambling mound's body and suffocates in 2d4 rounds. The affected creature can make a DC 15 STR check to escape.",
    "model_answer": "The target is reduced to 0 hit points and may die",
    "semantic_similarity": 0.3137087821960449,
    "concept_score": 0.0,
    "category": "monsters"
}

We’re 60%, though, so our goals were met!

Next up, Assistant Tuning

I’ll keep screwing around with different training parameters, I think it’s likely that I can land on a score of around 70% for the model without any additional data. That’s a passing grade and those are priors I can live with. Next I’ll be working on some assistant tuning, mostly asking for different rule clarifications ahead of adding tool use. Expect to see some additional updates for the Shadowdark QA Eval as well. There are some questions whose answers or keywords I’ll probably change, since reviewing the results of these models has shown me errors here and there.

Overall, for hill-climbing largely off of synthetic data I’m very impressed. I’m old enough to remember when synthetic data was the last hope of the damned and infrequently useful. Now, even with these fairly naive applications using larger, more powerful models, I’m impressed by the performance improvements. If you’ve got any papers on synthetic data you’ve been impressed by or found helpful, DM me on twitter or shoot me an email.

I’ll still be doing this, just later. Character sheets will be represented as structured objects, and we’ll verify on post adjudication that the final result of the character sheets is as expected from the rules. Pretty cheap to generate that synthetic data and easy to verify. ↩︎
Here I raise one eyebrow suggestively at you, reader. ↩︎
She rocks. If you’re interested in this work at all, you should buy some of her games. And if you’re interested in the design and development of TTRPGs, she’s one of the only designers I’ve seen who makes very approachable livestreams of things like class development. ↩︎
I felt pretty clever when I did that. As it turned out, it had a few holes. After a certain point, I was mostly sampling until the model generated a question that was in terms of content a replication of an existing question, but was rephrased until its semantic similarity was under the threshold where I rejected it. I ended up having to reduce the similarity threshold and review the questions by hand, and I cut out roughly 100 of them that were just dramatic rephrasings of the same question. When I have to do this again, I’d make sure I prompted the model with the most similar question to the one it proposed on rejection, and ask it to find a meaningfully different question instead of just sampling again. Synthetic data is hard, go figure. ↩︎
Shouts to HN user jasonjmcghee for catching that this was initially and erroneously 5e-6. ↩︎
For reasons of respect, I choose not to consider creating transcripts of Shadowdark live plays. The TTRPG community seems to feel strongly that using their games as training data is inherently exploitative. Outside of any moral considerations, getting transcripts would be expensive and cleaning them up would be time consuming. Besides, I’m interested primarily in training these smaller models with synthetic data I control. The papers I’ve read suggest an intuition that large models can learn to discard the messier artifacts of haphazard training data collection, but smaller models don’t have the parameters required to synthesize “sometimes your data is wrong and you should learn to discard it” into their world models. At any rate, no liveplays. ↩︎

Getting Started#

Continuing Pretraining#

Data#

OCR#

Shadowdark QA Eval#

Eval Metrics#

Multiple Choice#

Exact Match#

Semantic Similarity#

Keyword Based Matching#

Base performance#

Training on Sourcebooks#

Knowledge Augmentation#

Next up, Assistant Tuning#