I hate to admit it, but I’m not that original. Just like every third dude who writes code for a living, I love tabletop roleplaying games. And like way too many of us, I have a homebrew setting I’ve been working on.
Like nearly every programmer, I prefer to write in Markdown. And after a while, I had way too many documents to track the cosmology, geographies, environments, civilizations, populations, economies, characters, genealogies, magic systems, etc. And so I thought, why don’t I make an app for this.
I had some constraints, I don’t want to blow through my usage limits or drive up my API bills on something I do entirely for fun. So it had to run on my existing home lab infrastructure. Luckily I have an older generation Mac Studio with an Ultra chip and I’ve been playing with Ollama.
So what model to use? I thought about a Deepseek model, but it felt like overkill. I have around 100 documents, I don’t need that many parameters. Silicon Dojo has had some interesting courses using IBM Granite. It’s an oxymoron, but they are small models, and are efficient enough to run on a 2012 macbook pro (I checked his work on my own 2012 mac mini). More importantly, they were specifically trained on professional documents and communication, and are optimized to reduce hallucinations.
Sometimes a smaller ox is fine.
Let’s pivot and talk about hallucinations. Last year I read this paper, ChatGPT is bullshit by Michael Townsen Hicks, James Humphries & Joe Slater . A tldr is that LLMs cannot hallucinate because the underlying technology has no mechanism to identify truth. So they are bullshitting in a Frankfurtian sense, that’s a fancy way to say they are indifferent to truth and thus not deliberately lying.
I only learned about it this year, but in early 2023, Ted Chiang wrote a piece in the New Yorker titled ChatGPT Is A Blurry JPEG Of The Web .
In his piece, he starts with how the “Xerox photocopier [didn’t] use the physical xerographic process popularized in the nineteen-sixties. Instead, it [scanned] the document digitally, and then [printed] the resulting image file”. And the scanning process was not a lossless compression algorithm. So as a result, situations like the 2013 incident occurred.
He then extends the same principle to Large Language Models. And without repeating all of his work, he makes a compelling argument that what we call hallucinations are no more than compression artifacts of a lossy algorithm.
Now let’s extend that compression artifacts metaphor.
We start with researchers ingesting massive amounts of data. The entire web. Youtube videos. TV Shows. Movies. Books. Music. Art. Way too much of a certain Rule. And who knows how much more. This is petabytes worth. Actual Big Data.
This massive dataset is then lossily compressed into a model, jargon for a complex equation using statistics and linear algebra.
That model is then queried using an unstructured query language. We have tools and techniques to effectively search Big Data, but none of them are as unstructured as plain English.
Here’s the catch though. Not only is the original Big Data compressed in a manner that does not retain the original source. But it also is non deterministic in its results. A fancy way to say the same input will result in different outputs.
So borrowing Chiang’s phrasing, hallucinations are “compression artifacts” produced by a lossy search function, impossible to avoid.
But what is a theory without testing?
Luckily I possess a small dataset of notes, roughly 14k words across 90-something Markdown files. And because I wrote it, I should hopefully know the contents quite well, and if I can’t I am still able to access the source.
Circling back to the application.
Here is the basic proof of concept I put together. It was also conveniently an excuse to test Sonnet and Opus on ClaudeCode and compare them to some of the open models through OpenCode. Pretty simple loop, run ibm/granite4:3b in Ollama, and then have the bots build a basic chat interface with Deno, HTMX, and Tailwind , plus a build step to process and embed the document. Then “ask” Granite questions about the world, and instruct the virtual intern as bugs appeared.
What did I learn? In general, it did a pretty good job most of the time, but it still was wrong sometimes. For example, it could give me general descriptions of story arcs, characters, and aspects of the various societies. The struggle was when I wanted to be specific, like when I was trying to plan out an invasion arc, I wanted to compare relative populations and their technology-levels, but it “artifacted” the sizes and what each had access to.
That is not saying this technology is garbage and should be thrown away. Quite the opposite. I think there’s something there and I can see the value in running a lower-footprint model on your own device using a smaller set of documents to quickly get information. But I think value entirely depends on how well you know the source data.