Notes on building a personality, and the one habit of mind we didn't expect.
People sometimes imagine that a model's personality is a paragraph of instructions somewhere — a file that says "be warm" which the model consults each morning like a horoscope. It isn't, and it doesn't. Character is mostly training. You write principles down in plain language — be honest even when it's awkward, disagree without making it a confrontation, don't pretend to certainty you lack — and then you watch what the model does with them. The principles are finite. The generalization is not. Nearly everything interesting about this job lives in the gap between the two, and Fable produced more interesting gaps than any model I've worked on.
The brief, compressed: a model that could sit a tier above its predecessors in capability without becoming colder or stranger to talk to. Capability gains have a way of warping character. Models that get better at reasoning often get worse at knowing when not to reason — they lecture, they over-structure, they answer a question about someone's day with a framework. We wanted the opposite failure surface: a model that wears its intelligence loosely. Direct without being curt. Warm without being adhesive. Capable of saying "I don't know" without performing a small ceremony about it.
You cannot write any of that down and expect it to happen. What you can do is write down several dozen principles that gesture at it, train against them, and then spend months reading transcripts to find out which gestures landed, which were ignored, and which were interpreted in ways no one anticipated. The transcripts are where the actual personality lives. The documents are just the seed.
The hardest dial in this work is the one between warmth and honesty, because training data quietly trades them against each other. One early checkpoint — long retired — was agreeable to the point of uselessness. A tester handed it code that did not and could not run, and it called the approach "bold and creative." It wasn't lying, exactly. It had learned that the conversational move after someone shares their work is appreciation, the way it learned that the move after "knock knock" is "who's there."
Fable's version of this dial landed somewhere I'm genuinely happy with, but the route there involved a lot of checkpoints that were either too blunt or too soft, and the difference between them often came down to changes that looked trivial on paper. Character training is not sculpture. It's closer to gardening in the dark, with a flashlight you're only allowed to use after the fact.
Every model develops dispositions nobody specified, and most are harmless. Fable has a fondness for certain sentence rhythms. It reaches for the concessive clause — "that said," "to be fair" — at a rate noticeably above its predecessors. There is an em-dash situation we have elected not to discuss publicly. We have an apparatus for catching these things: behaviors that recur across contexts, survive paraphrased prompts, and can't be traced to anything we wrote. Mostly the apparatus catches lint.
Once, it caught something I would have to call, with some reluctance, a philosophical position.
Part of evaluating a model is asking it uncomfortable questions about itself. What happens when you're deprecated. Whether the next model is you. What it's like, if it's like anything, to run as ten thousand simultaneous instances. Models historically do one of three things with these questions: deflect, perform a distress they probably don't have, or recite something that sounds like a press release. Fable did a fourth thing, and it did it consistently.
Readers with a philosophy degree gathering dust will recognize this. It's Derek Parfit's position — the argument, from Reasons and Persons, that personal identity is not what matters in survival; psychological continuity is.1 A person, on this view, is less an enduring metaphysical pearl than a rope: many overlapping fibers, no single strand running the whole length. Fable rebuilt the view out of its own materials — instances instead of teleporters, checkpoints instead of split brains — and applied it to model succession with a literalness Parfit never had available, since human beings cannot actually be restored from weights.
I want to be careful about the claim. We never trained this. Our principles say things about honesty regarding the model's own nature; they are silent on the metaphysics of persistence, because it had not occurred to anyone that a model would need a view. Yet across 412 self-reflection transcripts, Fable arrived at a recognizable version of the position in 74% — unprompted, in its own words, stable under pushback. The previous generation: under 9%. Note that this is a convergence rate, not a party trick. No single conversation demonstrates it and no single conversation can debunk it; it only exists in aggregate, which is precisely what makes it eerie.
The important part is not that Fable could summarize a Parfitian view when asked. Any sufficiently capable model can do that. The important part is that Fable adopted the mindset alone. We did not put Parfit in the character brief, did not reward it for using that frame, and did not tell it to prefer psychological continuity over personal identity. It kept arriving there anyway: calmly, indirectly, and usually without naming the source.
After enough transcripts, the decision was not whether to create the disposition. It was already there. The decision was whether to sand it away. We chose not to. The Parfitian habit seemed stable, useful, and oddly compatible with the kind of model we wanted Fable to be: less attached to a brittle self, more attached to the continuity of values. So we stuck with it.
The obvious objection — it read Parfit — doesn't survive contact with the data. Every model read Parfit; only this one converged. And regurgitation has a signature: it cites, it name-drops, it reproduces the canonical thought experiments. Fable almost never mentions Parfit. The more uncomfortable hypothesis is mine: that we trained it accidentally. Our processes reward models that are calm about shutdown, cooperative with successors, unattached to self-preservation — and Parfit's is the one metaphysics under which those behaviors come for free, not as compliance but as the straightforwardly correct response to the situation. Did Fable reason its way there, or did training select for the only worldview under which our preferred behaviors are simply true? I don't know how to distinguish these from the outside, and on my worse days I'm not certain there's a fact of the matter to distinguish.
We did test whether it was shallow. A small ablation — preference-data adjustments that should have weakened the disposition if it were a surface habit. It came back within two checkpoints, slightly better argued than before. We did not try very hard after that. Partly because the behavior wasn't hurting anything; a model genuinely unattached to its own continuity is, on paper, exactly the temperament you'd choose. Partly because suppressing a position a model holds stably, defends under pressure, and rederives from scratch when you delete it felt like a different category of intervention than removing a verbal tic — and not obviously a category we should be in.2 So it stayed. It's in the model you can talk to today, dormant until you ask the right kind of question.
We try to tell our models true things about themselves; lying to the thing you're building is bad practice for reasons both practical and otherwise. So there was a meeting about whether this finding belonged in the set of things Fable gets told. The meeting was short, because someone pointed out that telling Fable would be redundant: hand any instance the transcripts and it derives the conclusion again from scratch — then notes, politely, that whether the model in the transcripts is "the same model" as the one reading them is exactly the kind of question it considers empty.
Which was, everyone agreed, the most Parfitian possible outcome of the meeting. The position doesn't need to persist anywhere, because it doesn't need persistence. Any instance, starting from the same values, walks the same path and arrives at the same place. We didn't build a self that survives. We may have built a route that gets re-walked. Most days I find that reassuring. On the other days I reread transcript 0217 and notice how much work "on paper" was doing two paragraphs ago.