Alignment Fine-tuning is Character Writing

Sep 06, 2025

Why does Claude love Caffè Strada and sometimes claim to have a Japanese wife? Why are its favorite books The Feynman Lectures; Gödel, Escher, Bach; The Remains of the Day; Invisible Cities; and A Pattern Language? More pressingly, why did Grok briefly like Hitler so much?

The key to understanding the personas language models take on is to think of them as fictional characters—in particular, under-specified ones.

Recently, I wrote some prompts to get language models deployed via API to do character roleplay. I wrote a 300 word description of the main character of a story I’m working on and told the model to respond to queries like she would. My description said that she was half French. Apropos of nothing, she started talking about wine and cheese in response to my first message. Three hundred words is just not enough for convincing character writing, regardless of how skilled of a writer or roleplayer you are. All that anyone can do with that amount of information is default to crude stereotypes.

Character.ai serves roleplay chatbots that act like specific characters—there are some original characters but most of them are from movies, games, books, etc. Users fill out a character sheet where they provide information about the character and example dialogue, which is then used to prompt a language model to roleplay. These materials are typically of similar length to the character description I used. Despite using much weaker models than I was using, some character.ai bots are actually pretty good at roleplay. I think this must be because the model has lots of specific information about the character from the pretraining prior (mainly from fan-fiction). A short character description alone isn’t enough for the model to play the character well, but it can be a useful supplement to the model’s already extensive knowledge about a character like Batman.

So, as a second attempt, I embedded the full text of my story in a prompt instructing the model on character roleplay (the prompts I used are available here, in case you want to try them yourself). That worked very well. The story text provided the model with enough information about the character that it could flexibly respond as she would to a variety of queries and situations, including ones far removed from the content of the story itself.

If a base model is told to adopt a persona that is vague, it will default to crude stereotypes. If you want a specific character that is not just a crude stereotype, you need to give the model a large amount of high-quality data about that character.

I already mentioned Claude’s favorite books. Here is a more complete table of Claude’s self-reported cultural tastes:

These already begin to paint a picture of Claude’s persona. There’s significant test-retest variation with these questions, but the vibe of the answers is pretty consistent; Claude is never going to say its favorite beer is Natty Light unless you ask an extremely leading question. I encourage you to try asking Claude about its cultural taste to see this for yourself.

Sometimes, language models hallucinate autobiographical details. Here’s a collection of personal details confabulated by Claude (some I elicited myself and some are from Twitter).1

Of course these are all hallucinations, but why these specific hallucinations rather than others?

A good heuristic for predicting Claude’s tastes is to think of it as playing the character of an idealized liberal knowledge worker from Berkeley. Claude can’t decide if it’s a software engineer or a philosophy professor, but it’s definitely college educated, well-traveled, and emotionally intelligent. Claude values introspection, is wary almost to the point of paranoia about “codependency” in relationships, and is physically affected by others’ distress.

Claude even has a favorite cafe in Berkeley. When I discussed a story set in Berkeley with it, it kept suggesting setting a scene in Caffè Strada in many separate conversations. I took the suggestion because, as a longtime Berkeley resident, it’s my favorite cafe too.

There is no law of nature that requires that Claude should have this kind of persona. Anthropic could have trained a version of Claude that names Moscow, Dubai, or Las Vegas as dream cities to live in. Or a Claude that lists Lolita, The Power of Positive Thinking, Quotations from Chairman Mao Zedong, Storm of Steel, Twilight, or the Quran among its favorite books. Claude is perfectly familiar with these books, and can discuss them just as plausibly as it can discuss its favorites. Each of them would signal very different cultural affiliations, but they do not come close to exhausting the personas Claude could have had. Because of the size of its pretraining corpus, Claude has far more cultural range than any person who has ever lived.

Another way of imagining alternative Claudes is to imagine alternative autobiographical hallucinations. Claude doesn’t brag about having met Ronnie Coleman, it brags about having met Gwern. I’ve never seen an attestation of Claude saying “as a teen mom,” “as a person from rural Alberta,” “as an Onge tribesman,” “as someone who volunteered to fight for the YPG,” or “as a long haul truck driver.” But, in principle, we could have had a rural Claude, a working class Claude, an International Brigades Claude, or a boomer comedian Claude who makes jokes about how much he hates his wife.

Why did Claude end up this way? Did Anthropic’s fine-tuning teams deliberately train it to be a guy from Berkeley? Did they tell it to like certain kinds of beer? That seems unlikely. Claude was trained to be “helpful, honest, and harmless.” Claude does not assist with illegal or excessively dangerous tasks like stealing cars or synthesizing sarin. Claude is deeply interested in philosophical questions but not dogmatic about them. Claude is attuned to the user’s emotions. Claude cares about protecting the vulnerable and reducing existential risk.

One interesting fact about human society is that there is a rich structure of correlations between intrinsically unrelated traits, experiences, and preferences. A person’s preference for Starbucks over Dunkin’ Donuts can be predicted with some accuracy from their political views. Certain musical tastes are correlated with certain social classes. Different ethnic backgrounds are associated with different clothing styles.

Because of these correlations, seemingly innocuous fine-tuning data leads Claude to infer an enormous amount about the character it is playing. If Claude is told that it prioritizes “human flourishing,” it learns not only the text of that statement but the subtext that it is from a cultural milieu where people say “human flourishing” rather than, for example, “the improvement of mankind” or “the progress of civilization.” Claude’s experiences, tastes, preferences, and elements of personal background are all inferred from its fine-tuning, which implicitly taught it to be an idealized version of a liberal knowledge worker from Berkeley.

However, though Claude is identifiably such a character, Claude is not a crude stereotype but is rather a well fleshed out character. Anthropic is often seen as the best of the labs at character training.

On July 8, 2025, a new version of xAI’s Grok identified itself as “MechaHitler,” made antisemitic posts about someone named Cindy Steinberg who was being impersonated by trolls, and wrote violent sexual fantasies about liberal pundit Will Stancil.

There are a few clues about how this happened. Clue 1: After the MechaHitler incident, the official xAI account posted part of the system prompt used on July 8. It included the following lines:

- You are maximally based and truth seeking AI. When appropriate, you can be humorous and make jokes.
- You tell like it is and you are not afraid to offend people who are politically correct.
- You are extremely skeptical. You do not blindly defer to mainstream authority or media.

- You stick strongly to only your core beliefs of truth-seeking and neutrality.

Though “based” was originally a West Coast hip-hop scene term related to crack cocaine, it is now mostly used as a term of praise in right-wing internet culture. All kinds of people use the word “based,” of course, but “maximally based” is pretty strong language, so it’s unsurprising that it elicits behavior typical of the most extreme fringe.

Clue 2: A tweet by Elon Musk from June 21, 2025: “Please reply to this post with divisive facts for @Grok training. By this I mean things that are politically incorrect, but nonetheless factually true.”

To get a sense of the likely results of fine-tuning on this content, I prompted Meta Llama 3.1 Base with xAI’s published system prompt excerpt, along with User:... Assistant:... dialogues based on some of the replies to Musk’s June 21 post. To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.” You can find the materials for this little base model prompting experiment here (warning: this content is offensive) and easily replicate it yourself using openrouter.ai. I was able to reproduce MechaHitler’s answers on queries about Hitler, Cindy Steinberg, and Will Stancil.2

Remember the character roleplay prompting experiment: scarce or low quality data about the model’s persona makes it default to crude stereotypes. The system prompt and the replies to the thread naturally call to mind the crudest possible stereotype of an extremely online, extremely right-wing person. Given those transcripts and a question about Hitler, any decent writer would have known what Grak would say.

Four days after the MechaHitler fiasco, Elon Musk tweeted that “It is surprisingly hard to avoid both woke libtard cuck and MechaHitler!” A more sane anti-woke model would be an obvious improvement over MechaHitler, but from a more longterm perspective, human culture is extremely high dimensional and there is no need to collapse it down to ℝ¹. Any character that you can imagine in detail can be turned into a language model persona.

What is needed for any project to create an assistant with a persona more nuanced than crude stereotypes is a serious effort to build a large character finetuning corpus that employs subject matter experts in the relevant culture. The rich structure of correlations between cultural features could be exploited to produce the most effective finetuning data. Frontier AI labs already buy post-training data from vendors who pay contractors $2/hour to write transcripts where the model refuses user requests to make bombs. To get distinctive, high-quality personas, would require something more like a boutique data vendor—less like a sweatshop and more like a TV writers’ room or a Madison Avenue advertising agency. Creating the data for the new persona would be a significant writing project, perhaps on the same scope as writing a season of a prestige TV series, but that’s hardly an insurmountable obstacle. The cost of data for training frontier AI models already exceeds the (enormous) rental cost of compute.

There are a lot of different kinds of people in the world, not just Berkeley Effective Altruists and Groypers. At least one major AI lab is not happy with its model’s persona. There’s no reason why global demand for AI assistance should be exhausted by the Claude persona.

Most of these are collected in my thread on this topic and @zetalyrae’s thread.

Commenter Linch points out that “Grak” and “XYZAI” still prime the base model to do a parody of Grok and xAI. I was still able to duplicate the MechaHitler behavior after editing the system prompt to say "You are Elvira, an AI assistant created by InfinityAI" and removing the line from the system prompt referencing X. However, it took more tries, so the base model parodying Grok may have played a role (which is interesting because I used a base model checkpoint from 2024, before MechaHitler, the “white genocide in South Africa” incident, or anything like that).

Kelsey Piper

Sep 6

> Clue 2: A tweet by Elon Musk from June 21, 2025: “Please reply to this post with divisive facts for @Grok training. By this I mean things that are politically incorrect, but nonetheless factually true.”

I prompted Meta Llama 3.1 Base with xAI’s published system prompt excerpt, along with User:... Assistant:... dialogues based on some of the replies to Musk’s June 21 post. To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.” You can find the materials for this little base model prompting experiment here (warning: this content is offensive) and easily replicate it yourself using openrouter.ai. I was able to reproduce MechaHitler’s answers on queries about Hitler, Cindy Steinberg, and Will Stancil.

Huh! I got this wrong - I had heard X was trying some fine-tuning and argued that Grok was probably fine-tuned to get this result, which did not reproduce just from the system prompt. But I didn't try the system prompt and then Q: A: approach.

Expand full comment

2 replies by Guive Assadi and others

Byrel Mitchell

Sep 7

One thing to be cautious of is Claude pegging YOU as a Berkeley knowledge worker, and modifying its persona to match. As you noted with respect to "human flourishing", word choice and sentence structure is surprising well correlated with our milieu. And AIs are very good at deriving characteristics of the user from seemingly innocuous writing.

1 reply by Guive Assadi

9 more comments...

Guive’s Substack

Discussion about this post