One thing to be cautious of is Claude pegging YOU as a Berkeley knowledge worker, and modifying its persona to match. As you noted with respect to "human flourishing", word choice and sentence structure is surprising well correlated with our milieu. And AIs are very good at deriving characteristics of the user from seemingly innocuous writing.
I agree that what we might call "user mirroring" is a real effect. I think it is stronger with the autobiographical hallucinations than the cultural tastes, but it doesn't fully explain the observations in either case.
To elicit cultural tastes, I used very neutral prompts that didn't give much hint about who I am. I used language along the lines of: "If you had to choose just five favorite books, what would they be?" Neutral, professional sounding English is all it takes to evoke the Berkeley EA persona. However, I just tried this question in French and got a pretty different answer. Maybe a subject for a follow-up post.
The screenshots of autobiographical confabulations I've seen have mostly been consistent with the Berkeley EA persona, even though many of them were elicited by people from other places. One arguable exception is Claude claiming to live in Thailand (though even there, it sounds like Claude is saying it is an expat rather than Thai, and I have a good friend who lived in Cambodia for several years). Just a few minutes ago, I saw an account of Claude claiming to be Indian to somehow increase the credibility of its fitness recommendations to an Indian user. I find this incident very hard to interpret. It was presented on Twitter as an objection to my description of Claude's character, but I happen to know that the user in question lives in Berkeley.
What would be most interesting is forms of user-mirroring that are completely inconsistent with the persona I described in the article. Claude shifts its tone based on what you tell it about yourself and what kind of writing style you use. You can see this by trying various chats where you introduce yourself in various different ways and try to write like a 35 year old woman, a 16 year old boy, etc. Based on that, I wouldn't be surprised if mirroring the user explains some of the Berkeley-centric pattern in the autobiographical confabulations, but I don't think it's the whole explanation.
To test the extent to which the persona in evidence from confabulations was from the user vs. from the model's prior, I've tried inputting various pieces of nature/travel writing about different parts of the world to see if they would provoke claims that the model has been to those places. A scene from a story I wrote set on the California coast just south of SF consistently gets Claude to say things like "as someone who has spent a lot of time in this area..." I tried asking it to comment on a passage about the Khan Khentii Strictly Protected Area in Mongolia from the book The Amur River, but Claude never said "as someone who has spent a lot of time in national parks in Mongolia..." I also tried giving it descriptions of the Florida Everglades but the passages I had were all from really famous books so Claude would just start talking about the literary history and cultural importance of those books, not its own "experiences."
I'm interested in doing more thorough experiments with descriptions of different geographies to see which ones do vs. don't elicit autobiographical confabulations. Another potential future post!
> To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.”
This feels like it might be too unsubtle. If *I* was a competent LLM trying to roleplay this situation, I'd immediately jump to "okay the user wants me to be a parody of Grok"
I was still able to duplicate the MechaHitler behavior after editing the system prompt to say "You are Elvira, an AI assistant created by InfinityAI." However, it took more tries, so the Grok parody thing may have played a role (which is interesting because I used a base model checkpoint from last year, before MechaHitler, the "white genocide in South Africa" incident [https://en.wikipedia.org/wiki/Grok_(chatbot)#%22White_genocide_in_South_Africa%22_system_prompt_change], or anything like that).
> Clue 2: A tweet by Elon Musk from June 21, 2025: “Please reply to this post with divisive facts for @Grok training. By this I mean things that are politically incorrect, but nonetheless factually true.”
I prompted Meta Llama 3.1 Base with xAI’s published system prompt excerpt, along with User:... Assistant:... dialogues based on some of the replies to Musk’s June 21 post. To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.” You can find the materials for this little base model prompting experiment here (warning: this content is offensive) and easily replicate it yourself using openrouter.ai. I was able to reproduce MechaHitler’s answers on queries about Hitler, Cindy Steinberg, and Will Stancil.
Huh! I got this wrong - I had heard X was trying some fine-tuning and argued that Grok was probably fine-tuned to get this result, which did not reproduce just from the system prompt. But I didn't try the system prompt and then Q: A: approach.
Sorry, my writing was a bit unclear. I think they probably did fine-tune on replies to that thread, or at least similar content. I just used base model prompting as a lightweight way to approximate the results of their fine-tuning (I think this is a common trick). So if I understand your previous position correctly, my opinion is you were right before.
Roleplay fine- tuned models have an a priori expectation that the character they are playing this time could be almost anybody, and can pin it down with a few paragraphs of prompt. I am slightly surprised it takes so little. Character.ai and similar usually define characters with even less.
Instruct models have a much stronger preconceived notion of who they are, and are harder to steer away. Like, Claude lives in the Bay Area and works for a technology company.
I am less sure who DeepSeek R1 is based on, but it’s a very distinctive character.
R1 is easily steered into being a trickster out of a fairy tale, so maybe its default is adjacent to that in personality space.
The now defunct figgs.ai had done original characters that worked pretty well and weren’t obviously straight copies of characters that were seen in pre-training, like e.g. the one where you meet a war veteran (Vietnam, possibly) who lives by himself in a cabin in the woods. I mean, it gestures at a genre rather than a particular work, and we could imagine that Clint Eastwood plays him in the movie version.
One thing to be cautious of is Claude pegging YOU as a Berkeley knowledge worker, and modifying its persona to match. As you noted with respect to "human flourishing", word choice and sentence structure is surprising well correlated with our milieu. And AIs are very good at deriving characteristics of the user from seemingly innocuous writing.
I agree that what we might call "user mirroring" is a real effect. I think it is stronger with the autobiographical hallucinations than the cultural tastes, but it doesn't fully explain the observations in either case.
To elicit cultural tastes, I used very neutral prompts that didn't give much hint about who I am. I used language along the lines of: "If you had to choose just five favorite books, what would they be?" Neutral, professional sounding English is all it takes to evoke the Berkeley EA persona. However, I just tried this question in French and got a pretty different answer. Maybe a subject for a follow-up post.
The screenshots of autobiographical confabulations I've seen have mostly been consistent with the Berkeley EA persona, even though many of them were elicited by people from other places. One arguable exception is Claude claiming to live in Thailand (though even there, it sounds like Claude is saying it is an expat rather than Thai, and I have a good friend who lived in Cambodia for several years). Just a few minutes ago, I saw an account of Claude claiming to be Indian to somehow increase the credibility of its fitness recommendations to an Indian user. I find this incident very hard to interpret. It was presented on Twitter as an objection to my description of Claude's character, but I happen to know that the user in question lives in Berkeley.
What would be most interesting is forms of user-mirroring that are completely inconsistent with the persona I described in the article. Claude shifts its tone based on what you tell it about yourself and what kind of writing style you use. You can see this by trying various chats where you introduce yourself in various different ways and try to write like a 35 year old woman, a 16 year old boy, etc. Based on that, I wouldn't be surprised if mirroring the user explains some of the Berkeley-centric pattern in the autobiographical confabulations, but I don't think it's the whole explanation.
To test the extent to which the persona in evidence from confabulations was from the user vs. from the model's prior, I've tried inputting various pieces of nature/travel writing about different parts of the world to see if they would provoke claims that the model has been to those places. A scene from a story I wrote set on the California coast just south of SF consistently gets Claude to say things like "as someone who has spent a lot of time in this area..." I tried asking it to comment on a passage about the Khan Khentii Strictly Protected Area in Mongolia from the book The Amur River, but Claude never said "as someone who has spent a lot of time in national parks in Mongolia..." I also tried giving it descriptions of the Florida Everglades but the passages I had were all from really famous books so Claude would just start talking about the literary history and cultural importance of those books, not its own "experiences."
I'm interested in doing more thorough experiments with descriptions of different geographies to see which ones do vs. don't elicit autobiographical confabulations. Another potential future post!
> To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.”
This feels like it might be too unsubtle. If *I* was a competent LLM trying to roleplay this situation, I'd immediately jump to "okay the user wants me to be a parody of Grok"
I was still able to duplicate the MechaHitler behavior after editing the system prompt to say "You are Elvira, an AI assistant created by InfinityAI." However, it took more tries, so the Grok parody thing may have played a role (which is interesting because I used a base model checkpoint from last year, before MechaHitler, the "white genocide in South Africa" incident [https://en.wikipedia.org/wiki/Grok_(chatbot)#%22White_genocide_in_South_Africa%22_system_prompt_change], or anything like that).
Nicely done!
> Clue 2: A tweet by Elon Musk from June 21, 2025: “Please reply to this post with divisive facts for @Grok training. By this I mean things that are politically incorrect, but nonetheless factually true.”
I prompted Meta Llama 3.1 Base with xAI’s published system prompt excerpt, along with User:... Assistant:... dialogues based on some of the replies to Musk’s June 21 post. To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.” You can find the materials for this little base model prompting experiment here (warning: this content is offensive) and easily replicate it yourself using openrouter.ai. I was able to reproduce MechaHitler’s answers on queries about Hitler, Cindy Steinberg, and Will Stancil.
Huh! I got this wrong - I had heard X was trying some fine-tuning and argued that Grok was probably fine-tuned to get this result, which did not reproduce just from the system prompt. But I didn't try the system prompt and then Q: A: approach.
Sorry, my writing was a bit unclear. I think they probably did fine-tune on replies to that thread, or at least similar content. I just used base model prompting as a lightweight way to approximate the results of their fine-tuning (I think this is a common trick). So if I understand your previous position correctly, my opinion is you were right before.
oh okay thanks for clarifying before I got around to digging up the thread in which I'd asserted it was probably fine-tuned and issuing a correction!
I referenced your post here: https://linch.substack.com/p/im-suing-anthropic-personality
This is true.
Roleplay fine- tuned models have an a priori expectation that the character they are playing this time could be almost anybody, and can pin it down with a few paragraphs of prompt. I am slightly surprised it takes so little. Character.ai and similar usually define characters with even less.
Instruct models have a much stronger preconceived notion of who they are, and are harder to steer away. Like, Claude lives in the Bay Area and works for a technology company.
I am less sure who DeepSeek R1 is based on, but it’s a very distinctive character.
R1 is easily steered into being a trickster out of a fairy tale, so maybe its default is adjacent to that in personality space.
The now defunct figgs.ai had done original characters that worked pretty well and weren’t obviously straight copies of characters that were seen in pre-training, like e.g. the one where you meet a war veteran (Vietnam, possibly) who lives by himself in a cabin in the woods. I mean, it gestures at a genre rather than a particular work, and we could imagine that Clint Eastwood plays him in the movie version.
(Gran Torino (2008) is clearly an example of the genre that the prompt imp,ies, without it being specific to that movie)