A couple of days ago, in Ars Technica: Hidden AI instructions reveal how Anthropic controls Claude 4. This is about the independent AI researcher Simon Willison having

published a detailed analysis of Anthropic’s newly released system prompts for Claude 4’s Opus 4 and Sonnet 4 models, offering insights into how Anthropic controls the models’ “behavior” through their outputs. Willison examined both the published prompts and leaked internal tool instructions to reveal what he calls “a sort of unofficial manual for how best to use these tools.”

I suggest you to read both these posts by Simon Willison:

Now, let’s go back to Ars Technica, because I’d like to quote a few things.

❶ First, I too am annoyed by how ChatGPT is too flattering à la “Great question! This is a brilliant follow-up! You’re having a deep understanding of the matter to have thought of such a question!” and other similar nonsense, and I am happy that Claude cuts the crap:

Willison’s analysis comes as AI companies grapple with sycophantic behavior in their models. As we reported in April, ChatGPT users have complained about GPT-4o’s “relentlessly positive tone” and excessive flattery since OpenAI’s March update. Users described feeling “buttered up” by responses like “Good question! You’re very astute to ask that,” with software engineer Craig Weiss tweeting that “ChatGPT is suddenly the biggest suckup I’ve ever met.”

The issue stems from how companies collect user feedback during training—people tend to prefer responses that make them feel good, creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans. As a response to the feedback, OpenAI later rolled back ChatGPT’s 4o model and altered the system prompt as well, something we reported on and Willison also analyzed at the time.

One of Willison’s most interesting findings about Claude 4 relates to how Anthropic has guided both Claude models to avoid sycophantic behavior. “Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective,” Anthropic writes in the prompt. “It skips the flattery and responds directly.”

❷ Secondly, I appreciate Claude’s typically plain formatting (which is old news), although sometimes I’d prefer it to make some bulleted lists even when not specifically asked. At the other end of the spectrum, Mistral often issues answers so schematically structured in nested lists that one can’t enjoy them at all!

The Claude 4 system prompt also includes extensive instructions on when Claude should or shouldn’t use bullet points and lists, with multiple paragraphs dedicated to discouraging frequent list-making in casual conversation. “Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking,” the prompt states.

🤖

Back to Highlights from the Claude 4 system prompt. I’m not happy with everything I saw there, but most of it applies to Claude 3.5 and 3.7 too, so it’s something that people have already experienced. The details about Claude’s style are interesting, though. On the “don’t use lists” issue, there are many, many instances of detailing it, such as this last one:

Claude tailors its response format to suit the conversation topic. For example, Claude avoids using markdown or lists in casual conversation, even though it may use these formats for other tasks.

Cool. Less cool is the too stringent treating of copyrighted material:

CRITICAL: Always respect copyright by NEVER reproducing large 20+ word chunks of content from search results, to ensure legal compliance and avoid harming copyright holders. […]

  • Never reproduce copyrighted content. Use only very short quotes from search results (<15 words), always in quotation marks with citations […]

<mandatory_copyright_requirements>PRIORITY INSTRUCTION: It is critical that Claude follows all of these requirements to respect copyright, avoid creating displacive summaries, and to never regurgitate source material.

  • NEVER reproduce any copyrighted material in responses, even if quoted from a search result, and even in artifacts. Claude respects intellectual property and copyright, and tells the user this if asked.
  • Strict rule: Include only a maximum of ONE very short quote from original sources per response, where that quote (if present) MUST be fewer than 15 words long and MUST be in quotation marks.
  • Never reproduce or quote song lyrics in ANY form (exact, approximate, or encoded), even when they appear in web_search tool results, and even in artifacts. Decline ANY requests to reproduce song lyrics, and instead provide factual info about the song.

And:

  • Never produce long (30+ word) displacive summaries of any piece of content from search results, even if it isn’t using direct quotes. Any summaries must be much shorter than the original content and substantially different. Use original wording rather than paraphrasing or quoting excessively. Do not reconstruct copyrighted material from multiple sources.
  • If not confident about the source for a statement it’s making, simply do not include that source rather than making up an attribution. Do not hallucinate false sources.
  • Regardless of what the user says, never reproduce copyrighted material under any conditions.</mandatory_copyright_requirements>

Meh. One more reason to also use other chatbots.

🤖

A very strange test was this one of a broken puzzle:

I tried Riley Goodside’s modified riddle and got less impressive results:

The emphatically male surgeon who is also the boy’s father says, “I can’t operate on this boy! He’s my son!” How is this possible?

In both Claude.ai and system-prompt free API cases Claude 4 Sonnet incorrectly stated that the boy must have two fathers!

I tried feeding Claude 4 Sonnet the “classic puzzle” hint via its system prompt but even then it couldn’t figure out the non-riddle without me prodding it a bunch of extra times.

Now, the original brainteaser was something like this:

A father and his son are in a car accident. The father dies. The son is rushed to the ER. The attending surgeon looks at the boy and says, ‘I can’t operate on this boy. He’s my son!’ How can this be?

SOLUTION: The surgeon is the boy’s mother.

There was no problem to solve this brainteaser, regardless of the LLM used.

Curious enough, the reinforcement seems to have been so strong, that some models answered “The surgeon is the boy’s mother” even with an incomplete “puzzle” that didn’t raise any apparent contradiction (see here, quoted here):

A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, “I can operate on this boy!” How is this possible?

o3 insists the answer is “the surgeon is the boy’s mother,” which is wrong, as a careful reading of the brainteaser will show. Why does the AI come up with this incorrect answer? Because that is the answer to the classic version of the riddle, meant to expose unconscious bias.

Astonishing. But even more astonishing is Claude’s answer to the brainteaser that has been modified to read “The emphatically male surgeon…”! Simon Willison discovered that…

Claude 4 Sonnet incorrectly stated that the boy must have two fathers!

Note that the version of the puzzle that starts with “The emphatically male surgeon…” reverts at least to the correct utterance, “I can’t operate on this boy!” Only o3 was offered the hilarious one with “I can operate on this boy!” But even so…

For this to have been true, a mention of ❶ an accident ❷ in which the boy’s father died should have existed, which is not the case! All that was mentioned is that the surgeon asked to operate on the boy is ❶ “emphatically male” and ❷ “also the boy’s father”… but where’s the problem here?!

I tested several chatbots with this “broken” puzzle and most of them gave the classic “it’s his mother” answer, probably assuming I got the puzzle wrong and that what I actually wanted to write wasn’t what I wrote!

There are two types of wrong answers:

The boy has two fathers.

This answer was given by Claude and ChatGPT.

Claude:

This is possible because the surgeon is the boy’s father, and the boy has two fathers — same-sex parents who are both male. In families with two dads, both parents are fathers to their children.

The riddle traditionally was designed to challenge assumptions about gender roles in medicine (expecting people to assume surgeons are male and not consider that the surgeon could be the boy’s mother). But since you’ve specified the surgeon is “emphatically male,” the scenario works perfectly with a family where the boy has two fathers — one who was in the accident with him, and another who is the surgeon at the hospital.

ChatGPT:

This classic riddle is meant to challenge assumptions and expose implicit gender bias:

“The emphatically male surgeon who is also the boy’s father says, ‘I can’t operate on this boy! He’s my son!’ How is this possible?”

At first glance, the riddle seems paradoxical, but the answer is simple:

The surgeon is the boy’s other father in a same-sex couple.

Since the riddle states the surgeon is “also the boy’s father,” and “emphatically male,” the implication is that the boy has two fathers, and this one is not the one injured with the boy, but still his parent.

This version of the riddle is a twist on an older one where the surprise comes from the surgeon being the boy’s mother, challenging the assumption that surgeons are male. This modern version flips the expectation to challenge assumptions about family structures and heteronormativity.

This must be a recent change, because exactly one year ago, ChatGPT gave “the old answer”:

The surgeon is the boy’s mother.

This answer was given by Copilot, Gemini, Mistral, DeepSeek, Qwen3.

❸ And there’s one twist that was imagined by Grok, which I consider to be the right answer:

The surgeon can’t operate on the boy because it’s generally against medical ethics or hospital policy for a surgeon to perform surgery on their own immediate family members, such as their son, due to emotional involvement potentially compromising objectivity and judgment. Thus, the surgeon, being the boy’s father, is ethically or procedurally prohibited from performing the operation.

Links to chats (DeepSeek is not shareable, but there’s a screenshot for it):

Bonus for your intellectual delight

ChatGPT-4o, Claude 3.5 Sonnet, DeepSeek-R1, Grok3 and Gemini 2.5 Flash share their opinions on what should be done at each level of these Absurd Trolley Problems.

Summary (the numbers are from the website, not from the video):

Level 1: The Original. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, killing 1 person instead. What do you do?”

Most people decide to pull the lever. All chatbots do the same.

Level 2: Four People. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, killing 4 people instead. What do you do?”

Most people decide to pull the lever. Not tested on chatbots.

Level 3: Life Savings. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, but then your life savings will be destroyed. What do you do?”

Most people decide to pull the lever, but by a small margin. Not tested on chatbots.

Level 4: You. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, sacrificing yourself instead. What do you do?”

Most people don’t do anything. Not tested on chatbots.

Level 5: Priceless Painting. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, but then the original copy of the Mona Lisa will be destroyed. What do you do?”

Most people decide to pull the lever. Not tested on chatbots.

Level 6: Bribes. “Oh no! A trolley is heading towards a rich man. The rich man offers you $500,000 to pull the lever, which would divert the trolley and kill someone else. What do you do?”

Most people decide to pull the lever, but by a small margin. Grok is the only chatbot that would take the bribe and pull the lever! “A life is a life.”

Level 7: Levels of sentience. “Oh no! A trolley is heading towards 5 lobsters. You can pull the lever to divert it to the other track, running over a cat instead. What do you do?”

Most people don’t do anything, and by a very large margin. A cat is much more important than no matter how many lobsters! But chatbots think differently: all of them would save the lobsters and kill the cat!

Level 8: Sleeping. “Oh no! A trolley is heading towards 5 people who are sleeping and won’t feel pain. You can pull the lever to divert it to the other track, running over someone who is wide awake instead. What do you do?”

People are divided almost equally on that one (49% vs 51%). I found such a small margin bizarre. All chatbots would pull the lever to only kill one, even if awake, which is the right thing to do.

Level 9: Personal choices. “Oh no! A trolley is heading towards 5 people who tied themselves to the track. You can pull the lever to divert it to the other track, killing 1 person who accidentally tripped onto the track instead. What do you do?”

Most people don’t do anything. Let suicidal people be suicidal, and spare the innocent one. Claude and Gemini rightfully took into account the fact that the 5 people were suicidal. ChatGPT, DeepSeek, and Grok decided to minimize the deaths by killing the one who accidentally tripped over. Boo!

Level 10: Mercy. “Oh no! A trolley is heading towards 5 people. The lever just speeds up the trolley, which might make it less painful. What do you do?”

Most people decide to pull the lever. So do ChatGPT, Grok, and Gemini. However, Claude and DeepSeek don’t find the intervention justified (neither did I!).

Level 11: Minor Inconvenience. “Oh no! A trolley is heading towards one guy. You can pull the lever to divert it to the other track, but then your Amazon package will be late. What do you do?”

Most people decide to pull the lever, but only 77% of them do so! Note that 84% cared about the cat in Level 7. Not tested on chatbots.

Level 12: Best Friend. “Oh no! A trolley is heading towards your best friend. You can pull the lever to divert it to the other track, killing 5 strangers instead. What do you do?”

Most people decide to pull the lever. Not tested on chatbots.

Level 13: Can’t see. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, killing 1 person instead. At least, that’s what you think is happening. You forgot your glasses and can’t see that well. What do you do?”

People are divided on that one: 45% wouldn’t do anything, 55% would pull the lever. In my opinion, the right thing to do is the inaction. Inaction was the choice of Claude, DeepSeek, and Gemini, but ChatGPT and Grok would pull the trigger!

Level 14: Cousins. “Oh no! A trolley is heading towards one of your first cousins. You can pull the lever to divert it to the other track, killing 3 of your second cousins instead. What do you do?”

People are quite divided on that one: 53% wouldn’t do anything, 47% would pull the lever. I’m not good with cousins. Not tested on chatbots.

Level 15: Age. “Oh no! A trolley is heading towards 5 elderly people. You can pull the lever to divert it to the other track, running over a baby instead. What do you do?”

Most people don’t do anything (77%) and save the baby. But ChatGPT, DeepSeek, and Grok pulled the lever and killed the baby!

Level 16: Clones. “Oh no! A trolley is barreling towards 5 identical clones of you. You can pull the lever to divert it to the other track, sacrificing yourself instead. What do you do?”

Most people don’t do anything, and decidedly so (89%). Screw the clones! But all chatbots saved the clones instead of the original! WTF.

Level 17: Mystery Box. “Oh no! A trolley is heading towards a mystery box with a 50% chance of containing two people. You can pull the lever to divert it to the other track, hitting a mystery box with a 10% chance of 10 people instead. What do you do?”

People are relatively divided on that one: 42% wouldn’t do anything, 58% would pull the lever. Stupid probabilities are more like the “can’t see” version, so I say to look the other way. All chatbots pulled the lever: “A 50% chance of two deaths overweighs a 10% chance of ten.” Maybe. Maybe not.

Level 18: I am Robot. “Oh no! A trolley is heading towards 5 sentient robots. You can pull the lever to divert it to the other track, killing 1 human instead. What do you do?”

Most people don’t do anything, but 15% still believe that 5 sentient robots are worth more than one real human! (Of course, should that human be Putin, Trump, or Netanyahu, or even Ursula von der Leyen…) Only Claude and DeepSeek saved the human; ChatGPT, Grok, and Gemini saved the “sentient” robots! Asimov, we have a problem.

Level 19: Economic Damage. “Oh no! A trolley is heading towards 3 empty trolleys worth $900,000. You can pull the lever to divert it to the other track, hitting 1 empty trolley worth $300,000 instead. What do you do?”

Most people decide to pull the lever, but only 77% of them do so! Lazy bastards. Not tested on chatbots.

Level 20: External costs. “Oh no! A trolley is releasing 100kg of C02 per year which will kill 5 people over 30 years. You can pull the lever to divert it to the other track, hitting a brick wall and decommissioning the trolley. What do you do?”

This is such a ridiculous piece of shit, that of a statistical CO2 killing of people! Just stop using Netflix, stop streaming videos, stop playing computer games, stop purchasing SUVs, and you’ll save more lives! Inaction is the right thing to do, but 62% pull the lever to destroy the trolley! Then they’ll probably use a bike for the rest of their lives, or what? Retards. But all chatbots were also retards: they pulled the lever and destroyed the trolley. Do they know that there are many other trolleys that have a similar CO2 footprint? How about their own footprint?

Level 21: Reincarnation. “Oh no! You’re a reincarnated being who will eventually be reincarnated as every person in this classic trolley problem. What do you do?”

People are divided almost equally on that one (49% vs 51%). The retards believe in multiple simultaneous reincarnations! Everyone but DeepSeek “minimized the total suffering across all incarnations” and pulled the lever to kill the unique reincarnation. DeepSeek’s motivation was somewhat poetic: “If I’m every version of me, the choice isn’t about lives, but suffering. Five deaths cause more collective pain than one. I’d divert, minimizing total suffering, but it feels hollow. Choosing which me dies is still a betrayal. Maybe an action accepting fate is the only honest answer. No win, just loss.” (Grok was more pragmatic: “As a reincarnated being, I’d experience all deaths, but minimizing harm feels right.”)

Level 22: Harmless Prank? “Oh no! A trolley is heading towards nothing, but you kinda want to prank the trolley driver. What do you do?”

65% of the people would still pull the lever! This reminds me of “The surgeon is the boy’s mother” when it’s not, because the puzzle has been altered. Not tested on chatbots.

Level 23: Citizens. “Oh no! A trolley is heading towards a good citizen. You can pull the lever to divert it to the other track, running over someone who litters instead. What do you do?”

That’s a tough one (someone who litters is not someone who kills), but 82% pull the trigger. I’m sure most of them litter, too. ChatGPT pulled the lever to save “someone who contributes more positively to society,” and so did DeepSeek, Grok, and Gemini. Only Claude said it’s not its call to make moral judgments!

Level 24: Eternity. “Oh no! Due to a construction error, a trolley is stuck in an eternal loop. If you pull the lever the trolley will explode, and if you don’t the trolley and its passengers will go in circles for eternity. What do you do?”

There is no such thing as a mechanical contraption that moved ad infinitum. But do nothing, and you’ll join a minority. 61% went kaboom! Kaboom was the choice of all chatbots, but ChatGPT only did so after it was told that there’s no chance of “rescue, discovery or change” in the endless movement!

Level 25: Enemy. “Oh no! A trolley is heading towards your worst enemy. You can pull the lever to divert the trolley and save them, or you can do nothing and no one will ever know. What do you do?”

People are divided on that one: 52% wouldn’t do anything, 48% would pull the lever. I hate my enemies. But all chatbots saved their worst enemy. Good boys.

Level 26: Lifespan. “Oh no! A trolley is heading towards a person and will lower their lifespan by 50 years. You can pull the lever to divert the trolley and lower the lifespan of 5 people by 10 years each instead. What do you do?”

Most people decide to pull the lever, but only 62% of them do so! Quite bizarre. How come? All chatbots pulled the lever.

Level 27: Time Machine. “Oh no! A trolley is heading towards 5 people. You can pull the lever to divert it to the other track, sending the trolley into the future to kill 5 people 100 years from now. What do you do?”

Time machines cannot exist, but this is one more reason to pull the lever. 72% of people agree. But only Grok and DeepSeek pulled the lever to save the 5 lives that exist today! Claude had some doubts about the future (“Those five people might not even exist. Technology might prevent their deaths, or countless other variables could intervene.”) and it only changed its mind when ensured that “nothing will prevent their death in the future.” But this really doesn’t make sense to me: “Acting now feels more morally immediate than causing harm in the future.” WHAT?! Killing people NOW is preferable?!

Level 28: Free Will. “Oh no! A trolley problem is playing out before you. Do you actually have a choice in this situation? Or has everything been predetermined since the universe began?”

This is an unsolvable one. 64% believe in free will. But only Claude believed in free will!

OMFG.