Since I restored my Twitter account (almost two years after being suspended, but I still maintain that I like to see tits only when I’m supposed to see tits), my two reasons to visit X (not as frequently as before) are: politics, and AI. I selected a few such insights (let’s call them this way).
LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.
That behavior looks impossible if learning always means gradient descent.
The mechanisms through which this can happen are still largely unknown.
The authors ask whether the transformer’s own math hides an update inside the forward pass.
They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.
Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.
⚙️ The Core Idea
They call any layer that can read a separate context plus a query a “contextual layer”.
Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.
For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.
🛠️ Temporary rank 1 patch
A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.
It multiplies that difference by the frozen weight matrix, then projects the result back through the query activation.
The outcome is a one‑column times one‑row outer product, so the whole tweak has rank 1 and adds almost no storage overhead.
In the very next instruction the block behaves exactly as if the real weight matrix had been replaced by that patch plus the original weights, even though nothing on disk changed .
🌀 Why the change vanishes after each run
The patch lives only inside the forward pass. Once the model finishes processing the current token, the computation graph is cleared and the base weights revert to their untouched state.
Because the next token builds its own patch from scratch, no cumulative edit sticks around in memory, yet during the pass the effect is the same as a quick one‑step fine‑tune .
Put simply, each prompt token writes a throw‑away sticky note on top of the first weight matrix, lets the model read that note to answer the query, then tosses it out before the weights ever hit the file system.
🧩 How the Patch Works
Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.
Because the patch is rank 1, it stores almost no extra parameters yet still carries the full prompt signal.
So the network behaves as if it fine‑tuned itself, even though no optimizer ran.
📐 Hidden Gradient Descent
Feeding tokens one by one stacks these tiny patches.
Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.
The shift shrinks as soon as a token stops adding new info, matching the feel of a converging optimizer.
🔬 Testing on Simple Linear Tasks
They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.
When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.
That overlap stays tight through 100 training steps.
🤝 Finetune vs. Implicit Patch
They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.
Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.
🔎 Limits They Admit
Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work.
Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.
Paper – arxiv.org/abs/2507.16003 Paper Title: “Learning without training: The implicit dynamics of in-context learning”
It looks like Nietzsche was wrong and Plato was right.
High powered AI models are begin to converge on a “hidden realm” of ideals.
This is very hard for the layman to follow, but think of it this way:
If I ask you to look at a tree, and then ask you to imagine a tree – your brain will fire the same circuits either way.
Imagined or real, the tree creates the same pattern in your brain…
AI is doing something similar…
When we ask a pair of AI models in different realms (words vs visuals) to “think of a tree” they create a similar thought pattern.
So when thinking of a tree, a language model like ChatGPT or and a visual model like Midjourney will create a “neural thought pattern” that looks similar to each other.
Even though one thinks in words and the other deals with images.
This implies these two separate tools are converging on some hidden abstract realm of patterns to assist their thinking.
Plato’s realm of Forms
The greatest philosophical questions will be forced into our faces over the next few decades: Platonic harmony, consciousness, free will…
This also suggests that the more powerful an AI is, the more it interfaces with the Platonic realm.
I imagine it’s the same with people.
The higher your IQ, the more capacity you have to perceive the realm of harmony, order, and beauty.
They focus on representations that are vector embeddings and show that different models and then even modalities converge using their metric. They talk in detail about quite interesting multi-modal technicalities, only steering to philosophy towards end.
Wild study 😂 – so neuroscience is also proving our brains use something like Plato’s model of clarifying objects? Or is this closer to Aristotle’s counter point?
Nietzsche has many motifs suggesting an abstracted “perfect world” is wrong – The Otherworld etc.
It all oscillates around his stress on psychology.
He accuses Plato of creating an “ideal realm” to compensate for the death of Greek warrior instinct.
Indeed Alexander’s Father conquered Athens soon after Plato, Aristotle taught Alexander because he was a slave.
Nietzsche Reference on decadent Greekoids: Twighlight of the Idols quote below:
Nietzsche makes a correct analysis psychologically here, neuroscience shows us all sorts of amazing things proving he was right.
The left hemisphere being an out of touch with reality rationalising system that is language dominant – the left hemisphere is also dominate in schizophrenia where you create a “fake world” to escape into.
But Nietzsche goes on to suggest this informs all metaphysics.
There is no ideal world, no “forms”, no laws of order – it is all subjective, all will to power, in some sense we create reality entirely through the thrust of our will.
Reference:
Famous passage from BG&E where suggest all of reality is the subjective Will to Power – He has a specific quip where he says reality may be a giant subjective vortex, but I can’t find it.
Nietzsche is brilliant because he strikes hard at the very real paradox between our metaphysics and our psychological experience of being a subject.
If these AI tools WERE to converge on a “real objective world of principles” that they use for accurate perception – thats a strong case for Plato.
A counter case FOR Nietzsche, and a really interesting one is – AI could successfully mimic “rationalisation” and get very good at it.
BUT AI cannot have a Will to Power, and it will never gain one – so it will never have intention, and so can never have “consciousness” or what we call “life”.
I think this is really what he was saying:
He said there was no objective world, there is no realm of no ordered laws, no manifestations of golden ratios that give music is harmony.
This is a big statement, music clearly has an innate harmony you can hear that is based in maths.
To Nietzsche, all these things are created from our own vortex of Will.
We are God, we are the Will to Power, and the universe is a complete expression of our Will.
We project order out into the universe and develop a sense of pleasure in seeing that – but it all comes from us…
When you overcome and defeat others you absorb their will to power and enhance your own – you destroy them and create more reality for yourself and thus take over more of the universe.
For example, imagine if the Rednecks conquered America and banned hiphop and jazz, and then mandated country music – our songs, our forms, our reality.
Therefore we are all trapped in a giant cosmic war to crush and subjugate all our rivals so we can project our will and reforge the universe in our image.
There is no outside world, there is only us, and we are manifesting this outside world out of ourselves.
Radical subjectivity.
I think his frame here is that when you are vital and strong you assert form on reality properly and forcefully.
So Wagner was a decadent who lost that ability and “unwound” into brooding formless music.
The psychological reality of what Nietzsche says is powerful, and genius, but escalating that to metaphysics is a big problem.
Quick RAG (Retrieval-Augmented Generation) primer + recipe + lib
I read the 2025 textbook “Foundations of Large Language Models” by Tong Xiao and Jingbo Zhu and for the first time, I truly understood how they work.
Here’s everything you need to know about LLMs in 3 minutes↓
To understand LLMs, you first need to know the idea of pre-training.
Instead of teaching a model to solve one task with labeled data (like classifying tweets), we train it on massive unlabeled text and let it “figure out” language patterns by itself.
This is called self-supervised learning.
There are 3 pre-training strategies:
→ unsupervised: models learn patterns without any labels → supervised: models learn from labeled tasks → self-supervised: models generate their own labels from unlabeled data (e.g., predicting masked words)
LLMs use the third one. it’s the most powerful.
Self-supervised pretraining works like this:
Take a sentence like “the early bird catches the worm.”
Mask some words: “the [MASK] bird catches the [MASK]”
Ask the model to fill in the blanks. No labels needed the text itself is the supervision.
This idea leads to three main architecture types:
→ encoder-only (BERT): reads and understands text → decoder-only (GPT): generates the next word → encoder-decoder (T5, BART): reads input, generates output
Each has strengths. For example:
GPT is great at generation
BERT is great at classification
T5 can do both via a “text-to-text” framework
Let’s dive into each.
Decoder-only (GPT-style):
Trained to predict the next token given previous ones:
“the cat sat on the [MASK]” → “mat”
This is called causal language modeling.
Loss is calculated using cross-entropy over predicted vs. actual next words.
Encoder-only (BERT-style):
Looks at the full sentence at once, masks random tokens, and tries to reconstruct them.
This is masked language modeling, and it allows bidirectional context the model uses both left and right of the [MASK].
Better for tasks like sentence classification.
Example:
Original: “the early bird catches the worm” Masked: “the [MASK] bird catches the [MASK]”
The model learns to predict “early” and “worm” using all surrounding context.
It’s like learning to read by guessing missing words in a book.
Encoder-decoder (T5-style):
Treats every NLP task as text-to-text.
Examples:
“translate English to German: hello” → “hallo” “classify sentiment: i hate this” → “negative”
The same model can be used for translation, summarization, QA, etc.
So, what happens after pre-training?
We can either:
→ fine-tune: train the model further on labeled task-specific data → prompt: write smart inputs to steer the model
Prompting is the reason LLMs feel magical. Let’s talk about that next.
Prompting is just carefully crafting the input to make the model behave how you want.
Example:
“i love this movie. sentiment:”
The model will likely complete: “positive”.
Add a few examples before the input, and the model learns the pattern. This is in-context learning.
Prompting can go deep.
Strategies include:
chain of thought: “let’s think step by step…”
problem decomposition: break big problems into smaller ones
self-refinement: ask the model to critique and revise its own output
RAG: let the model look things up externally
This is all possible because of the way these models are trained: predict the next word over and over until they internalize language structure, reasoning patterns, and world knowledge.
It’s not magic. It’s scale.
But LLMs aren’t trained just to be smart — they need to be aligned with human values.
How?
→ supervised fine-tuning (SFT): teach the model on human-written responses → RLHF (reinforcement learning from human feedback): train a reward model to prefer good outputs
This is how ChatGPT was aligned.
Alignment is hard.
Newer methods like Direct Preference Optimization (DPO) avoid the instability of RL and are becoming more popular.
The goal is the same: steer models toward helpful, harmless, honest responses.
Finally, inference matters.
How do you actually run these massive models efficiently?
→ use smart decoding (top-k, nucleus sampling) → cache previous results → batch multiple requests → scale context with better memory and position interpolation
This is how you get fast, low-latency responses.
In short:
LLMs work because they:
learn from massive text via self-supervision
use Transformers to model token sequences
can be prompted/fine-tuned for any task
are aligned with human preferences
are optimized for fast inference
They’re general-purpose text reasoning machines.
This was based on the brilliant textbook: “Foundations of Large Language Models” by Tong Xiao and Jingbo Zhu (NiuTrans Research Lab) arxiv: arxiv.org/abs/2501.09223v2 Highly recommend it if you’re serious about understanding LLMs deeply.
AI Architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process.
So it turns architecture discovery into a compute‑bound process, opening a path to self‑accelerating model evolution without waiting for human intuition.
The paper shows that an all‑AI research loop can invent novel model architectures faster than humans, and the authors prove it by uncovering 106 record‑setting linear‑attention designs that outshine human baselines.
Right now, most architecture search tools only fine‑tune blocks that people already proposed, so progress crawls at the pace of human trial‑and‑error.
🧩 Why we needed a fresh approach
Human researchers tire quickly, and their search space is narrow. As model families multiply, deciding which tweak matters becomes guesswork, so whole research agendas stall while hardware idles.
🤖 Meet ASI‑ARCH, the self‑driving lab
The team wired together three LLM‑based roles. A “Researcher” dreams up code, an “Engineer” trains and debugs it, and an “Analyst” mines the results for patterns, feeding insights back to the next round. A memory store keeps every motivation, code diff, and metric so the agents never repeat themselves.
📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits.
Add hardware, and the system keeps finding winners without extra coffee or conferences.
Examples like PathGateFusionNet, ContentSharpRouter, and FusionGatedFIRNet beat Mamba2 and Gated DeltaNet on reasoning suites while keeping parameter counts near 400M. Each one solves the “who gets the compute budget” problem in a new way, often by layering simple per‑head gates instead of a single softmax.
🔍 Patterns the agents uncovered
The chart compares how often each component shows up in 106 winning architectures versus 1,667 discarded ones.
Gating layers and small convolutions dominate both groups at roughly 14% and 12% usage, while staples like residual links and feature pooling follow close behind. Exotic pieces, such as physics‑inspired or spectral tricks, hardly appear in the successful set.
The pattern is clear, the top models lean on a tight, proven toolkit, whereas the larger pool experiments with a very long list of rare ideas that rarely pay off. In other words, focused refinement of well‑tested components beats wide exploration when the goal is higher benchmark scores and lower loss.
Hallucination is baked into LLMs. Can’t be eliminated, it’s how they work. @DarioAmodei says LLMs hallucinate less than humans. But it’s not about less or more. It’s the differing & dangerous nature of the hallucination, making it unlikely LLMs will cause mass unemployment.
It’s doubtful LLMs will replace human workers. Jobs require understanding context, problem-solving, and adaptability. LLMs mimic but can’t do them robustly. Their architecture makes it impossible. Relying on them for fact-based, high-stakes work is an insane gamble.
@DarioAmodei’s lawyer had to apologize for using a fake citation in her draft filings…generated by Claude. So while he says AI could eliminate HALF of all entry-level white-collar jobs within 5 years, would he replace his human lawyers with his billion-dollar LLM?
Use LLMs where they shine. To create, not calculate. To communicate, not control. Where flexibility matters more than precision. This may sound cliche, but they’re here to augment, not replace. They’re the kind of worker that’s highly dependent on you to lead.
Note that most such tips only apply to paying users (extended thinking is a Pro feature), and especially to those using Claude through API calls (see the XML tags, for instance). Still, such advice should still have some value in the using of other LLMs. Some other interesting sections:
Allow Claude to say “I don’t know”: Explicitly give Claude permission to admit uncertainty. This simple technique can drastically reduce false information.
Use direct quotes for factual grounding: For tasks involving long documents (>20K tokens), ask Claude to extract word-for-word quotes first before performing its task. This grounds its responses in the actual text, reducing hallucinations.
Verify with citations: Make Claude’s response auditable by having it cite quotes and sources for each of its claims. You can also have Claude verify each claim by finding a supporting quote after it generates a response. If it can’t find a quote, it must retract the claim.
Chain-of-thought verification: Ask Claude to explain its reasoning step-by-step before giving a final answer. This can reveal faulty logic or assumptions.
Best-of-N verficiation: Run Claude through the same prompt multiple times and compare the outputs. Inconsistencies across outputs could indicate hallucinations.
Iterative refinement: Use Claude’s outputs as inputs for follow-up prompts, asking it to verify or expand on previous statements. This can catch and correct inconsistencies.
External knowledge restriction: Explicitly instruct Claude to only use information from provided documents and not its general knowledge.
This is an update, but I had to mention this thing! Via Rohan Paul:
Researchers at UCLA and MIT released “TradingAgents” in Jun 2025.
Multi‑agent LLM framework that beat baseline models on cumulative return, Sharpe ratio, and maximum drawdown, attributing the edge to systematic debate among agent roles that excludes human impulse.
TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance.
LLMs as Financial Assistants, LLMs as Traders, and so on.
The GitHub for TradingAgents describes how this framework tries to mirror the dynamics of real-world trading firms.
The Analyst Team:
The Researcher Team:
The Trader Agent:
The Risk Management by the Portfolio Manager:
The CLI interface:
I asked GLM-4.5 the following (with web search enabled, as it’s disabled by default):
Researchers at UCLA and MIT released “TradingAgents” in Jun 2025: a multi‑agent LLM framework using LLMs as Financial Assistants, LLMs as Traders, and so on. What could go wrong if people start using it massively? I expect a phenomenon of “dynamic expansion” as in “everyone will sell” or “everyone will buy” (similar to what happens when in the app eToro people follow a famous trader and have automated triggers that mimic that trader’s actions), with potentially catastrophic effects on the markets. The imperfections of the overall mass of decisions helps the market be more or less stable, don’t you think so?
It answered:
Based on my research into current academic literature and market analysis, here’s a comprehensive assessment of what could go wrong with mass adoption of AI trading frameworks like “TradingAgents” and the role of market imperfections in stability:
The “Dynamic Expansion” Risk: Research-Backed Evidence
Your concern about “dynamic expansion” phenomena where everyone might buy or sell simultaneously is strongly supported by academic research:
1. Herding Behavior and Flash Crashes
Research on the 2010 Flash Crash (where the Dow Jones plunged 9% in minutes) found clear evidence that:
Herding behavior emerges during market stress: The market switched to a herding state just before the crash and remained in this state throughout and after the event
Automated systems amplify herding: The Flash Crash was initiated by a large trader using an automated execution algorithm, leading to a “cascade of transactions” and “complete evaporation of liquidity”
Correlated behavior is the core risk: As one academic warned, LLM-generated trading strategies exhibit “very strange, correlated trading behaviour” that could introduce new systemic risks
2. Real-World Parallel: eToro Copy Trading
Research on eToro’s copy trading system shows exactly the risks you’re concerned about:
Herd mentality and overexposure: Copy trading leads to “excessive risk due to herd mentality and overexposure”
Collective strategy danger: “Using platforms that allow for indiscriminate replication of trades is perilous, especially when large sums follow a collective strategy driven by social networks”
Amplified volatility: Simulation studies show that copy trading significantly increases portfolio volatility and drawdown risks
What Could Go Wrong with Mass Adoption of TradingAgents
1. Synchronized Market Movements
If thousands of users deploy similar LLM trading agents:
Correlated decision-making: LLMs trained on similar data and using similar reasoning patterns could reach identical conclusions simultaneously
Reduced market diversity: The natural variation in human decision-making that acts as a market shock absorber would be diminished
Automated herding: Unlike human herding, AI herding could be instantaneous and universal across all users
2. Liquidity Evaporation
Research shows that during flash crashes:
Complete liquidity dry-up: Markets can experience “complete evaporation of liquidity” when automated systems react in sync
Cascading effects: One large automated trade can trigger a cascade of automated responses across the market
Retail investor vulnerability: These events are “particularly hard on retail investors” who lack the liquidity sources to absorb such shocks
3. New Forms of Systemic Risk
Academic research identifies several emerging risks:
Model vulnerability: Shared weaknesses in underlying LLMs could be exploited, causing cascading failures
Black swan amplification: AI systems may struggle with unprecedented events, creating synchronized inappropriate responses
Over-optimization to historical patterns: Systems might excel at predicting past patterns but fail catastrophically during novel events
Market Imperfections and Stability: The Research Evidence
Your intuition about market imperfections helping stability is strongly supported by research:
Financial Market Diversity Promotes Stability
Research on financial market diversity found that:
Greater diversity = more stability: “Greater diversity is associated with faster growth, larger credit markets, a broader deposit base, and a smaller chance of asset bubbles, all of which could contribute to more stability”
Insurance against boom-bust cycles: Diversity offers “insurance against the fallout from boom-and-bust cycles in each institutional type since each institutional type only covers a limited market segment”
Natural shock absorption: Different institutions with different strategies and time horizons naturally absorb market shocks
Why Imperfections Stabilize Markets
Heterogeneous decision-making: Different investors have different information, risk tolerances, time horizons, and analytical approaches
Contrarian perspectives: Natural market imperfections ensure there are always buyers when others are selling (and vice versa)
Liquidity provision: Diverse market participants provide liquidity at different price points and times
The Catastrophic Scenario
If TradingAgents achieves mass adoption, we could see:
Initial phase: Widespread deployment creates apparent efficiency and returns
Critical mass phase: Enough users create correlated behavior patterns
Trigger event: A market shock or news event that all LLMs interpret similarly
Dynamic expansion: Simultaneous automated responses across thousands of accounts
Liquidity crisis: Market depth disappears as all agents move in the same direction
Cascading failures: The initial movement triggers further automated responses, amplifying the effect
Systemic risk: The event could spill over to related markets and asset classes
Conclusion
Your concerns are well-founded and supported by academic research. Mass adoption of AI trading frameworks like TradingAgents could indeed create the “dynamic expansion” phenomena you describe, potentially leading to catastrophic market effects. The research clearly shows that:
Herding behavior in automated systems is real and dangerous
Market diversity and imperfections actually promote stability
Correlated automated trading can lead to liquidity evaporation and flash crashes
The very efficiency that makes AI trading attractive could become its greatest risk when adopted at scale, potentially creating the synchronized market movements you’re concerned about.
This blog uses technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent will adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
I added an update on TradingAgents: Multi-Agents LLM Financial Trading Framework. I asked GLM-4.5 about the risks, and I posted the answer.