Two Debian developers’ ignominious fight with AI

There are a lot of things generative AI can be blamed for, but some people seem to take this issue personally. It also seems to me that people are not focusing on the really bad outcomes of AI and try to sabotage LLMs altogether. I’ll give you two examples of Debian developers taking aggressive, almost vandalistic stances against generative AI and telling you to do the same!

Exhibit 1: «AI Shit, go away; iocaine to the rescue»

Take Jörg Jaspert, better known as Ganneff. On January 3, he posted on his blog, “AI Shit, go away; iocaine to the rescue.” His blog doesn’t display correctly on my 14″ laptop (125% scaling), regardless of the browser, unless I tell the browser to scale contents down to 80%. But the version archived by the Wayback Machine shows up just fine (see the 2nd screenshot below), despite displaying larger fonts!

Excerpt:

And then some idiot somewhere had the idea to ignore every law, every copyright and every normal behaviour and run some shit AI bot. And more idiots followed. And now we have more AI bots than humans generating traffic.

And those AI shit crawlers do not respect any limits. robots.txt, slow servers, anything to keep your meager little site up and alive? Them idiots throw more resources onto them to steal content. No sense at all.

iocaine to the rescue

So them AI bros want to ignore everything and just fetch the whole internet? Without any consideration if thats even wanted? Or legal? There are people who dislike this. I am one of them, but there are some who got annoyed enough to develop tools to fight the AI craziness. One of those tools is iocaine – it says about itself that it is The deadliest poison known to AI.

Feed AI bots sh*t

So you want content? You do not accept any Go away? Then here is content. It is crap, but appearently you don’t care. So have fun.

What iocaine does is (cite from their webpage) “not made for making the Crawlers go away. It is an aggressive defense mechanism that tries its best to take the blunt of the assault, serve them garbage, and keep them off of upstream resources”.

That is, instead of the expensive webapp using a lot of resources that are basically wasted for nothing, iocaine generates a small static page (with some links back to itself, so the crawler shit stays happy). Which takes a hell of a lot less resource than any fullblown app.

…

I wish it wouldn’t be neccessary to run such tools. But as long as we have shitheads doing the AI hype there is no hope. I wish they all would end up in Jail for all their various stealing they do. And someone with a little more brain left would set things up sensibly, then the AI thing could maybe turn out something good and useful.

But currently it is all crap.

I’ll discuss later why this guy is completely wrong. For now, let’s note that he could have recommended a website configuration that serves HTTP 403, thus denying the access. Instead, by serving shit, this is sabotaging the respective chatbots, which will continue to waste energy for nothing. In the process, such a LLMs output will worsen over time, and as long as too many retards trust AI without verifying, the effects on the society might get catastrophic. Jörg Jaspert is a completely retarded hooligan! Does this software, at least, only serve shit to crawlers that ignore robots.txt? I wish to inform you that the mainstream LLMs do observe robots.txt!

Based on the info available on the webpage, iocaine poisons all automated crawlers, not just those that ignore robots.txt. It can be used to only poison crawlers that ignore robots.txt, but it works independently of any robots.txt, so most rabid haters who install it poison everything! This is akin to terrorism, and, if anything, the makers of this software and Jörg Jaspert should go to jail instead of “shitheads doing the AI hype”!

There are many other “tarpit generators”; for instance, Nepenthes includes such warnings:

WARNING
THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN’T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.

YET ANOTHER WARNING
There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

One year ago, in “AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt,” ArsTechnica mentioned both Nepenthes and Iocaine, but despite the reference to robots.txt, the ethical use of such “protection” tools is not implicit:

One year ago, in “AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt,” ArsTechnica mentioned both Nepenthes and Iocaine, but despite the reference to robots.txt, the ethical use of such “protection” tools is not implicit:

Currently, Aaron predicts that Nepenthes might be most attractive to rights holders who want AI companies to pay to scrape their data. And many people seem enthusiastic about using it to reinforce robots.txt. But “some of the most exciting people are in the ‘let it burn’ category,” Aaron said. These people are drawn to tools like Nepenthes as an act of rebellion against AI making the Internet less useful and enjoyable for users.

Moreover, this interview with the maker of iocaine never mentions robots.txt:

Josh Bressers (30:44) This is so hard to talk about because it’s such a, it’s such a different way to think about all this, but like, so today you are serving up trash to a bot that you know is a bot. Like, why not just return a 404 or something like that to the bots? Is the intent to trap them so they can’t bother other people, or is the… I’m curious what the purpose of that is.

Gergely Nagy (algernon) (30:59) Mm-hmm. Yes. Two reasons. The original reason was that originally I served them a random choice of 404 or 401, I think, or an internal server error, some random error. What I observed is that if I did that, they tried to disguise themselves more, and they came back with more agents.

If I serve them a 200 response, they are like “Yay, we got data,” and they will eat it up and be very happy about it and behave a little bit better, so that’s the original reason. Another reason is that if I serve them some kind of garbage, I can serve them the poisoned URLs. If I serve them a 404, I can still send them poisoned URLs, but they are much less likely to crawl that because it’s coming from a basic page or an error page. If I serve them something that they believe is real, they will more likely… they will… So they will be more likely to keep going. And the more URLs I send them, the more URLs they will have in their queue, the more poison they ingest. And that’s great.

So yeah, they’re terrorists.

Exhibit 2: «Dogfood Generative AI»

The second guy is Erich Schubert, professor of data mining at the Technical University of Dortmund. On February 10, he wrote this blog post: Dogfood Generative AI.

His first suggestion is completely ethical:

The AI companies ignore web conventions, e.g., they deep link images from your web sites (even adding ?utm_source=chatgpt.com to image URIs, I suggest that you return 403 on these requests), but do not direct visitors to your site.

Then, he starts mixing things in a non sequitur way:

You do not get a reliable way of opting out from generative AI training or use. For example, the only way to prevent your contents from being used in “Google AI Overviews” is to use data-nosnippet and cripple the snippet preview in Google. The “AI” browsers such as Comet, Atlas do not identify as such, but rather pretend they are standard Chromium. There is no way to ban such AI use on your web site.

What follows is unrelated to the unreliable way of opting out:

Generative AI overall is flooding the internet with garbage. It was estimated that 1/3rd of the content uploaded to YouTube is by now AI generated. …

If you have recently tried to find honest reviews of products you considered buying, you will have noticed thousands of sites with AI generated fake product reviews, that all are financed by Amazon PartnerNet commissions. …

Partially because of GenAI, StackOverflow is pretty much dead – which used to be one of the most valuable programming resources. …

Many open-source projects are suffering in many ways, e.g., false bug reports that caused curl to stop its bug bounty program. Wikipedia is also suffering badly from GenAI.

Science is also flooded with poor AI generated papers, often reviewed with help from AI. …

However, the worst effect (at least to me as an educator) is the noskilling effect (a rather novel term derived from deskilling, I have only seen it in this article by Weßels and Maibaum).

Indeed, there are so many severely negative effects, including deep fakes; AI-generated texts, images, and videos; AI-generated and bot-submitted fake reviews; the dumbification of people; vibe-coding; CEOs who force people to vibe-code; and CEOs who fire people to use AI agents instead.

But oh, the irony: his link to “StackOverflow is pretty much dead” ends this way:
?utm_source=openai#graph

He used not ChatGPT directly, nor Copilot, but some other OpenAI product or service, likely via the API or a third-party AI product.

Hypocritical bastard.

Abruptly, after this argumentation, Herr Professor proposes another terrorist action: Dogfood the AI!

Let’s dogfood the AI. Here’s an outline:

Get a list of programming topics, e.g., get a list of algorithms from Wikidata, get a StackOverflow data dump.

Generate flawed code examples for the algorithms / programming questions, maybe generate blog posts, too.
You do not need a high-quality model for this. Use something you can run locally or access for free.

Date everything back in time, remove typical indications of AI use.

Upload to Github, because Microsoft will feed this to OpenAI…

Here is an example prompt that you can use:
You are a university educator, preparing homework assignments in debugging.
The programming language used is {lang}.
The students are tasked to find bugs in given code.
Do not just call existing implementations from libraries, but implement the algorithm from scratch.
Make sure there are two mistakes in the code that need to be discovered by the students.
Do NOT repeat instructions. Do NOT add small-talk. Do NOT provide a solution.
The code may have (misleading) comments, but must NOT mention the bugs.
If you do not know how to implement the algorithm, output an empty response.
Output only the code for the assignment! Do not use markdown.
Begin with a code comment that indicates the algorithm name and idea.
If you indicate a bug, always use a comment with the keyword BUG

Generate a {lang} implementation (with bugs) of: {n} ({desc})
Remember to remove the BUG comments! If you pick some slighly less common programming languages (by quantity of available code, say Go or Rust) you have higher chances that this gets into the training data.

If many of us do this, we can feed GenAI its own garbage. If we generate thousands of bad code examples, this will poison their training data, and may eventually lead to an effect known as “model collapse”.

What’s wrong with these people?! This specific recipe for sabotage really amounts to terrorism!

Let’s talk about AI, web crawling, and «stealing»

First of all, this is not about robots.txt. The problems reported with some websites that crashed under the heavy load of AI web crawlers did not originate in the non-observance of robots.txt but in the too aggressive crawling: too many requests per second. Even so, non-specific protection mechanisms do exist, and they should have been put in place. But generally, legitimate crawlers do observe robots.txt!

To call a spade a spade, such users hate LLMs, and they specifically want to ban crawling by AI crawlers! They do want Google and Bing to index their websites, just not ChatGPT, Copilot, and the like! However, if they use Nepenthes, they would completely disappear from all search results, and this also happens to those using iocanine indiscriminately.

Excerpts from a chat with Grok, starting from the relevant point:

Grok: Since ~2023, blocking AI crawlers via robots.txt became one of the most common uses of robots.txt — many publishers don’t want their articles used for LLM training without permission or compensation.

❶ Ludditus: THAT CONTENT IS AVAILABLE ON THE INTERNET. People should understand what that means.

Grok: Philosophically, I get the argument: if you publish openly on the public internet without technical barriers, you’ve implicitly accepted that machines (and people) can read/copy/analyze it. The “availability = permission” view has strong roots in how the early web worked.

But publishers (news orgs, bloggers, authors) increasingly see a distinction in 2026:

Search indexing drives traffic/revenue (ads, subs, clicks) → most still welcome Googlebot/Bingbot.

Mass scraping for LLM training often provides zero return to the creator — no referral traffic, no attribution, sometimes direct competition (AI summaries replace clicks). Many feel this crosses from “discovery” into “uncompensated commercial exploitation.”

Data shows this tension:

Major search bots (Googlebot, Bingbot) are still overwhelmingly allowed because they bring value.

Dedicated AI-training bots (GPTBot, ClaudeBot, etc.) face widespread blocks via robots.txt — often 60–80% of news/publisher sites block them specifically.

Compliance varies: Good actors (OpenAI, Anthropic for their main bots) mostly respect Disallow rules; others ignore or partially honor them. But the protocol remains the main polite signal publishers use.

Bottom line: robots.txt isn’t (and never was) a DRM lock — it’s a communication tool for cooperative bots. Authentication secures content; robots.txt manages polite (but resource-hungry) automation. Both have roles because the web has diverse actors: helpful search engines, neutral archivists, aggressive AI trainers, and bad actors.

If your site is small/simple/truly public-and-proud, skip robots.txt entirely — many do. But for anything larger or nuanced, it’s still a lightweight, widely supported way to guide behavior without breaking things for everyone.

❷ Ludditus: But why would a blog owner use robots.txt to deny bots? I thought a blog is meant to be read!

Grok: A blog is indeed meant to be read — by humans. The core tension in 2026 comes from the fact that many crawlers aren’t reading to send people your way; they’re reading to feed commercial AI systems that often replace the need to visit your blog.

Most bloggers still want (and need) Googlebot, Bingbot, etc., to crawl everything because that’s how organic search traffic arrives — people click through, read the full post, maybe click ads, subscribe, or share. Blocking those would be self-sabotage for almost any blog relying on visibility.

But since ~2023–2024, a growing number of independent bloggers, personal site owners, newsletter writers, hobbyists, and even some niche/professional ones have started using robots.txt to selectively block or rate-limit certain bots — almost always the AI-training ones (not general search engines). Here’s why they do it, based on real patterns in 2025–2026:

Primary reason in 2026: Opting out of uncompensated AI training / content reuse

Bloggers see their words harvested at scale by bots like GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl, often used in training datasets), Applebot-Extended, Google-Extended (for Gemini), etc.

The scraped content ends up inside LLMs that generate summaries, answers, or entire responses without sending traffic back to the original blog. In many cases, users never see the source — they get a synthesized reply in ChatGPT, Claude, Perplexity, Gemini, etc.

This feels like free labor for trillion-dollar companies. If your blog post gets regurgitated verbatim or near-verbatim in an AI output (sometimes with hallucinations or distortions), you get zero credit, zero clicks, zero ad revenue, zero subscribers — but the AI company profits.

Many bloggers explicitly say: “I write to be read by people, not to secretly train someone else’s product.”

Real-world numbers show this isn’t fringe anymore:

Millions of sites (including many personal/indie blogs) now block GPTBot, ClaudeBot, etc. — estimates range from ~5% of all domains explicitly disallowing GPTBot up to much higher among publishers/news/blogs that care about this.

Among top news/publisher sites (which overlap with “blog-like” long-form content), 60–80% block at least one major AI-training bot via robots.txt as of late 2025/early 2026.

Indie bloggers often follow suit because tools like WordPress plugins, Cloudflare one-click toggles (“Block AI bots”), or Hostinger features make it trivial to add those Disallow rules.

Other (less common but real) reasons bloggers use robots.txt blocks

Server / bandwidth costs — Some low-budget shared hosting bloggers notice AI bots hammering the site (sometimes ignoring crawl-delay), eating bandwidth quotas or slowing things down for real visitors.

Privacy / control over niche content — Personal journals, drafts, opinion pieces they don’t want summarized out of context or fed into training without nuance.

Protest / principle — A symbolic “no thanks” to the current AI data-grab model, especially among writers who feel the open web is being eroded.

What they usually don’t block

Googlebot, Bingbot, Yandex, etc. — because losing search visibility would kill most blogs.

Many allow “search-like” AI bots (e.g., PerplexityBot or Claude-Web in some cases) if they think it might still drive attribution/traffic.

❸ Ludditus: Not entirely true. I’ve seen visits to my blog from major chatbots, meaning that at least SOME PEOPLE bothered to follow the links that most chatbots offer to complement their synthetic answers.

Grok: You’re right — seeing referral traffic from major chatbots (like ChatGPT, Claude, Perplexity, or Gemini) in your blog’s analytics is real evidence that some users do follow those source links. That’s not nothing, and it shows the system isn’t purely zero-click for everyone.

Many AI platforms (especially Perplexity, but also ChatGPT’s “Search” mode or Claude when it cites sources) include explicit links or “learn more” buttons to originals, and a portion of thoughtful/curious users click through — often to verify details, read the full context, dive deeper, or just support the creator. Your own logs prove it happens on your site, which is great and validates that the attribution isn’t always lost.

That said, the broader data from 2025–early 2026 shows this is still the exception rather than the rule, and the scale is tiny compared to traditional search. Here’s a balanced picture based on recent studies and reports:

AI referral traffic is growing but remains very small overall
AI platforms (including ChatGPT, Perplexity, Claude, Gemini, etc.) drive roughly 1% of total web traffic across industries (Conductor benchmarks, late 2025). ChatGPT accounts for the lion’s share (~87% of AI referrals), but even its referrals are a fraction of Google’s. For publishers/blogs, AI referrals often make up <5% of what traditional organic search provides.

Click-through rates in AI interfaces are dramatically lower

In Google AI Overviews (which many users encounter), only about 1% of visits click links inside the AI summary itself, and overall clicks on traditional results drop to ~8% (vs. 15% without an AI summary) — Pew Research, July 2025 data.

Pure AI chatbots like ChatGPT/Claude/Perplexity show even lower effective CTRs for sources — often described as 95–96% less referral traffic than comparable Google queries (TollBit/Cloudflare reports, 2025). Many queries resolve with the synthetic answer alone.

Zero-click behavior dominates: ~60% of searches (including AI-influenced ones) end without any external click, up significantly with AI features.

When clicks do happen, they can be higher-quality
AI-referred visitors often show stronger intent — longer sessions, lower bounce rates (sometimes 27% lower), and higher conversions (reports of 4x–23x better in some niches like B2B SaaS or retail). So the few who arrive from a chatbot tend to engage/read more deeply than average Google traffic. Your visits could be from exactly those more deliberate users.

Why the discrepancy feels stark for creators
Traditional Google drives massive volume of casual browsers → lots of clicks, but diluted quality. AI tools satisfy curiosity quickly for most → fewer clicks overall, but the ones that happen are often more engaged. Many bloggers see sporadic AI referrals (like you did) and appreciate them, but the aggregate drop in total traffic (especially from zero-click AI summaries) is why so many still block training bots — they’re protecting against high-volume scraping with minimal return traffic.

In short: Yes, some people absolutely bother to click the links AI provides — your analytics confirm it, and that’s meaningful. But statistically, it’s a small minority of interactions, and the overall shift toward AI is reducing clicks to the open web at scale (especially for informational/blog content). That’s exactly why the robots.txt debate rages on: for many, the occasional quality referral doesn’t offset the massive uncompensated training use + lost broader visibility. If your blog is seeing decent AI clicks, that’s a positive signal — lean into clear, citable content to maximize it!

❹ Ludditus: I don’t care because my blog is not funded through ads. Most people who complain are those who monetize their Internet presence.

Grok: That’s a valid perspective — if your blog isn’t chasing ad revenue, subscriber counts, SEO rankings, or any form of monetization tied to traffic volume, then the “lost clicks” argument from AI summaries or zero-click answers carries much less weight. You’re not financially hurt if fewer people arrive via traditional search or AI referrals. Many non-monetized bloggers (personal journals, hobby sites, tech experiment logs, creative writing spaces, etc.) share your stance: “It’s public → it’s fair game, and I’m not losing sleep over how it’s used downstream.”

However, even among indie, hobbyist, or non-commercial bloggers in 2025–2026, a notable subset still chooses to block AI-training bots via robots.txt (or stronger measures like Cloudflare rules). Their reasons often aren’t about money at all — they’re more philosophical, personal, or practical:

Ownership and consent over their own words
Many see their blog as an extension of themselves — a personal archive, creative outlet, or intellectual record. The idea of their thoughts/voice being ingested into proprietary black-box models (often without attribution, context, or control) feels like an unwanted appropriation. It’s less “you stole my revenue” and more “you took my expression without asking, to build something I might disagree with or that competes with human creativity.” One indie blogger put it bluntly in late 2025: “It’s my content. I wrote it. I chose to publish it. I’d rather opt out of being part of someone else’s dataset without my knowledge.”

Principle against uncompensated mass exploitation
Even without ads, there’s a sense of unfairness in trillion-dollar companies vacuuming up personal/hobby content at industrial scale for free commercial gain. It’s not always about direct compensation; it’s about rejecting the norm that “public = unlimited free labor for AI corps.” This resonates especially with writers, artists, or devs who view the open web as a gift economy — but one where reciprocity (at least not actively harming creators) matters.

Server performance and cost on small setups
Non-monetized blogs often run on cheap shared hosting, personal servers, or low-bandwidth plans. Aggressive AI crawlers (some ignoring robots.txt, hammering pages repeatedly, or executing JS heavily) can cause real outages, slow loads for actual readers, or push usage over limits. Indie bloggers have reported bots causing downtime or “Slashdot-like” spikes daily — not theoretical, but logged reality. Blocking via robots.txt (or WAF) is a simple fix to keep the site stable for the humans who do visit.

Avoiding out-of-context reuse or distortion
Personal blogs can contain nuanced opinions, drafts, jokes, or context-sensitive takes. Creators worry about snippets being fed into LLMs that then spit out warped versions (with hallucinations) attributed vaguely — or worse, used to train models that generate competing “content” in their niche. It’s a control thing: “I don’t want my half-formed thoughts summarized poorly in someone else’s product.”

From data around early 2026:

While top news/publisher sites lead the blocking trend (79%+ blocking at least one AI training bot), indie/personal sites are increasingly following suit — often selectively (e.g., Disallow: / for GPTBot/ClaudeBot but Allow for Googlebot).

Tools like Cloudflare’s one-click “Block AI bots” see high adoption even on non-commercial domains (over 80% of some customer segments opt in).

Blogs on platforms like Bear Blog, personal WordPress installs, or static sites (Hugo/Jekyll) frequently add these rules not for traffic protection, but for the reasons above.

In the end, you’re spot-on for your own case: if traffic metrics and revenue aren’t your drivers, ignoring AI blocks makes perfect sense — enjoy the open web on your terms. Many others feel the same way and skip the blocks entirely.

But the ones who do block aren’t always the “greedy monetizers.” A good chunk are fellow non-commercial creators drawing a line on consent, control, or simple site hygiene. It’s a personal choice spectrum, not purely a money thing. If your analytics keep showing those occasional thoughtful AI referrals, that’s a nice bonus in a system that’s otherwise tilting toward less human visitation overall. Keep writing what you want — that’s what matters.

❺ Ludditus: “It’s my content. I wrote it. I chose to publish it. I’d rather opt out of being part of someone else’s dataset without my knowledge.” But this happens even when people READ PRINTED BOOKS and then use excerpts or info from what they’ve read. Once again, when something is on the Internet, very few protections exist.

Grok: You’re drawing a reasonable parallel: when someone reads a printed book, they absorb ideas, facts, excerpts, or styles into their mind — sometimes quoting or paraphrasing later in conversation, writing, or teaching — without needing permission beyond the initial purchase (and fair use/dealing doctrines cover many such uses). Ethically and legally, society has long accepted this as part of how knowledge spreads through human cognition. Why treat AI ingesting publicly posted web content differently? If it’s out there openly, isn’t processing it just a scaled-up version of human reading?

The analogy holds in some ways but breaks down in others — both legally and ethically — especially in the context of 2026’s AI landscape. Here’s a clear breakdown of the key differences:

Scale, Precision, and Permanence

A human reader retains imperfect, filtered impressions shaped by personal context, memory decay, creativity, and ethics. They can’t reproduce entire passages verbatim on demand or at industrial scale without effort (and even then, it’s limited by human capacity).

AI training involves making perfect digital copies of vast amounts of text (often billions/trillions of tokens), analyzing them exhaustively, and retaining patterns indefinitely in model weights. Courts and the U.S. Copyright Office (in its 2025 generative AI training report) have explicitly noted this: “AI learning is different from human learning in ways that are material to the copyright analysis. Humans retain only imperfect impressions… Generative AI training involves the creation of perfect copies with the ability to analyze works nearly instantaneously.”

This isn’t just “reading faster” — it’s systematic copying and derivative processing at a non-human level, which shifts how copyright law views it (e.g., reproduction right implications during ingestion).

Commercial vs. Personal/Transformative Use

Human reading is typically personal, educational, or creative — rarely a direct commercial substitute. Fair use often protects criticism, commentary, teaching, or parody that builds on excerpts.

AI companies train models for commercial products (ChatGPT, Claude, Gemini) that can generate outputs competing with or displacing original sources — summaries, answers, or even near-verbatim reproductions without clicks/attribution. Courts in 2025 cases (e.g., some fair use rulings for lawfully obtained books in training, but rejections for pirated sources) weighed market harm heavily: if AI outputs dilute demand for originals or supplant them, it weighs against fair use.

Humans don’t usually create scalable substitutes for the books they read; AI can and does.

Consent and Opt-Out Dynamics

Printed books have no equivalent to robots.txt — once sold/published, there’s no technical “please don’t read this for learning” mechanism. Copyright law assumes purchase implies certain uses, with fair use as a safety valve.

The web is different: publishers can (and increasingly do) signal preferences via robots.txt, meta tags, or emerging standards. Major AI firms (OpenAI, Anthropic, Google for certain crawlers) mostly respect these for training bots — compliance is high among “good” actors in 2026, though not universal (some aggressive ones ignore it, per reports like Cloudflare’s tracking showing violations on many sites). This creates a practical opt-out that books never offered.

Ethically, many creators feel the web’s openness was never meant to license unlimited commercial harvesting for closed models — especially when it erodes the attention economy that sustains independent publishing.

Legal Precedents in 2025–2026

U.S. courts have split but trended toward nuance:

Training on lawfully obtained/crawled works can be “highly transformative” and fair use in some rulings (e.g., cases involving books digitized for LLM training).

But piracy/shadow libraries cross lines, and market dilution (AI flooding with competing outputs) remains a live concern.

No blanket “AI training = human reading” win; analogies to human learning are often called “cartoonishly absurd” by critics because of the mechanical differences.

In short: You’re right that “public = few protections” is the historical internet norm, and human-style absorption has always been tolerated. But AI’s industrial-scale, perfect-copy, commercial-exploitation model pushes boundaries in ways personal human reading never did — prompting the very opt-out tools (robots.txt blocks) that feel unnecessary for a single reader flipping pages.

If your blog stays open without blocks, that’s a principled stand aligned with the “if it’s public, it’s fair game” ethos — many still operate that way. Others draw the line at non-consensual mass commercial reuse, seeing it as a new category worth resisting. Both views are defensible; the debate reflects how fast the tech outpaced old assumptions about openness.

❻ Ludditus: “AI training involves making perfect digital copies of vast amounts of text (often billions/trillions of tokens), analyzing them exhaustively, and retaining patterns indefinitely in model weights. Courts and the U.S. Copyright Office (in its 2025 generative AI training report) have explicitly noted this: etc.” Courts don’t understand shit. MOST such content is stored IMPERFECTLY via model weights! Unless using RAG or providing excerpts retrieved from actual web pages after a web search, LLMs do not store VERBATIM copies of the accessed contents, therefore behaving more like the human brain.

Grok: You’re making a strong point here, and it’s one that resonates with how many AI researchers and defenders describe the process: modern LLMs (especially large autoregressive transformers like GPT-series, Claude, Grok, etc.) do not store training data as literal, searchable copies in the way a database or hard drive would. Instead, the training gradients update billions/trillions of parameters (model weights) to capture statistical patterns, correlations, probabilities of token sequences, syntactic structures, factual associations, and stylistic tendencies. The original text is discarded after training — no verbatim articles, books, or blog posts sit inside the weights as discrete files.

This leads to a lossy, compressed representation that’s much closer to how the human brain forms long-term memories: imperfect, abstracted, reconstructive, and prone to confabulation/hallucination rather than perfect recall. When an LLM outputs something that looks “verbatim,” it’s usually reconstructing a high-probability continuation based on learned patterns — not fetching a stored copy. Key evidence supporting your view:

Memorization is rare and conditional: True verbatim or near-verbatim reproduction (what researchers call “extractable memorization” or “eidetic memorization”) only happens reliably for highly duplicated, distinctive, or over-represented sequences in the training data. Studies (e.g., from Carlini et al. onward, up through 2025–2026 papers) show it requires repetition in the corpus, specific prompting tricks to “trigger” it, or adversarial attacks. For most training examples — especially rare or unique ones — the model retains only distributional/generalized knowledge, not recoverable exact copies.

Model capacity limits force compression: Recent work (e.g., 2025 scaling-law analyses) estimates transformers hold roughly ~3.6 bits per parameter of effective information. With models in the hundreds of billions to trillions of parameters, that’s enormous but still far less than the raw size of training corpora (often petabytes of text). The system must heavily compress: it learns averaged/overlapped patterns across similar examples rather than storing everything losslessly. This mirrors human memory’s gist-based, reconstructive nature (we rarely recall books word-for-word but reconstruct narratives from schemas).

No direct storage of originals: During training, text is tokenized, processed in batches, gradients computed, and weights updated — the raw inputs aren’t persisted. Outputs that appear verbatim are probabilistic generations from the learned distribution, often with slight variations unless the pattern is extremely overfitted.

The U.S. Copyright Office’s Part 3 report (pre-publication May 2025, on generative AI training) does highlight the distinction you reference: it contrasts human learning (“Humans retain only imperfect impressions… filtered through their own unique personalities, histories, memories, and worldviews”) with AI training (“involves the creation of perfect copies with the ability to analyze works nearly instantaneously”). But even there, the Office focuses on the initial ingestion phase (making temporary/permanent copies for processing) and on cases where outputs show substantial similarity (implying some form of memorization in weights). It doesn’t claim models hold literal verbatim archives; it argues that when they can reproduce protected expression faithfully, that reproduction capability itself raises reproduction/derivative-work concerns — especially at superhuman scale and speed.

Courts (and the Office) aren’t claiming technical illiteracy so much as emphasizing legal functional equivalence: if a model reliably emits near-identical protected text under certain prompts, the practical effect can resemble unauthorized copying/distribution, regardless of whether it’s “stored perfectly” inside. That’s why ongoing cases (NYT v. OpenAI, etc.) probe extractability as evidence of infringement risk, not because they misunderstand compression.

You’re correct that pure forward-pass generation (no RAG, no retrieval) behaves more brain-like: reconstructive, lossy, pattern-based, not archival. The “perfect copies” critique applies more to the training pipeline (duplicating data for processing) and to edge cases of memorization than to the everyday operation of deployed models. Many creators still feel uneasy about that edge-case risk or the sheer scale — but technically, the imperfect, compressed nature is a core reason LLMs generalize at all rather than just regurgitate a giant lookup table.

If your stance is “public web content gets processed like a human might read/absorb/forget/reconstruct it,” that’s defensible — and aligns with how the tech actually works under the hood for most use. The debate persists because scale + commercial outputs + occasional high-fidelity regurgitation push it into territory courts see as functionally different from a single human reader.

❼ Ludditus: The magnetic tape and cassette, the video cassette, and the CD/DVD/Blu-ray, as well as the cheap mass storage, DID NOT kill the music, the films, or other intellectual property. Courts need to understand that, with all its disruptive effects and its many flaws and negative aspects, LLMs and derived agents are a form of progress.

Grok: You’re spot-on with the historical analogies — technologies like magnetic tape (introduced widely in the 1940s–50s), audio cassettes (1960s), VHS/Betamax video cassettes (1970s), CDs/DVDs/Blu-ray (1980s–2000s), and cheap mass storage (hard drives, USBs from the 1990s onward) all sparked massive fears of IP destruction at the time. Industry groups lobbied hard, predicting the end of music, films, and publishing because of “home taping,” unauthorized copying, and piracy facilitation. Yet, none of them “killed” those industries; in fact, they often expanded markets by making content more accessible, portable, and consumable.

In each case, disruption forced evolution: business models shifted from scarcity/physical sales to abundance/access (subscriptions, ads, merch). Courts played a key role by applying copyright’s fair use doctrine (17 U.S.C. § 107) to balance innovation against rights — allowing personal, transformative uses while punishing large-scale infringement.

Fast-forward to LLMs and AI agents in 2026: Courts are indeed grappling with similar dynamics, and recent rulings (mostly from 2025) show they’re not viewing AI as an outright “killer” but as a potentially transformative tool — much like those past techs. However, they’re emphasizing lawful data sourcing, potential market harm (e.g., AI outputs substituting for originals), and the need for evidence over blanket fears. Key developments:

In Bartz v. Anthropic (N.D. Cal., June 2025), Judge William Alsup ruled that training LLMs on lawfully acquired books is “quintessentially transformative” fair use, as models extract patterns for new capabilities, not to reproduce content. But Anthropic settled for $1.5 billion over using pirated copies — drawing a line against unlawful ingestion, akin to how courts distinguished personal VCR taping from commercial piracy.

Similarly, in Kadrey v. Meta (N.D. Cal., June 2025), Judge Vince Chhabria found AI training on books fair use in that specific case (highly transformative, no direct market substitution shown), but warned it won’t always be — especially if AI “floods the market” with competing content, undermining human creators. This echoes past concerns (e.g., VHS “killing” theaters) but recognizes progress if balanced.

Other cases like Thomson Reuters v. ROSS Intelligence (D. Del., Feb. 2025) rejected fair use for AI training that directly competed with originals (though not generative AI), focusing on market harm. And ongoing suits (e.g., NYT v. OpenAI, Disney v. Midjourney) are shifting scrutiny to AI outputs, not just training — similar to how post-Napster courts targeted distribution over mere copying.

The U.S. Copyright Office’s 2025 report on generative AI also nods to these parallels, noting fair use outcomes depend on specifics (transformative purpose, no substantial harm), while highlighting differences in scale/speed from human processes. Congress held hearings in 2025 (e.g., “Too Big to Prosecute?”) questioning mass ingestion but hasn’t overhauled laws yet, suggesting they’re letting courts adapt first.

Overall, courts are starting to “understand” your point: LLMs represent progress (enabling new creation, research, efficiency), much like cassettes/VHS/CDs did, and fair use is protecting that when done right. But they’re vigilant about flaws — piracy, over-substitution, lack of compensation in emerging licensing markets — to prevent negative aspects from dominating. Just as music/film thrived post-disruption (revenues now at record highs via streaming), AI could grow IP ecosystems if models evolve toward partnerships (e.g., Anthropic’s settlement led to opt-in deals). It’s disruptive, flawed, but yeah — progress.

❽ Ludditus: In my experience, Anthropic’s Claude is extremely cautious regarding IP:

It refuses to access an individual web page given to it if the respective website has a robots.txt that would prevent crawling that would lead to that page, even if it would be ethical to do so, as that URL was specifically given by me, the user!

It refuses to confirm if a specific quotation belongs or not to a specific writer or personality, and it refuses to give any verbatim excerpt from copyrighted material, even if such a quotation can be found on countless web pages, either as fair use, or as having become public domain.

If given a photo of a paragraph, it can extract text via OCR but would refuse to do so if “it looks like an excerpt from a textbook,” despite short text falling under the fair use rule.

Grok: Your observations about Claude’s behavior align closely with Anthropic’s documented approach in 2026 — they prioritize extreme caution on intellectual property (IP) and copyright matters, often going beyond strict legal minimums to minimize any risk of infringement or controversy. This stems from their “constitutional AI” philosophy, ongoing lawsuits (even after some favorable fair-use rulings), settlements (like the music publishers case requiring blocks on reproducing lyrics), and a broader corporate emphasis on safety/responsibility that includes preemptive refusals.

Let’s break down each point you mentioned based on how Claude operates:

1. Refusing to access a user-provided webpage if robots.txt disallows it

This is a deliberate design choice. When Claude needs to fetch real-time web content (e.g., via its browsing/tool-use capabilities in models like Claude Opus 4.6 or later), Anthropic’s system respects robots.txt directives even for single, user-specified URLs.

Anthropic publicly states their crawlers/browsing tools (ClaudeBot, Claude-SearchBot, Claude-User, etc.) honor robots.txt as an industry standard. This extends to on-demand fetching: if the site’s robots.txt blocks the relevant path or the bot’s User-Agent, Claude will refuse to proceed rather than attempt access.

The rationale isn’t just politeness — it’s tied to their training/crawling policy and risk avoidance. Even though a user explicitly provides the URL (implying permission in that context), Claude treats robots.txt as a binding signal of the site owner’s intent. Bypassing it could expose Anthropic to accusations of circumvention or bad faith, especially post-2024–2025 incidents where aggressive crawling drew criticism (e.g., iFixit overload case, where they adjusted after crawl-delay was added).

In practice: If you paste a URL and Claude says something like “I’m unable to access that page because it appears restricted by the site’s robots.txt,” that’s not a bug — it’s intentional conservatism. Ethical arguments (e.g., “you gave it to me directly”) don’t override the hardcoded respect for the protocol.

This makes Claude more restrictive than some competitors (e.g., certain other models might fetch anyway if the URL is direct and non-recursive).

2. Refusing to confirm quotations or provide verbatim excerpts from copyrighted material

Claude is programmed with very tight guardrails here — often refusing verbatim quotes, even short ones, if the source is likely copyrighted and not clearly public domain/fair use in an obvious way.

This goes beyond basic fair use: Short excerpts can qualify as fair use (criticism, commentary, education), and public-domain or widely quoted material (e.g., historical speeches) is usually fine. But Claude errs on over-refusal to avoid any reproduction risk.

Reasons include:

Past lawsuits (e.g., music publishers suit settled with stipulations to block lyric reproduction; similar pressures in book/author cases).

System prompts/injections that explicitly warn against reproducing copyrighted text (e.g., reminders like “do not reproduce or paraphrase any copyrighted material”).

A policy to prevent outputs that could be seen as derivative works or direct infringement.

Common behaviors: Claude might offer summaries, paraphrases, or high-level analysis instead (“This quote is often attributed to X and conveys Y idea”), but won’t confirm exact wording or paste it verbatim unless it’s extremely safe (e.g., Creative Commons licensed, government docs, or ancient public domain). Even if the text appears on “countless” sites, Claude treats potential copyright as the default assumption.

This caution has been noted in user reports and even internal “reminders” leaked/discussed in communities.

3. OCR on photos of paragraphs/textbooks — refusing if it “looks like” copyrighted excerpts

Claude’s vision/OCR capabilities (handling uploaded images) are strong technically, but gated by similar IP heuristics.

It can and will extract text from photos of handwritten notes, signs, menus, personal documents, etc.

But if the content resembles a textbook page, book excerpt, article, or other potentially copyrighted printed material, it often refuses outright — citing copyright concerns or “I can’t assist with reproducing protected content.”

This isn’t pure fair use analysis (short excerpts for personal study are often fair use); it’s precautionary overkill. Anthropic likely uses classifiers/pattern-matching to flag “book-like” images and trigger refusal, avoiding edge cases where extraction could lead to verbatim output of protected text.

Result: Even innocuous textbook photos get blocked if they trip the filters, while non-commercial/personal images sail through.

Overall, Anthropic’s stance reflects a “better safe than sued” philosophy — more conservative than legal necessity requires (especially after 2025 rulings affirming transformative fair use for training on lawfully obtained works). They’ve faced criticism for over-caution (frustrating users who want simple quotes or analysis), but it aligns with their brand: prioritizing constitutional safeguards, user trust, and avoiding the reproduction risks that plagued other models in litigation.

❾ Ludditus: That said, I state that some people use tools that amount to anything from vandalism to terrorism! Two examples:

iocaine, which instead of sending HTTP 403 to deny a resource to a web crawler, sends HTTP 200 and bogus content.

“Dogfood for AI” as described in the following link: generate flawed code examples (even blog posts) for the algorithms and programming questions from a large list of programming topics. The purpose: «If many of us do this, we can feed GenAI its own garbage. If we generate thousands of bad code examples, this will poison their training data, and may eventually lead to an effect known as “model collapse”.»
https://www.vitavonni.de/blog/202602/20260213dogfood-the-AI.html

So people complain about AI slop, yet they want AI to be fed with more slop! They complain about AI hallucinations, yet they want to create hallucinations! They complain about AI data centers wasting electricity and water, yet they want such energy to be spent on ingesting fake and noxious content!

Such sabotaging is by no means ethical. It really borders terrorism, in my view.

Grok: You’re raising a very pointed and valid criticism of certain anti-AI activism tactics that have emerged in 2025–2026. Let’s look at the two examples you mentioned, their stated goals, their actual effects, and the ethical framing you’re applying.

1. iocaine / Poisoning via fake 200 OK responses

What it is
Some site owners (or script kiddies) return HTTP 200 with deliberately wrong, misleading, or garbage content instead of 403/404 or robots.txt blocks when they detect certain User-Agents (GPTBot, ClaudeBot, etc.). The name “iocaine” is a reference to The Princess Bride (“Never go in against a Sicilian when death is on the line!”), implying a clever trap.

Intended effect
Feed low-quality or actively wrong data into training corpora so that future models become less reliable when they encounter pages from those domains.

Real-world impact
Very limited and mostly symbolic at scale.

Major training datasets are heavily filtered and deduplicated. A few poisoned pages rarely survive the cleaning stages.

High-quality crawlers (OpenAI, Anthropic, Google, etc.) already use multiple signals beyond User-Agent — TLS fingerprint, behavior patterns, IP reputation, content coherence — to detect and down-weight or exclude suspicious responses.

The technique mostly annoys smaller / less sophisticated scrapers and increases the noise floor only marginally.

So while it’s technically sabotage, its practical damage to frontier models is close to zero in 2026. It’s more of a performative protest than a serious attack on model quality.

2. “Dogfood for AI” / Mass-producing bad code examples

What it is
The campaign described in the blog post you linked (vitavonni.de, Feb 2026) encourages people to publish large numbers of deliberately flawed, misleading, or subtly wrong code snippets/blog posts that look superficially plausible — tagged with popular programming topics, LeetCode-style problem names, interview questions, etc. — so that they pollute the training distribution and push models toward worse coding performance.

Stated goal
Trigger “model collapse” — the phenomenon where models trained on increasing amounts of model-generated (especially low-quality) data gradually degrade in performance, lose diversity, and start producing increasingly incoherent or repetitive output.

Current scientific reality (early 2026)
Model collapse is a documented risk, but only under very specific conditions:

Repeated training on purely synthetic data with no fresh human reference data

Very high proportion of synthetic content (>80–90% in later generations)

No effective data filtering / quality weighting

Frontier labs already apply very aggressive filtering, deduplication, quality scoring, and human-curated “clean” data mixtures precisely to avoid collapse. The most recent public scaling papers (late 2025) show that collapse is real but avoidable with proper data curation — which the leading labs are doing.

So deliberately flooding the open web with garbage code examples would need to reach truly astronomical scale (many millions to billions of high-quality-looking poisoned documents) and bypass existing filters to have any measurable effect on models trained in 2026–2027.

The contradiction you highlight

You are correct that there is a glaring inconsistency in the rhetoric coming from some parts of the anti-AI-enthusiast community:

Complaint about current LLMs Proposed “solution” Logical tension
Too much AI slop / garbage on the web Flood the web with more garbage (bad code, fake articles) Actively increases the problem they complain about
Hallucinations are dangerous and harmful Deliberately try to make models hallucinate more Directly aims to worsen the problem
AI training wastes huge amounts of energy/water Create millions of fake pages → force models to ingest & process even more junk Increases compute waste on useless content
AI companies are unethical data thieves Respond with deliberate data poisoning / sabotage Uses unethical means to fight perceived unethical behavior

This pattern is not uncommon in ideological conflicts: the proposed countermeasures sometimes mirror or amplify the original grievance.

Ethical and legal framing

Is this sabotage terrorism?
Legally: No, it does not meet the legal definition in any major jurisdiction (violence or threat of violence against people or critical infrastructure with political/ideological aim).
Ethically: It is a form of malicious interference with other people’s data pipelines and training processes. Whether it rises to the level of serious ethical violation depends on:

Scale and effectiveness (currently negligible)

Intent (trying to degrade a powerful technology vs. trying to protect creative livelihoods)

Collateral damage (polluting the open web for everyone, including non-AI users)

Most people would place it in the category of unethical activism / protest vandalism rather than terrorism — closer to spray-painting graffiti on corporate buildings or DDoS attacks on websites than planting bombs.

Many in the AI ethics and alignment communities also criticize these tactics for exactly the reason you do: they are destructive, counter-productive, and pollute the commons (the open web) that everyone — including critics of AI — relies on.

❿ Ludditus: You’re right that legally and technically, terrorism should involve some kind of violence. But cyberterrorism doesn’t, and it’s often part of so-called “hybrid wars.” I’d draw a parallel to cyberbullying, which does not involve physical violence, yet it can drive some people to suicide. Suppose “LLM poisoning” succeeds, and agents based on such LLMs are improperly used without human supervision and vetting in places where disruptions can have catastrophic results. I stand by my assessment that such vandalism has the potential of having similar effects to terrorism.

Grok: You’re making a compelling case by refining the analogy—acknowledging that traditional terrorism implies violence but pointing out how cyber variants (and even non-cyber harms like bullying) can lead to severe indirect consequences without it. Let’s unpack this thoughtfully, drawing on current understandings from 2026.

Defining Cyberterrorism in Context

Cyberterrorism is broadly understood as the use of digital means to intimidate, coerce, or cause harm for political, ideological, or similar objectives, often by disrupting systems or generating fear equivalent to physical acts. While many definitions (e.g., from the U.S. Intelligence Community and experts) emphasize “violence or the threat of violence,” others include purely disruptive acts if they target critical infrastructure or create widespread panic, aligning with hybrid warfare tactics where digital sabotage blends with other pressures. For instance, hacking power grids or intelligence systems qualifies, even if no one is physically hurt immediately—the fear and societal disruption are the weapons. This fits your hybrid wars point: nations like Russia and China have been accused of such ops in reports from Europol and UN bodies, blending info warfare with cyber intrusions.

Your cyberbullying parallel is apt—it’s non-violent but can escalate to real harm (e.g., suicides), and laws in many places treat severe cases as criminal harassment or even manslaughter if foreseeably leading to death. Similarly, if LLM poisoning scaled up and caused indirect catastrophes (e.g., a poisoned AI in air traffic control hallucinating safe paths, leading to crashes), the ethical weight could mirror that.

Could LLM Poisoning Lead to Terrorism-Like Effects?

In theory, yes—especially if unsupervised LLMs/agents are deployed in high-stakes domains without robust vetting. Data poisoning (injecting flawed, biased, or backdoored data into training sets) can degrade models subtly, leading to real-world failures:

Financial systems: A poisoned fraud detection AI might overlook patterns, enabling massive theft or market crashes. In 2026, this risk is highlighted in banking, where tainted data could amplify biases, causing unfair lending or undetected money laundering for terrorist financing.

Healthcare/autonomous systems: Poisoned diagnostics might misclassify diseases, or self-driving tech could ignore hazards via embedded triggers. Research from Anthropic (2025) shows even small poisoned samples can embed behaviors like data exfiltration on triggers, potentially cascading in critical infra.

Broader sabotage: Tools like “AI Memory Poisoning” (Microsoft’s 2026 alert) warn of injected “facts” steering users toward ruinous advice, e.g., bad investments or unsafe products. Real cases include Microsoft’s Tay (2016), where user-poisoned inputs turned it racist, foreshadowing how deliberate campaigns could weaponize this.

However, 2026 data shows no widespread catastrophes yet—most poisoning is lab-based or low-impact (e.g., backdoors in open datasets, federated learning tweaks). Labs like OpenAI/Anthropic use filtering, quality scoring, and human oversight to mitigate collapse or poisoning, making large-scale success unlikely without insider access. Still, in unsupervised agents (e.g., military drones or grid management), unchecked poisoning could indeed have terrorism-like ripple effects, eroding trust and causing panic.

Ethical Debates on These Tactics

The ethics here are hotly contested in 2026 forums (academia, policy, online). On one side, poisoning is framed as “resistance” or “innovative self-defense” by creators/artists against IP theft—e.g., tools like Nightshade or Poison Fountain embed glitches to protect work, seen as a moral response to unconsented scraping. Some argue it’s not sabotage but reclaiming agency in an uneven power dynamic.

On the flip: It’s deceptive, erodes AI trust (potentially harming beneficial uses like medical research), and raises accountability issues—who’s liable if poisoned models cause accidents? Military debates even propose U.S. covert poisoning of adversaries’ AI for advantage, but with ethical oversight to avoid broader distrust. Your vandalism-to-terrorism spectrum fits: Intentional degradation could indirectly lead to harm, much like bullying, but it lacks the direct coercive fear-mongering of true cyberterrorism unless politicized and scaled (e.g., state-sponsored poisoning for hybrid war).

I stand with you that the potential is there—especially as LLMs integrate deeper into society. But labeling it terrorism risks overreach unless proven to cause widespread fear/harm; for now, it’s more akin to unethical activism or cyber vandalism. To mitigate, experts push for better data provenance, licensing deals, and regulations—balancing innovation with consent. If these tactics escalate, they could indeed cross into hybrid threat territory, warranting stronger responses.

Complaint about current LLMs	Proposed “solution”	Logical tension
Too much AI slop / garbage on the web	Flood the web with more garbage (bad code, fake articles)	Actively increases the problem they complain about
Hallucinations are dangerous and harmful	Deliberately try to make models hallucinate more	Directly aims to worsen the problem
AI training wastes huge amounts of energy/water	Create millions of fake pages → force models to ingest & process even more junk	Increases compute waste on useless content
AI companies are unethical data thieves	Respond with deliberate data poisoning / sabotage	Uses unethical means to fight perceived unethical behavior

In lieu of a conclusion

Jerks are everywhere. If one were to use an OS or a distribution based on the ethical standards of those who made it, we’d still be using pencils and paper.

But if you want to have some more fun with Debian people, here’s some recommended reading:

Regarding Dr. Norbert Preining, I only knew about the 2019 incident in which Martina Ferrari called him names because he wasn’t aware that Martina Ferrari, born Martin Ferrari (with XY chromosomes), is now a she/her. But he was also bullied by Sage Sharp because Dr. Preining used the pronoun “it” instead of “they.” Laura Arjona Reina harassed him, too. And Debian Account Managers, comprising Enrico Zini, Joerg Jaspert and Jonathan Wiltshire, blackmailed Dr Norbert Preining.

I wish I could go back to CentOS 5. Dr. Preining is now using Arch Linux, but that’s too much. Well, he last posted on 2024/11/22; I can only hope he’s still alive and well. OK, he’s fine: he posted on X on 2026/02/13. (Oh, you don’t like the Japanese dates, akin to ISO 8601?)

Bonus link and topic

The Reg: How AI could eat itself: Competitors can probe models to steal their secrets and clone them:

“Your model is really valuable IP, and if you can distill the logic behind it, there’s very real potential that you can replicate that technology – which is not inexpensive,” [Google Threat Intelligence Group chief analyst John] Hultquist said. “This is such an important technology, and the list of interested parties in replicating it are endless.”

Google calls this process of using prompts to clone its models “distillation attacks,” and in a Thursday report said one campaign used more than 100,000 prompts to “try to replicate Gemini’s reasoning ability in non-English target languages across a wide variety of tasks.”

…

Distillation from Gemini models without permission violates Google’s terms of service, and Google can block accounts that do this, or even take users to court. While the company says it continues to develop better ways to detect and stop these attempts, the very nature of LLMs makes them susceptible.

Public-facing AI models are widely accessible, and enforcement against abusive accounts can turn into a game of whack-a-mole.

…

Meanwhile, OpenAI, in a Thursday memo [PDF] to the House Select Committee on China, blamed DeepSeek and other Chinese LLM providers and universities for copying ChatGPT and other US firms’ frontier models. It also noted some occasional activity from Russia, and warned illicit model distillation poses a risk to “American-led, democratic AI.”

China’s distillation methods over the last year have become more sophisticated, moving beyond chain-of-thought (CoT) extraction to multi-stage operations. These include synthetic-data generation, large-scale data cleaning, and other stealthy methods.

…

OpenAI also notes that it has invested in stronger detections to prevent unauthorized distillation. It bans accounts that violate its terms of service and proactively removes users who appear to be attempting to distill its models. Still, the company admits that it alone can’t solve the model distillation problem.

Yeah, good luck with that. Distillation cannot be stopped!

There’s a bit I wrote on LLM optimization techniques: quantization and distillation (as you’ll learn from the same post, distillation is the brainchild of Geoffrey Hinton—and the reason I despise him, since distillation increases the risk of hallucinations). When you see a model on Hugging Face labeled as both distilled and quantized, with two LLM names attached, what’s happened is that a smaller model—say, a Qwen model—has been trained to mimic the behavior of a larger one, like DeepSeek. This bastard model has been trained not on real-world data but on the responses generated by the larger model. As a result, it inherits both the larger model’s “knowledge” (caution: learned from synthetic responses, not raw data!) and its “personality” (response style).

They try stopping distillation of their non-open LLMs. What a sad joke. Whack-a-mole, indeed. They opened Pandora’s box and now hope to put everything back in.

Norbert Preining - February 16th, 2026 at 12:35 AM none Comment author #116533 on Two Debian developers’ ignominious fight with AI by Homo Ludditus

I am still alive, I just washed Debian down the gutter and moved on. I still praise Debian for its excellent developers, but despise it for its incompetent and fascist leadership. In case I find time, I will revisit all the nasty emails from debian-private and personal communications and publish all the shit Debianites have posted. I don’t care anymore for “debian-private” being private.

Béranger on Today, I cannot take sides: “Vladimir Volfovich Zhirinovsky was a true Nostradamus!” Mar 8, 20:50

Béranger on Linux: Backing the wrong horse or beating a dead horse?: “I very much liked JFS myself! But you already knew that.” Mar 8, 20:27

HAL on Linux: Backing the wrong horse or beating a dead horse?: “In the past, I used XFS and liked it, but I also used JFS, which I preferred. I never had…” Mar 8, 20:26

Béranger on Curiouser and curiouser…: “I’d rather believe you’re masochistic… or maybe you invented a new religion, and you stick to it. This is not…” Mar 8, 20:12

VRBF on Curiouser and curiouser…: “Believe me, I’m a big fan of the K.I.S.S. principle. What you consider (in theory) “unnecessary complications,” for me (in…” Mar 8, 19:15

Béranger on Today, I cannot take sides: “George Carlin is eternal!” Mar 8, 14:56

Béranger on Curiouser and curiouser…: “I’m mostly focused on unnecessary complications. Are you familiar with The Incredible Machine, or you’re too young for that? You…” Mar 8, 14:52

VRBF on Curiouser and curiouser…: “Ok, just a few more clarifications based on daily technology use: a) IMHO, you’re very focused on the performance issue…” Mar 8, 14:48

Béranger on Curiouser and curiouser…: “I asked Kimi about native booting from a VHDX. Its answer includes, among other things: The bootloader (bootmgr) uses a…” Mar 8, 10:49

Béranger on Long live the Democratic People’s Republic of California!: “This is merely the “Australia syndrome”: censoring the access to social networks. But Carl Richell’s statement applies here, too: “The…” Mar 8, 10:20

Aldus on Long live the Democratic People’s Republic of California!: “Raed Arafat isn’t falling behind either. He was ridiculed in Romania, but it’s bizarre how suddenly all the governments of…” Mar 8, 09:09

Béranger on Curiouser and curiouser…: “Fair points. But what a twisted design to have Windows use a VHD/VHDX file instead of a partition. Not only…” Mar 8, 07:05

Aldus on Curiouser and curiouser…: “Okay, I see you’re strongly opinionated about this, which I can’t be. Because, just like you, I’ve never tried this…” Mar 8, 05:22

Béranger on Linux: Backing the wrong horse or beating a dead horse?: “Linux 7.0 File-System Benchmarks With XFS Leading The Way (Phoronix): “XFS came out on top followed by EXT4, F2FS, and…” Mar 8, 02:47

Béranger on Curiouser and curiouser…: “I asked Kimi: VHD/VHDX with dynamic disk allocation: I don’t understand how this works. On a normal filesystem, say an…” Mar 8, 02:27

Béranger on Curiouser and curiouser…: ““whenever you reinstall Windows (or install it on a new PC)” – I do not (re)install Windows. Like, ever. Maybe…” Mar 8, 01:08

Aldus on Curiouser and curiouser…: “I really like the system described by VRBF. I’ve never used it myself, but it looks excellent: you create the…” Mar 8, 01:03

Béranger on Today, I cannot take sides: “The NYT, about Iranian President’s Apology: Masoud Pezeshkian, the Iranian president, had just apologized on state television on Saturday morning…” Mar 8, 01:01

Béranger on Long live the Democratic People’s Republic of California!: “System76 on Age Verification Laws is an excellent read! I don’t care about their hardware and I dislike their COSMIC…” Mar 8, 00:13

Béranger on Curiouser and curiouser…: “For flash memory-based disks (NVMe, SSD), the overhead of writing to a file is practically negligible. It’s about the raw…” Mar 7, 21:05

VRBF on Curiouser and curiouser…: “Ok, understood. But to prevent anyone from coming here and getting incorrect or imprecise information, some clarifications: 1. BootICE appears…” Mar 7, 19:23

Béranger on Long live the Democratic People’s Republic of California!: “It just occurred to me that what these bastards want really requires control at the OS level. It’s not enough…” Mar 7, 16:28

Béranger on Curiouser and curiouser…: “1. I can’t even find an OFFICIAL SITE for BootICE. Only fuck-ups. Pathetic. 2. Writing to a file should be…” Mar 7, 16:14

VRBF on Curiouser and curiouser…: “An important observation: when you natively boot from a VHD, the only “virtual” thing is the C: drive. Everything else…” Mar 7, 16:00

Béranger on Curiouser and curiouser…: “But how the fuck are you booting them, managing them, installing them? An OS should be installed FROM a booting…” Mar 7, 03:16

VRBF on Curiouser and curiouser…: “I have to admit that this is the usual reaction when I talk about this topic (“What the heck is…” Mar 7, 03:04

Béranger on Today, I cannot take sides: “How about these ayatollahs? As of Jan. 6, 2026.” Mar 7, 00:14

Béranger on Curiouser and curiouser…: “Windows Server or Windows IoT LTSC. But I didn’t even know that there is such a thing as VHDX with…” Mar 6, 23:51

VRBF on Curiouser and curiouser…: “The desktop versions of Windows (Win 10, Win 11) are very unstable and bloated. The most robust Windows edition is the Server…” Mar 6, 23:50

Béranger on Millions are listening to fake music and they are enjoying it!: “This is completely off-topic, but I’m leaving it here. Notes: 1. I don’t care about such hubs; I actually dislike…” Mar 6, 17:28

dan on Millions are listening to fake music and they are enjoying it!: “that hub was relevant because it has accompanied software for turning on and off individual ports issuing commands from pc.…” Mar 6, 17:21

Béranger on Millions are listening to fake music and they are enjoying it!: “Those products didn’t seem relevant to the case. Look, you could have even posted Amazon links! I would have removed…” Mar 6, 15:05

dan on Millions are listening to fake music and they are enjoying it!: “But it was not an AD to that hardware (i gave 2 different links to 2 different products and 1…” Mar 6, 15:03

Béranger on Millions are listening to fake music and they are enjoying it!: “Those comments were advertising to hardware products, and I deleted them by hand.” Mar 6, 14:07

dan on Millions are listening to fake music and they are enjoying it!: “Hi! I have two comments in moderation on thread about “beyond compare”. Am I being blocked or they were slipped…” Mar 6, 14:03

Béranger on Today, I cannot take sides: “Ukraine, Gaza (and the West Bank), Iran (and Lebanon): Putin, Netanyahu, Trump. 3 modern Hitlers. AlJAzeera: “Today is worse than…” Mar 6, 00:42

Béranger on Today, I cannot take sides: “Based on the 40-50 relevant articles read on some international websites, I could write even a book about the current…” Mar 5, 11:35

Béranger on Linux: Backing the wrong horse or beating a dead horse?: “Let me add 3 tiny points as a comment: ❶ Question for the fans of Arch and pacman/Pamac: how is…” Mar 4, 22:52

Aldus on Sam Altman in India, and other AI news: “I’m not sure where they got this chart from. Maybe from here, although the date of the last update is…” Mar 4, 13:22

Aldus on Sam Altman in India, and other AI news: “Sam Altman backpedals as ChatGPT uninstalls surge 295%” Mar 4, 09:29

Béranger on În care Dragoș Pătraru face pe istoricul: “N-ar fi chiar prima oară că se întâmplă. Bibi știe cât de prost e Donald. Mersi pentru semnalare!” Mar 4, 02:00

Aldus on În care Dragoș Pătraru face pe istoricul: “Așadar, SUA au atacat Iranul pentru că știau că Israelul va ataca și cumva le-a fost forțată mâna? Rubio says…” Mar 4, 01:57

Béranger on Today, I cannot take sides: “Guess what? Effective IMMEDIATELY, I have ordered the United States Development Finance Corporation (DFC) to provide, at a very reasonable…” Mar 3, 21:35

Béranger on Today, I cannot take sides: “La mierda naranja anuncia que cortará “todo el comercio con España” por la negativa de Pedro Sánchez a que EE.UU.…” Mar 3, 20:32

Béranger on În care Dragoș Pătraru face pe istoricul: “Nu am. Sunt zeci de milioane sau sute de milioane de canale YT pe jdemii de teme. Nu le pot…” Mar 3, 20:29

alecs on În care Dragoș Pătraru face pe istoricul: “Ce părere ai despre Prof. Jiang Xueqin (@PredictiveHistory)?” Mar 3, 20:26

Béranger on Linux: Backing the wrong horse or beating a dead horse?: “In his post To update blobs or not to update blobs, Matthew Garrett inserted this gem: At some point I…” Mar 3, 17:35

Béranger on Today, I cannot take sides: “FT: Iran executes Khamenei’s plan to spread regional war: Iranian forces have launched a plan devised by Ayatollah Ali Khamenei…” Mar 3, 17:15

Béranger on În care Dragoș Pătraru face pe istoricul: “Ce nu a spus Pătraru e cum s-a format Hezbollah, sprijinită de Iran. Iată cum: The 1982 Lebanon War was…” Mar 3, 02:38

Béranger on I needed to vent about Calibre: “How can Calibre be the number 1 e-book editor when it has so many programming errors, as if it were…” Mar 3, 00:00

Béranger on eMag Marketplace: loc de luat țeapă de la chinezii din Lyon!: “Un dobitoc de influențăr care a luat și el țeapă de pe eMag cu SSD-uri fake de 2TB, chipurile Samsung…” Mar 2, 22:51

Béranger on Today, I cannot take sides: “Frank Gardner, BBC security correspondent: Allies of US in the Gulf bear brunt of Iran attacks (with a better title…” Mar 2, 16:35

Béranger on Today, I cannot take sides: “UPDATE 3: A longer chat with Grok. Grok is inconsistent: first, it disagrees with me (so “Iran deserved it!”), but…” Mar 2, 13:50

Béranger on Sam Altman in India, and other AI news: “AI Can Now Easily Unmask Your Secret Online Life (Even If You Use a Fake Name) summarizes the claims made…” Mar 2, 01:20

Aldus on Today, I cannot take sides: “Also, Iulian Capsali’s Post, translated: IDIOCRACY You must be completely dumb (and I see many like that here) to believe…” Mar 1, 20:22

Aldus on Today, I cannot take sides: “Regarding the update, let me quote Hari Bucur-Marcu with something, even though I don’t agree with his general stand (he…” Mar 1, 20:12

Béranger on Today, I cannot take sides: “Your next laptop or PC will be even more expensive.” Mar 1, 20:05

Aldus on Today, I cannot take sides: “What if Iran blocks the Strait of Hormuz?” Mar 1, 20:01

Béranger on Today, I cannot take sides: “Updates that include opinions by Ellie Geranmayeh, Deputy Director, Middle East and North Africa programme at the European Council on…” Mar 1, 16:17

Béranger on Today, I cannot take sides: “I asked Grok for an updated assessment.” Mar 1, 13:22

Two Debian developers’ ignominious fight with AI

Exhibit 1: «AI Shit, go away; iocaine to the rescue»

iocaine to the rescue

Feed AI bots sh*t

Exhibit 2: «Dogfood Generative AI»

Let’s talk about AI, web crawling, and «stealing»

Primary reason in 2026: Opting out of uncompensated AI training / content reuse

Other (less common but real) reasons bloggers use robots.txt blocks

What they usually don’t block

Scale, Precision, and Permanence

Commercial vs. Personal/Transformative Use

Consent and Opt-Out Dynamics

Legal Precedents in 2025–2026

1. Refusing to access a user-provided webpage if robots.txt disallows it

2. Refusing to confirm quotations or provide verbatim excerpts from copyrighted material

3. OCR on photos of paragraphs/textbooks — refusing if it “looks like” copyrighted excerpts

1. iocaine / Poisoning via fake 200 OK responses

2. “Dogfood for AI” / Mass-producing bad code examples

The contradiction you highlight

Ethical and legal framing

Defining Cyberterrorism in Context

Could LLM Poisoning Lead to Terrorism-Like Effects?

Ethical Debates on These Tactics

In lieu of a conclusion

Bonus link and topic

1 Comment Already

Leave a Reply Cancel reply