The other day, I stumbled over this news: Google releases Gemini CLI with free Gemini 2.5 Pro. Now, of course, the real information is that I should head to github.com/google-gemini/gemini-cli and see what I can find there.

1 Small technical notes2 Some caveats and inconveniences3 A specific use case4 The Good, the Bad and the Ugly5 CLI take 1: One big file (just don’t!)6 CLI take 2: Midsized files (yay!)7 So, what do I think about Gemini CLI?8 How about translating by running quantized models locally?

Small technical notes

The official description is somewhat false: “Prerequisites: Ensure you have Node.js version 18 or higher installed.” If you’re using Ubuntu 24.04 LTS with Node.js version 18, it will complain. Nonetheless, it will work! This:

sudo apt update
sudo apt install nodejs npm
sudo npm install -g @google/gemini-cli

Leads to this:

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: '@google/genai@1.7.0',
npm WARN EBADENGINE   required: { node: '>=20.0.0' },
npm WARN EBADENGINE   current: { node: 'v18.19.1', npm: '9.2.0' }
npm WARN EBADENGINE }
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: 'undici@7.10.0',
npm WARN EBADENGINE   required: { node: '>=20.18.1' },
npm WARN EBADENGINE   current: { node: 'v18.19.1', npm: '9.2.0' }
npm WARN EBADENGINE }
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: 'eventsource-parser@3.0.3',
npm WARN EBADENGINE   required: { node: '>=20.0.0' },
npm WARN EBADENGINE   current: { node: 'v18.19.1', npm: '9.2.0' }
npm WARN EBADENGINE }

But it will go on. Alternatively, you could uninstall Ubuntu’s packages and go full commando:

sudo apt update && sudo apt install -y curl
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs
node -v
npm -v

To remove them and revert to Ubuntu’s:

sudo apt remove -y nodejs
sudo apt autoremove -y
sudo rm /etc/apt/sources.list.d/nodesource.list
sudo apt update

To use Gemini CLI with your Google account, a browser-based authentication will follow. Firefox worked just fine in my case. In this case, ~/.config/gemini/oauth_creds.json will be created and ~/.config/gemini/settings.json will include something like this (the theme would probably be different):

{
  "theme": "Ayu Light",
  "selectedAuthType": "oauth-personal"
}

To use Gemini CLI with an API key, head to aistudio.google.com/apikey to generate an API key. Then, add it to ~/.bashrc, then source ~/.bashrc if you plan to launch gemini right away from that terminal:

export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY_HERE"

In this case, you’ll see a different authentication type in settings.json:

  "selectedAuthType": "gemini-api-key"

Some caveats and inconveniences

Authenticating with your Google account is the intended and most straightforward way to get gemini-2.5-pro by default.

If you only use an API key, you’ll get gemini-2.5-flash by default. There are two ways to change that: either run gemini --model gemini-2.5-pro or edit ~/.config/gemini/settings.json to include:

{
  "default_model": "gemini-2.5-pro"
}

Unfortunately, in my case, it briefly switched to gemini-2.5-pro, only to revert to gemini-2.5-flash almost instantly! But I was using the free tier of the API.

So using the API without paying restricted me to the Flash model, which was unacceptable!

A specific use case

When I learned about Gemini CLI, I wanted to perform a specific experiment, namely to translate a long document. Say, a novel. One could use a DRM-free ePub and transform the embedded XHTML files to text. Or one could save any document as text, because translating a DOCX would be problematic, and a model would normally produce plain text for the translation.

Whoever tried a web-based chatbot (and I tried ChatGPT, Gemini, Claude, Mistral, Copilot, Grok, Qwen, and DeepSeek) already knows that they would refuse to translate a document over a certain size. The one I wanted to test had 44k words and about 240k characters. So instead of translating it in chunks (e.g., translating individual chapters, which can be easily obtained when starting from an ePub, because they’re usually one chapter per XHTML file), I wanted to see if I could do one of the following by using Gemini CLI:

  1. Translate the entire document in one go.
  2. Automating the manipulation of the 10–30 files (1-3 chapters per file) by using of a Python script.

You might object, and you’d be right, that an LLM can’t translate a large document in one go because it would consider the entire document as “context,” and this would literally “suffocate” it. It indeed seems to be the case, but this reflects a stupid design of such LLMs. A smart LLM should automatically decide to translate one paragraph at a time, or it could ask me how much context to keep while translating. But an LLM can’t be smarter than the retards who designed it!

On the other hand, any kind of automation requires the use of the API. No automation is possible in the browser app, and Gemini CLI is also CLI-oriented (duh), thus focusing on interactivity. Sure thing, by using the up arrow and editing a previous command, processing several files with similar names is quite easy, yet it’s still manual.

There might be a way to use Gemini CLI non-interactively, with a bit of scripting, but I didn’t explore it!

I was frustrated to see that I can only use Gemini Flash with my API key (free tier) because it translates unsatisfactorily. But, what the heck, why not give it a try?

I needed to give it a try to see if setting the temperature would change the accuracy and the consistency of a translation. With all the good reasons for having a temperature of 0.7 or 1.0 when accessing a chatbot via a browser or an app, should you want to change it, it’s just impossible! So one has to access a remote model via an API key to customize it. I wanted to see what gets, even if the API-accessible model was gemini-2.5-flash (the hell would freeze over before I pay to use something made by Google).

Now, for the reasons of adjusting both temperature and Top P when it’s about a translation.

In Gemini’s own words:

The temperature parameter in LLMs controls the randomness or “creativity” of the generated output.

  • Low Temperature (close to 0): The model becomes more deterministic and predictable. It will prioritize the most statistically probable next tokens based on its training data and the given input. This leads to more focused, consistent, and “safe” outputs.
  • High Temperature (closer to 1.0 or higher): The model becomes more exploratory and diverse. It gives more weight to less probable tokens, which can lead to more creative, surprising, or even “hallucinated” (factually incorrect but plausible-sounding) outputs.

Why Zero for Translation:

  • Fidelity to Source: In translation, your primary goal is to accurately convey the meaning and intent of the source text in the target language. You want the most probable and direct translation, not a creative interpretation.
  • Reduced Hallucination: A temperature of 0 significantly reduces the chance of the model “hallucinating” or generating content that deviates from the source text. Higher temperatures can introduce errors, omissions, or even add information that wasn’t in the original.
  • Consistency: When translating a long document like a book, you want consistency in terminology, tone, and style across chapters. A low temperature helps ensure that the model consistently applies its translation knowledge, rather than introducing variations due to randomness.

If you find the translations at temperature 0 to be too literal or stiff, you might slightly increase it (e.g., to 0.1 or 0.2) to allow for a very subtle amount of linguistic flexibility, but for pure accuracy, stick to 0.

For accurate translation, just like with temperature, you should generally set Top P (Nucleus Sampling) to a high value, typically 1.0, or close to it (e.g., 0.9 or 0.95). Top P controls the size of the “nucleus” of tokens from which the model samples its next word. It considers the smallest set of tokens whose cumulative probability is greater than the specified P value.

  • High Top P (e.g., 1.0): The model considers a broader range of possible next tokens, including those that are less probable but still linguistically valid and relevant. This gives the model more options to find the most appropriate word for a nuanced translation.
  • Low Top P (e.g., 0.1): The model restricts itself to only the very most probable tokens. While this might seem good for accuracy, it can lead to very constrained, repetitive, or even grammatically awkward translations because it limits the model’s ability to choose from a wider, more natural vocabulary. It can make the translation sound “robotic” or miss subtle meanings.

Why High Top P for Translation:

  • Vocabulary Richness: Human languages are incredibly rich and diverse. A high Top P allows the model to access a wider vocabulary and choose the most fitting word, even if it’s not the absolute most common one in a given context. This leads to more natural-sounding and contextually appropriate translations.
  • Nuance and Idioms: Translation often involves dealing with idioms, cultural nuances, and subtle shades of meaning. A model with a wider selection pool (higher Top P) is better equipped to handle these complexities and provide a more accurate rendition.
  • Avoiding Repetition: A very low Top P can sometimes force the model to repeat words or phrases because its options are too limited. A higher Top P helps prevent this by offering more diverse lexical choices, while preserving accuracy.

For accurate translation, the most common and effective strategy is to set temperature to 0 (or very close to it) to make the model deterministic, and then set Top P to 1.0 (or very close to it, like 0.95 or 0.99) to ensure the model has a full range of vocabulary choices within that deterministic framework. This combination aims for the broadest possible vocabulary selection for the most natural and accurate output.

At least, that’s how it’s supposed to be.

The Good, the Bad and the Ugly

Maybe I was naive, but here’s what I did. For scripting while using the API key, I needed to install Google’s Generative AI SDK for Python:

python3 -m venv ~/gemini-env
source ~/gemini-env/bin/activate
pip install google-generativeai

I also set the key in .bashrc, set the model in ~/.config/gemini/settings.json (only to be ignored), but I couldn’t set any translation instructions in Gemini.md:

Gemini.md is only used by the CLI and has no effect on the Python API.

Then, in a terminal, I started python3 translate_chapters.py, which was taking its time. It did not use asyncio for parallel processing, so it processed one chapter at a time. (Note that model_name="gemini-2.5-pro" was also set in translate_chapters.py, but it was ignored as well.)

As it was crunching data (and Gemini, somewhere in the Cloud, was increasing Earth’s temperature), in a different terminal I started Gemini CLI with the same API key.

So, the sequential processing (synchronous execution) wasn’t per process, but per API key! Gemini CLI waited more than 8 minutes before answering, and when it did, my free API quota has already been used up!

WTF. The other WTF concerned the API usage limits, for which I had higher expectations:

Free tier limit (might change):

  • Requests per minute (RPM): 15
  • Tokens per minute (TPM): 1 million
  • Requests per day (RPD): 1,500
  • Concurrent Sessions: 3 per API key

In my case, it reached the ceiling at about 17.5k tokens and 10 requests!

Well, so the API is unusable, unless you pay. Thanks, but no, thanks.

But then, it worked. Meanwhile, I deleted settings.json and the key from ~/.bashrc, then I ran source ~/.bashrc, so I hoped that relaunching gemini would ask me to authenticate with a Google account. It didn’t! Somehow, it was still using the API key, but it worked! (That’s surreal.)

Here’s how it looks when authenticating with Google: Pro by default!

While still into this ghost API session (hence Flash), trying with a small chapter:

But I wanted to really give Gemini CLI a quick try. Gemini, which, as a general rule, is quite dumb when asked about itself, claimed that:

Gemini CLI often provides access to more generous rate limits (e.g., 60 requests per minute and 1,000 requests per day for Gemini Code Assist/Gemini CLI for individuals) compared to what you might experience with the free browser version, which can sometimes have unstated or lower daily limits.

The fuck it does. As I soon learned, the Pro tokens can expire rather quickly with large contexts.

In Gemini CLI, I ditched Gemini.md and used a simple prompt: “Translate this file from English to Romanian, keeping the translation accurate but literary. Try to detect the style of the fiction work, so that the translation matches the original in the way it conveys the message to the reader.” The file’s path can be introduced after @, which will browse for files.

I couldn’t find it anywhere in the documentation, but Gemini CLI can also save files! Just tell it in the prompt text the name of the file to which you want its output saved!

CLI take 1: One big file (just don’t!)

With my usual Google account, I initially wanted to use the set of 10 files that I started to translate by using translate_chapters.py. Each such file contained 2 to 4 chapters, and between the chapters in a file, a “---” separator was used. As it turned out, Gemini stopped the translation upon encountering a “---” line, so the file containing the chapters 23 to 26 only got the first one translated:

So, pissed off, I decided to feed it an entire 240 KB TXT file!

🤖

Bad choice, as several things happened, both good and bad:

  • Even if my prompt was in English, some of its answers were in Romanian, as if it were confused as to how much and what should be translated.
  • It decided that for such a long text, saving to a file would be more appropriate. This is how I learned that it could do that!
  • The Pro tokens for my free account got used up during the translation. I had Gemini opened in a browser, and it prompted me that it had to switch to Flash and that I could switch back to Pro after about 70 minutes.
  • Meanwhile, Gemini CLI was counting the seconds without providing any feedback!
  • After about 90 minutes, I started scrolling in Gemini CLI, trying to see whether it’s stuck or not. And, behold, it detected a “slow response” that made it switch to Flash despite Pro being actually available! Unfuckingbelievable!

18 minutes later, the output still didn’t provide any relevant information as to how much it was in the translation and how much it still had to go, so I cancelled it. It was not what I intended, anyway, as it insisted on using Flash when I was again entitled to use Pro!

Fucking stupid Gemini CLI. It kept duplicating a text saying it’ll continue the translation and save the result in translated.txt and that the process “can take a few moments.” Moments? Ages, maybe.

CLI take 2: Midsized files (yay!)

The winning solution (as if there was something to win), or rather the decent solution, was to use midsized files. Not the entire 44k-word document, nor in 4k-word chunks, but four files of 7 to 15 thousand words each! It just worked, with gemini-2.5-pro, without complaining and without “expiring”!

The funny thing is that just before that, I updated Gemini CLI from 0.1.3 to 0.1.5 because it notified me itself. But upon relaunching, it still wanted to update… from 0.1.5 to 0.1.5! Quality software.

As you might have noticed, I didn’t give it any specific instructions regarding the translation at the prompt (nor did I use Gemini.md). Just “translate” and “save”…

“Commencing Translation Task” and “Commencing the Translation” aren’t that informative, and it took it more than 2 and a half minutes until the translation of the first paragraph started!

Why is it the “Initial Paragraph” instead of the “First Paragraph”?

Strange thing, despite being explicitly told to save to a specific file, it wanted me to confirm that I really wanted to “apply this change”without telling me what kind of “change” it is! Who was the fucking retard who designed this shitty behavior?!

I selected “Yes, allow always,” which was interpreted as “Now and in the future, fucking save to the files I fucking tell you to save to!”

Translating the other files was a breeze, and it took about 3 minutes per file, even for the last file that had slightly over 15k words! Somehow, without keeping the previous files as context, the translation of the “Initial Paragraph” of the 2nd, 3rd, and 4th files started much faster than for the 1st file! Or maybe it was simply due to lower server load.

Because of the slower start of the first file, the overall session took 16 minutes.

I’m not sure what the “wall” duration means. It literally took 16 minutes to translate. I wasn’t using the API myself!

As for the quality of the translation… nope. At first sight, it’s extremely intelligible and not incorrect, but it cannot be used as is. Translating fiction requires more than mechanically translating the text, even if you correctly identify the idioms. Gemini 2.5 Pro is not how you translate; at least, not to a minor language.

So, what do I think about Gemini CLI?

In very few words:

  • Gemini CLI is crap.
  • Using the API is a trap to make you pay. Don’t.
  • Still, Gemini CLI can be more practical than the use of Gemini in a browser, at least for some use cases.
  • But then, it’s crappily designed.
  • And yet, it works. Kinda. OK, it does work.
  • But it’s crap.
  • And Gemini as a model is not my first choice, regardless of the type of task.

How about translating by running quantized models locally?

I am the last one to favor automated translations for anything serious. I appreciate even the most broken one when I try to understand what a page written in Chinese, Turkish, or Finnish tries to say, because there’s no way I could understand by myself. In languages I partially understand, the obvious mistranslations annoy me, but automated translations can help sometimes. Boy, but they suck!

After the Gemini CLI experiment described above, I tried to find a small model that I could run locally on my modest hardware. It had to be a quantized model able to use as little as 8 GB of RAM (I only have 16 GB on one laptop and 8 GB on everything else because I live frugally), and it had to support Romanian.

No fapping way!

There is a translation model for translating from English (en) to Romanian (ro), huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-ro. It’s based on Opus-MT, but I’m not sure why this en-ro has files from December 2019. Either way, there’s no pre-quantized model to download.

Then, Meta has this NLLB-200 thing: 200 languages within a single AI model: A breakthrough in high-quality machine translation. Oh Hugging Face, I found huggingface.co/facebook/nllb-200-3.3B, which bears this note:

The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation.

It seems to have a 4-bit quantization: Youseff1987/nllb-200-bnb-4bit.

Then, facebook/nllb-200-distilled-1.3B has two quantizations of possible interest: Emilio407/nllb-200-distilled-1.3B-8bit and Emilio407/nllb-200-distilled-1.3B-4bit.

The even smaller facebook/nllb-200-distilled-600M has its quantizations, too: Emilio407/nllb-200-distilled-600M-8bit and Emilio407/nllb-200-distilled-600M-4bit.

Except that I can’t download them, because:

  • I don’t want to use Python with transformers and shit.
  • I don’t want to use huggingface-cli.
  • I don’t want to manually download 5-7 files per model.
  • I want “plug-and-play” models, meaning that I want to be able to use LM Studio or GPT4All (both knowing how to download from Hugging Face), or Ollama (via Msty or AnythingLLM, which also know to download for Ollama); but the models that interest me don’t show up in neither of them!

They’re not compatible! Apparently, there are almost no translation models converted to GGUF.

ChatGPT recommended me 3 models that I could run with LM Studio, GPT4All, or Ollama:

If I’m able to allocate more than 8 GB of RAM to such a model (10-12 GB, which is borderline possible with 16 GB of RAM in Ubuntu MATE, but these 16 GB are actually 15.3 GB when the VRAM is RAM), more precise (Q8_0 is better than Q6_K) quantization could be used, such as:

Sure enough, they show up in LM Studio:

The Romanian-focused quantized models (I suppose you can find similar models for many other languages) are a bit puzzling:

  • Mistral: 7.7 GB, Falcon2: 4.7 GB. So Falcon2 uses a more aggressive quantization (Q4_K_S), reducing both its size and its accuracy.
  • However, ChatGPT claims that Falcon2 “may be fine-tuned more thoroughly on Romanian translation, not just chat” (it doesn’t like Mistral, does it?).
  • “But in raw architecture terms, Mistral 7B (Q8_0) is likely superior to Falcon2 8B (Q4_K_S).”

As for EuroLLM (7.51 GB), it’s not necessarily too small for a multi-language model. It’s a 9B model, quantized to 6 bits (Q6_K). But it’s not meant for fiction: “Fine-tuned on ~30 European languages, with heavy EU translation corpora.” That means legal, governmental, policy, news, reports, and essays. It might sound robotic for fiction, and it might lack the understanding of colloquial language, slang, and metaphors.

Because I could, I thought I should. So I downloaded this specific EuroLLM in LM Studio.

There is a reason why I use the term “Eurocrap” so frequently: this model, at least in this quantization, is completely broken!

I attached the first quarter of the book that Gemini did translate, however flatly, and I asked it: “Can you translate this text in Romanian?” The retard: “No citations were found in the user’s files. I will translate the text into Romanian. Could you please provide me with the text that needs translation?” I must be having a bad dream! “It’s in the fucking attachment, you dumbo!” It then “retrieved 3 relevant citations” because there were 3 epigraphs at the beginning of the book, then it started to spit shit:

What. The. Fuck.

OK, let’s load the smaller Falcon2 thing. This time, I wanted to be more explicit: “Please translate the fiction text from the attached file into Romanian.”

This cannot be. There are 1,829,512 models on Hugging Face. How many of them are broken? The ones I tried back in February worked just fine, regardless of the unsatisfactory answers!

For fuck’s sake, there are so many broken people wasting their lives creating such humongous crap! “Oh, we’re so passionate about AI!” You fucking shitheads. Broken, stupid biological robots.