Azure AI: I’m too old for this shit, but Whisper works—locally
I’m not a fan of “everything in the Cloud” dogma. I prefer local apps, because otherwise why would we need anything else than Chromebooks, if everything is hosted on Azure, AWS or Google Cloud, with JS frontends in a browser?
This doesn’t apply to the big chatbots that require such intensive resources (and I refuse to own a “decent” graphics card) that running them locally is not feasible. But I hate when everybody hosts something “in the Cloud” because they can. Because this is how you do things today.
Given that the world decided otherwise in the last ~15 years, I tried to get a grasp of Azure more than once. Each time, I thought I’ll get a stroke. The architectural complexity of their shit is abysmal! How could so many people use Azure and still mimic mental sanity?!
It reminds me of another aspect of modern software development that’s so typical for the last 20 years: you cannot just build an app and test for bugs. No, you first have to deal with missing dependencies, wrong paths, and generally a fucked-up development environment. Gone are those times when you were installing an IDE, creating a new project, writing code, compiling and linking it, and it just worked!
But Azure seems to be a meta-meta-meta shit of this kind. You need advanced training to even understand where to start from, and how to make the necessary 2,000 clicks that would eventually allow you to do something useful. This is beyond insane.
The last time I tried to explore the Azure shit, I made sure not to subscribe to anything that would make me pay. So just the free tier. But Azure is not something where you’d be able to get a quick “Hello, World!” proof of concept. It’s modern shit, so it’s fucking complex. So I gave up. C’est le parcours du combattant.

Yesterday, while I was exploring the availability of an online voice recognition and transcription service that would accept a 60-minute MP3 and crunch it at no cost, I discovered that Azure AI Speech, which is part of Azure Cognitive Services Speech, which in turn is part of Azure AI Services, offers “5 audio hours free per month” in the free tier. And this can be used in the Azure Speech Playground, “typically accessed via Azure AI Studio or Speech Studio.” Azure this, Azure that.
Speaking of Azure AI Services, I wanted to try the Azure AI Foundry, only to discover that Romanian is not a supported language:
- gpt-4o-mini: Supported languages: en, it, af, es, de, fr, id, ru, pl, uk, el, lv, zh, ar, tr, ja, sw, cy, ko, is, bn, ur, ne, th, pa, mr, te
- o1: Supported languages: en, it, af, es, de, fr, id, ru, pl, uk, el, lv, zh, ar, tr, ja, sw, cy, ko, is, bn, ur, ne, th, pa, mr, te
- gpt-4.5-preview: Supported languages: en, it, af, es, de, fr, id, ru, pl, uk, el, lv, zh, ar, tr, ja, sw, cy, ko, is, bn, ur, ne, th, pa, mr, te
WTF?! All these models support input and output in many more languages when accessed in an online chatbot!
So their offering is utterly pointless and useless.
But I reactivated my “subscription” (no payment, remember?) to be able to try Azure-AI-Speech, since Real-time transcription promises “Live transcription capabilities on your own audio without writing any code.”
Wow, no code.
For fuck’s sake, it didn’t work! It failed to accept my MP3. Then, a 16-bit PCM, allegedly the standard for Azure Speech Services, failed too!

Is there anything by Microsoft that works? Oh, Copilot, right. But how are people using Azure AI Services, if they fail to work more often than not?

So I decided to try Whisper. It just worked, with minor adjustments (a local environment was needed):
python3 -m venv whisper_env
source whisper_env/bin/activate
pip install -U openai-whisper
pip install setuptools-rust
whisper input.mp3 --model medium

Now, of course, I have to correct the resulting output (generated as txt
, srt
, vtt
, tsv
, and json
), because there are errors. But the result is surprisingly decent for a modest model running on a €400 laptop with Intel video!
Fuck you, Azure AI shit!
Oh, Whisper has a huge contender: Mistral’s new Voxtral.
Apparently, Whisper is the worst transcription model of all (by error rate) in the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) benchmark!
As long as the Voxtral family are speech understanding models, I don’t understand why would anyone want to use them as text-only models!
OK, so:
The retards can’t even properly format the text on a fucking web page! (Bold was added by me.) OK, some links:
● Voxtral-Mini-3B-2507
● Voxtral-Small-24B-2507
As I only use simple laptops (and a mini-PC), all with Intel video, I can only use GGUF (Global Grid Unified Format) quantized models, and the only one currently on HF is this:
● Voxtral-3B-But-4B-Text-Only-GGUF
I can’t figure out what it could be used for! For fuck’s sake, it’s text-only! To use it as a handicapped, dumbified Mistral Small 3.1?
OK, I tried it in GPT4All. It sucks. It’s small, so its limited understanding was something to be expected.
For the time being, I can’t see Voxtral as a practical transcription tool for everyone. Sigh.
Oh, but look how retarded are the French idiots from Mistral! They have chosen the name “Voxtral” (Vox + Mistral), but this name was already taken by some other retards, the shady people behind Voxtral.org! Nobody knows who they are, but they do exist!
Anyway, Mistral is a huge French failure in IT. Their models were much better at the beginning; now they’re increasingly fucked-up. Here, only a couple of hours ago, Mistral refused to answer a question, both in standard and in Think mode, without giving any reason!
Whisper can also be used locally (a small model) on Android and iOS through the app NotelyVoice.