Don’t trust the magazines on choosing an OCR

This covidiocy is so alienating that I wasn’t in the mood of writing anything. Better late than never, a few notes on the free OCR software, simply because a stupid British magazine pissed me off.

There is this thing called Web User Magazine, belonging to Penis Publishing (that’s a cheap one, I know), and they recently included a short roundup on “Best free OCR software”:

That’s bollocks. There’s absolutely nothing worth mentioning in their selection. Guaranteed crap. But what OCR should one choose instead?

In the past, I used to swear by ABBYY FineReader, but they didn’t seem to be able to progress: in all the versions I tried (from 9.0 to 15.0), no matter I was selecting the proper language dictionary, all the Romanian words that started with “Î” (at the beginning of a phrase) were identified as starting with “î”. Occasionally, there were other misidentifications that I judged unacceptable for a commercial solution that included a dictionary for the recognized language, and that was fed with good quality images. So I dropped ABBY once and for all.

To the point: what you need to use is tesseract, still the best OCR there is, free, open-source, and able to use dictionaries for most languages. Nothing is perfect on Earth, but for OCR you don’t need to look farther than that. (Alpha builds of version 5.0 for Windows are provided by Universitätsbibliothek Mannheim; make sure you install the additional language data!)

Well, you still have though: unless you love to use the command line for everything, you should be looking for a GUI front-end.

■For Android, there’s Text Fairy, an app that uses Tesseract and works fabulously well (just make sure you select the correct language). And it too is open-source.

■For Windows and Linux, the GUI of choice is gImageReader. It provides Qt-based binaries for Windows (e.g. gImageReader_3.3.1_qt5_x86_64.exe), and mentions that builds exist at least for Debian, Fedora, Ubuntu (PPA), OpenSUSE, Arch Linux. Generally, the Linux GUI is using Gtk+, and it’s called “gimagereader” (very rarely “gimagereader-gtk”); the Qt-based “gimagereader-qt” only exists for Fedora, ALT and Arch.

Using the Windows version is simple, except for an initial hiccup that should happen on most systems:

It tries to download Tesseract’s language data in a folder in Program Files, and no non-elevated program has this right–it’s like this since Vista. So we need to change the language data location from “System-wide paths” to “User paths”:

After downloading the languages for the text to be recognized, everything should work rather smoothly:

There’s a small annoyance for Romanian though: the recognition uses “ţ” (U+0163, t with cedilla), whereas the dictionary-based spelling checker checks against the correct Romanian character “ț” (U+021B, t with comma below). The same for “ş” (U+015F, s with cedilla) instead of the correct “ș” (U+0219, s with comma below). This used to be a problem with Microsoft’s fonts and keyboard layouts prior to XP SP3, but it seems to have a second life even in the Open Source world.

As a matter of fact, the so-called “Romanian (Germany)” layout, which adds the Romanian characters to a German keyboard is using the wrong characters (i.e. with cedilla) even in the last version of xkeyboard-config. Here, in the symbols/de file:

Most Romanians are idiots who couldn’t care less about their language: they never cared to use the proper characters while using a computer; I believe they’re the only country on Earth where the physical keyboards that can be purchased don’t have the specific characters (șțîă) painted on them, but they’re actually either US International or UK keyboards! Try to do this in Poland (ąćęłńóśźż) or in Hungary (áéíóöőúüű)!

In Linux, one should install gimagereader from the official repositories of their distro, and the required dependencies should be retrieved too. Attention though! Tesseract’s language files are not automatically added as dependencies, so one should either install tesseract-ocr-all, or pick only the needed languages (e.g. tesseract-ocr-deu, tesseract-ocr-fra, tesseract-ocr-ron). Also, the corresponding spelling languages are needed, e.g. aspell-de, aspell-fr, aspell-ro.

One issue in Linux is that the languages cannot be managed from within GImageReader, but only by installing or removing the above mentioned language packages. If no tesseract-ocr-* package is found, a confusing error is displayed:

This is obviously crap, but it still confuses some people.

Other than that, it should work just fine once properly installed, should we ignore the “ţ/ț” and “ş/ș” annoyance:

The skinning depends on the distro and of the desktop environment (and, as you noticed, there’s no caption bar):

Under Linux one could also use an older program, gocr (with gocr-tk for a GUI), which in turn uses cuneiform as OCR, but I’d advise against such a choice. It’s mostly abandoned. Some other tools of interest under Linux:

■pdfsandwich, a tool that generates “sandwich” OCR PDF files, i.e. PDF files that contain only images (no text) will be processed by OCR and the text will be added to each page invisibly “behind” the images. It uses tesseract.

■k2pdfopt, a PDF reflowing tool that optimizes multi-column PDF/DJVU files for mobile e-readers and smartphones. It can re-flow text even on scanned PDF files, and it too can create “sandwich” OCR files. Well, there’s also a Windows version of it, but I didn’t try it.

On the other hand, I never managed to get Lios (“Linux intelligent OCR solution”) do something. Anything.

I’ll end with a traditional meme for the French readers 🙂

Béranger on Microsoft + CloudStrike = DeathDave Plummer, former software developer at Microsoft between 1993 and 2003, had some explanations on his YouTube channel. I didn't
Béranger on Is openSUSE at crossroads?Here's why you shouldn't be using Zypper and YaST Software, which means you shouldn't use openSUSE at all.
Béranger on Microsoft + CloudStrike = DeathLet's correct myself, while also explaining what some journos couldn't, because they're retarded. From Why Apple doesn't suffer outages like
Béranger on The dead and the reborn BSDs — now updated!There's no chance for me to use FreeBSD on this laptop (maybe on the old one from 2016): its MT7663
Béranger on Microsoft + CloudStrike = DeathThis is unbelievable! European Commission denies responsibility for massive Microsoft IT outage: A Microsoft spokesperson yesterday suggested to US media
HAL on The dead and the reborn BSDs — now updated!I currently have Ventoy on an 8 GB Integral USB 3.0 stick. On it I have openSUSE Leap 15.6 KDE
Béranger on The dead and the reborn BSDs — now updated!BS. Ignore Richard Brown. Ventoy always played well with openSUSE Tumbleweed. If there's a build that doesn't, then it's on
zugu on The dead and the reborn BSDs — now updated!Nope, it was a brand-new SanDisk stick, these are usually the most compatible kind of USB drives. According to openSUSE
Béranger on Is openSUSE at crossroads?OK, so openSUSE Factory is more like Fedora Rawhide, but I don't understand your claims. After openSUSE Factory, a package
Béranger on Is openSUSE at crossroads?SSO or not, there are people who have used RHEL7 with KDE. It's not about liking it or not liking
Béranger on The shortest distro bashing and appraisal in a long while—now with added Dedoimedo!KDE3? What a great thing that you skipped KDE4! But KDE5 is as usable as KDE6, if not even more.
Béranger on SJVN made my day!to automatically grab TV shows / books / music / basically any kind of media from the internet You lost
zugu on The shortest distro bashing and appraisal in a long while—now with added Dedoimedo!I was recently surprised by KDE Plasma 6. My most recent memories of KDE were from the 3.0 days, when
Béranger on The dead and the reborn BSDs — now updated!Nope. There are distros that are incompatible with Ventoy (or vice versa), but this is not the case. Maybe it
zugu on SJVN made my day!Having a NAS can be quite handy. I built one out of some old hardware (a Sandy Bridge Celeron with
zugu on The dead and the reborn BSDs — now updated!Just my two cents on Ventoy: while it's a really useful tool, it's not very reliable. A couple of weeks
Ex-SUSE on Is openSUSE at crossroads?SUSE (yes, I mean SUSE, not openSUSE) has had a Factory First approach for many years. "Factory" being where openSUSE
Ex-SUSE on Is openSUSE at crossroads?Customers don't care about KDE vs Gnome. I like KDE better but it has a fundamental flaw: no support for
Béranger on Is openSUSE at crossroads?I lacked a repartee (j'ai l'esprit de l'escalier). I should have added the following. Indeed, RHEL is the standard in
Béranger on Is openSUSE at crossroads?Intellectual masturbation, technological masturbation - modern versions of art for art's sake / l'art pour l'art / ars gratia artis.