Don’t trust the magazines on choosing an OCR

This covidiocy is so alienating that I wasn’t in the mood of writing anything. Better late than never, a few notes on the free OCR software, simply because a stupid British magazine pissed me off.

There is this thing called Web User Magazine, belonging to Penis Publishing (that’s a cheap one, I know), and they recently included a short roundup on “Best free OCR software”:

That’s bollocks. There’s absolutely nothing worth mentioning in their selection. Guaranteed crap. But what OCR should one choose instead?

In the past, I used to swear by ABBYY FineReader, but they didn’t seem to be able to progress: in all the versions I tried (from 9.0 to 15.0), no matter I was selecting the proper language dictionary, all the Romanian words that started with “Î” (at the beginning of a phrase) were identified as starting with “î”. Occasionally, there were other misidentifications that I judged unacceptable for a commercial solution that included a dictionary for the recognized language, and that was fed with good quality images. So I dropped ABBY once and for all.

To the point: what you need to use is tesseract, still the best OCR there is, free, open-source, and able to use dictionaries for most languages. Nothing is perfect on Earth, but for OCR you don’t need to look farther than that. (Alpha builds of version 5.0 for Windows are provided by Universitätsbibliothek Mannheim; make sure you install the additional language data!)

Well, you still have though: unless you love to use the command line for everything, you should be looking for a GUI front-end.

■For Android, there’s Text Fairy, an app that uses Tesseract and works fabulously well (just make sure you select the correct language). And it too is open-source.

■For Windows and Linux, the GUI of choice is gImageReader. It provides Qt-based binaries for Windows (e.g. gImageReader_3.3.1_qt5_x86_64.exe), and mentions that builds exist at least for Debian, Fedora, Ubuntu (PPA), OpenSUSE, Arch Linux. Generally, the Linux GUI is using Gtk+, and it’s called “gimagereader” (very rarely “gimagereader-gtk”); the Qt-based “gimagereader-qt” only exists for Fedora, ALT and Arch.

Using the Windows version is simple, except for an initial hiccup that should happen on most systems:

It tries to download Tesseract’s language data in a folder in Program Files, and no non-elevated program has this right–it’s like this since Vista. So we need to change the language data location from “System-wide paths” to “User paths”:

After downloading the languages for the text to be recognized, everything should work rather smoothly:

There’s a small annoyance for Romanian though: the recognition uses “ţ” (U+0163, t with cedilla), whereas the dictionary-based spelling checker checks against the correct Romanian character “ț” (U+021B, t with comma below). The same for “ş” (U+015F, s with cedilla) instead of the correct “ș” (U+0219, s with comma below). This used to be a problem with Microsoft’s fonts and keyboard layouts prior to XP SP3, but it seems to have a second life even in the Open Source world.

As a matter of fact, the so-called “Romanian (Germany)” layout, which adds the Romanian characters to a German keyboard is using the wrong characters (i.e. with cedilla) even in the last version of xkeyboard-config. Here, in the symbols/de file:

Most Romanians are idiots who couldn’t care less about their language: they never cared to use the proper characters while using a computer; I believe they’re the only country on Earth where the physical keyboards that can be purchased don’t have the specific characters (șțîă) painted on them, but they’re actually either US International or UK keyboards! Try to do this in Poland (ąćęłńóśźż) or in Hungary (áéíóöőúüű)!

In Linux, one should install gimagereader from the official repositories of their distro, and the required dependencies should be retrieved too. Attention though! Tesseract’s language files are not automatically added as dependencies, so one should either install tesseract-ocr-all, or pick only the needed languages (e.g. tesseract-ocr-deu, tesseract-ocr-fra, tesseract-ocr-ron). Also, the corresponding spelling languages are needed, e.g. aspell-de, aspell-fr, aspell-ro.

One issue in Linux is that the languages cannot be managed from within GImageReader, but only by installing or removing the above mentioned language packages. If no tesseract-ocr-* package is found, a confusing error is displayed:

This is obviously crap, but it still confuses some people.

Other than that, it should work just fine once properly installed, should we ignore the “ţ/ț” and “ş/ș” annoyance:

The skinning depends on the distro and of the desktop environment (and, as you noticed, there’s no caption bar):

Under Linux one could also use an older program, gocr (with gocr-tk for a GUI), which in turn uses cuneiform as OCR, but I’d advise against such a choice. It’s mostly abandoned. Some other tools of interest under Linux:

■pdfsandwich, a tool that generates “sandwich” OCR PDF files, i.e. PDF files that contain only images (no text) will be processed by OCR and the text will be added to each page invisibly “behind” the images. It uses tesseract.

■k2pdfopt, a PDF reflowing tool that optimizes multi-column PDF/DJVU files for mobile e-readers and smartphones. It can re-flow text even on scanned PDF files, and it too can create “sandwich” OCR files. Well, there’s also a Windows version of it, but I didn’t try it.

On the other hand, I never managed to get Lios (“Linux intelligent OCR solution”) do something. Anything.

I’ll end with a traditional meme for the French readers 🙂

Ludditus’s bookshelf: read

Giorgio de Chirico. Immagini metafisiche

Giorgio de Chirico. Immagini metafisiche

by Riccardo Dottori

Questo libro di complessa analisi e interpretazione dei dipinti di Giorgio de Chirico è unico nel suo genere, e fortunatamente anche come e-book (non necessariamente per Kindle; l’ePub costa 9,99 euro all’IBS) comprende immagini di buona…

Qui a tué Roger Ackroyd? [nouvelle édition]

Qui a tué Roger Ackroyd? [nouvelle édition]

by Pierre Bayard

Après avoir lu « La Vérité sur Dix petits nègres » et « L’affaire du chien des Baskerville » (lu en anglais : Sherlock Holmes Was Wrong: Reopening the Case of The Hound of the Baskervilles), je suis définitivement persuadé que la rigueur…

The Eighth Detective

by Alex Pavesi

What eighth detective (or what eight detectives, in the UK edition)? You’ll understand towards the end of this unusual book which mainly consists of 7 stories, which per se are not that bad, so if one would only read the odd chapters fro…

Marseille Une ville sous influences

by Pierre Boisserie

Assommante comme histoire et très médiocre graphiquement, cette BD demeure assez instructive pour ceux qui ne connaissent pas Marseille. Mais ça, seulement si le prétexte d’une balade à travers la Marseille qui engendre une discussion en…

La Vérité sur Dix petits nègres

by Pierre Bayard

J’avoue n’avoir pas encore lu « Qui a tué Roger Ackroyd ? » (c’est pour bientôt), mais j’ai été séduit par « L’affaire du chien des Baskerville » (que j’ai lu en anglais : Sherlock Holmes Was Wrong: Reopening the Case of The Hound of the…

L'homme de l'année 1989 - 16 - L'inconnu de la place tiananmen

L’homme de l’année 1989 – 16 – L’inconnu de la place tiananmen

by Jean-Pierre Pécau; Gin

Cette aberration est de la propagande pure. Je sais que cette collection n’est pas censée décrire la réalité, mais au sujet des évènements de Tiananmen, un peu plus de discernement s’imposerait.En admettant qu’il s’agisse d’un massacre…

Tif et Tondu de Blutch et Robber - Mais où est Kiki ? N/B

Tif et Tondu de Blutch et Robber – Mais où est Kiki ? N/B

by Robber

On ne peut nier ni l’effort ni le talent, mais après avoir lu la version en couleurs dans Spirou, cette version noir et blanc, complètement redessinée par rapport à celle en couleurs, ne me satisfait nullement.

The Killings at Kingfisher Hill

by Sophie Hannah

I dare say it’s only with the fourth “New Poirot” that Sophie Hannah finally managed to reach a cunningness comparable to that of the Grande Dame of Golden Age’s Crime. The first three attempts were of unequal quality in my view: I rated…

Béranger on I decided to ignore the upcoming Apocalypse—life is too short as it isIt seems that I forgot to link to a video from Feb 27, 2024: Jon Stewart on Israel - Palestine
Béranger on Jessie Inchauspé, the self-made wannabe goddessI don’t want to marry her Too bad. She's a sexy gal ;)
Marius Fourie on Jessie Inchauspé, the self-made wannabe goddessI did a study of Jesse (Glucose Goddess) to test her methods. During the 10-day study, the average base glucose
Béranger on This is how I fell out of love with Yanis VaroufakisAnd yet, Varoufakis: Europeans Have No Right to Tell Palestinians How to Escape Their Prison w/ Yanis Varoufakis (Apr 8,
Béranger on The Human Shield mantra about GazaThat's an odd place to post this update, but I've cited Finkelstein here, so I'll cite him again. ● Al
Béranger on Open-source software: the road to hell?This is beyond funny, it's epic! Critical Rust flaw enables Windows command injection attacks. Rust: emphasizes type safety, enforces memory
Béranger on Open-source software: the road to hell?Thanks. A surprisingly realistic piece from Dorin Lazăr! After "The xz disaster" and "The redis debacle" (Redis is changing its
Aldus on Open-source software: the road to hell?Also, here: Open source is failing as expected. Not programming advice.
Béranger on I decided to ignore the upcoming Apocalypse—life is too short as it isBecause this is where I mentioned Jon Stewart and Christiane Amanpour in connection to Israel and Gaza, I'll add something
Béranger on Jessie Inchauspé, the self-made wannabe goddessOh, the Guardian doesn’t like this trend: I’m not diabetic – should I be using a glucose monitor? Biotech companies
Béranger on Open-source software: the road to hell?The xz thing got an article in the NYT, albeit a rather clumsy one: Did One Guy Just Stop a
Béranger on Hamas is Not Palestine● FRANCE 24 English: Accused of anti-Semitism, comedian Bassem Youssef slams 'empty accusation' (March 27, 2024) ● PoliticsJOE: Bassem Youssef
Béranger on Et tu, KDE? Vulnerable by design?Not an April Fool! As reported by DistroWatch: The Fedora project is currently considering a change for future versions of
Aldus on Open-source software: the road to hell?Here is a bit from that Wiki page: A large number of telecommunication and digital rights management cryptosystems use security
Béranger on Open-source software: the road to hell?No matter what some people believe, "security through obscurity" actually works most of the time.
Aldus on Open-source software: the road to hell?OSS also allows various entities or agencies to study the source code of popular applications and find vulnerabilities that they
Béranger on I’m not sure I believe in the two-state solutionOn March 26, 2024, Al Jazeera English posted this updated summary, with Sandra Gathmann. After having watched it, it really
Béranger on Language Learning Apps, Revisited: 34 Products + Bonus LinksQuick updates for Italian and Spanish, YT channels: ●●● Joy of Languages. Italian, which is a strange channel, as it
Béranger on Answering to anti-establishment people on UkraineI'm afraid Jeffrey Sachs lost his credibility with this interview: Piers Morgan vs Jeffrey Sachs: Putin, Israel-Hamas, TikTok Ban, China,
Béranger on Israel’s Defense at the International Court of Justice is DisgustingProfessor Jeffrey Sachs: ‘US is complicit in Israeli genocide’ | The Bottom Line (Al Jazeera English) Jeffrey Sachs, a Jew,