technology

Linux – OCR PDF

One of the few tasks I have not been able to do on Linux since I switched over from Windows more than a decade ago is optical character recognition (OCR) of PDF documents. I work with a lot of PDFs. Most of them were digital documents to begin with and the text is readily selectable. However, the occasional need arises when I either have to scan something myself or I receive a document that does not have selectable text and is just an image. Up until now, I have kept a software package on a Windows virtual machine (in Virtualbox) specifically to OCR PDFs on the rare occasion when I need to do that. But, I think I can safely move past that thanks to recent advances in OCR on Linux.

I scanned a chapter I wrote in a book recently. The scan looked good (especially after I used Scan Tailor’s Dewarping feature to flatten the pages).

I then converted the TIF files from Scan Tailor into PDF files, put them in the correct order, and was ready to OCR them in the software I used in Windows. However, my virtual machine was giving me some issues and required me to install some updates that were going to take a while (’cause, Windows!). I got the updates started, then realized that I hadn’t checked to see if any progress had been made on OCR on Linux for quite a while (probably a couple of years). A quick Google search landed me on Stack Exchange (where I seem to spend a lot of time these days). There, I found two new options for OCR on Linux. 

The first option was a command line program called “ocrmypdf.” That sounds like a dream! I quickly installed it on my Kubuntu machine:

$ sudo apt install ocrmypdf

A number of additional packages were installed as well. Once it was installed, I gave it a whirl. The command is pretty easy. Navigate to the directory where you have your PDF you want to have recognized then type in the following:

$ ocrmypdf input.pdf output.pdf

My initial PDF (on the left below) was 14 mb in size and looked fine, but I couldn’t select the text – it was just an image. With a quick command, I ran it through the “ocrmypdf” program and got out a nearly identical PDF that was smaller (just 9 mb) and allowed me to select the text (image on the right below).

Original PDF on the left; OCR PDF on the right.

In the next image, you can see that I can select the text in the OCRd image:

I can select the text in the OCRd image.

Finally, the real question is, how accurate is the OCR? The image below shows the OCR document next to the text:

PDF on the left; selected text copied and pasted on the right.

As you can see, this isn’t award-winning software. It’s probably 80 to 90% accurate. But that is more than sufficient for what I need most of the time. And, as FOSS, I’ll take it. My rating: 8 out of 10. But given the speed with which I can do this, I’ll absolutely be using this over the old software I was using on a virtual machine before.

On the same page where I found “ocrmypdf,” there was mention of another software package: “ocrfeeder.” Since I had been dreaming of this kind of software for ages, I figured I’d give it a spin as well. I downloaded “ocrfeeder” quickly:

$ sudo apt install ocrfeeder

I then tried to pull up the GUI… And… Nothing. Perhaps I needed to reboot. Not sure. But “ocrfeeder” didn’t seem to be working on my install (Kubuntu 18.10) when I tried to run it.

technology

Ubuntu: Adequate OCR for free on Linux

Even though I have mostly switched from Windows to Linux, I do have to emulate Windows for a few things just because the software for Linux either isn’t very good, doesn’t work, or in one case I haven’t learned it (R rather than SPSS).  One of the reasons I would run Windows over Linux was for “optical character recognition” or OCR.  This is what the process is called for converting scanned text into actual text.  On Windows there are a number of good, relatively cheap software packages that do this.  The one I regularly used was Omnipage, which is really good.  I’ve looked for open source Linux alternatives and would even be willing to pay for one that wasn’t too expensive (i.e., less than $100), but have had no luck with finding something in my price range that works.  However, with a recent development in Google Documents, it’s possible to bypass standalone OCR software completely and simply allow Google to convert your scanned documents into text for you.

As of June 21st, you can upload scanned files (e.g., jpg, pdf) to Google Docs and it will run it through OCR (directions here).  The character recognition is mediocre compared to Omnipage, but it works fairly well for simple text.  It doesn’t deskew text before converting it, so if your scanned text is skewed at all, it will have a hard time converting it.  Also, the Google Docs OCR software doesn’t retain formatting at all.  But it works.  And it’s adequate for what I do (mostly scanning paragraphs from books to help with my book reviews).  However, one thing that is not mentioned with this free service is that it is limited to scans that are 10 pages in length or shorter.  I found this out the hard way by uploading a scan that was about 30 pages and realizing it wasn’t recognizing 20 pages of the document.  So, if you need lots of long documents converted, you’re probably still better off using some other software.  But if you just need a few pages of text converted, Google Docs will do it for you and do it fairly well.

So, how do you get all of this to work on Linux?  Assuming you already have a scanner (that would be the only cost, aside from your computer, of course), you can use the built in scanning software “Simple Scan” (as of 10.04) to scan your document.  Select “text” scan as you don’t need color.  Simple Scan is quite easy to use and allows for quick and easy cropping, which I would recommend (you have to do this in Omnipage as well, so it’s not any more time consuming).  Once you’ve scanned and cropped your text selection, save it as a PDF.  Then follow the directions here on how to upload it to Google Docs and tell Google Docs to convert it to OCR.  Et voila, you have OCR on Linux.

If you do happen to have a longer document you need converted, you can easily break it into 10 page chunks using another piece of free software, pdfSam, which automates breaking the PDF into 10 page chunks.  You can then upload all the pieces of the document to Google Docs simultaneously.  So long as they are less than 10 pages long, each will be converted in turn.