Ubuntu: Adequate OCR for free on Linux

Even though I have mostly switched from Windows to Linux, I do have to emulate Windows for a few things just because the software for Linux either isn’t very good, doesn’t work, or in one case I haven’t learned it (R rather than SPSS).  One of the reasons I would run Windows over Linux was for “optical character recognition” or OCR.  This is what the process is called for converting scanned text into actual text.  On Windows there are a number of good, relatively cheap software packages that do this.  The one I regularly used was Omnipage, which is really good.  I’ve looked for open source Linux alternatives and would even be willing to pay for one that wasn’t too expensive (i.e., less than $100), but have had no luck with finding something in my price range that works.  However, with a recent development in Google Documents, it’s possible to bypass standalone OCR software completely and simply allow Google to convert your scanned documents into text for you.

As of June 21st, you can upload scanned files (e.g., jpg, pdf) to Google Docs and it will run it through OCR (directions here).  The character recognition is mediocre compared to Omnipage, but it works fairly well for simple text.  It doesn’t deskew text before converting it, so if your scanned text is skewed at all, it will have a hard time converting it.  Also, the Google Docs OCR software doesn’t retain formatting at all.  But it works.  And it’s adequate for what I do (mostly scanning paragraphs from books to help with my book reviews).  However, one thing that is not mentioned with this free service is that it is limited to scans that are 10 pages in length or shorter.  I found this out the hard way by uploading a scan that was about 30 pages and realizing it wasn’t recognizing 20 pages of the document.  So, if you need lots of long documents converted, you’re probably still better off using some other software.  But if you just need a few pages of text converted, Google Docs will do it for you and do it fairly well.

So, how do you get all of this to work on Linux?  Assuming you already have a scanner (that would be the only cost, aside from your computer, of course), you can use the built in scanning software “Simple Scan” (as of 10.04) to scan your document.  Select “text” scan as you don’t need color.  Simple Scan is quite easy to use and allows for quick and easy cropping, which I would recommend (you have to do this in Omnipage as well, so it’s not any more time consuming).  Once you’ve scanned and cropped your text selection, save it as a PDF.  Then follow the directions here on how to upload it to Google Docs and tell Google Docs to convert it to OCR.  Et voila, you have OCR on Linux.

If you do happen to have a longer document you need converted, you can easily break it into 10 page chunks using another piece of free software, pdfSam, which automates breaking the PDF into 10 page chunks.  You can then upload all the pieces of the document to Google Docs simultaneously.  So long as they are less than 10 pages long, each will be converted in turn.