One of the few tasks I have not been able to do on Linux since I switched over from Windows more than a decade ago is optical character recognition (OCR) of PDF documents. I work with a lot of PDFs. Most of them were digital documents to begin with and the text is readily selectable. However, the occasional need arises when I either have to scan something myself or I receive a document that does not have selectable text and is just an image. Up until now, I have kept a software package on a Windows virtual machine (in Virtualbox) specifically to OCR PDFs on the rare occasion when I need to do that. But, I think I can safely move past that thanks to recent advances in OCR on Linux.
I scanned a chapter I wrote in a book recently. The scan looked good (especially after I used Scan Tailor’s Dewarping feature to flatten the pages).
I then converted the TIF files from Scan Tailor into PDF files, put them in the correct order, and was ready to OCR them in the software I used in Windows. However, my virtual machine was giving me some issues and required me to install some updates that were going to take a while (’cause, Windows!). I got the updates started, then realized that I hadn’t checked to see if any progress had been made on OCR on Linux for quite a while (probably a couple of years). A quick Google search landed me on Stack Exchange (where I seem to spend a lot of time these days). There, I found two new options for OCR on Linux.
The first option was a command line program called “ocrmypdf.” That sounds like a dream! I quickly installed it on my Kubuntu machine:
$ sudo apt install ocrmypdf
A number of additional packages were installed as well. Once it was installed, I gave it a whirl. The command is pretty easy. Navigate to the directory where you have your PDF you want to have recognized then type in the following:
$ ocrmypdf input.pdf output.pdf
My initial PDF (on the left below) was 14 mb in size and looked fine, but I couldn’t select the text – it was just an image. With a quick command, I ran it through the “ocrmypdf” program and got out a nearly identical PDF that was smaller (just 9 mb) and allowed me to select the text (image on the right below).
In the next image, you can see that I can select the text in the OCRd image:
Finally, the real question is, how accurate is the OCR? The image below shows the OCR document next to the text:
As you can see, this isn’t award-winning software. It’s probably 80 to 90% accurate. But that is more than sufficient for what I need most of the time. And, as FOSS, I’ll take it. My rating: 8 out of 10. But given the speed with which I can do this, I’ll absolutely be using this over the old software I was using on a virtual machine before.
On the same page where I found “ocrmypdf,” there was mention of another software package: “ocrfeeder.” Since I had been dreaming of this kind of software for ages, I figured I’d give it a spin as well. I downloaded “ocrfeeder” quickly:
$ sudo apt install ocrfeeder
I then tried to pull up the GUI… And… Nothing. Perhaps I needed to reboot. Not sure. But “ocrfeeder” didn’t seem to be working on my install (Kubuntu 18.10) when I tried to run it.
On my professional website, I use wordclouds from the text of my publications as the featured images for the posts where I share the publications. I have used a website to generate those wordclouds for quite a while, but I’m trying to learn how to use the R statistical environment and knew that R can generate wordclouds. So, I thought I’d give it a try.
Here are the steps to generating a wordcloud from the text of a PDF using R.
First, in R, install the following four packages: “tm”, “SnowballC”, “wordcloud”, and “readtext”. This is done by typing the following into the R terminal:
(NOTE: You may need to install the following packages on your Linux system using synaptic or bash before you can install the above packages: r-cran-slam, r-cran-rcurl, r-cran-xml, r-cran-curl, r-cran-rcpp, r-cran-xml2, r-cran-littler, r-cran-rcpp, python-pdftools, python-sip, python-qt4, libpoppler-dev, libpoppler-cpp-dev, libapparmor-dev.)
Next, you need to load those packages into the R environment. This is done by typing the following in the R terminal:
Before we begin creating the wordcloud, we have to get the text out of the PDF file. To do this, first find out where your “working directory” is. The working directory is where the R environment will be looking for and storing files as it runs. To determine your “working directory,” use the following function:
There are no arguments for this function. It will simply return where the R environment is currently looking for and storing files.
You’ll need to put the PDF from which you want to extract data into your working directory or change your working directory to the location of your PDF (technically, you could just include the path, but putting it in your working directory is easier). To change the working directory, use the “setwd()” function. Like this:
Once you have your PDF in your working directory, you can use the readtext package to extract the text and put it into a variable. You can do that using the following command:
wordbase <- readtext("paper.pdf")
“wordbase” is a variable I’m creating to hold the text from the PDF. The variable is actually a data frame (data.frame) with two columns and one row. The first column is the document ID (e.g., “paper.pdf”); the second column is the extracted text. You can see what kind of variable it is using the command:
This gives you the following information:
readtext object consisting of 1 document and 0 docvars.
# data.frame [1 × 2]
1 career.pdf "\" \"..."
R won’t show you all of the text in the text column as it is likely quite a bit of text. If you want to display all the text (WARNING: It may be a lot of text), you can do so by telling R to display the contents of that cell of the data frame, which is row 1, column 2:
“readtext” is the package that extracts the text from the PDF. The readtext package is robust enough to be able to extract text from numerous documents (see here) and is even able to determine what kind of document it is from the file extension; in this case, it recognize that it’s a PDF.
The list can now be converted into a corpus, which is a vector (see here for the different data types in R). To do this, we use the following function:
corp <- Corpus(VectorSource(wordbase))
In essence, we’re creating a new variable, “corp,” by using the Corpus function that calls the VectorSource function and applies it to the list of words in the variable “wordbase.”
We’re close to having the words ready to create the wordcloud, but it’s a good idea to clean up the corpus with several commands from the “tm” package. First, we want to make sure the corpus is a plain text:
corp <- tm_map(corp, PlainTextDocument)
Next, since we don’t want any of the punctuation included in the wordcloud, we remove the punctuation with this function from “tm”:
corp <- tm_map(corp, removePunctuation)
For my wordclouds, I don’t want numbers included. So, use this function to remove the numbers from the corpus:
corp <- tm_map(corp, removeNumbers)
I also want all of my words in lowercase. There is a function for that as well:
corp <- tm_map(corp, tolower)
Finally, I’m not interested in words like “the” or “a”, so I removed all of those words using this function:
At this point, you’re ready to generate the wordcloud. What follows is a wordcloud command, but it will generate the wordcloud in a window and you’ll then have to do a screen capture to turn the wordcloud into an image. Even so, here is the basic command:
To explain the command, “wordcloud” is the package and function. “corp” is the corpus containing all the words. The other components of the command are parameters that can, of course, be adjusted. “max.words” can be increased or decreased to reflect the number of words you want to include in your wordcloud. “random.order” should be set to FALSE if you want the more frequently occurring words to be in the center with the less frequently occurring words surrounding them. If you set that parameter to TRUE, the words will be in random order, like this:
There are additional parameters that can be added to the wordcloud command, including a scale parameter (scale) that adjusts the relative sizes of the more and less frequently occurring words, a minimum frequency parameter (min.freq) that will limit the plotted words to only those that occur a certain number of times, a parameter for what proportion of words should be rotated 90 degrees (rot.per). Other parameters are detailed in the wordcloud documentation here.
One of the more important parameters that can be added is color (colors). By default, wordclouds are black letters on a white background. If you want the word color to vary with the frequency, you need to create a variable that details to the wordcloud function how many colors you want and from what color palette. A number of color palettes are pre-defined in R (see here). Here’s a sample command to create a color variable that can be used with the wordcloud package:
color <- brewer.pal(8,"Spectral")
The parameters in the parentheses indicate first, the number of colors desired (8 in the example above), and second, the palette title from the list noted above. Generating the wordcloud with the color palette applied involves adding one more variable to the command:
There is another package that allows for some more advanced wordcloud creations called “wordcloud2.” It allows for the creation of wordclouds that use images as masks. Currently, the package is having problems if you install from the cran servers, but if you install directly from the github source, it works. Here’s how to do that:
You can then use the “wordcloud2” package to create all sorts of nifty wordclouds, like this:
Before you can use wordcloud2 to create advanced wordclouds, you need to convert your data (after doing everything above) into a data matrix. Here’s how you do that:
dtm <- TermDocumentMatrix(corp)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
The data matrix is now contained in the variable “d”. To see the words in your frequency list ordered from most frequently used to least frequently used, you can use the following command. The number after “d” is how many words you want to see (e.g., you can see the top 10, 20, 50, 100, etc.)
To create a wordcloud using wordcloud2, you use the following command:
wordcloud2(d, color = "random-light", backgroundColor = "grey")
And if you want to create a wordcloud using an image mask (the image has to be a PNG file with a transparent background, you use the following command:
wordcloud2(d, figPath = "figure.png", backgroundColor = "black," color = "random-light")
Note: Source for directions on wordcloud2 are here and here; though see here for converting your corpus into a data matrix, which is what you have to use to create these fancy wordclouds.
NOTE: Color options for wordcloud2 are any CSS colors. See here for a complete list.
UPDATE: As of 12/15/2015, LibreOffice 5.0 broke this feature. See here. This should be fixed as of LibreOffice 5.0.5.
I do a lot of my grading in my classes electronically. As a LibreOffice user, one issue I’ve had with the software is that I haven’t been able to insert comments into the document and then have those comments show up in the margins of the document when I save it to a PDF and return it to my students. I was unable to figure out how to do this when I swtiched to OpenOffice (and then LibreOffice) almost a decade ago. I’m not sure if it was possible back then, but I recently discovered that it is possible, which is exciting for me!
Here is what I mean. Previously, when I graded students’ papers, I would track my changes. I would insert comments into the text in parentheses, like this:
LibreOffice would print those out, like this (saved as a PDF):
It used to be the case that, if you wanted comments that you inserted in the margins printed, LibreOffice would print them all at the end of the document, which was pretty useless for providing contextual feedback. But now you can print out comments in the margins. Here’s how you do it.
Go to “Tools -> Options”.
Then go to “Writer->Print”. In the resulting window, look to the right and you’ll see options for comments.
To print the comments in the margin of a PDF, choose “in margin.” Then hit “OK.”
Now, when you add a comment, like this:
When you save the file as a PDF, it looks like this:
Voila! You have comments in PDFs in the margins. Hooray!
I’m not exactly sure why, but with the latest Firefox updates, every time I download a PDF using Firefox or try to open one using Zotero integrated with Firefox, the PDF opens in GIMP. This didn’t used to happen, but it’s really annoying. It’s doubly annoying since you can’t solve it inside Firefox.
It would make sense to be able to change this in one of two ways. First, by simply setting your system-wide preference for what program opens PDFs, that should do it, but mine was already set to Okular (you can change the default for most programs by right-clicking a PDF file, selecting Properties, then File Type Options and setting the program you want to be the default).
The second logical solution would be to change the default applications in Firefox, but that doesn’t do anything. It turns out, the solution is to edit a different file, changing the order of default applications for opening PDFs. Here’s what you need to do:
Open a terminal and gain root privilege to edit the following file: /usr/share/applications/mimeinfo.cache. Here’s the command I use in Kubuntu:
sudo kate /usr/share/applications/mimeinfo.cache
NOTE: As of Kubuntu 18.04, you now have to use the following command to edit protected files with Kate:
The order of the applications after the “=” indicates the order in which they will be used to load PDFs. Right now, GIMP will be used first. All you need to do to fix this is to change the order so Okular is first, like this:
Once you change it, save your changes and then restart Firefox. Your PDFs should now load in Okular.
For those who have used this workaround in the past, they may have realized that this fix is temporary. The next time you update your software or change something, gimp, again, gets set as the default for PDF files. In order to make this a permanent fix, there is another option. You can override the mimeinfo.cache list by creating a file in the /usr/share/applications directory called “mimeapps.list” that overrides mimeinfo.cache. Here’s what you would add to the file if you want Okular to be your default PDF reader: