R (Linux) – creating a wordcloud from PDF

On my professional website, I use wordclouds from the text of my publications as the featured images for the posts where I share the publications. I have used a website to generate those wordclouds for quite a while, but I’m trying to learn how to use the R statistical environment and knew that R can generate wordclouds. So, I thought I’d give it a try.

Here are the steps to generating a wordcloud from the text of a PDF using R.

First, in R, install the following four packages: “tm”, “SnowballC”, “wordcloud”, and “readtext”. This is done by typing the following into the R terminal:

install.packages(“tm”)
install.packages(“SnowballC”)
install.packages(“wordcloud”)
install.packages(“readtext”)

Next, you need to load those packages into the R environment. This is done by typing the following in the R terminal:

library(tm)
library(SnowballC)
library(wordcloud)
library(readtext)

Before we begin creating the wordcloud, we have to get the text out of the PDF file. To do this, first find out where your “working directory” is. The working directory is where the R environment will be looking for and storing files as it runs. To determine your “working directory,” use the following function:

getwd()

There are no arguments for this function. It will simply return where the R environment is currently looking for and storing files.

You’ll need to put the PDF from which you want to extract data into your working directory or change your working directory to the location of your PDF (technically, you could just include the path, but putting it in your working directory is easier). To change the working directory, use the “setwd()” function. Like this:

setwd(“/home/ryan/RWD”)

Once you have your PDF in your working directory, you can use the readtext package to extract the text and put it into a list. You can do that using the following command:

wordbase <- readtext(“paper.pdf”)

“wordbase” is a variable I’m creating to hold the text from the PDF. “readtext” is the package that extracts the text from the PDF. The readtext package is robust enough to be able to extract text from numerous documents (see here) and is even able to determine what kind of document it is from the file extension; in this case, it recognize that it’s a PDF.

The list can now be converted into a corpus, which is a vector (see here for the different data types in R). To do this, we use the following function:

corp <- Corpus(VectorSource(wordbase))

In essence, we’re creating a new variable, “corp,” by using the Corpus function that calls the VectorSource function and applies it to the list of words in the variable “wordbase.”

We’re close to having the words ready to create the wordcloud, but it’s a good idea to clean up the corpus with several commands from the “tm” package. First, we want to make sure the corpus is a plain text:

corp <- tm_map(corp, PlainTextDocument)

Next, since we don’t want any of the punctuation included in the wordcloud, we remove the punctuation with this function from “tm”:

corp <- tm_map(corp, removePunctuation)

For my wordclouds, I don’t want numbers included. So, use this function to remove the numbers from the corpus:

corp <- tm_map(corp, removeNumbers)

Finally, I’m not interested in words like “the” or “a”, so I removed all of those words using this function:

corp <- tm_map(corp, removeWords, stopwords(‘english’))

At this point, you’re ready to generate the wordcloud. What follows is a wordcloud command, but it will generate the wordcloud in a window and you’ll then have to do a screen capture to turn the wordcloud into an image. Even so, here is the basic command:

wordcloud(corp, max.words = 100, random.order = FALSE)

To explain the command, “wordcloud” is the package and function. “corp” is the corpus containing all the words. The other components of the command are parameters that can, of course, be adjusted. “max.words” can be increased or decreased to reflect the number of words you want to include in your wordcloud. “random.order” should be set to FALSE if you want the more frequently occurring words to be in the center with the less frequently occurring words surrounding them. If you set that parameter to TRUE, the words will be in random order, like this:

There are additional parameters that can be added to the wordcloud command, including a scale parameter (scale) that adjusts the relative sizes of the more and less frequently occurring words, a minimum frequency parameter (min.freq) that will limit the plotted words to only those that occur a certain number of times, a parameter for what proportion of words should be rotated 90 degrees (rot.per). Other parameters are detailed in the wordcloud documentation here.

One of the more important parameters that can be added is color (colors). By default, wordclouds are black letters on a white background. If you want the word color to vary with the frequency, you need to create a variable that details to the wordcloud function how many colors you want and from what color palette. A number of color palettes are pre-defined in R (see here). Here’s a sample command to create a color variable that can be used with the wordcloud package:

color <- brewer.pal(8,”Spectral”)

The parameters in the parentheses indicate first, the number of colors desired (8 in the example above), and second, the palette title from the list noted above. Generating the wordcloud with the color palette applied involves adding one more variable to the command:

wordcloud(corp, max.words = 100, min.freq=15, random.order = FALSE, colors = color, scale=c(8, .3))

Finally, if you want to output the wordcloud as an image file, you can adjust the command to generate the wordcloud as, for instance, a PNG file. First, tell R to create the PNG file:

png(“wordcloud.png”, width=1280,height=800)

The text in quotes is the name of the PNG file to be created. The other two commands indicate the size of the PNG. Then create the wordcloud with the parameters you want:

wordcloud(corp, max.words = 100, random.order = FALSE, colors = color, scale=c(8, .3))

And, finally, pass the wordcloud just created on to the PNG file with this function:

dev.off()

If all goes according to plan, you will have created a PNG file with a wordcloud of your cleaned up corpus of text:

 

 

Source Information:

Reading PDF files into R for text mining

Building Wordclouds in R

Word cloud in R

R (Linux) – basic installation

To install the R programming environment on Linux is pretty straightforward, but it does require a little bit of know how in order to find the correct packages. As is typically the case with Linux, there are multiple ways to get things done. I like to use Synaptic for installing and removing software, but you can also use the software manager that comes with your Linux distribution (in Linux Mint it’s called Software Manager) or the command line (in KDE based distributions, Konsole).

For the most up-to-date installation of R, it’s actually best to install directly from the R repository. A list of Linux repositories for the R environment is located here. In order to install from the repository, you need to update your list of repositories in Synaptic. To access your repository list in Synaptic, click on Settings -> Repositories.

In the new Software Sources window, click on “Additional repositories” and you’ll get this window:

Click on Add a new repository. You’ll get this window:

The exact information you put into that window will vary based on which mirror you chose. Here is what I added in mine:

deb https://cran.cnr.berkeley.edu/bin/linux/ubuntu/ xenial/

In order to ensure you have the right files and to follow best security practices, you should install the signing key as well. Directions for installing the signing key are found here, but it can be done with a simple command from a terminal:

sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys E084DAB9

Once you have done all of that, you can install R from Synaptic.

First, open Synaptic, which will require your password. You’ll get the basic Synaptic Package Manager window:

Next, in the search box, search for “r-base”. Right-click it and select “Mark for installation” to install “r-base”:

In the above screenshot, I have already installed r-base, so the option “Mark for installation” is greyed out. But, obviously, that’s what I already did. When you select this, Synaptic will automatically select all the other necessary packages (there are about 10 to 15 additional packages necessary for R to run: r-cran-class, r-cran-lattice, r-cran-spatial, r-cran-survival, r-cran-codetools, r-cran-nnet, r-cran-mass, r-cran-boot, r-cran-nlme, r-cran-rpart, r-cran-cluster, r-cran-kernsmooth, r-cran-foreign, r-cran-mgcv, r-cran-matrix, r-recommended, r-base-core).

If you plan on installing any other R packages, it’s not a bad idea to also install “r-base-dev,” as it helps fill in dependencies for other packages.

Once you’ve selected r-base, hit Apply in Synaptic and all the software will be installed.

You now have the base software for R installed.

To open the R environment in a terminal, launch a terminal and simply type “R” at the prompt, like this:

Here’s where things can get a little complicated. To do different things in R requires various libraries or packages. Some of these can be installed using the R terminal while others need to be installed from your Linux distribution’s repositories. To install a library or package using the R terminal, you use the following command once you have opened the R environment:

install.packages(“PACKAGENAME”)

The first time you run this, the R environment will ask you to select a mirror.

Choose one close to your location. R will then install the package, assuming you type everything correctly.

NOTES:

Before you start trying to install additional R packages, it’s a very good idea to install the following Linux packages:

r-base-dev build-essential

If you run into an error message, there are several possibilities. First, check to make sure you typed everything correctly. R is not forgiving on spelling mistakes. Second, if the error is something like:

installation of package ‘PACKAGENAME’ had non-zero exit status

Or

dependency ‘PACKAGENAME’ is not available

There is a good chance that you need to install a package or library using Synaptic (or from a terminal using apt). For instance, to install the “tm” package, there is an unsatisfied dependency (meaning, a library or package that needs to be installed but cannot be installed using the R installer). The dependency is the ‘slam’ package. This can be installed using Synaptic (or, from a terminal, using the command “sudo apt-get install r-cran-slam”). Once you’ve installed the dependency, try re-installing the package and the error messages should go away.

 

NOTES:

I also have found that I like RStudio as an IDE for working with R. It’s a little bit friendlier to use than a straight command line interface as it keeps track of variables and loaded libraries. The personal version for your desktop can be downloaded here.

And a note on RStudio on Linux. I regularly get an offset from the cursor location and where the cursor actually is in the command window. It turns out this is a font issue. If you go up to Tools -> Global Options -> Appearance and change the font to anything else, this problem will go away.

Switzerland – remaining adventures

I was attending my conference July 4th through the 6th, but skipped out on the last day of the conference (July 7th) to go see CERN (the location of the large hadron collider). Debi, Toren, and Rosemary, meanwhile, had a number of adventures. They took the chocolate train through various parts of Switzerland, visiting the Gruyere cheese factory, the Gruyere castle, and the Maison Cailler chocolate factory.

Here’s a video Debi shot of the chocolate extruding and packaging process at Maison Cailler:

Amazingly, they took a picture in front of the Giger Museum, but didn’t know what it was and didn’t go in (I’ve got to go back just for that).

Toren in front of the Giger Museum.
Toren in front of the Giger Museum.

They also took a boat ride from Montreux to Lausanne one day while I was at my conference:

I did sneak in a visit to the Chillon Castle before my conference started one day:

Debi, Rosemary, and Toren at Chillon Castle.
Debi, Rosemary, and Toren at Chillon Castle.

I didn’t get to see the whole castle as I had to make it to my conference in time for the first session that day, but I got to see some of the castle. Again, I’ll have to go back.

The one day I did skip of the conference was so we could go to CERN. Getting tickets was a bit of a nightmare as they have to be reserved in advance, go on sale at 8:00 am Swiss time, and are usually gone in a matter of minutes. Debi and I spent a few days getting up just before 2:00 am so we could get the tickets and eventually got 4 for the last day of my conference.

You obviously don’t get to go down into the actual collider, which is about 90 meters below ground, but they do give you a tour of a control center and showed us some old colliders, like this one where Toren was pushing the self-destruct button:

Toren pushing the "self-destruct" button on an old collider. It was a red button with no label, so I told him it was the self-destruct button and he immediately proceeded to push it.
Toren pushing the “self-destruct” button on an old collider. It was a red button with no label, so I told him it was the self-destruct button and he immediately proceeded to push it.

The tour starts at the welcome center, where they have a nice museum, and then works its way around the campus. We went into a control center, watched a video about particle accelerators, and then got to go into where the original collider is at CERN (from the 1950s; very cool presentation there). There is another museum across the street from the main welcome center, as well as numerous monuments. Here’s a photo in front of one of those monuments:

Rosemary, Toren, and Debi by a monument at CERN.
Rosemary, Toren, and Debi by a monument at CERN.

We also found a little time to stop by the Reformation Wall in Geneva, which is a monument to the Protestant Reformation. We didn’t stay long as we had to get to CERN on time and this happened to be kind of on the way. Here’s a photosphere of the Wall:

I also had Toren pose as though he was each of the individuals remembered by the monument. Here’s one of those photos:

Toren posing as figures in the Reformation Wall.
Toren posing as figures in the Reformation Wall.

While Toren played at the park near the Reformation Wall and Rosemary watched him, Debi and I jogged up to a nearby church where Martin Luther used to preach, where she got a picture of me trying to gain entry:

No one answered. I guess no one is home?!? ;)
No one answered. I guess no one is home?!? ;)

Before heading back for our last night in Switzerland, we stopped for a brief walk around downtown Geneva and got to see the Jet d’Eau and try out some more Swiss chocolate.

Toren and Rosemary at the Jet d'Eau in Geneva.
Toren and Rosemary at the Jet d’Eau in Geneva.

We then caught a train back to Lausanne to pack up for our flight home the next day.

Switzerland – Matterhorn and Zermatt

Our trip to Iceland occurred because I was presenting some research at a conference in Lausanne, Switzerland. We spent a week in Iceland before heading to Switzerland. We flew into Geneva then took a train to Lausanne, where we stayed in a nice apartment (AirBnB) with an amazing view of Lake Geneva.

The view from our AirBnB in Lausanne.
The view from our AirBnB in Lausanne.

I really only got to spend two days doing touristy stuff in Switzerland – the day before the conference and the last day of the conference (which I skipped to go to CERN, ’cause it’s CERN). The day before the conference, we decided to head into the Swiss Alps to see the Matterhorn.

From Lausanne, it was a couple of hours on trains to get to Zermatt, which is the small town at the base of the Matterhorn. No cars are allowed in Zermatt, which is kind of nice. We walked from the train station through the town, snapping photos along the way:

The Matterhorn from Zermatt
The Matterhorn from Zermatt
Toren with the Matterhorn as backdrop in Zermatt.
Toren with the Matterhorn as backdrop in Zermatt.

We walked to one of the ski resorts (Zermatt ZBAG) and then bought tickets to the very top, Matterhorn Glacier Paradise. Matterhorn Glacier Paradise is a peak that has been tunneled into. Inside, they have built a restaurant, some rooms for museums and watching videos, and an entrance into the glacier that covers the mountain. Here are a few photos from inside the glacier:

Toren and Debi by an ice sculpture inside the glacier.
Toren and Debi by an ice sculpture inside the glacier.
Debi and Toren (with Rosemary in the background) inside the glacier.
Debi and Toren (with Rosemary in the background) inside the glacier.

There is also a viewing spot on the top of the peak where you can actually look down on the Matterhorn. Here’s the view from there:

The Matterhorn from Matterhorn Glacier Paradise viewing platform.
The Matterhorn from Matterhorn Glacier Paradise viewing platform.

And a photo of us on the viewing platform:

The three of us on the viewing platform on top of Matterhorn Glacier Paradise.
The three of us on the viewing platform on top of Matterhorn Glacier Paradise (Italy in the background).

We then got to walk out onto the glacier where we snapped a few photos:

 

Rosemary, Toren, and Debi on the glacier with the Matterhorn in the background.
Rosemary, Toren, and Debi on the glacier with the Matterhorn in the background.

On the way back down, we stopped to take a few more photos along the way.

 

Debi in front of the Matterhorn.
Debi in front of the Matterhorn.
Toren, Rosemary, and Debi in front of the Matterhorn Glacier Paradise.
Toren, Rosemary, and Debi in front of the Matterhorn Glacier Paradise.
The three of us in front of the Matterhorn.
The three of us in front of the Matterhorn.

We got a later start than we hoped and ended up not having a lot of time on the mountain, otherwise, we would have done some hiking. Even so, it was a great initial exposure to the Swiss Alps.

After we took the lift back to Zermatt, we walked through the town looking for a place for dinner. Along the way, we were treated to this fun encounter with a bunch of goats.

We eventually found a fondue place. Toren, Rosemary, and I had cheese fondue (dipped bread and potatoes), while Debi went off in search of a chicken sandwich.

 

Toren and Rosemary enjoying Swiss fondue in Zermatt.
Toren and Rosemary enjoying Swiss fondue in Zermatt.

 

We found a creperie along the main walkway in Zermatt as well and decided we had to have crepes for dessert:

The train ride itself was quite scenic and took us through the southwestern portion of Switzerland. We ended up getting home quite late, but it was well worth the trip.

Iceland – final post – drone footage

I took my drone to Iceland with us. I knew that there were lots of places where I could fly the drone and it seemed like the ideal opportunity to take advantage of the drone to get shots we couldn’t otherwise get. Here’s my Iceland drone compilation: