R (Linux) – creating a wordcloud from PDF

On my professional website, I use wordclouds from the text of my publications as the featured images for the posts where I share the publications. I have used a website to generate those wordclouds for quite a while, but I’m trying to learn how to use the R statistical environment and knew that R can generate wordclouds. So, I thought I’d give it a try.

Here are the steps to generating a wordcloud from the text of a PDF using R.

First, in R, install the following four packages: “tm”, “SnowballC”, “wordcloud”, and “readtext”. This is done by typing the following into the R terminal:

install.packages(“tm”)
install.packages(“SnowballC”)
install.packages(“wordcloud”)
install.packages(“readtext”)

Next, you need to load those packages into the R environment. This is done by typing the following in the R terminal:

library(tm)
library(SnowballC)
library(wordcloud)
library(readtext)

Before we begin creating the wordcloud, we have to get the text out of the PDF file. To do this, first find out where your “working directory” is. The working directory is where the R environment will be looking for and storing files as it runs. To determine your “working directory,” use the following function:

getwd()

There are no arguments for this function. It will simply return where the R environment is currently looking for and storing files.

You’ll need to put the PDF from which you want to extract data into your working directory or change your working directory to the location of your PDF (technically, you could just include the path, but putting it in your working directory is easier). To change the working directory, use the “setwd()” function. Like this:

setwd(“/home/ryan/RWD”)

Once you have your PDF in your working directory, you can use the readtext package to extract the text and put it into a list. You can do that using the following command:

wordbase <- readtext(“paper.pdf”)

“wordbase” is a variable I’m creating to hold the text from the PDF. “readtext” is the package that extracts the text from the PDF. The readtext package is robust enough to be able to extract text from numerous documents (see here) and is even able to determine what kind of document it is from the file extension; in this case, it recognize that it’s a PDF.

The list can now be converted into a corpus, which is a vector (see here for the different data types in R). To do this, we use the following function:

corp <- Corpus(VectorSource(wordbase))

In essence, we’re creating a new variable, “corp,” by using the Corpus function that calls the VectorSource function and applies it to the list of words in the variable “wordbase.”

We’re close to having the words ready to create the wordcloud, but it’s a good idea to clean up the corpus with several commands from the “tm” package. First, we want to make sure the corpus is a plain text:

corp <- tm_map(corp, PlainTextDocument)

Next, since we don’t want any of the punctuation included in the wordcloud, we remove the punctuation with this function from “tm”:

corp <- tm_map(corp, removePunctuation)

For my wordclouds, I don’t want numbers included. So, use this function to remove the numbers from the corpus:

corp <- tm_map(corp, removeNumbers)

Finally, I’m not interested in words like “the” or “a”, so I removed all of those words using this function:

corp <- tm_map(corp, removeWords, stopwords(‘english’))

At this point, you’re ready to generate the wordcloud. What follows is a wordcloud command, but it will generate the wordcloud in a window and you’ll then have to do a screen capture to turn the wordcloud into an image. Even so, here is the basic command:

wordcloud(corp, max.words = 100, random.order = FALSE)

To explain the command, “wordcloud” is the package and function. “corp” is the corpus containing all the words. The other components of the command are parameters that can, of course, be adjusted. “max.words” can be increased or decreased to reflect the number of words you want to include in your wordcloud. “random.order” should be set to FALSE if you want the more frequently occurring words to be in the center with the less frequently occurring words surrounding them. If you set that parameter to TRUE, the words will be in random order, like this:

There are additional parameters that can be added to the wordcloud command, including a scale parameter (scale) that adjusts the relative sizes of the more and less frequently occurring words, a minimum frequency parameter (min.freq) that will limit the plotted words to only those that occur a certain number of times, a parameter for what proportion of words should be rotated 90 degrees (rot.per). Other parameters are detailed in the wordcloud documentation here.

One of the more important parameters that can be added is color (colors). By default, wordclouds are black letters on a white background. If you want the word color to vary with the frequency, you need to create a variable that details to the wordcloud function how many colors you want and from what color palette. A number of color palettes are pre-defined in R (see here). Here’s a sample command to create a color variable that can be used with the wordcloud package:

color <- brewer.pal(8,”Spectral”)

The parameters in the parentheses indicate first, the number of colors desired (8 in the example above), and second, the palette title from the list noted above. Generating the wordcloud with the color palette applied involves adding one more variable to the command:

wordcloud(corp, max.words = 100, min.freq=15, random.order = FALSE, colors = color, scale=c(8, .3))

Finally, if you want to output the wordcloud as an image file, you can adjust the command to generate the wordcloud as, for instance, a PNG file. First, tell R to create the PNG file:

png(“wordcloud.png”, width=1280,height=800)

The text in quotes is the name of the PNG file to be created. The other two commands indicate the size of the PNG. Then create the wordcloud with the parameters you want:

wordcloud(corp, max.words = 100, random.order = FALSE, colors = color, scale=c(8, .3))

And, finally, pass the wordcloud just created on to the PNG file with this function:

dev.off()

If all goes according to plan, you will have created a PNG file with a wordcloud of your cleaned up corpus of text:

 

 

Source Information:

Reading PDF files into R for text mining

Building Wordclouds in R

Word cloud in R

R (Linux) – basic installation

To install the R programming environment on Linux is pretty straightforward, but it does require a little bit of know how in order to find the correct packages. As is typically the case with Linux, there are multiple ways to get things done. I like to use Synaptic for installing and removing software, but you can also use the software manager that comes with your Linux distribution (in Linux Mint it’s called Software Manager) or the command line (in KDE based distributions, Konsole).

For the most up-to-date installation of R, it’s actually best to install directly from the R repository. A list of Linux repositories for the R environment is located here. In order to install from the repository, you need to update your list of repositories in Synaptic. To access your repository list in Synaptic, click on Settings -> Repositories.

In the new Software Sources window, click on “Additional repositories” and you’ll get this window:

Click on Add a new repository. You’ll get this window:

The exact information you put into that window will vary based on which mirror you chose. Here is what I added in mine:

deb https://cran.cnr.berkeley.edu/bin/linux/ubuntu xenial/

In order to ensure you have the right files and to follow best security practices, you should install the signing key as well. Directions for installing the signing key are found here, but it can be done with a simple command from a terminal:

sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys E084DAB9

Once you have done all of that, you can install R from Synaptic.

First, open Synaptic, which will require your password. You’ll get the basic Synaptic Package Manager window:

Next, in the search box, search for “r-base”. Right-click it and select “Mark for installation” to install “r-base”:

In the above screenshot, I have already installed r-base, so the option “Mark for installation” is greyed out. But, obviously, that’s what I already did. When you select this, Synaptic will automatically select all the other necessary packages (there are about 10 to 15 additional packages necessary for R to run: r-cran-class, r-cran-lattice, r-cran-spatial, r-cran-survival, r-cran-codetools, r-cran-nnet, r-cran-mass, r-cran-boot, r-cran-nlme, r-cran-rpart, r-cran-cluster, r-cran-kernsmooth, r-cran-foreign, r-cran-mgcv, r-cran-matrix, r-recommended, r-base-core).

If you plan on installing any other R packages, it’s not a bad idea to also install “r-base-dev,” as it helps fill in dependencies for other packages.

Once you’ve selected r-base, hit Apply in Synaptic and all the software will be installed.

You now have the base software for R installed.

To open the R environment in a terminal, launch a terminal and simply type “R” at the prompt, like this:

Here’s where things can get a little complicated. To do different things in R requires various libraries or packages. Some of these can be installed using the R terminal while others need to be installed from your Linux distribution’s repositories. To install a library or package using the R terminal, you use the following command once you have opened the R environment:

install.packages(“PACKAGENAME”)

The first time you run this, the R environment will ask you to select a mirror.

Choose one close to your location. R will then install the package, assuming you type everything correctly.

If you run into an error message, there are several possibilities. First, check to make sure you typed everything correctly. R is not forgiving on spelling mistakes. Second, if the error is something like:

installation of package ‘PACKAGENAME’ had non-zero exit status

Or

dependency ‘PACKAGENAME’ is not available

There is a good chance that you need to install a package or library using Synaptic (or from a terminal using apt). For instance, to install the “tm” package, there is an unsatisfied dependency (meaning, a library or package that needs to be installed but cannot be installed using the R installer). The dependency is the ‘slam’ package. This can be installed using Synaptic (or, from a terminal, using the command “sudo apt-get install r-cran-slam”). Once you’ve installed the dependency, try re-installing the package and the error messages should go away.

Iceland – final post – drone footage

I took my drone to Iceland with us. I knew that there were lots of places where I could fly the drone and it seemed like the ideal opportunity to take advantage of the drone to get shots we couldn’t otherwise get. Here’s my Iceland drone compilation:

Iceland – Day 7 – The Golden Circle: Gulffoss, Geysir, Strokkur, and Þingvellir National Park

Debi, Toren, and Ryan at Þingvellir National Park

We saved some of the most visited sites for our last day in Iceland. Lots of buses take tourists to visit three sights in a single day: Gullfoss, Geysir, and Þingvellir National Park. This is often referred to as The Golden Circle as you can include Seljalandfoss and actually make it into a circle. Since we had already visited Seljalandfoss, we headed straight to Gullfoss.

Gullfoss is a very powerful waterfall with two levels.

To get a good view of how tall the lower falls are, you need to hike up a bit so you can see down into the trench it has carved.

Toren, Debi, and Ryan at Gulfoss
Toren, Debi, and Ryan at Gulfoss

Just down the road from Gulfoss are two geysers, Geysir and Strokkur. Geysir was the first geyser to be documented by modern Europeans and is the source of the English word “geyser.” Geysir no longer regularly erupts, but Strokkur does every few minutes.

We walked around the geysers for a bit and watched several eruptions, then jumped back in the car and headed to our final destination for the day, Þingvellir National Park. Þingvellir is cool for a lot of reasons. First, it was the original seat of Iceland’s Parliament and an important meeting place for the various tribes of Iceland for a long time. Second, it is the location where two continental plates are separating by about 2 centimeters per year, and you can literally see the result as the area is being pulled apart. There is a large canyon you can walk down that is the result of tectonic plates moving. You can see the canyon in this photosphere:

Here’s another photosphere from the Parliament rock, where the laws used to be read:

We spent a couple of hours here walking around the lake, streams, the church, and the canyon.

Ryan, Debi, and Rosemary at Þingvellir National Park
Ryan, Debi, and Rosemary at Þingvellir National Park

Here’s a short clip of a waterfall that drops right into the canyon:

And a photo of us in front of the waterfall:

Debi, Toren, and Ryan at Þingvellir National Park
Debi, Toren, and Ryan at Þingvellir National Park

We actually had big plans for this evening – it was time to try Icelandic cuisine. We made a reservation for a nice restaurant in Reykjavik, Þrír frakkar, where they serve traditional Icelandic fare. We ordered three appetizers and two entrees to split between the four of us. First up, fermented shark:

fermented shark
fermented shark

Everyone but Debi was able to get their piece of frozen, fermented shark down. Debi gagged on hers. Imagine the most fishy tasting fish you’ve ever had, then leave it to spoil for, let’s say, a week. Then freeze it. That’s what fermented shark tastes like. Not a winner.

Next up was, sadly, puffin breast:

puffin
puffin

We asked on our whale and puffin viewing trip if puffins were endangered and they said no, so I didn’t feel bad ordering this. It’s basically thin strips of puffin breast, perhaps lightly cooked, served with a mustard sauce. It tasted kind of like chicken, but more oily and stringy. Everyone tried it, but I ended up eating most of it.

We also ordered fish stew as an appetizer, which wasn’t particularly exotic, and most everyone liked it. For the entrees, it was a lamb steak (split between Debi and Rosemary) and a horse steak (split between Toren and me). The steaks were all good; horse tastes a lot like cow.

Dinner was crazy-expensive, but we got to sample the local cuisine.

After dinner, we headed back to our B&B to pack up and get ready for our early flight the next day. We did stop briefly at the park near our B&B to let Toren run around a bit, but otherwise that pretty much wraps up our trip to Iceland. Though, see my next post where I highlight one other thing we did while we were there…

Iceland – Day 6 – Deildartunguhver, Hraunfossar, and Barnafossar

Our goal on day 6 was to make it from Akureyri back to Reykjavik, while doing a little sightseeing along the way. I found three things that looked cool in Western Iceland (that we could do on the way), but we only managed to visit two of them. One location I wanted to visit, Grábrók, we couldn’t find. Google maps sent us off on a really sketchy, dirt road that we should never have taken. It was a single lane road with big pot holes, cliff edges, rocks, and all the fun stuff that would be great in a large SUV, but not so much in a small, close to the road, wagon.

With that side adventure out of the way, our first stop was Deildartunguhver, which is another spot with volcanic activity. This location was pretty cool as it had lots of boiling water and sulphur vents, but was also a location where the country had tapped into the geothermal energy and was using it to heat water.

After Deildartunguhver, we headed to Hraunfossar and Barnafossar, which is another set of two waterfalls. The first set of waterfalls, Hraunfossar, kind of drizzle out of the side of a cliff, which you can see in the background of this photo.

Ryan, Toren, and Debi in front of Hraunfossar
Ryan, Toren, and Debi in front of Hraunfossar

The second set of waterfalls, Barnafossar, which are about 100 meters up the river, have carved through rock and formed an arch, as seen in this video:

Both were quite beautiful.

From Hraunfossar and Barnafossar, we opted to take the new Hvalfjörður Tunnel, which drops under a channel by going under the seabed (541 feet below sea level).  This cuts about 45 minutes off the time to get to Reykjavik and costs about $10.00. It’s deep enough that your ears pop as you drive underneath the sea. Pretty cool to say we have now driven under the ocean!

We had two nights scheduled in a bed and breakfast in Seltjarnarnes, which is the tail end of the peninsula where Reykjavik lies. That wrapped up day 6.