R – create scatterplot with ggplot2

R has pretty amazing capabilities for creating charts and graphs. One of the most common packages for this is ggplot2. However, it’s not the most intuitive package I have used in R. So, I figured I’d illustrate how to make some relatively simple scatterplots in R using ggplot2. I’ll likely post instructions on how to create other graphs/charts in ggplot2 as well.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

The point of a scatterplot is to visually examine the relationship between two variables. Since it’s well-known that there is a relationship between how much education someone has and how much money they make, let’s examine that relationship visually. The two variables in the GSS we’ll use are: EDUC (years of education ranging from 0 to 20, with 12 being a high school diploma and 16 a Bachelor’s degree) and REALRINC (the respondent’s income in 1986 dollars). Just because I want the resulting scatterplots to be more accurate, let’s go ahead and adjust the REALRINC into 2010 dollars first. Per this website, we need to multiply each income listed in the 2010 GSS by 1.989 to get 2010 dollars from 1986 dollars:

GSS2010$REALRINC2010 <- GSS2010$REALRINC * 1.989

Let’s just make sure the values make sense. To do that, first pull the first 10 incomes:

head(GSS2010$REALRINC, 10)

The “head” command tells R to simply list the first “10” values in the variable “REALRINC” and print them in the console. Here’s what I got as output:

[1] 42735.0 3885.0 NA NA NA 5827.5 42735.0 NA NA NA

We can check to see what the corresponding value in 2010 dollars is for the first value, $42,735, with the following command:

GSS2010$REALRINC2010[match(42735,GSS2010$REALRINC)];

The value I got was $84,999.92, which is $42,735 * 1.989. Groovy. Our formula worked.

Now to create the scatterplot. Install ggplot2 (I’m using version 3.2.1) and load the library.

install.packages("ggplot2")
library(ggplot2)

Now begins the rather bizarre part of working with ggplot2 – building the commands to create the charts.

The basic code for creating plots in ggplot2 is odd. You create a base object that includes the variables that will be used for creating the chart like this:

scatterEDUCxINCOME <- ggplot(GSS2010, aes(EDUC, REALRINC2010))

The first portion, “scatterEDUCxINCOME” is the base object.

ggplot” calls the ggplot2 package.

In the parentheses is, first, the database (GSS2010), followed by the “aesthetic” components of the chart (“aes” stands for aesthetic). In this case, it’s the two variables we’re going to use: EDUC and REALRINC2010.

This base object is really just the data for the scatterplot but not the scatterplot itself. To illustrate, you can tell R to display the base object:

scatterEDUCxINCOME

You’ll get some of the pieces of a scatterplot, but not the points:

To overlay the scatterplot onto the base object, you have to tell R what objects you want to be overlaid. Obviously, that should include the points, which would be done like this:

scatterEDUCxINCOME + geom_point()

geom_point()” tells R to overlay the geometric points for the scatterplot onto the base layer. The parentheses are useful as they allow for additional commands or modifiers for the points, as I’ll illustrate below.

Two points to note with the code above. First, when you run it, it will generate the scatterplot (see image below).

Second, you can actually save this entire piece of code into the base object, like this:

scatterEDUCxINCOME <- scatterEDUCxINCOME + geom_point()

However, doing so converts your base object into the scatterplot with the geometric points included. Since you may want to modify your scatterplot with additional features, it’s generally a good idea to not do that. Keep your base object simple and overlay things on top of it as you go until you get to where you want to be.

As noted, the geom_point() code can be modified. One of my favorite modifications is to add a command that adds a little bit of entropy to the scatterpoints, which is helpful when you have a large dataset. Otherwise, all the points are basically stacked on top of each other. Here’s how:

scatterEDUCxINCOME + geom_point(position = "jitter")

Of course, a scatterplot should have labels – a title and axis labels. ggplot2 tries to add the axis labels based on the names of the variables, but you can overlay these as objects onto the base object as well, like this:

scatterEDUCxINCOME + geom_point() +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

ggtitle adds a title to the chart.

labs(x = “”, y = “”) adds the labels for the x and y axes with the labels inside the quotes.

Two additional features are helpful in a basic scatterplot. First, adding a line to the chart to illustrate the relationship is helpful. I’ll show two. First, a standard regression line is always nice:

scatterEDUCxINCOME + geom_point() + geom_smooth(method = "lm", color = "red", alpha = 0.2, fill = "blue") +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

The code to add a smoothing line is geom_smooth. In this case, I modified that line to make it linear (“lm”), make it red (color = “red”), and changed the fill of the error around the line (fill = “blue”). The last modification was to change the transparency of the fill for the error (alpha = 0.2).

Other smoothing options are available. The default simply runs a best fitting line through the scatterplot:

scatterEDUCxINCOME + geom_point() + geom_smooth() +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

Finally, the scatterpoints can be modified in a variety of ways (e.g., thickness, transparency, color, etc.). Here’s a simple illustration:

scatterEDUCxINCOME + geom_point(position = "jitter", size = 4, colour = "green")

The points options that can be modified are detailed here.

A script file with the above commands is available here.

NOTE: This was done in R version 3.5.2.

R – find cases (rows) that match specific criteria

I regularly need to find a specific case or set of cases that meet some criteria when analyzing data, often so I can modify those values for one reason or another. The easiest way I have found to find such values in R is the “which” function.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

In the 2010 GSS there is a variable for race (RACE). The options are: 1 = WHITE, 2 = BLACK, 3 = OTHER. To find all of the cases with a “3” in the dataset, I would use the following code:

which(GSS2010$RACE == "3", arr.ind = TRUE)

Here’s what the command is doing…

which” is the function that tells R to search for information that meets the criteria detailed in the parentheses.

GSS2010 is the name of the dataset.

RACE is the name of the variable in the dataset. By including the name of the variable, we restrict R to searching just inside that variable rather than the whole dataset. (The $ tells R that RACE is a variable inside the GSS2010 dataset.

The “==” indicates “equals” in R.

The target value, which can be text, characters, or numbers, goes inside the quotes. In this case, we wanted to find all of the cases with the number “3” which is code for “OTHER.”

arr.ind = TRUE tells R to include the index if the result is an array.

In the 2010 GSS, if you type in the code above, you’ll get a list like this:

 [1]    1   12   14   19   27   28   42   44   46   50   64   73   96   97  101  102  119  120  121  123  124  130  133  140  145  147  151  152
 [29]  153  154  159  161  180  185  190  194  195  199  211  213  217  220  230  245  263  275  278  287  288  295  297  301  305  312  314  318
 [57]  333  339  345  348  349  371  373  381  403  420  441  446  458  464  465  473  475  477  478  479  483  489  495  501  505  507  508  520
 [85]  550  554  561  564  567  591  593  631  704  712  713  715  716  732  741  749  770  776  792  793  805  807  823  824  829  877  901  956
[113]  973 1092 1112 1125 1186 1193 1218 1224 1264 1281 1291 1304 1307 1331 1336 1342 1345 1347 1352 1355 1356 1364 1365 1411 1423 1440 1441 1442
[141] 1444 1445 1446 1449 1451 1513 1523 1526 1528 1532 1534 1547 1550 1552 1556 1557 1559 1562 1564 1567 1568 1569 1570 1571 1572 1573 1574 1575
[169] 1576 1656 1660 1735 1764 1913 1933 1935 1993 2010 2011 2018 2019 2022 2038

The [1] is indicating that this is the first response. The [29] indicates that the next number is the 29th response. The numbers after the brackets (“[ ]”) indicate the row where that response was found. Thus, I know that the 161st row in my dataset has the value 3 in the variable RACE.

We can check this by using the following code:

GSS2010[161,"RACE"]

The result should be:

[1] 3

BONUS:

Should you want to modify the value for an individual observation, like the one we just examined, you could use the following code:

GSS2010[161,"RACE"] <- 2

This would change the values for that case from “3” (OTHER) to “2” (BLACK). I’m not really sure why you would want to do this in this instance, but, now you can. (There is a scenario when you might, but there are better ways to recode data.) Basically, the “<-” tells R to set the value of that specific observation to 2, overwriting the 3 that was there.

And if you wanted to change all of the values from 3 to 2, since you have a massive list, the easiest way would be to save all those values as a list, then have R change all the values in one fell swoop, like this:

OTHERRACELIST <- which(GSS2010$RACE == "3", arr.ind = TRUE)
GSS2010[c(OTHERRACELIST), "RACE"] <- 2

The above two commands would create a list in your environment called “OTHERRACELIST” that includes all of the row numbers of the cases with a 3. The second command then tells R to look inside the GSS2010 dataset and use the list (c(OTHERRACELIST)) to find all the rows you want changed in the RACE variable to “2.” That will then change the code for all of the people coded as “3” into a “2.”

A script file with the above commands is available here.

NOTE: This was done in R version 3.5.2.

R – delete one or several variables in a dataset

I regularly create variables while analyzing data and then find that I need to delete a variable I created. At times, I just want to get rid of a variable in a dataset (’cause screw that variable). This short tutorial will explain how to delete a variable (or multiple variables if needed).

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

To completely remove a variable from a dataframe, you need to tell R to copy the dataframe minus the variable you want to delete. Here’s the code:

GSS2010 <- subset(GSS2010, select = -(OCC))

Here is what the code above does…

GSS2010 is the name of the dataset. Typically, when I use the subset function, I do so to create a different dataset. However, in this case I actually want to overwrite the dataset, so I’m actually naming the new dataset the same thing as the old dataset, which, effectively, overwrites the dataset, getting rid of the unwanted variables in the process.

The “subset” function tells R that you want to take part of an existing dataset. It’s a very useful function for selecting, for instance, all the men in a sample or all of the people who live in a certain region.

After the “subset” function, inside the parentheses, is the name of the dataset from which we are taking a subset, GSS2010. After the comma inside the parentheses is code to tell R how to select the subset. In this case, we use the “select =” command to tell R that we want it to select a specific variable. However, the “” before “(OCC)” actually tells R to select all the other variables BUT not the OCC variable for the subset. Thus, “-(OCC)” tells R to select the entire dataframe except the variable OCC for the subset. In effect, OCC is deleted but, to get there, you actually have to tell R to keep everything but that (pretty stupid, honestly).

NOTE: This is an instance in R when you don’t need to put the name of the variable in quotes (e.g., (“OCC”)) nor do you need to indicate which dataset the variable is in (e.g., (GSS2010$OCC)) since the dataset is already referenced in the subset command.

To remove multiple variables at the same time, the above command can be modified slightly to include other variables by putting them into a vector:

GSS2010 <- subset(GSS2010, select = -c(YEAR, WRKSTAT))

By changing what comes after the “select =” component in the parentheses to a vector (c indicates a vector in R), you can indicate multiple variables that you want deleted from the dataset in one command. Thus, in the above code, the variables YEAR and WRKSTAT would both be deleted from the dataset.

Because it is R, there is always another way. Here are two alternative lines of code that will do the same thing, the first is for removing a single variable and the second is for removing multiple variables:

GSS2010 <- GSS2010[,-match(c("EVWORK"), names(GSS2010))]
GSS2010 <- GSS2010[,-match(c("EVWORK", "PRESTIGE"), names(GSS2010))]

The logic in the above code is very similar, using the “match” command instead of subset.

You may find some tutorials that suggest you can remove a variable from a dataframe/dataset using the following code:

GSS2010$GOD <- NULL

What this command does is actually remove all of the data in the variable GOD. However, the variable remains in the dataset, it’s just empty. I prefer one of the above approaches because they completely remove the variable from the dataset.

Here’s a script file for these commands.

NOTE: This was done in R version 3.5.2.

R – create variable filled with zeros

I ran into a situation where I needed to add a variable to a dataset. I knew that I was then going to modify some of the values in the variable, but most of the values were going to be zeros. So, I wanted to create a new variable and fill it with all zeros.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

Here’s the code I used to create the variable:

GSS2010$TEMPANALYSIS <- replicate(2044, 0)

Here is what the code above does…

GSS2010 is the name of the dataset into which I wanted to create the variable. In this case, it is a copy of the 2010 wave of the GSS.

TEMPANALYSIS is what I called the variable. (The “$” tells R that it is a variable in the dataset.)

The “replicate” function tells R to replicate the second value in the parentheses (0) the number of times noted as the first value in the parentheses (2044). I used 2,044 because that is how many cases there are in the dataset. You can obviously adjust the value for the number of cases in your dataset/dataframe. If you have 320 cases, adjust it to 320.

If you don’t include the exact number of cases, you’ll get an error like this:

Error in `$<-.data.frame`(`*tmp*`, TEMPANALYSIS, value = c(0, 0, 0, 0,  : replacement has 2042 rows, data has 2044

That error is saying that you tried to add a variable but R needs to know what to put in every one of the rows and since it is short 2 rows, it can’t do it.

Of course, with R, there is always another way to do something. Here’s an alternative command that will do the same thing:

GSS2010$TEMPANALYSIS2 <- rep(0, times=2044)

I won’t repeat the description of the dataset and variable but will detail what the rest of the code is doing.

rep” tells R to repeat the first value in the parentheses (0) the number of times specified as the second number in the parentheses (2044; technically, the “times=” portion is not required.

Here’s a script file for these commands.

NOTE: This was done in R version 3.5.2.