R – find cases (rows) that match specific criteria

I regularly need to find a specific case or set of cases that meet some criteria when analyzing data, often so I can modify those values for one reason or another. The easiest way I have found to find such values in R is the “which” function.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

In the 2010 GSS there is a variable for race (RACE). The options are: 1 = WHITE, 2 = BLACK, 3 = OTHER. To find all of the cases with a “3” in the dataset, I would use the following code:

which(GSS2010$RACE == "3", arr.ind = TRUE)

Here’s what the command is doing…

which” is the function that tells R to search for information that meets the criteria detailed in the parentheses.

GSS2010 is the name of the dataset.

RACE is the name of the variable in the dataset. By including the name of the variable, we restrict R to searching just inside that variable rather than the whole dataset. (The $ tells R that RACE is a variable inside the GSS2010 dataset.

The “==” indicates “equals” in R.

The target value, which can be text, characters, or numbers, goes inside the quotes. In this case, we wanted to find all of the cases with the number “3” which is code for “OTHER.”

arr.ind = TRUE tells R to include the index if the result is an array.

In the 2010 GSS, if you type in the code above, you’ll get a list like this:

 [1]    1   12   14   19   27   28   42   44   46   50   64   73   96   97  101  102  119  120  121  123  124  130  133  140  145  147  151  152
 [29]  153  154  159  161  180  185  190  194  195  199  211  213  217  220  230  245  263  275  278  287  288  295  297  301  305  312  314  318
 [57]  333  339  345  348  349  371  373  381  403  420  441  446  458  464  465  473  475  477  478  479  483  489  495  501  505  507  508  520
 [85]  550  554  561  564  567  591  593  631  704  712  713  715  716  732  741  749  770  776  792  793  805  807  823  824  829  877  901  956
[113]  973 1092 1112 1125 1186 1193 1218 1224 1264 1281 1291 1304 1307 1331 1336 1342 1345 1347 1352 1355 1356 1364 1365 1411 1423 1440 1441 1442
[141] 1444 1445 1446 1449 1451 1513 1523 1526 1528 1532 1534 1547 1550 1552 1556 1557 1559 1562 1564 1567 1568 1569 1570 1571 1572 1573 1574 1575
[169] 1576 1656 1660 1735 1764 1913 1933 1935 1993 2010 2011 2018 2019 2022 2038

The [1] is indicating that this is the first response. The [29] indicates that the next number is the 29th response. The numbers after the brackets (“[ ]”) indicate the row where that response was found. Thus, I know that the 161st row in my dataset has the value 3 in the variable RACE.

We can check this by using the following code:

GSS2010[161,"RACE"]

The result should be:

[1] 3

BONUS:

Should you want to modify the value for an individual observation, like the one we just examined, you could use the following code:

GSS2010[161,"RACE"] <- 2

This would change the values for that case from “3” (OTHER) to “2” (BLACK). I’m not really sure why you would want to do this in this instance, but, now you can. (There is a scenario when you might, but there are better ways to recode data.) Basically, the “<-” tells R to set the value of that specific observation to 2, overwriting the 3 that was there.

And if you wanted to change all of the values from 3 to 2, since you have a massive list, the easiest way would be to save all those values as a list, then have R change all the values in one fell swoop, like this:

OTHERRACELIST <- which(GSS2010$RACE == "3", arr.ind = TRUE)
GSS2010[c(OTHERRACELIST), "RACE"] <- 2

The above two commands would create a list in your environment called “OTHERRACELIST” that includes all of the row numbers of the cases with a 3. The second command then tells R to look inside the GSS2010 dataset and use the list (c(OTHERRACELIST)) to find all the rows you want changed in the RACE variable to “2.” That will then change the code for all of the people coded as “3” into a “2.”

A script file with the above commands is available here.

R – delete one or several variables in a dataset

I regularly create variables while analyzing data and then find that I need to delete a variable I created. At times, I just want to get rid of a variable in a dataset (’cause screw that variable). This short tutorial will explain how to delete a variable (or multiple variables if needed).

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

To completely remove a variable from a dataframe, you need to tell R to copy the dataframe minus the variable you want to delete. Here’s the code:

GSS2010 <- subset(GSS2010, select = -(OCC))

Here is what the code above does…

GSS2010 is the name of the dataset. Typically, when I use the subset function, I do so to create a different dataset. However, in this case I actually want to overwrite the dataset, so I’m actually naming the new dataset the same thing as the old dataset, which, effectively, overwrites the dataset, getting rid of the unwanted variables in the process.

The “subset” function tells R that you want to take part of an existing dataset. It’s a very useful function for selecting, for instance, all the men in a sample or all of the people who live in a certain region.

After the “subset” function, inside the parentheses, is the name of the dataset from which we are taking a subset, GSS2010. After the comma inside the parentheses is code to tell R how to select the subset. In this case, we use the “select =” command to tell R that we want it to select a specific variable. However, the “” before “(OCC)” actually tells R to select all the other variables BUT not the OCC variable for the subset. Thus, “-(OCC)” tells R to select the entire dataframe except the variable OCC for the subset. In effect, OCC is deleted but, to get there, you actually have to tell R to keep everything but that (pretty stupid, honestly).

NOTE: This is an instance in R when you don’t need to put the name of the variable in quotes (e.g., (“OCC”)) nor do you need to indicate which dataset the variable is in (e.g., (GSS2010$OCC)) since the dataset is already referenced in the subset command.

To remove multiple variables at the same time, the above command can be modified slightly to include other variables by putting them into a vector:

GSS2010 <- subset(GSS2010, select = -c(YEAR, WRKSTAT))

By changing what comes after the “select =” component in the parentheses to a vector (c indicates a vector in R), you can indicate multiple variables that you want deleted from the dataset in one command. Thus, in the above code, the variables YEAR and WRKSTAT would both be deleted from the dataset.

Because it is R, there is always another way. Here are two alternative lines of code that will do the same thing, the first is for removing a single variable and the second is for removing multiple variables:

GSS2010 <- GSS2010[,-match(c("EVWORK"), names(GSS2010))]
GSS2010 <- GSS2010[,-match(c("EVWORK", "PRESTIGE"), names(GSS2010))]

The logic in the above code is very similar, using the “match” command instead of subset.

You may find some tutorials that suggest you can remove a variable from a dataframe/dataset using the following code:

GSS2010$GOD <- NULL

What this command does is actually remove all of the data in the variable GOD. However, the variable remains in the dataset, it’s just empty. I prefer one of the above approaches because they completely remove the variable from the dataset.

Here’s a script file for these commands.

R – create variable filled with zeros

I ran into a situation where I needed to add a variable to a dataset. I knew that I was then going to modify some of the values in the variable, but most of the values were going to be zeros. So, I wanted to create a new variable and fill it with all zeros.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

Here’s the code I used to create the variable:

GSS2010$TEMPANALYSIS <- replicate(2044, 0)

Here is what the code above does…

GSS2010 is the name of the dataset into which I wanted to create the variable. In this case, it is a copy of the 2010 wave of the GSS.

TEMPANALYSIS is what I called the variable. (The “$” tells R that it is a variable in the dataset.)

The “replicate” function tells R to replicate the second value in the parentheses (0) the number of times noted as the first value in the parentheses (2044). I used 2,044 because that is how many cases there are in the dataset. You can obviously adjust the value for the number of cases in your dataset/dataframe. If you have 320 cases, adjust it to 320.

If you don’t include the exact number of cases, you’ll get an error like this:

Error in `$<-.data.frame`(`*tmp*`, TEMPANALYSIS, value = c(0, 0, 0, 0,  : replacement has 2042 rows, data has 2044

That error is saying that you tried to add a variable but R needs to know what to put in every one of the rows and since it is short 2 rows, it can’t do it.

Of course, with R, there is always another way to do something. Here’s an alternative command that will do the same thing:

GSS2010$TEMPANALYSIS2 <- rep(0, times=2044)

I won’t repeat the description of the dataset and variable but will detail what the rest of the code is doing.

rep” tells R to repeat the first value in the parentheses (0) the number of times specified as the second number in the parentheses (2044; technically, the “times=” portion is not required.

Here’s a script file for these commands.