R – delete one or several variables in a dataset

I regularly create variables while analyzing data and then find that I need to delete a variable I created. At times, I just want to get rid of a variable in a dataset (’cause screw that variable). This short tutorial will explain how to delete a variable (or multiple variables if needed).

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

To completely remove a variable from a dataframe, you need to tell R to copy the dataframe minus the variable you want to delete. Here’s the code:

GSS2010 <- subset(GSS2010, select = -(OCC))

Here is what the code above does…

GSS2010 is the name of the dataset. Typically, when I use the subset function, I do so to create a different dataset. However, in this case I actually want to overwrite the dataset, so I’m actually naming the new dataset the same thing as the old dataset, which, effectively, overwrites the dataset, getting rid of the unwanted variables in the process.

The “subset” function tells R that you want to take part of an existing dataset. It’s a very useful function for selecting, for instance, all the men in a sample or all of the people who live in a certain region.

After the “subset” function, inside the parentheses, is the name of the dataset from which we are taking a subset, GSS2010. After the comma inside the parentheses is code to tell R how to select the subset. In this case, we use the “select =” command to tell R that we want it to select a specific variable. However, the “” before “(OCC)” actually tells R to select all the other variables BUT not the OCC variable for the subset. Thus, “-(OCC)” tells R to select the entire dataframe except the variable OCC for the subset. In effect, OCC is deleted but, to get there, you actually have to tell R to keep everything but that (pretty stupid, honestly).

NOTE: This is an instance in R when you don’t need to put the name of the variable in quotes (e.g., (“OCC”)) nor do you need to indicate which dataset the variable is in (e.g., (GSS2010$OCC)) since the dataset is already referenced in the subset command.

To remove multiple variables at the same time, the above command can be modified slightly to include other variables by putting them into a vector:

GSS2010 <- subset(GSS2010, select = -c(YEAR, WRKSTAT))

By changing what comes after the “select =” component in the parentheses to a vector (c indicates a vector in R), you can indicate multiple variables that you want deleted from the dataset in one command. Thus, in the above code, the variables YEAR and WRKSTAT would both be deleted from the dataset.

Because it is R, there is always another way. Here are two alternative lines of code that will do the same thing, the first is for removing a single variable and the second is for removing multiple variables:

GSS2010 <- GSS2010[,-match(c("EVWORK"), names(GSS2010))]
GSS2010 <- GSS2010[,-match(c("EVWORK", "PRESTIGE"), names(GSS2010))]

The logic in the above code is very similar, using the “match” command instead of subset.

You may find some tutorials that suggest you can remove a variable from a dataframe/dataset using the following code:

GSS2010$GOD <- NULL

What this command does is actually remove all of the data in the variable GOD. However, the variable remains in the dataset, it’s just empty. I prefer one of the above approaches because they completely remove the variable from the dataset.

Here’s a script file for these commands.

NOTE: This was done in R version 3.5.2.

 1,294 total views,  12 views today