R – importing data using haven

rcragun

1 year ago

Introduction

I hate to admit that I’ve been using R exclusively for data analysis for about 5 or 6 years and am just now realizing that I likely have been loading data sets into R incorrectly this entire time. Let me explain the issue.

I regularly load datasets into R that are either in SPSS or Stata format (occasionally CSV or DAT files, but those don’t suffer from the same issue as they don’t include data labels). The data mostly come from surveys I have conducted via software like Qualtrics or public datasets I use for various projects (e.g., the GSS or WVS).

I’ll walk you through the issue with illustrative code and output as I go. Note that I’m using R 4.2.2 in RStudio version 2022.07.2+576 on Kubuntu 22.04; I’m using the 2.5.1 version of the haven package.

I’m going to use two sample datasets to illustrate the problem. One is a survey I conducted myself. The other is the SPSS version of the 2010 wave of the GSS. To load those datasets, I use the ‘haven’ package. Here’s my code to get started:

# set working directory
setwd("~/Desktop")
# load haven package
install.packages("haven")
library(haven)
# import dataset
2010GSS <- read_sav("~/2010GSS.sav")
utah <- read_sav("~/utah.sav")

Now, on to the issue at the heart of this post. If you check the attributes of variables imported from an SPSS file using the haven package, you’ll see that their attributes will include “haven_labelled” as one of the classes. I’ll use DEGREE (an ordinal variable) in the 2010 GSS to illustrate:

#### check attributes ####
attributes(GSS2010$DEGREE)

Here’s the output after importing the 2010 GSS:

$label
[1] "R's highest degree"

$format.spss
[1] "F1.0"

$class
[1] "haven_labelled" "vctrs_vctr"     "double"        

$labels
LT HIGH SCHOOL    HIGH SCHOOL JUNIOR COLLEGE       BACHELOR       GRADUATE            IAP             DK             NA 
             0              1              2              3              4              7              8              9

What the output above is telling me is: (1) the label of the variable, which is “R’s highest degree, (2) the format of the variable (from SPSS), (3) the class of the variable: “haven_labelled” “vctrs_vctr” and “double”, and (4) the labels for the different numeric values of the variable.

There are a variety of variable types or classes in R, but the two primary ones that I use are “factor” and “numeric.” Factor variables are categorical variables (i.e., nominal or ordinal variables). Numeric variables are continuous variables (i.e., interval or ratio variables). Once properly labeled, R will know which commands or functions can be applied to which variables. Some functions need the variable to be the right class before it can run.

The Problem

Typically, when I import the GSS into R, I don’t have issues running analyses on most of the variables (though see below). However, with some of my datasets, I can’t even get a simple table command to run if the variable is the class “haven_labelled.” Here’s the code from my other dataset:

table(utah$healing_1)

And here is the error that I get when this doesn’t work.

Error in `as.character()`:
! Can't convert `x` <haven_labelled> to <character>.
Run `rlang::last_error()` to see where the error occurred.

Running the “rlang::last_error()” command recommended gives me the following information:

> rlang::last_error()
<error/vctrs_error_cast>
Error in `as.character()`:
! Can't convert `x` <haven_labelled> to <character>.
---
Backtrace:
 1. base::table(utah$healing_1)
 2. base::factor(a, exclude = exclude)
 5. vctrs:::as.character.vctrs_vctr(y)
Run `rlang::last_trace()` to see the full context.

And running the “rlang::last_trace()” command gives me the following:

> rlang::last_trace()
<error/vctrs_error_cast>
Error in `as.character()`:
! Can't convert `x` <haven_labelled> to <character>.
---
Backtrace:
     ▆
  1. ├─base::table(utah$healing_1)
  2. │ └─base::factor(a, exclude = exclude)
  3. │   ├─base::unique(as.character(y)[ind])
  4. │   ├─base::as.character(y)
  5. │   └─vctrs:::as.character.vctrs_vctr(y)
  6. │     └─vctrs::vec_cast(x, character())
  7. └─vctrs (local) `<fn>`()
  8.   └─vctrs::vec_default_cast(...)
  9.     ├─base::withRestarts(...)
 10.     │ └─base (local) withOneRestart(expr, restarts[[1L]])
 11.     │   └─base (local) doWithOneRestart(return(expr), restart)
 12.     └─vctrs::stop_incompatible_cast(...)
 13.       └─vctrs::stop_incompatible_type(...)
 14.         └─vctrs:::stop_incompatible(...)
 15.           └─vctrs:::stop_vctrs(...)
 16.             └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = vctrs_error_call(call))

I don’t fully understand everything these traces are telling me, but I can see the key message: it’s checking to see the class of the variable in order to run and it’s finding the “haven_labelled” class. When it does, it flags the variable as “incompatible” with the function and then the command doesn’t run.

This issue also comes up with the data imported from the GSS, but not for most basic analyses. I run into it every time I try to run a Tukey HSD test after ANOVA. To illustrate, I am going to run an ANOVA test with the number of children (CHILDS) as the dependent variable (DV) and the final degree earned (DEGREE) as the independent variable (IV). After it runs, I’m going to try to run a Tukey HSD test on the output. Here’s the code:

#### show that Tukey doesn't run ####
CHILDSxDEGREE <- aov(GSS2010$CHILDS ~ GSS2010$DEGREE)
summary(CHILDSxDEGREE)
TukeyHSD(CHILDSxDEGREE)

And here’s the error that I get when I do this:

Error in replications(paste("~", xx), data = mf) : 
  (converted from warning) non-factors ignored: GSS2010$DEGREE

The problem, I believe, is that R is looking for the class of the variables in order to run the analysis as it needs to know which of the variables is categorical (or factor) and which is continuous (or numeric). The same code works fine if you change the class of the variable, as I’ll show below.

Solution 1 – Remove Classes One Variable at a Time

Let’s say that you’ve run into this problem and are now looking for a solution. Well, I’ve got you covered as I have multiple solutions.

First, you can simply change the problematic variable’s class. There are two ways to do this. One option is just to remove all the classes entirely. Here’s the code to do that for DEGREE:

attr(GSS2010$DEGREE, which = "class") <- NULL

This code tells R to assign “NULL” to the class for the DEGREE variable attribute. If you run that code, then check the attributes for DEGREE, you’ll see that there are no longer any classes listed, including the “haven_labelled” class:

$label
[1] "R's highest degree"

$format.spss
[1] "F1.0"

$labels
LT HIGH SCHOOL    HIGH SCHOOL JUNIOR COLLEGE       BACHELOR       GRADUATE            IAP             DK             NA 
             0              1              2              3              4              7              8              9

The class option is missing entirely from the output now. This will allow you to run some analyses, but it still won’t work for Tukey’s because R thinks the variable is now numeric, not a factor variable, which is the R default for variables that have numbers. A solution for that will be detailed below. You can see that R thinks it’s numeric if you check the variable’s class:

class(GSS2010$DEGREE)

When you run this, you’ll now see that DEGREE’s class is numeric:

[1] "numeric"

Granted, we haven’t assigned a class to the variable. R is just inferring from the characteristics of the variable what the class is.

Given that this is R, we can, of course, assign the class another way. Here’s a simpler line of code that does the same thing:

class(GSS2010$DEGREE) <- NULL

The end result will be the same.

What we have done here, though, is not assign the correct class or type to the variable, we’ve just removed the “haven_labelled” class.

Solution 2: Assign the Correct Class

Given that DEGREE is actually an ordinal variable and we don’t want it treated as a numeric variable, we actually do need to assign the right class to it. The easiest way to do this is to create a new variable based on the old one but coerce it into the right class as follows:

GSS2010$DEGREE2 <- as.factor(GSS2010$DEGREE)

You can, of course, overwrite the original variable, but I don’t recommend it as you’ll lose some information, as I detail below. Just make a new variable (you can always switch things back later). This will actually make DEGREE2 into a factor variable, as you can see if you check the attributes now:

attributes(GSS2010$DEGREE2)
class(GSS2010$DEGREE2)

Here’s the output:

> attributes(GSS2010$DEGREE2)
$levels
[1] "0" "1" "2" "3" "4"

$class
[1] "factor"

> class(GSS2010$DEGREE2)
[1] "factor"

So, we now have DEGREE2 as a copy of the data in DEGREE and it is a factor variable in R. But you may also have noticed that we no longer have a label (what the variable asked) or the labels for the numeric values. Those all get deleted when you use this coercion method to make the variable a factor variable. If you don’t care about that information, not a problem. You’ve now got the variable in the right format and can move on. But, if you want those labels, you’ll have to add them back in.

I wish I had an easy solution to this, and perhaps someone will see this and provide one, but I do have a solution that works. Basically, so long as you still have the original label (i.e., what the variable asked) and labels (what the numeric values stand for) in the dataset, which you should since we didn’t over-write DEGREE but made a copy of it we called DEGREE2, we can copy the label and labels over. It took a while to figure this out, but here’s how:

# this copies over the meanings of the numeric values
attr(GSS2010$DEGREE2, 'labels') <- attr(GSS2010$DEGREE, 'labels')
# this copies over the variable label or the question asked
attr(GSS2010$DEGREE2, 'label') <- attr(GSS2010$DEGREE, 'label')

Now, check the attributes again.

attributes(GSS2010$DEGREE2)

You’ll see that we have the correct class – factor – and R recognizes the levels of the variable (0 through 4), but we also have the label and labels:

> attributes(GSS2010$DEGREE2)
$levels
[1] "0" "1" "2" "3" "4"

$class
[1] "factor"

$labels
LT HIGH SCHOOL    HIGH SCHOOL JUNIOR COLLEGE       BACHELOR       GRADUATE            IAP             DK             NA 
             0              1              2              3              4              7              8              9 

$label
[1] "R's highest degree"

That’s a fair amount of work and not an insignificant amount of code, but we now have everything set to do the analyses correctly.

Note: If you want to add labels to a variable using a package like “expss,” those cannot be added to a factor variable once it has been coerced into that class. Doing so will convert it into a “character” variable and all the information will be removed. Thus, any labels you want to add need to be added before converting a variable into the factor class.

Solution 3 – Remove “haven_labelled” from the Whole Dataset At Once

While it’s probably best practice to go from variable to variable in your dataset to make sure that every one of them is the correct class and is properly labeled, that is also incredibly time-consuming. There is another option from the “labelled” package that allows you to remove the “haven_labelled” class from all the variables in a dataset with one command. As a reminder, here are the attributes of DEGREE in the 2010 GSS right after importing the dataset:

> attributes(GSS2010$DEGREE)
$label
[1] "R's highest degree"

$format.spss
[1] "F1.0"

$class
[1] "haven_labelled" "vctrs_vctr"     "double"        

$labels
LT HIGH SCHOOL    HIGH SCHOOL JUNIOR COLLEGE       BACHELOR       GRADUATE            IAP             DK             NA 
             0              1              2              3              4              7              8              9

Now, with the “unlabelled” command in the labelled package, you can remove that class. Here’s the code:

install.packages("labelled")
library(labelled)
unlabelled(GSS2010)

That runs through the entire dataset and removes the class information. Per the documentation, the function is supposed to be capable of determining if a variable is a factor or numeric and assigning classes appropriately. But it’s not a guarantee that it will work.

This is basically a quick and dirty solution to this problem that often works for me. But it really is a better approach to go variable by variable to make sure they are the class you need for your analysis.

Conclusion

I’ve been struggling with this issue for years now. It came to a head when I had a number of datasets with identical questions from different countries open at once and started running into the error detailed above in some datasets but not in others. That forced me down this rabbit hole so I could solve this issue once and for all.

Do note that I only use R. I am not a package maintainer or contributor. There is a very real chance that some of what I discussed above is not a perfectly accurate representation of what is going on. It’s my best attempt to explain the situation.

Finally, I do kind of wonder if the authors of the haven package added the “haven_labelled” class just to make sure that people can’t be lazy as they run their analyses and have to check all of the variables they are using to make sure they are the correct class.

Note:

I occasionally get a related error when recoding my data in R. For instance, in one survey, I failed to set a requirement when asking for people’s year of birth that only numeric information be entered into that field. When I downloaded the data and imported it into R (from an SPSS file), the variable’s class was the same as above, “haven_labelled,” “vctrs_vctr,” and “double.” Some people wrote out the year they were born, e.g., “nineteen fifty-seven,” instead of “1957.” As I tried to recode the text into a number, I got the following error:

Error:
! Assigned data `1957` must be compatible with existing data.
ℹ Error occurred for column `Q53`.
x Can't convert <double> to <character>.
Run `rlang::last_error()` to see where the error occurred.

I was trying to put a number into a variable with the class “double.” To fix this error, you can do the same thing as what I outlined above, either remove the class or coerce the class into “numeric,” as follows:

Q53X <- as.numeric(data$Q53)

You’ll get a warning that you’re going to lose information by coercion because any non-numeric data will be deleted. Since this is creating a new variable, that isn’t a real concern as you still have the original data and can recode it into the new variable. But this error is similarly related to the variable class issue that stems from importing SPSS data sets into R.