Are you sure your regression weights are correct in R?
Really sure?
I used to be sure! Then I found out I was wrong.
I’m going to run a simple linear regression, with a weight that equals 1 for every observation.
I’m going to use the built-in R data set mtcars
, with mpg
(miles per gallon) as the dependent variable, and hp
(horsepower) as the only independent variable.
All of the code that follows can be copied into your R terminal (fun!).
# Create the weight w <- rep(1, nrow(mtcars))
# Run the regression m1 <- lm(mpg ~ hp, data = mtcars, weights = w)
Let’s look at the weights from that regression.
m1$weights [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
All weights equal 1. Looks good, right?
Not sure why you’re still reading this? Stay with me. I’m going to do something that seems like it shouldn’t affect any of the above.
I’m going to add a column to the mtcars
data.frame.
mtcars$w <- 2
Now if I run the exact same regression as before, literally the exact same call, I’ll get different results. Don’t believe it? Watch this:
m2 <- lm(mpg ~ hp, data = mtcars, weights = w)
m2$weights [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Technically, the regression coefficients didn’t change, since the weights are constant in both models.
But I’m demonstrating that the weights in the second are different, which would change the regression results if the weights weren’t just a made up constant value.
What just happened?
I’m going to point you to a hidden dark corner of the help page for lm
:
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
The weight argument is always treated as a column name in your data first, even though it’s not quoted.
This isn’t that weird for R, actually, but it is incredibly dangerous if you’re not careful.
Now, you could argue that the second call isn’t really the same as the first, because my data changed, and data is an argument to lm
.
But I’d counter by saying lm
has a specific weights argument, and that didn’t change! The interpretation of the argument changed because it depends on the column names in data.
What Would I Do?
If I could make a change and didn’t have to worry about backwards compatibility, I would remove the weights
argument from lm
, and devise a way to include weights in R formula syntax.
It’s clearer (to me) that names in R formulas are first interpreted as variable names in the data. It is not clear the same is true for the weights argument, until you read the help file.
So something like this would make sense (to me):
m1 <- lm(mpg ~ hp + weight(w), data = mtcars)
Does it get worse?
Maybe not worse, but definitely more confusing.
Suppose instead of a numeric vector, we just have a column name. This might be common if you had a function that runs a regression at some point, and the weight column name is passed into the function as an argument.
So let’s say the argument weight_var
contains 'w'
, and we try to pass that to lm
as a numeric vector.
weight_var <- 'w'
m3 <- lm(mpg ~ hp, data = mtcars, weights = mtcars[, weight_var])
m3$weights [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Looks fine, right?
Now suppose weight_var
already existed as a column in the model data!
mtcars$weight_var <- 2
weight_var <- 'w'
m4 <- lm(mpg ~ hp, data = mtcars, weights = mtcars[, weight_var])
Error in model.frame.default(formula = mpg ~ hp, data = mtcars, weights = mtcars[, : invalid type (list) for variable '(weights)'
Yikes. So you need to check whether the object name containing the weight column name doesn’t exist as a column name in the model data.frame.
If it is, you don’t get a right or wrong result. You get an error!
Prevention
We could debate whether this behavior is good or bad (but it’s bad), but it’s not going to change.
So it’s worth the effort to be proactive about preventing and checking this. You don’t want to find out the weights were wrong after interpreting your model! (Or after publishing a paper!)
The surest way to avoid mistakes is to include your weight in the model data.frame and specify it by name, just like lm
wants.
But again this isn’t always possible, because you might only be able to pass a weight column name as an argument to a function.
But it’s understandable you wouldn’t want to be forced into that.
So how can you prevent mistakes?
One way to prevent is always to put your weight vector inside a list. This will never resolve to a column within the passed data.frame.
# Create a list with the weights as first element. analysis_weights <- list(w = rep(1, nrow(mtcars)))
# lm can't get confused now. m5 <- lm(mpg ~ hp, data = mtcars, weights = analysis_weights$w)
m5$weights [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Maybe putting your weights in a named list feels like overkill. Your alternative is to assert that the weight vector you wanted to use is identical to the weights actually used in the regression:
w <- rep(1, nrow(mtcars))
m6 <- lm(mpg ~ hp, data = mtcars, weights = w)
# Stop if weights aren't right. stopifnot(identical(w, m6$weights))
There’s a drawback to checking this way though, which is that if any observations get dropped from the model due to missing values, this will return FALSE
. So you may need to adjust for your data.