R-Squared Changes Based on How Many Observations There Are

R-squared says nothing about prediction error, even with σ2 exactly the same, and no change in the coefficients. R-squared can be anywhere between 0 and 1 just by changing the range of X. ¹

As you can see below, by simply including more values - the R-Squared can be drastically increased (or decreased); which, is a concern because someone preparing summary statistics of data could manipulate the number of observations to benefit them. All this can be done while other, more reliable predictions, such as mean square error, stay the same (given the regression is the same) no matter the number of observations.

Compare R-Squared & Mean Squared Error

10 X Observations

Setting Up The Regression

Show code

x <- seq(1,10,length.out = 100)
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
RS10 <- lm(y ~ x)

R-Squared

Show code

summary(RS10)$r.squared

[1] 0.9383379

Mean Squared Error

Show code

sum((fitted(RS10) - y)^2)/100

[1] 0.6468052

2 X Observations

Setting Up The Regression

Show code

x <- seq(1,2,length.out = 100)
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
RS2 <- lm(y ~ x)

R-Squared

Show code

summary(RS2)$r.squared

[1] 0.1502448

Mean Squared Error

Show code

sum((fitted(RS2) - y)^2)/100

[1] 0.6468052

What to do about it?

As you can see, just by changing the number of observations alone makes R-Squared an unreliable prediction measure. However, you can also see that factors as simple as number of observations does not effect the mean squared error. For that reason, I suggest taking some TNT to R-Squared and using the advanced machine learning model options available for free in the R program to get predictions that’d blow R-Squared out of the water! Browse my ‘Projects’ tab to see more on some of these models such as Support Vector Machine and Neural Networks; though, I think good ’ole Google might be a better resource than I.

https://professor-hunt.github.io/ACC8143/Rsquared_Bad.html ↩︎

R-Squared Is Bad…

R-Squared Changes Based on How Many Observations There Are

Compare R-Squared & Mean Squared Error

10 X Observations

2 X Observations

What to do about it?