Главная » ЭКСП » residual standard error — что это

residual standard error — что это

Linear Regression Essentials in R

 kassambara | 
 11/03/2018 |
  104264
 |   Comments (10)
 |  Regression Analysis

Linear regression (or linear model) is used to predict a quantitative outcome variable (y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce and Bruce (2017)).

The goal is to build a mathematical formula that defines y as a function of the x variable. Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.

When you build a regression model, you need to assess the performance of the predictive model. In other words, you need to evaluate how well the model is in predicting the outcome of a new test data that have not been used to build the model.

Two important metrics are commonly used to assess the performance of the predictive regression model:

  • Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds — predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.
  • R-square, representing the squared correlation between the observed known outcome values and the predicted values by the model. The higher the R2, the better the model.

A simple workflow to build to build a predictive regression model is as follow:

  1. Randomly split your data into training set (80%) and test set (20%)
  2. Build the regression model using the training set
  3. Make predictions using the test set and compute the model accuracy metrics

In this chapter, you will learn:

  • the basics and the formula of linear regression,
  • how to compute simple and multiple regression models in R,
  • how to make predictions of the outcome of new data,
  • how to assess the performance of the model

Contents:

  • Formula
  • Loading Required R packages
  • Preparing the data
  • Computing linear regression
    • Quick start R code
    • Simple linear regression
    • Multiple linear regression
  • Interpretation
    • Model summary
    • Coefficients significance
    • Model accuracy
  • Making predictions
  • Discussion
  • References

The Book:

Machine Learning Essentials: Practical Guide in R

Formula

The mathematical formula of the linear regression can be written as follow:

y = b0 + b1*x + e

We read this as “y is modeled as beta1 (b1) times x, plus a constant beta0 (b0), plus an error term e.”

When you have multiple predictor variables, the equation can be written as y = b0 + b1*x1 + b2*x2 + … + bn*xn, where:

  • b0 is the intercept,
  • b1, b2, …, bn are the regression weights or coefficients associated with the predictors x1, x2, …, xn.
  • e is the error term (also known as the residual errors), the part of y that can be explained by the regression model

Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or parameters.

The figure below illustrates a simple linear regression model, where:

  • the best-fit regression line is in blue
  • the intercept (b0) and the slope (b1) are shown in green
  • the error terms (e) are represented by vertical red lines

Linear regression

From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.

The average variation of points around the fitted regression line is called the Residual Standard Error (RSE). This is one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:

y ~ b0 + b1*x

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called least squares regression or ordinary least squares (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x) and the outcome variable (y).

Метод Random Forest (randomForest)

require(randomForest)
rf.m <- randomForest(y ~ ., data=df[train.ids, ], do.trace=50, mtry=2) | Out-of-bag | Tree | MSE %Var(y) | 50 | 37.52 4.68 | 100 | 36.98 4.61 | 150 | 36.74 4.58 | 200 | 35.9 4.48 | 250 | 36.21 4.52 | 300 | 36.34 4.53 | 350 | 36.35 4.54 | 400 | 36.82 4.59 | 450 | 35.87 4.47 | 500 | 35.72 4.46 |

Следует отметить, что большинство функций имеют несколько способов вызова через определение функции как в вышеприведенном случае, либо через указание матрицы X и вектора Y. Так построение модели можно осуществить через
rf.m <- randomForest(x=df[train.ids, -1], y=df[train.ids, "y"], do.trace=50, mtry=2)

Вызовем статистику модели

rf.m

Call:
randomForest(formula = y ~ ., data = df[train.ids, ], do.trace = 50, mtry = 2)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2

Mean of squared residuals: 35.72
% Var explained: 95.54

Рассчитаем значения коэффициента детерминации для out-of-bag выборки (внутренняя валидация моделей)

cor(rf.m$predicted, df$y[train.ids]) ^ 2
[1] 0.9556

Функция plot в случае моделей случайного леса возвращает зависимость величины среднеквадратичной ошибки прогноза out-of-bag выборки от количества деревьев в лесе

plot(rf.m)

Чтобы оценить качество модели предскажем тестовый набор данных

pred.rf <- predict(rf.m, df[test.ids,]) cor(pred.rf, df$y[test.ids]) ^ 2 [1] 0.9644

Построим график зависимости между предсказанными и наблюдаемыми значениями

plot(pred.rf, df$y[test.ids])
abline(0, 1)

Loading Required R packages

  • tidyverse for easy data manipulation and visualization
  • caret for easy machine learning workflow

library(tidyverse)
library(caret)
theme_set(theme_bw())

Preparing the data

We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data(«marketing», package = «datarium»)
# Inspect the data
sample_n(marketing, 3)## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ] test.data <- marketing[-training.samples, ]

Computing linear regression

The R function lm() is used to compute linear regression model.

Quick start R code

# Build the model
model <- lm(sales ~., data = train.data) # Summarize the model summary(model) # Make predictions predictions <- model %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)

Simple linear regression

The simple linear regression is used to predict a continuous outcome variable (y) based on one single predictor variable (x).

In the following example, we’ll build a simple linear model to predict sales units based on the advertising budget spent on youtube. The regression equation can be written as sales = b0 + b1*youtube.

The R function lm() can be used to determine the beta coefficients of the linear model, as follow:

model <- lm(sales ~ youtube, data = train.data) summary(model)$coef## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3839 0.62442 13.4 5.22e-28
## youtube 0.0468 0.00301 15.6 7.84e-34

The output above shows the estimate of the regression beta coefficients (column Estimate) and their significance levels (column Pr(>|t|). The intercept (b0) is 8.38 and the coefficient of youtube variable is 0.046.

The estimated regression equation can be written as follow: sales = 8.38 + 0.046*youtube. Using this formula, for each new youtube advertising budget, you can predict the number of sale units.

For example:

  • For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
  • For a youtube advertising budget equal 1000, we can expect a sale of 8.38 + 0.046*1000 = 55 units.

Predictions can be easily made using the R function predict(). In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.

newdata <- data.frame(youtube = c(0, 1000)) model %>% predict(newdata)## 1 2
## 8.38 55.19

Multiple linear regression

Multiple linear regression is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).

For example, with three predictor variables (x), the prediction of y is expressed by the following equation: y = b0 + b1*x1 + b2*x2 + b3*x3

The regression beta coefficients measure the association between each predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

In this section, we’ll build a multiple regression model to predict sales based on the budget invested in three advertising medias: youtube, facebook and newspaper. The formula is as follow: sales = b0 + b1*youtube + b2*facebook + b3*newspaper

You can compute the multiple regression model coefficients in R as follow:

model <- lm(sales ~ youtube + facebook + newspaper, data = train.data) summary(model)$coef

Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using ~.:

model <- lm(sales ~., data = train.data) summary(model)$coef## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

  • Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable
  • Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.
  • t value: the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)
  • Pr(>|t|): The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.

As previously described, you can easily make predictions using the R function predict():

# New advertising budgets
newdata <- data.frame( youtube = 2000, facebook = 1000, newspaper = 1000 ) # Predict sales values model %>% predict(newdata)## 1
## 283

Метод частичных наименьших квадратов (pls)

require(pls)
pls.m1 <- plsr(y ~ ., data=df[train.ids, ], ncomp=2, validation="CV", segments=5, segment.type="random")

Воспользуемся функцией summary для вывода статистики модели

summary(pls.m1)
Data: X dimension: 100 2
Y dimension: 100 1
Fit method: kernelpls
Number of components considered: 2

VALIDATION: RMSEP
Cross-validated using 5 random segments.
(Intercept) 1 comps 2 comps
CV 28.6 13.87 14.00
adjCV 28.6 13.76 13.79

TRAINING: % variance explained
1 comps 2 comps
X 69.23 100.00
y 78.68 81.78

Чтобы посчитать коэффициент детерминации для обучающей выборки и для случая кросс-валидации воспользуемся следующими функциями. Следует отметить, что модель pls в качестве прогнозированных значений возвращает массив!

# train
cor(pls.m1$fitted.values[,,2], df$y[train.ids]) ^ 2
[1] 0.8178
# cross-validation
cor(pls.m1$validation$pred[,,2], df$y[train.ids]) ^ 2
[1] 0.7604

Для pls модель также возможно использовать функцию plot, которая возвращает диаграмму зависимости предсказанных значений от наблюдаемых

plot(pls.m1)
abline(0, 1) # добавляет линию выходящую из начала координат и идущей под углом 45 градусов

Спрогнозируем значения тестовой выборки и оценим качество прогноза

pred.pls.m1 <- predict(pls.m1, df[test.ids,]) cor(pred.pls.m1[,,2], df$y[test.ids]) ^ 2 [1] 0.775

Используем другой вид зависимости, который был успешно применен ранее в случае линейной регрессии

pls.m2 <- plsr(y ~ poly(x1, 2) + poly(x2, 2), data=df[train.ids, ], ncomp=2, validation="CV", segments=5, segment.type="random") summary(pls.m2) Data: X dimension: 100 4 Y dimension: 100 1 Fit method: kernelpls Number of components considered: 2VALIDATION: RMSEP Cross-validated using 5 random segments. (Intercept) 1 comps 2 comps CV 28.6 8.976 6.848 adjCV 28.6 8.246 6.711TRAINING: % variance explained 1 comps 2 comps X 24.62 50.23 y 94.86 95.81 cor(pls.m2$validation$pred[,,2], df$y[train.ids]) ^ 2 [1] 0.9415

Как видно качество модели значительно уеличилось. Подтвердим это оценив прогнозирующую способность модели на тестовом наборе данных.

pred.pls.m2 <- predict(pls.m2, df[test.ids,]) cor(pred.pls.m2[,,2], df$y[test.ids]) ^ 2 [1] 0.9344 plot(pls.m2) abline(0, 1)

Interpretation

Before using a model for predictions, you need to assess the statistical significance of the model. This can be easily checked by displaying the statistical summary of the model.

Credits


Created by Anirudh Singh National Institute of Technology(NIT),
JamshedpurAnirudh Singh has created this Calculator and 300+ more calculators!
Verified by Urvi Rathod Vishwakarma Government Engineering College(VGEC),
AhmedabadUrvi Rathod has verified this Calculator and 2200+ more calculators!

How to interpret the residual standard deviation/error

Simply put, the residual standard deviation is the average amount that the real values of Y differ from the predictions provided by the regression line.

We can divide this quantity by the mean of Y to obtain the average deviation in percent (which is useful because it will be independent of the units of measure of Y).

Here’s an example:

Suppose we regressed systolic blood pressure (SBP) onto body mass index (BMI) — which is a fancy way of saying that we ran the following linear regression model:

SBP = β0 + β1×BMI + ε

After running the model we found that:

  • β0 = 100
  • β1 = 1
  • And the residual standard error is 12 mmHg

So we can say that the BMI accurately predicts systolic blood pressure with about 12 mmHg error on average.

More precisely, we can say that 68% of the predicted SBP values will be within ∓ 12 mmHg of the real values.

Why 68%?

Remember that in linear regression, the error terms are Normally distributed.

And one of the properties of the Normal distribution is that 68% of the data sits around 1 standard deviation from the average (See figure below).

Therefore, 68% of the errors will be between ∓ 1 × residual standard deviation.

Normal curve

For example, our linear regression equation predicts that a person with a BMI of 20 will have an SBP of:

SBP = β0 + β1×BMI = 100 + 1 × 20 = 120 mmHg.

With a residual error of 12 mmHg, this person has a 68% chance of having his true SBP between 108 and 132 mmHg.

Moreover, if the mean of SBP in our sample is 130 mmHg for example, then:

12 mmHg ÷ 130 mmHg = 9.2%

So we can also say that the BMI accurately predicts systolic blood pressure with a percentage error of 9.2%.

The question remains: Is 9.2% a good percent error value? More generally, what is a good value for the residual standard deviation?

The answer is that there is no universally acceptable threshold for the residual standard deviation. This should be decided based on your experience in the domain.

In general, the smaller the residual standard deviation/error, the better the model fits the data. And if the value is deemed unacceptably large, consider using a model other than linear regression.

What is linear Regression?

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

Residual Summary Statistics

The first info printed by the linear regression summary after the formula is the residual summary statistics. One of the assumptions for hypothesis testing is that the errors follow a Gaussian distribution. As a consequence the residuals should as well. The residual summary statistics give information about the symmetry of the residual distribution. The median should be close to 0
as the mean of the residuals is 0
, and symmetric distributions have median=mean. Further, the 3Q and 1Q should be close to each other in magnitude. They would be equal under a symmetric 0
mean distribution. The max and min should also have similar magnitude. However, in this case, not holding may indicate an outlier rather than a symmetry violation.

We can investigate this further with a boxplot of the residuals.

boxplot(model[[‘residuals’]],main=’Boxplot: Residuals’,ylab=’residual value’)
Boxplot of Residuals

We see that the median is close to 0
. Further, the 25
and 75
percentile look approximately the same distance from 0
, and the non-outlier min and max also look about the same distance from 0
. All of this is good as it suggests correct model specification.

How to Calculate Residual Standard Error?

Residual Standard Error calculator uses Residual standard error = sqrt(Residual sum of squares/(Number of Observations in data-2)) to calculate the Residual standard error, The Residual Standard Error formula is defined as the square root of the residual sum of squares divided by the residual degrees of freedom. Residual standard error is denoted by RSE symbol.

How to calculate Residual Standard Error using this online calculator? To use this online calculator for Residual Standard Error, enter Residual sum of squares (RSS) & Number of Observations in data (No) and hit the calculate button. Here is how the Residual Standard Error calculation can be explained with given input values -> 2.236068 = sqrt(10/(4-2)).

Coefficients

The second thing printed by the linear regression summary call is information about the coefficients. This includes their estimates, standard errors, t statistics, and p-values.

Estimates

The intercept tells us that when all the features are at 0
, the expected response is the intercept. Note that for an arguably better interpretation, you should consider centering your features. This changes the interpretation. Now, when features are at their mean values, the expected response is the intercept. For the other features, the estimates give us the expected change in the response due to a unit change in the feature.

Standard Error

The standard error is the standard error of our estimate, which allows us to construct marginal confidence intervals for the estimate of that particular feature. If s.e.(hat{beta}_i)
is the standard error and hat{beta}_i
is the estimated coefficient for feature i
, then a 95% confidence interval is given by hat{beta}_ipm 1.96cdot s.e.(hat{beta}_i)
. Note that this requires two things for this confidence interval to be valid:

  • your model assumptions hold
  • you have enough data/samples to invoke the central limit theorem, as you need hat{beta}_i
    to be approximately Gaussian.

That is, assuming all model assumptions are satisfied, we can say that with 95% confidence (which is not probability) the true parameter beta_i
lies in [hat{beta}_i-1.96cdot s.e.(hat{beta}_i),hat{beta}_i+1.96cdot s.e.(hat{beta}_i)]
. Based on this, we can construct confidence intervals

confint(model)
2.5 % 97.5 %
(Intercept) 2.3987332457 2.8924423620
crim -0.0111943622 -0.0056703707
rm 0.1086963289 0.1769912871
tax -0.0004055169 -0.0001069386
lstat -0.0334396331 -0.0256328293

Here we can see that the entire confidence interval for number of rooms has a large effect size relative to the other covariates.

t-value

The t-statistic is

(1)   begin{align*}frac{hat{beta}_i}{s.e.(hat{beta}_i)}end{align*}

which tells us about how far our estimated parameter is from a hypothesized 0
value, scaled by the standard deviation of the estimate. Assuming that hat{beta}_i
is Gaussian, under the null hypothesis that beta_i=0
, this will be t distributed with n-p-1
degrees of freedom, where n
is the number of observations and p
is the number of parameters we need to estimate.

< 7 Errors Calculators

Standard error of difference of sample mean
Standard Error = sqrt(((Standard Deviation^2)/Sample Size 1)+((Standard deviation 2^2)/Sample size 2))

Go

Standard error of proportion
Standard error of proportion = sqrt((Sample proportion*(Sample proportion-1)/Number of Observations in data))

Go

Residual Standard Error
Residual standard error = sqrt(Residual sum of squares/(Number of Observations in data-2))

Go

Residual Standard Error Using P Value
Residual standard error = F statistic*Residual sum of squares/Total sum of squares

Go

Standard error
Residual standard error = Standard Deviation/sqrt(Number of Samples)

Go

Margin of error
Margin of error = Critical Value*Standard deviation of static

Go

Margin of error given standard error
Margin of error = Critical Value*Standard error of static

Go

Источники

  • http://www.sthda.com/english/articles/40-regression-analysis/165-linear-regression-essentials-in-r/
  • http://qsar4u.com/files/rintro/03.html
  • https://www.calculatoratoz.com/en/residual-standard-error-calculator/Calc-2702
  • https://quantifyinghealth.com/residual-standard-deviation-error/
  • https://boostedml.com/2019/06/linear-regression-in-r-interpreting-summarylm.html
[свернуть]
Решите Вашу проблему!


×
Adblock
detector