Linear Regression with R

This is going to be mainly to do with creating a linear regression model for the default diamond dataset found in the ggplot2 library in R.

Below you can find the code used to create the LR model for the diamond data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(ggplot2)
library(caTools)
View(diamonds)

#Creating our training and test data frames
sample.split(diamonds$price,SplitRatio = 0.65) -> split_values #uses a 0.65 to 0.35 split
subset(diamonds,split_values==T) -> train_reg
subset(diamonds,split_values==F) -> test_reg

#buidling linear model
lm(price~.,data=train_reg) ->mod_regress
predict(mod_regress,test_reg) -> result_regress
cbind(Actual=test_reg$price,Predicted=result_regress)-> Final_Data
as.data.frame(Final_Data)->Final_Data
View(Final_Data)

#calculating the error
(Final_Data$Actual - Final_Data$Predicted) -> error
cbind(Final_Data,error) -> Final_Data
rmse<-sqrt(mean(Final_Data$error^2))
rmse

 

Diamond Data

 

Final Data

 

Regression Model

 

Results of the Regression Model

To provide some further information on how the above graphs are relevant in any way the following article written by Bommae Kim https://data.library.virginia.edu/diagnostic-plots/ provides a far superior explanation than what I am about to say.

Residuals Vs Fitted

This type of plot will help indicate if the predictor variables and the outcome variables have a linear or non-linear relationship. If there are equally spread residuals sitting around a horizontal line, this would be a good indicator of a linear relationship. However, if there is no equal spread around a horizontal line this could be indicating a non-linear relationship. In the case above, it could be assumed that the model has a linear relationship as in the plot there is a somewhat equal spread around a horizontal line however there is a slight progressive increase to the line which could be hinting at a parameter of the model which has not been defined.

Normal Q-Q

A normal Q-Q plot will show if the residuals are normally distributed. This is demonstrated when they follow a straight line or not. A positive result would be the residuals lined up well on the straight line.  Following these assumptions, the Normal Q-Q plot we have above looks concerning.