Understanding and following a machine learning model. You may click “Show code” on any of the elements to see their source.
Support Vector Machine (SVM) starts off with information, or data, you give it. It then looks at the data and tries to find categories of similarities within and splits the data based on these categories .. think of it looking at all the different types of food and putting them in different piles. Candy might be in one pile, and vegetables in a different pile, and breakfast in another. Except with SVM, you don’t have to make the piles - we can tell it things, called variables, about the pile of food and it uses that to make the piles on it own. Think about it like telling a computer “candy is sweet” or “waffles are only for breakfast” and it using that to sort the pile for you! 1
Watch as we run SVM and change the amount of data we give it to learn from.
Here, we tell the computer 60% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .60, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest <- iris[-trainIndex,]
iris_SVM <- train(
form = factor(Species) ~ .,
data = SVMTrain,
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "svmLinear",
preProcess = c("center", "scale"),
tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel
90 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 81, 81, 81, 81, 81, 81, ...
Resampling results:
Accuracy Kappa
0.9666667 0.95
Tuning parameter 'C' was held constant at a value of 1
summary(iris_SVM)
Length Class Mode
1 ksvm S4
svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
if_else(versicolor>setosa & versicolor>virginica, "versicolor",
if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
setosa versicolor virginica
20 18 22
confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 20 0 0
versicolor 0 17 1
virginica 0 3 19
Overall Statistics
Accuracy : 0.9333
95% CI : (0.838, 0.9815)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8500 0.9500
Specificity 1.0000 0.9750 0.9250
Pos Pred Value 1.0000 0.9444 0.8636
Neg Pred Value 1.0000 0.9286 0.9737
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2833 0.3167
Detection Prevalence 0.3333 0.3000 0.3667
Balanced Accuracy 1.0000 0.9125 0.9375
Here, we tell the computer 75% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .75, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest <- iris[-trainIndex,]
iris_SVM <- train(
form = factor(Species) ~ .,
data = SVMTrain,
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "svmLinear",
preProcess = c("center", "scale"),
tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel
114 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 102, 102, 103, 103, 104, 102, ...
Resampling results:
Accuracy Kappa
0.9833333 0.975
Tuning parameter 'C' was held constant at a value of 1
summary(iris_SVM)
Length Class Mode
1 ksvm S4
svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
if_else(versicolor>setosa & versicolor>virginica, "versicolor",
if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
setosa versicolor virginica
12 10 14
confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 12 0 0
versicolor 0 10 0
virginica 0 2 12
Overall Statistics
Accuracy : 0.9444
95% CI : (0.8134, 0.9932)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 1.728e-14
Kappa : 0.9167
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8333 1.0000
Specificity 1.0000 1.0000 0.9167
Pos Pred Value 1.0000 1.0000 0.8571
Neg Pred Value 1.0000 0.9231 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2778 0.3333
Detection Prevalence 0.3333 0.2778 0.3889
Balanced Accuracy 1.0000 0.9167 0.9583
Here, we tell the computer 50% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .5, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest <- iris[-trainIndex,]
iris_SVM <- train(
form = factor(Species) ~ .,
data = SVMTrain,
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "svmLinear",
preProcess = c("center", "scale"),
tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel
75 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 69, 69, 66, 66, 69, 68, ...
Resampling results:
Accuracy Kappa
0.9065476 0.8579072
Tuning parameter 'C' was held constant at a value of 1
summary(iris_SVM)
Length Class Mode
1 ksvm S4
svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
if_else(versicolor>setosa & versicolor>virginica, "versicolor",
if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
setosa versicolor virginica
25 23 27
confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 25 0 0
versicolor 0 23 0
virginica 0 2 25
Overall Statistics
Accuracy : 0.9733
95% CI : (0.907, 0.9968)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.96
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9200 1.0000
Specificity 1.0000 1.0000 0.9600
Pos Pred Value 1.0000 1.0000 0.9259
Neg Pred Value 1.0000 0.9615 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3067 0.3333
Detection Prevalence 0.3333 0.3067 0.3600
Balanced Accuracy 1.0000 0.9600 0.9800
As you can see, when we give it a little information, SVM is right about 91% of the time. And when we give it a little more, it’s right about 96% of the time. Finally, when we give it even more information, it’s right about 98% of the time. This makes sense, because if you only tell somebody one thing about one type of food in a huge pile - they might not be good at splitting the pile up. But if you tell someone 10 things about foods in a pile - they’re much more likely to split the piles up the right way.
In accounting, the value is also limitless. Models can be used to more accurately determine estimates for accounts like Warranty Liability, Return Merchandise, Allowance for Doubtful Accounts, and much more. Models can even be used to [once you 2nd paper is published I’ll write the title of it here and link the journal]