Understanding and following a machine learning model. You may click “Show code” on any of the elements to see their source.
Support Vector Machine (SVM) starts off with information, or data, you give it. It then looks at the data and tries to find categories of similarities within and splits the data based on these categories .. think of it looking at all the different types of food and putting them in different piles. Candy might be in one pile, and vegetables in a different pile, and breakfast in another. Except with SVM, you don’t have to make the piles - we can tell it things, called variables, about the pile of food and it uses that to make the piles on it own. Think about it like telling a computer “candy is sweet” or “waffles are only for breakfast” and it using that to sort the pile for you! 1
Watch as we run SVM and change the amount of data we give it to learn from.
Here, we tell the computer 60% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .60, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest  <- iris[-trainIndex,]
iris_SVM <- train(
  form = factor(Species) ~ .,
  data = SVMTrain,
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel 
90 samples
 4 predictor
 3 classes: 'setosa', 'versicolor', 'virginica' 
Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 81, 81, 81, 81, 81, 81, ... 
Resampling results:
  Accuracy   Kappa
  0.9666667  0.95 
Tuning parameter 'C' was held constant at a value of 1summary(iris_SVM)
Length  Class   Mode 
     1   ksvm     S4 svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
  mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
                            if_else(versicolor>setosa & versicolor>virginica, "versicolor",
                                    if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
    setosa versicolor  virginica 
        20         18         22 confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         20          0         0
  versicolor      0         17         1
  virginica       0          3        19
Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.838, 0.9815)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8500           0.9500
Specificity                 1.0000            0.9750           0.9250
Pos Pred Value              1.0000            0.9444           0.8636
Neg Pred Value              1.0000            0.9286           0.9737
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2833           0.3167
Detection Prevalence        0.3333            0.3000           0.3667
Balanced Accuracy           1.0000            0.9125           0.9375Here, we tell the computer 75% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .75, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest  <- iris[-trainIndex,]
iris_SVM <- train(
  form = factor(Species) ~ .,
  data = SVMTrain,
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel 
114 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 
Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 102, 102, 103, 103, 104, 102, ... 
Resampling results:
  Accuracy   Kappa
  0.9833333  0.975
Tuning parameter 'C' was held constant at a value of 1summary(iris_SVM)
Length  Class   Mode 
     1   ksvm     S4 svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
  mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
                            if_else(versicolor>setosa & versicolor>virginica, "versicolor",
                                    if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
    setosa versicolor  virginica 
        12         10         14 confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         10         0
  virginica       0          2        12
Overall Statistics
                                          
               Accuracy : 0.9444          
                 95% CI : (0.8134, 0.9932)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 1.728e-14       
                                          
                  Kappa : 0.9167          
                                          
 Mcnemar's Test P-Value : NA              
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9167
Pos Pred Value              1.0000            1.0000           0.8571
Neg Pred Value              1.0000            0.9231           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2778           0.3333
Detection Prevalence        0.3333            0.2778           0.3889
Balanced Accuracy           1.0000            0.9167           0.9583Here, we tell the computer 50% of the information we know about the types of food we’ve given it.
set.seed(1)
trainIndex <- createDataPartition(iris$Species, p = .5, list = FALSE, times = 1)
SVMTrain <- iris[ trainIndex,]
SVMTest  <- iris[-trainIndex,]
iris_SVM <- train(
  form = factor(Species) ~ .,
  data = SVMTrain,
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10)
iris_SVM
Support Vector Machines with Linear Kernel 
75 samples
 4 predictor
 3 classes: 'setosa', 'versicolor', 'virginica' 
Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 69, 69, 66, 66, 69, 68, ... 
Resampling results:
  Accuracy   Kappa    
  0.9065476  0.8579072
Tuning parameter 'C' was held constant at a value of 1summary(iris_SVM)
Length  Class   Mode 
     1   ksvm     S4 svm_Pred<-predict(iris_SVM,SVMTest,type="prob")
svmtestpred<-cbind(svm_Pred,SVMTest)
svmtestpred<-svmtestpred%>%
  mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
                            if_else(versicolor>setosa & versicolor>virginica, "versicolor",
                                    if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(svmtestpred$prediction)
    setosa versicolor  virginica 
        25         23         27 confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         25          0         0
  versicolor      0         23         0
  virginica       0          2        25
Overall Statistics
                                         
               Accuracy : 0.9733         
                 95% CI : (0.907, 0.9968)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.96           
                                         
 Mcnemar's Test P-Value : NA             
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9200           1.0000
Specificity                 1.0000            1.0000           0.9600
Pos Pred Value              1.0000            1.0000           0.9259
Neg Pred Value              1.0000            0.9615           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3067           0.3333
Detection Prevalence        0.3333            0.3067           0.3600
Balanced Accuracy           1.0000            0.9600           0.9800As you can see, when we give it a little information, SVM is right about 91% of the time. And when we give it a little more, it’s right about 96% of the time. Finally, when we give it even more information, it’s right about 98% of the time. This makes sense, because if you only tell somebody one thing about one type of food in a huge pile - they might not be good at splitting the pile up. But if you tell someone 10 things about foods in a pile - they’re much more likely to split the piles up the right way.
In accounting, the value is also limitless. Models can be used to more accurately determine estimates for accounts like Warranty Liability, Return Merchandise, Allowance for Doubtful Accounts, and much more. Models can even be used to [once you 2nd paper is published I’ll write the title of it here and link the journal]