Course Project: Practical Machine Learning | Coursera

Author: Nabeel Mukhtar <nabeelmukhtar@gmail.com>

library(knitr)
opts_chunk$set(fig.width=8, fig.height=8)

This is the course project for Coursera Practical Machine Learning.

Objective

The baseline performance index for this HAR dataset is 99% accuracy (see References). However for the purpose of this assignment we will target an out of sample accuracy of 95% on the testing set.

Initialization

Here we initialize random number generator with a seed and load training data. Note that the training data contains #DIV/0! which needs to be parsed as NA.

library(caret)
## set seed
set.seed(32343)
wle_data <- read.csv("data/pml-training.csv", na.strings = c("", "NA", "#DIV/0!"))

Features and Training Parameters

We decided to use all the predictors for training as using PCA did not result in any improvement. Here we select all predictors that do not contain NAs and are not text/date columns (like user_name/raw_timestamp_part_1 etc). There was no need for removing zero vars because there weren't any in the leftover variables.
We also initialize training control parameters with 5-fold repeated cross validation for all classifiers.

predictors <- colnames(wle_data)
predictors <- predictors[colSums(is.na(wle_data)) == 0]
predictors <- predictors[-(1:7)]
# nsv <- nearZeroVar(wle_data[, predictors])
# predictors <- predictors[-nsv]
classes <- unique(wle_data$classe)
class_colors <- 1 + as.integer(classes)
fitControl <- trainControl(method="repeatedcv",
                           number=5,
                           repeats=1,
                           verboseIter=FALSE)

Data Partitioning

Here we split the data into into training(49%), testing(21%) and validation(30%) datasets. The validation dataset was used for ensemble classifier in the end.
Finally we removed unused variables and called gc() to reclaim memory for later analysis.

inBuild <- createDataPartition(y=wle_data$classe,
                               p=0.7, list=FALSE)
validation <- wle_data[-inBuild, predictors]
buildData <- wle_data[inBuild, predictors]
inTrain <- createDataPartition(y=buildData$classe,
                               p=0.7, list=FALSE)
training <- buildData[inTrain, ]
testing <- buildData[-inTrain, ]
rm(buildData, wle_data, inBuild, inTrain)
clean <- gc(FALSE)
rm(clean)

First Attempt: Decision Tree

The first classifier we tried was a decision tree. The reason is that the model generated by the decision tree is very interpretable and also gives insights on which predictors are more important which is useful for further feature extraction.
Unfortunately the accuracy of the tree was not very good (56%) on the testing set. We tried different configurations but it did not help much. Here are the tree and results.

modeltree <- train(classe ~., data=training, method="rpart", trControl=fitControl)
library(rattle)
fancyRpartPlot(modeltree$finalModel)
plot of chunk tree
predicttree <- predict(modeltree, newdata=testing)
cmtree <- confusionMatrix(predicttree, testing$classe)
plot(cmtree$table, col = class_colors, main = paste("Decision Tree Confusion Matrix: Accuracy=", round(cmtree$overall['Accuracy'], 2)))
plot of chunk tree
kable(cmtree$byClass, digits = 2, caption = "Per Class Metrics")
Per Class Metrics
Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.62 0.96 0.87 0.86 0.28 0.18 0.20 0.79
Class: B 0.46 0.87 0.46 0.87 0.19 0.09 0.19 0.66
Class: C 0.77 0.84 0.51 0.94 0.17 0.13 0.26 0.80
Class: D 0.49 0.78 0.31 0.89 0.16 0.08 0.26 0.63
Class: E 0.43 1.00 0.98 0.89 0.18 0.08 0.08 0.71

Second Attempt: Linear Discriminant Analysis

The second classifier we tried was an LDA classifier which was also taught in the course. We chose the default parameters with a 5 fold cross validation. The accuracy was vastly improved to 71% but it was still not very good. Here are the results.

modellda <- train(classe ~., data=training, method="lda", trControl=fitControl)
predictlda <- predict(modellda, newdata=testing)
cmlda <- confusionMatrix(predictlda, testing$classe)
plot(cmlda$table, col = class_colors, main = paste("LDA Confusion Matrix: Accuracy=", round(cmlda$overall['Accuracy'], 2)))
plot of chunk lda
kable(cmlda$byClass, digits = 2, caption = "Per Class Metrics")
Per Class Metrics
Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.84 0.91 0.79 0.93 0.28 0.24 0.30 0.87
Class: B 0.65 0.93 0.70 0.92 0.19 0.13 0.18 0.79
Class: C 0.68 0.90 0.59 0.93 0.17 0.12 0.20 0.79
Class: D 0.71 0.92 0.64 0.94 0.16 0.12 0.18 0.82
Class: E 0.62 0.98 0.86 0.92 0.18 0.11 0.13 0.80

Third Attempt: Generalized Boosted Regression Modeling

Finally we stumbled upon the GBM classifier. We ran it with repeated cross validation and its accuracy was much better (96%) on the training set even though it took much longer to execute.

modelgbm <- train(classe ~., data=training, method="gbm", trControl=fitControl, verbose = FALSE)
predictgbm <- predict(modelgbm, newdata=testing)
cmgbm <- confusionMatrix(predictgbm, testing$classe)
plot(cmgbm$table, col = class_colors, main = paste("GBM Confusion Matrix: Accuracy=", round(cmgbm$overall['Accuracy'], 2)))
plot of chunk gbm
kable(cmgbm$byClass, digits = 2, caption = "Per Class Metrics")
Per Class Metrics
Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.98 0.99 0.98 0.99 0.28 0.28 0.29 0.99
Class: B 0.94 0.99 0.95 0.99 0.19 0.18 0.19 0.97
Class: C 0.97 0.98 0.93 0.99 0.17 0.17 0.18 0.98
Class: D 0.95 0.99 0.96 0.99 0.16 0.16 0.16 0.97
Class: E 0.95 1.00 0.98 0.99 0.18 0.18 0.18 0.97

Final Attempt: Ensemble Classifier

Moving on we decided to build an ensemble of the first three classifiers by stacking them with random forest to see if it further improves the accuracy. Unfortunately the accuracy was still 96% with a slight improvement in the accuracy of class E.

predicttesting <- data.frame(predicttree, predictgbm, predictlda, classe = testing$classe)
modelensemble <- train(classe ~ ., data = predicttesting, method = "rf")
predictvalidation <- data.frame(predicttree = predict(modeltree, newdata=validation),
                                predictgbm = predict(modelgbm, newdata=validation),
                                predictlda = predict(modellda, newdata=validation),
                                classe = validation$classe)
predictensemble <- predict(modelensemble, predictvalidation)
cmensemble <- confusionMatrix(predictensemble, validation$classe)
plot(cmensemble$table, col = class_colors, main = paste("Ensemble Confusion Matrix: Accuracy=", round(cmensemble$overall['Accuracy'], 2)))
plot of chunk ensemble
kable(cmensemble$byClass, digits = 2, caption = "Per Class Metrics")
Per Class Metrics
Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.98 0.99 0.98 0.99 0.28 0.28 0.28 0.99
Class: B 0.95 0.98 0.94 0.99 0.19 0.18 0.20 0.97
Class: C 0.96 0.99 0.93 0.99 0.17 0.17 0.18 0.97
Class: D 0.95 0.99 0.97 0.99 0.16 0.16 0.16 0.97
Class: E 0.97 1.00 0.99 0.99 0.18 0.18 0.18 0.98

Even Better Attempt: Random Forest

We had tried random forest initially but could not make it complete in a reasonable amount of time. After evaluating some assignments we realized that a random forest would have been an even more accurate classifier and would finish in reasonable time if we do proper feature selection and limit the number of trees to 100. Here's another attempt with the best accuracy of 98%.

modelrf <- train(classe ~ roll_belt + pitch_forearm + magnet_dumbbell_z + yaw_belt + magnet_dumbbell_y + roll_forearm + pitch_belt, data=training, method="rf", ntree = 100)
predictrf <- predict(modelrf, newdata=testing)
cmrf <- confusionMatrix(predictrf, testing$classe)
plot(cmrf$table, col = class_colors, main = paste("Random Forest Confusion Matrix: Accuracy=", round(cmrf$overall['Accuracy'], 2)))
plot of chunk rf
kable(cmrf$byClass, digits = 2, caption = "Per Class Metrics")
Per Class Metrics
Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.99 1.00 0.99 0.99 0.28 0.28 0.28 0.99
Class: B 0.96 0.99 0.98 0.99 0.19 0.19 0.19 0.98
Class: C 0.99 0.99 0.96 1.00 0.17 0.17 0.18 0.99
Class: D 0.99 1.00 0.98 1.00 0.16 0.16 0.17 0.99
Class: E 0.98 1.00 0.99 1.00 0.18 0.18 0.18 0.99

Conclusion

Since the ensemble classifier could not improve the accuracy of the classifiers probably because the accuracies of the LDA and Descision Tree were much less than GBM. So in the end we decided to go with the GBM classifier which has the best accuracy of 96%. Alas we could not evaluate random forest before the assignment deadline.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz3Gx4pWyLd