My attempt at the Kaggle competition "San Francisco Crime Classification"
View the Project on GitHub juandes/SFCrimeClassification-R-RandomForest
The "San Francisco Crime Classification" challenge, is a Kaggle competition aimed to predict the category of the crimes that occurred in the city, given the time and location of the incident.
In this post, I explain and outline my first solution to this challenge.
Link to the competition: San Francisco Crime Classification
The algorithm chosen for the implemented solution, is a random forest, a learning method used mostly for classification and regression.
The competition provides two dataset: a train data set and a test dataset. The train dataset is made of 878049 observations and the test dataset, of 884262 observations.
Both of them contains incidents from January 1, 2003 to May 13, 2015.
In an attempt to produce a lower error rate, three random forests were created, using using different predictors and number of trees.
Here, they are presented in the same order they were created.
The first step is to load the random forest library, randomForest, followed by loading both the train and test datasets.
To download the package, use this command:
install.packages("randomForest")
library(randomForest)
setwd("~/path/to/working/directory")
train <- read.csv("~/path/to/working/directory/train.csv")
test <- read.csv("~/path/to/working/directory/test.csv")
The first model uses the day of the week ('DayOfWeek') and the police district ('PdDistrict') as the predictors. The forest is made of 25 trees.
rf <- randomForest(Category ~ DayOfWeek + PdDistrict, data = train, ntree = 25)
This model produced an average error rate of 77.95% (the error rate was calculated after several tests).
For the next model, a new column named 'Hour' was created. The value of this new field is the hour (in 24h format) of the incident (taken from the 'Dates' column).
# Make a new column containing the hour (24h format) of the crime
train$Hour <- sapply(train$Dates, function(x) as.integer(strftime(x, format = "%H")))
# Another random forest model using the same predictors as before, plus the hour of
# the crime
rf <- randomForest(Category ~ DayOfWeek + PdDistrict + Hour, data = train, ntree = 25)
The average error rate is 77.53% (a small improvement from the previous one).
As in the last model, this one also introduces a new column, 'TimeOfDay'. This new variable has the time of the day (early morning, morning, afternoon and night) when the incident occurred.
To create this new column, a custom function was created.
# Function that returns the time of the day (early morning, morning, afternoon or
# or night) according to the hour.
timeoftheday <- function(hour) {
if (hour >= 1 && hour <= 6) { return (as.factor("early morning"))}
else if (hour >= 7 && hour <= 11) { return (as.factor("morning"))}
else if (hour >= 12 && hour <= 19) { return (as.factor("afternoon"))}
else return (as.factor("night"))
}
train$TimeOfDay <- sapply(train$Hour, timeoftheday)
rf <- randomForest(Category ~ DayOfWeek + PdDistrict + TimeOfDay, data = train, ntree = 25)
Average error rate of 77.97% (worse than the previous one).
After realizing how similar the error rates were, my next step was to try to minimize the error rate of the second model because it is the one with the lowest error rate. However, even with a larger number of trees, the error rate was very similar. Although in several cases the model achieved an error rate lower than 77.53%, the time it took to train the model, was significantly larger.
Once the model is trained, the next step is to apply the model to the test dataset to predict the category of the crime.
# Add the hour column to the test set.
test$Hour <- sapply(test$Dates, function(x) as.integer(strftime(x, format = "%H")))
predictions.result <- predict(rf, test)
The last step is preparing the output file.
results.for.submission <- table(1:length(predictions.result), predictions.result)
rownames(results.for.submission) <- 0:884261
write.csv(results.for.submission, file = "results.csv")
Upload it to Kaggle!
As expected, the score I received from Kaggle was not that great (26.74064; #~350 in the leaderboard). My plan for the next attempt is to play around with variables I left untouched this time and use another classification model.