San Francisco Crime Classification (Kaggle competition) using R and Random Forest

Overview

The "San Francisco Crime Classification" challenge, is a Kaggle competition aimed to predict the category of the crimes that occurred in the city, given the time and location of the incident.

In this post, I explain and outline my first solution to this challenge.

Link to the competition: San Francisco Crime Classification

Learning method

The algorithm chosen for the implemented solution, is a random forest, a learning method used mostly for classification and regression.

Data

The competition provides two dataset: a train data set and a test dataset. The train dataset is made of 878049 observations and the test dataset, of 884262 observations.

Both of them contains incidents from January 1, 2003 to May 13, 2015.

Data fields

Dates : timestamp of the crime incident.
Category: Category of the incident. Also, this is the variable we want to predict. This variable is available only in the train dataset.
Descript: A short description of the incident. This variable is available only in the train dataset.
DayOfWeek: Day of the week where the incident occurred.
PdDistrict: Police Department District
Resolution: Outcome of the incident. This variable is available only in the train dataset.
Address: Address of the incident.
X: Longitude
Y: Latitude

Model development

In an attempt to produce a lower error rate, three random forests were created, using using different predictors and number of trees.

Here, they are presented in the same order they were created.

Loading data and library

The first step is to load the random forest library, randomForest, followed by loading both the train and test datasets.

To download the package, use this command:

install.packages("randomForest")

library(randomForest)
setwd("~/path/to/working/directory")
train <- read.csv("~/path/to/working/directory/train.csv")
test <- read.csv("~/path/to/working/directory/test.csv")

Random forest model #1

The first model uses the day of the week ('DayOfWeek') and the police district ('PdDistrict') as the predictors. The forest is made of 25 trees.

rf <- randomForest(Category ~ DayOfWeek + PdDistrict, data = train, ntree = 25)

This model produced an average error rate of 77.95% (the error rate was calculated after several tests).

Random forest model #2

For the next model, a new column named 'Hour' was created. The value of this new field is the hour (in 24h format) of the incident (taken from the 'Dates' column).

# Make a new column containing the hour (24h format) of the crime
train$Hour <- sapply(train$Dates, function(x) as.integer(strftime(x, format = "%H")))
# Another random forest model using the same predictors as before, plus the hour of
# the crime
rf <- randomForest(Category ~ DayOfWeek + PdDistrict + Hour, data = train, ntree = 25)

The average error rate is 77.53% (a small improvement from the previous one).

Random forest model #3

As in the last model, this one also introduces a new column, 'TimeOfDay'. This new variable has the time of the day (early morning, morning, afternoon and night) when the incident occurred.

To create this new column, a custom function was created.

# Function that returns the time of the day (early morning, morning, afternoon or
# or night) according to the hour.
timeoftheday <- function(hour) {
  if (hour >= 1 && hour <= 6) { return (as.factor("early morning"))}
  else if (hour >= 7 && hour <= 11) { return (as.factor("morning"))}
  else if (hour >= 12 && hour <= 19) { return (as.factor("afternoon"))}
  else return (as.factor("night"))
}

train$TimeOfDay <- sapply(train$Hour, timeoftheday)
rf <- randomForest(Category ~ DayOfWeek + PdDistrict + TimeOfDay, data = train, ntree = 25)

Average error rate of 77.97% (worse than the previous one).

After realizing how similar the error rates were, my next step was to try to minimize the error rate of the second model because it is the one with the lowest error rate. However, even with a larger number of trees, the error rate was very similar. Although in several cases the model achieved an error rate lower than 77.53%, the time it took to train the model, was significantly larger.

Prediction stage and preparation of the output file

Once the model is trained, the next step is to apply the model to the test dataset to predict the category of the crime.

# Add the hour column to the test set.
test$Hour <- sapply(test$Dates, function(x) as.integer(strftime(x, format = "%H")))
predictions.result <- predict(rf, test)

The last step is preparing the output file.

results.for.submission <- table(1:length(predictions.result), predictions.result)
rownames(results.for.submission) <- 0:884261
write.csv(results.for.submission, file = "results.csv")

Upload it to Kaggle!

Conclusion

As expected, the score I received from Kaggle was not that great (26.74064; #~350 in the leaderboard). My plan for the next attempt is to play around with variables I left untouched this time and use another classification model.