San Francisco Crime Classification (Kaggle competition) using R and multinomial logistic regression via neural networks

Overview

The "San Francisco Crime Classification" challenge, is a Kaggle competition aimed to predict the category of the crimes that occurred in the city, given the time and location of the incident.

In this post, I explain and outline my third solution to this challenge. This time using R (again).

Link to the competition: San Francisco Crime Classification

Learning method

The algorithm chosen for this solution, is a variation of multinomial logistic regression, a classification model based on regression where the dependent variable (what we want to predict) is categorical (opposite of continuous), implemented using neural networks.

Data

The competition provides two dataset: a train data set and a test dataset. The train dataset is made of 878049 observations and the test dataset, of 884262 observations.

Both of them contains incidents from January 1, 2003 to May 13, 2015.

Data fields

These are the features of the datasets:

Dates : timestamp of the crime incident.
Category: Category of the incident. Also, this is the variable we want to predict. This variable is available only in the train dataset.
Descript: A short description of the incident. This variable is available only in the train dataset.
DayOfWeek: Day of the week where the incident occurred.
PdDistrict: Police Department District
Resolution: Outcome of the incident. This variable is available only in the train dataset.
Address: Address of the incident.
X: Longitude
Y: Latitude

Model development

Package installation and data loading

For this solution, I used the nnet package. To install it, simply run this command in R.

install.packages('nnet')

Once the package is downloaded , the next step is calling the library, followed by setting the working directory, and loading both datasets.

library(nnet)
setwd("~/path/to/working/directory")
train <- read.csv("~/path/to/working/directory/train.csv")
test <- read.csv("~/path/to/working/directory/test.csv")

Preparing the dataset

After loading the datasets, the next thing I did was creating a new dataframe with the columns needed (Category, DayOfWeek and PdDistrict), this way we can save precious memory space.

# New dataframes
train.df <- data.frame(Category = train$Category, DayOfWeek = train$DayOfWeek,
                       PdDistrict = train$PdDistrict)
test.df <- data.frame(DayOfWeek = test$DayOfWeek, PdDistrict = test$PdDistrict)

The next step is to add a new feature, the hour of the incident, to both datasets. This is done by calling the function strftime on the original date to remove just the hour.

# Create a new column with the hour of the incident
train.df$Hour <- sapply(train$Dates, function(x) as.integer(strftime(x, format = "%H")))
test.df$Hour <- sapply(test$Dates, function(x) as.integer(strftime(x, format = "%H")))

Remove the original dataframes.

# Remove the original dataframes
rm(train)
rm(test)

Create and train the model.

After pre-processing the data, the next step is to create and train the model. The model will predict the category of the crime using the day of the week when the incident occurred, the district where it occurred and the hour when it occurred, as the predictors.

Instead of the default value of 100 iterations, I changed it to 500. Keep in mind the model will take some time (around 45 mins in my set up) to finish the training.

# Multinomial log-linear model using the day of the week,  the district of the crime
# and the hour of the incident as the predictors.
multinom.model <- multinom(Category ~ DayOfWeek + PdDistrict + Hour, data = train.df, 
                 maxit = 500)

Prediction phase

Now we predict!

predictions <- predict(multinom.model, test.df, "probs")

Preparing the submission file

To get a smaller file, I rounded down the number of digits.

submission <- format(predictions, digits=4, scientific = FALSE)
submission <- cbind(id = 0:884261, submission)
submission <- as.data.frame(submission)
write.csv(submission, file = "results.csv", row.names = FALSE)

Conclusion

The score received this time was way better than my previous attempts. First, I got a score of 26.74064, followed by 26.78360. This time my score was 2.60502, which is a huge improvement. At the moment of writing, I do not have plans on improving the score, since I would like to tackle more advanced challenges.