Third attempt at the Kaggle competition "San Francisco Crime Classification"
View the Project on GitHub juandes/SFCrimeClassification-R-MultinomialModel
The "San Francisco Crime Classification" challenge, is a Kaggle competition aimed to predict the category of the crimes that occurred in the city, given the time and location of the incident.
In this post, I explain and outline my third solution to this challenge. This time using R (again).
Link to the competition: San Francisco Crime Classification
The algorithm chosen for this solution, is a variation of multinomial logistic regression, a classification model based on regression where the dependent variable (what we want to predict) is categorical (opposite of continuous), implemented using neural networks.
The competition provides two dataset: a train data set and a test dataset. The train dataset is made of 878049 observations and the test dataset, of 884262 observations.
Both of them contains incidents from January 1, 2003 to May 13, 2015.
These are the features of the datasets:
For this solution, I used the nnet
package. To install it, simply run this command in R.
install.packages('nnet')
Once the package is downloaded , the next step is calling the library, followed by setting the working directory, and loading both datasets.
library(nnet)
setwd("~/path/to/working/directory")
train <- read.csv("~/path/to/working/directory/train.csv")
test <- read.csv("~/path/to/working/directory/test.csv")
After loading the datasets, the next thing I did was creating a new dataframe with the columns needed (Category, DayOfWeek and PdDistrict), this way we can save precious memory space.
# New dataframes
train.df <- data.frame(Category = train$Category, DayOfWeek = train$DayOfWeek,
PdDistrict = train$PdDistrict)
test.df <- data.frame(DayOfWeek = test$DayOfWeek, PdDistrict = test$PdDistrict)
The next step is to add a new feature, the hour of the incident, to both datasets. This is done by calling the function strftime
on the original date to remove just the hour.
# Create a new column with the hour of the incident
train.df$Hour <- sapply(train$Dates, function(x) as.integer(strftime(x, format = "%H")))
test.df$Hour <- sapply(test$Dates, function(x) as.integer(strftime(x, format = "%H")))
Remove the original dataframes.
# Remove the original dataframes
rm(train)
rm(test)
After pre-processing the data, the next step is to create and train the model. The model will predict the category of the crime using the day of the week when the incident occurred, the district where it occurred and the hour when it occurred, as the predictors.
Instead of the default value of 100 iterations, I changed it to 500. Keep in mind the model will take some time (around 45 mins in my set up) to finish the training.
# Multinomial log-linear model using the day of the week, the district of the crime
# and the hour of the incident as the predictors.
multinom.model <- multinom(Category ~ DayOfWeek + PdDistrict + Hour, data = train.df,
maxit = 500)
Now we predict!
predictions <- predict(multinom.model, test.df, "probs")
To get a smaller file, I rounded down the number of digits.
submission <- format(predictions, digits=4, scientific = FALSE)
submission <- cbind(id = 0:884261, submission)
submission <- as.data.frame(submission)
write.csv(submission, file = "results.csv", row.names = FALSE)
The score received this time was way better than my previous attempts. First, I got a score of 26.74064, followed by 26.78360. This time my score was 2.60502, which is a huge improvement. At the moment of writing, I do not have plans on improving the score, since I would like to tackle more advanced challenges.