A time series analysis about the crime in Puerto Rico from the year 2007 to 2014

Overview

Crime Pattern Detection is one of the many applications of data mining; it deals with the study, analysis and understanding of crimes through discovery of pattern or other significative insight in data. As you might think, detecting a pattern of criminal activity, or even predicting them is something that might seem taken from a movie. However, it is a real application of data mining, and there are many research groups working on that.

In this report I will introduce the concept of time series analysis using the R base package, to inspect the assassination history of Puerto Rico from 2007 to 2014.

The data

This report’s dataset features the total amount of murders perpetrated in Puerto Rico during the period 2007 - 2014. This data is divided by months and years.

Let’s take a look at it.

pr_crime_table <- read.csv("~/Desktop/pr_crime_table.csv", row.names = 1)
print(pr_crime_table)

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2007  66  55  67  47  62  52  41  60  69  65  78  68
## 2008  57  74  76  57  84  52  71  64  70  68  72  70
## 2009  56  80  75  63  91  79  63  75  84  84  85  59
## 2010  76  78  67  64  99  82  97  94  67  76  93  90
## 2011 112  94  99  70  92 107 100  95  97  97  87  86
## 2012  88  83  73  63  70  97  88  89  70  70  83 104
## 2013  76  75  64  69  75  75  72  96  65  88  66  62
## 2014  55  60  48  54  61  71  51  62  45  52  57  65

The first row of the dataset (the header) are the months, and the first column (row names) represent the year. For example, on January 2007 there were 66 registered assassinations.

The data was obtained from the Instituto de Estadisticas de Puerto Rico (Statistics Institute of Puerto Rico).

The analysis

Now that we have seen how the data is, let’s commence the analysis by building the time series model. To build the model we need a vector of the values to be analyzed (the frequency of crimes), and the time interval, seen as frequency on the code, which is 12 since we are using monthly data, and lastly we need to state the starting period (January 2007).

# Transform the table into a vector; ts should just take a vector, it performs
# the split into times by itself
ts.model <- ts(c(t(pr_crime_table)), frequency = 12, start = c(2007, 1))

Once the model is built, the next step is to decompose it into several components. These components are: seasonal, trend and irregular. In the next section, I will explain these concepts in detail.

To decompose the data into the previously mentioned elements, I applied a technique named moving averages - a calculation that shows the average value of a subset over a period of time.

decomposition.ma <- decompose(ts.model)
names(decomposition.ma)

## [1] "x"        "seasonal" "trend"    "random"   "figure"   "type"

The previous output shows the components of this decomposed object. As it was previously mentioned, our objective was to decompose the time series into a seasonal, trend and irregular (named random in the object) component.

The seasonal trend

The seasonal trend is the common trend across all seasons (months).

plot(decomposition.ma$figure, type='b', xaxt='n', xlab='', ylab='Difference of murders'
     , main='Difference of murders from the global average')
months.names <- months(ISOdate(2007,1:12,1))
axis(1, at=1:12, labels=months.names, las=2)

The past plot is the seasonal trend of the data. What you see in the figure is the generalized trend of the crime activity during a year, based on seven years of crime. The x-axis represent the months of the year, and the y-axis shows the difference between the global average number of crimes, and the average number of crimes of that month.

The plot shows that during the period February - April, the amount of crimes decreases until it reaches a global minima at -13.5. Then, on the following month, May, the average number of murders increase by 19. Thus, we have the less active month, and the most active back to back. From April until the end of the year, the crime activity does not vary that much. The following table shows the exact values of the plot.

print(decomposition.ma$figure)

##  [1]  -2.3536706   1.0034722  -4.8655754 -13.5024802   5.5570437
##  [6]   4.4141865  -0.5143849   5.3784722  -1.8239087   1.9618056
## [11]   4.2118056   0.5332341

The trend

The trend decomposition is the trend of the murders across the complete dataset.

plot(decomposition.ma$trend, xlab = 'Year', ylab = 'Number of assassinations',
     main = 'Trend of assassinations in Puerto Rico during 2007 - 2014')

We can see on this plot that the assassinations increases until it reaches a maximum point during mid-2011, meaning that 2011 was the year with the most assassinations. After that year the number decline.

The following table presents the total murders per year. If we compare those numbers with the plot, we can notice that it follows the pattern presented on the previous image. From 2007 until 2011, the frequency of murders increment up to 2011 (this is the peak on the plot). Afterwards, from 2011 until 2014 the recurrence of homicides slows down.

apply(pr_crime_table, 1, function(x) sum(x))

## 2007 2008 2009 2010 2011 2012 2013 2014 
##  730  815  894  983 1136  978  883  681

Decomposition plot

For the last section of the report, I will present a summary of the decomposition using a visualization that has the previous two figures, plus a plot of the entire original dataset, and the random or remainder decomposition. Each value of the remainder component is the difference between the original value from the seasonal plus trend fit.

plot(decomposition.ma)

The first plot is the original values. If we compare it to the second one, the trend, we can notice that it follows the same structure, a peak at mid-2011 and smaller values at the extremes. The third plot, the seasonal decomposition, is the same seasonal trend plot presented before, but repeated for all the years. Finally, the random plot is the remainder decomposition.

Closure

In this report, I presented the basics of a time series analysis using the base R package, and an application of it using crime data from Puerto Rico. The results found indicates that there is a trend of crimes that takes place during a year; during the first three months the homicide activity declines, followed by an increase in April, where it hits its maximum value, and subsequently it decreases and continues an almost similar behaviour until the end of the year. Moreover, the general trend of the data is that from 2008 up until 2011, the number of crimes increased at a slow pace, and after mid-2011 it decreased.