The code in this repo is the result of an interview exercise that was given to me. It involves stocks, some basic stats and a bit of web scraping.
View the Project on GitHub juandes/Stocks-StandardDeviation-Assignment
The work presented in this report is a coding problem assignment that was giving to me during the hiring process for a data engineer position. Because the assignment was kind cool, I decided to make a report out of it.
The task of this assignment was to find the S&P stock with the highest adjusting close price standard deviation for the period January 1, 2015 until September 28, 2015.
As mentioned previously, the purpose of this assignment was to find the S&P stock with the highest close price standard deviation during a given period of time. So, the first part of the problem is to find the stocks list, since it was not given and it is kind of silly to write them in the code (there a bit over 500 stocks). To do this, I used R's rvest web scraping package.
The data was scraped from the Wikipedia page List of S&P 500 companies. If you see the page, it has a table with the 505 common stocks. R's rvest package works by specifying the css selector that matches the data we want. To find the css selector I wanted, I used the SelectorGadget widget. If you use the gadget in the page I linked before, you will see that the selector tag containing the stock code is tr:nth-child(i) td:nth-child(1)
, where i
is the position of the stock (starting from 2) in the table.
This is the R script for scraping the stocks.
library(rvest)
stocks <- data.frame(stock = character(), stringsAsFactors = FALSE)
stocks.site <- read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
for (i in 2:506) {
stock.symbol <- stocks.site %>%
html_node(paste0("tr:nth-child(", i, ") td:nth-child(1) .text")) %>%
html_text()
stocks[i - 1, 1] <-stock.symbol
}
write.table(stocks, file = 'stocks_list.txt', col.names = FALSE, row.names = FALSE,
quote = FALSE)
Now that we have scraped the data, it's time for writing the solution.
The first step is to load the required libraries. As I mentioned at the beginning, we'll be using the Pandas library.
import datetime
import pandas as pd
import pandas.io.data
Read the data and create an empty list that will have the stocks.
# Load the stocks codes from an external file
stocks_file = open('stocks_list.txt', 'r')
stocks_list = []
Specify the time period
# Note: the stock market is closed on New Year's Day
start_date = datetime.datetime(2015, 1, 1)
end_date = datetime.datetime(2015, 9, 28)
To keep track of the largest standard deviation and its stock, I used a dictionary that will be updated each time the program finds a standard deviation greater than the current one.
# This dict has the current stock with the highest std
current_highest_adjclose = {'stock': 'placeholder', 'stdev': -1}
Now that the structures have been created, it's time to add the stocks to stocks_list
.
# Add the stocks to a new list while removing \n
for line in stocks_file:
stocks_list.append(line.strip('\n'))
The next piece of code is the main part of the script. It is a loop that iterates through the stocks list and do the following:
stock
using pd.io.data.get_data_yahoo
,which returns a dataframe. From said dataframe, we are interested in the 'Adj Close' column.for stock in stocks_list:
print stock
try:
s = pd.io.data.get_data_yahoo(stock, start=start_date,
end=end_date)['Adj Close']
current_standard_deviation = s.std()
if current_standard_deviation > current_highest_adjclose['stdev']:
current_highest_adjclose['stock'] = stock
current_highest_adjclose['stdev'] = current_standard_deviation
except:
'Something happened with ', stock
Finally, the stock and its standard deviation are printed.
print 'Highest \'Adj Close\' is: %f (%s)' % (
current_highest_adjclose['stdev'],
current_highest_adjclose['stock'])
So there you have it! I thought this exercise was kind of cool, mostly because it is a bit different than your standard programming assignment. Even though this one is very simple, it capture the essence of a real and possible case you could encounter while working with data. At the moment of writing, I am thinking of adding new stuff to the script, such as visualizations or further univariate analysis using some of the standard deviations or the adjusted close price.