An approach to classify the races of characters from Lord of the Rings using their names as feature and naive Bayes

Overview

As a huge fan of the Lord of the Rings and Tolkien's work, I was interested in finding a way of using data from the legendarium with machine learning. While searching and pondering about what problem could be interesting, I had the idea of playing around with the names of the characters and the relation to the race of said character.

In this report, I will discuss and show an approach used to predict the races of the characters from Lord of the Rings using a naive Bayes classifier and various techniques for natural language processing. How is that done? You might ask. The reasoning behind this, is that an algorithm will be trained using the names of the characters and their races. While it is being trained, it will learn about the similarities between the names and the races. For example, suppose that we tell to the algorithm that the name Juan, Jose and Jony are Spanish names, in other words we are teaching the model that a name that starts with J, and has four characters is indeed a Spanish name. After the model is train, we feed it with the name Javi and if the training was successful, the algorithm will output that the name is a Spanish one. The dataset used consists of 789 observations (characters) and their respective race.

Data fields

name: the name of the character
race: the race of the character. There are four possible races: Man, Elf, Dwarf and Hobbit.

Tools used

Spark (Pyspark)
R: for scraping, transforming and preparing the data.

Scraping the data

The data used for the study was scraped from the website http://lotrproject.com/ (which is awesome). At the moment of writing, the homepage of the site features a family tree of all the characters from Tolkien's universe. Using Chrome's View Page Source, I copied the HTML code that is related to the characters to a new file.

Then using R and the rvest web scraping library, I was able to scrap the wanted data. The next piece of code shows this.

library(rvest)
html_data <- read_html("~/Development/lotr-names-classification/lotr-names-html.html")
characters_data <- data.frame(name = character(0), race = character(0),
                              stringsAsFactors = FALSE)

for (i in 1:952){

  # Get the name
  name <- html_data %>%
    html_nodes(paste0('#', i)) %>%
    html_text()

  race_and_class <- strsplit(html_data %>%
    html_nodes(paste0('#', i)) %>%
    html_attr('class'), split = ' ')

  if (length(name) > 0) {
    characters_data[i,] <- list(name, race_and_class[[1]][length(race_and_class[[1]]) - 1])
  }
}

Now, we have the data in a dataframe (a table structure; think of an Excel worksheet). However, we are not done yet! As usual, the data is not in the right shape. Some of the observations has ? as the character name, NA entries and trailing spaces (white spaces after the end of the word). So, let's clean.

# Remove rows with NA
characters_data <- na.omit(characters_data)
# Remove rows where name is '?'
characters_data <- characters_data[grep('\\?', characters_data$name, invert = TRUE), ]
# Remove \n from the names
characters_data$name <- sub('\n', '', characters_data$name)
# Remove the prefix '1st', '2nd', etc.
characters_data$name <- sub('[0-9]?[0-9][a-z]{2}', '', characters_data$name)

In the previous piece of code, we removed rows that contains NA, character names ?, and the prefix 1st, 2nd, etc that was present on some of the names. If you take a look at the linked website, you will see why the data has this.

While cleaning the data, I removed those characters whose respective races does not appear often in the dataset because they would probably do more harm than good at the time of predicting since we do not have a large number of characters from that race.

# Subset the races that have a significant number of entries
characters_data <- characters_data[characters_data$race == 'Ainur' | 
                        characters_data$race == 'Dwarf' |
                        characters_data$race == 'Elf' |
                        characters_data$race == 'Half-elf' |
                        characters_data$race == 'Hobbit' |
                        characters_data$race == 'Man', ]
# Change the half-elves for elves (sorry Elrond)
characters_data$race[characters_data$race == 'Half-elf'] <- 'Elf'

Table *the name with the strange characters should say Dunedain

So we kept, the ainur, dwarves, men, hobbits, elves and half-elves. These last two groups were merge into one, called elf.

Lastly, the trailing spaces were removed, as well to some characters who do not have an actual name, but a title, e.g. Master of Lake-town, and the surnames, e.g. Frodo Baggins -> Frodo and Thorin III -> Thorin.

# Remove trailing spaces
characters_data$name <- sub('[ \t]+$', '', characters_data$name)
# Remove an entry where the name is 'Others'
characters_data <- characters_data[characters_data$name != 'Others' & 
                                     characters_data$name != 'Master of La...', ]

# The names of the characters on this dataframe won't have any surnames or
# numbers on their name; we'll keep just the first name.
characters_no_surnames <- characters_data

# Regex to remove everything after the first whitespace
characters_no_surnames$name <- sub(' .*', '', characters_no_surnames$name)

Then the data was exported to a text file. In addition to the dataset without surnames, I also included a second dataset with the full name of the character.

write.csv(characters_no_surnames, file = 'characters_no_surnames.csv', row.names = FALSE)
write.csv(characters_data, file = 'characters_data.csv', row.names = FALSE)

During an early version of this report, I was using the Ainur race, however after performing the experiment I discovered that it didn't add much to the prediction model since the number of cases was really low, so I decided to remove it entirely from the dataset.

characters.no.ainur <- filter(characters_no_surnames, race != "Ainur")
write.csv(characters.no.ainur, file = "characters_no_ainur.csv", row.names = FALSE)

Model development and prediction

Loading and pre-processing of data

Now that we have the data, lets start the actual analysis in Spark. We will start by loading the data.

# Import both the train and test dataset and register them as tables
imported_data = sqlContext.read.format('com.databricks.spark.csv').options(
    header='true') \
    .load('/Users/Juande/Development/lotr-names-classification/characters_no_surnames.csv')

Because the data was exported from R as a CSV file, we need to load it as a CSV. Luckily for us, there is a package for Spark that handles this, spark-csv.

Once the data is loaded, the next action is to create an RDD (a structure that holds the data) made of four columns. These are:

complete_name: the name of the character, e.g. Aragorn
name : name of the character (in lower case) as a list of characters, e.g. ['a','r','a','g','o','r','n']
race: race of the character, as a number; 0.0 for man, 1.0 for elf, 2.0 for hobbit, and 3.0 for dwarf.

# Map the race to a number
race_to_number = {'Man': 0.0, 'Elf': 1.0, 'Hobbit': 2.0, 'Dwarf': 3.0}

# Build a new rdd made of a row that has the name of the character, the name as a list of the characters, the race of
# the character
data_rdd = imported_data.map(lambda row: Row(complete_name=row.name, name=list(row.name.lower()),
                                             race=race_to_number[row.race]))
df = sqlContext.createDataFrame(data_rdd)

Transformation pipeline

One of the reason why I did this work, was to test Spark's ML pipeline. Normally, I used the MLLIB library for performing machine learning, but for this work I wanted to try ML and its pipeline for the first time. A pipeline is a sequence of stages where the data is transformed at each step. For more details of this, check the official documentation at Spark's pipeline. You might be asking why we need to transform the names, and the reason is that with this kind of problems (natural language processing), is not always optimal to use the text as it is. Normally, you have to transform it in such as way that it is better for the algorithms to process it.

The pipeline used for transforming the data consists of 3 steps:

n-gram: n-gram is a contiguous sequence of n items from a given sequence of text or speech. (NGram). In this case, the items are the characters of the name. For this problem, I used an ngran with n=2, also known as a bigram.
HashingTF: Hashing trick or HashingTF (as it is called in Spark) is a technique used for turning features into indices of a vector. In other words, what we are doing with this transformation, is turning each item of the bigram into a number.

To explain the pipeline, let's use the name 'aragorn' as an example. Don't worry about the meaning of the numbers, the purpose of this example is to illustrate the process.

aragorn -> apply n-gram -> ['a r', 'r a', 'a g', 'g o', 'o r', 'r n'] -> apply HashingTF -> [86, 143, 156, 277, 312, 323]

# Pipeline consisting of three stages: NGrams, HashingTF
ngram = NGram(n=3, inputCol="name", outputCol="nGrams")
hashingTF = HashingTF(numFeatures=500, inputCol="nGrams", outputCol="TF")
pipeline = Pipeline(stages=[ngram, hashingTF])

# Fit the pipeline 
pipelined_data = pipeline.fit(df)
transformed_data = pipelined_data.transform(df)
training_set, test_set = transformed_data.randomSplit([0.8, 0.2], seed=10)

Once the data is transformed, the dataset is split into a training set made of 80% of the original dataset, and a test set made of the remaining 20%.

Overview of the data

Before going into the actual prediction section, I would like to show some of the data so you can see it for yourself and reach your own conclusions about the similarities between the names (if there is one). When looking at it, think of the example of the Spanish names explained at the start.

Man	Elf	Hobbit	Dwarf
Aragorn	Arwen	Frodo	Durin
Aulendil	Ingwë	Ferumbras	Óin
Atanalcar	Ingil	Fortinbras	Thráin
Vardamir	Galadriel	Isembard	Thorin
Axantur	Celeborn	Flambard	Glóin

Notice any similarities between the races? What do you think?

Now, to the prediction model.

Prediction and results

The prediction model used in this report is a naive Bayes classifier. In most cases, this classifier performs well while working with text data because of it assumes that attributes values are independent of each other. But wait? If we are trying to predict the races based on the format of the name, why this? Good question. For these kind of problems the terms are conditionally dependent on each other, but let's not think about that.

In the next piece of code, the model is created, trained and tested using the test dataset.

# Create the model, train and predict
nb = NaiveBayes(smoothing=1.0, modelType="multinomial", featuresCol='TF', labelCol='race')
model = nb.fit(training_set)
predictions = model.transform(test_set)

# Evaluate the results
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='race')
result = predictions.select('race','prediction')
result_rdd = result.rdd
metrics = MulticlassMetrics(result_rdd)

Upon further examination of the model using a statistic called f1 score or f measure that focuses on the proportion of positive results for one race against the positive results of all the cases.

For this problem we were able to achieve a precision of 0.629139072848, overall f-score of 0.605332849886, and the following f scores for each one of the races

Race	F-score
Man	0.260869565217
Dwarf	0.333333333333
Hobbit	0.567901234568
Man	0.73333333333

Conclusion

In this report we built a naive Bayes classifier model for classifying the races of characters of Lord of the Rings based on their name. While doing it, topics such as classification, pipeline, and data pre-processing were discussed.

What now? The outcome of this experiment was not what I was expecting, however I am sure that the accuracy percentage can be improved at least by a few percentages. Our model was based on the bigram of the name, however certain features such as the length of the name, the ratio between vowels/consonants, and the number of foreign letters should be analyzed.

Tolkien was a genius.