A few weeks before the Rugby World Cup, I decided that I wanted to build an AI that would predict the outcome of games. This is a writeup of how I went about doing it. I’ll try to keep it quite light on the technical aspects, to focus instead on the key topics to understand. If it’s useful to anyone please just shout at me on Twitter and I’ll upload the data and code to GitHub.
Why HOWLAI? It felt weird to name this after anyone other than Rob Howley, the Welsh coach recently sent home after placing “suspicious bets”.
Step 1 – Designing the system
As with anything, the first part is to work out what you actually want to achieve. For this project, I want something to have a real-world outcome and I settled on a machine that tells me which games to bet on for the World Cup.
For this, it would need to take a bunch of data to learn from, then take the teams playing in upcoming games, and it would predict the likelihood of each team winning.
I’d then compare this likelihood to the odds given by betting shops, and if HOWLAI’s odds were higher then I would place the bet. If the odds were lower, then I wouldn’t. At the end of the tournament, I’d see how HOWLAI fared overall.
Because the outcome here is a probability, it’s best to use a logistic regression (a form of statistical analysis that outputs probabilities).
Step 2 – Data data data
A learning machine is only as intelligent as the data it’s given, so I wanted to get everything I could. I figured there were three main are as I’d be using: World Rugby ranking points (a number published weekly), if the match was home/away, and form.
As it’s a machine learning project, I wanted to take all past results and link it up to this data, so that I could use this data before upcoming games to predict the result.
I got the results from ESPN’s surprisingly comprehensive data, by copying and pasting into a spreadsheet. I removed any matches involving teams that weren’t going to the World Cup (for no reason other than low ranked teams are very inconsistent).
I used a spreadsheet query to find the previous Monday before the match took place, and then ran a small Python script to ping the World Rugby API to get the rankings for a given date. This was hidden and I needed to dig it up, which was a bit time-consuming.
So now for each team I had their ranking points and the opponents ranking points, and the difference between these two. The difference was what I wanted to use to make predictions.
The ESPN data already told me which team was home/away, or if the match was neutral. So I made two dummy variables: ‘IsHome’ and ‘IsAway’. Only Japan would be ‘Home’ during the tournament, but I was still curious as to the effect of these variables.
Finally, I needed form. I wanted to calculate: match form (matches won / lost), attacking form (points scored, tries scored, penalties scored), and defensive form (attacking form but conceding) – this gave about 10 different form variables. I wanted this form for the past 3, 5 and 10 matches, as well as for the opposition they were about to play.
I’m sure there’s a smarter way to do this, but I did it (relatively) manually. Using my spreadsheet, I calculated aggregates of previous matches to find current form for when they played a match. I then used a small Python script to find opposition form going into the same match, because that would take hours to do manually.
So there I had it – a match result, with ranking difference, home advantage, and all the form data I wanted. The machine will look at those three variables of upcoming matches, and use it to predict the result.
There is lots of other data I wish I could use, but it just impossible to get. The main one would be player stats – namely player caps. Experience is a huge factor in international rugby, this would have been a lovely predictive feature. But sadly, I couldn’t find this data, so I left it out.
Step 3 – Choosing which variables to use
This whole process above ended up with over 50 predictor variables. I only wanted to use ones in my model that would actually add value, so I suspect I should remove some.
You don’t want your model to use more variables than necessary to make predictions because:
- it slows down training and predicting
- it means you need to calculate and maintain all those variables
- it makes it harder to interpret
For the project this doesn’t matter too much as the overall data set was small enough, but I like to do things by the book.
I used a scoring mechanism called Area Under Curve (AUC) to predict the accuracy of HOWLAI. There is a great explainer here if you’re interested. There’s a fairly simple method called Forward Stepwise Variable Selection for choosing which variables to use. I ran this (and everything else from here on out) using Sci-Kit Learn, Python’s wonderful machine-learning library.
The method goes through every variable on its own, and says “if we used only one variable, which one would give us the best AUC score?”. It then says “Ok, assuming we use that best variable and one other, which one should we use?”. It then says “Ok, assuming we use those two and one other…” etc.
So you end up with an ordered list of the variables that you should use. For me, it looked like this:
For reference, 10For means points scored form (taking opponents level into account) in the last 10 games, and Opps10Tries means try-scoring form of opponents over last 10 games. ‘Difference’ is World Rugby ranking difference.
So now that we’ve got all these in order, the question is when do we stop? How many variables do we have?
For this, you look at the AUC score as we add each variable. As this starts to tail off, we can probably stop adding variables. The machinery prints off the below:
This orange line is the training data (that the model learns from), and the blue line is the test data (that the model has never seen before). It’s important to separate these two out, so that you can see how the model predicts on things it’s never seen before. Without that, you can’t really see how accurate it is.
I’m going to cut this off on the well-drawn ‘x’ on the graph above, above ‘Opps5PensCon’. This means that we’ll have 11 variables for this predictive model:
- Difference (difference in IRB rankings between the two teams)
- IsAway (if the match was played away from home)
- IsHome (if the match was at home)
- Opps10Tries (10-game try-scoring form of opposition)
- 10For (10-game points scoring form of team)
- Opps5TriesConc (5-game try-conceding form of opposition)
- 10TriesConc (10-game try-conceding form of team)
- 10Form (10-game form of team)
- 5PensConc (5-game penalty-conceding form of team)
- Opps10TriesConc (10-game try-conceding form of opposition)
- Opps5PensConc (5-game pen-conceding form of opposition).
I could probably keep adding more, but I want to keep it as simple as I can. I was surprised here by how little form mattered compared to home/away advantage, and World Ranking difference. I guess the latter includes form though naturally. Anyways…
Step 4 – Training and predictions
There’s not much to say here unfortunately. Training HOWLAI is just a few lines of code, essentially telling him which variables to look at when making predictions. It came up with the following regression coefficient:
- Intercept: -0.084
- Difference: 1.47
- IsAway: 5.93
- IsHome: -6.83
- Opps10Tries: -3.34
- 10For: 6.96
- Opps5TriesConc: -4.11
- 10TriesConc: -1.78
- 10Form: -1.12
- 5PensConc: -1.98
- Opps10TriesConc: 1.73
- Opps5PensConc: 1.52
I noticed when going through this that the machinery was predicting the likelihood of a team losing, rather than winning, so I’d need to invert the outcomes when assessing whether I should place the final bet or not.
I wish I could tell you why it did this, but I’m not entirely sure.
Step 5 – Predictions and bets!
I then fed HOWLAI new matches, based on the form, venue and ranking information that we have on teams before the match. I’ll only know how well it performed once the tournament is complete, so I’ll update then.
In terms of betting, I would place a bet if HOWLAI thought an outcome was more likely than the betting shops did. So if it thought Team X was 0.7 likelihood of winning, but the odds given worked out to 0.6, we’d place the bet.
It recommended that the bets to place on the first round are:
- Russia to beat Japan (3% chance vs bookies of 1.4%)
- Fiji to beat Australia (17% chance vs bookies of 13%)
- France to beat Argentina (72% chance vs bookies of 54%)
- South Africa to beat New Zealand (35% chance vs bookies of 33%)
For now, here are some interesting predictions that I didn’t expect. HOWLAI believes that…
- Argentina are going to crash out in the pool stages
- England will pretty easily top their group
- Scotland only have a 62% chance of beating Japan
- Samoa only have a 86% chance of beating Russia
That’s it from me and HOWLAI for now. I’ll update as the tournament progresses to let you know how it’s going!