In order to get FFR off the ground, I quickly (but not haphazardly) created some classification models to predict whether or not a fantasy league player will achieve a high score in a given game. The models accept inputs like the league position of a player’s team, league position of the opponent, player’s average number of minutes played, etc (see variables section below) to produce a single numerical output: the probability that a player will score 6 or more points. Players in different positions score points in different ways so I created a model for each position: Goalies, Defenders, Midfielders and Forwards.
What to do with the forecast probabilities?
Naturally, a fantasy manager will want to have as many high probability players as possible. Given that a probability is always between 0 and 1, yet “scoring 6+ points” is a binary outcome (i.e. either True or False), it is easy to convert the forecast probabilities into boolean predictions: you just take those probabilities greater than or equal to 0.5 and treat those as predicting True; the rest are predicting False. A manager will want to own Trues and not Falses. For example, say my model forecasts probabilities 0.4, 0.1, 0.9, 0.6 for Lee Cattermole, Santi Cazorla, Yaya Toure and Juan Mata respectively. The model is forecasting that Toure and Mata will score 6+ and that Cattermole and Cazorla will not [see footnote 1].
Of those players that are predicted True (i.e. are predicted to score 6+), the percentage that actually do go on to score 6+ is called the Precision of the model. When creating the models, I explicitly sought the highest Precision possible by trying different model types (see below) and adding and removing variables (again, below). Say my model has a Precision of 0.333 and it predicts three players will score 6+; I would expect 1 of this 3 predictions to actually come true. Instead, say my model has Precision of 0.667. Now I expect 2 of those predictions to come true. The higher the Precision, the more I can rely on my model to be right. [see footnote 2]
Types of model used in FFR
I conducted the initial analysis using the sklearn python library but restricted the analysis to classification models that I could readily reproduce myself in python, knowing that the code was ultimately going to run on the Google AppEngine. AppEngine only runs pure python code, you see. It runs some selected C++ addins, like NumPy but not general C++ code. sklearn uses a lot of C++ to make it faster and this means it will not run seamlessly on AppEngine. Hence, I knew any classifier I wanted to use, I had to be prepared to make it work myself in pure python. Long story short: this left me the Naive Bayes (Gaussian) and k-Nearest Neighbours classifiers. It prevented me using Logistic Regression and Decision Trees (these are targeted enhancements I hope to introduce to FFR in future).
For proper explanations of these models, I would suggest taking a look at the book Data Mining by Witten and Frank (when Googled, the 2nd result is the complete text in pdf. Don’t shoot the messenger) and perhaps also the Wikipedia definitions. However, let me have a cursory stab at explaining them.
To quote Wikipedia, “In simple terms, a naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the presence or absence of the other features.” The classifier calculates the likelihood that a new instance belongs to a target class to be the product of the marginal probabilities based on the observed instances that also belong to the target class. I don’t think there’s a simple way to say that. If you really want to understand it, I recommend going through the Naive Bayes example in Witten & Frank. But for the sake of understanding the principle, know that a naive Bayes classifier will predict that a player is most likely to score 6+ when the player has variable values that are all close to those typically seen in players that have scored 6+ in the past.
k-Nearest Neighbors, aka KNN
is a bit easier to understand, thankfully. k is some integer value. To keep it simple, let’s say k = 5, so we are using a 5-nearest neighbours (or 5NN) classifier. Given a new instance (the instance for which we want to make a prediction), the 5NN model will find the 5 instances in its training dataset (the cache of historical data that we have) that are “nearest” to the new instance. Let’s say that, of these 5 training instances, 2 did score 6+ and 3 did not. The 5NN classifier will calculate that the probability of the new instance scoring 6+ is 2/5 = 0.4 and the probability of scoring 5 or less is 3/5 = 0.6.
But wait: what is this vaguely defined “nearest” crap? How far is one observation from another? The distances between observations are calculated using a metric. Let’s keep this simple and define just the metric that matters (the Euclidean metric) which, in two dimensions, is basically the Pythagoras law we learnt at school. E.g. the distance between the observations (3, 4) and (7, 1) is sqrt( (7-3)^2 + (1-4)^2 ) = sqrt(4^2 + 3^2) = 5
I need some succinct names for variables, so let me define and explain at the same time:
TEAM_POS: position in league table of player’s team
OPP_POS: position in league table of opponent team
TEAM_FOR_POT: mean number of goals scored per game by player’s team
TEAM_AGAINST_POT: mean number of goals conceded per game by player’s team
OPP_FOR_POT: mean number of goals scored per game by opponent team
OPP_AGAINST_POT: mean number of goals conceded per game by opponent team
(these four variables are calculated using just home or away performances, depending on venue. e.g. if Wayne Rooney is about to play West Brom at Old Trafford, his TEAM_FOR_POT score will be the mean goals scored by ManUtd at home that season, and his OPP_AGAINST_POT score will be mean number of goals conceded by West Brom away from home)
x_Mins: Average minutes played in previous x games
x_Points: Average points scored in previous x games
x_Goals: Average goals scored in previous x games
x_Assists: Average assists made in previous x games
x_Bonus: Average bonus points in previous x games
XYZ_STD: the XYZ variable less the average XYZ score and divided by the standard deviation of the XYZ score.
e.g. TEAM_FOR_POT_STD = [TEAM_FOR_POT - mean(TEAM_FOR_POT)] / stdev(TEAM_FOR_POT)
Models currently used in FFR
Let me abbreviate a Gaussian Naive Bayes model as GNB and a k-Nearest Neighbours model as kNN (e.g. 3NN, 5NN, etc). Then, the models initially in use by FFR are:
GNB(TEAM_POS, OPP_POS, TEAM_FOR_POT, TEAM_AGAINST_POT, OPP_FOR_POT, OPP_AGAINST_POT, 1_MINS, 1_POINTS, 3_POINTS)
4NN(TEAM_POS_STD, OPP_POS_STD, TEAM_FOR_POT_STD, TEAM_AGAINST_POT_STD, OPP_FOR_POT_STD, OPP_AGAINST_POT_STD, 1_MINS_STD, 1_POINTS_STD, 3_POINTS_STD)
7NN(1_MINS, 1_POINTS, 3_POINTS, 7_GOALS, 7_POINTS)
These models were derived fairly rapidly, using common sense whilst trying to maximise model Precision (defined above). That said, some of the variable combinations are a bit weird and interpreting them could yield yet more articles. Also, I became more inquisitive as I progressed so the goalie model is largely built on common sense whereas the midfielder and forward models are more nuanced and pragmatic. I owe this topic a systematic review. I want to improve all of the models. This will be the focus of the next few weeks.
1: Naturally, a manager may decide instead to just maximise the aggregate sum of the probabilities in her/his team without converting the probabilities to booleans. This is equally valid. I have explained the process of converting probabilities to boolean outcomes to facilitate the explanation of Precision.
2: You may spot that I casually discarded all of the players predicted as False. “What percentage of them went on to score 6+?”, you might ask. I would counter that I only care about maximising my score so I only care whether the model is good at predicting players that will score well, not players that will not. The way you appraise a classification model really depends on how you want to use the model and the relative costs of getting things wrong. In the fantasy league setting, I just want a model that predicts True well. I don’t care how well it predicts False. In another setting (disease diagnosis, say), I might want a model that predicts True and False well, so that sick people receive treatment and well people are not unnecessarily treated. It’s to do with the sensitivity and specificity of the test. It is confusing. Occasionally, I read that Wikipedia article to get things straight in my head but after a while it makes my head spin. What is a True Negative again?…etc…