In the sporting world, there is no greater achievement for an individual than being inducted into the Hall of Fame for his/her respective sport. The NFL just celebrated its 100th season and of the many thousands of NFL players to take the field over that span, only 294 players are in the hall of fame to date. Given the unlikeliness of any single player receiving this honor, there are endless debates among NFL fans, writers, analysts, media, players and coaches about who will eventually be inducted into the hall of fame. So to end all debate, I’ve created a machine learning model to predict who the next NFL Hall of Fame inductees will be at the quarterback position.
** The following sections include the technical details of this prediction model. If you would like to skip to the predictions, please scroll down to the Results portion of the post.
Gathering sports data is surprising difficult. In most cases you’re required to pay for access to organized usable data. However, I was lucky to stumble upon this gem: sportsreference. It’s a free python API that allows easy access to data from a variety of sports. After reading countless tutorials and examples I was able to write code to pull season and career stats for every player to play in the NFL from 1970 to present.
The next step to gathering data was to get a list of all the players who were in the NFL HOF, which I pull from the HOF website, and changed the data into CSV format. This data became my target values.
Now that I had all the data, I needed to isolate the quarterback position. I did this by filtering all players who attempted more than 800 passes in their career. Typically, it takes a quarterback about 3 years as a full-time starter to attempt 800 passes, and if a QB has played less than 3 years in his career he has no shot at becoming a hall-of-famer and should not be considered.
Once I filtered the players I wanted to consider, I merged both the career stats dataset and HOF dataset (my target) together to begin the modeling process. Avoiding data leakage was easy, as my target data was added later, and therefore contained no data that could have leaked to the remaining data.
Working with an Imbalanced Dataset
Given the difficultly of becoming a Hall of Fame player, any model predicting HOF inductees will deal with imbalanced data. To account for the randomness of train/test splits and not having enough positive classifications in each set, I stratified the data when making the train/test split. This ensured I had equal representation in both the training and test data.
When working with imbalanced data the usual method of comparing an accuracy score against the baseline can be misleading. A baseline is determined by choosing the percentage of counts for the more prevalent classification. For a balanced dataset the baseline is around 50%-60% and a model with 70%-80% accuracy can be considered a good model, but in this case it won’t work. 88% of players considered are not members of the HOF, which is the baseline, and a “good” score for a balanced model can be achieved by simply saying every player won’t make it into the HOF. So, in addition to the accuracy score I used the ROC-curve to judge model performance.
Building The Models
To get the best predication I decided to use 3 models, Logistic Regression, Random Forest Classifier, and XGBClassifier, compare the results and choose the best model for randomizedsearchcv.
Here are the accuracy scores for these models and the ROC-curves.
LOG: Training Accuracy Score: 1.0
LOG: Validation Accuracy Score: 0.972
FOREST: Training Accuracy: 0.965
FOREST: VAl Accuracy: 0.945
BOOST: Training accuracy: 0.965
BOOST: Val accuracy: 0.972
As you can see even with the imbalanced data these models performed very well against the baseline. This is in line with the results from the ROC-curve; the area under the curve is almost 1. Which means it is able to predict all the true positives with only a few false positives.
Given these models there really isn’t a bad choice, however, I chose XGBoost as it had the best accuracy and the best AUC without needing to use a pipeline which allowed me to use Shapley plots later.
Because I’m dealing with a small dataset, I felt it important to use cross-validation to be sure all the data was being used to train my model. I used RandomizedSearchCV because it gave me the added benefit of automatically tuning my hyperparameters.
Based on my model here are the next QBs to be inducted into the Pro Football Hall of Fame (in no particular order), which will most likely be over the span of the next 10 to 15 years:
- Russel Wilson
- Drew Brees
- Aaron Rodgers
- Tom Brady
- Philip Rivers
- Matt Ryan
- Ben Roethlisberger
- Eli Manning
- Carson Palmer
- Peyton Manning
The QBs considered in this prediction are either still playing or have been retired for less than 5 years. To be eligible of the HOF you need to be retired for 5 years.
So let’s take a look at why my model chose these QBs.
I used sklearn permutation importance to produce the above chart. It shows the most important features as it affects model accuracy, and it’s no surprise that wins have the most importance in whether a QB makes it into the HOF. However, the other 3 stats are interesting see below.
- net_yards_per_pass_attempt — net yards gained per pass attempt, equal to (pass_yards — sack_yards) / (pass_attempts + times_sacked)
- passing_yards_per_attempt — yards gained per passing attempt.
- passing_touchdown_percentage — percentage of total passes that are touchdowns. Percentage ranges from 0–100.
I find it interesting that these 3 have a lot pull in determining a HOF QB, because they are all normalized features. Meaning all QBs no matter how much experience are judged on the same scale. Now, combine that with number of wins, the most important feature, which benefit the players who’ve been in the league a long time with sustained success. This model takes into account high efficiency QBs with the normalized stats and QBs with prolonged success with number of wins.
Let’s take a deeper looking to some specific predictions using a Shapley force plot.
These plots will tell you why a specific player is classified as a hall-of-famer or not. Good stats shown in red “force” them to likely make it into the HOF, poor blue stats will “force” them to unlikely make it to the HOF. Positive value equals future HOF negative value equals not.
Let’s start with the GOAT.
All his stats are good and force to his probability up to become a hall of fame quarterback, and there are no stats that bring down his probability of becoming a hall-of-famer. He’s the GOAT, that’s to be expected.
Now let’s take a look at the best current player in the NFL.
As you can see Patrick Mahomes beats Tom Brady in most statistics except those involving experience. He has incredible normalized stats but he hasn’t been doing long enough to get a lot of wins and pass-touchdowns, so if his career were to end today he would most likely not make it into the Hall of Fame.
Now let’s take a look a the most controversial successful QB.
Eli manning gets in because he played for so long and racked up enough wins and passing passing touchdowns to increase his probability. However, his passing_yards_per_attempt bring him down a little. (Plus he won two 2 Super Bowls, which is not considered in this model, but that alone should get him in the HOF.)
Finally let’s take a look at the “winningest” mediocre QB.
Although Joe Flacco has a lot of wins, his net_yards_per_pass_attempt are very low along with his passing_yards_per_attempt. Meaning he’s won a lot of games in the NFL, he just wasn’t a very efficient quarterback throughout his career, and therefore, he most likely will not be inducted into the Hall of Fame.
There you have it. Based on stats alone these are the most likely candidates to be inducted into the Hall of Fame over the next decade or so.
I realize there are some shortcoming with this model. Accolades and awards like pro-bowls and all-pro selection or playoff success and super bowl wins were not considered in this model, all of which weigh heavily in the actual Hall-of-fame selection. However given the data I have I’m surprised at how well my predictions aligned with the general consensus of future Hall-of-Fame quarterbacks.
If you like to see the code for this project please checkout my github.