Thomas Huang and Chase Stockwell
There has been a significant amount of research into the prediction of basketball players based on their collegiate years into their overall significance in the NBA arena. Due to the massive market that the NBA now possesses as a center point of entertainment in the US and around the world, the predictive ability of the best agents and teams are constantly being improved to increase their use of draft positions.
However, these tools are not widely available to the public for personal use. Although online discussion can help contribute and shape perspective of player rankings and success, there has been less noted focus on the longevity of player performance in the NBA. Common methods for predicting ability of players were using the NBA formula for productivity. However, many of these statistics are used to make predictions for early success in their career and their ability to establish a draft position. Our study, rather, will look at the longitudinal data retroactively to determine the longevity of their time in the league and how this is correlated to aspects taken from their time in collegiate basketball.
We intend to create using data science an algorithm that is able to extract the stats that are most significant in indicating and/or predicting longitudinal success in the professional scene among college players. We define success as the following: Longevity of NBA career relative to position Amount of money earned on the player’s contract Overall ‘basketball statistics’ that would be considered successful (points, assists, steals, blocks). Wins College Statistics Physical Qualities
Hoops Hype Salaries of individual players 2000-2020
https://hoopshype.com/salaries/2000-2001/
Bart Torvik On-court stats of individual players
http://barttorvik.com/trankpre.php
NBA Basketball Reference Statistics On-court stats of individual NBA players
https://www.basketball-reference.com/leagues/NBA_2001_totals.html
Will require data scraping Salaries of NBA Players Details the salaries of individual players in the NBA from years 2000 to 2020. This is raw data, so the salary data that will be used as one factor of many to determine success of players. Salaries will also be taken in relevance to salary caps and other monetary statistics for the NBA that year as inflation as well as the compensation of players has changed significantly over the years.
Will not require data scraping - contacted Bart Torvik, will send raw data in csv files. Statistics of individual players in college years. Details include all on-court stats, school, physical characteristics, etc.
Does not require data scraping - instead the data was available as downloadable csv files that can be found in the data_NBA folder.
Data Scraping - we want to data scrape the databases above from their websites.
We intend to use methods learnt in class to achieve this
For more complicated extractions, we will use Stack Overflow and other modules online to learn how to compile the data.
We may also interact with software API’s to gain more information on the data, using methods as we have learned in lecture.
Python + Pandas to interpret and analyze csv raw data
We intend to use the same techniques as performed in class to create DataFrames of the tables.
Some SQL may be used to join and merge together tables for a clearer picture whatever inferential statistic we are pursuing.
Statistics Modeling to analyze data
General Stats Description: Mean, Medians, etc… by quartile groups of “success” groups
Assignment of success scores of current NBA players based on various factors.
The inspiration behind this project is that the arguably most influential college basketball player within the last 20 years drafted on the New Orleans team this year, and we want to make sure that the indicators of his success in the NBA look promising. But this project has much more breadth. This could examine: Coaches & Scouts that use general success statistics to access overall talent. Video Game creators to decide what number to quantify the skill of the player in sports video games such as the NBA 2k Series and the NBA Live Series Advertisers to figure out which potential superstars to target for a cheap price before they become superstars. Small Market Teams who can elect to select value players that have an effective skill-to-dollar-paid ratio.
There are no conflicts of interests. This project was not sourced nor paid for by any participating basketball organization nor any subjects of the study This will be unbiased, objective research only using quantifiable statistics to make empirical assessments.
Coates, Dennis, University of Maryland, ResearchGate, The Length and Success of NBA Careers: Does College Production Predict Professional Outcomes? Oct 11th, 2019
The first step to our analysis of various datasets is to install the proper libraries necessary for data collection, analysis, and to also make the display of this notebook more suitable.
The first step of any analysis is getting your data. So what data will we need to work with to obtain our data? We will actually be utilizing three different datasets. One is scraped while two others are available in csv format.
import pandas as pd
import numpy as np
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
#The output of this cell below is hidden to allow the notebook to appear more organized. Press "o" while selecting this cell to show the output.
The first database we will be looking at is Hoops Hype. Hoops Hype gives us team information rather than information concerning individual players. Thus, this dataset will actually be one of the smallest ones we work with. BeautifulSoup greatly helps us in datascraping by easing the interpretation of the data.
We want to look at the NBA seasons beginning in 2000 to 2018. We will not be using 2019 as part of the data set because the season is unfinished.
#Below are a few more import statement necessary for data scraping from a website.
import requests
from bs4 import BeautifulSoup
'''
The first year from which we will be collecting data from hoopshype datebase is 1990.
For this analysis, the seasons will be labeled by the year in which the season began.
Thus, as an example, the NBA season 2019-2020 will be labeled as 2019.
'''
starting = 2000
#An empty list to be populated by team salaries is created.
team_salaries = []
#The beginning index is 0.
index = 0
#We do not want to obtain 2019 data is the season is still unfinished.
while starting < 2019:
'''
Request formats the URL in which the data will be scraped from. The {} within the
URL will be replaced by the current year.
'''
request = "https://hoopshype.com/salaries/{}-{}/".format(str(starting), str(starting+1))
r = requests.get(request)
# BeautifulSoup allows us to parse the return data from the request.
root = BeautifulSoup(r.content, 'html.parser')
# Prettify then allows us to transform the large lock of text into something legible so we can read it.
root.prettify()
# Use find() to save the aforementioned table as a variable
hh_table = root.find('table')
# Use pandas to read the HTML file
list_df = pd.read_html(str(hh_table) )
#print(list_df[0])
# In the case in where this instance of the while loop is the first year scraped,
if starting == 2000:
# The dataframe team_salaries is created with the salary taken from list_df.
team_salaries = list_df[0]
# Set reasonable names for the table columns. Also adds the year of the dataset.
team_salaries.columns = ['Rank', 'NBA Team', 'Total Salary', 'Total Salary (adj.)']
yr = []
# Then iterate through team_salaries and append to yr, the current year.
for i in range(len(team_salaries.index)):
yr.append(starting)
# Set the Year column of team_salaries equal to
team_salaries['Year'] = yr
else:
team_salaries2 = list_df[0]
# Set reasonable names for the table columns. Also adds the year of the dataset.
team_salaries2.columns = ['Rank', 'NBA Team', 'Total Salary', 'Total Salary (adj.)']
yr = []
for i in range(len(team_salaries2.index)):
yr.append(starting)
team_salaries2['Year'] = yr
team_salaries = pd.concat([team_salaries, team_salaries2])
#Once we have reached the end of this iteration of the while loop, we move on to the next year.
starting = starting + 1
#We now have our working dataframe that we will use for the NBA salaries. It is tidied up now, because the year has been merged into a separate variable
#and the data now grows vertically instead of horizontally.
df_team_salaries = team_salaries
df_team_salaries.head()
In the table, you can see the Team, Total Salary, the adjusted salary, and the year. This information will be crucial later when we want to compare team salaries and how the pay is distributed among players.
Below are the unique NBA team names found in the hoops hype data. As we begin to accrue more data, it is important to be able to have a good understanding of how you plan to merge tables. This table is the only table that uses the full name of the team. Therefore, we need to use a function called map to change the names of each NBA team to their three letter abbreviation.
#This displays every NBA team found in the hoops hype data.
df_team_salaries['NBA Team'].unique()
df_team_salaries['NBA Team'] = df_team_salaries['NBA Team'].map({
'Cleveland' : 'CLE',
'New York' : 'NYK',
'Detroit' : 'DET',
'LA Lakers' : 'LAL',
'Atlanta' : 'ATL',
'Dallas' : 'DAL',
'Philadelphia' : 'PHI',
'Milwaukee' : 'MIL',
'Phoenix' : 'PHO',
'Brooklyn' : 'NJN',
'Boston' : 'BOS',
'Portland' : 'POR',
'Golden State' : 'GSW',
'San Antonio' : 'SAS',
'Indiana' : 'IND',
'Utah' : 'UTA',
'Oklahoma City' : 'SEA',
'Houston' : 'HOU',
'Charlotte' : 'CHA',
'Denver' : 'DEN',
'LA Clippers' : 'LAC',
'Chicago' : 'CHI',
'Washington' : 'WAS',
'Sacramento' : 'SAC',
'Miami' : 'MIA',
'Minnesota' : 'MIN',
'Orlando' : 'ORL',
'Memphis' : 'VAN',
'Toronto' : 'TOT',
'New Orleans' : 'NOA'
})
df_team_salaries.head()
Good. Now that each NBA team has been properly named to the three letter abreviation rather than the full team name, merging and joining functions across different tables is made much easier. Let's move on to the next database we will be looking at.
Our next dataset comes from basketball references. This is a great resource for data scientists interested in sports as they have data on almost every single well documented professional and amateur league. It contains on-court information concerning every single player. As this dataset was easily downloadable as csv files, we will be using the os library to quickly iterate through the data_NBA directory to read in each file into a central dataframe named df_NBA.
import os
directory_in_str = "./data_NBA"
directory = os.fsencode(directory_in_str)
#Set the first year to be read in as 2000 and create an empty DataFrame.
yNum = 2000
df_NBA = pd.DataFrame()
# Now, we will iterate through each file in the directory.
for file in os.listdir(directory):
filename = os.fsdecode(file)
# If the file is a csv file, then it is one we want to read in. The files are ordered by year, so we know exactly which year we are reading in.
if filename.endswith(".csv"):
# A temporary dataframe is created for each file.
df_temp = pd.read_csv((directory_in_str +"/"+ filename), sep=",", encoding='latin-1')
# The current dataframe has each observation set to its respective year.
df_temp["Year"] = yNum
# This current dataframe is then concatenated to the main dataframe and then discarded.
df_NBA = pd.concat([df_NBA, df_temp], ignore_index = True)
# Increase the current year by one.
yNum = yNum+1
else:
continue
# Remove the column Rk, which is the old index of the original table.
del df_NBA["Rk"]
df_NBA.head()
We now have the data loaded into a dataframe, we need to tidy it. Most of these column headings are shortened for readability. In the cell below, we take a look at the different columns. Most of this information is very important. They are all on-court stats for each player in each year.
df_NBA.columns
Looking at the columns above, we could probably discern some of these column headers. FT more than likely means free throws and Tm probably means Teams. However, lets change the column headings to increase the readability of our table.
df_NBA.columns = ["Player", "Position", "Age", "Team",
"Games", "Games Started", "Minutes", "Field Goals",
"FG Attempts", "FG Percent","3P FG", "3P Attempts",
"3P Percent", "2P FG", "2P Attempts", "2P Percent",
"Effective FGP", "FT made", "FT Attempts","FT Percent",
"Off Rebounds", "Def Rebounds", "Total Rebounds", "Assists",
"Steals", "Blocks", "Turnovers", "P Fouls", "Points",
"Year"]
df_NBA["Player"] = df_NBA["Player"].str.split("\\").str[0]
df_NBA.head()
Perfect. Now, we have a table that is easily interpretable. In addition, the name of each player was blocked in a way such that each player had a unique identifier and their full name. Since the unique identifer is a naming schema used only by this dataset, it will not be useful in merging with other data. Thus, we used the string split function to remove the latter, coded identifer and leavve just the player name.
One issue with the data is that players that were traded in the middle of the season were divided into different observations per year, and then a third column was created for their total stats for the entire year. The main issue is that in merging with other data frames, the salaries are based on what each team paid them. Thus, we will remove the summation observations and leave the individual team observations as separate in the table.
# If a record has the Team = TOT, then it is the total summation of that player's stats and is thus a repetitive column.
df_NBA = df_NBA.loc[df_NBA["Team"] != "TOT"]
df_NBA["3P% of Total Attempts"] = df_NBA["3P Attempts"]/(df_NBA["3P Attempts"]+df_NBA["FG Attempts"])
df_NBA.head()
One statistic that isn't shown is that frequency of three point attempts relative to total shot attempts for each player. This will be interesting to analyze due to the changing nature of the NBA. In the cell above, we also calculated this stat.
Our next dataset will be the salary data, for individual players this time. Our hypothesis is that salary is able to accurately reflect the projected success and impact a player can have. However, this is not always the case as players can underperform. In addition, some players may not be paid simply just for their skills on the court but also factor such as their fanbase, presence in the locker room, and brand.
starting = 1999
player_salaries = []
index = 0
while starting < 2019:
for pages in range(1, 16):
#http://www.espn.com/nba/salaries/_/year/2000
request = "http://www.espn.com/nba/salaries/_/year/{}/page/{}".format(str(starting+1), str(pages) )
r = requests.get(request)
#4. Use BeautifulSoup to read and parse the data, as html or lxml
root = BeautifulSoup(r.content, 'html.parser')
#5. Use prettify to view the content and find the appropriate table
root.prettify()
#6. Use find() to save the aforementioned table as a variable
salary_table = root.find('table')
#7. Use pandas to read the HTML file
if salary_table is None:
continue
else:
df_salary = pd.read_html(str(salary_table) )
if starting == 1999:
player_salaries = df_salary[0]
#8 Set reasonable names for the table columns. Also adds the year of the dataset.
player_salaries.columns = ['Rank', 'Player', 'Team', 'Salary']
player_salaries['Year'] = starting
else:
player_salaries2 = df_salary[0]
#8 Set reasonable names for the table columns. Also adds the year of the dataset.
player_salaries2.columns = ['Rank', 'Player', 'Team', 'Salary']
player_salaries2['Year'] = starting
player_salaries = pd.concat([player_salaries, player_salaries2])
starting = starting + 1
#We now have our working dataframe that we will use for the NBA salaries. It is tidied up now, because the year has been merged into a separate variable
#and the data now grows vertically instead of horizontally.
player_salaries.head()
Our initial scrape of the salary data from NBA stats is now finished. However, just from the first five elements, we can see that the scrape was not clean. Some of the rows in the tables had RK NAME TEAM SALARY rows every 10 rows for reference. Let's get ride of those.
player_salaries = player_salaries[player_salaries["Rank"] != "RK"]
player_salaries.head()
Lastly, the player names are again disorganized. The position of the player is included into the Player column. We can drop that using the same string split function we used in the previous dataset.
player_salaries['Player'] = player_salaries['Player'].str.split(',').str[0]
player_salaries.head()
Our final dataset to read in from csv files is the ranking data as well as the regular season record for each year.
mega_dataframe = pd.DataFrame()
counter = 1999
while counter < 2019:
r = 'data_NBA/nba standings/{}.csv'.format(str(counter))
rotating_df = pd.read_csv(r, skiprows = [0])
rotating_df['Year'] = counter
mega_dataframe = pd.concat([mega_dataframe, rotating_df], sort=False, ignore_index = True)
counter = counter + 1
df_rank = mega_dataframe[["Rk", "Team", "Overall", "Year"]]
df_rank.head()
Again, one issue with this dataset is that each team is labeled differently from the other datasets. In order to allow for merging across tables, we need to map each team name to their respective three letter team name.
print(df_rank["Team"].unique())
df_NBA["Team"].unique()
Above are the team names in the df_rank dataframe. We want to map all of these team names to the three letter abbreviations seen in the df_NBA teams.
df_rank["Team"] = df_rank["Team"].map({
'Los Angeles Lakers' : "LAL",
'Portland Trail Blazers' : "POR",
'Indiana Pacers' : "IND",
'Utah Jazz' : "UTA",
'Phoenix Suns' : "PHO",
'San Antonio Spurs' : "SAS",
'Miami Heat' : 'MIA',
'Minnesota Timberwolves' : "MIN",
'New York Knicks' : "NYK",
'Charlotte Hornets' : "CHH",
'Philadelphia 76ers' : "PHI",
'Seattle SuperSonics' : "SEA",
'Toronto Raptors' : "TOR",
'Sacramento Kings' : "SAC",
'Detroit Pistons' : "DET",
'Milwaukee Bucks' : "MIL",
'Orlando Magic' : "ORL",
'Dallas Mavericks' : "DAL",
'Boston Celtics' : "BOS",
'Denver Nuggets' : 'DEN',
'Houston Rockets' : "HOU",
'Cleveland Cavaliers' : "CLE",
'New Jersey Nets' : "NJN",
'Washington Wizards' : "WAS",
'Atlanta Hawks' : "ATL",
'Vancouver Grizzlies' : "VAN",
'Golden State Warriors' : "GSW",
'Chicago Bulls' : "CHI",
'Los Angeles Clippers' : "LAC",
'Memphis Grizzlies' : "MEM",
'New Orleans Hornets' : "NOH",
'Charlotte Bobcats' : "CHA",
'New Orleans/Oklahoma City Hornets' : "NOK",
'Oklahoma City Thunder' : "OKC",
'Brooklyn Nets' : "BRK",
'New Orleans Pelicans' : "NOP"
})
df_rank.head()
Okay. Now, we finally have all of our data imported. Let's take a look at all of it. We have four different datasets:
display(df_team_salaries.head())
display(df_NBA.head())
display(player_salaries.head())
display(df_rank.head())
Our next step is to explore our data a little bit. What interesting things can we see?
One main discussion in looking at the meta of the NBA is the rise of the three point shot. With the rise of Stephen Curry came with it the change of offensive style of the game. Three point shots became known as the effecient shot to take. Valued at 50% more points than a shot made in the paint, it gives overall more points per possession given a decent three-point shot percentage. The increase in three-point shots have also increased spacing on the floor, increasing the availability for players in the post to make plays.
In order to better comprehend the change in 3 point attempts, a bar graph will suffice as the plot of choice in explaining the shift in the NBA scoring meta.
#Grouping by each year, the percent of shots taken that were threes were averaged and then plotted as a bar per year.
attempts3_plot = df_NBA.groupby("Year")["3P% of Total Attempts"].mean().plot.bar(figsize = (10, 5))
attempts3_plot.set_title("Percent of total shot attempts that were 3P attempts over Time")
attempts3_plot.set_ylabel("Percent of total shot attempts that were 3P attempts")
attempts3_plot.set_xlabel("Year")
attempts3_plot
As expected, the percentage of three point shots have increased over time as the culture has changed. While a significant increase in 3P attempts can be seen from 2010 and onwards, the consistent increase in 3P attempts can be seen 2012 and onwards, which is commonly known as the Golden Age or Dynasty of the Golden State Warriors. Morey is well known for driving a statistics-based basketball program.
Morey began coaching the Houston Rockets and implementing is data science based approach in 2007. Let's take a look specifically at the Houston Rockets and see if there is a significant change.
from scipy import stats
#Create a dataframe that is just the Houston Rockets based on their Team abbreviation HOU.
df_rockets = df_NBA.loc[df_NBA["Team"] == "HOU"]
#Next, lets group by year and find the mean of percent of total field goal attempts that were 3 point attempts of all players in each year.
avg_3P_attempts_percent = df_rockets.groupby("Year")[["3P% of Total Attempts"]].mean()
avg_3P_attempts_percent = avg_3P_attempts_percent.reset_index()
#In order to visualize this, a scatter plot will be suitable.
hou_plot = avg_3P_attempts_percent.plot.scatter(x = "Year", y = "3P% of Total Attempts", figsize = (10, 5))
#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(avg_3P_attempts_percent["Year"], avg_3P_attempts_percent["3P% of Total Attempts"])
line = slope*avg_3P_attempts_percent["Year"] + intercept
hou_plot.plot(avg_3P_attempts_percent["Year"], line, 'r', label='fitted line')
hou_plot.set_title("The impact of Morey on the Houston Rockets and their 3P Shot Preference")
hou_plot.set_xlabel("Year")
hou_plot.set_ylabel("Percent of total shot attempts that were 3P attempts")
print(r_value**2)
From the graph, it is evident that the percent of field goal attempts that were three point attempts have steadily increased over the years, from .125 to 0.325 in the 2017-2018 season. The r-squared value is 0.716, indicating a fairly significant trend. Morey was the first non-basketball background General Manager hired in the NBA. He began his tenure in 2007. Interestingly enough, the percent of 3P attempts dropped in 2007 and onwards but greatly rose when James Harden joined the team in 2012.
Next, lets take a look at the salary data across the years for each team in the NBA
#Next, we will create a graphic of the teams over time to see how their payroll increases.
import matplotlib.pyplot as plt
#We just want to initialize our x and y values here for the graphs. We will not use all of the years or all of the payrolls
x = df_team_salaries['Year']
y = df_team_salaries['Rank']
#There are 30 teams, so we want to make 30 possible teams so we can follow their distributions
fig, axes = plt.subplots(nrows = 6, ncols = 5, sharey = True, sharex = True, figsize = (20,12))
plt.xlim(2000, 2019)
#We want to add each distribution per team. We must first make sure we limit only values that are pertaining to the specific franchise.
index = 0
for row in axes:
for col in row:
x = df_team_salaries[df_team_salaries['NBA Team'] == df_team_salaries['NBA Team'].unique()[index]]
y = df_team_salaries[df_team_salaries['NBA Team'] == df_team_salaries['NBA Team'].unique()[index]]
#Then, we want to graph the item to make it available, and add a title over the name so the reader knows which team is which.
col.plot(x['Year'] , y['Rank'])
col.set_title(df_team_salaries['NBA Team'].unique()[index])
index = index + 1
plt.gca().invert_yaxis()
This graph shows demonstrates which teams in the NBA had the largest salary cap relative to other NBA teams for the last 20 years. Over the years, the ranking of each of the 30 NBA teams has fluctuated in who was paying the most money for their players over the years. In MileStone 1, we defined player success as them getting a large contract to play for their teams. We also wanted to compare their salary to the team salary their team was making. Thus, we would be able to have a ratio of how important their contract was to their team. This graph is the first part of solving this problem, demonstrating which teams were spending the most money to satisfy the star players on their squad.
The most commonly used effeciency score in the game is calculated using the formula:
(PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) / Games Played
This is commonly used in most chart comparison and what you would expect to see on a day to day ESPN show. Using our player data, lets use this to calculate the effeciency score for each player.
df_NBA["Missed Field Goals"] = df_NBA["FG Attempts"] - df_NBA["Field Goals"]
df_NBA["Missed FT"] = df_NBA["FT Attempts"] - df_NBA["FT made"]
df_NBA["Standard Eff Score"] = (df_NBA["Points"] + df_NBA["Total Rebounds"] + df_NBA["Assists"] + df_NBA["Steals"] + df_NBA["Blocks"]
- df_NBA["Missed Field Goals"] - df_NBA["Missed FT"] - df_NBA["Turnovers"])/df_NBA["Games"]
df_NBA = df_NBA.sort_values(["Standard Eff Score"], ascending = False)
df_NBA.head()
However, what this type of effeciency score does is that it limits the ability for us to see how effective bench players are. Bench players are inherently biased against in that they will not have nearly as much playing time as the stars would.
The next step in our analysis is to try and gain a better understanding the value of each player. In doing so, we want to create some weighted statistics that show how effective each player was in their time on court. Let's begin by first calculating the stats of each player per minute. We will be calculating the weighted scores based on the number of minutes they playered rather than by their raw totals or even total games due to the inherently bias in how benches are played. Someone like Lebron James with high playrate will always have higher stats than someone like Kawhi Leanord who has constant load management. This system of evaluation is meant to measure players in their effectiveness in their given time on the floor.
df_NBA["Minutes"] = df_NBA["Minutes"].fillna(1)
df_NBA["PPM"] = df_NBA["Points"]/df_NBA["Minutes"] # Points per minute
df_NBA["RPM"] = df_NBA["Total Rebounds"]/df_NBA["Minutes"] # Rebounds per minute
df_NBA["APM"] = df_NBA["Assists"]/df_NBA["Minutes"]# Assists per minute
df_NBA["SPM"] = df_NBA["Steals"]/df_NBA["Minutes"]# Steals per minute
df_NBA["BPM"] = df_NBA["Blocks"]/df_NBA["Minutes"]# Blocks per minute
df_NBA["Missed FGPM"] = df_NBA["Missed Field Goals"]/df_NBA["Minutes"]
df_NBA["Missed FTPM"] = df_NBA["Missed FT"]/df_NBA["Minutes"]
df_NBA["TOPM"] = df_NBA["Turnovers"]/df_NBA["Minutes"]
df_NBA["Min_Eff_Score"] = df_NBA["PPM"] + df_NBA["RPM"] + df_NBA["APM"] + df_NBA["SPM"] + df_NBA["BPM"] - df_NBA["Missed FGPM"] - df_NBA["Missed FTPM"] - df_NBA["TOPM"]
df_NBA.head()
Next, let's create same weighted measurements. We will see how far from the average they are. The greater positive gain, the better they are compared to league average.
df_NBA = df_NBA.fillna(0)
df_NBA = df_NBA.replace([np.inf, -np.inf], 0)
df_NBA["wt_PPM"] = (df_NBA["PPM"]-df_NBA["PPM"].mean())/df_NBA["PPM"].std()
df_NBA["wt_RPM"] = (df_NBA["RPM"]-df_NBA["RPM"].mean())/df_NBA["RPM"].std()
df_NBA["wt_APM"] = (df_NBA["APM"]-df_NBA["APM"].mean())/df_NBA["APM"].std()
df_NBA["wt_SPM"] = (df_NBA["SPM"]-df_NBA["SPM"].mean())/df_NBA["SPM"].std()
df_NBA["wt_BPM"] = (df_NBA["BPM"]-df_NBA["BPM"].mean())/df_NBA["BPM"].std()
df_NBA["wt_Missed FGPM"] = (df_NBA["Missed FGPM"]-df_NBA["Missed FGPM"].mean())/df_NBA["Missed FGPM"].std()
df_NBA["wt_Missed FTPM"] = (df_NBA["Missed FTPM"]-df_NBA["Missed FTPM"].mean())/df_NBA["Missed FTPM"].std()
df_NBA["wt_TOPM"] = (df_NBA["TOPM"]-df_NBA["TOPM"].mean())/df_NBA["TOPM"].std()
df_NBA["wt_score"] = df_NBA["wt_PPM"] + df_NBA["wt_RPM"] + df_NBA["wt_APM"] + df_NBA["wt_SPM"] + df_NBA["wt_BPM"] - df_NBA["wt_Missed FGPM"] - df_NBA["wt_Missed FTPM"] - df_NBA["wt_TOPM"]
df_NBA = df_NBA.sort_values(["wt_score"], ascending = False)
df_NBA[:15]
Here, we suddenly see a completely different side of the NBA. The top 15 players are all young players, playing very few minutes but being very effective in their time on the court. Of course, just like all other metrics, it is important to understand that this metric is not sustainable for these players. In their few minutes on the court, they were able to be very effective. This alternate "effeciency score", what we are calling a weighted minutes score, seems to favor bench players with very low minutes player to superstars. Let's see if their is a possible correlation between the standard efficiency score and weighted score.
eff_wt_plot = df_NBA.plot.scatter(x = "Standard Eff Score", y = "Min_Eff_Score", figsize = (10, 5))
#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(df_NBA["Standard Eff Score"], df_NBA["Min_Eff_Score"])
line = slope*df_NBA["Standard Eff Score"] + intercept
eff_wt_plot.plot(df_NBA["Standard Eff Score"], line, 'r', label='fitted line')
eff_wt_plot.set_title("Relationship between Standard Effeciency Score and Minutes Based Effeciency Score")
eff_wt_plot.set_xlabel("Standard Efficiency Score")
eff_wt_plot.set_ylabel("Minute Based Efficiency Score")
print(r_value**2)
Here is the same graph but with the weighted efficiency values. The R-squared value is around 0.409, which means that these two metrics are moderately similar in how the predict the abilities of NBA players.
df_NBA["wt_Standard Eff Score"] = (df_NBA["Standard Eff Score"] - df_NBA["Standard Eff Score"].mean())/df_NBA["Standard Eff Score"].std()
eff_wt_plot = df_NBA.plot.scatter(x = "wt_Standard Eff Score", y = "wt_score", figsize = (10, 5))
#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(df_NBA["wt_Standard Eff Score"], df_NBA["wt_score"])
line = slope*df_NBA["wt_Standard Eff Score"] + intercept
eff_wt_plot.plot(df_NBA["wt_Standard Eff Score"], line, 'r', label='fitted line')
eff_wt_plot.set_title("Relationship between Weighted Standard Efficiency Score and Weighted Minutes Based Efficiency Score")
eff_wt_plot.set_xlabel("Wt Standard Efficiency Score")
eff_wt_plot.set_ylabel("Wt Minute Based Efficiency Score")
print(r_value**2)
The weighted graph, however, has a much lower R-squared value.
The next question we want to investigate is whether it is better to have a team assembled of similarly strong players, a team with great depth, or just to have a few kew role players than can carry the rest of the team? We will use the Standard efficiency score to measure the average and spread of each team.
avg_effscores = df_NBA.groupby(["Year", "Team"])[["Standard Eff Score"]].mean()
avg_effscores.head()
Now that we have the average standard efficiency score calculated for each team, we can use it to find relative scores for each player. A relative score for a player means how many times better a player's efficiency score is than compared to the average eff score on their team. If a team has invested in a weak bench but has one or two superstars, then that superstar will have a much higher relative score.
def team_effscore_comparison(score, year, team):
avg_score = avg_effscores.loc[(year, team)]
return score/avg_score
df_NBA["rel_score"] = df_NBA.apply(lambda x: team_effscore_comparison(x["Standard Eff Score"], x["Year"], x["Team"]), axis = 1)
df_NBA.head()
The next step is to group the relative scores by team. We want to look at superstars and excessively good role players specifically. Thus, we will look at the maximum relative score of each team, where a high rel_score of a team would indicate the presence of a star player and a general lack of depth, which a lower max rel_score indicating a deeper bench or generally a more evenly spread team.
In addition, we will need to take a look at the ranking data we scraped further up in the notebook. We will merge the data sets in an inner join.
top_eff = df_NBA.groupby(["Year", "Team"])[["rel_score"]].max()
top_eff = top_eff.reset_index()
top_eff = top_eff.merge(df_rank, how = "inner", on = ["Team", "Year"])
top_eff.head()
Now, let's see if superstars will result in a greater success in the regular season.
rel_rk_plot = top_eff.plot.scatter(x = "Rk", y = "rel_score", figsize = (10, 5))
#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(top_eff["Rk"], top_eff["rel_score"])
line = slope*top_eff["Rk"] + intercept
rel_rk_plot.plot(top_eff["Rk"], line, 'r', label='fitted line')
rel_rk_plot.set_title("Does having a superstar on a team translate to success in the regular season?")
rel_rk_plot.set_xlabel("Rank of Team at end of Regular Season")
rel_rk_plot.set_ylabel("Highest Relative Score on a Team")
print(r_value**2)
Rank 1 means the highest rank in the season. Thus, we can see a negative slope in the graph. Thus this indicates a relationship between the rank of a team and whether they have a star player. Teams with a superstar player rather than a evenly spread, deep team will tend to perform better in the regular season. However, it is important to note that the R-squared value is quite low, at 0.171. Thus, even though the linear relationship seems to indicate that having superstars over a deep bench leads to regular season success, the relationship itself is not strong.
Associating Salary with NBA Players
The scores are all based off minutes, but now, through a join we will be able to access the salary of how much that specific player made that season.
df_NBA = pd.merge(df_NBA, player_salaries, how = 'inner', on = ['Player', 'Year'])
df_NBA
Now that we have a player's salary, we can generate their salary score. To calculate a salary score, it is the following:
Player Salary / Avg. Salary of Player in the same year.
This way, we are able to control for inflation.
#To perform this calculation, we need to clean salary up so that it is a float dtype.
df_NBA['Salary'] = df_NBA['Salary'].str[1:]
df_NBA['Salary'] = df_NBA['Salary'].str.replace("," , '')
df_NBA['Salary'] = df_NBA['Salary'].astype(float)
#Next, we must find the average salary of a player for a specific year. We'll call this groupby object avg_sal
avg_sal = df_NBA.groupby(["Year"])[["Salary"]].median()
def player_wtscore_salary(salary, year, player):
avg_score = avg_sal.loc[(year)]
return salary/avg_score
#Finally, we return the salary score of each NBA Player using our function
df_NBA["sal_score"] = df_NBA.apply(lambda x: player_wtscore_salary(x["Salary"], x["Year"], x["Player"]), axis = 1)
df_NBA
Now we are able to calculate our efficiency score. To do so, we will use the following formula:
wt_score / sal_score
Using this formula will give us the total 'efficiency' score of the player. Specifically, how impactful is that player on the court, with regards to the minutes he has played and the amount he is paid.
Now, we will be able to see who is the most efficient player in the NBA, according to 'our' statistics. This efficient players performs better on conventional NBA Stas than the average player per minute paid, and is also paid less than the average NBA player for his specific season
df_NBA['efficiency_score'] = df_NBA['wt_score'] / df_NBA['sal_score']
df_NBA = df_NBA.sort_values(['efficiency_score'], ascending = False)
df_NBA
As you can see from our data, most of the most efficient players have played only a few minutes. To make this a little more interesting, let's limit the scope of his players to those how have appeared in at least 41 games, or half an NBA regular season.
players_41_eff = df_NBA[df_NBA['Games'] >= 41]
players_41_eff
Now, we are able to see players that participated in a reasonable amount of games within the season and their corresponding efficiency scores
Machine Learning Component
No final datascience database is complete without implementing some machine learning element to our data. We elected to choose machine learning regression to help predict our dataset.
We wanted to assess that given a player's weighted score, how much should he paid? General Managers can use this information to predict how much they are essentially worth, assuming that their stats remain somewhat consistent. Below is our code for the problem:
Our Machine Learning Example Using Predicted Pay Based off Weighted Average. We will use an average of the 5, 30, and 100 nearest neighbors to perform this example and compare to each other.
plt = df_NBA.plot.scatter(x = 'wt_score', y = 'Salary', figsize = (10, 10), title = "K Nearest Neighbors of Weighted Score Predicting Salary Pay")
def get_NN_prediction(x_new, k):
#Given new observation, returns the k-nearest neighbors prediction
dists = ((X_train - x_new) ** 2).sum(axis=1)
inds_sorted = dists.sort_values().index[:k]
return y_train.loc[inds_sorted].mean()
X_train = df_NBA[["wt_score"]]
y_train = df_NBA["Salary"]
X_new = pd.DataFrame()
X_new["wt_score"] = np.arange(0, 15, 1)
X_new
colors = ['red', 'green', 'blue']
for i,k in enumerate([5, 30, 100]):
y_new_pred = X_new.apply(get_NN_prediction, axis=1, args=(k,))
y_new_pred.index= X_new
y_new_pred.plot.line(color = colors[i], label=str(k), legend=True)
The key takeaways from this graph are as follows:
1. The players that have the highest salary do not have overwhelming weighted scores. In fact, some of the players have a weighted score of about zero.
2. It does not appear that wt_score and Salary are positively correllated, for the most part. From the interval from a wt_score of [2,4] however, it does appear that there is an increase in pay.
3. Players that have extremely high wt_scores receive less than average Salary. This may be due to high stats relative to a small amount of minutes played, thus giving the players a generous wt_score.
In this notebook, we took a brief look into some analysis of aspects of basketball, player statistics, salary, and how much individual players can impact the entire team. The basketball game has evolved in the last decade, and significantly through the change in how it has become a statistically-focused change. The 3 point shot has been dubbed the "most efficient shot" by Morey, and with that the Houston Rockets, and the rest of the league, have increased their focus on the 3 point shot.
The main focus of this project, however, was to investigate what team composition allows for the greatest success. Specifically, we looked into whether emphasizing depth of a team is more important or having one excellent role player that serves as a superstar leads to a more successful team. We found that having a key player is correlated with a greater end of season ranking.
We were also able to show a few different scores of quantifying players. ESPN, NBA, and various other organization often use an efficiency score that includes some basic in game stats divided by the number of games they played. We generated two other important methods of quantifying players: a minute based efficiency score (min_eff_score) as well as a minute and salary based efficiency score, where their score was compared to how much they are paid. This showed just how important some bench players and rookies are that are able to significantly outperform their salary. In the current league environment where most of the money are going to key role players, it can be important to recognize these bench players that are able to outperform their expected salary worth.