Basketball Analysis Tutorial

Thomas Huang and Chase Stockwell

Background/ Current Literature

There has been a significant amount of research into the prediction of basketball players based on their collegiate years into their overall significance in the NBA arena. Due to the massive market that the NBA now possesses as a center point of entertainment in the US and around the world, the predictive ability of the best agents and teams are constantly being improved to increase their use of draft positions.

However, these tools are not widely available to the public for personal use. Although online discussion can help contribute and shape perspective of player rankings and success, there has been less noted focus on the longevity of player performance in the NBA. Common methods for predicting ability of players were using the NBA formula for productivity. However, many of these statistics are used to make predictions for early success in their career and their ability to establish a draft position. Our study, rather, will look at the longitudinal data retroactively to determine the longevity of their time in the league and how this is correlated to aspects taken from their time in collegiate basketball.

Objective

We intend to create using data science an algorithm that is able to extract the stats that are most significant in indicating and/or predicting longitudinal success in the professional scene among college players. We define success as the following: Longevity of NBA career relative to position Amount of money earned on the player’s contract Overall ‘basketball statistics’ that would be considered successful (points, assists, steals, blocks). Wins College Statistics Physical Qualities

Databases

Hoops Hype Salaries of individual players 2000-2020

https://hoopshype.com/salaries/2000-2001/

Bart Torvik On-court stats of individual players

http://barttorvik.com/trankpre.php

NBA Basketball Reference Statistics On-court stats of individual NBA players

https://www.basketball-reference.com/leagues/NBA_2001_totals.html

Hoops Hype Database

Will require data scraping Salaries of NBA Players Details the salaries of individual players in the NBA from years 2000 to 2020. This is raw data, so the salary data that will be used as one factor of many to determine success of players. Salaries will also be taken in relevance to salary caps and other monetary statistics for the NBA that year as inflation as well as the compensation of players has changed significantly over the years.

Bart Torvik Database

Will not require data scraping - contacted Bart Torvik, will send raw data in csv files. Statistics of individual players in college years. Details include all on-court stats, school, physical characteristics, etc.

NBA Basketball Reference

Does not require data scraping - instead the data was available as downloadable csv files that can be found in the data_NBA folder.

Methods

Data Scraping - we want to data scrape the databases above from their websites.
We intend to use methods learnt in class to achieve this For more complicated extractions, we will use Stack Overflow and other modules online to learn how to compile the data. We may also interact with software API’s to gain more information on the data, using methods as we have learned in lecture. Python + Pandas to interpret and analyze csv raw data We intend to use the same techniques as performed in class to create DataFrames of the tables. Some SQL may be used to join and merge together tables for a clearer picture whatever inferential statistic we are pursuing. Statistics Modeling to analyze data General Stats Description: Mean, Medians, etc… by quartile groups of “success” groups Assignment of success scores of current NBA players based on various factors.

Impact

The inspiration behind this project is that the arguably most influential college basketball player within the last 20 years drafted on the New Orleans team this year, and we want to make sure that the indicators of his success in the NBA look promising. But this project has much more breadth. This could examine: Coaches & Scouts that use general success statistics to access overall talent. Video Game creators to decide what number to quantify the skill of the player in sports video games such as the NBA 2k Series and the NBA Live Series Advertisers to figure out which potential superstars to target for a cheap price before they become superstars. Small Market Teams who can elect to select value players that have an effective skill-to-dollar-paid ratio.

Conflicts of Interest

There are no conflicts of interests. This project was not sourced nor paid for by any participating basketball organization nor any subjects of the study This will be unbiased, objective research only using quantifiable statistics to make empirical assessments.

Sources:

Coates, Dennis, University of Maryland, ResearchGate, The Length and Success of NBA Careers: Does College Production Predict Professional Outcomes? Oct 11th, 2019

The first step to our analysis of various datasets is to install the proper libraries necessary for data collection, analysis, and to also make the display of this notebook more suitable.

Data Acquisition

The first step of any analysis is getting your data. So what data will we need to work with to obtain our data? We will actually be utilizing three different datasets. One is scraped while two others are available in csv format.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

#The output of this cell below is hidden to allow the notebook to appear more organized. Press "o" while selecting this cell to show the output.
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    libxslt-1.1.33             |       h31b3aaa_0         556 KB  conda-forge
    lxml-4.4.2                 |   py37h7ec2d77_0         1.5 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

  libxslt            conda-forge/linux-64::libxslt-1.1.33-h31b3aaa_0
  lxml               conda-forge/linux-64::lxml-4.4.2-py37h7ec2d77_0

The following packages will be UPDATED:

  ca-certificates                      2019.9.11-hecc5488_0 --> 2019.11.28-hecc5488_0
  certifi                                  2019.9.11-py37_0 --> 2019.11.28-py37_0



Downloading and Extracting Packages
certifi-2019.11.28   | 148 KB    | ##################################### | 100% 
lxml-4.4.2           | 1.5 MB    | ##################################### | 100% 
ca-certificates-2019 | 145 KB    | ##################################### | 100% 
libxslt-1.1.33       | 556 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

The first database we will be looking at is Hoops Hype. Hoops Hype gives us team information rather than information concerning individual players. Thus, this dataset will actually be one of the smallest ones we work with. BeautifulSoup greatly helps us in datascraping by easing the interpretation of the data.

We want to look at the NBA seasons beginning in 2000 to 2018. We will not be using 2019 as part of the data set because the season is unfinished.

In [37]:
#Below are a few more import statement necessary for data scraping from a website.
import requests
from bs4 import BeautifulSoup

'''
The first year from which we will be collecting data from hoopshype datebase is 1990.

For this analysis, the seasons will be labeled by the year in which the season began.
Thus, as an example, the NBA season 2019-2020 will be labeled as 2019.
'''
starting = 2000

#An empty list to be populated by team salaries is created.
team_salaries = []

#The beginning index is 0.
index = 0

#We do not want to obtain 2019 data is the season is still unfinished.
while starting < 2019:
    
    '''
    Request formats the URL in which the data will be scraped from. The {} within the
    URL will be replaced by the current year.
    '''
    request = "https://hoopshype.com/salaries/{}-{}/".format(str(starting), str(starting+1))
    r = requests.get(request)
    # BeautifulSoup allows us to parse the return data from the request.
    root = BeautifulSoup(r.content, 'html.parser')

    # Prettify then allows us to transform the large lock of text into something legible so we can read it.
    root.prettify()

    # Use find() to save the aforementioned table as a variable
    hh_table = root.find('table')


    # Use pandas to read the HTML file
    list_df = pd.read_html(str(hh_table) )
    #print(list_df[0])
    
    # In the case in where this instance of the while loop is the first year scraped,
    if starting == 2000:
        # The dataframe team_salaries is created with the salary taken from list_df.
        team_salaries = list_df[0]
        
        # Set reasonable names for the table columns. Also adds the year of the dataset. 
        team_salaries.columns = ['Rank', 'NBA Team', 'Total Salary', 'Total Salary (adj.)']
        yr = []
        
        # Then iterate through team_salaries and append to yr, the current year.
        for i in range(len(team_salaries.index)):
            yr.append(starting)
        
        # Set the Year column of team_salaries equal to 
        team_salaries['Year'] = yr
    else: 
        team_salaries2 = list_df[0]

        # Set reasonable names for the table columns. Also adds the year of the dataset. 
        team_salaries2.columns = ['Rank', 'NBA Team', 'Total Salary', 'Total Salary (adj.)']
        yr = []
        for i in range(len(team_salaries2.index)):
            yr.append(starting)
        team_salaries2['Year'] = yr
        
        team_salaries = pd.concat([team_salaries, team_salaries2])
    
    #Once we have reached the end of this iteration of the while loop, we move on to the next year.
    starting = starting + 1

#We now have our working dataframe that we will use for the NBA salaries. It is tidied up now, because the year has been merged into a separate variable
#and the data now grows vertically instead of horizontally.
df_team_salaries = team_salaries

df_team_salaries.head()
Out[37]:
Rank NBA Team Total Salary Total Salary (adj.) Year
0 1.0 Portland $87,395,140 $129,847,169 2000
1 2.0 New York $74,007,738 $109,956,860 2000
2 3.0 Miami $73,472,329 $109,161,375 2000
3 4.0 Brooklyn $68,977,578 $102,483,308 2000
4 5.0 Washington $59,085,969 $87,786,867 2000

In the table, you can see the Team, Total Salary, the adjusted salary, and the year. This information will be crucial later when we want to compare team salaries and how the pay is distributed among players.

Below are the unique NBA team names found in the hoops hype data. As we begin to accrue more data, it is important to be able to have a good understanding of how you plan to merge tables. This table is the only table that uses the full name of the team. Therefore, we need to use a function called map to change the names of each NBA team to their three letter abbreviation.

In [38]:
#This displays every NBA team found in the hoops hype data.
df_team_salaries['NBA Team'].unique()
Out[38]:
array(['Portland', 'New York', 'Miami', 'Brooklyn', 'Washington',
       'LA Lakers', 'Milwaukee', 'San Antonio', 'Indiana', 'Phoenix',
       'Utah', 'Dallas', 'Denver', 'Oklahoma City', 'Boston',
       'Philadelphia', 'Cleveland', 'Houston', 'Memphis', 'Minnesota',
       'Charlotte', 'Sacramento', 'Golden State', 'Detroit', 'Atlanta',
       'Toronto', 'Orlando', 'Chicago', 'LA Clippers', 'New Orleans'],
      dtype=object)
In [39]:
df_team_salaries['NBA Team'] = df_team_salaries['NBA Team'].map({ 
    'Cleveland' : 'CLE', 
    'New York' : 'NYK',
    'Detroit' : 'DET', 
    'LA Lakers' : 'LAL', 
    'Atlanta' : 'ATL',
    'Dallas' : 'DAL', 
    'Philadelphia' : 'PHI', 
    'Milwaukee' : 'MIL', 
    'Phoenix' : 'PHO', 
    'Brooklyn' : 'NJN',
    'Boston' : 'BOS', 
    'Portland' : 'POR', 
    'Golden State' : 'GSW', 
    'San Antonio' : 'SAS', 
    'Indiana' : 'IND',
    'Utah' : 'UTA', 
    'Oklahoma City' : 'SEA', 
    'Houston' : 'HOU', 
    'Charlotte' : 'CHA', 
    'Denver' : 'DEN',
    'LA Clippers' : 'LAC', 
    'Chicago' : 'CHI', 
    'Washington' : 'WAS', 
    'Sacramento' : 'SAC', 
    'Miami' : 'MIA',
    'Minnesota' : 'MIN', 
    'Orlando' : 'ORL', 
    'Memphis' : 'VAN', 
    'Toronto' : 'TOT', 
    'New Orleans' : 'NOA'
})

df_team_salaries.head()
Out[39]:
Rank NBA Team Total Salary Total Salary (adj.) Year
0 1.0 POR $87,395,140 $129,847,169 2000
1 2.0 NYK $74,007,738 $109,956,860 2000
2 3.0 MIA $73,472,329 $109,161,375 2000
3 4.0 NJN $68,977,578 $102,483,308 2000
4 5.0 WAS $59,085,969 $87,786,867 2000

Good. Now that each NBA team has been properly named to the three letter abreviation rather than the full team name, merging and joining functions across different tables is made much easier. Let's move on to the next database we will be looking at.

Our next dataset comes from basketball references. This is a great resource for data scientists interested in sports as they have data on almost every single well documented professional and amateur league. It contains on-court information concerning every single player. As this dataset was easily downloadable as csv files, we will be using the os library to quickly iterate through the data_NBA directory to read in each file into a central dataframe named df_NBA.

In [40]:
import os

directory_in_str = "./data_NBA"
directory = os.fsencode(directory_in_str)

#Set the first year to be read in as 2000 and create an empty DataFrame.
yNum = 2000
df_NBA = pd.DataFrame()

# Now, we will iterate through each file in the directory.
for file in os.listdir(directory):
     filename = os.fsdecode(file)
     # If the file is a csv file, then it is one we want to read in. The files are ordered by year, so we know exactly which year we are reading in.
     if filename.endswith(".csv"):
        # A temporary dataframe is created for each file.
        df_temp = pd.read_csv((directory_in_str +"/"+ filename), sep=",", encoding='latin-1')
        # The current dataframe has each observation set to its respective year.
        df_temp["Year"] = yNum
        # This current dataframe is then concatenated to the main dataframe and then discarded.
        df_NBA = pd.concat([df_NBA, df_temp], ignore_index = True)
        # Increase the current year by one.
        yNum = yNum+1
     else:
         continue
# Remove the column Rk, which is the old index of the original table.
del df_NBA["Rk"]
df_NBA.head()
Out[40]:
Player Pos Age Tm G GS MP FG FGA FG% ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 Mahmoud Abdul-Rauf\abdulma02 PG 31 VAN 41 0 486 120 246 0.488 ... 5 20 25 76 9 1 26 50 266 2000
1 Tariq Abdul-Wahad\abdulta01 SG 26 DEN 29 12 420 43 111 0.387 ... 14 45 59 22 14 13 34 54 111 2000
2 Shareef Abdur-Rahim\abdursh01 SF 24 VAN 81 81 3241 604 1280 0.472 ... 175 560 735 250 90 77 231 238 1663 2000
3 Cory Alexander\alexaco01 PG 27 ORL 26 0 227 18 56 0.321 ... 0 25 25 36 16 0 25 29 52 2000
4 Courtney Alexander\alexaco02 PG 23 TOT 65 24 1382 239 573 0.417 ... 42 101 143 62 45 5 75 139 618 2000

5 rows × 30 columns

We now have the data loaded into a dataframe, we need to tidy it. Most of these column headings are shortened for readability. In the cell below, we take a look at the different columns. Most of this information is very important. They are all on-court stats for each player in each year.

In [41]:
df_NBA.columns
Out[41]:
Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year'],
      dtype='object')

Looking at the columns above, we could probably discern some of these column headers. FT more than likely means free throws and Tm probably means Teams. However, lets change the column headings to increase the readability of our table.

In [42]:
df_NBA.columns = ["Player", "Position", "Age", "Team",
                  "Games", "Games Started", "Minutes", "Field Goals",
                  "FG Attempts", "FG Percent","3P FG", "3P Attempts",
                  "3P Percent", "2P FG", "2P Attempts", "2P Percent",
                  "Effective FGP", "FT made", "FT Attempts","FT Percent",
                  "Off Rebounds", "Def Rebounds", "Total Rebounds", "Assists",
                  "Steals", "Blocks", "Turnovers", "P Fouls", "Points",
                  "Year"]
df_NBA["Player"] = df_NBA["Player"].str.split("\\").str[0]
df_NBA.head()
Out[42]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Off Rebounds Def Rebounds Total Rebounds Assists Steals Blocks Turnovers P Fouls Points Year
0 Mahmoud Abdul-Rauf PG 31 VAN 41 0 486 120 246 0.488 ... 5 20 25 76 9 1 26 50 266 2000
1 Tariq Abdul-Wahad SG 26 DEN 29 12 420 43 111 0.387 ... 14 45 59 22 14 13 34 54 111 2000
2 Shareef Abdur-Rahim SF 24 VAN 81 81 3241 604 1280 0.472 ... 175 560 735 250 90 77 231 238 1663 2000
3 Cory Alexander PG 27 ORL 26 0 227 18 56 0.321 ... 0 25 25 36 16 0 25 29 52 2000
4 Courtney Alexander PG 23 TOT 65 24 1382 239 573 0.417 ... 42 101 143 62 45 5 75 139 618 2000

5 rows × 30 columns

Perfect. Now, we have a table that is easily interpretable. In addition, the name of each player was blocked in a way such that each player had a unique identifier and their full name. Since the unique identifer is a naming schema used only by this dataset, it will not be useful in merging with other data. Thus, we used the string split function to remove the latter, coded identifer and leavve just the player name.

One issue with the data is that players that were traded in the middle of the season were divided into different observations per year, and then a third column was created for their total stats for the entire year. The main issue is that in merging with other data frames, the salaries are based on what each team paid them. Thus, we will remove the summation observations and leave the individual team observations as separate in the table.

In [43]:
# If a record has the Team = TOT, then it is the total summation of that player's stats and is thus a repetitive column.
df_NBA = df_NBA.loc[df_NBA["Team"] != "TOT"]

df_NBA["3P% of Total Attempts"] = df_NBA["3P Attempts"]/(df_NBA["3P Attempts"]+df_NBA["FG Attempts"])
df_NBA.head()
Out[43]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Def Rebounds Total Rebounds Assists Steals Blocks Turnovers P Fouls Points Year 3P% of Total Attempts
0 Mahmoud Abdul-Rauf PG 31 VAN 41 0 486 120 246 0.488 ... 20 25 76 9 1 26 50 266 2000 0.053846
1 Tariq Abdul-Wahad SG 26 DEN 29 12 420 43 111 0.387 ... 45 59 22 14 13 34 54 111 2000 0.082645
2 Shareef Abdur-Rahim SF 24 VAN 81 81 3241 604 1280 0.472 ... 560 735 250 90 77 231 238 1663 2000 0.047619
3 Cory Alexander PG 27 ORL 26 0 227 18 56 0.321 ... 25 25 36 16 0 25 29 52 2000 0.222222
5 Courtney Alexander PG 23 DAL 38 6 472 62 178 0.348 ... 43 63 21 16 3 21 76 160 2000 0.053191

5 rows × 31 columns

One statistic that isn't shown is that frequency of three point attempts relative to total shot attempts for each player. This will be interesting to analyze due to the changing nature of the NBA. In the cell above, we also calculated this stat.

Our next dataset will be the salary data, for individual players this time. Our hypothesis is that salary is able to accurately reflect the projected success and impact a player can have. However, this is not always the case as players can underperform. In addition, some players may not be paid simply just for their skills on the court but also factor such as their fanbase, presence in the locker room, and brand.

In [44]:
starting = 1999
player_salaries = []
index = 0


while starting < 2019:
    
    for pages in range(1, 16):
    #http://www.espn.com/nba/salaries/_/year/2000
        request = "http://www.espn.com/nba/salaries/_/year/{}/page/{}".format(str(starting+1), str(pages) )
        r = requests.get(request)
   
        #4. Use BeautifulSoup to read and parse the data, as html or lxml
        root = BeautifulSoup(r.content, 'html.parser')

        #5. Use prettify to view the content and find the appropriate table
        root.prettify()

        #6. Use find() to save the aforementioned table as a variable
        salary_table = root.find('table')
    
        #7. Use pandas to read the HTML file
        if salary_table is None:
            continue
        else:
            df_salary = pd.read_html(str(salary_table) )
    
        if starting == 1999:
            player_salaries = df_salary[0]
        
        #8 Set reasonable names for the table columns. Also adds the year of the dataset. 
            player_salaries.columns = ['Rank', 'Player', 'Team', 'Salary']
            player_salaries['Year'] = starting
        
        else: 
            player_salaries2 = df_salary[0]
            #8 Set reasonable names for the table columns. Also adds the year of the dataset. 
            player_salaries2.columns = ['Rank', 'Player', 'Team', 'Salary']
            player_salaries2['Year'] = starting
            player_salaries = pd.concat([player_salaries, player_salaries2])
        
    starting = starting + 1

#We now have our working dataframe that we will use for the NBA salaries. It is tidied up now, because the year has been merged into a separate variable
#and the data now grows vertically instead of horizontally.

player_salaries.head()
Out[44]:
Rank Player Team Salary Year
0 RK NAME TEAM SALARY 1999
0 RK NAME TEAM SALARY 2000
1 1 Kevin Garnett, PF Minnesota Timberwolves $19,600,000 2000
2 2 Shaquille O'Neal, C Los Angeles Lakers $19,285,000 2000
3 3 Alonzo Mourning, C Miami Heat $16,879,000 2000

Our initial scrape of the salary data from NBA stats is now finished. However, just from the first five elements, we can see that the scrape was not clean. Some of the rows in the tables had RK NAME TEAM SALARY rows every 10 rows for reference. Let's get ride of those.

In [45]:
player_salaries = player_salaries[player_salaries["Rank"] != "RK"]
player_salaries.head()
Out[45]:
Rank Player Team Salary Year
1 1 Kevin Garnett, PF Minnesota Timberwolves $19,600,000 2000
2 2 Shaquille O'Neal, C Los Angeles Lakers $19,285,000 2000
3 3 Alonzo Mourning, C Miami Heat $16,879,000 2000
4 4 Juwan Howard, PF Washington Wizards $16,875,000 2000
5 5 Hakeem Olajuwon, C Houston Rockets $16,685,000 2000

Lastly, the player names are again disorganized. The position of the player is included into the Player column. We can drop that using the same string split function we used in the previous dataset.

In [46]:
player_salaries['Player'] = player_salaries['Player'].str.split(',').str[0]
player_salaries.head()
Out[46]:
Rank Player Team Salary Year
1 1 Kevin Garnett Minnesota Timberwolves $19,600,000 2000
2 2 Shaquille O'Neal Los Angeles Lakers $19,285,000 2000
3 3 Alonzo Mourning Miami Heat $16,879,000 2000
4 4 Juwan Howard Washington Wizards $16,875,000 2000
5 5 Hakeem Olajuwon Houston Rockets $16,685,000 2000

Our final dataset to read in from csv files is the ranking data as well as the regular season record for each year.

In [47]:
mega_dataframe = pd.DataFrame()

counter = 1999

while counter < 2019:
    r = 'data_NBA/nba standings/{}.csv'.format(str(counter))
    rotating_df = pd.read_csv(r, skiprows = [0])
    rotating_df['Year'] = counter
    mega_dataframe = pd.concat([mega_dataframe, rotating_df], sort=False, ignore_index = True)
    counter = counter + 1

df_rank = mega_dataframe[["Rk", "Team", "Overall", "Year"]]
df_rank.head()
Out[47]:
Rk Team Overall Year
0 1 Los Angeles Lakers 67-15 1999
1 2 Portland Trail Blazers 59-23 1999
2 3 Indiana Pacers 56-26 1999
3 4 Utah Jazz 55-27 1999
4 5 Phoenix Suns 53-29 1999

Again, one issue with this dataset is that each team is labeled differently from the other datasets. In order to allow for merging across tables, we need to map each team name to their respective three letter team name.

In [48]:
print(df_rank["Team"].unique())
df_NBA["Team"].unique()
['Los Angeles Lakers' 'Portland Trail Blazers' 'Indiana Pacers'
 'Utah Jazz' 'Phoenix Suns' 'San Antonio Spurs' 'Miami Heat'
 'Minnesota Timberwolves' 'New York Knicks' 'Charlotte Hornets'
 'Philadelphia 76ers' 'Seattle SuperSonics' 'Toronto Raptors'
 'Sacramento Kings' 'Detroit Pistons' 'Milwaukee Bucks' 'Orlando Magic'
 'Dallas Mavericks' 'Boston Celtics' 'Denver Nuggets' 'Houston Rockets'
 'Cleveland Cavaliers' 'New Jersey Nets' 'Washington Wizards'
 'Atlanta Hawks' 'Vancouver Grizzlies' 'Golden State Warriors'
 'Chicago Bulls' 'Los Angeles Clippers' 'Memphis Grizzlies'
 'New Orleans Hornets' 'Charlotte Bobcats'
 'New Orleans/Oklahoma City Hornets' 'Oklahoma City Thunder'
 'Brooklyn Nets' 'New Orleans Pelicans']
Out[48]:
array(['VAN', 'DEN', 'ORL', 'DAL', 'WAS', 'MIL', 'SAS', 'BOS', 'SAC',
       'HOU', 'POR', 'DET', 'MIN', 'CHI', 'SEA', 'PHI', 'IND', 'UTA',
       'GSW', 'PHO', 'TOR', 'CLE', 'ATL', 'MIA', 'LAC', 'CHH', 'NYK',
       'LAL', 'NJN', 'MEM', 'NOH', 'CHA', 'NOK', 'OKC', 'BRK', 'NOP',
       'CHO'], dtype=object)

Above are the team names in the df_rank dataframe. We want to map all of these team names to the three letter abbreviations seen in the df_NBA teams.

In [49]:
df_rank["Team"] = df_rank["Team"].map({ 
    'Los Angeles Lakers' : "LAL",
    'Portland Trail Blazers' : "POR",
    'Indiana Pacers' : "IND",
    'Utah Jazz' : "UTA",
    'Phoenix Suns' : "PHO",
    'San Antonio Spurs' : "SAS",
    'Miami Heat' : 'MIA',
    'Minnesota Timberwolves' : "MIN",
    'New York Knicks' : "NYK",
    'Charlotte Hornets' : "CHH",
    'Philadelphia 76ers' : "PHI",
    'Seattle SuperSonics' : "SEA",
    'Toronto Raptors' : "TOR",
    'Sacramento Kings' : "SAC",
    'Detroit Pistons' : "DET",
    'Milwaukee Bucks' : "MIL",
    'Orlando Magic' : "ORL",
    'Dallas Mavericks' : "DAL",
    'Boston Celtics' : "BOS",
    'Denver Nuggets' : 'DEN',
    'Houston Rockets' : "HOU",
    'Cleveland Cavaliers' : "CLE",
    'New Jersey Nets' : "NJN",
    'Washington Wizards' : "WAS",
    'Atlanta Hawks' : "ATL",
    'Vancouver Grizzlies' : "VAN",
    'Golden State Warriors' : "GSW",
    'Chicago Bulls' : "CHI",
    'Los Angeles Clippers' : "LAC",
    'Memphis Grizzlies' : "MEM",
    'New Orleans Hornets' : "NOH",
    'Charlotte Bobcats' : "CHA",
    'New Orleans/Oklahoma City Hornets' : "NOK",
    'Oklahoma City Thunder' : "OKC",
    'Brooklyn Nets' : "BRK",
    'New Orleans Pelicans' : "NOP"
})

df_rank.head()
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:37: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[49]:
Rk Team Overall Year
0 1 LAL 67-15 1999
1 2 POR 59-23 1999
2 3 IND 56-26 1999
3 4 UTA 55-27 1999
4 5 PHO 53-29 1999

Okay. Now, we finally have all of our data imported. Let's take a look at all of it. We have four different datasets:

  1. df_team_salaries: The team salaries of each team in each year.
  2. df_NBA: Individual player stats for each player in the league during the regular season in each year.
  3. player_salaries: The player salaries for each player in each year.
  4. df_rank: The ranking and record for each team in each year.
In [50]:
display(df_team_salaries.head())
display(df_NBA.head())
display(player_salaries.head())
display(df_rank.head())
Rank NBA Team Total Salary Total Salary (adj.) Year
0 1.0 POR $87,395,140 $129,847,169 2000
1 2.0 NYK $74,007,738 $109,956,860 2000
2 3.0 MIA $73,472,329 $109,161,375 2000
3 4.0 NJN $68,977,578 $102,483,308 2000
4 5.0 WAS $59,085,969 $87,786,867 2000
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Def Rebounds Total Rebounds Assists Steals Blocks Turnovers P Fouls Points Year 3P% of Total Attempts
0 Mahmoud Abdul-Rauf PG 31 VAN 41 0 486 120 246 0.488 ... 20 25 76 9 1 26 50 266 2000 0.053846
1 Tariq Abdul-Wahad SG 26 DEN 29 12 420 43 111 0.387 ... 45 59 22 14 13 34 54 111 2000 0.082645
2 Shareef Abdur-Rahim SF 24 VAN 81 81 3241 604 1280 0.472 ... 560 735 250 90 77 231 238 1663 2000 0.047619
3 Cory Alexander PG 27 ORL 26 0 227 18 56 0.321 ... 25 25 36 16 0 25 29 52 2000 0.222222
5 Courtney Alexander PG 23 DAL 38 6 472 62 178 0.348 ... 43 63 21 16 3 21 76 160 2000 0.053191

5 rows × 31 columns

Rank Player Team Salary Year
1 1 Kevin Garnett Minnesota Timberwolves $19,600,000 2000
2 2 Shaquille O'Neal Los Angeles Lakers $19,285,000 2000
3 3 Alonzo Mourning Miami Heat $16,879,000 2000
4 4 Juwan Howard Washington Wizards $16,875,000 2000
5 5 Hakeem Olajuwon Houston Rockets $16,685,000 2000
Rk Team Overall Year
0 1 LAL 67-15 1999
1 2 POR 59-23 1999
2 3 IND 56-26 1999
3 4 UTA 55-27 1999
4 5 PHO 53-29 1999

Exploratory Analysis and Characterization of the Data

Our next step is to explore our data a little bit. What interesting things can we see?

One main discussion in looking at the meta of the NBA is the rise of the three point shot. With the rise of Stephen Curry came with it the change of offensive style of the game. Three point shots became known as the effecient shot to take. Valued at 50% more points than a shot made in the paint, it gives overall more points per possession given a decent three-point shot percentage. The increase in three-point shots have also increased spacing on the floor, increasing the availability for players in the post to make plays.

In order to better comprehend the change in 3 point attempts, a bar graph will suffice as the plot of choice in explaining the shift in the NBA scoring meta.

In [51]:
#Grouping by each year, the percent of shots taken that were threes were averaged and then plotted as a bar per year.
attempts3_plot = df_NBA.groupby("Year")["3P% of Total Attempts"].mean().plot.bar(figsize = (10, 5))
attempts3_plot.set_title("Percent of total shot attempts that were 3P attempts over Time")
attempts3_plot.set_ylabel("Percent of total shot attempts that were 3P attempts")
attempts3_plot.set_xlabel("Year")
attempts3_plot
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faccb238d68>

As expected, the percentage of three point shots have increased over time as the culture has changed. While a significant increase in 3P attempts can be seen from 2010 and onwards, the consistent increase in 3P attempts can be seen 2012 and onwards, which is commonly known as the Golden Age or Dynasty of the Golden State Warriors. Morey is well known for driving a statistics-based basketball program.

Morey began coaching the Houston Rockets and implementing is data science based approach in 2007. Let's take a look specifically at the Houston Rockets and see if there is a significant change.

In [52]:
from scipy import stats

#Create a dataframe that is just the Houston Rockets based on their Team abbreviation HOU.
df_rockets = df_NBA.loc[df_NBA["Team"] == "HOU"]

#Next, lets group by year and find the mean of percent of total field goal attempts that were 3 point attempts of all players in each year.
avg_3P_attempts_percent = df_rockets.groupby("Year")[["3P% of Total Attempts"]].mean()
avg_3P_attempts_percent = avg_3P_attempts_percent.reset_index()

#In order to visualize this, a scatter plot will be suitable.
hou_plot = avg_3P_attempts_percent.plot.scatter(x = "Year", y = "3P% of Total Attempts", figsize = (10, 5))

#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(avg_3P_attempts_percent["Year"], avg_3P_attempts_percent["3P% of Total Attempts"])
line = slope*avg_3P_attempts_percent["Year"] + intercept
hou_plot.plot(avg_3P_attempts_percent["Year"], line, 'r', label='fitted line')

hou_plot.set_title("The impact of Morey on the Houston Rockets and their 3P Shot Preference")
hou_plot.set_xlabel("Year")
hou_plot.set_ylabel("Percent of total shot attempts that were 3P attempts")

print(r_value**2)
0.7161733415770244

From the graph, it is evident that the percent of field goal attempts that were three point attempts have steadily increased over the years, from .125 to 0.325 in the 2017-2018 season. The r-squared value is 0.716, indicating a fairly significant trend. Morey was the first non-basketball background General Manager hired in the NBA. He began his tenure in 2007. Interestingly enough, the percent of 3P attempts dropped in 2007 and onwards but greatly rose when James Harden joined the team in 2012.

Next, lets take a look at the salary data across the years for each team in the NBA

In [53]:
#Next, we will create a graphic of the teams over time to see how their payroll increases. 
import matplotlib.pyplot as plt

#We just want to initialize our x and y values here for the graphs. We will not use all of the years or all of the payrolls
x = df_team_salaries['Year']
y = df_team_salaries['Rank']


#There are 30 teams, so we want to make 30 possible teams so we can follow their distributions
fig, axes = plt.subplots(nrows = 6, ncols = 5, sharey = True, sharex = True, figsize = (20,12))
plt.xlim(2000, 2019)

#We want to add each distribution per team. We must first make sure we limit only values that are pertaining to the specific franchise.
index = 0
for row in axes:
    for col in row:
        x = df_team_salaries[df_team_salaries['NBA Team'] == df_team_salaries['NBA Team'].unique()[index]]
        y = df_team_salaries[df_team_salaries['NBA Team'] == df_team_salaries['NBA Team'].unique()[index]]
        
        #Then, we want to graph the item to make it available, and add a title over the name so the reader knows which team is which.  
        col.plot(x['Year'] , y['Rank'])
        col.set_title(df_team_salaries['NBA Team'].unique()[index])
        index = index + 1

plt.gca().invert_yaxis()

This graph shows demonstrates which teams in the NBA had the largest salary cap relative to other NBA teams for the last 20 years. Over the years, the ranking of each of the 30 NBA teams has fluctuated in who was paying the most money for their players over the years. In MileStone 1, we defined player success as them getting a large contract to play for their teams. We also wanted to compare their salary to the team salary their team was making. Thus, we would be able to have a ratio of how important their contract was to their team. This graph is the first part of solving this problem, demonstrating which teams were spending the most money to satisfy the star players on their squad.

The most commonly used effeciency score in the game is calculated using the formula:

(PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) / Games Played

This is commonly used in most chart comparison and what you would expect to see on a day to day ESPN show. Using our player data, lets use this to calculate the effeciency score for each player.

In [54]:
df_NBA["Missed Field Goals"] = df_NBA["FG Attempts"] - df_NBA["Field Goals"]
df_NBA["Missed FT"] = df_NBA["FT Attempts"] - df_NBA["FT made"]
df_NBA["Standard Eff Score"] = (df_NBA["Points"] + df_NBA["Total Rebounds"] + df_NBA["Assists"] + df_NBA["Steals"] + df_NBA["Blocks"]
                                - df_NBA["Missed Field Goals"] - df_NBA["Missed FT"] - df_NBA["Turnovers"])/df_NBA["Games"]

df_NBA = df_NBA.sort_values(["Standard Eff Score"], ascending = False)
df_NBA.head()
Out[54]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Steals Blocks Turnovers P Fouls Points Year 3P% of Total Attempts Missed Field Goals Missed FT Standard Eff Score
10391 Giannis Antetokounmpo PF 24 MIL 72 72 2358 721 1247 0.578 ... 92 110 268 232 1994 2018 0.140000 526 186 35.250000
9671 Russell Westbrook PG 28 OKC 81 81 2802 824 1941 0.425 ... 132 31 438 190 2558 2016 0.230983 1117 130 33.827160
10533 Anthony Davis C 25 NOP 56 56 1850 530 1026 0.517 ... 88 135 112 132 1452 2018 0.123826 496 89 33.357143
1704 Kevin Garnett PF 27 MIN 82 82 3231 804 1611 0.499 ... 120 178 212 202 1987 2003 0.025998 807 97 33.134146
10639 James Harden PG 29 HOU 78 78 2867 843 1909 0.442 ... 158 58 387 244 2818 2018 0.350017 1066 104 33.089744

5 rows × 34 columns

However, what this type of effeciency score does is that it limits the ability for us to see how effective bench players are. Bench players are inherently biased against in that they will not have nearly as much playing time as the stars would.

The next step in our analysis is to try and gain a better understanding the value of each player. In doing so, we want to create some weighted statistics that show how effective each player was in their time on court. Let's begin by first calculating the stats of each player per minute. We will be calculating the weighted scores based on the number of minutes they playered rather than by their raw totals or even total games due to the inherently bias in how benches are played. Someone like Lebron James with high playrate will always have higher stats than someone like Kawhi Leanord who has constant load management. This system of evaluation is meant to measure players in their effectiveness in their given time on the floor.

In [55]:
df_NBA["Minutes"] = df_NBA["Minutes"].fillna(1)
df_NBA["PPM"] = df_NBA["Points"]/df_NBA["Minutes"] # Points per minute
df_NBA["RPM"] = df_NBA["Total Rebounds"]/df_NBA["Minutes"] # Rebounds per minute
df_NBA["APM"] = df_NBA["Assists"]/df_NBA["Minutes"]# Assists per minute
df_NBA["SPM"] = df_NBA["Steals"]/df_NBA["Minutes"]# Steals per minute
df_NBA["BPM"] = df_NBA["Blocks"]/df_NBA["Minutes"]# Blocks per minute
df_NBA["Missed FGPM"] = df_NBA["Missed Field Goals"]/df_NBA["Minutes"]
df_NBA["Missed FTPM"] = df_NBA["Missed FT"]/df_NBA["Minutes"]
df_NBA["TOPM"] = df_NBA["Turnovers"]/df_NBA["Minutes"]

df_NBA["Min_Eff_Score"] = df_NBA["PPM"] + df_NBA["RPM"] + df_NBA["APM"] + df_NBA["SPM"] + df_NBA["BPM"] - df_NBA["Missed FGPM"] - df_NBA["Missed FTPM"] - df_NBA["TOPM"]
df_NBA.head()
Out[55]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Standard Eff Score PPM RPM APM SPM BPM Missed FGPM Missed FTPM TOPM Min_Eff_Score
10391 Giannis Antetokounmpo PF 24 MIL 72 72 2358 721 1247 0.578 ... 35.250000 0.845632 0.380831 0.179813 0.039016 0.046650 0.223070 0.078880 0.113656 1.076336
9671 Russell Westbrook PG 28 OKC 81 81 2802 824 1941 0.425 ... 33.827160 0.912919 0.308351 0.299786 0.047109 0.011064 0.398644 0.046395 0.156317 0.977873
10533 Anthony Davis C 25 NOP 56 56 1850 530 1026 0.517 ... 33.357143 0.784865 0.363243 0.117838 0.047568 0.072973 0.268108 0.048108 0.060541 1.009730
1704 Kevin Garnett PF 27 MIN 82 82 3231 804 1611 0.499 ... 33.134146 0.614980 0.352522 0.126586 0.037140 0.055091 0.249768 0.030022 0.065614 0.840916
10639 James Harden PG 29 HOU 78 78 2867 843 1909 0.442 ... 33.089744 0.982909 0.180677 0.204395 0.055110 0.020230 0.371817 0.036275 0.134984 0.900244

5 rows × 43 columns

Next, let's create same weighted measurements. We will see how far from the average they are. The greater positive gain, the better they are compared to league average.

In [56]:
df_NBA = df_NBA.fillna(0)
df_NBA = df_NBA.replace([np.inf, -np.inf], 0)

df_NBA["wt_PPM"] = (df_NBA["PPM"]-df_NBA["PPM"].mean())/df_NBA["PPM"].std()
df_NBA["wt_RPM"] = (df_NBA["RPM"]-df_NBA["RPM"].mean())/df_NBA["RPM"].std()
df_NBA["wt_APM"] = (df_NBA["APM"]-df_NBA["APM"].mean())/df_NBA["APM"].std()
df_NBA["wt_SPM"] = (df_NBA["SPM"]-df_NBA["SPM"].mean())/df_NBA["SPM"].std()
df_NBA["wt_BPM"] = (df_NBA["BPM"]-df_NBA["BPM"].mean())/df_NBA["BPM"].std()
df_NBA["wt_Missed FGPM"] = (df_NBA["Missed FGPM"]-df_NBA["Missed FGPM"].mean())/df_NBA["Missed FGPM"].std()
df_NBA["wt_Missed FTPM"] = (df_NBA["Missed FTPM"]-df_NBA["Missed FTPM"].mean())/df_NBA["Missed FTPM"].std()
df_NBA["wt_TOPM"] = (df_NBA["TOPM"]-df_NBA["TOPM"].mean())/df_NBA["TOPM"].std()
df_NBA["wt_score"] = df_NBA["wt_PPM"] + df_NBA["wt_RPM"] + df_NBA["wt_APM"] + df_NBA["wt_SPM"] + df_NBA["wt_BPM"] - df_NBA["wt_Missed FGPM"] - df_NBA["wt_Missed FTPM"] - df_NBA["wt_TOPM"]
df_NBA = df_NBA.sort_values(["wt_score"], ascending = False)
df_NBA[:15]
Out[56]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... Min_Eff_Score wt_PPM wt_RPM wt_APM wt_SPM wt_BPM wt_Missed FGPM wt_Missed FTPM wt_TOPM wt_score
10189 Georgios Papagiannis C 20 POR 1 0 4 1 1 1.000 ... 1.250000 0.957429 0.822547 -1.363652 23.764540 -0.880028 -2.598159 -0.927302 -1.755144 28.581442
7620 DeAndre Liggins SG 25 MIA 1 0 1 1 1 1.000 ... 3.000000 11.541783 9.140374 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 22.171555
7938 Sim Bhullar C 22 SAC 3 0 3 1 2 0.500 ... 1.333333 2.133469 1.746750 4.324213 -1.547528 13.538509 2.202899 -0.927302 -1.755144 20.674959
4606 Steven Hill PF 23 OKC 1 0 2 1 1 1.000 ... 2.500000 4.485547 14.685591 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 20.660536
10129 Naz Mitrou-Long SG 24 UTA 1 0 1 1 1 1.000 ... 3.000000 18.598020 -1.950062 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 18.137356
2193 Jackie Butler C 19 NYK 3 0 5 4 4 1.000 ... 2.000000 11.541783 -1.950062 -1.363652 8.577299 -0.880028 -2.598159 -0.927302 4.402693 15.048109
10908 Gary Payton PG 26 WAS 3 0 16 5 8 0.625 ... 1.062500 2.280474 -0.563757 2.902246 7.944497 1.823448 0.102436 -0.927302 0.169180 15.042594
7857 Maalik Wayns PG 22 LAC 2 0 9 1 2 0.500 ... 0.777778 -1.002636 0.514480 2.428258 9.702280 -0.880028 -0.997806 -0.927302 -1.755144 14.442606
3978 Gerald Green SF 22 HOU 1 0 4 3 3 1.000 ... 2.000000 8.013665 3.595156 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 13.098219
5132 Trey Gilder SF 25 MEM 2 0 5 1 1 1.000 ... 0.800000 0.251806 0.268025 -1.363652 8.577299 -0.880028 -2.598159 -0.927302 -1.755144 12.134056
2702 Martynas Andriuškevicius C 19 CLE 6 0 9 0 1 0.000 ... 0.555556 -2.570689 2.979021 -1.363652 9.702280 -0.880028 -0.997806 -0.927302 -1.755144 11.547185
4342 Marcus Williams SF 21 SAS 1 0 2 0 1 0.000 ... 0.000000 -2.570689 -1.950062 -1.363652 -1.547528 20.747777 4.603427 -0.927302 -1.755144 11.394865
10923 Zhou Qi PF 23 HOU 1 0 1 1 1 1.000 ... 2.000000 11.541783 -1.950062 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 11.081120
9083 Briante Weber PG 23 MIA 1 0 3 1 1 1.000 ... 1.333333 2.133469 1.746750 4.324213 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 11.057481
9404 Terrence Jones PF 25 MIL 3 0 6 0 3 0.000 ... 0.333333 -2.570689 3.595156 -1.363652 6.889828 6.329241 4.603427 -0.927302 -1.755144 10.958903

15 rows × 52 columns

Here, we suddenly see a completely different side of the NBA. The top 15 players are all young players, playing very few minutes but being very effective in their time on the court. Of course, just like all other metrics, it is important to understand that this metric is not sustainable for these players. In their few minutes on the court, they were able to be very effective. This alternate "effeciency score", what we are calling a weighted minutes score, seems to favor bench players with very low minutes player to superstars. Let's see if their is a possible correlation between the standard efficiency score and weighted score.

In [57]:
eff_wt_plot = df_NBA.plot.scatter(x = "Standard Eff Score", y = "Min_Eff_Score", figsize = (10, 5))

#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(df_NBA["Standard Eff Score"], df_NBA["Min_Eff_Score"])
line = slope*df_NBA["Standard Eff Score"] + intercept
eff_wt_plot.plot(df_NBA["Standard Eff Score"], line, 'r', label='fitted line')

eff_wt_plot.set_title("Relationship between Standard Effeciency Score and Minutes Based Effeciency Score")
eff_wt_plot.set_xlabel("Standard Efficiency Score")
eff_wt_plot.set_ylabel("Minute Based Efficiency Score")

print(r_value**2)
0.4088115858646164

Here is the same graph but with the weighted efficiency values. The R-squared value is around 0.409, which means that these two metrics are moderately similar in how the predict the abilities of NBA players.

In [58]:
df_NBA["wt_Standard Eff Score"] = (df_NBA["Standard Eff Score"] - df_NBA["Standard Eff Score"].mean())/df_NBA["Standard Eff Score"].std()
eff_wt_plot = df_NBA.plot.scatter(x = "wt_Standard Eff Score", y = "wt_score", figsize = (10, 5))

#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(df_NBA["wt_Standard Eff Score"], df_NBA["wt_score"])
line = slope*df_NBA["wt_Standard Eff Score"] + intercept
eff_wt_plot.plot(df_NBA["wt_Standard Eff Score"], line, 'r', label='fitted line')

eff_wt_plot.set_title("Relationship between Weighted Standard Efficiency Score and Weighted Minutes Based Efficiency Score")
eff_wt_plot.set_xlabel("Wt Standard Efficiency Score")
eff_wt_plot.set_ylabel("Wt Minute Based Efficiency Score")

print(r_value**2)
0.16007960057807463

The weighted graph, however, has a much lower R-squared value.

The next question we want to investigate is whether it is better to have a team assembled of similarly strong players, a team with great depth, or just to have a few kew role players than can carry the rest of the team? We will use the Standard efficiency score to measure the average and spread of each team.

In [59]:
avg_effscores = df_NBA.groupby(["Year", "Team"])[["Standard Eff Score"]].mean()
avg_effscores.head()
Out[59]:
Standard Eff Score
Year Team
2000 ATL 7.586210
BOS 7.912348
CHH 8.026443
CHI 8.129920
CLE 7.309115

Now that we have the average standard efficiency score calculated for each team, we can use it to find relative scores for each player. A relative score for a player means how many times better a player's efficiency score is than compared to the average eff score on their team. If a team has invested in a weak bench but has one or two superstars, then that superstar will have a much higher relative score.

In [60]:
def team_effscore_comparison(score, year, team):
    avg_score = avg_effscores.loc[(year, team)]
    return score/avg_score

df_NBA["rel_score"] = df_NBA.apply(lambda x: team_effscore_comparison(x["Standard Eff Score"], x["Year"], x["Team"]), axis = 1)
df_NBA.head()
Out[60]:
Player Position Age Team Games Games Started Minutes Field Goals FG Attempts FG Percent ... wt_RPM wt_APM wt_SPM wt_BPM wt_Missed FGPM wt_Missed FTPM wt_TOPM wt_score wt_Standard Eff Score rel_score
10189 Georgios Papagiannis C 20 POR 1 0 4 1 1 1.0 ... 0.822547 -1.363652 23.764540 -0.880028 -2.598159 -0.927302 -1.755144 28.581442 -0.622092 0.538691
7620 DeAndre Liggins SG 25 MIA 1 0 1 1 1 1.0 ... 9.140374 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 22.171555 -0.945798 0.345432
7938 Sim Bhullar C 22 SAC 3 0 3 1 2 0.5 ... 1.746750 4.324213 -1.547528 13.538509 2.202899 -0.927302 -1.755144 20.674959 -1.215553 0.161214
4606 Steven Hill PF 23 OKC 1 0 2 1 1 1.0 ... 14.685591 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 20.660536 -0.622092 0.532472
10129 Naz Mitrou-Long SG 24 UTA 1 0 1 1 1 1.0 ... -1.950062 -1.363652 -1.547528 -0.880028 -2.598159 -0.927302 -1.755144 18.137356 -0.945798 0.359470

5 rows × 54 columns

The next step is to group the relative scores by team. We want to look at superstars and excessively good role players specifically. Thus, we will look at the maximum relative score of each team, where a high rel_score of a team would indicate the presence of a star player and a general lack of depth, which a lower max rel_score indicating a deeper bench or generally a more evenly spread team.

In addition, we will need to take a look at the ranking data we scraped further up in the notebook. We will merge the data sets in an inner join.

In [61]:
top_eff = df_NBA.groupby(["Year", "Team"])[["rel_score"]].max()
top_eff = top_eff.reset_index()

top_eff = top_eff.merge(df_rank, how = "inner", on = ["Team", "Year"])
top_eff.head()
Out[61]:
Year Team rel_score Rk Overall
0 2000 ATL 2.757420 25 25-57
1 2000 BOS 2.861600 20 36-46
2 2000 CHH 2.462256 14 46-36
3 2000 CHI 2.777528 29 15-67
4 2000 CLE 2.624521 23 30-52

Now, let's see if superstars will result in a greater success in the regular season.

In [62]:
rel_rk_plot = top_eff.plot.scatter(x = "Rk", y = "rel_score", figsize = (10, 5))

#A regression line is then fit to each plot using the linregress function.
slope, intercept, r_value, p_value, std_err = stats.linregress(top_eff["Rk"], top_eff["rel_score"])
line = slope*top_eff["Rk"] + intercept
rel_rk_plot.plot(top_eff["Rk"], line, 'r', label='fitted line')

rel_rk_plot.set_title("Does having a superstar on a team translate to success in the regular season?")
rel_rk_plot.set_xlabel("Rank of Team at end of Regular Season")
rel_rk_plot.set_ylabel("Highest Relative Score on a Team")

print(r_value**2)
0.17082667821596656

Rank 1 means the highest rank in the season. Thus, we can see a negative slope in the graph. Thus this indicates a relationship between the rank of a team and whether they have a star player. Teams with a superstar player rather than a evenly spread, deep team will tend to perform better in the regular season. However, it is important to note that the R-squared value is quite low, at 0.171. Thus, even though the linear relationship seems to indicate that having superstars over a deep bench leads to regular season success, the relationship itself is not strong.

Associating Salary with NBA Players

The scores are all based off minutes, but now, through a join we will be able to access the salary of how much that specific player made that season.

In [63]:
df_NBA = pd.merge(df_NBA, player_salaries, how = 'inner', on = ['Player', 'Year'])
df_NBA
Out[63]:
Player Position Age Team_x Games Games Started Minutes Field Goals FG Attempts FG Percent ... wt_BPM wt_Missed FGPM wt_Missed FTPM wt_TOPM wt_score wt_Standard Eff Score rel_score Rank Team_y Salary
0 Georgios Papagiannis C 20 POR 1 0 4 1 1 1.000 ... -0.880028 -2.598159 -0.927302 -1.755144 28.581442 -0.622092 0.538691 579 Sacramento Kings $185,397
1 Georgios Papagiannis C 20 SAC 16 0 118 17 41 0.415 ... 1.319410 0.331300 -0.927302 0.332258 1.727396 -0.874988 0.389814 579 Sacramento Kings $185,397
2 Sim Bhullar C 22 SAC 3 0 3 1 2 0.500 ... 13.538509 2.202899 -0.927302 -1.755144 20.674959 -1.215553 0.161214 469 Sacramento Kings $507,336
3 Steven Hill PF 23 OKC 1 0 2 1 1 1.000 ... -0.880028 -2.598159 -0.927302 -1.755144 20.660536 -0.622092 0.532472 458 Oklahoma City Thunder $442,114
4 Naz Mitrou-Long SG 24 UTA 1 0 1 1 1 1.000 ... -0.880028 -2.598159 -0.927302 -1.755144 18.137356 -0.945798 0.359470 522 Utah Jazz $815,615
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7959 Chris McCray SG 22 MIL 5 0 12 0 3 0.000 ... -0.880028 1.002634 -0.927302 3.376387 -11.763677 -1.593210 -0.115203 439 Milwaukee Bucks $412,718
7960 Darko Milicic C 27 BOS 1 0 5 0 1 0.000 ... -0.880028 0.282476 -0.927302 10.560530 -16.009575 -1.755063 -0.255935 500 Boston Celtics $854,389
7961 Mindaugas Kuzminskas SF 28 NYK 1 0 2 0 2 0.000 ... -0.880028 11.805014 -0.927302 -1.755144 -17.434525 -1.755063 -0.227269 257 New York Knicks $3,025,035
7962 Von Wafer SF 21 LAC 1 0 1 0 1 0.000 ... -0.880028 11.805014 -0.927302 -1.755144 -17.434525 -1.593210 -0.136596 470 LA Clippers $23,443
7963 Mile Ilic C 22 NJN 5 0 6 0 3 0.000 ... -0.880028 4.603427 -0.927302 13.639449 -23.779126 -1.593210 -0.111739 355 New Jersey Nets $800,000

7964 rows × 57 columns

Now that we have a player's salary, we can generate their salary score. To calculate a salary score, it is the following:

Player Salary / Avg. Salary of Player in the same year.

This way, we are able to control for inflation.

In [64]:
#To perform this calculation, we need to clean salary up so that it is a float dtype. 
df_NBA['Salary'] = df_NBA['Salary'].str[1:]
df_NBA['Salary'] = df_NBA['Salary'].str.replace("," , '')
df_NBA['Salary'] = df_NBA['Salary'].astype(float)

#Next, we must find the average salary of a player for a specific year. We'll call this groupby object avg_sal
avg_sal = df_NBA.groupby(["Year"])[["Salary"]].median()

def player_wtscore_salary(salary, year, player):
    avg_score = avg_sal.loc[(year)]
    return salary/avg_score

#Finally, we return the salary score of each NBA Player using our function
df_NBA["sal_score"] = df_NBA.apply(lambda x: player_wtscore_salary(x["Salary"], x["Year"], x["Player"]), axis = 1)
df_NBA
Out[64]:
Player Position Age Team_x Games Games Started Minutes Field Goals FG Attempts FG Percent ... wt_Missed FGPM wt_Missed FTPM wt_TOPM wt_score wt_Standard Eff Score rel_score Rank Team_y Salary sal_score
0 Georgios Papagiannis C 20 POR 1 0 4 1 1 1.000 ... -2.598159 -0.927302 -1.755144 28.581442 -0.622092 0.538691 579 Sacramento Kings 185397.0 0.050246
1 Georgios Papagiannis C 20 SAC 16 0 118 17 41 0.415 ... 0.331300 -0.927302 0.332258 1.727396 -0.874988 0.389814 579 Sacramento Kings 185397.0 0.050246
2 Sim Bhullar C 22 SAC 3 0 3 1 2 0.500 ... 2.202899 -0.927302 -1.755144 20.674959 -1.215553 0.161214 469 Sacramento Kings 507336.0 0.230150
3 Steven Hill PF 23 OKC 1 0 2 1 1 1.000 ... -2.598159 -0.927302 -1.755144 20.660536 -0.622092 0.532472 458 Oklahoma City Thunder 442114.0 0.171882
4 Naz Mitrou-Long SG 24 UTA 1 0 1 1 1 1.000 ... -2.598159 -0.927302 -1.755144 18.137356 -0.945798 0.359470 522 Utah Jazz 815615.0 0.221045
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7959 Chris McCray SG 22 MIL 5 0 12 0 3 0.000 ... 1.002634 -0.927302 3.376387 -11.763677 -1.593210 -0.115203 439 Milwaukee Bucks 412718.0 0.165087
7960 Darko Milicic C 27 BOS 1 0 5 0 1 0.000 ... 0.282476 -0.927302 10.560530 -16.009575 -1.755063 -0.255935 500 Boston Celtics 854389.0 0.298787
7961 Mindaugas Kuzminskas SF 28 NYK 1 0 2 0 2 0.000 ... 11.805014 -0.927302 -1.755144 -17.434525 -1.755063 -0.227269 257 New York Knicks 3025035.0 0.819833
7962 Von Wafer SF 21 LAC 1 0 1 0 1 0.000 ... 11.805014 -0.927302 -1.755144 -17.434525 -1.593210 -0.136596 470 LA Clippers 23443.0 0.009377
7963 Mile Ilic C 22 NJN 5 0 6 0 3 0.000 ... 4.603427 -0.927302 13.639449 -23.779126 -1.593210 -0.111739 355 New Jersey Nets 800000.0 0.320000

7964 rows × 58 columns

Now we are able to calculate our efficiency score. To do so, we will use the following formula:

wt_score / sal_score

Using this formula will give us the total 'efficiency' score of the player. Specifically, how impactful is that player on the court, with regards to the minutes he has played and the amount he is paid.

Now, we will be able to see who is the most efficient player in the NBA, according to 'our' statistics. This efficient players performs better on conventional NBA Stas than the average player per minute paid, and is also paid less than the average NBA player for his specific season

In [65]:
df_NBA['efficiency_score'] = df_NBA['wt_score'] / df_NBA['sal_score']
 
df_NBA = df_NBA.sort_values(['efficiency_score'], ascending = False)
df_NBA
Out[65]:
Player Position Age Team_x Games Games Started Minutes Field Goals FG Attempts FG Percent ... wt_Missed FTPM wt_TOPM wt_score wt_Standard Eff Score rel_score Rank Team_y Salary sal_score efficiency_score
367 Andre Ingram SG 32 LAL 2 0 64 8 17 0.471 ... -0.927302 -0.311901 3.452718 1.077364 1.825390 596 Los Angeles Lakers 13824.0 0.003747 921.579100
329 Eric Moreland PF 27 PHO 1 0 5 0 0 0.000 ... -0.927302 -1.755144 3.622909 -0.945798 0.331796 498 Toronto Raptors 17092.0 0.005228 692.974943
0 Georgios Papagiannis C 20 POR 1 0 4 1 1 1.000 ... -0.927302 -1.755144 28.581442 -0.622092 0.538691 579 Sacramento Kings 185397.0 0.050246 568.835405
22 Greg Monroe C 28 BOS 2 0 5 3 5 0.600 ... -0.927302 -1.755144 9.741841 -0.783945 0.370057 486 Toronto Raptors 59820.0 0.018298 532.411828
9 Marcus Williams SF 21 SAS 1 0 2 0 1 0.000 ... -0.927302 -1.755144 11.394865 -1.431357 0.000000 486 San Antonio Spurs 50254.0 0.021990 518.176822
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5734 Coby Karl SG 26 CLE 3 0 5 0 0 0.000 ... -0.927302 22.876205 -23.226528 -1.539259 -0.073930 478 Golden State Warriors 17328.0 0.007534 -3082.930181
7917 Dahntay Jones SF 36 CLE 1 0 12 3 8 0.375 ... 2.202151 0.810621 -6.165344 -0.622092 0.538876 594 Cleveland Cavaliers 5767.0 0.001794 -3437.293703
7869 Aaron Jackson PG 31 HOU 1 0 35 3 9 0.333 ... 0.145653 -0.875453 -4.402120 -0.783945 0.513874 598 Houston Rockets 4608.0 0.001249 -3524.963512
3729 Linton Johnson PF 27 TOR 2 0 10 2 5 0.400 ... -0.927302 7.481612 -9.539916 -1.269504 0.117225 490 Toronto Raptors 4533.0 0.001984 -4809.481384
7902 Billy Thomas SF 32 NJN 4 0 8 0 3 0.000 ... -0.927302 2.093504 -10.517132 -1.512284 -0.061446 491 New Jersey Nets 4533.0 0.001984 -5302.137799

7964 rows × 59 columns

As you can see from our data, most of the most efficient players have played only a few minutes. To make this a little more interesting, let's limit the scope of his players to those how have appeared in at least 41 games, or half an NBA regular season.

In [66]:
players_41_eff = df_NBA[df_NBA['Games'] >= 41]
players_41_eff
Out[66]:
Player Position Age Team_x Games Games Started Minutes Field Goals FG Attempts FG Percent ... wt_Missed FTPM wt_TOPM wt_score wt_Standard Eff Score rel_score Rank Team_y Salary sal_score efficiency_score
1455 Tim Frazier PG 28 NOP 47 17 909 88 195 0.451 ... -0.555486 0.277145 1.814477 0.052869 0.817592 478 Milwaukee Bucks 196553.0 0.060121 30.180395
399 Samuel Dalembert C 22 PHI 82 53 2197 270 499 0.541 ... 0.132467 -0.549923 3.354572 0.848401 1.482487 174 Philadelphia 76ers 887000.0 0.177400 18.909650
340 Andrei Kirilenko PF 22 UTA 78 78 2895 412 931 0.443 ... 0.421768 0.531445 3.563709 2.063007 2.565895 173 Utah Jazz 956000.0 0.191200 18.638648
74 Greg Stiemsma C 26 BOS 55 3 766 66 121 0.545 ... -0.338998 -0.428717 5.934447 -0.289558 0.801254 502 Boston Celtics 762195.0 0.326521 18.174791
112 Chris Andersen C 30 DEN 71 1 1460 160 292 0.548 ... 0.384496 -0.300039 5.410281 0.595224 1.347470 405 Denver Nuggets 797581.0 0.310077 17.448166
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7654 Eddie House PG 22 MIA 50 0 550 104 247 0.421 ... -0.176233 0.204168 -2.609587 -0.884294 0.390193 349 Miami Heat 316000.0 0.143636 -18.168010
6217 Cartier Martin SF 29 ATL 53 6 822 106 256 0.414 ... -0.242020 -0.519082 -0.855779 -0.561016 0.599476 418 Chicago Bulls 104034.0 0.045808 -18.681965
4142 JaKarr Sampson SF 22 PHI 47 18 691 86 202 0.426 ... 0.974826 0.517285 -2.554306 -0.659973 0.555519 497 Philadelphia 76ers 258489.0 0.106228 -24.045425
7536 Rasual Butler SG 31 LAC 41 2 744 73 226 0.323 ... -0.473027 -0.844711 -2.235689 -0.847107 0.385816 518 LA Clippers 211084.0 0.092072 -24.281998
13 Marcus Williams PG 22 NJN 53 7 854 111 293 0.379 ... -0.487566 0.948824 -0.915535 -0.469401 0.730390 486 San Antonio Spurs 50254.0 0.021990 -41.633576

5009 rows × 59 columns

Now, we are able to see players that participated in a reasonable amount of games within the season and their corresponding efficiency scores

Machine Learning Component

No final datascience database is complete without implementing some machine learning element to our data. We elected to choose machine learning regression to help predict our dataset.

We wanted to assess that given a player's weighted score, how much should he paid? General Managers can use this information to predict how much they are essentially worth, assuming that their stats remain somewhat consistent. Below is our code for the problem:

Our Machine Learning Example Using Predicted Pay Based off Weighted Average. We will use an average of the 5, 30, and 100 nearest neighbors to perform this example and compare to each other.

In [67]:
plt = df_NBA.plot.scatter(x = 'wt_score', y = 'Salary', figsize = (10, 10), title = "K Nearest Neighbors of Weighted Score Predicting Salary Pay")

def get_NN_prediction(x_new, k):
    #Given new observation, returns the k-nearest neighbors prediction
    dists = ((X_train - x_new) ** 2).sum(axis=1)
    inds_sorted = dists.sort_values().index[:k]
    return y_train.loc[inds_sorted].mean()

X_train = df_NBA[["wt_score"]]
y_train = df_NBA["Salary"]

X_new = pd.DataFrame()
X_new["wt_score"] = np.arange(0, 15, 1)
X_new

colors = ['red', 'green', 'blue']

for i,k in enumerate([5, 30, 100]):
    y_new_pred = X_new.apply(get_NN_prediction, axis=1, args=(k,))
    y_new_pred.index= X_new
    y_new_pred.plot.line(color = colors[i], label=str(k), legend=True)

The key takeaways from this graph are as follows:

1. The players that have the highest salary do not have overwhelming weighted scores. In fact, some of the players have a weighted score of about zero. 
2. It does not appear that wt_score and Salary are positively correllated, for the most part. From the interval from a wt_score of [2,4] however, it does appear that there is an increase in pay. 
3. Players that have extremely high wt_scores receive less than average Salary. This may be due to high stats relative to a small amount of minutes played, thus giving the players a generous wt_score. 

Final Conclusions

In this notebook, we took a brief look into some analysis of aspects of basketball, player statistics, salary, and how much individual players can impact the entire team. The basketball game has evolved in the last decade, and significantly through the change in how it has become a statistically-focused change. The 3 point shot has been dubbed the "most efficient shot" by Morey, and with that the Houston Rockets, and the rest of the league, have increased their focus on the 3 point shot.

The main focus of this project, however, was to investigate what team composition allows for the greatest success. Specifically, we looked into whether emphasizing depth of a team is more important or having one excellent role player that serves as a superstar leads to a more successful team. We found that having a key player is correlated with a greater end of season ranking.

We were also able to show a few different scores of quantifying players. ESPN, NBA, and various other organization often use an efficiency score that includes some basic in game stats divided by the number of games they played. We generated two other important methods of quantifying players: a minute based efficiency score (min_eff_score) as well as a minute and salary based efficiency score, where their score was compared to how much they are paid. This showed just how important some bench players and rookies are that are able to significantly outperform their salary. In the current league environment where most of the money are going to key role players, it can be important to recognize these bench players that are able to outperform their expected salary worth.

In [ ]: