Skip to content

benjamincrom/baseball

Repository files navigation

Table of Contents

Baseball

This package fetches and parses event data for Major League Baseball games. Game objects generated via the _from_url methods pull data from MLB endpoints where events are published within about 30 seconds of occurring. This XML/JSON source data zip file contains event data from MLB games 1974 - 2020.

Installing from pypi

pip3 install baseball

Installing from source

git clone [email protected]:benjamincrom/baseball.git
cd baseball/
python3 setup.py install

Fetch individual MLB game

  • get_game_from_url(date_str, away_code, home_code, game_number)

Fetch an object which contains metadata and events for a single MLB game.

import baseball
game_id, game = baseball.get_game_from_url('2017-11-1', 'HOU', 'LAD', 1)
game_dict = game._asdict()
game_json_str = game.json()

Write scorecard as SVG image:

with open(game_id + '.svg', 'w') as fh:
    fh.write(game.get_svg_str())

2017-11-01-HOU-LAD-1.svg svg

Game Class Structure

Game

  • away_batter_box_score_dict
  • away_pitcher_box_score_dict
  • away_team (Team)
  • away_team_stats
  • start_datetime
  • expected_start_datetime
  • game_date_str
  • home_batter_box_score_dict
  • home_pitcher_box_score_dict
  • home_team (Team)
  • home_team_stats
  • inning_list (Inning list)
  • end_datetime
  • location
  • attendance
  • weather
  • temp
  • timezone_str
  • is_postponed
  • is_suspended
  • is_doubleheader
  • is_today
  • get_svg_str()
  • json()
  • _asdict()

Team

  • abbreviation
  • batting_order_list_list (list of nine PlayerAppearance lists)
  • name
  • pitcher_list (PlayerAppearance list)
  • player_id_dict
  • player_last_name_dict
  • player_name_dict
  • _asdict()

Inning

  • bottom_half_appearance_list (PlateAppearance list)
  • bottom_half_inning_stats
  • top_half_appearance_list (PlateAppearance list)
  • top_half_inning_stats
  • _asdict()

PlateAppearance

  • start_datetime
  • end_datetime
  • batter (Player)
  • batting_team (Team)
  • error_str
  • event_list (list of Pitch, Pickoff, RunnerAdvance, Substitution, Switch objects)
  • got_on_base
  • hit_location
  • inning_outs
  • out_runners_list (Player list)
  • pitcher (Player)
  • plate_appearance_description
  • plate_appearance_summary
  • runners_batted_in_list (Player list)
  • scorecard_summary
  • scoring_runners_list (Player list)
  • _asdict()

Player

  • era
  • first_name
  • last_name
  • mlb_id
  • number
  • obp
  • slg
  • _asdict()

PlayerAppearance

  • start_inning_batter_num
  • start_inning_half
  • start_inning_num
  • end_inning_batter_num
  • end_inning_half
  • end_inning_num
  • pitcher_credit_code
  • player_obj (Player)
  • position
  • _asdict()

Pitch

  • pitch_datetime
  • pitch_description
  • pitch_position
  • pitch_speed
  • pitch_type
  • _asdict()

Pickoff

  • pickoff_description
  • pickoff_base
  • pickoff_was_successful
  • _asdict()

RunnerAdvance

  • runner_advance_datetime
  • run_description
  • runner (Player)
  • start_base
  • end_base
  • runner_scored
  • run_earned
  • is_rbi
  • _asdict()

Substitution

  • substitution_datetime
  • incoming_player (Player)
  • outgoing_player (Player)
  • batting_order
  • position
  • _asdict()

Switch

  • switch_datetime
  • player (Player)
  • old_position_num
  • new_position_num
  • new_batting_order
  • _asdict()

Analyze a game: 2017 World Series - Game 7

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

import baseball

%matplotlib inline

game_id, game = baseball.get_game_from_url('11-1-2017', 'HOU', 'LAD', 1)

pitch_tuple_list = []
for inning in game.inning_list:
    for appearance in inning.top_half_appearance_list:
        for event in appearance.event_list:
            if isinstance(event, baseball.Pitch):
                pitch_tuple_list.append(
                    (str(appearance.pitcher), 
                     event.pitch_description,
                     event.pitch_position,
                     event.pitch_speed,
                     event.pitch_type)
                )

data = pd.DataFrame(data=pitch_tuple_list, columns=['Pitcher', 'Pitch Description', 'Pitch Coordinate', 'Pitch Speed', 'Pitch Type'])
data.head()
Pitcher Pitch Description Pitch Coordinate Pitch Speed Pitch Type
0 21 Yu Darvish Ball (155.47, 160.83) 96.0 FF
1 21 Yu Darvish Called Strike (107.0, 171.09) 83.9 FC
2 21 Yu Darvish In play, no out (115.36, 183.1) 83.9 SL
3 21 Yu Darvish In play, run(s) (80.06, 168.03) 96.6 FF
4 21 Yu Darvish Ball (54.1, 216.52) 84.6 SL
data['Pitcher'].value_counts().plot.bar()

png

for pitcher in data['Pitcher'].unique():
    plt.ylim(0, 125)
    plt.xlim(0, 250)
    bx = [250 - x[2][0] for x in pitch_tuple_list if x[0] == pitcher if 'Ball' in x[1]]
    by = [250 - x[2][1] for x in pitch_tuple_list if x[0] == pitcher if 'Ball' in x[1]]
    cx = [250 - x[2][0] for x in pitch_tuple_list if x[0] == pitcher if 'Called Strike' in x[1]]
    cy = [250 - x[2][1] for x in pitch_tuple_list if x[0] == pitcher if 'Called Strike' in x[1]]
    ox = [250 - x[2][0] for x in pitch_tuple_list if x[0] == pitcher if ('Ball' not in x[1] and 'Called Strike' not in x[1])]
    oy = [250 - x[2][1] for x in pitch_tuple_list if x[0] == pitcher if ('Ball' not in x[1] and 'Called Strike' not in x[1])]
    b = plt.scatter(bx, by, c='b')
    c = plt.scatter(cx, cy, c='r')
    o = plt.scatter(ox, oy, c='g')

    plt.legend((b, c, o),
               ('Ball', 'Called Strike', 'Other'),
               scatterpoints=1,
               loc='upper right',
               ncol=1,
               fontsize=8)

    plt.title(pitcher)
    plt.show()

png

png

png

png

png

plt.axis('equal')
data['Pitch Description'].value_counts().plot(kind='pie', radius=1.5, autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)

png

data.plot.kde()

png

fig, ax = plt.subplots()
ax.set_xlim(50, 120)
for pitcher in data['Pitcher'].unique():
    s = data[data['Pitcher'] == pitcher]['Pitch Speed']
    s.plot.kde(ax=ax, label=pitcher)

ax.legend()

png

fig, ax = plt.subplots()
ax.set_xlim(50, 120)
for desc in data['Pitch Type'].unique():
    s = data[data['Pitch Type'] == desc]['Pitch Speed']
    s.plot.kde(ax=ax, label=desc)

ax.legend()

png

fig, ax = plt.subplots(figsize=(15,7))
data.groupby(['Pitcher', 'Pitch Description']).size().unstack().plot.bar(ax=ax)

png

Analyze a player's season: R.A. Dickey - 2017

game_list_2017 = baseball.get_game_list_from_file_range('1-1-2017', '12-31-2017', '/Users/benjamincrom/repos/livebaseballscorecards-artifacts/baseball_files')

pitch_tuple_list_2 = []
for game_id, game in game_list_2017:
    if game.home_team.name == 'Atlanta Braves' or game.away_team.name == 'Atlanta Braves':
        for inning in game.inning_list:
            for appearance in (inning.top_half_appearance_list +
                               (inning.bottom_half_appearance_list or [])):
                if 'Dickey' in str(appearance.pitcher):
                    for event in appearance.event_list:
                        if isinstance(event, baseball.Pitch):
                            pitch_tuple_list_2.append(
                                (str(appearance.pitcher), 
                                 event.pitch_description,
                                 event.pitch_position,
                                 event.pitch_speed,
                                 event.pitch_type)
                            )

df = pd.DataFrame(data=pitch_tuple_list_2, columns=['Pitcher', 'Pitch Description', 'Pitch Coordinate', 'Pitch Speed', 'Pitch Type'])
df['Pitch Type'].value_counts().plot.bar()

png

plt.axis('equal')
df['Pitch Description'].value_counts().plot(kind='pie', radius=2, autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)
plt.ylabel('')
plt.show()

png

df.dropna(inplace=True)
ax.set_xlim(50, 100)
df.plot.kde()
ax.legend()

png

fig, ax = plt.subplots()
ax.set_xlim(50, 100)
for desc in df['Pitch Type'].unique():
    if desc != 'PO':
        s = df[df['Pitch Type'] == desc]['Pitch Speed']
        s.plot.kde(ax=ax, label=desc)

ax.legend()

png

Analyze a lineup of pitchers: Atlanta Braves - 2017 Regular Season

import datetime
import dateutil.parser
import pytz
pitch_tuple_list_3 = []
for game_id, game in game_list_2017:
    if game.home_team.name == 'Atlanta Braves' and dateutil.parser.parse(game.game_date_str) > datetime.datetime(2017, 3, 31):
        for inning in game.inning_list:
            for appearance in inning.top_half_appearance_list:
                pitch_tuple_list_3.append(
                    (str(appearance.pitcher),
                     str(appearance.batter),
                     len(appearance.out_runners_list),
                     len(appearance.scoring_runners_list),
                     len(appearance.runners_batted_in_list),
                     appearance.scorecard_summary,
                     appearance.got_on_base,
                     appearance.plate_appearance_summary,
                     appearance.plate_appearance_description,
                     appearance.error_str,
                     appearance.inning_outs)
                )
    if game.away_team.name == 'Atlanta Braves' and dateutil.parser.parse(game.game_date_str) > datetime.datetime(2017, 3, 31):
        for inning in game.inning_list:
            if inning.bottom_half_appearance_list:
                for appearance in inning.bottom_half_appearance_list:
                    pitch_tuple_list_3.append(
                        (str(appearance.pitcher),
                         str(appearance.batter),
                         len(appearance.out_runners_list),
                         len(appearance.scoring_runners_list),
                         len(appearance.runners_batted_in_list),
                         appearance.scorecard_summary,
                         appearance.got_on_base,
                         appearance.plate_appearance_summary,
                         appearance.plate_appearance_description,
                         appearance.error_str,
                         appearance.inning_outs)
                    )

df3 = pd.DataFrame(data=pitch_tuple_list_3, columns=['Pitcher',
                                                     'Batter',
                                                     'Out Runners',
                                                     'Scoring Runners',
                                                     'RBIs',
                                                     'Scorecard',
                                                     'On-base?',
                                                     'Plate Summary',
                                                     'Plate Description',
                                                     'Error',
                                                     'Inning Outs'])

for pitcher in df3['Pitcher'].unique():
    summary = df3[df3['Pitcher'] == pitcher]['Plate Summary']
    s = summary.value_counts(sort=False)
    if len(summary) > 400:
        fig, ax = plt.subplots()
        ax.set_ylim(0, 250)
        s.plot.bar()
        plt.title(pitcher)
        plt.show()

png

png

png

png

png

x = []
for pitcher in df3['Pitcher'].unique():
    #f = df3[df3['Pitcher'] == pitcher]['On-base?'].value_counts()[0]
    s = df3[df3['Pitcher'] == pitcher]['On-base?'].value_counts()
    if len(s) == 2:
        f = s[0]
        t = s[1]
        x.append((str(pitcher), f, t))

df4 = pd.DataFrame(data=x, columns=['Pitcher',
                                    'Did not get on base',
                                    'Got on base'])

df4.index = df4['Pitcher']
df4.sort_values(by=['Got on base']).nlargest(10, 'Did not get on base').plot.bar()

png