Data Set: Major League Baseball

Karen Higgins October 31st, 2021

You may find the referenced data sets in the `openintro` R package (Github for now, on CRAN soon) and on this page.

The American pastime of baseball is fun for fans and statisticians. Moneyball, the book and movie, is the story of how a coach with a small budget used data to build a team that won 20 consecutive games in 2002. The Oakland Athletics bested well-funded teams by using statistics to assess, acquire and trade players, instead of scouts.

OpenIntro has baseball data sets: mlb_teams for team data from 1876 to 2020, mlb for 2010 salary data, mlb_players_18 for 2018 batter statistics, and mlbbat10 for 2010 hitting statistics. Additional data is in the Lahman's Baseball Database, a full set of baseball data from 1871 to 2020, available on the internet and in the `Lahman` R package.

Earned run average, the number of earned runs divided by the number of innings pitched, is a popular metric for assessing pitchers. In the `mlb_teams` data set, earned run average is for the team for the year. Below is a graph showing earned run averaged across the teams by year. The league winner's earned run average for each year is in green.

The average ERA (Earned Run Average, which is a baseball statistic) for each year) for teams over 1871 to 2020. There are two lines for ERA, one for the MLB World Series winner and another for all the other teams as a separate time. What is distinctive is that the World Series winner has had a lower ERA every single one of the years. Additionally, ERA is typically a bit higher in the last 100 years than in the preceding 50 years.