top of page
Search
  • Writer's pictureSean

Data Science, Baseball, and AM I UNDER PAID?

Updated: Mar 22, 2020

If we conducted a poll of everyone who reads this, I’d wager that someone here feels that they are underpaid at their job. Heck, we all feel that way at a certain point. What if I told you that there was a definite way that you could PROVE to your boss that you are worthy of a pay raise? (*Prove is a strong word meant for dramatic effect) Well, I can’t…but I can if you are an MLB player! How is this possible you may ask? Through the power of machine learning!


Before getting to the machine learning explanation, let’s talk about metrics. As you likely already know, there are many metrics that measure a baseball player’s performance. In this case, I decided to focus on batters and specifically, their Weighted On-Base Average (wOBA). wOBA is an excellent metric that was created by Tom Tango (hell of a name) and it measures a hitter’s overall offensive value, based on the relative values of each distinct event. From FanGraphs: “wOBA is a simple concept: Not all hits are created equal…Weighted On-Base Average combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value”. Here’s the formula from the 2013 season (the weights change each year):




Now that we’re on the same page, you’re probably wondering what hoodoo voodoo I used to figure this all out! Well, I used an unsupervised machine learning algorithm, kMeans Clustering, to group batters based on the similarities between their wOBA and salaries. An unsupervised algorithm looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. What this means is that I did not tell the algorithm how to group the players. I simply provided data (wOBA and salaries) and kMeans grouped the batters on how similar they were to each other. I used a method to determine how many groups to cluster the players into, which was three, and with using 2016 batting data, this was the result:





Quite quickly, you can see the three distinct groups (woooh colored scatter plots). There’s a group with lower wOBA and lower salaries (purple), a group with higher wOBA and lower salaries (greenish-blue), and a group with higher wOBA and higher salaries (yellow). For this, I wanted to look at two things:


  1. From the group with the highest average salary (yellow), were there players who had a wOBA that was higher than the group’s average and their salary lower than the group’s average? This would suggest this player was underpaid based on their performance.

  2. From the group with higher wOBA and lower salary (greenish-blue), are there any players we could identify as “bargain” acquisitions? By this, I wanted to find players whose wOBA was above their group’s average and whose salaries were less than $10,000,000.


Filtering the results for Question 1, I was able to identify 10 players who have a real argument to be made to their GMs! Their wOBA was above their group’s average and their salaries were below the group’s average. Here are the ten players with some key metrics:


Mean Salary of Cluster: $16,710,783.97 Mean wOBA of Cluster: 0.320




Filtering the results for Question 2, I was able to identify 10 players who should be considered for acquisition! Why? Because their salaries were less than $10,000,000 and their wOBA was above the group’s average.





Wow, flash forward 4 years and Bryce Harper has cashed in alright...maybe he had me helping him figure this all out...





The results are not groundbreaking, but they serve as an example of how one can use machine learning to analyze MLB data. In this case, a player can use this information as a bargaining chip or a team can use it to find a player with above average wOBA on a salary that meets their budget constraints.





Comments? Questions? Concerns? Comment away! Also, don’t hate on me too much for using 2016 data. Thanks for reading!



Want to see what you can do with baseball data? Check out Sean Lahman's open source MLB dataset here: http://www.seanlahman.com/baseball-archive/statistics/

187 views0 comments
bottom of page