“Home Runs by Park – 2011 Season” or “Man the Astros Sucked This Year”

I hate the Giants. Let this be known. What i was hoping to find was another reason to support my claim that their WS win in 2010 was a complete fluke.  So when digging through the game logs for the 2011 season from Retrosheet I noticed that the park with the fewest home runs for the whole season was AT&T Park.  Boom!  Giants suck right? I’ve seen little league teams with a higher team ISO!  HA! Except i’m just smart enough to know that before I proclaim my findings I should probably dig just a little deeper. So the next step was to break down the total number of home runs by Home Team vs. Visiting Team. My victory was short lived. It turns out that the Giants, as a team, had 42 home runs on the season at AT&T Park to the visiting teams 39.  Just better than 50/50. Fine! So the question then was who was the worst team by this completely arbitrary and new metric that I can’t think of a clever name for right now but am open to suggestions…The answer was the Houston Astros.  They miserable 2011 Astros hit 46 home runs in their own park (more than the Giants I might add) and gave up 106 to all visiting teams.  %30.26 of the balls that left Minute Maid Park were hit by the home team.  Painful if you’re an Astros fan. And even worse knowing that in 2 years that place is going to be full of American League teams. The winner of this game for the 2011 season is the Atlanta Braves with %62.07 of Home Runs at Turner Field coming off of native bats.

The charts below were made using ggplot2 and retrosheet game logs:

This One’s Personal: Sanford Koufax vs. Randy Johnson…pffft

I couldn’t let this one go. The conclusion draw here by this author that Randy Johnson was “the best pitcher of all time” was not something I could allow to slip through the cracks. Johnson was awesome. Incredible to watch. Always delivered on a great game without a doubt. But by all measures not the greatest pitcher of all time AND not even the greatest left-handed pitcher of all time. Any reasonable person knows that this distinction belongs to Sandy Koufax. The problem, that we all know, is that Koufax’s career was cut short because he was playing at a time when no one was limiting young pitchers the way the do now in order to save their arms to lengthen their careers.  He left at the height of his game, and, if you watch his press conference when he made the decision it’s quite understandable.

So then how do we judge a 21 year major league career against a 12 year career.  I propose two methods of doing so.  We can compare their seasons by Age which creates a bit of an issue because Koufax was forced to join the team at 19 because of some rules around how he was drafted that don’t exist anymore and Johnson didn’t come up to the majors until he was 24.  So this limits us to 7 seasons: Ages 24-30.  The other way to approach it is to compare their stats by career year which will give us 12 years to look at keeping in mind that Johnson had considerably more time to hone his skills in the minors.

The charts below are using the ggplo2 R package with no doctoring of the images. I’ve also kept it strictly to the stats used in the above mentioned post: Strike Outs and ERA. While the XY-aspect of the plot should be obvious (Strike Outs against Age or Career Year), the thickness of the lines also represents their respective ERAs.  Lets see how this compares first by Age then by Career Year.

By age Koufax is the clear winner.  He had more time in the majors than Randy at that point but the point of being in the minors is to get prepared for major league baseball. That begs the question then if minor league experience is better than just being dumped right in? By career year I’d say that on first glance it looks pretty even but when you look closer at all of Sandy’s sub-3 and sub-2 ERA seasons plus his huge number of strikeouts in his year 11 season I’m also going to go ahead and give this one to Sanford.

One of the things Johnson is so lauded for is his high number of cumulative strike outs.  So let’s do the same thing.  First by Age then by Career Year:

I don’t even have to comment this.  Koufax = Winner.  The question is always out there about what could he have done if he hadn’t blown out his arm.  We’ll never know.  But just because he didn’t get to put up the cumulative numbers that some players have doesn’t mean he should be excluded from the “Greatest Pitcher Ever’ or ‘Greatest Left Hander Ever’ debates.

Home Runs heating up?

My intuition tells me that objects traveling through the air would meet more resistance when there is more moisture in the air. It turns out that my intuition is wrong. It still doesn’t make sense to me but apparently humid air is less dense. And this applies to baseball specifically because the belief is that there are more home runs in the latter half of the season because many parks are in humid areas (east coast bias) and as the summer progresses it gets hotter and hotter and more and more humid. A lot of this is purely anecdotal: “The ball’s really going to start flying out of the park as the weather heats up” and other such nonsense from the mouths of the talking heads we’re forced to listen to while watching a game.

Anyway, after seeing this post at Revolution Analytics I wanted to use the calendar heat map function created by Paul Bleicher.  (source code is available here) And it seemed like a really fitting opportunity to look at how cumulative daily home runs fluctuated over the course of the MLB season. Based on the science behind the humidity factor you would imagine that there would be a, somewhat, obvious increasing trend at least until it starts to cool off at the end of September. Here is how that data looks in one of these calendar heat maps.

From this perspective I’m seeing home run heavy days sprinkled all over the course of the season. The only conclusion that I can come to is that 1) obviously the science is right but the sample size is too small on a daily basis not to be skewed by one big game and 2) the announcers that perpetuate these myths are just parroting each other with no actual check on what comes out of their pie-holes.

Show me your WAR face!

Below is a chart of the top 20 offensive players based on FanGraphs WAR for the 2011 season.  The various features and their corresponding metric are clear in the image. I’ve also included the leader and last place for each metric to get an idea of what the extremes would look like as it’s all normalized.  For example Jose Reyes’s 7 Home Runs this season gives him a very narrow face as compared to Jose Bautista’s double wide.  (This is highly derivative and I’m painfully aware of this but I really wanted to play with the Chernoff faces function available in the aplpack R library. )

How does Matt kemp become Andre Dawson?

While reading this article over at Fangraphs I was inspired to ask myself “what would Matt Kemp have to do between now and then end of his career to be seriously considered for the Hall of Fame?”.  This question comes out of

1) Being a Dodger fan

and

2) Romanticizing the idea of watching a HoF player’s entire career develop from beginning to end.

Assuming what the author of the above mentioned Fangraphs article assumes, that Andre Dawson is the baseline for the HoF and reducing it to the simplest case of cumulative WAR, the question is how much more WAR does Kemp need to match Dawson?  Well, 49.1 is the answer.  Matt Kemp’s WAR progression is actually pretty interesting in that prior to his terrible 2010 season he was improving at a rate that fits a simple regression line with an R-squared of 0.997.  That’s right, he was moving along in almost exactly a straight line.  But what is even more interesting is that even with his dip in 2010 he actually hopped right back on that regression line in 2011, as if 2010 had never happened (actually if you exclude 2010 then the R-squared drops all the way down to 0.977).  He actually improved 2 seasons worth in that time.  Assuming that he continues on that same line, which is totally reasonable as most HoF-ers have at least one 10+ WAR season, then he will be at 10.2 WAR and his first seven seasons as compared to Dawson’s will look like the following:

As you can see Andrew Dawson really shined as a rookie (hence the RoY award) and Matt Kemp’s 2010 season is really going to hurt him in cumulative WAR.  It isn’t impossible for him to catch up at this point but he will have to have a string of really stellar seasons and/or quit before he gets too old and into that negative WAR territory.

Below is a look at some key offensive statistics of the two players over their first six seasons.

A discussion of whether or not a player with only 6 major league seasons under his belt is HoF worthy is obviously quite premature.  And HoF voting is largely subjective and not based on standardized metrics like WAR.  Nevertheless, with the season that we saw from a player with absolutely no protection in his line up, driving in 126 runs with something like 127 runners on base ahead of him all season I don’t think this is too far out of the realm of possibilities.

R Tools for FEC Campaign Finance Disclosure Data

UPDATE 10/18/2011:

Thanks to some of the comments, I was able to pare this down using R’s read.fwf() function. Here’s the new version.


# makeData_campaignFinance_v1_1.R -- copyright 10.18.2011, christopher compeau (email: my last name aht gmail dot com)
# thanks to the commentors on swordofcrom.wordpress.com for their help with read.fwf()

# use as you please but please attribute credit to christopher compeau if you publish anything
# the use of the FEC campaign finance data is subject to the rules on the FEC website
# have fun my babies. bonus points if you get yourself on some conrgessional campaign's shit list.

# this uses the 2011-2012 detailed discoloser data files at http://www.fec.gov/finance/disclosure/ftpdet.shtml
# still to be done: write tools for amended individual contributions files and other stuff as yet undiscovered.

# overpunch tool
overpunch = function(x) {
  # remove leading zeroes
  amount = sub("^0+","",x)
  sign = rep(1,length(x))
  changeChar = c(
    expression(sub("\\[$","0",amount)),
    expression(sub("\\]$","0",amount)),
    expression(sub("[{}]$","0",amount)),
    expression(sub("[AJ]$","1",amount)),
    expression(sub("[BK]$","2",amount)),
    expression(sub("[CL]$","3",amount)),
    expression(sub("[DM]$","4",amount)),
    expression(sub("[EN]$","5",amount)),
    expression(sub("[FO]$","6",amount)),
    expression(sub("[GP]$","7",amount)),
    expression(sub("[HQ]$","8",amount)),
    expression(sub("[IR]$","9",amount))
    )
  changes1 = grep("\\]$",amount)
  changes2 = grep("[JKLMNOPQR}]$",amount)
  sign[c(changes1,changes2)] = -1
  for (i in 1:length(changeChar)) {
    amount = eval(changeChar[i])
  }
  holder = as.numeric(sign) * as.numeric(amount)
  return(holder)  
}

# Committee Master File
writeLines(iconv(sub("\t","/t",readLines("~/Projects/campaign_finance/data/raw/committeeMaster_2011_2012.dta")),from="ASCII",to="UTF8"),"~/Projects/campaign_finance/data/preprocessed/committeeMaster_2011_2012_UTF8.dta")
cmteeMasterNames = c("cmID","cmNAME","treasurer","streetOne","streetTwo","cityTown","state","zip","cmDESIG","cmTYPE","cmPARTY","fileFreq","groupCategory","orgName","candidateID")
cmteeMaster = read.fwf("~/Projects/campaign_finance/data/preprocessed/committeeMaster_2011_2012_UTF8.dta",c(9,90,38,34,34,18,2,5,1,1,3,1,1,38,9),comment.char="",strip.white=TRUE,col.names=cmteeMasterNames)
  
# Candidate Master File
writeLines(iconv(sub("\t","/t",readLines("~/Projects/campaign_finance/data/raw/candidateMaster_2011_2012.dta")),from="ASCII",to="UTF8"),"~/Projects/campaign_finance/data/preprocessed/candidateMaster_2011_2012_UTF8.dta")
candMasterNames = c('cndID','cndName','partyDesig1','filler1','partyDesig3','seatStatus','filler2','candidateStatus','streetOne','streetTwo','cityTown','state','zip','principalCommID','electionYear','currentDistrict')
candMaster = read.fwf(file="~/Projects/campaign_finance/data/preprocessed/candidateMaster_2011_2012_UTF8.dta",c(9,38,3,3,3,1,1,1,34,34,18,2,5,9,2,2),comment.char="",strip.white=TRUE,col.names=candMasterNames)
  
# Individual Contributions
writeLines(iconv(sub("\t","/t",readLines("~/Projects/campaign_finance/data/raw/individualContributions_2011_2012.dta")),from="ASCII",to="UTF8"),"~/Projects/campaign_finance/data/preprocessed/individualContributions_2011_2012_UTF8.dta")
individualNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','contributorName','cityTown','state','zip','occupation','month','transactionDay','transactionCentury','transactionYear','amount','otherID','fecRecord')
individual = read.fwf(file="~/Projects/campaign_finance/data/preprocessed/individualContributions_2011_2012_UTF8.dta",c(9,1,3,1,11,3,34,18,2,5,35,2,2,2,2,7,9,7),comment.char="",strip.white=TRUE,col.names=individualNames)
individual$amount = overpunch(individual$amount)
  
# Contributions from Committees
writeLines(iconv(sub("\t","/t",readLines("~/Projects/campaign_finance/data/raw/candidatesFromCommittees_2011_2012.dta")),from="ASCII",to="UTF8"),"~/Projects/campaign_finance/data/preprocessed/candidatesFromCommittees_2011_2012_UTF8.dta")
candFromCommitteesNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','transactionMonth','transactionDay','transactionCentury','transactionYear','amount','otherID','candidateID','fecRecord')
candFromCommittees = read.fwf(file="~/Projects/campaign_finance/data/preprocessed/candidatesFromCommittees_2011_2012_UTF8.dta",c(9,1,3,1,11,3,2,2,2,2,7,9,9,7),comment.char="", strip.white=TRUE, col.names=candFromCommitteesNames)
candFromCommittees$amount = overpunch(candFromCommittees$amount)

# Transaction from committee to another
writeLines(iconv(sub("\t","/t",readLines("~/Projects/campaign_finance/data/raw/committeeToCommittee_2011_2012.dta")),from="ASCII",to="UTF8"),"~/Projects/campaign_finance/data/preprocessed/committeeToCommittee_2011_2012_UTF8.dta")
commToCommNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','contributorName','cityTown','state','zip','occupation','month','transactionDay','transactionCentury','transactionYear','amount','otherID','fecRecord')
commToComm = read.fwf(file="~/Projects/campaign_finance/data/preprocessed/committeeToCommittee_2011_2012_UTF8.dta",c(9,1,3,1,11,3,34,18,2,5,35,2,2,2,2,7,9,7),comment.char="", strip.white=TRUE, col.names=commToCommNames)
commToComm$amount = overpunch(commToComm$amount)

ORIGINAL POST 10/17/2011:

For my first contribution to the blog, I wanted to make some kind of enlightening visualization of campaign finance disclosure data from the Federal Election Commission’s website. It looks like they’re working on some new, easy-to-use data dumps here, but I decided to try to use the more detailed data files here because I couldn’t really tell the difference between the two data pages, and as a rule I always of for the most granular unaggregated data when I have a choice.

Anyway, the FEC dumps the data in some weird fixed-width COBOL format that kept me from using any of the read.delim functions to get the data into R, so I had to write a bunch of little parsing functions for each data file. I spent all day yesterday on these little helpers and I haven’t yet had the opportunity to do anything interesting with the data, so I decided that I would just post the code and work on some visualizations later this week.

So in summary, this code makes each of the FEC data dump file into R data frames:

  • Committee Master File: cmteeMaster
  • Candidate Master File: candMaster
  • Individual Contributions: individuals
  • Contributions to Candidates from Committees: candFromCommittees
  • Transactions between Committees: commToComm
This data is DIRTY, and it still needs a lot of work… this code just gets it into data frames. More to come.

# makeData_campaignFinance_v1_0.R -- copyright 10.17.2011, christopher compeau (email: my last name aht gmail dot com)

# use as you please but please attribute credit to christopher compeau if you publish anything
# the use of the FEC campaign finance data is subject to the rules on the FEC website
# have fun my babies. bonus points if you get yourself on some conrgessional campaign's shit list.

# this uses the 2011-2012 detailed discoloser data files at http://www.fec.gov/finance/disclosure/ftpdet.shtml
# still to be done: write tools for amended individual contributions files and other stuff as yet undiscovered.  

# RAW DATA FILE PARSING TOOLS

trim.trailing <- function (x) {sub("\\s+$", "", x)}

# committee master file
cmMaster = function(line) {
  cmID = substr(line,1,9)
  cmNAME = substr(line,10,99)
  treasurer = substr(line,100,137)
  streetOne = substr(line,138,171)
  streetTwo = substr(line,172,205)
  cityTown = substr(line,206,223)
  state = substr(line,224,225)
  zip = substr(line,226,230)
  cmDESIG = substr(line,231,231)
  cmTYPE = substr(line,232,232)
  cmPARTY = substr(line,233,235)
  fileFreq = substr(line,236,236)
  groupCategory = substr(line,237,237)
  orgName = substr(line,238,275)
  candidateID = substr(line,276,284)
  record = c(cmID,cmNAME,treasurer,streetOne,streetTwo,cityTown,state,zip,cmDESIG,cmTYPE,cmPARTY,fileFreq,groupCategory,orgName,candidateID)
  for (i in 1:length(record)) {
    record[i] = trim.trailing(record[i])
  }
  return(record)
}


# candidate master file
candMaster = function(line) {
  cndID = substr(line,1,9) 
  cndName = substr(line,10,47)
  partyDesig1 = substr(line,48,50)
  filler1 = substr(line,51,53)
  partyDesig3 = substr(line,54,56)
  seatStatus = substr(line,57,57)
  filler2 = substr(line,58,58)
  candidateStatus = substr(line,59,59)
  streetOne = substr(line,60,93)
  streetTwo = substr(line,94,127)
  cityTown = substr(line,128,145)
  state = substr(line,146,147)
  zip = substr(line,148,152)
  principalCommID = substr(line,153,161)
  electionYear = substr(line,162,163)
  currentDistrict = substr(line,164,165)
  record = c(cndID,cndName,partyDesig1,filler1,seatStatus,filler2,candidateStatus,streetOne,streetTwo,cityTown,state,zip,principalCommID,electionYear,currentDistrict)
  for (i in 1:length(record)) {
    record[i] = trim.trailing(record[i])
  }
  return(record)
}

# indivudual candidate contributions, committee to committe transactions
indAndComContribution = function(line) {
  filerID = substr(line,1,9)
  amendIndicator = substr(line,10,10)
  reportType = substr(line,11,13)
  primaryGeneral = substr(line,14,14)
  microfilmLocation = substr(line,15,25)
  transactionType = substr(line,26,28)  
  contributorName = substr(line,29,62)
  cityTown = substr(line,63,80)
  state = substr(line,81,82)
  zip = substr(line,83,87)
  occupation = substr(line,88,122)
  month = substr(line,123,124)
  transactionDay = substr(line,125,126)
  transactionCentury = substr(line,127,128)
  transactionYear = substr(line,129,130)
  amount = substr(line,131,137)
  otherID = substr(line,138,146)
  fecRecord = substr(line,147,153)
  record = c(filerID,amendIndicator,reportType,primaryGeneral,microfilmLocation,transactionType,contributorName,cityTown,state,zip,occupation,month,transactionDay,transactionCentury,transactionYear,amount,otherID,fecRecord)
  for (i in 1:length(record)) {
    record[i] = trim.trailing(record[i])
  }
  return(record)
}

# contributions to candidate from committees
candComContibution = function(line) {
  filerID = substr(line,1,9)
  amendIndicator = substr(line,10,10)
  reportType = substr(line,11,13)
  primaryGeneral = substr(line,14,14)
  microfilmLocation = substr(line,15,25)
  transactionType = substr(line,26,28)
  transactionMonth = substr(line,29,30)
  transactionDay = substr(line,31,32)
  transactionCentury = substr(line,33,34)
  transactionYear = substr(line,35,36)
  amount = substr(line,37,43)
  otherID = substr(line,44,52)
  candidateID = substr(line,53,61)
  fecRecord = substr(line,62,68)
  record = c(filerID,amendIndicator,reportType,primaryGeneral,microfilmLocation,transactionType,transactionMonth,transactionDay,transactionCentury,transactionYear,amount,otherID,candidateID,fecRecord)
  for (i in 1:length(record)) {
    record[i] = trim.trailing(record[i])
  }
  return(record)
}


# overpunch tool
overpunch = function(x) {
  # remove leading zeroes
  amount = sub("^0+","",x)
  sign = rep(1,length(x))
  changeChar = c(
    expression(sub("\\[$","0",amount)),
    expression(sub("\\]$","0",amount)),
    expression(sub("[{}]$","0",amount)),
    expression(sub("[AJ]$","1",amount)),
    expression(sub("[BK]$","2",amount)),
    expression(sub("[CL]$","3",amount)),
    expression(sub("[DM]$","4",amount)),
    expression(sub("[EN]$","5",amount)),
    expression(sub("[FO]$","6",amount)),
    expression(sub("[GP]$","7",amount)),
    expression(sub("[HQ]$","8",amount)),
    expression(sub("[IR]$","9",amount))
    )
  changes1 = grep("\\]$",amount)
  changes2 = grep("[JKLMNOPQR}]$",amount)
  sign[c(changes1,changes2)] = -1
  for (i in 1:length(changeChar)) {
    amount = eval(changeChar[i])
  }
  holder = as.numeric(sign) * as.numeric(amount)
  return(holder)  
}

# function using parsing tools to make data frames
# 'expsn' is an unevaluated expression for each parsing tool
# some raw data records are not the length stated in data docs
mkDataFrame = function(data,lineLength,columnNames,expsn) {
  properData = data[nchar(data, allowNA=TRUE)==lineLength]
  nRecords = length(properData)
  finalMatrix = matrix(nrow=length(properData),ncol=length(columnNames))
  for (i in 1:nRecords) { 
    result = eval(expsn)                   
    finalMatrix[i,] = result
  }
  finalDF = as.data.frame(finalMatrix)
  names(finalDF) = columnNames
  return(finalDF)
}

# Now use parsing tools to read data into dataframes    
    
# Committee Master File
cmteeMasterRaw = read.delim(file="~/Projects/campaign_finance/data/committeeMaster_2011_2012.dta", header=FALSE, sep="\n")
cmteeMasterRaw = as.character(cmteeMasterRaw[,1])
cmteeMasterNames = c("cmID","cmNAME","treasurer","streetOne","streetTwo","cityTown","state","zip","cmDESIG","cmTYPE","cmPARTY","fileFreq","groupCategory","orgName","candidateID")
cmteeMaster = mkDataFrame(cmteeMasterRaw,284,cmteeMasterNames,expression(cmMaster(properData[i])))  
  
# Candidate Master File
candMasterRaw = read.delim(file="~/Projects/campaign_finance/data/candidateMaster_2011_2012.dta", header=FALSE, sep="\n")
candMasterRaw = as.character(candMasterRaw[,1])
candMasterNames = c('cndID','cndName','partyDesig1','filler1','seatStatus','filler2','candidateStatus','streetOne','streetTwo','cityTown','state','zip','principalCommID','electionYear','currentDistrict')
candMaster = mkDataFrame(candMasterRaw,165,candMasterNames,expression(candMaster(properData[i])))  
  
# Individual Contributions
individualRaw = read.delim(file="~/Projects/campaign_finance/data/individualContributions_2011_2012.dta", header=FALSE,sep="\n")
individualRaw = as.character(individualRaw[,1])
individualNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','contributorName','cityTown','state','zip','occupation','month','transactionDay','transactionCentury','transactionYear','amount','otherID','fecRecord')
individuals = mkDataFrame(individualRaw,153,individualNames,expression(indAndComContribution(properData[i])))
individuals$amount = overpunch(individuals$amount)
  
# Contributions from Committees
candFromCommitteesRaw = read.delim(file="~/Projects/campaign_finance/data/candidatesFromCommittees_2011_2012.dta", header=FALSE, sep="\n")
candFromCommitteesRaw = as.character(candFromCommitteesRaw[,1])
candFromCommitteesNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','transactionMonth','transactionDay','transactionCentury','transactionYear','amount','otherID','candidateID','fecRecord')
candFromCommittees = mkDataFrame(candFromCommitteesRaw,68,candFromCommitteesNames,expression(candComContibution(properData[i])))
candFromCommittees$amount = overpunch(candFromCommittees$amount)

# Transaction from committee to another
commToCommRaw = read.delim(file="~/Projects/campaign_finance/data/comitteeToCommittee_2011_2012.dta", header=FALSE, sep="\n")
commToCommRaw = as.character(commToCommRaw[,1])
commToCommNames = c('filerID','amendIndicator','reportType','primaryGeneral','microfilmLocation','transactionType','contributorName','cityTown','state','zip','occupation','month','transactionDay','transactionCentury','transactionYear','amount','otherID','fecRecord')
commToComm = mkDataFrame(commToCommRaw,153,commToCommNames,expression(indAndComContribution(properData[i])))
commToComm$amount = overpunch(commToComm$amount)

Percentage of Organic Farming Operations by State

With data from the USDA on certified organic farms for 2008.  I created a map using the Geo Map function from the googleVis API package available in R.  I’ve copied and pasted the image below as WordPress.com sites don’t support scripting so the rollover functionality is, unfortunately, lost.  But because this is really the easiest way to create a choropleth map i went with it and did some of my own editing.  Maine and Vermont top the list of the highest proportion of certified organic farming and livestock operations with a huge total of 8% each with California coming in a distant second at 4%.  The fewest such operations as a fraction of the total number of farms is found in across the South with Mississippi, Arkansas, Louisiana, Tennessee, West Virginia and Alabama all below .06%.

click to enlarge: