Kennedy wrote:Awesome work Tessa. This is right up my alley and I really wish my life had a little more boredom in it because this is just the kind of project I love to jump in to!

The first question I have is what method do you use to produce a projection? I've been dabbling with it this year since chef-de-race went offline and took their projections with them. But because I'm not that smart I really just ended up with a straight projection based off the fractions and beaten lengths at any call. It feels inexact though and I'd be interested to hear how you actually accomplish the "math".

One thought I had in terms of the Derby decline is to actually do a correlation study between the number of starts each winner had prior to the Derby.

I've been thinking about a theory in concept but what has a greater impact on a horses ability to run really fast? Age, current fitness or experience?

Obviously all three matter and it's tough to make a blanket statement for all horses. But the general decline in Derby performances may have a correlation to the fact that the runners in the Derby are more often making their 5th lifetime start than their 10th.

How many horses run their best lifetime effort in their 5th start?

I personally don't think horses are getting worse but I do wonder if the Derby is now scheduled at a time of year that is "sooner" in terms of the entrants development than in years past.

Ha, thanks, I have to say I thought of you while writing this!

Regarding the methodology, some of it honestly is pretty hazy because it was several years ago and my notes are poorly organized. I know that I started by looking for the best predictor of final time. I looked at pretty much every fraction and combination of fractions, but ended up finding that 6f split had the best correlation with final time (and this relationship is surprisingly linear, all polynomial curves were basically the same). Assuming that exceptional and poor performances would distort this correlation, I took out the outliers and wet tracks to get a better line of fit and generated a starter linear equation. So as a disclaimer, I realize this is a recursive analysis, which is terrible! But there really is no dataset like the Derby, so I'd be hesitant to apply fractional correlations gathered from other races. I'm eager to keep collecting these over the years to refine the equation, and thus far it has proved a fairly reliable BSF predictor- for reference, I heard from a number of people in 2014 that their own self-generated figures equated to about a 103 beyer.

The wind stuff is fairly uninteresting- I just looked at whether or not races with high winds fell above or below that line of fit, then looked for a correlation between wind speed and reduced final time. It isn't huge, but it's there and again it's pretty linear, so I added that on as a modifier to final time. Finally, I researched what other people do with track surface and came up with a series of modifications for wet tracks, topping off at +1.3 seconds for sloppy (which lines up pretty well with Beyers again, although slop is tough to work with because it's always a different surface). This allowed me to generate the predicted vs. actual times and work from there.

So this is all basically a way to generate a Beyer without knowing track variant, and it worked pretty nicely for that, but I wanted another method for assessing quality, which is where the final margins came in. Under the assumption that fast races results in big margins, if not big win margins, I wanted to look at beaten lengths. After working with a lot of possibilities, I decided to average the beaten margin of 3rd and 6th place. I didn't want to punish horses for beating quality opponents by small margins, so I didn't do win margin. However, adding the 3rd place margin gives those big winners a slight bonus, while going to sixth gives a bonus to horses for stringing the field out late (incidentally, I found while going through the charts that 1st to 3rd is variable, but there's usually a bunching around 4th to 6th). My thought process here was that while field quality varies from race to race, it's likely about even by the time you get to that sixth tired horse. Unsurprisingly, most of the fastest races also had the biggest margin bonuses- but it helped smooth out some of the irregularities, such as Animal Kingdom's number (which was originally very high). This aspect has not been perfect because I think it punishes some horses

*too* much, so I'd love to keep fiddling with it to improve it.

Long story short for the rest of it, I weighted margin by 1.5x, normalized the predicated vs. actual figure, combined the two, then ranked the horses against known beyers, looked for identical rankings and used that to generate a linear relationship between the two (and it was incredibly linear, like r squared of .99, which was interesting). So technically, these predicted beyers are relative to the assumption that Monarchos got a 116 and Giacomo got a 100, plus a few others. Even if not entirely accurate, this was a fun way to look at old races, and I think I might extend it back to 1960 (when the first "modern" race times started to pop up in the Derby). Another nice feature is the fact that this method can be applied to all Derby horses, not just winners.

As for career starts, that would be really interesting to look at! Perhaps I could see whether there is a relationship between how many starts they have made and what figure they receive. It's very hard to control for developmental stage, but it would make sense that horses start to run slower if they don't have the same foundation or aren't as far along developmentally.