Jumping off from the previous post, it seems to me that a good a priori statistic to know would be the following:
Suppose that two teams meet in a 7 game series. If you suppose that they are evenly matched, then you'd expect Team A to win the series 50% of the time. Okay. Now, suppose Team A loses the first game. A is down 0-1. What would have to be the odds of A to win each game such that Team A would have a 50% chance of winning the series? Does Team A have to be 55-45 good? 60-40? 70-30? What would bring the odds for winning the series back to parity? This, I think, would be a good set of figures to know to have a good intuitive grasp of how improbable a series win is for a team that falls behind.
Of course, these figures are just fixed mathematical calculations that you could apply in any sport. Unfortunately I don't know how to do the math to calculate them out, but maybe some bored genius will help us out.
Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts
Sunday, October 18, 2009
The value of ad hoc statistics
The math certainly bodes well for them. Since Major League Baseball adopted a best-of-seven format for the ALCS in 1985, the team that won Game 2 has advanced to the World Series 17 of 23 times.
I'm always wary of this sort of statistic because it can be misleading. Of course, for any Game n in a series, you expect the eventual series winner to have won that game more often than the loser, if only because in every series the winner is guaranteed to have 4 wins spread over those 4-7 games and the loser only anywhere from zero to three wins spread over the same. And of course, by definition, you expect the series winner to win Game 7 100% of the time.
So what are the a priori probabilities here? Well, running a simulation of 10,000 7 game series in which each team has a 50-50 chance of winning, these were the results:
Team A wins 4922 out of 10000 series
Series winner wins
Game 1: 6541/10000 (65%)
Game 2: 6565/10000 (66%)
Game 3: 6579/10000 (66%)
Game 4: 6643/10000 (66%)
Game 5: 5896/8732 (68%)
Game 6: 4692/6220 (75%)
Game 7: 3084/3084 (100%)
So a priori, assuming evenly matched teams, we would expect the series winner to win Game 2 about 65% of the time. The statistic in the article said that the series winner in the ALCS has won Game 2 17 out of 23 times, or about 74% of the time. Considering that the sample size is very small--23 games--this difference of 10% doesn't seem to be terribly significant.
(Bonus section: in the above example, I'm actually being conservative, because I'm assuming that the teams are evenly matched. But of course, in real life the teams are sometimes not evenly matched, in which case we should say that one team has a (e.g.) 60-40 or 70-30 chance of winning. If we run the 10,000 series simulation with 60-40 odds, we get this:
Team A wins 7140 out of 10000 series
Series winner wins
Game 1: 6766/10000 (68%)
Game 2: 6690/10000 (67%)
Game 3: 6784/10000 (68%)
Game 4: 6724/10000 (67%)
Game 5: 5853/8436 (69%)
Game 6: 4456/5748 (78%)
Game 7: 2727/2727 (100%)
If we run it with 70-30 odds, we get this:
Team A wins 8729 out of 10000 series
Series winner wins
Game 1: 7318/10000 (73%)
Game 2: 7286/10000 (73%)
Game 3: 7274/10000 (73%)
Game 4: 7305/10000 (73%)
Game 5: 5558/7532 (74%)
Game 6: 3474/4363 (80%)
Game 7: 1785/1785 (100%)
So what we see here is that, when we account for the fact that teams are not always evenly matched--that sometimes a team will have a odds-on advantage in winning each game--it only nudges the probability that the series winner will win Game 2 upwards. Which makes my case a little bit stronger....
...although, we should note that the probability probably never swings too far away from 50-50. Remember that, in the regular season, the best team in the league rarely has better than about a 65-35 advantage when it plays 165 games against all the other teams in the league (which includes a lot of crappy and mediocre teams). When you consider that in the ALCS the best teams are playing against each other, I imagine that the odds of the team favorited to win doesn't go much beyond 60, if that.)
Anyway, to conclude: the statistic cited in the article is not particularly meaningful. Moreover, it's odd to focus in on Game 2 in isolation of the fact that the Yankees also won Game 1. It seems like, if anything, the statistic we should be getting is: what are the a priori odds that the Angels will come back from 0-2 to win the series, assuming they're evenly matched with the Yankees (a good assumption, I think)? Well, assigning the Angels to "Team A":
Team A wins 1893 out of 10000 series
Doesn't look too good for the Angels.
(Photo used sans permission from here.)
Labels:
baseball,
sports,
statistics
Tuesday, October 13, 2009
Odds are this guy doesn't know what he's talking about
This annoys:
Repeating as World Series champions is, as the last nine winners can attest, exceedingly difficult. The last defending champion even to reach the World Series was the 2001 Yankees. The last National League team to win consecutive World Series was the 1975-76 Cincinnati Reds, known as the Big Red Machine. So the odds are stacked against the Phillies, not that they mind.He makes it seem like the fact that the Phillies won the Series last year actually makes it more unlikely that they will win it this year. But that's just the ol' gambler's fallacy at work.
I think the real insight lurking here is that baseball is, relatively, a very stochastic game, and so even a very dominant team is going to require a significant amount of luck to make it all the way twice in a row. Compare this to, say, basketball, which is not as stochastic and where you see higher winning percentages and where championship streaks are relatively common.
Labels:
baseball,
basketball,
sports,
statistics
Saturday, May 9, 2009
Lies, damn lies, etc.
Via Sullivan, someone quotes some statistics:
These kinds of comparisons always bother me, because they don't seem like a very good basis on which to make decisions. Presumably, these "odds" are arrived at by dividing, say, the number of child abductions by the number of total children. But it's not like you can conclude that if you leave your kid outside unattended, those will reflect the odds of an abduction. In some areas and in some circumstances, the chances of a child abduction will be higher than in others. Indeed, it could be that the very reason why the odds are so slim of a child abduction in the first place is precisely because most parents take lots of precautionary measures to make sure this never happens. If this is the case, then it certainly doesn't make sense to use this as a reason to stop taking precautionary measures!
It'd be like if someone refused to wear a bike helmet on the grounds that the odds of serious head injury are low, when in fact the very reason why head injury is rare is because everyone wears a helmet. You'd want to know the odds of a serious head injury amongst people who don't wear helmets.
The same holds true with car accident deaths. Defensive drivers have much better odds of avoiding accidents than aggressive drivers. Maybe some areas are more prone to car accidents than others--there is more traffic in some places, for example. The point is, within the set of all drivers, I'm sure that probabilities of an accident vary widely among individuals within the set depending upon their behavior and their environment. So general statistics that take into account all drivers everywhere don't necessarily tell you your personalized odds of an accident.
I mean, I understand that the point the author is trying to make is that we tend to misjudge certain risks, and it is possible that parents are being too risk averse if they, say, deprive their children of a playful childhood because they won't let them go outside unattended or something. But quoting statistics like these certainly doesn't make that case.
There is a 1 in 1.5 million chance that your kid would be abducted and killed by a stranger. It is hard to wrap your mind around those numbers, and everybody always assumes: What if it's my 1 in 1.5 million? If you don't want to have your child in any kind of danger, you really can't do anything. You certainly couldn't drive them in a car, because that's the No. 1 way kids die, as passengers in car accidents.
These kinds of comparisons always bother me, because they don't seem like a very good basis on which to make decisions. Presumably, these "odds" are arrived at by dividing, say, the number of child abductions by the number of total children. But it's not like you can conclude that if you leave your kid outside unattended, those will reflect the odds of an abduction. In some areas and in some circumstances, the chances of a child abduction will be higher than in others. Indeed, it could be that the very reason why the odds are so slim of a child abduction in the first place is precisely because most parents take lots of precautionary measures to make sure this never happens. If this is the case, then it certainly doesn't make sense to use this as a reason to stop taking precautionary measures!
It'd be like if someone refused to wear a bike helmet on the grounds that the odds of serious head injury are low, when in fact the very reason why head injury is rare is because everyone wears a helmet. You'd want to know the odds of a serious head injury amongst people who don't wear helmets.
The same holds true with car accident deaths. Defensive drivers have much better odds of avoiding accidents than aggressive drivers. Maybe some areas are more prone to car accidents than others--there is more traffic in some places, for example. The point is, within the set of all drivers, I'm sure that probabilities of an accident vary widely among individuals within the set depending upon their behavior and their environment. So general statistics that take into account all drivers everywhere don't necessarily tell you your personalized odds of an accident.
I mean, I understand that the point the author is trying to make is that we tend to misjudge certain risks, and it is possible that parents are being too risk averse if they, say, deprive their children of a playful childhood because they won't let them go outside unattended or something. But quoting statistics like these certainly doesn't make that case.
Labels:
statistics
Sunday, August 17, 2008
What everyone should know about polls
If a candidate's lead in a poll is within the margin of error, that does not mean that it is meaningful to say that the candidates are "in a statistical dead heat". As Kevin Drum once explained:
The moral of the story: when you see Obama ahead in a poll, but the lead is still within the poll's margin of error, don't worry--it's still likely that Obama really is ahead.
...what we're really interested in is the probability that the difference is greater than zero — in other words, that one candidate is genuinely ahead of the other. But this probability isn't a cutoff, it's a continuum: the bigger the lead, the more likely that someone is ahead and that the result isn't just a polling fluke. So instead of lazily reporting any result within the MOE as a "tie," which is statistically wrong anyway, it would be more informative to just go ahead and tell us how probable it is that a candidate is really ahead. As a service to humanity, here's a table that tells you:So, for example, if the margin of error for a poll was 5%, and Candidate A had a lead of 3%, then that means that the probability of Candidate A really being in the lead is 73%. To say that the candidates are in a "statistical tie" or "statistical dead heat" makes you think that there is an equal chance that either candidate could be in the lead--which is wrong.
The moral of the story: when you see Obama ahead in a poll, but the lead is still within the poll's margin of error, don't worry--it's still likely that Obama really is ahead.
Labels:
poll,
statistics
Subscribe to:
Posts (Atom)