We had two math and statistics professionals look into the likelihood of the New Mexico/ Henderson events occurring.
The first report is from an experienced data scientist, who prefers to remain anonymous, but whose professional opinion I sought and whose report I will be forwarding to the ethics committee. The data scientist examined three scenarios in the Jan 15 tournament, one using pre-tournament ratings, one using post-tournament ratings, and a third using the lowest published rating in the past year of the Henderson students and the peak ratings of their opponents. The data scientist found that the chances of the Jan 15 tournament occurring, assuming pre-tournament ratings were accurate, is 0.000000000000000000000000000000888, which is less than one in one nonillion (1 with 30 zeroes after it). That is approximately a billion times the number of stars in the observable universe.
Assuming post-tournament ratings led to a probability of 0.000000000045, which is less than 1 in 100 billion (note that 100 billion is the approximate number of stars in our galaxy).
And the third (most favorable to Henderson) scenario, assuming the Henderson students were at their past-year weakest and the opponents were at their lifetime strongest, still found a likelihood of only is 0.000000037, which is less than 1 in 10 million.
A second analysis was done by a parent on my team who works in computer programming and statistics. I present his work and conclusions below; for obvious reasons they are very similar to the above. They are slightly different in scenario two because the first statistician assumed post tournament ratings of both sides and the second analysis assumed only post tournament ratings of the New Mexico players. (This scenario was run because an argument is being made that the New Mexican players' ratings were provisional and inaccurate, see below.)
Base Analysis
The main argument is that the EP vs. EG tournament is highly implausible. The ratings difference between the winners and losers is much too wide for such a number of simultaneous upsets to occur.
This analysis looked at each individual game, calculated the odds of losing each game, and then calculated the odds of a 0-28 score based on those odds. The odds of losing a given game is given by the USCF ELO model (see resources below). Specifically, the odds of losing a given game is 1 minus the odds of winning a game given two ratings:
This analysis excludes the possibility of draws, but if we included those odds the odds of losing any given game would be lower, so would only strengthen this argument.
Assuming post-tournament ratings led to a probability of 0.000000000045, which is less than 1 in 100 billion (note that 100 billion is the approximate number of stars in our galaxy).
And the third (most favorable to Henderson) scenario, assuming the Henderson students were at their past-year weakest and the opponents were at their lifetime strongest, still found a likelihood of only is 0.000000037, which is less than 1 in 10 million.
A second analysis was done by a parent on my team who works in computer programming and statistics. I present his work and conclusions below; for obvious reasons they are very similar to the above. They are slightly different in scenario two because the first statistician assumed post tournament ratings of both sides and the second analysis assumed only post tournament ratings of the New Mexico players. (This scenario was run because an argument is being made that the New Mexican players' ratings were provisional and inaccurate, see below.)
Base Analysis
The main argument is that the EP vs. EG tournament is highly implausible. The ratings difference between the winners and losers is much too wide for such a number of simultaneous upsets to occur.
This analysis looked at each individual game, calculated the odds of losing each game, and then calculated the odds of a 0-28 score based on those odds. The odds of losing a given game is given by the USCF ELO model (see resources below). Specifically, the odds of losing a given game is 1 minus the odds of winning a game given two ratings:
This analysis excludes the possibility of draws, but if we included those odds the odds of losing any given game would be lower, so would only strengthen this argument.
Given the above, the odds of such a lopsided tournament occuring is once in 1.13 x 10^30. In plain English, that's once in a nonillion chance of occuring (We had to look that up; see resources below).
5 sigma is often used as an extreme hurdle to determine validity or significance. Scientists used it to validate the discovery of a new particle (see article). 5 sigma is an event that occurs once in 3.5 milliontimes. Not billion. Not trillion.
Post-event Peak Analysis
One argument in defense of the upset team is that the opponent ratings were provisional and therefore meaningless. It's true that six out of the seven winners had provisional ratings. We ran the same test as the above, but this time using the peak ratings of the opponents after the above suspicious event.
Sure enough, most of the provisionally rated opponents had their ratings move up (even though much of it occured by beating their much higher rated opponents in the above event!). As of April, 2018, four players still had provisional ratings, but two of those had 24 and 25 games respectively, so their ratings are close to non-provisional (26 games needed for non-provisional rating).
Using these peak ratings of the opponents, running the same analysis shows the odds of a 0-28 sweep/upset is one in 1.44 x 10^16.
Or, in plain English, one in 14 quadrillion.
This seems like a fair analysis; if you look through the histories of the provisionally rated players, there isn't much to indicate that they are materially, grossly underrated. They do show patterns of consistently losing to low rated players etc.
Even-match Analysis
Finally, all this math aside, the simplest analysis is to just look at the odds of a 0-28 sweep of an evenly matched team, which is far from the case here. The odds of such an upset is simply 0.5^28.
Using this method, we get the odds of this occurring as one in 268 million. Remember, 5 sigma is a once in 3.5 million event, good enough to validate the discovery of a new particle.
5 sigma is often used as an extreme hurdle to determine validity or significance. Scientists used it to validate the discovery of a new particle (see article). 5 sigma is an event that occurs once in 3.5 milliontimes. Not billion. Not trillion.
Post-event Peak Analysis
One argument in defense of the upset team is that the opponent ratings were provisional and therefore meaningless. It's true that six out of the seven winners had provisional ratings. We ran the same test as the above, but this time using the peak ratings of the opponents after the above suspicious event.
Sure enough, most of the provisionally rated opponents had their ratings move up (even though much of it occured by beating their much higher rated opponents in the above event!). As of April, 2018, four players still had provisional ratings, but two of those had 24 and 25 games respectively, so their ratings are close to non-provisional (26 games needed for non-provisional rating).
Using these peak ratings of the opponents, running the same analysis shows the odds of a 0-28 sweep/upset is one in 1.44 x 10^16.
Or, in plain English, one in 14 quadrillion.
This seems like a fair analysis; if you look through the histories of the provisionally rated players, there isn't much to indicate that they are materially, grossly underrated. They do show patterns of consistently losing to low rated players etc.
Even-match Analysis
Finally, all this math aside, the simplest analysis is to just look at the odds of a 0-28 sweep of an evenly matched team, which is far from the case here. The odds of such an upset is simply 0.5^28.
Using this method, we get the odds of this occurring as one in 268 million. Remember, 5 sigma is a once in 3.5 million event, good enough to validate the discovery of a new particle.
Conclusion
Given the above analysis, and especially even the last 'even-match', sanity-check analysis, it is safe (or exceedingly, astronomically safe) to say that this was not a valid event.
We have seen various analyses on this (including one from a math Phd, professional quantitative analyst/statistician), and numbers may vary due to rounding and other issues, but the conclusion is basically the same; this event is an astronomically unlikely event to have occured normally.
Given the above analysis, and especially even the last 'even-match', sanity-check analysis, it is safe (or exceedingly, astronomically safe) to say that this was not a valid event.
We have seen various analyses on this (including one from a math Phd, professional quantitative analyst/statistician), and numbers may vary due to rounding and other issues, but the conclusion is basically the same; this event is an astronomically unlikely event to have occured normally.
Wow. Those are big numbers. I am trying wrap my head around that... I'd like to believe most of us are rational reasonable human beings. The odds of dying from an asteroid, meteorite, or comet are 1 in 1,600,000. What does the phrase 'statistically impossible' mean? Is this 'statistically impossible'?
ReplyDeletePer multiple statistics websites:
"A statistical impossibility is a probability that is so low as to not be worthy of mentioning. Sometimes it is quoted as although the cutoff is inherently arbitrary. Although not truly impossible the probability is low enough so as to not bear mention in a rational, reasonable argument."
If we are rational reasonable human beings, this is beyond statistically impossible.
My name is Tony Berard (facebook page of the same name). I have been running chess tournaments for some time now, and with the initial ratings being what you describe, this tournament would fail The Ratings Goodness of Fit Test, which is a standard statistical test I modified to be able to compare initial ratings versus performance ratings (not the TPR--the performance ratings calculated by my rating system). Upon failure of this test, I would then have the authority (in my rating system) to change the initial ratings of the most out of spec players (in the case of this tournament, probably all of the players). In short, my rating system would conclude that the initial ratings were wrong.
ReplyDeleteSo, did the lower rated players all play over their heads to win upsets in every game? No, it is far more likely that the higher rated players threw their games to lower their ratings. Usually, cheating is not done with collusion (Fischer versus tge Russians is a notable exception), but it looks here a systemic collusion took place.
This comment has been removed by the author.
ReplyDeleteHonestly, anyone with the chutzpah to write a book about winning the U750 section at Nationals is bound to be a person of questionable integrity.
ReplyDeleteStrange things happen with this team...
ReplyDeleteIn addition to the 0-28 results analyzed above, here is another 1-12 performance that same week. Oddly just before the ratings cut-off for Nationals.
http://www.uschess.org/msa/XtblMain.php?201801197852.0
10 upset victories by NM.
Honestly, Henderson and its players should simply be disqualified, all teams and players finishing below them in the U1000 and U750 sections should be moved up accordingly, and new trophies should be purchased and sent out along with a letter explaining what happened and an apology for not taking action when it was brought to the organizers' attention.
ReplyDeleteAnything short of that will be disgraceful.
The hard part is had they not played, that would have changed much more. People who lost to them in earlier rounds, may have done better than they did had they not played anyone from that team. Pull them out and whoever was at the top of the bottom half may have ended up at the bottom of the top half, thus changing other outcomes. Just think "The Butterfly Effect" of the chess world. It is really much further reaching than just moving everyone up.
ReplyDeleteTrue, it is much further reaching than moving everyone up, but trying to do it any other way requires pure guesswork. There is no perfect method of undoing all the "Butterfly Effects" that these ineligible players had on all the other teams and players in the tournament, but surely the results of such cheating shouldn't be allowed to stand. Unless someone can come up with a better solution, moving everyone up seems to be the fairest option.
ReplyDeleteI agree with you. I think they should be stripped of the awards/titles. The best option available is to move everyone up as you say. I just want it out there how far reaching this is and truly that the extent of the damage is undefinable so that people get a better idea of just how severe it is. Henderson should only be able to participate in open categories at least as long as this coach is affiliated with them... which they should really reconsider themselves. The players should be made aware of the rules regarding ratings manipulation as well.
ReplyDeletePerhaps the best outcome of this mess would be to revamp both the charter and scope of the Ethics Committee.
ReplyDelete