We had two math and statistics professionals look into the likelihood of the New Mexico/ Henderson events occurring.
The first report is from an experienced data scientist, who prefers to remain anonymous, but whose professional opinion I sought and whose report I will be forwarding to the ethics committee. The data scientist examined three scenarios in the Jan 15 tournament, one using pre-tournament ratings, one using post-tournament ratings, and a third using the lowest published rating in the past year of the Henderson students and the peak ratings of their opponents. The data scientist found that the chances of the Jan 15 tournament occurring, assuming pre-tournament ratings were accurate, is 0.000000000000000000000000000000888, which is less than one in one nonillion (1 with 30 zeroes after it). That is approximately a billion times the number of stars in the observable universe.
Assuming post-tournament ratings led to a probability of 0.000000000045, which is less than 1 in 100 billion (note that 100 billion is the approximate number of stars in our galaxy).
And the third (most favorable to Henderson) scenario, assuming the Henderson students were at their past-year weakest and the opponents were at their lifetime strongest, still found a likelihood of only is 0.000000037, which is less than 1 in 10 million.
A second analysis was done by a parent on my team who works in computer programming and statistics. I present his work and conclusions below; for obvious reasons they are very similar to the above. They are slightly different in scenario two because the first statistician assumed post tournament ratings of both sides and the second analysis assumed only post tournament ratings of the New Mexico players. (This scenario was run because an argument is being made that the New Mexican players' ratings were provisional and inaccurate, see below.)
Base Analysis
The main argument is that the EP vs. EG tournament is highly implausible. The ratings difference between the winners and losers is much too wide for such a number of simultaneous upsets to occur.
This analysis looked at each individual game, calculated the odds of losing each game, and then calculated the odds of a 0-28 score based on those odds. The odds of losing a given game is given by the USCF ELO model (see resources below). Specifically, the odds of losing a given game is 1 minus the odds of winning a game given two ratings:
This analysis excludes the possibility of draws, but if we included those odds the odds of losing any given game would be lower, so would only strengthen this argument.
Assuming post-tournament ratings led to a probability of 0.000000000045, which is less than 1 in 100 billion (note that 100 billion is the approximate number of stars in our galaxy).
And the third (most favorable to Henderson) scenario, assuming the Henderson students were at their past-year weakest and the opponents were at their lifetime strongest, still found a likelihood of only is 0.000000037, which is less than 1 in 10 million.
A second analysis was done by a parent on my team who works in computer programming and statistics. I present his work and conclusions below; for obvious reasons they are very similar to the above. They are slightly different in scenario two because the first statistician assumed post tournament ratings of both sides and the second analysis assumed only post tournament ratings of the New Mexico players. (This scenario was run because an argument is being made that the New Mexican players' ratings were provisional and inaccurate, see below.)
Base Analysis
The main argument is that the EP vs. EG tournament is highly implausible. The ratings difference between the winners and losers is much too wide for such a number of simultaneous upsets to occur.
This analysis looked at each individual game, calculated the odds of losing each game, and then calculated the odds of a 0-28 score based on those odds. The odds of losing a given game is given by the USCF ELO model (see resources below). Specifically, the odds of losing a given game is 1 minus the odds of winning a game given two ratings:
This analysis excludes the possibility of draws, but if we included those odds the odds of losing any given game would be lower, so would only strengthen this argument.
Given the above, the odds of such a lopsided tournament occuring is once in 1.13 x 10^30. In plain English, that's once in a nonillion chance of occuring (We had to look that up; see resources below).
5 sigma is often used as an extreme hurdle to determine validity or significance. Scientists used it to validate the discovery of a new particle (see article). 5 sigma is an event that occurs once in 3.5 milliontimes. Not billion. Not trillion.
Post-event Peak Analysis
One argument in defense of the upset team is that the opponent ratings were provisional and therefore meaningless. It's true that six out of the seven winners had provisional ratings. We ran the same test as the above, but this time using the peak ratings of the opponents after the above suspicious event.
Sure enough, most of the provisionally rated opponents had their ratings move up (even though much of it occured by beating their much higher rated opponents in the above event!). As of April, 2018, four players still had provisional ratings, but two of those had 24 and 25 games respectively, so their ratings are close to non-provisional (26 games needed for non-provisional rating).
Using these peak ratings of the opponents, running the same analysis shows the odds of a 0-28 sweep/upset is one in 1.44 x 10^16.
Or, in plain English, one in 14 quadrillion.
This seems like a fair analysis; if you look through the histories of the provisionally rated players, there isn't much to indicate that they are materially, grossly underrated. They do show patterns of consistently losing to low rated players etc.
Even-match Analysis
Finally, all this math aside, the simplest analysis is to just look at the odds of a 0-28 sweep of an evenly matched team, which is far from the case here. The odds of such an upset is simply 0.5^28.
Using this method, we get the odds of this occurring as one in 268 million. Remember, 5 sigma is a once in 3.5 million event, good enough to validate the discovery of a new particle.
5 sigma is often used as an extreme hurdle to determine validity or significance. Scientists used it to validate the discovery of a new particle (see article). 5 sigma is an event that occurs once in 3.5 milliontimes. Not billion. Not trillion.
Post-event Peak Analysis
One argument in defense of the upset team is that the opponent ratings were provisional and therefore meaningless. It's true that six out of the seven winners had provisional ratings. We ran the same test as the above, but this time using the peak ratings of the opponents after the above suspicious event.
Sure enough, most of the provisionally rated opponents had their ratings move up (even though much of it occured by beating their much higher rated opponents in the above event!). As of April, 2018, four players still had provisional ratings, but two of those had 24 and 25 games respectively, so their ratings are close to non-provisional (26 games needed for non-provisional rating).
Using these peak ratings of the opponents, running the same analysis shows the odds of a 0-28 sweep/upset is one in 1.44 x 10^16.
Or, in plain English, one in 14 quadrillion.
This seems like a fair analysis; if you look through the histories of the provisionally rated players, there isn't much to indicate that they are materially, grossly underrated. They do show patterns of consistently losing to low rated players etc.
Even-match Analysis
Finally, all this math aside, the simplest analysis is to just look at the odds of a 0-28 sweep of an evenly matched team, which is far from the case here. The odds of such an upset is simply 0.5^28.
Using this method, we get the odds of this occurring as one in 268 million. Remember, 5 sigma is a once in 3.5 million event, good enough to validate the discovery of a new particle.
Conclusion
Given the above analysis, and especially even the last 'even-match', sanity-check analysis, it is safe (or exceedingly, astronomically safe) to say that this was not a valid event.
We have seen various analyses on this (including one from a math Phd, professional quantitative analyst/statistician), and numbers may vary due to rounding and other issues, but the conclusion is basically the same; this event is an astronomically unlikely event to have occured normally.
Given the above analysis, and especially even the last 'even-match', sanity-check analysis, it is safe (or exceedingly, astronomically safe) to say that this was not a valid event.
We have seen various analyses on this (including one from a math Phd, professional quantitative analyst/statistician), and numbers may vary due to rounding and other issues, but the conclusion is basically the same; this event is an astronomically unlikely event to have occured normally.