This could be the setup for any number of jokes about judges (send us your favorites and maybe we will post them if they are any good). But actually the question is how many judges are really needed to score a competition under the new system. There are proposals at the ISU to keep the number of judges the same and to use all judges in the scoring. Another calls for splitting a panel of 12 into six element judges and 6 Program Component judges, and scoring with the resulting six sets of marks.
Some in U.S. Figure skating want to go the other way, and cut the number of judges on a panel to the bone (five or less), and some advocate judging with three element judges and three Program Component judges, determining results using just three sets of scores.
The bottom line is, that for the calculation method adopted in the new judging system, the number of sets of marks you need to obtain a high confidence depends on the spread in the judges marks. The greater the diversity of opinion among the judges, the more scores that are needed to obtain a meaningful set of results.
In the 6.0 system the judges are expected to combine all the various skating skills into a placement without a great deal of guidance for how much importance to give to each skill. So each judge can come to a different conclusion depending how much importance they gave a specific skill, strength or weakness. The variety in the importance the judges give each of these is probably the main source of variety of placements among the judges.
IJS is designed to eliminate that. The importance of each skill, and the impact of each error, is fixed by the mathematical construction of the system. You would expect, then, that the judges would show better agreement under IJS than under 6.0, and fewer judges would be required. But is isnít so. Actually it is significantly worse, so that one needs more sets of marks under IJS than what is needed under 6.0 to get the same confidence in the results. Previous calculations show that a good rule of thumb is that under IJS one needs two more judges that what is needed under 6.0; that is, the "quality" of an IJS result with seven judges is equivalent to using five judges under 6.0.
Results from the recent Olympic Winter Games are instructive in illustrating how much diversity of opinion exists among the best judges, after three years of experience with the new system; and thus provides insight into how many sets of marks are needed to score a competition.
In terms of diversity of opinion, the worst case in Torino was the dance event. In the Compulsory Dance segment, five couples were scored best by at least one of the twelve judges. No couple was considered best by more than three of the twelve judges. Moreover, for each of these five couples, another judge considered them dreadful. Isabelle Delobel & Olivier Schoenfelder, for example were scored best by one judge and ninth by another, Ninth! That says, the panel could not agree if this couple was the best of the 24, or only belonged in the middle third of the group! The winners of this dance segment, Barbara Fusar Poli & Maurizio Margaglio were scored best by only three judges, while another judge thought seven other couples were better than the Italians.
|1. FUSAR POLI Barbara & MARGAGLIO Maurizio||8||2||4||2||3||6||2||3||1||6||1||1|
|2. NAVKA Tatyana & KOSTOMAROV Roman||4||7||8||1||1||2||1||7||4||5||4||2|
|3. DENKOVA Albena & STAVISKI Maxim||5||5||5||5||5||4||4||4||3||8||2||3|
|4. DUBREUIL Marie-France & LAUZON Patrice||2||1||2||6||6||7||7||1||5||1||3||6|
|5. GRUSHINA Yelena & GONCHAROV Ruslan||6||4||7||3||2||3||3||8||2||3||5||4|
|6. BELBIN Tanith & AGOSTO Benjamin||1||6||3||4||4||1||5||5||6||2||7||5|
|7. DELOBEL Isabelle & SCHOENFELDER Olivier||3||3||1||8||7||5||6||2||7||4||9||8|
Not much agreement for a single skill (Presentation -- PC 2 and 4) either!
|1. FUSAR POLI Barbara & MARGAGLIO Maurizio||7||2||4||2||3||5||2||3||1||5||1||1|
|2. NAVKA Tatyana & KOSTOMAROV Roman||5||3||8||1||1||1||1||6||2||5||2||2|
|3. DENKOVA Albena & STAVISKI Maxim||3||7||5||3||5||4||4||4||2||7||2||4|
|4. DUBREUIL Marie-France & LAUZON Patrice||1||1||2||9||6||7||5||1||5||2||6||7|
|5. GRUSHINA Yelena & GONCHAROV Ruslan||7||4||6||3||2||3||2||8||2||2||4||2|
|6. BELBIN Tanith & AGOSTO Benjamin||1||4||3||5||4||2||5||4||5||1||4||5|
|7. DELOBEL Isabelle & SCHOENFELDER Olivier||3||4||1||5||7||6||7||2||5||2||7||7|
In the Original Dance four couples were scored best by at least one judge, and in the Free Dance five. Again, all these couples who were thought best by at least one judge were also thought mediocre (as low as eight place) by another judge. Under the 6.0 system the expected performance for a judge was to get the top four competitors in their top group of four (and the bottom four in their bottom group of four), and usually that was the case. No longer. In the Compulsory Dance only one judge had the panelís top four as their top four. For the Original dance it was five judges and in the Free Dance only two. The panels did nearly as bad picking out the bottom four.
Under IJS, the judges are also supposed to be evaluating each individual element and each Program Component according to strictly specified criteria. Under the IJS calculation method, the judges should at least agree who is the best spinner, or has the best twizzles, or the best footwork, or presentation. But no. That isnít true either.
In the Original Dance, for example, eight couples were thought to have the best lifts by at least one judge. Worse, ten couples had the best spins in the view of at least one judge, six couples had the best sequences and twizzles, five had the best skating skills and transitions, and four had the best presentation. And for every judge who has a couple first in a skill, another judge had them very low, in some cases as low as twelfth! Letís see now, youíre the best in a skill; no, wait, wait, your really only a middle of the pack couple. Which is it? And why do the marks turn out this way.
In terms of the overall results one finds that if all 12 sets of marks are used, only 10 of 24 places in the Dance event are statistically certain. Nearly 60% of the time the results do not give a statistically definitive answer. This is not a fault of the calculation method, it is a fault in the performance of the judges and their wildly varying assessments of the skating. To produce results in the Dance event where the majority of the places were statistically significant, one would have to double the number of sets of marks!
In the Torino Dance event, one consequence of this diversity of opinion (using 9 of 12 sets of marks) was that the top couples were essentially tied in both the Compulsory Dance and the Original Dance. The same was also true in the Ladies Short Program where only 0.71 points separated the top three skaters, and only 0.03 points separated the top two. Further, with so much variety among the scores, the random selection of the judges can have a crucial impact on the calculation of the results. When one recalculates the results of all the skating events using all twelve sets of marks one finds several place switches in the intermediate segment results, and a small number in the final results.
As an example of the impact of random selection, if all twelve sets of marks are used in the Ladies Short Program, Irina Slutskaya, actually wins the Short program, 0.09 points ahead of Sasha Cohen. In fact, the only thing that kept the Ladies event from being a scandal was that Cohen did not get a second deduction in the Free Skating when she put her hands down on her triple flip. Had that been a deduction, then second and third place would have been determined by the random selection of the panel, and not the skating.
|1. PLUSHENKO Evgeni||1||1||1||1||1||1||1||1||1||1||1||1|
|2. WEIR Johnny||4||3||2||2||3||2||4||3||2||4||3||2|
|3. LAMBIEL Stephane||3||2||3||4||5||3||3||2||3||3||2||3|
|4. JOUBERT Brian||2||4||7||3||2||4||2||4||4||2||6||4|
|5. TAKAHASHI Daisuke||6||9||4||6||6||6||10||5||5||5||7||6|
|6. BUTTLE Jeffrey||5||5||5||5||4||5||5||6||10||6||4||5|
In singles and pairs the variety of opinion was a little less than in the Dance event, but still there was a huge spread of opinion for the total value of an event segment and for individual skills. In singles and pairs, overall about 40% of the results are not statistically certain.
So, in the Ladies Free Skate, was Emily Hughes the second best jumper, or the ninth? Was Kimmie Meissner the third best or the eighth? Was Irina Slutskaya the third best or tenth? For perhaps the most obvious of the skills (jumps) the judges show a huge variety of opinion.
The bottom line, whether you look at total points or points for an individual skill, whether you look at results from a single event segment or an entire event, the marks under IJS produce results that are uncertain, and open to question half the time, or more. This is grossly unfair to the skaters and must be fixed. The question is how.
It will not be fixed by decreasing the number of scores included in the results calculation. If anything, experience over the last three years says the number of sets of marks would have to be substantially increased to compensate for the huge variation of opinion among the judges. Something that is, unfortunately, just not practical.
The long term solution lies in far more extensive and intensive training of the judges, coupled with a more perceptive monitoring of the judges performance than currently exists. The ISU, and indeed all the ISU members have grossly underestimated the difficulty of training the officials to do the job asked of them under IJS. The job in the U.S. is particularly challenging, since the judges must be "calibrated" not only for each event segment and skill within an event segment, but also for the nearly two dozen different competition divisions that exist within the U.S. Figure Skating Competition structure. Skaters, coaches and officials should be collectively terrified at the though of how difficult this will be to accomplish, I know I am.
Return to title page
Copyright 2006 by George S. Rossnao