Judging the Judges in CoP

In Judging the Judges we looked at an approach to rating the quality of U.S. Judges marking using the 6.0 system at the U.S. National Championships.  The fundamental challenge there was to develop an absolute standard for rating judges who are using a scoring system based on relative placements.  Despite this challenge, it was found that it was possible to reliably distinguish between individuals of greater and lesser judging skill, and to produce a meaningful ranking of the judges.

Under CoP, ranking the judges is in principle a much easier task.  The judges are evaluating specific skating skills to an absolute standard, and the accuracy and frequency with which they conform to that standard should be easily quantifiable.

To set up an accountability system under CoP there are basically three questions.  How do you determine what the "correct" marks are that the judges marks should be compared to?  How much may a judge's mark depart from the correct mark to be considered an error?  How often may a judge's marks be in error before the judge is considered unfit?  In this article the term "marks" includes the grade of execution (GoE) for each element and each program component mark.

ISU Communication 1275 describes the approach the ISU is taking to evaluate the performance of its judges.  The approach is to decide from the actual marks, what the "correct" marks should have been and then define bands around the correct marks in which the judges marks must lie.  Marks outside the band are referred to as "anomalies"  (in statistics these are alternately referred to as outliers).  Judges with an unacceptable frequency of anomalies are subject to corrective action.

The ISU evaluation process is conducted in total secrecy.  Ratings of the judges are not published, nor are the names of any judges subject to corrective action.  All the ISU has disclosed is that one person was demoted from referee to judge this past season, and 28 judges were given cautions over the quality of their judging.  No judges were demoted or lost their appointments due to the quality of their judging.  No judges were rated incompetent.

This article is an independent analysis of the judges marks for the 2003 Grand Prix of Figure skating.  The analysis make use of the published marks from all judges in all segments of all events, at all competition of the 2003 Grand Prix.  We first describe the methodology of the analysis.  Results of the analysis are then given at the end.

Determining the Correct Marks

One approach to determining the correct marks would be to review video tapes of competitions after the fact in gruesome detail, comparing the performances meticulously to the rules and deciding on the correct marks after careful consideration by a panel of the best judges.  Not very practical though.

The approach the ISU takes, and we take here, is that the consensus opinion of the panel is the correct mark the vast majority of the time; so frequently, it is assumed, that anomaly statistics calculated in this way are a fair measure of the quality of the judging.

The question then becomes, how does one determine the consensus mark for each element and program component?  Typical choices include using the average mark, the median mark, a trimmed mean mark or the most common mark.

If the goal is to identify outliers, then you have to assume they are present, and if they are present any form of average may be contaminated by the outliers.  This makes the use of any kind of average undesirable.  In science and engineering, the median value is most commonly used when data are potentially contaminated by outliers.  If the statistical distribution of the data in not symmetric, use of the most common value is usually preferable.

In this study we use the most common mark as the "correct" mark to which we compare the marks of each judge.

Identifying Outliers

How much may a judge's mark depart from the correct mark to be considered an error?

One approach to the question is to ask what level of agreement does the scoring system require from the judges for the results to be meaningful.  In CoP, results are determined to 1/100 of a point, which in the men's free skating is one-half of one-hundredth of one percent of the total score.

For a panel of judges to produce a result with this precision, a single judge in a panel of nine must agree with the most common mark to within a few-one-hundredths of one percent nearly all the time.  By this criterion, every mark that is not the most common mark is an outlier, and for all judges the majority of their marks are outliers.  Consequently, by this criterion virtually all judges in the 2003 Grand Prix would have to be identified as incompetent.

This is an unreasonable conclusion, of course, because we all know (I hope) that humans can't do anything with a precision a few-one-hundredths of one percent.  What is unreasonable about the conclusion is that CoP requires of the judges a level of performance humans can never meet, and so comparing the judges to that standard is unfair to the judges.

Rather than have the scoring system tell us how well the judges have to perform, we instead ask, what level of performance can reasonably be expected from well trained humans, and then see how close the judges in the Grand Prix come to that expectation.

Studies of human perception, and analysis of the Grand Prix scores tell us that the best we can expect of human judges for consistency in the marks is plus or minus 5-7.5%, about 2/3 of the time.  To be on the generous side, we take the upper value and allow the judges a variation in their marks of plus or minus 7.5%, and adopt the following criteria for outliers.

For the grades of execution, any GoE that departs from the most common GoE by more that one GoE is an outlier.  Any GoE that departs from the most common GoE by one GoE is an outlier unless three or more judges assign that GoE.

For the program components, any mark that departs from the most common mark by more than 0.50 is an outlier.

These standards are extremely generous, and are about 50,000 time worse than what the system mathematically requires to determine results to 1/100 of a point; i.e., in this evaluation we allow the performance of the judges to be 50,000 worse than what the system actually requires.

Frequency of Outliers

Having decided on the definition of an outlier, the next step is to decide how often the judges' marks may be outliers before we consider their performance to be sub-standard, for in saying the marks should be within some range of the most common mark, we do not expect that will be the case all the time, just the majority of the time.

We adopt the standard that no more than 30% of a judge's marks should be flagged as outliers according to the two criteria above.  Implicit in this, is the assumption that the variation of the judges' marks follow a normal error distribution.  Spot checks of the judges marks in the Grand Prix indicate that the error distribution for the judges marks appears to be consistent with the normal error distribution, and at the level we are interested here it is not worth worrying about whether it is exactly true or just a good approximation.  Further, to allow nearly one-third of the marks to be outliers seems more than generous in the performance expected of the judges, just based on common sense.  After all, how many of us have jobs where you are allowed to be wrong 1/3 of the time and still be considered doing an acceptable job!

Rating the Judges

For each judge in each event segment, in each competition, we have calculated the frequency of outliers for all elements and program components for all skaters in the event segment and hold them to the following standard:

Performance Standard for
Frequency of Outliers
Standard Percent Outliers
Excellent 0 - 5
Good 6 - 15
Adequate 16 - 30
Inadequate 31 - 49
Grossly Inadequate 50 - 65
Incompetent 66 - 100

We feel this standard is more than generous, allowing nearly 2/3 of a judge's marks to be outliers before their performance is considered incompetent.

Outlier statistics for the elements and program components, however, are only a part of the story in assessing the performance of the judges.  So, in addition to these outlier statistics, we also looked at other aspects of the judging. 

For each judge we calculated the total score they arrived at for each skater in each event segment.  If a judge's total score deviates from the actual score of a skater by more than 7.5%, we considered that an outlier for total points.  Using the total scores from each judge for all the skaters in an event segment we, thus, calculated the outlier frequency for the total scores in each event segment.

The process for the total score was repeated for each of the five skating skills (e.g., in singles: jumps, spins, sequences, basic skating and presentation).  For each skill we calculated the points from each judge for each skater and compared that to the result for the overall panel in each event segment.  Deviations of more than 7.5% were identified as outliers and the frequency of outliers determined.

In this way, we determine whether a judge is marking the individual elements and program components correctly, as well as whether their total scores and the scores for the five skills are correct.  The seven outliers rates are then combined together (using an RSS) to determine the overall outlier score for each judge in each event segment.  This overall outlier score is a comprehensive rating of how well a judge functioned under the new judging system. The lower this score the better (the less frequent their marks are outliers).

To determine how well the judges were marking to the same standard, one additional statistic was calculated.  For each judge and each skater in an event segment, the total score from the judge was ratiod to the score from the panel as a whole.  The average ratio for all skaters in the event segment was then calculated for each judge.

This average ratio is the factor by which each judge systematically marks on a point scale that differs from the panel as a whole.  (In data analysis this kind of systematic error is referred to as a "bias.")

Because CoP assumes the judges are marking to the same absolute standard, it is important that this average ratio be close to 1.  For example, if the average ratio for a judge is 1.02 this would mean that on the average the judge is marking the skaters 2% higher than the rest of the panel.  Likewise an average ratio of 0.97 would mean the judge, on the average, is marking 3% lower than the rest of the panel.

We adopt the following standard for this metric:

Performance Standard for
Total Point Bias
Standard Percent
Excellent 0 - 1
Good 2 - 3
Adequate 4 - 5
Inadequate 6 - 10
Grossly Inadequate 11 - 15
Incompetent 16 or more

To put this in context, a 5% systematic error in the men's free skating can be as much as 10 points in the skater's score for a single judge; which is a considerable number of points considering the typical point difference between places in the Grand Prix was 3-4 points.

Evaluation of the Judges

An example of the results of these calculations can be found in these details for the ladies event at Cup of China.  This example shows how consistently poor the judging was in some events, and includes the case of one judge who had an anomaly score of greater than 90% and systematically marked the skaters more than 18% higher than the rest of the panel.

Rather than list the details from all the calculations for each event segment we instead provide here a summary of the frequency with which the judges were found to meet the above standards.

[Due to the anonymity used in CoP, we are treating each set of marks in each event segment as a "judge".  To the extent all judges serve on roughly the same number of event segments the percentages apply to numbers of individual judges.]

Frequency of Anomaly Rankings (In Percent)

Skate America Skate Canada Cup of China Lalique NHK Cup of Russia Final
Excellent 0 0 0 0 1 0 0
Good 2 5 0 3 3 1 6
Adequate 26 23 20 36 36 25 36
Inadequate 55 64 60 50 49 67 49
Grossly Inadequate 14 6 16 10 8 4 9
Incompetent 2 1 3 0 2 2 0

In four of the six Grand Prix competitions only about 1/4 of the judges had anomaly scores that were adequate or better (anomaly rate less than 30%).  After six competitions, 42% of the judges in the Final had anomaly scores that were adequate or better.  Overall, though, only 32% of the judges in the Grand Prix had anomaly scores that ranked them as adequate or better.  This can be compared to our study of judges at U.S. Nationals where 90% of the judges were rated adequate or better, roughly three times better than the Grand Prix judges.

About 1.5% of the judges had anomaly scores that rated them as incompetent (anomaly rate greater than 65%), and 11% had anomaly scores that rated them as grossly inadequate or incompetent (anomaly rates greater than 50%).  Only 3% of the judges had good or excellent anomaly rates.  In general, 1/2 or more of the judges in each Grand Prix competition are rated inadequate for anomalies, and this was the most common rating in each competition.

There was a slight improving trend in the frequency of anomalies over the course of the Grand Prix.  Nevertheless, by the end of the Grand Prix the majority of the judges still had inadequate anomaly rates or worse, and in the entire Grand Prix only one judge rated an excellent for anomalies.

Frequency of Bias Factor Rankings (In Percent)

Skate America Skate Canada Cup of China Lalique NHK Cup of Russia Final
Excellent 24 38 16 18 34 29 25
Good 45 34 35 39 39 44 45
Adequate 20 17 26 25 16 15 14
Inadequate 9 9 16 14 8 9 13
Grossly Inadequate 1 1 5 2 1 2 3
Incompetent 0 0 1 1 1 0 0

The judges in the Grand Prix did better in the area of marking on the same scale than they did in anomaly scores.  2.5% of the judges were grossly inadequate or incompetent in terms of bias factor.  14% scored inadequate or worse for bias factor.  The majority, however, were adequate or better, and about 1/4 were excellent in this area.

Bottom Line

The protocol of marks provided under CoP allows a detailed understanding of the scoring of each skater and an analysis of their strengths and weaknesses.  It also allows a detailed understanding of the strengths and weaknesses of the judges, and provides the opportunity to rate the judges in a quantitative way.

In order for competition results to be fair and meaningful, the judges must be able to mark the individual elements and program components according to the requirements of the rules in a consistent and accurate way.  Their marks must also result in accurate and consistent total point values and point values for the five skating skills.  In addition, all judges should be marking on the same point scale, without systematic differences from judge to judge.  In the absences of this, it is just dumb luck whether the judges agree or not.

For the 2003 Grand Prix competitions we have scored the judges in the areas of correct  marking of the individual elements and program components, correct evaluation of total points, correct evaluation of points for the five skating skills, and marks conforming to a uniform numerical scale.  The standard we have held the judges to is extremely lax, about 50,000 times worse than what the mathematics of the scoring system actually demands, and we only requires that the judges agree to plus or minus 7.5% (a range of about 1 on the 6.0 scale).

Even with this very low standard we find that 2/3 of the judges in the Grand Prix did not meet minimum expectations for competence.  Some judges (1.5%) had anomaly rates greater than 65% (some of these greater than 90%)  and systematic scoring bias factors greater than 15%.  It would be interesting to know who these judges are and why the ISU has decided they are qualified to judge ISU competition next season, but the anonymity of the scoring system precludes that.

Return to title page

Copyright 2004 by George S. Rossano

15 August 2004