Judging the Judges

Recently we have been looking at the statistical properties of the marks and ordinals from the 2004 U.S. National Championships to see what it tells us about the quality of judging for events of different size, time duration and competition level. We have been looking at U.S. Nationals because we happen to have all the scoring information easily accessible in our computer, and because we believe the judges at U.S. Nationals form a more or less homogenous group of well trained judges free of intentional bias.

To carry out that study we had to come up with metrics to quantify the quality of the judging, and this naturally raised the issue that in scoring the overall quality of the judging in events one can also score the performance of the individual judges. So, side-tracking somewhat from the main goal, we took a look at the performance of the judges themselves at the 2004 U.S. National Championships.

The problem of ranking or scoring the performance of judges is a difficult and touchy subject. Over the years attempts to deal with this have mostly ended in failure either because of the difficult part or the touchy part. Having more curiosity than sense, we will ignore the touchiness and focus here on the mathematics of it.

The fundamental reason it is difficult to judge the judges is that there is no absolute truth to compare their performance to. (This absolute truth is known as God’s Truth in the world of instrument calibration, since only God knows the "correct" answer). In a skating competitions we really don’t know with absolute certainty which skater deserves each and every place so we can never be absolutely certain how close any one judge came to the correct answer (truth) in judging an event. Implicit in our approach is the idea there is a correct answer – we just aren't perfectly certain what it is.

Our best guess at the truth is the result from the panel as a whole. We assume in this exercise that the official result from the panel as a whole is the truth the vast majority of the time when averaged over many event segments.

It follows then, that when looking at the placement from any one judge for any one skater we can never be certain if a difference from the official result is due to an error in judgment on the part of the judge, or an error on the part of the panel. However, if the panel’s result is the truth the vast majority of the time, then in a statistically significant sample of events the performance of each judge should agree within the natural statistical variation of the event as a whole. In effect, we are assuming that placements with a unique perspective that might actually be more correct than the official placement occur sufficiently infrequently that a judge’s average score is not significantly harmed in the process. In other words, the occasional "anomaly" that might actually be a correct placement in terms of the absolute truth does not wreck a judge’s score in this evaluation process.

In this scheme of things a judge does not have to have perfect agreement with the official result to obtain a high score. The statistical properties of a judge’s performance only have to be consistent with the statistical properties of the event overall. In this way events that are easier to judge (which have less statistical variation among the marks) hold the judges to a higher standard of conforming to the official results, while events that are difficult to judge hold the judges to a looser standard. The results of this is that the metrics used to judge the judges are independent of the difficulty of judging individual events, allowing direct comparison of the judges’ scores from different events.

In this evaluation process we end up with a score for each judge in each event segment that rates how well each judge did in judging the event segment, and ranks the judges from best to worst. However, because we have not tied this process to an absolute measure of truth this is in large part a relative ranking, and the worst judge in an event segment may well still have done a good enough job. Nevertheless, by looking at many event segments and many judges, one thing we hoped to get out of this is some sense of what score might indicate good, average or poor judging; and if individual judges have similar scores over many event segments we might conclude the average score has some absolute meaning.

Several metrics are used for this evaluation process, each of which contributes to the judges’ scores. The maximum possible score from all metrics is 20. Those of you bored by statistics can scroll quickly down to the results section below.

1. Global Deviations

The difference between a judges’ placement and the official result is called the deviation, and the absolute value of that the absolute deviation. For each event segment we calculate the average absolute deviation and root mean square (RMS) deviation for all judges and all skaters. These are global values that characterize the spread in the placements for the event segment as a whole. We also calculate the average absolute deviation and RMS deviation for each judge. If the average absolute deviation is better than the global average the judge gets one point, and likewise for the RMS deviation. The judge(s) with the lowest average absolute deviation get one point, and likewise for the lowest RMS deviation. The judge(s) with the worst average absolute deviation lose one point, and likewise for the worst RMS deviation.

2. Deviation Spread

For the event segment as a whole we calculate the spread of deviations; i.e., the number of times there was a deviation –5 (or more), -4, -3 … through 4, and 5 (or more). We then compare the spread of deviations for each individual judge to the global spread. For each of the 11 possible cases, the judges get one point if they do better than the global distribution.

3. Use of Both Marks

Because the ordinal system uses relative judging, the judges may not be marking on the same absolute scale so we do not look for consistency in the total marks. Nonetheless, the judges should be using the two marks in a meaningful and consistent way to decide if the second mark should go up or down, and by how much. This metric evaluates how well the judges use the two marks.

For each skater we calculate the average and RMS spread of the amount by which the panel goes up or down on the second mark. For the segment as a whole we calculate how often the panel went up or down within the range of the average +/- the RMS spread, how often they were within the range of two times the RMS spread, and how often they exceeded the range of two times the RMS spread. We do the same thing for the judges’ marks individually and compare them to the global statistics. Judges get one point for each of the three cases (within the range, within two times the range, outside two times the range) when they do better than the global statistics.

4. Sliding Groups of Four

A simple standard for judging the judges that is sometimes used is to see if the judges place the top four skaters from the official results in their individual top four, and the bottom four in their bottom four. This metric applies that criterion for sliding groups of four, meaning first through fourth, then second through fifth, then third through sixth, and so on. For each group of four we count the number of times each judge places skaters outside of each group, and then sum the total number of outliers. If the number of outliers is greater than the total number of sliding groups the judge gets no points, if they have less than the number of sliding groups they get one point, and if the number of outliers is less than half the number of sliding groups they get a second point.

Results

When we score all the judges (37) for all event segments (28) at the 2004 U.S. Nationals according to the above evaluation criteria we obtain the following results:

Dance

For the 11 judges who judged the dance events the average score was 12.3. The median score was 12.4
The best score for a judge in a dance event segment was 20.
The worst score for a judge in a dance event segment was 3.
The best judge had an average score of 17.3.
The worst judge had an average score of 8.7.
Table for all dance judges.

Singles and Pairs

For the 29 judges who judged the singles and pairs events the average score was 10.7. The median score was 10.5.
The best score for a judge in a singles and pairs event segment was 20.
The worst score for a judge in a singles and pairs event segment was 2.
The best singles and pairs judge had an average score of 15.3.
The worst singles and pairs judge had an average score of 6.5.
Table for all singles and pairs judges.

Looking at the scores for all the judges in all event segments we also conclude the following:

One needs at least four event segments for the average score to be meaningful.
The average judge at 2004 U.S. Nationals typically scored 11-12 in this system.
The top 10% of judges scored about 14 or greater.
The bottom 10% of judges scored about 8 or less in singles and pairs, and about 9 or less in dance.
This type of evaluation process appears to sort out the judges’ performance in a meaningful way.

One omission in this evaluation process is the lack of a specific metric to evaluate the way the judges handle deductions. This is indirectly captured since agreement with the panel as whole in short program event segments will depend on the judges assigning the deductions correctly. It would be better, however, to break that out separately. Unfortunately, no record of the deductions assigned by the judges is kept so evaluating that aspect of judging is impossible at this time.

Except for the best judges (who might enjoy the bragging rights), names are not provided here because (1) it is a touchy subject, and (2) the purpose of this initial exercise was to see if this kind of evaluation process tells us anything useful – which it appears it does. The next step would be to try this analysis on other U.S. events (such as the qualifying competitions, and events with lower level judges), and to also extend a similar kind of analysis to international judging.

Return to title page