A look at the quality of the judging at U.S. Nationals.
In a previous article we considered the effect start order might have on the placement of skaters by analyzing the results of some of the event segments at the 2004 U.S. National Championships. Looking at the event segments at Nationals of similar size (12-13 skaters), with a totally random draw, we found there was no evidence of any disadvantage in skating first in terms of ultimate placement in the event segment. If anything, the numbers indicated that there might be a small benefit to skating first.
In view of those calculations (and also our concern over the excessive time it takes to conduct skating events at major competitions) it seemed an interesting idea to investigate other aspects of judging that might be related to start order. For example, is the quality of judging related to the length of an event, or to the start order of an event? To answer such questions we decided to try to assess the overall quality of judging in each event segment at 2004 U.S. Nationals to see if there were any significant patterns or trends that might tell us something about the judging process, or the ways in which the process of conducting events affects the quality of judging.
As noted in an earlier report on the performance of the individual judges, we have been looking at U.S. Nationals because we happen to have all the scoring information easily accessible in our computer, and because we believe the judges at U.S. Nationals form a more or less homogenous group of well trained judges free of intentional bias. Ultimately we will extend this kind of study to international judging both under OBO and CoP.
To carry out this study we had to come up with metrics to quantify the quality of the judging in each event segments. These metrics are described below. The underlying assumption for many of them them is that the tighter the agreement of the panel, the "easier" the event was to judge, the more "reliable" (statistically significant) the result, and the higher the "quality" of the process.
We assume that U.S. judges are sufficiently well trained as a group so that by averaging over all judges on a panel, the conclusions are not compromised by variations in quality of the individual judges. In particular, when comparing event segments within a given event this is a pretty safe assumption since the same judges are used in each event segment. We also assume that by averaging over all the judges in an event segment and by looking at many event segments, the metrics are telling us something about the judging process in general, the effects of human limitations on it, and the way the conduct of events affects the judging process, and not the individual judges.
When interpreting the results for the metrics chosen, we assume the official result from the panel as a whole is a good estimate of the "truth" for the place each skater deserves. As noted in a previous article, this is just an estimate since the absolute truth (God's Truth) is unavailable to us, but if we average over several event segments it should be close enough to the truth to learn something useful.
The following metrics are used in this study. Each is calculated using the results from all nine judges on each panel and thus should be largely independent of variations in the quality of individual judges. Those of you bored by statistics, and have barely made it this far, should now scroll quickly down to the results section below.
The difference between a judges’ placement and the official result is called the deviation, and the absolute value of that the absolute deviation. For each event segment we calculate the average absolute deviation and root mean square (RMS) deviation for all judges and all skaters. These are global values that characterize the spread in the placements for the event segment as a whole. We also calculate the average absolute deviation and RMS deviation for each skater in the event segment. An average absolute deviation of greater than 1.2 or an RMS deviation of greater than 1.8 indicates an undesirable decline in statistical accuracy of the placements.
For the event segment as a whole we calculate the spread of the deviations; i.e., the number of times there was a deviation –5 (or more), -4, -3 … through 4, and 5(or more). If the judging is limited purely by small random differences of opinion among the judges then this distribution should follow a Gaussian (bell curve) error distribution. In this metric we calculate the extent to which the distribution departs from a Gaussian distribution. This metric is sensitive to the presence of systematic effect in the deviations. A value greater than 0.5 indicates an undesirable degree of systematic effects in the placements.
For the purposes of this study we define an outlier to be any absolute deviation with a value of 4 or more. The fraction of placements that are outliers is calculated for each event segment. In any statistical process some occurrence of outliers is expected, typically about 5 percent of the placements for the definition of an outlier chosen here. An incidence of outliers greater than 7.5% indicates an undesirable degree of systematic effects in the placements or an undesirable decline in statistical accuracy of the placements. An incidence of outliers significantly less than 5% (and particularly 0%) can also be an indication of undesirable systematic effects in the absence of similarly ideal values in the other metrics.
For each event segment we calculate the RMS spread in the amount by which the panel goes up or down on the second mark for each skater. We then average the spread for all skaters. The greater this global average, the more difficulty the panel had agreeing on the use of the second mark. A value of greater than 0.150 indicates an undesirable decline in statistical accuracy of the use of the two marks. In short programs, it can also be due to inconsistent application of deductions by the panel.
A simple standard for judging the judges that is sometimes used is to see if the judges place the top four skaters from the official results in their individual top four, and the bottom four in their bottom four. This metric applies that criterion for sliding groups of four, meaning first through fourth, then second through fifth, then third through sixth, and so on. For each group of four we count the number of times each judge places skaters outside of each group, and then sum the total number of outliers. We then calculate the average number of outliers for the entire panel, normalized by the number of sliding groups in the event segment. A value greater than 1.5 indicates a problem with placing skaters in the appropriate groups of four.
First we look at the above statistics for each event segment at the 2004 U.S. National Championships. These are global statistics, meaning they are characteristic of each event segment as a whole.
Table of statistical properties for event segments
In this table, values that exceed standards of reasonable statistical accuracy are highlighted in red. Although the ideal would be to have no red entries, event segments for which only one or two of the metrics are barely in the red can still be considered good quality. Event segments with four or more red entries indicates the judges struggled in those event segments and the statistical quality of the results is less than desirable (and potentially obtainable).
One possible use of these statistics is, where they calculated at a competition and the judges debriefed with the goal of determining why (as a group) they struggled with a particular event segment, the feedback could be used to improve the judging process and/or training of the judges.
In singles and pairs, the judges had the most difficulty with Junior Men and all three Senior Short Programs. In dance, the judges had the most difficulty with the Novice Compulsory Dances.
Singles and Pairs
We find that the statistical quality of all single and pairs short programs is very similar as a group, and likewise for all free skating event segments. The average absolute deviations and RMS deviations show small variation among these event segments, while the percent of outliers varies considerably. A small number of event segments show indications of undesirable systematic effects, particularly the Senior Ladies and Senior Pairs Short Programs.
The statistics for use of the two marks shows there is room for improvement in this area. It is also noted that for the junior and senior events, this metric is worse for the six short programs compared to the free skating. It is suggested this is likely due to the inconsistent application of deductions in the short programs. To better understand and improve the judging process it would be advantageous if deductions and base marks were recorded separately and individually available.
Except for Junior Men, the statistics for sliding groups of four were all reasonably good.
There is a distinct difference between the statistics for all short program event segments versus the free skating event segments. The metrics improve dramatically in the free skating segments compared to the short program segments, and in most cases the incidence of outliers was zero or near zero in the free skating. A similar situation is found in dance for the compulsory dances and the free dance. We view this with suspicion.
In a totally random draw we see one set of statistical characteristics while in the event segments seeded by groups we find a significantly different set of characteristics. It is suggested that the seeding of the groups after the short program is a subconscious crutch that affects the decisions of the judges in the free skating in a dramatic way, and that the improved statistical quality in the free skating is an artifact that indicates "thinking outside the group" is an area that could use some improvement. Comparing the numbers for these event segments one also sees that judges had more difficulty thinking outside the group in pairs than they did in singles.
In a totally random draw the judges have to concentrate on not forgetting the skaters in the early warmups, and the numbers show they do a good job at it. For the free skating, however, it appears the judges do less well thinking outside the groups. Though it has no chance of ever coming about (due to the demands of TV) it would actually be a great deal fairer to judge all events in a totally random draw than in seeded draws.
The overall dance statistics show that the panels only struggled with the Novice Compulsory dances. Unlike singles and pairs, the statistical quality of dance events increased with the event level. Analogous to singles and pairs, the statistical quality of the free dances and original dances was significantly better than the compulsory dances. The statistical quality of the dance events was significantly greater that for singles and pairs.
The statistical properties of all the dance event segments as a group, must be viewed with suspicion. It is implausible that human perception would limit judges in singles and pairs to one degree of statistical consistency while dance judges would do two to three times better -- or that individuals who judge both dance and singles and pairs would have one set of statistical characteristics for their dance events and another for singles and pairs (which is the case).
No explanation is offered for this mystery of the universe, but it is suggested that the observed differences in statistical characteristics for dance vs. singles and pairs is an artificial situation that needs further attention. For those who would make the knee-jerk suggestion that CoP will solve all problems we disagree in advance, and suggest the problem here more likely is in some way related to the nature of ice dance competition itself and/or the thought processes or training of dance judges, and is not intrinsic to the scoring system.
Finally, we note that it is ironic that judging in compulsory dance shows the most nearly normal statistical characteristics and the greatest variety of opinion of any of the dance segments, and yet that is the part of ice dancing that has been reduced in value over the years and is under the greatest pressure for elimination. (A Canadian proposal on the ISU agenda calls for the elimination of compulsory dance from competition). Compulsory dance may not be the most exciting or entertaining part of ice dancing to the casual fan, but the numbers show it is the only part of dance that offers any real mix-it-up diversity of opinion in dance events.
Use of the Second Mark
In looking over the statistics for the various event segments, the following was noted: invariably the panels at Nationals tended to go up on the second mark (i.e., the average second mark for the panel was greater than the average first mark). In the short programs this might be expected to occur frequently, since deductions systematically reduce the first mark. One would expect, however, that for skaters with clean programs the second mark would at least occasionally be lower than the first mark, and the panels would not go up on the second mark every time. In the short programs at U.S. Nationals, however, the panels went up on the second mark for all but one (1) skater in all the short programs!
For all other event segments one would expect the second mark to do down a significant fraction of the time. For a random mix of skaters with strengths in either the technical or presentation aspects of skating, the panel would be expected to go up as often an down. Taking into account that these are the best skaters in the U.S. it would not be unreasonable for the fraction of skaters going up on the second mark to be greater than a random 50%. For the non-short-program segments at Nationals, however, the panels went up on the second mark nearly 90% of the time.
These trends true holds for all disciplines at all levels.
Table of change in the second mark
It is difficult to believe that the observed frequency with which the panels went up on the second mark accurately reflects the true relative occurrence of skaters stronger in the first or second mark. It is suggested that this characteristic of the judging is indicative of a systematic effect in the way the second mark is used. Further, if well trained judges such as those judging at Nationals have problems using one subjective presentation mark correctly, one must view with extreme skepticism the claim that judges can use the five subjective marks in CoP correctly.
In regard to CoP, one should also note that presentation in CoP is significantly less than 50% of the total score. Taken at face value, the fact that the majority of the best U.S. skaters are getting more of their total mark from presentation than from the technical merit mark, says that under CoP these skaters will suffer even more than previously feared.
Repetition of Placements in Event Segments
Another interesting statistic that fell out of this study was the frequency with which skaters receive the same placements in each part of an event segment. One might note it is generally assumed that in the 6.0 system, dance couples always get the same placement in each segment of dance events. It turns out that is not exactly correct, at least in the U.S. A decade or more ago, that was in fact the trend, but in recent years U.S. dance judges, at least, have gotten a lot better at moving the couples around.
Table of repetition of placements
In singles and pairs, the fraction of competitors that had the same placement in each event segment was no greater than 24% and was typically about 17% . Unexpectedly, the fraction of pair teams with identical placements was significantly greater on the average, only 17% in Junior Pairs, but 50% in both Novice and Senior Pairs. This would tend to support the suspicion, noted above, that judges have a greater difficulty thinking "outside the group" in pairs than they do in singles.
In dance overall, the repetition of placements was actually less on the average than it was in pairs. In Novice Dance NONE of the couples received the same placement in each segment. In Junior Dance the repetition rate was 38% and in Senior Dance it was 67%.
These numbers for dance offer two possible interpretation, both of which may be at work.
In Novice Dance, where there is generally a lot of "new blood" each year and fewer preconceived notions, the judges had no inhibitions about moving the couples around. In Junior Dance, where there are couples with better known reputations, there was less movement; but still, more than 60% of the couples did not get the same placement in each segment. In Senior Dance, however, there was little real movement when you take into account the fact Lang & Tchernyshev did not complete the event. Even had Lang & Tchernyshev not gotten second place in each event segment, the repetition rate for Senior Dance still would have been 50%.
Thus, one view of the greater repetition rate in Senior Dance (compared to novice and junior) is that at the senior level all the teams are so well know that reputation still has a strong influence on the judges which inhibits them from moving the couples around in each dance as much as they should. An alternate point of view is that senior dance couples in the U.S. span such a broad range of skill, and are so weak as a group, that the 12 teams couldn't provide the close competition needed to result in movement between the dances.
An alternate way of measuring movement in events is to look not only at the repetition of placements, but also the amount by which the placements change from one segment to the next. Even when there is movement in dance, it is usually by only one place, while in singles changes of several places are common. If one calculates the movement taking into account the amount by which the places change, one finds the amount of movement in dance and pairs was typically half to one-third that in singles. This is consistent with the result from simply counting the frequency of place repetition.
Quality vs. Order of Finish
The statistics discussed thus far all refer to event segments as a whole. One can also dig a little deeper and look at some of these statistics on a skater by skater basis, particularly in terms of their relationship to order of finish and start order.
One finds that all event segments can generally be divided up into three groups of statistical quality that roughly follows a top third, middle third and bottom third distribution. The three groups aren't always exact thirds, but the three groups are always present. There is also a trend that the larger the event, the greater the size of the middle group.
The statistical quality for the top group is always the best of the three groups. The bottom group is frequently just as good, though occasionally it is somewhat worse when one has, for example, a few competitors fighting tooth and nail for last place. The middle group always has the worst statistical quality, frequently having half to one-third the statistical accuracy of the other two groups.
This result quantifies the obvious: it is easiest to pick the best and worst out of a group of skaters, but sorting out the messy middle is far more difficult. The best skaters tend to be strong in all aspects of skating and the worst tend to be weak in all aspects of skating. The middle group of skaters, however, is a complex mix of strengths and weaknesses and it is unlikely the judges use identical standards for balancing these strengths and weaknesses when determining the marks.
The reason this situation exists is that the judges are not provided clear (quantitative) guidance for how to balance the different skating skills or for the impact different errors should have on the marks (other than the deduction in the short programs). This weakness in the judging process is easily correctable, and is addressed in several aspects of the Modern Era 6.0 scoring system on the ISU agenda in June.
Quality vs. Start Order
Few of the metrics studied here appear to have a significant correlation with start order. For events with a totally random draw, there is a very weak trend for skaters in the first group to place slightly better on the average than skaters in the later groups. More distinctly, the spread in the ordinals for each skater shows a clear trend for greater statistical uncertainty for later warmup groups; that is, the statistical accuracy of the placements declines with later warmup group.
Typically, the first warmup group in events with totally random draws has the greatest statistical accuracy and the last warmup group the worst, sometimes as much as a factor of two worse. The most plausible explanation for this is judges fatigue as events progress. This effect is superimposed on the quality vs. order of finish effect noted above.
In events with a seeded draw, any possible effect related to start order is masked due to the seeding and the quality vs. order of finish effect. Nevertheless, if the quality of judging shows a fatigue effect in the short programs and compulsory dance it seems likely that it is also present in the free skating and free dance event segments at well. The statistical accuracy of competition results would be improved (and the skaters better served) by minimizing the time it takes to complete event segments.
At this point a fair question to ask would be whether these statistics are relevant only to the 2004 Nationals, or have any wider meaning. To best answer that question it would be nice to have a complete set of results from recent qualifying competitions. Unfortunately, USFSA does not make results available in a form that lends itself to easily entering them into the software analysis tool used. (In fact, it takes more time and effort to prepare the results for analysis than to do the analysis itself!)
Nevertheless, after finishing the bulk of this article it was decided in the interest of caution to do a complete analysis of the results from the 2003 U.S. National Championships. While the details differ somewhat from 2003 to 2004 in specific events, in general the results and conclusions are the same. All of the trends and quirks discussed here for the 2004 Championships are present in the 2003 Championships, with more or less the same numerical values. The main difference between the 2003 and 2004 Championships was in the area of repetition and movement in the placements in dance and pairs.
In both 2003 and 2004 there was less movement in dance and pairs than there was in singles, and the least movement was found in pairs in both years. In 2003, however, there was more a little more movement in dance and pairs than there was in 2004. Taking two-year averages, the repetition of placements in pairs averaged 38% while the repetition in dance averaged 30%. In comparison, the two-year average for repetition of placements in singles was 14%.
We also note that in scoring the performance of the judges in 2003, the results obtained were very similar to what was found in 2004.
Given the excellent agreement of the analysis of the 2003 Championships with the 2004 Championships, and the fact the two competitions were judged by two completely different sets of judges consisting of nearly 80 different individuals we conclude that the analysis discussed in this article provides an accurate picture of the general characteristics of judging in the U.S. by National Judges at the Novice level and above.
This article describes the statistical characteristics of two generally well judged competition (2003 and 2004 U.S. Nationals), and identifies some areas where improvement in judging quality and the judging process is both desirable and achievable. The overall conclusion is that the quality of the judging was fairly good, but there are areas where a good thing could be made better.
Improved performance goals worth pursuing.
Structural changes in the judging process worth pursuing to help achieve the above goal.
Should any of these goals and changes in the judging process be pursued, the statistics described here would allow quantitative assessment of the effectiveness in meeting the goals. It is suggested that the fraction of "anomalous" event segments (currently about 25%) could easily be cut in half within one season if a coherent effort was made by the USFSA. The fraction of anomalous event segments would also be reduced through adoption of the Modern Era 6.0 system, which includes several structural changes that act to better standardize the judging process.
Return to title page
Copyright 2004 by George S. Rossano
(10 March 2004)