In this article we look at the mathematical properties of CoP as determined by analysis of the marks from the Grand Prix. To accomplish this, the marks from all events at the six Grand Prix competitions and the Final were examined in detail using a variety of standard statistical methods. The purpose of this analysis was to determine how the judges used the marks and how CoP performed, according to rigorous mathematical standards.
To examine the way the judges used the five program component marks during the Grand Prix we calculated the statistical spread in the marks for each program component using the marks from all the judges for all performances in the Grand Prix. In a Normal Error Distribution, about 68% of the marks will lie within the statistical spread in the marks, and about 95% will lie within twice the statistical spread in the marks. The the statistical spread in the marks is frequently expressed as a percentage of the average value, which is the practice we follow here.
For the Grand Prix competitions we find that the spread of the judges' marks calculated for each program component was typically is in the range of 12 to 16% of the average value, with some values as low as 5% and others near 20%. Previously we have pointed out that studies of human perception consistently show that humans can rate observed events (assign a numeric value) to generally no better than about 15%, and sometimes no better than 25%. The marks assigned by the judges in the Grand Prix confirm this, and we conclude the judges are doing about the best that can be expected in assigning marks to a single program component.
The program components are intended to capture in the scoring five completely different and independent aspects of a skating performance. In that case, the spread in one judge's marks for the five components should be no better than about 15% for a skater who has roughly equal skill in each component, since that is the limit to the consistency of human judging. For skaters with different skill in the five components the spread in the marks from one judge for the five components should be worse than 15%, with any excess over 15% due to the difference in skill from one component to the next.
When we calculate the spread in each judge's marks for the five program components for each performance in the Grand Prix, instead of finding a spread worse than 15% we find the spread is typically only about 5%; i.e., about three times better than the spread from one judge to the next. Some judges did move their marks around for some skaters, but most did not to any significant extent.
In other words, when many judges rate one program component the spread in the marks is about 15%, but when one judge rates five components the spread is only about 5%. The tight agreement in the five component scores for each skater says that throughout the Grand Prix the judges were unable to use the components as five independent aspects of skating, and were unable to assign statistically independent ratings for the five components. This held true throughout the entire Grand Prix; i.e., they did not get significantly better at it as time went on.
In comparing the spread of the marks for the individual program components with the spread in the marks for the total of all five program components it is also found that this statistic does not scale as one would expect for five uncorrelated marks. Instead, this statistic also shows that the five program components are highly correlated; i.e., are not being judged as five independent aspects of skating.
So far as the use of the program components are concerned, the worst of the worst are program components 3 through 5 that score presentation. These program components typically vary by less than 0.25 points throughout the Grand Prix. The judges are a little better at distinguishing Transitions from Skating Skills, but the marks show they were completely incapable of distinguishing the presentation components form each other.
There are three possible reasons for the failure to score the five program components independently: 1, the guidelines for marking the program components were not clear enough; 2, the judges were not adequately trained, and the seven competitions of the Grand Prix were inadequate to get the judges "calibrated"; 3, it is humanly impossible to train judges to use the program components as intended (and required) by the system.
In any event, regardless of the actual cause, the numerical evidence shows that the marking of the program components under CoP in its current form was a complete failure throughout the entire Grand Prix.
The CoP development team has acknowledged this failure, and has assumed the problem in marking the program components was in the guidelines. The guidelines were extensively rewritten after the Grand Prix, but there is no evidence, however, that this is the actual source of the problem, and there is no evidence that the new guideline will work better than the first set. In the absence of a second round of testing to confirm that the new guidelines will work better, it is just a wild guess and wishful thinking.
Each program element is given a grade of execution (GoE) from each judge with a designation of -3 through +3, for a total of seven grades of execution. The grade of execution is converted to a point value through a lookup table and combined with the base value for the element.
For every element judged in the Grand Prix the statistical spread in the grade of execution was calculated. Throughout the Grand Prix the statistical spread for the GoE was very small; much less than one step in grade of execution. In other words, the judges agreed fairly well on the GoE to be assigned to the various elements.
Since there are seven possible grades of execution from worst to best, each grade of execution spans about 14% of the total range from best to worst. Thus, the statistical spread for the GoEs in the Grand Prix are consistent with the 15% rule that says the best the judges can do is about plus or minus 15% of the average of whatever they are trying to measure. From the data available we conclude that the choice of seven grades of execution is ideally suited to the intrinsic limitations on human ability to assign numeric values to observed activities, and that the judge are doing about the best that can be expected in assigning the GoEs, at least in terms of consistency.
It is clear from inspection of the marks that the judges had no difficulty assigning independent GoEs to each element, so we did not bother to calculate the spread in the GoEs for individual judges marking all the elements for a given skater. Instead a more useful statistic that measures how the judges used the GoEs was calculated: the frequency with which judges assigned the seven different GoEs throughout the Grand Prix.
The most mathematically rigorous approach to using GoEs would be to say that a GoE of 0 corresponds to the typical/average execution of an element. If one then knew the typical spread in quality of execution determined over several seasons for all competitors, one would specify that a GoE of +1 should cover elements executed better than the typical spread, +2 better than twice the typical spread, and so on for all GoEs.
In a Normal Error Distribution, GoEs from -1 through +1 would then encompass about 68% of all assessments, -2 and +2 would include about 27% of all assessments, and -3 and +3 would include the remaining 5% of all assessments. Of course, in particularly high or low quality events this distribution would not be followed exactly, but for many events in many competitions one would expect this would be a reasonable distribution. However, this is not the distribution one finds in the Grand Prix.
Typically, GoEs from -1 through +1 were assigned 85% to 90% of the time, while GoEs of +/- 2 were assigned 5% of the time each, or less. GoEs of -3 were not uncommon, for obvious reasons; but GoEs of +3 were extremely rare, awarded typically less than one-third of 1% of the time. In many events, never.
The distribution of GoEs in the Grand Prix are statistically abnormal. Further, it is difficult to accept that in seven competitions that include the best skaters in the world, elements (some with the minimum difficulty level, such as simple spins and double jumps) were executed with the best possible quality less than 0.3% of the time (less than one attempt in 300).
As for the program components, there are three possible sources for this problem: 1, the guidelines for marking the grades of execution are not clear or are not appropriate; 2, the judges were not adequately trained and the seven competitions of the Grand Prix were inadequate to get the judges calibrated; 3, it is humanly impossible to train judges to use the grades of execution as intended (and required) by the system.
Regardless of the actual cause, it is clear from the numerical evidence that the marking of the GoEs under CoP in its current form was a significant failure throughout the Grand Prix. The CoP development team has made no substantial changes to the judging of the GoEs since the end of the Grand Prix, nor so far as we know, are any planned. Thus, this failure to use the GoEs correctly can be expected to carry over into the future.
It is a well established rule in skating that a well executed double should earn more points than a poorly executed triple (and likewise for singles to doubles, and triples to quads). In CoP, a +3 double Axel, for example, can earn more points than a -3 triple Axel, so this principle has been retained in CoP. In practice, however, the judges so rarely award a GoE of +3 that a well executed double will almost never be scored higher than a poorly executed triple.
Another consequence of primarily awarding GoEs of -1 through +1 is that the total value for all the grades of execution has little impact on the scores. In principle, skaters could gain or lose tens of points in grade of execution compared to the average execution. In practice, the total grade of execution for all elements is typically only plus or minus 2 points for each skater. Only ccasionally is it more than plus or minus 5 points.
All of the complexity to judge the GoE, all of the agonizing over what constitutes a given GoE, all of the effort trying to train all the judges to mark over 1500 GoEs consistently (all GoEs for all possible elements), all of the cost and complexity to incorporate GoE into the scoring process comes down to plus or minus 2 points that at best affects the result by about plus or minus 1 place, only part of the time.
Of all the events, it is found that the GoEs have the least impact on the results in dance. The GoEs have all the characteristics discussed above, and one other not present in singles and pairs.
In singles and pairs different skaters frequently end up rated differently in each of the skating skills. For example, different skates may end up rated as the best jumper, vs. the best spinner, vs. best in presentation; or different teams may end up rated as best in lifts and jumps, vs. best in spins, or best in presentation.
In dance it is uncommon for teams to be rated differently in each skating skill. For example, in Compulsory Dance the order of places for the first pattern, second pattern, timing, presentation, or GoE alone, all track more or less in lock-step. Consequently, it doesn't matter which of these skating skills is used to determine the order of finish, since they are all the same. This pattern is also true for the Original Dance and Free Dance, with at best a little bit of variety seen in the ratings for lifts in the Free Dance.
The long running joke in ice dancing that couples place in more or less the same order for every dance has carried over into CoP. Although there was a little more movement from dance to dance in the Grand Prix under CoP than has typically been the case, there wasn't that much more, and within any one dance all the couples were rated in essentially the same order for each skating skill within the dance. This lock-step conformity for the rating of each skating skill is as implausible in dance as it is for the uniformity of the program component marks. It says that throughout the Grand Prix the dance judges were incapable of judging each skating skill within a dance independently form the others.
Results in CoP are calculated to one one-hundredth (1/100) of a point, and thus CoP claims it can distinguish between two programs that differ from each other in value by one-half of one one-hundredth of one percent.
This is equivalent to claiming a panel of judges can look at two people, one three hours after the first, and determine that one is taller than the other by the thickness of a piece of newspaper, without benefit of a ruler. It is equivalent to claiming a panel can watch two people race, one three hours after the other, and determine that one ran the race one-millisecond faster than the other, without benefit of a clock. It is equivalent to claiming a panel can watch two skaters jump, one three hours after the other, and determine that one jumps higher than the other by one-thousandth of an inch, or takes off faster by one-hundredth of an inch per second, without benefit of a ruler or a clock. This strains credibility. To determine the real believability of the scores under CoP, the marks were used to determine the formal statistical uncertainty in the scores.
The most basic premise of CoP is that the marks are awarded according to an absolute, consistent standard. The calculated scores are then supposed to be the absolute truth for what each performance deserves according to the absolute standard. However, because there is a large spread in the marks and only a small number of judges (five used to compute the scores), the calculated scores are only an estimate of the absolute score each program deserves. It is like timing a race using a stopwatch with a loose second-hand that flops around randomly. You sort of know the time, but not exactly -- only to give or take a few seconds. Similarly, you only sort of know the true value of the programs in CoP, only to give or take some number of points.
The uncertainty in the scores (their believability) can be determined from the spread of the marks for the individual elements and program components. When the ranges of the marks are combined together using standard statistical methods, one finds that the scores typically have an uncertainty of plus or minus 3/4 of a point. This means the CoP "stopwatch" can only tell time to give or take 3/4 of a "second", but claims it can determine a winner by one-hundredth of a "second".
The believability of the scores can also be determined by calculating the scores for each skater from each judge and inter-comparing the judges' scores. In this approach, the uncertainty in the point totals is typically 2-3 points. However, since this is not the calculation method used by CoP we will give CoP the benefit, and stick with the value of 3/4 of a point for the uncertainty in the scores.
Based on the quantitative statistical accuracy of the CoP system, it is impossible to say with absolute certainty which of two skaters had the better performance if their point totals differ by less than 1.5 points. Based on the mathematical characteristics of the actual marks, skaters whose point totals differ by less than 1.5 points should be considered tied.
During the Grand Prix, more than one-third of the places were determined by a point difference of less than 1.5 points, some places were determined by a point difference of 1/100 of a point, and several medals were determined by point differences less than 1/10 of a point. In terms of the statistical accuracy needed to believably determine places, CoP was a major failure during the Grand Prix.
In Championship events with 20-30 skaters, one can expect the fraction of places to be determined by statistically meaningless point differences to increase, since the average point difference between places in larger events will decrease by a factor of 2 or more compared to the Grand Prix. Without changes that improve the statistical accuracy of the system, CoP will continue to be a failure in the area of believably separating performances of nearly equal value.
In the past we have described how jumps could be expected to make up more than 40% of the points for the men's free skating and presentation only about 30% -- and that spins and sequences would have negligible value in CoP. These values were based on calculations of the expected point totals for programs typically presented by the best skaters.
Using the actual scores during the Grand Prix, the amount by which each class of element contributes to the actual scores was calculated. In singles, for example, we divided the scoring into five classes of skill: jumps, spins, sequences, basic skating and transitions (program components 1 and 2), and presentation (program components 3 through 5). For the Grand Prix we find the following overall distribution of points:
Skill | Men's Free Skating (Typical) |
Men's Free Skating (Elite Skaters) |
Ladies Free Skating (Typical) |
Ladies Free Skating (Elite Skaters) |
Jumps | 36% | 41% | 33% | 39% |
Spins | 8% | 8% | 10% | 10% |
Sequences | 4% | 4% | 5% | 5% |
Basic Skating and Transitions | 21% | 20% | 21% | 18% |
Presentation | 31% | 28% | 31% | 28% |
The actual marks during the Grand Prix confirm previous comments about the distribution of points, and confirm that scores are dominated by the marks for jumps. Spins and sequences together make up only 12-15% of the scores, while jumps and the program components make up 85-88%.
To examine the impact of spins and sequences on the results, the skaters' scores were calculated omitting the points from spins and sequences. The order of finish determined this way differs from the order of finish using all elements by the occasional place switch among some skaters. The impact of spins and sequences on the results is limited to at most one place for less than one-third the skaters.
Since the completion of the Grand Prix, one additional jump has been added to the allowed elements. This change will further shift the distribution of points in favor of jumps, adding a few percentage points to their contribution and decreasing the others. For Free skating, CoP was a jumping contest during the Grand Prix and will be more so in the future. With one additional jump, spins and sequences can be expected to have even less of an impact on the results.
The following table shows the balance of elements for the singles Short Programs, as determined in actual competition.
Skill | Men's Short Program (Typical) |
Men's Short Program (Elite Skaters) |
Ladies Short Program (Typical) |
Ladies Short Program (Elite Skaters) |
Jumps | 30% | 36% | 28% | 30% |
Spins | 12% | 10% | 14% | 15% |
Sequences | 8% | 7% | 10% | 10% |
Basic Skating and Transitions | 20% | 19% | 19% | 18% |
Presentation | 30% | 28% | 29% | 27% |
In the Short Programs, the balance of the elements is not as bad as for the Free Skating, since the number elements of each type are more nearly equal in the Short Programs. Spins and sequences together make up 17-25% of the scores, while jumps and the program components make up 75-83%. Despite the slightly greater weight given spins and sequences in the Short Programs, the impact of spins and sequences on the actual placements is only slightly greater than for the Free Skating.
One contentious feature of CoP is the use of random selection of the marks. There is no evidence that random selection of marks has any effect to eliminate attempts at misconduct. It is clear from the marks in the Grand Prix, however, that it has a significant negative impact on the scoring.
The most obvious example of how random selection of the marks can skew the results occurs when a panel in nearly evenly divided in rating two skaters. If half the panel gives Skater A higher marks than Skater B, and seven judges are randomly selected, it is obvious that either Skater A or Skater B could be placed first, depending on which judges are selected. Less obvious is that even if as many as seven or eight judges give Skater B higher marks than Skater A, random selection of marks can still result in Skater A receiving the higher score. (In these examples, we assume that there are ten judges, seven of whom are randomly selected.)
Consider the following example for one program component where seven of ten judges score Skater B higher than Skater A. Using all ten judges Skater B out scores Skater A.
J1 | J2 | J3 | J4 | J5 | J6 | J7 | J8 | J9 | J10 | Average | |
Skater A | 6.25 | 7.00 | 6.25 | 6.25 | 5.50 | 6.25 | 6.25 | 6.50 | 5.75 | 6.75 | 6.275 |
Skater B | 6.75 | 7.25 | 6.00 | 6.00 | 5.75 | 6.50 | 6.50 | 7.00 | 6.75 | 6.50 | 6.500 |
If three judges who scored Skater B higher Skater A are eliminated we end up with the following marks:
J1 | J2 | J3 | J4 | J5 | J6 | J7 | J8 | J9 | J10 | Average | |
Skater A | 7.00 | 6.25 | 6.25 | 5.50 | 6.25 | 6.25 | 6.75 | 6.321 | |||
Skater B | 7.25 | 6.00 | 6.00 | 5.75 | 6.50 | 6.50 | 6.50 | 6.357 |
After the random selection of judges, four of the seven judges have scored Skater B higher than Skater A, and Skater B still out scores Skater A.
CoP, however, does not use a simple average, it uses a single trimmed mean. If we take the trimmed mean of the seven judges in this example, the marks from two of the four judges who scored Skater B higher are eliminated, and Skater A ends up with the higher score for this component.
J1 | J2 | J3 | J4 | J5 | J6 | J7 | J8 | J9 | J10 | Trimmed Average |
|
Skater A | 6.25 | 6.25 | 6.25 | 6.25 | 6.75 | 6.350 | |||||
Skater B | 6.00 | 6.00 | 6.50 | 6.50 | 6.50 | 6.300 |
Since some judges consistently tend to mark a little higher or lower than other judges, this effect will accumulate in several program components and element scores, and the net effect will be that even with seven judges giving one skater higher scores, another skater can end up scoring a greater number of points a significant fraction of the time.
The effect of random selection of marks can be even more extreme than in this example. It is not just a matter of how many judges score one skater higher than another, it also the amount by which they score one higher than another. For example, eight of ten judges could score Skater B higher than Skater A. After random selection of the judges, marks from five of the seven remaining judges could be higher for Skater B than Skater A. After the single trimmed mean three of the five marks could be higher for Skater B than for Skater A, and yet Skater A can still end up with the higher score if the two judges marking A higher than B do so with a greater point difference than the three marking B greater than A.
To see how often random selection of marks affect the results in the Grand Prix the order of finish using a single trimmed mean applied to the marks from all the judges was calculated. The difference between the order of finish calculated this way and the official order of finish is due the random selection of marks in the official results. It is found that random selection of marks typically skews the results for 1/6 to 1/3 of the placements, and in some events it skews the results for as many as 50% of the places. This can be expected to occur even more frequently in Championship events where the typical difference in points between places can be expected to be about one-third what it was in the Grand Prix. (The smaller the point difference between places, the more sensitive the results are to random selection.)
Random selection of marks has yet another effect when combining the points from two segments of an event, which further increases the impact random selection has on the scores.
Suppose, for example, Skater A scores five points more than Skater B in the Short Program. To win the event in this example, Skater B needs to beat Skater A by more than five points in the Free Skating.
In examining the marks in the Grand Prix one finds that the point difference between two skater can vary drastically depending on which judges are randomly selected. One can have events where all ten judges score Skater B higher than Skater A, but depending on which judges are selected the margin of victory can vary from one or two point to ten or more points. Depending on the random selection of judges, Skater B may or may not be scored high enough to overcome the margin of victory in the Short Program, even though the average score from all ten judges gives Skater B a margin of victory high enough to win the event. For example, using all the judges Skater B might have a six point victory in the Free Skating which is enough to win the event, but due to random selection of marks might actually end up with only a four point margin of victory and place second.
Random selection has a frequent and pervasive negative impact on the results in CoP. It skews the results in individual event segments for a significant fraction of the places, and skews the event point totals on top of that. With random selection of marks CoP is currently a computerized roulette wheel.
One of the characteristics of CoP that has been well received is the fact it allows a come from behind victory without help. The great example of this in the Grand Prix is the ladies event at Cup of China where Elena Liashenko placed seventh in the Short Program and then won the Free Skating to win the event; however, this is only a part of the story for this event.
CoP does allow a small improvement over Total Factored Place (TFP) in combining event segments, but not nearly as much as expected, and there is no free lunch. Some skaters gain by directly summing the points, but other lose.
The following table show the results from Cup of China. In this table it is seen that under TFP Liashenko moves up from 7th to 2nd place, so using point totals only bought her one place. On the other hand, Jennifer Robinson lost out big-time, ending up three places lower using point totals compared to TFP.
Ladies FS, Cup of China |
||||
Skater | SP | FS | Combined Using Points | Combined Using TFP |
Liashenko | 7 | 1 | 1 | 2 |
Onda | 3 | 2 | 2 | 1 |
Suguri | 1 | 5 | 3 | 3 |
McDonough | 2 | 6 | 4 | 5 |
Basova | 5 | 4 | 5 | 4 |
Corwin | 6 | 7 | 6 | 7 |
Volchkova | 4 | 9 | 7 | 8 |
Fang | 8 | 8 | 8 | 9 |
Robinson | 10 | 3 | 9 | 6 |
Liu | 9 | 10 | 10 | 10 |
Gimazetdinova | 11 | 11 | 11 | 11 |
Another negative for summing points is that a skater can place second in the Short Program and first in the Free Skating and still lose. During the Grand Prix this occurred for 1/4 of the skaters who won the Free Skate and were second in the Short Program.
It is also interesting to note that Liashenko benefited not only from the use of point totals, but also from random selection of marks and harsh double penalties directed at Yoshie Onda.
For the 11 sets of marks in the Protocol, the point difference between Liashenko and Onda for the Free Skating varied from a few points in favor of Onda, to 44 points in favor of Liashenko. Liashenko lucked-out on the random selection of marks which chose the judges that gave her the largest margins of victory. If all judges are used, Liashenko's margin of victory is only a few tenths of a point.
As for Onda, in the Free Skating she attempted eight jump elements when only seven are allowed. Her last three jumps consisted of triple toe loop, triple toe loop, and double Axel. Since the second triple cannot be repeated outside a combination it did not count. But neither did the double Axel, so she got points for only six jump elements. Had she done the Axel before the second triple toe loop it would have counted, and without random selection of marks she would have won.
Consequently, Liashenko's come from behind victory resulted not only from the way event segments are combined, but also from random selection of marks and the way penalties are assessed. Note, for example, that there is a combination of judges that gives Onda victory in both the Free Skating and the overall event, had that combination been randomly selected.
Whether summing points from event segments is a good thing or not is to some extent more a philosophical question than mathematical one. Cynthia Phaneuf, who moved up from 8th place in the Short Program to finish 2nd by winning the Free Skating at Four Continents under TFP, would probably have liked summing of points in her competition, but other skaters in the Grand Prix would probably have preferred their overall results calculated using TFP.
Another basic premise of CoP is that by combining the marks from every judge for the different aspects of skating using the same weights, greater consistency is obtained, producing superior results. It is generally assumed that the large range in marks and ordinals sometimes seen in the 6.0 system is due to the judges giving different values to the various aspects of skating and combining them in different ways. A fundamental goal of CoP is to eliminate this inconsistency.
At events marked using the 6.0 system following the conclusion of the Grand Prix, there were the occasional cases where skaters received a wide range of scores and ordinals. It was often remarked how terrible this was, how this wouldn't happen under CoP, and how the sooner all competitions switched to CoP the better.
In examining the marks for the ladies Free Skating at Cup of China it was noticed, however, that there was a great deal of spread in the judges marks. Here we look at the spread of scores under CoP to determine to what extent it is an improvement over the current system.
The following table shows the scores for each judge in ladies Free Skating at Cup of China, together with the corresponding places. The highest and lowest places for each skater are highlighted in red. The skaters are listed in official order of finish.
Ladies Free Skating, Cup of China
|
|||||||||||
Skater | J1 | J2 | J3 | J4 | J5 | J6 | J7 | J8 | J9 | J10 | J11 |
Elena LIASHENKO | 96.44 2 |
90.27 3 |
101.07 1 |
104.74 1 |
95.97 1 |
108.47 1 |
109.74 1 |
115.64 1 |
95.47 5 |
92.74 2 |
93.87 1 |
Yoshie ONDA | 95.69 3 |
90.59 2 |
98.69 2 |
97.49 3 |
93.49 2 |
100.49 3 |
88.19 3 |
71.59 3 |
108.59 1 |
98.49 1 |
92.09 2 |
Jennifer ROBINSON | 78.28 9 |
97.75 1 |
85.45 6 |
81.08 8 |
85.45 6 |
92.85 5 |
79.42 8 |
78.12 2 |
97.65 4 |
91.55 |4 |
78.05 6 |
Tatiana BASOVA | 98.24 1 |
81.19 8 |
90.79 3 |
99.59 2 |
79.69 7 |
104.39 2 |
80.99 6 |
66.76 4 |
90.39 6 |
91.89 3 |
72.99 7 |
Fumie SUGURI | 84.66 5 |
88.64 4 |
85.6 5 |
91.6 4 |
89.54 3 |
87.16 8 |
80.56 7 |
61.16 8 |
99.14 2 |
89.84 5 |
82.56 3 |
Ann Patrice McDONOUGH | 82.48 7 |
83.28 6 |
83.28 8 |
88.48 5 |
88.83 5 |
90.38 7 |
82.88 5 |
62.65 6 |
97.73 3 |
80.08 7 |
82.05 4 |
Amber CORWIN | 93.78 4 |
86.28 5 |
88.78 4 |
82.18 7 |
88.85 4 |
80.68 9 |
94.35 2 |
62.26 7 |
83.13 7 |
77.88 8 |
78.98 5 |
Dan FANG | 76.07 10 |
80.37 9 |
77.97 10 |
73.07 9 |
79.67 8 |
92.77 6 |
74.17 9 |
63.77 5 |
81.07 9 |
84.77 6 |
72.27 8 |
Viktoria VOLCHKOVA | 81.19 8 |
81.29 7 |
79.89 9 |
83.99 6 |
79.29 9 |
78.89 10 |
86.89 4 |
60.39 10 |
83.09 8 |
77.79 9 |
70.09 9 |
Yan LIU | 82.76 6 |
76.46 10 |
83.86 7 |
68.99 10 |
75.16 10 |
93.86 4 |
69.26 10 |
60.59 9 |
78.86 10 |
76.46 10 |
64.66 10 |
Anastasia GIMAZETDINOVA | 64.88 11 |
62.08 11 |
63.28 11 |
60.05 11 |
55.68 11 |
72.98 11 |
65.38 11 |
47.22 11 |
67.68 11 |
65.78 11 |
49.25 11 |
This table of marks and placements is about as dreadful as one sees in competition, usually at events below the Juvenile level. If this table had been generated from marks in the 6.0 system, people would be screaming bloody murder! Liashenko has places of 1 through 5 and marks that correspond to the high 5's to mid 4's. Onda has marks equivalent to the mid 5's through the low 3's. Robinson has placements of 1 through 9 and marks equivalent to the high 4's to the mid 3's. Corwin has placements of 2 through 9 and marks equivalent to the high 4's to the low 3's. And these are just some of the examples of the inconsistency in these results.
This table also illustrates that, despite a major effort to train the judges, some judges mark systematically higher than average and other judges marks systematically lower than average. It is also found that these systematically high and low marks are not completely remove by the use of the trimmed mean. By examining the marks at the Grand Prix it is found that about half the marks from the high and low judges are not eliminated when the trimmed mean is applied on an element by element basis. Getting a large group of humans to score in a consistent way is extremely hard to accomplish (which is why the concept of ordinals was introduced in the first place), and has yet to be accomplished under CoP.
The table speaks for itself. In terms of bringing consistency to the marks and the places, CoP in its current form is a complete failure. The result for this event is not unique. There were many events during the Grand Prix that were just as bad. For CoP to work, the judges must all mark in a consistent way. At this point, clearly they do not. Whether judges can ever be trained to do this remains to be seen. Further training of the judges and further testing will be required to prove that they can -- for to quote Mark Twain, "supposing is good, but knowing is better.".
One can carry this analysis one step further and ask if among the different aspects of skating there was any better consistency. For the five classes of skill (jumps, spins, sequences, basic skating and transitions, and presentation) the number of points in each class of skill and the corresponding places for each skill was calculated. The result of this calculation is that the scores and places for the individual skills shows just as much variation as do the total points. The marks and places are all over the map. CoP is no better at determining who is the best jumper or spinner, etc. than it is at determining who is the best overall skater.
The following table shows the places based on the points from presentation alone (sum of program components 3 through 5) for the ladies Free Skating at Cup of China. The skaters are listed in the official order of finish.
Ladies Free Skating, Cup of China
|
|||||||||||
Skater | J1 | J2 | J3 | J4 | J5 | J6 | J7 | J8 | J9 | J10 | J11 |
Elena LIASHENKO | 4 | 6 | 2 | 3 | 5 | 1 | 2 | 1 | 5 | 8 | 4 |
Yoshie ONDA | 5 | 7 | 5 | 6 | 4 | 6 | 8 | 4 | 2 | 5 | 3 |
Jennifer ROBINSON | 9 | 1 | 8 | 8 | 7 | 7 | 6 | 2 | 4 | 2 | 6 |
Tatiana BASOVA | 5 | 10 | 6 | 4 | 8 | 2 | 7 | 7 | 6 | 4 | 8 |
Fumie SUGURI | 3 | 2 | 1 | 2 | 3 | 4 | 5 | 6 | 3 | 1 | 2 |
Ann Patrice McDONOUGH | 2 | 3 | 3 | 1 | 1 | 3 | 3 | 3 | 1 | 3 | 1 |
Amber CORWIN | 1 | 4 | 4 | 7 | 2 | 9 | 1 | 5 | 6 | 6 | 5 |
Dan FANG | 11 | 9 | 10 | 9 | 8 | 8 | 10 | 9 | 9 | 6 | 9 |
Viktoria VOLCHKOVA | 7 | 5 | 7 | 5 | 6 | 10 | 4 | 8 | 8 | 9 | 7 |
Yan LIU | 8 | 8 | 8 | 10 | 10 | 5 | 11 | 10 | 10 | 10 | 10 |
Anastasia GIMAZETDINOVA | 10 | 11 | 11 | 11 | 11 | 11 | 9 | 11 | 11 | 11 | 11 |
This table also speaks for itself, with just a few examples of gross inconsistency highlighted. The agreement among the judges is just about as dreadful as dreadful can be. Five ladies got at least one first place score for presentation and seven got at least one second place score!
The result for this event and this skating skill is not unique. In many events the five skating skills all show equally poor consistency. This degree of disagreement would be considered completely unacceptable in a senior level event judged under the 6.0 system; and, ironically, when calculated in terms of places, the inconsistency in the marks is significantly worse (50-200%) under CoP than what is typical found under the ordinal system. In terms of consistency of judgment, the data show that CoP does not perform nearly as well as the 6.0 system.
This raises the question, if the judges are capable of significantly better agreement under the 6.0 system, why are the same judges incapable of getting equally good consistency under CoP? Is it a fundamental defect in CoP? Is it a result of inadequate training? The only way to find out for sure is further study and testing.
This degree of inconsistency also raises the question how a meaningful and valid system of accountability can be implemented under CoP when the judges' marks show so much variation? For example, according to some criteria being considered for accountability, one might conclude that more than half the marks in the above table are anomalies!
Among the different events, consistency among the judges is worst in singles, and only slightly better in pairs. As is often the case, dance follows a different pattern, with the scoring of the judges fairly consistent, though still not as much as might be expected given the assumption underlying the construction of CoP.
To the extent it places the better performances more or less at the top and the worst performances more or less at the bottom, CoP gives the appearance of producing plausible results. However, when subjected to objective mathematical testing it is found there is little validity to the specific placements produced by the system for the vast majority of the skaters. Detailed placements under CoP lack validity for the following reasons.
Use of the program components has systematic defects, as does use of the element GoEs. The statistical believability of the results fails to meet reasonable and necessary standards. The contributions to the total scores from the five main types of skating skills is highly unbalanced. Random selection of marks frequently skews the results in significant ways, both for individual event segments and for complete events. The ability to have a come from behind victory is a double-edged sword with benefits and drawbacks. Consistency in the judges' marks is significantly worse in CoP compared to the current system of judging.
These calculation indicate that CoP has yet to achieve its desire goal of providing a rigorous, bias free, absolute standard for evaluating skating. It is clear that improvements are needed in several areas. It is also clear that the judges are not marking the grades of execution and program components with the accuracy and consistency needed. Whether these problems can be solved through revision of CoP and further training of the judges is unknown, and demands further study and testing. The statistical methods used here provide an objective method of determining whether such efforts achieve the improvement required, and which the skaters deserve.
Copyright 2004 by George S. Rossano