An Analysis of ISU Judging Proposals

by Dr. George S. Rossano

 INTRODUCTION

In a sport such as skating where results are determined by combining the numerical assessments of a panel of judges, the scoring system (or more correctly the accounting system) must satisfy four basic requirements. The combined result must accurately reflect the consensus of the judges. It must take into account small honest differences of opinion among the judges, and must filter out systematic bias to the greatest extent practical. The workings of the method must be transparent and understandable so skaters and the public trust the results and understand who wins and why. Because the scoring system must meet several equally important needs, care must be taken not to concentrate on one objective to the extent that other essential objectives are compromised.

In the wake of the figure skating judging scandal at the 2002 Olympic Games, the International Skating Union in February announced a series of proposals to overhaul the current judging system. The ISU proposal is to have 14 judges assess the skaters’ performances and then randomly select seven of the assessments to determine the skaters’ placements. A computer program will use the judges’ assessments to assign a numerical value to the performance. The actual assessments made by the judges will be kept secret and the number of judges on a panel from any one regional block of judges is to be limited to four. In comparison, in the current ISU scoring system, if a majority of judges prefer one skater over another, the skater with the majority of judges in their favor beats the other. A third approach is to simply base the result on the actual marks. These different approaches do not always give the same answer.

To understand why the different results occur, think of a baseball game where the Dodgers outscore the Giants in each of five innings while the Giants outscore the Dodgers in each of four innings and also score more runs overall. Under the current ISU system the Dodgers win the game because they won five of the nine innings, regardless of the overall score. Under a system using the actual marks the Giants win because, for the nine innings as a whole, they scored more runs overall. In the new proposed ISU system the game would be 14 innings long and the score from 7 randomly selected innings would determine the winner. Depending on the score in each inning and which innings are selected sometimes the Dodgers win and sometimes the Giants. Of these three methods, most people would agree counting the runs for the whole game is the best.

 RANDOM-7

Random Selection of Marks

One characteristic of Random-7 (my name for the proposed system) is that the random selection of seven of the fourteen assessments will all too frequently defeat the consensus of the panel as a whole. In a tie, for example, where the 14 judges are evenly divided between two skaters, a random selection of seven judges will usually force a winner and that winner will be selected by the computer’s flip of the coin. On a six-eight split of opinion there is a 30% chance that four or more of the judges in the minority will be selected and make up a majority of the seven assessments used. On a five-nine split there is a 13% chance the minority will control the combined assessment and on a four-ten split a 4% chance. Overall, about one-quarter of the time the minority of the 14 judges will control the combined assessment and produce a result contrary to the consensus of the entire panel.

Suppose, for example, you have black socks and blue socks in a drawer and you want to know which you have more of. The best choice is to count all 14 socks in the drawer. The worst choice is to randomly pick out just one. The ISU proposal would have you randomly pick out seven of the fourteen.

If you have seven black socks and seven blue socks in the drawer and randomly pick out seven you will always pick out more of one color than the other. You always get the wrong answer. Now try it with six black socks and eight blue ones. About one-third the time your random choice of seven will give a majority of black socks. Again the wrong answer – there are more blue socks in the drawer not black. Go to the extreme and put four black socks in the drawer and ten blue ones. Pick seven again several times. Sometimes you will still end up with a majority of black socks. Still the wrong answer.

The lesson is simple. If you want to know the count of socks in a drawer nothing beats actually counting the socks, and if you want to know who should win a skating competition nothing beats counting all the judges, preferably a large number of judges.

Panel Size

Why a large number of judges?

If you desperately needed to know what the people in your town feel about a controversial subject – your life depends upon it - you wouldn’t ask just your neighbor; you would ask as many people as possible. The more opinions you got the better you would know the answer. When it comes to mathematically combining the numerical assessments of judges the same is true. You can’t have too many judges – at least mathematically. The current ISU system uses nine judges. The proposed system will use seven for determining results.

In terms of confidence in the result, it is always better to have more judges rather than fewer, and the effect of panel size on the confidence level can be quantified easily. By going to a seven-judge result from the current nine-judge panel the confidence in the result decreases by 14% and the impact of one biased mark on the result increases by 28%.

Secrecy and Filtering of Biased Marks

The primary benefit of ISU proposals is that deal making is discouraged because conspirators will never know if they followed through on the plan or if their marks were actually used. But deal making is only one form of bias the scoring system must protect against. Bias may be in the form of honest human error, personal bias in judgement, national bias, block judging or deal making. Most of these are not eliminated or discouraged by secrecy, and secrecy may instead allow many acts of bias to go undetected. National bias, for example, is a solitary act. There is no penalty for trying, and a 50% chance of affecting the result are pretty good odds. Further, chances will be even better. If random selection is done for every skater, as is proposed, a judge has a 75% chance of skewing the result for two skaters by marking one skater up and the other down since there is only one chance in four both marks will be eliminated. In addition, the proposal to combine the seven assessments by averaging them eliminates all mathematical protection against bias. Detailed statistical analysis shows that when one to four of the seven judges indulge in some form of bias, Random-7 will fail to reject bias 74 to 96% of the time and deliver a skewed answer.

The purpose of the random choice of judges and the secrecy is to discourage back-room deal. The secrecy element of Random-7 offers some discouragement to deal making, but it also has its drawbacks, not the least of which is that secrecy and confidence in the process are mutually exclusive. Marks, even those generated by a computer, ultimately are determined by the judges’ individual assessments which will now be hidden from view. Secrecy also makes it impossible to understand the full story of why skaters were placed where they were. The public and media are unlikely to embrace a system that is cloaked in secrecy.

From the public’s point of view, it would be as if the officials in a basketball game were suspected of making poor calls and being on the take from team owners. The league would then have the officials randomly work only half of each game and have them make all their decisions in secrecy. The public would not trust such system and the average spectator would find the competition confusing. The public would probably prefer the league improve the officials and watch their work a little more closely – and so would the players, no doubt.

In effect, the ISU approach to stop cheating is to turn the crooks away half the time and allow the crooks to do whatever they want for the remainder. Few people would buy a security system that works only half the time and few crooks would be discouraged by such a system. Thus, randomness will do little to discourage cheating and will create a host of additional problems instead. Secrecy, however, has some small benefit for discouraging deal making. For example, most people would not hire a crook to commit a crime if they could never find out if the crime was committed. This benefit of secrecy, however, can be obtained in other ways.

Block Judging

Limiting the number of judges from any one block could be a valuable tool to help eliminate block judging if done properly. This innovation, however, is compromised in Random-7. Assuming the judging pool consists of two blocks with the remaining judges neutral, the blocks will be unequally represented among the seven selected assessments 75% of the time, and a block of four judges will be a majority or near-majority (three of seven) 20% of the time. Mathematically, it is more effective to have blocks equally represented and to dilute the blocks in a larger panel. A block of four can never be a majority on a nine-judge panel and is only 27% of a 15-judge panel.

Further, there isn’t just one block to keep track of. We in North America are quick to point our fingers at the former Soviet block, but the Eastern Europeans and former Soviet block point right back at us. With Random-7, three-quarters of the time two blocks will be unequally represented among the seven judges selected. Since nobody trusts anybody else, the only fair thing to do is make sure each block is limited in size and equally represented, and to chose the panel size so that no one block can ever be close to a majority. If the ISU is willing to use 14 judges, they might as well go to fifteen, and use them all. In that case, a block of four is only about one-quarter of the panel, and relatively harmless. In addition, increasing the panel size increases the statistical confidence in the result by over one-quarter.

Blocks must also be chosen with care. Geographic boundaries alone will not work well. For example, The former-Soviet republics lie in both Europe and Asia. If there were an Eastern European block and an Asian block and four judges were chosen from each, it is highly likely panels would end up with more than four judges from former-Soviet republics. The North American and Western European publics would not view this as an improvement.

Computer Generated Marks

A key element of the ISU proposals is the use of computer generated marks. This offers intriguing possibilities, but the ISU’s estimate that this can be implemented on only two years is na´ve.

The main value of computer generated marks – in which the judges enter assessments and the computer generates marks – is that differences in the criteria judges use to combine all the aspects judged in a skating competition will be greatly reduced. Producing such a program is an extremely complex task, however, which I can attest to from personal experience.

Following the 1999 World Championships I began looking at methods of computerizing the judging process. This began with the development of a graphical interface that allowed the complete annotation of a program, including all content executed and the assessment of the quality of the elements and program. The goal for the interface was complete flexibility to annotate all elements and movements that appear in skating programs in unlimited combination, flexibility to anticipate new movement and elements, ability to completely assess all aspects of each element and movement as performed, and ease of use so that entry of information does not distract the judges from observing the program. Ultimately these goals were achieved, taking about two years to complete.

Beginning with the 2001/02 season I began looking at how the assessments in the computer could be combined to together to generate the marks. I refer to this as the "point model." The correct point model, is the most difficult part of this approach. The point model must take into account all elements and movements combined in unlimited ways. It must anticipate expected development in the sport for the foreseeable future. In accordance with the well balanced program requirements it must produce "reasonable results" for all combinations of strengths and weaknesses among all the aspects of skating judged. It must be "well calibrated" so it produces results consistent with the way judges currently assess performances for all levels of skill.

The goals of computer generated marks are probably achievable, but will require considerable effort and testing. There are several ways to implement the point model and to combine the judges’ assessments. The details for these thus far provided by the ISU suggest a simplistic system that lacks the flexibility and sophistication to meet required performance goals. In addition, the ISU’s currently described method of combining assessments is extremely weak in terms of filtering bias from the results, and the method should not be used on that count alone.

The ISU has grossly underestimated the difficulty and time needed to implement computer generated marks. Based on my own efforts in this area and 20 years experience writing a similar class of software for use in scientific research, and in evaluating systems developed by others and the processes used in their development, it appears that this effort will require 3-4 years to produce a system with the necessary characteristics if supported at a reasonable level. In addition, the secretive, closed manner in which the ISU is currently develop this system does not lend itself to the successful development of a system with the performance potential obtainable and required.

Judges Accountability

In the ISU proposals there is a group of 16 "super-judges" for each event. Two of the judges are selected to be Referee and Assistant Referee. The remaining 14 judge the event. The judge’s assessments are kept secret and will be sent to ISU headquarters for review. The decisions of the judges will not be reviewed in a post-event debriefing.

The idea of a system of accountability in which the judges’ performance is reviewed is excellent, but the method proposed is not well thought out, lacking standards to be met, criteria to evaluate performance, and actions to be taken when remedial action is required. The focus seems to be simply on requiring a high level of agreement (yet to be specified) among the judges. This is an extremely dangerous idea. Past experience from ISU and USFSA competitions has shown that when agreement with the panel forms the emphasis for evaluating the judges, the judges tend to second-guess the panel and do not always vote their true opinions. This effect has influenced medal results in both organizations in the past.

The USFSA moved away from this approach beginning in 1990 and the effects have only been for the better. Statistical analysis of the marks from competitions made after this change, as compared to before, are in much better agreement with well-known limits on the human ability to numerically quantifying complex assessments of events. Over the past decade, movement in results has increased, and reputation-judging is now virtually non-existent within the USFSA.

Selecting the Referee and Assistant Referee from the pool of judges and eliminating the judges’ review meeting are also dangerous ideas. For all the noise made by the media at the 2002 Olympics, the situation in the pairs event was unearthed by a courageous and ethical referee, Ron Pfenning, carrying out his responsibilities in a professional way for the better of the sport. Discovery of the misconduct that led to the suspensions of two officials occurred in the post-event judges’ review.

Rather than making the Referees and Assistant Referees two randomly selected officials with reduced power, the ISU should focus on enhancing the ability of Referees to act as the guardians of judging ethics.

In addition to helping the ISU monitor judges’ performance, the post-event review is also important for insuring that judges are aware of the criteria they are using to assess programs. It is a key element for insuring that judges are assessing performances using the same criteria and the same standards. This is particulary important when the judges are expected to mark on an absolute scale, as will be the cafe for whatever judging system is adopted at the ISU Congress in June.  The ISU does not have a substitute plan in place for this important function if the review meeting is eliminated.

If Random-7 does not have viable mathematical properties, what is the answer? Something must be done. The public and skaters demand it, and the skating world, to its credit, is now of a mind to do something.

MEDIAN MARK WITH TIEBREAKERS

If the judges’ scores are viewed as the numerical expression of their independent opinions, then some numerical combination of those scores should be the "right" answer. The skater with the best scores (the most runs in our baseball game) should win. If there were never any biases in the judges’ marks, and all you had were small honest differences of opinion, then the best answer would simply be the average of the judges’ marks.

When bias may be present, the best way to combine the scores is something called the median mark. The median mark is highly immune to contamination even for extreme manipulation of several of the higher and lower marks. To find the median mark for a group of marks, list them in ascending order and take the middle mark in the list; for example, it is the fifth of the nine marks on a nine-judge panel. The median value has been used in science and engineering for centuries to filter out the effects of bias in measurements. Its statistical properties are well known and understood.

Use of the median total mark alone to determine skating results, however, produces a large number of ties because skaters are marked to only the nearest one-tenth of a point and the difference between successive places is frequently about one-tenth of a point. Thus, if two skaters have the same median total mark, some form of tiebreaker is needed. (To reduce the frequency with which tiebreaking is invoked, the judges could mark to the nearest 0.05 of a point vs. the current 0.1 – but some form of tie breaking will nonetheless still be needed.)

In a method called Median Mark with Tiebreakers (MMTB) that I devised in 1997, ties are broken using other statistics of the marks. This method was developed while studying the mathematical properties of the current ISU accounting system and was passed along to a few ISU officials at the time. The detailed description of MMTB was published in 1997 and can be found at

http://www.iceskatingintnl.com/archive/rules/scoring2.htm.

MMTB offers extremely strong protection against bias. It provides the most reliable consensus of a panel of judges when bias may be present, and it mathematically outperforms the current system and Random-7 in every respect. For a one or two biased mark, MMTB is nearly bullet proof. For up to three biased marks on a nine-judge panel or four biased marks on a 15-judge panel analysis shows it rejects bias from the results 74 to 96% of the time. With an enlarged panel of 15 it also increases statistical confidence by 29% compared to the current nine-judge panel. MMTB also has the characteristic, important to the ISU, that it does not allow place switching for intermediate results.

To discouraging deal making, the marks in MMTB would be displayed in ascending numerical order and the association of each mark with a given judge kept secret. The public would see the range of marks to help them understand the action, but it would be impossible for deal-makers to know if their conspirator on the panel gave the high mark, the low mark, or one of the marks in between.

MMTB usually gives the same result as the current ISU system, but for close decisions and in the presence of bias it is superior. For example, for the pairs free skate at the 2002 Olympic Games MMTB places the pairs co-gold medallists Jamie Sale & David Pelletier of Canada first. For the other extremely controversial decision this season, it places Lithuanians dancers Margarita Drobiazko & Povilas Vanagas third in the free dance at the 2002 World Figure Skating Championships instead of fourth.

At the Olympics, five of the nine judges placed Russians Elena Berezhnaya & Anton Sikharulidze first in the pairs free skate with the Russian team receiving total marks of 11.5 through 11.7 and a median mark of 11.6. Sale & Pelletier, on the other hand, received total marks of 11.6 through 11.8 and a median mark of 11.7. Berezhnaya & Sikharulidze placed first under the current ISU system because five of the nine judges placed them above the Canadians, but MMTB places them second because their median mark was less than Sale & Pelletier’s. In effect, the Russians won more innings, but the Canadians scored more runs. (Two of the judges in the pairs event tied the Canadians and Russians in total marks, but even if those two judges had marked the Russians higher than the Canadians, MMTB still picks the Canadian team as the winners).

A similar situation occurred for the Israeli and Lithuanian ice dance couples at the 2002 World Championships. Under OBO, the Lithuanians placed fourth in the free dance behind the Israelis on a five-four split of the panel, but the Lithuanians outscored the Israelis on median mark, and also total marks. When the marks for the two teams are listed side by side in numerical order, it is clear from even casual observations that the marks for the Lithuanians, as a group, are superior to the Israelis. Again, the Israelis won more innings but the Lithuanians scored more runs.

 SUMMARY

ISU Proposal: Increase the panel size to 14. Select one half the judges randomly. Maintain secrecy in the assessments.

Conclusion: Increase the panel size to 11 to 15. Use all the judges’ marks. Maintain secrecy in identifying which mark was given by each judge but publish all the marks. Combine the judges marks using MMTB.

ISU Proposal: Divide the member nations into blocks. Limit block representation on each panel to no more than four.

Conclusion: Choose block membership with great care. Insure, as much as possible, that blocks are equally represented on panels.  For blocks of four, a panel size of 11 or more is required so that no one block can be a majority of the panel.

ISU Proposal: Combine the judges’ assessments using a computer program.

Conclusion: Combine the judges’ marks using MMTB. Expand the scope of the current investigation to study the feasibility of computer aids to judging. Include additional outside experts and subject the study to regular peer review. Do not rush this approach into use prematurely.

ISU Proposal: Select the Referee and Assistant Referee from the pool of judges and eliminate the judges’ review meeting.

Conclusion: Maintain the current process.

 

Comparison of Judging Methods

Criteria

Random Panel of Seven

MMTB (with 15 judges)

One biased mark Constitutes 15% of panel. Constitutes 7% of panel.
One block of four judges Is a majority or near-majority of the panel 20% of the time. Can never be more than 26% of the panel, and thus can never be a majority or near-majority of the panel.
Two blocks of four judges Are unequally represented 76% of the time. Are always equally represented.
Statistical accuracy Is reduced 14% compared to current nine-judge panels.

Is inadequate for judges to mark on an absolute point scale.

Is improved 33% compared to current nine-judge panel.

Is adequate for judges to mark on an absolute point scale.

Effectiveness in filtering out biased marks Effective only 4 to 25 percent of the time using the proposed ISU computerized point system. Effective 74 to 96 percent of the time using the current point system and median filtering.
In close decisions Result is determined by a flip of the coin. Result is always determined by a statistically meaningful combination of those marks most likely to be uncontaminated by bias.
In split decisions Result is controled by the minority of the panel one-quarter of the time. Result is always determined by a statistically meaningful combination of those marks most likely to be uncontaminated by bias.

 

The author is an astronomer by profession, and a free lance photojournalist who has covered international competitive skating for 15 years. For over ten years he has studied the mathematical properties of the judging process and the effects of bias on it. During this time he has studied the properties of more than a dozen different scoring systems.

Copyright 2002 by George S. Rossano

Return to title page