Fewer Is Not Better

(Judges that is)

In October the ISU Council approved a reduction in the number of judges that are used to determine the results in ISU Championships.  This action was taken to reduce the cost of officials at championships.  In March the Council extended this policy to the number of judges to be used at the Olympics in Vancouver.  The justification for the most recent decision was to establish a uniform panel size at competitions.  It was not to save money since the cost of officials at the Olympics is not carried by the ISU.

In a previous article we took the position that reducing the number of officials was penny wise and pound foolish in that it saved the ISU a small amount of money at the cost of reducing the mathematical reliability and integrity of the results, and increased the impact misconduct would have on the results of competitions.  In effect it devalued the ISU's most important "product" -- the reliable and integrity of its competitions.

Reduced panel sizes were used in the ISU, Four Continents, European and World Championships this year.  In this article we compare results from these championships with those from Championships held in the 2004/05 season.

When the adoption of IJS was first proposed in 2003 I was told by several ISU officials that it would take up to five years before judges were fully comfortable and adept in the scoring system, and that judging the quality of the method should be reserved until the system had been used for several.

It is now nearly six years since the first test events were held in 2003.  How does the quality of the judging today compare to past seasons?  Inquiring minds would like to know.

But first let's review the basic foundation of IJS.

Under IJS, the basic premise is that the value of a performance will be determined on an absolute scale and the performance with the highest value wins, the next highest value is second, etc.  ISU rules and communications define what aspects of a skating performance will receive points, what aspects will reduce the value of a performance and what aspects will increase the value of a performance.  A panel of officials is used to determine whether the skaters meet the requirements for gaining or losing points in accordance with the rules and communications.

If the rules were complete and unambiguous, a single omniscient person could score a competition with 100% accuracy.  All we would have to do is hire that person to judge all competitions and life would be simple.  Unfortunately GOD is unavailable to take on that job, and we are forced to rely on a collection of mortals to determine the results.

Because even well trained individuals come to different conclusions for points the skaters should earn or lose, we use more that one official to get the job done.  The individuals serve on Technical Panels and some number of judges mark the elements and components.  The wider the spread of opinions among the pool of judges, the more judges one needs on a panel to determine the value of performances and hence the order of finish for a competition.  If judges were in fairly narrow agreement in their evaluations we could get by with just a few.  If they have widely divergent conclusions we need many.

For reasons we have discussed elsewhere, and will not repeat here, the use of nine scoring judges in the past was woefully inadequate given the wide range of conclusions judges came to in evaluating performances.  The only valid technical argument for reducing the number of scoring judges to seven would be that the diversity of opinion among the judges is now better than it was in the past (after 6 years of experience) and thus seven is now enough.

The question for this article then becomes: "Is the spread of opinion among the judges better than it was when IJS was first introduced, to an extent that justifies a reduction in the size of panels?"

The answer is no.

Instead, the spread of opinion is nearly the same as it was at the introduction of IJS, and the reliability of the results is now worse than it was because of the reduced panel size.

In the following table we compare the average standard deviation of the GoEs and Program Components for several event segments from the 2004/05 season and the 2008/09 season.  In the right hand column we compare the percentage of places that are not statistically significant for these events (for the skaters that completed the event).  These events are typical of all championship events for these two season.

Statistical Significance of ISU Championship Results

Event Spread of GoE Values Spread of PC Values Percent Places Not Significant
2005 4C Ladies SP 0.43 0.45 46
2005 4C Ladies FS 0.40 0.34
2005 Euros Ladies SP 0.45 0.41 33
2005 Euros Ladies FS 0.41 0.34
2005 Worlds Qual A FS 0.41 0.49 54
2005 Worlds Qual B FS 0.35 0.41
2005 Worlds Ladies SP 0.43 0.31
2005 Worlds Ladies FS 0.42 0.41
2005 Worlds Qual A FS 0.44 0.51 42
2005 Worlds Qual B FS 0.34 0.36
2005 Worlds Men's SP 0.48 0.41
2005 Worlds Men's FS 0.41 0.39
Average 0.41 0.40 44
2009 4C Ladies SP 0.51 0.51 54
2009 4C Ladies FS 0.44 0.36
2009 Euros Ladies SP 0.41 0.39 63
2009 Euros Ladies FS 0.39 0.41
2009 Worlds Ladies SP 0.44 0.43 88
2009 Worlds Ladies FS 0.45 0.48
2009 Worlds Men's SP 0.46 0.42 46
2009 Worlds Men's FS 0.39 0.40
Average 0.44 0.43 63

Reducing panel size in championship events has clearly had an adverse effect on the reliability and integrity of results this season.  To a cynical person, the only plausible reason to have extended the reduction in panel size to the Olympic Games is that to have not done so would give the appearance that the original decision in October was an error -- which indeed it was.


Mathematical Notes:

One commonly used measure of the spread of values in a data set is called the variance.  The variance is the average of the square of the difference between each data value and the mean data value  The positive square root of the variance is called the standard deviation.

The variance and standard deviation are measures of the extent to which the data values are spread out around the average, some values being greater than the average and some less.  The standard deviation is usually used instead of the variance since it has the same units as the original variable and can be easier to interpret for this reason.  That is, for skating scores, the standard deviation of the points involved is also some number of points (while the variance is some value of points squared) that measures the spread of points about the average points.

From the standard deviations for each element and component the uncertainty in the value of the program can be calculated -- measured as the standard deviation of the mean.  If the difference in point value for two sequential is less than the uncertainty in the two program values the results for those two places is considered to be statistically NOT significant.

The goal for any scoring system should be that all places in an event are statistically significant.  The more marks included in the calculation the smaller is the value of the standard deviation of the mean and the greater is the confidence in the results.  The standard deviation of the mean is reduced and the confidence in the results are increased by increasing the number of samples (judges' marks) in the calculation.

Return to title page

Copyright 2009 by George S. Rossano