The IJS calculation method is based on having the skaters marked on an absolute point scale for element scores and Program Component scores. When evaluated during training, judges are expected to give marks that agree with the official panel. In ISU competitions, judges are expected to mark within a defined range (corridor) and the correctness of their judging is evaluated based on being within the corridor. Judging panels are frequently criticized when event protocols show Program Component marks spread over a large range for a given component.
IJS scores are calculated to the nearest 0.01 points. In a perfect system, the calculated scores must then be correct to better than 0.005 points for every skater so that every place is correctly determined. Thus far, it is well established that IJS does not yet meet that mathematical standard for accuracy, precision, and hence fairness. The calculation method has inherent flaws that results in rounding errors that are typically several hundredths of a point, and can potentially be as large as many tenths of a point. The spread of marks among the judges due to random differences of opinion is even larger, with a standard deviation typically ¾ of a point in any one event segment. Thus, both rounding errors and random errors are already significantly greater than 0.01 point, demonstrating the IJS calculation method is currently far from a perfect system.
In addition to round-off error, and random judges’ error, the IJS calculation method is also affected by variations in the judges’ individual marking standards, errors by the Technical Panel, and errors in the Scale of Values.
Examination of event protocols shows that judges clearly do not all mark to the same absolute standard. Each judge has their own individual marking standard that departs to some extent from the ideal standard (which itself isn’t well defined). The issue we examine here is the effect of inconsistent individual marking standards on the calculation of results as a source of error in the calculation of IJS results. The issues of errors by the Technical Panel, and error inherent in the SoV will be discussed in future papers.
Just how consistent do the judges’ individual marking standards really have to be?
We begin with marks subject only to random and then expand that to marks with both random and systematic errors. Those bored by math can jump from here to the Summary section at the end.
Let:
mij be the mark judge "i" gives skater "j"
Mj be the correct mark skater j deserves as specified by the rules, if judges had perfect knowledge of the rules, perfect observational skills, and perfect judgment; i.e. the "truth" mark.
D
ij be the difference between the mark judge i gives skater j and the truth mark for skater "j".Then:
ij = mij - MjD
or,
Dij (1)mij = Mj +
Equation (1) assumes that the judges are all marking on the same numerical scale to the same standard, and thus the only errors in the marks are differences of opinion from one judge to the next because the judges do not have perfect knowledge of the rules or skating, perfect observational skills, or perfect judgment.
In the calculation of scores, the marks for the judges are averaged for each skater j. We use "< >" brackets to represent an average over the judges for a given skater.
In averaging the marks the judges actually give, equation (1) becomes
Dj> (2)<mj> = Mj + <
That is, the average mark for skater j is the correct mark plus the average of all the errors for that skater. If the errors in the marks (
Dij) are random, the average of the errors will decrease as the number of judges increase in inverse proportion to the square root of the number of judges. As the number of judges becomes infinite the average error goes to zero and the average of the actual marks approaches the truth mark.In comparing the marks for two skaters (j and j+1) the difference in calculated score will be:
Dj> - <Dj+1>) (3)<mj> - <mj+1> = (Mj - Mj+1) + (<
Whether the two skaters finish in the correct order depends on the differential error for the two skaters, (<
Dj> - <Dj+1>).Since results are determined to the nearest 0.01 points, the judges must be adequately trained and the number of judges used so that the differential error term is less than 0.01 point. For well trained judges, reducing the differential error term requires the use of tens of thousands of judges, or greater.
Because only 5-9 marks are typically used in calculating IJS scores, and not tens of thousands, the differential error term is typically much greater than 0.01, and errors in the order of finish are common in IJS calculations. Analysis of IJS results shows that typically one-third to one-half the results in IJS competitions are not statistically significant (i.e., the statistical uncertainty in the marks is greater than the point difference between sequential places), with the random error typically about ¾ of a point in an event segment.
In addition to random errors, judges are not equally calibrated (trained), and do not all mark using the same numerical standard. For example, one judge may mark all skaters in a group systematically higher than another judge, or one judge may mark a group using a greater or lesser span of marks than another judge. These systematic differences in calibration also introduce errors into IJS results.
To examine the effect of systematic errors due to differences in marking standards we model the calibration of the judges as follows, with equation (1) becoming:
Dij (4)mij = Ai + Bi * Mj + Ci * Mj2 +
where,
is the "offset" for judge i’s personal marking scale (judge i marks all skaters in the group higher or lower than the correct absolute standard)Ai
Bi is the "gain" for judge i’s personal marking scale (judge i uses a spread of marks greater or less than the correct absolute standard)
Ci is one of several potential non-linear terms that accounts for judge i using a span of marks that is different for the lower scoring skaters compared to the higher scoring skaters. In principle, there could be terms of higher order of Mj, but these will be omitted to avoid cluttering up the discussion.
[What we are doing is taking the individual marking standards of the judges and modeling them as a polynomial expansion, and keeping only the first three terms for the purpose of this analysis.]
If a judge is correctly calibrated and marking on the correct absolute marking scale, then for that judge Ai = 0.0, Bi = 1.0, and Ci = 0.0.
If we average equation (4) over all marks for skater j, we obtain:
Dj> (5)<mj> = <A>+ <B>* Mj + <C>* Mj2 + <
where,
<A> is the average A coefficient for the panel
<B> is the average B coefficient for the panel
<C> is the average C coefficient for the panel.
Comparing the marks for two skaters (j and j+1) the difference in calculated score will be the difference between the following two equations:
Dj><mj> = <A>+ <B>* Mj + <C>* Mj2 + <
<mj+1> = <A>+ <B>* Mj+1 + <C>* Mj+12 + <
Dj+1>which gives,
Dj> - <Dj+1>) (6)<mj> - <mj+1> = <B>* (Mj - Mj+1) + <C>* (Mj2 - Mj+12)+ (<
If <B> = 1.0, and <C> = 0.0, then equation (6) reduces to equation (3).
Note that the A coefficients do not affect the point differences between any two skaters. Judges may mark on individual numerical scales with completely different offsets and the results are not affected.
Thus, while it is cosmetically appealing to try and calibrate every judge to give the same mark for a given skater, the reality is that differences in offset are meaningless. For example, if most of a panel marks a group of skaters from 3.50 through 6.75 and one judge marks the same group from 1.50 through 4.75, the fact the judge marked 2.00 points lower is in itself no measure of poor judging, and has no impact on the results.
The B and C coefficients, however, are another matter.
For the B coefficients (the linear term) a comparison of equations (3) and (6) shows that the error introduced when the judges use different gain factors (spread of marks from highest to lowest scoring) the following error is introduced in the point difference between two skaters:
LinearError = (<B> - 1) * (Mj - Mj+1) (7)
Since results are determined to the nearest 0.01 points, we can place limits on how close to 1.00 <B> must be.
For example, in an event where the typical difference between two sequential places is 5.0 points in PCS, if we require,
| LinearError | < 0.01
then we must have,
0.998 < <B> < 1.002
That means, on the average, the span of the marks used by the judges must agree to within 2/10 of one percent (2 parts in 1000). For example, if the judges use a span of 5.0 points in each Program Component for a given event, the spans used by each judge must agree on the average within 0.01 points (0.002 * 5.0). In other words, they must agree nearly perfectly.
We conclude, then, so far as training and evaluating the judges is concerned, what matters is not whether the judges marks on the same scale overall (same average score for the group), but whether the judges use the same linear span of marks from highest to lowest scoring skater.
To illustrate this, suppose on a five judge panel, four judges are perfectly calibrated and use as span of 5 points in scoring each Program Component. The fifth judge, however, uses as span of 5.50 points. This 10% greater span by one judge results in an average B coefficient of 1.02 and introduces a point error into the results that will results in place switches for a small but non-zero number of skaters.
Using the same approach for the C coefficient, the quadratic error is given by,
QuadraticError = <C>* (Mj2 - Mj+12) (8)
and for the absolute value of this error to be less than 0.01 points, for typical Senior Men competition results this requires,
| <C> | < 0.00025
This requirement says, on the average, if the individual marking standard for each judge departs even negligibly from a linear relation, an error will be introduced that can result in some place switches. Over the full range of allowed Program Component marks, for example, a C coefficient as high as 0.01 would hardly be noticeable in cursory inspection of the marks, yet it has the potential to introduce an error in the marks as large as 1.00 point in an event with a short and long program.
Since the errors introduced by individual marking standards among the judges depends on the average of the B and C coefficients, one way to reduce these errors is again to use a large number of judges to average out the different marking standards, assuming high and low individual marking standards are equally common. Unfortunately, like random errors, thousands of judges are required to reduce the average B and C coefficients to values where the point error is less than 0.01 point. On the other hand, if all judges are trained systematically high or low, then increasing the number of judges does nothing to remove this source of error.
Individual marking standards among the judges is a source of error in the calculation of IJS results. This source of error is sufficiently large (tenths of a point to a full point) to result in placement errors in competition results.
Offset errors (differences in average score for a group) do not introduce errors in placement. A judge may mark an entire group outside the corridor and introduce NO error into the final results. Evaluating the judges based on a simple corridor alone is not an insightful process.
Gain errors (differences in span of marks from highest to lowest skater) can introduce significant errors into scores and lead to errors in placement in the final results. The span of marks for each judge needs to agree to better than 0.2% for this source of error to be less than 0.01 point. In evaluating judges the span of marks the judges use is of far greater importance than the corridor in which those marks lie.
Non-linear errors (lack of linearity over the span of marks, such as different marking scales used for lower scoring skaters vs. higher scoring skaters, or drift in judgment creeping higher or lower during an event) can introduce significant errors into scores and lead to errors in placement in the final results. Even minute departures from a linear marking standard can introduce place altering errors. In evaluating judges, consistency (linearity) of marking standard from low marks through high marks is of far greater importance than the corridor in which those marks lie.
Use of the corridor to train and evaluate judges is a simple minded approach that does not serve the judging community well to produce, evaluate or maintain quality judges. A more sophisticated process needs to be put in place if the quality of judging is to be improved.
Consistency of judgment throughout an entire event (correct span of marks, linearity, and lack of marking drift) is essential to prevent placement errors due to systematic errors. To achieve this, IJS requires both absolute and relative judgment. Forbidding a direct comparison of the performances in an event is a serious flaw in the IJS judging process that makes IJS scoring less accurate and less consistent, rather than otherwise.
Training and evaluation of judges should focus less on absolute agreement of Program Component values and more on the span of marks used and consistency of judgment throughout an event. However, the goal of obtaining a pool of judges who all mark to the same standard, is likely never to be reached so long as the absolute standard for each Program Component criterion remains as vague and obscure as it currently is. So long as judges lack specific guidance for what marks should be given in specific situations, marking to a consistent absolute standard will remain an elusive goal.
Given the current limitations that prevent IJS from determining the value of a program to the nearest 0.01 point, choosing winners to that level of precision is an arbitrary and capricious process. Given current technical limits, either scores need to be rounded off to the nearest whole point, or changed must be made to IJS to more cleanly separate the scores for skaters with very similar degrees of skill.
Copyright 2009 by George S. Rossano