Actually, this is probably the correct interpretation of the correlation results for a matrix polynomial of degree 100 or less, but we tried to illustrate a (potentially) more informative interpretation of the results, where rater severity became increasingly more strongly correlated with task difficulty:
--- Note that we scaled the rater severity scores corresponding to a rater severity of 6 by dividing by the maximum severity, a scaling that is needed for the correlation results. Other choices are possible, such as for instance setting a minimum severity cutoff.
Together, the results suggest that rater severity can be reliably estimated as the degree of the rater correlations with task severity increases to some point, where it then declines, although it is important to note that such patterns depend on the nature of the data, the particular combinations of examinees, raters and tasks, and additional factors (e.g., the ratio of examinees, raters and tasks with acceptable and noisy fit in the linking set).
This figure shows the effect of data set size on rater severity estimates, after omitting examinees with missing data from the linking set. Such omissions are common in many operational data sets, where, for example, missing questions or items may be dropped from the rating set. d2c66b5586