Based on work from the G-theory literature, my colleagues and I derived an
interrater reliability estimator for dealing with situations similar to the
type you describe below (i.e., situations in which raters are neither fully
crossed, nor nested within ratees).

Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured
measurement designs in organizational research: Implications for estimating
interrater reliability. Journal of Applied Psychology, 93, 959-981.

I can email you the SAS macro for estimating the coefficient described in
the article if you need it.

One thing you will need in order to estimate this coefficient is unique IDs
for each individual rater (i.e., not simply IDs that indicate which
interviewers was 1, 2, and 3 within each panel). Having unique IDs for each
rater will allow you to properly estimate rater main effect variance (e.g.,
leniency/severity effects) and appropriately scale their contribution to
error variance. In the case of your design, rater main effect variance will
contribute to error since each interviewee was not rated by the same

Another option you might consider, if you were truly interested in
"agreement" among interviewers at the level of the individual interviewee
is rwg, or similar statistics (e.g., rAD). While such indices would
provide you with an indication of the amount of disagreement across
interviewers for any given interviewee, they will be less informative when
it comes to informing consistency in the rank ordering of interviewees
across interviewers - which is primarily the focus of estimates of
interrater reliability.

The choice between using an index of interrater agreement as opposed to an
index of interrater reliability will depend on how you plan to use your
resulting interview scores.

Suppose you have interview score data from combinations of three-rater
panels that frequently changed members. The result was that virtually
all of the interview panels contained unique groups of three raters.

We originally wanted to do a generalizability study (G and D). However,
typical methods of estimating variance components (e.g., REML, urGENOVA)
failed to converge because virtually all of the panels are unique.

Is there way to calculate a level of agreement of raters with this type
of data or perhaps some way to better understand if raters were
exhibiting an acceptable level of agreement? (Note: We also tried to
correlate matched-pairs of raters, but there were also a very limited
number of matched pairs of raters that we could use).

