[IPAC-List] Interpretation of internal consistency reliabilitycoefficients

For me, I think the crux is identified in what Dennis said... [and I just saw Geoff's message which to me echoes this idea]

"Having said that, playing devil's advocate of my own position, it could be argued that situational judgment tests still arrive at a single total score, which suggests that there is a single something being measured"

If we are measuring a something, for example situational judgment, is there not a "something" that some will have more of and others less of?

Training ability is one that I am familiar with. While it is clearly multidimensional, I expect that good trainers will have more "training ability" and poor and/or inexperienced trainers will have less. I wouldn't expect high correlations between training items, but I would expect there to be enough of a trend that the items would hang together enough to give me a KR-20 of .5 or .6

You could say that "training ability" is made up of many dimensions - knowledge of training methods, knowledge of learning theory, ability to plan lessons, ability to deliver training, effective use of visual aids, effective use of PowerPoint, etc.

But wouldn't a good trainer be good in many of those areas and wouldn't a poor trainer be good in very few. Depending on a person's aptitude, cognitive ability, experience, education, etc. I would expect him or her to master more of this broad domain or less of it. So while training is multidimensional, in the real world its components are "presented" together to people over and over. A person in the world of training will have numerous opportunities to acquire these "sub-components" of training. maybe not all at once, but over time. In which case these abilities should hang together "inside" the person.

Thank you all for your responses so far. This is rich material for thought.


Abstract. You have here the classic conflict in any type of test of multidimensional domains or a test based on a criterion oriented strategy. However, at the same time you want to calculate a total score. Bottom line, it is impossible to say that a .3, .6 or .9 would be preferable without knowing your exact purpose and the likely correlations between those domains. A .3 could be very good in some situations. Although then a critic might argue - why calculate a total score? Why not calculate a score for each dimension separately?

More detail

First, internal consistency reliability is based on one particular theory of reliability (or several theories of reliability really) coming from a particular view of tests, which sees the best tests as measuring a unidimensional latent trait.

The general theory of situational judgment tests (see especially Mike McDaniel's work) is not compatible with classic notions of internal consistency. In other words, internal consistency is not an appropriate index for situational judgment tests.

Having said that, playing devil's advocate of my own position, it could be argued that situational judgment tests still arrive at a single total score, which suggests that there is a single something being measured.

On your point 2, you could have high test retest reliability with low internal consistency. It is possible, but again would depend on your measure and your theory. Technically, under some reliability theories, test retest is not an acceptable type of reliability at all -- since by definition there is no random sampling of items (we can retain the fiction as long as we do not blatantly violate it).

Bottom line then - a .3 internal consistency might be very appropriate and in fact good for a situational judgment test, especially depending upon its theory of construction. Remember, in developing a test based on a pure criterion oriented approach, such as might be seen with classical BIBs or even situational judgment tests, we would want a correlation between each additional item of 0, since that results in the greatest Multiple R. Thus, in that case, lower internal consistencies are better and a high internal consistency would be bad.

So you have competing ideas here. You are calculating I would guess a total score, but you are calculating a total score based on adding together measures of independent constructs. So a .3 might be good, a .6 might be good, a .9 might be good, depends on what you are trying to achieve and what the correlation really is between those independent constructs.

One could go into a lot more detail and argument on these concepts, but that is for a different forum.

I have been reading up on internal consistency reliability coefficients (e.g., KR-20 and Cronbach's Alpha) in order to clarify my thinking about it, but I am having trouble finding much of practical use beyond two basic

(1) .6 is tolerable, and .9 is the gold standard

(2) you can have high test-retest reliability with low internal consistency

My question is this:

Suppose you have around 20 items that make up a situational judgment subtest. The domain is defined by job analysis and is supposed to address a competency that entails behaviors such as interacting with customers, solving problems, giving advice, assessing situations, and determining the best action to take, all within a circumscribed context.

How would you interpret an internal consistency reliability coefficient of .3 for such a test? How about .6? And what about .9?

My stab at this is that .3 suggests several unwanted possibilities: among them are (1) the items were "good" but too difficult for the candidates and
(2) the items have flaws such as not fully defining the circumstances and/or constraints that need to be taken into account to arrive at the "correct" answer.

Personally, I would expect a well crafted set of such items that are given to an appropriate candidate group to hold together and have an internal consistency reliability coefficient around .6

As far as getting a .9 for this sort of test, I think that that would indicate too narrow a focus for a domain that would be expected to cover a pretty wide territory.

Because while it makes sense that people who are more able and motivated to develop expertise in a "broad" competency will tend to be good at much of it and those who are less able and/or motivated will tend to perform poorly in much of it, I would expect some randomization of people's strengths and weaknesses, which would lead some people to perform well in many areas but still fall down in a few (but with no discernible pattern) and others to perform poorly in many but still be strong in a few (again with no discernible pattern).

Your thoughts on this would be much appreciated.

Thanks in advance,


