[IPAC-List] Interpretation of internal consistency reliabilitycoefficients

Mark Hammer Mark.Hammer at psc-cfp.gc.ca
Wed Feb 3 12:05:53 EST 2010

The heavy lifting needs to be shared by the reliability tests AND the
test-developer. The test merely confirms (or rather provides supporting
evidence) that the test content you believe to be revolving around one
broad construct is indeed just that. Shoving content that addresses
several different constructs into what gets *called* a single test can
easily generate unacceptable alphas simply because it's not ONE test but
rather several under one banner. In other words, the alphas are low for
all the right reasons. In this case, calling it a single competency
does not necessarily make it so. there may well be 2 or 3 underneath.

The other thing I will bring up is a point that has stuck with me from
a talk I attended on personality test development in the mid-1980's.
Item-to-total correlations tend to improve across personality tests.
The reason that was suggested in the talk I attended was that
test-takers tend to approach such tests like a concept-formation task.
It takes a number of exemplars for them to "get" what the focus of the
test is. Once they "get it", their responding tends to become more
cohesive and consistent, resulting in higher item-to-total

The speaker suggested a number of remedies/responses. One was to
introduce warm-up items at the start of the test such that they "get it"
before anything counts. In this case, the brevity of the test suggests
that is maybe not such a good idea. The other strategy suggested was
that during test development, one work out a sort of "Latin square"
arrangement of the items, with the same items appearing in either Block
A, B, or C (start, middle, end). That way, the item-to-total
correlation does not depend, or be spuriously influenced by, its serial
position, and you don't end up throwing out items that were actually
better than you thought, or keeping ones that weren't necessarily as
good as you thought.

Is an SJT enough like a personality test to warrant taking such factors
into consideration? I guess opinions differ, but to the extent that
both types of tests permit faking good, and both types of tests depict
sanctioned and nonsanctioned behaviour, one may suggest that there is a
common element of figuring out "who they want me to be" in both.

Mark Hammer

>>> "Shekerjian, Rene" <Rene.Shekerjian at cs.state.ny.us> 2010/02/03

11:36 am >>>
I have been reading up on internal consistency reliability coefficients
(e.g., KR-20 and Cronbach's Alpha) in order to clarify my thinking about
it, but I am having trouble finding much of practical use beyond two
basic points:

(1) .6 is tolerable, and .9 is the gold standard

(2) you can have high test-retest reliability with low internal

My question is this:

Suppose you have around 20 items that make up a situational judgment
subtest. The domain is defined by job analysis and is supposed to
address a competency that entails behaviors such as interacting with
customers, solving problems, giving advice, assessing situations, and
determining the best action to take, all within a circumscribed

How would you interpret an internal consistency reliability coefficient
of .3 for such a test? How about .6? And what about .9?

My stab at this is that .3 suggests several unwanted possibilities:
among them are (1) the items were "good" but too difficult for the
candidates and (2) the items have flaws such as not fully defining the
circumstances and/or constraints that need to be taken into account to
arrive at the "correct" answer.

Personally, I would expect a well crafted set of such items that are
given to an appropriate candidate group to hold together and have an
internal consistency reliability coefficient around .6

As far as getting a .9 for this sort of test, I think that that would
indicate too narrow a focus for a domain that would be expected to cover
a pretty wide territory.

Because while it makes sense that people who are more able and
motivated to develop expertise in a "broad" competency will tend to be
good at much of it and those who are less able and/or motivated will
tend to perform poorly in much of it, I would expect some randomization
of people's strengths and weaknesses, which would lead some people to
perform well in many areas but still fall down in a few (but with no
discernible pattern) and others to perform poorly in many but still be
strong in a few (again with no discernible pattern).

Your thoughts on this would be much appreciated.

Thanks in advance,


René Shekerjian | Testing Services Division | NYS Department of
Civil Service | 518-473-9937

IPAC-List at ipacweb.org

More information about the IPAC-List mailing list