[IPAC-List] Interpretation of internal consistency reliabilitycoefficients

Dennis Doverspike dd1 at uakron.edu
Wed Feb 3 12:16:39 EST 2010

Abstract. You have here the classic conflict in any type of test of
multidimensional domains or a test based on a criterion oriented strategy.
However, at the same time you want to calculate a total score. Bottom line,
it is impossible to say that a .3, .6 or .9 would be preferable without
knowing your exact purpose and the likely correlations between those
domains. A .3 could be very good in some situations. Although then a critic
might argue - why calculate a total score? Why not calculate a score for
each dimension separately?

More detail

First, internal consistency reliability is based on one particular theory of
reliability (or several theories of reliability really) coming from a
particular view of tests, which sees the best tests as measuring a
unidimensional latent trait.

The general theory of situational judgment tests (see especially Mike
McDaniel's work) is not compatible with classic notions of internal
consistency. In other words, internal consistency is not an appropriate
index for situational judgment tests.

Having said that, playing devil's advocate of my own position, it could be
argued that situational judgment tests still arrive at a single total score,
which suggests that there is a single something being measured.

On your point 2, you could have high test retest reliability with low
internal consistency. It is possible, but again would depend on your measure
and your theory. Technically, under some reliability theories, test retest
is not an acceptable type of reliability at all -- since by definition there
is no random sampling of items (we can retain the fiction as long as we do
not blatantly violate it).

Bottom line then - a .3 internal consistency might be very appropriate and
in fact good for a situational judgment test, especially depending upon its
theory of construction. Remember, in developing a test based on a pure
criterion oriented approach, such as might be seen with classical BIBs or
even situational judgment tests, we would want a correlation between each
additional item of 0, since that results in the greatest Multiple R. Thus,
in that case, lower internal consistencies are better and a high internal
consistency would be bad.

So you have competing ideas here. You are calculating I would guess a total
score, but you are calculating a total score based on adding together
measures of independent constructs. So a .3 might be good, a .6 might be
good, a .9 might be good, depends on what you are trying to achieve and what
the correlation really is between those independent constructs.

One could go into a lot more detail and argument on these concepts, but that
is for a different forum.

Dennis Doverspike, Ph.D., ABPP
Professor of Psychology
Director, Center for Organizational Research
Senior Fellow of the Institute for Life-Span Development and Gerontology
Psychology Department
University of Akron
Akron, Ohio 44325-4301
330-972-8372 (Office)
330-972-5174 (Office Fax)
ddoverspike at uakron.edu

The information is intended only for the person or entity to which it is
addressed and may contain confidential, privileged and/or a work product for
the sole use of the intended recipient. No confidentiality or privilege is
waived or lost by any errant transmission. If you receive this message in
error, please destroy all copies of it and notify the sender. If the reader
of this message is not the intended recipient, you are hereby notified that
any dissemination, distribution or copying of this communication is strictly
prohibited. In the case of E-mail or electronic transmission, immediately
delete it and all copies of it from your system and notify the sender.
E-mail and fax transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed, arrive late
or incomplete, or contain viruses.

-----Original Message-----
From: ipac-list-bounces at ipacweb.org [mailto:ipac-list-bounces at ipacweb.org]
On Behalf Of Shekerjian, Rene
Sent: Wednesday, February 03, 2010 11:36 AM
To: ipac-list at ipacweb.org
Subject: [IPAC-List] Interpretation of internal consistency

I have been reading up on internal consistency reliability coefficients
(e.g., KR-20 and Cronbach's Alpha) in order to clarify my thinking about it,
but I am having trouble finding much of practical use beyond two basic

(1) .6 is tolerable, and .9 is the gold standard

(2) you can have high test-retest reliability with low internal consistency

My question is this:

Suppose you have around 20 items that make up a situational judgment
subtest. The domain is defined by job analysis and is supposed to address a
competency that entails behaviors such as interacting with customers,
solving problems, giving advice, assessing situations, and determining the
best action to take, all within a circumscribed context.

How would you interpret an internal consistency reliability coefficient of
.3 for such a test? How about .6? And what about .9?

My stab at this is that .3 suggests several unwanted possibilities: among
them are (1) the items were "good" but too difficult for the candidates and
(2) the items have flaws such as not fully defining the circumstances and/or
constraints that need to be taken into account to arrive at the "correct"

Personally, I would expect a well crafted set of such items that are given
to an appropriate candidate group to hold together and have an internal
consistency reliability coefficient around .6

As far as getting a .9 for this sort of test, I think that that would
indicate too narrow a focus for a domain that would be expected to cover a
pretty wide territory.

Because while it makes sense that people who are more able and motivated to
develop expertise in a "broad" competency will tend to be good at much of it
and those who are less able and/or motivated will tend to perform poorly in
much of it, I would expect some randomization of people's strengths and
weaknesses, which would lead some people to perform well in many areas but
still fall down in a few (but with no discernible pattern) and others to
perform poorly in many but still be strong in a few (again with no
discernible pattern).

Your thoughts on this would be much appreciated.

Thanks in advance,


René Shekerjian | Testing Services Division | NYS Department of Civil
Service | 518-473-9937

IPAC-List at ipacweb.org

More information about the IPAC-List mailing list