[IPAC-List] Interpretation of internal consistencyreliabilitycoefficients
Gschindler at comp.state.md.us
Thu Feb 4 08:46:42 EST 2010
I have struggled with the interpretation of internal consistency
coefficients, in particular KR-20, for several years now. In my
experience, the reliability index is influenced largely by the number of
items in the subtest. So when I run reliability coefficients on my
separate subtests, which are theoretically more homogenous than my whole
test, my KR-20s are always much lower than the whole test KR-20 and
often lower than the recommended indices. This has propelled some
people I have worked with to use the whole test index as the measure of
reliability, which gets me reeling because I believe it's an incorrect
use of the statistic. However, when one is thinking about what looks
good on paper and about defending one's test to a judge and jury of lay
persons, part of me understands the argument. It's exasperating for one
who is really trying to be accurate and stay true to good test
Gwen Schindler, Personnel Analyst
Comptroller of Maryland
Office of Personnel Services
Louis L Goldstein Treasury Building
80 Calvert St., Room 209
Annapolis, MD 21404-0466
>>> "Patrick McCoy" <Patrick.McCoy at psc-cfp.gc.ca> 2/3/2010 1:50 PM >>>
If I am not mistaken, a test can have high internal consistency, under
some circumstances, even if it is tapping more than one construct.
My guess is that for this to take place, there needs to be enough
related to each construct and the answers need to be sound.
>>> "Dennis Doverspike " <dd1 at uakron.edu> 2010/02/03 12:16 pm >>>
Abstract. You have here the classic conflict in any type of test of
multidimensional domains or a test based on a criterion oriented
However, at the same time you want to calculate a total score. Bottom
it is impossible to say that a .3, .6 or .9 would be preferable
knowing your exact purpose and the likely correlations between those
domains. A .3 could be very good in some situations. Although then a
might argue - why calculate a total score? Why not calculate a score
each dimension separately?
First, internal consistency reliability is based on one particular
reliability (or several theories of reliability really) coming from a
particular view of tests, which sees the best tests as measuring a
unidimensional latent trait.
The general theory of situational judgment tests (see especially Mike
McDaniel's work) is not compatible with classic notions of internal
consistency. In other words, internal consistency is not an
index for situational judgment tests.
Having said that, playing devil's advocate of my own position, it
argued that situational judgment tests still arrive at a single total
which suggests that there is a single something being measured.
On your point 2, you could have high test retest reliability with low
internal consistency. It is possible, but again would depend on your
and your theory. Technically, under some reliability theories, test
is not an acceptable type of reliability at all -- since by definition
is no random sampling of items (we can retain the fiction as long as
not blatantly violate it).
Bottom line then - a .3 internal consistency might be very appropriate
in fact good for a situational judgment test, especially depending
theory of construction. Remember, in developing a test based on a pure
criterion oriented approach, such as might be seen with classical BIBs
even situational judgment tests, we would want a correlation between
additional item of 0, since that results in the greatest Multiple R.
in that case, lower internal consistencies are better and a high
consistency would be bad.
So you have competing ideas here. You are calculating I would guess a
score, but you are calculating a total score based on adding togethe
measures of independent constructs. So a .3 might be good, a .6 might
good, a .9 might be good, depends on what you are trying to achieve
the correlation really is between those independent constructs.
One could go into a lot more detail and argument on these concepts,
is for a different forum.
Dennis Doverspike, Ph.D., ABPP
Professor of Psychology
Director, Center for Organizational Research
Senior Fellow of the Institute for Life-Span Development and
University of Akron
Akron, Ohio 44325-4301
330-972-5174 (Office Fax)
ddoverspike at uakron.edu
The information is intended only for the person or entity to which it
addressed and may contain confidential, privileged and/or a work
the sole use of the intended recipient. No confidentiality or
waived or lost by any errant transmission. If you receive this message
error, please destroy all copies of it and notify the sender. If the
of this message is not the intended recipient, you are hereby notified
any dissemination, distribution or copying of this communication is
prohibited. In the case of E-mail or electronic transmission,
delete it and all copies of it from your system and notify the sender.
E-mail and fax transmission cannot be guaranteed to be secure or
as information could be intercepted, corrupted, lost, destroyed,
or incomplete, or contain viruses.
From: ipac-list-bounces at ipacweb.org
[mailto:ipac-list-bounces at ipacweb.org]
On Behalf Of Shekerjian, Rene
Sent: Wednesday, February 03, 2010 11:36 AM
To: ipac-list at ipacweb.org
Subject: [IPAC-List] Interpretation of internal consistency
I have been reading up on internal consistency reliability
(e.g., KR-20 and Cronbach's Alpha) in order to clarify my thinking
but I am having trouble finding much of practical use beyond two basic
(1) .6 is tolerable, and .9 is the gold standard
(2) you can have high test-retest reliability with low internal
My question is this:
Suppose you have around 20 items that make up a situational judgment
subtest. The domain is defined by job analysis and is supposed to
competency that entails behaviors such as interacting with customers,
solving problems, giving advice, assessing situations, and determining
best action to take, all within a circumscribed context.
How would you interpret an internal consistency reliability
.3 for such a test? How about .6? And what about .9?
My stab at this is that .3 suggests several unwanted possibilities:
them are (1) the items were "good" but too difficult for the
(2) the items have flaws such as not fully defining the circumstances
constraints that need to be taken into account to arrive at the
Personally, I would expect a well crafted set of such items that are
to an appropriate candidate group to hold together and have an
consistency reliability coefficient around .6
As far as getting a .9 for this sort of test, I think that that would
indicate too narrow a focus for a domain that would be expected to
pretty wide territory.
Because while it makes sense that people who are more able and
develop expertise in a "broad" competency will tend to be good at much
and those who are less able and/or motivated will tend to perform
much of it, I would expect some randomization of people's strengths
weaknesses, which would lead some people to perform well in many areas
still fall down in a few (but with no discernible pattern) and others
perform poorly in many but still be strong in a few (again with no
Your thoughts on this would be much appreciated.
Thanks in advance,
René Shekerjian | Testing
Services Division | NYS Department of
Service | 518-473-9937
IPAC-List at ipacweb.org
IPAC-List at ipacweb.org
IPAC-List at ipacweb.org
This email and any file transmitted with it may be confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you received this email in error, please notify the Comptroller's System Manager by forwarding this message to postmaster at comp.state.md.us
More information about the IPAC-List