[IPAC-List] Interpretation of internal consistency reliability coefficients -- Not trying to beat a dead horse...

Shekerjian, Rene Rene.Shekerjian at cs.state.ny.us
Thu Apr 8 11:07:23 EDT 2010

Previous message: [IPAC-List] Interpretation of internal consistency reliability coefficients
Next message: [IPAC-List] Interpretation of internal consistency reliability coefficients -- Not trying to beat a dead horse...
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

As I considered everyone's input to this question, I was left with this concern: domain sampling.

If my test or subtest is supposed to address a given area and the items do not correlate with one another (yielding a low alpha or kr-20), then it seems that I have tapped many domains, perhaps as many as I have items.

And, in the extreme case, if I have only one or two items per domain, how do I have adequate coverage of each of the domains?

If I have only one or two items per domain, then the area of each of the domains I happen to tap on any given holding of the test will be the biggest factor in candidates' scores. If the content area covered varies from holding to holding, then repeat candidates will get different scores depending on which area of the many domains I tapped with each item.

For example, I might have a subtest that I think is covering a single knowledge but really covers many, say physics, geometry, art history, social psychology, etc. If there are only one or two items for each of these diverse areas of knowledge, then none is being reliably measured, and candidates' scores will depend on the match between candidates' knowledge of each area and the particular questions I end up asking in each one. If I constructed many of these tests, I would expect candidates' scores to vary significantly from one test to another.

In this case, it seems that not only would I have poor reliability, but I would have poor validity as well.

Now I suppose I could create parallel forms by selecting for each form one or two items that address the same little piece of each domain, but then the domain description wouldn't really be accurate. For example, instead of having some geometry questions, I would have two questions on the Pythagorean theorem.

I know this isn't extremely precise, but I am hoping I have been precise enough to reasonably convey the idea that a collection of uncorrelated questions may be tapping unrelated KSAs, which then would result in too few questions per KSA for there to be reliable measurement.

If my thinking has gone astray here, I would appreciate a correction.

I thank everyone again for their comments to date.

René

René Shekerjian | Testing Services Division | NYS Department of Civil Service | 518-474-3778
====================================================================================================

-----Original Message-----
From: Winfred Arthur, Jr. [mailto:w-arthur at neo.tamu.edu]
Sent: Thursday, February 04, 2010 4:27 PM
To: ipac-list at ipacweb.org
Subject: Re: [IPAC-List] Interpretation of internal consistency reliabilitycoefficients

i suppose whether it is inappropriate "to use the whole test index as
the measure of reliability" or not depends on _*what*_ you are
interpreting as "the test score" b/c that determines what the
appropriate level of analysis shld be. thus, for example, if i am
interpreting the GRE quant score, then i would be interested in the rxx
of that subtest. on the other hand, if i am interpreting the total GRE
score, the rxx metric of interest would be at the level of the total score.

in addition, i am sure you are already aware of this but it is worth
noting that alpha (of kr-20) is influenced not only by the number of
items but also the avg item correlations. that is why it is possible to
have fairly high alphas for tests that measure clearly multi-dimensional
domains (e.g., quant & verbal) -- if they are correlated. this also
serves as the caution for not interpreting alpha as a metric or
indicator of unidimensionality [which is distinct from homogeneity].

thanks.

- winfred

On 2/4/2010 7:46 AM, GWEN SCHINDLER wrote:

> IPMAAC Folks,

>

> I have struggled with the interpretation of internal consistency

> coefficients, in particular KR-20, for several years now. In my

> experience, the reliability index is influenced largely by the number

> of items in the subtest. So when I run reliability coefficients on my

> separate subtests, which are theoretically more homogenous than my

> whole test, my KR-20s are always much lower than the whole test KR-20 and

> often lower than the recommended indices. This has propelled some

> people I have worked with to use the whole test index as the measure

> of reliability, which gets me reeling because I believe it's an

> incorrect use of the statistic. However, when one is thinking about

> what looks good on paper and about defending one's test to a judge and

> jury of lay persons, part of me understands the argument. It's

> exasperating for one who is really trying to be accurate and stay true

> to good test development principles.

>

>

>

> Gwen Schindler, Personnel Analyst

> Comptroller of Maryland

> Office of Personnel Services

> Louis L Goldstein Treasury Building

> 80 Calvert St., Room 209

> Annapolis, MD 21404-0466

> Phone: 410-260-6622

> Fax: 410-974-5249

>

>

>

>>>> "Patrick McCoy"<Patrick.McCoy at psc-cfp.gc.ca> 2/3/2010 1:50 PM>>>

>>>>

> If I am not mistaken, a test can have high internal consistency, under

> some circumstances, even if it is tapping more than one construct.

>

> My guess is that for this to take place, there needs to be enough

> items related to each construct and the answers need to be sound.

>

> Pat McCoy

> Ottawa, Canada

>

>

>>>> "Dennis Doverspike "<dd1 at uakron.edu> 2010/02/03 12:16 pm>>>

>>>>

> Abstract. You have here the classic conflict in any type of test of

> multidimensional domains or a test based on a criterion oriented

> strategy. However, at the same time you want to calculate a total

> score. Bottom line,

> it is impossible to say that a .3, .6 or .9 would be preferable

> without

> knowing your exact purpose and the likely correlations between those

> domains. A .3 could be very good in some situations. Although then a

> critic

> might argue - why calculate a total score? Why not calculate a score

> for

> each dimension separately?

>

> More detail

>

> First, internal consistency reliability is based on one particular

> theory of reliability (or several theories of reliability really)

> coming from a particular view of tests, which sees the best tests as

> measuring a unidimensional latent trait.

>

> The general theory of situational judgment tests (see especially Mike

> McDaniel's work) is not compatible with classic notions of internal

> consistency. In other words, internal consistency is not an

> appropriate index for situational judgment tests.

>

> Having said that, playing devil's advocate of my own position, it

> could be

> argued that situational judgment tests still arrive at a single total

> score,

> which suggests that there is a single something being measured.

>

> On your point 2, you could have high test retest reliability with low

> internal consistency. It is possible, but again would depend on your

> measure and your theory. Technically, under some reliability theories,

> test retest

> is not an acceptable type of reliability at all -- since by definition

> there

> is no random sampling of items (we can retain the fiction as long as

> we

> do

> not blatantly violate it).

>

> Bottom line then - a .3 internal consistency might be very appropriate

> and in fact good for a situational judgment test, especially depending

> upon

> its

> theory of construction. Remember, in developing a test based on a pure

> criterion oriented approach, such as might be seen with classical BIBs

> or

> even situational judgment tests, we would want a correlation between

> each

> additional item of 0, since that results in the greatest Multiple R.

> Thus,

> in that case, lower internal consistencies are better and a high

> internal

> consistency would be bad.

>

> So you have competing ideas here. You are calculating I would guess a

> total score, but you are calculating a total score based on adding

> togethe r

> measures of independent constructs. So a .3 might be good, a .6 might

> be

> good, a .9 might be good, depends on what you are trying to achieve

> and

> what

> the correlation really is between those independent constructs.

>

> One could go into a lot more detail and argument on these concepts,

> but that

> is for a different forum.

>

>

> Dennis Doverspike, Ph.D., ABPP

> Professor of Psychology

> Director, Center for Organizational Research

> Senior Fellow of the Institute for Life-Span Development and

> Gerontology Psychology Department

> University of Akron

> Akron, Ohio 44325-4301

> 330-972-8372 (Office)

> 330-972-5174 (Office Fax)

> ddoverspike at uakron.edu

>

> The information is intended only for the person or entity to which it

> is addressed and may contain confidential, privileged and/or a work

> product for

> the sole use of the intended recipient. No confidentiality or

> privilege is

> waived or lost by any errant transmission. If you receive this message

> in

> error, please destroy all copies of it and notify the sender. If the

> reader

> of this message is not the intended recipient, you are hereby notified

> that

> any dissemination, distribution or copying of this communication is

> strictly

> prohibited. In the case of E-mail or electronic transmission,

> immediately

> delete it and all copies of it from your system and notify the sender.

> E-mail and fax transmission cannot be guaranteed to be secure or

> error-free

> as information could be intercepted, corrupted, lost, destroyed,

> arrive

> late

> or incomplete, or contain viruses.

>

>

> -----Original Message-----

> From: ipac-list-bounces at ipacweb.org

> [mailto:ipac-list-bounces at ipacweb.org]

> On Behalf Of Shekerjian, Rene

> Sent: Wednesday, February 03, 2010 11:36 AM

> To: ipac-list at ipacweb.org

> Subject: [IPAC-List] Interpretation of internal consistency

> reliabilitycoefficients

>

> I have been reading up on internal consistency reliability

> coefficients (e.g., KR-20 and Cronbach's Alpha) in order to clarify my

> thinking about it,

> but I am having trouble finding much of practical use beyond two basic

> points:

>

> (1) .6 is tolerable, and .9 is the gold standard

>

> (2) you can have high test-retest reliability with low internal

> consistency

>

> My question is this:

>

> Suppose you have around 20 items that make up a situational judgment

> subtest. The domain is defined by job analysis and is supposed to

> address a competency that entails behaviors such as interacting with

> customers, solving problems, giving advice, assessing situations, and

> determining the

> best action to take, all within a circumscribed context.

>

> How would you interpret an internal consistency reliability

> coefficient of

> .3 for such a test? How about .6? And what about .9?

>

> My stab at this is that .3 suggests several unwanted possibilities:

> among them are (1) the items were "good" but too difficult for the

> candidates

> and

> (2) the items have flaws such as not fully defining the circumstances

> and/or

> constraints that need to be taken into account to arrive at the

> "correct"

> answer.

>

> Personally, I would expect a well crafted set of such items that are

> given to an appropriate candidate group to hold together and have an

> internal

> consistency reliability coefficient around .6

>

> As far as getting a .9 for this sort of test, I think that that would

> indicate too narrow a focus for a domain that would be expected to

> cover a pretty wide territory.

>

> Because while it makes sense that people who are more able and

> motivated to develop expertise in a "broad" competency will tend to be

> good at much of it

> and those who are less able and/or motivated will tend to perform

> poorly in

> much of it, I would expect some randomization of people's strengths

> and

> weaknesses, which would lead some people to perform well in many areas

> but

> still fall down in a few (but with no discernible pattern) and others

> to

> perform poorly in many but still be strong in a few (again with no

> discernible pattern).

>

> Your thoughts on this would be much appreciated.

>

> Thanks in advance,

>

> René

>

> René Shekerjian | Testing

> Services Division | NYS Department of

> Civil

> Service | 518-473-9937

> ======================================================================

> ======

> =

>

>

>

>

> _______________________________________________________

> IPAC-List

> IPAC-List at ipacweb.org

> http://www.ipacweb.org/mailman/listinfo/ipac-list

>

> _______________________________________________________

> IPAC-List

> IPAC-List at ipacweb.org

> http://www.ipacweb.org/mailman/listinfo/ipac-list

> _______________________________________________________

> IPAC-List

> IPAC-List at ipacweb.org

> http://www.ipacweb.org/mailman/listinfo/ipac-list

>

> ----------------------------------------------------------------------

> --------

> This email and any file transmitted with it may be confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you received this email in error, please notify the Comptroller's System Manager by forwarding this message to postmaster at comp.state.md.us

> ==============================================================================

> _______________________________________________________

> IPAC-List

> IPAC-List at ipacweb.org

> http://www.ipacweb.org/mailman/listinfo/ipac-list

>

Previous message: [IPAC-List] Interpretation of internal consistency reliability coefficients
Next message: [IPAC-List] Interpretation of internal consistency reliability coefficients -- Not trying to beat a dead horse...
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the IPAC-List mailing list