Validity and Think-Aloud Protocols


protocol_analysisFirst adapted from the work of experimental psychologists – most notably, Ericsson and Simon’s landmark 1993 work Protocol Analysis, – think-aloud protocols are the de-facto standard for usability research in both the lab and field settings.  If you’ve seen or given a usability test before then you know what this is, it’s when the moderator tells the respondent to use a website or other application and then says “Hey, tell me what you are thinking.”   Jakob Nielsen and other HCI researchers were quick to trump the merits of this technique for uncovering usability problems with sample sizes as small as four people.  Why is the technique so effective?  Well, it’s validity stems largely from the fact that it’s a direct measure of what’s happening in a subject’s short-term memory.   Other examples of direct measures of human cognition are hard to find… in fact, the two others that are primarily used are response tests (e.g. reaction time indicators) and MRI brain scans!   So to have a direct measure that is cheap and easy to administer and also provides qualitative insights into the user experience is powerful indeed!  But if the interview is poorly moderated, or descends into a Q & A session between moderator and respondent, then this validity flies out the window… so let’s look at the issue more closely.

If a subject is steadily verbalizing while performing a task (e.g. concurrent verbalization), they are assumed to be speaking from short term memory.  Dumas & Redish (1993) conveniently summarize three levels of think-aloud protocols commonly referred to by HCI researchers:  Level 1 verbalizations:  where the emphasis is on pure thoughts with no or minimal explanations; Level 2 verbalizations:  same, but when the participant is dealing with non-verbal information, like shapes, which must be internally “coded” in order to be articulated verbally, and Level 3 verbalizations:  or “thinking plus explanations.”  The latter are also referred to as retrospective reports, because the respondent is recapping and opining about what they actually did earlier.  Dumas summarizes the distinction between Level 3 and the other levels as the stage where the researchers are “no longer getting a read out of short-term memory… rather it is the interpretation of the process they are using or the reasons they have selected a strategy.”  Retrospective reports are not useless to the user researcher, in fact they are necessary to clarify a respondent’s statements and actions, but they are far less valid.

This leads us to the question of active vs. inactive moderation.  In an inactive moderation scenario, the emphasis is on experimental control and creating a unified experience for all test subjects.  This is the old-school style, where the researchers stand behind the glass and the respondent sits in the room by themselves talking out loud like a crazy person.  Participants, faced with the unnatural task of constant verbalization, are typically “coached” on how to deliver a think-aloud protocol.  This often takes the form of a warm-up exercise where the participant and experimenter practice thinking aloud with non-related stimuli, preferably a simple game such as tic-tac-toe (in order to place emphasis on cognitive strategy.)  During the actual experiment, the moderator prompts only when the participant ceases to verbalize: “Please keep talking,” they say.  And that’s all they say!

In a high moderator intervention scenario, the experimenter employs probing questions to focus the participant’s attention on particular features or to elicit and clarify subjective explanations of their behavior.  Moderators are skilled in asking neutral, non-leading questions to minimize bias.  In addition, active listening techniques are employed to emulate the clinician’s empathic stance. This implies paraphrasing of a speaker’s comments to ensure them that they have been listened to, noted, and understood as well as other verbal and non-verbal forms of caring, non-judgmental acceptance.  Mike Kuniavsky, whose work is mentioned frequently in these pages,  lays out the guidelines for “non-directed” interviewing:  questions should be concentrated on immediate experience, nonjudgmental, focused on a single topic, open-ended, and non-binary (e.g. yes-no, true-false).

In a quick look at the literature, you’ll see that Taylor and Dionne (2000) suggest that probes are best deployed for collection and verification of data in retrospective reports, and that they have a detrimental impact on validity if used during concurrent think-aloud protocols.  Preece (1994) suggests that the role of the moderator on the participant is both interruptive and imposes additional cognitive load.  Nielsen (1993) is pragmatic on the subject, suggesting the moderator intervenes as little as possible, yet directing the flow and direction of the interview to maximize the number of usability issues found.   Nielsen is the spiritual father of the commercial usability field, so it’s no surprise that most usability is performed with this degree of pragmatism.  I think it was Jared Spool who once commented that you can only watch participants avoiding clicking on the red button so many times before you are compelled to intervene and ask why.   Plus, most clients of usability research do not share the academic’s interest in validity.  They want enough validity to feel good about the process and the results, but ultimately they want their specific questions answered for a reasonable amount of time and expense.  In Nielsen et. al.’s (2002) interpretation, The human is a psychological being engaged in a psychological interaction, which cannot be reduced to that which is concurrently verbalized.

Some usability researchers, characterized by Whiteside, et. al. (1993), have also posited that observing a user’s behavior is not enough to understand what is happening in terms of higher order thinking and cognitive strategy.  If you set up a priori conditions then you are bounding yourself to learning only what falls within those conditions.  In this view, subjective experience is the most comprehensive criterion for understanding usability.  Concerns of generalizability are side-stepped: the goal is to obtain rich, experiential data.  Since specific questions are the only effective way to elicit and clarify mental models, the researcher must come to terms with at least a partially subjectivist stance. (Tamler, 2001)  This approach values reflexivity:  the subject is full participant in the study, leading the research into relevant areas for exploration as well giving the subject the ability to respond to the researchers interpretations. It is not a search for “truth,” per se, as a philosophical underpinning of this method of inquiry.  The assumption is that an expansion of perspectives leads to the exposure of more aspects of learning.

Here’s some References (I know, for a blog post, it’s over the top… but this stuff’s important!)

DUMAS, J.S. (2001) “Usability Testing Methods:  Think-Aloud Protocols,” in Design by People For People:  Essays on Usability, UPA,  pp 119-129

DUMAS, J.S. & REDISH, J.C., (1993)  A Practical Guide to Usability Testing.  Norwood, NJ, Ablex Publishing Corp.

ERICSSON, K.A. & SIMON, H.A. (1984, 1993) Protocol analysis:  Verbal reports as data (Rev. ed).  Cambridge, MA:  MIT Press

KUNIAVSKY, M. (2003)  Observing the User Experience: A Practitioner’s Guide to User Research,  San Francisco: Morgan Kaufmann Publishers, Inc.

NIELSEN, J., (1993) Usability Engineering.  Chestnut Hill, MA:  Academic Press, Inc.

NIELSEN, J., CLEMMENSEN, T., & YSSING, C., (2002)  “Getting access to what goes on in people’s heads? – Reflections on the think-aloud technique”, paper presented to NordiCHI, Arhus, Denmark, October 19-23

PREECE, J. (1994), Human-Computer Interaction, Addison-Wesley, England

TAMLER, H. (2001) “How (Much) to Intervene in a Usability Testing Session,”  in Design by People For People:  Essays on Usability, UPA,  pp 165-171

WHITESIDE, J., BENNETT., J.L., & HOLTZBLATT, K., (1988)  “Usability Engineering: Our Experience and Evolution,” in Handbook of Human Computer Interaction; edited by Helander, M.  New York, NY:  Elsevier Science Publishers

,

  1. #1 by nickgould on August 7th, 2009

    Great post, Todd! RT @solidstateux: A discussion with @nickgould got me thinking about Validity & Think-aloud Protocols: http://bit.ly/i1EjH
    1:20 PM Aug 5th from TweetDeck

(will not be published)

  1. No trackbacks yet.