UX Certification
Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
Usability testing is widely accepted as the de facto method for finding usability problems with a user interface. However, test sessions can suffer from a significant ‘observer effect’. This article describes some of the evidence for the observer effect along with some suggestions for ameliorating it.
If I mentioned the ‘observer effect’ in usability testing, what would it mean to you?
When I ask people this question, they normally say that the ‘observer effect’ means you can’t measure behaviour in a usability test without changing it in some way. It’s a bit like measuring the tyre pressure in your car: you can’t do this without letting some of the air escape (and hence changing the pressure). People can hardly be expected to behave normally if they know they are on camera or if they have a usability specialist next to them peering over their shoulder.
In practice, this kind of ‘observer effect’ tends to be much less of a problem than you might suppose. Except for the occasional nervous participant, I tend to find that so long as my tasks are engaging, participants quickly adapt to the test setting and get engaged in solving the task. Their behaviour will never be exactly the same as when they are alone, but frankly if they can’t navigate a web site in the quiet of a usability test, they won’t be able to navigate it when they are at home on the couch with all of the additional distractions of home life. Like the tyre example, what you measure may not be exact — but it’s close enough to identify a problem that needs fixing.
But did you know there’s another kind of observer effect in usability testing? This one is more worrying and much less widely known.
Back in the late 90s, HCI researchers Jacobsen, Hertzum and John asked people to observe usability test videos and report the usability issues that they saw. The authors were interested in this research question: if you asked four different experts to observe exactly the same usability test, would they find exactly the same problems? Of course, it’s unlikely you would get 100% agreement, but how much overlap do you think there would be between the four sets of observations?
In this study, the four experts found 93 usability problems. Stop for a second and think: if you had taken part in this study as one of the four experts, how many of these problems do you think you would have spotted? 90%? 70%?
The answer, in fact, is about half. And if we look to see how many problems we agree on as a team it would be a measly 20% — or about 19 of 93 problems. (I can be very specific because I’ve taken these numbers from the original research paper).
When I tell people about this study their immediate reaction is that the lack of overlap is because some of the evaluators may be being a bit over-zealous. This result is probably because one or two evaluators are finding trivial problems that aren’t truly important (such as a spelling error or a formatting issue that doesn’t affect usage). In a real study, if you found 93 problems in a product, you would never report them all because the report would be unusable. You would probably select the 20-30 problems that you consider most important and report them, essentially abandoning the rest.
Interestingly, the authors examined this too. Remember that the overlap across all problems was 20%. When the authors focussed on the most severe problems (classified as appearing on each evaluator’s ‘Top 10’ list), the results indeed looked better. But of the 25 problems that appeared on each top 10 list, it was still the case that only 52% of the problems appeared on every list.
In other words, the chances of you spotting a critical usability problem that the other usability experts have found is about 50-50.
If you’re still unconvinced, then you’ll be interested to know that the authors have recently replicated and extended their work. Morten Hertzum and Niels Ebbe Jacobsen (in collaboration with Rolf Molich) had 19 evaluators watch usability test videos of people using the U-Haul web site. Some evaluators watched videos of remote, unmoderated usability test sessions and others watched videos of moderated usability test sessions. During these usability tests, participants were asked to ‘think aloud’ in a kind of stream of consciousness, and evaluators used this (along with other factors they saw and heard) to identify usability issues.
The results were very similar to the previous study. Even when focusing on only those problems rated ‘severe’ or ‘critical’, the agreement across all observers was just 40% (for unmoderated sessions) to 50% (for moderated sessions).
And before you dismiss the results as being due to inexperienced usability professionals, you should know that the list of evaluators reads like a veritable who’s who of usability specialists: on average, the evaluators had conducted over 100 usability tests with an average of 17 years experience doing usability evaluations.
Some usability experts don’t want to hear this finding because they like to think of usability testing as a quasi-scientific activity with a single ‘correct’ set of answers. But this effect has been known in the field of psychological science for some time: it’s akin to inter-rater reliability. For example, if you have six judges evaluating a gymnastic performance, it’s unlikely all six would give exactly the same score. With usability testing, it’s even more nuanced because the ‘judges’ aren’t giving an overall score but instead they are explaining in detail why the ‘performance’ is less than perfect.
And this analogy helps us understand what's behind the observer effect: spotting problems in a usability test is a complex cognitive activity that requires evaluators to exercise difficult judgments. This means that the results depend as much on the person doing the observing as they do on the participant in the test lab.
This is the real ‘observer effect’ in usability testing: what you find in a usability test depends on who’s doing the looking. And it’s one that you need to work hard to counter.
Since any one evaluator is only going to spot around 50% of all the important problems in a usability test, you might think: what’s the point in running a test at all? In fact, there are some simple ways of modifying the way you run tests to mitigate this problem.
If you'd like to find out more about usability testing, try my online course on usability testing.
Jacobsen, N. E., Hertzum, M., & John, B. E. (1998). The evaluator effect in usability studies: Problem detection and severity judgments. In Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting (pp. 1336-1340). Santa Monica, CA: HFES.
Hertzum, M., Jacobsen, N. E. & Molich, R. (2014). What You Get Is What You See: Revisiting the Evaluator Effect in Usability Tests. Behaviour & Information Technology, 33:2, pp. 143-161.
Philip Hodgson and Rolf Molich provided helpful comments on an earlier draft of this article.
Dr. David Travis (@userfocus) has been carrying out ethnographic field research and running product usability tests since 1989. He has published three books on user experience including Think Like a UX Researcher. If you like his articles, you might enjoy his free online user experience course.
Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details
This article is tagged usability testing.
Our most recent videos
Our most recent articles
Let us help you create great customer experiences.
We run regular training courses in usability and UX.
Join our community of UX professionals who get their user experience training from Userfocus. See our curriculum.
copyright © Userfocus 2021.
Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
We can tailor our user research and design courses to address the specific issues facing your development team.
Users don't always know what they want and their opinions can be unreliable — so we help you get behind your users' behaviour.