If I mentioned the ‘observer effect’ in usability testing, what would it mean to you?

When I ask people this question, they normally say that the ‘observer effect’ means you can’t measure behaviour in a usability test without changing it in some way. It’s a bit like measuring the tyre pressure in your car: you can’t do this without letting some of the air escape (and hence changing the pressure). People can hardly be expected to behave normally if they know they are on camera or if they have a usability specialist next to them peering over their shoulder.

In practice, this kind of ‘observer effect’ tends to be much less of a problem than you might suppose. Except for the occasional nervous participant, I tend to find that so long as my tasks are engaging, participants quickly adapt to the test setting and get engaged in solving the task. Their behaviour will never be exactly the same as when they are alone, but frankly if they can’t navigate a web site in the quiet of a usability test, they won’t be able to navigate it when they are at home on the couch with all of the additional distractions of home life. Like the tyre example, what you measure may not be exact — but it’s close enough to identify a problem that needs fixing.

But did you know there’s another kind of observer effect in usability testing? This one is more worrying and much less widely known.

The real observer effect

Back in the late 90s, HCI researchers Jacobsen, Hertzum and John asked people to observe usability test videos and report the usability issues that they saw. The authors were interested in this research question: if you asked four different experts to observe exactly the same usability test, would they find exactly the same problems? Of course, it’s unlikely you would get 100% agreement, but how much overlap do you think there would be between the four sets of observations?

In this study, the four experts found 93 usability problems. Stop for a second and think: if you had taken part in this study as one of the four experts, how many of these problems do you think you would have spotted? 90%? 70%?

The answer, in fact, is about half. And if we look to see how many problems we agree on as a team it would be a measly 20% — or about 19 of 93 problems. (I can be very specific because I’ve taken these numbers from the original research paper).

When I tell people about this study their immediate reaction is that the lack of overlap is because some of the evaluators may be being a bit over-zealous. This result is probably because one or two evaluators are finding trivial problems that aren’t truly important (such as a spelling error or a formatting issue that doesn’t affect usage). In a real study, if you found 93 problems in a product, you would never report them all because the report would be unusable. You would probably select the 20-30 problems that you consider most important and report them, essentially abandoning the rest.

Interestingly, the authors examined this too. Remember that the overlap across all problems was 20%. When the authors focussed on the most severe problems (classified as appearing on each evaluator’s ‘Top 10’ list), the results indeed looked better. But of the 25 problems that appeared on each top 10 list, it was still the case that only 52% of the problems appeared on every list.

In other words, the chances of you spotting a critical usability problem that the other usability experts have found is about 50-50.

If you’re still unconvinced, then you’ll be interested to know that the authors have recently replicated and extended their work. Morten Hertzum and Niels Ebbe Jacobsen (in collaboration with Rolf Molich) had 19 evaluators watch usability test videos of people using the U-Haul web site. Some evaluators watched videos of remote, unmoderated usability test sessions and others watched videos of moderated usability test sessions. During these usability tests, participants were asked to ‘think aloud’ in a kind of stream of consciousness, and evaluators used this (along with other factors they saw and heard) to identify usability issues.

The results were very similar to the previous study. Even when focusing on only those problems rated ‘severe’ or ‘critical’, the agreement across all observers was just 40% (for unmoderated sessions) to 50% (for moderated sessions).

And before you dismiss the results as being due to inexperienced usability professionals, you should know that the list of evaluators reads like a veritable who’s who of usability specialists: on average, the evaluators had conducted over 100 usability tests with an average of 17 years experience doing usability evaluations.

What causes the observer effect?

Some usability experts don’t want to hear this finding because they like to think of usability testing as a quasi-scientific activity with a single ‘correct’ set of answers. But this effect has been known in the field of psychological science for some time: it’s akin to inter-rater reliability. For example, if you have six judges evaluating a gymnastic performance, it’s unlikely all six would give exactly the same score. With usability testing, it’s even more nuanced because the ‘judges’ aren’t giving an overall score but instead they are explaining in detail why the ‘performance’ is less than perfect.

And this analogy helps us understand what's behind the observer effect: spotting problems in a usability test is a complex cognitive activity that requires evaluators to exercise difficult judgments. This means that the results depend as much on the person doing the observing as they do on the participant in the test lab.

This is the real ‘observer effect’ in usability testing: what you find in a usability test depends on who’s doing the looking. And it’s one that you need to work hard to counter.

Fighting back against the observer effect

Since any one evaluator is only going to spot around 50% of all the important problems in a usability test, you might think: what’s the point in running a test at all? In fact, there are some simple ways of modifying the way you run tests to mitigate this problem.

  • First, accept it’s real. No matter how much of a usability guru you think you are, you won’t spot all of the critical problems in a usability test. By thinking you’re immune, you become part of the problem and not part of the solution.
  • Make sure it’s not just you observing test sessions. Involve at least one other person and where possible compare notes at the end of each session.
  • Even better, invite the whole design team along to the usability test and build consensus on the severe and critical problems that have been spotted after each participant.
  • Use a systematic method for defining severity that doesn’t just depend on your gut instinct. Really think through the criteria that make a problem severe or critical. This makes it more likely that you and your colleagues will agree on the rating. This matters, because in the Hertzum, Mohlich and Jacobsen study, when two or more evaluators rated a problem as critical, there was much stronger consensus (70% agreement for moderated sessions and 78% agreement for unmoderated sessions). So anything you can do to help people arrive at the same severity definition will help.
  • If, despite your best efforts, you turn out to be the only person who can observe the test and analyse the data then be systematic in the way you do the analysis. Create detailed behavioural logs of user behaviour and use these logs to aid in the identification of usability problems. This will help prevent you being influenced (and distracted) by just the forehead-slapping moments that we all see in usability tests.

If you'd like to find out more about usability testing, try my online course on usability testing.

Literature cited

Jacobsen, N. E., Hertzum, M., & John, B. E. (1998). The evaluator effect in usability studies: Problem detection and severity judgments. In Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting (pp. 1336-1340). Santa Monica, CA: HFES.

Hertzum, M., Jacobsen, N. E. & Molich, R. (2014). What You Get Is What You See: Revisiting the Evaluator Effect in Usability Tests. Behaviour & Information Technology, 33:2, pp. 143-161.

Acknowledgements

Philip Hodgson and Rolf Molich provided helpful comments on an earlier draft of this article.

About the author

David Travis

Dr. David Travis (@userfocus) has been carrying out ethnographic field research and running product usability tests since 1989. He has published three books on user experience including Think Like a UX Researcher. If you like his articles, you might enjoy his free online user experience course.



Foundation Certificate in UX

Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details

Download the best of Userfocus. For free.

100s of pages of practical advice on user experience, in handy portable form. 'Bright Ideas' eBooks.

Related articles & resources

This article is tagged usability testing.


Our services

Let us help you create great customer experiences.

Training courses

Join our community of UX professionals who get their user experience training from Userfocus. See our curriculum.

David Travis Dr. David Travis (@userfocus) has been carrying out ethnographic field research and running product usability tests since 1989. He has published three books on user experience including Think Like a UX Researcher.

Get help with…

If you liked this, try…

Get our newsletter (And a free guide to usability test moderation)
No thanks