Home

Managing the `Evaluator Effect` in User Testing

1. Measurement Vol XX No 1 pp 37 46 Ericsson K A Simon H A 1984 Protocol Analysis verbal reports as data Cambridge Massachusetts MIT Press pp 376 377 Mattel Inc http www fisher price com us 14 January 2003 Hertzum M Jacobsen N E 2001 The Evaluator Effect A Chilling Fact About Usability Evaluation Methods International Journal of Human Computer Interaction 13 4 421 443 Jacobsen N E Hertzum M and John B E 1998 The evaluator effect in usability tests in Proceedings of the conference on CHI 98 Summary Conference on Human Factors in Computing Systems Los Angeles April 18 23 ACM New York 255 256 Molich R Bevan N Curson I Butler S Kindlund E Miller D and Kirakowski J 1998 Comparative evaluation of usability tests In Proceedings of the Usability Professionals Association 1998 UPA98 Conference Washington D C USA June 25 26 1998 Molich R Damgaard Thomsen A Schmidt L Ede M Oel W Van and Arcuri M 1999 Comparative evaluation of usability tests in Proceedings of CHI 99 extended abstract 83 84 Data available at http www dialogdesign dk cue html last verified September 2002 Van Kesteren I E H 2003 Usability Evaluation Methods with Children Master s Thesis Delft University of Technology Faculty of Industrial Design Engineering Vermeeren A P O S 1999 Designing scenarios and tasks for user trials of home electronic devices in W S Green and P W
2. Jordan eds Human Factors in Product Design Current Practice and Future Trends Taylor amp Francis London 47 55 Vermeeren A P O S den Bouwmeester K Aasman J de Ridder H 2002 DEVAN a tool for detailed video analysis of user test data Behaviour amp Information Technology 21 6 403 423
3. a chance incident an artefact of the peculiarities of a single study or a weakness of a particular usability evaluation method Their study made them realise that the evaluator effect will to a considerable extent have to be managed rather than eliminated Hertzum and Jacobsen s 2001 conclusions were based on comparing set ups and outcomes of reported usability studies Due to the lack of explicitness in the analyses of the reported studies causes of the individual evaluator effects could not be determined directly from the data but were inferred based on comparing characteristics of the studies The present paper examines in more detail how characteristics of the data analysis process may influence the evaluator effect Three studies were conducted In each of the three studies two evaluators independently analysed the same video recorded user test data In one of the studies a data analysis approach comparable to that of Jacobsen et al 1998 was used In the other two studies the data analysis process was prescribed and done in much detail and decisions in the data analysis process were made explicit For this the user test data analysis tool DEVAN DEtailed Video ANalysis Vermeeren et al 2002 was used It was expected that an analysis based on the use of this tool would suffer less from the shortcomings that according to Hertzum and Jacobsen 2001 contribute to the evaluator effect i e vague evaluation procedures and v
4. evaluator effect consumer products children 1 Introduction In research on usability evaluation methods as well as in design practise user testing is considered one of the best techniques for getting insights into usability problems However recently a number of studies have been published that question the reliability of user test data For example in studies of Molich et al 1998 and Molich et al 1999 multiple usability laboratories 3 and 6 respectively were asked to evaluate the same software The average agreement on usability problems by two laboratories proved to be no more than 7 Molich et al 1998 speculate that this limited agreement may be a result of the different approaches taken to user testing In a study by Jacobsen et al 1998 four evaluators independently analysed the same video recordings of four user test sessions They found that on average two analysts agreed on only about 42 of the usability problems they collectively found This suggests that the user test data analysis process itself plays a significant role in the lack of agreement between analysts as well Jacobsen et al 1998 call this the evaluator effect In the present paper the evaluator effect is studied in more detail Hertzum and Jacobsen 2001 provide an overview of eleven studies on the evaluator effect in relation to user evaluation methods Three out of the eleven studies concerned user testing They analysed where the us
5. hindered by a lack of information about what went on inside the user s head Thus another approach could be to gather more data during the test itself e g eye movements or additional verbal data from retrospective interviews in order to get more insights into the user s intentions Inefficiencies and redundant actions Five out of the 37 differences in UPTs were caused by the fact that in some cases evaluators differed in judging whether an inefficiency in the interaction should be regarded as problematic or not In case of three UPTs the inefficiencies concerned unnecessary but harmless actions like pressing a rewind button while the recorder had already started rewinding just to be sure In case of two out of the five UPTs the inefficiency did not concern an individual button press but the choice of an inefficient strategy for performing a task In both cases the user was asked to cancel a scheduled recording which could be done with a single button press However the user s strategy was to set both start time and stop time to 0 00 Evaluators decided differently on whether this should be regarded as a problem or not In all five cases both evaluators had indicated to be unsure about how to treat these cases and realised that other evaluators might treat them differently In the development of DEVAN evaluators had already run across this problem and it had been tried to solve this pro
6. puzzlement the subject went through all menu items without setting anything She expressed her frustration about how difficult the thing is and subsequently ran into a number of problems while setting the items in the menu Almost all of these UPTs were detected by both evaluators Thus the interaction directly following the puzzlement clearly revealed that to the subject it was not immediately clear how to schedule the video recorder and missing the puzzlement UPT would not lead to different conclusions as compared to having detected it In six cases the UPTs that were missed by one of the evaluators were not followed by closely related UPTs Two of these concerned brief instances of puzzlement in the beginning of a task which were quickly followed by trouble free and smooth task performance Thus these UPTs do not seem to be very significant Four out of the six UPTs were really unique and provided new insights For example in one case the user was trying to remove a menu from the TV screen The menu indeed disappeared and the screen changed from showing video images to showing TV images However due to the response time of the TV screen the screen was black for a moment This confused the user and made her conclude that the whole thing did not function anymore Although one could argue about the severity of this problem it is a usability problem that was missed by one of the evaluators Summarising it seems
7. software used by Jacobsen et al 1998 Thus comparing Jacobsen s study to the Jammin Draw study seems to support Hertzum and Jacobsen s statement that the evaluator effect persists across various situations Like study one studies two and three are from different system domains In these studies the data analysis process was expected to suffer less from the shortcomings that Hertzum amp Jacobsen 2001 considered being major contributors to the evaluator effect However even here considerable evaluator effects were found although less than in the other studies In section 4 1 the evaluator effects of studies two and three are analysed in more detail using the representations created with DEVAN it is analysed during which data analysis activities differences first occurred Suggestions are made for how to manage the evaluator effect 4 1 The Evaluator Effect Analysed In studies two and three a total of 126 UPTs were detected 37 out of the 126 UPTs were listed by no more than one evaluator Below the differences in the lists of UPTs are analysed Five main groups of causes emerged during the analysis of differences these are related to 1 interpreting verbal utterances and non verbal behaviour 2 guessing user intentions 3 judging to what extent inefficiencies or redundant actions are considered problematic 4 distinguishing usability problems from problems of the test and finally 5 inaccuracies in doing the analy
8. Human Computer Interaction INTERACT 03 M Rauterberg et al Eds Published by IOS Press c IFIP 2003 pp 647 654 Managing The Evaluator Effect in User Testing Arnold P O S Vermeeren Ilse E H van Kesteren Mathilde M Bekker Delft University of Technology Industrial Design Engineering Landbergstraat15 NL 2628 CE Delft The Netherlands A P O S Vermeeren io tudelft nl Technical University of Eindhoven Technology Management Den Dolech 2 NL 5600 MB Eindhoven The Netherlands Abstract If multiple evaluators analyse the outcomes of a single user test the agreement between their lists of identified usability problems tends to be limited This is called the evaluator effect In the present paper three user tests taken from various domains are reported and evaluator effects were measured In all three studies the evaluator effect proved to be less than in Jacobsen et al s 1998 study but still present Through detailed analysis of the data it was possible to identify various causes for the evaluator effect ranging from inaccuracies in logging and mishearing verbal utterances to differences in interpreting user intentions Suggested strategies for managing the evaluator effect are doing a systematic and detailed data analysis with automated logging discussing specific usability problems with other evaluators and having the entire data analysis done by multiple evaluators Keywords usability testing methods
9. Study 1 Jammin draw user A n 2 8 15 53 8 15 53 Jammin draw user B n 2 9 15 60 9 14 64 Study 2 Thermostat n 2 30 49 61 21 33 64 Study 3 TV video user A n 2 43 54 80 26 32 81 TV video user B n 2 66 110 60 42 61 69 Jacobsen et al 1998 27 Multimedia authoring software n 4 a 42 93 Table 2 Overview of measured evaluator effects As a measure for the evaluator effect the Any two agreement as proposed by Hertzum and Jacobsen 2001 is used Its value ranges from 0 in case of no agreement to 100 in case of full agreement n signifies the number of evaluators Average number of UPTs found by two evaluators based on data from one user This figure compares best to the figures of study 1 2 and 3 Average number of UPTs found by two evaluators based on data from four users Number of UPTs found by four evaluators based on data from four users Jacobsen et al s 1998 study but the evaluator effect is still considerable Hertzum and Jacobsen 2001 stated that the evaluator effect persists across differences in system domains as well as in system complexity amongst other variables Clearly Jammin Draw is from an entirely different system domain and user group than the multimedia authoring software that Jacobsen et al 1998 evaluated Considering the total number of UPTs found the complexity of the Jammin Draw interface most likely is less than that of the multimedia authoring
10. ability evaluation methods fell short of providing evaluators with the guidance necessary for performing reliable evaluations Two out of the three shortcomings they found directly relate to the analysis of user test data These are vague evaluation procedures and vague problem criteria About the consequences of vague evaluation procedures they state that differences in evaluators general views on usability their personal experiences with the system under evaluation their opinions about it and so forth lead them to make some observations and remain blind towards others In addition they state that vague problem criteria lead to anything being accepted as a problem Hertzum and Jacobsen 2001 argue that the principal reason for the evaluator effect is that usability evaluation involves interpretation They state that although some usability problems are virtually self evident most problems require the evaluator to exercise judgement in analysing the interaction among the users their tasks and their systems In general individual differences preclude that cognitive activities such as detecting a Jammin Draw study 1 thermostat study 2 recorder study 3 Figure 1 The products that were tested in the three studies and assessing usability problems are completely consistent across evaluators Furthermore they believe that the evaluator effect can not be dismissed as
11. ague problem criteria In addition the explicitness of the data analysis process would provide opportunities to better understand what data analysis activities contribute most to the evaluator effect In section two the set ups of the three studies are described Section three reports how the user test data have been analysed Finally in section four the results in terms of the measured evaluator effects are described This is followed by an analysis of what might have caused the evaluator effect in the two studies that have been analysed with DEVAN 2 Three Studies 2 1 Study 1 Interactive Toy A user test was done on an interactive toy for children of 5 years and up named Jammin Draw Mattel Inc 2003 see figure 1a It is a toy whereby children can make music through drawing colour plates that are placed on the toy Ten children age 6 to 8 participated in the test Data from two of the children were used for measuring the evaluator effect Sessions lasted about 30 minutes Children were allowed to first play with the product for a maximum of 5 minutes without any further instruction Subsequently a number of tasks were given For example translated from Dutch I brought this colouring plate for you I would like you to colour it for me I would like to hear various musical instruments while you are colouring and Please choose another background music and tell me when you are ready with it Video recordings we
12. blem by defining more specific usability criteria However this did not seem to work To resolve issues like these it would probably suffice to discuss with other evaluators the specific instances of problems about which an evaluator is unsure Test or usability problem In five out of the 37 cases it was clear that there was a problem but it was unclear whether it should be regarded as a usability problem or as a problem introduced by the test itself For example in three out of the five cases the observed process of scheduling the thermostat or video recorder was technically correct but a wrong value was entered for one of the settings a wrong temperature and wrong dates respectively In these cases it was not clear whether the problem was caused by problems in memorising the task or whether subjects really believed they had to set the chosen value to successfully perform the task In two out of the five cases there were problems that related to interference of the experimenter during the test For example in case of one task in the TV video recorder test the subject was asked to tell how many TV programs she had scheduled to be recorded As the subject had not succeeded in properly scheduling the video recorder nothing was scheduled although she believed she had been successful In looking up how many programs she had scheduled the subject was not able to understand the timer menu well enough to conclude that nothing was sch
13. e subjects was used for measuring and analysing the evaluator effect a subject who s task performance was not too extreme in terms of speed and breakdowns Subjects were given 12 tasks on paper Sessions lasted about 20 to 25 minutes Tasks were formulated as scenarios in terms of user goals describing a desired behaviour of the heating system For example You are going away on holiday and don t find it necessary that the house is heated during that time Make settings such that the house will not be heated during the holidays Subjects were asked and trained to think aloud using the instructions suggested by Ericsson amp Simon 1984 All sessions were recorded on video showing the subject s hands as well as an overall picture of the subject sitting at the table 2 3 Study 3 TV Video recorder In the third study a combined TV video recorder Philips type nr 21PT351A 00 see figure 1c was tested with twelve subjects Data sets of two subjects of age group 30 40 years were analysed to measure and analyse the evaluator effect One subject user A was a relatively quiet man who worked in a reasonably systematic way The other subject user B was a talkative woman who seemed to work less systematic and who experienced many problems in performing the tasks Subjects were asked to perform 38 tasks in a maximum of one hour time All tasks were related to the product s video functions and to general TV functions Subjects
14. eduled After some time the experimenter implicitly suggested the answer to the subject who then gave the right answer Evaluators decided differently on whether this should be treated as a problem or not Like in the previous category both evaluators indicated to be unsure about how to treat such problems To resolve issues like these it would probably suffice to discuss the views of other evaluators on specific instances of such problems Inaccuracies of the evaluator Eight out of the 37 differences in UPTs were caused by inaccuracies of the evaluators during the analysis For example in three out of the eight cases button presses were not accurately logged In two cases one evaluator had observed that a user repeatedly pressed a button whilst the other had only seen one long button press In the third case a button was not properly pressed and its function was not activated one evaluator failed to notice this In one out of the eight cases a subject forgot to confirm settings that she had just made Again one of the evaluators had not noticed this In yet another case one evaluator had forgot to copy a breakdown indication from DEVAN s interaction table to the list of breakdown indications Finally in three cases the difference was caused by vagueness in the description of when to record events as breakdown indications of type GOAL see table 1 This only happened in study two In study three the definition of breakd
15. he product User chooses wrong action ACT User discontinues an initiated action DISC User has problem in physically executing an EXE action An action is repeated with exactly the same REP effect User corrects or undoes a preceding action CORR User stops task task not successfully finished STOP Breakdown indication types based on verbal utterances or on non verbal behaviour User formulates an inadequate goal GOAL User s seems to be puzzled about what todo PUZZ next From the user s words it is clear that actions RAND are selected at random User indicates to be searching for a specific SEARCH function and can t find it or function does not exist User indicates that physical execution of an DIFF action is problematic or uncomfortable User expresses doubt surprise or frustration DSF after having performed an action From the user s words it is clear that a REC preceding error is recognised as such or that something previously not understood now has become clear User realises that the current task was not QUIT successfully finished but continues with next task Table 1 DEVAN s checklist of breakdown indication types short descriptions for full definitions see Vermeeren et al 2002 breakdowns in the interface This means that for small groups of evaluators the detection rate becomes overly high The second measure Cohen s Kappa Cohen 1960 presupposes a similar a
16. iour e preliminary segmentation of the interaction based on threshold pause times between actions e deciding on definitive interaction segments as well as clustering and abstracting these to intermediate level episodes and task level episodes At the end of stage one the interaction is represented in the format shown in figure 2 except for the grey marks figure 2 item 7 and the breakdown indication codes figure 2 item 6 which are added in stage two The interaction table includes all loggings and transcriptions of utterances and non verbal behaviour that are used as the basis for detecting interaction breakdowns The segmentation in combination with the abstractions makes explicit how evaluators interpreted a subject s interaction with the product Stage two creating a list of breakdowns in the interaction e detecting events that indicate the occurrence of a breakdown by using a checklist of breakdown indication types e describing the breakdowns The checklist of breakdown indication types table 1 serves as a list of usability problem criteria Detected breakdown indications are described using the following elements 1 a time code reference 2 a description of the observed event 3 the context in which the event occurred task context and product mode 4 the code for the type of breakdown indication 5 a free form description of the breakdown indication It should be noted that at different points in ti
17. me multiple indications can exist for the occurrence of a single breakdown For example an evaluator may observe that the user erroneously selects a button first indication of breakdown may hear the user observed indications for say oops that was wrong second indication and may then see that the user undoes the erroneous action third indication Thus another step that is needed before comparisons between evaluators results can usefully be made is to group breakdown indications that refer to the same occurrence of a breakdown 3 2 Measuring Evaluator Effects For measuring the evaluator effect several measures can be used Hertzum and Jacobsen 2001 discuss three popular measures These are the detection rate Cohen s kappa Cohen 1960 and the any two agreement measure Hertzum and Jacobsen 2001 define the detection rate as the average of Pi over all n evaluators Pail In this equation P is the set of problems detected by evaluator i and P is the set of problems collectively detected by all n evaluators The detection rate suffers from the drawback that its minimum value varies with the number of evaluators Moreover the detection rate rests on the assumption that the number of breakdowns collectively found by the evaluators is identical to the total number of Short description Code Short description Code Breakdown indication types based on observed actions on t
18. own indication type GOAL was improved and the problem did not occur again Most likely doing a systematic and detailed data analysis and automated logging would reduce evaluator effects caused by inaccuracies of the evaluator 5 Discussion and Conclusion The results of the present studies show that the evaluator effect is a phenomenon that is found in various system domains and with systems of varying complexity Insights were gained in possible causes of the evaluator effect Identified causes lay in differences in interpretations of verbal utterances and non verbal behaviour in guessing user intentions in decisions regarding how problematic inefficiencies in the interaction are in distinguishing usability problems from problems introduced by the test set up itself and in inaccuracies in doing the analysis Suggested strategies for managing the evaluator effect are 1 conducting a systematic and detailed data analysis with automated data logging to minimise errors in logging and in making transcriptions 2 discussing with other evaluators the specific problems about which an evaluator is unsure e g in case of inefficiencies in interactions or problems that might have been caused by the test set up itself 3 having the analysis done by multiple evaluators so that multiple views on user intentions can be gathered References Cohen J 1960 A Coefficient of agreement for nominal scales Educational and Psychological
19. pically subjects reduced the volume by repeatedly pressing the minus button and then pressing the plus button once to raise the volume a little Evaluators disagreed on whether this signified an overshoot in reducing the volume or a deliberate way of searching for a convenient volume level On a more global level there were two cases in which the intention behind a series of button presses was unclear For example in one case during study three a subject needed a menu for scheduling the video recorder The subject systematically went through all available menus finally arriving at the needed menu It was unclear whether the user was inspecting each menu to find out whether it was the menu she needed or whether she was just exploring all menus to learn about them for later use If an evaluator thinks that a wrong menu was opened because the user expected it to be the correct menu for the task this should be counted as a problem However if a menu is opened just to learn from it it should not be counted as a problem In all cases within this category differences between evaluators lists of UPTs seem to have been caused by evaluators having different views on the specific intentions of the users For this category of differences involving multiple evaluators seems a useful way of getting new or complementary views on the same data Different views on intentions can coexist in such cases as deciding which view is valid is seriously
20. re made of the children and their interaction with the product The test was conducted in the context of a study on assessing usability evaluation methods with children subjects van Kesteren 2003 2 2 Study 2 Thermostat In study 2 a programmable home thermostat Honeywell Chronotherm II see figure 1b was tested by five subjects Vermeeren 1999 None of the subjects had any previous experience with using interaction breakdown time stamp log segments context indications 0 00 00 action description of task 1 0 00 12 ae interaction description 0 00 13 ction segment code 0 00 19 action description of 0 00 21 action interaction 0 00 22 action segment 0 00 23 action 2 0 00 24 action 0 002 action verbal utterance code 0 00 44 action description of task 2 0 00 46 action interaction description 5 0 00 49 action segment 0 00 50 action verbal utterance code 1 column for logging user product interaction primary boundary indicating the start of a new interaction segment 3 secondary boundary indicating the possible start of a new interaction segment 4 column for definitive interaction segment boundaries and descriptions 5 column for task descriptions and descriptions of intermediate level episodes 6 column for breakdown indication type codes 7 event marked as breakdown indication Figure 2 General format for DEVAN s interaction overview table a programmable thermostat The data from one of thes
21. sat behind a table with the remote control of the product a user manual a quick reference card and a TV guide on it The experimenter sat next to the subject All sessions were recorded on video showing the handling of the remote control as well as the TV in front view 3 Data Analysis In studies two and three the tool DEVAN Vermeeren et al 2002 was used for analysing the data In study one only DEVAN s checklist of breakdown indication types was used Below the most important characteristics of DEVAN are described 3 1 The DEVAN Tool DEVAN Vermeeren et al 2002 was developed as a tool for structured and detailed analysis of video data from user tests of interactive systems It provides clear and detailed data analysis procedures and detailed criteria for what constitutes a breakdown in an interaction Interactions are transcribed to a specified format that assists evaluators in making explicit how they interpreted interactions figure 2 Additionally DEVAN provides a checklist that facilitates detecting breakdowns in interactions table 1 as well as a format for describing the breakdowns Moreover it provides a procedural description of the data analysis process Two main stages are distinguished consisting of three and two sub stages respectively Stage one creating a table that represents the interaction at multiple levels of abstraction e logging and transcribing actions verbal utterances and non verbal behav
22. sis Below these categories are discussed in more detail Interpreting verbal utterances and non verbal behaviour For 14 out of the 37 differences the only indications used to detect the UPTs were verbal utterances or non verbal behaviour In nine of these cases only one evaluator had recorded the utterance or non verbal behaviour Missed events included for example frowning a subject visually scanning an interface panel and verbal utterances that indicate puzzlement In case of the other five UPTs both evaluators had transcribed the utterance or behaviour but differed in what they had heard or decided differently on whether this indicated a problem or not Most of the cases mentioned above concerned utterances or behaviour that indicated puzzlement code PUZZ in DEVAN s checklist see table 1 It seems that especially for this type of indications it is difficult to interpret whether the observed event is significant enough to record or interpret as indication of a breakdown A closer look at the 14 UPTs suggests that this problem may be less disconcerting than the figures seem to indicate In case of eight out of the 14 UPTs closely related UPTs of a different level of abstraction were found in addition to the indication of puzzlement For example one of the UPTs concerned puzzlement about how to use the timer menu for scheduling the video recorder to record something tomorrow After having expressed her
23. ssumption It assumes that the total number of breakdowns is known or that it can reliably be estimated For small numbers of evaluators this typically is not the case Therefore Hertzum and Jacobsen 2001 suggest using the any two agreement measure in such cases The any two agreement measure is defined as the average of Pi N P m over all 2n n 1 pairs of evaluators PIO P In this equation P and P are the sets of problems detected by evaluator i and evaluator j and n is the number of evaluators For the studies reported here the any two agreement measure is used As one of the units of comparison the number of breakdowns is used In section 3 1 it was explained how evaluators can use DEVAN to create lists of occurrences of breakdowns in interactions In studies two and three DEVAN was used to its full extent in study one evaluators only used DEVAN s checklist of breakdown indication types which is somewhat comparable to Jacobsen et al s 1998 list of usability criteria as well as a specified format for reporting the breakdowns comparable to that of Jacobsen et al 1998 In Jacobsen et al s 1998 study Unique Problem Tokens UPTs were chosen as units of comparison In their study each evaluator s problem report problem reports are comparable to breakdown descriptions in DEVAN was examined as to whether it was unique or duplicated Thus a final list of UPTs was created To be able to compare
24. that in only four out of the 14 cases differences in the evaluators lists of UPTs contained unique and potentially important information On the other hand it should be realised that in the present studies a very thorough analysis was done It is very likely that in a less explicit analysis with less systematic transcription of utterances and which lacks an extensive checklist of breakdown indication types many more differences may be expected in this category of causes Hertzum and Jacobsen 2001 suggest using multiple evaluators to gain better insights into what usability problems may occur in an interaction It seems that for this category of differences this approach would not be very efficient Doing a detailed and explicit data analysis would probably be more efficient Guessing user intentions Five out of the 37 differences in UPTs related to problems in guessing the users intentions This concerned two levels of intentions intentions at the level of individual button presses and intentions at the level of sequences of button presses In three out of the five cases the intention behind a sequence of button presses was clear whereas the intention behind one of the button presses was not This led to disagreement between evaluators on whether the single button press indicated a problem or not For example in study three one UPT concerned reducing the TV s sound volume with the minus buttons on the remote control Ty
25. the present studies evaluator effects to the evaluator effect Jacobsen et al 1998 found duplicate breakdowns were filtered out Breakdowns had to be similar in much detail in content as well as in level of abstraction for being considered duplicates For example the breakdown user is puzzled about how to get a still image is considered different from the more concrete but similar breakdown user tries lt stop gt to get a still image Also in its detailed content the breakdown user uses lt cursor down gt instead of lt cursor right gt while trying to set a menu item is different from user uses lt cursor down gt instead of lt cursor left gt while trying to set a menu item Thus these are regarded as unique breakdowns 4 Results Table 2 shows the evaluator effects that were found in the three studies as well as the evaluator effect found by Jacobsen et al 1998 The evaluator effects were measured in terms of occurrences of breakdowns as well as in terms of UPTs Of the three studies reported in this paper the data analysis process of study one the Jammin Draw study compares best to that of Jacobsen et al The study seems to yield somewhat higher agreements than Occurrences of Breakdowns Unique Problem Tokens UPTs detected detected Any two detected detected Any two by both in total agreement by both in total agreement evaluators evaluators

Managing the `Evaluator Effect` in User Testing

Contents

Download Pdf Manuals

Related Search

Related Contents