Home

User`s Manual - School of Computer Science

1. 12 Open the visualization plugins dropdown and see that there are three available by default Note Two of these visualization plugins rely on your data to have timestamps in one column As our test data does not we will only be able to ex periment with the third the pie chart visualization 13 Click PieChartVisualization to switch views 14 Select one of your annotations in the dropdown menu on the far left hand side This will show you the relative frequency of each label type 15 Return to the annotation tab Close the Seg mentation F Annotation window to return to the launch panel Lesson 3 Creating a new annotation scheme for your data In this lesson we ll add a new annotation scheme to represent a new type of information we re interested in predicting or using as evidence To add an annotation scheme for a UIMA file 1 Open SIDE or if SIDE is already running navigate to the launch panel 2 Click the Segmentation amp Annotation button 3 Click the load file button in the top left corner 4 In the file browser select the file to edit We will again be using ES 2012a csv xmi 5 Once the data loads click the new annotation scheme button A popup window will appear 6 Select the segmentation type that you want It is likely that you will want native which pre serves the segmentation from your CSV file 7 Give your new annotation scheme a name This should rep
2. SIDE The Summarization IDE Elijah Mayfield Carolyn Penstein Rose Fall 2010 SIDE The Summarization IDE 2010 Carnegie Mellon University User Manual Co authors can be contacted at the following addresses Elijah Mayfield elijah cmu edu Carolyn Ros cprose cs cmu edu Work related to this project was funded through the Pittsburgh Science of Learning Center the Office of Naval Research Cognitive and Neural Sciences Division the National Science Foundation Carnegie Mellon University and others Special thanks to Moonyoung Kang Sourish Chaudhuri Yi Chia Wang Mahesh Joshi and Eric Ros for collaboration and contributions to past versions of SIDE SIDE is released under the GPL version 3 Ihe GNU General Public License is a free copyleft license for software and other kinds of works This manual is released until the GFDL version 1 3 The GNU Free Documentation License is a form of copyleft intended for use on a manual textbook or other document to assure everyone the effective freedom to copy and redistribute it with or without modifications either com mercially or non commercially These licenses are available in full at http www gnu org licenses Table of Contents Table of Contents 1 SIDE The Summarization Integrated Development Environment 2 Installation and Setup 3 Using SIDE Text Annotation Lesson 1 Converting your data to SIDE s format for the firsttime Lesson 2 Analyzing annotated
3. s etc More information on how to use regular expressions is available at download oracle com javase tutorial essential regex 19 To search for a pattern more complex than an n gram within a document switch to the Regex Search tab at the top of the window 20 Enter your regular expression into the Search for text box following Java regex syntax 21 To only find whole words that match your regex rather than finding subsequences within words check the Match whole words only checkbox 22 If you want your search to be case sensitive click the Make search case sensitive checkbox 23 To create a feature based on your regular expres sion click the Add Search button To use information from other annotations of your data It will occasionally be useful to search for spe cific information about your data based on other annotations of the data Note that this will re guire all of your data from this point forward to be annotated with exactly the same layers if you use these features just as using metafea tures in general reguires which is restrictive for fully automatic systems Currently these pri or annotations are slow to evaluate in SIDE We intend to improve efficiency in the future 24 Switch to the Prior Annotations tab at the top of the window 25 Select the annotation that you are interested in from the left list which will produce the set of labels from that annotat
4. Feature Table Machine Learning Feature Analyzer model highlighted feature comparison test 20100914 105207 eq Groups 0 24 we hide empty features 0 17 34 51 ss 102 119 136 153 170 in Threshold ation Free threshold table Frror allowed 10 percent absolute Prediction under actual value _ Export to CSV Segments within error threshold Export to CSV ID Predicted Actual Error T 1D Predicted Actual Error Text 0 021 People like to associate with others wha ext 0 788 In Cumberland e w w wvuwvunuuwunu gt oc S oc MAN anu AN AUAN AN DU NN AU AN dd ANNA v w w s t nw wn nadu s 44 1 973 It is natural for 45 4 706 64 3 205 4 0 795 Us versus then 47 4 638 Figure 10 Numeric error analysis in SIDE 15 If you are interested in experimenting with your data outside of SIDE you can click the Export to CSV buttons to export the tables in their respective windows to an external CSV format 16 Continue to experiment with different cells and different comparisons to get a better under standing of the ways in which your model failed Once you are done you can close the Feature Table amp Model Building window To study a numeric prediction result SIDE does offer support for numeric prediction such as linear regression tasks The error analy sis interface for these problems is different be cause there is no concept of a confusion matr
5. WekaPlugin implementing this interface by default FEPlugin This interface takes as input a list of documents and converts those documents to a fea ture table A plugin implementing FEPlugin must be able to take as input a list of strings and convert them to a map of feature names to numbers SIDE comes with the TagHelperExtractor and Defined FeatureExtractor implementing this interface by deault Feature IableConsumer This interface converts a feature table to some external representation of your data which can be used by other programs SIDE comes with consumers which convert feature tables to HTML and ARFF formats DocumentReaderPlugin This interface converts external documents into the UIMA format that SIDE uses internally to store documents A plugin implementing this interface must be able to produce xmi UIMA files from some file input SIDE comes with document readers which read plaintext CSV and DeXML files by default EMPlugin These plugins must produce an ordered list of documents based on some evaluation metric This is used for summarization where a way of determining which segments to include must exist SIDE comes with a length counter and a TF IDF plugin by default SegmenterPlugin This plugin must be able to take as input a string containing all the text of a data set and produce a list of indices at which to split this text into segments In addition to native segmenta tion each li
6. a trained classifier may take awhile particularly for large files 0 Once this finishes the new annotation scheme will be in view automatically Note the new item in the interface a partially filled bar on the right side of each segment This shows the classifier s confidence in each prediction 11 Click save in the bottom left of the window if you wish to use this annotation in the future 21 Using SIDE Summarization At this point you have a filter that can be used to segment and annotate files Once a file has gone through this process that imposed structure can be used in the process of generating a summary so that you can retrieve just those parts of the document that are relevant for the summary that you want The first step is to define the selection criteria which will be used to extract a subset of segments from the text The simplest model of this criteria is annotated in the AMI meeting corpus data that SIDE comes with It contains a layer with a Yes No annotation Summary where the segments that should be included in a summary are marked as Yes There is ample room for adding additional lay ers however by automatically identifying segments based on topic or rhetorical structure for instance The selection criteria is specified in terms of annota tions that have been defined If you refer to a filter in the selection criteria that filter will be applied to the text Then the con
7. data within SIDE s interface Lesson 3 Creating a new annotation scheme for your data 4 Using SIDE Machine Learning Lesson 4 Building a feature table to represent your data set Lesson 5 Training an automatic classifier using machine learning Lesson 6 Performing error analysis on your trained model Lesson 7 Defining more complex features for your data by h and Lesson 8 Automatically annotating your data with your model 5 Using SIDE Summarization Lesson 9 Defining a recipe for summarization of your data Lesson 10 Generating an extractive summary using a recipe 6 Extending SIDE Writing Plugins 21 21 22 23 SIDE The Summarization Integrated Development Environment SIDE the Summarization IDE was developed as an infrastructure for facilitating researchers in the task of getting a guick sense of text data Ihe use of machine learning models can facilitate study of a large set of data by highlighting key features and differentiating the significant factors from noise However this work of understanding the data being studied can not and should not be a fully automatic process It is more de sirable to have a human researcher in the loop receiv ing filtered information about the data they are work ing with and using human judgment to make the final decision about how to utilize this information The SIDE framework offers flexibility in the specifi cation of which feature
8. no partii Clear Annotation as vital to this project Clear All Annotations m Kl remove oe m_K 4 A add label 9 M model annotation name segmentation annotate empty segments only um T ll just go round th the table wet Andrew marketing remove um m Kendra with the uh um designing the the the User Interface uh uh and Kate with the the industrial design Um What s uh the the th th project is is here to do is is to to get this this project up and moving ev everybody is is free to uh say wh whatever they want uh everybody has a contribution to make and uh everybody feel free to interrupt me at any time to to say what you want to say use current segmentation B Um in in terms of the immediate meeting the uh um everybody knows everybody else A everybody s worked for the the company for a while if if an anybody feels that they need to say more about themselves please do if if if anybody wants to b briefly give their their background annotate using model so that everybody s quite clear what everybody uh uh everybody s experience is please do so Uh in fact I d I d I d welcome anybody to uh say something briefly about themselves export to csv and and saying what what contribution you you re looking to make 19 in fact we will do that by by going round the table quickly save En we ll start with Andrew F igure 4 Creating an annotation scheme
9. summary ean operators to additional filters Select the built in TF IDF plugin from the Evaluation Metric dropdown menu for the segments that best fit the conditions you have specified in your recipe Select from the Order dropdown the type of results you wish to receive In the where n text area enter the number of results you wish to receive or the percent of items in the set if you wish to receive a variable number based on the size of the document in its original order or by highest rank check or uncheck the restore original order checkbox 10 Fill in the Recipe name text area with a name 11 Click the create recipe button and the recipe object will be generated for your use To save load models for repeated use of a classifier 12 To save this recipe for future use right click its name in the summary recipe list and select save 13 Right clicking in the summary recipe list and selecting load recipe will allow you to use an existing recipe 14 At this time click the summary tab at the top of this window and proceed to Lesson 9 Lesson 10 Generating a summarization of a text using a recipe To generate a summary 1 Before beginning this lesson complete Lesson 8 or make sure you have loaded a summary recipe in the Summary Panel window Click add to choose which data to summarize In the segmentation menu
10. to add to your feature space click ok at the bottom of the window Lesson 8 Automatically annotating your data with your model To automatically annotate a UIMA file 1 Before beginning this lesson make sure you have loaded a model in the Feature Table Model Building window Click the Segmentation amp Annotation button Click the load file button and select a file to annotate automatically Open a file which does not currently have the annotation that you are interested in If there are no current annotations of the text the right side of the screen will be blank If this is the case you must first segment the text by clicking the new annotation scheme button This will open a popup choose native segmentation In the model dropdown menu in the bottom left select the model to use for annotation In the segmentation dropdown ensure that 4 9J o use current segmentation is checked Give the new scheme a meaningful name in the annotation name text area Click the annotate using model button This _Segmentation amp Annotation load file 3 add label D model rest 2010091963205 H segmentation use current a annotate empty segments only o annotation name Summary Auto 8 F igure 1 4 Automatic annotation of text using
11. 12 bits 0 8673 bits instante eme 7400 bits 295 8678 bits instance nt St 107085 1688 bits 295 0005 bits instance 0 2755 55 0 5249 67 6214 115 9293 363 s weka classifiers functions SMO Classifier parameters C 1 0 L 0 0010 P 1 0E 12 N 0 V 1 W 1 K weka classifiers functions support global feature collection plugin sample fce TagHelperExtractor_ cause plugin sample fce TagHelperFxtractor_a Figure 7 Training a classifier using machine learning in SIDE 9 Give a name to the model you want to build in the New model name field 10 Click the train model button to build a model Note Machinelearningcanbeslow Ihisisespeciallytrue of large data sets or feature tables or complex algorithms 11 Once your model is built information about your model will appear on the right window 12 To focus on specific information about your model such as the weights themselves for an SVM style model or the confusion matrix for preliminary error analysis click the dropdown menu on the top of the right window To save load models for repeated use of a classifier 13 Right click the model name in the list of models menu and choose save 14 To load a classifier that was previously saved you can right click in the list of models and select load training result 15 At this time leave the window open and proceed to Lesson 6 Lesson 6 Performing error an
12. E Feature Table amp Model Building X A 5 Summary Panel u yb Figure 1 The launch panel and main menu of SIDE to define how a summary can be built by specifying filters that will be used to apply structure to a docu ment first and then either specifying how visualizations can display patterns of annotations or annotations can be used to find desired portions of a document For a running example this tutorial will use example conversations from the AMI Meeting Corpus A few of these files are available in our distribution of SIDE in cluding all of those used in this manual Ihe entirety of the corpus is available online for free In these conversa tions four participants discuss the design ofa new TV re mote control Ihe versions we distribute come annotated for dialogue act tagging and extractive summarization Lesson 1 Converting your data to SIDE s format for the first time In order to provide maximum clarity for how to per form a particular task these lessons will be giving very basic step by step instructions on how to per form a task While doing this it is important not to lose sight of the big picture You should not consider this process as merely a set of steps to be repeated but an interactive flexible adapting understanding of the nature of your data Before you can do this however you need to understand the basics of the user interface To convert a CSV file into UIMA format 1 O
13. Ihus machine learning models can be used to assign status tags to individual sentences and thus impose a structure on what initially looked at the surface to be unstructured That structure can then be used in retrieving certain portions of the argument that might be of interest For example perhaps only the supporting arguments are of interest It would not be possible to summarize the argument by pulling out these portions without first imposing the structure on the text that distinguishes between those three types of sentences Conceptually then the use of SIDE has two main parts The first part is to construct filters that can im pose structure on the texts that are to be summarized and the second part is constructing specifications of summaries that refer to that structure and extract sub sets of text or display visualizations of that structure To train the system and create a model the user first has to define a filter Filters are trained using machine learning technology As we have stated previously two customization options are available to analysts The first is the selection of the machine learning algo rithm that will be used Dozens of options are made avail able through the Weka toolkit but some are more com monly used than others The three options that are most recommended to analysts starting out with machine learning are Naive Bayes which is a probabilistic model SMO which is Weka s implementation of Supp
14. alized plu a sample fce 0 4137 normalized plug 1 2423 nonwializad plugin 5a 0 1776 normalized plugin sample fce TagHelpert sExtiactoe ecause 0 5487 ormalized plugin sample fce TagHelperExtractor before 0 4193 ormalired plugins e fce TagHelperFxtractor_bein 0 832 i malized plugin sample fce TagHelperExtractor_best plu tractor_big 0 1549 normal nal ec y plu ugin sample fce TagHelperExtractor_bit Figure 8 High level classifier information in SIDE To study a machine learning model at a high level 1 Before beginning this lesson complete Les son 5 or make sure the Feature Table 9 Model Building window is open with a trained model loaded in the Machine Learning tab 2 The dropdown menu at the top of the training results page has many options for examination Open it and click the model option This will give you the model that your classifier is actually using Note Throughout the process described below it is important to keep coming back to this model If you make conclusions based on differences in the data those conclusions must also be backed up by the actual decision making structure of your model For instance a decision tree must actu ally make use of a feature at some point reasonably high up in the tree in order for that feature to be important A linear SVM model must be giving weight to a feature in order for that feature to af fect the classificati
15. alysis on an annotation model In an insightful process of applied machine learn ing a practictioner will design an approach that takes into account what is known about the structure of the data that is being modeled However typically that knowledge is incomplete and thus there is always a good chance that the decisions that are made along the way are suboptimal When the approach is evalu ated it is possible to determine based on the propor tion and types of errors whether the performance is acceptable for the application or not And if it s not then the practitioner should engage in an error anal ysis process to determine what went wrong and what could be done to better model the structure in the data Two common ways to approach an error analysis are top down starting with the learned model or bot tom up starting with the confusion matrix In the first case the model is examined to find the attributes that are treated as most important in the model These are the attributes that have the biggest influ ence on the predictions made by the learned model and thus these attributes provide a good starting point In the second case the bottom up case one first examines the confusion matrix to identify large off diagonal cells which represent common con fusions Consider the sets in a confusion matrix Tetassivepass S NO aca Getay Geth acruuyno teeter eta The error analysis is then the process of determin
16. can see it in dark rooms Um and that people might want it as as in addition to their existing rem rols And that it shouldn t be too small Uh before I do that however will go through some new project re that um the the management have placed on us summarize visualization PieChartVisualizationPlugin and uh will be challenging in terms of what we discussed at the first meetina visualize Figure 16 Generating a summary automatically in SIDE 8 will be shown in the main panel of this window Choose a visualization plugin and click the vi sualize button to produce a graphical summary of your data The summary is also produced in plaintext in the terminal window where SIDE was opened 23 Extending SIDE Writing Plugins SIDE is designed to be a fully extendable framework for machine learning and summarization Thus for each of the major functionalitys of the program there is a common interface that must be implemented which has an expected output Listed below are those plugin interfaces along with brief descriptions of how they are used within the SIDE workflow MLAPlugin This interface builds a classification model given a feature table representing a set of documents A plugin implementing MLAPlugin must be able to build a model based on a feature table of training data and apply that model to predict the labels of a given unlabeled dataset SIDE comes with the
17. choose the segmen tation option that you want to use if you want to use segments as defined in your original file choose native as your segmentation option Select in the summary recipe menu which recipe you will use for summarization Name your summary in the summary name field Click create summary object to add a summary object to the summary list in the bottom left corner of the panel Click summarize to automatically annotate your document using the model in the recipe and give a list of the segments that you chose in your recipe as the most important These results B28 Summary Panel text recipe summary segmentation results native 3 NOO summ 20100916_113113 F summary recipe welcome to the second meeting of this uh design group 0 Um I ll briefly go through the uh notes of the of the last meeting summary 20100916_110916 uh just done in in note form we we s we saw that the the problems with existing remote controls were the uh b a boring shape and boring colour Um and and we s we saw that the um what we needed to do was to to make sure the device um controls several items that you shouldn t need to point the thing at uh anything in um that it need to be contoured to make it interesting S summary summ 20100916_113113 _ that the keys might be concave simply because that hasn t been done before that we know of so that people
18. combine features using a boolean operator select them all in the middle window by click config Word N Gram Regex Search Prior Annotations 50 2 i V Match whole words only 21 __ Make search case sensitive AOA config rch Find each result of thjs oma BE 16 O turns before That is within ne or more of the results of this search bsom w S b Combine Selected With AND OR 29 _ Convert To Features _ u K heleta lt alartad Faria A Clear Feature List Delete Selected Feature F igure 12 Regular expression and sequencing based feature creation ing them while holding down Shift or Control 13 Once all the features that you want to use are highlighted click one of the boolean operators AND OR XOR NOT and a new feature will be created that is a combination of those features with that operator To define a sequencing based feature If your data set is conversational that is each document is meant to follow the previous docu ment temporally then it may be interesting to note sequences of words occuring throughout the conversation and the order in which they occur You may want to look for patterns which will give a feature representation of these se quencing criteria The Sequencing button next to the boolean buttons gives you that option 14 To create a sequencing feature highlight exactly two features in the m
19. ditions placed on the filter will be applied What that means is that you can indicate that a subset of the annotations that can be applied to the filter indicate that the corresponding segments should be selected Using a combination of boolean operators you can create arbitrarily com plex selection criteria involving multiple filters Lesson 9 Defining a recipe for summarization of your data To write a recipe for summarization 1 Before beginning this lesson complete Lesson 5 or make sure SIDE is opened to the launch panel and you have loaded a model in the Fea ture Table amp Model Building window 2 Click the Summary Panel button 3 In the Expression builder menu click the black triangle and select is and then the model that you want to get a label from 4 Click the lt button in this recipe and select the label that you want to use as a filter for this summary 5 Ifyou would like to include more categories you can click the green icon to include bool ene Summary Panel text recipe summa ry 14 Expression Builder description Fvaluation Metric plugin sample em TFIDFPlugin 6 me Order 3 is summ 20100916_110852 Yes ws o Figure 15 Defining a summary recipe in SIDE 22 Note When entering percentages input the num ber as a decimal for instance type 0 35 for 35 9 To choose whether to order the resulting
20. e particularly good examples of argumentation qual ity to draw attention to as an example for struggling students Alternatively an instructor may want to glean a list of questions posed by students during a collaborative learning discussion in order to gauge which topics are confusing for students Another possibility would be to monitor discussions related to negotiations over issues such as design discus sions where trade offs such as power output versus environmental friendliness are being made with re spect to the design of a power plant In that case it might be interesting to classify conversational contributions as indicating a bias towards envi ronmental friendliness or alternatively high power output so that it is possible to display how prefer ences ebb and flow in the course of the discussion Installation and Setup Checking your Java VM In order to use SIDE your computer must have a Java Virtual Machine installed with support for at least Java 6 As Java is platform independent this means that al most any system should be capable of running SIDE however you must first ensure that you have the ap propriate JVM installed Below you will find instruc tions for checking your JWM on Windows XP Mac OS X v10 5 and Fedora Core 11 Linux Other oper ating systems should follow a similar general process Windows XP Click Start then Run Inthe Run dialog type cmd and click OK Type java versio
21. e a useful distinguish ing characteristic Stemming Stemming is a technique for remov ing inflections from words in order to allow some forms of generalization across lexical items for example the words stable stability and stabilization all have the same lexical root Once the reader has grasped these concepts then it is not much of a stretch to consider that defin ing a filter has four steps creating annotated files with user defined annotations choosing features to use for machine learning choosing evaluation met rics and choosing a classifier to train the system Without a foundation of concepts in machine learn ing these notions will likely sound very foreign Figure 2 shows the arrangement of buttons on the SIDE main menu which follows the logical process of using SIDE The first button allows the user to in put files either pre annotated or plain text and con vert them to the internal format referred to as UIMA The Segmentation amp Annotation interface enables users to define coding schemes that can be applied ei ther by hand using the Annotation interface to place a structure on a document or automatically with a filter that can be defined using the Feature Table amp Model Building interface Structured files that result from annotation either by hand or automatically can be summarized The Summary Panel allows the user eNO f Convert to UIMA Se Segmentation amp Annotation SID
22. ent prior to conversion In the future we will build a way of handling Unicode charaters into SIDE but in the current release conversion of a Unicode file will fail Many text editors contain an option to zap gremlins which will remove these intruding characters for you automatically 2 SIDE reguires that every entry in your data has a value for every column If there are some cells in your table which are blank SIDE s conversion will fail A simple solution to this problem is to simply fill in those spaces with a label Unknown or Blank before using the file in SIDE 3 Remember that in csv files commas separate columns However in text content commas are often included in the transcription Ensure that your text column is guoted if it contains commas or the conversion will fail Can use non text data sets in SIDE Yes In order to use a non text dataset with SIDE to make use of its error analysis or defined feature function ality simply uncheck the File contains text data check box In other windows the text content of your files will show up as meaningless numbers This can be safely ig nored and it will not be processed by the machine learn ing algorithms later on in the process of using SIDE Lesson 2 Analyzing annotated data within SIDE s interface Our example files are meeting transcripts from the AMI Meeting Corpus They are distributed with SIDE with three annotations intact though
23. er of ambitious ways from this point however this is not a film which has the wherewithal to kill off its leading star in the opening ten minutes the entire sequence is then clearly an exercise for character exposition with attempts at humour terribly diminished by utter predictability To define a boolean feature based on word n grams 1 Ensure that SIDE is running and that you are in the Feature Table amp Model Building window 2 Click add to choose the file or files to build features from 3 Select which annotation you want to learn to predict in the annotation dropdown menu 4 Check the DefinedFeatureExtractor option in the feature extractor plugins menu Then right click on it to open the feature definition window 5 The top segment of the window will be open to the Word N Gram tab by default This gives you three columns of n grams which appear in your document 6 To change the length of n grams in one of these lists change the number in the NGrams of length text box and click Load to refresh config 5 Word N Gram Regex Search Prior Annotations Combine Selected With AND OR 1 Six NOT Sequencing 14 Filter 7 Filter Lists 8 Turn stemming on ability device_that_can yep indeed indeed_absolutely yet indeed_first D you industrial design you re indusuial designer you ve Jinformation_statistics instead_of a device_u
24. erent ways of doing this Ihe most straightforward is to change the algorithm that you are using either moving to an entirely new algorithm or altering parameters of the current algorithm However there is a great deal that can be done in the feature table representation while keeping the same algorithm for learning Ihe feature table can be modified by removing items which are giving incorrect evidence filtering down the features in a table as discussed in lesson 4 You can also write entirely new plugins which offer ad ditional functionality over what SIDE already gives you as will be discussed in Chapter 6 However the DefinedFeatureExtractor gives another option for users his interface can be daunting at first but it gives a great deal of flexibility in terms of what in formation you would like to include in a new feature 17 The features that we will be constructing can be com prised of a variety of structures Ihe first ones that we will discuss are boolean features Ihese are boolean trees that evaluate to 1 as a feature if the statement that is described by the tree is true and 0 if it is not true These features can contain important contextual in formation For instance consider a feature such as XOR diminished ambitious This feature may make a distinction between subtleties in a movie review for instance in the following extract this is a promising premise and mr taylors film could have gone any numb
25. h_you yourse If v inter workings U device _you_see a o lt A F E E Meal SS WM e9 2 s NGrams of 6 1 Load NGrams of length 2 Load NGrams of length 3 Load Precision 0 3673 18 out of 49 Recall 0 1765 18 out of 102 F Score 0 23841059602649012 1 1 As predictor of N r o Prec 0 6327 31 out of 49 Recall 0 1188 31 out of 261 F Score 0 2 a Convert To Features Clear Feature List Delete Selected Feature Figure 11 N Gram Defined Features creation window 7 To filter the lists of n grams to a specific word or set of words type the word you are searching for in the Filter text box and click Filter Lists or press Enter To search for stemmed n grams instead of surface n grams click the Turn stemming on button to change back click the same button To add one of these n grams as a leaf node for a boolean feature double click it in the list It will be added to the list of features in the middle of the window Double click a feature in the middle list This will open up a popup window showing the in stances in which this feature occurs in your data Once you have added a feature to the middle window click it and the text area to its right will fill with information about the predictiveness of the feature The information given is precision recall and f score These are useful indicators of the specificity and generalizability of a feature To
26. he feature table list and select load feature table This will open a dialog for loading tables into SIDE 18 Ihe Empty feature extractor will allow you to create an empty feature table with no features extracted This is useful for working with non text datasets Then in the Machine Learning panel see lesson 5 you can select the Use Metafeatures checkbox to use the other col umns in your dataset as features 19 To remove the feature tables that you ve al ready created you can click the clear button Lesson 5 Training an automatic classifier using machine learning To train a classifier on your feature table 1 Before beginning this lesson complete Lesson 4 or make sure the Feature Table amp Model Build ing window is open with a feature table loaded in the feature table list 2 At the top of the window switch to the Machine Learning tab 3 In the top left corner click the Choose button to select the machine learning algorithm you would like to use This will open a tree structured window 4 The three algorithms you will most likely want to choose between will be NaiveBayes in the bayes folder SMO in the functions folder and J48 in the trees folder 5 Open the feature table dropdown menu and select the table see Lesson 4 to use 6 You can choose whether to perform cross valida tion and select the number of folds i
27. hink at a high level about the process First note that when using SIDE we think in terms of producing summaries or reports about docu ments Often the term document conjures up im ages of newspaper articles or reports but even discus sions or individual contributions to discussions can be documents For our purposes typical documents will be those that come up in instructional contexts such as answers to open response guestions logs from on line discussions email messages from students posts to course discussion boards and so forth Therefore a document can be a single sentence a single word or an entire essay depending on the nature of your data SIDE was designed with the idea that documents whether they are logs of chat discussions sets of posts to a discussion board or notes taken in a course can be considered relatively unstructured Nevertheless when one thinks about their interpretation of a document or how they would use the information found within a document then a structure emerges For example an argument written in a paper often begins with a thesis statement followed by supporting points and finally a conclusion A reader can identify with this structure even if there is nothing in the layout of the text that indicates that certain sentences within the argument have a dif ferent status from the others Subtle cues in the language can be used to identify those distinct roles that sentences might play
28. iddle window Then click the Sequencing button and a popup window appears 15 To change the window in which you will look for a past feature change the within X turns dropdown menu s setting Options available to you are from 0 meaning that the words must occur in the same document to 5 16 To change the direction that a feature is look ing the second dropdown menu lets you choose whether to look for the additional feature before after or both before and after the current document Ihis dropdown menu will change that window 17 To change the ordering of the two features that is to look for the second feature listed rather than the first thereby looking contextually for the first feature rather than the second click the swap searches button 18 Once you have defined the seguencing criterion you are interested in click ok to add this new feature candidate to the middle window To search within a document using regular expressions Some features that you may be interested in are features which are more complex than n grams but which are still contained within a single docu ment For this purpose you are able to create regu lar expressions which will create a feature based on whether that regex matches the text of each docu ment individually Ihe syntax of these regular ex pressions matches that of the Java Pattern class and includes operators such as and char acter classes such as Yw
29. imitive proxy for grammaticality certain patterns of part of speech tags will be very rare in grammatical sentences while others will be very frequent Treat Features as Binary This is a setting which applies to the features above In some cases it may be useful to count the number of occurrences of each word in a document however in most cases machine learning has been shown to be more effective if a bag of words model checks only for the presence of a word 1 or 0 This option toggles between those two settings Line Length This feature simply describes the number of words in a document This may be use ful in conversational data where short lines should be treated differently from long extended state ments Contains Non Stop Word This flag can identify whether a statement contains at least some con tentful word It is a boolean value set to either 0 if all words are filtered out from the stop words list described below or 1 if a word exists in the docu ment that is not on that list Stop words This flag can weed out whether a contribution is contentful or not which can be useful when processing chat data rather than newsgroup style data For example making a distinction between contributions like ok sure and the attribution is internal and stable Often the categories that are appropriate for non contentful contributions are distinct from those that apply to contentful ones so this can b
30. in SIDE 15 16 17 18 19 20 holding down shift for contiguous sections or control for dispersed segments and right click ing once to label all selected segments You can right click and select select unannotated segments to select all unlabeled segments at once Select all segments behaves in a similar way You can right click the incorrectly annotated segment and select clear annotation to remove annotation from that segment If you choose to remove an annotation type from a scheme altogether you can click Re move on the list of labels to remove annotations from any segment that were labeled that way To save the annotations that you have made click save in the bottom left corner You may also want to re export your annota tions back to CSV you can do this by clicking the export to CSV button Close the Segmentation Annotation panel once you are finished annotating Using SIDE Machine Learning Lesson 4 Building a feature table to pomme your data set To predict an annotation using machine learning a document must first be converted into a form that is understandable by the algorithms that SIDE uses This means that the document must be eguivalent to a vector of features each one of which has a single value associated with it What this means in a natural language processing scenario is usually that a model will be con structed in the bag of word
31. ing how examples in set C could have been associated closely enough with examples in set A to be classified as such by your machine learning model This can be done by identifying attributes that most strongly dif ferentiate sets C and D a horizontal comparison and attributes that are most similar between sets A and C a vertical comparison Ihe same pro cess applies for the error in set B or any error cell e Feature Table amp Model Building Feature Table Machine learning Feature Analyzer Machine learning plugin training resu plugin sample classifier weka WekaPlugin req model It of SVM_prediction 20100913_130840 Choose SMO C 1 0 L 0 0010 P 1 0E 12 N 0 V 1 W1 K SMO feature table uni binary thresh2 20100913 130241 H infer NOMINAL auto C nominal numeric cross validation 10 ear showing attribute weights not support vectors narization help 0 7238 normalized plu mple fce TagHelperExtractor_ cause plugir pi 9 5984 eon lt a plugin Sa c fce TagHelperExtractor_a default segmenter 0 6636 no zed bl plugin sample segmenter DocumentSegmenter Hi mane n bee New model name SVM_prediction 0 5384 no Use Metafeatures 0 7403 normalized plugi 0 0462 ve 5 5 zed plu sam UB list of models un 0 6112 normaliz ed plu ugin sample fce TagHelperExtractor_at prediction 20100913_130840 0 3629 norm
32. ion 26 Select the label that you want to search for from the right list by double clicking on it 27 lf you want to know simply whether the label of the current segment is the same or differ ent from the previous segment you can choose those options from the right list as well Note In order to use the prior annotation feature the segmentation of those annotations must be the same as the annotation that you are trying to automate Thus you cannot have a document level segmenta tion for the annotation you are predicting while us ing features based on a sentence level segmentation To finalize the set of features to add to your feature space The set of features that you have in the middle window will be comprised of partial or incomplete config Word N Gram Regex Search Prior Annotations 24 Elements with this annotation A 26 C D Same as Previous Segment Differs from Previous Segment eoo l matching instances and and saying what what contribution you you re looking to make um and that goes along with being trendy uh uh you know the i want it uh scenario so if you want something that looks like uh something that makes you think oh what s but it makes you think oh OR my you because sometimes you get the remote controls that are just those big rectangular th which um i mean y you get all sorts of shapes in the shops and s you know some guite fancy ones andrew have you had any th
33. ix Instead we can consider two different types of er rors overestimation and underestimation where the predicted value is higher or lower than the actual result respectively In fact that is exactly what the numeric error analysis interface does 17 When using the numeric interface there is a slider visible to the user This slider allows you to customize the margin of correct answers 18 Select percent or absolute to determine whether you want the error tolerance to be measured as an absolute value for instance a prediction within 2 0 of your actual value is acceptable or as a percentage for instance a tolerance of up to 10 error 19 Moving the slider from side to side will change the accepted tolerance Note that this can be slow to respond as it is recalculating categoriza tion for all instances in real time 20 The confusion matrices underneath the slider are now comprised of only three cells rather than a matrix Thus there is no longer an option of either horizontal or vertical comparison You can still select one category of error but it will always be compared based on its differences from the correctly classified instances Lesson 7 Defining more complex features of your data by hand Once you have identified different types of errors that have occurred in your data it is likely that you will want to attempt to improve your performance based on these insights There are a few diff
34. l 8 Open the comparison dropdown menu and see that you have three options full horizontal and vertical comparison By default Full Com parison is selected and shows all segments in the right hand list Click Horizontal Comparison 9 Now that you have selected a comparison type open the highlighted feature dropdown menu The contents are sorted by degree of difference between the segments that were incorrectly labeled and the segments that were correct This means that elements at the top of the list are Leoe Feature Table amp Model Building Feature Table Machine Learning _ Feature Analyzer model highlighted feature sorted by larity comparison SVM_prediction 20100913_130840 E of 0 28 9 4 Vertical Comparison 8 wv hide empty features confusion matrix 1 3 Yes No Yes 37 65 No 35 226 average matrix for selected feature mean and standard deviation Yes No Yes 0 32 0 47 0 25 0 43 No 0 34 0 47 0 05 0 22 Predicted Yes and Actual No _ Export to CSV _ Segments correctly labelled as Yes 15 _ Export to CSV ID of Y of 1D Text of 19 Um in int of the immediate meeting the uh 1 w welcome to the the first meeting of uh Real Re 112 because o their ul ability to uh ul 1 11 um I ll just go round th the table 1 3 Uh I suppo ose in ms of the immediate meeting the 13 um m Kendra with the uh um designing the the t 184 which um I y yol rybody else 26 in fact we will d
35. les Figure 2 The UIMA conversion window in SIDE is selected see figure 3 6 Click the select files button see figure 3 7 In the file browser that appears select the files that you want to import 8 Once a file is chosen both the open file list and save file list text areas are filled automatically 9 Ensure that the File contains text data checkbox is checked This should be the default setting This informs SIDE where to look when building a feature table representation of your data 10 Check the name of text column field to make sure that it matches the header for your text column In this case the default text is correct see figure 3 11 Click the convert files button see figure 3 at the bottom of the window to produce your UIMA document This may take some time especially for CSV files with many different columns each annotation is processed separately 12 Once the convert succeeded dialog appears close the Convert to UIMA window to return to the launch panel Why did my file conversion fail The most common source of problems in using SIDE is the initial conversion of the file Don t worry if you ve en countered problems Therearethreelikelysourcesoferrors 1 UIMA is designed to handle text data in ASCI formatting only Because of this any characters encoded in Unicode formats such as Chinese characters need to be removed from your docu m
36. ll adjust to match your selection 7 Look at the list on the left hand side of the win dow These labels are color coded to match the segments in the main panel 8 To change the color of an annotation click the small color box on the left hand list 9 Once you have selected a suitable color from this box click ok and the annotations will change colors immediately 10 Click save in the bottom left corner to store the new colors you have selected in the UIMA file 11 Look above the data panel and see the two tabs AOA Segmentation amp Annotation 12a cs 3 Mm yeah that would be good Gw 6m Andrew have you had ony thoughts yet about how we might market something nto O nominal G numeric and that we want to un un un t take over the entire um the planet with new annotation sc heme mhm um especially if we uy to sell whol two milion of tem lt a heme ia Ohm Eu EE TH A OoOO S SEY 8 N this you c you could either market i as the point of view we could have the wo sw 2 we could have parallel marketings schemes Cadd label_ and ike you just o OSO i WED a HEH mg Soe HU ee ED annotate using mode ae ee eee eae 2 Figure 3 The Segmentation amp Annotation window in SIDE labelled annotation and visualization Click visualization to switch to a different panel To analyze text using visualization
37. llow different steps to run SIDE Once you have completed these steps SIDE will be running and you can continue to the next chapter to begin to learn to use the software Windows XP Open the SIDE folder Double click the run icon SIDE will start after a short delay Windows 7 Open the start menu and search for cmd Click the cmd icon Type cd Desktop SIDE to navigate from your home folder to the location where SIDE was ex tracted If you saved this folder somewhere else you will have to navigate to it yourself Type run bat SIDE will start after a short delay Mac OS X v10 6 Open Finder Click Applications then Utilities Double click Terminal Type cd Desktop SIDE to navigate from your home folder to the location where SIDE was ex tracted If you saved this folder somewhere else you will have to navigate to it yourself Type run sh SIDE will start after a short delay Fedora Core 11 Linux Click Applications then Administration Click Terminal Type cd Desktop SIDE to navigate from your home folder to the location where SIDE was ex tracted If you saved this folder somewhere else you will have to navigate to it yourself Type run sh SIDE will start after a short delay Using SIDE Text Annotation Before stepping through how to use the SIDE GUI it may be helpful to t
38. many more exist in the full version of the corpus This in cludes the Speaker annotation who contributed each line as well as two annotation schemes These annotation schemes are DialogAct which divides statements into one of four high level categories Task Elicit Minor and Other representing the contribution of the statement to the dialogue as a whole and Summary which labels individual lines based on whether they are important enough to be included in a conversation summary with a simple Yes No distinction We can analyze this informa tion visually before performing any machine learning In fact doing so will be beneficial to our results in all likelihood as it will allow us to get a feel for the data that we will be working with and understand what a meaningful feature space might look like intuitively To analyze segmented and annotated text 1 Open SIDE or if SIDE is already running navigate to the launch panel 2 Click the Segmentation amp Annotation button 3 Click the load file button in the top left corner of the window that appears 4 Select the file that you want to examine in our case ES 2012a csv xmi and click ok 5 The data that we are examining will now appear in the main panel 6 Open the drop down box below the load file button Switch between the different annotation schemes and see that the main panel wi
39. n and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing SIDE Otherwise skip to the next section Installing the Java 6 VM Windows 7 Open the start menu then search for cmd Click the cmd icon Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing SIDE Otherwise skip to the next section Installing the Java 6 VM Mac OS X v10 6 Open Finder Click Applications then Utilities Double click Terminal o Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing SIDE Otherwise skip to the next section Installing the Java 6 VM Fedora Core 11 Linux Click Applications then Administration Click Terminal Type java version and press Enter If an appro priate version of Java is installed on your com
40. n the cross validation field it is set to do so by default Note If you do not use cross validation you will not be able to perform error analysis lesson 6 7 You must choose a segmenter from the default segmenter dropdown However your choice is not important unless you are performing summariza tion For machine learning this option is ignored 8 Ifyou wish to use the other annotations on your dataset as features for machine learning select the Use metafeatures checkbox This is important if you are using a non text dataset for which these extra columns will be your only features Note If you use metafeatures then all future data that you wish to annotate using this model must have those same exact metafeatures available to the model ene Feature Table amp Model Bullding Feature Table Machine Learning Feature Analyzer Machine learning plugin trainin g result of SVM_prediction 20100913_130840 plugin sample lssieka WekaPlugin rc All 12 req Choose SMO C 1 0 L 0 0010 P 1 0E 12 NO V 1 W1 K Confusion Matrix D Confusion Matrix a b classified a feature table uni binary thresh2 20100913 130241 we 37 65 a Yes 35226 b No infer NOMINAL auto nominal numeric 1 1 IZ cross validation 10 6 Summary Summ summarization help 263 72 4518 100 27 5482 X 0 2511 9582 1734 X 83 4633 bits 0 2299 bits instance 314 83
41. ne of each file is one segment SIDE comes with DocumentSegmenters each file is one segment SentenceSegmenter using a trained English sentence identifier and WordSegmenter where each word is treated as a separate segment VisualizationPlugin This plugin must be able to take as input a list of documents that have been au tomatically classified by some model and produces a Java Swing component which represents that data in some visual way SIDE comes with the PieChart TimeSeries and Periodic visualizations by default FeatureAnalysisPlugin This plugin takes as input a list of documents that have been automatically classified by some model and produces a Java Swing component which provides some insight into the behavior of that model SIDE comes with the Side BySideErrorAnalysis plugin by default Language Technologies Institute School of Computer Science Carnegie Mellon University www lti cs cmu edu
42. nnotation dropdown menu 6 Choose which features you would like to ex tract using the feature extractor plugins list Check the TagHelperExtractor box 7 Right click the TagHelperExtractor label to 12 IAQO Feature Table amp Model Bullding Feature Table Machine Learning Feature Analyzer feature table manager iJ annotation use filtered list below as table uni_binary_thresh _manual 15 create Summary we meres m ca J Oa Mi able mM abou F actua lly agai M all Wi also M alway i rem fi an threshold MH and M andrew ew feature tabl uni_binary_thresh animal M any f table m anybody M are 1 2 F aren t P feature table save as 14 mm Cn Ge i_binary_thresh _m ir sd M awkward M b be Figure 6 Exporting a feature table for use outside of SIDE 10 11 12 open a configuration popup window Here you are able to select which features to extract We will be extracting a simple unigram model so select unigram and treat features as binary and click ok If you wish to filter out features which do not occur frequently check the remove rare features checkbox Then type a number into the textbox that appears This will require that feature to occur in at least that many docu ments in order to be included in your feature table Choose a name for this feature set and type it into the new feat
43. o that by by going round the tabl 31 I m a i m the Market Research person for this uh 32 and uh yeah I ll be uh presenting infor n 33 what people want to like and from a 37 and I e Us L plu 79 and th the the detailed des 80 Uh now th the main desian Figure 9 Error analysis interface in SIDE 10 11 12 13 14 more likely to be predictive of why a machine learning algorithm is making mistakes Select a feature from the highlighted feature list A second confusion matrix will appear below the menu Study the second confusion matrix For each cell this gives the average value of the high lighted feature among members of that cell with standard deviation given in parentheses Now switch to Vertical comparison and reopen the highlighted feature dropdown menu You will see that the features have now been auto matically resorted to show similarity between the two cells In this case because the cells represent different actual labels this will again show a possible reason why the classifier made a mistake You may have noticed that the hide empty fea tures box is checked by default This option can be unchecked if you are interested in features which do not appear in the documents in the cells you are comparing To see the entirety of a document s text hover over the Text column and it will appear as mouseover text eoo Feature Table amp Model Bullding
44. on of a document And so on 3 Now click the Summary option This page gives a lot of information about our classifier based on cross validation including total accuracy Kappa 15 accuracy and other statistics 4 Finally click the Confusion Matrix option This gives us a brief summary of the results and sources of error We will look at this matrix in much greater detail now To explore a confusion matrix in detail 5 Switch to the Feature Analyzer tab by clicking it at the top of the screen 6 Select the model you want to analyze in the model dropdown menu in the top left corner 7 By default all segments that were evaluated in cross validation display in the bottom right scrolling list and the confusion matrix appears in the middle of the page Note This process requires you to keep track of a lot of information about what categories differ ent sets of data fall into and what these categories mean As you work the labels on this window will change to give you some reminders For instance the left hand panel gives the predicted and ac tual labels of the segments in the cell you have highlighted while the right hand panel is labeled with the name of the correctly classified category you are comparing against Click a cell that is not along the diagonal of correct cells in the confusion matrix at the top of the page This will fill the bot tom left scrolling list with the contents of that cel
45. ort Vec tor Machines and J48 which is one of Wekas imple mentation ofa Decision Tree learner SMO is considered state of the art for text classification so we expect that analysts will frequently find that to be the best choice The remaining customization options affect the design of the attribute space Ihe standard attribute space is set up with one attribute per unique feature the value corre sponds to the number oftime that feature occursin a text SIDE comes packaged by default to search for standard features from Natural Language Processing integrating an older package called TagHelper Tools Ihe follow ing types of features can be extracted automatically Unigrams and bigrams A unigram is a single word and a bigram is a pair of words that appear next to one another Unigrams are the most typical type of text feature Bigrams may carry more information They capture certain lexical distinctions such as the difference in the meaning of the word inter nal between internal attribution and internal combustion POS bigrams Part of speech bigrams are similar to the features discussed above except that instead of words they represent grammatical categories They can be used as proxies for aspects of syntac tic structure Thus they may be able to capture some stylistic information such as the distinction between the answer which is and which is the answer hey may also be used as a pr
46. oughts yet about how we might market something and then if your remote control breaks it s god forbid you actually get up and manually but also if you make it too small kinda like you know how mobile phones are getting sr how you can get the different um what are they called like the face plates Combine Selected With where you can change the face plates uh thank you very much indeed thank you gZ C 25 rns after Speaker Differs from Prev OR my you bsom w S b cancel f ok cancel Figure 13 Prior annotation features and matching instances window features it is through constructing combinations of these features that you gain information Obvi ously you don t want to add all of the intermediate steps to your feature space Thus there is a separate step to create the final set of features to be added 28 Highlight the feature or features that you want to add to your final feature space in the middle window by clicking on them with Shift and Control Click the Convert to Features button The highlighted features will be shifted to the bot tom window 29 30 To correct mistakes in the bottom list press the Delete Selected Feature button to delete the currently highlighted feature or press the Clear Feature List button to clear the list The features in the middle window will remain 31 When you have finalized your list of features
47. participation in the course Instructors likely do not have time to keep up with all of the corre spondence contributed on the course discussion board One such example is an on line environment for Civics education where students participate in debates over time about bills that they propose in what amounts to a virtual legislature A time se ries displays student posting behavior over the course of an academic year in terms of both the number of posts and the level of argumentation quality au tomatically detected in those messages From this visualization one observation is that the frequency of student posts increases over the course of the se mester It is also clear that the proportion of high quality argumentation indicated in green does not consistently increase over the course of the semester Weeks where the proportion of messages with high quality argumentation is highest seem to be weeks where there is a larger than average frequency of posting potentially indicative of weeks where there is more intensive debate Instructors using a visual ANO Segmentation amp Annotation Time series target column new annotation scheme timestamp format 3 gt S E annotation name Time Eis F igure 0 A simple time series visualization in SIDE ization like this would be able to determine which weeks students were not particularly engaged in the discussion It would also help the instruc
48. pen the CSV file you will be converting and ex amine it in a program such as Microsoft Excel Ihe files we will be using is located at SIDE data ami 2 Make a note of which column stores the actual text data that you are importing In this file it is text You will need this information later 3 Open SIDE or if SIDE is already running navi gate to the launch panel 4 Click the Convert to UIMA button the first but ton on this list 5 In the document reader plugins dropdown menu make sure that the sample plugin CSVFileReader 000 document reader plugins plugin sample document_reader CSVFileReader 5 req 6 select files afs andrew cmu edu usr12 emayfiel Desktop SIDE data ami ES2012a csv afs andrew cmu edu usrl2 emayfiel Desktop SIDE data ami ES2012b csv afs andrew cmu edu usr12 emayfiel Desktop SIDE data ami ES2012c csv afs andrew cmu edu usrl2 emayfiel Desktop SIDE data ami ES2012d csv afs andrew cmu edu usr12 emayfiel Desktop SIDE side workspace xmi ES2012a csv xmi afs andrew cmu edu usrl2 emayfiel Desktop SIDE side workspace xmi ES2012b csv xmi afs andrew cmu edu usr12 emayfiel Desktop SIDE side workspace xmi ES2012c csv xmi afs andrew cmu edu usr12 emayfiel Desktop SIDE side workspace xmi ES2012d csv xmi Convert to UIMA open file list save file list option M File contains text 9 name of text column text ra x N convert fi
49. puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing SIDE Otherwise skip to the next section Installing the Java 6 VM Installing the Java 6 VM If you are using a computer running Mac OS X then you can install the Java 6 VM through the Software Update utility Open this program by clicking on the Apple icon in the top left corner and running select ing the Software Update option Install jre 6 with the highest update version available for your computer Ifyouareusingacomputerrunning Windows Linux or any other operating system you will need to download the appropriate file directly from Suns official website http java sun com javase downloads Once you select the appropriate file here you should open it and follow the instructions it gives Installing and running SIDE Now that Java 6 is installed on your comput er you can start using SIDE All the files that you will need for basic use are available in a sin gle package located at the following website http www cs cmu edu cprose SIDE html Save the file to your desktop for easy ac cess Now extract the package to a folder us ing whatever archive manager you prefer The resulting folder should be named SIDE To run SIDE open this folder Depending on the op erating system you are using you will need to fo
50. resent the feature you are labeling 8 Click segment to create this annotation scheme You will notice that there are no label options and all segments are now unlabeled 9 Now we need to define labels Click add label twice to make new label types 10 Change the text in these labels to match what you want to label Click the color boxes to the left of the labels to change the color appearing in the main panel 11 12 Right click a segment in the Annotation panel and click set label This will open a popup window 13 From the scroll menu select the radio button next to the annotation matching this segment Click ok to apply the label to this segment 14 You can annotate multiple segments at a time by ano ES2012a csv xmi Segmentation amp Annotation annotation visualization 4 load file Right Negotiation w welcome to the the first meeting of uh Real Reaction s uh um development meetings for our our cw television remote control B um which we want to to build on auto nominal OC taking advantage of the uh the the latest developments in in technology and the uh the latest uh h feelings in in consumer design and and demand numeric new annotation scheme 5 uh one that everybody delete annotation scheme x uh at a good price Select All Segments ipany Select Unannotated Segments Uh and to that end We Um and uh b in
51. s fashion where each feature in the vector corresponds to the presence of a word or sequence of words This is the type of feature table that the default TagHelperExtractor we use in this example will create In order to improve performance from a baseline the feature table representation is one of the first places that a research should consider Later les sons will teach you the basics necessary for error analysis lesson 6 feature construction lesson 7 and including your own code in SIDE s plugin framework Chapter 6 For now we will use the simpler representation available from the TagHel perExtractor To build a feature table for training a classifier 1 Open SIDE or if SIDE is already running navigate to the launch panel 2 Click the Feature Table amp Model Building button 3 Click add in the top left corner to open a file feature table ager xmi file add clear use filtered list below as table _ create FS52012a csv xmi e Table Machine Learning Feature Analyzer stem Figure 5 Building a feature table in SIDE chooser popup window 4 Select files you want to extract features from you can select multiple files at once by holding control or shift Click ok once you have chosen all files from which you want to extract features We will use ES 2012a csv xmi 5 Select which annotation you want to learn to predict in the a
52. s to take note of as well as how to search for these features in data how to analyze an automated model for errors and how to deliver this information to a user in a structured way Ihe combination of all of these tasks in a single suite of applications makes SIDE a valuable research tool as well as a utility for supporting institutional practice As we have stated earlier SIDE is an application that makes use of machine learning This functional ity is provided by the Weka toolkit It is important to understand what machine learning algorithms do Ihese algorithms are designed to induce rules based on patterns found in structured data repre sentations A researcher has two types of options in customizing the behavior of a machine learning algorithm One is to manipulate the structured rep resentation of the text being studied and the other is to manipulate the selection of the machine learn ing algorithm These two choices are not entirely independent of one another An insightful machine learning practitioner will think about how the repre sentation of their data will interact with the proper ties of the algorithm they select Interested readers are encouraged to read Witten amp Franks 2005 book which provides a comprehensive introduction to the practical side of the field of machine learning Example Applications As a scenario consider a course where students are heavily involved in on line discussions as part of their
53. tor to identify weeks where the argumentation quality is lower than desired in order to offer suggestions or other support to elevate the level of intensity in the discussion in order to provide students with better opportunities to hone their argumentation skills Another similar scenario where reports based on analyses of behavior in a discussion forum can be useful is in project based courses where students do a lot of their group work outside of the direct supervision of the instructor Insights about the well being of student groups can be gleaned from some byproducts of group work such as the mes sages left in a groupware environment used to co ordinate the group work Prior work has shown that machine learning models can make predictions with reasonable accuracy of how the instructor would rate the extent to which students have contributed productively to their group that week with a cor relation of R 0 63 in comparison with instructor assigned productivity grades Such an interface could display trends in student participation over time to aid instructors in identifying students who may need more attention and prodding to inten sify their contributions to their respective groups Visualizations are not the only type of summary that may be useful to instructors With respect to the example in Figure 1 rather than knowing what proportion of messages exhibited high levels of ar gumentation quality the instructor may want to se
54. ure table name text area Click the create feature table button and the features you chose will be extracted Be patient this may take a few minutes for a large dataset The feature table descriptior panel will now show a checklist containing the names of the features you extracted Right click a third time and select save to open a file chooser dialog where you can save the results of this table for future use within SIDE 13 At this time leave the window open and pro ceed to Lesson 5 To edit a feature table manually once it has been created 14 If there are features that you believe are hurt ing performance of your classifier you can uncheck them from the right hand window 15 Then in the use filtered list below as table text box give this filtered list a new name and click create The feature table with the changes you have made will be available to you in the list on the bottom left corner Other options associated with feature tables in SIDE 16 Right click the table name you just created in the feature table list and choose the export using option then select HTML Feature Table or ARFF Feature Table options These options will let you view the feature table in a web browser or in the Weka machine learning environment 17 If you would like to load a table that was previously saved such as the XML file you just created in step 15 you can right click t

User`s Manual - School of Computer Science

Contents

Download Pdf Manuals

Related Search

Related Contents