Home
EDITS - Edit Distance Textual Entailment Suite Version 1.0
Contents
1. lt ATTLIST entity name CDATA REQUIRED start CDATA IMPLIED end CDATA IMPLIED source CDATA IMPLIED Fig 3 DTD of the ETAF annotation format 12 B Magnini M Kouleykov E Cabrio 3 2 ETAF Morpho Syntax Level The second level of annotation represents texts as sequences of tokens with morpho syntactic features Common properties are token lemma morpho and part of speech but other linguistic annotations may be used at this level including named entities weights of tokens like IDF and many others For example the following is the morpho syntax representation of the sentence Edison invented the Kinetoscope with user defined attributes for two different pos tagging sets token lemma full morphological analysis full morpho and sentence boundaries lt hAnnotation gt lt string gt Edison invented the Kinetoscope lt string gt lt word id 2232 gt lt attribute lt attribute lt attribute lt attribute lt word gt name wnpos gt n lt attribute gt name token gt Edison lt attribute gt name lemma gt edison lt attribute gt name pos gt NPO lt attribute gt lt word id 2233 gt lt attribute name full_morpho gt invent v parttpast inventedtadj zero invent v indic past lt attribute gt lt attribute lt attribute lt attribute lt attribute lt word gt name wnpos gt v lt attribute gt name token gt invented lt attribute gt name lemma gt invent lt att
2. lt constant name PATH gt home epack edits lt constant gt lt module lt module gt lt conf gt id textpro type eu fbk hlt annotation textpro TextPro gt lt attribute name script gt VAR PATH scripts textpro lt attribute gt lt attribute name tempDir gt VAR PATH temp lt attribute gt lt attribute name tempFile gt question pure lt attribute gt lt attribute name extension gt txp lt attribute gt lt attribute name encoding gt Latin1 lt attribute gt lt attribute name language gt Italian lt attribute gt 7 3 In line and out line definitions A configuration file can be defined in two modalities i in line when modules are nested one into the other without making use of mlinks to other modules ii out line when module identifiers are used to refer to such modules The in l token edit d used as nest lt conf gt lt module lt module lt module lt module lt module lt module lt module lt module lt module ine modality produces a file describing all necessary modules while the out line modality allows to define in a single file more variants for the same EDITS functionality and to activate just one of them using a mlink As an example of out line configuration the following file shows that in a single file while more than one module for the same function i e distance algorithm can be defined i e istance string edit distance and tree edit
3. lt ATTLIST entity name CDATA REQUIRED start CDATA IMPLIED end CDATA IMPLIED source CDATA IMPLIED Fig 10 DTD of the rule file 22 B Magnini M Kouleykov E Cabrio The entailment rule above states that the string invented entails the string pioneered with a proba bility equal to 1 0 Morpho Syntax level lt rule name 2 gt lt t gt lt word gt lt attribute name lemma gt invent lt attribute gt lt word gt lt t gt lt h gt lt word gt lt attribute name lemma gt pioneer lt attribute gt lt word gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt The entailment rule above states that the lemma invent entails the lemma pioneer with a probability equal to 1 0 Syntactic level lt rule name 3 gt lt t gt lt node gt lt attribute name edge to parent gt dobj lt attribute gt lt word gt lt attribute name token gt home lt attribute gt lt word gt lt node gt lt t gt lt h gt lt node gt lt attribute name edge to parent gt dobj lt attribute gt lt word gt lt attribute name token gt habitation lt attribute gt lt word gt lt node gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt The entailment rule above states that the node home in the dependency relation of direct object with its syntactic head e g John bought a house where house is the direct object of the verb to buy entails the node habitation in the dependency relation
4. EDITS allows to define the cost of the edit operations by means of user defined attributes As an additional example the following cost scheme exploits the pre computed frequency of a token to calculate the cost of insertion according to the intuition that the most frequent words should have a lower cost of insertion lt insertion name insertion_frequency gt lt condition gt not null attribute freq B lt condition gt lt cost gt 1 number attribute freq B 20 lt cost gt lt insertion gt lt ELEMENT schemes scheme gt lt ELEMENT scheme constant insertion deletion substitution gt lt ELEMENT constant EMPTY gt lt ELEMENT insertion condition cost gt lt ELEMENT deletion condition cost gt lt ELEMENT substitution condition cost gt lt ELEMENT condition PCDATA gt lt ELEMENT cost PCDATA gt lt ATTLIST scheme name CDATA REQUIRED gt lt ATTLIST insertion name CDATA REQUIRED gt lt ATTLIST deletion name CDATA REQUIRED gt lt ATTLIST substitution name CDATA REQUIRED gt lt ATTLIST constant name CDATA REQUIRED type string number boolean REQUIRED value CDATA REQUIRED Fig 4 DTD of cost scheme file 5 1 Data Types and Functions Conditions and costs are defined using a set of functions expressed in a lisp like syntax Such functions can consider as parameters the source 4 the target B the text T and the hypothesis H This means that
5. value training_set model_path conf path Configuration file path optional algorithm name Algorithm to use Possible values string String Edit Distance default token Token Edit Distance tree Tree Edit Distance optional task value Calculate the threshold per task of the entailment pair false default true optional length value Calculate the threshold per length of the entailment pair false default true optional scheme path Path to a cost scheme file optional rules path Path to an entailment rules file optional simple path Path to file where the system will write the output in simple format entailment decision and score for each pair optional full path Path to file where the system will write the output in full format entailment decision score and edit operations for each pair optional verbose level Verbose level of system messages 0 default 1 2 optional overwrite Overwrites the output files if they exist optional As an example the following lines represent two equivalent commands that run EDITS either using the ECF conf1 zml or using parameters over the development set of RTE3 annotated at the morpho syntax level In both cases the execution of the command produces as output a distance threshold i e the distance model stored in the file modelRTE3dev The full option indicates the file in which the system will save the edit operations performed by the system to transform T in H
6. Data Types and Functions ore cee cece cee eee ence ee eee tenes 15 9 1 1 Data Types ic ac eaeas vans oeni Shae paa ehGea Ta ba a wk Mage Deva eas 15 Ord FUNCIONS ee 6044 ta AS a aed eee a Aaa eee a ed eed ae 16 5 2 How to configure cost schemes cece eee cee tee eee eee eee eee e nes 19 Defining rules in EDITS sj s05b4 449404 oes a pA heb See ea de ed o Hd eS 19 61 Rule format rita Ma amas aes A A A IA ale kU ee ele A AAA 19 6 161 Entalmentrules id da da led tale Ty rants Stee aaa 19 6 1 2 Contradiction rules a sete een ERER a eee a eee eS 21 6 2 Riles TEPOSIO VS Ti mi atoa A era bebe EA A eid Roeland alten ee Mange RA 22 6 3 Ritle activation a 444 1460 anata ete nen A ada een Ss da et 22 EDPS confnouratiom legaat oe oo oe A o ia 23 Module Configuration iach aa A Mae E Oe a at date aS 24 42 Usage of Constants orere sce A A A a eb Gg eee sea animes 24 7 3 In line and out line definitions 0 0 cece eens 25 TA Predefined configurations siii Fh eo AGS HAAR Meee ERE oR EDS ea eed 26 EDITS Edit Distance Textual Entailment Suite 3 1 Introduction EDITS is a software package aimed at recognizing entailment relations between two portions of text see 1 for an introduction to textual entailment The package assumes the task definition as proposed by the RTE Recognizing Textual Entailment evaluation campaigns 2 According to the two way classification task at RTE EDITS takes as input a Tex
7. String set contains HashFile Object returns True if SetFile contains Object 20 B Magnini M Kouleykov E Cabrio Null handling functions null Object returns True if the argument is null For example to express that a word A from T has not an attribute freq the expression null attribute freq A can be used 5 2 How to configure cost schemes The cost scheme file is defined in the ECF As an example the declaration below extracted from the pre defined configuration file conf1 xml in Section 2 2 allows to use the simple scheme aml lt conf gt lt module type entailment engine gt lt mlink idref xml cost scheme gt lt module gt lt module id xml cost scheme type cost scheme className eu fbk hl1t edits distance cost scheme XmlCostScheme gt lt attribute name scheme file gt EDITS_PATH share cost schemes simple scheme xml lt attribute gt lt module gt lt conf gt 6 Defining rules in EDITS EDITS allows the use of sets of rules both entailment rules and contradiction rules in order to provide specific knowledge e g lexical syntactic semantic about transformations between T and H Rules can be manually produced or they can be extracted from any available resource e g WordNet Wikipedia DIRT and stored in XML files which are called Rule Repositories Each rule in EDITS consists of four parts 1 rule name a unique identifier of the rule within a certain rule r
8. String Stringz returns True if String contains Stringz For instance contains reenacting act returns True number String reads a number from String For instance number 3 14 returns 3 14 boolean String reads a boolean from String The possible arguments are true and false to lower case String converts String to lower case length String returns the number of characters in String char String Number returns the character in String at position corresponding to Number substring String Number Number2 returns the sub string of String from the position corre sponding to Number till the position Number distance String Stringz normalize returns the Levenshtein distance between String and Strings If the normalize parameter is present the function returns a normalized distance with respect to the length of the two arguments between 0 and 1 Functions with numeric arguments Number Numberg returns True if Number is equal to Numbers lt Number Number2 returns True if Number is less than Number gt Number Number2 returns True if Number is more than Number Number Number makes a sum of numbers example 1 2 is equal to 3 Number Number subtracts numbers from Number example 5 2 1 is equal to 2 x Number Number multiplies numbers example 2 2 is equal to 4 Number
9. be annotated in the ETAF format In addition the bin edits command allows for a number of options reported below Usage edits test option value test_set conf path Configuration file path optional algorithm name Algorithm to use Possible values string String Edit Distance default token Token Edit Distance tree Tree Edit Distance optional task value Calculate the threshold per task of the entailment pair false default true optional length value Calculate the threshold per length of the entailment pair false default true optional scheme path Path to a cost scheme file optional rules path Path to an entailment rules file optional simple path Path to file where the system will write the output in simple format entailment decision and score for each pair optional full path Path to file where the system will write the output in full format entailment decision score and edit operations for each pair optional verbose level Verbose level of system messages 0 default 1 2 optional overwrite Overwrites the output files if they exist optional model path Model file path As an example the following two lines represent equivalent commands that run EDITS either using the ECF conf1 xml or using command line options In both cases the input is a corpus of RTE3 pairs annotated in the ETAF format at the morpho syntax level and the distance model is the modelRTE3dev created before bin edi
10. edition of the Language Resources and Evaluation Conference Marrakech Morocco 28 30 May Kaizhong Zhang and Dennis Shasha 1990 Fast Algorithm for the Unit Cost Editing Distance Between Trees In Journal of Algorithms vol 11 December 1990
11. example of a SetFile file containing stop words is shown in Figure 9 type value 1 value 2 value 3 Fig 8 SetFile format string a a s able about above according accordingly Fig 9 SetFile example The system also treats the string null as expressing a non existing value 5 1 2 Functions Functions over AnnotatedText string AnnotatedTezt returns the text of the AnnotatedText i e the text of T or H tree AnnotatedText returns the syntactic tree of the AnnotatedText words AnnotatedTezt returns the list of words in the AnnotatedText Functions for accessing Entailment Rules entail SimpleRulesObject SimpleRulesObjectz checks for the existence of an entailment rule see Section 6 where the left hand side of the rule matches SimpleRulesObject and the right hand side of the rule matches SimpleRulesObjectz The two arguments must be of the same data type The allowed types are String Word and Node If the rule exists then the probability associated to the rule is returned otherwise the output of the function is null contradict SimpleRulesObject SimpleRulesObjecta checks for the existence of a contradiction rule see Section 6 where the left hand side of the rule matches SimpleRulesObject and the right hand side of the rule matches Simple RulesObjectz The two arguments must be of the same data type The allowed types are String Word and Node If
12. EDITS Edit Distance Textual Entailment Suite Version 1 0 User Manual Bernardo Magnini Milen Kouleykov and Elena Cabrio 1 FBK Irst Trento Italy 2 University of Trento Italy magnini kouylekov cabrio fbk eu May 22 2009 Abstract EDITS is a software package aimed at recognizing entailment relations between two portions of text termed as text T and hypothesis H The system is based on edit distance algorithms and computes the T H distance as the cost of the edit operations i e insertion deletion and substitution that are necessary to transform T into H EDITS is based on three main modules an edit distance algorithm a cost scheme for the three edit operations a set of rules representing transformations between portion of T and H Each module can be configured by the user through a configuration file The system can work at different levels of complexity depending on the linguistic analysis carried out over T and H Both linguistic processors and semantic resources that are available can easily be used within EDITS resulting in a flexible modular and extensible approach to Textual Entailment This document introduces the main functionalities of the system and provides an in depth descrip tion of how EDITS can be used how available tools and resources can be integrated and how the system can be extended EDITS has been developed by the HLT research unit at FBK irst http hlt fbk eu and can be downloaded as open so
13. EDITS Section 2 1 there are four main steps required in order to use EDITS 1 Configure EDITS main parameters Section 2 2 2 Prepare the RTE datasets both training and test providing the linguistic annotations required by the selected distance algorithm Section 2 3 3 Run EDITS over a training dataset Section 2 4 4 Run EDITS over a test dataset Section 2 5 3 http pascallin ecs soton ac uk Challenges RTE 4 B Magnini M Kouleykov E Cabrio 2 1 Install EDITS EDITS can be downloaded from http edits fbk eu Updated versions will be periodically released and uploaded at the same web page EDITS runs on Unix based Operating Systems and has been tested on MAC OSX Linux and Sun Solaris The system requires SUN Java version 6 4 In order to have EDITS installed it is enough to unzip the package in any directory of your machine The basic organization of files and directories in EDITS is shown in Figure 1 API DOC USER MANUAL BIN m gt Scripts e g edits annotated LIB _ JAVA libraries STRING RTE m ETAF MORPHO SYNTAX OUTPUT EDITS compiled code SYNTAX CONFIGURATIONS SRC EDITS ETAF DTDS ETAF XSDS SHARE COST SCHEMES TEMP RULES TEXT ANNOTATOR PLUGIN ENTAILMENT ENGINE DISTANCE ALGORITHM COST SCHEME RULES REPOSITORY Fig 1 Organiz
14. Number divides Number by the rest example 24 3 2 is equal to 4 o a Functions with boolean arguments and Boolean Boolean returns True if all the arguments are True or Boolean Boolean returns True if at least one of the arguments is True not Boolean returns True if the argument is False and False if it is True Conditional Functions if Boolean Object Objecta if the Boolean is equal to True then the function returns Object otherwise Objecta If Objecta is not defined the function returns null Functions with list arguments member List Object returns True if List contains Object For example the list 1 2 3 contains the number 1 the list sum plus minus contains the string plus size List returns the number of elements in List nth Number List returns the n th Number element of List The first element of a list is returned by nth 1 list the last with nth size list 1 list subseq List Number Number2 returns the sub list of the list from the position corresponding to Number till the position Number Functions handling HashFiles and SetFiles make hash file String Reads a HashFile from the path specified in String make set file String Reads a SetFile from the path specified in String hash value HashFile String returns the value of the map inside HashFile with key
15. Object Complex Rules Object String Word Node List Tree AnnotatedText Fig 5 EDTIS datatype hierarchy Number a real number for example 0 3 14 etc Boolean True False String a sequence of characters for example Dolomiti or Milen Char a symbol for example a or List a sequence of elements words nodes numbers booleans etc SetFile a group of elements of a given type strings numbers or booleans loaded from a file 10 HashFile an object that contains maps from keys to values loaded from a file Keys are strings while values are either strings numbers or booleans AnnotatedText represents the annotated text of T or H accessible using the variables T and H SetFiles and HashFiles are objects that have to be read from an external file i e can not be created inside a cost scheme The format of a HashFile is shown in Figure 6 The type is the data type of the values in the HashFile and the key is separated from a value with a tab An example of a HashFile file containing the IDF of words is shown in Figure 7 type key 1 value 1 key n value n Fig 6 HashFile format number speak 12 23 ride 3 23 read 10 32 Fig 7 HashFile example 18 B Magnini M Kouleykov E Cabrio The format of a SetFile is shown in Figure 8 The type is the data type of the elements in the SetFile An
16. S provides few pre defined configurations aiming at facilitating the use of the system Configuration 1 This configuration conf1 cml makes use of the following modules distance algorithm token distance algorithm i e Levenshtein distance 5 calculated over the tokens of T and H cost scheme a cost scheme where the costs of each edit operation are fixed rules no rules are used in this configuration linguistic annotation T and H are annotated using TextPro The configuration has been experimented using the RTE1 RTE2 and RTES datasets for training and RTE 4 for testing Configuration 2 This configuration makes use of the following modules distance algorithm tree distance algorithm i e Zhang Shasha 8 calculated over the nodes of the dependency tree of T and H cost scheme a cost scheme where the costs of each edit operation are dynamically calculated consid ering the length of T and H rules a repository of entailment rules automatically extracted from WordNet Rules are built ac cording to the following procedure for all pairs in the RTE collection i e all RTE training and test data available so far and for all combination of tokens A in T and tokens B in H we seek in WordNet version 3 0 whether the following relations holds Synonym A B Hypernym A B Hypernym B A If the relation holds then an entailment rule is built between A and B with probability 1 linguistic annotation T and H are annotated
17. Such transformations are considered as edit operations i e deletion insertion and substitution carried out over T and H A main feature of EDITS is that it can use different edit distance algorithms accord ing to the linguistic annotation used to represent T and H More specifically EDITS provides distance algorithms at three levels string edit distance algorithms are used when T and H are represented as sequences of characters i e strings maintaining their original order token edit distance algorithms are used when T and H are represented as sequences of tokens main taining their original order tree edit distance algorithms are used when T and H are represented according to their syntactic structure e g dependency trees Different algorithms make use of different cost functions and different set of entailment rules In the following sections the three classes of algorithms are detailed as well as their requirements 4 1 String Edit Distance At this level the three edit operations are defined over sequences of characters Cost functions can only consider string properties of T and H e g string length and rules may refer to properties of characters e g capitalization This level of annotation is referred in ETAF as string level and is detailed in Section 3 1 As a string edit distance algorithm EDITS provides the Levenshtein distance algorithm 5 imple mented in the class StringEditDistance 4 2 Tok
18. TD of EDITS output file An example of RTE pair from the RTE2 test set annotated by the system is the following lt pair task IE score 0 6470588235294118 original YES id 87 entailment YES confidence 0 16421568627450975 gt 10 B Magnini M Kouleykov E Cabrio Beside the annotation of the entailment relation i e YES NO for each pair the system provides the entailment score calculated by the algorithm and the confidence score of the entailment relation assignment i e the distance from the threshold 5 At the end of the test phase a summary of the system s performance over the test set is printed on screen This summary reports i the accuracy of the annotation of the whole test set ii separate precision recall and F measure scores for the YES and NO test pairs For instance the following is the evaluation made on the RTE2 test set using the configuration confl aml Test Result FEO kk kkk kkk k Accuracy 0 5525 FE A kk kkk kkk k Entailment Class YES Number of Examples 400 Precision 0 54375 Recall 0 6525 FMeasure 0 5931818181818181 Classified as NO 139 YES 261 oaao A k kk kkk kkk k Entailment Class NO Number of Examples 400 Precision 0 565625 Recall 0 4525 FMeasure 0 5027777777777779 Classified as NO 181 YES 219 FO A kk kkk k 3 EDITS Text Annotation Format The entailment corpus is represented in EDITS according to ETAF EDITS Text Annotation Format d
19. all the information about T and H derived from their linguistic processing e g part of speech syntactic structure etc is available for defining conditions and cost functions As an example typical constraints involve checking the token and the part of speech of A and B while typical cost functions are computed considering the lexical similarity between A and B possibly normalizing such value over the length of T and H 5 1 1 Data Types Basic elements for defining constraints and costs in a cost scheme are derived from the three represen tation levels defined in ETAF see Section 3 EDITS provides functions for the most relevant elements defined in the ETAF linguistic representation The arguments of such functions are the variables i e A B T H which are instantiated within a specific cost scheme Functions use the following primitive objects data types hierarchically arranged in Figure 5 1 Word represents a token in T and H and it is instantiated by the variables A and B in a cost scheme 2 Node represents a tree element in T and H and it is instantiated by the variables A and B in a cost scheme 3 Tree represents a syntactic tree it is obtained using the function tree SOR SSS SS 11 EDITS Edit Distance Textual Entailment Suite 17 Object ee A ETAF Object Number File Boolean Character Rules Object HashFile SetFile Ze ee Simple Rules
20. ation All nodes except the root of the tree are re labeled in such way Edit operations on single nodes are defined in the following way Insertion insert a node A from the dependency tree of H into the dependency tree of T When a node is inserted it is attached with the dependency relation of the source label Deletion delete a node A from the dependency tree of T When A is deleted all its children are attached to the parent of A It is not required to explicitly delete the children of A as they are going to be either deleted or substituted on a following step Substitution change the label of a node A in the source tree into a label of a node B of the target tree In case of substitution the relation attached to the substituted node is changed with the relation of the new node 4 4 How to configure an edit distance algorithm EDITS allows to select a certain edit distance algorithm through a module declaration in the ECF As an example the declaration below extracted from the pre defined configuration file conf2 cml reported in Section 2 2 allows to use a tree edit distance algorithm lt conf gt lt module type entailment engine gt lt Active modules in the configuration gt lt mlink idref tree edit distance gt lt module gt lt module id tree edit distance type distance algorithm className eu fbk hlt edits distance algorithms TreeEditDistance gt lt conf gt 5 Defining Cost Schemes
21. ation of EDITS files and directories 2 2 Configure EDITS The simplest way for getting started with EDITS is to choose a pre defined EDITS Configuration File ECF among those available in the edits share configurations directory As an example the follow ing is the configuration file conf2 zml lt conf gt lt module type entailment engine gt lt Active modules in the configuration gt lt mlink idref tree edit distance gt lt mlink idref xml cost scheme gt lt mlink idref entailment rules wordnet gt lt module gt 4 The system can be used with SUN Java 5 by adding the Java Architecture for XML Binding JAXB 2 0 5 jars https jaxb dev java net to the lib directory EDITS Edit Distance Textual Entailment Suite 5 lt module id tree edit distance type distance algorithm className eu fbk hlt edits distance algorithms TreeEditDistance gt lt module id xml cost scheme type cost scheme className eu fbk hlt edits distance cost scheme XmlCostScheme gt lt attribute name scheme file gt EDITS_PATH share cost schemes simple rules scheme xml lt attribute gt lt module gt lt module id entailment rules wordnet type rules repository className eu fbk hlt edits rules DefaultRulesRepository gt lt option name entailment rules gt EDITS_PATH share rules entailment rules wordnet xml lt option gt lt module gt lt conf gt In the upper part of the file enta
22. aultRulesRepository gt lt option name entailment rules id entailment rules wordnet gt EDITS_PATH share rules entailment rules wordnet xml lt option gt lt option name contradiction rules id contradiction rules wordnet gt EDITS_PATH share rules contradiction rules wordnet xml lt option gt lt module gt lt conf gt 6 3 Rule activation The basic way for activating a rule is through one of the two functions entail and contradict which can be used within a cost scheme see Section 5 The two functions check for the existence of an entailment or a contradiction rule between the values assumed by A and B in a cost schema If a rule exists in the specified repository which matches with both A and B then the probability associated to the rule is returned otherwise null The two functions accept four parameters 1 The first two parameters X and Y are portions of T and H managed by the distance algorithm and the cost scheme 2 The name of a set of rules in the rules repository where the search has to be carried on This parameter is optional 3 The search modality Two search modalities are allowed First which selects the first rule that matches the X and Y parameters Max which selects the rule that matches X and Y and with the highest probability The search modality is cab also be declared as a default strategy at the level of configuration file As an example the following call at the entail function en
23. bin edits train rte etaf morpho syntax RTE3_dev xml modelRTE3dev algorithm token full RTE3dev out scheme share cost schemes simple scheme xml bin edits train conf share configurations conf1 xml rte etaf morpho syntax RTE3_dev xml modelRTE3dev At the end of the training phase a summary of the system s performance over the training set is printed on screen This summary reports i the distance model including the distance threshold ii the accuracy of the annotation of the whole training set iii separate precision recall and F measure scores for the YES and NO training pairs 8 B Magnini M Kouleykov E Cabrio Calculated Threshold FE OR A kk kk 0 7948717948717948 FE RA k k A RK AK Training Result AKK kk 3k IO ak 3k 3k ak 3k 3k ak 3K 3K k 3K KK KKK K Accuracy 0 61 Fkk kk 3k ak OGIO IRI IIE Entailment Class YES Number of Examples 412 Precision 0 6243781094527363 Recall 0 6092233009708737 FMeasure 0 6167076167076166 Classified as YES 251 NO 161 Ak kk 3k IO IGOR IO AIR K K K Entailment Class NO Number of Examples 388 Precision 0 5954773869346733 Recall 0 6108247422680413 FMeasure 0 6030534351145037 Classified as YES 151 NO 237 Fkk k kk 3k ak ak 3K AOR IORI IRI K K K 2 5 Test EDITS Once a distance model is available EDITS can be run over a test RTE file using the bin edit command with the test option The input parameter is the path of the test_set file which has to
24. distance just one of them i e tree edit distance is ed by the entailment engine and will be actually activated while running the system In order to change the module to be activated the dref of the corresponding mlink should be simply substituted type entailment engine gt lt mlink idref tree edit distance gt lt mlink idref xml cost scheme gt lt mlink idref entailment rules wordnet gt gt id token edit distance type distance algorithm className eu fbk hlt edits distance algorithms TokenEditDistance gt id string edit distance type distance algorithm className eu fbk hlt edits distance algorithms StringEditDistance gt id tree edit distance type distance algorithm className eu fbk hlt edits distance algorithms TreeEditDistance gt id xml cost scheme type cost scheme className eu fbk hl1t edits distance cost scheme XmlCostScheme gt lt attribute name scheme file gt EDITS_PATH share cost schemes simple scheme xml lt attribute gt gt id entailment rules wordnet type rules repository className eu fbk hlt edits rules DefaultRulesRepository gt lt option name entailment rules gt EDITS_PATH share rules entailment rules wordnet xml lt option gt gt EDITS Edit Distance Textual Entailment Suite 27 lt conf gt It is also possible to have a mixed configuration mixing both in line and out line definitions 7 4 Pre defined configurations EDIT
25. e 7 output The annotated RTE file files annotator name The name of the linguistic tool used for annotation textpro stanford parser optional words Separates a text into words using a tokenizer language lang The language of the entailment corpus English default Italian optional conf path Configuration file path optional verbose level The verbose level of system messages O default 1 2 optional overwrite Overwrites the output files if they already exist optional As an example the following command annotates the development set of RTE2 with TextPro and puts the output of the process i e the annotated file in the directory rte etaf morpho syntax bin annotate annotator textpro overwrite rte etaf string RTE2_dev xml RTE2_dev_textpro xml Obviously the same annotation process using TextPro should be performed also on the test set 2 4 Train EDITS Once EDITS has been configured and the RTE datasets have been annotated the system can be run over a training dataset The command bin edits can be used both for training with the train option and testing with the test option The input is a training set file or directory containing an RTE corpus already annotated with the ETAF format the output the model path is the path to a file specified by the user where EDITS will store the obtained distance model In addition the bin edits command allows for a number of options reported below Usage edits train option
26. e cost of insertion to 10 the cost of deletion to 10 and the cost of substitution to 0 when the two tokens to be substituted referred by the functions attribute token A and attribute token B respectively are the same and to 20 when the two tokens are different For more details on cost schemes see Section 5 lt schemes gt lt scheme gt lt A simple scheme for cost function calculation gt lt Valid constant types are number boolean string and file gt lt constant name INSERTION type number value 10 gt lt constant name DELETION type number value 10 gt lt constant name SUBSTITUTION type number value 20 gt lt insertion name insertion gt lt cost gt INSERTION lt cost gt lt insertion gt lt deletion name deletion gt lt cost gt DELETION lt cost gt lt deletion gt lt substitution name equal gt lt condition gt equals attribute token A attribute token B lt condition gt lt cost gt 0 lt cost gt lt substitution gt lt substitution name not equal gt 6 B Magnini M Kouleykov E Cabrio lt condition gt not equals attribute token A attribute token B lt condition gt lt cost gt SUBSTITUTION lt cost gt lt substitution gt lt scheme gt lt schemes gt Concerning entailment and contradiction rules each rule is defined with a left hand context the text T and a right hand context the hypothesis H and a probability that th
27. e rule preserves either the entailment or the contradiction between T and H As an example the following two rules state that invent entails pioneer with probability 1 and that beautiful contradicts ugly with probability 1 lt rule name 1 gt lt t gt lt string gt invent lt string gt lt t gt lt h gt lt string gt pioneer lt string gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt lt rule name a gt lt t gt lt string gt beautiful lt string gt lt t gt lt h gt lt string gt ugly lt string gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt EDITS provides a pre defined file of entailment rules extracted from WordNet i e the synonyms and the hypernyms of the words contained in the RTE datasets This file can be found in the directory edits share rules 2 3 Preparing the RTE datasets Both the input and the output of EDITS are RTE datasets i e set of T H pairs expressed in the same XML format as defined in the RTE evaluation campaign An example of a RTE pair from the RTE2 development set is the following lt pair id 61 entailment YES task IE gt lt t gt Although they were born on different planets Oscar winning actor Nicolas Cage s new son and Superman have something in common both were named Kal el lt t gt lt h gt Nicolas Cage s son is called Kal el lt h gt lt pair gt The edit distance algorithms used by EDITS often require that the RTE dataset is p
28. ely independent from any linguistic annotation lt hAnnotation gt lt string gt Edison invented the Kinetoscope lt string gt lt hAnnotation gt EDITS Edit Distance Textual Entailment Suite 11 lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ATTLIST entailment corpus pair gt pair t h tAnnotation hAnnotation gt t PCDATA gt h PCDATA gt tAnnotation string word tree semantics gt hAnnotation string word tree semantics gt string PCDATA gt word attribute gt attribute PCDATA gt tree node edge gt node word label gt label PCDATA gt edge EMPTY gt semantics entity relation gt entity EMPTY gt relation EMPTY gt pair id CDATA REQUIRED gt lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST entailment YES NO UNKNOWN REQUIRED task CDATA IMPLIED length CDATA IMPLIED lt ATTLIST tree lt ATTLIST node lt ATTLIST edge gt name CDATA from CDATA tAnnotation id CDATA IMPLIED gt hAnnotation id CDATA IMPLIED gt word id CDATA IMPLIED gt attribute name CDATA REQUIRED gt id CDATA IMPLIED gt id CDATA IMPLIED gt IMPLIED REQUIRED to CDATA REQUIRED lt ATTLIST relation gt name CDATA REQUIRED source CDATA IMPLIED
29. en Edit Distance At this level the three edit operations are defined over sequences of tokens of T and H Cost functions may consider properties of tokenized sentences e g sentence length and rules may refer to morpho syntactic properties of tokens such as part of speech lemma morphological features and named entities This level of linguistic annotation is referred in ETAF as morpho syntax and is detailed in Section 3 2 14 B Magnini M Kouleykov E Cabrio As a distance algorithm at the morpho syntax level EDITS provides a token based version of the Levenshtein distance algorithm 5 implemented in the class LinearEditDistance 4 3 Tree Edit Distance At this level the three edit operations are defined over single nodes of a syntactic representation of T and H Cost functions may consider properties of syntactic trees e g tree structure and rules may refer to syntactic relations among nodes such as dependency relations This level of linguistic annotation is referred in EDITS as syntax and is detailed in Section 3 3 As tree edit distance algorithm the current version of EDITS implements the Zhang Shasha algorithm 8 in the class TreeEditDistance Since the Zhang Shasha algorithm does not consider labels on edges while dependency trees provide them each dependency relation R from a node A to a node B has been re written as a complex label B R concatenating the name of the destination node and the name of the rel
30. epository This is used for logging purposes only in order to help the user to understand which rules have been applied by the system for a certain pair If not provided by the user the rule name is automatically generated by the system 2 t a text T i e the left hand side of the rule h a hypothesis H i e the right hand side of the rule 4 rule probability a probability that the rule maintains either the entailment or the contradiction between T and H Both in entailment and contradiction rules a probability equal to 0 means that the relation between T and H is unknown while a probability equal to 1 means that the entail ment contradiction between T and H is fully preserved ew 6 1 Rule format Both T and H can be defined using the Edits Text Annotation Format ETAF described in Section 3 ETAF allows text portions to be represented at three different levels of annotation just as strings i e the STRING object as sequences of tokens with their morpho syntactic features i e the WORD object and as syntactic trees e g the NODE object Rules in EDITS can be defined using the three datatypes provided that they are used consistently in T and H i e either STRING or WORD or NODE The XML Document Type Definition DTD of the rules file is reported in Figure 10 In the current release of EDITS only rules that contain just one element both in t and h i e lexical rules are allowed 6 1 1 Entailment rules Entailment r
31. ertain dataset Only modules defined in a EDITS Configuration File ECF can be used for training and testing with the command bin edits see Sections 2 4 and 2 5 A module may require that another module is defined in order to work Such dependencies are ex pressed in a configuration file through nested modules The whole EDITS configuration is considered itself a module the more global one called the entailment engine which requires three nested mod ules respectively for the distance algorithm the cost scheme and the rules repository The XML Document Type Definition DTD of the configuration file is reported in Figure 11 while the following is a simple configuration file where the whole EDITS is defined as a module of type entailment engine which in turn needs three modules referred with through their identifiers in the mlink elements lt ELEMENT conf constant module gt lt ELEMENT module module attribute option mlink gt lt ELEMENT attribute PCDATA gt lt ELEMENT option PCDATA gt lt ELEMENT mlink PCDATA gt lt ATTLIST module type CDATA IMPLIED id CDATA IMPLIED className CDATA IMPLIED lt ATTLIST attribute name CDATA REQUIRED id CDATA IMPLIED gt lt ATTLIST option name CDATA REQUIRED id CDATA IMPLIED lt ATTLIST mlink idref CDATA REQUIRED Fig 11 DTD of the configuration file lt conf gt lt module type entailment engine gt EDITS Edit Dista
32. eturn true in order the cost scheme to be applied 3 Cost A fixed value or a function that returns a numerical value expressing the cost of the edit operation applied to the source and to the target A cost function can consider as parameters the source element the target element the text T and the hypothesis H EDITS adopts a combination of XML annotations and functional expressions to define the cost schemes The XML Document Type Definition DTD of the cost scheme file is reported in Figure 4 As a simple example EDITS provides a pre defined cost scheme file simple scheme aml introduced in Section 2 2 where costs for the three edit operations are defined in the following way lt schemes gt lt scheme gt lt A simple scheme for cost function calculation gt lt Valid constant types are number boolean string and file gt lt constant name INSERTION type number value 10 gt lt constant name DELETION type number value 10 gt lt constant name SUBSTITUTION type number value 20 gt lt insertion name insertion gt lt cost gt INSERTION lt cost gt lt insertion gt lt deletion name deletion gt lt cost gt DELETION lt cost gt lt deletion gt lt substitution name equal gt lt condition gt equals attribute token A attribute token B lt condition gt lt cost gt 0 lt cost gt lt substitution gt lt substitution name not equal gt lt condition gt not equals at
33. for Edit Operations According to the distance based approach T entails H if there exists a sequence of transformations applied to T such that we can obtain H with an overall cost below a certain threshold The underlying assumption is that pairs between which an entailment relation holds have a low cost of transformation EDITS allows for the definition of the cost for each edit operation carried out by the distance algorithm in order to find the best i e less costly sequence of edit operations that transforms T into H The basic data structure in EDITS for the definition of costs is the cost scheme One or more cost schemes can be associated to each edit operation and they are collected in a cost scheme file that can be created by the user A cost scheme is invoked by the edit distance algorithm with three parameters i an edit operation ii an element of T called the source and referred through the variable A and iii an element of H called the target and referred through the variable B Each cost scheme for a certain edit operation consists of three parts EDITS Edit Distance Textual Entailment Suite 15 1 Name Every cost scheme must have a user defined unique name 2 Condition A set possibly empty of constraints over the source and the target elements which need to be satisfied in order to activate the cost scheme Each constraint is expressed in a lisp like syntax and all constraints must be satisfied i e they have to r
34. hat are available can easily be used within EDITS resulting in a flexible modular and extensible approach to Textual Entailment Given a certain configuration of its three basic components EDITS can be trained over a specific RTE dataset in order to optimize its performance In the training phase EDITS produces a distance model for the dataset which includes a distance threshold S 0 lt S lt K that best separates the positive and negative examples in the training data During the test phase EDITS applies the threshold S so that pairs resulting in a distance below S are classified as YES while pairs above S are classified as NO Given the edit distance ED T H for a T H pair a normalized entailment score is finally calculated by EDITS using the following formula ED T H ED T ED H 1 entailment T H where ED T H is the function that calculates the edit distance between T and H and ED T ED _ H is the distance equivalent to the cost of inserting the entire text of H and deleting the entire text of T The entailment score has a range from 0 when T is identical to H to 1 when T is completely different form H A detailed description of the edit distance approach implemented in EDITS can be found in 4 2 Getting Started with EDITS This Section provides a quick introduction to the main functionalities of the EDITS package More details can be found in the rest of the document After having installed
35. ilment engine the modules that EDITS will actually use in the cur rent configuration are declared using a reference to their IDs the algorithm mlink idref tree edit distance the cost scheme mlink idref xml cost scheme and the rules mlink idref entailment rules wordnet In the rest of the ECF there is a list of definitions of the modules that the system can use Each module is referred by its name id and its function type The attribute of the modules can contain information about resources used by the modules or information about their behaviour For instance the attribute scheme file of the xml cost scheme module specifies the path to the directory containing the cost scheme It is possible to list different modules of the same type e g different distance algorithms and to choose the one you want to use changing the idref in the upper part of the configuration file N B It is necessary to list at least one distance algorithm and one cost scheme More details on ECF can be found in Section 7 Concerning distance algorithms EDITS provides as pre defined both string token and tree edit distance algorithms for more details see Section 4 Pre defined cost schemes i e the module that de termines the cost of the three edit operations are available in the directory edits share cost schemes As an example of cost scheme the one reported below i e the simple scheme xml used in conf1 xml allows to set th
36. nce Textual Entailment Suite 25 lt mlink idref tree edit distance gt lt mlink idref xml cost scheme gt lt mlink idref entailment rules wordnet gt lt module gt lt module id tree edit distance type distance algorithm className eu fbk hlt edits distance algorithms TreeEditDistance gt lt module id xml cost scheme type cost scheme className eu fbk hlt edits distance cost scheme XmlCostScheme gt lt attribute name scheme file gt EDITS_PATH share cost schemes simple scheme xml lt attribute gt lt module gt lt module id entailment rules wordnet type rules repository className eu fbk hlt edits rules DefaultRulesRepository gt lt option name entailment rules gt EDITS_PATH share rules entailment rules wordnet xml lt option gt lt module gt lt conf gt 7 1 Module Configuration Modules are defined by the following pieces of information className attribute a path to the Java class of the module referring to the code that will be executed when the module is activated id attribute a unique identifier for the module assigned by the user type attribute indicates the category of the module being defined Accepted values for the type attribute are entailment engine distance algorithm cost scheme rules repository Such types are also accepted by the bin edits command attribute element used to set the parameters of the module option element used t
37. o set the options of the module mlink element links to nested modules i e submodules that are required by the module being defined As an example in addition to those presented in the previous Section the following is the configuration of a module defining the use of TextPro an interface to a text annotator offered by EDITS see Section 2 3 The module is defined by the id its class path and a number of parameters that will be used to run TextPro lt module id textpro className eu fbk hlt annotation textpro TextPro gt lt attribute name script gt home epack scripts textpro lt attribute gt lt attribute name tempDir gt home epack temp lt attribute gt lt attribute name tempFile gt question pure lt attribute gt lt attribute name extension gt txp lt attribute gt lt attribute name encoding gt Latin1 lt attribute gt lt attribute name language gt Italian lt attribute gt lt module gt 7 2 Usage of Constants EDITS allows that the values of the attributes which are frequently used in the configuration file can be referred through the use of constants declared at the beginning of the configuration file For example the TextPro configuration module showed above could be re written with the use of a constant in the following way so that the location of TextPro can be defined only once and can be easily changed if the package is later relocated 26 B Magnini M Kouleykov E Cabrio lt conf gt
38. ocumented in share zsd etaf xsd The format is used both for representing the input T H pairs and for representing entailment and contradiction rules ETAF allows to represent texts at three different levels of annotation i as simple strings ii as sequences of tokens with their associated morpho syntactic properties iii as syntactic trees with structural relations among nodes The three levels are considered in increasing complexity and levels of higher complexity assume the levels of lower complexity For instance texts annotated at syntactic level assume that the texts are also tokenized At each level a number of properties and their values can be introduced by the user through the use of the attribute element For instance the part of speech of a certain token can be expressed by an attribute pos associated to a certain word element Values of properties are retrieved by means of the attribute function The annotation is considered as a preliminary step before the actual use of the system and both entailment pairs and entailment contradiction rules must be defined according to this format The XML Document Type Definition DTD of the ETAF is reported in Figure 3 3 1 ETAF String level The first level of annotation represents texts as strings i e as a sequence of characters with their properties As an example the text Edison invented the Kinetoscope is simply represented as the sequence of its characters This level is complet
39. of direct object with its syntactic head e g John bought an habitation where habitation is the direct object of the verb to buy with a probability equal to 1 0 6 1 2 Contradiction rules Contradiction rules represent with some degree of confidence the seman tic incompatibility between T and H The following are examples of contradiction rules at the different levels allowed by ETAF String level lt rule name a gt lt t gt lt string gt beautiful lt string gt lt t gt lt h gt lt string gt ugly lt string gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt The contradiction rule above states that the string beautiful contradicts the string ugly with a prob ability equal to 1 0 Morpho Syntax level lt rule name b gt lt t gt lt word gt lt attribute name token gt extend lt attribute gt lt word gt lt t gt lt h gt lt word gt lt attribute name token gt shorten lt attribute gt lt word gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt The contradiction rule above states that the lemma extend contradicts the lemma shorten with a probability equal to 1 0 Syntactic level lt rule name c gt lt t gt lt node gt lt attribute name edge to parent gt amod lt attribute gt lt word gt lt attribute name token gt white lt attribute gt lt word gt lt node gt lt t gt lt h gt lt node gt lt attribute name edge to parent gt amod lt attribute gt l
40. ribute gt s name pos gt VVD lt attribute gt lt word id 2234 gt lt attribute lt attribute lt attribute lt attribute lt attribute lt word gt name full_morpho gt thet adv thetart lt attribute gt name wnpos gt lt attribute gt name token gt the lt attribute gt name lemma gt the lt attribute gt name pos gt ATO lt attribute gt lt word id 2235 gt lt attribute lt attribute lt attribute lt attribute lt word gt name wnpos gt n lt attribute gt name token gt Kinetoscope lt attribute gt name lemma gt kinetoscope lt attribute gt name pos gt NN1 lt attribute gt lt word id 2236 gt lt attribute lt attribute lt attribute lt attribute lt attribute lt attribute lt word gt lt hAnnotation gt name full_morpho gt punc lt attribute gt name sentence gt amp 1t eoskgt lt attribute gt name wnpos gt lt attribute gt name token gt lt attribute gt name lemma gt lt attribute gt name pos gt PUN lt attribute gt 3 3 ETAF Syntactic Level The third level of annotation represents texts as syntactic trees with their structural features Both nodes terminal and non terminal and edges with syntactic relations are represented Nodes are typically described with their morpho syntactic properties The example below shows the output of the Stanford Parser for the sentence Edison invented the Kinetoscope converted into ETAF lt hAnno
41. rocessed by linguistic tools For instance since a tree edit distance algorithm works over the syntactic structure of T and H both training and test data need to be processed by a parser EDITS provides an internal format called ETAF see Section 3 for representing linguistic annotations The format is enough flexible and extensible to accommodate most of the morpho syntactic and syntactic information useful for Textual Entailment and the user should provide converters from the format of the linguistic tools he she wants to the ETAF format In order to facilitate its use EDITS provides external interfaces for a couple of existing tools TextPro 7 a suite of text processors for Italian and English tokenizer lemmatizer named entities recognizer and the Stanford Parser 3 a dependency parser for English Such interfaces are provided separately from the system as plugins and they do not contain the annotation tools which have to be obtained separately and installed according to the readme instructions of the interfaces In order to be used external interfaces have to be installed and then the annotation process is called over a certain RTE dataset through the command bin annotate Here below the command usage with the required parameters Usage annotate option value input output input The file files containing the RTE pairs 5 Interfaces are downloadable from http edits fbk eu EDITS Edit Distance Textual Entailment Suit
42. t T and an Hypothesis H and outputs one of two judgements either YES or NO supported by a confidence score EDITS implements a distance based approach for recognizing textual entailment which assumes that the distance between T and H is a characteristic that separates the positive T H pairs for which the entailment relation holds from the negative pairs for which the entailment relation does not hold More specifically EDITS is based on edit distance algorithms and computes the T H distance as the cost of the edit operations i e insertion deletion and substitution that are necessary to transform T into H The edit distance approach implemented in EDITS is based on three core modules an edit distance algorithm which calculates the minimal set of edit operations that transform T into H a cost scheme which defines the cost associated to each edit operations optional sets of rules both entailment rules and contradiction rules providing specific knowledge e g lexical syntactic semantic about allowed transformations between portion of T and H Each module can be configured by the user through a configuration file i e the EDITS configuration file ECF EDITS can work at different levels of complexity depending on the linguistic analysis carried out over T and H An internal representation format called ETAF EDITS Text Annotation Format is defined such that both linguistic processors and semantic resources t
43. t word gt lt attribute name token gt black lt attribute gt lt word gt lt node gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt EDITS Edit Distance Textual Entailment Suite 23 The contradiction rule above states that the node white in the dependency relation of adjectival modifier with its syntactic head e g Mary wears a white T shirt where white is the adjective modifying the noun T shirt contradicts the node black in the dependency relation of adjectival modifier with its syntactic head e g Mary wears a black T shirt where black is the adjective modifying the noun T shirt with a probability equal to 1 0 6 2 Rules repository In order to be used by EDITS both entailment and contradiction rules have to be stored in a rule repository EDITS allows to declare and use multiple XML rule files as sets of entailment or contradiction rules that can be refereed using user defined identifiers As an example the declaration below defines a rule repository called wordnet rules which contains two rule files called respectively entatlment rules wordnet and contradiction rules wordnet An example of entailment rule file can be found in edits share rules lt conf gt lt module type entailment engine gt lt Active modules in the configuration gt lt mlink idref wordnet rules gt lt module gt lt module id wordnet rules type rules repository className eu fbk hlt edits rules Def
44. tail X Y entailment rules wordnet max searches for the rule with the highest probability among those that are activated by the A and B parameters and that are contained in a rules set in the rules repository called entailment rules wordnet A rule is activated when the X parameter of the entail contradiction function matches with the T part of the rule i e the left hand side and the Y parameter of the function matches with the H part of the rule ie the right hand side All the elements of the X Y argument have to match against all 24 B Magnini M Kouleykov E Cabrio elements of the rule In case the rule contains variables their assignments to corresponding elements of the X Y argument need to be satisfied The entail and contradict functions are called in cost schemes typically the cost scheme defined for the substitution edit operation As an example the following is a cost scheme that calculates the cost of substituting A with B based on the inverse of the probability of the entailment rule between A and B in the repository called entailment rules wordnet lt substitution name entail gt lt cost gt 1 entail A B entailment rules wordnet max lt cost gt lt substitution gt 7 EDITS configuration file The purpose of a configuration file is to define the three modules i e distance algorithm cost scheme and rule repositories and their corresponding parameters that will be actually used while running EDITS on a c
45. tation gt lt string gt Edison invented the Kinetoscope lt string gt lt tree root 2 gt lt node id 1 gt lt word id 1 gt lt attribute name token gt Edison lt attribute gt lt attribute name lemma gt Edison lt attribute gt lt attribute name pos gt NNP lt attribute gt EDITS Edit Distance Textual Entailment Suite 13 lt word gt lt node gt lt node id 2 gt lt word id 2 gt lt attribute name token gt invented lt attribute gt lt attribute name lemma gt invent lt attribute gt lt attribute name pos gt VBD lt attribute gt lt word gt lt node gt lt node id 3 gt lt word id 3 gt lt attribute name token gt the lt attribute gt lt attribute name lemma gt the lt attribute gt lt attribute name pos gt DT lt attribute gt lt word gt lt node gt lt node id 4 gt lt word id 4 gt lt attribute name token gt Kinetoscope lt attribute gt lt attribute name lemma gt Kinetoscope lt attribute gt lt attribute name pos gt NN lt attribute gt lt word gt lt node gt lt edge to 1 name nsubj from 2 gt lt edge to 3 name det from 4 gt lt edge to 4 name dobj from 2 gt lt tree gt lt hAnnotation gt 4 Using Edit Distance Algorithms The EDITS package estimates the entailment relation among two text portions assuming that the en tailment probability is inversely proportional with respect to the cost of transforming T into H
46. the rule exists then the probability associated to the rule is returned otherwise the output of the function is null Functions over Trees nodes Tree returns the list of nodes of a tree parent Node Tree returns the parent of a node in the syntactic tree of T or H children Node Tree returns the children of a node in the syntactic tree of T or H Functions over Nodes word Node returns the word of the node label Node returns the label i e the syntactic category of the node edge Node returns the edge i e the syntactic relation entering in the node is label node Node returns true if the node contains a label is word node Node returns true if the node contains a word A EDITS Edit Distance Textual Entailment Suite 19 Functions over Words attribute String Word returns the value of the attribute String of the word If the attribute is missing the function returns null Functions with string arguments equals String Stringz Returns True if String is equal to Stringz equals ignore case String Stringz compares two strings ignoring their case capitalized String returns true if the string is capitalized starts with String Stringz returns True if String starts with Stringz For instance starts with reading read is True ends with String Strings returns True if String ends with Stringz contains
47. tribute token A attribute token B lt condition gt lt cost gt SUBSTITUTION lt cost gt lt substitution gt lt scheme gt lt schemes gt lt schemes gt This cost scheme applies to elements of T and H referred respectively as A and B which are annotated as word see ETAF in Section 3 The function attribute token A returns the token of the source element Within the example there are four cost schemes 1 for insertion 1 for deletion and 2 for substitution expressing the following costs insertion B 10 inserting an element B no matter what B is always costs 10 deletion A 10 deleting an element A no matter what A is always costs 10 substitution A B 0 if A B substituting A with B costs 0 if the token of A and the token of B are equal i e they are the same string substitution A B 20 if A B substituting A with B costs 20 if the tokens of A and B are not equal Here below is a more complex example of a cost scheme for the substitution operation lt substitution name equal gt lt condition gt and equals attribute lemma A attribute lemma B equals attribute pos A attribute pos B lt condition gt lt cost gt 0 lt cost gt lt substitution gt In the example a token from T is substituted by a token from H with a cost equal to 0 if their lemmas and part of speech are equal 16 B Magnini M Kouleykov E Cabrio As shown in the previous examples
48. ts test rte etaf morpho syntax RTE3_test xml algorithm token EDITS Edit Distance Textual Entailment Suite 9 scheme share cost schemes simple scheme xml model modelRTE3dev bin edits test conf share configurations conf1 xml model modelRTE3dev rte etaf morpho syntax RTE3_test xml The XML Document Type Definition DTD of the output as specified by the options full and simple of EDITS is reported in Figure 2 lt ELEMENT output pair statistics gt lt ELEMENT pair operations gt lt ELEMENT operations operation gt lt ELEMENT operation EMPTY gt lt ELEMENT statistics threshold elementClass gt lt ELEMENT threshold EMPTY gt lt ELEMENT elementClass classifiedAs gt lt ELEMENT classifiedAs EMPTY gt lt ATTLIST pair id CDATA REQUIRED entailment CDATA REQUIRED original CDATA REQUIRED score CDATA REQUIRED confidence CDATA REQUIRED distance CDATA REQUIRED normalization CDATA REQUIRED task CDATA REQUIRED lt ATTLIST operations cost CDATA REQUIRED gt lt ATTLIST operation cost CDATA REQUIRED type CDATA REQUIRED source CDATA IMPLIED target CDATA IMPLIED scheme CDATA REQUIRED lt ATTLIST statistics accuracy CDATA REQUIRED gt lt ATTLIST elementClass name CDATA REQUIRED number CDATA REQUIRED precision CDATA REQUIRED recall CDATA REQUIRED fmeasure CDATA REQUIRED lt ATTLIST classifiedAs name CDATA REQUIRED number CDATA REQUIRED Fig 2 D
49. ules preserve with some degree of confidence the entailment relation between T and H The following are examples of entailment rules at the different levels allowed by ETAF String level lt rule name 1 gt lt t gt lt string gt invented lt string gt lt t gt lt h gt lt string gt pioneered lt string gt lt h gt lt probability gt 1 0 lt probability gt lt rule gt EDITS Edit Distance Textual Entailment Suite 21 lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ELEMENT lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST lt ATTLIST rules rule gt rule t h probablity gt t string word tree semantics gt h string word tree semantics gt string PCDATA gt word attribute gt attribute PCDATA gt tree node edge gt node word label gt label PCDATA gt edge EMPTY gt semantics entity relation gt entity EMPTY gt relation EMPTY gt rule name CDATA REQUIRED gt t id CDATA IMPLIED gt h id CDATA IMPLIED gt word id CDATA IMPLIED gt attribute name CDATA REQUIRED gt tree id CDATA IMPLIED gt node id CDATA IMPLIED gt edge name CDATA IMPLIED from CDATA REQUIRED to CDATA REQUIRED gt lt ATTLIST relation name CDATA REQUIRED source CDATA IMPLIED gt
50. urce software from the EDITS web site http edits fbk eu Acknowledgments Work on EDITS has been partially supported by the EU FP6 QALL ME project http qallme fbk eu We thank Matteo Negri and Yashar Mehdad who contributed to test the system and to revise this document Table of Contents EDITS Edit Distance Textual Entailment Suite 0 0 cece eee 1 B Magnini M Kouleykov E Cabrio Introductionias ala bas Ga ba eves oleae ahs A a eae lak Why 6 lima haaa 3 Getting Started with EDITS crisser ec ccc cee eee eee eee eee a eens 3 2 1 Teste EDI St A eae by eet ae ate vada 4 2 2 Configure EDETS ta ee Gente acest yen oath a ls Sante ele asad douse hes a 4 2 3 Preparing the RTE datasets 0 cee eee n eens 5 pe Nias DIAS o O eR eR en ee ee RN 6 Zo Test EDITS 0 ee tet A A A Ui aha AA AAA Sg A ae 7 EDITS Text Annotation Format ois eisd eura enera a AEE eee t teen tenet nn eens 10 SA ETAER String level bus lesa bi ted 11 3 2 ETAF Morpho Syntax Level oooooooooccrrocccr cent cence enn enes 11 3 3 ETAR Syntactic Level meva seg 1 ae a AP DAE a ti 12 Using Edit Distance Algorithms iere ae T E E n tenet LEANA 12 AAs Strine Edit DISCO CE stas alto rad A at to IA a aa AA 13 4 2 Token Edit Distancen viii dt de ae ie PA ee ae iia 13 E Edit Distance AAA O RN 13 4 4 How to configure an edit distance algorithm 0 00 cee ee cece eee 13 Defining Cost Schemes for Edit Operations 0 cece eens 14 5 1
51. using the Stanford parser The configuration has been experimented using the RTE1 RTE2 and RTE3 datasets for training and RTE 4 for testing References Ido Dagan Oren Glickman 2004 Probabilistic Textual Entailment Generic Applied Modeling of Language Variability in Proceedings of the PASCAL Workshop of Learning Methods for Text Understanding and Mining Grenoble France Ido Dagan Oren Glickman Bernardo Magnini 2005 The PASCAL Recognizing Textual Entailment Challenge in Proceedings of the First PASCAL Challenges Workshop on Recognising Textual Entailment Southampton U K 11 13 April Dan Klein and Christopher D Manning 2003 Fast Exact Inference with a Factored Model for Natural Language Parsing in Advances in Neural Information Processing Systems 15 NIPS 2002 Cambridge MA MIT Press pp 3 10 Milen Kouylekov Bernardo Magnini 2005 Tree Edit Distance for Recognizing Textual Entailment in Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP Borovets Bulgaria 21 23 September Vladimir I Levenshtein 1965 Binary codes capable of correcting deletions insertions and reversals in Doklady Akademii Nauk SSSR 163 4 pages 845848 Dekang Lin 1998 Dependency based Evaluation of MINIPAR in Workshop on the Evaluation of Parsing Systems Granada Spain May Emanuele Pianta Christian Girardi Roberto Zanoli 2008 The TextPro tool suite in Proceedings of LREC 6th
Download Pdf Manuals
Related Search
Related Contents
Digitus USB 3.0 - IDE & SATA User`s Manual 販売名:ジマトロン放射線治療用シミュレータ Copyright © All rights reserved.
Failed to retrieve file