Home

Type-based Inference of Size Relationships for XML

1. and n are size annotations for the corresponding types We argue that the precision aimed at cannot be achieved The for expression has the form for x in e do e2 The expression e evaluates to a list and for each element a in that list x gets bound to a and e2 gets executed There are two approaches to typing the for expression We first look at the approach used in XQuery s type system 8 first x is bound to the union of the unit types in e1 s type 71 and e2 is typed once with x having that binding Then type constructors are added to the inferred type based on their Ti lev1 Tu levi 7 levQ n i 1 m ley2 va ee rip lev2 l lam gt T p lev2 min n j lt r lt max n 7 S P min mxn lt p lt max mxn Figure 4 Some inferred types in our first attempt to infer all size relationships precisely l levi Tig levi Tas lev1 lev2 lev2 2 lev2 3 Figure 5 Some inferred types in our next attempt to infer all size relationships precisely occurrences in T1 In inferring a type for the program in Figure 2b if Ts is annotated as shown in Figure 3b the type of expression children root on line 1 is 7 as shown in Figure 4 Fig ure 4 also shows the rest of the types we refer to in this paragraph Suppose we find the type of w Tu by taking the union of the unit types in 7 as done in XQuery s type system The type of children w on line
2. J Clark eds XML transformations XSLT version 1 0 W3C Nov 1999 URL http www w3 org TR xslt 7 Roberto Chinnici eds Web services description language WSDL version 2 0 W3C March 2004 URL http www w3 org TR wsd120 8 Scott Boag eds XQuery the W3C query language for XML W3C working draft W3C November 2003 URL http www w3 org TR xquery 9 Tim Bray eds Extensible markup language XML version 1 0 W3C February 2004 URL http www w3 org TR PR xm1 971208 ps 10 M Fernandez J Sim on and P Wadler An algebra for XML Query In FST TCS pages 11 45 2000 11 M Fitzgerald Relaxer tutorial 2003 URL 12 J Freire J R Haritsa M Ramanath P Roy and J Sim on StatiX making XML count In STGMOD pages 181 191 New York NY 10036 USA June 2002 13 Exolab Group Castor 2002 URL http castor exolab org 14 H Hosoya and B C Pierce XDuce A Typed XML Processing Language In WebDB Dallas TX 2000 15 Haruo Hosoya J r me Vouillon and Benjamin C http www relaxer org doc tutorial tutorial html 10 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Pierce Regular expression types for xml In ICFP pages 11 22 2000 J Hughes L Pareto and A Sabry Proving the correctness of reactive systems using sized types In POPL 1996 M Kempa and V Linnemann Type checking in XOBE In BTW 03
3. ing for languages such as C deal with flat arrays 1 INTRODUCTION Since XML 9 became a W3C recommendation in 1998 XML has been increasingly accepted as the standard for mat for electronic data exchange Two parties who wish to exchange data generally organize their data differently Thus one or both of the parties must transform their data so that it is suitable for the other to use In the context of XML schemas e g XML Schema 21 are used to spec ify data organization When data is exchanged using XML the recipient specifies a schema to which all received XML documents must conform The sender must write a trans formation program to convert data from his own schema to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee Copyright 200X ACM X XXXXX XX X XX XX 5 00 the recipient s schema If the sender can determine that his program performs the transformation correctly no run time checks are necessary This gives rise to the XML type checking problem Let T be the set of all XML documents T is a mnemonic for trees all XML documents have a tree structu
4. pages 227 246 February 2003 C Kirkegaard A M ller and M I Schwartzbach Static analysis of xml transformations in java URL http citeseer nj nec com 593778 html Sun Microsystems JAXB 2002 URL http java sun com xml jaxb T Milo D Suciu and V Vianu Typechecking for XML transformers In PODS pages 11 22 2000 M Murata D Lee and M Mani Taxonomy of XML schema languages using formal language theory In Extreme Markup Languages Montreal Canada 2001 Carlo Sartiani A framework for estimating XML query cardinality In WebDB San Diego CA 2003 D De Schreye and S Decorte Termination of logic programs The never ending story Journal of Logic Programming pages 199 260 1994 Z Su and G Wassermann A type based dimensional analysis for XQuery Technical Report CSE 2004 8 University of California Davis April 2004 URL http wwwcsif cs ucdavis edu wassermg research SizeTechRpt ps D Suciu The XML typechecking problem STGMOD Record March 2002 A Tozawa Towards static type checking for XSLT In Document Eng pages 18 27 2001 M Wallace and C Runciman Haskell and XML Generic combinators or type based translation In ICFP pages 148 159 1999 A K Wright and M Felleisen A syntactic approach to type soundness Inf Comput 115 1 38 94 1994 H Xi and F Pfenning Eliminating array bound checking through dependent types In PLDI 1998 H Xi and F Pfenning Dependent types in practic
5. String We show the sequence constructor here to make it explicit that this is a sequence type Fernandez et al s type rule which does not reason about sizes for the for expression looks roughly like r Fe r a h TW for x T1 F ez 72 I for x in e do e2 Tm We have already found 71 and at the top level 71 is a se quence type To find 72 for the second hypothesis we first use the following auxiliary rule Iw for x n F e2 ti T w for x T2 F ez T 5 F CW for Xx Ti T2 F e2 71 7 This rules types the original program fragment by finding the for expression s type as though e had type title finding the for expression s type as though e had type author and putting those two types in sequence Ts catalog book Se author title subtitle ISBN l l l Str Str Str Int Ta catalog book ge N author title subtitle ISBN Str Str Str Int Figure 11 The input type tree annotated let catO Catalog 1 titles for w in children cat0 do 2 case w of 3 wi book gt 4 for x in children wi do 5 case x of 6 xi title gt x1 T x2 gt 8 end 9 w2 gt 10 end 11 isbns for y in children cat0 do 12 case y of 13 y1 book gt 14 for z in children y1 do 15 case Z of 16 zi isbn gt z1 17 z2 gt 18 end 19 y2 gt 20 end Figure 10 A program that makes a list of all titles follo
6. a size annotation in the subtype is assigned to n an annotation in the super type The other matching records that the sum of two anno tations equals a third annotation and it is used by an aux iliary function introduced later in this section We ensure that subgoals in a subtyping goal match size annotations consistently by means of a operator NiON2 means that if n is matched in both N and N2 then Ni n Noa n i e the matchings are consistent A list as a subscript e g Ni 2O0N3 abbreviates several consistency checks e g NiON3 NoON3 A subtyping relationship N 71 lt T2 is read 7 is a subtype of 72 with size annotations matched according to N Our type inference algorithm generates constraints on size annotations and those constraints must be resolved before subtyping We only need to do a subtype check at function calls and at the end of type inference The type inference al gorithm guarantees that the constraint set will not include a path 7 at either of these times After constraint resolution size annotations which are bound to equivalent expressions are renamed so that they are identical Pair types can then be changed into choice types The algorithm then divides type trees into two parts the upper part near the root which may have relational size annotations and the lower part which must not have relational size annotations The algorithm discovers the border between the two when it en co
7. is the size name given to catalog book title Instead of re composing the structure of the input type the second group of auxiliary rules first infers a type for the unit type that causes the for expression to produce output The auxiliary rules then add a star and the size name e g n to the inferred type In this way the auxiliary rules flatten the type structure and declare that the type is repeated n times 4 3 4 The Rules Themselves The second group of auxiliary rules have names of the form ForII In rules ForII1 ForII5 7 can be instantiated to 71 72 m1 172 or lt 71 72 gt The rule FORIIU extracts the path annotation and adds the path the constraint set It also adds a star to the inferred type Hypotheses of the form Az m C require that there must not be a path in the constraint set The rule ForII removes the path from the constraint set uses the J function to map the path back to a size name and adds a constraint that multiplies the number of elements from one iteration by the number of iterations that produce output 4 4 Type Soundness We now state a type preservation theorem for our system THEOREM 1 SUBJECT REDUCTION If Fe 7 C and Etelld then It d 7 C The full proof of this theorem is given in the compan ion technical report 24 It follows the style of Wright and Felleisen 28 The d refers to data as defined in Figure 6 5 SUBTYPIN
8. subtrees of unknown size We also discuss the typing of recursive functions in Section 4 3 2 4 1 Sequence Expressions The type rule for sequence expressions is as follows T m z Preinm Cy TI Feg 72 Co I Fei e2 T1 T2 C1 U Co U n m m n is fresh This rule is straightforward the number of XML elements produced by the sequence expression as a whole is the sum of the numbers of XML elements produced by its subexpres sions The rule adds the constraint that n the size of the sequence expression equals m m2 the sum of the sizes of the subexpressions 4 2 Conditional Expressions Our type system allows the true and false branches of a conditional expression to have different types The type of the conditional expression in most type systems is a choice type composed of the types of the branches r r2 The loss of precision from this approach poses a problem we can determine that the value of the size variable for the conditional expression is within the range of the sizes of its branches but we can no longer conclude that two sizes are equal We address this by means of a pair type introduced in Section 3 plus related constraints The main ideas is this if two different if expressions have the same boolean condition with equivalently bound variables and get executed the same number of times in one run of the program then their true and false branches get executed the same number of times respectively We c
9. were made the length of those lists can vary provided that they are equal to each other Alter natively consider a specification manual that must include the same information in multiple languages The number of headings in one linguistic section can vary provided that it equals the number of headings in every other linguistic sec tion Size relationships arise in settings that include parallel or repeated data The ability to infer size relationships may find application in ensuring the correct composition of Web services 7 No previous technique for XML type checking can accu rately type check a program when the target type has size relations and the program output is not confined to a regular subtype of the output type The decidability of XML type checking has been established using k pebble tree transduc ers when no size relations are present 20 It is unclear how Tt doc 7 doc TEn we oS AON titles isbns titles isbns title isbn title isbn Figure 1 A target type with size relations Con servatively inferred types using existing techniques cause correct programs to fail to type check well these automata based techniques would work in prac tice because of their high computational complexity and more fundamentally how to incorporate size information into these formalisms to retain decidability of type check ing Existing type based approaches 8 may provide more practical if less precise s
10. 2 is then 71b How ever it is not clear what the size annotation on T p means To add clarity we rewrite the size annotation as shown in Figure 4 We can now find the unit type to which x gets assigned as Tub and so the body of the inner for expression on line 3 is typed as T2 We then go back up and compose the type 72 with r When going up again to find the type of the for expression on line 1 we compose the type of the nested for expression with m The result is 7 for clarity r has been simplified out of the constraints Unfortunately all we know about p s value is that it is confined to a given range When the for expression on line 4 is typed the result will also be a starred type whose size is confined to the same range Knowing that two numbers are in the same range is usually not sufficient to relate the two numbers concretely e g describing one as a function of the other Thus this approach is not suitable for discovering interesting size relations The second approach to typing the for expression is the one taken by Fernandez et al 10 find a type for the body of a for expression for every named unit type in 7 and combine them based on the type structure of 71 Given the annotation for Ts in Figure 3b the type of children root is again 71 as shown in Figure 4 Because the number of children n may be different for each of the m lev1 s the first unique unit type here is Tu as shown in Figure 5 The se
11. E can only be constructed bottom up as opposed to allowing named gaps that can be filled in any order Castor 13 and JAXB 19 use Java to generate an object model of XML documents from XML Schema in order to gain a higher level of abstraction 6 3 Size Analysis We view size analysis as seeking to make claims about the sizes of data structure and other closely related aspects of programs The study of size analysis started with the inference of linear constraints for imperative languages 5 This abstract interpretation based approach inferred linear relationships among variables automatically This topic has significance to logic programming in the sense that inferring bounds on argument sizes can ensure termination 23 Type based size analyses relate more closely to our work One such analysis uses dependent types 29 30 They use parameterized types to infer lengths of lists The parameters can be constrained with linear equalities and inequalities to determine size relationships Unlike our type system theirs requires user annotations Hughes Pareto and Sabry type check recursive data structures with size information in the context of a lazy functional language 16 Chin and Khoo build on this approach by inferring sizes for recursive func tions in the context of strict functional languages 3 They define the size of a function as both a relation between input and output parameters and invariants of input parameters across r
12. G XML type checking first requires inferring a type for the transformation program which Section 4 addressed This section addresses the second part of XML type checking checking whether the inferred type is a subtype of the target type A type denotes a set of documents Our notion of sub typing is simply inclusion between the sets denoted by two types A restriction on the types that may be inferred makes deciding subtyping possible Based on observations in Sec tion 1 1 our type system does not infer size relationships for nested repetitions In other words in a type tree a repeated element b will not have a size annotation that relates it to other parts of the type tree if one of its ancestors is also repeated e g a This also implies that none of the annotations of b s descendants will relate their sizes to other parts of the type tree Size annotations must be matched to establish a subtype relationship We use N to record how size annotations are matched and we define it as N 0 NU n m NU n n1 n r Fe n C T Fei ts Cy I w for x m F ez T2 Cg Tr w for g n tF ez 2 Co for label 0 for label start Jr m E Co FOR FoRII I Ffor x in e do e2 T2 I for x in e do e2 2 Ci U C2 C U Co 7 U n m x In r r w z u Hez T x label Q r w fe uttbten 2 x label r FoRU FoRI
13. IU Pw fora ut rFeg 7 C T wW for g ut tF ez 2 C U a ForN ee T for x 00 F ez 00 al I w for x m1 e2 T C I w for x t2 F e2 7 C FORE Jr EC ForII1 T w forz O Fe 0 E ee eer ae 1 T w for x n aoe Ci Tw forr tm Fe 7 C I w for x t2 F 2 75 C2 o g n San a s 5 I w for a t2 Fe2 0 C T ey T2 25 ndr m E C1 2 1 0 FoRS Ir rEeC TAO I w for x T1 T2 F ez Ti T2 FO More aE are ForII2 C U C2 U n m m i r f H o roi rw forz n F ez Tti Cy Beiter men e 7 I w forxr m Fez r C I wW for x T F ez T2 Co arr ec r0 adr m E C12 meets ForII3 FoRC r w fora t r ez T C I wW for x m r2 F e2 r 73 8 C1 U Co r w forx n F ez T C F 1 I w for x 7 e295 Tis Ci I w for x T2 F e2 0 C I W for T2 H ez 73 Cg In nrecC ro m z 7 25 adr m E C12 T E aa ForII4 n px ri notp x r2 W for x t F eg 7 C ForP I wW for x lt t1 72 gt F e2 lt T 73 gt r w forx t1 be2 0 C C U Co U n p x r notp x ro I w for x t2 F e2 7 C U n p X m notp X mo In rEC ro n p p x m2 FoRII5 Tafort k erT at f Pecar pE adr n EC I W for 2 7 F e r x C FORR r w forx r F e T g r w forz r Fez T C Imrec r w for z rx F ez 7 C FoRIIR Figure 14 Type rules for the for expression n is fresh in all rules it appears in The matching n m means that m
14. Type based Inference of Size Relationships for XML Transformations Zhendong Su lt su cs ucdavis edu gt Gary Wassermann lt wassermg cs ucdavis edu gt Department Computer Science University of California Davis Davis CA 95616 8562 USA ABSTRACT XML transformation languages e g XSLT take an XML document as input and produce another XML document as output It is useful to know statically that such transforma tions always produce valid documents for static debugging of the transformation program or for eliminating dynamic checks on the output documents Type and automata theoretic techniques that exploit XML s tree structure have been proposed to address this problem However existing approaches are not capable of reasoning about size informa tion of produced XML documents such as that two locations in the output documents always have the same number of elements which occurs when data is repeated This paper presents a type based inference system to discover size re lationships in output documents from XML transformation programs through refined type checking For example our system can identify program fragments producing the same number of elements for all input documents Programs that use or produce parallel or repeated data will benefit from this analysis The novel aspects of our system are tech niques to deal with the rich tree structure of XML types i e schemas whereas array analyses e g bounds check
15. al programming In POPL pages 214 227 1999
16. an use unification to conservatively determine which variables have the same binding A straightforward analysis can conserva tively determine which boolean expressions are equivalent and are executed the same number of times All if ex pressions are given labels and two if expressions will have the same label if and only if their boolean expressions were found to be equivalent Our type rule for conditional expressions is as follows r e Boolean Cp r Fe n Cy Y F e2 T2 C2 T z T2 z2 if label p n is fresh I Hif e then e else ez lt T T2 gt C U Ci U Co U n p Xx n notp X no The hypothesis if label p extracts the label and uses it as a fresh size variable The rule also uses the label to create a fresh size variable notp which represents the unknown number of times the boolean expression evaluates to false The constraint uses the label to relate the sizes of conditional expressions with equivalent booleans 4 3 Repetition Expressions The for expression is the principle mechanism for pro ducing lists of unspecified length and is the most involved to handle in our type system Consider the input type and the program fragment shown in Figure 9 By inspection of the input type we conclude that the expression children book0 has type T title String author String In other words all data that children book0 produces are included in the set defined by 7 Although
17. anges has the same problems as discussed in Section 1 1 but unless all parts of the boolean expression are static we cannot determine statically which branch of the conditional will be executed To address this we use pair types Like conditional types 1 pair types preserve the relationship between the types of the branches of conditional expressions and true and false evaluations of the boolean expression We relate the size of a pair type to the sizes of the conditional expression s true and false branches as well as to the identity of the boolean condition If in a pre processing phase two boolean conditions can be found to be equivalent then it becomes possible to relate the sizes of the corresponding conditional expressions 2 THE SOURCE LANGUAGE The first and second halves of the program in Figure 2 would match this semantics if the appropriate case expres sions were inserted tag a variable x constant n Cint Cstr Cbool operator op and or lt expression e n x al e e O eope let x e doe for x ine doe _ children e case e of r p gt e x gt e end if e then e else e f e 3e pattern p a s data d n ald d d O Figure 6 XML transformation language tag a type name x size type rise xn scalar type s String Boolean Integer unit type us alr fT s type Zama e eT La AO herent i ae g annotated type T u 2 Figure 7 Type l
18. anguage Figure 6 gives the syntax of our XML transformation lan guage Most of the constructs are standard The expres sion ale constructs XML elements Paths can be expressed through for and case expressions We omit the expres sion to select the parent of an element which can be typed conservatively as is done in other XML transformation lan guages with type systems such as XQuery Beyond that our language does not include for example sorting explicit type casts and modules We do not expect much difficulty in ex tending our technique to cover these language constructs Note that the case expression matches a value against a p defined by the pattern derivation which parallels the unit type derivation in Figure 7 Also as shown by the data derivation we denote XML elements as tag rather than lt tag gt lt tag gt to simplify notation The dy namic semantics for this language is standard but the com panion technical report gives a complete presentation 24 3 OUR TYPE LANGUAGE Figure 7 gives our type language The size type shows that either a constant or a variable can be used as a size an notation on a type The size annotation denotes the number of unit values that may be matched to the annotated type The grammar allows nonsense types to be written e g but our type system only infers meaningful types and pro grams only type check if all types are meaningful The wild car
19. atch the semantics of paths Paths the function lr and the con nection between them are explained in Section 4 3 4 TYPE RULES We use a constraint based formulation of our type system The type judgment I fF e 7 C is read in environment I expression e has type 7 where the size variables in I and T are subject to the constraints C Type environments are defined by the following grammar Pi 0 w x r 0 w for x 7 where for x T is used in typing the for expression A type environment l maps variables and for variables to types according to the following rules if a otherwise T if a I x otherwise r w x rT x I ya 8 gt T w for x 7 2 Due to space constraints we explain here the typing of the three most interesting expressions to size types in increasing order of difficulty The complete list of type rules can be found in the companion technical report 24 In the type rules that we discuss next z u and 7 are as in Figure 7 z is a type without a size annotation u is a unit type without an annotation and 7 is a type with an annotation In Section 4 1 we explain the type rule for sequence ex pressions in which two subexpressions are put in sequence In Section 4 2 we explain the type rule for conditionals In Section 4 3 we explain the typing of the for expression which is the most involved because it is the main language construct used to produce
20. cond is Tuz the third is Tuz etc Trying to infer a type for the for expression by inferring a type for e2 based on Tui Tug Tug 18 cumbersome and requires complicated symbolic reasoning about summations such as 77 nj Why cannot we get precise size information through pre cise source type annotations When a type element in a tree type has a Kleene star e g lev1 in Figure 2 its children in the type tree e g lev2 represent uniformly all lists of child elements of the starred element in an actual document Adding precise size annotations to Kleene starred elements e g lev2 of the input type distinguishes within the type tree the concrete lists that the Kleene starred element rep resents After distinctions have been added to the input type either the type system factors out the distinctions resulting in a loss of precision as shown in Figure 4 or in addressing the distinctions directly the type system faces increased complexity as shown in Figure 5 In this paper we present an approach to overcome these problems 1 2 Our Approach The key insight of our approach is that the elements of a list are usually treated uniformly both in the type and in the program so the only information we need is the to tal number of elements in a concrete tree represented by an element or more precisely by a path in the type tree In some XML transformation languages there is no mech anism to access the elem
21. constant e g 1 title per book the child s size is proportional to the parent s These names get used as size variables when an expression outputs the corresponding portion of the in put tree The precise details on how size names are put on the input type tree and how the transformation program is annotated can be found in the companion technical report see Section 3 2 3 24 Recursion both in the input type and the program is another source of statically unknown output In the case of recursive types we give a name to the total number elements represented by each tag at the first back edge In a recur sive type there is at least one pair of type elements or type constructors where each can be reached from the other We say that a back edge goes to the type element constructor that can be reached from the root of the type tree first A type element constructor with an incoming first back edge has no anscestors with incoming back edges Figure 12 shows a recursive type first textually 10 then graphically The annotation at the choice type constructor means for type Part Basic Composite type Basic basic cost Integer type Composite composite assembly_cost Integer subparts Part basic cost m composite assembly subparts n Integer m n basic composite i TNS cost assembly subparts Integer Integer Figure 12 Annotation of a recursive type type Part2 part total_c
22. cument validation There are a few possible directions for future work In this work we dealt with recursive functions by requiring the user to provide a type signature and type checking to verify the correctness of that signature We may be able to infer a type without being provided a type signature by using XDuce s type system to infer a type signature that does not include size annotations and then adding size annotations on a second pass We may be able to avoid doing two passes by using ideas from Chin and Khoo 3 Finally it would be interesting to implement our inference procedure to gain some practical experiences Acknowledgments We would like to thank Mary Fernandez for helpful an swers to our questions on XQuery and on An Algebra for XQuery 10 We also thank the anonymous reviewers of an earlier version of our paper for their helpful comments 8 REFERENCES 1 A Aiken E L Wimmers and T K Lakshman Soft typing with conditional types In POPL 1994 2 V Benzaken G Castagna and A Frisch CDuce an XML centric general purpose language In C FP 03 volume 38 9 pages 51 63 August 25 29 2003 3 W Chin and S Khoo Calculating sized types In PEPM pages 62 72 1999 4 A S Christensen and A M ller JWIG user manual 2002 URL http www brics dk JWIG manual 5 P Cousot and N Halbwachs Automatic discovery of linear restraints among variables of a program In POPL pages 84 96 1978 6
23. d unit type 7 is defined such that a 7 is a subtype of r for all tags a following Fernandez et al 10 Among the types z T T is the type of two values in sequence The union type is r 7 a value whose type is either of the choices matches it We introduce the pair type lt T T gt for typing conditional expressions Like expression e c n Ix m e ete exe path miu at a constraint C CU n e CU m Figure 8 Constraint language the choice type a value of either of the two types may match it Unlike the choice type the order of the two types is pre served i e m1 T2 T2 71 but lt 71 72 gt lt T2 T1 gt when 7 T2 Because the order is preserved it is possible to reason about the types of different conditional expressions in relation to each other The is an identity for choice types and is needed for typing repetition expressions We also have a constraint language to capture size re lations Figure 8 shows our constraint language There are two kinds of constraints The first kind consists of equality constraints between a size variable and an arith metic expression The function I maps paths to size vari ables constants The second kind of constraint is a path 7 which is not explicitly related to anything else in the con straints or types Paths remain in the constraint set only temporarily during the typing of for expressions that m
24. ecursive calls They infer sizes in terms of array lengths tree heights and integer values All of these previ ous approaches only infer flat sizes even when sizes for trees are inferred it is in terms of their one dimensional height We infer sizes for the richer tree structure and take into account the levels of the subtrees 7 CONCLUSIONS AND FUTURE WORK In this paper we have presented a type system for XML transformations to infer size relationships within the output type Our approach also allows size annotations to be added to the input type that would then be propagated through to the inferred output type In addition to helping program mers confirm properties of XML transformation programs size relations may provide efficiency improvements Knowl edge of concrete sizes can benefit query optimization and database storage so we expect size relations to yield similar benefits 12 22 Finally our type system does not add significant complex ity to either type inference or document validation Our type inference algorithm only infers in the constraints simple and usually small equations which can be solved efficiently using simple symbolic algebra XML Schema is designed so that validation can be implemented by a top down parser with limited look ahead 21 Adding size annotations requires only the addition of counters to keep track the number of elements which does not increase the algorithmic complex ity of performing do
25. ents in a list non uniformly For example it is impossible to remove the first element from a given list In other languages e g CDuce constructs that handle list elements non uniformly can be typed con servatively We take advantage of this in our analysis We annotate the source type as shown in Figure 3c The annota tion n on lev2 in Figure 3c denotes the total number of lev2 elements in the input document that have as parent a lev1 element and as grandparent a root element Note the differ ence between this annotation and a size the elements may not all have a common parent For an alternating sequence of for and case expressions that match the semantics of root lev1 lev2 the body of the innermost case will be executed n times Consequently our type system multiplies the size of the innermost case expression by n to find the size of the outermost for expression Because our annotations do not introduce distinctions into Ts we avoid the trouble shown in Figure 5 We therefore leverage the more powerful second approach to typing the for expression The union operation used in the first ap proach loses information whenever an element type has more than one child type and so cannot achieve the precision nec essary to infer size relationships Conditional expressions are often used to select certain elements of a list and pass over others so they too influence the sizes of output types A solution that leads to sizes confined to r
26. lds Figure 1 gives the output type 7 omitting the scalar chil dren of title and isbn as the programmer intends it We portray types as trees to provide a graphical view of the tree structure of XML types A vertical or diagonal line means that the type at the lower end of the line is a child of the type at the upper end In the type 7 in Figure 1 titles isbns is a sequence type the sequence constructor is implicit be cause titles and isbns are children of the same parent and are next to each other Using existing type inference meth ods the type 7 would be inferred Because 7 7 this correct program fails to type check Repeated data arises in many settings and each time it does this shortcoming of existing techniques limits the amount of automated checking 1 for w in children root do on Toor 2 for x in children w do 3 s lev1 4 for y in children root do lev2 5 for z in children y do 6 T a b Figure 2 A source type with nested repetitions ie root root ur lev1 lev1 lev2 lev2 i 1 m lev2 a b c Figure 3 Two ways to annotate a source type Ts available to programmers 1 1 Difficulties with Size Relation Inference At first consideration it may seem as though the use of integer constraints which enable array analyses in languages such as C would be sufficient for inferring size relationships Surprisingly it is not that simple The main
27. olutions However currently these approaches are unable to infer types with size relations The main contribution of our paper is a type based inference sys tem to discover size relations for XML transformation pro grams To the best of our knowledge ours is the first system capable of reasoning about size relations for XML transfor mations Several languages have been proposed for XML trans formations including XSLT 6 XQuery 8 XDuce 14 CDuce 2 HaXml 27 and Relaxer 11 The XML trans formation language in this paper used to explain our tech nique has much of the expressive power of these languages It includes iterations over subtrees pattern matching based on tag or type conditional expressions etc Section 2 intro duces our language For illustration purposes we consider first the following XQuery program lt doc gt lt titles gt for a in document cat xml catalog book title return a lt titles gt lt isbns gt for b in document cat xml catalog book isbn return b lt isbns gt lt doc gt The program takes an XML catalog of books and creates an output document with lists of titles and ISBNs perhaps for easy ISBN lookup by title in a printed listing Because each book has exactly one title and ISBN the lists rooted at titles and isbns must have the same number of elements that is they must have the same size The programmer would like to confirm that this size relationship ho
28. ons such as alb c lt a c b c We use two functions rmc remove choices and tls top level split Both functions operate at the top level of types by treating elements as atomic and ignoring their children This allows us to keep the upper and lower parts of the type tree separate but it means that the rule will reject some true subtype relationships The rmc function takes a type tree 7 as an argument and returns the set of all type trees derived by selecting one or the other of the types from the choice types in 7 For example rmc alb c a c b c The tls function first calls rmc on its argument For each resulting type tree 7 tls returns the set of pairs of type trees that in sequence describe 7 For example tls a b c O a b cl a b b Jc a b c OJ In the example the size annotated b was split in the second pair so its size annotation was split into ni and ng The split is recorded in N with n n n2 Yri Erme n Wry E rme te IN Ta T E tls z Na iTi lt Ta No T3 lt To Na b pO Na b p Na b pO n m NSEQ U Na o p U n m T1 T2 lt 2 Our subtyping algorithm include nine other rules not shown here This subtyping algorithm is sound and it terminates 6 RELATED WORK 6 1 Automata based Techniques Murata et al classify six ways of representing XML types including XML Schema in terms of expressiveness 21 The ty
29. ost Integer subparts Part2 convert p Part lt basic m composite n gt Part2 lt part m n gt Figure 13 Annotation of a recursive function example that the number of basic tags represented by this recursive type is m Recursive functions typically operate on recursive types We handle recursive functions by re quiring the user to annotate each recursive function with a type signature and type checking the functions based on those annotations Figure 13 shows a second recursive type Part2 For each part the total_cost includes any costs for that part plus the sum of all costs of subparts The func tion convert transforms a Part into a Part2 Due to space limitations we omit the function body Since both basic parts and composite parts are parts the total number of part s should equal the sum of the numbers of basic s and composite s The user need not annotate the function with the size of every type element in Figure 13 only 3 have annotations 4 3 3 Auxiliary Rules for Size Types We write auxiliary rules to make use of the size related in formation discussed in Section 4 3 2 Basically two groups of auxiliary rules are needed one group for those for expres sions that are not known to produce top level repetitions and one group for those that are The first group of auxil iary rules is similar to the rules discussed in Section 4 3 1 they decompose the input type find types for the body of the for ex
30. pes we work with here are regular expression tree types with size annotations which are at least as expressive as the six surveyed One significant automata based work on XML type check ing uses a generalization of traditional top down regular tree transducers called k pebble tree transducers to demonstrate the decidability of type checking for the broad range of que ries that can be expressed by these automata 20 This tech nique was applied to a subset of XSLT for backward type inference 26 However it is unclear how to support size information in these formalisms Adding sizes naively can produce non context free languages and make type checking undecidable 6 2 Type System based Techniques Instead of using automata based approaches many XML transformation languages use type systems to accomplish XML type checking The aspects of XQuery s type system relevant to this work were explained in Section 1 1 Other languages such as XDuce have similar type systems 8 14 Fernandez et al s more precise type system was discussed in Section 4 3 10 However our type system to the best of our knowledge is the first one to support size inference Other work on XML type checking aims at integrating XML into general purpose programming languages One in tegrates XML into Java 18 and the work relies on JWIG 4 an extension of Java XOBE 17 is also an extension of Java with a similar goal but it differs in that XML trees in XOB
31. pression as though it were a let expression iter ating over a single unit type and re compose the inferred types according to the structure of the input type The aux iliary rules are enhanced only to carry size information with them Figure 14 shows the complete list of type rules for the for expression the auxiliary rules named FoR are the first group of rules The ar m C hypothesis will be explained in Section 4 3 4 The second group of rules have a slightly different purpose and function They also decompose the structure of the input type and they find types for the for expression as though it were a let expression iterating over a single unit type However they do not re compose the types according to the structure of the input type The path annotation from Section 4 3 2 tells exactly which of the unit types will cause the for expression to produce a potentially non empty list i e produce output In the program in Figure 10 for example the for expression start ing on line 4 will only produce output when x is bound to a title element all other types will result in the empty sequence Furthermore the path annotation can be mapped back to a size name which could be a constant in the input type so it tells exactly how many times the for expression that iterates over that path will produce output In the program in Figure 10 the first half of the program will pro duce output n times because n
32. problem is that because XML transformations operate on trees a very rich data structure size relationship inference must interrelate tree sub structures Standard array analyses however need only reason about how size information for arrays a flat data structure flows in C like programs Consider the source type Ts and program shown in Fig ure 2 Lines 1 3 and 4 6 of the program in Figure 2 have the same semantics as root where is a wildcard that matches any tag except line 3 substitutes an S for what ever the output would have been and equivalently with T on line 6 In general the semantics of paths can be achieved through nested for and case expressions For example in Figure 10 lines 1 10 are equivalent to catalog book title Clearly this program produces the same number of S s as T s However standard type systems perform a modular analysis using only the types of subexpressions and some global type environments for type checking and inference Therefore in discovering a relation such as size equality between two expressions the type system is restricted to using information it has available at the time of typing both expressions the input type It must discover size relations between the input type and the type of each expression in order to relate the sizes of the expressions types to each other Suppose that in hoping to discover the size relation ships precisely we annotate Ts as in Figure 3b where 1 m
33. re An XML type is a subset of all documents 7 C T often called a schema The XML type checking problem asks for source and target types Ts and 7 respectively and transformation program P is it true that Vx Ts P m 25 One common approach to answer this question is based on type inference an output type 7 is conservatively inferred based on the program and the source type P T C 7 If the inferred type is a subtype of the target type T C 7 then the program successfully type checks We introduce the notion of sizes in XML documents and types a size denotes the number of XML elements and or scalars in a consecutive sequence under a common parent For a particular XML document sizes are always known constants However sizes may not remain constant across all documents conforming to a single type In this case the sizes of the type are represented by variables which may be constrained to allow only values valid for some document within the type When some sizes of a type are constrained in terms of other sizes currently not supported in XML Schema we call those size relations Because of the com mon use of Kleene stars in types it is generally impossible to discover the actual values of sizes Rather we aim at discovering relationships among sizes in output documents Some practical settings require size information For ex ample in a document with parallel lists of movie titles and the years those movies
34. ssion may not produce top level repetition if it is nested in an expression e where e is not a for or case expression and e is nested in the body of a for expression Otherwise the for expression will produce top level repeti tion We answer the second question similarly Consider the nested for and case expressions on lines 1 10 of the pro gram in Figure 10 for which 7 in Figure 11 is the input type Collectively these expressions iterate over the title ele ment or more specifically the path catalog book title in Ts We infer this path by checking for certain syntactic structures e g e1 is a children expression in all for ex pression and each expression uses a variable bound by the expression above it and piecing together the patterns from the case expressions If the relevant syntactic structures do not appear we do not infer a path We make paths explicit for each group of for and case expressions that produce top level repetition by labeling the first expression with start and the last expression with the path it iterates over To answer the third question we first note that the num ber of elements represented by a path in the input type is the same each time an expression iterates over that path We therefore give names to the numbers of elements along each path of the input type Figure 11 shows an example of this naming Starting from the root sizes are given names If the number of child elements per parent element is
35. the n in 7 does not tell us how many elements each member of 7 has it does say that the number of elements produced is the same as the number of the input elements The body of the for let bookO Book book jt TNS 1 for x in children book0 title author 7 2 do case x of l 3 x1 author gt x1 String String 4 x2 gt 5 end a b Figure 9 Simple type tree and expression expression operates on each element individually from the value of the children expression Each execution of the for expression produces data included in the type 1 7 author String Two main ideas allow us to infer such types The first idea is the use of auxiliary rules for typing the for expres sion Auxiliary rules allow us to infer a type for the for expression which has the same regular structure as the type it iterates over The second idea is that the input type and the program contain implicit size related information which we make explicit We determine which information to make explicit based on the argument in Section 1 1 We then equip the for rules to use this information 4 3 1 Auxiliary Rules Auxiliary rules for the for expression were introduced by Fernandez Sim on and Wadler 10 and we review them here The for expression has the form for x in e do eo Suppose that for the program fragment in Figure 9b we have determined that children book0 has type Ti title y author String
36. unters type names or repeated ed types with unbound size annotations these mark the top of the lower part For example in the un annotated type tree in Figure 11 the upper part ends at book and the lower part is everything below book In the absence of relational size annotations our type lan guage describes regular expression types for XML the same type language as XDuce uses XDuce also uses a set based notion of subtyping and it has a sound and complete algo rithm for deciding subtyping 15 To check subtyping algo rithm for the lower part of the type tree e g 0 T lt T we use XDuce s subtyping algorithm In the case of recur sive types we match annotations at the top and use XDuce s subtyping for the types in the absence of annotations Because the upper part is free of recursion we can split choice types and check each of the choices separately N n lt Zz No 172 lt Zz NiONo2 Ni 2O n m NCHOIC Ni U N2 U n m Ti r2 Zal HOICE If an expected annotation is missing an annotation can be added The annotation will be a constant if the size is con stant e g 1 If the annotation is for a sequence type and the sizes of the sequence s members equal the sizes of another sequence the annotation will be identical to the an notation on the other sequence Otherwise the annotation will be fresh which is equivalent to unknown The rule for sequences must check subtype relati
37. wed by all ISBNs from a catalog Other auxiliary rules handle other type constructors such as repetitions If e1 s type is a unit type e g title then the for expression can be typed like a let expression The purpose of these auxiliary rules is to decompose the in put type infer types for the for expression over unit types and compose the inferred types according to the structure of the input type The resulting type of the for expression in Figure 9 is T2 QO gt author author String String 4 3 2 Making Size related Information Explicit In Sections 1 1 and 1 2 we argued that a natural type based analysis can reason about size information at but not below the top level of repetition We define top level repe tition for a given type to be types which are repeated d and which have no repeated ancestors We get the informa tion that we need for reasoning about sizes by answering 1 Which for expressions produce top level repetition in the output type 2 Which parts of the input type do those for expressions iterate over 3 How many elements are in those parts of the input type Aside from conditional expressions which Section 4 2 ad dressed the for expression is the only expression which can both produce output of unspecified length and have other expressions nested in it To answer the first question con servatively we use the syntactic structure of the program A for expre

Type-based Inference of Size Relationships for XML

Contents

Download Pdf Manuals

Related Search

Related Contents