Home

Compaq`s Web Language A Programming Language for the Web

image

Contents

1. eee eee Creating Pieces sopa asses eg ES Re Inserting PieCeS ccc cece cece ce eeee Deleting Pieces recane oi ereas Replacing Pieces 1 cece cee cece eee Modules Module Base64 ccc cece cece Module Browser ecceeeeseees Module Cookies eceeeecseees Module Farm ccccceccccscees Module Files s cis cic eso eiseecers oe eg eels Module Java cece eee cece renee Module Servlet cceecccseees Module Str ccc eee eee eee ewe eee Module Url 6 sevens cb a 6 9 a ereeers reg bs areas Module WebCrawler eee005 Module WebServer 0eeeeceeees Examples Reading Grades cece cece eens WebCrawler ee cece cece cece eens Highlight Proxy cece eeee eens WebL Quick Reference Running WebL Programs WebLeB BNE naetead anatase Sable Operator Precedence 0000 Operators ee eee eee eee ee eee BUNCHIONS ss s eriviee ea eG 0 Soe eae eee EXceptions cece cece i Regular Expressions ee ee0 WebL A Programming Language for the Web WebL A Programming Language for the Web TABLE 1 TABLE 2 TABLE 3 TABLE 4 TABLE 5 TABLE 6 TABLE 7 TABLE 8 TABLE 9 TABLE 10 TABLE 11 TABLE 12 TABLE 13 TABLE 14 TABLE 15 TABLE 16 TABLE 17 TABLE 18 TABLE 19 TABLE 20 TABLE 21 TABLE 22 TABLE 23 TABLE 24 TABLE 25 TABLE 26 TABLE 27 TABLE 28 TABLE 29 TABLE 30 List of T
2. P directlyinside directlyinside Q The directlyinside operator returns all the ele ments of P that are inside or not inside any element of Q and in addition are not inside another element of P Intuitively this retrieves the outermost element of all nested elements Given a page of the following form lt UL gt lt LI gt First Section lt LI gt lt LI gt Second Section lt LI gt lt LI gt Third Section lt UL gt lt LI gt First Subsection lt LI gt lt LI gt Second Subsection lt LI gt lt UL gt lt LI gt lt LI gt Fourth Section lt LI gt 11 lt UL gt C Cmrnta nn amp WN m we can calculate the following WebL A Programming Language for the Web 93 The Markup Algebra All the list items in lists i e lines 2 3 4 6 7 9 Elem X li inside Elem X ul All the list items in the first list i e the elements on lines 2 3 4 9 6 7 10 Elem X 1li inside Elem X ul 0 All the items directly in the first list i e lines 2 3 4 9 10 Elem X 1li directlyinside Elem X ul 0 Outermost items in the first list i e lines 2 3 4 9 10 var x Elem X li inside Elem X ul 0 x inside x P directlycontain directlycontain Q The directlycontain operator returns all the elements of P that contain or do not contain any element of Q and in addition do not contain another element of P Int
3. Retrieving Page Objects TABLE 16 Fields of the option object Field Description autoredirect Controls whether moved pages for example HTTP status code 302 get automatically fetched from their new locations The default value is true charset Overrides the character set used to parse the document Typical values are ISO 8859 1 UTF8 etc dtd Overrides the DTD to be used when parsing the page The value of this field must be string with the official DTD name as defined in the SGML catalog emptyparagaphs When this flag is set to true the HTML parser will regard paragraphs i e lt p gt tags as empty markup ele ments instead of the usual lt p gt lt p gt pairs The lt br gt is an example of another empty markup element This option is sometimes useful when confronted with pages where lt p gt is used without regard for the HTML specification for example the incorrect use of lt p gt inside lt font gt and so on The default value of this flag is false expandentities When this flag is set to true charac ter entities like amp quot etc are expanded as the page is parsed The expansion is onlye performed on HTML pages and between markup tags i e not inside attributes The default value of this flag is false WebL A Programming Language for the Web 65 Pages TABLE 16 Fields of the option object Field fixhtml mimetype noncompliantPOSTredirect resolveu
4. j object Class classname string j object NewArray classname string size int j array Get A j array i int any Set A j array i int v any any Length A j array int Description Allocates a Java object using the specified class name and optional constructor arguments Valid class names are Java primitive types int char short etc or fully specified Java class names java lang String java util Vec tor etc Maps the specified class into a WebL object allowing the static fields of the class to be accessed Allocates a Java array of the speci fied type and size Retrieves index i of array A Sets index i of array A to value v Returns the length of the Java array 128 WebL A Programming Language for the Web Module Java TABLE 32 Conversion of Java types into WebL types Corresponding WebL Java Type class value type null nil boolean bool char char java lang String string long int int int short int byte int float real double real Any array type j array webl lang expr ObjectExpr object webl lang expr ListExpr list webl lang expr SetExpr set webl lang expr AbstractFunExpr fun webl lang expr AbstractMethExpr meth webl page Page page webl page Piece piece webl page PieceSet pieceset webl page TagExpr tag Any Java array j array Any other class no listed above j object WebL A Programming Language for the Web 129 Modules TABLE 33 Convers
5. Farm_NewFarm noworkers 48 end 49 50 Abort meth s s Stop end 51d First we need to keep track of all pages visited so far with an associative array aka a WebL object where the fields are the visited URLs and the value is either true or false line 7 Note that an alternative implementation could use a set instead of an object without any performance penalty Lines 14 and 15 define the two methods that need to be overridden to customize the crawler The Visit method is called each time a new page is visited and the Should Visit method indicates whether a specific URL should be crawled or not The Enqueue method lines 17 30 adds a URL to the queue of pages to be fetched The first task is to strip off any references from the URL lines 19 22 Line 24 then checks if we visited the page already Note the use of the service combinator to catch the exception should the URL not be in the visited array If the URL is not present and we should visit this page line 25 we remember that we have seen the page line 26 and then pass the job of retrieving the page to a farm object line 27 Eventually when a worker on the farm reaches a new job the ProcessPage function is invoked lines 32 44 After the page is fetched line 34 we call the method Visit to let the crawler process the page line 35 Lines 38 40 take care of enqueing all the anchors found on the page A custom crawler Now we look at how we can create a cust
6. Turns s into lowercase Turns s into uppercase Remove white space characters like new lines carriage returns tabs etc from the beginning and end of the argument string WebL A Programming Language for the Web 137 Modules Module Url The Url module performs a number of useful operations on URL strings For exam ple it is sometimes useful to break up a URL into its constituent parts modify some of them and glue the parts back together again In the same manner query strings the part of a URL that follows the question mark also need to be manipu lated For example given the URL a typical AltaVista query http www altavista digital com cgi bin query pg q amp kl XX amp q 2Bjava 2Bcoffee the Split function will return an object as follows query pg q amp k1l XX amp q 2Bjava 2Bcoffee path cgi bin query host www altavista digital com ren ws scheme http Applying the SplitQuery function to the query field of this object will return the fol lowing object kl XX pg q q 4 java coffee Behind the scenes decoding the query string involves calls to the Decode function to remove the character encodings eg B and so on The Glue and GlueQuery functions will glue those objects back together again The field names generated by the Split function are summarized in Table 38 Note that the current implementation can process only the http ftp
7. long gt setTime lt setTime long voids getDay lt getDay int gt 124 WebL A Programming Language for the Web Module Java setHours lt setHours int void gt setMonth lt setMonth int void gt notifyAll lt notifyAll void gt after lt after java util Date boolean gt SetDate lt setDate int voids getHours lt getHours ints setSeconds lt setSeconds int void gt wait lt wait long void wait long int void wait voids getMonth lt getMonth ints toString lt toString java lang String gt UTC lt UTC int int int int int int long gt notify lt notify voids getYear lt getYear int gt before lt before java util Date boolean gt equals lt equals java lang Object booleans gt getTime lt getTime long gt getTimezoneOffset lt getTimezoneOffset ints getMinutes lt getMinutes int gt hashCode lt hashCode ints getClass lt getClass java lang Class gt getDate lt getDate ints setMinutes lt setMinutes int voids toGMTString lt toGMTString java lang String gt toLocaleString lt toLocaleString java lang String gt WebL Java type conversion Furthermore automatic translation between WebL and Java data types is done when calling methods and constructors or assigning values to object fields Table
8. Page 121 WebL Java integration support Page 124 Java Servlet support Page 131 WebL A Programming Language for the Web 113 Modules TABLE 24 Standard WebL Modules Module Function Str General string related functions Page 136 Url Url manipulation functions Page 138 WebCrawler An extensible web crawler object Page 141 WebServer Implementation of a simple web server Page 143 114 WebL A Programming Language for the Web Module Base64 Module Base64 Base 64 encoding of strings is typically used to scramble transmitted passwords when accessing web pages that require user authentication The typical pattern for basic HTTP authentication is as follows import Base64 var A Basic Base6 4 Encode user pw var P GetURL http Authorization A In the code above user must be set to the user name and pw to the authentication password The last object passed to the GetURL function contains the authentica tion header to send to the web server The Base64 module is also used when authenticating users to Web proxies by add ing a Proxy Authorization header to the HTTP request import Base64 Basic Base6 4 Encode user pw GetURL url nil Proxy Authorization A var A var P TABLE 25 Module Base64 Function Description Encode s string string Encodes a string in the base64 encoding Decode s string string Decodes a
9. WebL Quick Reference TABLE 45 WebL Operators Operator Description overlap p piece q piece pieceset All the elements of p that overlap overlap p pieceset q piece pieceset any element in g overlap p piece q pieceset pieceset overlap p pieceset p pieceset pieceset without p piece q piece pieceset All the elements of p where over without p pieceset q piece pieceset laps with any element of g have without p piece q pieceset pieceset been removed without p pieceset q pieceset pieceset a Right bracket fix operator of the form x i b Object membership test is based on object field names 172 WebL A Programming Language for the Web Functions Functions TABLE 46 Built in Functions Function Assert x bool BeginTag q piece tag Boolp x bool Charp x bool Funp x bool Intp x bool Listp x bool Methp x bool Objectp x bool Realp x bool Setp x bool Stringp x bool Pagep x bool Piecep x bool Tagp x bool Piecesetp x bool Call cmd string string Children q piece pieceset Clone o object p object object Description Throws an assertion failed excep tion if x is false Returns the begin tag of a piece Predicates that check if a value is of a specific type Executes a shell command and returns the output written to standard out while the command is running The command string may contain references t
10. as in Table 19 In this table we use the notation beg to indicate the position of the begin tag of a piece and end to indicate the position of the end tag of a piece Note that the piece comparison operators equal inside after etc are not defined in the WebL language itself the following section will introduce new language operators based on these definitions FIGURE 5 Example of Position Numbering 0 1 1 2 2 3 4 2 We introduce here a fictitious WebL function called pos that returns the numerical posi tion of a tag value WebL A Programming Language for the Web 85 The Markup Algebra TABLE 19 Comparing Pieces x and y Relationship between x and y Definition x equal y beg x beg y A end x end y x inside y beg y lt beg x A end x lt end y A beg x beg y end x end y X contain y beg x lt beg y A end y lt end x A beg x beg y end x end y x after y end y lt beg x x before y end x lt beg y x overlap y beg x lt end y A beg y lt end x A beg x beg y end x end y 86 WebL A Programming Language for the Web Piece Set Operators and Functions Piece Set Operators and Functions All the piece set operators are summarized in Table 20 on page 101 Note that all piece set operators accept both pieces and piece sets as operands Piece operands are converted automatically to a piece set with the operand as the only el
11. make as few changes as possible to the page with the guidance of the DTD It is thus important to realize that when bad HTML or XML is parsed the internal rep resentation might not be what you expect from viewing the page source To give users an idea of what WebL sees a pretty printing function called Pretty is included in WebL that displays a representation of the parsed web page in a nicely formatted 58 WebL A Programming Language for the Web Retrieving Page Objects way We recommend using this tool as it often illustrates the badly formatted markup found on the web Badly encoded scripts A growing number of web pages contain code written in scripting languages like JavaScript and VBScript WebL attempts to skip over the contents of these parts of a page so that the HTML parser does not get confused when seeing things that look like markup that are encoded in scripts Typically this is not a problem since authors are expected to always place scripts inside HTML comments lt gt between the tags lt script gt and lt script gt Unfortunately this advice is sometimes ignored and the script code is left uncommented which can confuse the page parser For this reason WebL does not parse enything at all between lt script gt and lt script gt tags in HTML pages The whole stretch of text between the two tags remains a single unparsed text segment Retrieving Page Objects WebL s internal representation of a web r
12. module was loaded first the module is auotmatically reloaded WebL A Programming Language for the Web 131 Modules Examples The following WebL servlet shows how to set and retrieve a variable on your Web server File Examplel webl var theval nil the variable To set the variable to hello access http www host com servlet webl Examplel_ SetVal x hello export var SetVal fun req res theval req param x res mimetype text plain res result Set val to theval end To retrieve the value access http www host com servlet webl Examplel GetVal export var GetVal fun req res res mimetype text plain res result Val is theval end Note how the x parameter in the URL is accessed with req param x Note that in the case of multiple parameters with the same name the particular parameter field will have a list of strings as value Programmers should thus be aware of the fact that the value type of parameters is either a string or a list of strings depending on the number of parameters Often during servlet development you will need to snoop the servlet request head ers The following example shows how this is done with a WebL servlet File Example2 webl var Decode fun req var s Header snoop n n every field in req do s s ToString field ToString req field n end S end 132 WebL A Programming Language for the W
13. of XML WebL keeps all element names and attributes in their original case In the case of HTML WebL converts all element names and attribute names to lower case URL resolution Many elements have attributes that specify URLs of other docu ments on the Web Most of these URLs are specified relative to the document itself WebL simplifies handling of these URL attributes by resolving them to an absolute URL when the document is fetched To determine which attributes refer to URLs WebL uses slightly modified HTML DTDs internally that explicitly denote which attributes of elements contain URLs No URL resolution is performed for XML documents HTML URL resolution can be switched off with a page retrieval option Table 16 on page 65 Bad HTML A surprisingly large number of pages on the web contain errors Some of the typical errors encountered include e Forgotten end or start tags e Illegal nesting of elements forbidden by the DTD e Non hierarchical markup where elements overlap instead of nest e The DTD specified by the DOCTYPE SGML directive does not match the markup the document contains e Tags with illegal names etc WebL tries to take all these problems into account In SGML terms WebL is a non validating processor WebL only uses DTDs to correct simple mistakes and to add optional tags where needed WebL also corrects overlapping tags in HTML to ensure that we have a hierarchically structured document In general we try to
14. options arguments are the same as the built in PostURL function The cotnets of filename is evaluated as a WebL program The result returned is the value of the last state ment in the program Returns the file names and directory names contained in directory dirname 122 WebL A Programming Language for the Web Module Files TABLE 30 Module Files Function IsDir name string bool IsFile name string bool Mkdir name string bool Delete name string bool Size name string int Description Checks whether name is a valid directory name or not Checks whether name is a valid file name or not Attempts to create a new directory called name and returns success or failure Attempts to delete the file called name and returns success or failure Returns the size in bytes of the file called name WebL A Programming Language for the Web 123 Modules Module Java The Java module allows you to access Java classes objects and arrays directly from the WebL programming language This functionality provides practically transparent access to any functionality provided by Java class library at the extra run time cost of translating between WebL and Java data types The direction of access is purely from WebL to Java transparent Java to WebL access is not possi ble without changes in the Java virtual machine Note that using module Java requires a knowledge of Java itself and some
15. q piece pieceset s1 pieceset s2 pieceset pieceset q1 piece q2 piece pieceset q piece s pieceset pieceset s pieceset q piece pieceset sl pieceset s2 pieceset pieceset ql piece q2 piece pieceset q piece s pleceset pieceset s pleceset q piece pieceset s1 pieceset s2 pieceset pieceset s pieceset 1 int piece inside p piece q piece pieceset inside p pieceset q piece pieceset inside p piece q pieceset pieceset inside p pieceset q pieceset pieceset inside p piece q piece pieceset inside p pieceset q piece pieceset inside p piece q pieceset pieceset inside p pieceset q pieceset pieceset directlyinside p piece q piece pieceset directlyinside p pieceset q piece pieceset directlyinside p piece q pieceset pieceset directlyinside p pieceset q pieceset pieceset directlyinside p piece q piece pieceset directlyinside p pieceset q piece pieceset directlyinside p piece q pieceset pieceset directlyinside p pieceset p pieceset pieceset contain p piece q piece pieceset contain p pieceset q piece pieceset contain p piece q pieceset pieceset contain p pieceset q pieceset pieceset contain p piece q piece pieceset contain p pieceset q piece pieceset contain p piece q pieceset pieceset contain p pieceset q pieceset pieceset Descripti
16. same page WebL A Programming Language for the Web 187 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function NewPage s string mimetype string page NewPiece q piece piece NewPiece s string mimetype string piece NewPiece tl tag t2 tag piece NewPieceSet p page pieceset Objectp x bool PCData p page pieceset PCData p piece pieceset Exceptions ArgumentError Incorrect or wrong number of arguments NetException Fetch failed sta tuscode field of the exception object indicates the reason ArgumentError Incorrect or wrong number of arguments NotSamePage The tag arguments to the function do not belong to the same page ArgumentError Incorrect or wrong number of arguments NotAPiece The set argument to NewPieceSet must only contain pieces EmptySet The set argument to NewPieceSet must only contain pieces belonging to the same page NotSamePage The set argument to NewPieceSet must only contain pieces belonging to the same page ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments 188 WebL A Programming Language for the Web Exceptions TABLE 47 Exceptions thrown by the built in functions Function Page q piece page Page t tag page Pagep x bool Para p page paraspec string pieceset Para q piece paraspec string pleceset Par
17. set ToSet s string set ToSet o object set ToSet p pieceset set ToString x string Exceptions ArgumentError Incorrect or wrong number of arguments Timeout A time out occurred ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments WebL A Programming Language for the Web 193 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function Exceptions Trap x object ArgumentError Incorrect or wrong number of arguments ToString x string ArgumentError Incorrect or wrong number of arguments 194 WebL A Programming Language for the Web Regular Expressions Regular Expressions Here we summarize the syntax of Perl5 regular expressions all of which are sup ported by the WebL However for a definitive reference you should consult the perlre man page that accompanies the Perl5 distribution and also the book Pro gramming Perl 2nd Edition from OReilly amp Associates We need to point out here that for efficiency reasons the character set operator is limited to work on only ASCII characters Unicode characters 0 through 255 Other than this restriction all Unicode ch
18. 1998 19 45 47 GMT Content type text html URL http www digital com Some HTTP response headers like for example Set Cookie might be repeated sev eral times In such a case the value of the header field will be list of string values in the order of occurrence in the HTTP response The page fields of an HTTP response with multiple headers of the same name might thus look as follows Is Content type text html URL http www digital com Set Cookie id 123 pw abc The same idea is applied when submitting multiple headers with the same name in a request 1 Although WebL supports multiple request headers the underlying Java implementation does not WebL A Programming Language for the Web 61 Pages GetURL url nil HeaderA xyz HeaderB id 123 al ew ee pw abc 62 WebL A Programming Language for the Web Retrieving Page Objects Overrides Unfortunately many servers return the incorrect MIME type for a page This incorrect MIME type can be overridden by passing a mimetype field in the options object argument of GetURL and PostURL The mimetype field of the options object must be of type string and the value must be taken from Table 14 In case the content encoding is known an optional charset MIME type parameter can be specified See MIME types on page 53 For example we can write the fol lowing to override the MIME type of page
19. 33 Conversion of WebL types into Java types on page 130 shows with what Java types a specific WebL type is compatible with Refer to this table when calling a Java method or constructor Refer to Table 32 Conversion of Java types into WebL types on page 129 to see how values returned from methods and field accesses are converted back into WebL types Studying these two tables will show that the type conversion is mostly restricted to converting between primitive Java and WebL types That means for example that WebL objects can only be passed to methods that accept the implementation type of WebL objects webl lang exprObjectExpr This is not restrictive as it might sound many methods in the JDK accept java lang Object s as arguments which is of course a superclass of webl lang exprObjectExpr For example it becomes pos sible to insert WebL objects into Java hash tables WebL A Programming Language for the Web 125 Modules Here is a more complicated example which reads and numbers the lines of a file called test txt import Java var System Java_Class java lang System var F Java_New java io File test txt var R Java_New java io BufferedReader Java_New java io FileReader F j var c 1 var L R readLine while L nil do System out print c System out printin L e E E L R readLine end After each occurrence of R readLine the resulting line is converted int
20. A Programming Language for the Web 177 WebL Quick Reference TABLE 46 Built in Functions Function PostURL url string params object string headers object options object page Pretty p page string Pretty q piece string Print x y Z nil PrintLn x y z nil ReadLn string Replace a pieceset b pieceset nil Rest 1 list list Retry x any Select 1 list from int to int list Select s set f fun set Select 1 list f fun list Select p pieceset f fun pieceset Select s string from int to int string Description The options object allows amongst other functions the overridng of the MIME type and DTD to be used for parsing the page Returns a pretty printed version of the page Returns a pretty printed version of a piece Prints arguments to standard output Prints arguments to standard output followed by end of line Reads a line from standard input throws away the end of line charac ter Replaces each piece set of a with copies of all the elements of b Returns a list of all list elements except the first element Executes expression x and returns its value In case x throws an exception x is re executed as many times as needed until it is successful Extracts a sublist of starting at ele ment number from and ending at ele ment number to exclusive Maps sets lists and piecesets to sets lists and pi
21. Algebra TABLE 21 Piece and Piece Set Functions Function Children q piece pieceset Parent q piece piece Flatten s pieceset pieceset Content p page piece Content q piece piece Description Returns a piece set consisting of all the direct children elements of g in the markup parse tree unioned with pieces representing all the text seg ments in g without all the nested text segments Returns the element in which g is nested direct parent in the parse tree Returns a flattened piece set without any overlappings of all the parts of the page that piece set s cov ers Returns a piece that encompasses the whole page p Returns a piece inside q represent ing everything that is inside q excluding the begin tag and end tag of q 104 WebL A Programming Language for the Web Piece Set Operators and Functions TABLE 22 Formal Definitions of Piece Set Operators Operator P Q P Q P Q P inside Q P inside Q P directlyinside Q P directlyinside Q P contain Q P contain Q P directlycontain Q P directlycontain Q P after Q P after Q P directlyafter Q P directlyafter Q P before Q P before Q P directlybefore Q P directlybefore Q P overlap Q P overlap Q Definition PU qeQI 7dppe Pa pequal q pe Pl7 dqqe Qa pequal q pe Plaqqe Qa p equal q pe Plaqqe QA p inside q pe Pl7 dqqe Qa pinside q pe Plaqqe QA p inside q 7dr
22. C x string y string bool C x char y char bool contain p piece q piece pieceset contain p pieceset q piece pieceset contain p piece q pieceset pieceset contain p pieceset q pieceset pieceset Description Numeric substraction Set exclusion Object field access Numeric division Value equality test See Value Equality on page 31 Indexing into a piece set Pieces are numbered 0 to Size 1 List object and string indexing Elements in a list and string are numbered from 0 to Size All the elements of p that are after any element of q All the elements of p that precede any element of q Numerical comparison where C is one of lt lt gt or gt Lexical comparison where C is one of lt lt gt or gt All the elements of p that contain any element of q 170 WebL A Programming Language for the Web Operators TABLE 45 WebL Operators Operator directlyafter p piece q piece pieceset directlyafter p pieceset q piece pieceset directlyafter p piece q pieceset pieceset directlyafter p pieceset q pieceset pieceset directlybefore p piece q piece pieceset directlybefore p pieceset q piece pieceset directlybefore p piece q pieceset pieceset directlybefore p pieceset q pieceset pieceset directlycontain p piece q piece pieceset directlycontain p pieceset q piece pieceset directlycontain p pie
23. NewPiece q piece piece NewPiece s string mimetype string piece NewPiece t1 tag t2 tag piece NewPieceSet s set pieceset NewPieceSet p page pieceset Description The headers object specifies the additional headers to include in the HEAD request Inserts a copy of q after the tag t Inserts copies of the elements of s after the tag t Inserts a copy of q before the tag t Inserts copies of the elements of s before the tag t Turns a page object back into a string Turns a piece object back into a string Returns the name of a piece Loads a WebL function imple mented in Java Equivalent to NewNamed Piece name BeginTag q End Tag q Returns a new named piece starting before t and ending after t2 Parses the string s with the mime type indicated markup parser and returns a page object Equivalent to NewPiece Begin Tag q EndTag q Equivalent to Content NewPage s mimetype Returns a new unnamed piece start ing before t and ending after 72 Converts a set of pieces into a piece set Throws an EmptySet exception should s be empty Returns an empty pieceset associ ated with with page p 176 WebL A Programming Language for the Web Functions TABLE 46 Built in Functions Function Page q piece page Page t tag page Para p page paraspec string pleceset Para p piece paraspec string pleceset Parent q piece piece Pat p p
24. TCP IP connection port path http ftp file Full path name of the addressed resource query http Query string after in URL ref http file Anchor reference string after in URL user ftp Login user name password ftp Login password type ftp File transfer type g i d url Unknown Contains the complete URL schemes because no constituents could be extracted because the scheme is unknown 140 WebL A Programming Language for the Web Module WebCrawler Module WebCrawler Module WebCrawler exports a single object called Crawler that implements a low performance multi threaded web crawler To use the web crawler the methods Visit and ShouldVisit must be overridden by the programmer The Visit method is called by the crawler each time a page is visited and the ShouldVisit method returns true when a specific URL must be crawled The crawler is activated by the Start method which takes as argument an integer specifying how many threads should perform the crawl At this point the crawler has no pages to crawl yet Pages are inserted into the crawler queue with the Enqueue method As each page in the queue is processed the crawler will extract all the anchor lt A gt tags in that page and call the ShouldVisit method to determine if the page referred to by the anchor should be crawled or not The Abort method can be called at any time to terminate the crawl The following example implemen
25. a similar effect The problem is that two different pieces according to our definition above might have equivalent markup which confuses the difference between the two pieces This is a side effect of an unnamed tag becoming invisi ble when the piece is converted to markup For example in Figure 4 piece B is nested inside piece A Applying the Markup function to A and B strips away the unnamed pieces to return the string WebL without any markup Because of our handling of unnamed pieces as invisible enti ties the place holders for patterns piece A and B should be equal to each other from the programmer s point of view but is not according to our earlier definition WebL A Programming Language for the Web 83 The Markup Algebra FIGURE 4 Nested Unnamed Pieces The first intuition is that WebL should merge neigbouring unnamed tags so that the equality problem goes away Unfortunately experience has shown that merging of unnamed tags is a bad idea Without going into too much detail merging of unnamed tags complicates the programmer s mental understanding of the current shape of the page as merging might happen at unexpected situations This often causes problems when a page is subsequently modified To give a flavor of the problems that might occur suppose piece A of the figure was created by thread A and piece B of the figure was created by an independent thread B Now let s sup po
26. a spe cific name contained in piece q Returns the end tag of a piece Prints arguments to standard error output Prints arguments to standard error output followed by end of line Evaluates the WebL program coded in string s Executes a shell command and returns the exit code returned by the command The command string may contain references to variables in lexical scope by writing var or var The value of these refer enced variables are expanded before the command is executed Terminates the program with an errorcode WebL A Programming Language for the Web Functions TABLE 46 Built in Functions Function ExpandCharEntities p page s string string ExpandCharEntities s string string DeleteField o object fld nil First 1 list any Flatten s pieceset pieceset GCO nil GetURLiurl string page GetURLiurl string params object string page GetURLiurl string params object string headers object page GetURLiurl string params object string headers object options object page HeadURL url string page HeadURL url string params object string page Description Expands the character entities eg amp lt amp amp in s to their Uni code character equivalents The DTD of page p is used for the look ups Expands the character entities eg amp lt amp amp in s to their Uni code character equivalents The H
27. all the links in a page to more complex operations that fill in Web forms and pro cess the results returned from a server Manipulating Web pages might involve rewriting parts of a page for example highlighting words or creating a new page from parts of several other ones The markup algebra consists of several operators and functions that operate on pages tags pieces and piece sets There are operators and functions to create or build piece sets from pages or from other piece sets convert pieces to their string representation modify the content of a page and so on Pages Tags Pieces and Piece Sets After a page is retrieved from the Web and parsed according to its MIME type the page and its content is accessible for further computation in WebL The computa tion that can be performed on a page is determined by the WebL markup algebra WebL A Programming Language for the Web 67 The Markup Algebra The markup algebra is based on three concepts tags pieces and piece sets In sim ple terms a tag corresponds to a markup tag a piece identifies a contiguous sub region of a page and a piece set is a collection of pieces Tags The first step in parsing a Web page is to identify of all the markup tags in the page enclosed between lt and gt characters Each of the tags is converted into a tag a WebL value of type tag Conceptually the page then consists of a list of tag objects and text segments or char
28. amp y abc def the Echo function could be invoked with the following req object contents protocol HTTP 1 0 method GET uri bin echo x 3 amp y abc def path bin echo query x 3 amp y abc def param y abc def x 3 header Accept Charset iso 8859 1 utf 8 Connection Keep Alive User Agent Mozilla 4 04 en WinNT I Accept image gif image x xbitmap image jpeg image pjpeg image png Accept Language en Host ck pa dec com 90 and the following res object result nil header Server WebL Date Thu May 14 15 59 01 PDT 1998 144 WebL A Programming Language for the Web Module WebServer Content Type text html iiy statuscode 200 statusmsg OK The invoked function can now look at the fields of req to determine how to handle the request and modify the fields of res to indicate the result to be returned note that many of the fields are filled in to sensible values when the function is invoked The meaning of the individual fields of the request and response object are listed in Table 40 and Table 41 respectively note that all fields except statuscode are of value type string or object The most commonly used field is param which indi cates the request parameters received TABLE 39 Module WebServer Function Description Start fileroot string port int nil Starts the web server on
29. an element s begin tag are copied into field variables of each piece Thus a piece is very similar to the object value type in that it looks and behaves in many ways like an object Furthermore we associate the appropriate name with each piece in this case the names are the strings ul li and li written above the triangles Note how the begin and end tag of the comment piece refer to the same tag object In accordance with our previous definition a piece that refers to unnamed begin and end tags is called an unnamed piece which correspondingly has the empty string as name FIGURE 2 Piece Notation I ul li li Piece Sets As its name indicates a piece set is a collection of pieces belonging to the same page It is a set in the sense that a piece can belong only once to a piece set but a piece can be a member of several piece sets A piece set is also a list because pieces in a piece set are ordered The piece ordering in a piece set is based on the begin and end tag positions of a piece in a page We order pieces according to the left to right order of the begin tags Piece sets play a very important part in WebL They form the basis of many of the operations that extract data from web pages WebL A Programming Language for the Web 69 The Markup Algebra Searching Functions There are several ways in which piece sets can be created e Searching for markup elements by explicitly naming i
30. argname typel type2 to indicate that argname can be of type or type2 See Table 15 on page 64 40 WebL A Programming Language for the Web Built in Functions TABLE 13 Core Built in Functions Function Assert x bool Boolp x bool Charp x bool Funp x bool Intp x bool Listp x bool Methp x bool Objectp x bool Realp x bool Setp x bool Stringp x bool Pagep x bool Piecep x bool Tagp x bool Piecesetp x bool Call cmd string string Clone o object p object object Error x y z nil ErrorLn x y Z nil Eval s string any Description Throws an assertion failed exception if x is false Predicates that check if a value is of a specific type Executes a shell command and returns the output written to standard out while the command is running The command string may contain references to variables in lexical scope by writing var or var The value of these referenced variables are expanded before the command is executed Makes a new object by copying all the fields of the objects passed as arguments Fields of p have prece dence over fields of o and so on The field ordering of the resulting object is defined by enumerating the fields of o p and so on in that sequence Prints arguments to standard error output Prints arguments to standard error output followed by end of line Evaluates the WebL program coded in s
31. ceie ore e ete wad Saat wate ws 46 Service Combinators eeee0 47 Services aaa o3 aioe sian e 0a a yee eee es 47 Sequential execution S T 0 000 48 Concurrent execution SIT 0008 48 Time out timeout t S 6 eee cece eee 49 Repetition Retry S 0 0 0 e cee ee eee ees 49 Non termination Stall 00000e ees 49 Pages 51 Basic Protocol Terminology 51 Markup cece cece cece rene eee 54 Retrieving Page Objects 59 The Markup Algebra 67 Pages Tags Pieces and Piece Sets 67 AGG EEEE EE AE E aed yet eve catia 68 PIC CO Sei aneii Pashia sid Side abt cten ERRA 68 Piece Sets isis case e esas die oy cela a oe he aT 69 Searching Functions e06 70 Element search 0 ccc cece ee eeeee 70 Pattern Search cc cece cece cence ceees 71 PCData search oc ccc cece cece ee eeeee 73 Sequence search cececcvceccececes 74 Paragraph search 0 00 0 ce cee eeceeees 75 Filtering Pieces 6 cece cece ween 78 Miscellaneous Functions 80 Piece Comparison eeeeeeees 83 Piece Set Operators and Functions 87 I Basic Operators 6 0 ce cece eee eens 88 II Positional Operators 0000 89 II Hierarchical Operators 93 IV Regional Operators 0000 95 V Miscellaneous Functions 98 WebL A Programming Language for the Web CHAPTER 5 CHAPTER 6 CHAPTER 7 Page Modification
32. e When possible write WebL regular expressions in single back quotes e g ab nc This will switch off escape character expansion and prevent WebL from complaining about illegal escape sequences like d e When matching URLs keep in mind that and do not have a literal mean ing in regular expressions Use the character classes to match these sym bols e g write Www xyz Jcom instead of www xyz com TABLE 49 Quantified Atoms with Minimal Matching Pattern Description n m Matches at least n but not more than m times n Matches at least n times n Matches exactly n times K Matches 0 or more times Matches 1 or more times Matches 0 or 1 times 196 WebL A Programming Language for the Web Regular Expressions TABLE 50 Atoms Pattern b n r t f d D w W s S xnn cD nn or nnn 1 2 3 etc 0 Description Matches everything except n Null token matching the beginning of a string or line i e the position right after a newline or right before the beginning of a string Null token matching the end of a string or line i e the position right before a newline or right after the end of a string Null token matching a word boundary w on one side and W on the other Null token matching a boundary that is not a word boundary Matches only at beginning of string Matches only at end of string or before newline at
33. field There are two common ways of indexing into object fields e The o x notation denotes a field called x of object o e The ofe notation evaluates e to a value x and retrieves the field x of a This effectively makes the object an associative array The expression 0 x and o x refers to the same field Trying to access an object field that does not exist will throw an exception A special assignment expression is used to insert new fields into an object If the field already existed its pre vious value is overridden A builtin function called DeleteField allows the removal of a field from an object Examples Ep a The empty object x21 y 1 1 Object with x amp y field Varo Gx ve Le wy O x Field x of o o x The same field again o y hello Defines field y of o o 1 2 42 Defines field 3 of o o 4 1 Accesses field 3 of o Size ToList o fields of o DeleteField o x Remove field x The associative array behavior is so useful that it is used for other WebL types too These object like types are called special objects Examples include types page and piece To the programmer these types look very similar to objects but they have hidden state attached to them i e they function as opaque data types Object fields are ordered in the sequence of their definition i e left to right top to bottom in the object The ordering of fields only has little impa
34. file The filename argument to each of the functions in Table 30 is a file name in the syntax supported by the file system underlying WebL The combination of the SaveToFile and Eval functions allows some limited persis tant storage for WebL For example the following program writes a set out to disk and reads it back in again to the variable T import Files var S 1 2 4 6 Files SaveToFile test tmp ToString S var T Files Eval test tmp Note that this technique can be used only for externalizing non recursive values that do not contain functions or methods the external format of those structures are not legal WebL programs TABLE 30 Module Files Function Description AppendToFile filename string val Appends val to the end of the file string nil AppendToFile filename string val As above but sets the character string charset string nil encoding to use Typical encodings are iso 8859 1 UTF8 etc Exists filename string bool Determines if a file with the speci fied name exists LoadFromFile filename string Loads a page object from a file mimetype string page LoadStringFromFile filename Loads a file as a string object using string string the default character encoding WebL A Programming Language for the Web 121 Modules TABLE 30 Module Files Function LoadStringFromFile filename string charset string string SaveToFile filenam
35. for the Web 183 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function Charp x bool Children q piece pieceset Clone o object p object object Content p page piece Content q piece piece Delete s pieceset nil Delete q piece nil Elem p page pieceset Elem p page name string pleceset Elem q piece pieceset Elem q piece name string pleceset EndTag q piece tag Error x y Z nil ErrorLn x y Z nil Exceptions ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments NoContent Page or piece has no content ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments No exceptions are thrown No exceptions are thrown 184 WebL A Programming Language for the Web Exceptions TABLE 47 Exceptions thrown by the built in functions Function Exceptions Eval s string any ArgumentError Incorrect or wrong number of arguments SyntaxError Cannot evaluate due to syntax error in argument IOException An IO exception occurred during function execution ReturnException A return state ment was executed outside of a fun
36. found in the cookie specifi cation from Netscape Mutiple Cookie Databases By default the cookie database is shared by all threads and web requests of the WebL process However it is sometimes useful to have groups of requests using logically separate cookie databases WebL allows you to specify which cookie database to use for each web request and if no data base is specified the default shared cookie database is used Cookie databases have programmer defined names strings and are automatically allocated when ever they are first used In particular the cookiedb field of the options parameter to GetURL and PostURL functions specify which cookie database is to be used for that requests For example the following web request reads and also writes cook ies from and to a database called DB1 GetURL http www abc com nil nil cookiedb DB1i The default cookie database is used when no cookiedb option is specified or the database name is the empty string Note that as explained before all cookie data bases are discarded after the WebL program ends i e the databases are not stored to disk If you need this functionality the Load and Save functions of the Cookie module allows you to read and write cookie databases from and to file storage WebL A Programming Language for the Web 117 Modules TABLE 27 Module Cookies Function Load filename string nil Save filename string nil Load filena
37. import Ident Ident SS Var E 3 Var 1E Var export var IdentInit IdentInit IdentInit Ident E E Value ImportRef E BinOp E UnOp E Statement C E IE C E E FieldRef FieldRef E define expr Value nil Bool String Real Integer Character Object Set List ImportRef Ident _ Ident Bool true false Object Field Field Field Ident Set ant LE ts EY List arma Oat Be ras o os i PIP TE YE pC FieldRef E E lE Ident BinOp erp 1 div mod pt ee Set Meh a a and or py pn jv member inside tinside directlyinside directlyinside contain contain directlycontain directlycontain after after directlyafter directlyafter before before directlybefore directlybefore overlap foverlap without UnOp 4 I Statement WhileStat IfStat FunStat MethStat CatchStat EveryStat LockStat RepeatStat BeginStat ReturnStat WhileStat while SS do SS end IfStat if SS then SS ElseStat end ElseStat else SS elsif SS then SS ElseStat WebL A Programming Language for the Web 163 WebL Quick Reference FunStat fun Ident Ident SS end MethStat meth Ident Ident SS end CatchStat try SS catch Ident on EdoSS end Ident introduced into a new scope EveryStat every Ide
38. new piece is created This involves inserting new unnamed tags just in front of and just after each pattern occurrence to keep track of the location For example Figure 3 shows how a page looks after searching for the word WebL In this figure the unnamed tags are indicated by the boxes marked lt gt and lt gt The unnamed tags created while searching for character patterns are simply pattern locators they are ignored by many operators and functions and are automati cally removed from the page when not required any more in some sense they are invisible Also when a page is converted back to string format the unnamed tags are removed It is also important to know that unnamed tags are always inserted Thus searching for the same pattern twice will cause two nested and unnamed pieces to be inserted into the page Another way of saying this is that tags are never shared by more than one piece FIGURE 3 Results of Searching for WebL Pattern groups The Pat function also supports Perl5 regular expression groups Groups as indicated with parenthesis in Perl5 regular expressions identify constit uent parts of the pattern to be matched For example a regular expression matching dates might have groups related to the day month and year of the date For each pattern matched in the page the corresponding piece object of that pattern is attrib uted with fields numbered from 1 onwards th
39. other modules This allows the construction of a module hier archy in the form of an import graph Note that the graph is directed and a cyclic recursive module imports are not allowed and will cause a runtime error Service Combinators We can imagine that many things can go wrong with a computation on such a large distributed scale as the World Wide Web For example part of a WebL computa tion might fail because of a failed web server or missing web page Thus the unpre dictable nature of the web causes many more exceptions than in a non distributed environment To counteract this problem WebL provides a few convenient ways to handle exceptions The exception handling mechanism is based on a formalism called service combinators In this formalism we talk about services computa tions that depend on remote web servers that complete successfully or fail throw an exception The service combinators allow several services to be combined in ways that can make a computation more reliable and in some cases even improve its speed Note that by service we mean any WebL computation WebL supports several service combinators sequential execution concurrent execution time out repetition and non termination One of the most basic services involves fetching a page from the Web To make the examples that follow more realistic we are going to use two of these built in func tions More details about the exact behavior of these functions ca
40. reg ular expression pattern in page p Returns all the occurrences of a reg ular expression pattern located inside the piece q Returns the parsed character data of the page This corresponds to the individual seqences of text on the page as delimited by markup tags Returns the parsed character data of the piece This corresponds to the individual seqences of text inside the piece as delimited by markup tags Matches all the occurrences of a sequence of elements identified by pattern See PCData search on page 73 Matches all the occurrences of a sequence of elements identified by pattern inside the piece p See PCData search on page 73 WebL A Programming Language for the Web 79 The Markup Algebra Miscellaneous Functions The markup algebra includes several miscellaneous functions for converting between different value types for example turning a string into a page and back accessing the begin and end tags of a piece See Table 18 To give a feeling for how these functions are used we first define a new page containing a heading and 2 by 2 table var P NewPage lt html gt lt body gt lt hl gt Test Page lt hl1 gt lt table gt lt tr gt lt td align center gt A lt td gt lt td gt 100 lt td gt lt tr gt lt tr gt lt td align center gt B lt td gt lt td gt 230 lt td gt lt tr gt lt table gt lt body gt lt html gt text html Note that t
41. removed from the page This is not a problem seeing that we cannot refer to text segments in the WebL markup algebra e Unnamed tags inside q are left untouched e The tags of named pieces completely inside p are converted to unnamed tags They still can be referred to but essentially become invisible e If qis named then its tags are converted to unnamed tags e The tags of named pieces that overlap q but are not inside q are left untouched WebL A Programming Language for the Web 109 The Markup Algebra Figure 12 illustrates the situation when overlaps occur during deletions Note how piece C remains named because its end tag is located outside and to the right of B Of course as we simply leave tags where they are in the page we can imagine situ ations where the page fills up with unused tags after several deletions To counter this problem a scrubber process is periodically invoked to remove unused tags from a page i e tags that are not accessible to the programmer as detected by the Java garbage collector Examples Delete all occurrences of the word cool Delete Pat P cool Remove all H1 and H2 headings Delete Elem P hi Elem P h2 FIGURE 12 Deleting Pieces L kt ebelo lf sets lt 4 gt his lt gt Delete B ess Epae s Ren lt e 5 110 WebL A Programming Language for the Web Page Modification Replacing Pieces The Repl
42. string repre sentation Executes x and returns the exception object that was caught In case no exception is thrown in x nil is returned In addition the exception object contains a field trace that has extra information why the exception occurred This information is useful for logging unexpected exception events in your WebL programs Returns the type of x nil int real bool char string meth fun set list object page piece pieceset tag WebL A Programming Language for the Web Built in Functions a The class indicated must be a subclass of webl lang exprAbstractFun Expr WebL A Programming Language for the Web 45 The Language Core Modules To facilitate the reuse of code WebL allows you to package commonly used rou tines in a module An example module might be routines to process pages from a specific web server Client programs can access the routines by importing the mod ule This is indicated by the client writing an import statement specifying all the modules that a program requires After importing a module the variables declared in that module can be accessed This is done by writing the module name followed by an underscore character and the variable name For example the following pro gram imports module A accesses a variable and calls a function in that module import A PrintLn The value of x in A is A x A Doit The import statement may occur only at
43. the end Newline Carriage return Tab Formfeed Digit 0 9 Non digit 40 9 Word character 0 9a zA Z Non word character 0 9a zA Z A whitespace character t n r f A non whitespace character t n r f Hexadecimal representation of character Matches the corresponding control character Octal representation of character unless a backreference Matches whatever the first second third etc parenthe sized group matched This is called a backreference If there is no corresponding group the number is inter preted as an octal representation of a character Matches null character WebL A Programming Language for the Web 197 WebL Quick Reference TABLE 51 Perl5 Extended Regular Expressions Extended Pattern text regexp regexp regexp imsx Description An embedded comment causing text to be ignored Groups things like but does not cause the group match to be saved A zero width positive lookahead assertion For example w s matches a word followed by whitespace with out including whitespace in the match result A zero width negative lookahead assertion For example foo bar matches any occurrence of foo that is not followed by bar Remember that this is a zero width assertion which means that a b d will match ad because a is followed by a character that is not b the d and a d follows the zero width assertion One or more embedde
44. type set All the expressions between the curly braces are evaluated and their values inserted in the set if not already an element of the set There is no ordering between the elements no restrictions on the type of elements and no restriction on the size of the set TABLE 11 Set Expressions Expression Value Value Type 1 1 2 3 2 3 set 1 2 3 2 4 1 2 3 4 set 1 2 3 2 4 2 set 1 2 3 2 4 1 3 set Size 1 2 3 6 4 int Type Fun The fun statement constructs values of type fun or function The format of the fun statement is of the following form fun argl arg2 StatementSequence end The identifiers in brackets are the formal arguments of the function A function can be applied with the same number of actual arguments enclosed in parenthesis fol lowing the function constructor The actual arguments are evaluated and assigned in a paired manner to the formal arguments The resulting variable bindings form a new context in which the statement sequence of the function is executed The value of the applied function is the value of the function statement sequence executed in the new context For example fun x y x y end 3 4 evaluates a function that sums two numbers with arguments x 3 and y 4 More typically functions are constructed and then assigned to variables for later use For example the following program calculates the factorial of 10 WebL A Programming Langu
45. used as a measure of how many sen tences are present and initials containing periods will skew that count Finally lines 13 26 calculate a few common reading scores and return a score object with the results to the caller Lines 29 46 calculate reading scores for a list of URLs Lines 36 37 fetch the indi vidual pages and calculate the score In lines 38 39 we extend the score object with fields to identify the URL and title of the page The ScorePageList function returns a list of these score objects The purpose of the GetStories function lines 54 65 is to retrieve a list of URLs representing newspaper articles After fetching the root page from news com line 56 we follow the link called All the headlines to a page that contains all the sto ries of the day line 57 Lines 58 59 perform the extraction of the story URLs We locate all the anchors appearing after a bullet symbol identified by the amp 149 character entity that are not written in the strong font Lines 60 62 construct a list object of all the URLs found The main program starts at line 67 First we fetch the stories and then score them Lines 70 77 take care of sorting the stories according to score Finally lines 79 84 take care of printing the result and writing it to a file 152 WebL A Programming Language for the Web WebCrawler WebCrawler In this example we illustrate how to build a simple web crawler framework that can easily be cust
46. with with page p Returns the page a piece belongs to Returns the page a tag belongs to Returns a pretty printed version of the page 1 WebL tries to ensure that the pretty printed page still renders in the browser in the same manner as the original page by using some limited inbuilt knowledge about markup For example HTML preformatted elements PRE are not changed 82 WebL A Programming Language for the Web Piece Comparison TABLE 18 Miscellaneous Functions Function Description Pretty q piece string Returns a pretty printed version of a piece Size p pieceset int Returns the number of pieces belonging to p Text p page string Returns the text sans tags of a page Text q piece string Returns the text sans tags of a piece Piece Comparison Pieces can be compared for equality containment position relative to each other and so on These tests play a very important role in the piece set operators intro duced in the following section Without regard to unnamed pieces the comparison of named pieces is quite straightforward for example piece x is equal to piece y if the following is true BeginTag x BeginTag y and EndTag x EndTag y Unfortunately the situation is more complicated when unnamed pieces are involved So far we have only seen unnamed pieces being created as a side effect of the Pat function following sections will illustrate that many other functions have
47. 1 int 1 41 1 41 real ral al char abc abc string abc abc string a Back quoted string constants differ from regular string con stants in that escape sequences contained in the string are not expanded Operators Operators combine expressions into more complicated expressions Evaluating an expression involves evaluating the operands constituent expressions performing some computation on the resulting values and returning a result Examples include numerical boolean and service combinator operators The evaluation sequence of operands is typically left to right TABLE 2 Operator examples Expression Value Value Type true or false true bool 2 2 4 int 1 2 2 2 4 real abc def abcdef string WebL A Programming Language for the Web 19 The Language Core Statements WebL uses typical imperative program language constructs like while if and try statements These statements are expressions in WebL which means that they also evaluate to a value often to the value nil Examples while x gt 0 do x x 1 end if x gt y then y else x end if x 1 then y x 2 elsif x 2 then y x 7 else y 1 end every s in Hello World do PrintLn s end repeat x X 2 until x gt y end Variables and Scoping A context is a set of variables and associated value bindings Expression evaluation is performed in a context which specifies the values of the variables that appear i
48. Compagq s Web Language A Programming Language for the Web Hannes Marais Compaq Systems Research Center SRC This document describes version 3 0 of Compaq s Web Language hereinafter abbreviated to WebL its former name See also Attp Avww compaq com WebL Acknowledgements WebL was initially designed and implemented by Thomas Kistler and Hannes Marais Service combinators were contributed by Luca Cardelli and Rowan Davies Tom Rodeheffer suggested many improvements to the language and implementation Monika Henzinger Jeff Dean Brian Eberman and Jin Yu contributed many suggestions bug fixes and improvements Cynthia Hibbard Dominique Marais and Krishna Bharat corrected several mistakes in the user manual Since the release of the software in July 1998 many corrections and improvements have been made by WebL users themselves The list of contributors and their contributions are contained in the file BugList java which is part of the WebL source distribution c Copyright Compaq Computer Corporation 1998 1999 All rights reserved The WebL software contains regular expression software developed by Daniel F Savarese Copyright c 1997 1999 by Daniel F Savarese All rights reserved WebL A Programming Language for the Web Table of Contents CHAPTER 1 CHAPTER 2 Introduction 11 The Language Core 17 Basic Terminology eseeeeeeee 17 EXPIeSSIONS cierne eas base eee bee 86 17 Value TY Des ssa cate
49. Elem X b WebL A Programming Language for the Web 95 The Markup Algebra FIGURE 6 Operation of P without Q EES arming mse EA A without B c D bs 2 ee ee 96 WebL A Programming Language for the Web Piece Set Operators and Functions P intersect Q The intersect operator intersects each element of P with all the over lapping pieces of Q The resulting piece set contains all the parts of P that are in common with pieces of Q As parts of pieces of P are cut away by the intersection new pieces need to be created and thus new unnamed tags are inserted into the page Another way of thinking about the operator is that it calculates the overlap between pieces Figure 7 shows how this is done Example The parts of a page that is both italic and bold Elem X i intersect Elem X b FIGURE 7 Operation of P intersect Q L gt lebel lt gt eets e i ke A intersect B A B c IEEE Es ARES EAL WebL A Programming Language for the Web The Markup Algebra V Miscellaneous Functions Children p The Children function returns all the children pieces of piece p The children of a piece include all the elements directly contained in the piece and all the text segments directly contained in the piece Markup elements that are only parially inside p because of overlap are not regarded as children of p For example the childre
50. L language have a built in knowledge of web protocols like HTTP and FTP but it also knows how to process documents in plain text HTML and XML format The flexible handling of structured text markup as found in HTML and XML docu ments is an important feature of the language In addition WebL also supports fea tures that simplify handling of communication failures the exploitation of replicated documents on multiple web services for reliability and performing mul tiple tasks in parallel WebL also provides traditional imperative programming lan guage features like objects modules closures control structures etc To give a better idea of how WebL can be applied for web task automation and also what makes WebL different from other languages it is instructive to discuss the computational model that underlies the language In addition to conventional features you would expect from most languages the WebL computation model is based on two new concepts namely service combinators and markup algebra For now we can describe these two concepts of WebL in the following way WebL A Programming Language for the Web 11 Introduction Service combinators is a formalism that can provide more reliable access to web resources and services Very succinctly service combinators is an exception han dling mechanism that is powerful enough to encode robust behavior when commu nication failures occur This concept is especially important for perf
51. PostURL url string page PostURL url string params object string page PostURL url string params object string headers object page PostURL url string params object string headers object options object page HeadURL url string page HeadURL url string params object string page HeadURL url string params object string headers object page Description Uses the HTTP GET protocol to fetch the resource identified by the URL The params object string contains the parameters of a GET that includes a query The headers object specifies the additional headers to include in the GET request The options object allows amongst other functions the overridng of the MIME type and DTD to be used for parsing the page Uses the HTTP POST protocol to fetch the resource identified by the URL The params object string contains the parameters of a POST to fill in a web form The headers object specifies the additional headers to include in the POST request The options object allows amongst other functions the overridng of the MIME type and DTD to be used for parsing the page Uses the HTTP HEAD protocol to fetch the resource headers identified by the URL The params object contains the parameters of the HEAD request The headers object specifies the additional headers to include in the HEAD request 64 WebL A Programming Language for the Web
52. Returns the time in milliseconds it takes to evaluate the expression x Performs the expression x and returns its value If the evaluation takes more than the specified amount of time in milliseconds an exception is thrown instead No operation Converts an integer to the equivalent Unicode character WebL A Programming Language for the Web 43 The Language Core TABLE 13 Core Built in Functions Function ToInt c char int ToInt i int int Tolnt r real int ToList s set list ToList list list ToList s string list ToList o object list ToList p pieceset list Tolnt s string int ToReal c char real ToReal i int real ToReal r real real ToReal s string real ToSet s set set ToSet l list set ToSet s string set ToSet o object set ToSet p pieceset set ToString x string Trap x object Type x string Description Returns the Unicode character num ber of a char No operation Truncates the real value to an integer rounding towards zero Enumerates all the elements of the argument and returns a list See Every Statement on page 38 Converts a string to the numeric equivalent Same as ToReal ToInt c Converts an integer to a real No operation Converts a string to a real value Enumerates all the elements of the argument and returns a set See Every Statement on page 38 Converts a value to its
53. TML 4 0 DTD is used for the lookups Removes the field fld from the object o Nothing happens if the field fld does not exist Returns the first element in a list Returns a flattened piece set without any overlapping of all the parts of the page covered by s Explicitly invokes the Java garbage collector Uses the HTTP GET protocol to fetch the resource identified by the URL The params object string contains the parameters of a GET that includes a query The headers object specifies the additional headers to include in the GET request The options object allows amongst other functions the overridng of the MIME type and DTD to be used for parsing the page Uses the HTTP HEAD protocol to fetch the resource headers identified by the URL The params object contains the parameters of the HEAD request WebL A Programming Language for the Web 175 WebL Quick Reference TABLE 46 Built in Functions Function HeadURL url string params object string headers object page InsertAfter t tag q piece nil InsertAfter t tag s pieceset nil InsertBefore t tag q piece nil InsertBefore t tag s pieceset nil Markup p page string Markup q piece string Name q piece string Native classname string fun NewNamedPiece name string q piece piece NewNamedPiece name string t1 tag t2 tag piece NewPage s string mimetype string page
54. X to be of type plain text GetURL X nil nil mimetype text plain TABLE 14 Supported MIME Types MIME Type Parser Used text plain Plain text text html HTML text xml XML application xml XML A related problem that we often face when processing HTML pages is that the author of the page either gives no indication which version of HTML is being used or gives an incorrect indication The HTML version information is part of the DOCTYPE tag and identifies the HTML DTD to be used to parse the page WebL relies on this information to parse an HTML correctly In case of an incorrectly authored page the DTD can be explicitly overridden by the WebL programmer by adding a dtd field to the options object argument The value of the parameter should be the officially assigned named of the DTD For example the following option values identify HTML 4 0 3 2 and 2 0 dtd W3C DTD HTML 4 0 EN dtd W3C DTD HTML 3 2 EN dtd IETF DTD HTML EN The fields of the option argument to GetURL and PostURL are summarized in Table 16 on page 65 WebL A Programming Language for the Web 63 Pages TABLE 15 Functions to Retrieve Web Pages Function GetURLiurl string page GetURLiurl string params object string page GetURLiurl string params object string headers object page GetURLiurl string params object string headers object options object page
55. ables Constant examples 19 Operator examples 19 Constructor expressions 22 Boolean Expressions 23 Character Expressions 24 Escape Sequences 24 String Expressions 25 Integer Expressions 25 Real Expressions 26 List Expressions 26 Set Expressions 27 WebL Core Operators 33 Core Built in Functions 41 Supported MIME Types 63 Functions to Retrieve Web Pages 64 Fields of the option object 65 Piece Set Searching Functions 79 Miscellaneous Functions 82 Comparing Pieces x andy 86 Piece and Piece Set Operators 101 Piece and Piece Set Functions 104 Formal Definitions of Piece Set Operators 105 Page Modification Functions 112 Standard WebL Modules 113 Module Base64 115 Module Browser 116 Module Cookies 118 Module Farm 120 Methods of Farm Objects 120 Module Files 121 WebL A Programming Language for the Web 7 TABLE 31 TABLE 32 TABLE 33 TABLE 34 TABLE 35 TABLE 36 TABLE 37 TABLE 38 TABLE 39 TABLE 40 TABLE 41 TABLE 42 TABLE 43 TABLE 44 TABLE 45 TABLE 46 TABLE 47 TABLE 48 TABLE 49 TABLE 50 TABLE 51 Module Java 128 Conversion of Java types into WebL types 129 Conversion of WebL types into Java types 130 Format of the Servlet request parameter object 134 Format of the Servlet response parameter object 135 Module Str 136 Module Url 139 URL constituents 140 Module WebServer 145 Fields of the Request Object 146 Fields of the Response Object 147 WebL Co
56. ace function deletes each piece in the first argument and inserts copies of the pieces of the second argument at that position The function can be coded using Delete and InsertAfter in the following manner var Replace fun A B every a in A do Delete a InsertAfter BeginTag a B end Note that this encoding is only possible because the Delete function does not remove any tags from the page and thus we can apply the BeginTag function to a deleted tag without causing an exception Examples Make all links bold links Elem P a every L in links do Replace L NewPiece lt b gt Markup L lt b gt end Replace all links with the word censored Replace Elem P a NewPiece lt i gt censored lt i gt WebL A Programming Language for the Web 111 The Markup Algebra TABLE 23 Page Modification Functions Function Delete s pieceset nil Delete q piece nil InsertBefore t tag q piece nil InsertBefore t tag s pieceset nil InsertAfter t tag q piece nil InsertAfter t tag s pieceset nil NewPiece t1 tag t2 tag piece NewPiece q piece piece NewNamedPiece name string t1 tag t2 tag piece NewNamedPiece name string q piece piece Replace a pieceset b pieceset nil Description Deletes s or q from the page by removing all the pieces from the page data structure Inserts a copy of q before the ta
57. acter data We can use a simple train like picto rial representation of a page to illustrate the conversion Figure 1 In the figure each box represents either a tag or a piece of text The WebL model also supports unnamed tags the purpose of which will become clearer soon The equivalent HTML or XML for an unnamed tag is lt gt which of course does not occur in practice WebL uses unnamed tags as place markers in a page As their name suggests unnamed tags do not have a name or attributes FIGURE 1 Converting Markup into Tag and PCData Sequences lt ul gt lt li gt Modula 3 lt li gt lt li gt Pascal lt li gt lt ul gt Pieces A piece is a WebL value type that denotes a region of page Each piece refers to two tags the begin tag that denotes the start of the region and the end tag that denotes the end of the region The region includes both the begin and end tag and everything between them in the page Note that the begin and end tag can also be the same Another important fact is that pieces never point to text segments The most common types of pieces are those that correspond to elements in a page We extend the box diagram notation to include triangles to denote pieces and lines 68 WebL A Programming Language for the Web Pages Tags Pieces and Piece Sets from pieces to tags to denote begin and end tags Figure 2 To allow the program mer to access element attributes the attributes of
58. age regexp string pieceset Pat q piece regexp string pieceset PostURL url string page PCData p page pieceset PCData p piece pieceset PostURL url string params object string page PostURL url string params object string headers object page Description Returns the page a piece belongs to Returns the page a tag belongs to Extracts the paragraphs in p accord ing to the paragraph terminator spec ification paraspec See Paragraph search on page 75 Extracts the paragraphs in p accord ing to the paragraph terminator spec ification paraspec See Paragraph search on page 75 Returns the element in which g is nested direct parent in the parse tree Returns all the occurrences of a reg ular expression pattern in page p Returns all the occurrences of a reg ular expression pattern located inside the piece q Uses the HTTP POST protocol to fetch the resource identified by the URL Returns the parsed character data of the page This corresponds to the individual seqences of text on the page as delimited by markup tags Returns the parsed character data of the piece This corresponds to the individual seqences of text inside the piece as delimited by markup tags The params object string contains the parameters of a POST to fill in a web form The headers object specifies the additional headers to include in the POST request WebL
59. age for the Web 27 The Language Core var fac fun n if n 1 then 1 else n fac n 1 end end fac 10 The function context created during function execution is nested in the context in which the function was initially created This allows the construction of higher order functions and closures As an example we define a function that returns a function that adds a certain number to its argument var MakeAdder fun c fun x x c end end var Add5 MakeAdder 5 Add5 10 Add 5 to 10 WebL requires the introduction of a new variable before its first use This creates a problem when functions need to mutually refer to each other because one function has not been introduced yet at the place where it is called Fortunately as functions are first class citizens in WebL the problem can be overcome by declaring mutu ally recursive functions by introducing the function variable names and afterwards assigning them values var f g f fun g end g fun ve end 28 WebL A Programming Language for the Web Dynamic Types Type Object The object constructor constructs values of type object Objects have fields each field having a specific field value Fields are typically used to store object variables functions and methods inside the object The fields themselves may be of any value type i e they are untyped Indexing an object with the field retrieves the value of that
60. aluated during value construction TABLE 3 Constructor expressions Valu e Expression Value Type 1 2 2 1 1 2 3 list 1 2 14 1 1 2 set a 1 1 b 2 a 2 b 2 object 22 WebL A Programming Language for the Web Dynamic Types Dynamic Types Recall that WebL values have a value type that determines to a large extent what can be done with the value i e what operators can be applied to it The following paragraphs will explain the characteristics of each value type and give examples Type Nil The keyword nil denotes a special value that indicates that a variable has no value Note that we refer both to the value and the value type as nil Variables that are declared without an initial value are initialized to the nil value Type Bool The lexical constants true and false evaluate to a value of type bool short for bool ean Expressions containing operators that compare values for example equal or less than also evaluate to a boolean Boolean expressions can be combined with logical and and logical or operators and are evaluated in a short circuited fashion TABLE 4 Boolean Expressions Expression Value Value Type true true bool false false bool true or false true bool true and false false bool 1 1 true bool 1 lt 1 true bool 1 1 false bool WebL A Programming Language for the Web 23 The Language Core Type Char A lexical character constant evaluates to a value of
61. amming Language for the Web Reading Grades 43 end 44 end 45 res 46 end 47 48 var FollowLink fun page anchortext 49 var dest Elem page a contain 50 Pat page anchortext 0 51 GetURL dest href 52 end 53 54 var GetStories fun 55 var res 56 var P GetURL http www news com 57 var H FollowLink P i all the headlines 58 var A Elem H a directlyafter Pat H amp 149 59 inside Elem H strong 60 every a in A do 61 res res a href 62 end 63 PrintLn Size res articles found 64 res 65 end 66 67 var pages GetStories 68 var res ScorePageList pages 69 70 res Sort res 71 fun a b 72 var diff a Kincaid b Kincaid 73 if diff gt 0 0 then 1 74 elsif diff 0 0 then 0 75 else 1 76 end 77 end 78 79 vars 80 every x in res do 81 PrintLn x Kincaid x Title 82 s s x Kincaid x Title r n 83 end 84 Files SaveToFile kincaid txt s j WebL A Programming Language for the Web 151 Examples Lines 3 24 implement the core of the scoring function After extracting the text of the page line 4 we proceed to calculate the number of letters line 5 words line 6 and syllables line 7 on the page using a few simple regular expressions Lines 9 11 take care of removing any initials that might appear in the page This is neces sary as the number of periods in the page is
62. an raw computation speed It is thus better suited as a rapid prototyping tool than a high volume production tool e WebL is implemented as a stand alone application that fetches and processes web pages according to programmed scripts Programming Language e WebL is a high level imperative interpreted dynamically typed multi threaded expression language e WebL s standard data types include boolean character integer 64 bit double precision floats Unicode strings lists sets associative arrays objects func tions and methods e WebL has prototype like objects 12 WebL A Programming Language for the Web WebL supports fast immutable sets and lists WebL has special data types for processing HTML XML that include pages pieces for markup elements piece sets and tags WebL uses conventional control structures like if then else while do repeat until try catch etc WebL has a clean easy to read syntax with C like expression and Modula like control structures WebL supports exception handling mechanisms based on Cardelli amp Davies service combinators like sequential combination parallel execution timeout and retry WebL can emulate arbitrary complex page fetching behaviors by combining services Protocols Spoken WebL speaks whatever protocols Java supports i e HTTP FTP etc WebL can easily fill in web based forms and navigate between pages WebL has HTTP cookie support Progra
63. and file protocol schemes 138 WebL A Programming Language for the Web Module Url TABLE 37 Module Url Function Decode s string string Encode s string string Glue obj object string GlueQuery obj object string Resolve base string rel string string Split url string object SplitQuery query string object Description Decodes a string in the MIME type encoding x www form urlen coded The complementary func tion is called Encode Encodes a string in the MIME type encoding x www form urlen coded which is generally used to encode input form parameters The complementary function is called Decode Takes the constituent parts of a URL as broken up by Split and glues them together to form a URL again Given an object constructed from SplitQuery GlueQuery returns the original query string again Given the base URL base and the relative URL rel return the resolved URL Splits a URL into its constituent parts like scheme host path query etc Each part becomes a field of the object returned Splits a query string into its constitu ent parts Each part becomes a field of the object returned Query strings typically follow the in URLs WebL A Programming Language for the Web 139 Modules TABLE 38 URL constituents Used in Field name schemes Description scheme All http ftp file etc host http ftp file Host name port http ftp
64. aracters should be useable in the package s regular expressions Perl5 regular expressions consist of e Alternatives separated by e Quantified atoms Table 48 e Atoms Regular expression within parentheses character classes e g abcd ranges e g a z and the patterns in Table 50 Special backslashed characters work within a character class except for backreferences and boundaries b is back space inside a character class Any other backslashed character matches itself Expressions within parentheses are matched as subpattern groups and saved for use by certain methods TABLE 48 Quantified Atoms Pattern Description n m Match at least n but not more than m times n Match at least n times n Match exactly n times X Match 0 or more times Match 1 or more times Match 0 or 1 times By default a quantified subpattern is greedy In other words it matches as many times as possible without causing the rest of the pattern not to match To change the quantifiers to match the minimum number of times possible without causing the rest of the pattern not to match you may use a right after the quantifier Table 49 Perl5 extended regular expressions are fully supported See Table 51 WebL A Programming Language for the Web 195 WebL Quick Reference Regular Expression Tips Combining regular expresions and WebL code might sometimes be a little confusing The following tips might help
65. arameters in the correct way as required by the GET and POST protocol variants In the case of a PostURL request the correct construction of the parameter object needs to be deduced by the programmer from the Web form where the request originates from This is beyond the scope of this manual Note that the HTTP specification requires that the POST parameters be submitted in the order they appear in the form on the page It is thus important to list the param object fields in the same order as the fields in the form recall that object fields are ordered according to definition sequence There are two tricks that are sometimes needed when submitting form data The first trick involves posting multiple parameters that have the same name An HTML form might allow the user to pick several options from a list for example to indicate his or her favorite programming language This can be specified as fol lows PostURL http gender male language WebL Java Note the use of a field of type list to indicate the multiple values For those readers familiar with the HTTP specification the data that will be posted as follows in the body of the HTTP request gender male amp language WebL amp language Java Note how the parameter language appears twice in the submitted data The second trick is a work around for the case when the submitted parameters do not match well with the WebL object type This might for example be the cas
66. at contain each of the groups occur ring from left to right in the pattern A field named 0 is also added to the piece which contains the complete matched pattern For example the following code fragment recognizes dates of the form day month year WebL A Programming Language for the Web 71 The Markup Algebra Pat P d d w d The date pattern contains three groups one for the two digit day one for a word representing the month and one for the digits of the year Given the occurrence of the string 20 Jan 1998 in a page the corresponding piece object would look as follows 20 Jan 1998 I N20 Jan 1998 w NBEO l See page 182 for more details on the syntax of Perl5 regular expressions Once a piece set has been created with the Elem or Pat functions we can apply WebL operators and functions to the result to perform further computation For example by indexing into a piece set with the indexing operator we can extract the nth element of the piece set 72 WebL A Programming Language for the Web Searching Functions PCData search The PCData function returns a piece set of all text segments that are contained in a page or piece The name PCData is derived from the term parsed character data which denotes the text segments on a page i e what is left over when all markup tags are removed from a page The PCData function is thus complementary
67. aths are requested For example the following program starts the web server on port 90 and publishes a function called Echo that returns an HTML page Afterwards its goes to sleep while requests are serviced If a request for the URL bin echo is received by the server the Echo function is invoked import WebServer WebServer Start c InetPub wwwroot 90 var Echo fun req res res result lt html gt lt body gt Hello lt body gt lt html gt end WebServer Publish bin echo Echo while true do Sleep 10000 end WebL A Programming Language for the Web 143 Modules Functions may be published under any case sensitive name The web server will first consult the list of exported functions when a request is received by comparing the path of the URL requested to each of the given names of the published func tions Should no published name match the URL path requested the web server attempts to serve a file in the directory rooted by fileroot The formal arguments reg and res represent respectively the request the web server received and the response the web server has to return The idea is that the invoked function looks at the object req to figure out what to do and modifies the object res to tell the server what to do i e what data to return etc For example given the following request to the web server running on a machine called ck pa dec com http ck pa dec com 90 bin echo x 3
68. atuscode Integer status code to be returned The semantics of the codes are listed in the HTTP specification statusmsg Status message that matches this sta tus code result The page that is to be returned to the client header The header fields the server will return to the client WebL A Programming Language for the Web 147 Modules 148 WebL A Programming Language for the Web CHAPTER 6 Examples The purpose of this chapter is to give a feeling for how WebL can be used in real world programs It contains three case studies e Calculating statistics on newspaper articles e The implementation of a simple multi threaded web crawler class e The implementation of a highlight proxy Reading Grades The following program calculates the Kincaid score of a set of headline newspaper articles found on the www news com web server and outputs a sorted table of those article titles to the file kincaid txt The Kincaid scoring function is used to judge reading ease of an English document based on its sentence and word characteris tics The function s output ranges from 5 5 to 16 5 in reading grade level Note that this implementation does not calculate the correct Kincaid reading grade as it takes some shortcuts in calculating the number of sentences and syllables in a page Also web pages tend to contain a lot of headings and so on which are not identified correctly as sentences Web pages differ enough f
69. bL 0 b Elem P p 5 NewNamedPiece i BeginTag a EndTag b Turn all occurrences of WWW to a hyperlink every x in Pat P WWW do var p NewNamedPiece a x p href http www w3 org end FIGURE 10 Application of the NewPiece function NewPiece A A B WebL A Programming Language for the Web 107 The Markup Algebra Inserting Pieces The functions nsertBefore and InsertAfter insert a piece into a page either before or after a specified tag Inserting a piece p involves copying the contents of the piece and inserting the copied tags and text segments one after another at the destination point according to the following rules e All the text segments contained in p are copied e All the named tags contained in p are copied also includes the named tags of p itself Also suppose there exists a piece q that is either inside p or overlaps with p Incase q is inside p both the begin tag and end tag of q will be copied to the destination Otherwise if q overlaps with p and is not inside p we will according to our definition only copy either the begin tag or end tag of q To prevent this unfortunate situation with dangling pieces the tag of q outside of p is also copied In case of many dangling tags outside of p we copy all of them making sure that their relative ordering is preserved e Unnamed tags contained in p are not copied Figure 11 shows ho
70. ce p pieceset pieceset directlycontain p pieceset q pieceset pleceset directlyinside p piece q piece pieceset directlyinside p pieceset q piece pieceset directlyinside p piece q pieceset pieceset directlyinside p pieceset q pieceset pieceset div x int y int int inside p piece q piece pieceset inside p pieceset q piece pieceset inside p piece q pieceset pieceset inside p pieceset q pieceset pieceset intersect p piece q piece pieceset intersect p pieceset q piece pieceset intersect p piece q pieceset pieceset intersect q pieceset p pieceset pieceset member x s set bool member x list bool member x o object bool mod x int y int int or x bool y bool bool and x bool y bool bool Description All the elements of p that follow directly after any element of q All the elements of p that are directly before any element of q All the elements of p that directly contain any element of q All the elements of p that are directly inside any element of q Whole division All the elements of p that are located inside any element of q All the elements of p that overlap an element in q each of them repeatedly intersected with all overlapping elements in q Set list and obj ect membership test x mod y Logical operators short circuit evaluation WebL A Programming Language for the Web 171
71. cedence Level 10 10 10 20 20 20 30 30 30 30 40 40 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 Fix Right bracket Infix Right bracket Prefix Prefix Prefix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Infix Associativity Left Right Right Right Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left Left 166 WebL A Programming Language for the Web Operator Precedence TABLE 44 Operator Precedence Table Precedence Operator Level Fix Associativity directlybefore 45 Infix Left directlybefore 45 Infix Left overlap 45 Infix Left overlap 45 Infix Left intersect 45 Infix Left without 45 Infix Left lt 60 Infix Left lt 60 Infix Left gt 60 Infix Left gt 60 Infix Left 70 Infix Left l 70 Infix Left and 80 Infix Right or 90 Infix Right 100 Infix Right 100 Infix Right 110 Infix Right 110 Infix Right Note Operators with a higher precedence level smaller numeric values bind tighter than those of a lower precedence level WebL A Programming Language for the Web 167 WebL Quick Reference Operators TABLE 45 WebL Operators Operator I x bool bool x y bool after p piece q piece pieceset after p pieceset q piece pieceset after p piece q pieceset piecese
72. ct in programs it only defines the sequence in which fields are enumerated and how objects are printed It does however play an important role for certain functions where parame WebL A Programming Language for the Web 29 The Language Core ter ordering is important See Retrieving Page Objects on page 59 Note that there is no way to remove a field from an object Object based programming in WebL Combining objects and functions allows us to program in an object based or object oriented manner For example the fol lowing program implements a bank account object with methods to deposit and withdraw money Note how we need to pass the bank account object as first actual argument to the deposit and withdraw methods In both cases the se f formal argu ment refers to the bank account object var myaccount balance 0 deposit fun self amount self balance self balance amount end withdraw fun self amount self balance self balance amount end al myaccount deposit myaccount 100 Deposit 100 myaccount withdraw myaccount 50 Withdraw 50 Type Meth The meth constructor constructs values of type meth or method Methods behave in all aspects except for application i e execution in the same manner as func tions They are in fact used as a notational short hand for method invocation with out the need to pass a self parameter We can recode the bank account program with metho
73. ction or method while executing the argument Exec cmd string int ArgumentError Incorrect or wrong number of arguments Exit errorcode int nil ArgumentError Incorrect or wrong number of arguments ExpandCharEntities p page s ArgumentError Incorrect or wrong string string number of arguments ExpandCharEntities s string string IOException An IO exception occurred during function execution First 1 list any ArgumentError Incorrect or wrong number of arguments EmptyList Cannot apply first to an empty list WebL A Programming Language for the Web 185 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function Flatten s pieceset pieceset Funp x bool GCO nil GetURLiurl string page GetURLiurl string params object string page GetURLiurl string params object string headers object page GetURLiurl string params object string headers object options object page GetURLiurl string page GetURLiurl string params object string page GetURLiurl string params object string headers object page InsertAfter t tag q piece nil InsertAfter t tag s pieceset nil InsertBefore t tag q piece nil InsertBefore t tag s pieceset nil Exceptions ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or w
74. ctions var P Para page tt i b u s strike big small em string dfn code samp kbd var cite acronym a img applet object font basefont script map q sub sup span bdo iframe input select textarea label button var R every p in P do R R Text p end In conclusion note that Para page is nearly equivalent to PCData page except for the fact that pieces with no content are filtered out WebL A Programming Language for the Web 77 The Markup Algebra Filtering Pieces Even though the piece searching functions introduced so far already provide power ful ways of extracting pieces from a web page it might still not be enough Suppose it is necessary to restrict the contents of a piece set to those elements whose attributes match some criteria For example we might be interested in all HTML anchors that point to a specific site Exactly for this purpose the builtin Select func tion allows you to filter the contents of a piece set according to a selection function The Select function also supports filtering of sets and lists in a similar manner The following code fragment illustrates how the select function might be used in this case import Str var A Select Elem P a fun a Str StartsWith a http site com end Note how the selection function is passed as the second argument to Select The Select function iterates over the elements of its first arg
75. d define the tag names tag structures and hierarchical organization of documents that conform to the DTD 54 WebL A Programming Language for the Web Markup HTML as an instance of SGML consists of a DTD that defines the exact version of HTML being used and a set of conventions followed by web browsers for ren dering the markup on a computer display Most of the HTML involves how markup should be presented for example what fonts are used and in what size colors spac ing line breaks and so on The main clients of HTML are real people viewing the pages marked up in this manner In contrast XML is an instance of SGML for the exchange of content or applica tion specific data over the web The idea is that if two or more people can agree on a common DTD that is the markup and structure of a document they can exchange documents and other information In a simplistic way XML can be regarded as a variant of HTML where you may define your own markup The main clients of XML are programs that process the content of web pages although XML can be viewed in a browser it has nothing to do with presentation XML documents are typically grouped according to the DTD that is used For example XML documents using the Content Definition Format CDF DTD are used in push media and XML documents using the Chemical Markup Language CML DTD are used to exchange molecular structures Tags At the simplest level pages consist of sequence
76. d pattern match modifiers i enables case insensitivity m enables multiline treatment of the input s enables single line treatment of the input and x enables extended whitespace comments 198 WebL A Programming Language for the Web Symbols 33 34 33 101 34 after 102 before 102 contain 101 directlyafter 102 directlybefore 103 directlycontain 102 directlyinside 101 inside 101 overlap 103 33 101 33 101 34 33 34 gt 33 gt 33 48 48 A Abstract syntax tree 18 after 86 91 102 and 34 AppendToFile 121 Assert 41 assignment 32 Associative arrays 29 Authentication 115 B before 86 89 102 BeginTag 82 111 Bool 23 boolean 129 Boolp 41 Built in functions 40 byte 129 C Call 41 Case sensitivity 58 Char 24 char 129 Character Entities 57 WebL A Programming Language for the Web 199 Character entities 82 Charp 41 Children 98 104 Class 128 Clone 41 Command line options 159 Comments 56 162 Compare 136 Comparison operators 33 Concurrency 119 Concurrent execution 48 Constants 19 Constructors 22 contain 86 93 101 Content 100 104 Contexts 21 Cookie Databases 117 cookiedb 66 Cookies 54 61 117 Crawler 141 D DDE 116 Decode 115 138 139 Delete 109 111 112 123 DeleteField 42 directlyafter 91 102 directlybefore 90 102 directlycontain 94 102 directlyinside 93 101 Directories 123 div 33 Doc
77. d thus is equivalent to an empty tag WebL knows about these anomalies by virtue of the HTML DTD Processing Instructions Processing instruction elements give instructions to the page parser to perform special handling of its contents They are used only in XML and consist of a single tag lt tagname gt Here the ellipses take the place of the processing instructions Similar to comment elements tagname is defined as the name of the element The processing instructions between the question marks are mapped onto the content field of the element SGML Directives SGML directives provide information about the DTD of the page being parsed They have the form lt tagname gt Here the ellipses take the place of the directive As before tagname is defined as the name of the element and the element has an attribute called content that stores the directive The most commonly occurring SGML directive is DOCTYPE which specifies the name of the DTD to be used to parse the remainder of the document Optional tags Parsing of HTML is made complicated by an SGML feature called optional tags a feature that has explicitly been left out of XML The idea is that the DTD often gives enough contextual information to infer that a start or end tag must be present at a certain position in the document For several HTML elements either start or end tags are declared to be optional and should be inserted automati cally by the pa
78. ds in the following manner var myaccount balance 0 deposit meth self amount self balance self balance amount end withdraw meth self amount self balance self balance amount end a myaccount deposit 100 Deposit 100 myaccount withdraw 50 Withdraw 50 30 WebL A Programming Language for the Web Value Equality As can be seen the only difference from the previous program is the use of the meth keyword and a convenient way of invoking methods In fact the internal implementation of methods is equivalent to the bank account object programmed only with functions Types Page Piece PieceSet and Tag Value types page piece pieceset and tag are an essential part of the WebL markup algebra We will not go into details yet about these value types they are dis cussed in more detail in Chapter 4 Value Equality Values of types nil boolean int real char string fun meth set list and tag are immutable This means that once a particular value is calculated or declared in a constant the value cannot change For example appending a character to a string creates a new string inserting an element into a set creates a new set and so on In contrast objects and special objects are mutable by the fact that their field values can be modified and new fields can be added Two immutable values are equal if their contents e are both nil e have the same boolean value tru
79. e string val string nil SaveToFile filename string val string charset string nil GetURLiurl string filename string page GetURLiurl string filename string param object string page GetURL url string filename string param object string header object page GetURLiurl string filename string param object string header object options object page PostURL url string filename string page PostURL url string filename string param object string page PostURL url string filename string param object string header object page PostURL url string filename string param object string header object options object page Eval filename string any List dirname string list Description Loads a file as a string object using the character encoding specified by charset Typical values for charset are UTF8 Unicode iso 8859 1 etc Saves the string val to the specified file As above but overrides the default character encoding of the saved file Typical encodings are iso 8859 1 UTF8 etc Similar to the built in GetURL func tion except that the retrieved docu ment is saved to the indicated file The url param header and option arguments are the same as the built in GetURL function Similar to the built in PostURL function except that the retrieved document is saved to the indicated file The url param header and
80. e when the server does not conform to the HTTP specification and allows the post ing of data in any format To handle this case you may pass a string type instead of an object type to the param argument of PostURL Of course in such a case you have to take care yourself of encoding the parameters correctly and for this Mod ule Url on page 138 is useful Keeping with our example above this would be coded as 60 WebL A Programming Language for the Web Retrieving Page Objects PostURL http gender male amp language WebL amp language Java Also note that the GetURL Files_GetURL and Files_PostURL functions also accept a string argument instead of an object for parameters Also see Module Files on page 121 In the case of the Get variants of the functions the parameter string is simply appended to the URL itself with WebL adding the usual in between Headers The GetURL and PostURL functions add extra HTTP header fields to a request in case the optional header object is used as an argument Headers that might need to be added in this way can be the client identification cookies etc The response headers of a request with names converted to lowercase become part of the page object returned by the functions For example the program var p GetURL http www digital com PrintLn p prints the page fields server Apache 1 2 4 connection close date Fri 01 May
81. e Options Option Description D Emit casual debugging output Llogfile Write casual debugging output to a log file C Print performance counters at end of run P Wait for ENTER when the program finishes Script search path By default WebL will search for scripts and modules in the current working directory and in the scripts sub directory inside the WebL jar file The directory search path can be changed by setting a Java system property called webl path to a set of directories This can be done on the command line with the D option java Dwebl path dirl dir2 dir3 WebL Windows java Dwebl path dirl dir2 dir3 WebL Unix Note that setting a webl path shell environment variable won t do because envi ronment variables are not accessible from Java applications Java System Properties WebL programmers can access the system properties of the underlying Java implementation through a global WebL object called PROPS For example to access the user name of the person executing the script you can write PROPS user name 160 WebL A Programming Language for the Web Running WebL Programs The following PROPS object gives here path separator ftpNonProxyHosts http proxyHost http nonProxyHosts an idea of what information is accessible from W W pa dec com www proxyl pa dec com pa dec com http proxyPort 8080 user language en ftpProxyHo
82. e or false e have the same numerical value by converting ints to reals if necessary e have the same character or string e have identical sets or lists e have the same function or method with identical dynamic outer context Two objects are regarded equal when the internal reference to the object data stuc ture is equal i e reference equality It is important to note that even though sets and lists are immutable in WebL oper ating on these value types does not necessarily mean that the internal data structures are copied for each operation WebL uses an efficient internal implementation that makes the following operations possible in constant time and space WebL A Programming Language for the Web 31 The Language Core e Concatenating two lists e Applying the First and Rest functions on a list e Adding or removing an element from a set In some cases for example when printing a list or indexing into it a cost propor tional to the number of elements is paid once after which the cost becomes con stant again Operators Table 12 lists the operatorsof the WebL core language To illustrate how operators are overloaded we use a functional notation even though the operators are written in infix prefix or right bracket fix For example op x T y S U denotes that an infix operator op takes a first operand of x of type T and a second operand y of type S and returns a value of type U Unary operators ha
83. e siciecscs thee ee aes x 18 CONSTANS E r IE TAS eG ee g NE Se 19 OD EI GIONS ie a eg eRe Seg oe EE 19 Statements Inra a tisha w ORNE 20 Variables and Scoping 0 e eeee00 20 CONSIFUCIOYS Tyas cece eee eens 22 Dynamic Types cece cece eens 23 Dy De Nilin Tren se a ies be ag aes 23 LY DE BOO Les snare Sei dros te at istiane EEE Sadie 2 23 Type Char ieee bo yee EEEa ele eae 24 Type SUG a a ics oe eae og bese BS 25 Type Int rse sna te tele a oie ob Saab D TAA 25 Ly De Realis oped alates ests weir apes aa ee 26 Type List obs ve eg testes Mes Lote aa 26 Type SOB acu ssn sseni neiaie aoaaa 27 Type BUM meran src haere tashana ssmimi as 27 Type OD OCU e365 see rested sense a 29 Ly pe Meth sasi ee ae nina RENEE 30 Types Page Piece PieceSet and Tag 31 Value Equality cece eee eee eee 31 OPETALOLS soe 6 ciescai ss gi 6 Ta a ea 32 Statements esis iscsi idle hs eae SRS 35 Statement Sequences e cee eeees 35 Uf Statement erarerare ected inii i ti cteptia 35 While Statement 0 ce cece ee eeees 36 Repeat Statement 1 ce cee cece eeeee 36 Try Statement 0 cc cec cece cee eeee 36 Every Statement cece cee eececees 38 Lock Statement 0 00 c ccc ce cece eeeee 38 WebL A Programming Language for the Web CHAPTER 3 CHAPTER 4 Begin Statement ecce veces ceeees 39 Return Statement 0 0 cece cece ees 39 Built in Functions eeee eee 40 Modules Ssisicscie fois
84. e we illustrate how to perform transformations on viewed pages in a proxy like fashion In particular we would like to build a highlight proxy that high lights all occurences of a particular word on the Web in red The highlight proxy is contacted with http www host com 9092 bin highlight url X amp word Y where X denotes the starting point URL on the Web and Y denotes the word that is to be highlighted and of course www host com is the machine the proxy server is run on Our proxy is written in such a way that all links that are followed from page X onwards are redirected to our proxy again This is accomplished by rewrit ting the contents of the page C CmrAtaA Nn WN N NNNNNNDNN m m j m j j j SIAN BSB wWwWNeER Sow RP AIAN BRwWN FE CS import Url WebServer var port 9092 var where bin highlight var Highlight fun req res var url req param url http www compaq com var word req param word Compaq var page GetURL url fetch the page every w in Pat page word inside Elem page title do wrap a font element around it var p NewNamedPiece font w p size 1 define its size attribute p color red define its color attribute end every a in Elem page a do a href where word Url Encode word word parameter Surl Url Encode a href url parameter nil nothing if no href end res result Markup page this is the r
85. eading score 149 L LastIndexOf 136 Latin 1 162 Length 128 List 26 122 Listp 41 Load 118 LoadFromFile 121 LoadStringFromFile 121 Locks 38 long 129 202 WebL A Programming Language for the Web M Markup 80 81 82 Markup algebra 67 Match 136 member 34 meth 30 Methods 30 Methp 41 Mkdir 123 mod 33 Modules 46 Base64 115 Browser 116 Cookies 117 Farm 119 Files 121 Java 124 Url 138 WebCrawler 141 WebServer 143 Mutual exclusion 38 N Name 80 82 Native 42 Netscape 116 New 128 NewArray 128 NewFarm 120 NewNamedPiece 106 112 NewPage 80 82 NewPiece 82 106 112 NewPieceSet 82 nil 23 Non termination 49 null 129 O Object based programming 30 Objectp 41 Objects 29 Pages 59 Operator precedence 166 Operators 19 32 168 Optional tags 57 Options 159 or 34 overlap 86 92 103 Overrides 63 WebL A Programming Language for the Web 203 autoredirect 65 charset 65 dtd 65 emptyparagaphs 65 fixhtml 66 mimetype 66 resolveurls 66 P Page 81 82 Pagep 41 Pages 59 Searching functions 70 Para 75 79 Paragraph search 75 Paragraph terminators 75 Parent 99 104 Pat 71 Pattern groups 71 Pattern search 71 PCData 73 79 Perform 120 Perl5 195 PI 57 Piece set Operators 87 Piece set functions Children 98 Content 100 Flatten 99 Parent 99 Piece set operators After 91 Before 89 Contain 93 Directlyafter 91 Directlybefore 90 Directlycontain 94 Directly
86. eb Module Servlet Access the following URL http www host com servlet webl Example2_ Snoop export var Snoop fun req res res mimetype text plain res result Decode req end Servlets typically use HTTP cookies to keep track of client state The following WebL servlet maintains a visit counter inside the client s cookie File Example3 webl Access the counter with http www host com servlet webl Example3_ Count export var Count fun req res res result Cookie test n Retrieve the cookie named cc var count TolInt req cookies cc begin executed if no such cookie exists res result res result No cookie n 0 end res result res result Count count res mimetype text plain set the new cookie res cookies cc domain www myhost com path value count 1 comment maxage 1 version 0 end WebL A Programming Language for the Web 133 Modules Server setup Servlet setup can be complicated First make sure that you can access the demo servlets that come with your web server Only then continue with this checklist 1 Put WebL jar is the CLASSPATH of your web server server dependent 2 Add a configuration parameter for the WebL servlet to your Web server server dependent Parameter name webl path Parameter value directory search path for WebL scripts 3 Restart your web server for changes to take aff
87. eb Basic Protocol Terminology WebL supports both the GET and POST methods as built in functions These func tions accept the request parameters in a WebL object and perform the correct encoding and packing in the HTTP request either in the URL or at the end of the request Parameter encoding For each parameter with name N and value V we construct a string N V All parameter strings are then concatenated separated by a amp sym bols and a question mark is prepended The URL of a GET request with parame ters will thus have the general form http domainname path filename hml N1 V1 amp N2 V2 amp N3 V3 Names consist of alpha numeric characters Values may contain any character except those that are reserved for URLs To encode the latter characters we replace them with a percentage sign followed by a two digit hexadecimal number spec ifying the ASCII code of the character In addition spaces are replaced by plus signs Request and response headers HTTP request headers give the web server more information about the request itself the browser that is being used etc HTTP response headers give the browser more information about the page that is returned In contrast to parameters that can be freely picked headers are pre defined by the HTTP protocol A header consists of a name and a value Although WebL can add request headers and read response headers scripts seldom need to exercise this control The main uses o
88. ebL A Programming Language for the Web Statements ment The exception object is automatically assigned to a programmer specified exception variable WebL will automatically declare the exception variable in a fresh context By definition any WebL object can be thrown as an exception By convention though most exception objects consist of a string valued field msg that describes the exception and a string valued field type that is used for identifying the excep tion type In some cases the exception object contains fields that give more specific information on what went wrong for example the file and line number where it occurred Table 47 on page 183 lists the exceptions thrown by statements operators and built in functions of the WebL language Syntax CatchStat try SS catch Ident on EdoSS end Examples try p GetURL http www yahoo com catch E on E statuscode 404 do PrintLn page not found on E type HttpError do PrintLn connection error on true do nil catch everything else end Throw type OutOfMemory msg No space left WebL A Programming Language for the Web 37 The Language Core Every Statement The every statement enumerates the elements of sets lists strings objects and piece sets Piece sets will be introduced later Set elements are enumerated in an undefined sequence List elements are enumerated from left to right Enumera
89. ebL allows programmers to look at both the markup structure of a page and the raw text without any tags Module Support Standard modules supplied with WebL include File manipulation for writing or downloading pages to disk Displaying pages in your web browser checking which pages are being viewed in Netscape and instructing Netscape to navigate to a specific URL Windows only Multi processing with workers jobs and job queues General string manipulation including PERLS regular expression searches Routines to split and glue together URLs An easily customizable multi threaded web crawler A multi threaded web server that allows the direct execution of WebL functions with full access to HTTP state Java servlet support Examples to access information from public services like AltaVista Yahoo etc Java Support and Integration WebL is written in nearly completely in Java The Browser access module needs access to a few Windows API calls WebL is completely portable on UNIX platforms It is possible however not recommeded to directly code against the WebL API thus not writing WebL scripts but still using its functionality Very easy to add bridges from WebL to Java code Java objects can be called directly from WebL code without extending the WebL system see module Java 14 WebL A Programming Language for the Web Java extensions are loaded dynamically and it is possible to add and remove builti
90. ece pieceset directlyafter p piece q pieceset pieceset directlyafter p pieceset q pieceset pieceset directlyafter p piece q piece pieceset directlyafter p pieceset q piece pieceset directlyafter p piece q pieceset pieceset directlyafter p pieceset q pieceset pieceset before p piece q piece pieceset before p pieceset q piece pieceset before p piece q pieceset pieceset before p pieceset q pieceset pieceset before p piece q piece pieceset before p pieceset q piece pieceset before p piece q pieceset pieceset before p pieceset q pieceset pieceset directlybefore p piece q piece pieceset directlybefore p pieceset q piece pieceset directlybefore p piece q pieceset pieceset directlybefore p pieceset q pieceset pieceset Description All the elements of p that directly contain any element of q All the elements of p that do not directly contain any element of q All the elements of p that are after any element of q All the elements of p that are not after any element of q All the elements of p that follow directly after any element of q All the elements of p that do not follow directly after any element of q All the elements of p that precede any element of q All the elements of p that do not precede any element of q All the elements of p that are directly before any element of q 102 WebL A Programming Lang
91. ecesets respectively according to a membership function f Function f must have a single argument and must return a boolean value indicating whether the actual argument is to be included in the set list or pieceset Extracts a substring of s starting at character number from and ending at character number fo exclusive 178 WebL A Programming Language for the Web Functions TABLE 46 Built in Functions Function Description Seq p page pattern string Matches all the occurrences of a pieceset Seq p piece pattern string pleceset Sign x int int Sign x real int Size l list int Size s set int Size s string int Size p pieceset int Sleep ms int nil Sort 1 list f fun list StallQ Text p page string Text q piece string sequence of elements identified by pattern See PCData search on page 73 Matches all the occurrences of a sequence of elements identified by pattern inside the piece p See PCData search on page 73 Returns 1 0 1 if x lt 0 x 0 and x gt 0 respectively Returns the number of elements in a list Returns the number of elements in a set Returns the number of characters in a string Returns the number of pieces belonging to p Suspends thread execution for the specified number of milliseconds Sorts the elements of according to the comparison function f The func tion f needs to take two formal argu m
92. ect 4 Place WebL scripts in the directory of the search path as indicated by 2 TABLE 34 Format of the Servlet request parameter object ie method string requestURI string servletpath string pathinfo string pathtranslated string querystring string remoteuser string authtype string remoteaddr string remotehost string scheme string servername string serverport string protocol string contenttype string header object param object cookies object i 134 WebL A Programming Language for the Web Module Servlet TABLE 35 Format of the Servlet response parameter object statuscode int statusmsg string result string mimetype string header name val name val cookies cookiename comment string domain string maxage int path string secure bool value string version int l cookiename ae aes cookiename 7 WebL A Programming Language for the Web 135 Modules Module Str The Str module provides several useful operations on string values TABLE 36 Module Str Function Compare a string b string int EndsWith s string regexp string bool EqualsIgnoreCase a string b string bool IndexOf pat string s string int LastIndexOf pat string s string int Match s string regexp string object Replace s string from char to char string Search s string reg
93. ectlyafter Elem X h1 Retrieve the second element directly after Hl1 s i e lines 3 7 10 13 Elem X directlyafter Elem X directlyafter Elem X h1 WebL A Programming Language for the Web 91 The Markup Algebra P overlap overlap Q The overlap operator returns the pieces of P that overlap or do not overlap any element of Q Example Find all l the occurrences of words that are italic or partially consists of italic text Pat X w overlap Elem X i 92 WebL A Programming Language for the Web Piece Set Operators and Functions HI Hierarchical Operators The hierarchical operators express relationships between pieces involving their hierarchical nesting in the element parse tree P inside inside Q The inside operator returns the pieces of P that are nested inside or not nested inside any piece of Q Examples Retrieve all the rows in the third table Elem X tr inside Elem X table 3 Retrieve all the italic elements not in a table Elem X i inside Elem X table P contain contain Q The contain operator returns the pieces of P that contain or do not contain any piece of Q Examples Retrieve all the level 2 headings with italic characters Elem X h2 contain Elem X i Retrieve all the tables that mention program Elem X table contain Pat X program
94. ed page Timeout 10000 GetURL http www altavista digital com GetURL http www altavista yellowpages com au Repetition Retry S The repetition combinator provides a way to repeatedly invoke a service until it succeeds The service Retry S acts like S except that if S fails then S starts again The loop can be terminated by writing Timeout t Retry S This program makes repeated attempts in the case of failure alternating between two services page Retry GetURL http www x com GetURL http www y com Non termination Stall The stall combinator never completes or fails This program repeatedly tries to fetch the URL but waits 10 seconds between attempts page Retry getpage http www digital com Timeout 10000 Stall WebL A Programming Language for the Web 49 The Language Core 50 WebL A Programming Language for the Web CHAPTER 3 Pages To set the stage for the next chapter on markup algebra we must introduce fetching a page from the Web and mapping the page into structures compatible with the WebL language Although we cannot give a thorough overview of the Web proto cols and formats involved in this process we present a short tutorial to introduce the particular vocabulary used in WebL Thus the most of this chapter is a review of things that might be known to many readers However the chapter does contain important information and definiti
95. ed This also has the effect of removing nested elements of P New unnamed pieces are inserted into the page to create these new pieces Figure 8 shows how two overlapping pieces are flat tened FIGURE 8 Flattening a Piece Set L gt leeel lt gt Ileets e i EAI Flatten A B c L A letke kA WebL A Programming Language for the Web 99 The Markup Algebra Content p The Content function returns the content of piece p The content of a piece is the part of the page between the begin and end tag of p exclusive The Content function can also be applied to a page object in which case a piece is returned that starts at the beginning of the page and ends at the end of the page In both cases new unnamed tags are inserted into the page Figure 9 For example given a page lt td gt abc lt i gt def lt i gt lt td gt we can calculate the following Content of the TD element i e abe lt isdef lt i gt Content Elem P td 0 Content of the whole page i e lt tdsabe lt i gt def lt i gt lt td gt Content P FIGURE 9 Application of the Content Function L Wevell Ise8s k ot EAI Content A c Bees ee ese 100 WebL A Programming Language for the Web Piece Set Operators and Functions TABLE 20 Piece and Piece Set Operators Function q1 piece q2 piece pieceset q piece s pleceset pieceset s pleceset
96. egal field assignment e NotAnObject Left hand side is not an object field e NotAVariable Left hand side is not a variable Field definition with can throw the following exceptions e FieldDefinitionError Could not define field e NotAnObject Left hand side is not an object n Indexing into a type with or can throw the following exceptions 182 WebL A Programming Language for the Web Exceptions e IndexRangeError Index is out of range e ArgumentError Index is not of the expected type e NoSuchField Object does not have such field e NotAnObject Left hand side is not an object The if repeat while and catch statements will throw a GuardError exception if the guard expression does not return a boolean value type The every statement will throw a NotEnumerable exception if the object does not have enumerable contents The lock statement will throw a NotAnObject exception if an attempt is made to lock on a non object value type TABLE 47 Exceptions thrown by the built in functions Function Exceptions Assert x bool nil ArgumentError Incorrect or wrong number of arguments AssertFailed Assertion failed BeginTag q piece tag ArgumentError Incorrect or wrong number of arguments Boolp x bool ArgumentError Incorrect or wrong number of arguments Exec cmd string int ArgumentError Incorrect or wrong number of arguments WebL A Programming Language
97. ele ment number fo exclusive Extracts a substring of starting at character number from and ending at character number fo exclusive 42 WebL A Programming Language for the Web Built in Functions TABLE 13 Core Built in Functions Function Select s set f fun set Select 1 list f fun list Select p pieceset f fun pieceset Sign x int int Sign x real int Size l list int Size s set int Size s string int Sleep ms int nil Sort list f fun list Stall Throw o object Time x int Timeout ms int x any ToChar c char char ToChar i int char Description Maps sets lists and piecesets to sets lists and piecesets respectively according to a membership function f Function f must have a single argument and must return a boolean value indicating whether the actual argument is to be included in the set list or pieceset Returns 1 0 1 if x lt 0 x 0 and x gt 0 respectively Returns the number of elements in a list Returns the number of elements in a set Returns the number of characters in a string Suspends thread execution for the specified number of milliseconds Sorts the elements of according to the comparison function f The func tion f needs to take two formal argu ments and return 1 0 or 1 if the actual arguments are less equal or more than each other Program goes to sleep forever Generates an exception
98. elements of p that do not follow directly after any element of q All the elements of p that are not directly before any element of q All the elements of p that do not directly contain any element of q All the elements of p that are not directly inside any element of q 168 WebL A Programming Language for the Web Operators TABLE 45 WebL Operators Operator inside p piece q piece pieceset inside p pieceset q piece pieceset inside p piece q pieceset pieceset inside p pieceset q pieceset pieceset overlap p piece q piece pieceset overlap p pieceset q piece pieceset overlap p piece q pieceset pieceset overlap p pieceset q pieceset pieceset ql piece q2 piece pieceset q piece s pleceset pieceset s pleceset q piece pieceset s1 pieceset s2 pieceset pieceset x int y int int x int y real real x real y int real x real y real real x set y set set q1 piece q2 piece pieceset q piece s pleceset pieceset s pleceset q piece pieceset s1 pieceset s2 pieceset pieceset x char y string string x char y char string x string y string string x string y char string x int y int int x int y real real x real y int real x real y real real x list y list list x set y set set q1 piece q2 piece pieces
99. ement Most of the operators have a formal definition as defined in Table 22 on page 105 The remainder of this section attempts to give an intuitive explanation of the opera tors with the help of examples In our examples X will denote a page P and Q will denote piece sets and p and q will denote elements of P and Q respectively WebL A Programming Language for the Web 87 The Markup Algebra I Basic Operators Basic piece set manipulation includes the set union intersection and exclusion operators Set Union P Q The set union operator merges two piece sets into a single piece set and eliminates duplicate pieces from the result Example Retrieve level 1 and two headings from a page Elem X hi Elem X h2 wow Set Exclusion P Q The set exclusion operator removes all pieces from the left operand that are elements of the right operand Example Retrieve all level 1 headings except for those that contain the word Figure Elem X hi1 Elem X h1 contain Pat X Figure Set Intersection P Q The set intersection computes the intersection between its operands Example Retrieve all the occurrences of the word WebL written in bold and in italic Pat X WebL inside Elem X b Pat X WebL inside Elem X i 88 WebL A Programming Language for the Web Piece Set Operators and Functions II Posit
100. enough to break scripts that worked correctly on the ugly page WebL A Programming Language for the Web 81 The Markup Algebra TABLE 18 Miscellaneous Functions Function BeginTag q piece tag EndTag q piece tag ExpandCharEntities p page s string string ExpandCharEntities s string string Markup p page string Markup q piece string Name q piece string NewPage s string mimetype string page NewPiece s string mimetype string piece NewPieceSet s set pieceset NewPieceSet p page pieceset Page q piece page Page t tag page Pretty p page string Description Returns the begin tag of a piece Returns the end tag of a piece Expands the character entities eg amp lt amp amp in s to their Unicode character equivalents The DTD of page p is used for the lookups Expands the character entities eg amp lt amp amp in s to their Unicode character equivalents The HTML 4 0 DTD is used for the lookups Turns a page object back into a string Turns a piece object back into a string Returns the name of a piece or the empty string in the case of q being unnamed Parses the string s with the mime type indicated markup parser and returns a page object Equivalent to Content NewPage s mimetype Converts a set of pieces into a piece set Thows an EmptySet exception should s be empty Returns an empty pieceset associ ated
101. ent q piece piece Pat p page regexp string pieceset Pat q piece regexp string pieceset PieceSetp x bool Piecep x bool Exceptions ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments MalformedPattern Illegal regular expression passed to Pat function ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments WebL A Programming Language for the Web 189 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function PostURL url string page PostURL url string params object string page PostURL url string params object string headers object page PostURL url string params object string headers object options object page Pretty p page string Pretty q piece string Print x y Z nil PrintLn x y z nil ReadLn string Realp x bool Replace a pieceset b pieceset nil Rest 1 list list Exceptions ArgumentError Incorrect or wrong number of arguments NetException Fetch failed sta tuscode field of the exception object indicates the reason IOException An IO exception occurred during func
102. ents and return 1 0 or 1 if the actual arguments are less equal or more than each other Program goes to sleep forever Returns the text sans tags of a page Returns the text sans tags of a piece WebL A Programming Language for the Web 179 WebL Quick Reference TABLE 46 Built in Functions Function Text q piece insertspaces bool ean string Throw o object Time x int Timeout ms int x any ToChar c char char ToChar i int char ToInt c char int ToInt i int int ToInt r real int Tolnt s string int ToList s set list ToList list list ToList s string list ToList o object list ToList p pieceset list ToReal c char real ToReal i int real Description Returns the text sans tags of a piece When insertspaces is true each HTML tag is mapped into a space and inserted into the result string inline tags like b i em etc are ignored and not mapped into spaces This option is useful to correctly identify word boundaries for example to prevent words flowing together in a case like lt li gt wordA lt li gt lt li gt wordB lt li gt Generates an exception Returns the time in milliseconds it takes to evaluate the expression x Performs the expression x and returns its value If the evaluation takes more than the specified amount of time in milliseconds an exception is thrown instead No operat
103. epts that you will need to know in later chapters Basic Terminology Expressions WebL programs consist of sequences of expressions separated by semicolons Run ning a WebL program involves evaluating the expressions in sequence Each expression either evaluates to a value or result or causes an exception that causes the program evaluation to terminate at that point We say that the expression throws an exception More details about exceptions can be found in the sections Try Statement on page 36 and Exceptions on page 182 WebL A Programming Language for the Web 17 The Language Core The value of one expression is typically used by other expressions in the program We also define the value of a sequence of expressions to be the value that the last expression in the sequence evaluated to If no special steps are taken by the pro grammer the results of the remainder of the single expressions in an expression sequence are lost Running or executing a WebL program involves several integrated steps e The program source text is parsed and checked for syntax errors If syntax errors are detected the execution of the program is terminated e A representation of the program in the form of an abstract syntax tree AST is constructed in memory The AST consists of a sequence of expressions e The in memory sequence of expressions are executed in turn Side effects of the computation might be printing of results on
104. er answers with an HTTP response The response consists of a status code indicating success or failure a status message headers and the page data itself contents of path filename html on the server e The connection is closed Request parameters provide additional information to a web server about the requested data This information is often used to access a special service on the server that generates appropriate responses dynamically for example by looking up data in a database Each parameter consists of a parameter name and a value Parameters are included in the HTTP request in one of two methods The HTTP GET request appends the parameters encoded in a special way to the URL The HTTP POST request appends the parameters to the end of the request GET requests issued with parameters are recognized by a question mark fol lowed by the parameters name value pairs appended to the URL In contrast parameters of a POST request are hidden and not visible from the URL POST requests are the preferred method for transmitting the contents of an HTML fill in form to a web server Their main advantage is that larger amounts of data can be submitted than with the GET method Note however that the GET method is also applicable to fill in forms and is typically used when parameters are few and rela tively short The GET method is also the default when no parameters are passed 52 WebL A Programming Language for the W
105. esource is a page object The built in functions GetURL and PostURL fetch a page from the Web and return a page object In the next chapter we will introduce functions that will turn a page object into a string value and back search and manipulate markup in interesting ways etc The GetURL and PostURL functions take a variable number of arguments that specify the URL to be fetched request parameters additional headers and options See Table 15 on page 64 WebL s ability to process different URL protocols like http file and ftp is inherited from the underlying Java implementation i e WebL does not provide support for any additional protocols The most common URL used in WebL is the one corre sponding to the HTTP protocol Note that page redirects and cookies are handled transparently by WebL but this default behavior can be overridden if required The request parameters passed to the functions are in the form of objects For example the URL of a typical AltaVista request has the following form http www altavista digital com cgi bin query pg q amp what web amp k1l xXX amp q 22Hannes Marais 22 This can be converted into a call to the GetURL function in the following manner WebL A Programming Language for the Web 59 Pages GetURL http www altavista digital com cgi bin query pq q what web kl XX gq Hannes Marais Parameters WebL will automatically take care of packing request p
106. esult end WebServer Publish where Highlight WebServer Start dev null port 156 WebL A Programming Language for the Web Highlight Proxy 29 Stall 30 The highlight proxy consists of single function called Highlight lines 6 25 This function is exported with the built in WebL web server in lines 27 and 28 More information about the built in Web server can be found in the WebServer module documentation On lines 7 and 8 we extract the URL and word parameters passed to the proxy Note how we use service combinators to provide sensible defaults in case no parameters are present Lines 11 16 does the actual highlighting of the word on the page In line 13 a font element is wrapped around the occurrence of the word In addition lines 14 and 15 define the size and color attributes of the new font element Note that we also make sure in line 11 that we only wrap word occurences outside of the title of the Web page The next step is to rewrite the href attribute of all anchors a elements in the page to work correctly with our proxy This involves passing the old href attribute as the URL parameter to our proxy We use the Url_Encode function to encode special characters in the URL as dictated by the URL specification Finally in line 24 we re generate the markup of the now modified page and return it back to the browser by assigning it to the appropriate field of the server response object res WebL A Pr
107. et q piece s pieceset pieceset s pieceset q piece pieceset sl pieceset s2 pieceset pieceset x int int x real real Description All the elements of p that are not located inside any element of g All the elements of p that do not overlap any element in q Piece set intersection Numeric multiplication Set intersection Piece set union String and character concatena tion Numeric addition x y List concatenation Set union Piece set difference Numeric negation WebL A Programming Language for the Web 169 WebL Quick Reference TABLE 45 WebL Operators Operator x int y int int x int y real real x real y int real x real y real real x set y set set x object y any x int y int int x int y real real x real y int real x real y real real x y bool s pieceset 1 int piece x list i int any x object i any x string i int char after p piece q piece pieceset after p pieceset q piece pieceset after p piece q pieceset pieceset after p pieceset q pieceset pieceset before p piece q piece pieceset before p pieceset q piece pieceset before p piece q pieceset pieceset before p pieceset q pieceset pieceset C x int y int bool C x int y real bool C x real y int bool C x real y real bool
108. evaluates to a value of type int or integer The internal representation of integers is 64 bit signed two s complement Overflows or unde flows during integer computations do not throw exceptions TABLE 8 Integer Expressions Expression Value Value Type 1 2 3 int 2 1 1 0 int 6 div 4 1 int 6 mod 4 2 int WebL A Programming Language for the Web 25 The Language Core Type Real A lexical real constant evaluates to a value of type real The internal representation of reals is 64 bit IEEE 754 floating point No exceptions are thrown in real math TABLE 9 Real Expressions Expression Value Value Type 1 2 1 2 real 2 2 1 0 real 0 0 NaN real 1 0 Inf real Type List The list constructor constructs values of type lists All expressions between the square brackets are evaluated from left to right and the values inserted into the list in that sequence The parallel list constructor I evaluates the expression between the brackets in parallel using multiple threads instead of left to right There is no restriction on the size of the list or the value types that it can contain TABLE 10 List Expressions Expression Value Value Type 1 1 2 a 1 2 a list 1 2 3 1 2 3 list First 1 2 1 int Rest 1 2 3 2 3 list l1 1 2 21 2 4 list Size 1 2 6 3 int 26 WebL A Programming Language for the Web Dynamic Types Type Set The set constructor constructs values of
109. exp string list Description Returns 1 0 1 if a is less than equal or greater than b Tests if a string ends with a particu lar pattern Tests if a and b are equal in a case insensitive manner Returns the first position where pat occurs in s otherwise 1 Returns the last position where pat occurs in s otherwise 1 Tests whether s matches the regular expression regexp If so an object is returned where the fields of the object are integers numbered from 1 onwards each of them having a value corresponding to the Perl5 groups as indicated by the parenthe sis sub expressions in regexp that has been matched Nil is returned otherwise Replaces each occurrence of from with char in s Searches for all the occurrences of the regular expression regexp in s and returns a list of objects for each of them The fields of the objects are similar to those returned by the Match function Note that object field 0 is the complete matched char acter string WebL A Programming Language for the Web Module Str TABLE 36 Module Str Function Split s string chars string list StartsWith s string regexp string bool ToLowerCase s string string ToUpperCase s string string Trim s string string Description Splits the string s at positions where any of the characters of chars appear The function returns a list of strings Tests if a string starts with a particu lar pattern
110. f this feature include mimicking a specific web browser model and retrieving and setting cookies MIME types One of the important pieces of information returned by an HTTP response is the type of the data that is being retrieved included in a response header The MIME type specifies if the data is an HTML page an XML page an image a Postscript file etc WebL supports onlythe MIME types corresponding to what it can parse Plain text HTML and XML Attempting to process anything else in WebL causes an exception A common MIME type is the one that identifies HTML documents typically written in one of the following forms text html text html charset us ascii text html charset ISO 8859 1 The charset parameter is optional it indicates the character encoding or content encoding the document is encoded in WebL uses the charset parameter or makes WebL A Programming Language for the Web 53 Pages an educated guess as to its value when missing to determine how pages are con verted into an internal Unicode format Unfortunately many web servers do not return the correct MIME type information for certain documents which makes it impossible for WebL to parse the document To prevent this from occuring it is possible to override the MIME type of a docu ment explicitly when using the GetURL and PostURL builtin functions Cookies Many web servers today use cookies to store client side state For example a typical a
111. frm Perform F a b It is important to know that the arguments to the function application are evaluated before the job is started A typical application of a farm object is the following stu pid web crawler program with 10 parallel workers import Farm var F Farm _NewFarm 10 var ProcessPage fun url var page GetURL url every a in Elem page a do F Perform ProcessPage a href end end F Perform ProcessPage http www nowhere com while F Idle do Sleep 10000 end WebL A Programming Language for the Web 119 Modules TABLE 28 Module Farm Function NewFarm nofworkers int object Description Creates a farm object with the speci fied number of workers TABLE 29 Methods of Farm Objects Method Perform e nil Idle bool StopQ nil Description Adds a task or job to the farm queue An idle worker will eventually remove the job from the queue Note that e must be an expression where a function is applied Returns true if the job queue is empty and all workers are idle Kills off all the jobs and stops all of the workers Adding jobs to the queue after this operation will have no effect 120 WebL A Programming Language for the Web Module Files Module Files The Files module provides rudimentary functions for testing the existence of a file saving and loading pages and strings to and from files and downloading web con tent to a
112. g 147 Stop 120 145 String 25 Stringp 41 T Tagp 41 Tags 55 68 Begin tags 68 End tags 68 Optional tags 57 Positions of 84 Unnamed tags 68 Terminology 17 51 Text 80 83 Text segments 68 73 Threads 119 Mutual exclusion 38 Throw 43 throw 36 Time 43 Time out 49 Timeout 43 206 WebL A Programming Language for the Web ToChar 43 ToInt 44 ToList 44 ToLowerCase 137 ToReal 44 ToSet 44 ToString 44 ToUpperCase 137 Trap 44 Trim 137 Type 44 Types 18 Bool 23 Char 24 Fun 27 Int 25 j array 127 J Object 124 List 26 Meth 30 Nil 23 Object 29 Real 26 Set 27 Special objects 29 String 25 U Union 88 Unnamed tags 68 URL Resolution of 58 URLs 51 UTF 8 162 Vv Value types 18 Variables 20 Exported variables 46 W WebCrawler 141 153 WebL jar 159 WebL Java type conversion 125 weblwin32 dll 116 WebServer 143 without 95 103 WebL A Programming Language for the Web 207 XML 55 208 WebL A Programming Language for the Web
113. g t Inserts copies of the elements of s before the tag t Inserts a copy of q after the tag t Inserts copies of the elements of s after the tag t Returns a new unnamed piece start ing before t and ending after 72 Equivalent to NewPiece Begin Tag q EndTag q Returns a new named piece starting before t and ending after t2 Equivalent to NewNamed Piece name BeginTag q End Tag q Replaces each piece set of a with copies of all the elements of b 112 WebL A Programming Language for the Web CHAPTER 5 Modules WebL includes a number of standard modules for the convenient reuse of often required functionality The purpose of this chapter is to introduce the more com mon modules shipped with the WebL installation Table 24 To use most of these modules a programmer must import the module and refer to the exported variables of the module See Modules on page 46 TABLE 24 Standard WebL Modules Module Base64 Browser Cookies Farm Files Java Servlet Function Encodes and decodes base 64 strings used for user authentication at many web sites Provides access to the web browser for displaying web pages Page 116 Provides functionality to save and load the HTTP cookie database Page 117 Introduces a technique for programming and controlling several concurrently executing threads Page 119 Functions to process local files and download pages to files
114. he second argument to NewPage defines the parser to be used to parse the string into a page For the definition above the following WebL expressions evaluate in the following manner Markup P Returns lt html gt lt body gt lt html gt as above var H Elem P h1 0 Returns the first H1 Markup H Returns lt hl gt Test Page lt hli gt Text P Returns Test Page A 100 B 230 including white space Text H Returns Test Page Name H Returns h1 var T Elem P td Returns all the TD elements Markup T 0 Returns lt td align center gt A lt td gt Markup T 1 Returns lt td gt s100 lt tds Text T 0 Returns A Text T 1 Returns 100 Name T 0 Returns td 80 WebL A Programming Language for the Web Miscellaneous Functions T O align Returns center var x BeginTag H y EndTag H Returns true P Page x P Returns true P Returns true The Pretty function is similar to the Markup function except that it pretty prints the markup by indenting elements according to their nesting level This is useful to study the structure of badly formatted HTML and XML pages Note that pretty printing a page involves a reformatting of white spaces and new lines so the result ing string might differ dramatically from the original page source sometimes
115. icates that all tags except for font a b i tt and img should be regarded as paragraph terminators The last example illustrates a very useful application of the Para function HTML distinguishes between inline elements and block elements Block elements typically start and end on a fresh line in the displayed web page Inline elements flow in the text stream and do not typically start or end a fresh line Sometimes it is necessary to extract the blocks of inline elements that make up the paragraphs of a Web page As the number of inline HTML 4 0 elements are relatively small we can accom plish this with the following WebL statement Para page tt ib u s strike big small em string dfn code samp kbd var cite acronym a img applet object font basefont script map q sub sup span bdo iframe input select textarea label button In a similar vein the Para function can also play a role when extracting text from a Web page This addresses a problem of the Text function when retrieving the text of a page For example applying the Text function to the following page lt lisword A lt li gt lt lis gt word B lt li gt results in the text string word Aword B where two words unexpectedly flow together To insert an extra space at the word boundary is dependent on whether a breaking tag is present or not The problem can be solved with a script of the fol lowing form 76 WebL A Programming Language for the Web Searching Fun
116. inside 93 Indexing 89 Inside 93 Intersect 97 Overlap 92 Set Exclusion 88 Set Intersection 88 Set Union 88 Without 95 Piece sets 69 Piecep 41 204 WebL A Programming Language for the Web Pieces 68 Comparison of 83 Creation of 82 106 Deleting of 109 Filtering of 78 Graphical notation 69 Insertion of 108 Replacing of 111 Piecesetp 41 Positions 84 PostURL 47 59 61 64 122 Overrides 63 Predicates 41 Pretty 81 82 Print 42 PrintLn 42 Processing Instructions 57 Properties 160 Proxies 115 156 Publish 143 145 R Reading grades 149 ReadLn 42 Real 26 Realp 41 Regular expressions 71 165 195 Repetition 49 Replace 111 112 136 Resolve 139 Rest 42 Retry 42 Running WebL programs 159 S Save 118 SaveToFile 122 Scoping rules 20 Scrubber 110 Search 136 Search path 160 Select 42 43 Seq 74 79 Sequence search 74 Sequential execution 48 Service combinators 47 Services 47 Set 27 128 Setp 41 WebL A Programming Language for the Web 205 SGML 54 SGML Directives 57 Shell commands 41 42 short 129 ShouldVisit 141 ShowPage 116 Sign 43 Size 43 123 Sleep 43 Sort 43 Split 137 139 SplitQuery 138 139 Stall 43 Start 143 145 StartsWith 137 Statements 20 35 Begin statement 39 Every statement 38 If statement 35 Lock statement 38 Repeat statement 36 Return statement 39 Sequences 35 Try statement 36 While statement 36 statuscode 147 statusms
117. ion Converts an integer to the equivalent Unicode character Returns the Unicode character num ber of a char No operation Rounds a real value down to an inte ger Converts a string to the numeric equivalent Enumerates all the elements of the argument and returns a list See Every Statement on page 38 Same as ToReal ToInt c Converts an integer to a real 180 WebL A Programming Language for the Web Functions TABLE 46 Built in Functions Function ToReal r real real ToReal s string real ToSet s set set ToSet l list set ToSet s string set ToSet o object set ToSet p pieceset set ToString x string Trap x object Type x string Description No operation Converts a string to a real value Enumerates all the elements of the argument and returns a set See Every Statement on page 38 Converts a value to its string repre sentation Executes x and returns the exception object that was caught In case no exception is thrown in x nil is returned In addition the exception object contains a field trace that has extra information why the exception occurred This information is useful for logging unexpected exception events in your WebL programs Returns the type of x nil int real bool char string meth fun set list object page piece pieceset tag a The class indicated must be a subclass of webl lang exprAbstractFun Exp
118. ion of WebL types into Java types WebL Type nil bool char string int real object list set fun meth page piece pieceset tag j object j array Compatible Java type class value null boolean char string java lang String and superclasses int long short byte float double float double webl lang expr ObjectExpr and superclasses webl lang expr ListExpr and superclasses webl lang expr SetExpr and superclasses webl lang expr AbstractFunExpr and superclasses webl lang expr AbstractMethExpr and superclasses webl page Page and superclasses webl page Piece and superclasses webl page PieceSet and superclasses webl page TagExpr and superclasses Corresponding type of wrapped Java object Corresponding type of wrapped Java array type 130 WebL A Programming Language for the Web Module Servlet Module Servlet Many web servers today support the Java Servlet standard from JavaSoft This standard allows the efficient execution of server side actions In addition to the built in Web server support see module WebServer WebL also supports the serv let standard directly and transparently In fact the WebL Servlet integration is so transparent that no new functions need to be introduced The description how to use servlets provided below assumes a fair knowledge about Java and servlets it is thus advisable to study the servlet documention before continuing Servlet access The class weblx se
119. ional Operators Positional operators express relationships between pieces according to their order in a page Most positional operators have a negated or inverted version that is indi cated by an operator symbol written with an exclamation point Indexing P i The index operator extracts the nth element of a piece set P Pieces are numbered from 0 to Size P 1 Examples Extract the 4 th table from a page Elem X table 4 Extract the 2 nd row of the 3 rd table Elem Elem X table 3 tr 2 Extract the 2 nd row of the table containing the word WebL var t Elem X table contain Pat X Web1 Elem X tr inside t 2 P before before Q The before operator returns all the elements of P that are before or not before any element of Q Note that this is equivalent to all the ele ments of P that are before or not before the last element of Q Consequently we often need to index into Q to reduce it to a single piece Examples Retrieve all the H2 s before the appendix assuming only a single appendix is present Elem X h2 before Elem X h1 contain Pat X Appendix Retrieve all the headings from Chapter 4 onwards Elem X hi before Elem X h1 contain Pat X Chapter 4 Retrieve all the italic elements except the last Elem X i before Elem X i Retrieve the last italic elemen
120. knowledge about the WebL imple mentation which is beyond the scope of this user manual The WebL to Java integration works by automatically wrapping Java objects classes and arrays with special WebL types and performing transparent translation of WebL data types to Java data types and vice versa The Java module introduces two new WebL data types for this purpose The WebL j object type is a special object type that wraps Java objects and Java classes The WebL j array type wraps Java arrays Type j object Wrapping a Java object in a WebL j object is transparent to the WebL programmer From the WebL programmer s perspective the object behaves exactly the same as a normal WebL object That is the fields and methods of a Java object is directly accessible from the WebL j object For example the following WebL code creates a Java Date object and calls some of its methods to print out some of the details of the data object import Java var D Java_New java util Date PrintLn Today s date is D toString PrintLn Today is D getMonth PrintLn D Notice how the last line prints out the Java object itself The console output from this statement might look as follows which illustrates that the methods and fields of the Java date object are reflected 1 to 1 inside the WebL j object setYear lt setYear int voids gt getSeconds lt getSeconds int gt parse lt parse java lang String
121. markup might be a linear sequence of elements following each other For example we might expect an H element fol lowed by a sequence of characters followed by a BR element We will be using this as our example in the following discussion Given a page and a string describing such a sequence called a sequence pattern the Seq function will return a piece set with all the occurrences of the sequence in the page That is each piece refers to an unnamed tag just before and after the first and last element of the sequence A sequence pattern is a list of element names separated by space characters The intention is to match exactly that sequence of elements on the same element nesting level It is important to note that sequence patterns do not match nested elements For example in our example whether the H element contains other elements is irrelevant To match sequences of characters we use the symbol The symbol will match the longest sequence of characters or unnamed tags at that position in the page Unnamed tags are ignored The following will match all the H1 text BR sequences in our example Seq P hl br The H text and BR pieces matched in each of the sequences are accessible by indexing the returned pieces one for each sequence in the page with integers from 0 onwards For example the following code fragment prints details of the matched sequences var S Seq P hl br every p in S do PrintL
122. me string database string nil Save filename string database string nil Description Adds the cookies in filename to the default cookie database Saves the default cookie database to filename Adds the cookies in filename to the cookie database named database Saves the cookie database named database to filename 118 WebL A Programming Language for the Web Module Farm Module Farm Module Farm introduces the concept of a farm object an object with a hidden implementation A farm consists of a number of workers that process jobs The Perform method of a farm object allows the programmer to insert a job into the job queue of the farm Idle workers those that are not doing something periodically pick a job from the queue to perform When the job queue is empty and no workers are working we say that the farm is idle It is important to note that the workers are simple minded in the sense that should an exception occur while performing a job the job is terminated without any indi cation to the programmer and the worker becomes idle again It is thus advisable to include exception handling code in the job itself The Perform method of a farm uses a special calling convention The argument of this method must be an expression denoting a function application For example say we would like to turn a function invocation of F with two arguments into a job we must write var frm Farm _NewFarm 10
123. mmand Line Options 160 String and Character Escape Sequences 165 Operator Precedence Table 166 WebL Operators 168 Built in Functions 173 Exceptions thrown by the built in functions 183 Quantified Atoms 195 Quantified Atoms with Minimal Matching 196 Atoms 197 Perl5 Extended Regular Expressions 198 WebL A Programming Language for the Web List of Figures FIGURE 1 FIGURE 2 FIGURE 3 FIGURE 4 FIGURE 5 FIGURE 6 FIGURE 7 FIGURE 8 FIGURE 9 FIGURE 10 FIGURE 11 FIGURE 12 Converting Markup into Tag and PCData Sequences 68 Piece Notation 69 Results of Searching for WebL 71 Nested Unnamed Pieces 84 Example of Position Numbering 85 Operation of P without Q 96 Operation of P intersect Q 97 Flattening a Piece Set 99 Application of the Content Function 100 Application of the NewPiece function 107 Copying Pieces during Inserts 109 Deleting Pieces 110 WebL A Programming Language for the Web 10 WebL A Programming Language for the Web CHAPTER 1 Introduction WebL pronounced webble is a web scripting language for processing docu ments on the World Wide web It is well suited for retrieving documents from the web extracting information from the retrieved documents and manipulating the contents of documents In contrast to other general purpose programming lan guages WebL is specifically designed for automating tasks on the web Not only does the Web
124. mmers can define HTTP request headers and inspect response headers Programmers can explicitly override mimetypes and DTDs used when parsing Web pages Proxy support Support for HTTP basic authentication both client and proxy authentication Markup Algebra WebL understands HTML XML and plain text mime types WebL uses a DTD based HTML parser for extensibility HTML 2 0 3 2 and 4 0 DTDs included WebL has relatively robust page parsing that attempts to make a faithful repre sentation of Web pages WebL supports a markup algebra for extracting elements and text from pages and functions for manipulating the content of a page Extraction functions include extracting all elements of a specific name all occurrences of PERLS regular expressions and all occurrences of simple element patterns Elements and patterns are mapped onto piece objects in WebL and allow the direct access to markup attributes WebL A Programming Language for the Web 13 Introduction Markup algebra allows the expression of complicated access patterns easily for example extract all the images in the third row of the table that contains the word abc and so on WebL can handle overlapping elements internally Page manipulation is not based on an internal tree like representation of markup Page manipulation functions include modifying attributes deleting elements tags copying elements text and replacing elements text W
125. n Heading p 0 PrintLn Text p 1 end 74 WebL A Programming Language for the Web Searching Functions Paragraph search Paragraph search is one of the more complicated WebL page searching techniques it is rather seldom used but still performs a useful function that is sometimes required The purpose of this searching technique is to break up a page or piece into logical paragraphs Paragraphs in the WebL world are longer regions of a page that logically belong together Paragraphs in WebL should not be confused with HTML paragraphs marked up with lt p gt lt p gt elements Example paragraphs in WebL might be sequences of markup each terminated with a br tag or the regions between a set of images WebL allows the programmer to define his or own meaning of the term paragraph To allow the WebL programmer to define an own notion of paragraph we intro duce the notion of a paragraph terminator A paragraph terminator is a tag which denotes the end of a paragraph For example the br tag might be denoted as a para graph terminator It is important to note that identifying a non empty HTML ele ment such as font as a terminator signifies that both the begin tag lt font gt and end tag lt font gt are to regarded as paragraph terminators Typically sets of terminators are used to break a page into paragraphs For example we can specify that all br and p tags are regarded as paragraph terminators or that all
126. n the expression A context can be created in several ways The most common way is by the programmer who defines the variables and their values explicitly in the current context using variable declarations After declaration a variable can be 669 assigned arbitrary values with the assignment expression Examples var X Defines the variable x var a b C Defines three variables a b c var name John Defines a variable called name and assigns it a value var Define and initialize several x 1 variables y 2 Z E EOS xX y 2 Assignment expression 20 WebL A Programming Language for the Web Basic Terminology A variable s value is set to nil when no initializer is specified A variable must be declared before it is used for the first time and should be declared only once in any given context Variable declarations are expressions that evaluate to the value the variable is set to Assignment expressions evaluate to the value that is assigned It is important to note that var X x 1 is equivalent to var x nil xX aj x 4 1 Both of these programs lead to a runtime exception because 1 and nil are not type compatible under the plus operator This definition of variable declaration allows the introduction of self recursive functions WebL uses lexical scoping for variables This allows contexts to be nested in each other according to the syntactic structure of the p
127. n be found in Retrieving Page Objects on page 59 Note that any WebL computation can be regarded as a service Services GetURL url paraml vall param2 val2 PostURL url paraml vall param2 val2 The GetURL function fetches with the HTTP GET protocol the resource associated with the URL It returns a page object that encapsulates the resource The function fails if the fetch fails The second argument to GetURL provides the server with WebL A Programming Language for the Web 47 The Language Core query arguments A similar function called PostURL uses the HTTP POST proto col used to fill in Web based input forms This program simply attempts to fetch the named URL page GetURL http www digital com This program looks up the word java on the AltaVista search engine page GetURL http www altavista digital com cgi bin query pg g what web q java Sequential execution S T The combinator allows a secondary service to be consulted in case the primary service fails for some reason Thus the service S T acts like the service S except that if S fails then it executes the service T This program first attempts to connect to AltaVista in California and in the case of failure attempts to connect to a mirror in Australia page GetURL http www altavista digital com GetURL http www altavista yellowpages com au Concurrent execu
128. n functions by editing a standard script Applications WebL is a general purpose programming language and can thus be used to build whatever you can imagine The example chapter of this book only gives a small taste of what is possible with WebL Some of the things that we at Compaq have built with WebL include Web shopping robots Page and site validators Meta search engines Tools to extraction connectivity graphs from the Web and analyze them Tools for collecting URLs host names word frequency lists etc Page content analysis and statictics Reprocessing of results from public services for example custom rankings of stocks Custom servers and proxy like entities Locating and downloading multi media content and downloading of complete Web sites WebL A Programming Language for the Web 15 Introduction 16 WebL A Programming Language for the Web CHAPTER 2 The Language Core The special features of WebL like service combinators and markup algebra are integrated in a small programming language core Stripped of special features the core language is conceptually similar to most other procedural programming lan guages To lay some ground work and to understand the examples introduced in later chapters we first need to study the language core without touching the special features the language introduces for handling web pages This chapter introduces many of the essential and basic language conc
129. n of the following TD element consisting of nested J and B elements lt td gt abc lt i gt lt b gt def lt b gt lt i gt ghi lt b gt jkl lt b gt mno lt td gt are the pieces represented by abe lt i gt lt b gt def lt b gt lt i gt ghi lt b gt jkl lt b gt mno Examples Everything inside the first table Children Elem X table 0 Program to walk recursively through a page var walk fun x if Name x then Named piece every p in Children x do walk p end else PrintLn Text x parent Name Parent x end end var P NewPage lt td gt abc lt i gt lt b gt def lt b gt lt i gt ghi lt b gt jkl lt b gt mno lt td gt text xml walk Elem X td 0 98 WebL A Programming Language for the Web Piece Set Operators and Functions Parent p The Parent function returns the direct parent enclosing element of piece p It is implemented by looking at named tags t from right to left starting just before the left tag of p identifying the piece g that tag t belongs to and determining if the corresponding end tag of q follows the end tag of p Example Locate the Parent element of the second table Parent Elem P table 1 Flatten P The Flatten function returns the union of all elements of P Intuitively two overlapping pieces p and q of P are replaced repeatedly by a single joined piece that covers the union of the regions p and q cover
130. ng Language for the Web 191 WebL Quick Reference TABLE 47 Exceptions thrown by the built in functions Function Sleep ms int nil Sort list f fun list Stall Stringp x bool Tagp x bool Text p page string Text q piece string Throw o object Time x int Exceptions ArgumentError Incorrect or wrong number of arguments Interrupted Sleep function inter rupted ArgumentError Incorrect or wrong number of arguments FunctionReturnTypeNotInteger Function argument to Sort did not return an integer value ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments 192 WebL A Programming Language for the Web Exceptions TABLE 47 Exceptions thrown by the built in functions Function Timeout ms int x any ToChar c char char ToChar i int char ToInt c char int ToInt i int int Tolnt r real int Tolnt s string int ToList s set list ToList list list ToList s string list ToList o object list ToList p pieceset list ToReal c char real ToReal i int real ToReal r real real ToReal s string real ToSet s set set ToSet l list
131. nt in E do SS end Ident introduced into a new scope LockStat lock SS do SS end RepeatStat repeat SS until SS end BeginStat begin SS end ReturnStat return E Ident Letter Letter Digit Integer Digit Digit Real Integer Fraction Exponent Fraction Integer Exponent e E Integer String Char VW 1 Char lt Char Char Digit 0 309 Letter a Zz 1 A NZ Char any unicode character 164 WebL A Programming Language for the Web WebL EBNF Strings and characters may contain the escapes listed in Table 43 To write the non standard escapes that occur in regular expressions like w and d it is advisable to use back quoted strings which ignore the string content completely TABLE 43 String and Character Escape Sequences Escape b t n f r y Y Xxx UXXXX Description Backspace Horizontal tab Newline Form feed Carriage return Double quote Single quote Backslash Character of octal value XXX Character of hexadecimal value xxxx WebL A Programming Language for the Web 165 WebL Quick Reference Operator Precedence TABLE 44 Operator Precedence Table Operator O div mod member inside inside directlyinside directlyinside contain contain directlycontain directlycon tain after after directlyafter directlyafter before before Pre
132. nteresting elements for example all the a or li elements e Searching for character patterns that match a regular expression e Searching for text segments e Searching for stylized sequences of markup patterns e Searching for segments delimited by explicitly named markup elements i e paragraph extraction Element search The Elem function returns a piece set of all elements that match a specific name The function also allows the search scope to be restricted to a page or piece Thus a piece is constructed for each matching begin and end tag pair of a markup element with the indicated name in the indicated scope and the resulting pieces are col lected into a piece set result For example the following program fetches a page calculates a piece set with all the img image elements of the page and proceeds to print out the src attribute of each of those images var P GetURL http www nowhere com var images Elem P img every image in images do PrintLn image src end As can be seen from the example the every statement also allows the iteration over the elements the pieces of a piece set WebL A Programming Language for the Web Searching Functions Pattern search The Pat function searches a page for character patterns that match a regular expres sion The Pat function ignores the tag objects in a page only the pure text stream is searched For each occurrence of the pattern a
133. o a WebL string type When System out print In is called the WebL types are converted back into the appropriate Java type Statics The above example also illustrates how to access a static field namely Sys tem out The Java_Class function wraps the class into a WebL object from where the field can be accessed directly In addition the example also shows how to use constructors with arguments Overloading WebL programmers should be aware that constructors and methods are often overloaded in Java In this case WebL will attempt to match the best con structor or method by comparing the actual arguments provided in WebL with the formal arguments of the constructors and methods in question This might lead to problems when the matching involves numeric types Suppose a Java method named X is overloaded three times with single formal arguments of type int short and byte respectively Which instance of X will be called when a formal argument of type int is used in a call the function WebL s approach is to prefer the widest type which in this case would be X with the formal of type int Programmers should be aware that this simple heuristic might cause the wrong instance of the overloaded method to be invoked There is no support for enforcing calls to a spe cific overloaded method in WebL 126 WebL A Programming Language for the Web Module Java Type j array In addition to the j object type the Java module also provide
134. o variables in lexical scope by writing var or var The value of these referenced variables are expanded before the command is executed Returns a piece set consisting of all the direct children elements of g in the markup parse tree unioned with pieces representing all the text seg ments in g excluding all the nested text segments Makes a new object by copying all the fields of the objects passed as arguments Fields of p have prece dence over fields of o and so on WebL A Programming Language for the Web 173 WebL Quick Reference TABLE 46 Built in Functions Function Content p page piece Content q piece piece Delete s pieceset nil Delete q piece nil Elem p page pieceset Elem p page name string pieceset Elem q piece pieceset Elem q piece name string pieceset EndTag q piece tag Error x y z nil ErrorLn x y Z nil Eval s string any Exec cmd string int Exit errorcode int Description Returns a piece that encompasses the whole page p Returns a piece inside q represent ing everything that is inside q excluding the begin and end tag of q Deletes s or q from the page by removing all the pieces from the page data structure Returns all the elements in a page Returns all the elements in page p with a specific name Returns all the elements contained nested in piece q Returns all the elements with
135. ogramming Language for the Web 157 Examples 158 WebL A Programming Language for the Web CHAPTER 7 WebL Quick Reference This chapter is a quick reference to the WebL programming language It contains the WebL EBNF syntax operator precedence table list of operators and functions and the Perl5 regular expression format specification Running WebL Programs Running a WebL program is highly dependent on the host platform The WebL classes and resources are bundled in a Java JAR file called WebL jar The main class in this JAR file is called WebL class The main method of this class needs to be executed with the following arguments options filename argl arg2 The options are summarized in Table 42 The filename argument specifies the name of WebL program to be executed and arg arg2 etc are the arguments passed to the program The latter argument list can be accessed from the variable called ARGS inside WebL programs The following list gives an indication of how WebL programs can be executed depending on one of several Java installation scenarios WebL A Programming Language for the Web 159 WebL Quick Reference Java development kit java WebL options filename argl arg2 Java Runtime Environment jre cp WebL jar WebL options filename argl arg2 Java 2 a k a JDK 1 2 with extension support java jar WebL jar options filename argl arg2 TABLE 42 WebL Command Lin
136. om crawler using the generic crawler above To override the Visit and ShouldVisit methods we use the 154 WebL A Programming Language for the Web WebCrawler Clone builtin applied to the generic crawler and our own object that contains the modifications to the generic crawler we would like to make lines 3 16 1 import Str WebCrawler 2 3 var MyCrawler Clone WebCrawler Crawler 4 Lt 5 Visit meth s page 6 var title Text Elem page title 0 N A 7 PrintLn page URL title title 8 end 9 10 ShouldVisit meth s url 11 Str _StartsWith url 12 http www w pa dec com 13 and 14 Str_EndsWith url htm1l 15 end 16 Pall 17 18 MyCrawler Start 2 19 MyCrawler Enqueue http www srce pa dec com 20 MyCrawler Enqueue http www wrl pa dec com 21 while MyCrawler Idle do Sleep 10000 end Our particular implementation of the Visit method extracts and prints the URL and title of the page lines 5 8 The ShouldVisit method lines 10 15 restricts crawling to host names of the form www pa dec com and URLs that end either in or html Lines 18 20 start up the crawler with two workers and enqueue two starting point URLs Line 21 goes in a loop that checks every 10 seconds whether the workers have become idle in which case the crawler terminates WebL A Programming Language for the Web 155 Examples Highlight Proxy In this exampl
137. omized The basic idea is to define a generic Crawler object of which methods can be overridden to customize its behavior By the way our crawler implementation is provided as standard in WebL in a module called Web Crawler First we define the generic Crawler object as follows 1 import Str Farm 2 3 export var Crawler 4 E 5 Pages visited so far and removed from queue 6 and pages waiting in the queue 7 enqueued 8 9 Will contain the farm after the start method is 10 called 11 farm nil 12 13 Method that should be overridden 14 Visit meth s page PrintLn page URL end 15 ShouldVisit meth s url true end 16 17 Enqueue meth s url 18 First remove everything following 19 var pos Str IndexOf url 20 if pos 1 then 21 url Select url 0 pos 22 end 23 lock s do 24 var present s enqueued url false 25 if present and s ShouldVisit url then 26 s enqueued url true 27 s farm Perform s ProcessPage s url 28 end 29 end 30 end 31 32 ProcessPage fun s url 33 try 34 var page GetURL url fetch the page WebL A Programming Language for the Web 153 Examples 35 s Visit page 36 37 Process all the links from this page 38 every a in Elem page a do 39 s Enqueue a href nil 40 end 41 catch E 42 on true do PrintLn url err E msg 43 end 44 end 45 46 Start meth s noworkers 47 s farm
138. on Piece set union Piece set difference Piece set intersection Indexing into a piece set Pieces are numbered 0 to Size 1 All the elements of p that are located inside any element of q All the elements of p that are not located inside any element of q All the elements of p that are directly inside any element of q All the elements of p that are not directly inside any element of q All the elements of p that contain any element of q All the elements of p that do not contain any element of q WebL A Programming Language for the Web 101 The Markup Algebra TABLE 20 Piece and Piece Set Operators Function directlycontain p piece q piece pieceset directlycontain p pieceset q piece pieceset directlycontain p piece p pieceset pieceset directlycontain p pieceset q pieceset pleceset directlycontain p piece q piece pieceset directlycontain p pieceset q piece pieceset directlycontain p piece q pieceset pieceset directlycontain p pieceset q pieceset pleceset after p piece q piece pieceset after p pieceset q piece pieceset after p piece q pieceset pieceset after p pieceset q pieceset pieceset after p piece q piece pieceset after p pieceset q piece pieceset after p piece q pieceset pieceset after p pieceset q pleceset pieceset directlyafter p piece q piece pieceset directlyafter p pieceset q pi
139. ons that will be required in the following chap ters Basic Protocol Terminology The World Wide Web WWW consists of a large number of Web sites domains or servers that provide services to clients over the Internet The typical clients of these services are users who retrieve web pages with a software application called a Web browser In contrast WebL is a client that fetches pages in an automated manner under the control of a program The purpose of this section is to show how this works Uniform resource locators Pages are identified by a uniform resource locator or URL A URL identifies the web site the location of the page on the web site the WebL A Programming Language for the Web 51 Pages filename of the page and the Internet transmission protocol required to fetch the page Much simplified URLs have the following form http hostname path filename html Here http refers to the protocol being used hostname the web site or machine iden tification and path the directory on that machine where the page called file name html is stored HTTP The Hypertext Transfer Protocol transfers the page over the Internet The basic steps are e Establish a communications link from client to the web server identified by host name e The client sends an HTTP request to the server The request consists of the loca tion or filename of the page to retrieve path filename html headers and optional parameters e The serv
140. orming any reliable computation on the unreliable web structures It often happens that web services are unavailable suddenly fail or become unacceptably slow These are very serious complications for computations that depend so much on the web infra structure Although service combinators cannot make a web based computation completely failure proof it does add a certain amount of robustness to program ming on the web Service combinators are discussed in detail on page 47 Markup algebra is a formalism for extracting information from structured text doc uments and the manipulation of those documents It consists of functions to extract elements and patterns from web documents operators to manipulate what has been extracted in this manner and functions to change a page for example to insert or delete parts The functions and operators all work on the high level concept of a parsed web page and there is little need to do lower level string manipulation Markup algebra is discussed in detail in Chapter 4 The purpose of this document is to introduce programmers to the WebL language and its features Before however introducing the language in its totality we will first summarize WebL s main features Basic Features e The WebL language and system is designed for rapid prototyping of Web com putations It is well suited for the automation of tasks on the WWW e WebL s emphasis is on high flexibility and high level abstractions rather th
141. ors and Functions P after after Q The after operator returns all the elements of P that are after or not after any element of Q Note that this is equivalent to all the elements of P that are after or not after the first element of Q Consequently we often need to index into Q to reduce it to a single piece Examples Retrieve all the H2 s after the appendix assuming only a single appendix is present Elem X h2 after Elem X h1 contain Pat X Appendix Retrieve all the headings before Chapter 4 inclusive Elem X hi after Elem X h1 contain Pat X Chapter 4 Retrieve all the italic elements except the last Elem X i before Elem X i Retrieve the last italic element Elem X i before Elem X i P directlyafter directlyafter Q The directlyafter operator returns the pieces of P that are directly after or not directly after any element of Q A piece p of P is directly after a piece q of Q if no other piece in P appears between p and q Exam ples based on the previous page object X Retrieve the italics directly after Hl s i e lines 2 6 9 12 Elem X i directlyafter Elem X h1 Retrieve the italics not directly after Hl1 s i e lines 3 4 7 10 12 Elem X i directlyafter Elem X h1 Retrieve all elements directly after H1 s i e lines 2 6 9 12 Elem X dir
142. peated execution of an expression while a boolean expression its guard yields true The guard is checked before every execution of the expression The value of a while statement is nil Syntax WhileStat while SS do SS end Example while x gt 0 do x x div 2 k k 1 end Repeat Statement Repeat statements specify the repeated execution of an expression until a boolean expression its guard yields true The guard is checked after every execution of the expression The value of a repeat statement is nil Syntax RepeatStat repeat SS until SS end Example repeat x x 2 until x gt k end Try Statement Execution of an expression may terminate in a failure or exception We say that the expression has thrown an exception Exceptions are implemented with objects in WebL The throw function accepts any object to throw as an exception The try expression is used to trap a failed expression or more commonly said to catch the exception object In case no exception occurs the try statement simply executes a statement sequence and returns its value In the case of an exception occurring in the statement sequence a sequence of guarded expressions is evaluated The guards are evaluated in sequence until one evaluates to true whereafter the associated expression is evaluated and returned as value In case no guard evaluates to true the exception is automatically re thrown and may be caught by an enclosing try state 36 W
143. pplication of cookies is to unique identify customers at a web store front At startup time WebL knows about no cookies i e the cookie database is empty As cookies are set by servers during HTTP requests the cookie database will fill up Each WebL HTTP request is checked against the cookie database and if necessary WebL will return the appropriate cookies to the server A special WebL module called Cookies allow you to save the cookie database to a file and reload it at a later time Page parsing Once an HTTP request is completed WebL parses the page data into an internal format that makes it easy to query and manipulate the page WebL programmers should have high level understanding of how HTML and XML are handled to this end the following section gives an overview of basic markup concepts and how they relate to WebL This background material is a pre requisite for the following chapter on the search algebra and page manipulation fea tures Markup The Hypertext Markup Language HTML and Extensible Markup Language XML are both instances of the Standard Generalized Markup Language SGML SGML was conceived in the middle 1980 s as a text markup notation for exchang ing hierarchically organized electronic documents SGML consists of two parts namely the document markup in the form of tags and a meta description of a docu ment class called a Document Type Definition DTD DTD s are typically designed for special purposes an
144. r WebL A Programming Language for the Web 181 WebL Quick Reference Exceptions Exceptions typically indicate unexpected situations occuring during program exe cution Exceptions are caught with the try statement See Try Statement on page 36 and generated with the Throw built in function Processing exceptions require knowledged about the format of exception objects in particularly the type of the exception which allows you to distinguish between the possible situations that occurred Table 47 lists the exceptions thrown by the built in WebL functions By conven tion the exception type eg ArgumentError etc is indicated by the type field of the exception object Also by convention the msg field of the exception object gives information on why the exception occured Operators and statements can also generate exceptions as explained in the follow ing paragraphs All operators will throw an OperandMismatch exception in case the operands to the operator are not of the expected value type Function or method application eg calling a function or method can throw the following exceptions e NoSuchField Object does not have such field e NotAFunctionOrMethod Left hand side is not callable e NotAnObject Left hand side is not an object e ArgumentError Number of actual and formal arguments do not match Variable assigment with can throw the following exceptions e FieldError Unknown field or ill
145. r e P Ar inside q p insider pe Pl7dqqe Q p inside q 7dr r e P r inside q p inside r pe Plaqqe QA p contain q pe Pl7 gt dqqe Q p contain q pe Plaqqe QA p contain q 7dr r e P r contain q p contain r pe PIi JqqE Qa p contain q 7dr r e P r contain q p contain r pe Plaqqe Qapafterq pe Pl7dqqe Qa paftergq pe Plaqqe Qapafterga drre P Arafterq a pafterr pe Pl7 gt dqqe Qa pafterqga CJrre P arafter q pafter r pe Plaqqe Qap before q pe Pl7 gt dqqe Qap before q pe Plaqqe Qap before q drre P Ar before q p before r pe Pl7 dqqe Qap before q drre P Ar before q p before r pe Pl7dqqe Qa poverlap q pe Pl7dqqe Qa poverlap q WebL A Programming Language for the Web 105 The Markup Algebra Page Modification Page modification is an important part of the WebL markup algebra As we have seen already the attributes of markup elements can be elegantly modified by accessing the fields of pieces This section will focus on how to insert pieces into a page delete pieces from a piece and replace pieces of a page Creating Pieces There are several ways to create new pieces See Table 23 After a new piece has been created it can be inserted into a page at a specific position We already intro duced the NewPage function which takes a string and a mimetype as argument and returns a page object We also know that then applying the Content func
146. rls cookiedb Description When this flag is set to true the HTML parser attempts to correct incorrectly nested HTML elements in a page for example putting a H2 inside a H1 This has the effect of regularizing badly formatted HTML at the cost of sometimes unintutive parses The default value of this flag is false Overrides the mime type to be used when parsing the page See Table 14 on page 63 for typical string values this field may assume When this flag is set to true HTTP POST requests that are redirected by a web server to another URL are automatically changed into a subse quent HTTP GET request to that URL a behavior which non compli ant with section 9 3 of the HTTP 1 0 specification and section 10 3 of the HTTP 1 1 specification Note that all POST request parameters are ignored for the subsequent HTTP request The default value of this flag is true as many web browsers do not follow the specifications cor rectly in this regard When this flag is set to false the URLs in the page are not resolved to absolute form The default is true A string value specifying which cookie database to use for the request See Mutiple Cookie Data bases on page 117 66 WebL A Programming Language for the Web CHAPTER 4 The Markup Algebra The WebL markup algebra is used for manipulating web pages and extracting data from them Extracting information may range from simple operations like iterating
147. rogram Nested contexts are auto matically created at points where sequences of statements can be used for example inside while repeat and if statements A fresh context can be created explicitly with the begin statement A variable can be used in a specific context at all posi tions syntactically following the place where it was declared Variable resolution is done by searching for a binding from inner nested contexts to outer contexts This allows variables in inner contexts to override variables with the same name in outer contexts For example in the following program the vari able sq is visible only inside the body of the while statement and the variable i is visible only inside the context defined by the begin statement var sum 0 Print The sum of the squares between 0 and 100 is begin var i 0 while i lt 100 do var sq i i sum sum sq plese a ee end end PrintLn sum WebL A Programming Language for the Web 21 The Language Core Constructors WebL also supports lists of values sets of values and objects with fields Construc tors perform the creation of these types of values from simpler values Table 3 shows that lists are constructed by square brackets sets by curly braces and objects by square brackets and a period token More information about these value types is given in the section Dynamic Types on page 23 Note that constructors consist of sub expressions that are ev
148. rom the Navy manuals WebL A Programming Language for the Web 149 Examples on which the scoring function is based to let us conclude that we are only calculat ing a relative score between similar pages in a corpus 1 import Str Files 2 3 var Scores fun page 4 var txt Text page 5 var letters Size Str_ Search txt A Za z 6 var words Size Str_ Search txt 0 9a ZA Z 7 var syllables Size Str_ Search txt aeiouy 8 9 var exceptions Pat page A Z0 9 10 Replace exceptions NewPiece X text plain 11 var sentences Size Pat page 12 13 14 Sentences sentences 15 Words words 16 Syllables syllables 17 18 ARI 4 71 letters words 19 0 5 words sentences 21 43 20 Kincaid 11 8 syllables words 21 0 39 words sentences 15 59 22 CLF 5 89 letters words 23 0 3 sentences words 100 15 8 24 Flesch 206 835 84 6 syllables words 25 1 015 words sentences 26 27 end 28 29 var ScorePageList fun L 30 var res 31 var count 1 32 every s in L do 33 try 34 PrintLn count scoring s 35 count count 1 36 var page GetURL s 37 var sc Scores page 38 sc URL S 39 sc Title Text Elem page title 0 40 res res sc 41 catch e just report errors 42 on true do PrintLn e msg 150 WebL A Progr
149. rong number of arguments ArgumentError Incorrect or wrong number of arguments NetException Fetch failed sta tuscode field of the exception object indicates the reason ArgumentError Incorrect or wrong number of arguments NetException Fetch failed sta tuscode field of the exception object indicates the reason ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments 186 WebL A Programming Language for the Web Exceptions TABLE 47 Exceptions thrown by the built in functions Function Intp x bool Listp x bool Markup P page string Markup q piece string Methp x bool Name q piece string Native classname string fun NewNamedPiece name string t1 tag t2 tag piece NewNamedPiece name string q piece piece Exceptions ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments NativeCodeImportError Class instantiation failed access denied no such method or not a sub class of built in ArgumentError Incorrect or wrong number of arguments NotSamePage The tag arguments to the function do not belong to the
150. rser For example the paragraph lt P gt element is in fact a non empty element that has a corresponding lt P gt end tag However most HTML documents do not contain these optional end tags in which case the HTML parser has to infer where paragraphs end WebL knows the HTML DTD and can thus insert optional tags when needed In general WebL attempts to make a faithful internal representa tion of the documents it parses including spaces and new lines except for the fact that it inserts optional tags when appropriate Conversion from the internal format to external format might thus result in slightly different but equivalent pages Character Entities Inserted in the PCDATA stream we often find character enti ties of the form amp where stands for a number or an alphanumeric name denoting special symbols For example amp t and amp gt denotes the less than and WebL A Programming Language for the Web 57 Pages greater than symbols This encoding is used both to embed special symbols that might be confused with markup and to provide a human readable way to represent all Unicode characters WebL does not perform any translation of character entities by default when fetching a page but does provide a built in called ExpandCharEn tities to process them afterwards and an retrieval option that switches on expansion Table 16 on page 65 Case sensitivity XML is case sensitive and HTML is case insensitive In the case
151. rvlet Serviet implements the WebL servlet By placing this class or the jar file it is in namely WebL jar on a servlet enabled web server it becomes possible to execute WebL code directly on the server Web surf ers may access your WebL servlet by accessing a URL of typically the following form http www host com servlet weblx servlet Servlet modulename_variablename arguments In case your web server supports aliases you can alias weblx servlet Servlet as webl which allows access from the following URL http www host com servlet web1 modulename_variablename arguments In both cases modulename identifies the WebL module that contains the WebL servlet script and variablename identifies an exported variable in that module The value type of this variable must be a function with two formal arguments The mod ule will be loaded automatically the first time the URL is accessed this happens only once afterwards the module is cached Table 34 and Table 35 show the format of the two arguments of the function The first is the request object and the second is the response object The explanation for the field names and values is found in the Java servlet specification available from Javasoft You may modify your WebL servlets while being used WebL checks before each servlet access whether the WebL module has changed or not using the file last modified date If the modified date is different from the modified date when the
152. s a new type called j array that wraps Java arrays The reasoning behind providing a sepa rate array type instead of using an established data type such type list is that Java arrays are fundamentally different from WebL data types Java arrays are mutable i e elements can be overwritten whereas WebL types except for type object are immutable Thus passing a WebL list to a method that expects a Java array begs the question want would happen if the method mutates the array The Java array support in module Java includes functions to allocate an array of a specific type and size Java_NewArray retrieve an element at a specific index Java_Get and overwritting an element at a specific index Java_Set The following program allocates writes and reads the elements of an array import Java var A Java_NewArray int 10 Java_Set A 0 42 PrintLn Java_Get A 0 j Java_Set A 1 hello gt Type mismatch exception Java Classpath The Java CLASSPATH environment variable must be set correctly to access the Java classes that are external to the classes in WebL jar Programmers should be aware that when using the jar option of the Java runtime classes are only searched for in WebL jar It is thus better to run WebL with the cp Java runt ime option where the CLASSPATH must be specified explicitly WebL A Programming Language for the Web 127 Modules TABLE 31 Module Java Function New classname string
153. s of characters called parsed character data or PCDATA and markup symbols called tags Tags consist of char acters enclosed between less than lt and greater than gt symbols The tag con tents specify a tag name and optionally any number of attributes Tag names are predefined in HTML for example H1 H2 P FONT TITLE etc whereas XML tags are defined according to a DTD Attributes consist of name value pairs The general tag style is as follows lt name A abc B 123 gt Here the tag name name is followed by attribute values A and B having the values abc and 123 respectively Values are always quoted by single or double quotes in XML HTML values have a more flexible syntax allowing certain values to be unquoted HTML also allows attributes to have no value for example lt name A B gt WebL A Programming Language for the Web 55 Pages Elements A hierarchical structure is imposed on a page by collecting tags and parsed character data into elements We distinguish between comment elements non empty elements empty elements processing instruction elements and SGML directive elements Note that in this manual we diverge from SGML terminology so as to explain the unified view WebL presents to the programmer by mapping names and attributes of elements to piece objects introduced later and object fields The name of a piece is derived from the start tag of the element Comments Comment elements specify
154. se thread A inserts a character x directly after the begin tag of A In the case of separate i e non merged unnamed tags the resulting situation is easy to visualize However with merged unnamed tags thread A will insert the character inside the piece B created by thread B which might be unexpected by thread B These type of problems caused us to reject unnamed tag merging Instead to ensure that A and B are equal WebL introduces the concept of posi tions The position of a tag is a numerical rank of the tag in a page We number tags from 0 onwards in the order of occurrence in the page all the while ensuring that sequences of unnamed tags have the same number Figure 5 shows the position numbering for a more complicated page consisting of named and unnamed tags Comparisons of pieces is then made according to the positions of the begin and end tags of the pieces For example our definition of piece equality of x and y becomes 1 Readers concerned about inefficient renumbering of tag positions after inserting or delet ing tags should be aware that behind the scene WebL uses an efficient encoding that pre vents renumbering positions for large parts of the page after a modification is performed 84 WebL A Programming Language for the Web Piece Comparison pos BeginTag x pos BeginTag y and pos EndTag x pos EndTag y Using the notion of positions we can thus define equality containment etc
155. ssion that evaluates to the value returned otherwise nil is returned by default Note that the return statement can be used to return a value early Contrast the WebL convention of the last expression of a func tion or method calculating the return value which allows returning a value only at the end of the function or method Also note that it is a runtime exception to exe cute a return statement outside of a function or method body Syntax ReturnStat return E Example var F fun s if s nil then return end ToString s end WebL A Programming Language for the Web 39 The Language Core Built in Functions Several functions are built into the WebL programming language in contrast to functions written by the programmer We distinguish between normal built ins and special built ins Normal built ins evaluate all their actual arguments before invoking the function Special built ins defer the evaluation of their arguments to the function being invoked Examples of special built ins include Time Timeout and Retry Most built ins accept only a fixed number of arguments Some built ins like PrintLn accept any number of arguments of any value type Variable length argu ment builtins are specified with ellipses in Table 13 An actual argument can be of any value type if no explicit type is given in the table The pseudotype any denotes values of any type As a shorthand we sometimes use the notation
156. st www proxy pa dec com user region US ftpProxyPort 8080 Java vendor Sun Microsystems Inc file encoding 8859 1 line separator n file encoding pkg sun io os name Windws NT user name marais awt toolkit sun awt windows WToolkit Java class version 45 3 file separator http proxySet true user timezone PST jJava home C JAVA Java version 11 os arch x86 jJava vendor url http www sun com ftpProxySet false os version 4 0 user dir user home Java class path C Proj WebL3 0 java Z marais W W WebL A Programming Language for the Web 161 WebL Quick Reference WebL EBNF WebL programs can be written in the Unicode character set little or big endian byte ordering with an initial Unicode byte ordering mark or the more compact UTF 8 character set Note that the first 127 characters of UTF 8 correspond to the widely used western ISO 8859 1 or Latin 1 character set White space and comments are ignored in WebL programs Comments consist of either e adouble forward slash token which introduces a comment till the end of the line or e the token pairs and with comments in between Note that comments of the style may nest 162 WebL A Programming Language for the Web WebL EBNF The WebL EBNF is Module Import SS Import
157. string in base64 encod ing WebL A Programming Language for the Web 115 Modules Module Browser The Browser module provides a way to display markup in your web browser On the Windows platform the default installed browser will be started up to display the page On UNIX platforms WebL tries to communicate with an already running copy of Netscape Note that to implement this functionality WebL has to write the markup to a temporary file in the specified character encoding On the Windows platform only module Browser also provides rudimentary sup port for inquiring and controlling a running copy of a Netscape browser with Dynamic Data Exchange DDE Specifically it is possible to detect what web page is being viewed in the browser and to request Netscape to navigate to a specific URL Both the support for viewing markup and the DDE functionality is bundled in a Windows platform specific DLL called weblwin32 dll The readme txt file that is part of the WebL distribution contains instructions how to install this DLL on a Windows machine TABLE 26 Module Browser Function Description GetCurrentPage object Windows only Returns information about the cur rently viewed page in a running copy of Netscape The object returned has string fields url and title that specifies the viewed URL and title of the viewed page respectively GotoURL url string nil Windows only Sends a request to a running copy of Netscape to na
158. t Elem X i before Elem X i WebL A Programming Language for the Web 89 The Markup Algebra P directlybefore directlybefore Q The directlybefore operator returns the pieces of P that are directly before or not directly before any element of Q A piece p of P is directly before a piece q of Q if no other piece in P appears between p and q For example given page X contains excluding the line numbers on the left lt h1 gt A lt h1 gt lt i gt a lt i gt lt i gt b lt i gt lt b gt c lt b gt lt hl gt B lt hl1 gt lt i gt d lt i gt lt i gt e lt i gt lt h1 gt C lt h1 gt lt i gt f lt i gt lt i gt g lt i gt lt hl gt D lt hl1 gt lt i gt h lt i gt CmArAtaun amp BWN p et on S lt i gt i lt i gt lt h1 gt E lt h1 gt A we can compute the following Retrieve the italics directly before H1 s i e lines 3 7 10 13 Elem X i directlybefore Elem X h1 Retrieve the italics not directly before H1 s i e lines 2 6 9 12 Elem X i directlybefore Elem X h1 Retrieve all elements directly before H1 s i e lines 4 7 10 13 Elem X directlybefore Elem X h1 Retrieve the second element directly before H1 s i e lines 3 6 9 12 Elem X directlybefore Elem X directlybefore Elem X h1 90 WebL A Programming Language for the Web Piece Set Operat
159. t after p pieceset q pleceset pieceset before p piece q piece pieceset before p pieceset q piece pieceset before p piece q pieceset pieceset before p pieceset q pieceset pieceset contain p piece q piece pieceset contain p pieceset q piece pieceset contain p piece q pieceset pieceset contain p pieceset q pieceset pieceset directlyafter p piece q piece pieceset directlyafter p pieceset q piece pieceset directlyafter p piece q pieceset pieceset directlyafter p pieceset q pieceset pieceset directlybefore p piece q piece pieceset directlybefore p pieceset q piece pieceset directlybefore p piece q pieceset pieceset directlybefore p pieceset q pieceset pleceset directlycontain p piece q piece pieceset directlycontain p pieceset q piece pieceset directlycontain p piece q pieceset pieceset directlycontain p pieceset q pieceset pieceset directlyinside p piece q piece pieceset directlyinside p pieceset q piece pieceset directlyinside p piece q pieceset pieceset directlyinside p pieceset p pieceset pieceset Description Logical negation Value in equality test See Value Equality on page 31 All the elements of p that are not after any element of q All the elements of p that do not precede any element of q All the elements of p that do not contain any element of q All the
160. tags except i b font and tt are regarded as paragraph terminators Breaking a page into paragraphs with a specific set of paragraph terminators then proceeds as follows e Identify all the paragraph terminators on the page e Build a result piece set of paragraphs namely all the regions that appear between successive terminators on the page bounding terminator tags excluded This involves the insertion of unnamed tags as placeholders e Remove from the result piece set all those pieces p that consist of white space only i e applying Markup p returns a string containing only n r t and character 160 character code of amp nbsp The paragraph search function Para expects a piece or page as first argument and a specification of the paragraph terminators as the second argument The function returns the pieceset of paragraphs The paragraph terminator specification is in the form or a string of tag names delimited by white space For example var p Para page br p table li WebL A Programming Language for the Web 75 The Markup Algebra indicates the br p table and li elements should be regarded as paragraph termina tors Sometimes it is more convenient to specify the tags that should not be regarded as paragraph terminators This is done by making the first element name in the para graph terminator specification a var p Para page font a b i tt img This ind
161. text to be ignored during document pars ing Comments consist of a single tag in the following style lt this is the comment text gt The name of a comment element is In WebL the comment element has a field called comment which has as value the text occuring between the tokens Non empty Elements These consist of a start tag any number of nested elements or PCDATAs and a matching end tag Everything between the start tag and end tag is said to be inside or contained in the element The general format is as follows lt tagname A abc B 123 gt lt tagname gt The names of the start tag and end tag must match Note how the end tag starts with a forward slash character Only the start tag may have attributes The name and attributes of non empty elements are those of the start tag In the case of attributes with no value the attribute of the element is set to the empty string Empty elements Empty elements do not have any content and thus do not require an end tag They have the format lt tagname A abc B 123 gt Note the forward slash that ends the tag The element name and attributes are those of the tag Empty elements appear only in XML documents HTML has something similar to an empty element but it cannot be distinguished from a start tag For example the 56 WebL A Programming Language for the Web Markup HTML markup lt br gt does not have a corresponding end tag an
162. the beginning of a program Imported vari able references must always be explicitly qualified by the module name and an underscore Note the choice of the underscore character allows us to separate the variable name space and module name space i e a module and a variable might have the same name One of the side effects of importing a module is the loading of the module into memory WebL keeps track of all loaded modules in a global module list Before importing a module the module list is checked to see whether the module has been loaded before if so the previously loaded module is reused Thus a module can be loaded only once There is no operation to unload a module A module is nothing more than a statement sequence stored in a file with the exten sion webl Loading a module involves executing this statement sequence once The language allows the programmer to export declared variables from the module The exported variables are visible to clients of the module Unexported variables cannot be accessed from clients Exported variables are only allowed in the top level context For example in the following implementation of module A which must be stored in the file A webl the variable y is hidden from clients Implementation of module A export var X 42 var y 10 export var Doit fun PrintLn ok end 46 WebL A Programming Language for the Web Service Combinators Modules may import
163. the console or communicating over the Internet The value of the expression sequence the last expression executed is discarded Value Types Each value has an associated value type or type The type determines how the value can be further used by expressions For example it is only possible to multiply two values that have a numerical type WebL is a dynamically typed programming lan guage This means that at the point where values are used by expressions they are checked to be of the correct type as expected by the expression If they are not an exception is thrown The defined value types of WebL are nil boolean int real char string fun meth set list object page piece pieceset tag See Dynamic Types on page 23 for more details 1 Note that WebL programs need not be compiled explicitly compilation is performed automatically just before the program is executed WebL s execution model is similar to that of most scripting languages 18 WebL A Programming Language for the Web Basic Terminology Constants Constants are simple expressions that evaluate to themselves They are the simplest WebL expressions WebL allows nil boolean integer real character and string constants Table 1 lists examples of constant expressions what they evaluate to and the resulting value type TABLE 1 Constant examples Expression Value Value Type nil nil nil true true bool false false bool 21 2
164. the indi cated port and prepares to server up files located at fileroot in the file system StopQ nil Stops the web server Publish name string f fun nil Publishes the function f under a name on the server The name indi cates the URL that will invoke func tion f Function f has to be a function with two formal arguments see dis cussion above WebL A Programming Language for the Web 145 Modules TABLE 40 Fields of the Request Object Field method protocol uri query path contents param header Description HTTP method GET POST etc The HTTP protocol version The URL of the complete request The query part of the request The path of the script requested The contents of the request message typically only has a value for POST methods Object with fields submitted in either a GET or POST method In case a particular parameter is repeated in the request the appropri ate field of the the param object will be set to a list of strings correspond ing to the individual parameter val ues Header fields the browser sent with the HTTP request In case a particu lar header field is repeated in the request the appropriate field of the the header object will be set to a list of strings corresponding to the indi vidual header field values 146 WebL A Programming Language for the Web Module WebServer TABLE 41 Fields of the Response Object Field Description st
165. ting a string gives the individual characters of the string from left to right Enumerating an object gives the field names of the objects While enumerating each element is assigned in turn to the loop variable and the body of the every statement executed The value of the every statement is nil Syntax EveryStat every Ident in E do SS end Example every x in 1 2 3 4 do PrintLn X has the value x end Lock Statement The lock statement is used to prevent race conditions in muli threaded programs The statement locks an object executes a statement sequence and unlocks the object The lock on the object can only be held by a single thread at any specific time In case a thread tries to aquire a lock on an object held by another thread the thread is suspended until the lock is released Syntax LockStat lock SS do SS end Example var counter val 0 inc meth s i lock s do s val s val 1 end end 38 WebL A Programming Language for the Web Statements Begin Statement The begin statement allows the programmer to introduce a new statement sequence in a sub expression This is sometimes useful to open a fresh context in which tem porary variables can be declared Syntax BeginStat begin SS end Example x begin var s a b s s end Return Statement The return statement returns the value of a function or method call The return token is optionally followed by an expre
166. tion S T The combinator allows two services to be executed concurrently The service S T starts both services S and T at the same time and returns the result of whichever succeeds first If both S and T fail then the combined service also fails Should one service complete before the other the slower service is stopped Stopping the slower service is performed in a controlled manner to ensure the run time remains in a consistent state Typical checkpoints at which WebL will stop a service is at function or method call boundaries and at the beginning or end of programmed loops This program attempts to fetch a page from one of the two alternate sites Both sites are attempted concurrently and the result is that from whichever site successfully completes first page GetURL http www altavista digital com GetURL http www altavista yellowpages com au WebL A Programming Language for the Web Service Combinators Time out timeout t S The time out combinator allows a time limit to be placed on a service The service Timeout t S acts like S except that it fails after t milliseconds if S has not com pleted within that time S will be stopped in controlled manner when it times out see the concurrent execution description above for details on how services are stopped This program attempts to connect to alternative AltaVista mirror sites but gives a limit of 10 seconds to succe
167. tion execution ArgumentError Incorrect or wrong number of arguments No exceptions are thrown No exceptions are thrown ArgumentError Incorrect or wrong number of arguments IOException An IO exception occurred during function execution ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments 190 WebL A Programming Language for the Web Exceptions TABLE 47 Exceptions thrown by the built in functions Function Retry x any Select s set f fun set Select 1 list f fun list Select p pieceset f fun pieceset Select s string from int to int string Seq p page pattern string pleceset Seq q piece pattern string pleceset Setp x bool Sign x int int Sign x real int Size s set int Size s string int Size l list int Exceptions ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments IndexRangeError Index into list or string out of bounds FunctionReturnTypeNotBoolean Function argument to Select did not return a boolean value ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments ArgumentError Incorrect or wrong number of arguments WebL A Programmi
168. tion to a page returns a piece covering the whole page In fact the code to create a piece in this manner Content NewPage lt html gt lt html gt text html occurs so often that we also use the following short hand NewPiece lt html gt lt html gt text html Another way of creating a new piece is to pass the begin tag and end tag of two arbitrary pieces to the NewPiece function Figure 10 The function returns a new unnamed piece with new unnamed tags inserted just before and after the begin tag and end tag respectively to wrap its contents The NewPiece function will also wrap a piece argument in the same manner The NewNamedPiece function works in a similar manner as NewPiece except that a new piece with the indicated name is created Seeing that any begin and end tag pair not belonging to the same piece can be passed to this function programmers should be aware that invalid HTML or XML can be created where elements do not nest properly As WebL uses a flexible internal page representation the presence of overlapping elements does not present any problems 1 Technically the function does not modify the contents of a page because only unnamed tags are inserted into the page 106 WebL A Programming Language for the Web Page Modification Examples Turn the text from the word WebL to the end of the sixth paragraph to italic var a Pat P We
169. to the Elem function and somewhat related to the Pat function As an example the following program fetches a page and prints out all the text seg ments occuring on the page as delimited by markup tags var P GetURL http www nowhere com every t in PCData P do PrintLn Text t end The Text function used above will be introduced a little later it prints out the textual content of a piece Running this program will typically print a lot of white space this is because the PCData function regards the empty regions between tags for example the area between br and br in the markup some text lt br gt lt br gt some text as a distinct text segments The following program shows how to get rid of these empty regions import Str var P GetURL http www nowhere com every t in PCData P do var txt Text t if Str _Trim txt then PrintLn txt end end Note that the PCData function inserts new unnamed tags just in front of and just after each text segment to keep track of their location This means that for the markup above the piece identifying the empty region consists of an unnamed begin tag just after the first br and an unnamed end tag just before the the second br WebL A Programming Language for the Web 73 The Markup Algebra Sequence search HTML generated on the fly by web servers often contains highly stylized markup patterns without hierarchical structure The
170. tring indexing Elements in a list and string are numbered from 0 to Size Set list and obj ect membership test a Operator is written in the form x i b Object membership test is based on object field names 34 WebL A Programming Language for the Web Statements Statements Statement Sequences Statement sequences are separated by semicolons The value of a statement sequence is the value of the last expression in the sequence An optional trailing semicolon in a statement sequence is ignored by WebL If Statement If statements specify the conditional execution of guarded commands The boolean expression preceding an expression is called its guard The guards are evaluated in sequence of occurrence until one evaluates to true whereafter its associated expres sion is evaluated If no guard is satisfied the statement sequence following the symbol else is executed if there is one The value of an if statement is the value of the associated expression whose guard evaluated to true Syntax IfStat if SS then SS ElseStat end ElseStat else SS elsif SS then SS ElseStat Example if ch gt a and ch lt z then ReadIdent elsif ch gt 0 and ch lt 9 then ReadNumber elsif ch or ch then ReadString else ReadSpecial end WebL A Programming Language for the Web 35 The Language Core While Statement While statements specify the re
171. tring s WebL A Programming Language for the Web 41 The Language Core TABLE 13 Core Built in Functions Function Exec cmd string int Exit errorcode int DeleteField o object fld nil First 1 list any GCO nil Native classname string fun Print x y Z nil PrintLn x y z nil ReadLn string Rest 1 list list Retry x any Select 1 list from int to int list Select s string from int to int string Description Executes a shell command and returns the exit code returned by the command The command string may contain references to variables in lexical scope by writing var or var The value of these refer enced variables are expanded before the command is executed Terminates the program with an error code Removes the field fld from the object o Returns the first element in a list Explicitly invokes the Java garbage collector Loads a WebL function imple mented in Java Prints arguments to standard output Prints arguments to standard output followed by end of line Reads a line from standard input throws away the end of line charac ter Returns a list of all list elements except the first element Executes expression x and returns its value In case x throws an exception x is re executed as many times as needed until it is successful Extracts a sublist of starting at ele ment number from and ending at
172. ts a web crawler that prints the URL and title of all pages visited The crawl is restricted to pages on the pa dec com domain that have a URL that ends in a htm or html The queue is initially seeded with two pages from where the crawl will progress in a breadth first fashion Note that the program finally goes to sleep while the crawl is in progress import Str WebCrawler var MyCrawler Clone WebCrawler Crawler En Visit meth s page var title Text Elem page title 0 This page has no title PrintLn page URL title title end ShouldVisit meth s url Str StartsWith url http www w pal dec com and Str_EndsWith url html end a ie MyCrawler Start 2 Only two threads are used MyCrawler Enqueue http www src pa dec com MyCrawler Enqueue http www wrl pa dec com WebL A Programming Language for the Web 141 Modules Stall 142 WebL A Programming Language for the Web Module WebServer Module WebServer The WebServer module exports an interface to a simple multi threaded web server The Start function allows the programmer to start the web server on a specific port on the host machine where WebL is running After starting the web server HTML and other files will be served from the fileroot directory indicated when the server was started The programmer may publish WebL functions to be executed when specific URL p
173. type char or character Char acters are enclosed with single quotes The internal coding of characters is Unicode Character expressions may contain escape sequences that denote special characters Table 6 lists the escape sequences used in WebL TABLE 5 Character Expressions Expression Value gt i r a a n An a b ab Value Type char char string TABLE 6 Escape Sequences Escape b t n Z Xxx UXXXX Description Backspace Horizontal tab Newline Form feed Carriage return Double quote Single quote Backslash Character of octal value XXX Character of hexadecimal value Xxxx 24 WebL A Programming Language for the Web Dynamic Types Type String A lexical string constant enclosed in double quotes evaluates to a value of type string A string consists of a sequence of characters The number of characters in a string is called its size The empty string denoted by contains no characters and has a size of zero There is no limit to the string size Strings may be wrap across several lines in WebL programs Strings may also contain the escape sequences defined in Table 6 Note that escape sequences are not expanded in strings that are written with the back quote character TABLE 7 String Expressions Expression Value Value Type abc abc string ab nc ab nc string abc d abcd string Size abc 3 int Type Int A lexical integer constant
174. uage for the Web Piece Set Operators and Functions TABLE 20 Piece and Piece Set Operators Function directlybefore p piece q piece pieceset directlybefore p pieceset q piece pieceset directlybefore p piece q pieceset pieceset directlybefore p pieceset q pieceset pleceset overlap p piece q piece pieceset overlap p pieceset q piece pieceset overlap p piece q pieceset pieceset overlap p pieceset p pieceset pieceset overlap p piece q piece pieceset overlap p pieceset q piece pieceset overlap p piece q pieceset pieceset overlap p pieceset q pieceset pieceset without p piece q piece pieceset without p pieceset q piece pieceset without p piece q pieceset pieceset without p pieceset q pieceset pieceset intersect p piece q piece pieceset intersect p pieceset q piece pieceset intersect p piece q pieceset pieceset intersect q pieceset p pieceset pieceset Description All the elements of p that are not directly before any element of q All the elements of p that overlap any element in g All the elements of p that do not overlap any element in q All the elements of p where over lap with any element of g has been removed All the elements of p that overlap an element in q each of them repeatedly intersected with all overlapping elements in q WebL A Programming Language for the Web 103 The Markup
175. uitively this retrieves the innermost element of all nested elements Given the page defined previously we can calculate The lists that contain the first subsection i e elements on lines 1 11 5 8 Elem X ul contain Pat X First Subsection The list that directly contains the first subsection i e element in lines 5 8 Elem X ul directlycontain Pat X First Subsection Innermost list that containsthe first subsection i e element in lines 5 8 var x Elem X ul contain Pat X First Subsection x contain x 94 WebL A Programming Language for the Web Piece Set Operators and Functions IV Regional Operators The regional operators construct new pieces to identify parts of a page Many other operators return pieces that existed only before the operator was applied P without Q The without operator returns the pieces of P where parts of Q that overlap with a piece in P are cut away This might involve creating several new pieces from a piece of P and inserting new unnamed tags as necessary Figure 6 gives an example where the word WebL is removed from a sentence Note how unnamed tags are inserted to the left and right of piece A Examples Cut up the second table into its constituent lines Elem X table 1 without Pat X n Remove all the bold text from the first paragraph Elem X p 0 without
176. ument repeatedly invoking the selection function to determine if that element should be included in the result piece set The selection function must have a single formal argument and must return a boolean value that indicates whether its argument should be included in the result piece set or not You are free to specify any selection criteria as long is the result of the function is of type boolean 78 WebL A Programming Language for the Web Searching Functions TABLE 17 Piece Set Searching Functions Function Elem p page pieceset Elem p page name string pleceset Elem q piece pieceset Elem q piece name string pleceset Para p page paraspec string pleceset Para p piece paraspec string pleceset Pat p page regexp string pieceset Pat q piece regexp string pleceset PCData p page pieceset PCData p piece pieceset Seq p page pattern string pleceset Seq p piece pattern string pleceset Description Returns all the elements in a page Returns all the elements in page p with a specific name Returns all the elements that are contained nested in piece q Returns all the elements with a spe cific name contained in piece q Extracts the paragraphs in p accord ing to the paragraph terminator spec ification paraspec Extracts the paragraphs in p accord ing to the paragraph terminator spec ification paraspec Returns all the occurrences of a
177. ument Type Definition 54 double 129 DTD 54 Dynamic Data Exchange 116 E EBNF 162 Elem 70 79 Elements 56 Empty elements 56 Searching for 70 Encode 115 139 EndsWith 136 EndTag 82 Equality 31 EqualsIgnoreCase 136 Error 41 200 WebL A Programming Language for the Web ErrorLn 41 Escape Sequences 165 Eval 41 122 Exceptions Trap function 44 Try statement 36 Exclusion 88 Exec 42 Exists 121 Exit 42 ExpandCharEntities 82 expandentities 65 Expressions 17 F Farm 119 field definition 32 Fields 29 Files 121 First 42 Flatten 99 104 float 129 Floating point 26 Fun 27 Functions 173 Built ins 40 Funp 41 G Garbage collection 42 GC 42 Get 128 GetCurrentPage 116 GetURL 47 59 61 64 122 Overrides 63 Glue 139 GlueQuery 138 139 GotoURL 116 H HeadURL 64 Highlight proxy 156 HTML 54 Forms 52 Handling of badly formatted HTML 58 Parsing of 54 HTTP 52 Cookies 54 GET Request 52 WebL A Programming Language for the Web 201 Headers 52 53 61 MIME types 53 63 Parameters 52 53 60 POST Request 52 Request 52 Response 52 Set cookie header 117 Status 52 I Idle 120 Import 46 Indexing 34 IndexOf 136 InsertAfter 108 111 112 InsertBefore 108 112 inside 86 93 101 Int 25 int 129 intersect 97 103 Intersection 88 Intp 41 IsDir 123 IsFile 123 ISO 8859 1 162 J J array 127 Java 124 java lang String 129 Job queues 119 J objects 124 K Kincaid r
178. ve only a sin gle argument to specify Two special operators are not contained in the operator table since they have spe cial constraints on when they can be used and hence cannot be specified in the syn tax just introduced The two operators are assignment and field definition The left hand side of an assignment must denote a variable or an object and field name combination The left hand side of a field definition must denote an object and field name combination eg obj field or obj field The value of an assignment or field definition is always the right hand side of the operator These two operators also differ in another way from the remainder of the operators in that they have side effects namely the setting of the value of a variable or field of an object to the right hand side of the operator 32 WebL A Programming Language for the Web Operators TABLE 12 WebL Core Operators Operator x int y int int x int y real real x real y int real x real y real real x char y string string x char y char string x string y string string x string y char string x set y set set x list y list list x int y int int x int y real real x real y int real x real y real real x int int x real real x set y set set x int y int real x int y real real x real y int real
179. vigate to this url ShowPage s string nil Displays the markup contained in s in a web browser uses the default locale for externalizing the string ShowPage s string charset string As above but ensuring that the nil string is externalized in the indicated character set Values for charset might be iso 8859 1 Unicode UTF8 etc 116 WebL A Programming Language for the Web Module Cookies Module Cookies Module Cookies allows the programmer to perform some basic operations on the HTTP cookie database The cookie database contain client side state that web serv ers have requested WebL to store for them By default the cookie database starts out empty with each WebL run and fills up as cookies are set At the end of the run the cookie database is discarded This is the default WebL behavior and no pro grammer action is required The contents of the cookie database can be overriden by specifying a non nil Cookie header field as part of the GetURL and PostURL functions Furthermore the Save and Load functions of the Cookie module can be used to save the database to a file and later load it again These functions are required if the cookie database is to transcend a single WebL session The external file format of the cookie database is a line per cookie where each cookie is stored in the same format as received in the Set cookie HTTP header More details about the HTTP Set cookie header can be
180. w copying piece B with an overlapping piece C after the begin tag of A results in the copies B and C in the page Note that the source page on the right top corner of the figure remains unchanged In case a piece set instead of a piece is passed to these two functions each of the elements of the argument will be copied in sequence to the destination insertion point Note that when the piece set contains nested elements the nested elements will be inserted twice or more times possibly one after another in the destination page Example Insert an image at the beginning of each hl tag var p NewPiece lt img scr a gif gt text html every x in Elem P h1 do InsertAfter BeginTag x p end 108 WebL A Programming Language for the Web Page Modification FIGURE 11 Copying Pieces during Inserts Eps oe ce es InsertAfter BeginTag A B Deleting Pieces The Delete function deletes a piece from a page In case the function is passed a piece set argument each of the elements of the argument piece is deleted One of the problems we face with deletion is that some program variables might still refer to pieces that were previously deleted Accordingly accessing these deleted pieces through these variables might cause some problems To simplify the problem we define the following sematics for deletion of a piece q e All the text segments contained in piece q are physically
181. x real y real real x I x M x M x I x div x int y int int set y set set int y int int int y real real real y int real real y real real mod x int y int int C x C x C x C x C x C x int y int bool int y real bool real y int bool real y real bool string y string bool char y char bool Description Numeric addition x y String and character concatenation Set union List concatenation Numeric substraction Numeric negation Set exclusion Numeric multiplication Set intersection Numeric division Whole division x mod y Numerical comparison where C is one of lt lt gt or gt Lexical comparison where C is one of lt lt gt or gt WebL A Programming Language for the Web 33 The Language Core TABLE 12 WebL Core Operators Operator x y bool x y bool or x bool y bool bool and x bool y bool bool I x bool bool x object y any x list i int any x object i any x string i int char member x s set bool member x list bool member x o object bool Description Value equality test See Value Equality on page 31 Value in equality test See Value Equality on page 31 Logical operators Short circuit evaluation Logical negation Object field access List object and s

Download Pdf Manuals

image

Related Search

Related Contents

User Manual  Epson PowerLite Pro Cinema 9700UB  Kidde i9030 User's Manual  Descargar manual en formato PDF  S-max sel la  Read Manual - Statistical Solutions  GSE-623 Amusement game (Fun World)  MPS-100 módulo de batería electrónica manual de instrucciones  

Copyright © All rights reserved.
Failed to retrieve file