Home
        DENODO ITPILOT 4.0 GENERATION ENVIRONMENT MANUAL
         Contents
1.                       deaesty ete 126    xe denodo techn ot ips    FIGURES    Figure 1   Figure 2   Figure 3   Figure 4   Figure 5   Figure 6   Figure 7   Figure 8   Figure 9   Figure 10  Figure 11  Figure 12  Figure 13  Figure 14  Figure 15  Figure 16  Figure 17  Figure 18  Figure 19  Figure 20  Figure 21  Figure 22  Figure 23  Figure 24  Figure 25  Figure 26  Figure 27  Figure 28  Figure 29  Figure 30  Figure 31  Figure 32  Figure 33  Figure 34  Figure 35  Figure 36  Figure 37  Figure 38  Figure 39  Figure 40  Figure 41  Figure 42  Figure 43  Figure 44  Figure 45  Figure 46  Figure 47  Figure 48  Figure 49  Figure 50  Figure 51  Figure 52  Figure 53    ITPilot 4 0 Generation Environment Manual  Initial ITPilot Installation Screen    3  Specification Generation TOOL AL CAS eenei aa aaan 4  Set Ve NC ae MCN UO NO  NENNEN 5  Denodo WebMail home page EEE 7  FURS MS SSN O E T A AE E IN EI TA 8  Meus mE ME E eee aA E A NAE              8  FN i EE states E UU um 9  New project created EEE EEE 10  Creation of a new process          cccccccccsesesseseecssscscsssecesscessessscasstsavsssssasecsevasessesesseatisencassteatateeeatiteneaterenss 11  NEVNE 11  Work area for Process Generation    nnns 12  Selection of the Initialization GONDOFIBITE osos EE EN 13  ere Tue ROO mmm 14   Wizard  tab in the component configuration area with the Initialization register already created      14  FE  6004  o 0  RR NER ie  pisi NIETO COMPING EEE EEE 16  BO OU I NO A E                                17
2.                  sss nnne 45  ricus arcc pio EEE EE EEE ENE 46  Results returned by the wrapper Lan eee touto dep 47  Wrapper deployment in an ITPilot execution server                     sssseennnnns 48  Wrapper storage in a local file system    49  Use of the Next Interval Iterator component to browse more pages of results                            sss 51    I   i    me denodo technologies ITPilot 4 0 Generation Environment Manual    Figure 54  Figure 55  Figure 56  Figure 57  Figure 58  Figure 59  Figure 60  Figure 61  Figure 62  Figure 63  Figure 64  Figure 65  Figure 66  Figure 67  Figure 68  Figure 69  Figure 70  Figure 71  Figure 72  Figure 73  Figure 74  Figure 75  Figure 76  Figure 77  Figure 78  Figure 79  Figure 80  Figure 81  Figure 82  Figure 83  Figure 84  Figure 85  Figure 86  Figure 87  Figure 88  Figure 89  Figure 90  Figure 91  Figure 92  Figure 93  Figure 94  Figure 95  Figure 96  Figure 97  Figure 98  Figure 99  Figure 100  Figure 101  Figure 102  Figure 103  Figure 104  Figure 105  Figure 106  Figure 107  Figure 108  Figure 109    Muse n SST 6 NRI 52  Assigning examples in the new structure of the Extractor component                        ssssssssssess 54  Tab for Assigning Tag Attribute Values                   ssssssssssssssseeeeennnnnnnnnnnes 95  Use of Record Sequence component                   sss mener nn 56  sisse s eguena GN MEME              57  Record Sequence component Command Editor    58  Advanced Tab for Back Sequence definition    59  Co
3.               1 1   wy Begin Next Int             ARS   C MailMainPagelterator                  r x      UD EndMailPagelterator                                                                       Nest Interval 1   Custom        Inputs Wizard   Details    HH E  Input records  Data Export Tool   Input page   aH  Tools at E                  Current process  WEBMAIL from project Default Project i 10 44 21 PM    Figure 57 Use of Record Sequence component    Using the Wizard tab  configure the access sequence to details pages by means of the record sequence editor  This  editor is divided into two tabs     Part Il 56      i    me denodo technologies ITPilot 4 0 Generation Environment Manual    1  Commands  This tab configures the command or commands required for browsing from the source page or  record to the required details page    2  Sequences  This tab is responsible for characteristic configuration tasks such as the back sequence or what  Is known as  global form management sequences  in ITPilot  which will be explained later on     In the area at the top of the window  the Commands tab displays the DEXTL specification of a record obtained from  the main page using the data provided by the Extractor specification  to do so  the Record Sequence must be directly  or indirectly connected to the Iterator providing each of the records for that extractor   Although you should read   DEXTL  for a full understanding of this language  it is intuitive enough for the meaning of the fol
4.          TPilot Wrapper Generation Tool    File   Browser   View   Help   denodo ITPilot    Project Management  Projects  Default Project  MAILWRAPPERS    Process Builder  Data Export Tool    Tools      Current process  StandardTemplate from project Default Project         Figure 8 New project created    Once the project has been created  click on it to create a new process  This process will enable you to generate a    wrapper at the end of this example  Click on the icon to give the process a name  In this case  the name will be  WEBMAIL  Figure 9 shows the result in the browsing area     Part   10    qx denodo technologies ITPilot 4 0 Generation Environment Manual     e  TPilot Wrapper Generation Tool TER   File Browser View Help denodo ITPilot       Projects k  Default Project  MAILVWRAPPERS    Process List  WEBMAIL    Process Builder  Data Export Tool         Current process  StandardTemplate from project Default Project   6 50 38 PM    Figure 9 Creation of a new process    35 COMPONENTS IN ITPILOT    Once the project and the process have been created  you can start to develop them  To do so  click on the name of  the recently created process WEBMAIL to load it in the tool  After a short while  a dialog box will be displayed like  the one in Figure 10  indicating that the process has been successfully created  On accepting this dialog box the tool  displays the workspace  where you can start to assign components  Figure 11      Process Ok    Process loaded succestully   
5.     Figure 36 Generating a DEXTL Program    Part   31    I    me denodo technologies ITPilot 4 0 Generation Environment Manual    To check that the system properly recognizes all the DETAIL examples entered into the Result Example Definition tab     once the Generation button has been pressed  the Lo Te   button can be clicked  Figure 37 shows    the correct result of this test  It can be observer how the total number of obtained elements matches the number of  messages of the first page  and also  how there are no wrong elements  the window also shows the number of  recognized examples  matching the number of generated examples  The numbers between parenthesis point out  which of the generated examples have been found  in this case  the three of them  0  1 and 2      m Extractor  Extractor 1     Structure Examples Generation    Specification    Generation Results    SIZE  1    SUBJECT  Test Demos  Warehouse appliances  boom ar bust      Item WEBMAIL   MESSAGEDATE  02 01 2007    SEMDER  Henry Gondor   SIZE    SUBJECT  what I think about Business Intelligence     Item WEBMAIL  MESSAGEDATE  02 01 2007   SENDER  Jean Luc Picard   SIZE  1    SUBJECT  Denodo Pilot     otal number of generated item s 20  otal number of unmatched iftems 0   Number of recognized samples  3   0 1  2        Figure 37 Specification Execution test    If the retrieved results are not the desired ones  we have different options to evaluate   e  f fewer results than expected are obtained  new examples can b
6.    DEXTL Actions     SIZE    9    om WEBMAIL  SENDER    O sear Arias     SUBJECT   Denodo ItPilat    SIZE    2        Figure 55 Assigning examples in the new structure of the Extractor component    The DEXIL program is generated in the same way and is tested as in section 3 8     3 16 2 1 Assigning Tag Attribute Values    Until now  the specification generator tool has allowed us to extract data that could be directly obtained by viewing  the Web page in the browser  However  on some occasions  we may wish to extract values from HTML tag  attributes  For example  you may want to include the    href    tag value of a link in a simple field  remember that if the  value of this tag is a relative link  the corresponding level will have to store the base URL from which it sets out      In this case  it may be wise to save the URL accessing the message detail data  To do so  use the Tags tab in the  Extractor component wizard  see Figure 56      Part II 54    J i  J  denodo technologies ITPilot 4 0 Generation Environment Manual     e Extractor  Extractor 3     Select pattern WEBMAL 0 ll         SENDER    SUBJECT  Assign NONE  w        Assign       Figure 56 lab for Assigning Tag Attribute Values    At this stage  the values of the tag attributes required are assigned to simple fields for the extracted elements  Users  carry out the following steps     1  Select the pattern in which the tag attribute is to be found  DEXTL allows for different patterns to be used  within the same spe
7.    Project Management Process Flow  WEBMAIL                                                 Process Builder O 3k     m   p   x    Component O I1 eS Gre D  dli    C  4 60 56 RIE  668  Components      CG MailMainPagelterator        FS  16 WEBMAIL RS    b o g a PTS  ARS     Z n yO DetailPageExtr     o  ite a  o oO o    C MailDetailPagel   S   E 2 i  p  MailRecord       gt  Output 1    v   Custom  MSS   nou   ES If tai    BB    f     Input page ag       Data Export Tool          Tools      Current process  WEBMAIL from project Default Project   10 47 06 PM       Figure 64 Adding a data Iterator comino from the detail pages    Nothing more will be said about the Iterator component  as its configuration is the same as that indicated in section  3 12 1  In terms of the Record Constructor component  section 3 12 2 explains how to use it as output of an iterator  and with a single Extractor component as the basis for generating the output record  In this case  the Record  Constructor will be used to create an output record based on the data obtained from the main page and from the  details page of each message     Two input values are created in the  Inputs  tab of the Record Constructor component configuration area  the output  value of the first iterator  that returned each of the WEBMAIL type records from the information extractor of the page  of results  and the output value of the second iterator  to which it is directly connected  which returns each of the  DETAILSTRUCT type rec
8.   Allows different properties to be configured for the money type values     currencyDecimalPosition     Number of decimals acknowledged by the currency in the location  For  example  for the euro this value is 2     currencyDecimalSeparator     Character used as a decimal separator in the currency  For example   the decimal separator for the euro is the comma     currencyGroupSeparator     Group separator in the currency used for the location  For example  for  the euro the group separator is the full stop     currency     Name of the currency  Example  EURO  POUND  FRANC     moneyPattern     Specifies the currency format  In currency formats the comma is always used as a  separator for thousands and the full stop as a separator for decimal numbers  The character    2     represents the currency symbol and indicates in which place the character or characters that represent  it should be positioned  Example                     The patterns defined by the  java text DecimalFormat class in the API standard Java Developer Kit are used to  analyze the currencies  see Javadoc documentation  JAVADOC  for more information      e Configuration of time type data     timePattern     Unit of time in which the values of this type are expressed in this location  The  possible values are  SECOND  MINUTE  HOUR  DAY  WEEK  MONTH and YEAR     e Configuration of dates  Configuration of data type date     Part      datePattern     Indicates the format for dates  To specify the format for dates 
9.   FUNCTIONS FOR URL PROCESSING        rnnrnenornrnvnvrnenonnrnenvrnenennvnenernenennrsrnernenennrnrnesnenennrnnnesnenennsnnnenn 89  FUNCTIONS FOR PAGE HANDLING         rnvnornonornrnorvrnenornrnesvrnenennrnenernenennvnenesnenennvsrnesnenennsnnnesvenennsnnnenn 89  ADD RECORD TO LIS ut                                  M        90  Description DOON 90  OUE E EEE EEE EEE 90  Qutput VAlUGS oo    ec ccceccecsccesesseecscstsacessesecsesavessesassesavasssaseesavsssssnsasssvasaseesasstvatasinsatieatateneatiteneateneass 90  CONDITION C                                                        nenen 91  DES Rm EE ENE 91  MN          m 91  Bee JE EE 91  2 1110 EEE EEE A E EE 91  Using ae Fed MCN SONG EE 92  GREATE LIST c                                              95  Description NER MEME 95  To eT ers iz  1 E ERE 95  MN Me 95  DEF                                                                      96  HE EE E A E E EE EN 96  MON 96  MINN 96  Ve 96  EXECUTE JAVASGRIP T                     98  EE EE 98  JU 10188011 55 SACS EE EE 98  Output VANS Sach MMM                                                     98  sd a 99    H   ERR m                             99    xe denodo te    6 6 2  6 6 3  6 6 4  6 6 5  6 7  6 7 1  6 7 2  6 7 3  6 7 4  6 8  6 8 1  6 8 2  6 8 3  6 9  6 9 1  6 9 2  6 9 3  6 9 4  6 10  6 10 1  6 10 2  6 10 3  6 10 4  6 11  6 11 1  6 11 2  6 11 3  6 11 4  6 12  6 12 1  6 12 2  D 12 3  6 12 4  6 13  6 13 1  6 13 2  6 13 3  6 13 4  6 14  6 14 1  6 142  6 14 3  6 14 4  6 15  6 
10.   eren 71  Checking Navigation Sequences in Systems with Cookie Based Session Authentication and   META ETG 1 115 ERR EN EE EN 78  THE SELECT HAVFBUTTIN kue 18  THE TRANSPOSETABLE BUTTON       rnvnornrnonvrnonornrnenvrnenonnrnenvrnenennrnesernenennrsrnesnenennrsnnesnenennsnnnesvenennnnr 19  THE SELECTANCHOR BUTTON       rnvnornonornrnonvenennnnrnenvrnenennrnesvrnenennrnenernenennrsenesnenennrsnnesnenennsnnnesnenennnnr 19  CONFIGURING AND USING DOMAINS        rnosnrnvnvrnenonnrnenvrnenennvnenvrnenennrnrnennenennrsrnesnenennrnnnesnenennsnenenn 80  He D ON EEE EE 80  FN 80  PROPERTIES OF THE NAVIGATION BAR                               eere rennen nnne nnn nnn nnnnnnnn nnn 81  Generating Sequences Using an Authenticated Proxy                  sssssssssssssssseeene 81  Criteria for Selecting NSEQL CommandsS       roorvrrvrvrrvrverevorverervereverseresvrrevervevesvrseversrveserverersrvesersereversenenn 02  Choosing the Browse Sequence Type              83  SELECTION OF PDF AND HTML CONVERTERS         nrnrnonornonornrnonvrnenonnrnenvrnenennrnrnesnenennrnrnesnenennsnrnenn 84  ARITHMETIC FUNCTIONS                     4  enr rennen nennen nnn rnm nn nennen nnm unn nnns 86  TEXT PROCESSING FUNCTIONS invaderende o cR un 87  LIST HANDLING FUNCTIONS        rornonornrnorvrnenvrnrnesvrnenernrnesvrnenernrnenernenennrsenesnenennrsrnesnenennsnnnesnenennsnnnenn 88  DATE PROCESSING FUNCTIONS        nonnrnorvonenornrnorvrnenonnrnenvrnenesnrnenernenennrsenesnenennrsrnesnenennsnnnesnenennsnnnenn 88
11.   iuit                                                     s 17  Home Page         ccsesceccscescssescseesesesacsassesessevassesecsasassesecsassesassassssassavassasacsasassesaesassesassacaseesaesasiecassacateesaeeaes 18  Manis EEE NE EEE 19  Selection of the Search Domain    tnmen nnnc 19  Search Domain Data toolbar                    sess entente 20  Drag amp Drop operation on the Main Page    20  Sequence editor with loaded sequence assisia saai Raai Ei aiaa 21  FP NAN aa 22  Using at ee OAO 114 010 EE EE 23  PUT page of the Extractor component sss macetenscei rutrum toten rtu ett beer natn dede 24  sere reco aes cu MUS MR 24  FT 26  GUS OT  EEE EE EEE 2   UG CUS ALS SES LE SEE SE 2   Result Examples lab    nn nnns 28  Assigning a e ECE TM STE I EE 29  Posen Various EX ONE S  EEE E r EE F ee 29  Patern Generation VON RETE DR 30  Generating a DEX dert RERO 31  SPECICATION EXECUTION EE SN 32  serere es TO RR Generation TAD RERO 33  Use OF ihe kerator COMMON CIE EEE 34  Use of the Record Constructor COMPONENT              ccccccesccsesescsseseseseeeescseesescencessteasesencassteaseteteasiteneateenss 35  BE OG  I CO EEE ENDEN 36  New record field editor    nennen nnne 37  Creation of a derived attribute from the GETDAY function                     sssssssssssse 30  Final result of the Output record    30  Use ot the Output OOM GM EE 39  Complete process of the first part of the example                  sss 40  Joronen dal Aann EEE EE 41  JavaScript code of the generated wrapper
12.   once the system has accessed the next pages of results     is shown in section 3 13     Part II 37    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    3 16 ACCESS TO DETAILS PAGES    3 16 1 Introduction    Until now  we have developed a specification that allows the list of messages that appear on a page of a Web e mail  application to be obtained in a structured manner  However  we already know that on this page only a sub group of  data for each message appears  Elements such as the message body  the absolute date  carbon copied senders or  attached data appear in the detail page of each of the messages     In this section we commence the modifications that need to be made to the example already used in order to make  all these data available     The structure now required is the following       MESSAGEDATE  date the e mail was received      SENDER  who sent the e mail    SUBJECT  message title      SIZE  size of the message      MESSAGE  content of the e mail      s can be seen  the elements SENDER  SUBJEGT and SIZE are maintained  However  new fields are added such as  MESSAGE  that is obtained from the detail pages   Likewise  the MESSAGEDATE field is maintained  but it will also  be obtained from the detail page     We must  therefore  modify the process in order to add those components that allow the browsing to each one of the  detail pages  besides  we will have to modify some of the already existing ones     3 16 2 Field Modification in t
13.  2 and is divided into three main areas   1  Menu  This saves wrappers  launches browsers to help generate wrappers and manages the display of  different task bars     2  Browsing Area  This is where the projects created are displayed along with the wrappers for each project   It also displays the list of components that can be used  including those provided by default by the tool and  those created by the user  see section 3 20 for further information   This area also allows for wrappers to  be exported to the ITPilot run server via the  Data Export Tool    tab  The  Tools  tab provides advanced  wrapper generation functions  as will be explained in section 3 17    3  Workspace  This is the main area of the tool and is where the wrapper is graphically created by using   configuring and relating the graphic components as a whole       ITPilot Administration tool    ell    File Browser View Help AE   denodo platform ITPilot  Project Management    Flow Builder    Data Export Tool       Figure 2 Specification Generation tool  Areas    Installation and Configuration 4    me denodo technologies ITPilot 4 0 Generation Environment Manual    The browsing sequence generation tool can be seen in Figure 3  The function of each of the buttons and options will  be explained in detail in section 4  although the following areas can primarily be found     Configuration  management and configuration of some of the ITPilot sequence generation commands     Sequence Generation  These buttons all
14.  44    n denodo technologies ITPilot 4 0 Generation Environment Manual    regenerated  However  bear in mind that the changes made to the Javascript code will have no effect on the  component flow  If you decide to regenerate the JavaScript associated to the flow  any changes made to the code  will be lost  Read  JSDENODO  for further information on the code generated by ITPilot     it   TPilot Administration tool    JavaScript editor    Current JavaScript                Edit       function getlnitd I  var Start 1  new INIT   Start 1 setText PASSWORD   OBLIGATORY    Start 1 setText LOGIN   OBLIGATORY    return Start  1       Regenerate            function getMetadatad I  var structure MAILMESSAGEOUT   new RECORD STRUCTURE  MAILMESSAGEOUT     Structure MAILMESSAGEOUT setlnt   SIZE     Structure MAILMESSAGEOUT setText SUBJECT    Structure MAILMESSAGEOUT setDate  MESSAGEDATE    Structure MAILMESSAGEOUT setTexti SENDER     return structure MAILMESSAGEOUT              function maing      var MAILPARAMS   null    var Start 1 2 null    var structure MAILPARAMS   null    Start 1   new INITO    Start 1 onError RUNTIME ERROR  ON ERROR RAISE     Start 1 onError INVALID QUERY ERROR  ON ERROR RAISE    Start 1 setText  PASSWORD   OBLIGATORY     Start 1 setText LOGIN   OBLIGATORY     MAILPARAMS     1 exech    v                            Ok   Close          Figure 48 JavaScript code of the generated wrapper    3 14 2 Wrapper Execution    With the wrapper generated  it can now be tested  To 
15.  Builder  Data Export Tool    VOL Generator    Server deploy          Wrapper name WEBMAIL  CO Maintenance enabled    O Create base relation                   Server URI Mlocalhost 9999 admin  User admin  Password eeeee                     Ok    Cancel       Tools         Current process  JDBCDemo from project Default Project   7 26 11 PM    Figure 51 Wrapper deployment in an I TPilot execution server    Now enter the server access data and its URL    localhost 9999 itpilot by default   login and password  The server  data may correspond to that of a Denodo Virtual DataPort server  VDP   so that the wrapper can be used as another  source in any data integration process  To do so  click on the  Create Base Relation  option and complete the field  called  Base View Name  with the name of the base view that will now reference the recently created wrapper in  Virtual DataPort  For further information  consult the Denodo Virtual DataPort documentation  VDP   Besides  ITPilot  allows to configure whether the user wants it to be maintained or not  Click on OK and  on the premise that the  execution server is started  the wrapper will be deployed  For further information on the ITPilot execution server  read   USE      3 14 3 2 VOL generation for subsequent loading    From the main window of the ITPilot wrapper generation environment  click on    Data Export Tool    in the browsing  area  This opens two more elements in this same area     VOL Generator    and    Server Deploy     Cli
16.  E 4 01 31 2007   John Smith Abstract 2 KE     5 01 31 2007 Marty Wick ly Wrapper Maintenance 2 KB   a B 01 31 2007 Jean Luc Picard Client Side Deep Web Data Extraction 2 KB  E T 01 31 2007 Jean Luc Picard Semantic web   News and Events 1 KB      B8 01 31 2007     John Smith Pattern Ambiguity 2 KB  E q 01 31 2007     Marty McFly DEXTL Actions 1 KB      1     153T  2007 Jean Luc Picard Global Technology Watch 2 KB  E 11 01 31 2007 John Smith Introduction ta NSEGL 2 KB     12 1 31 2007     Marty McFly  Ma Subject  2 KB   ul 13 02 01 2007 Doyle Lonegan What is VEL 2 KB     14 02 01 2007 Doyle Lonegan Base Relations in DataPort 1 KB     15 02 01 2007 Henry Gondorf VDP General Architecture 15 KB   id 16 02 01 2007   Henry Gondorf Denodo ItPilat 18 KB  E 17 02 01 2007 Doyle Lonegan Test Demos  PeopleSoft apps certified for Oracle Middleware 1 KB   i5  18 02 01 2007 Otis B  Driftwaad Test Demos  Warehouse appliances  boom ar bust  1 KB     19 O2 0172007 Henry Gondorf What I think about Business Intelligence 1 KB  E 20     02 01 2007 Jean Luc Picard Denodo ItPilat 1 KB  Delete   Forward   View Messages   Select   ja  jm    Figure 54 Webmail result page    Now  we can go back to the Next Results Sequence editor and load the new sequence by pressing the    Load from file    button     The rest of the editor has the following configuration capacities          Sequence lype  As with the Sequence component  this can determine the type of access to be made   whether via a browser  an http 
17.  Figure 10 Success Load Process Dialog       Part   11        b        n denodo technologies ITPilot 4 0 Generation Environment Manual            TPilot Wrapper Generation Tool        File   Browser   View   Help   denodo ITPilot  Project Management Process Flow  WEBMAIL    Process Builder   J e        9          EL  set         EN   adult A Lo             Components       Generic    le           fm onus  v  re   basis    Custom f amp           Data Export Tool          Tools         Current process  WEBMAIL from project Default Project   5 08 46 PM    Figure 11 Work area for Process Generation    In the browsing area you can see how the  Process Builder       Components  section has been automatically opened  with a list of general components that are distributed by ITPilot by default and an initially empty  Custom  list  where  the user created components will be listed     The more common general components will be explained in this manual  There is also a reference guide in section 6  of this manual  The reason why custom components are recommended and a description on how they are created are  explained in section 3 20     The workspace is divided into three parts  as indicated below    1  Components section  The general components are also graphically displayed at the top of the workspace  In  both cases  as indicated below  these components are used in the workspace by drag amp drop    2  Process generation section  This is the workspace itself  where users can drag co
18.  Figure 16  click with the left hand button of the mouse on the initialization component connector and  without  releasing the mouse button  drag it to the sequence component input port  round in shape   Release the mouse to see  the connection between both components     C5 0   OIC HSEGiecGi  da I       net Fa nitialSeg      h    h       Figure 16 Relating components    3 7 2 Component Configuration    The sequence component can now be configured  To do so  once again select the component so that its configuration  area is loaded  In this case  the    Inputs    tab is enabled  as it can receive inputs from other parameters  more  specifically      Input Values  input values to be used in Web browsing  In this case  it is necessary to select the    register created in section 3 6  To do so  click on the LJ icon of the  Input Values  element and select  the register  MAILPARAMS       2 If you cannot find it  check that it is visible  To do so  go to the View    Toolbars menu and check that the   Components  check box is ticked  You can also press the combination of keys Ctrl Shift 1     Part   16        T        me denodo technologies ITPilot 4 0 Generation Environment Manual      Input Page The Sequence component also allows for an input page from where browsing is to be done  to be indicated  This page must come from another component with a Web page as its output value   such as another Sequence component or a Next Interval Iterator component  This is not necessary in  this
19.  Flow  WEBMAIL  Process Builder   m 3   3 a 3        erem S euS D DEl eC OOt wel  LOUMmpPoONeEeNIS 2     Generic a i        Q key Begin Next Int      1 1     o  ab        C MailMainPagelterator    AR                               dr  go  CHI  o im                WEBMAIL  RS   pr  o g o  o E DetalPagebt  o  oO o o  ke EndMailPagelterator    Custom I      inputs    Wizard    Betis   EH     pm  Data Export Tool D   op  Tools                 Current process  WEBMAIL from project Default Project   10 45 36 PM    Figure 63 Use of the Extractor component to obtain information of the detail pages    In our example  the Extractor component responsible for extracting information from the detail page will contain at  least one element   MESSAGE   that may contain     lt BR gt   type HTML tags as well as links  Therefore  the   STANDARD  tag set of the StandardHTMLexer scanner  used by default in the Extractor  is of no use  as it would  find patterns within the message  so we would not be able to extract the complete message into a single attribute   Therefore  the detail level must have a different tag set  Section 3 17 shows how the graphic tool can generate a  new tag set associated with a specific scanner     Read  DEXTL  for a better understanding of the scanner and tag set  concepts     3 16 7 Generating the Access Specification to the Details Page    Once the new scanner and the tag set have been generated and the new structure is established  go to the Examples   PT   gt   m   t
20.  View     Toolbars     SequenceGenerator   2 1 4 3 Checking that the Specification Generation tool has been installed correctly    From the  bin  directory in the path where the tool has been installed  please execute the    startlTPAdminTool    file  or  optionally  double click on the icon that you will find on your desktop  A graphical tool such as the one shown in  Figure 2     Installation and Configuration 3    me denodo technologies ITPilot 4 0 Generation Environment Manual    2 1 4 4 De installing the software    First of all  close all the MSIE instances  Otherwise  the de installation process will not be able to delete the folder in  which the software was installed and it will display a message indicating same     In the  lt INSTALLATION_PATH gt  Uninstaller folder  you will find ITPilot de installation program  Another options are   using the  Uninstall ITPilot  icon which will have been created on the desktop during installation  and using the tag  added to the specific ITPilot folder in the    Start    menu     2 1 5 Introduction to the tools    If you have carried out the verifications described in section 2 1 4 2  you will already have seen the graphic  appearance of the tools comprising the ITPilot generation environment  This section describes the basic  characteristics of both  although they will be explained in detail in their respective sections  sections 3 and 4 of this  manual      The main screen of the specifications generation tool is shown in Figure
21.  a browsing sequence in NSEQL language  see  NSEQL       6 20 2 Input Parameters    Sequence accepts the following as input arguments   e Zero or more records   hese elements are used to assign variables to the browsing sequence   e Optionally  a page from which browsing is made     6 20 3 Output Values    This returns an element that represents the results page of browsing     6 20 4 Details of the component    See section 3 7 for a more in depth explanation of the component     Appendix B  Catalog of Components 125    L  i  n denodo technologies ITPilot 4 0 Generation Environment Manual    621  STOREFILE    6 21 1 Description    This component stores the contents entered as the input parameter in a file     6 21 2 Input Parameters    otore File accepts the following as input arguments   e Value  string  or binary type  to be stored   e  otring type value with the name of the file  where the contents are to be stored     6 21 3 Output Values    None     6 21 4 Example    Following the example given in this guide  the group of results is to be stored in a text file  To do so  the Store File  component is used  Figure 113 shows the basic structure of the steps to take for the process  After the Extractor  component has obtained the list of results from a specific page  it iterates on each one  During each iteration  a  Record Constructor component constructs the results to be sent asynchronously as the result of running the wrapper  program  An expression is then created that 
22.  allows for a disc file to be  loaded using the LOCAL path in the wizard s  Connection Type      e If an input page is also used  the URL value is used by the component as a resource to locally obtain this  page  e g  URL could have the value  image jpg  assigned  and  therefore  it would try to access the  image jpg resource on the input page     e If only one value is assigned for the  Input Page  field  the Fetch component will obtain the contents of the  resource to which this element points     6 8 3 Output Values    otring  or binary type value     Appendix B  Catalog of Components 103     O  denodo technologies ITPilot 4 0 Generation Environment Manual  I    6 9 FILTER    6 9 1 Description    This component carries out a filtering operation on a list of records  returning those meeting a given condition     6 9 2 Input Parameters    The component expects a list of records as input and  optionally  one or more records and one or more values  the  records and the values can be used to build the filter condition      6 9 3 Output Values    The Filter component returns the filtered list of records  empty list  if there are none      6 9 4 Example    Figure 98 shows part of an ITPilot process that filters the results obtained by an Extractor component before iterating  on them         Extractor 1    JL Filter 1    C Iteratar 1    gE Record Constr       gt   Ouput 1    A  G End lterator 1    Figure 98 Use of the Filter component    The Extractor component has extracted the s
23.  be added  although in this  case it is not necessary  as the link simply has to be followed  In other occasions  it may be necessary to carry out an  additional action  e g  selecting a check box before following the link      Where the ANCHOR is selected at the bottom and the t button clicked  a new window will be opened  as  indicated in Figure 59  Here it is possible to modify the NSEQL program generated by ITPilot by default in the event of    Part Il 57      TOU or i        me denodo technologies ITPilot 4 0 Generation Environment Manual    alternative behavior being required  To do so  please read  NSEQL   This will not be necessary in this example  as the  Web application will access the details page by merely clicking on this link  It is also possible to configure the  number of retries that this sequence can run in the case of access error on this page     1    Command editor     Automatic configuration  Generated command  ClickOnAnchorByHrefi URLANCHOREG  0 true 0  WaitPagest         Manual configuration          Ok    Cancel       Figure 59 Record Sequence component Command Editor    The Sequences tab of the record sequence editor allows for advanced configurations on the browsing sequence  defined in the previous tab      n    Part       Sequence type  As explained in section 3 7 2  different access protocols to the HTML resources to be  browsed can be defined  It is important to note that the access types for one sequence or another or in the  use of a Record Se
24.  be seen in Figure 84  the type of  transforming required     whether to Word or PDF   must be selected before clicking this button     Navigational Sequence Specification Manual 19      i  TRU  me denodo technologies ITPilot 4 0 Generation Environment Manual       a dends   QQ Open e Rec transpose Table Mjpomain   amp   Enabled Popups  E  Select Anchor   x  X Convert Word  Convert Pdf    Figure 84 Selection of the transformation type in the Select Anchor command    47 CONFIGURING AND USING DOMAINS    Sometimes it is advisable to parameterize the NSEQL navigation sequences according to the values received when  the ITPilot user applications are executed  For example  if a sequence is being constructed for a wrapper generated  using ITPilot  the sequence can include variables that tell the system how the sequence parameters relate to the  attributes received as input in the wrapper queries  see section 3 6      To handle these situations visually the Navigation Sequences Generator incorporates the Domain concept  In this  context  a domain is a list of parameters grouped logically  together with a list of examples for said parameters  The  following sections deal with the definition of domains and the use of same within the generator     47 1 Creating Domains    Normally domains are created directly using the Denodo ITPilot generation environment  see section 3 7      If using the Navigation Sequences Generator without the rest of the ITPilot generation environment  then the  
25.  bedrooms                        Test   Ok   Cancel            Figure 108 Configuration tab for the Form Iterator component    3  This completes the component  It can be independently tested using the    Test    button  for which a browser  must be set to the form page  and using the debugging editor  as explained in section 3 14 2      Appendix B  Catalog of Components 112    I i i i      me denodo technologies ITPilot 4 0 Generation Environment Manual    6 11  ITERATOR    6 11 1 Description    This component iterates on a list of records  one by one    6 11 2 Input Parameters   The component waits for the list of records on which to iterate as input   6 11 3     Output Values    For each iteration  the component returns the corresponding record from the input list  The order is that in which the  data is entered in the list     6 11 4 Details of the component    See section 3 12 for a more in depth explanation of the component     Appendix B  Catalog of Components 113    I   i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    6 12 JDBGEXTRAGTOR    6 12 1 Description    This component sends a query to any source available through the JDBC protocol  returning a record list which  contains the retrieved results     6 12 2 Input Parameters    The JDBCExtractor component accepts cero or more records  cero or more values as input arguments  These  elements are used to assign variables to the component configuration parameters     6 12 3 Output Values    A rec
26.  converted into a date type for comparison  To do so  use the TODATE function  by dragging amp dropping it from the functions list   Functions  area to the left of the editor   The  TODATE function  as explained in section 5  Appendix A   The first determines the date format   and the second is the character string representing the specific date  In this case  the date format  is MM dd yyyy  two characters for the month  one slash  two characters for the day  one slash   and four characters for the year   and  therefore  a string type constant must be created and  assigned the value  MM dd yyyy   Then  another string type constant is created to which the  comparison date is assigned  02 01 2007   Figure 99 shows the status of the process to date     T Filter 1 condition editor    HE Condition Editor Operators    Constants Condition operators    string CM hid lin     string 204200 7  Ed    Simple conditions          mM    Conditions Logic operators    rv    Result condition      Ok   Cancel         Figure 99 Creation of string type constants    c  Now the functions that will turn the string type values into dates must be created  Therefore   drag amp drop the TODATE function to the left hand panel of the    Values    area so that it can then be  assigned the constant  MM dd yyyy  and  02 01 2007  as parameters  in this order  These  actions create the right hand operand of the filter condition  See Figure 100     Appendix B  Catalog of Components 105    x denodo technologies I
27.  denodo technologies ITPilot 4 0 Generation Environment Manual       l  r List of items WEBMAIL  Ga      SENDER   John Smith     i SUBJECT    Data integration approach      4 MESSAGEDATE    01 31 2007    R4 SIZE    1              Figure 33 Assigning a Value to an Element    It is important to take into account that it is not possible to assign the text of any browser selection  since this is  determined by the tagset chosen in the first tab  see section 3 4      Values can also be removed with the  Unset Value  option on the same contextual menu  Furthermore  entire  examples can be removed with the  Delete  option on the contextual menu of the root element     Just as new examples can be added  occurrences from the hierarchical levels can also be added by placing the cursor  on the node which represents the level and selecting    Add Item    in its contextual menu     New examples are always added in the same way  but it is generally advisable that these examples be taken from  different queries to the Web source  e g  in electronic bookshops search by different subjects  and always taking care  that the specification generated  as will be seen later  correctly extracts all results out of each one of them  if this  does not occur  the system will require new examples which represent those query results which cannot extract  properly  giving more examples of elements that it can obtain is not of much help      In this example  and after providing the first email as example  p
28.  example     The Wizard tab enables you to access the Sequence Editor by clicking on the  Open Sequence Editor  button  This is  where the expected browsing sequence must be loaded  A browsing sequence is a set of NSEQL commands  see   NSEQL  for further information  that describes different events on a Web browser  ITPilot 4 0 uses Microsoft Internet  Explorer 6 x in the generation environment   The browsing sequence can be created in two different ways     Using the browsing sequence generation tool included in ITPilot  This tool  described in detail in  section 4 of this manual   integrated as a task bar in the Internet Explorer browser  allows for the  NSEQL program to be saved through user operations in the browser  This is the recommended way  as  it is more effective and fast       Entering the sequence by hand in NSEQL  NSEQL is a relatively simple language to use and advanced  users may prefer this option  There are also Web pages that require advanced commands that are not  provided graphically by the sequence generation tool     In this example  the sequence generation tool is used  To do so  open an Internet Explorer browser  either directly via  the Browser  gt New Browser option in the main menu or by pressing the combination of keys Ctrl B   An Internet  Explorer browser window will appear with the browsing sequence saving bar  see Figure 17   If the bar does not  appear  check that the menu option  View     Tool Bars     Sequence Generator  is marked  Although
29.  http  www  denada  com     platform     klade with IzPack  http   vanun izfarge com         Figure 1 Initial ITPilot Installation Screen    Having passed this screen and accepted the license  the group of modules from the Denodo Platform is selected and  the components from within each module that are to be installed at that time  More specifically  the following  components must be selected from within the ITPilot module to install the generation tool   Navigation Sequence  Generator    Wrapper Specification Generator  and optionally  Sequence Executor ActiveX Control   where browsing  sequences are to be run in client mode  read the ITPilot User Manual  USE  for further information   Also select   Wrapper Server  for the wrapper server to be in the same machine  Consult the ITPilot User Manual  USE  for  further information on the use of each of these components  The OpenOffice and Adobe paths can then be selected     2 1 4 Post Installation Tasks    2 1 4 1 Installing the Denodo ITPilot User License  Place the license file received  denodo lic  in the tool distribution    conf    directory  Without this file the Generator tool  will not start properly   2 1 4 2 Checking that the Navigation Sequence Generation tool has been installed correctly  To check that the software has been installed correctly follow the steps below   1  Start the MSIE   2  The navigation sequences generator taskbar should be visible on the browser  Where it does not appear   activate it by selecting
30.  page of the sequence  The red light on the semaphore lights up  until loading is complete     Once the semaphore changes to green  the navigation sequence can be generated  For this  the browser  should be used to generate the required sequence  simply remembering the following two points     e At each page change during sequence generation you have to wait for the semaphore to turn green  before continuing    e When generating the sequence  all the events should be executed using the mouse  Events generated  using the Keyboard will not be registered by the Generator  For example  execution of form sending  should always be carried out using the mouse to click on the send button and not by pressing the  ENTER key     In our TestMail example  the system could be used to generate a sequence that automatically accesses the  content of a user s Inbox folder and sorts the messages by date  To do this  it enters the user identification   e g  demos  and password  e g  JeMo 04  changing the language selection  if required  and pressing the    Navigational Sequence Specification Manual 71    I   i     me denodo technologies ITPilot 4 0 Generation Environment Manual    send button on the form  Once the semaphore turns green  the Date  link is clicked to sort the messages by  date     5  Atany point during the generation of the sequence  the P ay button can be used to reproduce the portion of  the sequence generated to date  The system launches a browser window in which automatic exec
31.  run server  as described in this manual     The section of this manual dealing in the Specifications Generator includes an example of how to use the tool and is  split into two different parts  The first part provides a small  detailed example of extracting e mail information from a  Web application  The second part expands on this example to observe and practice with functions such as the  extraction of data dispersed in different detail pages  browsing through pages of  further results  and other  advanced capacities     The following section will describe the Generation Environment installation and configuration process     Introduction and Installation 1    n denodo technologies ITPilot 4 0 Generation Environment Manual    2 INSTALLATION AND CONFIGURATION    2 1 INSTALLATION    2 1 1 Hardware Requirements    The minimum hardware configuration recommended to install the Process Generation Environment is a Pentium IV  2 4 GHz 1GB PC or equivalent  however  the system can normally operate using inferior performance hardware  Initial  installation requires approximately 60Mb of disk space  The space required to install Microsoft Internet Explorer   MSIE  is not considered here     212 Software Requirements    The following software must be pre installed           Microsoft Windows Operating System  2000 Server  2000 Advanced Server  2003  XP  Vista      Internet browser Microsoft Internet Explorer 6 x or 7 x to be used in the Process Generation  Environment      Java 2 SDK S
32.  s                                         117  Jo elv 1 Ir ARINC LOG NR Em 117  Output EI CRM 117  PIN 117  NEXT INTERVAL ITERATOR      orsrnornrnonnrnvnornenonnrnrnernenvnnvnrnesnenennvnrnennenennrnnnesnenennsnnnesnenennsnnnesnenennsnnnenn 119  BU RR ROUTE 119  MN 119  NAN 119  Details gele go ONIS kesesine nini aE E EEE 119  PIL  NN  120  NN 120  MINE GSE 120  Shui  C                                                        120  PEART PN Lunde 120  RECORD CONSTRUCTOR eec                         121  Description DOO 121  Mr 121  MANN eSzectce ance gsc nce EE cae poate EE E E EE 121  RTP 121  RECORD SEQUENCE Lua annen 122    HE EE NE NE EN 122    x denodo te    6 172  6 17 3  6 17 4  6 18  6 18 1  6 18 2  6 18 3  6 18 4  6 19  6 19 1  6 19 2  6 19 3  6 20  6 20 1  6 20 2  6 20 3  6 20 4  6 21  6 21 1  6 21 2  6 21 3  6 21 4    REFERENCES    hnologies ITPilot 4 0 Generation Environment Manual    f  l 018 FANE entem eR MEER ME EE MM DRE EIU 122  Output Values    tnnt nn nennen n 122  IST WIS TUI S0 IRR ROT 122  REPEAT t                                                        123  HESA EEE EEE 123  MINT 123  Tone ee RE 123  FEN CA EEE EE 123  SCRIPT ER EE EEE EE 124  HE EE EEE EN ES 124  Mr 124  VINNE 124  SENER 125  Pal eR IER 125  MINN 125  DN 00 GE EE EE EE 125  Details or FG  GOON EE EE 125  gpl                                                          126  BSS ETOT a APA EE E EE PE ES T SEENA S T PAE E AS A E 126  MRI 126  MAN UN E E EP L AE A EA E A E N 126  sc M             
33.  taskbar       brief description of the function of each of the interface elements is given below     e Open a Allows a navigation sequence saved on a disk file to be opened and executed     e Save e This is only active in record mode  which is accessed by clicking on the Aec button   This allows  the current sequence to be recorded in a disk file     e Rec    Starts the process of generating a sequence  requesting the initial URL from the user and  changing the Generator to record mode  whereby the events generated by the user are recorded by the  system and translated to NSEQL commands  Figure 81 shows how  when adding a URL  the user can also  decide whether that URL contains an HTML page  by default  or if it accesses a resource stored with  Microsoft Word or Adobe PDF format  In these cases  ITPilot will turn these formats into HTML  using  format transformers included in the distribution or dependent on third party tools  as described in section  2 1 2  so that the generation tool can be used     Initial URL Selection    URL     Converter     Word  Pdf  Acrobat Html   Pdf  Acrobat Text        Figure 81 URL Initial Selection    e Stop   This is only active in record mode  It allows the sequence generation process to be ended   returning the browser to the normal mode     e Play    This is only active in record mode  It allows the sequence recorded to the current moment in time  to be reproduced in a new browser window     e  SelectFrame ID  This is only active in record 
34.  the execution server  For more information about ITPilot execution  server  please see  USE      Part   49    I i i i  n denodo technologies ITPilot 4 0 Generation Environment Manual    PART Il    This second part shows how to make optimum use of the tool to obtain more complex wrappers     3 15 EXTRACTING MULTIPAGINATED DATA    Most Web sources present results in various consecutive pages  all with the same format  Any electronic shop or  Internet search engine can return hundreds or thousands of results in this manner  whereby in order to obtain an  ample subgroup of data from a specific source  you have to browse through this sequence of  more results  pages     To do so  the ITPilot specifications generation tool provides a browsing component known as  Next Interval Iterator     that iterates on different pages with a similar structure  Therefore  instead of browsing to a certain page using the  Sequence component and running the Extractor component on it  you browse to this page using the Sequence  component  and a loop is started in which  every time the Extractor component has extracted data from a page  the  next interval iteration component will access the next page of results using a browsing sequence defined in this  component  Below is a description of these steps in the generation tool     Drag amp drop the Next Interval Iterator     component to the workspace and connect it to the previously created  process in the way shown in Figure 53  The changes made are a
35.  this section does  not intend to give details as to how this tool works  the different steps will be shown in graphic form  For further  information  read section 4  which explains the way in which the sequence generation tool works  and  NSEQLI     S  s denodo   fr Open e Rec    Hi transpose Table F Domain  Enabled Popups   amp x Select Anchor   x       Figure 17 Denodo Toolbar      Rec    Start saving by clicking on the button and entering the browsing sequence home address  Figure 18      Initial URL Selection    https  mail  demos  denodo  com    Converter               Figure 18 Initial URL    The    Converter    selection list indicates whether the resource to be accessed via the URL is a Word or a PDF  document  it is left blank by default  which means that an HTML resource is expected   ITPilot is capable of  processing documents of this type through automatic HTML conversion     After clicking on OK  the browser will display the home page  Figure 19      Part   17    xe denodo technologies ITPilot 4 0 Generation Environment Manual    igh denodo    open  save      re  m Stop  gt  Play  Select Frame   Transpose Table Domain   amp x Enabled Popups  S Select Anchor   x   oe    Welcome to Denodo Mail            Last login  Never    Username     Password         I     OG denodo  Figure 19 Home Page    If you now enter the  UserName  and  Password  field values directly  the sequence generation tool will save them  as they are  However  the sequence saved is to be as gener
36.  to learn of the information  elements being processed at any given time in the process generation by any of its components  The Catalog explorer  enables you to see which registers  pages  views or lists exist at that time in a specific process in just one window   Therefore  it is now possible to see whether the MAILPARAMS register has been created appropriately     To start the Explorer  click on the f   button appearing by default on the left hand side of the component  configuration area   A window will appear like the one in Figure 15     Tm Pilot Administration tool    Catalog Explorer    Records  MAILPARAMS    Outputs       Figure 15 Catalog Explorer    The following information is available for each type of catalog element       nouts  list of components with this element as input     Outputs  list of components with this element as output         Structure  Used in Register and List elements  this contains the description of the register structure  or  the inherent register in the case of the list      37 WEB BROWSING AUTOMATION    3 7 1 Component Creation in the Workspace    Once the input parameters have been defined  the next step is to configure the process so that the wrapper knows  how to browse to the first user e mails  To do so  the system must know how to access the main page of the e mail    1 If you cannot find it  check that it is visible  To do so  go to the View  gt Toolbars menu and check that the    General     check box is ticked  You can also pres
37. 15 1  6 15 2  6 15 3  6 15 4  6 16  6 16 1  6 16 2  6 16 3  6 16 4  6 17  6 17 1    hnologies ITPilot 4 0 Generation Environment Manual    MANN 99  Qutput VAlUGS oo    ccceccccsceccsesceecscstvasessesesstsavstessasscsavasessasetsavssissasessevasissesatstatisnsasiteatatereatiteneateeenss 99  FIN 99  Using the Derived Attribute Expressions Editor                              sees 99  EXTRACTOR fee                                          M 102  BIS PRISON m 102  MT 102  MU                                                  102  D  tails of the CONIC Ne EEE EEE 102  dH t                                                                       103  HE mc                                                        poy tossed 103  MINE 103  OUS CI CRM 103  MN                                                                                      104  Pa  4 104  jns 110 so ORT  104  Buen TER                    104  FN HT                                                    M 104  FORM ITERATOR me                                       108  HH EEE EEE EE 108  TU CIR FAD UI 50  EEE EN EE AO 108  VANN  108  F   108  MMI                                                            113  BS OIT EEE ENE EE SEE 113  Input Parameters    nnnm tnnt trier nente snnt tnnt nn nnn 113  NN a E A E E E E E E A 113  ETEN 113  JDBCGEXIRACTOR Ne 114  eo EE 114  input oe 165 15 EEE UNE NU INE UM Ud a uU UE 114  MAN USS ee 114  FN  114  VP e                                                                        117  Borgo
38. 16 5 Individual Test of the Record Sequence Component                       ssssssssssss enne 59  3 16 6 Extracting data fromthe details Paean 61  3 16 7 Generating the Access Specification to the Details Page    sss 62    xe denodo te    3 16 8   3 17  3 18  3 19  3 20  3 21  4 1   4 2    4 3  4 3 1    4 4   4 5   4 6   4 7  47 1  412   4 8  4 8 1  4 8 2  4 8 3   4 9   5 1   5 2   5 3   5 4   9 5   5 6   6 1  6 1 1  6 1 2  6 1 3   6 2  6 2 1  6 2 2  D Z 9  6 2 4  6 2 5   6 3  6 3 1  6 3 2  6 3 3   6 4  6 4 1  6 4 2  6 4 3  6 4 4   6 5  6 5 1  6 5 2  6 5 3   6 6    nnt ITPilot 4 0 Generation Environment Manual  Iteration on the details page structures and creation of the output record    63  TAGSETS AND SCANNERS      ernonornrnorvrnonnrnrnorvrnennrnrnesvrnenernrnesvrnenernrnenernenennrsrnesnenennssrnesnenennssnnesnenennnnr 65  GENERATING FROM UNTIL PATTERNS     irervrnrnorornonornrnosvrnenornrnesvrnenesnvnesernenernrsrnesnenennsnrnesvenennsnnnenn 67  GENERATING THE DATA EXTRACTION SPECIFICATIONS MANUALLY                                      68  EXPORTING A FLOW AS A CUSTOM COMPONENT                               erre 70  CHECKING WRAPPER MAINTENANCE     rrnenornrnoeornenornrnesvrnenernrnesernenennvsenesnenennrsrnesnenennssnnesnenennsnnnenn 73  NTFODUENN acca                                                    75  DESCRIPTION OF THE NAVIGATION SEQUENCES GENERATOR INTERFACE                            15  STEPS FOR GENERATING A NAVIGATION SEQUENCE                                 
39. ASCII characters are  used to indicate the different units of time  Table 1 shows the meaning of each of the reserved  characters used in a date format  their arrangement and an example of use  Example of a date format   d MMM yyyy H h  m m   For more information  please read  DATEFORMAT   classes  java text DateFormat and or java text SimpleDateFormat     43    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    Meaning                          Arangement   Example        Specifies an Era    Year  Number  1996   M  Monthinyer                Tet amp Numbr               Juy807                    Day in month  Number  10   Time in day  0 23   Number  0     Number    Number    i    Week of the year  Number  2   P          Time in a m  p m   0 11   Number   Pacific Standard Time  Escape character for text    k  K  Z    Single inverted comma  Literal  MG    Table 2 Reserved Characters for Date Format    In Table 2 different values are used to indicate the arrangement of reserved characters  The specific  output format depends on the number of times the different elements are repeated     o Text  with 4 or more characters to use complete form  less than 4 characters to use the  abbreviated form     o Number  uses the minimum number of digits possible  The Os are added to the left of the  shortest numbers  The year is a special case  if the number of  y  is 2  the year is shortened to 2  digits     o Text  amp  Number  3 or more characters to represent it as t
40. Creation of the filtering condition    Generation Environment Manual      Operators    Condition operators    like    between  containsor  contains   in    Logic operators    f  To complete the process  simply drag amp drop the condition to the  Result Condition  area and  press  Ok   See Figure 102     Tm Filter 1 condition editor      MOD    MULT  hov   POWER   REGEAP  REMOVEACCENTS  REMOVEWHITESPACES  REPLACE  ROUND  SIMILARITY   SORT  SUBSTRACT  SUBSTRING   SUM  TODATE   TOURL   TRIM   UPPER   Input Values    Vul tL DERM    SIZE  SIZE    5 UBJECT  SUBJECT   SENDER  SENDER   DATE  DATE     Appendix B  Catalog of Components    Condition Editor    values    Simple conditions    Conditions    ENEN    Result condition    ODATEUCMM da svy VYMAILDEMO DATE       ODATEUCMM dasvy T2701 7200 7        Figure 102 Generating the results condition    Seles      Operators m    Condition o per ato rs    like    containsor  contains  in    Logic operators    107    I   i     me denodo technologies ITPilot 4 0 Generation Environment Manual    6 10 FORM ITERATOR    6 10 1 Description    This component allows for a run loop to be generated for a specific form  where different values for each of the fields  included are used in each iteration     6 10 2 Input Parameters    The Form Iterator requires the following elements as input parameter   e The input page  where the form on which to iterate is located     e Zero or more lists of records  zero or more values  zero or more records that can b
41. EHDER  2 Web data extraction techniques Alberto Pan                 X  Limit rows             Visible debug le     X  FATAL x  ERROR  X  WARN  X  INFO  X  DEBUG  X  TRACE        INFO   com denodo itp compon   Mon Jan 1517 42 43 CET 2007 sequence  ClickonAnchorByHref   URLANCHORE  d   INFO   com denodo itp compon   MonJan1517 42 52 CET 2007 Execution finished successfullysequence   ClickOnAn   INFO   com denodo itp compon   MonJan1517 4314 CET 2007  sequence   ClickOnAnchorByHref   URLANCHORE      INFO   com denodo itp compon   Mon Jan 1517 43 23 CET 2007 Execution finished successfullysequence   ClickOnAn                      Visible data   X  Level  X  Source  X  Message   Export            Next result   Close         Figure 62 Test window of the Record Sequence component    3 16 6 Extracting data from the details page    Once ITPilot is able to access the details page for each message  it is now time to obtain the data of interest from  this page  To do so  use an Extractor component once again  as when data was to be obtained from the first page of  results  see section 3 8   Use of the component in the process is shown in Figure 63  The component input is the  output of the Record Sequence component known as DETAILPAGE     Part II 61      i i  me denodo technologies ITPilot 4 0 Generation Environment Manual        TPilot Wrapper Generation Tool            File   Browser   View   Help   denodo ITPilot                                               Project Management Process
42. Extractor CG MailldainPagelterator  Ej WEBMAIL_RS IC N    OG MailDetailF agel     EH MailRecord    n  Lg   e Add Record T        lt  gt  hM ailciutput    13 End lterator 2    EndMailldainPagelterator    i il    E Nest Interval Iterator    Figure 74 Creating a custom component    It can be seen how  at the start of the process  an empty list is created  called    CCReturnList     which will store the  records returned by the customized component  A new  Add Record to List    component has been added after the  Record Constructor that adds the information from the information extractors on the main page and the detail pages  so that the response record for this component is stored in the new list following each iteration     Lastly  bear in mind that the component will return no element  if there is an internal error  and  therefore  each  component error parameter must be suitably configured  In this example  the CONNECTION ERROR parameter is  configured with the value ON  ERROR  IGNORE to avoid undesired errors  when the number of pages of more results  is not as expected  ITPilot also offers the value ON ERROR RETRY IGNORE  which retries the action for a pre   determined number of times and  if the error is persistent  it ignores it     With the process loaded in the ITPilot specifications generation tool  select the File  gt Save as custom component  menu option  or use the combination of keys Ctrl Alt S   The steps are as follows     Part II 71      i  me denodo technologi
43. H Y PASSWORDEX       M vwebmailCusto      C lterator 1       i          fF      Output 1    A  C End lterator 1    4    Figure 77 Using a custom component in a new process    3 21 CHECKING WRAPPER MAINTENANCE    ITPilot Specification Generation tool offers an option that informs the user of whether the generated wrapper can be  maintained or not by the maintenance Server  This server  described in  USE   allows automatic re generation of a  wrapper in case the original source changes     The option  Wrapper Maintenance Check    can be found in the left side of the component configuration area  as  shown in Figure 78  Pressing the button with an active wrapper  a dialog pops up informing whether ITPilot can try to  maintain the wrapper or not  see Figure 79      Part II 73    x denodo technologies ITPilot 4 0 Generation Environment Manual       Figure 78 Component Configuration Area    Tm  TPilot Wrapper Maintenance Check    Wrapper Maintenance Check    wf The wrapper is maintenable       Figure 79 Wrapper Maintenance Check Dialog    Part II 74    m denodo technologies ITPilot 4 0 Generation Environment Manual    4 NAVIGATIONAL SEQUENCE SPECIFICATION MANUAL    4 1 INTRODUCTION    Denodo ITPilot facilitates trouble free generation of programs  also called  wrappers   that carry out automation and  data extraction tasks on semi structured web sources  These tasks normally imply the automatic creation of complex  navigation sequences through Web sites  involving authentication pr
44. L  uses the HTML converter of the Adobe Acrobat Professional software  this  product must be installed    b  Acrobat Text  uses the plain text converter of the Adobe Acrobat Professional software   from which ITPilot generates an HTML file  this product must be installed    C  PdfBox HTML  uses the PDFBox library  PDFBOX  to generate the HTML file     Navigational Sequence Specification Manual 04    me denodo technologies ITPilot 4 0 Generation Environment Manual    In order for the PDF to HTML conversion to work  the PDF converter server must be running  This server can be found  at   DENODO HOME   bin       PdfConversionsServer exe PDFconversion server     Navigational Sequence Specification Manual 85    I i i i  n denodo technologies ITPilot 4 0 Generation Environment Manual    5 APPENDIX A  ITPILOT FUNCTIONS    This appendix describes the functions foreseen by ITPilot to create attributes derived from other existing ones     Derived attribute functions are used to generate new attributes  applying a process to the values of the other  attributes of the view  the constants and or the result of assessing other functions     A function is defined as an identifier and a list of arguments that can  in turn  be constants  fields or new functions   In some cases  the parameters received by a function and the value returned by them should all belong to the same  data type  For example  the SUM function can add two or more integer values  two or more floating values or two or  mor
45. Log out       Inbox  213 e    Select      Page 1 of 2 1 to 20 of 21 Messages    Kee  jm    Delete   Farward   View Messages   F Ar Date From Subject  Thread  Size   F 1 0153172007 Jahn Smith Data integration approach 1 KB  E 2 01 31 2007 Marty McFly Web data extraction techniques T09 KB  E 3 0153172007 Jahn Smith WoC Holds Workshop an Frameworks for Semantics in Web Services   2005 06     2 KB   1 4 01 31 2007 John Smith Abstract 2 KB  E 5 01 31 2007 Marty McFly Wrapper Maintenance 2 KB    B 01 31 2007 Jean Luc Picard Client Side Deep Web Data Extraction 2 KB  E T 01 31 2007 Jean Luc Picard Semantic Web   News and Events 1 KB  L  8 01 31 2007 John Smith Pattern Ambiguity 2 KB   1 g 01 31 2007 Marty McFly DEXTL Actions 1 KB  F 10 01 31 2007 Jean Luc Picard Global Technology Watch 2 KB  E 11 01 31 2007 John Smith Introduction ta NSEGL 2 KB  L  12 01 31 2007 Marty Mer ly  Mo Subject  2 KB  E 13 02 01 2007 Doyle Lonegan What is vL 2 KB  E 14 02 01 2007 Doyle Lonegan Base Relations in DataPort 1 KB  E 15 02 01 2007 Henry Gondor VDP General Architecture 15 KB   1 15   02 01 2007 Henry Gondorf Denodo ItPilot 18 KB  r  17 nimm emm Finvla nnanan Tact l iamnc  Pannlacnft anne eadifiad for Mracla Mirrnlawara 1 LA    Figure 5 First message screen    The content of any e mail can be accessed by clicking on the subject  Figure 6      To  demos denodo com  Subject  Data integration approach    Part s   Download All Attachments  in zip file  CQ  Headers  Show All Headers    LIII        Alter
46. OUT ERR  R  This error is produced if the Web source takes a long time to respond  The  waiting time is configurable  If the wrapper is used in the run environment  this parameter is  configured in the browser pool used  see  USE    In the generation environment in question  this  value is configured in the ITPAdminConfiguration properties file available in    DENODO HOME   conf itp admin tool  with the property  IEBrowser MAX DOWNLOAD TIME     For this example  all of the values will be kept as ON ERROR RAISE  indicating that any error is reported and the run    completed       Please note that the path can start with a     symbol  For example  Windows paths start by      so in order to  access a specific directory  ftp    c  directory should be written     Part      22      i  me denodo technologies ITPilot 4 0 Generation Environment Manual    38 STRUCTURE DEFINITION OF THE DATA TO BE EXTRACTED    Initially  the aim is to obtain the list of e mails as they appear in the main page without worrying about obtaining  further details just yet  such as those obtained by clicking with the left hand button of the mouse on each of the  messages   From the target page that the Sequence component obtained as a result  it is therefore necessary to carry  out a process to structure its relevant information after browsing     To do so  use another of the main ITPilot components known as Extractor  which has the following icon  L      his  component generates an HTML page data extraction 
47. RACTIONOUTPUT from the Details tab of the component configuration area     3 12 ITERATION OF RESULTS OBTAINED    3 12 1 X Use of the Iterator component    The Extractor component returns a list of records as the result  each one of which contains one of the elements  obtained  In this example  each record is a message with its sender  message  date and size fields  In order to  manage them appropriately  each one must be obtained to set filters on specific fields  records  conditions  etc  The  Iterator component is used to iterate on each record in the input list  For each iteration  the component will return a  record from the list     As usual  the iteration component can be dragged from the browsing area or from the workspace component bar  The    component icon is c  Figure 39 shows the graphic appearance of the component in the workspace     Part   33      i  me denodo technologies ITPilot 4 0 Generation Environment Manual            TPilot Wrapper Generation Tool        File   Browser   View   Help   denodo ITPilot  Process Flow  WEBMAIL      S MN re Die CD NOG JAG NS    Generic     TEX    Project Management           Process Builder    Components                                 6 InitialSeq    ab   J  ABS  a     ESTE  ga     gt  MainPageExtra     o g o  o   terator 1     o o o       z        End_lterator_1                           Custom       npu   izard    Deta    zl        Input records EXTRACTIONOUTPUT v   Data Export Tool i   op  Tools             Current proces
48. SSAGEDATE    WEBMAIL SENDER    MESSAGEDAY  MESSAGEMONTH  MESSAGEYEAR    Add new field       Figure 41 Record editor    As can be seen  at the top of the window is an error indicating that some attributes have not yet been defined  New    fields created from existing ones can be added from the record editor  To do so  click on the a icon of any of them   e g  MESSAGEDAY  to edit     3 12 2 1     Editing New Record Fields    Click on the a icon and a new window will be displayed  as shown in Figure 42  In this window  it is possible to use  the functions defined in ITPilot to apply them to the fields accessible from the Record Constructor component to  generate new derived attributes  Chapter 5 describes each of the functions available in ITPilot  In these cases  the  date treatment function GETDAY will be used  which accepts a DATE type parameter as input and returns an integer  that indicates the day     On the left of the screen are menus to create different values that can appear as operands in the expressions     e Constants  This menu allows constants of the different data types supported by Virtual DataPort to be  created    e Derived attribute functions  This menu allows for an invocation to one of the derived attribute functions  permitted by Virtual DataPort to be created  The functions can receive constants  attributes or the result of  evaluating other functions as parameters  They return one result  The list of available functions and use of  each one can be consul
49. Size  Maximum number of connections that the pool may manage at the same time     o Ping Query  SQL query used by the pool to verify the status of the connections to be cached  It is  required that the query is simple and that the table already exists     e  he third tab is used to execute a SOL query that allows ITPilot to determine the output record structure   the query may use variables which have been attained from the input records and values of the  component   Figure 110 shows how the  billing  table is accessed in the example to obtain the   Customer ld  field as clients unique identifier  Thus  the JDBCExtractor will return a list of   Customer  Id      mF JDBC Extractor  JDBC Extractor  1   Connection Configuration Pool Configuration  Output record structure    select customer_id from billing JDBC _Extractor_1_ output     auery Database      Ok   Cancel         Figure 110 Obtaining an output record structure in the JDBCExtractor component    Appendix B  Catalog of Components 116    I   i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    6 13 LOOP    6 13 1 Description    This component allows for loops to be made in the flow  The loop will be repeated  as long as the given condition is  met  WHILE    DO      6 13 2 Input Parameters    Loop accepts zero or more values  zero or more records  These elements are used to assign variables to the loop  output condition expression     6 13 3 Output Values    None     6 13 4 Example    After an Extracto
50. TPilot 4 0        Extractor 1    C    Appendix B  Catalog of Components         H    Iteratar 1    v  E    Generation Environment Manual    gE   Record Constr       cxf     lt 5    Loop 1  e HM    H Y     Expression 1    EX  oF Record Constr       a           HN             sf End Loop 1       Em    Figure 111    Example of Loop component operation    118    y A   i z   O  denodo technologies ITPilot 4 0 Generation Environment Manual        6 14 NEXT INTERVAL ITERATOR    6 14 1 Description    This component allows for iteration by different inter related pages by one or by different browsing sequences     6 14 2 Input Parameters    The Next Interval Iterator accepts the following as input   e An input page that is used as a base from which the remainder is accessed   e Zero or more input records used as input variables in subsequent browses     6 14 3 Output Values    The component returns the results page for each iteration     6 14 4 Details of the component    See section 3 15 for a more in depth explanation of the component     Appendix B  Catalog of Components 119    me denodo technologies ITPilot 4 0 Generation Environment Manual    6 15 OUTPUT    6 15 1 Description    This component places a record in the wrapper output     6 15 2 Input Parameters    Output accepts a record as input that asynchronously returns the wrapper as the result     6 15 3     Output Values    None     6 15 4 Details of the component    See section 3 12 3 for a more in depth explanation of the c
51. TPilot 4 0 Generation Environment Manual    Tm Filter 1 condition editor   ISNULL W Condition Editor opernu     Condition operators  Values   LOWER  Ma beeen cn       Nn va   S     3   MULT    PO    POWER  Nc Hr containsor    contains    Simple conditions    REMOVEACCENTS  FREMOVEYHITESPACES  REPLAGE Conditions Logic operators    ROUND  SIMILARITY  SUBSTRACT    SUBSTRING Result condition    TODATE    TRIM  LIPPER    Input values       Figure 100 Creation of the comparison date    d  Now create the left hand operand of this condition  Drag amp drop another instance of the TODATE  function  which is fed with the  MM dd yyyy  string as the first argument and with the DATE  attribute of the WMAILDEMO record that is originally in the list of  Input Values  to the left of  the editor    e  Finally  drag amp drop both TODATE functions to the condition created in step  a   First the function  created in  d  as the left hand operand and then that created in  c  as the right hand operand  See  Figure 101     Appendix B  Catalog of Components 106    xe denodo technologies    Tm Filter 1 condition editor    MOD   MULT   hov   POWER   REGEXP  REMOVEACCENTS  REMOVEVHITESPACES  REPLACE  ROUND  SIMILARITY      SORT  SUBSTRACT  SUBSTRING   IBI   TODATE   TOURL   TRIM   UPPER   Input values    Vui tL DERI    SIZE  SIZE    8 LUBJECT  SUBJECT   5 ENDER  GENDER   DATE  DATE     ITPilot 4 0    Condition Editor    Values    Simple conditions    Conditions    rv    Result condition       Figure 101 
52. Tag Value   In this case  as    Part II 65    me denodo technologies ITPilot 4 0 Generation Environment Manual    indicated above  the EOL tag  although without the HTML tags  will basically be responsible for defining the line  breaks and new paragraphs      ID  ef  gt  Nw TE CIR    p      T U  BRe      l  DleTR   ey o CUPSOPDHON      gt     TIDD SD SDTV PT ST PP SVR PT SVAR PN STE psp  reb VPT SVT    NOTE  when creating tags in ITPilot  the HTML opening tags must be written with the following syntax      lt TAG    gt     For example  the paragraph tag    lt P gt   should be written as   lt P    gt    This is required because of the internal  functioning of the ITPilot automatic maintenance system  see  USE  for more information about this tool      The central section   Nested Tag Values     is used to define attributes of the tag being created  For example  a URL  attribute is defined for the ANCHOR tag  to which the value    CompleteURL  href   URLBASE  Is  allocated  which is the function that receives a relative URL  e g   products id 3025  and a base URL  e g   http    www bookshop com  as parameters and combines them to return an absolute URL  e g   http   www bookshop com products id 3025   In this case  URL  CompleteURL  href  GURLBASE   would be written in the  Nested Tag Values    section        The tag is saved by clicking on the     button  In the event of updating a tag  should you wish to reject the          change made and return to the previous version  simp
53. Tool    This tool is divided into three vertical areas  where each one contains information on the scanners  tag sets and  specific tags that currently exist in the ITPilot installation you are working with  In a recently installed standard  distribution  as shown in the figure  there are three scanners  StandardHTMLexer  Standardlexer and  StandarLexerJS  By clicking on either of them with the left hand button of the mouse you will be able to see their  internal characteristics   lexer  type and  most importantly  the tag sets included  The central area shows all the  existing tag sets and the tags in each one  Lastly  the right hand area indicates the tags created to date     In the example proposed in this guide  you must create a new tag set belonging to the StandardHTMLLexer scanner  that does not contain tags that may prevent suitable access to the  MESSAGE  field for the details page of the Web  mail application to be accessed correctly  To do so  the tag set will only contain EOL  without the tags HTML   lt P gt    and   lt BR gt    and TAB tags     Therefore  the first step will be to create a new tag  EOLNOLINEBREAK  which will be defined with the same HTML    tags as EOL but without   BR   and   P    Click on the       button in the right hand area of the scanner  configuration window  that corresponding to tags  and create the required tag  EOLLINEBREAK  which will appear in  the list of existing tags  This new tag can be defined in the bottom field known as the  
54. X  xx                        a p  gt  mna        El  ecl NOG  oa J  6 ws  oe   Niuardey  Generic Z1      hs ei S   Add Record To List  Condition  Create List    Expression    Fetch  Filter  Form Iterator               Iterator  Loop  Next Interval Iterator        na T  g E MailRecord          Record Constructor  Record Sequence  Repeat    of   lt  gt  Mailoutput   o             o  E    o         Sequence       v EndMailPagelterator           Store File       Custom    GoogleFSCC a zz          BH   rot Record    Data Export Tool 3   OG  Tools            Current process  WEBMAIL from project Default Project   10 41 01 PM    Figure 45 Use of the Output component    Figure 46 shows the complete process     Part   39    xe denodo technologies    Part      ITPilot 4 0    Y  a    Li   MainPageExtractar    C MailMainPagelterator    if  Oo      on MailRecord  E  Li           MailOutput        i    x  Ca EndMailMainPagelterator    kje    Figure 46 Complete process of the first part of the example    Generation Environment Manual    40    L  i  me denodo technologies ITPilot 4 0 Generation Environment Manual    3 13 WRAPPER ADVANCED OPTIONS  BACK SEQUENCE AND LOCALE    Before finishing the wrapper creation process  some added capacities can be configured  Specifically  ITPilot allows  the addition of a  Back Sequence  to optimize the response time when the wrapper is executed several times   besides  the default locale information of the wrapper can also be configured     To do so  we u
55. Y STATE  see  USE   and a wrapper does not have an explicitly defined back  sequence  then Denodo ITPilot will try to obtain a suitable back sequence for the wrapper by itself   depending on the previous runs made  Normally Denodo ITPilot requires at least two wrapper runs before  being able to determine whether there is a back sequence suitable for the wrapper     This back sequence will be taken by ITPilot as the first Sequence component of the wrapper  It is important to take it  into account when building the wrapper  Besides  the browser type used in this back sequence is implicitly chosen as  that selected by the first Sequence component of the wrapper     Consult the Denodo ITPilot User Manual  USE  for further information on the reuse of browsers     3 13 2 Locale    This area is used to configure the locale information of the wrapper  It incorporates support for the integration of  information from different countries or geographic areas  expressing the output data in the formats expected by the  country in question     Besides  each Extractor component may contain its own locale configuration  taken into account even if it is different  to the default one     There is an internationalization configuration for each of the countries locations from which data can come  There  are several configuration parameters for each of the existing localizations  Some of the configuration parameters are   coin  decimal and thousands separator symbols  date format  etc  ITPilot inc
56. ab by clicking on the EN button  where the examples are assigned  It can be seen how in this case the text of   the message can be assigned to the MESSAGE attribute on using the new tag set  Where a message contains HTML    tags belonging to EOLLINEBREAK  the existing text may only be assigned to that tag  Where necessary   EOLLINEBREAK may be modified so that it can also be accepted     Once the examples have been assigned  go to the generation tab  This tab contains a function that has not yet been  indicated and that is quite useful  FROM UNTIL pattern generation        This tag set really already exists in the ITPilot distribution  It is the  STANDARD  TEXTFRAGMENT  tag set that  belongs to the  StandardLexerJS  scanner     Part II 62      i    me denodo technologies ITPilot 4 0 Generation Environment Manual    3 16 8 Iteration on the details page structures and creation of the output record    As indicated  the Extractor component returns a list of records  although only one element is returned  To do so  the  component output must use an Iterator to obtain the required records  Once this action is complete  the Record  Constructor component of section 3 12 2 can be reused to generate an output record containing the data obtained  from the pages of results and the details pages     Figure 64 shows the result of adding these last components to the e mail extraction process             TPilot Wrapper Generation Tool        File   Browser   View   Help   denodo ITPilot     
57. actor 1    GC Iteratar 1    IL Condition    Qm Record Constr       S  Ed     gt  Outputs       Wl End Condition          A  GO End lterator 1    Figure 91 Use of the Condition component    6 2 5 Using the Conditions Editor    The conditions editor  see Figure 92  allows selection conditions to be created  The condition can be written directly    in VOL format in the    Selection Condition    box or can be created completely graphically  This last process is described  below     On the left side of the screen  we will find menus for creating various values that can appear as operands in the  conditions     e Constants  This menu allows constants of the various data types supported by ITPilot to be created    e Functions  This menu allows an invocation to one of the functions permitted by ITPilot to be created  The  functions can receive attributes or the result of evaluating other functions as constant parameters  They  return one result  The list of available functions and use of each of them can be seen in the appendix A 5    e Attributes  This corresponds to the list of attributes to which the condition is applied     Appendix B  Catalog of Components 92    xe denodo technologies ITPilot 4 0 Generation Environment Manual       Condition 1 condition editor  EX Condition Editor Operators  constants     Condition operators  int  boolean  long  double  string  float    l simple conditions  binary Between    Functions   containsor    contains    CONCAT   DI Conditions Logic oper
58. al as possible  i e  dependent on the input parameter  values  To do so  create a  Domain  in ITPilot  a domain is a set of key value pairs   so that the sequence generation  tool can assign variables to Web page input elements  e g  forms   Thus  the wrapper generated is able to accept  different values for different runs     It is very simple to create domains graphically from the main window of the ITPilot wrapper generation tool  On the    left hand side of the component configuration area  click on the  Domain Editor  EH button that will open the    Domain Editor window  see    Then click on the o button to add an example to the domain  As many examples as  required can be added  although in this case only one is needed  The name of the example  MAILPARAMS  will be  displayed on screen from the initialization data created in section 3 6  Underneath  specific example values for each  parameter will be entered  In this case  the input values of the e mail Web application will be used  as indicated in  section 3 2        LOGIN  demos       PASSWORD  DeMo 04    See Figure 20 to check it     Part   18    n denodo technologies ITPilot 4 0 Generation Environment Manual        TPilot Administration tool    Domain Editor    MAILPARAMS    Example name MAILPARAMS     DeMo 04       Figure 20 Domain Editor    Once the examples have been added  the domain must be exported so that the sequence generation tool can use it     To do so  click on 2    Export Domain   and select the name wi
59. alues    The component returns a page resulting from browsing     6 17 4 Details of the component    See section 3 16 3 for a more in depth explanation of the component     Appendix B  Catalog of Components 122    I i i i      me denodo technologies ITPilot 4 0 Generation Environment Manual    6 18 REPEAT    6 18 1 Description    This component allows for loops to be made in the flow  The loop is repeated until the given condition is met   REPEAT    UNTIL      6 18 2 Input Parameters    Repeat accepts zero or more values and zero or more records as input  These elements are used to assign variables  to the loop output condition expression     6 18 3     Output Values    None     6 18 4 Example    This component works in a very similar manner to Loop  therefore  please see the example described in section  6 13 3     Appendix B  Catalog of Components 123    I i i i      me denodo technologies ITPilot 4 0 Generation Environment Manual    619 SCRIPT  6 19 1 Description  The component allows for a program to be written in Javascript  see  JSDENODO       This is a very useful option to add small scriptlets to the process flow  when it is not possible or not worth it  to  create a customized component     6 19 2 Input Parameters    Zero or more elements of any type     6 19 3     Output Values    None     Appendix B  Catalog of Components 124    I i i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    6 20 SEQUENCE    6 20 1 Description    This component creates
60. ated  the list of which is displayed in the box on the top right   To assign  an attribute or a value already created as an operand of the condition drag amp drop the element to the  parameter area  Press the  gt  button that appears beside the condition  and this will appear in the list of  conditions created  box center right      Creating Boolean conditions    To create a new Boolean condition the following actions are required     1     Select the required Boolean operator  AND  OR  or NOT  in the drop down menus on the right side of the  screen and click on or drag amp drop it to the workspace  where the Boolean conditions are created  left lower  box     The operator selected will appear in the workspace together with an area to specify its operands  The  operands can be simple conditions already created  the list of which is shown in the right center box  and  other Boolean conditions created beforehand  To assign a condition already created as an operand of the  new Boolean condition drag amp drop the condition to the operand area  Press the  gt  button that appears  beside the Boolean condition  and this will appear in the list of Boolean conditions created  box bottom  right      Finally  drag amp drop the condition to be added to the selection to the    Result Condition    box  On clicking    ok     you will  return to the process creation screen with the condition already created     Appendix B  Catalog of Components 94    I i i i      me denodo technologies ITPil
61. ates a new data value containing the actual date     GETDAY  Receives a date type argument and returns a long type object that represents the day of the  date received  If arguments are not received  a 1ong type object is created that represents the current day     GETHOUR  Receives a date type argument and returns a long type object that represents the time of the  date received  If no arguments are received  a long type object is created that represents the current time     GETMINUTE  Receives a date type argument and returns a long type object that represents the minutes  of the date received  If no arguments are received  a 1ong type object is created that represents the  current minutes     GETSECOND  Receives a date type argument and returns a long type object that represents the seconds  of the date received  If no arguments are received  a long type object is created that represents the current  seconds     Appendix A  ITPilot Functions 88    I   i     me denodo technologies ITPilot 4 0 Generation Environment Manual    9 5    5 6    GETMONTH  Receives a date type argument and returns a long type object that represents the month of  the date received  If no arguments are received  a long type object is created that represents the current  month     GETYEAR  Receives a date type argument and returns a long type object that represents the year of the  date received  If no arguments are received  a long type object is created that represents the current year     TODATE  T
62. ational Sequence Specification Manual 80      j j j L a a    me denodo technologies ITPilot 4 0 Generation Environment Manual    2  A pop up window containing a drop down menu will appear from which one of the available domains can  be selected    3  Once the required domain has been selected  a drop down menu will appear on the taskbar  beside or  under the DOMA N button  which allows the name of one of the examples associated with the domain to  be selected    4  Once the example has been selected  the values provided by same for the parameters that make up the  domain will appear on the bar  Figure 86 shows the taskbar with the domain BOOK and the example     Flanders Panel     Reverte    selected    5  When in one step of the sequence you wish to associate the value of a field on one form with one of the  domain parameters  just use the drag and drop function to bring the value associated with the parameter  to the required field on the form  Figure 87 depicts a graphic representation of this process  As a result of  this action  the NSEQL code created by the generator will associate a variable with the name of the  parameter used  prefixed by the character       e g      LA to said field  In this way  the sequence can be  used directly when defining a wrapper that allows the input of attributes with the same names as the  parameters of the selected domain    6  The other steps involved in generating the sequence remain unchanged        G denodo    Open e Rec E  Select Do
63. ators  ELEMENTAT   ENCODE   FLOOR   GETDAY   GETHOUR Result condition   GETMONTH   GETSECOND   GETURLCONTENT    ISKMOTNULL       Figure 92 Conditions Editor  On the right of the screen  we will find menus to select the various operators that can appear in the conditions     e Condition operators  e logical operators  AND  OR  NOT   These are used to combine the different simple conditions in a Boolean  expression     The center boxes of the screen allow three types of elements to be constructed from top to bottom  values that  appear in the conditions  simple conditions and compound Boolean conditions  The box on the left of each group is a  workspace for creating new elements  while the box on the right displays the elements already created  The  following subsections describe in more detail how each of these types of elements is created     Finally  the  Result condition  box contains the condition eventually created     6 2 5 1 Creating values for the conditions  To create a new constant value the following actions are required   1  Select the data type of the constant in the    Constants    drop down menu on the left side of the screen and  click on or drag amp drop to the workspace  where values are created  box on top left    2  The type selected will appear in the workspace together with a text area to fill in the value of the constant     The value required can be written directly in the text area   3  Onclicking the  gt  button  the new constant will appear in the li
64. be  made   Time between retries  This indicates the time between one retry and the next in the event of the first  failing  The time is defined in milliseconds     SEQUENCE   browser pool type     Navigaterhtteymail  demos denodo com  0 ExtendedWaitPages  1 FindFarmByName imp  login D  Betlnput    valueitimapuser 0  Encodeseqy  LOGIMN Setnputvalue pass o   EncadeSequePASSvWwoORDMFindFarmbwy  Mamefimp login 0 clickonElementiaginButtan  INPUT 0  ExtendedvvaitPages  13       Open sequence editor      Figure 25 Result of the Sequence Editor    9  Output Data Configuration and Error Processing    To complete the configuration of this parameter  the  Details  tab determines the output name and the error    conditions     Output Name  The output of a Sequence component is a page  The choice of name is important  as it  must be easily accessible by subsequent components using it as input  In this example   INITSEQOUTPUT has been chosen    Error conditions  This section configures the behavior of this component regarding certain error types    o  AUNTIME ERROR  In light of a runtime error of the component  you can choose to ignore  retry or  publish and propagate the error    o CONNECTION ERR  R  This error occurs when there is some kind of connection problem with the  source    o SEQUENCE ERROR  Error produced when there is some problem with the sequence  the  sequence is not correctly written  some command could not be run  etc      o HTTP ERROR  Produced by an http error    o   IME
65. be created from  which the required wrapper will be created  To do so  click on  Project Management    in the tool s Browsing Area and  the projects that currently exist will be displayed  see Figure 7       e  TPilot Wrapper Generation Tool E BK     File   Browser   View   Help   denodo ITPilot    Project Management  Projects  Default Project    Process List D    Process Builder  Data Export Tool    Tools         Current process  StandardTemplate from project Default Project   6 49 13 PM    Figure7 Project creation    To create a new project  click on the Cr icon in  Projects   The workspace will display a text field  where the project  name can be entered  In this case  call it    MAILWRAPPERS    and click on the OK button  Now see how the project is    displayed in the Browsing Area  see Figure 8   The a symbol to the right of the project name allows for it to be  deleted  where not required  Deleting a project involves eliminating all of its associated wrappers and  therefore  you  must be careful with your selection  On deleting a project the tool allows you to specify whether the wrappers  eliminated are deleted from the display tool only or also from the hard drive  If they are only deleted from the tool   they will disappear from the project view  They can be retrieved by selecting the Refresh option from the project s  contextual menu  by clicking with the right hand button of the mouse on the specific project  or by selecting the Add  Processes option and then choos
66. be necessary to install the scanner in the remote machine  Please see  DEXTL  for more  information about how to do it     The scanner and tag set for the Extractor component can now be updated      n the Structure tab of the extraction wizard  select the root node known as  DETAILSTRUCT  and modify  the scanner to    MyLexer           he new data structure is created with two attributes   o MESSAGE  String type  which will save the data on the message received   o   MESSAGEDATE  Date type  In the contextual menu  clicking with the right hand button of the  mouse   select the  Options  option and enter the following in the  Date Pattern  field  d  dd   MMM yyyy HH mm ss   This informs ITPilot of the pattern to be followed by the MESSAGEDATE  field       Click on the    Change TagSet    button and select the  MYTAGSET  tag set     3 18 GENERATING FROM UNTIL PATTERNS    It is sometimes difficult to determine the part of the page at which the information extraction process is to begin and  the part at which it is to end  This situation can normally be avoided by extending the specification of the pattern to  be recognized using parts that  although they are not to be extracted  avoid ambiguity  However  this is often not all  that easy  A typical example is shown below  Figure 69 shows the graphic format used by an on line bookshop to  show information on its products     Part II 67    x denodo technologies ITPilot 4 0 Generation Environment Manual  Title Author Price  Coraz  n 
67. bed in section 3 7 2  although with certain distinguishing features described below  First  let us see how to  generate a browsing sequence to obtain further information on messages not residing in the main page but in the  following pages of results  Figure 54 shows the home page of results  At the bottom right  you can see a series of  links that enable you to browse to the following pages of results  Therefore  save a browsing sequence and click on  the next page button  This sequence will be used by the component at the end of each iteration to access the next  page  so that the Extractor component continues to obtain results     e  Hec    We will therefore open a Microsoft Internet Explorer browser  Click on the button and when the dialog in    which the start navigation sequence address has to be written appears  close it by clicking on or by leaving  the URL blank and clicking on OK  We are now recording on the current page  In the navigation panel      Fa el cy mal click on the arrow  gt     Next Page     which brings us to the next results page  Click on the    Part Il 5     x denodo techn fl ies    Generation Environment Manual     save button to record the sequence  e g  correoWeb next nsg  and stop the recording process with    m Stop     1 01 31 2007 John Smith Data integration approach 1 KBI  E 2 01 31 2007 Marty McFly Web data extraction techniques TOG KB  E 3 01 31 2007 Jahn Smith W3C Holds Workshop on Frameworks far Semantics in Web Services   2005     2 KB  
68. can be run in the  case of error in the browsing sequence     i  Form Iterator Editor  Form Iterator 1        Values   Navigation    Configuration      FindF orm command       FindF ormByAction  searchivac 0 false        Sequence Type    browser pool       Submit sequence    FindElementByAttribute INPUT VALUE Search  false   ClickOnSelectedElement   WaitPadges 1               Load from file   Import from browser    Suggest                       Test   Ok   Cancel            Figure 107 selecting values in the form fields    c  Lastly  the    Configuration    tab is used for different actions such as limiting the total number of  iterations  running parallel iterations and the maximum number of parallel iterations that can be  run and reusing the current connection so that the same browser is used for each iteration   which  as explained in section 3 16 3  may be inadequate for parallel iterations   The order in  which the attributes are used is also configured  which affects the order of the combinations     Appendix B  Catalog of Components 111    n denodo technologies ITPilot 4 0 Generation Environment Manual    i  Form Iterator Editor  Form lterator 1          Values    Navigation   Configuration            Maximun retries          Time between retries  ms            X  Limit iterations       O Execute iterations in parallel       Maximun concurrent iterations              X  Reuse connection       catAbbreviation  query  srchType  minAsk   maxAsk  addTwo  addThree  hasPic 
69. cification to provide options  for example  See  DEXTL       2  Select the tag for which the attribute is to be included in the workspace    3  From all those possible  select the attribute required for this type of tag  e g  only the URL attribute is  defined for ANCHORS     4  Lastly  choose the field in which the value of the tag is to be included     For the example of the Web mail application  the specific actions to be carried out are those described below  As a  previous step  a new attribute must be created in the record of the extractor component  This can be called  SUBJECT URL      Select pattern WEBMAIL O m  1  In the        example    2  Ascan be seen in Figure 56  the main window of the tab shows the different tokens that make up the  selected pattern  whereby the tag from which the attribute is to be collected can be tagged using the  mouse  In this Case  click on the ANCHOR   that appears In   ANCHORUO SENDER lit changes color     3  The required attribute in this case is    URL        the only option      and this is selected in the    list  select the only pattern available in this         to  list    4  n Assign SUBJECT URL  included  in this case  SUBJECT_URL         select the field in which the tag value is to be      The update of the structure of an Extractor component implies the repetition of all previous steps  example  assignment  specification generation and mark assignment      Part II 55      i l  me denodo technologies ITPilot 4 0 Generation En
70. ck on the first  option and configuration data will appear in the workspace  as shown in Figure 52     Part   48    L  i  n denodo technologies ITPilot 4 0 Generation Environment Manual        TPilot Wrapper Generation Tool        File   Browser   View   Help   denodo ITPilot  Project Management   VQL Generator   Process Builder  Data Export Tool    VOL Generator    Server deploy       Wrapper name WEBMAIL        O Maintenance enabled    O Create base relation                    Ok   Cancel       Tools         Current process  JDBCDemo from project Default Project   7 26 25 PM    Figure 52 Wrapper storage in a local file system    Now enter the name to be given to the wrapper  e g  WEBMAIL   The wrappers can then be loaded in the ITPilot run  server using the Load VOL File    option in the ITPilot Web administration tool  See  USE  for further details     The wrapper can be used as another source in any data integration process in Denodo Virtual DataPort  VDP   to do  so  the option  Create Base Relation  must be clicked  and the  Base View Name  field must be filled with the name  of the base view that  from that moment on  will point out to the wrapper in DataPort  For more information  it is  recommended to read the Denodo Virtual DataPort documentation  VDP   Besides  the user can configure whether the  wrapper is going to be maintained or not  Pressing OK  ITPilot will store the file anywhere in the local file system   This is the file to be used to deploy the wrapper in
71. client  the FTP protocol or a resource residing in the local file system          Repeated Sequence Difterent Sequences  Although Web sources in general often replicate the way of  accessing the following results from one page to another  this does not have to be the case  To do so   ITPilot allows for a set of different browsing sequences to be generated  one for each iteration made  This  Is not necessary in this example and  therefore  the  Repeated Sequence  option will remain marked          Sequence Repetitions  This parameter determines the number of times the browsing sequence is to be run   i e  the number of pages of results to be covered as of the main page   For example  if  2  is entered   ITPilot will try to click twice  and the wrapper would extract data from three pages of results in total       Reuse connection  Marked by default  this indicates whether the browser used to date is reused or whether  a new browser is launched  maintaining the session data   his option is generally marked  although in  some cases  such as when the Iterator component is used  as explained in section 3 12  it may not be  useful      Maximum Fetries  This parameter determines the number of retries to be made       ime between retries  This indicates the time between one retry and the next in the event of the first  failing  The time is defined in milliseconds     The effect of these last actions can now be seen retesting the wrapper  as was the case in section 3 13  The result    
72. connection to the JDBC repository     o Driver Jar File  Path and name of the  jar file that contains the implementation of the JDBC driver     o Driver Class  The driver class to use for connecting the data source  it can use variables that are  obtained from the component input values and records      o Driver Properties  important to consider the specific characteristics of the databases used as  information sources  these fields are optional  If not specified  the general configuration to  access the database Is used     Appendix B  Catalog of Components 115      i  me denodo technologies ITPilot 4 0 Generation Environment Manual    o Database URI  The database connection URL  it may use variables that are obtained from the  component input values and records      o Login  User name  it may use variables that are obtained from the component inpout values and  records      o Password  The user keyword  it may use variables that are obtained from the component inpout  values and records      o Locale  source locale information  more information about internationalization and localization in  section 3 13 2      e The second tab of the component is used to configure the connection pool that manages the access to the  repository     o Use Pool  in this checkbox it can be decided whether a connection pool will be used or not     o Initial Size  Number of connections for pool initialization  These connections are established in     idle    state  ready to be used     o Maximum 
73. contains all the output record data stored under the name  RECORDCONTENT  Immediately afterwards  this expression is used to link its value to another initially created  expression known as RECORDLISTCONCAT that contains the values of each record obtained     After iteration  it can be seen how the StoreFile component is used  at the end of the figure below  to take the  contents of the RECORDLISTCONCAT expression and write it to the file  the name of which is described in the  OutputFileName value   his can be seen in Figure 112     File name QutputFilekame    Contents QUTPUTFILEMESSAGE       Figure 112 Input parameters of the StoreFile component    Appendix B  Catalog of Components 126    x denodo technologies ITPilot 4 0 Generation Environment Manual          Extractor 1    x       G Iterator 1    g E Record Constr         i       Hev RECORDCONTENT  E  Li       HY  RECORDLISTCONCAT   d  Li  Y Outputs  m    s         E       B stre File 1  m    Figure 113 Example of Store File component operation    Appendix B  Catalog of Components 127    me denodo technologies ITPilot 4 0 Generation Environment Manual    REFERENCES     ADOBE  Adobe Acrobat Professional  Atto   www adobe com     DATEFORMAT  Java format representation of date formats   htto7  java sun com  Zse T  5   y docs auvjava text SimpleDatelormat html     DEXTL  DEXTL Manual  Denodo Technologies  2007   FRFX  Mozilla Firefox  http    www mozilla com en US firefax    503166   80 3166 country code  Attia  Awww  cheme  
74. cord type than the selected one      Target list  name of the list to which the new record is added   6 1 3 Output Values    This component returns no element     Appendix B  Catalog of Components 90    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    62 CONDITION    6 2 1 Description   Allows for a condition to be defined  Two output connections determine the process flow  depending on whether the  condition is met or not    6 2 2 Input Parameters    Zero or more values  zero or more records     6 2 3 Output Values    This component returns no element     6 2 4 Example    Take the case presented in Figure 91  Following the extraction of information from a Web resource by an Extractor  component  the process iterates on each of the results obtained  Suppose that only the group of results is to be  displayed so that one of its parameters matches the input parameter foreseen by the user  using the Init component    To do so  as can be seen  a Condition component is used  Depending on the result of running the condition expression  described     true    or  false    the process will access the Record Constructor component to generate the final output  record or will simply go to the end of iteration to continue iterating  where applicable     condition expression can be  created using the component creation wizard     Appendix B  Catalog of Components 91    I   ja LI LI LI   O  denodo technologies ITPilot 4 0 Generation Environment Manual  I        Extr
75. ction  and this  will appear in the list of expressions created  box on right      Figure 97 shows an example in which an expression is used as a page counter     Appendix B  Catalog of Components 101    I i i i  n denodo technologies ITPilot 4 0 Generation Environment Manual    6 7 EXTRACTOR    6 7 1 Description  This is responsible for extracting structured data from an HTML page  thus generating a DEXTL program   DEXTL I      6   2 Input Parameters    This component accepts a Page type element as input  e g  like that returned by a Sequence component   which is  used as a base for information extraction     6 7 3 Output Values    The Extractor component returns a list of records with the results obtained following the information extraction  process from the HTML page     6 7 4 Details of the component    See section 3 8 for a more in depth explanation of the component     Appendix B  Catalog of Components 102    I   i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    68 FETCH    6 8 1 Description    This component obtains the contents of the URL or page used as the input argument and returns them either in binary  or text format     6 8 2 Input Parameters  e Optionally  a URL type value   e Optionally  a page   Hence  the behavior of the component is as follows     e  f a URL type value is assigned  the Fetch component will access this URL and download the contents of  the resource accessed in the format configured in the wizard  binary or text   This
76. d not on the records obtained but  rather on changes to the pages through which the process browses  To do so  the Diff component allows for the  difference between two HTML pages to be found  generally  the same page at two different times   Therefore  based  on the input information  the component can be configured with the following parameters  as shown in Figure 93     TF Diff Editor Diff 1     Prefix for new content   font color   0089800  style  hackground  ee       Sufix for new content     X  Show removed cantent    Prefix for removed content   font color   880000  style  background  zEfffec       Suffix for removed content       Case sensitive     Ignore tag attributes   X  Return null if page has not changed       Figure 93 Conditions Editor    Appendix B  Catalog of Components 96    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    e   Fretix for new content This text box indicates the prefix to use on generating the results page for the new  contents  green HTML tag by default      e Suffix for new content This text box indicates the suffix to use on generating the results page for the new  contents  green HTML tag by default      e Show removed content This checkbox indicates whether the prefix and suffix configuration for the deleted  contents is required  This means that  if this option is not marked  the deleted parts will not be displayed   Depending on this option  the following two options may or may not be enabled     e Prefix for 
77. dConstructor  section 3 12 2   and this  in turn  to the Output  section 3 12 3   The tab displays the different trace messages that the specific flow    Stop wrappe    can follow  During the execution  the button is transformed to   thus allowing the running to be    stopped at any moment     Lastly  following wrapper execution  the  Results  tab displays the results it has returned  In this case  Figure 50   it  can be seen how the wrapper returns the results on the e mail message Web page  Wrapper generation has been  successful     The values of the input parameters of the wrapper are maintained from one execution to the following  Besides  it is    possible to import values from the domain editor  by dragging the field name in the editor and dropping it in the  wrapper execution dialog field name     Part   46    n denodo technologies ITPilot 4 0 Generation Environment Manual         TPilot Wrapper Generation Tool  Wrapper tester    nputvalies    RESULTS    SUBJECT MESSAGEDATE SENDER  Data integration approach O1 31 2007 John Smith  arty McFIy    Web data extraction techniques  D1 31 2007 Marty McFly o 0   W3C Holds Workshop on Fra     D1 31 2007  Abstract             1 31 2007  Wrapper Maintenance     01 31 2007  Client Side Deep Web Data Ext    D1 31 2007  Semantic Web   News and Eve    01 31 2007  Pattern Ambiguity             01 31 2007  DEKTL Actions         01 31 2007  Global Technology Watch          01 31 2007  Introduction to NSEQL         01 31 2007   No Subjec
78. do so  click on the     Test Wrapper     button in the main  window  A window like the one shown in Figure 49 will be displayed  This test tool consists of three tabs  The first    Input Values   enables users to enter example values for each of the wrapper input parameters  as defined in the  initialization component in section 3 6      Part   45       n denodo technologies ITPilot 4 0 Generation Environment Manual    i   TPilot Administration tool    Wrapper tester               Input values         Input parameters        PASSWORD DeMo 04  Mandatory                 LOGIN demos  Mandatory                 Debug level  DEBUG iv                 Execute wrapper   Close            Figure 49 Wrapper testing tool    Furthermore  it also allows for the trace level of the wrapper run to be selected  You can choose from among FATAL   ERROR  WARN  INFO  DEBUG and TRACE  The use of the DEBUG level is recommended when testing the wrapper for  the first time     By clicking on the EEE EN button  the editor goes to the  Execution Trace    tab and launches a    browser  as the browsing type was defined as    browser pool    in section 3 7 2 and the generation tool uses a browser  pool based on Microsoft Internet Explorer   which starts to browse through the pages defined in the Sequence  component  section 3 7   On reaching the message page  the Extractor component  section 3 8  obtains the list of  records  after which the iterator  section 3 12  passes the individual records to the Recor
79. domains can be defined using XML files that should be located in the path  DENODO_HOME metadata seqgenerator domains  Figure 85 shows the definition of a BOOK domain with  searchable parameters TITLE and AUTHOR  and containing two examples for the domain  each of which gives values  to the parameters TITLE and AUTHOR      lt  xml version  1 0  encoding  ISO 8859 1       lt DOMAIN name  BOOK  gt    lt SCHEMA gt    lt FIELD name  TITLE    gt    lt FIELD name  AUTHOR    gt    lt FIELD name  PUBLISHING_HOUSE    gt     FIELD name  PRICE           SCHEMA      EXAMPLES     lt EXAMPLE alias  Java Norton       PAIR name  TITLE  value  Java   gt    lt PAIR name  AUTHOR  value  Patrick Naughton    gt    lt  EXAMPLE gt    lt EXAMPLE alias  Flanders Panel Reverte  gt     PAIR name  TITLE  value  The Flanders Panel   gt    lt PAIR name  AUTHOR  value  Arturo P  rez Reverte        lt  EXAMPLE gt    lt  EXAMPLES gt    lt  DOMAIN gt           Figure 85 Definition of the domain BOOK  As can be seen  definition of the domain commences by specifying its name with the label DOMAIN  Then a list of  associated searchable parameters is specified through a list of FIELD labels grouped into a SCHEMA label  Finally     the EXAMPLE labels allow examples to be defined that provide values for one or several of the domain parameters   Each example also has an associated name     4 7 2 Use of Domains    To use the domains defined via the bar follow the steps below     1  Click on the DOMAIN button     Navig
80. e  refattribute and relative position on the page    e Forms  value of the attribute name  value of the attribute action and relative position on the page   e Frames  value of the attribute name  value of the attribute source and relative position on the page     Navigational Sequence Specification Manual 02      TOU TEC        me denodo technologies ITPilot 4 0 Generation Environment Manual    In general  the least suitable criterion is that of relative position on the page  as it is more vulnerable to possible  changes in the structure of the Web site  However  sometimes it can be a good option  when the other alternatives  prove inadequate     If a value that does not exist for a specific element is set  e g  a frame without a value for the attribute name   the  system will try to select by itself another criterion that is more suited to this specific element     Another important aspect to take into consideration is that the criteria set for forms and frames are g obal to the  entire page  What this means is that within a specific page of the navigation sequence the same selection criteria  should be used for all the events on elements of the same type  frames or forms  contained in it  If during the  generation of the navigation sequence different criteria are specified for elements of the same type within the same  page  the system will always take the last criterion set     Lastly  this tab allows the type of NSEQL default page download wait command to be selected   No
81. e Record Constructor component    Part II 64    ITPilot 4 0 Generation Environment Manual    I i   O  denodo technologies  L     The process is now prepared  Now generate the wrapper and test it  The generated wrapper returns the expected  results asynchronously  it does not wait until the end of the process to return results      3 17 TAGSETS AND SCANNERS    The ITPilot generation tool allows the creation of as many scanners and tagsets as required by the different levels of  our wrappers  In the browsing area we can click on the Tools  gt Scanner  amp  TagSet configuration link  which will open  a new window in the work area  such as the one shown in Figure 67         TPilot Wrapper Generation Tool          File   Browser   View   Help   denodo ITPilot  Project Management   Scanner  amp  TagSet configuration    Process Builder Scanners TagSets Tads      Madd   Go   7   Qa        7             Data Export Tool       Tools       Scanner  amp  TagSet configuration          StandardFormLexer  StandardFormLexerJS  StandardHTMLLexer  StandardHTMLLexerJS  StandardLexer    i Standardl exerIS                form   standard   standardForm  standardHTMLFragment  standardTextFragment    SstandardTextFraamentNoP         Qoa   Gos       En            EOL   EOL FRAGMENT  FRAME   TAB   ANCHOR  ENDANCHOR       SELECT  CHECKBOX  d de RADIO  v               p  p    Close    11 08 08 PM         Current process  WEBMAIL from project Default Project      Figure 67 Scanner and Tag Set Generation 
82. e a  look at a small example     Part II 68    n denodo technologies ITPilot 4 0 Generation Environment Manual    m Extractor  Extractor 3     Select Scanner   StandardHTMLLexer    Select Tag Set     STANDARD       Figure 71 Utility tab    Imagine that we want to create the DEXTL data extraction program for the main results page of the Web mail    application  An alternative option to that which we have been looking at up to now is to start the browser from i  access this page and  for example  tag the first line that contains a message  see Figure 72                 Delete   Forward   Vier Messages       Date From Subject  Thread  Size  1 01 31 2007 John Smith Data integration approach 1 KB  2 0D1radii2007 Marty McFly Web data extraction techniques 709 KB  3 MEP TET W3C Holds Workshop on Frameworks for Semantics in Web Services   2005 0 2 KE  4 013172007 John Smith Abstract 2 KB  5 01 31 2007     Marte McFIv Wranper Maintenance 2 KB    Figure 72 Selecting Data to be Extracted    Now return to the specifications generator tool and after having properly selected the scanner and the desired tag    set  click on the button  The result is that shown in Figure 73     The system takes the text tagged by the user and analyzes it  extracting those tags that can be recognized from the  scanner and tag set  Those that have not been recognized are maintained in their literal form  so that the user can  leave it as it is  if it really is a constant value  or change it with an element val
83. e added from the examples that the  system has not extracted  It is also feasible to modify the existing examples   e  f more results than expected are retrieved  the options    Disambiguate    and  Strict Patterns  may be used  as explained previously   e Alternatively  the generated DEXTL program can be modified manually  if doing this  we recommend users    to carefully read  DEXTL    this option is selected by clicking on the button  The    automatically generated program can now be modified     Once the DEXTL programs of each of the levels have been satisfactorily generated  click on the button and  skip the  Marks  tabs which will be explained in detail in the advanced example available in PART Il     3 11  GENERATING THE SPECIFICATION  In the Specification tab  Figure 38  the DEXTL programs of each level are generated together     Part   32    J          O  denodo technologies ITPilot 4 0 Generation Environment Manual  a    m Extractor  Extractor 1      include  scanners StandardHThiLLexer     MAME  WEBMAIL    LISTMAME  WEBMAIL LIST     URLBASE  http mail demos denodo comfmp     TAGSET  STANDARD      MESSAGEDATE TAB   ANCHORG    SENDER ENDANCHOR O TABO ANCHORG SUBJEG          a       Figure 38 Specification Generation tab    In our example  as we have already tested the specification in the Generation tab  we just have to press the    button     Configuration of the extraction component is now complete  Now simply change the name of the component output  element to EXT
84. e application automatically moves on to the next tab  where the Search Examples are  defined     3 0 2 Nested Levels in the Component Structure    There may be nested levels in the data to be extracted schema  Figure 30 shows an example of an on line music  shop  the data of which can be modeled in line with the schema ALBUM  TITLE  AUTHOR  DATE  EDITION   FORMAT   PRICE    where EDITION is a composed element  According to this definition  an EDITION value will be composed of a  list of records  where each one has two fields known as FORMAT and PRICE     In the specifications generator tool  the structure would remain as shown in Figure 31     SPIRIT IN THE DARK ARETHA FRANKLIN 12 1993 CD     10 53   LP     15 92  THE GREAT ARETHA FRANKLIN  ARETHA FRANKLIN 8 1988 CD     10 53  THE FIRST 12SIDES MC     7 45    Figure 30 Music bookstore    TITLE  ALITHOR  DATE     l  r EDITION  FORMAT  PRICE    Figure 31 Structure of Music store    To this end  the specification generation tool provides a  level flattening  option  Click with the right hand button of  the mouse on a compound element to view the  Flatten level  option  In cases like this one  you may want data to be   Hattened  to belong to the same level  The selection only affects the output structure of the data  For further  information  please consult  DEXTL      3 9 ASSIGNING EXAMPLES OF THE RESULTS    In the second tab  the user may provide different examples of results so that the system can extract data according to  th
85. e double type values  but it will not add an integer value to a floating value  In addition  some functions only  operate with elements belonging to a specific data type     ITPilot provides a series of predefined functions that can be grouped into different types  based on the data type to  which they are applied      Arithmetic functions       Functions for text processing         List handling functions       Functions for date processing       Functions for URL processing       Functions for page handling    The functions supported by the system are described in the following paragraphs     NOTE  Functions are generally represented in prefix notation  i e  an identifier is indicated followed by a list of  parameters in brackets and separated by commas     5 1 ARITHMETIC FUNCTIONS    Arithmetic functions are applied to numeric type attributes and literals  int  long  float and double   with the constraint that all the parameters should have the same type  These allow mathematic calculations to be  made on attributes and literals     The supported arithmetic functions are     e SUM  The sum function receives a variable number of arguments  greater than or equal to two  and returns  as a result a new element of the same type containing the sum of those preceding     e SUBSTRACT   he substract function receives two arguments and returns a new element of the same  type with the result of subtracting the value of the second argument from that of the first     e MULT  The mult f
86. e previously generated structure  As many examples as desired can be inserted  and it is recommended that at  least two examples be provided for each of the levels    Where users are sufficiently advanced and wish to write the specification themselves using DEXTL language  see     DEXTL    the system can be informed of such by pressing the  m button to go to the next tab  In this example   the possibility of automatically generating the specification through examples is used     Initially  a window appears like that shown in Figure 32     Part   2     1 i i        m denodo technologies ITPilot 4 0 Generation Environment Manual     e Extractor  Extractor 3        otructure    Examples   Gener tiot    specification    uti      Assigned value  Q l   List of items WEBMAIL  Q   WEBMAIL     SENDER     SUBJECT  MESSAGEDATE  SIZE                    m            ok    cancel            Figure 32 Result Examples Tab    A structure now appears in the window in which to specify the values belonging to the first example  Any amount of  examples can be added by simply selecting the option  Add Item  on the contextual menu of the right button on the  root element  Each atomic item of the structure has an option  Assign Selected Text  in its contextual menu on the  right button which allows a value to be added to this specific field in two different ways     1  By associating text from an Internet Explorer browser  open by clicking on the menu option    Browser  gt New  Browser    in the mai
87. e used to generate the  search and run sequences of the specific form     6 10 3 Output Values    As a return value  this component returns the page generated after running the form in each iteration     6 10 4 Example    Information is required on vacations in the US through a source of real estate offers  This source offers a search form   where a group of search terms can be entered in a text box  There is also a selectable  where the type of complex  required for the summer season can be chosen  apartment  summerhouse  sublet  sale  etc    With ITPilot  it is  possible to create a process that accepts the type of complex on which to make the search as the input argument   However  if the search is to be made on several complexes  an input list provided by the user must be created  In a  simpler manner  the Formlterator component configures the input values of a form so that those on which iteration is  required is indicated dynamically  In each iteration  the component will assign one of the possible combinations of  form input arguments and will run it     Figure 103 shows part of the described process  A Sequence component positions a browser on the information input  page of a form  A Formlterator component is then added  the result of which in each iteration is a page used by an  Extractor to obtain the data required     The steps to follow to configure the Formlterator component are as given below     1  As input information  the component receives the results page o
88. ecification for this page based on examples provided by the user  This  matter is covered in further detail in the next subsection     Part   23    x denodo technologies ITPilot 4 0 Generation Environment Manual    Input page    INITSEGOLITPUT       Figure 27 Input page of the Extractor component    3 8 1 Data Extraction Specification Generation  By clicking on the    Open Extractor Configuration    button  ITPilot opens a new window  the Specifications Generator   as indicated in Figure 28        Extractor  Extractor 1     Structure    Current scanner   StandardHTMLLexer      Change Scanner      Current tag set     Change Tag Set    Extractor locale ES EURO       Figure 28 Specification Generation Tool    The first step involves defining the output structure for the data extracted from the page  i e  the type of data these  items have   The structure of an element may contain simple fields or hierarchically defined  nested subelements     Part   24    I i i i      me denodo technologies ITPilot 4 0 Generation Environment Manual    The user carries out this action on the first tab of the window  which is accessed directly when it is started for the  first time  known as  Structure   In this step  using the options provided by the graphic interface  a tree is created  that represents the structure of the elements     With Figure 5 in mind once again  the information of interest from each e mail is as follows         SENDER  the person to have sent the e mail    SUBJECT  title 
89. egister elements by clicking on the       L gt   button next to the  Add New Field  section for each element and giving each one a name and data type  In this  case  two elements are created        LOGIN  string type and mandatory  which is obtained by ticking the  Mandatory  check box        PASSWORD  string type and mandatory     Both elements are ticked as mandatory  as the Web application means that both fields must be completed   Therefore  any time users wish to run on this wrapper  once generated it must contain values for both register    elements as input arguments     The result of the action can be seen in Figure 13     Part   13    x denodo technologies ITPilot 4 0 Generation Environment Manual    TF Init Component Editor    Init recard lacale ES EWRO    PASSWORD string Ed Mandatory    LOGIN string Ed Mandatory    Add new field       Figure 13 Initialization Editor    Now click on  OK  to return to the main window and you will see how the two recently created fields appear in the   Wizard  tab of the component configuration area  see Figure 14      Current output    PASSWORD string Mandatory    LOGIN string Mandatory    Open init editar       Figure 14  Wizard  tab in the component configuration area with the initialization register already created    Part   14      i    me denodo technologies ITPilot 4 0 Generation Environment Manual    3 6 1 Use of the Catalog Explorer    Before continuing with process generation  this is a good time to insert a very useful tool
90. ent Manual    6 6 EXPRESSION    6 6 1 Description    Allows for an expression to be defined  based on constants and or use of functions provided by ITPilot  that will be  assessed at an output value     6 6 2 Input Parameters    This component returns zero or more values  zero or more records  or zero or more record lists     6 6 3 Output Values    This component returns the defined value or a record containing it     6 6 4 Example    Figure 94 shows part of a process that uses the expression component to initialize a variable  e g  CURRENTPAGE  to  1  Figure 96 shows initialization is as simple as assigning an integer constant as the expression result           O  HfY  CURRENTPAGE  o    LI      Figure 94 Variable initialization Expression component    Following this initialization  another Expression component can be used within a loop  either a Loop component  a  Repeat  or an Iterator  to act as a counter  in this case of pages  see Figure 95      HY CURRENT PAGE PLUS ONE    a  Record_Constr       Figure 95 Use of an Expression component as a page counter    The expression is defined from the expressions editor  the handling of which is described below     6 6 5 Using the Derived Attribute Expressions Editor     he expressions editor is shown in Figure 96  The expression is built in a totally graphic manner or by writing in the   Expression Value  box  This graphic process of the editor is described below     On the left of the screen are menus to create various values tha
91. ent application  the NSEQL sequence generation tool provides an option  known as     Transpose Table     which transposes any table selected by the mouse on the page     The transpose process flips the table over  transforming row vectors into column vectors  Here is an example shown  in Figure 82  There is a table with two rows and three columns   A  B  C    D  E  F   and you want to obtain its results  as three registers with two values each   A  D    B  Ej   C  Fl   Select the table  all its elements  and click on the  Denodo task bar button HU Transpose Table The result will be as shown in Figure 83  Any subsequent data  extraction process will use the modified table       m      SB deneds   EQ Open e Rec transpose Table Mjselect Domain   ci Enabled Popups x       BASIC 3X2 TABLE       ABC  D  EF  Figure 82 Using the    Transpose Table    Button  2B denodo    p open e Rec HTranspase Table  select Domain   ci Enabled Popups x       BASIC 3X2 TABLE    AD  BE  CIE    Figure 83 Result of the    Transposelable    Command Execution    46  THE SELECTANCHOR BUTTON    Denodo ITPilot allows for data to be extracted not only from HTML pages but also from resources saved in Microsoft  Word and Adobe PDF format  To do so  as mentioned above  it can be indicated that the initial browsing URL  references a Word or PDF resource  If the resource is accessed via a link  this button must be used before clicking on  the link itself to inform ITPilot that transforming will be required  As can
92. environments to help the automatic maintenance system regenerate the specification   if the Web source changes      Options  This modal dialog box provides two specific options    o  Zegular Expression  This allows for a regular expression defining the representation format for  this element to be added  This is useful when the wrapper to be generated is to be maintained by  the ITPilot maintenance server and the value obtained is known to vary very frequently   continuously  e g  a stock exchange value   For further information on maintenance  read  USE    The regular expression is defined in  REGEX   This option is not used in this example    o Date Pattern  Where the data type is Date  the specific pattern can be represented here using the  format defined in  DATEFORMAT   In this example and in the case of the MESSAGEDATE field   the DatePattern is defined as dd MM yyyy          Hat Level  When a record is multilevel  see section 3 8 2 for an explanation of this issue   it is possible to  indicate that you want the values of a certain level to be flattened  i e  the attributes of lower levels to  appear in upper levels     Part   26    L  i  me denodo technologies ITPilot 4 0 Generation Environment Manual       Once the complete structure has been created  click on the button    Set Structure    um on the main menu to set  it  the structure can always be modified by going back  although it is important to remember that this deletes all the    examples created up to now  Th
93. equence Type   TPilot provides access to Web resources via different communication protocols  The  main ones are browser pool and http  whereas it is also possible to use the ftp protocol and access a  resource saved in the local file system  In this example  the browser pool type has been used  given  that this Is a Web source with status  Each one is described below    o browser pool  This is the default option  In this case  the sequence will be run using the browser  pool configured in the execution server in which the wrapper is run  see  USE    The browser pool  uses browsers to run the NSEQL sequence  These browsers can be based on Microsoft Internet  Explorer  Firefox or on the use of a mini http client based browser  In the first two cases  users do  not have to worry about tasks such as JavaScript treatment  etc  When the source does not use  JavaScript  the use of the http client based implementation is normally just as effective  although  considerably more efficient  The browser pool included in the wrapper generation environment  uses minibrowsers based on Microsoft Internet Explorer and  therefore  if this option is chosen   the wrapper tests  see section 3 14 2  will be carried out using these minibrowsers    o http  This option uses the http client included in ITPilot for browsing sequences without using the  pool concept  As indicated  the use of an http client is more efficient and normally works  correctly  if the source does not use JavaScript  Through th
94. er        X  Reuse connection  X  Use custom back sequence  O Default sequence    Global form pre sequence    Global form post sequence                      Load from file   Import from browser Load from file    Import from browser                     Test    Ok    Cancel          Figure 61 Configuration of Sequences with the Record Sequence component    This component can be tested without having to compile and test the entire wrapper  This is also the case with the  Formlterator and Next Interval Iterator components     The following steps are required to do so     if    Part       Open a browser from the Browser  gt New Browser option in the main page menu of the wrapper generation  tool   Browse to the page of results         Test     In the configuration window of the Record Sequence component  click on the   Test   button  I TPilot  will transfer the session data from the Internet Explorer browser to an ITPilot browser  loading the same  page  It will also launch a component test window like the one shown in Figure 62   From here  it can be    s Met resul  seen how  by clicking on the button  the ITPilot browser accesses the details page for    each result  as required  The window displays the trace of the run  as would occur in the run test window  of the full wrapper shown in section 3 14 2                       60    me denodo technologies ITPilot 4 0 Generation Environment Manual    i  Record Sequence Tester  Record Sequence 7         gt  RESULTS       IZE SUBJECT S
95. es ITPilot 4 0 Generation Environment Manual    1  Assign a name to the custom component  see Figure 75   WebMailCustomComponent    Hame required    Please type the name of the new custom  component      Ok   Cancel         Figure 75 Assigning a name to a custom component    2  Select the output of the process component which will be used as the custom component output  see  Figure 76    in this case  the list just created is chosen  known as CCReturnList   Select component output    Select which component output will be used  as the custom component output    Ol use GutputComponent as output    CCRetumList Ivi      Ok   Cancel         Figure 76 Selecting the output type of the custom component    Once these steps are complete  a new component will appear under the  Custom  area in the tool s browsing area   To test it  a small test process can be created that uses this component  Figure 77 shows this small example  It can  be seen how  as a list  the customized component output is processed by an Iterator component  Data is input  through two Expression components that turn the initialization component record fields into input values  The  remaining process is similar to that shown previously in this manual     This same exercise could have been directly done by checking the option  Use OutputComponent as Output   which  uses the Output component s output record     Part II 12      i  n denodo technologies ITPilot 4 0 Generation Environment Manual    H Y    OGINEXPRES       
96. everersenen  76  A SIM TREES SEN CLOW EE EE R ATE 76  Using the TRANSPOSE  lable     DBUEPEIOTIL use etre nemine tentent eicit eh teh ertet 79  Result of the  TransposeTable  Command Execution         oeesvrvrvvrvrvrrvrvrrervrvervrvrrererverersererervererveverersennn  79  Selection of the transformation type in the Select Anchor command                            sss 80  Definition of the domain BUUK EEE 80  Taskbar with an Example Selected                      sss 81  Assigning Example Values to Form Fields           rrorrrrrrrrrrrrororrorvrrorrrrrsvorervornrrerveverveververnnversenervevervennenenne 81  Proxy Options WINGOW          cccccccececcsceseseeseecesscescsssesasseeavsssssasstsassssssasecsesasiseesasstvavssencasitatateeeatiteaeateeeass 02  VANN 83  Browser Sequence Type Selection Window        rrrvrvrrvrvrvvrvrvrrvrvevrrsrvervrveversrvererseversrveserverevervesersererersenen  84  RE the Condition ON 92  ENN 93  COICO Se MN EE EE EEE EE EE 96  Variable initialization Expression COMPONENA               cccccccesceccscescssesceeseecscsecsececsecsseecatassecasseeatensaeeaees 99  Use of an Expression component as a page counter    99  Creation of a constant value in the Expressions Editor                        sss 100  Creation of a constant value in the Expressions Editor                         sss 101  Useor the FILE 0  0010  EE tasse urs cessisse Uber Rentas 104  Creation of sting type 5 EE 105  Creation of the comparison date EEE ENE EEE 106  Creation ofthe filterin
97. ext  otherwise a number is used     In a date format the characters that are not found in the ranges   a    z   or   A    z   are  considered text in inverted commas  I e  characters such as                     and     appear in  the resulting date  although they are not in inverted commas in the format text    e Configuration of real numbers  Facilitates the configuration of the data types float and double     e  doubleDecimalPosition     Indicates the number of decimal positions to be used to represent a  double type or f 1oat type value  real numbers      e  doubleDecimalSeparator     Represents the decimal separator used in a real number     e  doubleGroupSeparator     Specifies the group separator for real numbers     3 14 WRAPPER GENERATION  TESTS AND EXPORTING    3 14 4 X Wrapper Generation    Once the graphic creation of the process is complete  it can be tested  To do so  the wrapper must have been  generated  ITPilot compiles the flows defining the wrappers to programs expressed in JavaScript  JS  language  This    Is generated by clicking on the     JavaScript     button on the General bar to the left of the component  configuration area  If everything is correct  a modal window will be displayed indicating that the JavaScript code has  been generated successfully  Click OK on this window and another will be displayed containing the code  as shown in  Figure 48  The code can be edited from this window  should any modification have to be made  or it can be    Part  
98. f the Sequence component  It may also  receive lists  records  or values that may be used as input values on the required form     Appendix B  Catalog of Components 108      i       O  denodo technologies ITPilot 4 0 Generation Environment Manual  I    en Sequence 1    c  Farm lteratar 1        Extractor 1    dde Iteratar 1     5 Output 1  A    r End lterator 1    X     End Form Iter       Figure 103 Use of the Formlterator component    2  The component wizard is divided into three tabs    a  Values     This assigns the different iteration values to each of the form fields  To do so  ITP must  first be informed of the form on which iteration is to be made  For this  the following steps are  taken    i  Open a browser from the Browser  gt New Browser menu option and browse to the form  page    ii  Mark the form required on the page  To do so  simply select part of the text associated  with that form  see Figure 104      gu Gem     Figure 104 Marking part of the form  ili  Click on the  Import Selected Form    button  The wizard editor will display information    on each of the form fields  and their values and the input values are displayed on the  left  see Figure 105      Appendix B  Catalog of Components 109    qx denodo technologies ITPilot 4 0 Generation Environment Manual       Form Iterator Editor  Form lterator 1     Configuration    mport selected form                Figure 105 Importing information from the form    IV  It is now possible to choose the different value
99. fu berlin  de alverse doc IS0 3166  html     150639  180 639 language code  ftto   www ICs UCl edu pub lett htto related 150639  txt    JAVADOC  Java Developer Kit Standard API Javadoc Documentation     JSDENODO  Denodo ITPilot Developer Guide  Denodo Technologies  2007     MSIE  Microsoft Internet Explorer  http   www microsoft com windows 1e     NSEQL  NSEQL  Navigation SEQuence Language  manual  Denodo Technologies  2007     00  OpenOffice Office Suite  hfgp7 Wwww openoffice  org    PDF  Adobe Portable Document Format  Atto  vww adobe com products acrobat adobeodt  html   PDFBOX  PDF Java Library  htto   www patbox org      REGEX  Java Format for Regular Expression Pattern representation   htto7 java sun com I2seg f 5  ydocs apvJava util regex Pattern  html     RFC1738  Request For Comments 1738  Uniform Resource Locators  URL   Pf  p  www rfc editor org Hio He 17368  txt   USE  Denodo ITPilot User Guide  Denodo Technologies  2007    VDP  Denodo Virtual DataPort Administration Guide  Denodo Technologies  2007    WORD  Microsoft Word  Atto   office  microsoft com    References 128    
100. g CONN TOM EEE 107  een es Conditio EE E eter ERR EE EE 107  USE or the Formliterator COIpOTTel IE  ascensores ntt recta ntu Sede reete 109  Marking 6 AE MOM UN Nt EEE EEK 109  Importing information from the RR 110  Selecting values in the form fields    110  Selecting values in the form Tl OVS EE EE 111  Configuration tab for the Form Iterator component                   sss 112  Access to Information from a Relational Database    115    x denodo tec     Figure 110  Figure 111  Figure 112  Figure 113    nologies ITPilot 4 0 Generation Environment Manual    Obtaining an output record structure in the JDBCExtractor component    116  Example of Loop component operation         sssvsvrvrrvrvrrvrvrreververerveneversevesveveverveveserveversrvesersereservesersevesenne 118  Input parameters of the StoreFile COMPONEN            ccccceccceccsescesesscescscsteasstecesseestesecassteaeeseeeateteeeetes 126  Example of Store File component operation                   sssssssssssseeeeeennnnnnnnnnes 127    TE TI MES der  eH  denodo technologies  I    TABLES    Table 1 List of Reserved Words    Table 2 Reserved Characters for Date Format                 sss meets    ITPilot 4 0 Generation Environment Manual    I      i   x i  n denodo technologies ITPilot 3 1 Generation Environment Manual    PREFACE    SCOPE    This document explains how to visually generate wrapper programs Denodo IT Pilot     WHO SHOULD USE THIS DOCUMENT    This document is aimed at developers and administrators that want t
101. he Extractor component  DATE field    The Extractor component created in section 3 8 obtained the values of the SENDER  SUBJECT  MESSAGEDATE and  SIZE fields for the message  This component is now modified to delete the MESSAGEDATE field  that will be  obtained from the details page      The DEXIL program must then be regenerated by providing examples  ITPilot requires as such  as the adding or  deleting of fields may modify some of the specification patterns  Luckily  the process remains as simple as in section  3 8   1  Open a browser from the Browser  gt New Browser option  or pressing Ctrl B  in the main page menu of the  wrapper generation tool   2  Browse to the page of results  that shown in Figure 5    3  Drag amp drop the example values to the specific fields of the structure displayed in the  Examples  tab of the  Extractor component editor   4  Generate as many examples as required  In this case  three examples are generated  as the structure of all  the results is similar             5  Go to the next tab by clicking on the   button        Figure 55 shows how the examples are assigned in the new structure before going to the next tab     Part II 53    nm denodo technologies ITPilot 4 0 Generation Environment Manual    mt Extractor  Extractor 3     Examples    Specification  Assigned value  List af items WEBMAIL    D E WEBMAIL  hiodifw SENDER      Alberto Pan     SUBJECT   Data integration approach      S      D E WEBMAIL  SENDER   Justo N  Hidalgo Sanz     SUBJECT
102. he specification generator  tool is shown below  when positioning ourselves in the structure tab  we must use the StandardHTMLLexer  if it does  not appear by default  under the  Current Scanner text   we must select it and press the button    Change Scanner    to set it up     Even though it is not necessary for this example  for more information about scanners the user should read  DEXTL   and section 3 17 of this Guide     Once this step is completed  the structure can be created  First  give the type of record to be created a name by  double clicking on the text   record name not set   and entering WEBMAIL   The record type name is updated along  with the name of the specific structure  This name can be changed by double clicking on it  in this case on   WEBMAIL LIST    In this example  it is called WEBMAILINSTANCE  Then  placing the cursor over each item and  clicking with the right hand button of the mouse  it is possible to invoke the  AddChild  action that allows a new  item to be created  The  Change name  option or double clicking allows each item to be named as required  The data  type of each element can be defined using the  Change type  option  Create a structure like the one seen in Figure  29     Part   25      i 2  me denodo technologies ITPilot 4 0 Generation Environment Manual    mt Extractor  Extractor 1     Structure Examples Generation Marks    Specification    Current scanner  T E WEBMAILINSTANCE  StandardHTMLLexer    MESSAGEDATE   Date    Change Scanne
103. hem into an NSEQL program that replicates  these actions     The Navigation Sequences Generator takes the form of a taskbar that is installed in an MSIE browser  Once installed   It can be used to generate any navigation sequence on the users browser  The generator records the events  generated by the user whilst navigating and automatically translates them into an NSEQL program that replicates the  actions     NOTE  It is important to note that the necessary events for running a browsing sequence in a minibrowser type do  not always match those necessary in another type  This means that the NSEQL programs produced by the Browsing  Sequence Generator may have to be adapted before being run with a browser pool configured to use Firefox  browsers or mini http client based browsers     This generator can also optionally generate browse sequences using pattern http requests  the characteristics and  differences of which in relation to NSEQL are explained in section 4 8 3     42 DESCRIPTION OF THE NAVIGATION SEQUENCES GENERATOR INTERFACE    Figure 80 shows what the Navigation Sequences Generator taskbar looks like when the browser starts up and the bar  Is selected     Navigational Sequence Specification Manual 75    I i    me denodo technologies ITPilot 4 0 Generation Environment Manual    Idm   Open    Rec  H Transpose Table F  Domain   amp 3 Enabled Papups  Al Select Anchor   x        Contact   Careers   Resources      technologies       Figure 80 Navigation Sequences Generator
104. his allows for text strings representing dates to be converted into date type elements  Three  text type arguments are given  The first represents a pattern to express dates  following the standard  syntax in JAVA language specified in  DATEFORMAT    whereas the second will be a date expressed  according to said pattern  The third one is a text type parameter which indicates the internationalization  configuration that representes the  locale  of the date to process  As a result  a date type element  equivalent to the specified date is returned     FUNCTIONS FOR URL PROCESSING    ENCODE  This function receives a URL type value as an argument and carries out its encoding  This is  necessary when different characters to those accepted by URLs are to be used  RFC1738   This function  automatically transforms invalid characters into their corresponding encoding     URLTOSTRING  This function receives a URL type value as an argument and obtains its content as a  string value     TOURL  This function receives a string type value representing a URL  and returns that same value  but  as a URL data type one     FUNCTIONS FOR PAGE HANDLING    GETLASTURL  This function receives a Page type object as input argument and returns its URL as a  character string     GETLASTURLMETHOD   his function receives a Page type object as input argument and returns its  access method  GET or POST  as a character string     GETLASTURLPOSTPARAMETERS  This function receives a Page type object as input argu
105. igure 40 Use of the Record Constructor component    The component is configured as follows  Select the set of records that can be combined in this component from the    Inputs tab  In this case  only one will be used  the so called WEBMAIL  Click on the     icon to the right of    Input  Values  to view a selection list from where  WEBMAIL  is chosen     Once this has been done  access the Record Editor from the  Wizard  tab to build the component output record  In  this case  the WEBMAIL fields SENDER  SUBJECT  and SIZE are to be returned and three new fields  MESSAGEDAY   that will return the day on which the message was delivered   MESSAGEMONTH  the month   and MESSAGEYEAR   the year   created  All fields are disabled by default  In order to use them in the output record  simply click once with    the left hand button of the mouse on the     icon for each one  By clicking on the C  icon  the field is disabled  again     Click on the     icon to the right of the    Add new field    message to create new derived attributes  Click it three  times and name each one as MESSAGEDAY  MESSAGEMONTH and MESSAGEYEAR  respectively     Figure 41 shows the result of the operation after naming the output record     Part   35    xe denodo technologies ITPilot 4 0 Generation Environment Manual       Record Editor  Record Constructor 2     Record name  MAILMESSAGEOU    Invalid value  Some field expressions are not set vet  MESSAGEMONTH     WEBMAIL  SIZE  SUBJECT WEBMAIL  SUBJECT  WEBMAIL  ME
106. igure 5      After checking that the semaphore icon is green  the sequence can be recorded so that it can be loaded into the  sequence component wizard  or by importing the sequence directly by clicking on the  Import from Browser  button   It is recommended to save the sequence if it is going to be used in other processes  This is achieved by clicking on     T     ms  the  save button  calling it  for example  webMail nsq  and stopping the recording process   woe    once  the sequence has been recorded  it is loaded in the tool by pressing the  Load from File  button and selecting the  generated file  Figure 24 shows the result of the action     Once loaded  the sequence can be altered  where necessary     which also implies that the navigation sequence can    also be handwritten  although in this case we recommend that you first read  NSEQL   To do it  the altered sequence  can be modified in the area where the sequence resides     Part   20      i  n denodo technologies ITPilot 4 0 Generation Environment Manual    m Sequence Editor Sequence 1     Sequence Type    browser pool    Sequence    Mavigatethtthmaildemas denodo com  0  Extended vaitPagest 1  FindFormbyNametimp login 01 setlnputvaluermap      Load fram file   Import from browser      Reuse connection  x     Maximum retries  Time between retries  ms  3000    mn menn       Figure 24 Sequence editor with loaded sequence    This same figure shows how there are some configuration parameters  More specifically          S
107. ing the specific project to be retrieved     Denodo ITPilot provides two templates  useful so that wrappers are not created from scratch  The   etandardTemplate  one creates a wrapper to access a source and obtain structured information from the target  page and the  more results  ones  The  StandardDetailTemplate  ones adds the possibility of accessing a detail  page for every item from the main pages  Even though section 3 21 explains in more detail the maintenance issue  if  the wrapper requires no more components  these will be automatically maintained by ITPilot by using these  templates     Part   9    J j j j j L    z    me denodo technologies ITPilot 4 0 Generation Environment Manual    At last  these templates can be deleted by the user  if required     The processes can be moved from one project to another using the  Copy to Project  or  Move to Project  options  from the contextual menu of each process  Besides  the processes can be migrated to different working environments  by manually copying the file  lt DENODO HOME   metadata itp admin   tool   project name     process name gt  xml  where  project name  is the name of the  Project where the process is stored  and  process  name  is the name of the project which will be migrated  It is  also possible to migrate a complete project  by copying the directory  lt DENODO HOME    metadata itp   admin tool   project name   and the project management file    lt DENODO HOME   metadata itp admin tool   project name   xml
108. ion for the end of the search    To    by using the   I button     In the WEBMAIL example  you will see that this limitation is not necessary  although you can still try to implement a   FROM  by marking the heading of the table on the homepage showing the mails received  see Figure 70  from a       browser launched from the generation tool  Once this area has been selected  click on   gin   and  ITPilot will increase the specification with the limitation information    Delete   Forward   View Messages      1 0173172007 Jahn Smith Data integration approach 1 KB    2 01 31 2007 Warte McFly Web data extractian techniques 709 KB  E 3 0173172007 Jahn Smith W3C Holds Workshop on Frameworks far Semantics in Web Services   2005     2 KB    4 01 31 2007   John Smith Abstract 2 KB  E 5 0173172007 Marty McFly Wrapper Maintenance 2 KB  b   E rna 604 nmm Ia an l ii  Pirard r liant  Sida Moan ah Mata Evtrarction  LH    Figure 70 Delimiting the Beginning of the Extraction    3 19 GENERATING THE DATA EXTRACTION SPECIFICATIONS MANUALLY    In some cases you may wish to generate the specification manually without having to enter examples  These cases  can arise  when the source has a very clear structure for the user or simply when the user has already acquired a  certain skill in managing the DEXTL language  The specification generator tool has a utility that simplifies this task     This function can be accessed through the  Utility  tab     shown in Figure 71  To see how it works let us tak
109. is option  it is also possible to use an  alternative syntax to NSEQL to specify the browsing sequences  simply indicating the request  mode  GET POST  and the access pattern  In general  the access pattern will be the access URL   which may include variables in the same form as the NSEQL programs  see  NSEQL    Example        o ftp  This provides access to the resource via ftp protocol  The format in which access to the  resource must be entered in the write area is as follows  ftp   login passwordQdomain port   where      login  user name    Part   21    I i i i  me denodo technologies ITPilot 4 0 Generation Environment Manual         password  password to access the ftp server     domain  specific address of the ftp server     port  port where the server is run  this is port 21 by default   o local  Likewise  ITPilot provides access to resources in the local file system  The format to use is    file   address   where      address  access path and resource name  Reuse Connection  Marked by default  this indicates whether the browser used to date is reused or  whether a new browser is launched  maintaining the session data  This option is generally marked   although in some cases  such as when the Iterator component is used  as explained in section 3 12  it  may not be useful   Maximum retries  As indicated in the next section  where the processing of errors of some type for this  component is configured to ON  ERROR RETRY  this parameter determines the number of retries to 
110. lowing tags to be  understood  observing Figure 58  The first  known as ANCHOR    indicates that the SENDER attribute  identified as an  attribute by the character       as a prefix  has a link  which can be seen in Figure 5  where the sender data contains a  link to the message details page   The ENDANCHOR   tag indicates the end of that link  There is also the TAB   tag  that indicates the existence of some kind of tab in the HTML page  The SUBJECT attribute saves the message subject  and is wrapped by a link to also access the details page for this message  Finally  there is another tab tag known as  the SIZE attribute and a fixed text  kb      ITPilot allows for a graphic indication to access a new page by clicking on one of the links displayed in the  specification  The way in which the Record Sequence component is indicated the manner of accessing the details  page from the main page is as follows  Double click on one of the ANCHOR tags  on the second in Figure 58   so that  ITPilot assigns the value corresponding to the URL of the link to a dynamically generated attribute        Record Sequence Configuration Record Sequence 7     Commands    DER ENDANMCH  OR U   TABQ SUBJECT ENDANCHUOR U TABI SIZE    Current commands     Test Cor  ean         Figure 58 Record Sequence editor    By double clicking on one of the ANCHOR tags  a line known as ANCHOR is added in the lower workspace  known as   Current Commands    The sequence editor allows for further processing commands to
111. ludes internationalization configurations  for the most common zones  The zone names correspond with the codes defined in standard 150 639  180639    Examples  ES EURO  Spain   GB  Great Britain         In the SDENODO HOME setup vdp metadata properties ii18n path there is a file with the  configured parameters for every zone  used by the Generation tool  The internationalization parameters of a location  can be divided into various groups  The different groups are mentioned below  and each of the parameters  comprising same are described in detail     NOTE  The internationalization parameters are case insensitive  For instance     timeZone    and    timezone     correspond to the same key     e Generic parameters    e language   Indicates the language used in this location  It is a valid ISO language code  These codes  contain two letters in lower case as defined in 150 639  150639   Examples  es  Spanish   en   English     x  French      Part   42    xe denodo tec     nologies ITPilot 4 0 Generation Environment Manual    country     Specifies the country associated with this location  It is a valid ISO country code  These  codes contain two letters in upper case  as defined by 180 3166  1503166   Examples  ES  Spain    ES EURO  Spain with EURO currency   GB  England   FR  France   FR EURO  France with  EURO currency   US  United States      timeZone   Indicates the time zone of the location  e g  Europe  Madrid for Spain   GM1 01 00    MET   GET      e Currency configuration
112. ly click on the        Revert to Saved  button        Once this tag has been defined  a new tag set   mylextlagSet   must be created that contains TAB and  EOLNOLINEBREAK  To do so  click on the   O    option in the central    TagSets    area and create the new set     To link the tags to the tag sets  select and edit  by clicking     the new tag set  You will see how the  arrows between the  tag sets  and  tags  areas are enabled  You can then select any tags to be included in the tag    set and click on       The  myTextTagSet  tag set will display the referenced tags in the  Included Tags    field  To    complete the stage  save the tag set by clicking on Ed In the event of updating a tag set  should you wish to         reject the change made and return to the previous version  simply click on the KS  Revert to Saved  button     Lastly  create a new scanner and link it to the recently created tag set  The operation is similar to the step indicated  above  clicking on    O    in the left hand part of the scanner generation window and creating a new scanner    MyLexer   Then  with the recently created scanner marked  select the  myTextTagSet  tag set by clicking on the    Le    button and then click on the arrow     between both areas to allocate them  The scanner generation  window will be similar in appearance to that in Figure 68  where the  Included TagSets  field of the scanner area  displays the  myTextTagSet  tag set  The    Standard    tag set is also added  as a str
113. main  C Enabled Popups      English Writer    William Shakespeare Hamlet    Figure 86 laskbar with an Example Selected         Direcci  n 2  http  ass  amazan camJexec abidos ats query page reF b tn bh bo 103 0042048 8171866 Ir    Sm gt denoda    Open e Rec  JHTranspose Table  select Domain  C EnabledPopups ME      Tabla Flandes Reverte    Tabla de Flandes Arturo P  rez Reverte    amazoncom four         Store                   Used  Books    New York Timesi    Advanced   Browse  Best Sellers    Search Subjects    Bargain       he Corporate   Amazon    Bestsellers d Y Magazines   Accounts   hate     Textbooks       Books       Go to search tips         subjects OOO      Figure 87 Assigning Example Values to Form Fields    48 PROPERTIES OF THE NAVIGATION BAR    By clicking on the Denodo icon on the left of the bar a dialog opens which allows various generation process  properties to be configured  The dialog consists of two panels that allow the preferences for the criteria followed to  generate NSEQL commands and the preferences for authenticated proxies  respectively  to be configured    48 1 Generating Sequences Using an Authenticated Proxy    If the Internet is going to be accessed through a proxy with authentication  it may be necessary to provide a value for  the following parameters     Navigational Sequence Specification Manual 81        h        me denodo technologies ITPilot 4 0 Generation Environment Manual    e   PROXY LOGIN  user in the proxy   e  PROXY PASSWORD  u
114. ment and  returns a character string which represents the POST parameters that have been used to access that page     GETCOOKIES   his function receives a Page type object as input argument and returns a character  string with the current  cookies      GETPAGETYPE   his function receives a Page type object as input argument and returns a character  string with the access type  pool or http      TOPAGE  String connection type  String VEL  String  url method  String post parameters  String cookies    his function  receives the connection type  URL  access method  POST parameters and cookies of a page as input  arguments  and returns a Page type object that represents that specific page state     Appendix A  ITPilot Functions 09    J     j     E E E   O  denodo technologies ITPilot 4 0 Generation Environment Manual  I    6 APPENDIX B  CATALOG OF COMPONENTS    This appendix lists and defines each of the components available in Denodo ITPilot for use in the wrapper generation  environment     6 1 ADD RECORD TO LIST    6 1 1 Description   Adds a record to a list  This component must be used  when there is a previous list  e g  created using the CreateList  component  section 6 3  to which new records are to be added    6 1 2 Input Parameters      Record  record to be added to the list  The number and type of record fields must be consistent with those  existing in the list  if it were not this case  an error will appear on screen with the description  Input List  has a different re
115. ment and returns it to the output with all the characters it  comprises changed to lower case     e UPPER  This function receives a text type argument and returns it to the output with all the characters it  comprises changed to upper case     e  SUBSTRING  The substring function receives as parameters a text type argument and two integer  numbers   t returns as output the part of the substring of the first argument that corresponds to the  positions indicated by the second  beginning  and third  end  arguments       e REGEXP  This function allows for transformations on character strings based on regular expressions  It  Is given three arguments  one text type element  one input regular expression and one output regular  expression  The regular expressions must be expressed using the regular expression syntax in JAVA  language  REGEX   The function behaves in the following manner  The input regular expression is assessed  against the text from the first argument and the output regular expression may include the  groups   defined in the input regular expression  The portions of text matching them will be replaced in the output  expression  For example  the result of evaluating     Appendix A  ITPilot Functions 87    I   i     me denodo technologies ITPilot 4 0 Generation Environment Manual    5 3    5 4    REGEXP  Shakespeare William     Nw     Nw      52 S17     will be the value of text type    William Shakespeare        REMOVEACCENTS  This function receives a text type argu
116. ment and returns that same argument value  but with no accents     REMOVEWHITESPACES  This function receives a text type argument and returns that same argument  value but with no blanks     SIMILARITY valuel  text  value2  text  algorithm text   This function  receives two character strings and returns a value of between 0 and 1  which is an estimated measurement  of similarity between the strings  The third parameter  optional  specifies the algorithm to use to calculate  the similarity measurement  ITPilot includes the following algorithms  if no algorithm is specified  ITPilot  chooses the one to apply      1  Based on the editing distance between the text string  ScaledLevenshtein   JaroWinkler  Jaro  Level2Jaro  MongeElkan  Level2Mongeklkan     2  Based on the appearance of common terms in the texts  TFIDF  Jaccard  UnsmoothedJS   3  Combinations of both  JaroWinklerTFIDF     TRIM  This function receives a text type argument and returns the same argument with all the spaces and  beginning and end carriage returns removed     LIST HANDLING FUNCTIONS    SIZE  This function accepts a list as an argument and returns the number of elements comprising it     ELEMENTAT   his function accepts a list and an integer as input arguments and returns the record in  the position expressed by the integer value in the list  The first position of the list is accessed by the value  0     DATE PROCESSING FUNCTIONS    Date functions allow to manipulate date values     NOW  This function cre
117. mode  Whilst frame handling is normally clear to the user   when the system is used for data extraction tasks  It is sometimes necessary for the user to directly specify  a frame in the last step of a navigation sequence  See section 4 4 for more information     e    ranspose Table    2 Transpose Table This allows a table to be transposed  transforming its row vectors  into column vectors  which is extremely useful when wishing to obtain results from ITPilot  where each  register is a column instead of a row  See section 4 5    Navigational Sequence Specification Manual 76    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    4 3    Domain e It may be necessary to parameterize the navigation sequences according to certain values  received during execution through a wrapper created using ITPilot  See section 4 7     Enabled FopUps EJ The sequence generator supports the creation of navigation sequences that involve  actions on pop up windows  For this  the    A  ow pop ups  button should be activated on the bar  If it is not  activated  no pop up window will be allowed to appear during the sequence recording     Select Anchor FE The sequence generator indicates that the link to be followed next in the recording  process is not an HTML resource but a PDF or Microsoft Word  By clicking on this button  when a link is  later clicked  ITPilot will convert this resource into HTML using Word HTML and Word PDF converters so  that the specifications generation to
118. mponents  relate them to  each other and configure them    3  Component configuration section  This contextual area allows for the selected component to be configured   The configuration of each component is divided into three parts   Inputs   where the input of data to the  specific component is indicated   Wizard   where each component with its specific characteristics is  configured  and  Details   where its output  that may be collected as an input by another or other  components  is described along with other characteristics indicated below     The workspace already displays two components as part of the current process  These components indicate the    process initialization and completion statuses  The initialization component is described in the following section   although it is first necessary to explain the types of input and output parameters that can exist     3 5 1 Input and Output Parameters    The following types of input and output parameters can be used by the components     Part   12    me denodo technologies ITPilot 4 0 Generation Environment Manual          Pages  Some components such as the browsing sequence  Sequence  component return a page as the  result  which the other component will use to extract information  for example        Registers  Other components  e g  the Record Constructor  return a structured group of information  in  this example  this may be the structured representation of an information item         Lists of registers   TPilot allo
119. multi frame page from which data is going to be  extracted using the Denodo IT Pilot extraction tools  see section 3 8 and  DEXTL    one final step must be followed  before ending the sequence  This step consists in selecting the frame in which the data to be extracted are found     Navigational Sequence Specification Manual 78    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    To do this  after completing the navigation sequence and before saving it on disk and returning to normal mode  the  user should follow the steps below     1  Use the mouse to highlight any text from the frame to be selected   2  Press the Se ectframe button   3  Now save and end the sequence in the usual manner     45 THE TRANSPOSETABLE BUTTON    The process for extracting data saved in tables followed by ITPilot means that the resulting tuples are organized  based on the rows of the table and the fields based on the columns  Hence  a DEXTL program would obtain from a  table with n rows and m columns  n registers  each one with m fields or attributes  This is normally sufficient  as it is  the logical structure of a table  However  it is sometimes interesting for each ITPilot register to take its data from  each column  e g  in tables with columns providing information on different time sections and where information is to  be obtained per period of time   Although a possible solution involves extracting the information row by row to  subsequently restructure it from the cli
120. n window of the specifications generation tool  From this window  it is possible to  browse to the results page and mark the text for each value using the mouse  Then  by clicking on the  aforementioned  Assign Selected Text  option  the value of the required field will be added  which will  appear to the right of the field name  FIELD   VALUE     2  By previously entering the value in the text area displayed upon double clicking on the field  The assigned  value will immediately appear to the right of the field name  FIELD   VALUE      We recommend the first option be used  wherever possible  so that ITPilot is able to obtain additional information  from the DOM tree of the HTML page  thus allowing a more adequate generation of the DEXTL program  besides  if  only the second option is used  all examples must come from the same web page  Remembering Figure 5  we can tag  each of the elements of the listing s first email and relate them to the elements of the structure  SENDER  John  omith   SUBJECT  Data Integration Approach   SIZE  1  and DATE  01 31 07   In order to do this  use the mouse to tag  the value  John Smith  in the browser window and then place the cursor on the element SENDER in the first  example of the third tab in the specifications generator tool  then click on the right button and select the option     Assign Selected Text     The result is that shown in Figure 33  the element SENDER has the following  Assigned  Value    John Smith      Part   28      i  me
121. native parts for this section   3  unnamed  text plain  0 28 KB         7rrmeseee                                                                                                                                                                    Hi Peter     Traditional data integration approaches  such as data warehousing are costly to deploy and do not easily support neither real time access to data nor    dealing with autonomous sources       think Ell is a much better approach in our case     Regards     Albert    Figure 6 Content of a message    The Web application displays the messages 20 at a time  whereby to access the next messages you have to click on    the right arrow in Ke  lt a nd nal    3 3    STARTING THE SPECIFICATION GENERATOR TOOL    After starting the tool  by executing the  lt DENODO_HOME gt  bin StartITPAdminTool bat  sh program or double   clicking the    Start Wrapper Generator Tool    icon  if it was generated in the installation process   a window such as  that shown in Figure 2 appears     Part      me denodo technologies ITPilot 4 0 Generation Environment Manual    An example is provided below to show how the application works  The objective is to obtain the e mail list  automatically and in a structured manner     3 4 CREATING A WRAPPER    Wrapper programs must be created within the context of a project  These projects can be created  modified and  deleted from the ITPilot specifications generation tool  In this case  a MAILWRAPPERS  project is to 
122. nfiguration of Sequences with the Record Sequence component    60  Test window of the Record Sequence component    61  Use of the Extractor component to obtain information of the detail pages                          ssssssses 62  Adding a data Iterator comino from the detail pages                   sssssssssssssnes 63  Configuration of input values of the Record Constructor component                       sssssssse 64  Output record of the Record Constructor component    64  Scanner and Tag Set Generation TOOL    uias esee eese tern tb tek vh poter nte  oet testa EU D Htedet etus 65  Generated Scanner and Tag Set    sse eene nnne n 67  Tabulated Results of a BookshopResult of the DEXTL Program Test on DETAIL                                    68  Delimiting the Beginning of the EX WU ACTON Luma kammammskingsujniaojaetdbitdeitis basene 68  GOE EEE 69  Seed Data t0 DE XI ROBO EE 69  Duel eee  GJ EE EE EE 70  NNN 71  Assigning a name to a custom component    enne nnns 72  Selecting the output type of the custom component         sororsvrvrvvrvevrrvrrevervrvervrrevervrverervererervererseverervenenn 72  Using a custom component in a NEW process            ccceccecescsseesscesesecscsecsseecsesacsecassaceseessesassecessacaseesaeeaees 73  Component Configuration ATEA RR RTT 74  Wrapper Maintenance Check NAG hun eindendommigiirsneiddeambanaleieidtssadinadraeste 74  Navigation Sequences Generator taskbar          oeorvrvrrvrvevervrvrrvrvevervrvervrvevervrverervevererveservereververers
123. ng or the  describing of conditional flows     The following pages explain the functions of this tool by generating an access wrapper for an e mail Web source   The first part of the example  as of section 3 2  will provide the basic and common capacities of almost any Web  wrapper  whereas the second  starting at section 3 15  focuses on more advanced matters that provide the tool with  greater power and versatility     opecification Generation Manual 6    x denodo technologies ITPilot 4 0 Generation Environment Manual    PARTI    In this first part a complete and functional example is used to study the basic functions of the specification generator  tool     3 4 PRESENTATION OF THE EXAMPLE    Figure 4 shows the home page of a Denodo Technologies e mail Web site  accessed at  http   mail demos denodo com   In this manual the specifications generator tool is used to obtain the list of incoming  e mails  with an increasing level of detail  in a structured manner        Welcome to Denodo Mail      Last login  Never          Username    Password    Language English  American         Log in    SH denodo    i    Figure 4 Denodo WebMail home page    Enter the UserName  demos  and the Password  DeMo 04  to access the Web e mail application  The following  window will appear  Figure 5      Part   7    ITPilot 4 0    L         z   D  denodo technologies Generation Environment Manual          QA OGOH    Problem    1   O  denodo  E   JUSTIN TIME DATA    New Message Search Password Help 
124. nr 21  3 10 GENERATING PATTEBDNS    codice EMEN MEER PENDHEH NER ELSE SE eee re 30  3 11 GENERATING THE SPEGIFIGATION        nrnornenonnrnonvrnenonnrnenvrnenennvnenernenennvnenesnenennvnnnesnenennrnnnesnenennsnnnenn 32  3 12 ITERATION OF RESULTS OBTAINED                               neret nnnnnnnnnnnnnnnnnnnnn nnn 33  2 4124  seo ENE Iterator 01001000  5     OE 33  3 12 2 Individual record management           cc ceecceseccecescsccscescsssessevaesesacsaseesessevassesaesacascassecessesstaseseasieeeseeseteaees 34  3 12 3 Returning of results    nnne nnne nnne tnn 39  3 13 WRAPPER ADVANCED OPTIONS  BACK SEQUENCE AND LOCALE                                             41  3 13 1 Back Sequence    enne eren treten nns rena rarnana rerne 41  ORA                                                                42  3 14 WRAPPER GENERATION  TESTS AND EXPORTING                                       eere 44     VEN NN 44  GET dm GEL EEE EEE 45  3 14 3 Wrapper Exporting EEE EE EE 4   3 15 EXTRACTING MULTIPAGINATED DATA      nrnornrnoevrnenornrnosvrnenornrnesernenennrsenesnenennrsrnesnenennsnrnesnenennsnrnenn 50  3 16 ACCESS TO DETAILS PAGES       nornrnonornenonnrnesornenennvnenernenennvnrnernenennrsrnesnenennrsrnesnenennrsnnesnenennsnnnesnenennnnr 53  JIO META E UR 53  3 16 2 Field Modification in the Extractor component  DATE field                       sssssssssssssee 53  3 16 3 Access to the Details Page from the Main Page    56  3 16 4 Back Sequence in the Browsing Components    59  3 
125. o carry out any of the following tasks  generate  HTML wrappers for use within Denodo Virtual DataPort  and or use Denodo ITPilot for web automation or data  extraction     SUMMARY OF CONTENTS    More specifically  this document on Denodo ITPilot describes both generation tools   e Specification Generator     o  he wrapper will be modeled as a process flow comprising different components  Each  component has a specific function on a group of inputs and produces a group of outputs  This  document describes the structure of the flows and the components comprising them     o  Describes how the Process Building tool works with a series of examples  and how to use it to  generate wrappers on sources with different levels of difficulty     o It explains how to export the recently generated wrapper to the ITPilot running environment   e Navigation Sequences Generator    o  Describes the main objectives of visually generating Navigation Sequences    o Provides a general overview of its architecture and installation procedures     o  Describes how to use it to visually generate navigation sequences of any level of complexity     I i i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    1 INTRODUCTION AND INSTALLATION    1 1 PRESENTATION    This document centers on the graphic tools from Denodo Technologies which allow to visually extract information  from web sources  It can also be used to extract information from documents in Word and or PDF format     opecificall
126. ocesses  form filling  frame selecting  etc     Denodo ITPilot includes a command language called NSEQL  Navigation SEQuence Language  for defining complex  browsing sequences that are run using a pool of instances from automated browsers  There can be three types of  browsers     e Instances from Microsoft Internet Explorer  MSIE   MSIE   e Instances from Mozilla Firefox  FRFX    e Instances from a mini http client based browser embedded in ITPilot     In the first two cases  this language allows the browser event model to be managed and exactly replicates the  behavior of a human user of MSIE or Firefox carrying out any browsing sequence  Thus  to implement complex  browsing  the developer does not need to worry about low level aspects such as the use of JavaScript code  session  maintenance systems  the use of HTTPS  etc  In the third case  the browsing sequences will be run normally in a  more efficient manner  but the system will not deal with browsing involving JavaScript code     Although the NSEQL navigation sequences are simple to write  NSEQL   for added comfort and speed Denodo ITPilot  also incorporates the Navigation Sequences Generator module dealt with in this manual     The Navigation Sequences Generator takes the form of a taskbar that is installed in an MSIE browser  Once installed   It can be used to generate any navigation sequence on the users browser  The generator records the events  generated by the user whilst navigating and automatically translates t
127. of the message       MESSAGEDATE  e mail reception date         SIZE  size of the message in KB    NOTE  there exists a set of ITPilot reserved words which can not be used as element names of the generated  structure  These keywords are shown in Table 1     ADD  ADD OBJECT TO LIST  ADMIN  ALL  ALTER  AND  ANY  ARRAY  AS  ASC  BASE   CACHE  CATCH  CLEAR  CONDITION  CONNECT  CONSTRAINTS  CONTEXT  CREATE   CREATE LIST  CROSS  DATABASE  DATABASES  DATASOURCE  DATASOURCES  DESC  DF   DISTINCT  DROP  ENCODED  ENUMERATED  EXCEL  EXISTS  EXPRESSION  EXTRACTOR   FALSE  FETCH  FIELD  FILTER  FLATTEN  FOR  FORM ITERATOR  FROM  FULL  GENFRIC   GRANT  HASH  HELP  HTML  I18N  I18NS  IF  INIT  INNER  INPUT  INPUTREWRITE   INVALIDATE  IS  ITEM  ITERATOR  JDBC  JOIN  LEFT  LIST  MAP  MAPS  MERGE  MY   NATURAL  NESTED  NEXT INTERVAL ITERATOR  NOS  NOT  NULL  OBL  ODBC  OF  OFF   ON  ONE  OPERATOR  OPERATORS  OPT  OR  ORDERED  OUTER  OUTPUT  OUTPUTLIST   OUTPUTREWRITE  PAGE  PATTERNS  POST  PRIVILEGES  QUERY   QUERYPLAN  RAW   RAWPATTERNS  READ  RECORD  RECORD CONSTRUCTOR  RECORD SEQUENCE   RECORD STRUCTURE  REGISTER  REVERSEORDER  REVOKE  RIGHT  SEARCHMETHOD  SELECT   SEQUENCE  SESSION  SIMPLE  STOREFILE  SWAP  TABLE  TRACE  TRUE  TRY  TII   TYPE  TYPES  UNION  USER  USERS  USING  VAR  VDB  VIEW  VIEWS  VQL  WHERE   WHILE  WRAPPER  WRAPPERS  WRITE  WS  XML  XML2BIN  ZERO    Table1 list of Reserved Words    These elements form the Web source data extraction structure  How to represent this in t
128. ol can subsequently process it     CloseWindow M   If  as part of the navigation sequence  a pop up is to be closed  simply click on the  CloseWindow button   X  on the bar and drag it over the pop up window to be closed  The event will be  recorded by the generator and incorporated into the NSEOL program generated     IE Ji  Semaphore           Only appears in the recording mode  This element is not a button  but an indicator for the  user  Each time the browser changes page during the sequence generation process the red disk on the  semaphore lights up until the system Is ready again to continue recording events  at which time the green  disk lights up  Thus  after accessing a new page during sequence generation the user should wait for the  semaphore to turn green before proceeding     Properties YE    py clicking on the Denodo logo on the left of the bar it is possible to configure  various aspects of the Navigation Sequences Generator functions  see section 4 8      STEPS FOR GENERATING A NAVIGATION SEQUENCE    This section provides a step by step description of how a navigation sequence is normally generated           2     Click on the Aec button to enter the record mode     A dialog box appears requesting the initial URL of the sequence  This can either be written directly or  pasted from the clipboard  right button on the mouse  Copy option   For example  the Denodo example site  can be used http   mail demos denodo com     The browser automatically loads the initial
129. omponent     Appendix B  Catalog of Components 120    me denodo technologies ITPilot 4 0 Generation Environment Manual    6 16 RECORD CONSTRUCTOR    6 16 1 Description    This component allows for a record to be constructed using other records generated in the flow as well as generating  attributes derived from existing ones     6 16 2 Input Parameters    Record Constructor accepts zero or more records and zero or more lists of records as input  which it uses as variables  to build the output record either by linking records or elements from the lists or by constructing derived fields     6 16 3 Output Values    The component returns one record     6 16 4 Details of the component    See section 3 12 2 for a more in depth explanation of the component     Appendix B  Catalog of Components 121    L  i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    6 17 RECORD SEQUENCE    6 17 1 Description    This component creates a browsing sequence created from the results of a record  It allows for sequences to be  created for access to other pages from pages processed by the Extractor component     6 17 2 Input Parameters    The Record Sequence accepts the following as input     e One record from the extractor  from which the necessary information is obtained to create the new  browsing sequence     e Zero or more input records from other components used as assignment variables for the new sequence     e Page  Page from which browsing is started     6 17 3 X Output V
130. onsists of carrying out a    Back     action  as if the  Back  button on the browser had been clicked     c  Where the previous box is not marked  the user may load a specific browsing sequence using the   Load from File  or  Import from Browser    buttons  as indicated above in the Sequence  component  see section 3 7     Global Form Sequences  Sometimes  the actions carried out on each result from a Web page to access  other pages  e g  the details pages of this example  all belong to a single form  This means that  before  being able to click on these links  ITPilot must find the form to which they belong in order to identify it and  to know  where necessary  how to run it  e g  by clicking a  Submit  button or that of a specific link   In    58      i  me denodo technologies ITPilot 4 0 Generation Environment Manual    these cases  ITPilot allows for browsing sequences to be added for the actions prior to running the  sequence in the  Global Form pre Sequence  area and for subsequent actions in the  Global Form post  sequence    area  The sequences can be entered by hand  although the  Load from File  and  Import from  Browser  buttons mean that a browsing sequence can be imported  ITPilot will search the first   HndFormByXXX  type command  see  NSEQL   for further information  and will copy everything above this  command  FindFormByXXX including  to the  Global Form pre Sequence  area and everything after it to   Global Form post Sequence   See  NSEQL  for further detail
131. oorly  even though  in fact  the sequence is being  generated correctly     In particular  some Web sites only present users with authentication forms when they are accessing the system for  the first time after starting up the browser  or after a certain session expiry time lapses      Thus  if during the generation of a sequence that requires login password authentication an attempt is made to  reproduce said sequence in a new browser window  it may happen that the reproduction fails due to the fact that the  session in the Web site is still open  and  thus  it is not possible to locate the login password form that did appear   however  when it was being generated      A similar situation can arise when  in a Web site  the effects of any other navigation event vary according to whether  or not a session has been established     The solution to this problem is very simple  the sequence is being generated correctly and the only difficulty arises  when checking it to ensure that it is functioning correctly  To overcome this difficulty simply follow the steps below     In order to check the sequence generated save same on disk using the Save button    Press the Stop button to end the sequence    Close the active session on the Web site on which the sequence has been generated    Use the Open button to execute the sequence generated and check that it is functioning correctly     po DU cm    44 THE SELECTFRAME BUTTON    When the sequence generated is going to be used to access a 
132. ord list  where the record structure is defined by the query performed on the database     6 12 4 Example    In many cases the web applications from which retrieve data  require input parameters that are actually stored in  other repositories  For example  the employee    identifications in a financial institutions  which are going to be used  by these same entities to access to its intranet and therefore perform service quality control of its internal  applications  With ITPilot  performing this action is simplified by using the JDBCExtractor component  Figure 109  shows part of the process  The component ejecutes a query to a relational database  from which an employee list is  obtained  Then  an iterator is used so that the internal web application is accessed  one employee id at a time  to  extract the data which allow the validation process to work     Appendix B  Catalog of Components 114        h         O  denodo technologies ITPilot 4 0 Generation Environment Manual  I       JPET JOB Extractor       Q  LpBciterator    em WEBSequence         9 WEBEstractor    i         C vWeblterator             Output 1    Figure 109 Access to Information from a Relational Database    The component configuration can be divided in the following sections     e The Inputs tab allows adding values or records that are going to be used as variables in the configuration  parameters     e In the component wizard  we will find three configuration tabs  The first one is sued to configure the  
133. ords from the data extractor of the details page for each message  See Figure 65     Part II 63    xr denodo technologies ITPilot 4 0 Generation Environment Manual    EX ILES         TPilot Wrapper Generation Tool       denodo ITPilot    File Browser View Help      ProjectManagement NERE  7 TE      o sess n De      Cl 46050 186 8      Components          a a       Generic   Add Record To List   Condition   Create List   Diff HH    Expression  Extractor  Fetch   Filter   Form Iterator    Loop    Next Interval Iterator  O Qutput 1         m       A            R        R   A   A        A   A   A   R   A   A   Iterator A  A   a   A       Output     Record Constructor A  A   Record Sequence A  A   Repeat a  Script 4  Sequence A       Store File    Custom      GoogeF amp CC inputs  Pre   Beta    test  ol    Input values    WMAILDEMO        DETAILSTRUCT         Data Export Tool    Current process  WEBMAIL from project Default Project   10 48 49 PM       Figure 65 Configuration of input values of the Record Constructor component    The  Wizard  tab provides access to the component editor  where the fields to form part of the  wrapper output record can be chosen  As in section 3 12 3  click on the    button of the attributes available to    enable them  with a similar result to that shown in Figure 66        Record Editor  Record Constructor 1     WEBMAIL SUBJECT  WEBMAIL  SENDER  DETAILSTRUCT MESSA    DETAILSTRUCT MESSA      Add new field    Cancel       Figure 66 Output record of th
134. ot 4 0 Generation Environment Manual    63 CREATE LIST    6 3 1 Description    Creates an empty list  Some components require a list of records as their input field  In other cases  the results list  for a component needs to be enriched with information from other parts of the process     6 3 2 Input Parameters    This component requires no input parameters     6 3 3 Output Values    This component returns an empty list     Appendix B  Catalog of Components 95      i    me denodo technologies ITPilot 4 0 Generation Environment Manual    6 4 DIFF    6 4 1 Description    The Diff component allows for two web pages to be compared  returning the differences between them in terms of  the HIML code obtained     6 4 2 Input Parameters    This component has the following input parameters  On one hand  a character string     Original page source code      which will contain the source code of the homepage  The page with which it is compared can be entered in two  different ways  Either as a character string that contains the page code or as a page type object such as that returned  by the Sequence component  If this last option is used  its base URL will be used as the base URL of the output HTML  code     6 4 3 Output Values   The component returns a character string that contains the HTML code of a page that displays the differences  between the pages entered as component input parameters    6 4 4 Use    In some cases  the decisions to take in the Web automation process must be base
135. ourse  the direct use of http sequences is not possible in any Web processing  In general  Web sources using  session variables and javascript code for processing forms  links or pop ups  etc  are unable to use this option     To select it  select the  Advanced  tab from the  Options  menu of the Denodo task bar and choose the  http  option  from the  Sequences Type  section  as indicated in Figure 90     Options  Proxy   Commands Advanced    Sequences Type    f  NSEUL    Cancel      Figure 90 Browser Sequence Type Selection Window       From this screen  the maximum waiting time for browsers when executing a sequence can also be configured  This  parameter is used  when the browser is to be run from the task bar     49 SELECTION OF PDF AND HTML CONVERTERS    When the user presses the Rec      button in the sequence generation tool  or when he she selects the  Anchor     type with the Se ect Anchor button  he she can decide if an Adobe PDF or a Microsoft Word converter must be  used to extract structured information from those resources  The user has the possibility of configuring the specific  extractor to be used out of the list provided by ITPilot  The selectable values of the Se ect Anchor button are the  following   1  Word  use of the Microsoft Word to HTML converter  Currently  ITPilot provides one conversion tool  that  uses the Open Office conversion capabilities   2  PDF  use of the PDF to HTML converter  Currently  ITPilot provides three converters   a  Acrobat HTM
136. ow for the graphic saving of the browsing sequence to be  used         Selection of Domain Values  A second bar appears  where the selection of values for a domain  previously created in ITPilot is necessary to complete specific fields in a browsing sequence  further  information on Domains can be found in section 4 7 2            or         WEBMAIL INPUT    DeMo 04 demos      Figure 3 Sequence Generation tool    Below is a detailed description of both tools and how they complement each other  Firstly  a series of examples will  be given to explain the functions of the Specifications Generation tool  where an initial approach to sequence  generation is to be found   The second part of the manual will provide details on the Browsing Sequence Generation  tool     Installation and Configuration D    me denodo technologies ITPilot 4 0 Generation Environment Manual    3   SPECIFICATION GENERATION MANUAL    3 1 INTRODUCTION    This section describes the Denodo ITPilot Specifications Generation tool that allows for Web wrappers to be created  in an easy and intuitive manner for non technical users through a graphic application  The basic operation consists of  the use of graphic components to generate work flows for the automation of accesses to Web sources  These  components implement tasks such as the browsing of a certain page  the extracting of useful information based on  the provision of examples of user tagged results  the iterating of results obtained for subsequent processi
137. ples   When should this option be used  When the  generated specification has been checked and is seen to be getting more results than required  Other  alternatives are the use of the following option   Strict Patterns      and to manually introduce the elements  FROM and TO  which delimit the beginning and end of the extraction area  for more information on these  elements and their syntax  see  DEXTL  and section 3 18        Pattern combination  clicking on the check box      Merge patterns   This option is marked by default   This option is extremely useful  when the source page requires the generation of a large number of optional  data elements  as it can reduce the necessary number of examples to be entered to a minimum   Furthermore  the DEXTL program resulting from using this option is more compact     When the Generation button is pressed  the DEXTL program text corresponding to this specific level appears on the  screen  See Figure 36 for a specific example on the home page of the Web e mail application       Extractor  Extractor  1     Structure Examples Generation    Specification    Generation Results    Generate   Test   include  scanners StandardHTMLLexer    From     To     NAME  EBMAIL   LISTMAME  WEBMAIL LIST     Edit    URLBASE  http Wmail demos denodo com imp     TAGSET  STANDARDC      Remove false examples MESSAGEDATE TAB   ANCHORG    SENDER ENDANCHORG TABO ANCHORG  SUBIEC  1       O Strict patterns    Disambiguate   X  Merge patterns      Ok   Cancel     
138. put data they receive  Denodo  components allow practically any operation on HTML based web sources  and also some capabilities for information  extraction from Microsoft Word  WORD  and Adobe PDF  PDF  files     Graphically  users can select the components required for a specific Web automation process from a palette  Each  component can be linked to others through information input and output relations  Thus  the result of a component  may be used as information input for others  For example  the extraction component  the  extractor   will return a list  of results that may be used as input for an iteration component  the  iterator    This component  in turn  will return  one of the elements comprising the list in each iteration  There are also other fork  transformation and output  components     full description of each component in the form of a reference guide can be found in section 6     Denodo ITPilot generates a wrapper program in JavaScript  JSDENODO  based on this graphic description of  components  This program contains the declaration of each component and their relations  The ITPilot components  related to browsing sequences and information extraction tasks also use specific ITPilot browsing and extraction  languages known as NSEQL  see  NSEQL  for further information  and DEXTL  see  DEXTL  for further information   respectively     The ITPilot specifications generation tool allows for the wrapper generated to be tested and debugged before  deploying it in the
139. quence may be different    Reuse connection  By ticking this check box  ITPilot is informed to use the same browser used until now in  the process  This is basically for efficiency purposes  Where it is not marked  ITPilot will launch a new  browser and export the session data of the browser used until then to the recently created  this is useful  and necessary when  for example  parallel runs are made in an iteration     Use of the Back Sequence  There are two boxes and a workspace related to this function  ITPilot enables  users to decide whether to transfer the responsibility of going back to the previous page after each  iteration on details pages to ITPilot or whether the users themselves will provide a specific browsing  sequence or  what is more  whether to go back to the main page or not  The following graphic elements are  used    a   Use Custom Back Sequence  check box  This is marked when a back sequence is to be used to  go back to the previous page  If it is not marked  ITPilot will generate a navigation to the previous  page with the same browser through an HTTP POST or GET method  This action is usually slower  than the navigation by means of a sequence  It is important to emphasize that the back sequence  will be performed at the beginning of the following iteration  not at the end of the current one    b  Default Sequence  check box  This is only enabled when the previous box has been marked  It  informs ITPilot that the default sequence will be used  which c
140. r      SUBJECT   String     SENDER   String      SIZE   String  Currenttag set  STANDARD      Change Tag Set      Extractor locale ES EURO      Ok   Cancel         Figure 29 Fxtraction Structure    As can be seen in the advanced example  this hierarchic structure may be as complete as required by adding new  hierarchic levels depending on how the wrapper output structure is to be modeled     The required  Tag Set    can be selected in each level  A tag represents a regular expression defined using HTML tags   Usually  the tags are used to specify in the same manner basic representation primitives that can be expressed in  different ways in HTML  For example  we can define an ENDOFLINE separator as follows  EOL      lt br gt      lt  p gt       lt  tr gt       lt  td gt       n r t     lt  tr gt     A tag set is simply a group of tags  The  Standard  tag language is used by default   which is valid for the vast majority of Web source extractions and is used in this first example  For more information  see  DEXTL  and section 3 17     Click with the right hand button of the mouse on the structure elements to view a contextual menu with the options  to change the name  add a new child node  delete the node and three other options described below      Aliases  Some types of wrappers can be automatically maintained by ITPilot  see  USE  for further  information   In these cases  the field enables users to assign words as synonyms or tags that can describe  this field in different 
141. r component has received information  where one of the attributes is a value indicating the birth  date of the specific person referred to in the record  information is to be obtained on each of the years in which this  person has been alive  This is done by accessing another resource that accepts the specific year from which  information is to be obtained as input and returns the most relevant events from that year     Using ITPilot  it is possible to construct a loop  where each iteration accesses each of the specific years  The  components acting in this part of the process are shown in Figure 111     For each record obtained by the Extractor component and after some kind of conversion in the RecordConstructor  component  an Expression component is created that obtains the age of this person  applying the expression     SUBTRACT  2007  GETYEAR  TODATE    MM dd yyyy   Record Constructor  1 output   DATE         Where the date of birth appearing in the record is subtracted from the current date to check the age of this person   to simplify the example  only the date of birth is taken into account   With this  the output condition of the loop can  be created  simply     AGE   0    where AGE Is the expression resulting from the previous subtraction     Within the loop  the only thing remaining is to create another expression that subtracts 1 from the AGE expression for  each iteration     SUBTRACT  AGE  1     Appendix B  Catalog of Components 117    xe denodo technologies    I
142. r in the workspace together with an area to fill in the value of the  parameters of the function  The values of the parameters should be expressions present in the list of  created values  right hand box  or attributes  To assign an expression already created as a parameter of a  function  drag amp drop the expression created to the parameter area  By clicking on the  gt  button beside the  function  it will appear in the list of expressions created  right hand box         Mew Value Editor    Expression Editor  Constants  Functions Values  Input values    Expression  Expression value      Ok   Cancel         Figure 42 New record field editor    In this case  select the GETDAY function from the Functions area and drag amp drop the MESSAGEDATE attribute from  the WEBMAIL record to the GETDAY function in the  Values  box  The result can be seen in Figure 43     Part   3     x denodo technologies ITPilot 4 0 Generation Environment Manual    u Mew Value Editor       FLOOR Expression Editor  ENCODE     Los oO O Values     GETDAY WEBMAIL MESSAGEDATE  4         GETMINUTE    CONCAT  UPPER  E    TODATE    NOW  GETS EC    NO  GETURLCONTENT         TRIM            Expression  SIZE  SIZE    Expression value    SUBJECT  Bl EB    MESSAGEDATE  MESSAG       SENDER  SENDER         Figure 43 Creation of a derived attribute from the GETDAY function    Now simply click on the EJ button to move to the right hand box  Given that no more are required  drag amp drop the  result to the  Expression Val
143. removed content This text box indicates the prefix to use on generating the results page for the  deleted contents  red HTML tag by default      e Suffix for removed content This text box indicates the suffix to use on generating the results page for the  new contents  red HTML tag by default      e  ase sensitive This indicates whether the marking of changes is upper case sensitive  This is not marked  by default     e   gnore tag attributes  This checkbox configures whether  when the pages are compared  the HTML tag  attributes are to be ignored  This does not affect the results HTML page generation process  This option is  not selected by default     e Return null if page has not changed  This checkbox  marked by default  indicates that  if the results page is  equal to any of the two input pages  the component returns    null    instead of the page itself     Appendix B  Catalog of Components 97    I i i i      me denodo technologies ITPilot 4 0 Generation Environment Manual    65 EXECUTE JAVASCRIPT    6 5 1 Description    This component allows the addition of JavaScrpt code  which will be executed on the current page in the browser     6 5 2 Input parameters    The Execute JavaScript component accepts an input page  mandatory      6 5 3 Output Values    The output value of the component will be a page which is the result of the JavaScript code execution     Appendix B  Catalog of Components 98    Va   TOU TEMP        me denodo technologies ITPilot 4 0 Generation Environm
144. rkspace together with a text area to fill in the value of the constant     The value required can be written directly in the text area   3  Onclicking the     gt     button  the new constant will appear in the list of values created  upper right hand box      Appendix B  Catalog of Components 100    xe denodo technologies ITPilot 4 0 Generation Environment Manual    I Expression 3 expression editor    Lal Expression Editor    Li    Constants  int    boolean         nius  double 2     SUM CURRENT_PAGE   sting 1   fost 3  binary o ooo l    Functions    CONCAT    ENCODE  FLOOR    GETHOUR  GETMINLTE    GETMONTH    GETSECOND  Expression value    GETURLCONTENT SUM CURRENT  PAGE 1     GETYEAR  SN  TNULL      ISNLILL    Expression       Figure 97 Creation of a constant value in the Expressions Editor    To create a new function type expression the following actions are required     1  Select the required function in the    Functions    drop down menu on the left side of the screen and click on or  drag amp drop to the workspace for creating expressions  box on left     2  The function selected will appear in the workspace together with an area to fill in the value of the function  parameters  The values of the parameters should be expressions present in the list of created values  right  box  or attributes  To assign an expression already created as a parameter of a function drag amp drop the  expression created to the parameter area  Press the  gt  button that appears beside the fun
145. rmal  indicates  that the  WaitPages  command will be used that enables the browser to wait until a certain number of pages have  been downloaded before continuing to run the remaining commands of the NSEQL program  The  Extended  option  indicates the use of  extendedWaitPages   which enables this same operation but allowing the system to check the  number of pages remaining before continuing to browse     Figure 89 shows a view of this configuration window     Proxy Commands   Advanced    Anchor Clicks  f FindElementByChild C ByTest 6 ByHref C  ByPosition    Search Frames       BuMame C  ByPosition    Search Forms       BuMame C  By  cton      ByPosition    Select Indexes       ByTest    ByPosition    alt Pages  ie Normal C  Extended    Cancel      Figure 89 NSEQL Options Window       4 8 3 Choosing the Browse Sequence Type    The browse sequence generation tool allows browse sequences to be saved in two different languages  In general   the suitable option  and also the default option  is to generate NSEQL programs  However  if the Web source to be  accessed complies with a series of characteristics  as described below   it is faster to use pattern http sequences   These sequences are based on http requests  the underlying protocol in all Web communications  without using any  browser as an intermediary  hence making them more efficient     Navigational Sequence Specification Manual 83        T        me denodo technologies ITPilot 4 0 Generation Environment Manual    Of c
146. rning the only  Input record is returned  sender  subject  date and size of message  although adding new attributes based on which  the date Is returned so that the day  month and year values are returned separately     4 Where no modification is to be made to the records returned by the Iterator component  the Record Constructor  component does not have to be used and the Iterator output can be linked directly to the input of the Output  component  which is explained below     Part   34      i  me denodo technologies ITPilot 4 0 Generation Environment Manual    B   r1 uu  As usual  drag the component  the icon on the component bar is z  l and add it to the process  as indicated in  Figure 40  Thus  the output records that the iterator returns after each iteration will be taken as input elements in the  Record Constructor component              TPilot Wrapper Generation Tool                                                                                File   Browser   View   Help   denodo ITPilot  Project Management Process Flow  WEBMAIL  Process Builder 28 E S  gt   m     xi    Components 5 O  im 8 be Hev  L gt  e  ii c C Jg    o nt       y    6 el  Gener mus   v    Weneric Q  F  1 1   Es InitialSeq    ab        AE     gt  MainPageEitra    ED  l i i  its    CG  tterator 1    o oO o  m   o  a Iv   Custom  Aeterno   npu   ard Detai  E  T Input values  Data Export Tool t3   0   Tools                    Current process  WEBMAIL from project Default Project   10 39 23 PM       F
147. roceed with more examples to make the process for  generating the access pattern more reliable  Figure 34      m Extractor  Extractor 1     Structure Specification  Assigned value  List of items WEBMAIL    cow WEBMAIL  SENDER   John Smith   SUBJECT   Data integration approach     MESSAGEDATE    01 3172007   SIZE s   D  mw WEBMAIL  SENDER    Marty McFly   SUBJECT   Wrapper Maintenance     MESSAGEDATE    01 31 2007   SIZE    2    D mw WEBMAIL  SENDER   Jean Luc Picard     SUBJECT   Denodo ItPilot   MESSAGEDATE    02 01 2007   5E       Figure 34 Assigning Various Examples    Part   29    J i  J  denodo technologies ITPilot 4 0 Generation Environment Manual          Finally  click on the button      check that the examples have been properly inserted and move on to the next  phase  pattern generation     3 10 GENERATING PATTERNS    Entering example results allows the system to generate the required extraction patterns  This is performed in the   Generation  tab  The initial view of this window is shown in Figure 35     m Extractor  Extractor 1     Structure Examples Generation    Specification  Generation Results      Generate       include  scanners StandardHTMLLexer             To    NAMES WWEBMAIL   LISTNAME  WEBMAIL  LIST    URLBASE  http  mail demos denodo comfimpr     TAGSET  STANDARD      Remove false examples  Replace this with Pattern Components        Strict patterns  O Disambiguate   X  Merge patterns       Figure 35 Pattern Generation Window    It is presumed that by 
148. rvrrvrrevrrvrvervrvevervrvererveverervererervevervevererveverveveserveversrveservevererverersevererserenennn 2  NNN 2     NNN 3  15 MUO CUE ENON  e O0 eee EENEN E E N O 4  3 1 INTRODUCTION  m                                                                    6  3 2 PRESENTATION OF THE EXAMPLE      rinorornorornonornenennvnrnennenennvnrnesnenennvnrnennenennvnrnesnenennsnrnesnenennsnnneseenen    3 3 STARTING THE SPECIFICATION GENERATOR TOOL        nrnnrnrnvrnenonnrnvnennenennvnrnennenennvnrnesnenennvnrnesnsnen 8  3 4 CREATING    WRBAPPER    icczci ansees ER ERE cnc i 9  3 5 COMPONENTS IN ITPILOT e                                11  S5 MEN Motan MUV 010 Gs 11 eS E EE T steer 12  3 6 PROCESS INITIAHIZATIUDNS ae 13  36 1 Useofthe Catalog EXDIOIBEsssiasntuoan pride eine nme undis ep nD re a e macdceaie T enses uncta 15  3 7 WEB BROWSING AUTOMATION      snrnvnornrnorvrnenovnrnenvrnenornrnesvrnenesnrnenernenennrsenesnenennrsrnesnenennsnnnesnnnennnnr 15  3 7 1 Component Creation in the Workspace    15  ou NNM 16  3 7 3 Output Data Configuration and Error Processing    22  3 8 STRUCTURE DEFINITION OF THE DATA TO BE EXTRACTED                                   enn 23  3 8 1 Data Extraction Specification ISeTDeFallOlsseeteuistten cetera enter tst can ta aet dett ebbe che 24  3 9 2 Nested Levels in the Component Structure                   ssssssssssssseeene nennen 2   3 9 ASSIGNING EXAMPLES OF THE RESULTS        nrnornrnornenonnvnvnornenennvnrnernenennrnrnesnenennrnrnesnenennsnrnesnrnennn
149. s  WEBMAIL from project Default Project   10 32 06 PM    Figure 39 Use of the Iterator Component    Configuration of this component is very simple  First  select the input list to feed the iteration process  In this case  as  can be seen in the above figure  the list corresponds to the extraction component output described in section 3 11   EXTRACTIONOUTPUT  Then  from the Wizard tab  the iterator run mode can be configured  A parallel run can be  chosen in which each iterated element is propagated concurrently with the subsequent components  The other option  Is the sequential run     In the Details tab  it can be seen how the name of the output record corresponds to the name assigned by users in  the extraction component  as explained in section 3 8 1     3 12 2 Individual record management    After configuring this component  the component receiving this iteration results record is added  In this specific case   only each of the results is to be obtained to return them asynchronously to the application  i e  as they become  available  without waiting for wrapper processing to have finished   To do so  another of the most important ITPilot  components is the so called  Record Constructor  that  after receiving a set of records  is responsible for generating  an output record that may be the simple combination of those received or a modified version following the editing  and transforming deleting of the fields of each one   In this simple example  the data of interest conce
150. s follows   1  The Sequence component sends its output to the starting element     Next_Interval_lIterator     of the Next  Interval Iterator component   2   his starting element is related to the Extractor component that  in the previous example  was directly  connected to the Sequence component   3  The ending component is no longer connected to the Iterator ending component  but to the Next Interval  Iterator ending component   Begin Next  Interval Iterator       Part II 50    J   I i T        n denodo technologies ITPilot 4 0 Generation Environment Manual    ee lnitialSeq    i      Begin Mext Interval Iteratar      MainPageExtractar    C MailMainPagelteratar    n ra Record Constr       Y     Mailoutput    Ax  G EndMailiainPagelterator    E  E  G hest Interval Iteratar o  m o    hoal    Figure 53 Use of the Next Interval Iterator component to browse more pages of results    The component can now therefore be configured  In the Inputs tab of the component s  Next Interval Iterator     element configuration area  the input page from which browsing for more intervals is to be carried out can be  indicated  Furthermore  input records can be assigned to the component  These records are used when the browsing  sequence has variables  ITPilot will use the values of the record attributes with names that match the name of the  variable used in the sequence     The Wizard tab enables users to access the next interval editor  This editor is very similar to the sequence editor  descri
151. s on NSEQL commands     3 16 4 Back Sequence in the Browsing Components    The possibility of defining the behaviour of the back sequence exists in every browsing component  specifically   Sequence    see section 6 20   Next Interval Iterator     see section 6 14   Form Iterator     see section 6 10  and Fetch      see section 6 8    This option is useful to control the browser behaviour in cases such as retries  page refresh actions   and so on  Figure 60 shows the    Advanced    tab in this components    configuration wizard  The Record Sequence  component  now being explained  has this option fully integrated in the  Sequences  tab  as it has been described in  the previous section     m Sequence Editor Sequence  1         Use custom back sequence    Back sequence    Import fram browser      Jor J  Cane          Figure 60 Advanced Tab for Back Sequence definition    3 16 5 Individual Test of the Record Sequence Component    Figure 61 shows the results of the sequence configuration of the Record Sequence component in our example   browserpool shall be used as the sequence type and ITPilot will be left to generate the back sequence  The browser  connection will be reused and the use of pre  and post sequences will not be required  as there is no global form in  the page of results     Part II 59      i  n denodo technologies ITPilot 4 0 Generation Environment Manual    1    Record Sequence Configuration Record Sequence 7              Commands          Sequence Type   lebrows
152. s the combination of keys Ctrl Shift 3     Part   15    me denodo technologies ITPilot 4 0 Generation Environment Manual    application  enter the user Key and Password values and any other type of additional browsing until the first page of  results Is accessed     In ITPilot  the browsing tasks are primarily carried out by the Sequence component  This component is intrinsically  related to the Browsing Sequence Generation Tool  as it enables users to generate the sequence graphically  without  having to use the internal language NSEQL  see  NSEQL  for further information on this language  in most cases  The  wrapper generation tool is integrated into the Denodo ITPilot sequence generation tool so that browsing sequences  can be generated in this source description stage     of course  this sequence may have been previously saved and   therefore  this step can be skipped  Even so  it is interesting to stop at this point to check how this tool integration  allows for domains to be automatically created in the browsing sequence tool       The process is as follows  First  drag the Sequence component either from the browsing area or from the workspace    component bar    using the 6 icon  The Sequence component is displayed in the workspace  The default name can  be modified by double clicking on the component  in this example  it is called  InitialSeq         Now link the initial component to the sequence component to start creating the process flow  To do so  as indicated  in
153. s to be used in the iterations for  markable fields  selection lists  checkboxes  etc    For text fields  constant values can  be typed or attribute values can be drag amp dropped  see Figure 106   Through these  steps  ITPilot is informed of the values to be used in the different iterations  The  number of iterations corresponds to the total combinations of this data  e g  if two  possible values are entered in a drop down and two values in a text field  the  component will iterate 4 times      i  Form Iterator Editor  Form lterator 1     Configuration  Import selected form    Input Values N query  Start 2 output Start 2 output SEARCHTERM    SEARCHTERM  string  srchType  O  Unchecked   Checked       addTwo      O Unchecked      Checked  addThree   O  Unchecked E Checked  hasPic      O Unchecked      Checked    catAbbreviation          Figure 106 selecting values in the form fields    Appendix B  Catalog of Components 110    I i   z   O  denodo technologies ITPilot 4 0 Generation Environment Manual  i    b  On the next tab   Navigation   the search and submission sequences for the form on the page are  configured  this is generally already defined  when the form is imported on the previous tab   In  this case  the sequence can be loaded from file or imported from the browser  as explained in  section 3 7  or ITPilot can automatically generate the sequence using the    Suggest    button  see  Figure 107   On this tab it is also possible to configure the number of retries that 
154. se the last option of the left side of the component configuration area  called  Wrapper Options   with       the icon     new window will appear such as the one at Figure 47      e  TPilot Wrapper Generation  ool  Wrapper Options    Default locale ES EURO    Back sequence    Load fram file Import fram browser       Figure 47 Opciones del Wrapper    3 13 1 Back Sequence    In this window it is possible to define a browse sequence that enables you to return to a specific status of this source  in this search tab  This action is used when you define a browser reuse strategy to increase system efficiency  It  often occurs that browse sequences executed by a specific wrapper share a series of initial common steps  For  example  imagine that a wrapper has been created to automate the search process in a specific Web source  The  source requires an authentication process that involves the entering of a login and a password  In this example   imagine that the wrapper uses the same key password pair for all source accesses  Using Denodo ITPilot to create  this wrapper  an initial Sequence component would be created that would run the following steps     1  Connect to the source homepage    2  Complete the authentication form with the login password and click on the  Submit  or    Enter    button  to authenticate    3  Once authenticated  click on the link accessing the search page    Complete the search form with the required query    5  The server returns a page with the query resul
155. ser password in the proxy   e DOMAIN  Windows 2000   Windows domain     Figure 88 shows this window     Options    Proxy   Commands    Authentication    Login    Password    Domain        Automatic Fill Proxy Window    Cancel      Figure 88 Proxy Options Window       48 2 Criteria for Selecting NSEQL Commands    NSEQL  NSEQL  provides various alternatives for performing certain actions  For example  selecting a link on which a  click event will be executed can be carried out either through the command CL CKONANCHORBYTEXT  which  identifies a link in accordance with the text contained in same  or through the command CZ CKONANCHORBYHAEF   which identifies a link according to the value of its attribute Aref    Whereas in most situations it does not matter whether one or the other criteria are used  certain situations may arise  in which this is not the case  For example  criteria based on text can be inadequate when said text varies dynamically  each time the web is accessed  e g  consider the case of a link that provides access to the list of new messages in a  webmail system  where the text indicates the number of new messages and  thus  it can differ each time the service  Is accessed      This panel allows the criteria to be varied using the family of commands that refer to the identification of links  maps   forms and frames  The options that exist for each family are     e Links  link text  value of the  refattribute and relative position on the page    e Maps  value of th
156. specification through the provision of examples by the user  This  means that the user does not have to use the internal language DEXTL  see  DEXTL  for further information   but by  merely providing the tool with some examples it is able to automatically generate the DEXTL program     As usually  drag it to the workspace  It will use the information provided by the Sequence component and  therefore   this must be interrelated  Lastly  the component is renamed  MainPageExtractor   The expected result is shown in  Figure 26              TPilot Wrapper Generation Tool    File   Browser   View   Help   denodo ITPilot                                        Project Management Process Flow  WEBMAIL  Process Builder 2    rel  S  l m   n3    posee   8 ug Ep Dei eC 656o0 56 J  6  amp   COUTINOUNENLS PA      Generic         ab     9  ABS     5 InitialSeq         J  ca  o c  o g o  m   gt  MainPageExtra    o  o o o  Custom rae    MEME nou   ES 1 1 T  af  betas      Input page INITSEQOUTPUT v  fail  I    Data Export Tool    Tools               Current process  WEBMAIL from project Default Project   10 15 44 PM       Figure 26 Using an Extraction Component    The first step in the component configuration process is the selection of the input page from where the component is  to extract structured data  This page is from the Sequence component  i e  its output value INITSEQOUTPUT  and is  therefore found in the selection list  see Figure 27      Once chosen  the Wizard tab generates the sp
157. st argument with the second as  the exponent     e SORT   his function is given a numeric argument and returns a double type value with the result of  the square root of the argument     e LOG  This function is given a numeric argument and returns a double type value with the result of the  base 10 logarithm of the argument     52 TEXT PROCESSING FUNCTIONS    Text processing functions have the objective of executing a transformation or calculation on a text type attribute or  literal     e  CONCAT  The concatenation function receives a variable number of arguments and allows a text type  element to be obtained as a result of concatenating its parameters  The infix version of this function  receives 2 arguments and is represented by the symbol              e  ISNOTNULL  The function receives an integer   string   date   url  or boolean type parameter  or a record  list as input argument  returning  true  if the value is not null  and  false  otherwise     e  TISNULL  The function receives an integer   string   date   url  or boolean type parameter  or a record list  as input argument  returning  true  if the value is null  and  false  otherwise     e LEN  The LEN function receives as a parameter a text type argument and returns the number of characters  that form it     e REPLACE  This function receives 3 text type arguments and returns the result of replacing the  occurrences of the second in the first by those of the third     e LOWER  This function receives a text type argu
158. st of values created  upper right hand box      To create a new function type value the following actions are required     1  Select the required function in the    Functions    drop down menu on the left side of the screen and click on or  drag amp drop to the workspace for creating values  box on top left      Appendix B  Catalog of Components 93    I i i  me denodo technologies ITPilot 4 0 Generation Environment Manual    6 2 5 2    The function selected will appear in the workspace together with an area to fill in the value of each  parameter of the function  The values of the parameters should be present in the list of created values  box  on top right   To assign a value already created as a parameter of a function drag amp drop the value created to  the parameter area  Press the  gt  button that appears beside the function  and this will appear in the list of  values created  box on top right      Creating simple conditions    To create a new simple condition the following actions are required           6 2 5 3    Select the required simple condition operator in the drop down menus on the right side of the screen and  click on or drag amp drop it to the workspace  where the simple conditions are created  left center box     The operator selected will appear in the workspace together with an area to fill in its operands  The  operands can be either attributes of the input view  present in the  Fields  drop down menu of the left side  of the screen  or values already cre
159. t          01 31 2007  Whatis VOL 0201 2007  2 01 2007  1  AJ    TestDemos Warehouse appli  02 01 2007      Otis B  Driftwood                  Aa ee es   X  Limit rows    Execute wrappe       Figure 50 Results returned by the wrapper    Before continuing  save the process  File  gt Save  to avoid the loss of valuable information and to be able to modify or  add functions in the future     3 14 3 Wrapper Exporting    With everything operating correctly  the last step consists of preparing the wrapper for operations in the ITPilot run  environment  There are two alternatives for this  direct exporting from the generation environment to the run  environment  which means that the run environment must be started at the time of exporting  or the saving of the  wrapper to the local file system in VOL format  which is the ITPilot wrapper run format  for subsequent loading in the  run environment     3 14 3 1 Deployment in the run server    From the main window of the ITPilot wrapper generation environment  click on    Data Export Tool    in the browsing  area  This opens two more elements in this same area     VOL Generator    and    Server Deploy     Click on the second  option and configuration data will appear in the workspace  as shown in Figure 51      Part   4     L  i  n denodo technologies ITPilot 4 0 Generation Environment Manual        TPilot Wrapper Generation Tool        File   Browser   View   Help   denodo ITPilot    Project Management 3           Server Deploy  Process
160. t can appear as operands in the expressions     e Constants  This menu allows constants of the various data types supported by ITPilot to be created     Appendix B  Catalog of Components 99    x denodo technologies ITPilot 4 0 Generation Environment Manual    e Functions  This menu allows an invocation to one of the functions permitted by ITPilot to be created  as  described in appendix A 5  The functions can receive attributes or the result of evaluating other functions  as constant parameters   hey return one result    e Attributes  This corresponds to the list of attributes of the wrapper program  The attributes can act as  function parameters     I Expression 2 expression editor    Expression Editor    Constants    int 1     Functions    Input Values    Expression    Expression value  1    ok    Cancel         Figure 96 Creation of a constant value in the Expressions Editor    The center boxes on the screen allow expressions to be constructed  The box on the left is a workspace for creating  new expressions  while the box on the right displays the expressions already created     Finally  the  Expression value  box contains the expression eventually created   To create a new constant expression the following actions are required   1  Select the data type of the constant in the    Constants    drop down menu on the left side of the screen and  click on or drag amp drop to the workspace  where expressions are created  box on left    2  The type selected will appear in the wo
161. tan blanco Marias  Javier 2200 ptas   Mafiana en la batalla piensa en m   Marias  Javier 2350 ptas   Negra Espalda del Tiempo Marias  Javier 3100 ptas     Figure 69 labulated Results of a BookshopResult of the DEXTL Program Test on DETAIL    A DEXIL program generated carelessly would return the heading row as yet another result  Although in almost all  cases and particularly in this one  a careful definition of format tags or other alternatives  see  DEXTL   allows an  unambiguous pattern to be defined  it is clear that it would be faster and easier to be able to define it in a more  intuitive manner without using additional format tags defined specifically for each document and almost certainly not  reusable     To solve this type of problem  the system offers the possibility of limiting the part of a document  where concordance  with a certain pattern is sought  This  which in DEXTL language is obtained using the FROM clause of an element   specified using the constructions FROM END FROM   and the TO tag  can be graphically generated as follows  The  results page is accessed from a browser launched from the specification generation tool and the text forming the  limitation prior to the pattern to be extracted is selected  the table heading in the example of Figure 69   Once this        Fromm  action has been completed  return to the generation tool and click on     The system will    extend the specification to include the search limitation  You can also include the limitat
162. tandard Edition  J2SE  1 4 2 09 or higher must also be available  tested successfully  with J2SE 1 5 0 05 and J2SE1 6 0 also      If extracting information from Adobe PDF resources is required  an Adobe Acrobat technology based  converter can be used  In that case  the system must be run on a Microsoft Windows machine  with  the previous installation of Adobe Acrobat Professional 7  ADOBE     If extracting information from Microsoft Word resources is required  previous installation of  OpenOffice 2 0 x is required  00      2 1 3 Installation    The installation process of the ITPilot Generation tool is performed through the Denodo installer which starts after  executing the install bat file  if in a Windows environment  or install sh  if it is being made on Linux  even though  Windows operating systems are required in order to make use of navigation s advanced capabilities  as it is  described in section 2 1 2    A detailed description of the installation process can be found on the ITPilot User Manual   USE    Nevertheless  its  main steps are described here  The first screen which appears before the user is shown in Figure 1     Installation and Configuration 2             T        n denodo technologies ITPilot 4 0 Generation Environment Manual    cl Denado   Installation of Denodo Platform     denodo corm    E  Welcome to Ehe installation of Denoda Platform 4 0    The author s  of this software is are       Denada Technologies  lt info denodo  com gt    o The homepage is  
163. ted in chapter 6    e Input Values  This corresponds to the list of attributes of the view to which the projection is applied  The  attributes can act as function parameters     The central area of the screen  Values  allows for expressions to be constructed  The box on the left is a workspace  for creating new expressions  while the box on the right displays the expressions already created     Finally  the  Expression  box contains the expression eventually created     Part   36      i  me denodo technologies ITPilot 4 0 Generation Environment Manual    The following actions are required to create a new constant expression     1  Select the data type from the constant in the    Constants    drop down menu on the left of the screen and  click or drag amp drop to the workspace  where expressions are created  left hand box     2  The type selected will appear in the workspace together with a text area to fill in the value of the constant    he value required can be entered directly in the text area    3  Onclicking the  gt  button  the new constant will appear in the list of values created  upper right hand box      The following actions are required to create a new function type expression     1  Select the required function in the    Functions    drop down menu on the left of the screen and click or  drag amp drop to the workspace for creating expressions  left hand box   Place the cursor over the name of the  function to view its syntax    2  The selected function will appea
164. th which it is to be saved  by default   MAILPARAMS xml   Close the Domain Editor by clicking on  Close      Now use the recently exported Search Domain  Click on the E  Domain button in the sequence generation tool and  a dialog box will appear as shown in Figure 21     Domain    Domain    Figure 21 Selection of the Search Domain       The list shows the existing domains  Click on the  MAILPARAMS  domain on the right side of the toolbar  A second  taskbar will appear under the original  where the data of the Search Domain exported from the specification  generator tool appear  Figure 22      Part   19    L  i  me denodo technologies ITPilot 4 0 Generation Environment Manual    HMAILPARAMS    DeMo 04 demas    Figure 22 Search Domain Data toolbar    Now  instead of writing the data directly in the search fields  Username  and  Password  of the form on the home  page of the e mail tool  a    Drag amp Drop    operation transfers the values on the bar to the specific fields  text or  selection areas  of the form  The result can be seen in Figure 23     MAILPARAMS v  DeMo 04 demos    Welcome to Denodo Mail                9 Last login  Never          Username ETS    Password eeeeece    English   amp merican  v  Log in       1     OL denodo  Figure 23 Drag amp Drop operation on the Main Page    In this way  the tool is capable of generating the necessary relations from the input parameters and the fields on the  HTML form     Navigation continues until the results page is reached  F
165. this time the user will have a suitable number of examples  see section 3 9   Where the user  has entered all the values of the examples  the tool will ask the user to specify the document from where these  examples are to be extracted  to do so  select some text in the frame with the mouse containing the examples in a  browser opened from the generation tool      If  however  users do not provide any example for the generation tool having selected the  Do not use examples   option from the examples tab  they will be responsible for generating the specification manually by clicking on the    Ed  button and writing the DEXTL program in the main window  this action is only recommended    for advanced users and or in situations in which the advanced DEXTL functions not directly available from the graphic  tool must be accessed      In this example  the specification is automatically generated     Having already selected the element for which a pattern is to be generated  the button    invokes the processing of those examples corresponding to this level and the documents that contain these examples  to return a group of DEXTL programs  for more information  see  DEXTL    Before clicking on the button  the user  should consider two options     Part   30      i  me denodo technologies ITPilot 4 0 Generation Environment Manual      Deleting false examples  by clicking on the checkbox     Remove false examples  Delete false  examples    when this option is selected  the system automa
166. tically attempts to detect false examples  i e   examples the user has accidentally entered  examples where data from several source examples have been  combined   This detection process can in some cases delete all examples  even those entered correctly   whereby we recommend that you avoid selecting this option unless you suspect that the examples could  have been entered incorrectly       Strict Patterns  clicking on the checkbox  H Strict patterns   it this option is selected  the system tries to  generate the most restrictive patterns possible  Specifically  the patterns will contain the bigger possible  number of text separators  instead of replacing them with elements of the type IRRELEVANT  If the user  does not select it  the system minimizes the number of IRRELEVANT elements and maximizes the use of  text separators  See  DEXTL  for more information  When should this option be used  In similar  circumstances than in the  Disambiguate  option  when more results than expected are being received           Elimination of ambiguities  clicking on the checkbox  Ll Disambiguate   when the user selects this  option  the analyzer modifies the patterns of the DEXTL program generated by adding elements at the  beginning and end of each in order to recognize only those elements that most accurately correspond to the  selected examples    that is  the patterns are generated with more restrictions in order to avoid incorrect  extraction of data that do not match the provided exam
167. tructured list of e mail messages  as explained in section 3 8  Before the  list of results is iterated by the Iterator component  it should be filtered by the DATE field  String type  so that the  iteration is only carried out by messages prior to a certain date  e g  February 1  2007   To do so  the following steps  are taken    1  Create a Filter component and position it in the process  as shown in the previous figure    2  The component input will be the list of records returned by the Extractor component     Appendix B  Catalog of Components 104        h        me denodo technologies ITPilot 4 0 Generation Environment Manual    3  The Filter component wizard allows for a conditions editor to be opened to create a condition expression  that will be assessed for each element in the list of records of the input argument  If the condition is met   this element will be one of the ones to survive  This editor is explained in detail in section 6 2 5 of this  manual  The specific actions to filter by date are indicated below    a  A simple condition is created that basically establishes that the value of the record  DATE   attribute must be before a specific date  e g  February 1  2007   Hence  to begin with the  condition operator     lt     is dragged amp dropped from the right hand side of the editor to the left hand  panel of the  Simple Conditions  area    b  Given that two dates are to be compared  but that the DATE attribute is of the character string  type  it has to be
168. ts     xm    Part   41    I   i     me denodo technologies ITPilot 4 0 Generation Environment Manual    The first three steps are common to all queries made to the wrapper  The difference between one query and the next  only arises in step four  when the search form is completed according to the specific query to be made at any given  time     It would be nice to save time on these first three steps in each query  ideally  when a new query is received  one  browser is already authenticated and located in the search page of the source to which the new request could be  allocated  The browser searches immediately  step 4  and returns the results  step 5   thus avoiding time loss in steps  1 3        back browse sequence is responsible for returning a browser to a status in which it can be reused in future  requests by the same wrapper  Thus  when the wrapper in this example has made a query to the source  the browser  used to run the browse sequence remains on the query results page  step 5   For the browser to be used for a new  wrapper query  it must return to the search page  step 4   The sequence responsible for achieving this is the  aforementioned back sequence     wrapper can obtain a back sequence in two ways     e Explicitly  the wrapper creator can specify a back browse sequence for a wrapper in the    Wrapper  Options  window  in the text field  Back Sequence     e implicitly  if the allocation strategy has been enabled in the STATE browser pool   ASSIGNMENT  STRATEG
169. ucture can only use one scanner  and the main level requires this tag set     Part II 66           O  denodo technologies  L         TPilot Administration tool    File   Browser   View   Help      Project Management  Flow Builder   Data Export Tool  Tools    Scanner  amp  TagSet configuration    ITPilot 4 0    4    Scanners       Scanner  amp  TagSet configuration          0         2           Generation Environment Manual    denodo platform iTPilot    Tags                    2                      CraigsLexer   myLexer  StandardFormLexer  StandardFormLexerJS    Scanner Name          myTextTagSet  standard  standardF orm    TagSet Name       EOL   EOL FRAGMENT  FRAME   TAB    Tag Name       2          myLexer             Nested Tag values    ps Included Tags   gt     Lexer Type  html nojs       ncluded TagSets  myTextTagSet  standard          Tag value       HTML tags not removed from texts                         EJA  LEKI LEEDI                Close            Figure 68 Generated Scanner and Tag Set    The last step of this process involves saving and generating the scanner so that it can be used by any ITPilot    L   application  To do so  simply click on in the scanner area  checking that it is correctly generated  The  application must be restarted in order for the changes to take effect  Please do not forget to save the process before  performing this action  Besides  if the execution server is not installed in the same location as the wrapper  generation tool  it will 
170. ue     Part II 69      i l  me denodo technologies ITPilot 4 0 Generation Environment Manual    mt Extractor  Extractor 1     Structure Examples Generation arks    Specification    Select Scanner  AB  3  TAB  01 31 2007  TAB ANCHOR  Jahn Smith  ENDANGHOR TAB ANCHOR  W3C Hold       StandardHTMLLexer    Select Tag Set   STANDARD    jew tokens       Figure 73 Obtaining data from tokens    3 20 EXPORTING A FLOW AS A CUSTOM COMPONENT    Denodo ITPilot enables users to create custom components  These components can be programmed directly in  Javascript  see  JSDENODO  for further information   It is also possible to create CUSTOM components using  previously created processes so that they can be reused in other processes  In this section  the custom    WebMail     component will be created using a recently generated process     The first step is the creation of a copy of the  WebMail  process  renamed to  WebMailAsCustom   where the  changes required so that this process works as a customized component are made  In this specific case  the  component is to return the list of results obtained following data extraction  To do so  a list of records containing all  the elements must be created so that it can be returned as the component result  Figure 74 shows the required  process flow     Part II 10        b        n denodo technologies ITPilot 4 0 Generation Environment Manual          3  ECReturnListCamponent    F   InitialSeq      3 l  Fy Begin Mest Interval Iterator       MainPage
171. ue  field and click on OK  By carrying out this same operation with the other two record  attributes but using the GETMONTH and GETYEAR functions  the three new attributes of the output record will have  been generated  The result will be similar to that shown in Figure 44  Click on OK to return to the main window of the  Generation Environment        Record Editor  Record Constructor 2     Record name  IMAILMESSAGEUOLIT    SL  E WEBMAIL  SIZE  SLIB JECT WEBMAIL SUBJECT    IMESSAGEDATE WEBMAIL MESSAGEDATE    WEBMAIL SENDER  GETDAY WEBMAILMESSAGED  B B  GETMONTH  WEBMAIL  MESSA     GETYEAR WEBMAIL MESSAGE       Add new field       Figure 44 Final result of the Output record    Part   38    qx denodo technologies ITPilot 4 0 Generation Environment Manual    3 12 3 Returning of results    The operation is almost complete  Once the output record has been generated  the only thing left to do is use the    component to place the record in the process output  This  Output  component   o icon  is very simple to use  as  you merely have to indicate which record it has to place  In this case  the MAILMESSAGEOUT record returned by the  RecordConstructor component or  where no transformation was necessary  the record returned by the Iterator  component  Figure 45 shows the use and configuration of the component            TPilot Wrapper Generation Tool    MENE         File Browser View Help denodo ITPilot    Project Management   Process Flow  WEBMAIL    Process Builder pe   gt   oe i ae 
172. unction receives a variable number of arguments  greater than or equal to two  and  returns a new element of the same type with the result of multiplying the different arguments     e DIV  The div function receives two numeric type arguments and returns a new element of the same type  with the result of dividing the first argument by the second  If the arguments are integers  the result of the  division will also be an integer     e ABS   he absfunction receives one sole numeric type argument and returns as a result its absolute value     e MOD  The mod function receives two non decimal numeric type arguments and returns the result of the  module operation between the first argument and the second  the remainder of the full division of the first  and second arguments      Appendix A  ITPilot Functions 06    I   i     n denodo technologies ITPilot 4 0 Generation Environment Manual    e CEIL  This function receives a numeric argument and returns the smallest integer  greater than or equal  to the argument  closest to the argument     e FLOOR  This function receives a numeric argument and returns the biggest integer  less than or equal to  the argument  closest to the argument     e ROUND  This function receives a numeric argument and returns as a result the integer number closest to  the argument     e POWER  This function is given two numeric arguments  the second of which must be an integer  It returns  a double type value result obtained through the exponentiation of the fir
173. ution of  the generated sequence can be viewed  A dialog box also displays the execution tracing of the NSEQL  commands     NOTE  Some Web sites use cookie based session authentication and maintenance techniques with cookies  that can cause immediate reproduction of the sequence to function poorly  even though  in fact  the  sequence is being generated correctly  See section 4 8 2 for more information     6  Once the desired sequence has been completed  and before clicking on the Stop button  the NSEQL  command program generated can be recorded on disk by pressing the Save button and selecting the folder  and file name as required  Said file will contain the sequence of NSEQL commands corresponding to the  generated navigation sequence in text format     7  Once the sequence has ended and has been saved on disk  the Stop button should be pressed to end the  record mode and return to the normal mode     8  The sequence can be executed at any time by clicking on Open and selecting the file in which it was saved   It is important to take into account that if the navigation sequence contains any domain variable  the  execution will not be satisfactory  since it will not perform the variable sustitution    43 1 Checking Navigation Sequences in Systems with Cookie Based Session Authentication and Maintenance    some Web sites use session authentication and maintenance techniques based on cookies that can cause immediate  reproduction of a sequence using the Pay button to function p
174. vie i  DO  denodo technologies    DENODO ITPILOT 4 0 GENERATION ENVIRONMENT MANUAL    530 Lytton Avenue  Suite 302 C  Alejandro Rodr  guez  32    Palo Alto  CA 94301 USA 28039 MADRID  Phone  650  566 8833 Phone   34  912 77 58 55  Fax  650  566 8836 Fax   34  912 77 58 60    www denodo com    I      pi  denodo technologies  eS    NOTE   This document is confidential and is the property of denodo technologies   hereinafter denodo     No part of the document may be copied  photographed  transmitted  electronically  stored in a document management system or reproduced by any  other means without prior written permission from denodo     copyright 2007  This document may not be reproduced in total or in part without written permission from denodo technologies     x denodo technologies ITPilot 4 0 Generation Environment Manual  INDEX  BPE a EEE EE EE I  Gs O  a                                                                                                              WHO SHOULD USE THIS DOCUMENT        rnrnvrnenonnrnonvrnenonnvnesvrnenernrnenernenesnrnenernenennrsrnesnenennrsrnesnenennsnrnesnenennsnrnesnenenn i  SUMMARY OF CONTENTS Luse LURFOE ETHER ELSE Rd MENS Ko OEERONAYUEE SEK MEE FNRE KMRo CR IYEXAE EUR ESSE SuSE    1 1 PRESENTATION                                               1  1 2 DEVELOPMENT OF COMPONENT BASED WRAPPERS                                        rennen 1  2 1 NAN                                         n 2     NNN 2  2 1 2 Software Requirements         rrrvrv
175. vironment Manual    5  To end click on the button  Assign jJ    The assignment result appears in the main window as a modification of the DEXTL program shown earlier          Assign Jc m  By selecting the option   NONE Iv     tag can be deleted     the attributes assigned to the selected            gt     s always  once the attribute values of the required tags have been assigned  click on the button m to move  on to the Specification tab     3 16 3 X Access to the Details Page from the Main Page    The aim now is to build the browsing relation between the main page of results and the details page for each  message  Once the Sequence component has obtained the page of results  it is sent to the Extractor component to  generate a list of records  each one of which represents one of the e mail messages on that page  A new component  known as Record Sequence can now be used  which provides access to pages related to others or pages with access  provided by previously extracted records  In this case  the component input will be the Sequence component output  page created in section 3 7  known as INITSEQOUTPUT  and the output record of the iterator  WEBMAIL   This    component  represented by the e  icon  is displayed in the workspace of Figure 57              TPilot Wrapper Generation Tool    File   Browser   View   Help   denodo ITPilot  Process Flow  WEBMAIL    Ems Dei cic  GOise J 6B    TT      MamPageExra             Project Management          Process Builder    Components  
176. ws for some components  e g  the Extractor component  to return lists of  registers at their output or to accept them at their input        Values  Other components return specific values at their output such as the Expression component     3 6 PROCESS INITIALIZATION    The initialization component receives no parameters from any component  as it always starts the process  It is  responsible for storing the structure of the input data  which is the data that the wrapper will receive from the calling  application  For example  in this case  certain information is required by the e mail application to access the  messages     more specifically  the user name and password values of the specific user  This data may be fixed or  variable so that different queries on the Web application use different values for these parameters  In this example   it is to be variable and  therefore  must be defined     The following steps are required to do so  First  select the initialization component using the left hand button of the  mouse  see Figure 12      TO  di 6 Cee Gree oid m rs          D 3  d                       Figure 12 Selection of the Initialization Component    Click on the  Wizard  tab in the component configuration area and then on the  Open Init Editor  button to access  the Initialization Editor that will enable you to create an input register  see Figure 13   First  give this register a name  so that it is accessible to the rest of the process  MAILPARAMS  Then create the r
177. y  there are two complementary applications      The Specification Generation tool  which allows the generation of    wrappers    or web connectors in an  easy and intuitive way to non technical users  This tool automatically generates wrapper programs in  JavaScript  JSDENODO   with the convenience and time saving that it conveys      The Navigation Sequence tool  utilized to define complex navigation sequences on web sources  e g  to  obtain a result list from a web source which requires previous authentication  browsing through different  pages and filling out a query form   This tool automatically generates NSEQL programs   NSEQL   which can  be used in the wrappers created using the Specifications Generator     12 DEVELOPMENT OF COMPONENT BASED WRAPPERS    Most information obtained from WWW  Worldwide Web  sources is presented using the HTML tag language   centered on the visualization of data by human beings  However  the constant growth of the Web makes it  impossible to access the data unless this is done mechanically  Many of the Web sources also generate their  registers automatically     with data repositories that are accessed through HTML front ends     Denodo ITPilot is based on the use and configuration of components    and the relationship among them  in order to  build    wrappers     programs that are in charge of automating the web source navigation and extraction processes   Each component accomplishes a specific task  and their behaviour depends on the in
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
PEG-UX50/UX40 Read This First (Operating Instructions)    Packard Bell NS1 1HR Laptop User Manual  DEIMOS BT UL - Ultra Access Controls      Copyright © All rights reserved. 
   Failed to retrieve file