Home

Cameleon#1 User Manual

image

Contents

1. i X ka Peta i Legend CH Definition L Field Listing ling ari Order Introduction Albania Topol Page Background CO E Between 1990 and 1992 Abami ended 46 years of xenophobec Conmiurait rule and 4 D Note that the address in the Source tab has a fixed part and a part that varies based on a selected country This variable part is stored in our attribute Link To go to this site we can simply append the Link attribute to the fixed part of the Specie Tree Spec Fie Source WebService Code n Clafactbook address Change the variable part T of the address in the Source text ne eileen et eners i Nowe box and add it to spec i name Link lypeestring Begin form namez 5 elect Launtry gt Pattern option values CaountriB option Ends forn E Source gt URL https www cla gov cla publications tactbook Link Snurce Attribute Input Attributes Messages Scripts Forms URL https www cia gow cia publications factbookHLink Fetch Back Forward Cameleon User Manual Step 5 Define more attributes Note that before you add an attribute make sure you click on the source in the spec tree that that attribute belongs to Otherwise the attribute may be added to the wrong source In that case right click on the attribute and choose delete to remove the attribute a a li x specFile Tree Spec File Source WebService Code o gt elafactbaok
2. An embedded comment causing text to be ignored regexp Groups things like but doesn t cause the group match to be saved regexp A zero width positive lookahead assertion For example w s matches a word followed by whitespace without including whitespace 1n the MatchResult regexp A zero width negative lookahead assertion For example foo bar matches any occurrence of foo that isn t followed by bar Remember that this is a zero width assertion which means that a b d will match ad because a is followed by a character that is not b the d and a d follows the zero width assertion imsx One or more embedded pattern match modifiers 1 enables case insensitivity m enables multiline treatment of the input s enables single line treatment of the input and x enables extended whitespace comments Copyright 1997 ORO Inc All rights reserved Original Reusable Objects ORO the ORO logo and Component software for the Internet are trademarks or registered trademarks of ORO Inc in the United States and other countries Java is a trademark of Sun Microsystems All other trademarks are the property of their respective holders 37 Page Cameleon User Manual A 3 Installing Cameleon Server 1 Get Cameleon 1 Z1p file Its contents should look like Folder Tasks CJ CJ amp im Bini Cameleon Cameleon ht camserv aspx camserv asp QQ Extract all fles e i Other Places A C
3. ini specFile Tree Spec File Source WebService Code clafactbook SpecFile Tree Spec File Source WebService Code E Source UAL https w cia gow cia publications factbook z wml yversion 1 0 encodings LL TF 8 7 HELATIDM name clatactbook gt SOURCE LURIZ https ay cla gov ca publications Factbaak SOURCE e RELATION source Attribute Input Attributes Messages Scripts Forms URL Jhttos www cia gov cia publications factbook j Fetch Back Forward Add to Spec If you click on the Spec File Source tab you will see that spec file in XML format is being constructed automatically Source Attribute Input Attributes Messages Scripts Forms 6 Page Cameleon User Manual Step 3 Define an Attribute Our first attribute will be the link that will take us to the page that has the information If you view the source of the document you will notice that relative links for each country have the following structure option value geos al html Albania option Or if we generalize option value Link gt Country lt option gt Where Country is an input country and the Link 1s the corresponding link Cameleon Studio O xl Spec File Search WebService About specFile Tree Spec File Source WebService Code Browser Soue Forms algas top alt he world Factbook 7 lt a gt spr eds lt tr gt lt tr bgcolor HEBBBBB lt td he
4. ATTRIBUTE Pattern for beginning of search region ee ne Pattern a end of search ee i ECT lt OBJ RN uu or more EMEN Defines object region that constitutes a tuple OBJECT BEGIN OBJECT BEGIN OBJECT Pattern for beginning of object region OBJECT END OBJECT END OBJECT NEN Pattern for end of object region PATTERN PATTERN l or more ATTRIBUTE Pattern that contains the data to be extracted 39 Page Cameleon User Manual Appendices A1 Related Readings and References 1 Aykut Firat Stuart Madnick Michael Siegel The Cameleon Web Wrapper Engine Proceedings of the Workshop on Technologies for E Services September 14 15 2000 Cairo Egypt http web mit edu smadnick www wp 2000 03 pdf 2 Aykut Firat Denis Peleshchuk Prakash Rao 1 Wrap Instant Web Wrapper Generator http web mit edu smadnick www wp 2000 10 pdf 3 Tarik Alatovic Thesis Capabilities Aware Planner Optimizer Executioner for COntext INterchange Project http web mit edu smadnick www wp 2002 01 pdf 4 Aykut Firat PhD Thesis Information Integration Using Contextual Knowledge and Ontology Merging Massachusetts Institute of Technology Sloan School of Management August 2003 5 Shin Wee Chuang Thesis A Taxonomy and Analysis of Web Wrapping Technologies http web mit edu smadnick www wp 2004 08 pdf 6 Aykut Firat Stuart Madnick Nor Adnan Yahaya Choo Wai Kuan St phane Bressan Information Aggregation us
5. Source UAL https Yay cia gov clay publications Factbook Attribute name Link tpe sting Begin form name SelectCountry gt Pattern lt option value gt ACountryt lt option i End c form Source UAL https Ive cia gov cia publications factbook HLink H I Attribute name MilE pendP ercent tipesstring Begin Military expenditures percent of GDP Pattem zbr s Endk lt table gt Source Attribute Input Attributes Messages Scripts Forms Mame MilExpendPercent Type sting rj Begin Military expenditures percent of Pattern br s 72 End lt table gt Match Results El MilE apendFercent When you author attributes you can use the search menu item to find what you are looking for 1n the web site or in its source Population GDP and GDPUnit attributes are declared as follows lOl Page Cameleon User Manual Cameleon Studio Population lt div gt lt td gt lt td valign top bgcolor FFFFFF width 80 gt lt a href docs notesanddefs html 2119 gt lt ima stc graphics dictionary jpg alt Definition gt lt a gt lt a href fields 2119 html gt lt img src qraphics listing pg alt Field Listing lt a gt lt a hrefz rankorder 2119rank html img stc graphics rankorder pq alt Rank Order a br 3 581 655 July 2006 est td lypesstring Begin Miltary ex
6. Spec File Source WebService El Forms o E formcirmTrT est SB Source Be form TURAL z URL http world altavista comtr action http world altavista com babelfish tr method post method post H input doit E input trurl H input intl E select name lp H input tt E input btnTrLIrl H testarea name trtest H Scripts H select name lp H input btnTrT st We then need to edit this form to encode the input attributes namely the text to be translated and the mode of translation 1 e from which language to which language We will call these two input attributes as source and language mode 20 Page Cameleon User Manual specFile Tree Spec File Source WebService Code mubabeltish El Source z WAL Atte world altavista corr tr method past aput doit put iit putt testarea name trhest m textarea value ttsaurcet select marne Ip MEZ zele ect a TS ue t le H lang T a gemo idett Ee input int Tat Then we add an attribute for extracting the results specFile Tree Spec File Source WebService Code eo mybabelfish Source URL http warld altavista comtr method post E input doit E input intl E input tt E testarea name trtest EL s select nare Ip b tipe string 5 Begin lt form action http w alkavista com web results 2 Pattern zdiv style padding 1 Ups S037 7 dive i End e form And test with some Input Attribute values 21 Page Ca
7. are created dynamically e g a session ID is assigned for each access to the page or a cookie needs to be established before you can go to the desired page or there is no simple way of deducing the desired URL without visiting a particular page or the data is spread through multiple pages To handle these kinds of cases Cameleon has a feature that lets us wrap multiple pages for a relation In Figure for example the link to the second page is extracted from the first web page We need to perform this step in this case because there is no simple way of deducing in advance what the link is supposed to be Then the link is supplied to the next source element which takes us to the page where we want to extract the values of price and airline 29 Page Cameleon User Manual lt xml version 1 0 encoding UTF 8 gt Relation Name lt RELATION na SOURCE URI http edit travel yahoo com config ytravel gt Source declaration POST method POST gt Post method PeR E source value YG gt lt PARAM name module value tripsrch gt lt PARAM name intl value us gt lt PARAM name src value trv gt lt PARAM name service value YHOE lt PARAM name tcycgi value airgcobrand ctl gt lt PARAM name smls value Y gt lt PARAM name resform value Y ahooFlightsR gt lt PARAM name trip_option value roundtrp gt lt PARAM name num_count value 9 lt PARAM name dep_arp_cd
8. cols 42 stulez width 400 namez rest Hella hexbarea gt lt br gt znabr lt telect name Ip style Fant size D Sem tabindexz 1 class button gt lt option valuez zh en gt Chinese simp to English lt option lt option valuez zt en gt Chinese trad to English lt option lt option valuez en zh gt English to Chinese simp lt option lt option valuez en zt gt English to Chinese tradz option lt option valuez en nl gt English ta Dutchz option optian valuez en fr English to French aption lt option valuez en de gt English to Germans option gt lt option value en_el English to Greek lt option lt option valuez en it gt English to Italianz option lt option valuez en Ja gt English ta Japanese aptior lt option valuez en ko gt English to Koreans option gt lt option value en_pt gt English to Partuguesez aptian lt option valuez en ru English to Russians option lt option valuez en es SELECTED English ta Spanish lt option option valuez nl en Dutch to English option lt option valuez nl fr gt Dutch to French option lt option valuez fr nl French to Dutch option option valuez fr en French to English aption lt option valuez fr de French ta Germans aptian lt option valuez fr el French ta Greeks option lt option valuez fr i gt French to talianz options lt option valuez fr pr gt French to Portuguese option lt option valuez fr es gt French
9. distribution and also the book Programming Perl 2nd Edition from O Reilly amp Associates We need to point out here that for efficiency reasons the character set operator is limited to work on only ASCII characters Unicode characters 0 through 255 Other than that restriction all Unicode characters should be useable in the package s regular expressions e Alternatives separated by e Quantified atoms n m Match at least n but not more than m times n Match at least n times n Match exactly n times Match O or more times Match 1 or more times Match O or 1 times e Atoms o regular expression within parentheses o a matches everything except in o a isanull token matching the beginning of a string or line 1 e the position right after a newline or right before the beginning of a string o a isanull token matching the end of a string or line 1 e the position right before a newline or right after the end of a string o Character classes e g abcd and ranges e g a z Special backslashed characters work within a character class except for backreferences and boundaries bis backspace inside a character class o Special backslashed characters b null token matching a word boundary Nw on one side and WW on the other AB null token matching a boundary that isn t a word boundary A Match only at beginning of string VZ Match only at end of string or before newline at the end n
10. symbols around the attribute name i e attribute_name Attributes can be referenced anywhere in a spec file where string value is expected They can even be embedded within plain static text such as in a matching pattern In these cases an extra attribute ink true for the tag is needed Line 7 BEGIN tag The begin tag contains a regular expression that matches the beginning of a region of text that contains desired data to be extracted This helps to limit the scope of matching and thus improves the efficiency in finding patterns A special XML tag lt CDATA regular_expression gt is used to encase the regular expression This tag ensures that any characters in your regular expression that potentially invalidates the entire XML document are ignored Remember every spec file is parsed by a XML parser before processed The CDATA tag tells the XML parser not to parse the stuff inside it Therefore always surround your regular expression with a CDATA tag Line 8 PATTERN tag The pattern tag pinpoints the location of the desired piece of data to be extracted Similar to the begin tag the content of the pattern tag is a regular expression surrounded by the CDATA tag The scope of matching 1s only limited to the region set by BEGIN and END tags Notice the parenthesis within the regular expression The parenthesis is there to indicate the segment of characters to extract provided the surrounding patterns match That segment o
11. the method remotely ciafactbook Chck here fora complete et of operations Test To test the operation using the HTTP POST protocol click the Invoke button Szezetice Velo Crunmiry Turkey The returned results are in the form of XML which takes the form of a data set in the NET framework iei xl e http www aykutfirat com ciafactbook asmx getCIAFACTBOOKData Windows Internet Explorer TO Kay v T e http www aykutfirat com ciafactbook asmx getCIAFACTBOOKData gt x Live Search File Edit view Favorites Tools Help Links Customize Links Matches I E books JH NYT 2 Naxos Q FS DotNetPanel Home Google Code Search Google Analytics 4 A List Apart Articles Journal Rankings CENTE D abe e http www aykutfirat com ciaFactbook asmx getCIA lt xml version 1 0 encoding utf 8 gt DataSet xmlns http www tempuri org gt xs schema id NewDataSet xmins xmlns xs http www w3 org 2001 XMLSchema xmins msdata urn schemas microsoft com xml msdata lt xs element name NewDataSet msdata IsDataSet true msdata UseCurrentLocale true gt lt xs complexType gt lt xs element name ciafactbook lt xs complexType gt xs sequence lt xs element name Country msdata ReadOnly true type xs string minOccurs 0 gt lt xs element name Link msdata ReadOnly true type xs string minOccur
12. to Spanishz option gt lt option valuez de en gt German to English option lt option value de_ fr gt German to French lt option gt option valuez el en gt Greek to Englishz option lt option value el fr gt Greek to French lt option lt option valuez it en ltalan to English option option value It_ fr Italian ta French lt options lt option valuez a en Japanese to English option lt option value ko_ en Korean to English option option valuez pt ern Partuguese to English lt option gt lt option value pt fr gt Portuguese to French option option values en Hussian to English option option valuez es en gt Spanish to English option gt lt option valuez es fr gt Spanish to Frenchz option elect z nabr n L Then save and transfer your spec file to complete the task Custom Forms If you want to use custom forms you can define them in the Forms tab by entering its URL method and name value pairs and clicking on Add to Spec This 1s valuable when you cannot do the transfer as in the previous example or your form is somehow different than what 1s shown on the page 1 e LRL Method Add to Spec 23 Page Cameleon User Manual Scripts This feature is not used as frequently it may sometimes be needed Scripts in web pages are shown in the left most Forms tab and can be transferred to the Scripts tab with a single cli
13. web pages are obvious web resources Java servlets Perl script cgi or any server side programs that return data over the web can be web resources as well The URI attribute of the source tag specifies the location of a web resource Use http encoding for special characters in a URI e g use amp amp for amp The content of any specified web resource 2 Pace Cameleon User Manual will be retrieved for pattern matching to form attributes as we will see in the discussion of attribute tag below Each source tag can contain multiple attribute tags Line 6 ATTRIBUTE tag An attribute tag defines what is to be extracted from a specified source Pattern matching by regular expression is used to locate the information to be extracted see discussion of PATTERN tag below Each attribute must be assigned a unique name and a type Currently all data extracted are treated as string data type by Cameleon So you can set type String for every attribute you define Once an attribute is defined it can be referenced by a user issued query i e via the Cameleon GUID or it can be referenced internally in the spec file The latter usage is similar to declaring a variable to store information extracted from one source for subsequent use within the spec file For example it is a common scenario to extract a URL address from one source and use that address as the URI of the next source element When referencing an attribute internally simply put
14. ya H Begin lt form name Selectlountry gt titles CIA The World Factbook France title ME oU M ME oun coon meta http equiv Content Type content text html Ends form charsetzisa B853 1 s Source UAL https as cla gov cla publications Factbook Link dlink rel stylesheet href Factbook css E pirate lupe text csz gt L JIT im scriptlanguage JavaScript gt gMEBEMD Input Attributes Messages Scripts Forms l function MM jumpMenu targ sel b restore t v3 D o eem m NENNEN evalltarg location sel0b options selD bi selected ndex value J if restare sel0 bj selected ndex ll Begin Population lt script gt Pattern lt br D hd hs 45 link rel stylesheet href Factbook css lype text css Sgen End ut lt body backaroundz graphics tilehdark jpg j T mE DT Match Add to Spec link HFFFFFF viink HFFFFFF alink CCCCCC bocolor EEEBGE gt a href Htop gt lt a gt div align center gt amp nbsp lt div gt div align center table widthz 5385 border 0 heightz 400 cellpadding 0 cellspacingz align center tr align center dtd heinhirz RR enlznanz A valinnz hntEam s I3 Page Cameleon User Manual Step 9 Transfer Now that we are satisfied we can transfer our spec file to a location we can reach publicly This can either be manually done i e by saving it in your local
15. 119 gt img Begin lt form name SelectCountry s z graphics dicti Jpg alt Definition lt a gt M ee eee icis E Pattern lt option valuez UH gt Flounty He option a End lt form hrefz Felds 2119 html img src graphics listing pa El Source alt Field Listing gt lt a gt LIRiL hitps 4 cia gov cia publicationsfactbook Link ns E Attribute href rankorderz21153rank html zimg src graphics rankarder pg alt Hank Order lt a gt Li im Source Attribute Input Attributes Messages Scripts Forms Mame Population Type sting lt i gt total lt i gt 52 752 135 Begin Population br lt ienote lt i 60 876 136 in metropolitan France July 2006 est lt td gt lt br gt Pattern End The following one works for both France and Albania so presumably will work for others as well Delete the existing attribute right click choose delete and put the attribute declaration Test again with other countries Cameleon Studio i loj x Spec File Search WebService About Browser Source Forms specFile Tree Spec File Source WebS ervice Code l FileName Connection cf den htm gt p clafacthook gl Type CFOSN Eb Source lt h Catalog gt o c URLhitpa Aaa cia gov cia publications factbook EN ETE Lhitps www cia gov cia publications Factboo lt l HTTP true gt E Attribute name Link tupe string
16. Cameleon User Manual Aykut FIRAT 24 May 2008 Working Paper CISL 2008 13 Contents MWirat 1S CameleonMl SBEVBEY neccsser en hate v br etdseh ensi to habi m EptC Ku scuh Sub Du to ER dS URd UR Ese byadP UR bn p UU RE 2 bir Caime eoir Uii M TP 4 Cameleond Studio by Example cscccssscccsssecccsseccescccescccseneceneneeceenceseusceeseueceeenseeeenseseusceeneneeees 5 Manually Authoring Cameleon Spec Files ccccecccssssececseeceeseeeeescesaeececeueceeeceseecessueceneneeees 26 Summary of Cameleon Spec File Syntax esses 33 ADDSHOIGBS ciem Eee nona tea Mem ee ne ee EE ee LL S ee ee eee 34 A1 Related Readings and References cccsssccccssecccssececceseececeeeecceeseecessueeceeseneceeseees 34 A 2 Regular ExpressiOfis ieu irasci err EpFU E Ue M ERE UNE Sb RAU eqs Lorcasbe Erb eser u PL vRE da pH RNV EUd V Ur tet dosN DUE UE 35 A 3 Installing Cameleon Server sessseessseeeeennneeee enne nnne 38 A 4 Installing Cameleon Studio sseesseessseenennn nennen 40 Composite Information Systems Laboratory CISL Sloan School of Management Room E53 320 Massachusetts Institute of Technology Cambridge MA 02142 Cameleon refers to Cameleon Server and Cameleon Studio Cameleon User Manual What is Cameleon Server Cameleon Server is a server application coded in C to extract data from web sites based on
17. ELATION gt Line 1 Standard XML declaration tag Every XML document hence every spec file starts with some kind of declaration similar to this You can copy this line and use it in your spec file as is Line 2 DTD schema tag DTD schema is a way to validate a XML document The schema located at http interchange mit edu cameleon sharp cameleonspec dtd contains information about the tags and their signatures You can copy this second line and use it in your spec file as is Below is the cameleonspec dtd lt ELEMENT RELATION SOURCE gt lt ATTLIST RELATION name CDATA REQUIRED gt lt ELEMENT SOURCE AUTHENTICATION COOKIE JSCRIPT POST ATTRIBUTE gt lt ATTLIST SOURCE URI CDATA REQUIRED gt lt ATTLIST SOURCE DELAY CDATA IMPLIED gt Sometimes it may be a good idea to validate your spec file against the DTD schema for debugging purpose Here is a reliable XML validator to do that http www cogsci ed ac uk richard xml check html 26 Page Cameleon User Manual lt ELEMENT AUTHENTICATION Realm Username Password lt ELEMENT Realm PCDATA gt lt ELEMENT Username FPCDATA gt lt ELEMENT Password PCDATA gt lt ELEMENT COOKIE PCDATA gt lt ATTLIST COOKIE name CDATA REQUIRED gt lt ELEMENT JSCRIPT PCDATA gt lt ATTLIST JSCRIPT name CDATA REQUIRED gt lt ELEMENT POST PARAM 4 lt ATTLIST POST method POSTIGET POST gt lt ELEMENT PARAM PCDATA gt lt ATTLIST PARA
18. M name CDATA REQUIRED value CDATA REQUIRED gt lt ELEMENT ATTRIBUTE PREFIX SUFFIX BEGIN OBJECT PATTERN END lt ELEMENT PREFIX PCDATA gt lt ELEMENT SUFFIX PCDATA gt lt ATTLIST ATTRIBUTE name CDATA REQUIRED type String String gt lt ELEMENT BEGIN PCDATA gt lt ELEMENT OBJECT OBJECT_BEGIN OBJECT_END gt lt ELEMENT OBJECT BEGIN PCDATA gt lt ELEMENT OBJECT END PCDATA gt lt ELEMENT PATTERN PCDATA gt lt ELEMENT END PCDATA gt Line 3 Comment tag Comments in an XML file are enclosed by the lt your comments gt tag You can intersperse comments anywhere you like in your spec file as long as the well formedness of the XML document is observed Anything within the comment tag is ignored by the Cameleon engine Line 4 RELATION tag Relation is the root element of a spec file It contains one or more SOURCE tags which we will look at next It can be thought of as the declaration of a relation or table in a database Unlike traditional database relation a Cameleon relation can consist of attributes from multiple sources much like a database view Although the name attribute of this tag concurrently has no function a meaningful name should be assigned to it for readability A good convention is to use the same name as the name of the spec file i e the relation cia is derived from cia xml Line 5 SOURCE tag Each source tag signifies a single web resource While static
19. NT gt ELEMENT gt lt gdp gt 2 585 lt qdp gt zpopulation282 422 299 population zgdpunit trillion lt gdpunit gt lt milexpendpercent gt 1 5 lt milexpendpercent gt lt ELEMENT gt z DOCUMENT gt LL Step 11 Optional Web Service Creation While you were going through spec file creation web service code was also created from your spec file In order to put that web service into action you need to save it first and transfer the files into a server that supports NET framework First save your web service by clicking on Web Service menu item and clicking on Save Identify a location to save S Local Disk C men Tdbisel473218FibisfF347dad i 1118425fb75a6S9cab697 O Backup Files mn bFAB Fd4da amp assersus3sbes n87e5z1 Cameleon 3 Code Snippets C DOMHelper 9 drivers 3 ECLiPSe 9 GoogleFinance rl ra all ah Filas Now click on the Web Service menu item and select Transfer Again enter the required information for your transfer location 16 Page Cameleon User Manual Lacation E9 16 250 55 Uzer ame aykut Password pes Directory f E passive Save as default Transfer Cancel Note that there are two files that are being transferred One of them is the asmx file that goes into the directory indicated in the ftp info above You also need to create an App Code directory in your server for the vb file as that is where the vb is assumed to exist as shown in
20. TE name MilExpendPercent type string gt lt BEGIN gt lt CDATA Military expenditures percent of GDE gt s BEGIN PATTERN lt P CDATA br at 30 4377 2 24 7 lt P ATTERN gt lt END gt x CDATA lt table gt gt Output lt END gt Attribute lt fATTRIBUTE gt lt ATTRIBUTE name population lt BEGIN gt CDATA BEGIN PATTERN CDATA lt P ATTERN gt lt END gt A CDATA lt tr gt gt z END z ATTRIBUTE lt ATTRIBUTE name GDP type string BEGIN Z CDATA purchasing s power s parity gt lt BEGIN PATTERN lt ITCDATA lt br gt st 0 94 1 O 377 s lt td gt gt lt PATTERN gt lt END gt lt A CDATA lt tr gt gt lt END gt lt ATTRIBUTE gt ATTRIBUTE name GDP unit type string BEGIN Z CDATA purchasing Ms paowerMs parity gt 4 BEGIN PATTERN lt I CDATA lt br gt s 0 94 40 4377 2 48A gt 4 PATTERN gt lt END gt lt A CDATA lt tr gt gt lt END gt lt fATTRIBUTE gt SOURCE 4 RELATION Input Attribute pe string Regular fopulation gt gt Expressions cbrse 079 Xa 3074377 2Xa 9 Figure 2 Example Specification File for the CIA Web Source 3 Page Cameleon User Manual CAMELEON HOME gt DEMONSTRATION SQL Query Select country population GDP gdp unit MilExpendPercent From cia Where country Singap
21. _1 valy Departure S lt PARAM name dep_dt_mn_1 vafie Month1 gt lt PARAM name dep_dt_dy_1 vafue Day 1 gt lt PARAM name arr_arp_cd_1 valie Destination gt lt PARAM name dep dt mn 2 val amp e Month2 gt lt PARAM name dep dt dy 2 value Day2 gt lt PARAM name adult pax cnt value 1 lt PARAM name num_cnx value 1 gt lt PARAM name finished value Search gt Parameter Replacements input from query lt POST gt lt ATTRIBUTE na Wy pe String link true gt lt BEGI A http equiv refresh gt lt BEGIN gt lt END amp lt CDATA gt gt lt END gt lt PAFTERN gt lt CDATA url gt lt PATTERN gt lt ATTRIBU FE gt Parameter Replacement lt SOURCE gt input from extraction lt SOURCE URI Regular expression patterns for identifying the boundaries of a region lt PATTERN gt lt CDATA USD s d s lt b gt gt lt PATTERN gt lt ATTRIBUTE gt Pattern for lt ATTRIBUTE name n ag gt extraction PREFIX gt lt CDATA lt im src http rg travelocity com edgesuite net logos gt lt PREFIX gt Prefix and lt SUFFIX gt lt CDATA gt gt lt SUFFIX gt suffix to be _ attached to lt BEGIN gt lt CDATA View s Results s by s Airline gt lt BEGIN gt the result lt END gt lt CDATA b gt lt div gt lt a gt lt td gt gt lt END gt lt PATTERN gt lt CDATA lt img src http
22. a specification file For example the following information that can be found in the CIA World Fact book can be extracted in table or XML format by first 1 authoring a specification spec file for that web site and then 2 sending a query to Cameleon a gov cia publications factbook geos sn html People People Singapore Population 1 E in 4 492 150 July 2006 est GDP purchasing LO FS lin power parity u 126 5 billion 2005 est Military CLO E in expenditures evan percent of GDP Mm EE Figure 1 Available data in CIA World Fact Book site The specification spec file for extracting data from the CIA factbook web site is shown below This spec file can be authored visually or manually After the spec file is completed a query in simplified SQL form can be sent to the Cameleon server via the http protocol and results are returned as shown in Figure 3 2 Page Cameleon User Manual lt xml version 1 0 encoding UTF 8 7 gt RELATION name cia SOURCE URI https www cia gov cia publications factbook index html ATTRIBUTE name Link iype string gt lt BEGIN gt Z CDATA lt body gt lt BEGIN PATTERN lt CDATA option value lt P ATTERN gt lt END gt CDATA lt Bb 00 aD y 1 gt gt lt fEND gt lt fATTRIBUTE gt lt SOURCE gt SOURCE URI https www cia gov cia publications factbook Link ATTRIBU
23. ck Here the script can be edited named and added to the spec file Lol xl Spec File Search Web Service About Browser Source Forms SpecFile Tree Spec File Source WebService Code Frameset gt Forms E Frameset gt Scripts E criptnc Afunction tab5 rT ag mmir do ius script z Ailifunction verfu TrT ext mir docurment frTr t Frame Jsponsared rezults7dq T ravele s5outh Amesericat po mubabelfish El Source UAL http orld altavista contr method post input doit l i value dane cument miro input intl input kk foe value urltext kexbtarea name trtext i Ltextarea value ttzourcet Source Attribute Input Attributes Messages Scripts Forms zl function tabS rT ag 1 if dacurment rmfrm 1 if document mfrm q salue H return Erie else 1 document mfr acthon rT ag document mfr subrmit return False Mame Add to Spec Results TranslatedT ext e Hola An example spec file that relies on scripts is Expedia As shown below Expedia has a JSCRIPT tag called Time which refers to the result of interpreting some JavaScript code In such a case the script can be modified in the scripts window and added to the spec file SOURCE URI http www expedia com pub agent dll gt COOKIE name path gt lt CDATA gt lt COOKIE gt JSCRIPT name Time gt lt CDATA va
24. d before being passed on to a parser Since then regular expressions took a life of their own appearing in such languages as AWK TCL and of course Perl for all sorts of textual data extraction and manipulation purposes The most basic regular expression syntax consists of 4 operations Let A and B each represent an alphabet a set of characters and s and t represent members of those alphabets Operation Representation Meaning Union of A and B AIB s is such that s is in A ors is in B Concatentation of A and B AB ist are such that s is in A and t is in B Kleene closure of A AB Zero or more concatenations of A Positive closure of A At One or more concatenations of A Using this notation you can define a regular expression for positive integers as follows digit Here digit represents the set of characters 0 9 A range of characters like this can be represented in most regular expression languages as 0 9 Because this is such a common expression some languages have a special character for it d Learning a regular expression language is quite simple once you ve learned one because most of the 35 Page Cameleon User Manual operations are the same Only the notation changes Perl5 regular expressions Here we summarize the syntax of Perl5 regular expressions all of which is supported by the OROMatcher TM Perl5 classes However for a definitive reference you should consult the per1re man page that accompanies the Perl5
25. d end of the region we are interested in this page via regular expressions Pattern is the regular expression denoting what to extract from this region In this case the pattern is option value 4 gt Country lt option gt But we will use an actual country name instead of the input parameter Country until we are Satisfied with our attribute definition for testing purposes i e it will be lt option value 4 gt Albania lt option gt The attribute text boxes have a couple of hidden functionalities 1 If you double click on the Begin box it will eliminate anything before the first match of that pattern in the Source box 2 If you double click on the End box it will eliminate anything after the first match of that pattern 3 If you double click on the Pattern box it will highlight the text it matches in the text in blue If there is no match then there will no highlighting Note that if the pattern matches more than once then multiple highlights will be shown 4 If you double click on the Name box it will restore the Source box into its original form 5 Note also that you can drag not drag and drop unfortunately so be careful as it will take the value as soon as you enter a text box this is due to some bug in NET some text from the text into attribute boxes Click on match and it will show you the match in the results pane Source Attribute Input Attributes Messages Scripts Forms Name Link Ty
26. directory and transferring via your favorite ftp application or using Cameleon Studio Click on Spec File menu item and choose save by giving the same name specified when you first created the spec file with xml extension Giving a different name will create a problem for ftp transfer Saves LN E Save in Co L ameleontt Fi E il Fe f l acompanytable spec E cia amp 26 04 PM txt ei babelFishz xml E riab 27 12 PM ExE companytable xml EE Google My Recent babelfish xml cia6 29 32 PM ExE ESSN E BulkPop sal E ria amp 33 05 PM txt lt a F P cameleanspec dtd E ciag 34 34 PM txt historic T historic Desktop expediaz xml old historic SE cia xml expediazex xml historic 9 E cias 59 46 PM txt E ka expedia3 xml ciaS_ xml ciafactbook xml a expedia xml My Documents hu Computer el hu Network File name clatactbook m Flaces Save as pe O Cancel E E cia amp D2 21 PM txt E ciag D2 56 PM txt E cia amp 03 58 PM bet E cia6 D5 44 PM bet cia 17 55 PM txt ciat xml expedia xml 1 expediacar xml s expedianew xml 1 expediatest xml After saving click on Spec File menu item and Transfer Below I show a location that I have privileges to upload my files You should replace the boxes with your own information Note that you can use passive ftp if you want or your server requires by clicking o
27. er pages however the authentication 1s done through pop up password windows We handle these kinds of cases with the following scheme The username and password values have to be inputted within the SQL query since they are coded as references in this spec file SOURCE URI http game etrade com cgi bin cgitrade TransHistory gt lt AUTHENTICATION gt lt Realm gt E Trade Player game lt Realm gt lt Username gt username lt Username gt lt Password gt password lt Password gt lt AUTHENTICATION gt Custom Cookies In Cameleon cookie handling is automatic as long as the cookies are set through headers In some cases cookies can be set in a non standard way for example using Javascript API To handle these cases we allow custom cookie setting as shown in example below custom cookies used are jscript 1 path SOURCE URI 2 http www expedia com pub agent dll COOKIE name jscript gt 1 lt COOKIE gt COOKIE name path gt lt COOKIE gt JavaScript Interpretation JavaScript is used frequently in Web pages in creating the html document on the client side In most cases JavaScript does not pose a problem in wrapping Web pages because it is usually used for cosmetic reasons In some cases however not being able to interpret JavaScript may block the wrapper engine in getting to a desired page One real example is the Expedia Web site which requires interpreting JavaScript code and supplying the result as a po
28. f extracted characters is set to be the value of the particular attribute The pattern is matched within the BEGIN END region as many times as possible Therefore an attribute can sometimes have multiple values When the result of a query involves two or more attributes the Cartesian product between the individual values of each attribute will be returned The only time a Cartesian product is not taken is when the number of values across all involved attributes are the same In that case the values are merged instead For instance the i value of attribute a will be merged with the i value of attribute b and so on to produce the ie row of result Line 9 END tag The end tag is analogous to the begin tag It marks the end of a region where pattern matching should be applied for a particular attribute Line 10 Signifies the end of an attribute declaration 28 Page Cameleon User Manual Line 11 12 Signify the possibility of more than one attribute tag per source Line 13 Signifies the end of a source declaration Line 14 15 Signify the possibility of more than one source tag per relation Line 16 Signifies the end of the relation declaration Features of Spec Files Disjunctive Patterns Sometimes it is not possible to discover a single pattern that would match the desired data across all similar pages In these types of cases we allow disjunctive patterns to specify multiple patterns The following is an example of such a situat
29. ght Click on ciafactbook in the specfile tree and choose Run Results E Link gecs html MilExpendPercent 2 6 Population No pattem matched GDP 1 871 GDPUnit billion It seems everything is fine except Population has a problem Step 8 Debug Right Click on ciafactbook in the specfile tree and choose Step Click on step a couple of times until you arrive to the second source Then switch to attribute tab right click on the attribute Population from the specfile tree and click on Transfer Attribute this will fill in the fields in the lower half of the screen You will notice if you double click on the begin first and end later and then pattern that the pattern indeed fails because population is not consistently expressed with the pattern that we found in Albania We need to look for a more common pattern Note that you can also put a breakpoint somewhere in your spec file so that it runs until the breakpoint by right clicking on the item of interest in the spec file tree You can also remove the break points by using delete break point 12 Page Cameleon User Manual Browser Source Forms specFile Tree Spec File Source WebService Code Population lt div gt clafactbook lt td gt I Source lt td valign top bgcolor FFFFFF UAL https waa cia gov cia publications factbook E Attribute width S0 name Link lt a 2 2 tipe sting href docs notesanddets html 2
30. gs that you may find Please keep in mind that this is a pre beta version and the software is continuously being updated 25 Page Cameleon User Manual Manually Authoring Cameleon Spec Files The general layout of a spec file 1s first presented in this guide followed by a closer look at some of the useful features A summary of the spec file syntax 1s provided at the end Cameleon Spec File Structure Cameleon spec file is XML based which means that its content is organized as a hierarchy of nodes or tags Always name the spec file with xml extension e g my spec file xml Let s examine the sample spec file below to illustrate spec file structure as well as some key elements of a spec file lt xml version 1 0 encoding UTF 8 gt 2 lt DOCTYPE RELATION SYSTEM http interchange mit edu cameleon_sharp cameleonspec dtd gt 3 lt Comments This is a spec file to wrap the CIA Fact Book site gt 4 RELATION name cia gt 5 SOURCE URI http www odci gov cia publications factbook index html gt 6 ATTRIBUTE name Link type String gt 7 lt BEGIN gt lt CDATA lt body gt lt BEGIN gt 3 lt PATTERN gt lt CDATA lt option value 4 4 gt gt Country gt lt PATTERN gt 9 lt END gt lt CDATA lt Bb oO dD y Y gt gt lt END gt 10 lt ATTRIBUTE gt 11 12 lt ATTRIBUTE gt lt ATTRIBUTE gt 13 lt SOURCE gt 15 lt SOURCE gt lt SOURCE gt 16 lt R
31. hem into your spec file directory and run 39 Page Cameleon User Manual A 4 Installing Cameleon Studio 1 Download CameleonStudio zip fa Cameleonstudio zip File Edit View Favorites Tools Help Qs O9 1 Search Folders my Address a CAWINNTProfilesia firat vy Documents visual Studio 20053ProjectsycCamSimulator DebugCameleonStudio zip Go Folder Tasks ES E Camsimulator pdb camSimulator vshost exe Extract all files iS Interop SHDocVw dll E userinfo txt 2 Run CamSimulator exe If you have any difficulties make sure you have the latest NET framework components 40 Page
32. http world altavista com tr you need to work with forms Cameleon Studio Spec File Search WebService About f Browser Source Forms SpecFile Tree Spec File Source WebService Code EHI mybabelfish altavista Home Tools Babel Fish Translation Babel Fish Translation Select from and to languages Translate http Select from and to languages Translate SS gt oO URL http world altavista com t POWERED BY Fetch Back Forward Add to Spec Source Attribute Input Attributes Messages Scripts Forms Business Services Submit a Site About AltaVista Privacy Policy Help 2006 Overture Services Inc If you click on the Forms tab in the babelfish page you will see two forms I9 Page Cameleon User Manual Cameleon Studio Spec File Search web Service About Browser Source Forms Bip Forms E E Form trad rT ext action http world altavista comtr method past ipit doit put int pnpiuab t testarea niame trtest elect name lp input bin f rT t Form fra T rR L action http world altavista com babelfish tr z method past E input trurl FI select name p E input btnTrLIrI l Scripts The first form is the one we are interested in If you double click on form frmTrText it will transferred to the a ecFile Tree tab as a source Cameleon Studio Spec File Search WebService About Browser Source Forms apecFile Tree
33. ightz 45 calspanz 3 align left valign middle back ground graphicsmetalpullnay pa gt lt form name SelectCountry gt amp nbsp amp nbsp amp nbsp amp nbsp amp nbsp amp nbsp nbsp inbsp tnbep lt select names CountrSelect onChange MM_jumnpMenu parent this ap clatactbook El Source UAL https ii cia gow ca publications Factbook lt option Select a Country or Location option lt option value geos ss html gt World option lt option valuez geos af html gt Afghanistan lt ophion gt lt option value geos ax html gt Akrotine option Albanias option lt option value geos al html lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value lt option value geas ag html geas ag html gens an html gengan html gens awv html gens ay html gens ac html gens q html geos ar html geos am htrl geas aa html geoas at html geoas zh html geoas as html geas au html gens a html gens bf html gens ba h
34. ing the Cam l on Web Wrapper EC Web http web mit edu smadnick www wp 2005 06 pdf 7 Lynn Wu Aykut Firat Tarik Alatovic Stuart Madnick Querying Web Sources within a Data Federation ICIS http web mit edu smadnick www wp 2006 09 pdf 34 Page Cameleon User Manual A 2 Regular Expressions Syntax What is a regular expression Perl5 regular expressions It is beyond the scope of this guide to give a detailed explanation of regular expressions to beginners The OROMatcher package is geared toward programmers who are already familiar with regular expressions having used them with other languages and who now want to apply them in their Java programs However we shall make a small attempt to cover the basics and summarize the Perl5 syntax supported by the OROMatcher Perl5 classes For a detailed exploration of regular expressions for both beginners and advanced users we recommend the book Mastering Regular Expressions by Jeffrey Friedl published by O Reilly amp Associates What is a regular expression Part of this discussion is based on page 94 of Compilers Principles Techniques and Tools by Aho Sethi and Ullman A regular expression is a pattern denoted by a sequence of symbols representing a state machine or mini program that is capable of matching particular sequences of characters Regular expressions have their root in lexical analysis and tokenization where a set of lexemes had to be recognize
35. ion in which two disjunctive patterns are specified for a single attribute ATTRIBUTE name LastTrade type String gt BEGIN CDATA Lasts Trade BEGIN END CDATA TR END lt PATTERN gt lt CDATA lt B gt s s lt FONT s SIZE 1 gt lt FONT gt gt lt PATTERN gt lt PATTERN gt lt CDATA lt B gt s d s lt B gt gt lt PATTERN gt lt ATTRIBUTE gt We should note however that these disjunctive patterns are not mutually exclusive and occasionally special care must be taken to construct patterns whose intersections are empty Otherwise the same item will be matched multiple times and repeated in the output Conjunction In spec files it is possible to define conjunctive patterns by simply denoting them with enclosing parentheses The semantics of conjunctive patterns in Cameleon corresponds to the concatenation of pattern matches The first pattern element in the example above has such a case with two groups of enclosing parentheses the first one matching the whole part of last trade value the second one the fractional part The matched elements are then concatenated to form a single value This feature 1s very useful when data to be extracted is not atomic and separated by unwanted tags Multi page transitions When wrapping web pages we sometimes need to traverse multiple pages to locate the page we want to extract information from This situation occurs when the URLs
36. meleon User Manual Source Hello Languagemode en es Note that the language mode values should be taken from the acceptable values of babelfish by examining its source as shown below Cameleon Studio Spec File Search WebService About Browser Source Forms B altavista Home gt Tools Babel Fish Translation Translated Text amp Babel Fish Translation e x Li Search the web with this text English to Spanish About AltaVista Help Business Services Submit a Site Privacy Policy i 2006 Overture Services Inc specFile Tree Spec File Source WebService Code p mybabelfish Source UAL http world altavista comtr method post input doit oO C yaluerdone E input intl input tt valuecurltext Br testarea name trtest L textarea value tisourcett E select name lp E input btnTrT t El Attribute x namezTranslatedT ext et type string 3 Begin zform actionz http Zwei altavista com web results x Patter zdiv style2 padding 1 Ups 0 37 7 T div gt End z forme Source Attribute Input Attributes Messages Scripts Farms z urmce langquagemode Results 22 Page Cameleon User Manual function verify TrT ext Hildacument frm T rT ext trtest value length Halert Please enter some text to translate retur false ff Script input type hidden namestt value urlbext gt testarea rowzz 6 wrap virtual
37. n the sections of source data defined by your BEGIN END tags of each attribute the data extracted according to your pattern specification when your spec file works Examine your extraction rule if the data you want to extract has been retrieved but you are not extracting them Examine source and parameter specifications if Cameleon cannot even retrieve the page content Study the error messages if any 32 Page Cameleon User Manual Summary of Cameleon Spec File Syntax RELATION lt RELATION 1 none name Root element Da Gime Do IP fei o M DELAY gt required web resource DELAY Delay by the specified AUTHENTICATION lt AUTHENTICATION gt morum For websites that authentication through pop up window pm o we o NEU oenksknndmn authentication realm a qo JU _ authentication a Emm authentication required J en NN J iian name gt 0 or more _ name Evaluates a JavaScript MT Ms snippet and store output POST POST Oor SOURCE method Default value is POST method POSTIGET T m RA Specifies the type of http request operation value gt required request Ic required parameter ee fomes or more unn name Stores extracted values ATTE me typez sound type Default is String allowed i J a or n 1 ee text to beginning extracted value adl MEM i or 1 ATTRIBUTE sm t Adds text to end of extracted value BEGIN BEGIN l
38. n the passive check box and you can also save your login information by checking the Save as default box Once the information is entered clicking on Transfer will ftp your spec file to a remote location As mentioned before this step can be achieved manually as well I4 Page Cameleon User Manual CT tC I Location User ame Password a Directory Cameleon erver Lameleonzz23 E passive Save as default Transfer Cancel Step 10 Test on the Web Now go to a public Cameleon test location For example 1 http Ainterchange mit edu cameleon sharp cameleon html 2 http www aykuttirat com Cameleon Demo aspx In these demo pages you can use a custom spec file location by entering a registry directory For example if you put the spec file in your MIT www directory use http www mit edu username as your registry directory Of course replace username with your username Then enter the query shown below SOL Query select GDP Population GDPUnit milExpendPercent from ciafactbook where Country Germany E Format xml Debug Custom Spec File Directory Note that I directly put the spec file in the local directory so I leave the custom spec file directory empty You should fill it appropriately pointing to the location of your spec file When you click on Run you should get Il5 Page Cameleon User Manual lt xml versionz 1 0 encoding IS0 8859 1 7 gt DOCUME
39. newline X carriage return X tab M formfeed d digit 0 9 D non digit 0 9 w word character 0 9a z_A Z W a non word character 0 9a z A Z s a whitespace character t n r f S a non whitespace character t n r f xnn hexadecimal representation of character 36 Page Cameleon User Manual cD matches the corresponding control character nn or nnn octal representation of character unless a backreference M 2 3 etc match whatever the first second third etc parenthesized group matched This is called a backreference If there is no corresponding group the number is interpreted as an octal representation of a character O matches null character Any other backslashed character matches itself e Expressions within parentheses are matched as subpattern groups and saved for use by certain methods By default a quantified subpattern is greedy In other words it matches as many times as possible without causing the rest of the pattern not to match To change the quantifiers to match the minimum number of times possible without causing the rest of the pattern not to match you may use a right after the quantifier Match 0 or more times Match 1 or more times Match 0 or 1 time nj Match exactly n times n Match at least n times n m Match at least n but not more than m times Perl5 extended regular expressions are fully supported text
40. ore Format table country population gdp gdp unit milexpendpercent Figure 3 Simple SQL Query against the wrapped CIA World Fact Book What is Cameleon Studio Cameleon Studio is a visual application that aids the development of spec files and converting them into web services It has a built in browser on the left that also shows the source of a web page and the forms that are in that web page On the upper right it shows the spec file in tree form and original form and the auto produced web service code On the lower right it has several tabs for surfing web sites Sources defining attributes Attributes providing values for input attributes Input Attributes displaying messages from the program Messages displaying scripts from the web sites and authoring custom forms forms Results can also be viewed via the Results tab 4 Page Cameleon User Manual zigixi Spec File Search WebService About Browser Source Forms SpecFile Tree Spec File Source WebService Code x Get a Google enhanced search box Google G Go Download Google Toolbar Web Images Video News Maps 1 Google Search I m Feelin Source Attribute Input Attributes Messages Scripts Forms URL http vov google com m Advertising Programs Business Solutions About Google Fetch Back Forward Add to Spec 62007 Google Results Figure 4 Cameleon Studio In
41. pe string Begin form name SelectCountry gt Pattern lues n Albania option gt End lt form gt Match Add to Spec Results E Link geos al html This is what we want Change now Albania into Country so that it will work for all Countries and click on Add to Spec Please see the Cameleon user manual at the end of this manual if you have difficulty understanding what this pattern means 8 Page Cameleon User Manual specFile Tree Spec File Source WebService Code p Clatactbook Source LIRL htfps A wwna cia gov cia publications facthook l Attribute name Link 2 tupesstring z Begin lt form name S electCountry gt Pattern option value H SCountritl option Erick forme Step 4 Define the Second Source Go back to the web site and choose a country to proceed to the next page a Cameleon studeo Specie Search WebServee About Browser Source Forms SpecFie Tree Spec File Source WebService Code Cil actbook nlf xi URL hipt Aver co ger cu pullcihonis AM act bool rubete nene Link Nore mE Reger chem risen Som Hon Patio lt option value PI ffCouniryfic op amp oro Select a County or Locabon End Momo Albania Home Reference Maps Appendixes Print Friendly Page S MUN TERES 1 3 A SERBIA dum Aen Shbngie Saran MACEDONLA TIRANA EDON Adriatic 5 Sen Elbasan z GAEECE e m Anm a
42. penditures percent of GDP H Patteri ebre Se e o c End table E Attribute e name Populatian e type strinq i Begin Population type strin z Begin GDP purchasing power parity Pattern zbrs s 4d End ltr E Attribute jas name GOPU rit js typesstrinq b Begin GDP purchasing power parity s Patere brs tegid ees oo Brick Ate E Source URL https www cia gov cia publications factbook Attribute name Link fo type string Begin lt form name SelectCountry gt Pattern option valuez gt Country lt option gt i End form Source UR L https www cia gov cia publications factbook Link E Attribute name MilExpendPercent L type string Begin Military expenditures percent of GDP L Pattern lt br gt s os End lt table gt E Attribute name Population type string Begin Population Pattem bps Nd 4 oo End lt tr gt Population Population 11 Page Cameleon User Manual Step 6 Define the Input Attributes Click on Input Attributes box and enter Country and a value for a country Note that these attributes can either be obtained from the query or may be extracted from a source If they are extracted from a source they can be used in consecutive sources Source Attribute Input Attributes Messages Scripts Forms me Mame Value iis Step 7 Test Ri
43. r d d new Date print d getTime gt lt JSCRIPT gt POST method GET gt lt PARAM name qscr value fexp gt lt PARAM name flag value q gt lt PARAM name city1 value Departure gt lt PARAM name citd1 value Destination gt lt PARAM name datel value Date1 gt lt PARAM name timel value 362 gt lt PARAM name date2 value Date2 gt lt PARAM name time2 value 362 gt lt PARAM name cAdu value 1 gt lt PARAM name rfrr value 429 gt lt PARAM name zz value Time gt lt POST gt Other Features You can open existing spec files with the Spec File open directive view error messages in the Messages pane 24 Page Cameleon User Manual 2 Known Bugs Cameleon Studio is an experimental system and contains a number of known bugs Some of these are 1 Files transferred message does not always mean that files are actually transferred If the file was not found in the directory it would give the message although the file was not transferred 2 You need to always start with the SpecFile new directive and give a name to your spec file If you forget this spec file transfer will not work You can still manually transfer your file though 3 Cameleon Studio and Cameleon Server will produce different results in some rare cases This is because the simulation performed in Cameleon Studio is slightly different in some cases There are also some other bu
44. rg travelocity com edgesuite net logos VO 377 s border 0 s alt Airline Logo gt gt lt PATTERN gt lt ATTRIBUTE gt lt SOURCE gt lt RELATION gt Figure 1 Cameleon Spec File for Yahoo Travel suirage Cameleon User Manual Parameter Replacement Parameter replacement is the use of input or extracted attribute values within the subsequent elements in the spec file In the multiple page traversal case we have seen one example of this The value of attribute Link was used in the next source element It is also possible to supply any extracted or input attribute value within the attribute definitions Consider the following SQL query to the Yahoo Travel Web site Select Airline Price from yahootravel where Departure BOS and Destination SFO and Monthl 5 and Dayl 19 and Month2 6 and Day2 1 When this query is executed the input attribute values specified after the where clause replace the same name attributes enclosed between signs in the post parameters as shown in Figure 1 Get Post Methods amp Authentication Cameleon spec files support both get and post methods when connecting to Web pages A post example is shown in Figure 1 Method attribute of the post tag determines which method 1s to be used Most web pages perform authentication through forms In connecting to those Web pages get or post methods with parameter replacement can be used for authentication purposes In some oth
45. s 0 gt lt xs element name MilExpendPercent msdata ReadOnly true type xs string minOccurs Q gt lt xs element name GDP msdata ReadOnly true type xs string minOccurs 0 gt lt xs element name Population msdata ReadOnly true type xs string minOccurs 0 gt lt xs sequence gt lt xs complexT ype gt lt xs element gt lt xs choice gt lt xs complexT ype gt lt xs element gt lt xs schema gt diffgr diffgram xmlns msdata urn schemas microsoft com xml msdata xmins diffgr urn schemas microsoft com xml diffgram vi gt NewDataSet xmins gt ciafactbook diffgr id ciafactbook1 msdata rowOrder 0 diffgr hasChanges inserted gt Country gt Turkey lt Country gt lt Link gt geos tu html lt Link gt lt MilExpendPercent gt 5 3 lt MilExpendPercent gt lt GDP gt 627 2 lt GDP gt lt GDPUnit gt billion lt GDPUnit gt lt Population gt 70 413 958 lt Population gt lt ciafactbook gt lt NewDataSet gt lt diffgr diffgram gt lt DataSet gt IS Page Cameleon User Manual 1 Advanced Features of the Cameleon Studio Although the above example covers most of the features of the Cameleon Studio there are a couple of things that were not mentioned These are explained below Auto form handling If the web page you want to wrap requires form submission before you can access your data you need to have a form submission For example if you want to wrap babelfish at
46. shect href __ Facthbook cz lt body back ground Zgraphicsztilehdark ip link HFFFFFF link HFFFFFF alink HOCCCCC bocolor HEEEBBB gt a hrefz Htop lt a gt Results lt div align center gt amp bsp lt div gt E KiEspendlereent lt div align center 7 Mas 1 5 ztable width 596 border 0 height 400 cellpadding 0 cellepacing 0 align center gt d B ztr align center gt lt td heinht AR malanans 4A walion hothoo gt i EJ GDPLInit al Once I do the transfer I can go into my web service location which is server name specfilename asmx in my case this 1s http www aykutfirat com ciafactbook asmx l7 Page Cameleon User Manual ciafactbook Web Service Windows Internet Explorer m x se x esaen x A aykutFirat cam ciaFackbaak asmx File Edit View Favorites Tools Help Links i Customize Links amp Matches l E books JH nT W Naxos Q A5 DotNetPanel Home G Google Code Search wr d ciat actbook Web Service EL Bh b Page Cy Toos X dd 3 ciafactbook The following operations are supported Fora fonmal definition please review the Service Description EcCIAFACTBOCOKData As seen above this web service has a default method called getCIFACTBOOKData Click on that enter a country name and click on invoke Note You should enable remote testing in your server to test
47. st parameter Cameleon spec files allow the interpretation of JavaScripts as shown in the following example SOURCE URI 2 http www expedia com pub agent dll lt JSCRIPT name Time gt var d d new Date print d getTime 3llPage Cameleon User Manual SAJSORIETS In this example the output of the JavaScript snippet 1s assigned to the Time attribute We should note that the JavaScript code to be interpreted does not have to be static and parameter replacement can be used in JScript tags as well Prefix and Suffixes Figure 1 shows an example of prefix and suffix declarations With these constructs it 1s possible to add static text before and after the extracted data values By using parameter replacement feature it also becomes possible to glue multiple extractions together Delays We should mention another useful feature in spec file creation delays This is used when the wrapper engine requests data from a Web site but has to wait a certain amount of time before getting an answer We cover this case by specifying a delay element in the source declarations specifying the waiting time in terms of milliseconds We show an example below SOURCE URI http www qixo com Link DELAY 65000 gt Debugging You can test your spec files by setting the debug parameter to true in the web based query testing harness http Ainterchange mit edu cameleon sharp cameleon html It will output all web data retrieved by Cameloe
48. terface Cameleon Studio by Example In this section we go through an example by authoring a specification file for the CIA World Fact Book using Cameleon Studio Step 1 Give a name Use the Spec File menu item click on New and enter a spec name e g ciafactbook specFile Tree Spec File Source WebService Code von clafactbook ciafactbook Relation Mame OF Cancel Step 2 Go to the web site Either type it directly in the URL text box in the Source tab on the right or search it in Google and find it 5 Page Cameleon User Manual Cameleon Studio i amp l x Spec File Search WebService About Browser Source Forme SpecFile Tree Spec File Source WebService Code clafactbook ZEE Selecta Co untry or Location Appendixes Reference Maps codem a ttribute Input Attributes Messages Scripts Forms Notes and Definitions Guide to Country Profiles Guide to Rank Order EES URL https Aor cia gov cia publications factbook Fetch Back Forward Add to Spec i Flags of the World Gallery of Covers Text Low Bandwidth Version Download This Publication Submit a Factual Update Search The World Factbook Results What s New Country information has been updated a We are trying to simulate how the extraction should happen and this is the page where we start We click on Add to Spec button in the Source tab OO O iojxi EE
49. the Web Service Code tab below Cameleon Studia Amid Spec File Search web Service About SpecFile Tree Spec File Source i Webs erence Code i Browser Source Form 1 File ame Connection_ch_den htm ciatactbook asm SWU gt l T pe CFL m Catalog SD Ei WebService Languages Wwb CodeBehind 4pp Eodezciafactkbonk vb lt l S chemas gt Class ciaefactbook gt lt l HT TR true gt ciakactbook b x html lt head gt lt title gt CLA The world Factbook Germanys tithe lt meta http equiv br ens Type content hex t html charset iso B859 1 Imports System eb Imports System feb S ervices Imports System eb Services Protocols Imports System Collections Specialized Imports System D ata Webs erwice N amespace http ey bempur org gt cV ebs erviceBindinglConforms To siPrafiles BasicFrofilel 1 _ z Global Microsoft isualB asic Compiler S erveices Designer enerated gt _ lt link rel stylesheet href 7 Factbook css Ly pe rext cssz l t EI bl script language JavaScript Source Attribute npu nbutres Messages Scripts l Forms xl Function Abl jiumpbldenurarg sellbirestore t z va3 eval targ locaton sel0 bj options sel0 bj selected ndes alue gr i restore sel0 bj selected ndex 0 ELE lt scripk gt lt link relz stvle
50. tml gens fa html gens bg html geos bb html geas bs html geas bo html geas be html geas bh html geas bn html geas bd html Algeria option American Samoa option Andorra option gt Angola lt option Anguillaz option Antarctica option Antigua and Barbuda aptior Arctic Ocean aptione Argentina aptior Ameniaz option Aruba option Ashmare and Cartier Islande aption Allantic Ocean option Australia option Austria option Azerbaijan option Bahamas Thes option gt Bahrain option gt Baker Island option gt Bangladesh option gt Barbados lt option Bassas da Indias option gt Belarus lt option gt Belgium optian Belize aptian Beninc option gt Bermudas option Begin form name Select ountm gt Pattern End lt option value T Albani lt form Match Add to Spec Here Link 1s the name of our attribute Source Attribute Input Attributes Messages Scripts Forms URL https www cia gov cia publications factbook Fetch Back Forward i Add to Spec Results Click on the attribute tab and enter the info 1n the boxes as follows Source Attribute Input Attributes Messages Scripts Farms Name ik We wa v 7 Page Cameleon User Manual form name SelectCountry gt and lt form gt designates the begin an
51. y web Web contig dA 0 2 2 Upload these files to a server that supports NET framework In the snapshot below cgi bin MyWeb and _holding htm were preexisting directories and file you do not need them Hosting Control System Home gt Domains gt aykutfirat com gt File Manager Beond tw ancoiss ad t File Manager Search Domains Below is a list af files you currently have in your account webspace You can use this file manager to directly alter the contents of your site Resource Total Used 7 Ee Choice Bandwidth 8053KB meme C SDomainssaykutfirat com swwwraot Disk Space 2J5KB NEM more Show Folder Sizes Account Billing Current Balance 0 00 MS SOL 2005 Plans x a o a DP Destination Frdde ewwmot OOOO O OS Reseller hosting oe Name Last Modified Size Upper t Parent Folder ad bin Cameleon 43d cgi bin E a Lc halding htm 2008 03 10 19 52 Cameleon html 2008 04 27 18 26 2003 04 27 18 27 D mj camserv aspx cs 2008 04 27 18 45 m a web config 2008 04 27 18 34 Powered by Helm Version 3 3 3 Spec files go under the Cameleon directory 4 Test by using Cameleon html S Page Cameleon User Manual cameleon Demo Mozilla Firefox 9 G Go O http www aykutfirat com cameleon html lt DOCUMENT lt ELEMENT lt capital gt Ankara lt capital gt S ELEMENT DOCUMENT 5 Now you can author spec files upload t

Download Pdf Manuals

image

Related Search

Cameleon chameleon cameleon chameleon antenna chameleon game cameleon group cameleon software chameleon song cameleon outlet cameleon svrljig cameleon vert cameleonbike chameleon paint cameleon haljine chameleon plant cameleon woluwe outlet cameleon coatings cameleon textil cameleon serie cameleon association cameleon face paint

Related Contents

Hoover Elite Soft N Light U4256  Ektron CMS400.NET Wiki Starter Application User Manual  Patriot Memory 8GB DDR3-1333    Emerson XWEB5000 Network Router User Manual  HP ProDesk 400 G2    TI-84 Plus et TI-84 Plus Silver Edition Manuel d`utilisation  Click here for User Manual. - taral tarım makine ve aletleri sanayi a.ş.  WH-7850-C24-CL  

Copyright © All rights reserved.
Failed to retrieve file