Home
        Universiteit Leiden Computer Science
         Contents
1.            esses 21   2 9 3 Conclusion of FilterdImageCrawler                    esses 24   Chapter 3  Website for UML Database Collecton  25  3 1 Improvement to Distributed Web Crawler                  sss 25   3 2 Website for Database Collection                    ssssseseseseseeeeeeenee nennen 25  3 2  L Introd  ction of Drupal        4e inm aea md te n As 26   3 22 Website Design    escis eR HER RR E REUS 26   3 2 3 Future Work for the Website                 eesesssssseeeneeneeene eene 27   3 3 Change of  mageCrawler                eese ener nennen nnns 27  Chapter 4  Website for XMI Storage and Ouer  28  4 T Overview Ob XML is oret a ae eerte eet re e ert 28  4 1 1 Overview of XML Schema    29   4 2 Overview ot XML    e PR A O tele eos he ee a a e 30  4 2 1 XMI vs  Relational Database eu 30   4 3  Related Work    etur rU aae ut ds 31  4 31 XML SOTA LE ione PR IE PR Po p e qu Deeg 31   4 3 2 XML Query  gefemmt i noe tree ee TE e Een P UP nee Gr notis 31   4 4 XM I Databases et RARIOR enlace EEN EE  32    441  Database Design  icine e a ai S EE NUES 32    RTE 34    4 5 Website Design for XMI Storage and Query                  sse 34  Z5  I XML Storage  oi tere re ee rU p tU Ur red per Erde 34   A S  2X IR 35  dE EE ER  Chapt  r  5  E Ee EE 37  Appendix A  User Manual for  mageCrawler                  essere 38  A   Image Downloading Part  t ei sts Reads aon Rese AR STORIE 38   A 2 Domain Rank Part  x e See hen EROR REO Bem 39  Appendix B  User Manual for Filter4ImageCrawl
2.      All the URL in the database VRL Black List     41    Figure 19  The initial interface of ImageCrawler       A 1 Image Downloading Part    The image downloading part is to download images from the result of Google image search based  on the keywords the user has input  Then  the information of those images will be saved into an  access database   Step 1  Select the database that will restore the information of images to be downloaded   There are two buttons to the right of the    Database    textbox providing two ways of the  database selection   The  Create New  button can create an empty access database file that contains all the five  tables of our designed database     pic  picblack  picwhite  blackcount  whitecount  The new  access file will be created in the same folder with the executable program file  It will be  named    pic    with the current date and time attached to it  For example  if the database is  created on 18 30 25  December 21  2012  the name will be  pic20121221183025 mdb   In  this way  the conflict of duplicated file name will be saved   Another way is to click the    Browse       button to select an access file that already contains  the five tables  After the user has selected a database file  the program will check whether  the database has the tables and keys required by the program   When a database file is selected or created  the full path will be shown in the    Database     textbox   38    Step 2  Select the folder the images will be saved
3.     The  mageCrawler program is implemented by a PC which does not have the ability to deal with  great capacity of information from the Internet  A web crawler of high performance can be reached  by distributed system      P   Most search engines have different IDCs  Internet Data Center   Each  IDC is in charge of a range of the IPs  Within each IDC  there are many instances of web crawler  working parallel    There are two methods used by the distributed web crawler    1  All web crawlers of the distributed system rely on the same connection to the Internet  They  co operate to fetch data and combine the result within a local area network  The advantage of this  method is that it s easy to expand the hardware resource  The bottleneck will appear in the  bandwidth of the connection to the Internet that are shared by all instances of web crawlers    2  The web crawlers of a distributed system are far away from each other geographically  The  advantage of this method is that there is no problem about the insufficiency of bandwidth   However  the conditions of the network differ from place to place  Thus  how to integrate the  result of different web crawlers is the most important problem     3 2 Website for Database Collection    As the hardware for building a distributed system 1s not possible at the moment  a website can be  established to collecting databases from users who have downloaded the  mageCrawler program   The website provides the download link of the program  Wh
4.     The path is shown in the    Save Path    textbox    There is a default value   C  UML class diagram   The user can change it by clicking the    Browse     button next to the textbox   Step 3  Input the key words to search    The default keyword is  uml class diagram   The user can input his choice   Step 4  Set the number of images to be downloaded    The default number is 100  The user can input his choice   Step 5  When step 1 to step 4 is finished  the user can start the downloading process by clicking  the  Start  button  The program will first check if the database  save path  keywords and number  of images have been set  If not  there will be a hint to tell the user to finish that first  The  downloading process may take a while to finish  It depends on the number of images the user  wants and the condition of the user s Internet connection  When finished  a message box will pop  up to acknowledge the user     A 2 Domain Rank Part    The domain rank part implements an analysis function  The user can build a    blacklist    that  contains the images not related to the keywords  Then  he can get the statistical data of the  domains  What s more  the program can give an output of the database in the form of csv files   Function 1  Load the database   Click the  Load Database  button to choose a database  After a database is chosen  the  program will check whether the tables and keys of the database meet the requirement   Then  the URLs of the images will be loaded  The
5.    1 1 Overview of UML    Unified Modeling Language  UML  is a standard modeling language that was created in 1997 by  the Object Management Group   Since then  it has been the industry standard for modeling  software intensive systems  In UML 2 2  there are 14 types of diagrams  They are divided into 3  categories  structure diagrams behavior diagrams and interaction diagrams  Here s the table of    their role in the UML standard          Category    Name    Role       Structure diagrams    Class diagram    Contains the attributes  methods  and relationship between the  classes of a system       Component diagram    Composite structure    Splits a system into components  and shows the dependencies among  them   Shows the detailed structure of a                diagram class and the collaborations that  this structure makes possible   Deployment diagram Describes the hardware deployment  of a system   Object diagram Shows a complete or partial view  of the structure of an example  modeled system at a specific time   Package diagram Splits a system into different  packages and shows the  dependencies among them   Profile diagram Shows the classes and packages at    the metamodel level together with  the extensions of them        Behavior diagrams    Activity diagram    Shows the detailed workflow of the    components in a system        State machine diagram    Describes the states inside a system    and the transitions among them       Use case diagram    Describes the interact
6.    Universiteit Leiden    Computer Science    A System for Collecting and Analyzing UML    Models and Diagrams    Name  Chengcheng Li  Student no  s1026968    Date  13 08 2012    1st supervisor  Michel Chaudron    2nd supervisor  Bilal Karasneh    Contents    List  of EE 2  Fast ot Tabless  een Sege ed nni e I eR EE 3  Abstract ERE RERO RERO ERROR Odo 4  Chapter T  Introduction    ern ORE e REG HR MM 5  LJ Overviewnof UML     atten uet edic utn eive a E EE Pita 5   1 2 Overview of UML Class Diagrams nennen nnne 6   1 3  Overview of Our Project    sciet IURE ERR ENEE 7  Chapter 2  image Crawler d castes te ERREUR GR RN E eet 8  GT Overview of Web Crawler     ide eH E We eite teet 8   2 2 Related  Work    sis itte im tp et etn e ave AERE 10   2 3 Software Requirements    etie ee tet a eet ie 11   Di Why Google  ien te ot ee DNO Oe CIO A 11   2 5 Implementation of  mageCrawler                 esses eene 12  2 5 1 Database Design    RR RR RH ROW S 12   2 5 2 Image Downloading Pat  14   2 5 3 Domain Rank Part    5  ento tod eO USER OU e ie 16   2 6  Validation  i  ette i REGINE TRIP UR EE S aT ad T dta 17  2 6 1    Source  Code with UML           ni teca eae ee Pede edes 18   2 7 The Limitation of  mageCrawler               esee eene nennen 18   2 8 Focused Web Crawler    tee Rep petiere 18   2 9 Perceptual Hash Algorithm    Ar tette tenete ere i o Ane ee 19  2 9 1 Implementation of FilterdImageCrawler                      eee 20   2 9 2 Validation of FilterdImageCrawler          
7.    attribute                                                                                              The id of the class thi  owner varchar 200  j S E  attribute belongs to  The id for thi  xmi_id varchar 200  ue   Se  3 attribute  The date and time thi  datetime varchar 20  e R 3 RSEN  item is generated  The URL of the  url varchar 200  source image of this  attribute  Table 7  The table to store attributes  tblOperation  Key Type Comment  name varchar 200  Name of the operation  visibility varchar 20  Public  private    Whether it   s an  isAbstract varchar 20     abstract operation  The id of the class this  owner varchar 200     operation belongs to  di The id for this  xmi_id varchar 200     B operation    The date and time this  datetime varchar 20       item is generated  The URL of the  url varchar 200  source image of this  operation  Table 8  The table to store operations  tblAssociation  Key Type Comment  N fth  name varchar 200  ENS b      association  visibility varchar 20  Public  private  y The direction of this  ordering varchar 200  NJ  association    The type of the  aggregation varchar 200     aggregation  The id of the  association varchar 200  association this item  belongs to  The id of the class this  typeofAsso varchar 200  K      association points to  The id for thi  xmi id varchar 200    E    operation       33               The date and time this  datetime varchar 20       item is generated  The URL of the  url varchar 200  source image of thi
8.   implementation class     P   OptimizationProblem OptimizationProblemComparison      U   double    currentbest   OptimizationProblemSolution    activeproblems   OptimizationProblem     M   double     nodesGenerated   long    elapsedTime   double    opc    solve       unspecified      selectProblem       unspecified      getElapsedTime     double   getNodeCount     long     compare     boolean                       interface    OptimizationProblem     getRelaxation     OptimizationProblem     getSolution     OptimizationProblemSolutio   isFeasible     boolean    branch     OptimizationProblem      performReduction         Figure 10  Example of black and white normal UML class diagram    Result  45 69  of the result is the same as in the standard list  Experiment 3  Black and white complicated UML class diagram   Example image              che   Cache   LRULinkedTableCache  cheUpdatePriority   Integer   Thread  MIN                Search Request     HioadFromCacheAndDB     List           Search Result                List of Keys            interface    Cacheable    IsicadFromids     Collectio      interface    Cache  IgetMaxSize     Integer    A       s        A         1           implementation class     LRULinkedMapCache  waxSize   Integer   0   failedGetCounter   Integer   1   successfulGetCounter   Integer   1     getMaxSiza     Integer   put     get     Object   j removeEldestEntry     Boolean         implementation class    LRULinkedTableCache    cache   LRULinkedMapCach
9.  23     Figure 24    Figure 25   Figure 26   Figure 27   Figure 28   Figure 29     Figure 30  Figure 31  Figure 32    Process of BFS   Process of URL queuing   Interface of ImageCrawler   Class diagram of ImageCrawler   Links of pages of Google image search   Class diagram for Filter4ImageCrawler   Interface for Filter4IlmageCrawler   Example of black and white simple UML class diagram    Example of black and white normal UML class diagram    Example of black and white complicated UML class diagram    Example of colored simple UML class diagram     Example of colored normal UML class diagram     Example of colored complicated UML class diagram    An example of XML file     An example of XML Schema     The architecture of MOOGLE     The interface of XML2DB     The initial interface of ImageCrawler     The interface of  mageCrawler in process   The initial interface of Filter4ImageCrawler   User login page   User registration page     Request new password page   Upload page   My settings page   Databases page   Database showing page   Initial interface of XML2DB     Upload page for SQL files     Search models page     Query of URL page    List of Tables    Table 1  14 kinds of UML diagrams   Table 2  Relationships of class diagrams   Table 3  Process of operation in a queue   Table 4  Design of the database   Table 5  Comparison of XML and relational database  Table 6  The table to store classes   Table 7  The table to store attributes   Table 8  The table to store operations  
10.  5 1 XMI Storage    Taken safety into account  we only let the administrator of the website to upload the SQL file  generated from the XML file  After the SQL file is uploaded  the website will abstract the  commands inside and execute them  Thus  the content of the XML file is written in our database     34    4 5 2 XMI Query    We provide a query function for the user of the XMI database  There are two ways to make a  query of the XMI database     e Search by URL   After a search of a database of the information of images  the result page will show a list  of URLs of the images  The user can select by choosing the checkboxes in front of the  items to search for all the information classes  attributes  operations and associations   contained in the selected images  The result page will show the images and all their  information  The information of each image will be shown in four tables  class  attribute   operation and association    e Search by keywords   The user can input keywords  and the models containing the keywords will be selected   We provide all the information of the images containing the models to the user   If the user inputs more than one keyword  we take the keywords separated by space as the  relation    AND     So the result should be items containing all the keywords  The user can  use    OR    to connect keywords  Then  the result will show models that contain either of  the keyword connected by the  OR  condition   By default  we provide a fuzzy enquiry
11.  64 results on 8 pages  There are  tools based on this API  and they can only get at most 64 images  It s far from enough        Firefox has a plug in named  Save images  that can save images from the current tab page  the user has opened  The problem is that the result of Google image search shown on the  webpage is just thumbnails  When we use the add on in Firefox  all we have got are just  thumbnails from this one page  The images downloaded are too small to use  What s more   the tool can only download images from pages that have opened in the browser  It s not  possible to open all the pages of the list of images and download them one page by one page        Bulk Image Downloader is a software that can download full sized images from almost any  thumb nailed web gallery  However  it takes  24 95 to buy it  Otherwise  functions will be  incomplete    Thus  we have developed our own software named  mageCrawler to collect images from the   Internet     2 3 Software Requirements    The requirements of our software have two points    1  We want to get as many images as possible    2  We want to finish the process efficiently    To finish the task  we have to implement a web crawler for images all over the Internet  A web  crawler with high performance should have two features    1  It should be able to grab great capacity of data from the Internet  That s why we use it    2  It should run on a distributed system  As the quantity of data is extremely large in the Internet   a
12.  It is the parent node of any other elements    An element can have attributes  For example  every  lt book gt  element has an attribute called     category     The value of an element appears between the start and end tags of the element  For  example     Everyday Italian    is the value of the first  lt title gt  element     4 1 1 Overview of XML Schema    XML Schema describes the structure of an XML file  It is also a standard of W3C  It contains  several aspects l    An XML Schema defines the elements that appear in the XML file   An XML Schema defines the attributes that appear in the XML file   An XML Schema defines the sequence of the appearance of elements   An XML Schema defines the number of elements    An XML Schema defines whether an element is empty     An XML Schema defines the data type of elements and attributes     An XML Schema defines the default value of elements and attributes   XML Schema uses the same grammar with XML  Thus  it is easy to understand and write   Here s the XML Schema corresponding to the XML example above     1  lt  xml version  1 0  encoding  UTF 8    gt   2  Ej  xs schema xmlns xs  http   f wer w3 org 2001 XMLSchema  elementFormDefault  qualified  gt   3     lt xs element name  bookstore  gt                 4     lt xs complexType gt                  lt xs  sequence gt    6  lt xs element maxOccurs  unbounded  ref  book    gt    lt  xsisequence gt    E    xs complexType                 9 m    xs element    10     lt xs element name  bo
13.  Table 9  The table to store associations    Abstract    UML is a most widely used language for building and describing models  There are 14 kinds of  UML diagrams  Among them  class diagram is one of the most important in software designing   Within a class diagram  the basic elements and structure of a software system is described  For the  developers  class diagrams are the guide of the process of development phase  The standard of  designing UML class diagrams is rather loose  Various styles can be found from different  developers  When researchers want to get research on class diagrams  they meet similar problem  that there is no efficient way to collect large number of UML class diagrams to work in      analyzing  querying and so on  As most of models on the Internet are images  we focus on  collecting them  providing usability of storing and querying models that are saved in images   and supporting reusability of these models    In this thesis  a software for collecting a large number of UML class diagram images from the  Internet is developed  It can download images of UML class diagrams efficiently and build a  database that contains information about these images  As a part of the project to transform models  in images to XMI  the paper provides a method of transforming XMI files into relational database   Then  a website is built for storing and online querying models     Keywords  UML class diagram  web crawler  XMI  model search engine    Chapter 1  Introduction 
14.  URLs in the table  pic  which contains  all the URLs of the images will be shown in the list box on the left  The URLs in the table   picblack  which contains all the URLs of images in the    blacklist    will be shown in the  list box in the middle   Function 2  View the images in the database   When a URL of either the left or the middle list box has been selected  the program will  show the corresponding image in the picture box on the top right  But before that  the user  must click on the  Select image folder  button to select where the images are saved   Function 3  Manipulating the    blacklist      The user can manually add or delete items from the    blacklist     There are two buttons  between the list of all URLs and the list of URLs in the    blacklist     By clicking the button  with the arrow     gt    the selected URL in the list of all URLs will be added to    blacklist      The function of the button with the arrow   lt     can delete a URL from the    blacklist      Function 4  Ranking the domains   After the    blacklist    is established  the user may want to know which domains contribute  many items to it   After clicking the button  Domain Rank   the program will take an analysis of all the  URLs that are in or not in the    blacklist     Then  the domains of the images will be  abstracted  The program will make a list of how many images have been downloaded from  each domain  The result will be shown in the list box on the right side in descending orde
15.  click on the    Show chosen database  button  a new page will appear  and show the user the corresponding content   There is a setting of  Show thumbnail   When it s chosen and the database to be shown contains  the URLs of images  the thumbnail of an image will be shown when the mouse moves over the  corresponding URL  However  this function may slow down the user s computer     C 5 Database Showing Page    Here s the screenshot of the database showing page     Home    ShowDB    View Edit  Submitted by admin on Sun  07 22 2012   12 45    The chosen table is  picwhite    Search for content      id url Thumbnails    r 1 http   hi csdn net attachment 201101 18 0 129531961715s8 gif       Figure 28  Database showing page  This page does not provide a direct link for the users  It can only be accessed via the    Databases     page or the    My Settings  page by selecting a database to show  It shows the content of the chosen  database  If the database contains the URLs of images  the thumbnail will be shown to the user  In  45    the meantime  the user can choose by clicking on the checkboxes in front of the items and clicking  the    Search for content    button  the information classes  attributes  operations and associations  of  the selected images will be shown to the user     46    Appendix D  User Manual for XMI2DB    This program aims at transforming XML files with models stored in to SQL files that contain     Insert    commands for our XMI database     Here   s the scre
16.  for periods  hyphens  apostrophes  and underscores     E mail address      A valid e mail address  All e mails from the system will be sent to this address  The e mail address is not made public and will only be used if you wish to receive  anew password or wish to receive certain news or notifications by e mail     Create new account    Figure 23  User registration page  If a user has forgot his password  just use the    Request new password  function to get a new one     42    When the username or email address has been input in this page  a one time log in link will be  provided to the user via his email address  Then the user can reset the password the same as when    creating a new account     User account    Create new account Log in Request new password    Username or e mail address      E mail new password    Figure 24  Request new password page    C 2 Upload Page    Here   s the screenshot of the upload page     Upload    View Edit  Submitted by admin on Fri  07 20 2012   10 15  Upload File Browse       Please choose the type of table you are uploading   pic   It contains all the information of the images   picblack   It contains the url of images in black list   picwhite   It contains the url of images in white list   blackcount   It contains the domain information of the url in black list     whitecount   It contains the domain information of the url in white list   Submit    Download ImageCrawler This Program aims to build a database and download UML class diagra
17.  for the user  The result will include models that    ec       the keywords partly match  For example  if the user inputs    edu     models with   education  will be matched  The user can choose the precise search mode by selecting  the option on the page  Then  models with    education    will only be matched by the    keyword  education      4 6 Validation    The advantages of our website are     Simpler transformation    As we have our own OCR process to generate XML files  the transformation from XMI to  relational database is simpler and faster    Good efficiency    Relational database is used to store the elements in XML files  We split the models into four  parts  class  attribute  operation and association  We sacrifice the space to get the advantage  of querying efficiency  Thus  the query process is faster than searching through model files  on the server    Model images provided    As our project aims at querying models that are saved in images  users of our website will get  the access of images we have collected from the Internet  Presenting models in a graphical  view is friendlier than by text  even the text has been formatted    Access for users to communicate    Our website is based on a content management system  so there s access for the users to    make comments and communicate their ideas with each other     For future development  a smarter search engine can be developed  Our project can also expand to    35    search with text files  As no official stan
18.  page provides the search function of the models in our XMI database     Query in Models    View Edit    Submitted by admin on Thu  08 09 2012   12 25    Please input your keywords       You can use OR to expand your keywords  Please use brackets to split the keywords  For example    education OR agriculture     T Precise Enquiry Mode    Search      Figure 31  Query in models page    To use this function  the user just input the keywords in the textbox  and click    Search    button   The result will be shown in the same page with the URL of the images corresponding to the  models found  The user can view the images instead of text to have a better impression of the  models    The keywords can be separated by spaces or connected by the    OR    keyword  Keywords    48    separated by spaces will be regarded as the relationship of    AND     All the keywords should be  matched in the result  Keywords connected by the    OR    function will be matched at least one of  them in the result     If the checkbox    Precise Enquiry Mode    is selected  the keywords will be precisely matched in the  result  Otherwise  the result will include items that partly contain the keywords required     E 3 Query of URL Page    When a user chooses to search several images in the list of URL  this page shows all the  information stored in the XMI database of the images  Here   s the screenshot of this page     Query of URL    View    Edit    Submitted by admin on Sun  08 12 2012   19 05    BankAcc
19.  the process is done  there will be a list of URLs of images shown in the list box on  the right side  They are considered similar to the example image  The user can view the    images by selecting a URL in the list box     41    Appendix C  User Manual for Database Collection Website    C 1 User Control System    We use the user control system that   s provided by Drupal  If a user has not logged in or registered   he doesn   t have the permission to use the functions  A message will be shown on the webpage to    let the user log in or register  Here   s the screenshot of the log in part     User login    Username      Password      e Create new account  e Request new password    Log in    Figure 22  User login page   The    Create new account    link can let an anonymous use create his account  A unique username  and email address is required for the user registration  When those information have been provided   a confirmation letter will be sent to the email address  Within this email  a link for one time log in  is provided  Considering the aspect of safety  the link will expire after one day  The user should  log in via that link and set the password and other information  Then  the account will be activated  to use the functions without restrictions    Here   s the screenshot of the    Create new account    page     Home    User account    User account    Create new account Log in Request new password    Username     Spaces are allowed  punctuation is not allowed except
20.  the starting node  A queue is needed as the data  structure to save the nodes  The algorithm is     1  Starting node V is in the queue     2  If the queue is not empty  continue     3  Node V is signaled as visited     4  Get the nodes that adjacent to node V and push them into the queue  Then go to step  2      Here is an example        Figure 2  Process of BFS    In this figure  we use node A as the starting point  and the process will be                                                        Operation Nodes in the queue   Ain A   A out empty  BCDEF in BCDEF   B out CDEF   C out DEF   D out EF   E out F   Hin FH   F out H   Gin HG   lin GI  G out I  Tout empty  and end       Table 3  Process of operation in a queue    So the order that nodes are visited will be  A  B  C  D  E  F  H  G I    The web crawler starts from a URL as the starting node  If there is a link that leads to a file  like a  PDF file for the user to download  it s a termination because the crawler cannot get links from it   The process is to start from the initial URL  put links inside it into the queue  The links that s  visited will be signaled  visited   If a link is already visited  it will be ignored next time it s found     The process can be shown by this picture           New URL    Initial URL URL visited  Queue New   Queue Visited    Figure 3  Process of URL queuing   1  Put the initial URL into the Queue New    2  Get URL A from Queue New  and get all links it has    3  If the links exist in th
21.  the table of the design of this database        Table  pic  Contains all the information of images                               Key Type Primary Key  ID Auto increment   Url Text Yes   Width Text   Height Text   Pixelform Text   Fname Text   Comments Text   isUML Boolean             Table  picblack  Contains the black list                            Key Type Primary Key  ID Auto increment   Url Text Yes   Table  picwhite  Contains the white list   Key Type Primary Key  ID Auto increment   Url Text Yes    Table  blackcount  Contains the domain statistics of the black list                Key Type Primary Key  ID Auto increment   Domain Text Yes   Count Text       Table  whitecount  Contains the domain statistics of the white list                   Key Type Primary Key  ID Auto increment   Domain Text Yes   Count Text                Table 4  Design of the database    13    The interface of the program is as follows     EB  ImageCrawler LE  Ox   Database    Dy Documents VE  APic2OIZT30DG4D ndb     B Create New   Save Path EW dee Zeen   Browse      Keyword  uml class dian Stet   Number of Images fioo     Ready    Load Database   Select image folder   Domain Rank   Generate csv      All the URL in the database VRL Black List    on    Figure 4  Interface of ImageCrawler    cn  OleDbConnection  savePath   string cmd   OleDbCommand  keyword   string reader   OleDbDataReader  numImage   int    ac   AccessConnection    Here s the class diagram     buildConnection    ImagePatiern   str
22.  two images by their  fingerprints   If two images have something in common  the more  unique it is  for example  the Eiffel Tower  or the portrait of a movie star  the more similar the  fingerprints of the images are  As for UML class diagrams  the components are mostly shapes like  rectangles  triangles  and diamonds together with some text  They are not very unique features in  an image  Thus  the fingerprint of a UML class diagram tends to be an ordinary one  As a result   the correctness of this program is not good enough because it mistakes some images as UML class  diagrams that are not according to their fingerprints    A suggestion for the future development of this program is to use shape recognition technology   The fingerprints of UML class diagrams are the elements  If the program can detect the shapes like  rectangles with text inside  arrows pointing to rectangles and so on  the accuracy will be improved  to a higher level  After all  this program is takes an example image instead of keywords to filter  images  It   s difficult to reach a high accuracy     24    Chapter 3  Website for UML Database Collection    3 1 Improvement to Distributed Web Crawler    There is a problem in  mageCrawler that the number of images has been limited by Google image  search  the solution is to improve the hardware  Because we have implemented this  mageCrawler  on a PC  to start from zero is not efficient  That s why we use the result of Google image search as  the starting node
23. 02235           UMLClass 29    UMLAttribute 30     201281002235            1  3       pet into tblOperation values  opr02      public    false    UMLClass 29    UMLOperstion 33    201281002235      Insert into tblClass values  class3      public      false    UMLClass 34    201281002235           Insert into tblAttribute values    atr  l    public        UMLClass 34    UMLAttribute 35     201281002235          Insert into tblAttribute values  atr02        private          UMIClass 34     UMLAttribute 36    201281002235          Insert into tblOperation values  opr01        public      false    UMLClass 34    UMLOperation 3T     201281002235      Insert into tblOperation values  opr02      private   false   UMLClass 34    UMLOperation 38    201281002235      Insert into tblClass values  class4    public    false   UMLClass 39    201281002235           Insert into tblAttribute values  atr  l      public          UMLClass 39    UMLAttribute 40    201281002235          Insert into tblAttribute values  atr  2      private          UMLClass 39    UMLAttribute 41    201281002235           Insert into tbl  peration values Copr0i      public    false    UMLClass 39    UMLOperation 42    201281002235        Insert into tblOperation values  opr02      private    false    UMLClass 39    UML  peration 43    201281002235      gt      v    Figure 18  The interface of XML2DB    4 5 Website Design for XMI Storage and Query    There are two functions of the website  XMI storage and query     4
24. acy problem that not all images downloaded are really UML class diagrams  a way  to solve it is to use the  Perceptual Hash Algorithm   7 51  1P l The algorithm can generate the   fingerprint  for images and compare them  The more similar the  fingerprints  of two images are   the more similar the two images are    The algorithm consists of several steps    1  To zoom an image into a small size  For example  zooming an image to a small square that the  width and height are both 8 pixels  So the small image is combined by 64 pixels  Then  the details  of the image are eliminated  Only the structure and level of brightness are preserved  Thus  the  difference of images with different sizes is eliminated    2  Make the image a greyscale one    3  Calculate the average of gray value of all the 64 pixels    4  Compare the gray value of each pixel with the average one  If larger than or equal to it  the  corresponding pixel is marked 1  Otherwise  mark it 0    5  Now we have a table with 64 bit of 1 or 0  This is the table that represents the image  When we  want to know how similar two images are  just compare the tables of them to see how many bits    are the same     2 9 1 Implementation of FilterdImageCrawler    Filter4ImageCrawler is the program that uses perceptual hash algorithm as a filter for the images  downloaded by  mageCrawler  It loads the database that s built by the  mageCrawler program   locates the folder where the images downloaded are  and takes an image as the 
25. al blackcount   It contains the domain information of the url in black list   c whitecount   It contains the domain information of the url in white list     Figure 26  My settings page  The page has been divided into two parts   The top part is the general settings of the user  Now here s only one option  whether to share the  user s database with the other users  The user can click on the    Save Settings  button to save his  choice   The bottom part is for the user to view his database  Choose one table and click on the button  the  chosen table will be shown in a new page     C 4 Databases Page    Here s the page of the list of the databases     44    Databases    View   Edit    Submitted by admin on Sun  07 22 2012   12 08  databases list  Show chosen database T Show thumbnail  Warning  this may slow down your computer      Core Database   pic   It contains all the information of the images    Core Database   picblack   It contains the url of images in black list    Core Database   picwhite   It contains the url of images in white list    Core Database   blackcount   It contains the domain information of the url in black list   Core Database   whitecount   It contains the domain information of the url in white list   usertable User No 1 pic    usertable User No 1 picblack    MI    i icartahla l lear Ma 1 nicwhita   Figure 27  Databases page  In this page  the core database and the databases that are shared by the users are listed  Choose one  of the tables in the list and
26. an be generated by the  mageCrawler program  If it s the first time a  user uploads the database  his database containing the five tables will be created on the server   The five tables will be named after the user s id  Thus  the users  databases won t have the  conflict of the same names  What s more  using user s id instead of username although unique   to name his database can give the most privacy for the user    The function to show the users the core database and databases of other users  A user can  choose whether his personal database is allowed to be shown to others  When a database  contains the URL of images  the user can choose whether to show the thumbnails in the  webpage  If chosen  the corresponding thumbnail will be shown dynamically when the mouse  moves over the URL of an image  However  the thumbnail showing function will take some  resource of the user s system which may slow down his computer     26    3 2 3 Future Work for the Website    Till now  the function of the website is mainly for database operation  Drupal is originally a  content management system  So our website can be extended to a community for the users to  make comments for the databases and exchange ideas with each other    There are two drawbacks of the website    1  When the user uploads his database  the csv file will be uploaded  Although there s a function  to check whether the file type is csv  there s no way to check the content of the file  If it   s not  in the same format as t
27. art 20   The number of images  starts with 0  which means the second page shows the images from 21  to 40   There are other  parameters such as  hl en  which means the language of this page is in English  They are not  important for this process  so we ignore them  Thus  the regular expression of links that point to  other pages of list of images is      search  u003Fq  a zA Z0 9  amp     start  0 9   a zA Z0 9  amp          When we have found a link  just add  http   www google com  in front of them and the link is  fully constructed  Then  the link is stored in a list that has all the links found    The second regular expression we have to construct is the URL for images  This is relatively an  easier one  The URL for images should start with the  http s   and end with an extension of image  file  So the regular expression is      http s             2    w             w       Gpgljpeg pnglico bmp gif      When we have found a link to an image  we first check whether it is in the database  which means       it has been downloaded before  If not  we save it to the list that contains all the URL of images    found  Then there is the process to download it     2 5 2 3 Image Download    After we have got the URL for an image  we can download it using HttpWebRequest to get the  data stream of this image  Then  save it as an instance of the System Drawing Image object  The  next step is to save the information of the image into our database  We use the Oledb namespace to  implemen
28. atabase   It s useful for the following steps of digital image processing  Some methods can only apply to  specific kinds of images  For example  the ExhaustiveTemplateMatching class  which implements  exhaustive template matching algorithm and can be used to detect shapes in the UML class images   can only be applied to grayscale 8 bpp and color 24 bpp images  Another key we have added to the  database is  isUML   It is a Boolean value that shows whether the image is an UML class diagram   Although the key words we have input are  uml class diagram   no search engine can ensure that  all the images that found are strictly  uml class diagram   So there should be a key that shows  which images are UML class diagram that can be used in other operations    As not all images downloaded are UML class diagrams  there is a list needed that can save the  URLs of such images to save the trouble to download them again next time  Thus  a    blacklist    is  added into the database  together with a    whitelist     The table  picwhite  and  picblack  will store  the urls of the images that are are not UML class diagrams    Not only URLs of the images but also the domains are interested by us  We want to find out which  websites can provide UML class diagrams more than others  There are two tables named   blackcount  and  whitecount  that store the domains where the images of the black list or white    12    list come from and how many images have been downloaded from each domain   Here s
29. bsite  Web    developers around the world have contributed a lot in making it more and more powerful  At the    same time  it has a useful access control system  Different permissions can be provided for    different users of the websites built on Drupal  Last but not least  it is appreciated by graphic    designers because the appearance of Drupal websites is flexible to adapt  Various    theme       templates are available for different requirements     3 2 2 Website Design    The target of our website is to set up a database that contains the index of images of UML class    diagrams  The functions of our website are     1     User control system  People can register and log in through this user control system  The only  requirement is a unique username and the email address  A confirmation letter will be sent to  the email address after register to provide the user with a one time log in link  Through this  link  the user can set his password  portrait and other information  The user control system can  let the users build their own databases and protect their privacy  The user can also choose  whether to share his database with others    The download link of the  mageCrawler program for the users to build their own database of  images of UML class diagrams    The function for the users to upload their databases and merge to our core database  Our core  database contains the entire index that the users have collected  The required format of the  upload file is csv  which c
30. d explicitly  For  example  the element    book    in the XML file has four elements embedded  so in line 11 and 19  it  is described as a complex type     4 2 Overview of XMI    XMI is short for    XML based Metadata Interchange   It provides a method for the developers to  exchange the metadata by XML   XMI integrates three industry standards  They are XML  UML and MOF Meta Object Facility   an    OMG Object Management Group  language for specifying meta models P      4 2 1 XMI vs  Relational Database    Our project uses XMI as the output of the OCR process for images of UML class diagrams  We    have built a website for XMI storage and query     First  there s a comparison of the difference between an XMI file and the relational database                                                     XMI RDB  Data storage File system Database system  Data structure Only related to logical Only related to logical  structure structure  Pattern description DTD  Schema Relational patterns  Logical interface SAX  DOM ODBC  JDBC  Query language XQuery SQL  Safety level Low High  Concurrent query No Yes  Usage Standard for Internet data Method for storing and  exchange querying for data  Operating language No standard language SQL  Internet usage Directly Need application  Data description Structural or semi structural Structural       Table 5  Comparison of XML and relational database    From the table  we can see that XMI is really a good method for exchanging data through the  Internet  H
31. dard of XMI for models is established  many  configuration files will be needed to transform from different model files to our standard in order  to insert into our database  The website can be expanded to a community for researchers in this    field to communicate and exchange ideas     36    Chapter 5  Conclusion    In this project  we have built a system that establishes databases of images of UML class diagrams  from the Internet  turns the images to XMI models via image processing technology  stores the  models into relational database and queries within the models    A website is established to share our results  It can also let the users make comments and    communicate with each other     For future research  the programs we have developed is planned to be written in the form of web  applications and integrate with our website to make a complete online system for collecting   recognizing and querying models that are presented in images  More efficient web crawler  algorithms can be developed to improve the percentage of UML class diagrams abstracted by the    ImageCrawler program     37    Appendix A  User Manual for ImageCrawler    This is the user interface when the program starts  The top part is the image downloading part  and  the bottom part is the domain rank part         E ImageCrawler    Browse      Create New    Browse      Keyword   bi class diaran Steet   Number of Images      100      Ready       Load Database Select image folder Domain Rank Generate csv 
32. date  the actual result may be less   So to solve this  we use several key words for search to build our database    Another weak point is accuracy  As the searching process is motivated by key words  the images  found are those with descriptions that include the key words  Thus  not all images found are really  UML class diagrams  Some of them maybe the screenshot of a presentation named  UML class  diagram  or a photo of a book that relates to  UML class diagram      2 8 Focused Web Crawler    A way to improve the efficiency of a web crawler is the focused web crawler  Different from  normal web crawlers  a focused web crawler does not rely on a huge coverage of websites  It  mainly focuses on the websites that belong to a certain topic  Thus  focused web crawler is better    performed for fetching data for the users with a specific topic    21415161    18    There are three aspects that should be concerned in designing a focused web crawler    1  The definition of the topic which the crawler is going for    2  The analysis and filter of the content of data fetched by the crawler    3  The method to search through the Internet    In this case  the topic is the UML class diagram    As discussed above  normal web crawler often uses breadth first search to explore the Internet   The implementation of BFS is relatively simple  In general  websites of the same topic tend to  gather together  Thus  BFS can be used in a focused web crawler  Given the start node  the content  of no
33. des adjacent to it is considered belonging to the same area  The drawback of BFS in  implementing focused web crawler is that as the coverage becomes bigger  more irrelevant pages  will be fetched and downloaded  What s more  when the number of irrelevant pages is much larger  than that of relevant ones  the web crawler may be trapped in a vicious cycle    There is another way to avoid this problem  When the links within a page are detected  the web  crawler analyses them before downloading the data  The analysis is to find which links are most  relevant to the topic and ignore the others  The efficiency is better than BFS  However  the  prediction cannot be 100  accurate  Some relevant pages may be thrown away    There are some methods to analyze the quality of a webpage  In general  within a website  the  links to the most important pages will be provided on its homepage or other places that are easy  for the user to find  If a page has been linked to by many websites  it can be considered the one  with high importance  PageRank is a technique that is based on this idea  On the other hand  if the  number of links to certain page is too large  it may not be a valuable one because there is  possibility that it s just a  junk page  that appears many times just to improve the PageRank value    Focused web crawler can be embedded to our  mageCrawler program  However  the time of  webpage analyzing may slow down the efficiency     2 9 Perceptual Hash Algorithm    As for the accur
34. e           Synchronized Wrapper           Cache 1                              interface    Timestamped  V  rgetid     String     getTimestamp     Long                  interface    java util Map   og    get     Object       D    java util LinkedHashMap    Pull   get     Object  HremoveEldestEntry     Boolean     Figure 11  Example of black and white complicated UML class diagram    Result  46 87  of the result is the same as in the standard list  Experiment 4  Colored simple UML class diagram     Example image     22       Figure 12  Example of colored simple UML class diagram  Result  46 09  of the result is the same as in the standard list  Experiment 5  Colored normal UML class diagram     Example image              Figure 13  Example of colored normal UML class diagram    Result  47 55  of the result is the same as in the standard list  Experiment 5  Colored complicated UML class diagram     Example image              Figure 14  Example of colored complicated UML class diagram    Result  47 16  of the result is the same as in the standard list    2 9 3 Conclusion of Filter4ImageCrawler    From the result  we can see that the correctness the program can give is nearly 50   no matter the  example image is black and white or colored  or the example image is simple or complicated  The  results the program gives remain in a stable level  but not high enough to be trusted    The reason comes from the algorithm itself  The perceptual hash algorithm compares similarity  between
35. e Queue Visited  ignore  Otherwise  put them into Queue New    4  Put the URL A into the Queue Visited  Go to step  2    BFS is the most widely used method by the web crawlers  The main reason is that when we are  designing a webpage  we usually put links to the pages that have contents related to the current  page  Using BFS can find them by first time     A web crawler should work continuously  Good stability and reasonable usage of resource is  required  Web crawler is a vulnerable part of a search engine  It deals with web servers that cannot  be under control of the search engine system  A bug from one webpage may put the web crawler  into severe situations like halt  crash or other unpredictable problems  Thus  the design of a web  crawler should take different conditions of the network into account     2 2 Related Work    Other than the large search engines like Google and Yahoo  it is difficult to get a large number of   UML class diagrams in a short time  Even if people use the search engines  no downloading   service is available to save them to the local disk    There are several tools that are specially designed for downloading images from the Internet    However  they have drawbacks  Some of the tools cannot meet the requirement of downloading   images with a large number and full sized  Most of them are commercial        Google has provided an API for downloading images from the result of image search     However  the API has a limitation of getting a maximum of
36. e and more powerful filter for the images  The most important reason we choose  Google is that the URL of Google image search can be easily constructed with different  parameters  We can use Google image search in our program without going to the web browser    first     2 5 Implementation of  mageCrawler    The program of  mageCrawler does not rely on the API of Google image search  It is the program  that can download as many images as possible from the result of Google image search  The key  words can be set by the user as well as the number of images that he wants to get  It simulates the  process of downloading images from Google manually  It implements the theory of web crawler   gets the data from Google image search and downloads them automatically  After that  the  information of the images will be saved in the database  At the same time  the program has the  function to build a    blacklist    of URLs of images that are not UML class diagrams  There is  another function that gives a rank of domains of images to see which websites contribute how    many images   2 5 1 Database Design    The database 1s designed for keeping the information of images we have downloaded  Most of the  information of the images is saved in the table  pic   As URL is unique in the Internet  we use the  URL as the primary key  The width and height  as the basic attributes of an image  is also saved  into the database  What s more  there is the record of the pixel format of an image in the d
37. en a user has built a database  he can  upload it in the website  After a database has been uploaded  the website will read its content and  write them into the database saved in the server that contains all the information from the  databases uploaded    It is an alternative way to the distributed system of web crawler  If more and more people use the  program to collect images and upload the URLs of the images  together with the   whitelist   blacklist  showing which URLs contain images that are  not  UML class diagrams   and the statistics of the domain from the  whitelist   blacklist   a database will be available on the  server that contains the index of a large number of images that are UML class diagrams  It s like  the index of images built by a search engine  only that people using the software take the place of    web crawlers     25    3 2 1 Introduction of Drupal    Drupal is a free open source content management system CMS  that   s developed by PHPP   It   s  best used for building a backend system  About 2  websites in the world are based on Drupal   including http   www whitehouse gov  the website for the White House  Here s a list that shows       the famous websites that use Drupal as the framework     http   www seo expert blog com list the most incomplete list of drupal sites         Drupal is combined by modules  It is convenient to add or remove functions on it  In fact  rather    than a CMS  it is more regarded as a framework for a content management we
38. en shot of this program   EFE  Biel x     Path of the XML file   Start         Figure 29  Initial interface of XML2DB   Step 1  Open the XML file   The user should click on the    Browse       button and select the XML file  The path of the  file will be shown in the textbox near the button    Step 2  Start the transformation   By clicking the    Start    button  the program will abstract the classes  attributes  operations  and associations from the XML files  In the end of the process  a message will pop up to  inform the user  with the path of the SQL file created  The SQL file is named after the  current data and time to avoid the problem of duplicated file name    Step 3  Result of the process   The SQL commands generated during the process will be shown in the textbox at the  bottom part  It s a friendly interface for the user to check the elements detected and  transformed in the XML file     Appendix E  User Manual for XMI Query Website    E 1 Upload Page for SQL Files    Here   s the screenshot of the upload page for SQL files generated by the XML2DB program     uploadXML    View Edit    Upload File Browse       Submit    Figure 30  Upload page for SQL files  It   s a very easy function  The user just click on the    Browse       button  choose the SQL file  generated and click the    Submit    button  The SQL commands in the uploaded file will be  executed   For safety concern  only the administrators have the access to this page     E 2 Query in Models Page    This
39. ent model search engine   Here   s the architecture of the Moogle search engine                Query  HTTP  Query  d ur We sere Ss  ser User Interface    Indexes    Model descriptors       Model Extractor    e    Model repositories   database  local file  system  web      Figure 17  The architecture of MOOGLE  The system consists of three parts  the model extractor  the searcher and the user interface   The model extractor can find and read model files from different repositories  It reads the content  of the model files  extract necessary content and generate model descriptors which will be used in  the searcher function   The searcher is based on an open source search engine  Apache SOLR  With SOLR  an index for  the model descriptors is established that contains all the necessary information of them   The graphical user interface of Moogle provides the users with three functions  simple search     31    advanced search and metamodel browsing    Moogle is a good idea for us to build a website in searching for models  It supports query in  different kinds of models  However  it has some drawbacks    The core function    model extractor uses the configuration file to get necessary content from them   A valid unique namespace declaration is required  Thus  if the namespace declaration of a model  file is damaged  it will stop generating no matter how well the other part is preserved    Moogle is a search engine for model files that are presented with text  What we want to achi
40. er                     sese 41  Appendix C  User Manual for Database Collection Website             cccceccessesseeeeteeteeeeeeeeeseeeesseeneens 42  C TI User Control Systems peret eee ted ed Bereich 42   C 2 Upload Page        ee Hee ei eet mete se 43   C3 My Settings Page  o eoo rima pp ane edt 44    54 Databases Paten e NUR RISO OUR RUSO DU 44   C5  Database Showing Pages  pee Sae ten cope ei odi S uere tecti ces 45  Appendix D  User Manual for XS MIDD    47  Appendix E  User Manual for XMI Query Website  48  E l Upload Page for SQL Piles 43 42  t stet vec ati a IL n a IIS cud 48   E 2 Query in Models Page         ise RR UR UR EUREN cues  48   ES Queryiot URL Regel REIR EN 49  Appendix F  Statistics of UML Class Diagrams Collection                     eee 50  ET Table pic EE 50   F2  Table  picwhite     ice tee EH Rer Ho e RO RH s 50   E3 Tables pricblack saute aU ate eed de  50   EA Table whitecount   2   oer PEDE HX FOR o UP sts 50   E 5 Table  blackco  nt    aeter ree De eta a Ri Ee eti ations 50  Acknowledgements    5  bees ede A ERE auo ed  51    References  Sereno Debet eet eebe M Rec ul edi 52    This paper is a gift for my family who have supported me all the way     List of Figures    Figure 1  An example of a class diagram    Figure 2   Figure 3   Figure 4   Figure 5   Figure 6   Figure 7   Figure 8   Figure 9   Figure 10  Figure 11  Figure 12  Figure 13  Figure 14  Figure 15  Figure 16  Figure 17  Figure 18  Figure 19  Figure 20    Figure 21   Figure 22   Figure
41. eve is  a project for models that are presented in UML class diagrams  As there are already different  search engines that deal with text  it   s more difficult to turn images into text models and do the  query    Another drawback is the searching method  Moogle uses indexes for model descriptors to  implement the query process  It is a search within the files  Thus  the efficiency will not be as  good as querying in a relational database     4 4 XMI Database    Our standard is based on the XMI files exported from starUML  There are four categories that we  take into account  class  attribute  operation and association  Our database is built on that  foundation     4 4 1 Database Design    Our database contains four tables  tb Class  tblAttribute  tblOperation and tblAssociation  corresponding to classes  attributes  operations and associations  Here are the keys in the tables                 tbIClass  Key Type Comment  name varchar 200  Name of the class  visibility varchar 20  Public  private    Whether it s an  isAbstract varchar 20     abstract class  xmi_id varchar 200  The id for this class  The date and time this                   datetime varchar 20       item is generated  The URL of the  url varchar 200  source image of this  class       Table 6  The table to store classes                               tblAttribute  Key Type Comment  name varchar 200  Name of the attribute  visibility varchar 20  Public  private  typeofAttr varchar 200  The type of this       32    
42. example for the  filtering process  As the result  the database loaded will be changed  There is a key in the table   pic  in the database named  isUML   It is a Boolean value that shows whether the image is a  UML class diagram  The value of this column will be changed according to the result of the  program  If an image is similar enough to the example image  its corresponding  isUML  attribute  will be true  The threshold of similarity is set to 25  It means that after the process of perceptual  hash algorithm  if there are less than 25 pixels different between the testing image and the example  image  we can regard them to be similar     Here s the class diagram of the program     Filier4hnages  Access Connection    imgPath   string    ac     ccess Connection    cn  OleDb Connection  Database Operation   cmd   OleDbCommand    FilterdImages   reader   OleDbDataReader  Image Operation      build Connection    Database Operation   execute SQLO    FilterdImages   execute S QLReader    Image Operation         Figure 7  Class diagram for Filter4ImageCrawler    Here s the screen shot of this program   iE Filter4ImageCrawler     mf x     The folder of images  C  1000 UNL class Browse     Database C  black done 1000 pic  mdb  mem           http    upload  wikimedia  org wikipedia commons thumb b bl Visitor UML class diagr a   http   www  ibn  com developerworks rational library content Rati onalEdge  sep04 be  http   www  ibm  com developerworks rational library content RationalEd
43. ferent domains involved in  Among them  http   upload wikimedia org ranks at the top  because 30 UML class diagrams come from the domain  Following it  there are websites like  http   www ibm com with 29 UML class diagrams downloaded  http   ars els cdn com with 25 and  http   docs nullpobug com with 21  12 domains contain 10 to 17 UML class diagrams for each  58  domains contain 4 to 9 UML class diagrams  641 domains have less than 3 UML class diagrams  downloaded  In the    blacklist     417 different domains are involved in  http   img docstoccdn com   with 53 images that are considered not UML class diagrams  ranks at the top of the    blacklist      There are 13 domains that contain over 10 images that are not UML class diagrams  133 domains  contain 2 to 9 images that are not UML class diagrams  The rest 270 domains have only one image  that is not UML class diagram for each     2 6 1 Source Code with UML    Another work we want to fulfill is to find out whether there are source codes provided together  with the UML class diagrams  We have searched on the domains on the    whitelist    that have  provided more than 3 UML class diagrams  Among the 29 websites  11 of them have source code  provided  They can be classified into 4 categories    1  UML education standard  http   www xml com 6 images provided   http   www uml org cn 15  images provided  and http   www c jump com 6 images provided  belong to this category  The  source codes on these websites are provided for t
44. ge sep04 be  http   www  databaseanswers  org data_models uml_class_diagram_for_shopping_cart i  http    blog  caplin  con wp content uploads UML Class diagram  gif                              http   www  codeproject  com KB 1ibrary  inSNMPWr epper class_ lt  diagram  png   http   www  edrawmax  com images software UML Class Di agram_full  png   http   www  agiledata  org images ool01ClassDi agram  gif   http    edn  embarcadero  Con arti cl e insges 31053 clessdi agranne3d  gif   http    viki  audesn  con  nedi a hd helpdesk8  png   http    svn  pjsip  ElK ap osip jeo jae UD x pjnath docs UML class di agran    http   www  clear  rice  edu comp20 Wen a AT  png   http   www  agilemodeling  com images models  E jpe   http    i  msdn  microsoft  con dynimg IC315445  p   http   www  databaseanswers  EE class diagram for risk management   http    blog  joycode  com wp content uploads images blog  TERR con  joy 414 o whi   http   www  tutori alspoint  com images unl class di agrami   http    staff  aist  go  Gye  a Seed Gal eat eae   http   www  codeguru  com ime legacy mise unldamll  gi      http    518  acing  EE if   http   www  dannyfowler  com uploads 800px Composite_  WL cl ass_diagran_fr  svg_  png   http   www  agilemodeling  re E UR ete SSE zif   bttp  ff enx  org content nl 1658 Latest 1istvisi tors  p e  e Se eege ege Seene Sen afimaeafI   ine                                                Figure 8  Interface for Filter4ImageCrawler    20    2 9 2 Validation of Filter4I
45. hat has similar features with the original  picture  the algorithm takes it as a confident match  Then  Google will list the confident matches  on the website for the user    The advantage of this algorithm is that it simulates the process of human being watching an image   If the image for query has a unique appearance  the result will be very good  The result for unique  landmarks like the Eiffel Tower is fantastic    However  when we are using the function to search for UML class diagram  the result is not so  efficient  The reason is that UML class diagram are mainly combined by shapes like rectangles   arrows and text  There are no unique symbols  When we upload a UML class diagram to the  Google search by image  the features Google will extract are just normal ones  Thus  the result is  not only UML class diagram but also graphs with curves and rectangles that relate to math  research  stock market  and medical research and so on    Thus  the function we want to use is the traditional way to search images by key words  At least  for UML class diagrams  the accuracy of searching by image does not take advantage of searching    11    by key words    In the meantime  Google provides details of personal result  which collects images related to your  most websites that the user has visited or is interested in  which is helpful for collecting models  the user wants    Other search engines also have similar functions of image search  but Google provides a better  user interfac
46. he URL that directly links to the resource of images  It s  the end of a process of the crawler s work    There comes a problem  When we have got the data of the page  not all links are what we need   We need only two types of URL as discussed above  Thus  we have to give two regular  expressions that correspond to the two types of URL to the crawler for matching in the page data   The first is the URL which links to another page that has the result of image search  First  we  should analyze the structure of Google s page of image search  In the web browser  the list of  images is shown in standard version  It dynamically shows pictures in other pages as we pull the  scroll down  However  the page data the program can get is of the basic version  If we want to go  to another page  we have to click on the number that has the link     Searches related to  uml class diagram  uml class diagram example uml sequence diagram       123 45 6 7 8 9 10 Nex            Google Images Home Help    Figure 6  Links of pages of Google image search    Thus  what we want to get are the links of numbers represented  They start with  search q   and  the key words  There is the parameter  tbm isch  which means the search is for images  Another  important parameter is  start   followed by a number  After the searching process  there are  thousands of images  The number after the  start   parameter is to show which is the first image  that the URL links to  For example  the URL of page 2 contains  st
47. he csv files created by the  mageCrawler  undefined content will be  merged into the databases on the server    2  Although there s an option for the user to choose whether he wants to share his database with  other users  the content of his database will be included in our core database  Thus  when the  core database is shown  a user s privacy may have been violated  One way to solve this  problem is not to show our core database to the users  However  we suppose that users of our  website are mostly willing to share what he has collected from our program  Thus  showing  the users our core database is a better choice     3 3 Change of ImageCrawler    To cooperate with the website  the  mageCrawler has been changed  It can search by URL instead   of by key words  The program behaves as a normal web crawler    1  It uses the URL provided by the user as a starting node    2  Itgets the page data of the URL and analyzes the page content    3  It uses the breadth first search to collect all the URLs included in the content of the page and  saves them into a list  The list will only receive URLs that are not already included  Otherwise   the program may deal with a URL that s visited before and trapped in an infinite cycle    4  It detects all the links to the images within the page content and downloads them    The other functions remain the same    The program no longer relies on the result of the Google image search  Users can focus on one   website and collect the images from 
48. he purpose of education to explain how UML  class diagrams are used    2  Software reference  http   mdp toolkit sourceforge net  10 images provided  and http   gdal org   6 images provided  belong to this category  Both the websites are reference to certain libraries or  software toolkit  The source code provided are examples to explain how certain functions are  used    3  Software design  http   www felix colibri com  14 images provided   http   staff aist go jp  4  images provided   http   www codeproject com  17 images provided    http   www c sharpcorner com  3 images provided  and http   tw mapandroute de  6 images  provided  belong to this category  The source codes provided by these websites are the  implementation for the software    4  Blog of IT articles  http   www blogjava net  4 images provided  is the only one in this category   It s the blog of articles related to the IT industry  This category may contain one of the three  classes above  The source code provided may be the implementation of certain software or  example codes attached to some educational content of UML standard     2 7 The Limitation of ImageCrawler    The advantage of  mageCrawler is based on Google image search  so is the weakness  The reason  is that the pages of the result of an image search are at most 50  The average number of images  within one page is 20  Thus  the maximum number of images the program can get with one piece  of key word is 1000  With the URL of some images are out of 
49. hed the    blacklist    using the function in the  Domain Rank Part  of  ImageCrawler  Images that are not UML class diagrams  or UML class diagrams but too blur to  distinguish or a screen shot that contains only a part of a UML class diagram are put into the     blacklist     As a result  947 of the 2341 images have been put into the    blacklist    with 1394 UML  class diagrams    Compared to other tools that collect images from the Internet   mageCrawler has its advantage    1  The tool that uses Google API to download images from Internet has a limitation of 64  images     ImageCrawler does not have this limitation so that it s easy for the users to  collect a large number of images    2  The plug in of Firefox named    Save images    has two drawbacks  It can only download  the images appear in the current tab opened in the Internet browser  The tool will just  download images as they are represented in the page  When it s used to download the  images listed as thumbnails in the result page of Google image search  it cannot download  the images with their original sizes which are really what the users need    3  ImageCrawler 1s free for the users  It s developed by  NET framework  Thus  it has a  good expandability  In future development  it can be transformed to an online application  instead of Windows application to provide more flexible usability    The result of the statistics of domain name has been saved in the database  In the    whitelist     there  are 715 dif
50. imeon     From XML Schema to  Relations  A Cost Based Approach to XML Storage   Data Engineering  2002    29  Ye Kaizhen   The Study of XML Storage Technique in Relational Database   Master Thesis   SUN YAT SEN UNIVERSITY  2007    30 Daniel Lucredio  Renata P  de M  Fortes  Jon Whittle     MOOGLE  a metamodel based model  search engine   Softw Syst Model  2012    53    
51. ing    setName name String  void   getSocialSecurityNumber   String   setSocialSecurityNumber socialSecurityNumber String  void   getDateOfBirthQ Date   setDateOfBirthidateOfBirth Date  void    calcAgelnY earsQ int       Figure 1  An example of a class diagram    There are several relationships between the classes  Here s a table of their definition and how they  are represented in the UML class diagrams        Name Description Representation                Generalization It describes the relationship between An arrow pointing to the             particularity and generality  parent class       Realization It describes the relationship between an An arrow with dash line    interface and a class implementing it  pointing to the interface       Association It s the relationship if one class has certain   A line with an arrow pointing       members or methods of another class  to the class that is associated   Aggregation It s the relationship if one class owns A line with an empty diamond  another  pointing to the owner        Composition   It s the relationship if one class is a part of   A line with a solid diamond       another that cannot exist alone  pointing to the owner   Dependency If a class needs another one to implement  A dash line with an arrow  they have the relationship of dependency  pointing to the class that s  relied on                    Table 2  Relationships of class diagrams    UML class diagrams are very important  They can show the static structure of a 
52. ing executeSQLO  URL   string execute S QLReader    NEXTPAGEGO OGLE   string    DomainPattern   string    ImageDownload    Database Operation   domainName   string  DomainRankQ  domainCount   int    getName    getCouni    setName    setCouni         Figure 5  Class diagram of ImageCrawler    2 5 2 Image Downloading Part    2 5 2 1 Initialization    The program takes the key word the user has input and constructed a URL for searching images  14    from Google  The URL for downloading images from Google is constructed by two parameters   key word and search category  image search   For example  the URL for downloading images that  relates to  UML  is  http  Awww google com search q uml amp tbm isch         At the same time  the program takes the number of images that the user has input and saves it into  a variable    Before the process starts  the program requires the user to select a database to save the information  of the images to be downloaded  There is a function checkDB   to check whether the database the  user has selected meets the format of the database we have designed  In this function  the program  gets the names of all the tables inside the database and save them into DataTable objects  For the  keys inside the tables  there is a function getCols   that also uses the DataTable objects  The  function will put the names of the columns in a table and compare them with the database we have  designed  If all the tables and columns exist  the database is ready for the 
53. ions between  the system and its users  A use case  is the target of a functionality  provided by the system           Interaction diagrams       Communication       Shows the interactions between             diagram different kinds of components  within the system  It describes both  the static structure and the dynamic  behavior of the system        Interaction overview Describes an overview of a control  diagram flow with nodes that can contain    interaction diagrams        Sequence diagram Uses messages to show the  communication between objects  within the system as well as the  lifecycles of them        Timing diagram Focuses on timing constraints to    explore the behavior of objects                during a period of time   Table 1  14 kinds of UML diagrams    1 2 Overview of UML Class Diagrams    A UML class diagram represents the content of classes and the relationships between them  A  class is represented by a rectangle with three parts lined vertically  On top of a class is the name of  it  It is mandatory  In the middle part of the rectangle of a class  there are members and attributes   On the bottom  there are methods within the class  Here s an example of a class  Person  with four               members and seven methods  The sign in front of attributes or methods means that this    attribute or operation is private  and the sign         means it is public      name String   socialSecurityNumber String   dateOfBirth Date   emailAddress String     getNamed Str
54. it  Compared to the result of Google image search which   contains images from different websites  the result of searching by URL can be more focused on  one topic because the images downloaded are mainly from one website    Combining the  mageCrawler and the website  an alternative way for distributed system 1s   finished  It s a better way because the website is not only for uploading and showing databases    but also a way for people to communicate and has a good extensibility for more functions     27    Chapter 4  Website for XMI Storage and Query    4 1 Overview of XML    XML is short for    Extensible Markup Language        It is a language that uses tags to mark the  electronic files to make them structural  It can be used to mark the data and define the data types   The tags can be defined by the users which provide it with a strong extensibility  XML is a subset  of SGML Standard Generalized Markup Language  and the recommended standard by the  W3C World Wide Web Consortium   It is very popular and useful in web transmission    The grammar of XML is similar with HTML  However  XML is not an alternative of HTML   XML is used to store data while HTML is used to format and print data to the webpage    XML consists of tags and the tags must appear in pairs which is different with HTML  XML is  case sensitive  which requires a stricter attitude in using it  An XML file is like a tree  It starts  from a root node and expand to the leaves     Here s an example of an XML fi
55. le   1 K9xml version  1 0  encoding  IS0 8859 1 2                lt bookstore gt                 5      lt book category  COOKING  gt    6  lt title lang  en  gt Everyday Italian lt  title gt    lt author gt Giada De Laurentiis lt  author gt    lt year gt 2005 lt   year gt     lt price gt 30 00 lt  price gt    10 r   book            io cO    mm  H       H  N  l              lt book category   CHILDREH   gt    lt title lang  en  gt Harry Potter lt  title gt     author  J K  Rowling lt   author gt    lt year gt 2005 lt   year gt    lt price gt 29 99 lt  price gt    r   book      HH HH H  JO C bh c    ip 0            H H           lt book category  WEB  gt    lt title lang  en  gt XQuery Kick Start lt  title gt    lt author gt James McGovern lt  author gt    AA  lt author gt Per Bothner lt  author gt    23  lt author gt Kurt Cagle lt  author gt    24  lt author gt James Linn lt   author gt    25  lt author gt Vaidyanathan Nagarajan lt   author gt    26  lt year gt 2003 lt   year gt    27  lt price gt 49 99 lt  price gt       lt  book gt        A N    E  A       A    lo o    A         z bookstore   Figure 15  An example of XML file    28    In this file  the first line is the declaration of this file  It defines the version of XML is 1 0 and the  encoding is    ISO 8859 1       From line 2 to line 30  the entities in the tags are called elements  The  lt bookstore gt  and   lt  bookstore gt  are the two parts of the root element  The root element should appear in every XML  file 
56. le  com images  google Ze a  Lum 2er google  Conf inages nav Logo IZ     Domain Rank                     http   www  ibm  com  D http   www    elix colibri  com  14   http   upload  wikimedia  org                    http    blog  caplin  con wp content uplot  http   www  agilemodeling  com images moc  http    rollerjm  free  fr images ejb uml   http   www  agilemodeling  com images moc  http   www  tutorialspoint  com images un  http   www  codeproject  com KB library    http   www  edrawmax  com images software  http  JL fine  agiledata  org images oo101C1  iff edn  embarcadero  com article ima                http    oldresources  visual paradigm  con  http  Gel  linkvp  ieee aA  http    bpl  blogger  com _sDOeSHxTdk SBi  http   www  jetbrains  com idea  features   http   www  altova  com images shots UML_  http    diacinstsller  de doc en  graphic   http   www  jetbrains  com img webhelp r   http   www  modeliosoft  eter  http    images  visusl p con  do   http   www  osZezine  EE TS ML a  http   images  vi sual paradi gn  com docs   http    images  visual paradi gm  com  docs   http  SE GE    gn  sch nat imacac lacslhact           http   www  ples resources  org    http    wiki  eclipse  org    http    mdp toolkit  sourceforge  net    http   www  xml  com    http   www  codeproject  com    http   staff  aist  go  jp    http   www  c sharpcorner  com    http   www  uml  org  cn     http   www  agilemodeling  com    httpi  fars  els cdn  com    http   www  c7 jump  com    ht
57. mageCrawler    To validate the correctness of recognizing UML class diagrams of the program  several  experiments have been implemented  Before the experiment  a database of 1000 UML class  diagrams has been established  The attribute of  is UML  of the database is manually filled  So the  database is a standard version of the list to show whether each image downloaded is actually a  UML class diagram  In the experiment  the program will take an image of UML class diagram as  the example image  abstract its structure using the perceptual hash algorithm and compare the  structure with the structures of the images we have collected  Then  a Boolean value of  isUML   is given for every image in the database  After the program is finished  we compare the database it  has changed with the standard list to validate the efficiency of the program  As an image is needed  as an example for the process of the filter  we take 6 images to perform 6 experiments  They are  divided into two groups     colored    and    black and white     In each group  three subgroups has  been made by the content of the images  simple  normal and complicated     Experiment 1  Black and white simple UML class diagram          Example image     Figure 9  Example of black and white simple UML class diagram    Result  47 06  of the result is the same as in the standard list   Experiment 2  Black and white normal UML class diagram     Example image     21               implementation class                  BnB 
58. ms     Figure 25  Upload page   In the bottom part of this page  the user can download the  mageCrawler program to build his  database  Then  he can upload the csv files generated by the  mageCrawler program by clicking on  the    Browse       button  As our program will generate five csv file for a database that correspond  to the five tables  the user should choose which table the csv file to be uploaded contains  When all  the required information is set  just click the  Submit  button to upload the file    The upload function will first check the type of the file uploaded  If a non csv file has been  uploaded  there will be a message to warn the user  If uploaded successfully  the information of the  uploaded file will be shown to the user  like the name and size of the file  Then  the content of the   43    csv files will be merged to his existed database on the server as well as our core database  If it   s  the first time the user uploads his database  the five tables will be built automatically on the server     C 3 My Settings Page    Here   s the screen shot of the page for the user settings     My Settings    View Edit        Submitted by admin on Sat  07 28 2012   18 02    v   would like to share my database with others     Save settings      databases list    Show chosen database         c pic   It contains all the information of the images    c picblack   It contains the url of images in black list    c picwhite   It contains the url of images in white list    V
59. n integrated system cannot meet the requirement  The crawlers should work on a distributed  system parallel to improve the efficiency    However  as the task should be finished in a normal PC  the efficiency of a normal web crawler is  not good  Different from a normal web crawler that gets every piece of information from the  Internet  the crawler we want to implement is just to download images of UML class diagrams  from the Internet to establish the database  Thus  we can use the result of an image searching  website as our starting node     2 4 Why Google     Google  as one of the most widely used search engines in the world  can meet our requirements    It can get a large number of images focused on a special topic  When we go to the image search  page of Google  we can get the list of images from various websites  It s the assembly of images  related to the key words we have input  The time for Google image search is extremely fast  because the information of billions of images has already been saved in the server of Google  Thus   if our web crawler can use the result of Google image search as a starting node  the efficiency is  great    In October 2009  Google has released the function to search by Images TT When an image is  uploaded  the algorithm is applied on it to abstract the features from it  The features can be  textures  colors  and shapes  Then  the features are sent to the backend of Google and compared  with the images on the server  If there is an image t
60. ok  gt   11 d  lt xs complexType gt   12     lt xsisequence gt   13  lt xs element ref  title   gt   14  lt xs element maxOccurs  unbounded  ref  author   gt   15  lt xs element ref  year   gt   16  lt xs element ref  price   gt   17 p    xs sequence    18   xs attribute name   category  use   required  type  xs HCHame   gt   19 F    xs complexType   20 Fr  lt  xs element gt   21 E   lt xs element name  title  gt                 22 H  lt xs complexType mixed  true  gt   23   xs attribute name   lang  use   required  type  xs NCName    gt     24 E    xs complexType    25 Fr  lt  xs element gt   26  lt xs element name  author  type  xs string   gt     E  lt xs element name  year  type  xs integer      26  lt xs element name  price  type  xs decimal   gt   29     xs schema         Figure 16  An example of XML Schema  The    xs     that appears in front of every element is the namespace  The  lt schema gt  tag is the root  element of an XML Schema file  It must appear   29    The  lt element gt  elements describe the elements that appear in the XML file  In this case  there are     bookstore        book        title        author        year    and    price     If the value of an element is a simple  type  it will be described as an attribute of the element  For example  the  author  element  contains the type    string     Thus  in line 26  the string type is described as an attribute of the     author    element  If the value of an element is a complex type  it will be describe
61. ond   Proceedings of the 17th international conference on World Wide Web   2008    17  Niu Xiamu  Jiao Yuhua   An Overview of Perceptual Hashing   ACTA ELECTRONICA  SINCA  Vol  36  No  7  2008    18  Schaathun  H G    On Watermarking Fingerprinting for Copyright Protection   Innovative  Computing  Information and Control  2006     19  Monga V   Evans B L      Perceptual Image Hashing Via Feature Points  Performance  Evaluation and Tradeoffs     Image Processing  IEEE  2006    20  Zhao Yuxin     A research on hashing algorithm for multi media application     PhD thesis   Nanjing University of Technology  2009    21  Bai He  Tang Dibin  Wang Jinlin     Research and Implementation of Distributed and  Multi topic Web Crawler System     Computer Engineering  Vol  35  No  19  2009    22  Wu Xiaohui     An improvement of filtering duplicated URL of distributed web crawler      Journal of Pingdingshan University  Vol  24  No  5  2009    23  http   drupal nl over drupal     Over Drupal       24  http   www w3 org TR xml11      The specification of XML 1 1       52        25  http   www w3 org TR xmlschema11 1      W3C XML Schema Definition Language  XSD   1 1 Part 1  Structures       26  http  www omg org spec XMI 2 4 1      Official documents for XMI specification       27  Xiao Jihai     A research of the transformation from XML Schema to entity relationship  database   Master thesis  Taiyuan University of Technology  2006    28  Philip Bohannon  Juliana Freire  Prasan Roy  Jerome S
62. or the search  engine    The basic operation of the web crawler is to fetch information from the website  The process of  fetching data is the same as we surf online  The web crawler mostly deals with URL  It gets the  content of the files according to the URL and classifies it  Thus  it s very important for the web  crawler to understand what a URL stands for    When we want to read a website  the web browser  as the client end  sends a request to the server  end  gets the source of the page  explains it and shows it to us  The address of a webpage is called  uniform resource locater URL  which is a subset of universal resource identifier URI   URI is the  identifier for every resource on the website  html documentation  images  videos  programs and so  on  URL is the string to make the description of the information online  It contains three parts  the  protocol  the IP address of the server where the resource is and the path of the resource at the    server end     When we put the web crawler into practice  we use it to traverse the entire Internet and get the  related information  which is just like  crawling   During this procedure  it regards the Internet as  a huge graph  and every page is a node inside with the links in a page as directed edges  Thus  the  crawler will take either depth first search DFS  or breadth first search BFS   However  in DFS  the  crawler might go  too deep  or get trapped  So BFS are mostly chosen    In the process of BFS  a node is chosen as
63. ount       owner   String  balance   Dollars    deposit   amount   Dollars    witha awal   amount   Dollars      Figure 32  Query of URL page    First  the image will be shown to the user  Then  under each image  four tables will be listed to  show the classes  attributes  operations and associations included in this image     49    Appendix F  Statistics of UML Class Diagrams Collection    F 1 Table    pic       It is the table that contains the full information of the images downloaded  There are 2341 items in  this table  with the width  height  pixel format and other attributes included     F 2 Table    picwhite       It is the table that contains the URL of the images that are considered UML class diagrams  1394  items are included in this table     F 3 Table    picblack       It is the table that contains the URL of the images that are considered not UML class diagrams   947 items are included in this table     F 4 Table    whitecount       It is the table that contains the statistics of domains where the images in the table    picwhite    come  from  There are 715 different domains included in this table     F 5 Table    blackcount       It is the table that contains the statistics of domains where the images in the table    picblack    come  from  There are 417 different domains included in this table     50    Acknowledgements    Hereby I would like to thank my supervisors  Dr  Michel Chaudron and Mr  Bilal Karasneh  for  their guidance and support of my thesis  With
64. out their help  it would have been rather hard for me  to go through the final step of my master program    A special thank to Leiden University and the Netherland where I have found a platform to finish  my master program  during which I have made an improvement not only in knowledge but also in  the way of thinking    I also want to express my deepest gratitude for my family  It   s hard for one to study and live alone  in a foreign country with a different culture  Their support and encouragement far from China has  helped me to go through the toughest period of my thesis     51    References        1  http    www uml org      Object Management Group   UML       2  Frakes  W B    Pole  T P       An empirical study of representation methods for reusable software  components     Software Engineering  IEEE Transactions  1994    3  Scott Henninger    An evolutionary approach to constructing effective software reuse  repositories   ACM Transactions on Software Engineering and Methodology  TOSEM   Volume 6  Issue 2  April 1997  Pages 111   140    4  Mascena  J C C P   de Lemos Meira  S R   de Almeida  ES Cardoso Garcia  V      Towards an  effective integrated reuse environment   5Th ACM International Conference on Generative  Programming and Component Engineering  GPCE   short paper  Portland  Oregon  USA  2006   5  Vanderlei  T A   Garcia  V C   de Almeida  E S   de Lemos Meira  S R   Folksonomy in a  software component search engine coop erative classi cation through shared me
65. owever  it s not good in operating with data  On the other hand  relational database can  efficiently query within the data  but it needs a flexible way to publish the data into the Internet   Thus  to combine XMI with relational database is a good solution     30       4 3 Related Work    4 3 1 XMI Storage    Phil Bohannon s paper   has provided a way to map XMI files into relational database  It comes  up with a cost based method for XMI storage in relational database    In this method  the first step is to generate a physical XML Schema  A physical XML Schema is  an equivalent way of XML Schema  It eliminates the multi valued elements by generating  separate new types for each of them  Then  for the attributes of each type  the size  minimum and  maximum value and the number of distinct values is inserted in the physical XML Schema  Then   it is ready to map from the Schema to the relations    The physical XML Schema is a great method of mapping from XML Schema to relational  database  However  it ignores the constraints of the elements like    minoccurs      maxoccurs  or   P l What s more  it deals with the XML Schema so that it is used to create the tables  instead of build a whole database  But it provides us a good idea about how to design a database  for an XMI file        unique    4 3 2 XMI Query System    In Daniel Lucredio  Renata P  de M  Fortes and Jon Whittle   s paper   a metamodel based  searching engine named    Moogle    is created  The system is an effici
66. process    If the user has selected a database  we suppose that there is already information about many  images saved in the database  To avoid downloading images that are already in the database  a list  of URL will be established  When the URL of an image is found in the next steps  the program  will check whether the image has already been in the database  If so  the downloading part will be  ignored     2 5 2 2 URL Analysis    When we have constructed the URL  we should download the data of the page  In  Net  the  WebClient class is used  WebClient class provides the method to send and receive data from a  resource that is signaled by a URI  The DownloadData method is to download the page data from  the URI provided  So we take the URL we have constructed as the parameter of the  DownloadData method and get the source of the page  Within the page data  there are links to  kinds of resources  This is like the starting node in the theory of web crawler    So how do we get the links from the source code of the page  One of the most efficient ways is by  regular expression  Regular expression is a string which can be used for parsing and manipulating  text  In complex search and replace operations  and validating whether a piece of text is  well formed  regular expression takes great advantage than other methods  For the image crawler   we have two kinds of strings to match  One is the URL that represents a link to another page  It s  like the  adjacency node   The other is t
67. r  by the number of images contributed by the domains   39    Function 5  Output to csv files   As our online database requires csv files as the input  the program can generate csv files  from the database  The user can click on the    Generate csv    button to generate five csv  files corresponding to the five tables in the database  The five csv files will be created in  the same folder with the executable program file  They will be named in the format of   csv for  and the name of the access database file and    _    with the table name  For    example  if a database is named    pictures mdb     the five csv files will be     39 ce     csv for pictures pic csv      csv_for_pictures_picblack csv       csv for pictures picwhite csv    csv for pictures blackcount csv  and     csv for pictures whitecount csv      Here s a screen shot of the program in running   Cat    o ImageCrarler    Database  E black done blackcount done 1000 pic mdb Browse      Create Hew   BankAccount  DEE  bie   Dolars    Save Path EW class diagram Browse          nmeenmem se    Fe      depasit   amourt   Dolars    wirasat amer Zeiss     Keyword    Number of Images    Ready          oruslltresPats    arortaga    raf ceres Gol     Desen  eines Ce   wwitdrewa  amcunt  br        cens vtirmerest      than are dn        URL Black List    http   www  uml  org  cn sj jm images figur a   http   www  drkT  jp MT drk images 090322    Load Database   Select image folder      File does not exist     http   www  goog
68. s  operation                   Table 9  The table to store associations    4 4 2 SOL Generator    We have developed a program XML2DB to convert from an XML file to SOL commands to insert  the content to our database  The main feature we have used is the System  Xml namespace of Net   In this program  we load the XML file generated by the OCR process  get the attributes of classes   attributes  operations and associations in this file  generate the    Insert into    command and write    them into a SQL file  The SOL commands generated will be shown in the textbox     Here s a screen shot of this program           Path of the XML file  E Nexample01  xml Browse         tblClass values  Classl    public    false    UMLClass 24    201281002235           tblAttribute values     atr01      public        UMLClass 24    UMLAttribute 25    201281002235           tbl  ttribute values  atr02      private          UMLClass 24    UMLAttribute 26    201281002235          tblOperation values  opr  l    public    false    UMLClass 24    UMLOperation 2T    201281002235        tblOperation values  opr  2    private    false   UMLClass 24    UML  peration 28    201281002235        tblClass values  Class2    public    false  tblAttribute values  atr  l    public         tblAttribute values  atr02      private        UMLClass 29    UMLAttribute 31    201281002235          tblOperation values  opr01      public    false   UMLClass 29    UMIOperation 32    201281002235         UMLClass 29    2012810
69. system  They have  provided the basic elements for other diagrams  They are especially useful for the developers   They are mainly used in the purpose of analysis and design  During the analysis phase  one should  not take the technology into account  Abstracting the system into different modules is an efficient  way  The UML class diagram is a way to represent the abstraction  When a project goes into the  phase of design  the UML class diagram used in analysis phase may be changed to suit for the  development environment  They become the basis of the code for the project  Different people  have different styles of codes  The most efficient way for them to communicate is by the UML    class diagrams     1 3 Overview of Our Project    Our project aims for building a system that can establish databases of images of UML class  diagrams from the Internet  transfer the images to XMI models and make queries within the  models  It includes the image collecting part  transforming part and querying part    Image is different from text  It    s very difficult to abstract information from images  Current  researches of model search engine focus on dealing with the models stored in text files  Our  project supports the usability of collection  conversion  storage and query for models from images   As it    s expensive for developers to find software assets  which results to lack of knowledge about  reusable assets  this is one of main reasons why developers are not willing to reuse  sof
70. t it  The URL  filename  width  height and pixel format of the images will be saved into  the database     2 5 3 Domain Rank Part    As not all images downloaded are UML class diagrams  there should be the function to distinguish  them  The program can show the list of all URLs in the database and show the corresponding  image of selected URL  The user can put them into the    blacklist    or delete them from    blacklist     manually  The    whitelist    contains the URLs of images that are UML class diagrams     16    Based on the    blacklist    and the    whitelist     the program can abstract the domains where the  images come from  Then  the program can give the statistical data of the domains and make a rank  of them    Our database is built as an access file  As the database will be used in the website  and csv file is  more popular in the web applications  there is a way to generate csv files from our database  The  user can click on the    Generate csv    button  and five csv files corresponding to the five tables  within the database will be created     2 6 Validation    With the  mageCrawler we have developed  a database with 2341 images has been established   The key words we have used are   uml class diagram    uml class diagram example  and  uml  class model   The result of images differs with different key words  The whole process has cost an  hour  The time cost depends not only on the algorithm  but also on the Internet condition    We have manually establis
71. tadata   Brazilian  Symposiumon Software Engineering  Tool Session  Florianopolis  Brazil  2006    6  Brian Pinkerton   Webcrawler   Finding what people want   PhD thesis  University of  Washington  2000    3  Xu Yuanchao  Liu Jianghua  Liu Lizhen  Guan Yong     Design and Implementation of Spider  on Web based Full text Search Engine   Control and Automation Publication  Vol  23  2007    8  https   developers  google com image search v l devguide     Image Search Developer s Guide of  Google       9  http   support google com images bin answer py hl en amp canswer 1 325808  Features of  Google   s search by image    10  http   support google com websearch bin answer py hl en amp answer 2409866  Details of  personal results by Google    11  http   googleimagedownloader com   Google Image Downloader  the tool to download  images with Google API    12  Zhou Lizhu  Lin Ling     Survey on the research of focused crawling technique   Computer  Applications  Vol  25  No  9  2005    13  Marc Ehrig  Alexander Maedche     Ontology focused crawling of Web documents      Proceedings of the 2003 ACM symposium on Applied computing  2003    14  Zhou Demao  Li Zhoujun   Survey of High performance Web Crawler   Computer Science   Vol  36  No  8  2009    15  Liu Jinhong  Lu Yuliang   Survey on topic focused Web Crawler   Application Research of  Computers  Vol  24  No  10  2007    16  Hsin Tsang Lee  Derek Leonard  Xiaoming Wang  Dmitri Loguinov     IRLbot  scaling to 6  billion pages and bey
72. tp   www  uml diagrams  org    http   www  clear  rice  edu      http    kennii  files  wordpress  com  S W  tnc   ie manandvonta da il  H                   Figure 20  The interface of ImageCrawler in process    40    Appendix B  User Manual for Filter4ImageCrawler    Here s the screenshot of the FilterdImageCrawler   El Filter4ImageCrovler DIE    The folder of images   Browse       Database    Browse           pen Example Image   Start       Figure 21  The initial interface of Filter4ImageCrawler   Step 1  Choose the folder of the images   Click on the  Browse     button next to the textbox on the top to set the path of the folder  of images    Step 2  Choose the database   Click on the  Browse     button next to the textbox of  Database  to set the path of the  database  The program will check whether it meets the format of the database  ImageCrawler has created  If not  there will be a message to warn the user to select again   The full path of the chosen database file will be shown in the    Database    textbox    Step 3  Choose the example image   The example image is the image that is taken as an example in the process of perceptual  hash algorithm  All the images will be compared to the example for similarity   Click on the  Open Example Image  button to choose an image as the example  The image  will be shown in the picture box below the button    Step 4  Start the process   Click on the  Start  button  and the process will be started    Step 5  View the result   When
73. tware 2  3  4  5   For this  finding reusable assets is a fundamental aspect of an effective reuse  program 30   so we try to find models and support reusability of these models via converting them  into suitable format that is easy to be modified    In this paper  the image collecting part and the XMI querying part 1s included  For the image  collecting part  we have developed a program  mageCrawler that behaves as a web crawler to  collect images of UML class diagrams from the Internet  Then  a website is established for  building a database of images we have collected  For the XMI querying part  we implement  functions by mapping XMI models into relational databases and make queries within them     Chapter 2  ImageCrawler    2 1 Overview of Web Crawler    We all know that it s convenient for us to  Google  something  Google  as the biggest search  engine in the world  has information of billions of websites and keeps them up to date  How can  that be done    Search engine is the technology which started in 1995  As more and more information has been  put online  it becomes more and more important  Search engine collects information from the  Internet  classifies them and provides the search function of the information for the users  It has  become one of the most important Internet services    The web crawler is a program that can fetch information from the Internet automatically  It plays  an important part in search engine because it can download data from the Internet f
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Makita LF1000 mitre saw  prime indicazioni e disposizioni per la stesura dei  VisualDSP++ Kernel (VDK) User's Guide  Fieldmann FZR 4617-BV lawnmower  NEXT STAGE  Kensington Band Case for iPhone 4    Copyright © All rights reserved. 
   Failed to retrieve file