Home

Interactive Network Exploration with Orange

1. Journal of Statistical Software April 2013 Volume 53 Issue 6 http wunw jstatsoft org Interactive Network Exploration with Orange Miha Stajdohar Janez Demsar University of Ljubljana University of Ljubljana Abstract Network analysis is one of the most widely used techniques in many areas of mod ern science Most existing tools for that purpose are limited to drawing networks and computing their basic general characteristics The user is not able to interactively and graphically manipulate the networks select and explore subgraphs using other statistical and data mining techniques add and plot various other data within the graph and so on In this paper we present a tool that addresses these challenges an add on for exploration of networks within the general component based environment Orange Keywords networks visualization data mining Python 1 Introduction Network analysis is one of the most omnipresent approaches in modern science with its use ranging from social sciences to computer technology to biology and medicine Tools for graph drawing and computing their characteristics are readily available either for free Gephi Bastian Heymann and Jacomy 2009 Graphviz Ellson Gansner Koutsofios North and Woodhull 2001 Network Workbench NWB Team 2006 NetworkX Hagberg Schult and Swart 2006 Pajek Batagelj and Mrvar 1998 or in commercial packages e g NetMiner Cyram 2003 An overview of such
2. Northeastern University and University of Michigan URL http nwb slis indiana edu Raghavan U Albert R Kumara S 2007 Near Linear Time Algorithm to Detect Community Structures in Large Scale Networks Physical Review E 76 3 R Core Team 2013 R A Language and Environment for Statistical Computing R Founda tion for Statistical Computing Vienna Austria URL http www R project org Schwarz G 1978 Estimating the Dimension of a Model The Annals of Statistics 6 2 461 464 Stajdohar M Mramor M Zupan B Demsar J 2010 FragViz Visualization of Fragmented Networks BMC Bioinformatics 11 475 Affiliation Miha Stajdohar Bioinformatics Laboratory Faculty of Computer and Information Science University of Ljubljana Trzaska 25 Slovenia E mail miha stajdoharOfri uni 1j si URL http www fri uni 1j si mihas Journal of Statistical Software http www jstatsoft org published by the American Statistical Association http www amstat org Volume 53 Issue 6 Submitted 2009 01 06 April 2013 Accepted 2012 11 16
3. e o o o 9o og O09 o D o 9 62 Oe es 19 oso 5 a a y os Dn dg SOLO dq f o o 9 oo Awe RR WR d o P Dp pa PW do 6 ofr o 9 Q i TS ese o9 o OF SRR SiGe A hs Pg A em g o PA d TAD Wee 4 9 T o ao Dany St ee EAM aN 5 9 b v a q o o Deut fe o D fa E NS o Y a eee 2 SIRES aah ve d Ud Pe ey oft R BOSD Oe a o oO 95 0 o Q o o o o7i cA tae E d b or eei rs lt a 4 aA V Aao N d Ob o One one PSG oe UOI r Dep gone A ON WO 0o 9 Od AAT AOE ena Sd y 79 ON E o o o d is o Q D o bc c cad fe PE Odo oe qd o 50 o o ye ents g OV d gt 9 o Gc Opp ae aue POE ORO o0 Too 2 d s o a o 5 a D o 0 5o o o a Oo oco DEO Q o Oo 5 oo Oo D OyO o 9 29 0 0 6 o o U j a a P og to t0055 Lo 5o Doo a goo o9 b Q 9 915 N 0 EH E OS STO BO m3 TOEA g b An Po re o P mes o boo 0 5 06 1 0 eae O0 30 o o o 0 O o o OI wn T LT i OP 05 9 o o 0 o 9 o o o o oO oo o o o cr D Oo 0 0 0 o 0 o 6 9 o Q ES 2 o KO oo AR o9 9058 a o RAS nd Q o o oo o o Ke oF P P os o ERI T Y 99 9 o5 0 9 9 6 OY TO S b o o o P DPP 0709 0 9 o 2 a o o o FARKI XA ere 5 om SO op Cee amp Tei 09200 O30 o o DR 479 N z LBP o 8 3978 Gg 9 9 5 o o LAN fe o Do 0 D a Oo o d oa 9 M ac oe TEILS o 01 0795 0 a 9 9 uL EE a o 9 9 fe 0 0o o o qo R 0S DOS o O0 D q Qu 2 o 0 5 eo EEO a o o V T 1s f 8 8 Dn A Q oO a a o o Wp o o h h oO oo a e 5 a oo o Q O o d 5 B E u
4. Development of genres To observe how the musical genres evolved over time we plotted the same graph as before but used the Select Data widget Figure 8 to select subsets of artists based on the year of their first album release and used this data to select the nodes in the Net Explorer by feeding it to the Items Subset slot The artists active before the selected year were represented with filled symbols Figure 9 shows graphs for artists active before years 1985 1990 1995 and 2000 Before 1985 Journal of Statistical Software ggi Data Hah P T Items Matching Data Select Attributes Select Data E i Network Net File Net Explorer Figure 8 Orange Canvas scheme for analysing the appearance of musical genres most active artists performed classic rock this probably covers multiple genres which most Last fm s listeners do not distinguish between We can notice some soul hard rock and pop groups and of course the 80 s tag In the next image before 1990 even more groups in these genres are active We also see the emergence of funk music By 1995 most of the popular groups from these genres are present and there are a lot of new genres such as metal hip hop rap R amp B indie punk and hardcore The last image adds electronic and emo and the only style still missing is minimal K means clustering in the graph Using the widget for k means clustering Figure 10 a we split the musicia
5. O o Q o 9 o e o o Og ee Oe D so S CEDE e o o o o o O o Oe o Oo e 5 o O o eO iO O 0 of 06 QN o n 8 o e o T I LP 9 o o OO O O ie o o o S E O50 TI o SRS E o o o e Oo e 8 re Se o o o o 0 8 Ee ees aque Ou o Be EKA 689 16 b HN ee Se SO e o 50 Q Oo 9 6 6 3 o o ec N H i Mab OOO REO e o d o AE A MAS OW TS Sw u A o o Pp 9 ME es ES o o o OOH 6 o o o o O Oo o o e o e o A o Q9 o o o 9 00 O e y e o o o o o o fe o o o o cr e Figure 15 Air traffic network colored by communities Misclassified nodes are marked marked in the Net Explorer and finally listed in the Data Table see marked examples in Figure 15 It seems that most of the classification errors were made on peripheral nodes which could be a result of the incompleteness of the analysed air traffic network 5 Conclusion We presented a new tool for visual analysis of network data the Orange Network add on Many similar tools exist though ours differs in offering a very interactive and flexible yet user friendly interface We have demonstrated how to use the widget for basic operations on graphs drawing the graph with optimized layout selection of nodes magnification of subgraphs etc Using the widget for data filtering we observed the development of the network in time We showed how to combine the graph with k means clustering to identify the clusters in the graph We observed the relation between influence
6. query for similar artists we got a set with around 26 000 artists We further reduced it by taking only the 2000 most connected artists We took the largest connected subgraph popular music and retrieved additional data about the corresponding artists and groups 11 12 Interactive Network Exploration with Orange E Network Items Selected Items Met File Net Explorer Net Explorer 2 Data Table Data Table 2 Selected Metwork Figure 5 Orange Canvas scheme for showing the graph and the corresponding meta data the number of albums released the year of their first and last album release the number of times they have been played on Last fm and the number of distinct listeners and lists of user assigned tags like country 60s female vocalists Russian rock and similar Last fm weights the tags according to how well they correspond to a particular artist For each artist we identified the tag with the greatest weight and assumed that it corresponds to the genre to which the artist belongs The data was retrieved in March 2007 Since some of these data album count first and last album release year were retrieved from another service AOL music http music aol com we removed the artists for which the queries returned ambiguous results mostly due to more than one group having the same name The resulting graph contains 1262 nodes representing the most established popular music groups an
7. 1 Numerical Python URL http www numpy org Bastian M Heymann Jacomy M 2009 Gephi An Open Source Software for Exploring and Manipulating Networks In International AAAI Conference on Weblogs and Social Media URL http gephi org Batagelj A Mrvar V 1998 Pajek Program for Large Network Analysis Connections 21 47 57 URL http pajek imfm si Baur M Brandes U 2004 Crossing Reduction in Circular Layouts In Proceedings of the Workshop on Graph Theoretic Concepts in Computer Science WG 2004 pp 332 343 Brandes U 2007 Eigensolver Methods for Progressive Multidimensional Scaling of Large Data In M Kaufmann D Wagner eds Graph Drawing volume 4372 of Lecture Notes in Computer Science pp 42 53 Springer Verlag Berlin Celma O 2010 Music Recommendation and Discovery Springer Verlag Csardi G Nepusz T 2006 The igraph Software Package for Complex Network Research InterJournal Complex Systems 1695 Cyram 2003 NetMiner User Manual Seoul URL http www NetMiner com De Leeuw J Mair P 2009 Multidimensional Scaling Using Majorization SMACOF in R Journal of Statistical Software 31 3 1 30 URL http www jstatsoft org v31 i03 Demsar J Zupan B Leban G 2004 Orange From Experimental Machine Learning to Interactive Data Mining Faculty of Computer and Information Science University of Ljubljana URL http orange biolab si Ellson J Gansner
8. 13 14 Interactive Network Exploration with Orange q oO o n o o b o Q 2 o 6 2 OQ o 9 o PIESO o 9 Q b o o 9 O56 oo o o o E o 9 20 2 900 g o Qv o o Q o Q ooo p o o 9 AR o o N o EM o Doo o o M ey n A a dc 7 d Imt o Oe a 9 f o o7 2 o m p MINIMAL s b PUNK d o oa ymo Ke We P o nc 0 o 9 o gt b P GB BO Oo oro P lh JP A PO o Q ek o la d 6 O O o q 9 e o d T aN Sa ONIS o 7 n 95 5 E re PT a P E n R i E a o ELECTRONIC Wa VO o SE Rompe o A FE NaH Cw VUA o o o o oo 2 oo o o o o o o oo 9 o Ounce o Y S iy o o R oo Q9 o Oo o ND ie o METAL CORE P o a o a p Qu o PW sa s T ae 9 be at AI o 5 1 HARDEQRE OOO 59 G08 OG o PON F do o o o ji H o 2 P o 9 NR a v S ROO E O Oo P 3 See P S p o a O 65 9 Poy Aro Q E G 9s c o oO o o 96 o o o m o o CX so ipe KR OEE RREGOP 97 OO 9 s SS o o is Q 9 ee EP x o 2 e OD IP o 4 t Q a o o9 OO s Pi LX P g O o Od oo C 9 Mew n 1 q o P e 292 9 4 2 97 z b O yo 9 METALs o oS ow POO OF OR ios SSIC ROCK deRN 7 a T VS 7 DG eo o o idco 9 N a 9 o6 o M R i o gt o rs 9 5 foe ve Hid d d Ong Sri o e o o SS d o o d sae AS PTS eee pet POE oo o Co O 6 Q q o o 6 a oo o 5 o o A Oo o 9 CEFR 9 0 o mi vam e T oe DP A j o are ae d oO 9 s Org De o oo Oc Oy ane oad dih m EE OD o o 0 o P o S A 7 80s o OR O 9 5 o
9. ER Koutsofios E North SC Woodhull G 2001 Graphviz Open Source Graph Drawing Tools Graph Drawing pp 483 484 URL http www Graphviz org Fruchterman TMJ Reingold EM 1991 Graph Drawing by Force Directed Placement Software Practice and Experience 21 11 1129 1164 Hagberg A Schult D Swart P 2006 NetworkX High Productivity Software for Complex Networks URL http networkx lanl gov 24 Interactive Network Exploration with Orange Handcock MS Hunter DR Butts CT Goodreau SM Morris M 2008 statnet Software Tools for the Representation Visualization Analysis and Simulation of Network Data Journal of Statistical Software 24 1 1 11 URL http www jstatsoft org v24 i01 Himsolt M 1996 GML A Portable Graph File Format Jones E Oliphant T Peterson P 2001 SciPy Open Source Scientific Tools for Python URL http www Scipy org Kononenko I 1994 Estimating Attributes Analysis and Extensions of Relief In F Bergadano LD Raedt eds Proceedings of the European Conference on Machine Learn ing ECML 94 pp 171 182 Springer Verlag Leung I Hui P Li P Crowcroft J 2009 Towards Real Time Community Detection in Large Networks Physical Review E T9 6 1 10 Nepssz T 2009 Reconstructing the Structure of the World Wide Music Scene with Last fm URL http sixdegrees hu last fm NWB Team 2006 Network Workbench Tool Indiana University
10. S Ba aes Roa b SS o6 eu Ev o h ee Oe x 1 59 n O y T 000 vw ASSERERE PORES O 07 O O8 d RE O o 99 0 XD q MP OR Pio j d D 600 A WW W e Helo o O S DEO s SEDE A eb PISO Wa M Y MU Shea ELI NK OU e 6 gt SOK 20h oe VO g One a d p F3 HP DEO B de oe PIA ced R LO R Oo V mie ae PAONTA b D Sc y O y IO 69 5 cC fib Ote 0 9 3 9 ONS SOROS of Re d D A AIREA e EO 19 10 95 eU gt A amp 610 Qoo beo x de 2x S KPa U WA DOM e 6 Md M le Cx dee PV Re P CO d 7 Oe ROTOR RUNS SS i PRO gO e g LAN cte LAPO Qu oq ix x 9 9 vo Ly d DUM eub WA RT OY DAV G O FY BS O o b NE ei ao d RM T T a o Sw b Py e e i o X 9 o o Pets if ji Re i E i b EV eo o o n X 8 Uo 0770 PSP RR Re QNS 9 o o 9 e d y FOOD o gt fea b N b ouo h y P Ia Q T o E g Y g o hk 9 4 90 o Q d d o Q o reer M _ ETNA d leit RU y mm ag YP w el Oo a o XXX Y ITAL OR o D OR e VES A ID y 2 o d o R g 9o 43 Q o OS ope a o L KIM o wD o o ee o p bp b g o oso o p eo o g Journal of Statistical Software T ET Data ltems Data select Attributes k Means Clustering B Network Net File Net Explorer a The Net Explorer widget connected to the k means clustering Q 0 o oo o 9 p i o u a re oo o9 9 Os i930 o Q o j 8 PSR o 9 O2 C o oOo Oo o g Q o Q 39 9 9 umo V 9 6 DE _O oF o o Or Pe O5 Doc q 0 H 96 a
11. Tb d re o Peers d A ju oo o o o 6 ab o io oo 0 o b Nodes in the same cluster share the same color Figure 10 Finding network clusters by k means clustering applied on network meta data Explorer to select the 50 most connected nodes these nodes represent the most influential artists We fed these hubs to the Scatter Plot widget which marked them by filling the cor responding symbols With only 50 out of 1262 artists selected we see that a disproportionate number of them appears at the top of their corresponding clusters Manipulating graph data from scripts We will show how to use the Python module behind the Net Explorer widget directly by scripting in Python We will verify the finding 18 Interactive Network Exploration with Orange Figure 11 Graph with the number of albums represented by the node size from the previous example namely that influential groups those with more connections are also among the most popular groups that is the most listened to within their respective genres To exclude the effect of the genre we divided the number of plays for each artist by the maximum number of plays in the corresponding genre genres were defined by the k means clustering We then used the Mann Whitney test to compare the number of plays for the group of the 50 most connected artists with the other artists The difference was highly significant p 0 01 import numpy scipy s
12. ailable libraries like SciPy Jones Oliphant and Peterson 2001 which includes a library for matrix manipulation and linear algebra Numpy Ascher Dubois Hinsen Hugunin and Oliphant 2001 the statistical library stats and many other scientific libraries Similar to R R Core Team 2013 where some libraries are written in low level languages Fortran C and others in R itself the fast core of Orange is written in C and other modules are written in Python The code below shows how to use the Orange library from Python In this paper we will mostly use the machine learning terminology related to modeling In classification problems we are given labelled data each data instance belongs to a certain class It loads Fisher s Iris data set and selects data instances from classes versicolor and virginica Then it randomly splits the data into two subsets for fitting the model and for testing it train and test Next we run a learning algorithm Machine learning makes a distinction between a learning algorithm or learner and classification algorithm classifier The latter is a predictive model for discrete outcomes and the former is an algorithm for fitting the model to the training data In the code below the learning algorithm for logistic regression LogRegLearner gets the data as input and returns a classifier variable logreg Classifier is an algorithm that gets a data instance and predicts its class In the loop at the end of t
13. anvas scheme of the analysis On the first glance there are two major communities of airports We confirmed this with the Net Clustering widget applying the community detection method by Raghavan et al 2007 and coloring the nodes Journal of Statistical Software 21 O E 2 08 Oo Qa of 9 O O Classification Tree O o O e 9 6 O m 6 9 0 00 edrner O O Learher 0 e o k Nearest Neighbours O mE E m O O i Learnes 2453 Evaluation Result O uw iE d Oe odo Logistic regre aish Y ix D Learners Confusion M trix O O O 3S0 TW dn O Data Hap Selected Data 9e ce pm Matching Data S AAPOR File Select Dati 5 620 d O O Fass Distances Network Q O O amu Ea O t OO Example Distance Net from Distances Net Explorer O a Orange Canvas scheme extended from Figure 1 b The 3 nearest neighbor network misclassified instances are marked Figure 13 Analysis of misclassified examples on Iris data using the Net Explorer widget Biz Network ae Network We Network mE AM Select d Data Net File Net Clustering Net ji m Net Explofer Items a Reduced Data El Evaluation Results Marked ftems Data HW E Rank Leamer Test Learners Confusion Matrix tb i vearner ait Select Attributes Data Table Logistic Regression ze Random Forest Figure 14 The complete Orange Canvas scheme used in the analysis of air traffic data by the clustering r
14. ately observed in the graph itself since sometimes precise tuning is required to get the graph which reveals interesting relations A similar problem is filtering of graph s nodes A geneticist might wish to filter out certain genes or a computer network expert might want to plot the graph with only the major nodes or for another purpose with every single client It would be helpful if they were able to perform such operations interactively in a script or both Interactive tools are also required to explore the local graph structure itself Questions like Which computers are at most two connections away from those infected are asked more often than How well do the degrees of the graph s nodes match the power law Finally exploration of the graph often leaves us with a subgraph or a subset of nodes that we want to explore further After identifying a strongly connected component of a social network we might want to learn about how this group differs from the general population If we find set of drugs with similar effects on the organism we might want to discover what is typical of their active parts Such questions can be answered with less effort if the network visualization tool is integrated into a larger statistical machine learning or data mining framework This paper presents our approach to those problems a collection of widgets and Python modules for interactive network visualization and analysis for the data mini
15. ch node and display more on the mouse hover over the node Nodes can be colored or sized according to values of the corresponding objects attributes The widths of edges can correspond to their weights and can be colored according to the value of the selected attribute from the edge descriptions The algorithm will be published elsewhere Journal of Statistical Software 9 Marking and selecting The widget uses a two stage procedure for node selection that allows for a very flexible manip ulation of subsets of nodes see for instance Figure 6 a Nodes can be marked or selected or both Marking is often a pre stage to selecting or deselecting Selected nodes are shown by filled circles as opposed to the rest which are hollow whereas marked nodes have a black border Nodes can be selected manually by clicking or by drawing selection rectangles Another way to select nodes is to add or remove the marked nodes to or from the selection or to replace the current selection with the marked nodes The selected nodes can be moved to manually enhance the layout The user can also hide the selected or the unselected nodes and then rerun the optimization The selected subgraph the data or distance sub matrix about the corresponding objects can also be fed on to other widgets The nodes can be marked upon values of attributes of the corresponding objects or upon network properties To select local portions of the graph t
16. clust in plays clust hubs degrees zip sorted net degree items key itemgetter 1 50 hubp numpy array normalized i for i in hubs not hubp numpy array plys for i plys in enumerate normalized if i not in hubs u prob scipy stats mannwhitneyu hubp not hubp print Mann Whitney U Ad p Ae A u prob Note that this is only a toy example The experimental procedure is invalid since the same data is used to formulate and to validate the hypothesis Besides if the similarity between two artists depends upon the number of mutual fans not normalized by the number of fans of each individual artists the artists which are more popular get more connections simply due to their popularity and the discovered relation follows directly from the definition of similarity We cannot verify this since Last fm keeps the exact definition of similarity secret 4 2 Extending existing data analysis in Orange Canvas Net Explorer widget can extend existing Orange Canvas schemes which often yields addi tional insights about the problem domain Consider the example in Figure 1 where the aim is to build prediction model on the Iris data set Although the classification accuracy of the k nearest neighbor classifier is high 0 94 we wish to further explore the instances where the model fails We connect the Select Data with the Example Distance widget add the Net from Distances and Net Explorer widgets as in Figure 13 a and sele
17. ct misclassified instances of the k nearest neighbors model in the Confusion Matrix widget In the Net from Distances widget we connect each node with the 3 most similar nodes to simulate the behavior of the k Nearest Neighbors widget s model exactly After observing the outliers in Net Explorer Figure 13 b and toying with different distance measures we notice that it is not possible to further increase the classification accuracy without overfitting the data 4 3 Network mining Network widgets seamlessly integrate machine learning algorithms into the analysis of net work data The air traffic network in this example was constructed by parsing timetables of three major airlines Lufthansa United and American Airlines Graph nodes represent airports a pair of nodes is connected if any of the listed airlines provides a direct flight be tween them Airport data was extracted from the World Airport Traffic Report issued by the Airport Council International in 2006 The network data set includes number of aircraft movements landing or take off of an aircraft number of passengers arriving or departing via commercial aircraft cargo handled in tonnes airport category as specified by the Federal Aviation Administration Large Hub Medium Hub Small Hub Nonhub Primary and Non primary Commercial Service and a class variable FAA Hub specifying whether the airport is considered a hub by the FAA or not Figure 14 shows the Orange C
18. d artists The purpose of this study is not to provide new insights into the Last fm data but to merely demonstrate the use of the Net Explorer widget An interested reader will find the detailed analysis of this network in the works of Celma 2010 and Nepssz 2009 Basic graph manipulation marking and selecting Scheme from Figure 5 loads the graph and additional data about nodes Net File widget The data is shown in the Data Table and the graph is plotted in the Net Explorer widget To observe the subgraphs that we are going to select in the Net Explorer we attached another Net Explorer and Data Table to its output When we open the Net Explorer and select variable artist for the node labels the graph despite being optimized by the Fruchterman Reingold s algorithm looks like a huge cloud of unreadable overlapping labels Say that we are interested in exploring the vicinity of Paul Simon In the Mark tab we select Find nodes and type Paul Simon This finds and marks the node corresponding to Paul Simon the node is drawn as a filled circle with black border as opposed to others that are empty We can then click the button for selecting the marked nodes This selects the Paul Simon s node the node is now filled but has no border We can proceed by choosing Mark neighbors of selected nodes and set the distance to say 2 This marks all nodes that are at most three connections from Paul Simon To further re
19. duce the visual clutter we check Show labels on marked nodes only and zoom in the marked part of the graph The result is shown in Figure 6 a We can again select the marked nodes and output the corresponding subgraph Since the second Net Explorer shows only the subgraph the resulting layout is much nicer as it is not affected by other artists in which we are not interested at the moment We added the information about genres by coloring the nodes according to the tag that best describes them Journal of Statistical Software Net Explorer al uli a F a P Ww o o o o vd z Nodes Edges Mark Info Performance p o x o a o o 2 ol infa Judee Silt Joanna Newsom e 9 W Devendra Banhart Sufjan Stevens Nodes shown hidden 1239 1239 0 x o e o Tinton and the Johfisons Selected 0 marked 80 o if o o we id n Etligtt Smith an enn ji A 4 o 2 Cat Power Regin Spektor of None T o o o e Y Rufus Wainwright 9 cinch o bs e ANI Draken tha Mawright h paul simon Laura Nyro Damen Rice 4 o effi Bustin na Apple Neighbors of focused o o is o AngpiFrenco j B Beth Orton g Neighbours of selected s o o Joni Mitchell o Tim Buckley S d P o o imee Mann Distance zi e o Ben SA jckie LeesJones Mark nodes with o R Cat Stevens x ack Jgnnsogagy 12 X d ori Amos at least N connections To Si do o EL aul Simon at mo
20. e Police ptimize layou Grateful Dead d 4 The Rolling Stones o d St rt Node color attribute ge ri i id Bruce Springsteen George Harris E Buffalo Springfield illy loe Elton John Paul McCartney LES neag y ERSS The Beach Boys avid Bowie The Youngbloods A SOR McLean The Kinks John Lennon Node size attribute o The Byrds NO eM v Tw Ringo radi 3 9 o eil Youn e Velvet Undergroun Invert size Min le g Max o g Scott McKenzie E Band Vin Mbinison g i e o g Ban gt T BD ou kee o Elvis Costello Jim Croce uini wah Iggy Pop Node labels tooltips James Tayor artist Cat Stevens x Dave Matthews B best tag Oo Ben H sibums Nick Drake Nm Buckley iai en Joanna NfoKsom d o O John Mayer s t Devendra B nhart x J ff Buckley Jack Johnson Trim label words to 0 o Elliott Smitn oni Mitchell Antony and the Johrisoris Oo General C Show indices ex Power Rufus Wainwright2men Riceaura Nyro Judee Sill C Show labels on marked nodes only Martha Wainwright Sandy Denny Regina Spektor E Ani DiFranco 2 Font size 4 B o Rickie Lee Jones amp Fiona Apple Aimee Mann Font weight Normal 2 Beth Orton o O o o Shawn Colvin Tori Amos Suzanne Vega k d lang jag ej b Subgraph with additional data visualization the number of plays as size and genre as color H Jlo Figure 6 Exploration of artists similar to Paul Simon
21. ed component zi Genome wide prediction of tran criptional regulatory O Preferences STAMP a web tool Number of hops 2 x MAPPER a search engine for gt TA O E DiRE identifying distant regulatory elements O Edge threshold 0 50 d Ensembl 2007 1 A hidden Markov model for interest Propagation Algorithm Without Clustering Identification of conserved regulatory elements Parameter k 0 30 With Clustering Visualization of comparative genomic analyses Figure 3 A subgraph of the Pubmed article network with the selected article Detecting pro tein function and protein protein interactions visualized together with 14 similar articles and others For the node level indices Figure 4 b the widget outputs another network in which the computed indices are appended to each node in the same way as for instance the Net File widget appends the data read from the file This data can then be explored in the Net Explorer widget or other data analysis widgets for example to build a prediction model based on the network structure For an example refer to the case study in Section 4 3 The computed node level indices are node degree average neighbor degree clustering coefficient number of triangles in which the node participates number of cliques degree centrality closeness centrality betweenness centrality information centrality core number eccentricity and others Com
22. eighbours a Logistic Regression bed Select Data Matching Data E Data v File Learner Leta Selected Data dte G le Lia Learp r Test Learners 2 4 Evaluation Results rr HH Selected Data Confusion Matrix lass Pd Scatterplot Gt Test Learners Matching Data J Figure 2 An example of a scheme for interactive data exploration mis classified to another class corresponding to columns If the user selects one or more cells the Confusion Matrix gives the corresponding instances to any widgets connected to its output We connected the Confusion Matrix to another of Scatter Plot s inputs the one for a subset of instances Wired like this the Scatter Plot widget shows the entire data set and marks the instances selected in the Confusion Matrix 3 Network visualization and exploration in Orange Canvas The Orange Network add on consists of widgets and modules for interactive network analysis Net Explorer is a widget for Orange Canvas which visualizes graphs and lets the user explore them We implemented several auxiliary widgets for reading the network constructing it from data and for computing the general statistical properties of the network Network object is stored in the data structure defined in the module NetworkX We have added some new algorithms e g community detection and frequent graph pattern discovery and methods to seamlessly integrate
23. es about the objects Such data is copied to the constructed network The last widget for construction of networks is SNAP which downloads the network from the Stanford Network Analysis Project graph library http snap stanford edu data 3 2 Net Explorer widget Input and output signals The Net Explorer widget has a number of input and output slots to exchange data with other widgets The inputs are Network The network to plot Items Data about the nodes overrides any such data that is already present in the network Item Subset A list of nodes to be marked in the graph Distances Distances between graph nodes required by some layout optimization algo rithms Net View A custom plug in widget for extending the Net Explorer The widget can output a selected subgraph and the corresponding distance matrix or descrip tions of the marked selected or unselected nodes Layout optimization Net Explorer supports several layout optimization algorithms No optimization If the given network object contains the nodes placement data this is supported for instance in Pajek format they are placed accordingly Random The nodes are scattered randomly Fruchterman Reingold F R The standard F R algorithm Fruchterman and Reingold 1991 positions pairs of connected nodes to a certain fixed small distance and the un connected ones to the fixed large distance A simulated annealing algorithm is used to optimize the soluti
24. esult indeed revealed two clusters See results in Figure 15 where colors represent the two discovered communities We then set the node tooltips to show the attribute city and hovered over some airports from each cluster We discovered that the airports are clustered according to the continents and the two large communities represent airports in the North America and Europe Our next objective was to explore the correlation between the network topology and the air port type more specifically to predict weather the airport is a hub or not from the network topology First 14 node level indices were computed with the Net Analysis widget Fig ure 4 b They were scored and compared in the Rank widget The top half of the ranked attributes according to the ReliefF measure of Kononenko 1994 were used to test differ ent learning algorithms logistic regression naive Bayes random forest and support vector machines Examples that were misclassified by the logistic regression learner the model with the best classification accuracy 0 73 were selected in the Confusion Matrix widget 24 Interactive Network Exploration with Orange o e o 9 o o o o o o o o 9 9 o o o Oo Q 9 O o B o amp T e o o o Oo o Oo o e O Q 9 M VPN O eq s O O o O o o o9 60 00 4 G4 e e o e o Oo O O Oo Oo 0 o o o O O o o O Uc o9 i oU PM o o O SOARS M 7 o o o o e o O e o e e O o o Oo o
25. ginica Random sampling in is defined etal width n P Repeat train test Relative training set size Search CO Negate 0 5 70 Logistic Regression Test on train data Learner Classifier Name Data Selection Criteria O Test on test data sepal length sepal width E petal length Active Condition Apply on any change Attribute selection iris in Iris versicolor Iris virginica Stepwise attribute selection Apply a e g Add threshold X 10 jketl Remove threshold 10 6 Data In Data Out Commit Sensitivity ensitivi 150 examples 100 examples Vv 3 A P P esa Lice Specificity C Limit number of attr 10 5 attributes 5 attributes Remove unuse Area under ROC curve Commit on change Imputation of unknown values 3 END P Mine same ommi Target class ien b A few widgets from the above scheme Figure 1 Loading the data filtering constructing and testing models in Orange Canvas File View Options Widgets Interactive Network Exploration with Orange Orange Canvas Help amp x gt Data v Visualize Wh Distributions E Attribute Statistics Scatterplot Linear Projection Radviz Polyviz Parallel Coordinates Survey Plot E E E BB EE L Correspondence Analysis Multi Correspondence Scatterplot Gt 2 Learfidr th ae L agher Classification Tree Data m s Learner jo PO a d a d aul f arner k Nearest N
26. he code we count the number of class predictions for the test data test that match the actual classes Finally the script reports the proportion of correctly classified test instances import range data Orange data Table iris data data data filter iris Iris versicolor Iris virginica data translate Orange data Domain data domain features Orange data preprocess RemoveUnusedValues data domain class_var data folds Orange data sample SubsetIndices2 data 0 7 train data select folds 0 test data select folds 1 lr Orange classification logreg LogRegLearner train corr 0 0 for inst in test if lr inst inst get_class corr 1 4 Interactive Network Exploration with Orange print Accuracy corr len test To continue the example we compared the classification accuracy implemented in module Orange evaluation scoring of several algorithms logistic regression with a stepwise variable selection pruned classification trees and k nearest neighbor model using cross validation sampling technique from module Orange evaluation testing from Orange classification import logreg tree knn from Orange evaluation import scoring testing models logreg LogRegLearner stepwiseLR True tree TreeLearner min instances 2 m pruning 2 knn kNNLearner k 3 res testing cross validation models data cas scoring CA res print Logistic regression cas 0 print Classification
27. he user can mark the neighbors of a node pointed to by the mouse or the neighbors of the selected nodes up to a certain distance e g up to three edges away Hubs poorly connected nodes and similar local phenomena can be observed by marking the given number of the most connected nodes the nodes with more or less edges than given number or with more connections than their average neighbor or than any of its neighbors Finally the set of marked nodes can be specified by the data sent from another widget using the input slot Item Subset Plug ins A network plug in is a widget that controls which part subgraph of the original network is visualized in the Net Explorer It is particularly useful for visualizing large networks A plug in widget connects to the Net Explorer s Net View signal Net Inside View widget is an example of a basic plug in It provides a local view on the network On selection it smoothly moves selected node to the center and hides distant nodes nodes that are more than k edges away An expert can then interactively explore the network by clicking on neighboring nodes Pubmed Network View is a plug in that provides a view on the large Pubmed article net work The user can first filter the articles and then select a set to display In addition to selected articles neighboring articles are also included if they satisfy selection criteria edge distance from selected maximum number of neighbors and edge t
28. hreshold Right clicking the article in the Net Explorer pops up a menu of options to remove expand or score the corresponding article In the example in Figure 3 the article Detecting protein function and protein protein interactions from genome sequences was selected and displayed together with the most similar articles at most 2 edges away and with edge weight higher than 0 5 3 9 Network analysis The Net Analysis widget computes graph and node level statistics for the given network Graph level indices are computed and displayed in the widget Figure 4 a These include the number of nodes and edges average node degree graph diameter average shortest path length density degree assortativity coefficient graph clique number graph transitivity aver age clustering coefficient number of connected components number of attracting components 10 Interactive Network Exploration with Orange oO ProLoc GO utilizing informative Gene Ontology File View Options Widget Help ES Nx View ney A Bayesian method for identifying Pubmed Network Vien E Net Explorer O B Detecting protein function Ez JIGSAW integration of multiple so rces Net File Pubmed Network View m Paper Selection O A database of phylogenetically atypical Detecting protein function and p O i Computational identification of ci SLAM cross species gene finding and O hiSITE database T Coffee A novel method for fast ISYS a decentraliz
29. i o S o o o o p H oP o OS x M 6 5 c OO POP SO Os e 9 HARD ROCK o c d Ke eG sO o 9 d o v Q o o Q o o e O Oe 4 Ly e Q o 6 M d ac d T gt ec o ea o ro g o o me o o o Cr o o SOUL Q5 CX D o o d o O LAA YD o d fo ov o UE hs o p Q o o t o aw o m D A R amp B C NA FUNK ae GS iy D o oO o Figure 7 Genres according to the prevalent tag in each region as judged by a human expert and resized them according to their popularity the number of times their music has been played on the Last fm The result in Figure 6 b shows a graph with Paul Simon and a few other artists in particular Bob Dylan between two stronger components classic rock with the Rolling Stones as the central point on one side and various representatives of a more acoustic and vocal style on the other Grouping by genres in the graph We again drew the entire graph with the layout optimized by Fruchterman Reingold algorithm and colored all nodes by the most important tag which presumably gives the genre The result in Figure 7 shows that there is a good correspondence between genres the most important tags and the groups which can be visually observed in the graph We are indeed able to easily label regions of the graph by the corresponding genre based on the prevalent tag This confirms that the graph can be with a grain of salt and caution used for subjective visual clustering
30. it into the Orange environment 3 1 Data preparation The add on includes three widgets which load or construct graphs Net File reads the popular Pajek format Graph Modeling Language GML Himsolt 1996 and a NetworkX graph in Python s pickle format The objects corresponding to the nodes can be described with vectors of continuous discrete or textual variables or in machine learning terminology attributes The user can supply additional data in different formats tab delimited comma separated or in several other formats used by other software The second optional file contains the data about the edges The data needs to include two columns marked u and v with indices of node pairs and an arbitrary number of other columns with the data about the given pair Journal of Statistical Software Net from Distances constructs a graph from a distance matrix Edges are defined by several criteria connect nodes with distances within the specified interval connect each node with its k nearest neighbors or nodes with distances up to the i th percentile The widget provides a histogram with the number of node pairs at each distance The distance matrix can come from various sources a widget that reads it from a file or one that computes distances between objects Net from Distances also includes some basic filters output the entire graph the largest component or nodes with at least one edge The distance matrix can include vectors of attribut
31. ness centrality finished Number of weakly connected components Information centrality finished v Number of attracting components 1 C Random walk betweenness centrality W Eigenvector centrality finished Eigenvector centrality NumPy Load centrality finished Core number finished W Eccentricity finished Closeness vitality v Commit automatically v Commit automatically Commit Stopcurrent Stop al Report Commit n Report a graph level indices b node level indices Figure 4 Graph and node level indices computed on the Last fm network 4 Case study 4 1 Using the Net Explorer widget We shall demonstrate the Net Explorer widget on a network of musical groups and artists obtained from the Last fm web radio http last fm The website provides a list of the five most similar artists or groups for each given artist where the similarity is measured by the number of common listeners the exact definition is not publicly available The site does not publish the complete list of available artists so we constructed the network using a search algorithm that started with a small set of artists from diverse musical styles U2 Johann Sebastian Bach Norah Jones Erik Satie and Spice Girls and then expanding it by querying for the artists who are similar to those in the set We stopped the search after several days when the set contained around 320 000 nodes After removing those for which we did not
32. ng toolbox Or ange Demsar Zupan and Leban 2004 There are some excellent Python packages for network analysis e g PyGraphviz a Python package that interacts with the Graphviz program to create network plots from the NetworkX graph objects However Orange Network add on is to our knowledge the only Python software for interactive network visualization and exploration The following section briefly introduces the Orange framework as much as necessary for un derstanding the context of the new add on which is presented in more detail in the third section The largest section of the paper is dedicated to several use case demonstrations Journal of Statistical Software 3 2 Orange and Orange Canvas Orange is an open source Python library for machine learning and data mining The library includes a selection of popular machine learning methods as well as many various prepro cessing methods filtering imputation categorization sampling techniques bootstrap cross validation and related methods Python is a popular modern scripting language featuring clean syntax powerful but non obtrusive object oriented model built in high level data structures complete run time intro spection and elements of functional programming It ships with a comprehensive library of routines for string processing system utilities internet related protocols data compression and many others These can be complemented by freely av
33. ns into 22 clusters The distance was defined as the Manhattan distance between the weights of tags tags not appearing at a certain artist were assigned a weight of 0 The number of clusters was determined by the Bayesian information criterion BIC Schwarz 1978 BIC estimates the quality of clustering as a combination of log likelihood and a penalty term for the number of free parameters which includes the number of clusters We colored the nodes corresponding to their respective clusters Figure 10 b and found a good corre spondence between the apparent graph clusters and k means clustering This can be the result of using the tags the basis for the clustering in the definition of similarities the basis for the graph as reported by the Last fm However since the available documentation on the Last fm states that the definition is mostly based on the number of common listeners the most plausible explanation is that a typical listener sticks to a certain kind of music denoted by the same set of tags This can thus be an informal practical verification of reliability of the tagging system Genres and number of albums Sizing the nodes by the number of released albums Figure 11 gives a rough impression about the number of albums per genre and its variation The picture suggests that artists from some genres generally release much more albums than their colleagues belonging to other genres However there is a correlation between
34. on F R Weighted A variation of the above that also considers the edge weights the larger the weight the smaller the desired distance between the two nodes 8 Interactive Network Exploration with Orange F R Radial An F R type algorithm which places a node selected by the user at the center and optimizes the layout around it The optimization procedure ensures that nodes with shorter paths to the central node are closer to it than those with longer paths Circular Crossing Reduction A local optimization algorithm which puts the nodes around the circle and tries to minimize the number of edge crossings Baur and Brandes 2004 Circular Original Nodes are placed around the circle in the order they are given in the data Circular Random The nodes are placed around the circle in random order FragViz The FragViz Stajdohar Mramor Zupan and DemSar 2010 algorithm for visu alization of networks that consist of multiple unconnected components The algorithm combines the standard Fruchterman Reingold algorithm for laying out individual com ponents with an MDS style algorithm for placement and rotation of components MDS The SMACOF De Leeuw and Mair 2009 multidimensional scaling algorithm using stress majorization The stress model simulates a set of balls corresponding to graph nodes that are connected by springs The lengths of the springs correspond to the desired distances between the graph nodes Pivot MDS An appr
35. ouble clicking the widget brings up a dialog with the widget s settings and depending upon the widget type its results Some widgets from the scheme are shown in Figure 1 b The power of Orange Canvas is its interactivity Any change in a single widget loading another data set changing the filter modifying the logistic regression parameters instantly propagates down the scheme unless the propagation is explicitly disabled or the change needs a confirmation by the user We can add a Scatter Plot widget to the above scheme and give it the filtered Iris data from Select Data Figure 2 Then we connect the Test Learners widget to a Confusion Matrix that shows how many data instances from each class corresponding to rows were Journal of Statistical Software Orange Canvas sa File View Options Widget Help Widgets xj gt Data b Visualize Classification Tree v Classify G Naive Bayes Logistic Regression Majority E k Nearest Neighbours Logistic Regression Ir Classification Tree C4 5 y Interactive Tree Builder Matching Dat Select Data a An example of Orange Canvas scheme Data File m k Info 150 example s 4 attribute s O meta attribute s Sampling Evaluation Results amp Cross validation Aiha Operator Values Number of folds l Classification Tree 0 9000 0 9030 lris setosa 0 9400 0 9810 Leave one out Iris versicolor 3 Logistic regression 0 9500 0 9960 lris vir
36. oximation of the classical MDS algorithm where k pivots columns are randomly selected to reduce the distance matrix size Brandes 2007 The latter three algorithms require information on preferred distances between the nodes provided on a separate input slot The last two algorithms place nodes according to the provided distances only disregarding the network edges The user can specify the number of iterations of the optimization procedure where applicable The default is set such that the optimization is expected to take approximately five seconds With a higher number of iterations the optimization takes longer but the results are often considerably better In practice if the number of nodes is large the most applicable algorithms are Fruchterman Reingold algorithms with the Random placement used to reinitialize the optimization if it gets stuck in a local minimum The user can also move individual nodes or groups of selected nodes after or even during the layout optimization Setting visual parameters The widget allows the user to set a number of parameters see for instance Figure 6 b The widget can print a label with the meta data beside each node When the number of nodes is excessive the user can reduce the clutter by reducing the number of shown variables or by printing out only the data about the node below the mouse pointer Another option is to annotate only the marked nodes It is also possible to print some data at ea
37. putation of selected network analytic tasks is parallelized to save time In multi core computers one processor core is always left free to enhance the user experience 3 4 Community detection in graphs Two label propagation clustering algorithms Raghavan Albert and Kumara 2007 Leung Hui Lio and Crowcroft 2009 are implemented in the Net Clustering widget As both algorithms are iterative the user must set the number of iterations The widget gets a network on the input and appends clustering results to the network data Communities can then be explored in the Net Explorer widget Journal of Statistical Software Net Analysis Graph level indices Node level indices Net Analysis Graph level indices Node level indices v Number of nodes 1239 Degree finished v Number of edges 3963 C In degree v Average degree 6 3971 C Out degree v Diameter 21 Average neighbor degree finished C Radius vV Clustering coefficient finished vV Average shortest path length 8 3594 Number of triangles finished Density 0 0052 Squares clustering coefficient finished v Graph clique number 7 v Number of cliques finished v Graph number of cliques 1580 vV Degree centrality finished vV Graph transitivity 0 3887 In egree centrality v Average clustering coefficient 0 4803 Out degree centrality v Number of connected components i vV Closeness centrality finished v Number of strongly connected components 1 V Between
38. st N connections E i i o pa Mattheus Suzanne Vega e James Tay amp amp b Dylan The Velvet Underground o E 2 5 o Jim Coch Neil Yo ung P Venn Mayerk d lang s O more connections than any neighbour 4 The Band NT orrison The Kinks Cod Reed Shawn Colvin o eorg 9 more connections than avg neighbour The B yrisgo Starr SB etl Bngsteck o e 3 nde David Bowie Elvis Costello O Buffalo Springfielg U 4 3 most connections ul Dead The Beach ennon o scot Seajetat e Paul McCartney o Number of nodes e 2 The SEE The Roling Stones U2 o o 4 o rhe Youngb oods im tiend d More nodes are marked in case of ties o rea nk EUS Who Mark Knopfler P lice d Mark nodes aiven in the inp anal e ee cL ton ectric Light Or hestra o o o iini Creedence Caf Zepp Reyval ails Hak 95 fin i 1 A elin o 9 c Lynyrd Skynyr Fleetwegie oc LA hd Steclye Dan e Tom Petty Eagles puy Joekod Stewart o ix SJ 8 Jy ee a Vicinity of Paul Simon in the entire graph Net Explorer 2 o x Nodes Edges Mark Info Performance o Nes WD eely Dan Optimize 4 ric Clapton d Lynyrd Skynyrd o Method Fruchterman Reingold S m leu Zeppelin iA u Mark Knopfler o ream i t Iterations 500 TomPetty 9 CM Fle Te Maelectric Light Orchestra Optimize from current positio Creedence Clea age er Diego Queen U2 Optimize layout E ful D The Dbo s e Beatles Th
39. tats Orange from operator import itemgetter from Orange clustering import kmeans net Orange network readwrite read lastfm net data Orange data Table lastfm_tags tab manhattan Orange distance Manhattan kmeans kmeans Clustering data centroids 22 distance manhattan plays_clust row plays clust for row clust in zip data kmeans clusters maxs max plys for plys clust in plays_clust if clust c for c in range 22 Net File Journal of Statistical Software Network Sel cted Items Net Explorer Data Table a Orange Canvas scheme for observing the popularity of network hubs b Network with marked hubs and node size set to popularity Main Settings Appearance Scatterplot Gt L X axis Attribute genre e Y axis Attribute plays e Point Color genre S Additional Point Properties Point label No labels le Point shape Same shape le Point size E album count Optimization dialogs a 2 BM e x VizRank Save Graph Report plays C1 C2 C3 C4 CS C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 genre c A scatter plot with marked hubs Figure 12 Correspondence between the influence and popularity within different genres 19 20 Interactive Network Exploration with Orange normalized plys maxs clust for plys
40. the number of albums released and the year of publication of the first album Spearman rank correlation is 0 70 so different number of albums can be attributed to different ages of genres Popularity and influence To demonstrate how the graph widget can be used to select data we first plotted the artists in a scatter plot in which we separated the groups by their clusters by using the cluster ID genre for the x axis and the y axis represents the number of times the artist was played on Last Fm Figure 12 The colors represent different clusters and the size of points correspond to the number of released albums Then we used the Net 15 16 6 Inst spotiod Aq oouereodde o1uor G66T 940Joq 2 000c 9xojeq p Interactive Network Exploration with Orange 066T 910Jeq q o o Q T 9 g 2 o o e 9 65 o9 o o o o o ooh ai o 9 P o o 9 o o n Wy p 9 o q o Qo K d A pei o o o EE b R Y o de SR RS o9 d iis R e Oo t Y o P b 4 o o Ih D WA od Ld oc A Oo o Y BS OFS o oP a HS SOG o Y v o 9 x Lie d d Tp ots o B o O5 o p b oO 6 2 o 07 e SEIS Pa Lop o 9 o Aa pA Oo 9 hs iif Ooo o o o o o o p AR RT OUO LARS NS Vo lt o o A 4 g n o 9 9 Soo Oi IK g o a gt o 0 SQ es o o i a SS oo are td d po P o R P f 9 ak Eas o 20 0 o e Nec co ES uw KaMe Oo 6 b o of o RSPR TRS Y OS OAS X o X ms o g 3 Mai Ko ye Ss a O RE Co coron oy o a e 9o es ANGRY RR ORA
41. the number of connections popularity the number of plays and musical genre by filtering the data and plotting it in the scatter plot Finally we demonstrated how a more advanced study can be done by the use of network analysis and other data mining widgets Beside visual programming in the Orange Canvas we have shown that the network analysis can also be done calling the functions provided by the library directly from a script in Python Powerful exploratory tools naturally increase the danger of conducting inappropriate experi mental procedures The researcher s awareness of what such tools can do and what not and how to test the hypotheses based on the data is however a general problem in data mining Orange http orange biolab si the Orange Network add on http bitbucket org biolab orange network and all required third party libraries are available under the GNU GPL license for Microsoft Windows Linux and Mac OS X Networks and data sets that were used in the case study are included in the add on Journal of Statistical Software 23 Acknowledgments We would like to thank the members of Bioinformatics Laboratory at the Faculty of Computer and Information Science Ljubljana Slovenia for their suggestions during development of the add on and in particular to Lan Umek for his help in finding a suitable testing procedure for the first case study References Ascher D Dubois PF Hinsen K Hugunin J Oliphant T 200
42. tools and their capabilities is provided in Table 1 Most graph visualization and analysis tools provide a picture about the system behind the graph as whole e g Pajek igraph Csardi and Nepusz 2006 and statnet Handcock Hunter Butts Goodreau and Morris 2008 Pajek is package for the analysis of large networks while statnet s specific focus is simulation of exponential random graph models and statistical analysis Much less effort has been put into making the graph analysis software interactive by allowing local exploration of the graph testing how the graph structure depends upon its construction parameters letting the user extract data from subgraphs and use it for further analysis and so on 2 Interactive Network Exploration with Orange Open source Interactive UI Scripting interface in Python Pajek Net Miner x NetworkX Graphviz igraph statnet Gephi Network Workbench Net Explorer x X X X X XK X X Table 1 An overview of the software for network analysis For example edges in most graphs are abstractions of numerical relations In genetic networks two genes are connected for instance if they are sufficiently co expressed T wo journal papers are related if they share a sufficient number of keywords and people in a social network can be connected for having enough shared interests or friends We would wish for a tool in which changes in the graph construction parameters e g connection thresholds can be immedi
43. tree cas 1 print K nearest neighbor cas 2 Using such scripts is only suitable for experts with enough programming skills and does not allow for visual exploration and manipulation of the data Orange Canvas provides a graphical interface for these functions Its basic ingredients are widgets Each widget performs a basic task such as reading the data from a file in one of the supported formats or from a data base showing the data in tabular form plotting histograms and scatter plots constructing various models and testing them clustering the data and so on Widgets can require data for performing their function and can output data as result The most common data type are sets of data instances other types include models model con structors and many others Types are organized into a hierarchy so a widget may require for instance a general prediction model or a specific one such as logistic regression We set the data flow between the widgets by connecting them into a scheme we put the widgets onto the canvas and connect their inputs and outputs Figure 1 a shows a scheme that performs the same procedure as the last code snippet The File widget reads the data Select Data selects instances of versicolor and virginica and gives them to the Test Learn ers The latter also gets learning algorithms from Logistic Regression Classification Tree and k Nearest neighbors performs the cross validation and shows the results D

Interactive Network Exploration with Orange

Contents

Download Pdf Manuals

Related Search

Related Contents