Home

Wikipedia data analysis

1. any Wikipedia language Type Description Format Compression Pi and ic Complete wiki text MediaWiki DB Pages meta history and metadata for every XML 7 zip amp bzip2 tables page change in the wiki revision and text Parton Administrative and XML zi MediaWiki DB 5 sons maintenance tasks E sap table logging Page revision Pages meta current ad eda CU NOE XML bz2 and text tables in IO VERS MediaWiki DB Page and revision Stub meta history o T about XML gzip tables in Medi changes in all pages aWiki DB Page and revision Stub meta current D o in XML gzip tables in Medi ersion of all pages aWiki DB List of users with spe Mum Mira user groups ia SQL gzip DB table cial privileges user groups 276 j O P table langlinks guages Links from a wiki page Mediwa Wiki external links to other pages outside SOL 7 zip DB table exter Wikipedia nallinks Mhttp www mediawiki org wiki Manual Database layout Category tags inserted Mond Mos category links Maa ce E SOL 7 zip DB table catego ina wiki page rylinks Links to other wiki T page links pages in the same SOL 7 zip ee pa il language pag 1 7 2 Complete activity records stub meta and pages meta The two most popular types of dump files are stub meta history and pages meta history These dump files contain information about every single change performed on any wiki page in a given Wikipedia language In the W
2. 6 1 4 Further reading You can check similar approaches in a number of previous research works In particular Voss in TODO ref measuring Wikipedia published in 2005 one of the first and most inspiring pa pers on general metrics and trends in Wikipedia Almeida et al TODO Ref also studied different metrics to explain the evolution of Wikipedia Finally Chi et al TODO Ref singu larity presented a model to explain the plateau phase in which many activity metrics in the English Wikipedia have entered In my PhD dissertation TODO Ref I undertake a broader analysis spanning the top 10 Wikipedias according to the official article count by 2009 6 2 The study of inequalities 6 2 1 Questions and goals Contributions from registered users to Wikipedia are known to be highly unequal with a small core of very active editors who perform a high proportion of all revisions TODO ref In the 10 largest Wikipedias by number of articles this proportion may vary but it is approximately around 10 of the total number of users who made 90 of all revisions in a given language At a smaller scale the purpose of this example is to explore available tools in the neg R package to study inequalities in the distribution of a certain statistic among the members of a population 49 6 2 Required data and tools In the WikiDAT folder inequality you will find two data files revisions RData and users RData which you can use for experimentation
3. Program 4 Example of XML data stored in pages logging dump cont lt logitem gt lt id gt 2 lt id gt lt timestamp gt 2004 12 23T07 30 05Z lt timestamp gt lt contributor gt lt username gt Netoholic lt username gt lt id gt 515 lt id gt lt contributor gt lt comment gt clearing MediaWiki namespace of legacy items content was f redirect Wikipedia MediaWiki namespace comment lt type gt delete lt type gt lt action gt delete lt action gt logtitle MediaWiki All messages lt logtitle gt lt params xml space preserve gt lt logitem gt lt logitem gt id 3 id lt timestamp gt 2004 12 23T07 30 18Z lt timestamp gt lt contributor gt lt username gt Netoholic lt username gt lt id gt 515 lt id gt lt contributor gt lt comment gt clearing MediaWiki namespace of legacy items content was f redirect Wikipedia MediaWiki namespace comment lt type gt delete lt type gt lt action gt delete lt action gt lt logtitle gt MediaWiki All system messages lt logtitle gt lt params xml space preserve gt lt logitem gt lt logitem gt lt id gt 4 lt id gt lt timestamp gt 2004 12 23T07 32 41Z lt timestamp gt lt contributor gt lt username gt Netoholic lt username gt lt id gt 515 lt id gt lt contributor gt lt comment gt clearing MediaWiki namespace of legacy items content was f redirect Template CompactTOC comment lt type gt delete lt type
4. Nowadays data analysts can choose among a wide variety of open source tools and program ming languages to implement their studies In fact open source tools for data analysis and High Performance computing are quickly becoming the preferred solution for many practi tioners scholars and professionals in this field In this chapter I recap some essential open source tools that you can use to accomplish this endeavour For sure this list is far from complete and I have only focused on the tools integrated in WikiDAT that we will revisit later when we examine the practical case examples for Wikipedia data analysis Furthermore I have left aside any tools linked with distributed computing solutions such as Hadoop based on the map reduce programming paradigm or some of its associated projects such as Pig Cassandra Hive or Mahout to cite but a few in stances If you have read the previous chapters of this document I strongly support the thesis that these tools despite their increasing popularity come with a high cost in the form of a steep learning curve time and effort to effectively parallelize complex tasks Actually some asso ciated Hadoop projects like Pig or Mahout try to alleviate this problem providing additional abstraction layers to hide the inherent complexity of programming map reduce processes Hence we will not focus on this kind of tools at this time although they may be eventu ally introduced and described in later ver
5. The R script inequalities R make use of some tools like the Gini coefficient or the Lorenz curve to analyze the inequalities in the distributio of revisions made by wikipedians In this case as the examples data files has a small size the inequality level is much less extreme that in the case of analyzing the whole population for one of the big Wikipedias This is also due to the disproportionately high number of casual contributors that we find with respect to the short list of users included in the very active core of editors 6 2 3 Conducting the analysis In this case you only need to load the file inequalities R in RStudio and press Source to create the numerical results and graphs 6 24 Further reading In 2008 I published a paper TODO Ref with Jesus G Barahona and G Robles a paper in the HICSS conference analyzing the evolution of the Gini coefficient for the top 10 Wikipedias by number of articles at that time In that study instead of evaluating the aggregate inequality level we studied the inequality per month that is over the total number of contributions per month instead of over the total number of contributes in the whole history of each language In my PhD dissertation TODO ref I further expand the conclusions of this study Text 6 3 The study of logging actions 6 3 1 Questions and goals Whilst the pages meta history dump files are by far the most popular of all Wikipedia database dumps as we have seen there
6. Users only need to deal with software configuration data load and execution of analy ses Users are charged for ex ecution time Same as before with reduced cost Clustering in the cloud 2 4 Operating system and file system support Depending on the scale of your analysis that is whether you work at the macro level with aggregate scores or at the microscopic level with high resolution data you may find some surprises while preparing for your data analysis For macroscopic studies one usually work with tables containing metrics and aggregated scores that have been computed in the data preparation phase so that your tables and data source files are not going to be very large In fact it is perfectly possible to run macroscopic analyses even for the English Wikipedia in a modern multi core laptop with 4GB of RAM and an SSD with good performance However high resolution analysis is another story If you want to apply NLP techniques over the whole set of words in the wiki text of a certain Wikipedia language you will probably have to work with the decompressed data file which can take several TB of storage space specially for the largest languages The same stands for example for analyzing the changes in the content of articles computing authorship information and tracing these changes over time for the whole set of encyclopedic entries in a given language In all these cases in which there are clear reasons
7. or a CSV file for custom processing e News e Demo StatMediaWiki is free software under the GPL v3 or higher license Some e Manual functionalities in StatMediaWiki are adapted from those in StatSVN PFERDE e Papers e Contact lt lt EvalMediaWiki WikiRing WikiEvidens gt gt Libre Software and Open Knowledge Office Universidad de C diz ree TOM vac amy Wo Bee E O Content of this site is licensed under a Creative Commons Licence Figure 3 2 Project page for statMediaWiki at RedIRIS forge Wikievidens 7 can be interpreted as the evolution of the previous tool inteded to provide a statistical and visualization software for wikis It is still on alpha stage 3 4 Interacting with the MediaWiki API Several libraries and applications have been developed to interact with the MediaWiki API to automate certain tasks in Wikipedia such as developing bots to peform regular mainten tance duties These tools can also be very useful for researchers since they provide convenient wrappers to access the API within popular programming languages like Python 3 41 Pywikipediabot Pywikipediabot is a Python library created to facilitate the development of bots for Wikipedia As such it is ideal to access the functionalities of the MediaWiki API for any Wikipedia lan guage from Python It is also very easy to learn and there are numerous examples that can be found of existing bots to learn the code However you should take
8. reach me at emijrp gmail com For bugs you can use the issues section lt lt StatMediaWiki WikiRing WikiPapers gt gt Links Groups WikiEvidens Figure 3 3 Project page for wikievidens at Google Code administrators Only bots that have been officially approved and get that status in the system can perform such actions at full speed 3 4 2 Python wikitools A second option for accessing the MediaWiki API from Python is python wikitools The tool can be downloaded from the page project hosted at Google Code or alternatively through Pypy the official Python packages repository 3 4 3 Mwclient Finally you can also check mwclient a framework for accessing the MediaWiki API in Python written by Bryan Tong Minh for personal use in his own bots According to the in formation from the README file in the latest version available 0 6 5 Muclient is a client to the MediaWiki API lt http mediawiki org wiki API gt and allows access to almost all implemented API functions Muwclient requires Python 2 4 This version supports MediaWiki 1 11 and above However for functions not available in the current MediaWiki a MediaWiki VersionError is raised http code google com p python wikitools Wnttp sourceforge net projects mwclient 28 Manual Pywikipediabot rom Pywikipediabot shortcut PWB The Python Wikipediabot Framework pywikipedia or PyWikipediaBot is a collection of tools that automate work on M
9. tions around the world that offer storage space and bandwith pro bono to help balancing the load These dump files usually have two possible formats for data representation e XML files XML is the preferred format for large dump files such as those storing in formation about changes in wiki pages or administrative and maintentance actions per https www mediawiki org wiki API Query https www mediawiki org wiki API Etiquette Phttp toolserver org Hnttp toolserver org tparis articleinfo index php Phttp dumps wikimedia org formed in the wiki XML presents a neutral format to export this information and recover it locally to another wiki system for example to build a replica of Wikipedia in a given language or any other storage system e SQL files SOL files are usually published for small or medium size dump files like ex ternal or internal links in wiki pages the list of redirects or the list of special users and their associated privileges Since these dump files can consume a lot of storage capacity in the servers it is a common practice to publish compressed version of these files using different algorithms according to the compression rate required to reduce the size of the files within manageable limits On one hand very large dump files like the ones contanining all changes recorded in large Wikipedia languages German French etc are compressed in 7 zip or bzip2 format On the other hand smaller dump
10. we cannot use this information to identify individual anonymous editors accurately This involves some technical details about the inner features of Internet communcation protocols but suffice to say that many really many computers can share the same IP address from the point of view of the Wikipedia system receiving incoming requests Outgoing connections can pass through proxies and other traffic filtering machines thus showing up in the destination with the same IP address As an example consider the case in which Wikipedia inadvertently blocked an IP address that was performing vandal actions banning de facto Wikipedia editing from the whole country of Quatar in which all outgoing connections go through the same Internet Service Provider lIhttps en wikinews org wiki Qatari proxy IP address temporarily blocked on Wikipedia 39 e Dots and extremely active editors Sometimes our study focuses on the activity of Wikipedia editors and one of the common metrics found in research literature is the number of re visions per editor In this case an initial step should always be eliding all revisions performed by bots special programs undertaking routinary operations sometimes at a ver fast pace For this we can use the information in the user groups table looking for users in the group bot However we must also take some caution not to confound these bots with extremely active users some of which can be incredibly prolific reg
11. ERV ER lt text gt revision page lt mediawiki gt Exzellent 1 enghis amp lt div style amp quot float right localurl Template Vichipedielenghis action edit Modific amp lt small amp gt amp lt div amp gt amp lt td amp gt 4 Juni 2008 111222333 13 Program 3 Example of XML data stored in pages logging dump mediawiki xmlns http www mediawiki org xml export 0 6 xmlns xsi http www w3 0rg 2001 XMLSchema instance xsi schemaLocation http www mediawiki org xml export 0 6 http www mediawiki org xml export 0 6 xsd version 0 6 xml lang en gt lt siteinfo gt lt sitename gt Wikipedia lt sitename gt lt base gt http simple wikipedia org wiki Main_Page lt base gt lt generator gt MediaWiki 1 20wmf4 lt generator gt lt case gt first letter lt case gt lt namespaces gt namespace key 2 case first letter gt Media lt namespace gt lt namespace key 1 case first letter gt Special lt namespace gt namespace key 0 case first letter gt lt namespace key 1 case first letter gt Talk lt namespace gt namespace key 2 case first letter gt User lt namespace gt lt namespace key 3 case first letter gt User talk lt namespace gt lt namespace key 4 case first letter gt Wikipedia lt namespace gt lt namespace key 5 case first letter gt Wikipedia talk lt namespace gt lt namespace
12. assegno de 10 milions di corone svedesi circa un mlhione di euro consegnato a Oslo d 10 dicembre Solo assai raramente qualcuno riuscito come Obama a catturare l attenzione del mondo e a dare una speranza per un futuro migliore si legge nella motivazione diffusa dal Comitato che spiega come la diplomaz ia del Presidente statunitense sa basata sul concetto che coloro che guidano i mondo debbano farlo sula base di valon e atteggiamenb c ondes dalla meggoranza della popolazione Nonostante le motivazioni fornte dal comitato di Oslo l attribuzione del Nobel ad un presidente eletto da cosi poco tempo ha suscitato alcune polemiche Secondo un sondaggio informale pubblicato dalla MSNBC 4 62 degl ntervatati pensa sia immeritato mentre numerose critiche sono state sollevate dai repubbicar e dalla starpa 10118 Famiglia e religione modifica Mentre lavorava presso la societ di consulenze legal Sdley Austin di Chic ago nel estate del 1989 Ob arme incontr Michelle Robins jn awoc ato associato della societa 105 michele e Barack barre si sposarono nel 1992 presso la Tenity United Church of Christ d Chicago la cenmona fu svolta dal reverendo jeremah Wnoht 1951 Hanno due fighe Mala natane 1998 e Sasha del 2001 7 Un pass aggio del discorso chiave di Obama presso la Convention democratica del 2004 nonch ttolo del suo ibro del 2006 L audacia della speranza The Audacity of Hope gli sono Figure 3 6 Snapshot of the WikiTr
13. could also use the rev_parent_id to solve this problem 40 online collaborative editing As a result we can not expect to find two revisions for the same page with the same timestamp Now we turn to the same study but this time focusing on registered users instead of wiki pages For each user we want to calculate the time interval between consecutive revisions We can proceed in the same way but this time our approach will fail Why Well if we carefully inspect the entries we will find that for some users we can find two three or even more revisions sharing the same timestamp How is that possible anyway The answer can be found in the values of the rev is redirect field In this case whenever a user creates a new redirect for a wiki page two or more depending on how many redirects are created entries in the database are generated all sharing the same timestamp one for a revision concerning the original page for which the new redirect is created and one for every new redirect page that has been created Therefore we must impose additional filtering for example grouping by user and timestamp to ensure that our data preparation process works In summary you should always perform sanity checks of intermediate results including descriptive graphs to make sure that your scores make sense and no hidden problems can jeopardize the results and conclusions from your analysis 41 Chapter 5 Open source tools for data analysis
14. data analysts In the end the most precious weapon to overcome these hidden obstacles is learning as many details as possible about your data set and its generation process As an example I am going to share with you some insights from a recent event history analysis on Wikipedia data For this type of analysis the researcher is usually interested in measuring time intervals between consecutive states of interest in different subjects In the process of preparing the data tables to perform this analysis I found useful details about the way in which MediaWiki software records revisions in the database Suppose that we want to measure time intervals between consecutive revisions performed on the same Wikipedia page To calculate this info automatically we can confront two tables the first with all revisions for that page ordered by their timestamp except for the last one the second with all revisions for that page ordered in the same way but this time removing the first one In this way we are facing each revision with the next one for this page This procedure will work whenever we can guarantee that every revision for the same wiki page has a different timestamp Is this the case The answer is yes it is MediaWiki software does not allows for concurrent editing of wiki pages unlike GoogleDocs or similar technologies for http en wikipedia org wiki Wikipedia List of Wikipedians by number of edits http stats wikimedia org We
15. do not exist in the wiki any more well unless they are restored again in the future funny enough right Phttps lists wikimedia org mailman listinfo xmldatadumps 1 17 1 Types of dump files If you visit the Wikimedia Downloads center and click on database backup dumps you will arrive ata page listing the progress of the process to generate all dump files for every Wikimedia wiki Clicking on the link with the name of the project will lead you to the page summarizing all available dump files for it along with the links to download them The project code should be easy to interpret Any link of the form XXwiki where XX is a 2 letter sometimes 3 letter ISO code representing the language identifies the Wikipedia in that language Therefore eswiki links to the dump files of Wikipedia in Spanish frwiki to the dumps of the French Wikipedia and so on As an example the filename frwiki 20120601 pages meta history xml 7z tells us that the dump file belongs to the ensemble for the French Wikipedia the date of creation of the file the type in this case complete wiki text and metadata for every change in the wiki the format XML and the compression algorithm 7 zip Table 1 1 summarizes many of the available dump files for the case of Wikipedia projects You can refer to the MediaWiki DB schema to learn more about the content included in each DB table Table 1 1 Summary of the different dump files available for
16. exist many more alternative data sources that we can use for our studies Quite an interesting target can be the dump file storing all logged actions recorded on the database for every Wikipedia including actions such as blocking users protection of pages and other administrative and maintenance tasks 6 3 2 Required data and tools In this case we parse the dump file for the Simple English Wikipedia simplewiki to demon strate some of the utilities in R to represent longitudinal data data points over time and build very simple models to analyze trends in these data No special tools are required for the anal ysis in R since all methods included in the script come with the standard installation 6 3 3 Conducting the analysis In the first place we need to parse the logging table for simplewiki using the tools provided in WikiDAT Alternatively you can download a compressed SQL file ready to be loaded in MySQL from http gsyc es jfelipe WPAC 2012 simplewiki logging 062012 sql gz Then load and execute the R file logging R in the logging directory of WikiDAT 50 Bibliography
17. files are compress in gzip format since it can be decompressed at a faster rate and we not need the same high compression ratio as in the larger dumps In the case of the English Wikipedia the single dump file that was formerly produced containing all wiki text and metadata about changes recorded in the database has been conve niently split into multiple pieces or chunks Each of these chunks or slices store information about all changes performed in a range of wiki pages The limits of the range in the form of the unique numerical identifiers assigned to every page are indicated in the file name Other large dumps like those containing the abstracts of articles or those including only metadata about changes in the wiki without the wikitext for each version are also split in a similar fashion An important advantage of retrieving information from these dump files is that researchers have maximum flexibility as for the type and format of the information they want to obtain As we will see in the next chapter this also means that researchers can search for information about useful metadata present in the wiki text that was not originally included in the fields of the database An example of this could be searching for special tags that mark quality content in Wikipedia templates for assessing the status of articles verifiability disputed neutrality or tracking the evolution of links and references inserted in an article over time Of course wh
18. http en wikipedia org wiki GUID_Partition_Table http en wikipedia org wiki Logical volume management 22 In conclusion I offer in Table 2 2 a list of some of the most important parameters to which we must pay attention in MyISAM Nevertheless since InnoDB is the future this section will be expanded in later versions of this document to include useful tips for InnoDB You can check the whole list of variables and their meaning on the online manuals for MyISAM and InnoDB I also strongly recommend the very good book by Schwartz et al TODO REF to get additional insights and many useful tips to configure your server Table 2 2 MyISAM server variables with high influence in performance of the data extraction and analysis process dering operations Variable Purpose Recommendations Make it as large as possi ble without consuming more Memory space to load keys than 50 of total available key buffer that speed up search and or memory if you run the anal ysis Python R in the same machine or 7596 in a dedi cated database server max allowed packet and Control the size of packets for information insertion in Make them large enough to avoid hindering the data in sertion 256M or 512M is usu ally enough In data analysis we usu ally sort result values using ORDER BY If you can allo cate some memory 1 or 2 GB for this purpose it can speed up this type of queries bulk insert buffer size dat
19. it The main argument against this approach is that if you try to visit to many pages at a very fast pace to retrieve their content high chances are that you get banned by Wikipedia s system centries on the premise of creating excessive traffic load Moreover as we are goint to learn in the next sections there are many alternative ways providing you much more information without the downside of overloading the Wikipedia side with a zillion of requests 15 MediaWiki API A very useful data retrieval channel for real time requests lovers is to use the MediaWiki API available for most if not all of the Wikipedia languages Like many other data retrieval APIs it offers you a structured syntax to query the Wikipedia system for the data we are looking for Requests take the form of HTTP GET requests accepting numerous possible parameters to refine the terms of our petition to the system A big advantage of using the MediaWiki API is that it also offers a format parameter that let consumers control the format in which the answer is represented Available output formats Shttp stats grok se http meta wikimedia org wiki Research Committee 5By no means I meant with this that reading Antonio s thesis is a boring activity He did quite a remarkable work and it is written in clear style All the same reading any PhD dissertation is always a daunting task as you are often exposed to far too many information and fine grain details that you
20. key 6 case first letter gt File lt namespace gt lt namespace key 7 case first letter gt File talk lt namespace gt lt namespace key 8 case first letter gt MediaWiki lt namespace gt lt namespace key 9 case first letter gt MediaWiki talk lt namespace gt lt namespace key 10 case first letter gt Template lt namespace gt lt namespace key 11 case first letter gt Template talk lt namespace gt lt namespace key 12 case first letter gt Help lt namespace gt lt namespace key 13 case first letter gt Help talk lt namespace gt lt namespace key 14 case first letter gt Category lt namespace gt lt namespace key 15 case first letter gt Category talk lt namespace gt lt namespaces gt lt siteinfo gt lt logitem gt lt id gt 1 lt id gt lt timestamp gt 2004 12 23T07 28 37Z lt timestamp gt lt contributor gt lt username gt Netoholic lt username gt lt id gt 515 lt id gt lt contributor gt lt comment gt clearing MediaWiki namespace of legacy items content was f redirect Template 1911 comment lt type gt delete lt type gt lt action gt delete lt action gt lt logt lt params xml space preserve lt logitem gt gt title gt MediaWiki 1911 lt logtitle gt 14
21. other online sources APIs Once the data is stored locally we should undertake some sanity checks to ensure that the data has been retrieved properly Prelimi nary EDA Exploratory Data Analysis including graphics summarizing important data traits should follow Extra caution must be taken to identify possible missing values or any other odd extreme values that may alter the results of our analysis Finally we prepare the data to obtain any intermediate results needed for our study undertake the analysis model building and refinement and interpret the conclusions This process may be iterative in case that we discover additional insights that could help us improve the model although we must also pay attention to avoid relying too much on a single data set to build our model since we could lose generality In this chapter I offer some tips and advice that can help you to avoid common pitfalls while retrieving and preparing your Wikipedia data 41 Thebig picture As I have commented previously no less than 75 of the total time for data analysis is con sumed in the data retrieval and data preparation part In the case of Wikipedia data analysis this may increase up to 80 85 depending on the actual type of analysis that we want to con duct For studies at the macroscopic level the proportion of time for data preparation will be closer to the lower bound However for fine grained or high resolution analysis involving huge data sets an
22. part dealing with data retrieval and preparation It also presents an overview of useful existing tools and frameworks to facilitate this process The second part includes a description of a general methodology to undertake quantitative analysis with Wikipedia data and open source tools that can serve as building blocks for researchers to implement their own analysis process Finally the last part presents some practical examples of quantiative analyses with Wikipedia data In each of these ex amples the main objectives are first introduced followed by the data and tools required to conduct the analysis After this the implementation and results of the analysis are described including code in SOL Python and R to carry out all necessary actions All examples conclude with pointers to further references and previous works for readers interested in learning more details I hope that this guide can help some researchers to enter the fascinating world of Wikipedia research and data analysis getting to know available data sources and practical methodologies so that they can apply their valuable expertise and knowledge to improve our understanding about how Wikipedia works Contents Preparing for Wikipedia data analysis 1 Understanding Wikipedia data sources 1 1 1 2 1 3 1 4 1 5 1 6 qt 2 1 212 2 2 24 3 1 3 2 3 9 3 4 3 5 3 6 37 3 8 Public data sources 1 aa oaia a da nada nada ki Different languages different com
23. r base and recommended packages in Debian like systems package r recommended http dev mysql com doc refman 5 5 en index html Inttp dev mysql com doc refman 5 5 en installing html http www percona com software Shttp www postgresql org http wiki postgresql org wiki Python http www r project org http www r project org contributors html http www r project org foundation main html Bhttp cran r project org mirrors html 44 There are binary files available for all major distributions including RedHat Suse De bian and Ubuntu The primary way to interact with R in GNU Linux is to simply type write q and press Enter to exit jfelipe blackstorm R R version 2 15 0 2012 03 30 Copyright C 2012 The R Foundation for Statistical Computing ISBN 3 900051 07 0 Platform x86 64 pc linux gnu 64 bit R is free software and comes with ABSOLUTELY NO WARRANTY You are welcome to redistribute it under certain conditions Type license or licence for distribution details Natural language support bu R is a collaborative project running in an English locale with many contributors Type contributors for more information and citation on how to cite R or R packages in publications Type demo for some demos help for on line help or help start for an HTML browser interface to help Type q to quit R gt e For Window
24. status of the system traffic data while activity data will refer to actions actually modifying the status of the system We are not going to cover traffic data in this guide The main reason is that only some of this data namely for page views is publicly available However in case that you are interested in obtaining a sampling for all possible actions in Wikipedia traffic data you must contact the Wikipedia Research Committee directly to ask for a data stream containing that sample This data stream will be in all cases conveniently anonymized to elid all private information about users performing requests to the system In case that you are interested in learning more about the type of analyses that we can conduct with this data I recommend you to grab a good cup of your favourite stimulating beverage and take a look at the doctoral thesis written by my colleague Antonio J Reinoso TODO citation about Wikipedia traffic x 1 4 Web content OK so finally we start to describe some available sources to retrieve Wikipedia data The first and probably most obvious one is to obtain this data from the information displayed on our browser when we request any Wikipedia page Some people could think that retrieving information about Wikipedia articles in this way sometimes known as web scraping is a good procedure to accomplish this I certainly discourage you to follow this path unless you have very strong and good rea sons supporting
25. things will change dramatically The bad news is that configuring MySQL as well as any other database engine is not a easy work In fact another complication is that the optimal configuration for one machine can be perfectly useless in a different computer The net result is that there is no magic configuration or silver bullet that can be shared accross multiple servers In fact MySQL configuration is a multifaceted issue that deserves a whole book on its own TODO Ref High perf MySOL 3rd edition I cannot stress the importance of getting a deep knowledge of your database engine if you want to conduct serious data analysis Another curious aspect is that so far for most of my applications I have been using the MyISAM eninge which has now become deprecated Its replacement InnoDB is without a doubt a more advanced engine for many business applications However for the specific needs of data analysis these advantages do not stand out so clearly For example in data analysis we frequently create data tables firsts and we only perfom read operations thereafter Concurrent updates writings which are common in business applications are very scarce in data analysis Moreover counting queries usually perform much faster in MyISAM than in InnoDB Finally InnoDB presents a much longer list of possible parameters for server fine tuning that we need to carefully study and test http en wikipedia org wiki Comparison of file systems
26. usually need May the Force be with you http en wikipedia org wiki Web scraping Thttp www mediawiki org wiki API Main page include the most important data representation standards like JSON WDDX XML YAML or PHP native serializaton format This is also a good way to query the MediaWiki database for obtaining answers to precise questions or subsets of data matching certain conditions I am not convering yet the many data fields available in the database server since they will be described in the section about the dump files later on On the other side we must keep in mind that we shouldn t push very hard this data re trieval channel specially since some of the queries must be expensive for Wikipedia servers to attend in terms of system resources Admins reserver the right to ban any user that may abuse this service Thus this is a very flexible a convenient way to peform concrete questions to the system when we know in advance that the answers should not return very large data sets It can also be very useful for software applications that need to query the Wikipedia system in almost real time However if you are interested in retrieving the complete activity history of the whole English Wikipedia arm yourself with a high dosis of patience or better try one of the other available alternatives 1 6 The toolserver The Wikimedia toolserver is a cluster of servers operated by Wikimedia Deutschland one of the Chapters of
27. 0K r M Commons 6 2K o French 4 8K 9 _ German 54K q s Russian 4 0K ns Japanese 4 0K 10 Spanish 2 9K a talian 2 8K 3 _ Chinese 1 8K z 9 Aug 2010 Editors on Commons are no longer included in overall total on the assumption that most also edit on one or more other projects More precise detection of double counts between any projects and languages is in development using Single User Login registration Figure 3 1 Example of the new reportcards produced by WMF 3 3 StatMediaWiki and Wikievidens StatMediaWiki is another tool to calculate metrics and statistics about the evolution of any MediaWiki powered site created by Emilio Jos Rodr guez Emijrp in Wikipedia It also ex ports this pre processed information for later consumption either as HTML pages or CSV data files The code is licensed under GNU GPLv3 3http stats wikimedia org index html fragment 14 http xeportcard wmflabs org 5https github com wikimedia limn http statmediawiki forja rediris es index en html 26 Universidad de C diz mE iaWi ki UCA Oficina de Software Libre sui Welcome to StatMediaWiki home site spafio x StatMediaWiki is a project that aims to create a tool to collect and amp English aggregate information available in a MediaWiki installation Main menu Results are static HTML pages including tables and graphics that can help to Main page analyze the wiki status and development
28. Inria and Google and currently implements a long list of different algorithms and tools for supervised and unsuper vised learning as well as data loading data transformation and data visualization including http www python org getit http mysql python sourceforge net MySQLdb html http sourceforge net projects mysql python http numpy scipy org http www scipy org http matplotlib sourceforge net v 0 N o 0 amp Qo http scikit learn org stable 43 3D support with RGL graphs Future versions of WikiDAT will use this library to illustrate the application of machine learning techniques on Wikipedia data along with implementations in the R programming language that we are introducing very soon 5 4 Database engine MySOL MySQL is a lightweight but powerful relational database engine software The MySOL man ual explains in detail how to get and install MySQL in multiple operating systems and platforms In Debian and Ubuntu GNU Linux you can just install the packages mysql client and mysgl server In the installation process you will be prompted to introduce a password for the root user of MySQL Please take your time to select a password that you can remember later on Of course MySQL is not the only open source database engine available for these purposes After the acquisition of Sun Microsystems by Oracle Corporation some companies has started to offer alternative storage engine and services bas
29. Wikipedia data analysis Introduction and practical examples Felipe Ortega June 29 2012 Abstract This document offers a gentle and accessible introduction to conduct quantitative analysis with Wikipedia data The large size of many Wikipedia communities the fact that for many languages the already account for more than a decade of activity and the precise level of detail of records accounting for this activity represent an unparalleled opportunity for researchers to conduct interesting studies for a wide variety of scientific disciplines The focus of this introductory guide is on data retrieval and data preparation Many re search works have explained in detail numerous examples of Wikipedia data analysis includ ing numerical and graphical results However in many cases very little attention is paid to explain the data sources that were used in these studies how these data were prepared before the analysis and additional information that is also available to extend the analysis in future studies This is rather unfortunate since in general data retrieval and data preparation usu ally consumes no less than 75 of the whole time devoted to quantitative analysis specially in very large datasets As a result the main goal of this document is to fill in this gap providing detailed de scriptions of available data sources in Wikipedia and practical methods and tools to prepare these data for the analysis This is the focus of the first
30. abase Size of the buffer used to sort Sort buffer size out values in queries Besides the previous one MyISAM has its own buffer to sort results Control the size of buffers used to load data from disk myisam_sort_buffer_size Same as before read_buffer_size and read_rnd_buffer_size Larger buffers will let us you to read data from disk faster Before closing this section and the chapter I would like to mention two important features of InnoDB that do have an impact in data analysis The first aspect is that by default InnoDB stores all data and keys to speed up search operations in a single file that can grow up to a huge size Besides in case that you delete some of the tables or entire databases the space consumed by this file is never claimed back preventing us to free storage capacity in our system This can be a problem in platforms with high computational power but with limited storage capacity The solution is to configure the innodb file per table server variable At least in this way we can perform some operations to free up some storage space The second aspect is that we can set up different tablespaces to place different tables in different directories maybe in different devices such as distinct SSDs This can be an im portant factor to speed up the analysis running different processes that target their own table in different devices This is a kind of low level process pa
31. ach revision which is omitted As a result this can speed up sig nificantly the parsing process to recover all information if we are only interested in metadata about pages and revisions Nonetheless in case that we do have any interest in tracing partic ular tags or types of content in the wiki text we must use the complete history dump You can refer to the current version of the page and revision tables in the MediaWiki database to learn about the meaning of the fields included in these dumps For many lan guages the rev len length of revision in bytes and the new rev_sha1 field with a hash of the text of every revision using the SHA 1 18 algorithm are not available yet However we will see that some data extraction tools can compute and store this informtion locally while parsing the dump file 1 7 3 User groups The user_groups dump contains information about special privileges assigned to certain users in the wiki This is a exact dump in SQL format of the user_groups MediaWiki table In Phttp en wikipedia org wiki Wikipedia Namespace l nttp www mediawiki org wiki Manual Page table Vhnttp www mediawiki org wiki Manual Revision table Phnttp en wikipedia org wiki SHA 1 Phttp www mediawiki org wiki Manual User groups table 10 general the number of different groups or privilege levels associated to users profiles will depend on the language In big languages we will find most if not all of th
32. ails about them Thttp wikilit referata com wiki Main_Page http wikidashboard appspot com 25 3 2 Wikistats and report cards Wikistats is a website presenting multiple statistics about the overall evolution of all Wiki media projects The analysis is powered by a set of Perl scripts developed by Erik Zachte data analyst at Wikimedia Foundation The code of this scripts is available for download from the same website This site was the first to provide a comprehensive overview of all Wikimedia projects and it has evolved over time to improve the display of statistical results in a more meaningful way From Wikistats visitors can also download CSV files with all data generated by the extraction scripts ready for use In this regard a new project has been launched by Wikimedia Foundation with the backup of the new Wikimedia Labs branch to offer monthly report cards that present similar informa tion in a more interactive and appealing fashion using new technologies such as JavaScript and HTML5 According to the information on that page the source code for creating these report cards Limn will be available very soon under an open source license as usual with all projects developed by WMF 78 519 00 Active Wikimedia Editors for All Wikimedia Projects 5 Apr 11 Apr12 1 60 Mar 12 Apr12 171 edits per month ui _ore r re __ _ c M 01 Aug 2010 60K Total 79 8K English 36 4K 4
33. and also for complex patterns For example bear in mind that the template syntax for marking featured articles Inttp en wikipedia org wiki Wikipedia Featured articles http en wikipedia org wiki Regular expression 17 in languages other than English can be translated into the local language and you can find yourself composing patterns to match these tags in Hebrew Farsi or Russian For this the use of UTF 8 codification can save us from getting a terrible regular expression headache In the parser code included in WikiDAT TODO ref you can find examples of the patterns use to match the featured article tag in 39 Wikipedia languages for a recent study that I have implemented with other colleagues and it is currently under review In conclusion my point here is you can use certain tools like regular expressions to look for interesting partterns in the text of revisions But you must always remember that this comes with a high price computational time In particular if you are thinking about implementing some type of fine grain NLP analysis for example to apply sentimental analysis on the wiki text of all revisions of the English Wikipedia then you should certainly think about distributed computing alternatives where terms like cluster Hadoop map reduce NoSQL and other buddies come into place All the same if you are just intersted in tracking a small subset of tags like in our featured articles example but also for good
34. arding the total number of revisions that they have contributed to Wikipedia e Missing editor information In certain dump files we can find revisions for which the user identifier is missing This can be caused by multiple reasons not necessarily by errors in the dump process For example if a user account is completely deleted its associated revisions may still be present in the database In WikiDAT I have marked these revisions with a 1 value in the reo user field e Widespread definitions The Wikipedia community usually follow some well known definitions to classify editors many of them derived from the assumptions used in the official stats page For example an active wikipedian is a registered user who performed at least 5 revisions in a given month Likewise a very active wikipedian is a registered user who made more than 25 revisions in a given month Sometimes when we wan to restrict our study to wikipedians who have shown a minimum level of commitment with the project in terms of activity a usual approach is to take all registered users with at least 100 revisions along their whole lifetime in the project This is the minimum threshold required to participate in several community voting processes 4 4 Know your dataset On the course of your own data analyses you will probably find situation in which problems or unexpected errors arise in unsuspected ways This is a common but again not frequently publicized situation for
35. are the sections delivering the actual scientific contribution How ever this also makes quite difficult for other researchers to follow these studies replicate their results and extend the analysis beyond its initial scope This is particularly true for analysis using very large data sets such as in many studies involving Wikipedia data Over the past 5 years I have reviewed many scientfic works and reports dealing with Wikipedia data analysis To my surprise I still can find many problems linked to incorrect assumptions about the nature of the process generating the data their meaning problems with data preparation unfiltered items introducing noise or that are irrelevant for the purposes of the study and many other issues that jeopardize the validity of quite interesting research works My aim in this guide is to hopefully help you to avoid commiting these common pitfalls in your own studies To achieve this the first step is to learn all necessary details about Wikipedia data and available information sources that you can use to retrieve it 1 1 Public data sources The first important characteristic of Wikipedia data sources is that many useful data records are publicly available for download and use This include information about Wikipedia con tent discussion pages user pages editing activity who what and when administrative tasks reviewing content blocking users deleting or moving pages and many many other details Th
36. articles pages with disputed neutrality introduction of templates to call for more references and many other cases the pattern matching approach may well fit your needs in a reasonable execution time 2 4 Practical tips and assessment The aim of this last section is to compile useful advices that you can apply to make your life easier analyzing Wikipedia data As I said in the presentation of this chapter this section should expand in future versions of this document to bring more pragmatic recommendations 2 4 1 DRY and ARW A couple of simple tips that we all even myself have broken several times DRY stands for Don t Repeat Yourself some people add a nice appeal to the end whereas ARW accounts for Avoid Reinventing the Wheel These are two important premises sustaining effective software development They translate into two important messages e If you find that you are writting code that will solve this problem but not any similar problem that you may find in the future then think it twice before going ahead with your typing It is difficult to write reusable code but it will save you precious time later so it worths the effort e Before trying to code your own super duper awesome solution for a given problem look around with your favourite search engine and exploring well known public code repos itories like SourceForce Github BitBucket and so forth to see if anybody else already attempted to solve the same problem An in
37. browser to the Download Python page and find out the installer file that matches your flavour and version 5 1 3 The mysql python API MySQLDB is a Python library that brings a convenient API to access MySQL from Python The user s manual as well as the project page on SEnet offer additional information You must consult the README file in the source package and follow the instructions to install it in your system Binary files are also available for some GNU Linux distributions For instance in Debian and Ubuntu you only need to install the package python mysgldb to get it working on your system 5 2 NumPy SciPy and matplotlib NumPy is the basic library in the Python programming language for mathematical operations and scientific computing In addition to this SciPy delivers a comprehensive toolset for scientific programming in Python covering many disciplines Finally matplotlib 8isa library to create high quality graphics in a variety of formats complementing NumPy and SciPy with graphical features Jointly these libraries conform a full featured environment for scientific programming that can be extended even further with any of the available scikits 5 3 Python Scikit learn The scikits are Python libraries providing even more functionalities to the baseline framework of NumPy and SciPy Among these scikits scikit learn stands out as a very powerful library for data mining and machine learning The project is supported by
38. btaining their explicit consent In summary Wikipedia offers a vast and quite rich data garden for researchers which is public and made available through some convenient channels that we are going to discover in the next sections So let s start our walk through the Wikipedia data store 12 Different languages different communities Another important trait of Wikipedia data is that a different MediaWiki site is maintained for every different language in Wikipedia Therefore in case that a user wants to contribute to more than one language with a user account she must register for a new account in each language Now a unified login feature has been deployed to facilitate users with presence in multiple language to perform simultaneous login in all of them in a single step However in many cases users find that their nickname in one language has already been taken by other person in a different language For older accounts this is a very common scenario As a result of this in general we must stick with the assumption that nicknames in Wikipedia identify users only within a certain language community and that they do not identify users consistently accross different languages unless we can prove otherwise reliably Besides having different communities also means that they may follow distinct policies or conventions to organize their online work Admittedly in all cases this does not include the very basic organizational priciples underpin
39. ch more memory without the downsides of the very slow magnetic technology in traditional drives Table 2 1 summarizes the main hardware factors that we have commented so far along with some recommendations Phttp en wikipedia org wiki Swap partition 19 Table 2 1 Key hardware factors that affect performance in data extraction and analysis Factor Influence Recommendations Caveats Multi core processors allow us to run several concur rent processes to extract data Limited number of Execution speed and analyse different lan CPUs cores in a single CPU l of instructions guages Faster CPUs higher machine currently frequency will cut down ex 8 16 ecution time of our programs and database operations RAM chips can be ex More RAM will let us load pensive ratio cost bit Speed of calcula 1 de E o arger data sets or database Limited size of memory RAM memory tions size of data 7 a tables in memory reducing allowed in some sys computation time tems except for profes sional servers Solid state drive technology is now mature enough for serious applicatons Good trade off solution ratio Execution epeed cost bit Orders of magni Beyond certain capac SSDs demo bounded tude faster than traditional ity circa 250 GB SSD tasks magnetic drives We can reserve part of the drive for virtual memory effectively expanding system mem ory with little impact on performance
40. cquire even a entry level hardware RAID card from a major brand in the business such as Adaptec to get better performance 2 4 4 Data storage database engines Over the past 5 years MySQL has been the database engine that I have used consistently to conduct Wikipedia data analysis Admittedly this is strongly linked to the fact that I have been interested in studies at the macroscopic level For high resolution analyses such as working with the whole set of wiki text in a Wikipedia language modern NoSQL alternatives like Mon goDB or Cassandra can be a better choice As for relational databases PostgreSQL is the other obvious choice but practice have taught me that getting to know your database engine ex tremely well can be the difference between spending several days or few minutes to complete a query Therefore this section is focused on MySQL Many people become quite surprise whenever I mention that running on MySQL I have had no major issues running Wikipedia data analysis for multiple languages even for the English Wikipedia now with nearly 450 million revisions Likewise my colleague Antonio Reinoso has been able to process nearly 1 billion entries with information about Wikipedia traf fic for a single month on MySOL without a problem The key point here is default MySOL server configuration is useless for any serious application not only for Wikipedia data anlysis Thus provided that you configure your sever in a proper way
41. d possibly multiple large data sets in longitudinal studies the scenario can be very complex Incredibly only few books cover the very important task of data cleaning and data prepara tion A well known reference for many years Data preparation for data mining TODO REF is now out of print and it can be found only through second hand resellers A more recent book Best Practice in Data Cleaning TODO REF has been recently published to close this gap dealing with important topics such as missing values and transformations Some other books also cover this important stage as part of complete methodologies for applying certain types of analysis For example I can recommend An R companion to applied Regression TODO REF and Regression Modeling strategies TODO REF for linear models along with a new reference coverning these aspects for survival and event history analysis TODO REF All the same 37 many applied statisitics books still work with synthetic or already prepared data sets to dive into the actual details of the stastical tools and techniques as fast as possible overlooking this stage The following sections will help you to find your way through the preparation of Wikipedia data for your own analyses 42 Retrieve and store your data locally In the first part we presented different sources from which we can retrieve our data for Wikipedia analysis Tools like WikiDAT can save us time to build this local store of Wikipe
42. devices are still some what expensive Hybrid hard drives Large stogare ca pacity faster data transfers HHDs are another compro mise solution integrating a large magnetic disk with a small SSD acting as a lo cal cach in the same device Large storage capacity with faster data transfer Manda tory for working with very large data sets like whole wiki text in a Wikipedia lan guage Not valid for same ap plications as for SSDs only required for very large storage capacity 20 On site tering Parallelization clus of very com plex and time cosuming tasks A local cluster conected to a reliable fast communica tion network Gigabit Ether net or Fibre Channel is the only way to accomplish very complex analysis involving difficult calculations Exam ples are NLP of a whole Significant effort to set up such an infrastruc ture in a proper way Required skills with associated technologies and physical space for installation It can be quite expensive Wikipedia language author ship at the word level and temporal evolution of social networks Software libraries and data must be loaded and configured every time we run a new analysis so these tasks should be automated May become expensive in the long term for conducting many anal yses Cloud services such as Ama zon EC2 offer access to es calable computational clus tering at an affordable cost
43. dia data At the time of writing WikiDAT can retrieve information from both pages meta history and pages logging dump files Together with the user groups dump that can be directly imported in a MySQL database these3 data sources can be combined to undertake multiple and interesting analyses on any Wikipedia language You can consult again Section 3 8 for additional details about the data fields currently retrieved or computed by WikiDAT AIl you need to do is to find the link for these files in the Wikimedia Download center see Section 1 7 Then open a terminal and go to the parsers folder in the WikiDAT code You can execute the following commands to retrieve the data from the pages meta history dump file this will be automated in new versions of WikiDAT jfelipe blackstorm WikiDAT sources mysql u root ppassword mysql gt create database wkp lang mysql gt exit jfelipe blackstorm WikiDAT sources mysql u root ppassword wkp lang tables wikidat sql jfelipe blackstorm WikiDAT sources python pages meta history py wkp lang lang lang YYYYMMDD pages meta history xml 7z log file log This will create a new MySQL database the tables to store the information and finally executes the parser on the downloaded dump file You need to you need to change password with the actual password of the root user in MySQL you can also create new users see the manual section for this task Please note that there i
44. e code in these examples will be improved progressively to provide didactic case studies for other researchers interested in fol lowing similar approaches in their own analyses In the last chapter of this document we visit some of these examples to illustrate some of these examle cases The following prerequisites must be satisfied to run the examples included in this docu ment using the code included in WikiDAT e MySQL server and client v5 5 or later e Python programming language v2 7 or later but not the v3 branch and MySOLdb v1 2 3 e R programming language and environment v 2 15 0 or later e Additional R libraries with extra data and functionalities RMySQL 5 lattice by Deepayan Sarkar car by John Fox et al 8http cran r project org web packages RMySQL index html 34 DAAG John Maindonald and W John Braun Hmisc by Frank E Harrell Jr with contributions from many other users rjson by Alex Couture Beil Please refer to Chapter 4 introducing open source tools for data analysis for additional information about obtaining and installing these dependencies 35 Part II Conducting Wikipedia data analysis Chapter 4 Methodology for Wikipedia data analysis The methodology for Wikipedia data analysis is not very different from the general process to conduct data analysis in other fields We must first retrieve our data source files or retrieve the information from
45. e dump files with the complete history of changes for each Wikipedia language As we mentioned one of the ad vantages of this approach is that in the pages meta history dump we have the whole wiki text associated to every revision stored in the file This opens up and endless number of possibil ities for researchers interested in tracking changes in particular parts of the page content or the introduction of specific templates and tags adopted by the Wikipedia community to flag certain conditions about the page status or its quality A straightforward example is to track the introduction of the tag to label an article as a featured article that is an article of very hight quality in that language For this purpose the usual approach is to include some extra logic in the software implementing the data parsing of XML dump files to look for these special tags in the text of the revision One of the possi ble ways to accomplish this is to use regular expression a flexible tool but also somewhat complex and prone to coding errors that is present in popular programming languages such as Python Perl or Ruby However a word of caution is in place at this point In the first place while there are some techniques that can let us speed up the preparation of the regular expression to make the pattern matching process faster like the use of re compile in Python this process may take some time This is true for revisions with very large text chunks
46. e excerpts Program 3 and Program 4 show some example XML content of this dump for the case of the Simple English Wikipedia simplewiki Phttp en wikipedia org wiki Wikipedia User access levels Anttp ww mediawiki org wiki Manual Logging_table 2nttps gist github com 2906718 Bhttp en wikipedia org wiki Wikipedia Flagged_revisions 11 Program 1 Example of XML data stored in pages meta history dump mediawiki xmlns http www mediawiki org xml export 0 6 xmlns xsi http www w3 0rg 2001 XMLSchema instance xsi schemaLocation http www mediawiki org xml export 0 6 http www mediawiki org xml export 0 6 xsd version 0 6 xml lang fur gt lt siteinfo gt lt sitename gt lt sitename gt Vichipedi lt base gt http fur wikipedia org wiki Pagjine_princip C3 A21 lt base gt lt generator gt MediaWiki 1 19wmf1 lt generator gt lt case gt first letter lt case gt case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 case first 1 e e e e e e e e e e tter gt Media lt namespace gt tter gt Special lt namespace gt gt Discu ter ter gt ssio ter gt ter gt Discussio n lt namespace gt Utent lt namespace gt n utent lt namespace g
47. e main exception to this general rule is private data about Wikipedia users who regis tered in any of the more than 280 different languages available in Wikipedia more on this in the next section This is for a very good reason it is not my business or your business to learn about the real name or e mail address of Wikipedia users Privacy must be preserved and Wikimedia Foundation does a splendig job in this regard In any case people use a nickname that is associated to their user account and that is displayed as their public identifier in the community For the purpose of tracking individual contributions this should suffice for most of your studies At the same time this also means that Wikipedia offers detailed records about all the ac tivity performed by users making it a public forum compliant with the design principles of social translucence TODO ref in digital systems Of course that means that we can virtu ally track the complete activity history of every single user in Wikipedia However from a research ethics point of view despite many users are fully aware that their digital footsteps can be traced in detail some of them may not feel very comfortable if they are pinpointed in a study So as a researchers we also have the responsibility of reporting this information with care and avoid unnecessary precise details in our results for the sake of respecting some of the public privacy of Wikipedia users in particular without o
48. e possible user levels currently considered in Wikipedia 7 For other languages with smaller communities some of these levels may be missing 1 7 4 Logging dumps administrative and maintenance tasks Another interesting dump file that does not share the same fame and popularity as its revision history siblings is the pages logging dump In practice this situation stems from the main goals of most data extraction tools for MediaWiki data which were originally focused on extracting data for pages revisions and text to build local replicas of a Wikipedia language As a result administrative actions were not as important and for a long time they have remained quite unnoticed in Wikipedia research literature Today next generation data extraction tools such as WikiDAT presented in the next chap ter solves this problem The logging table in MediaWiki DB contains unvaluable informa tion about many different administrative and maintenance actions undertaken in a Wikipedia language Gregor Martynus has recently created a draft list summarizing the different types of actions that we can find in this dump and their meaning Some interesting examples of actions that we can trace in this dump are e Deletions moves or protection of wiki pages e Registration of riew users in the system e Blocking users including the duration of the block e Reviewing actions performed in languages that has the flagged revisions extension en abled Th
49. ed on MySQL like Percona Yet another options is using PostgreSQL P another powerful open source database software with several different libraries to communicate with Python applications 4 5 5 R programming language and environment R a free software which offers the most complete and most powerful statistical programming language and environment available today The R project website is the entry point to the R world A core development group of 20 people 9 who founded the R Foundation to oversight the good progress of this project are responsible for maintaining and improving the core environment Additionally a very active ecosystem of developers and contributors are constantly aug menting the R suite of tools and features providing add on packages known as R libraries These libraries are published via the Comprehensive R Archive Network CRAN 5 a net work of mirror servers that provide access to this repository from many different locations At the time of writing these lines CRAN lists more than 3 800 libraries and this number continues to grow exponentially 5 5 1 Installing R R is also a multi platform programming language and statistical environment You can read the R FAQ to find instruction about how to get and install R in your own computer e For GNU Linux users search your software management system to find out the R bi naries Make sure that you install the base environment in Debian like systems pack age
50. ediaWiki sites Originally designed for Wikipedia it is Manual on MediaWiki Tools now used throughout the Wikimedia Foundation s projects and on many other mediawiki wikis It s written in Python which Is a free cross platform programming language List of MediaWiki Tools This page provides links to general information for people who want to use the bot software z page pi g peop 1S Python Wikipedia Bot Table of Contents ee PyWikipediaBot Release status beta Search in this subject Quick overview Overview Quick Start Guide Installation e Installation Mac Configuration e user config py e Use by third party wikis Pywikipediabot basics e Basic use Implementation Bot Scripts Description A collection of python e Communication scripts for automated Participating in pywikipediabot development editing of wikis running the mediawiki software Originally designed for Development Figure 3 4 Snapshot of the Pywikipediabot page at mediawiki org 3 5 WikiTrust for data analysis WikiTrust is an an open source on line system to calculate reputation metrics for Wikipedia authors and content Wikilrust is hosted by the Institute for Scalable Scientific Data Manage ment at the School of Engineering of the University of California Santa Cruz and their main developers are Lucal de Alfaro Bo Adler and Ian Pye Wikilrust has been one of the most popular research tools for calculating metrics about Wikipedia authoring includ
51. ediasutilitleS o eue a a ale ai UR OUR i EUR we URS WikiDAT Wikipedia Data Analysis Toolkit i d euo cs 3o el ES II Conducting Wikipedia data analysis 4 Methodology for Wikipedia data analysis Zl LE DIE picture ss sos sescat AAA E OP y bk ARA GK 42 Retrieve and store your data locally 222529 rentas sn 43 Routinary tasks for data cleaning and preparation 44 RIO yourda tas t lt A RAE AE AA A ea doe dE A Open source tools for data analysis SE AA ER 5 1 1 Installing Python in GNU LINUX iii mo X m y daraw 5 12 Installing Python in Windows or MacOS o o o o ooo ooo 513 Then py API gt siva siii a 52 N umPy SaPy and matplotlib ici RR iI 59 PRIOR BOND ss na o e os prem A E Ros Eum EEO dos 54 Database engine MYSQL uice posed p onm a RO Ere lara A a 55 R programming language and environment us 0 o de o Ron RERO Rn Bol Ide ciale ext hob ia el LA SE UR deny s 552 Installne additonal branes oss Ero y n 5 50 Graphical user interfaces for R cs soeces eee ed i 5 5 4 Rdocumentation and further references Example cases in Wikipedia data analysis Ol Sveti lui paria bee oa Ln LR n GLI icone A SIOE Roux EY e eos 5612 Required data and tools sece s soss oe EC Lea a La 613 Conducting theanalysiS s es eos kw ox Rok oe ox A 614 Furtherreading ee A AAA A 62 The sity ofineqaalities AA 621 Luson and LOA cis a a O22 Required dataand tools asso nta ROG s 625 Conducting the anal
52. enever something is too good to be true there must be some hidden caveat In this case the downside is that one needs to use any software already available to import this information from the dump files or write their own code to accomplish this something that is not within the reach of many people In the next chapter we will review some existing tools that can help you in this endeavour Here we will describe in detail the content of some of the most popular dump files Finally an important remark about the dump files is that every new file includes again all data already stored in prior versions plus the new changes performed in the system since the last dump process and excluding all information and metadata pertaining pages that have been deleted in that interval In other words we must understand these dump files as snapshots of the status of a Wikipedia language or any other wiki of a Wikimedia project at the time in which the dump process was triggered Now and then new people subscribing to the mailing list devoted to discuss the back up process status and features of the dump files 1 asks about the existence of this kind of incremental dump feature In case that we only had to deal with new changes this could certainly be implemented without too much effort However the additional complication introduced by page deletions makes this way trickier to implement and less useful otherwise the dump may show information about pages that
53. esents R code implementing the actual analysis part and not only the data extraction process which was written in Python So this is somehow one of the first available software for Wikipedia data analysis that really made an Phttps bitbucket org halfak wikimedia utilities l nttp people aifb kit edu ffl reverts http meta wikimedia org wiki WikiXRay 31 effort to open up the data analysis part for other researchers to check the code and reuse it for their own purposes However advances in available libraries for XML parsing and new proofs of concept such as wikimedia utilities has made me thik that it could be a good idea to try to rewrite the old code creating an new tool that I have called WikiDAT TODO Citation At the time of writing it is still in experimental phase as I am able to find free slots to clean up and organize the multiple and I have really many of them code snippets that I have created for undertaking past analyses with Wikipedia data At the time of writing end of June 2012 WikiDAT is shipped with new versions of parsers for the pages meta history and pages logging dump files Regarding the pages meta history Tables TODO internal ref summarizes the data fields that are currently extracted to a local MySQL database organized in 5 different tables e Table page stores information about all pages in a Wikipedia language e Table revision contains metadata about all revisions performed in a Wikiped
54. fi f i rev id Positive integer gt 0 e c RUE si sion IUD Unique identifier of page Bux od di modified in this revision Unique identifier of regis tered user who performed this revision 1 values indi cate that user identifier was rev_user Integer 1 0 or positive value not recorded in the dump file 0 values indicate anonymous users positive values indi cate identifier of registered users rev_timestamp Date and time value YYYY MM DD HH MM SS Timestamp for recording this revision in the database rev len Positive integer gt 0 Length in characters of the wiki page after this revision rev_parent_id Positive integer gt 0 or NULL Link to the numerical identi fier of the previous revision undertaken in the same wiki page NULL value indicates first revision for a wiki page rev_is_redirect Binary 1 or 0 1 value indicates that the page modified in this revision is a redirect rev minor edit Binary 1 or 0 1 value indicates that the user marked the minor edit tick for this revision rev fa Binary 1 or 0 1 value indicates that after this revision the modified page displays the feature ar ticle status tag locale depen dent rev comment String max 255 characters Comment inserted by the user who performed this re vision Table 3 4 Fields and values included in table logging as de fined in WikiDAT Field name P
55. ficially approved in the Wimipedia global movement Its aim is to main tain replicas of databases from all Wikimedia s project in order to test new software provide added value services and statistics based on these data and undertake research activities For example some of the new statistics about Wikipedia articles are computed by software services running on the toolserver 1 In order to get access to this infrastructure interested researchers must contact the toolserver responsibles to obtain a user account and shell access Before you ask yes the servers run on GNU Linux so you must have some minimal background about perfoming connections to remote systems on SSH and basic shell usage The database engine to store this replicas is MySQL the same engine used for all Wikimedia projects up to know The infrastructure is quite powerful though the usual common sense and etiquette rules apply here as well regarding the use of common computational resources 17 Wikipedia dump files The best way to retrieve large portions or the whole set of data and metadata about any action performed in a Wikipedia language is using the database dump files These dump files con tain not only the wikitext but also metadata about all actions that have been recorded in the database of any Wikimedia wiki The files are publicly available on the Wikimedia Downloads center Lately they have also been replicated on some mirrors provided by several institu
56. g en US firefox addon wikitrust Shttp www wikitrust net vandalism api 29 e python wikitools Python scripts and modules to interact with the MediaWiki API and source code for some en wikipedia bots Project Home Downloads Wiki Issues Source Summary People Project Information Python package to interact with the MediaWiki API The package contains general tools for working with wikis pages and users on the wiki and EI 5 Recommend this on Googl retrieving data from the MediaWiki API There is also the source for some en wikipedia specific scripts using the framework including the source for Mr Z bot en wikipedia The wikitools module requires Bob Ippolito s simplejson module or the json module in Python 2 6 Chris AtLee s poster package is needed for file upload support in all Python versions but it is not required for use If not present file upload support will be disabled Content license Version 1 1 1 was released on 14 April 2010 Source downloads with a setup py script a Windows installer and an RPM are available in the Creative Commons 3 0 Downloads tab or the right sidebar Project feeds Code license GNU GPL v3 BY SA wikitools will be roughly following the MediaWiki release cycle for major releases ensuring that each release is compatible with the version of MediaWiki released at the same time If you are using the development alpha version of MediaWiki as Wikipedia does you should Labels LEN conside
57. gt lt action gt delete lt action gt logtitle MediaWiki CompactTOC logtitle lt params xml space preserve lt logitem gt lt mediawiki gt 15 Chapter 2 Data retrieval preparation and storage Now that we know more details about the information and metadata available from the dif ferent Wikipedia data sources we turn to introduce some important aspects about the data retrieval process itself After this the next chapter will present some available tools to carry out this important task The main focus of this document are the Wikimedia dump files and therefore we will cover some tips and useful recommendations that hopefully can save some time to researchers conducting Wikipedia data analysis with these files In the current version of the document this chapter encompasses the most critical pieces of advice and recommendations that I could think of regarding Wikipedia data extraction Thus it is still far from complete and it will be expanded as I find more free slot to expand this information with more tips and further assessments 2 1 RSSto notify updates A very usueful service that is still overlooked by most Wikipedia researchers today is the uti lization of RSS notification channels to check for new updates of the dump files that they regularly use For every Wikipedia language a special page on the Wikimedia Downloads site lists the latest correct versions of all dump files produced for tha
58. ia Later on this project became pymwdat a more general framework to implement these analyses as well as the creation of overall statistics about any Wikipedia language According to the online documentation for this project Requirements Python 2 6 Linux OrderedDict available in Python 2 7 or http pypi python org pypi ordereddict 7 Zip command line 7za Input Alternatives Wikipedia history dump e g enwiki 20100130 pages meta history xml 7z Wikipedia page s name category use grab pages py Wikipedia page s history xml 7z bz2 gz Input Processing Detects reverts reverted edits revert wars self reverts Filtering labeled revisions data sets management etc Calculates users page counters revert rates page diffs etc Mhttp code google com p pymwdat 30 DEI LIRE E Tra le motwanon delle criche di ak uni commentator c scuramente la mancanza di dettagli Neccessive timide zza un momento in cua secondo loro harrmranestrazione Obama potrebbe osare di pi Premio Nobel modifica 9 ottobre 2009 d comtato di Oslo gli conferisce Premio Nobel per la Pace per il suo straordinario impegno per rafforzare la dpiomaria rtenarionde e la colaborar one trai popol di ridurre gli arsenal nuclear e diintavolare un dialogo distensivo e costruttivo col Medio Oriente il riconoscimento cons tente in una medaglia un diploma e un
59. ia language e Table revision hash stores the id of every revision as well as an SHA 256 hash of the wiki text associated to it e Table people lists the numerical identifier and nickname of all registered users in a Wikipedia language e Table logging with information about administrative and maintenance tasks undertaken in a Wikipedia language The original definition of these tables is based on the tables of the same name defined in MediaWiki except for table people which is new Fields in bold characters identify keys to speed up search and sorting operations in database queries Table 3 1 Fields and values included in table page as defined in WikiDAT Field name Possible values Description pagedd Pito Unique numerical id of the page page_namespace Positive integer gt 0 Namespace of the page page_title String max 255 characters Title of the page Comma separated set of per PME 3 mission keys indicating who page_restrictions Binary string E can move or edit this page last revision Table 3 2 Fields and values included in table people as de fined in WikiDAT Field name Possible values Description rev_user Positive integer gt 0 xis RR of regis rev user text String max 255 characters ME of a registered 32 Table 3 3 Fields and values included in table revision as de fined in WikiDAT Field name Possible values Description TENE Uni identi
60. ikipedia terminology these changes are called revisions Thus a revision is any change creation or modification that alters the content of a wiki page producing a new version The Program 1 caption shows an excerpt of the XML code store in one of these files These dumps as well as other such as pages logging start with information about names paces along with their names that are frequently translated to the same local language This information is a valuable aid to classify data pertaining to different namespaces as well as to filter out data that is not relevant for the purposes of our study for example if we are only interested in articles or discussion pages After this the file lists all revisions undertaken in every wiki page The organization is shown in the Program 1 excerpt with the metadata about the page title namespace unique identifier and attributes first and then metadata and content for all revisions in that page In this case we find a revision that was carried out by an anonymous user and thus only the IP address of the source of the incoming connection is tored In the Program 2 excerpt we see another two entries for a revision on the same page this time undertaken by a registered user In this case we have the unique numerical identifier of the user in the system along with her nickname The stub meta history dump contains exactly the same information with the sole exception of the whole wiki text for e
61. ilable tools in R and reference to find hands on examples to use these tools It also includes many pointers to data mining features and R libraries dealing with these techniques A classic but up to date reference to learn effective methods for data analysis with R is Data Analysis and Graphics Using R now in its 3rd edition Linear Models with R by J Faraway remains as the authoritative introduction to linear models in R There is also a prior shorter and free ver sion of this book available on CRAN manuals Finally A Handbook of Statistical Analyses Using R packs a concise reference for the implementation of many common statistical analyses and techniques for those users with solid theoretical background 47 Chapter 6 Example cases in Wikipedia data analysis 6 1 Activity metrics The first example case that we are going to review is the production of overall activity trends in Wikipedia Some of these results are similar to those produced in http stats wikimedia org as well as in some well known research papers such as TODO ref Voss paper and TODO ref Almeida paper This work is also based on the overall trend statisitics included in my PhD dissertation TODO ref Thesis Felipe 6 1 1 Questions and goals The evolution of the activity registered in Wikipedia has been one of the first topics of in terest for quantitative researchers As usual the main spot of attention has been the English Wikipedia in particular f
62. ind of analysis they want to carry out but also how they plan to imple ment it In practice this is not quite a serious issue for other smaller languages even if they are in the list of the largest Wikipedias Nevertheless it is very probable that if an analysis takes several days to complete in eswiki or frwiki it will take weeks or even a whole month to complete for the English Wikipedia This is the kind of situations that we must try to avoid as much as possible As an exam ple in the old days where old is 2008 where the dump process for the English Wikipedia could take several weeks to finish it was quite frequent that any problems in the form of a power outage an error in code problems in database connectivity or many other impoder ables prevented this process from finishing correctly As a consequence researchers had to wait sometimes more than one year to have new updated dump files Fortunately today the situation is very different but researchers should take this lesson with themselves to overcome the same problems in their own processes As a rule of thumb if the execution time of your analysis is dangerously approaching the 7 days threshold you should redesign the implementation details Otherwise you would be leaving too a large room for Murphy s law to smack down your process and force you to start it all over again 23 Computing extra metadata In the previous chapter we touched upon the contents included in th
63. ing several research papers that can be accessed from the same website It has also been applied to the study of vandalism and reverted content in Wikipedia since along the process the software tracks the changes or moves experimented by any word inside a Wikipedia article as well as the author of the revision performing those changes This authorship and reputation information is available online for being displayed on any Wikipedia article thanks to an add on plug in developed for Mozilla Firefox A set of 3 different APIs is available for accessing the information computed by WikiTrust The project authors are actively lookin for support to provide these data for Wikipedia languages other than English after a power outage that brought down all international WikiTrust support once available The code is released under a BSD like open source license In addition many files im plementing the core algorithm are written in Ocaml a programming language suitable for distributed processing in computer clusters Since computing this algorithm for an entire Wikipedia language is a daunting task it must be accoplished by this kind of parallel com puting infrastructure Even then it can take a substantial period of time to complete nearly a month for the whole English Wikipedia according to the last comments from authors This also depends on the number of nodes used for the calculation Hhttp www wikitrust net Phnttps addons mozilla or
64. into account the usual recommendations as for accessing the Wikimedia API at a very fast pace since you would run the risk of being banned by system 7http code google com p wikievidens Shttp www mediawiki org wiki Pywikipediabot 27 y Wikievidens A visualization and statistical tool for wikis Project Home Downloads Wiki Issues Source Summary People Project Information What is WikiEvidens 1 Recommend this on Google i Eu A A O A OSx S n isa WikiEvidens is a statistical and visualization tool for wikis Itis Pm Project feeds developed by Emilio J Rodriguez Posada and it is on alpha stage 10 pn ge AA IS Code license RAR GNU GPL v3 Features Free Content license Creative Commons 3 0 BY SA Labels wiki wikipedia mediawiki visualization statistics academic interactive downloader processing nip socialnetworks graphviz pylab matplotlib numpy Wiki datasets downloader Wiki XML preprocessor Global page by page user by user and sample analysis o Summary 9 Activity by year month day of week hour 9 Networks visualization o Authorship scanner o Many more very soon Export in several formats Mi umbas Documentation emi gmail com Learn to install and use WikiEvidens We are working in a complete Featured Documentation IY wiki pages Contribute Contribute Features FirstSteps Install Show all If you want to send any suggestion
65. irectly from stream created by the com pression software while inflating the original data file capture the relevant information to be stored and discard the rest As a result the memory footprint of this programs should not be very high However when we get to the analysis part there can be quit a large difference between having the indexes of your huge table loaded in memory to look for specific entries in a database or not This difference can be in the order of hours or even days Solid state drives are the new fastest gunmen in storage village Many manufacturers have been able to rectify the initial problems related to unreliable writing performance in their first models and they are now producing pieces of finest hardware suitable for serious professional applications requiring maximum speed in data transfer at an affordable cost well at least while we keep below the 300 GB threshold at the time of writing An important property of SSD technology is that compared to their close relatives based on magnetic technology SSDs can be several orders of magnitude faster and now without noticeably degrading their writing performance Yet another advantage of SSDs specially for GNU Linux users is that we can configure one of these devices to act as a swap device expanding the amount of virtual memory granted to the system In practice since the read write speed of many of these disks is now quite fast what we obtain is a system with mu
66. ision gt Furlane Europel Vichipedie furlan Culture Furlane Pa s Basc text 12 Program 2 Example of XML data stored in pages meta history dump cont Lines have been formatted to fit the page width revision lt id gt 55 lt id gt lt timestamp gt 2005 01 28T09 55 00Z lt timestamp gt lt contributor gt lt username gt Klenje lt username gt id 1 id lt contributor gt minor text xml space preserve gt More wiki text here lt text gt lt revision gt lt revision gt lt id gt 2456 lt id gt lt timestamp gt 2005 08 30T11 37 22Z lt timestamp gt lt contributor gt lt username gt CruccoBot lt username gt lt id gt 29 lt id gt lt contributor gt lt minor gt lt comment gt robot Adding an ar bg ca co da de en es fr ja Ib mt ni nni fio pl pt Fo T SC simple text xml space preserve gt width amp quot 100 amp quot cellspacing amp quot id scn TE O amp quot border lpx solid 996600 amp quot sv tl uk vi zh lt comment gt cellpadding amp quot 5 amp quot style amp quot bgcolor amp quot f 3f3 amp quot style amp quot font 95 Verdana amp quot amp lt big amp gt Tu puedis lei articui in tantis lenghis diferentis amp l Vichipediel lt big amp gt amp quo S t amp gt amp lt small amp gt
67. lyses e Rattle ideal for exploring R data mining capabilities 5 5 4 R documentation and further references One of the notable advantages of R is the abundancy of online freely accessible documentation about the baseline environment and many of the CRAN libraries In the last versions of R there has been a remarkable concern among developers encouraged by the R Core Team to improve the companion documentaion for many of these packages The following options are essential for novel and experienced R users alike to learn how to use these libraries and solve their doubts e R Manuals contributed documents and manuals from R users http cran r project org manuals html http cran r project org other docs html e R Seek Search in help files and R mailing lists http www rseek org http cran r project org web packages RMySQL index html nttp rstudio org Shttp socserv memaster ca jfox Misc Remar http rattle togaware com 46 e R Task Views http cran r project org web views e The R Journal http journal r project org e Blogs and websites R bloggers http www r bloggers com blogs list RForge https r forge r project org Quick R http www statmethods net Rgraphical manual http rgm2 lab nig ac jp RGM2 images php show all amp pageID 363 Finally there is a very long list of published books around R and a comprehensive review of these
68. munities iii ACUN vis MAG c kos gug RIA Nor NON RE Webcontent seee etx ta E Ge DECR IUE x n MediaWiki ARI e s ai gocce ee omen oe bor e boue EO ede eek Rod i The toolSerwef seecae iaraa bow ek Bare Re e uode e P e os Wikipedia eNO les oc os A XR EE Xue ieri A A bal edu pp a Peer RR SI 1 7 2 Complete activity records stub meta and pages meta lah OUBDPIJOUDE 22504403 oe doe EER GO E XLI RUE LG od A done PER od 1 74 Logging dumps administrative and maintenance tasks Data retrieval preparation and storage RSS ety updates gt Lo c3 Ee R3 ES ES PP aue ESS EEE a The particular ease of the English Wikipedia 22 0 28 0644 e4240 082085 Computing era metadata ae aw ea oe Rei ES Practical tips and assessment gt lt a iii ae ad 240 DESAY ie PeR ER RARI 242 Datastorage hardware s ssor eee ee ed ea n eee gd 24 3 Operating system and file system support 2 4 4 Data storage database engines ao ok bea ee eae ex a Available tools for Wikipedia data extraction and analysis Here qoe MEM DP E Se e ER aS h Wikistats and a 4 a4 ee ed PEOR UR E OX ode eS StatMediaWiki and Wikievidens cr Interacting with the MediaWiki API 2 404020 04 ok m moy M 41 Pywikipedlabol gt css ee A RRA EEDA ER DA 342 YOR wise usu RR RSS eRe Ee AER ue SEES 2415 Mwene su d db a Pe E A OU d bes Sob AS WikiTr st for data analysis e cin ra heo eA RANA lio co M eek ee AA A ERIN AO AAA AS Wikim
69. ning the whole Wikipedia project such as the five pillars Nevertheless this does include some variations that may alter substantive charac teristics of the activity that generates Wikipedia data and metadata in a given language For instance whereas in many languages users promoted to admin status retain it for life un less it is revoked for good reasons in the Norwegian Wikipedia they must renew this status periodically Another example is the habit extended in the Polish Wikipedia of deleting talk pages that remain idle for some time thus inducing us to infer somewhat incorrectly that their interesting in debates on articles contents is abnormally low 13 Activity vs traffic Yet another important remark is that Wikipedia not only offers data about actions that change the status of the system edits page moving or deletiong blocking users etc but also about requests that do not alter the system status or content A clear example of the latter is page views Inttp en wikipedia org wiki Wikipedia Unified login http en wikipedia org wiki Wikipedia Five pillars requests to Wikipedia servers to display a wiki page on our browser Further examples are searches performed on the site or clicking on the edit or preview button while changing the content of a wiki page but without clicking on the save button which is the action actually recorded on the system database We can call this data about requests that do not modify the
70. or creating models that try to explain and predict the activity trends in this language for the mid term see section 6 1 4 for more information The main goal has been to characterize the evolution of editing activity in Wikipedia that nurtures its fantastic growth rate in content and number of articles However over the past 3 years the focus has shifted to explain the steady state phase in which it has entered the activity in many larger Wikipedias as well as possible reasons that may have influenced this change in the overall trends In the example we will 6 1 2 Required data and tools In the table revision generated by WikiDAT for any language we can find the required data to undertake the analysis of overall activity trends However it is a sensible approach to use the information in the user_groups so that we can make a more accurate analysis by filtering out activity from bots see section 4 3 for more information More precisely we will use the following fields of the revision table e rev_id to count the number of revisions in a given time slot e rev_page to aggregate scores about number of pages e rev_user to aggregate scores about users as well as to elid information about anonymous users and bots 48 e reo timestamp to track scores over time e reo fa marking revisions with a FA label in them Additionally we can also use information in the page table to break down scores by the namespace of pages or to calc
71. ossible values Description log id Positive integer 0 Unique identifier of login ac tions log type String Type of log action performed 33 log action String Concrete action undertaken within a given type log_timestamp Date and time value YYYY MM DD HH MM SS Timestamp for recording this log action in the database log user Positive integer gt 0 Unique identifier of regis tered user log_username String Nickname of user who car ried out the log action log_namespace Positive integer gt 0 Namespace of the page on which the log action was per formed log title String max 255 characters Title of the page receiving the log action log comment String max 255 characters Comment of the log action log params String max 255 characters Additional descriptive params providing metadata about the log action e g duration of a block For Wikipedias with flagged revisions enabled rev id of log_new_flag A the last flagged version of this page For Wikipedias with flagged bpota iag Positive integers d revisions enabled rev id of the previous flagged version of this page The code of WikiDAT can be retrieved from the project site on Github TODO link The project files include some example implementations of common analyses conducted on Wikipedia data for descriptive purposes The documentation and sourc
72. p www vps fmvz usp br CRAN bin windows Anttp ww vps fmvz usp br CRAN bin windows64 2http www vps fmvz usp br CRAN doc FAQ R FAQ html How can R be installed _ 0028Macintosh_0029 45 world to obtain the additional software Just select the closest mirror to your current location or any other then sit and relax while R does all the work for you Following this procedure can install any additional packages a k a R libraries in your system In order to run the examples included in WikiDAT you will need the following R libraries e RMySQL 3 car by John Fox et al DAAG John Maindonald and W John Braun Hmisc by Frank E Harrell Jr with contributions from many other users rjson by Alex Couture Beil Regarding RMySQL please check that MySQL server is already installed in your system Specially for GNU Linux systems it is also recommendable to ensure that any additional dependencies are met installing this library directly from your favourite software management tool For example in Debian or Ubuntu you can install the package r cran rmysq 5 5 3 Graphical user interfaces for R A number of alternative GUIs have been created to facilitate the work with R specially for novice users The following are some suggestions that can help you to get a better R experience e RStudio good for new R users also allows working with a remote server e Rcommander great for novel users it automates many common ana
73. pen source In practice and I do share my own quota of blame here researchers are too busy to devote the necessary time to publish the code and data sets associated to their studies and research reports as to facilitate the labour of other colleagues willing to follow their footsteps and ex panding their line of work The most important factor for this lack of interest is the strong focus on reviewing and polishing research documents reports journal and conference articles etc in the scientific community at least in Computer Science This relegates data sets tools and in general study replicability to an obscure corner in the background In spite of this programming languages and modern statistical environments such as R that we will visit in the next part offer built in support for creating modular applications and libraries that can be easily loaded and integrated in new software Therefore it is not the lack of technical support but the absence of a clear interest from the scientific community so far in study replicability which stops us from mapping more unexplored territories in detail for other to enter new research regions That said in Wikipedia research we do not have a hopeless panorama since there are some open source tools and suites that can save us time implementing our own studies In this chapter we briefly present some of these open source tools along with the pointers for readers interesting in learning further det
74. r using the development version of wikitools which can be downloaded via SVN here from the source tab Old versions can be MediaWiki Wikipedia downloaded from the deprecated download list ymo Note that due to developer time constraints only the most recent release and the SVN version are supported and receive bugfix releases For major bugs affecting old versions a patch may be released A Members FE A Documentation is available on the wiki MrZmanwiki gmail com 2 committers Some bot scripts not the framework itself require the MySQLdb module and a MySQL server Scripts in the pywiki branch directory require Pywikipedia Note that these scripts are unmaintained and may no longer work Featured If you d like to help out with this send me an email mrzmanwiki at gmail dot com or leave a message on my en wikipedia user talk a Downloads in wikitools 1 1 1 1 noarch rpm e README wikitools 1 1 1 1 src rpm wikitools 1 1 1 tar gz wikitools 1 1 1 win32 exe wikitools 1 1 1 zip Show all Also available on the Python package index pypi Author Mr Z man amp en wikipedia Some assistance and code by bjweeks Figure 3 5 Snapshot of the python wikitools at Google Code 3 6 Pymwdat We can find traces of the first attemps to create applications to automate the analysis of Wikipedia data back to 2008 Dmitry Chichkov starts to develop wredese a tool to support the analysis of reverts and vandalism actions in Wikiped
75. rallelization without entering in the details of distributing different tasks in a high level programming environment Using table urlhttp dev mysql com doc refman 5 0 en innodb multiple tablespaces html 5http dev mysql com doc refman 5 1 en partitioning overview html 23 partitioning it is important to understand the nature of the operations that we are going to perform to get the maximum benefit from this maneuver For example if we are carrying out a longitudinal analysis by month it will make sense to partition data among different tables for each month If our analysis goes per page or per user we can also partition our tables accordingly 24 Chapter 3 Available tools for Wikipedia data extraction and analysis 3 1 Here be dragons Reviewing previous research literature conducting quantitative analyses with Wikipedia data can be a daunting task Recent work in progress has reported nearly 2 000 works only for peer reviewed venues and scientific publications However this bright beam research work turns into darkness or at least twilight when we speak about publication of source code to under take these analyses I m not going to mention here all possible examples but we have several cases of stunning applications like HistoryFlow developed by F Vi gas and Martin Watten berg while working for IBM and frequently cited in Wikipedia research or WikiDashboard for which their code is not publicly available as o
76. references falls beyond the scope of this document All the same there exist some well known references adequate for R newcomers as well as for R users with some experience willing to explore additional features and tools TODO A bunch of references should be inserted below At the introductory level one of my favourite references that I keep on recommending is the excellent book Introductory Statistics with R by Peter Dalgaard Pete is one of the members of the R Core Team and the book is written in a concise and very pragmatic way However one minor disadvantage is that despite he introduces some details of basic statistical techniques this short manual is not intended to be a statistical reference for novel students or practitioners Fortunately this gap has been filled by a very recent book Discovering Statistics Using R The aim of this thick volume 1000 pages is to present essential statistical tools and methods for researchers in an accessible and quite irreverent style The book is specially suitable for researchers and practitioners in social and experimental sciences though it also covers many aspects pertaining observational or correlational studies Once you have acquired some basic skills about statistics and R programming some refer ences may help you to expand your knowledge and explore the many possibilities that R can offer us R in a nutshell is a valuable companion manual that serves both as an catalogue of ava
77. s a blank space in between u and the user name but no blank space in between p and the user password In the last command change wkp lang for the name of your database lang with the name of the lang repository e g eswiki frwiki enwiki dewiki etc The last two arguments are the name of the dump file and the name to create a log file that will store any warning or error messages in case that any of these are created along the process You can read regular traces to follow the progress of the parsing process Once it has finished you can check that the numbers of pages and revisions retrieved concur with the information on the download page To retrieve information stored in the pages logging file execute jfelipe blackstorm WikiDAT sources python pages logging py wkp lang lang YYYYMMDD pages logging xml gz log file log Finally to import in your local database information about user groups and privileges retrieve the file for the Wikipedia language of your interest and type jfelipe blackstorm WikiDAT sources gzip d lang YYYYMMDD user_groups sql gz 38 jfelipe blackstorm WikiDAT sources mysql u root ppassword wkp lang lang YYYYMMDD user groups sql For example if we are interested in recovering data from the dump files created on 2012 06 01 for the French Wikipedia jfelipe blackstorm WikiDAT sources mysql u root ppassword mysql gt create database frwiki 062011 mysqi exit jfelipe black
78. s users please follow the links for Windows or Windows64 to install the base environment This should already come with the recommended libraries plus a nice editor to avoid using the command line interface usually hidden from the eyes of standard users in Windows You can open this GUI double clicking on the new icon that should appear now on your desktop e In the case of Mac OS or Mac OSX users please follow the installation instructions for Mac users in the R FAQ that will point you to the adequate binaries to get R running on your fashion hardware In this case the binaries also install a simple graphical interface to interact with the R command interpreter 5 5 2 Installing additional R libraries The easiest way to install additional R packages is to do it inside R itself Thus execute either R from the GNU Linux command line or double clicking the icon on your desktop in other operating systems On the R command interpreter you can now type the following gt install packages name of pacakge dep T With this command we request R to install the package whose name comes in double quotes as the first item in the brackets and also to retrieve and install all necessary dependen cies other R packages required for this package to work If you are installing your first pack age in this session you will be prompted to choose one of the many CRAN mirrors around the Phnttp www vps fmvz usp br CRAN bin linux Ontt
79. sions of this document to undertake high resolution analyses on really huge data sets 5 1 Python Python is a general purpose and multi platform programming language that will act as a sort of glue code to assemble different pieces together and implement the software to fit our analysis One of the great advantages of Python is that it offers and extremely informative and com plete online documentation for the baseline programming language as well as presenting many common libraries available to automate common tasks Browse the Python documenta tion to discover it by yourself 5 1 31 Installing Python in GNU Linux If you are a Linux user it is very probable that Python already comes installed in your favourite distribution as it has become quite a hard requirement for many basic applications To check Inttp hadoop apache org http www python org doc 42 this open a terminal and write jfelipe blackstorm python Python 2 7 2 default Oct 4 2011 20 06 09 GCC 4 6 1 on linux2 Type help copyright credits or license for more information gt gt gt If you can see this the command line of the Python interpreter then Python is installed in your system To quit the interpreter press Ctrl D 5 1 2 Installing Python in Windows or Mac OS Since Python is a multi platform programming language you can also find an installer target ing different versions of other operating systems Just point your
80. storm WikiDAT sources mysql u root ppassword frwiki 062011 tables wikidat sql jfelipe blackstorm WikiDAT sources python pages meta history py frwiki 062011 frwiki frwiki 20120601 pages meta history xml 7z log file log jfelipe blackstorm WikiDAT sourcesS python pages logging py frwiki 062011 frwiki 20120601 pages logging xml gz log file log file jfelipe blackstorm WikiDAT sources gzip d frwiki 20120601 user_groups sql gz jfelipe blackstorm WikiDAT sources mysql u root ppassword frwiki 062011 frwiki 20120601 user_groups sql Now your data is ready for additional preparation and or cleaning 43 Routinary tasks for data cleaning and preparation Below you can find some useful tips to prepare Wikipedia data for your own analysis e Keep data preparation in the database When we face the task of data preparation we usually have the option to prepare our data in the database and then import them in our favourite analysis environment or retrieve the raw data and undertake any data preparation and rearrangement tasks outside the database In my experience keeping all data preparation tasks in the database as much as possible is usually much faster and less error prone e Anonymous editors The only identifier recorded in the database for anonymous editors is the IP address of the computer from which they performed the revision No matter how many claims you find telling you otherwise
81. t ter gt Vichipedi ter gt Discussio lt namespace gt n Vichipedi ter gt Figure lt namespace gt n figure lt namespace gt ter gt Discussio ter gt ter gt Discussio tter gt Jutori lt ediaWiki lt namespace gt lt namespace gt n MediaWiki lt namespace gt tter gt Model lt namespace gt tter gt Discussion model lt namespace gt namespace tter gt Discussion jutori lt namespace gt tter gt Categorie lt namespace gt tter gt Discussion categorie lt namespace gt restrictions edit autoconfirmed move autoconfirmed restrictions lt timestamp gt 2005 01 25T06 55 26Z lt timestamp gt 3 lt ip gt lt namespaces gt namespace key 2 lt namespace key 1 lt namespace key 0 lt namespace key 1 lt namespace key 2 lt namespace key 3 lt namespace key 4 lt namespace key 5 lt namespace key 6 namespace key 7 namespace key 8 lt namespace key 9 lt namespace key 10 namespace key 11 lt namespace key 12 lt namespace key 13 namespace key 14 namespace key 15 lt namespaces gt lt siteinfo gt lt page gt lt title gt Pagjine princip l lt title gt lt ns gt 0 lt ns gt id 1 id shal revision id 1 id contributor lt ip gt 24 251 243 23 lt contributor gt text xml space preserve Benvign t al Fri l Lenghe Stat t regjon l lt rev
82. t language The structure of the URLs to reach these pages is always the same http dumps wikimedia org project name latest Where we must place instead of project name the name of the Wikipedia language we want to track frwiki dewiki enwiki eswiki etc Now if we inspect the content of this page we can quickly notice some links containing the tag rss in their URL like frwiki latest pages meta history xml 7z rss xml In fact this is the link to a RSS syndication channel to which we can subscribe to be notified about new versions of this dump file in this case pages meta history for the French Wikipedia Then we can follow this channel with our favourite aggregator to get this notifications or even smarter use this channels to automatically trigger some piece of software that retrieves the new dump file to a local machine and load the new information in a fresh database copy 16 22 The particular case of the English Wikipedia The English Wikipedia is always a particular case not only for the data extraction process perspective but for many other macroscopic studies that we were willing to implement As of May 2012 the complete pages meta history file of enwiki stores text and metadata for 444 946 704 revisions performed in 27 023 430 pages in all namespaces not only for the main namespace of articles With such scores in mind it quickly becomes apparent that researchers must think very carefully not only what k
83. teresting property of software engineering stated by Fred Brooks a long time ago TODO ref is that whenever you face a new problem you should prepare yourself to throw your first implementation to trash This has nothing to do with your personal skills but with the inherent nature of problems in software engineering and how we think about the problem different once we have really understood all the low level details and caveats that were hidden at first sight 2 4 2 Data storage hardware As I have explained before talking about extra metadata that we can retrieve from the text of revisions there are two possible alternatives that we can follow for storing our data locally 18 e Using a single machine as powerful as we can afford to undertake the data preparation and analysis e Resorting to use a cluster of machines applying modern techniques for process paral lelization such as map reduce or using cutting edge distributed database engines fol lowing the NoSQL paradigm like MongoDB or Cassandra My personal recommendation regarding this is to keep using a single machine unless you are absolutely sure that your problem will take a disproportionate amount of time to complete It is true that today technical and business interests are determined to present new technolo gies like map reduce or NoSQL as the answer for all our problems However anyone who has been involved in algorithm paralelization for some time can tell you
84. that this can be all but a trivial task Moreover it is definitely out of the reach for many researchers without a strong background on pograming systems architecture and parallelization and having strong skills using the software packages implied in this approach At the same time some new hardware technologies can facilitate to some an extent point our life in single machine analysis Multi core CPUs are now quite affordable so provided that we can buy a reasonable ammount of RAM for our system as much as we can afford we can execute multiple processes for different Wikipedia languages at the same time We will also be able to take advantage of the multiple chunks dump format in the English Wikipedia parsing several chunks at the same time With this approach the last version of WikiDAT at the time of writing can parse the whole English Wikipedia with 6 concurrent subprocesses targeting between 26 and 28 chunks each in about 44 hours And in that time it also calculates some extra metadata such as SHA 256 hashes for each revision the length of the text and presence absence of featured article tags in the wiki text System memory RAM is another key factor with an impact on performance specially for the data analysis portion Loading data into a local database or any other storage infrastruc ture does not typically require substantial amounts of memory to execute properly In fact the current procedure is usually to recover the data d
85. to think that we must work with files consuming TB of storage space we must think carefully about the type and configuration of the system in which we want to run our analysis Leaving aside clustering solutions it is possible to configure a server infrastructure with a storage capacity spanning more than 2TB but we need to follow some recommendations e 64 bit platforms It is mandatory for high resolution analysis to run on 64 bit platforms prepared to handle large data structures both in memory and in their file system e File system Another important choice is to use a modern filesystem that supports a very 21 large file size For example in GNU Linux ext4 or xfs are good options You should also pay attention to the size of your partitions to use a technology accepting the con figuration of partitions with a size larger than 2TB There are several options available including the use of GPT partition tables configuration of multiple RAID devices and joining several starndard partitions of up to 2TB together using LVM Finally regarding the implementation of your storage infrastructure in GNU Linux the performance offered by software RAID configurations using mdadm is quite decent for many applications in data analysis in particular if we can use SSD devices However we must re member that software RAID consumes part of the CPU resources that will not be available for our processes Hence my recommendation is always to a
86. ulate the number of redirects for articles In this way we can obtain for example how many revisions were received by encyclopedic articles main discussion pages talk and so forth 6 1 3 Conducting the analysis The directory tools activity in WikiDAT has code to generate numerical results and graphics summarizing general activity metrics at the macroscopic level for any Wikipedia language In the first place it is necessary to retrieve the information for the 39 different Wikipedia languages included in the file data august 2011 py using the parsers as we have explained in Section 4 2 Then we must prepare our data file in CSV format running this Python script jfelipe blackstorm WikiDAT sourcesS python data august 2011 py This will generate the data file data 082011 csv However if you want to directly go ahead and run the R script file you can use a version of this file that has been already computed and included in WikiDAT In this case the script computes a series of general descriptive metrics for each Wikipedia language as for August 2011 We can change the timestamp limits in the queries to produce results for other months as well this process will be automated in forth coming updates of WikiDAT Finally you can load the R script activity R in R studio and click on Source to produce the graphics and numerical results to describe our data The results and graphics can be found in the results and figs folders respectively
87. ust Firefox add on in action colouring Wikipedia content according to the reputation metrics computed by the tool 1 week C2Duo Full English Wikipedia dump 3 7 Wikimedia utilities Wikimedia utilities 1 is a very interesting piece of software that have not received general attention yet This software has been created by Aaron Halfaker from Univ of Minnesota while working in a Summer internship at Wikimedia Foundation premises in San Francisco The software is written in Python and it has been conceived as a general purpose application that distributes the analysis of Wikipedia data dumps over different subprocesses running on a multi core machine The main advantage of this application is that it is extensible namely that we can extend the code to include additional calculations or any other tasks that we want to peform on pages revisions or their associated wiki text as the dump file is decompressed As an example this baseline code has inspired the design of a new tool for tracking reverts of article content in Wikipedia accurately written by Fabian Flick 16 3 8 WikiDAT Wikipedia Data Analysis Toolkit In April 2009 I released WikiXRay a Python tool to automate the analyses included in my PhD dissertation comparing the 10 largest Wikipedia languages at that time by the official article count from different quantiative perspectives TODO CITATION The main advan tage over other similar tools is that WikiXRay also pr
88. ysis 24 o AAA ente EC pad Turbo PONE aus end ee BOR bed Pobre dox REE ERE UR 6 3 Tie study OFC Ne ACUON S sceau gae Rp ARE qoo heroe gn 6 31 Qacstibns and goals lt scs e so act Rondo REOR A AAA AA 6 3 2 Required data and tools gt o ooo oom nm ome 6039 Conduchng the analysis vizcaina idad ara da Part I Preparing for Wikipedia data analysis Chapter 1 Understanding Wikipedia data sources The first step in data analysis is always to find an interesting question or a set of questions sometimes hypotheses that we want to answer or validate Then we search out for data to answer these questions or assess the validity of our hypotheses Once we have found promising data sources we need to collect these data load or store them somewhere and start the exploratory phase of data analysis In this stage we try to compute numerical and graphical descriptors of our data set to validate its adequacy for the purposes of our study In this stage we must also pay attention to detect any special features in our data set missing values extreme values incorrect labeling etc that may threaten the validity of our analysis It is very curious to check how very little attention is frequently devoted to this important stage of data analysis Space limitations in many types of research publications journals tech reports conference papers etc usually force all authors to concentrate on the analysis results and discussion part which

Wikipedia data analysis

Contents

Download Pdf Manuals

Related Search

Related Contents