Home
/-l 403 /_l 404
Contents
1. and scope of the appended claims Accordingly the specifi cation and drawings are to be regarded in an illustrative sense rather than a restrictive sense The scope of the invention should therefore be determined with reference to the appended claims along with the full scope of equivalents to which such claims are entitled I claim 1 A computer implemented method comprising creating by a computer system a first array of lines of functional program code from a first program source code file the first program source code file including the lines of functional program code of a first program writ ten in a first programming language and lines of non functional comments of the first program creating by the computer system a second array of lines of non functional comments from a second program source code file the second program source code file including lines of functional program code of a second program written in a second programming language and the lines of non functional comments of the second pro gram wherein the first programming language is differ ent from the second programming language and wherein the second array of non functional comments is created using a predefined list of programming language specific special characters to determine the beginning and end of comments in the second program source code file comparing by the computer system the lines of functional program code from the first array with the
2. archive org web 20030510140152 http www dcs 65 Prior Publication Data warwick ac uk boss manuals sherlock html US 2009 0089754 Al Apr 2 2009 Pike et al Sherlock Plagiarism Detector 2002 retrieved from http web archive org web 20020804 1 14150 http www cs usyd Related U S Application Data edu au scilect sherlock 63 Continuation in part of application No 10 720 636 Continued filed on Nov 25 2003 now Pat No 7 503 035 Primary Examiner Lewis A Bullock Jr Assistant Examiner Jue Wang 51 Int Cl SAKA G06F 9 44 2006 01 74 Attorney Agent or Firm James H Salter 52 US AI ee ese tease 717 123 434 367 726 32 57 ABSTRACT 58 Field of Classification Search None See application file for complete search history Plagiarism of software source code is a serious problem in s two distinct areas of endeavor cheating by students at 56 References Cited schools and intellectual property theft at corporations A U S PATENT DOCUMENTS number of algorithms have been implemented to check source code files for plagiarism each with their strengths and 6 282 698 BI 8 2001 Bakeretal 717 118 weaknesses This invention detects plagiarism by comparing 6 976 170 BI 12 2005 Kelly 713 181 statements within source code ofa first program to comments 7 493 596 B2 2 2009 Atkin etal 717 124 within source code of a second program 7 568 109 B2 7 2009
3. be considered as part of the suspected copied program and will be used in the comparison If the user checks checkbox 509 the programming language selected in drop down list 503 and file types specified in field 504 will also be used for the files specified in folder2 If the user does not check checkbox 509 another language drop down box and file type field will appear allowing the user to specify a different programming language and different file types to be considered for the second set of files Threshold dropdown box 510 allows a user to select how many files to be reported For example if a user selects a threshold of 8 files and 9 files in folder2 have comments that are similar to statements in a file in folder1 only the 8 files with the highest similarity scores will be reported It may be necessary to arbitrarily choose among the files to be displayed if for example files 8 and 9 have the same similarity score The threshold is used to limit the size of the reports that are generated If checkbox 511 is checked then the comparison only compares a file in folder2 that has the same name as a file in folder1 This is done because sometimes file names are not changed when they are copied This speeds up the compari son process but will miss cases where file names have been changed or code was moved from one file to another The user clicks on compare button 512 to begin the com parison process FIG 6A and FIG 6B illustrate sam
4. code from the first array with the lines ofnon functional com ments from the second array to find similar lines means for calculating a similarity number based on the similar lines wherein calculating the similarity number comprises finding a number of matching functional statements in the first array and non functional com ments in the second array indicating that the non func tional comments of the second program source code file contain functional program code of the first program source code file and means for presenting to a user an indication of copying of the first program source code file wherein said indication of copying is defined by the similarity number 8 The apparatus of claim 7 wherein the means for calculating a similarity number comprises means for finding a number of matching lines in the first and second arrays weighted by the number of characters in the lines 9 The apparatus of claim 7 wherein the means for calculating a similarity number comprises finding a number of lines in the first and second arrays that have an edit distance less than a given threshold
5. dom access memories RAMs EPROMs EEPROMs mag netic or optical cards or any type of media suitable for storing electronic instructions each coupled to a computer system bus The algorithms and displays presented herein are not inher ently related to any particular computer or other apparatus Various general purpose systems may be used with programs in accordance with the teachings herein or it may prove convenient to construct more specialized apparatus to per form the required method steps The required structure for a variety of these systems will appear as set forth in the descrip tion below In addition the present invention is not described with reference to any particular programming language It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein The present invention may be provided as a computer program product or software that may include a machine accessible medium having stored thereon instructions which may be used to program a computer system or other elec tronic devices to perform a process according to the present invention A machine accessible medium includes any mechanism for storing or transmitting information in a form readable by a machine e g a computer For example a machine accessible e g computer readable medium includes a machine e g a computer readable storage medium e g read only memory ROM r
6. 6U67 L saynbunayy maapdgusanaingapb yuawdojaaap apony3 Aso IYSG senopaupqns apn pu A C LIPO FIPS sanopaipans aprijouj A T ploy INS m 504J3p0 kalat a AY aa WEA EA TIS 015 605 LOS SOS U S Patent Oct 26 2010 Sheet 4 of 6 US 7 823 127 B2 CodeCross Basic Report 600 Version 1 0 0 AA 601 Date 11 22 08 Time 13 09 09 CodeSuite copyright 2003 2008 by Software Analysis and Forensic Engineering Corporation SETTINGS Compare files in folder D S A F E code development CodeSuite test CodeCross files 1 Including subdirectories To files in folder D S A F E code development CodeSuite test CodeCross files 2 Including subdirectories Programming language C 602 Filetypes c h Reporting file threshold 8 files D S A F E code development CodeSuite test CodeCross files Naaa commented c Score Compared To File 603 100 D S A F E code development CodeSuite test CodeCross files 2 aaa c 100 D S A F E code development CodeSuite test CodeCross files 2 aaa_with_comments c 100 D S A F E code development CodeSuite test CodeCross files 21abc c 100 D S A F E code development CodeSuite test CodeCross files 21bpf dump semicolons c 100 D S A F E code development CodeSuite test CodeCross files 21bpf dump strings c 100 D S A F E code development CodeSuite test CodeCross files 2 semicolon_test c D S A F E code development CodeSuite test CodeCross files 1 bpf_dump c Score Compared To F
7. For example the source code file snippet 201 includes double slash characters that are used to denote comments that start after the double slash and continue until the end of the line The source code file snippet 201 also include the character sequence to denote the beginning of a comment and the character sequence to denote the end of a comment Once the array creators 301 and 302 create their respective arrays the comparator 303 performs the comparison using these arrays The comparator 303 compares each entry in the string array to each entry in the comment array When source US 7 823 127 B2 5 code is copied functional statements are sometimes com mented out and used as a guide for writing new code Hence copied source code may contain statements that have been commented out This comparator 303 calculates a similarity score based on the number of statements in the first file that are similar to comments in the second file In one embodiment similar strings consist of exact matches In other words the number of matching statements and comments includes only state ments and comments where each and every character in the string exactly matches the corresponding character in the text sequence In another embodiment similarity score s can represent the number of matching statements and comments in the pair of arrays weighted by the number of characters in the matching lines and can be determined using the following equ
8. Powell et al 713 187 2006 0129523 Al 6 2006 Roman et al 707 1 9 Claims 6 Drawing Sheets 401 Determine programming language specific information 402 Source code file 1 Create statement array 403 Source code file 2 Create comment array 404 Statement Comment matching 405 Display similarity score US 7 823 127 B2 Page 2 OTHER PUBLICATIONS Engels et al Plagiarism Detection Using Feature Based Neural Networks 2007 SIGCSE 07 Peer to Patent prior art submission report for 7568109 Hunt et al An Algorithm for Differential File Comparison 1976 Computer Science Technical Report 41 AT amp T Bell Laboratories Spafford et al Software Forensics Can We Track Code to Its Authors 1992 Technical Report Department of Computer Sci ence Purdue University Kilgour et al A Fuzzy Logic Approach to Computer Science Soft ware Source Code Authorship Analysis 1997 Fourth International Conference on Neural Information Processing The Annual Confer ence of the Asian Pacific Neural Network Assembly Arwin et al Plagiarism Detection across Programming Languages 2006 Proceedings of the 29th Australasian Computer Science Con ference vol 48 Print pub UNIX diff command utility James Hunt Jan 1 1976 cited by examiner U S Patent Oct 26 2010 Computing Device Code Phgiarism Detector Figure 1 ft fdiv ro
9. US007823127B2 a United States Patent 10 Patent No US 7 823 127 B2 Zeidman 45 Date of Patent Oct 26 2010 54 DETECTING PLAGIARISM IN COMPUTER OTHER PUBLICATIONS SOURCE CODE Paul Heckel A Technique for Isolating Differences Between Files A P 1978 Communications of ACM vol 21 Issue 4 pp 264 268 75 Inventor Robert Marc Zeidman Cupertino CA J Howard Johnson Substring Matching for Clone Detection and US Change Tracking 1994 Software Engineering Laboratory a A National Research Council of Canada 73 Assignee Software Analysis and Forensic Baker On finding Duplication and Near Duplication in Large Soft Engineering Corp Cupertino CA US ware Systems 1995 Reverse Engineering 1995 Michael Wise YAP3 improved detection of similarities in com Notice Subject to any disclaimer the term of this puter program and other texts 1996 SIGCSE 96 patent is extended or adjusted under 35 Joy et al Plagiarism in Programming Assignments 1999 IEEE U S C 154 b by 0 days Transactions on Education vol 42 No 2 pp 129 133 Marcus et al Identification of High Level Concept Clones in Source 21 Appl No 12 330 492 Code 2001 Automated Software Engineering ASE 2001 Lucca et al An Approach to Identify Duplicated Web Pages 2002 COMPSAC 2002 22 Filed Dec 8 2008 Hart et al Sherlock User Manual Nov 2002 retrieved from A kai http web
10. a client machine in a client server network environment or as a peer machine in a peer to peer or distributed network environment The machine may be a personal computer PC a tablet PC a set top box STB a Personal Digital Assistant PDA a cellular telephone a web appliance a server a network router switch or bridge or any machine capable of executing a set of instructions sequential or otherwise that specify actions to be taken by that machine Further while only a single machine is illustrated the term machine shall also be taken to include any collection of machines e g computers that individually or jointly execute a set or multiple sets of instructions to perform any one or more of the methodologies discussed herein The exemplary computer system 700 includes a processor 701 a main memory 702 e g read only memory ROM flash memory dynamic random access memory DRAM such as synchronous DRAM SDRAM or Rambus DRAM RDRAM etc a static memory 703 e g flash memory static random access memory SRAM etc and a secondary memory 708 e g a data storage device which communi cate with each other via a bus 709 Processor 701 represents one or more general purpose pro cessing devices such as a microprocessor central processing unit or the like More particularly the processor 701 may be a complex instruction set computing CISC microprocessor reduced instruction set computing RISC microproc
11. al comments of the first program creating a second array of lines of non functional com ments from a second program source code file the sec ond program source code file including lines of func tional program code of a second program written in a second programming language and the lines of non functional comments of the second program wherein the first programming language is different from the second programming language and wherein the second array of non functional comments is created using a predefined list of programming language specific spe cial characters to determine the beginning and end of comments in the second program source code file comparing the lines of functional program code from the first array with the lines of non functional comments from the second array to find similar lines calculating a similarity number based on the similar lines wherein calculating the similarity number comprises finding a number of matching functional statements in the first array and non functional comments in the sec ond array indicating that the non functional comments of the second program source code file contain func tional program code of the first program source code file and presenting to a user an indication of copying of the first program source code file wherein said indication of copying is defined by the similarity number 5 The computer readable storage medium of claim 4 wherein calculating a similarity number
12. andom access memory RAM magnetic disk storage media optical stor age media flash memory devices etc a machine e g com puter readable transmission medium electrical optical acoustical or other form of propagated signals e g carrier waves infrared signals digital signals etc etc FIG 1 illustrates a block diagram of a system for detecting program code plagiarism in accordance with one embodi ment of the invention The system includes a computing device 101 and a data storage device 103 The data storage device 103 may be a mass storage device such as a magnetic or optical storage based disk or tape and may be part of the computing device 101 or be coupled with the computing device 101 directly or via a network e g a public network such as the Internet or a private network such as a local area network LAN The computing device 101 may be a per sonal computer PC palm sized computing device personal digital assistant PDA server or other computing device The computer device 101 hosts a code plagiarism detector 102 that can detect plagiarism by examining source code of 20 25 40 45 50 55 65 4 two different programs The code plagiarism detector 102 detects plagiarism by comparing a first computer program source code file with a second computer program source code file The files being compared may be stored in the data storage device 103 In one embodiment the code plagiarism
13. ation s ZA for i 1 tom where m is the number of matching statements and A is the number of matching characters in matching statements i In another embodiment similar statements and comments are not limited to exact matches and also include partial matches It may be that in the copied source code the state ments were commented out but in the original source code the statements went through further changes after the source code was copied Hence this other embodiment considers partial matches where the distance between a string and a comment is below some predefined threshold This distance can be some well known distance measure such as the Lev enshtein distance also known as the edit distance the Dam erau Levenshtein distance or the Hamming distance In yet another embodiment a similarity score can be cal culated as a binary value of 0 or 1 If there is at least one statement in the first source code file that is similar to one comment in the second source code file the similarity score is 1 otherwise it is 0 This can be done because just the fact that a single statement in the first program appears as a comment in the second program is enough reason to warrant further examination This binary calculation does not make a value judgment about the commenting but simply directs a user to look more carefully at this suspicious phenomenon The output display 304 generates an output to a user such as a report that may inclu
14. comprises finding a num ber of matching lines in the first and second arrays weighted by the number of characters in the lines 6 The computer readable storage medium of claim 4 wherein 15 20 25 30 35 40 45 10 calculating a similarity number comprises finding a num ber of lines in the first and second arrays that have an edit distance less than a given threshold 7 A computer implemented apparatus comprising a computer and a source code matching program on the computer the source code matching program comprising means for creating a first array of lines of functional pro gram code from a first program source code file the first program source code file including the lines of func tional program code of a first program written in a first programming language and lines of non functional comments of the first program means for creating a second array of lines of non func tional comments from a second program source code file the second program source code file including lines of functional program code of a second program written in a second programming language and the lines of non functional comments of the second program and wherein the second array of non functional comments is created using a predefined list of programming language specific special characters to determine the beginning and end of comments in the second program source code file means for comparing the lines of functional program
15. de a list of file pairs ordered by the result of the similarity score calculated by comparator 303 as will be discussed in more detail below FIG 4 illustrates a flow diagram of one embodiment of a method of detecting source code plagiarism The method may be performed by processing logic that may comprise hard ware e g circuitry programmable logic microcode etc software such as instructions run on a processing device or a combination thereof In one embodiment the method is performed by a code plagiarism detector e g code plagia rism detector 102 of FIG 1 Referring to FIG 4 the method begins with processing logic determining program language dependant information block 401 Program language dependant information may include for example comment delimiter characters Program language dependant information may be hard coded or pro vided by a user At block 402 processing logic creates a statement array for a source code file of a first program At block 403 processing logic creates a comment array for a source code file of a second program Processing logic at blocks 402 and 403 may create the above arrays using the program language depen dant information 30 40 45 55 60 65 6 At block 404 processing logic compares the statement array of the first source code file to the comment array of the second source code file and creates a list of similar strings Processing logic uses the number of s
16. detector 102 pre processes the files being compared prior to performing the comparison As will be discussed in more detail below the code plagiarism detector 102 may create data structures e g arrays for the files being compared and may store the data structures in the data storage 103 The code plagiarism detec tor 102 may then compare entries of the data structures and calculate a similarity score based on the number of similar entries in the data structures where the similarity score indi cates a possibility of plagiarism The code plagiarism detector 102 may generate a report and store it in the data storage 103 or display it to a user of the computing device 101 or some other computing device coupled to the device 101 e g directly or via a network In one embodiment of the present invention each line of two source code files is initially examined and a string array for each file is created Statements1 is the collection of functional statements in the first file and Comments2 is the collection of non functional comments in the second file A sample snippet 201 of a source code file to be examined is shown in FIG 2A The array of statements 202 and comments 203 for the code snippet 201 is shown in FIG 2B Note that whitespace is not removed entirely but rather all sequences of whitespace characters are replaced by a single space in both source lines and comment lines In this way the individual words are preserved in th
17. e network inter face device 704 20 25 30 35 40 50 55 60 65 8 The machine accessible storage medium 708 may also be used to store source code files 714 While the machine acces sible storage medium 713 is shown in an exemplary embodi ment to be a single medium the term machine accessible storage medium should be taken to include a single medium ormultiple media e g a centralized or distributed database and or associated caches and servers that store the one or more sets ofinstructions The term machine accessible stor age medium shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention The term machine accessible storage medium shall accordingly be taken to include but not be limited to solid state memories and optical and magnetic media It is to be understood that the above description is intended to be illustrative and not restrictive Many other embodi ments will be apparent to those of skill in the art upon reading and understanding the above description Although the present invention has been described with reference to spe cific exemplary embodiments it will be recognized that the invention is not limited to the embodiments described but can be practiced with modification and alteration within the spirit
18. e strings Separator characters such as and are treated as whitespace The comment charac ters themselves in this case and are stripped off from the comments Special characters such as comment delimit ers and separator characters are defined in a language defini tion file that is input to this embodiment of the present inven tion FIG 3 illustrates a block diagram of one embodiment of a code plagiarism detector 102 that compares a source code file ofa first program with a source code file of a second program The code plagiarism detector 102 includes statement array creator 301 comment array creator 302 comparator 303 and output display 304 The statement array creator 301 examines lines of the source code file and creates a statement array The statement array includes functional statements that are found in the source code The comment array creator 302 examines lines of the source code file and creates a comment array The comment array includes non functional comments that are found in the source code The comparator 303 compares the statements in the statement array to the comments in the comment array The output display 304 takes the output of the comparator 303 and displays it to the user The comment array creator 302 uses a predefined list of special characters which is programming language specific to correctly determine the beginning and end of comments in the code in order to construct the comment array
19. essor very long instruction word VLIW microprocessor proces sor implementing other instruction sets or processors imple menting a combination of instruction sets Processor 701 may also be one or more special purpose processing devices such as an application specific integrated circuit ASIC a field programmable gate array FPGA a digital signal processor DSP network processor or the like Processor 701 is con figured to execute the processing logic 711 for performing the operations and steps discussed herein The computer system 700 may further include a network interface device 704 The computer system 700 also may include a video display unit 705 e g a liquid crystal display LCD or a cathode ray tube CRT an alphanumeric input device 706 e g a keyboard and a cursor control device 707 e g a mouse The secondary memory 708 may include a machine acces sible storage medium or more specifically a machine acces sible storage medium 713 on whichis stored one or more sets of instructions embodying any one or more of the method ologies or functions described herein The software 712 may reside completely or at least partially within the main memory 702 and or within the processor 701 during execu tion thereof by the computer system 700 the main memory 702 and the processor 701 also constituting machine acces sible storage media The software 712 may further be trans mitted or received over a network 710 via th
20. hile requiring a small amount of programming language specific information Such program ming language specific information includes characters used to delimit comments in the particular programming language In the following description numerous details are set forth It will be apparent however to one skilled in the art that the present invention may be practiced without these specific details In some instances well known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring the present invention Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic represen tations of operations on data bits within a computer memory These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art An algorithm is here and generally con ceived to be a self consistent sequence of steps leading to a desired result The steps are those requiring physical manipu lations of physical quantities Usually though not necessarily these quantities take the form of electrical or magnetic signals capable of being stored transferred combined compared and otherwise manipulated It has proven convenient at times principally for reasons of common usage to refer to these signals as bits values elements symbols characters
21. ile 604 100 D S A F E code development CodeS uite test CodeCross files 2 bpf_dump_strings c 100 D S A F E code development CodeS uite test CodeCross files 2 W32NReg_commented c Figure 6A U S Patent Oct 26 2010 Sheet 5 of 6 US 7 823 127 B2 Comparing file1 610 S A F E code development CodeSuite tes eCross files 1 aaa c nted To file2 D S A F E code devel ite t iles 2 aaa 611 Matching statements CAT comments e Fiel Lined Filel Line File2 MY a Figure 6B U S Patent Oct 26 2010 Sheet 6 of 6 US 7 823 127 B2 700 701 705 PROCESSOR PROCESSING LOGIC 702 ALPHA 706 NUMERIC KA INPUT DEVICE VIDEO DISPLAY 711 703 CURSOR 707 STATIC CONTROL MEMORY DEVICE 704 709 NETWORK INTERFACE DEVICE R 708 SECONDARY MEMORY 713 MACHINE ACCESSIBLE STORAGE MEDIUM 710 714 SOURCE CODE FILES Figure 7 US 7 823 127 B2 1 DETECTING PLAGIARISM IN COMPUTER SOURCE CODE RELATED APPLICATIONS The present application is a continuation in part applica tion of U S patent application Ser No 10 720 636 filed Nov 25 2003 now U S Pat No 7 503 035 which is incorporated herein by reference TECHNICAL FIELD Embodiments of the present invention relate to software tools for comparing program source code files to detect code copied from one file to another In particular the present invention relates to finding pairs of source code files that have been copied in full o
22. imilar statements and comments to generate a similarity score At block 405 processing logic generates a report based on the comparison FIG 5 shows one embodiment of a user interface for the present invention The user interface screen 500 contains a number of fields for accepting user input Folder1 field 501 allows the user to type a path to a folder containing source code files of the original program to be compared Alterna tively the user can click on browse button 506 that will allow the user to browse folders and select one that will be auto matically entered into folderl field 501 The user selects a programming language from the drop down list 503 of known computer programming languages The user selects a file type or list of file types containing source code from drop down list 504 Alternatively the user can type a file type or list of file types into field 504 If the user checks checkbox 505 source code files in all subdirectories of folder1 will also be considered as part of the original program and will be used in the comparison Folder2 field 502 allows the user to type a path to a folder containing source code files of the suspected copied program to be compared Alternatively the user can click on browse button 508 that will allow the user to browse folders and select one that will be automatically entered into folder2 field 502 If the user checks checkbox 507 source code files in all subdirectories of folder2 will also
23. lines of non functional comments from the second array to find simi lar lines calculating by the computer system a similarity number based on the similar lines wherein calculating the simi larity number comprises finding a number of matching functional statements in the first array and non func tional comments in the second array indicating that the non functional comments of the second program source code file contain functional program code of the first program source code file and presenting to a user an indication of copying of the first program source code file wherein said indication of copying is defined by the similarity number US 7 823 127 B2 9 2 The method of claim 1 wherein calculating a similarity number comprises finding a num ber of matching lines in the first and second arrays weighted by the number of characters in the lines 3 The method of claim 1 wherein calculating a similarity number comprises finding a num ber of lines in the first and second arrays that have an edit distance less than a given threshold 4 A computer readable storage medium storing executable instructions to cause a computer system to perform a method comprising creating a first array of lines of functional program code from a first program source code file the first program source code file including the lines of functional pro gram code of a first program written in a first program ming language and lines of non function
24. ne deficiency of the aforementioned programs is that they only compare functional code One program CodeMatch developed by Robert Zeidman the inventor of the present invention overcomes this deficiency by dividing program source code into elements including functional code state ments identifiers and instruction sequences and non func tional code comments and strings and compares these dif ferent elements in the source code files of different programs to each other Clever programmers will often make significant changes to the appearance but not the functionality of the functional source code in order to disguise copying The resulting func tional code looks very different but functions identically to the original code from which it was copied In cases of trying to disguise copying a programmer may copy a function from one program s source code into another program s source code and comment it out in order to use the code as a guide for writing a similar function Often program mers making changes to disguise functional statements do not make changes to the commented code because it is non functional and escapes their notice All of the previously mentioned tools will not find this sure sign of plagiarism 20 25 30 35 40 45 50 55 60 65 2 Accordingly it would be beneficial to have a plagiarism detection tool that can compare functional code in one source code file to nonfunctional code in a
25. nother source code file in order to overcome the above limitations of the conventional techniques BRIEF DESCRIPTION OF THE DRAWINGS The present invention is illustrated by way ofexample and not by way of limitation in the figures of the accompanying drawings and in which FIG 1 illustrates a block diagram of a system for the present invention in accordance with one embodiment of the invention FIG 2 illustrates dividing a file of source code into state ments and comments FIG 3 illustrates the software architecture of one embodi ment of the present invention FIG 4 illustrates a flow diagram of one embodiment of the present invention FIG 5 illustrates a user interface of one embodiment of the invention FIG 6 illustrates a basic report and a detailed report output in accordance with one embodiment of the invention FIG 7 illustrates a block diagram of an exemplary com puter system in accordance with one embodiment of the invention DETAILED DESCRIPTION Methods and systems for detecting copied program code based on source code are described In one embodiment signs of possible copying are detected by comparing source code functional statements of a first program with source code non functional comments of a second program sus pected of being copied from the first program Embodiments of the invention make use of a basic knowl edge of programming languages and program structures to simplify the matching task w
26. ple reports generated by one embodiment of the present invention indicating possible plagiarism Referring to FIG 6A an HTML output report 600 includes a list of file pairs ordered by their total correlation scores The report 600 includes a report description 601 a header 602 showing the chosen settings and rankings of file pair matches 603 and 604 based on their similarity scores Each correlation score on the left in sections 603 and 604 is also a hyperlink to a detailed report for that particular file pair FIG 6B illustrates a detailed report 610 showing similar statements and comments in a specific file pair In this way experts are directed to suspicious similarities and allowed to US 7 823 127 B2 7 make their own judgments The detailed report 610 includes a header 611 that tells which files are being compared Fur thermore the detailed report includes a detailed description of the statements in filel that matched comments in file2 as shown in the table 612 FIG 7 illustrates a diagrammatic representation of a machine in the exemplary form of a computer system 700 within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein may be executed In alternative embodiments the machine may be connected e g networked to other machines in a Local Area Network LAN an intranet an extranet or the Internet The machine may operate in the capacity of a server or
27. r in part by detecting functional code in one file that has been commented out in another file BACKGROUND Plagiarism detection programs and algorithms have been around for a number of years but have gotten more attention recently due to two main factors First the Internet and search engines like Google have made source code very easy to obtain Second the open source movement has grown tre mendously over the past several years allowing programmers all over the world to write distribute and share code In recent years plagiarism detection techniques have become more sophisticated A summary of available tools is given by Paul Clough in his paper entitled Plagiarism in natural and programming languages an overview of current tools and technologies Clough discusses tools and algo rithms for finding plagiarism in generic text documents as well as in programming language source code files There are a number of plagiarism detection programs cur rently available including the Plague program developed by Geoff Whale at the University of New South Wales the YAP programs YAP YAP2 YAP3 developed by Michael Wise at the University of Sydney Australia the JPlag program writ ten by Lutz Prechelt and Guido Malpohl of the University Karlsruhe and Michael Philippsen of the University of Erlan gen Nuremberg and the Measure of Software Similarity MOSS program developed at the University of California at Berkeley by Alex Aiken O
28. terms numbers or the like It should be borne in mind however that all of these and similar terms are to be associated with the appropriate physi cal quantities and are merely convenient labels applied to US 7 823 127 B2 3 these quantities Unless specifically stated otherwise as apparent from the following discussion it is appreciated that throughout the description discussions utilizing terms such as communicating executing passing determining generating or the like refer to the action and processes of a computer system or similar electronic computing device that manipulates and transforms data represented as physical electronic quantities within the computer system s registers and memories into other data similarly represented as physi cal quantities within the computer system memories or reg isters or other such information storage transmission or dis play devices The present invention also relates to an apparatus for per forming the operations herein This apparatus may be spe cially constructed for the required purposes or it may com prise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer Such a computer program may be stored in a computer read able storage medium such as but not limited to any type of disk including floppy disks optical disks CD ROMs and magnetic optical disks read only memories ROMs ran
29. utine This routine is important void fdiv char char path int Indexi fname file name path ge 201 ji printf hello world while 1 j find the file Statementi 0 Statementi 1 Statement1 2 Statement1 3 Statement 4 Statement 5 tann strlen fname extension Figure 2A Comment 1 0 Comment 1 1 Comment 1 2 void fdiv char fname char path int Indexl j while 1 j strlen fname gt 202 Figure 2B Sheet 1 of 6 101 102 fdiv routine Data Storage US 7 823 127 B2 This routine is important find the file extension NO 203 U S Patent Oct 26 2010 Sheet 2 of 6 US 7 823 127 B2 Code Plagiarism Detector 30 301 Statement Array Creator Source code file 1 304 302 Output Display Comment Array Creator Source code file 2 Figure 3 401 Determine programming language specific information 402 Source code file 1 Create statement array 403 Source code file 2 Create comment array 404 Statement Comment matching 405 Display similarity score Figure 4 US 7 823 127 B2 Sheet 3 of 6 Oct 26 2010 U S Patent cls 80S 90S esmoig IT S oanbra fuo aweu awes jo saj a1EdWOJ SUS p eyeyi dwog abenbuej aweg z sa Suuawy Mgapoysanaun sapag yuoudojanap apony3 adat a4 j i 209Nh
Download Pdf Manuals
Related Search
Related Contents
Minka Lavery 1733-613 Instructions / Assembly PDFファイル - 医薬品医療機器総合機構 IntegriSign Desktop User Manual 北九州市指定管理者制度 ガイドライン CTA SHARP 6K User's Manual Black & Decker 4100-09 Owner's Manual Kathrein WFS 166 Notas da Versão e Guia de Atualização do CA Nimsoft Monitor Server Téléchargez ici le manuel d`utilisation - EURO Copyright © All rights reserved.
Failed to retrieve file