Textalgorithmen Vorlesung Im Wintersemester 2003/04 Otto-Von

Total Page:16

File Type:pdf, Size:1020Kb

Textalgorithmen Vorlesung Im Wintersemester 2003/04 Otto-Von Textalgorithmen Vorlesung im Wintersemester 2003/04 Otto-von-Guericke-Universit¨at Magdeburg Fakult¨at fur¨ Informatik Ralf Stiebe [email protected] Inhaltsverzeichnis Einleitung 4 0 Begriffe und Notationen 6 1 Exakte Suche nach einem Wort 8 1.1 Naiver Algorithmus . 8 1.2 R¨ander und Perioden . 9 1.3 Der Algorithmus von Knuth-Morris-Pratt . 12 1.4 Suche mit deterministischen endlichen Automaten . 15 1.5 Boyer-Moore-Algorithmus . 17 1.6 Apostolico-Giancarlo-Algorithmus . 22 1.7 Algorithmus von Vishkin . 25 1.8 Der Shift-And-Algorithmus . 29 1.9 Algorithmen mit Suffixautomaten . 30 1.10 Karp-Rabin-Algorithmus . 33 1.11 Eine untere Schranke fur¨ Algorithmen zur Wortsuche . 35 2 Exakte Suche nach komplexen Mustern 38 2.1 Suche nach endlich vielen W¨ortern . 38 2.2 Suche nach zweidimensionalen Bildern . 44 2.3 Suche nach regul¨aren Ausdrucken.........................¨ 48 2.4 Einige Spezialf¨alle regul¨arer Ausdrucke......................¨ 51 3 Inexakte Suche 54 3.1 Einleitung . 54 3.2 L¨osung durch dynamische Programmierung . 56 3.3 Alignments mit beschr¨ankter Fehlerzahl . 63 3.4 Alignments mit Lucken(gaps)...........................¨ 66 3.5 L¨osung mit linearem Platzbedarf . 68 3.6 L¨osung in subquadratischer Zeit . 69 3.7 Multiple Alignments . 72 4 Indexieren von Texten 77 4.1 Einfuhrung¨ ..................................... 77 4.2 Datenstrukturen fur¨ Indexe . 77 4.2.1 Suffixb¨aume . 77 4.2.2 Suffix-Arrays . 79 1 4.2.3 Varianten und Erweiterungen . 79 4.3 Konstruktion von Suffixb¨aumen . 80 4.3.1 Naiver Algorithmus . 81 4.3.2 Algorithmus von McCreight . 82 4.3.3 Online-Algorithmus von Ukkonen . 86 4.4 Konstruktion von Suffix-Arrays . 86 4.4.1 Konstruktion durch Verfeinerung . 87 4.4.2 Skew Algorithmus von K¨arkk¨ainen und Sanders . 89 4.5 Anwendungen von Suffixb¨aumen und Suffix-Arrays . 91 4.5.1 Wiederholte Suche in festem Text . 91 4.5.2 L¨angstes gemeinsames Teilwort . 92 4.5.3 Ziv-Lempel-Kompression . 93 4.5.4 Burrows-Wheeler-Transformation . 93 Literaturverzeichnis 96 2 3 Einleitung Die Verarbeitung langer Zeichenketten (Strings, Texte) ist eine der grundlegenden Aufgaben in der Informatik. Es ist deshalb wenig verwunderlich, daß bereits ab 1970 zahlreiche Probleme aus diesem Gebiet formuliert und effizient gel¨ost wurden. Neue Herausforderungen ergaben sich in jungerer¨ Zeit durch das Aufkommen des Internet und der Bioinformatik. Diese Vorlesung soll eine Einfuhrung¨ in einige wichtige Probleme und Algorithmen fur¨ Zeichenketten geben. Sie gliedert sich in folgende Abschnitte: 1. Exakte Wortsuche: Gegeben sind W¨orter P (Pattern) und T (Text). Gesucht sind alle Vorkommen von P in T . Wir geben verschiedene Algorithmen an, die dieses Problem effizient l¨osen. Außerdem betrachten wir das Problem der Suche nach einer endlichen Menge von W¨ortern. 2. Suche nach regul¨aren Ausdrucken:¨ Gegeben sind ein Text T und ein regul¨arer Ausdruck α. Gesucht sind Teilw¨orter von T , die in der von α definierten Sprache enthalten sind. 3. Suffixb¨aume und Suffixarrays: Diese Datenstrukturen speichern eine Zeichenkette S so, daß man einen schnellen Zugriff auf jedes Teilwort von S hat. Im Suffixbaum werden alle Suffixe von S in einem Baum gespeichert. Im Suffixarray wird die lexikografische Ordnung der Suffixe von S gespeichert. Wir behandeln effiziente Algorithmen von zur Konstruktion von Suffixb¨aumen und Suffixarrays. Wichtigste Anwendung beider Datenstrukturen ist die wiederholte Suche in einem kon- stanten Text T der L¨ange n. Ist der Suffixbaum fur¨ T konstruiert, so kann die Suche nach einem Wort P der L¨ange m mit einem Aufwand von O(m), also unabh¨angig von n, erledigt werden. Weitere Anwendungen sind die Suche nach Regularit¨aten in Tex- ten (z.B. lange Wiederholungen, sehr h¨aufige Teilw¨orter) sowie die Burrows-Wheeler- Transformation. 4. Ahnlichkeit¨ von Zeichenketten: Das Finden von Ahnlichkeiten¨ in zwei oder mehr Zei- chenketten ist von großer Bedeutung in der Bioinformatik. Die einfachste Aufgabe dieses Typs ist die Bestimmung des Levenshtein-Abstandes (edit distance): Gegeben sind zwei W¨orter S1 und S2. Wie viele Operationen der Form Ersetzen/Einfugen/Streichen¨ eines Zeichens ben¨otigt man, um S1 in S2 umzuwandeln? Dieses Skript entstand zu einer Vorlesung im Rahmen der theoretischen Informatik. Neben der Vermittlung der Algorithmen stehen deshalb Korrektheitsbeweise und Laufzeitabsch¨atzun- gen im Vordergrund. 4 Literatur In den letzten Jahren sind einige Bucher¨ (in englischer Sprache) uber¨ Textalgorithmen er- schienen, u.a. von Apostolico und Galil [1], Crochemore und Rytter [3, 4], Gusfield [6], Na- varro und Raffinot [12] Smyth [14] und Stephen [13]. Auf franz¨osisch gibt es ein Buch von Crochemore, Hancart und Lecroq [5]. Die exakte Wortsuche wird außerdem in zahlreichen allgemeinen Buchern¨ uber¨ Algorithmen betrachtet. Besonders ausfuhrlich¨ geschieht dies im Buch von Heun [7]. Dieses Skript orientiert sich am Buch von Gusfield [6], das ein sehr gut geschriebenes Lehr- buch ist und außerdem einen umfassenden Einblick in die Anwendung von Textalgorithmen in der Bioinformatik vermittelt. Zahlreiche Quellen gibt es auch im Internet. Hier seien nur drei genannt: • http://www.dei.unipd.it/~stelo/ eine umfangreiche Link-Sammlung von Stefano Lonardi, • http://www-igm.univ-mlv.fr/~lecroq/lec en.html die Homepage von Thierry Lecroq mit einer umfangreichen Pr¨asentation (einschließlich Animationen) von Algorithmen zur exakten Wortsuche und zum Sequenzvergleich sowie einer umfassenden Bibliographie zu Textalgorithmen. • http://dcc.uchile.cl/ gnavarro/eindex.html die Homepage von Gonzalo Navarro mit zahlreichen Originalarbeiten und Ubersichts-¨ artikeln aus den letzten Jahren sowie einigen nutzlichen¨ Programmen. 5 Kapitel 0 Begriffe und Notationen Es werden die ublicherweise¨ in einer Grundvorlesung Algorithmen und Datenstrukturen ver- mittelten Kenntnisse vorausgesetzt. Die Notation von Algorithmen in Pseudocode folgt den Konventionen aus dem Standard-Lehrbuch Introduction to Algorithms von Cormen, Leiserson und Rivest [2]. Insbesondere wird der Rumpf einer Schleife durch die Tiefe der Einruckung¨ erkennbar. Bei der Bestimmung von Laufzeiten gehen wir davon aus, daß auf Speicherplatz in kon- stanter Zeit zugegriffen werden kann und fur¨ Zahlen in der Gr¨oßenordnung der betrachte- ten Wortl¨angen die arithmetischen Grundoperationen in konstanter Zeit ausgefuhrt¨ werden k¨onnen (unit cost model). Fur¨ ein Wort S uber¨ einem Alphabet Σ wird die L¨ange von S durch |S| bezeichnet. Die M¨achtigkeit einer Menge M bzw. der Betrag einer reellen Zahl z werden ebenfalls mit |M| bzw. |z| bezeichnet. Die Bedeutung ergibt sich aus dem Kontext. Das Wort der L¨ange 0 heißt das leere Wort und wird mit ε bezeichnet. Es seien u und v W¨orter. Gilt u = u1vu2, so nennt man v ein Teilwort oder Infix von u. Gilt u = vu2, so nennt man v ein Pr¨afix von u. Gilt u = u1v, so nennt man v ein Suffix von u. Ist v ein Teilwort (bzw. Pr¨afix bzw. Suffix) von u mit v 6= u, so nennt man v ein echtes Teilwort (bzw. echtes Pr¨afix bzw. echtes Suffix) von u. Fur¨ 1 ≤ i ≤ |u| ist u[i] das Zeichen an der Stelle i von u. Das Teilwort von der Stelle i bis zur Stelle j von u ist u[i . j]. Gilt i > j oder i > |u|, so ist u[i . j] per definitionem das leere Wort. Die Menge aller W¨orter uber¨ einem Alphabet Σ bezeichnen wir mit Σ∗, die Menge aller nichtleeren W¨orter mit Σ+. Mit ur bezeichnen wir das Wort S von rechts nach links gelesen. Gilt u = ur, so nennt man u ein Palindrom . Eine grundlegende Operation ist der Vergleich zweier Buchstaben. Ergibt sich bei einem solchen Vergleich eine Ubereinstimmung,¨ so sprechen wir von einem Match , anderenfalls von einem Mismatch . ∗ Es sei (Σ, <) ein total geordnetes Alphabet. Die lexikografische Ordnung <lex auf Σ erh¨alt man wie folgt: + 1. ε <lex α fur¨ alle α ∈ Σ . ∗ 2. Aus a < b folgt aα <lex bβ fur¨ alle α, β ∈ Σ . 3. Aus α <lex β folgt aα <lex aβ fur¨ alle a ∈ Σ. 6 Schließlich ben¨otigen wir noch den Begriff des endlichen Automaten. Ausfuhrlich¨ werden diese Stukturen in einfuhrenden¨ Buchern¨ zur Theoretischen Informatik (z.B. [8]) behandelt. Ein nichtdeterministischer endlicher Automat mit ε-Transitionen (NEA) ist ein Quintupel A = (Σ, Z, δ, I, Q), wobei Σ ein Alphabet, Z eine endliche Zustandsmenge, δ ⊆ Z × (Σ ∪ {ε}) × Z eine Uberf¨ uhrungsrelation¨ oder Menge von Transitionen, I ⊆ Z eine Menge von Startzust¨anden, Q ⊆ Z eine Menge von akzeptierenden Zust¨anden sind. Ein NEA wird h¨aufig durch seinen Graphen dargestellt. Dabei werden ein Zustand durch einen Knoten, eine Transition durch eine gerichtete und beschriftete Kante, ein Startzustand durch einen Pfeil von außen und ein Endzustand durch einen Doppelkreis gekennzeichnet. Die ε-Hulle¨ δ0 von A ist wie folgt definiert: 0 1. δ0 := {(z, ε, z): z ∈ Z}, 0 0 0 0 0 2. δi+1 := {(y, ε, z): ∃z ((y, ε, z ) ∈ δi+1 ∧ (z , ε, z) ∈ δ}, fur¨ 0 ≤ i, ∞ ∞ 0 S 0 S 0 3. δ := δi = δi . i=0 i=0 Die Relation δ wird wie folgt zur Relation δ∗ ⊆ Z × Σ∗ × Z erweitert: 1. δn+1 := {(y, wa, z): w ∈ Σn ∧ a ∈ Σ ∧ ∃z0∃z00((y, w, z0) ∈ δn ∧ (z0, a, z00) ∈ δ ∧ (z00, ε, z) ∈ δ0)}, fur¨ n ≥ 0, ∞ 2. δ∗ := S δn. n=0 Anschaulich gesprochen, ist (y, w, z) genau dann in δ∗, wenn es im Graphen des Automaten einen Weg vom Knoten y zum Knoten z mit der Beschriftung w gibt. Die vom NEA A = (Σ, Z, δ, I, Q) akzeptierte Sprache ist ∗ ∗ L(A) = {w ∈ Σ : ∃z1∃z2(z1 ∈ I ∧ z2 ∈ Q ∧ (z1, w, z2) ∈ δ }. Ein NEA heißt (partieller) deterministischer endlicher Automat (DEA), wenn δ eine (par- tielle) Funktion δ : Z × Σ → Z ist und I genau einen Zustand enth¨alt. Es ist bekannt, daß es zu jedem NEA einen ¨aquivalenten DEA gibt, der aber exponentiell gr¨oßer sein kann.
Recommended publications
  • R L W D by Oktie Hassanzadeh A
    R L W D by Oktie Hassanzadeh A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright © 2012 by Oktie Hassanzadeh Abstract Record Linkage for Web Data Oktie Hassanzadeh Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2012 Record linkage refers to the task of nding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. Several tools and techniques have been developed as part of research prototypes and commercial so- ware systems. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. In this thesis, we study the record linkage problem for Web data sources. We show that tra- ditional approaches to record linkage fail to meet the needs of Web data because 1) they do not permit users to easily tailor string matching algorithms to be useful over the highly heteroge- neous and error-riddled string data on the Web and 2) they assume that the attributes required for record linkage are given. We propose novel solutions to address these shortcomings. First, we present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. is framework is based on declarative specication of the linkage requirements by the user and allows linking records in many real-world scenarios.
    [Show full text]
  • Using String Metrics to Improve the Design of Virtual Conversational Characters: Behavior Simulator Development Study
    JMIR SERIOUS GAMES García-Carbajal et al Original Paper Using String Metrics to Improve the Design of Virtual Conversational Characters: Behavior Simulator Development Study Santiago García-Carbajal1*, PhD; María Pipa-Muniz2*, MSc; Jose Luis Múgica3*, MSc 1Computer Science Department, Universidad de Oviedo, Gijón, Spain 2Cabueñes Hospital, Gijón, Spain 3Signal Software SL, Parque Científico Tecnológico de Gijón, Gijón, Asturias, Spain *all authors contributed equally Corresponding Author: Santiago García-Carbajal, PhD Computer Science Department Universidad de Oviedo Campus de Viesques Office 1 b 15 Gijón, 33203 Spain Phone: 34 985182487 Email: [email protected] Abstract Background: An emergency waiting room is a place where conflicts often arise. Nervous relatives in a hostile, unknown environment force security and medical staff to be ready to deal with some awkward situations. Additionally, it has been said that the medical interview is the first diagnostic and therapeutic tool, involving both intellectual and emotional skills on the part of the doctor. At the same time, it seems that there is something mysterious about interviewing that cannot be formalized or taught. In this context, virtual conversational characters (VCCs) are progressively present in most e-learning environments. Objective: In this study, we propose and develop a modular architecture for a VCC-based behavior simulator to be used as a tool for conflict avoidance training. Our behavior simulators are now being used in hospital environments, where training exercises must be easily designed and tested. Methods: We define training exercises as labeled, directed graphs that help an instructor in the design of complex training situations. In order to increase the perception of talking to a real person, the simulator must deal with a huge number of sentences that a VCC must understand and react to.
    [Show full text]
  • Redacted VV MQP Report with Math
    ______________________________________________________________________________ Developing Models to Visualize & Analyze User Interaction for Financial Technology Websites ______________________________________________________________________________ Project Team Amanda Ezeobiejesi Computer Science Guy Katz Industrial Engineering Alissa Ostapenko Computer Science and Mathematical Sciences Project Advisor Professor Michael Ginzberg Foisie Business School Project Co-Advisors Professor Rodica Neamtu Department of Computer Science Professor Sara Saberi Foisie Business School Professor Jon Abraham Department of Mathematical Sciences This report represents the work of WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review. For more information about the projects program at WPI, please see: http://www.wpi.edu/academics/ugradstudies/project-learning.html Abstract Vestigo Ventures manually processes website traffic data to analyze the business performance of financial technology companies. By analyzing how people navigate through company websites, Vestigo aims to understand different customer activity patterns. Our team designed and implemented a tool that automatically processes clickstream data to visualize different customer activity within a website and compute statistics about user activity. This tool will provide Vestigo insight on the effectiveness of their clients’ website structures and help them make recommendations to their clients. 1 Acknowledgements Our team would like to extend our sincere gratitude to the individuals mentioned below, from Vestigo Ventures, Cogo Labs, and Link Ventures, as well as our advisors for their support and encouragement throughout the duration of this project: Vestigo Ventures Ian Sheridan, a managing director and co-founder of Vestigo Ventures, who made this project possible. He established the initial connection with WPI and checked in on us from time to time.
    [Show full text]
  • A Comparative Study on Sequence Matcher and Levenshtein Distance Procedure
    International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-6, Issue-4 E-ISSN: 2347-2693 Characteristic mining of Mathematical Formulas from Document - A Comparative Study on Sequence Matcher and Levenshtein Distance procedure G. Appa Rao 1*, G. Srinivas 2, K.Venkata Rao 3, P.V.G.D. Prasad Reddy 4 1* Department of CSE, GIT, GITAM,VISAKHAPATNAM,INDIA 2 Department of IT, ANITS, VISAKHAPATNAM, INDIA 3, 4 Department of CSSE, AndhraUniversity,VISAKHAPATNAM,INDIA *Corresponding Author: [email protected] Available online at: www.ijcseonline.org Received: 20/Mar/2018, Revised: 28/Mar/2018, Accepted: 19/Apr/2018, Published: 30/Apr/2018 Abstract— The key predicament in the present circumstances is how to categorize the mathematically related keywords from a given text file and store them in one math text file. As the math text file contains only the keywords which are related to mathematics. The math dataset is a collection of huge amount of tested documents and stored in math text file. The dataset is trained with giant amount of text files and the size of dataset increases, training with various text samples. Finally the dataset contains only math-related keywords. The proposed approaches evaluated on the text containing individual formulas and repeated formulas. The two approaches proposed are one is Sequence matcher and another one is Levenshtein Distance, both are used for checking string similarity. The performance of the repossession is premeditated based on dataset of repetitive formulas and formulas appearing once and the time taken for reclamation is also measured. Keywords—Levenshtein distance,Sequence matcher I.
    [Show full text]
  • A Bibliography on Pattern Matching, Regular Expressions, and String Matching
    A Bibliography on Pattern Matching, Regular Expressions, and String Matching Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 E-mail: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe/ and David Salomon Professor Emeritus Department of Computer Science California State University Northridge Northridge, CA, USA 12 August 2021 Version 1.170 Title word cross-reference (l; d) [Tan14]. 1 [Mun07]. 2 [ASG99, BSM+07, BZ98, CR95b, KPR97, KD15, LT90b, OND98, SHCY93, Via02, Via04]. $29.95 [Ano97a]. 3 [BSM+07, CJ93, LT90b, TCCK90]. $65.00 [Ano97b]. 2 [Ram94]. 33 [BGFK15]. 4 [ZLN11]. c [WD99]. cn [Rob79]. d [CGK08, CDEK95, GKP19, dRL95]. δ [CIL+03]. [Gef03, HM98, HSW01, Lif03]. K [COZ09, LVN87, ATX21, ALP04, CLZ+15, CKP+21, CW18a, CW18b, CN21, 1 2 FWW13a, FGKU15, GG86, GU16, GN19, GGF13, Gra15, GL89, HT17, JL93, JL96, KRS19a, KRS19b, KVX12, LV88, NYuR15, NR17, QQC+13, WD99]. L1 [LP08, LP11]. L2 [LP11]. L1 [LP11]. Lp [WC14]. µ [DJ96]. N [ML96a, ML96b, KST94, KWL07, KWLL08, Wag74]. O(modT mod 3) [KKM+85]. O(log log n)[BG90].O(n2 log m) [CNS18]. O(ndm=we) [GF08]. O(n log2(n)) [HM98]. O(n log3 m) [CH97b]. O(n log n)[Gef03].O(s2)[CZ01]. ! [BdFED+20]. !n [Cho78]. !T [BdFED+20]. q [HKN14, KPA10, STK06, Sal12, ST95, ST96b, ST04, Ukk92]. r [Pol13]. ρ [CFK07]. xmyn = zp [NC92]. -automaton [COZ09]. -bit [KVX12].
    [Show full text]
  • Code Similarity and Clone Search in Large-Scale Source Code Data
    Code Similarity and Clone Search in Large-Scale Source Code Data Chaiyong Ragkhitwetsagul A dissertation submitted in fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London September 30, 2018 Declarations I, Chaiyong Ragkhitwetsagul, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. The papers presented here are original work undertaken between September 2014 and July 2018 at University College London. They have been published or submitted for publication as listed below along with a summary of my contributions to the papers. 1. C. Ragkhitwetsagul, J. Krinke, D. Clark, Similarity of source code in the presence of pervasive modifications, in Proceedings of the 16th International Working Conference on Source Code Analysis and Manipulation (SCAM ’16), 2016. Contribution: Dr. Krinke designed and started the initial experiment which I later continued. I incorporated more tools, handled the tool executions, collected the results, analysed and discussed the results with Dr. Krinke. The three authors wrote the paper. 2. C. Ragkhitwetsagul, J. Krinke, D. Clark, A comparison of code similarity analysers, Empirical Software Engineering (EMSE), 2017. Contribution: This publication is an invited journal extension of the SCAM ’16 paper to a special issue of Empirical Software Engineering. I substantially expanded the experiment by doubling the data set size, rerunning the first and second experimental scenarios, and introducing three additional experimental scenarios under the supervision of Dr. Krinke. The paper has been written by the three authors.
    [Show full text]
  • Automated Discovery of Parameter Pollution Vulnerabilities in Web Applications
    Automated Discovery of Parameter Pollution Vulnerabilities in Web Applications Marco Balduzzi∗, Carmen Torrano Gimenez ‡, Davide Balzarotti∗, and Engin Kirda∗§ ∗ Institute Eurecom, Sophia Antipolis {balduzzi,balzarotti,kirda}@eurecom.fr ‡ Spanish National Research Council, Madrid [email protected] § Northeastern University, Boston [email protected] Abstract 1 Introduction In the last twenty years, web applications have grown In the last twenty years, web applications have grown from simple, static pages to complex, full-fledged dynamic from simple, static pages to complex, full-fledged dynamic applications. Typically, these applications are built using applications. Typically, these applications are built using heterogeneous technologies and consist of code that runs heterogeneous technologies and consist of code that runs both on the client and on the server. Even simple web ap- on the client (e.g., Javascript) and code that runs on the plications today may accept and process hundreds of dif- server (e.g., Java servlets). Even simple web applications ferent HTTP parameters to be able to provide users with today may accept and process hundreds of different HTTP interactive services. While injection vulnerabilities such as parameters to be able to provide users with rich, interactive SQL injection and cross-site scripting are well-known and services. As a result, dynamic web applications may con- have been intensively studied by the research community, a tain a wide range of input validation vulnerabilities such new class of injection vulnerabilities called HTTP Parame- as cross site scripting (e.g., [4, 5, 34]) and SQL injec- ter Pollution (HPP) has not received as much attention. If tion (e.g., [15, 17]).
    [Show full text]
  • Python Library Reference Release 2.1.1
    Python Library Reference Release 2.1.1 Guido van Rossum Fred L. Drake, Jr., editor July 20, 2001 PythonLabs E-mail: [email protected] Copyright c 2001 Python Software Foundation. All rights reserved. Copyright c 2000 BeOpen.com. All rights reserved. Copyright c 1995-2000 Corporation for National Research Initiatives. All rights reserved. Copyright c 1991-1995 Stichting Mathematisch Centrum. All rights reserved. See the end of this document for complete license and permissions information. Abstract Python is an extensible, interpreted, object-oriented programming language. It supports a wide range of applications, from simple text processing scripts to interactive WWW browsers. While the Python Reference Manual describes the exact syntax and semantics of the language, it does not describe the standard library that is distributed with the language, and which greatly enhances its immediate usability. This library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs. This library reference manual documents Python’s standard library, as well as many optional library modules (which may or may not be available, depending on whether the underlying platform supports them and on the configuration choices made at compile time). It also documents the standard types of the language and its built-in functions and exceptions, many of which are not or incompletely documented in the Reference Manual.
    [Show full text]
  • A Comparison of Code Similarity Analysers
    Empir Software Eng https://doi.org/10.1007/s10664-017-9564-7 A comparison of code similarity analysers Chaiyong Ragkhitwetsagul1 · Jens Krinke1 · David Clark1 © The Author(s) 2017. This article is an open access publication Abstract Copying and pasting of source code is a common activity in software engineer- ing. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental sce- narios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique.
    [Show full text]
  • Code Similarity and Clone Search in Large-Scale Source Code Data
    Code Similarity and Clone Search in Large-Scale Source Code Data Chaiyong Ragkhitwetsagul A dissertation submitted in fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London September 30, 2018 Declarations I, Chaiyong Ragkhitwetsagul, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. The papers presented here are original work undertaken between September 2014 and July 2018 at University College London. They have been published or submitted for publication as listed below along with a summary of my contributions to the papers. 1. C. Ragkhitwetsagul, J. Krinke, D. Clark, Similarity of source code in the presence of pervasive modifications, in Proceedings of the 16th International Working Conference on Source Code Analysis and Manipulation (SCAM ’16), 2016. Contribution: Dr. Krinke designed and started the initial experiment which I later continued. I incorporated more tools, handled the tool executions, collected the results, analysed and discussed the results with Dr. Krinke. The three authors wrote the paper. 2. C. Ragkhitwetsagul, J. Krinke, D. Clark, A comparison of code similarity analysers, Empirical Software Engineering (EMSE), 2017. Contribution: This publication is an invited journal extension of the SCAM ’16 paper to a special issue of Empirical Software Engineering. I substantially expanded the experiment by doubling the data set size, rerunning the first and second experimental scenarios, and introducing three additional experimental scenarios under the supervision of Dr. Krinke. The paper has been written by the three authors.
    [Show full text]