Project Proposal May Be Jointly Funded by ICT R&D Fund and Other Funding Agencies/Industry

Proposal / Application for ICT-Related Development and Research Grant

Design and Implementation of a Language Independent Software Clone Management Tool Suite for Single and Multiple Systems

Submitted by: Hamid Abdul Basit Assistant Professor Department of Computer Science Lahore University of Management Sciences (LUMS)

Read carefully before filling the form.

1. Please do not alter the layout of the application form. Information must be filled in the spaces provided, under set format. 2. Guidance notes in various fields should not be deleted. 3. Required information should be duly filled in the specified fields. 4. Required heads/fields which are not relevant to the project should be marked N/A (Not Applicable) or left blank and should not be deleted. 5. Specifications, justifications, purposes must be provided against each item in the Budget file. 6. Please do not change the formulas in the budget sheets. 7. We have prepared financial guidelines to evaluate the remuneration for human resource associated with the proposed project. The guidelines are available on our website (click application forms>Financial Guidelines for Preparation of R&D and HRD Proposals>.

Application for ICT-Related Development and Research Grant Page I

List of Abbreviations and Acronyms

EE External Evaluators ICT Information and Communication Technologies IPR Intellectual Property Rights CPI Co-Principal Investigator PI Principal Investigator PIO Principal Investigator’s Organization "Principal Investigator’s Organization" means the person, company, partnership, undertaking, concern, association of persons, body of individuals, consortium or joint venture which receives funding from the Company to execute a research and development project.” R&D Research and Development

List of Abbreviations and Acronyms Used by PI in the Proposal (Please add abbreviations and acronyms in the table below, if any.) VCL Variant Configuration Language XVCL XML-based Variant Configuration Language CDDC Clone Detection and Documentation Component CVAC Clone Visualization and Analysis Component CFC Clone Framing Component CTC Clone Tracking Component BPC Back Propagation Component CR Clones Repository FR Frames Repository CB Code Base IDE Integrated Development Environment SCC Simple Clone Class MCC Method Clone Class FCC File Clone Class SCS Simple Clone Structure MCS Method Clone Structure FCS File Clone Structure PDG Program Dependence Graph AST Abstract Syntax Tree SVN SubVersioN LSH Locality Sensitive Hashing XML eXtensible Markup Language HTML Hypertext Markup Language RCF Rich Clone Format AOP Aspect-Oriented Programming STL Standard Template Library NERF Non-Extendible Repeat Finder RTF Repeated Token Finder FCIM Frequent Closed Itemset Mining FIM Frequent Itemset Mining SME Small-Medium Enterprise LUMS Lahore University of Management Sciences

Application for ICT-Related Development and Research Grant Page II

Application for ICT-Related Development and Research Grant Guidelines and Forms

Introduction National ICT R&D Fund was created in January 2007 by Ministry of IT with the vision to transform Pakistan's economy into a knowledge-based economy by promoting efficient, sustainable, and effective ICT (IT and Telecommunications) initiatives through synergic development of industrial and academic resources. Collaborative efforts between academia, research institutions, and industry are greatly encouraged to ensure that local economy can reap the monetary benefits of investment in research. This organization has significant funds available for proposals that are geared towards creating ICT related technologies.

Research grants will be awarded for high-level and promising ICT-related development and research projects by individuals or groups from academia and/or industry actively involved in the research and development individually or collaboratively. These projects should be based on either a universally known technology or a new technology developed by the applicant and should be aimed at achieving economically viable systems, products, or processes beneficial to the nation.

The grant will cover the honoraria of the principal investigator and co- principal investigators, salaries of professional researchers and developers at market rate, stipends for student research assistants, and supporting staff. It will also cover travel(s) within and outside the country for project-related activities and for scientific conferences where project team’s research paper, an outcome of the project, has been selected for presentation. The grant may be used to purchase very specific unavoidable equipment kept to the bare minimum, consumable materials, and other items needed for the project.

Submission Procedure Duly filled application forms complete in all respects along with any documents should be submitted online through Fund’s website www.ictrdf.org.pk. A hard copy should also be submitted by registered post or by fax at our mailing address given below. On receipt of the applications the proposals will be evaluated internally as well as externally as laid down in our policy documents. The PI may need to revise the proposal in light of the evaluator’s recommendations.

There is no deadline for submission of the application forms for Unsolicited Projects. The deadline for Solicited Projects will be given in the RFPs whenever floated.

Joint Funding The project proposal may be jointly funded by ICT R&D Fund and other funding agencies/industry. The efforts to obtain joint funding will be at the discretion of the Principal Investigator Organization (PIO) to which Principal Investigator belongs. However any such information must be provided to ICT R&D Fund. The funds released will be provided to the PI.

Application for ICT-Related Development and Research Grant Page III

Agreement A written agreement will be made between National ICT R&D Fund and PI. The PI will undertake to administer the grant according to the agreement and to provide laboratory space, and other facilities necessary for the project. The equipment purchased with ICT R&D Fund for the approved project will remain the property of ICT R&D Fund. The laptops will be returned to ICT R&D Fund after completion of the project. The grantee is required to submit a final narrative and financial report within one month of the completion of the project. The IPR issues will be sorted according to the policy in vogue.

For further information, please contact:

Solicitation and Evaluation Department, National ICT R&D Fund, 6th Floor, HBL Towers, Jinnah Avenue, Blue Area, Islamabad Tel.: (+92-51) 921 5360 - 65 Fax: (+92-51) 921 5366 Website: www.ictrdf.org.pk

Application for ICT-Related Development and Research Grant Page IV

Table of Contents Description Page # 1. Project Identification ...... 1 A. Reference Number: ...... 1 B. Project Title:...... 1 C. Principal Investigator (PI): ...... 1 D. Principal Investigator’s Organization (PIO): ...... 2 E. Other Organizations Involved in the Project: ...... 2 F. Key Words: ...... 3 G. Research and Development Theme: ...... 3 H. Project Status: ...... 3 I. Project Duration: ...... 4 J. Executive Summary: ...... 4 2. Scope, Introduction and Background of the Project ...... 5 A. Scope of the Project: ...... 5 B. Introduction: ...... 7 B1. Literature Review: ...... 7 B2. Current State of the Art: ...... 20 C. Challenges: ...... 23 D. Motivation and Need: ...... 25 3. Objectives of the Project ...... 27 A. Specific Objectives Being Addressed by the Project: ...... 27 4. Research Approach ...... 28 A. Development / Research Methodology:...... 28 B. Project Team: ...... 39 C. Team Structure: ...... 39 D. Project Activities:...... 41 E. Key Milestones and Deliverables: ...... 44 5. Benefits of the Project ...... 46 A. Direct Customers / Beneficiaries of the Project: ...... 46 B. Outputs Expected from the Project: ...... 46 C. Organizational / HRD Outcomes Expected: ...... 47 D. Technology Transfer / Diffusion Approach: ...... 47 6. Risk Analysis ...... 48 A. Risks of the Project: ...... 48 A1. Comments: ...... 48 7. Contractual Matters ...... 49 A. Contractual Obligations under this Project: ...... 49 B. Ownership of Intellectual Property Rights: ...... 49 C. Competent Authority of the Principal Investigator’s Organization: ...... 49 8. Project Schedule / Milestone Chart ...... 50 9. Proposed Budget ...... 51 Annexure A – Curriculum Vitae ...... 52 Bibliography ...... 61

Note: To update the table of contents, right click in the table and select ‘update field’ and then select ‘Update Entire Table’.

Application for ICT-Related Development and Research Grant Page V

Application for ICT-Related Development and Research Grant Page VI

Application for ICT-Related Development and Research Grant

1. Project Identification

A. Reference Number: (for office use only)

B. Project Title: Design and Implementation of a Language Independent Software Clone Management Tool Suite for Single and Multiple Systems

C. Principal Investigator (PI):

Name: Hamid Abdul Basit

Designation: Assistant Professor

Organization: Lahore University of Management Sciences

Mobile # : 03464439267 Tel. # : (042) 3560-8194

Email: [email protected]

(A letter from the competent authority regarding PI’s time commitment for the proposed research project must be provided.)

C1. Co-Principal Investigator (CPI):

Name: Shafay Shamail

Designation: Associate Professor

Organization: Lahore University of Management Sciences

Mobile # : Tel. # : (042) 3560-8187

Email: [email protected]

Co-Principal Investigator (CPI):

Name: Basit Shafiq

Designation: Assistant Professor

Organization: Lahore University of Management Sciences

Mobile # : 0321-2707965 Tel. # : (042) 3560-8366

Email: [email protected]

Co-Principal Investigator (CPI):

Application for ICT-Related Development and Research Grant Page 1

Name: Khushro Shahookar

Designation:

Organization: Softech

Mobile # : Tel. # : (042) 3

Email: [email protected]

Co-Principal Investigator (CPI):

Name: Salman Iqbal

Designation:

Organization: Softech

Mobile # : Tel. # : (042) 3

Email: [email protected]

C2. Contact Person: (If different from PI.)

Name:

Designation:

Organization:

Mobile # : Tel. # :

Email:

D. Principal Investigator’s Organization (PIO): (Please indicate the name, address, telephone and fax of the Principal Investigator’s Organization. The Principal Investigator should belong to this organization.)

Name: Lahore University of Management Sciences

Address: Opposite Sector ‘U’, D.H.A. Phase II, Lahore (Please attach certified Registration #: copy) (Please attach certified National Tax #: copy) Tel. # : Fax # :

Website: www.lums.edu.pk

E. Other Organizations Involved in the Project: (Please identify all affiliated organizations collaborating in the project, and describe their

Application for ICT-Related Development and Research Grant Page 2

role/contribution to the project.)

E1. Industrial Organizations: # Organization Name Role / Contribution 1. Softech Requirement specification; supervision of tool suite development and evaluation 2.

E2. Academic Organizations: # Organization Name Role / Contribution 1. 2.

E3. Funding Organizations: # Organization Name Role / Contribution 1. 2.

E4. Other Organizations: # Organization Name Role / Contribution 1. 2.

F. Key Words: (Please provide a maximum of 5 key words that describe the project. The key words will be incorporated in our database.) Clone management, reverse engineering, product Lines, variability management, Computer-aided software engineering (CASE) tools

G. Research and Development Theme: (If the proposal belongs to a theme specified by NICT R&D Fund, please identify the Research Theme.)

Leverage ICT to Improve productivity and quality of products in SME

H. Project Status: (Please mark )  New  Modification to previous Project  Extension of existing project

Application for ICT-Related Development and Research Grant Page 3

H1. Project URL: (The project URL should be provided. This URL should be hosted by the project executing agency. Sufficient details such as executive summary, objectives are expected on the website. Once the proposal is approved, the website should also provide final copy of the proposal and deliverables/progress.)

I. Project Duration: Expected Starting Date: September 1, 2014 Planned Duration in months: 24 months

J. Executive Summary: Pakistan has a growing software industry with revenues exceeding couple of billion dollars. This is a project proposal to design and develop the prototype of a software engineering tool to be used by the software development organizations. The tool will capitalize on the untapped potential of software similarities (code clones) that abound in existing software systems, helping our software industry to increase its efficiency in software development, and to enhance its potential to capture new customers and new market segments globally.

Code clones are similar program parts existing within and across software systems. Clones are known to exist in all kinds of software systems in significant proportions. Clones adversely affect the software quality, especially its maintainability and comprehensibility. It is also believed that the financial impact of cloning on maintenance is very high, where the cost of post-delivery maintenance is estimated to be around 60% - 80% of the total cost incurred during a software system's lifetime. However, due to the pragmatic difficulties in avoiding or removing clones, researchers and practitioners have agreed that code clones should be detected and managed efficiently.

In our proposed clone management system, clones will be unified using Variant Configuration Language (VCL) (http://vcl.comp.nus.edu.sg), which is a metaprogramming language and technique based on Bassett’s frames. This is a language independent technique that provides unlimited parameterization capability to any level of detail in a given text. Manual transformation of code into Bassett’s frames and managing these frames have been reported to have increased the productivity of several organizations by an order of magnitude and reduced the time to market of their products by 70%.

With VCL, we address the concern of reuse separately from the concern of functionality. VCL based clone management is compensatory clone management whereby we unify clones at the meta-level, without removing them from the actual source code. In this way we avoid the negative impacts of clones while keeping the essential clones in the system. VCL can be applied on top of other languages and

Application for ICT-Related Development and Research Grant Page 4

design techniques, complementing and enhancing them in areas where conventional techniques like refactoring, design patterns, generics, or aspects fail to provide a satisfactory solution. Since VCL works at the meta-level without altering the program, there is no risk involved in terms of breaking a running system, loss of performance, or compromising other design goals. Once the clones have been unified, we can also perform preventive clone management by using the variability management and code generation facilities of VCL.

VCL representation for a clone class is both physically smaller and conceptually easier to understand than the original source code for all the clones in the clone class. This solution also highlights important relationships between various code elements that facilitates the programmers to better understand the cloned code and its variants. This includes the visibility of similarities and differences among clones, and the ability to trace how various features affect the code at various places. Instead of dealing with each clone individually, we can understand clones together. In the VCL representation, we can see what is similar and what is different across specific clones in a clone class, down to the every detail of code. If we want to change one clone instance, we can check if the change also affects other clone instances. The information contained in the VCL representation reduces the risk of update anomalies, and helps in reusing existing clones when writing new code.

In earlier research, we showed that recurring patterns of clones often indicate the presence of interesting higher-level similarities that we call structural clones. Structural clones are bigger and more meaningful parts of a software system that have been replicated, and often embody important application domain or design concepts. Efficient management of structural clones promises even further gains towards software maintenance.

With two decades of active research behind it and a growing number of publications, probably code cloning is currently the most vibrant and promising research area in software engineering. Over the past several years, PI has extensively worked on various individual aspects of clone management like clone detection, clone visualization, and clone analysis [HBJ14] [BAHJ12] [BJ09] [ZBJ+08] [BSP+07] [BJ05] [BRJ05a] [BRJ05b]. Based on these existing foundations, we are now ready to consolidate these pieces together into a comprehensive tool suite that can provide the complete clone management capabilities from clone detection, to clone documentation, clone visualization, clone tracking, and controlled clone generation.

As part of this project, we will also conduct training sessions for the selected industry personnel to effectively make use of the capabilities provided by the tool suite in managing software clones.

2. Scope, Introduction and Background of the Project

A. Scope of the Project:

Pakistan’s software industry mostly falls in the SME category. Different sectors of the industry are involved in developing and maintaining applications in different software domains such as web, mobile, embedded systems, and desktop based legacy

Application for ICT-Related Development and Research Grant Page 5

systems. It is a growing industry, with revenues exceeding several million dollars. To stay competitive in the market, the companies have to continuously improve their practices. One of the well-known but less practiced technique is to reuse previously developed software components in developing new systems. However, it requires an upfront investment, careful planning, and a disciplined approach to software development, to reap the full benefits of software reuse. A common shortfall of all kinds of software systems are the code similarities or code clones that abound within and across systems. These are mostly the result of ad hoc code reuse, which is simply copying existing pieces of code and pasting them with or without minor modifications within the same system or in a different system, for reusing existing features. Such activities are very easy as compared to writing a generic version of the code that fits the original and the new context. Sometimes, the programmers are also forced to copy because of limitations of the programming language.

Code cloning is a serious problem in industrial software systems. Clones adversely affect the software quality, especially its maintainability and comprehensibility. For example, code clones increase the probability of inconsistencies in updating. If a bug is found in a code fragment, all of its cloned fragments should be detected to fix the bug in question. Without effective countermeasures, further development may become prohibitively expensive due to the presence of code clones. Moreover, too much cloning increases the system size and often indicates design problems. It is also believed that the financial impact of cloning on maintenance is very high, where the cost of post-delivery maintenance is estimated to be around 60% - 80% of the total cost incurred during a system's lifetime. However, it is not an easy task for developers or maintainers to be aware of all the code clones and maintain all of them consistently, specifically in the case of large software systems.

Clone management summarizes all process activities targeted at detecting, avoiding or removing clones. Thus, clone management encompasses a broad category of activities including clone detection, tracking of clone evolution, and refactoring (removal) of code clones. Based on the PI’s existing research described previously, we are now ready to consolidate these pieces together into a comprehensive tool that can provide the complete clone management capabilities from clone detection, to clone unification, clone visualization, clone tracking and controlled clone generation.

Existing clone unification techniques attempt to unify clones using the programming languages in which the code is written. This leads to the problem of increasing code complexity, and reduces the maintainability of code. In our proposed clone management system, clones will be unified using VCL (http://vcl.comp.nus.edu.sg) , which is a meta-programming language and technique based on Bassett’s frames. This is a language independent technique that provides unlimited parameterization capability to any level of detail in any given text. Manual transformation of code into Bassett’s frames and managing these frames have been reported to have increased the productivity of several organizations by an order of magnitude (10 times) and reduced the time to market of their products by 70% [Bas97]. With VCL, we unify clones at the meta level and address the concern of reuse separately from the concern of functionality. This separation of concern is a fundamental principle of software engineering discipline. VCL based clone management is compensatory clone management whereby we unify clones at the meta-level, and not remove them from the actual source code. In this way we avoid the negative impacts of clones. Once the

Application for ICT-Related Development and Research Grant Page 6

clones have been unified, we can also perform preventive clone management by using the variability management and code generation facilities of VCL.

Our tool suite will broadly comprise of four main components. Clone Detection and Documentation Component (CDDC) will identify clones among different pieces of code at different levels of abstraction through lexical analysis. Clone Visualization and Analysis Component (CVAC) will generate a diverse set of visualizations like treemaps, scatterplots, bar charts, etc. to present the cloning data to the developers in a meaningful way and to facilitate the analysis of clones. Clone Framing Component (CFC) will unify the clones within and across the systems by providing generic representation of these clones with VCL. Clone Tracking Component (CTC) will keep track of the already found clones in the updated versions of the code without having to run the complete clone detection again. In addition to these four components, we will also implement two repositories: a Clones Repository (CR) for storing raw clones data and a Frames Repository (FR) to store the VCL frames.

We will integrate our tool suite with Eclipse IDE (Integrated Development Environment) (http://www.eclipse.org) , which is a popular platform for programming.

After the completion of the tool suite we will run training sessions for industry professionals and conduct controlled experiments to assess the benefits of integrated clone management for better adoption of the tool in the industry. Proper user guide and step-by-step tutorial will also be developed.

B. Introduction:

B1. Literature Review: (Detailed summary of what all has been done internationally in the proposed area quoting references and bibliography. Please note that this section demonstrates the depth of knowledge of the project team and builds the confidence of the evaluators about capability of the team in achieving the stated objectives.) Clones and clone types Similar parts of source code within or across systems are called code clones. Previous research reports empirical evidences that a significant portion (even as high as 50% [RDL04]) of a typical software system consists of cloned code. A precise definition has not yet been proposed in the literature, due to a number of factors. As defined by Ira Baxter “Clones are segments of code that are similar according to some definition of similarity (Ira Baxter, 2002)” [Kos06]. Here definition of similarity is an important factor to be considered. Sometimes, it is exactly the same piece of code that we categorise as clones, while sometimes it is some part of the code with similarities as well as differences that will come under the definition of clones. Based on these considerations, following are different clone categories that are frequently used in cloning literature [Kos06] [RC07] [Rie05]:-

Type-1 Clones (Exact Clones): In this type of clones we have similar code fragments except for variations in white-spaces and comments. So these are almost same code chunks and are based on similarity in the code text.

Type-2 Clones (Renamed Clones): This type of clones is based on structural or

Application for ICT-Related Development and Research Grant Page 7

syntactical similarity in the code text. However, they may vary in terms of identifier names, literals, types, layout and comments.

Type-3 Clones (Gapped Clones): Type 1 or Type-2 cloned code fragments that also have some additional differences, lie under type-3 clones category. These differences can be additions, deletions or modifications of statements. Type-3 clones are also based on text similarity. Type-3 clones usually exhibit the absence of lines of code in one code segment and at the same time its presence in the other code segments. This difference can be based on some parts of a single statement, a complete statement, or multiple consecutive lines of code i.e. a code chunk. Their complete or partial presence and absence is the factor that is considered to categorise any code fragment as Type-3 clones.

Type-4 Clones (Semantic Clones): This type of clone show functional level i.e. semantic similarity in code. Such code fragments are implemented using different syntactic structure.

These four types are most frequently used to categorise similarity among code fragments. They can also be found jointly in same chunk of code. Type-2 and Type-3 clones are one such possibility and are jointly known as near-miss clones [RC07].

Other categorisations of code clones have also been proposed, based on criteria other than that discussed above [MLM96] [BMD+99a] [BKA+07] [KFF06] [DBF+95] [Kon97].

Structural Clones In previous work, PI proposed the concept of structural clones to mean design-level, large granularity similar program structures [BAHJ12][BAJ11][BJ10][BJ09][BJ05]. Examples of structural clones are similar modules such as class methods, classes, source files, directories, or recurrent configurations of similar modules.

Figure 1 and Figure 2 show intuitive examples of simple and structural clones. Three similar code fragments (a1,a2,a3) in Figure 1 differ slightly in code details. We can consider them as instances of a Simple Clone Class (SCC), based on some textual similarity criteria.

Figure 1: Simple Clones

Suppose code fragments (a1,a2), (b1,b2), and (c1,c2) form simple clone classes, and they appear in files F1 and F2, as shown in Figure 2(a). Grey code is unique to F1 and

Application for ICT-Related Development and Research Grant Page 8

black code is unique to F2. The group of code fragments [a1,b1,c1] is a structure that has a clone [a2,b2,c2] in file F2, and together they form a Simple Clone Structure (SCS) Across Files, which is a basic type of structural clone.

Suppose a substantial part of each of both files F1 and F2 is covered by this SCS, then we could consider files F1 and F2 as members of a File Clone Class (FCC) - a higher level structural clone, as shown in Figure 2(b). This abstraction step allows us to detect structural clones in a hierarchical way, whereby higher-level, coarse-grained structural clones are formed in terms of smaller-granularity ones.

Suppose further that (X1,X2), (Y1,Y2), and (Z1,Z2) are also File Clone Classes as shown in Figure 2(c). Then, [F1,X1] is a structure in directory D1 with a clone [F2,X2] in directory D2, forming a File Clone Structure (FCS) Across Directories - one of the highest level structural clone that we consider in our work. Similarly, cloned structures [Y1,Z1] and [Y2,Z2] form an FCS Within Directory.

Figure 2: Structural Clones

The above examples illustrate that a structural clone class is formed by a group of program structures whose respective elements are similar and are inter-related in similar ways.

The concept of structural clones covers all kinds of large- granularity repeated program structures. Clone Miner [BJ09] detects some specific types of structural clones, which are listed in Table I and are explained below.

Clone Miner takes a bottom-up approach towards the detection of structural clones. It detects simple clones first, based on the similarity of the transformed token strings generated by a lexer [BJ09], and groups them in terms of their containers (e.g., methods, files, etc.). Then, Clone Miner detects the recurring configurations of simple clones using frequent closed itemset mining. Each of these recurring configurations of

Application for ICT-Related Development and Research Grant Page 9

simple clones represents a possible first- level structural clone, where the entities are cloned code fragments and the relationship between the entities is “same container”. Higher level structural clones are detected in a similar way, with highly similar containers, found from the previous lower level clone analysis, being treated as the cloned entities. Similar to the simple clone classes (SCC), we have method clone classes (MCC) and file clone classes (FCC), which consist of groups of cloned entities at successively higher levels of abstraction. The other types of clones listed in Table I consist of recurring structures of simple clones (SCS), method clones (MCS) or file clones (FCS). The detailed explanation and the mechanisms of detecting all these different types of structural clones are given in [BJ09].

Table 1: Structural Clones found by Clone Miner

Level 1 A Simple Clone Structures (SCS) Within Methods

B Simple Clone Structures (SCS) Across Methods

Level 2 A Simple Clone Structures (SCS) Within Files

B Simple Clone Structures (SCS) Across Files

Level 3 Method Clone Classes (MCC)

Level 4 A Method Clone Structures (MCS) Within Files

B Method Clone Structures (MCS) Across Files

Level 5 File Clone Classes (FCC)

Level 6 A File Clone Structures (FCS) Within Directories

B File Clone Structures (FCS) Across Directories

We typically find ~50% of legacy code contained in software clones. A study of 11 open source systems found that 56% of simple clones are contained in structural clones, many of which are meaningful to developers [BAHJ12] [BAJ11].

The limitation of considering only simple clones and not looking at the bigger picture of similarity situation is known in the field, and researchers have applied classification, filtering, and visualization to help analysts understand the cloning information [UKKI02b] [KG04]. Our concept of structural clones addresses the above limitation by raising the level of clone analysis from simple clones to design-level, coarse grained program structures. In experimental studies, we demonstrated usefulness of structural clones for design recovery and program understanding [BAHJ12] [BAJ11].

How clones arise Clones arise in software systems both intentionally and unintentionally. Intentional cloning, where a developer copies a piece of code from one place and pastes it at another, with or without modifications, is mainly done as a form of ad-hoc reuse. Sometimes this is considered as safe development, as cloning an existing piece of

Application for ICT-Related Development and Research Grant Page 10

code that has been working fine is safer than implementing the same or similar functionality from scratch, where there are more chances of making errors. Cloning a piece of code several times with minor modifications is also considered faster and simpler as compared to writing one generic piece of code representing all the variants.

Unintentional or accidental clones are created due to mental models used by developers [Rie05], using the same APIs or same design patterns [ZR11b]. Several other causes of clones are also discussed in the literature [KG06] [KG08] [Kos06] [RC07].

Other benefits of clone detection The main purpose of clone detection is clone management for better software maintenance. In addition to this objective, clone detection can also be used for identifying candidates for a software library [IHH+12], program comprehension, aspect mining for detecting cross-cutting concerns [BDET05] [KMT07] [Bru04], bug detection [Bea11] [CCM+11] [JB10] [JSC07] [JHDF08], finding malicious software by comparing a piece of code with a known malicious software [BMM07] [WL06], detecting plagiarism and copyright infringement [BFL10] [HKVD11], and code compacting for compact devices, among other uses.

Clone detection techniques Over the past two decades, various techniques and tools have been developed to detect code clones. These techniques differ in the way they represent code and the comparison mechanisms [RC07]. Some of the well known techniques can be grouped as follows.

Plain text comparison: Code is represented as plain text, usually with some normalization to remove differences due to renamed identifiers, comments, and different layouts. Clones are detected by comparing lines with each other [Joh94] [Joh93] [DRD99] [MM01] [RC08a] [ZR12b] [URSH11]. These techniques are usually independent of programming languages but may be sensitive to even very small changes in otherwise similar code [Kos08].

Comparison of lexical tokens: Code is represented as a string of lexical tokens and similar subsequences are detected as clones [Bak92] [Bak95] [Bak96] [BSP+07] [KKI02] [KYU+09] [LLMZ06]. These techniques are slightly more dependent on the underlying programming languages as compared to the plain text based techniques, but can be faster and more robust in the presence of small code differences. The down-side is that they may report clones that cross boundaries of syntactic blocks and would be difficult to treat.

Comparison of Syntax trees: Code is represented as syntax tree or parse tree and similar subtrees are reported as clones [BYM+98] [Yan91] [Bah10] [WSWF04] [FFK08] [KFF06] [LT10] [EFM07] [JMSG07] [NNP+11]. The techniques are more dependent on the programming language syntax due to the parsing involved, but have high precision as the detected clones have similar syntactic structure. However, they suffer from low recall rates [BKA+07] [Bel02] [BB02].

Comparison of program dependency graphs: Code is represented as program dependency graphs, showing the data and control dependencies between the program statements. Isomorphic subgraphs are detected as clones [GJS08] [HK09] [HYNK11]

Application for ICT-Related Development and Research Grant Page 11

[KH01] [Kri01]. These techniques can also detect clones of non-contiguous and reordered code segments, but it is a computationally expensive technique.

Comparison of code metrics: Code is represented as vectors of various code metrics (for example, cyclomatic complexity, fan-in, fan-out, etc. ) computed for each function or code block. Similar vectors indicate similar functions or code blocks [KDB+95] [KMM+96] [LPM+97] [MLM96] [DDF02] [LM03] [LM03]. It may be computationally expensive to calculate certain code metrics and sometimes they may report many false positive clones.

Comparison of bytecode or assembler code: Code is compiled into bytecode or assembly code on which clones are detected through various techniques [BM98] [San05] [DG10a] [DG10b] [SWP+09]. One limitation of this technique is that it requires syntactically correct code. Even though the technique appears to be robust in the presence of comments and layout differences between clones [DG10], small differences in the source code may produce very different compiled code, leading to low clones recall [BM98]. Also, only class level clones are detected due to the difficulty of finding method or block boundaries in the compiled code.

Other techniques: Some other techniques have also been used for clone detection like tracking of clipboard operations like copy and paste [Wit08] [HJJ09a] [Wec08], anti-unification [BM09] [LT10], formal methods [San05], hybrid techniques [FBCG10] [Lei04], and tracing of abstract memory states at runtime to detect semantic clones [KJKY11].

Evaluation of clone detection techniques Comprehensive qualitative evaluation of the clone detection techniques are given by Roy & Cordy [RC08b] and Roy et al. [RCK09]. Comprehensive quantitative study on the comparison of clone detection tools is done by Bellon et al. [BKA+07], comparing six tools (CCFinder, CloneDr, Dup, Duploc, Duplix, and CLAN) representing different clone detection techniques. The reference clone benchmark used in this study is the only reference benchmark available so far [Kos08]. An earlier comparative case study was also reported by Bailey and Burd [BB02].

Roy and Cordy have proposed a mutation based framework for the evaluation of clone detection techniques [RC09]. In this framework, a set of artificially created but manually verified clone fragments are injected at various locations of a code base to serve as a reference benchmark.

Incremental clone detection As software systems evolve, new clones are introduced and old clones gets modified, nullifying the results of previous clone detection to some extent. For large code bases, It is not feasible to re-detect clones after every change. In incremental detection of clones, only the modified source code is analyzed for clones, merging the new results with the results of the previous clone detection. Evaluation results by Gode and Koschke [GK09] suggest that the incremental clone detection can perform four times faster than non-incremental detection. However, the storage of a high volume of data is a common issue with all the incremental clone detection techniques.

iClone [GK09] is an incremental clone detection algorithm that works with n previous revisions of a program together with the changes between every two consecutive

Application for ICT-Related Development and Research Grant Page 12

revisions. Clone detection in the initial version is done with a typical token-based technique using a generalized suffix tree. For the detection of clones in subsequent versions, only the files that are changed, added or deleted from the previous version are examined.

Hummel et al. [HJHC10] have proposed an index-based incremental clone detection approach, reusing preprocessing and postprocessing of clones from ConQAT. Incremental clone detection is facilitated by creating clone index data structure, which is a list of tuples containing file name, position in the sequence of normalized statements for the file, an MD5 hash code of the chunk of n normalized statements from the statement index, and the start and end lines of the chunk of consecutive statements. Tuples with the same sequence hash indicate clones with at least n statements. Consecutive clones are merged to detect maximal clones. This approach has been extended into a general framework for computing code quality metrics in an incremental and distributed way that can analyze large systems in real time and also provide the history of those results [BHH+12].

Another incremental clone detection technique based on a program dependency graphs has been proposed by Higo et al. [HYNK11]. Both control and data dependencies present in the source code are captured by these PDGs, which are stored in the database. Clone are detected by approximately matching these PDGs, and they are kept synchronized with the changing source code by analyzing the modified source files.

Li and Thompson [LT11] incorporated incremental clone detection in Wrangler using an abstract syntax tree approach.The AST computed from the normalized source code is annotated and serialized. A suffix tree based approach is applied for clone detection from the serialized AST using anti-unification technique to avoid false positives. The serialized AST is kept up to date with the changes in the source code.

JSync [NNP+11] (initially called ClemanX [NNA+09]) is an incremental clone detection tool implemented as a plugin to SVN. It runs an initial clone detection and stores the clone classes in N buckets. Based on the change information of the source files from SVN, it can determine the modified, added, or deleted fragments of the code. JSync then filters the changed or deleted fragments from the clone classes. New and modified code fragments are placed in the buckets using the LSH technique. Finally, fragments in each bucket are compared with each other to update the clone classes.

Barbour et al. [BYZ10] centrally maintains clone information for a software system in a server. Whenever the developer commits an update, the server updates the clone information incrementally. The number of comparisons required for incremental clone detection is reduced by selecting representative clones from the existing clone list. Knuth–Morris–Pratt algorithm for string comparison is used to compare the newly committed code with the representative clones and to update the clone list.

Some other similar techniques are mentioned in [LGS13] [KRC11] [ZR12b] [PLR+13]

Approaches for clone management Clone management includes all processes for detecting, maintaining, avoiding, and removing clones [Gie07]. This covers a wide spectrum of activities such as clone detection, tracking of clone evolution, and refactoring of code clones. Clone

Application for ICT-Related Development and Research Grant Page 13

documentation, clone analysis, and clone visualization are also supporting aspects of clone management. Zibran & Roy [ZR12a] presents a detailed survey of clone management dimensions and work done in all these areas.

For managing code clones, Mayrand et al. [MLH96] introduced “problem mining” and “preventive control” activities. which were further supported by a later study of Lague et al. [LPM+97]. Giesecke [Gie07] called them compensatory and preventive clone management, respectively.

Giesecke [Gie07] suggested an alternate categorization of clone management as corrective, preventive, and compensatory clone management. Corrective clone management involves removing existing clones from the system. Preventive clone management prevents forming new clones in the system. Compensatory clone management avoids the negative impacts of clones that are not removed from the system for some valid reasons by applying techniques like annotation and documentation.

Another categorization of clone management is to consider these activities as proactive or retroactive. Proactive clone management [HJJ09a] [HJJ09b] is another name of preventive clone management, that aims to deal with the clones during their creation or soon after they are introduced. On the other hand, retroactive clone management [CH07] is the post-mortem approach [ZR12b], where clone management activities are initiated after the development process is completed up to a milestone. In ideal circumstances, proactive clone management should be applied for all clones in newly developed systems, but due to practical limitations discussed later, it may not be always possible. Hence, a practically useful clone management system should provide both proactive and retroactive clone management facilities [CH07]. On the other hand, clone management in legacy systems can only be done by the retroactive approach.

Clone Documentation Clone detectors document clone information either in terms of clone pairs [KKI02] or clone classes [BJ09]. Individual clones are identified by the file name, together with the range of line numbers [RC08a] [BJ09] or with the character offset and the number of characters in the clone [HJJ09a]. Clones are reported in various formats, like XML, HTML, and plain text. Rich Clone Format (RCF) [HG11] has been proposed as an extensible schema based data format for storage, persistence, and exchange of clone data.

When a system undergoes changes, the locations of clones as reported by clone detection tools become invalid. to address this problem, it has been proposed to report the relative locations of clones instead of the absolute ones [LT11]. Another approach is based on describing clone region in terms of the containing file, class, method, and block, in the form of a clone region descriptor (CRD) [DR07] [DR10], independent of the exact code of the clone or its location in the file.This technique is incorporated in the tool CloneTracker [DR08].

Clone Annotation JSync [NNP+11] provides a clone annotation facility to developers. Using this feature, they can tag clones that they don’t want to refactor immediately, so that the clone management framework should not repeatedly suggest these clones for refactoring.

Application for ICT-Related Development and Research Grant Page 14

Clone Refactoring Techniques Following are different techniques to refactor code clones found in a software system

Object Oriented Refactoring: A number of object oriented refactoring patterns have been proposed by Fowler [FBB+99] to remove code defects including duplicate code or clones. Several researchers have utilized these patterns for clone removal in their tools [HKKI04a] [HKKI04b] [JH06] [KVB+10] [LBC+10] [SK09] [ZR11a] [ZR11b] [KT14]. These refactoring patterns include Extract Method, Move Method, Pull-up Method, Extract Superclass, and Extract Utility Class. In addition to these fundamental refactorings, some peripheral refactorings are also used to facilitate clone removal. These include Rename Identifier, Rename Method, and Re-order Parameters among others that are used to remove superficial textual differences between otherwise similar clones [ZR11b]. Chained Constructor refactoring [Ker04] has also been used to remove code duplication from the constructors of the same class [NSG07].

For some specific types of structural clones, it has been proposed to directly apply composite refactorings like Form Template Method and Extract Superclass instead of performing a series of smaller refactorings like Extract Method and Move Method [BAHJ12].

Design patterns: Refactoring based on design patterns is a clone unification option that is fairly independent of the underlying programming language but is closely tied with the design of the program. To eliminate the redundant code in a Java software system, Balazinska et al. [BMD+99b] [BMDK00] applied the refactoring based on ‘strategy’ and ‘template’ design patterns, by factoring out the commonalities of methods and parameterizing the differences according to the design patterns.

Generics and Templates: In class libraries, clones often stem from the well- known “feature combinatorics” problem. A proper parameterization mechanism can combat this emergence of clones, increasing software reuse and easing software maintenance. At the language level, we have two flavors of programming language extensions that help us define generic solutions. Firstly, there are techniques tightly integrated with the underlying language, e.g., generics (Ada, Java) and higher order functions. Secondly, there are language extensions loosely integrated with the underlying language, e.g., C++ templates. Templates (or generics) help us write compact, generic code, which aids both reuse and maintenance. In an earlier work [BRJ05a] [BRJ05b], the PI carried out two case studies to assess the potential of generics and templates in unifying clones.

Macros: Macro extraction was initially proposed as the mechanism for removing clones [Bak92] [BYM+98], mostly in the context of procedural languages. Most of the macro systems are merely implementation level mechanisms for handling variant features (or changes, in general). Failing to address change at analysis and design levels, macros never evolved towards full-fledged “design for change” methods [Bas97] [KRT97]. Programs instrumented with macros tend to be difficult to understand and test.

Metaprogramming: Metaprogramming is another similar but less restrictive technique of clone refactoring. In this technique VCL (Variant Conﬁguration Language) is used to for a generic representation of code. Earlier studies [BRJ05a] [BRJ05b] have also shown that through using XVCL (an early version of VCL, based on XML syntax),

Application for ICT-Related Development and Research Grant Page 15

68% of the code from original Java Buffer Library was removed [JL06].

Traits: Traits [SDNB03] are a modularization construct of object oriented paradigm based on the concept of inheritance. As in inheritance, common functionality is implemented in parent class and variants in child classes, which is also helpful to avoid and remove duplicate code. Traits can be described as a collection of pure methods that can be used within a class or by another trait or other classes without requiring multiple inheritance. Traits have been applied as clone refactoring mechanism in an empirical study on java.io library [MQB05]. With 14 traits, the authors were able to to remove 30 duplicated methods by refactoring 12 classes of the java library.

Aspects: An aspect is a program feature that is related to several parts of the program, but does not belong to the program's primary functionality, consequently crosscutting other program concerns, and violating the principle of separation of concerns [KLM+97]. In Aspect Oriented Programming (AOP), one splits a problem into separate aspects, and by the process of weaving aspects, one can put the solutions for the different aspects of a problem back into a solution for the whole problem. Aspect technologies promise improved modularity of programs, which may also reduce clones.

In addition to the object oriented refactoring patterns, Schulze et al. [SKR08] have also proposed three aspect oriented refactoring patterns to remove clones. These include Extract Feature into Aspect, Extract Fragment into Advice, and Move Method from Class to Interface.

Synchronized modification: This refers to the technique of applying the same edit operations on multiple clones in parallel to avoid update anomalies. Toomim et al. [TBG04] provided synchronized modification support in their tool CodeLink, which is an XEmacs editor extension. When the user manually links two code segments in the editor, CodeLink detects the similarities and differences between them using longest common subsequence algorithm, and enables simultaneous editing in the similar regions. CPC [Wec08] and CloneBoard [Wit08] also support synchronized modifications.

Consistent renaming: When clones are created with copy and paste action, usually the identifiers in the pasted code fragment need to be renamed to fit in the new context. CReN [JH07] is an Eclipse plugin that checks for inconsistently renamed identifiers in a code fragment and suggest corrections. LexId [JH10] is an extension of CReN that helps in consistently modifying different identifiers in a code fragment. CSeR [JHJ10] provides visual analysis of similarities and differences while editing a copy- paste clone. CnP [HJJ09a] combines CReN and CSeR into a single toolkit for proactive management of copy-pasted clones.

JSync [NNP+11] is a plugin to SVN version control system. For a given clone pair, it compares their ASTs to detect inconsistently renamed identifiers and suggests consistent renamings. Using the versioning information available in SVN, JSync also provides support for clone synchronization when one clone in a clone pair is changed in the new version while the other is still unchanged.

Clone visualization Clone detectors return huge numbers of clones when a large code base is analyzed.

Application for ICT-Related Development and Research Grant Page 16

Post-detection analysis of clones is then a must to help users zoom into cloning areas that are of their interest. Visualizations, along with abstraction, clustering, and filtering are the cornerstone techniques for clone analysis. Human eye can easily recognize visual patterns, hence, visualization techniques can help users understand large amount of cloning data and identify important regions of cloning in a software system [Rie05].

Many general-purpose visualizations, or visualizations proposed in specific domains have been adopted in clone detection / visualization tools to make cloning information useful. These techniques are either integrated with a clone detection tool, or are available as independent clone analysis tools. These techniques can meaningfully represent a large number of clones and helps software maintainers in analyzing the clones. Some of these visualizations are more popular and are present in multiple tools in variant forms, while others are more specific to a certain tool. Some tools propose new and unique visualizations, while others present the same visualizations in somewhat different ways. Some of the visualizations have some limitations, which could be addressed in some cases. Each visualization technique provides a different view of the cloning data, and is helpful in achieving specific software engineering tasks, such as detecting candidates for reuse, or program understanding.

A recent work by PI [HBJ14] presents a comprehensive survey of clone visualization techniques that have been proposed in various clone analysis tools. Our work is based on the material gathered from published papers or online web-pages of the clone analysis tools like VisCad (http://homepage.usask.ca/~mua237/viscad/viscad.html) , ConQAT (https://www.conqat.org/), and CCFinder/Gemini [KKI02] etc. . In this work, for each clone visualization technique, we identify its salient features that exist in various software clone analysis tools. Each technique is also analyzed in terms of its limitations and drawbacks. The analysis also contains suggestions that can improve the quality of each technique. Finally, we characterize capabilities of visualizations in terms of a conceptual model of cloning information that programmers are typically interested in. We present the result in the form of a faceted classification of visualizations.

There are also other papers that present and analyze new visualizations for clones. Tairas et al. [TGB07] [Tai10] presents clone representation, analysis, and maintenance techniques that are only used in their tools CloViz (Clone Visualization) and CeDAR (Clone Detection, Analysis, and Refactoring) [TGB07] Eclipse plug-ins. Moreover, it also depicts different techniques related to clone representations that include clone visualization and localized representation. Likewise, Michiel de Wit [Wit08] discusses clone management strategy by introducing the Clone Board Plug-in tool. This tool is able to maintain the history of modification in clones, and offers several resolution strategies for inconsistently modified clones. Michiel [Wit08] has elaborated the adequacy, usability and effectiveness of the Clone Board Plug-in with an experiment. Additionally, Rieger et al. [RDL04] and Rieger [Rie05] propose clone visualization techniques of duplicated source elements, elaborating the duplication problems revealed by each view, and explain how each view can support the engineer in his tasks. Still the papers do not cover all the software clone visualizations. Finally, Jiang [Jia06] has proposed three visualization techniques to help researchers in examining cloning in large scale software systems. Clones are classified as external clones and internal clones, and visualization techniques are evaluated upon this classification of clones.

Application for ICT-Related Development and Research Grant Page 17

Cloning situation can be viewed from multiple perspectives, depending on the specific goal of clone analysis at hand. Typically, a user would initially want to get a global view of cloned modules or sub-systems, and then zoom in to the detailed views of specific regions of the system. There is a rich set of visualization techniques that can help users analyze clones from global and detailed perspectives. Clone detection tools typically support a subset of these techniques. Some of the visualizations that have been used for showing cloning data are:

 Dot plot  Tree map  Bar chart  Edge-bundling wheel view  Navigation tree  Bar map  Parallel coordinates view  Line chart  Pie chart

Clone analysis for refactoring From the output of a clone detection tool, Ducasse et al. [DRGB99] identified two types of clones that could be refactored using two variants of the standard extract method refactoring [FBB+99].

Balazinska et al. [BMDK00] have proposed advanced clone analysis based on the similarities and differences of the clones to identify clone groups that are suitable for refactoring. Various measures of context analysis are also proposed based on relationships between class and its methods, method and its variables, method and other methods that it call, etc.

Gemini [UHK+02] [UKKI02b] is a clone analysis tool that works with the output from CCFinder [KKI02] , and computes various clone metrics to represent different properties of clones [HKKI07]. Visualizations like Scatter Plot and metric-graph view further provides visual aid for analysing clones [CYI+11] [HKKI04b].

CCShaper [HKKI04a] [HUK+02] extracts cloned code blocks from the output of CCFinder [KKI02], that gives clones of arbitrary granularity, to aid extract method refactoring. ARIES [HKKI04a] combines the metric-graph view of Gemini with the clone filtering of CCShaper, and also computes additional clone related metrics.

Yoshida et al. [YHK+05] introduced the concept of “chained clones”, referring to the clones from different clone groups that are in a calling or data sharing relationship with each other. It is argued that such clones should be refactored together, and a PDG based technique built on top of the ARIES environment is also proposed to locate these chained clones.

In an empirical study on open source systems [TG10], it was found that sometimes refactoring (extract method) is performed on only part of a clone, such that the remaining clone can still be further refactored. The reason for this phenomenon is suggested to be the lack awareness or the lack of availability of clone detection and clone management tools to the developers.

Application for ICT-Related Development and Research Grant Page 18

Libra [HUKI07] is a tool for detecting clones of a specific code fragment so that code could be changed without creating update anomalies, using CCFinder as the underlying clone detection mechanism. A similar tool is proposed by Lee at al. [LRHK10] that finds k most similar clones of the given code fragment using feature- vector of the code AST. Another tool is developed by Zibran and Roy [ZR12b] that is also integrated with Eclipse IDE.

Clone categorization for refactoring Different categorizations of clones have been proposed based on the type of refactoring that is applicable on them. This is still considered an open problem [Kos06], and more factors could be considered for better categorizations.

Balazinska [BMD+99a] have proposed a categorization of function/method clones focusing on the similarities and differences between them. Koni-N’Sapu [Kon01] proposed an alternate categorization of clones focusing on the position of these clones in the inheritance structure of the system. For each category of clones, a set of refactorings are also proposed.

Schulze et al. [SKR08] considers aspect-oriented refactorings as more appropriate than object-oriented refactorings for certain types of clones. To make this decision, they have proposed a clone categorization, focusing on the nature of clones and their positions. Kodhai et al. [KVB+10] have also proposed a similar mechanism for clone refactoring.

Using the formal concept analysis with a data mining approach, Torres [Tor04] have proposed four categories of concepts that can contain clones, and suggested appropriate refactorings for each category accordingly.

A comprehensive categorization of clones is proposed by Kapser and Godfrey [KG04], based on both the position of clones in the system and the types of similarities between the clones. However, no refactorings have been suggested for each category of clones.

Verification of clone modification/refactoring Manual modification or refactoring of a clone can be error prone. CReN [JH07] checks for inconsistently renamed identifiers when a clone is being modified in the IDE, and suggests consistent renamings. JSync [NNP+11] provides the same behaviour when the code is being checked-in to the central code repository. Similarly, refactoring of code clones also needs to be verified to ensure that the program behaviour has not changed [ZR11c].

Cost-benefit analysis and scheduling of clone refactoring Bouktif et al. [BANM06] have proposed an effort estimation model for the refactoring of clones. This model estimates the refactoring effort by considering clone size, caller- callee relationship, and the extent of modification required for a given refactoring. The refactoring scheduler is framed as a constrained knapsack problem and solved by a genetic algorithm.

Similartly, Liu et al [LLMS08] have used a heuristic algorithm for refactoring scheduling. For scheduling they have considered the conflict and sequential dependencies among the refactoring activities, and try to maximize a code quality indicators while satisfying

Application for ICT-Related Development and Research Grant Page 19

the constraints. In another approach [LBC+10], ordering messy genetic algorithm (OmeGA) is used to schedule refactoring by using similar constraints and objectives.

Scheduling of clone refactoring has also been formulated as a constraint satisfaction optimization problem [ZR11a] [ZR11b] considering a number of hard and soft constraints. A constraint programming technique was applied to compute an optimal solution. The authors also proposed an effort estimation model for object-oriented code refactoring.

B2. Current State of the Art: (Please describe the current state of the art specific to this research topic.) Clone management tools Some initial clone management tools have started to appear recently. Venkatasubramanyam et al [VSR12] provides a proactive clone management technique. The tool is integrated with an IDE and is activated with the copy/paste operations of the IDE. Constraints are applied for copying a code fragment to make a clone, and also to the modifications done to this piece of code. These constraints are specified on a special representation of the code. However, their clone management facilities provided after the clone has been accepted by the tool, are not clear.

Another tool CPC [Wec08] is implemented as a framework to provide a platform for developing clone management tools

Hou et al. [HJJ09a] have developed CnP, a toolkit for clone management that handles copy-paste clones. However, there is only limited support for clone management in this toolkit.

Zibran and Roy [ZR11c] have also proposed and developed a versatile clone management tool.

CloneTracker [DR08] uses clone region descriptors (CRDs) for tracking clones during the evolution of the software and supports simultaneous modifications of clone regions.

Clone refactoring tools An important and challenging aspect of clone management is automated refactoring or removal of clones from the source code. Modern IDEs like Eclipse provide some automated support for simple refactorings like renaming, extract method, extract superclass, pull-up method, etc., but the limiting factor is that the candidate code for refactoring has to be manually selected. Some tool have been developed to further automate the refactoring of code. This includes the tools that perform automated or semi-automated analysis of clones to select candidates suitable for refactoring.

CeDAR [TG12] automatically categorizes clones for refactoring based on the types of parameterized differences present between them. The authors have automated the process of feeding clone detection results to the Eclipse refactoring engine. They have also improved the Eclipse refactoring engine to cater more Extract Method refactoring scenarios.

[TYM+11] has mentioned the difficulty of applying the general refactorings to code

Application for ICT-Related Development and Research Grant Page 20

clones because of the variety of differences that could exist between clones. They proposed a categorization of clones based on these differences and propose some specialized extensions of the general refactorings discussed by Fowler et. al [FBB+99].

Volanschi [Vol12] captures repeated domain concepts as stereotypes and describes an approach to manually replace these stereotypes with code generators. Iso-Generation tests based on simple text matching are performed to verify the exact matching of the original code and the generated code.

Fanta and Rajlich [FR99] have proposed a tool for refactoring of C++ code that provides functionality like function insertion, function expulsion, function encapsulation, renaming, and argument reordering. However, the actual implementation has several operational limitations.

Wrangler [LT10] provides the equivalent of extract method refactoring in the functional language Erlang by folding expressions against a function definition.

CloneDR ( http://www.semanticdesigns.com/Products/Clone/ ) provides interactive support for extract method refactoring by creating a parameterized method from a group of cloned blocks.

Clone detection tools integrated in IDEs Clone detectors are sometimes implemented as standalone tools [KKI02] [RC08a] [JMSG07] [HK09] or they are integrated inside the IDEs. For clone management, an integrated clone detection tool is more appropriate. Here we list clone detection tools that are implemented as plugins to various IDEs but have either limited or no support for further clone management activities.

CloneDetective [JDH09] is an open source framework based on ConQAT infrastructure (http://www.conqat.org/ ), which is an integrated toolkit for software quality assessment. CloneDetective framework is used for implementing custom clone detectors by providing skeletal implementations of different phases of clone detection. It also provides a stand-alone clone viewer that can be integrated with the Microsoft Visual Studio.NET or Eclipse.

SimScan (http://blue-edge.bg/download.html) is a parser-based clone detection tool available as plugin to Eclipse, IDEA, or JBuilder.

DupMan (http://sourceforge.net/projects/dupman/) is a framework integrated with Eclipse to support development of clone detection and clone management tools based on a generic model for describing clones [Gie07].

CloneBoard [Wit08], CPC [Wec08], and CnP [HJJ09a] are other clone detectors that are Eclipse plugins to detect and track clones by using the copy-paste clipboard activities of the editor. CloneBoard and CPC also provide linked editing of clone pairs [TBG04].

SHINOBI [KYU+09] and CodeRush ( https://www.devexpress.com/Products/CodeRush/ ) are add-ons to Microsoft Visual Studio for clone detection. SHINOBI uses CCFinderX’s pre-processer and shifts the clone detection overhead to the CVS server instead of the IDE. It also displays clones

Application for ICT-Related Development and Research Grant Page 21

of a code fragment under the mouse pointer.

Wrangler [LT10] provides refactoring support of functional programs written in Erlang. Wrangler can be integrated with Emacs or Eclipse IDE. Wrangler can also detect clones from Erlang programs using an AST based approach.

JClone [Bah10] is an Eclipse plugin for detecting clones in Java code using an AST based approach and provides different clone visualizations.

JSync [NNP+11] is a plugin to the SVN version control system for clone detection using similarity of feature vectors computed from an AST representation of code. JSync provides some clone management features also, that are described in the next section.

Copy-Paste Detector (CPD)( http://pmd.sourceforge.net/pmd-5.1.0/cpd-usage.html ) is part of the PMD toolsuite that performs source code analysis of Java programs.

SDD (http://wiki.eclipse.org/index.php/Duplicated_code_detection_tool_(SDD)) is a freely available plugin to Eclipse for clone detection using n-neighbour distance and inverted indexes [LJ05].

Simian ( http://www.harukizaemon.com/simian/ ) is another clone detector implemented as an Eclipse plugin.

CloneDigger ( http://clonedigger.sourceforge.net/download.html ) is also an Eclipse plugin that performs clone detection based on AST and anti-unifcation on source code written in Java or Python.

Tairas and Gray [TG06] implemented a clone detector that works with suffix trees, as a plugin for the Microsoft Phoenix framework.

CloneDR ( http://www.semanticdesigns.com/Products/Clone/ ) is a commercial AST- based clone detector [RCK09], also implemented as an Eclipse plugin. Besides clone detection, CloneDR offers support for clone removal using preprocessor macros.

CeDAR [TG12] is an Eclipse plugin that can display clone detection results from other clone detection tools (e.g., CCFinder, CloneDR, DECKARD, Simian, or SimScan). It also displays various clone related metrics computed from the results.

Zibran and Roy [ZR12b] have developed an Eclipse plugin to facilitate focused search for clones of a selected code fragment, using a suffix-tree-based k-difference hybrid approach.

Clone management applied in the industry Some initial studies have shown the success of clone management in the industrial setting. Yamanaka et al. [YCY+12] [YCY+13] presents the application of a simple clone management system in the industrial settings. Clones are tracked across versions and changes to existing clones or newly created clones are reported to developers via email. In this experiment, two potentially harmful clones were caught by the system within 15 days, and were handled by the developers.

Inoue et al. [IHY+12] describes the tool CloneInspector that detects inconsistently

Application for ICT-Related Development and Research Grant Page 22

changed bugs in code clones. The tool development was funded by Samsung and reported several bugs in their systems.

The proactive clone management tool proposed by Venkatasubramanyam et al. [VSR12] is integrated with an IDE and is being used at Siemens.

Gleirscher et. al. [GGIW12] report an experience of applying various static analysis techniques including code clone detection to five German SMEs. They found that the effort to introduce these analyses techniques is usually small. They observed a high level of cloning (14% to 79%) in the studied systems. Their industry partners found clone detection relevant enough for inclusion in their regular quality assurance process despite limited resources and tight schedules usually faced by SMEs. In the authors’ opinion, clone detection can efficiently improve quality assurance in SMEs, provided it is continuously used throughout the development process and is technically well integrated into the tool landscape.

C. Challenges: (Please describe the challenges, specific to this research topic, currently being faced internationally.) Refactoring Challenges To automatically unify and refactor clones is a challenging problem, compounded by the fact that clones are usually modified after creation. In the most recent work on this topic by Krishnan and Tsantalis [KT14], it was found that, on average, only around 36% of the clones in a system can be refactored through conventional means. Some of the challenges and limitations of current refactoring mechanisms are discussed here.

Limitation of method parameterization: A major limitation of the current refactoring tools is that they can parameterize only a small subset of differences found in clones. Differences like operator and variable type mismatch cannot be parameterized with methods [KT14].

Limitation of refactoring due to different behaviour or control structure of clones: different clones could have different control structures. The automated refactoring tool by Krishnan and Tsantalis first identifies clones with identical control dependence structures to serve as candidate refactoring opportunities, while the other clones are ignored as unsuitable for refactoring [KT14]. The selected clones are further scrutinized against a set of strict preconditions to determine if they can be parameterized without changing the program behavior. If any preconditions is violated, then the clones are rejected from refactoring [KT14]. These conditions are:

Precondition 1: The parameterization of differences between the matched statements should not change the program behavior.

Precondition 2: The unmatched statements should be movable before or after the matched statements without changing the program behavior.

Precondition 3: The duplicated code fragments should return at most one variable of the same type.

Application for ICT-Related Development and Research Grant Page 23

Precondition 4: Matched branching statements should be accompanied with corresponding matched loop statements.

Precondition 5: If two clones belong to different classes, these classes should extend a common superclass.

Limitations of refactoring with generics: In an earlier work [BRJ05a] [BRJ05b], the PI carried out two case studies to assess the potential of generics and templates in unifying clones. In the first study [BRJ05a], we experimented with generics in Java. We tried to unify classes in the Java Buffer Library that differed in the type of a buffer element. We observed that type variation also triggered many other non-type parametric differences among similar classes, hindering application of generics. As the result, despite striking similarities across library classes, only a small part of the library could be transformed into generic classes.

Careful examination revealed that most of the issues that hindered a complete generic solution for the library were specific to Java generics. However, some other issues were of more fundamental nature. We thought further work was needed to draw the fine line between the two.

The C++ Standard Template Library (STL) provided a perfect example to strengthen the observations made in the Buffer Library case study [BRJ05b]. Firstly, parameterization mechanism of C++ templates is more powerful than that of Java generics. Due to light integration of templates with the C++ language core, template parameters are less restrictive than parameters of Java generics. Unlike Java generics, C++ templates also allow constants and primitive types to be passed as parameters. Secondly, the STL not only uses the most advanced template features and design solutions (e.g., iterators), but it is also widely accepted in the research and industrial communities as a prime example of the generic programming methodology.

The STL needs genericity for simple and pragmatic reasons: There are plenty of algorithms that need to work with many different data structures. Without generic containers and algorithms, the STL’s size and complexity would be enormous. Such simple- minded solution would unwisely ignore similarity among data structures, and also among algorithms applied to different data structures, which offers endless reuse opportunities. Redundant code sparking from unexploited similarities would contribute much to the STL’s size and complexity, hindering its evolution. The object of the STL was to avoid these complications, without compromising efficiency. Still, we found much cloning in the STL. Our study confirmed that these clones varied in certain ways that could not be easily unified by template parameters. These restrictve variations include non-parametric variations, non-type parametric variations, type variations not parameterizable by generics, and clones involving coupling between classes [BRJ05a].

Limitations of refactoring with design patterns: To eliminate the redundant code in a Java software system, Balazinska et al. [BMD+99b] [BMDK00] applied the refactoring based on ‘strategy’ and ‘template’ design patterns, by factoring out the commonalities of methods and parameterizing the differences according to the design patterns. However, the scope of the applicability of this technique is restricted only to specific types of clones. Also, the reengineering process merged 28 methods but created 84 new methods, and thus actually increased the line of code.

Application for ICT-Related Development and Research Grant Page 24

Limitations of refactoring with macros: Most of the macro systems are merely implementation level mechanisms for handling variant features (or changes, in general). Failing to address change at analysis and design levels and never evolved towards full-fledged “design for change” methods [Bas97] [KRT97]. Programs instrumented with macros tend to be difficult to understand and test.

Limitations of refactoring with aspects: Aspect technologies promise improved modularity of programs, which may also reduce clones. However, it is argued that due to restrictive composition rules and no parameterization mechanism, AOP becomes constrained to remove large number of code redundancies [JL06].

Limitation of refactoring due to risk of breaking running code: Another reason for refactoring limitation is the risk of breaking the currently working piece of code [Cor03]. Since refactoring involves changing the software system, and software development organizations usually have a very conservative approach to maintenance and change. This is due to the fact that the risk of introducing a new bug is considered proportional to the amount of changes made to a software system. Hence, systems are changed as little as possible in the maintenance phase. D. Motivation and Need: (Please describe the motivation and need for this work.) The proposed clone management tool suite will be helpful for industry in efficient software maintenance that will also increase productivity and performance in new development. The tool suite will add significant value by providing the ability to merge, remove and reuse common code chunks. This will make a major positive economic impact to the industry by providing optimized code development, saving redundant resource effort, and efficient utilization of existing code for multi-purpose future usage.

Why are we interested in clones? Code cloning is a serious problem in industrial software systems. Having similar code in multiple places of a software system leads to compromise on its quality, correctness and maintainability. If a developer has to write or even copy-paste same piece of code in ten multiple files, chances are high that he will make an inconsistent change somewhere.The next question is what if there is a bug in this code segment? The answer is writing similar bug fix in ten different files has a high probability of injecting more inconsistencies and mistakes. Going further, such cloning may cause a very high software maintainability cost. Another difficulty is to comprehend a software system with high code cloning factor. What if a developer has to read through ten different files or folders to know about a single function? This leads to complexity in getting familiar with the code and ultimately increases the cost. Such software systems with high level of cloning are also very difficult to extend in terms of functionality. Increasing clones also increase system size and design problems related to it, such as missing inheritance or missing procedural abstraction. Also inflated code base requires more system resources to sustain it. All these factors exponentially increase the financial costs during different phases of software development life cycle, especially in post- delivery maintenance. It is estimated that around 40%-70% of total cost is incurred on post-delivery maintenance. Empirical studies show that 9%-17% [ZSAR11] of a software consists of code clones. This proportion may vary from 5% [RC07] upto 50% [RDL04].

Application for ICT-Related Development and Research Grant Page 25

Why is clone management important? Despite the fact that clones negatively impact the maintainability of a software system, it is not practically possible to get rid of all clones in a system through refactoring [Cor03]. Sometimes, these clones are intentional, for example, clones in code generated by code generators that developers don’t want to modify manually. Sometimes the programming language does not provide enough generic programming capability to unify the differences in a clone class. In a related work, while studying the limitations of language level refactorings, only 34% of the clones detected by a clone detection tool were found to be refactorable [TG12]. This number has been increased to 36% in a follow-up work by Krishnan and Tsantalis [KT14]. Clones are also sometimes used to decouple classes and components, so that each version of the similar component or class can evolve independently.

Templates (or generics) help us write compact, generic code. The STL is a powerful example of how templates help achieve these goals. Still, our study of the STL [BRJ05b] revealed substantial counter-productive clones across groups of similar class or function templates. Clones occurred, as variations across these similar program structures were irregular and could not be unified by suitable template parameters in a natural way. We encountered similar problems in other class libraries as well as in application programs, written in a range of programming languages.

Removing clones at the language level requires changes to the code. In real systems, clones are often tolerated in spite of their negative effect on maintenance, to avoid the risk of breaking a running system while trying to remove them [Cor03]. When clones are specifically created for performance considerations, it is not advisable to remove them altogether. Similarly, at times the clone resolution may be possible through refactoring [FBB+99], but the result may conflict with other design goals that cannot be compromised [JL06].

In an empirical study on open source systems [TG10], it was found that sometimes refactoring (extract method) is performed on only part of a clone, such that the remaining clone can still be refactored further. The reason for this phenomenon is suggested to be the lack awareness or the lack of availability of clone detection and clone management tools to the developers.

Given the above limitations and risks of refactoring clones, the viable option left for the maintenance engineers is to efficiently detect and manage these clones. Researchers and practitioners [Gie07] [HG10] [HJJ09a] [HJJ09b] [LPM+97] [NNP+11] [ZR11c] [ZR12b] unanimously believe that clone management activities should be integrated with the development process to enable proactive clone management.

From the research perspective, so far most of the work on clones has focused on clone detection, followed by the work on clone analysis. Lately, clone management has caught more attention as a research topic. However, still the research done on clone management is far less than other dominant areas of cloning. Hence, there is a significant need for further research in this important and potential area of cloning.

Why are we proposing clone management with VCL? In this project, we are targeting compensative clone management with VCL by limiting the negative impact of existing clones that are to be left in the system.

Application for ICT-Related Development and Research Grant Page 26

As we move from tightly integrated language-level techniques to the meta-level techniques, we increase the expressive power of the parameterization mechanism. VCL can be applied on top of other languages and design techniques, complementing and enhancing them in areas where conventional techniques fail to provide a satisfactory solution [BRJ05b]. Since VCL works at the meta-level without altering the program, there is no risk involved in terms of breaking a running system, loss of performance, or compromising other design goals. For example, VCL was applied to unify clones in the STL, with no risk of breaking it or any other system that uses the STL [BRJ05b].

VCL representation for a clone class is both physically smaller and conceptually easier to understand than the original source code for all the clones in the clone class. This solution also highlights important relationships between various code elements that facilitates the programmers to better understand the cloned code and its variants. This includes the visibility of similarities and differences among clones, and the ability to trace how various features affect the code at various places. Instead of dealing with each clone individually, we can understand clones together. In the VCL representation, we can see what is similar and what is different across specific clones in a clone class, down to the every detail of code. If we want to change one clone, we can check if the change also affects other clones. The information contained in the VCL representation reduces the risk of update anomalies, and helps in reusing existing clones when writing new code [JL06].

Having to deal with an additional meta-level adds a certain amount of complexity to the problem. However, the feedback from our industry partners indicates that, in practice, the benefit of being able to deal with complexity at two levels outweighs the cost of the added complexity. The syntax of VCL is easy to learn and properly designed meta- structures and tools help mitigate this problem of added complexity at the meta-level.

3. Objectives of the Project

A. Specific Objectives Being Addressed by the Project: (Please describe the measurable objectives of the project and define the expected results. Use results-oriented wording with verbs such as ‘to develop..’, ‘to implement..’, ‘to research..’, ‘to determine..‘, ‘to identify..’ The objectives should not be statements and should not include explanations and benefits. The objective should actually specify in simple words what the project team intends to achieve (something concrete and measurable/ deliverable). Fill only those objectives that are applicable to the proposed project.)

A1. Research Objectives: (if any)

The overall goal of this project is to develop a tool suite for complete clone management capability for software development organizations. Below are the specific research objectives related to development of this end-to-end solution:

1. To implement an efficient token-based simple clone detection tool with high precision and recall and minimal language dependency. 2. To implement efficient data mining based techniques for detecting structural clones. 3. To develop and implement efficient and useful clone visualization techniques for structural clones.

Application for ICT-Related Development and Research Grant Page 27

4. To identify VCL framing patterns and corresponding clone categorization based on those framing patterns. 5. To research and develop efficient incremental structural clone detection techniques for changing software. 6. To develop and implement efficient storage structures for clones and frames for a system. 7. To integrate all the above techniques in an Eclipse plug-in component. 8. To publish results on these novel techniques of clone management in leading journals and conferences.

A2. Academic Objectives: (if any)

1. To engage graduate students in the proposed research for their dissertation. 2. To incorporate the research findings of this project into relevant graduate courses like Software Engineering, Distributed Systems, Software Reuse.

A3. Industrial Objectives: (if any)

1. To identify opportunities for local software houses for utilization of total clone management solution to improve code maintenance and reuse. 2. To conduct training sessions for industry professionals in clone management and frame technology. A4. Human Resource Development Objectives: (if any)

1. To develop human resource expertise in the area of software maintenance and specifically in clone management with VCL.

A5. Other Objectives: (if any)

4. Research Approach

A. Development / Research Methodology: (Please describe the technical details and justification of your development and research plan. Identify specialized equipment, facilities and infrastructure which are required for the project and their utilization plan. The block diagrams, system flow charts, high level algorithm details etc. have to be provided in this section.) Some of the major tasks that we will be carrying out in this project include:  Extensive survey of existing clone management techniques  Implement scalable and accurate incremental structural clone detection  Standardize output documentation format for structural clones  Develop suitable visualizations and analysis tool for structural clones  Develop a catalog of VCL refactoring patterns  Implement automated framing in VCL

Application for ICT-Related Development and Research Grant Page 28

 Implement VCL frame repository  Implement back-propagation support from code to VCL frames  Implement frame verification sandbox environment  Develop training and user manuals for the tool suite  Run a pilot project at industrial partner site

Figure 3 shows the flow of activities in the proposed system.

Figure 3: Flow of activities in the proposed Clone Management Tool Suite Architecture and Components: The system will be based on a client-server architecture. Clone detection and clone framing will be performed on the server side based on the code stored in the version control system of the organization. Developers in large organizations might not have the complete code base stored on their local machines. Clone detection in a developer’s local snapshot of the code is hence not complete, and have to be done on the central server.

A central server will also store the frames repository as well as the intermediate code representation for incremental clone detection. The system will consist of five processing components and two repositories, as shown in Figure 4.

Figure 4: Proposed architecture of the Clone Management Tool

Application for ICT-Related Development and Research Grant Page 29

The details of the processing components are discussed below.

Clone Detection and Documentation Component (CDDC): This component will perform the initial clone detection and generate the first report of clones found in the system. We will be using a token-based approach for detection of simple clones, based on earlier work done by PI [BSP+07]. Token based code representation provides a suitable abstraction for clone detection. It involves minimal code transformation and gives good results. It has both ease of adaptability to different languages, and awareness and control of the underlying language tokens. Comparative studies involving different clone detection techniques have shown that token based clone detection tools perform well in terms of precision and recall of the detected clones. However, manipulating all tokens in large software systems is computationally very expensive. Efficient data structures and matching algorithms can help mitigate this problem to make the technique scalable even for very large scale systems of multi- million lines of code.

Token based techniques make use of a well known data structure, suffix tree, to detect similarities in the token string. An alternative data structure, suffix arrays, can provide the same efficiency for string pattern matching with much reduced space requirements.

Token based techniques usually distinguish between parameter and non-parameter tokens, due to the limitation of language level refactoring techniques. With VCL based clone management, we can have generic frames parameterized in fairly unrestrictive ways. Hence we assume that we can uniformly treat all tokens as equal without differentiating between tokens that can become parameters for generic structures and tokens that cannot.

Our original simple clone detection tool RTF [BSP+07] implements a simple and flexible tokenization mechanism. The language-specific tokenizer just needs to know the token classes of the source language, i.e., keywords, operators, comments markers etc. Each token class of a given language’s grammar is assigned a unique numeric ID. The text related to each token and its location in the source file is also stored by the scanner for output generation. A single large token string is generated from all the source files. Identical segments of this token string are reported as clones, which can either be exact clones or parameterized clones owing to the normalization of certain tokens according to the proposed tokenization. Blank spaces, blank lines and comments are ignored during tokenization.

RTF also allows the user to tailor the token string for better clone detection. The first possibility is the suppression of certain insignificant token classes that may be considered noise in source code for clone detection. For example, difference of access modifiers like private, protected, public etc. may not be very interesting between two otherwise identical methods and hence these tokens can easily be suppressed. The user can simply indicate, from a list of all language tokens, the tokens that should be suppressed during code scanning and these tokens will not become a part of the final token string. Another option given is to equate the different token classes. For example, if the user does not want to differentiate between the types {int, short, long, float, double}, the same ID can be used to represent every member of the above set of types. In this way, all those code fragments that differ only in the type of certain variables become exact replicas of each other in the token string. Users can also

Application for ICT-Related Development and Research Grant Page 30

choose to equate operators, certain keywords, punctuation symbols, access specifiers, etc., depending upon the requirement.

The RTF tokenizer also locates method boundaries. Clones that start from the middle of one method and end up in the middle of another may not be so meaningful. For this reason, the tokenizer inserts unique sentinel tokens at the method boundaries to break the similarity of token sequences covering two or more consecutive methods at multiple locations. The same is done with the file boundaries. These sentinel tokens marking all the methods and files in a system increase the alphabet size, but the string pattern matching algorithms are so selected that this effect is only marginal.

Clone detection in RTF is done by finding non-extendible (NE) repeating substrings (or simply repeats) within the combined token string. A repeating substring is non- extendible (NE) if it is neither always preceded by the same symbol nor always followed by the same symbol. These NE repeats correspond to the simple clone classes, and their multiple occurrences in the string points to the different clone instances.

To locate all the NE-repeats in the token sequence we use the suffix array data structure and a straight-forward variation of the algorithm described in [AKO04] for locating NE repeating pairs of substrings. Our extension to [AKO04], which we call NERF (Non- Extendible Repeat Finder), ensures whole sets of NE repeats are computed instead of just pairs. The algorithm runs in linear time and linear space in addition to the token string, however we omit further algorithmic details here.

The advantages of using NERF in the wider RTF approach are twofold. Firstly, the suffix array leads to a significant reduction in memory usage; secondly, as NERF returns full sets of repeats, RTF can directly form clone classes based on its output.

For the initial detection of structural clones, we will be adopting the mechanism from the earlier tool developed by PI called Clone Miner [BJ09]. The overall process is explained here.

Clone Miner performs structural clone detection by finding simple clones first, and then gradually raising the level of clone analysis to larger similar program structures. Table 1 lists the structural clones detected by Clone Miner. The overall algorithm for structural clone detection at various levels is shown in Figure. 5.

Application for ICT-Related Development and Research Grant Page 31

Figure 5. A hierarchy of structural clones detected by Clone Miner and the overall detection process.

Clone Miner uses RTF as the default front-end tool to compute the simple clone classes. Structural clones at Levels 1 and 2 are found by manipulating the simple clones’ data extracted from a software system. For this, we first need to reorganize this data to make it compatible with the input format for the data mining technique that is subsequently applied on this data.

We list simple clones for each method or file, depending on the analysis level. With this arrangement of simple clones, we get a different view of the simple clones’ data, with simple clones arranged in terms of methods or files. From this data, we detect recurring groups of simple clones in different files or methods, to identify the level 2-B and level 1-B structural clones, respectively. The order in which different simple clones appear in a file or method is ignored at this stage. The list of simple clone classes in a file or method is, hence, sorted to facilitate further analysis.

To detect recurring groups of simple clones in different files or methods, we apply the same data mining technique that is used for “market basket analysis”. The idea behind this analysis is to find the items that are usually purchased together by different customers from a departmental store. The input database consists of a list of transactions, each one containing items bought by a customer in that transaction. The output consists of groups of items that are most likely to be bought together. The analogy here is that a file or a method corresponds to a transaction and the simple clone classes, represented in that file or method, correspond to the items of that transaction. Our objective is to find all those groups of simple clone classes whose instances occur together in different files or methods.

The market basket analysis uses the “frequent itemset mining (FIM)” technique. The difference between our problem and the standard FIM problem is that in standard FIM, the items in a transaction are considered unique, whereas in our data, one file or method may contain multiple instances of the same simple clone class. We could normalize the data by removing the duplicates, but by doing so, we would miss out important information—where multiple instances of a simple clone class are part of a valid structural clone across files or methods. To let the FIM algorithm differentiate between different instances of the same simple clone class in a given file or method, we encode the tuple as an integer, where a is the simple clone class ID and b is

Application for ICT-Related Development and Research Grant Page 32

the occurrence index of this ID in the given file or method. This is a reversible transformation, and once we get the output from FIM algorithm, we can perform the reverse transformation to get the desired output, i.e., 9-9-9-15 as the detected level 2-B structural clone.

Mining all frequent itemsets returns many frequent itemsets that are subsets of bigger frequent item sets. More suitable for our problem is to perform “Frequent Closed Itemset Mining” (FCIM), where only those itemsets are reported which are not subsets of any bigger frequent itemset.

One of the input parameters for FCIM is the minimum support count, or simply support, of a frequent item set. In our context, it indicates the minimum number of files or methods that should contain the detected group of simple clone classes, for it to be reported as frequent. Due to the general nature of the FCIM problem, the standard algorithms are designed to adjust the minimum support level for FCIM. In our case, we have hard-coded the support value at 2 so that it will report a group of simple clone classes, even if it is present only in two files as it could still be significant because of its length.

Another input parameter is the minimum size of the item set. If we assume that all the items are of equal importance, then the number of items will determine the importance of an item set. In our case, however, the number of simple clones is not the only factor that determines the importance of a repeating group. The length of the simple clones is sometimes more important than their number. Currently, these lengths are not reflected directly in the input data for FCIM.

The output from FCIM is a list of frequent itemsets along with their support count, indicating, in our case, the number of files or methods containing those groups of simple clones that are frequent (Fig. 6). As FCIM only deals with detecting frequent item sets, an important piece of information missing here is the identification of those files or methods that contain these frequent groups of simple clones. With some postprocessing, we can find this information as well.

Due to the limitation of the FCIM technique, only repeating groups of simple clones across different files and methods can be detected, even though such groups may be present within a given method or file. To detect level 1-A and 2-A structural clones, we apply a simple and straightforward follow-up technique to compute these locally repeating groups of simple clones separately, based on sorting and brute-force combination generation.

File clone classes (level 5) and method clone classes (level 3) are found by the process of clustering from the significant level 2-B and 1-B structural clones, respectively. Using this mechanism, we expect to find groups of highly similar files and methods. These clusters of similar files and methods indicate even larger granularity similarities than level 1-B or 2-B structural clones, with more defined boundaries.

The significance of a level 1-B or 2-B structural clone classes is measured using the raw length of the structural clone instance and the percentage coverage of the container (method or file) by the structural clone instance. By default, we use number of tokens to compute the length of Level 1-B and 2-B structural clones, and also for their containers, methods and files.

Application for ICT-Related Development and Research Grant Page 33

To perform clustering of highly similar files and methods, again we have to represent the data in a suitable format for the ease of analysis. Instead of representing files and methods, we start by representing clusters. To start with, each structural clone at level 1-B and 2-B is considered as the description of a unique cluster, which contains all the methods or files containing that structural clone. The length and coverage values for the instances of these structural clones indicate the significance of each cluster. We let the user specify a minimum average length (minLen) and a minimum average coverage (minCover) value to filter the significant clusters. These significant clusters are our file clone classes and method clone classes.

In the same way as finding repeating groups of simple clones across different methods and files, we find repeating groups of method clones across different files to form level 4-B structural clones.

Level 4-A structural clones, forming locally repeating groups of method clones within files, are again found by sorting and brute-force combination generation.

From file clone classes, we can move on to the level 6 and level 7 structural clones. For finding level 6-B structural clones, the previously formed file clone classes play the same role as the simple clone classes in finding level 1-B and 2-B structural clones. The containers for these file clones that we currently consider are the directories. Directories provide a naive modularization of software source code, usually reflecting the modules and subsystems of the given system. We also allow the user to specify the file groupings based on the actual modularization of the software, if it is different from the directory-based grouping of files.

The transition from level 6-B to level 7 is similar to the transition from level 2-B to level 5 via clustering. Finally, level 6-A structural clones, representing repeating groups of file clones within directories, are detected in the same way as Level 4-A structural clones mentioned previously.

We will perform literature review of relevant material to identify recent improvements of the existing algorithms used in RTF and Cone Miner for clone detection. This comparison will lead to implementation of an optimized clone detection tool for different programming languages.

Clone Visualization and Analysis Component (CVAC): We will review current state of the art in clone visualization and explore how these techniques could be extended to represent structural clones as well. The selected techniques will be implementation and incorporated in this component. One of our preferred visualizations is the treemap as it effectively show the directory structure of the system along with the relevant sizes of different parts likes files and directories. We will also explore other visualizations that suits our specific objectives. For this component, we will also be collaborating with Prof. Jack van Wijk of Technical University of Eindhoven, who is an expert on visualizations.

Clone Framing Component (CFC): The objective of this component is to provide an algorithm for an optimal frame based clone unification. We will explore the various VCL framing patterns and perform a categorization of simple and structural clones based on these framing patterns. The preliminary theoretical work on this component

Application for ICT-Related Development and Research Grant Page 34

has been done by a former MS student of the PI [Tar13].

This component will be integrated with the Clones Repository (CR) to directly obtain the cloning information, and with the Frames Repository (FR) to store the resultant frames.

Clone Tracking Component (CTC): Clone detection in a large code base can consume significant amount of time and resource. Clone management during the development process demands quick response. To meet this objective, we will preserve the initial clone detection results, track changes in the code base, and incrementally update the clone detection results by comparing the modified and newly added code to the existing results. Moreover, the clone detection results will also be updated to remove references to any deleted source code.

CTC is responsible to track clones and frames once the software undergoes some changes. The new inserted code can result in the following changes to the clones database of the system: a) changes to existing clone classes, b) creation of a new clone instance of an existing clone class, c) creation of a new clone class, or d) invalidation of an existing clone class. We will adopt an index-based technique [HJHC10] to track these changes and update the clones database incrementally, without having the need to run clone detection on the entire code base from scratch. Clone tracking component will be activated once a new file is checked-in to the code repository. We will clear the previous records of the clones for this file and perform a fresh incremental clone detection for it.

Another possibility of incremental clone detection is to apply incremental techniques for various steps employed in the detection of simple and structural clones. For example, new techniques are being explored to perform incremental updation of suffix arrays [LMS12] [SLLM10] and for incremental frequent closed itemsets mining (FCIM) [OPW+04] [BSV12] [LXZ+13]. We will evaluate the two possibilities and pick the most suitable one for our system.

Back Propagation Component (BPC): The developers will be able to make changes directly to the frames or in the generated code from the frames. In the latter case, BPC will incorporate these changes back in the frames to keep the frames consistent with the generated code. Even if a developer perform some changes to the code that is not part of the clones, still it can affect the framed clones as these changes could move the cloned regions from their initial place when initial clone detection was performed.

The two repositories are:

Frames Repository (FR): This repository will keep track of all the frames in the system. It will be populated by the CFC and later updated by the BPC. In order to link the clones that have been framed, the Clones Repository will also be linked to the Frames Repository.

Clones Repository (CR): This repository will contain the up to date information about the clones in the system. It will be initially populated by they Clone Detection and Documentation Component (CDDC) and later updated by the Clone Tracking Component (CTC). In order to track the clones that have been framed, the Clones Repository will also be lined to the Frames Repository (FR).

Application for ICT-Related Development and Research Grant Page 35

Verification case study We will conduct a comprehensive case study on a real system from the industry partners Softech to validate the effectiveness of the tool suite.

Aspects of Clone Management implemented in the system In the light of the previously mentioned aspects of clone management, here we give the details of our proposed tool suite for each of those aspects.

Type of clone management: Our proposed solution of clone management is primarily compensatory clone management whereby we unify clones at the meta-level and not remove from the actual source code. In this way, we avoid the negative impacts of clones. Once the clones have been unified, the developers can also perform preventive clone management by using the variability management and code generation facilities of VCL.

Similarly, we are combining both retroactive and proactive treatment of clones. Normally, the retroactive or post-mortem approach will be used whereby clones will be treated after creation and detection. The proactive approach will be applicable when the developer will have a clear understanding of the clone he is going to introduce in the system.

Centralized architecture: This will be a centralized solution following the client- server model. The frames will also be part of the central repository managed by the server. All developers using the client integrated in their IDEs will be able to access and work on the VCL frames while working on the source code.

Human triggered and system triggered activities: In our tool suite, we will have a combination of both human triggered and system triggered activities. Developer will be able to trigger any clone management action at any time. For example, the developer can decide when to run the initial clone detection of the entire system. Similarly, he can select any unframed simple or structural clone for framing at any point.

The system will also initiate some clone management activities upon meeting certain conditions or on the scheduled time. For example, when a developer integrates his code to the repository, the system will check for clones and suggest integrating with existing frames.

Integrated tool: Developers are more likely to use tools that are available in their development environments, as compared to those that run independently. We will integrate our tool suite with the popular Eclipse IDE to facilitate the developers.

Clone documentation: We will extend the Rich Clone Format (RCF) [HG11], the proposed standard format for clone documentation, to incorporate various types of structural clones as well, and report our clone results using that format.

An alternative clone documentation available in our system will be the VCL frames, as they represents not only the clones but also all the differences between the different members of a clone class, whether simple or structural.

Application for ICT-Related Development and Research Grant Page 36

When the code is changed or updated, the back-propagation component (BPC) will keep the VCL representation and the original source code synchronized.

Clone tracking: This is a challenging and still wide-open research problem. We will develop a technique for incremental detection of clones to locate new clones in the updated version of the software system without having to detect all the clones in the system. This technique will be based on creating an index of all clone-size segments of the code in a hash-table, and matching new code segments with these indexed segments for clone tracking.

The developer will also be facilitated by the tool to choose any code fragment that the developer need to copy from the existing code base, and transform that into a frame and identify the parameters in the code fragment, so that further copies can be simply generated by providing the correct parameter values.

Clone annotation: As mentioned before, the VCL frames will be kept synchronized with the generated source code. Special markers in the source code view will indicate regions of source code that have been framed. Moreover, the developer will also be able to write comments in the frames to annotate the purpose of the clones and other information. The developer will also be able to mark/tag a code fragment that he/she plans to frame in future.

Handling of comments: As VCL has a powerful parameterization mechanism that is language-independent, we can parameterize and unify comments together with the source code with equal ease.

Synchronized modification: As VCL is a code generation technique too, our tool will provide an equally powerful alternative for synchronized modification capability. The required modifications will be done in the VCL frame and will be automatically reflected in all the code regions generated from that VCL frame. The frame verification capability of the CFC will also provide a sandbox type environment where the developer could verify the generated code before inserting it in the original place.

Consistent renaming: VCL multi-valued meta-variables takes care of the consistent renaming problem. Different copies of a framed clone are generated through a looping mechanism in VCL that runs on the number of values for a multi-valued variable. In each iteration of the loop, the next value of the multi-valued variable is considered. The developer only need to consistently provide values for all the multi- valued variables, which is easier than providing those values consistently in the source code, as the values are stated very explicitly in the VCL representation, unlike the source code where these values are hidden inside the other parts of code.

Refactoring patterns: Similar to the OOP and AOP refactoring patterns, we will also develop a catalog of VCL refactoring patterns. Unification of different types of clones need different VCL constructs. Sometimes, multiple constructs can provide the same kind of unification. Simple clones can normally be unified with a single x-frame while structural clones are unified with a hierarchy of frames at different levels. Single frames and frame hierarchies can be used to compose bigger hierarchies representing clones at much higher levels.

Tool support for refactoring patterns: The clone framing component (CFC) will

Application for ICT-Related Development and Research Grant Page 37

automatically transform the detected clones into VCL frames with minimal or no human supervision using the VCL refactoring patterns catalog. The back-propagation component (BPC) will keep the changed source code synchronized with the VCL frames.

Visualization of clones: Most of the visualizations used for clone analysis are borrowed from the software visualization domain. We have conducted a thorough survey of clone visualizations, analyzing pros and cons of each visualization and suggesting possible improvements [HBJ14]. We will extend the existing visualizations to represent structural clones and will also develop new visualizations. Treemap and Wheel view are some of the visualizations that we will be focusing on first.

Identifying clones for refactoring: Even though VCL has a very strong parameterization mechanism that is not restricted by the grammar of the underlying language of the code, still not 100% of the clones may be refactored even with VCL. We have identified scenarios in real systems when different clones overlap with each other. In these situations, we have to decide whether we frame only one clone or we frame only the overlapping part of the clone. Similarly, there could be other situations where the VCL solution is not feasible or practical. We will incorporate heuristics based intelligence in our tool to identify such situations.

Categorizing clones for refactoring: For automated framing of clones, we need automated categorization of clones to apply the appropriate framing pattern to it. CFC will automatically identify the correct category for the clones and mark them accordingly.

Verification of clone framing: As mentioned previously, our system will provide a sandbox environment to the user to see the output of any selected frame by providing the customization parameters for that frame. In this way, the user can see the simulated effect of the provided parameters in the generated code immediately. Once the developer is satisfied with the outcome, he can port these changes back to the original frame.

Application for ICT-Related Development and Research Grant Page 38

Figure 6: Use cases of the Clone Management Tool Suite B. Project Team: (Please attach the curriculum vitae (CV) of PI and CPI(s). Also attach the CVs of key research/ development personnel if available. Please follow the format included in Annexure A. The numbers in the table below must tally with the HR Cost sheet in the Budget file.) Title / Position Number PI - Hamid Abdul Basit 1 Co-PIs - Shafay Shamail, Basit Shafiq, Salman Iqbal, 4 Khushro Shaookar Team Leads 1 Researchers / Developers 2 Researcher / Development Assistants 3 Support Staff 2 Contract Staff (please specify) Others (please specify) Add more rows if required

C. Team Structure: (Please define the team structure (organogram) and role/key responsibilities of each member. If in collaboration with another partner, the division of manpower at various locations of partners

Application for ICT-Related Development and Research Grant Page 39

be provided.)

Title/Position Role/Key Minimum Expertise / Minimum (of each member) Responsibilities Qualification Background Experience Required Required Required (years) Hamid Abdul Basit Role -- PI Project Management; Requirement analysis; algorithm development / evaluation Shafay Shamail Role -- Co-PI Requirement analysis; algorithm development / evaluation Salman Iqbal Role -- Co-PI Requirement specification; supervision of tool development / evaluation Khushro Shaookar Role -- Co-PI

Application for ICT-Related Development and Research Grant Page 40

Requirement specification; supervision of tool development / evaluation Team Lead Coordination among the MS in experience in Java 3 years research and Computer EE, C/C++; development team Science or Database members; component Software programming interface specifications; Engineering tool development; UI design; testing and evaluation Researcher (or Ph. D. Development and MS in Java programming; 0 years student) implementation of Computer C/C++ algorithms for various Science or research tasks; Software component integration; Engineering unit and integration testing. Research Assistant / Incorporate algorithms BS in Java programming; 1 year Software Developer into tool; component Computer C/C++ integration; UI Science or implementation; testing Software / evaluation Engineering Finance / Accounts Prepare and maintain BS in - - Staff financial records for the Accountancy / project; Assist in annual Commerce / audit process; Assist in Finance / financial monitoring of Business the project Project Coordinator Assist in liaison with BS in any field - - ICT; Assist in travel arrangements; Assist in recruitment and procurement process; Assist in secretarial work

D. Project Activities: (Please list and describe the main project activities, including those associated with the transfer of the research results to customers/beneficiaries. The timing and duration of research activities are to be shown in the Gantt chart in Section 8.)

Activity 1: Development of Clone Detection and Documentation Component subactivity 1.1 Requirements gathering & research subactivity 1.2 Requirements analysis subactivity 1.3 Selection and simulation of algorithms subactivity 1.4 Design

Application for ICT-Related Development and Research Grant Page 41

subactivity 1.5 Implementation subactivity 1.6 Testing subactivity 1.7 Deployment

Activity 2: Development of Clone Visualization and Analysis Component subactivity 2.1 Requirements gathering & research subactivity 2.2 Requirements analysis subactivity 2.3 Selection and simulation of algorithms subactivity 2.4 Design subactivity 2.5 Implementation subactivity 2.6 Testing subactivity 2.7 Deployment

Activity 3: Development of Clone Framing Component subactivity 3.1 Requirements gathering & research subactivity 3.2 Requirements analysis subactivity 3.3 Selection and simulation of algorithms subactivity 3.4 Design subactivity 3.5 Implementation subactivity 3.6 Testing subactivity 3.7 Deployment

Activity 4: Development of Clone Tracking Component subactivity 4.1 Requirements gathering & research subactivity 4.2 Requirements analysis subactivity 4.3 Selection and simulation of algorithms subactivity 4.4 Design subactivity 4.5 Implementation subactivity 4.6 Testing subactivity 4.7 Deployment

Activity 5: Development of Back Propagation Component subactivity 5.1 Requirements gathering & research subactivity 5.2 Requirements analysis

Application for ICT-Related Development and Research Grant Page 42

subactivity 5.3 Selection and simulation of algorithms subactivity 5.4 Design subactivity 5.5 Implementation subactivity 5.6 Testing subactivity 5.7 Deployment

Activity 6: Development of Clones Repository subactivity 6.1 Requirements gathering subactivity 6.2 Requirements analysis subactivity 6.3 Selection and simulation of algorithms subactivity 6.4 Design subactivity 6.5 Implementation subactivity 6.6 Testing subactivity 6.7 Deployment

Activity 7: Development of Frames Repository subactivity 7.1 Requirements gathering subactivity 7.2 Requirements analysis subactivity 7.3 Selection and simulation of algorithms subactivity 7.4 Design subactivity 7.5 Implementation subactivity 7.6 Testing subactivity 7.7 Deployment

Activity 8: Validation of integrated tool suite on industry partner case study system

Activity 9: Transfer of research results to customer/beneficiaries The research results and algorithms developed will be incorporated into the final tool suite. This tool suite will be developed in close collaboration with our industry partner (Softech) and will be developed and deployed in their premise. Moreover, the developed tool suite will undergo the standard testing and evaluation procedures that are practiced in the industry. This will ensure that the end user/customers requirements are considered right from the start and the research results are transferred to the industry.

Application for ICT-Related Development and Research Grant Page 43

E. Key Milestones and Deliverables: (Please list and describe the principal milestones and associated deliverables of the project. A key milestone is reached when a significant phase in the project is concluded, e.g. selection and simulation of algorithms, completion of architectural design and design documents, commissioning of equipment, completion of test, etc.) The timing of milestones is also to be shown in the Gantt chart in Section 8. Quarterly deliverables are preferred.

The information given in this table will be the basis of monitoring and release of funds by the National ICT R&D Fund.

No. Elapsed time Milestone Deliverables from start (in months) of the project 1. Completion of Requirements Requirement specification gathering & research, document for CDDC. 3 Requirements analysis, Selection (Activity 1, subactivities 1.1, 1.2, and simulation of algorithms for 1.3) CDDC Requirement specification 2. 3 Completion of Requirements document for CVAC. gathering & research,

Requirements analysis, Selection (Activity 2, subactivities 2.1, 2.2, and simulation of algorithms for 2.3) CVAC 3. 3 Completion of Requirements Requirement specification gathering & research, document for CFC. Requirements analysis, Selection (Activity 3, subactivities 3.1, 3.2, and simulation of algorithms for 3.3) CFC 4. 6 Completion of architectural Design document for CDDC design, low level design of CDDC (Activity 1, subactivity 1.4) 5. 6 Completion of architectural Design document for CVAC design, low level design of CVAC (Activity 2, subactivity 2.4) 6. 6 Completion of architectural Design document for CFC design, low level design of CFC (Activity 3, subactivity 3.4) 7. 6 Completion of Requirements Requirement specification gathering & research, document for CR. Requirements analysis, Selection (Activity 6, subactivities 6.1, 6.2, and simulation of algorithms for 6.3) CR 8. 9 Completion of implementation of CDDC internal release for testing CDDC (Activity 1, subactivity 1.5) 9. 9 Completion of implementation of CVAC internal release for testing CVAC

Application for ICT-Related Development and Research Grant Page 44

(Activity 2, subactivity 2.5) 10. A 9 Completion of implementation of CFC internal release for testing CFC (Activity 3, subactivity 3.5) 11. 9 Completion of architectural Design document for CR design, low level design of CR (Activity 6, subactivity 6.4) 12. 12 Completion of Testing and CDDC final release Deployment of CDDC (Activity 1, subactivity 1.6 and 1.7) 13. 12 Completion of Testing and CVAC final release Deployment of CVAC (Activity 2, subactivity 2.6 and 2.7) 14. 12 Completion of Testing and CFC final release Deployment of CFC (Activity 3, subactivities 3.6 and 3.7) 15. 12 Completion of implementation of CR internal release for testing CR (Activity 6, subactivity 6.5) 16. 12 Completion of Requirements Requirement specification gathering & research, document for CTC. Requirements analysis, Selection (Activity 4, subactivities 4.1, 4.2, and simulation of algorithms for 4.3) CTC 17. 12 Completion of Requirements Requirement specification gathering & research, document for BPC. Requirements analysis, Selection (Activity 5, subactivities 5.1, 5.2, and simulation of algorithms for 5.3) BPC 18. 15 Completion of architectural Design document for CTC design, low level design of CTC (Activity 4, subactivity 4.4) 19. 15 Completion of architectural Design document for BPC design, low level design of BPC (Activity 5, subactivity 5.4) 20. 15 Completion of Testing and CR final release Deployment of CR (Activity 6, subactivities 6.6 and 6.7) 21. 15 Completion of Requirements Requirement specification gathering & research, document for FR. Requirements analysis, Selection (Activity 7, subactivities 7.1, 7.2, and simulation of algorithms for 7.3) CR 22. 18 Completion of implementation of CTC internal release for testing

Application for ICT-Related Development and Research Grant Page 45

CTC (Activity 4, subactivity 4.5) 23. 18 Completion of implementation of BPC internal release for testing BPC (Activity 5, subactivity 5.5) 24. 18 Completion of architectural Design document for FR design, low level design of FR (Activity 7, subactivity 7.4) 25. 21 Completion of Testing and CTC final release Deployment of CTC (Activity 4, subactivities 4.6 and 4.7) 26. 21 Completion of Testing and BPC final release Deployment of BPC (Activity 5, subactivities 5.6 and 5.7) 27. 21 Completion of implementation FR internal release for testing, and testing of FR FR final release (Activity 7, subactivities 7.5, 7.6 and 7.7) 28. 24 Completion of validation case Case study report study (Activity 8) 29. 24 Organizing training workshop for Training workshop industry professionals (Activity 9) (Please add more rows if required.)

5. Benefits of the Project

A. Direct Customers / Beneficiaries of the Project: (Please identify clearly the potential customers/beneficiaries of the research results and provide details of their relevance, e.g. size, economic contribution, etc.) The local software industry in the SME sector will be the main beneficiary of the proposed Clone Management Tool Suite. This tool suite will be useful for any software development organization whether it is a software house or a software development branch of some bigger organization like a university or a government department. The tools will help the organizations in better management of the clones and ease the maintenance of their software systems, as well as providing systematic reuse of clones in new software systems.

B. Outputs Expected from the Project: Clone management tool suite This will be the main output of this project. This suite will contain the following tools integrated into the Eclipse IDE:

Clone Detection and Documentation Component (CDDC)

Application for ICT-Related Development and Research Grant Page 46

Clone Visualization and Analysis Component (CVAC) Clone Framing Component (CFC) Clone Tracking Component (CTC) Back Propagation Component (BPC)

Tool evaluation report As a first test case for the tool suite, the tools will be run on a major legacy system from Softech to evaluate its usefulness. This exercise will be documented and published.

Research output During the course of the project, new algorithms will be developed for fast and efficient clone detection, clone tracking, and clone framing. Similarly, new visualizations will be developed for effectively presenting simple and structural clones data to the developers. These research outputs will be published in reputed journals and conferences. Similarly, the initial case study on the application of the developed tool suite on a real system will be published to share the results with the wider research community.

One PhD and two MS students will be working on this project and this work will be reported in their dissertations.

C. Organizational / HRD Outcomes Expected: Graduate students and software developers working on the project will gain invaluable hands-on experience in developing state-of-the-art software engineering tools using the Eclipse IDE and designing the most efficient algorithms for various challenging parts of the project. Once the tools are developed, we will conduct the first case study of the tool suite on the industry partner’s legacy system. This will train the team of software developers at partner’s organization on using clone management tool suite.

The graduate students at LUMS will be directly exposed to the software maintenance issues faced by our local industry, and will get an opportunity to work on real and relevant problems in their research.

In addition, the research and development results of this project will be incorporated in relevant graduate courses that the PI/ Co-PI teach at LUMS. These courses include 1) Software Reuse, 2) Design Patterns and Refactoring, 3) Software Engineering, 4) Distributed Software System Development, 5) Distributed Systems. This will help the students in gaining the knowledge and skill-set needed for IT-based industry which is increasingly adopting new tools in software development life-cycle.

D. Technology Transfer / Diffusion Approach: (Please describe how the outputs of the project will be transferred to the direct beneficiaries/customers. Please also state if the project outputs are sustainable, i.e. if they can be utilized without further external assistance.) Involvement of an industrial partner (Softech) from the beginning of the project is to ensure that the stakeholder’s requirements are incorporated in the tool suite developed. In addition, Softech will also supervise tool suite development and testing/evaluation in an operational environment. This will also facilitate technology transfer from research

Application for ICT-Related Development and Research Grant Page 47

results to product development.

Once the development is complete, we will run training sessions for industry professionals and conduct controlled experiments to assess the benefits of integrated clone management for better adoption of the tool suite in the industry. Proper user guide and step-by-step tutorial will also be developed.

6. Risk Analysis

A. Risks of the Project: (Please describe the factors that may cause delays in, or prevent implementation of, the project as proposed above; estimate the degree of risk.)

(Please mark  where applicable) Low Medium High  Technical risk     Timing risk     Budget risk   

A1. Comments: Technical risk factors 1. Equipment - equipment includes standard computing and storeage resources (server and computers). Therefore the risk with respect to this factor is negligible. 2. Software - We will be using open source Eclipse IDE for development, the risk with respect to this factor is also negligible. 3. Human resource - the technical team includes PhD students in the area of software engineering / distributed systems and software developers with programming expertise in Java and C/C++. Since the PI and two CoPI are faculty members at LUMS with expertise in relevant areas, recruiting PhD students for research work does not pose any risk. Similarly, finding good software developers with Java and C/C++ is not a big challenge. Retaining trained resources may be an issue given excessive demand in the IT area. However, the software development team will be working closely with a leading IT company, which provides future job incentive within the same company. Therefore, the risk with respect to human resource is low. Timing risk factors 1. Availability of resources - risk with respect to human resources and equipment is low as discussed earlier. 2. Timely availability and release of funds.

Budget risk factor 1. Variation in human resource salaries at the time of grant approval and project execution.

Application for ICT-Related Development and Research Grant Page 48

7. Contractual Matters

A. Contractual Obligations under this Project: (Please indicate any contractual obligations with third parties that will be entered into for this project.)

B. Ownership of Intellectual Property Rights:

All newly developed intellectual property rights arising out of or capable of legal recognition with respect to the projects implemented by the National ICT R&D Fund (The “Company”) shall vest with the Company.

The Company may assign or license its rights in the said intellectual property to any person on such terms as it may deem appropriate.

C. Competent Authority of the Principal Investigator’s Organization: (Documentary proof of the Competent Authority (VC/Rector/CEO..) as being the authorized signatory for the PIO is mandatory for approval of the Project Proposal. Please attach copy of the proof.)

Name: Designation: Email:

Date: Signature & stamp:

Application for ICT-Related Development and Research Grant Page 49

8. Project Schedule / Milestone Chart

(Project schedule using MS-Project (or similar tools) with all tasks, deliverables, milestones, cost estimates, payment schedules clearly indicated are preferred.)

Application for ICT-Related Development and Research Grant Page 50

9. Proposed Budget

Please use the embedded Excel Worksheet for providing budget details.

Double click the icon to open the worksheet.

Application for ICT-Related Development and Research Grant Page 51

Annexure A – Curriculum Vitae

Please provide relevant information and also attach CVs of key research / development personnel (if available) and PI, CPI.

A. Professional Information

1. Name : Dr. Hamid Abdul Basit 2. Title or Position Held : Assistant Professor 3. Experience : (yrs) 9 years 4. Email Address : [email protected]

B. Research Papers in Relevant Area

In Refereed Journals 1. Hammad, M., Basit, H.A., and Jarzabek, S. “A Survey of Software Clone Visualization Techniques”. Ready for submission to ACM Computing Surveys. 2014. 2. Basit, H.A., and Jarzabek, S. “A Data Mining Approach for Detecting Higher- level Clones in Software”, IEEE Transactions on Software Engineering 35(4):497–514, July–August 2009. ISSN 0098-5589. DOI: 10.1109/TSE.2009.16. IEEE Computer Society. 3. Yuan, L., Dong, J.S., Sun, J., and Basit, H.A., “Generic Fault tolerant software architecture: reasoning and customization”, IEEE Transactions on Reliability 55(3):421–435, September 2006. In Refereed Conferences & Workshops 1. Koznov, D., Luciv, D., Basit, H. A., Lieh, O. E., and Smirnov, M. “Clone Detection in Documentation Reuse”, ready for submission. 2014. 2. Basit, H. A., and Dajsuren, Y. “Handling Clone Mutations in Simulink Models with VCL”, Int. Workshop on Software Clones, IWSC’2014, CSMR Workshop, February 2014, Antwerp, Belgium 3. Basit, H. A., Ali, U., Haque, S., and Jarzabek, S. “Things Structural Clones Tell that Simple Clones Don’t,” Int. Conference on Software Maintenance, ICSM’2012, Trento, Italy, September 2012, pp. 275-284 4. Basit, H. A., Ali, U. and Jarzabek, S. “Viewing Simple Clones from a Structural Clones’ Perspective,” Int. Workshop on Software Clones, IWSC’2011, ICSE Workshop, Honolulu, USA, May 2011, pp. 1-6 5. Basit, H. A., and Jarzabek, S., “A case for structural clones”. In proceedings 3rd International Workshop on Software Clones, March 2009, Kaiserslautern, Germany.

Application for ICT-Related Development and Research Grant Page 52

6. Zhang, Y., Basit, H. A., Jarzabek, S., Anh, D., and Low, M., “Identifying Useful Design-Level Similarity Patterns based on Clone Detection Output”. In proceedings 24th IEEE International Conference on Software Maintenance, September 28 - October 4, 2008, Beijing, China, pp. 376-385 7. Basit, H. A., Smyth, W. F., Puglisi, S. J., Turpin, A., and Jarzabek, S., “Efficient Token Based Clone Detection with Flexible Tokenization”. Short paper in proceedings 11th European Software Engineering Conference and 15th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, ACM Press, September 2007, Dubrovnik, Croatia, pp. 513-516 8. Basit, H.A. and Jarzabek, S. “Detecting Higher-level Similarity Patterns in Programs”, In proceedings 10th European Software Engineering Conference and 13th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, ACM Press, September 2005, Lisbon, Portugal, pp. 156- 165 9. Basit, H.A., Rajapakse, D.C, and Jarzabek, S. “An Empirical Study on Limits of Clone Unification Using Generics”, In proceedings 17th Int. Conference on Software Engineering and Knowledge Engineering, SEKE'05, July 2005, Taipei, Taiwan, pp. 109-114 10. Basit, H.A., Rajapakse, D.C., and Jarzabek, S. “Beyond Templates: a Study of Clones in the STL and Some General Implications”, In proceedings 27th Int. Conf. on Software Engineering, ICSE’05, May 2005, St. Louis, USA, pp. 451- 459 11. Basit, H. A., Rajapakse, D. C., and Jarzabek, S., "Extending Generics for optimal Reuse", poster presentation at 8th Intl. Conf. on Software Reuse (ICSR'04), 2004, Madrid, Spain.

C. Courses Taught in Relevant Area

Undergraduate • Software Engineering • Databases • Data Structures and Algorithms • Introduction to programming with C++ • Computing Structures • Software Engineering Project Graduate • Distributed Software System Development • Design Patterns and Refactoring • Software Reuse D. Thesis / Projects Supervised in Relevant Area

MS Theses

Application for ICT-Related Development and Research Grant Page 53

 “Clone Management in XVCL”, Shaheen Tariq.  “Comparing Popular Web Development Technologies for Cloning: An Empirical Study”, Aisha Khan.  “Developing Clone Visualization Techniques for Structural Clones”, Umber Nisar.  “Proposing and Evaluating a New Gapped Clone Detection Technique”, Fatima Shahid.  “Evaluating Clone Visualization Techniques from User Goals Perspective”, Muhammad Hammad. MS Project  “A comparative study of cloning in Ruby and Java systems”, Haider Mahmood.

E. Grants Received in Relevant Area

 PI, “Design and Implementation of a Language-Independent Software Clone Management Tool Suite for Single and Multiple Systems”, LUMS Faculty Initiative Fund, Rs. 1,000,000, May 2014 – April 2015.

F. Industrial Work Done in Relevant Area

2000 – 2002 Elixir Technologies Pvt. Ltd. Islamabad, Pakistan. Software Engineer

Application for ICT-Related Development and Research Grant Page 54

Please provide relevant information and also attach CVs of key research / development personnel (if available) and PI, CPI.

A. Professional Information

1. Name : Dr. Shafay Shamail 2. Title or Position Held : Associate Professor CS / Director OSP 3. Experience : (yrs) 23 Years 4. Email Address : [email protected]

B. Research Papers in Relevant Area International Journals 1. A Randomized Partitioning Approach for CBR Based Autonomic Systems to Improve Retrieval Performance, Malik Jahan Khan, Mian M. Awais, Shafay Shamail, The Computer Journal, Oxford University Press, (2013) 56(2), 175-183, First published Online April 19, 2012, IF 1.327 (2011) 2. An empirical study of modeling self-management capabilities in autonomic systems using case-based reasoning. Malik Jahan Khan, Mian M. Awais, Shafay Shamail, Irfan Awan: Simulation Modelling Practice and Theory 19(10): 2256-2275 (2011), IF 0.728 (2010) 3. Nomenclature Unification of Software Product Measures, Zeeshan Ali Rana, Mian Muhammad Awais, Shafay Shamail, IET Software, Volume 5, Issue 1, p.83–102, IET Digital Library, DOI:10.1049/iet-sen.2010.0016. Available online Feb 17, (2011).,IF 0.671 (2010) 4. Improving Efficiency of Self-Configurable Autonomic Systems Using Clustered CBR Approach, Malik Jahan Khan, Mian Muhammad Awais, Shafay Shamail, IEICE Transactions on Information and Systems, Vol. E93-D, No.11, pp. 3005-3016, November 2010, IEICE Press. IF 0.369 (2009) 5. Automatic Case Generation for Pattern Classification, Zahra A Shah, M.M. Awais & S. Shamail, Expert Update, The Specialist Group on AI, Spring 2010 (Vol. 10, No. 1), IF 0.821 (2007) 6. Applying fuzzy logic to measure completeness of a conceptual model, Tauqeer Hussain, M. M. Awais, S. Shamail, Applied Mathematics and Computation, Vol. 185 Issue 2, (2007) 1078– 1086. [IF(2007) 0.821], IF 1.349 (2013) 7. Dimensionally Reduced Krylov Subspace Methods for Large Scale Systems, M MAwais, S Shamail, and Nisar Ahmed, J of Control and Intelligent Systems, 191(1), 21-30, (2007). ACTA Press. [Publisher: ACTA Press, Indexed by the American Mathematical Society - Mathematical Reviews, Cambridge Scientific Abstracts, Compendex (Engineering Information) and INSPEC, ISSN: 1480-1752] 8. Automatic Arabic Speech Segmentation System, M Jamil Anwar, M. M. Awais, S Masud, S. Shamail, International Journal of Information Technology, Information Communication Institute of Singapore (ICIS), Vol. 12 No.6, pp. 102-111. 2006. 9. An Assessment of Current Level and Future Prospects of Internet Adoption by Banks in Pakistan, Shafay Shamail, Mian M. Awais, Tauqeer Hussain, Saadiya Waheed Raza, WSEAS Transactions on Information Science and Applications, Issue 8, Volume 2, August 2005, pp 1167-1172 [Publisher: WSEAS Press, ISSN: 1790-0832]. 10. Determinants of Electronic Commerce in Pakistan: Preliminary Evidence from Small and Medium Enterprises, Electronic Markets J, Seyal AH, Awais MM, Shamail S, Abbas A, Vol. 14, No. 4. (December 2004), pp 372-387. [Publisher: Taylor & Francis, ISI Indexed, ISSN: 1019- 6781], IF 0.429 (2012) 11. An Aspect-to-Class Advising Architecture Based on XML in Aspect Oriented Programming, T.

Application for ICT-Related Development and Research Grant Page 55

Hussain, M. M. Awais, S. Shamail, M. A. Adnan, WSEAS Transactions of Information Science and Applications, Issue 1, Volume 1, July 2004, pp 209-214. [Publisher: WSEAS Press, ISSN: 1790-0832] 12. Multi-Valued Relationship Attributes in Extended Entity Relationship Model and Their Mapping to Relational Schema, T. Hussain, S. Shamail, Mian M. Awais, WSEAS Transactions of Information Science and Applications, Issue 1, Volume 1, July 2004, pp 269-273. [Publisher: WSEAS Press, ISSN: 1790-0832]

Book Chapters 13. Enterprise Architectures for e-Government Development Muhammad Kashif Farooq, Shafay Shamail and Mian Muhammad Awais in Developing E-Government Projects: Frameworks and Methodologies, Editor: Dr Zaigham Mahmood, pages 139-164, Publisher: IGI Global, Hershey, PA, USA, June 2013. 14. An FIS for Early Detection of Defect Prone Modules, Zeeshan Ali Rana, Mian Muhammad Awais and Shafay Shamail, Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence, Volume 5755/2009, Lecture Notes in Computer Science, 144-153, September 19, 2009, 5th International Conference on Intelligent Computing, ICIC 2009 Ulsan, South Korea, September 16-19, 2009 Proceedings 15. Devolution in a Virtual Enterprise, Muhammad Kashif Farooq, Shafay Shamail, Mian M. Awais: Pervasive Collaborative Networks, IFIP TC 5 WG 5.5 Ninth Working Conference on Virtual Enterprises (PRO-VE), September 8-10, 2008, Poznan, Poland. IFIP Springer 2008, ISBN 978-0-387-84836-5. Also in Proceedings of IFIP TC 5 WG 5.5 Ninth working conference on Virtual Enterprises, September 8-10, 2008, Poznan, Poland. 16. On Vowels Segmentation and Identification Using Formant Transitions in Continuous Recitation of Quranic Arabic. Hafiz Rizwan Iqbal, Mian M. Awais, Shahid Masud, Shafay Shamail: New Challenges in Applied Intelligence Technologies. Studies in Computational Intelligence Vol. 134 Springer 2008, 155-162. ISBN 978-3-540-79354-0. 17. Achieving Self-Configuration Capability in Autonomic Systems Using Case-Based Reasoning with a New Similarity Measure, Malik Jahan Khan, Mian Muhammad Awais, Shafay Shamail, Communication in Computer and Information Science, Volume 2, Springer Berlin Heidelberg, Also In Proceedings of International Conference on Intelligent Computing (ICIC'07), QinagDao, China (August 2007) 18. Arabic Phoneme Identification Using Conventional and Concurrent Neural Networks in Non Native Speakers, Mian M. Awais, Shahid Masud, Junaid Ahktar and Shafay Shamail, Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, Lecture Notes in Computer Science, Volume 4681/2007, Also In Proceedings of International Conference on Intelligent Computing (ICIC'07), Qinag Dao, China (August 2007) 19. A Fuzzy Based Approach to Measure Completeness of an Entity-Relationship Model, Tauqeer Hussain, Mian M. Awais, Shafay Shamail, Proceedings of ER 2005 Workshops in 24th International Conference on Conceptual Modeling (ER 2005), Oct. 24-28, 2005, Klagenfurt, Austria published as Lecture Notes in Computer Science (LNCS) by Springer-Verlag Vol. 3770/2005 (Perspectives in Conceptual Modeling) Chapter pp. 410-422 (Book Chapter) 20. A Hybrid Multi-Layered Speaker Independent Arabic Phoneme Identification System, M. M. Awais, S. Masud, S. Shamail, J. Akhtar, Presented at 5th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL04), August 25-27, 2004, Exeter, UK, published as Lecture Notes in Computer Science by Springer-Verlag, Exeter, UK, Vol. 3177/2004, pp. 416-423 (Book Chapter), [Acceptance Rate: 25%]

International Conferences 21. Identifying Association between Longer Itemsets and Software Defects, Zeeshan A Rana, Sehrish A Malik, Shafay Shamail, and Mian M Awais, (Short Paper), The 20th International Conference on Neural Information Processing (ICONIP 2013), 3-7 November 2013, Daegu, Korea. 22. Internet Voting: A Smarter Way to Vote in Pakistan, Umar Muneer and Shafay Shamail, (Short Paper), 7th International Conference on Theory and Practice of Electronic Governance

Application for ICT-Related Development and Research Grant Page 56

(ICEGOV 2013), 22-25 October, 2013, Seoul, Korea 23. Agile Model Adaptation for E-Learning Students Final Year Project, Muhammad Adnan Ashraf, Shafay Shamail and Zeeshan A. Rana, Teaching, Assessment and Learning for Engineering (TALE), 2012, 20-23 August 2012, Hong Kong, pp T1C-18-T1C-21 24. Using Association Rules to Identify Similarities between Software Datasets, Saba Anwar, Zeeshan Rana, Shafay Shamail, Mian Muhammad Awais, 8th International Conference on the Quality of Information and Communications Technology, Lisbon, Portugal, 3-6 September 2012 (http://2012.quatic.org:9000/tracks/thematic-tracks/ict-verification-and-validation/) 25. A comparison between Cadastre 2014 and cadastral systems of different countries, Sajid Manzoor, Taimur A. Qureshi, Muhammad D. Liaqat, Muhammad K, Farooq, Shafay Shamail, 3rd International Conference on Theory and Practice of Electronic Governance (ICEGOV 2009), Bogota, Colombia, November 10-13, 2009, ACM International Conference Proceedings, Vol. 322, pp 293-297, Session Standards and Guidelines 26. Intelligent project approval cycle for local government: case-based reasoning approach, Muhammad K. Farooq, Malik Jahan Khan, Shafay Shamail, Mian M. Awais, 3rd International Conference on Theory and Practice of Electronic Governance (ICEGOV 2009), Bogota, Colombia, November 10-13, 2009, ACM International Conference Proceedings, Vol. 322, pp 68-73, Session Knowledge Management 27. Decision Making in Autonomic Managers using Fuzzy Inference System, Malik Jahan Khan, Mian M. Awais, Shafay Shamail, ICAS 2009, Valencia, Spain, April 20-25, 2009, IEEE Computer Society. 28. Survey of Frameworks, Architectures and Techniques in Autonomic Computing, Amina Khalid, Mouna Abdul Haye, Jahan Khan, Shafay Shamail, ICAS 2009, Valencia, Spain, April 20-25, 2009. IEEE Computer Society. 29. Ineffectiveness of Use of Software Science Metrics as Predictors of Defects in Object Oriented Software, Zeeshan A. Rana, Shafay Shamail, Mian M. Awais, World Congress on Software Engineering, May 19-21, 2009, Xiamen, China. IEEE Computer Society. 30. Reference Model for Devolution in E-Governance, Muhammad, Kashif Farooq, Shafay Shamail, Mian M. Awais. 2nd International Conference on Theory and Practice of Electronic Governance (ICEGOV 2008), Dec 1-4, 2008, Cairo, Egypt. Publisher ACM. 31. Enabling self-configuration in autonomic systems using case-based reasoning with improved efficiency. Khan, M. J., Awais, M. M., & Shamail, S., In Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on (pp. 112-117). IEEE. (2008, March). 32. Self-configuration in autonomic systems using clustered CBR approach. Khan, Malik Jahan, Mian M. Awais, and Shafay Shamail. In Autonomic Computing, 2008. ICAC'08. International Conference on, pp. 211-212. IEEE, June 2008. 33. Towards a generic model for software quality prediction. Rana, Zeeshan Ali, Shafay Shamail, and Mian Muhammad Awais. In Proceedings of the 6th International Workshop on Software Quality (WoSQ), pp. 35-40. ACM. Leipzig, Germany, 10 - 18 May 2008. (Co-located with ICSE 2008). 34. Devolution of E-Governance among Multilevel Government Structure, Awais, M.M.; Farooq, M.K.; Shamail, S.; Innovations in Information Technology, Dubai, Nov. 2006, Page(s):1 - 5 35. Continuous Arabic Speech Segmentation using FFT Spectrogram, Ahmad, W.; Awais, M.M.; Masud, S.; Shamail, S.; Innovations in Information Technology, Dubai, Nov. 2006 Page(s):1 – 6 36. Measuring Quality of a Conceptual Model through Fuzzy Completeness Index, Tauqeer Hussain, M. M. Awais, S. Shamail, Proceedings of International Conference on Intelligent Computing (ICIC’05), August 23-26, 2005, Hefei, China. [Acceptance Rate = 33%] 37. Automatic Arabic Speech Recognition Systems, M Jamil Anwar, M. M. Awais, S Masud, S. Shamail, International Conference on Intelligent Computing (ICIC’05), August 23-26, 2005, Hefei, China, [Acceptance Rate = 33%] 38. A Collaboration Portal for Social Enterprises, Shafay Shamail, Mian. M. Awais, Tauqeer Hussain,, International Conference on Web Technologies, Applications, and Services, (WTAS 2005), July 04-06, 2005, Calgary, Canada 39. On Measuring Structural Complexity of a Conceptual Model, Tauqeer Hussain, S Shamail, and M MAwais, IASTED International Conference on Software Engineering, Novosibirsk,

Application for ICT-Related Development and Research Grant Page 57

Russia - June 20-24, 2005, pp 71-75. [Acceptance Rate: 40%] 40. Improving Quality in Conceptual Modeling, Tauqeer Hussain, Shafay Shamail, Mian M. Awais, Proceedings Companion 19th Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), , October 24-28, 2004, Vancouver, BC, Canada. [CiteSeer CS Impact Factor: 2.05 (top 2.29% - rank 28 / 1221)] 41. Application of Concurrent Regression Neural Networks for Arabic Phoneme Identification, S. Sehgal, M. M. Awais, S. Masud, S. Shamail, J. Akhtar, International Conference on Neural and Computational Intelligence (NCI2003), IASTED, February 22-25, 2004, Switzerland 42. IT Support for Distance Learning to Social Enterprises in Pakistan, Shafay Shamail, Mian M. Awais, M. Asim Noor, ICSDS'02, November 30-Decemebr 1, 2002, Beijing, China. 43. Distributed Memory Diesel Engine Simulation Using Transputers, Proceedings of Ninth International Symposium on Computer and Information Sciences, 7-9 November 1994, Antalya, Turkey.

Local Conferences 44. Data Analysis and Visualization using Spectral Decomposition and Feature Selection, Areeb Kamran, Shafay Shamail, and Mian Muhammad Awais, International Conference on Emerging Technologies (ICET 2013), Islamabad, 9-10 December 2013. 45. Finding Focused Itemsets from Software Defect Data, Hafsa Zafar, Zeeshan Rana, Shafay Shamail, Mian M Awais, 15th IEEE International Multi-topic Conference (INMIC 2012), December 13-15, 2012, Islamabad (Best Paper Award) 46. Automated optimum test case generation using web navigation graphs. Ahmad Shahzad, Sajjad Raza, Muhammad Nabeel Azam, Khurram Bilal, and Shafay Shamail. In 5th International Conference on Emerging Technologies, 19-20 Oct. 2009. ICET 2009, Islamabad, Pakistan. 47. Blending Six Sigma and CMMI - An Approach to Accelerate Process Improvement in SMEs, Maria Habib, Sana Ahmed, Amna Rehmat, Malik Jahan Khan, Shafay Shamail, INMIC08, December 24-25, 2008, Karachi. 48. Software Quality Prediction Techniques: A Comparative Analysis, Sana Shafi, Syed Muhammad Hassan, Afsah Arshaq, Malik Jahan Khan, Shafay Shamail, 4th International Conference on Emerging Technologies (ICET 08), Islamabad, Oct 18-19, 2008 49. Analytical Hierarchy Process Approach to Rank Measures for Structural Complexity of Conceptual Models, Hussain, Tauqeer; Tahir, Ahmed Salman; Awais, Mian Muhammad; Shamail, Shafay; IEEE Multitopic Conference, INMIC '06, Dec. 2006 Page(s):255 - 258 50. Comparative Study of Various Artificial Intelligence Techniques to Predict Software Quality, Khan, Malik Jahan; Shamail, Shafay; Awais, Mian Muhammad; Hussain, Tauqeer; IEEE Multitopic Conference, INMIC '06, Dec. 2006 Page(s):173 - 177 51. A comparative study of spatial complexity metrics and their impact on maintenance effort, Rana, Z.A.; Khan, M.J.; Shamail, S.; International Conference on Emerging Technologies. ICET '06, 2006 Page(s):714 – 718 52. Software Automated Testing Guidelines, Khan, M.J., Qadeer, A., and Shamail, S., Proceedings of the 3rd CIIT Workshop on Research in Computing, Wah Cantt, April 2006 53. Schema Transformations – A Quality Perspective, Tauqeer Hussain, Shafay Shamail and Mian M. Awais, INMIC 2004, December 24-26, 2004, FAST, Lahore, pp 645 - 649. [Acceptance Rate = 27%] 54. CFD Based Combustion Modeling for Industrial Scale Combustors, Bashir Ahmed, M. M. Awais and S. Shamail, INMIC 2004, December 24-26, 2004, FAST, Lahore, pp 547 - 552. [Acceptance Rate = 27%] 55. On Dependency Preservation and BCNF, Tauqeer Hussain, Shafay Shamail, M. M. Awais, Short paper in Proceedings of 2nd International workshop on Frontiers of Information Technology (FIT 2004), December 20-21, 2004, Islamabad. 56. A Comparative Study of Feed Forward Neural Networks and Radial Basis Neural Networks for Modeling Tokamak Fusion Process, Muhammad Awais, Shafay Shamail, Nisar Ahmed, Saman Shahid, INMIC 2003, December 8-9, 2003, Islamabad, Pakistan. 57. A Novel Approach to Increase the Robustness of Speaker Independent Arabic Speech, Muhammad Shoaib Bashir, Sh Faisal Rasheed M. Muhammad Awais, Shahid Masud, Shafay

Application for ICT-Related Development and Research Grant Page 58

Shamail, INMIC 2003, December 8-9, 2003, Islamabad, Pakistan. 58. Eliminating Process of Normalization in Relational Database Design, Tauqeer Hussain, Shafay Shamail, Mian M. Awais, INMIC 2003, December 8-9, 2003, Islamabad, Pakistan. pp. 408-413. 59. An XML based Architecture for Aspect-to-Class Advising in Aspect Oriented Programming, A.A. Hussain, T. Hussain, M.M. Awais, S. Shamail, M.A. Adnan, Second National Wrokshop on Trends in Information Technology (NWTIT 2003), November 15-16, 2003, Islamabad, Pakistan. 60. Simulation of Arabic Phoneme Identification through Spectrographic Analysis, Muhammad Shoaib Bashir, Faisal Rasheed, M.M. Awais, Shahid Masud, Shafay Shamail, Second IBCAST Conference, July 2003, Bhurban, Pakistan 61. Diesel Engine Simulation, S. Shamail, M.M.Awais, Second IBCAST Conference, July 2003, Bhurban, Pakistan 62. Development and Implementation of a new Discretized Quantum Based Radiative Model for Industrial Scaled Combustion (QRM), M.M. Awais, S. Shamail, A. Salam, INMIC2002, December 27-28, 2002, Karachi, Pakistan. 63. An Intelligent Approach Towards Solving State Space Models for Fusion Process, Mian M. Awais, Shafay Shamail, Nisar Ahmad, Saman Shahid, INMIC2002, December 27-28, 2002, Karachi, Pakistan. 64. An E-Platform for Capacity Building of Social Enterprises in Pakistan, Arsalan I. Anwer, Shafay Shamail, Mian M. Awais, INMIC2002, December 27-28, 2002, Karachi, Pakistan. 65. Development and Implementation of a Quantum Based Radiative model for Industrial Scale Combustion (QRM), Attuq-us-Salam, M. M. Awais, Shafay Shamail, Printed in Poster Abstracts, ISCON 2002, IEEE Student Conference; OPSTEC, pp 6-7, August 16-17, 2002 66. Development of Digital Certification Authority in Pakistan, Proceedings of IEEE INMIC International Conference; Lahore University of Management Sciences, December 28-30, 2001. 67. Data Communication for Parallel Diesel Engine Simulation, Proceedings of International Workshop on Computer Vision and Parallel Processing, Quaid-i-Azam University, Islamabad, January 2-5, 1995. 68. Application of Block Predictor Corrector Method to Diesel Engine Simulation, Proceedings of the National Seminar on Software Development in Pakistan, Pak-AIMS, Lahore, July 7, 1993. C. Courses Taught in Relevant Area at LUMS Undergraduate Circuits and Systems I Data Structures and Algorithms Advanced Programming Techniques Software Engineering Programming in Java Problem Solving & Comp Programming (C++) Introduction to Computing Graduate Software Development Tools and Processes Information Technology Architecture Research Trends in Software Engineering D. Thesis / Projects Supervised in Relevant Area PhD Theses  Tauqeer Hussain, “Improving Conceptual Modelling in Database Design”, Lahore University of Management Sciences (LUMS), 2006. o Co-Supervised with Dr. Mian M. Awais o LUMS First PhD Graduate  Faisal Shah, “Optimum Software Process Improvement Paradigm for Quality Practices in Software Industry” (PU), Dec 2011, Co-supervised with Prof. Dr. Niaz Ahmad

Application for ICT-Related Development and Research Grant Page 59

 Malik Jahan Khan, “Achieving Self-Management Capabilities in Autonomic Systems using Case-Based Reasoning”, (LUMS), 2012, Co-Supervised with Dr. Mian M. Awais E. Grants Received in Relevant Area

F. Industrial Work Done in Relevant Area

1999-2000 SoftNet Systems Pvt. Ltd. Lahore, Pakistan. Manager Projects Responsible for establishing Microsoft Technology Department, Involved in the Web Commerce Projects 2002-2008 Consultancy Assignment Task Manager, IT Support for Social Enterprises in Pakistan, LUMS-McGill Social Enterprise Development Program, LUMS, Lahore

Application for ICT-Related Development and Research Grant Page 60

Please provide relevant information and also attach CVs of key research / development personnel (if available) and PI, CPI.

A. Professional Information

5. Name : Dr. Basit Shafiq 6. Title or Position Held : Assistant Professor 7. Experience : (yrs) 8 years 8. Email Address : [email protected]

B. Research Papers in Relevant Area Journal papers

1. A. Paliwal, B. Shafiq, J. Vaidya, H. Xiong, and N. Adam, "Semantics Based Automated Service Discovery," IEEE Transactions on Services Computing, Vol. 5 No. 2, pp. 260- 275, April/June 2012. 2. J. Vaidya, B. Shafiq, W. Fan, D. Mehmood, and D. Lorenzi ,“A Random Decision Tree Framework for Privacy-preserving Data Mining,” IEEE Transactions on Dependable and Secure Computing, accepted, 2013. 3. N. Adam and B. Shafiq, and Robin Staffin "Spatial Computing and Social Media in the Context of Disaster Management," IEEE Intelligent Systems, Nov/Dec 2012. 4. B. Shafiq, V. Atluri, J. Vaidya, S. Chun, and G. Nabi, “Resource Sharing using UICDS Framework for Incident Management, ” Transforming Government: People, Process and Policy, Vol. 6, No. 1, pp. 41 - 61, 2012. 5. D. Lorenzi, J. Vaidya, S. Chun, B. Shafiq, and V. Atluri, “Enhancing the Government Service Experience through QR Codes on Mobile Platforms,” Government Information Quarterly, accepted, 2013. 6. B. Shafiq, J. Joshi, E. Bertino, and A. Ghafoor, “Secure Interoperation in Multi-Domain Environment Employing RBAC Policies,” IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 11, November 2005, pp. 1557-1577. 7. R. Bhatti, B. Shafiq, J. Joshi, E. Bertino, and A. Ghafoor, “X-GTRBAC Admin: A Decentralized Administration Model for Enterprise Wide Access Control,” ACM Transactions on Information and System Security, Vol. 8, No. 4, pp. 388-423. 8. R. Bhatti, B. Shafiq, M. Shehab, and A. Ghafoor, “Distributed Access Management in Multimedia IDCs,” IEEE Computer, Vol. 38, No. 9, September 2005, pp. 60-69. 9. B. Shafiq, H. Fahmi, S. Baqai, A. Khokhar, and A. Ghafoor, “Wireless Network Resource Management for Web-Based Multimedia Document Services,” IEEE Communication Magazine, Vol. 41, No. 3, March 2003, pp. 138-145. 10. J. Joshi, K. Li, H. Fahmi, B. Shafiq, and A. Ghafoor, “A Model for Secure Multimedia Document Database System in a Distributed Environment,” IEEE Transactions on Multimedia, Vol. 4, No. 2, June 2002, pp. 215-234.

Conference Papers

11. J. Vaidya, A. Basu, B. Shafiq, and Yuan Hong, “Differentially Private Naive Bayes

Application for ICT-Related Development and Research Grant Page 61

Classification,” Proceedings of the 2013 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2013), 17-20 November 2013, Atlanta, GA, USA 12. B. Shafiq, J. Vaidya, S. Chun, N. Badar, and N. Adam, “Secure Composition of Cascaded Web Services,” in Proceedings of the 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2012), October 14–17, 2012 Pittsburgh, Pennsylvania, USA. 13. X. He, J. Vaidya, B. Shafiq, and N. Adam, “Privacy Preserving Maximum-Flow Computation in Distributed Graphs,” IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT 2012). Amsterdam, Netherlands. September 2012. 14. D. Mehmood, B. Shafiq, J. Vaidya, Y. Hong, N. Adam, and V. Atluri, “Privacy-preserving Subgraph Discovery”, 26th Annual WG 11.3 Conference on Data and Applications Security and Privacy (DBSec), July 11-13, 2012, Institut Mines-Telecom, Paris, France. 15. B. Shafiq, J. Vaidya, A. Ghafoor, and E. Bertino, “A Framework for Verification and Optimal Reconfiguration of Event-driven Role Based Access Control Policies,” 17th ACM Symposium on Access Control Models and Technologies (SACMAT), Newark, NJ, USA, 2012. 16. D. Lorenzi, J. Vaidya, S. Chun, B. Shafiq, G. Nabi, and V. Atluri, “Using QR Codes for Enhancing the Scope of Digital Government Services,”13th Annual International Conference on Digital Government Research (dg.o 2012), University of Maryland, College Park, MD, USA. 17. Y. Hong, J. Vaidya H. Lu, and B. Shafiq, “Privacy-preserving Tabu Search for Distributed Graph Coloring,” in Proceedings of the 3RD IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT '11), October 9-11, 2011, MIT, Boston, USA. 18. V. Atluri, B. Shafiq, S. Chun, G. Nabi, and J. Vaidya, “UICDS-based Information Sharing among Emergency Response Application Systems, ” in Proceedings of the 12th Annual International Conference on Digital Government Research (dg.o 2011), June 12-15 2011, University of Maryland, College Park, MD, USA. (Demo Paper - Best Demo or Poster Award) 19. X. He, J. Vaidya, B. Shafiq, N. Adam, and X. Lin, “Reachability Analysis in Privacy- Preserving Perturbed Graphs,” Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 20. J. Vaidya, V. Atluri, B. Shafiq, and N. Adam, “Privacy-preserving Trust Verification,” ACM Symposium on Access Control Models and Technologies, 2010. 21. V. Atluri, B. Shafiq, J. Vaidya, S. Chun, M. Trocchia, N. Adam, C. Doyle and L. Skelly, “Information Sharing Infrastructure for Pharmaceutical Supply Chain Management in Emergency Response,”in Proceedings of the 2010 IEEE International Conference on Technologies for Homeland Security, 8-10 November 2010, Westin Hotel, Waltham, MA. 22. B. Shafiq, J. Vaidya, V. Atluri, and S. Chun, “UICDS Compliant Resource Management System for Emergency Response,” 11th International Digital Government Research Conference (dg.o 2010). 23. V. Atluri, S. Chun, J. Ellenberger, B. Shafiq, and Jaideep Vaidya, “Integrated Resource and Logistics Management through Secure Information Sharing for Effective Emergency Response, ” Workshop on Emergency Management: Incident, Resource, and Supply Chain Management (EMWS09), November 5-6, 2009. 24. X.He, J. Vaidya, B. Shafiq, N. Adam, and V. Atluri, “Preserving Privacy in Social Networks: A Structure-Aware Approach,” Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence. 25. X. He, J. Vaidya, B. Shafiq, and N. Adam, “Efficient Privacy-Preserving Link Discovery,” 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-09). 26. B. Shafiq, J. Vaidya, V. Atluri, N.Adam, S. Chun, and A. Lieb, "Ontology Driven

Application for ICT-Related Development and Research Grant Page 62

Resource Management for Emergency Response," in Proceedings of the Workshop on Secure Knowledge Management (SKM 2008), November 3-4, 2008 27. X. He, B. Shafiq, J. Vaidya, and N. Adam, “Privacy-Preserving Link Discovery,” in Proceedings of the 23rd Annual ACM Symposium on Applied Computing, Data Mining Track, March 16-20, 2008, Fortaleza, Ceara, Brazil. 28. N. Adam, V. Atluri, S. Chun, J. Ellenberger, B. Shafiq, J. Vaidya, and H. Xiong, “Secure Information Sharing and Analysis for Effective Emergency Management,” in Proceedings of the 9th Annual International Conference on Digital Government Research, Montreal, Canada, May 2008. 29. N. Adam, T. White, B. Shafiq, and J. Vaidya, “Privacy Preserving Integration of Health Care Data,” AMIA 2007 Annual Symposium. 30. A. Paliwal, N. Adam and B. Shafiq, “Policy based Web Service Orchestration and Goal Reachability Analysis using HMSC and CP Nets,” in Proceedings of 9th IEEE Conference on E-Commerce Technology (CEC' 07) and the 4th IEEE Conference on Enterprise Computing, E-Commerce and E-Services (EEE ' 07) , July 2007. 31. N. Adam, V. Janeja, A. Paliwal, B. Shafiq, C. Ulmer, V. Gersabeck, A. Hardy, Bornhoevd, and J. Schaper, “An Approach for discovering and handling crisis in a service-oriented environment,” in Proceedings of the IEEE International Conference on Intelligence and Security Informatics 2007 (ISI 2007) , May 2007. 32. N. Adam, A. Kozanoglu, A. Paliwal, and B. Shafiq, “Secure Information Sharing in a Virtual Multi-agency Team Environment,” in Proceedings of the Second International Workshop on Security and Trust Management (STM’06), Hamburg, Germany, September 2006. 33. B. Shafiq, A. Samuel, E. Bertino, and A. Ghafoor, “A Technique for Optimal Adaptation of Time-Dependent Workflows with Security Constraints,” in Proceedings of the IEEE International Conference on Data Engineering, Atlanta, GA, April 2006, (short paper). 34. B. Shafiq, A. Samuel, and H. Ghafoor, “A GTRBAC Based System for Dynamic Workflow Composition and Management,” in Proceedings of the IEEE International Symposium on Object Oriented Real-time Distributed Computing, Seattle, WA, May 2005. 35. J. Joshi, E. Bertino, B. Shafiq, and A. Ghafoor, "Dependencies and Separation of Duty Constraints in GTRBAC", in Proceedings of the ACM Symposium on Access Control Models and Technologies, Como, Italy, June 2-3, 2003.

Patent

Christof Bornhoevd, Aabhas Paliwal, Nabil Adam, Basit Shafiq. “Service-based processes using policy-based model-to-model conversion and validation techniques.” Publication number: US8601432 B2. C. Courses Taught in Relevant Area

 Service-oriented Computing (CS 585)  Distributed Systems (CS 582)  Database Systems (CS 340)  Operating Systems (CS 370)

Application for ICT-Related Development and Research Grant Page 63

D. Thesis / Projects Supervised in Relevant Area

Ph.D. Committee  “Privacy Preserving Analysis of Graph Structured Data,” Ph.D. Dissertation, Xiaoyun He, Rutgers University (graduated in 2011) M.S. Thesis  “Automated Business Process Recommendation and Composition System,” Sobia Urooj,.  “Privacy-preserving Collaborative Business Process Composition,” Hassaan Irshad, (Expected graduation date, Dec 2014.) E. Grants Received in Relevant Area 1. PI, “A Privacy-preserving Framework for Collaborative Business Process Composition,” LUMS Faculty Initiative Fund, Rs. 450,000, July 2013 – June 2014. 2. Co-PI, “Information Sharing across States’ Incident Management Systems using UICDS,” United States Army Armament Research, Development and Engineering Center (ARDEC), $498,431, October 2009 - October 2011. (Vijay Atluri, PI) 3. Co-PI, “information Technology for Emergency mAnageMent (iTEAM),” Rutgers University Academic Excellence Fund, $40,000, July 2008 – June 2009. (Nabil R. Adam, PI) 4. Co-PI, “Technology Transfer: A Context Sensitive Alert Management System,” NSF Supplement Grant extension to NSF Award IIS-0306838, $50,000, June 2008 – May 2009. (Nabil R. Adam, PI) 5. Co-PI, “Service-oriented Architecture Supporting Networks of Public Security,” SAP Germany, $566,000, July 2007 – June 2009. (Nabil R. Adam, PI). 6. Co-PI, “Secure Agency Interoperation for Effective Data Mining in Border Control and Homeland Security Applications,” NSF, September 2003 – August 2008, (Joined as Co- PI from September 2006) (Nabil R. Adam, PI) 7. Co-PI, “Improving Business Knowledge Management and Analysis through the Use of Semantic Web Services and RFID Technology”, SAP Research Labs, $100,000, January 2007 - June 2007. (Nabil R. Adam, PI)

F. Industrial Work Done in Relevant Area

Please provide relevant information and also attach CVs of key research / development

Application for ICT-Related Development and Research Grant Page 64 personnel (if available) and PI, CPI.

A. Professional Information

9. Name : Dr. Khushro Shahookar 10. Title or Position Held : Chief Software Architect 11. Experience : (yrs) 20 12. Email Address : [email protected]

B. Research Papers in Relevant Area

[1] K. Shahookar, P. Mazumder “A Genetic Approach to Standard Cell Placement with Meta-Genetic Parameter Optimization,” IEEE Transactions on Computer-Aided Design, May 1990.

[2] K. Shahookar, P. Mazumder “Standard Cell Placement by the Genetic Algorithm,” Proc. European Design automation Conference, 1990.

[3] K. Shahookar, P. Mazumder “VLSI Cell Placement Techniques,” ACM Computing Surveys, Vol. 23, No. 2, pp. 143-220, June 1991.

[4] H. Chan, K. Shahookar, P. Mazumder “Macro-Cell Placement by Genetic Adaptive Search with Bitmap-Represented Chromosomes,” Integration, the VLSI Journal, pp. 49-77, Nov. 1991.

[5] K. Shahookar, P. Mazumder “Standard Cell Placement and the Genetic Algorithm,” Chapter 4 in Advances in Computer-Aided Engineering Design, Vol. 2, pp. 159-233, JAI Press, 1990.

[6] K. Shahookar, W. Khamisani, P. Mazumder, S.M. Reddy “Genetic Beam Search for Gate Matrix Layout,” Proc. Sixth International Conference on VLSI Design, pp. 208- 213, Jan. 1993.

[7] K. Shahookar, P. Mazumder “Genetic Multiway Partitioning,” Proc. Eighth International Conference on VLSI Design, Jan. 1995. C. Courses Taught in Relevant Area

Software Engineering Object Oriented Analysis and Design

Application for ICT-Related Development and Research Grant Page 65

D. Thesis / Projects Supervised in Relevant Area

E. Grants Received in Relevant Area

F. Industrial Work Done in Relevant Area

Feb. 1998 – Present: Chief Software Architect, Softech Systems Responsible for the technical foundation of the company. Responsibilities include • Define / validate the development processes, methodologies, and standards. • Design the overall system architecture. • Assist Project Managers in project sizing and estimation. • Assist Team Heads / Analysts in the object and data design • Business development and proposal writing • Validate the Requirement and Functional Specifications, design, and QA measures • Act as technical consultant for analysts / developers • Analyze development activities and implement measures for improvement Projects at Softech Systems -Complete Requirements Engineering, Specifications and Project Management of the entire project from conception to delivery for the following projects • AssetConnect Mutual Fund Management System • MarginConnect Stock Market Financing, Investment and Risk Management System for Banks • National Clearing and Settlement System of Pakistan - provides a common clearing/settlement and risk management platform for all three stock exchanges in Pakistan through a country-wide on-line system. • Several Capital markets-related projects developed by Softech Systems for EFA Software Services, Calgary, Canada, including • Java gateway framework, and several gateways for linking various stock exchanges, clearing, depository and settlement systems countrywide, • Complete Reporting module for the new Clearing depository and settlement system of EFA • Java-based GUI interface to the AS/400-based Clearing depository and settlement system of EFA • Commodity and Currency Trading Back Office System – for the international trading in spot, futures, and options. Includes comprehensive back office processes, including client order input, internal cross-matching, external trading, client portfolio, financial transactions, automatic bank transaction system, commissions, rebates, margins, real time market feed, IVR system for automated telephone transactions.

Application for ICT-Related Development and Research Grant Page 66

• Internet-based Day-trading application, which receives live stock feed and displays the ticker, level 2 screen, board view, portfolio, real-time moving graphs and charts, and allows order-entry. • Wireless Nursing Informatics Project - an integrated on-line patient care and treatment information system to provide timely and efficient nursing and medical services to patients at the hospital. • EdConnect – Distance Learning and On-Line Examinations System • Internet Billing and Customer Care System for Orbit BCS Internet Service Provider • Activity Monitor – Automated Time Sheet Software for Consultants • Countrywide Integrated Point of Sale and inventory System for Mobilink, Pakistan Consultancy Projects at Softech Systems • Consultancy for National Commodity Exchange – Making of Business process requirements for commodity trading and settlement system • Consultancy for National Commodity Exchange – IT Security Audit for Commodity Trading and Settlement Systems, and IT department of NCEL. • Consultancy for Karachi Stock Exchange – Network Access and Security Policy Aug. 1996 – Feb. 1998: Principal Software Engineer, CresSoft Headed the Tools and Technologies department, the main focus of which was to conduct R&D on the latest software tools and cutting-edge technology for enterprise software development. Among the tools and technologies explored were Object Databases (Versant), Web technologies (Java Beans, Servlets, Java development tools), and CLIPS expert system. Also worked on the Transport Layer project, for interfacing object-oriented software with relational databases. The main functions of the transport layer are flattening complex objects into simple data types for storage in an RDBMS, and reconstructing objects from the data returned by the RDBMS. It supports object relationships, such as inheritance, aggregation, association etc., and determines the mappings of the member variables to database tables and fields. The software was developed in ParcPlace Smalltalk / VisualWorks, with Sybase RDBMS. CresSoft is a software export company based in Pakistan, with clients in USA and Europe. It primarily develops enterprise business applications in an object-oriented environment, using Smalltalk or Visual C++. Jun. - Nov. 1994: Independent Consultant Worked on a project for Retail Target marketing systems, Wisconsin, USA, for the computation of direct marketing information from several gigabytes of transaction data. The software simultaneously reads from four 20 GB tape drives, and does merging, summarization, and statistical model computation. The output is stored in an inverted index database for high speed access. RTMS now uses the RS-6000 desktop workstation to do the multi-gigabyte processing that was so far being done on a supercomputer. In order to run in a reasonable time, the software required intense speed optimization, and an intimate knowledge of the hardware, behavior and I/O concurrency of peripherals, such as tape drives. The software is in C, on the RS-6000 under the AIX operating system.

Application for ICT-Related Development and Research Grant Page 67

May - Dec. 1992: Software Development Internship, General Motors Research Lab Worked as part of a team developing VLSI CAD software for analog channel routing. Developed a Dogleg channel Router algorithm incorporating analog constraints, such as critical net minimization, net crossing minimization, and via minimization. Developed the output subsystem for the router, including outputs in postscript, HPGL, and GDS-II formats, with zooming and multi-page color printing. Interfaced the channel router with commercial VLSI design software, (Chipgraph) through the GDS-II data format.

Please provide relevant information and also attach CVs of key research / development personnel (if available) and PI, CPI.

A. Professional Information

13. Name : Dr. Salman Iqbal 14. Title or Position Held : Chief Executive, Softech Systems (Pvt.) Ltd 15. Experience : (yrs) 20+ Years 16. Email Address : [email protected]

B. Research Papers in Relevant Area

1. Iqbal M. S., “An adaptive message passing system for transputer networks”, proceedings of the Australasian Workshop on Parallel and Real-Time Systems”, 7-8 July, 1994, Melbourne, Australia.

2. Khan G.N., Mahmud K., Iqbal M.S., and Rashid H., "RSM-A restricted shared memory architecture for high speed interprocessor communication", Microprocessors and Microsystems Journal (U.K.), Volume 18, No. 4, pp.193-203, May 1994.

3. Iqbal M. S., Khan G. N., Rashid H., and Mahmud M., "A Store-and-Forward Router for Transputer Networks", proceedings of Parallel Computing and Transputers (PCAT-93), 3-4 November, 1993, Brisbane, Australia., pp. 237-242. Eds. D.Arnold, R. Christie, J. May and P. Roe, Published by IOS Press Amsterdam.

4. Iqbal M. S., and Poon F.S.F., "A simplified and an efficient packet level internet access control scheme", proceedings of the ICCS/ISITA'92, 16-20 November 1992, pp. 963-967, The Westin Stamford and Westin Plaza Singapore. ISBN: 0-7803- 0803-4.

5. Iqbal M.S., and Poon F.S.F., "Packet level access control scheme for internetwork security", IEE Proceedings (U.K.) Part I, Vol. 139, No. 2, pp. 165-175, April

Application for ICT-Related Development and Research Grant Page 68

1992.

6. Poon F.S.F., and Iqbal M.S., "Design of a physical layer security mechanism for CSMA/CD networks", IEE Proceedings (U.K.) Part I, Vol. 139, No. 1, pp. 103-112, February 1992.

7. Iqbal M.S., Development of novel security mechanisms for ISO/OSI physical and network layers, Ph.D. Thesis, University of Sussex, September, 1990.

8. Poon F.S.F., and Iqbal M.S., "A physical layer security system for IEEE 802.3/(ISO 8802/3) protocol based local area networks", proceedings of the 1990 Bilkent international conference on new trends in communication, control, and signal processing, Bilkent University, Ankara, Turkey, pp. 148-154, July 2-5 1990. Elsevier Science Publishers, 1990. ISBN: 0-444-88762-8.

9. Poon F.S.F., and Iqbal M.S., "An access control mechanism for internetwork communications", Proceedings of the 1990 Bilkent international conference on new trends in communication, control, and signal processing, Bilkent University, Ankara, Turkey, pp. 162-168, July 2-5 1990. Elsevier Science Publishers, 1990. ISBN: 0-444- 88762-8.

10. Poon F.S.F., and Iqbal M.S., "Method to implement packet level access control in multinetworks", IEE Electronics Letters U.K., Vol. 25, No. 25, pp. 1742-1744, December 1989.

11. Iqbal M.S., and Khan G.N., "A message routing system for a transputer network", proceedings of national workshop on parallel computing: architectures and applications, department of electronics Quaid-i-Azam University, Islamabad, Pakistan, pp. 52-67, April 26-30 1992.

12. Khan G.N., and Iqbal M.S., "TDRP: transputer based dynamic reconfigurable parallel computer system", proceedings of national workshop on parallel computing: architectures and applications, department of electronics Quaid-i-Azam University, Islamabad, Pakistan, April 26-30 1992.

13. Rashid H., Mahmud K., Iqbal M. S., and Khan G. N., “Software tools for the KSM-1 Parallel Computer System”, Proc. International Workshop on Computer Vision and Parallel Processing, Quaid-i-Azam University, Islamabad, Pakistan, 2-5 January, 1995. C. Courses Taught in Relevant Area

2001-2002: Adjunct faculty member, LUMS (Lahore University of Management Sciences), Lahore, Pakistan. Taught distributed systems course to the MS Computer

Application for ICT-Related Development and Research Grant Page 69

Science class.

D. Thesis / Projects Supervised in Relevant Area

E. Grants Received in Relevant Area

F. Industrial Work Done in Relevant Area

March 98 – Present: Chief Executive - Softech Systems (Pvt.) Limited, Lahore, Pakistan Established and directed the start-up of Softech Systems, provided strong and visionary leadership to progressively make it one of the leading product and technology services company in the region. It is focused on design and development of niche products in the capital, financial, healthcare and telecom sectors. These products have captured large share in the capital and investment markets. Also carried out customized software development and consulting projects for large prestigious customer base in Canada, UK, Singapore, Ghana, UAE, Turkey and Pakistan. Some of these products/projects are mentioned below:

Capital Market Products and Projects: • Backconnect: A comprehensive online trading and back-office product for equities and commodity markets. Over 40 leading capital market organizations including JPMorgan, Crosby, KASB, JS, BMA, DataBank and Cal Bank (Ghana) uses this product. • Assetconnect: A versatile asset management product for mutual, provident and pension funds portfolio and unit management. Leading asset management companies, such as NIT (largest mutual fund organization in Pakistan), JS Investments, Atlas, HBL, Al-Habib, BMA, NAFA and Askari Investments are using this product. Stanbic one of the leading bank in Ghana also uses this product. The size of funds under management on this product exceeds US$ 2 billion. • Marginconnect: Investment banking and capital market financing and risk management system for commercial and investment banks to finance against shares, margin trading and outright portfolio investment. Six leading commercial banks including MCB (largest private sector bank in Pakistan) and Askari are live on this product. • Bespoke development for EFA Software Services, Calgary, Canada, leading capital market product development company which at one time had implementations in 30 stock exchanges world-wide. • Designed and developed national clearing system for consolidated clearing and settlement of all the stock exchanges in Pakistan which has been live since December 2001. • Also designed and developed the clearing, settlement and depository system for Ghana Stock Exchange, which is under live operations since October 2008.

Application for ICT-Related Development and Research Grant Page 70

• Providing consulting services to prestigious institutions such as Central Depository Company, Karachi Stock Exchange, National Commodity Exchange, Ghana Stock Exchange and State Bank of Pakistan.

Telecom/Mobile/Wireless Based Projects: • World-wide partnership with Conceptwave (now Ercisson), Canada, that has designed world’s leading high performance customer order and product catalog management solution. In partnership carried out projects for BMW, Thai Telecom, Turk Telecom, Hutchison, Hong Kong and Etisalat, UAE. • In partnership with Mi-Pay, UK, Softech has developed a number of mobile based solutions for pre-paid payments and Topups for telecom operators, mobile banking application for MiSys banking product, and UBC Music purchase and download system for the entertainment industry. • Developed mobile based multi-exchange stock market dissemination and online trading product. It is in live use by several brokers. • Designed wireless/mobile based bespoke applications for Singapore Client, including process improvement for nursing-care at Tan Tock Seng Hospital, Singapore and Digital Audio Broadcast based mobile applications for education based field assignment system. • Provided a number of customized solutions for telecommunication sector, such as, mobile content management (IVR based song dedication and ring-tone system), ISP billing and customer care and point of sale/inventory management for mobile operators. Healthcare Product: • Softech has designed healthcare product for managing complete lifecycle of modern hospitals from patient registration, out and in patient management, lab and radiology automation and comprehensive procurement, inventory and financial management systems. Structured as a scalable product that can be deployed as a service in a cloud computing environment.

May 94 – Feb. 98: Manager Information Systems - Lahore Stock Exchange, Lahore, Pakistan A senior technical management position reporting directly to the President of Lahore Stock Exchange (LSE). Conceived a long-term LSE software services and infrastructure automation plan, consequently, a number of software development projects were designed, developed and commissioned in the areas of computerised equities trading, clearing/settlement, exposure management and real-time information dissemination. These projects had budgetary lay-out in access of US$ 3 million.

Oct. 1995 – Feb. 1998, Project Manager for the implementation of Computerised Trading System at Lahore Stock Exchange, Lahore, Pakistan Pioneering work in successful implementation of the first fully screen-based computerised trading system in Pakistan, which resulted in increased daily volumes and market depth at LSE, made it the most transparent exchange of Pakistan. This

Application for ICT-Related Development and Research Grant Page 71

project changed the very landscape of capital markets of Pakistan, resulting in further automation initiatives by other capital market organizations which tremendously helped to improve market efficiency and transparency.

The trading project had a capital layout of over 2 million US dollars. Reviewed and defined, in conjunction with EFA (a Canadian based software company), the design review document, the rules document, the change and project management plan, and documents for test, conformance and acceptance plan of the trading system. The operations environment included four AIX based RISC/6000 and over 400 trader workstations.

Feb 1991 to May 94: Senior Engineer at Engineering Research Labs, Rawalpindi, Pakistan A member of the group involved in various aspects relating to software and hardware design of distributed as well as shared memory parallel systems based on Transputers. Responsible for designing and developing communication support software for the parallel systems. Additionally, involved with the group in addressing the design level issues for system level software aids, such as, network configuration and exploration software. Software development was done under ANSI Parallel C for Transputers. A number of international papers were published as a result of this research and implementation.

October 1987 - October 1990: Ph.D. studies at University of Sussex, UK Designed and developed several novel computer network security protocols for computer data networks. Based on the designed techniques a software access control system was developed on SUN/UNIX, and a physical layer security system was implemented for CSMA/CD (Ethernet) class of networks. These techniques can be used in developing secure network protocols. The novelty of the developed schemes can be established by the fact that six international papers describing the schemes were published in reputed journals (two in IEE proceedings UK) and conferences.

Consultancy • 1995 - present: Consultant to the Central Depository Company of Pakistan. Also Chairman of their Information Technology Steering Committee. • 2004 onwards: Member IT Steering Committee of National Commodity Exchange of Pakistan • 2003: Consultancy services in the area of IT security to State Bank of Pakistan • 1998 onwards: Consultancy services/member of IT Steering Committee, Karachi Stock Exchange, Pakistan • Represented Lahore Stock Exchange (1995-1998) as a member of the Technology Committee at the Federation of Euro-Asian Stock Exchanges (a federation of over 25 emerging stock exchanges of the world), Istanbul, Turkey. Also chaired the technology

Application for ICT-Related Development and Research Grant Page 72

committee meetings in 1996.

Bibliography

[AKO04] Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E., “Replacing suffix trees with suffix arrays”, Journal of Discrete Algorithms, vol. 2(1), 2004, pp. 53–86.

[Bah10] M. Bahtiyar. JClone : Syntax tree based clone detection for java. Master’s thesis, Linnaeus University, 2010.

[BAJ11] Basit, H., Ali, U. and Jarzabek, S. “Viewing Simple Clones from a Structural Clones’ Perspective,” Int. Workshop on Software Clones, IWSC’2011, ICSE Workshop, Honolulu, USA, May 2011

[Bak96] B. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28 – 42, 1996.

[Bak95] B. Baker. On finding duplication and near-duplication in large software systems. In WCRE, pages 86 –95, 1995.

[Bak92] B. Baker. A program for identifying duplicated code. Computing Science and Statistics, 24:49–57, 1992.

[BANM06] S. Bouktif, G. Antoniol, M. Neteler, and E. Merlo. A novel approach to optimize clone refactoring activity. In GECCO, pages 1885–1892, 2006.

[Bas97] Bassett, P., Framing software reuse - lessons from real world, Yourdon Press, Prentice Hall, 1997.

[BAHJ12] Basit, H. A., Ali, U., Haque, S., and Jarzabek, S. “Things Structural Clones Tell that Simple Clones Don’t,” Int. Conference on Software Maintenance, ICSM’2012, Trento, Italy, September 2012, pp. 275-284

[BB02] J. Bailey and E. Burd. Evaluating clone detection tools for use during preventative maintenance. In SCAM, pages 36 – 43, 2002.

[BDET05] M. Bruntink, A. van Deursen, R. van Engelen, and T. Tourwe. On the use of clone detection for identifying crosscutting concern code. IEEE Trans. Softw. Eng., 31:804–818, 2005.

[Bea11] M. Beard. Extending bug localization using information retrieval and code clone location techniques. In WCRE, pages 425 –428, 2011.

Application for ICT-Related Development and Research Grant Page 73

[Bel02] S. Bellon. Vergleich von techniken zur erkennung duplizierten quellcodes. Diploma thesis, Universitat Stuttgart, 2002.

[BFL10] R. Brixtel, M. Fontaine, B. Lesner, C. Bazin, and R. Robbes. Language- independent clone detection applied to plagiarism detection. In SCAM, pages 77 –86, 2010.

[BHH+12] Bauer, V., Heinemann, L., Hummel, B., Juergens, E., & Conradt, M. (2012, September). A framework for incremental quality analysis of large software systems. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on (pp. 537- 546). IEEE.

[BJ10] Basit, H. and Jarzabek, S. “Towards Structural Clones: Analysis and semi- automated detection of design-level similarities in software,” VDM Verlag, 2010 (172 pages)

[BJ09] Basit, H.A., and Jarzabek, S. “A Data Mining Approach for Detecting Higher-level Clones in Software”, IEEE Transactions on Software Engineering 35(4):497–514, July– August 2009. ISSN 0098-5589. DOI: 10.1109/TSE.2009.16. IEEE Computer Society.

[BJ05] Basit, H.A. and Jarzabek, S. “Detecting Higher-level Similarity Patterns in Programs”, In proceedings 10th European Software Engineering Conference and 13th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, ACM Press, September 2005, Lisbon, Portugal, pp. 156-165

[BKA+07] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. IEEE Trans. on Softw. Engg., 33(9):577–591, 2007.

[BM09] P. Bulychev and M. Minea. An evaluation of duplicate code detection using anti- unification. In IWSC, 2009.

[BM98] B. Baker and U. Manber. Deducing similarities in java sources from bytecodes. In USENIX ATEC, pages 15–15, Berkeley, CA, USA, 1998. USENIX Association.

[BMDK00] M. Balazinska, E. Merlo, M. Dagenais, and K. Kontogiannis. Advanced clone-analysis to support object-oriented system refactoring. In WCRE, pages 98–107. IEEE Computer Society Press, 2000.

[BMD+99a] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, and K. Kontogiannis. Measuring clone based reengineering opportunities. In METRICS, pages 292 –303, 1999.

[BMD+99b] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, and K. Kontogiannis. Partial redesign of java software systems based on clone analysis. In WCRE, pages 326–336. IEEE Computer Society, 1999.

Application for ICT-Related Development and Research Grant Page 74

[BMM07] D. Bruschi, L. Martignoni, and M. Monga. Code normalization for self-mutating malware. IEEE Security and Privacy, 5:46–54, 2007.

[BRJ05a] Basit, H.A., Rajapakse, D.C, and Jarzabek, S. “An Empirical Study on Limits of Clone Unification Using Generics”, In proceedings 17th Int. Conference on Software Engineering and Knowledge Engineering, SEKE'05, July 2005, Taipei, Taiwan, pp. 109- 114

[BRJ05b] Basit, H.A., Rajapakse, D.C., and Jarzabek, S. “Beyond Templates: a Study of Clones in the STL and Some General Implications”, In proceedings 27th Int. Conf. on Software Engineering, ICSE’05, May 2005, St. Louis, USA, pp. 451-459

[Bru04] M. Bruntink. Aspect mining using clone class metrics. In WARE, 2004.

[BSP+07] Basit, H. A., Smyth, W. F., Puglisi, S. J., Turpin, A., and Jarzabek, S., “Efficient Token Based Clone Detection with Flexible Tokenization”. Short paper in proceedings 11th European Software Engineering Conference and 15th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, ACM Press, September 2007, Dubrovnik, Croatia, pp. 513-516

[BSV12] Bhadane, C., Shah, K., & Vispute, P. (2012). An Efficient Parallel Approach for Frequent Itemset Mining of Incremental Data. International Journal of Scientific & Engineering Research, 3(2).

[BYM+98] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In ICSM, page 368, 1998.

[BYZ10] L. Barbour, H. Yuan and Y. Zou, “A Technique for Just-In-Time Clone Detection in Large Scale Systems”. In IEEE 18th International Conference on Program Comprehension (ICPC), 2010, pp. 76-79

[CCM+11] D. Chatterji, J. Carver, B. Massengil, J. Oslin, and N. Kraft. Measuring the efficacy of code clone information in a bug localization task: An empirical study. In ESEM, pages 20–29. IEEE Computer Society, 2011.

[CH07] A. Chiu and D. Hirtle. Beyond clone detection. CS846 Course Project Report, University of Waterloo, 2007.

[Cor03] J. Cordy. Comprehending reality: Practical barriers to industrial adoption of software maintenance automation. In IWPC, pages 196–206. IEEE Computer Society, 2003.

[CYI+11] E. Choi, N. Yoshida, T. Ishio, K. Inoue, and T. Sano. Extracting code clones for refactoring using combinations of clone metrics. In IWSC, pages 7–13. ACM, 2011.

Application for ICT-Related Development and Research Grant Page 75

[DBF+95] N. Davey, P. Barson, S. Field, R. Frank, and D. Tansley. The development of a software clone detector. International Journal of Applied Software Technology, 1(3/4):219 – 236, 1995.

[DDF02] G. Di Lucca, M. Di Penta, and A. Fasolino. An approach to identify duplicated web pages. In COMPSAC, pages 481 – 486, 2002.

[DG10a] I. Davis and M. Godfrey. Clone detection by exploiting assembler. In IWSC, pages 77–78. ACM, 2010.

[DG10b] I. Davis and M. Godfrey. From whence it came: Detecting source code clones by analyzing assembler. In WCRE, pages 242 –246, 2010.

[DR10] E. Duala-Ekoko and M. Robillard. Clone region descriptors: Representing and tracking duplication in source code. ACM Trans. Softw. Eng. Methodol., 20:3:1–3:31, 2010.

[DR08] E. Duala-Ekoko and M. Robillard. CloneTracker: tool support for code clone management. In ICSE, pages 843–846, 2008.

[DR07] E. Duala-Ekoko and M. Robillard. Tracking code clones in evolving software. In ICSE, pages 158–167, 2007.

[DRD99] S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. In ICSM, pages 109 –118, 1999.

[DRGB99] S. Ducasse, M. Rieger, G. Golomingi, and B. Bym. Tool support for refactoring duplicated OO code. In ECOOP Workshop Reader, number 1743 in LNCS, pages 2–6. Springer-Verlag, 1999.

[EFM07] W. Evans, C. Fraser, and F. Ma. Clone detection via structural abstraction. In WCRE, pages 150–159. IEEE Computer Society, 2007.

[FBB+99] M. Fowler, K. Beck, J.Brant, W. Opdyke, and D. Roberts. Refactoring: Improving the Design of Existing Code. Addison Wesley Professional, 1999.

[FBCG10] M. Funaro, D. Braga, A. Campi, and C. Ghezzi. Combining syntactic and textual approach in clone detection. In IWSC, 2010.

[FFK08] R. Falke, P. Frenzel, and R. Koschke. Empirical evaluation of clone detection using syntax suffix trees. Empirical Softw. Engg., 13:601–643, 2008.

[FR99] R. Fanta and V. Rajlich. Removing clones from the code. Journal of Software Maintenance: Research and Practice, 11(4):223–243, 1999.

Application for ICT-Related Development and Research Grant Page 76

[GGIW12] Gleirscher, M., Golubitskiy, D., Irlbeck, M., & Wagner, S. (2012). On the benefit of automated static analysis for small and medium-sized software enterprises. In Software Quality. Process Automation In Software Development (pp. 14-38). Springer Berlin Heidelberg.

[Gie07] Simon Giesecke, “Generic modeling of code clones”, In DRSS, pages 1–23, 2007.

[GJS08] M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In ICSE, pages 321–33

[GK09] N. Gode and R. Koschke. Incremental clone detection. In CSMR, pages 219– 228, 2009.

[HBJ14] Hammad, M., Basit, H.A., and Jarzabek, S. “A Survey of Software Clone Visualization Techniques”. Submitted for publication. 2014.

[HG11] J. Harder and N. Gode. Efficiently handling clone data: Rcf and cyclone. In IWSC, pages 81–82. ACM, 2011.

[HG10] J. Harder and N. Gode. Quo vadis, clone management? In IWSC, pages 85–86. ACM, 2010.

[HJHC10] B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based code clone detection:incremental, distributed, scalable. In ICSM, pages 1 –9, 2010.

[HJJ09a] D. Hou, P. Jablonski, and F. Jacob. CnP: Towards an environment for the proactive management of copy-and-paste programming. In ICPC, pages 238–242, 2009.

[HJJ09b] D. Hou, F. Jacob, and P. Jablonski. Exploring the design space of proactive tool support for copy-and-paste programming. In CASCON, pages 188–202. ACM, 2009.

[HK09] Y. Higo and S. Kusumoto. Enhancing quality of code clone detection with program dependency graph. In WCRE, pages 315 –316, 2009.

[HKKI07] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Method and implementation for investigating code clones in a software system. Inf. Softw. Technol., 49:985–998, 2007.

[HKKI04a] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Aries: Refactoring support environment based on code clone analysis. In IASTED-SEA, pages 222–229. ACTA Press, 2004.

[HKKI04b] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Refactoring support based on code clone analysis. PROFES, (LNCS 3009):220–233, 2004.

Application for ICT-Related Development and Research Grant Page 77

[HKVD11] A. Hemel, K. Kalleberg, R. Vermaas, and E. Dolstra. Finding software license violations through binary code clone detection. In MSR, pages 63–72. ACM, 2011.

[HUK+02] Y. Higo, Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue. On software maintenance process improvement based on code clone analysis. In PROFES, pages 185–197. Springer-Verlag, 2002.

[HUKI07] Y. Higo, Y. Ueda, S. Kusumoto, and K. Inoue. Simultaneous modification support based on code clone analysis. In APSEC, pages 262–269. IEEE Computer Society, 2007.

[HYNK11] Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto. Incremental code clone detection: A PDG-based approach. In WCRE, pages 3 –12, 2011.

[IHH+12] Ishihara, T., Hotta, K., Higo, Y., Igaki, H., & Kusumoto, S. (2012, October). Inter-project functional clone detection toward building libraries-An empirical study on 13,000 projects. In Reverse Engineering (WCRE), 2012 19th Working Conference on (pp. 387-391). IEEE.

[IHY+12] Katsuro Inoue, Yoshiki Higo, Norihiro Yoshida, Eunjong Choi, Shinji Kusumoto, Kyonghwan Kim, Wonjin Park, and Eunha Lee, “Experience of Finding Inconsistently- Changed Bugs in Code Clones of Mobile Software”, International Workshop on Software Clones (IWSC), 2012.

[JB10] K. Jalbert and J. Bradbury. Using clone detection to identify bugs in concurrent software. In ICSM, pages 1 –5, 2010.

[JDH09] E. Juergens, F. Deissenboeck, and B. Hummel. CloneDetective - a workbench for clone detection research. In ICSE, pages 603–606, 2009.

[JH10] P. Jablonski and D. Hou. Renaming parts of identifiers consistently within code clones. In ICPC, pages 38–39. IEEE Computer Society, 2010.

[JHJ10] F. Jacob, D. Hou, and P. Jablonski. Actively comparing clones inside the code editor. In IWSC, pages 9–16. ACM, 2010.

[JH07] P. Jablonski and D. Hou. CReN: a tool for tracking copy-and-paste code clones and renaming identifiers consistently in the IDE. In ETX, pages 16–20, 2007.

[JH06] N. Juillerat and B. Hirsbrunner. An algorithm for detecting and removing clones in java code. In SeTra, pages 63–74, 2006.

[JHDF08] E. Juergens, B. Hummel, F. Deissenboeck, and M. Feilkas. Static bug detection through analysis of inconsistent clones. In TESO, pages 443–446, 2008.

Application for ICT-Related Development and Research Grant Page 78

[Jia06] Jiang, Z.M. 2006. Visualizing and Understanding Code Duplication in Large Software Systems, M.S. thesis, Department of The David R. Cheriton School of Computer Science, University of Waterloo, Ontario

[JL06] S. Jarzabek and S. Li. Unifying clones with a generative programming technique: a case study. Journal of Software Maintenance and Evolution: Research and Practice, 18(4):267–292, 2006.

[JMSG07] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: Scalable and accurate tree-based detection of code clones. In ICSE, pages 96–105, 2007.

[Joh94] J. Johnson. Substring matching for clone detection and change tracking. In ICSM, pages 120–126, 1994.

[Joh93] J. Johnson. Identifying redundancy in source code using fingerprints. In CASCON, pages 171–183. IBM Press, 1993.

[JSC07] L. Jiang, Z. Su, and E. Chiu. Context-based detection of clone-related bugs. In ESEC-FSE, pages 55–64. ACM, 2007.

[KDB+95] K. Kontogiannis, R. DeMori, M. Bernstein, M. Galler, and E. Merlo. Pattern matching for design concept localization. In WCRE, pages 96 –103, 1995.

[Ker04] J. Kerievsky. Refactoring to Patterns. Addison Wesley, 2004.

[KFF06] R. Koschke, R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees. In WCRE, pages 253–262, 2006.

[KG08] C. Kapser and M. Godfrey. Cloning considered harmful” considered harmful: patterns of cloning in software. Empirical Software Engineering, 13:645–692, 2008.

[KG06] C. Kapser and M. Godfrey. “Cloning considered harmful” considered harmful. In WCRE, pages 19–28, 2006.

[KG04] C. Kapser and M. Godfrey. Aiding comprehension of cloning through categorization. In IWPSE, pages 85–94. IEEE Computer Society, 2004.

[KH01] R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In SAS, pages 40–56. Springer-Verlag, 2001.

[KJKY11] H. Kim, Y. Jung, S. Kim, and K. Yi. Mecc: memory comparison-based clone detector. In ICSE, pages 301–310. ACM, 2011.

[KKI02] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654–670, 2002.

Application for ICT-Related Development and Research Grant Page 79

[KLM+97] G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect oriented programming. In ECOOP, volume 1241, pages 220–242. Springer-Verlag, 1997.

[KMM+96] K. Kontogiannis, R. Mori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Autom. Softw. Eng., 3(1/2):77–108, 1996.

[KMT07] A. Kellens, K. Mens, and P. Tonella. In A. Rashid and M. Aksit, editors, Transactions on aspect-oriented software development IV, chapter A survey of automated code-level aspect mining techniques, pages 143–162. Springer-Verlag, 2007.

[Kon01] G. Koni-N’Sapu. A scenario based approach for refactoring duplicated code in object oriented systems. Diploma thesis, University of Bern, 2001.

[Kon97] K. Kontogiannis. Evaluation experiments on the detection of programming patterns using software metrics. In WCRE, pages 44 –54, 1997.

[Kos06] R. Koschke. Survey of research on software clones. In DRSS, pages 1–24, 2006.

[Kos08] R. Koschke, “Frontiers of Software Clone Management”, 2008.

[KRC11] Keivanloo, I., Rilling, J., & Charland, P. (2011, October). Internet-scale real- time code clone search via multi-level indexing. In Reverse Engineering (WCRE), 2011 18th Working Conference on (pp. 23-27). IEEE.

[Kri01] J. Krinke. Identifying similar code with program dependence graphs. In WCRE, pages 301–309, 2001.

[KRT97] Karhinen, A., Ran, A. and Tallgren, T. “Configuring designs for reuse,” Proc. International Conference on Software Engineering, ICSE’97, Boston, MA., 1997, pp. 701-710.

[KT14] G. P., Krishnan, and N., Tsantalis. Unification and refactoring of clones. CSMR- 18/WCRE-21 Software Evolution Week, 2014.

[KVB+10] E. Kodhai, V. Vijayakumar, G. Balabaskaran, T. Stalin, and B. Kanagaraj. Method level detection and removal of code clones in C and Java programs using refactoring. In IJJCET, pages 93–95. Gopalax Publications & TCET, 2010.

[KYU+09] S. Kawaguchi, T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, and H. Iida. SHINOBI: A tool for automatic code clone detection in the IDE. In WCRE, pages 313–314, 2009.

Application for ICT-Related Development and Research Grant Page 80

[LBC+10] S. Lee, G. Bae, H. Chae, D. Bae, and Y. Kwon. Automated scheduling for clone-based refactoring using a competent ga. Softw. Pract. Exper., 41(5):521–550, 2010.

[Lei04] A. Leitao. Detection of redundant code using r2d2. Software Quality Control, 12:361–382, 2004.

[LMS12] Léonard, M., Mouchard, L., & Salson, M. (2012). On the number of elements to reorder when updating a suffix array. Journal of Discrete Algorithms, 11, 87-99.

[LGS13] Liu, H., Guo, X., & Shao, W. (2013). Monitor-Based Instant Software Refactoring. IEEE Transactions on Software Engineering, 39(8).

[LJ05] S. Lee and I. Jeong. SDD: high performance code clone detection system for large scale source code. In OOPSLA, pages 140–141, 2005.

[LLMS08] H. Liu, G. Li, Z. Ma, , and W. Shao. Conflict-aware schedule of software refactorings. IET Software, 2(5):446–460, 2008.

[LLMZ06] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: finding copy-paste and related bugs in large-scale software code. Software Engineering, IEEE Transactions on, 32(3):176 – 192, 2006.

[LM03] F. Lanubile and T. Mallardo. Finding function clones in web applications. In CSMR, pages 379–286. IEEE Computer Society, 2003.

[LPM+97] B. Lague, D. Proulx, J. Mayrand, E. Merlo, and J. Hudepohl. Assessing the benefits of incorporating function clone detection in a development process. In ICSM, pages 314–321. IEEE Computer Society, 1997.

[LRHK10] M. Lee, J. Roh, S. Hwang, and S. Kim. Instant code clone search. In FSE, pages 167–176, 2010.

[LT11] H. Li and S. Thompson. Incremental clone detection and elimination for erlang programs. In FASE/ETAPS, pages 356–370. Springer-Verlag, 2011.

[LT10] H. Li and S. Thompson. Similar code detection and elimination for Erlang programs. Practical Aspects of Declarative Languages, 5937:104–118, 2010.

[LXZ+13] Li, Y., Xu, J., Zhang, X., Li, C., & Zhang, Y. (2013, November). An Incremental Closed Frequent Itemsets Mining Algorithm Based on Shadow Prefix Tree. InWeb Information System and Application Conference (WISA), 2013 10th (pp. 440-445). IEEE.

[MLH96] J. Mayrand, B. Lague, and J. Hudepohl. Evaluating the benefits of clone detection in the software maintenance activities in large scale systems. In WESS, 1996.

Application for ICT-Related Development and Research Grant Page 81

[MLM96] J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the automatic detection of function clones in a software system using metrics. In ICSM, pages 244 –253, 1996.

[MM01] A. Marcus and J. Maletic. Identification of high-level concept clones in source code. In ASE, pages 107 – 114, 2001.

[MQB05] E. Murphy-Hill, P. Quitslund, and A. Black. Removing duplication from java.io: a case study using traits. In OOPSLA, pages 282–291. ACM, 2005

[NNA+09] T. Nguyen, H. Nguyen, J. Al-Kofahi, N. Pham, and T. Nguyen. Scalable and incremental clone detection for evolving software. In ICSM, pages 491 –494, 2009.

[NNP+11] H. Nguyen, T. Nguyen, N. Pham, J. Al-Kofahi, and T. Nguyen. Clone management for evolving software. IEEE Trans. on Softw. Engg., 1(1):1–19, 2011.

[NSG07] S. Nasehi, G. Sotudeh, and M. Gomrokchi. Source code enhancement using reduction of duplicated code. In IASTED, pages 192–197. ACTA Press, 2007.

[OPW+04] Otey, M. E., Parthasarathy, S., Wang, C., Veloso, A., & Meira, W. (2004). Parallel and distributed methods for incremental frequent itemset mining.Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,34(6), 2439-2450.

[PLR+13] J. Park, M. Lee, J. Roh, S. Hwang and S. Kim. Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst. 2013

[RC09] C. Roy and J. Cordy. A mutation/injection-based automatic framework for evaluating code clone detection tools. In ICSTW, pages 157–166, 2009.

[RC08a] C.Roy and J. Cordy. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In ICPC, pages 172–181, 2008.

[RC08b] C. Roy and J. Cordy. Scenario-based comparison of clone detection techniques. In ICPC, pages 153–162, 2008.

[RC07] C.Roy and J. Cordy. A survey on software clone detection research. Tech Report TR2007-541, School of Computing, Queens University, Canada, 2007.

[RCK09] C. Roy, J. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74:470– 495, 2009.

[RDL04] M. Rieger, S. Ducasse, and M. Lanza. Insights into system-wide code duplication. In WCRE, pages 100–109, 2004.

[Rie05] M. Rieger. Effective Clone Detection Without Language Barriers. Phd thesis, Institut fur Informatik und angewandte Mathematik, Germany, 2005.

Application for ICT-Related Development and Research Grant Page 82

[San05] A. Santone. Clone detection through process algebras and java bytecode. In IWSC, pages 73–74. ACM, 2011.

[SDNB03] N. Scharli, S. Ducasse, O. Nierstrasz, and A. Black. Traits: Composable units of behaviour. In ECOOP, volume 2743 of LNCS, pages 327–339. Springer Berlin / Heidelberg, 2003

[SK09] S. Schulze and M. Kuhlemann. Advanced analysis for code clone removal. In WSR, pages 1–2, 2009.

[SKR08] S. Schulze, M. Kuhlemann, and M. Rosenmuller. Towards a refactoring guideline using code clone classification. In WRT, pages 6:1–6:4. ACM, 2008.

[SLLM10] Salson, M., Lecroq, T., Léonard, M., & Mouchard, L. (2010). Dynamic extended suffix arrays. Journal of Discrete Algorithms, 8(2), 241-257.

[SWP+09] A. Sæbjørnsen, J. Willcock, T. Panas, D. Quinlan, and Z. Su. Detecting code clones in binary executables. In ISSTA, pages 117–128. ACM, 2009.

[Tai10] Tairas, R. 2010. Representation, Analysis, And Refactoring Techniques To Support Code Clone Maintenance, Ph.D. dissertation, Department of Computer and Information Sciences, University of Alabama, Birmingham

[Tar13] Tariq, S. 2013. Clone Management in XVCL, MS Thesis, Department of Computer Science, Lahore University of Management Sciences, Lahore

[TBG04] M. Toomim, A. Begel, and S. Graham. Managing duplicated code with linked editing. In VLHCC, pages 173–180. IEEE Computer Society, 2004.

[TG12] Robert Tairas, Jeff Gray, “Increasing clone maintenance support by unifying clone detection and refactoring activities”, Information and Software Technology 54 (2012) 1297–1307.

[TG10] R. Tairas and J. Gray. Sub-clone refactoring in open source software artifacts. In SAC, pages 2373–2374. ACM, 2010.

[TG06] R. Tairas and J. Gray. Phoenix-based clone detection using suffix trees. In ACM-SE, pages 679–684, 2006.

[TGB07] Tairas, R., Gray, J., And Baxter, I. 2007. Visualization of Clone Detection Results, In the Proceedings of the 22nd IEEE/ACM international conference on Automated software engineering, ASE '07, November 2007, Atlanta, Georgia, USA, 549-550

Application for ICT-Related Development and Research Grant Page 83

[Tor04] R. Torres. Source code mining for code duplication refactorings with formal concept analysis. M.Sc. thesis, Vrije Universiteit Brussel, Belgium, 2004.

[TYM+11] Masayuki Tokunaga, Norihiro Yoshida, Kazuki Yoshioka, Makoto Matsushita, Katsuro Inoue, “Towards Collection of Refactoring Patterns Based on Code Clone Classification”, Asian Conference on Pattern Languages of Programs (AsianPLoP), 2011

[UHK+02] Y. Ueda, Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Gemini: Code clone analysis tool. In ISESE, volume 2, pages 31–32, 2002.

[UKKI02a] Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue. On detection of gapped code clones using gap locations. In APSEC, pages 327 – 336, 2002.

[UKKI02b] Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue. Gemini: Maintenance support environment based on code clone analysis. In METRICS, pages 67–76. IEEE Computer Society Press, 2002.

[URSH11] M. Uddin, C. Roy, K. Schneider, and A. Hindle. On the eectiveness of simhash for detecting near-miss clones in large scale software systems. In WCRE, pages 13 –22, 2011.

[Vol12] Nic Volanschi, “Safe Clone-Based Refactoring through Stereotype Identification and Iso-Generation” International Workshop on Software Clones (IWSC), 2012.

[VSR12] Radhika Venkatasubramanyam, Himanshu Singh, K Ravikanth, “A Method for Proactive Moderation of Code Clones in IDEs”, International Workshop on Software Clones (IWSC), 2012

[Wec08] V. Weckerle. CPC: an eclipse framework for automated clone life cycle tracking and update anomaly detection. Master’s thesis, Freie Universit ̈at Berlin, Germany, 2008.

[Wit08] M. de Wit. Managing Clones Using Dynamic Change Tracking and Resolution. M.Sc. thesis, Delft University of Technology, 2008.

[WL06] A. Walenstein and A. Lakhotia. The software similarity problem in malware analysis. In DRSS, pages 1–10, 2006.

[WSWF04] V. Wahler, D. Seipel, J. Wolff, and G. Fischer. Clone detection in source code by frequent itemset techniques. In SCAM, pages 128 –135, 2004.

[Yan91] W. Yang. Identifying syntactic dierences between two programs. Softw. Pract. Exper., 21:739–755, 1991.

Application for ICT-Related Development and Research Grant Page 84

[YCY+12] Yuki Yamanaka, Eunjong Choi, Norihiro Yoshida, Katsuro Inoue, “Industrial Application of Clone Change Management System”, International Workshop on Software Clones (IWSC), 2012

[YCY+13] Yamanaka, Y., Choi, E., Yoshida, N., Inoue, K., & Sano, T. (2013, May). Applying clone change notification system into an industrial development process. In Program Comprehension (ICPC), 2013 IEEE 21st International Conference on (pp. 199- 206). IEEE.

[YHK+05] N. Yoshida, Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. On refactoring support based on code clone dependency relation. In METRICS, pages 16–25. IEEE Computer Society, 2005.

[ZBJ+08] Zhang, Y., Basit, H. A., Jarzabek, S., Anh, D., and Low, M., “Identifying Useful Design-Level Similarity Patterns based on Clone Detection Output”. In proceedings 24th IEEE International Conference on Software Maintenance, September 28 - October 4, 2008, Beijing, China, pp. 376-385

[ZR12a] Minhaz F. Zibran and Chanchal K. Roy, “The Road to Software Clone Management: A Survey”, Technical Report, University of Saskatchewan, Canada, 2012.

[ZR12b] M. Zibran and C. Roy. IDE-based real-time focused search for near-miss clones. In ACM-SAC (SE Track), pages 1–8, 2012

[ZR11a] M. Zibran and C. Roy. Conflict-aware optimal scheduling of code clone refactoring: A constraint programming approach. In ICPC, pages 266 – 269, 2011.

[ZR11b] M. Zibran and C. Roy. A constraint programming approach to conflict-aware optimal scheduling of prioritized code clone refactoring. In SCAM, pages 105–114, 2011.

[ZR11c] M. Zibran and C. Roy. Towards flexible code clone detection, management, and refactoring in IDE. In IWSC, pages 75–76, 2011.

[ZSAR11] M. Zibran, R. Saha, M. Asaduzzaman, and C. Roy. Analyzing and forecasting near-miss clones in evolving software: An empirical study. In ICECCS, pages 295–304, 2011.

Application for ICT-Related Development and Research Grant Page 85