Proceedings of 2018 7th International Conference on and Information Engineering ICSIE 2018

The British University in Egypt, Egypt May 2-4, 2018

ISBN: 978-1-4503-6469-0

The Association for Computing Machinery 2 Penn Plaza, Suite 701 New York New York 10121-0701

ACM COPYRIGHT NOTICE. Copyright © 2018 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM, Inc., fax +1 (212) 869-0481, or [email protected].

For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-8400, +1-978-750-4470 (fax).

ACM ISBN: 978-1-4503-6469-0 Table of Contents

......

Preface ...... v ConferenceSession Committees 1- Software Engineering and Information Security vi ......

DynamicAndrew Code Sadek, Loading Mohamed to a Bare-metal Elmahdy and Embedded Tarek Eldeeb Target 1 ......

An EmpiricalJalal Shah, Study Nazri with Kama Function and Saiful Point Adli Analysis Ismail for Software Development Phase Method 7 ......

AnalysingMarlina Log Abdul Files Latib, For Web Saiful Intrusion Adli Ismail, Investigation Othman Mohd Using Yusop, Hadoop Pritheega Magalingam and Azri Azmi 12 ...... 3 Privacy-PreservingSaeed Samet, Mohd Personal Tazim Health Ishraque Record and (PAnupamHR): A Sharma Secure Android Application 22 ......

The RoleAli Meligy, of Ethnography Walid Dabour in Agile and AlaaRequirements Farhat Analysis 27 ......

PredictingNadine the Farag Survivors and Ghada of the Hassan Titanic - Kaggle, Machine Learning From Disaster - 32 ......

UsingNayeth Fuzzy I. Logic Solorzano in QCA Alcivar, for the Luke Selection Houghton of Relevant and Louis IS Adoption Sanzogni Drivers in Emerging Economies 38 ......

UsingSara SMOTE Adel andEl-Shorbagy, Heterogeneous Wael Mohamed Stacking El-Gammal in Ensemble and learning Walid. for M. AbdelmoezSoftware Defect Prediction 44 Session 2- Computer Vision and Image Processing

......

ExtractionAnn Nosseir of Egyptian and Ramy License Roshdy Plate Numbers and Characters Using SURF and Cross Correlation 48 ......

AutomaticAnn Nosseir Extraction and Omar of Arabic Adel Number from Egyptian ID Cards 56 ......

AutomaticAnn Nosseir Identification and Seif Eldinand Classifications Ashraf Ahmed for Fruits Using k-NN 62 ......

ComputerReham Aided Rabie, Diagnosis Mohamed System Meselhy for Eltoukhy, Liver Cirrhosis Mohammad Based al-Shatouri, on Ultrasound Essam Images A. Rashed 68 ......

ImageHaneen Denoising A. Elyamani, Technique Samir forA. CT El-Seoud Dose Modulation and Essam A. Rashed 72 ......

An InteractiveSamir A. El-Seoud, Mixed Reality Amr S. Imaging Mady and System Essam for A. Minimally Rashed Invasive Surgeries 76 ......

A Computer-AidedHanan M. Amer, Early Fatma Detection E.Z. Abou-Chadi, System of Sherif Pulmonary S. Kishk Nodules and Marwa in CT I. ScanObayya Images 81

iii Session 3- Computer Science and Applications

......

Directer:Xiaobin A Parallel Song, Zehui and Directed Wu and Yunchao Fuzzing Wangbased on Concolic Execution 87 ......

A NewNada Approach Radwan, for M. Implementing B. Abdelhalim 3D and Video Ashraf Call AbdelRaouf on Cloud Computing Infrastructure 93 ......

InteractiveIhab Adly, Mobile Mohamed Learning Fadel, Platform Ahmed at El-Baz the British and Hani University Amin in Egypt 97 ......

A RESTfulMohanad Architecture Odema, Ihab for PortableAdly, Ahmed Remote El-Baz Online and Hani Experimentation Amin Services 102 ......

AdaptiveShourok security AbdelRahim, scheme Samy for real-time Ghoneimy VoIP and using Gamal multi-layer Selim steganography 106 ......

ClickbaitSuhaib Detection R. Khater, Oraib H. Al-sahlee, Daoud M. Daoud and Samir Abou El-Seoud 111 ......

PedagogicalEslam Abou and Gamie, Elearning Samir Logs Abou Analyses El-Seoud, to Enhance Mostafa Students’ A. Salama Performance and Walid Hussein 116 ......

EfficientDiaaEldin Architecture M. Osman, for MohamedControlled A. Accurate Sobh, Ayman Computation M. Bahaa-Eldin using AVX and Ahmad M. Zaki 121 ......

A FrameworkEslam Amer to andAutomate Ayman the Nabil Generation of Movies’ Trailers Using Only Subtitles 126 ......

Example-BasedRana Ehab, MachineEslam Amer Translation: and Mahmoud Matching Gadallah Stage Using Internal Medicine Publications 131 ......

PositiveMostafa and NegativeA. Salama Feature-feature and Ghada Hassan Correlation Measure: AddGain 136

iv Preface

The 2018 7th International Conference on Software and Information Engineering (ICSIE 2018) provided a forum for accessing the most up-to-date and authoritative knowledge from both industrial and academic worlds, sharing best practices in this exciting field. ICSIE 2018 was held in Cairo, Egypt, during the period May 2-4, 2018.

The event was held with presentations delivered by researchers and scholars from the international community, including keynote speeches and highly selective lectures.

The proceedings of ICSIE 2018 consist of 26 selected papers from 64 submitted papers which were from universities, research institutes and industries. All of the papers were subjected to peer-reviewing by conference committee members and international reviewers. The papers selected for the proceedings depended on their quality and their relevance to the conference. Studies presented in this volume cover the following topics: Artificial Intelligence, Bioinformatics, Communication Systems and Networks, Computer Vision & Pattern Recognition, Design Patterns and Frameworks, Distributed and Intelligent Systems, Software Requirements Engineering, Technology Transfer, Web Engineering, etc.

The conference dinner took place with a glittering 2-hour dinner cruise along the iconic Nile River and was greatly enjoyed by all participants.

I would like to take this opportunity to thank many people. First and foremost I want to express my deep appreciation to keynote speakers, session chairs, as well as all the reviewers for their efforts and kind help in this congress.

Final thanks go to all authors and participants at ICSIE 2018 for helping to make it a successful event.

Conference Chair

Prof. Samir A. El-Seoud, The British University in Egypt, Egypt

v Conference Committees

International Advisory Committees

Prof. Dr.sc., Dr.-Ing. Michael E. Auer, Vice Rector at Carinthia University of Applied Sciences (FH Kärnten), Austria

Prof. Omar H. Karam, Dean of the Faculty of Informatics and Computer Science (ICS), The British University in Egyot (BUE), Egypt

Prof. Mohamed F. Tolba, Ex-Vice President for Student Affairs, Ain Shams University, Egypt

Prof. Amr Goneid, Ex-Chair of the Computer Science Department and Ex-Director of Graduate Programs, The American University in Cairo (AUC), Egypt

Prof. Ibrahim El-Henawy, Ex-Dean, Faculty of Computers and Informatics&Head of Department of Computer Science, Zagazig University, Egypt

Prof. Magne Jørgensen, Simula Research Laboratory, Norway

Prof. Sunil Vadera, Dean of the School of Computing, Science and Engineering, University of Salford, UK

Prof. Amr El-Abbadi, University of California, Santa Barbara, USA

Prof. Ashraf S. Hussein, Vice-Presdent for Education and Information Technology and Dean of the Faculty of Computing Honoraryat Arab Open Chair University, Kuwait

ConferenceProf. Ahmed Mohamed Chair Hamad, the President of British University in Egypt, Egypt

LocalProf. Samir Chairs A. El-Seoud, British University in Egypt, Egypt

Prof. Mostafa Abdel Aziem Mostafa, Arab Academy for Science and Technology and Maritime Transport, Egypt

ProgramProf. Atef Zaki Chairs Ghalwash, University of Helwan, Egypt

Prof. Hesham H. Ali, University of Nebraska Omaha, USA

Prof. Christopher Nwosisi, The College of Westchester & Pace University New York, USA

Prof. Jeffrey McDonald, University of South Alabama, USA

vi LocalProf. Naoko Organizing Fukami, Research Committees Station, Cairo, Japan Society for the Promotion of Science Promotion Society (JSPS)

Miss Samah Mettawie (Senior Administrator, ICS-BUE)

Mrs. Hoda Hosin(Director, Vice President Office, Research and Postgraduate Studies, BUE)

Mrs. Hanan El Saadawi (Director of PR, Marketing & Alumni, BUE)

vii

Session 1 Software Engineering and Information Security

Dynamic Code Loading to a Bare-metal Embedded Target

Andrew Sadek Mohamed Elmahdy Tarek Eldeeb Valeo German University in Cairo Valeo Giza, Egypt Cairo, Egypt Giza, Egypt [email protected] mohamed.elmahdy@ [email protected] guc.edu.eg

ABSTRACT 1. INTRODUCTION Dynamic Code loading at run-time is a challenging task in Embedded systems are found everywhere in present times embedded systems. While dynamic linker feature is pro- ranging from automotive to televisions, telecommunications, vided by many operating systems for ELF files such as Linux, medical devices, and military applications. These systems bare metal embedded systems shall not depend on any OS require inexpensive micro-controllers in order to perform the support. Indeed, various researches have deployed Position needed functionality. Due the raise of using such devices, Independent Code (PIC) approach instead of dynamic link- their processing qualifications are continuously increasing ing allowing the code to run regardless its memory loca- which allows them to be maintained and extended with new tion. The work presented here aims at providing an efficient features. Dynamic reprogramming of Embedded Systems methodology for run-time code loading of multiple applica- has become a highly demanded task recently. It saves time tions to a bare metal embedded target. In the first place, the to dynamically load a code to an embedded target with- code is compiled in position-independent form then linked out the need of recompiling the whole base image. For in- with base image at compile time. Correspondingly, the re- stance, wireless sensor networks require to be reconfigured sulting program is considered as an add-on to the base image at run-time due to the complexity of re-building them once and sent to a specified section in the target memory. Fur- deployed [1, 2]. thermore, ‘GCC’ and ‘Binutils’ were customized to enhance Dynamic code loading is essentially based on memory the current implemented methodology of (PIC). This allows space assignation and symbol relocation in case of function referencing data by offset from the start of text section in- or variable. There are two methods for symbol relocation stead of using Global Offset Table (GOT) hence making po- either by static linking or dynamic linking. In static linking, sition independent code smaller and more efficient. After all, symbols are resolved with absolute and/or relative addresses work-flow was implemented on an FPGA Board using Mi- at compile time. This implies that external symbol addresses croblaze processor and tested with Dhrystone benchmark. and the program location in memory are already known be- Markedly, the results and performance analysis have proven fore the loading. While in dynamic linking, symbols are better efficiency for the proposed work-flow. resolved at run-time. Thus the code shall be relocatable in a way that all symbol names and locations referenced by it shall be provided to the target system to relocate them CCS Concepts before program execution. •Computer systems organization → Embedded soft- In fact, OS support has become widely available for differ- ware; ent embedded devices like SOS, Mate, TOSBoot, RETOS. In addition, Contiki and SenSpire are considered as exam- ples of flexible operating systems for tiny network sensors Keywords featuring the capability of dynamic loading and unloading Embedded Systems; GCC; Microblaze; Dynamic Loading; of different applications at run-time [1, 3]. Position Independent Code In case of bare metal systems, a running code shall not depend on any operating system functionality. The work presented in [4] provided an approach for extending a Robot Control Framework to support bare metal embedded com- puting nodes. They relied on static linking of the final image when building their system instead of the usual procedure Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed of dynamic linking in case of running with an OS. for profit or commercial advantage and that copies bear this notice and the full cita- Equally important, various operating systems are enabled tion on the first page. Copyrights for components of this work owned by others than with Position Independent Code (PIC) feature like Enix [5], ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission SOS and TinyOS [6]. In fact, PIC allows the code to be and/or a fee. Request permissions from [email protected]. loaded and run anywhere from the memory regardless its absolute address. Moreover, it is target dependent and in ICSIE ’18, May 2–4, 2018, Cairo, Egypt most cases has to be supported by the compiler [7]. © 2018 Association for Computing Machinery. Notably, in [8, 9, 10, 11], the idea of a position- ACM ISBN 978-1-4503-6469-0/18/05. . . $15.00 independent application was preferred over dynamic linking DOI: https://doi.org/10.1145/3220267.3220568

1 generally in order to avoid symbol resolution in the code at Furthermore, its compilation flow is aligned with Unix load-time. This may cause time overhead and may not be Conventions [14] as follows : supported as well in some targets (e.g. ELF file [9]). How- ever, the experiment done in [12] on a Linux OS showed • User runs a language-specific driver program (e.g. gcc, that a position independent executable (PIE) is not often a g++ for C, C++ respectively), which parses the code recommended option as it introduces run-time performance and turns it into a low-level intermediate representa- overheads. tion called ‘Register Transfer Language’ (RTL). Moreover, relocatable code was preferred over PIC in [1, • Then, the assembler is invoked on the output and 2, 13] in order to avoid execution efficiency degrade due to transfers it to assembly code according to the instruc- the indirect addressing made in the code where a symbol tion set of the target. address has to be loaded from memory before using it. • Optionally, the user may run the linker on the result- By the same token, reprogramming embedded systems has ing object files in order to produce an integrated exe- been deployed in previous researches using various method- cutable binary. ologies. In [2], dynamic linking was preferred over static linking as re-linking is not required when there are updates 2.2 Microblaze Architecture in the Contiki OS which makes the loading more flexible. ‘Microblaze’ is a soft microprocessor designed by Xilinx However, in [1] they benefit from the pre-linking concept as company to run on FPGA modules and programmable SoC it saves more time in the first loading of the new image. families. As a RISC-based Harvard architecture, Microb- Moreover, they send the necessary relocations along with laze satisfies the conditions for various applications such as the ELF file for further re-allocation. In fact, both [2] and medical, industrial, automotive and consumer markets. [1] avoided PIC due to compiler dependency as well as the Indeed, the soft processor solution offers high flexibility performance cost of indirect addressing. with over 70 configuration options which makes users achieve Conversely, PIC methodology was applied in [10] for dy- the optimal system design according to their needs. Partic- namic application loading to time-critical systems in order ularly, the core supports: to ensure compos-ability and predictability of the loading process while relying on GOT for global variables storage • 3-stage pipeline for optimal size, 5-stage pipeline for as well as virtual function tables for external OS and de- optimal performance, bug APIs. Similarly, a self-relocation method was proposed • Instruction and Data caches, in [9] for embedded systems without ELF loading support. In particular, they compile the target program in PIC and • Big-endian/little-endian support, update GOT entries after the loading. • Optional memory management unit, On the other hand, another way for position executable • Optional floating point unit, was implemented in [8] by relying on indirection tables which consist of the OS symbols required by the loaded programs. and many other features as mentioned in [15]. Results showed a run-time overhead compared to static link- ing model as well as restriction for accessing the application 2.3 Position Independent Code resources only by the standard interfaces. As previously stated, PIC is a form of code that runs In this paper, the focus has been directed to dynamic code properly regardless of its place in the memory. Indeed, there loading for bare-metal embedded systems thus the repro- exists different methodologies for implementing such form. gramming shall be independent from OS support. Moreover, As a general convention in GCC, all symbols addresses the work-flow considers efficiency and flexibility as major in- shall be located in a memory section called ‘Global Offset fluencing factors. Table’ (GOT) instead of resolving them inside the code in- structions. Consequently, all symbol references in the code 2. BACKGROUND AND DESIGN TOOLS require loading their GOT entries before processing them. In addition, the support of Procedure Linkage Table (PLT) In this section, a brief summary is given about GCC Com- may be offered in order to complete the lazy binding for piler, Position Independent Code Practices and Microblaze external functions included in ‘GOT’ [9]. architecture which are the tools and/or methodologies for Furthermore, some embedded architectures as ARM [16, work-flow implementation in this paper. 8] allow PC-relative addressing thus the data can be refer- 2.1 GCC enced by offset from the program counter. Fig. 1 illustrates the general PIC implementation in GCC The GNU Compiler Collection is the standard compiler of compared with the normal non-PIC code. In detail, the most UNIX-based systems. Being originally written as the example provided intends to load the data inside address compiler for the GNU operating system in its first release in 0xFFF0000 into r3. To clarify, a base register is dedicated 1987, it was meant to handle only C programming where the to store the GOT start address and evaluated in the prologue abbreviation stood for ‘GNU C Compiler’. Currently, GCC of each function. (e.g r20 in Microblaze, r9/r10 in ARM) covers different programming languages for front ends, it supports programs written in C, C++, Objective C, For- tran, Ada, and Go. Additionally, various instruction sets 3. WORK-FLOW are supported for back ends as the x86 ISA, MIPS and also This paper aims at providing an efficient and flexible for embedded systems like ARM, Microblaze, AMCC. methodology for run-time module loading generally for em- Not only that GCC is an open-source software but also, its bedded systems with even no OS support. As a high-level highly mature architecture grabbed the attention of many abstraction, the work-flow will cover the following sequential professionals to work on it for different interests. steps:

2 r20 contains the start 24: addik r3, r0, 55 address of GOT (0xFFA0000) 28: imm 0 R_MICROBLAZE_64 dataArray

Text Section Text Section 2c: sbi r3, r19, 0 ……………………….. ……………………….. ... lw r3, r0, 0xFFF0000 lw r4, r20, 0x10000 ……………………….. lw r3, r4, 0 ……………………….. Listing 2: Compiled Object.o

0xFFA0000 GOT Section ……………………….. 4.1 Pre-Linked Module (Static Linking) 0xFFB0000 0xFFF0000 In the first place, suppose that the base image file flashed ……………………….. on the embedded target that shall receive code in Listing 2 Data Section Data Section 0xFFF0000 ……………………….. ……………………….. is called ‘System.elf’. As shown in Listing 1, local references 0xFFF0000 0x12345678 0x12345678 ……………………….. ……………………….. as ‘dataArray’ and ‘var’ were already defined and just needs address assignation from the linker. On the other hand, other external references such as ’print hello world’ have non-PIC PIC unknown address. Correspondingly, the linker executable is invoked on ‘Ob- Figure 1: Illustration of PIC with GOT ject.o’ with command ‘-R System.elf’ or ‘–just-symbols = System.elf’. In this way, it refers symbolically to the ab- solute memory locations defined in ‘System.elf’ and thus in • Compile module code using PIC, this case shall resolve ‘print hello world’. Notably, there exists a section dedicated in the base im- • Static Link of code with the base image, age for dynamically loaded code named ‘.load’ which start • Copy the resulting executable to SD-RAM at run-time address has been declared in the linker script for the static and jump to it, linking above and has been resolved by the ‘R’ option. Consequently, the resulting ELF can be safely loaded to • Loaded code should be able to run and accesses the the target memory and run properly. However, the major exported symbols from the embedded software as ex- issue here relies in prior knowledge to the start address of pected. Safe return to the basic image is a must. the loaded ELF before linking. Hence, it is not possible to relocate it anywhere in the memory. Moreover, the paper treats PIC run-time performance is- Additionally, the issue will rise with the need of loading sues and code size excess as well, therefore the cost resulted multiple applications. Either the object file shall be re-linked by indirect addressing is decreased by using the data-text on the host as done in [1] or dynamically linked on the target relative concept. by sending a relocatable ‘elf’ along with the relocations to be evaluated in run-time. 4. IMPLEMENTATION The piece of code below and the corresponding assembly 4.2 Compile with PIC provide the necessary use cases throughout the upcoming At this point, it has been decided to use ‘Position Indepen- section. It shall be sent and run on the embedded target. dent Code’ methodology (PIC). No matter where the code Please refer to [15] for Microblaze instruction manual. The is located, it can be loaded and run from any place in the demonstrated subset are described in the appendix. memory. Accordingly, invocation of GCC command option ‘-fPIC’ in compilation stage produces assembly in Listing 3 extern void print_hello_world (void); for ‘Object.c’. As previously mentioned in Fig. 1, symbol char dataArray[8] = {0xFF,0,0,0,0,0,0,0}; addresses reside now in the ‘GOT’ section and therefore re- long var = 1; quires update in run-time according to the code’s location void run(unsigned char i) { offset. Moreover, indirect addressing can cause performance print_hello_world(); overhead due to the added instructions for loading addresses var = 123; from GOT. For instance, when the linker hits the relocation dataArray[i] = 55; in Listing 3 (line 0x28), it will calculate its offset from the } start address of GOT in r20 (previously evaluated in line 0x18). Forthwith, the code loads the ‘var’ address in ‘r3’ Listing 1: Object.c where it stores the value ‘123’ (lines 0x30, 0x34). Unfortunately, Microblaze does not support PC-relative 00000000 : addressing unlike ARM where ‘PC’ can be declared as a ... base register [16], thus the data references could have been c: imm 0 resolved by offset to the program counter instead of relying R_MICROBLAZE_64_PCREL print_hello_world on ‘GOT’. 10: brlid r15, 0 14: addk r19, r5, r0 00000000 : 18: addik r3, r0, 123 ... 1c: imm 0 10: mfs r20, rpc R_MICROBLAZE_64 var 14: imm 0 20: swi r3, r0, 0 R_MICROBLAZE_GOTPC_64 _GLOBAL_OFFSET_TABLE_+0x8

3 18: addik r20, r20, 0 To emphasize, the relocation in Listing 4 (lines: 0x2c, 0x30) 1c: imm 0 points to the difference between ‘var’ and ‘start of text’ ad- R_MICROBLAZE_PLT_64 print_hello_world dresses which shall be added to ‘r20’ instead of absolute 20: brlid r15, 0 addressing in non-PIC code shown in Listing 2 (lines: 0x1c, 24: addk r19, r5, r0 0x20). 28: imm 0 R_MICROBLAZE_GOT_64 var 4.3.2 Binutils Customization 2c: lwi r3, r20, 0 Following the GCC adaptation for PIC, 2 new relocations 30: addik r4, r0, 123 have been added in the GNU binutils (assembler and linker 34: swi r4, r3, 0 tools): 38: imm 0 R_MICROBLAZE_GOT_64 dataArray • ‘R MICROBLAZE TEXTPCREL 64’ (resolves offset 3c: lwi r3, r20, 0 of current PC to start of text) 40: addik r4, r0, 55 • ‘R MICROBLAZE TEXTREL 64’ (resolves offset of 44: sb r4, r19, r3 ... mentioned data reference to start of text) The assembler outputs these relocations according to the di- Listing 3: Compiled PIC Object.o rectives from the compiler then the linker perform the cor- responding calculations. Another adjustment was done in the linker for external calls and data. Here the data-text 4.3 PIC Enhancements relative approach does not apply since all references in the In this paper, the solution introduced aims at avoiding base image are constant (i.e. their location shall not change). indirect addressing overhead while referencing the data by Hence, the linker modifies all relative addressing branches to offset to the start of text based on data-text relative con- be absolute if they are referring to external symbols coming cept. Accordingly, the distance between data and instruc- from ‘System.elf’ by the new command option ‘–adjust-insn- tions in memory shall be constant. GCC back end has been abs-refs’. Not to forget, that for the external data references customized in order to change the base register for data ad- the base register shall be reverted back from ‘r20’ to ‘r0’. In dressing. In addition, 2 new relocations have been added in detail, Listing 5 shows the resulting ELF after static link- binutils and handled in both assembler and linker. ing of the enhanced PIC in Listing 4, with arbitrary start address (0xa8086708). 4.3.1 GCC Compiler Customization As shown in Listing 4 (line 0x10 to 0x18), special instruc- Contents of section .text: tion ‘mfs’ [15] moves the PC to r20, then addition of the a8086708: dcff2130 0000e1f9 1c0061fa 200081fa constant difference between current PC and start of text oc- ... Contents of section .data: curs. This is the same function prologue of the original PIC a8086760: 01000000 ff000000 00000000 in Listing 3 but the base register ‘r20’ currently contains Disassembly of section .text: beginning of text address instead of start index of GOT. a8086708 : ... 00000000 : a8086718: mfs r20, rpc ... a808671c: imm -1 10: mfs r20, rpc a8086720: addik r20, r20, -16 14: imm 0 a8086724: imm -22522 R_MICROBLAZE_TXTPCREL_64 _TEXT_START_ADDR+0x8 a8086728: bralid r15, -18504 18: addik r20, r20, 0 a808672c: addk r19, r5, r0 1c: imm 0 a8086730: addik r3, r0, 123 R_MICROBLAZE_64_PCREL print_hello_world a8086734: imm 0 20: brlid r15, 0 a8086738: swi r3, r20, 88 24: addk r19, r5, r0 a808673c: addik r4, r0, 55 28: addik r3, r0, 123 a8086740: imm 0 2c: imm 0 a8086744: addik r3, r20, 92 R_MICROBLAZE_TXTREL_64 var a8086748: sb r4, r19, r3 30: swi r3, r20, 0 ... 34: addik r4, r0, 55 38: imm 0 R_MICROBLAZE_TXTREL_64 dataArray Listing 5: Executable Enhanced PIC Object.elf 3c: addik r3, r20, 4 40: sb r4, r19, r3 4.4 Data Re-locations ... Equally important, the data section may include pointers to other references as mentioned in [9]. All pointers assigned Listing 4: Compiled Enhanced PIC Object.o with addresses outside the code (e.g. ‘varPointer’ in List- Accordingly, all data references with base register ‘r0’ (i.e. ing 6) are allocated in a memory section called ‘.data.rel’. no offset, r0’s value is always 0) [15] can be resolved in run- Henceforth, the corresponding relocations are sent for re- time by the following formula: evaluation in run-time according to the program’s location in memory compared with the arbitrary start address set in addr = textstartAddr + (addr − textstartAddr) (1) linking.

4 Host Target Table 1: Dhrystone Time Results (ms) Compile code (-mpic-data- -O0 -O1 -O2 -O3 text-relative) Non-PIC 62.51 28.11 22.61 21.76 Link with base image (-R) Sends ELF size + PIC 72.81 39.06 29.46 28.72 MD5 checksum Enhanced PIC 66.66 32.41 22.71 22.41 Validate MD5 + check for space

Return result Allocate space and it went correctly for all tests in Table 1. ELF content + Fig. 3 and Table 1 show a performance improvement Data Receive ELF in for the enhanced PIC over the original PIC. First in -O0 relocations assigned memory location where no optimization takes place, the overhead has de- Entry address + + creased from 16% to 6% due to the absence of instructions Arbitrary start Data relocation address that load addresses from GOT which could introduce cache update misses as well. Markedly, in enhanced PIC the base regis-

Start program ter was just changed from r0 to r20 in address calculation execution and the rest of the 6% overhead is caused mainly by the calculation of r20 in each function prologue. Second in -O1, the overhead decrease remains nearly the same as the com- mon sub-expression and dead code eliminations could not Figure 2: Dynamic Loading Work-flow optimize r20 calculation. Finally, in -O2 and -O3, the instruction re-scheduling tech- niques could improve the performance significantly by accel- int var; int* varPointer = &var; erating the pipeline avoiding stalls and data hazards, thus making the overhead decrease from original PIC to enhanced Listing 6: Data Relocation example PIC >= 90%. By the same token, Fig. 4 shows an overhead decrease of 4.5 Full Structure and Interaction with Host [75-80] % for all optimization levels due to the absence of To sum up, Fig. 2 shows the whole work-flow and how the the GOT itself which saves the data section size along with host PC interacts with the embedded target. Given that no the instructions for loading addresses saving the text section OS support exists in our case here, major changes and re- size. structuring could be introduced in the base image. There- 150 fore, the add-on shall be re-linked and even re-compiled ev- non-PIC PIC Enh-PIC ery time a modification takes place on the base ELF which is not considered as a drawback like in [2]. First, compilation 116 106 100 to PIC occurs with the new GCC command ‘-mpic-data- 100 text-relative’, then static linking with the base ELF (-R) is invoked. Second, the host sends the MD5 check-sum of base image along with the add-on size. Therefore, the tar- 62 52 get checks if the MD5 matches with the one of the flashed 50 45 47 46 image and if there is enough space otherwise it outputs an 36 36 35 36 error and re-linking may be required. If second step passed, the host sends the binary content of the programmable ELF sections along with the entry address and the arbitrary start 0 address that was set in the linking process. Finally, the tar- -O0 -O1 -O2 -O3 get allocates each byte received in the assigned location and updates data relocations (e.g. .data.rel) according to the Figure 3: Normalized Dhrystone Time Results offset of the actual start address to the arbitrary one.

5. EVALUATION 120 non-PIC PIC Enh-PIC After All, the work-flow was tested on an FPGA-Board 110 (Xilinx Spartan 6) and MicroBlaze was configured with a 5- 105 101 stage pipeline and a CPU clock of 100 MHZ. In particular, 100 100 95 95 95 92 92 the Dhrystone benchmark [17] was picked from the GCC 90 91 91 91 test suite and set to run for 5000 loops. It was compiled in 90 non-PIC (only static linking), PIC (-fPIE), new enhanced 80 PIC (-fPIE and -mpic-data-text-relative) with all GCC op- -O0 -O1 -O2 -O3 timization levels. Then, the host sends it as an add-on to the target where the externals functions for printing and time calculation reside. Figure 4: Normalized Dhrystone ELF Size Results Notably, the benchmark fires a test report after execution

5 6. CONCLUSION AND FUTURE WORK international conference on Mobile systems, To summarize, the work-flow proposed in this paper aimed applications, and services, pages 163–176. ACM, 2005. at providing an efficient solution for dynamic reprogram- [7] Leandro Batista Ribeiro and Marcel Baunach. ming of bare-metal embedded systems. In essence, code is Towards dynamically composed real-time embedded compiled in PIC and statically linked with the base ELF systems. In Logistik und Echtzeit, pages 11–20. that was flashed on the target. Since MicroBlaze does not Springer, 2017. support PC relative addressing for data references, the orig- [8] Nermin Kajtazovic, Christopher Preschern, and inal PIC methodology in GCC was enhanced using the data- Christian Kreiner. A component-based dynamic link text relative concept in order to minimize the cost of indirect support for safety-critical embedded systems. In addressing in terms of performance and code size. Accord- Engineering of Computer Based Systems (ECBS), ingly, assembler and linker in GNU binutils were customized 2013 20th IEEE International Conference and to align with GCC adaptation. Workshops on the, pages 92–99. IEEE, 2013. In addition, the linker adjusts the instructions referring [9] Tang Xinyu, Zhang Changyou, Liang Chen, Khaled to variables and/or function in the base image by reverting Aourra, and Li YuanZhang. A code self-relocation back the base register to r0 in case of data references and method for embedded system. In Computational converting relative branches to absolute in case of functions. Science and Engineering (CSE) and Embedded and After all, enhanced PIC has proved better performance over Ubiquitous Computing (EUC), 2017 IEEE the original PIC with an increase range of [60%-98%] as well International Conference on, volume 1, pages 688–691. as minimized code size. IEEE, 2017. In future work, the focus shall be directed to the secu- [10] Shubhendu Sinha, Martijn Koedam, Rob Van Wijk, rity aspect for code unloading, loading other dynamic codes Andrew Nelson, Ashkan Beyranvand Nejad, Marc and even multiple code loads. This includes dealing with Geilen, and Kees Goossens. Composable and memory fragmentation in case of full allocation. Notably, predictable dynamic loading for time-critical removal and un-linking of pointers passed from the add-on partitioned systems. In Digital System Design (DSD), to the base image shall take place in case of code unloading. 2014 17th Euromicro Conference on, pages 285–292. IEEE, 2014. 7. ACKNOWLEDGMENTS [11] Adriaan Van Buuren. Dynamic loading and task The research for this paper was financially supported by migration for streaming applications on a composable Valeo Egypt. It is a part of a Research and Develop- system-on-chip. In Student Thesis, Faculty of ment program, to enrich Valeo’s innovative employees in the Electrical Engineering, Delft University of Technology, embedded-software automotive industry. 2012. [12] Mathias Payer. Too much pie is bad for performance. 8. REFERENCES Technical Report 766, ETH Zurich, Switzerland, 2012. [1] Wei Dong, Chun Chen, Xue Liu, Jiajun Bu, and [13] Francisco Javier Acosta Padilla. Self-adaptation for Yunhao Liu. Dynamic linking and loading in Internet of things applications. PhD thesis, Rennes 1, networked embedded systems. In Mobile Adhoc and 2016. Sensor Systems, 2009. MASS’09. IEEE 6th [14] Steve Pate. UNIX filesystems: evolution, design, International Conference on, pages 554–562. IEEE, and implementation, volume 10. John Wiley & Sons, 2009. 2003. [2] Adam Dunkels, Niclas Finne, Joakim Eriksson, and [15] Microblaze soft processor core. http://www.xilinx. Thiemo Voigt. Run-time dynamic linking for com/products/design-tools/microblaze.html. reprogramming wireless sensor networks. In [16] ARM information center. http://infocenter.arm.com/. Proceedings of the 4th international conference on [17] Alan R Weiss. Dhrystone benchmark: History, Embedded networked sensor systems, pages 15–28. analysis, scores and recommendations. EEMBC White ACM, 2006. Paper, 2002. [3] Richard Oliver, Adriana Wilde, and Ed Zaluska. Reprogramming embedded systems at run-time. In APPENDIX A: USED MB INSTRUCTIONS The 8th International Conference on Sensing This shows a selected subset of the used Microblaze in- Technologies, 2014. structions and the respect functionality for each. [4] Steffen Schutz,¨ Max Reichardt, Michael Arndt, and Karsten Berns. Seamless extension of a robot control Instr. Functionality framework to bare metal embedded nodes. In GI-Jahrestagung, pages 1307–1318, 2014. imm 16-bit immediate value [5] Yu-Ting Chen, Ting-Chou Chien, and Pai H Chou. lw(i) Load Word (Immediate) Enix: a lightweight dynamic operating system for sw(i) Store Word (Immediate) tightly constrained wireless sensor platforms. In Proceedings of the 8th ACM Conference on Embedded sb(i) Store Byte (Immediate) Networked Sensor Systems, pages 183–196. ACM, add(i)k Add (Immediate) and Keep Carry 2010. br(a)lid Branch (Absolute) and Link Imm with Delay [6] Chih-Chieh Han, Ram Kumar, Roy Shea, Eddie mfs Move From Special Purpose Register Kohler, and Mani Srivastava. A dynamic operating system for sensor nodes. In Proceedings of the 3rd

6 An Empirical Study with Function Point Analysis for Software Development Phase Method Jalal Shah Nazri Kama Saiful Adli Ismail Universiti Teknologi Malaysia Universiti Teknologi Malaysia Universiti Teknologi Malaysia 54100, Jalan Semarak, Kuala Lumpur 54100, Jalan Semarak, Kuala Lumpur 54100, Jalan Semarak, Kuala Lumpur [email protected] [email protected] [email protected]

ABSTRACT is the process of predicting the amount of effort required for the It is important to know the actual size and complexity of the implementation of a change request [6, 7]. While before software before predicting the amount of effort required to be estimating the amount of effort for a required change request it is implemented. Two most common methods used for software size important to know its actual size and complexity [6, 7]. One of estimation are: (i) Source Lines of Code (SLOC) and (ii) Function the main reasons for an incorrect effort estimation is the Point Analysis (FPA). Estimating the size of a software with inappropriate size estimation of change requests. SLOC method is only possible once the process of coding is There are two most common methods that are used for software completed. On the other hand, estimating software size with FPA size estimation, namely: SLOC and FPA. SLOC method uses method is possible in early phases of Software Development Life number of source lines which are developed for the specific Cycle (SDLC). However, one main challenge from the viewpoint change request for measuring the size of change request. It is only of software development phase, is the presence of inconsistent possible when the task of coding is completed. Thus, it is difficult states of software artifacts i.e. some of the classes are completely for software project managers to use SLOC for early phase developed, some are partially developed, and some are not estimation. On the other hand, FPA method is used for software developed yet. Therefore, this research is using the new developed size estimation in early phases of SDLC. model i.e. Function Point Analysis for Software Development The current challenge of relying solely on the FPA method is that Phase (FPA-SDP) in an empirical study to overcome this mostly this technique [8, 9] is used for software maintenance challenge. The results of FPA-SDP model can help software phase, where software artifacts are consistent. On the other hand, project managers in: (i) knowing the inconsistent states of in software development phase, software artifacts are in software artifacts (ii) estimating the actual size of a change inconsistent states [10, 11]. So it becomes a challenging task for request with its complexity level for software development phase. software project managers to accept or reject change requests during software development phase [12, 13]. CCS Concepts • Software and its engineering➝Agile software development. Based on the above challenge, we have proposed FPA-SDP model. This model has the capability to support the above challenge. Keywords This paper is structured as follows: Section 2 presents related Software Development phase; Software Change Management; work, section 3 describes proposed model, section 4 presents Function Point Analysis; Source Lines of Code; Software Size evaluation process, section 5 contains the discussion and section 6 Estimation; Effort Estimation; Impact Analysis. presents conclusion and future work. 1. INTRODUCTION 2. RELATED WORK One of the causes for the failure of a software project is The four most related keywords involved in this research are continuously changing of requirements during SDLC. Accepting a Impact Analysis, Effort Estimation, Source Lines of Code, massive amount of change requests might increase the time and Function Point Analysis, budget of the software. On the other hand, accepting a very low number of change requests may increase the level of 2.1 Change Impact Analysis disappointment of the customer [1, 2]. Change Impact Analysis (CIA) is the method of finding possible There are two commonly used techniques: (i) Impact Analysis (IA) consequences of a change, or “estimating what needs to be and (ii) Effort Estimation (EE) that help software project modified to accomplish a change” [14]. managers in accepting or rejecting a change request [3, 4]. Impact There are two types of IA techniques: (i) Static Impact Analysis Analysis (IA) is the process of estimating the results of a change (SIA) and (ii) Dynamic Impact Analysis (DIA) [15]. The SIA request on software artifacts[5]. Whereas, Effort Estimation (EE) technique considers static information from software artifacts to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that produce a set of possible impact classes. On the other hand, the copies are not made or distributed for profit or commercial advantage and DIA technique considers dynamic information created by that copies bear this notice and the full citation on the first page. implementing the code to generate a set of potential impact Copyrights for components of this work owned by others than ACM must classes [2]. be honored. Abstracting with credit is permitted. To copy otherwise, or The studies [3, 12, 16] show that the integration of SIA technique republish, to post on servers or to redistribute to lists, requires prior with DIA technique as a new approach. In addition, these studies specific permission and/or a fee. Request permissions from consider the fully and partially developed classes during software [email protected]. development phase. ICSIE '18, May 2–4, 2018, Cairo, Egypt On the base of above studies we have selected one of the CIA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 techniques [16] that is integration of SIA and DIA. According to Asl and Kama [16], their technique can be used for IA during DOI:https://doi.org/10.1145/3220267.3220268 software development phase. Furthermore, it considers all the

7 classes that are fully developed, partially developed or not i = GSC from 1 to 14. developed yet. Ci = degree of influence for each General System Characteristic. 2.2 Effort Estimation Σ = summation of 14 GSC. So, after getting the value of VAF from Equation (2) the final Effort Estimation (EE) is the method of predicting that how much value of FPs can be calculated from Equation (1). work and how many hours of work is required to develop a software. Normally it describes in man-days or man-hours unit [12]. 3. PROPOSED MODEL According to Idri, et al. [9] EE is one of the most interesting task An integration of Function point analysis method has been done for software project managers. Several EE models have been with Change Impact Analysis technique in the newly developed developed [17, 18]. Some of the most common used EE models model i.e. FPA-SDP model. In FPA-SDP model IA technique is are: Expert Judgement [9, 19]; Estimation by Analogy [20]; and used for predicting the impact of a change request on software Regression Analysis [9]. artifacts. Furthermore, it is also used to identify the status of However, before estimating the amount of effort for a change classes which are influenced after implementing a new change request, it is important to estimate its accurate size [21]. It is still a request. Whereas, FPA method is used for the size estimation of challenging task for software project managers to estimate the the new accepted change request during early phase of SDLC. accurate size of a change request[13]. For this purpose, two most common methods which are used are: (i) SLOC and (ii) FPA. FPA-SDP model has four main stages, which are (i) Change 2.3 Source Lines of Code Request Analysis, (ii) Change Impact Analysis, (iii) Calculating Function Points and (iv) Estimating Change Effort as shown in Most effort estimation models use SLOC method for software size Figure 1. The model has been discussed in detail in the previous estimation because it is easy to calculate the number of source studies. While in this paper a brief explanation is given. lines of code when it is developed [22]. However, SLOC method cannot accurately estimates the size of a change request until the process of coding is completed. Therefore, estimating the size of a change request in early phases of SDLC becomes practically impossible [8]. Since software size is the most important input for an effort model, a poor SLOC estimate will lead to a bad effort estimation. 2.4 Function Point Analysis Function Point Analysis (FPA) method was developed by Allan Albrecht in 1979 [7, 23]. It is a technique of counting the size and complexity of a software system in terms of the functions that the system provides to its end users[8, 24]. The main goals of FPA method are: (i) independent of development technology, (ii) simple to apply, (iii) can be estimated from requirement specifications and (iv) meaning full to end users [21]. Additionally, a systematic literature review was performed on EE by [22] in which they specified that FPA is one of the most solid and reliable estimation technique.

In FPA technique, Function Points (FPs) of a software are calculated by adding Unadjusted Function Points (UFPs) with Value Adjustment Factor (VAF) as shown in Equation (1). The procedure of calculating UFPs and VAF are given in International Figure 1. Function Point Analysis for Software Development Function Point User Group (IFPUG) manual [25] . Phase (FPA-SDP)

FPs = UFP * VAF (1) Stage 1: Change Request Analysis: A change request will be submitted through a change request form and change request Whereas, analysis process will be started. If the change request is accepted FP stands for Function Points than change request specification will be described. The process UFP stands for Unadjusted Function Points. will repeat for every change request. UFPs is the sum of all functions i.e. External Interface (EI's), Stage 2: Change Impact Analysis / Impact Analysis: In this stage External Output (EO's), External Queries (EQ's), Internal Logical the process of IA will be executed. It will receive two inputs i.e. Files (ILF's) and External Interface Files (EIF's) with its level of change request specification and a set of software artifacts which complexity (low, average and high). are updated during previous change request. Afterwards, the VAF stands for Value Adjustment Factor process of impact analysis will be done in two categories i.e. Value Adjustment Factor (VAF) can be calculated from fourteen Static Impact Analysis (SIA) and Dynamic Impact Analysis (DIA). General System Characteristics (GSC) [25] as shown in Equation In SIA the impact of a change request will be observed in design (2). and class artifacts. Whereas in DIA the status of code will be observed. The process of IA [16] will be repeated for every VAF = 0.65 + [(Σni=1Ci) *0.1] (2) change request. Stage 3: Calculating Function Points: In this stage the FPs for a Where: change request will be identified and calculated. It will be

8 calculated by using the updated set of software artifacts and code CR-14 Coding Deletion as an input. In this stage the size and complexity of the change CR-15 Coding Addition request will be observed by calculating the number of FPs using IFPUG manual [25]. The process will be repeated for every CR-16 Coding Modification change request. CR-17 Coding Addition

Stage 4: Estimating Change effort: In this stage the amount of CR-18 Coding Modification required effort for implementing a change request will be CR-19 Testing Deletion estimated. The effort will be estimated by using FPs as an input. Whereas, the efforts will be equal to productivity multiplied by CR-20 Testing Addition number of function points. For example, if the productivity is 8 CR-21 Testing Deletion hours for 1 function point. Then efforts will be equal to CR-22 Testing Modification Productivity multiplied by FPs as shown in Equation (3).

Efforts = Productivity * FPs (3) 4.3 Evaluation Metric

An evaluation metric has been used for evaluating the results 4. EVALUATION PROCESS produced by FPA-SDP model which is the Magnitude of Relative This section defines the method for conducting the empirical Error (MRE). It has calculated a rate of the relative errors in both study. During the study four main factors has been considered. cases of over-estimation or under-estimation as shown below. These factors are: (i) Case selection, (ii) Data Collection, (iii) Evaluation Metric, and (iv) Evaluation Results. [Actual Results- Estimated Results] 4.1 Case Selection MRE = To evaluate the results of new developed model i.e. FPA-SDP. An Actual Results empirical study is conducted by selecting a case i.e. Vending Machine Control System (VMCS). The data used in this study is collected from the selected case study, which has been 4.4 Evaluation Results developed by 6 (six) experienced members. These members are Table 2 shows the experimental results of the empirical study. The master of software engineering students, having experience in table indicates that the estimated amount of effort for a change software development industry. Hence, can count them as expert request with the actual amount of effort. The change request members. estimation produced by the actual implementation effort and MRE 4.2 Data Collection value (percentage of discrepancy between estimated effort and During the study 22 (Twenty-Two) change requests have been actual implementation effort) sorted by the Change Request ID. collected mostly from all phases of SDLC. Later, these change There are 22 (Twenty-Two) change requests have been introduced requests have been analyzed and a change request specification document has been derived. to the case selection software project during the software Table 1 is showing the data collection with Change Request ID, development phase. Change Request occurrence Stage and Change Request Types i.e. Addition, Deletion and Modification. Table 2. Evaluation Results Change Estimated Actual amount MRE% Table 1: Change Requests and Change Request Types Request amount of of effort for Change Change Request Change Request ID effort Change Change Request ID Stages Types Request Request CR-1 27.72 26.91 CR-1 Analysis Addition 0.0301 CR-2 9.9 12.00 CR-2 Analysis Addition 0.175 CR-3 5.94 4.92 CR-3 Analysis Addition 0.20732 CR-4 2.97 4.95 CR-4 Analysis Modification 0.4 CR-5 14.85 12.93 CR-5 Analysis Deletion 0.14849 CR-6 6.93 7.00 CR-6 Design Deletion 0.01 CR-7 20.79 18.94 CR-7 Design Addition 0.09768 CR-8 13.86 14.95 CR-8 Design Modification 0.07291 CR-9 6.93 8.85 CR-9 Design Deletion 0.216949 CR-10 6.93 5.00 CR-10 Design Modification 0.386 CR-11 9.9 11.84 CR-11 Design Addition 0.163851 CR-12 5.94 7.00 CR-12 Design Deletion 0.151429 CR-13 13.86 11.96 CR-13 Design Modification 0.15886

9 International Conference on Electrical, Electronics, and CR-14 3.96 5.93 0.332209 Optimization Techniques (ICEEOT), 2016, pp. 505-509. CR-15 6.93 8.84 0.216063 [5] N. Kama and F. Azli, "A Change Impact Analysis CR-16 20.79 17.93 0.15951 Approach for the Software Development Phase," in CR-17 4.95 6.91 2012 19th Asia-Pacific Software Engineering 0.283647 Conference, 2012, pp. 583-592. CR-18 13.86 11.93 0.16178 [6] W. W. Royce, "Managing the development of large CR-19 11.88 9.97 software systems," in proceedings of IEEE WESCON, 0.191 1970, pp. 1-9. CR-20 8.91 5.96 0.49497 [7] A. J. Albrecht, "AD/M productivity measurement and estimate validation," IBM Corporate Information CR-21 2.97 4.69 0.366738 Systems, IBM Corp., Purchase, NY, 1984. CR-22 20.79 17.08 0.21721 [8] P. Agrawal and S. Kumar, "Early phase software effort estimation model," in 2016 Symposium on Colossal 5. DISCUSSION Data Analysis and Networking (CDAN), 2016, pp. 1-8. [9] A. Idri, M. Hosni, and A. Abran, "Systematic literature To review the results of the empirical study we have come up with review of ensemble effort estimation," Journal of some solutions that FPA-SDP model can help software project Systems and Software, vol. 118, pp. 151-175, 8// 2016. managers in: (i) knowing the inconsistent states of software [10] B. Sufyan, K. Nazri, H. Faizura, and A. I. Saiful, artifacts (ii) estimating the actual size of a change request with its "Predicting effort for requirement changes during complexity level during software development phase. software development," presented at the Proceedings of 6. CONCLUSION AND FUTURE WORK the Seventh Symposium on Information and The new developed model i.e. FPA-SDP, can estimate the amount Communication Technology, Ho Chi Minh City, Viet of required effort for a change request during the software Nam, 2016. development phase. FPA-SDP model uses the combination of [11] J. Shah and N. Kama, "Extending Function Point Change Impact Analysis (CIA) technique with Function Point Analysis Effort Estimation Method for Software Analysis (FPA) method to support estimation during software Development Phase," in Proceedings of the 2018 7th development phase. The results of new EE model show that it can International Conference on Software and Computer be helpful to software project managers in predicting the impact Applications, 2018, pp. 77-81. of a required change on software artifacts. Furthermore, it also [12] S. Basri, N. Kama, and R. Ibrahim, "A Novel Effort helps software project managers in predicting the size and Estimation Approach for Requirement Changes during complexity of a change request for a change during software Software Development Phase," International Journal of development. Software Engineering and Its Applications, vol. 9, pp. The results of the paper are part of our ongoing research to 237-252, 2015. overcome the challenges of change acceptance decisions for the [13] J. Shah and N. Kama, "Issues of Using Function Point requested changes in software development phase. For future Analysis Method for Requirement Changes During work, we aim to conduct intensive tests of this method by Software Development Phase.," presented at the Asia considering more change requests from different case studies. In Pacific Requirements Engeneering Conference, Melaka addition to this, we will also try to do an empirical study to Malaysia, 2018. estimate the accuracy rate of our new model with the existing FPA [14] D. Kchaou, N. Bouassida, and H. Ben-Abdallah, "UML method. models change impact analysis using a text similarity technique," IET Software, vol. 11, pp. 27-37, 2017. 7. ACKNOWLEDGMENTS [15] M. Shahid and S. Ibrahim, "Change impact analysis This research project is fully sponsored Ministry of Higher with a software traceability approach to support Education Malaysia, and Universiti Teknologi Malaysia, Vote No: software maintenance," in 2016 13th International 16H68. Bhurban Conference on Applied Sciences and Technology (IBCAST), 2016, pp. 391-396. REFRENCES: [16] Asl and Kama, "A Change Impact Size Estimation

Approach during the Software Development," in 2013 [1] A. Sharma and D. S. Kushwaha, "Estimation of 22nd Australian Software Engineering Conference, Software Development Effort from Requirements Based 2013, pp. 68-77. Complexity," Procedia Technology, vol. 4, pp. 716-722, [17] W. Rosa, C. Jones, J. McGarry, R. Madachy, J. Dean, B. 2012/01/01 2012. Boehm, et al., "Improved Method for Predicting [2] Kama and M. Halmi, "Extending Change Impact Software Effort and Schedule," International Cost Analysis Approach for Change Effort Estimation in the Estimating and Analysis Association (ICEAA), 2014. Software Development Phase," in WSEAS International [18] B. Chinthanet, P. Phannachitta, Y. Kamei, P. Leelaprute, Conference. Proceedings. Recent Advances in Computer A. Rungsawang, N. Ubayashi, et al., "A review and Engineering Series, 2013. comparison of methods for determining the best [3] B. Sufyan, K. Nazri, A. Saiful, and H. Faizura, "Using analogies in analogy-based software effort estimation," static and dynamic impact analysis for effort presented at the Proceedings of the 31st Annual ACM estimation," IET Software, vol. 10, pp. 89-95, 2016. Symposium on Applied Computing, Pisa, Italy, 2016. [4] K. Usharani, V. V. Ananth, and D. Velmurugan, "A [19] O. Fedotova, L. Teixeira, and H. Alvelos, "Software survey on software effort estimation," in 2016 Effort Estimation with Multiple Linear Regression:

10 Review and Practical Application," J. Inf. Sci. Eng., vol. [23] F. Ferrucci, C. Gravino, and L. Lavazza, "Simple 29, pp. 925-945, 2013. function points for effort estimation: a further assessment," [20] A. Idri, F. a. Amazal, and A. Abran, "Analogy-based presented at the Proceedings of the 31st Annual ACM Symposium software development effort estimation: A systematic on Applied Computing, Pisa, Italy, 2016. mapping and review," Information and Software [24] M. d. F. Junior, M. Fantinato, and V. Sun, Technology, vol. 58, pp. 206-230, 2// 2015. "Improvements to the Function Point Analysis Method: [21] A. Hira and B. Boehm, "Function Point Analysis for A Systematic Literature Review," IEEE Transactions on Software Maintenance," presented at the Proceedings of Engineering Management, vol. 62, pp. 495-506, 2015. the 10th ACM/IEEE International Symposium on [25] D. Longstreet, "Function points analysis training Empirical Software Engineering and Measurement, course," SoftwareMetrics. com, October, 2004. Ciudad Real, Spain, 2016. [22] L. M. Alves, S. Oliveira, P. Ribeiro, and R. J. Machado, "An Empirical Study on the Estimation of Size and Complexity of Software Applications with Function Points Analysis," in 2014 14th International Conference on Computational Science and Its Applications, 2014, pp. 27-34.

11 Analysing Log Files For Web Intrusion Investigation Using Hadoop

Marlina Abdul Latib, Saiful Adli Ismail, Othman Mohd Yusop, Pritheega Magalingam, Azri Azmi Advanced Informatics School Universiti Teknologi Malaysia Jalan Sultan Yahya Petra 54100 Kuala Lumpur, Malaysia {saifuladli,othmanyusop,mpritheega.kl,azriazmi}@utm.my, [email protected]

ABSTRACT source, which the investigators rely upon most of the time. The The process of analyzing large amount of data from the log file analysis will reveal whether attacker has run an exploit and taken helps organization to identify the web intruders‘ activities as well advantage of the system vulnerability [1]. Log has an important as the vulnerabilities of the website. However, analyzing them is role in determining any vulnerability, source of attack, malicious totally a great challenge as the process is time consuming and activities, and producing statistical data [2]. This will be the sometimes can be inefficient. Existing or traditional log analyzers guidance for incident handling team regarding certain incidents may not able to analyze such big chunk of data. Therefore, the and the procedures that need to be done to fix the vulnerability. aim of this research is to produce an analysis result for web Occasionally, the analysis is manually done or using any log intrusion investigation in Big Data environment. In this study, analyzer tools. web log was analyzed based on attacks that are captured through The process of analyzing logs, system events and network flows web server log files. The web log was cleaned and refined through for forensic purposes has been an issue since decades ago, as a log-preprocessing program before it was analyzed. An conventional technologies are not always adequate to support experimental simulation was conducted using Hadoop framework large-scale analytics. This is due to the inefficiency in performing to produce the required analysis results. The results of this the analytics of large unstructured datasets and complex queries. experimental simulation indicate that Hadoop application is able Moreover, things get worse when the datasets are incomplete and to produce analysis results from large size web log files in order to noisy. Even though there are various Security Information and assist the web intrusion investigation. Besides that, the execution Event Management (SIEM) tools that are widely used, some of time performance analysis shows that the total execution time will them are not designed to manage and analyze unstructured data not increase linearly with the size of the data. This study also and restricted to a predefined schema. In addition, such tools that provides solution on visualizing the analysis result using Power are capable of managing and analyzing large dataset such as SAP View and Hive. software is very expensive and requires a strong business [3]. Based on the issues, the log data can be considered as Big Data CCS Concepts which relates to massive volume of data. Fortunately, by • Security and privacy➝Malware and its mitigation. implementing Big Data solution, those huge log data can be put to good use as it has the ability to identify large-scale patterns in Keywords diagnosing and preventing problems [4]. Big Data; Hadoop; log pre-processing; web intrusion; web log file. Big Data‘s popularity is growing each year as more applications 1. INTRODUCTION related to it are becoming part of security management software. Hosting and managing hundreds of websites is definitely not an The reasons are that they able to prepare, clean and query data easy task in terms of ensuring their accessibility and security. efficiently even in diverse, noisy and incomplete formats. Upon an incident, the incident response team will need to run an Therefore, many Big Data tools such as Hadoop framework are investigation in order to identify the cause of the incident and giving new opportunities for the processing and analyzing task of analyze the impact of it. Log files of the web servers are the main data because they are now commodifying the deployment of large-scale and reliable cluster [3]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that Hadoop as mentioned earlier is an open source application, which copies are not made or distributed for profit or commercial advantage has the ability to process huge amount of data in parallel on large and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM clusters. It is a combination of MapReduce and Hadoop must be honored. Abstracting with credit is permitted. To copy Distributed File System (HDFS). Besides that, Hadoop allows otherwise, or republish, to post on servers or to redistribute to lists, distributed processing of huge data sets on multiple clusters of requires prior specific permission and/or a fee. Request permissions computers as it uses a standard of programming that that is easy to from [email protected]. implement. The entire technology is a constitution of a Distributed ICSIE '18, May 2–4, 2018, Cairo, Egypt File System (DFS), shared utilities, analytics and platforms for © 2018 Association for Computing Machinery. information storage, as well as application layer which is ACM ISBN 978-1-4503-6469-0/18/05…$15.00 responsible to administer activities like configuration management, DOI:https://doi.org/10.1145/3220267.3220269

12 distributed processing, parallel computation and workflow. Other the same time, the regular user request or the malicious request is than providing high availability of data, in term of handling recorded in access log including request related to attacks against massive, complex, or diverse data sets, Hadoop is definitely cost- the server. The logs later are analyzed to identify the symptoms of effective compared to traditional approaches. Hadoop also offers the security incidents. immense speed and scalability [5]. The analysis results enable us to detect anomalous activities such as brute force attack which can be detected when too many request get a lot of error status in a small period of time. Besides 2. RESEARCH BACKGROUND that, the Uniform Resource Identifier (URI) query field in the log Security incidents can be considered as daily events where each file would give us information in detecting any attempt or attack day there are hundreds of new attacks being launched through of Structured Query Language (SQL) Injection and Cross-Site discovered vulnerabilities using various types of hacking tools and Scripting (XSS) attack [10]. SQL injection is a technique where malware. Therefore, incident response and investigation process intruders or attackers are able to inject code into SQL queries sent require thorough analysis of log files related to the incident. to the back-end database with the intention to access and to Web log file is a sample of Big Data because of its volume that manipulate unauthorized database data. On the other hand, XSS can be excessive, especially when there is a huge number of attack is a type of attack that injects malicious scripts to a website. websites that constantly receive high access every day. Therefore, Both type of attacks can be detected by analyzing the log file it is important to choose a tool or technology that can support the fields [11]. Big Data such as Hadoop framework, which is known to handle Therefore, the goal of the experiment in this study was to produce high volume of data such as log files. results in detecting the log records that consist of anomalous 2.1 Web Intrusion activities of the hackers. The traces of the intruders‘ activities Web intrusion relates to the vulnerability of web applications would help in the web intrusion investigation. which invite threats and web attacks by malicious hackers who 2.3 Web Log Files intend to access sensitive data [6]. There are several types of Web log file which resides on the web server and record each intrusion and this research will use log files to detect and identify visitor‘s activity who access the website through a browser is them. called a web server log file. Generally, there is so much useful Vulnerability is a hole or a weakness in the application, which can information that can be retrieved from web log files in helping the be a design flaw or an implementation bug that allows an attacker web administrators or investigators to detect web intrusion to cause harm to the stakeholders of an application. When activities. The common useful information is such as record of someone exploits a web vulnerability, that particular web pages being requested by visitors, errors that occur while application is actually being attacked. There are various ways for accessing the web, record of status returned by the server and also web vulnerability to be exploited. Some web vulnerabilities such the size packets being sent from server to server [10]. as information leakage or improper error handling are difficult to The contents or basic information of a Web Server Log File are as detect [7]. follows [1]: The Open Security Project Foundation (OWASP) (a) User name: information on visitor identification who had released a report on the ten most critical web application security visited the web site such as IP address. risks for 2013, which focused on identifying the most serious risks (b) Visiting Path: records of the path used by the user while for a broad array of organizations. The risks are as below [8]: browsing the web site. 1. Injection (c) Path Traversed: list the path that the user took within the 2. Broken Authentication and Session Management website using various links. 3. Cross-Site Scripting (XSS) (d) Timestamp: The time that user spent or used on each web 4. Insecure Direct Object References page while going through the web site. This is what we 5. Security Misconfiguration call as a session. 6. Sensitive Data Exposure (e) Page last visited: record of pages visited by the user 7. Missing Function Level Access Control before he or she leaves the website. 8. Cross-Site Request Forgery (CSRF) (f) Success rate: based on the number of successful 9. Using Components with Known Vulnerabilities downloads and the number of copying activities been done 10. Unvalidated Redirects and Forwards by the user. (g) User Agent: information of user‘s browser software and From the OWASP list, there are several attacks that can be version where the person sends their request to the web detected by analyzing log file. Based on [9], there are also few server. types of web attacks against Hypertext Transfer Protocol (HTTP). (h) URL: information of resource being accessed by the user The attacks are Denial-of-service attacks (DoS attack) analysis such as CGI program, an HTML page or a script. and HTTP tunneling attack. 2.4 Log Data in Big Data Environment 2.2 Detecting Attacks From Web Log Files Basically, Big Data is defined as datasets that could not be The main focus in some studies related to web server security perceived, acquired, managed, and processed within an endurable incident detection will usually be on server log analysis and time by traditional IT and software or hardware tools [12]. Big network traffic. For server log analysis, each HTTP request Data is not only characterized by three V‘s: volume, velocity and received by the web server is recorded in the server‘s access log variety [13]. Recently, there are studies that emphasize two other file where it becomes one of the main sources in obtaining characteristics which are veracity and value that need to be information about the server's healthiness as well as security. At considered in Big Data [14][15]. Therefore, web server logs can

13 be considered as Big Data as it relates to the massive amount of data where log files of web servers nowadays can reach to petabyte or terabyte size of storage. 2.4.1 Hadoop Framework for Log Analysis. Hadoop is a framework developed by Dog Cutting which is written in under the Apache License. It is basically used for analyzing and processing Big Data in a distributed computer environment as well as supporting the running of related applications. Furthermore, Hadoop addresses three main challenges created by Big Data as listed below [16][17] [18][19]: Figure 2. HDFS Architecture.

(a) Volume: Hadoop provides a framework to scale out large data sets to address volume of data. It is used in systems where multiple nodes are present, which can process 2.4.2.2 MapReduce terabytes of data MapReduce is a programming paradigm or a software programming model that has been used by Google to process (b) Velocity: Hadoop is able to handle intense rate of incoming large data set in a distributed fashion on large clusters of hardware data from very large system. It uses its own file system in a reliable fault tolerant approach. In MapReduce, task is HDFS which facilitate fast transfer of data which can sustain separated into small parts and distributed to a huge number of node failure and avoid system failure as whole. nodes for processing (map) and the final answer (reduce) is based (c) Variety: Hadoop supports complex tasks in order to deal on the summarized results. Hadoop uses the Map/Reduce for data with the variety of unstructured data. It uses MapReduce processing purposes. Various functions for the processing are algorithm which break down the Big Data into smaller written in the form of Hadoop job [18]. Fig. 3 demonstrates the chunks and performs the operations MapReduce processing flow of data. MapReduce library supports Hadoop framework may vary from each application to another application and map operations which can be executed application depending on the needs and expected output. independently [21]. Paragraphs below are few research done regarding the Big Data analysis. 2.4.2 Hadoop Architecture Hadoop consists of two major components which are MapReduce and HDFS as illustrated in Figure 1 [17].

Figure 3. The flow of data in MapReduce processing [21].

2.5 Related Works Using Hadoop The number of research proposing the use of Hadoop Framework in solving Big Data issue is increasing each year. Many of them

Figure 1. High Level Architecture of Hadoop. have been exploring and recommending new log analysis

approach using Hadoop in distinctive domains or areas. This

section discussed the related works by other researchers that focus 2.4.2.1 HDFS on log analysis in Big Data environment using Hadoop framework. HDFS or Hadoop Distributed File System is actually a default storage layer, which is redundantly needed for computations. Based on HDFS architecture, it is more like a master-slave model Research in [22] has proposed the use of Hadoop framework where a set of Hadoop cluster is composed of a Namenode as using MapReduce on log analysis for system threats and problem master and many Datanodes as slaves. File system namespace identification. The MapReduce algorithm consists of Map phases maintenance‘s tree structure and other information are managed and Reduce phases. The log file that needs to be processed will be by Namenode while the file system is stored by Datanode. the input to the Map phase while the output of each map phase Datanode is served by several nodes which then return to the will then be given to particular keys. The Reduce function will Namenode in order to maintain data consistency in providing then provide the final result or log report. The proposed system single directory system and file namespace [20] [17]. The provides an efficient way of collecting and correlating log in order architecture of the HDFS is shown in Figure 2. to identify the system threats and problems. The researchers found that the proposed system has significant improvement in response time, which is achieved with the use of MapReduce.

14 In [20], the researchers designed and implemented an enterprise time required for Extract, Transform, and Loading (ETL) process Web log analysis systems based on the architecture of Hadoop and analysis using Hadoop is approximately 20 times less than the with HDFS (Hadoop Distributed File System) and MapReduce as MySQL. well as Pig Latin Language as illustrated in Figure. 4. The main purpose of their system was to assist system administrators to Hadoop was proposed as a solution in an analysis of a system for quickly capture and analyze data hidden in the massive potential flow log which targeted on the network traffic traces in China. value, thus providing an important basis for business decision. The solution was proposed to overcome the expanding size of the The research that had been carried out showed that the structure of flow logs which had increased to 870GB per day for single city MapReduce program was an effective solution for very large [25]. The system used HDFS for the logs storage and MapReduce Weblog files in the Hadoop environment. Besides that, the log Framework for analysis job and also for creating their own script requirement was easy to analyze using Pig programming language called Log-QL. The results of the experiment showed that the new that also gave better performance. The system succeeded in system enabled them to analyze TBs of data compared to the providing AP server traffic statistics that helped the system existing centralized system that could only process up to 10GB of administrators to identify potential problems and predict the future data. However, the system would only perform better as the size trend. of data grows. Hadoop MapReduce programming model has been applied for analyzing web log files in cloud computing environment in order to retrieve the hit count for specific web application. The Hadoo experiment used HDFS to store the web log file and MapReduce Report p s programming model was used to write application for analyzing (3) MapReduc MapReduc e e log file. The log files that were used contained 100,000 records (4 Result HDFS HDFS with each log having different fields of URL, date, hit, age and ) Analysis others [26]. The applied model of using HDFS and MapReduce had given analyzed results in minimal response time. While the Pig (2) Upload Pig Program MapReduce Job performance test results were against number of records, the (1) Data Pre-processing number of nodes in the cluster showed that the performance of the system would increase along with the increase in number of nodes. Table 1. Summarization of Related Research of Log Analysis Weblog Raw Data Using Hadoop Figure 4. Weblog Analysis System Flowchart using

Hadoop [20].

A scheme to overcome the problem of analysis of Big Data using ponent oop oop ysis rithm Apache Hadoop in their study was proposed in [18]. The log of e search search of sults search processing involved four steps, which include creating a server of or tools her required configuration using Amazon web services, importing Re Had Com Ot algo Typ anal Re Re data from a database to Hadoop, performing jobs in Hadoop and [18] MapReduce, Mongo Amazon log Optimal exporting data back to the database. In step two, data was stored HDF DB file operation in Mongo DB, which is a NoSQL database. Then, MapReduce time was used to perform six Hadoop jobs that were implemented in [20] HDFS, Pig Weblog . A Hadoop job consists of the mapper and Better reducer functions. The produced output of data processing in the MapReduce languag analysis e performan Hadoop job had to be exported back to the database. The old ce values in the database had to be updated immediately in order to prevent loss of valuable data. The researcher agreed that the able to application is able to perform an operation on Big Data in optimal predict time and produce an output with minimum utilization of resources. trend In another work, distributed K-Means clustering algorithm based [22] MapReduce Analyzing Improve on Mahout/Hadoop Map-Reduced model to analyze high volume sys log to response of log files [23] was used in Log Analysis. Their findings showed identify time that the performance was better than a standalone log analyzer as system it was capable of supporting a huge size of log. Their method threats proved to be able to extract a new knowledge for a million entries of logs, as it cannot be obtained without the scalability of Hadoop and the proposed analysis. [23] K- Apache web Support MapReduce Means server log huge file Another approach of combining of Hadoop and MapReduce Mahout size paradigm was applied on NASA‘s web log file of size 77MB of 445,454 records [24]. They conducted an experiment to do a [24] MapReduce MySQL NASA‘s Improve comparison between MySQL (RDBMS) and Hadoop in analyzing (RDBM web log files response the users‘ activity of the web. The results obtained, showed that S) time the proposed log analyzer helped to improve response time as the

15 [25] HFDS, Log-QL Flow logs of It is crucial for the data to undergo preprocessing operation in MapReduce (script) China Able to order to deal with various imperfections in raw collected data as it network analyze may contain noise such as errors, redundancies, outliers and other traffic larger ambiguous data or missing values. The main operations in data dataset preprocessing are mainly: (TBs) 1. Data cleaning/ data refinement: Handle missing values Perform and noise as well as data inconsistency. better as 2. Data integration: A process of integrating duplicated the data data. size 3. Data transformation: The collected data will be grows converted to the format of the destination data system [26] HDFS, Web log [29] [30]. MapReduce files in cloud Minimal computing response time Not all of the log records are useful or necessary. Therefore, before the process of web log data analysis is being done, the data Perform cleaning phase needs to be carried out. The data cleaning process better as involved removing [26]: the number of 1. Records that have missing value data, for example, nodes when the execution process is suddenly terminated, the increase log file record is not completely recorded. 2. Illegal records that have exception status numbers for example 400 or 404 which caused by HTTP client errors, bad requests or a request not found. Table 1 summarized the related research works of log analysis 3. Irrelevant records that have no significant URLs. There using Hadoop Framework. The proposed frameworks involving are some files that are generated automatically when various types of log analysis suggested that Hadoop is able to web page is requested, for example .txt, .jpg, .gif or .js cater and to handle a variety of data. Mainly, the research work extensions. applied MapReduce as the main component of Hadoop for analyzing the log files and HDFS as the data storage. In order to fulfill the analysis purposes, the researchers had also used other In Log Analysis, the purpose of implementing log preprocessing tools and algorithms together with the Hadoop Framework. The is to improve the log quality and to increase the results accuracy. results of these researches showed that implementing Hadoop The preprocessing phase helps to filter and to organize only framework, enabled us to successfully minimize the analysis appropriate information, which is used before applying the process response time as well as analyzing larger dataset (TBs). MapReduce algorithm so that it may not affect the analysis result. Based on the related researches, Hadoop framework is capable of analyzing log files as it can perform in sufficient time and handle Do not include headers, footers or page numbers in your large volume of data. Furthermore, Hadoop is stable and widely submission. These will be added when the publications are used in this research area [27] . assembled. 2.6 Data Preprocessing 3. Designs And Implementation When the log data has been collected, it has to go through a This research proposed a simulation model of web log file preprocessing stage before proceeding to the log analysis stage. analysis for the purpose of web intrusion investigation. The This is supported by the data process flow in Figure 5, that shows simulation implementation involved two main processes, which the Extract, Transform, and Loading (ETL) process is needed in a were log preprocessing and log analysis. The implementation log analysis flow. The data preprocessing is actually a part of the experimented with samples from real log data that had been ETL that transform data to a desired format. captured from previous incidents. Log preprocessing would clean the sample log files while the log analysis would produce analysis results for the simulation. The implementation of the simulation is illustrated in Figure 6 below.

Figure 6. Research Simulation Workflow.

Figure 5. Data process flow [28].

16 3.1 Experimental Setup The experiment used only one single machine to perform the simulation for both data preprocessing and log analysis as shown Apache Access Log File: in Table 2. This machine was running the data preprocessing python code as well as the Hadoop Data Platform (HDP) when - Sample A analyzing the log files. - Sample B - Sample C

Table 2. Hardware and Software Requirement Hardware Software Data Workstation (Intel® Core™ i7- Windows 8.1 Cleaning 5500U Processor) with 64-bit Log preprocessing: OS and 12Gb RAM Data - Python 3.5 Transformation - pygeoip package Log Analysis: - Hortonworks Sandbox with HDP Cleaned Log File in 2.3.2 desired format Oracle VM VirtualBox 5.0.10

Figure 8. Proposed Data Preprocessing Flow. 3.4 Data Cleaning An algorithm had been developed to clean the samples of apache access log file based on certain data cleaning rules. However, for web intrusion investigation, there is some information that should not be removed because it could be important information for 3.2 Data Collection tracking intruders‘ trace of activities. Therefore, the list of the For this research experiment, several web log files were collected removed items might be different in web mining [10]. This from a university‘s web server. The web server had been hosting algorithm considered several irrelevant details that needed to be the university‘s main web portal as well as over 800 domains that removed from the log files, which are listed below: were related to the teaching and learning business. The web server used Apache web server to host the websites, therefore the type of (a) Robots request log that was used in this project experiment was the Apache These are the records that are generated by automatic agents access log. In order to minimize the storage, the logs of all the known as googlebot that act as a crawler. involved domains were logged centralized in one particular (b) Methods other than GET and POST apache log. In this research, few samples of the apache access log Only request with GET and POST methods are needed for the files from previous incidents were used. The web server had web intrusion investigation. Other request methods such as HEAD configured the apache access log file entries based on the or OPTIONS can be removed. Common Log Format (CLF). Example of a single line record of (c) Unnecessary multimedia files requested the Apache access log file is illustrated in Figure 7. Referring to some files that are generated automatically when a certain web page is requested, for example.txt, .jpg, .gif or .js extensions. 157.56.92.165 - - [20/Sep/2013:04:12:44 +0800] "GET (d) Blank URI request Referring to the records that have missing values data which log /xxx/kelab_usahawan.html HTTP/1.1" 200 21891 file record is not completely recorded. This could happen when the execution process is suddenly terminated. Figure 7. Example of Apache access log record. (e) Unnecessary log records generated by the web server Log records that are generated by certain system or server might not be important for tracking hacker activities. Therefore, this 3.3 Log Preprocessing kind of records could be ignored. For example, if the URI field The log-preprocessing phase involved two main processes, which does not consist of method, URI and HTTP protocol of the single were data cleaning and data transformation. The large size of log request, the record is considered not legitimate and must be files required the logs to be preprocessed in order to discard any removed. unnecessary information and transform the log into desired format. Preprocessing of log files would produce cleaned log files that would make the next analysis process much easier. Figure 8 3.5 Data Transformation illustrates the proposed data preprocessing flow of the research. The log files need to be transformed to a specific format before the analysis process could be done. The purpose of transforming the log file was to provide easier and faster log analysis. In this

17 research, the request field in the log files was split to 3 new fields which were the ‗method‘, ‗URI‘ and the ‗HTTP protocol‘. The other fields were IP address, date time, status code and response size, which remained in the same format. In addition, a new field was also added to the output log file which was the country code field. The extra field logged the origin country of the IP address for each of the request record. This detail is usually important for the web intrusion investigation where the origin country may indicate the legitimacy of an attack. In order to generate the country code, a Python API called pygeoip was used. The API package consisted of set of codes that read an IP address and map it to its origin country based on the list in GeoLiteCity database.

3.6 Analyzing Log Using Hadoop Framework In this research experiment, the latest version of Hadoop Hortonworks Data Platform (HDP) 2.3.2 was used to analyze the cleaned and preprocessed log file. The HDP provided an enterprise ready data platform, which runs Apache Hadoop an open source framework for distributed storage and processing of large sets of data. However, in order to run the experiment on a Figure 10. Query Editor of Hive. single node, Hortonworks Sandbox was used because it enables developers to develop the HDP anywhere without data center, 4. Experimental Result cloud service or even an Internet connection. It also gives the developer full control of the environment. Figure 9 presents the process flow of Hadoop Hortonworks Data Platform (HDP) that 4.1 Preprocessed Log Data was applied in the research experiment. The result of the execution of data preprocessing program in the experiment showed that the data cleaning algorithm used in the

H adoop Sandbox simulation had successfully reduced unnecessary or unwanted log

U pload Log Data H D FS A nalysis Result records. All three (3) samples of log file were reduced by more A nalysis Preprocessed Tool:Hue,File D ata Tools: Am bari, Log File Brow ser, Hive Visualization H ive than 50% from the original data. Both size and number of records showed a significant reduction compared to the original log files. Therefore, the experiment proved that the proposed algorithm has Figure 9. Hadoop Hortonworks Data Platform(HDP) turned the log files into more efficient log data for further analysis Process Flow. process. The percentage reduction equation is as follow: The Hortonworks Sandbox provides web interface for the user to easily access and use the available Hadoop applications. For this experiment, Hue was used for uploading files and creating tables (i) Percentage of Reduction (by number of web log while Ambari was used for creating Hive queries and monitoring record) the executions. = (total number of web log records removed / total original number of web log records) × 100%

3.6.1 Uploading The Data to HDFS In this experiment, Hadoop Distributed File System (HDFS) was (ii) Percentage of Reduction (by size of web log file) used because it is a primary storage for Hadoop application. = (size file after reduction / original size file) × 100% Therefore, all sample log files should be uploaded first to HDFS using the File Browser application. Even though the sample log Table 3. Data Cleaning Process Result files were in text format, they could be directly uploaded because HDFS is able to accept various types of file, which can minimize Data Number of Records Removed the experiment process time. Analyzing data with Hive. Clenaning Sample Log Sample Sample Log Process A Log B C 3.6.2 Analyzing Data with Hive Log records The sample log files were in semi-structured format, therefore with 2143 659 1879 Hive was suitable to be used for exploring and analyzing them. robots.txt Hive uses SQL-like language called HiveQL to query, summarize and analyze data. By using Hive, data analysts do not have to Log records write any complex MapReduce programming because Hive with method queries will be converted into MapReduce programs in the other than 32863 26131 30107 background using Hive compiler. In order to easily manage and GET or monitor the execution of the created queries, Apache Ambari was POST used because it provides a simplified web based user interface. The HiveQL was created using the Query Editor where each Unnecessary 776106 672470 2960196 query was written on a Worksheet as shown in Figure 10. multimedia

18 file request Table 4. Total Execution Time Log records Query Total Execution time (s) with blank 36469 9069 20887 Sample A Sample D Sample E URL request (55.8 MB) (401 MB) (825 MB) Unnecessary 29 71 70 log records Query 1 353 731 157 generated by Query 2 20 17 16 web server Query 3 28 77 76 Code Execution Results Query 4 25 58 60 Original 1342887 (156 1095490 (125 4045347 Query 5 28 60 52 Records MB) MB) (1 GB) Query 6 23 54 58 Final 494953 (55.8 386430 (42 1032121 Query 7 34 96 96 Preprocessed MB) MB) (111 MB) Records Query 8 43 64 74 Query 9 43 69 81 Reduction Percentage Query 10 46 66 74 (by number 63.14% 70.90% 74.50% of records) The graph in Figure 11 indicated that the least execution time was (by size of 64.10% 66.40% 88.90% from the small sample size. However, the total execution time for file) both sample D and E were almost similar even though the size of sample E was double than the size of sample D.

4.2 Producing Analysis Result The main purpose of this research was to analyze the log file for the web intrusion investigation. Therefore, this experiment used several query statements of HiveQL to produce several outputs based on the sample log files. The focus of the intrusion case in this experiment was the web defacement or web tampering which commonly attacks government agencies websites. Therefore, the target of this experiment was to identify the attackers and traces of their activities.

Within the experiment, sample of query statements was written to gather the required information such as:

(a) General Information of the web activities: Figure 11. Graph of Query Execution Time.

- Top most active IP addresses that have been accessing the web pages. - Top most URIs that have been accessed by users. - Top most countries that have been accessing the web 4.4 Data Visualization pages. In order to produce an understandable analysis result, several tools - The number of requests for each HTTP status code. were able to transform the analysis results to a visualized form for the purpose of optimal view. (b) Traces of anomalous activities: Based on the suspicious list of IPs from (a) results, we then 4.4.1 Using Power View Excel identify their traces activities such as a list of URIs that have been Figure 12 below showed the global view of the origin country of visited by them, any login attempt made by them or any files that the websites user based on their total numbers. The countries have been successfully uploaded by them. These traces might help viewed in the map were the top 10 users obtained from the log in collecting the evidence of any intrusion or web defaced files records. This visual result enabled the investigator to detect activities. whether the activities were considered normal if too many requests have been received from certain country. For example, if the websites were mainly for local users, there should not have 4.3 Performance Analysis been high requests from other countries. The experiment was also carried out for different size of samples in order to test and evaluate the Hadoop performance when handling different size of log data. Table 4 showed the total execution time for three (3) samples of log file with different sizes. Each sample was used for executing the same set of queries.

19 6. Future Works The results presented here may facilitate improvements in the study of analyzing web log files using Hadoop in the field of web intrusion investigation. However, for future works, the research can be done using multiple node cluster in order to obtain more accurate and reliable results. Besides that, the used of the bigger size of data shall be introduced to measure the capability of Hadoop application. This can be done by mirroring the log data to another workstation or a server. Furthermore, the log preprocessing process could be integrated and run in the same platform of Hadoop where log data will only have to be

transferred once. Figure 12. Global view of origin country of the website users

7. ACKNOWLEDGMENTS 4.4.2 Hive Visualization with Ambari This research is a part of the Research Academic Grant (Vot Figure 13 below illustrated a bar chart produced by Hive with Q.K1300000.2538.11H12). The authors would like to thanks Ambari using the visualization tools. The results of the HiveQL Research Management Centre (RMC), Universiti Teknologi execution automatically produced a visualization result. Based on Malaysia (UTM) for the support in R & D. the presented chart in Figure 5.3 indicated that the IP address xx.xx.77.76 was the highest user detected based on the log files record. Usually the investigator would be triggered by this spike and would initiate further investigation of this IP address. 8. REFERENCES [1] L. K. J. Grace, V. Maheswari, and D. Nagamalai, ―Web Log Data Analysis and Mining,‖ in Advanced Computing, N. Meghanathan, B. K. Kaushik, and D. Nagamalai, Eds. Springer Berlin Heidelberg, 2011, pp. 459–469. [2] A. Oliner, A. Ganapathi, and W. Xu, ―Advances and Challenges in Log Analysis,‖ Commun. ACM, vol. 55, no. 2, pp. 55–61, Feb. 2012. [3] A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, ―Big Data Analytics for Security,‖ IEEE Security Privacy, vol. 11, no. 6, pp. 74–76, Nov. 2013. [4] S. Sharma and V. Mangat, ―Technology and Trends to Handle Big Data: Survey,‖ in 2015 Fifth International Conference on Advanced Computing Communication Technologies (ACCT), 2015, pp. 266–271. [5] B. Kotiyal, A. Kumar, B. Pant, and R. H. Goudar, ―Big Data: Mining of log file through hadoop,‖ in 2013 International Conference on Human Computer Interactions (ICHCI), 2013, pp. 1–7. [6] T. Mantoro, N. Binti Abdul Aziz, N. D. Binti Meor Yusoff, and N. A. Binti Abu Talib, ―Log Visualization of Intrusion and Prevention Reverse Proxy Server Figure 13. Bar Chart of number of requests by IP address against Web Attacks,‖ in 2013 International Conference

on Informatics and Creative Multimedia (ICICM), 2013, 5. Conclusion pp. 325–329. In conclusion, this study indicates that Hadoop application is able [7] Helen Kapodistria, Sarandis Mitropoulos, and Christos to produce analysis results of the web log files in order to assist Douligeris, ―An advanced web attack detection and the web intrusion investigation. The analysis also shows that even prevention tool,‖ Info Mngmnt & Comp Security, vol. when the size of the log is increased to double, there are no 19, no. 5, pp. 280–299, Nov. 2011. significant increases of the execution time. The finding is [8] ―OWASP Top 10 - 2013 The Ten Most Critical Web consistent with the findings of past studies that the Hadoop is Application,‖ 11-Jun-2013. [Online]. Available: capable of improving the performance of log analysis and able to https://www.owasp.org/index.php/Top10#OWASP_Top handle bigger size of log data. As an addition, the research also has presented the visualization of the log data using Power View _10_for_2013. [Accessed: 07-May-2015]. and Hive for easier understanding. [9] Mingjun Wei, Yufang Liu, Guangli Xu, and Yuhuan Cui, ―A study on an intrusion detection technique protecting Web server,‖ 2009, pp. 268–271.

20 [10] S. E. Salama, M. I. Marie, L. M. El-Fangary, and Y. K. (IACC), 2013 IEEE 3rd International, 2013, pp. 831– Helmy, ―Web Server Logs Preprocessing for Web 835. Intrusion Detection,‖ Computer and Information Science, [23] J. Therdphapiyanak and K. Piromsopa, ―Applying vol. 4, no. 4, pp. 123–133, Jul. 2011. Hadoop for Log Analysis Toward Distributed IDS,‖ in [11] P. Dange and S. Dr Deven, ―Web Log Analysis for Proceedings of the 7th International Conference on Security Compliance Using Big Data,‖ International Ubiquitous Information Management and Journal of Advanced Research Computer Science and Communication, New York, NY, USA, 2013, pp. 3:1– Software Engineering, vol. 5, no. 3, p. 23, Mar. 2014. 3:6. [12] M. Chen, S. Mao, and Y. Liu, ―Big Data: A Survey,‖ [24] H. Hingave and R. Ingle, ―An approach for MapReduce Mobile Netw Appl, vol. 19, no. 2, pp. 171–209, Apr. based log analysis using Hadoop,‖ in 2015 2nd 2014. International Conference on Electronics and [13] S. Sagiroglu and D. Sinanc, ―Big data: A review,‖ in Communication Systems (ICECS), 2015, pp. 1264–1268. 2013 International Conference on Collaboration [25] J. Yang, Y. Zhang, S. Zhang, and D. He, ―Mass flow Technologies and Systems (CTS), 2013, pp. 42–47. logs analysis system based on Hadoop,‖ in 2013 5th [14] P. Saporito, ―The 5 V‘s of Big Data,‖ Best‘s Review, no. IEEE International Conference on Broadband Network 7, p. 38, Nov. 2013. Multimedia Technology (IC-BNMT), 2013, pp. 115– [15] S. Madden, ―From Databases to Big Data,‖ IEEE 118. Internet Computing, vol. 16, no. 3, pp. 4–6, May 2012. [26] S. Narkhede, T. Baraskar, and D. Mukhopadhyay, [16] R. Lu, H. Zhu, X. Liu, J. K. Liu, and J. Shao, ―Toward ―Analyzing web application log files to find hit count efficient and privacy-preserving computing in big data through the utilization of Hadoop MapReduce in cloud era,‖ IEEE Network, vol. 28, no. 4, pp. 46–50, Jul. 2014. computing environment,‖ in 2014 Conference on IT in [17] K. Singh and R. Kaur, ―Hadoop: Addressing challenges Business, Industry and Government (CSIBIG), 2014, pp. of Big Data,‖ in Advance Computing Conference 1–7. (IACC), 2014 IEEE International, 2014, pp. 686–689. [27] I. Polato, R. Ré, A. Goldman, and F. Kon, ―A [18] J. Nandimath, E. Banerjee, A. Patil, P. Kakade, S. comprehensive view of Hadoop research—A systematic Vaidya, and D. Chaturvedi, ―Big data analysis using literature review,‖ Journal of Network and Computer Apache Hadoop,‖ in 2013 IEEE 14th International Applications, vol. 46, pp. 1–25, Nov. 2014. Conference on Information Reuse and Integration (IRI), [28] X. Lin, P. Wang, and B. Wu, ―Log analysis in cloud 2013, pp. 700–703. computing environment with Hadoop and Spark,‖ in [19] M. Mohandas and P. M. Dhanya, ―An approach for log 2013 5th IEEE International Conference on Broadband analysis based failure monitoring in Hadoop cluster,‖ in Network Multimedia Technology (IC-BNMT), 2013, pp. 2013 International Conference on Green Computing, 273–276. Communication and Conservation of Energy (ICGCE), [29] H. Mousannif, H. Sabah, Y. Douiji, and Y. O. Sayad, 2013, pp. 861–867. ―From Big Data to Big Projects: A Step-by-Step [20] C. H. Wang, C. TsorngTsai, C. C. Fan, and S. M. Yuan, Roadmap,‖ in 2014 International Conference on Future ―A Hadoop Based Weblog Analysis System,‖ in 2014 Internet of Things and Cloud (FiCloud), 2014, pp. 373– 7th International Conference on Ubi-Media Computing 378. and Workshops (UMEDIA), 2014, pp. 72–77. [30] Y.-H. Kim and E.-N. Huh, ―A rule-based data grouping [21] A. K P, D. K. C. Gouda, and D. N. H R, ―A STUDY method for personalized log analysis system in big data FOR HANDELLING OF HIGH-PERFORMANCE computing,‖ in 2014 Fourth International Conference on CLIMATE DATA USING HADOOP,‖ Innovative Computing Technology (INTECH), 2014, pp. INTERNATIONAL JOURNAL OF INNOVATIVE 109–114. TECHNOLOGY AND RESEARCH, vol. 0, no. 0, pp. 197–202, Apr. 2015. [22] S. S. Vernekar and A. Buchade, ―MapReduce based log file analysis for system threats and problem

identification,‖ in Advance Computing Conference

21 Privacy-Preserving Personal Health Record (P3HR): A Secure Android Application Saeed Samet, Mohd Tazim Ishraque, and Anupam Sharma School of Computer Science, Faculty of Science University of Windsor, Windsor, ON, Canada [email protected], [email protected], [email protected]

ABSTRACT and their capabilities are nowadays widely utilized in every health In contrast to the Electronic Medical Record (EMR) and sector. EHR and EMR systems are just two popular examples that Electronic Health Record (EHR) systems that are created to are broadly used in every clinic, hospital, and health organizations maintain and manage patient data by health professionals and by the health professionals and policy-makers, to speed up health organizations, Personal Health Record (PHR) systems are services, enable consistency and efficiency, allowing them to operated and managed by patients. Therefore, it necessitates make important prognostic and diagnostic predictions and increased attention to the importance of security and privacy decisions for the future, in order to improve the patients’ quality challenges, as patients are most often unfamiliar with the potential of life. However, these data intensive systems are controlled and security threats that can result from release of their health data. On operated by the health professionals, and patients usually have no the other hand, the use of PHR systems is increasingly becoming access to them, and almost always have no control on who, how, an important part of the healthcare system by sharing patient when, and to what extent their sensitive health information is information among their circle of care. To have a system with a being accessed. They are not able to revoke someone’s access to more favorable interface and a high level of security, it is crucial their data under most circumstances. All these limitations have the to provide a mobile application for PHR that fulfills six important potential to cause many issues within the healthcare system and features: (1) ease the usage for various patient demographics and can adversely affect the patient’s quality of life in various ways. their delegates, (2) security, (3) quickly transfer patient data to Strategy for Patient-Oriented Research (SPOR) has increased the their health professionals, (4) give the ability of access revocation importance of using PHR by the patients to engage them as a to the patient, (5) provide ease of interaction between patients and crucial partner in the healthcare system, and consequently their circle of care, and (6) inform patients about any instances of improve their quality of life by promoting an easy and cost- access to their data by their circle of care. In this work, we effective way of interaction between them and their circle of care. propose an implementation of a Privacy-Preserving PHR system However, this introduces a set of new challenges into the existing 3 (P HR) for Android devices to fulfill the above six characteristics, healthcare systems in terms of storage, security, and control over using a Ciphertext Policy Attribute Based Encryption to enhance the patients’ sensitive health information. More importantly, security and privacy of the system, as well as providing access because patients often lack the training and technological skills revocation in a hierarchical scheme of the health professionals and needed to work with sophisticated and complex computer systems organizations involved. Using this application, patients can and are unaware of the various privacy acts and policies, it is very securely store their health data, share the records, and receive important to provide them with a secure, and user-friendly system feedback and recommendations from their circle of care. that gives them ease of usage anytime, anywhere. The nature of these requirements led us to implement a PHR application for CCS Concepts mobile devices, while taking into consideration and ensuring the • Information systems→Mobile information processing protection of patients’ privacy. systems. In this paper, we extend and propose an implementation of a PHR Keywords system proposed in [1], which is privacy-preserving, and gives a Personal Health Records, Patient-centric Data Privacy, Attribute- fine-grained access control to the patients on their health data and Based Encryption. creates an environment for the proactive interaction between the patient and their circle of care. This not only improves the 1. INTRODUCTION communication time and frequency of the interaction between them, it also significantly brings the costs down, by reducing the As with many industries and businesses, technological advances amount of unnecessary physical visits, and other related Permission to make digital or hard copies of all or part of this overheads. The implementation of this PHR system is done as a work for personal or classroom use is granted without fee provided that mobile application that will be installed and used on Android copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. devices by the patients and health professionals. This application Copyrights for components of this work owned by others than ACM will empower patients to have full control on their health records, must be honored. Abstracting with credit is permitted. To copy with built-in security, transparency, and ease of accessibility. The otherwise, or republish, to post on servers or to redistribute to lists, system will enable patients to share and offer a consistent view of requires prior specific permission and/or a fee. Request permissions their records to their healthcare providers to receive better and from [email protected]. more effective care on-time. The rest of this paper is organized as ICSIE '18, May 2–4, 2018, Cairo, Egypt follows: in Section II we bring some preliminaries and the related © 2018 Association for Computing Machinery. work. The details of the proposed system, risk management and ACM ISBN 978-1-4503-6469-0/18/05…$15.00 analysis will be explained in Section III, followed by the DOI:https://doi.org/10.1145/3220267.3220271

22 implementation and experimental results in Section IV. The paper if the access policy associated with that content satisfies the will be concluded in Section V by stating some possible future attribute set of their secret key. work and conclusion. 2.2 Related Work 2. PRELIMINARIES AND RELATED A patient- centric access control (PEACE) method has been WORK proposed for personal health record systems by Barua et al. in [7], that preserve privacy of patient’s health information by providing 2.1 Preliminaries various access rights to users and associating different attribute Personal Health Record (PHR): According to the definition sets to them. It utilizes identity-based encryption for security provided by the Office of the United States National Coordinator purposes. Chase et al. proposed a Patient Controlled Encryption for Health Information Technology [2], “a personal health record (PCE) method [8], by which a patient can partially share their is an electronic application used by patients to maintain and access rights with someone else, and also search over their manage their health information in a private, secure, and encrypted health information using both symmetric and confidential environment.” By their definition, they consider asymmetric cryptosystems. However, existing schemes have following characteristics for a given PHR system: (1) they are several limitations, such as lack of fine grained access control and managed by patients, (2) can contain information from different efficient revocation mechanisms. sources, (3) they should help patients to securely store and monitor their health information, (4) they must be separate from A PHR system has been proposed by Debnath et al. [1] using the health records stored, maintained, and operated by the mCP-ABE to give fine-grained access control to the patients, healthcare provider, and (5) they are separate from portals that while their data will be securely stored, and they can grant and enable patients to communicate with their healthcare providers. revoke access of their health data to health providers. We adopt However, in this work, we somehow modify the last feature by this system, and extend it to provide other features, including providing capability to the patients to receive alerts and interaction between the patients and their healthcare providers, recommendations from their circle of care, as well as being and inform patients about their circle of care activities on their notified of who accessed their health data and when. We also health records, and implemented it as a mobile Android PHR emphasize on the importance of giving access revocation application. capability to the patients to revoke access to their health records In [9], the authors presented a PHR system called LifeSensor, by any authorized person when needed. which allows patients to store health information, medical history, Attribute-Based Encryption: Standard cryptosystems in both images, and authorize selected healthcare team members to view, symmetric and asymmetric types, such as Advanced Encryption add, or update their data. The authors also presented the Standard (AES) [3] and RSA [4] can be used in various HealthSpace application, which is a secure online personal health applications. However, in a given PHR system, we should organizer available to patients in the United Kingdom, that allows somehow encrypt data such that they can be selectively decrypted patients to access a centralized view of their health records and and shared by a group of users, and at any given time data owners allows them to input their needs and preferences, as well as are able to revoke the access to them for a set of particular alerting services [9]. authorized users. Those standard cryptosystems do not deliver In [10], the authors presented that patient engagement is the key to such fine-grained access control means for a privacy-preserving PHR system adoption, and that secure messaging between the PHR system, because of low scalability in the presence of large patient and the Patient Aligned Care Team (PACT) members led user base systems, that comes from their one-to-one encryption to the greatest adoption. characteristics, and lack of key management. To overcome these shortages using an attribute-based encryption system that follows In [11], the authors presented several types of PHRs, including one-to-many encryption would be more appropriate. Attribute- standalone PHRs such as a Smartcard PHR which is used in based encryption is a public-key encryption scheme, in which Germany, Consolidator PHRs such as Microsoft’s HealthVault decryption key of a user as well as the ciphertext depend on the and Google’s Google Health, and integrated PHRs which compile values of some attributes. It means that a particular key can patient data from several sources such as Electronic medical decrypt a specific ciphertext only if the attributes of the ciphertext records (EMR) and presents it in a single unified view. The match the decryption key. This feature will offer a fine- grained authors found that PHR use for chronic illnesses resulted in access control tool in terms of attributes and access policy [5]. positive findings in a majority of the cases (109 of 112 articles relating to PHR system use during a literature review) [11]. Ciphertext Policy Attribute-Based Encryption (CP-ABE): In this specific version of attribute-based encryption system, a user In [12], the authors presented that preferences for PHR decryption key is related to a set of attributes, and the encrypted functionalities varied based on individual patient’s illnesses, message is related to an access policy over those attributes [5]. suggesting that system customizability may be preferred, rather This means that the given user is able to decipher the encrypted than a one size fits all approach. They proposed that unlike an message if and only if the attribute set of their decryption key EMR which just monitor health status, PHRs have the potential to satisfies the access policy stated in the ciphertext. monitor and promote lifestyle changes and educate the patient, through empowerment and self-management [12]. Mediated CP-ABE (mCP-ABE): it is an extended version of CP- ABE that makes instant access revocation possible [6], and 3. PROPOSED PHR SYSTEM, RISK contains five major methods: Setup, Key Generation, Encryption, Decryption, and m-Decryption. MANAGEMENT AND ANALYSIS As it is indicated in [1], there are six major components in this Access policy: It is a tree structure of the attributes based on their PHR system, (1) Authorization Server which handles registration, existence on a given access right as boolean values [5]. Using this and authorization, (2) Trusted Authority which handles master access policy, a given user is able to decrypt an encrypted content and secret keys’ generation, storage, and transmission, (3)

23 Revocation Server which handles distribution of decryption the common risks through mechanisms intentionally implemented tokens, and retaining the Attribute Revocation List, (4) Patient for the authentication, authorization, and accounting of data (health data owner), (5) Health Professional (data user) who can access. Even with the most secure and complex encryption and be a medical doctor, nurse, pharmacist, or other types of authentication mechanisms, if the physical data or the user is caregivers, and (6) Storage Provider which offers storage facility compromised, then the application’s intended goal will be to store encrypted health data. fruitless. Every person who installs the application on their Android mobile A potential challenge to our application would be the lack of device, should first register as a patient or data owner, and then widespread adoption. If the application is not used by many will be able to enter or upload their health data into the system, patients, then the overall adoption will be low due to the resultant and indicate which of their healthcare providers can access that low adoption rate by the circle of care team. In [13], the authors health information. The patient can revoke access from anyone of assessed their Personal Health Information Management System their circle of care, and also receives alerts and/or (PHIMS) usage in a federally funded senior citizen housing recommendations from them based on their health status. facility. They presented that the residents’ ability to use their PHR Furthermore, anytime a health professional accesses a patient’s system was limited by poor computer and internet skills, health data, they will be informed, which will enhance the technophobia, low health literacy, and limited or declining awareness of the patient about who is looking at their health physical/cognitive ability [13]. This is a major concern for all information, improving transparency. Health professionals should PHR systems as a significant portion of the users will potentially also register to the system before being able to access their be senior citizens. patients’ health data. These data users are hierarchically structured, such that a healthcare provider should belong to a section of The user must be provided with adequate training and health profession, in a health department, from a health documentation for them to navigate through the system without organization in a specific region. By having such a hierarchical difficulty. The user must also be made aware of the significance structure, we are able to extend our system in the future for of unauthorized access to their health data and provided with features such as regional epidemic alerts, disseminating policy training in order to detract from the efforts of social engineering, updates, etc. which is a well-known major source of digital security risk. We will provide users with a FAQ guide to address usability concerns The other four components will take care of key generation, and some common best practice guidelines to be followed, not authorization, revocation, and data storage, as previously just for the use of our application, but for technology use in indicated. The major activities in this framework are as follows: general to mitigate the risk of the user being taken advantage of by malicious third parties. We will take adequate measures to Setup: It includes specifying the attributes, that illustrate various ensure that the system user interface is not cluttered, and offers health system hierarchies, generating various public, master and ease of use, so that it can be easily accessible to senior citizens. secret keys and distributing them among the users, and mapping Overall, the authors in [13] found that the PHR use three or more registered users with an authorized set of attributes. times a week resulted in an improvement in the overall quality of Data Encryption and Storage: Any health data that is entered to health care received by the patients. Therefore, we will encourage the system by a patient will be encrypted using an access policy users to use the application several times a week to maximize the specified by the patient, and the encryption key. The encrypted potential benefit. data will be then stored at the data server which is a cloud-based Another challenge to our system is inconsistency in the formatting storage. In the future, data owner can change the access policy of the data uploaded by the patient. For the uploaded health record corresponding to a specific data and re-encrypt and store it to to be useful to the care team, it needs to have a certain degree of change the accessibility of their healthcare providers to that data. detail, accuracy, and depth, which may not necessarily be possible Data Usage: Any authorized data user, can request data from the for the patient to provide. In [14], the authors presented that the storage provider, and decrypt it using the secret key corresponding senior citizens using their PHR system required 1 hour to enter to the access policy if attributes inside the access policy are not in their health record data into the system with the help of a nurse. the revocation list. Updating the records took them 5-15 minutes with the help of a nurse [14]. From this observation, the key takeaway is that we Sending Alerts/Recommendations: Based on the observation of must enable caregivers or family members the ability to aid the a given patient’s data and their health status, an authorized health user manage their personal health records if they require. professional in their circle of care can send appropriate messages to the patient. Other potential risks include loss of support for system dependencies (database, Android API version, etc.), and misuse of Data Observation Log: every patient can set the configuration to the health data by authorized circle of care members. We intend be notified about any observation on their health information by on staying up to date on system support and dependency policies any of their healthcare providers. to ensure our application is not using any deprecated components Access Revocation: After receiving a request from a patient which may introduce additional risks and vulnerabilities. regarding revoking an attribute by the revocation server through There is currently no mechanism in place to address the risk of the trusted authority, the attribute revocation list will be updated authorized individuals taking screenshots or physically recording accordingly by this component. From this point, the access of the patient data they were given access to in the event that the data is specified data user will be revoked. misused in the future, even after access has been revoked. It can As with any system, it is impossible to be devoid of issues and be debated what degree of legal policy enforcement and risks. Here we will analyze and address potential risks and monitoring constitutes a perfect balance between security and concerns surrounding our system, as well as strategies to manage expected privacy, as a certain degree of trust must be present and mitigate them. The application attempts to mitigate most of between the patient and the individuals they provide access to

24 their health data to. Therefore, the log records of data access system is shown in Figure 3. The P3HR application provides users should suffice for the time being as a precautionary act of due with the benefit of managing their own health records, while diligence, since it is expected that the patient’s care team is acting controlling access through providing permissions to specific according to the patient’s best interests. Legal protection is in individuals, including members of their circle of care, as well as place in many jurisdictions, an example being the Personal Health having the ability to revoke or limit existing permissions. The Information Protection Act (PHIPA) in Ontario, Canada, which application enables health professionals to interact with the provides enforceable legal policies and best practice guidelines for patients through accessing the patient’s personal health records health data management, including e-health data [15]. and sending the patient messages relevant to their health or regarding specific records they have accessed. As with any networked system, another major risk that requires consideration is internet security and physical theft of the mobile device after acquiring the user’s login credentials. As rare and unlikely these events are, they are potentially the most detrimental. There is no surefire way to prevent these risks, except to practice due diligence by not accessing untrusted networks or sites to mitigate the risk of hacking or malware being installed onto the mobile device and enabling a pin or password to unlock the device to prevent unintended access. 4. IMPLEMENTATION AND EXPERIMENTAL RESULTS 4.1 Software environment The system currently accommodates Android 6.0 (Marshmallow) to 8.0 (Oreo), supporting API levels 23 through 26. We justify limiting the application to these API levels, due to the older Figure 1. Login and Registration pages.

Android versions having a higher risk of system vulnerabilities from deprecation and lack of support, taking into account the sensitive nature of health data, it made sense to make the application compatible with the newer versions of Android. The versions selected for our application are currently fully supported and are expected to be stable, running on most of the newer devices out in the market today. Java was used for the client-side development, XML for the user interface design, and SQL for the data definitions and queries. MySQL Server 5.7 was used for the database system. The system was developed using Android Studio version 3.01 on the Microsoft Windows Platform (Windows 10). 4.2 Hardware environment The system was built using two laptop computers with the following configurations:  Installed memory (RAM): 16 GB  Processor: Intel® Core™ i7-5500U CPU @ 2.40 GHz, and Intel® Core™ i5-4690K CPU @ 3.50 GHz Figure 2. Patient information, Permission & Revocation.  The system was primarily run and tested on LG Nexus 4 and LG Nexus 5 devices (Android 6.0 Marshmallow) emulated using the Android Studio’s AVD manager and also physically on a LG These messages are referred to as notes in the P3HR application. Nexus 5 Device (Android 6.0 Marshmallow) with the following For example, if a patient visits their dietitian to obtain a new diet specifications: plan and uploads this new diet as a record into the application and grants their primary care physician access to the record, their  Installed memory (RAM): 2 GB physician can send them feedback through the application’s “note” feature and can ask the patient to follow up with them or set up an  Processor: 2.3 GHz Qualcomm Snapdragon 800 Quad-core appointment. The application does not store the patient’s health 4.3 Experimental Results records on the record accessor’s device. It just enables them to view the records through an interface to prevent the data from The P3HR system covers the key features proposed, including user being accessed in the event that permission has been revoked. The profile abstraction and authentication (Patients vs Healthcare application data is stored on a cloud-based system to enhance providers), data storage and encryption, data access authorization, system scalability. The application’s database system is access revocation, and data access logging. The system offers ease of use, intuitive navigation through the features, and aesthetic normalized to reduce data redundancy, and all stored data is minimalism as observed in the following screenshots, including encrypted. A web server is used to enable interaction with the database through the application. HTTP Post calls are used to patient and health professional login and registration pages in send and receive data to and from the user’s device. Three Figure 1, and patient information, access permission and separate databases are used for the purposes of abstraction and revocation pages in Figure 2. Also, the sequence diagram of the

25 system integrity. The first database is the trusted authority server absence of its authorized owner. We also intend to make the iOS which handles the permission granting transactions. The second app of the system to reach a wider user segment. one is the revocation server which handles permission revocation and access restriction transactions. The third one handles the user 6. ACKNOWLEDGMENTS data, including health record upload and removal transactions. This work has been partially supported by the School of Computer Science, University of Windsor, as well as the Natural Sciences and Engineering Research Council of Canada (NSERC). 7. REFERENCES [1] Mitu Debnath, Saeed Samet, and Krishnamurthy Vidyasankar. A Secure Revocable Personal Health Record System with Policy-Based Fine-Grained Access Control. The 13th International Conference on Privacy, Security and Trust (PST 2015), July 2015, Izmir, Turkey [2] https://www.healthit.gov/providers-professionals/faqs/what- personal-health-record [3] W. Stallings, The advanced encryption standard. Cryptologia, 2002. [4] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 1978. [5] J. Bethencourt, et al., Ciphertext-policy attribute-based encryption. In IEEE Symposium on Security & Privacy, 20- 23 May 2007. Berkeley, CA, USA. [6] L. Ibraimi, et al., Mediated ciphertext-policy attribute-based encryption and its application. 10th International Workshop, WISA 2009, 2009, pp. 309–323. [7] M. Barua, et al., Peace: An efficient and secure patient- centric access control scheme for eHealth care system. Computer Communications Workshops, 2011. [8] J. Benaloh, et al., “Patient controlled encryption: ensuring privacy of electronic medical records.” ACM workshop on cloud computing security, 2009. Figure 3. Sequence Diagram of the proposed P3HR [9] Pagliari, C., et al. Potential of electronic personal health system. records. British Medical Journal, 2007, 335(7615), 330.

5. CONCLUSION AND FUTURE WORK [10] Nazi, K. M., The personal health record paradox: health care In this work, we have implemented an Android application for a professionals’ perspectives and the information ecology of privacy-preserving PHR system, that empowers patients to easily personal health record systems in organizational and clinical gather and store their health information in a secure system, settings. Journal of medical Internet research, 2013, 15(4). maintain full control by sharing them with their healthcare [11] Daglish, D., and Archer, N., Electronic personal health providers at their desire, and revoke any previously given access, record systems: a brief review of privacy, security, and as well as receive quick alerts/feedbacks from their health architectural issues. In Privacy, Security, Trust and the professionals, all in one place. This system will not only help Management of e-Business, 2009. CONGRESS'09. World patients to easily and securely manage their health data, but it also Congress on (pp. 110-120). provides a cost-effective solution for the rapid communication and interaction between them and their circle of care. Furthermore, [12] Archer, N., et al., Personal health records: a scoping review. patients will be notified about every access to their health records Journal of the American Medical Informatics Association, by any of their care providers. As a future work, we will integrate 2011, 18(4), 515-522. the application with two important features. First, as the data users [13] Kim, E. H., et al., Challenges to using an electronic personal are stored in a hierarchical structure of various regions, health health record by a low-income elderly population. Journal of authorities and policy makers can use the aggregated information medical Internet research, 2009, 11(4). for communicating epidemic alerts or policy decision making to [14] Lober, W. B., et al., Barriers to the use of a personal health improve public health. The second feature is continuous record by an elderly population. In AMIA Annual authentication to enhance the security and privacy of the system, Symposium Proceedings (Vol. 2006, p. 514). American by which the system intelligently distinguishes its users from Medical Informatics Association. unauthorized individuals using a touch-behavior mechanism and will lock the program if someone else is using the device in the [15] https://www.ontario.ca/laws/statute/04p03

26

The Role of Ethnography in Agile Requirements Analysis Ali Meligy Walid Dabour Alaa Farhat Menofiya University Menofiya University Menofiya University 00201068626996 00201153030618 00201142163649 [email protected] [email protected] [email protected]

ABSTRACT • Working software is the principal measure of progress. • Customer satisfaction by rapid, continuous delivery of useful The integration of ethnography analysis with agile methods is a software. new topic of the research of software engineering. In agile • Even late changes in requirements are welcomed. development, ethnography is particularly effective at discovering •Close daily cooperation between business people and developers. two types of requirements: the functions requested from • Face-to-face conversation is the best form of communication. customers and the functions observed from the ethnographic • Projects are built around motivated individuals, who should be analyst. The proposed model depends on the role of ethnographic trusted. analyst in understanding how people operate actually and discover • Continuous attention to technical excellence and good design. requirements that support software functionality. This help to • Simplicity. predicts implicit system requirements that not defined by the • Self-organizing teams. organization. The proposed ethnographic model requires that the • Regular adaptation to changing circumstances‖. [1] ethnographic analyst remain in the organization and observe the actual ways in which people work, rather than only the formal Requirements Elicitation techniques are basically the ways and requirements documented by the organization. procedures to obtain user requirements and then implement them in the system to be developed so that it satisfies the needs of CCS Concepts stakeholders. [2] • Software and its engineering→Agile software development. The Agile Manifesto gathered representatives from Extreme Programming (XP), Dynamics Systems Development Methods

Keywords (DSDM), Adaptive Software Development (ASD), Scrum, Crystal Agile Requirements; Ethnography ; Observation ; Interview ; Methods, Feature-Driven Development (FDD), and others who Requirements elicitation. saw the need for an alternative to documentation driven, 1. INTRODUCTION heavyweight Traditional software development processes. Agile methods universally rely on an incremental approach to software Requirements Engineering (RE) is the process of establishing the development, and delivery. In incremental development, services that the customer requires from a system and the specification, development and validation activities are constraints under which it operates and is developed. The main interleaved rather than separate, with rapid feedback across goal of requirements engineering process is creating a system activities. Each increment of the system incorporates some of the requirements document for knowledge sharing, while Agile functionality that is needed by the customer. Generally, the early Development (AD) methods focus on face to- face increments of the system include the most important or most communication between customers and agile teams to reach a urgently required functionality. The current increment has to be similar goal. Agile means being able to ―Deliver quickly. Change changed and, possibly, new functionality defined for later quickly. Change often‖ While agile techniques vary in practices increments. Incremental delivery is an approach to software and emphasis, they follow the same principles behind the agile development where some of the developed increments are manifesto: delivered to the customer and deployed for use in an operational environment. • "Working software is delivered frequently (weeks rather than months). Once an increment is completed and delivered, customers can put it into service. This means that they take early delivery of part of the system functionality. They can experiment with the system and this helps them clarify their requirements for later system Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that increments. As new increments are completed, they are integrated copies are not made or distributed for profit or commercial advantage with existing increments so that the system functionality improves and that copies bear this notice and the full citation on the first page. with each delivered increment. Note: A system can develop Copyrights for components of this work owned by others than ACM incrementally without being delivered incrementally to users, but must be honored. Abstracting with credit is permitted. To copy not vice versa. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions Ethnography is an observational technique that can be used to from [email protected]. understand operational processes and help derive support ICSIE '18, May 2–4, 2018, Cairo, Egypt requirements for these processes . An analyst immerses himself or © 2018 Association for Computing Machinery. herself in the working environment where the system will be used. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 The day-to-day work is observed and notes made of the actual DOI:https://doi.org/10.1145/3220267.3220273 tasks in which participants are involved. The value of ethnography

27 is that it helps discover implicit system requirements that reflect 2.3 Scrumban method the actual ways that people work, rather than the formal processes It combines the most important Scrum practices (release and defined by the organization [10]. Sprint planning, regular delivery of increments, frequent feedback) Requirements elicitation and analysis is among the most with basic Kanban principles (visualization of workflow, limiting communication-rich processes of software development. It work in progress, change management). engages different stakeholders, both from the customer and the developer sides, who need to intensively communicate and 2.4 Agile modeling collaborate. As a key part of the requirements engineering process, It maintains the amount of models and documents as low as requirements elicitation has a great impact on the later possible. So it is a supplement to other agile methodologies. development activities; any omission and incompleteness may Since each of agile methods has its individual characteristic, it lead to important mismatches between customer's needs and would be good to combine two or more methods with each other. released product. Elicitation techniques include questionnaires and This research presented an ethnography analysis to unify these surveys, interviews and workshops, documentation analysis and approaches into one coherent method. The proposed method will participant observation. During this phase, requirements should be hold the benefits of both. Hence, the goals of this thesis are: negotiated and analyzed carefully since many software projects Ethnography analysis for agile software requirements. have failed because their requirements were poorly negotiated among stakeholders. Agile methods very successful for some types of system development:: Ethnography was selected as the research method as it provides an 1. Development of a product where a software organization iterative approach to data collection and because sense making is develops a small or medium-sized product for sale. a key characteristic of producing rich descriptions on which to 2. Production software for a particular customer. build understanding and knowledge. [3]. Describes ethnography as ‗the study and representation of culture as used by particular 3. The software is delivered to customers and get rapid people, in particular places, at particular times‘, being a pragmatic feedback from them. way to determine what culture does [3]. Observation is a key to 4. Custom system development within an organization, where ethnographic studies and both participant and non-participant there is a clear commitment from the customer to become observation are legitimate forms of ethnography. Observation is involved in the development process and where there are not almost always complemented with other forms of data collection a lot of external rules and regulations that affect the software. such as interviews or document analysis. The role the researcher Agile methods universally rely on an incremental approach to takes with respect to the observed community will partly define software development and delivery. In incremental development, which areas of the community can be accessed [4]. An specification, development and validation activities are ethnographical approach requires understanding a problem „from interleaved rather than separate, with rapid feedback across within‟, i.e. everyday activities which are meaningful for end- activities. Each increment of the system involves some services users and performed in real-life settings. This approach provides and functionality that is required by the customer. Regularly, the us with the opportunity to realize what people actually do, beyond early increment of the system includes the most important or most what they say they do or actually do in more formal settings, e.g. priority required functionality. The current increment may be meeting rooms [5]. The main value of ethnography is its capacity changed (add or delete tasks) and, possibly, new functionality to build visible the ‗real world‘ sociality of a setting. As a mode of redefined for later increments. Incremental delivery is an approach social research, it is concerned to produce detailed descriptions of to software development where some of the developed increments the ‗workday‘ processes of ethnographic analyst within specific are delivered to the customer and deployed for use in an contexts [6]. operational environment. Once an increment is developed and delivered, customers can operate it in the customer environment. 2. PROBLEM STATEMENT This means that the customer receives an early part of the system Agile software development emerged as a lightweight alternative functionality. They can test increment with the system as a to plan-driven development. There are many approaches to rapid whole and this helps them clarify the functions for later system software development that share some fundamental characteristics increments. As current increments are developed, they are and every approach has specific characteristics. The following few integrated with previous increments so that the system lines will highlight some of these approaches and their distinctive functionality completes with each delivered increment. features. Note: A system can develop gradually without gradually being 2.1 Extreme programming delivered to users, but not vice versa. It involves a number of practices such that incremental development, pair programming, test-first development, 3. RELATED WORK continuous integration, refactoring etc… Helen Sharp, Yvonne Dittrich and Cleidson de Souza describe how researchers of empirical software engineering would benefit 2.2 Scrum process from embracing ethnography [4]. This can achieve by explaining It provides a management framework for the project. It can the four roles that played from ethnography in supporting the therefore be used with more technical agile approaches such as empirical software engineering: to encourage studies into the extreme programming. social and human aspects of software engineering; to enhance the software engineering design; to improve the development processes and to inform research . Andrea Rosales, Valeria Righi, et. al. [5] conducts ethnographical techniques,

28 inspired by concurrent ethnography, to get end-users involved in 4. AGILE REQUIRMENTS PROCESSES both generating further design ideas and identifying and solving The proposed model defines agile requirements processes as technical implementation issues, which often appear at depicted in figure 2, define requirements, break down to set of intermediate stages. An ethnographical approach requires increments, and simple design. In the first process, the customer understanding a problem from within i.e. everyday activities defines all requirements that the system should include. These which are meaningful for end-users and performed in real-life requirements must contain all requirements, such as, functional settings. This approach provides the opportunity to realize what requirements, non-functional requirements, and all requirements people actually do, beyond what they say they do or actually do in that required from all stakeholders. In the second process, break more formal settings, e.g. meeting rooms. They have been down requirements to set of increments. Each increment involves exploring ethnographical techniques in two requirement and tasks that share their priorities. In the final process, enough design design projects, and present them together with the main lessons is carried out to meet the current requirements in one increment learned. John Hughes, et. al. [6] identified the different uses of and no more. ethnography within design: Define agile requirements include two types of requirements: the 3.1 Concurrent ethnography first one refers to the requirements that documented in the contract where design is determined by continuous ethnographic study between the customer and the organization that develops the taking place at the same time as the systems are developed as software, and the second one refers to requirements that are shown in figure 1. derived from observing which people actually work. The second type can achieve by requirements elicitation and analysis techniques. Requirements elicitation and analysis are the most communication-rich processes of software development. It engages different stakeholders, both from the customer and the developer sides, who need to intensively communicate and collaborate. As a key part of the requirements engineering process, requirements elicitation has a great impact on the later development activities; any omission and incompleteness may lead to important mismatches between customer's needs and released product. Elicitation techniques include questionnaires and surveys, interviews and workshops, documentation analysis and participant observation. During this phase, requirements should be negotiated and analyzed carefully since many software projects have failed because their requirements were poorly negotiated among stakeholders. In the proposed model, we focus on Figure 1. The use of concurrent Ethnography [6]. ethnography. An ethnographic analyst immerses himself in the working environment where the system will be used. The day-to- 3.2 Quick and dirty ethnography day work is observed and notes made of the actual tasks in which Where short ethnographic studies are offered to provide a wide participants are involved. The value of ethnography is that it helps but informed sense of the setting for designers. discover implicit system requirements that reflect the actual ways 3.3 Evaluative ethnography that people work, rather than the formal processes defined by the The ethnographic study is undertaken to test or verify a set of organization [9]. formulated design decisions. Define Reqs 3.4 Re-examination of previous studies where previous studies are re-examined to talk initial design study. Carol Passos et.al [7] focuses on the social interactions, communications, and relationships that arise as an intrinsic part of adopting agile software development practices. For that, they Break Down to applied an ethnographic approach, employing participant observation, interviews, and document analysis. The main idea of Increments their approach involved performing ethnography with its holistic and contextual vision, including some characteristics of action research, such as its collaborative and reflexive approach; used to recommend how to improve the studied practices to the company. Andrew Mara Liza Potts Gerianne Bartocci [8] design and agile Simple Design ethnography conducted in multinational corporate settings consist of day-to-day observation of work patterns to document and solve Figure 2. Requirements processes for agile method. the specific design problem being addressed. The end goal is to provide a picture of the landscape and design requirements to improve and make that work more efficient through technology to support the end users—employees.

29 5. AGILE ETHNOGRAPHIC MODEL In the observation technique, the requirements engineer observes Agile ethnographic model include activities that determine how the user‘s environment without interfering in their work. This the requirements defined? In addition to traditional activities that technique is used when a customer is not able to explain what they used to define requirements, this model involves other activities want to see in the system. It is often used in combination with such that ethnography, observation and interview. This model has other requirements elicitation techniques like interviews. four processes as shown in figure 3: requirements elicitation, Observation can be done actively or passively. develop, prototyping, aluate prototyping and increment Passive observation is when the analyst doesn‘t interact with the prototyping. user while he is observing. 5.1 Requirements elicitation Active observation is when the user is interrupted for questions This process divides two types: Observation session and during observation. traditional interaction. Traditional interaction includes interviews, meetings and gathering tasks. An interview is one method to 5.2 Develop prototyping discover requirements and functions held by potential Ethnography may be integrated with prototyping. The stakeholders of the system under development. There are two ethnography develops the prototype so that some prototype types of interviews: Open interviews, where there is no written refinement cycles are required. Moreover, the prototyping focuses contract document and a range of issues are discovered with the ethnography, it identifies problems and questions that stakeholders and closed interviews, where a predetermined set of addressed with the ethnographic analyst. The ethnographic analyst questions are answered. In fact, the interview is a useful method should then look for the answers to these questions during the next for getting an overall understanding of what customers want and phase of the model. how they interact with the system. All agile approaches consider that interview is an effective way to communicate with customers 5.3 Evaluate prototyping and to increase trust between two sides. The prototyping with ethnography model is re-evaluated to ensure Meetings are held continuously among the customers and that it is accommodating for all variables and solved for all stakeholders to accommodate a change in requirements. Gathering problems and its coverage of social methods to process problems. tasks involve any method by which the system requirements can be identified. 5.4 Increment prototyping Observation session involves ethnography analysis and discovers After the prototype has been evaluated, the requirements are requirements. Ethnography is an observational technique [9]. In broken into increments. Each increment involves tasks that have this method, ethnographic analyst observes the activities of people the same of priorities and they are important for the customer. The they actually work for a period of time in detail and in the system is released incrementally. All increments must coverage all meanwhile other methods are used to collect requirements needed. requirements. The day-to-day work is observed and notes made of the actual tasks in which participants are involved. 6. CONCLUSION Ethnography is a kind of field work done in order to observe a The combination of ethnography method with agile software particular work place and stakeholders and relationships between development is a new topic of the research of software them. Analyst immerses himself completely in the working engineering. The ethnography is one method to requirement environment to understand its socio-organizational requirements. elicitation that used side by side other requirements elicitation It is used side by side with other elicitation techniques such as techniques. This paper describes agile requirements processes that involve defining requirements, break down into increments interviews and questionnaires to accommodate all requirements. and simple design. In the define requirements process, this model uses ethnographic techniques that help in discover requirements. Traditional Interaction Requirements elicitation Traditional Interaction Requirements elicitation Observation session Ethnography is not a complete approach to requirements

Ethnography Interviews elicitation and should be used to complement other approaches analysis such as interview, observation and use cases. In requirements processes, there are two types: traditional Meetings requirements and observation session. Traditional requirements Discover Reqs include all activities that contribute to define and discover Gathering tasks requirements. An observation session includes ethnography analyst that meant, he stays in the working environment and observes how the people work to discover some requirements that customer can not define it. Using the previous method to define requirements derive the second process which is develop Develop prototyping prototyping. Develop prototype means great a prototype involve all requirements should be included in the system. In agile methods, the system is released incrementally, all increments Evaluate prototyping compose the overall system. 7. REFERENCES Increment prototyping [1] Andrea, L, and Abdallah, Qasef. 2010. Requirements Engineering in Agile Software Development. Journal of Figure 3. Agile ethnographic model. Emerging Technologies in Web Intelligence Vol. 2_ No. 3. [2] Helen, S., Yvonne, D, and Cleidson, S. 2016. The Role of Ethnographic Studies in Empirical Software Engineering.

30 IEEE Computer Society. (pp 1-25) DOI= ACM conference on Computer supported cooperative work. http://dx.doi.org/10.1109/TSE.2016.2519887. 429-439. [3] Katie, T. Work in Agile Software Development Teams. [8] Andréa, R., Valeria, R, Sergio, S, and Josep, B. Ethnographic https://www.semanticscholar.org/paper/Work-in-Agile- Techniques with Older People at Intermediate Stages of Software-Development-Teams Product Development. . DOI= Taylor/50f1e432c636db75c574fc5e62696ef9ee22ca5d http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46 [4] Masooma, Y, and M, Asger, 2015. Comparison of Various 0.6541&rep=rep1&type=pdf. Requirements Elicitation Techniques. International Journal [9] Andrew, M., Liza, P, and Gerianne, B. 2013. The Ethics of of Computer Applications. Vol. 116_ No. 4, April 2015. Agile ethnography. The 31th ACM International Conference [5] on Design of Communication. [6] Carol, P, Daniela, Tore, D, and Manoel, M. 2012. Challenges [10] Somerville. 2011. Software engineering. 9 th edition" of Appling Ethnography to Study Software Practices. (ESEM Boston, New York, Addison-Wesley. ' 12, September 19-20, 2012,) Lund, Sweden. DOI= https://www.researchgate.net/publication/261075101. [7] John, H, Val, K,. Tom, R, and Hans A. 1994. Moving Out from The Control Room: Ethnography in System Design.

31 Predicting the Survivors of the Titanic - Kaggle, Machine Learning From Disaster - Nadine Farag Ghada Hassan The British University in Egypt Ain Shams University & Cairo, Egypt The British University in Egypt [email protected] Cairo, Egypt [email protected]

ABSTRACT techniques to identify which passengers were more likely to survive. April 14th, 1912 was very unfortunate for the most powerful ship The prediction accuracy of the Decision Tree versus that of Naïve ever built at that time, the Titanic. Grievously, 1503 out of 2203 Bayes is presented and compared. However, results indicated that passengers perished the sinking, but the rationale behind survival the predictors fare, sex/title, age, passenger class and price were the still remains a question mark. In efforts to study the Titanic most relevant features and variables in predicting the survival of passengers; Kaggle, a popular data science website, assembled each passenger. Here comes the significance of Machine Learning, information about each passenger back in the days of the Titanic which helps tune a model by “learning” or “studying” each into a dataset, and made it available for a competition titled: passenger to provide accurate prediction results. The model “Titanic: Machine Learning from Disaster.” This research aims to provides answers to the following questions: use machine learning techniques on the Titanic data to analyze the  Why were some people survivors and who were they? data for classification and to predict the survival of the Titanic  Did they just survive because they paid more money? passengers by using data-mining algorithms; specifically Decision Trees and Naïve Bayes. The prediction and efficiency of these  Did they survive because they were of a higher class, or algorithms depend greatly on data analysis and the model. The because of their gender? paper presents an implementation which combines the benefits of  Did they survive because of the location of their cabins? feature selection and machine learning to accurately select and The answers to these questions can be pinpointed using the power of distinguish characteristics of passengers‟ age, class, cabin, and port data mining and analysis. Interestingly, the Titanic followed the of embarkation then consequently infer an authentic model for an “women-and-children-first” code of conduct which was adhered to accurate prediction. The data-set is described and the at that time. Historically known, Captain Edward Smith called out implementation details and prediction results are presented then “Women and children first” at the moment of collision, and they did compared to other results. The Decision Tree algorithm has actually save a lot of women and children aboard. However, there accurately predicted 90.01% of the survival of passengers, while the were some others that couldn‟t make it. Here comes the role of data Gaussian Naïve Bayes witnessed 92.52% accuracy in prediction. analysis which helps to provide statistical summaries. CCS Concepts • Information systems➝Clustering

Keywords Data Mining; Machine Learning; Decision Trees; Naïve Bayes; Supervised Learning; Kaggle 1. OVERVIEW & PROBLEM STATEMENT Only 712 passengers out of 2456 on board survived the shipwreck. Modern culture and cinema portray the sinking as a collision with an iceberg. While this case is true, some people were lucky enough to survive as upper-class passengers and women, while some perished. This research aims to use the benefit of machine learning Figure 1: Percentage of Survival by class and Gender [1] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and 1. Men missed the boat and their survival rate was only 20%. that copies bear this notice and the full citation on the first page. The rest were women and children. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or 2. Third class women were more likely to survive than first republish, to post on servers or to redistribute to lists, requires prior class men. specific permission and/or a fee. Request permissions from 3. 44% of passengers in first class were women. [email protected]. 4. Only 23% of passengers in third class were women ICSIE '18, May 2–4, 2018, Cairo, Egypt © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220282

32 2. DATA-SET DESCRIPTION & DATA ANALYSIS Having given this training data, analysis is the first step in order to build a model which represents it so it can be fed to the algorithm to provide prediction results. Aforementioned, sex was prioritized over class and this paper emphasizes on that by ensuring its existence in most parameters. The test file does not show the response variable survived, however it does contain the 11 other variables for 418 passengers which are different from the passengers in the training set. First and foremost, the “Age” variable was noticed to be measured in fractions (new born children) and years. We deduced that if the age appeared as a floating number or in other words, that

particular field could have been originally missing but has been Figure 2: Model of the Titanic: Cabins locations and Passenger Class filled by estimation. Furthermore, we also noticed variables that [2] were associated with family relations like “SibSp” and “Parch” Information about each passenger has been gathered and made which stand for Siblings/ Spouses and Parents, Children publicly accessible through Kaggle‟s dashboard. The competition is respectively. Siblings account for sisters, brothers, stepsisters or titled “Titanic: Machine Learning from Disaster”. It started on stepbrothers on board and spouses account for the presence of a September 28th 2012 and will continue until January 7th 2020. The husband or wife on board. Meanwhile, any mistresses, fiancé‟s are dataset provided consists of three csv files: a training file, a test file not accounted for. Parch account for the presence of each mother, and a submission file. All work and analysis is to be on the training father, daughters, sons, stepdaughters or stepsons. Family relations file that we are using to build our machine learning model since for may benefit from feature engineering to know more about family every passenger; the training set provides the “ground truth” by groups including nephews/nieces, cousins and aunts/uncles. This including the response variable “survived”. In addition, the dataset will require text analytics to engineer each of the Name, SibSp and also includes 11 other descriptive variables associated with 891 Parch features in order group families together using their surnames, passengers. In efforts to improve the model, “features” like since we can hypothesize whether families sink or swim together. passengers‟ class and sex are taken into consideration. The paper Also, if a child was accompanied by a nanny only, or friends or aims to show the prediction results obtained after working on the neighbors then the Parch attribute will be set to 0. Needless to say, dataset provided by Kaggle, and compares them with other works. there are problems associated with the Titanic dataset. There are Particularly, this paper discusses the results of each of the Decision missing values and unwanted features that this section will cover. Trees and the Naïve Bayesian Classifier. The results indicate Also, Kaggle did not specify any details regarding how this whether a Titanic passenger has actually survived or not. The information was gathered and assembled into such datasets. supervised learning problem dealt with requires a “model” before PassengerID: A numeric feature that represents the number of testing the algorithm on unseen data. The following sections passengers. It does not make sense to say if passenger ID equals 20 describe the problem, the dataset and problems associated with the then he is going to die. The number here does not add any dataset in further details. The Titanic data-set is a supervised significance to the predictive model hence, it is eliminated. learning algorithm. Kaggle provides the (“ground truth”) or the outcome for each passenger which means that class labels are Survived: A numeric feature 1 or 0 for survived or not respectively. accounted for. Hence, it is a binary classification problem. This is considered as a feature by default, but in this case we do not want to treat that as a feature because this is the thing that we are Table 1: An Overview on the Kaggle Titanic Data [3] trying to predict. Hence, this feature is changed from a feature to a label. In addition, it is also marked as numeric. The problem is that File Description Significance here are some numbers that we do not want to actually treat as Name numbers. While it is a number, it does not actually represent a 12 Unique Features The “Survived” feature is number; it represents whether a passenger has survived or not. It does not make sense to average these 0s and 1s together or do any Train.csv  “Survived” feature the ground truth for each exists passenger. sort of math with them, so in that case it is changed from Numeric to Categorical. 891 Unique Rows for each Passenger Pclass: A numeric feature (1, 2, 3) representing three classes of Same Features as in No ground-truth (Used for passengers from the wealthiest classes to the ones doing the Irish Test.csv Train.csv testing). dancing. Similarly, it is changed from Numeric to Categorical  “Survived Feature because it represents a category more than just a number. does not exist Name: A relevant string feature that might use some text analytics 491 Unique Rows to extract reputable names like Colonel, Reverend, and referring to 419 Countess…Etc. passengers Two Columns The ground-truth for the Sex: A String feature which represents the gender of each passenger. Gender “PassengerID” & expected outcome for the According to this dataset, 65% of passengers are males and 35% are Submissi “Survived” test file. females. Again, this feature is not meant for text analytics or some on File 419 Rows Used to evaluate the sort; hence it changed from String to Categorical. performance of the Age A numeric feature that might actually be highly relevant. Back predictive model in the days of the Titanic, they did actually board the children and

33 women first, so if it makes sense to categorize passengers into three data [5]. Predictive analytics algorithms are either supervised or sets (Children, Adults or Elderly), it might as well add value to the unsupervised. predictive model. However, the problem associated with this column is that there are lots of missing values. 3.2 Supervised Learning Vs. Unsupervised There are two different ways to estimate these missing values: Learning. Supervised learning also known as predictive modeling uses a single 1. Using the average median of the ages column or a group of columns in a dataset to predict a target 2. Using a logic equation from the general age difference variable. The target variable may be either categorical or continuous, between that of a woman and her child and in the case of a categorical variable, we use classification techniques while in the case of continuous variables, we use SibSp & Parch: These two features represent family units. They regression techniques. In contrast, unsupervised learning sometimes are both numeric to represent the number of Siblings, Spouses, known as descriptive modeling is more inclined to clustering in Parents and Children. They may be relevant features to group family order to build a model for the data that does not have a target units together. variable [5]. Ticket This feature is a String but it is very irrelevant and contains a lot of uncertainty. There is not a single pattern in this feature, as the 3.3 Literature Survey [6] ticketing system back in the times of the Titanic probably had a way Implemented each of the Naïve Bayes, SVM and Decision Trees of designing the tickets that is not known to us. In this case, this using several combinations each time. His results proved that the feature is eliminated as it is very difficult to parse. Decision Tree Algorithm was the most accurate, as it got 70.43% of the predictions right, while Naïve Bayes performed the weakest with Fare A numeric feature that represents how much a person has paid a score of 76.79%. SVM was in between, as it scored 77.99%. [7] for the ticket which seems actually relevant. However, this feature Findings showed that the sex feature was the most relevant giving does not by any means relate to the Pclass feature. In other words, an 81% accuracy using the Decision Tree Classifier. there is no direct relationship between “Pclass” and “Fare”. Correspondingly, [8] generated a “Gender-based-model” which According to the movie Titanic, someone did not pay a single dollar, depends greatly on the sex feature and it performed with the highest and won his first class ticket through gambling. This means that accuracy 0.78469, only 2% higher than Random Forests. these two features may not be directly related. Cabin This interesting String feature tells us the cabins some 3.4 Analysis of the Related Work passengers have booked. Unfortunately, this feature contains a lot of In fulfillment of [6] compared three different machine learning missing values. However, this can be handled by changing this algorithms, specifically SVM, Decision Trees and Naïve Bayes in feature from letters to Binary. Like whether or not a person has a terms of survival results. The data analyzed showed that 36.38% of cabin. Meanwhile, another super interesting approach to deal with passengers in the test data survived and that 74.20% of females this is by looking at the map of the Titanic in figure 1, we can survived while the rest were male survivors, which indicates that actually see which cabins were located at the borders of the ship, sex was a highly significant feature. Findings show that more than and these were the ones that were actually closer to the lifeboats so 90% of women in first and second class survived. For males, their passengers having cabins close to the borders are more likely to chance of survival was 36.89% when in first class. [7] Found “Sex” survive, while the cabins in the middle of the ship are more likely to to be the most significant feature in Survival. Class was found to be die. the second most significant as passengers in first class were more likely to survive than third class. Adults of ages between 20 and 49 Embarked This is a String feature to represent the three ports from were found to likely to perish and passengers embarking from where the Titanic has embarked from before heading into the Southampton were more likely to survive. [8]‟s strength lies in the Atlantic. The three ports of embarkation are Southampton, model which took each of the gender, class, fare, port, ticket, and Queenstown and Cherbourg. Similarly, this feature is changed from family size as parameters in account. They found that passengers String to Categorical because it represents three different ports. This with Ticket values that start with „A‟, „SOTON‟„ „W‟ have barely column has two missing values, but they were estimated to be survived while tickets that start with „PC 17755‟ linked to survival. equals to Southampton. From the dataset, most of the passengers Combining these features altogether has improved their accuracy actually embarked from Southampton and their passenger class was results, compared to their implementation in Random Forests. three. Hence, having two missing values for Embarked and Pclass=3 shall be set to Southampton. 4. WORK METHODOLOGY 3. RELATED WORK (STATE-OF-ART) 4.1 The Gaussian Naïve Bayes The disaster affected western culture and its media. Also, it The Naïve Bayes algorithm was implemented first as a benchmark. influenced a lot of intensive research in the field of machine The following features were taken into consideration for learning. The following sub-sections review and analyze the constructing the Naïve Bayes model: Passenger Class, Sex, Age, findings of other works done in this area. These researches have Fare, Embark and Ticket. The Naïve Bayes was used to construct a shown varying differences in accuracies according to their different classifier for each passenger in the test-set, based on a model that predictive models. assigns the label “Survived” to each instance. Usually an instance is referred to as a vector holding feature values as the classifier treats 3.1 Background each feature independently. First, the probability of death and Large databases experience growth that their size become survival were calculated. The probability of survival was calculated impossible for humans to analyze on their own [4]. This leads to by adding up the grand total number of survivors and dividing the what is known as “Predictive Analytics” which incorporate the use addition result by the total number of records. Importantly, the “sex” of mathematical formulas and computational methods for generating and “class” features experienced discrete values, since “sex” has useful patterns or determining important features in large amounts only two values: male and female in addition to “class” which has

34 exactly three values: 1, 2 and 3. The next step is calculating the in the first trial, the age was passed in as a categorical value. When conditional probability of these two features, given the label the “Age” feature was categorized into four bins: Children, Youths, “Survived” to specify whether a passenger has survived or not. Adults and Elderly; it resulted in a very poor prediction. However, Similarly for the Sex feature, the total number of male survivors when the values in “Age” remained continuous, the prediction were divided by the total number of survivors (total number of accuracy improved from 62% to 73% after discretization. records with Survived=1). In the same fashion, probabilities of other Furthermore, adding “Fare” as a feature to the Classifier was features were calculated. In practice, when there are too many expected to improve the prediction results, but surprisingly it did not features as seen in this research, the Bayes theorem can be used to and it decreased the accuracy from 73% to 69%. From this reformulate the traditional conditional probability model. The Bayes implementation, the sex of the passenger implied a strong influence. theorem was applied as follows: The gender feature seemed to be the most powerful indicator to whether a passenger has survived or not. Even when adding other ) | ) features to the Naïve Bayes model, the prediction accuracy did not | ) ) significantly improve and this is because the “Sex” feature is However, to calculate parameter estimates for each of the Fare and strongly related to the “Survived” label. In other words, the sex of Age features, the Gaussian distribution was used by computing the the passenger dominates the model of this dataset using the Naïve mean and variance first. After handling the continuous values of Bayes classifier. In view of this, excluding “sex”, and adding other each of the Age and Fare, the conditional probabilities of these features to the model also improved the accuracy of the classifier, features can be estimated, given the label “Survived” whether or not which indicates that despite not being as strong as sex, they still do a passenger has survived. The continuous values in each of the Age correlate to the survival of passengers at some point. and Fare vectors were measured to compute the variance in each class, Survived = 0 and Survived = 1. The feature values for each of 4.2 Decision Trees The simple Decision Tree algorithm enables clear identification of the Age and Fare vectors were distributed according to the Gaussian patterns through predictive models, by making use of simple distribution. The conditional probability of each feature from the probability calculations. There are several variations of Decision training set was calculated using the following equation: Trees from simple tree structures to regression and ensemble trees.

∑ They are super easy to read, yet an extremely powerful machine | ) learning algorithms. Conceptually, decision trees start with a root node at the top which stands for the best predictor upon which the Where f refers to a particular feature and s refers to the label data is broken down to further levels, where the last level indicates “Survived”. At this point, iterating over every record j in the the decision (leaf nodes). Decision tree algorithms deal with both training set, gives the conditional probability distribution subject to numerical and categorical data. In the scope of this research, the Survival. The conditional probability distribution can be used to find decision tree structure was built upon three features: Embark Sex the probability of survival of any test point, after passing in the and Class. The “Ticket” feature was disregarded because it is feature set. Furthermore, the MAP estimate which refers to neither categorical nor numerical. The data was first split at Maximum a Posteriori Probability was used to obtain the maximum Embarked into Q, S and C starting at the root node (Q=Queenstown, likelihood as follows: S=Southampton and C=Cherbourg). Then the data (representing passengers) was split according to Sex, since “Sex” is highly ) ) ∏ | ) correlated to one‟s chance in survival, as confirmed by several data scientists who have looked into this problem. Furthermore, Where f refers to the dataset features and s tells whether a passenger passengers were divided once again according to Class, since it was has survived or not as it refers to True or False. The probability of mentioned previously that men in first class had a higher chance of each feature is multiplied given positive and negative outcomes. In survival than men in third class. With the strength of these features, other words, the probability of each feature is then multiplied given the Age and Fare were eliminated from the decision tree. The age a non-desirable outcome and compared with the probability of each domain as previously mentioned, is continuous. When it was feature having given a desirable outcome. The greater probability changed to a categorical feature representing (Children, Youths, makes the prediction. Several trials were made, as the way features Adults and Elderly) it confused the classifier, and resulted in a poor were combined each time has varied which affected the accuracy estimation even though it was thought of as a good decision percentage. boundary. Back in the days of the Titanic, it was true that children had higher priorities than the elderly; but the kaggle data when fed Table 2: Naïve Bayes with Several Features as Parameters into the classifier, resulted in a huge error rate. Surprisingly, Pclass Gender Age Fare Embark Ticket Accuray reducing the number of features achieved a classification accuracy result of 91.01%. The following figure shows the decision tree ✓ ✓ ✓ ✗ ✓ ✗ 62% structure, splitting the data first at root node Embarked, then second ✓ ✓ ✓ ✗ ✓ ✗ 73% at Sex and lastly at Class. ✓ ✓ ✓ ✓ ✓ ✗ 69% ✓ ✓ ✓ ✓ ✓ ✓ 78% ✓ ✓ ✓ ✗ ✓ ✓ 92%

Several different features were combined together and fed to the Naïve Bayes Classifier as demonstrated in the previous table. The first two trials experienced different results, even though the combinations were exactly the same. The reason behind this is that

35 Table 5: A Comparison on Implementation Details Using DBT Hierarchy [6] [9] [7] Root Embark Sex Sex Sex Inner Sex Class Class Class Embark Child Class Age Age Age Accuracy 90% 79% 79% 81% Table 4 shows the difference between the two implementations for the Naïve Bayes using different combinations of parameters each time. Surprisingly, the “Ticket” feature has significantly affected the predictive model, as all passengers with tickets having “PC 17755” were alive. Different tree structures have achieved different Figure 3: Decision Tree Spanning the Titanic Data accuracies as presented in the table 5. Researchers [6] and [9] emphasized on the Sex feature, as they considered it to be the best predictor to start with. Researcher [7] emphasized on the “Sex” 5. RESULTS & COMPARISON TO OTHER feature too, as well as the Embark. However, their work did not consider “Embark” as the best predictor, as they paid more attention WORKS to “Sex”. In contrast, this research did not consider adding the Age, neither as a numerical nor as a categorical feature and presumed the 5.1 Testing and Performance Evaluation Embark to be the root node (best predictor). In this fashion, having a Table 3: Comparison of Accuracy Results with other Works reduced tree with only three levels and disregarding the “Age” feature has resulted in an almost 9-10% increase in accuracy.

Algorithm [6] [9] [8] 5.2 Naïve Bayes vs. Decision Trees After implementing two methods on the kaggle data, the results ✗ proved that, Naïve Bayes performs best with 92.5 and worst with 92.5% 76.79% ✗ Naïve Bayes 62.03% while the Decision Tree performs best with 90.1% accuracy ✗ when it was reduced to only three levels. The only difference here is Decision 90.1% 79.43% 79.46% about 2%. The only logic behind this is that the Naïve Bayes has Tree included the Ticket as a feature, as the Decision Tree failed to include such kind of data. The decision tree algorithm decided upon Modified ✗ 78% three features (Embark, Sex and Class) but did not specify weights. ✗ Gender ✗ In other words, tree has treated each feature independently, which Based was not a bad solution after all. Table 6: Accuracy Results Summary 77.9% 77.9% 77.03% SVM ✗ Accuracy Naïve Bayes 92.52% Random ✗ ✗ 81.3% 76.0% Forests Decision Trees 90.01%

Table 4: A Comparison on Implementation Details Using NB 6. CONCLUSION Conceding that sex was highly correlated to survival, as confirmed by data breakdown findings and the media; a passenger‟s ticket Parameters [6] proved to be strongly correlated to survival as well. Passengers with (Features) Tickets having “PC 17755” were all alive, regardless of their fare, class and sex. The logic behind this may correspond to their Pclass ✓ ✓ locations on board, as the passengers with tickets having “PC 17755” Sex ✓ ✓ could have been the passengers located at places closer to the edges, and hence closer to the life-boats while ones with other ticket codes Age ✓ ✓ were more likely to be stuck in the middle of the ship. In any of the three models, knowing the number of relatives aboard did not help Fare ✓ ✓ with classification, but perhaps, if given the links between Ticket ✓ ✗ passengers then it may be possible to infer more knowledge about the survival rate. Since family units tend to either die altogether or This research confirms that Naïve Bayes Classification outperforms survive altogether, knowing the family links would have been useful. Decision Trees and Random Forests. This classification estimated In this research, classification results confirmed a 10-15% rise in the necessary parameters using simple Gaussian probability accuracy compared to other models. The explanation behind this distribution equations to normalize the data. refers to the strong significance of the Ticket feature when used as a parameter in the Naïve Bayes classifier. Also, the simplicity of the tree structure performed at best, when it disregarded features like

36 Age, Fare, SibSp and Parch. Briefly, the kaggle data-set gave out 8. REFERENCES several features and some of which, like “Age” and “Fare” were [1] C. Anesi, Titanic official casualty figures, 1997. found to be less relevant than expected, and not very useful in this problem. As previously mentioned, the “Fare” feature confused the [2] M. M. Nichol, Titanic model., 1998. Naïve Bayes classifier, and resulted in poor accuracy results, but the [3] "Kaggle.com," Kaggle, [Online]. Available: case was different when the feature was disregarded as the accuracy https://www.kaggle.com/c/titanic. [Accessed 27 2 2018]. tended to significantly increase from 78% to 92%. [4] S. Finlay, "Predictive Analytics, Data Mining and Big Data: 7. ACKNOWLEDGMENTS Myths, Misconceptions and Methods," Palgrave Macmillan, New York, 2014. I sincerely acknowledge my dear advisor, Dr. Ghada Hassan, for her undying patience, willingness to help me throughout this process [5] D. Abbott, Applied Predictive Analytics: Principles and and her encouraging attitude to pursue this research experience. I Techniques for the Professional Data Analyst, 1 edition ed., vol. would like to thank my friends Ahmed Moussa, Lily El Bishry, 4 of 10, Indianapolis: John Wiley Sons, Inc, New York,, 2014. Jayda Shaalan, Belal Medhat, Thomas Jalabert and Peter Fletcher [6] E. Lam, Cs229 titanic – machine learning from disaster, 2012. for their constant support and belief in my success. I gratefully thank my beloved parents for their unconditional support. Lastly, I [7] J. S. M. M. .. L. C. Shawn, Classification of titanic passenger would like to thank all of my beloved artists dead or alive, for their data and chances of surviving the disaster., 2014. works that kept me inspired to work throughout this research. [8] Z. Z. L. L. Kunal Vyas, Titanic - machine learning from disaster., 2014. [9] T. Mitchell, Machine Learning, New York: McGraw Hill, 1997.

37 Using Fuzzy Logic in QCA for the Selection of Relevant IS Adoption Drivers in Emerging Economies Nayeth I. Solorzano Alcivar Luke Houghton Louis Sanzogni Escuela Superior Politécnica del Litoral, Griffith University Griffith University ESPOL Nathan Campus Nathan Campus Campus Gustavo Galindo Brisbane, Australia Brisbane, Australia Guayaquil, Ecuador [email protected] [email protected] [email protected]

ABSTRACT In this investigation, Public Ecuadorian Organizations (PEOs) are This paper argues that typical adoption studies fail to capture the used as the case study of a LAT region for the analysis. nuances and realities of emerging economies in Latin American Some studies use two forms of QCA such as crisp-set (csQCA) (LAT) regions. Existing research has a long list of factors that are and fuzzy-set (fsQCA) rather than QCA in its multi-value form based on studies outside of the LAT region, which is a problem (mvQCA) to examine necessary and/or sufficient conditions to because there are almost no studies that capture the unique explore complex organizational parameters [3, 4] from a large set perspective of the LAT context. These issues, in turn, creates of variables. Other authors such as Servant and Jones [5] use other uncertainty because the context in LAT varies widely from the fuzzy logic techniques such as an automatic code-history-analysis. economies where most of these studies are conducted. To begin to This approach that takes advantage of the fuzzy history graph to address this problem, the authors used a Qualitative Comparative improve the accuracy of a fundamental task in code history Analysis (QCA) using fuzzy logic to refine the selection of drivers analysis to identifying the revisions of a large set of code lines. obtained from earlier studies. The study revealed fourteen themes Servant and Jones [5] argued that this technique provides higher as being relevant candidate drivers for comparative future accuracy than existing models to obtain fine-grained code history research purposes. It is argued that these results provide local from extensive coding sets. stakeholders with a set of drivers relating to IS adoption within a specific context, namely in LAT economies and provide a In an early three-stage process, a large set of candidates IS contextual frame to develop more meaningful studies in LAT adoption drivers were initially identified from existing IS/IT economies. adoption theories, local secondary data, and the opinion of local experts/practitioners. This set was obtained by using mixed- CCS Concepts method analysis strategies. NVivo, which helps to make the data • Computing methodologies→Vagueness and fuzzy logic. analysis process transparent and faster [6], was the research tool used to code and categorize the data. However, several of the Keywords identified themes were not dichotomous, imposing the need to Fuzzy logic; Qualitative Comparative Analysis; fsQCA; choose fs/QCA to refine the set and to analyse causal relations. Information System Adoption; Latin America; Ecuador; Public fs/QCA as a comparative analysis strategy that is applied to reveal Organization. patterns of association across the set of formed themes (each theme is considered a ―Case‖), and to provide support for the 1. INTRODUCTION existence of causal relations between determined conditions in Qualitative Comparative Analysis (QCA) is an analytic approach, relation to the cases [2]. Based on the results, a fine-grained complemented by a set of research tools that helps determine the selection of the themes formed as candidate drivers of SISA was necessary or sufficient conditions [1] to evaluate significantly identified to be tested in different organizational LAT contexts. varied outcomes of a selection process. This technique is also These factors can be anticipated for further studies. Therefore, the considered to be for to evaluating empirical analysis based on research question, ―Which themes identified from existing IS/IT qualitative approaches [2]. In this instance, QCA is used to better adoption theories, local secondary data, local experts/practitioners‘ explain and justify which of the large number of empirically opinion, are the most prominent candidate drivers affecting SISA refined themes obtained from a previous study should be selected in LAT organizational contexts?‖ was answered. as the most prominent candidate drivers affecting SISA in public organisation of emerging economies such as LAT regions. 2. APPLYING FUZZY QCA AS BRIDGING METHODOLOGY Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that 2.1 Scoring in the analysis process copies are not made or distributed for profit or commercial advantage and As part of a bridging process in a large mixed method study, that copies bear this notice and the full citation on the first page. fs/QCA is applied as a comparative analysis strategy revealing Copyrights for components of this work owned by others than ACM must association patterns of formed themes (as cases) and bringing be honored. Abstracting with credit is permitted. To copy otherwise, or support to identify causal relations between determined conditions republish, to post on servers or to redistribute to lists, requires prior related to the cases [2]. Fuzzy-set scores used by QCA are applied specific permission and/or a fee. Request permissions from [email protected]. to normalise the frequency of reference by relevance (FrR) ICSIE '18, May 2–4, 2018, Cairo, Egypt relationship of the 50 themes identified from the outcomes in © 2018 Association for Computing Machinery. three previous stages of this study (see Table 1). The FrR ACM ISBN 978-1-4503-6469-0/18/05…$15.00 calculated ranges from 0.0000 to 0.1205 over 50 themes as the DOI:https://doi.org/10.1145/3220267.3220285

38 Maximum level of FrR per stage (see also Table 1, columns FrR from four level sets (e.g., 0, 0.33, 0.67, and 1) to continuous sets of Stages 1, 2, 3). (where the fuzzy score can take any value between zero and one). Cases on different sides of the crossover point per stage can be According to Ragin [7], the fuzzy-set scores range from 0 to 1 and qualitatively different, while cases differing from the FrR in the can describe different case conditions in a set (the 50 themes set on the same stage of the crossover point may differ in the previously identified are considered as the ‗cases‘ in the fs/QCA degree of relevance for a complete set [7] (see Table 1 Initial process). A set can be assumed as formalised representations of studies columns). concepts. In this research, cases can be evaluated regarding their frequency of reference by relevance (as we named FrR) obtained An fs/QCA applied score to standardize these ranges, used a more by each identified theme. From each of them, the relationship fine-grained relationship in which a fuzzy score can take any significance extracted from Literature analysed of existing IS/IT value between 0 and 1. Based on Legewie [1] and Ragin [7] adoption theories (Stage 1), local secondary data (Stage 2), local suggestions, in this study, the cases are normalized in four-level experts and practitioners‘ opinion transcripts (Stage 3) were sets 0, 0.33, 0.67, and 1. The FrR of each theme were obtained qualitatively evaluated to reach the saturation point [8, 9]. Then from the three stages of the initial study (see Table 1). These were the FrR was calculated in each stage. used to measure the relevance of the selected themes in different sets of conditions. The normalized scores represented by the four Table 1. The initial set of themes as possible SISA drivers fine-grained fuzzy scores are shown in Table 2. Initial Study: Frequency of References Outcomes fsQCA Table 2. Normalized scores [10] Stage 1 Stage 2 Stage 3 A+B Result Sec Themes -candidate Drivers- (Caseid) NR NS FrR NR NS FrR NR NS FrR ST FrR Range obtained fs/QCA 1 Accessibility-Interconnectivity 2 2 0.0032 99 16 0.0739 184 44 0.1109 1 Measure 2 Age 7 7 0.0132 - - 0 49 26 0.0131 from NVivo Score 3 Attitude Towards Using-Intention to Use 28 19 0.0575 45 12 0.0536 105 38 0.039 0.67 0 0 No References 4 Communication Channels 12 7 0.0211 5 3 0.004 53 25 0.0153 0.001 to 0.0402 0.33 Low-Medium Level of References 5 Compatibility & Standardization 6 6 0.0141 33 9 0.0234 70 28 0.032 6 Corruption 2 2 0.0055 3 2 0.0027 46 24 0.0137 0.0403 to 0.0803 0.67 Medium-High Level of References 7 Cultural & Values Aspects 25 4 0.09 3 1 0.0027 54 32 0.0131 1 0.0804 to 0.1205 1 High Level of References 8 Defined Processes 1 1 0.0011 4 3 0.0023 46 29 0.017 9 Economic Aspects 15 6 0.0399 3 2 0.0003 18 10 0.0012 10 Education & Skills 16 10 0.0355 47 16 0.0278 103 41 0.0345 2.2 Consistency of the fuzzy score range 11 Gender 6 6 0.0043 1 1 0.0009 5 4 0.0003 12 Individual Income 3 3 0.0066 - - 0 12 9 0.002 To justify the sensitivity of the cutoff point, the researchers 13 Information Availability - - 0 93 15 0.0948 93 32 0.0434 1 discussed consistency and the coverage of the fuzzy score range. 14 Information Quality 6 6 0.0217 25 11 0.0303 44 27 0.0081 15 Intellectual Property and Software Rights (*) - - 0 70 11 0.063 49 20 0.0203 0.67 For example, Ragin [7] states that conditions or combinations of 16 Internet Facilities 5 5 0.0129 36 14 0.032 71 31 0.0245 conditions in which all cases fit in a relation of necessity or 17 Job Relevance 6 6 0.0099 - - 0 5 2 0.0006 18 Labour Force 2 2 0.0052 - - 0 24 16 0.0091 sufficiency are rare. At least a few cases will usually differ from 19 Language 1 1 0.003 - - 0 19 9 0.0017 the general patterns. Therefore, it is necessary to evaluate how 20 Leadership Continuity - - 0 - - 0 26 13 0.014 well the themes, as cases in different sets, fit a relation of 21 Loyalty 2 2 0.0069 - - 0 1 1 0 22 Market Environment 9 5 0.0218 5 3 0.0041 40 20 0.0092 necessity or sufficiency [1]. 23 National Plan-ICT Inclusion - - 0 26 10 0.0259 24 12 0.0096 24 National Telecommunication Environment 1 1 0.0013 4 1 0.0052 14 10 0.0021 Furthermore, the outcome, evidencing computing consistency and 25 Nature of Development 4 3 0.0071 11 5 0.0122 106 29 0.0665 0.67 26 Net Benefits Perception 11 10 0.0428 21 11 0.0244 49 21 0.0259 0.67 resembling the idea of significance in statistical models, involves 27 Observability 3 3 0.006 - - 0 - - 0 the degree measurement of necessity or sufficiency condition 28 Organisational Aspects 12 8 0.0371 6 2 0.0063 3 3 0.0003 29 Organisational Experience & Slack 15 7 0.0331 - - 0 27 19 0.003 between causal conditions or the combination of conditions. Thus, 30 Organisational Structure 4 4 0.0061 33 5 0.0298 4 3 0.0001 the fs/QCA software computes consistency of the fuzzy scores 31 Perceived Ease Of Use 21 15 0.0577 5 3 0.0061 57 26 0.0184 0.67 ranges used. The value range "0" indicates no consistency and "1" 32 Perceived Usefulness 25 17 0.0778 52 13 0.0488 94 37 0.0436 0.67 33 Political Aspects 7 6 0.0138 7 4 0.0053 83 25 0.0271 indicates perfect consistency, providing a measure of empirical 34 Population Changes 1 1 0.0019 - - 0 1 1 0 relevance. This range of measurements is analogous to the 35 Regulation & Policies (*) 15 8 0.0346 129 19 0.1205 99 33 0.0429 1 36 Service Quality 4 4 0.0149 23 10 0.019 30 17 0.0099 variance contribution of a variable in a statistical model [7]. 37 Subjective Norms & Motivation 18 11 0.0581 3 2 0.0023 84 31 0.0228 0.67 38 System Characteristics 9 6 0.0218 9 2 0.0123 33 16 0.0173 39 System Development & Implementation - - 0 51 9 0.0545 53 23 0.0284 0.67 3. THE FUZZY LOGIC OF QCA APPLIED System Maintenance-Continuing 40 Improvements - - 0 17 5 0.0122 34 20 0.0092 TO THE THEMES SELECTION AS CASES 41 System Obsolescence - - 0 2 2 0.0007 18 12 0.0042 42 System Quality 4 4 0.01 32 7 0.0244 32 17 0.0059 43 System Security Perception 7 4 0.0218 23 8 0.0282 115 30 0.0251 3.1 The selection criteria for the fs/QCA 44 Technology Costs & Budget 3 3 0.0073 27 8 0.031 116 36 0.0373 45 Technology Infrastructure 10 6 0.0326 48 15 0.0429 72 26 0.0269 0.67 outcomes 46 Technology Maturity & Awareness 6 3 0.0151 34 6 0.034 48 25 0.0154 Definitions stated in relation to QCA were mostly textually taken 47 Timeframes 4 4 0.014 4 1 0.0042 68 33 0.0255 48 Trust & Leadership Governance 5 3 0.013 25 12 0.0342 162 45 0.0554 0.67 from Alcivar et al. [10], which helps to better explain 49 Usage Behaviour and Use 20 15 0.0627 - - 0 85 31 0.0445 0.67 terminologies for the fs/QCA process and fsQCA software 50 User Satisfaction 10 8 0.0361 - - 0 34 16 0.0099 application used. The definitions used are expressed as follows: Sources analysed in each Stage 28 34 55 50 50 50 Number of joint Themes refined Case/s_ set of official representations of concepts used as part of a Leyend: Number of References (NR); Number of Sources Analysed (NS); Frequency of References (FrR) (*) last joined Themes by related mening and computed results qualitative analytic technique ([11], [12], [13], [14], [15] as cited in Legewie [1]). In this study, the cases are the 50 themes The three anchor points can define a set between the three stages: obtained from three stages of the early study phase. Each theme is ―High Level of References‖ (indicated by relationship score of 1), considered a case (named in the fsQCA software as ―caseid‖). ―No references‖ at all (relationship score of 0), and a crossover point (a probable relationship score of 0.5). However, between the Scoring Criteria_ the cases scoring 0.67 or 1 on the conditions, extremes of full level of references and non-level of references, a necessary or sufficient to the outcomes, are the themes recognised set can have fine-grained relationship levels of references, ranging

39 as significant to all stages or in a stage [10]. The most prominent measure the theme’s reference level prominent to all the stages in themes identified are selected for further analysis. the same proccess (see Table 3). Necessity Condition_ is the condition of a determined set of These themes achieved the necessary conditions determined by themes (named as A-themes) necessary for the outcome Y the Level of References of Necessary Themes LRTNS causal (candidate drivers of SISA), if Y is not possible without the recipe with a narrow standard deviation of 0.1495882 (see Table inclusion of A-themes. Therefore, in all the cases, outcome Y 4). Also, we identified the six themes in the STHR recipe (as shares the presence of the A-themes’ condition [1]. It is higher selected themes of LRTNS) in which only a theme scoring determined as the necessary condition to the outcomes or the 1 or 0.67 (using the fs/QCA scale) was computed, obtaining a necessity of determined outcomes [10]. very low standard deviation (9.93411E-09). Sufficiency Condition_ is the condition of a group of themes (A- Table 3. Programming in fsQCA themes) or the combination of other themes (X-themes) sufficient for the outcomes of Y (candidate drivers of SISA). Y will always rise if A-themes are present. However, other conditions, besides A-themes, may produce the outcome Y. These other conditions indicate that all cases in which A-themes are included share the occurrence of the outcome Y [1]. The definition of this statement is known as the sufficiency condition to the outcome [10]. INUS Condition_ is a single condition of Z-themes neither necessary nor sufficient by itself, but which can be part of the combination of one or more conditions that are sufficient for the outcome Y (candidate drivers of SISA) [1, 10]. Causal Recipes_ Causal Recipes are the conditions (using a set of theories or Boolean algebra) used to formally analyses to what degree some of these conditions, or a combination of them, are necessary or sufficient for the outcome [1, 10]. The formulas for the Causal Recipes used and computed in the fsQCA software are presented in Table 3. The intention is to cover Sufficiency, Necessity and Inus conditions that can be obtained. 4. THE QCA PROCESS: DISCUSSION AND RESULTS 4.1 The fs/QCA analysis Initially, we analysed the 50 identified themes as caseid to differentiate the Sufficiency, Necessity and possible Inus conditions. We used the fsQCA software to upload the spreadsheet with the original calculated FrR and to automatically convert the percentage to the proposed normalised fuzzy scores (see Table 2). The results obtained by computing all the causal recipes for the 50 themes, were stored in a new table (see Table 3). In this process, the different proposed recipes evidencing necessity or sufficiency aspects of the relationship were closely computed and examined. During the fs/QCA process, the causal recipes, named Selected Themes with Higher Relevance (STHR), formulated as a combination of stages 1, 2 and 3 in the analysis process (see Table 3), were identified as the conditions that best suit the outcomes. These recipes accomplish the necessity of including Medium-high and High referenced themes obtained from the outcomes of Stage 3 (S3), which are also relevant emerging themes from Stage 2 (S2) and S3, and Stage 1 (S1) as it is shown in Figure 1. Consequently, the level of references of 50 themes originally obtained, were normalized by using the fs/QCA score. From this process, six themes were identified as the most relevant (see Figure 1). At the same time, we identify which theme was the most significant to all stages in the study case (see the joining point in Figure 1). This was done by computing the LRAND recipe _Level of Reference joining the three themes_, used to

40 Table 4. Range to be used from additional content analysis, that the themes named Intellectual Properties & Software Rights and Regulations & Policies (see Table 5) are both defined as rules of law related to IS/ICT in the context of the current research. Therefore, they were regrouped within one theme as Regulations & Policies without affecting the results of the recipes ST and STHR.

Furthermore, the decision to regroup these two themes was supported due to the fact that the fuzzy scores of Intellectual Properties & Software Rights (in the recipes LRTNS=0.33, ST=0.67) are lower than or equal to the Regulations & Policies score (in the recipes LRTNS=0.67, ST=0.67). Therefore, the Boolean operation of this union shows that the fuzzy scores of Regulations & Policies remain equal (LRTNS= 0.33 + 0.67 = 0.67 and ST = 0.67 + 0.67 = 0.67). Then, STHR=LRTNS conditioned by themes >= 0.67 presents the same result of Regulation & Policies before the combination of both (see Appendix I, Section 2). Consequently, the joining of the two last-mentioned themes reduced the themes selected to 14 (see Figure 1).

Figure 1. SISA in LAT selected themes structure. 5. DISCUSSION OF THE OUTCOMES 4.2 The fs/QCA results 5.1 Relationship relevance among the selected Even though STHR explains the necessary condition (Necessity themes condition) to identify the most relevant themes for combined The 14 themes obtained with the higher level of relevance from stages, it does not consider the inclusion of all the possible themes the fs/QCA technique and their relationship within the three stages, that are highly significant to any specific stage. These incluisons were determined as a “Sufficiency condition” to answer the are better explained in the Selected Themes (ST) recipe as a research question. As a result, six of the themes selected by the “Sufficiency condition” to answer the QCA research question (see acronyms —ACIN, REPO, INAV, PEUS, ATUI, and USBU — Table 3). Initially, 15 themes emerged from the ST recipe were identified as the most relevant themes, determined as the application (Table 1, last column), which included the “Necessity condition” for SISA outcomes (see and Figure 1). combination of the “Necessity condition” in STHR, and the “Inus From the selected themes, PEUS was determined as the principal conditions” of THRS1, THRS2, and THRS3 (see Table 3). Even driver mentioned by all the sources. Finally, eight remaining though the ST recipe outcomes include the same number of themes—CUVA, NADE, TRLG, TEIN, NEBP, PEOU, SNMO, themes as the LRT recipe which also allows identifying themes and SYDI— were identified as highly important, but only in one relevant to each of the stages at the same time, we choose to work of the three stages at a time. These were recognised as “Inus with ST due to its lower standard deviation (0.14593) in relation conditions” accomplished (see Figure 1). to LRT (see Table 4). ST also helps to identify more closely the significant relationship of the themes between the stages The obtained results were then anticipated as the sufficiency performed in the previous phases of this study. condition, represented by the ST recipe and confirmed with the STHR recipe. The 14 themes obtained are then proposed as the From these evaluations, a preliminary result of 15 selected themes most prominent candidate drivers of SISA in public LAT was obtained. Acronyms to identify each selected theme as driver organisations (see Table 5). of SISA were used, as it is shown in Table 5. Then, we closely re- examined the content, concept, meaning, and opinions obtained 5.2 Grouping the results from the sources in relation to selected themes. Thus, we noted, To determine the nature of the themes and to better examine their influence as candidate drivers of SISA in LAT, we clustered them Table 5 fsQCA Scores of selected themes by related characteristics. The Control Characteristics group were determined based on existing literature and theories previously analysed. Thus, the selected themes were examined and reorganised into groups; related to Subjective Aspects, Technological Aspects, and Public Aspects. We kept consistent with the organisation undertaken in previous stages in which these themes were identified, or they emerged and were clustered according to their similarity. In the group containing themes with characteristics related to Subjective Aspects, six of them were identified as highly significant based on the fs/QCA process. This group was determined with a distribution of 41% over the 100% calculated from the set of drivers selected (see Figure 2). Perceived Usefulness (named with the acronym PEUS) was the only theme evidencing high significance in the three previous stages (see also

41 Table 5 fsQCA Scores of selected themesTable 5. In To conclude, these results provide local stakeholders with a set of Technological Aspects, four themes were selected from this group drivers relating to IS adoption within a specific context, namely in with a frequency of distribution of 36% over 100% (see Figure 2): LAT regions. For future researchers, the findings will provide a In this group, we highlight that INAV and SYDI, are emerging contextual frame to develop future investigation to do themes from local sources (identified in S2 and S3). This means comparative studies validating the selected drivers of IS.in in that these themes were not previously proposed as drivers of IS different organizations of LAT economies. adoption in the review of existing theories. In Public Aspects cluster, three themes were identified as relevant with 24% over 7. ACKNOWLEDGMENTS 100% (see Figure 2). Regulations & Policies (named with the Our thanks to the Escuela Superior Politécnica del Litoral, ESPOL acronym REPO) was identified as highly significant in S2 and S3 who sponsored the presentation of this paper. We would also like but also mentioned in S1. to acknowledge the Griffith University in which the Ph.D. thesis containing the complete related investigation was undertaken, and to the ACM SIGCHI for allowing us to modify templates they had developed. 8. REFERENCES [1] Legewie, N.: ‗An Introduction to Applied Data Analysis with Qualitative Comparative Analysis‘, in Editor (Ed.)^(Eds.): ‗Book An Introduction to Applied Data Analysis with Qualitative Comparative Analysis‘ (2013, 2013-07-31 edn.), pp. 1-30 [2] Schneider, C.Q., and Wagemann, C.: ‗Standards of Good Practice in Qualitative Comparative Analysis (QCA) and Fuzzy-Sets‘, Comparative Sociology, 2010, 9, (3), pp. 397- 418 [3] Wagemann, C., Buche, J., and Siewert, M.B.: ‗QCA and

business research: Work in progress or a consolidated Figure 2. Funnel chart showing the frequency distribution of agenda?‘, Journal of Business Research, 2016, 69, (7), pp. the selected drivers, clustered by similar characteristics 2531-2540 In the end, from these clustering process, we obtained an [4] Viswanathan, M., Bergen, M., Dutta, S., and Childers, T.: organisational structure to grouping the selected themes for ‗Does a single response category in a scale completely further analysis of their relationship as drivers affecting SISA in capture a response?‘, Psychology and Marketing, 1996, 13, PEOs (see Figure 2). (5), pp. 457-479 6. SUMMARY AND CONCLUSION [5] Servant, F., and Jones, J.A.: ‗Fuzzy fine-grained code-history analysis‘, in Editor (Ed.)^(Eds.): ‗Book Fuzzy fine-grained In this paper using a case study to identify the candidate drivers of code-history analysis‘ (IEEE Press, 2017, edn.), pp. 746-757 adoption in PEOs, we aim to explain the applicability of fs/QCA in IS adoption studies. To this end, we apply fuzzy logic [6] Beekhuyzen, J., Nielsen, S., and von Hellens, L.: ‗The Nvivo techniques to refine a set of identified drivers of adoption. The looking glass: Seeing the data through the analysis‘, in Editor application of fuzzy logic in the selection process was done by (Ed.)^(Eds.): ‗Book The Nvivo looking glass: Seeing the using existing software named as fsQCA. This approach helps to data through the analysis‘ (2010, edn.), pp. avoid ambiguities which are difficult to overcome in qualitative [7] Ragin, C.C.: ‗Redesigning social inquiry: Fuzzy sets and studies and provides clear and measurable outcomes. As a result, beyond‘ (University of Chicago Press, 2008. 2008) the application of QCA process using fsQCA to compute and normalize earlier outcomes, lead to the selection of 14 themes [8] Creswell, J.W.: ‗Qualitative inquiry & research design: representing the candidate drivers of SISA in LAT regions. Choosing among five approaches‘ (Sage, 2013. 2013) Particularly these were tested by accomplishing sufficiency [9] Charmaz, K., and Belgrave, L.: ‗Qualitative interviewing and conditions as drivers of adoption in PEOs. We anticipated the grounded theory analysis‘, The SAGE handbook of interview possibility that the selected candidate drivers can be tested in research: The complexity of the craft, 2012, 2, pp. 347-365 other LAT contexts. However further investigation should be [10] Alcivar, N.I.S., Sanzogni, L., and Houghton, L.: ‗Fuzzy QCA done to further test these assumptions. applicability for a refined selection of drivers affecting IS The results recognise the criteria used to select the relevance of adoption: The case for Ecuador‘, in Editor (Ed.)^(Eds.): the drivers chosen. The selection of relevant drivers includes the ‗Book Fuzzy QCA applicability for a refined selection of accomplishment of Inus, Necessity, and Sufficiency conditions drivers affecting IS adoption: The case for Ecuador‘ (IEEE, enclosing the three previous stages of early phases of the current 2016, edn.), pp. 1-6 study. In the end, the process of using fuzzy logic on QCA [11] Blatter, J.: ‗Ontological and epistemological foundations of involves the identification of relevant drivers from existing studies, causal-process tracing: Configurational thinking and timing‘, local secondary data from LAT, and local primary data from ECPR Joint Sessions, Antwerpen, 2012, pp. 10-14 PEOs. Therefore, we answered the stated research question ―Which themes identified from existing IS/IT adoption theories, [12] George, A.L., and Bennett, A.: ‗Case studies and theory local secondary data, local experts/practitioners‘ opinion, are the development in the social sciences‘ (Mit Press, 2005. 2005) candidate drivers affecting SISA in LAT organisational contexts?‖.

42 [13] Gerring, J.: ‗Case study research: principles and practices‘ [15] Strauss, A.L., and Corbin, J.M.: ‗Basics of qualitative (Cambridge University Press, 2006. 2006) research: techniques and procedures for developing grounded [14] Mahoney, J.: ‗The logic of process tracing tests in the social theory‘ (Sage Publications, 1998. 1998) sciences‘, Sociological Methods & Research, 2012, pp. 0049124112437709

43 Using SMOTE and Heterogeneous Stacking in Ensemble learning for Software Defect Prediction

Sara Adel El-Shorbagy Wael Mohamed El-Gammal Walid. M. Abdelmoez College of Computing and Information Technology

Arab Academy for Science, Technology and Maritime Transport - Egypt

+2 0111 91 44 800 +2 0100 69 95 002 +2 01113718555 [email protected] [email protected] [email protected]

correcting software defects. To do this the defective software “modules” need to be predicted before delivery [1]. Being unable ABSTRACT to identify defective modules early will cause inefficient Nowadays, there are a lot of classifications models used for utilization of the limited resources of the company [2]. predictions in the software engineering field such as effort estimation and defect prediction. One of these models is the Defect prediction is very complex and very challenging problem ensemble learning machine that improves model performance by due to the high imbalanced, high dimensionality and linearly combining multiple models in different ways to get a more inseparable datasets. In this paper our work will focus on the high powerful model. imbalance problem. When a class is highly imbalanced this means that the modules which are correct are much more than the One of the problems facing the prediction model is the modules that have defects, that is normal but in defect prediction misclassification of the minority samples. This problem mainly we target the defects. Unfortunately, class imbalanced leads most appears in the case of defect prediction. Our aim is the classifiers to miss those defects which lead to poor performance. classification of defects which are considered minority samples To handle the problem of the imbalance class there are 3 ways: during the training phase. This can be improved by implementing ensemble classifiers or data-level approach or algorithm-level the Synthetic Minority Over-Sampling Technique (SMOTE) approach.[3] Data-level approach (Sampling methods) reduces the before the implementation of the ensemble model which leads to samples from the majority class (under-sampling) or adding over-sample the minority class instances. samples to the minority class (over-sampling) to balance class In this paper, our work propose applying a new ensemble model distribution. Algorithm-level approach points to cost-sensitive by combining the SMOTE technique with the heterogeneous learning (CSL) which says that misdiagnosed minority samples stacking ensemble to get the most benefit and performance in cost more than misdiagnosed majority samples. Its target is to training a dataset that focus on the minority subset as in the minimize the cost [1], [4]. software prediction study. Our proposed model shows better In this paper, Synthetic Minority Over-Sampling Technique performance that overcomes other techniques results applied on (SMOTE) was used to make new instances from the minority the minority samples of the defect prediction. samples (the defective class modules). This technique will make the dataset more balanced and enabling classifiers to perform

CCS Concepts better with the minority class instances [1][5][6]. Also, the • Software and its engineering➝ Software verification and precision measure was used instead of accuracy measure because validation. we are dealing with an imbalanced data set in defect prediction. Finally, adding the use of stacking in our experiment as it Keywords performs another classification task (Meta classifier) on the result Machine Learning; Ensemble; SMOTE; Software Engineering; of the individual classifiers (base classifiers) and the produced Defect Prediction; Classification; Stacking; Heterogeneous. output will give the final prediction [7]. 1. INTRODUCTION The rest of this paper is organized as follows. Section 2 is These days software companies grow very quickly in size & dedicated to the background in which the basic concepts are complexity. Quality is an essential process in the success of any defined. Section 3 describes the dataset. In section 4, the proposed software company which is never done without finding and approach is presented. Section 5 discusses the experimental results. Permission to make digital or hard copies of all or part of this Related work is discussed in section 6. Finally, the conclusion and work for personal or classroom use is granted without fee provided that future directions of work in section 7. copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. 2. BACKGROUND Copyrights for components of this work owned by others than ACM must In this section, the basic background concepts are introduced to be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior make use of in this research work. According to Oxford dictionary specific permission and/or a fee. Request permissions from ensemble means “A group of musicians, actors, or dancers (people) [email protected]. who perform together”. But in software engineering domain, ICSIE '18, May 2–4, 2018, Cairo, Egypt ensemble is a learning technique that improves prediction model © 2018 Association for Computing Machinery. performance by combining multiple models in different ways to ACM ISBN 978-1-4503-6469-0/18/05…$15.00 get a more powerful model. Researchers were attracted by the fact DOI:https://doi.org/10.1145/3220267.3220286

44 of “Wisdom of the crowds” which means that the number of votes improvement that happen when using both (Stacking with for a question combined from each person is very close to the SMOTE) techniques and also the improvement if using each of right answer including the expert and naïve answers which cancel them separately. each other. It helps Software engineering in predicting the effort of a project or if a module contains bugs [8] [9] [10]. Data Set Ensemble can be grouped to homogeneous and heterogeneous. Homogeneous: An ensemble that combines one base model with at least two different configurations. Heterogeneous: An ensemble that combines at least two different base models [11] SMOTE Low diversity may cause overfitting [12]. So, using different classifiers with different capabilities enhances the model over all.

For example, “J48 have a good effect on big sample classes” [13]. 3classifier base It is a kind of decision trees used in prediction. Lessmann et.al suggested using Naive Bayes as it performs well with other classifiers [14]. It belongs to linear classification family [7]. MLP NB MLP J48 is a kind of neural network that is composed of at least 3 layers of Stacking nodes. In the Meta layer Boosting and Bagging were used in our model. Bagging is an ensemble method that uses a single type of base learner to produce different base models [2]. Boosting is a machine learning model that aims to build a strong model using Classifier Meta many weak ones. [15] Boosting Or Bagging 3. DATASET DESCRIPTION We used PROMISE software engineering repository dataset for software defects created by NASA metrics data program titled as (PC1/software defect prediction) that contain about 1109 set of Figure 1: SMOTE + Heterogeneous Stacking data and about 22 attribute including ("line of code", "design In our work using Weka software enables you to use a collection complexity", "volume", "program length", "difficulty", of machine learning algorithms for data mining tasks. It helps to "intelligence", "effort", "time estimator", "unique operators", apply algorithms to any dataset that you choose in your "unique operands", …) our main focus will be on the “defects” experiment. attribute that will be used to predict the software defects. The “defects” attribute is true in case the software has defects. 5. EXPERIMENT AND RESULT Our results are shown in Table 1 that includes many performance This data set is composed of 1032 false defect data and 77 true measures that were used in our comparisons. defect which mean that there is a very small number of defected software compared to the whole number of data. To solve this In Weka, SMOTE is a supervised instance library that must be problem, (SMOTE) Synthetic Minority Over-Sampling Technique enabled /added from the “Package Manager” found in Weka menu was applied on this imbalanced dataset before applying any under “Tools”. Then can be found in Weka Explorer in the Machine learning on it. “Preprocess” tab when you choose the filter under Filtersupervisedinstance. [17] 4. PROPOSED MODEL At the beginning we run our experiment on Weka by applying the In this section, we introduce the proposed model. Figure 1 shows SMOTE to our initial data set and save a copy from the new data the model used to predict software defects by applying SMOTE to set after the oversampling to use the initial and the new data sets the Data Set then train the Heterogeneous Stacking model in the comparisons. Then, after applying Stacking machine composed by three base classifiers (Naïve Bayes, Multilayer learning with Meta classifier AdaBoost two times. Adding the use Perception, J48) and a meta classifier (Boosting or Bagging). of three diverse base classifiers (Naïve Bayes, MLP, J48) to In our work we applied stacking ensemble instead of voting. implement the Heterogeneous concept, first time without SMOTE Especially by using AdaBoost and Bagging ensembles (as they are and second time with SMOTE to compare them. It was found that the most commonly used in Software Defect Prediction) as a Meta the Stacking using SMOTE results is much better than without it classifier. Then comparing them to check which of them is better as shown in Table 1. Especially when comparing the TP rate of to improve minority class classification. [1] the “True” Class that represent the defected modules for 0.143 without SMOTE compared to 0.468 with SMOTE. Also, when the To apply diversity in our model since different classifiers discover ROC was compared, it shows that with SMOTE we accomplished different subsets of defects, suggesting the use of Heterogeneous 0.876 compared to 0.749 without SMOTE. In the PRC areas a concept to help in the targeting of the defects represented by the value of 0.513 was reached with SMOTE and 0.193 without minority class. By using the previous work best practice regarding SMOTE. the number of base classifiers used in Stacking which is practically three classifiers. Mainly, the Naïve Bayes (NB), After this the same comparison was done but using the Bagging as Multilayer Perceptron (MLP) and the C4.5 (J48 implementation) Meta classifier for the Stacking machine learning instead of decision trees which are practically the best classifiers used for AdaBoost. Once again, it was found that the Stacking using Software Defect Prediction (SDP). [1][16]. SMOTE results is much better than without it as shown in Table 1. When comparing the TP rate of the “True” Class that represent the In our experiment applying and comparing multiple sets of results defected modules for 0.169 without SMOTE compared to 0.455 for multiple learning techniques combinations to proof the

45 with SMOTE. Also, when the ROC was compared, it showed that and AdaBoost it was found that the values are almost similar for with SMOTE we accomplished 0.871 compared to 0.773 without TP, Recall, ROC and PRC are for Bagging are 0.455, 0.455, 0.871 SMOTE. In the PRC areas a value of 0.603 with SMOTE and and 0.603 respectively And for AdaBoost, values are 0.468, 0.468, 0.320 without SMOTE. By comparing the Stacking using Bagging 0.876 and 0.513 respectively. In the second part of our experiment we run each of our base And over all in our second part of comparison the J48 with classifiers used in the Stacking separately two times (with and SMOTE achieved better performance than the MLP and NB. But, without SMOTE) to check if SMOTE always enhance the the comparison show that the values are not very far while performance of any classifier or only with Stacking, also to comparing each classifier using SMOTE compared to the same compare if the Heterogeneous (classifiers diversity) improve the classifier without SMOTE which mean that the diverse stacking at performance. The results of our second part of the experience the same time with SMOTE has much performance than applying demonstrate that the SMOTE always enhance the performance SMOTE to one classifier alone without diversity. using any of classifiers used in this paper. Table 1: Experiment results

6. RELATED WORK Our research effort makes use of machine learning techniques voting neglects the minorities which are the targeted samples in to build a prediction model in software engineering domain. In defect prediction. The stacking was introduced by Wolpert. It this section, we will present the related work in this area. consists of two layers, classifiers layer and a Meta layer. The Meta layer uses output from classifiers as an input which makes the final In [18], Zhiqiang et al. proved that heterogeneous defect prediction. So, if a defect was detected by a classifier it will not be prediction (HDP) targets the defective modules. But; ignored as voting would ignore it [7]. That is why our proposed unfortunately, it does not take into account that data could be model contradict with combining ensemble classifiers with voting highly imbalanced or linearly inseparable. [1] as majority voting

In [1], Hamad et al. proposed using SMOTE to make the data Other similar area is Software effort estimation where it was realized more balanced. And also, combining this technique with that a method can never be ranked but in our experiment stacking ensemble classifiers improves the classifiers performance. proofed to be better in defect detection performance. So, it depends Jean Petrie et al. use the suggestion that the diverse ensemble on how & why you will use this method. More over combining models predict different errors as same classifiers results in the different techniques enhances the prediction [19]. same predicition. In addition to use stacking and not voting as

46 7. CONCLUSION AND FUTURE WORK (ESEM '16). ACM, New York, NY, USA, Article 46, 10 To conclude, there are two things that should be carefully pages. DOI: https://doi.org/10.1145/2961111.2962610 considered when building ensemble models. First, combining the [8] Tim Menzies, Laurie Williams, Thomas Zimmermann, outputs from all classifiers should be done in a way that Leandro L. Minku. 2016. Perspectives on Data Science for encourages the correct decisions to be amplified and ignores Software Engineering, Morgan Kaufmann Elsevier, MA incorrect decisions. This can be implemented in the learning phase 02139, USA. by applying SMOTE on the training datasets. Second, ensembles [9] R. Polikar. 2006. Ensemble based systems in decision making, should be built from some diverse classifiers. [20] And these in IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 21- ensembles should include classifiers that make different incorrect 45, Third Quarter 2006. doi: 10.1109/MCAS.2006.1688199 predictions (because classifiers that make the same prediction [10] Dinesh R. Pai, Kevin S. McFall and Girish H. Subramanian, errors do not add any information). So, different classifiers find Software Effort Estimation Using a Neural Network different software defects [7]. Our findings in this paper Ensemble, Journal of Computer Information Systems, concluded that the best technique that can be used in Software Vol.53 (4), July 2013,pp. 49-58. defects prediction is by combining the SMOTE and the stacking DOI:10.1080/08874417.2013.11645650 techniques in the same model to get the best performance. [11] Ali Idri, Mohamed Hosni, Alain Abran. 2016. Systematic As a future work, our plan is to try the same experiment on many literature review of ensemble effort estimation, Journal of datasets to identify the best classifier that can be used in stacking Systems and Software, Vol. 118, 2016, Pages 151-175, ISSN with SMOTE. And according to the good results in our 0164-1212, https://doi.org/10.1016/j.jss.2016.05.016. experiment, it is recommended to try other boosting classifiers [12] Leandro L. Minku and Xin Yao. 2013. An analysis of multi- and compare their results to define which of them have the highest objective evolutionary algorithms for training ensemble performance. models based on different performance measures in software effort estimation. In Proceedings of the 9th International 8. REFERENCES Conference on Predictive Models in Software Engineering [1] Hamad Alsawalqah, Hossam Faris, Ibrahim Aljarah, Loai (PROMISE '13). ACM, New York, NY, USA, Article 8, 10 Alnemer, and Nouh Alhindawi. 2017. Hybrid SMOTE- pages. DOI=10.1145/2499393.2499396 Ensemble Approach for Software Defect Prediction, [13] Ying Wang, Yongjun Shen and Guidong Zhang. 2016. Software Engineering Trends and Techniques in Intelligent Research on Intrusion Detection Model using ensemble Systems, Proceedings of the 6th Computer Science On-line learning methods, 7th IEEE International Conference on Conference 2017 (CSOC2017), Vol 3, Springer, pp. 355-366. Software Engineering and Service Science (ICSESS), Beijing, DOI 10.1007/978-3-319-57141-6 39 China, 2016, pp. 422-425. [2] Tim Menzies, Ekrem Kocagüneli, Leandro Minku, Fayola doi:10.1109/ICSESS.2016.7883100 Peters, Burak Turhan, 2014. Sharing Data and Models in [14] Issam H. Laradji, Mohammad Alshayeb, Lahouari Ghouti. Software Engineering, Morgan Kaufmann Elsevier, MA 2015. Software defect prediction using ensemble learning on 02451, USA. selected features, Information and Software Technology, Vol. [3] Mahmoud O. Elish, Tarek Helmy, and Muhammad Imtiaz 58, 2015, Pages 388-402, ISSN 0950-5849, Hussain, Empirical Study of Homogeneous and https://doi.org/10.1016/j.infsof.2014.07.005. Heterogeneous Ensemble Models for Software Development [15] Necati Demir, Ensemble Methods: Elegant Techniques to Effort Estimation, ” Mathematical Problems in Engineering, Produce Improved Machine Learning Results vol. 2013, Article ID 312067, 21 pages, 2013. https://www.toptal.com/machine-learning/ensemble- doi:10.1155/2013/312067 methods-machine-learning , last access on 22 February 2018 [4] Haonan Tong, Bin Liu, Shihai Wang. 2018. Software defect [16] Improving Predictions with Ensemble Model, Posted by prediction using stacked denoising autoencoders and two- Valiance Solutions, August 2016 stage ensemble learning, Information and Software (https://www.datasciencecentral.com/profiles/blogs/improvin Technology, Volume 96, 2018, Pages 94-111, ISSN 0950- g-predictions-with-ensemble-model), last access on 22 5849, https://doi.org/10.1016/j.infsof.2017.11.008. February 2018 [5] Wattana Punlumjeak ,Sitti Rugtanom , Samatachai [17] Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016. Data Jantarat ,and Nachirat Rachburee, Improving Classification Mining: Practical Machine Learning Tools and Techniques, of Imbalanced Student Dataset Using Ensemble Method of Morgan Kaufmann Elsevier, Fourth Edition, 2016. Voting, Bagging, and Adaboost with Under-Sampling [18] Zhiqiang Li, Xiao-Yuan Jing, Xiaoke Zhu, Hongyu Zhang, Technique, Springer Nature Singapore, IT Convergence and Heterogeneous Defect Prediction through Multiple Kernel Security 2017,pp.27-34. DOI 10.1007/978-981-10-6451-7_4 Learning and Ensemble Learning, IEEE International [6] Chenggang Zhang, Jiazhi Song, Zhili Pei and Jingqing Jiang, Conference on Software Maintenance and Evolution, An Imbalanced Data Classification Algorithm of De-Noising vol.2017, pp.91-102. DOI 10.1109/ICSME.2017.19 Auto-Encoder Neural Network Based on SMOTE, MATEC [19] T. Menzies, E. Kocaguneli and J. W. Keung, "On the Value Web of Conferences, vol 56, 2016, 4 pages. DOI: of Ensemble Effort Estimation," in IEEE Transactions on 10.1051/matecconf/20165601014 Software Engineering, vol. 38, no. , pp. 1403-1416, 2012. [7] Jean Petric, David Bowes, Tracy Hall, Bruce Christianson, doi:10.1109/TSE.2011.111 and Nathan Baddoo. 2016. Building an Ensemble for [20] Azzeh, Mohammad & Nassif, Ali & L. Minku, Leandro. An Software Defect Prediction Based on Diversity Selection. In empirical evaluation of ensemble adjustment methods for Proceedings of the 10th ACM/IEEE International Symposium analogy-based effort estimation, ElSEVIER, vol 2015, on Empirical Software Engineering and Measurement pp.36-52. DOI: http://dx.doi.org/10.1016/j.jss.2015.01.028

47

Session 2 Computer Vision and Image Processing

Extraction of Egyptian License Plate Numbers and Characters Using SURF and Cross Correlation

Ann Nosseir Ramy Roshdy Institute of National Planning (INP) British University in Egypt (BUE) &British University in Egypt (BUE) ICS Department, Cairo, Egypt ICS Department, Cairo, Egypt [email protected] [email protected]

ABSTRACT Cameras are used in high ways; however, in city roads, officers In Egypt, Traffic police or traffic officers usually write down the still write down the car license number. This introduces the error car license numbers and characters to enforce traffic rules. This is of writing a wrong number or character. Mobile cameras can be subject to errors of writing or reading the numbers and characters. used to take pictures of car plates. This could be a step in reducing The proposed work can utilise the advantage of widely spread of the errors of bad handwriting or missed numbers or characters. mobile phones. Officers can take pictures of car plate licenses and In Egypt, the report of the Central Agency for Public the system converts the pictures of car plate numbers and Mobilisation and Statistics' (CAPMAS) in 2015 states that 88.1 % characters into digital numbers and letters. Arabic characters are of families own mobile phones[3]. Officers can take pictures of challenging because some are very similar to each other’s unlike the car license plate and with a system that automatically extracts noon the numbers and characters from the picture, errors can be ,(ق) and Qaaf (ف) the English characters.. For example, feh .difference is minor. The challenge of this work is minimised (ب) and ba (ن) to extract the Arabic characters and numbers with high accuracy from pictures of new and old car plate design and pictures by Automatic Number Plate Recognition (ANPR) is one of the regular people. emerging technologies in image processing domain. Different algorithms have been added to ANPR systems to recognise Arabic The algorithm has five steps image acquisition, pre-processing, numbers and characters [4],[5],[6],[7], [8]. segmentation, feature extraction, and character recognition. To improve the performance time, in the pre-processing step, the 2. RELATED WORK developed system tests the cropped area, converts the picture into The work of automatic identifying car plate numbers has been gray scale, reverses color, and converts it into binary image. Then, going on for several years. Different algorithms have been it uses morphological operations which is dilation. To improve the experimented to extract and recognise the numbers and characters accuracy, in the feature extraction step it uses SURF (Speeded Up from the car plates. Robust Features) and cross correlation algorithms in the character recognition. The system is tested with 21 plate pictures and the Canny edge detector or optimal detector algorithm is commonly accuracy is 95% and only one plate picture was missed. used to extract the edge of the image because it has low error rate [9]. Pham et al. used template matching to identify the English CCS Concepts characters and numbers. The algorithm compares the image Image Processing extracted with a templets of images shown in Figure 1. It starts comparing pixel by pixel of all images in the template then selects Keywords the pictures with the greatest match [10]. Image recognition; mobile; matching templet. Hough Transform by Duan et al.[11] is a technique commonly 1. INTRODUCTION used in the process of features extraction. It works by the With the new traffic law presented in 2014, officers have to report extracting lines and shapes in an image and it is used in detecting more incidences of excessively using horns, crossing traffic light, the curves in images. The main benefit of Hough Transform is it allows the existence of spaces in the bounders of the image and it using mobile while driving, or parking in the wrong place [1]& [2]. is not affected by the noise in the image [11].They have tested this algorithm and the accuracy is 98.76%. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220276 Figure 1: Example of template matching [10]

48 Histogram presents the count of the pixels within each area in the image. Histogram can be used to correct the brightness of the image and it also could be used to balance the image [12]. Figure2 shows an example of the histogram of a car plate. Each character is represented by the graph. Many projects used Histogram in the plate recognition [13],[14], [15]. It is effective when dealing with English characters but ineffective when dealing with Arabic characters.

Figure3: Example of cross correlation operation [20]. End points detection is a technique of identifying features of an image by marking the end points of each character. The number of end points of each character must be known. For example, character (S) has two end points while character (P) has one. This method is not as affective with Arabic numbers and characters because they are more complicated and similar to each other unlike the English numbers and characters [24].

Figure 2 Example of histogram of license plate [12] K-nearest neighbour [16],[17], Support Vector Machine (SVM) [18], Artificial Neural Network(ANN) [19] and [20] shown their ability in pattern recognition and classification. They are notorious for building supervised learning models to classify and compare the features of the image with the features of other images. Figure 4: Example of end points [24] For example, given a set of training patterns, each marked as The Arabic numbers and characters composed of set of blocks belonging to one or the other of two categories, SVM and K- which are all similarity. The numbers of the numbers and nearest neighbour training algorithms build models that assign characters used in the new Egyptian car plates are only 27 new examples to one category or the other. numbers and characters as shown in Figure 5 the characters used which are only 17 characters [4]. Morphological operators is a collection of non-linear operations related to the shape or morphology of features in an image. They rely only on the relative ordering of pixel values. Erosion and dilation are two basic operators in mathematical morphology. The basic effect of erosion operator on a binary image is to erode away the boundaries of foreground pixels (usually the white pixels). Thus areas of foreground pixels shrink in size, and "holes" within those areas become larger. The basic effect of dilation on binary images is to enlarge the areas of foreground pixels (i.e. white pixels) at their borders. The areas of foreground pixels thus grow in size, while the background "holes" within them shrink [21]. Speeded Up Robust Features (SURF) focuses on blobs structures in the image. These structures can be found at corners of objects and at locations where the reflection of light on specular surfaces is maximal (i.e. light speckles). The algorithm detects the feature of an image by returning SURF points which contains the feature of this image. [22] Cross correlation measures the similarity between two images. A test is made between all images to return a correlation coefficient between images and the largest correlation number represents the higher matching. Figure 3 shows an example of cross correlation, as shown the last image gets the largest correlation coefficient which means it matches [23]. Figure. 5: Arabic characters used in the new design of the Egyptian car plates [4].

49 The some Arabic numbers and characters are similar unlike the correctly compared to Massoud et al[4].only 60% of the plates’ English characters and numbers which are unique. characters detected correctly. Some of the characters are alike in shape but have dots that This work proposes an algorithm that recognises the Arabic differentiate between them that makes the recognition step quite numbers and characters on the Egyptian plates. It tests the challenging. A clear example in the similarity between the algorithm with pictures of the characters and numbers of new and and more. old car plate’s designs taken by personal mobiles. Some pictures (ب) and ba (ن) noon ,(ق) and Qaaf (ف) characters feh are not clear because they are either laminated i.e., has light There is also similarity between a few numbers and characters. reflection or taken from the side. is (٤) and number four (ع) For example, the shapes of character could cause (١) and number one (أ) very close. The alef character .(ء) error in the recognition because of the Hamza In Arab countries, they have different car plates deigns with either Arabic number or both Arabic number and characters. Because of this difference a number of research work has been going on the detection of the car plats numbers and characters [4],[5],[6],[7]. Massoud et al.[4]worked with the new Egyptian car plate design to extract the Arabic numbers and characters. Their work implemented the generic steps of pre-processing, character detection, feature extraction, and character recognition. In the character detection, the characters and numbers were cut into blocks with fixed size and used cross correlation to identify the numbers and characters, their algorithm managed to identify 91% of the plates correctly. Al-Shami el at. [5] presented an algorithm to recognise the Saudi plates’ numbers and characters. They compare three approaches to extract the character: 1. the feature selection, 2. the reference image, and 3.the image difference techniques. The feature Figure. 6. Proposed algorithm for car plate recognition system. selection is an algorithm that searches for the horizontal or vertical line containing a specific feature or a set of derived 3. PROPOSED SOLUTION features and uniquely identify a certain character in all the training The proposed solution for car license plate detection algorithm datasets. The reference Image technique compares the candidate has four steps. Figure. 6 shows these steps which are image image with at least one reference image. The image difference acquisition, pre-processing, segmentation, feature extraction, and technique is the reverse. The accuracy of the feature selection character recognition. In the followings, these steps are explained. technique is less than the other two and it is 79.29%. The accuracy 3.1 Image Acquisition of both techniques the reference image and the image difference technique is the same. It is 98.22%. The algorithm uses a number of images of the new design of car plates. All the Egyptian new standard plates have a fixed size with Shawky el at.[6] work focused on the design of the old Egyptian height equals to 17 cm and width equals to 32 cm. car plate. They segmented the characters by counting the number 3.2 Pre-processing of black pixels in each column along vertically projection. The The Egyptian car plate consists of two areas, the upper area of the peaks of digits represent the existence of a digit, and the valley image, which shows the color of the plate where each color represent the isolation or boundary between these digits. The represents the type of the vehicle. The light blue color represents segmented plate characters was evaluated with a recognition rate the private cars while the dark blue represent the police cars or red. of 97.6%. This part of the plate is not needed in the recognition stage. The Basalamah[7] developed an algorithm to recognise the design of other part of the plate is the region, which contains the Arabic the Saudi plate characters. The algorithm segments numbers by characters and numbers. It is the important part that will be used moving left to right in the lower left image until finding empty in the recognition stage [22]. The pre-processing has five steps vertical lines between numbers. Meanwhile, it segments that are showed in Figure.7. characters by moving left to right in the lower right image until finding empty vertical lines between characters. Then, all segmented characters are resized to 50 × 50 pixels and compared with all reference characters. The recognition method has success rate of 81%. Youssef et al.[8] worked on the design of the new Egyptian car plate. In the segmentation process, they use the Stroke Width Transform (SWT) to separate each character, a vertical projection is applied. Then, every character is cropped into separated image and resized to 70X50 pixels. Their algorithm using template matching technique detected 80.4% of the plates’ characters Figure. 7. Pre-processing steps.

50 3.2.1 Crop image 3.2.4 Binarise the Image The challenge is to include an algorithm that is able to detect the Binarise the image clearly divide the image into the target area rectangle number plate region in the image which is called as and the background. The image of whatever size gets transformed Region Of Interest (ROI). To detect the rectangle, we need first to into a binary matrix of fixed pre-determined dimensions. This identify the ratio of the area we need to crop. Table 1 shows the establishes uniformity in the dimensions of the input. Hence, it different cropping ratio results. According to the results, the ratio makes the area of characters and numbers, which will be used in 0.4 gets the best result. the detection in white and any other area in black (see Figure 10).

TABLE I: THE DIFFERENT CROP RATIOS Ratio Cropped image

Original

0.2

0.3 Figure. 10. The results of binarise the image.

0.4 3.2.5 Morphological (dilation) The last step in pre-processing is dilation. It pursues the goals of removing these imperfections by accounting for the form and structure of the image. Dilation is a part of morphological operations which makes the binary pixels clearer and easy to 3.2.2 Convert to gray scale recognise. Figure 11 shows the effect of applying dilation. The algorithm converts the image from RGB to gray scale. This step is important because it reduces the processing time. The RGB version of the image is more complex than the gray scale version (see Figure 8).

Figure. 11. Applying the dilation operation

Figure. 8. Converted cropped image from RGB to gray scale At that point, the image is read to pass through the following steps, which are segmentation, feature extraction and recognition. Figure 11 shows the details of these steps. 3.2.3 Color reverse 3.3 Segmentation The algorithm converts the zero pixels into one and one pixels This step is the most important and difficult step because the into zero; in other words, black and white pixels are reversed. accuracy of feature extraction and number and character This enables the characters in white and other objects of the plate recognition steps depends on it. to in black as shown in Figure 9.

Figure. 9. The results of converting the gray scale image to color reverse.

51

Figure. 14.the step of applying the boundary box 3.4 Feature Extraction There are many factors, which make this step difficult, such as the small dots in the image, the plate of the frame, and the small English characters in the plate. The boxes are they are in boundary boxes as well (see Figure 14). Therefore the following steps are applied to enhance the feature extraction step Figure.12. Details of segmentation, feature extraction, and recognition. 3.4.1 The ratio to crop To extract Arabic characters and numbers, we get the ratio of the 3.3.1 Boundaries objects. This is by calculating the numbers of white pixels and Character and number detection is the step of detecting the dividing them by the number of the black pixels. The ratio of the characters and number from the image to be used in the later steps. characters is between 0.5 and 1.9 so objects smaller or larger that It is marking the boundaries around the objects of the image. To these values are neglected. achieve this task, we use bwboundaries() function in math-works MATLAB software to mark the exterior boundaries of the image 3.4.2 Crop with red color as well as the holes inside the object with green The characters and numbers are cropped based on the ratio from color as shown into Figure.13. the plate. 3.4.3 SURF In this step, we extract the features of each cropped character and number using SURF feature which is a function in the MATLAB that returns the features points of each image. 3.5 Recognition Now, we have the features of the cropped characters from the image. Before getting to the recognition step, we define a dataset with all the Arabic characters and numbers to compare it with the plate characters and numbers in the recognition step. The dataset contains different representations for each character and number (see Figure 15).

Figure. 13. The image after running the boundaries function. 3.3.2 Boundary boxes The second step in segmentation is applying boundary boxes around the characters to crop shown in Figure. 14.

Figure. 15. Sample of the characters and numbers.

52

Figure. 17: An example of the results of comparing a plate and the dataset character. As shown in Figure. 17, the result of the recognition step which shows a cropped character and its corresponding matched one based to the similarity of the features. The red character is the one which was detected on the plate and the white one is the one matched from the dataset of characters. 4. EVALUATION

Figure. 16. The characters and numbers recognition steps The recognition step is the last step in the Automatic number plate recognition. Most of projects which works on English plates use the histogram algorithm however it will not detect the details of the Arabic characters. Hence, the properties of the images in the dataset are adjusted. First, they are convert to grey scale then to binary as the cropped image from the plate. They are all then resized to be the same size of the cropped images and the last step is to extract the features of each image in the dataset by using SURF features in MATLAB(see Figure 16). Figure. 18: The images in the dataset To compare the characters and numbers extracted from the plates against the ones in the dataset, cross correlation is used. It We have asked volunteers to take pictures of their car plates. The measures the similarity between two functions. [21] pictures are for old car plate design and new as shown in Figure 18. The picture are laminated and taken from the side. Cross correlation in MATLAB is defined The algorithm managed to recognize 20 car plate numbers and By z = corr2 (x, y) characters correctly from 21 pictures. i.e, 95%. Tests are shown in Figure.19, 20, &21. In Figure 18, plate number 18 was not Let x be the cropped image of the character from the plate and let recognised because the characters were too squeezed. y be one of the images of a character from dataset, see Equation (1) Cross-correlation (Image1, Image2) =∑ Image1(x, y)*Image2(x,y) (1) The function will return the correlation coefficient z between x and y for example 0.3738, so this function will be in a for loop with the size of the dataset to loop all over the images in the dataset and return the one with the largest correlation coefficient to be the best matched character.

53 6. REFERENCES [1] D. Hashish, "New Egyptian Traffic Amendments End License Withdrawal Law", Scoop Empire, 2017. [Online]. Available: http://scoopempire.com/new-egyptian-traffic- amendments-end-license-withdrawal-law/. [Accessed: 28- May- 2017]. Figure. 19: new car plate design picture [2] "11 New Egyptian Driving Laws", Cairo Scene, 2014. [Online].Available: http://www.cairoscene.com/LifeStyle/11-New-Egyptian- Driving-Laws. [Accessed: 28- May- 2017]. [3] “88.1 pct of Egyptian families own mobile phones: CAPMAS” http://english.ahram.org.eg/NewsContent/3/12/236203/Busin ess/Economy/-pct-of-Egyptian-families-own-mobile-phones- CAPMAS.aspx, [Accessed on nov.2017], [4] M.A.Massoud, M.Sabee, M.Gergais, R.Bakhit, Automated new license plate recognition in Egypt, Alexandria Engineering Journal, Volume 52, Issue 3, September 2013, Pages 319-326.

[5] S. Al-Shami1, A. El-Zaart, A. Zekri, K. Almustafa, and R. Zantout, (2017) Number Recognition in the Saudi License Plates using Classification and Clustering Methods, An International Journal, Applied Mathematics & Information Sciences, 11( 1), pp.123-135. Figure. 20: new car plate design from an angle [6] A. Shawky, A. Hamdy, H. Keshk, and M. El_Adawy, November 2009. License Plate Recognition Of Moving Vehicle ,Journal of Engineering Sciences, Assiut University, 37( 6), pp.1489-1498. [7] S. Basalamah , February 2013, Saudi License Plate Recognition, International Journal of Computer and Electrical Engineering, 5( 1), pp.30-38. [8] A. M. Youssef, M. S. El-Mahallawy and A. Badr, Egyptian,2014, license plate recognition using enhanced stroke width transformation and fuzzy artmap , Journal of Computer Science 10 (6), pp.961-969. [9] J Canny ,( June 1986) A Computational Approach to Edge Detection, Journal of IEEE Transactions on Pattern Analysis and Machine Intelligence, 8 ( 6) pp.679-698. Figure. 21: old car plate design [10] Iq. Pham ; R. Jalovecky ; M. Polasek , (Sept. 2015) Using 5. CONCLUSIONS template matching for object recognition in infrared video In this work, we have presented an algorithm to recognize Arabic sequences, Digital Avionics Systems Conference (DASC), number and characters of car plates. The algorithm pre-processing, 2015 IEEE/AIAA 34th,pp.13-17. character detection, feature extraction, and character recognition. [11] T. D. Duan, D. A.Duc, and T. L. H. Du , Combining Hough To improve the accuracy of the detection of the Arabic numbers transform and contour algorithm for detecting vehicles and characters on Egyptian car plates, this work has done the license plates ", in International Symposium on Intelligent following. In the image crop step, we have tested the Region of Multimedia, Video and Speech Processing, 2004, pp. 747- Interest (ROI) threshold to accurately detect the rectangle number 750. plate region. To extract the features of each detected character, the speeded up robust features (SURF) SURF feature is used to return [12] S. Banerjee, D. Mitra, Automatic number plate recognition the features points on the character. From results, this proved to system: a histogram based approach, I IOSR Journal of be more efficient than using histogram algorithm. Electrical and Electronics Engineering (IOSR-JEEE), 11(1) Ver. IV (Jan. – Feb. 2016), PP 26-32. The cross correlation is used to measure the similarity between two functions. Before implementing this algorithm, the images in [13] L. Weerasinghe , T. Tennegedara, and T. Jinasena, the dataset convert into grey scale then to binary to have the same Histogram based number plate recognition with template properties of the cropped character. matching for automatic granting, Proceedings of 8Th International Research Conference, KDU, November 2015 The results show potential of this algorithm. It identifies 95% of [14] M.Nejati , A. Majidi , M. Jalalat, (Dec. 2015)License plate images of old and new Egyptian design car plates and had light recognition based on edge histogram analysis and classifier reflection and taken from the side. Only one plate was not correctly identified out of 21.

54 ensemble, Signal Processing and Intelligent Systems [21] C. Pardeshi and P.Rege, Morphology Based Approach for Conference (SPIS), 2015, pp.16-22. Number Plate Extraction, the International Conference on [15] D. Goswami, R. Gaur,( November 2014,) Automatic Data Engineering and Communication Technology, License Plate Recognition System using Histogram Graph Advances in Intelligent, Systems and Computing Singapore Algorithm, International Journal on Recent and Innovation 2017. Trends in Computing and Communication, 2(11),pp.3521– [22] E. Oyallon and J. Rabin, An Analysis of the SURF Method", 3527., Image processing [Online] 2015. Available: [16] A. Rosebrock, k-NN classifier for image classification - http://www.ipol.im/pub/art/2015/69/. [Accessed: 02- Jun- PyImageSearch", PyImageSearch, 2016. [Online]. Available: 2017]. http://www.pyimagesearch.com/2016/08/08/k-nn-classifier- [23] M.I.Khalil, Car plate recognition using the template for-image-classification/. [Accessed: 03- Jun- 2017]. matching method, International Journal of Computer Theory [17] S. S.Tabrizi and N. Cavus, A Hybrid KNN-SVM Model for and Engineering, 2( 5) October, 2010,pp.70-79. Iranian License Plate Recognition, Procedia Computer [24] C. Jia ,B. Xu, An improved entropy-based endpoint detection Science, 102, 2016, pp. 588-594. algorithm", Institute of Automation, Chinese Academy of [18] E.S. Gopi and E.S. Sathya, SVM approach to number plate Sciences, Beijing, 2017. recognition and classification system, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005.

[19] A. P. Nagare , License Plate Character Recognition System using Neural Network International Journal of Computer Applications 25(10), July 2011,pp. 0975 –8887. [20] H.E. Kocer and K.K. Cevik, Artificial neural networks based vehicle license plate recognition. Procedia Computer Science, 3, 2011, pp.1033-1037.

55 Automatic Extraction of Arabic Number from Egyptian ID Cards

Ann Nosseir Omar Adel Institute of National Planning (INP) British University in Egypt (BUE) &British University in Egypt (BUE) ICS Department, Cairo, Egypt ICS Department, Cairo, Egypt [email protected] [email protected]

ABSTRACT invoice imaging, legal industry, banking, healthcare, etc[1]. ID check is a vital process to verify people identify and to allow A mobile picture and an OCR system that exacts the number entry to different places like universities, check points or a banks. can be solution for this problem. This could be feasible because in This process is usually done by just looking into the ID or writing Egypt, a high percentage of the population carries mobile phones it down. To improve the check process and make it quick and easy, [2]. our work develops a novel system that extracts the Arabic numbers from the ID picture. This paper develops an algorithm that identifies the Arabic numbers on the Egyptian ID cards. It starts with related work in The proposed algorithm uses morphological operations more this area and techniques used to solve other problems. Then the specifically dilation to maximally eliminate non Region Of details of proposed algorithm are presented. This is followed by Interest (ROI) and enhance the (ROI). Moreover, algorithm the evaluation and finally the conclusions and future work are applies (Speeded Up Robust Features) SUFE algorithm to extract discussed. feature points of each image and the correlation based template matching technique to recognise characters. 2. RELATED WORK The approach has been evaluated with 17 ID cards pictures taken The OCR technique has a number of known steps namely image by amateur people with their mobile and other pictures from the acquisition, pre-processing, segmentation, feature extraction and internet. The images were too bright, dark, or from an angle. The recognition process. The image acquisition is from either scanned algorithm is tested with these pictures and all ID numbers where document or captured photograph. The pre-processing involves modifying the raw data and correct deficiencies in the data identified correctly acquisition process. Segmentation is the process of separating CCS Concepts lines, words and characters from an image. Feature extraction is Image Processing finding a set of features that define the shape of the underlying character as precisely and uniquely as possible. The most Keywords important step of the recognition process is selection of the Image recognition; mobile; matching template. features to achieve high recognition performance. Different algorithms have been experimented to extract and recognise the 1. INTRODUCTION numbers and characters [1]. ID Cards are usually checked in places like university gates, Canny edge detector or optimal detector algorithm is commonly banks, or road check points. For some cases, a person writes down used to extract the edge of the image because it has low error rate the numbers or take a picture of the ID. This can take time and [3]. Pham et al. used template matching to identify the English increase the risk of errors of missing a number or read the number characters and numbers. The algorithm compares the image wrong because of the hand writing. extracted with a templates of images shown in Figure 1. It starts Optical Character Recognition technology has been used for comparing pixel by pixel of all images in the template then selects different applications to recognise numbers and letters in different the pictures with the greatest match [4]. languages e.g. English and Arabic. OCR converts numbers’ Hough Transform by Duan et al.[5] is a technique commonly images into digital numbers and characters. Recently, the used in the process of features extraction. It works by the applicability of OCR has increased to be in different fields such as extracting lines and shapes in an image and it is used in detecting Permission to make digital or hard copies of all or part of this the curves in images. The main benefit of Hough Transform is it work for personal or classroom use is granted without fee provided that allows the existence of spaces in the bounders of the image and it copies are not made or distributed for profit or commercial advantage is not affected by the noise in the image [6].They have tested this and that copies bear this notice and the full citation on the first page. algorithm and the accuracy is 98.76%. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220277

56 areas of foreground pixels thus grow in size, while the background "holes" within them shrink [15]. Speeded Up Robust Features (SURF) focuses on blobs structures in the image. These structures can be found at corners of objects and at locations where the reflection of light on specular surfaces is maximal (i.e. light speckles). The algorithm detects the feature of an image by returning SURF points which contains the feature of this image. [16] Cross correlation measures the similarity between two images. A test is made between all images to return a correlation coefficient between images and the largest correlation number represents the higher matching. Figure 3 shows an example of cross correlation, as shown the last image gets the largest correlation coefficient which means it matches [17].

Figure 1: Example of template matching [6] Histogram presents the count of the pixels within each area in the image. Histogram can be used to correct the brightness of the image and it also could be used to balance the image [7]. Figure2 shows an example of the histogram of a car plate. Each character is represented by the graph. Many projects used Histogram in the plate recognition [8],[9]. It is effective when dealing with English characters but ineffective when dealing with Arabic characters.

Figure3: Example of cross correlation operation [17]. End points detection is a technique of identifying features of an image by marking the end points of each character. The number of end points of each character must be known. For example, character (S) has two end points while character (P) has one. This method is not as affective with Arabic numbers and characters because they are more complicated and similar to each other unlike the English numbers and characters [18].

Figure 2 Example of histogram of license plate [8] K-nearest neighbour [10],[11], Support Vector Machine (SVM) [12], Artificial Neural Network(ANN) [13] and [14] shown their ability in pattern recognition and classification. They are Figure 4: Example of end points [18] notorious for building supervised learning models to classify and compare the features of the image with the features of other The Arabic numbers are shown in Fig 5. images. For example, given a set of training patterns, each marked as belonging to one or the other of two categories, SVM and K- nearest neighbour training algorithms build models that assign new examples to one category or the other.

Morphological operators is a collection of non-linear operations related to the shape or morphology of features in an image. They Figure. 5: Arabic numbers used in the ID. rely only on the relative ordering of pixel values. The paper is based on a combination of thresholding, labelling, Erosion and dilation are two basic operators in mathematical filling up the holes approach method and region props method. morphology. The basic effect of erosion operator on a binary Segmentation of the card numbers was achieved by horizontal image is to erode away the boundaries of foreground pixels ratio method. The number recognition was accomplished with the (usually the white pixels). Thus areas of foreground pixels shrink process of template matching and cross correlation in size, and "holes" within those areas become larger. The basic effect of dilation on binary images is to enlarge the This work proposes an algorithm that recognises the Arabic areas of foreground pixels (i.e. white pixels) at their borders. The numbers and characters on the Egyptian plates. It tests the

57 algorithm with pictures of the characters and numbers of new and old car plate’s designs taken by personal mobiles. Some pictures are not clear because they are either laminated i.e., has light reflection or taken from the side.

Figure. 7. Result of applying the cropping ratio. 3.2.2 Convert to gray scale The algorithm converts the image from RGB to gray scale. This step is important because it reduces the processing time. The RGB version of the image is more complex than the gray scale version (see Figure 8).

Figure. 6. Proposed algorithm for ID recognition system. 3. PROPOSED SOLUTION The proposed solution for ID detection algorithm has four steps. Figure. 6 shows these steps which are image acquisition, pre- processing, segmentation, feature extraction, and character recognition. In the followings, these steps are explained. 3.1 Image Acquisition The algorithm used initially an image of the Egyptian ID. 3.2 Pre-processing The pre-processing has five steps that are showed in Figure 7. Figure. 8. Converted cropped image from RGB to gray scale 3.2.3 Color reverse The algorithm converts the zero pixels into one and one pixels into zero; in other words, black and white pixels are reversed. This enables the characters in white and other objects of the plate to in black as shown in Figure 9.

Figure. 7. Pre-processing steps. 3.2.1 Crop image The challenge is to include an algorithm that is able to detect the rectangle number plate region in the image which is called as Region Of Interest (ROI). A number of ratios cropping were tried. Figure. 9. The results of converting the gray scale image to According to the results, the ratio 0.8 gets the best result. color reverse. 3.2.4 Binarise the Image Binarise the image clearly divide the image into the target area and the background. The image of whatever size gets transformed into a binary matrix of fixed pre-determined dimensions. This

establishes uniformity in the dimensions of the input. Hence, it

58 makes the area of characters and numbers, which will be used in 3.3.1 Boundaries the detection in white and any other area in black (see Figure 10). Character and number detection is the step of detecting the characters and number from the image to be used in the later steps. It is marking the boundaries around the objects of the image. To achieve this task, we use bwboundaries() function in math-works MATLAB software to mark the exterior boundaries of the image with red color as well as the holes inside the object with green color as shown into Figure.13.

Figure. 10. The results of binarise the image. 3.2.5 Morphological (dilation) The last step in pre-processing is dilation. It pursues the goals of removing these imperfections by accounting for the form and structure of the image. Dilation is a part of morphological Figure. 13. The image after running the boundaries function. operations which makes the binary pixels clearer and easy to 3.3.2 Boundary boxes recognise. Figure 11 shows the effect of applying dilation. The second step in segmentation is applying boundary boxes around the characters to crop shown in Figure. 14.

Figure. 11. Applying the dilation operation At that point, the image is read to pass through the following steps, Figure. 14. The step of applying the boundary box which are segmentation, feature extraction and recognition. Figure 11 shows the details of these steps. 3.4 Feature Extraction There are many factors, which make this step difficult, such as the 3.3 Segmentation small dots in the image, the plate of the frame, and the small This step is the most important and difficult step because the English characters in the plate. The boxes are they are in boundary accuracy of feature extraction and number and character boxes as well (see Figure 14). Therefore the following steps are recognition steps depends on it. applied to enhance the feature extraction step 3.4.1 The ratio to crop To extract Arabic characters and numbers, we get the ratio of the objects. This is by calculating the numbers of white pixels and dividing them by the number of the black pixels. The ratio of the characters is between 0.5 and 1.9 so objects smaller or larger that these values are neglected. 3.4.2 Crop The characters and numbers are cropped based on the ratio from the plate. 3.4.3 SURF In this step, we extract the features of each cropped character and number using SURF feature which is a function in the MATLAB that returns the features points of each image. 3.5 Recognition Now, we have the features of the cropped numbers from the Figure.12. Details of segmentation, feature extraction, and image, before getting to the recognition step we define a dataset recognition.

59 with all the Arabic numbers representations to compare it with the plate character in the recognition step.

Figure. 16: An example of the results of comparing a plate and the dataset numbers. As shown in Figure 16, the result of the recognition step which shows a cropped numbers and their corresponding matched one based to the similarity of the features. The red character is the one which was detected on the ID and the white one is the one matched from the dataset of numbers. 4. EVALUATION

Figure. 15. The numbers recognition steps The recognition step is the last step in the Automatic number recognition. To compare the numbers excreted with the different representations of the numbers, the properties of the numbers’ images in the dataset are adjusted and goes through the following steps. Figure. 17: The images in the dataset First, the numbers in the dataset are converted to grey scale then to binary. They are all then resized to be the same size of the We have asked volunteers to take pictures of 17 ID shown in Fig cropped images and the last step is to extract the features of each 17. The pictures are not the same size, some reflect light and too image in the dataset by using SURF features in MATLAB (see bright, dark, or from an angle. The algorithm managed to Figure 15). recognize them all. To compare the numbers extracted from the plates against the 5. CONCLUSIONS ones in the dataset, cross correlation is used. It measures the In this work, we have presented an algorithm to recognize Arabic similarity between two functions. [21] numbers on ID cards. The algorithm pre-processing, character detection, feature extraction, and character recognition. To Cross correlation in MATLAB is defined improve the accuracy of the detection of the Arabic numbers on By z = corr2 (x, y) Egyptian ID, this work has done the following. In the image crop step, we have tested the Region of Interest (ROI) threshold to Let x be the cropped image of the character from the plate and let accurately detect the rectangle number plate region. To extract the y be one of the images of a character from dataset, see Equation (1) features of each detected character, the speeded up robust features Cross-correlation (Image1, Image2) =∑ Image1(x, y)*Image2(x,y) (SURF) SURF feature is used to return the features points on the (1) character. From results, this proved to be more efficient than using histogram algorithm. The function will return the correlation coefficient z between x and y for example 0.3738, so this function will be in a for loop The cross correlation is used to measure the similarity between with the size of the dataset to loop all over the images in the two functions. Before implementing this algorithm, the images in dataset and return the one with the largest correlation coefficient the dataset convert into grey scale then to binary to have the same to be the best matched character. properties of the cropped character. The results show potential of this algorithm. It identifies all of the 17 different ID cards images that are laminated and taken from the side.

60 6. REFERENCES [10] A. Rosebrock, k-NN classifier for image classification - [1] A.Mir Asif , S. Abdul Hannan , Y. Perwej , M. Vithalrao, PyImageSearch", PyImageSearch, 2016. [Online]. Available: an overview and applications of optical character recognition, http://www.pyimagesearch.com/2016/08/08/k-nn-classifier- International Journal of Advance Research In Science and for-image-classification/. [Accessed: 03- Jun- 2017]. Engineering, IJARSE, 3(7),July 2014 pp. 261-267. [11] S. S.Tabrizi and N. Cavus, A Hybrid KNN-SVM Model for [2] “88.1 pct of Egyptian families own mobile phones: Iranian License Plate Recognition, Procedia Computer CAPMAS” Science, 102, 2016, pp. 588-594. http://english.ahram.org.eg/NewsContent/3/12/236203/Busin [12] E.S. Gopi and E.S. Sathya, SVM approach to number plate ess/Economy/-pct-of-Egyptian-families-own-mobile-phones- recognition and classification system, Proceedings of 2005 CAPMAS.aspx, [Accessed on nov.2017], International Conference on Intelligent Sensing and [3] J Canny ,( June 1986) A Computational Approach to Edge Information Processing, 2005. Detection, Journal of IEEE Transactions on Pattern Analysis [13] A. P. Nagare , License Plate Character Recognition System and Machine Intelligence, 8 ( 6) pp.679-698. using Neural Network International Journal of Computer [4] Iq. Pham ; R. Jalovecky ; M. Polasek , (Sept. 2015) Using Applications 25(10), July 2011,pp. 0975 –8887. template matching for object recognition in infrared video [14] H.E. Kocer and K.K. Cevik, Artificial neural networks based sequences, Digital Avionics Systems Conference (DASC), vehicle license plate recognition. Procedia Computer Science, 2015 IEEE/AIAA 34th,pp.13-17. 3, 2011, pp.1033-1037. [5] T. D. Duan, D. A.Duc, and T. L. H. Du , Combining Hough [15] C. Pardeshi and P.Rege, Morphology Based Approach for transform and contour algorithm for detecting vehicles Number Plate Extraction, the International Conference on license plates ", in International Symposium on Intelligent Data Engineering and Communication Technology, Multimedia, Video and Speech Processing, 2004, pp. 747- Advances in Intelligent, Systems and Computing Singapore 750. 2017. [6] S. Banerjee, D. Mitra, Automatic number plate recognition [16] E. Oyallon and J. Rabin, An Analysis of the SURF Method", system: a histogram based approach, I IOSR Journal of Image processing [Online] 2015. Available: Electrical and Electronics Engineering (IOSR-JEEE), 11(1) http://www.ipol.im/pub/art/2015/69/. [Accessed: 02- Jun- Ver. IV (Jan. – Feb. 2016), PP 26-32. 2017]. [7] L. Weerasinghe , T. Tennegedara, and T. Jinasena, [17] M.I.Khalil, Car plate recognition using the template Histogram based number plate recognition with template matching method, International Journal of Computer Theory matching for automatic granting, Proceedings of 8Th and Engineering, 2( 5) October, 2010,pp.70-79. International Research Conference, KDU, November 2015 [18] C. Jia ,B. Xu, An improved entropy-based endpoint detection [8] M.Nejati , A. Majidi , M. Jalalat, (Dec. 2015)License plate algorithm", Institute of Automation, Chinese Academy of recognition based on edge histogram analysis and classifier Sciences, Beijing, 2017. ensemble, Signal Processing and Intelligent Systems Conference (SPIS), 2015, pp.16-22.

[9] D. Goswami, R. Gaur,( November 2014,) Automatic License Plate Recognition System using Histogram Graph Algorithm, International Journal on Recent and Innovation Trends in Computing and Communication, 2(11),pp.3521– 3527.,

61 Automatic Identification and Classifications for Fruits Using k-NN

Ann Nosseir Seif Eldin Ashraf Ahmed Institute of National Planning (INP) British University in Egypt (BUE) &British University in Egypt (BUE) ICS Department, Cairo, Egypt ICS Department, Cairo, Egypt [email protected] [email protected]

ABSTRACT extracts the features of the image, which is followed by applying a Most fruit recognition techniques combine different analysis classification method and ends up with identifying the object. method like color-based, shaped-based, size-based and texture- Research applies different algorithms in each step to improve the based. This work classifies the fruits features based on the color recognition accuracy [8] [9]. RBG values and texture values of the first statistical order and The presented work classifies the fruits features based on color second statistical of the Gray Level Co-occurrence Matrix and texture. The RGB values and first statistical order and second (GLCM). It applies different classifies Fine K-NN, Medium K- order statistical of the Gray Level Co-occurrence Matrix (GLCM) NN, Coarse K-NN, Cosine K-NN, Cubic K-NN, Weighted K-NN. values are used to extract the features of each fruit. The system The accuracy of each classifier is 96.3%, 93.8%, 25%, 83.8%, applies and compares different classifiers such as Fine K-NN, 90%, and 95% respectively. The system is evaluated with 46 Medium K-NN, Coarse K-NN, Cosine K-NN, Cubic K-NN, and images by amateur photographers of seasonal fruits at the time Weighted K-NN to get the best accuracy. namely, strawberry, apply and banana. 100% of these pictures were recognised correctly. This paper starts with reviewing literature in the area of automatic classification of fruits and vegetables for different purposes. The CCS Concepts following section describes the details of the algorithms. In Image Processing section three, results of the experiment are presented and finally conclusions and future work are discussed. Keywords Image Recognition, K-NN classifier. 2. RELATED WORK The image recognition goes through basic known steps namely 1. INTRODUCTION image acquisition, pre-processing, feature extraction and Fruits and vegetables classification is important in the domain of classification. The image acquisition is from captured photographs. robotic fruit harvesting [1], [2] to check the progress of crops The pre-processing involves modifying the raw data and harvesting [3], diseased crops [4] and recognise the difference correcting deficiencies in the data acquisition step. Feature between vegetable and fruit types [5]. extraction is finding a set of features that define the color or texture of the fruits or vegetables to uniquely identify the object. In the supermarket, the paper-based barcode labels the fruit type The most important step is a classifier technique such as Support and its cost. However, customers can‟t tell the cost until they Vector Machine (SVM), Artificial Neural Networks (ANN), or K- reach the cashier. The paper-based barcodes could be partially Nearest Neighbor (k-NN), to identify the object accurately. damaged and may lead to more work to find out the price, and error in identifying the price, consequently customer‟s frustration Different algorithms have been developed and tested to extract [6]. Adapting a camera at the supermarket scanner that identifies and recognise fruits and vegetables accurately [6]. Kaur and fruits and vegetables based on color and texture can provide a Sharma [10] developed a system that classifies the status of lemon solution. The visually impaired and blind can‟t identify easily fruits defected. The classification is based on the colors, shape and fruits types while shopping. Having an assistant app that tells size features and used ANN for classification. them the fruits‟ types will be of use [7]. Patel et al.‟s [3] work detects the fruits and counts them. A few An image processing algorithm goes through a number of steps. It techniques like Gaussian Low Pass filter, HLS, RGB color space starts with image acquisition then image pre-processing. Next it into the L*a*b, and Binaries are used in the preprocessing. In the Permission to make digital or hard copies of all or part of this segmentations, the Sobel operator is used to detect the fruit. work for personal or classroom use is granted without fee provided that Orthogonal least squares is used to extract the feature. The copies are not made or distributed for profit or commercial advantage accuracy was calculated based on the difference between and that copies bear this notice and the full citation on the first page. manually counted fruits and the fruits counted by the algorithm. Copyrights for components of this work owned by others than ACM Out of 100 images of apple, pomegranate, orange, peach, plum must be honored. Abstracting with credit is permitted. To copy and litchi 98% were recognised accurately. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions Zawbaa et al.[11] present an automatic fruit recognition system from [email protected]. for classifying and identifying orange, strawberry, and apple. The ICSIE '18, May 2–4, 2018, Cairo, Egypt dataset has 178 fruit images. They compared between extracting © 2018 Association for Computing Machinery. different features of shape and color, and between classifiers such ACM ISBN 978-1-4503-6469-0/18/05…$15.00 as k-NN and SVM. Their system was evaluated using 46 of DOI:https://doi.org/10.1145/3220267.3220278 orange pictures and 55 strawberry picture. The evaluation was

62 based on grouping the fruits into three groups. Apple and orange are similar in shape and different in color. Apple and strawberry are similar in color and different in shape. Orange and strawberry are different in both shape and color. The system records the highest accuracy for the third group when it combines features of color and shape and uses SVM. Goldenstein et al. [6] worked on an algorithm to automatic classify fruit and vegetable. The image data set contained 15 Figure. 2. Pre-processing steps. produce categories and 2633 images collected on-site in a period of 5 months. All features are simply concatenated and fed 3.2.1 Convert to gray scale independently to each classification algorithm. It gives the class a The algorithm converts the image to gray scale. This step is binary representation. The accuracy of their system is 85%. important because it reduces the processing time. The RGB version of the image is more complex than the gray scale version. Ninawe and Pandey[12] presented an algorithm that recognise the Figure.3 gives a sample of the grayscale image. fruit type. The classification is based on color, shape and size. They used 36 pictures for 6 fruits namely red apple, green banana, green guava, green melon, orange, and watermelon. They used a k-NN classifier. The accuracy researched 95%. 3. PROPOSED SOLUTION The solution proposed in this paper is an algorithm that classifies different types of fruits. The algorithm has five steps. Figure. 1 shows these steps which are image acquisition, pre-processing, feature extraction, and classification. In the followings, these steps are explained.

Figure. 3. Converted image to gray scale 3.2.2 Enhance contrast Enhancing contrast is used to reduce noise and improving the quality of the image. The basic idea behind this image processing technique is to make details more obvious or to simply highlight certain features of interest in an image[16]. Contrast is determined by the difference in the color and brightness of the object with other objects. Figure4 shows the results of applying this algorithm. The algorithm converts the zero

pixels into one and one pixels into zero; in other words, black and white pixels are reversed. Figure. 2. Pre-processing steps. 3.1 Image Acquisition The algorithm classifies four different types of fruits i.e., apples, mango, banana, and strawberry. The images are from COFI-Lab [13], Shutter-Stock [14], I- Stock [15] and other sources.

3.2 Pre-processing The pre-processing has three steps that are showed in Figure.2.

Figure. 4. The results of contrast enhance.

63 3.2.3 Median filtering 3.3 Feature Extaction Median filtering is a nonlinear method used to remove noise from 3.3.1 Color features images. It is widely used as it is very effective at removing noise The algorithm starts by extracting the histogram colors values. while preserving edges. It is particularly effective at removing This is a representation of the distribution of colors in an image. A „salt and pepper‟ type noise. The median filter works by moving color histogram represents the number of pixels that have colors through the image pixel by pixel, replacing each value with the in each of a fixed list of color ranges (See Figure. 7). median value of neighboring pixels [16](See Figure 5). 3.3.2 Texture feature This step is the most important and difficult step because the accuracy of texture feature extraction depends on it. The Canny edge detection algorithm is used on the output image. Canny edge detection is responsible for reducing the amount of data, detect useful regions in the image and detecting the range of edges in the image [16]. From the canny edge results, texture feature information is calculated. These are the first order statistical, the higher order statistical features, and the Fast Fourier Transform. Figure 7 shows an example of these values. 3.3.2.1 First order statistical feature The first order statistical depends on the pixel values and computes the mean, standard deviation, smoothness and entropy. The mean value measures the average intensity. This means that if the regions in the image have a high grey scale the mean will be high and if the regions in the images have a low grey scale the mean will be low. The standard deviation value measures the average contrast of an image. It gives the distribution of the gray scale region in the image. If the pixels are distributed in a wide range then the standard deviation will be high. Smoothness is related to the standard deviation. If the standard deviation is high, the smoothness will be low and vice versa. Entropy measures Figure. 5. The results of Median filtering. randomness [16]. 3.3.2.2 Higher order statistical features Now the image is ready to pass through the following steps, which This depends on both pixels values and the relationship between are feature extraction and classification. Figure 6 shows the details the pixel values. Gray Level Co-occurrence Matrix (GLCM) of these steps. features is a combination of pixels values (grey level) and these pixels values have a specific angle and distance between each other. The GLCM computes the higher order statistical features, which are correlation, contrast, energy, and homogeneity values of the image. Correlation measures how correlated a pixel to its neighbors. Its value range is from 1 to -1. Contrast measures the intensity of contrast between a pixel and its neighbors. Energy returns the sum of all elements in the co-occurrence matrix. Homogeneity returns a value that measures the closeness of the distribution of elements [16]. 3.3.2.3 Fast Fourier Transform The Fourier Transform accesses the geometric characteristics of a spatial domain image. Because the image in the Fourier domain is decomposed into its sinusoidal components, it is easy to examine or process certain frequencies of the image, thus influencing the geometric structure in the spatial domain. The Fourier Transform is an important image processing tool which is used to decompose an image into its sine and cosine components. The output of the transformation represents the image in the Fourier or frequency domain, while the input image is the spatial domain equivalent. In the Fourier domain image, each point represents a particular frequency contained in the Figure. 6. Steps of segmentation, feature extraction, spatial domain image [17]. and classification. The values of the color and texture feature are used in the following step to identity the different fruits.

64

Figure. 7. Example of Canny edge and the feature extraction values. 3.4 Classification The k-nearest neighbor algorithm (k-NN) is a non-parametric method used for classification and regression [17]. The nearer neighbors contribute more to the average than the more distant ones. This can be thought of as the training set for the algorithm, though no explicit training step is required. With the information produced in the earlier step, a table was produced to train six type of k-NN classifier to get the best accuracy. They are Fine k-NN Medium k-NN, Coarse k-NN, Cosine k-NN, Cubic k-NN, and Weighted k-NN. Fine k-NN makes detailed distinctions between classes and the number of neighbors is set to one. Medium k-NN makes fewer distinctions than fine k-NN and the number of neighbors is set to ten. Coarse k-NN makes coarse distinctions between classes and Figure. 8: Confusion matrix represents the accuracy of Fine the number of neighbors is set to one hundred. Cosine k-NN uses k-NN (96.3%). the cosine distance metric. Cubic k-NN uses the cubic distance metric. Weighted k-NN uses the distance weighting. Figure. 8,9,10,11,12,13 represent the confusion matrices for each classifier. The training results for Fine k-NN, Medium k-NN, Coarse k-NN, Cosine k-NN, Cubic k-NN, and Weighted k-NN, The accuracy of each classifier is 96.3%, 93.8%, 25%, 83.8%, 90%, and 95% respectively(see Figure. 8,9,10,11,12,13).

Figure. 9: Confusion matrix represents the accuracy of Medium k-NN (93.8%).

65

Figure. 10: Confusion matrix represents the accuracy of coarse k-NN 25%. Figure. 14: Sample of the pictures took by volunteers. 4. EVALUATION We have asked volunteers to take pictures of the seasonal fruits i.e., apple, strawberry and banana. They took four in each category in total 46 picture. The pictures are not the same size, some reflect light i.e. illumination or blur and others have taken from different viewpoints. The algorithm managed to recognise them all. Figure.15, 16, and 17 show the results.

Figure. 11: confusion matrix represents the accuracy of Cubic k-NN 90%.

Figure. 15: The images in the dataset

Figure. 12: Confusion matrix represents the accuracy of Cosine k-NN 83.8%.

Figure. 16: The images in the dataset

Figure. 13: Confusion matrix represents the accuracy of Figure. 17: The images in the dataset weighted k-NN 95%.

66 5. CONCLUSIONS [7] T. Chowdhury, S. Alam, M. Hasan, I. Khan, “Vegetables In this work, we have presented an algorithm to classify four types detection from the glossary shop for the blind”. IOSR Journal of fruits namely mango, strawberry, apply and banana. The of Electrical and Electronics Engineering (IOSR-JEEE), 8(3), algorithm follow these steps pre-processing, character detection, (Nov. - Dec. 2013), PP 43-53 feature extraction, and classification. [8] S.Mahalakshmi, H. Srinivas, S.Meghana, C. Sai Ashwini, Identification and Classification Techniques A Review Using To improve the accuracy of the detection the color and the texture Neural Networks Approach”, International Journal of features were extracted using GCLM and FFT. Six K-NN Advanced Research in Computer and Communication classifiers were used and the results were compared. The fine k- Engineering, 4( 12), December 2015, Pp.234-241. NN has the best accuracy of 96.3%. The algorithm developed is tested with 46 pictures taken by amateur photographers of [9] W. Seng, “A New method for fruits recognition system ", seasonal fruits at the time namely, strawberry, apply and banana. International Conference on Electrical Engineering & 100% of these pictures were recognised correctly. Informatics, Selangor, 2009. [10] [A.Ritika, S. Kaur,” Contrast Enhancement Techniques for 6. REFERENCES Images, A Visual Analysis”, International Journal of [1] ” Apple-Picking Robot Prepares to Compete for Farm Jobs, Computer Applications, 64 (17), February 2013,Pp. 0975 - MIT review” 8887. https://www.technologyreview.com/s/604303/apple-picking- robot-prepares-to-compete-for-farm-jobs/[accessed on [11] ] H. M. Zawbaa, M.Abbass, M.Hazman, and A. Hassenian, Jan2018] “Automatic Fruit Image Recognition System Based on Shape and Color”, AMLTA Springer International Publishing [2] [S. Nandyal , and M. Jagadeesha, " Crop Growth Prediction Switzerland, 2014, 488, pp. 278–290. Based on Fruit Recognition Using Machine Vision ", International Journal of Computer Trends and Technology [12] P. Ninawe, and S. Pandey, “A Completion on Fruit (IJCTT), 4( 9), September 2013, pp. 3132-3138. Recognition System Using K-Nearest Neighbors Algorithm”, International Journal of Advanced Research in Computer [3] H N Patel, A D.Patel, “Automatic Segmentation and Yield Engineering & Technology (IJARCET), 3 (7), July 2014 pp. Measurement of Fruit using Shape Analysis ", International 2352-2360. Journal of Computer Applications, 45(7), May 2012, pp. 19- 24. [13] Cof- Ilab, www.cofilab.com/[accessed on Jan2018] [4] R. Linker , O.Cohen , A. Naor " Determination of the [14] Shutter Stock, shutterstock.com/ [accessed on Jan2018] number of green apples in RGB images recorded in orchards [15] I- Stock, istock.com/[accessed on Jan2018] ", Computers and Electronics in Agriculture 81, 2012,pp. [16] D. H. Ballard and C. M. Brown. Computer Vision. Prentice 45–57. Hall, 1982. [5] R. Zhou, L. Damerow, M. M. Blanke, Recognition [17] N. S. Altman, "An introduction to kernel and nearest- Algorithms for Detection of Apple Fruit in an Orchard for neighbor nonparametric regression". The American early yield Prediction” Precision Agriculture, 13 (5), 2012, Statistician. 46 (3), (1992)., pp. 175–185. pp.568-580.

[6] A. Rocha, A., D.Hauagge, J.Wainer, J., S. Goldenstein,” Automatic fruit and vegetable classification from images”. Computers and Electronics in Agriculture 70(1), 2010, Pp.96–104.

67 Computer Aided Diagnosis System for Liver Cirrhosis Based on Ultrasound Images

Reham Rabie1, Mohamed Meselhy Eltoukhy2, Mohammad al-Shatouri3, Essam A. Rashed1,4 1 Image Science Lab., Dept. of Math., Faculty of Science, Suez Canal University, Ismailia, Egypt 2 Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt 3 Dept. of Radiology, Faculty of Medicine, Suez Canal University, Ismailia, Egypt 4 Faculty of Informatics and Computer Science, The British University in Egypt, Cairo, Egypt Email: [email protected]

ABSTRACT quantification is important to reduce the mortality caused by liver This work introduces a computer-aided diagnosis (CAD) system diseases. for diagnosing liver cirrhosis in ultrasound (US) images. The Ultrasound (US) imaging procedure is one of the most widely proposed system uses a set of features obtained from different utilized noninvasive and real-time diagnostic methods. It is used feature extraction methods. These features are the first order to guide doctors for diagnosing diffuse liver diseases. An statistics (FOS), the fractal dimension (FD), the gray level co- ultrasound of liver uses high frequency sound waves to create a occurrence matrix (GLCM), the Gabor filter (GF), the wavelet live image from inside of a patient’s body. It is a painless test that () and the curvelet (CT) features. The measured features are is very commonly used in the medical field. A computer-aided presented in two different classifiers such as support vector diagnosis (CAD) system helps radiologists to diagnose the US machine (SVM) and k-nearest neighbors (K-NN). The proposed images [2]. CAD system usually consists of segmentation of liver, system is applied on dataset consists of 72 cirrhosis and 75 normal extraction of features and finally identification of tissues by regions each of 128128 pixels. The classification accuracy rates means of a classifier. Texture analysis of ultrasound liver images are calculated using a 10-fold cross validation. A correlation- is always a challenge for researchers. based feature selection (CFS) is used resulting in better accuracy predictions. The results showed that SVM and K-NN classifiers Virmani et al. [3] Introduced CAD system for categorizing liver achieved higher performance with the combination of the wavelet into normal, cirrhotic and hepatocellular carcinoma (HCC) by and curvelet feature vectors than other feature extraction methods. using wavelet packet transform (WPT) texture descriptors, they extracted statistical features such as standard deviation, energy CCS Concepts and mean. An accuracy of 88.8% was achieved using SVM classifier. Ahmadian et al. [4] Proposed a method using Gabor • Computing methodologies➝Image processing. wavelet texture feature extraction method for categorizing different liver diseases. Features were extracted and images were Keywords classified into normal, cirrhosis and hepatitis groups using Gabor Computer aided diagnosis (CAD); ultrasound images; liver wavelet transform, dyadic wavelet transform and statistical diseases; feature extraction; classification. moments. The result showed that the sensitivity is 85% in the distinction between normal and hepatitis liver images, and 86% to 1. INTRODUCTION distinguish between normal and cirrhosis liver images. They The liver is one of the biggest organs of the human body. It concluded that Gabor wavelet is more appropriate than dyadic constitutes 2.5% of the human body weight [1]. The liver's main wavelet and statistical based methods. job is to filter the blood coming from the digestive tract, before passing it to the rest of the body. Cirrhosis is a long-term damage Ribeiro et al. [5] identified and classified different stages of to the liver from several potential diseases that lead to permanent chronic liver disease. The classifiers used are SVM, K-NN, and scarring. The liver becomes unable to function well. Cirrhosis decision tree. The best results obtained using SVM with 73.20% develops when scar tissue replaces normal or healthy tissue in the overall accuracy rate with a radial-basis kernel. Lee et al. [6] liver. It happens after the healthy cells are damaged over a long suggested feature extraction methods based on fractal geometry period of time, usually many years. Early detection and and spatial-frequency decomposition. Accuracies of 93.6% in distinction of cirrhosis and hepatoma and 96.7% in distinction of Permission to make digital or hard copies of all or part of this normal and abnormal liver were gotten by utilizing Bayes work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage classifier. Recently, Lee [7] proposed an ensemble-based data and that copies bear this notice and the full citation on the first page. fusion strategy to differentiate normal, hepatoma and cirrhosis. Copyrights for components of this work owned by others than ACM The algorithm chose satisfactory classifier with high recognition must be honored. Abstracting with credit is permitted. To copy rate and diversity. An accuracy of 95.67% was accomplished. Cao otherwise, or republish, to post on servers or to redistribute to lists, et al. [8] extracted liver features by using 2D phase congruency to requires prior specific permission and/or a fee. Request permissions differentiate among normal, cirrhosis and fibrosis of liver. from [email protected]. Accuracies of 96.27% for normal liver, 95% for cirrhosis and 86.6% ICSIE '18, May 2–4, 2018, Cairo, Egypt for fibrosis were accomplished. Wu et al. [9] proposed two-stage © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 feature fusion strategy to characterize ultrasonic liver tissue images in three classes: normal, hepatitis and cirrhosis. GLCM, DOI:https://doi.org/10.1145/3220267.3220283

68 multi-resolution energy feature and multi-resolution fractal deviation and lacunarity [13]. Gabor filter breaks down a solitary features were extracted. The resulting fused feature set was image utilizing a linear combination of various frequencies and utilized in SVM. Accordingly, 96.25% ± 1.91 accuracy was angles. In 2D Gabor filter, a number of Gabor filter banks are achieved. applied on images that filtered by varying the wavelengths and the orientation angles, then the result is evaluated and compared for The rest of this paper is organized as follows. Section 2, presents each Gabor filter bank [14]. the image segmentation for CAD system, the methods of feature extraction, the feature selection method and the classification The wavelet transform is a multi-scale analysis of an image. It has methods. Section 3 describes the performance of CAD system. only three directional horizontal, vertical, and diagonal. The 2D Section 4 summarized the experimental tests and presents the wavelet transform is the mix of two 1D wavelet transform. In the classification results. It is also discussing the results obtained first it does 1D wavelet transform along rows, and afterward do using the proposed system. Finally, Conclusions and future potential research directions are proposed in Section 5. 1D wavelet transform along columns. The wavelet image decomposition provides a representation that is simple to interpret 2. MOTHODOLGY [15]. Each sub-image contains data of a specific scale and The proposed system consists of four different steps such as orientation, which is helpfully isolated. Spatial data is held inside region of interest (ROI) identification, feature extraction, feature the sub-images. selection and classification as described in the following. The curvelet transform is a multi-scale and multi-directional 2.1 Segmentation (ROI Identification) analysis that evolved from the wavelet transform which comprises The segmentation method decomposes an US image into small of a few sub-bands at various scales comprising of various regions for further examination. Liver infections can be orientations in the frequency domain. From all these sub-bands partitioned into two classes, diffused infections (e.g. hepatitis, cirrhosis) and focal infections (e.g. hepatoma, hemangioma). The the statistical features are computed [16-18]. Table 1 summarizes diffuse infections are the place the variation from the norm is the number of features extracted from different feature extraction dispersed everywhere throughout the liver volume and the focal method. infections are the place the irregularity is moved in a little zone of liver tissue. Accordingly, in diffuse infections, the segmentation is 2.3 Features Selection any area of the image; preferably a physician determines the The features selection (FS) method can assess individual features region that has the disease. In this work, a dataset consists of 72 and rank them based on their correlation with the classes. Features ROIs for cirrhosis and 75 ROIs for normal is used. From each selection can be utilized for reducing the computation time, the image, ROI with 128128 pixels is cropped manually (see Fig. 1). processing complexity and improving the performance of the 2.2 Features Extraction CAD system. This work applied the correlation-based filter (CFS) For each ROI, several features have been derived from the FOS, approach to select the most significant features. GLCM, FD, GF, WT and CT. The first order statistics (FOS) CFS is a straightforward filter algorithm that assorts feature manages the extracted data from an isolated pixel. The advantage of this method is that the features are extracted quickly. The subsets as indicated by a correlation based heuristic evaluation features extracted from FOS are mean, average, energy, entropy, function. The inclination of the evaluation function is toward skewness and kurtosis [10, 11]. The gray level co-occurrence subsets that contain highlights that are profoundly related with the matrix extracts features based on two-pixels intensity values that class, yet uncorrelated with each other [19]. different in rotational angles (0°, 45°, 90°, 135°) and distance to the neighbor pixel [12]. CFS’s feature subset evaluation function

̅̅ ̅ ̅ (1) √ ̅̅ ̅ ̅

where is the heuristic “merit” of a feature subset S containing k features, ̅ ̅ ̅ ̅ is the mean feature-class correlation (f ∈ S), and ̅̅ ̅ ̅ is the average feature-feature inter-correlation. 2.4 Classification Image classification breaks down the numerical properties of different image features and organizes data into classes. The methods that used to classify the liver tissue are SVM and K-NN. SVM uses input features to determine a maximum margin hyper- plane to separate the training data from two classes. The special kernel function can be used in the case of the features are not Fig. 1. Liver representation of cirrhosis (left) and normal linearly separable to transform the data to a higher dimensional condition (right). feature space [20, 21]. On the other hand, K-NN classifier uses the The fractal dimension used to figure geometric shapes that exhibit k nearest neighbor’s distance measure to make the decision of self-similarity. Fractal dimension are calculated by box dimension class attribution. The classifier is based on the majority vote of its strategy. The features extracted from FD are mean, standard k-nearest neighbors in a test sample. In the present work, the Euclidian distance metric is used [22].

69 4. RESULTS AND DISCUSSION In this work, 147 ROIs are obtained from 72 cirrhosis images and 75 normal tissues. All cases are acquired and provided by our Table 1. The number of features extracted using different features extraction methods radiologist consultant. A region of interest (ROI) that demonstrate the focus are of liver cirrhosis is marked from each image. The Features extraction methods Number of features ROI size is 128×128 pixels. A total of 624 features have been extracted from each ROI of liver ultrasound images, namely 6 FOS 6 using FOF, 14 using GLCM, 7 using GF, 3 using Fractals GLCM 14 approach, 108 using WT, and 486 using CT. The correlation- GF 7 based filter approach method has been used for obtaining the most FD 3 relevant features that improved the CAD system performance. WT 108 The obtained features are forwarded to the SVM and the K-NN CT 486 classifiers for evaluation purposes. TP, FP and ROC are measured for each feature extraction method and the combination between Table 2. Summary of the performance of SVM classifier some features is also tested. method with different set of features The obtained results are illustrated in Tables 2 and 3. The best overall accuracy is achieved using SVM and K-NN classifiers Feature set TP FP ROC with the combination of WT and CT feature vectors. We think FOS 0.871 0.129 0.873 that the main reason for this is that the combination of WT and GLCM 0.89 0.105 0.893 CT feature vectors are extracted in more details and more texture GF 0.782 0.221 0.781 features. We can achieve a 99.31% classification accuracy rate. A comparison demonstrates the performance of each classifiers is FD 0.898 0.100 0.899 shown in figure 2. It is difficult to identify a classifier with higher WT 0.966 0.033 0.966 performance from these results as the ROC values is very close in CT 0.946 0.053 0.946 almost every case. FOS+GLCM 0.891 0.105 0.893 FOS+GF 0.905 0.095 0.905 FOS+FD 0.966 0.033 0.966 FOS+WT 0.966 0.033 0.966 FOS+CT 0.952 0.046 0.953 GLCM+GF 0.898 0.101 0.898 GLCM+FD 0.918 0.081 0.919 GLCM+WT 0.966 0.033 0.966 GLCM+CT 0.946 0.053 0.946 GF+FD 0.939 0.060 0.939 GF+WT 0.966 0.033 0.966 GF+CT 0.993 0.007 0.993 FD+WT 0.966 0.033 0.966 FD+CT 0.952 0.046 0.935 WT+CT 0.993 0.007 0.993 Fig. 2. Comparison of the performance of SVM and k-NN

classifiers for all different features. 3. PERFORMANC EVALUATION OF CAD SYSTEM To determine the performance of the proposed system, some 5. CONCLUSION quality measures are needed to be calculated. These factors are This paper proposed computer aided system for liver cirrhosis listed as follow, accuracy, Receiver Operating Characteristic detection. Several feature extraction methods are highlighted and (ROC) curves, F-Measure, precision, and Area Under the Curve evaluate in order to identify the best set of features. SVM and (AUC) [23]. The classification results are tabulated in a confusion KNN classifiers are used to accomplish the classification task. matrix [24]. The confusion matrix would have four results: True The result showed that SVM and K-NN classifiers achieved the positives (TP) are positive cases accurately classified as positive. highest classification accuracy rate with the combination of WT True negatives (TN) are negative cases effectively distinguished and CT feature vectors. It reached to 99.31%. as negatives. False positives (FP) are negative cases accurately Since, the results were improved by applying combination of classified as positive. False negatives (FN) are positives cases different set of features. The new directions for future work such classified by the system as negative ones [25]. as using some swarm intelligent methods to identifying the most where: FP rate = FP / (FP + TP) important features that could improve the CAD system performance. FN rate = FN/ (FN + TN)

TP rate = TP / (TP + FP)

TN rate= TN / (TN + FP)

70 Table 3: Summary of the performance of KNN classifier [10] A. Zaid, W. Fakhr, A. Mohamed. “Automatic Diagnosis of method with different set of features Liver Diseases from Ultrasound Images”. International Conference on Computer Engineering and Systems, pp.313- Feature set TP FP ROC 319, 2006. FOS 0.844 0.156 0.848 [11] S. Poonguzhali, G. Ravindran. “Automatic classification of GLCM 0.864 0.136 0.856 focal lesions in ultrasound liver images using combined GF 0.735 0.266 0.725 texture features”. Inform Tech J 2008; Vol 7, pp. 205–209. FD 0.857 0.143 0.857 [12] M. Liang, “3D co-occurrence matrix based texture analysis applied to cervical cancer screening”, Department of WT 0.980 0.021 0.979 Information Technology, UPPSALA UNIVERSITET 2012. 0.966 0.033 0.963 CT [13] M. C. Breslin, J. A. Belward. “Fractal dimensions for rainfall FOS+GLCM 0.837 0.163 0.829 time series.” Mathematics and Computers in Simulation, Vol FOS+GF 0.891 0.108 0.882 48, PP.437-446, 1999. FOS+FD 0.946 0.054 0.955 [14] J. Wu, G. An, and Q. Ruan. (2009). “Independent gabor FOS+WT 0.980 0.021 0.979 analysis of discriminant Features fusion for face recognition”. FOS+CT 0.959 0.039 0.952 Signal Processing Letters, IEEE, Vol 16(2), pp.97–100 doi: GLCM+GF 0.878 0.123 0.864 10.1109/LSP, 2008. GLCM+FD 0.946 0.054 0.948 [15] M. M. Eltoukhy, I. Faye, and B. B. Samir, "A statistical GLCM+WT 0.980 0.021 0.979 based feature extraction method for breast cancer diagnosis GLCM+CT 0.959 0.039 0.958 in digital mammogram using multiresolution representation," Computers in Biology and Medicine, vol. 42, pp. 123-128, GF+FD 0.891 0.109 0.895 2012. 0.993 0.007 0.989 GF+WT [16] J.L. Starck, E.J. Candes, D.L. Donoho, “The curvelet GF+CT 0.993 0.007 0.989 transform for image denoising, Image Process”, IEEE Trans. FD+WT 0.891 0.108 0.882 Vol 11, pp. 670–684, 2002. FD+CT 0.939 0.059 0.937 [17] M. M. Eltoukhy and I. Faye, "An optimized feature selection WT+CT 0.993 0.007 0.993 method for breast cancer diagnosis in digital mammogram using multiresolution representation," Applied Mathematics 6. REFERENCES & Information Sciences, vol. 8, pp. 2921-2928, 2014. [1] B. Vollmar, M. D. Menger. “The hepatic microcirculation: [18] M. M. Eltoukhy, “Mammographic Mass Detection Using mechanistic contributions and therapeutic targets in liver Curvelet Moments” Applied Mathematics & Information injury and repair”. Physiol. Rev. 1269-1339, 2009. Sciences, vol. 11(3), pp. 717-722, 2017. [2] K. Doi, M. L. Giger, H. MacMahon, et al. “Computer-aided [19] M. A. Hall, “Correlation-based feature selection for machine diagnosis: development of automated schemes for learning,” PhD, Department of Computer Science, The quantitative analysis of radiographic images”. Seminars in University of Waikato, Hamilton, 1999 Ultrasound CT MR, Vol 13(2), pp.140–152, 1992. [20] C. J. Burges: “A tutorial on support vector machines for [3] J. Virmani, V.Kumar, N.Kalra, N.Khandelwal. “SVM-based pattern recognition”. Data Min Knowl Disc Vol 2(2), pp.1– characterization of liver ultrasound images using wavelet 43, 1998. packet texture descriptors”, Journal of Digit Imaging, Vol [21] I. Guyon, J. Weston, S. Barnhill, V. Vapnik: “Gene selection 26(3), pp.530-543, 2013. for cancer classification using support vector machines”. J [4] A. Ahmadian, A. Mostafa, M. D. Abolhassani, and Y. Machine Learn Vol 46(1–3), PP.389–422, 2002. Salimpour, “A texture classification method for diffused liver [22] Y. M. Kadah,A. A. Farag, J. M. Zurada, A. M. Badawi, A. diseases using Gabor wavelets”, 27th Annual International M. Youssef. Classification algorithms for quantitative tissue Conference of the Engineering in Medicine and Biology characterization of diffuse liver disease from ultrasound Society, IEEE-EMBS 2005- Shanghai- China, pp. 1567 – images. IEEE Trans Med Imaging, Vol 15, pp.466-478, 1996. 1570, Print ISBN: 0-7803-8741-4, 2005. [23] T. Fawcett. “An introduction to ROC analysis. Pattern [5] R. Ribeiro, R. Marinho, J. Velosa, F. Ramalho, and J. M. Recognition Letters”, Vol 27(8), pp. 861–874, 2006. Sanches, “Chronic liver disease staging classification based [24] N. V. Chawla “Data mining for imbalanced datasets: An on ultrasound, clinical and laboratorial data,” in IEEE overview”. In Data mining and knowledge discovery International Symposium on Biomedical Imaging: From handbook. Springer US, pp. 853–867, 2005 Nano to Macro, pp. 707-710, 2011. [25] F. Provost, T. Fawcett. “Robust classification for imprecise [6] W. L. Lee, Y. C. Chen, K. S. Hsieh. “Ultrasonic liver tissues environments”. Machine Learning, vol. 42, pp.203–231 2001 classification by fractal feature vector based on M-band wavelet transform”. IEEE Trans Med Imaging. Vol 22(3), pp. 82-92, 2003. [7] W. L. Lee. “An ensemble-based data fusion approach for characterizing ultrasonic liver tissue”. Appl Soft Comput. Vol 13(8), pp. 83-92, 2013. [8] G. Cao, P. Shi, B. Hu. Ultrasonic liver discrimination using 2-D phase congruency. IEEE Trans Biomed Eng. Vol 53(10):2116- 2119, 2006. [9] C. C. Wu, W. L. Lee, Y. C .Chen, C. H. Lai, K. S. Hsieh. Ultrasonic liver tissue characterization by feature fusion. Expert Syst Appl. Vol 39(10), pp. 9389-9397, 2012.

71 Image Denoising Technique for CT Dose Modulation Haneen A. Elyamani Samir A. El-Seoud Essam A. Rashed Image Science Lab., Dept. of Math., Faculty of Informatics and Computer Faculty of Informatics and Computer Faculty of Science, Suez Canal Science, The British University in Science, The British University in University, Ismailia, Egypt Egypt, Cairo, Egypt Egypt, Cairo, Egypt [email protected] [email protected] [email protected]

ABSTRACT exposure during a scan length and amongst patients to meet a Low-dose computed tomography (LDCT) imaging is considerably certain image quality level [6, 7]. ATCM systems determine the recommended for use in clinical CT scanning because of growing patient attenuation and describe changes to the scanner output fears over excessive radiation exposure. Automatic exposure specially designed to the specific patient and body region to meet control (AEC) is one of the methods used in dose reduction the required image quality and especially useful for body parts techniques that have been implemented clinically comprise that are not uniform in size or non-homogenous scan regions as showing significant decrease scan range. The quality of some shown in Fig. 1. On the other hand, dose reduction by lowering images may be roughly degraded with noise and streak artifacts the tube current lead to a noisy image and artifacts, which effect due to x-ray flux, based on modulating radiation dose in the of the diagnostic accuracy. Therefore, there is a need to improve a and slice directions. In 2005, the nonlocal means (NLM) technique which could denoise images in ATCM without losing algorithm showed high performance in denoising images image quality. corrupted by LDCT. The proposed method incorporates a prior Non-local means (NLM) denoising algorithm is an effective knowledge obtained from previous high-quality CT slices to image denoising technique that exploits the high degree of improve low-quality CT slice during the filtering process because redundant information present in most images. It uses the self- of the anatomical similarity between the arranged image slices of similarity property of images to suppress noise by replacing the the scans. The proposed method is evaluated using real data and intensity of each pixel with a weighted average of its neighbors CT image quality is notably improved. according to similarity based on patches. CCS Concepts In this paper, we propose a novel approach for signal to noise • Computing methodologies➝Image processing. ratio (SNR) enhancement of real chest images based on the NLM algorithm by using series of images with different level of noises Keywords result by using ATCM. We consider the case where the ATCM is X-ray CT; denoising; dose reduction; nonlocal means filtering not accurate such that the dose is extremely low in some slices which causes significant artifacts. However, some neighbor CT 1. INTRODUCTION slices can be found with relatively high-dose that result in high- X-ray computed tomography (CT) has been widely used in quality images. The information extracted from high-quality slices medical domain for diagnosis, image-guided radiotherapy, are used as a searching domain for a NLM-based algorithm to screening, and image-guided surgeries. CT scan is a radiation- improve the quality of low-quality slices. Evaluating by real data intensive procedure [1]. However, the high radiation dose exhibits that the proposed method leads to the improvement of delivered to the patients has potential harmful effects including image quality. cancerous disease and genetic diseases [2]. This fact raised safety The remainder of this paper is organized as follows: Section 2 is concerns in the medical physics community. Therefore, high- an overview of the NLM algorithm. In Section 3, we describe the quality CT images reconstructed from low-dose CT are highly proposed method for tube current modulation. Section 4 describe desired. Hence, several models have been proposed for dose the experimental results and Section 5 discusses the results and reduction in CT scan to decrease radiation related risks [3,4]. present conclusions. Recent developments in CT technology, including implementation of automatic exposure control and optimizing the system 2. OVERVIEW OF NLM ALGORITHM parameters [5]. Automatic exposure control (also known as Buades et al. [8] proposed the classical formulation to the NLM automatic tube current modulation (ATCM) or dose modulation) algorithm. It assumes that the images have similarity of elements is implemented on CT scanners allow reduction in radiation and patterns. When filtering each pixel, the filter searches for Permission to make digital or hard copies of all or part of this similar pixels from throughout the local image, named search work for personal or classroom use is granted without fee provided that window (SW), giving equal weight to neighboring and non- copies are not made or distributed for profit or commercial advantage neighboring local image regions. and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy Due to the success of the NLM filter, many of regularization otherwise, or republish, to post on servers or to redistribute to lists, models has been based to NLM filter were also proposed for requires prior specific permission and/or a fee. Request permissions different inverse problems in image denoising. One of from [email protected]. regularization model uses previous high-quality scan (preHQ) ICSIE '18, May 2–4, 2018, Cairo, Egypt image instead of using noisy image based on the presence of a © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 prior radiography of the patient [9]. Some patients need to repeat CT scans that in several clinical applications such as longitudinal DOI:https://doi.org/10.1145/3220267.3220288

72 studies, disease monitoring, and image-guided radiotherapy. To improve the benefit of radiation dose, the initial scan can be acquired with high quality to set up a guide, followed by a series of low-dose scans. A preHQ-guided NLM filtering method proposed by Ma et al. [10] for CT image. Assume that ( ) is the pixel in an image, the NLM corresponding to image pixel (noisy image) is a weighted sum of arbitrary image pixels (reference image) that satisfy some statistical properties.

( ) (1) ∑ ( )

where represents the pixels in noisy image, represents the pixels in previous high quality image that recorded for the current low-dose image, ( ) is the weighting coefficient and ( ) is the new intensity value of the pixel located at after filtering, the weighting coefficient satisfy the constrains

( ) and Figure 1: (a)-(c) are different slices representing different ∑ ( ) and was given as: tissue densities that requires CT dose modulation. Corresponding position in sagittal pan view is shown in

( ) bottom right image.

‖ ‖ ( )

(2) The proposed method is called ATCM-NLM. The initial high- dose slices are used as a search window for NLM to improve the

‖ ‖ next slice. However, this process can be used for only few slices ∑ ( ) with local neighborhood of high-dose slices. The improved slices are then used as search widow to later ones, and so on. It is expected that the image quality improvement with gradually decrease the statistical noise problem. However, the efficiency of the NLM filtering is becoming weaker when the filtered slice is of A and are two patches with the same size far distance from the high-dose slice. Therefore, this process is that centered at pixels and (called patch- repeated several times (STEP 4), until a reasonable image quality window (PW), e.g. 5×5 pixels in 2D case), h is the filtering is achieved. parameter that controls and ‖ ‖ is the Euclidean distance between two patch windows in high dimensional space. 4. EXPERIMENTAL RESULTS 3. PROPOSED METHOD 4.1 Image Quality Measures The preHQ guided filtering results may be further improved rather To evaluate the performance and estimate the image quality of the NLM algorithm. However, previous high-quality image of the proposed method, we used the image quality measures of the peak same patient is not always available. Using ATCM techniques in signal-to-noise ratio (PSNR). CT scanners, where using different x-ray doses may result in different quality images. The goal here is to use the neighbor ( ) ( ) ( ) (3) high-quality slices to improve the successor low-quality slices √ ( ) within the framework of NLM. Authors developed an adaptive approach using NLM for CT image denoise in [11]. In this where MSE is the mean square error and is computed as follows method, the searching window is limited to pixels belongs to the same slice. In the current work, this method is extended for CT ( ) ∑( ) (4) dose modulation applications. The proposed method can be implemented as follows: where ( ) is the pixel value of true reference [STEP 1] Acquire few slices (x(1) to x(t)) with high-dose CT. image and ( ) is the corresponding denoised [STEP 2] Gradually reduce the x-ray power along with patient image. Also, relative root mean square error (RRME). bed motion for slices (x(t+1) : x(n)). (i) [STEP 3] For each x , t

73 Moreover, evaluations using RRME and PSNR measurements are

(a) (b) (c) (d)

Fig. 2: Result for different slices (a) True image, (b) noisy image, (c) denoised image using standard NLM algorithm [8], and (d) denoised image using proposed ATCM-NLM algorithm.

4.2 Results used to validate the effect of ATCM-NLM method and results are We have used a CT volume representing chest imaging. An illustrated in Fig. 4. It is also observed that the proposed method artificial statistical noise is added within projection domain and has a high effect on the final image quality. The reason for this image is reconstructed to simulate low-dose CT. Figure 2(a) image quality improvement is explained as follows. When the shows the images from the normal dose scan and the images with search domain for arbitrary pixel is set to the same low-quality low-dose noise are shown in Fig. 2 (b). Images restored from the slice, it is likely that the selected set of pixels are suffering from low dose scan by using the standard NLM algorithm and the noise and other artifacts. One the other hand, when the search proposed ATCM-NLM algorithm are shown in Fig 2 (c) and (d), domain is set to a high-quality neighbor slice, it would likely lead respectively. Fig. 3 shows the zoomed images corresponding to to higher quality as the anatomical structures are not changed Figure 1. It can be observed that noise in the low-dose CT images significantly and the image pixels are not suffering from low-dose is effectively suppressed using the ATCM-NLM method. noise. This proposed method can be easily implemented and can

74 lead to produce a diagnosis useful image of low-dose for the sake [8] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A of patient protection. review of image denoising algorithms, with a new one.

(a) (b) (c) (d)

Figure 3: Zoomed ROIs corresponding to Figure 2 slice 34: (a) the normal dose scan (b) the low dose scan (c) the restored image from the low-dose scan by NLM algorithm (d) the restored image from the low-dose scan by the ATCM-NLM method.

5. CONCLUSION This paper presents ATCM-NLM method has a strong power in detecting noise and image restoration may already be able to generate well image quality for LDCT because filtering in current low dose image is usually problematic, the introduction of high quality images can significantly enhance image. Experimental results show that the proposed method leads to a notable enhancement in image quality. 6. REFERENCES [1] Willi A. Kalender. Dose in X-ray computed tomography. Physics in Medicine & Biology, 59, no. 3, R129, 2014. [2] David J Brenner and Eric J Hall. Computed tomographyan increasing source of radiation exposure. New England Journal of Medicine, 357(22):2277–2284, 2007. [3] Mannudeep K Kalra, Michael M Maher, Thomas L Toth, Bernhard Schmidt, Bryan L Westerman, Hugh T Morgan, and Sanjay Saini. Techniques and applications of automatic tube current modulation for ct. Radiology, 233(3):649–657, 2004. Figure 4. RRME and PSNR with different of slices. Top is RRME curves and bottom is PSNR curves. [4] Hsieh, Jiang. "Computed tomography: principles, design, artifacts, and recent advances." Bellingham, WA: SPIE, 2009. Multiscale Modeling & Simulation, 4(2):490–530, 2005. [5] Yu, Lifeng, Xin Liu, Shuai Leng, James M. Kofler, Juan C. [9] Zhang, Hao, Dong Zeng, Hua Zhang, Jing Wang, Zhengrong Ramirez-Giraldo, Mingliang Qu, Jodie Christner, Joel G. Liang, and Jianhua Ma. "Applications of nonlocal means Fletcher, and Cynthia H. McCollough. "Radiation dose algorithm in low‐dose X‐ray CT image processing and reduction in computed tomography: techniques and future reconstruction: A review." Medical physics 44, no. 3 1168- perspective." Imaging in medicine 1, no. 1, 2009) 1185, 2017. [6] John A Bauhs, Thomas J Vrieze, Andrew N Primak, Michael [10] Jianhua Ma, Jing Huang, Qianjin Feng, Hua Zhang, R Bruesewitz, and Cynthia H McCollough. CT dosimetry: Hongbing Lu, Zhengrong Liang, and Wufan Chen. Low-dose comparison of measurement techniques and devices. computed tomography image restoration using previous Radiographics, 28(1):245–253, 2008. normal-dose scan. Medical physics, 38(10):5713–5731, 2011. [7] Jerrold T Bushberg and John M Boone. The essential physics [11] Haneen. A. Elyamani, Samir A. El-Seoud, Hiroyuki Kudo of medical imaging. Lippincott Williams & Wilkins, 2011. and Essam A. Rashed, "Adaptive image denoising approach for low-dose CT," The 12th IEEE International Conference on Computer Engineering and Systems (ICCES 2017), Cairo, Egypt, Dec. 2017

75 An Interactive Mixed Reality Imaging System for Minimally Invasive Surgeries Samir A. El-Seoud Amr S. Mady Essam A. Rashed Faculty of Informatics and Faculty of Informatics and Faculty of Informatics and Computer Science (ICS) Computer Computer Science (ICS) Science (ICS) The British University in Egypt The British University in Egypt The British University in Egypt (BUE) (BUE) (BUE) El Sherouk City, Misr-IsmaliaRoad, El Sherouk City, Misr-IsmaliaRoad, El Sherouk City, Misr- Cairo-Egypt Cairo-Egypt IsmaliaRoad, Cairo-Egypt [email protected] [email protected] [email protected]

ray-casting. ABSTRACT 1. INTRODUCTION In orthopedic surgery, it is important for physicians to completely understand the three-dimensional (3D) anatomical The current evolution in medicine and technology should structures for several procedures. With the current revolution proceed at the same level. Furthermore, medicine should take in technology in every aspect of our life, mixed reality in the advantage in the speedily development in technology. One of medical field is going to be very useful. However, medicine these significantly important parts of tech-medicine has a visualization problem hindering how surgeons operate. applications are the visualization of human anatomy [1]. The surgeons are required to imagine the actual 3D structure Interventional radiology procedures using imaging guidance of the patient by looking at multiple 2D slices of the patients’ such as CT/MRI does not meet 100% surgeon’s satisfaction. body. This process is time consuming, exhausting and requires In current procedures, radiologists must scan the patient from special skill and experience. Moreover, patients and surgeons different positions. Thereafter, surgeons and radiologists must are exposed to extra x-ray doses. investigate the scanned images to better locate the problem. Consequently, doctors and patients are exposed to heavy Therefore, it is important to provide the surgeon with a better radiation. However, some types of medical imaging systems way to diagnose the patient; a way that is more accurate and provide a series of scans that can be viewed as a 3D model locates where the problem is in a faster and more efficient using appropriate software that can guide interventional manner. Medical imaging systems usually provide 3D images clinical procedures. that can guide interventional clinical procedures. However, it is difficult to map the 3D anatomical structure with real Our developed system should be a one step forward to solve objects. This project investigates and solves this problem by the problem of visualizing human bodies. providing a mixed reality technology solution that merges the Volume rendering of three-dimensional (3D) image data of 3D image with real objects to facilitate the work progress of patient’s multiple slices is the revolution in imaging human the surgeon. The proposed solution is an interactive mixed body [2]. A voxel is the 3D equivalent to a pixel and are the reality (MR) system for minimally invasive surgeries. The smallest element in a 3D object [3]. Voxels are used to build system is based on mapping the patient volume scan using 3D objects, mostly used in computer graphical applications computed tomography (CT) or Magnetic Resonance Imaging like computer games, but also used to render a volume. (MRI) to a 3D model of the patient’s body. The rendered Applications on volume rendering have taken a large part in model can be used in MR system to view 3D human structures interventional and minimally invasive surgeries over the past through a set of wearable glasses. couple of years. Before volume rendering there were other CCS Concepts techniques that concentrate on visualization via surface shading: • Computing methodologies➝Image processing. • It transforms the volumetric data into geometric primitives then screen the pixels Keywords component: mixed reality, volume rendering, medical imaging, • It is good but not the best for visualization. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that When it comes to volume rendering, the technique displays copies are not made or distributed for profit or commercial advantage the information inside the volume, it is a direct display, the and that copies bear this notice and the full citation on the first page. technique transforms volumetric data to screen pixels directly, Copyrights for components of this work owned by others than ACM and also it uses transparency to see through volumes. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, In 2008, researchers from university of München, Germany, requires prior specific permission and/or a fee. Request permissions started a project that maps the CT scans obtained with patient from [email protected]. body. Their Augmented Reality (AR) system of optical ICSIE '18, May 2–4, 2018, Cairo, Egypt tracking and video see-through head mounted device (HMD) © 2018 Association for Computing Machinery. for visualization was developed to keep track of the objects in ACM ISBN 978-1-4503-6469-0/18/05…$15.00 the scene. This process is carried out by two separate optical DOI:https://doi.org/10.1145/3220267.3220290

ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220268 76 tracking systems. Four infrared ARTtrack2 cameras have been different ways with some GUI features to help the user to mounted to the room’s ceiling to obtain an outside-in optical interact with it more easily, such as: tracking system, while an IR camera mounted directly on the HMD is used as an inside-out optical tracking system [4]. • Increasing visibility This work aims to develop a MR software that will be used in • Increasing and Decreasing Opacity minimally invasive surgeries and interval procedures. Our • Clipping (removing parts of the object) on the X, Y goal is to reduce the heavy load of scans visualization, as well and Z axes. as saving time and effort. This procedure will be much cheaper than the previous methods. Our system requires only • Rotation and Translation a smartphone and a MR ready headset. Mixed Reality is the Breifly, the considered scenario may be summarized as combination of Virtual Reality (VR) and AR [5]. This follows: combination brings together the real world and the digital one in one reality [6, 7]. It allows the users to interact with both 1. First obtain volumetric medical data physical and virtual items. (DICOM/RAW).

2. MATERIAL AND METHODS 2. Preprocess the data to the best possible losslessly form of useable data.

2.1 System Overview 3. The data are stored afterwards on a The introduced system visualizes medical images (CT, MRI) smartphone then mount it on a VR headset that as a 3D object. First, we use the developed software to has a passthrough camera feature. visualize the medical images by using volume rendering ray- 4. The software will render the preprocessed data casting technique. The term volume rendering is used to as a 3D object into reality using augmented describe techniques which allow the visualization of three- reality technology through the virtual reality dimensional data. Volume rendering is a technique for headset. visualizing sampled functions of three spatial dimensions by computing 2D projections of a colored semitransparent 5. User will interact with the 3D object via Gear volume. The technique works as follows: VR controller.

Step 1: Concluding this scenario, surgeons and radiologists will be able to see the scanned slices of the patient as a real 3D object • Trace from each pixel a ray into object space in front of them and will be able to interact with it through a • Compute and accumulate color/opacity value controller. They have the capability to zoom in or out or even along the ray to make parts of the object transparent as well as clipping parts of it. • Assign the obtained value to the pixel

Fig.1 How ray-marching works [8, 10].

Step 2: Fig. 2 Compositing of pixels’ color/opacity along the ray [9, 10]. In this step, we use compositing (alpha blending), i.e. the iterative computation of discretized volume integral. Figure 2 illustrates on how alpha blending works while each ray goes 2.2 Image Acquisition through the object on its direction. CT or MRI scanners first scan the patient. Afterwards, data is The developed software runs on Samsung Gear VR headset, sent to an online archive to be stored and registered. and using its pass-through camera feature. It will enable the Thereafter, the data has to be sent to the smartphone via software to augment the 3D object of the medical scans in real wireless for processing. world space. Interactiveness with the augmented object will be performed using the Gear VR controller. Users could manipulate the viewed 3D object generated from the medical images sliced by hiding parts of the object or view it in

77 3. RESULTS AND DISCUSSION

3.1 Results The figures below are samples from the first dataset (221 slices).

Fig3.a, b, c samples from the first used CT scans dataset In Fig. 4(a-c), the images show the surface of the patient’s body as a 3D object viewed from three different angles using the first dataset.

Fig. 4a (left) A 3D object rendered with ray-marching by using the first dataset, full opacity, no clipping, front facing, Fig. 4b (center) rotated 90 degrees on the Y-axis, Fig. 4c (right) rotated 90 degrees on X-axis

In Fig. 5 (a-c), showing the inside of the body after decreasing the opacity viewed from three different angles using the first dataset.

Fig. 5a (left) Same 3D object, 0.03 opacity, no clipping, front facing, Fig. 5b (center) rotated 90 degrees on the Y-axis, Fig. 5c (right) rotated 90 degrees on X-axis

78 In Fig. 6 (a-c), showing only half of the rendered object viewed from three different angles using the first dataset.

Fig. 6a (left) Same 3D object, 0.03 opacity, clipped 50% of it on X-axis, front facing, Fig. 6b (center) rotated 90 degrees on the Y-axis, Fig. 6c (right) rotated 90 degrees on X-axis. The figures below are samples from the second dataset (361 slices).

Fig7.a, b, c samples from the second used CT scans dataset In Fig. 8(a-c), the images show the surface of the patient’s body as a 3D object viewed from three different angles using the second dataset.

Fig. 8a (left) A 3D object rendered with raymarching using the second dataset, full opacity, no clipping, front facing, Fig. 8b (center) rotated 90 degrees on the Y-axis, Fig. 8c (right) rotated 90 degrees on X-axis. In Fig. 9 (a-c), showing the inside of the body after decreasing the opacity viewed from three different angles using the second dataset.

Fig. 9a (left) Same 3D object, 0.03 opacity, no clipping, front facing, Fig. 9b (center) rotated 90 degrees on the Y-axis, Fig. 9c (right) rotated 90 degrees on X-axis. In Fig. 10 (a-c), showing only half of the rendered object viewed from three different angles using the second dataset.

79

Fig. 10a (left) Same 3D object, 0.05 opacity, clipped 50% of it on Y-axis, front facing, Fig. 10b (center) rotated 90 degrees on the Y-axis, Fig. 10c (right) rotated 90 degrees on X-axis.

[4] Wieczorek, M. et al., 2010 GPU-accelerated Rendering 3.2 Discussion for Medical Augmented Reality in Minimally-invasive The proposed software will help in minimizing the visualization of Procedures.,” in Bildverarbeitung für die Medizin, 574, 102-106. medical images, saving time and effort for surgeons and radiologist, [5] Virtual Reality Vs. Augmented Reality Vs. Mixed Reality with relevantly fast run time. It requires few minutes to render a - Intel data-set of 300 images. https://www.intel.com/content/www/us/en/tech-tips-and- 4. CONCLUSION. tricks/virtual-reality-vs-augmented-reality.html In this paper, we discussed the developed software that will be used [6] Ohta, Y., and Tamura H. 2014 Mixed reality: merging real as a new method in visualization of medical images. This software and virtual worlds. Springer can deliver better visualizations to surgeons and radiologists, helping to create a better environment for surgeries. [7] Billinghurst, M. and Kato, H. 1999 Collaborative mixed reality. Proceedings of the First International Symposium on 5. REFERENCES Mixed Reality [8] Volume ray casting, En.wikipedia.org, 2017. [1] Rowe, S. P. and Fishman, E. K. 2017 Image Processing from 2D to 3D, Springer, 2017 https://en.wikipedia.org/wiki/Volume_ray_casting [2] Udupa, J. K. and Goncalves, R. J. 1993 Medical image [9] Komura, T 2008, Volume Rendering, Visualization – rendering. American journal of cardiac imaging 7.3, 154-163 Lecture 10, The University of Edinburgh [3] What is a Volume Pixel (Volume Pixel or Voxel)? - Definition [10] Möller T., Direct Volume Rendering. University of Vienna. from Techopedia", Techopedia.com. https://www.techopedia.com/definition/2055/volume-pixel- volume-pixel-or-voxel

80 A Computer-Aided Early Detection System of Pulmonary Nodules in CT Scan Images Hanan M. Amer Fatma E.Z. Abou-Chadi Sherif S. Kishk Marwa I. Obayya Assistant Teacher, Professor and Head of Professor, Department of Associate Professor, Department of Electrical Engineering Electronics and Department of Electronics Electronics and Department, Faculty of Communications and Communications Communications Engineering, The British Engineering, Faculty of Engineering, Faculty of Engineering, Faculty of University of Egypt. Engineering, Mansoura Engineering, Mansoura Engineering, Mansoura University University University hanan.amer@yahoo [email protected] [email protected] marwa_obayya .com du.eg @yahoo.com

ABSTRACT CCS Concepts In the present paper, computer-aided system for the early • Applied computing→Computer-aided design detection of pulmonary nodules in Computed Tomography (CT) scan images is developed where pulmonary nodules are one of the Keywords critical notifications to identify lung cancer. The proposed system Image Processing; Histogram Thresholding; Histogram of consists of four main stages. First, the raw CT chest images were Oriented Gradients; Lung Segmentation; Nodule Extraction; preprocessed to enhance the image contrast and eliminate noise. Principal Component Analysis; Discrete Wavelet Transform; Second, an automatic segmentation stage for human's lung and Genetic Algorithm;; Support Vector Machine; pulmonary nodule candidates (nodules, blood vessels) using a two-level thresholding technique and a number of morphological 1. INTRODUCTION operations. Third, the main significant features of the pulmonary Lung cancer has become one of the most important diseases that nodule candidates are extracted using a feature fusion technique pose a great threat to humanity because of the high rates of air that fuses four feature extraction techniques: the statistical pollution, the spread of smoking in recent years and the difficulty features of first and second order, Value Histogram (VH) features, of treatment. Developing early detection of this disease has Histogram of Oriented Gradients (HOG) features, and texture become the concern of scientists in medical fields [1]. features of Gray Level Co-Occurrence Matrix (GLCM) based on Early detection of lung cancer increases the chance of survival of wavelet coefficients. To obtain the highest classification accuracy, the patient for a period of up to 5 years by up to a percentage of three classifiers were used and their performance was compared. 70%, as well as it increases the chance of success of treatment These are; Multi-layer Feed-forward Neural Network (MF_NN), whenever diagnosed in the early stages, this led to the increasing Radial Basis Function Neural Network (RB-NN) and Support importance of work on the development of early detection Vector Machine (SVM). To assess the performance of the systems [1]. proposed system, three quantitative parameters were used to compare the classifier performance: the classification accuracy One of the most important techniques used in the diagnosis of rate (CAR), the sensitivity (S) and the Specificity (SP). The lung cancer is Computerized Tomography (CT) of the patient's developed system is tested using forty standard Computed chest. It is one of the most accurate examination methods, because Tomography (CT) images containing 320 regions of interest (ROI) it allows lung imaging on many sections, which results a large obtained from an early lung cancer action project (ELCAP) number of images, enabling radiologists and physicians to association. The images consists of 40 CT scans. The results show examine all parts of the lung [1]. But this large number of images that the fused features vector which resulted from GA as a feature resulting from the CT examination in addition to the use of low selection technique and the SVM classifier gives the highest CAR, radiation doses to protect the patient from the risk of exposure to S, and SP values of99.6%, 100% and 99.2%, respectively. large amounts of radiation, made the examination of these images by a radiologist difficult and onerous task [1]. This motivated Permission to make digital or hard copies of all or part of this scientists to develop computerized systems that process and work for personal or classroom use is granted without fee provided that analyze these images and allow automatic determination of the copies are not made or distributed for profit or commercial advantage presence of pulmonary nodules. These systems are known as and that copies bear this notice and the full citation on the first page. Computer-Aided Detection (CAD) systems [2]. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy In general, any CAD system for the automatic detection of otherwise, or republish, to post on servers or to redistribute to lists, pulmonary nodules is composed of four main stages: a requires prior specific permission and/or a fee. Request permissions preprocessing stage for contrast enhancement and noise reduction, from [email protected]. the automatic segmentation stage that aims to extract the human's ICSIE '18, May 2–4, 2018, Cairo, Egypt lung area and nodules followed by a feature extraction procedure © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 of the pulmonary nodules and the final stage is the classifier [2]. Figure 1 shows the basic stages of a CAD system. DOI:https://doi.org/10.1145/3220267.3220568

81 association [7].The images in this database are available in format of Digital Images and Communication in Medicine (DICOM) and have a resolution of 0.760.761.25. The size of pulmonary nodules that were considered in this work varies from 3 mm to 30 mm. Figure 2 shows a typical example of the chest CT images.

Figure 2. A typical example of CT lung image from ELCAP database image . Figure 1. A block diagram of CT-lung CAD system

The accurate extraction of the lungs from the CT chest images is 3. METHODOLOGY an essential step in the CAD systems. Techniques previously The proposed CAD system consists of four main stages as follows: reported in lung segmentation are based on intensity variation (thresholding methods), image region (merge region, split region, 3.1 Image Preprocessing and the region growing techniques), and others that are based on the object texture and edge detection [2]. In the present work, a Physicians use a low radiation doses during the CT scan to protect novel Image Size Dependent Normalization Technique (ISDNT) the patient from the risk of exposure to large amounts of radiation was adopted. but this leads to low-resolution images. On the other hand, processing the CT scan itself is accompanied by the exposure of According to clinical opinions of physicians, the blood vessels images to noise from different sources which reduces the image and pulmonary nodules are presented in the CT scan image as quality. Preprocessing was accomplished in two steps: enhancing having lower contrast values and higher gray values [2]. Several the image contrast and denoising of the CT chest image. attempts were reported for nodule extraction. The thresholding techniques [2] where the extraction of the nodule candidates is 3.1.1 Image Contrast Enhancement based on the intensity variation between the lung parenchyma and Contrast enhancement of CT scan images increases the accuracy the nodule candidates were utilized. of nodule detection. Hence, a comparative study of three image contrast enhancement techniques, namely; histogram equalization, For feature extraction, four types of feature extraction techniques adaptive Histogram Equalization, and a novel Image Size [3, 4] were utilized: the statistical features, the Value Histogram Dependent Normalization Technique (ISDNT) [8] were conducted. (VH) feature, the Histogram of Oriented Gradients (HOG) The visual comparison of the contrast enhanced images showed features, and the texture features of Gray Level CO-Occurrence that the ISDNT technique gives the best results. Figure 3 shows Matrix (GLCM) based on wavelet coefficients. the CT image before and after contrast enhancement using the Classification approaches have been proposed such as Artificial ISDNT. Neural Networks (ANN), linear discriminate analysis classifier, rule-based, Bayesian classifier, support vector machine (SVM), and k-NN [5]. In the present work Artificial Neural Network (ANN), Radial Basis Function Neural Network (RBF-NN), and Support Vector Machine (SVM) were used.

One of the most important methods to increase the classifier (a) (b) accuracy is the fusion technique. It can be classified into three Figure 3. (a)The raw CT chest image and (b) the results different levels, namely, data fusion at the level of data, fusion at obtained using INSDT the level of features, and fusion at the decision level [6]. A fusion 3.1.2 Denoising of the CT Lung Images step at the feature level has been adopted. Artifacts decrease of the quality of CT images. A previous study The performance of the proposed system is compared with that of [9] compared the performance of six image denoising techniques: previous reported classifiers: ANN classifier, RBF-NN classifier average filter, weighted average filter, Gaussian filter, median and SVM classifier [5]. The organization of the paper is as filter, Wiener filter, and wavelet filter and concluded that the follows: Section 2 is a description of the dataSET used. Section Weiner filter gives the best results [9]. Figure 4 shows an example 3describes the different stages the proposed system. Section 4 of a denoised image using Wiener filter. discusses the experimental results and Section 5 is the final conclusion. 2. THE DATASET Forty CT scans containing 320 regions of interest (ROI) were used from the Early Lung Cancer Action Project (ELCAP) Figure 4. A denoised CT lung image.

82 3.2 Lung Segmentation Having preprocessing CT chest images, the next step is to extract 3.2.2 Segment the Lungs within the Thorax the human lungs area from CT chest images. The proposed The goal of this step is to separate human lungs area from the algorithm of human lungs segmentation [10] consists of three thoracic area. To extract the lung area, the bi-level thresholding main steps; calculation of an optimal threshold, then segment the technique was applied to get a binary image. Then the thorax from the background, and finally segment the lungs. morphological operations and a median filter were used to obtain To calculate the optimal gray-level threshold, a diagonal gray- the lung binary mask. This was multiplied with the thorax image level histogram was constructed using the diagonal pixels to get the segmented lung area. Figure 7 shows the resulted intensity of all CT chest images of a complete scan. The resulted images. histogram was found to have three clear peaks (Figure 5). This is a common feature in all CT chest scan images. These peaks represent the following; peak P1is formed from black background pixels intensity, peak P2is formed from the pixels of low intensity representing the external region that surrounds the thorax area and the internal parenchyma of lung, and peak P3is formed from the pixels of high intensity which represent blood vessels, bones of the rib cage, heart, and pulmonary nodules. Accordingly, the choice of a gray-level point that divides the distance between the second and third peaks equally as an optimal gray-level threshold was used in the present automatic segmentation work. The Figure 7. The steps of lungs segmentation technique (a) The optimal gray-level threshold is calculated according to the segmented thorax image, (b) The binary image, (c) The following equation; filtered image, (d) The dilated image, (e) The filled image, (f) The closed image, (g) The segment lungs image. L = (P2 + P3)/2. (1) 3.3 Extraction of Nodule Candidates 3.2.1 Segment the Thorax from Background The main objective of this step is to extract the regions of interest The thorax extraction includes the removal of all image (ROIs) that composed of the nodule candidates using the bi-level components external to the chest area. First, the bi-level thresholding technique and a median filter of size 5*5. The thresholding technique was applied to obtain a binary image. resultant image was multiplied with the gray level lung image to Then a morphological operation and median filter of size 15*15 obtain the ROIs in the CT chest images. To evaluate the were applied to obtain the thorax binary mask which will be performance of the proposed framework, the resulted area were multiplied with the preprocessed image to get the segmented compared with those obtained by three other different techniques. thorax area. Figure 6 shows the steps in detail. These techniques are Otsu thresholding, local entropy-based transition region extraction and thresholding, and the basic global thresholding [11]. The region non-uniformity criteria [11] was used to compare the performance of the four thresholding methods.

3.3.1Region Non-uniformity Region non-uniformity is defined as:

F  2 T f (2) NU  2  BF TT 

2 2 where is the whole image variance, and f represents the

foreground variance, BT and FT denote the background and Figure 5. The diagonal gray-level histogram of CT scan foreground area pixels in the segmented image [11]. According to images. a non-uniformity (NU) measure the segmented image of smallest NU measure is the best histogram thresholding technique. The calculated NU measures of each segmented image for all applied histogram thresholding techniques are shown in Table 1. Table 1. The NU measure calculated for the segmented images using the four histogram thresholding techniques

Figure 6. The steps performed to segment the thorax (a) The preprocessed image, (b) The binary image, (c) The filled image, (d) The filtered image, (e) The segmented thorax.

83 Table 4 The specificity (SP) of the three classifiers.

By visual comparison of the images in Figure 8 and comparing 3.5 Nodules Detection the results tabulated in Table 1 shows that the proposed The final stage is to classify the resulted nodule candidates into framework gives the highest accuracy to detect the pulmonary nodules and non-nodules. Three classifiers were selected and their nodule candidates. performance was compared. These are: Artificial Neural Network (ANN), Radial Basis Function Neural Network (RBF-NN) [14] and Support Vector Machine(SVM) [15]. The classifiers are trained and their performance was evaluated using the classification accuracy, sensitivity, and specificity measures for each classifier [16]. For the training and testing steps of each classifiers, 25% of the available data set size was used for the training phase and they were tested using 75% of the available

dataset size.

4. EXPERIMENTAL RESULTS For each classifier, the classification accuracy rate (CAR), sensitivity (S) and specificity (SP) were calculated using the four types of features and hybrid feature vector. Tables 2-4 show the CAR,S, and SP obtained from the 3 different classifiers. Comparing the results of classification shown in the three tables, it is clear that the use ofthe feature fusion technique led to the highest classification results for the three classifiers. The CAR reached 96.3%, 97% and 95%, for the three classifiers ANN, Figure 8. (a) Otsu's method, (b) Basic Global Thresholding, (c) RBF-NN and SVM, respectively. The specifity (S) reached 99.1%, Local Entropy-Based Transition Region, and (d) the proposed 100% and100% for the three classifiers ANN, RBF-NN and SVM, framework. respectively. SP reached to 94%, 91% and 95%, False Positive (FP) of values 0.06, 0.1, and 0.058 and the False Negative (FN) Table 2. The classification accuracy rate(CAR) for the values of 0.008,0.0 and 0.0 for the three classifiers Artificial three classifiers. Neural Network (ANN), Radial Basis Function Neural Network (RBF-NN) and Support Vector Machine (SVM), respectively. Classifiers Table 5 depicts the number of features before and after selection ANN RBF-NN SVM Features using the GA algorithm and the CAR, S and SP corresponding to Wavelet Features 88.4% 81.7% 89.6% each classifier. VH Features 90.6% 94.3% 95% HOG Features 63% 63.6% 77.7% Table 3 The sensitivity (S) of the three classifiers. Statistical Features 75.8% 70.5% 93% Hybrid Features 94% 91% 95%

3.3 Feature Extraction Feature extraction is the process of defining a set of features which represent the information that is important for analysis and classification. In the present work, four different techniques of feature extraction were used; the statistical features of first and second order [12], the Value Histogram (VH) feature [12], the Table 4 The specificity (SP) of the three classifiers. Histogram of Oriented Gradients(HOG) features [13], and the texture features of Gray Level CO-Occurrence Matrix (GLCM) based on wavelet coefficients [13]. 3.4 Feature Fusion In the process of feature fusion, a new set of features was created from different sets of features after removing the insignificant and redundant features. Therefore, the four different feature vectors Table 5 The classification accuracy rate (CAR), the sensitivity were fused in a new hybrid feature vector using a simple (S) and the specificity (SP) of the three classifiers and the concatenation procedure. number of hybrid features before and after using the genetic algorithm (GA) technique. Having formed the hybrid feature vector, the next step is to remove any redundant and correlated information which is known as “feature selection”. In the present work, the GA algorithm was applied to the hybrid feature vector as a feature selection technique. The performance of each feature vector and the new hybrid feature vector was then compared.

84 Table 4 The specificity (SP) of the three classifiers.

As clear from Table 5, the application of GA feature selection feature fusion technique increased the detection accuracy of technique has increased the CAR, S and SP in addition to pulmonary nodules and improves the system performance . reducing the feature vector size. This has led also to a reduction in the computational time. The number of features resulted from An attempt was made to increase the classification accuracy, using the RBF-NN classifier decreased significantly but the CAR enhance the system performance and to reduce the computational time using the Genetic Algorithm (GA) as a feature selection and SP are relatively lower than those of the other two classifiers. While the results show that both the ANN and SVM have equal algorithm on the hybrid features vector. The CAR, S and SP values of CAR but the number of features in the case of the SVM results of the three learned classifiers; ANN, RBF-NN, and SVM is less than that of ANN. showed an increase in the values of CAR, S and SP of the three classifiers. The CAR reached 99.6%, 99.2% and 99.6% for the 5. CONCLUSION three classifiers respectively. Based on these results, it can be A Computer-Aided Detection system (CAD) for early detection of concluded that applying the (GA) as a feature selection technique lung nodules in CT scans images has been developed. The system to the hybrid feature vector increases the classification consists of four main stages. These are; image preprocessing performance of the system significantly. stage to enhance the quality of the CT images, an automatic In conclusion, the SVM classifier gives the highest CAR, S, and segmentation stage to automatically extract the human's lung and SP values of99.6%, 100% and 99.2%, respectively. Table 6 shows the pulmonary nodule candidates, a feature extraction and a comparison of the performance of the suggested system and selection stage and a classification stage to identify the pulmonary five systems reported in previously published researches. The nodules. comparison shows that the suggested system achieves the best Table 6 Comparison of the accuracy and false positives of the classification rate and the lowest false positives. proposed system and previous published work Still much work is needed for discriminating benign and malignant tumors of the lung nodules. This is the aim of the next stage of the work. 6. REFERENCES [1] Manikandan, T., "Challenges in lung cancer detection using computer-aided diagnosis (CAD) systems – a key for survival of patients", Arch Gen Intern Med Volume 1 Issue 2, 2017. [2] Manikandan, T., Bharathi, N., "A Survey on Computer- Aided Diagnosis Systems for Lung Cancer Detection", International Research Journal of Engineering and Technology, Forty CT scans with 320 regions of interest (ROI) were made Vol. 3, May-2016. available from the early lung cancer action project (ELCAP) [3] Shen, R., Cheng, I., Basu, A., "A hybrid knowledge-guided association to train and test the classifiers. The size of pulmonary detection technique for screening of infectious pulmonary nodules that were considered varies from 3 mm to 30 mm. tuberculosis from chest radiographs", IEEE Trans. Biomed. Eng., A novel Image Size Dependent Normalization Technique (ISDNT) vol. 57, no. 11, pp. 2646–56, Nov. 2010. was utilised to enhance the CT image contrast and aWiener filter [4] Nuzhnaya, T., Megalooi, V., Ling, H., Kohn, M., Steine, R., was used to ameliorate the CT image quality in the image "Classification of texture patterns in CT lung imaging", Proc. preprocessing stage. SPIE, vol.7963, pp. 1–7, 2011. For the automatic segmentation stage, the bi-level thresholding [5] Unnikrishnan, S., Shamya, C., Neenu, P.A., " An Overview technique wasapplied to the preprocessed CT images and median of CAD Systems for Lung Cancer Detection", International filter and mathematical morphological operations were utilised to Journal of Engineering Research and General Science, vol. 4, suppress any unwanted pixels. March 2016. In the third stage, four feature extraction techniques were utilized. [6] Mangai, U.G., Samanta, S., Das, S., Chowdhury, P.R., " A These are: are the statistical features of first and second order, the Survey of Decision Fusion and Feature Fusion Strategies for Value Histogram (VH) feature,the Histogram of Oriented Pattern Classification ", IETE technical review, vol 27, No. 4, Gradients (HOG) features, and the texture features of Gray Level Jul-Aug., 2010. CO-Occurrence Matrix (GLCM) based on wavelet coefficients. A [7] Early Lung Cancer Action Program(ELCAP), available from: feature fusion step was employed on the four different sets of http://www.via.cornell.edu/lungdb.html.[Last cited on 2011 Dec extracted features to produce the hybrid features vector. The five 05]. feature vectors were then used as the input to three types of classifiers and their performance was evaluated. The classifiers [8] Al-Ameen, Z., Sulong., G., Gapar., M., Johar, M., are: Artificial Neural Network (ANN), Radial Basis Function "Enhancing the Contrast of CT Medical Images by Employing a Neural Network (RBF-NN), and Support Vector Machine (SVM). Novel Image Size Dependent Normalization Technique", Int. Each classifier was trained using 25% of the dataset and tested Journal of Bio-Science and Bio-Technology, vol. 4, no. 3, using the remained 75% of available data input. September, 2012. [9] Abou-Chadi, F.E.Z., Amer, H.M., Obayya, M.I., "A The Classification Accuracy Rate (CAR), the Sensitivity (S), and Computer-Aided System for Classifying Computed Tomographic the Specificity (SP) were calculated for each classifier using each (CT) Lung Images Using Artificial Neural Network and Data of the five feature vectors. Comparing the CAR, S, and SP Fusion", Int. Journal of Computer Science and Network Security, resulted from each classifier has showed that the hybrid features vol.11 no.10, Oct. 2011. gave the highest CAR, S, and SP. This leads to conclude that the

85 [10] Raju, D.R., Neelima, "Image Segmentation by using [15] Choi, W.J., Choi, T.S., " Automated Pulmonary Nodule Histogram Thresholding", IJCSET, vol 2, Issue 1, pp. 776-779 Detection System in Computed Tomography Images: A January,2012. Hierarchical Block Classification Approach", Entropy 2013, vol. [11] Zuoyong, L., Zhang, D., Xuc, Y., Liu, C., "Modified local 15, pp. 507-523, 2013. entropy-based transition region extraction and thresholding", [16] Kuruvilla, J., Gunavathi, K., "Lung cancer classification Applied Soft Computing, vol. 11, pp. 5630–5638, 2011. using neural networks for CT images", computer methods and [12] Liu, X., Ma, L., Song, L., Zhao, Y., "Recognizing Common programs in biomedicine, vol. 113, pp. 202–209, 2014. CT Imaging Signs of Lung Diseases Through a New Feature [17] Demira, O., Çamurcub, A.Y., "Computer-aided detection of Selection Method Based on Fisher Criterion and Genetic lung nodules using outer surface features",Bio-Medical Materials Optimization", IEEE Trans. Biomed. and Health Informatics, vol. and Engineering, vol. 26, pp. S1213–S1222, 2015. 19, no. 2, March 2015. [18] Manikandan, T., Bharathi, N.," Lung Cancer Detection Using [13] Orozco, H.M, Villegas, O.V, Sánchez, V.G., Domínguez, Fuzzy Auto-Seed Cluster Means Morphological Segmentation and H.O., " Automated system for lung nodules classification based SVM Classifier", J Med Syst., vol. 40, no. 181, 2016. on wavelet feature descriptor and support vector machine", [19] Sweetlin, J.D., Nehemiah,H.Kh., Kannan, A., "Computer Madero Orozco et.al. Bio-Medical Engineering On Line, vol. 14, aided diagnosis of pulmonary hamartoma from CT scan images no. 9, 2015. using ant colony optimization based feature selection", Alexandria [14] Rao, M.V., Murty, N.V., "Early Lung Cancer Detection University, Alexandria Engineering Journal, ISSN:111.-0168, using Radial Basis Function Neural Networks ", vol. 2, no. 8, pp. 2017. 44-48, 2015.

86

Session 3 Computer Science and Applications

Directer: A Parallel and Directed Fuzzing based on Concolic Execution

Xiaobin Song1,2, Zehui Wu1,2, Yunchao Wang1,2 1.State Key Laboratory of Mathematical Engineering and Advanced Computing 2.China National Digital Switching System Engineering and Technological Research Center Zhengzhou, China [email protected]

ABSTRACT arrival of the target by combining the advantage of the fuzzing speed based on the provided target, which is mainly applied in Fuzzing is a widely used technology to find vulnerabilities, but the [1] current technology is mostly based on coverage and there are patch detection , crash reproduce, etc. At present, symbol relatively few research in the field of directed fuzzing. In this execution is usually adopted to solved the problem of reachable, paper, a parallelized testing technique combining directed fuzzing because symbol execution can find the target accessible paths and the inputs of the paths can be got by constraint solving. Common and concolic execution will be proposed. It extracts path space [2] [3] within the level of basic block in the function call chain through tools such as KLEE , S2E , etc. In addition, another fuzzing the program control flow analysis and function call relationship. technology is used to reduce the mutated domain in test case by Concolic execution is used to implement the target reachable paths combining with taint analysis, which bytes of seed may trigger abnormal program termination and focus on these bytes mutation guidance, in order to achieve the goal of rapid arrival. In the [4] experimental stage, the developed Directer was used to test on in future, the use of such technology tool such as Vuzzer . Marcel Böhme puts forward a kind of directed greybox fuzzing LAVA dataset, which shows better performance than the existing [5] fuzzers. tool AFLGo . It doesn't need heavy machinery of symbolic execution and constraint solving, casting reachability as a CCS Concepts optimization problem by prioritizing smaller distance test case to • Security and privacy→Software security engineering; achieve the goal and only need to take a lightweight program analysis. The method in the experiment is effective. Keywords directed fuzzing, concolic execution, parallel 2. BACKGROUND 1 int main(int argc, char **argv) { 1. INTRODUCTION In recent years, with the convenience of software development, the 2 int magic = 6845; number of software shows an explosive growth trend. Due to the 3 char bug_loc[ ] = "BUG-LOCATION"; lack of normative software testing standards, the number of 4 if (magic == atoi(argv[1])){ vulnerabilities in software also increases year by year, posing a great threat to users. In the years of exploration and development, 5 printf("Magic number is ok, executing...\n"); vulnerability mining and analysis technology has formed a mature 6 } system and the most widely used method for vulnerability mining 7 else{ is fuzzing. AFL is one of the most widely used fuzzers. It through the method of instrumentation to achieve paths recording, using 8 printf("Invalid magic number, exit!\n"); genetic algorithm to generate a large number of inputs, improving 9 exit(1); the code coverage, but as a result of the fuzzing technology based on coverage lack certain guidance, a large number of seeds in the 10 } irrelevant paths exploration wasting too much time, resulting in 11 if (strncmp(argv[2], bug_loc, strlen(bug_loc)) == 0){ low efficiency. Different from traditional fuzzing based on 12 program_bug(); coverage, directed fuzzing is designed to achieve the gradual 13 exit(1); Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that 14 } copies are not made or distributed for profit or commercial 15 else{ advantage and that copies bear this notice and the full citation on 16 execut_branch(); the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is 17 next_stage(); permitted. To copy otherwise, or republish, to post on servers or to 18 } redistribute to lists, requires prior specific permission and/or a fee.Request permissions from [email protected]. 19 return 0; ICSIE '18, May 2–4, 2018, Cairo, Egypt 20 } © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 Figure 1: Example that illustrates issues in fuzzers DOI:https://doi.org/10.1145/3220267.3220272

87 Applications usually deal with two types of user inputs, one is Directed fuzzing module. By using the LLVM compiler to common input and its valid value range is very wide. The other is instrument and calculate distance, the target approximation a specific type of input that needs to conform to a particular value strategy based on simulated annealing algorithm is adopted to and the range of valid values is very limited. Due to these special achieve the sensitive target. The existing seeds are updated inputs are usually located in certain offsets in an input, the user according to the results of the symbol execution module in order does not know its specific meaning, which is called magic bytes, to realize the breakthrough of magic byte inspection. The shorter these bytes will often determine the program’s execution flow. seed distance is, the faster arrival will be. Because most inputs adopting blindly mutation strategy are A scheduling module based on seed distance. The module based difficult to meet the inspection. Therefore, the code behind these on the variation of seed distance in fuzzing, selective invokes checking is often difficult to be tested, thus reducing fuzzing symbol execution module. At the same time, it uses Pin[6] to efficiency. This section provides a simple example to illustrate in record the execution function chain of each seed in the minimum Fig 1. distance seeds queue. According to the function chain, the jump This program first compares the first input parameter with a location waiting to flip of each seed in the minimum distance constant. It will enter the next phase of the magic byte checking if seeds queue is determined. Finally, the new solved seed is added the first succeed. The second magic byte will trigger the exception to the queue of solved to wait for the fuzzing module to update. of the program after success. Symbol execution module. This moudle is implemented based on BAP[7]. Firstly, the set of instructions for the specific bytes main main pollution in the seed waiting to solve are analyzed and these bytes are selected. Then according to the jump condition location Long time information obtained by scheduling module, the jump instruction 6845 6845 to be flipped in the middle language file is selected. The execution flow transformation is implemented by flipping specific jump condition. Long time 4. METHODOLOGY Invalid BUG-LOCATION Invalid BUG-LOCATION In this section, we elaborate each component in Fig. 3 in details 4.1 Fuzzing

execut_branch execut_branch This module combines the idea of AFLGo. First of all, it uses afl- BUG BUG clang-fast compiler to generate the call graph(CG) and the control flow graph(CFG) of related functions and then use Djikstra Figure 2: Different effects between AFL(L) with AFLGo(R) algorithm to calculate the distance between functions and the As shown, AFL executing the program for a long time did not distance between the basic blocks to get files contain each basic discover a bug of the program. The reason is that the exact value block distance. of the original field offsets must be found exactly at the first if The advantage of AFLGo lies in the strategy of target condition. If the match fails, it will enter the else branch, because approximation, which is different from the traditional coverage- based fuzzer. The directed fuzzing is more suitable for the specific the code coverage has improved, so it will retain the else branch [8] test case. It will lead to the program crash becoming unreachable. state space test after mastery of certain information of program. By contrast, the consumption of time in the first checking may be Since AFLGo is an AFL-based improvement, its main advantage relatively small in AFLGo. Though it is able to reduce the is the rapid increasing in code coverage, which is inefficient in possibility of exploring the else branch, it also takes a long time to face of magic byte checking. Because of the seed selection run into the second if condition. strategy of AFL, once it discovers new path, the subsequent process will more focus on exploring that path space. Compared 3. OVERVIEW with AFL, the seed selection strategy of AFLGo includes the Directer can test the code hidden behind the magic byte checking, influence of seed distance, the seed will try its best to avoid further combining with the speed of fuzzing and the ability to solve the exploration of irrelevant branches, but it will also cause the complex input of symbol execution. The tool is composed of three reduction of efficiency due to the difficulty of changing a specific modules. A general overview of these components is as follows, bit accurately, so it will also stuck in the check for a long time. which will be elaborated later. Based on this problem, the naming rule of seed is modified, which contains the distance value of the mutation seed in order to call the Solved symbol execution module based on the updated of the distance queue value subsequently. For the newly generated solved seed, by starting a new process to poll the solved queue before the new round of seed testing of fuzzing test begins. Once a new seed is Mutation Concolic Program Fuzzing found, the corresponding seed in the original seed queue will be test cases execution updated in order to solve the problem of fuzzing efficiency because of the magic byte checking. As shown in the following, the example has only a single target, Shortest setting the target function for vulnerability, AFL and AFLGo may scheduling distance queue not reach the vulnerability function due to check_magic_value. Execution paths Directer queries distance values of seeds generated by AFLGo. Once it is found that the minimum distance between seeds remains unchanged for a period of time and the seed execution flow does Figure 3: A high-level overview of Directer

88 main main main main

read_file read_file read_file read_file

error handle Check_magic_value error handle Check_magic_value error handle Check_magic_value error handle Check_magic_value

error_magic file_buffer error_magic file_buffer error_magic file_buffer error_magic file_buffer

match not_match match not_match match not_match match not_match

other vulnerability default other vulnerability default other vulnerability default other vulnerability default

exit exit exit exit Figure 4: The nodes Figure 5: The nodes Figure 6: The nodes Figure 7: The nodes initially found by the found by the first found by new test case found by the second fuzzer invocation of concolic invocation of concolic execution execution not contain vulnerability function, the symbol execution module will be called. By comparing with the reachable path of the objective function extracted in the previous stage, the non-uniform jump position of : the seed is obtained. The new solved seed can reach the file_buffer Algorithm Even-and-odd function pair search algorithm according to the result in Fig 5, though it will not be stopped by Input:Trace checking the magic bytes. The seed further explores the state space, but it may enter the not_match security function and still Output:addr_dict will not reach the vulnerability function. In this case, the minimum 1: f ← read(trace) seed distance is maintained at a certain size again. After a period of mutation, the symbol is executed for the second time, and 2: repeat according to the seed execution path, it will find that the 3: line ← readline(f) file_buffer should be flipped at a jump position to enter the match function. Finally, it will reach the vulnerability function and the 4: if ‘0x’ and ‘end’ not in line: effect is shown in Fig 7. 5: if func_pair: 4.2 Scheduling 6: num(func) + 1 Through the extraction of program call graph from pre-test, the 7: else: call chain of objective function reachable can be accessed. 8: append func to func_pair According to this call chain, it records the reachable control flow of the basic blocks containing the calling function of the internal 9: end if function of the chain in the control flow graph of each function 10: elif ‘end’ in line: from the chain. At the same time, the conditions of conditional 11: num(func) + 1 jump instructions in the above control flow are also recorded to provide guidance for subsequent symbol execution. When the 12: else: fuzzing is carrying out, the seeds from the initial fuzzing phase 13: for func from func_pair(last) to func_pair(first) do: were arranged in ascending order according to the distance value 14: if num(func) is odd: and a new seed queue was maintained. Before the symbol execution, the test program should be instrumented in the call 15: append addr to addr_dict(func) functions and the basic blocks with the Pin. The execution path of 16: end if the seed is recorded by using pintool for selection of subsequent jump conditions. The records are represented as function pairs like 17: end for ['main', 'main: end']. In this module, a process pool is set up, 18: end if which includes a seed queue generating process and a symbol 19: until line is None execution process, two processes are processed in parallel, the symbol execution process is selectively invoked according to the distance of the seeds from queue. The polling time is set in the 4.3 Concolic Execution symbol execution process. It varies according to the test procedure When the new seed queue is polled and the shortest seed distance and is mainly based on the overall time taken for a single seed to don't update, the symbol execution module is invoked. In this process, the new seed queue suspend update. A BAP-based undergo mutation testing in the fuzzing phase. The record of the improvement solution is used, BAP is an open source binary basic blocks addresses of the seed execution path uses an even- and-odd function pair search algorithm, because each layer program static analysis platform developed by CMU Cylab. It first function needs to be processed according to the function execution converts the assembly code disassembled binary code into a BIL flow in consideration of the existence of complex structures such intermediate language, and then use different components to as nesting and looping. analysis the middle language.

89 program ::= stmt* 5. EVALUATION stmt ::= var := exp | jmp exp | cjmp exp,exp,exp | assert exp Dolan-Gavitt et al. developed LAVA, an automated vulnerability injection technique[11].It constructs a large number of vulnerable | label label_kind | addr address | special string corpora by the way of source code injection, and each

exp ::= load(exp, exp, exp, 휏reg) | store(exp, exp, exp, exp, vulnerability is accompanied by a trigger input. It is mainly used for testing fuzzer and symbolic execution performance. We use 휏 ) | exp ♢b exp reg the LAVA-1 dataset for the convenience of specifying the location | ♢u exp | var | lab(string) | integer | cast(cast_kind, 휏reg, exp) of the vulnerability using file as the injection program, which | let var = exp in exp | unknown(string, 휏) contains 69 buffer overflow vulnerabilities, each of which uses 4 bytes for triggering. There are two types of injection, the first one Figure 8: BIL main syntax definition is 2-byte unsigned on the input and the other 2 bytes are large The magic byte checking in the assembly code is usually in the enough to trigger a vulnerability. The other type is a magic value form of CMP instructions. Therefore, the assembly instructions that uses 4 bytes of unsigned integer and is within a range of are scaned to get the addresses of all CMP instructions and variation. For the performance, here we quote some of the results operands at first. We take advantage of an idea from Vuzzer to from the LAVA paper. The experimental environment for this track taint at the byte level based on Datatracer[9], dynamic taint experiment is two 32-bit 4-core intel CPU and 4GB of memory analysis prior to fuzzing and it results in CMP instructions used in Ubuntu16.04 LTS system. magic byte checking. It will record the offset position in the Table 1: The LAVA-1 dataset performance test original seed corresponding to the CMP instruction operands. In the subsequent of the symbol execution, the CMP instruction can Tool Crashes Found be searched reversely according to the trace and the CMP Type Range KT instruction list acquired in the previous phase. If the corresponding 20 27 214 221 228 instruction is not found in the CMP instruction list, dynamic taint Total analysis of the current seed is required to find the original byte (12) (10) (11) (14) (12) (10) offsets of the CMP instruction. After the offset position is obtained, AFL 0 0 4 12 9 4 the designated offset is marked as taint according to the corresponding seed and then the corresponding trace file is SES 1 0 1 3 0 1 obtained by executing the seed. The file is converted to the AFLGo 0 0 4 12 9 5 intermediate language (il) file through the concolic execution. The process also record the path constraints during execution. Directer 1 1 5 12 9 7 After getting the il file of seed corresponding to the specified Directer has some improvement over the amount of crash in offset bytes, the next step needs to flip the specific condition in the AFLGo, but the magnitude is smaller, which is mainly related to il. According to the il grammar rules, the conditional jump the ability of symbolic execution. Meanwhile, the average time for instruction starts with assert ~. Firstly, the total number of different tools to trigger different types of crashes is counted. conditional jump in the il file are counted and then flip at specific Three different tools were tested in the experiment for 4 hours. conditon according to the execution of the seed. Symbol execution Here were selected for each type of three test programs, statistics process through the IDA to obtain the target function call chain on the average vulnerability trigger time from different tools. and control flow graph of the functions in the chain, constructing the path list containing the basic blocks that call function from the Table 2: Different tool crash time comparison chain. Firstly, the function execution flow obtained through the Serial AFL Scale AFLGo Direct Scale Pin instrumentation is compared with the function call chain Number (s) (k) (s) er(s) (k) whose objective function is reachable to obtain the last public function in the two chains and the internal basic block level path 1370 951.4 982.3 655.1 158.6 611.3 of the function is compared with the list of pre-extracted paths to 214 14331 10319.5 982.1 1224.9 276.5 599.1 obtain the longest common sub-sequence of the path. Then the last basic block in the subsequence backwards in order to find the 4049 1035.3 981.8 14696.3 560.3 665.6 conditional jump instruction, flipping according to il file with the 660 20.7 982.7 40.8 41.7 611.6 same jump instruction. Subsequent seeds repeat the above process. 21 If there is no comparison instruction, the seed is segmented to taint 2 2048 48.6 982.8 16.3 16.3 611.6 and the above steps will not be repeated. Until the jump 14960 1838.6 981.9 1288.2 733.6 599 instruction in basic block is found. If the seed is small, the seed 3612 1849.7 983.8 1736.2 1350.3 611.3 can be full-byte tainted. Then the position of the conditional jump instruction in the il file will be recorded. The position is used as 228 4192 1.6 982.9 0.9 0.9 611.5 the parameter, deleting all the subsequent after the jump condition 4961 1195.5 981.8 1965.3 1665.3 611.3 and finally obtain the new il file. The constraint conditions in the new il file are collected with topredicate[10] tool of BAP to 14314 5542.6 981.8 3355.8 471.5 599.1 generate the constraint paradigm file and then the constraint KT 1460 1475.3 982.8 942.4 193.5 611.1 paradigm file is solved by STP constraint solver. According to the result of the solution, a new seed file is constructed by combining 2655 5804.3 982.5 1103.6 490.4 611.5 the taint mapping file with the original seed file. The new seed It can be seen that Directer trigger time compared to AFLGo have will be put into the solved queue and wait for the subsequent different degrees of reduction. We take file-5.22.14314 as an fuzzing moudle to update. example, the codes to trigger the condition of the vulnerability is as follows.

90 After debugging, we found that the last 4 bytes in the first 8 bytes high-quality input seeds compared to non-application methods. of the seed file must satisfy the condition of 0xda89 and the first 4 This method can improve the quality of seeds to target and bytes need to be larger than 0x2ab. The above conditions for the approaching speed, but the method is limited to rely too much on AFLGo mutation strategy takes longer time to meet, but Directer the quality of seeds, causing that the method has certain limitation can use the symbol execution module to process and time reduced in some cases. by nearly eight times. However, there is no obvious improvement Hybrid fuzzing technology. Istvan Haller puts forward Dowser[14], in the test program with a short target exposure time. This is a "guide" fuzzer, it combines taint tracking, program analysis and because the symbolic execution consumes a large amount of symbolic execution to find buffer overflow and underflow resources in solving process. vulnerabilities that go deep into program logic. Specifically, it first uses taint analysis to determine which input bytes affect the array … index and executes the program symbolically. By constantly if (pos != (off_t)-1) stepping the result of the branch into the path that is most likely to cause an overflow, a deep error can eventually be detected in the (void)lseek(fd, pos, SEEK_SET); actual program. Wang proposed a directed fuzzing technique close_and_restore(((ms)) + ((lava_get()) & 0xffff) * (0xda89 based on checksum-aware and designed the prototype TaintScope == ((lava_get()) >> 16)),inname,fd,&sb); system[15]. This system uses the dynamic and static methods to out: check and bypass check points and perform fuzzing test. The core idea of TaintScope is that taint propagation information can be return rv == 0 ? file_getbuffer(ms) : NULL; used to detect and bypass checksum-based integrity checking and … to drive the generation of malicious test cases. Then it will repair the check field in the test case by combining concolic techniques. Figure 9: file-5.22.14314 critical code snipped TaintScope can change the execution path of the target program at At the same time, the symbolic execution is scheduled according the location of integrity checkpoint. Its fine-grained taint label can to execution time and the quality of the seed, which leads to the be used to determine exactly which input byte can reach the target vulnerability being triggered before scheduling. Although security sensitive point. parallel processing is used, it does not affect the overall efficiency of fuzzing, so the solution of symbolic execution does not play a 7. CONCLUSION significant role in overall. The above data also shows that the size In this paper, a kind of parallel technology combining the fuzzing of the input threshold range that is satisfied also limit symbolic with the symbol execution technology is proposed, which avoids execution. However, due to avoiding the global path exploration the influence of the complex mechanism of symbol execution on of traditional symbolic execution, this method shows in the the fuzzing test efficiency. At the same time, aiming at the experimental results that it is not limited by the inability to solve problem that the fuzzing can not be further explored due to magic due to the path explosion. byte checking. A solution to the problem of specific point inversion by using the program dynamic analysis is given, so as to 6. RELATED WORK realize the optimization of the seed. We rely on the byte tracking This section will summarizes the existing directed fuzzing of the special instructions in the early stage to prevent the problem techniques and makes a brief analysis, which is divided into of path constraint solving difficult caused by over pollution. whitebox fuzzing technology based on symbolic execution, According to the result of the solution, the quality of the original directed fuzzing based on taint analysis and hybrid fuzzing. seed is improved by replacing the seeds generated during the Whitebox fuzzing based on symbolic execution. Godefroid et al. fuzzing. The distance value is taken as the measurement index to proposed a whitebox fuzzer for x86 Windows applications named achieve the fast approximation of the target position. SAGE[12]. SAGE begins with an initial input and records the trace, Finally, the LAVA dataset was tested and compared with AFL and then symbolically executes the path while storing the input AFLGo. The results show that not only the quantity of constraints. For each constraint being denied, a new input of vulnerabilities found but also the target exposure time are different execution flows is obtained to increase code coverage. significantly improved compared with AFL, which is also Nick Stephens et al. proposed a balanced approach to the use of improved to a certain extent compared with AFLGo. The result fuzzing and selective concolic execution of the technology shows that the parallelization process of fuzzing test and symbolic driller[13] to find deeper errors. Firstly, the application of execution has certain feasibility for directed test and it has certain lightweight fuzzing will be used and concolic implementation improvement for test performance. The shortcome of this used to generate input to meet the complex check to achieve part technique is that symbolic execution still consumes relatively of the branch jump to test the deeper path to improve code large resources in program analysis. In the future, we will further coverage. By combining the advantages of these two technologies, research in this issue. their respective weaknesses are alleviated and the problem of path explosion and fuzzing defects in their analysis are avoided. The 8. ACKNOWLEDGMENTS experimental results show that driller has achieved good results in We thank the anonymous reviewers for their comments to practical application through proper combination of two improve the quality of the paper. This work was supported by technologies. Ministry of Science and Technology of China under Grant Fuzzing technology based on taint analysis. Sanjay Rawat et al. 2017YFB0802901. proposed a fuzzing technique for applying application-aware evolutionary fuzzing strategy without any prior knowledge of 9. REFERENCES application or input format. In order to maximize the coverage and [1] Marinescu P D, Cadar C. KATCH: high-coverage testing of exploration of deeper paths, the basic properties of the application software patches[C]//Proceedings of the 2013 9th Joint are inferred using control and data flow characteristics based on Meeting on Foundations of Software Engineering. ACM, static and dynamic analysis. This allows for faster production of 2013: 235-245.

91 [2] Cadar C, Dunbar D, Engler D R. KLEE: Unassisted and Meeting on Foundations of Software Engineering. ACM, Automatic Generation of High-Coverage Tests for Complex 2017: 627-637. Systems Programs[C]//OSDI. 2008, 8: 209-224. [9] Stamatogiannakis M, Groth P, Bos H. Looking inside the [3] Chipounov V, Kuznetsov V, Candea G. S2E: A platform for black-box: capturing data provenance using dynamic in-vivo multi-path analysis of software systems[J]. ACM instrumentation[C]//International Provenance and Annotation SIGPLAN Notices, 2011, 46(3): 265-278. Workshop. Springer, Cham, 2014: 155-167. [4] Rawat S, Jain V, Kumar A, et al. Vuzzer: Application-aware [10] Brumley D, Jager I, Schwartz E J, et al. The BAP evolutionary fuzzing[C]//Proceedings of the Network and handbook[J]. 2013. Distributed System Security Symposium (NDSS). 2017. [11] Dolan-Gavitt B, Hulin P, Kirda E, et al. Lava: Large-scale [5] Böhme M, Pham V T, Nguyen M D, et al. Directed greybox automated vulnerability addition[C]//Security and Privacy fuzzing[C]//Proceedings of the 2017 ACM SIGSAC (SP), 2016 IEEE Symposium on. IEEE, 2016: 110-121. Conference on Computer and Communications Security [12] Godefroid P, Levin M Y, Molnar D A. Automated whitebox (CCS’17). 2017. fuzz testing[C]//NDSS. 2008, 8: 151-166. [6] Luk C K, Cohn R, Muth R, et al. Pin: building customized [13] Stephens N, Grosen J, Salls C, et al. Driller: Augmenting program analysis tools with dynamic Fuzzing Through Selective Symbolic Execution[C]//NDSS. instrumentation[C]//Acm sigplan notices. ACM, 2005, 40(6): 2016, 16: 1-16. 190-200. [14] Haller I, Slowinska A, Neugschwandtner M, et al. Dowsing [7] Brumley D, Jager I, Avgerinos T, et al. BAP: A binary for Overflows: A Guided Fuzzer to Find Buffer Boundary analysis platform[C]//International Conference on Computer Violations[C]//USENIX Security Symposium. 2013: 49-64. Aided Verification. Springer, Berlin, Heidelberg, 2011: 463- 469. [15] Wang T, Wei T, Gu G, et al. TaintScope: A checksum-aware directed fuzzing tool for automatic software vulnerability [8] Li Y, Chen B, Chandramohan M, et al. Steelix: program-state detection[C]//Security and privacy (SP), 2010 IEEE based binary fuzzing[C]//Proceedings of the 2017 11th Joint symposium on. IEEE, 2010: 497-512.

92 A New Approach for Implementing 3D Video Call on Cloud Computing Infrastructure Nada Radwan M. B. Abdelhalim Ashraf AbdelRaouf College of Computing and College of Computing and Faculty of Computer Science, Information Technology, Information Technology, Misr International University, Cairo, Arab Academy for Science and Arab Academy for Science and Egypt Technology & Maritime Transport, Technology & Maritime Transport, Cairo, Egypt Cairo, Egypt [email protected] [email protected] [email protected]

ABSTRACT 3D video call is a set of technologies, which allow a caller to feel Traditional video-conferencing systems still fail to meet the the depth of the other caller and to give the real-life feeling. 3D challenge of providing a feasible alternative for physical business video call is a developing technology that can be presented by travel, which characterized by unacceptable delays, and costs [2]. peer-to-peer architecture. Cloud-based technologies are driving 3D video call technology that presented nowadays has a problem positive changes in the way organizations can communicate. In that not scalable and expensive implementations [3]. running a global business, the need for travel and being available in meetings is a must. However, with expensive travel costs, an The urgent need for 3D video call is encouraging to enhance a alternative solution to overcome this problem is required. This system that makes communication more natural and clear between paper presents, a new approach that enhances current 2D video people. The addressed issues were the inspiration to work on a calls to 3D video calls benefiting from the unlimited features of solution that handles this issue. To deliver clear, fast, pure 3D the cloud-computing. Three technologies were implemented, video communication, video as a service (VaaS) shown in figure 1 OpenStack cloud, webRTC call and 3D anaglyph effect to achieve applied in cloud infrastructure [4]. the sense of 3D video. CCS Concepts • Software and its engineering→Agile software development.

Keywords cloud computing; peer-to-peer; Openstack; webRTC; 3D anaglyph.

1. INTRODUCTION Currently, 3D video is entering broad in the technology market. Figure 1. VaaS Solution and Platform The technology is now maturated, providing excellent quality. It becomes increasingly interesting for other applications such as home entertainment, mobile devices and 3D video systems. In our research, we benefit from previous research in three Cloud Computing has become a significant research topic of the different disciplines, which are cloud computing, video call and scientific and industrial communities since 2007 because of its 3D video. As shown in figure 2, Cloud-computing is used as an management strategy, reliability, speed, scalability, and infrastructure [5] to setup a video call using webRTC technology convenient services offered to clients [1]. [6]. Then, 3D video is created using image processing techniques to generate 3D video that can be watched by Red/Cyan glasses [7]. Finally, we setup cloud Virtual Machine (VM) to handle and test Permission to make digital or hard copies of all or part of this the performance of the 3D video call and compare it with work for personal or classroom use is granted without fee provided that traditional 3D peer-to-peer video call. copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220274 Figure 2. Research Implementation Diagram

93 The goal of our research is to develop for the first time a new approach that implements 3D video call on top of cloud computing infrastructure. Section II discusses the related work done in our approach research areas. Section III explain in details our proposed framework. Section IV escribes our approach conclusions and future work. 2. RELATED WORK Real-time communication is enhancing along time started with telephone call 1876 [8], text chatting 1973 [9], voice calls 1973 [10], video calls 1980 [11] then video conferencing systems 1991 [12]. Cloud computing provides three main service models Software as a Service (SaaS), Infrastructure as a Service (IaaS) and Platform Figure 4. webRTC API components Architecture as a Service (PaaS). There are many solutions developed to put cloud computing in implementation. OpenStack is one from those communities of cloud computing platforms [5]. OpenStack as The term of "3D" was discovered in 1850's. In 1853, the first shown in figure 3, is an open source and free platform under the person who presented the idea of anaglyph using blue and red rules of the Apache license that has a set of tools for the creation lines on a black field was W. Rollman [16]. Rollman used blue and management of public, hybrid and private cloud computing, and red glasses to perceive the anaglyph effect. A mixture of two used because of its modularity, scalability, and flexible set of images from the perspective of the right and left eyes is called utilities [13]. anaglyph 3D image. One eye will perceive through a red filter and the other eye will perceive through a different color filter such as cyan [7]. The Anaglyph 3D method of stereoscopic visualization is both cost effective and compatible with all full-color displays [17]. 3. PROPOSED FRAMEWORK The proposed framework presents a new approach to develop 3D video call on top of cloud computing infrastructure; it is a combination of OpenStack cloud computing, webRTC technology for video chatting system and anaglyph algorithm for generating 3D video. Figure 5 shows our proposed framework. It starts by constructing cloud infrastructure to handle the video communication using OpenStack cloud architecture, create Virtual Machine (VM) to handle and test the communication, then setup video call system using webRTC on the Virtual machine. While capturing the video using video input device apply the 3D filter to generate 3D anaglyph video. Figure 3. OpenStack cloud platform Architecture 3.1 Cloud Construction OpenStack consists of seven different service code projects to WebRTC is a free, open source project that provides browsers and make it modular. The Virtual machine was created on top of mobile applications with Real-Time Communications (RTC) OpenStack with nova service, which provides the service for capabilities via simple APIs. WebRTC, built on JavaScript provisioning and un-provisioning of virtual machines on-demand Sockets programming, Communication held on between two basis. The virtual machine created by storage service divided into networks with real-time video streaming feature with help of two main projects Cinder and Swift. Block storage (Cinder) used special protocol as well as reliable communication [14], rich high to store data over running the instance and get lost when instance quality RTC can be developed on browsers and mobile platforms is terminated. Object storage (swift) allows the OpenStack users [15]. Figure 4 explains the webRTC architecture. to store or retrieve files. Ubuntu operating system [18, 19] was installed on the VM using image service (Glance). A 1 Gbps network was used to connect the cloud over the Internet using Neutron. Finally, to manage and monitor the virtual machine over the cloud infrastructure dashboard (Horizon) is used. Figure 6 shows how the Virtual Machine operates on top of OpenStack infrastructure.

94 (cloud infrastructure) without any firewalls or NATs between them. Signaling Mechanism based on reliable data channel, what is required is session negotiation before establishment a connection between browsers, this is done by WebRTC signaling mechanism. To build signaling mechanism node.js was used and web-socket library to pass the requests between candidates. WebRTC is using many codecs to encode and decode the video and audio streams such as H.264, iSAC, Opus and VP8 [15]. When two browsers connect together, they choose the most optimal supported codec between two-users. Figure 7 shows how WebRTC technology operates on top of Cloud computing infrastructure such as OpenStack.

Figure 5. The proposed approach block diagram

3.2 Virtual Machine Creation The host machine is the actual machine on which the virtualization takes place, the guest machines are the virtual machines functioning through the host [20]. OpenStack was built on top of host machine with 32 GB RAM and 500 GB storage. Figure 7. WebRTC on top of OpenStack Then to host sufficient virtual machine on top of OpenStack minimum specification required is 8 GB RAM with 80 GB and four Virtual CPUs (vCPUs), this is considered as large VM on 3.4 3D Video Anaglyph Construction OpenStack infrastructure. The Operating system is installed by Anaglyph video construction is done via JavaScript and HTML5 creating Ubuntu image using Glance service. Then to connect the tags to be implemented on the web. The video was captured and VM to the internet must assign Floating IP that allows external encoded using WebRTC technology. WebRTC capture video with access from outside networks or Internet to an OpenStack virtual 30 Frames per Second (FPS) [15]. To construct anaglyph images, machine, This IP used also to test connectivity of the VM by ping two RGB images must be combined (frames). Since only one it from a remote computer in LAN. RGB image (frame) used as our input, we have to duplicate the color frame and apply pixel shifting to have right and left images to create stereoscopic view for the images [7]. Combination of these two frames will create the anaglyph image (frames) the generated frames will be assembled to generate 3D anaglyph video, then by wearing the anaglyph glasses we can feel the sense of 3D depth. Figure 8 describes the process to creating anaglyph video.

Figure 6. VM on Openstack infrastructure

3.3 WebRTC Setup Figure 8. process of creating anaglyph video WebRTC (Web Real-Time Communication) achieves a peer-to- peer real-time multimedia communication on the web [6]. The core architecture of webRTC is based on multimedia 4. Conclusion and Future Work communication process includes voice module, video module, and This paper proposed a new approach that implements 3D video transmission module. In the delivery of real-time data, timeliness call on top of cloud infrastructure OpenStack. WebRTC and low latency can be more important than reliability. In both technology used to create the video call system. The 3D video peers of data transmission, one of the fundamental requirement is generated by using anaglyph technique applied on 2D video. The the ability to locate and identify each other on the network, in our sense of 3D video can be viewed by red/cyan glasses (anaglyph implementation; both peers are located in the same network glasses).

95 The major challenge we faced is the difficulty in cloud computing [9] “Online Chat.” Wikipedia, Wikimedia Foundation, 15 Dec. systems to connect it with external devices such as webcams and 2017, en.wikipedia.org/wiki/Online_chat. headphones to be viewed on the VM. To overcome this difficulty, [10] Pathan, Mukaddim.” Advanced Content Delivery, Streaming, we used MP4 video to test the implementation on top of and Cloud Services”. Wiley, 2014. OpenStack. This obstacle can be targeted in the future for the full run of the proposed approach. Also, we must have some tests to [11] Harrison, S. Media Space 20 Years of Mediated Life. measure the quality of generated anaglyph 3D video compared Springer, 2009. with anaglyph in the market. Also, we have to compare the quality [12] Telemerge Inc. Follow. “The History of Video of our generated 3D video call with another implementation such Conferencing.” LinkedIn SlideShare, 23 Jan. 2015, as peer-to-peer connection. www.slideshare.net/Telemerge/the-history-of-video- conferencing-by-telemerge. REFERENCES [1] Manish Kumar Aery, “Mobile Cloud Computing: Security [13] Daniel Grzonka “The Analysis of OpenStack Cloud Issues and Challenges”, International Journal of Advanced Computing Platform: Features and Performance”, Journal of Research in Computer Science, Volume 7, No. 3, May-June telecommunications and Information Technology, March 2016 2015. [2] Schreer, O., et al. “3D Presence - a System Concept for [14] “WebRTC Home, webrtc.org/. Multi-User and Multi-Party Immersive 3D Video [15] Zafran M R M, Gunathunga L G K M, Rangadhari M I T, Conferencing.” IET 5th European Conference on Visual Gunarathne M D D J, Kuragala K R S C B, and Mr Dhishan Media Production, 2008. Dhammearatchi,” Real Time Information and [3] Kelion, Leo. “Skype Confirms 3D Video Calls Are under Communication Center based on webRTC“, International Development.” BBC News, BBC, 29 Aug. 2013, Journal of Scientific and Research Publications, Volume 6, www.bbc.com/news/technology-23866593. Issue 4, April 2016. [4] D. Kesavaraja and Dr. A. Shenbagavalli, “Cloud Video as a [16] Ray Zone, Stereoscopic Cinema and the Origins of 3-D Film, Service [ VaaSj with Storage, Streaming, Security and 1838-1952,2007 Quality of service Approaches and Directions”, 2013 [17] Woods, Andrew J., and Chris R. Harris. “Comparing Levels International Conference on Circuits, Power and Computing of Crosstalk with Red/Cyan, Blue/Yellow, and Technologies. Green/Magenta Anaglyph 3D Glasses.” Stereoscopic [5] Rohit Kamboj and Anoopa Arya, “Openstack: Open Source Displays and Applications XXI, Apr. 2010. Cloud Computing IaaS Platform”, International Journal of [18] Get Images¶.” OpenStack Docs: Get Images, Advanced Research in Computer Science and Software docs.openstack.org/image-guide/obtain-images.html. Engineering, Volume 4, Issue 5, May 2014 [19] Canonical. “Ubuntu Enterprise Summit.” The Leading [6] Cui Jian and Zhuying Lin, “Research and Implementation of Operating System for PCs, IoT Devices, Servers and the WebRTC Signaling via WebSocket-based for Real-time Cloud | Ubuntu, www.ubuntu.com/. Multimedia Communications”,5th International Conference [20] M. Kuttera and F. A. P. Petitcolasb, “A fair benchmark for on Computer Sciences and Automation Engineering image watermarking systems”, Electronic Imaging '99. [7] Makhzani Niloufar, Kok-Why Ng and Babaei Mahdi,” Security and Watermarking of Multimedia Contents, vol. Depth-Based 3D Anaglyph Image Modeling” International 3657, Jan 1999. Journal of Scientific Knowledge, Vol. 4, No.7, March 2014. [8] “Telephone Call.” Wikipedia, Wikimedia Foundation, 8 Dec. 2017, en.wikipedia.org/wiki/Telephone_call.

96 Interactive Mobile Learning Platform at the British University in Egypt

Ihab Adly Mohamed Fadel Ahmed El-Baz Hani Amin Centre for Emerging Centre for Emerging Mechanical Eng. Dept., Centre for Emerging Learning Technologies Learning Technologies The British University in Learning Technologies (CELT) (CELT) Egypt (BUE) (CELT) The British University in The British University in Cairo-Suez Desert Road, The British University in Egypt (BUE) Egypt (BUE) Cairo-Egypt Egypt (BUE) Cairo-Suez Desert Road, Cairo-Suez Desert Road, +202-26890000 Cairo-Suez Desert Road, Cairo-Egypt Cairo-Egypt Ahmed.Elbaz@bue. Cairo-Egypt +202-26890000 +202-26890000 edu.eg +202-26890000 [email protected] Mohamed.Fadel@bue. [email protected] m edu.eg .eg

ABSTRACT In recent years, mobile technology has been rapidly developed and now plays an important role in education. Traditional course Keywords offerings are on the change towards M-Learning. However, such Interactive Learning; M-Learning; Online Tools; Portable shift requires combined and integrated efforts from course Learning Platform. planners, system designers, software developers, teachers, and 1. Introduction students. With the recent advances in smartphone technology, powerful This paper introduces the design of an online M-Learning processors, efficient graphical processing unit (GPU) and interactive teaching and learning platform that has been developed increased storage capacity in most of portable devices, new and deployed at the British University in Egypt (BUE). Different learning approaches are heading towards mobile learning (M- real cases of interactive learning applications have been designed, Learning). In addition, the rapid rise in mobile phone usage and developed, integrated within the platform and evaluated by availability of low-cost internet connectivity both empower M- students. Feedbacks from students show promising results on Learning and support it to overcome limitations of past different aspects; 1) significant improvement of engagement in the pedagogical online learning approaches, as it provides mobility learning processes, 2) better understanding of abstract concepts and efficiency for both instructor and student. through the visualization interactivity provided through the The growth of mobile broadband has largely outpaced that of learning applications. In addition, students showed motivation to fixed broadband, with mobile-broadband prices dropped by 50% use this kind of ICT-based learning techniques in different on average over the last three years. These factors have resulted in subjects. about half of the world’s population getting online and broadband Although different types of applications have been developed and services being available at much higher speeds. Based on latest integrated within the platform, focus will be given on the design update by the International Telecommunication Union (ITU), and implementation of interactive online tools where students can mobile-broadband subscriptions have grown more than 20% use calculations/simulations and visualization activities to better annually in the last five years and are expected to reach 4.3 billion understand and even imagine the effect of different parameters on globally by end 2017 [1]. the behavior of targeted concept. Considering the new generation, often called the digital CCS Concepts generation, growing with electronic devices in hand and electronic • Applied computing→Interactive learning environments. contents available anytime and anywhere. The traditional educational materials and systems cannot fulfill the needs of such generation [2]. Although the predominantly face-to-face learning Permission to make digital or hard copies of all or part of this has some irreplaceable advantages, such as facial and body work for personal or classroom use is granted without fee provided that language in communication, emotional transfer, and active copies are not made or distributed for profit or commercial advantage and experience [3], online learning and mobile learning have been the that copies bear this notice and the full citation on the first page. hot keywords in all educational institutions and will be the trend Copyrights for components of this work owned by others than ACM must in the future. be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior Mobile learning (M-Learning) is the latest iteration of ubiquitous specific permission and/or a fee. Request permissions from (anytime, anywhere) learning technique, where it provides a [email protected]. personalized (offered materials, feedback …) fully portable ICSIE '18, May 2–4, 2018, Cairo, Egypt platform. M-Learning is a form of micro-learning; defined as the © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 process of gaining knowledge in a chosen subject, anywhere at any time, however it is based on modern forms of content that are DOI:https://doi.org/10.1145/3220267.3220279 DOI: http://dx.doi.org/10.1145/12345.67890

97 designed to engage learners. The term mobility involves three key aspects [4]: MODEL  Mobility of learners Updates Manipulates  Mobility of learning

 Mobility of technology VIEW CONTROLLER This paper presents the development of an interactive mobile learning (M-Learning) platform and associated online tools for Sees teaching and learning at the Centre for Emerging Learning Uses Technologies (CELT), the British University in Egypt (BUE. Different interactive learning tools will be presented along with USER students’ feedback.

2. Platform and Website Figure 1. Interactions within MVC based platform Prior to building the target platform, a set of specifications has Accordingly, to satisfy that set of targeted platform specifications, been put forward to consider during the technology selection and the following set of technologies have been selected for the development phases. The following summarizes the main development path: requirements drafted based on previous experience [5], many meetings with different stakeholders and other related work as in • Model-View-Controller (MVC) architecture-based [6][7][8]: development that allows for better testability of modules and scalability as shown in Figure 1. a. OS independent, with the plethora of OS systems currently in ties the market, it was clear that any OS dependency will • JavaScript language and libraries. significantly limit a wide spread usage of the application • Bootstrap UI elements: Bootstrap is a free and open-source b. Platform independent, where the decision has been made in front-end for designing websites and web every stage that developed M-Learning tools and activities applications. It contains HTML- and CSS-based design should be compatible with any portable devices and not templates. Bootstrap UI elements are sleek, intuitive, fast and limited to Desktop PCs powerful especially for mobile first web development. c. Framework independent, many frameworks are currently • Client-side processing: The client-side environment used to existing in the market, such as Moodle, while such run JavaScript is only a browser. The processing takes place frameworks allows a rich set of features, they highly limit the on the end users computer. The source code is transferred possibilities in the UI design, resulting in either old fashion UI from the web server to the user’s computer over the internet or very slow website and run directly in the browser. d. Web application style rather than static pages • Responsive web design (RWD) is an approach to web design aimed at allowing desktop webpages to be viewed in response e. Scalability, where the main target is to develop many activities to the size of the screen or web browser one is viewing with. and online tools, adding new tools easily is very important to the project sustainability • Server-side authentication and validation: At the Server Side, the input submitted by the user is being sent to the server and f. Security is of prime concern to protect used credentials, by validated using server-side programs. After the validation deploying SSL based security, ensuring users trust and process on the Server, the feedback is sent back to the client. improving the site ranking Validating user input on Server Side has the advantage of g. Light weight footprint for all developed activities to make the protecting against malicious users, who can easily bypass download faster and enhance the web application response client-side codes and submit dangerous input to the server. h. Future possibility to transform the web application into a • Server-side Database based on MySQL: One of the major uses mobile application, of server-side scripting is to interact with a database which i. User authentication is a feature primary provided through resides on the server. By interacting with this database, it is existing framework, since the decision was made to develop a possible to change the content displayed to the user on a web completely new application from the bottom-up, it is of prime page without updating the HTML with new information. importance to provide user authentication module in the Figure 2, shows the developed web application main blocks, as developed work, and finally that fulfils those needs meanwhile allowing for a flexible j. Modern UI elements including stylish controls and attractive scalability and easy integration of new modules. charts.

98 Server Side Client Side interactive applications, students and instructors can use on their portable devices any time and even anywhere. For instance, Server App CELT-BUE App instructor can use online tools in classroom to emphasize key concepts, to dynamically visualize effects of parameters’ change Index App modules on the outcome/result. Online tools have been optimized for use App files configuration App Styles on PCs, tablets and smart phones with easy to use sliders and

Handling of client flexible charting options as shown in Figure 3. side request 3.2 Case study: Wind Simulator Tool This interactive online tool enables the user to estimate the Bootstrap based odules M View diameter and rotational speed of a fixed pitch Horizontal Axis Directives Wind Turbine (HAWT) based on the required output power in kW, available wind speed at site in m/s using three types of airfoil Database Server Scripts Controller Interaction sections for manufacturing the turbine blades [9]. Once the user Online Tool Module enters the required data using the data selection menu as shown in Experiment Module Figure (a), the tool enables the user to view the variation of blade

Database chord length with blade radius in graphical format as shown in Fig. 4(b). In addition, the tool can be used to view blade shape (chord length at different radii) for turbine blade as shown in Fig. 4(c). Finally, the tool will also calculate the expected output power of Figure 2. Application Structure turbine with wind speed assuming variable turbine rotational speed control as presented in Fig. 4(d). 3. Applications Advanced JavaScript-based charting libraries have been used in Although different interactive teaching and learning applications the implementation of the different visualizations; both 2d and 3d have been developed and integrated within the M-Learning charts are scaling appropriately on smartphone screen. The 3D platform, focus will be given on the development, integration and chart can rotate in different directions allowing the user to use of interactive online tools at the British University in Egypt accurately investigate the blade profile. The slider control is (BUE). Students’ feedback and reflections on the integration of mobile friendly allowing the user to easily control the different these new teaching tools with standard delivery will be presented. parameters on touchscreens. 3.1 Online Tools Online tools are mainly interactive simulation-based visualization activities in which the user (students/instructors) might change different input parameters to study their effect on targeted concept through both calculations of required parameters and responsive visualization of specific behavior.

a) Data selection menu

b) Blade chord length versus radius

Figure 3. Screenshots of UI examples on smart phone This will help students get sense of behavior though visualization c) Graphical view of blade development versus radius of targeted parameters/concepts. Online tools are very promising

99

Figure 6. Students’ response; the tool provides good technical content for the covered topic

4 d) Developed power of turbine Figure 4. Wind Simulator Tool GUI 4. STUDENTS’ FEEDBACK To measure the impact of using interactive online tools in teaching and learning, a students’ questionnaire has been designed based on System Usability Scale (SUS) standards [10]. The questionnaire has been deigned to include two parts: part 1 related Figure 7. Students’ response; using the tool will improve my to the tool interface and associated issues, and part 2 related to the engagement in lecture technical contents of the tools and how much the tool supported to achieve the targeted learning outcomes. Figure 6 shows students’ feedback on question in part 2; the tool provides good technical content for the covered topic. The questionnaire has been used in stress analysis course delivered to chemical engineering students. The used tool As for the impact of using the tool to improve students’ “Analysis of Simple Beam Structure” has been designed to engagement in lectures, Fig. 7 shows students’ feedback on introduce the basic analysis concepts of simply supported beams, question in part 2; using the tool will improve my engagement in where beams are analyzed under different loading conditions, lecture. including uniform loads and a combination of point loads. The tool allows students to develop an in-depth understanding and 5. CONCLUSIONS imagination of simple beam behavior under several loading Development, deployment and integration of interactive online conditions. tools with an M-Learning platform has been implemented at the British University in Egypt (BUE). The developed interactive Figure 5 shows students’ feedback on question in part 1; the tool tools can be used either in classroom or even off campus to provides a good support to understand and visualize the topic. provide a continuous support for student to better understand of key concepts in different courses. The M-Learning platform has been designed to provide a fully portable teaching and learning capabilities and features for current and future online learning activities and applications based on modern technologies. Students’ feedback shows the following results; 1) more than 60% of students supported the use of such online tools in teaching and learning, 2) 74% of students confirmed that the use of such online tool improved their engagement in lectures, and 3) 84% of students showed interest to have similar tools in different topics. 6. ACKNOWLEDGMENTS Figure 5. Students’ response; the tool provides good technical The development of the interactive online wind simulator tool has content for the covered topic been funded through a Newton-Musharfa Institutional Links Grant, STDF ID#26134 in collaboration with the Centre for Renewable Energy Systems Technology (CREST), Loughborough University, UK. 7. REFERENCES [1] ICT Facts and Figures 2017, available online http://www.itu.int/en/ITU- D/Statistics/Pages/facts/default.aspx

100 [2] Marc Prensky, “Digital natives, digital immigrants”, On the [7] Hoober, S., and P. Shank, “Making mLearning usable: How Horizon, MCB University Press, Vol. 9, No. 5, pp 1–6, we use mobile devices”, The eLearning Guild Research October 2011. Report, April 2014. [3] Martyn Stewart, “Learning through research: An introduction [8] Dennen, V., and S. Hao., “Intentionally mobile pedagogy: to the main theories of learning”, JMU Learning and The M-COPE framework for mobile learning in higher Teaching Press Vol. 4, Issue 1, pp 6-14, 2004. education. Technology”, Pedagogy and Education 23(3): [4] Pandey K., Singh N., Mobile Learning: Critical Pedagogy to 397–419, 2014. doi:10.1080/1475939X.2014.943278. Education for All. In: Zhang Y. (eds) Handbook of Mobile [9] P. J. Schubel and R. J. Crossley, “Wind turbine blade Teaching and Learning. Springer, Berlin, Heidelberg, 2015 design,” Energies, vol. 5, pp. 3425–3449, 2012. [5] Hani Ghali, “Remote Online Experimentation Platform at the [10] John Brooke, “SUS: A “quick and dirty” usability scale”. In British University in Egypt (BUE)”, eLearning Africa 2016 - Usability evaluation in industry, Edited by: Jordan, P. W., 11th International Conference on ICT for Development, Thomas, B. A. Weerdmeester and McClelland, I. L. 189–194. Education and Training, 24–26 May 2016, Cairo – Egypt. London: Taylor & Francis, 1996. [6] Haag, J., & Berking, P., “Design considerations for mobile learning”. In Y. Zhang (Ed.), Handbook of mobile teaching and learning (pp. 41–60). Berlin: Springer Verlag, 2015.

101 A RESTful Architecture for Portable Remote Online Experimentation Services Mohanad Odema Ihab Adly Ahmed El-Baz Hani Amin Centre for Emerging Centre for Emerging Mechanical Eng. Dept., Centre for Emerging Learning Technologies Learning Technologies The British University in Learning Technologies (CELT) (CELT) Egypt (BUE) (CELT) The British University in The British University in Cairo-Suez Desert Road, The British University in Egypt (BUE) Egypt (BUE) Cairo-Egypt Egypt (BUE) Cairo-Suez Desert Road, Cairo-Suez Desert Road, +202-26890000 Cairo-Suez Desert Road, Cairo-Egypt Cairo-Egypt [email protected] Cairo-Egypt +202-26890000 +202-26890000 .eg +202-26890000 Mohanad.odema@gma [email protected] [email protected] il.com

ABSTRACT the students. Yet as the number of students grow, they are placed In this paper, an architecture is proposed to deliver portable in condensed groups near the testing kits, all assigned with remote online experimentation services. This can benefit the conducting the same set of procedures at designated times. In educational and academic sectors in terms of providing remote addition, institutions are faced with two obstacles; one is they are online accessibility to real experiment setups. Thus, the users can obliged to provide redundancy of kits to contain students’ groups be relieved from geographical and time dependence for the working simultaneously. Thus, costs and operational difficulties experiment to be conducted. Nowadays, almost all web services arise from this large-scale deployment. The second is that some leverage the efficiency and prevalence of the REST experiments are not convenient for redundant deployment due to (Representational State Transfer) architecture. Hence, this space and cost constraints, limiting the students to conduct their proposed remote online service has been implemented in experiments at specific time slots. compliance with the RESTful architectural style. As education methodology is constantly evolving because of the Web-based experiments require compatibility with any of the rapid technological advancements. The trend for adapting online users’ portable devices and accessibility at any time. A RESTful education for students has been gaining popularity. Not just in architecture can fulfill these requirements. In addition, different providing high quality online courses, but also in rendering lab experiments can be made available online based on this experiments online based for the students to access them from architecture while sharing the same infrastructure. A case study anywhere through their personal devices [1] [2]. has been selected to obtain measurements of different force For online experimentation, two approaches have dominating the components existing inside wind tunnels. The complete scene. One is to deploy virtual labs [3]; this approach provides its implementation of this system is provided starting from the users with a simulated environment of the experiment based on a embedded controller retrieving sensor measurements to the web mathematical model that provides the output similar to that of the server development and user interface design. real experiment. There are no actual testing or measurements conducted on real hardware or experiment setups. Users are able CCS Concepts to access these labs remotely over the internet. • Applied computing→Interactive learning environments. The other approach is to adapt web-based remote online Keywords experimentation [4]. It differs from the previous approach in terms RESTful architecture; Online testing and experimentation; of having real hardware and equipment installed in a lab. Through Remote testing facilities; a control unit connected to the hardware, the experiment can be rendered available for users to access it via the internet and 1. INTRODUCTION conduct their tests through a designated portal. Output from this In most of scientific institutions, conducting experiments is real-time experiment is fed back online to the users. crucial to promote theoretical comprehension and validation for Advantages of the second approach can be emphasized in the fact Permission to make digital or hard copies of all or part of this that users are able to interact with the actual equipment and not work for personal or classroom use is granted without fee provided that merely a simulation. This serves greatly in extending users classes copies are not made or distributed for profit or commercial advantage and to include not only students, but also scientific researchers that copies bear this notice and the full citation on the first page. conducting their tests without any geographical or time constraints. Copyrights for components of this work owned by others than ACM must Thus, real time measurements can be obtained while adding the be honored. Abstracting with credit is permitted. To copy otherwise, or capability of live streaming the experiment’s operation. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from When it comes to students, they can access their experiments even [email protected]. after their day hours. The need for having them available near the ICSIE '18, May 2–4, 2018, Cairo, Egypt setup at specific times is discarded. In addition, students are no © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 longer to be placed in condensed groups operating one set of equipment. Instead, each student can now connect to the DOI:https://doi.org/10.1145/3220267.3220280

102 experiment remotely through his personal device, conduct the Besides having the database for data storage in this architecture, procedures and obtain real time results. This enhances their the database plays the vital role of acting as the intermediary learning experience and promotes their understanding of the between the embedded controller and the web server. Both these experiment’s theory and objectives. design components cannot directly communicate with each other in a simple manner. However, each component can have a specific A broader scope of this approach is to have a list of remote online connector installed and included in its own executable script for experiments available through a portal where students can view connecting with the database. Another advantage is that these the status of their experiments. Aside from performing the real three components can be all implemented on a Single Board time experiments, they are able to evaluate experiments’ previous Computer (SBC) with embedded software. Consequently, performance, review historic data or redo the experiment. development cost can be minimized in addition to constraining the Subsequent section 2 presents the proposed architecture for implementation of several components into a single entity. deploying the remote online experimentation. Section 3 provides a Implementing this system as web-based serves for having a low case study of adapting the proposed architecture for wind tunnels cost open platform accessible by any portable device via any web measurements. Final conclusions are provided in section 4. browser. Many approaches have been previously implemented to 2. PROPOSED ARCHITECTURE adapt the remote online experimentation scheme. However, portability and compatibility requirements were missing in a For adapting the web-based remote online experimentation number of them. For example, the model proposed in [7] bounded approach, the aspects of portability, compatibility and availability its users by the requirement of having a digital TV for experiment need to be covered. In this context, the proposed architecture access and live streaming. Also, the remote experiment in [8] follows a RESTful architectural style for communication between required NI LabVIEW Runtime engine to be installed on the users’ the clients and server. This fulfills the required functionalities by computer. In this context, this proposed model gains the edge rendering the experiment available for access through internet via over its counterparts in terms of filling both portability and any of the clients’ devices (e.g. laptop, smartphone) while compatibility requirements as any users’ device with internet providing the potential of reaching users at any time of the day. connection can access the experiment in real time. A RESTful architecture is a resource-based architecture identified by URIs (Universal Resource Identifiers) defined in the server 3. CASE STUDY script [5] [6]. Consequently, these URIs are identified in each 3.1 Wind Tunnel Experiment user’s request. These requests can be issued through a uniform The chosen experiment to be rendered for remote online testing is interface between clients and server as HTTP requests. Thus, for obtaining measurements in wind tunnels. In this experiment, stateless communication is provided as each request is self- the measurements of , drag and moment are acquired through descriptive containing enough context to be processed. In addition, three force sensors. Each of these force components is affected by the capability of providing users with code on demand is supplied the orientation and speed of the wind flowing inside the tunnel. through executable client-side scripts. For this purpose, NodeJS and JavaScript are chosen for the server and client sides scripting For the measurements to be retrieved, a balance mechanism respectively as both languages provide open source structure has been constructed encompassing a specific implementations fulfilling the needed requirements. architecture of loadcells reacting to the applied forces on the mechanism; where each one is responsible for measuring a certain The complete proposed architecture is shown in Figure 1. The force component. Through tensions and compressions suffered by selected experiment is to be connected to an embedded controller the loadcells, voltage signals are produced at loadcells’ outputs via interfacing circuits. This controller is responsible for obtaining which are then retrieved and measured by a control unit. Figure 2 vital data from the experiment and relaying it to be stored in a shows a screenshot of the balance structure. database. A published web server oversees responding to the different clients’ requests by retrieving the corresponding information from the database to be displayed for the clients. In addition, clients can pass on certain configuration parameters to the server, and consequently the controller, for their remote experimentation experience to be complete.

On-Site Clients

Web Web Server Database Clients

Figure 2. The Balance Structure Experiment Interfacing Embedded Setup Circuits Controller Through this control unit, the experiment will be set for remote online testing where students should be able to access it through a secure portal via their personal laptops or mobile devices. Sensed Figure 1. Proposed Architecture measurements by the balance structure will be displayed as well as the provided data analysis. Means for configuring and

103 calibrating the system for optimal performance are provided. Start Achieving this objective is contingent to implementing the control unit to fulfill the various requirements of rendering this Initialize experiment web based. parameters

Retrieve users 3.2 System Design input data To have the remote online testing experiment, a complete Retrieve Calibration complementary system is implemented for retrieving the readings Calibration True from the loadcell sensors placed in the balance structure. This Parameters system is also responsible for performing analysis on the readings, False Set configuration Compute New parameters; True Configuration storing data in their respective database tables and publishing a reference Values designated web server for the system to be accessible. Configuration = False False Having this system web-based serves for having a hybrid Retrieve loadcell Set reference functionality of remote online or on-site testing. Once the system readings values is operational, the web server is published on a specified address. Compute average and False RMS Clients should be able to access the server either on-site through Calibration = False an HDMI screen attached to the control unit, or remotely through Store results in entering the server’s address in any web browser. database A Raspberry Pi has been selected as the core of development. The Terminate True End Pi is an affordable SBC with high computing and processing capabilities with embedded Linux. The Pi performs the Figure 3. Main code flowchart functionalities of the embedded controller in addition to having the web server and the database deployed on it. On top of that, a 3.3 Developed Interface PCB was fabricated to be mounted on top of the SBC to have the As students and scientific researchers are to be the primary users loadcells’ interfacing circuits connected to their relative pins. of this system. The interface has been developed taking into Hence, all these desired functional components are integrated into account thoroughness of options, simplicity of usage and a single device. The interfacing circuits render voltage signal interactive display. The system is accessible either through the readings depending on the applied force on their corresponding HDMI touch screen attached to the SBC or through any web loadcells. With the connection intact between the SBC and the browser on the users’ devices. Users are able to choose from three circuits, a separate interfacing code is run to obtain the readings different modes of operation namely measurement, configuration from the loadcells. and calibration respectively. Figure 4 shows the operational As was mentioned, the database deployed on the Pi acts as the developed system with its interface. intermediary between the controller and the server. Data acquired Measurement mode allows the users to view real time sensed through the controller are stored in the database and relayed to the forces in the balance mechanism. Data can be chosen to be server upon request. This is achieved through having the displayed as tables or plotted in charts. Charts serve as an asset in interfacing code and the server script both communicating with showing the effect of real-time turbulence forces suffered at the the database; mainly the former to store data while the latter to loadcells in a dynamic manner. Users are provided with the retrieve them. capability to access and display historic data through entering At the SBC, the main code queries the interfacing circuits for their their desired time periods and a corresponding GET request is readings with a specified sampling frequency. Once readings have initiated on behalf of the user to the server. been retrieved, readings’ averages and Root Mean Squares (RMS) are computed. The readings and computed results are then stored in their relative tables in the database. Also, the code queries the database each run for any changes by the users in the configuration table parameters (e.g. sampling frequency). A separate instructions’ sequence is also provided in case calibration of the interfacing circuits readings is required. This code’s generic flow chart is shown in Figure 3. The web server has been developed using NodeJS scripting language. The server listens constantly to users’ GET or POST requests and queries the database accordingly. Retrieved data is rendered in a convenient format to be displayed for the user. In addition, added client-side scripting provides the users with the Figure 4. Balance System interface capability to run their JavaScript executables and insert their designated configuration parameters and calibration information. Configuration mode provides a set of options through which users can set their configuration parameters including the sampling JavaScript is used for client-side scripting. jQuery JavaScript frequency and number of points plottable within a chart. The last library is utilized with AJAX (Asynchronous JavaScript and XML) mode is concerned with calibrating the loadcells through calls intensively in order to update the web server with new data providing a series of instructions for the user to follow in order to in real time. In addition, HighCharts JavaScript library has been compute new reference values for the loadcells’ interfacing chosen for plotting of data in charts. circuits.

104 3.4 Testing and Results solution. This represents a bargain taking into consideration the For testing the functionality of this system, three separate tests extent of the service provided. In addition, the prospect of adding needed to be conducted. One is to apply known weight forces onto more hardware setups to be integrated onto the same SBC is also the balance structure at different sampling frequencies. The maintained providing multiple available online testing scenarios. second is done by applying a turbulent force on the structure and observing the system response in comparison to a strain gauge 5. ACKNOWLEDGMENTS meter. The final test was to validate the functionality of the This work has been funded through a Newton-Musharfa calibration mode. Institutional Links Grant, STDF ID#26134 in collaboration with the Centre for Renewable Energy Systems Technology (CREST), For the first test, weight forces of 1, 3 and 5 kgs. were applied in Loughborough University, UK. several combinations ensuring reasonable measurements by the system. Retrieved measurements as well as computed averages 6. REFERENCES and RMSs were displayed in real time. Testing was repeated at [1] Marquez-Barja, J. M. et al. 2014. FORGE: enhancing different sampling frequencies of 0.25, 0.5, 1 and 2 samples/sec. elearning and research in ICT through remote Data updating interval was changed accordingly while experimentation. In Global Engineering Education maintaining the basic functionalities. At higher frequencies, Conference (EDUCON) (Istanbul, Turkey, April 03 – 05, system’s response was constrained by the database’s query 2014). 1 – 7. DOI = response time. https://doi.org/10.1109/EDUCON.2014.7130485 The second test incorporated applying fluctuating turbulent forces [2] Mikroyannidis, A. et al. 2016. Applying a methodology for on the mechanism similar to those it will suffer in the wind tunnel. the design, delivery and evaluation of learning resources for Applied turbulent forces measurements can be plotted remote experimentation. In Global Engineering Education dynamically at a corresponding force’s chart alongside the Conference (EDUCON) (Abu Dhabi, United Arab Emirates, computed averages and RMSs plots for comparison purposes. April 10 – 13, 2016). 448 – 454. DOI = System’s response was found to be equivalent to that of the strain https://doi.org/10.1109/EDUCON.2016.7474592 gauge meter ensuring its reliability. A screenshot of the dynamic [3] Bose, R. 2013. Virtual Labs Project: A Paradigm Shift in chart is show in Figure 5. Internet-Based Remote Experimentation. IEEE Access. 1 (Oct. 2013), 718 – 725. DOI = https://doi.org/10.1109/ACCESS.2013.2286202 [4] Angulo, I., Garcia-Zubia, J., Rodriguez-Gil, L., and Orduna, P. 2016. A new approach to conduct remote experimentation over embedded technologies. In 13th International Conference on Remote Engineering and Virtual Instrumentation (REV) (Madrid, Spain, February 24 – 26, 2016). 86 – 92. DOI = https://doi.org/10.1109/REV.2016.7444445 [5] Lelli F., and Pautasso C. 2011. Design and Evaluation of a RESTful API for Controlling and Monitoring Heterogeneous Devices. In: Davoli F., Meyer N., Pugliese R., Zappatore S. (eds) Remote Instrumentation Services on the e- Figure 5. Measurements’ dynamic chart Infrastructure. Springer, Boston, MA. 3 – 13. DOI = https://doi.org/10.1007/978-1-4419-5574-6_1 For the calibration test, arbitrary reference values were set at the beginning for the interfacing circuits. The calibration sequence [6] Wenhui, H. et al. 2017. Study on REST API Test Model was initiated by selecting the designated loadcell. Three-point Supporting Web Service Integration. In IEEE 3rd calibration at reference weights of 1,3 and 5 kgs. is achieved International Conference on Big Data Security on through applying the relative weight and computing the actual Cloud(BigDataSecurity), (Beijing, China, May 26 – 27, reference values. For the same calibration point, reference value 2017). 133 – 138. DOI = should be consecutively computed at least twice with no more https://doi.org/10.1109/BigDataSecurity.2017.35 than 5% fault tolerance between the two computations before [7] Dos Santos, R. A., et al. 2015. Remote experimentation setting the corresponding reference point. Calibration was model based on digital TV. In 3rd Experiment International conducted multiple times successfully for the three force sensors. Conference (exp.at'15) (Ponta Delgada, Portugal, June 02 – 04, 2015). 321 – 324. DOI = 4. CONCLUSION https://doi.org/10.1109/EXPAT.2015.7463288 Proposing a RESTful design architecture for remote online [8] Cotfas, P. A., Cotfas, D. T., and Gerigan C. 2015. Simulated, experimentation delivers a portable, compatible and 24/7 available Hands-on and Remote Laboratories for Studying the Solar online testing service. This open source implementation allows for Cells. In 2015 Intl Aegean Conference on Electrical altering of the system’s design flow and capabilities in addition to Machines & Power Electronics (ACEMP), 2015 Intl serving greater number of clients through adding or removing a Conference on Optimization of Electrical & Electronic few lines of code. Equipment (OPTIM) & 2015 Intl Symposium on Advanced The implementation of the experiment’s complementary system Electromechanical Motion Systems (ELECTROMOTION) including the SBC, the interfacing circuits with their fabricated (Side, Turkey, September 02 – 04, 2015). 206 – 211. DOI = boards, and attached HDMI screen is considered as a low-cost https://doi.org/10.1109/OPTIM.2015.7426953

105 Adaptive security scheme for real-time VoIP using multi- layer steganography

Shourok AbdelRahim Samy Ghoneimy Faculty of Business Infromation Faculty of Informatics and Gamal Selim System Computer Science Faculty of Engineering Canadian International College, British University in Egypt, Arab Academy for Science and Cairo, Egypt ELsherouk city, Egypt Technology, Cairo, Egypt [email protected] [email protected]. [email protected] om eg

research attention in many different aspects. Steganography means ABSTRACT to hide messages existence in a particular medium such as audio, Nowadays Voice over Internet Protocol (VoIP) is one of the most video, image, text [1] . widely used technologies to transmit the voice. With the widely However, the area of steganography for real-time systems is spreading in such technology many counters attaches tried to apply largely unexplored. This may be due to the fact that the real-time different counter measure. In this paper we tried to build a characteristic of real-time systems is a double-edged sword. While counter countermeasure which increases the security of specific the real-time nature actually offers better security for secret messages by performing a complicated three security stages. These messages, it does not allow many complex operations, which stages are; embedding the selected voice into RGB image, hidden increases the difficulty in assuring security. Nevertheless, given its the image in voice signal and perform data integrity using real time potential advantages, steganography for real-time systems may protocol (RTP). Following such a proposed algorithm, the process soon become a worthy subject of further studies [2]. of eavesdrop or counter attacks will not be able to break such a multi-layer security process. In this paper, we propose an Adaptive VoIP steganography approach to hide the audio information within Voice over Internet Protocol (VoIP) communication is one of the images to enhance the security of the voice communications. The most popular real-time services on the Internet. VoIP has more proposed system is completely implemented and developed using advantages than traditional telephony, since the Internet allows C++ in OPNET Modeler. Simulation results showed that the VoIP to provide low-cost, high-reliability, and global services. proposed system is robust enough to overcome many attacks such VoIP streams often have a highly redundant representation, which as denial of service, man-in-the-middle and eavesdrop without usually permits the addition of significantly large amount of secret affecting network performance or quality of service. data by means of simple and delicate modifications that preserve the perceptual content of the underlying cover object. With the CCS Concepts increasing percentage of VoIP streams in all of the Internet traffic, VoIP is considered to be a better cover object for information • Security and privacy→Network security hiding compared with “static” cover objects such as text files, image files, and audio files. Besides, VoIP connection is usually Keywords very short, and so it is unlikely for attackers to detect the hidden Real Time Protocol; Steganography; Voice over Internet Protocol; data within VoIP streams. Their real-time characteristics may be Least Significant Bit, audio security. used to improve the security of the hidden data embedded in VoIP “dynamic” streams [3]. Real Time Protocol (RTP) is the most 1. INTRODUCTION important one in transport protocols. RTP is used in conjunction In order to solve the drawback of data communication through the with User Datagram protocol (UDP) for transport of digital voice Internet, many data security techniques have been proposed. stream. Recently, in the last few years steganography has drawn increasing VoIP is a valuable technique which is used to enable telephone Permission to make digital or hard copies of all or part of this calls via a broadband Internet connection. Owing to its advantages work for personal or classroom use is granted without fee provided that of low cost and advanced flexible digital features, VoIP has copies are not made or distributed for profit or commercial advantage and become a popular alternative to the public-switched telephone that copies bear this notice and the full citation on the first page. network (PSTN), and extensive research on it has been conducted Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or [4]. republish, to post on servers or to redistribute to lists, requires prior Recently, some researchers have noticed the advantages of this specific permission and/or a fee. Request permissions from technology and carried out useful studies on steganography over [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt VoIP [5] [6] [7] [8] [9]. However, all of these studies adopt the © 2018 Association for Computing Machinery. same simple embedding strategy, replacing the least significant bits ACM ISBN 978-1-4503-6469-0/18/05…$15.00 (LSBs) of the cover speech with the binary bits of secret messages or their encrypted form without leveraging the characteristics of DOI:https://doi.org/10.1145/3220267.3220281

106 the LSBs. Although LSBs modifications usually have little impact Using the LSB to embed the voice streams into an RGB image. on the quality of cover media, blind substitution potentially risks The new image is then divided into packets and gets sent through being detected by statistic methods. This view has been widely the network along with other voice parts by using the RTP protocol. accepted by researchers working on steganography on storage media [10] [11] [12]. 2.1.1 Algorithm for Embedding Process Step 1: Get the selected voice data and read it and separate it to In fact, the key criteria for steganography is perfect transparency to frames then loop of the frame values. non-authenticated entities and high capacity for carrying secret messages. The first criterion, a measure of embedding distortion, is Step 2: Confirm that all the values are positive number and if there often more important. An acknowledged belief is that the smaller are any negative ones they should be converted to positive by the embedding distortion, the harder the detection of the adding a fixed sum to all of the values. embedding changes. Therefore, in this paper, we focus on an Step 3: Convert the values of each frame into a stream of bits. adaptive steganography scheme for VoIP, which aims at minimizing the distortion of speech quality to enhance the Step 4: Store the stream of bits into a vector or a list. imperceptibility of the steganography system [13]. Step 5: Choose a suitable image based on the size of the selected The paper is organized as follows. In section II we proposed a voice data. system about VoIP steganography. The design of the whole system using simulator in section III. In section IV discussed the results Step 6: Read the image information and separate it into pixels, about the system. Finally ended with conclusion in section V. while taking padding operation into consideration as it’s a 24 bit image (RGB). 2. PROPOSED SYSTEM Step 7: Loop through the pixels of the image. In each iteration, get The proposed system consists of two parts the first is transmitter the LSB of the RGB values and store them in a vector. security and the second is receiver detection and retrieval. The Step 8: In each 8 bits in the image, select the LSB to hide the main constraints are the time limits on secure messages or secure vector of voice data. conversation length and Quality limitation of voice message as to limit the message size. Step 9: Repeat step 6 until the whole vector of voice data bits is processed.

Step 10: Overwrite the values of the LSB vector with the values of the audio bit vector. Step 11: Prepare to rewrite the image and keep the LSB of the first few bits empty to store the length of the audio data. Step 12: Write the length of the audio data into the first bits, then proceed to use the vector which contains the overwritten LSB with the audio data to rewrite the original image. Step 13: Divide the newly rewritten image to packets and put a key in those packets to differentiate them from normal audio packets and proceed to send them over the network. 2.2 Receiver and Retrieval Part All Received packets are subjected to the examination of the packet detection subsystem which determines if the packet is a container for the image information which will be converted to audio later or just a regular voice packet. All voice packets are Figure 1. General architecture of the proposed VoIP presented to the user, but all the image packets are collected till a steganography framework. signal that specifies the end of image packet is received. The accumulated image packets are then arranged and used to 2.1 Transmitter Security Part build an image, this image is then subjected to the voice extraction As the conversation starts with the voice signals which are detected system which decodes all of the image pixels and converts them to from the user’s microphone are captured once the users chooses to their simplest form which is bits. After the image is converted to secure a certain part of his conversation or message. Started the bits all of the LSB bits are extracted and collected and used to process to capture audio information and converted into their build the secured audio information so that it can be relayed to the simplest form which is bits. Those bits are encoded into an RGB user. Image using the LSB Technique after converting the image also into bits. 2.2.1 Algorithm for Extracting Process Step 1: Receive all packets sent through the network. The least significant bit (LSB) is the bit that when flipped from 0 to 1 or from 1 to 0, which means there are no significant changes Step 2: Subject all received packets to the detection subsystem. that will occur to the total size of the original image, because this Step 3: All detected packets that contain the image are stored in a bit located at the right side in each one byte, which is not affected separate container until they are fully collected. about the image, but if made the reverse process means change the left bit it was affected about the image. Step 4: arrange the collected packets to their order of the building.

107 Step 5: use the packets that represent the first eight pixels that are always reserved to store the audio size and extract their LSB to find out the length of the audio data inside the image file. Step 6: use the rest of the packet and subject them to the decoding process while considering the size of the audio data and the reserved bits to write the audio file header. Step 7: extract the LSB of all pixels till the length of the audio bits counter is reached. Step 8: use the extracted audio bits to write the audio frames while doing the conversions. Step 9: remove the fixed value that has been added previously to all frames to turn the negative audio values to positive ones in Figure 3. Sites topologies. Left: Site_A network topology. Right: order to restore the audio frames to their original state. Site _B network topology Step 10: use the restored frames to construct the audio file. After the receiver received the packets used the key to know which Step 11: present the decoded and constructed audio file to the user. packet contain the secret message, then it should do the reverse process to get the original message that it was embedded by 3. SIMULATION SETUP extracting the RGB image from the voice streams and then We used the OPNET simulator to simulate our network scenario collected the message by also LSB algorithm that mention above. and also visual studio to write our algorithm using C++ programming language. Figure 2 shows the topology of the In order to make declarations about security concerns and non- company network, the two offices are shown as subnets perceptibility of the described scenario a third person acts as an interconnected by a T1 (1.5 Mbps) dedicated link. attacker which is interested in detecting the hidden message of the sender. Let us assume, the attacker is capable of accessing the The LAN node composed of 5 computers and the two IP phones is network and detecting VoIP communication. Using their abilities connected with LAN node by a switch in site A, although the site B to find the communication between sender and receiver and tries to consist of two switches. Switch_0 connected with LAN node detect the hidden message by analyzing the transmitted VoIP composed of 5 computers and one IP phone and switch_1 linked to packets. two IP phones and one LAN node composed of 4 computers as shown in figure 3. The Gateway router used to connect between the 3.1 Performance Measures two sites. 3.1.1 Payload Using G.711 codec, voice is converted into packets with durations of 20ms of sampled voice, and these samples are encapsulated in a VoIP packet which has a fixed packet length of 160 bytes, then adding 40 bytes representing 12 bytes for RTP header, 8 bytes for UDP header and 20 bytes for IP header producing a packet size of 200 bytes which equal 80Kbps bit rate including 64Kbps Voice payload and 16Kbps header. 3.1.2 VoIP quality of service parameters 3.1.2.1 Delay Delay is the sum of Codec delay, Queuing Delay and propagation delay. We present some standard numbers for G.711 used for codec delay. The bit rate is 64 Kbps, frame size 10 ms, codec delay 0.125 ms and look-ahead 0 ms. For queuing delay the time a packet has to wait in the queues at the input and output ports before it can be processed. It depends on the network traffic intensity, nature, and network design (link, Figure 2. Network Topology equipment and structure).

After the sender starts the conversation and decided to secure the Propagation delay is the time needed to propagate the information selected message. This message codec by using RAW PCM (8,000 through the links between the sender and the receiver and this is HZ, 8 bit) which is the one codec we focused on it. dynamic delay introduced by internet routers caused by store-and forward processing and congestion. As a main constraint we assume embedding and retrieving must be possible without causing delays or interventions during VoIP ITU-T Recommendation G.114 [14] recommends one-way communication. Therefore, for embedding the selected voice transmission time limits for connections and it defines three bands stream into RGB image we used a Least Significant Bit (LSB) of one-way delay as shown in table 1. scheme, providing a high capacity and low complexity. Then the stego image is then divided into packets and gets sent through the network along with other voice parts by using the RTP protocol.

108 Measured the packet loss in the VoIP call in the WAN network TABLE 1: Delay limit specifications means from site_A to site_B as applied in OPNET simulator that Delay Description mention above founded 4% and less than 1% in each site means in LAN network. According to the previous research the losses of 0-150 ms Acceptable for most user applications. packets in the multi-layer system didn’t exceed 5%.

150-400 ms Acceptable for international connections. Comparison between the VoIP calls with and without the steganography as mention in figure5 and figure 6 by using a Unacceptable for general network planning sniffing tool. Above 400 ms purposes.

3.1.2.2 Packet Loss Not only does the Internet introduce delay but can also discard packets resulting in packet loss. When a router in the IP networks Figure 5. VoIP without steganography becomes heavily congested and its buffer overflow then it can no longer accept any more packets for queuing and it has no other option but to drop incoming packets resulting in the receiver experiencing lost audio packets. How much packet loss a codec can handle depends on bitrate and codec design; the percentage lies between 1 and 5 percent. It can be handled well if the lost packets are randomly distributed and do Figure 6. VoIP with steganography not occur in bursts. According to Walsh and Kuhn in [15] a 5% packet loss can make a call catastrophic. Previous research also The sniffing tool used to examine the multi-layer system from the states that the maximum tolerable packet loss is 3%. However, attaches that threatening the VoIP network that mention above some codecs have better performance than others at 3% packet loss. proved that the multi-layer system robustness against any attacks, moreover the sniffing tool cannot explore the embedded secret 3.1.3 Security Robustness voice message in the VoIP call. Attackers typically target the most popular and well-publicized systems and applications. VoIP has become one of such application. 5. CONCLUSION Several VoIP weaknesses have been revealed recently, thus In this paper, the proposed steganography scheme for VOIP protocol designers need to address it before successfully deploying achieving the four layer technologies. The experimental results VoIP on the global scale. The most common attacks on the VoIP demonstrate the delay in multi-layer steganography scheme does infrastructure are Denial of service (DoS), Eavesdropping, not exceed the delay in the limit specifications according to ITU-T Masquerading and Toll Fraud recommendation G.114 and the losses of packets in VoIP network does not exceed 5%. The experimental results proved that the 4. SIMULATION RESULTS AND multi-layer system scheme is robustness against eavesdropping DISCUSSION attack. Actually by using the G.711 codec the payload of the RTP packet, it didn’t change after embedding the voice in the color image. The 6. REFERENCES average of end-to-end delay between 55-60 ms the result from OPNET simulator shown in the figure 4 by comparing this result [1] A. Pandey and J. Chopra, "COMPARISON OF VARIOUS with the standard ITU-T. It is an acceptable delay because it didn’t STEGANOGRAPHY TECHNIQUES USING LSB AND exceed the maximum value 150 ms. 2LSB: A REVIEW," International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 – 0882, vol. 6, no. 5, May 2017.

[2] H. Tian, K. Zhou, H. Jiang, Y. Huang, J. Liu and D. Fen, "An Adaptive Steganography Schema for Voice ocer IP," in CSE Conference and Workshop Papers, 2009. [3] S. Tang, Y. Jiang, L. Zhang and Z. Zhou, "Audio steganography with AES for real-time covert voice over internet protocol communications," Science China Press and Springer-Verlag Berlin Heidelberg, vol. 57, no. 3, pp. 1-14, March 2014. [4] B. Goode, "Voice Over Internet protocol (VOIP)," Proceeding of the IEEE, vol. 90, no. 9, pp. 1495-1517, Sept. Figure 4. End-to-End Delay from Site_A to Site_B 2002.

[5] O. Al-Farraji, "NEW TECHNIQUE OF STEGANOGRAPHY

109 BASED ON LOCATIONS OF LSB," International Journal steganography approach for voice over IP," Journal of of Information Research and Review, vol. 04, no. 01, pp. Ambient Intelligence and Humanized Computing, vol. 5, no. 3549-3553, January, 2017. 4, pp. 601-610, August 2014. [6] S. WANKHADE and P. R. SHAHABADE, "HIDING [12] H. Moodi and A. Naghsh-Nilchi, "A New Hybrid Method for SECRET DATA THROUGH STEGANOGRAPHY IN VoIP Stream Steganography," Journal of Computing and VOIP," International Journal of Computer & Communication Security, vol. 3, no. 3, pp. 175-182, July 2016. Technology, ISSN (PRINT): 0975 - 7449, vol. 4, no. 3, 2013. [7] J. Dittmann, T. Vogel and R. Hillert, "Design and evaluation of steganography for voice over IP," in IEEE International [13] H. Tian, K. Zhou and D. Feng, "Dynamic matrix encoding Symposium on circits and Systems, 21-24 May 2006. strategy for voice-over-IP steganography," Journal of Central [8] W. Mazurczyk and Z. Kotulski, "Covert Channel for South University of Technology, vol. 17, no. 6, p. 1285–1292, Improving VOIP Security," in in Proc. of Multiconferece on December 2010. Advanced Computer Systems (ACS), Oct. 2006. [14] "ITU-T Recommendation G.114: One-way transmission [9] H. Tian, K. Zhou, Y. Huang, D. Feng and J. Liu, "A Covert time," ITU-T Telecommunication Standardization Sector, Comunication Model Based on Least Significant Bits 2003. Steganography in Voice over IP," in in Proc. of the 9th [15] T. Walsh and R. Kuhn, "Challenges in securing voice over International Conference for Young Computer Scientists, IP," IEEE Security and Privacy Magazine, vol. 3, no. 3, pp. Nov. 2008. 44-49, June 2005. [10] H. Neal and H. ElAarag, "A Reliable Covert Communication Scheme Based on VoIP Steganography," in Transactions on Data Hiding and Multimedia Security X, DeLand, FL, USA, Springer, Berlin, Heidelberg, 2015, pp. 55-68. [11] Z. Wei, B. Zhao, B. Liu, J. Su, L. Xu and E. Xu, "A novel

110 Clickbait Detection

Suhaib R. Khater1; Oraib H. Al-sahlee2 ; Daoud M. Daoud3 and M. Samir Abou El-Seoud4 1,2,3PSUT, Jordan, Amman; 1,2,3The British University in Egypt, BUE, Cairo, Egypt4 [email protected]; [email protected] [email protected]; [email protected]

Abstract method that efficiently and effectively identifies social media post Clickbait is a term that describes deceiving web content that uses and research is still active in this topic. In this research we try to ambiguity to provoke the user into clicking a link. It aims to extract the minimum number of features that better describe the increase the number of online readers in order to generate more problem from different parts of a social media post to provide an advertising revenue. Clickbaits are heavily present on social media effective and efficient model. platforms wasting the time of users. We used supervised machine Identifying Clickbaits is an important part of blocking them from learning to create a model trained on 24 features extracted from a the user’s social media feed. Facebook, Twitter and other social dataset of social media posts to classify the posts into two classes. media websites faced a lot of criticism for not identifying clickbaits This method achieved an F1-score of %79 and area under ROC and down ranking them from the user’s feed. Below is an example curve of 0.7. The method used highlights the importance of using of a clickbait post on Facebook. features extracted from different elements of a social media posts along with the traditional features extracted from the title and the Fig 1 clickbait example article. In this research, we prove that it is possible to identify clickbaits using all parts of the post while having minimum number The aim is to achieve an acceptable f1-score and area under ROC of features possible. curve proving the correctness of the approach chosen which is to consider all parts of social media posts in the features extraction CCS Concepts • Computing methodologies →Support vector machines. Keywords Clickbait, F1-score, ROC curve, SVM 1. INTRODUCTION A clickbait is a deceiving headline with the aim of increasing the number of readers to increase advertisement revenue without offering adequate content or a content that is close to the advertised title. A post is a clickbait if it withholds information needed to understand what the article is about. Saying “you won’t believe what this team did!” instead of “Real Madrid wins its 12 EUFA champions league” is an example of a clickbait. This Research aims to solve the problem by training a model on a labeled dataset consisting of a number of social media posts of the two classes (clickbaits, non-clickbaits) to classify posts into the previously mentioned classes. Since 2015, researchers have tried to produce a method that detects Figure 1 clickbait example [1] clickbaits effectively using a model applicable for market use. process, while retaining an acceptable performance by choosing a While this is not the first time this problem is being tackled using limited number of features and using feature engineering to further supervised machine learning algorithm, researches did not find a reduce the number of features in the final model. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that 2. LITERATURE REVIEW copies are not made or distributed for profit or commercial advantage and In computer science, the problem of classifying false and that copies bear this notice and the full citation on the first page. misleading content have existed for so long and various attempts to Copyrights for components of this work owned by others than ACM must solve or at least classify such content have been made in various be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior applications using various approaches from spam detection to fake specific permission and/or a fee. Request permissions from news detection. The approaches used varied in complexity. Some [email protected]. were too complex to be feasibly used. Below we will discuss ICSIE '18, May 2–4, 2018, Cairo, Egypt previous approaches used to tackle the clickbait problem. © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 Yahoo research team has developed a Clickbait detection algorithm DOI:https://doi.org/10.1145/3220267.3220287 [2]. They analyzed the article and the title to extract features used

111 to classify clickbaits. They used several text formality measures to the fields present in the other JSON file. Fields used in feature help identify clickbaits. Their objective was to prove there is a extraction are bolded out. relation between article informality and clickbaits. Text formality is an index of how formal a given text is. They extracted 7000 Table 1 Dataset description features from 4000 articles. This approach achieved an F-1 Score of 74.9%. We used one of the formality measures used and proven to be relevant in this paper as a feature. This will be discussed in details in the methodology section. A Bauhaus-Universität Weima research used 2992 tweets on a model of 215 features and produced an F-1 score of 76% [3]. This research was the first paper to be published tackling the clickbait problem. The approach relied heavily on the bag of words algorithm. Sentiment analysis and readability measures were used too. Clickbait detection is a part of detecting fake news but due to the complexity of detecting fake news, researchers try to find solution to the problems in that domain hoping to get closer to solving the problem on a bigger scope. An interesting linguistic approach use to tackle the fake news problem relied on the cues present in the text that shows that the writer is lying (Feng & Hirst, 2013). Such cues were based on the frequencies of pronouns in the text and the percentage of negative words used in the article. The ID of the post is used to search for the label in the label file. A common aspect of the previously mentioned approaches is that The post text is text written by the one who shared the article as a their study was conducted based on the relevance of the title and comment or description of what to expect when reading the article. article mainly dismissing other elements commonly existing in It can be considered as a title written by the one who shared the social media posts while the results are already promising post. The target title is the title of the article and is supposed to regarding this problem our aim was to enhance it even further reflect what the article is about although that does not happen with mainly by considering the rest of the elements in the classifying clickbaits. The description and the keywords are data extracted process and also by adding other features to the classification while from the Meta tags in the source code of the site. Target paragraph keeping the number of features as low as possible to retain an is the article text. acceptable performance without affecting the integrity of the classification. The number of features used in the previously Another dataset was used to implement the bag of words algorithm. mentioned papers is huge which affects the performance and the The algorithm and the data will be discussed in the next section. ability of the model to be used in a real time application. Some of the features in our approach have been used in other researches and 3.2 Feature Extraction proved to be of enough relevance that anyone willing to work on Generally, Extraction of the features is a process that relies heavily this problem can’t dismiss. This will be discussed in the next on domain experience and previous researches that proved the chapter. functioning of certain features. This process is the most critical part in model training. This is due to the fact that if the features 3. METHODOLOGY extracted have low correlation with the labels, no matter what To train a model to classify a post supervise supervised machine modeling algorithm, the model will fail to classify correctly. learning was used. Supervised machine learning is the process of Initially, we extracted 28 features. The most common features used using labeled data to train a modeling algorithm to discriminate here are: between different labeled data. The input of the algorithm is different values of a set of feature and there label. Features are a set 1) Similarity to measure the similarity between article’s of attributes that best describe the differences between the labels. text, article’s title, and post. To best describe the problem, I need to identify features that best 2) Formality to determine how formal a text is by describe the differences between the 2 classes that we have and measuring the frequency of different part of speech tags then extract features chosen from the dataset. We then train the in the text. The formality metric used called fmeasure model on the features extracted. Below we will describe in details [5]. the dataset, feature extraction and elimination process and the 3) Readability to measure how readable is the text. The modeling algorithm used. metric used is called Automated Readability index [5]. 3.1 Data Set This metric is used to determine the difficulty level of The dataset used was provided by Bauhaus-Universitat Weimar as text in standardized examinations. a part of clickbait detection challenge organized by the university 4) Bag of words to extract frequently used words. The 1- [4]. The data was annotated by 5 judges. The dataset contained gram algorithm has been implemented to extract these 22,033 posts and was divided to 2495 post for training the model, words. The 1-gram algorithm was implemented on a and 19538 posts used for the validation of the model. The dataset separate data set. The data set included 6080 non contained 2 JSON files and a media archive where images were clickbait titles and 5637 clickbait titles [7]. placed if the post contained images. One of the JSON files 5) Noun extraction to measure ambiguity factor present in contained the ID of the post and its label. Below is a description of clickbaits.

112 4. RESULTS AND DISCUSSION 3.3 Feature Engineering 4.1 Results using different modeling algorithms As mentioned above, 28 features were extracted from the model. The model was trained on 2495 posts that consists of 762 clickbait The more features we have the more time needed to extract this posts and 1697 not clickbait posts. To validated the model we used feature. This might negatively affect the model’s performance. 19,487 posts that consists of 14774 clickbait posts and 4713 non Although adding features can increase the accuracy this isn’t clickbait posts. The distribution of the data was specified by the always the case. If a feature have high correlation with another university that organized the challenge. In addition to using SVM feature or is derived from another feature the accuracy might with linear kernel, we tried using logistic regression, an algorithm decrease. The more features added the more data is required to that is known to work well for binary classification problems and ensure there are enough samples for each combination of values. linearly separable data. The logistic algorithm tries to separate between the two classes using probability unlike SVM that uses Based on all those disadvantages of having a model with high Euclidean distance. SVM tries to find the widest possible dimensionality, we decided to perform recursive feature separating margin, while Logistic Regression uses probabilities elimination to decrease the dimensionality of the model. Recursive modeled by the sigmoid function to discriminate between the two feature elimination works by recursively considering smaller and classes. We achieved 79% accuracy in both models. smaller sets of features. We used recursive feature elimination because of the several factors affecting feature selection. The results of the validation process were: Statistically, features with correlation close to zero should be eliminated, but that’s not the only factor to be considered. A Table 2 Results feature with high correlation with another feature can decrease the Algorithm Precision Recall F1 performance of the model. To ensure that we chose the best Logistic 0.79 0.79 0.79 discriminatory features, we chose to implement the recursive regression method. After implementing the algorithm, 4 features were Linear 0.78 0.79 0.79 eliminated. SVM 3.4 Modeling the Data Even though feature elimination decreased the dimensionality of The results were very similar using both algorithms. Both the data, the data is still high dimensional. This can affect the algorithms are known to perform similarly. The reason we used accuracy of the model if a modeling algorithm that performs poorly SVM is because of its capability on handling noise in the data. The with high dimensional data was chosen so we decided to use content of the web evolves and clickbaits might change its style support vector machine (SVM). [8] which makes it necessary to increase data and retrain the model at some point. To make our algorithm practical we decided to use SVM is an algorithm that takes each instance in the data set as a SVM that uses margin instead of a line to separate between the 2 vector and plots it in a high dimensional space and then constructs classes. Margin is the gap between 2 classes. Unlike logistic a hyperplane to separate each class from the other. The separator regression that needs all the data in the training method, SVM only can be straight plane or a curve depending on the linearity of the uses the data closest to the margin. The margin is placed in a point data. The linear SVM performed better on the data hence the such that the distance between the margin and points of the 2 linearity of the data available. The hyperplane is chosen so that the classes are maximized. The size of the margin and the linearity of distance between the plane and nearest data point of each class is the data were determined by using a genetic algorithm called Grid maximized. The model is designed to handle high dimensional data Search. This method makes SVM much faster to train than logistic and have a high noise tolerance. Due to the high dimensionality of regression. Another reason on why SVM was chosen is that it deals the data, we were unable to plot it, but below is a Linear SVM better with noise. Data parsed from the web can be messy and a lot hyperplane separating a 3 dimensional data into 2 classes. of noise can be present. SVM is known to perform well on noisy and missing data [9].

4.2 Validation and Training accuracy Training accuracy is the accuracy of the model to classify the same data it was trained on. Validation accuracy is the accuracy of the model to classify different data. The Training accuracy of the model is 74.5% and the validation accuracy is 79.4%. This measure is useful for better understanding the model behavior. Usually, the training accuracy is slightly higher or equal to the validation accuracy. The model’s validation accuracy is 4.9% higher than the training accuracy. This is because the training data’s distribution is different from the distribution of the validation data.

4.3 Results Analysis The distribution of the data affected the results as shown in the table below. The better performance in identifying the not clickbait Fig 2 3-dimensions plot is a result of difference in the number of instances in both classes.

113 features in the model don’t use the title and article which proves that other parts of a post is relevant to identifying clickbaits. Table 3 Precision, Recall, & F1-score 5. CONCLUSIONS Class Precision Recall F1- Instances In this research we intended to give a different approach on the score number Clickbait classification problem by highlighting the relevance of a Not- 0.85 0.88 0.87 14774 social media post elements in such process and introducing some clickbait new features that add to the effectiveness of it and demonstrating Clickbait 0.58 0.50 0.54 4713 that such process can be done using an acceptable number of Weighted 0.78 0.79 0.79 19487 features, all of which has been achieved with the methodology average behind that explained thoroughly and results demonstrated by several commonly used metrics to evaluate machine learning based binary classification, leading to the following : Precision is the ratio of true positives predicted correctly over the total true positives in the data. Recall is the ratio of true positives  Clickbait detection is possible on social media platforms predicted correctly over the total number of instances in the class. with better performance if elements of posts on such F1-score is the weighted average of both precision and recall, and platforms are used properly. it’s measured by using the following formula  A low number of features can still be effective to classify clickbaits which helps in building a real time classifier moving this idea from theory to application. 6. FUTURE WORK Many different modifications and experiment has been left out due To better evaluate the model the ROC curve was plotted. The ROC to the limited time and support, future work may include different curve is used to illustrate the diagnostic ability of a binary learning approach and different methods for features extractions classifier. It’s the plot of true positive rate against false positive and some modifications that can be more useful in a product than rate. Area under the ROC curve is used as metric to measure in a research, also experimenting with Arabic language can be feasibility of the model. It represents the ability of the algorithm to interesting but not feasible currently due to the lack of the correctly classify data. The area under the ROC curve is 0.70. appropriate tools for processing text in that language, we’ll list Below is a plot of the curve. here some of the ideas that we might apply in the future: -limiting the training of the model to the features extracted from all parts of a post excluding the article making the classifier faster and also reducing data storage and processing required for fetching and saving the article. -determining the features using unsupervised machine learning techniques can result in higher accuracy but can’t be done currently due to the need for a larger and more diverse dataset and the high time required for the completion of each run in the learning process, also the same unsupervised learning techniques can replace the SVM model we chose and can lead to an interesting and different results for the classification results. -experimenting with Arabic is also a possibility but is limited by the available NLP tools for the language and also the available datasets that will be used in the learning process that precedes the classification but sure will be doable in the near future hopefully better tools will be available and appropriate datasets will be available. 7. REFERENCES [1] Josh Constine(2017), Facebook feed change fights clickbait Fig 3 ROC Curve post by post in 9 more languages [https://techcrunch.com/2017/05/17/facebook-anti-clickbait/] These numbers shows that the model has a pretty good ability in [2] Prakhar Biyani, Kostas Tsioutsiouliklis, and John classifying the posts. It also proves that it is possible to extract Blackmer(2016), 8 Amazing Secrets for Getting More Clicks”: relevant features from all parts of the post and keep the Detecting Clickbaits in News Streams Using Article dimensionality as low as possible. As mentioned in the literature Informality, Yahoo Labs, Sunnyvale, California, USA.. review, most of the previous research was based on the title and [3] Martin Potthast, Sebastian Köpsel, Benno Stein and Matthias article. These approaches achieved good numbers in some cases Hagen(2016), Clickbait Detection, Bauhaus-Universität but in a short amount of time the topics of the articles change and Weimar. so will the style of the titles making the model completely useless. [4] Bauhaus-Universität Weimar (2017) clickbait challenge The usage of all parts of the post was an attempt to extract features {clickbait-challenge.org} that generally describe the problem and maintain the functionality [5] FRANCIS HEYLIGHEN & JEAN-MARC DEWAELE of the model for a long period of time. 6 out of 10 of the top (1999), Formality of Language: definition, measurement and

114 behavioral determinants, Center "Leo Apostle", Free University of Brussels. [10] Peter Bourgonje, Julian Moreno Schneider and Georg Rehm [6] E. A. Smith, EdO, R. 1. Senter, PhD (1967). Automated (2017), From Clickbait to Fake News Detection: An Readability Index, Aerospace Medical Research Center Approach based on Detecting the Stance of Headlines to [7] Saurabh Mathura (2017) Clickbait Detector Articles, Second workshop on Natural Language Processing [github.com/saurabhmathur96/clickbait- meets Journalism detector/blob/master/data] [11] Sophie Chesney, Maria Liakata, Massimo Poesio and [8] Corrina Corttes, Vladmmir Vapnick (1995), Support-Vector Matthew Purver (2017), Incongruent Headlines: Yet Another Networks, AT&T Bell Labs. Hohndel Way to Mislead Your Readers, Second workshop on Natural [9] Padmavathi Janardhanan, Heena L., and Fathima Language Processing meets Journalism Sabika(2015), Effectiveness of Support Vector Machines in Medical Data mining, Journal of Communication, Software, and systems

115 Pedagogical and Elearning Logs Analyses to Enhance Students’ Performance Eslam Abou Gamie1, M. Samir Abou El-Seoud2, Mostafa A. Salama3 and Walid Hussein4 The British University in Egypt, Cairo, postcode, Egypt [email protected]; [email protected]; [email protected] [email protected]

ABSTRACT his grades or based on his basic information like family status. This paper introduces a model to analyze and predict students’ Also the analysis of the successful presentation of the taught performance based on two dimensions; teaching style, and module and the provided resources to the students. Each of these eLearning activities. Such data will be collected from educational tasks are applied separately based on the module where the data is settings within an academic institution. The analyzed data is used extracted from. The integration of the data gathered from multiple to reveal knowledge and useful patterns from which critical resources in the EMIS is an important goal for enhancing the decisions could be made. accuracy of the decision making. The consideration of the attributes existing in each of the module composing the EMIS is The suggested model should be able to: currently investigated heavily in different areas. Educational settings in academic institutions accumulate tons of data across  Classify modules according to their module nature years; such data could be analyzed to reveal useful knowledge.  Analyze different kinds of students’ interaction with Systems like Moodle Learning Management System generate eLearning statistical and behavioral students’ data. Various machine learning  Classify teaching styles and pedagogical approaches and techniques are applied to extract frequent student behavioral their effect on students’ performance patterns, such patterns are important in detecting learning style  Classify students and their final grades according to their and at risk students. Most of the predicting techniques are based background and characteristics. on features extraction from eLearning activities like online  Utilize different correlation analysis and feature selection assignments, online quizzes, online forums and eLearning techniques resources like accessing files and labels. Weekly logs: by breaking down number of student course logs into 13 weeks, this could be CCS Concepts mapped into 13 features, then detect final student grades from the student record system (SRS). This will form a dataset of 13 • Information system➝Data mining • Computing features and one class feature (final grade), from which 90% of methodologies➝Feature selection. the dataset will be used to apply machine learning techniques like support vector machine and decision tree, accordingly the student Keywords grades can be predicted from such logs. Data mining, education data mining; MOODLE; feature The work here proposes a new approach to the integration of the selection, correlation analysis; learning activities; pedagogical education data models. The approach is based on integration of approaches; classification the data of the different modules based on a specific perspective. These perspective, named as factors, may include common 1. INTRODUCTION attributes according to each one. The factors considered are the Educational management information systems (EMIS) is a student’s data, the module data and the teacher data. Each factor collection of components that includes input-process-output includes a set of attributes from different modules. An example of modules, where feedbacks are integrated to achieve a specific goal. these factors are students’ activity factor, this factor includes The modules in an EMIS are like E-learning system, the student several attributes from different modules like students’ record system, the student attendance system, and the student performance on student record system, student activities from E- basic information. Educational system make use of the data in Learning, and module nature like module nature from module these module to process different analysis tasks. The performed specification, teaching style from online surveys, and the student analysis examples like the analysis of the student failure based on characteristics and background from the student record system. Permission to make digital or hard copies of all or part of this The aim of the proposed educational system analysis is to predict work for personal or classroom use is granted without fee provided that the success rates of students enrolled in a certain module, long copies are not made or distributed for profit or commercial advantage with the detection of main features characterizing these rates. and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM Those features are categorized into three categories: student must be honored. Abstracting with credit is permitted. To copy activities and demography, module Nature; and teaching style, in otherwise, or republish, to post on servers or to redistribute to lists, addition to the cross correlation between the features in these requires prior specific permission and/or a fee. Request permissions categories. The outcome should enhance the accuracy of results from [email protected]. prediction and also provide some proper ICSIE '18, May 2–4, 2018, Cairo, Egypt recommendations/interpretations to the administrating /academic © 2018 Association for Computing Machinery. departments as early students-at-risk alerting. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 DOI:https://doi.org/10.1145/3220267.3220289

116 The research starts by reviewing and analyzing different students’ eLearning logs only is not sufficient enough for methodologies in previous work, then exploring dimensions and classifying students according to their grades. The current attributes to be included in an integrated model long with students’ researches did not evaluate the pedagogical attributes along with final grade. Finally data collections, analysis and results will be eLearning analysis. Students could perform badly in the module, introduced. not because they are not diligent, but because the materials of the modules via the e-learning or the teaching style are not attractive. 2. PREVIOUS WORK Also students could lose the interest or feel disappointed if the The work in [5] focused on the extraction of rules from eLearning resources of module are not added probably or added lately. On systems using rare association mining techniques (RARM). Four the other hand, every module could have a different type relative mining association algorithms were compared; Apriori- frequent, to the other modules. This work extends the current research in Apriori –Infrequent, Apriori-Inverse and finally Apriori- Rare. the area of e-learning data mining process by detecting the The paper explored applying RARM to detect infrequent student correlation between three general areas. The first area is the behavior, it also stated that normal association roles (like Apriori module delivery on the e-learning system, the second area is the algorithms) do not take infrequent associations into consideration, student performance and interaction to the delivered contents, the despite the fact that relatively infrequent associations could be of third area is the students’ performance in the final marks of the significant interest. Three students’ online activities were counted; module. assignments, quizzes and forums and the predicted final course grade. The paper receded roles extraction on variable types of This work considering the fact that the factors that affects the association roles only, in addition no behavioral attributes were student performance is semantically categorized into several taken into consideration, which for sure could enhance the dimensions. Although various research is applied make use of accuracy of predication. In [6] the author deals with variance of these dimensions [12], the predication model of the student courses types and number of activities generated from eLearning performance do not consider this important fact. Ensemble systems, he detects the relationship between activities and methods that combines the results of various heterogeneous resources in a certain course along with students’ final grades, he classifiers is a fitting tool to guarantee the enhancement of the did so by applying different Multiple Instance learning techniques accuracy of utilized machine learning models [13]. Each and results were compared. Although the research is well classifier in the ensemble model could target one the dimensions organized the main focus was on the techniques, and not the data of the learning factors. Which classifier to be for each dimension attributes, without mentioning the reason behind choosing only is dependent on the nature of the data in this dimension. In this three specific students’ online activities. The author in [7] started work, an assumption is considered that all classifiers could with a question ifit is possible to predict student’s success enrolled perform equally to all the dimensions. Figure 1 provides a picture in a course with a small dataset? And that datasets associated with of the proposed model that distribute the classifiers used in the students are considered small even if with a big number of ensemble technique to the data sets of each dimensions. This students. Student attributes considered in this paper for prediction: process of distribution is applied randomly in this work, the future gender, year of birth, Employment, status, registration, type of work that is applied here is to use an evolutionary model like study, Exam condition and activities. genetic algorithms to perform this distribution. Most of the previous work in studying the behaviour of the students over the academic period are based on several perspectives. The first perspective is partitioning the factors that are affecting the student behaviour according to the institutional and family support and degree the student awareness [8]. The second perspective studies the students who perform an improvement during their study in the university [9]. Another perspective is the addition of the external factorials like the economic status [10]. Finally, the current perspective adds, to the known factors, the interaction to the electronic educational systems [11]. The current research trends in this area examines the different activities performed by the student on the Electronic learning systems. The work in [11] studies the frequency of online interaction of the students. It studies the percentage of accessing the virtual classroom and discussion boards. The work in [12] provides an evaluation to the E-learning systems by categorizing the different factors that may affect the student performance. These factors are divided into six dimensions: system quality, service quality, content quality, learner perspective, instructor attitudes, and supportive issues. The purpose of the previous work Figure 1: The proposed model for predicting the student is to gain the benefit of all factors that affect the student performance based on the E-Learning log analysis. performance and build a machine learning model that enables the decision makers in altering the teaching methodology. None of the previous work builds a model that simulates the multiple 3.1 Population and Data collection dimensions of these factors. This work considers two main dimensions that are influencing the performance of the students. These dimensions can be named as; 3. PROPOSED SOLUTION student activities via eLearning (abbreviated as A), teaching style The problem investigated in this work is to enhance the accuracy from pedagogical approach (abbreviated as T). Each dimension of predicting the final grades of the students. The dependency on includes a set of features (factors); that describes the different

117 characteristics of each dimension. The aim of this work is to study of data will be collected from the student evaluation form the effect of these factors on determining the results of the provided by the students in the end of the semester. students by utilizing data mining techniques.The features and the corresponding values of each factor can be listed as follows: 3.1.3 Module nature M: The module nature reflects the direction of the module, whether it is scientific, mathematical, programming, or theoretical module. 3.1.1 E-Learning activities A The module specification of the module includes the detailed While trying to test the effect of e-learning frameworks on information about the module including the nature of the instructive results, this exposition endeavors to examine the assessment like whether the assessments focus on the lab tests, instance of the local private university. projects or unseen exams. The module specification also may A recently established university in 2005 is considered that has include the topics covered by the module, the number of hours of one of the oldest Learning Management Systems. The aggregate the labs, lectures and tutorials. The distribution of the mark also enrolment at the college was 9228 understudies in 2016, and the reflects the focus of the module about the theoretical contents, enrolment extended nearly around this level in the earlier years. laboratory contents or the mathematical contents. The module The e-learning framework at the University covers all modules reading list can be recognized according to the category of the over all resources. Materials are transferred week by week, and materials in the famous websites like Amazon, or from the local chiefly incorporate PowerPoint introductions covering the library like the library of the university or the school. The module sessions about that week, and an instructional exercise sheet. Each nature could reflect the interest of the student and the points of student has a specific ID number, and the e-learning system uses strength and weakness of each student. Students may perform that ID to track students’ usage, collecting data such as the badly in a module, not because the students are not diligent, but number of times a student accessed a certain module, his/her because the materials of the module via the e-learning or teaching enrolment date, the duration of using e-learning, and which style are not attractive. This will help the university decision materials the student accessed and/or downloaded. Also students makers to adjust the module according to the student needs and could lose the interest or dispassionate if module resources are not capabilities without ignoring the need of the industry. added properly or lately. For academic purposes this data was made available, and coupled with the fact that the British 3.2 Data analysis University in Egypt has a well-established e-learning system, the The study is divided into two phases as shown in figure. The first university was chosen as the case study of this research. phase combines the features of the two dimensions along with the A total of 5 variables were selected to be included in the analysis, student results in one data set. Then apply a set of feature based on previous literature on the topic. The variables, their selection techniques to detect the most discriminating features, the explanation, and previous literature in which they were used can correlation among features, and patterns that infers the final be viewed in table 1. All variables were obtained from the British performance of the student University in Egypt’s e-learning server, with the help of the E- The second phase in this study is to construct three different data learning Department. sets based on the features of the three factors: student activities,  Delay in enrolment to the module [0-100 days] the teaching style, and the content categorization. The features of  Number of accessing the module and resources in the the student characteristics factor (S) are common among the three semester [0-100 times] data sets. For each data set, a set of classifiers are tested to select  Average number of accessing the module per week [0- the most appropriate one whose classification accuracy percentage 10 times] is the maximum. For testing a new student, the trained classifier  Average time delay in accessing the lectures starting specific to each data set is applied to predict the final performance from the upload time[0-100 days], of the student. If two classifiers lead to a certain prediction, while  Average time delay in accessing the lectures starting the third classifier got a different result, the final prediction goes from the upload time [0-100 days] to the majority. The utilized classifiers are neural networks,  Average time of uploading the assignment answers decision tree, support vector machine and Bayesian belief network. subtracted from deadline time [0-100 days]. Finally, a comparison is conducted between the two phases Such data is collected from eLearning system log file and according to the classification accuracy percentage. If the first university database. phase shows a better accuracy percentage, this concludes that the correlation between the features of different factors is highly 3.1.2 Teaching style T: important in prediction of the students’ final performance. The style of teaching presented by the instructor could not be suitable for all students. For example, some student may like the Table 1. E-Learning activities variables use of the marker during the lecture, while others prefer using Variable Explanation power points presentation. This type of data will be collected from Student Grade The final grades for each module, the student evaluation form provided by the students by the end of across the 243 students. This the semester. The current researches have not evaluated the tutor variable is used as a proxy for performance or the quality of the delivered contents along with measuring educational outcomes the analysis of student data. Materials presentation is an important Number of Course The total number of times a student factor where the instructor could deploy the lectures properly, for Log Ins logged into a module’s page on e- instance if the lectures are uploaded simultaneously with labs or learning during the whole year. This weekly or within the first week, or the lectures of the contents are variable is used as a proxy for e- not up to date with each session. The style of teaching presented learning usage by the instructor may not be suitable for all students. For example, School Leaving The final high school grade for each some student could like the use of the marker during the lecture, Grade student, in percentage terms while others will prefer using power points presentation. This type Module Type Specifies whether the grade of the

118 student belongs to a mathematical or a theoretical module Attendance Measures the overall attendance level of students by specifying whether the mandatory attendance policy existed

The summary statistics of the data can be viewed in table 2 below. The statistics show that the total number of observations is not It appears that the school leaving grades possess the highest constant across all variables, due to the fact that some data is correlation to the ranking feature, followed by the number of log missing across some of the variables, however this is not a in to the educational system [E-learning]. On the other hand, the problem, since the used software STATA is designed to student attendance appears to be of the lowest discriminating automatically drop missing data. The total number of observations effect. The School leaving grades reflect the dimension of the ranges between 3455 and 3000. The mean of the student grade student characteristics and original behaviour. While the no of log variable is 53, which is equal to a C in the grading scale of the in reflects the degree of the student interaction related to the E- British University in Egypt. The values of the grades have a very Learning division. This proves that a single dimension is not wide range between 0 and 98, given that the highest possible enough to reach an accurate prediction of the student performance. grade is 100. The average number of course log-in’s is close to And provides an evidence that the data analysis of this field must that of the average of the grade, standing at nearly 49 times. compromise all the factors but with putting into consideration that However, the standard deviation differs greatly being 37, showing these factors are categorized. that the number of course log-in’s is less concentrated around the The second step is the classification of the data features against mean. This can be due to the fact that the data range of the the ranking feature. The classifier that shows the highest number of log-in’s is wider, ranging between 1 and 379. The classification accuracy is the Naïve Bayesian network. Baysian mean of the school leaving grade is significantly higher than that models consider the univariate model of the input data set. This of the student grade, at 80% suggesting that students tend to behavior shows that the data attributes are gathered from the perform better in high school than in university. The mean of the different resources, so the dependency or the correlation between school leaving grade is much smaller than the others, showing that these attributes are decreases to minimum. Otherwise another the majority of the data centers around the high mean, suggesting technique like neural network, support vector machine or even the that the majority of the students scored relatively high grades. The evolutionary algorithms. The accuracy percentage of this classifier lowest reported grade is 63% and the highest is 105%. appears as follows:

Table 2. Summary statistics Variable Observation Mean Standard Min Max deviation The accuracy of the classifier is calculated based on the number of Student 3,092 53.00356 14.80982 0 98 instances predicted correctly to the total number of instances. The grade number of instances are equally categorized to ensure the fairness of the classification process. The confusion matrix of the classifier Number 3,000 48.88733 37.55451 1 379 is as follows: of

course log ins School 3,247 80.58451 8.959488 62.9 104.9 leaving grade Module 3,455 1.449493 .4975146 1 2 type The confusion matrix shows that the error rate in each category, where the error rate in the middle category is the highest. This is because the attribute values lies between the top and bottom class 4. RESULT ANALYSIS and this increases the confusion of the applied classifier. When This work studies the factors that affects the performance of the features Faculty, cohort, module and module type features are students in the last year. The experimental work address a removed, the classification accuracy appears to be the same. collective data set related to high education students. The When removing the No of Log ins feature, the classification collective data set includes all the subclasses/dimensions that accuracy decreases to 86.25%, while when removing the high categorizes these factor into four main dimensions. The student, School leaving Grade, the accuracy is decreased to 49.76%. This model, teaching style and student activity on E-Learning provides an evidence that the student ranking is dependent mainly represents the categories of all the factors that may affect the on this evaluation before joining the high education stage. student performance. The first step in this study it find out the main factors that are highly discriminative to the student rank classification. The statistical correlation between the ranking feature and the rest of the factors are measured as follows.

119 each of manually written and machine printed pictures, and the 200 testing information comprised of reports containing an aggregate 150 of 286 machine printed and 104 manually written words. 100 6. CONCLUSION AND FUTURE WORK 50 This work extends the current researches in the area of learning analytics and data mining process by detecting the correlation 0 between three general dimensions; the first dimension is the 1 9

17 25 33 41 49 57 65 73 81 89 97 student activities via eLearning; the second dimension is teaching 105 113 121 129 style and finally student result. Data gathering phase is done for Grade No of Log ins eLearning dimension and ready for analysis, Future work will include analysis of these dimensions and their results, then other School leaving Grade Ranking dimensions like demographic data and pedagogical attributes from open linked data will be included for further accuracy results. Figure 2: The grades and the high school values according to each of the three ranking values 7. REFERENCES [1] Koichiro Ishikawa, M. F. (2013, December 6). Log Data Figure 2 presents the values of the current grades of the students Analysis of Learning Histories in an e-Learning. International and the corresponding grades in high school and the number of Journal of Information and Education Technology. login activities to the E-Learning system. These values are [2] Beth Dietz-Uhler, J. E. (2013, spring). Using Learning distributed over three ranking values (Top, Medium, and Bottom Analytics to Predict (and Improve) Student Success: A Faculty ranking). This chart shows that the high school grades are always Perspective. Journal of Interactive Online Learning. higher than the current grades of the students. A low correlation appears between the number of login activities and the current and [3] Nor Bahiah Hj Ahmad, S. M. (2010). A Comparative Analysis high school grades. of Mining Techniques for Automatic Detection of Student’s Learning Style. 10th International Conference on Intelligent 5. EXTENDED RESULTS FOR Systems Design and Applications (pp. 877-882). INTEGRATION [4] Fatos Xhafa, S. C. (2011). Using Massive Processing and Another dimension that could have a great contribution to the Mining for Modelling and Decision Making in Online Learning Educational systems is the handwritten document detection. The Systems. 2011 International Conference on Emerging Intelligent documents scanned and uploaded by students are saved in the Data and Web Technologies, (pp. 94-98). Educational system as black-box where no use of the contents of [5] Romero, C. R. (2010). Mining rare association rules from e- these documents. The detection of the text could help on learning data. Proceedings of 3rd International Conference on automatic marking of the contents and recording the results by the Educational Data Mining, International Educational Data Mining student records. This ensure online evaluation of the uploaded Society, (pp. 171–180). Pittsburgh. document, providing an appropriate feedback to each student [6] Ventura, A. Z. (2009). Predicting Student Grades in Learning separately. Management Systems with Multiple Instance Genetic Here in this work, the Line detachment regularly is put off to the Programming. Proceedings of the 2nd International Conference division step. Division calculations endeavor to part an archive on Educational Data Mining, (pp. 307-314). Cordoba. into pieces: pages into lines, lines into words, words into [7] Srecˇko Natek, M. Z. (2014). Student data mining solution– characters. These calculations create hopeful districts for knowledge management system related to higher education recognition. Each word is validated by discovering its standard institutions. Expert Systems with Applications. and turn it on its focus of gravity so that the standard winds up [8] KHURSHID, FAUZIA. (2014) Factors Affecting Higher noticeably even. The resulted text is compared to the model Education Students’ Success, Asia Pacific Journal of Education, answer and a reasoned-mark is resulted automatically online to Arts and Sciences, Vol. 1, No. 5, November 2014. the students. The resulted mark is a feature, other several features could be detected like the time submitting the answers, the [9] Hijazi, S. T., & Naqvi, S. M. M. (2006). FACTORS handwriting clarity, and the answer organization. AFFECTING STUDENTS'PERFORMANCE. Bangladesh e- journal of Sociology, 3(1). [10] Farooq, M. S., Chaudhry, A. H., Shafiq, M., & Berhanu, G. (2011). Factors affecting students’ quality of academic performance: a case of secondary school level. Journal of quality and technology management, 7(2), 1-14. [11] Davies, J., & Graff, M. (2005). Performance in e‐learning: online participation and student grades. British Journal of Educational Technology, 36(4), 657-663. [12] Ozkan, S., & Koseler, R. (2009). Multi-dimensional students’ evaluation of e-learning systems in the higher education context: An empirical investigation. Computers & Education, 53(4), 1285-1296. [13] Whalen, S., & Pandey, G. (2013, December). A comparative analysis of ensemble classifiers: case studies in genomics. In A general accuracy of about 95 % was found with this technique Data Mining (ICDM), 2013 IEEE 13th International Conference on an inside gathered dataset. Preparing was done on 50 words on (pp. 807-816). IEEE.

120 Efficient Architecture for Controlled Accurate Computation using AVX DiaaEldin M. Osman Mohamed A. Sobh Ayman M. Bahaa-Eldin Ain Shams University Ain Shams University Misr International University, On Cairo, Egypt Cairo, Egypt leave from Ain Shams University [email protected] [email protected] Cairo, Egypt [email protected]

Ahmad M. Zaki

Ain Shams University Cairo, Egypt [email protected] ABSTRACT

Several applications have problems with the representation of the CCS Concepts real numbers because of its drawbacks like the propagation and • Mathematics of computing→Mathematical software the accumulation of errors. These numbers have a fixed length performance format representation that provides a large dynamic range, but on Keywords the other hand it causes truncation of some parts of the numbers in Error free transformation; Hilbert Matrix; Ill-conditioned matrices; case of a number that needs to be represented by a long stream of Vandermonde Matrix; polynomial regression; floating-point; bits. Researchers suggested many solutions for these errors, one of these solutions is the Multi-Number (MN) system. MN system 1. INTRODUCTION represents the real number as a vector of floating-point numbers Real numbers have a problem with the representation to be with controlled accuracy by adjusting the length of the vector to represented in form of binary bits in the digital systems. Real accumulate the non-overlapping real number sequences. MN numbers can be represented in either fixed-point form or in system main drawback is the MN computations that are iterative floating-point form each of these forms has its pros and cons. and time consuming, making it unsuitable for real time Based on the representation of the real numbers, there are applications. In this work, the Single Instruction Multiple Data problems appear that affect the accuracy of the numbers stored (SIMD) model supported in modern CPUs is exploited to like rounding and limited bits of the representation. The floating- accelerate the MN Computations. The basic arithmetic operation point representation [1, 2] has many issues with the accuracy algorithms had been adjusted to make use of the SIMD mentioned in [5, 8, 10, 12] although it is more efficient and architecture and support both single and double precision dynamic than the fixed-point. Precision issue, comes from the fact operations. The new architecture maintains the same accuracy of that some numbers need more bits in the significand part to be the original one, when was implemented for both single and represented. Another issue is the Rounding issue which arises double precision. Also, in this paper the normal Gaussian Jordan while performing operations on the numbers. Researches have Elimination algorithm was proposed and used to get the inverse of been working on accurate algorithms that can be used to the Hilbert Matrix, as an example of ill-conditioned matrices, overcome the issues of the floating-point representation of the real instead of using iterative and time-consuming methods. The numbers [6, 7, 11, 14]. One of these error free transformation accuracy of the operations was proved by getting the inverse of algorithms is the set of algorithms Zaki et al. [15, 16] introduced the Hilbert Matrix and verify that the multiplication of the inverse to perform the basic mathematical operations (addition, and the original matrix producing the unity matrix. Hilbert Matrix subtraction, multiplication, and division) on a vector of multi inverse execution time was accelerated and achieved a speedup 3x, numbers of an arbitrary length n to preserve the accuracy. The compared to the original NM operations. In addition to the length n of the vector is used to control the accuracy of the previous, the accelerated MN system version was used to solve operations, so the length is configurable based on the application the polynomial regression problem. that uses the algorithm and based on the required accuracy of the results. The Single Instruction Multiple Data paradigm can be found in Permission to make digital or hard copies of all or part of this most of the modern computer to apply single instruction (for work for personal or classroom use is granted without fee provided that example Addition or Subtraction) on different elements in parallel. copies are not made or distributed for profit or commercial advantage It was firstly introduced in Supercomputers for processing data and that copies bear this notice and the full citation on the first page. Vectors. Later, the silicon vendors of the modern commercial Copyrights for components of this work owned by others than ACM processors extended their architectures to support the SIMD for must be honored. Abstracting with credit is permitted. To copy data processing parallelization. Both Streaming SIMD Extensions otherwise, or republish, to post on servers or to redistribute to lists, (SSE) and Advanced Vector Extensions (AVX) are versions of requires prior specific permission and/or a fee. Request permissions SIMD introduced by Intel [3]. The important difference between from [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt SSE and AVX is the size of the data elements that can be © 2018 Association for Computing Machinery. processed simultaneously, the size of the registers in the SSE is ACM ISBN 978-1-4503-6469-0/18/05…$15.00 128 bits compared to 256 in the AVX. The registers can be loaded DOI:https://doi.org/10.1145/3220267.3220292 by different data types based on the application where they are

121 used, so they can be loaded with integers or Single/Double l=f l (x2.y2 − (((k − x1.y1) − x2.y1) − x1.y2)) precision floating point numbers. The number of variables that can be packed in the register is varying based on which SIMD 3. PROPOSED UTILIZATION OF AVX TO version is used (SSE or AVX), and the size of the variable’s data type. ACCELRATE MULTI NUMBER OPERATIONS 2. MULTI-NUMBER SYSTEM This work proposes using the AVX to support the double The MN algorithms are based on error free transformation precision. AVX has bigger vector size of 256 bits that is algorithms that can overcome on the floating-point representation represented by the data type __m256 that abstracts the content of issues. The core of the MN system is the TwoSum and the the AVX register. The AVX register can be packed by different TwoProduct algorithms 1, 5, as they don’t cancel the errors of data types like int, double, and float. The AVX instructions deal calculations. The TwoSum operation is used to add two floating with the data packed in the register in a parallel way, so the same point numbers and the result is two floating point numbers, one is operation is done at the same time on the elements of the vector. the result of the addition operation and the other number is the For the MN-system the AVX register can be packed by four error due to the floating-point representation issues. The double precision numbers (4 * 64 = 256). The same TwoProduct operation is used to multiply two floating point representation of the operands can be used like the SSE version of numbers and the result is two floating point numbers, one is the the MN will be represented in a four elements vector [9], the result of the multiplication operation and the other number is the length of the MN shall be multiple of four as a limitation. The error due to the floating-point representation issues. AVX version of MN algorithms are 6, 7, 8, 9, 10, 11, 13. The MN addition and subtraction operations use MN-SumK to Transposing a square matrix is not supported by AVX functions. add the elements of the two operands (vectors) without losing The transpose function is proposed in this work to be done using accuracy. For the subtraction operation, a negative sign can be the shuffle and the permute functions as shown in algorithm 12. added to the elements of the subtracted MN variable. 3.1 Addition and Subtraction The MN multiplication is a dot product operation, and it is done The TwoSum operation is the core operation of the Addition and by multiplying each element of one of the MN vectors by all the Subtraction algorithms of the MN System. The AVX version elements from the other vector using the TwoProduct operation. proposes applying the TwoSum operation on multiple double After this the result is reduced by the SumK to a vector of length n. precision numbers simultaneously using the AVX Intrinsic instructions. This AVX implementation of the TwoSum can be The division is an iterative algorithm based on Newton-Raphson done by packing multiple double precision numbers in a SIMD’s division method. data type (__m256) to apply the same operation over these packed numbers. The AVX DataType __m256 can hold four double Algorithm 1 [a,b]=TwoSum(x,y) precision numbers of the C type “double”. For a MN double a=fl(x+y) precision vector, it can be represented with n double precision c=fl(a-x) floating-point numbers where n is multiple of four, so each MN b=fl((x-(a-c))+(y-c)) double precision number could be packed in n/4 AVX variables (__m256 data type) (n/4 * 256 − bits). Algorithm 2 [a]=VecSum(a) For MN addition or subtraction there are two MN numbers to be added or subtracted, a total of n/2 AVX variables of __m256 data for j=1 to n-1 do type are used. Applying the TwoSum algorithm on the first two (aj , aj−1)=TwoSum(aj , aj−1) AVX variables, each element in the AVX variable would be end for TwoSummed with the corresponding one in the other AVX

variable. After the TwoSum operation is applied, one of the Algorithm 3 [sum]=SumK(a,K) variables will be the sum and the other will be the error due to rounding issue. When the TwoSum operation is applied to two for j=1 to K-1 do AVX variables it produces two vectors one is the sum and the a=VecSum(a) other is the error due to the rounding issue, so the sum is stored to end for the second vector and the error due to the rounding problem will be saved to the first vector (replacing the original values). The (( ∑ ) ) next operation is to apply the TwoSum to the next AVX variable and the sum produced from the previous TwoSum operation. The TwoSum operation is then applied consecutively on n/2 AVX Algorithm 4 [k,l]=Split(x) variables. Finally, the n/2 AVX variables are summed and sorted {s=p/2, p=24 for IEEE 754 Single precision} in ascending order (biggest elements would be in last vector due f = fl (2s + 1) to the sequential nature of the VecSum operation). The AVX y=fl (f . x) VecSum operation could be considered as four VecSum functions k=fl (y-fl(y-x)) done simultaneously in the vertical direction on the elements of l=fl ((x-k) the AVX variables. After that the TwoSum operation is applied on the horizontal elements also, this is achieved by transposing every Algorithm 5 [k,l]=TwoProduct(x,y) four AVX variables. This sorts the summation on the horizontal k=fl(x.y) direction. The previous assumes n/2 is a multiple of four. If this is [x1, x2]=Split(x) not the case, then two more AVX variables are added with zero [y1,y2]=Split(y) value. By the previous steps mentioned one iteration of AVX

122 implementation is performed. After K iterations, the last vector would be part of the result of the addition but, because of the sequential nature of the original algorithm (VecSum) the Algorithm 8 [a1, .., an]=MN-SumK-add-AVX(x1, .., xn, y1, ..,yn), sequential original SumK should be applied to the remaining K, n vectors for K iteration where K >3 [13] to get the errors in the K_Sequential = 4; vector summed and sorted and hence the second part of the result Array[(2*n)-4]; of the addition would be derived. AVX_Var(1) = x1, x2, x3, x4 Subtraction is performed in the same procedure, but the second AVX_Var(2) = y1,y2,y3,y4 number would be negated before doing any of the iterations...... Algorithms 6, 7, 8 illustrate the steps of the proposed AVX_Var((n/2)-1) = xn−3, xn−2, xn−1, xn implementation using AVX. AVX_Var(n/2) = yn−3,yn−2,yn−1,yn for j=1 to 2*K do Algorithm 6 [a1, .., an] = MN-AVX-Addition (x1, .., xn,y1, ..,yn) for i=1 to (n/2) -1 do {the addition of 2 MultiNumber(MN) variables} AVX_result = AVX_Var(i) + AVX_Var(i+1) K=2 ; number of iteration AVX_temp = AVX_result - AVX_Var(i) a=MN-SumK-add-AVX(x, y, K, n) AVX_Var(i) = (AVX_Var(i) -(AVX_result - AVX_temp)) + (AVX_Var(i+1) - AVX_temp) AVX_Var(i+1) = AVX_result Algorithm 7 [a1, .., an]=MN-AVX-Subtraction(x1, .., xn, y1, ..,yn) end for {the subtraction of 2 MultiNumber(MN) variables} for i=1 to (n/8) do K=3 ; number of iteration TRANSPOSE_AVX (AVX_Var((i*4)-3), AVX_Var((i*4)- a=MN-SumK-add-AVX(x, -y, K, n) 2), AVX_Var((i*4)-1), AVX_Var((i*4)-3)) end for The operators +, - are AVX arithmetic function calls to handle end for the packed AVX variables’ elements simultaneously. Array = {AVX_Var(1), .. , AVX_Var((n/2)-1)} 3.2 Multiplication SumK (Array[1:((2*n)-4)], K_Sequential) The core of the Multiplication operation is the TwoProduct a[1:n-4] = Array[(n+1):((2*n)-4)] operation. The AVX version of MN multiplication (Algorithm 9) a[n-4:n] = AVX_Var((n/2) is based on the algorithms 10, 4, and 8 to produce an accurate result for multiplication [4, 16]. Each element multiplication by all the elements of other number will produce a vector of length Algorithm 9 [c1, .., cn]=MNMultiplcation-AVX(x1, .., xn, y1, ..,yn) 2*n, this vector can be reduced to length n by applying SumK {the multiplication of 2 MultiNumber(MN) variables} with suitable K iterations. K_sequential=3 ; number of iteration After finishing the multiplication of all elements, we will have n K = 7; vectors each of length n. The n vectors each of length n floating- b_AVX_packed(1) = {y1,y2,y3,y4} point numbers can be packed in ((n∗n)/4) of __m256 AVX data- ...... type (__m256 can hold four float double precision variables). The b_AVX_packed(n/4) = {yn−3,yn−2,yn−1,yn} next step now is to sum these AVX variables, so applying the i = n TwoSum algorithm on the ((n∗ n)/4) AVX variables consecutively while (n ≠ 0 and xi ≠ 0) do like the addition, the ((n∗ n)/4) AVX variables will be summed a_AVX_packed = {xi, xi, xi, xi} and sorted in ascending order, as we have four VecSum functions done simultaneously on the elements of the AVX variables. After for m=1 to n/4 do that, transpose every four consecutive AVX variables and use R[((m-1)*8)+1:m*8]=TwoProduct-AVX(a_AVX_packed, TwoSum on the transposed variable. The previous is an iteration b_AVX_packed(m)) for summing the reduced output of TwoProduct of two MN end for numbers, and this process is like what is used before in addition/subtraction. After applying the SIMD VecSum operation r[((i-1)*8 +1):(i*8)] = MN-SumK (R[1:2*n], K) over the AVX variables for K iterations, the original SumK i = i -1 operation is applied over the last n variables for K_Sequintial iterations. Then the result can be extracted from the last n/4 AVX end while variables (n elements). res= SumK-add-nMN-AVX(r,K) Another improvement that could be done is the parallelization of TwoProduct operation itself, as originally the TwoProduct is performed n ∗ n times in sequential way. Using AVX, four Algorithm 10 res=TwoProduct-AVX(x,y) elements can be multiplied by another four simultaneously (The {the Product of 2 AVX variables x and y using error-free width of the __m256). Also, because the numbers which construct transformation operations} the MN number are sorted then if some elements are zeroes then [a , a ] = Split(x) these elements would be the last elements, so if the multiplication 1 2 [b1, b2] = Split(y) is started from the biggest element the multiplication can be result = x*y; stopped once a zero element is found, because the next elements res[1:n] = (a2.b2 − (((result − a1.b1) − a2.b1) − a1.b2)) will be zeroes anyway. res[n+1:2*n] = result

123 3.3 Division 2 The MN division algorithm is an iterative algorithm based on Algorithm 11 [a1, .., an]= SumK-add-nMN-AVX(x1, .., xn ,K) Newton-Raphson division method that iterates to estimate the K_Sequential = 6; reciprocal of the divisor [16]. The MN division algorithm uses the Array[4*n]; MN operations for multiplication and subtraction, and it is in the for j=1 to (n*n/4) do both versions SIMD versions has been improved from execution AVX_Var(j) = x((j−1)∗4)+1, x((j−1)∗4)+2, x((j−1)∗4)+3, x4∗j time perspective due to the improvement of the multiplication and subtraction operations. end for for j=1 to 2*K do 4. CASE STUDIES for i=1 to (n*n/4) do 4.1 Ill-Conditioned Matrix Inversion AVX_result = AVX_Var(i) + AVX_Var(i+1) Researches developed methods to get the inverse of ill- AVX_temp = AVX_result - AVX_Var(i) conditioned matrix, but most of them are iterative, time AVX_Var(i) = (AVX_Var(i) -(AVX_result - AVX_temp)) consuming and approximate the result instead of getting the + (AVX_Var(i+1) - AVX_temp) accurate exact result. The Hilbert matrix can act as measurement AVX_Var(i+1) = AVX_result of the accuracy of the used operations and data representation. This paper proposes using the MN system to get the inverse of the end for Hilbert matrix using the Gaussian Jordan Elimination method. for i=1 to (n*n/16) do The accuracy was verified by multiplying the inverse of the TRANSPOSE_AVX (AVX_Var((i*4)-3), AVX_Var((i*4)- Hilbert matrix by the original matrix to get the unity matrix and 2), AVX_Var((i*4)-1), AVX_Var((i*4)-3)) then calculate the sum of the square of the errors. The MN system end for was able to get the inverse of the Hilbert Matrix till size 200×200. end for The SIMD version of inverse has execution time 1/3 of the Array = {AVX_Var((n*n/4)-n : n*n/4)} original MN version as shown in Figure 1. SumK (Array[1:4*n], K_Sequential) a[1:n] = Array[3*n:4*n]

Algorithm 12 Transpose-AVX(row0, row1, row2, row3) {transposes the elements of 4 AVX vectors __m256d} __m256d tmp3, tmp2, tmp1, tmp0; tmp0 = _mm256_shuffle_pd(row0, row1, 0); tmp2 = _mm256_shuffle_pd(row2, row3, 0); tmp1 = _mm256_shuffle_pd(row0, row1, 15); tmp3 = _mm256_shuffle_pd(row2, row3, 15); row3 = _mm256_permute2f128_pd(tmp0, tmp2, 0x20); row2 = _mm256_permute2f128_pd(tmp1, tmp3, 0x20); row1 = _mm256_permute2f128_pd(tmp0, tmp2, 0x31); row0 = _mm256_permute2f128_pd(tmp1, tmp3, 0x31); Figure 1. Hilbert Matrix Inversion Execution Time - Double Precision Algorithm 13 [d1, ..,dn]=MN-AVX-Division(x1, .., xn, y1, ..,yn) 4.2 Polynomial Regression {the division of 2 MultiNumber(MN) variables} Polynomial regression is used to get nth polynomial that fits a set of points (x, y) where the number of the points is more than the n. TWO=[0 0 0 0 .... 2] The least squares fitting method uses the Vandermonde matrix X X=[x1, ..., xn] formed of the x coordinates of the given set of points, to get the Y=[y1, ...,yn] coefficients of the polynomial terms that fits in the provided set of T=[0 0 0 0 .... 0] points. With the high degrees of the polynomial, the Told =[0 0 0 0 .... 1] Vandermonde matrix has ill-conditions. For example, trying to get K=0; the polynomial y = x20 from its points using the double primitive T (n)=1/yn; datatype found in C/C++ will fail dramatically. On the other hand, while (Told , T or K≤ MAXIMUMITERATION) do using the accelerated MN system the coefficients are calculated correctly with error equal to zero. Figure 2 shows execution time V1=MNMultiplcation-AVX(Y,T); comparison, the accelerated version of MN-System achieves 3.6x V2=MN-AVX-Subtraction(TWO,V1); speedup. These measurements were run on Intel (R) Core(TM) i5 Told = T ; −3210M [email protected], with MN vector of length n = 8. T =MNMultiplcation-AVX(T ,V2) K=K+1; end while D=MNMultiplcation-AVX(X,T)

124 [2] 2008. 754-2008-IEEE Standard for Floating-Point Arithmetic. (2008). [3] Intel Corporation. February 2016. Intel® Architecture Instruction Set Extensions Programming Reference. Manual. (February 2016). [4] T. J. Dekker. 1971. A Floating-Point Technique for Extending the Available Precision. Number. Math. 18 (1971), 224–242. [5] D. Dobkin and D. Silver. 1990. Applied computational geometry: Towards robust solutions of basic problems. J. Comput. System Sci. 40, 1 (1990), 70–87. https://doi.org/10.1016/0022- 0000(90)90019-H [6] S. Graillat. 2008. Accurate simple zeros of polynomials in floating-point arithmetic. Computers & Mathematics with Applications 56, 4 (2008), 1114–1120. [7] S. Graillat and V. Morain. 2012. Accurate summation, dotproduct and polynomial evaluation in complex floating-point Figure 2. Polynomial Regression Execution Time - Double arithmetic. Information and Computation (March 2012). Precision [8] W. Kahan. 1996. A Test for Correctly Rounded SQRT. Lecture note. (May 1996). [9] Abdalla D. M., A. M. Zaki, and A. M. Bahaa-Eldin. 2014. Acceleration of accurate floating point operations using SIMD. In Computer Engineering and Systems (ICCES), 2014 International Conference. IEEE, Cairo, Egypt, 225–230. [10] G. Masotti. 1993. Floating-point numbers with error estimates. Computer-Aided Design 25, 9 (1993), 524–538. https://doi.org/10.1016/0010-4485(93)90069-Z [11] V. Y. Pan, B. Murphy, G. Qian, and R.E. Rosholt. 2009. A new error-free floating-point summation algorithm. Computers & Mathematics with Applications 57, 4 (2009), 560–564. [12] S. Schirra. 1997. Precision and robustness in geometric computations. In Algorithmic Foundations of Geographic Information Systems. Lecture Notes in Computer Science, Vol. 1340. Springer Berlin / Heidelberg, 255–287. [13] O. Takeshi and R. Siegfried. 2005. Accurate Sum and Dot- Figure 3. Hilbert Matrix Inversion Execution Time - Double Product. SIAM J. Sci. Compute. 26 (June 2005), 1955–1988. Precision Issue 6. [14] V. Ch. Venkaiah and S. K. Sen. 1988. Computing a matrix 5. CONCLUSION symmetrizer exactly using modified multiple modulus residue In this paper, an efficient set of algorithms for the basic arithmetic. J. Comput. Appl. Math. 21, 1 (1988), 27–40. arithematic operations for the real numbers have been proposed. [15] A. M. Zaki, A. M. Bahaa-Eldin, M. H. El-Shafey, and G. M. The base of the proposed algorithms (MN system) was Aly. 2010. A new architecture for accurate dot-product of accelerated by modifying the algorithms to utilize the AVX which floatingpoint numbers. In Computer Engineering and Systems is available in all the modern CPUs. The MN system is introduced (ICCES), 2010 International Conference. IEEE, Cairo, Egypt, here to solve the the polynomial regression, and also to get the 139–145. inverse of the ill-conditioned matrices directly with Gaussian [16] A. M. Zaki, A. M. Bahaa-Eldin, M. H. El-Shafey, and G. M. Jordan Elimination instead of the iterative approximate methods. Aly. 2011. Accurate floating-point operation using controlled The proposed accelerated version showed that the execution time floating-point precision. In Communications, Computers and is reduced to ≈ 35% of the original one. Signal Processing (PacRim), 2011 IEEE Pacific Rim Conference. IEEE, Victoria, BC, 696–701. 6. REFERENCES [1] 1985. 754-1985-IEEE Standard for Binary Floating-Point Arithmetic. (1985).

125 A Framework to Automate the generation of movies’ trailers using only subtitles

Eslam Amer Ayman Nabil Associate professor Assistant professor Misr International University Misr International University Cairo, Egypt Cairo, Egypt [email protected] [email protected]

ABSTRACT growing at an accelerated rate. This is coupled also with a rapidly With the rapidly increasing rate of user-generated videos over the increasing rate in supplying and demanding video contents. The World Wide Web, it becoming a high necessity for users to main issue that becomes obvious is that the time required to watch navigate through them efficiently. Video summarization is such huge amount of videos is still limited which explicate the considered to be one of the promising and effective approach for human inability to keep up with such enormous amount of video efficacious realization of video content by means of identifying data. Therefore, human needs an assistance to understand the and selecting descriptive frames of the video. In this paper, a video contents and hence produces a summarization for the whole proposed adaptive framework called Smart-Trailer (S-Trailer) is video in just a few minutes. introduced to automatize the process of creating a movie trailer Movie trailers can be viewed as an application of video for any movie through its associated subtitles only. The proposed summarization; the objective of a trailer is to encapsulate a whole framework utilizes only English subtitles to be the language of 2-3 hours length movie into 2-3 minutes only. usage. S-Trailer resolves the subtitle file to extract meaningful The major hindrance affecting video summarization is the way to textual features that used to classify the movie into its find distinguishable chunks of sub-videos or scenes that can be corresponding genre(s). Experimentations on real movies showed taken into consideration as significant or interesting and bypasses that the proposed framework returns a considerable classification other chunks that are neither worthy nor expressively to viewers. accuracy rate (0.89) to classify movies into their associated Currently, there are many handy applications that are utilized by genre(s). The introduced framework generates an automated humans to edit a video and make selections and merging of video trailer that contains on average about (43%) accuracy in terms of scenes like Apple iMovie [1], Microsoft Windows Movie Maker recalling same scenes issued on the original movie trailer. [2], and some other online applications such as Movavi [3], CCS Concepts MakeWebVideo [4], and IBM Watson. Nonetheless, such • Information systems applications→Data applications generally necessitate loading the whole video or mining→Collaborative filtering. movie to carry out their tasks. Probably, producers also need to work through many phases that include observing the movie, Keywords picking out the best shots that characterize the movie, and finally Movie trailer, natural language processing, classification aggregating such shots in an elegant order. Manual production of a movie trailer is considered a laborious and 1. INTRODUCTION time-consuming. As reported by Independent article [5]; the The diffuse of available high-speed Internet access nowadays is production time of a movie trailer could be accomplished in a the main cause that videos become the most familiar information matter of three to four months. Production companies that create a medium on the Web. It became easy to create/search for videos movie trailer generally try to find some attractive keywords that are related to some topics, watch movies through YouTube1. through which they think that it will be going to be the reason for The huge amount of videos that are produced or indexed is bringing people to get a ticket and watch their movie. Some approaches are implemented to ease the process of

generating movie trailers. One of the approaches is the one Permission to make digital or hard copies of all or part of this introduced by Amy Paval, et.al in [6] to segmenting a video into work for personal or classroom use is granted without fee provided that different partitions and enriching them with short text summaries copies are not made or distributed for profit or commercial advantage and and thumbnails for each particular partition. Viewers become able that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must to read and navigate to their favorite partitions by browsing the be honored. Abstracting with credit is permitted. To copy otherwise, or summary. However, the approach isn’t effective in the case of republish, to post on servers or to redistribute to lists, requires prior movies, as there is no technique that a trailer creator provides a specific permission and/or a fee. Request permissions from word and get back the corresponding scenes associated with the [email protected]. word yet. ICSIE '18, May 2–4, 2018, Cairo, Egypt Another work introduced by Zhe Xu, et.al in [7] to create an © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6469-0/18/05…$15.00 automatic movie trailer by using featured frames and shots from the movie. The work proposed by Zhe Xu uses some movie DOI:https://doi.org/10.1145/3220267.3220293 trailers as training examples to acquire some features. The 1 www.youtube.com

126 obtained features are then employed as patterns to fetch similar However, the system lacks to get any extra preferences or shots as a result when a new movie is given. opinions from users as it just relies only on anonymous keywords. The drawback of current approaches is that it still requires totally Multimedia-mining is another approach used to generate movie user participation to generate trailers. Such involvements result in trailers. In the work presented by Go Irie, et.al [12], trailer a considerable delay in time and great efforts produced by movie generation method called Vid2Trailer (V2T) automatically editors when working on a movie in order to generate a trailer. generate impressive trailers from original movies through In this paper, a framework is introduced that automate the process identifying audiovisual components, as well as featured key of producing movie trailers using only the textual features indexed symbols such as the title logo and the theme music. V2T showed in its subtitle. The framework utilize Natural Language Processing, more appropriateness as compared to conventional tools. The and Machine Learning to analyze the included text in the subtitle major drawback of the V2T system is the huge processing effort file. The framework initially classifies the movie to its related which is considered too much due to speech filtering after genre(s), and generate a featured of significant keywords considering the top words of the whole movie. The processing associated to the different genre(s). could be reduced if the generation of top-impacted keywords happened after filtering subtitles from trivial words rather than The subtitle file associated with the movie is converted to a graph processing the whole subtitle file. of sub-scenes where each node in the graph is a sub-scene, and each sub-scene is connected to other sub-scene if they have Howard Zhou, et.al [13] suggested a trailer system based on scene similar contents. We utilized PageRank algorithm proposed by [8] categorization. The system introduced by Howard utilizes to retrieve effective sub-scene to capture their corresponding time intermediate structured temporally level features to improve the frames. classification performance over the use of low-level visual features alone. Despite the slight enhancement in terms of This paper is organized as follows. Section 2 describes the related classification performance, the system relies on a bag of visual works. The proposed approach is presented in Section 3. Initial words which indeed require a huge storage to save the bag of results are presented in section 4, and finally, section 5 presents visual words that are related to each movie genre. the conclusion and future work. Alan F. Smeaton, et.al [14] introduced an approach that selects 2. RELATED WORKS specific shots from action movies that facilitate the process of creating a trailer. The approach makes use of visual scenes to In this section, a concise overview of some approaches is produce a structure of shots through identifying a shot boundary introduced. As there are rarely related works so far that techniques for a movie. The approach analyzes also the audio specifically generate an automatic movie trailer, most of the track of a movie to know how to distinguish the presence of current approaches fall into the more general task of video categories like speech, music, silence, speech with background summarization. However, the field of automatic trailer generation music and other audio. Due to the mixture of genres in nowadays can be considered as an untouched field of research. movies, it becomes necessary for any trailer generator to reflect Text mining approaches that generate automatic trailer is such mixture. Alan’s approach looks promising, however the introduced by Konstantinos Bougiatiotis, et.al [9] and R. Ren, approach designed specifically to suit action movies. Therefore, et.al [10], they demonstrate in their works the ability to extract the the system won’t be able to produce a pleasing trailer if the movie topic representation from movies based on subtitle mining through has many genres. inspecting the presence of a similarity correlation between the The approaches presented in this section state considerable efforts content of movie and low-level textual features from particular to realize textual content, audio-video contents, or both together. subtitles. The approaches presented in [9-10] generate a topical However, it showed a deficiency in grasping user preferences and model browser for movies which allow users to scrutinize the opinions. Today’s Movies, as well as its associated trailers, comes various aspects of similarities between movies. However, the in a variety of forms due to the diversity of cultural environments, approaches presented in [9-10] doesn’t take into account the therefore systems that generate movies’ trailers should be adaptive movie genre(s), it can be seen as a recommendation system for to suit the diversity of culture environments as well as different movies based on the similarity of topics. user perspectives. Amy Pavel, et.al [6] introduced a work that create an affordable digest that enable video browsing and skimming through segmenting videos into separate sections and providing short summaries of text to each segment. Users can navigate to a certain subject of the video by reading the summary section and pick out the corresponding video that is relevant to the section in e textual summary. Although the work presented in [6] provides a decent infrastructure to handle the problem of searching inside a video, however, the work mainly used to partition videos according to titles, chapters, or sections. If any title is missed for any topic inside the video, the system becomes unable to summarize it correctly. Another work introduced by J. Nessel, et.al [11] that endorse movies to users based on extracting words from the user examples. It then compares user preferences and examples with textual contents of movies. The developed system works recursively in the context of decidable languages and computable functions. Figure 1. Smart-Trailer architectural Model

127 3. PROPOSED SYSTEM (S-TRAILER) the first sentence where entry i occurs. TFi , TIi indicate the term frequency, influence weight for entry i respectively. In this section, the proposed framework Smart trailer or (S-trailer) is introduced. As shown in Figure (1), S-Trailer contains two main The final outcome of the training phase is set of genres’ models, phases namely, training, and processing-output phase. In the each model contains a list of weighted keyphrases. Each model following subsections each phase will be described in details can be viewed as a sign that represents specific movie genre. Eventually, the generated models are stored in a genre dictionary. 3.1 Training Phase 3.2 Processing-Output Phase The training phrase is essential to acquire a bag of words or The processing phase is considered the essence phase of the generally a corpus that outlines the most common words or sentences that characterize each genre. The process is done by framework model. In this phase, the user supplies the model with collecting English subtitles for the top rated movies in each genre the movie subtitle which he needs to generate a trailer for. The according to IMDB Top rated movies by genre [15]. In the supplied subtitle will undergo in the same process similar to the training phase to produce the featured words from user subtitle. training phase, the top-20 movies for each genre were used to extract the bag of words for each movie genre or category. The featured words list generated from user subtitle is compared against genres lists stored at movie genres’ models in order to A movie subtitle file contains three parts: classify user subtitle into its related movie genre(s). The 1 – A number that indicates scene index. classification is done through Naïve Bayes classifier. In general, a 2- A time interval, that point out when the subtitle will appear, movie could be related to several genres in different percentages and when it disappears. depending on type and number of scenes that are related to each 2 3- The script of that scene. genre. For example, Titanic movie contains two genres (Drama, and Romance). Therefore, the classification results represent Figure (2) outline an example of sample subtitle from Titanic percentages of ordered genres that are closely related to user movie. subtitle (i.e., 70% Action, 30% Drama). The percentages returned 633 by the classifier is used as guidance to the proposed framework to 00:51:29,650 --> 00:51:31,686 allocate the trailer with types of scenes. Why can't I be like you, Jack? When it comes to rank the influenced scenes inside the movie, it 634 becomes necessary to initially build a graph of movie scenes. The 00:51:31,770 --> 00:51:35,045 established graph is represented as adjacency matrix, where Just head out for the horizon N is the number of sequences in the subtitle file. The matrix Whenever I feel like it. values of rows and columns are computed using the following

Figure 2. Sample subtitle sequences equation:

( ) ( ) { } (2) Where (633,634) indicating the order of subtitle in the movie ( ) ( ) sequence, 00:51:29,650 --> 00:51:31,686 showing the time duration that tells when the subtitle will appear on the screen, and Where ( ) denotes the relationship between item i and item j in when it disappears, the text ―Why can't I be like you, Jack?‖ the adjacency matrix, the cosine similarity is used to measure the represents the script itself. similarity value between two sequences in the subtitle file. The Initially, all subtitles’ files are preprocessed to exclude unneeded value of ( ) will reflect the degree of similarity between two text. This includes removing trivial characters and words which sequences in the subtitle file [18]. The generated matrix will are considered insignificant to be represented as a featured text of represent the relations in terms of similarity between scenes or movie genre. As reported by experimental observations in [16-19], sequences inside the subtitle file where vertices represent the words that are annotated as nouns or adjectives considered scenes and edges showed how scenes are related to each other. meaningful and worthy. Therefore the framework extracts all The proposed framework utilizes PageRank algorithm proposed nouns and adjectives that occurred individually or exist in any by [8] to rank the graph represented in the resultant adjacency pattern like an adjective + noun and prune anything else. The matrix. PageRank is used to weight influential sequences which preprocessing step relied on part-of-speech tagger in NLTK is the most popular sequences in the subtitle file. library to tag words. The result of the ranking module is a set of ordered weighted The objective from the training phase is to build a model for each sequences and fetch their associated time frame. The proposed movie genre. This is accomplished through constructing framework selects the top-K sequences to be presented in the hierarchical n-grams of unique words and/or the co-occurrences of trailer scenes. Determination for the value of K is depending on words with other words in processed documents and their the required time duration for the trailer. frequencies. The framework rely on methodologies presented in [18, 20] to build the generation model for each genre. The associated time frames come with top-K sequences are passed to video editing library in python, the library contains methods Each word or keyphrase in the resulting model is given rank using that fetch the corresponding scenes from the original movie. All keyphrase ranking algorithm presented in [20] with some minor fetched scenes are aggregated, merged, and then finally passed to modification. the user as an output trailer. ( ) ( ( ) ) (1)

2 Where pi is the position of entry i. The position is computed as http://www.imdb.com/title/tt0120338/ (L–Ls) where L represents the total lines in the document, Ls is

128 4. EXPERIMENTS AND RESULTS For the purpose of evaluation, random 10 movies were selected for each genre which counts to a total of 50 movies used to In this section, experimental tests and evaluations are presented in evaluate the performance of the proposed model. Table (2) show order to prove the validity of the proposed framework. the average performance of S-trailer in retrieving similar scenes Evaluations of the framework tends to be focused on evaluation of presents in the original trailer for top 10, 30, and 50 scenes movie genre(s) lists, and the accuracy of the generated trailer. respectively. The movie genres models resulted from the training phase are tested against Kaggle movie dataset which is available to Table 2: Evaluating S-trailer Performance according to download from [21]. Kaggle movie dataset contains about 5000 retrieved scenes movies that are related to different genres. Each movie in the dataset comes with its associated IMDB genre(s). For the purpose of testing the effectiveness of the generated movie genre(s) model, a group of random 500 movie are selected from Kaggle movie dataset. For each movie in the selected dataset, its English subtitle is downloaded from Subscene3 Homepage. Experimental results on the first outcome generated by the proposed framework; which is the movie genres model (Table 1) showed a strong correlation in classification between the movies’ genres originally indexed in IMDB and the generated movie Where PR, RC, and FM stands for precision, recall, and F- genres model. measure respectively. Table 1: Evaluating S-trailer genre classification accuracy A key observation point is that the generated trailer accuracy is increased in terms of precision and recall when the number of test scenes increased which indicate the reliability of the proposed framework in fetching valuable scenes. However, it is observed that some movies’ trailers contains lots of silence scenes. It tends that producer’s main focus is to catch user’s anxiety or fear especially in horror movies or some action movies. For example, the trailer for movie like American Sniper 2, it was observed that the majority of scenes that were used in the official trailer contains no speech (silence) or a silence scene with dialog in background that is not related to the scene. In such cases, the performance of framework generated trailer become very The order of genre appearance is taken into consideration in the weak. evaluation process of classification accuracy. For example, if the original appearance of the genre in a movie is Action, Crime, and Experimental results showing the Smart-Trailer framework Drama; the classifier has to return identical order of genres or provide a promising results. It achieves a recall accuracy ratio of result of classification will be regarded as misclassification. 43% in Top-50 scenes retrieved by the smart trailer. A major As there is no standard golden corpus that could be used to drawback of the proposed framework is that, it cannot fetch classify movies into its related category or genre(s), the movie silence senses, which are scenes where there is no speech in it. It genre model resulted from the framework can be viewed as a seed is noted that producers like that type of scenes for the purpose of corpus for movies’ classification. attraction or surprising especially in horror and romance movies. However, such drawback will going to be overcome in the future The second experimental evaluation is the evaluation of the work. produced trailer. Evolutions to generated trailers will be based on precision, recall, and F-measure metrics: 5. CONCLUSION AND FUTURE WORK Where PRECISION is defined as the ratio of the number of In this paper, a framework called a Smart-Trailer is proposed to relevant scenes retrieved to the total number of scenes indexed in automate the process of trailer generation relying on natural original trailer [22], it is calculated as: language processing and machine learning. Smart-Trailer framework that originally introduced in [23] revealed how it can (3) be used successfully in the field of marketing through the phases

described before in order to generate an attractive trailer to the RECALL is defined as the ratio of the number of correct or audience. The framework establishes efficiently a golden corpus relevant scenes retrieved to the total number of relevant scenes for each movie category through which it can be used to classify indexed in the produced trailer [22], and it is calculated as: any movie into its related genre(s). The main contribution of Smart-trailer is its capability to generate a trailer without any (4) human involvement. Smart-Trailer returns an average of 43% in And, F-MEASURE is considered as the weighted harmonic mean terms of accuracy in recalling scenes exist in the original trailer of precision and recall [22], and it is calculated as: scenes. The future work includes enhancements to the framework by (5) extracting latent information indexed in silence scenes that could

be reflected in an increase of the accuracy rate. The framework will also add a recommendation module that grasps user 3 https://subscene.com/

129 behaviour, and suggest or recommend user with special scenes [13] H. Zhou, T. Hermans, A. V. Karandikar, and J. M. Rehg, that likely matches user preferences. Movie genre classification via scene categorization, in MM 10 Proceedings of the 18th ACM international conference on 6. REFERENCES Multimedia, October 25 - 29, 2010 , Firenze, Italy, SIGMULTIMEDIA ACM Special Interest Group on Multimedia. [1] iMovie - Apple. New York,USA: ACM New York, NY, October 2010, pp. 747- 750 https://www.apple.com/lae/imovie/ [14] A. F. Smeaton, B. Lehane, N. E. OConnor, C. Brady, and G. [2] Windows Movie Maker Craig, Automatically selecting shots for action movie trailers, in https://www.windowsmovie-maker.org/ MIR 06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval, Santa Barbara, California, USA, [3] Movie Trailer Maker—How to Make a Movie Trailer SIGGRAPH ACM Special Interest Group on Computer Graphics and Interactive Techniques. New York, USA: ACM New York, Movavi. NY, October 2006, pp. 231-238 https://www.movavi.com/support/how-to/how-tomake-a- movie- [15] www.imdb.com/ trailer.html. [16] Hulth, A.: Improved automatic keyword extraction given

[4] Create Your Own Movie Trailer With Our Online Video more linguistic knowledge. In: Collins, M., Steedman, M. (eds.) Maker.‖https://www.makewebvideo.com/en/make/movie- Proceedings of the 2003 Conference on Empirical Methods in trailervideo. Natural Language Processing, pp. 216 –223 (2003). [5] The Independent. ―We spoke to the people who make film [17] Wan, X., Xiao, J.: Single document keyphrase extraction trailers ‖ 17 Jan. 2017, using neighborhood knowledge. In: Proceedings of the 23rd http://www.independent.co.uk/artsentertainment/films/features/fil National Conference on Artificial intelligence, AAAI 2008, vol. m-trailers-editors-interviewcreate-teasers-tv-spots-a7531076.html. 2, pp. 855 –860. AAAI Press (2008) [6] A. Pavel, C. Reed, B. Hartmann, and B. Hartmann, Video dig ests: a browsable, skimmable format for informational lecture [18] Eslam Amer. Enhancing Efficiency of Web Search Engines videos, in UISTUser Interface Software and Technology, through Ontology Learning from unstructured information sources, SIGGRAPH ACM Special Interest Group on Computer Graphics Proceeding of 16th IEEE International conference of Information Integration and Reuse (IRI2015), PP.542- 549, 13-15 August and Interactive Techniques. NY, USA: ACM New York, October 2014, pp. 573-582. 2015. San Francisco, USA. [7] Z. Xu and Y. Zhang, Automatic generated recommendation [19] Youssif, Aliaa AA, Atef Z. Ghalwash, and Eslam A. Amer. for movie trailers, in Broadband Multimedia Systems and "HSWS: Enhancing efficiency of web search engine via semantic Broadcasting (BMSB), 5-7 June 2013, London, UK. IEEE, web." In Proceedings of the International Conference on October 2013. [Online Management of Emergent Digital EcoSystems, pp. 212-219. ACM, 2011. [8] Ying Ding, et.al. PageRank for Ranking Authors in Co- citation Networks. JOURNAL OF THE AMERICAN SOCIETY [20] Aliaa A.A. Youssif, Atef Z. Ghalwash, and Eslam Amer. KPE: An Automatic Keyphrase Extraction Algorithm, Proceeding FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(11):2229–2243, 2009 of IEEE International Conference on Information Systems and Computational Intelligence (ICISCI 2011), pp. 103 -107, 2011. [9] K. Bougiatiotis and T. Giannakopoulos, Content representation and similarity of movies based on topic extraction [21] ‖Kaggle.‖ https://www.kaggle.com/ from subtitles, in SETN 16 Proceedings of the 9th Hellenic [22] Eslam Amer, and Khaled Foad. "Akea: an Arabic keyphrase Conference on Artificial Intelligence, May 18 - 20, 2016 , extraction algorithm." In International Conference on Advanced [10] R. Ren, H. Misra, and J. Jose. Semantic based adaptive Intelligent Systems and Informatics, pp. 137-146. Springer, Cham, movie summarization. In S. Boll, Q. Tian,L. Zhang, Z. Zhang, and 2016 Y.-P. Chen, editors, Advances in Multimedia Modeling, volume [23] Mohammed Hesham, Bishoy Hany, Nour Foad, and Eslam 5916 of Lecture Notes in Computer Science, pages 389- Amer. " Smart Trailer: Automatic generation of movie trailer 399.Springer Berlin Heidelberg, 2010. using only subtitles." The First International Workshop on Deep [11] J. Nessel and B. Cimpa, The movieoracle - content based and Representation Learning, IWDRL 2018. PP.26-30. IEEE, movie recommendations, in Web Intelligence and Intelligent 2018 Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on, 22-27 Aug.2011, Lyon, France. IEEE, October 2011. [12] G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa, Automatic trailer generation, in MM 10 Proceedings of the 18th ACM international conference on Multimedia, Firenze, Italy, SIGMULTIMEDIA ACM Special Interest Group on Multimedia. ACM New York, NY, October 2010, pp. 839-842

130 Example-Based English to Arabic Machine Translation: Matching Stage Using Internal Medicine Publications Rana Ehab Eslam Amer Mahmoud Gadallah Computer Science Department Faculty of Computer Science Computer Science Department Modern Academy for Computer Misr International University Modern Academy for Computer Science and Management Cairo, Egypt Science and Management Technology [email protected] Technology Cairo,Egypt g Cairo,Egypt [email protected] [email protected]

ABSTRACT Users are generally interested in obtaining a rough idea of a Automatic machine translation becomes an important source of text’s topic or what it means [1]. However, some applications translation nowadays. It is a software system that translates a text require much more than this [1]. For example, the beauty and from one natural language to one (many) natural language. On correctness of the text may not be important in the medical field, the web, there are many machine translation systems that give but the adequacy and precision of the translated message are very the reasonable translation, although the systems are not very important [1]. Because of this machine translation in the medical good. Medical records contain complex information that must be domain has been applied. translated correctly according to its medical meaning not its Computer technology has been applied to technical translation to English meaning only. So, the quality of a machine translation in improve one or both of the following factors [2]: this domain is very important. In this paper, we present using matching stage from Example-Based Machine Translation • Speed: Translation by or with the aid of machines can be faster technique to translate a medical text from English as source than manual translation. language to Arabic as the target language. We have used 259 • Cost: Computer aids to translation can reduce the cost per word medical sentences that are extracted from internal medicine of a translation. publications for our system. Experimental results on BLUE metrics showed a decreased performance 0.486 comparing to Although the concept of machine translation (MT) has been GOOGLE translation which has an accuracy result about 0.536. around since the 30's and 40's, it gained popularity only in the 60’s and 70’s, when it was touted as the perfect solution for text translation, capable of rendering translated text of human CCS Concepts translation quality [3]. Machine translation system develops by • Computing methodologies→Machine translation. using four approaches depending on their difficulty and complexity [4]. These approaches are rule-based, knowledge- Keywords based, corpus-based and hybrid MT, Rule-based machine Automatic machine translation, Natural Language Processing, translation approaches can be classified into the following Example-Based Machine categories: direct machine translation, interlingua machine translation and transfer-based machine translation [4]. 1. INTRODUCTION The construction of a machine translation system doesn’t just Smooth communication is an important issue to take into rely on the machine translation technique that is used but considers and in order to solve this issue, we must break the although the dataset that is used in training and testing of the barriers of language. Here comes the mission of a machine system. The data sets that were used in some projects were a translation system which its rule is translating a text from one reason that caused a low evaluation of the system. For that, we language to another language with or without the existence of must take into consideration the quality of the dataset that will human. be used. Medical records contain complex information that must be Permission to make digital or hard copies of all or part of this translated correctly according to it medical meaning not its work for personal or classroom use is granted without fee provided that English meaning only. They typically contain complex copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. information intended for healthcare professionals, not consumers Copyrights for components of this work owned by others than ACM must [3]. So machine translation in the medical domain will benefit be honored. Abstracting with credit is permitted. To copy otherwise, or both physicians and patients by facilitating the communication republish, to post on servers or to redistribute to lists, requires prior specific between them and making medical information more permission and/or a fee. Request permissions from [email protected]. understandable. ICSIE '18, May 2–4, 2018, Cairo, Egypt © 2018 Association for Computing Machinery. Our goal is to create an Example-Based Machine Translation ACM ISBN 978-1-4503-6469-0/18/05…$15.00 system by using the matching stage only to translate medical DOI:https://doi.org/10.1145/3220267.3220294

131 records correctly from English as a source language into Arabic In the second experiment, they extracted data from UMLS as a target language. As English is a universal language, most of database by extracting singular terms from Spanish and English the researchers in MT are mainly concentrated on the translation and combine each singular term from Spanish with terms related between English and Arabic because the automatic English-to- to it from English. They added these dictionaries to the system Arabic translation is still an active area and this will help in without changing the language model. In the third experiment, simplifying the Arab communication with other countries [5]. they added English to the language model data. Arabic belongs to the Semitic language family [2].It is the mother During the fourth experiment, they used the semantic tongue by more than 356 million people as a native language, in information to generalize the training data. They filtered an area extending from the Arabian Gulf in the East to the Spanish-English dictionary they extracted from UMLS to have Atlantic Ocean in the West. Choosing Arabic as a target only words and phrases from only two semantic types: “Body translation language is because of the difficult morphology of the Part, Organ, or Organ Component” and “Body Location or Arabic sentences. Translating from English to Arabic faces Region” [6]. some problems one of them that the English language has a structured way to build a sentence which is different from Arabic They added two more semantic types in the last experiments that can be structured under many combinations of SVO, VSO, instead of “Body Part” semantic type which are: “Findings/sign VOS, and OVS. Its alphabet consists of 28 characters, where the or Symptom” and “Disease or Syndrome” [6]. They used in the shape of each character depends on its position within a word [2]: experiment the three semantic types: “Body Location or Region”, “Findings/sign or Symptom” and “Disease or Syndrome”. سامي أخذ انكتاب – SVO) Sami takes the book) The experiment that achieved the best evaluation was the last أخذ سامي انكتاب– VSO) Takes Sami the book) one. However, although the system has a disadvantage that the There are commercial MT systems. “Al-Mutarjim Al- Arabey” extracted dictionaries contain technical terms and has no which translates English text into Arabic, “golden Al-Wafi colloquial terms [6] that patients use and they affects the translator” which also translates English text into Arabic and performance of the system. “Sakhr CAT” translator is a computer-aided translation system (Qing Zeng-Treitlera et al., 2009) [3] discussed the usage of supporting bidirectional bilingual translation between English multilingual machine translation could make medical records and Arabic [5].These systems are generic machine translation content more understandable to patients. They used Babel Fish systems, not a specific domain system. tool which is a free- available machine translation tool. They Using translation via websites has been extended nowadays but translated 213 records from English into Russian, Spanish, translating a medical text is different from translating any other Korean and Chinese. They bring their medical records from English text because of the complex information that a medical MedLEE demo site and MT (Medical Transcript) resources. record contains so translating via existed translation systems They first translated one record from English into four target cause some problems in the translation of the medical record. languages to check the quality of the translated text. Then they The organization of the paper is as follow: the next section used two testing variables as other studies set [7, 8] for presents related research in machine translation in the medical understandability and correctness of the translated sentence. domain. The third section contains a description of the issues of Because of some errors in the phrases as grammar, they added a medical domain. The fourth section describes our attempt to use third variable understandable of the original sentence. Then they Example-Based Machine Translation system to translate English used five medical researchers professional in English and a medical sentence to Arabic medical sentence. The fifth section native speaker of either of four target languages for evaluation of presents our experiments. The sixth section presents results and the translated text. the accuracy of the results. Finally, we introduce the conclusion The disadvantage of this system is the lack of data used to of our work. develop a system; using 213 medical sentences are not enough to 2. RELATED WORK build an efficient system. Also, the system was good only when translating from English to Spanish then translating to other Machine translation systems that have been developed in the languages. Also using human during evaluation cost a lot medical domain are a lot, each has its own way to translate the although the accuracy of evaluation is better than automatic medical text. In this section, we are going to give some evaluation. examples of existed machine translation systems in the medical domain. (S Dandapat et al., 2010) [9] used Example-based machine translation and Translation Memory to translate medical text (Eck et al., 2004) [6] proposed a statistical machine translation from English to Bangle. They translated receptionist dialogues of system using the Unified Medical Language System (UMLS) as medical and primarily appointment scheduling [9]. First, they their database to translate dialogues between patients and doctors collected their corpus of English language from receptionist from Spanish to English. They made many experiments to dialogue of patient then they translated it manually to Bangla. improve their system. The corpus contains 380 parallel sentences. The second step is In the first experiment, the dialogues were collected from that they build a Translation Memory automatically from a research projects they did and used as a training data, the rest of corpus of patient dialogue using Moses. They created two the training data were not a medical data from the C-star Project. Translation Memory the first contain phrase pairs that are The test data were also from dialogues. aligned and the second one containing the word aligned file. Finally, they build their Example-based machine translation

132 system using the first Translation Memory in the first stage case happens if a single query matches an example in the which is Matching stage they used the second one in the final dictionary then the translated text is the target side of the stage which is Recombination stage. translation dictionary. They made five different experiments to show the accuracy of The second case is if the query is implied in an example from the their system. In the first experiment, they used OpenMtrEx translation dictionary then the translated text is the SMT result system [9] that is an open source statistical machine translation whose all words appear on the target side of the translation system. In the second experiment, they used the matching stage dictionary. The last case happens when a left phrase or right output and their Example-based machine translation system. In phrase of a query matches an example in the translation the second experiment their system with the first Translation dictionary then the translated text is the SMT result that contains Memory they constructed. In the second experiment, they used all words in the target side of the dictionary. The disadvantage their system with the first and second Translation Memory. In the of this system is that is there is no analysis to the query where last experiment, they used statistical machine translation with there may be any term has many different meanings according to Example-based machine translation as they translated the its position in the sentence. unmatched portion of the input using statistical machine (Dušek et al., 2014) [12] described the Khresmoi systems translation. submitted to the WMT 2014 Medical Translation Task. They As the results they constructed the highest accuracy systems were translate summary and query for all languages: Czech, German, the third and the fourth systems, however, some errors appeared, Frenc. The translation is done from English to these languages the first is the wrong of source-target equivalent in both and vice versa. The system based on Moses phrase-based Translation Memory systems. The second in the recombination translation toolkit and standard methods for domain adaption step that some words are translated separately. [12]. (Costa-jussà et al., 2012) [10] concentrate to check if using a free (Krzysztof Wołk et al., 2015) [13] goal was to build a statistical available translation system in the web can be used in medical machine translation system to translate medical data from without extra resources They used Google translate system as a English to Polish and vice versa. They used polish data statistical machine translation system. They used a corpus gained constructed by the European Medical Agency (EMEA). During by a tool developed by Universal Doctor project that contains real preparing the Polish data they used Moses toolkit to remove long medical and English questions and answers. They used sentences and set to 80 tokens. Then they prepared the English professional translators to translate the corpus into six languages data which is less complicated than Polish. During the which are: Basque, French, German, Portuguese, Russian and translation experiments, they made 13 experiments by applying a Spanish to be used as a reference. modification to the system in each experiment using Moses open- source SMT toolkit with its Experiment Management System. After translating from English to as a source language into six target language using Google translate they evaluated the system (Wołk et al., 2015) [1] after they built the previous system they using automatic and manual translation and both evaluation proposed an experiment on neural based machine translation to systems had good performance when translating to French, compare its results to SMT they built. The corpus used was Portuguese, Spanish, German but the performance with the other European Medicines Agency (EMEA). The system translates two languages were very low. The problem they explored is that Polish medical text into English medical text. Moses-based SMT when using statistical machine translation to translate into system was used and the Groundhog and Theano tools for neural Russian unknown words, incorrect word order, word network machine translation. The results show that using neural disagreement appear[10]. Also, the wrong declination appears network machine translation needs more work to achieve better when translating into Bosque[10]. results. SMT achieved better results. (Jianri Li et al., 2014) [11] developed a hybrid system by (Amer et al., 2016) [14] constructed a Wiki –Transpose that is a combining dictionary-based approach and statistical machine query translation system for cross-lingual information retrieval translation approach (SMT) and then they compared the results (CLIR). They relied on Wikipedia as a source for translations. to the result of phrase-based statistical machine translation Their purpose was checking the coverage ratio of Wikipedia system (PBSMT). The translation source language and target versus specialized queries that are related to the medical domain. languages were German, English and vice versa. The system used to check how reliable Wikipedia is to get corresponding translation coverage of Portuguese to English and For PBSMT system training phase they used a parallel corpus also English to Portuguese queries. They made two experiments that a mix from EMEA, MuchMore, Wikipedia-titles, Patient- in the first one they used English Open Access, Collaborative abstract, claim, title and Unified Medical Language system Consumer Health Vocabulary Initiative dataset [14]. In the (UMLS) [11]. They used monolingual corpora for English- second one, they used a collection of Portuguese medical terms German pair: Wikipedia-articles, Patient descriptions and UMLS that were assessed by medical experts as medical terms [14]. descriptions [11]. For German-English pair, they used also They reached a coverage ratio in Wikipedia about 81% and about monolingual corpora: Wikipedia-articles, Patient descriptions, 80% in single English and Portuguese terms respectively. UMLS descriptions, AACT, GENIA, GREC, FMA and PIL [11]. They used parallel corpus Wikipedia-articles and UMLS (S Dandapat et al., 2010) [9] used a hybrid system between dictionary and a monolingual corpus Wikipedia-articles for query Example-based machine translation and Translation memory. translation system [11]. They used Moses toolkit for PBSMT. Their system faced some errors as the wrong of source-target The hybrid system has three cases first if the query consists of equivalent in the Translation memory system that they built and many phrases then it firstly divided into single phrases. The first the second error was in recombination step. ( Jianri Li et al.,

133 2014) [11] who translated queries using hybrid technique between Dictionary based machine translation and Statistical 4. OUR APPROACH machine translation their problem were that there is no analysis 4.1 Data Preparation to the query where there may be any term that has many different In the previous section, we said that the data is small and that meanings according to its position in the sentence. because we extracted from the internal medicine publications the As a result of the previous review of the related works the most indications and side effects in both languages English and Arabic used SMT system in their translation system. The problem with for internal diseases only. After that, we made some processing ( Eck et al., 2004) [6] is that they extracted data from Unified on English data as tokenization, a lower casing, and final Medical Language system that contain technical terms and has no cleaning. colloquial terms that patients use. ( Qing Zeng-Treitlera et al., For Arabic data, we didn’t apply any preprocessing and this 2009) [3] that used Babel Fish tool, their problem was that they because of the morphology of the sentences. Any change in any translated only 213 records and there is not enough to build an word in the Arabic sentence will make a change in the meaning efficient system. (Costa-jussà, et al., 2012) [10] who used of the sentence. In the medical domain, the meaning of a Google translate as a statistical machine translation system, their sentence is very important. problem appear when translating into Russian that there were unknown words, incorrect word order, and word disagreement. 4.2 Translation System EBMT is based on the idea of performing translation by 3. ISSUES WITH MEDICAL DOMAIN imitating translation examples of similar sentences [15]. In this Building an efficient Machine Translation system for a Medical translation system technique, a large bi/multi-lingual translation Domain has two main issues that we are going to explain them examples are stored in a database and input sentences are later. rendered in the target language by restoring from the database that example that is most closely to the input. 3.1. Parallel Corpus Collection In Example-Based machine translation system, the first stage is Our first task is to collect an efficient medical English and to find source language example(s) that match the input sentence Arabic data that are always used by patients to improve the closely. In our approach, we find (Sc) for the input sentence (S) English– Arabic parallel corpus. For this purpose, we consider that will be translated from the example-base. For this purpose, using internal medicine publications. Thus, our corpus contains we used word-based edit distance metric (Levenshtein, 1965; 259 sentences; each sentence contains 8 words on average (which Wagner and Fischer, 1974) to find the closest match sentence is considered as a small corpus). The small size of the corpus from the example- base (Si) and this based on the following because of using medical data for only internal diseases. equation [9].

3.2. Size and Type of Corpora Score(S,Si)= (1) Example-Based Machine Translation is a data-driven machine where |S| and |Si| denotes the length of an input sentence and translation technique. The first thing that is needed is a machine- example base sentence and ED(S,Si) refers to the word based readable parallel corpus [9]. To construct a data-driven machine edit distance between S and Si. translation system there is an important question to ask is: how many examples are needed? And as we mentioned before our Based on the above scoring method, we can choose the closest parallel corpus is very small in comparison with the standard match for the input sentence the will be translated. For the used data-driven parallel corpora that may be up to millions of following three input sentence examples in (2) the closest parallel sentences. Many have been developed with such a small matched sentences from the example-base are in (3). corpus [9]. Table 1 lists some of the EBMT systems developed using a small amount of parallel data (details can be found in (2) a- palpitations Somers, 2003). b- fast or irregular heartbeats c- extra heartbeats Table1.Size of Example database in EBMT systems (3) a- i) blood clots in the veins System Language Pair Size ii)sugar in the urine TTL English-> Turkish 488 iii)blood clot in the lungs iv)blood clot in the legs TDMT English->Japanese 350 v)a blockage in the bowels EDGAR German-> English 303 b- irregular heartbeat c- slow heart rate ReVerb English-> German 214 As seen the example based of 2(a) we get five closest results. ReVerb Irish -> English 120 Then we get the associated translation (Sc) in (4) from the METLA-1 English -> French 29 example-based for the closest match source sentence in (3). جهطاخ دمويح في األوردج (a- i (4) METLA English -> Urdu 7 نسثح انسكز في انثول (ii انجهطح اندمويح في انزئتين (iii انجهطح اندمويح في األرجم(iv انسداد في األمعاء (v

134 ,.Zeng-Treitler, Q., Kim, H., Rosemblat, G. and Keselman, A [3] عدو انتظاو ضزتاخ انقهة -b Can multilingual machine translation help make .2010 تطء سزعح ضزتاخ انقهة -c medical record content more comprehensible to 4.3 System Steps patients?. Studies in health technology and Let us investigate the translation system using matching step informatics, 160(Pt 1), pp.73-77. from Example-Based Machine Translation technique in the [4] Alawneh, M., Omar, N., Sembok, T., Almuhtaseb, H. and following steps: Mellish, C., 2011. Machine Translation from English to Step1: Input the source text in the English language Arabic. In International Conference on Biomedical Engineering and Technology. Step2: Allow preprocessing in the input (tokenization, the lower [5] Agiza, H.N., Hassan, A.E. and Salah, N., 2012. An English- casing, and final cleaning) to-Arabic Prototype Machine Translator for Statistical Step 3: Compute word-based edit distance metric between output Sentences. Intelligent Information Management, 4(01), p.13. in 2 and each sentence in example base [6] Eck, M., Vogel, S. and Waibel, A., 2004, August. Improving Step 4: Get the sentences from the example base that has the statistical machine translation in the medical domain using highest score the Unified Medical Language System. In Proceedings of the 20th international conference on Computational Step 5: Get the translation from the Arabic dictionary for the Linguistics (p. 792). Association for Computational outputs from 4 Linguistics. [7] Chatzichrisafis, N., Bouillon, P., Rayner, M., Santaholma, M., 5. EXPERIMENTS Starlander, M. and Hockey, B.A., 2006, June. Evaluating We constructed two different experiments. First, we use Google task performance for a unidirectional controlled language Translate in our experiment. Google translate system is a medical speech translation system. In Proceedings of the statistical machine translation system [10]. Second, we used the Workshop on Medical Speech Translation (pp. 5-12). matching stage from Example-Based Machine Translation Association for Computational Linguistics. technique; we acquire the closest translation and consider this as [8] Nyberg, E.H., Mitamura, T. and Carbonell, J.G., 1994, output for the input to our system. August. Evaluation metrics for knowledge-based machine translation. In Proceedings of the 15th conference on 6. RESULTS Computational linguistics-Volume 1 (pp. 95-99). We have used BLEU score to automatically evaluate our system. Association for Computational Linguistics. BLEU score captures the fluency of the translation. As shown our [9] Dandapat, S., Morrissey, S., Kumar Naskar, S. and Somers, system scores low accuracy than Google Translate. H., 2010. Statistically motivated example-based machine translation using translation memory. Table2. Systems accuracies by BLEU metric [01] Costa-Jussà, M.R. and FMaS, J., 2012. Machine Translation System BLEU in Medicine. A quality analysis of statistical machine translation in the medical domain. In Conference on Google Translate 53.56 Advanced Research in Scientific Areas (ARSA-2012). EBMT 48.86 [11] Li, J., Kim, S.J., Na, H. and Lee, J.H., 2014. Postech's System Description for Medical Text Translation Task. 7. CONCLUSION In Proceedings of the Ninth Workshop on Statistical We find that our system has lower accuracy compared to Google Machine Translation (pp. 229-232). Translate and that may be due to the small size of our corpus. We [12] Dušek, O., Hajič, J., Hlaváčová, J., Novák, M., Pecina, P., notice also that the some medical terms have no equivalent terms Rosa, R., Tamchyna, A., Urešová, Z. and Zeman, D., 2014. with the same Arabic or English meaning, which causes a Machine translation of medical texts in the khresmoi project. difficulty to find a good closest example from the example-base. In Proceedings of the Ninth Workshop on Statistical As our translation system output may contain some inappropriate Machine Translation (pp. 221-228). fragments we consider to remove those fragments using [13] Wołk, K. and Marasek, K., 2015. Polish-English statistical translation memory. Also as Google Translate is a Statistical machine translation of medical texts. In New Research in Machine Translation System and because it achieved higher Multimedia and Internet Systems (pp. 169-179). Springer, accuracy than our system we consider constructing a hybrid Cham. machine translation system between Statistical Machine [14] Amer, E. and Abd-Elfattah, M., 2016 . Can Wikipedia Be A Translation system and Example-Based Machine Translation Reliable Source For Translation? Testing Wikipedia Cross system. Lingual Coverage of Medical Domain. IOSR Journal of Computer Engineering (IOSR-JCE), Volume 18, Issue 3, PP 8. REFERENCES 16-22 [1] Wołk, K. and Marasek, K., 2015. Neural-based machine [15] Papageorgiou, H., Cranias, L. and Piperidis, S., 1994, June. translation for medical text domain. Based on European Automatic alignment in parallel corpora. In Proceedings of Medicines Agency leaflet texts. Procedia Computer the 32nd annual meeting on Association for Computational Science, 64, pp.2-9. Linguistics (pp. 334-336). Association for Computational [2] Akeel, M. and Mishra, R., 2014. ANN and rule based method Linguistics. for english to Arabic machine translation. Int. Arab J. Inf. Technol., 11(4), pp.396-405.

135 Positive and Negative Feature-Feature Correlation Measure: AddGain

Mostafa A. Salama Ghada Hassan Dept. of Computer Science Faculty of Computers and Information British University in Egypt Ain Shams University [email protected] British University in Egypt [email protected]

the highest classification accuracy. The usage of the least number of features in high dimensional data sets decreases the complexity ABSTRACT of the used classification algorithms, as it avoids wasting Feature selection techniques are searching for an optimal subset of resources on measuring redundant features. And the removal of features required in the machine learning algorithms. Techniques irrelevant features increases the understanding of the reason like the statistical models have been applied for measuring the behind the real-life classification results [1]. Categorization of correlation degree for each feature separately. However, the feature subset selection is either based on the dependency of the mutual correlation and effect between features is not taken into machine learning algorithms or on the complexity of the used consideration. The proposed technique measures the constructive technique [2]. The feature selection techniques has three types and the destructive effect (gain) of adding a feature to a subset of according to the machine learning dependency; wrapper features. This technique studies feature-feature correlation in techniques which are considered as a black box scoring method of addition to the feature-class label correlation. The optimality in a subset of features according to their classification performance, the resulted subset of features is based on searching for a highly filter techniques which are machine learning algorithm constructive subset of features with respect to the target class label. independent techniques, and embedded techniques that performs a The proposed feature selection technique is tested by measuring hybridization between both of the previous two techniques. the classification accuracy results of a data set containing subsets Information gain measures the mutual information between a of constructively correlated features. A comparative analysis feature and the target class label [4,5,6,12]. And according to the shows that the resulted classification accuracy and number of the complexity of feature subset selection techniques, the used selected feature of the proposed technique is better than the other algorithms are exponential, sequential or randomized [3]. feature selection techniques. Exponential technique is an exhaustive search for the optimal solution like a branch and bound algorithm. Sequential technique

CCS Concepts is an iterative search in a sorted list of ranked features. • Computing methodologies➝Feature selection Randomized technique is the usage of randomness in the selection of different trials of feature and instance subsets like genetic Keywords algorithms. Feature selection technique; Security; Ranking algorithm; Machine learning. A new dimension in feature selection is introduced in this research based on the correlation between features. The measurement of 1. INTRODUCTION the correlation between features depends on grouping a set of One of the major problems in data mining is the high number of highly ranked features; for example, if two features are highly features in data sets, this problem is known as the curse of correlated to the target class label, these features are highly dimensionality problem. The features may contain irrelevant or correlated to each other. This research measures the correlation redundant features that may affect the accuracy of the between features based on its mutual effect on each other, with classification methods in machine learning. The selection of the respect to the target class label. If two features are highly ranked most relevant features of the classification problem is essential features based on any statistical technique, these two features and important as a preprocessing step in data mining. The selected could have a positive, negative or neutral effect on each other. For features are considered to have a higher power in the example, if the features are ranked based on a filter or a wrapper discrimination between different target class labels. Feature method, and the highest two features are selected from a single selection technique goal is to find a subset of features that leads to data set, the discrimination of these two features together could Permission to make digital or hard copies of all or part of this decrease, resulting in low classification accuracy if they are work for personal or classroom use is granted without fee provided that negatively correlated. In this case, the discriminatory power of a copies are not made or distributed for profit or commercial advantage and subset of features can be measured by the evaluation of the that copies bear this notice and the full citation on the first page. correlation between each pair of features in this subset. A new Copyrights for components of this work owned by others than ACM must measure is introduced to this work named addGain value which be honored. Abstracting with credit is permitted. To copy otherwise, or evaluates the correlation between a pair of features. The addGain republish, to post on servers or to redistribute to lists, requires prior value is calculated by measuring the classification accuracy specific permission and/or a fee. Request permissions from percentage values for two single-featured data sets and for one [email protected]. ICSIE '18, May 2–4, 2018, Cairo, Egypt double-featured data set. The average of all addGain values for a © 2018 Association for Computing Machinery. subset of feature is considered as the evaluation of this subset. In ACM ISBN 978-1-4503-6469-0/18/05…$15.00 the practical application of the average value on different data sets, DOI:https://doi.org/10.1145/3220267.3220270 the classification accuracy of the subsets measured and this

136 average values are noticed to be directly proportional. For some Randomness in this algorithm is to ensure the variety of resulted kinds of data sets, the correlation between features and the target decisions, then the best split is applied according to the best result class labels could be higher than the correlation between features. [9]. The problem here is that the best split is calculated only In this case, the value that represents this class label correlation is within this random subset of features, while a better split could be added to the addGain value formula. This measure could also be calculated according to another subset of features. The rank of used in the visual representation of correlation among features. each feature is calculated by permuting the values of this feature This is helpful in the clarification and understanding of the effect in the testing samples in each tree in the forest, then compare the between different features in real life problems. accuracy predicted before and after permuting in all trees. As the average difference increases between the two predicted values The rest of this paper is organized as follows: Section 2 provides over all trees, the importance of this feature increases. This the general information about the current feature evaluation algorithm deals with the importance of the feature in the presence techniques, and section 3 presents the proposed addGain feature of groups of highly correlated features. The randomness and the subset evaluation technique. Section 4 shows the experimental averaging method could result in an accurate measure of the work and finally the conclusion is discussed in section 5. correlation among specific set of features [10].

2. LITERATURE REVIEW Other techniques used sparsity regularization [11] like l1-SVM Feature selection techniques use filter components like which uses l1-norm regularization or a combination between Information gain for ranking features. The forward feature l1norm and 21-norm. The existence of the sample size upper selection framework is implemented in an iterative procedure bounding and assuming the binary of the target class labels are whereby, in each iteration, the most important feature in $D$ is considered as limitations for this approach. identified among a set of remaining features based on some filter components. Feature ranking methods detect the relevance of the 3. FEATURE-FEATURE CORRELATION features to target class labels and score each feature accordingly. MEASURE: ADDGAIN Information gain measures the mutual information between a feature and the target class label 4. It evaluates the future 3.1 AddGain Model Steps according to the number of class specific values or range of values. In this work, a new definition of the correlation between features The mutual information I(X; Y) measures how much the is introduced and proved theoretically and practically. There are uncertainty of X is reduced if Y has been observed [5]. MI is three types of relation among each pair of features; positive based on entropy which is often considered a measure of relation, negative relation and no-relation. In other words, some uncertainty as it measures the marginal probability distribution. features can be complementary to each other, clashing with each MI is considered as a linear correlation coefficient between a other, or neutral to each other. The classification accuracy of a single feature and the target class label. Chi-square and Chi-merge data set that contains a feature may increase, decrease or not methods are other filter components that measures the dependency change when adding another feature. Figure 1 presents the types of an attribute on the target class labels according to the variance of correlation between two features x and y graphically as a set of of the values of this attribute [6]. F-score is based on calculating patterns or waves. the average values of the ith feature for the whole data set, and for each of the two classes in the data set. The increase of F-score value for a feature indicates the increase of its discriminating power and its rank accordingly [7]. Relief is an iterative weighting algorithm [8]; it updates the features weight in each iteration based on a randomly selected instance. The algorithm changes the weight of each feature according to its Euclidian distance from the nearest instances existing in the different target classes. A disadvantage of these ranking techniques is the ignorance of the correlation or the mutual information among the features. Feature ranking algorithms that put into consideration the correlation between features appears to have better results. The Joint Mutual Information in a data set is calculated by summing the pairwise Mutual Information between features in this data set. These approaches are pure filter methods that do not take the learning machine into consideration. The joint mutual information (JMI) between a set of features (X1 , X2 , X3 , ..., XN) and the target label, searches for the least redundant and most relevant set of features. The definition of the JMI mentions that adding a feature to the pre-calculated features will never decrease the JMI value. The exhaustive JMI search calculates the JMI estimates for each one of the possible feature subsets. To avoid the high Figure 1 – Complementary, Redundant and clashing features complexity of the exhaustive search, a kind of a forward feature selection (FFS) method is applied. FFS method adds in a stepwise The proposed model is searching for a subset of features that do mode the feature of the highest MI value to the selected features not contain redundant features and do not have classes in-between. subset. Other Multivariate feature ranking algorithms are based on The existence clashing or negative features' pairs in a data set may the random forest learning method. Random forest feature have a greater effect in deteriorating the resulted classification selection is based on building different decision trees constructed accuracy. The detection of the type of correlation is applied by a random set of instances and a random subset of features.

137 through the calculation of the classification accuracy of each out 23 features are tested in both ways, by applying the single feature and each pair of features in separate data sets. Then classification accuracy test and the MaxCGS evaluation. The the detection of whether adding two single featured data sets in classification accuracy test is the accuracy percentage of the 10- one data set will enhance the accuracy or decrease it or the fold training and testing of the input data set based on naïve accuracy will not change. Bayesian tree method. The 17,328 combination and the two resulted test values are sorted by the classification accuracy values. In order to find the optimal solution of the most related and Then the two test values are plotted against the same sorted list of positively correlated subset of features, it is enough to find the all combinations in figure 2. Figure 3 shows that, the sum of all subset of features that follow the formula (1). The complexity of AddGain values of the feature-feature connections in the subset is this method will of O(n). proportional to the classification accuracy percentage. And this is considered as the proof of the definition in formula 4. { λxy > λx } ^ { λxy > λy } (1) In some cases, one of the two conditions in formula (1), while the other condition is satisfied and cancels the effect of the first condition as shown in formula (2).

{ λxy - λx } >> { λxy > λy } (2) In this case, it is considered that the coupling between the two features x and y with respect to one feature (i.e. x) is highly greater than to the other feature (i.e. y). The formula in 1 can be updated to formula 3, where it includes the cases when the complementary of feature y to feature x is highly greater than the clashing of feature x on feature y.

Accuracy vs combinations λxy > 휖xy (3)

Where 휖xy is average of the classification accuracies of the two values x and y:

휖xy = (λx +λy) /2 (4) The relation between the two features x and feature y can be measured according to formula (1). The measure name will be addGain AG as it shows the gain of adding two features together in the same data set. The AG of the two features x and y can be measured as follows:

AGxy = λxy - 휖xy (5) PGS vs combinations The addGain value AGxy can be adjusted to give privilege to Figure 2: Accuracy and PGS vs combinations features pair that have high classification higher than another pair. For example, the gain of adding two features {x, y} could be higher than that of other two features {a, b}, but the accuracy of The maximum classification accuracy resulted is 67.69% by the set of features {6, 7, 12, 17}. This set of features is the resulted each {a, b} features is higher than the accuracy of each of the {x, selection of the highly positively correlated features and the most y} features. If this difference is not taken into consideration, it relevant features to classification problem. When the classical would affect negatively in the correctness of the ranking. According to this accuracy differences with respect to other chimerge ranking method followed by the forward feature features, the AG value can be updated as following: selection technique is applied, the selected subset of features is xy {15, 17, 6, 13} of classification accuracy equals to 66.53%. This shows that the proposed method has higher classification accuracy ̅̅̅ ̅̅ ̅ ̅ (6) percentage rather classical methods. In some cases, the adjustment shown 6 shows a higher results, the CGS and MaxCGS values will Where maxλ and minλ are the maximum and minimum of the be adjusted as follows: classification accuracy values of the single valued data sets of each feature respectively. ̅̅ ̅̅ ̅ ∑ ̅̅̅ ̅ ̅̅ ̅ (7)

The feature selection technique is the search for the optimal subset The maximum correlation gain value MaxCGS will be calculated of s features from a set of n features, where sis less than n. The as follows: optimality is measured by the classification accuracy resulted based on any machine learning algorithm. ̅̅ ̅̅ ̅ (8) 4. EXPERIMENTAL WORK This classifier is applied on pair attributed data sets of the combinations of pair of features selected from both techniques, 4.1 Mutagenicity Data Set chimerge and AddGain, as shown in table 1 respectively. The Mutagenicity Data set contains 23 extracted features and 260 Euclidean calculation is applied by measuring the average values instances divided equally into two categories to ensure the of feature 1 and 2 of instances in class a relative to the average fairness of the test. In this test, 17,328 combinations of 5 features values in class b. Every pair of features in the selected subset by

138 AddGain proposed method shows a high classification accuracy high difference values. If one feature has high difference (positive and a very small Euclidean distance, relative to the selected subset or negative), and the other feature has low difference, the addGain by the chimerge method. The high Euclidean Distance values value appears to increase. For example, the addGain of features between centroids of the two classes in the chimerge selected pair {2, 3} is -0.6 and features pair {3, 4} is 0.68, where the features is due to the way of selecting attributes that shows a high difference values of {2, 3, 4} features are {-0.32, -0.37, -0.4} discrimination between objects in the two classes. Chimerge make respectively. While, the addGain of features pair {2, 7} is 11.4 use of features 1 and 2 ignore feature 7, however the classification and features pair {3, 8} is 14.1 where the difference values of {7, accuracy of features 1 and 2 as a single attributed data sets shows 8} features are {-0.01, 0.05} respectively. The average difference a low classification accuracy than feature that feature 7. The values of the set of features {2, 3, 4} are much less than that of {7 reason of such behavior is that features 1 and 2 with features 3 and 8}, the coupling of feature pairs from each set leads to a high and 23 shows a high distance between objects of the two classes addGain value. The reason of this is that weak pattern in the rather than feature 7. features {7, 8} has a week negative effect on the pattern existing in features {2, 3, 4}. Feature Feature Classification Euclidean 1 2 Accuracy distance 1 2 48.63 7.98 1 3 55.13 5.77 1 23 50.34 5.77 2 3 55.13 5.52 2 23 50.34 5.51 3 7 56.16 0.49 3 23 55.13 0.24 7 23 54.79 0.43 Table 4 – Applying the Classifier on the Pair Attributed Data Sets Figure 3: The addGain analysis vs the average values of the features. The data distribution of instances of each of these pairs are captured, the following pairs are considered here which are {1, 3}, {1, 23}, {7, 3}, {7, 23} as shown in figure 3. As feature 1 and 5. CONCLUSION feature 7 are uniquely selected with respect to chimerge and The proposed technique in this work presents a new direction of AddGain respectively. This figure shows that, with respect to the evaluation of a subset of features altogether. The evaluation of feature 7, the instance concentrated behind the vertical line is each feature in a separate manner leads to the loss of an important nearly 22/40 or 55% of the total distribution of the data, which factor in the classification method, which is the correlation means that the outliers take place of only 45% of the total between features. A new measure is proposed here, the addGain distribution of the data. On the other hand, with respect to feature value, which measures the interaction between features existing in 1, the instance concentrated behind the vertical line is nearly the same data set. According to the results, the classification 400/1000 or 40%, the outliers take place of 60% of the total accuracy percentage and the addGain values are directly distribution of the data. The vast space occupied by outliers in proportional. And this appears in the enhancement of the results feature 1 leads to misleading selection by statistical techniques of the selected features rather than the classical sequential feature like chimerge methods. selection techniques. On the other hand, these results are helpful in the analysis of the selected features, and the detection the value of the correlation between them relative the target class label. 6. REFERENCES [1] M. Grimaldi, P. Cunningham, and A. Kokaram, 2003. An evaluation of alternative feature selection strategies and ensemble techniques for classifying music. In Proc. of Workshop on Multimedia Discovery and Mining, Dubrovnik. [2] Luka Cehovin and Zoran Bosni, 2010. Empirical evaluation of feature selection methods in classification, Intelligent Data Analysis journal, vol. 14, pp. 265-281. [3] Lauren Burrell, Otis Smart, George J. Georgoulas, Eric Marsh, George Vachtsevanos, 2007. Evaluation of Feature Selection Techniques for Analysis of Functional MRI and Figure 3: Distribution of features 1 and 7, with respect to the EEG. In Proc. of the International Conference on Data common features 3 and 23 in the two subsets of selected Mining, DMIN2007, June 25-28, Las Vegas, Nevada, USA. features. [4] Howard Hua, and John Moody, 2006. Feature selection based on Joint Mutual Information, Independent Component 4.2 Addgain Result Analysis Analysis and Blind Signal Separation, Lecture Notes in Figure 4 shows the difference between average feature values Computer Science, vol. 3889, pp. 823-830. corresponding to class 1 and the values of class 2. The addGain value of pairs of features appears to decrease if both features have

139 [5] Georgia D. Tourassi, Erik D. Frederick, Mia K. Markey, [10] Long Han, Mark J. Embrechts, Boleslaw Szymanski, 2006. Carey E. Floyd, 2001. Application of the mutual information Random Forests Feature Selection with Kernel Partial Least criterion for feature selection in computer-aided diagnosis, Squares: Detecting Ischemia from MagnetoCardiograms, In Medical Physics. vol. 28 (12), pp. 2394-2402. Proc. of the European Symposium on Artificial Neural [6] Bidgoli, Amir-Massoud, Naseri Parsa, Mehdi, 2012. A Networks, Burges, Belgium, pp. 221-226. Hybrid Feature Selection by Resampling, Chi squared and [11] Nie, F.P., Huang, H., Ding, C.: Efficient and Robust Feature Consistency Evaluation Techniques, World Academy of Selection via Joint l2,1-Norms Minimization, 2010. In: 22th Science, Engineering \& Technology, vol. 68, pp. 276. Ann. Conf. Neural Information Processing Systems, pp. [7] Yi-Wei Chen and Chih-Jen Lin, 2006. Combining SVMs 1813-1821. MIT Press. with Various Feature Selection Strategies, Feature Extraction [12] B. Chandra, Manish Gupta, 2011. An efficient statistical Studies in Fuzziness and Soft Computing, vol. 207, pp. 315- feature selection approach for classification of gene 324. expression data, Journal of Biomedical Informatics, vol 44 [8] Yi-Wei Chen and Chih-Jen Lin, 2006. Combining SVMs (4), pp. 529-535. with Various Feature Selection Strategies, Feature Extraction [13] Cheminformatics database on: Studies in Fuzziness and Soft Computing, vol. 207, pp. 315- http://cheminformatics.org/datasets/ 324. [14] ChemAxon Software: available on: [9] A. Hapfelmeier, K. Ulm, 2013, A new variable selection http://www.chemaxon.com/. approach using Random Forests. Journal of Computational [15] UCI database on: http://archive.ics.uci.edu/ml/ Statistics \& Data Analysis archive, vol. 60, pp. 50-69.

140

Author Index

A M. B. Abdelhalim 93 Ahmad M. Zaki 121 Mahmoud Gadallah 131 Ahmed El-Baz 97 Marlina Abdul Latib 12 Alaa Farhat 27 Marwa I. Obayya 81 Ali Meligy 27 Mohamed A. Sobh 121 Amr S. Mady 76 Mohamed Elmahdy 1 Andrew Sadek 1 Mohamed Fadel 97 Ann Nosseir 48, 56 Mohamed Meselhy Eltoukhy 68 62 Mohammad al-Shatouri 68 Anupam Sharma 22 Mohanad Odema 102 Ashraf AbdelRaouf 93 Mohd Tazim Ishraque 22 Ayman M. Bahaa-Eldin 121 Mostafa A. Salama 116, 136 Ayman Nabil 126 N Azri Azmi 12 Nada Radwan 93 D Nadine Farag 32 Daoud M. Daoud 111 Nayeth I. Solorzano Alcivar 38 DiaaEldin M. Osman 121 Nazri Kama 7 E O Eslam Abou Gamie 116 Omar Adel 56 Eslam Amer 126, 131 Oraib H. Al-sahlee 111 Essam A. Rashed 68, 72 Othman Mohd Yusop 12 76 P F Pritheega Magalingam 12 Fatma E.Z. Abou-Chadi 81 R G Ramy Roshdy 48 Gamal Selim 106 Rana Ehab 131 Ghada Hassan 32, 136 Reham Rabie 68 H S Hanan M. Amer 81 Saeed Samet 22 Haneen A. Elyamani 72 Saiful Adli Ismail 7, 12 Hani Amin 97, 102 Samir A. El-Seoud 72, 76, I 111, 116 Ihab Adly 97, 102 Samy Ghoneimy 106 J Sara Adel El-Shorbagy 44 Jalal Shah 7 Seif Eldin Ashraf Ahmed 62 L Sherif S. Kishk 81 Louis Sanzogni 38 Shourok AbdelRahim 106 Luke Houghton 38 Suhaib R. Khater 111 M T Tarek Eldeeb 1 W Wael Mohamed El-Gammal 44 Walid Dabour 27 Walid Hussein 116 Walid. M. Abdelmoez 44 X Xiaobin Song 87 Y Yunchao Wang 87 Z Zehui Wu 87