See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326631698

Journal of Soft Engg & Intelligent Syst Volume 3 Issue 1

Article · April 2018

CITATIONS READS 0 268

3 authors, including:

Muhammad Imran Babar Dayang Norhayati Abang Jawawi National University of Computer and Emerging Sciences Universiti Teknologi Malaysia

40 PUBLICATIONS 267 CITATIONS 195 PUBLICATIONS 1,418 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Developing early reliability prediction approach for component-based software systems View project

Textual similarity analysis of software artifacts in regression testing View project

All content following this page was uploaded by Muhammad Imran Babar on 26 July 2018.

The user has requested enhancement of the downloaded file.

JSEIS EDITORIAL BOARD

Founding Editor in Chief

Muhammad Imran Babar Army Public College of Management & Sciences (APCOMS) University of Engineering & Technology Taxila, Pakistan [email protected]

Co-Editor in Chief

Masitah Ghazali Dayang N.A. Jawawi Universiti Teknologi Malaysia, Malaysia. Universiti Teknologi Malaysia, Malaysia. [email protected] [email protected]

Advisory Editorial Board

Vahid Khatibi Bardsiri Zeljko Stojanov Bardsir Branch, Islamic Azad University of Novi Sad, Serbia. University, Iran. [email protected] [email protected]

Basit Shahzad Muhammad Siraj National University of Modern King Saud University, Saudi Arabia. Languages, Pakistan. [email protected] [email protected]

Vladimir Brtka University of Novi Sad, Serbia. [email protected]

Editors

Rafa E. Al-Qutaish Ahmed Hamza Usman Ecole de Technologie Superieure, King Abdulaziz University, Saudi Canada. Arabia. [email protected] [email protected]

Alessandra PIERONI Arta M. Sundjaja Marconi International University, Binus University, Indonesia. Florida, USA. [email protected] [email protected]

Nur Eiliyah Wong Noemi Scarpato Universiti Teknologi Malaysia, Università Telematica, Rome, Italy. Malaysia. [email protected] [email protected]

Manu Mitra Hikmat Ullah Khan Alumnus of University of Bridgeport, COMSATS, WAH, Pakistan. USA [email protected] [email protected]

Venkata Druga Kiran Kasula Kirti Seth K L University, Vaddeswaram, India INHA University, Tashkent, Uzbekistan druga [email protected] [email protected]

Mustafa Bin Man Anitha S. Pillai Universiti Malaysia Terenggnau, Hindustan University, India. Malaysia. [email protected] [email protected]

Gule Saman Farrukh Zeeshan Shaheed Benazir Bhutto Women COMSATS, Lahore, Pakistan. University, Pakistan. [email protected] [email protected]

Mohammed Elmogy Abid Mehmood Mansoura University, Egypt. King Faisal University, Saudi Arabia. [email protected] [email protected]

Nadir Omer Fadi Elssied Hamed Vladimir Brtka University of Khartoum, Sudan. University of Novi Sad, Serbia. [email protected] [email protected]

Ashraf Alzubier Mohammad Ali Abubakar Elsafi International University of Africa, International University of Africa, Khartoum, Sudan. Sudan. [email protected] [email protected] Sim Hiew Moi Vikas S. Chomal Southern University College, Johor The Mandvi Education Society, Institute Bahru, Malaysia. of Computer Studies, India. [email protected] [email protected]

Mohd Adham Isa Ashraf Osman Universiti Teknologi Malaysia, Alzaiem Alazhari University, Sudan. Malaysia. [email protected] [email protected]

Awad Ali Abder Rehman Shahid Kamal University of Kassala, Sudan. Gomal University, Pakistan. [email protected] [email protected]

Shafaatunnur Hasan Philip Achimugu Universiti Teknologi Malaysia, Lead City University Ibadan, Nigeria. Malaysia. [email protected] [email protected]

Arafat Abdulgader Golnoosh Abaei University of Bisha, Saudi Arabia. Shahab Danesh University, Iran. [email protected] [email protected]

Hemalatha K.L. Raad Ahmed Hadi Dept of ISE,SKIT, Bangalore, India Iraqia University, Baghdad, Iraq [email protected] [email protected]

Mohammed Abdul Wajeed Mohd. Muntjir Keshav Memorial Institute of Taif University, Saudi Arabia. Technology, Hyderabad, India [email protected] [email protected]

Adila Firdaus Razieh Haghighati Limkokwing University, Cyberjaya, Universiti Teknologi Malaysia. Malaysia. [email protected] [email protected]

Wasef Al-matarneh Anwar Yahya Ebrahim Petra University, Amman, Jordon. Babylon University, Iraq. [email protected] [email protected]

Neelam Gohar Shaheed Benazir Bhutto Women University, Peshawar, Pakistan. [email protected]

Managing Editor/Linguist

Summaya Amra Army Public College of Management & Sciences, Rawalpindi, Pakistan. [email protected]

Regional Steering Committee

Muhammad Imran Babar Kashif Naseer Qureshi APCOMS, Rawalpindi, Pakistan. Bahria University Islamabad, Pakistan. [email protected] [email protected]

Hikmat Ullah Khan Sheikh Muhammad Jahanzeb COMSATS, WAH, Pakistan. APCOMS, Rawalpindi, Pakistan. [email protected] [email protected]

Khalid Mehmood Awan Muhammad Zahid Abbas COMSATS, Attock, Pakistan. COMSATS, Vehari, Pakistan. [email protected] [email protected]

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Contour extraction for medical images using bit- plane and gray level decomposition 1Ali Abdrahman M Ukasha, 2Ahmed B. Abdurrhman, 3Alwaleed Alzaroog Alshareef 1,2,3Department of Electrical and Electronics Engineering, University Sebha, Libya Email: [email protected]

ABSTRACT In this paper we have implemented a contour extraction and compression from digital medical image (X-ray & CT scan) and is proposed by using the most significant bit (MSB), maximum gray level (MGL), discrete cosine transform (DCT), and discrete wavelet transform (DWT). Transforms depend on different methods of contour extraction like Sobel, Canny and SSPCE (single step parallel contour extraction) methods. To remove the noise from the medical image the pre-processing stage (filtering by median & enhancement by linear contrast stretch) is performed. The extracted contour is compressed using well-known method (Ramer). Experimental results and analysis show that proposed algorithm is trustworthy in establishing the ownership. Signal-to-noise ratio (SNR), mean square error (MSE), and compression ratio (CR) values obtained from MSB, MGL, DCT & DWT methods are compared. Experimental results show that the contours of the original medical image can be extracted easily with few contour points at high compression exceeded to 87% in some cases. The simplicity of the method with accepted level of the reconstruction is the main advantage of the proposed algorithm. The results indicate that this method improves the contrast of medical images and can help with better diagnosis after contour extraction. This proposed method is very useful for real time application. Keywords: bit-planes and Gray level decomposition; contour edge extraction and compression; image compression; DCT; DWT; 1. INTRODUCTION In the recent years, a huge amount of digital information is circuiting through all over the world by means of the World-Wide Web. Most of this data is exposed and can be easily forged or corrupted. The need for intellectual property rights protection arises. Use of multimedia technology and computer networking is all over the world. Image resolution enhancement is the process of manipulating an image so that resultant image is good quality image. Image enhancement can be done in various domains. The conventional method using Bit-plane decomposition [1], gives an image that is better in visual quality and PSNR parameters. For much better resolution of an image, a new proposed method, which uses Gray-level decomposition, is employed and the result will be compared with the existing methods. Medical image contour extraction based on most significant bit / maximum gray level has been proposed as one of the possible ways to deal with this problem, to keep information safe. Feature extraction approach in medical imaging (i.e. magnetic resonance imaging MRI, computed tomography CT, and X-ray) is very important in order to perform diagnostic image analysis [2]. Edge detection reduces the amount of data and filters out useless information, while protecting the important structural properties in an image [3]. The extraction contours of digital data have become very popular approach for reduction data. Several contour compression techniques were developed and a large amount of methods were proposed, but still the most of known methods to compress the contours is Ramer [4] which is has high quality compared with others like Centroid [5], Triangle [6, 7], and Trapezoid [8, 9] methods. The contour can be extracted from binary image using single step parallel contour extraction (SSPCE) method [10, 11], or simply used Sobel & Canny edge detectors [12-14].

2. THE ANALYSED ALGORITHM Figure 1 shows a sequence of steps which are to be followed before contour compression from the CT /X-ray images. When CT / X-ray images are viewed on computer screen, they look like black and white but in actual they contain some primary colors (RGB) content. So, for further processing of these image, it must be converted to perfect grayscale image in which the red, green and blue components all have equal intensity in RGB space. The pre-processing step is required for Gray-level decomposition for efficient and accurate calculation of edges from the medical images. This step is carried out to improve the quality of the image to make it ready for further processing. This improved and enhanced image will help in detecting edges and improving the quality of the overall image. Edge detector step is using for contour extraction. Finally, the extracted contours can be compressed using well-known method Ramer with different threshold values.

3. PRE-PROCESSING STAGE This stage is very necessary when the gray level decomposition has been used. Usually the medical images

1

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org are captured with some undesired components and hence the median filter can remove it. In this work the medical images (CT scan or X-ray) captured in foggy weather conditions get highly degraded due to suffering from poor contrast and loss of color characteristics [15]. This task uses a contrast enhancement algorithm for such degraded medical color images to obtained the contour extraction with high quality and later with good compression. Besides proposed method being simple, experimental results show that the proposed method is very effective in contrast and color of image after resizing to 256 X 256 medical pixels to be given in 8bit/pixel (bpp) precision. Each pixel has a gray value between 0 and 255 For example, dark pixel may have a value of 10 and a bright pixel might have a value of 230.

Pre- processing Image Bit-planes / (Resizing, Binary Acquisition Gray-levels / filtering, Image (CT / X-ray) DCT / DWT contrast Transforms adjustment)

Contour compression Edge Morphological using Ramer Detectors Operations

Comparisons

Figure. 1 Block diagram analysed algorithm

4. BIT-PLANE DECOMPOSITION We assume a 256 X 256 medical pixels image to be given in 8bit/pixel (bpp) precision. The entire image can be considered as a two-dimensional array of pixel values. We consider the 8bpp data in the form of 8-bit planes, each bit plane associated with a position in the binary representation of the pixels. 8-bit data is a set of 8 bit-planes. Each bit-plane may have a value of 0 or 1 at each pixel, but together all the bit-planes makeup a byte with value between 0 to 255. Given below (as shown in Figure 3 & 4) are the most significant bit-planes of two tested images (as shown in Figure 2).

(a) (b) Figure. 2 Test Images: (a) X-ray, and (b) CT scan

(a) (b)

2

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(c) Figure. 3 X-ray Hand Skel Images: (a), (b), and (c) Bit-planes no. 7, 8, and (7 & 8) respectively

(a) (b)

(c) Figure. 4 CT scan Images: (a), (b), and (c) Bit-planes no. 7, 8, and (7 & 8) respectively

5. GRAY-LEVEL DECOMPOSITION The idea consists on that given a set of gray-level patterns to be first memorized: (1) Decompose them into their corresponding binary patterns, and (2) Build the corresponding binary associative memory (one memory for each binary layer) with each training pattern set (by layers). A given pattern or a distorted version of it, it is recalled in three steps: (1) Decomposition of the pattern by layers into its binary patterns, (2) Recalling of each one of its binary components, layer by layer also, and (3) Reconstruction of the pattern from the binary patterns already recalled in step 2. The proposed methodology operates at two phases: training and recalling. Conditions for perfect recall of a pattern either from the fundamental set or from a distorted version of one they are also given. Figures 5 and 6 represent the gray level decomposition.

(a) (b)

3

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(c) Figure. 5 X-ray Hand Images: (a), (b), and (c) Gray-level decomposition using maximun gray level, quarter gray levels, and half gray levels respectively

(a) (b)

(c) Figure. 6 CT scan Images: (a), (b), and (c) Gray-level decomposition using maximun gray level, quarter gray levels, and half gray levels respectively This paper compared this zonal method with another zonal sampling method consists in selecting one block of the spectral images (i.e. shadow region) as LPF for image compression and the other coefficients will be taken into account in the contour reconstruction stage. This algorithm is referred as algorithm II and is shown in Figure 2 [16].

6. SINGLE STEP PARALLEL CONTOUR EXTRACTION Detection of edge points (pixels) of a 3-dimensional physical object in a 2-dimensional image and contour is one of the main research areas of computer vision. The extraction of object contours and object recognition depends on the correctness and completeness of edges [17]. Edge detection is required to simplify images and to facilitate image analysis and interpretation [18]. Edge detection extracts and localizes points (pixels) around which a large change in image brightness has occurred. Edge detection is based on the relationship a pixel has with its neighbors. If the grey-levels around a pixel are similar, the pixel is unsuitable to be recorded as an edge point. Otherwise, the pixel may represent an edge point. • Sobel Metric It is defined as the square root of the sum of squared and squared, where and are obtained by convolving the image with a row mask and column mask respectively. The Sobel operator performs a 2-D spatial gradient measurement on an image. Typically, it is used to find the approximate absolute gradient magnitude at each point in an input grayscale image. The Sobel edge detector uses a pair of 3x3 convolution masks, one estimating the gradient in the x-direction (columns) and the other estimating the gradient in the y-direction (rows). A convolution mask is usually much smaller image than the actual image. As a result, the mask is slid over the image, manipulating a square of pixels at a time. The Sobel operator performs a 2-D spatial gradient measurement on an image and so emphasizes regions of high spatial gradient that correspond to edges. Typically, it is used to find the approximate absolute gradient

4

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org magnitude at each point in an input grey-scale image. In theory at least, the operator consists of a pair of 3×3 convolution masks as shown in Figure 7. One mask is simply the other rotated by 90°. -1 0 +1 +1 +2 +1 -2 0 +2 0 0 0 -1 0 +1 -1 -2 -1

Gx G y Figure. 7 Sobel Cross convolution masks • Canny Metric It is optimal for step edges corrupted by white noise. In evaluating the performance of various edge detectors, Canny has defined three criteria [19-22] for optimal edge detection in a continuous domain: ✓ Good Detection: the maximum of the ratio of edge points to non-edge points on the edge map. ✓ Good Localization: the detected edge points must be as close as possible to their true locations. ✓ Low-responses Multiplicity: the maximum of the distance between two non-edge points on the edge map. The Canny operator was designed to be an optimal edge detector (according to particular criteria, there are other detectors around that also claim to be optimal with respect to slightly different criteria). It takes as input a grey scale image and produces as output an image showing the positions of tracked intensity discontinuities. • SSPCE Metric The SSPCE (single step parallel contour extraction) method is applied to the binary image which is obtained by suitable threshold value applied to the noisy digital watermarked image [10, 11]. The eight rules of edge extraction are applied and are coded using 8-directional chain-code as shown in Listing (1).

LISTING (1): IMPLEMENTATION OF THE EIGHT RULES FOR CONTOUR EXTRACTION (3X3 WINDOWS)

a(i,j)  0; i = 1,2,…..,N; j = 1,2,…..,N; for i = 2,3,…..,N-1; j = 2,3,…..,N-1; { if b(i,j) and b(i+1,j) and [b(i,j+1) or b(i+1,j+1)] and [ not [b(i,j-1) or b(i+1,j-1)]] 0 then a(i,j) a(i,j) or 2 { edge 0 } if b(i,j) and b(i+1,j) and b(i+1,j-1) and [ not [b(i,j-1)]] 1 then a(i,j) a(i,j) or 2 { edge 1 } if b(i,j) and b(i,j-1) and [b(i+1,j) or b(i+1,j-1)] and [ not [b(i-1,j) or b(i- 1,j-1)]] 2 then a(i,j) a(i,j) or 2 { edge 2 } if b(i,j) and b(i,j-1) and b(i-1,j-1) and [ not [b(i-1,j)]] 3 then a(i,j) a(i,j) or 2 { edge 3 } if b(i,j) and b(i-1,j) and [b(i,j-1) or b(i-1,j-1)] and [ not [b(i,j+1) or b(i- 1,j+1)]] 4 then a(i,j) a(i,j) or 2 { edge 4 } if b(i,j) and b(i-1,j) and b(i-1,j+1) and [ not [b(i,j+1)]] 5 then a(i,j) a(i,j) or 2 { edge 5 } if b(i,j) and b(i,j+1) and [b(i-1,j) or b(i-1,j+1)] and [ not [b(i+1,j) or b(i+1,j+1)]] 6 then a(i,j) a(i,j) or 2 { edge 6 } if b(i,j) and b(i,j+1) and b(i+1,j+1) and [ not [b(i+1,j)]] 7 then a(i,j) a(i,j) or 2 { edge 7 } } Morphological filters [18] are used for sharpening medical images. In this method, after locating edges by

5

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org gradient-based operators, a class of morphological filter is applied to sharpen the existing edges. In fact, morphology operators, through increasing and decreasing colors in different parts of an image, have an important role in processing and detecting various existing objects in the image. Locating edges in an image using morphology gradient is an example that has comparable performance with that of classic edge-detectors such as Canny and Sobel [23, 36].

7. DISCRETE COSINE TRANSFORM Spectral domain transforms like Karhonen-Loeve [24], Fourier, Haar [25], Pieodic Haar Piecewise-Linear (PHL) [26], Walsh-Hadamard [27, 28], Discrete Cosine (DCT) [29], and recently, wavelets [30, 31] can be used to extract the medical contours points and image compression using low-pass filter (LPF) and high-pass filter (HPF) are investigated and compared with Sobel and Canny detectors in this section. The algorithm uses Discrete Cosine Transform (DCT). Effectiveness of the contour extraction for different classes of images is evaluated. The main idea of the procedure for both contour extraction and image compression are performed. To compare the results, the mean square error and signal-to-noise ratio criterions were used. The simplicity and the small number of operations are the main advantages of the proposed algorithms. A high pass filter is a filter that passes high frequencies and attenuates low frequencies. In high pass filtering the objective is to get rid of the low frequency or slowly changing areas of the image and to bring out the high frequency or fast changing details in the image. This means that if we were to high pass filter the box image we would only see an outline of the box. The edge of the box is the only place where the neighbouring pixels are different from one another. Contour representation and compression are required in many applications e.g. computer vision, topographic or weather maps preparation, medical images and moreover in image compression. The results are compared with Sobel and Canny edge detectors for the contour extraction [12, 18, 32-34].

8. DISCRETE WAVELET TRANSFORM The Wavelet analysis is an exciting new method for solving difficult problems in mathematics, physics, and engineering, with modern applications as diverse as wave propagation, data compression, signal processing, image processing, pattern recognition, computer graphics, the detection of aircraft and submarines and other medical image technology [31, 35]. Wavelets allow complex information such as music, speech, images and patterns to be decomposed into elementary forms at different positions and scales and subsequently reconstructed with high precision. Wavelets are obtained from a single prototype wavelet called mother wavelet by dilations and shifting using the equation (1). 1 t − b  (t) = ( ) (3) a,b a a 9. ZONAL SAMPLING METHOD A lot of zonal sampling methods which was described in [16], shows that the best scheme for compression and contour extraction is as illustrated in Figure 8. Fit criterion of the algorithm consists in selecting one of the squared block of the spectral images (e.g. shadow region) as LPF filter for image compression and the other coefficients will be taken into account in the contour reconstruction stage as shown in the Figure 2. This method is mainly used in this work using DCT transform.

Figure. 8 LPF & HPF filters zonal method for the spectral image using the algorithm I

10. RAMER METHOD The Contour is represented as a polygon when it fits the edge points with a sequence of line segments. There are several algorithms available for determining the number and location of the vertices and also to compute the

6

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org polygonal approximation of a contour. The well-known is Ramer method which is based on the polygonal approximation scheme [4]. The simplest approach for the polygonal approximation is a recursive process (Splitting methods). Splitting methods work by first drawing a line from one point on the boundary to another. Then, we compute the perpendicular distance from each point along the segment to the line. If this exceeds some threshold, we break the line at the point of greatest error. The idea of this first curve approximation is illustrated in Figure 1. We then repeat the process recursively for each of the two new lines until we don't need to break any more. For a closed contour, we can find the two points that lie farthest apart and fit two lines between them, one for one side and one for the other. Then, we can apply the recursive splitting procedure to each side. First, use a single straight line to connect the end points. Then find the edge point with the greatest distance from this straight line. Then split the straight line in two straight lines that meet at this point. Repeat this process with each of the two new lines. Recursively repeat this process until the maximum distance of any point to the poly-line falls below a certain threshold. Finally draw the lines between the vertices of an edge of the reconstructed contour to obtain the polygonal approximating contour as shown in Figure 9.

104 C

102

100

98 maximum distance

96 (d) y 94 A B 92 maximum distance (d) 90

88 D 40 42 44 46 48 50 52 54 56 58 x

Figure. 9 Contour compression using Ramer Algorithm

11. APPLIED MEASURES The compression ratio of the analyzed methods is measured using the equation (4).

CR = [(LCC − LAC ) / LCC ]100% (4)

L L where CC is the input contour length, and AC is the approximating polygon length. Quality measuring of a contour approximation during the approximating procedure uses mean square error (MSE) and signal-to-noise ratio (SNR) criterions by the relations (5) and (6) respectively [11, 35].

LCC 2 (5) MSE = (1/ LCC )di i=1 d where i is the perpendicular distance between i point on the curve segment and straight line between each two successive vertices of that segment.

SNR = −10log10(MSE /VAR) (6)

where VAR is the variance of the input sequence. The mean square error (MSE) and peak signal-to-noise ratio (PSNR) criterions were used to evaluate the distortion introduced during the image compression and contour extraction procedures. The MSE criterion is defined by the following equation:

~ 1 n m ~ MSE(I, I) = (I(i, j) − (I(i, j))2 (7) (n*m) i=0 j=0

7

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Where I and I are the grey-level and reconstructed images respectively. The PSNR is defined by the following formula:

~ 2 (L −1) (8) PSNR(I, I) = 10log10 ~ MSE(I, I) where L is the grey-level number.

12. RESULTS OF THE EXPERIMENTS To visualize the experimental results a CT scan Hand image & X-ray image. Selected images are shown in Figure 2. Some selected results of the tested images are shown in Figure (10 to 19) (Related results are in Tables (I to X)). Where CE is contour extraction; BP is bit plane; GL is grey level; PN is contours points number.

(a) Binary image using MSB

(b) CE using Sobel ( c ) CC using Sobel

(d) CE using Canny ( e ) CC using Canny

(f) CE using SSPCE ( g ) CC using SSPCE

Figure. 10 Contour extraction & compression using Most significant bit (MSB) and Sobel, Canny, and SSPCE respectively

8

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Table. 1 Results of hand medical image contours extraction & compression using msb and ramer and sobel, canny, & sspce methods with threshold =0.1

Measures Original Compressed Contour Contour Points MSE SNR CR [%] Contour Points Using Ramer Extraction (MSB) Methods Sobel 1865 401 0.0337 14.7248 78.4987 Canny 2079 443 0.2467 6.0782 87.1054 SSPCE 1759 1399 0.0155 18.1044 20.4662

(a) Binary image using MGL

(b) CE using Sobel ( c ) CC using Sobel

(d) CE using Canny ( e ) CC using Canny

(f) CE using SSPCE ( g ) CC using SSPCE

Figure. 11 Contour extraction & compression using maximum gray level (MGL) with adjust input image intensity values [0.3 0.5], based on Sobel, Canny, and SSPCE respectively

9

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Table. 2 Results of Hand Medical Image Contours Extraction & Compression using MGL and Ramer and Sobel, Canny, & SSPCE Methods with threshold =0.1

Measures Original Compressed Contour Points Contour Points MSE SNR CR [%] Contour (MGL) Using Ramer Extraction Methods b) Sobel 1379 365 0.0155 18.1044 73.5315 c) Canny 1662 366 0.0198 17.0387 77.9783 d) SSPCE 1392 701 0.0109 19.6339 49.6408

(a) Binary image using DCT

(b) CE using Sobel ( c ) CC using Sobel

(d) CE using Canny ( e ) CC using Canny

(f) CE using SSPCE ( g ) CC using SSPCE

Figure. 12 Contour extraction & compression using DCT with zonal block 150 & threshold=33, and Ramer based on Sobel, Canny, and SSPCE respectively

10

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Table. 3 Results of hand medical image contours extraction & compression using dct and ramer and sobel, canny, & sspce methods with zonal sampling block=150

Measures Original Compressed Contour Contour Points Contour Points MSE SNR CR [%] Extraction (DCT) Using Ramer Methods (Threshold) b) Sobel (0.1) 1623 415 0.0184 17.3441 74.4301 c) Canny (0.1) 1881 416 0.0224 16.5064 77.8841 d) SSPCE (0.5) 1667 672 0.0154 18.1259 59.6881

(a) Binary image using DWT

(b) CE using Sobel ( c ) CC using Sobel

(d) CE using Canny ( e ) CC using Canny

(f) CE using SSPCE ( g ) CC using SSPCE

Figure. 13 Contour extraction & compression using DWT with threshold=10, and Ramer based on Sobel, Canny, and SSPCE respectively

11

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Table. 4 Results of hand medical image contours extraction & compression using dwt (haar) details coefficients and sobel, canny, & sspce methods with threshold =10 Measures Original Compressed Contour Contour Points MSE SNR CR [%] Contour Points (DWT) Using Ramer Extraction Methods b) Sobel 1454 562 0.0282 15.5025 61.3480 c) Canny 1800 696 0.0271 15.6753 61.3333 d) SSPCE 1521 696 0.0126 19.0003 54.2406

Figure. 14 Image compression using a) DCT (zonal sampling), and b) DWT (Haar) Table. 5 Results of hand medical image comptression using dct and dwt Measures MSE PSNR CR Image [%] compression a) DCT 19.1210 35.3157 34.30 b) DWT 255.40.64 24.0585 69.50

(b) CE using Sobel

12

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(c) CE using Sobel after (d) CC using Sobel Morphological

(e) CC using Sobel

(f) CE using Canny (g) CC using Canny after Morphological

(h) CC using SSPCE

(j) CC using SSPCE (i) CE using SSPCE after Morphological Figure. 15 Contour extraction & compression using Most significant bit (MSB) and Sobel, Canny, and SSPCE respectively

13

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Table. 6 Results of chest medical image contours extraction & compression using msb and ramer and sobel, canny, & sspce methods with threshold =0.1 Measures Original Compressed Contour Contour MSE SNR CR [%] Contour Points Points Extraction (MSB) Using Methods Ramer Sobel 2039 1649 0.0331 14.8082 19.1270 Canny 2705 1729 0.0430 13.6685 36.0813 SSPCE 1427 1664 0.0261 15.8323 14.2428

(a) Binary image using MGL

(b) CE using Sobel

(c) CE using Sobel (d) CC using Sobel after Morphological

(e) CE using Canny

14

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(f) CE using Canny (g) CC using Canny after morphological

(h) CE using SSPCE

(j) CE using SSPCE (j) CC using SSPCE after morphological

Figure. 16 Contour extraction & compression using maximum gray level (MGL) with adjust input image intensity values [0.4 0.44], based on Sobel, Canny, and SSPCE respectively Table. 7 Results of chest medical image contours extraction & compression using mgl and ramer and sobel, canny, & sspce methods with threshold =0.1 Measures Original Compressed Contour Contour MSE SNR CR [%] Contour Points Points Extraction (MGL) Using Methods Ramer b) Sobel 1681 1450 0.0263 15.7969 13.7418

c) Canny 2258 1490 0.0123 19.1176 35.5624

d) SSPCE 1720 1525 0.0299 15.2400 11.3372

15

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(a) Binary image using DCT

(b) CE using Sobel

(c) CE using Sobel (d) CC using Sobel after morphological

(e) CE using Canny

(f) CE using Canny (g) CC using Canny after morphological

16

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(h) CE using SSPCE

(i) CE using SSPCE (j) CC using SSPCE after morphological Figure. 17 Contour extraction & compression using DCT with zonal block 150 & threshold=33, and Ramer based on Sobel, Canny, and SSPCE respectively Table. 8 Results of hand medical image contours extraction & compression using dct and ramer and sobel, canny, & sspce methods with zonal sampling block=100 Measures Original Compressed Contour Contour MSE SNR CR [%] Contour Points Points Extraction (DCT) Using Methods (Threshold) Ramer b) Sobel (0.1) 1874 1640 0.0307 15.1242 12.4867 c) Canny (0.1) 2501 1640 0.0418 13.7921 34.4262 d) SSPCE (0.5) 1758 1510 0.0280 15.5213 14.1069

(a) Binary image using DWT

17

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(b) CE using Sobel

(c) CE using Sobel (d) CC using Sobel after morphological

(e) CE using Canny

(f) CE using Canny (g) CC using Canny after morphological

(h) CE using SPCE

18

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(i) CE using SSPCE (j) CC using SSPCE after morphological Figure. 18 Contour extraction & compression using DWT with threshold=10, and Ramer based on Sobel, Canny, and SSPCE respectively

Table. 19 Results of hand medical image contours extraction & compression using dwt (haar) details coefficients and sobel, canny, & sspce methods with threshold =10 Measures Original Compressed Contour Contour MSE SNR CR [%] Contour Points Points Extraction (DWT) Using Methods Ramer b) Sobel 2233 1847 0.0365 14.3772 17.2862 c) Canny 2817 1847 0.0453 13.4372 34.4338 d) SSPCE 1943 1551 0.0294 15.3182 20.1750

(a) (b)

Figure. 19 Image compression using a) DCT (zonal sampling), and b) DWT (Haar) Table. 10 Results of hand medical image comptression using dct and dwt Measures MSE PSNR CR [%] Image compression a) DCT 72.4032 29.5332 15.30 b) DWT 70.9038 29.6241 80.40

19

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

13. CONCLUSIONS Medical image enhancement technologies have attracted much attention since advanced medical equipment was put into use in the medical field. Enhanced medical images are desired by a surgeon to assist diagnosis and interpretation because medical image qualities are often deteriorated by noise and other data acquisition devices, illumination conditions, etc. We have implemented of CT / X-ray sample images. Sobel and Canny edge detection operators & single step parallel contour extraction (SSPCE) have been implemented on that image. The contours of the tested images can be also extracted using (DCT) high pass filter coefficients during zonal sampling methods; and single level of (DWT) within high pass filter details coefficients. Enhanced medical images are desired by a surgeon to assist diagnosis and interpretation because medical image qualities are often deteriorated by noise and other data acquisition devices, illumination conditions, etc. For these reasons we use pre-processing stage (filtering & enhancement) before the main processing. The results show that this kind of algorithms has a satisfactory performance. The extracted contours are compressed using well known Ramer method. Simulation results using MATLAB programming show that this kind of algorithms has a satisfactory performance with good compression ratio exceeds to 87% (see Table I) for Hand X-ray. By using DWT, the compressed image can be obtained during approximation coefficients (low pass filter) by compression ratio exceeds to 80% with good quality approaches to 30 decibels (see Figure 19 & Table X). In the feature we can used the proposed strategy is to detect, analyze and extract the tumor from patient’s CT scan images. REFERENCES 1. M. Petrou and C. Petrou, “Image Processing: The Fundamentals”, Wiley, Amsterdam, 2010. 2. D. W. Mc Robbie, E. A. Moore, M. J. Graves, M. R. Prince, “MRI : From picture to proton”. 2nd. ed. New York: Cambridge University. 2007. 3. Rafael C. Gonzalez and Richard E. Woods, “Digital Image Processing”. 3nd. ed. New Jersey: Pearson Prentice Hall. 2008. 4. Ramer U., “An iterative procedure for the Polygonal approximation of plane curves”, Computer Graphics and Image Processing, Academic Press, Volume 1, Issue 3, pp. 244-256, 1972. 5. Dziech A., Baran R. & Ukasha A., “Contour compression using centroid method”, WSEAS Int.Conf. on Electronics, Signal Processing and Control (ESPOCO 2005), Copacabana, Rio de Janeiro, Brazil, pp. 225-229, 2005. 6. Dziech A., Ukasha A. and Baran R., “Fast method for contour approximation and compression”, WSEAS Transaction on communications, Volume 5, Issue 1, pp. 49-56, 2006. 7. Ukasha A., Dziech A. & Baran R., “A New Method For Contour Compression”, WSEAS Int. Conf. on Signal, Speech and Signal Processing (SSIP 2005), Corfu Island, Greece, pp. 282- 286, 2005. 8. Ukasha A., Dziech A., Elsherif E. and Baran R., “An efficient method of contour compression”, International Conference on Visualization, Imaging and Image Processing (IASTED/VIIP), Cambridge, United Kingdom, pp. 213-218, 2009. 9. Ukasha A., “Arabic Letters Compression using New Algorithm of Trapezoid method”, International Conference on Signal Processing, Robotics and Automation (ISPRA'10), Cambridge, United Kingdom, 336-341, 2010. 10. Dziech, W. S. Besbas, “Fast Algorithm for Closed Contour Extraction”, Proc. of the Int. Workshop on Systems, Signals and Image Processing, Poznań, Poland, 1997, pp. 203-206. 11. W. Besbas, “Contour Extraction, Processing and Recognition”, Poznan University of Technology, Ph. D. Thesis, 1998. 12. Scott E. Umbaugh, “Computer vision and image processing”, Prentice-Hall, 1998. 13. Nalini K. Ratha, Tolga Acar, Muhittin Gokmen, and Anil K. Jain, “A distributed Edge detection and surface reconstruction algorithm”, Proc. Computer Architectures for machine Perception (Como, Italy), 1995, pp. 149-154. 14. Yali Amit, “2D Object Detection and Recognition”, MIT Press, 2002. 15. Veysel Aslantas, “An SVD based digital image watermarking using genetic algorithm”, IEEE, 2007. 16. A. Ukasha, “An Efficient Zonal Sampling Method for Contour Extraction and Image Compression using DCT Transform”, The 3rd Conference on Multimedia Computing and Systems (ICMCS'12), Tangier, Morocco, May,2012. 17. Nalini K. Ratha, Tolga Acar, Muhittin Gokmen, and Anil K. Jain, “A distributed Edge detection and surface reconstruction algorithm”, Proc. Computer Architectures for machine Perception (Como, Italy), 1995, pp. 149-154.

20

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

18. G. Economou, S. Fotopoulos, and M. Vemis, “A noval edge detector based on non – linear local operations”, Proc. IEEE International Symposium on Circuits and Systems (London), 1994, pp. 293- 296. 19. Kim L. Boyer and Sudeep Sarkar, “On the localization performance measure and optimum edge detection”, Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1994), pp. 106- 108. 20. J. F. Canny, “A computational approach to edge detection”, Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (1986), pp. 679-698. 21. Didier Demigny and Tawfik Kamle, “A discrete expression of cranny’s criteria for step edge detector performances evaluation”, Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997), pp. 1199-1211. 22. Hemant D. Tagare and Rui J.P. deFigueiredo, “Reply to on the localization performance measure and optimal edge detection”, Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1994), pp. 108-110. 23. Chen, T., Wu, Q.H., Rahmani-Torkaman, R. and Hughes, J. (2002) “A Pseudo Top-Hat Mathematical Morphological Approach to Edge Detection in Dark Regions”. Pattern Recognition, 35, 199-210. 24. A. K. Jain, “Fundamentals of Digital Image Processing”, New Jersey: Prentice Hall International, 1989. 25. Brigham E.O., “The Fast Fourier Transform”, Prentice-Hall, Englewood Cliffs, 1974. 26. A. Dziech, F. Belgassem & H. J. Nern, “Image data compression using zonal zonal sampling and piecewise-linear transforms”, Journal of Intelligent And Robotic Systems. Theory & Applications, 28(1- 2), Kluwer Academic Publishers, June 2000, 61-68. 27. Walsh, J. L. “A Closed Set of Normal Orthogonal Functions”, Amer. J. Math. 45, 5-24, 1923. 28. Wolfram, S., “A New kind of Science”, Champaign, IL: Wolfram Media, pp. 573 and 1072-1073, 2002. 29. Clarke R. J., “Transform Coding of Images”, Academic Press, 1985. 30. Gerhard X. Ritter and Joseph N. Wilson, “Computer vision algorithms in image algebra”, CRC Press, New York, 1996. 31. Vetterli, Martin & Kovacevic, Jelena, “Wavelets and Subband Coding”, Printice Hall Inc., 1995. 32. D.H. Ballard and C.M. Brown, “Computer vision”, Prentice Hall, Englewood Cliffs, NJ, 1982. 33. R.M. Haralick and L. G. Shapiro, “Computer and robot vision”, Addison-Wesley Publishing Co., 1992. 34. B.K.P. Horn, “Robot vision”, The MIT Press, Cambridge, MA, 1986. 35. Gonzalez R. C., “Digital Image Processing”, Second Edition, Addison Wesley, 1987. 36. Mahmoud, T.A. and Marshall, S. (2008) “Medical Image Enhancement Using Threshold Decomposition Driven Adaptive Morphological Filter”. Proceedings of the 16th European Signal Processing Conference, Lausanne, Switzerland, 25-29 August 2008, 1-5.

AUTHORS PROFILE

21

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

An appraisal for features selection of offline handwritten signature verification techniques 1Anwar Yahy Ebrahim, 2Hoshang Kolivand, 3Mohd Shafry Mohd Rahim 1Babylon University, Babylon, IRAQ 2Department of Computer Science, Liverpool John Moores University, Liverpool, UK, L3 3AF. 3Universiti Teknologi Malaysia 81310 Skudai Johor MALAYSIA Email: [email protected], [email protected], [email protected]

ABSTRACT This research provides a summary of widely used Handwritten Signature Verification based feature selection techniques. Moreover, the focus is on selected best features of signature verification, characterized by the number of features represented for each signature and the aim is to discriminate if a given signature is genuine or a forgery. We presented how the discussion, on the advantages and drawbacks of feature selection techniques, has been handled by several researchers in the past few decades and the recent advancements in the field.

Keywords: signature verification; feature extraction; dimension reduction; feature selection; handwritten signature;

1. INTRODUCTION Handwritten signature is widely utilized and recognized technique throughout the world, the thorough testing of the signature image is important before going to the deduction about the writer. The difference in original signature makes it difficult to distinguish between the original and forgery signature. The signature identification and verification methods may develop the authentication procedures and can distinguish between the genuine and forged signatures [1]. The handwritten signature has also an adequate importance in online banking implementations and cheque processing mechanism [2]. For the authentication of passports, biometrics methods can be utilized; exactly, for the signature verification [3]. Features extraction can be defined as the characteristics of signature that are derived from that signature itself. These extracted features represented an important role in developing the robust system as all other phases are based on these features. A large number of features may decrease the value of FRR (overall number of genuine signatures discarded by the system) but at the same time it will increase the value of FAR (number of forged signatures accepted by the system). However, little effort has been done in measuring the consistency of these attributes. This consistency measurement is important to determine the effectiveness of the method. In order to measure the consistencies of these features, there is a need to choose the best attributes set among them [4]. There are two major procedures of signature identification and authentication one of them is the real identification of the signer of signature, and the other is real classification of sample whether it is an original or a forged [5]. The focus of this research will be on off-line signature authentication methods. Further parts of this study will be, in Section 2 include the literature review of the already published existing methods of off-line signature verification, Section 3 includes the critical analysis of existing research studies, and lastly in Section 4 the conclusion of the research is given.

2. PROBLEM STATEMENT In the literature of offline handwritten signature verification, we can find multiple ways of defining the problem. In particular, one matter is critical to be able to compare related work: whether or not skilled forgeries are used for training. Some authors do not use skilled forgeries at all for training [7, 8], other researchers use skilled forgeries for training writer-independent classifiers, testing these classifiers in a separate set of users [9] lastly, some papers used skilled forgeries for training writer-dependent classifiers, and test these classifiers in a separate set of original signatures and forgeries from the same set of users.

22

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Boosting feature selection is achieved by attributes selection methods that chooses the single most discriminant attribute of a set of the attributes and finds a threshold to detach the two categories to train, effectively a decision stump. Then, attributes are chosen in a greedy fashion according to the weighting while training is conducted by the features selection techniques. The presence of a very large number of features resulted in a committee built on the best attributes selection signifying the training samples [10]. The concept of feature selection proposes a system for signatory recognition which is based on reduced number of features from the signature [11]. Proposed a good approach of feature selection, which when applied for signature provides a good way of compressing the signature while maintaining acceptable identification rates.

3. SIGNATURE VERIFICATION Handwritten signatures have applied as bio-metric features that distinguish persons. It has confirmed that signature samples are very faultless bio-metric feature with a low conflict proportion. Some signature samples might be comparable but there are different technical methods to distinguish between them and for disclosure of forgery signatures. There are two classes of handwritten signature verification systems: 3.1 Verification system of offline (static) signature Signature is written offline like a signature written on bank cheques and the technique read the scan sample of the signature then obtains it with the signature samples stored in the database. Off-line signatures are shown in Figure 1.

Figure. 1 Offline signatures [12] 3.2 Verification system of online (dynamic) Signature signing onto a reactive electronic system for example in is read on-line, and comparison of signature samples on folder of the individual to test for validity. Several selected best features are used with on-line signature samples that are not accessible for the off-line ones. Online handwritten signature is displayed in Figure 2.

Figure. 2 Online signatures [12] 4. DATASETS The availability of datasets is one of the most important requirements in any research area. Thus, the same is the case with signature analysis and recognition. A number of datasets comprising signature samples have been developed over the years mainly to support signature verification, signature segmentation and signer recognition tasks. Especially, during the last few years, a number of standard datasets in different scripts and languages have been

23

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org developed allowing researchers to evaluate their systems on the same databases for meaningful comparisons. Some notable datasets of signature samples along with their exciting measurements are presented in Table 1. Table. 1 Summary of notable signature dataset Dataset Name Language Signatures GPDS [13] English 8640 CEDAR [14] English 2640 Arabic dataset [15] Arabic 330 Japanese dataset [16] Japanese 2000 Persian Dataset [17] Persian 2000 Chinese NZMI dataset [18] Chinese 1200 5. PRE-PROCESSING For effective recognition of a signatory from offline signature samples, the signature must be distinguishable from the background allowing proper segmentation of the two. Most of the signatory identification techniques developed to date depend on selected features which are extracted from binary signatures with white background and black ink trace. An exclusion to this is the search of Wirotius [19], where the authors argue that like online signature sample, grayscale images also contain information about pen pressure, the intensity of the gray value at a particular pixel being proportional to the pen pressure. Zuo et al. [20] also supported this idea and conducted a series of signatory identification experiments both on gray scale and binary images. The experiments on gray scale images reported slightly better results than the binary images with an overall identification rate of 98%. It should however, be noted that feature extraction from the gray ink trace is quite complex as opposed to the binary version. A large set of useful attributes can be extracted from binarized version of signature and consequently most of the contributions to signatory identification are based on binary images of signature [20]. A number of standard thresholding systems have been developed to binarize images into foreground and background [21], and these methods can also be applied to signature samples. Most of the research employs the well-known Otsu’s thresholding algorithm [21], to compute a global threshold for the signature image and convert the gray scale image into binary [22]. Signature images may present variations in terms of pen thickness, scale, rotation, etc., even among authentic signatures of a person. Common pre-processing techniques are: signature extraction, noise removal, application of morphological operators, size normalization, centering and binarization [23]. The experiments on gray scale images reported slightly better results than the binary images with an overall identification rate of 98%. It should however, be noted that feature extraction from the gray ink trace is quite complex as opposed to the binary version. A large set of useful attributes can be extracted from binarized version of signature and consequently most of the contributions to signer identification are based on binary images of signature [24].

6. FEATURE EXTRACTION 6.1 Global and local feature extraction

Local and global features include data, which are efficient for signature confession. The features choosing is that various features are necessary for any style confession and classification method. Global attributes are extracted from the complete signature. The set of these local and global attributes is further applied for reporting the identity of documentation and forgery signature samples from the dataset. The global attributes that are extracted from sample are described as follows [25]. Width (Length): For a binary image, width is the dimension between 2 pixels in the horizontal projection and must include more than three points of the signature. Height: Height is the dimension between 2 pixels in the vertical projection and must include more than three points of the signature for a binary signature. Aspect ratio: The proportion is a global attribute that represents the ratio of the width and the height of the signature image [26].

24

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Horizontal projection: Horizontal projection is calculated from both binary and the skeletonized signatures. The set of dark points are calculated from the horizontal projections of binary and skeletonized images. Vertical projection: A vertical projection is presented as the set of dark points achieved from vertical projections of binary and skeletonized images. Local attributes extracted from gray level, binary signatures. From the small areas of the whole signature, local attributes represent, height, width, aspect proportion, horizontal, and vertical projections. To get a group of global and local attributes, both of these attributes groups are collected into an attribute vector that are represented as input to the classification techniques for matching [27, 28]. 6.2 Orientation Orientation represents the direction of the image lines. This attribute is necessary and helps to know how the signatory signed down the image, which letters come first confirming towards corners and peaks. Orientation is obtained by using the proportion of angle at main axis [29].

7. DIMENSION REDUCTION This section introduces, the reduction of the dimension, difficulties in classification for high dimensional multivariate. Figure 3 shows the main idea of this study.

Figure. 3 Representation data of the data reduction methods The basic concept is to decrease large amounts of information down to the significant parts. Data reduction is the procedure of decreasing the set of arbitrary inputs under consideration [32]. It can be split into attribute selection discussed in detail in next sub-sections and feature extraction. There are benefits of data reduction as it enhances the achievement of the machine training model. The first part of dimension reduction is feature selection methods, which is a try to find the original features. In some situations, information test such as classification can be done in the reduced area more exactly than in the original area such as Sparse PCA technique [33]. Linear and nonlinear reduction methods have been suggested in recent time which depend on the estimation of local data. This section shows a logical comparison of these methods. By identifying the weaknesses of current, linear and nonlinear techniques. 7.1 Linear dimension reduction Linear methods achieve dimension reduction by combine the information into a sub area of lower dimension. There are different methods to do so, such as Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA) [33]. LDA is a popular data-analytic tool for studying the category relationship between information points and LDA is supervised. A main disadvantage of LDA is that it fails to find out the local geometrical object of the data manifold [34]. Dimension reduction is the task to reduce the amount of available data (data dimension). The data processing required in dimension reduction, of ten times linear for computational simplicity, is determined by optimizing an appropriate figure of merit, which quantifies the amount of information preserved after a certain reduction in the data dimension. The ‘workhorse’ of dimension reduction comes under the name of PCA [33]. PCA has been extremely popular in data dimension reduction since it entails linear data processing.

25

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

7.2 Non-linear techniques for dimension reduction This section discusses two non-linear methods for dimension reduction named as Kernel PCA (KPCA) and Multi- Dimensional Scaling (MDS). These methods attempt to maintain the original data in the low dimensional performance [36]. Shown in Figure 4 KPCA calculates the kernel matrix K of the variables points xi. KPCA is Kernel based method. As shown in Figure 4 the mappings performed by Kernel principal component depends on the selection of the kernel task. A main shortcoming of KPCA is that the size of the kernel is ratio to the square of the set of cases in the database [37].

Figure. 4 Kernel principal components analysis Honarkhah et al. [38] represented MDS but a major disadvantage of MDS provides a global measure of dis/similarity but does not provide much insight into subtleties [34]. The susceptibility to the curse of dimension and the problem in finding the small eigenvalue in an eigen problem. The PCA is susceptible to the relative scaling of the original attributes [39]. Feature extraction produces new features from the original features, while feature selection returns a subset of the original features [40]. The set of PC<= the set of original features as shown in Figure 5 (a). Orthogonal directions of greatest variance in data projections along PC1 discriminate the information mostly along any one X new X of PC as shown in Figure 5 (b).

Figure. 5 (a & b) Principal components analysis In Figure 6(a) & (b) PCA can be found via compute the Singular Value Decomposition (SVD) of matrix factorization is XTX complex matrix (empirical covariance matrix of X). V1 represents vector (yellow) is added to another vector V2 (blue), SVD is a factorization of a real.

26

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

(a) (b) Figure. 6 (a) Vector place Figure, (b) singular value decomposition (SVD) In Figure 7 (a) & (b). Each eigen-gene is expressed only in the corresponding eigen-array with the corresponding eigen-expression level. PCA can be found via compute the SVD of the features matrix. Compute the SVD of XB = UD, where SVD is singular value decomposition, UD are the principal components, the columns of VT are the consistent loading of the primary components eigenvectors, V of which diagonalizes the covariance matrix XTX, U are called the Eigen values, XTX is covariance matrix and V are the Eigen-genes which represent sparse loading of features matrix, to make sure that the first principal component has the maximum variation.

(a) (b) Figure. 7 (a & b) PCA on expression data In Figure 8 the variance of X that is remained in X’ is maximal. Dataset X is mapped to dataset X’, here of the same dimension. The first dimension in X’ e1= the PC1 is the direction of maximal difference. The PC2 (e2) is orthogonal to the first.

Figure. 8 Eigenvalue measures the variation in eigenvectors e The major fragility of PCA is that the size of the covariance matrix is commensurate to the distance of the data points. As an outcome, the calculation of the eigenvectors might be infeasible for very high-distance information [35].

27

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

8. FEATURE SELECTION Features reduction is one of the fascinating and widespread systems in offline handwritten signature verification. In some cases, current feature does not improve the capability, these features are too many (high dimensions), which reduce classification process efficiency for this we need to select best features, from features extraction as shown in Figure 9. Many researchers [41, 10] proposed features selection techniques to select features from signature image and achieved good quality results. Many papers have used a feature selection approach for signature verification. Trained a writer-independent classifier, by first extracting a large number of features from each signature (over 30 thousand features), applying feature extractors at different scales of the image. Their method consisted in training an ensemble of decision stumps (equivalent to a decision tree with only one node), where each decision stump only used one feature. With this approach, they were able to obtain a smaller feature representation (less than a thousand features) that achieved good results in the Brazilian and GPDS datasets. Eskander et al. [41] extended Rivard’s [10] work to train a hybrid writer-independent-writer-dependent classifier, by first training this writer-independent classifier to perform feature selection, and then train writer dependent classifiers using only the features that were selected by the first model. This strategy presented good results when a certain number of samples per user is reached.

Figure. 9 Representation of best features of the features selection methods as input to classification technique

9. CRITICAL EVALUATION The features which are selected for signature, such as Srihari et al. [30] who selected 3 types of features, Biswas et al. [42] selected 5 types of features and Pushpalatha et al. [43] selected 5 types of features. Nguyen et al. [27], Poureza et al, [28] selected 8 types of features in offline signature verification. Siddiqi and Vincent [44] selected 10 types of features in offline handwritten recognition. Each signature is represented using features selection that is the first effort of its kind in signatory recognition as shown in Table 2. Table. 2 Number of features used in the classification, identification and verification process

Author / year Number of Features Types of Features

Kaler (2004) [13] 3 types of features extracted in offline Eccentricity, rectangularity and orientation Signature Verification Nguyen, (2007) [27], 8 types of features extracted in offline Vertical projections, horizontal projections, upper profiles, Daramola (2010) [26] and signature verification lower profiles, elongation, solidity, eccentricity and poureza (2011) [28] rectangularity, Biswas (2010) [42] 5 types of features extracted in signature Height Width proportion of the signature, Occupancy, verification Dimension proportion computation at boundary, Compute the distance and Compute the set of symbols of the sample. Pushpalatha (2014) [43] 5 types of features extracted in offline set of Cross-points, set of edge-points, eccentricity, Mass signature verification and Centre of Mass.

28

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Siddiqi (2010) [44] 10 types of features extracted in offline Vertical projections, horizontal projections, upper profiles, handwritten recognition lower profiles, elongation, solidity, eccentricity, rectangularity, orientation and perimeter.

10. CONCLUSION Several researchers have proposed different systems for verification of the signature. In spite of these advancements, the results still report somewhat large error rates for distinguishing genuine signatures and skilled forgeries, when large public datasets are used for testing, such as GPDS. This research includes a practical solution to some of the main problems encountered in the design verification of the signature, the limited set of individuals and, the large set of features from signatures, the high intra-personal variability of the signatures, and the lack of forgeries as counter examples. A new technique for feature selection is suggested for accurate design of signature verification methods. It integrates extraction and selection of the feature. Recently, feature selection methods, with classification techniques based on signer verification, have emerged as an effective way for characterizing the signer of a signature and the results of these methods are found to be better than other features for verification of the signature. As a conclusion, the method of selecting the best features among many features will help to develop the performance of verification of the signature. As this study includes the evaluation of literature in extension to this the future research suggestion is to propose new techniques that will decrease the EER. ACKNOWLEDGEMENT The writers are thankful to the Babylon University, Babylon, Iraq for providing study means to achieve this research success. REFERENCES 1. M. Tomar, and P. Singh, A directional feature with energy based offline signature verification network. International Journal on Soft Computing, vol.2, pp. 48–57, February 2011, “doi: 10.5121/ijsc. 2. K. Harika, and T.C.S. Ready, A tool for robust offline signature verification. International journal of advanced research in computer and communication engineering, 2013, vol.2, pp. 3417–3420, September. 3. S. Odeh, and M. Khalil, Apply multi-layer perceptron neural network for off-line signature verification and recognition. IJCSI International Journal of Computer Science Issues, 2011, vol.8, pp. 261–266, November. 4. Zulkarnain, Z., Rahim, M. S. M., & Othman, N. Z. S., Feature Selection Method For Offline Signature Verification. Jurnal Teknologi, 2015, 75(4). 5. V.M. Deshmukh, and S.A. Murab, Signature recognition & verification using ANN. International Journal of Innovative Technology and Exploring Engineering, 2012, vol.1, pp. 6–8, Novembe.. 6. G.P. Patil, and R.S.Hegadi, Offline handwritten signatures classification using wavelets and support vector machines. International Journal of Engineering Science and Innovative Technology, 2013, vol.2, pp. 573– 579. 7. D. Rivard, Multi-feature approach for writer-independent offline signature verification. Ph.D. dissertation,2010, ´ Ecole de technologie sup´erieure. 8. G. Eskander, R. Sabourin, and E. Granger,Hybrid writer-independent-writer-dependent offline signature verification system. IET Biometrics, 2013, vol. 2, no. 4, pp. 169–181, Dec. 9. M. B. Yilmaz, Offline Signature Verification With User-Based And Global Classifiers Of Local Features. Ph.D. dissertation, 2015, Sabanc University. 10. Rivard, D, Granger, E and Sabourin, R., Multi-Feature Extraction and Selection in Writer-Independent Offline Signature Verification. International Journal on Document Analysis and Recognition, vol.16, no.1, pp.83-10. 11. Zou, H., Hastie, T., & Tibshirani, R., Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265-286.

29

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

12. Helli, B. and Moghaddam, M. E., A Text-Independent Persian Writer Identification Based On Feature Relation Graph (FRG). Pattern Recognition, 2010, 43(6), 2199-2209. 13. M. Ferrer, J. Alonso, and C. Travieso, Offline geometric parameters for automatic signature verification using fixed-point arithmetic. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 993–997, Jun. 2005. 14. M. K. Kalera, S. Srihari, and A. Xu, Offline signature verification and identification using distance statistics. International Journal of Pattern Recognition and Artificial Intelligence, 2004, vol. 18, no. 07, pp. 1339– 1360,Nov. 15. Ismail, M.A., Gad, S., Offline Arabic Signature Recognition and Verification. Pattern Recognition, 2000,vol. 33, no. 10, pp. 1727—1740. 16. Ueda, K., Investigation of Off-Line Japanese Signature Verification Using a Pattern Matching. In ICDAR, 2003, (p. 951). 17. Chalechale, A., & Mertins, A., Line segment distribution of sketches for Persian signature recognition. In TENCON 2003. Conference on Convergent Technologies for the Asia-Pacific Region (Vol. 1, pp. 11-15). IEEE. 18. Lorette, On-Line Handwritten Signature Recognition Based on Data Analysis and Clustering. P R,1984, Vol. 2, pp. 1284-1287. 19. Wirotius, M. and Vincent, N., Stroke Inner Strukture Invariance in Handwriting. In Proc. 11th Conference of the International Graphonomics Society (IGS), (2003). Scottsdale, Arizona, USA. pp. 308-311. 20. Zuo, L., Wang, Y. and Tan, T., Personal handwriting identification based on pca. In Second International Conference on Image and Graphics. International Society for Optics and Photonics, (2002). pp. 766-771. 21. Otsu, N., A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man and (1979) Cybernetics, 9(1), 62-66. 22. Shekar, B. H., and Bharathi, R. K., DCT-SVM-Based Technique for Off-line Signature Verification. In Emerging Research in Electronics, Computer Science and Technology, 2014, (pp. 843-853). Springer India. 23. D. Impedovo and G. Pirlo, Automatic signature verification: the state of the art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2008, IEEE Transactions on, vol. 38, no. 5, pp. 609–635. 24. Ferrer, M. A., Vargas, J. F., Morales, A., & Ordóñez, A., Robustness of offline signature verification based on gray level features. 2012, IEEE Transactions on Information Forensics and Security, 7(3), 966-977. 25. Miroslav, B., Petra, K. and Tomislav, F., Basic On-Line Handwritten Signature Features for Personal Biometric Authentication. In Proceedings of the 34th International Convention MIPRO, Opatija, May 2011,pp. 1458-1463. 26. Daramola, S. A., & Ibiyemi, T. S., Offline signature recognition using hidden markov model (HMM). (2010). International journal of computer applications, 10(2), 17-22. 27. Nguyen, V., Blumenstein, M., Muthukkumarasamy, V., & Leedham, G., Off-line signature verification using enhanced modified direction features in conjunction with neural classifiers and support vector machines. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (Vol. 2, pp. 734- 738). IEEE. 28. Pourreza, Hamid Reza, Omid Mirzaei and Hassan Irani, Offline Signature Recognition Using Modular Neural Networks with Fuzzy Response Integration. 2011 International Conference on Network and Electronics Engineering, IPCSIT vol.11, Singapore. 29. Yilmaz, M.B., Yanikoglu, B., Tirkaz, C. and Kholmatov, A., Offline Signature Verification Using Classifier Combination of HOG and LBP Features. Biometrics (IJCB), 2011 International Joint Conference on Biometrics Compendium, IEEE, 978-1-4577-1359. 30. Srihari, S. N., Xu, A., and Kalera, M. K., Learning strategies and classification methods for off-line signature verification. In Ninth International Workshop on Frontiers in Handwriting Recognition, 2004. IWFHR-9 2004. (pp. 161-166). IEEE. 31. Shailaja Dilip Pawar, A Survey on Signature Verification Approaches. (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (2), 2015, 1068-1072. 32. Saul, L. K., and Rahim, M. G., Maximum likelihood and minimum classification error factor analysis for automatic speech recognition. (2000). IEEE Transactions on Speech and Audio Processing, 8(2), 115-125. 30

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

33. Lu, H., Plataniotis, K. N., and Venetsanopoulos, A. N., A survey of multilinear subspace learning for tensor data. Pattern Recognition, (2011). 44(7), 1540-1551. 34. Roweis, S. T. and Saul, L. K., Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science (2000). 290 (5500):2323–2326. 35. Van der MLJP, P. E. and van den, J., Dimensionality reduction. A comparative review, Tilburg, Netherlands: Tilburg Centre for Creative Computing, Tilburg University, and Technical Report: 2009-005. 36. Hoffmann, H. (2007). Kernel PCA for novelty detection. Pattern Recognition, 40(3), 863-874. 37. Turk, M., and Pentland, A., Eigenfaces for recognition. Journal of cognitive neuroscience, (1991). 3(1), 71- 86. 38. Honarkhah, M and Caers, J, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling. Mathematical Geosciences, (2010). 42: 487–517. 39. Abdi. H., and Williams, L.J., Principal component analysis. (2010). Computational Statistics, 2:433–459. 40. Ding, C., He, X., Zha, H., and Simon, H. D., Adaptive dimension reduction for clustering high dimensional data. In 2002 IEEE International Conference on Data Mining, 2002. ICDM 2003. Proceedings. (pp. 147- 154). IEEE. 41. G. Eskander, R. Sabourin, and E. Granger, Hybrid writer-independent-writer- dependent offline signature verification system. 2013 IET Biometrics, vol. 2, no. 4, pp. 169–181, Dec. 42. Biswas, Samit Bhattacharyya, Debnath Kim, Tai-hoon Bandyopadhyay, Samir Kumar, Extraction of Features from Signature Image and Signature Verification Using Clustering Techniques. (2010). Security- Enriched Urban Computing and Smart Grid, Springer, pp. 493-503, ISBN. 3642164439. 43. Pushpalatha, K. N., Gautham, A. K., Shashikumar, D. R., ShivaKumar, K. B., & Das, R., Offline Signature Verification with Random and Skilled Forgery Detection Using Polar Domain Features and Multi Stage Classification-Regression Model. (2013). International Journal of Advanced Science and Technology, 59, 27-40. 44. Siddiqi, I. and Vincent, N., Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. (2010), Pattern Recognition, 43(11), 3853-3865.

AUTHORS PROFILE Dr. Anwar Yahya Ebrahim received her B.Sc. degree from Babylon University, Iraq in 2000. And her M.Sc. degree from MGM College Dr. Babasaheb Ambedkar Marathwada University, India in 2009. Then her PhD from Universiti Technologi Malaysia (UTM), Malaysia in 2016. currently She is a Lecturer of computer science at Babylon University in Iraq. Her research interests are in signatures identification and verification, features recognition, image analysis and classification, enhancement and restoration, human activities recognition, data hiding-digital watermarking and steganography, image encryption, image compression, image fusion, image mining, digital image forensics, object detection, segmentation and tracking. Dr. Hoshang Kolivand received the B.S degree in Computer Science & Mathematic from Islamic Azad University, Iran in 1997, M.S degree in Application of Mathematics and Computer from Amirkabir University, Iran in 1999. His PhD is from Universiti Technologi Malaysia (UTM), Malaysia in 2013. He was a lecturer in the Faculty of Computing, UTM, Malaysia, in 2015. Currently, he is a lecturer in Department of Computer Science, Liverpool John Moores University, Liverpool, UK.

Dr. Mohd Shafry Mohd Rahim received his BSc and MSc degrees in computer science from University Technologi Malaysia (UTM) in 1999 and 2002, respectively, and PhD degree in computer science from University Putra Malaysia (UPM). Currently, he is an Associate Professor in the Faculty of Computing, UTM, and head of ViCubeLab, K-Economy Research Alliance. His research interests include computer graphics, visualization, spatial modelling, image processing, and geographical information systems.

31

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Indoor navigation to estimate energy consumption in android platform 1Hasan Sajid Atta Al Nidawi, 2Ammar Khaleel, 3Kareem Abbas Dawood 1,3 Department of Software Engineering and Information System, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Malaysia. 2Department of Computer, Faculty of Education for Girls, University of Kufa, Iraq. Email: [email protected] , [email protected] , [email protected]

ABSTRACT Consumption of a mobile application energy (battery and data traffic) is still a primary concern to mobile manufacturers. It has been noted earlier that the consumption of a particular mobile application depends heavily on its software architecture. Therefore, mobile developers can make necessary design decisions based on the comparative study performed on different software architectures. This work presents the consumption analysis of two different software architectures: server-centric architecture and mobile-centric architecture, in order to show the least energy-consumption in Android-mobile application which is implemented to execute effectively the primitive operations for indoor navigation. To do so, either PowerTutor1.4 or Trepn Power Profiler will be applied to estimate the energy consumption. Ultimately, this application will be implemented on indoor navigation environment of first floor - IMAM HUSSEIN LIBRARY (IHL) at Imam Hussein Shrine, Karbala, IRAQ. KEYWORDS: software architecture; energy consumption; android mobile application; indoor navigation; 1. INTRODUCTION The popularity of smartphones and mobile apps has been increasing since the beginning of this century. According to [1] the number of mobile-cellular subscriptions increases from 738 million to about 7 billion currently. The energy conservation of mobile application relies heavily on its resource consumption such as battery use and network traffic [2][3] these two elements are considered as factors to determine the success of mobile applications. The resource consumption of mobile devices and their applications is a topic that has garnered significant attention recently [4]. Most researches have focused on optimizing the consumption of applications after they have been developed. Due to the limited resources of mobile devices, conducted industrial context studies are increased. This increase makes them an interesting subject of study in terms of consumption efficiency. Mobile applications that drain the device’s resources are soon rejected by their users [5]. Therefore, the development of a mobile application should include analysis of consumption patterns. These consumption patterns are determined primarily by software architecture [6]. The most established architecture for mobile apps is the server-centric approach [7], whereby mobile devices are acting as simple clients and tasks such as information storage, processing, and communication tasks are delegated in the cloud. However, there are other emerging mobile-centric architectures [8] inspired by distributed processing which are gaining in relevance for mobile-to-mobile service provisioning. Resource consumption is also a concern for these architectures. In this context, there has been no work assisting developers in choosing the most suitable software architecture for their applications in terms of resource consumption except [9], and their work still limited to a number of case studies, architectures, and real applications. Thus, in this research the authors aim to present analysis of two architectures (server-centric architecture and mobile centric architecture) to show which of these architectures is less demanding for energy consumption in mobile devices. Generally, wireless technology oriented indoor navigation application will be deployed at first floor of IMAM HUSSEIN LIBRARY (IHL), Karbala as the case for this study. Wireless technologies such as Bluetooth, Wi-Fi, signals of cellular towers and ZigBee are common. Amongst these technologies, Wi-Fi is the most popular one because it does not require additional location of each mobile device. Therefore, "Wi-Fi trilateration approach" will be used in our indoor navigation application. This approach uses the signal strength to estimate the distance between user and each transmitter. Moreover, the Spherical Trilateration Algorithm that uses the parameters of known Wi-Fi network such as frequency, signal strength, network MAC address and real coordinates of Wi-Fi access points will be implemented in this work. This research presents the consumption analysis of navigation for localization within the indoor environment of first floor of IMAM HUSSEIN LIBRARY (IHL), Karbala, Iraq. An effective application (used in Android smartphones) will be implemented to execute primitive operations (like post, get) for indoor navigation. This application will be built by two different architectures (server or mobile-centric) in order to

32

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org identify the least energy-consuming architecture. In addition, The PowerTutor1.4 or Trepn Power Profiler, one of them it will be used to estimate the energy consumption [10]. The remainder of the paper is organized as follows: Section 2 describes the related work, Section 3 presents the Method, Sections 4 discusses the limitations and finally, section 5 concludes the paper.

2. RELATED WORKS Resources of mobile devices are generally limited. One of the most important resources in mobile devices is energy recourse (battery). Various strategies have been proposed by [11] to reduce battery consumption. Technique off-loading resource-consuming tasks to cloud servers, for example, has been adopted by commercial mobile applications. However, it is not applicable for application that is processing data stored locally (not on server). Besides, managing resource consumption at the level of the device’s operating system has been proposed; however, most developers found that it is challenging to perform this task. Studies such as [12][13] focus on the battery consumptions of different mobile networking technologies including Wi-Fi and 3G and the authors have proposed a new communication protocol to reduce energy consumption by delaying some communications or increasing data traffic through pre-fetching information. Various energy-saving methods such as scheduling data transmission between mobile devices and cloud servers are reported by [14][15]. In order to characterize the energy consumption, energy demands of mobile devices are determined from both hardware and software [13]. Their study has led to the creation of an energy- aware operating system for mobile devices designed to reduce the energy consumption of mobile applications. The resource consumptions within specific applications are studied by [16]. As reported, a fine-grained energy profiler for smartphone applications is applied to measure the energy spent within an application in performing tasks such as rendering images on the screen or building an internal database for the application. While this information is beneficial for developers seeking to improve resource consumption, the application must be built before the analysis can be executed. Thus, this strategy is not useful at the design stage of a mobile application [16]. The resource consumption of a wide array of sensors embedded in mobile applications has been studied by [17], They have proposed a solution to manage the sensing requirements of all the applications running on a mobile device in order to reduce the energy consumption. However, detailed information is not provided on how to design the least-consuming application. While, a set of indicators has been proposed in [18] to measure power consumption. The authors concluded that McCabe cyclomatic complexity, weighted methods per class, nested block depth, number of overridden method, number of methods, total lines of code, method lines of code and number of parameters have strong bivariate correlations with the power consumption. Therefore, these metrics can be adopted as indicators to estimate the power consumptions of mobile applications. So far, various techniques have been proposed to measure energy consumption such as external power monitor [19][20]. Also, the consumption information from the battery and the modified kernel has been evaluated by [21]. In general, consumption information obtained from the devices is reliable for different types of analyses and experiments such as those proposed in the present work. A conceptual framework has been proposed by [9] to help mobile developers during the architectural decision making process. By estimating the energy consumption of mobile applications constructed under different software architectures, the proposed framework allows developers to analyze the resource consumption and its variations as the applications are scaled up. To that end, the framework analyzes the consumption of a set of primitive operations that can be used to compose complex social applications. In short, topic such as resource consumption of mobile devices has garnered significant attention in recent years [22]. Most of the studies focus on optimizing the consumption of applications upon the development stage. However, to the best of the authors’ knowledge, work related to choosing the most suitable software architecture for mobile applications in terms of resource consumption is rather limited [6]. 3. METHODOLOGY The main concern is to identify which approach; server-centric architecture or mobile-centric architecture, is going to consume less energy in android mobile applications to execute primitive operations for indoor navigation of IHL, Karbala. Due to the fact that there is no useable map within (IHL), an indoor map is created as well in the current work. This map (include all rooms and exits) can be assessed near the emergency exits. After the structure of the inner first floor-IHL building is modeled, walkable area inside the floor is then defined which is essential for route calculation. The indoor localization method based on Wi-Fi signal strength trilateration technique is considered. It is simple in realization and estimation and can localize position of a

33

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org mobile device within building. The methodology focuses on many activities to achieve analysis of these two architectures (server and mobile-centric architecture) for energy consumption as shown in Section 3.1. 3.1 Wi-Fi trilateration approach In this approach, signal strength will be used to predict the distance between user and three access points. Here, the Spherical Trilateration Algorithm will be implemented whereby the distance is estimated by the signal strength which is presented as a circle centered at each access point. The three circles may intersect to form a point or an area of receiver. Figure 1 shows indoor localization area provided by the trilateration approach.

Figure. 1 Indoor localization area provided by the trilateration approach 3.2 Application architectures Indoor navigation application can be implemented by applying either the server-centric (SC) or the mobile- centric (MC) architectures, but its behavior differs depending on which architecture is used. With a server- centric architecture (See Figure 2), the users’ location is stored on a server. Thus, the client (mobile device) supposedly has Wi-Fi connection capabilities, where Wi-Fi connection is turned on automatically as the application starts. By using the Android's Wi-Fi Manager API, the device scans for all available connections. This information contains Service Set Identifier (SSID), Received Signal Strength Indicator (RSSI) and MAC address of each access point. The current module selects only the pre-defined SSIDs and plugs the corresponding RSSIs into the Wi-Fi Localization algorithm. The resulting coordinates (X, Y) are used in the server. Therefore, RSSI must be posted to the server in order to obtain the user location from the server. On the contrary, with a mobile-centric architecture (See Figure 3), the user location is kept internally inside the client (mobile device) and provided as a service for this application.

Figure. 2 Server-centric architecture for IHL

34

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 3 Mobile-centric architecture for IHL 3.3 Experimental setup Predicting the resource consumption of an application is not a simple task and trying to estimate that consumption under different architectures is even harder. To get the most accurate measurements, a prototype including the most significant functionalities would have to be built for each architecture. However, this is generally unfeasible because of the cost and effort required. To this end, We plan to apply the new approach (framework) presented by [9] via identifying the commonest operations like (post, get) of an app and measures its consumption. Then, the important functionalities of an app can be composed from these primitive operations, and the expected consumption of an app can be extrapolated based on the consumptions of the primitives. This method has been used at high abstraction levels on social network case study [9]. In the current work, we aim to develop an application indoor navigation applying the same technique utilized in social network on indoor navigation case study to identify the least energy-consuming architecture. The consumption of the battery and the data traffic (either received or transmitted) can be measured with each primitive operation executed and registered by the PowerTutor1.4 or Trepn Power Profiler. To ensure measurement accuracy, this indoor navigation app must be executed without relegating to the background by the operating system. Also, the executions of other applications are not permissible during the measurement. Finally, based on the analysis of the two architectures, a decision should be made to conclude: under which architecture (server or mobile-centric architecture) this application will be less energy-consuming for battery life in an indoor navigation. 4. LIMITATIONS Due to the fact that there are no useable maps within IHL, an indoor map should be created as well in the current work. Also, to get the most accurate measurements, a prototype includes the most significant functionalities would have to be built for each architecture. These map and prototypes would then be used to perform different simulations in conditions close to the real execution environments [23]. However, this is generally unfeasible because of the cost and effort required. Furthermore, the efforts put in would not be reusable for measuring the consumption of other applications since these would also require their own prototypes with which to compare the different architectures. 5. CONCLUSION The study focuses on the analysis of two architectures (server and mobile-centric architecture): which of the two architectures (server or mobile-centric) is less demanding for energy consumption in mobile devices. Generally, wireless technology oriented indoor navigation application will be deployed at first floor of Imam Hussein Library, Karbala as the case for this study. In this research, some facts will be found out that contribute to the body of knowledge and this expected fact can be summarized as such: firstly, this research will detail up and conclude which architecture; server centric architecture or mobile-centric architecture consumes less energy in android mobile applications (indoor navigation). This fact at the design phase is crucial for developers to be able to reduce resource consumption and hence, increase the likelihood of success of their apps. Secondly, to

35

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org develop a prototype to implement the aforementioned two architectures by executing primitive operations for indoor navigation of first floor- IMAM HUSSEIN LIBRARY (IHL) at Karbala, IRAQ. ACKNOWLEDGMENT Authors owe the Imam Hussein Library Management Centre and a great debt of gratitude for their indispensible support and cooperation. Sincere gratitude also extends to students and other individuals who are either directly or indirectly involved in this project. REFERENCES 1. B. Sanou, “Facts & Figures,” 2015. 2. A. Merlo, M. Migliardi, and L. Caviglione, “A survey on energy-aware security mechanisms,” Pervasive Mob. Comput., vol. 24, pp. 77–90, 2015. 3. K. Lee, A. Member, J. Lee, S. Member, and Y. Yi, “Mobile Data Of fl oading : How Much Can WiFi Deliver ?,” vol. 21, no. 2, pp. 536–550, 2013. 4. R. Hans, D. Burgstahler, A. Mueller, M. Zahn, and D. Stingl, “Knowledge for a Longer Life : Development Impetus for Energy-efficient Smartphone Applications,” 2015. 5. C. Wilke, S. Richly, G. Sebastian, C. Piechnick, and U. Aßmann, “Energy Consumption and Efficiency in Mobile Applications : A User Feedback Study,” 2013. 6. A. Khaleel, “ENERGY CONSUMPTION PATTERNS OF MOBILE APPLICATIONS IN ANDROID PLATFORM : A SYSTEMATIC LITERATURE REVIEW,” vol. 95, no. 24, 2017. 7. W. Xu, Wu, Daneshmand, Liu, “A data privacy protective mechanism for WBAN,” Wirel. Commun. Mob. Comput., no. February 2015, pp. 421–430, 2015. 8. J. Guillen, J. Miranda, J. Berrocal, J. Garcia-Alonso, J. M. Murillo, and C. Canal, “People as a service: A mobile-centric model for providing collective sociological profiles,” IEEE Softw., vol. 31, no. 2, pp. 48–53, 2014. 9. J. Berrocal et al., “Early analysis of resource consumption patterns in mobile applications,” 2016. 10. J. Bornholt, T. Mytkowicz, and K. S. McKinley, “The model is not enough: understanding energy consumption in mobile devices,” Power (watts), vol. 1, no. 2, p. 3, 2012. 11. N. Vallina-rodriguez and J. Crowcroft, “Modern Mobile Handsets,” pp. 1–20, 2012. 12. N. Balasubramanian, “Energy Consumption in Mobile Phones : A Measurement Study and Implications for Network Applications,” pp. 280–293, 2009. 13. N. Vallina-rodriguez, P. Hui, J. Crowcroft, and A. Rice, “Exhausting Battery Statistics,” no. February, pp. 9–14, 2010. 14. M. B. Terefe, H. Lee, N. Heo, G. C. Fox, and S. Oh, “Energy-efficient multisite offloading policy using Markov decision process for mobile cloud computing,” Pervasive Mob. Comput., vol. 27, pp. 75–89, 2016. 15. T. Shi, M. Yang, X. Li, Q. Lei, and Y. Jiang, “An energy-efficient scheduling scheme for time- constrained tasks in local mobile clouds,” Pervasive Mob. Comput., vol. 27, pp. 90–105, 2016. 16. A. Pathak and Y. C. Hu, “Fine Grained Energy Accounting on Smartphones with Eprof,” pp. 29–42, 2012. 17. A. A. Moamen, “ShareSens : An Approach to Optimizing Energy Consumption of Continuous Mobile Sensing Workloads,” 2015. 18. C. K. Keong, K. T. Wei, A. Azim, A. Ghani, and K. Y. Sharif, “Toward using Software Metrics as Indicator to Measure Power Consumption of Mobile Application : A Case Study,” pp. 172–177, 2015. 19. L. Zhang, R. P. Dick, Z. M. Mao, Z. Wang, and A. Arbor, “Accurate Online Power Estimation and Automatic Battery Behavior Based Power Model Generation for Smartphones,” pp. 105–114. 20. W. Jung, C. Kang, C. Yoon, D. Kim, and H. Cha, “DevScope : A Nonintrusive and Online Power Analysis Tool for Smartphone Hardware Components,” pp. 353–362, 2012. 21. C. Yoon, D. Kim, W. Jung, C. Kang, and H. Cha, “AppScope : Application Energy Metering Framework for Android Smartphones using Kernel Activity Monitoring.” 22. R. Hans, D. Burgstahler, A. Mueller, M. Zahn, and D. Stingl, “Knowledge for a Longer Life : Development Impetus for Energy-efficient Smartphone Applications,” no. June, 2015. 23. R. Mittal, A. Kansal, and R. Chandra, “Empowering developers to estimate app energy consumption,” Proc. 18th Annu. Int. Conf. Mob. Comput. Netw. - Mobicom ’12, p. 317, 2012.

AUTHORS PROFILE

36

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

A review of software fault detection and correction process, models and techniques 1Sabia Sulman, 2Bushra Nisar 1,2International Islamic University, Islamabad, Pakistan Email: [email protected], [email protected]

ABSTRACT In software development life cycle, the most important activity is software maintenance, in order to get a reliable and quality product. Huge amount of time, cost and effort is involved in it. Maintenance of software encompasses various activities like prediction, detection, prevention and correction of fault. Due to refined and multifaceted applications, clustered architecture, artificial intelligence and commercial hardware are in use. Hence, in this research work a review is conducted in the field of software fault detection and correction. There are a lot of software reliability growth models and techniques which help in software fault detection and correction, nevertheless, the room for more models and processes is vacant to detect and correct faults. Keywords: software maintenance; reliability; software fault detection; software fault prevention; reliability models; 1. INTRODUCTION An error, bug or fault in computer program is a software defect. Quality of the software is decreased if a system produces an incorrect, unwanted and unexpected result, which occurs due to the software faults. Software maintenance is of paramount importance as it involves a great quantity of efforts and cost. For the development of high assurance systems, reliability and performance requirements are essential. Software testing is the way toward practicing a program with the specific expectation of finding faults preceding delivery to the clients. At times, fault correction is not performed promptly once a failure is recognized [1]. Schneidewind [2] announced that the software developers may delay fault correction in situations where the failures were classified as non- crucial on the current release, and were not viewed as crucial for one to three releases later on. The software is a solo entity which has recognized a strong influence on all the domain software. For their accurate and reliable service need, the domain activities always demand for quality software [3-5]. Software quality means a faultless product, which will be capable of producing anticipated results within the limitations of cost and time. Furthermore, fault detection and testing the system has become quite important processes in software life cycle. In order to detect faults and correct them, numerous fault prediction techniques, fault detection and correction processes, and reliability growth models are proposed and analyzed [6]. The contribution of this paper is to discuss and present the work done to detect and correct software faults.

2. RELATED WORK To gauge and to equate testing scope approach, [7] authors suggested transformation analysis. In order to check the viability of test suits, a statistical investigation was made to check the observational outcomes. This was hard to discover the genuine faults in software and approve the basis that is the reason countless faults were infused in the software to think about the test suits on the product (software). The mutants were created to demonstrate the fault discovery successfully. The outcomes were extremely predictable over the examined criteria as they have demonstrated that the utilization of transformation administrators was reliable outcome. The advantage of the mutant era was that the mutant administrators portrayed in precise way and the fault infusion process executed effectively to recognize flaws. An extensive number of mutants diminished the effect of arbitrary variety in the analysis and permitted the utilization of various examination approaches. This approach concentrated on the decision of low cost and also viability of test suits. In [8] authors proposed the mechanized static analysis to recognize faults in software. The quantity of procedures as software audit, software review and testing were utilized to identify faults, however, every strategy was not compelling to distinguish mistakes. To dispose of faults and errors before the delivery of software, the static analysis tools were utilized. The static examination was a modest approach and an inexpensive approach. There was no need of execution because the faults were noticed physically and consequently in Nortel programming items. Automated static analysis utilized Orthogonal Defect Classification Proposition to recognize task and checking flaws. The programming errors which brought on security susceptibilities were additionally identified via automated static investigation strategies. The outcomes show that to improve the quality of software

37

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org in financial way, the computerized static examination was an effective method. Be that as it may, there were a few restrictions additionally as the outcomes just gave by utilizing three bigger Nortel systems for C/C++. In [9] researchers proposed a joined way to identify faults in in embedded control software framework. At two levels that is programming level and controlled-framework level, the viewed data was checked with the suitable basics. The inserted control framework contains both programming and equipment parts. A model-based procedure was utilized to identify faults amid run time operation. The proposed approach was utilized to look at the conduct of programming that the usefulness was performed in coveted way or not. When the examined conduct was dismissed by a software level screen, the software fault was distinguished. To segregate the faults in programming, the I/OEFA (input-output extended finite automata) were utilized. To discover and identify faults various existing procedures including N-version programming and Formal verification were utilized, however there were a few constraints like as intricacy and absence of finish determination and so forth. Along these lines, a two-tier method was utilized to discover all faults in software. In any case, the approach guarantied no false- alert. In [10] researchers used automated search and suggested statistical testing to identify faults in software. Statistical and the basic testing were dignified to identify faults. The statistical testing was ideal to identify faults than the basic and irregular testing since it considered the approach of both procedures to minimize the issues. To identify faults in real world applications, this research depended on the robotized search techniques. The uniform irregular testing was more productive in vast test sizes than statistical testing and the dispersions have demonstrated the reduced fault discovery capability. To create test data in samples, the probability distribution was utilized. Be that as it may, it was extremely hard to derive probability distribution for huge and multifaceted software. Search Based Software Engineering (SBSE) was utilized to derive probability distribution in statistical testing which was actualized physically. It has demonstrated that the method was powerful and applicable for bigger input area/domain. In [11] authors projected an immaculate model-based way to deal with challenges in the distributed frameworks. Automatic system monitoring and recuperation enhanced the trustworthiness viably in distributed programming frameworks yet it was exceptionally troublesome to discover the faults. The strategies considered are Bayesian estimation and Markov decision theory to give recuperation in a framework. This approach investigated more focal points by relationship of checking and recuperation techniques in examination of utilizing them independently. The faults were infused in practical web-based business frameworks to approve this approach. By utilizing automatic system monitoring and recuperation methods, the constancy of disseminated programming framework was enhanced. The recuperation was exceptionally troublesome on the grounds that it was hard to pick devices and strategies to discover the faults in distributed software system. The outcomes have demonstrated that the approach was of minimal cost and gave high accessibility solution for distributed framework. 2.1 Software fault prediction techniques The expenses of finding and adjusting software deserts have been the costliest movement in software improvement. The exact forecast of defect‐prone software modules can help the product testing exertion, decrease costs, and enhance the software testing process by concentrating on fault inclined module. The new methodology is proposed for software fault prediction, which is the combination of two techniques “bagging and swarm optimization”. From the result it is found that the proposed method is best for the improvement in software quality as compared to other fault prediction techniques [12].The software administration under various datasets with the criticality disclosure is played out and in addition keeps up the product framework execution under the fault prediction models and characterized the fault examination based model utilizing the metric examination and the fault based assessment. Author performed the dependability and additionally the execution estimation. An anomaly identification approach is characterized, so that contrast metric investigation should be possible effectively [13]. In order to identify and recognize the system reliability and accessibility under different software features, an Artificial Immune system-based fault prediction program is characterized. The ongoing programming framework investigation, under the constancy appraisal by dealing with the software prerequisite, is defined. Author additionally play out the characterization of these modules under the fault criticality level and also in view of the fault recurrence. The individual module investigation is played out and also the entire framework examination under the information examination and in addition class level investigation is performed. The model introduced by the author was motivated from the Immune framework adapted in various applications. The fault prediction demonstrating at the class level, is characterized [14].

38

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

To find out the value of object-oriented matrices in foreseeing fault-prone classes, various design attributes are examined and concluded that the coupling and complexity are more responsible for the occurrence of fault in the classes [15]. Different approaches to reduce difficulties in different procedures related to software development are analyzed such as, the defects in software modules by using object-oriented approach to minimize testing efforts and time for software development, development time analysis for better reliability, extensive case analysis, attribute selection for better software classification and the prediction model to identify software reliability [16]. An examination on the design metrics alongside fault investigation for the object-oriented programming modules is considered. The product framework investigation under the machine learning calculation is described, so that the better outcome examination will be performed. In a research work, authors performed work, datasets and experiments demonstrated the powerful validation and seclusion of faults in these software systems [17]. A suit of configuration measures to foresee the software fault in object-oriented programs is exhibited. For playing out the fault discovery in an object-oriented software system, a probabilistic approach is defined. The principle target was to concentrate on the product item and the quality. The product examination under the recurrence investigation of the faults particular to the classes and the fault frequency is considered. The fault inclination in the product system is additionally concentrated. [18] A calculation examination based forecast model for object- oriented software is exhibited, so that the product quality will be recognized under the software fault investigation and programming cost investigation. For the quality characteristic forecast with different, a constructing model with clustering approaches is characterized. Author played out the investigation under the FCM and neural system based forecast display so that the successful characterization of the product measurements will be done. Likewise, the experimentation on the product framework under the quality investigation and the fault inclination is measured. Experimentation for the viability assessment is considered so that the more precise framework will be constructed. [19] There are innumerable techniques proposed to predict software faults but none has proven to be perfect and complete. While the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is tested on the basis of java based object-oriented software system and results showed that this technique is useful for prediction of fault proneness. The proposed technique was tested at different numbers of matrices and was concluded that with the reduced number of matrices accuracy increases significantly which proves it to be the best technique for software fault prediction [20]. The quality vector analysis under the software fault investigation is characterized. In order to do more exact module investigation, the prediction model under the performance examination is considered. Author characterized the work for java based programming framework [21]. A fault prediction analysis-based review for the open source software framework is demonstrated. Author characterized the part for the software applications and also characterized a level program so that the software adequacy will be drawn. The code-based examination under the attributes investigation and the investigation is performed. The framework under the fault prediction abilities is characterized [22]. A fault forecast based review under the UML framework displayed, so that the fault analysis in the product framework will be made. The forecast demonstrate for the logistic regression is considered and also the UML measurements examination for the software codes are described. The code estimation under the direct scaling is additionally played out, so that the unit change will be evaluated completely and fault prediction will be done adequately [23]. 2.2 Software fault detection and correction process Many automatic tools and techniques analyzed that, these automatic tools are not enough to detect and correct architectural software design defects fully automatically so, still some manual effort is required in these tools. There is a need in future work to propose the fully automatic ASD defect detection tool and semi-automatic fault correction tools [24]. A novel modeling system for software FDP and FCP is produced and its parameter estimation with WLS technique is contemplated. The exploration depends on the speculation on GOS that the identification time and redress span of every fault around and autonomously takes after certain stochastic distribution, and the created FDP and FCP models have a general structure. Utilizing two datasets of which one is from distributed work and the other one is gathered amid the development of a practical software system. The use of the models represented that the proposed model with the WLS evaluation has brought down expectation mistakes. This is on account of the proposed demonstrate contains the fault correction data and additionally the detection data, and the WLS estimation weights more on the errors of recent information, which have more bits of knowledge with the future test process. There are a few constraints in ebb and flow research, for example, it sets the exponential order statistic as illustration, and this model neglects to fit the S-formed fault discovery dataset [25].

39

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Exploration of fault detection mechanism, and additionally fault counteractive action mechanism in connection to the late pattern of the most recent advances have been talked about. Their fault discovery and software frameworks used to analyze the immeasurable number of strategies and methods, however not each tech suits each framework. Selection of, technology system arrangement, technology platform, size and intricacy of adaptableness and reliability targets, is driven by critical factors. Automated approach to distinguish an inclination to more elevated amounts in hybrid mining methods and statistical models are in inclining toward more traditional systems-oriented solutions for diagnostics and aversion. Fault handling in advanced applications are in the early phases of research and the solution architecture attempt to manufacture resistance level however much as could reasonably be expected [26].The software reliability modeling is analyzed and a Bayesian way to deal with gauge the obscure parameters for FDP with failure checks is displayed. A far reaching and efficient review on the Bayesian estimation for consolidated FDP and FCP is carried out. The techniques are created under the Bayesian system to produce a progression of arbitrary samples from the posterior distributions of the parameters. To show the proposed technique, a reproduction study is directed to look at the performance of the proposed Bayesian approach with that of MLE strategy. Simulation studies expose that the Bayesian technique performs exceptionally well in numerous settings without torment from the convergence issue. A numerical illustration is applied and demonstrated the proposed Bayesian approach can be connected to more broad circumstances contrasted with MLE technique. Additionally, the proposed Bayesian technique is more effective in computational speed than MLE strategy [27]. Precompiled Fault Detection (PFD) technique is proposed to overcome the problem of fault occurrence in runtime. This technique is used to detect and repair the fault before the compilation of source code. PFD technique is tested by using the simulators and it is concluded that by using this technique errors are removed appropriately before the execution of source code and the reliability of software increased [28]. A structure of Software Reliability Estimation in view of Fault Detection and Correction Processes is produced and parameter estimation issue in this circumstance is considered. The correct expulsion time interim of every fault is considered. To depict this kind of dataset, the fault expelling matrix is characterized. The proposed demonstrating structure is connected to a real three-edition test dataset. The outcomes demonstrated that the proposed process shows structure with Maximum Likelihood estimation delivered enhanced parameter measures and a dependable stochastic model. One test dataset with three discharges from a viable software project is connected with the proposed structure, which indicated suitable performance. Likewise, the proposed process shows system can give a decent estimation and expectation on the dataset with numerous discharges [29]. 2.3 Software reliability growth Models Software reliability is defined as the probability of errors occurred in software, considered as the measurement scale of software quality. Usually software growth models are used only for fault detection. This is not enough to detect the faults in the software; rather there is always a need to correct them for the reliability of software. Therefore, some beneficial techniques for the detection and correction of fault are discussed and experiment shows as a result that Maximum Likelihood Estimation (MLE) has better capability of fault prediction as compared to the Least Square Estimation (LSE). The testing time and testing coverage is important for reliability [30]. Previously one dimensional approach was used in software reliability growth model which focused either on testing time or testing coverage therefore multi released two dimensional approach based on growth model is introduced. In the present case testing effort and testing time are considered as inputs which have impact on output testing resources [31]. Research proposed process level redundancy (PLR) approach for the soft errors which have serious impact on software reliability. PLR is the software-based technique which focuses on the detection of errors which are in the boundary of sphere of replication (SOR). PLR is used to emerge multi-core process as well as influences the OS to uninhibitedly plan the procedures to all accessible resources. For the deployment of this technique, there is no need to modify the operating system as well as hardware [32]. Several software reliability growth models were introduced in the last 3 decades for the fault detection and correction by using statistical assumptions. Many of them assumed that fault detection and correction is done parallel but it is not true there is always a time break between fault detection and correction process. There is always a problem in resource allocation process of fault detection and correction. Mathematical optimization model is proposed for the resource allocation of fault detection and correction [33]. From the study of software reliability growth models, it is found that these models present either infinite or finite number of failure. One dimensional model is used to work either on testing time or testing effort which is not enough for accurate software reliability. To resolve this problem two dimensional model is preferred [34]. Queuing based models are best for achieving the reliability accuracy however these models are not taking in to

40

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org account the amount of resources that were consumed during the fault detection and fault correction process. Therefore, the new model integrates Exponentiated Weibull TEF and Logistic TEF techniques that are very useful for considering the amount of resources consumed during the testing process. This model is best for improvement of estimation and evaluation of software [35]. In past SRGM are based on the assumption of fault detection and removal process at the same time, however, in real it is not possible. From the observation it is concluded that removal of mutually dependent faults can only be possible if the leading faults are removed first. Along these lines, the thoughts of fault dependency and time-dependent delay work into software reliability growth modeling are incorporated [36]. Testing is very essential phase in every software development life cycle (SDLC). In testing software fault detection and correction is one of the important steps. There are various SDLCs developed from last three decades but majority of them were developed by using static approach. The main reason for which testing is important, is the consumption of resources in a very large quantity. To consume the resources efficiently a mathematical SRGM is introduced and to achieve this model Pontryagain’s Maximum principle is used. Moreover, for the purpose of resource allocation Differential Evolution (DE) is used [37]. SRGM have their own approaches some are flexible according to the situation while other are not flexible as such. Therefore, the selection of appropriate process model is not an easy task. To overcome this problem many authors donated their efforts by using different approaches. It is observed that “Random lag function, Infinite server queuing theory, Hazard rate function” approaches are different in their assumption but mathematically these approaches are proved to be equal and also it is concluded that Hazard rate function is best for generalized purpose and provides a common platform for perfect and imperfect SRGM [38]. SRGMs were developed for the detection and correction of faults in the software however many of them not consider the resource expenses. Subsequently, in request to address this issue, a new model is introduced by joining the resource uses and change-point, which spends on software troubleshooting process. A genuine Software failure project exhibited the adequacy of proposed models, and numerical outcomes demonstrate new queuing model can give better fit and estimate [39]. To achieve the great demand of software reliability there is a need of SRGM which estimates the testing effort and resource utilization. Therefore, to overcome these problems, improved GO model is introduced by considering the two factors for the effectiveness of fault detection and these factors are human learning ability and fault detection rate by observing the remaining faults after the fault removal. This model is tested under the faulty data and proves that this model is accurate to fit as compared to other models [40]. This analysis is based on introduction and renewal of different approaches to reach a certain level of reliability and perfection in software development, for instance evolutionary algorithm is defined to design an effective application which would be fault free. Similarly resource utilization approach was defined and lot more [41]. Object-oriented metrics are introduced to separate faulty and non-faulty modules and offering the design by using programmed tools so that fault analysis in modules can be done effectively. There are number of strategies and methods used to recognize faults in programming framework however every strategy has a few constraints too. SRGM includes the stochastic procedure for statistical examination of software. How SRGM is utilized to ascertain the time delay and minimize the cost of programming items is examined? Utilizations of SRGM model in some basic applications, for example safety-critical flight software, online banking service and huge scale business software, is considered. The proposed work is to apply any SRGM model on the web applications to test and compute its consistency. Firstly, the web application is tested to check the presence or absence of faults, then the SRGM model is utilized to discover the rate of identification of the faults and to figure out the reliability of web applications [42]. A new model is introduced to compare the fault detection and correction processes in analysis of the number of faults. The maximum SRGM used for fault detection but the fault correction was ignored. Because, it was assumed that the faults were eliminated immediately and efficiently but it was not realistic. The fault correction was difficult part because it was difficult to correct all faults. When the faults were detected, the faults were located and removed by doing some changes in the codes by fault correction process. A logistic function was used to model both fault detection and fault correction processes based on NHHP. The dependency of the two processes was described by the ratio of corrected fault number to detected fault number in software [43]. A software reliability growth model incorporating the Burr type Xii testing technique is introduced, for detection and correction of faults in software which helps increasing reliability, an important factor for maintenance of quality of software. This model is statistically evaluated and compared with other models involving different techniques. These experimental results proved that present model is much better to detect and correct faults of software, which decreases failure rate of software. [44]

41

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

3. CONCLUSION A lot of work is being done in the field of software maintenance. To minimize the testing efforts, various fault prediction, fault detection and fault correction techniques are explored. In the field of software maintenance, fault detection and fault correction, enough research work has been done and going on. In relation to the recent trend of advanced technologies, research on fault detection and fault correction mechanism is discussed in this survey paper. Fault handling in modern day applications are in the early stages of research and the solution architecture try to build tolerance level as much as possible. More models and techniques are needed for the improvement of the fault detection and correction activities and processes. ACKNOWLEDGEMENT Special thanks to International Islamic University, Islamabad, Pakistan in supporting this research. REFERENCES

1. Dai, Y.-S., M. Xie, and K.-L. Poh, Modeling and analysis of correlated software failures of multiple types. IEEE Transactions on Reliability, 2005. 54(1): p. 100-106. 2. Schneidewind, N.F. An integrated failure detection and fault correction model. in Software Maintenance, 2002. Proceedings. International Conference on. 2002: IEEE. 3. Monden, A., et al., Assessing the cost effectiveness of fault prediction in acceptance testing. IEEE Transactions on Software Engineering, 2013. 39(10): p. 1345-1357. 4. Wedyan, F., D. Alrmuny, and J.M. Bieman. The effectiveness of automated static analysis tools for fault detection and refactoring prediction. in Software Testing Verification and Validation, 2009. ICST'09. International Conference on. 2009: IEEE. 5. Liu, S., et al., Formal specification-based inspection for verification of programs. IEEE Transactions on software engineering, 2012. 38(5): p. 1100-1122. 6. Neeraj Mohan, P.S.S., and Hardeep Singh, Impact of Faults in Different Software Systems: A Survey 2010. 7. Andrews, J.H., et al., Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria. IEEE Transactions on Software Engineering, 2006. 32(8): p. 608-624. 8. Zheng, J., et al., On the value of static analysis for fault detection in software. IEEE Transactions on Software Engineering, 2006. 32(4): p. 240-253. 9. Zhou, C., R. Kumar, and S. Jiang. Keynote: Hierarchical Fault Detection in Embedded Control Software. in 2008 32nd Annual IEEE International Computer Software and Applications Conference. 2008. 10. Poulding, S. and J.A. Clark, Efficient Software Verification: Statistical Testing Using Automated Search. IEEE Trans. Softw. Eng., 2010. 36(6): p. 763-777. 11. Joshi, K.R., et al., Probabilistic Model-Driven Recovery in Distributed Systems. IEEE Transactions on Dependable and Secure Computing, 2011. 8(6): p. 913-928. 12. Romi Satria Wahono1, a.N.S., Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction. International Journal of Software Engineering and Its Applications, 2013. 13. Oral Alan, An Outlier Detection Algorithm Based on Object-Oriented Metrics Thresholds. IEEE, 2009. 14. Catal, C., B. Diri, and B. Ozumut. An Artificial Immune System Approach for Fault Prediction in Object- Oriented Software. in Dependability of Computer Systems, 2007. DepCoS-RELCOMEX '07. 2nd International Conference on. 2007. 15. Jeenam Chawla, A.A., Object-Oriented Design Metrics to Predict Fault Proneness of Software Applications. Jeenam Chawla et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, 2015. 16. Bharavi Mishra, Impact of Attribute Selection on Defect Proneness Prediction in OO Software. International Conference on Computer & Communication Technology, 2011. 17. Rathore, S.S. and A. Gupta. Investigating object-oriented design metrics to predict fault-proneness of software modules. in 2012 CSI Sixth International Conference on Software Engineering (CONSEG). 2012. 18. J. Daly, V.P.L.B., et al., Predicting Fault-Prone Classes with Design Measures in Object-Oriented Systems, in Proceedings of the The Ninth International Symposium on Software Reliability Engineering. 1998, IEEE Computer Society. p. 334.

42

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

19. Cong, J., et al. Quality prediction model of object-oriented software system using computational intelligence. in 2009 2nd International Conference on Power Electronics and Intelligent Transportation System (PEITS). 2009. 20. Supreet Kaur, D.K., Quality Prediction of Object Oriented Software Using Density Based Clustering Approach. IACSIT International Journal of Engineering and Technology, 2011. 21. Shatnawi, R. Improving software fault-prediction for imbalanced data. in 2012 International Conference on Innovations in Information Technology (IIT). 2012. 22. Singh, P. and S. Verma. Empirical investigation of fault prediction capability of object oriented metrics of open source software. in 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE). 2012. 23. Cruz, A.E.C. Exploratory study of a UML metric for fault prediction. in 2010 ACM/IEEE 32nd International Conference on Software Engineering. 2010. 24. Gu´eh´eneuc, N.M.a.Y.-G.e., On the Automatic Detection and Correction of Software Architectural Defects in Object-Oriented Designs. 2005. 25. Liu, Y., et al., A general modeling and analysis framework for software fault detection and correction process. Software Testing, Verification and Reliability, 2016. 26(5): p. 351-365. 26. Holley, R., How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 2009. 15(3/4). 27. Wang, L., Q. Hu, and M. Xie. Bayesian analysis for NHPP-based software fault detection and correction processes. in Industrial Engineering and Engineering Management (IEEM), 2015 IEEE International Conference on. 2015: IEEE. 28. Prattana Deeprasertkul, P.B., Automatic detection and correction of programming faults for software applications. The Journal of Systems and Software, 2005. 29. Liu, Y., et al. A New Framework and Application of Software Reliability Estimation Based on Fault Detection and Correction Processes. in 2015 IEEE International Conference on Software Quality, Reliability and Security. 2015. 30. Y.P. Wu, Q.P.H., M. Xie and S.H. Ng, Detection and Correction Process Modeling Considering the Time Dependency. 12th Pacific Rim International Symposium on Dependable Computing IEEE, 2006. 31. Shrivastava, A.K., Generalized Modeling for Multiple Release of Two Dimensional Software Reliability Growth Model. INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH AND DEVELOPMENT, 2016. 32. Connors, A.S.T.M.V.J.R.J.B.D.A., Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance. 2007. 33. Nasar, M., Resource Allocation Policies for Fault Detection and Removal Process. Modern Education and Computer Science, 2014. 34. Rashmi Upadhyay, P.J., Review on Software Reliability Growth Models and Software Release Planning. International Journal of Computer Applications, 2013. 35. Zhang, N., Software Reliability Analysis using Queuing based Model with Testing Effort. JOURNAL OF SOFTWARE, 2013. 36. Chin-Yu Huang1, C.-T.L., Sy-Yen Kuo2, Michael R. Lyu3, and Chuan-Ching Sue4, Software Reliability Growth Models Incorporating Fault Dependency with Various Debugging Time Lags. 2005. 37. Johri, M.N.a.P., Testing and Debugging Resource Allocation for Fault Detection and Removal Process. International Journal of New Computer Architectures and their Applications, 2014. 38. P.K. KAPUR1, A.G.A., and SAMEER ANAND2, A New Insight into Software Reliability Growth Modeling. International Journal of Performability Engineering, 2009. 39. Nan, Z., Infinite Server Queuing Models with Resource Expenditures and Change-point for Software Reliability. International Journal of Hybrid Information Technology, 2016. 40. Chandra Mouli Venkata Srinivas Akana, D.C.D., 3Dr. Ch. Satyanarayana, Residual Fault Detection and Performance Analysis of G-O Software Growth Model. Journal of Current Computer Science and Technology, 2015. 41. Marshima Mohd Rosli, The Design of a Software Fault Prone Application Using Evolutionary Algorithm. IEEE Conference on Open Systems, 2011. 42. Tamak, J., A Review of Fault Detection Techniques to Detect Faults and Improve the Reliability in Web Applications. International Journal of Advanced Research in Computer Science and Software Engineering, 2013. 3(6). 43. Shu, Y., et al. Considering the Dependency of Fault Detection and Correction in Software Reliability Modeling. in 2008 International Conference on Computer Science and Software Engineering. 2008.

43

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

44. Md.Zaffar amam , s.s., ahmed, ANALYSIS OF SOFTWARE FAULT DETECTION AND CORRECTION PROCESS MODELS WITH BURR TYPE X11 TESTING EFFORT. IEEE, 2016.

AUTHORS PROFILE

44

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Using peer comparison approaches to measure software stability 1Liguo Yu, 2Yingmei Li, 3Srini Ramaswamy 1Indiana University South Bend, South Bend, Indiana 46634, USA 2Harbin Normal University, Harbin, Heilongjiang 150080, China 3ABB Inc., Cleveland, Ohio 44125, USA Email: [email protected], [email protected], [email protected]

ABSTRACT Software systems must change to adapt to new functional requirements and new nonfunctional requirements. This is called software revision. However, not all the modules within the system need to be changed during each revision. In this paper, we study how frequently each module is modified. Our study is performed through comparing the stability of peer software modules. The study is performed on six open-source Java projects: Ant, Flow4j, Jena, Lucence, Struct, and Xalan, in which classes are identified as basic software modules. Our study shows (1) about half of the total classes never changed; (2) frequent changes occur to small number of classes; and (3) the number of changed classes between current release and next release has no significant relations with the time duration between current release and next release. Keywords: software evolution; software revision; software stability; class stability; open-source project; Java class; 1. INTRODUCTION Software systems must continually evolve to fix bugs or adapt to new requirements or new environments. The changes made to an existing system would generate a new version of the system. This process is called revision. During the software revision process, some modules within the system are modified and some other modules are unchanged. The ability that a software module remains unchanged is called stability. Stability is an important measure of software modules and software systems. It is commonly agreed on that software stability could affect software quality [1]. For example, if more frequent and dramatic changes are made to a software module, it is more likely that errors will be introduced into the code, and accordingly the quality of the module and the quality of the product could be compromised. It is also commonly agreed on that less stable modules are more difficult to maintain than stable modules [2, 3]. For example, regression faults could be introduced in software maintenance. Maintaining module stable is also important for software product line, where the stability of core assets is essential for software reuse [4−6]. Software stability is an area that is under extensive research. For example, Fayad and Altman described a Software Stability Model (SSM) [7, 8], which has been applied to software product lines to bring multiple benefits to software architecture, design, and development [9]. Xavier & Naganathan presented a probabilistic model to enhance the stability of enterprise computing applications, which allows software systems to easily accommodate changes under different business policies [10, 11]. In Wang et al.’s research, stability is used as a measure to support component design [12]. Identification of stable and unstable software components is one of the important tasks in this area of research. For example, Hamza applied a formal concept analysis method to identify stable software modules [13]. Grosser et al. utilized case-based reasoning to predict class stability in object-oriented systems [14]. Bevan & Whitehead mined software evolution history to identify unstable software modules [15]. This paper studies software stability through comparing peer modules. The study is performed on six open-source Java projects. The remaining of the paper is organized as follows. Section 2 reviews the current available measurements of software stability. Section 3 presents our research method and introduces our new measurement of software stability. Section 4 describes the data source used in this study. Section 5 presents the results and the analysis of the case studies. Conclusions and future work are illustrated in Section 6.

45

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

2. LITERATURE REVIEW A review of literature could find out that there are basically two ways to measure software module stability, static measurement and dynamic measurement. Static measurement is to analyze the interdependencies between software modules in order to study their probability of co-evolutions: changes made to one module could require corresponding changes to another module [16, 17]. If a software module has weak dependencies on other modules, change propagation is less likely to happen and the module is more stable. If a software module has strong dependencies on other modules, change propagation is more likely to happen and the module is less stable. This measurement is related with module coupling, which represents the architecture of the system. Because this measurement examines the interactions of different modules of one version of a software system, it is called static measurement. The concept of static measurement is illustrated in Figure 1(a). In dynamic measurement, software evolution history is used to measure module stability, where stability is represented as differences between two versions of an evolving software product. The differences between two versions of an evolving software module could be measured with program metrics difference, such as number of variables, number of methods, etc. [18, 19]. Because this measurement examines the differences of two versions of one module, it is called dynamic measurement. The concept of dynamic measurement is illustrated in Figure 1(b). In Threm et al.’s latest work, information-level metrics based on Kolmogorov complexity are used to measure the difference between several versions of software products [20]. Using normalized compression distance, various evolutionary stability metrics of software artifacts are defined, including version stability, branch stability, structure stability, and aggregate stability. Again, this is a dynamic measurement based on information entropy, which also belongs to Figure 1(b).

Figure. 1 Three measurements of module stability: (a) static measurement; (b) dynamic measurement; and (c) peer comparisons. 3. RESEARCH METHOD In this study, we analyze module stability through comparing peer modules in one system. Instead of looking at interactions between modules (static measurement) and difference between versions of one module (dynamic

46

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org measurement), we compare the change frequencies of different modules. We call our approach peer comparisons. The concept of peer comparisons is illustrated in Figure 1(c). In this measurement, the stability of a module is measured through comparing its frequency of changes with other modules’ frequencies of changes. For example, in Figure 1(c), if Module A is modified 1 time, Module B is modified 2 times, and Module C is modified 3 times, in three revisions, we can say Module A is more stable than Module B, and Module B is more stable than Module C. 4. DATA SOURCE The data used in this study are retrieved from Helix - Software Evolution Data Set [21]. The evolution data of six open-source Java projects are downloaded and analyzed. They are Ant, Flow4j, Jena, Lucence, Struct, and Xalan. Table 1 shows the general information about six Java projects. Please note (1) The release months are shown in mm/yy format; (2) The first release date and the last release date are referring to the data collected by Helix, and are not necessarily representing the available data on the project web site; and (3) Number of classes is counted on the last release of each project as specified in the table. Table. 1 The general information of six Java projects Ant Flow4j Jena Lucence Struct Xalan Num. of releases 18 29 25 19 18 15 First release 07/00 10/03 09/01 06/02 09/00 03/00 Last release 04/10 08/05 06/10 06/10 09/09 11/07 Duration (days) 3573 660 3168 2931 3288 2803 Num. of classes 561 274 915 398 910 1198 5. ANALYSIS AND RESULTS First, we study the retirement rate of classes. During the evolution of an object-oriented software product, some new classes could be added to the project and some existing classes could be removed from the project. The removal of a class from a software system is called retirement of the class. The definition of class retirement rate is given below. Definition 1. Class retirement rate of a project is the ratio of the number of retired classes over the number of total classes ever existed in the project. Class retirement rate measures the stability of a software system as a whole: higher class retirement rate indicates lower stability of the system and lower class retirement rate indicates higher stability of the system. Table 2 shows the class retirement rates of six Java projects studied in this research. It is worth noting that Row 2 shows the total number of ever existed classes in each project. From Table 2, we can see that the class retirement rates are in the range of 6.5%, which is for Ant, and 52.2%, which is for Jena. Based on the data in Table 2, it is fair to say that Ant is more stable than Jena. Class retirement rate represents the stability of the entire system, but not individual classes (modules). To study the stability of individual classes (modules), we need to examine them in more detail. For all the current classes in six Java projects, if a class has never been changed during the revisions, it is called an unchanged class; if a class has ever been changed during the revisions, it is called a changed class. Figure 2 illustrates the percentage of changed classes and the percentage of unchanged classes in all six Java projects, in which Lucence has the largest percentage of changed classes and Struct has the largest percentage of unchanged classes. Table 2. The class retirement rate of six Java projects Ant Flow4j Jena Lucence Struct Xalan Num. of classes 600 403 1916 478 1498 1585 Current classes 561 274 915 398 910 1198 Retired classes 39 129 1001 80 588 387 Retirement rate 6.5% 32.0% 52.2% 16.7% 39.3% 24.4%

47

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 2 The percentages of unchanged and changed classes in each system. For changed classes, Figure 3 shows the frequency of the number of times a class is changed in all the revisions. It can be seen that more classes are changed fewer times and fewer classes are changed more times. Some classes are not introduced in the first version of the product. Instead, they are added later during the revisions. Therefore, these later added classes did not experience the full evolution lifetime and accordingly, the number of changes made to them could not accurately represent their stability. To account for the shortcomings of using the number of changes as a direct measure of class stability, we introduce a new metric, revision rate. Definition 2. Revision rate of a class is the ratio of the number of times this class is changed over the total number of revisions this class experienced. For example, if a class experienced 4 revisions and changes are made in 1 revision, the revision rate of this class is 1/4 (25%). We can tell from the definition that revision rate is in the range of [0%, 100%]. Figure 4 shows the frequency of classes with different revision rates in their lifetime of evolution. It should be noted here that Figure 4 includes both changed classes and unchanged classes as illustrated in Figure 2. From Figure 4, we can see that more classes have low revision rates while less classes have high revision rates. For example, in Ant, Lucence, and Xalan, some classes have about 90% possibilities of being changed in a revision.

48

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 3 For changed classes, the frequency of the number of times a class is changed in all the revisions: (a) Ant; (b) Flow4j; (c) Jena; (d) Lucence; (e) Struct; and (f) Xalan. As described in Section 3, peer comparisons are used in this study to evaluate module stability. Figure 3 and Figure 4 show that different classes in one system have different number of changes and different possibility of being changed in a revision. Accordingly, classes with high possibility of being changed (revision rate) in a revision is considered more unstable than classes with low possibility of being changed (revision rate). Next, we study the relationship between amount of changes and the duration of each revision. If a revision takes longer time, it is more likely that major changes are being made and more classes are being modified; if a revision takes shorter time, it is more likely that minor changes are being made and fewer classes are being made. To see if this analysis is correct, we study the correlation between the percentage of modified classes made on previous release and the duration between previous release and current release in each revision. Table 3 shows the results of Spearman’s rank correlation test, where significance at the 0.05 level is bolded. Figure 5 illustrates the scatter plots of percentage of classes modified and the duration between previous release and current release.

49

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 4 The frequency of classes with different revision rates: (a) Ant; (b) Flow4j; (c) Jena; (d) Lucence; (e) Struct; and (f) Xalan. From Table 3 we can see that 5 out of the 6 correlations are positive and 3 out of the 5 positive correlations are at the 0.05 level. Based on this result, we could not conclude that the amount of changes made to classes are correlated with the duration of each revision. In other words, the amount of changes depends on revision activities. On the other hand, the duration of each revision is also related with the maintenance activity. Accordingly, amount of changes and the duration of each revision are indirectly correlated, but with no causal relations. Table 3: Spearman’s correlation test on percentage of classes modified and duration between previous release and current release Ant Flow4j Jena Lucence Struct Xalan Num. of datasets 17 28 24 18 17 14 Correlation (r) 0.554 0.175 -0.170 0.498 0.583 0.363 Significance (p) 0.02 0.37 0.43 0.04 0.01 0.20

50

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 5 The scatter plots between percentage of classes modified and the duration to previous release: (a) Ant; (b) Flow4j; (c) Jena; (d) Lucence; (e) Struct; and (f) Xalan. 6. CONCLUSIONS AND FUTURE WORK In this paper, we introduced a new method to measure the stability of software systems and the stability of software modules. Using this approach, we studied the class stability of 6 open-source Java projects. We compared system stability of these 6 Java systems and measured revision rate of each class, which represents the possibility a class could be changed during a revision. Our study found (1) about half of the total classes never changed in the studied lifetime of software evolution; (2) frequent changes occur to small number of classes; and (3) the number of changed classes between current release and next release has no significant relations with the time duration between current release and next release. In practice, our proposed approach can be used to identify stable and unstable modules, which can help us improve software design quality and reusability. In addition, our proposed metrics, such as retirement rate and revision rate can be easily implemented in a version control and configuration management system, such as Subversion and Git. In our future research, we will create a software tool integrated with GitHub so that it can be used to measure project stability and module stability of any projects on GitHub. Other similar research has been done to study change sets, which are characterized as the architecture features of the program [22]. Our definition of stable and unstable modules can be used to evaluate these studies. In addition, our

51

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org findings about the relation between amount of changes and release time could be further validated with other tools and on other projects.

REFERENCES 1. Eick, S. G., Graves, T. L., Karr, A. F., Marron, J. S., & Mockus, A. 2001. “Does code decay? assessing the evidence from change management data”. IEEE Transactions on Software Engineering, 27(1), 1–12. 2. Mohagheghi, P., Conradi, R., Killi, O. M., & Schwarz, H. 2004. “An empirical study of software reuse vs. defect-density and stability”. In Proceedings. 26th International Conference on Software Engineering (pp. 282–291). IEEE. 3. Menzies, T., Williams, S., Boehm, B., & Hihn, J. 2009. “How to avoid drastic software process change (using stochastic stability)”. In Proceedings of the 31st International Conference on Software Engineering (pp. 540–550). IEEE Computer Society. 4. Leavens, G. T., & Sitaraman, M. 2000. “Foundations of component-based systems”. Cambridge University Press. 5. Dantas, F. 2011. “Reuse vs. maintainability: revealing the impact of composition code properties”. In Proceedings of the 33rd International Conference on Software Engineering (pp. 1082–1085). ACM. 6. Figueiredo, E., Cacho, N., Sant'Anna, C., Monteiro, M., Kulesza, U., Garcia, A., Soares, S., Ferrari, F., Khan, S., Castor Filho, F. and Dantas, F. 2008. “Evolving software product lines with aspects”. In Proceedings of the 30th ACM/IEEE International Conference on Software Engineering (pp. 261–270). IEEE. 7. Fayad, M. E., & Altman, A. 2001. “Thinking objectively: an introduction to software stability”. Communications of the ACM, 44(9), 95–98. 8. Fayad, M. (2002). “Accomplishing software stability”. Communications of the ACM, 45(1), 111–115. 9. Fayad, M. E., & Singh, S. K. 2010. Software stability model: software product line engineering overhauled. In Proceedings of the 2010 Workshop on Knowledge-Oriented Product Line Engineering (p. 4). ACM. 10. Xavier, P. E., & Naganathan, E. R. 2009. “Productivity improvement in software projects using 2- dimensional probabilistic software stability model (PSSM)”. ACM SIGSOFT Software Engineering Notes, 34(5), 1–3. 11. Naganathan, E. R., & Eugene, X. P. 2009. “Architecting autonomic computing systems through probabilistic software stability model (PSSM)”. In Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human (pp. 643–648). ACM. 12. Wang, Z. J., Zhan, D. C., & Xu, X. F. 2006. STCIM: a dynamic granularity oriented and stability based component identification method”. ACM SIGSOFT Software Engineering Notes, 31(3), 1–14. 13. Hamza, H. S. 2005. “Separation of concerns for evolving systems: a stability-driven approach”. In ACM SIGSOFT Software Engineering Notes (Vol. 30, No. 4, pp. 1–5). ACM. 14. Grosser, D., Sahraoui, H. A., & Valtchev, P. 2002. “Predicting software stability using case-based reasoning”. In Proceedings of the 17th IEEE International Conference on Automated Software Engineering (pp. 295–298). IEEE. 15. Bevan, J., & Whitehead Jr, E. J. 2003. “Identification of Software Instabilities”. In WCRE (Vol. 3, p. 134– 145). 16. Yau, S. S., & Collofello, J. S. 1980. “Some stability measures for software maintenance”. IEEE Transactions on Software Engineering, (6), 545–552. 17. Yau, S. S., & Collofello, J. S. 1985. “Design stability measures for software maintenance”. IEEE Transactions on Software Engineering, (9), 849–856. 18. Kelly, D. 2006. “A study of design characteristics in evolving software using stability as a criterion”. IEEE Transactions on Software Engineering, 32(5), 315–329. 19. Yu, L., & Ramaswamy, S. 2009. “Measuring the evolutionary stability of software systems: case studies of Linux and FreeBSD”. IET software, 3(1), 26–36. 20. Threm, D., Yu, L., Ramaswamy, S., & Sudarsan, S. D. 2015. “Using normalized compression distance to measure the evolutionary stability of software systems”. In Proceedings of the 26th IEEE International Symposium on Software Reliability Engineering (pp. 112–120). IEEE. 21. Vasa, R., Lumpe, M., & Jones, A. 2010. “Helix-Software Evolution Data Set”.

52

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

22. Wong, S., Cai, Y., Valetto, G., Simeonov, G., & Sethi, K. 2009. “Design rule hierarchies and parallelism in software development tasks”. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (pp. 197–208). IEEE Computer Society.

AUTHORS PROFILE Dr. Liguo Yu received his PH.D degree in Computer Science from Vanderbilt University. He received his BS degree in Physics from Jilin University. He is currently an associate professor in Computer Science Department, Indiana University South Bend, USA. His research interests are in software engineering, information technology, complexity system, and computer education. Prof. Yingmei Li is a full professor at Harbin Normal University, China. Her research interest is in software engineering and computer science education.

Dr. Srini Ramaswamy received his PH.D degree from University of Louisiana at Lafayette. His specialty area includes technical/engineering management (software and systems), program/project management, new technology scouting and evaluation, vision / strategy development and execution, software development process adaptation and improvement, industrial automation systems, energy, power & water systems, healthcare systems, healthcare IT, big data, and cloud computing. He is currently a software project manager at ABB Inc.

53

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

A framework for software re-documentation using reverse engineering approach 1Nasrin Ismail Mohamed, 2Nisreen Beshir Osman College of Computer Science and Information Technology, Sudan University of Science and Technology Department of Computer Science, Bayan College for Science and Technology, Khartoum, Sudan E-mail: [email protected], [email protected]

ABSTRACT During software evolution, programmers spend time and effort in the comprehension of programs and that is due to the fact that the documentation is often incomplete, inconsistent and outdated. In order to avoid these problems, software could be re-documented. Software re-documentation enables the understanding of software that aids the support, maintenance and evolution of software. Re-documentation is implemented by different approaches. Reverse engineering is one of these approaches that provide a better understanding of an existing system by maintainers and developers, especially when faced by a large and evolving legacy system. This study proposes a framework for systems re-documentation based on reverse engineering approach. The re- documentation is done using a reverse engineering tool that generates graphical representations of a system which are then used to produce documentation in the form of a standard document UML notation. Since, the quality of the generated documentation is important for program understanding and software evolution, the study also proposes a model for evaluating the quality of the generated documentation. The Documentation Quality Model (DQM) was validated and result of the evaluation showed that the documentation generated using reverse engineering was usable, up-to-date and complete. Keywords: reverse engineering; software re-documentation; code understanding; software maintenance; documentation quality model;

1. INTRODUCTION Software documentation is an important aspect of both software projects and software engineering in general. Unfortunately, the current perception of documentation is that it is outdated, irrelevant and incomplete. For the most part, this perception is probably true. A key goal of re-documentation is to provide easier ways to visualize relationships among program components so that we can recognize and follow paths clearly. There are many approaches that were used to implement the re-documentation process; one of these is using the reverse engineering approach. Developers tend to be focusing on source code since code is the most reliable source to be referred to as the system representation. Therefore, reverse engineering could be used to generate the documentation directly from the source code. The quality and usefulness of documentation is one of the key factors that affect the quality and consistency of software. Accordingly, generating quality documentation through re-documentation process is important for program comprehension and software evolutions [8]. The study presented in this paper proposes a framework for system documentation based on reverse engineering approach. It also proposes a model for evaluating the quality of the generated documentation.

2. BACKGROUND 2.1 Software re-documentation Re-documentation is the process of analyzing a software system to represent it with a meaningful alternate view intended for a human audience. Tilley [10] defined re-documentations as follows: “Program re-documentation is one approach to aiding system understanding in order to support maintenance and evolution. It is the process of retroactively creating program documentation for existing software systems. It relies on technologies such as reverse engineering to create additional information about the subject system. The new information is used by the engineers to help make informed decisions regarding potential changes to the application”. The main purpose of re-documentation is to make sure the software teams understand the legacy system [9]. There are two steps that are necessary to re-document a system. These are: • Extract facts about the system which can be done starting from a static representation of the system (e.g., the source code).

54

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

• These extracted facts need to be combined and transformed into the correct documentation format (e.g., UML diagrams). [5]. Figure (1) shows the steps of the re-documentation process.

Fact Representation

Fact Documents Extraction Generation

Figure.1 Re-documentation process There are many re-documentation techniques in industry and research. Most of these techniques can be categorized to few approaches. Some of the significant approaches and tools which can contribute to the development of quality documentation are listed below: I. XML Based Approach: XML based approach is one of the common re-documentation approaches. By using XML the technical writer or software engineer can create their own format such as , ,, ...etc .The nature of XML shows that the information in hierarchical help to understand the program more easily. It also validates the data captured from the program [5]. II. Model Oriented Re-documentation (MOR): The first step in Model Oriented Re-documentation approach is to transform the legacy system into formal models. These formal models are written using a formal language and transform into TSs (Technological Spaces). The generated TSs are stored in repository and produced documentation in a uniform way [5]. III. Incremental Re-documentation: One of the common issues in maintaining the system is to record the changes requested by customer or user the source code. The Incremental Re-documentation approach rebuilds the documentation incrementally after changes are done by the programmer [5]. IV. The Ontology Based Approach: It produces a schema from the legacy system to describe the context of the software system. The schema should be able to capture the artifacts from the latest version of the software system by establishing a reverse engineering environment [6]. 2.2 Reverse engineering The most common approaches of re-documenting software are named as reverse engineering. The IEEE Standard for Software Maintenance [3] defines reverse engineering as: “the process of extracting software system information (including documentation) from source code.” In the context of software engineering reverse engineering is defined by Chikofsky and Cross [1] as: “the process of analyzing a subject system to identify the system’s components and their interrelationships and create representations of the system in another form or at a higher level of abstraction”. Reverse engineering captures the design information from the existing source code. It is the inverse procedure of forward engineering, as forward software engineering process goes from certain specification or target architecture toward building the system, the reverse engineering process goes from low abstract level to higher as shown in Figure 2.

Figure. 2 Difference between forward and reverse engineering

55

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Software systems that are targets for reverse engineering, such as legacy applications, are often large, with hundreds of thousands or even millions of lines of code. As a result, it is highly desirable to automate reverse engineering activities. 2.3 Documentation quality The quality of a software product is a significant driver for its success. However, the majority of the applied quality assurance methods mainly focus on the executable source code. Quality reviews of the software documentation are often omitted. Software documents such as requirements specifications, design documents, or test plans represent essential parts of a software product. Therefore, the quality of such documents influences the overall quality of a software product considerably. That means, documentation is a key component in software quality and improving the documentation process will have considerable impact on improving the quality of software. Documents describe the product at all levels of development including the finished product and therefore should be up-to date, complete, consistent and usable [4]. According to Kitchenham "Quality" means different things to different people; it is highly context dependent. As there is no universally accepted definition of quality there can be no single, simple measure of software quality that is acceptable to everyone. However, defining quality in a measurable way makes it easier for others to understand a given viewpoint. To understand and measure quality, many models of quality and quality characteristics have been introduced. The IEEE Std 1061-1998 [3] used a set of factors such as Functionality, Reliability, Efficiency, Maintainability, Portability and Usability to distinctively assess the quality. Garousi [2] presented the hybrid methodology for Software documentation quality provide a meta-model that modeled the quality of content using Content Quality, which has several attributes as its subclass: Accessibility, Accuracy, Author-related, Completeness, Consistency, Correctness, Information organization,, Format, Readability, Similarity, Spelling and grammar, Traceability, Trustworthiness, Up-to-dateness [2]. The key performance indicators (KPIs) framework, were also created by focusing on the quality attributes of the document: Structure, Contextual, Accuracy and Accessibility [7]. 3. METHODOLOGY This part summarizes the two practical steps used to re-document software using the proposed framework and measures the Quality of the document generated. 3.1 System re-documentation The selected system was Shopping Management System which is a Service-Oriented system that provides services for purchasing items based on large data base. In the Shopping System, customers can request to purchase one or more items from the supplier. The customer provides personal details, such as address and credit card information. This information is stored in a customer account. If the credit card is valid, then a delivery order is created and sent to the supplier. The supplier checks it, confirms the order. The system contains large number of modules, variables and structures and also large number dependencies. 3.2 Rigi tool Rigi tool environment provides automated reverse engineering activities that could be used to understand and analyze the overall structure of a system. Static information is generated for whole software and visualized using the tool. Figure (3) shows the conceptual architecture of the Rigi tool.

Figure. 3 Rigi conceptual architecture

56

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

The Rigi Workbench window and a root window are shown in figure (4).

Figure. 4 Rigi workbench window

▪ Fact Extraction. Rigi includes parsers to read the source code of the subject system. It also enables viewing information stored in Rigi standard Format (RSF) files. ▪ Fact Representation. Represent and visualize the extracted information as directed graph, the initial window titled Root in figure 4 is used to display the parent(s) of the subsystem hierarchy (SQL/DS node). A tree-like structure represented in figure (5) was used to manage the complexity for large information spaces.

Figure. 5 Fact representation

57

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

▪ Documentation Generation. To generate the documentation, it was not possible to automate all the work so some parts required human expert interpretation in order to complete work. UML notation was chosen to generate a standard graphical documentation. Rigi has been used for static reverse engineering. The extracted static information is viewed as directed graphs. The static dependency graph contains approximately the same information as a class diagram. The Table (1) enumerates the main UML class diagram constructs that can be used in Rigi for expressing the meaning of the UML class diagram. The correspondence is characterized as replacing if such a Rigi construct exists.

Table. 1 Class diagram construct vs Rigi graph constructs Constructs A UML class A Rigi static Correspondence diagram dependency graph Class Class (node type) Replacing Interface Interface (node type) Replacing Method Method (node type) Replacing Variable Variable (node type ) Replacing Generalization Inherit (arc type ) Replacing Association (arc type ) Replacing

In Rigi, classes and interfaces have their own node types, methods and variables inside a class in UML class diagram as shown in figure (6).

Figure. 6 Shopping management system class diagrams

58

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

3.3 Document quality model To classify quality- related attributes of the documentation generated by the tool a Document Quality Model (DQM) was constructed. The model consists of two main attributes: Usability and Benefit. The attributes of the model are shown in Figure (7).

Figure. 7 Document quality model

4. RESULT AND DISCUSSION The quality of the documents generated was measured using the DQM. Table (2) below shows the results of the evaluation. According to these results the generated document was understandable, consistent, accurate, accessible, simple, and readable. The results also showed the document was up-to-date and complete.

Table. 2 Documents evolution result Document Quality Attributes Evaluate Understand ability √ Consistency √ Accuracy √ Usability Accessibility √ Simplicity √ Readability √ Up-to-dateness √ Benefit Completeness √

59

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

The Usability was measured by set of attributes. These are: understandability, consistency, accuracy, accessibility, simplicity and readability. UML notation chosen was accepted as standard for visualizing, understanding and documented software systems and this proved that UML document format have all usability aspect of the model. The Model measured Benefit criteria by content of documents, it should be up-to-date and complete. The document generated of the system is up-to-date since it depends on the latest version of the software system. It is complete and accurate since it is derived from the actual all source code. 5. CONCLUSION Legacy software systems have a different approach to software re-documentation than has traditionally been used, one of them is reverse engineering. Documentations made manually by developers in some cases are inconsistent. Some change requests, updates, or bugs fixing somehow are not included in the documentation as the software evolves. Developers tend to be focusing on source code rather than the documentation. Consequently, code is the most reliable source to be referred as the system representation. Generating the documentation directly from the source code makes the resulted document consistent with the code at all times. Therefore, reverse engineering is very effective to understand large software systems then re-documented. REFERENCES

1. Chikofsky E., Cross J., "Reverse Engineering and Jan. Design Recovery Taxonomy", IEEE Software, vol.7 (1). 1990, pp. 13-17 2. Garousi G., " A Hybrid Methodology for Analyzing Software Documentation Quality and Usage ", UNIVERSITY OF CALGARY. September 2012. 3. Institute of Electrical and Electronics Engineers. Standard for Software Maintenance. New York, IEEE Std. 1219-1998. 4. Jemutai N.1, Kipyegen1 and William P. K. Korir2, " Importance of Software Documentation ", IJCSI International Journal of Computer Science Issues, Vol. 10, Issue 5, No 1, September 2013. ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784 www.IJCSI.org 5. Naisan I. and Ibrahim S., "Reverse Engineering Process to Support Software Design Document Generator ". 2010 6. Nallusamy S., Ibrahim S., "A Software Redocumentation Process Using Ontology Based Approach in Software Maintenance "International Journal of Information and lectronicsEngineering,2011. 7. SUFI ABDI B., "Framework for Measuring Perceived Quality in Technical Documentation", University of Gothenburg Chalmers University of Technology, February 2013. 8. Sugumaran N., Ibrahim S., "An Evaluation on Software Redocumentation Approaches and Tools in Software Maintenance". Vol. 2011 (2011), Article ID 875759 http://www.ibimapublishing.com/journals/CIBIMA/cibima.html 9. Sugumaran N., Ibrahim S. 2," A Review of Re-documentation Approaches", University Technology Malaysia, 2009. 10. Tilley, S. 2008. Three Challenges in Program "Re-documentation for Distributed Systems." In Proceedings of IEEE Conference 2008.

AUTHORS PROFILE

60

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

A review of adoption of e-learning in middle east countries 1Asmala Ahmad, 2Ali Fahem Nemeah, 3Hassan Mohammed 1,2,3University Technical Malaysia, Melaka, Malaysia Email: [email protected]

ABSTRACT E-learning has provided men with new opportunities in teaching-learning procedures. A historical review of educational systems literature reveals that e-learning has spread out among people much faster than any other learning methods. E-learning, as a state-of-the-art technology, has caused great innovations in materials development in those societies in which new methods and procedures could hardly ever been accepted. Technological innovations and the development of telecommunications such as Television Stations and Channels, Satellites, Mobile, and Internet have made it possible for the children and teenagers in the Middle East to access to the latest news and information. Of course, these developments have endangered both political and educational systems in some aspects. The present paper while pointing to some of the recent developments in the field of e- learning in the Middle East, tries to examine the political and educational systems reactions to this phenomenon. Keywords: e-learning ; technology; culture; learning; higher education; educational systems; 1. INTRODUCTION E-learning (Electronic Learning) is the unifying term to describe the fields of online learning, Web-based training, and technology-delivered instruction. The widespread accessibility of the World Wide Web and the ease of using the tools to browse the resources on the Web have made the e-learning technology extremely popular and the means of choice for distance education and professional training. The concept and the use of e-learning were adapted in the mid 1980’s by several institutes in the United States. Approximately 1.9 million learners participate in e learning at institutes of higher educations, a million of which are from Australia, New Zealand and the United Kingdom. The number of people applying for e-learning courses all over the world increases at a rate of 25 percent each year [1]. Some of the middle-east countries have introduced and are successfully running e-learning in their educational institutes and business organizations. The Middle East countries education system is somewhat under stress to provide additional educational opportunities for increasing population and to boost the literacy rate. With over 50% of these countries population under the age of 20 and one of the highest birth rates in the world, higher education institutions have been facing a growing demand for enrolment.in this paper tries to show the review of e learning in middle east countries . 2. E-LEARNING E-learning has not only affected youth’s methods of learning but also has modified the relations between social structures and young generation. The application and genesis of mass communication and its outcome which is Electronic Learning has made some sorts of information accessible to the young people [2]. For centuries, being grown up and experienced was a basic and needed factor in the Middle East to gain access to such kind of information. Nowadays, this modification in an ancient area is to such an extent fundamental that everybody should give priority to that. During the last century, emergence of Radio, Newspaper and TV had a kind of tremendous effect on the connections between grown up young generations of society. The history of 50 years of social transformation in the Middle East distinctly shows how each one of these new medium has increased anxiousness and tension in parents, teachers, politicians, and clergymen. It has also widened the extent of misunderstanding among generations. Today, we are confronting with a new phenomenon called the genesis of Internet and E-learning. These new technologies have affected security of traditional societies in the Middle East. As a result, the act of learning has turned into a national - security issue [3].It has been turned into a security issue because the sense of equilibrium which has been prevailing in the relation between last and new generation, has been lost. Parents do not feel secure anymore because their kids, quickly and before the appointed time become familiar with the relationship between different sexes. They believe that their children communicate with strangers and play a number of games which are not only time-killers but bothersome and annoying. Politicians do not feel secure because they are not able to exercise their influence on young generation as the only eternal political source. They have to fight with their political opponents both in the field of practice and the virtual world. To the politicians, in this critical region, nothing is more perilous than the minds of the young people crammed with

61

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org opponents’ ideas. Finally, these are the teachers and educationalists that will not be considered as the only right sources and criteria for gaining knowledge. They can surpass their teachers in acquiring new information. 3. EDUCATIONAL SYSTEM Amalgamation of citizens in this part of the world shows that children and young people form a considerable part of age pyramid of population. Educational system is one of the first among social structures which is under the impact of this constitution. During the last two decades, educational systems have seen the increase of registration in all academic levels. For example, girls form more than 60% of all university students in Iran and more than 90% of graduate students in high schools of UAE register at colleges and universities [4]. Anyway, increase in usage of e-learning is not limited to the universities only. We can see the access to e-learning even in people with lower levels of age. Available citations show that educational systems of Iran, Kuwait, UAE, Israel and Turkey are under the influence of global privatization and therefore the number of kindergartens, elementary and secondary schools connected to the Internet are increasing [5]. Today, teachers in the Middle East are observing that Internet has been gaining popularity among young people, though at a much slower pace than television and radio. In addition, experimental evidence shows that in the Middle East there are so many more computer-literate young people than adults indicates the younger generation’s greater interest in and aptitude for technological advances. Educational system should make an attempt for not being retarded from children and young people. 4. E–LEARNING IN MIDDLE EAST COUNTRIES E–Learning in the Middle East, none of the current ways of teaching and learning is considered more fascinating than E-learning. The reason is poor economic condition, cultural and social impediments and lack of an able and fascinating method of learning. Historical experience in developing countries such as Middle East countries shows that lack of financial sources in procuring costly installations is the main reason [6] but the increase in the oil price in 1970s, made some of the countries in this part of the world enabled to solve this problem to some extent. These countries compare to other developing countries have been empowered to build new schools and equip them with new technologies [7]. The second impediment for the acceptance of these new technologies is the social and cultural prevention of these societies. Opposing to new technologies, overestimation of negative aspects of their application, and disinclination of families are some inextricable specifications of the societies in the Middle East. Surprisingly, it should be acknowledged that internet has created an ideal ambiance for all children, juveniles, girls and women of all classes of society. It has given this chance to young generation to not only observe traditional limitations of their societies but also make contact with others easily. They believe Internet has been able to present a new view and meaning for some of Islamic concepts such as Hijab (Veil), Meraj (Ascension), Hijrat (Migration), global brotherhood, equality of women. Consequently, it should be said that E-learning has overthrown cultural and social obstacles to some extant. It has been accepted quickly among families and instructional settings. As a matter of fact, in compare to other educational technologies, there are two reasons for young generation’s enthusiasm towards learning through Internet. These are: little cost and lots of attraction. In spite of some shifting views among adults about children and the Internet in the Middle East, the overall responses continue to supply a broad range of strongly positive views about the benefits of Internet use especially about its value as an information source, and its growing use for involvement in online communities [8]. Because of social, cultural and economic restrictions in traditional societies such as Middle East, old information methods could not provide people with appropriate learning opportunities. This problem has been resolved by e-learning. During a period of 7 years (2000-2007) Middle East countries have got a considerable growth in usage of internet. Their growth is equal to four times global growth at the same period. In fact, the Middle East is an upcoming market as experts suggest today, even though major western e-learning and IT suppliers expanded their boundaries into the Middle East years ago. 4.1 Learning in kingdom of Jordan The Hashemite kingdom of Jordan is located in the heart of the Middle East with 6.316 million population size. His Majesty King Abdullah II strongly believes that the Information and Communication Technology (ICT) sector offers great potential to positively shape the future of education systems in the kingdom[9]. This is demonstrated through the Jordan education initiative project (JEI), which was launched in 2003. This focused on a partnership development with Cisco systems to create an effective model of internet-enabled learning [10]. It is evident that large expenditure and substantial effort has been made by the Ministry of Education in Jordan to

62

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org successfully implement e-learning developments in schools. While Jordanian school students recognize the potential of e-learning to support teaching and learning, infrastructure often limits student-student and tutor- student interactivity[11] Jordan has rapidly expanded its higher education system although it has not yet produced a sufficient qualitative leap[12]. Reflecting the world’s University sector moving forward with e-learning, Jordanians higher education institutions are responding accordingly. E-learning offers alternative approaches to Jordanian traditional higher education institutions, encouraging them to re-evaluate the way they operate. In doing so, it provides potential to accommodate new information and communication technologies to enhance the student learning experience. The demand for e-learning in Jordan is expected to rise in the next few years [13]. This is due to the sharp growth of internet and mobile users and the high literacy rates considered to be the highest among other countries in the region. Owing to these booming advances in information technology, it is important for higher education institutions to embrace the technological developments, redesigning teaching trends and developing researchers in the educational domain[14]. The increased demand from students to change teaching methods in traditional lectures pushes higher education institutions to consider e-learning to provide online courses and e-training programs. Jordanian students realise that information technology is the future and, therefore, they are looking for more flexible opportunities of learning that help them to develop their skills and the educational outcomes. Accordingly, many Jordanian institutions have adopted e-learning to meet the increased demands for enhanced and flexible teaching methods. 4.2 E- Learning in UAE Tertiary education institutions in the UAE are preparing students for a rapidly changing information and technology driven world. The UAE needs graduates who are ready for the workplace and who have a high level of knowledge and confidence in the use of technology to help them in their lifelong learning. The UAE is a small country of approximately 4 million inhabitants, situated at the toe of the Arabian Peninsula and is bounded by the Kingdom of Saudi Arabia and the Sultanate of Oman. Driven by oil discoveries, the UAE’s vibrant economy has experienced unprecedented economic growth in the last 10 years. Described as one of the most-wired countries on earth [15] the UAE has been brought into the globalized world over the last 30 years since being an impoverished region of small desert principalities to becoming a modern independent country. Universities and Higher Colleges of Technology in the UAE are increasingly using online learning or e- learning as it is more commonly called, as part of the curriculum. E-learning is the currently fashionable term used to describe the diverse use of information and communications technologies to support and enhance learning, teaching and assessment- from resource-based learning (in which students carry out face to face tasks supplemented by a range of online resources) to fully online courses [16]. Online learning is often used interchangeably with the term e-learning. Brennan et al. [17] describe online delivery as computer technology which enhances, extends and replaces traditional teaching and training practices. A small number of studies have been carried out in the UAE to investigate the use of e-learning in tertiary education. The majority of studies have focused on the perceptions of the educators about how to integrate the new technologies into their teaching and learning, along with their perceptions of the value or not of e-learning in the UAE context. Other local studies focus on language competence and language use to access the Web [18], and the impact of globalism on Arab and Higher Education. A recent study being carried out at Zayed University in the UAE by Birks, Hunt & Martin [19] researches the use of information literacy web resources by Arabic students. Findings are yet to be released. This study aims to focus on the actual lived reality of students as they participate in the world of e learning. Our goal is to ensure that we understand the barriers for students which impact on the effectiveness of their e-learning experiences. It is anticipated that findings will enable the researchers to make suggestions for improving the e-learning and teaching environment for students in the UAE. 4.3 E -learning in Qatar The first e-learning initiatives in Qatar were developed in Education City among the U.S. satellite campuses. For example, the new building of Weill Cornell Medical College Qatar (name changed to Weill Cornell Medicine – Qatar) has been equipped with completely online and blended resources since its inception in 2002-3. Due to shortages of specialty faculty to teach highly technical courses, live video feed courses are run from the main campus in New York City. One course, Psychology 101, is taught by recorded lecture from the Ithaca, New York campus while a Teaching Assistant in Doha manages the Qatar classroom and responds to questions and administers exams. With the availability of low or free cost VoIP suites and videoconferencing software such as Skype or Face time all higher education institutions in Qatar have some teleconferencing capabilities, with WCM-

63

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Q, Georgetown SFS – Qatar, and Texas A & M – Qatar having fully equipped state-of-the art teleconferencing theatres. In K-12 education, the earliest e-learning initiatives were K-Net and e-schoolbag implemented by ictQatar and Infocomm Development Authority of Singapore (IDA). E-schoolbag was launched as a pilot project at Al Wakrah Independent School for Girls in 2006. Over two hundred 7th grade students received Tablet PCs with “e-contents on science, maths, and English, which will be used by teachers as ready-to-use materials mapped to the Qatari curriculum standards and allow them to customize and add their own materials to fit their students’ needs” [20]. Knowledge Net, based on Microsoft products and implemented by ITWorx, has been described as “a three-way educational portal that connects students, parents, and teachers any time, day or night. Utilizing a unique Learning Management System, Knowledge NET provides teachers with instructional tools and resources; parents with instant access to teachers, coursework and upcoming tests; and students with the ability to communicate with peers and submit homework assignments. Knowledge Net improves content delivery, facilitates accessibility, enhances communication and expedites administrative tasks” [21]. By 2011, Qatar had made impressive gains in ICT implementation in education, with the following milestones: 93% of primary and secondary schools in Qatar had broadband Internet access, with 98% of schools with some form of Internet access; 100% of all educators in Qatar and 96% of students could access a PC for personal or educational purposes; and 71% of K-12 teachers had received general ICT training [22]. Another government pilot project was a specially built e-Maturity Diagnostic and Self Assessment Tool so that “Schools can evaluate their current e-maturity level, compare themselves to other schools, and develop targeted action plans to update and improve their technology”. Qatar University’s Continuing Education Office (CEO) signed an e-learning MoU partnership with Malomatia in 2015. Malomatia, a government technology and services provision company based in Qatar, will provide e-learning support, programmers and training for the CEO [23]. The Connected Learning Gateway (CLG) of the Egyptian-based ITWORX Education has been used throughout Qatari schools. CLG is a K-12 social media based Virtual Learning Environment that supports mobile devices [24]. E-learning has also impacted the Qatari workplace and most major companies are now using e- learning for skills upgrading (online self-paced short courses). Qatar Islamic Bank (QIB) requires all of its new employees to complete an e-learning course on operational risk to help develop a robust and vigilant risk-management culture at the bank. The largest and most successful e- learning project implemented by ictQatar is the Qatar National e-Learning Portal (www.elearning.ictqatar.qa), developed by Malomatia. Offering standard Business, Nursing, Management, and IT security courses, individual courses and learning plans are accredited by such organizations as the National Association of State Boards of Accountancy, The Six Sigma Program, Board of Registered Nurses and Association for Operations Management. Technical deficits in workers in all areas of government has been widely recognized as a serious issue. Only one decade ago, many ministries recorded public data in large books by hand and many offices were not computerized. Thus, the International Computer Driving License (ICDL) was a common training programme offered through online modules to government employees to teach basic computer skills. Although e-Learning Portal courses are available to any citizen or holder of a Residence Permit (RP), as of 2016 individual registration is not open, only organizational access. In a related project, Malomatia partnered with the International Human Resources Development Corporation (IHRDC) in 2015. IHRDC, a training and consulting company serving the petroleum industry, will offer workers in the oil and gas industries access to its four major e-learning libraries as well as web-based tools for assessment and competency management. Although not specifically an e-learning platform, Qatar’s Hukoomi government e-portal which delivers electronic public services has propelled the nation from an international rank of 62 in 2010 to the rank of 44th in 2014 in the United Nation’s e-governance maturity and readiness survey entitled “E-Government for the Future We Want” [25]. 4.4 E-learning in Iran Iranian ministry of research and science established Payam Noor University in 1988 with the aim of which was offering distance education and part-time degree programs. The decision proved to be fruitful as it paved the way for many subsequent similar programs offered at other higher education institutes in Iran. In fact, “history of e-learning in Iran at present time does not exceed more than 10 years, yet from a realistic point of view we might say that e-based learning in Iran has had an eight year experience and even younger”[26]. In 1991 distance education program was placed on the agenda of University of Tehran. The university started to offer nine courses to incoming students and newly matriculated students. In about the meantime, Iranian ministry of research and science and technology (MSRT) declared that the first virtual university would be founded accredited by the Ministry as a non-profit institution [27]. Despite all the effort, time and energy, as well

64

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org as the contributions from both private and governmental sectors, e-learning and e-teaching in Iran are still at their initial stages of development and there are only a few accredited online programs available. Amirkabir University of Technology, Iran University of Science and Technology, Shiraz virtual University and some Islamic virtual colleges and centers like Islamic virtual centers and Faculty of the Science of Hadith offer accredited online academic programs. E-learning can be seen as a tool for extending the scope of higher education, especially to geographically remote places and underprivileged rural areas. However, the challenges of virtual learning continue to persist. 4.5 E- Learning in Kingdom of Saudi Arabia To support the implementation of e‐learning in Saudi Arabia, a National Centre for E‐learning and Distance Learning was established in 2005, with the aim of creating a complementary educational system that uses e‐ learning technologies. Many outstanding projects have been adopted by this center to assist in the transition to a digital society and support the implementation of e‐learning in Saudi Arabia, such as the Saudi Digital Library Project. Despite the growing availability of educational technology (mainly e‐learning) and the awareness of its potential contributions to enhance learning outcomes, teachers still face complexities in using existing e‐learning material, and the implementation of successful e‐learning and online instruction in Saudi Arabia’s educational system is still very limited . Even in cases where e‐learning has been applied in Saudi Arabia, there is no measurable evidence of its effectiveness for students’ learning outcomes, and no clear framework or policy to implement e‐learning in Saudi schools [28]. Saudi Arabia needs to generate a clear plan for implementing new technology in the educational setting. Strong evidence emphasizes that previous efforts were unsuccessful not because of the lack of effective efforts, but because the implementation was not planned thoroughly [29]. The implementation of an effective e‐learning system in the Saudi educational system is a vital step towards accomplishing government policy in the information technology area. 5. CONCLUSION We could not forget this reality that many children and youths in the Middle East appreciate Internet contents that deal credibly with topics they may find difficult to discuss with parents or adults, such as personal relationships, sexuality, AIDS, drugs, self-esteem, etc. In addition, the youth in countries with widespread poverty, corruption and political turmoil also seek realistic, relevant and meaningful content to help them understand and cope with hardships they face in their daily lives. This fact demonstrates that youths do not bother more about parents, political or religious leaders’ confirmation. Also, through avoiding any exaggeration – either positive or negative – on the effects of Internet, regional differences should be taken into account. As a matter of fact, family income has a major effect on the level of literacy of the people to information technology and also to their access to Internet. Use of Internet requires a fairly complex set of skills and technology which is not always available for many youth people. Therefore, we should avoid of exaggerating about Internet affects on youths in the Middle East. REFERENCES 1. Wilson, D. N. 2001. The future of comparative and international education in a globalised world. In M. Bray (Ed), Comparative Education: Continuing Tradition, New Challenges, and New Paradigms. London: Kluwer Academic Publishers. 2. DSES International Forum. 2007. Security and defense learning. The fourth International Forum on technology assisted training for defense, security and emergence services. Retrieved on December 19, 2008 from www.Newsecurityfoundation.org 3. Arani, A. M. & Abbasi, P. 2008. The education system of Iran. In H. J. Steyn and C. C. Wolhuter (Eds), Education Systems: Challenges of the 21st Century. Orkney: Keurkopie Uitgewers. 4. Internet World Stats. 2007. Middle East Internet Usage and Population Statistics. Retrieved on 20 Mar 2008 from http//: www.Internetworldstats.com/stats5.html 5. Coombs, P. H. 1985. The world crisis in education: The view from the Eighties. N.Y: Oxford University Press. 6. Mideast Youth. 2007. The Internet’s affect on Arab youth, according to Sameh Sameer. Retrieved on July 1, 2008 from http://www.mideastyouth.com/2007/07/07/wwwarab-youthbloggersnet 7. Center for the Digital Future. 2008. Seventh Annual Study Finds Significant Concerns about Online Predators and Children’s Participation in Online Communities. Annual Internet Survey by the Center for the Digital Future. Retrieved on July 6, 2008 from http://www.digitalcenter.org/pdf/2008-Digital- Future-Report-Final-Release.

65

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

8. MoICT. (2006). The e-readiness Assessment of the Hashemite kingdom of Jordan. [Online]. Available at http://www.moict.gov.jo/Jordan%20e-Readiness%20Assessment%20Fina.pdf (Accessed 14th May 2011). 9. Cisco (2005). Jordan education initiative the way for global e-learning opportunities. [Online].Available at http://newsroom.cisco.com/dlls/2005/ts_012805.html (Accessed 13th May 2011). 10. Alomari, A.M. (2009). Investigating online learning environments in a web-based math course in Jordan”. International Journal of Education and Development using Information and Communication Technology (IJEDICT), 2009, Vol. 5, Issue 3, pp. 19-36. 11. Sabri, H. & El-Refae, G. (2006). Accreditation in Higher Business Education in the Private Sector: The Case of Jordan. [online]. Available at http://www.informaworld.com/openurl?genre=article&id=doi:10.1300/J050v16n01_03 (Accessed 26th Jan 2012). 12. Hinnawi, I. (2011) Demand for e-learning projected to soar. [online]. Available at http://www.menafn.com/qn_news_story_s.asp?storyid=1093461074 (Accessed 26th Jan 2012). 13. Diabat, B. (2011). The Extent of Acquiring E-Learning Competencies by Faculty Members in Jordan Universities. European Journal of Social Sciences, 27(1), PP.71-81. 14. Walters, T.,& Quinn, S. (2003). Living Books: A culturally sensitive, adaptive e-Education process, paper presented at the SSGRR International conference on Advances in infrastructure for e-business, e- education, escience, s-medicine and Mobile Technologies on the Internet, L’Aquila, Italy. 15. Fitzgerald, J. (2006). IT blended and e-learning learning committee. Presented at Abu Dhabi Women’s College in-house PD forum. Accessed May 2006 from http://www.e-learningcentre.co.uk/. 16. Brennan, R. (2003). One size doesn’t fit all, in Guthrie, H. (ed) (2003) Online learning: research findings, NCVER and ANTA (55-68). 17. Peel, R. (2004). The Internet and language use: A Case Study in the United Arab Emirates, International Journal on Multicultural Studies. 6 (1), 79-91. 18. Birks, J., Hunt, F. & Martin, J. (2007). Research into the use of information literacy web resources by Arabic Students. Zayed University, UAE. 19. Gulf Times. (September 2006). E-learning project starts at girls’ school in Wakrah. [Online]. Available: http://www.gulf-times.com/ 20. ictQatar. (January 2011). Knowledge Net. supreme council of information and communication technology. [Online]. Available: http://www.ictqatar.qa/output/page442.asp 21. ictQatar, Qatar’s ICT Landscape 2011, Doha: ictQatar, 2011. 22. Qatar Tribune. (Feb 2015). E-learning a growing trend in Qatar's classrooms. [Online]. Available: www.qatar-tribune.com 23. The United Nations (U.N.) E-government survey 2014: E-government for the future we want. [Online]. Available: http://unpan3.un.org/egovkb/Portals/egovkb/Documents/un/2014-Survey/E- Gov_Complete_Survey-2014.pdf 24. Yaghoubi, J. Malek Mohammadi, I. Attaran, M. Iravani, and Gheidi, H. 2008. Virtual students‟ perception of e-learning in Iran. The Turkish Online Journal of Educational Technology,7 , 159-173. 25. Tabatabaei, M. 2010. Evolution of distance education in Iran, Procedia, 2, 1043–1047. 26. Dahan, M. 2002. Internet usage in the Middle East: Some political and social implications. Retrieved on 26 Mar 2008 from http://www.mevic.org/papers/inet-mena.html 27. CIA World Factbook. (October 2016). Population: Qatar. [Online]. Available: https://www.cia.gov 28. A. S. Weber, “Revolution in Arabian gulf education,” in N. Bakić-Mirić & D. Gaipov, Ed., Current Trends and Issues in Higher Education: An International Dialogue, Cambridge: Cambridge Scholars Press, 2016, pp. 141-156. 29. General Secretariat for Development Planning (GSDP). The Qatar National Development Strategy 2011- 2016 (QNDS), 2011, Doha: QNDS, pp. 130-131.

AUTHORS PROFILE

66

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

DNA-based cryptography: motivation, progress, challenges, and future 1A.E. El-Moursy, 2Mohammed Elmogy, 3Ahmad Atwan 1,2,3Information Technology Dept., Faculty of Computers and Information, Mansoura University, Egypt Email: [email protected], [email protected], [email protected]

ABSTRACT Cryptography is about constructing protocols by which different security means are being added to our precious information to block adversaries. Properties of DNA are appointed for different sciences and cryptographic purposes. Biological complexity and computing difficulties provide twofold security safeguards and make it difficult to penetrate. Thus, a development in cryptography is needed not to negate the tradition but to make it applicable to new technologies. In this paper, we review the most significant research, which is achieved in the DNA cryptography area. We analysed and discussed its achievements, limitations, and suggestions. In addition, some suggested modifications can be made to bypass some detected inadequacies of these mechanisms to increase their robustness. Biological characteristics and current cryptography mechanisms limitations were discussed as motivations for heading DNA-based cryptography direction. Keywords: DNA; cryptography; encryption; DNA computing; bio-inspired cryptography; 1. INTRODUCTION Technological development seizes our valuable information including financial transactions are transmitted back and forth in public communication channels, posing a considerably high challenge in confronting with unintended intruders. One suggestion is cryptography that is about constructing protocols built strong mathematically and theoretically by which different security means are being added to such precious information. DNA computing is a new science emerged in recent years clarifying to be very efficient in energy consumption, high information storage capability and parallel processing. Deoxyribonucleic Acid is molecules formed in a certain sequence to construct the information needed for building and maintaining the vital operations of an organism, similar to the way in which binary bits appear in certain order to form different information in our digital world [1]. 1.1 DNA computer DNA computer or biomolecular computer is a computer its input, system, and output is wholly or partially made of DNA molecules, biochemistry, and molecular biology hardware instead of silicon chips technologies. The complexity and ingenuity of living beings are built based on a simple coding system functioning with only four components of DNA molecule similar to the binary coding system of traditional computers. This coding system make DNA is very suited as a medium for data storing and processing. 1.2 DNA biological anatomy DNA is a blueprint for the living organism; it carries instruction for functioning vital processes. DNA is a collection of molecules stuck together to form a long chain of strands, a certain combination of these DNA strands forms amino acids which are the building block of a living organism. Amino Acids, in turn, combines to form protein, proteins create living cells, and cells create organs. The bases of DNA nucleotides are of four types (guanine, adenine, thymine, and cytosine) labelled as G, A, T, and C; respectively and usually exists in nature in the form of double-stranded molecules, see Figure 1. Human DNA consists of about 3*109 bases, and more than 99 percent of those bases are quite similar to all people.

67

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 1 A short section of a DNA helix and its associated pairs [1] 1.2.1 Central dogma and genetic code Central dogma is the overall process of transforming DNA nucleotides to synthesize protein (see figure 2), which performs body`s major processes.

DNA Transcription RNA Translation Protein

Figure. 2 Central dogma process

DNA can be presented as a sequence of nucleotides: AGAGTCTGAGCA. The genetic code is a DNA code written in the form of triplets, named codons. Each triplet uniquely codes one of the Amino Acids, see Figure 3; there are also three excepted codons reserved for 'stop' or 'nonsense' indicating the end of the portion coding. Since there are 3-letters codons combinations with four different bases, this produces 64 possible codons. These encode the 20 standard amino acids, providing redundancy to each amino acid to be encoded by more than one codon.

Figure. 3 DNA to Amino Acids three letters abbreviation coding table [2]

68

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

1.2.2 Hybridization The DNA is a molecule that correlates together to form two long complementary strands (formed by hybridization process defined by Watson-Crick complementary rule, see Figure 4) running anti-parallel to form a double helix structure each of which is made from persistent subunits, called nucleotides.

Figure. 4 The Hybridization Process: a) pairing is unstable and cannot go further because sequences have different complementary bases, the strands come apart. b) base-pairing continues because sequences are complementary [1] 1.3 Cryptography Cryptographic techniques are the kernel of the whole information security field. Cryptography, not only means preventing data from being hijacked, but it also used for authentication. Three types of cryptographic schemes simply achieve these goals: secret key (symmetric) cryptography, public-key (asymmetric) cryptography and hash functions. Two techniques namely a block cipher and stream cipher can be implemented in hardware or software. 1.3.1 Symmetric VS asymmetric Symmetric encryption is the oldest type and most widely known, it applies to the plaintext to change the content in a particular way; this might be as simple as changing the sequence of the letters or just shifting the letters by a prefixed number of places in the alphabet. As long as both sender and recipient agreed on the same secret key, they can encrypt and decrypt all messages that use the same key easily. The problem with this method is that it needs to exchange this key over the public network while preventing them from falling into the wrong hands. Asymmetric encryption is a suggestion to bypass this, in which there is a mechanism that uses two related keys, one for encryption named as a public key and made freely available to anyone who wants to send a message and the other for decryption and named as a private key and kept secret. The decryption process cannot be completed unless we have the two keys. So the asymmetric key mechanism is more secure, but it lags in speed behind the corresponding symmetric mechanisms. A combination of the speed advantage of secret-key systems and the security advantages of public-key systems in a hybrid system is the best solution. Such a protocol is called a digital envelope. 1.3.2 Block cipher vs stream cipher A block cipher applies encryption to a block of data at once and requires high computation capability. The key element in the block cipher is the diffusion and the confusion. The Data Encryption Standard (DES) and the Rivest, Shamir, and Adleman (RSA) technique are examples of the block cipher. A stream cipher performs the substitution, bit-wise operations, etc., on each bit of the plaintext with a one-time-pad independently. 1.3.3 Hardware vs software implementation Cryptographic modules can be executed either by hardware or by software. Whereas, software implementations are known for being easier in maintaining and developing, however, being less secure than their hardware equivalents. The reason lies on 69

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org that software solutions make use of shared memory space and running on top of an operation system and more susceptible to be modified. 1.3.4 Cryptography challenges The more progress in information technology the more challenges it will confront by the mean of insurance, integrity, and security as most of the currently used cryptographic schemes are rely on computational hardness assumptions that cannot keep up with the rapid progress on the information technology development pace. Cryptography challenges lie on some factors, one of them is based on existing algorithms limitations, and another is on the cryptanalysis attempts to them, in addition to the enormous development of new computer paradigm as Quantum Computing and Molecular Computing. Quantum Computing: The pace the scientists keep minimizing the transistor size on a silicon chip of a classical computer one day will reach its limit. Quantum Computing may be a replacement. Quantum computers could use quantum bits called qubit by using a single electron instead of digital circuits; it differs that the qubit does not have to be in an exact position, that is called Quantum superposition, the same qubit may be zero, one, any value between zero and one managed by quantum mechanics. The quantum computing is not faster than the classical computer but the key advantage is in the parallelism as each qubit can calculate multiple calculations, in addition to the size of one transistor on a quantum computer can be like the size of one atom. Quantum Computing is one of the threats that undermine the existence nowadays cryptography algorithms, although we are far away from a real implementation. DNA Computing: DNA Computing takes advantage of massive parallelism embedded in its molecules to try different possibilities at once. If a DNA computer can be realized in a real-world implementation, it can be faster and smaller than any other computer built so far. For instance, the prime factorization problem seems to be realized in an affordable time using a molecular computer. Secure Channels: There is no secure channel in the real world, but there is at best condition to make insecure channel less insecure. Secure channels are a way to prevent intruders to overhearing or tampering. Confidential channels prevent overhearing but not tampering, and Authentic Channels prevent tampering but not overhearing. Quantum communication channel may be a solution which allows the quantum state to be transmitted as photons through an optical fibre or free space. Cryptographic Algorithms Limitations: Cryptography, in general, is standard procedure with different keys for each encryption process making the attacker focuses only on one variable which is guessing the key used in the algorithm. DES has a key of 56 bits which can be brute-forced as demonstrated ten years ago [see Table 1], and the key size also raises some potential challenges in encryption data size of gigabytes which is not that big nowadays. In addition, to its small block size making it sustainable to linear and differential cryptanalysis; it is half the speed of Advanced Encryption Standard (AES) and half the key size. Table. 1 The DES key strength VS modern computers [3] Type of Attacker Budget Time per key Time per key recovered, 50 recovered, 40 bits bits Pedestrain hacker $400 1 week Infeasible Small Business $ 10.000 12 minutes 556 days Corporate Department $300.000 0.18 sec. 3 hours Big Company $10.000.000 0.005 sec. 6 minutes Intelligence Agency $300.000.000 0.002 sec. 12 seconds 3DES is a trick to keep DES implementation alive some more, by cascading DES three instance; respectively. 3DES is believed to live at least more than 100 years from now, but now it is six times slower than Advanced Encryption Standard (AES) with the same key size especially in dealing with software as it designed for hardware implementation from the beginning. AES is the replacement of DES which done its job very faithfully and never been compromised mathematically.

70

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

AES keeps the tradition of DES; it is a symmetric encryption algorithm designed for the use of US federal organization and approved by National Security Agency (NSA). Using 128, 192 or 256 bits keys to bypass the DES challenges, and it is efficient in software and hardware implementation. It is the best till now. However, if AES is now unbreakable, whatever DES was unbreakable to ten years ago. The key advantage of AES over the DES is that it reduced the sub rounds to 10 instead of 16 in DES which simplifies the time complexity, in addition to the parallelism implicated in the double plaintext size. Although, it sustainable to side-channel attacks, which don’t attack the AES cipher itself, rather its implementation. Another challenge for making AES infinite in secrecy is the key exchange procedures which make a possibility of the intruder to overhear or to tamper it even if it is being sent over a secure channel. In addition to its complexity, many applications require reduced complexity. RSA is the most widely used public key cryptosystem as it involves four steps, key generation, key distribution, encryption, and decryption. RSA involves two keys one for encryption and the other for the decryption process, its security based on the factorization problem. The security of RSA exceeded any other cryptography algorithm, but at the expense of the speed, so it used for encrypting the encryption keys in symmetric cryptography algorithms for secure transmission in hybrid encryption techniques. Cryptanalysis is the field in which hidden or protected information and procedure are being studied to gain access to the content of the encrypted information Brute-Force Attacks are so named because they do not require much intelligence in the attack process; it is simple as trying each possible key until the correct key is found. It takes 2^n-1 steps in average to get the right key and in the worst case, 2^n steps for a key size of n bits. Parallel and Distributed Attacks is a systematical brute-force attack which works in a parallel and distributed order if we have N processors; we can find the key roughly N times faster than if we have only one processor. Cryptanalytic Attacks contrary to Brute-Force attacks rely on applications that involve some intelligence ahead of time, a significant reduction of the search space is provided by doing so. While an in-depth discussion of cryptanalytic techniques is beyond the scope of this research. The current known cryptanalytic attacks are discussed briefly: Differential Cryptanalysis analyzes how differences in plaintext correspond to differences in cipher-text. Linear Cryptanalysis focuses linear approximations on describing the internal functions of the encryption algorithm. An eternal challenge confronting cryptography is that how to know when an insecure channel worked securely (or, and perhaps more importantly when it did not). In addition, to the static procedures in the encryption algorithms give the eavesdropper a clue of what to do to reverse the encrypted data. The bottom line is: according to last mentioned obstacles, there is no perfect cryptography algorithm as being noticed that the key length is a double-edged sword when dealing with cryptography algorithms. Encryption algorithms themselves maybe not the weak point in an encryption product. However, there are implementation flaws or key management errors. The intrinsic idea of a cryptosystem is a one-way function. A function called a one-way function if it is easy to compute extremely hard to invert but not impossible. In a function f(x) = y, it is easy to compute f(x) for all y in the domain range, while it is computationally infeasible to find any x, such y = f(x). A trapdoor one- way function is a function with additional information (trapdoor information) [4]. 1.3.5 DNA computing The organization of organisms is based on a coding system with four components making it very suited to data processing and storing. According to different calculation, one gram of DNA could potentially have the capacity to hold 512 Exabytes, on top on that; the theoretical maximum data transfer speed would be enormous due to massive parallelism of the calculation. DNA computing is the field of science in which biology and computer science merge. The development of biocomputers has been made possible by yielding the science of nano-biotechnology which made by combining the technology of both nanoscale materials and biologically based materials. Though DNA computing still in its primitive stage, it has been applied in many fields and proved efficiency in solving hard problems successfully including, but not limited to, the NP-complete problem (Nondeterministic Polynomial Time), O-1 planning problem, integer planning problem, optimal problem, graph theory, cryptography, database, etc. Utilizing DNA extraordinary characteristics and integrate them in the information technology science will make an incredible leap in the information technology field in the next few years.

71

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

1.3.6 The idea Biological information is very complex and numerous; this emerged the science of bioinformatics to understand and analyze these data as any human brain has not the ability to cope with it. However, when these machines reached its roof capability, scientists needed another theory to deal with information that the machine cannot handle by itself. One suggestion is a bio-inspired computer, a computer inspired by the biological operations. Adleman [5] created the first experiment when he solved a Hamiltonian path graph problem with DNA molecules. He implemented the experiment biologically. Later the idea has extended to computational biology by replacing experiments by computers; known as DNA computer. Two features of DNA structure amount for its remarkable impact on science. The core idea is as simple as its string nature, and it complementarily resembles the digital structure. The genes themselves are made of the information, stimulating the research in molecular DNA storage. 1.3.7 Molecular computing history Some researchers show that biological molecules like DNA and enzymes can be built to act like electrical circuits [6]. Someday, these biological circuits could be used, say, to make sensors in cells that would know when to release drugs into the body. The tools of molecular computing start with Adleman [5], a professor of Southern California University for his pioneering work in 1994 setting the stage of biocomputing research combined with the field of mathematics. He made advantage of the biochemical level for solving problems that require an enormous amount of computation or unsolvable by conventional computers. He encoded a small graph by the mean of Hamiltonian path problem in DNA molecules; the computation "operations" were performed with standard molecular protocols and enzymes. Adleman experiment demonstrated the feasibility of fulfillment computations at the molecular level. Adleman mechanism was as follow: 1. Encode all nominees’ solutions to the interest computational problem; 2. Generate all possible solution to the computational problem; 3. Keep only paths that start with S and end at E; 4. Keep only paths that have N number of vertices; 5. Keep only paths that visit each Vertex once; 6. Use Polymerase Chain Reaction (PCR) technology to amplify the remaining DNA molecules and get the solution. The technique later has expanded by various research, including in cryptanalysis, as Boneh et al. [7] explained that the Data Encryption Standard (DES) cryptographic protocol could be broken. From here it leads to, if DNA computing can break codes, it can also be exploited to encrypt data as Pramanik and Setua [1] presented a cryptography technique using DNA molecular structure, one-time-pad scheme, and DNA hybridization techniques. Gearheart et al. [8] were able to demonstrate a novel logic gate design. The design was based on chemical reactions in which observance of the double-stranded sequence indicated a truth evaluation. Ogihara and Ray [9] suggested an implementation to DNA-based Boolean circuits. In 2002, researchers at the Weizmann Institute of Science [10] unveiled a computing storage device made of enzymes and DNA molecules instead of silicon microchips. Finally, in March 2013, researchers created a transcriptor (a biological transistor). 1.3.8 DNA technologies Polymerase Chain Reaction: PCR is critical in DNA computing as it is the technology that used to extract the problem`s solution. Furthermore, it is an imposition to amplify a sample of DNA over several orders of magnitude and primers. From a cryptographic point of view, PCR can be useful as it requires two primers to accomplish the amplification process. For an adversary, it would be extremely difficult to amplify the message encoded sequence with PCR without these correct primers chosen from about 21023 kinds of sequences. DNA Fragment Assembly refers to the aligning and merging fragments of DNA sequence to reconstruct the original sequence. The technology is used in DNA sequencing technology and cannot read whole genomes at once. There are a lot of other DNA technologies as Gel electrophoresis which used to separate mixed DNA fragments and DNA chip technology, a technology used for gene expression and DNA profiling, but they are beyond the scope of this research as the authors here focuses only on technologies that can be used in cryptography.

72

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

1.3.9 DNA advantages in computing Recent research on DNA computing has focused on DNA as information carrier for storing data from ultra-concise information storage and ultra-scale computation. DNA advantages can be listed as follow: • Parallel processing: 1026 operations/sec • Low power consumption • Incredibly light weight • Data capacity: 2.2 Exabyte per gram • Imperishable storage

1.3.10 Limitations and challenges Despite showing a bright future, the research of DNA cryptography still at its initial stage and many scopes still uncovered. Moreover, it confronted with some obstacles the same confronted to DNA computing research that can be summarized as follows: Theoretical problems: Shannon`s theory illuminated that a powerful tool for generating keys in encryption algorithms should use the complex mathematical procedure, DNA cryptography does not have any mature mathematical background to support this theory. Difficult implementation: Much extensive material and many biological and lab experiments should be performed to produce a DNA-based cryptosystem; this might be one of the reasons why only a few examples of reliable DNA cryptography mechanisms were exhibited. A persistent foundation between the biological structure and computer science are required to be the standard to evolve efficient and stable algorithms for DNA computing, therefore, researchers still excavating for mush more theoretical foundation than practical. Further, its slow computing speed and the solution analysis in a molecular computer is a lot harder than a digital one. So, there is a need to create a bridge between existing and new technologies and to open possibilities for a hybrid cryptographic system that provide high confidentiality and stronger authentication mechanism. 1.3.11 DNA digital coding From a computational point of view or more focused on the eyes of the coder, anything can be coded in binary. Researchers [4] were able using the same tactic to store a JPEG image, and an audio file on DNA digital , this kind of storage device is much more compact than currently used magnetic tapes; this is due to the data density inherent in DNA, in addition to its longevity advantages. When DNA digital coding is being mentioned, it means any scheme to represent digital data in the base of DNA sequence (see figure 5).

Figure. 5 The DNA digital coding [11] 73

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

1.3.12 DNA cryptography Nowadays, the field of biology and that of cryptography have come to combine (see figure 6). The DNA computing opens a new way for cryptography. The nucleotide bases have the capability for creating self-assembly structures that have excellent means of executing computations. Currently, several DNA computing algorithms are implemented in encryption, cryptanalysis, key generation and steganography, so, from the cryptographic perspective, DNA is very powerful.

Computer Science

DNA Cryptography Information Biological Security Science

Figure. 6 DNA cryptography

Why DNA encryption instead of digital encryption? Most of the cryptographic systems have been broken at least partially, if not completely or may likely to be a break in the future by the new generation computers. Thus, threats grow exponentially with the growth of technology, from here data transmission and storage have become vulnerable more day after another. As the key element of breaking any encryption mechanism is brute force attacks. One proved when Shamra et al., [12] showed the capability of breaking the Simplified DES (SDES) algorithm at an affordable cost and in a reasonable time. Bhateja and Kumar [13] showed that with the aid of genetic algorithms he could break the Vigenere cipher without the key using elitism with a novel fitness function. Furthermore, about key generation matter, any key is generated in the form of binary code making the exponential power to be two, in contrast to the DNA code exponential power which is 4, making a single bit key eight times stronger. The bottom line is, the complexity and randomness of DNA structure add an extra layer of security by the mean of cryptography, in addition to its biological capability in high data capacity and parallel processing. From here, the concept of integrating DNA in the field of cryptography has been identified as a possible technology that brings forward a new hope for raising more robust algorithms. 1.3.13 DNA cryptography future The last few years have witnessed a high leap in this area of DNA cryptography and have seen real progress in applying DNA methodologies into cryptography, and there are a quite number of schemes that perceived the interest of cryptography at the biological level. At present, the work in bio-inspired cryptography, especially from DNA, is focused on applying some technique to encode binary data in the form of DNA sequences. Nevertheless, it still needs much more theoretical and practical implementation; DNA cryptography brings new hope to the future of the information security. The ultimate goal is to enable the discovery of new computational biology insights in addition to creating a global perspective from which unifying principles in biomolecular computing can be discerned. The rest of this paper is organized as follows: Section 2 lists and categorizes existing cryptography schemes inspired by DNA structure, Section 3 represents current research topics and challenges faced, and the paper is finally concluded in Section 4. 2. RELATED WORK DNA Cryptography is a new information security branch; it encrypts the information in the form of DNA sequence, making use of its biological properties. In general, existing DNA cryptography techniques use modern biological technologies as implementation tool and DNA as an information carrier. Common biological technologies among recent literature are included DNA Hybridization, PCR amplification, DNA Synthesis DNA digital coding, etc., this section will categorize exertion carried 74

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org out in this research area into two main categories, the first is categorized based on the security technique, and the second is categorized based on the algorithm`s inspired procedure. 2.1 Security techniques Steganography, key generation, and Encryption are the major techniques involved in any cryptosystem. These techniques and their types are shown in Figure 7.

Steganography

Encryption

Authentication key Generation

Figure. 7 Security techniques used in cryptography

2.2 Steganography DNA Steganography is the field of cryptography in which a message is hidden in other messages. DNA can synthesize sequences in any desirable length makes DNA is perfect for data hiding. Therefore, if the advantages of the random trait of DNA could be taken, they can make our cryptography technique in principle invincible. The principle of DNA steganography is concealing valuable information that needs insurance in a significant number of irrelevant DNA sequence chains. Only the desirable recipient can find the correct DNA fragment based on the conventional information sent and agreed prior the data transferring process. Such a process may not be considered encryption since the plaintext is not encrypted, but it is only disguised within other media. 2.2.1 Encryption Encryption means converting ordinary information (called plaintext) into an unintelligible form (called ciphertext); decryption is the reverse. The detailed operation of a cipher is controlled both by the algorithm and by a "key," usually a short string of characters. Encryption is the art of protecting data by transforming them into an unreadable form (ciphertext) using a pre-agreed scheme that is publicly available. Only those who possess a secret key can decipher the message back to the plaintext. Since a few years, scientists implemented DNA cryptography using modern biological processes as the tool and DNA as the information carrier for encrypting the DNA information. DNA encryption can be classified into two subfields: a) Symmetric Key DNA Cryptography: In Symmetric key cryptography (also known as secret key cryptography) the receiver and the sender share the same key for encryption and decryption. b) Asymmetric Key DNA Cryptography: Asymmetric key cryptography is also known as the public key cryptography. It uses two keys; one to encrypt the data which labeled as a public key and is distributed freely to everyone and the other to decrypt it which is labeled as a private key, and it is secret to each person and must keep hidden. The two keys are related and generated together.

75

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

2.2.2 Authentication Authentication determines whether someone or something is, in fact, who or what it is declared to be, by adding extra information to the original data. Authentication may be in the form of digital watermarking [14] or in fingerprinting [15]. Digital watermarking is an authentication technique used to authorize the originality of the data by adding extra bits to the original data and designed to be completely unnoticed, Unlike printed watermarks, which are intended to be visible as the reference is the human perception eventually. 2.2.3 Key generation Key generation is the process of generating keys used in cryptography. The key element in a powerful scheme is the randomness as pseudorandom number generator (PRNG), which is a computer algorithm that produces data that appears random under analysis. Modern cryptography schemes use a combination of symmetric and asymmetric key algorithms since the first tend to be much slower than the last. 2.3 Inspired procedure DNA cryptography can be implemented using classical computational operations or by processes inspired biologically or a combination of both. Biologically inspired: Biological operations like PCR, DNA synthesizer, DNA hybridization, DNA fragment assembly, translation, transcription, and splicing can be involved in the process of encryption and decryption. These operations are summarized in Figure 8:

• PCR Is a molecular biology technique used to amplify a copy of a piece of DNA across several orders of PCR magnitude. Without the two primers used in the PCR technology, it is infeasible to be reversed.

• Splicing is cutting of DNA from one sequence and Splicing pasting in another DNA sequence.

• DNA hybridization is the process of combining two DNA complementary single stranded DNA and allowing Hybridization them to form double-stranded molecule through a base pairing process.

• Translation is the process of relating DNA sequence to Translation the amino acids in protein.

• DNA transcription is a process that involves Transcription transcrbing genetic information from DNA to NRNA.

• Bio-Xor is a logic function based on biological Bio-Xor caracheristics.

DNA Fragment • A technique used to assemble many fragments of Assembly DNA sequence in one long DNA chain.

Figure. 8 Biologically inspired procedures involved in the cryptography algorithms Computationally Inspired: Even computational operations like arithmetical, mathematical, etc., can be involved in the process of encryption and decryption within the DNA cryptography technique [16]. The DNA-type of encryption uses DNA coding scheme and then deals with the DNA sequence as zeros and ones; hence, any arithmetic operation can be applied to produce the encryption algorithm or even uses classical encryption techniques with DNA sequences (see figure 9).

76

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

• Bitwise operations like Complement, Substitution, Xor, Bitwise Insertion, Binary adition and Binary subtraction can be operations implemented on a DNA encoded sequence to add extra layer of security to the algorithm.

• DNA indexing the OTP key is assigned with an index Indexing number shifts one by one for every DNA sequence and the repleaced with an array.

• There are a number of basic operations that can be Matrix applied to modify matrices, called matrix operations addition, scalar multiplication, transposition,matrix multiplication, row operations, and submatrix

Figure. 9 Computationally Inspired procedures involved in cryptography algorithms Khalifa and Atito [17] demonstrated a steganography approach using basic biological DNA concepts. The method implemented at two main levels: first a DNA-based play-fair cipher is applied to encrypt the message. The second level uses a two-by-two generic complementary substitution rule to replace the reference to the encrypted DNA. A performance analysis was presented on the hiding capacity as well as robustness against attacks. The proposed technique was tested on different real DNA sequences considering a number of parameters such as time performance and capacity. In conclusion, the proposed scheme can encrypt information into DNA sequences in addition to hiding these data into another reference DNA sequence increasing the security level. Pramanik and Setua [1] presented a simple encryption method based on DNA sequencing, Watson-Crick complementary rule, and One Time Pad (OTP). A plaintext is first converted to binary by ASCII table conversion, then creating a prefixed length DNA sequence (agreed by sender and receiver) by OTP. The sender scans the binary bits and then the DNA sequence in reverse order, for each one binary bit he takes the WC complementary of the corresponding DNA sequence end, for each 0 binary bit, he does nothing and sends it to the receiver. The technique is distinguished in minimizing the time consumption, although it lacked to the encryption complexity factor and maximize data capacity. Javheri and Kulkarni [18] proposed a symmetric encryption technique based on mathematical calculations and DNA digital coding. The data is first formed in a matrix; a substitution is performed with a key. The key is generated from another module operation to enhance the complexity, another layer of encryption is added the same way after adding extra information to the cipher to produce Primary Cipher. Finally, DNA digital coding is performed to produce the Final Cipher. An implementation methodology and experimental results are presented showing an excellence in time over DES. From a cryptographic point of view, the proposed algorithm is eminent, but it lacked to the DNA by depending only on DNA digital coding. Jain and Bhatnagar [19] proposed a novel symmetric encryption technique based on two security levels; one is based on spiral transposition approach, and the other is based on the DNA sequence dictionary table. Experimental results and analysis showed that the technique overall enhanced security, nevertheless it increases key space complexity. Wang and Zhang [20] introduced a new way to evolve DNA encryption layer with RSA after converting text into DNA sequence than to numbers and finally using RSA to encrypt the numbers. The technique did not enhance security even using an extra layer of encryption but increased time is consuming as the RSA itself has the highest level of security, in addition, to text only can be encrypted. Vijayakumar et al. [21, 22] proposed an idea to overcome this time limitation by applying the same idea with Hyper Elliptic Curve Cryptography. Cui et al. [23] produced an encryption scheme that is designed by using the concept of PCR amplification as a steganography technique by concealing it to a microdot, in addition to DNA digital coding. The primers and the coding module are used as the key to the scheme. Clelland et al. [24] showed the capability of performing actual PCR amplification in labs performed to the encrypted data concealed in various mixtures of genomic from different organisms to block any attempts to use the subtraction technique. The technique also could send individual secret messages to several recipients, a unique set of primers would use to amplify an intended messages to each recipient. The work earns winner of “Junior Nobel Price.” 77

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Sabry et al. [25] discussed a significant modification to the old Play-fair cipher by applying it to a DNA-based and amino acids structure. Therefore, these Amino acids pass through a Play-fair encryption process after a pre-processing to the data and converting it to DNA sequence. To bypass the deficit from representing the 20 amino acids to the 25 alphabetic, some additional procedures are implemented. The presented technique enhanced the security of the Play-fair cipher. Performance analysis is presented to evaluate the technique proved the increased security. Sadeg et al. [26] proposed a new symmetric key block cipher algorithm. The algorithm supports the confusion and diffusion factors. Many of them are computationally inspired as permutation, substitution and XOR operations and many other is inspired biologically as a new bio-XOR technique proposed in this paper. The paper although presented a key generation technique by inventing a Bio-XOR indexing table as a logic function, the paper distinguished by producing a technique that combines biological and computational factors giving the technique more complexity. In addition to the DNA module presented in this work is unpredicted to intruders, the performance analysis showed an increase in time complexity over AES. Zhang et al. [11] proposed a new symmetric encryption technique based on DNA fragment assembly involved with DNA digital coding. The implanted key is used to hint how the original plaintext was before the fragmentation process. The technique cuts the sequence randomly giving the advance of the difficulty to restore the original text without the key. The following table summarizes DNA-based cryptography research. Table. 2 DNA-based cryptography

No. Paper Name Year Type Inspiration Involved Algorithm – Techniques - DNA digital coding 1 Hiding messages in DNA microdots [24] 1999 Steganography Biologically - Microdoting - PCR - 2 Primer Keys - DNA Lab 2 DNA-Based steganography [27] 2001 Steganography Biologically - DNA digital coding - Random number generator A Novel Generation Key Scheme Based on - Gene-Bank 3 2008 Key Generation Computationally DNA [28] - Key Expansion Matrix - RSA 4 DNA computing based cryptography [20] 2009 Asymmetric encryption Computationally - PCR 5 A Pseudo-DNA Cryptography Method [29] 2009 Symmetric Encryption Biologically - Central Dogma - Transposition An encryption algorithm inspired from DNA - Computationally - Matrix Permutation 6 2010 Symmetric Encryption [26] - Biologically - Bio-Xor - Central Dogma - DNA Encoding DNA Encoding Based Feature Extraction for 7 2011 Watermarking Computationally - Biometric watermarking Biometric Watermarking [2] - Discrete wavelet transform Bi-serial DNA Encryption Algorithm (BDEA) - Computationally - PCR 8 2011 Asymmetric Encryption [30] - Biologically - BDEA - DNA Encoding Index-Based Symmetric DNA Encryption 9 2011 Symmetric Encryption Computationally - GeneBank Algorithm [31] - Logistic Map - Play-fair - DNA digital coding High-Capacity DNA-based Steganography - Biologically 10 2012 Steganography - Substitution [17] - Computationally - DNA reference - Complementary rules - DNA digital coding Integration of DNA Cryptography for - Biologically 11 2012 Symmetric Encryption - PCR Complex Biological Interactions [32] - Microdoting - OTP 12 DNA Cryptography [1] 2012 Symmetric Encryption - Computationally - DNA hybridization DNA Cryptography Based on DNA Fragment - DNA Digital Coding 13 2012 Symmetric Encryption - Biologically Assembly [11] - DNA Fragment Assembly

Three Reversible Data Encoding Algorithms - Computationally 14 based on DNA and Amino Acids’ Structure 2012 Symmetric Encryption - Central Dogma - Biologically [33]

78

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

- AES (modified with DNA key) Hardware Implementation of DNA Based - Symmetric Encryption 15 2013 - Computationally - DNA code set Cryptography [34] - Key Generation - OTP A DNA Encryption Technique Based on - Matrix Manipulation - Symmetric Encryption - Computationally 16 Matrix Manipulation and Secure Key 2013 - DNA Primers - Key Generation - Biologically Generation Scheme [35] - Central Dogma

A Novel DNA Sequence Dictionary Method for Securing Data in DNA using Spiral - Biologically. - Spiral transposition. 17 2014 Approach and Framework of DNA Symmetric Encryption - Computationally. - DNA sequence dictionary table Cryptography [19]

- Choastic Function Algorithm for Enhanced Image Security Using - Computationally 18 2015 Symmetric Encryption - DNA digital coding DNA and genetic algorithm [36] - Genetic Algorithm Implementation of DNA Cryptography in - Biologically. - The Bi-directional DNA Encryption 19 Cloud Computing and using Socket 2016 Symmetric Encryption - Computationally. Algorithm (BDEA) Programming [37]

Table. 2 Continued

No. Advantage Disadvantage What to enhance Performance Analysis Security is depending upon the referred DNA 1 Implementation simplicity sequences that are available on the internet. Insert Key generation technique None Microdot increased data capacity Hard to guess according to its High tech biomolecular laboratory 2 Automated way from data to DNA None biological structure large capacity High change in the matrix output by a - Improves the independence Increases computation because of the matrix Replace the expansion matrix or reduce 3 slight change in the - Improves the strict avalanche operation the key size random DNA sequence input Increase the security by adding 4 Two-level security Basic biological operations None additional mathematical operation - Decreased time complexity Add extra level of security (traditional data recovered 5 Liable to deferential analysis - reduced cipher capacity cryptography) successfully - OTP Outperform over AES - DNA sequence key - Key size (Matrix) 6 Exclude the matrix operation but unsurpassed it in - DNA module (bio-Xor) is new and - time complexity over AES openSSL open SSL mode unpredicted - Secret code redundancy 7 Image encryption only Add binary encryption Retrieval without loss - Additional security level - Double Layer Encryption - Eliminate PCR amplification 8 Increased cipher capacity None - Asymmetric keys - Using compression - Reduced encrypted data High Key change 9 Encrypts text only Add binary encryption feature - Key selected by receiver and sender sensitivity Maximum hiding Time performance, Capacity, and Add encryption factors (confusion and 10 Encryption is too simple capacity Payload diffusion)

Repeating encryption for the same Rudimentary operations 11 plaintext will produce a different Add DNA encryption factors None Lacked to DNA encryption factors ciphertext - No encryption used. - OTP - No DNA encryption involved. Add an extra layer of encryption or Security is strong as 12 - Minimizes time complexity - Security depends on upon a key only. involve encryption technique OTP - No analysis Some errors in the 13 The cutting is very random - Increased computation Substitute the fragment assembly stage stage of overlap and layout - Serve in biological experiments and - Doesn`t include secret key Encoding is reversible 14 Insert key generation technique DNA computing - Encode English characters only without data loss - Encryption and key generation - Text only (64). techniques Extend the 64 DNA set table to cover 15 - The AES modified is not illustrated. Increased security. - Primers increase security. 256 ASCII table characters. - DNA encryption is poor - AES modified increases security - Encrypts text only - Different cipher for same key and 16 - Using matrix manipulation increases time - Substitute matrix operation same data every time complexity

79

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

- DNA sequence dictionary table increases key - Key generation technique in binary - Increased security at two levels of space complexity. level security 17 security. - Cipher-text is too large. Reduced complexity - Reduce DNA sequence dictionary table - Binary encryption - Binary security level has no key; procedures by a key have to be sent to receivers - High entropy Retrieval without 18 - Low correlation - Image encryption only Add binary encryption capability distortion - sensitive to changes - Extend encryption to BDEA Applicate on binary data to encrypt Applied and worked on 19 encryption technique over Unicode - Not Applicable to images and other data types images and another data type real world web server characters

3. CURRENT RESEARCH TOPICS AND CHALLENGES Current research topics are concentrated on two main directions, one is on enhancing the security of existing cryptographic techniques by adding an extra layer of security using DNA characteristics and the main challenge in this direction is that the produced mechanism is not applicable to all data types and confined only in particular to images or ASCII code characters. The second direction is not to negate the classical but to extend the current cryptographic techniques to apply to new technologies like DNA Computing, and the main challenge in this direction that the classical mechanisms are designed to be applied to the binary data type in the first place. Researchers concerned involving the concept of DNA structure and bio-computing in their work, by attempting to stimulate biological processes and manipulate them by different means regardless of the main factor of an ideal cryptographic scheme lies on space complexity and time complexity. A common lacking can be analysed as many researchers attempt implementing data encryption on one data type only avoid dealing with other data types. Another common lacking that researchers attempt to add an extra layer of security to a classical encryption technique not accounting time complexity and vice versa. In particular, steganography researches seem to hide specific data in enormous DNA microdots making it harder to storing and transmitting, key generation techniques has no backbone structure for generating and sharing, encryption techniques represented showed that researchers tried to adding an extra layer of security by using DNA structure while failing to preserve processing time factor. 4. CONCLUSION Combining the new fields of DNA computing in addition to the conventionally used encryption algorithm, moreover, mathematical operation with biological operations in order to increase the concept of confusion, this combination will produce a more robust, more secure algorithm that is hard to decipher without the key; this is DNA cryptography. DNA cryptography field is confronted with some obstacles at the infrastructure level as theoretical problems and difficult implementation as explained in this work. Authors focused on these obstacles aiming to make researchers to take advantage of painstaking effort expanded so far in the DNA cryptography research area and make researchers utilize the analysis and benefits of recent works and try to bypass the limitations found in it. The authors compared the various DNA cryptographic techniques. These parameters would also help the future researchers to design and improve the DNA storage techniques for secure data storage more efficiently and in a reliable manner. REFERENCES 1. Pramanik, S. and Setua, S. (2012), “DNA Cryptograph,”' 2012: Proceedings of 7th International Conference on Electrical and Computer Engineering, Dhaka, Bangladesh, pp. 551 – 554. 2. Arya, M.S., Jain, N., Sisodia, J. and Sehgal, N. (2011) “DNA encoding based feature extraction for biometric watermarking,” 2011 International Conference on Image Information Processing (ICIIP), vol., no., pp.1-6. 3. Leech, D. and Chinworth, M. (2001) “The Economic Impacts of NIST’s Data Encryption Standard (DES) Program,”, Strategic Planning and Economic Analysis Group, Planning Report 01-2. 4. Jacob, G. and Murugan, A. (2013) “DNA-based Cryptography: An Overview and Analysis” International Journal of Emerging Sciences (IJES), vol. 1, pp. 36-42. 5. Adleman, L. (1994) “Molecular computation of solutions to combinatorial problems”, Science, 266(5187), pp. 1021– 1024. 6. Chen, J. (2003) “A DNA-based, biomolecular cryptography design” Circuits and Systems, ISCAS '03. Proceedings of the 2003 International Symposium on, vol.3, pp. 822-825.

80

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

7. Boneh, D., Dunworth, C. and Lipton, R. (1995), “Breaking DES Using a Molecular Computer”, Department of Computer Science, Princeton University, USA, Technical Report CS-TR-489-95. 8. Gearheart, C., Rouchka, E. and Arazi, B. (2012) “DNA-Based Active Logic Design and Its Implications” Journal of Emerging Trends in Computing and Information Sciences, VOL. 3, NO. 5. 9. Ogihara, M. and Ray, A. (1996) “Simulating Boolean circuits on a DNA computer,” Technical Report 631, University of Rochester. 10. Stefan, L. (2003) “Computer Made from DNA and Enzymes", http://news.nationalgeographic.com/news/2003/02/0224_030224_DNAcomputer.html (Accessed 20 Jan 2017). 11. Zhang, Y., Fu, B. and Zhang, X. (2012) “DNA cryptography based on DNA Fragment Assembly” 8th International Conference on Information Science and Digital Content Technology, pp.179-182. 12. Shamra, L., Bhupendra, K. and Ramgapol, S. (2012) “Breaking of Simplified Data Encryption Standard using Genetic Algorithm” Global Journal of Computer Science and Technology, Vol. 12. 13. Bhateja, A. and Kumar, S. (2014) “Genetic Algorithm with elitism for cryptanalysis of Vigenere cipher” International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), Ghaziabad, 2014, pp. 373-377. 14. Hamad, S.; Khalifa, A., (2013) “Robust blind image watermarking using DNA-encoding and discrete wavelet transforms” 8th International Conference on Computer Engineering & Systems (ICCES), Cairo, Egypt, pp. 221-227. 15. Ghany, K., Hassan, G., Hassanien, A., Hefny, H., and Schaefer, G., (2014) “A hybrid biometric approach embedding DNA data in fingerprint images” International Conference on Informatics, Electronics & Vision (ICIEV), pp.1-5. 16. Rakheja, P. (2011) “Integrating DNA computing in international data Encryption algorithm “IDEA,”' International Journal of Computer Applications, 26(3), pp. 1–6. 17. Khalifa, A. and Atito, A., (2012) “High-capacity DNA-based steganography” 8th International Conference on Informatics and Systems (INFOS), Cairo, pp. 37-43. 18. Javheri, S. Kulkarni, R., (2014) “Secure Data Communication and Cryptography based on DNA-based Message Encoding,” International Journal of Computer Applications, pp. 35-40. 19. Jain, S. Bhatnagar, V., “A novel DNA sequence dictionary method for securing data in DNA using spiral approach and framework of DNA cryptography” International Conference on Advances in Engineering & Technology Research (ICAETR), pp. 1-5. 20. Wang, X. Zhang, Q. (2009) “DNA computing-based cryptography” 4th International Conference on Bio-Inspired Computing, pp.1-3. 21. Vijayakumar, P. Vijayalakshmi, V., and Zayaraz, G. (2011) “DNA Computing based Elliptic Curve Cryptography” International Journal of Computer Applications, pp. 18-21. 22. Vijayakumar, P. Vijayalakshmi, V., and Zayaraz, G. (2013) “Enhanced Level of Security using DNA Computing Technique with Hyperelliptic Curve Cryptography” ACEEE International Journal of Network Security, Vol. 4, No. 1. 23. Cui, G. Qin, L. Wang, Y. and Zhang, X. (2008) “An encryption scheme using DNA technology” 3rd International Conference on Bio-Inspired Computing: Theories and Applications (BICTA), pp. 37-42. 24. Clelland, C., Risca, V. and Bancroft, B. (1999) “Hiding messages in DNA microdots” in Nature: Vol. 399, pp. 533- 534. 25. Sabry, M. Hashem, M. Nazmy, T. and Khalifa, M. (2010) “A DNA and Amino Acids-Based Implementation of Play- fair Cipher,” International Journal of Computer Science and Information Security, Vol. 8, No. 3. 26. Sadeg, S., Gougache, M., Mansouri, N. and Drias, H. (2010) “An encryption algorithm inspired from DNA” International Conference on Machine and Web Intelligence (ICMWI), vol., no., pp.344-349. 27. Bancroft, F., and Clelland, C. (2001) “DNA-BASED STEGANOGRAPHY” United States Patent 6312911. 28. Xin-she, L., Lei, Z., and Yu-pu, H. (2008) “A Novel Generation Key Scheme Based on DNA” International Conference on Computational Intelligence and Security, USA, vol.1, no., pp.264-266, 13-17. 29. Ning, K. (2009) “A Pseudo-DNA Cryptography Method”, Cornell University Libarar: http://arxiv.org/abs/0903.2693 (Last accessed on 20/01/2017). 30. Prabhu, D., Adimoolam, M. (2011) “Bi-serial DNA Encryption Algorithm (BDEA)” Cryptography and Security, arXiv:1101.2577.

81

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

31. Yunpeng, Z., Yu, Z., Zhong, W. and Sinnott, R. (2011) “Index-based symmetric DNA encryption algorithm” 4th International Congress of Image and Signal Processing (CISP), Shanghai, pp. 2290-2294. 32. S. Dhawan and A. Saini, (2012) “Integration of DNA Cryptography for complex Biological Interactions” International Journal of Engineering, Business and Enterprise Application, pp. 121-127. 33. Sabry, M., Hashem, M, and Nazmy T. (2012) “Three Reversible Data Encoding Algorithms based on DNA and Amino Acids Structure” International Journal of Computer Applications, pp. 24-30. 34. Naveen, J.K.; Karthigaikumar, P.; SivaMangai, N.M.; Sandhya, R.; Asok, S.B., (2013) “Hardware implementation of DNA-based cryptography” International Conference of Information & Communication Technologies (ICT), pp.696- 700. 35. T. Mandge and V. Choudhary, “A DNA encryption technique based on matrix manipulation and secure key generation scheme,” International Conference on Information Communication and Embedded Systems (ICICES), Chennai, pp. 47-52. 36. Saranya, M.R.; Mohan, A.K.; Anusudha, K., (2015) “Algorithm for enhanced image security using DNA and genetic algorithm” International Conference on Signal Processing, Informatics, Communication, and Energy Systems (SPICES), pp.1-5. 37. Prajapati, B. and Barkha, P., (2016) “Implementation of DNA cryptography in cloud computing and using socket programming” International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, pp. 1- 6.

AUTHORS PROIFLES

82

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Big data and data quality dimensions: a survey 1Onyeabor Grace Amina, 2Azman Ta’a 1,2School of Computing, Universiti Utara Malaysia, 06010, Sintok, Kedah, Malaysia Email: [email protected], [email protected] [email protected]

ABSTRACT Data is a vital asset in virtually all types of organizations. These days data or information acquired from data analysis is the basis of decision making in various businesses or organizations in general and this offers numerous benefits by building accurate and dependable process. The degradation of its quality has erratic consequences resulting to wrong insights and decisions. Moreover, these are the days of Big Data (BD) which comes with varieties of vast amount of unprecedented data with unknown quality which makes its Data Quality (DQ) evaluation very challenging. DQ is therefore critical for the processes of data operations and management in order to detect associated performance problems. Besides, data of high quality has the ability to attain top services within an organization through enlarged prospects. Nonetheless, recognising different characteristics of DQ from its definition to the different Data Quality Dimensions (DQDs) are crucial for equipping methods and processes for the purpose of improving DQ. This paper focuses on the review of BD and the most commonly used DQDs for BD which are basis for the assessment and evaluation of the quality of BD. Keywords: big data; data quality; data quality dimensions; big data quality; 1. INTRODUCTION The rate of data explosion nowadays has never been apparent. Variety of data from diverse sources have been mounting immensely in large volume, with unprecedented velocity and the veracity of much of these data are uncertain. The Volume, Variety, Velocity and Veracity constitute the initial 4Vs definition of BD [1-5]. This new drift has given birth to the known phenomena called Big Data (BD). The trend has also prevailed on a change in organizational policy or strategy from the classical traditional management systems to Cloud enabled BD which brings flexible and scalable management of data and has proved to be cost effective and efficient [6-7]. Moreover, the growth of unstructured data especially indicates that data processing has gone beyond ordinary tables and rows [8-11]. This is noteworthy because data is considered as an asset in small and large business organizations especially in such an era where insights for business strategic decisions are drawn from BD [12-17]. According to [18], the insights offer new ways to the organizations by influencing fresh types of analytics on the new kinds of data. The challenge is now thrown to the organizations for the creation of fresh actions based on the profits offered by these sorts of analysis [19]. Bearing in mind that data from its sources and data analytics products are well-meaning for organizations and considering the great value of the organizations, practitioners and researchers view data as one of the significance benefit of business [20-21]. Due to the above fact, the requirement for more attention for Data Quality (DQ) in BD should not be overlooked [22-23]. One of the keys to achieving successful management data in an organization, is by attaining high DQ. Poor DQ has led organizations into several issues like wrong decisions, high cost and not being able to provide customer satisfaction [24]. As data is a vital resource in all areas of applications within business organizations and government agencies, DQ is vital for decision makers in the organizations to enable resolution performance connected concerns [25-27]. For the achievement of high quality data, there is a need to employ diverse techniques and strategies. According to [28-30] these strategies are divided into 1. Data-driven and 2. Process-driven. Data-driven strategies handles the data the way it is, by enhancing the DQ by altering directly the values of data applying techniques and activities such as integration or cleansing, while process-driven strategies make efforts to find poor DQ original sources and enhance the DQ by the redesign of the process of data creation or modification. Generally, process-driven DQ strategies has proven to perform better in comparison to data-driven strategies since its emphases is on removing the causes of DQ problems. Additionally, data-driven strategies appear to be costlier than the process-driven strategies either within the short long period [31]. There is a common phrase according to [32], by practitioners of quality control that one cannot improve what cannot be measured. Therefore, attempts should be made to operationally provide the definition and measurement of DQ. Sometimes, measurement of DQ is compared with measurement of physical product. So, [33- 34] said comparing the measurement of physical product, DQ has a multidimensional problem.

83

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Moreover, data has multidimensional concept that can be measured by various dimensions like consistency, accuracy and timeliness [35-36]. These dimensions are characteristics for the measurement and management of data and information quality across diverse domain and the metrics being used for measurement differ from context to context [37]. This paper review studies on DQ and the various dimensions from the time of the traditional data management system and their applicability in this era of BD. The rest of this paper is organized thus: Section 2 discusses BD and DQ, section 3 talked about DQDs, section 4 discussed DQDs for BD and the concluding remarks came in the last section. 2. BIG DATA AND DATA QUALITY BD is a term used to describe huge data sets that are of diverse format created at a very high speed, the management of which is near impossible by using traditional database management systems. Organizations and businesses today are producing large datasets, the same way enormous number of data is being acquired and received from various sources and stored [38], [39]. This is the era of BD which started to be recognized a few years back. Its initial definition gives the term a poor definition of its representation; the only idea that it really conveys most frequently is of a huge volume of data too large to be managed by the current processors of computers [40], [7]. However, According to[41], [42], BD does not only concern the large volume of data but it also includes the ability to search, process, analyze and present meaningful information obtain from huge, varied and rapidly moving datasets. These three attributes lead to the foundational definition of BD regarding volume, variety, and velocity. Furthermore, [43], defined BD as high volume, high velocity and high variety assets of information demanding cost effective ground-breaking forms of information processing for improved insight and decision making. Data are created from an extensive range of sources such as social media, the internet, databases, websites, sensors, and so on. But before these data are stored, processing and cleansing with the help of numerous analytical algorithms are performed on them [44], [45]. However, because of the nature of BD, oftentimes, organizations encounter issues and challenges. BD acquired are in large volume, of different varieties and with unprecedented velocity which makes it challenging to manage the data. These concerns and challenges need to be looked into for the stored data to be simply retrieved for making proper business decisions prospectively [46]. [47] identified the challenges and the riskiest of them is DQ. DQ as a concept is not easily defined. The studies related to DQ began as far back as in the 90s - the days of database management systems. Since then various researchers have proposed diverse definitions of DQ [48]. According to [35] the group of Total Data Quality Management led by Professor Richard Wang of the MIT University, with their in-depth research in the area of DQ defined it as fitness for use. And henceforth other researchers in the field came up with their own definition in the literature as meeting the users’ expectations [49] (Sebastian-Coleman, 2012) or data suitable for use by data users [50-53], [54-55, 40]. DQ is defined by the International Organization for Standardization/International Electrotechnical Commission 25012 standard (ISO/IEC 2008) [56] as the extent to which a set of features of data meets requirements. All the above definitions of DQ clearly indicates that DQ is highly reliant on the context of the use of data and interactions to the customers’ requirements, the ability to use and access data [18]. According to [57], it was pointed out, that to enhance DQ, two strategies are involved which are: data-driven and process-driven. The first strategy which is data-driven handles the data the way it is, making use of methods and actions like cleansing to enhance the quality of the data. And secondly, Process-driven strategy tries to detect originating poor DQ sources then redesigns the way the data is produced. DQ problems exist, right before the introduction of BD in the field. According to [13], the researchers categorized DQ issues and challenges according to (i) errors correction, (ii) unstructured data to structured conversion and (iii) integrating data from various sources of data. To add to the issues mentioned above, there exist quite a number of particular BD challenges, which include the large volume of data generated by web 2.0 moving at an unusual speed, contained within schema-less structures. Other BD quality issues are also identified related with BD features [35, 58-60]. Because of these joint issues, the processes of BD cleaning and sifting are phases to be implemented before the analyses of data with quality that is unknown. In [61] it is pointed out that DQ problems are more pronounced when dealing with data from multiple data sources. This problem obviously multiplies the data cleansing needs. Also, the huge amount of data sets that comes in at an unprecedented speed creates an overhead on the cleansing processes [13]. With the magnitude of data generated, the velocity at which the data arrives, and the huge variety of data, the quality of these data has left so much to be desired. There has been an estimation of inaccurate data costing US businesses 600 billion dollars yearly [62]. The error rate in data as recorded by enterprises is typically estimated to between 1% and 5%, while for some organizations; it is well above 30% [63-64]. In the majority of data warehouse projects, data cleaning amounts to 30% to 80% of the

84

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org developmental time plus the budget for enhancing the DQ against building the system. Regarding the web data, about 58% of the available files are XML, out of this volume, only 1/3 (one-third) of the XML documents with associated XSD/DTD are valid [65]. Also, about 14% of the documents are not well-formed, which is a simple mistake of tags that are mismatched and omitted tags that render the whole XML-technology unusable over these documents. All these pinpoint the pressing requirement for DQ management to make sure data in the databases exemplify the real world objects to that are refer in a reliable, consistent, precise, comprehensive, well-timed and exceptional way. There has been increase in demands by business organisations to develop DQ management systems, with the sole aim of detecting and efficiently correcting data errors. Thus, this move adds accurateness and value to the underlying business processes. Indeed, it is estimated that the rate of growth of DQ tools in the market is growing at 16% annually. This value is far above the average estimate of 7% for other IT sectors [66]. Data is exposed to auditing, profiling and the application of quality rules in a DQ system, with the aim of keeping and/or improving the quality. DQ concept has been known in the database community and it has not been a passive area of database management research for many years [67], [68-69]. Nevertheless, to apply directly these quality concepts to BD encounters serious problems as regards to the costing as well as the timing for data processing. This issue is made worst knowing the fact that these techniques were designed in the context of structured data [70]. Within the context of BD, any DQ application must be designated base on the origin, domain, format, and the data type it is being applied. It is essential to properly manage these DQ systems in solving the many problems rising in dealing with such vast data sets. In addition, for DQ to be managed, it must be measurable using the DQDs which is reviewed in the following section. 3. DATA QUALITY DIMENSIONS DQ can be analyzed from multiple dimensions. A Data Quality Dimension (DQD) is a feature or information part use data requirements. DQD provides the way to measure and manage DQ [27, 57, 71-72], [27], [73]. It is a quantifiable property of DQ which is a representative of some feature of the data such as accuracy, consistency and completeness used in the guidance of the process of giving quality understanding [74]. Consequently, the description of some specific data could be said to be high in quality, depending on one or multiple dimensions. It is a usual phenomenon to find different terms denoting the same dimensions in the literature. For example, currency is sometimes referred to as timeliness due to the fact that use of data is universal [75]. Also, DQDs are on many occasions denoted as characteristics, or attributes [76]. Usually, data is altered owing to some factors such as the reading of sensor’s devices, human data entry error, missing values in data, social media data and all sorts of unstructured data. These factors should be identified and categorised under the DQDs especially when the quality requires improvement and evaluation [13]. This is because DQ problems usually referred to as dirty or poor data are typically the particular problem existent and manifests within a DQD, for example, format glitches suffaces under the accuracy DQDs, and when data lacks in the appropriate format, it cannot be stared as quality data [77]. According to [78], various terms are used in the description of the data DQ related issues. as well as the mapping between the various problems to each of the relevant dimensions. The researchers in [56, 79] listed some details of dirty data affecting its quality component and the dimensions is associated with. Below is the table by Taleb, 2016 oa short list of the familiar DQ issues associted with DQDs. Table. 1 Data quality issues vs data quality dimensions Data Quality Dimensions Related Data Quality Issues Accuracy Completeness Consistency Missing data X X Incorrect data, Data entry errors X Instance Irrelevant data X Level Outdated data X Misfielded and Contradictory values X X X Uniqueness constrains, Functional dependency violation X Schema Wrong data type, poor schema design X Level Lack of integrity constraints X X X

85

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

3.1 Types of Data Quality Dimensions There are several types of DQDs available in the literature and each of them is linked to specific metric [80-85]. The researchers in [86] identified forty DQDS that existed from 1985 to 2009. In addition, [73] revealed one hundred and twenty-seven DQDs from the analysis of sixteen sources selected for the study. Although, the DQDs that are commonly seen in the literature are categorized into intrinsic and contextual according to [32, 31, 80-81], both [35, 86] initially grouped DQs into four categories that are Intrinsic, Accessibility, Contextual, Representational in DQ field which are based on their dimensions. • Intrinsic DQDs refers to data features that are native to the data and objective • Accessibility DQDs are categorized by fundamental issues relating to technical data access • Contextual DQDs refers to data features that are reliant on the context in which the data are perceived or used • Representational DQDs refers to how data is presented The table below shows the categorization of DQDs: Table. 2 Data quality categories and dimensions DQ Category DQ Dimensions Intrinsic DQ Accuracy, Objectivity, Believability, Reputation Accessibility DQ Accessibility, Access security Contextual DQ Relevancy, Value-Added, Timeliness, Completeness, Amount of data Representational DQ Interpretability, Ease of understanding, Concise representation, Consistent representation Furthermore, other researchers have recognised different framework and methodology for the assessment and improvement of DQ using various approaches and methods on DQDs [27]. These scholars demonstrated descriptions for DQDs and brought to recognition more significant DQDs [27, 82 84, 87], [2, 11, 12, 22]. 4. DATA QUALITY DIMENSIONS FOR BIG DATA Some studies have been conducted in organizations and in the academics regarding DQDs in BD. It has been observed from the various research works that in most cases the DQDs used in measuring that are used on the traditional data management systems are applicable in the measurement of DQ in BD. However, the DQDs for BD are categorized into intrinsic and contextual and intrinsic [32]. Contextual DQDs are connected with the values of data and intrinsic DQDs are related to the data intention, that is, the schema of the data [31, 13]. The intrinsic is commonly used and frequently found in the literature [18, 13]. The intrinsic DQ consist of the following dimensions: Accuracy, Completeness, Consistency and Timeliness. The above DQDs are associated with the ability of data to map the interest of the data user [88]. Intrinsic DQ dimensions comprises of i. Accuracy: which measures whether logging of data was done correctly and shows precise values. ii. Timeliness: This measures that if data is up to date or not which is occasionally signified as data volatility and currency [89]. iii. Consistency: This measures agreement of data with its format and structure. Studies on BDQ refers to conditional functional dependencies as DQ rules to identify semantic faults [90-91]. iv. Completeness: This measures that if all data that are relevant are correctly recorded without missing values or entries [13]. The features of BD, that is, volume, velocity, variety and veracity have a more or less result on the area of DQ. A concern is that the DQ cannot just be described by the traditional DQDs, but also requires care to be taken considering BD characteristics. This is called BD quality dimensions. The authors in [19] even merge BD characteristics – 3 Vs with DQDs based on International Standard Organization/ International Electrotechnical Commission ISO/IEC. Thus, for example, additional dimensions such as performance, relevancy, popularity and credibility are measured for quality of social media data [92-93]. Accuracy, Completeness, consistency and timeliness were also used to evaluate BD in health sector [8]. 5. CONCLUSION DQ issue is a serious issue for organizational operation processes to be able to identify associated performance issues since data is a vital resource in organizations, businesses and governmental agencies [28, 31, 62]. Organizational data are no longer limited to just databases as new technologies emerge. BD sources have turned out to be significant in organizations. This paper reviewed the literature on BD and DQDs from both the traditional data and BD. From the

86

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org viewpoint BD quality research as compared with traditional data still has much to cover in using DQDs for the assessment and evaluation of BD. The literatures right from the inception of DQ have defined DQ in different ways and have identified various vital DQDs. Reaching up to one hundred and twenty-seven DQDs. It is also reviewed that the most common DQDs used for BD are Accuracy, Consistency, Completeness and Timeliness. Therefore, there’s still much to be covered in BD for DQDs for effective and efficient measurement of BD quality. REFERENCES 1. Chen, M., S. Mao, and Y. Liu, Big data: A survey. Mob. Netw. Appl., 2014. 19(2): p. 171–209. 2. Philip Chen C. L. and. C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf. Sci., 2014. 275: p. 314–347. 3. Wielki, J., The opportunities and challenges connected with implementation of the big data concept, in Advances in ICT for Business, Industry and Public Sector, 4. C. M. Olszak, and T. Pe_ech-Pilichowski, Editors. 2015, Springer. p. 171–189. 5. Hashem, I. A. T., I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, and S. Ullah Khan, The rise of ‘big data’ on cloud computing: Review and open research issues. Inf. Syst., 2015. 47: p. 98–115, 2015. 6. Hu, H.,Y. Wen, T.-S. Chua, and X. Li, Toward Scalable Systems for Big Data Analytics: A Technology Tutorial, IEEE Access 2014. 2, p. 652–687. 7. N. I. of Standards, Draft NIST big data interoperability framework: security and privacy, Report, 2015.US. Department of Commerce. 8. Serhani, M. A., H. T. El Kassabi, I. Taleb &A. Nujum, An hybrid approach to quality evaluation across big data value chain. In Big Data (BigData Congress. 2016. IEEE International Congress. p. 418-425. 9. N. I. of Standards, Draft NIST big data interoperability framework: reference architecture, Report, 6. 2015. US. Department of Commerce. 10. Gubbi, J. R. Buyya, S. Marusic, M. Palaniswami, (IoT): A vision, architectural elements, and future directions, Future Generation. Computer. Syst. 2013. 29(7), p. 1645–1660. Including Special sections: Cyber-enabled Distributed Computing for Ubiquitous Cloud and Network Services and Cloud Computing and Scientific Applications Big Data, Scalable Analytics, and Beyond. http://dx.doi.org/10.1016/j.future.2013.01.010.URL: http://www.sciencedirect.com/science/article/pii/S0167739X13000241 11. Pääkkönen, P. and D. Pakkala, Reference architecture and classification of technologies, products and services for big data systems, Big Data Res., 2015. 10.1016/j.bdr.2015.01.001. 12. Madnick, S. E., R. Y. Wang, Y. W. Lee, and H. Zhu, Overview and framework for data and information quality research, Journal of Data Inf. Quality, 2009. 2. 13. Taleb, I., H. T. El Kassabi, M. A. Serhani, Dssouli, R., & Bouhaddioui, C. Big data quality: A quality dimensions evaluation in Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), 2016. Intl IEEE Conferences, p.759-765). 14. Bhatia, S., J. Li,W. Peng, and T. Sun, Monitoring and analyzing customer feedback through social media platforms for identifying and remedying customer problems, in Proc. IEEE/ACM Int. Conf. Adv. Soc. Netw. Anal. Mining (ASONAM). 2013, p. 1147-1154. 15. Antunes, F. and J. P. Costa, Integrating decision support and social networks, Adv. Human-Comput. Interact., 2012, 9. 16. Fabijan, A., H. H. Olsson, and J. Bosch, Customer feedback and data collection techniques in software R&D: A literature review in Software Business (Lecture Notes in Business Information Processing), 2015. 210. Springer. p. 139-153. 17. Ferrando-Llopis, R., D. Lopez-Berzosa, and C. Mulligan, Advancing value creation and value capture in data-intensive contexts in Proc. IEEE Int. Conf. Big Data. 2013, p. 5-9. 18. Izham Jaya, M., F. Sidi, I. Ishak, L.I. L. L. Y. Suriani Affendey, & M. A. Jabar, A Review Of Data Quality Research In Achieving High Data Quality Within Organization. Journal of Theoretical & Applied Information Technology, 2017.95(12). 19. Merino, J., I. Caballero, B. Rivas, M. Serrano, & M. Piattini, A data quality in use model for big data. Future Generation Computer Systems,2016. 63, p.123-130.

87

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

20. Gandomi, A., and M. Haider, Beyond the hype: Big data concepts, methods, and analytics, Int. J. Inf. Manage., 2015. 35, p. 137–144. 21. Lesser, E., and R. Shockley, Analytics: The new path to value. 2014. URL: http://www- 935.ibm.com/services/us/gbs/thoughtleadership/ ibv-embedding-analytics.html. 22. Finch, G., S. Davidson, C. Kirschniak, M. Weikersheimer, C. Rodenbeck Reese, and R. Shockley, Analytics: The speed advantage. why data-driven organizations are winning the race in today’s marketplace. 2014. URL:http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/?cm_ mc_uid=48208801620614296137181&cm_mc_sid_50200000=1429613718. 23. Lundquist, E., Data quality is first step toward reliable data analysis. URL: http://search.ebscohost.com/login.aspx?direct=true& b=bth&AN=89867448&lang=es&site=ehost-live. 24. Kwon, O., N. Lee, B. Shin, Data quality management, data usage experience and acquisition intention of big data analytics, Int. J. Inf. Manage. 2014. 34, p.387–394. URL: http://www.sciencedirect.com/science/article/pii/ S0268401214000127 25. Strong, D. M., Y. W. Lee, and R. Y. Wang, Data Quality in Context , Communications of the ACM, 1997. 40(5), p. 103–110. 26. Eckerson W., Data Warehousing Special Report: Data quality and the bottom line, Applications Development Trends, 2002. 27. Batini, C., C. Cappiello, C. Francalanci, and A. Maurino, Methodologies for data quality assessment and improvement. ACM Computing Surveys, 2009. 41(3), p. 1–52. 28. Tee, S. W., P. L. Bowen, P. Doyle and F. H. Rohde, Factors influencing organizations to improve data quality in their information systems, Accounting and Finance, 2007. 47(2), p. 335–355. 29. Sidi, F., P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha, Data quality: A survey of data quality dimensions in International Conference on Information Retrieval Knowledge Management (CAMP), 2012. p. 300 –304. 30. Glowalla, P., P. Balazy, D. Basten, and A. Sunyaev,Process-driven data quality management – An application of the combined conceptual life cycle model in 47th Hawaii International Conference on System Sciences (HICSS), 2014. p. 4700–4709. 31. Batini C, Cappiello C et al. (2009). Methodologies for data quality assessment and improvement, ACM Computing Surveys, vol 41(3), 1–52, doi:[10.1145/1541880.1541883]. 32. Hazen, B. T., C. A. Boone, J. D. Ezell, and L. A. Jones-Farmer, Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics, 2014. 154, p. 72- 80. 33. Garvin, D.A., What does product quality really mean?. Sloan Manage. Rev., 1984. 26 (1),p.25–43. 34. Garvin, D.A., Competing on the eight dimensions of quality. Harvard Bus. Rev., 1987. 65 (6), p.101–109. 35. Wang, R. Y, and D. M. Strong, (1996). Beyond accuracy: What data quality means to data consumers, Journal of Management Information Systems, 1996. 12(4), p. 5–33. 36. Heinrich, B., M. Kaiser, and M. Klier, How to measure data quality? A metric-based approach, Twenty Eighth International Conference on Information Systems, Montreal, 2007, p. 101–122. 37. Huang H, B. Stvilia, C. Jorgensen and H. W. Bass, (2012). Prioritization of data quality dimensions and skills requirements in Genome annotation work, Journal of the American Society for Information Science and Technology, 2012. 63(1), p. 195–207. 38. Maier, M., A. Serebrenik, and I. T. P. Vanderfeesten, 2013). Towards a big data reference architecture, 2013. University of Eindhoven 39. Immonen, A., P. Pääkkönen, & E. Ovaska, Evaluating the quality of social media data in big data architecture. IEEE Access, 2015. 3, p. 2028-2043. 40. Loshin, D., Big Data Analytics: From strategic planning to enterprise integration with tools, techniques. No Sql, and Graph, 2013. Elsevier. 41. Malik, P., Governing Big Data: Principles and Practices. IBM Journal of Research and Development, 2013. 42. Mahanti, R. Critical success factor for implementing data profiling: The first step toward data quality. Softw. Qual. Prof., 2014. 16 (2), p. 13–26. 43. G. Inc., Gartner’s IT glossary. URL: http://www.gartner.com/it-glossary/bigdata 2015

88

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

44. Lock, M. (2012). Data management for BI big data. Aberdeen Group, 2012. p. 4-14. http://www.facebook.com/l.php?u=http%3A%2F%2Fvertica.com%2Fwpcontent% 2Fuploads%2F2012%2F03%2FDataManagementforBI_Aberdeen.pdf&h=oAQE7lPo5 45. Soares, S., Big Data Governance: An Emerging Imperative, MC Press, 2012. 46. Feldman M., The big data challenge: Intelligent tiered storage at scale. White Paper, 2013, p. 7-8. 47. Parkinson, J., Six big data challenges. CIO Insight. 2012. URL: http://www.cioinsight.com/c/a/Expert- Voices/Managing-Big-Data-Six- Operational-Challenges- 484979. 48. Cai, L., and Zhu, Y., The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 2015. 14. 49. Sebastian-Coleman, L., Measuring data quality for ongoing improvement: a data quality assessment framework, 2012. Newnes. 50. Strong, D. M., Y. W. Lee, and R. Y. Wang, Data Quality in Context. Communications of the ACM, 1997, 40(5), p. 103–110. 51. Lee Y. W. and D. M. Strong, Knowing- Why About Data Processes and Data Quality. Journal of Management Information Systems, 2003, 20(3) p. 13–39. 52. Levitin A. V. and T. C. Redman, Data as a resource: properties, implications, and prescriptions. Sloan Management Review, 1998, 40, p. 89–101. 53. Wang, R. Y., A product perspective on total data quality management. Communications of the ACM, 1998, 41(2), p. 58–65. 54. Sidi, F., P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim and A. Mustapha, Data quality: A survey of data quality dimensions in Information Retrieval & Knowledge Management (CAMP), International Conference, 2012 (pp. 300-304). IEEE. 55. Alizamini, F. G., M.M. Pedram, M. Alishahi, and K. Badie, Data quality improvement using fuzzy association rules, 2010, p.468-472. 56. Chen, M., M. Song, J. Han, and E. Haihong, Survey on data quality in World Congress on Information and Communication Technologies (WICT), 2012, p.1009–1013. 57. Glowalla, P., P. Balazy, D. Basten, and A. Sunyaev, Process-Driven Data Quality Management – An Application of the Combined Conceptual Life Cycle Model in 47th Hawaii International Conference on System Sciences (HICSS), 2014, p. 4700–4709. 58. Juddoo, S., Overview of data quality challenges in the context of Big Data in International Conference on Computing, Communication and Security (ICCCS), 2015, pp. 1–9. 59. L. Cai, L. and Y. Zhu, The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Sci. J., 2015. 14(0), p. 2. 60. Krogstie, J. and S. Gao, A semiotic approach to investigate quality issues of open big data ecosystems in Information and Knowledge Management in Complex Systems, K. Liu, K. Nakata, W. Li, and D. Galarreta, Editors. Springer. 2015, p. 41–50. 61. Rahm, E. and H. H. Do, Data Cleaning: Problems and current approaches, IEEE Data Eng Bull, 2000. 23(4) p. 3–13. 62. Eckerson W. W., Data quality and the bottom line: achieving business success through a commitment to high-quality data. Data Warehousing Institute, 2002. 63. Fan W. and F. Geerts, Foundations of data quality management. Morgan & Claypool, 2012. 64. Fan, W., F. Geerts, X. Jia, and A. Kementsietsidis, Conditional functional dependencies for capturing data inconsistencies. ACM TODS, 2008, 33(2). 65. Grijzenhout S., and M. Marx, The quality of the XML web. CIKM, 2011. p.1719-1724. 66. Gartner: Forecast 2007 Data quality tools worldwide. 2006-2011 Technical report, Gartner. 67. Yeh, P. Z. and C. A. Puri, An efficient and robust approach for discovering data quality rules in 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2010. 1, p. 248–255. 68. Floridi, L. Big data and information quality in The Philosophy of Information Quality, L. Floridi and P. Illari, Editors. Springer. 2014, p. 303–315. 69. Zhou, H., J. G. Lou, H. Zhang, H. Lin, H. Lin, and T. Qin, An Empirical Study on Quality Issues of Production Big Data Platform in IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), 2015, 2, p. 17–26.

89

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

70. Becla, J., D.L. Wang, and K.T. Lim, Report from the 5th workshop on extremely large databases. Data Sci. J. 11, 2012, p. 37–45. URL: http://www.scopus.com/inward/record.url?eid=2-s2.0- 84859721986&partnerID=40&md5=fc683361d4e5427bd6fe1780713b0c51 71. McGilvray, D., Executing data quality projects: Ten steps to quality data and trusted information: Morgan Kaufmann, 2008. 72. Sidi, F., P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha, Data quality: A survey of data quality dimensions,” in 2012 International Conference on Information Retrieval Knowledge Management (CAMP), 2012, pp. 300 –304. 73. Jayawardene, V., S., Sadiq, and M. Indulska, An analysis of data quality dimensions, 2015, (1-32). 74. Scannapieco M., P. Data quality at a glance, Datenbank-Spektrum, 2005, 14, p. 6–14. 75. Batini C., M. Palmonari, G. Viscusi, Opening the closed world: A survey of information quality research in the wild in The Philosophy of Information Quality. Springer. 2014, p. 43–73. 76. Loshin D., The practitioner’s guide to data quality improvement. Elsevier, 2011. 77. Rahm, E. and H. H. Do, Data Cleaning: Problems and current approaches, IEEE Data Eng Bull, 2000. 23(4), p. 3–13. 78. Oliveira P., A formal definition of data quality problems in IQ, 2005. 79. Laranjeiro, N., S. N. Soydemir, and J. Bernardino, A survey on data quality: Classifying poor data in IEEE21st Pacific Rim International Symposium on Dependable Computing (PRDC), 2015, p. 179– 188. 80. Ballou, D.P., H. L. Pazer, Modelingdataandprocessqualityinmulti-input, multi-output informationsystems. Manage.Sci., 1985, 31(2),150–162. 81. Wang, D. P., H. Pazer, G. K. Tayi, Modeling information manufacturing systems to determine information product quality. Manage.Sci., 1998, 44(4), 462–484. 82. Pipino, L.L., Y. W. Lee, and R. W.Wang, Data quality assessment. Commun.ACM, 2002, 45 (4), p. 211– 218. 83. Redman, T.C., Data Quality for the Information Age. Artech House Publishers, Norwood, MA, 1996. 84. Wand,Y., and R. Y. Wang, Anchoring data quality dimensions in ontological foundations. Commun. ACM, 1996, 39(11), p. 86–95. 85. Wang, R.Y., and D. M. Strong, Beyond accuracy: What data quality means to data consumers. Journal of Manage.Inf.Syst., 1996, 12(4), p. 5–33. 86. Lee, Y.W., D. M. Strong, B. K., Kahn, R. Y. Wang, AIMQ: A methodology for information quality assessment. Inf.Manage. 2002, 40 (2), p. 133–146. 87. Wang, K. Q., S. R. Tong, L. Roucoules, and B. Eynard, Analysis of data quality and information quality problems in digital manufacturing,2008, pp. 439-443. 88. Bovee, M., R. P. Srivastava, and B. Mak, A conceptual framework and belief function approach to assessing overall information quality, International Journal of Intelligent Systems, 2003, 18(1), p. 51–74. 89. Fan, W. F. Geerts, and J. Wijsen, Determining the currency of data,” ACM Trans. Database Syst. TODS, 2012, 37(4), p. 25. 90. Saha, B. and D. Srivastava, Data quality: The other face of Big Data in IEEE 30th International Conference on Data Engineering (ICDE), 2014, p. 1294–1297. 91. Tang, N. Big Data Cleaning in Web Technologies and Applications, L. Chen, Y. Jia, T. Sellis, and G. Liu, Editors. Springer. 2014, p. 13–24. 92. Bhatia, S., J. Li, W. Peng, and T. Sun, Monitoring and Analyzing Customer Feedback Through Social Media Platforms for Identifying and Remedying Customer Problems, 2013, p. 1147–1154. 93. Immonen, A., M. Palviainen, and E. Ovaska, Requirements of an Open Data Based Business Ecosystem, 2, 2014.

AUTHORS PROFILE

90

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

An evolutionary-based adaptive neuro-fuzzy expert system as a marriage counsellor using astrology science 1Seyed Muhammad Hossein Mousavi 1Independent Researcher, Tehran, Iran Email: [email protected]

ABSTRACT Divorce rate is increasing around the globe. The result of such a phenomenon could be a criminal one for coming generations. Too many ways, to decrease this phenomenon, are employed during times and some of them lead to success and some of them to failure. By using expert systems, human errors reduce. In this paper a new way to divorce rate reduction is proposed. Combining astrology science with artificial intelligence techniques could make a decent automatic expert system for this purpose. The main idea is just to order everything based on couples’ birthday chart but according to Vedic astrology. Using such an expert system as an assistant or even replacement for family counsellor could have good effects. In this approach wealth and money and other things doesn’t matter and the only thing which matters is the way of thinking that is achievable through astrology. This expert system proposed three classes of “do marry” in the case of (+60 %) compatibility, “do not marry” in the case of (-40 %) compatibility and “marriage with caution” class in the case of (50 ± 10 %) compatibility. The data has been trained and tested with different Meta-heuristic optimization algorithms like (ACO, DE, PSO and GA) and neural network training methods like (Hybrid and back-propagation). Error factors like (MSE, RMSE, Error Mean and Error STD) are calculated for each one of these approaches as validation results. Very satisfactory results were achieved. Dream to a perfect life for all couples in the globe and decreasing divorce rate into zero. Keywords: divorce; expert system; astrology; optimization algorithm; neural network; 1. INTRODUCTION Automatic expert systems could lift heavy burden from our shoulders. But what kind of automatic expert systems? In this case, we need to reduce divorce rate through couple’s thoughts and behaviours. Thus, is it a Psychologist expert system or a doctor or even family counsellor? Well each one of them has their pros and cons. For example, Psychologist and family counsellor works well in this case, especially before marriage. But is it not better to set everything properly from the first? Psychologists and doctors act based on what the patient says, but what if patient not to say everything or say something wrong or not precise or even forgot something to say. But using an automatic astrologist expert system, there is exist a general picture of the subject’s mind which is comparable with someone else to check the compatibility between couples. More details on this subject is in parts below but in a general perspective, there are exist four important structural elements for each one of us which each one of these elements are compatible with each other’s. Fire, Earth, Air and water are those four main elements. Each element is compatible with itself and some of the other elements, but not with its antithetic. Thus, in this method a water sign is not going to be compatible with a fire sign and result of such a combination, leads to divorce. Need to mention that it is not voodoo or black magic. Because in black magic you will contact with demons and speaking about future, but here you just speak about present time and someone’s attributes and also you just work on bunch of numbers. Even you can use the name of god at the start. In Introduction or section 1, all the necessary fundamentals will be defined. Section 2 includes some of the other researcher’s works on expert systems with the subjects of family counsellor, marriage counsellor and divorce rate reduction. As it is a novel work on this subject, we could not find much works. Section 3 and 4 includes our proposed evolutionary ANFIS expert system based on astrology science on acquired results from dataset using different evolutionary optimization algorithms as learning algorithms respectively. For validating results, a dataset from a human expert (astrology expert) has been received, which described thoroughly in section 4 in details. Finally, in section 5, pays to conclusion on the paper and some suggestions to making even a better system. According to a research in 2012 (United Nations World Demographic Report) [11], for 70 countries, reporting on divorce rate, in which first five were: Russia, Aruba, Belarus, Latvia and Lithuania, with respectively divorce rate of 4.5, 4.4, 4.1, 3.6 and 3.5, of which Iran was at 31th place with divorce rate of 2. Figure 1 shows statistical chart of divorce for (year vs

91

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org number of divorce) and the average age of marriage for (year vs age) in Iran, among 1975-2016. This plot shows that number of divorces is increasing through years.

Figure. 1 Statistical chart of divorce and the average age of marriage in Iran among 1975 – 2016 [12] 1.1 Evolutionary computation The fundamental metaphor of evolutionary computing relates this powerful natural evolution to a particular style of problem solving – that of trial-and-error [1]. In artificial intelligence, an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based meta-heuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions. Differential evolution algorithm is one of the popular evolutionary algorithms, which is used here to find best initial cluster centers. Differential Evolution (DE) is a relatively recent heuristic (it was created in the mid- 1990s) proposed by Kenneth Price and Rainer Storn [2-4], which was designed to optimize problems over continuous domains. This approach originated from Kenneth’s Price attempts to solve the Tchebycheff Polynomial fitting Problem that had been posed to him by Rainer Storn [5]. Another example is, genetic algorithm (j.holland, k. dejong, 1960) [6], which is a perfect example of them leads initial population to evolution for optimization problems. The operators of this algorithm are inspired of natural genetic changes and natural generation selection. Also, PSO [7] which became formulized by Edward and Kennedy in 1955, based on social behavior of animals such fish and birds [8]. In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. This algorithm is a member of the ant colony algorithms family, in swarm intelligence methods, and it constitutes some meta-heuristic optimizations. Initially proposed by Marco Dorigo in 1992 in his PhD thesis, [9, 10] the first algorithm was aiming to search for an optimal path in a graph, based on the behavior of ants seeking a path between their colony and a source of food. The original idea has since diversified to solve a wider class of numerical problems, and as a result, several problems have emerged, drawing on various aspects of the behavior of ants. 1.2 Expert system An adaptive neuro-fuzzy inference system or adaptive network-based fuzzy inference system (ANFIS) is a kind of artificial neural network that is based on Takagi–Sugeno fuzzy inference system. The technique was developed in the early 1990s [13]. Since it integrates both neural networks and fuzzy logic principles, it has potential to capture the benefits of both in a single framework. Its inference system corresponds to a set of fuzzy IF–THEN rules that have learning capability to approximate nonlinear functions [14]. Hence, ANFIS is considered to be a universal estimator [15]. 1.3 Astrology Astrology is the study of the movements and relative positions of celestial objects as a means for divining information about human affairs and terrestrial events [18-20]. Astrology has been dated to at least the 2nd millennium BCE and has its roots in calendrical systems used to predict seasonal shifts and to interpret celestial cycles as signs of divine communications [21]. Many cultures have attached importance to astronomical events, and some – such as the

92

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Indians, Chinese, and Maya – developed elaborate systems for predicting terrestrial events from celestial observations. Western astrology, one of the oldest astrological systems still in use, can trace its roots to 19th–17th century BCE Mesopotamia, from which it spread to Ancient Greece, Rome, the Arab world and eventually Central and Western Europe. Throughout most of its history astrology was considered a scholarly tradition and was common in academic circles, often in close relation with astronomy, alchemy, meteorology, and medicine [22]. 1.4 Vedic astrology Jyotisha (or Jyotishyam from Sanskrit jyotiṣa, from jyótis- "light, heavenly body") is the traditional Hindu system of astrology, also known as Hindu astrology, Nepalese Shastra, Indian astrology, and more recently Vedic astrology. The term Hindu astrology has been in use as the English equivalent of Jyotiṣa since the early 19th century, whereas Vedic astrology is a relatively recent term, entering common usage in the 1980s with self-help publications on Āyurveda or Yoga. Vedanga Jyotisha is one of the earliest texts about astronomy within the Vedas [23-25]. However, some authors have claimed that the horoscopic astrology in the Indian subcontinent came from Hellenistic influences, post-dating the Vedic period [26]. In epics Ramayana and Mahabharata, only electional astrology, omens, dreams and physiognomy are used. 1.5 Natal astrology Natal astrology, also known as genethliacal astrology, is the system of astrology based on the concept that each individual's personality or path in life can be determined by constructing a natal chart for the exact date, time, and locations of that individual's birth. Natal astrology can be found in the Indian or Jyotish, Chinese and Western astrological traditions. In horoscopic astrology an individual's personality is determined by the construction of the horoscope or birth chart for the particular individual involved (known as the native), showing the positions of the sun, moon, planets, ascendant, midheaven, and the angles or aspects among them. 1.6 Astrological signs In Western astrology, astrological signs are the twelve 30° sectors of the ecliptic, starting at the vernal equinox (one of the intersections of the ecliptic with the celestial equator), also known as the First Point of Aries. The order of the astrological signs is Aries, Taurus, Gemini, Cancer, Leo, Virgo, Libra, Scorpio, Sagittarius, Capricorn, Aquarius and Pisces. The concept of the zodiac originated in Babylonian astrology and was later influenced by Hellenistic culture. According to astrology, celestial phenomena relate to human activity on the principle of "as above, so below", so that the signs are held to represent characteristic modes of expression [27]. When we born, there are coordinates on the earth for us. These coordinates are normalized on other planets, sun, moon, and other celestial bodies in the solar system. Each one of this celestial bodies creates a part of our personality. Having all of these part, it is possible to know someone completely which help a lot in marriage issue. Figure 2 represents 12 sigs of Zodiac in details. Based on Vedic astrology which is the most accurate one, there are 12 signs (lion, scorpion, and goat …), 12 date partitions, 12 Lords (Aries, Scorpio, and Libra …), 12 Planets for each Lord, 4 structural elements of (Fire, Earth, Air and Water), 12 Houses, 12 Constellations and each sign has its own traits (good and bad). This science has been discovered 2000 years ago and in that time planet’s rotation was faster. But when mankind discovered space and starts to search it, discovered that planet’s speed is slowed down. Thus, using tropical astrology which is based on 1 to 30 days of months in each season is wrong. But Vedic astrology fixed this problem and it is based on planets position and rotation speed around the sun. Moreover, in astrology two satellites (Moon and Chiron) are so important. In the other hand, Pluto from dwarfs and Sun from stars family and Rahu, Ketu from celestial bodies are so important to discover someone’s personality in astrology. As it is clear each planets and other mentioned space materials are related to something in a persons’ characteristics. As it is mentioned before peoples all around the world from ancient time till now are using this science for better relationships. This science is spread in USA, China and Europe more. There are clinics and doctors, counsellors and even a university branch for this science in a lot of countries. Note that in a lot of countries and for fun peoples use this science but they just use Sun sign, which is just 25% of person characteristics and even they use tropical astrology which make it so wrong. They do not calculate moon, mercury, mars and ascending signs. That is normal because they do not know anything about it and it is just for fun. Thus, if this matter is so important to you, just use the right, precise knowledge, which is astrology and it is harder than you think.

93

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

For eliminating confusion, the word of “planet” is used for celestial bodies in the sky, even the sun. Figure 3 represents someone’s chart based on Vedic astrology. At the left-hand side of the figure, planets are clear with their respective degree. Each planet could be in 1-30 degree, which shows how this planet is affected from that lord. Also, each planet has its own lord, which if the lord be in its respective planet, it is a good sign of power in that specific part of personality. Signs are in the third column. But in these planets, there are 8 important and four most important planet which could presents a good percentage of characteristics from the subject. Those 8 are (Ascending, Sun, Moon, Mercury, Mars, Venus, Rahu and Ketu) and 4 most important are (Moon, Sun, Ascending and Mercury). In fact, Moon sign is the most important sign of all zodiac sign (and its degree). In that previews paragraph it is mentioned those people using this science for fun and wrong. It should be mentioned that even if they calculate it science according to Vedic astrology and even for all planets, still a problem exists and that is the degree. Imagine moon sign which is most important sign of all zodiac. Because moon control the mind. Moon’s orbit is almost in three days on around the earth. If moon be in Libra but in 26 degree, then this person mind is not working based on Libra and that is because Libra’s constellation is almost done in 28 degree and this person’s mind is based on 26 degree of Scorpio and just 4 degrees in Libra. Thus, most of the actions of this person will be based on Scorpio personality in the life. In the other hand, some of the planets like Uranus and Neptune are not so important because their orbit length is so big which covers a whole generation. Figure 3 presents a subject’s personality and for other 146 combinations it can be referred to [28- 32]. This person is born in 30 of July, 1990. Ascending or rising sign shows a general view of whole chart and some of the body shape. He is Pieces ascendant which means, he is intelligent and so quiet. He will think about something so much without speaking. Also, because Pieces is a water sign, he is so calm and relax. This personality likes other dimensions facts and could be a good employee. Sun sign shows the inner personality which is not shown so much to other people. In this case sun sign is in 13 degree of cancer which is a water sign just like pieces. Thus, relaxation is vivid in this personality but cancer is most well-rounded sing in zodiac, which means they can learn everything but with having enough time. In the other hand cancer is the fourth house’s lord which is called mother of zodiac. Thus, these people are good parents too. They are so emotional and they just care about their loved ones and they are so loyal. Also, like other water signs they are tend to know other world or dimensions sciences. As it is clear sun is in 13 degree, which means cancer is 13 and Leo in 17 degrees. Thus, these people may show some show off and aggressiveness like a lion from their selves but mostly cancer because crab constellation is still vivid in the sky. And of curse moon sign, which as it mentioned before is most important sign of the zodiac. Moon represents emotions, habits, behaviour with other and way of thinking. Moon is third house and is occupied with Libra lord, but because it is in 26 degree, it is not Libra anymore. It is Scorpio which is next lord. This problem is a common mistake in astrology, so degree is so important. Again, like other water signs Scorpio is so calm but most dangerous, especially in moon. They are secretive, loyal, powerful, dominant, hard workers and wealthy. Scorpio is controlled by eight house which is the house of underworld. They like to know about the other dimensions so much. It is better to not joke with these guys. They could be very harmful and dangerous. Mars is in Aries which is its perfect planet. Mars shows our aggressiveness, actions and desires. Sadly arises are the most violet sign in the zodiac. But in this position, they could be for example a perfect commander in army or police officer. They are dominant like Scorpio but with more aggressiveness. Mercury represents communication, logic and speaking. We have mercury in Leo which means, this person has good ability in speaking as a leader and strong absorption power and this will happen if they be in center of crowed as a leader. Because Mercury is related to busyness, Leo in this place could be a creative and powerful leader. Also, they are so faithful and the world will provide for them, because they are so generous. By the way Leo is controlled by the only star in the planet which is sun and sun is center of everything and other planets are depending to it. Venus represents passion, art, beauty and love. Best place for it, is Gemini, which in this example Gemini is in this planet with 19 degree, which will be affected by cancer too. After Gemini we have Scorpio, Aries and Leo in passion power. Rahu and Ketu are in their perfect place in this example. Rahu in the tenth house or Capricorn and Ketu in the fourth house or cancer. Also, Capricorn and cancer are father and mother of the zodiac. When Rahu and Ketu will be placed in this house, then fame and wealth comes toward that person. It should be mentioned that combination of this planets, houses and lord with their degrees represents the whole personality but may some of them change the others with a low effect. For example, if you have Leo in moon and Pisces in mars or ascending, Pisces power will reduce aggressiveness of Leo which is a good balance. But sometimes it is bad like Hitler’s birth chart which is based on Aries completely. He has too much planet in Aries and Jupiter expanded the power of Aries too. Aries is leader like Leo and Scorpio but and it is fire sign. They will fire up so fast and vice versa. But if there be no main planet to fire down and worse fired up more, then that is a disaster, which could be a good example of Hitler. Other planets are not so important and for more information about them can be referred to [28-32]. According to my

94

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org experience combinations of Aries, Leo, Scorpio, Cancer, Capricorn and Gemini and in proper houses could make a perfect person. These characteristics sums with year sign, family and friends’ environment and beliefs and makes persons personality, but birth chart calculated with Vedic astrology consists 70 % of it and could influence the other parts too. Table 1 shows the 17 important factors in consultation before marriage which will be extracted from couples’ birth chart. The values will be used in our proposed system. First 10 factors are so important. If they be compatible with each other, it is a good sign for marriage. System works with two approaches (with planet-based and birth date- based). Data for 98 couples is received from astrology human expert and based on Table 1, which will be used in section 4. Table. 1 Astrology factors in consultation before marriage First Approach (Planet-Based) Variable Name Value Range Degree Range 1 Ascending [1 12] [1 30] 2 Sun [1 12] [1 30] 3 Moon [1 12] [1 30] 4 Mars [1 12] [1 30] 5 Mercury [1 12] [1 30] 6 Jupiter [1 12] [1 30] 7 Venus [1 12] [1 30] 8 Saturn [1 12] [1 30] 9 Rahu [1 12] [1 30] 10 Ketu [1 12] [1 30] 11 Uranus [1 12] [1 30] 12 Neptune [1 12] [1 30] 13 Pluto [1 12] [1 30] 14 Chiron [1 12] [1 30] 15 X [1 12] [1 30] 16 Y [1 12] [1 30] 17 Z [1 12] [1 30]

Second Approach (Birth Date-Based) Day Month Year 1 [1 31] [1 12] [1900 2018] 2 [1 31] [1 12] [1900 2018] 3 [1 31] [1 12] [1900 2018]

2. PRIOR RELATED WORKS Unfortunately, there is not much works on marriage counselor or family counselor expert system. Just two cases is founded, but some of the other evolutionary based ANFIS expert systems in other areas will be discussed. In 2017 Mousavi, S. M. H., MiriNezhad, S. Y., & Lyashenko made an evolutionary-based adaptive neuro-fuzzy expert system as a family counselor before marriage with the aim of divorce rate reduction. They used evolutionary algorithms and neural networks for learning and classification process [33]. In 2017 Hamidreza Saghafi and Milad Arabloo succeeded to make an expert system for estimation of carbon dioxide equilibrium adsorption isotherms using adaptive neuro-fuzzy inference systems (ANFIS) and regression models [34]. Also, Kemal Polat and Salih Güneş made an expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease [35].

95

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Servet Soyguder and Hasan Alli made an expert system for the humidity and temperature control in HVAC systems using ANFIS and optimization with Fuzzy Modeling Approach [36]. In 2009, Melek Acar Boyacioglu and Derya Avci made An Adaptive Network-Based Fuzzy Inference System (ANFIS) for the prediction of stock market return: The case of the Istanbul Stock Exchange [37].

96

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 2 12 signs of Zodiac in details and their corresponding traits, figures, date range, constellation and symbols

Figure. 3 Someone’s chart based on Vedic astrology 3. PROPOSED METHOD In this paper, an evolutionary-based Adaptive Neuro-Fuzzy Expert System as an astrology marriage counselor before marriage with the aim of reducing divorce rate is proposed. The main goal is to combine evolutionary algorithms with fuzzy logic and inferring nature inspired results for this kind of natural event (divorce). In fact, learning process Takes place for (Hybrid, back-propagation, GA, PSO, DE and ACO) learning algorithms, and the results will be compare with each other. Figure 4 represents the flowchart of proposed method. For generating FIS, FCM [38] [39] clustering method is used to decreasing fuzzy rules and output membership functions. As it is clear, there are 17 inputs or neurons or variables and input membership functions. Also, there are 5 rules and output

97

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org membership functions and just one output variable. Training FIS using hybrid learning and testing train and test data for respective dataset is shown in Figures 5, 6, 7 respectively. Figure 8 represents train and test errors for our dataset (DE learning). These plots are just for DE learning algorithm and merely for representing proposed method procedure. As Figure 8 indicates, train and test errors are close to zero.

Figure. 4 Proposed Method’s procedure

Figure. 5 Training FIS using hybrid learning in 20 epoch

98

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Figure. 6 and 7 Testing train data-Testing test data

Figure. 8 Train and test errors for our dataset (DE learning) 4. EVALUATIONS AND RESULTS 4.1 Data Results are acquired from a dataset, collected during 2 years from 98 subjects. Next, 17 of most important consultation factors selected as features or variables. There are three classes in the dataset (do marry, do not marry, marriage with caution).69 subjects as do marry class, 9 subjects as do not marry class and 19 subjects as marriage with caution class are collected. This dataset is achieved from human expert (Astrology expert). 4.2 MSE, RMSE, Mean error and STD error Table 2 and Table 3 indicate calculated errors such (MSE [40], RMSE [41], Mean Error [42] and STD Error [42]) for different learning algorithms which has been used (Train and Test). Dataset splitted into 62% train data and 38%

99

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org test data; number of hidden neurons are 10. Table 4 represents evolutionary learning algorithms parameters. The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations; that is, the difference between the estimator and what is estimated. If (푌^) is a vector of n predictions, and Y is the vector of observed values corresponding to the inputs to the function which generated the predictions, then the MSE of the predictor can be estimated by: n 1 MSE = ∑(Y^ − Y )2 n i i (1) i=1

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values ^ and observed values. The RMSD of predicted values yt for times t of a regression's dependent variable yt is computed for n different predictions as the square root of the deviation’s squares mean: (2) ∑푛 (푦^ − 푦 )2 푅푀푆퐸 = √ 푡=1 푡 푡 푛

Or simply RMSE is sqrt of (MSE). Mean Error and STD Error are mean and standard deviation of train and test error. Train and test errors are subtraction of targets from output in dataset. Table. 2 (MSE, RMSE, Mean Error and STD Error) for different learning algorithms (train) TRAIN MSE RMSE Mean Error STD Error Hybrid 0.03 0.01 -0.03 0.02 Back Propagation 0.08 0.31 -0.27 0.31 GA 0.003 0.37 -0.004 0.23 PSO 0.007 0.06 -0.0007 0.18 DE 0.08 0.37 -1.80 0.28 ACO 0.05 0.34 -0.0027 0.20 Table. 3 (MSE, RMSE, Mean Error and STD Error) for different learning algorithms (test) TEST MSE RMSE Mean Error STD Error Hybrid 0.38 0.37 0.02 0.63 Back Propagation 0.22 0.48 -0.008 0.55 GA 0.41 0.58 0.44 0.51 PSO 0.31 0.90 0.30 0.61 DE 0.57 0.62 0.20 0.57 ACO 0.29 0.24 0.31 0.46 Table. 4 Evolutionary learning algorithms parameters PARAMETERS GA PSO DE ACO Number of Decision Variables 800 800 800 800 Size of Decision Variables Matrix [1,800] [1,800] [1,800] [1,800] Lower Bound of Variables -5 -5 -5 -5 Upper Bound of Variables 10 10 10 10 Maximum Number of Iterations 700 1000 400 350 Population Size 100 40 20 20 Crossover Percentage 0.6 - 0.2 - Number of Offspring’s (Parents) 75 - - -

100

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

Mutation Percentage 0.4 - - - Number of Mutants 45 - - - Mutation Rate 0.2 0.3 0.1 0.3 Selection Pressure 7 - - - Inertia Weight - 1 - - Inertia Weight Damping Ratio - 0.98 - - Personal Learning Coefficient - 1 - - Global Learning Coefficient - 2 - - Lower Bound of Scaling Factor - - 0.3 - Upper Bound of Scaling Factor - - 0.8 - Sample Size - - - 55 Intensification Factor (Selection - - - 0.3 Pressure) Deviation-Distance Ratio - - - 1 5. CONCLUSION AND DISCUSSION Using evolutionary algorithms as learning algorithm, and combining them with fuzzy logic produces perfect results, which resulting in not only being able to compete with neural network learning algorithms, but also sometimes being better. These kind of evolutionary learning algorithms worked well on our data and returned satisfactory results. The main goal was combining A.I techniques with astrology science to decrease divorce rate and making an expert system as an astrological marriage counsellor. As it is clear in Table 3, back propagation and ACO gained good results on this data and PSO gained weaker results, but all of them worked well. It is suggested to combine other evolutionary algorithms like bat [43], ICA [44] and GGO [45] algorithms, as learning algorithm for achieving different or maybe better results.

ACKNOWLEDGEMENT Special thanks to my teachers for teaching me what I didn’t know and encouraging me to continue, my family for supporting me entire time and god for helping me when I was so exhausted to continue. REFERENCES

1. Eiben, Agoston E., and James E. Smith. “Introduction to evolutionary computing”. Vol. 53. Heidelberg: springer, 2003. 2. R. Storn and K. Price. Differential Evolution: “A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces”. Technical Report TR-95-012, International Computer Science Institute, Berkeley, California, March 1995. 3. R. Storn and K. Price. “Differential Evolution - A Fast and Efficient Heuristic for Global Optimization over Continuous Spaces”. Journal of Global Optimization, 11:341–359, 1997. 4. K. V. Price, R. M. Storn, and J. A. Lampinen. Differential Evolution. “A Practical Approach to Global Optimization”. Springer, Berlin, 2005. ISBN 3-540-20950-6. 5. Coello, Carlos A. Coello, Gary B. Lamont, and David A. Van Veldhuizen. “Evolutionary algorithms for solving multi-objective problems”. Vol. 5. New York: Springer, 2007. 6. Mitchell, Melanie (1996). “An Introduction to Genetic Algorithms”. Cambridge, MA: MIT Press. ISBN 9780585030944. 7. J. Kennedy and R. Eberhart, “Particle swarm optimization”. «Proc.» IEEE International Conf. on Neural Networks (Perth, Australia), IEEE Service Center, Piscataway, NJ, 1995 (in press). 8. Eiben, Agoston E., and James E. Smith. “Introduction to evolutionary computing”. Vol. 53. Heidelberg: springer, 2003. 9. A. Colorni, M. Dorigo et V. Maniezzo, Distributed Optimization by Ant Colonies, actes de la première conférence européenne sur la vie artificielle, Paris, France, Elsevier Publishing, 134-142, 1991. 10. M. Dorigo, “Optimization, Learning and Natural Algorithms”, PhD thesis, Politecnico di Milano, Italy, 1992.

101

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

11. https://divorcescience.org/for-students/world-divorce-statistics-comparisons-among-countries/ 12. http://www.afkarnews.ir/ 13. Jang, J.-S.R. (1993). "ANFIS: adaptive-network-based fuzzy inference system". IEEE Transactions on Systems, Man and Cybernetics. 23 (3). doi:10.1109/21.256541. 14. Abraham, A. (2005), "Adaptation of Fuzzy Inference System Using Neural Learning", in Nedjah, Nadia; de Macedo Mourelle, Luiza, Fuzzy Systems Engineering: Theory and Practice, Studies in Fuzziness and Soft Computing, 181, Germany: Springer Verlag, pp. 53–83, doi:10.1007/11339366_3 15. Jang, Sun, Mizutani (1997) – Neuro-Fuzzy and Soft Computing – Prentice Hall, pp 335–368, ISBN 0-13- 261066-3 16. Joseph, Giarratano, and Riley Gary. "Expert systems principles and programming." PWS Publishing Company 2 (1998): 321. 17. J. Kennedy and R. Eberhart, “Particle swarm optimization”. «Proc.» IEEE International Conf. on Neural Networks (Perth, Australia), IEEE Service Center, Piscataway, NJ, 1995 (in press). 18. "astrology". “Oxford Dictionary of English”. Oxford University Press. Retrieved 11 December 2015. 19. "astrology". “Merriam-Webster Dictionary”. Merriam-Webster Inc. Retrieved 11 December 2015. 20. Bunnin, Nicholas; Yu, Jiyuan (2008). “The Blackwell Dictionary of Western Philosophy”. John Wiley & Sons. p. 57. 21. Koch-Westenholz, Ulla (1995). “Mesopotamian astrology: an introduction to Babylonian and Assyrian celestial divination”. Copenhagen: Museum Tusculanum Press. pp. Foreword, 11. ISBN 978-87-7289-287- 0. 22. Kassell, Lauren (5 May 2010). "Stars, spirits, signs: towards a history of astrology 1100–1800". Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences. 41 (2): 67–69. doi:10.1016/j.shpsc.2010.04.001. 23. Thompson, Richard L. (2004). Vedic Cosmography and Astronomy. pp. 9–240. 24. Jha, Parmeshwar (1988). Āryabhaṭa I and his contributions to mathematics. p. 282. 25. Puttaswamy, T.K. (2012). Mathematical Achievements of Pre-Modern Indian Mathematicians. p. 1. 26. Pingree(1981), p.67ff, 81ff, 101ff 27. Mayo (1979), p. 35. 28. Frawley, David. The Astrology of Seers: A Comprehensive Guide to Vedic Astrology. Motilal Banarsidass Publishe, 1992. 29. Harness, Dennis M. The Nakshastras: The Lunar Mansions of Vedic Astrology. Motilal Banarsidass Publ., 2004. 30. Dreyer, Ronnie Gale. Vedic astrology: A guide to the fundamentals of Jyotish. Weiser Books, 1997. 31. Charak, K. S. Elements of Vedic astrology. Institute of Vedic Astrology, 2002. 32. http://www.astrologykrs.com/ 33. Mousavi, Seyed Muhammad Hossein, S. Younes MiriNezhad, and Vyacheslav Lyashenko. "An Evolutionary-Based Adaptive Neuro-Fuzzy Expert System as a Family Counselor before Marriage with the Aim of Divorce Rate Reduction." Education 1: 5. 2017. 34. Saghafi, Hamidreza, and Milad Arabloo. "Estimation of carbon dioxide equilibrium adsorption isotherms using adaptive neuro‐fuzzy inference systems (ANFIS) and regression models." Environmental Progress & Sustainable Energy (2017). 35. Polat, Kemal, and Salih Güneş. "An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease." Digital Signal Processing 17.4 (2007): 702-710. 36. Soyguder, Servet, and Hasan Alli. "An expert system for the humidity and temperature control in HVAC systems using ANFIS and optimization with Fuzzy Modeling Approach." Energy and Buildings 41.8 (2009): 814-822. 37. Boyacioglu, Melek Acar, and Derya Avci. "An adaptive network-based fuzzy inference system (ANFIS) for the prediction of stock market return: the case of the Istanbul stock exchange." Expert Systems with Applications 37.12 (2010): 7908-7912. 38. Dunn, Joseph C. "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters." (1973): 32-57.

102

JOURNAL OF SOFTWARE ENGINEERING & INTELLIGENT SYSTEMS ISSN 2518-8739 30th April 2018, Volume 3, Issue 1, JSEIS, CAOMEI Copyright © 2016-2018 www.jseis.org

39. Bezdek, James C. “Pattern recognition with fuzzy objective function algorithms”. Springer Science & Business Media, 2013. 40. Lehmann, E. L.; Casella, George (1998). “Theory of Point Estimation” (2nd ed.). New York: Springer. ISBN 0-387-98502-6. MR 1639875. 41. Hyndman, Rob J. Koehler, Anne B.; Koehler (2006). "Another look at measures of forecast accuracy". International Journal of Forecasting. 22 (4): 679–688. doi:10.1016/j.ijforecast.2006.03.001. 42. Everitt, B.S. (2003) “The Cambridge Dictionary of Statistics”, CUP. ISBN 0-521-81099-X. 43. Yang, Xin-She. "A new metaheuristic bat-inspired algorithm." Nature inspired cooperative strategies for optimization (NICSO 2010) (2010): 65-74. 44. Atashpaz-Gargari, Esmaeil, and Caro Lucas. "Imperialist competitive algorithm: an algorithm for optimization inspired by imperialistic competition." Evolutionary computation, 2007. CEC 2007. IEEE Congress on. IEEE, 2007. 45. Mousavi, Seyed Muhammad Hossein, S. Younes MiriNezhad, and Mir Hossein Dezfoulian. "Galaxy Gravity Optimization (GGO) An Algorithm for Optimization, Inspired by Comets Life Cycle.", CSI 2017.

AUTHORS PROFILE

Seyed Muhammad Hussain Mousavi received his M.Sc. degree from Bu-Ali Sina University Hamadan- Iran. He received his M.Sc. degree in Artificial Intelligence in 2017. His research interests are Evolutionary algorithms, Pattern recognition, Image Processing, Fuzzy Logic, Human computer interactions, Classifications and clustering, Artificial intelligence, RGB_D data, Expert systems, Kinect, Data mining, Facial expression recognition, Face recognition, Age estimation and Gender recognition.

103

View publication stats