<<

Performance Monitoring of Network Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Master of

Science in the Graduate School of The Ohio State University By Anand Sreenivasan, B.E. Graduate Program in Computer Science and Engineering The Ohio State University 2011

Thesis Committee Dr. Rajiv Ramnath Mark Fullmer Dr. Jay Ramanathan Copyright By

Anand Sreenivasan

2011 Abstract

Performance monitoring of network systems allows you to monitor and efficiency. Many of the current approaches involve methods like round-robin ping service and/or parallel monitoring using non-blocking Input/Output’s (I/O). While these methods are good enough for small number of nodes, they fail to generalize and scale to highly complex network infrastructures. Our approach describes an architecture and toolkit for configuring tests for active, passive and end-to-end measurements using

Simple Protocol (SNMP) objects within these systems. The contents of SNMP replies are analyzed in parallel and in real time by the toolkit. The toolkit also provides a network oriented approach for storing network data and provides an Application Programming Interface (API) with Javascript Object Notation (JSON) output for front end visualization along with accessing other real time information like time series charts and intensity maps.

ii Dedication

Dedicated to

Mark Fullmer

Dr. Rajiv Ramnath

My parents

iii Acknowledgements

I would like to express my sincere gratitude to my manager Mark Fullmer and my advisor Dr. Rajiv Ramnath for their continuous support during the course of this thesis work.

I would also like to thank my committee member Dr. Jay Ramanathan and staff at

OARnet for their encouragement and feedback.

Finally I would like to thank my parents for their support and guidance in every phase of my life.

iv Vita

Start Date End Date Details

Aug. 1992 April 2002 Kilbil St. Joseph’s High School

Aug. 2002 April 2004 R.Y.K. Science College

Aug. 2004 April 2008 Vishwakarma Institute of Information Technology (Bachelor in Engineering, Computer Science and Engineering

July 2008 July 2009 JP Morgan Chase India

Sept. 2009 Current The Ohio State University (Master of Science in Computer Science and Engineering)

Fields of Study

Major Field: Computer Science and Engineering

v Table of Contents

Abstract ...... ii

Dedication ...... iii

Acknowledgements ...... iv

Vita ...... v

List of Tables ...... viii

List of Figures ...... ix

Chapter 1: Introduction ...... 1

1.1 Introduction ...... 1

1.2 Problem ...... 3

1.3 Objective ...... 6

1.4. Organization of the Thesis ...... 6

Chapter 2: Background ...... 7

2.1 Key Concepts ...... 7

2.2 Related Work ...... 12

Chapter 3: Architecture ...... 14

3.1 Components of the System ...... 14

3.2 Data Flow ...... 24

Chapter 4: Approach ...... 27

4.1 Speed ...... 27

4.2 Easy to Configure ...... 29

vi 4.3 Flexible Visualization ...... 31

Chapter 5: Implementation, Results and Feedback ...... 33

5.1 Implementation ...... 33

5.2 Results ...... 37

5.3 Feedback ...... 45

Chapter 6: Conclusion ...... 47

6.1 Conclusion ...... 47

6.2 Contribution ...... 47

6.3 Future Work ...... 48

References ...... 50

vii List of Tables

Table 1. RPM Test Configuration Values...... 36

viii List of Figures

Figure 1. Measurement points in Network Monitoring...... 1

Figure 2. Current network monitoring approach...... 5

Figure 3. Current Visualization approach...... 5

Figure 4. Model of Network Management used by SNMP...... 8

Figure 5. RPM client-server operation...... 10

Figure 6. RPM configuration options for Juniper Devices...... 12

Figure 7. Architecture of Performance Monitoring System...... 14

Figure 8. Data Flow of Monitoring Process...... 23

Figure 9. Data Flow of Visualization...... 24

Figure 10. Python vs C...... 38

Figure 11. No. of Node vs No. of Iterations vs Time...... 39

Figure 12. Multi-threading vs Non-multi-threading...... 41

Figure 13. Client-side request based approach vs Server-side always ready approach.43

Figure 14. Impact of a continuously failing node...... 44

ix Chapter 1: Introduction

1.1 Introduction

Figure 1. Measurement points in Network Monitoring

Performance monitoring of network systems allows a user to view and administer performance quality, data analysis and errors of an Protocol (IP) network. A performance monitoring tool is an essential asset for a to constantly keep track of the operation of the network based on the data collected at

1 regular intervals. A typical network consists of different hardware devices like Bridges,

Repeaters, Switches, Routers etc. Monitoring a network involves verifying the correctness in the functionality of these devices, while also ensuring the availability of the interconnecting medium.

As shown in Figure 1. the different measurement points are:

Network Centric: Measurement points are taken at ingress and egress points of the network.

Router Centric: Measurement points are at taken at the routers directly.

Based on whether measurements are network or link centric, the two kinds of measurements are:

Active Measurements: New tests are configured along with the regular traffic flow for these tests. Real time performance monitoring in Juniper switches is an example of active measurements.

Passive Measurements: These measurements evaluate the traffic at the interface of the nodes.

In this work, we primarily focus in designing a tool that will probe the nodes within a network. It will then check for errors at every interface and along the node links connecting the nodes by performing both active and passive measurements.

2 1.2 Problem

A comprehensive list of network monitoring tools can be found here (Network Monitoring

Tools). These tools collect network data using protocol based approach like

Transmission Control Protocol (TCP), Secure Shell (SSH), Simple Network Management

Protocol (SNMP) etc. Ping mechanism is used in some instances to test a node or a node-link.

The following mechanism is a common way of configuring a test node in most monitoring tools. Configuring test on a node involves setting parameters like protocol, number of packets to be transmitted for probing, frequency of probes etc. These hosts are then pinged in a round-robin manner, beginning with the first node. The data is then stored in a database, which is then collected by the visualization module, which displays

the performance or errors in a network using time-series, tree charts etc. For web based solutions, the visualization is concentrated in the server. Configuration and test of a node-link works in a similar manner.

Figure 2 shows the inherent problem in having a round-robin approach to probe each

node. This approach of pinging each node might result in a situation where in a certain node may fail to respond or take longer than expected time to respond. In situations like these the frequency of probing the next node is disturbed causing latency in the output. 3 This causes a delay in error reporting and correction by the support team or system administrators who expect a real-time feedback on network performance or errors.

Another problem with serial probing of nodes is that sometimes it might fail to detect an error. For instance consider a node that drops incoming packets when the maximum number of buffers that it has for storing the packet reaches a maximum. This might happen as a result of an increase in the network traffic. By the time the monitoring tool reaches this particular node, the network traffic could have gone down which would result in the node not dropping any more packets. A serial network monitoring tool will be unable to detect that any error occurred at all.

As shown in Figure 3 with most of these systems, the visualization module is tightly coupled to the monitoring system. For instance, if a monitoring tool is developed in a programming language like Java, then the visualization module and the monitoring module usually communicate and exchange data with Java Objects. This makes the modules difficult to reuse, when an open source developer wishes to branch off from the existing code base. Most monitoring tools provide visualization in a tree-view and require some amount of human intervention. In order to extract details of a particular node that is hidden deep in a logical network hierarchy, a user has to click through various levels, till he finds the target node.

4 Figure 2. Current network monitoring approach

Figure 3. Current Visualization approach

5 Monitoring tools which have a web based approach, mostly design the visualization using server side technologies. Visualization of this kind increases the load on the server and as a result of the entire visualization being transmitted to the client, causes latency in viewing the output. Server based visualizations are mostly images, which causes difficulty in developing interactivity on them.

1.3 Objective

The objective of this thesis is to design a network performance monitoring tool that overcomes the limitations of existing systems as described in 1.2.

The purpose of this work is to build a parallel network probing tool, which is easy to configure for a user along with providing flexible visualization for support teams and system administrators.

1.4. Organization of the Thesis

Chapter 2 explains the key concepts and terminologies within networking and data processing. Chapter 3 discusses the high level architecture of the system. Chapter 4 describes the approach taken to fulfill each key point within the Objective. Chapter 5 discusses the test methodology, results and feedback provided by the support team. 6 Chapter 2: Background

2.1 Key Concepts

The following key concepts will aid a reader in future chapters.

2.1.1

Computer Network is a collection of autonomous computers interconnected by a single technology [2]. The primary hardware components (or nodes) of a computer network are

Bridges, Switches, Routers etc. These components are connected via a wired medium

(Optical fibres, Coaxial cables etc.) or wireless medium (WLAN’s, Microwave, Infrared etc.). Depending upon on the scale of the network, computer networks can be classified into Personal area Network (PAN), Local area Network (LAN), Metropolitan area network

(MAN), Wide area network (WAN), Backbone network and Internet.

2.1.2 Network Monitoring

Network Monitoring is a process of continuous monitoring of computer network. The objective of a network monitoring tool is to detect errors in the system, analyze network performance, predict network failures etc.

7 2.1.3 Simple Network Management Protocol (SNMP)

SNMP defines a protocol for exchanging information between one or more management systems and a number of agents. It provides a framework for formatting and storing management information. SNMP also defines a number of general purpose management information variables or objects [3].

Figure 4. Model of Network Management used by SNMP

8 The key elements in the model of network management used by SNMP (Figure

4) can be described as follows [3]:

1. Management Station: A standalone device typically, which acts as

an interface between human network manager and network management

system. A management station usually has a set of management

applications for data analysis, fault recovery etc, an interface by which

network manager monitors and controls the network, a protocol to

exchange information with managed entities and a database of extracted

information from managed entities.

2. Management Agent: Hosts, Bridges, Routers etc. are equipped

with SNMP agent software so that they may be managed from a

management station. The management agent may respond to request by

management station or provide information to management station.

3. Management Information Base: Resources in a network are

represented as objects. Each object is a data variable represents one

aspect of a managed system. The collection of objects is referred to as

Management Information Base (MIB). These objects are standardized

across systems of a particular class. In addition, proprietary extensions

can be made.

9 4. Network Management Protocol: The management station and

agents are linked by network management protocol, which includes

capabilities like Get, Set and Trap.

2.1.4 Real Time Performance Monitoring (RPM)

RPM can be defined as a probe query sent out by a source (client) to a destination

(server). The probe query contains a packet sent across, which the server responds with an acknowledgement [4].

Figure 5. RPM client-server operation

For each probe, RPM makes several measurements. For each measurement

RPM calculates the minimum, maximum, average, peak-to-peak, standard deviation etc.

The different types of packets that can be sent include Internet Control Message

Protocol (ICMP) echo/timestamp, HyperText Transfer Protocol (HTTP) get, User

Datagram Protocol (UDP) echo/timestamp, TCP.

10 An example of the common options that can be configured for RPM in Juniper

Devices is shown in Figure 6.

2.1.5 Measurement Points and Key Performance Indicators [5]

Measurement Points

Measurement points indicate where measurements can be taken from a service provider network.

Network-centric measurements: These are measurements taken at measurement points that map to ingress and egress points for the network.

Router-centric measurements: These measurements are taken directly from the router themselves.

Key Performance Indicators

Availability: It measures the reachability of one measurement point from another measurement point at the network layer(e.g. ICMP Ping)

Health: It measures the number and type of errors that are occurring on the provider network. It consists of hardware failures or packet loss.

Performance: It measures how well it can support IP services (e.g. delay or utilization).

11 Figure 6. RPM configuration options for Juniper Devices

2.2 Related Work

A comprehensive list of various networking tools can be found here [1]. As a part of this thesis work, various tools were studied and evaluated on the basis of speed, ease of configuration and visualization.

12 WebNMS framework [6] is a network management model built on top of the

WEBNMS API. Although not open source it had a 45 day trial period, which allowed us to evaluate the product. It provides flexibility in terms of configuration and defines an architecture that is suited for scaling to large number of devices in a network. It provides a HTML/JSP client and its open API allows a user to extend the SNMP API in Java, C and .NET.

Statscout [7] is a tool that is used within OARnet [8] for network monitoring. The tool provides a detail report of the device list that are monitoring and generates various port reports and device reports. It is not an open-source tool and feedback from OARnet suggests that it is difficult to configure tests on the devices. Also a high level visualization output is absent here.

The other tools like SNMP Informant [9], Denika[10] use SNMP for polling network information and provide visualization and configuration support. But these tools are not open-source and do not leverage parallelism (multi-threading). As with other tools a high level visualization is absent here too.

The Kansei testbed [11] developed at The Ohio State designed to facilitate research on networked sensing applications at scale. It implements a multi-threaded daemon for test-bed scheduling, administration management and experimentation.

13 Chapter 3: Architecture

3.1 Components of the System

Figure 7. Architecture of Performance Monitoring System

14 3.1.1 Main Process

As shows in Figure 7. the monitoring process acts as a moderator for the entire performance monitoring system. Its main functionality is to process the input stored in the input files, starting and terminating threads at various stages, initiating calls to pull network data, storing the network information, converting data to JSON format for visualization system.

The monitoring process, also handles all the error the system faces. Every other component in the system reports back its errors to the monitoring process, which then logs it.

The major sub-components in the main process are:

Network Data Process: Provides a wrapper functionality around the JSON

Library, in order to process network information it parses from the input network data files.

Stats Calculator: The Stats Calculator, calculates essential network information like Input Utilization, Output Utilization etc. Each of these values are calculated in parallel.

BSNMPBulkWalk: This module, does the SNMP Bulk Walk, to probe each node in parallel and get snmp objects in bulk and store them in file. Each object name has a

15 unique file to which it is written to. The process of probing various nodes and retrieving each requested object within it is done in parallel.

Results: The results process module, retrieves all the network information stored within files and stores them in a hash table. The process of retrieving data and processing them to a hash is done in parallel by the module.

Cache Data: The data cache module caches the latest results, delta values of snmp counters and latest stats values. The unique identifier for the data is the timestamp at which the Main Process had triggered the flow.

Errors Find: The errors check module walks through the processed network data and checks for existing errors. A flag is set against each interface that has an error. The visualization module checks the error flag while color coding the output.

3.1.2 Input Network Data Files

The Input network data files contain input data for the monitoring process and its sub- components. The input files contain the network, node, location, snmp object information. This is the primary input that is fed into the system. An example of a network data file is:

{

"network" : {

"name" : "oarnet",

"nodes" : [ {

16 "name" : "clmbn-r50.etech.oar.net",

"community" : "public",

“latitude” : “39.96118”,

“longitude” : “-82.99879”,

“properties” : {

“description” : “Test Node”,

“type” : “Juniper EX-3200 24T”

},

"snmp" : ["ifInErrors", "ifHCInOctets", "ifSpeed"]

},{

"name" : "clevs-r50.etech.oar.net",

"community" : "public",

“latitude” : “41.49950”,

“longitude” : “-81.69541”,

"snmp" : ["ifInErrors", "ifHCInOctets", "ifSpeed"]

}]}

}

The optional “properties” key allows a user to setup custom properties like meta information etc.

17 3.1.3 Libraries

The libraries component contain various open-source libraries that is used for parsing

JSON configuration files, MIB Files parsing and for carrying out the SNMP BulkWalk.

JSON-C (Version 0.9)

The JSON-C library[12], is a simple lightweight library to parse JSON files. The library has an MIT license.

BSNMP (Version 1.12)

The BSNMP [13] is BSD licensed SNMP library. This is a lightweight library that provides functionalities for Abstract Syntax Notation One (ASN.1) encoding and decoding of

SNMP information, creating Protocol Data Units (PDU) and sending and receiving requests.

The initial implementation of this library was not thread-safe. It only allowed only one connection to be created at a single time. In order to have a multi-threaded environment, which probed all the nodes in parallel, this library had to be modified. This modified version now allows multiple SNMP requests to be sent and received in parallel, whilst also giving the developer who is using this library a freedom to allocate and free the required variables as per his choice. This is beneficial in processes where the SNMP data need to be pulled at a continuous frequency. An API is added to allocate and free the processes.

18 The library has a SNMP definition generator, which parses enterprise specific

MIB files and stores them in a .def file, which is later referred to when converting user object names to machine object id’s.

BSNMPTools (Version 1)

The BSNMPTools [14], is a library written on top of the BSNMP library. It is licensed under BSD and provides functionality for SNMP Get, Walk, Set and GetBulk The initial implementation of the BSNMPTools was in the form of an application, rather than a library with an API for developer’s reference. Also the application wrote the output directly to the console rather than storing it in a data structure or a file. Since the application was using the BSNMP library it was not thread-safe either.

Changing the application to a library involved going through the source code and modifying the function calls as per the data in the input configuration files. Also, the network results are now written to a file rather than console. The whole library is now made thread-safe.

The initial implementation of the BSNMPTools did not have a SNMP BulkWalk implementation. BulkWalk is the most optimal form of network data retrieval. After working with the author of the library the Tools library now has a BulkWalk implementation that is synchronized with our changes.

Another change involved loading the definitions file. The definitions file contain mapping from Object name to Object ID. The earlier implementation loaded the

19 definitions during each iteration. We have extracted that implementation and made it a part of the Main Process system. These mappings are loaded only once and are cached for further use.

LibMicroHTTPD- (Version 0.5.3)

Although not core to the system, the libmicrohttpd [14] is a lightweight library to run an

HTTP Server, which processes all the incoming web requests. The wrapper code written around this library provides implementation for processing Ajax requests. The request arguments are JSON strings, which are parsed by the JSON-C library.

3.1.4 Network Data Storage

The network information is stored in files arranged within different directories. Each

SNMP test contains a set of information associated with it. For instance a request to pull network information contains the following information:

Network: Oarnet

Node: clmbn - r 50. etech . oar . net

Timestamp: 2010/05/01 22:00:00

Object name: ifInErrors

This test result is stored in the following directory hierarchy: oarnet/clmbn-r50.etech.oar.net/2010/05/01/220000/ifInErrors

20 This approach makes it easier to search for a particular file, in case the source of error needs to be traced or for extracting information for data analysis and visualization.

An approach like this also favors parallel pull of results. Since each file contains information pertaining a single object name, multiple threads can be started that search through the files to get different object names.

Data is stored in binary format, so that it is faster to read through them. Each file contains a “key[interface]=value” representation of the object. An example representation of the result file: ifInErrors[1]=0 ifInErrors[2]=0 ifInErrors[4]=0

This representation is same as the one the BSNMPTools provides. Storing data in a hierarchical format makes it easier to process it and store it in data structures like

Hash Table, which makes it easier to convert to data formats like JSON.

3.1.5 Visualization

The visualization module, is a Javascript based web base visualization. On each user request, the visualization module sends out a request which interfaces with the libmicrohttpd module. The visualization module, uses JQuery for implementing basic

Javascript functionalities. Google Visualization and JQPlot Javascript libraries are used for representing network information. More working of the visualization modules is covered in 4.3

21 3.1.6 Network

The network backbone is the production environment for the application. It is a OARnet eTech public broadcast network. These networks are connected to end devices which are powered by NetVX. Monitoring the backbone, involves monitoring the various nodes, node links and the end devices in the network. Most of the routers within the network are

Juniper and hence Juniper Real Time Performance monitoring are configured within these nodes.

22 Figure 8. Data Flow of Monitoring Process

23 Figure 9. Data Flow of Visualization

3.2 Data Flow

The flow of data within the system as shown in Figure 8 and Figure 9 can be described as:

1. The monitoring process is the moderator of the system. It parses the input

files and retrieves network information. The monitoring process parses this

information and stores it into a linked list of thread input.

24 2. The thread input, initiates a call to the BSNMPBulkWalk module and initializes the definitions. These definitions contain mapping information for human readable Object names and machine readable Object ID’s. These are done only once during the entire process.

3. After the mappings are loaded, the main process gets the latest time. This timestamp is carried forward for the entire process, and its used as a unique identifier for a result during visualization and data analysis.

4. The timestamp and the network information are concatenated to get the target file name which are then added to the thread input.

5. The monitoring process, then starts a series of threads, one thread per entry in the linked list of thread input and makes a call to the SNMPBulkWalk in

BSNMPBulkWalk module.

6. The BSNMPBulkWalk works along with the underlying BSNMP library in making requests to the network and getting SNMP data.

7. This data is then written to the file as indicated in the thread input.

8. Once all the threads calling the BSNMPBulkWalk comes to an end, the monitoring process starts another series of threads, which parse the file names within the thread input and store them in a result data structure.

9. Once the threads come to an end and the data is collected, the result data structure is converted to a hash table for further use.

25 10. The monitoring process then caches the latest result by the timestamp of the process.

11. Monitoring process then calls to the stats calculator module, which will initially calculate the delta of the result. Delta is calculated by looking up the current result passed to the module and the previous result taken from the cache.

12. Once the delta is calculated, the stats calculator, starts threads equal to the number of stats that are to be calculated. Each thread looks up the cached delta data if required and writes the result into a hash table which are then cached according the timestamp.

13. Once the threads complete, the control is passed back to the monitoring process, which walks through the result to check if it has any error in it. The result of which is written to a file.

14. The visualization module retrieves the errors file, along with other meta information and parses it. It extracts the error information from it and calls the

Google Maps API.

15. The visualization module also looks up the location information and retrieves it and passes it to the Google Maps API. Both the error data and location data are displayed for a node as per its location on the map.

16. Support visualizations like Table View, Stacked bar view which facilitate a user to interpret the data more accurately.

26 Chapter 4: Approach

4.1 Speed

The primary objective of this tool, is to design a faster network monitoring tool. The following highlights the different approaches that were taken in order to decrease the execution time of the process.

4.1.1 Low Level Programming Language

The initial implementation of the tool, was in a higher level language: Python. Although easy to code and process data, python had a major drawback in terms of its speed of execution. Being interpreted, it faces issues when running on low memory devices. An obvious replacement was to switch to a low level, compiled programming language like

C. C offers considerable advantage in terms of speed, when compared to python. It leaves a smaller memory footprint and runs on devices with low memory. ANSI standards have ensured that C is portable across various OS and architectures too.

4.1.2 Threading

The initial idea of building this tool centered around the concept of threading. Various earlier solutions had flawed in serial implementation and hence the design of this tool originated and progressed by keeping in mind that threading was the integral part of the system. The threading solution divided the process into a set of atomic operations. Each sub operation within a single atomic operation could be executed in parallel. The support 27 libraries were checked for thread-safety. The JSON-C library and the libmicrohttpd library was thread safe from the beginning, while changes had to be made to the BSNMP

Library and the BSNMP Tools library to make it thread-safe.

4.1.3 Parallel Data Storage and Retrieval

Files are advantageous when deploying the data on another system. Storing the various results according to the SNMP object names in different files within the hierarchy of the network, node and the timestamp, makes it easier to search through them. Having a good segregation of the data also makes it easier to have parallel access to them.

4.1.4 SNMP Configuration Load

SNMP Configurations involve retrieving the meta information about the interfaces within a node, for example name (ifName), description (ifDescr) etc. These information are retrieved only once and are cached for use by the visualization module. At the beginning of each day, the meta information is refreshed again.

In order to probe enterprise SNMP objects, mappings from the MIB’s need to be looked up. These mappings are loaded at the beginning once and are cached for further use. Each time the

4.1.5 Data Cache

After each iteration, when the results are collected, they are timestamped and cached for use by the visualization module. Caching is also done for the meta information (interface 28 names and descriptions), delta information, stats output and errors. The data in the cache are flushed out at the end of each 24 hour cycle. API calls are present to get the current information contained within the cache. Implementation is provided for regenerating the information within the cache at a further time.

4.1.6 Load Next Iteration

At the end of each iteration when a sleep call is invoked before the next iteration, the system runs the initialization code for the next iteration. This involves getting the system time of the next iteration (the new system time is calculated by keeping in mind that the system sleep time), generating file names, initializing threads etc.

4.2 Easy to Configure

4.2.1 Data Representation

The requirements for representing the input data were

Lightweight: This intends to make sure that the data is easy to process and parsing the data does increase the execution time by a large factor.

Easy to understand: Understanding what the input files actually convey should be easy.

Parsing: The parsing library should be easily available, widely supported or a parser should be easy to write from scratch.

These requirements are easily met by Javascript Object Notation (JSON) files.

JSON is a key-value representation of the data. It is easy to read, does not involve any 29 overhead while parsing and there are plenty of open source libraries available in almost all the languages. JSON is built into the Javascript object model, a Javascript object has a one-to-one mapping with JSON data. Additionally, it is easy to convert a JSON data into a hash table.

The configuration files are JSON. They represent the relevant information that the system requires for carrying out the process. They can be extended to represent more information as well. The configuration files have specific keywords which are used to uniquely identify a particular element or an object.

The SNMP Data files have a “Key[Interface]=Value” format. This representation is easy to read and interpret. It is also easy to convert a data of this kind into a hash table representation.

The cache data is a key value representation of various results calculated during the course of the process. API exists for converting the data to JSON for usage by the visualization.

The request and the response data of the visualization module are also JSON.

As a result of a one-to-one correspondence between, it is easy to process the response from the server.

4.2.2 Support Scripts

Support scripts exist which allow a user to configure network information. Create, Read,

Update and Delete (CRUD) operations are provided which allow a user to modify 30 configurations easily. A web interface allows the same functionality. It provides a GUI representation of the existing data, while also allowing a user to search through the existing configuration and providing him options to make necessary changes.

4.3 Flexible Visualization

Most of the applications that were studied as a part of this thesis, had varying approach to network visualizations. Desktop solutions had client side visualizations, but not all web based solutions provided client side or Javascript visualizations. Server side solutions involved processing the data on the server, passing it to a visualization API and converting the output to an image file, which is the client then receives as a response.

This approach increases the load on the server, while also causing the latency in transmitting the image file. User interaction with image file is difficult to implement and requires high precision as to where the key points in the image are.

Client side/Javascript solutions involve the necessary data being sent to the client on request and the client using a Javascript charting solution for providing the visualization. A similar approach is used over here. When a client requests a latest result from the server, the libmicrohttpd wrapper processes the input request and extracts the input arguments and make an appropriate function call. The result of that function call is then converted to JSON which is then returned by the libmicrohttpd to the client request.

The visualization reference that we have looked up are Google Visualization and

JQPlot. Both these visualization libraries provide various kinds of chart ideas and API’s

31 for the same. These charts sufficiently represent the necessary network information and gives a user multiple views of the network data. The primary form of visualization is a

Google Maps visualization. The location data is extracted from the input files and then passed to the Google Maps API.

Most of the visualization techniques, involve constant human intervention. A user has to dig into multiple levels of network and node details to extract the actual error in the system. A primary goal of this visualization is to have a simple error visualization which is mapped to the location of the node. In case where multiple nodes map to the same location, error is shown even if any one of the node in that location has an error.

Further interactivity allows a user to search easily to find the source of that error.

Feedback from the support team and the technical team at OARnet has enhanced the quality of the visualization output.

32 Chapter 5: Implementation, Results and Feedback

5.1 Implementation

5.1.1 Specification

The operating system of the development box as printed from the command: sysctl -n kern.osrelease is

RELENG8-20101124

This is a custom name given to the FreeBSD 8 version of OS

The development box has the following hardware specification as printed from the command: sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu' hw.machine: i386 hw.model: Intel(R) Xeon(TM) CPU 2.40GHz hw.ncpu: 1 hw.machine_arch: i386

The production box will have the same hardware and software specification.

33 5.1.2 Configuring Juniper Real time performance (RPM)

Juniper RPM test, involves sending out probe requests by the client, for which the server responds. Configuring a Juniper RPM test, involves configuring the following parameters:

Probe Type: Type of probe to send as a part of the test(http-get, icmp-ping, tcp- ping, udp-ping).

Target Address: Destination of the probe.

Source Address: Source address of the probe.

Probe Interval: Wait time (in seconds) between each probe transmission.

Test Interval: Wait time (in seconds) between tests.

Probe Count: Total number of probes sent for each test.

Data Size: Size of the data portion of the ICMP probes.

5.1.3 SNMP Objects

The following SNMP objects are pulled as a part of monitoring the network

IfInErrors: Number of Inbound Packets that contained errors, preventing them from being delivered [16].

IfOutErrors: Number of Outbound Packets that contained errors, preventing them from being transmitted [17].

jnxRpmResSumPercentLost: Provides a percentage of packet loss in a RPM test [4].

ifSpeed: An estimate of the interface’s current bandwidth in bits per second [18]. 34 ifHCInOctets: The total number of octets received on the interface [19].

ifHCOutOctets: The total number of octets transmitted out of the interface [20].

5.1.4 Error Testbed

An error testbed was setup, in order to verify the correctness of the tool, in a relatively bad environment. This would test the tool in terms of detecting frequent errors, while also indicating any effects on performance in a constantly failing environment.

The setup of the test, involved configuring two Juniper Ex-3200 24T switches.

RPM test was configured which involved setting the following parameters as

35 Table 1. RPM Test Configuration Values

The Juniper interface was connected to an JDSU VCB0+1NC1.0NC1.0NC Optical

Attenuator which would allow us to introduce errors in the node link. Zterm emulator was used to connect to the attenuator and reduce the power of the node link, thereby introducing errors. The results of this test is discussed in 5.2.5

36 5.2 Results

5.2.1 Python vs C

The initial implementation was done in Python 2.6 while the current implementation is in

C (gcc version 4.2.1). This change was done from Python to C, considering one of the primary claims of the tool i.e speed. C being compiled and highly optimized performs well when compared to Python which is interpreted. Also, the memory footprint left behind by the C implementation is lower than python. Figure 10. results justify the improvement in speed as a result of change in the implementation language.

No. of Iterations: 10

37 Figure 10. Python vs C

The implementation for the above, consisted of threads issuing SNMP requests and storing the result. This result was then pulled by the Javascript code and then processed as per the visualization.

The above time was noted, by calling the function time() before and after the code runs and subtracting the result to get the running time.

As it can be seen from the above visualization, the total running time of the C implemented code, is much less than the one in Python. The relative increase with increase in number of nodes in lesser in C as compared to Python. 38 5.2.2 No. of nodes vs No. of Iterations vs Time (Language: C)

Figure 11. No. of Node vs No. of Iterations vs Time

This test consisted of a C implementation, in which different threads probed the nodes. Once those results were returned, multiple threads processed the data and calculated different statistics. After that a single thread ran across all the different

39 statistics that were calculated and checked for errors and formatted the data as per the visualization API.

As seen, from Figure 11. , the total time taken per iteration increased with increase in number of nodes and number of number of iterations.

The increase in number of nodes was primarily due to the error checking and formatting of data for the visualization. Time increase across the iterations was primarily due to the continuous large amount of memory allocation, cleanup activities, re- initialization activities and the caching of the results.

5.2.3 Parallel vs Serial approach (Language: C)

Figure 12 shows the comparison of the implementation with threading involved in various stages versus a non-threaded implementation.

No. of Iterations: 10

40 Figure 12. Multi-threading vs Non-multi-threading

In a threaded environment, multiple threads pull network data and store them.

Different set of threads then parse those data and cache it locally. In a non threaded environment, the above process is done by a single threaded. Hence only one connection to any node exists at any given time for pulling the network information. Only a single thread probes through all the results to get the stored data. As it can be seen, there are two contrasting performances.

41 5.2.4 Request based approach vs Always ready approach

The initial implementation in C (also the final implementation in Python) wrote the result data to the files after which the Javascript request would dynamically format the data as per the api and send it back to the client.

While the latest implementation (language: C), adds an extra step after the data is written to the file. It parses these files and checks for errors within them. It then formats the data as per the visualization API. The test here, included a comparison of whether a client-side request based approach is faster than a server-side always ready approach.

The total running time in this case was calculated as:

Total Time = Δ Client Side Process + Δ Server Side Process

where Δ Client Side Process = Time at which response is received by client -

Time at which request is issued by client.

and Δ Server Side Process = Time taken to complete an iteration of the entire

process

The results of the above test was carried out for the following cases

Number of iterations: 10

42 Figure 13. Client-side request based approach vs Server-side always ready approach

Figure 13. clearly indicates that a server side always ready approach. Although this may not be a generalized case, but always ready approach works perfectly here, because the client is requesting the same data for all the nodes periodically.

5.2.5 Error Testbed Results

The result of the test scenario described in 5.1.4 are as follows:

43 Figure 14. Impact of a continuously failing node.

The test process involved having a test case without the error node and test case with the error node. This was done in order to check if a continuously failing node affects the performance of the system in any way.

The number of nodes in both the test cases are same. In the test case with the error node, a regular node was deleted from the configuration.

As it can be seen from the result in Figure 14, the time taken per iteration actually drops down. This can be attributed to the fact that the number of interfaces that are active is only 1. Hence during an SNMP Bulk walk, data is returned in a single query

44 itself, while in other test nodes the number of interfaces that are active are usually high, causing multiple request response calls between the tool and the node.

5.3 Feedback

Feedback from the OARnet technical team was noted as a part of improving the experience of the user in handling the tool.

5.3.1 Configuration file

An easy to create, update and read configuration file was a primary requirement. Also, a configuration file that could easily be processed without any significant overhead by the monitoring tool was a necessity. JSON, hence was the appropriate way of representing the network data. JSON files are easier to create and read and parsing them do not involve any overhead.

In order to further enhance the ease of editing configuration files, a web interface exists. This web interface allows a user to perform the same task of creating, editing, deleting and updating network configuration data.

5.3.2. Visualization

The primary goal for the visualization was to have minimal interactivity in viewing the output. The disadvantage of using bar charts, column charts, scatter plots etc. are that once the number of nodes increase, these charts become cluttered within the web page and hence it is difficult to identify errors with them.

45 A map based approach gives a user a definite view of where the error has occurred and since multiple nodes can be mapped to the same location, it is easy to represent many nodes without causing too much cluttering. Additionally it is easier to represent errors on node links by creating a path between two nodes.

46 Chapter 6: Conclusion

6.1 Conclusion

The current implementation has shown that the initial three claims of speed, ease of configuration and flexible visualization have improved compared to the current tools.

Change in programming language, use of standard formats for configuring input and using a client side approach for visualization have all worked in favor of the claim.

The results shown in Chapter 5, highlight the various scenarios where performance has increased compared to earlier approaches. Feedback from the OARnet team has helped in increasing the usability of the visualization and configuration files.

6.2 Contribution

As a part of thesis, I was involved in working with OARnet to decide the requirements of the tool. After researching various tools to see how they work and after evaluating the approach they take, I worked on designing the architecture of my tool. The BSNMP library that we wanted to use at OARnet for SNMP data pulling was not thread safe initially. Neither was the BSNMPTools which was built on top of BSNMP and implemented the SNMP request algorithms. I worked on making these two libraries

47 thread safe. In case of BSNMPTools, SNMPBulkWalk feature was absent in the initial implement which I worked with the author to add it to the tools.

Also I worked on implementing the sub modules in the architecture, deciding on how input data and result is to be stored, and designing the visualization.

Apart from this the working with OARnet to make changes as per the requirements and documenting the work was also a part of my thesis work.

6.3 Future Work

6.3.1 Three Level error system

The current implementation has only a two level error system, a node with error or a node without any error. Future enhancement could add an extra level to the error system. This extra level will consider those nodes that recently had errors but are fixed now. These nodes will be represented differently, and after a certain period of time, if there are no more errors it will be classified as a no-error node.

6.3.2 Proactive monitoring

Presently, errors are detected after they occur. Proactive monitoring will involve a prediction algorithm to determine whether an error can occur at a future time. This would involve various collecting various data when an error occurs in any node and formulating them to predict future errors.

48 6.3.3 End-to-End measurement

The end systems in OARnet backbone are the NetVX system. Monitoring these devices will act as an acceptance test for the rest of the network. For instance, there might be a situation that the tool does not detect any error within the network, but the end devices are not delivering the content either. In this case the tool has an error, and it can be corrected.

49 References 1. Network Monitoring Tools http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html, data May 2011 2. Andrew Tanenbaum. 2002. Computer Networks (4th ed.). Prentice Hall Professional Technical Reference. 3. William Stallings “SNMP and SNMPv2: The Infrastructure for Network Management,” IEEE Communications Magazine, Vol. 36, No.3, March 1998 pp, 37-43 4. Real-Time Performance Monitoring on Juniper Network Devices http://www.juniper.net/us/en/local/pdf/app-notes/3500145-en.pdf 5. Understanding Measurement Points, Key Performance Indicators, and Baseline Values http://www.juniper.net/techpubs/en_US/junos10.2/topics/concept/measurement- points-kpi-and-baseline-settings-junos-nm.html 6. WEBNMS http://www.webnms.com/webnms/index.html 7. Statscout statscout.oar.net 8. OARnet www.oar.net 9. SNMP Informant http://www.snmp-informant.com/ 10. Denika http://www.plixer.com/products/denika.php 11. E. Ertin, A. Arora, R. Ramnath and M. Nesterenko. Kansei: A testbed for sensing at scale . Proceedings of the 5th Symposium on Information Processing in Sensor Networks (IPSN/SPOTS track), pp. 399-406, 2006. 12. JSON-C library http://oss.metaparadigm.com/json-c/ 13. BSNMP library http://people.freebsd.org/~harti/bsnmp/ 14. BSNMPTools http://wiki.freebsd.org/BsnmpTools 15. LibMicroHTTPD http://www.gnu.org/software/libmicrohttpd/ 16. SNMP Object Navigator http://tools.cisco.com/Support/SNMP/do/BrowseOID.do? objectInput=ifInErrors&translate=Translate 17. SNMP Object Navigator http://tools.cisco.com/Support/SNMP/do/BrowseOID.do? objectInput=ifOutErrors&translate=Translate 18. SNMP Object Navigator http://tools.cisco.com/Support/SNMP/do/BrowseOID.do? objectInput=ifSpeed&translate=Translate 19. SNMP Object Navigator http://tools.cisco.com/Support/SNMP/do/BrowseOID.do? objectInput=ifHCInOctets&translate=Translate

50 20. SNMP Object Navigator http://tools.cisco.com/Support/SNMP/do/BrowseOID.do? objectInput=ifHCOutOctets&translate=Translate

51