Network Performance Monitoring

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Shriram Raghavendra Ramamurthy

Graduate Program in Science and Engineering

The Ohio State University

2012

Dissertation Committee:

Dr. Rajiv Ramnath, Advisor

Mark Fullmer, Advisor

Dr.Prasad Calyam, Advisor

Dr. Jay Ramnathan

Copyright by

Shriram Raghavendra Ramamurthy

2012 Abstract

Network management is the use of tools that pertain to the operation, administration and provisioning of networked systems. Operation is the continuous monitoring of the network for problems, and once identified, the proper fixing of the problems before the they become widespread. Thus operation refers to keeping the network “up and running” smoothly. Maintaining the correct operation of every network component is of much importance to avoid domino effects that ultimately result in total network failure.

Administration refers to observing all the resources of the network, to see how are they utilized and so on. The main goal of administration is to ensure appropriate usage of the network resources. Finally, provisioning means making the services of the network available to its users. When a user needs a particular service, there must an easy way to facilitate the request and that service upon deployment must also be monitored for maintaining the sanity of the network.

This thesis concentrates upon providing various means to monitor a network, and various tools were created to enable the network administrators to prevent the nodes from not going down. The chapters covered are centric towards Wired Nodes on a local area network, to a wireless mobile device based monitoring. Experiments were also run on what is called the “Academic Cloud ” to analyze the capability of perfSONAR, which helps one to monitor the problem associated with wide area networks.

iii

Dedication

This document is dedicated to my parents and sister

iv

Acknowledgments

Firstly, I would like to express my sincere gratitude to my advisors Dr. Rajiv Ramnath,

Mark Fullmer of OARnet and Dr. Prasad Calyam of Ohio Supercomputer Center for their continuous support of my Masters study and research. They have been a great motivation for me throughout and their guidance has helped with my Masters. Their constant encouragement and complete knowledge of the subject has made me learn and explore more about my research.

My sincere thanks also goes to my thesis committee member Dr. Jay Ramnathan, for all the knowledge she imparted to my team and me during our CSE793.

I would also like to thank Mathew Jaffee, of Indiana University for helping me out during my LAMP testing and my Mathematics teacher Mr.N.Srinivasan, for teaching basic application of education in life.

I would also like to sincerely thank the administration of OARnet, for letting me learn, implement and experiment with many latest technologies.

Finally, I would like to extend to my gratitude to my family and friends for all their help and motivation, without whom this Masters would not have been possible.

v

Vita

2006 - 2010 ...... B.E. Computer Science & Engineering,

Coimbatore Institute if Technology,

[Under Anna University Chennai],

Coimbatore, Tamil Nadu, India.

2010 to 2012 ...... Masters Student,

Department of Computer Science and

Engineering,

The Ohio State University.

Fields of Study

Major Field: Computer Science and Engineering

vi

Table of Contents

Abstract ...... iii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

Fields of Study ...... vi

Table of Contents ...... vii

List of figures ...... x

1 Wired Networks ...... 1

1.1 Introduction ...... 1

1.2 SysUpTime Monitor ...... 1

1.2.1 SNMP ...... 1

1.2.2 sysUpTime ...... 3

1.2.3 Implementation ...... 3

1.3 Juniper Alarm Monitor ...... 5

1.3.1 jnxAlarmRelayMode ...... 5

1.3.2 jnxYellowAlarm ...... 6

vii

1.3.3 jnxRedAlarm ...... 6

1.3.4 Implementation ...... 7

1.4 Google Maps Integration ...... 7

2 Wireless Network ...... 11

2.1 Introduction ...... 11

2.2 NanoBSD based throughput analyzer ...... 11

2.2.1 Introduction to NanoBSD ...... 11

2.2.2 Implementation ...... 12

2.3 iPhone based throughput analyzer ...... 13

2.3.1 Web page response analyzer ...... 14

2.3.2 Download throughput analyzer ...... 15

3 Wide Area Networks ...... 18

3.1 Introduction ...... 18

3.2 Global Environment for Networking Innovations ...... 18

3.3 perfSONAR ...... 19

3.3.1 perfSONAR architecture ...... 19

3.3.2 perfSONAR Problem Example ...... 20

3.3.3 perfSONAR-ized Measurement Tools ...... 23

3.3.4 perfSONAR resource protection ...... 26

viii

3.3.5 Investigating LAMP/perfSONAR to explore “infrastructure measurement

slices” in GENI ...... 28

4 Conclusion and Future Work ...... 42

Bibliography ...... 44

ix

List of figures

Figure 1 Juniper alarm MIB ...... 2

Figure 2 sysUpTime OID view ...... 3

Figure 3 Google map view ...... 9

Figure 4 Web page response time graph ...... 15

Figure 5 Throughput analyzer graph ...... 17

Figure 6 The user and resource location…………………………………………………22

Figure 7 Traceroute points ...... 22

Figure 8 Locating the pain points………………………………………………………..23

Figure 9 Storing the measurement data ...... 23

Figure 10 Problem areas identified based on the data collected ...... 23

Figure 11 perfSONAR resource protection ...... 28

x

1 Wired Networks

1.1 Introduction

This chapter contains the explanation and description of the tools, which were deployed to manage and monitor the nodes on the backbone of the OARnet. There are 2 scripts, which cover this section. They are sysUpTime Monitor and Juniper Alarm Monitor.

These tools, though operating on the backend are aimed towards providing information to the support team which closely monitor whether the nodes on the OARnet backbone are up or down or performing up to the required threshold as per the SLA. Google maps were also used to project the exact functioning of these scripts on a geographically mapped way.

1.2 SysUpTime Monitor

1.2.1 SNMP

SNMP is nothing but Simple Network Management Protocol. Many devices such as router, switches are supported by SNMP. SNMP is used for monitoring these devices and was created by the Internet Engineering Task Force (IETF). Services like the Dynamic

Host Configuration Protocol (DHCP) can be managed using the SNMP. The architecture

1 of SNMP consists of the managed device, the Network Management System (NMS) and the agent.

The managed device , is the node that is managed. This is an end point which implements the entire interface of the SNMP, so that it can be monitored by the SNMP services. The

Network Management Service is a bridge, which controls the managed device.

And finally the agent is the base software of the NMS, which is present and provides information to the NMS from the managed device.

The other main entities, which contribute to the managing and monitoring process of the

SNMP, are the MIB and OID.

MIBs are known as the Management Information Base. It is a collection of entities in the communication network. In a MIB, the information is organized hierarchically and in a tree like structure. An identifier called as Object Identifier or the OID uniquely identifies each entry of the MIB.

Figure 1 Juniper alarm MIB 2

1.2.2 sysUpTime

The SNMP MIB has an OID called as sysUpTime.

Figure 2 sysUpTime OID view The virtue of sysUpTime is that it gives out the time since the system was previously reinitialized. The time is represented with ticks as the units. A tick is nothing but a hundredth of a second.

1.2.3 Implementation

System Up-time Detection aims at reporting whether a particular host is down or not. The report is displayed via Google Maps, classified based on a danger level and also via e- mail, for records. The plug-in was coded in Python, accompanied by the usage of netsnmp libraries. sysUpTime gives the up time of a particular host in time ticks. We

3 have to provide the community string (password) along with the host name to snmpget library call to query the up time of a particular host.

Condition to raise an alarm is: a host is considered to be down when its current up time is lesser than the up time obtained by the previous ping to the same host. But there are some factors, which needs to be taken under consideration, before confirming that a host is down. Whenever the host is rebooted, it will have a low up time. But this case doesn't necessarily imply that the host has failed currently. The second scenario is the integer overflow. Python integers are 32 bits, so the maximum ticks which an integer can is

4294967295. So after this value there will be an integer overflow, which results in the time ticks being less. This can be misinterpreted as a downtime for the host and be reported as the same in Google maps / e-mail and in order to prevent this, an arithmetic logic was used:

Arithmetic: [previous tick + difference (previous, current) > maximum (integer value) current tick = difference (previous, current) - (maximum (integer value)- previous tick)]

If the above arithmetic succeeds, we can classify it as an integer over flow and no report is filed.

There are various scenarios, which are also being handled. These are issues like, sometimes certain hosts may not reply to the current ping posed on them. This is made note of and if the total number of times that the host did not reply is above 300 (this number of 300 is based on the interval between successive pings, which is around 5 minutes, so this number of 300 signifies, the no response of that host for over a day) then it is reported as an error via e-mail and the host name and time of no response are made note of in noresponse.txt. A similar mechanism is also used when the hosts are not 4 getting connected at all. This may be due to password error or error in the host name.

Threading feature of Python is being used to make the code efficient.

1.3 Juniper Alarm Monitor

The Juniper alarm monitor aims at relaying the alarm arising from the Juniper box’s chassis. Chassis is the sheet metal that houses the complete hardware of the Juniper box.

The alarm relaying has been made possible via the usage of an enterprise MIB released by Juniper networks. The exact version released can be found at http://www.juniper.net/techpubs/en_US/junos11.1/topics/reference/mibs/mib-jnx-chassis- alarm.txt

There are three modes, which make up the Juniper alarm MIB. They are jnxAlarmRelayMode, jnxRedAlarm and the jnxYellowAlarm.

1.3.1 jnxAlarmRelayMode

This mode denotes the alarm relay mode of the craft interface panel for the yellow and red alarms.

• 1–Other: Other or unknown state

• 2–passOn: Alarms are passed on. The alarm relay is activated to pass on the

yellow or red alarms to audible sirens or visual flashing devices.

• 3–cutOff: Alarms are turned off. Both the yellow and red alarms are cut off from

the alarm relays and do not get passed on to audible sirens or visual flashing

devices.

5 1.3.2 jnxYellowAlarm jnxYellowAlarm has three sub objects. They are the jnxYellowAlarmState, jnxYellowAlarmCount and jnxYellowAlarmLastChange. The yellow alarm is on when there are system warnings such as maintenance alert or significant temperature increase.

• jnxYellowAlarmState, denotes the yellow alarm state on the craft panel and the

router chassis

• jnxYellowAlarmCount shows the currently active yellow alarms

• jnxYellowAlarmLastChange shows the value of the sysUpTime since the state of

the yellow alarm last changed from on to off or vice versa. This object returns 0 if

the alarm state has not changed since the sysUpTime was reset last time, or if the

value is unknown.

1.3.3 jnxRedAlarm jnxRedAlarm indicates that there is a system failure or power supply failure or a malfunction or threshold shoot breaking condition.

The Sub Objects of the jnxRedAlarm are

• jnxRedAlarmState, denotes the red alarm state on the craft panel and the router

chassis

• jnxRedAlarmCount, shows the currently active red alarms

• jnxRedAlarmLastChange, shows the value of the sysUpTime since the state of the

red alarm last changed from on to off or vice versa. This object returns 0 if the

alarm state has not changed since the sysUpTime was reset last time, or if the

value is unknown.

6 1.3.4 Implementation

Juniper alarm MIB, basically reports whether the alarms of Juniper boxes are on or

off and if on, whether its the red alarm or the yellow alarm. The basic requirement

which forced the writing of this plug in is that once an alarm of a particular host is up,

there should be an e-mail and Google map alert, which is based upon whether it’s a

red or yellow alarm.

The host-name and community string is obtained from the OARnet JSON API and

they filter it according to the JUNOS . This data is the input to the

command generator, which serves as the get next operation in the PYSNMP library.

The command generator operates on port 161, via which is the UDP connection is

established. The output of the command generator is 3(ON) or 2(OFF) or

1(UNKNOWN). This output is then mapped on to the Google maps and also sent as

an e-mail. The format of the command generator is

CommunityData(‘agent-name’,’community_string’)

UdpTransportTarget(hostname,161) OID with a return value of 3(ON) 2(OFF)

1(UNKNOWN)

The library used for this tool is PYSNMP, of which the command generator is used to perform the get-next operation of the SNMP.

1.4 Google Maps Integration

Both the above scripts are integrated with Google maps, for plotting on the map, which node is up or down. The main advantage of this apart from the e-mail feature, which informs the support team, is that, with the maps we get a more visual representation of

7 the failure or success data and thus we can easily interpret the data. One of the main extensions is that we can plot the node failure or success data for all nodes of a particular location or county.

The engineering web server of OARnet was used for the Google maps javascript API to be integrated. Apache Tomcat runs the webserver. The implementation runs as an html file backed with a Python coded cgi script. Both the Juniper alarm monitor and the sysUpTime monitor are run as a cron job on the FreeBSD system. The outputs of these scripts are stored in the cgi-bin of the webserver, in the format, host-name:alarm-color.

The alarm color can either be red or green or yellow. The file is read by a Python script, which converts it to a json format. There is another file in the cgi-bin, which stores the node-names against the address where these nodes are located. The format of the file is node-name:address. Finally now it all comes to the html file in the htdocs, which on load initially loads the map of the world centric towards Columbus, Ohio. The address of the node is mapped into a geo coordinate and plotted exactly into the map, which is loaded on the html page.

The first AJAX call is to convert the address in the json file to be map geo coordinates.

This is done by the geo location API of the Google maps. Geo location basically means that identifying the geographic location of a user and converting it into geo coordinates or address based on the initial input provided. Here we provide the address of the nodes and thus with the help if Google geo location API calls, we convert it into geo coordinates.

The next AJAX call then takes the result for the first call and then processes it on the location of the map and also pulling the green / red / yellow marker on the map, based on the json which stores the node-name against the current alarm. 8 The Basic Map Declaration is as var myOptions = {

zoom: 6,

mapTypeId: google.maps.MapTypeId.ROADMAP

};

var map = new google.maps.Map(document.getElementById("map_canvas"), myOptions);

A Google maps screenshot:

Figure 3 Google map view

9

2 Wireless Network

2.1 Introduction

This chapter is focused on creating tools to manage the enterprise level wireless connections. With huge enterprises spanning over a large area, having a unified wireless connection can easily end up in a situation where the strength of the connection at particular place can be a totally different as compared to the strength of the connection at different place. To solve this problem, the users cannot be monitoring or reporting the problem whenever it happens and thus sometimes this can also end up in loss of productivity. To solve this problem, a notification based monitoring feature was created.

One is based on NanoBSD box (manufacturer Netgate) and taking cue from the implementation of this tool gave birth to the iPhone based thoughput analyzer written natively in Objective .

2.2 NanoBSD based throughput analyzer

2.2.1 Introduction to NanoBSD

NanoBSD creates a FreeBSD system image for embedded applications, suitable for use on a compact flash card (or other mass storage medium). This image can be a part of custom made computer appliances like an embedded box which has been configured and

11 is looking for an operating system to boot up and perform operations. With FreeBSD being a favorite choice of operating system running applications for satisfying networking demands, we can use NanoBSD in embedded boxes aimed at that very same purpose. But since all the embedded computer applications are more or less of low memory, and the embedded system’s memory also being based on the flash disk device, cannot afford to have the whole image of FreeBSD on to it. Thus NanoBSD which is a flash disk bootable FreeBSD that is also capable of supplying the full features of the

FreeBSD image, came into existance.

Features of NanoBSD:

• All ports functionality of FreeBSD

• Full functionality provision, unless or otherwise forcefully removed

• Read only at run time.

• Customization is easy.

2.2.2 Implementation

This tool was implemented on a Netgate wireless box, which had NanoBSD installed. So as the first step, the box was made capable of getting connected to the wireless connections, the test connections are oarnet-staff and oarnet-visitor. So the connection is established to a wireless service by configuring the box, with the help of rc.conf, wpa_supplicant.conf.

The idea for measuring the throughput is connecting to the ftp site of OARnet: ftp://ftp.eng.oar.net and downloading a 100-mega-byte file. We have an alarm limit which also runs in parallel to the download. So to picturize the scenario, there are two

12 threads: downloading thread and alarm thread. The plug-in operates as, if the alarm expires first, we can conclude the throughput is surely lesser than file-size/alarmtime kbps, else if the download completes first, we can state the throughput by file- size/endtime-starttime kbps. Thus based on the throughput obtained and the threshold value provided we can raise corresponding e-mail alerts. This was achieved on the

Netgate Box. As a next step, the Chestnut box, a similar NanoBSD box at a remote location, was taken and the code was pushed to its flash memory and was executed to check the throughput of the remote location.

The main advantage is that we get to know when a wireless network goes down, the exact statistics associated with it and the code is made compatible to be run on any network.

The problems associated with this approach is that, even though the Netgate box is portable, it needs to be connected to a mac/pc to execute the code and also it is command line centric, making it a difficult process for non-technical users. Having separate boxes at various locations in an enterprise does not seem to be an useful option. Thus taking these disadvantages under consideration, an iPhone-based tool was developed.

2.3 iPhone based throughput analyzer

Taking cues from the above NanoBSD based throughput analyzer, was then formed an idea to make a tool which was mobile based . Thus an application was developed on the iPhone. The main advantages are that the code changes are easier to implement and be pushed to the device. One developer can update the application at any time or revamp it totally and all the users who are using it need to just download the update shown by the

13 Apple App Store. The lesser diversity in case of Apple devices made it an easier job of adapting the code either to the tablet or the mobile format. With all these advantages in mind the application was developed natively on the iPhone. Two features were developed. One is a web page response analyzer, which monitors the response time taken by a web page to load via a wireless connection, over a period of time. The other feature is the download throughput analyzer.

2.3.1 Web page response analyzer

This feature analyzes the response time of a web page to load over a period of time across regular intervals. Once the data is collected, the Google visualization API is used to actually plot the line chart on to the UIWebView of the iPhone. The steps taken in this approach were:

First, to check the sanity of the webpage under consideration to ensure whether the user input of the webpage name is correct or else a notification is sent over to the user. There is also a default test, which is run if the user intends to go for the same webpage repeatedly. Then a brain class takes care of the webpage loading using the

NSURLConnection. It makes a synchronous call and actually waits for the webpage to be loaded so that the actual response time can be calculated. The data finally obtained is pushed to an xml file locally on the mobile. This xml file is then parsed using javascript and the parsed data is fed to the Google visualization chart API.

14 Google visualization API provides a method to visualize the data on a webpage. Line charts are used in this case and the XML parsing is done using the jQuery AJAX methodology.

Figure 4 Web page response time graph

2.3.2 Download throughput analyzer

This feature is a mobile version of the FTP download feature as advertised by the

NanoBSD Box. Inspired by the social networking site Foursquare, this takes the geographical coordinates of the user location in a particular spot in an enterprise and then observing the throughput of the wireless connection over that spot and recording the same. The storyboard of the app is as follows:

15 The user needs to create a new session at the beginning or whenever he/she needs to use a new download file as a testing input. The new session establishment means that user must comply to one download file throughout the session, as that will alone provide a method for the user to actually to conduct the test with a meaning of comparing the throughput over various locations.

Once the session is set, the user needs to check in into the location. Check in means the user will be geo fenced to an area. If the user is standing at a particular location and if that location has not been previously checked in, the user will be prompted to enter a nickname to that location. That entry will be stored in the phone file system. Then the

FTP download test will be performed and the results of the test will be pushed to an xml file.

There is also an option for the user to just load the app and view the graph being plotted without having to check in or perform the download. This enables the graph to be plotted based the available data.

The NSInputStream and the NSOutputStream handle the download process. The

NSInputStream is read only and it is responsible to read the data from the server as a chunk of bytes and the NSOutputStream actually pushes the data to the phone’s file system. The bytes, which are read, are operated on by the scheduled run loop, which acts as a callback process.

The location detection is taken care by CLLocationManager and CLLocation (which depends on the CLLocation manager). There is a specification, which requires the manager to provide with an accuracy of 100m, for the calculation the geo coordinates.

This is because as this is an enterprise based application, the wireless signal strength may 16 not vary greatly upon a distance of 100m and also this will result in less consumption of the power of the mobile device.

The Download observation at various locations at OARnet:

Figure 5 Throughput analyzer graph

17

3 Wide Area Networks

3.1 Introduction

The main scope of the monitoring concepts which are dealt in this section are aimed towards the performance analysis of systems scattered over a wide area. Wide Area

Networks (WAN) are a geographically dispersed telecommunications network. The analysis was done using open source tool perfSONAR on GENI framework.

3.2 Global Environment for Networking Innovations

GENI is a network research infrastructure suite, which is being sponsored by the National

Science Foundation (NSF). The core concepts of the GENI infrastructure are as follows:

• Programmability – researchers have the allowance to modify the character of a node by installing new software to meet their needs.

• Virtualization and other forms of resource sharing – whenever feasible, nodes implement virtual machines, which allow multiple researchers to simultaneously share the infrastructure; and each experiment runs within its own, isolated slice created end-to- end across the experiment’s GENI resources;

• Federation – the various parts of the GENI nodes are owned by various organizations.

• Slice-based experimentation – researchers will be able to leverage the concept of slice based experimentation. Here there is a timed aspect and encompassing aspect, which

18 allow the resources acquired by the researcher to be present under a single shed and also to control the usage timings to enrich the proper management of resources.

Slice is nothing but a container, which contains the resources and connects the researcher with the resources. Basically we can picturize it as a container, which contains the topology of three powered nodes, connected with each other with a bandwidth capacity of 100 Mbps. The researcher will now be able to access these resources that were requested by him for over a time period. The resources within the slice are connected as a coherent virtual network. To be secure and for fair performance, a slice is basically isolated from other researcher’s slices and this thus enables the experiments to run smoothly without any interference.

3.3 perfSONAR perfSONAR is a multi-domain performance monitoring framework, which defines a set of protocol standards for sharing data between measurement and monitoring systems. It is basically a tool for network performance monitoring.

3.3.1 perfSONAR architecture

The perfSONAR architecture is constituted of 3 services. They are data services, analysis/visualization and infrastructure.

The data services are made up of:

1. Measurement points – These are the lowest layer of the monitoring infrastructure.

They directly interact with the tool, which performs the monitoring. MP’s

communicate with measurement archive to store measurement data.

19 2. Measurement archive – The results of the network tests are stored here. Using the

MA, one can derive queries to access the data. This data helps in analyzing the

condition of the network.

3. Transformations – These perform correlation and aggregation. It is used to

combine various stats like network utilization, bandwidth and latency and do a

combined analysis.

Then there is analysis and visualization, which provides the user and network admin to view the data and analysis via GUI’s or webpage dashboards or network operating center alarms, like an Alert.

Finally it’s the Infrastructure composed of information services. The information services are facilitated by look up service. This is basically drawing queries upon the data distributed across the MA at various MP’s. This allows a fresh data propagation in a dynamic network setting. Then there is the topology service, which is a universal topology storage service. Here we can make changes to the tests or the node configuration. The test specifics can be modified at any time and they can be pushed to the topology service, which maintains a persistent state of the topology.

3.3.2 perfSONAR Problem Example

The best way to explain what perfSONAR is to consider the example: One needs to download a file across a remote location, and when a download failure occurs or a slow download is taking place, the method of spotting out the problem can be done using perfSONAR. Usually under, such failure circumstances as stated above, the following steps are followed to debug the problem.

20 Application level – The first check will be made in the application level, the application developers usually refute this problem stating it pertains to networking.

Protocol level – The next step is to look at the protocol deployed and the problems pertaining to that. But usually this comes in tie up with the application.

Host level – To check whether the host machine is experiencing problems, which are causing the download problems. But the end user need not necessarily be a hardware person to find the root cause or it is not necessary to debug at hardware level as it demotes the quality of the product.

LAN Networks – When the local area managers are contacted, they usually say to the end user that this problem is pertaining to the backbone network or WAN.

Backbone Networks – They say that the network is functioning at their point properly.

So now the end user, will not know, where the problem is basically originating. This leads to total confusion and the end user is at a disadvantageous position. To prevent such problems, perfSONAR provides you with a solution of separating an entire geography of network into a various measurement points and providing dashboards for the user to find where the problem has occurred.

For example, a request to a particular node where a downloadable file is hosted has various points of travel for the packets. The following is the travel points for a packet travelling from OARnet at Kinnear Road to a node at Downtown, Columbus.

bash-3.2$ traceroute 131.187.67.108

1. traceroute to 131.187.67.108 (131.187.67.108), 64 hops max, 52 byte packets

2. 1 clmbk-fw1-ge-0-0-8s3094.ls.oar.net (131.187.117.1) 0.855 ms 0.504 ms

0.452 ms 21 3. 2 clmbk-r1-ge-0-0-1s3300.cpe.oar.net (131.187.80.57) 0.958 ms 1.144 ms

0.907 ms

4. 3 chest-e40-vl3045.ls.oar.net (131.187.80.22) 4.027 ms 1.965 ms 4.947 ms

5. 4 chest-dhcp-131-187-67-108.ls.oar.net (131.187.67.108) 1.559 ms 1.645 ms

1.460 ms

So here we can see that a packets travel through various points, after which it reaches the destination. If each of these points had a database and stored test results which are run between the points using the perfSONAR-ized tools, then it will be easy for both the users and the network managers and researchers to monitor the problems over a period of time. These data can be depicted over user GUI, web pages or network operating center alarms to indicate problems over the period and effectively reduce the steps for an user to identify where the problem is originating from.

The below series of diagrams show the problem’s pattern and how they can be easily mapped to a certain path which is under stress

Figure 6 The user and resource location Figure 7 Traceroute points

22

Figure 8 Locating the pain points Figure 9 Storing the measurement data

Figure 10 Problem areas identified based on the data collected

3.3.3 perfSONAR-ized Measurement Tools

There are various perfSONAR measurement tools used in the investigation. They are:

1. BWCTL

2. OWAMP

3. PINGER

4. perfSONAR BUOY

23 3.3.3.1 BWCTL

BWCTL is known as Bandwidth Controller. It is an outer wrapper over the

throughput testing tools like Thrulay and iPerf an Nuttcp. The main intent is to

measure TCP bandwidth or UDP tests by tuning them according to the user

requirements. There is a daemon process called bwctld that runs on the server end.

Thus the client application will send a request over to the daemon process and

daemon process will initiate the tests between the sender and the receiver. BWCTL is

a 3 party application. There can also be tests arranged between two servers on a

different system and if the local system is a part of the tests acting as an end point

then the bwctl will look for an local bwctld running and then target the tests based on

that.

The main advantage of bwctl is to run non-specific tests between two systems. So

here we need not have to specify the account details and can just run the tests. This

enables us to get the bandwidth which may give us the throughout specific

information on whether the network capability will match our needs or not.

There are certain base requirements, which need to be met prior to using BWCTL to

measure the network capability. The firewall must be basically properly configured

such as to allow client interaction. The tools which bwctl must use to run the

throughput tests need to be present before hand. The systems that are a part of the

BWCTL interaction must be synced with NTP. The NTP is basically a universal

clock and a networking protocol to sync over a packet-switched network.

24 3.3.3.2 OWAMP

OWAMP is known as the One Way Active Measurement Protocol. It is a command

line application used to determine one-way latencies between hosts. The main use of

OWAMP is that, one way measurement will enable one to determine the direction in

which the congestion is actually taking place and thus in situations where a data

transfer in a network is taking place in both the directions, we can easily spot out the

problematic area.

The functioning of OWAMP is client server functionality is that the server process

serves as the ever-open daemon owampd. The incoming connections are accepted and

then forked to facilitate their own functioning and keeping the server open for other

incoming processes. The heuristics, which can be obtained, using owamp as a tool,

are singleton characteristics such as loss, delay and jitter and also non-singleton

characteristics like expected arrival time and interval gap between packets. Even

though it is much easier to ping between the nodes, OWAMP gives more accurate

results.

owping is used as the client service which connects to the server and owampd is the

restricting service on the server end that accepts and rejects the incoming client

connections based on the restriction applied by the administrator.

3.3.3.3 perfSONAR BUOY

perfSONAR-BUOY Measurement Archive is a framework that enables the test

scheduling process and also test data storage framework. The archived measurement

data is stored in the MySQL database and the data can be accessed via web interfaces.

25 Tools such as OWAMP and BWCTL are used by the BUOY service. In the BUOY

service we need to specify the end points that are a part of the test. The owmesh

configuration file takes care of the scheduling processing. The owmesh.conf is

structured into group and node tags. The node tags describe a particular node along

with its IP and description. The group tag specifies the nodes that form a particular

group. Then there is the test spec tag where one needs to specify the test

characteristics like which group is a part of the test and what are test conditions.

Finally the measurement set tag includes all the above tags.

3.3.4 perfSONAR resource protection

The base aim of perfSONAR Resource Protection is to enable multiple tests to be run

over a series of node contributing the topology. As we know, the base tests, which are

a part of the perfSONAR framework, are OWAMP, BWCTL and pinger. Out of these

tests BWCTL and OWAMP are interdependent on each other. But pinger is not

dependent on any resources, which are used by either BWCTL or OWAMP. So now

when an BWCTL and pinger test or OWAMP and pinger test are assigned to the

perfSONAR BUOY service, usually there will be a serial execution of these tests,

even though the two tests don’t depend on same resources. And when there is a case

where many users request tests like OWAMP and BWCTL to be run on the same

node pair at the same time, they may run the tests and the results may not be an

accurate one, as there will be people using the same channel and thus the bandwidth

usage will affect the results. So in order to prevent inaccuracy of results obtained in

such high scale research networks, we use perfSONAR resource protection.

26 The base concept of resource protection is to differentiate between users based on their priority. There are 3 classes of users; they are normal user, the power user and the administrator. These users can request tests on the same set of nodes, and in order to schedule the tests so that there is no in accuracy in the results, we use the meta- scheduler.

The meta-scheduler, based on the tools and the nature of the user, allots the tests at interleaved time intervals. So if there are users requesting owamp and pinger tests on the same pair of nodes over the same time, the meta-scheduler can schedule the tests at the same time as the tools owamp and pinger as these tools are not common resource dependent. But when there are tests, which need to be run at the same time, and if these tests use the resource interdependent tools, then the tests are allotted based on level of priority of the users.

The base architecture of this implementation is as follows. The user enters an input to the meta-scheduler. The automation script is always monitoring the input. This script when there are any changes to the input configuration file, which the meta-scheduler operates upon, stops the perfSONAR services, which are running currently and then updates the process flow to start again.

27

Figure 11 perfSONAR resource protection

3.3.5 Investigating LAMP/perfSONAR to explore “infrastructure

measurement slices” in GENI

LAMP is known as Leveraging and Abstracting Measurements using perfSONAR.

Here the entire implementation of LAMP was tested and then recommendations were

made to the LAMP team based on experiments, which were conducted.

3.3.5.1 Purpose

To describe the findings from our experiments to deploy LAMP/perfSONAR

(http://groups.geni.net/geni/wiki/LAMP) on the GENI infrastructure and to explore

the instrumentation and measurement related capabilities for setting up “infrastructure

measurement slices”.

28 Although a LAMP tutorial exists at: http://groups.geni.net/geni/wiki/LAMP/Tutorial, we initially faced difficulties in re-producing the tutorial steps. In the course of our experiments between October 2011 to December 2011, we worked closely with the

LAMP project team to get our issues resolved, and also provided feedback to them to update their documentation and scripts to help future experimenters.

3.2.5.2. Findings Summary

• Resolved Issues

1. In the first steps, we encountered a Hardware Error problem; there was a

hardware error being thrown during the process of creating sliver or

getting a ticket to the sliver creation. This error stated: OS

'GeniSlices/UBUNTU91-LAMP' (OS-2283) does not run on this hardware

type! This error was reported to the LAMP team and they sent us a patch

update that resolved this problem.

2. LAMP does not support the version 2 Rspec format. The -

sendmanifest.py script’s execution threw an error indicating the

incompatible Rspec version. Also, when the code was manually modified

to use the correct Rspec version, the PEM Passphrase request did not pop

up. This problem was reported and finally we got a corrected script from

the LAMP team. The corrected script has now been posted on the main

LAMP wiki.

3. The error-handling messages were not informative; the wrong script that

was hosted at the main LAMP wiki, when used, did not throw an error, 29 when the PEM Passphrase was not prompted. The process continued and

the lamp-getcertificate.py script threw an error.

4. The documentation for bundle certificates was not mentioned in the main

LAMP wiki, this resulted in a problem where the tests on being scheduled

and pushed to the UNIS, cannot be pulled to view the results. This bug

was fixed after the LAMP team got back saying that the ‘getacerts script’

at /usr/local/etc/protogeni/ssl/getcacerts needs to be run before the systems

are rebooted. This was not run earlier as the main LAMP wiki said that the

bootstrap command invokes this script, but in our case as well as in other

cases as mentioned to us by the LAMP team, it failed to do so.

• Open Issues

1. The slice renewal is a problem. The slice currently expires automatically

after around 6 hours and we lose all the data, and we have to follow all of

the installation steps again to continue running LAMP in our slice. We are

trying to automatically kick off a renewal script run via the crontab feature

in Linux to see if we can keep the slice up and running for extended

durations.

2. BWCTL (throughput) results were not available after the tests were

pushed to the UNIS server, and the LAMP team suggested that we get

back to see the results after some considerable time has passed by, but still

the results are not available. We plan to follow-up with the LAMP team

about getting this issue resolved.

30 3.2.5.3. Technical Terms

What is a LAMP Portal?

A useful resource for experimenters, which enables configuration, query and

visualization of I&M services data. When a slice with LAMP on multiple-nodes is

setup, there is a LAMP image deployed on a separate node that hosts the portal,

which can then be used as a starting point to interact with the other nodes and the

services in the slice.

What is UNIS?

Unified Network Information Service provides a combined service: Lookup Service,

Topology Service. The topology data of the slice (specified in the Rspec) needs to be

uploaded to the UNIS. After the upload, users can interact with the perfSONAR

services within the slice.

4. Step-by-step Process that worked

We now describe the steps we took to create a slice, getting a LAMP certificate for

the slice, uploading the topology to UNIS, and finally enabling the instrumentation

and measurement services for the slice.

Step-1: Creating the Rspec

[1] We need to create an Rspec, which contains various resource specifications and

topology. The LAMP uses a variant of the UBUNTU image. This wiki follows the

usage of the default Rspec that has been provided in the LAMP wiki. 31

xmlns:lamp="http://protogeni.net/resources/rspec/0.2/ext/lamp/1">

startup_command="/usr/local/etc/lamp/bootstrap.sh urn:publicid:IDN+emulab.net+slice+slice_name urn:publicid:IDN+emulab.net+user+user_name">

startup_command="/usr/local/etc/lamp/bootstrap.sh urn:publicid:IDN+emulab.net+slice+slice_name urn:publicid:IDN+emulab.net+user+user_name">

32

100

0.05

startup_command="/usr/local/etc/lamp/bootstrap.sh urn:publicid:IDN+emulab.net+slice+slice_name urn:publicid:IDN+emulab.net+user+user_name">

NOTE: The Rspec has a bootstrap script i.e., the getacerts script, which needs to be run once the system boots. If we do not specify this, we need to run it every time we restart the perfSONAR services. As pointed out in the Resolved Issues section, often this script has to be run manually even though you specify it in the bootstrap. 33 NOTE: After completion of step-1, you should download the protogeni-testscripts bundle for further progress. These scripts are available at - http://www.emulab.net/downloads/protogeni-tests.tar.gz. Then make sure that your

Mac/PC is running Python version 2.6.x. This is a strict requirement. The protogeni- testscripts are compatible with Python 2.6.x alone. And you may get error throws if you try with Python 2.7 even though it’s an enhanced release. There must also be an

M2Crypto Python Library present on your Mac/PC. Absence of this library may cause Certificate Errors. Make sure you are using version 0.20.1 or above of

M2Crypto as recommended by the official website – http://chandlerproject.org/bin/view/Projects/MeTooCrypto.

NOTE: You may need to resolve a set of dependencies when trying to import the

M2Crypto library for Python2.6. Many OS may have the latest version or some other versions of Python, so once the Python2.6 tar ball has been downloaded and installed, we need to go ahead and do ln command for changing the hard link to python.

Usually the executable of Python is in /usr/bin or /usr/local/bin, which python command will tell the path, after getting the path do ln –s path/to/python2.6 existing/python/path. M2Crypto is dependent on SWIG that has a regular expression parser and hence will require PCRE. The PCRE latest builds can be obtained at http://www.pcre.org/. Once PCRE has been installed, then we can go ahead and install SWIG. The latest builds can be found at http://www.swig.org/download.html. Now make sure that the OpenSSL that is installed on your Mac/PC has a version of at least 0.9.8. OpenSSL might require the 34 openssl-devel install for perfect compiling of M2Crypto. M2Crypto uses the SWIG features, which are essential for building the M2Crypto package. They are: -python -

I/usr/local/include/python2.6 -I/usr/include/openssl -includeall -D__i386__ - cpperraswarn. These flags are essential for a graceful build of M2Crypto. For Fedora users, try the sh fedora_setup.sh build and sh fedora_setup.sh install. This will fix the include flag dependency, by using the proper SWIG features. If this still throws dependency errors of missing config files of SWIG, try uncommenting the finalize_options function’s code parts of the setup.py file, in the M2Crypto untar directory. Please reboot your Mac/PC at the end to make the changes effective.

Step-2: Slice Creation

Next we need to create a slice. Before starting the process, please make sure you have your public key uploaded to the Emulab profile and also have the public/private pair in ~/.ssl directory. Then also make sure that you have the SSL certificate generated in the Emulab profile and have it download into the ~/.ssl/encrypted.pem.

For more information in the slice creation and certificate generation in Emulab, please visit: http://www.protogeni.net/trac/protogeni/wiki/Tutorial.

Step-3: Get Credential

We need to get the credential from the Slice Authority (SA) that grants the user a signed document containing the privileges and permissions. This is a credential provided to the user by the local SA i.e., Emulab in our case. Please run the getcredential script from the protogeni-testscripts bundle.

python getcredential.py 35 This provides us with the signed credential.

Step-4: Register a Slice

Now we need to register a slice name. Please run the registerslice script.

python registerslice.py –n slice_name

The –n option if overridden, we get a default slice created as mytestslice. So please make sure you provide the –n option else don’t use the –n option along with any other script.

Step-5: Allocating Resources

Now we need to allocate the resources. This can be done in two ways. We can either request a ticket and redeem a ticket and then renew a ticket or we can go ahead with the createsliver option (recommended option!)

Once you create a slice, we next need to establish a sliver, which can hold the resources. Please run the createsliver script.

python createsliver.py –n slice_name rspec_file.xml

This returns the manifest returned for this instance, which we can store in the lamp- manifest.xml file. You may not get a manifest return here, as createsliver.py is a bundled script, which does all the operations of requesting and redeeming a ticket. So in this case please use the getmanifest.py script, which is a part of the protogeni- testscripts bundle, to get the output throw of the manifest.

Once we have created the sliver, we need to renew it before we start any experiments.

The createsliver script creates the sliver for a specific amount of time, which usually is not enough to do prolonged work, and suddenly you lose the sliver, which causes the work to stop abruptly. So we also renew it side by side. 36 python renewslice.py –n slice_name time_to_renew_in_minutes

NOTE: Renewing the slice automatically renews the containing sliver too. You can find it in the output throw

Step-6: Upload to UNIS

After creating the slice and the sliver and renewing the sliver too, we need to upload the topology to the UNIS. The topology, which is represented as a Rspec, needs to be converted into a UNIS schema. This is taken care by the script lamp-sendmanifest.py.

But this script has some requirements. The UNIS will always require the slice credential for identification, and also the manifest which we obtained earlier when we allocated the resources.

We use the getslicecredential script for the obtaining the slice credential.

python getslicecredential.py –n slice_name > lamp-credential.xml

NOTE: We need to remove the first line in the stored file lamp-credential.xml, so that the signature check sees it as a signature. python lamp-sendmanifest.py lamp-manifest.xml urn:publicid:IDN+emulab.net+slice+slice_name lamp-credential.xml

We get an output throw, which shows the message we sent to UNIS. Then we get a prompt for pass phrase. Upon the correct pass phrase enter; we get the data elements successfully replaced message. This means that the topology is now uploaded successfully in the UNIS. Use the lamp-getmanifest.py script to obtain the lamp- manifest.xml file’s content.

37 Step-7: LAMP Certificate

We now have to upload the LAMP certificate to all the nodes in our slice. We need to upload the certificate to the /usr/local/etc/protogeni/ssl/lampcert.pem. The file name in the nodes must for sure be lampcert.pem. But you may have any file name in your local directory.

python lamp-getcertificate.py –n slice_name

The resulting certificate throw, needs to be stored in a file and then uploaded to the location as specified above. [Assume it is stored in lampcert.pem in the local directory also].

grep “login” lamp-manifest.xml bash-3.2$ grep "login" lamp-manifest.xml

bash-3.2$

We need to upload the lampcert.pem in our local directory to the

/usr/local/etc/protogeni/ssl/lampcert.pem in the nodes we have obtained above. Now all we need to do is to ssh to the nodes above and load this lampcert.pem file into the specified location.

38 NOTE: You may get a permission denied when you try to ssh into the nodes. Use the

–i flag of the ssh to direct the ssh to read from the ~/.ssl/your_private_key.

Now after uploading the lampcert.pem from the local directory to the location in the nodes [this can either be achieved by scp or by simply copy pasting after ssh to the nodes], we need to run the following :

bash-3.2$ cat shell.sh for node in node1 node2 node3; do

ssh -i ~/.ssl/your_private_key user_name@$node "sudo mv lampcert.pem

/usr/local/etc/protogeni/ssl/lampcert.pem"

ssh -i ~/.ssl/your_private_key user_name@$node "sudo chown root.perfsonar

/usr/local/etc/protogeni/ssl/lampcert.pem"

ssh -i ~/.ssl/your_private_key user_name@$node "sudo chmod 440

/usr/local/etc/protogeni/ssl/lampcert.pem"

ssh –i ~/.ssl/your_private_key user_name@$node "sudo

/usr/local/etc/lamp/bootstrap.sh urn:publicid:IDN+emulab.net+slice+geni1 urn:publicid:IDN+emulab.net+user+shriram"

ssh -i ~/.ssl/your_private_key user_name@$node "sudo perl

/usr/local/etc/protogeni/ssl/getcacerts"

ssh -i ~/.ssl/your_private_key user_name@$node "sudo /etc/init.d/psconfig restart" done 39 Step-8: Browser Access

Now after restarting the perfSONAR-PS Services in the nodes, we can access the

LAMP portal.

NOTE: To access the LAMP portal, we need to upload a certificate to the browser.

Follow the below steps to upload a certificate if you are using the Firefox web- browser. The portal access works fine with Firefox and Google Chrome. It does not work well with Safari.

• Go to your user profile page on the Utah Emulab.

• Click 'Generate SSL Cert' on the left.

• Enter a password and remember it.

• Immediately click on the pkc12 format download and store it locally.

• Then feel free to click on the Download [1st option] option and copy the

certificate and paste it in the ~/.ssl/encrypted.pem. This is to update your local

certificate as you have changed it now.

• Open Firefox, select Preferences [command+, for mac]

• Select Advanced Tab and choose Encryption sub tab.

• Check the Select One Automatically radio button.

• Click View Certificates and select Your Certificates tab.

• Import the pkc12 format you downloaded locally and save the settings.

• Now open the https://pc142.emulab.net/lamp/ [Here pc142 is the lamp node]

• Only the lamp node, see the Rspec, can be accessed via the web portal.

• Then Select the Add Exception, and you will reach the lamp portal of the node.

40 Step-9: Completing LAMP Deployment

In this step, we will be able to access the basic browser view for the lamp node and thus we can schedule and run tests in the web interface provided.

3.2.5.5. Conclusion

In the above, we have described the complete details of the first five steps along with the [NOTE] tags, which are likely the error possibilities in getting LAMP successfully deployed in your GENI slice. Although the main LAMP wiki is very helpful, and the LAMP team is very responsive to help requests, our overall experience showed that the documentation in the main LAMP wiki needs to be more informative. Also, due to the dependencies of the LAMP software, there might be other challenges that may arise across Mac/PC platforms.

41 4 Conclusion and Future Work

This work has been centric towards the network management and .

Handling across all the wired, wireless and wide area networks, this work addresses specific problems and fixes to that problem. One insight on how to elaborate this work is to have an universal framework, which addresses problems and tool fixes in wired networks and this framework must also be able to accept future contributions of fixes for other problems. The data obtained and stored must also be made universal, so that many developers can leverage the data and make interesting analysis. This framework can thus be deployed as network management software.

The iPhone app has another feature for implementing Test TCP over the iPhone as a client under rapid development. This feature will thus be able to complete the features, handling, web page (http), file (ftp) and data buffer transfer to test a particular wireless connection. With the release of the app, we users can perform download tests at various points in an enterprise or in a college campus and based on that data, suggestions can be made on where the network strength is much healthy and the user can go over to that location for seamless experience. perfSONAR implementation and LAMP implementation and testing were mostly done on

VM’s. perfSONAR has a lot of advantages in monitoring widely scattered nodes and the

GENI framework makes testing perfSONAR an easy process, as allocating and obtaining a node is a simplified process with the case of GENI. Once the open flow network project gets completed, the flavor of nodes obtained over the geography of the country will be much more rich and thus we can spot out network differences over the period of testing.

Also having a unified interface which includes the meta scheduler along with the 42 perfSONAR services will be a great advantage as that will enable the users to handle all the services at one point.

43

Bibliography

[1] LAMP references:

http://groups.geni.net/geni/wiki/LAMP/Tutorial

http://groups.geni.net/geni/wiki/GIR3.2_LAMP

Inputs from Matthew Jaffee [email protected]; [email protected]

[2] GENI references:

http://groups.geni.net/geni/wiki/GeniNewcomers

GENI system overview document ID: GENI-SE-SY-SO-02.0

[3] Protogeni references:

http://www.protogeni.net/trac/protogeni/wiki/Tutorial

https://www.protogeni.net/trac/protogeni/wiki/FlashClientSetup

[4] perfSONAR references:

http://www.perfsonar.net

http://www.internet2.edu/workshops/npw/materials/WJT2011/20110202-NPW-

pS-Architecture.pdf

[5] perfSONAR tools OWAMP:

http://www.internet2.edu/performance/owamp/

[6] perfSONAR tools BWCTL:

http://www.internet2.edu/performance/bwctl/

[7] perfSONAR BUOY:

http://psps.perfsonar.net/psb/

44

[8] CISCO OID browser:

http://tools.cisco.com/Support/SNMP

[9] Juniper alarm MIB:

http://www.juniper.net/techpubs/en_US/junos11.1

[10] NanoBSD:

http://www.freebsd.org/

45