<<

IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 21-23 September 2009, Rende (Cosenza), Italy

Large scale monitoring of home routers

S. Costas-Rodr´ıguez+,R.Mart´ınez-Alvarez´ +,F.J.Gonz´alez-Casta˜no∗+, F. Gil-Casti˜neira∗,R.Duro× +Gradiant, ETSI Telecomunicaci´on, Campus, 36310 Vigo, Spain ∗Departamento de Ingenier´ıa Telem´atica, Universidad de Vigo, Spain ×Grupo Integrado de Ingenier´ıa, Universidad de La Coru˜na, Spain Tel: +34 986 813788, : +34 986 812116 E-mail: {scostas,rmartinez}@gradiant.org,{javier,xil}@det.uvigo.es,[email protected]

Abstract – This paper describes our experience with reception power, memory usage or are useful to concurrent asynchronous monitoring of large populations of detect and fix many potential problems. end-user broadband-access routers. In our real tests we focused on home/office ADSL Despite of the wealth of research in large-scale monitoring, which assumes that it is possible to inquiry individual nodes routers, although our results are valid for any other access efficiently, end-user access routers usually have manual legacy technology. We monitored the routers of the Spanish ISP interfaces, either HTTP- or telnet-oriented. They seldom and VoIP operator Comunitel (www.comunitel.es). With offer a direct interface to other programs. Moreover, the our approach, a full monitoring cycle of 22,300 such uptime of end-user routers is unpredictable. For all these routers took less than five minutes. reasons, commercial large-scale monitoring tools such as SNMP collectors are useless. The rest of this paper is organized as follows: In This research is motivated by the fact that some telecom- section II we review the background, comprising academic munications operators do not let end-users buy their routers research and existing industrial solutions. In section III in the consumer electronics market. Instead, they rent we describe the practical difficulties in end-user router the routers and maintain them under long-term contracts. monitoring. In section IV we present our solution, based By monitoring line signal-to-noise ratio, transmission and reception power, memory usage or uptime, the operators on concurrent asynchronous connections. Finally, sec- can predict many types of router failures. In any case, this tion V concludes. information may feed their data warehouses for future use. In our field tests we monitored the routers of the Spanish II. BACKGROUND ISP and VoIP operator Comunitel (www.comunitel.es). With our approach, a full monitoring cycle of 22,300 such routers A. Academic research took less than five minutes. Large-scale scalable monitoring typically follows a dis- Keywords: Monitoring, access networks, SNMP tributed schema [1]. Instead of relying on a single collec- tor, there are multiple monitoring devices working con- I. INTRODUCTION currently. This schema is highly reliable against network This paper describes our experience with concurrent failures. asynchronous monitoring of large populations of end-user The research in [2] identifies three types of distributed broadband-access routers. monitoring: static decentralized, programmable decentral- Despite of the wealth of research in large-scale monitor- ized and active distributed. In the latter, mobile agents ing, which assumes that it is possible to inquiry individual control monitoring nodes. They migrate through the net- nodes efficiently, end-user access routers usually have work and identify the optimal nodes to activate monitoring manual legacy interfaces, either HTTP- or telnet-oriented. functions. Thus, the active distributed monitoring architec- They seldom offer a direct interface to other programs. ture adapts itself to the state of a dynamic network. This Moreover, the uptime of end-user routers is unpredictable. strategy is also followed in [3]. For all these reasons, commercial large-scale monitoring Regardless of the number of monitoring nodes in a tools such as SNMP collectors are useless. network and their evolution, each monitoring node must Obviously, at the network core we will find complex collect data from a large number of network nodes. A nodes with adequate monitoring interfaces. However, end- monitoring node must be as efficient as possible, because user device monitoring is of paramount importance for otherwise it could compromise scalability. This paper operators that rent and maintain those devices instead of focuses on this practical problem, and thus our solution is allowing end-users to purchase them. The operators that valid for any monitoring system, either static or dynamic. follow this model waste large sums of money attending Regarding monitoring interfaces, previous work tried to service calls, due to router failures or service degradations solve the limitations of SNMP. It has been proposed to that are often predictable from monitoring data. Typical adopt CORBA or Java RMI-based interfaces [4] to access examples are uncontrolled growth of NAT tables or unde- data. The corresponding objects would retrieve data (via clared p2p filesharing activities blocking too many ports. SNMP) and add extra information like physical location Measures like line signal-to-noise ratio, transmission and in the network or the relationship with other devices. In practice, as we will see in section III, end-user devices Model Number Telsey CPVA500 2.500 have extremely limited monitoring interfaces. Telsey CPVA3 11.000 Although less related to this paper, there are many other Telsey Gada 200 research lines in network monitoring. Among them, we can Zyxel Prestige 700 - cite the following: OneAccess 200 1.900 Cisco 800 series 6.700 • Passive monitoring [5], [6] analyzes data packets at Total: 22.300 different points of the network using packet sniffers. Table I These sniffers may extract packet headers, transmis- END-USER COMUNITEL ROUTERS IN THE FIELD TESTS sion rates, number of retransmissions, packet sizes, etc. Passive monitoring is somewhat limited, but it may help to feed advanced network analysis tools with the data they need. costs to gain a significant market share. Among other • AI techniques may assist monitoring systems in net- consequences, this implies severely limited firmware. work failure detection [7]. Ideally, this will enable Due to the fact that the end-user usually manages his automatic problem fixing, to reduce the number of router himself, all interesting parameters (used memory, warnings human operators must handle. CPU load, SNR and attenuation...) are available via telnet B. Industrial systems or www interfaces. SNMP services are often limited to a single MIB to report uptime. Unfortunately, manufacturers By default, many ISPs and operators employ the com- are reluctant to add new MIBs, as it would increase cost mercial program HP OpenView [8], [9]. It relies on SNMP in products with a low profit margin. Finally, the telnet or for data acquisition, but it can be enhanced with external web monitoring interfaces are not standard, so each device collectors. Specifically, it admits plain text data, XML files is managed in a different way. or inputs from SQL databases. In the free software arena, RRDtools [10] is the pre- Another problem in this context is that users shut their ferred solution. RRD is a database format designed to routers down quite often. This, the fact that end-user store monitoring data in an efficient way. RRDtools allows routers have high monitoring response times and the un- to create RRD databases, monitor devices and store their predictable load of the access network (which also affects results periodically, as well as to extract data and create the response time of the devices), make it useless to adopt graphs from them. Data acquisition follows the SNMP a sequential monitoring algorithm, namely an algorithm protocol. In theory, RRDtools admits external collectors to monitor individual devices in a pure sequential manner as HP OpenView does. (waiting for the acknowledgement of each device before Many monitoring programs like Cricket [11] (a system monitoring the next one). Existing commercial software to generate statistics from RRDtools data) or BigSister [12] works this way. employ RRD. Finally, ISP operators do not want to be constrained by Finally, OpenNMS is a tool with growing popularity the choice of a specific router model. They always have [13]. It also relies on SNMP. It determines if diverse a number of alternatives, and replace routers for technical protocols (HTTP, ping...) are available. or marketing reasons quite often. Therefore, a large-scale To sum up, the most popular tools and solutions rely monitoring software for end-user routers must be flexible on SNMP monitoring. As we will see in the next section, and quickly adaptable to any kind of device. unlike high-performance routers, end-user devices do not All theses problems render the high-end solutions in have full SNMP support. They still depend on “manual” section II-B useless, since they only admit SNMP and per- protocols to report their state. form sequential monitoring (thus being unable to handle arbitrary device shutdowns). III. MONITORING END-USER DEVICES A. Typical difficulties B. End-user routers in this research The programs in section II-B are highly advantageous In this research1 we monitored the 22,300 end-user for large-scale network monitoring. They employ SNMP routers of the Spanish operator Comunitel [14] at the time to acquire data, or rely on complex software like virtual this paper was written. There were six different models machines. Current backbone routers support SNMP, since (table I). Comunitel was a nation-wide ISP and VoIP they must report glitches and failures as quickly as possi- operator. Table II shows the variables Comunitel requested ble. Thus, manufacturers add as many MIBs as they can. us to track and the built-in monitoring methods in each However, when it comes to end-user routers the key device. issue is cost: when a provider has to distribute thousands of devices, a small saving per unit really matters. Hence, 1C´atedra Comunitel grant, funded by Comunitel Global SA manufacturers eliminate non-critical elements to lower (www.comunitel.es), Spain SNRD SNRU AttD AttU CCPU Memo ENAT Uptime PING PVOX Telsey CPVA500 Telnet Telnet Telnet Telnet Telnet Telnet Telnet SNMP ICMP ICMP Telsey CPVA3 Web Web Web Web —– Web Web SNMP ICMP ICMP Telsey Gada Telnet Telnet Telnet Telnet Telnet Telnet —– SNMP ICMP ICMP Zyxel Prestige 700 Telnet Telnet Telnet Telnet Telnet —– Telnet SNMP ICMP ICMP OneAccess 200 Telnet Telnet Telnet Telnet SNMP SNMP SNMP SNMP ICMP ICMP Cisco 800 series SNMP SNMP SNMP SNMP Telnet Telnet Telnet SNMP ICMP ICMP SNRD: downlink SNR SNRU: uplink SNR AttD: downlink attenuation AttU: uplink attenuation CCPU: CPU load Memo: memory usage ENAT: number of entries in NAT table Uptime: time (in seconds) since the router was shut down PING: if it answers a PING to data IP PVOX: if it answers a PING to voice IP Table II MONITORING INTERFACES FOR THE VARIABLES IN EACH DEVICE MODEL

C. Remarks We conclude this section with the following remarks so far: • Some telecommunications operators need to monitor end-user routers. This was the case of Comunitel (Spain), for example. Their profile corresponds to operators that keep the ownership of the routers and maintain them under contract, with many SMEs as end-users. • End-user routers do not have full SNMP support. They have non-standard www and telnet interfaces instead. • ISPs offer end-users a variety of different router models, and replace them quite often. • Commercial high-end monitoring software does not admit legacy interfaces and it is badly conditioned to face the frequent shutdowns in end-user routers.

IV. MONITORING END-USER ROUTERS Since the monitoring software will practically retrieve all data through web or telnet interfaces, we must establish an unified flexible way to collect data automatically. In order to do so, we have decided to split the process in two stages. First we get the raw data from the device, in HTML format or from the output of a telnet session. Then, at a second stage, we process the data to extract the values Fig. 1. telnet monitoring session example we want. These values will be stored or passed to other programs, which will process and/or display them (HP OpenView, BigSister, OpenNMS, or others). We wrap each the password, and waits for the answer again. Finally, it manual legacy interface to offer a unified interface, and displays a prompt and allows us to type commands. Once we build a collector on top of the resulting middleware. If executed, it waits for new commands, until we enter the necessary, a group of these collectors can feed high-end logout command. monitoring software with the data from a large population The first command we run is sysinfo, which returns the of end-user routers. However, a single collector running number of active processes, the CPU load in 1, 5 and 15 on a Pentium IV is enough to handle the population in minute intervals, and a table with memory usage. table I. The next command is cat/proc/meminfo, which pro- vides information about memory usage (in deeper detail A. Telnet sessions than the previous command). Figure 1 shows an example of a typical telnet mon- Then, command adsl show –info returns information itoring session. If we analyze it, we see that it starts about the current status of the ADSL line. with a string requesting a username, waiting for us to Finally, command logout terminates the session, so the type it. Then the device prints another string asking for router closes the connection. In order to write the wrapper for this particular interface, we observe that it is possible to define a telnet session with a sequence of sending/receiving strings, where the router sends a string (asking for username, password or new command) and the monitor answers with another (the username, the password or a command). The monitor must accept some kind of control codes to send specific usernames and passwords, as every device could have its own password. Thus, there should be a control code at the answer string associated to the username petition, for the monitor to replace it with the real username, obtained from a database. All returned data will be raw data, to be analyzed at the second stage (section IV-D) to extract variable values B. Web sessions HTTP data is well-conditioned for automatic process- ing (by HTML navigators, for example). In fact, this is straightforward with the telnet interface for web connec- Fig. 2. Architecture of the monitor tions, just by sending a HTTP header2 to port 80. Unfortunately, the routers often force their clients to authenticate themselves with a username and a password. each end-user router model. These scripts are launched at There are mainly two authentication procedures: HTTP system startup, receive raw data from the standard input authentication functions and GET/POST functions. and deliver the processed variables and their values at the In the case of the former, the system sends the username standard output. A server process receives the processed and the password into the HTTP header, coded in MIME64 data and stores them on disk. format. Therefore, it is possible to employ the telnet interface just by adding a new control code that sends This schema has two advantages: that string. • The scripts are extremely simple in any programming In the case of the GET and POST functions, the language. They can be added to the main program username and the password travel as variables in the data without recompiling it. Thus, when the operator offers field after the header. This makes our task more difficult, a new router model, it only has to provide the because the header must include the length of the data send/receive strings and a little script to process raw field, which depends on the length of the username and data. The main code does not change. the password. In order to solve this problem there is a new • Since the scripts are tailored for each device, it control code in the telnet interface that sends a well-formed is possible to perform arithmetic and comparison HTTP header with an arbitrary parameter line, where the operations. For example, it is possible to monitor the user can insert the username and the password, and other sum of program and operating system memory. variables. Figure 2 shows our architecture. The main program employs the right send/receive strings for each router C. SNMP sessions model, and stores the raw data until the end of the session. The SNMP protocol was specifically designed for re- Then, the external scripts process the raw data and return mote monitoring, so there are no special difficulties in the value of the variables. There is a single instance of applying it when possible. In the end-user routers in our each script, which processes the raw data from all devices study we can only obtain via SNMP, since this is sequentially. This saves resources and CPU, as no time the only MIB available. is wasted in killing child processes and launching them D. Extracting monitoring data again. In this section we describe the second stage in the E. Large-scale efficient monitoring monitoring process. End-user routers have low computing capabilities, so After obtaining the raw data, we must find the numeric their monitoring response time is long. Our system needs values and assign them to the different variables. Since between 1 and 5 seconds to get all the raw data from one we need a high flexibility, we decided to assign this of the devices in table I, in ideal conditions. task to little subprograms or scripts, a different one for As previously said, another problem is that many users 2A HTTP header is simply a GET /file.html HTML1.0 string, shut the routers down when they leave their workplace. followed by two carry returns. This means that a sequential system should wait for a Fig. 3. Comparison between synchronous (up) and Fig. 4. Comparison keeping the order of the connections or asynchronous monitoring (down). Each line pattern corresponds not. Each line pattern corresponds to a different router to a different router

mine when the monitoring connections are established. TCP timeout. The default value for that timer is over three This allows to launch a connection and resume working minutes. Changing some kernel parameters in Linux, it is (reading, writing, creating or closing other connections) possible to reduce it down to 3 seconds. while waiting for the establishment of that connection. We conclude that we cannot use a synchronous moni- Therefore, a single process can check all end-user routers toring method, i.e. asking one device and waiting for its concurrently.Thisway,asshowninfigure 3, there is no answer before asking the next one. This is because we idle time due to router shutdowns. would spend at least a second per device, and most of In our scenario (table I), a full sequential monitoring the time just waiting for the answers. This is a long time cycle would take several hours. Nevertheless, a Pentium considering the large number of deployed routers (over IV at 2.4 GHz can perform it in less than five minutes 20,000). Furthermore, some devices like the Telsey CPVA3 with concurrent asynchronous monitoring3. need three monitoring connections. A last relevant issue is connection overload. If we must Apparently, we could solve this with many child pro- monitor thousands of devices every X minutes, we cannot cesses, working in parallel to monitor all the devices. launch all connections (for all devices) at the beginning This is how web navigators work [15]: when a page has of the monitoring cycle, since this would produce a traffic several embedded contents, they are fetched by concurrent peak and probably saturate the routing tables, for no traffic requests (usually, up to four at a time). Besides, persistent at all during the last part of the X-minutes interval. To connections allow to leave TCP connections open between avoid this, we decided to distribute the connections along consecutive operations, and the pipelining technique al- the whole monitoring cycle. Thus, for two connections lows to send multiple requests (or responses) in a single per device, if we must monitor 15,000 devices every 5 TCP segment. For 30,000 routers we would need a large minutes, the program will launch one hundred connections number of child processes. This is a heavy burden for a per second, producing a nearly constant traffic rate. commercial operating system, which should also manage In order to evenly spread monitoring traffic across the process synchronizations. On the other hand, we would not monitoring cycle, we set a connection establishment order. take any advantage of pipelining or persistent connections. By doing so, the spacing between connection startup Therefore, in order to achieve our goals, we decided to times will be nearly constant and, as a consequence, launch asynchronous concurrent connections to monitor response traffic will be smooth in average (there may be end-user routers, so that socket readings and writings minor differences in response times, unexpected router are non-blocking. An asynchronous connection does not shutdowns, etc.). Otherwise, connections would start each block the call, but it returns immediately to the caller. X + t minutes, being t a random value between 0 and Asynchronous procedure calls are common in distributed X.Infigure 4 we illustrate this with an example. At systems: this mechanism allows the caller to run in parallel the upper half we see that, by keeping order, the elapsed with a call and to collect the results of the call later, i.e. the time between consecutive accesses to the same device is requests to the server are non-blocking [16]. We follow a similar schema, with a relevant difference: in our system, 3A full monitoring cycle of five minutes was a project specification. the non-blocking operation is the connection establishment We did not try to decrease it, although there was a considerable margin. In practice, there are diverse bottlenecks that affect the monitoring cycle: itself. the rate the collector emits the queries, the rate the collector processes The main program has to check the sockets to deter- the answers and the network load. constant. At the lower half we see that, if we alter the order [9] http://www.openview.hp.com/news/success/index.html at each monitoring cycle, elapsed time may be half the [10] http://people.ee.ethz.ch/˜oetiker/webtools/rrdtool/ [11] http://cricket.sourceforge.net/ cycle length or nearly double it. Thus, monitoring traffic [12] http://bigsister.graeff.com/ peaks may result. [13] http://www.opennms.org/wiki/ [14] http://www.comunitel.es/ V. C ONCLUSIONS [15] H.F. Nielsen, J. Gettys, A. Baird-Smith, E. Prud’hommeaux, H.W. Lie and C. Lilley, “ of http/1.1, css1, and Monitoring end-user routers is a challenging task be- png,” Proc. ACM SIGCOMM ’97, pp. 144-166, 1997. cause of their large number and slow legacy interfaces. [16] B. Liskov and L. Shrira, “Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems,” Proc. ACM Large-scale SNMP-oriented commercial tools have been SIGPLAN 1988 Conf. on Programming Language Design and designed to monitor high-performance routers. They can- Implementation (PLDI ’88), pp. 260-267. ACM Press, 1988. not handle large populations of end-user routers, for the [17] A.D. Joseph, J.A. Tauber and M.F. Kaashoek, “Mobile computing with the Rover toolkit,” IEEE Transactions on Computers 46-3 reasons explained in section III-C (legacy non-standard (1997), pp. 337-352. interfaces, unpredictable shutdowns, etc). However, they may be assisted by external collectors that aggregate moni- toring data. Our system wraps legacy monitoring interfaces to offer a unified interface, so it is easy to integrate heterogeneous end-user routers. Our collector performs concurrent asynchronous monitoring to deal with the low response times of end-user router interfaces. Consequently, we can efficiently monitor over 20,000 routers (table I) each 5 minutes, without generating traffic peaks. As a future upgrade, we could use persistent asyn- chronous invocations as, for example, the QRPC (Queued RPC) mechanism in the Rover Toolkit [17]. Rover is a software for mobile devices with intermittent connectivity and limited network bandwidth. QRPC supports mobile- aware and mobile-transparent applications, queueing re- quests and results at intermediate elements, and thus hiding network latencies and connection interruptions (intermedi- ate elements keep on trying requests in parallel with the execution of applications).

VI. ACKNOWLEDGEMENTS This work has been partially supported by Ministerio de Fomento, Spain (ROSA-MITUS T38/2006 PEIT grant), Ministerio de Educacion´ y Ciencia, Spain (TEC2007- 67966-C03-02 grant) and Xunta de Galicia, Spain (MIND- GAP-5 PGIDIT08TIC010CT grant).

REFERENCES [1] A. Sahai and C. Morin, “Towards Distributed and Dynamic Net- work Management,” Proc. NOMS’98. [2] A. Liota, G. Pavlou and G. Knight, “Exploiting Agent Mobility for Large-Scale Network Monitoring,” IEEE Network 16-3 (2002), pp. 7-15. [3] T. C. Du, E. Y. Li and A.-P. Chang, “Mobile Agents in Distributed ”, Communications of the ACM 46-7 (2003). [4] P. Haggerty and K. Seetharaman, “The benefits of CORBA-based network management,” Communications of the ACM 41-10 (1998). [5] A. Moore, J. Hall, C. Kreibich, E. Harris and I. Prat, “Architecture of a network monitor,” Proc. of the Fourth Passive and Active Measurement Workshop (PAM 2003). [6] K. G. Anagnostakis, S. Ioannidis, S. Miltchev, J. Ioannidis, M. B. Greenwald and J. M. Smith, “Efficient packet monitoring for network management,” Proc. IFIP/IEEE Network Operations and Management Symposium (NOMS) 2002. [7] G. M. Weiss, J. P. Ros and A. Singhal, “ANSWER: network mon- itoring using object-oriented rules,” Proc. of the Tenth Conference on Innovative Applications of Artificial Intelligence, AAAI Press, 1998. [8] http://www.openview.hp.com