
Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003 Performance in distributed systems l Faults in distributed systems are notoriously hard to diagnose l Performance problems are even more subtle to debug l Often transient or affect only a subset of requests / users l Frequently involve complex interactions between multiple machines l Aggregate statistics (e.g. utilization) may look perfectly normal 12 November 2003 Magpie Approach l Track individual requests end to end l Observe control flow (causality) l Monitor resource consumption: CPU, bandwidth, disk l Debug performance “in the small” l Build a probabilistic workload model from the aggregate requests l Cluster similar requests according to their observed behaviour l Debug performance “in the large” 12 November 2003 How do we use this information? l Performance debugging l Why did this request take much longer than that request? l Fault detection l Configuration and management l Performance prediction l Realistic workload models for capacity planning l Obtain automatically on a “live” system 12 November 2003 Magpie components l Instrumentation l System activity recorded to logs l Generic request parser l Extract individual requests from logs according to an event schema l Model construction l Behavioural clusters l Probabilistic state machine 12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status 12 November 2003 What is a request? l System activity which takes place in response to an action initiated by the application being traced l HTTP request l Database query l File open request l We describe a request as l The sequence of application components involved in its processing l The resource consumed at each stage l CPU, bandwidth, disk transfer size, (latency) 12 November 2003 A typical e-commerce site (1) Internet Storage SQL Servers Web Front Ends 12 November 2003 A typical e-commerce site (2) Web Server SQL Server IIS CLR Application Static Logic Stored ContentFilter procedures ASP.NET ADO.NET Data WinSock2 API WinSock2 API http.sys Kernel Kernel 12 November 2003 HTTP request: detailed view IIS worker thread picks up request ASP.NET thread blocks after from http.sys Sync WinSock send IIS worker thread RPC to database to SQL Server wakes up to write log ! WEB.eec - + + + - - - + - - - WEB.398 HTTP request Disk HTTP response packets packet ASP.NET worker TDS request and reply packets sent and sent back to client Net RX thread takes over received Net TX 10.051s 10.100s 10.155s Net TX Net RX Disk - - SQL thread - SQL.9c4 unblocks 10.051s 10.100s 10.155s KEY: Blocked IIS ASP.NET SQL Disk Other 12 November 2003 Why is request tracking hard? l Many components, multiple machines l Must track control flow across machines l No globally unique request ID l Components are developed independently l Multiple thread pools l Many threads participate in processing a request l Asynchronous communication l Must match send/recvs between threads/machines l Hand-rolled synchronization primitives l SQL server has user-mode scheduler 12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status 12 November 2003 Event Tracing for Windows l Low-overhead event mechanism l Events timestamped with cycle counter l Global ordering on events on a single machine l Can enable/disable sets of events at runtime l Using ETW in Magpie l Each instrumentation point posts an event l Events are logged to disk l Logs are post-processed to extract requests l Can also consume events in real time 12 November 2003 Instrumentation points l Existing ETW event providers l IIS, kernel l App-specific hooks l IIS, ASP.NET, SQL Server l Detours l Wrap dlls to trap Win32 and WinSock2 calls l WinPcap l Capture packets on the wire 12 November 2003 CPU usage from kernel events l The ETW kernel logger records every context switch l How do we know which cycles are used for which request? l We can attribute cycles to a request by l An application-specific event which occurs within a delimited sector of CPU time, or l The current context of execution, eg thread id 12 November 2003 Example: protocol processing in a DPC DPC pkt DPC Events: cswitch start recv end cswitch Request 1 time cycle count Request 2 cycle count 12 November 2003 Application and middleware events l Cover points where flow of control moves between components l Cover points where resources are multiplexed and demultiplexed l E.g. user-level scheduling primitives l Propagation of a global request id is not required! l Magpie used to do this but not any more 12 November 2003 Instrumenting a web service Web Server SQL Server IIS CLR Application Extended SPs Static Logic Stored ContentFilter procedures HTTPModule Wrappers ASP.NET ADO.NET ISAPI Filter CLR profiler Data Intercept Intercept WinSock2 API WinSock2 API http.sys Kernel Kernel Event Tracing for Windows Event Tracing for Windows Packet Packet capture capture 12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status 12 November 2003 Generic request extraction l No inbuilt assumptions about the system or the application l No common unique identifier l Schema specifies semantics of events l Easy to add new event types l Parser stitches events into requests based on event semantics 12 November 2003 Terminology l Namespace l Event parameter which references an entity in the system, eg thread id l Timeline l Instantiation of a namespace with a unique value, eg thread id = 0xa l Events bind or unbind requests to timelines l Bindings capture the semantics of each event for a particular request type 12 November 2003 Example: connecting eventsRecv returns Enter Recv DPC start DPC end TCP pkt cswitch cswitch Cpuid=0 Tid=0xa Tid=0xb Connid=0xd Request 1 12 November 2003 Request 2 End-to-end request extraction l An instance of the request parser runs on each machine in the distributed system l Online or offline mode l Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier 12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status 12 November 2003 Clustering for workload generation l Target the Indy performance modelling tool l Calculates throughput, bottlenecks l Needs transaction mix, resource consumption l Previously: microbenchmark approach l Run 10000 of each “transaction type” (URL) l Divide aggregate resource usage by 10000 l Aim: provide realistic workload models l From real, mixed workloads l Derive transaction “types” automatically 12 November 2003 Single request: cartoon view l Partial ordering of events l Annotated with resource usage 1k 12k 5ms 6ms 3ms 24k 6ms 1ms 6ms read 1k 6k 2ms 192k 3ms read IIS CPU ASP.NET CPU SQL Server CPU Network Disk 12 November 2003 Behavioural clustering of requests l Represent requests as event strings l “Flatten” out any concurrency l Use Levenshtein string edit distance l Modified to factor in resource usage vectors l Cluster requests based on this distance l Linear-time algorithm l Each cluster is a request “type” l Select representative from near centroid 12 November 2003 Build a workload model by clustering similar requests 1k 30k 2ms 10ms 5ms 14ms 24ms 1ms 5ms 0.2k 0.2k 0.2k 0.1k 6k 2k A 5ms 11ms 5ms Requests in the same 7% 1k cluster often have 11k 14ms 27ms 2ms 1ms 2ms 7ms different URLs, and B 10% one URL may appear A 1k B 12k 5ms 6ms 3ms 24k 6ms 1ms 6ms in many clusters read 1k C 6k 2ms 192k 3ms C read E 15% D 1k 11k 2ms 13ms 11ms 3ms 2ms 5ms 0.3k D 0.3k 5% 5ms 0.6k 1k E 5ms 11ms 12 November 2003 63% Taking it further: work-in- progress l Online and incremental modelling: l Detect component failure l Detect sudden shifts in workload l More sophisticated models l Learn the probabilistic state machine for each request l c.f. flowcharts annotated with performance information l “Bayesian watchdogs” l Compute the likelihood of a request’s behaviour as it moves through the system l Deal with “unlikely” requests appropriately 12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status 12 November 2003 Current status l Recent focus has been developing a generic request extraction scheme l Prototype for 2-machine e-commerce site l TPC-W style workload l Prototype for single machine SQL Server 2000 l Challenge is user mode scheduler l TPC-C workload l Other applications on the way l Large-scale l “Real” systems with “real” performance problems 12 November 2003 Conclusion l Magpie is a tool for performance analysis in a distributed system l Bottom up, per-request approach l Complementary to existing techniques: l Performance counters l Program profiling l Feeds into performance debugging and prediction tools 12 November 2003 Work-in-progress: learning the probabilistic state machine l Infer a stochastic context free grammar from a sample set of strings l Each state transition emits a character and has an associated probability l Use the Alergia algorithm (Carrasco & Oncina ‘94) l Construct a prefix tree from the sample set l Merge similar subtrees l Apply to Magpie requests l “Just” event strings… 12 November 2003 Ongoing work with Alergia l Tuning the similarity criterion l Factoring in resource usage information l Can we identify event sequences with suspiciously low probability l Run online for anomaly detection? 12 November 2003.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages34 Page
-
File Size-