Magpie: Distributed request tracking for realistic performance modelling

Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Research Cambridge

James Bulpin University of Cambridge

12 November 2003 Performance in distributed systems

l Faults in distributed systems are notoriously hard to diagnose

l Performance problems are even more subtle to debug l Often transient or affect only a subset of requests / users l Frequently involve complex interactions between multiple machines l Aggregate statistics (e.g. utilization) may look perfectly normal

12 November 2003 Magpie Approach l Track individual requests end to end l Observe control flow (causality) l Monitor resource consumption: CPU, bandwidth, disk l Debug performance “in the small” l Build a probabilistic workload model from the aggregate requests l Cluster similar requests according to their observed behaviour l Debug performance “in the large”

12 November 2003 How do we use this information? l Performance debugging l Why did this request take much longer than that request? l Fault detection l Configuration and management l Performance prediction

l Realistic workload models for capacity planning l Obtain automatically on a “live” system

12 November 2003 Magpie components l Instrumentation l System activity recorded to logs l Generic request parser l Extract individual requests from logs according to an event schema l Model construction l Behavioural clusters l Probabilistic state machine

12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status

12 November 2003 What is a request? l System activity which takes place in response to an action initiated by the application being traced l HTTP request l Database query l File open request l We describe a request as l The sequence of application components involved in its processing l The resource consumed at each stage l CPU, bandwidth, disk transfer size, (latency)

12 November 2003 A typical e-commerce site (1)

Internet

Storage SQL Servers Web Front Ends

12 November 2003 A typical e-commerce site (2)

Web Server SQL Server IIS CLR Application Static Logic Stored ContentFilter procedures

ASP.NET ADO.NET Data

WinSock2 API WinSock2 API http.sys Kernel Kernel

12 November 2003 HTTP request: detailed view

IIS worker thread picks up request ASP.NET thread blocks after from http.sys Sync WinSock send IIS worker thread RPC to database to SQL Server wakes up to write log !

WEB.eec - + + + - - - + - - -

WEB.398 HTTP request Disk HTTP response packets packet ASP.NET worker TDS request and reply packets sent and sent back to client Net RX thread takes over received Net TX

10.051s 10.100s 10.155s

Net TX

Net RX

Disk - -

SQL thread - SQL.9c4 unblocks

10.051s 10.100s 10.155s

KEY: Blocked IIS ASP.NET SQL Disk Other 12 November 2003 Why is request tracking hard? l Many components, multiple machines l Must track control flow across machines l No globally unique request ID l Components are developed independently l Multiple thread pools l Many threads participate in processing a request l Asynchronous communication l Must match send/recvs between threads/machines l Hand-rolled synchronization primitives l SQL server has user-mode scheduler

12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status

12 November 2003 Event Tracing for Windows l Low-overhead event mechanism l Events timestamped with cycle counter l Global ordering on events on a single machine l Can enable/disable sets of events at runtime l Using ETW in Magpie l Each instrumentation point posts an event l Events are logged to disk l Logs are post-processed to extract requests l Can also consume events in real time

12 November 2003 Instrumentation points l Existing ETW event providers l IIS, kernel l App-specific hooks l IIS, ASP.NET, SQL Server l Detours l Wrap dlls to trap Win32 and WinSock2 calls l WinPcap l Capture packets on the wire

12 November 2003 CPU usage from kernel events l The ETW kernel logger records every context switch l How do we know which cycles are used for which request? l We can attribute cycles to a request by l An application-specific event which occurs within a delimited sector of CPU time, or l The current context of execution, eg thread id

12 November 2003 Example: protocol processing in a DPC

DPC pkt DPC Events: cswitch start recv end cswitch

Request 1 time cycle count

Request 2 cycle count

12 November 2003 Application and middleware events l Cover points where flow of control moves between components l Cover points where resources are multiplexed and demultiplexed l E.g. user-level scheduling primitives l Propagation of a global request id is not required! l Magpie used to do this but not any more

12 November 2003 Instrumenting a web service

Web Server SQL Server IIS CLR Application Extended SPs Static Logic Stored ContentFilter procedures

HTTPModule Wrappers

ASP.NET ADO.NET

ISAPI Filter CLR profiler Data

Intercept Intercept WinSock2 API WinSock2 API http.sys Kernel Kernel

Event Tracing for Windows Event Tracing for Windows Packet Packet capture capture

12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status

12 November 2003 Generic request extraction l No inbuilt assumptions about the system or the application l No common unique identifier l Schema specifies semantics of events l Easy to add new event types l Parser stitches events into requests based on event semantics

12 November 2003 Terminology l Namespace l Event parameter which references an entity in the system, eg thread id l Timeline l Instantiation of a namespace with a unique value, eg thread id = 0xa l Events bind or unbind requests to timelines l Bindings capture the semantics of each event for a particular request type

12 November 2003 Example: connecting eventsRecv returns Enter Recv DPC start DPC end TCP pkt cswitch cswitch

Cpuid=0

Tid=0xa

Tid=0xb

Connid=0xd

Request 1 12 November 2003 Request 2 End-to-end request extraction l An instance of the request parser runs on each machine in the distributed system l Online or offline mode l Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier

12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status

12 November 2003 Clustering for workload generation l Target the Indy performance modelling tool l Calculates throughput, bottlenecks l Needs transaction mix, resource consumption l Previously: microbenchmark approach l Run 10000 of each “transaction type” (URL) l Divide aggregate resource usage by 10000 l Aim: provide realistic workload models l From real, mixed workloads l Derive transaction “types” automatically

12 November 2003 Single request: cartoon view l Partial ordering of events l Annotated with resource usage

1k 12k

5ms 6ms 3ms 24k 6ms 1ms 6ms read

1k 6k

2ms 192k 3ms read

IIS CPU ASP.NET CPU SQL Server CPU

Network Disk 12 November 2003 Behavioural clustering of requests l Represent requests as event strings l “Flatten” out any concurrency l Use Levenshtein string edit distance l Modified to factor in resource usage vectors l Cluster requests based on this distance l Linear-time algorithm l Each cluster is a request “type” l Select representative from near centroid

12 November 2003 Build a workload model by clustering similar requests

1k 30k

2ms 10ms 5ms 14ms 24ms 1ms 5ms

0.2k 0.2k 0.2k 0.1k 6k 2k

A 5ms 11ms 5ms Requests in the same 7%

1k cluster often have 11k 14ms 27ms 2ms 1ms 2ms 7ms different URLs, and B 10% one URL may appear A 1k B 12k 5ms 6ms 3ms 24k 6ms 1ms 6ms in many clusters read 1k C 6k 2ms 192k 3ms C read E 15%

D 1k 11k

2ms 13ms 11ms 3ms 2ms 5ms

0.3k D 0.3k 5% 5ms

0.6k 1k E 5ms 11ms 12 November 2003 63% Taking it further: work-in- progress l Online and incremental modelling: l Detect component failure l Detect sudden shifts in workload l More sophisticated models l Learn the probabilistic state machine for each request l c.f. flowcharts annotated with performance information l “Bayesian watchdogs” l Compute the likelihood of a request’s behaviour as it moves through the system l Deal with “unlikely” requests appropriately

12 November 2003 Outline l Introduction l What is a request? l Instrumentation l Request extraction l Modelling l Current status

12 November 2003 Current status l Recent focus has been developing a generic request extraction scheme l Prototype for 2-machine e-commerce site l TPC-W style workload l Prototype for single machine SQL Server 2000 l Challenge is user mode scheduler l TPC-C workload l Other applications on the way l Large-scale l “Real” systems with “real” performance problems

12 November 2003 Conclusion l Magpie is a tool for performance analysis in a distributed system l Bottom up, per-request approach l Complementary to existing techniques: l Performance counters l Program profiling l Feeds into performance debugging and prediction tools

12 November 2003 Work-in-progress: learning the probabilistic state machine l Infer a stochastic context free grammar from a sample set of strings l Each state transition emits a character and has an associated probability l Use the Alergia algorithm (Carrasco & Oncina ‘94) l Construct a prefix tree from the sample set l Merge similar subtrees l Apply to Magpie requests l “Just” event strings…

12 November 2003 Ongoing work with Alergia

l Tuning the similarity criterion l Factoring in resource usage information l Can we identify event sequences with suspiciously low probability l Run online for anomaly detection?

12 November 2003