TECHNICAL PAPER Directshow: A New Media Architecture

By Amit Chatterjee and Andrew Maltz

The desktop revolution in production and post-production has dramatical- streaming. Other motivating factors are ly changed the way film and television programs are made, simultaneously the new hardware buses such as the reducing equipment costs and increasing operator eficiency. The enabling IEEE 1394 serial bus and Universal digital innovations by individual companies using standard computing serial bus (USB), which are designed with multimedia devices in mind and platforms has come at a price-these custom implementations and closed promise to enable broad new classes of solutions make sharing of media and hardware between applications difi- audio and video application programs. cult if not impossible. Microsoft s DirectShowTMStreaming Media To address these and other require- Architecture and Windows Driver Model provide the infrastructure for ments, Microsoft introduced Direct- today’s post-production applications and hardware to truly become inter- ShowTM, a next-generation media- operable. This paper describes the architecture, supporting technologies, streaming architecture for the and their application in post-production scenarios. Windows and Macintosh platforms. In development for two and a half years, Directshow was released in August he year 1989 marked a turning Additionally, every implementation 1996, primarily as an MPEG-1 play- Tpoint in post-production equip- had to fight with back vehicle for Internet applications, ment design with the introduction of constraints and surprises, particularly although the infrastructure was desktop digital nonlinear editing sys- in the areas of internal stream synchro- designed with a wide range of applica- tems. While LBM-compatible personal nization, external device synchroniza- tions in mind. Directshow’s follow-on computers and Apple Macintoshes had tion, and data throughput, to name a release, incorporating substantial input been used in linear editing and control few. from OpenDML and key application systems for several years prior, these Any review of desktop production software developers, addresses many new editing systems forsook analog technologies would be incomplete high-end tools issues and is now in videotape as an online storage medium without mentioning standardization Beta release. High data rate video cap- and instead brought video and audio attempts such as Apple’s Quick ture and playback, seamless nonlinear source material “inside the box” in a TimeTMmultimedia architecture and playback from multiple files, support practical sense for the first time. Other Avid Technologies’ Open Media for transitional effects, improved inter- editing equipment manufacturers Framework Interchange format. stream synchronization, an improved quickly jumped on the desktop band- However, these and other efforts have media file format, external device con- wagon. Those that didn’t (and some not successfully addressed cross-plat- trol, and a new hardware driver model that did) were destined to become form issues on the are just part of Directshow’s feature footnotes in the annals of broadcast operating systems or cross-application set. Other benefits include container equipment history. issues much beyond digital audio format independence, location trans- Regardless of the platform on which interchange. Microsoft’s own Video parency of software modules, and they were built, all of these pioneering for Windows multimedia architecture, hardware and software codec inter- digital nonlinear systems, as well as designed for low bit rate consumer changeability and scalability. most of those in production today, had multimedia applications, is also insuf- one thing in common: the internal ficient for the high-performance Inside Directshow-Supporting media handling was performed by cus- demands of professional users. Technology tom engines. Each implementation The formation of the OpenDML A look under the hood of Direct- relied on proprietary and incompatible committee in 1995, consisting of a Show reveals that it builds on several file formats, unique and incompatible very vocal group of Windows-based Microsoft technologies, although it programming interfaces, and nonexis- video tools and hardware manufactur- should not be considered “just another tent data interchange methods. ers, underscored the desire for open layer” of system software (Fig. I). standards in digital media handling and Directshow provides functional parti- interchange. Given the “flattening” of tioning, module connection, media Presented at the 31st SMPTE Advanced Motion Imaging Conference (paper 31-6). New York, N.Y.. the historically vertical tools market, a type negotiation, and stream control February 6 to 8. 1997. Amit Chatterjee is with standardized multimedia architecture services to applications. Other compo- Microsoft Corp., Redmond. WA 98052; Andrew would enable better interchangeability nents provide high-performance media MalU (who co-read the paper) is with Digital Media Technologies. Inc., Van Nuys. CA 9141 I. An unedit- of video and audio hardware with services, low-latency stream handling, ed version of this paper appears in The Age ~f application software as well as merge hardware independence, and an over- COmpression: Nonlinear Editing. Digital Broadcasting, and Other Wonders, SMPTE, 1997. traditionally separate application all model for component construction, Copyright Q 1997 by the Society of Motion Picture spaces such as video editing, video loading, and intercomponent commu- and Television Engineers, Inc. conferencing, and server-based media nication.

SMPTE Journal, December 1997 865

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE

Figure 1. Microsoft multimedia architecture.

Taking a bottom-up approach, the consoles, and Directplay for multima- areas of the operating system running lowest level component is the hard- chine, multiplayer gaming. All of the in kernel mode, it is extremely diffi- ware abstraction layer (HAL) common DirectX drivers were initially released cult for them to “crash” the system. to all Microsoft operating systems. As for the operating system, The downside to this separation is the basis for hardware independence, and most are now being ported to that any direct hardware register this is the key to having one’s applica- Windows NT. They are designed for request or other kernel mode call made tion run transparently on either an low-latency and high-performance by an application program incurs a Intel x86, DEC Alpha, or PowerPC control of hardware, although at pre- time penalty for each user-to-kernel processor, requiring only recompila- sent they are primarily user mode mode transition. This is due to the nec- tion with the proper code generator in modules. This begs the question essary saving of registers and environ- place. Multiprocessor implementations “What is user mode?’ ment state, known as a “context.” are also handled at this level. The Windows NT operating system High-performance applications have Also part of the lowest layer are takes advantage of two distinct privi- recourse for this, as we will see a bit kernel and base driver services, con- lege modes of today’s microproces- later. taining high-priority system tasks, sors, commonly called “user mode” The Windows driver model (WDM) plug-and-play support, and power and “kernel mode.” The primary dif- is a new driver model that makes management. It is at this level that ference between them is that the level development of low-latency, cross low-latency and time-critical software of access to the underlying hardware platform drivers more practical (Fig. resides. This subject will be covered in and memory space is unrestricted to 2). WDM drivers reside completely in more depth further on. software running in kernel mode and kernel mode and can communicate At the top of the kernel services highly restricted to software running directly with other WDM drivers, thus layer, and extending above it as well, in user mode. The result of this privi- eliminating the performance hit asso- are additional Microsoft DirectX lege separation is a robust computing ciated with mode transitions. WDM Media technologies, comprised of environment: application software and drivers are modular, using a class dri- DirectDraw for two-dimensional many operating system components vedminidriver structure. This layered graphics, Directsound for audio, run in separate and protected user approach separates drivers into logical for three-dimensional graph- mode memory spaces. Since they do partitions, allowing a particular ics, Directlnput for joysticks and game not have direct access to sensitive “stack” to be configured for a given

866 SMPTE Journal, December 1997

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE peripheral using a given protocol con- nected via a given bus. Microsoft is supplying a class driver I Micmrott UO Manign for streaming media that contains all of the code necessary to support plug- and-play, memory management, prop- erty and method set management, and other core services. This reduces the amount of code development for a hardware manufacturer to a relatively simple minidriver that can bc written in the C . Before moving on to the details of Directshow, one other Microsoft tech- nology must be introduced: the com- ponent object model (COM) (Fig. 3). COM is the underlying object-ori- II ented software technology upon which ininidriver most of Microsoft’s newer products are based. It defines the communica- tions rules by which individual soft- TI I Svstem this Cl;iss Driver ware components connect, control, I 1 and transfer data. The software is indifferent to programming languages Figure 2. Windows driver model. and guarantees binary compatibility between objects. COM is also cross- p 1 at f o r m , s up po r t e d o n W i 11 dow s , Inside Directshow-Pins, Filters, and text niedia types, and developers Macintosh, and UNIX operating sys- and Graphs are free to define their own as tems. Directshow is a set of COM objects required. One of the essential COM concepts specifically designed to enable stream- The method of sample transfer is location transparency. COM objects ing media applications (Fig. 4). between filters is called a transport. can communicate whether physically Because Directshow is COM-based, it Directshow currently supports three present within the same process space is truly extensible. That is, new fea- transports: local memory-based, video on a single machine, or running on a tures can be added by defining addi- overlay, and hardware-based. The computer connected via a network yet tional interfaces and new and extended local memory transport uses system physically located half a world apart. i m p le in e n t a t i on s of D i rec t S how ’ s memory to store media samples, and Location transparency is made possi- basic objects as thc need arises. These an efficient implementation passes ble by remote procedure calls (RPC), basic objects, known as filters, pins, pointers to the media samples to avoid which are part of the industry standard and filter graphs, are the building performance-killing memory copies. d i s t ri b u t ed c o m p u t i n g en v i ro n men t blocks from which all Directshow The overlay transport enables tradi- specification. RPCs handle the actual applications are constructed. tional analog overlay and digital inlay communication between COM Filters are objects that perform a hardware. Hardware-based transports objects. specific task such as file reading, file allow an add-in adapter’s own memo- COM also defines a contract writing, compression, transitional ry to be used, an example being a between the object itself and the rest effects, or image display. They are video capture card with on-board of the world through interfaces. commonly grouped into three basic memory supplying media sample Interfaces are strongly typed and categories: source, transform, and ren- buffers for direct reading by a bus semantically related groups of func- derer, although some filters defy sim- mastering SCSl host adapter. Custom tions called methods. They are ple categorization such as those used transports can also be defined for more described by a globally unique identi- for external device control. specialized applications. fier (GUID), a 128-bit number that is Filters transfer time-stamped data A collection of filters connected pretty hard to accidentally assign to called media samples between one together is called a filter graph. An more that one interface. One of the another through another object called application can allow Directshow to fundamental COM rules is: interfaces a pin. Pins are exposed by a filter as automatically construct a filter graph are immutable. That is, if an interface either input or output, depending on for certain applications such as media changes, it gets a new name and a new whether each pin sinks or sources file playback and capture, or it can GUID. This rule simplifies software media samples. Pins are the point of manually assemble the filters together versioning and guarantees that soft- connection between filters and are the to serve its own special purposes. The ware written to use a particular inter- objects that actually handle the trans- filter graph’s overall behavior is con- face will always work (once fully fer of media samples. Directshow trolled by an object called the filter debugged, of course). defines many standard video, audio, graph manager. In the case of MPEG

SMPTE Journal, December 1997 867

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE

( ~ocaiServer Process .h 0- cllalt Proms --.

Client Application . Remote Server Process I-,Cross-nehvorkwnh me

Figure 3. . playback, for example, stream control solution generally requires integrated quality, hardware capabilities, and such as run, stop, and pause are han- audio and video hardware from the available computing horsepower. dled via the filter graph manager’s same manufacturer, which limits end Reference clock suppliers also pro- control interfaces. Applications can user choice. This approach also does vide notification services to other fil- also communicate directly with con- not enable applications that want to ters using events. Events are operating trol interfaces on individual filters and display real-time video on a free-run- system objects that trigger when preset pins to control specialized aspects of ning video graphic array (VGA) dis- conditions are met, and can be config- their behavior, such as video and play without artifacts such as “tearing” ured in one-shot or multivibrator audio parameter adjustments. caused by the asynchronous video modes. The net effect of a triggered clocks. event is to unblock a waiting thread of Professional Application Issues Directshow addresses this problem execution in a filter, thus synchroniz- The bane of most software media on several fronts. First, it allows any ing that filter’s processing with the engines’ existence is proper synchro- filter to supply a time-stamped refer- clock source. nization, both interstream and with ence clock to drive the filter graph. A Directshow’s concept of time is external references. In the video video capture card, SMPTE time code important because it impacts synchro- world, almost every piece of hardware reader, or other device capable of nization and is the basis for the quality has the ubiquitous reference or gen- exposing a clocking signal can then be management mechanism described lock input connector, and phase- the system master. The time stamp is below. Directshow and its applica- locked loops handle the job of syn- usually the system time, and since that tions maintain four distinct yet related chronizing to a reference video signal is always kept current, a filter or appli- time values: quite well. The computer world has cation can determine the amount of Media Time: a temporal position consistently ignored this basic require- time that has passed since the refer- within a seekable medium, i.e., the ment of video (and audio) systems ence clock value was updated. This byte position with a data file. design by trusting free-running and feature is very important for determin- Reference Time: absolute or “wall unstable audio cards to supply the ing the existence of latencies and com- clock” time is established by a refer- master reference clock. If streams pensating for them. Second, new facil- ence clock. It is always counting slipped, then dropping a frame or five ities are provided for interstream rate regardless of graph state. here and there would bring things back matching, particularly with audio. This Stream Time: the offset from the together. On top of this, applications is done at three levels: coarse filtering time the graph was started; also called generally were not informed of this of audio samples, dynamic hardware relative reference time. poor state of affairs until it was too sample clock rate adjustment, and Presentation Time: the reference late. Hardware-locked video and audio low-latency sample rate conversion time at which media samples should sample clocks are standard features on using kernel mode filters. The tech- be presented. This is tied directly to high-end hardware today, but this nique selected depends on the required the reference time and is calculatcd by

868 SMPTE Journal. December 1997

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE the following formula: [(Media Time - Starting Media Time) / ,"""""-""'""-"-"" Playback Rate] + Starting Reference Time I Directshow Filter Graph I This value is used by a rendering filter to determine exactly when to I FIllw Graph Manager 1 render a sample. The third synchronization technique used by Directshow is a rich set of quality management controls, which are driven by the time model previous- IYI /m y P ly described. Based on comparisons of presentation times and reference time, filters can report buffer overflow, underflow, or watermark conditions to other filters in the stream so that adjustments can be made before sam- ples must be dropped or duplicated. For example, in the case of video cap- I ture to disk, a compressor filter can be told to increase its compression ratio by the downstream filewriter filter if the disk subsystem falls behind. When the "flood" condition corrects itself, Figure 4. Directshow components. the compressor can be told to restore its previous settings. Editing Support Editing applications have additional requirements beyond basic media han- dling. One of the most basic is support for nonlinear playback of material from multiple sources. Since multiple source filters can exist in a filter graph, an application can easily mix material sourced from an audio video interlead (AVI) file, a wave (WAV) file, an MPEG-I (MPG) file, and even a QuickTime MOoVie file. Additionally, Directshow provides interfaces that allow for cut-list-driven playback through a low-level tech- nique known as dynamic graph recon- struction (Fig. 5). Playback segments can be divided I I into reusable portions of filter graphs, I or graphlets. In the diagrammed case, Figure 5. Dynamic graph reconstruction. the graphlets consist of source filters, parsers, and stub filters that are con- remove or introduce media samples editing. Extremely large files can now nected to downstream target filters by from a running filter graph. This can be created, and the resulting large the filter graph manager at reference be very useful if an application wants frame indexes arc distributed through- times determined by a Directshow to read time code values directly or out the file for efficient access. Other playlist. This technique is also quite source images to and sink from non- refinements include support for video useful for reordering complex effects Directshow applications. fields, stream interleave control, and and compositing filter graphs, where Other higher level compositional SMPTE time code and film source hardware codecs or effects processors notations are currently under consider- information. must appear in different positions at ation for applications that do not Since video and audio data are typi- different reference times. require such fine control of filter graph cally supplied to editing systems on The internal and target filters shown construction. videotape, and these systems output to in Fig. 5 have another benefit. They Improvements have also been made videotape as well, frame accurate con- allow an application to directly to the AVI file format to support video trol of external devices is yet another

SMPTE Journal, December 1997 869

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE re q u i re me n t . D i re c t S h o w d c f i n c s can also be broken down into separate clients, however, run in user mode. In interfaces for complex machine con- audio and video capture filters. The some situations, the latencies and non- trol, and filter implementations for a tee filters are also interesting because deterministic behavior of a user mode wide selection of professional and they can bc installed ad infinitum, up process can gct in the way of achiev- industrial videotape machines will to the practical processing limit of the ing “really timely” performance. In soon be available. SMPTE time code host CPU. these cases, applications can be devel- is also defined as a standard media Distributed applications (Fig. 7) are oped using a new, high-performance type and control interfaces are defined installed in many facilities today, with Microsoft technology: the WDM to enable time code acquisition and centralized image or stream storage Connection and Streaming generation. supplied by remotely located file Architecture, or WDM Streaming for servers. For news editing and even short. WDM Streaming extends Sample Filter Graphs episodic television editing, which are Directshow streaming services down Figures 6 and 7 show sample filter both multi-editor and short-turnaround into the kernel, where filters can take graphs for various applications possi- time programs, Directshow enables advantage of more predictable sched- ble with Directshow. In Fig. 6 the the simultaneous capturing and play- uling services and can pass media important filter is the capture source back required for efficient throughput. samples directly between drivers with- filter shown on the left. It connects via out costly mode transitions. private interfaces to the Microsoft High-Performance Some background on performance Stream Class Driver, which in turn Applications-WDM Streaming issues with respect to the Windows talks to the capture hardware through Directshow addresses a great many NT operating system is appropriate at vendor-supplied minidrivers. While of the systems issues for streaming this point. There are many priority lev- only a single source filter is shown, it media applications. Its services and els at which various software compo- nents will execute, and a given sys- tem’s performance is directly related ____ - to how the various tasks are distrib- Compressed Video - uted among them. At the risk of over simplification, the order of execution A ud l oN id e o priority from highest to lowest goes CaplumlCodec AVI File Filter Mux something like this: - ‘d Interrupt Service Routines (ISR): usually triggered by a hardware event and runs to completion unless inter- rupted by a higher priority interrupt. ISRs are typically very short pieces of code that usually read registers or col-

Jb lect data for later processing. , Deferred Procedure Call (DPC): Stream Class Driver Kernel Mode usually scheduled by an ISR and runs to completion unless interrupted by an ISR. DPCs usually handle non-time critical processing. Note: lnlerfawa not shown System Thread: can also be sched- Hardware uled for lengthier driver processing. System threads are preemptible by Figure 6. Sample filter graph-video capture. other system threads and can be inter-

[-“q-zJpFile Reader

\ -J Client Workrtatlon Remote File Server I HTMLOr I

Figure 7. Sample filter graph-distributed application.

070 SMPTE Journal, December 1997

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply. MICROSOFT DIRECTSHOW: A NEW MEDIA ARCHITECTURE

nologies in the professional tools mar- Directshow Filter Graph ket is made evident by the substantial applications expected to be available in 1997. Acknowledgments Thc authors would like to thank the Microsoft teams responsible for the ongoing development of thc technolo- gies described herein. We would also like to thank the many independent software and hardware vendors that contributed valuable input and feed- back to Microsoft. WDM Streamin# Filter Graph

THE AUTHORS

Amit Chatterjee completed his Bachelor of Technology Degree with honors in electronics and electrical Figure 8. WDM Streaming vs. Directshow. communications from the Indian Institute of Tcchnology, Kharagpur, rupted by ISRs. ble for marshalling the kernel filter's in 1984. He received the President of User Mode Thread: the standard responses back to the application. India Gold Medal for being judged execution object for an application WDM Streaming also defines another the best graduating student for the program. User mode threads are transport called device input/output period 1979- 1984. Amit completed assigned one of 32 priority levels and interface (IDevIO). IDevIO is a mem- his Master of Technology Degree in can be both interrupted and preempt- ory-based transport similar to that of computer engineering from the same ed. User mode threads generally have DircctShow's, but it enables the direct, institute in 1985 and received the the least deterministic behavior and as kernel driver-to-kernel driver media Technology Alumni Association Gold such are poor choices for time-sensi- streaming. The net result is that high- Medal. Chatterjee has been working tive applications. priority and low-latency execution at the Microsoft Corp. since 1988. The first four execution components components are now made more avail- During this period he has been part of operate strictly in the kernel, and thus able to applications through the WDM the Windows development team that are generally not available to applica- Streaming arc hi lecture. shipped Windows 95, 3.1, 3.0, and tion programs. A natural use of WDM Streaming is 2. I. He is currently the development In a typical Directshow playback for real-time audio echo cancellation lead for Microsoft's Directshow mul- graph, incoming samples from a piece in desktop teleconferencing applica- timedia architecture. of hardware must cross the boundary tions. This has latency requirements in into user mode for processing by the I-msec range. While very difficult Andrew Maltz is an entertainment downstream filters (Fig. 8). The sam- to achieve in a user mode filter graph, technology consultant in Los Angeles. ples must then cross back again into this is certainly possible with WDM He pioneered nonlinear video editing kernel mode for output. And as previ- Streaming. systems as a principal developer of ously stated, user mode processes are Such high performance, however, is the Emmy award-winning Ediflex and scheduled at relatively low priorities not without its costs. Developing ker- its digital successor, developed com- and can incur varied and unacceptably nel mode components is one of the puter animation and digital audio sys- long latencies if higher priority and more difficult software development tems, and has worked in film and tele- time-consuming tasks are also run- challenges and carries with it the vision production and postproduction. ning. potential for compromising overall Maltz's current work is in media In a WDM Streaming playback system stability if not done correctly. engines for the Microsoft Windows graph, kernel mode drivers appear to platforms, digital television equip- applications via a Directshow proxy Conclusion ment for film mastering, and video filter. Microsoft supplies a generic Microsoft Directshow and its relat- compression for high-quality desktop proxy filter that translates a set of con- ed multimedia technologies provide a applications. Maltz has a BSEE from trol interfaces into kernel mode com- rich, robust, and extensible platform SUNY Buffalo and is a member of mands called input/output controls for high-performance media applica- several SMPTE Working Groups. (IOCTLS). The proxy is also responsi- tions. The acceptance of these tech-

SMPTE Journal, December 1997 871

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 22,2019 at 16:39:29 UTC from IEEE Xplore. Restrictions apply.