Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

Infrastructure architecture essentials, Part 5: Content delivery and distribution network design

Concepts and techniques

Sam Siewert ( [email protected] ), Principal Software Architect/Adjunct Professor, University of Colorado

Summary: Discover the methods for content delivery and distribution of Web-based media in the Web 2.0 world.

Date: 11 Nov 2008 Level: Intermediate PDF: A4 and Letter (96KB | 10 pages) Get Adobe® Reader® Activity: 361 views Comments: 0 ( Add comments )

Average rating (based on 1 vote)

The concept of Web caches has existed as long as the Web and evolved from storage of frequently accessed files on an individual's personal computer or proxy server to Internet-based Web cache servers provided by companies as a paid subscription service for content providers. As multimedia has become more prevalent on the Web, content delivery networks (CDNs) have become a critical component of the Internet and an enabler of Web 2.0 applications like IPTV, mobile Web and TV devices, and content-rich Web databases such as Wikimedia.

Frequently used acronyms

ACL: Access control list GUI: Graphical user interface HTML: Hypertext Markup Language HTTP: Hypertext Transfer Protocol I/O: Input/output ISP: Internet service provider RAID: Redundant array of independent disks SOA: Service-Oriented Architecture UDP: User Datagram Protocol URI: Uniform resource identifier XML: Extensible Markup Language

Continuing the Infrastructure architecture essentials series, this fifth article provides an overview of CDNs and distribution networks. It shows how they have evolved from simple Web caches for optimizing access to content on the Web to much more sophisticated and even intelligent content-management systems. Most of us who have used the Web since the beginnings of Web browsing (circa the early 1990s) recall that browsers have always had the option to cache frequently or recently accessed Web content. By the late 1990s, many new Web-based companies initiated Internet-based Web cache servers as a paid subscription service for Web

1 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

content providers. This network of cache servers became known as a content distribution network and has vastly improved user Web content access performance. The basic distribution services have evolved significantly since to include security for content providers and multimedia streaming and has subsequently become known as a content delivery network, with the focus shifted not just from distribution but also to delivery services for rich content, including video, audio, and multimedia databases.

For a Web services architect designing a content-delivery system today, the scope of a CDN system is daunting both in scale and in the diversity of services, media types, and access performance these systems will be expected to provide. One of the stated goals of Web 2.0 (the second-generation World Wide Web) is collective intelligence. Although collective intelligence is perhaps an academic goal of Web 2.0, significant business opportunity exists and is proven by the emergence of social networks, viral video, IPTV, mobile Web services and media, and advertisement insertion in all these. Clearly, content providers will be looking to well-designed CDNs more and more to reach users. This article arms the systems designer and solution architect with methods for success in the design of content distribution and delivery systems with an eye toward the future of what these systems may become.

The reach and health of the Internet today

Collective intelligence

Over time, the Internet has evolved from a simple network on which to share files and communicate through e-mail to the World Wide Web of information, which quickly became backed by sophisticated cache servers and eventually CDNs. Given the wealth of content, human knowledge, and interaction on the Web, it has become a nexus for collective intelligence. Exactly what collective intelligence is can be debated, but most would agree that intelligence requires rich media that matches human sensory experience and cognition, including video, audio, and text that can be used interactively. Web 2.0 promises to enrich content with much more real-time, high-definition video and audio along with massive knowledge databases that can support new levels of human coordination, cooperation, collaboration, and cognition.

The Massachusetts Institute of Technology (MIT) Center for Collective Intelligence states that, "Our basic research question is: How can people and computers be connected so that—collectively—they act more intelligently than any individuals, groups, or computers have ever done before?" From this stated goal, it is clear that access, interaction, and the reach of Web-based content and services will be critical for collective intelligence.

Between 2000 and 2008, use of the Internet has grown by more than 100 percent in North America, by more than 200 percent in Europe, and by more than 400 percent in Asia. Explosive growth in Africa and the Middle East has exceeded 1000 percent in both of those regions. Despite this continued high growth rate, saturation has still not been achieved even in North America, where more than 70 percent of the population uses the Internet (see Resources ). Of course, access and reach alone are not the only measures of success. Quality of the Internet experience measured by broadband data rates, access latency, and richness of content is also important.

A list of the worldwide top 500 Web sites reveals an interesting trend (see Resources ). Looking at the Alexa ranking, it's not surprising that the top two sites are Web search services. Perhaps more revealing is the fact that in the top 10 you now find social networking, encyclopedia, blogger, and viral video Web services. It is also interesting to note that many of the emerging sites are ingesting user content, including text from bloggers as well as images (for example, photobucket.com) and video (for example, youtube.com). It is clear that the trend and the second Internet revolution that is often referred to as Web 2.0 will include rich content, greater user interaction, real-time media, and more user collaboration—such that users not only consume content but generate it.

2 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

Elements of a CDN

Web services have evolved from scalable Web servers—as shown in Figure 1—to much more complex systems such as the Wikimedia CDN (see Resources ). The evolution from single-site scalable Web servers to distributed content servers—as shown in Figure 2 —has numerous advantages.

Figure 1. Example of traditional Web services

Figure 2. Example of content delivery services

First, the content servers can be geographically placed such that clients are better served in a given region with lower latency and so that less traffic encounters congestion on backbone networks. The early CDNs were often called edge servers because of their placement closer to users and considered distribution servers , because one of the main goals was to eliminate network congestion and to improve user experience. More recently, as the richness of Internet content has grown to include more real-time media, CDNs have become known as content delivery networks, because many of these systems are becoming more specialized to provide not only better distribution of files but better streaming of real-time media.

3 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

Another, more recent trend has been the extension of CDNs to the users themselves such that personal computers can participate in distribution with peer-to-peer (P2P) services (see Resources ). Some of the major decisions for a CDN designer is the degree to which the CDN will include P2P features, or whether the CDN will be a more traditional distributed system of servers and what forms of media the CDN will handle. No matter what the goals are for CDN design, a CDN should include the following basic services:

Web caches: For example, Squid Database servers: For example, MySQL or commercial databases such as IBM® DB2® Web servers: For example, Apache CD management: Content authoring, transcoding, and management tools; IT configuration, monitoring, and management; bug tracking; and so on. Highaccess storage: RAID arrays, solid-state disk (SSD) drives, and P2P distributed storage and access etwork attached storage (AS) heads: File servers to support heterogeneous clients, including ®, UNIX®, Mac OS X, and Windows®

It is clear that centralized Web services, as shown in Figure 1 , will lead to network congestion as worldwide clients are routed through backbone networks to reach a single-server site. Even if that Web server can be scaled in terms of network bandwidth, processing, and storage I/O in order to keep up, the user experience will suffer because of wait times in queues and backbone network loading. One of the very first solutions to this problem in Web 1.0 was to use local Web caches with individual client browsers. This worked well when total content on the Internet was not so rich and varied; but today, most users browse a much broader range of content on many more sites than they did in the 1990s. Such browsing of extensive content led to the use of proxy servers that would cache Web content for major client locations like a university (and in some cases would filter content access). Neither client cache nor proxy cache solved the problem of wait time for users of the most popular content providers as Internet access and total content has grown. The evolution of CDNs was required to better distribute loading over the Internet as a whole for the most popular sites.

By comparison, Figure 2 shows a simple geographical distribution of CDNs to replace the centralized Web servers shown in Figure 1. This new design adds some complexity to overall Web site management, because it requires content distribution within the distributed set of CDN servers from content origin servers. However, CDNs can distribute content over private networks with guaranteed quality of service (although the Internet might still be used) and can deliver that content in bulk, with interactive access to it pushed to the CDN edge servers. Clients can still connect to one domain and are routed to their regional CDNs for interaction and access. With careful forecasting and measurement of access statistics at each regional CDN edge server, you can scale the edge servers independently to meet local demand and to relieve the Internet from backbone congestion. Furthermore, you can schedule distribution of new content from origin servers during non-peak usage times, and use the variation of time zones to balance network and server loading over time and location.

Skills and competencies: Configuration

Nodes in a CDN include cached content as well as stored content and often a database such as MySQL. The stored content requires ingestion of large volumes of video or audio transport streams such as MPEG-2 and MPEG-4, as well as content-rich XML and HTML with JPEG images embedded, for example. Ingestion is most often done through a NAS head on a SAN RAID for large-scale CDNs. The NAS head provides content creators, post-production editors, and content managers or distributors the client access to scalable file systems that they need.

Content on a CDN often includes hundreds or thousands of video or audio transport files that are many gigabytes in size, so storage needs are most often terabyte scale and, in some cases today, petabyte scale. As discussed in "Infrastructure architecture essentials, Part 2: Find, avoid, and eliminate system bottlenecks," (see Resources ) many CDN designers are leveraging emerging technologies like tiered storage and SSDs with

4 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

scalable RAID storage for content editing, management, and distribution NAS clusters. The richness of content on CDNs continues to grow with incorporation of high-definition video, which is often 20 to 50Mbps in a compressed MPEG-2, MPEG-4, or VC-1 transport stream, compared to standard definition, which is typically 3.75Mbps or less. Eliminating I/O and storage access bottlenecks is therefore becoming one of the most important design considerations for scalable CDNs intended to host rich content.

Tools and techniques: Web cache configuration

The Squid Web cache is a great tool for CDNs to better server clients and to relieve Internet and intranet network congestion for content-rich Web services hosted on Windows or Linux systems. The Squid Web cache service provides configuration for CDN that includes:

Access control: With more than 25 different ACL types ranging from source IP addresses to user names and TCP ports File caches and file system tuning: squid.conf directives for specifying cache directories and schemes (file system type such as UFS, the UNIX file system), sizing, and cache policies. Careful tuning of file systems and block device interfaces is critical. For example, you can increase the read-ahead for a block interface to storage with blockdev --setra 4096 , which would set read-ahead to 2MB. Likewise, file systems like XFS may be better adapted to content hosting than traditional UNIX file systems. Interception and SquidtoSquid interfaces: Where traffic on the CDN is routed to Squid without any special client configuration, but routers in the CDN are configured to divert HTTP connections to the machine on which a Squid is running, which also requires configuration of IP tables on Linux to handled intercepted connections. URI redirectors: Using Squid, the URI redirectors can be used to modify a user HTTP page request to a new page. Authentication: Squid provides several methods to manage user names and passwords for Web interfaces, including the basic htpasswd methods on common Web servers as well as more advanced methods. Logs and monitoring: Squid operations can be managed through log files, including cache.log, access.log, store.log, and additional optional logs that you can configure. These log files provide warnings to indicate when Squid is not configured or running efficiently.

Content delivery vs. content distribution

As the Web has evolved from the original text and still-image character typical of Web 1.0 in the 1990s to much richer real-time, high-definition video and audio content in the new millennium, Web 2.0 is envisioned not as distributing files and images to personal computers but rather delivering real-time media to a wide range of desktop and mobile devices.

Multimedia content delivery and interactive systems

As CDNs evolve from file-access distribution networks to multimedia delivery networks, the complexity of designing a CDN becomes greater. The same principals that have made CDNs successful for file distribution will be helpful with content ingestion and multimedia streaming, because both benefit tremendously from lower latency, less congestion, and better overall load balancing. The CDN designer and CDN IT personnel will have to learn new standards and tools for multimedia content management, distribution, authoring, transcoding, and monitoring. Mobile device clients complicate this matter further. For example, the Digital Video Broadcasting-Handheld (DVB-H), MediaFLO, and Multimedia Broadcast Multicast Service (MBMS) standards for delivery of Web content and digital media to mobile devices have emerged and are in competition. Their presence may require a CDN to support multiple mobile media distribution standards and

5 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

data formats.

Likewise, video encoding and transport streams may include MPEG-2, MPEG-4, VC-1, or the open Ogg/Theora encoding format. Similarly, for audio MP3 or AC-3—not to mention high-definition surround sound formats and encoding. True streaming, where digital video and audio content is delivered in real time, is far different from the present-day viral video download-and-start-playback-as-you-go approach. The emergence of IPTV with video on demand in fact goes beyond current CDN capability. But CDNs can be extended and will certainly play a critical role in providing content to real-time, on-demand headends that provide true real-time streaming. Interactive game networks and social networks present new requirements for CDNs, which can play a key role in distributing user content and new game titles globally.

Skills and competencies: Designing for video and audio

Content management includes many steps and processes beyond the CDN, including post-production and distribution processing such as encoding transport streams or transcoding them between formats or down-converting them from higher to lower bit rates. Ultimately, the CDN must have content-ingestion interfaces to clients that support preparation of content for distribution. Much of the content will be HTML and XML, but richer streaming content will also include transport streams.

Transport streams encapsulate encoded program streams such as MPEG-2, which in turn combine elemental streams of video or audio. A program stream may include an elemental video stream and one or more audio streams, for example. Likewise, a transport stream may encapsulate a single program or multiple program streams in a Multiprogram Transport Stream (MPTS). These transport streams typically support on-demand streaming rather than download-and-stream encapsulations, so many CDNs host file formats suited to media players such as Windows Media Player. Media players most often also include a digital rights management (DRM) capability to protect content so that it is only consumable by users who in fact have purchased the content.

Methods to deliver content end to end are evolving at a rapid pace, especially for mobile content distribution over third-generation cell phone networks to handheld and portable devices. Most of these delivery and streaming services include proprietary headends and user devices that will require integration into a larger overall CDN design. The article "SoC drawer: SoCs and the digital content revolution" (see Resources ) may be helpful to readers wanting to dig deeper into delivery of real-time media to more embedded user devices.

Tools and techniques: VLC Server

VLC Server provides a great Windows- or Linux-based open source tool for transcoding and providing streaming services over UDP. Although most content-delivery systems include DRM and proprietary encoding, VLC Server provides a method to manage and deliver open source content. The amount of open source content available is small, but open source servers and tools like VLC Server combined with open source content such as W6RZ can be very helpful for testing CDNs and learning about management of real-time content delivery (see Resources ).

VLC bundles client and server together and provides GUI-based as well as scripted interfaces for streaming over UDP and for transcoding MPEG program and transport streams. CDN designers taking an open source approach or integrating both proprietary DRM-ed and open source content delivery may also want to keep up with the advancements of LinuxMCE (see Resources ).

P2P CDNs

The emergence of P2P networks became well known through the controversy surrounding early MP3 digital audio file-sharing networks and the digital rights management (DRM) of that content. P2P content

6 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

distribution continues to be somewhat controversial, because it turns user computing into servers as well as clients, which may not be desirable for local IT groups trying to manage their client or server resources. However, CDNs clearly benefit from pushing content closer to the edge, and the natural result is moving content all the way to the user. Within an organization, P2P services may also be quite useful, because much of the content sharing will naturally be within co-located user groups. Likewise, many client systems sit idle during non-peak hours and could be utilized for compute, I/O, and storage resources that otherwise go wasted. Overall, it seems that P2P will have a place, but it must be balanced with security, resource contention, and DRM issues.

Looking ahead

The growth of CDN service providers and equipment vendors has been explosive as Web 2.0 emerges and is expected to continue. During the genesis of the Internet, the growth of ISPs was similar and today has been largely consolidated into multiservice organizations (MSOs) provided by cable, satellite, and telecommunication network companies. Traditionally, content has come from content providers like networks, Hollywood, or publishers; but clearly, the viral, P2P, and user content revolution is changing this pattern dramatically.

In general, Web 2.0 is imagined to be a much more interactive, more inclusive, and better user experience, with richer content coming not only from traditional content providers but from a much longer tail of content sources, including each and every one of us. This will help to reinforce the necessity for CDNs to ingest regional content and to distribute that content to other regional edge servers based on access patterns, as a much broader cross-section of humanity communicates and collaborates on the Internet. It is safe to say that CDNs are here to stay, will grow in importance, and will require very savvy designers as expectations for performance, content diversity, and a much wider range of traditional and mobile clients make use of CDNs.

Resources

Learn

Check out the other parts of the Infrastructure architecture essentials series: Part 1 , "Build a reliable yet inexpensive infrastructure architecture" (developerWorks, Sep 2008) Part 2 , "Find, avoid, and eliminate system bottlenecks" (developerWorks, Oct 2008) Part 3 , "System design methods for scaling" (developerWorks, Oct 2008) Part 4 , "Scalable enterprise systems management" (developerWorks, Oct 2008)

The IBM Systems Journal article, " A Web content serving utility " (P. Gayek, R. Nesbitt, H. Pearthree, A. Shaikh, and B. Snitzer, Vol. 43, No. 1), is a great resource for understanding the background, evolution, and state of CDN as of 2004. The CDN evolution has continued on a rapid pace since 2004, but this article really helps with fundamentals for any CDN architecture.

See an image of the Wikimedia CDN on Wikipedia.

The open source Squid-Cache for Web site content acceleration and distribution, which runs on UNIX and Windows systems, is a great place to learn about hierarchical Web caches and experiment with one. Squid-cache is used to accelerate access to Wikipedia content.

The Internet World Stats Big Picture Web site has global Internet reach and access statistics.

The global deployment of the Internet and rich content available to all has led Thomas Friedman to

7 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

declare The World Is Flat in his well-known book on Internet collective intelligence and its social impact. In his follow-on book, Hot, Flat, and Crowded , it would appear that the value of Internet collective intelligence has arrived just in time to help humanity tackle difficult global challenges like climate change and the energy crisis.

For periodically updated activity on the global Internet, see the Internet Traffic Report , which has breakdowns by continent, country, and routers.

Charting of Web site access by Alexa provides insight into content and service access by global Internet users and by specific country .

Dig deeper into delivery of real-time media to more embedded user devices in the article " SoC drawer: SoCs and the digital content revolution " (Sam Siewert, developerWorks, Jan 2007).

The W6RZ home page provides MPEG-2 transport stream test patterns and tools.

LinuxMCE is a free, open source add-on to .

O'Reilly has a number of must-read books for anyone designing a CDN, including: Programming Collective Intelligence: Building Smart Web 2.0 Applications (Toby Segaran, 2007). Squid: The Definitive Guide (Duane Wessels, 2004). Building Scalable Web Sites: Building, scaling, and optimizing the next generation of Web applications (Cal Henderson, 2006). High Performance MySQL (Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, Jeremy Zawodny, Arjen Lentz, and Derek J. Balling, 2008). High Performance Web Sites: Essential Knowledge for Front-End Engineers (Steve Souders, 2007).

Browse the technology bookstore for books on these and other technical topics.

Stay current with developerWorks technical events and webcasts.

Get the RSS feed for this series.

Get products and technologies

This list of CDN services and software from web-caching.com may be useful for content delivery designers considering outsourcing or building a CDN of their own.

The VLC tools and client/server can be downloaded from the VideoLAN Media Player Web site for Windows or Linux.

Deploying Web-based services and applications in an SOA can be expedited using IBM WebSphere® software.

Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2, Lotus®, Rational®, Tivoli®, and WebSphere.

8 of 9 8/22/2009 6:50 PM Infrastructure architecture essentials, Part 5: Content delivery and distribu... http://www.ibm.com/developerworks/library/ar-infraarch5/

Discuss

Check out developerWorks blogs and get involved in the developerWorks community .

Participate in the IT architecture forum to exchange tips and techniques and to share other related information about the broad topic of IT architecture.

About the author

Dr. Sam Siewert is a systems and software architect who has worked in the aerospace, telecommunications, digital cable, and storage industries. He also teaches at the University of Colorado at Boulder in the Embedded Systems Certification Program, which he co-founded in 2000. His research interests include high-performance computing, broadband networks, real-time media, distance learning environments, and embedded real-time systems.

Trademarks | My developerWorks terms and conditions

9 of 9 8/22/2009 6:50 PM