TUT7317 A Practical Deep Dive for Running High-End, Enterprise Applications on SUSE

Holger Zecha Senior Architect REALTECH AG [email protected] Table of Content

• About REALTECH

• About this session

• Design principles

• Different layers which need to be considered

2 Table of Content

• About REALTECH

• About this session

• Design principles

• Different layers which need to be considered

3 About REALTECH 1/2

REALTECH Software REALTECH Consulting . Business Service Management . SAP Mobile . Service Operations Management . Cloud Computing . Configuration Management and CMDB . SAP HANA . IT Infrastructure Management . SAP Solution Manager . Change Management for SAP . IT Technology . . IT Infrastructure

4 About REALTECH 2/2

5 Our Customers

Manufacturing IT services Healthcare Media Utilities

Consumer Automotive Logistics products Finance Retail

REALTECH Consulting GmbH 6 Table of Content

• About REALTECH

• About this session

• Design principles

• Different layers which need to be considered

7 The Inspiration for this Session

• Several performance workshops at customers

• Performance escalations at customer who migrated from UNIX (AIX, Solaris, HP-UX) to Linux

• Presenting the experiences made at these customers in this session

• Preventing the audience from performance degradation caused from:

– Significant design mistakes

– Wrong architecture assumptions

– Having no architecture at all

8 Performance Optimization The False Estimation

Upgrading server with CPUs that are 12.5% faster does not improve application performance of 12.5%

•Identify the layer where you lose your performance

– i.e. Server ratio of overall response time is on 37% from 500ms which is 185ms

– No additional parallelization necessary, because transactions are not waiting for CPU cycles

– Also wait time in SAN and network layer

– CPU exchange from CPUs with 3.2GHz to CPUs with 3.6GHz clock speed

– CPU performance improvement of 12.5%

– Transaction improvement of 23.13ms now 476.87ms

• Improvement of 4.63% and not 12.5%

9 Clarification What is a High End Enterprise Application?

• The (high) end of an application is defined by the smallest entity which can not be sliced and scaled out

• An enterprise application is defined by its importance for the company's ongoing operation and therefore the company’s revenue

– Web shop and its related backend systems of an online retailer

– Supply Chain Management system from an automotive supplier

10 What this Session Covers and What Not

• Frameworks for running Linux not be covered

– I.e. SUSE Cloud, vSphere cluster

– Evaluating a Framework is part of a Proof of Concept

• Technical components which are needed for design principles will be covered

– KVM

– File system layouts and file systems itself

– Storage architectures

– Memory configuration

– Network throughput optimization

11 Table of Content

• About REALTECH

• About this session

• Design principles

• Different layers which need to be considered

12 Design Principles in Theory

• Slice and dice your system into appropriate layers

• Do a proper architecture for every layer and make sure you can reuse it

• Bring the layers together

• The design principles that work well for a High End Enterprise Application will also work for a Web Server

• The design principles that work well for a Web Server will not necessarily work well for a High End Enterprise Application

13 Considerations in Core Design Illustrated using the IO Scheduler as Example

• Peak IOPS needed and how to ensure that we can get them, without to much administrative overhead in file system layout

• Avoid “hot spots”

• Illustration: IO schedulers and their impact on IO performance

– CFQ scheduler vs NOOP scheduler

• Architecture examples:

– Linux Server running on VMware

– Linux Server running in Amazon EC2

– Linux Server running on bare metal

14 Avoiding Hot Spots The IO Scheduler Example

• NOOP Scheduler: • Block device: 1 IO thread per block lvol is no block device

device DATA (lvol) • Stripe size: • Striping: Best results with sdc1, sdd1, sdc2, sdd2,… physical extent size

• Appropriate disk size to • Use partitions to reduce administrative effort increase performance

sdc sdd

VG3

sdc1 sdc2 sdd1 sdd2

sdc3 sdc4 sdd3 sdd4

15 Linux Server Core Design The Oracle Example

16 Storage Layer Core Design

VG1 VG2 VG3

sda1 sdb1 sdc LUN 1 LUN 2 LUN 3 Partition 1 Partition 1 Partition 1

FC HBA1 FC HBA2

Physical Storage (Array Group 1) Physical Storage (Array Group 2) Physical Storage (Array Group 3) Concept and Visio by Manuel Padilla and Holger Zecha LUN 1 LUN 2 LUN 3 LUN 4

17 VMware Core Design 1st Usage of Storage Layer Core Design

18 Bringing the Layers Together The VMware Example

19 VMware Example

• SAN layer is part of VMware infrastructure

• Mapping between SAN and Linux disks based on VMware storage infrastructure

• Use virtualization solution specific optimization for throughput optimization

– Para virtualized SCSI controllers

– Para virtualized NICs

– …

20 Bringing the Layers Together The Amazon EC2 Example

• Why isn’t here an architecture diagram?

• Because server layer is the only tier we have access to and therefore target for performance optimization

• No SCSI controllers – disks get mapped directly into guest OS (hwinfo | grep xvdb)

– E: DEVPATH=/devices/xen/vbd-51792/block/xvdf

– E: DEVNAME=/dev/xvdf

• Spread needed IOPS across sufficient disks and avoid hot spots

– Be aware of IO scheduler behavior

– Use appropriate striping

21 AMAZON Example 1/2

• AMAZON guarantee: Dedicated number of IOPS per volume (i.e. 3.000 IOPS)

• Question: What does this mean for a 1 TB database which needs 20.000 IOPS in peak OLTP operations?

• Answer: We need at least 7 volumes

• Considerations:

– Do appropriate striping across all volumes to get access of all 21.000 IOPS

– NOOP IO scheduler is no bottleneck (7*145 GB disks), because we have enough disks which have one IO thread, therefore no additional partitions needed!

22 AMAZON Example 2/2

• Question: The database size is now 10TB, what changes?

• Answer: Use 10*1 TB disks and create 4 partitions on each disk to eliminate scheduler hot spots

• Considerations:

– NOOP IO scheduler can become bottleneck

– Create 4 partitions a 250GB on each disk

– Do appropriate striping across all partitions on all volumes to access all IOPS equally sdc1, sdd1, sde1, sdf1, sdg1, sdh1, sdi1, sdj1, sdk1, sdl1, sdc2, sdd2, …..

23 Bringing the Layers Together Bare Metal Example – 2nd Usage of Storage Layer Core Design

sdc sdd VG1 VG2 VG3

sdc1 sdc2 sdd1 sdd2 sda1 sdb1 LUN 1 LUN 2 Partition 1 Partition 1 sdc3 sdc4 sdd3 sdd4

FC HBA1 FC HBA2

Physical Storage (Array Group 1) Physical Storage (Array Group 2) Physical Storage (Array Group 3) Concept and Visio by Manuel Padilla and Holger Zecha LUN 1 LUN 2 LUN 3 LUN 4

 Also no hot spots because of Fiber Channel Multi Pathing  CFQ scheduler avoids IO scheduler bottleneck

24 Bare Metal Example

• Question: Do we need 2 disks for the data file system?

• Answer: No

– IOPS from storage is completely useable in one LUN

– IO scheduler is no longer a bottleneck because of used CFQ scheduler

– CFQ scheduler has one scalable IO queue per process

• Considerations:

– Take data integrity into account!

25 Table of Content

• About REALTECH

• About this session

• Design principles

• Different layers which need to be considered

26 What Layers Do We Have?

• SAN/NAS

• Network

• Server hardware

• Virtualization solutions

• Server configuration

• Application

27 SAN/NAS What is important in this layer?

• IOPS

– Tiered storage

– Traditional layout

• Data integrity

– Depends on the technology used from your storage vendor

• Redundant access paths

– Wherever possible use redundant access paths for load balancing and high availability

– Multipath on server level using

– Server virtualization layer

– Storage virtualization

28 Network How to Optimize Network Throughput?

• Jumbo frames or standard 1518 byte frames

– It depends on your application. Measurements for different applications show sometimes advantages of using jumbo frames.

• Distribute your network traffic onto dedicated NICs

– Separation of user, data and backup LAN

• If you want to use VM relocation also use a dedicated LAN

• Use bonding if possible

29 Server Hardware

• The number of CPU cores and the clock speed do not necessarily guarantee a linear performance gain

• Take care of crucial features of single components

– CPU – implemented virtualization features

– Memory – error correction code

• New: Isolation features

– i.e. physical hardware partitioning on Fujitsu PRIMEQUEST servers

– 2 socket per board (30 cores)

– 3TB RAM per board

– 4 boards in total

– Bare metal and virtualization on one physical server

30 Virtualization Overview

31 Virtualization Solutions 1/5 Hardware Partitions, KVM, XEN, VMware, Hyper-V, Containers

• Number of supported CPUs for guest systems is no performance indicator

– Take NUMA into account

– Take maturity of virtualization solution into account

• Hardware Partitions

full virtualized

– Every guest OS system call is being captured and translated into a system call of the host

– System call translation from guest OS to host needs some overhead

– Every OS can be run on full virtualized servers

32 Virtualization Solutions 2/5 Hardware Partitions, KVM, XEN, VMware, Hyper-V, Containers

• Hypervisor para virtualized

– Certain guest OS system calls will bypass the virtualization layer and access the host hardware directly

– Para virtualized guests need dedicated para virtualized kernel for guest OS and/or drivers (i.e. VMware tools)

– Para virtualized guest OS system calls are usually faster, because they do not need to be translated from the virtualization layer

– Different virtualization solutions support different levels of para virtualization

33 Virtualization Solutions 3/5 Hardware Partitions, KVM, XEN, VMware, Hyper-V, Containers

• XEN Hypervisor

– Hypervisor which runs on hardware

– Para virtualized guests

– Hardware virtual machine

– Para virtualized hardware for certain application specific usage types i.e. SCSI bus sensing for failover cluster solution

• KVM Hypervisor

– KVM hypervisor is implemented as kernel module

– Allows shared usage of hardware (server and hypervisor)

– Para virtualized device drivers

34 Virtualization Solutions 4/5 Hardware Partitions, KVM, XEN, VMware, Hyper-V, Containers

• VMware Hypervisor

– Hypervisor which runs on hardware

– Para virtualized device drivers

– RDM or VMDKs available for guests

• Hyper-V Hypervisor

– Para virtualized

– Part of Windows Server and Windows 8

– Pass through disks and VHD disks

• Containers

– Special solution for virtualization on OS level

35 Virtualization Solutions 5/5 Gartner Magic Quadrant

• Break down technologies

1. VMware

2. Hyper-V

3. XEN

4. XEN

5. Linux Containers

6. KVM

• Reflects our experience regarding stability and performance

36 Performance Measurement KVM Compared to Physical Servers Sec

Different OLTP Transactions KVM virtualized Server with paravirtualized NICs

Bare metal + SSDs

Bare metal

KVM virtualized Server

37 Linux Containers

• Currently not relevant in High-End Environment because of the performance requirements of these applications

• Parallels is a proprietary solution

– Mature

– Stable

– Never seen for running productive high end enterprise applications until now

• LXC/Docker

– Not mature enough for running high end enterprise applications inside

38 Server Configuration

• Adjust kernel parameters for memory management

– Use huge pages if possible

• Adjust network parameters for tuning network throughput

• Create a well designed file system layout for optimized IO throughput avoiding hot spots in every layer

• Reduce number of page out and page in operations if application does not depend on paging

39 File System Options 1/2

– Still no option for high end enterprise systems

– BTRFS advantages (i.e. roll back) are usually achieved in this system class using appropriate change management process and acceptance procedures for implementing changes

• EXT3

– Stable and reliable file system

– Used for OS file systems (/, /usr, /opt, …)

– Not optimized for throughput in large file systems

– Much more time is required in large file systems for fsck

40 File System Options 2/2

• EXT4

– Recommended for huge file systems with high IO throughput

– Stable and reliable

• XFS

– Recommended for huge file systems with high IO throughput

– Stable and reliable

• Public available benchmarks do not show significant differences between ext4 and xfs

• Your choice depends on your flavor and what is supported from your software vendor (i.e. Oracle support statement)

41 NUMA 1/2

Node 0 Node 1 CPU CPU

1 2 intersocket 1 2

connection Node 1 Node Node 0 Node 3 4 3 4

remote access

Local Memory Local Memory Local local access

42 NUMA 2/2 Consider it - Or Ignore it?

• 2 respectively 3 levels of NUMA

– 1st level is the BIOS of the physical server

– 2nd level is the configuration of the virtualization layer and the BIOS of the virtual machine

– 3rd level is the NUMA behavior of the application

• Options Physical server Virtual server Application Enabled Enabled Enabled Enabled Enabled Disabled Enabled Disabled Disabled

• Test appropriate settings with your application

43 Do Not Forget Your Application

• Do not group applications with a contrarily resource consumption profile on same server

– 1st example: Applications with the need of huge and static main memory that react on intensive page out and page in operations with performance degradation

– 2nd example: Applications with a very dynamic memory usage profile, which are designed to use intensive page out and page in operations for handling a huge amount of users and/or transactions

• Consolidate applications with same resource usage profile on servers to reduce OS resource management overhead

44 Speed up your Linux Environment! www.realtech.com

Thank you.

45

Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.