2011 IBM Power Systems Technical University OctoberTitle: Running10-14 | Fontainebleau Oracle DB Miami on Beach IBM | AIXMiami, and FL Power7 Best Pratices for Performance & Tuning IBM

Session: 17.11.2011, 16:00-16:45, Raum Krakau Speaker: Frank Kraemer

Frank Kraemer Francois Martin IBM Systems Architect IBM Oracle Center mailto:[email protected] mailto:[email protected]

Materials may not be reproduced in whole or in part without the prior written permission of IBM. Agenda Deutsche Oracle Anwenderkonferenz DOAG 2011 ™CPU ™Power 7 ™Memory ™AIX VMM tuning ™IO ™Storage consideration ™AIX LVM Striping ™Disk/Fiber Channel driver optimization ™Virtual Disk/Fiber channel driver optimization ™AIX mount option ™Asynchronous IO ™NUMA Optimization

© Copyright IBM Corporation 2011 2 IBM Power 7 (Socket/Chip/Core/Threads) Deutsche Oracle Anwenderkonferenz DOAG 2011

1 Power7 Each Power7 socket = chip Chip

Each Power7 Chip = 8 x Power7 4Ghz Cores Power Hardware SMT SMT SMT SMT Each Power7 Core = 4 HW SMT threads

lcpu lcpu lcpu lcpu In AIX, when SMT is enable, each SMT thread is seen like a logical CPU Software 32 sockets = 32 chips = 256 cores = 1024 SMT threads = 1024 AIX logical CPUs

© Copyright IBM Corporation 2011 3 Scalability with IBM POWER Systems and Oracle Deutsche Oracle Anwenderkonferenz DOAG 2011 Oracle Certifies for the O/S & All Power servers run the same AIX = Power 795 Complete flexibility for Oracle workload Power 780 deployment

Power 770

Power 750 Power 740 2S/4U Power 720 1S/4U Consistency: ƒ Binary compatibility Power 730 2S/2U ƒ Mainframe-inspired reliability Power 710 1S/2U Blade Server ƒ Support for full HW based virtualization ƒ AIX, Linux and IBM i OS

4 © Copyright IBM Corporation 2011 4 POWER7 Server RAS Features Deutsche Oracle Anwenderkonferenz DOAG 2011

Virtualization AIX v6/v7 ƒ PowerVM is core firmware ƒ LVM and JFS2 Virtual Virtual ƒ Thin bare metal Hypervisor I/O I/O OS - #1 OS - #N ƒ SMIT – reduce human errors ƒ Device driver free Hypervisor Server1 Server2 ƒ Integrated HW / OS / FW error ƒ HW enforced virtualization reporting LPAR LPAR LPAR LPAR ƒ IBM Integrated stack ƒ Hot AIX kernel patches ƒ Dynamic LPAR operations ƒ WPAR and WPAR mobility PowerVM Hypervisor ƒ Redundant VIOS support ƒ Configurable error logs ƒ Resource monitor & control ƒ Concurrent remote support General CPU Memory Network Disk ƒ Live Partition Mobility (LPM) ƒ EAL 4+ security certification

General CPU Memory/Cache I/O ƒ First Failure Data Capture ƒ Dynamic CPU deallocation ƒ Redundant bit steering ƒ Integrated I/O product ƒ Hot-node add/repair ƒ Dynamic processor sparing ƒ Dynamic page deallocation development - stable drivers ƒ Concurrent firmware updates ƒ Processor instruction retry & ƒ Error checking channels & ƒ Redundant I/O drawer links ƒ Redundant clocks & service alternate processor recovery spare signal lines ƒ Independent PCIe busses processors ƒ Partition availability priority ƒ Memory protect keys (AIX) ƒ PCIe Bus Error Recovery ƒ Dynamic clock failover ƒ Processor contained ƒ Dynamic cache deallocation ƒ Hot add GX bus adapter ƒ ECC on fabric busses checkstop & cache line delete ƒ I/O drawer hot add / ƒ Light path diagnostics ƒ Uncorrectable error handling ƒ HW memory & L3 scrubbing concurrent repair ƒ Electronic service agent ƒ CPU CoD ƒ Memory CoD ƒ Hot swap disk, media, & PCIe adapters

© Copyright IBM Corporation 2011 5 POWER7 SMT performance Deutsche Oracle Anwenderkonferenz DOAG 2011 • Requires POWER7 Mode – POWER6 Mode supports SMT1 and SMT2 • Operating System Support – AIX 6.1 and AIX 7.1 – IBM i 6.1 and 7.1 – Linux • Dynamic Runtime SMT scheduling – Spread work among cores to execute in appropriate threaded mode – Can dynamical shift between modes as required: SMT1 / SMT2 / SMT4 • LPAR-wide SMT controls – ST, SMT2, SMT4 modes – smtctl / mpstat commands • Mixed SMT modes supported within same LPAR – Requires use of “Resource Groups” © Copyright IBM Corporation 2011 6 POWER7 specific tuning Deutsche Oracle Anwenderkonferenz DOAG 2011 • Use SMT4 Gives a cpu boost performance to handle concurrent threads in parallel • Disabling HW prefetching. Usually improve performance on Database Workload on big SMP Power system # dscrctl –n –b –s 1 (this will dynamically disable HW memory prefetch and keep this configuration across reboot) # dscrctl –n –b –s 0 (to reset HW prefetching to default value) • Use Terabyte Segment aliasing Improve CPU performance by reducing SLB miss (segment address resolution) # vmo –p –o esid_allocator=1 (by default on AIX7) • Use Large Pages (LP) 16MB memory pages Improve CPU performance by reducing TLB miss (Page address resolution) # vmo -r -o lgpg_regions=xxxx -o lgpg_size=16777216 (xxxx= # of segments of 16M you want) (check large page usage with smon –mwU oracle) # vmo –p –o v_pinshm = 1 Enable Oracle userid to use Large Pages # chuser capabilities=CAP_NUMA_ATTACH,CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracle export ORACLE_SGA_PGSZ=16M before starting oracle (with oracle user)

© Copyright IBM Corporation 2011 7 Virtual Memory Management Deutsche Oracle Anwenderkonferenz DOAG 2011 ACTION

1 - AIX is started, applications load some 1 - System Startup computational pages into the memory. As a system, AIX will try to take 2 - DB Activity (needs more memory) advantage of the free memory by using it as a cache to reduce the IO on the 3 - => Paging Activity physical drives. 2 - The activity is increasing, the DB needs MEMORY PAGING SPACE more memory but there is no free pages available. LRUD (AIX page stealer) is starting to free some pages into the Performance memory. ! degradation 3 - With the default setting, LRUD will page out some computational pages instead of FS CACHE removing only pages from the File System Cache.

• Objective : SGA + PGA Tune the VMM to protect computational Programs pages (Programs, SGA, PGA) from being paged out and force the LRUD to Kernel SGA + PGA steal pages from FS-CACHE only.

© Copyright IBM Corporation 2011 8 Memory : VMM Tuning for Filesystems (lru_file_repage) Deutsche Oracle Anwenderkonferenz DOAG 2011 This VMM tuning tip is applicable for AIX 5.2 ML4+ or AIX 5.3 ML1+ or AIX 6.1+ – lru_file_repage : • This new parametter prevents the computational pages to be paged out. • By setting lru_file_repage=0 (1 is the default value) you’re telling the VMM that your preference is to steal pages from FS Cache only. – “New” vmo tuning approach : Recommendations based on VMM developer experience. Use “vmo –p –o” to set the following parametters : lru_file_repage = 0 page_steal_method = 1 (reduce the scan:free ratio)

minfree = (maximum of either 960 or 120 x # lcpus*) / # memory pools** maxfree = minfree +(j2_maxPageAhead*** x # lcpus*) / # memory pools** * to see # lcpus, use “bindprocessor –q” ** for # memory pools, use “ –v” *** to see j2_maxPageReadAhead, use “ioo –L j2_maxPageReadAhead” Increase minfree and recalculate maxfree if “Free Frame Waits” increase over (vmstat –s)

minperm% = from 3 to 10 maxperm% = maxclient% = from 70 to 90 strict_maxclient = 1 © Copyright IBM Corporation 2011 9 Memory : Use jointly AIX dynamic LPAR and Oracle dynamic allocation of memory + AMM Deutsche Oracle Anwenderkonferenz DOAG 2011 Scenario : Initial configuration Final configuration Oracle tuning advisor indicates Memory_max_size that SGA+PGA need to be Memory_max_size = 18 GB increased to 11GB: = 18 GB memory_target can be increased Real memory = 15 GB dynamically to 11GB but real AIX Real memory memory is only 12GB, so it needs + free = 12 GB to be increased as well. Memory_target AIX = 11 GB + free – Step 1 - Increase physical Memory_target memory allocated to the = 8 GB LPAR to 15 GB ssh hscroot@hmc "chhwres -r mem -m -o a -p -q 3072" SGA SGA + – Step 2 - Increase + PGA SGA+PGA allocated to the PGA instance of the database to 11GB alter system set memory_target=11GB;

Î Memory allocated to the system has been increased dynamically, using AIX DLPAR

Î Memory allocated to Oracle (SGA and PGA) has been increased on the fly © Copyright IBM Corporation 2011 10 IO : Database Layout Deutsche Oracle Anwenderkonferenz DOAG 2011

ƒHaving a good Storage configuration is a key point : ƒ Because disk is the slowest part of an infrastructure ƒ Reconfiguration can be difficult and time consuming

Stripe and mirror everything (S.A.M.E) approach: ¾Goal is to balance I/O activity across all disks, loops, adapters, etc... ¾Avoid/Eliminate I/O hotspots ¾Manual file-by-file data placement is time consuming, resource intensive and iterative ¾Additional advices to implement SAME : • apply the SAME strategy to data, indexes • if possible separate redologs (+archivelogs)

© Copyright IBM Corporation 2011 11 IO : RAID Policy with ESS, DS6/8K

Deutsche Oracle Anwenderkonferenz DOAG 2011 Storage ¾ RAID-5 vs. RAID-10 Performance Comparison HW Striping

I/O Profile RAID-5 RAID-10 LUN 1 Sequential Read Excellent Excellent LUN 2 Sequential Excellent Good Random Read Excellent Excellent LUN 3 Random Write Fair Excellent LUN 4

¾ With Enterprise class storage (with huge Use RAID-5 or RAID-10 to create striped cache), RAID-5 performances are comparable to LUNs RAID-10 (for most customer workloads) If possible try to minimize the number of ¾ Consider RAID-10 for workloads with a high LUNs per RAID array to avoid contention on percentage of random write activity (> 25%) and physical disk. high I/O access densities (peak > 50%)

© Copyright IBM Corporation 2011 12 IO : 2nd Striping (LVM) Deutsche Oracle Anwenderkonferenz DOAG 2011 AIX Storage HW Striping Volume Group

hdisk1 LUN 1

hdisk2 LUN 2

hdisk3 LUN 3 (raw or with jfs2) LVM Striping LVM

LV hdisk4 LUN 4

1. Luns are striped across physical disks (stripe-size of the physical RAID : ~ 64k, 128k, 256k) 2. LUNs are seen as hdisk device on AIX server. 3. Create AIX Volume Group(s) (VG) with LUNs from multiple arrays 4. Logical Volume striped across hdisks (stripe-size : 8M, 16M, 32M, 64M) => each read/write access to the LV are well balanced accross LUNs and use the maximum number of physical disks for best performance.

© Copyright IBM Corporation 2011 13 IO : LVM Spreading / Striping Deutsche Oracle Anwenderkonferenz DOAG 2011 Two different ways to stripe on the LVM layer : 1. Stripe using Logical Volume (LV) – Create Logical Volume with the striping option : mklv –S ... Valid LV Stripe sizes: • AIX 5.2: 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M • AIX 5.3+: AIX 5.2 Stripe sizes + 2M, 4M, 16M, 32M, 64M, 128M 2. PP (Disk Physical Partition) striping (AKA spreading) – Create a Volume Group with a 8M,16M or 32M PPsize. (PPsize will be the “Stripe size”) • AIX 5.2 : Choose a Big VG and change the “t factor” : # mkvg –B –t –s ... • AIX 5.3 +: Choose a Scalable Volume Group : # mkvg –S –s ... – Create LV with “Maximum range of physical volume” option to spread PP on different hdisk in a Round Robin fashion : # mklv –e x ... jfs2 filesystem creation advice : If you create a jfs2 filesystem on a striped (or PP spreaded) LV, use the INLINE logging option. It will avoid « hot spot » by creating the log inside the filesystem (which is striped) instead of using a PP stored on 1 hdisk. # crfs –a =INLINE … © Copyright IBM Corporation 2011 14 IO : Disk Subsystem Advices

Deutsche Oracle Anwenderkonferenz DOAG 2011 ƒ Check if the definition of your disk subsystem is present in the ODM. ƒ If the description shown in the output of “lsdev –Cc disk” the word “Other”, then it means that AIX doesn’t have a correct definition of your disk device in the ODM and use a generic device definition. # lsdev –Cc disk

Generic device definition bad performance

ƒ In general, a generic device definition provides far from optimal performance since it doesn’t properly customize the hdisk device : exemple : hdisk are created with a queue_depth=1

1. Contact your vendor or go to their web site to download the correct ODM definition for your storage subsystem. It will setup properly the “hdisk” accordingly to your hardware for optimal performance. 2. If AIX is connected to the storage subsystem with several Fiber Channel Cards for performance, don’t forget to install a multipath device driver or path control module. - sdd or sddpcm for IBM DS6000/DS8000 - powerpath for EMC disk subsystem - hdlm for Hitachi etc....

© Copyright IBM Corporation 2011 15 IO : AIX IO tuning (1) – LVM Physical buffers (pbuf) Deutsche Oracle Anwenderkonferenz DOAG 2011

Volume Group (LVM) AIX

hdisk1 Pbuf

hdisk2 Pbuf

hdisk3 Pbuf (raw or with jfs2) LVM Striping LVM Pbuf

LV hdisk4

1. Each LVM physical volume as a physical buffer (pbuf)

2. #vmstat –v command help you to detect lack of pbuf.

3. If there is a lack of pbuf, 2 solutions: - Add more luns (this will add more pbuf) - Or increase pbuf size : output “vmstat –v” command lvmo –v -o pv_pbuf_count=XXX

© Copyright IBM Corporation 2011 16 IO : AIX IO tuning (2) – hdisk queue_depth (qdepth)

Deutsche Oracle Anwenderkonferenz DOAG 2011

Volume Group (LVM) AIX Queue_depth

hdisk1 Pbuf queue nmon DDD output

hdisk2 Pbuf queue

hdisk3 Pbuf queue (raw or with jfs2) LVM Striping LVM Pbuf queue

LV hdisk4

App. IO queue time service time 1.9 ms 0.2 ms 1. Each AIX hdisk has a “Queue” called queue depth. This parameter set the number of // queries that can be send to Physical disk.

2. To know if you have to increase qdepth, use nmon (DDD in interactive mode, -d in recording) and monitor : service time, wait time and servQ full

1. If you have : service time < 2-3ms => this mean that Storage behave well (can handle more load) And “wait time” > 1ms => this mean that disk queue are full, IO wait to be queued => INCREASE hdisk queue depth (chdev –l hdiskXX –a queue_depth=YYY) © Copyright IBM Corporation 2011 17 IO : AIX IO tuning (3) – HBA tuning (num_cmd_elems)

Deutsche Oracle Anwenderkonferenz DOAG 2011

Volume Group (LVM) AIX Queue_depth

hdisk1 Pbuf queue Nb_cmd_elems fcs0 queue hdisk2 Pbuf queue

hdisk3 Pbuf queue fcs1 queue (raw or with jfs2) LVM Striping LVM MPIO sddpcm Pbuf queue External Storage

LV hdisk4

1. Each HBA fc adapter has a queue “nb_cmd_elems”. This queue has the same role for the HBS as the qdepth for the disk.

2. Rule of thumb: nb_cmd_elems= (sum of qdepth) / nb HBA

3. Changing nb_cmd_elems : chdev –l fcsX –o nb_cmd_elems=YYY You can also change the max_xfer_size=0x200000 and lg_term_dma=0x800000 with the same command fcstat fcs0 output

These changes use more memory and must be made with caution, check first with “fcstat fcsX”:

© Copyright IBM Corporation 2011 18 Virtual SCSI Virtual I/O helps reduce hardware costs by sharing disk drives Deutsche Oracle Anwenderkonferenz DOAG 2011 POWER Server External Storage Micro-partitions Shared VIOS AIX 5L Linux AIX 5L Linux Fiber Chan V V V6.1 V5.3 A1 A2 A3 A4 Adapter S C L A5 A A1 A2 S N Shared I B1 B2 B3 A3 SCSI Adapter B1 B2 B4 B5 B3 Virtual SCSI

POWER Hypervisor VIOS owns physical disk resources ƒLVM based storage on VIO Server ƒPhysical Storage can be SCSI or FC Multiple LPARs can use same or different physical ƒLocal or remote disk Micro-partition sees disks as vSCSI (Virtual SCSI) ƒConfigure as logical volume on VIOS devices ƒAppear a hdisk on the micro-partition ƒVirtual SCSI devices added to partition via HMC ƒCan assign entire hdisk to a single client ƒLUNs on VIOS accessed as vSCSI disk ƒVIOS must be active for client to boot

© Copyright IBM Corporation 2011 19 VIOS : VSCSI IO tuning

Deutsche Oracle Anwenderkonferenz DOAG 2011

AIX pHyp VIOS

hdisk qdepth vscsi0 vhost0 hdisk qdepth queue fcs0 AIX MPIO MPIO sddpcm

# lsdev –Cc disk HA: (dual vios conf.) change vscsi_err_recov to fast_fail Monitor VIO FC activity with hdisk0 Available Virtual SCSI Disk Install Storage Drive chdev –l vscsi0 –a nmon vscsi_err_recov=fast_fail Subsystem (interactive: press “a” or “^”) Storage driver cannot be Driver on the (reccord with “-^” option) installed on the lpar Performance: VIO ⇒Default qdepth=3 !!! Bad No perf tuning can be made; Adapt num_cmd_elems performance accordingly with sum of We just know that each vscsi can qdepth. ⇒chdev –l hdisk0 –a handle check with: fcstat fcsX queue_depth=20 512 cmd_elems. (2 are reserved for + monitor svctime/wait time with the adapter and 3 reserved for each nmon to adjust queue depth vdisk)

* You have to set same So, use the following formula to queue_depth for the source hdisk the number of disk you can attache on the VIOS behind a vscsi adapter.

Nb_luns= ( 512 – 2 ) / ( Q + 3 ) With Q=qdepth of each disk If more disks are needed => add vscsi © Copyright IBM Corporation 2011 20 N-Port ID Virtualization Deutsche Oracle Anwenderkonferenz DOAG 2011

VIOS AIX Linux AIX Linux Tape Library 6.1 5.3

Fiber 8Gb PCIe Fiber Chan Chan Adapter Switch

SAN Storage POWER Hypervisor

VIOS Fiber Channel adapter ►LPARs have direct visibility on supports Multiple World Wide Port SAN (Zoning/Masking) Names / Source Identifiers ►I/O Virtualization configuration Physical adapter appears as effort is dramatically reduced multiple virtual adapters to SAN / end-point device ►NPIV does not eliminate the need to virtualize PCI-family Virtual adapter can be assigned to multiple operating systems sharing ►Tape Library Support the physical adapter

© Copyright IBM Corporation 2011 21 NPIV Simplifies SAN Management Deutsche Oracle Anwenderkonferenz DOAG 2011

Virtual SCSI model N-Port ID Virtualization

POWER5 or POWER6 POWER6 or POWER7 AIX AIX Disks disks

Virtualized generic generic DS8000 EMC scsi disk scsi disk

Virtual SCSI Virtual FC VIOS VIOS Shared FC Adapter FC Adapters FC Adapter FC Adapters

SAN SAN

DS8000 DS8000 EMC EMC

© Copyright IBM Corporation 2011 22 VIOS: NPIV IO tuning Deutsche Oracle Anwenderkonferenz DOAG 2011

AIX pHyp VIOS wwpn hdisk qdepth queue fcs0 vfchost0 queue fcs0 MPIO sddpcm

# lsdev –Cc disk HA: (dual vios conf.) HA: hdisk0 Available MPIO FC 2145 Change the following parameters: Change the following parameters: chdev –l fscsi0 –a Storage driver must be installed chdev –l fscsi0 –a fc_err_recov=fast_fail on the lpar fc_err_recov=fast_fail chdev –l fscsi0 –a dyntrk=yes ⇒Default qdepth is set by the chdev –l fscsi0 –a dyntrk=yes drivers Performance: Performance: Monitor fc activity with nmon ⇒Monitor “svctime” / “wait time” Monitor fc activity with nmon (interactive: option “a” or “^”) with nmon or iostat to tune the (interactive: option “^” only) (reccording : option “-^”) queue depth (reccording : option “-^”) Adapt num_cmd_elems Adapt num_cmd_elems Check fcstat fcsX Check fcstat fcsX

⇒Should be = Sum of vfcs num_cmd_elems connected to the backend device © Copyright IBM Corporation 2011 23 IO : Filesystems Mount Options (DIO, CIO) Deutsche Oracle Anwenderkonferenz DOAG 2011

Writes with a normal file system FS Cache

Writes with Direct IO (DIO)

Writes with Concurrent IO (CIO)

Database Buffer OS File System

© Copyright IBM Corporation 2011 24 IO : Filesystems Mount Options (DIO, CIO) Deutsche Oracle Anwenderkonferenz DOAG 2011 If Oracle data are stored in a Filesystem, some mount option can improve performance :

¾Direct IO (DIO) – introduced in AIX 4.3.

• Data is transfered directly from the disk to the application buffer, bypassing the file buffer cache hence avoiding double caching (filesystem cache + Oracle SGA). • Emulates a raw-device implementation.

¾To mount a filesystem in DIO $ mount –o dio /data

¾Concurrent IO (CIO) – introduced with jfs2 in AIX 5.2 ML1 Bench throughput over run duration – higher tps indicates better performance. • Implicit use of DIO. • No Inode locking : Multiple threads can perform reads and writes on the same file the same time. • Performance achieved using CIO is comparable to raw-devices.

¾To mount a filesystem in CIO: $ mount –o cio /data

© Copyright IBM Corporation 2011 25 IO : Benefits of CIO for Oracle

Deutsche Oracle Anwenderkonferenz DOAG 2011

¾ Benefits : 1. Avoid double caching : Some data are already cache in the Application layer (SGA) 2. Give a faster access to the backend disk and reduce the CPU utilization 3. Disable the inode-lock to allow several threads to read and write the same file (CIO only)

¾ Restrictions : 1. Because data transfer is bypassing AIX buffer cache, jfs2 prefetching and write-behind can’t be used. These functionnalities can be handled by Oracle. ⇒ (Oracle parameter) db_file_multiblock_read_count = 8, 16, 32, ... , 128 according to workload 2. When using DIO/CIO, IO requests made by Oracle must by aligned with the jfs2 blocksize to avoid a demoted IO (Return to normal IO after a Direct IO Failure) => When you create a JFS2, use the “mkfs –o agblksize=XXX” Option to adapt the FS blocksize with the application needs. Rule : IO request = n x agblksize Exemples: if DB blocksize > 4k ; then jfs2 agblksize=4096 Redolog are always written in 512B block; So jfs2 agblksize must be 512

© Copyright IBM Corporation 2011 26 IO : Direct IO demoted – Step 1 Deutsche Oracle Anwenderkonferenz DOAG 2011

AIX Virt. Memory

Application 512

CIO

CIO write failed 4096 because IO is not aligned with FS blksize AIX jfs2 Filesystem

© Copyright IBM Corporation 2011 27 IO : Direct IO demoted – Step 2 Deutsche Oracle Anwenderkonferenz DOAG 2011

AIX Virt. Memory

Aix kernel read 1 FSblock 512 Application from the disk

4096

AIX jfs2 Filesystem

© Copyright IBM Corporation 2011 28 IO : Direct IO demoted – Step 3 Deutsche Oracle Anwenderkonferenz DOAG 2011

AIX Virt. Memory 4096

Application write 512B block in the memory Application 512

AIX jfs2 Filesystem

© Copyright IBM Corporation 2011 29 IO : Direct IO demoted – Step 4 Deutsche Oracle Anwenderkonferenz DOAG 2011

Because of the CIO AIX Virt. Memory access,the 4k page is 4096 remove from memory

Aix kernel write the 4k page Application to the disk

AIX jfs2 Filesystem To Conclued : demoted io will consume : - more CPU time (kernel) - more physical IO : 1 io write = 1 phys io read + 1 phys io write

© Copyright IBM Corporation 2011 30 IO : Direct IO demoted Deutsche Oracle Anwenderkonferenz DOAG 2011

ƒ Extract from Oracle AWR ( made in Montpellier with Oracle 10g)

Waits on redolog (with demoted IO, FS blk=4k)

Waits% Time -outs Total Wait Time (s) Avg wait (ms) Waits /txn log file sync 2,229,324 0.00 62,628 28 1.53

Waits on redolog (without demoted IO, FS blk=512)

Waits% Time -outs Total Wait Time (s) Avg wait (ms) Waits /txn log file sync 494,905 0.00 1,073 2 1.00

¾ How to detect demoted IO :

Trace command to check demoted io : # trace –aj 59B,59C ; 2 ; trcstop ; trcrpt –o directio.trcrpt # –i demoted directio.trcrpt

© Copyright IBM Corporation 2011 31 FS mount options advices Deutsche Oracle Anwenderkonferenz DOAG 2011 With standard mount With Optimized mount options options mount –o rw mount –o noatime Oracle Cached by AIX (fscache) Cached by AIX (fscache) binaries noatime reduce inode modification on read mount –o rw mount –o noatime,cio Oracle Datafile Cached by AIX (fscache) Cached by Oracle (SGA) Cached by Oracle (SGA) mount –o rw mount –o noatime,cio Oracle Cached by AIX (fscache) jfs2 agblksize=512 (to avoid io Redolog Cached by Oracle (SGA) demotion) Cached by Oracle (SGA) mount –o rw mount –o noatime,rbrw Oracle Archivelog Cached by AIX (fscache) Use jfs2 write-behind, but memory is released after write. Oracle Control mount –o rw mount –o noatime files Cached by AIX (fscache) Cached by AIX (fscache) © Copyright IBM Corporation 2011 32 IO : Asynchronous IO (AIO) Deutsche Oracle Anwenderkonferenz DOAG 2011

• Allows multiple requests to be sent without to have to wait until the disk subsystem has completed the physical IO. • Utilization of asynchronous IO is strongly advised whatever the of file-system and mount option implemented (JFS, JFS2, CIO, DIO).

1 3 4 aioservers Application aio Disk 2 Q

¾Posix vs Legacy Since AIX5L V5.3, two types of AIO are now available : Legacy and Posix. For the moment, the Oracle code is using the Legacy AIO servers.

© Copyright IBM Corporation 2011 33 IO : Asynchronous IO (AIO) tuning Deutsche Oracle Anwenderkonferenz DOAG 2011 • important tuning parameters :

check AIO configuration with : AIX 5.X : lsattr –El aio0 AIX 6.1 : ioo –L | grep aio

• maxreqs : size of the Asynchronous queue. • minserver : number of kernel proc. Aioservers to start (AIX 5L system wide). • maxserver : maximum number of aioserver that can be running per logical CPU

¾ Rule of thumb : maxservers should be = (10 * <# of disk accessed concurrently>) / # cpu maxreqs (= a multiple of 4096) should be > 4 * #disks * queue_depth

¾ but only tests allow to set correctly minservers and maxservers

¾ Monitoring : In Oracle’s alert.log file, if maxservers set to low : “Warning: lio_listio returned EAGAIN” “Performance degradation may be seen”

#aio servers used can be monitored via “ –k | grep aio | –l” , “iostat –A” or nmon (option A)

© Copyright IBM Corporation 2011 34 IO : Asynchronous IO (AIO) fastpath Deutsche Oracle Anwenderkonferenz DOAG 2011 With fast_path, IO are queued directly from the application into the LVM layer without any “aioservers kproc” operation.

¾ Better performance compare to non-fast_path ¾ No need to tune the min and max aioservers ¾ No ioservers proc. => “ps –k | grep aio | wc –l” is not relevent, use “iostat –A” instead

1 3 AIX Kernel Application Disk 2

• Raw Devices / ASM : ¾ check AIO configuration with : lsattr –El aio0 enable asynchronous IO fast_path. : AIX 5L : chdev -a fastpath=enable -l aio0 (default since AIX 5.3) AIX 6.1 : ioo –p –o aio_fastpath=1 (default setting) • FS with CIO/DIO and AIX 5.3 TL5+ : ¾ Activate fsfast_path (comparable to fast_path but for FS + CIO/DIO) AIX 5L : adding the following line in /etc/inittab: aioo:2:once:aioo –o fsfast_path=1 AIX 6.1 : ioo –p –o aio_fsfastpath=1 (default setting)

© Copyright IBM Corporation 2011 35 IO : AIO,DIO/CIO & Oracle Parameters Deutsche Oracle Anwenderkonferenz DOAG 2011

¾ How to set filesystemio_options parameter

Possible values

ASYNCH : enables asynchronous I/O on file system files (default) DIRECTIO : enables direct I/O on file system files (disables AIO) SETALL : enables both asynchronous and direct I/O on file system files NONE : disables both asynchronous and direct I/O on file system files

Since the version 10g, Oracle will open data files located on the JFS2 file system with the O_CIO option if the filesystemio_options initialization parameter is set to either directIO or setall.

Advice : set this parameter to ‘ASYNCH’, and let the system managed CIO via mount option (see CIO/DIO implementation advices) …

Note : set the disk_asynch_io parameter to ‘true’ as well

© Copyright IBM Corporation 2011 36 NUMA architecture Deutsche Oracle Anwenderkonferenz DOAG 2011 • NUMA stands for Non uniform memory access. It is a memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

© Copyright IBM Corporation 2011 37 Oracle NUMA feature Deutsche Oracle Anwenderkonferenz DOAG 2011

Oracle DB NUMA support have been introduced since 1998 on the first NUMA systems. It provides a memory/processes models relying on specific OS features to better perform on this kind of architecture.

Since 10.2.0.4 NUMA features are enabled by default on x86 hardware.

On AIX, the NUMA support code has been ported, default is off.

© Copyright IBM Corporation 2011 38 Oracle NUMA parameters Deutsche Oracle Anwenderkonferenz DOAG 2011 The following shows all the hidden parameters that include ‘NUMA’ in the name: • _NUMA_instance_mapping Not specified • _NUMA_pool_size Not specified • _db_block_numa 1 • _enable_NUMA_interleave TRUE • _enable_NUMA_optimization FALSE • _enable_NUMA_support FALSE • _numa_buffer_cache_stats 0 • _numa_trace_level 0 • _rm_numa_sched_enable TRUE • _rm_numa_simulation_cpus 0 • _rm_numa_simulation_pgs 0

© Copyright IBM Corporation 2011 39 Enable Oracle NUMA Rset Naming Deutsche Oracle Anwenderkonferenz DOAG 2011 • Oracle 11g has NUMA optimization disabled by default. • _enable_NUMA_support=true is required to enable NUMA features.

• When _enable_NUMA_support = TRUE, Oracle checks for AIX rset named “${ORACLE_SID}/0” at startup.

• For now, it is assumed that it will use rsets ${ORACLE_SID}/0, ${ORACLE_SID}/1, ${ORACLE_SID}/2, etc if they exist.

© Copyright IBM Corporation 2011 40 Prepare system for Oracle NUMA Optimization Deutsche Oracle Anwenderkonferenz DOAG 2011 The test is done on a POWER7 machine with the following CPU and memory distribution (dedicated LPAR). It has 4 domains with 8 CPU and >27GB each. If the lssrad output shows unevenly distributed domains, fix the problem before proceeding. • Listing SRAD (Affinity Domain) # lssrad -va REF1 SRAD MEM CPU 0 0 27932.94 0-31 1 31285.00 32-63 1 2 29701.00 64-95 3 29701.00 96-127

• We will set up 4 rsets, namely SA1/0, SA1/1, SA1/2, and SA1/3, one for each domain. # mkrset -c 0-31 -m 0 SA1/0 # mkrset -c 32-63 -m 0 SA1/1 # mkrset -c 64-95 -m 0 SA1/2 # mkrset -c 96-127 -m 0 SA1/3

• Required Oracle User Capabilities # lsuser -a capabilities oracle oracle capabilities=CAP_NUMA_ATTACH,CAP_BYPASS_RAC_VMM,CAP_PROPAGATE

© Copyright IBM Corporation 2011 41 _enable_NUMA_Support Deutsche Oracle Anwenderkonferenz DOAG 2011 • The system is ready to be tested. So, we enable the NUMA parameters in the init.ora parameters file.

#*._enable_NUMA_optimization=true #deprecated, use _enable_NUMA_support

*._enable_NUMA_support=true

© Copyright IBM Corporation 2011 42 Making Process Private Memory Local Deutsche Oracle Anwenderkonferenz DOAG 2011 • Before starting the DB, let’s set vmo options to cause process private memory to be local.

# vmo -o memplace_data=1 -o memplace_stack=1

© Copyright IBM Corporation 2011 43 Starting the NUMA enabled Database Deutsche Oracle Anwenderkonferenz DOAG 2011 • Once started Oracle set the following parameter _NUMA_instance_mapping Not specified _NUMA_pool_size Not specified _db_block_numa 4 _enable_NUMA_interleave TRUE _enable_NUMA_optimization FALSE _enable_NUMA_support TRUE _numa_buffer_cache_stats 0 _numa_trace_level 0 _rm_numa_sched_enable TRUE _rm_numa_simulation_cpus 0 _rm_numa_simulation_pgs 0 • The following messages are found in the alert log. It finds the 4 rsets and treats them as NUMA domains. LICENSE_MAX_USERS = 0 SYS auditing is disabled NUMA system found and support enabled (4 domains - 32,32,32,32) Starting up Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production © Copyright IBM Corporation 2011 44 Oracle shared segments differences Deutsche Oracle Anwenderkonferenz DOAG 2011 • Check the shared memory segments. There are total of 7, one of which owned by ASM. The SA1 has 6 shared memory segments instead of 1. # ipcs -ma|grep oracle Without NUMA optimization – m 2097156 0x8c524c30 --rw-rw---- oracle dba oracle dba 31 285220864 13369718 23987126 23:15:57 23:15:57 12:41:22 wit – m 159383709 0x3b5a2ebc --rw-rw---- oracle dba oracle dba 59 54089760768 17105120 23331318 23:16:13 23:16:13 23:15:45

# ipcs -ma|grep oracle With NUMA optimization – m 2097156 0x8c524c30 --rw-rw---- oracle dba oracle dba 29 285220864 13369718 7405688 23:27:32 23:32:38 12:41:22 – m 702545926 00000000 --rw-rw---- oracle dba oracle dba 59 2952790016 23987134 23920648 23:32:42 23:32:42 23:27:21 – m 549453831 00000000 --rw-rw---- oracle dba oracle dba 59 2952790016 23987134 23920648 23:32:42 23:32:42 23:27:21 – m 365953095 0x3b5a2ebc --rw-rw---- oracle dba oracle dba 59 20480 23987134 23920648 23:32:42 23:32:42 23:27:21 – m 1055916188 00000000 --rw-rw---- oracle dba oracle dba 59 3087007744 23987134 23920648 23:32:42 23:32:42 23:27:21 – m 161480861 00000000 --rw-rw---- oracle dba oracle dba 59 42144366592 23987134 23920648 23:32:42 23:32:42 23:27:21 – m 333447326 00000000 --rw-rw---- oracle dba oracle dba 59 2952790016 23987134 23920648 23:32:42 23:32:42 23:27:21 © Copyright IBM Corporation 2011 45 Post-Setup background processes inspections Deutsche Oracle Anwenderkonferenz DOAG 2011 • Take a look at the Oracle processes. The Oracle background processes that have multiple instantiations are automatically attached to the rsets in a round-robin fashion.

• All other Oracle background processes are also attached to an rset, mostly SA1/0. lgwr and psp0 are attached to SA1/1.

© Copyright IBM Corporation 2011 46 Standard Oracle Memory Structures Deutsche Oracle Anwenderkonferenz DOAG 2011

Server Server Background process PGA process PGA PGA process 1 2

SGA shm

Shared pool Streams pool Large pool

Database Redo log Java pool buffer cache buffer

© Copyright IBM Corporation 2011 47 NUMA Oracle Memory Structures (to be confirmed) Deutsche Oracle Anwenderkonferenz DOAG 2011 Server Server Background process PGA process PGA PGA process 1 2

Database Database Database Database buffer cache buffer cache buffer cache buffer cache + + + + Numa pool numa pool Numa pool Numa pool

Redo log buffer

Stripped - Large pool + Shared pool + Streams pool + Java pool

SRAD0 SRAD1 SRAD2 SRAD3

© Copyright IBM Corporation 2011 48 What Goes into Which Shared Memory Segment? Deutsche Oracle Anwenderkonferenz DOAG 2011

• Notice that the shared pool is into shared pool and numa pool. Shared pool items that can be partitioned are moved to the numa_pool.

• Notice also that the log_buffer is piggy backed onto the affinitized shared memory segment on SRAD1. This is where the lgwr is attached to.

© Copyright IBM Corporation 2011 49 Affinitizing User Connections Deutsche Oracle Anwenderkonferenz DOAG 2011 • If Oracle shadow processes are allowed to migrate across domains, the benefit of NUMA-enabling Oracle will be lost. Therefore, arrangements need to be made to affinitize the user connections.

• For network connections, multiple listeners can be arranged with each listener affinitized to a different domain. The Oracle shadow processes are children of the individual listeners and inherit the affinity from the listener.

• For local connections, the client process can be affinitized to the desired domain/rset. These connections do not go through any listener, and the shadows are children of the individual clients and inherit the affinity from the client. Example below using sqlplus as local client:

– # attachrset SA1/2 $$ – # sqlplus itsme/itsme

© Copyright IBM Corporation 2011 50 SQLnet configuration example for a 4 Oracle numa nodes configuration and IPC protocol (SID=SA) Deutsche Oracle Anwenderkonferenz DOAG 2011 listener.ora tnsnames.ora #------SA0 = (DESCRIPTION = LISTENER0 = (ADDRESS_LIST = (DESCRIPTION_LIST = (ADDRESS = (PROTOCOL = IPC)(KEY = IPC0)) (DESCRIPTION = (ADDRESS_LIST = ) (ADDRESS = (PROTOCOL = IPC)(KEY = (CONNECT_DATA = IPC0)(QUEUESIZE = 200)) (SERVICE_NAME = SA) ) ) ) ) ) SA1 = (DESCRIPTION = #------(ADDRESS_LIST = (ADDRESS = (PROTOCOL = IPC)(KEY = IPC1)) LISTENER1 = ) (DESCRIPTION_LIST = (DESCRIPTION = (CONNECT_DATA = (ADDRESS_LIST = (SERVICE_NAME = SA) (ADDRESS = (PROTOCOL = IPC)(KEY = ) ) IPC1)(QUEUESIZE = 200)) … ) ) SA = ) (ADDRESS_LIST= … (ADDRESS=(PROTOCOL=IPC)(KEY=IPC0)(QUEUESIZE = 1000)) (ADDRESS=(PROTOCOL=IPC)(KEY=IPC1)(QUEUESIZE = 1000)) (ADDRESS=(PROTOCOL=IPC)(KEY=IPC2)(QUEUESIZE = 1000)) (ADDRESS=(PROTOCOL=IPC)(KEY=IPC3)(QUEUESIZE = 10000)) ) © Copyright IBM Corporation 2011 51 Listener startup procedure example Deutsche Oracle Anwenderkonferenz DOAG 2011 • Listeners startup procedure # for i in 0 1 2 3 do execrset SA1/${i} su - oracle -c " export MEMORY_AFFINITY=MCM lsnrctl stop LISTENER${i} lsnrctl start LISTENER${i} " done

• Database registration: the init.ora service_names parameter should be set to: *.service_names = SA # su - oracle -c sqlplus sys/ibm4tem3 as sysdba <

© Copyright IBM Corporation 2011 52 A Simple Performance Test Deutsche Oracle Anwenderkonferenz DOAG 2011 • Four Oracle users each having it’s own schema and tables are defined. The 4 schemas are identical other than the name.

• Each user connection performs some query using random numbers as keys and repeats the operation until the end of the test.

• The DB cache is big enough to hold the entirety of all the 4 schemas. therefore, it is an in-memory test.

• All test cases are the same, except domain-attachment control. Each test runs a total of 256 connections, 64 of each oracle user.

© Copyright IBM Corporation 2011 53 Relative Performance Deutsche Oracle Anwenderkonferenz DOAG 2011

Case 0 Case 1 Case 2 Case 3 Case 4 Case 5 NUMA config No Yes Yes Yes Yes Yes

Connection affinity No No RoundRobin RoundRobin Partitioned Partitioned NUMA_interleave NA Yes Yes No Yes No Relative performance 100% 113% 113% 112% 137% 144% *RoundRobin = 16 connections of each oracle user run in the each domain; Partitioned = 64 connections of 1 oracle user run in each domain. *Partioned = listener is attached to the Rset the relative performance shown applies only to this individual test, and can vary widely with different workloads.

© Copyright IBM Corporation 2011 54 Performance Analysis Deutsche Oracle Anwenderkonferenz DOAG 2011 • It is apparent that the biggest boost NUMA optimization achieves is when user connections are affinitized, and the workload does not share data across domains (cases 4 and 5).

• Cases 2 and 3 have user connection affinity, but the workload shares data across domains. In this sample test, it does not perform any better than without connection affinity (case 1).

• The meaning of _enable_NUMA_interleave parameter is not detailed. One would assume that it enables/disables some kind of memory interleaving. But consider test cases 2 and 3, the throughput is the same regardless whether it is enabled. It might be because these test cases have a low memory locality anyway. In contrast, test case 5 has a decent boost over case 4 by disabling interleaving. What this says is that if the workload already has a high memory locality, disabling _enable_NUMA_interleaving can further increase memory locality.

© Copyright IBM Corporation 2011 55 A real Benchmark result Deutsche Oracle Anwenderkonferenz DOAG 2011 • Very promising features for CPU bound workload – Speed up up to 45% on an ISV benchmark case, p780 & 11gR2 – Applicable from 740 to 795 LPARs with more than 8 cores

• Further benchmark tests under progress – NUMA_optimization parameter – Size of the Oracle rsets: P7 chip vs P7 node – Compatibility with SPLPARs, AME …

© Copyright IBM Corporation 2011 56 Deutsche Oracle Anwenderkonferenz DOAG 2011 Thank you Questions ? mailto:[email protected]

© Copyright IBM Corporation 2011 57