Analysis of Real-Time Multimedia Systems in Chetan Sethi, Lead Engineer

Abstract— Modern infotainment processors have sophisticated hardware accelerators for high-resolution video processing, but still there are some use-cases that stretch them to the limit, requiring near real-time processing realizing them (e.g. BD). On GPOS, meeting these requirements can be a challenge resulting in video artifacts like jerkiness, freeze, audio video loss and distortion. This paper talks about mechanisms and tools in Linux to analyze and fix such issues taking the BD video/graphics data-path as an example.

Keywords—Kernel Tracers, trace-cmd, Blu-ray, de-interlace, blending

time rendering target and ways for 1. INTRODUCTION analyzing these issues.

Typical media processing solutions like BD, Target audience would include developers STB, DISC players need to video and working on embedded platforms having graphics data and render it. Different post stringent video rendering requirements processing operations like de-interlacing, with pre-knowledge of audio/video alpha blending, color conversion, scaling, software development. etc are accomplished using variety of hardware accelerators to prepare one 2. TERMS AND DEFINITIONS composite signal to be rendered.

Typical video/graphics sub-system post the Term Description decoding stage consists of below DEI De Interlacer • Post processing of video, image GPU Graphics Processing Unit (2D/3D • Blending videos and graphics acceleration)/ Blender • Rendering video, image, graphics DSS Display Sub System The software stack for video/graphics sub- IP Intellectual Property system needs to ensure that it is capable of DRM Linux DRI Direct Rendering Interface Linux rendering the frames at LCD refresh rate. To OPP Operating Point/Frequency ensure, the rendering is done flawless and in real-time, it’s essential that each block in video/graphics software pipeline does its unit task in well defined real-time boundary. This paper talks about various artifacts that might avoid achieving the real-

Sasken Confidential 1 March 04, 2016

User Interface Application 3. BD Data Path

Middleware Application

Decoded Data FullHD FullHD Video/Graphics Middleware compositor Primary @30i DeInte @60p Video NV12 rlacer Alpha V4L2 Driver BltsVille DRM/DRI SD@30i SD@60p Secondary DeInte Blending Video NV12 rlacer DEI GPU DSS HD@60p Graphics ARGB User Graphics Blended Output Users of compositor module Full HD@20fps Full HD@60fps Middleware composing Legend Display Sub System video/graphics and rendering

LCD RGB Hardware Accelerators size 666 LCD Figure 2: Typical BD Software stack

Figure 1: Typical BD Data Path DEI, GPU and DSS are independent hardware accelerators and parallel Figure 1 above, shows a typical BD execution is highly desired to achieve video/graphics Data path post decoding for optimal performance and throughput. The one of the worst case combinations. BD key idea is to reduce the interdependency, composite signal consists of primary video, define independent features/task so that secondary video and graphics. For the execution is done in true parallel way to interlaced video content, de-interlacing feed data at real-time intervals. operation is performed and then primary video, secondary video and BD graphics At software level, a pipeline based signals are blended. This composite BD threading model can be used for effective signal is then finally blended with user usage of these accelerators by defining one graphics before rendering onto the LCD execution context () per hardware display. accelerator. These threads are responsible picking up buffer for processing, executing 4. Typical Implementation post processing operation using the Figure 2 below, gives a possible top level hardware blocks and outputting the overview of general software architecture resultant buffer to next hardware module. design for video/graphics composite signal These execution contexts can have input rendering use cases considering BD as an and output queues defined so that, output example. of each stage can be fed into input of another stage.

Sasken Confidential 2 March 04, 2016

Figure 3 shows the pipeline based threading system behavior and long duration model, possible for such scenario. execution. Kernel in Linux provides a light- weight, time-accurate capability to inspect system events occurring between specific DEI GPU DSS points in the application execution. This Thread Thread Thread Input Output information is invaluable in ascertaining the Buffer Buffer system’s impact on the performance of a specific application and identifying the bottlenecks. The commands in Linux to achieve this are trace-cmd and KernelShark. DEI GPU DSS Trace-cmd is the tool that initiates the trace operation and generates the log while Thread KernelShark is the GUI front-end for visual Legend inspection of the log. Buffer Below are few snapshots that demonstrate the usage of trace-cmd and kernel shark Figure 3: Pipeline based threading model tools to analyze such issues in the BD use- case.

Figure 5 below shows a snapshot of kernel 5. Video issues/artifacts and tracers dat file used for performance its Isolation measurement of video post processing IPs. Command used for obtaining below This chapter lists different video snapshot: #trace-cmd record –e ftrace issues/artifacts and probable reasons for the same as well as methods used for isolating/analyzing these issues. Issues like jerky video, audio video -sync loss, distortion, video freeze, black screen can be noticed. Multiple reasons like frame drops, reception of delayed frames at rendering level, multiple user-kernel transactions, issues, locking contentions, clock related issues, high /DMA loads etc. can result in these. Finding the root cause of above mentioned artifacts, invariably requires an insight into what the system as a whole is doing at the Figure 5: Time Measurement using Kernel instant of the artifact. Logs are of limited Tracers use when the analysis has to span multiple independently-developed components,

Sasken Confidential 3 March 04, 2016

Figure 6 below captures different thread scheduling time-stamps and shows one particular instance where due to higher priority thread of one particular process, video rendering thread belonging to other process was not being able to schedule leading to video breaks/jerky video issues. By appropriately tuning the thread priorities of both the threads this issue can be resolved. Command used for obtaining below snapshot: #trace-cmd record –e sched* -o sched.dat #kernelshark sched.dat Figure 7: User/Kernel Logging

Figure 8 below shows interrupt statistics that can be captured using trace-cmd. Also trace-cmd extends interface for easy adaptation of data in excel format for further analysis. As can be seen in below example, one interrupt occurs per ms indicating a very high rate of interruption which is not expected adding to un- necessary load on the system. This particular instance of high rate of interrupt was traced to a power management issue in one of the video post processing hardware IPs. Command used for obtaining below snapshot: #trace-cmd record –e irq

Time Stamp Function name IRQ No Interrupt Name Figure 6: Scheduling Issues 1065.393695 irq_handler_entry irq=187 name=48468000.dei 1065.394736 irq_handler_entry irq=187 name=48468000.dei Figure 7 below shows way of common 1065.395776 irq_handler_entry irq=187 name=48468000.dei logging between and kernel 1065.396817 irq_handler_entry irq=187 name=48468000.dei space. Such method has been extremely 1065.397861 irq_handler_entry irq=187 name=48468000.dei useful in figuring out video freeze issues 1065.398904 irq_handler_entry irq=187 name=48468000.dei having locking contentions at application 1065.399954 irq_handler_entry irq=187 name=48468000.dei layer. 1065.400998 irq_handler_entry irq=187 name=48468000.dei Command used for obtaining below 1065.402035 irq_handler_entry irq=187 name=48468000.dei snapshot: #trace-cmd record –e ftrace 1065.403076 irq_handler_entry irq=187 name=48468000.dei 1065.404123 irq_handler_entry irq=187 name=48468000.dei

Figure 8 Interrupt statistics

Sasken Confidential 4 March 04, 2016

Figure 9 below shows one more example of The key advantages of the trace-cmd and analysis using system calls where one KernelShark tools are: particular (NR 91- sysmunmap)

is taking around 235ms of time for returning. Also all details to figure out the • It can be used for debugging or analyzing reason for the same can be seen in the latencies and performance issues. same trace file. • It has minimal impact on the CPU load and is very accurate. Such method is extremely useful in • It has ability to trace syscall entry and analyzing system specific issues like tuning exit, and signal delivery, to a process of DMA channel priorities (which can also used for debugging a Command used: - #trace-cmd record –e all process) • Excellent user level representation tools TID TimeStamp System Call Details like kernel shark. NR 91 (95b03000, 1944 1439.713837 sys_enter 1c2000, 0, 0) • Dynamic control for recording/stopping

1944 1439.713867 irq_handler_entry irq=30 name=arch_timer traces. 1944 1439.713867 hrtimer_cancel hrtimer=0xca6b3f28 • Complete thread level scheduling 1944 1439.713867 hrtimer_expire_entry hrtimer=0xca6b3f28 statistics for the entire system can be comm=trace-cmd generated. 1944 1439.713867 sched_stat_sleep pid=17003 • Provide interface for plotting graphs in 1944 1439.79306 sched_wakeup 1944?? + 1946100? excel sheet. 1944 1439.79306 hrtimer_expire_exit hrtimer=0xe9fadac8 • 1944 1439.79306 irq_handler_exit irq=30 ret=handled It gives detailed information on multiple 1944 1439.798889 irq_handler_entry irq=30 name=arch_timer system parameters like , kernel 1944 1439.79892 hrtimer_cancel hrtimer=0xc1211a98 events etc. 1944 1439.79892 hrtimer_expire_entry hrtimer=0xc1211a98 • Gives unified interface for logging across 1944 1439.79892 softirq_raise vec=1 [action=TIMER] kernel and user space with single log file 1944 1439.79892 rcu_utilization c078c644 generation. 1944 1439.79892 rcu_utilization c078c744 1944 1439.79892 hrtimer_expire_exit hrtimer=0xc1211a98 1944 1439.79892 irq_handler_exit irq=30 ret=handled 7. CONCLUSION 1944 1439.79892 softirq_entry vec=1 [action=TIMER] 1944 1439.79895 softirq_exit vec=1 [action=TIMER] Kernel-shark and trace-cmd are ideal tools 1944 1439.806702 irq_handler_entry irq=30 name=arch_timer for analyzing performance related 1944 1439.806732 hrtimer_expire_entry hrtimer=0xc1211a98 issues/latencies and they have quite a few 1944 1439.806732 softirq_raise vec=1 [action=TIMER] advantages for performance measurement 1944 1439.806732 rcu_utilization c078c644 (drm_gem_vm_close+0x2 over traditional methods like 1944 1439.949371 kfree 8) gettimeofday(). Apart from performance 1944 1439.949371 kmem_cache_free (remove_vma+0x48) measurement quite a few advance 1944 1439.949371 sys_exit NR 91 = 0 debugging techniques are exposed by them. Figure 9 System Call statistics

Future work: 6. Salient features of tools • Explore trace-cmd and KernelShark more to see if performance issues due to other

Sasken Confidential 5 March 04, 2016

system causes (DMA, cache, etc) can be article and review of its drafts were easily debugged. provided by Nandkishor Pore and Vinod • Explore other performance debugging Vasudevan. tools in Linux (, stap, ktap, ebpf, , , ffsb) • Apply these techniques to other 9. REFERENCES performance-sensitive applications like protocol stacks http://lwn.net/Articles/341902/ http://lwn.net/Articles/425583/ http://www.ti.com/lit/ds/symlink/dra746.pdf https://en.wikipedia.org/wiki/Direct_Renderi 8. ACKNOWLEDGEMENT ng_Manager Knowledge underlying this article has been http://graphics.github.io/bltsville/ gained through experience of many people http://www- 01.ibm.com/support/knowledgecenter/linuxo in Sasken Communication Technologies Ltd. nibm/liaat/liaatbestpractices_pdf.pdf working in embedded software development section. Specific inputs to this

Sasken Confidential 6 March 04, 2016