
Making OpenVX Really “Real Time”∗ Ming Yang1, Tanya Amert1, Kecheng Yang1;2, Nathan Otterness1, James H. Anderson1, F. Donelson Smith1, and Shige Wang3 1University of North Carolina at Chapel Hill 2Texas State University 3General Motors Research Abstract paper, we consider its use on platforms where graphics pro- OpenVX is a recently ratified standard that was expressly cessing units (GPUs) are used to accelerate CV processing. proposed to facilitate the design of computer-vision (CV) Unfortunately, OpenVX’s alleged real-time focus reveals applications used in real-time embedded systems. Despite its a disconnect between CV researchers and the needs of the real-time focus, OpenVX presents several challenges when real-time applications where their work would be applied. validating real-time constraints. Many of these challenges In particular, OpenVX lacks concepts relevant to real-time are rooted in the fact that OpenVX only implicitly defines analysis such as priorities and graph invocation rates, so it is any notion of a schedulable entity. Under OpenVX, CV ap- debatable as to whether it really does target real-time systems. plications are specified in the form of processing graphs More troublingly, OpenVX implicitly treats entire graphs as 1 that are inherently considered to execute monolithically end- monolithic schedulable entities. This inhibits parallelism to-end. This monolithic execution hinders parallelism and and can result in significant processing-capacity loss in set- can lead to significant processing-capacity loss. Prior work tings (like autonomous vehicles) where many computations partially addressed this problem by treating graph nodes as must be multiplexed onto a common hardware platform. schedulable entities, but under OpenVX, these nodes repre- In prior work, our research group partially addressed these sent rather coarse-grained CV functions, so the available issues by proposing a new OpenVX variant in which individ- parallelism that can be obtained in this way is quite lim- ual graph nodes are treated as schedulable entities [23, 51]. ited. In this paper, a much more fine-grained approach for This variant allows greater parallelism and enables end-to- scheduling OpenVX graphs is proposed. This approach was end graph response-time bounds to be computed. However, designed to enable additional parallelism and to eliminate graph nodes remain as high-level CV functions, which is schedulability-related processing-capacity loss that arises problematic for (at least) two reasons. First, these high-level when programs execute on both CPUs and graphics pro- nodes still execute sequentially, so some parallelism is still cessing units (GPUs). Response-time analysis for this new potentially inhibited. Second, such a node will typically in- approach is presented and its efficacy is evaluated via a case volve executing on both a CPU and a GPU. When a node study involving an actual CV application. accesses a GPU, it suspends from its assigned CPU. Sus- pensions are notoriously difficult to handle in schedulability 1 Introduction analysis without inducing significant capacity loss. The push towards deploying autonomous-driving capabil- Contributions. In this paper, we show that these problems ities in vehicles is happening at breakneck speed. Semi- can be addressed through more fine-grained scheduling of autonomous features are becoming increasingly common, OpenVX graphs. Our specific contributions are threefold. and fully autonomous vehicles at mass-market scales are on First, we show how to transform the coarse-grained the horizon. In realizing these features, computer-vision (CV) OpenVX graphs proposed in our group’s prior work [23, 51] techniques have loomed large. Looking forward, such tech- to fine-grained variants in which each node accesses either a niques will continue to be of importance as cameras are both CPU or a GPU (but not both). Such transformations elimi- cost-effective sensors (an important concern in mass-market nate suspension-related analysis difficulties at the expense vehicles) and a rich source of environmental perception. of (minor) overheads caused by the need to manage data To facilitate the development of CV techniques, the sharing. Additionally, our transformation process exposes Kronos Group has put forth a ratified standard called new potential parallelism at many levels. For example, be- OpenVX [42]. Although initially released only four years cause we decompose a coarse-grained OpenVX node into ago, OpenVX has quickly emerged as the CV API of choice finer-grained schedulable entities, portions of such a node for real-time embedded systems, which are the standard’s can now execute in parallel. Also, we allow not only succes- intended focus. Under OpenVX, CV computations are rep- sive invocations of the same graph to execute in parallel but resented as directed graphs, where graph nodes represent even successive invocations of the same (fine-grained) node. high-level CV functions and graph edges represent prece- Second, we explain how prior work on scheduling pro- dence and data dependencies across functions. OpenVX can cessing graphs and determining end-to-end graph response- be applied across a diversity of hardware platforms. In this time bounds can be adapted to apply to our fine-grained ∗Work supported by NSF grants CNS 1409175, CPS 1446631, CNS 1As discussed in Sec. 3, a recently proposed extension [18] enables 1563845, and CNS 1717589, ARO grant W911NF-17-1-0294, and funding more parallelism, but this extension is directed at throughput, not real-time from General Motors. predictability, and is not available in any current OpenVX implementation. 1 1 OpenVX graphs. The required adaptation requires new anal- Ex. 1. Consider DAG G in Fig. 1. Task τ4 ’s predecessors 1 1 1 1 ysis for determining response-time bounds for GPU compu- are tasks τ2 and τ3 , i.e., for any j, job τ4;j waits for jobs τ2;j 1 tations. We show how to compute such bounds for recent and τ3;j to finish. If intra-task parallelism is allowed, then NVIDIA GPUs by leveraging recent work by our group on 1 1 τ4;j and τ4;j+1 could execute in parallel. ♦ the functioning of these GPUs [1]. Our analysis shows that 1 For simplicity, we assume that 1 allowing invocations of the same graph node to execute in each DAG Gi has exactly one source parallel is crucial in avoiding extreme capacity loss. i 1 1 task τ1, with only outgoing edges, 2 3 Third, we present the results of case-study experiments i and one sink task τni , with only conducted to assess the efficacy of our fine-grained graph- 1 incoming edges. Multi-source/multi- 4 scheduling approach. In these experiments, we considered sink DAGs can be supported with the Figure 1: DAG G1. six instances of an OpenVX-implemented CV application addition of singular “virtual” sources and sinks that connect called HOG (histogram of oriented gradients), which is used multiple sources and sinks, respectively. Virtual sources and in pedestrian detection, as scheduled on a multicore+GPU sinks have a worst-case execution time (WCET) of zero. platform. These instances reflect a scenario where multiple Source tasks are released sporadically, i.e., for the DAG camera feeds must be supported. We compared both analyti- i i G , the job releases of τ1 have a minimum separation time, or cal response-time bounds and observed response times for i i period, denoted T . A non-source task τv (v > 1) releases HOG under coarse- vs. fine-grained graph scheduling. We th i th its j job τv;j after the j jobs of all its predecessors in found that bounded response times could be guaranteed for i i i G have completed. That is, letting rv;j and fv;j denote all six camera feeds only under fine-grained scheduling. In i i fact, under coarse-grained scheduling, just one camera could the release and finish times of τv;j, respectively, rv;j ≥ i i i (barely) be supported. We also found that observed response maxffw;j j τw is a predecessor of τvg: The response time of i i i times were substantially lower under fine-grained scheduling. job τv;j is defined as fv;j −rv;j, and the end-to-end response Additionally, we found that the overhead introduced by con- th i i i time of the j invocation of the DAG G as fni;j − r1;j. verting from a coarse-grained graph to a fine-grained one had Deriving response-time bounds. An end-to-end response- modest impact. These results demonstrate the importance of time bound can be computed inductively for a DAG Gi enabling fine-grained scheduling in OpenVX if real time is by scheduling its nodes in a way that allows them to be really a first-class concern. viewed as sporadic tasks and by then leveraging response- Organization. In the rest of the paper, we provided needed time bounds applicable to such tasks. When viewing nodes background (Sec. 2), describe our new fine-grained schedul- as sporadic tasks, precedence constraints must be respected. i i ing approach (Sec. 3), present the above-mentioned GPU This can be ensured by assigning an offset Φv to each task τv response-time analysis (Sec. 4) and case study (Sec. 5), dis- based on the response-time bounds applicable to “up-stream” i th i cuss related work (Sec. 6), and conclude (Sec. 7). tasks in G , and by requiring the j job of τv to be released i th exactly Φv time units after the release time of the j job of i i i i i 2 Background the source task τ1, i.e., rv;j = r1;j +Φv; where Φ1 = 0. With i i In this section, we review prior relevant work on the real- offsets so defined, every task τv in G (not just the source) i i time scheduling of DAGs and explain how this work was ap- has a period of Ti.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages14 Page
-
File Size-