<<

Introducing decisions in : an automated approach

Martin Takeya Weber École Polytechnique Fédérale de Lausanne martin.weber@epfl.ch

ABSTRACT threads; however, the experienced latency for individual con- Userspace threads are execution contexts that are managed nections may suffer: if one connection misbehaves and keeps by a userspace runtime rather than the kernel. For any two our single busy for a significant amount of time, other userspace threads that run on top of the same kernel thread, connections are not handled within reasonable delays. This a between those two userspace threads does can be improved by using thread pools, i. e. a fixed number of not require any transition to kernelspace, as all thread con- worker threads, where the main thread can simply dispatch text data is managed and available in userspace. Context incoming data to the next available thread, without risking switching there is thus significantly more lightweight than to become blocked on problematic input data itself. This en- between kernel threads. However, as there is no way to inter- sures fairness between connections; however, this also makes rupt processes without the kernel, userspace threads cannot matters more complex. Furthermore, as threads will handle be preemptively scheduled purely from userspace; instead, data for various connections, connection states cannot be they must run cooperatively, i. e. they must regularly reflected in an intuitive way (simply by the progress within control over the to allow other userspace threads the code), but must instead be encoded in and loaded from to run as well. some independent context data. Writing software for such an environment requires yields 1.1 Fibers to be inserted in the right places: if some long-running exe- The issue stems from the fact that the kernel cution path in the code was no considered and contains no cannot reasonably create any arbitray number of threads. yields, it will end up other processes from running. We can avoid this limitation by not using kernel threads, We therefore propose a mechanism to relieve the program- but fibers [3]: a fiber is an execution context that is created mer from having to handle yielding manually: we have writ- and maintained on top of a kernel thread, typically by a ten an llvm [5] Pass that instruments the code at compile userspace runtime. A fiber can be considered a “userspace time and automatically inserts yields at specific positions. thread” insofar as it is associated an individual stack and This allows code to be written for userspace threads as if it can exist independently from the fiber that created it; just would run in a preemptively scheduled environment, leaving like a thread, a fiber can wait for and join with other fibers, scheduling decisions to the compiler. or create new fibers on its own; and just like threads, fibers Preliminary benchmark results indicate that our method rely on a separate component that manages and schedules does not severely degrade throughput performance. Mi- them. crobenchmarks run on single cores show that our instru- As all the required metadata and stack memory is main- mented variants of userspace threads have execution time tained in userspace, context switching between fibers is a overheads of 5 to 8 % for certain parameters compared to matter of saving a few registers on the current fiber’s stack, kernel threads. With some more refinement, we believe that pointing the stack pointer to the new fiber’s stack, then this performance can be improved further. restoring the registers from there, and jumping to the new fiber’s current execution position (also restored from the 1. INTRODUCTION stack); this is less costly than having to switch to kernel Writing a server application normally consists of reading space to perform a full thread context switch. data from a socket, processing it, and responding to the As the kernel has no knowledge of fibers running on top client. One way to implement such software is to create a of its threads, it cannot preemptively and sched- separate thread per connection: each thread handles incom- ule fibers (it can only do so for entire threads). Context ing data for a single connection, and maintains only that switching to another fiber therefore requires that the fibers connection’s state. This multithreaded approach is fairly cooperate: as long as a fiber does not yield on its own, the intuitive; however, it does not scale particularly well with userspace scheduler can not schedule another fiber. This is a a growing number of connections, as the kernel may not be major appeal of cooperative multitasking: the lack of kernel able to spawn tens of thousands of threads at once. allows tighter control over the execution flow of An alternative is to keep only a single thread that waits for a multitasked program. However, it requires the program- events on all sockets (e. g. with epoll_wait), and processes mer to strategically place yields in their code to ensure that all incoming data within the context of the corresponding fibers do not prevent each other from running, which can be connection. This scales a bit better than the aforemen- tricky: if any execution follows a path in the code that the tioned multithreaded approach, as we do not spawn any new programmer has not accounted for (and thus placed yields

1 insufficiently), this may result in a fiber blocking execution 1.3 Libfiber for other fibers. Libfiber [8] is a userspace that provides an API Another issue is what happens when a fiber makes a block- similar to threads for creating and joining fibers and ing : ideally another fiber could take over and creating mutexes, semaphores and spin locks, and provides continue running while the first one is waiting for the sys- a function fiber_yield that wraps around some fiber man- tem call to return. However, the kernel has no knowledge of agement operations and fiber context switches. fibers; such a system call will simply block the entire kernel As it provides a familiar API, handles blocking I/O purely thread. Fibers are therefore often written in a way that they in userspace, and does not pose any specific restrictions on only make non-blocking system calls; however, this might the number of fibers that can be created, we have decided impose a specific design choice on the program that may to use libfiber as a basis for our work. not always be suitable. Some fiber implementations work around this issue and 2. IMPLEMENTATION provide mechanisms to allow programmers to use blocking system calls with fibers. Anderson et al proposed so-called Libfiber has been designed to work on multi-core sys- scheduler activations [2], i. e. an extension to the kernel such tems (i. e. it can multiplex fibers on top of multiple kernel that it can distinguish regular threads from threads that run threads), and by default, it uses a work-stealing scheduler, fibers and interact with userspace schedulers and “activate” implemented in fiber_scheduler_wsd.. However, if lim- them in specific events. Brian Watling’s libfiber [8] on the ited to a single kernel thread, we have observed that this other hand provides a drop-in replacement wrapper (shim) scheduler only switches back and forth between the same around blocking system calls; the shim calls the respective two fibers, with the remaining fibers never getting CPU time non-blocking variants, and then yields as long as the call at all. We have thus enabled the other scheduler also pro- returns E_WOULDBLOCK, and only returns once the system call vided in libfiber, implemented in fiber_scheduler_dist.c, returns a real result, giving the fiber the illusion that it made which schedules fibers in a round-robin fashion. llvm a blocking system call. A major advantage of this approach For writing our Pass, we have used and Clang 7.0. over scheduler activations is that this can be done entirely We have derived our Pass from the predefined ModulePass in userspace and does not require any special support in the class, as it poses the fewest restrictions on what our Pass kernel. can do (but also predefines fewer things compared to other, On the other hand, no fiber implementation has any mech- more specifically applicable Passes). anisms to guarantee fairness in cases where a fiber misbe- To inject yields into the code, we assumed that every long haves and does not yield on its own. One approach would running code will eventually make function calls and/or run be to use interprocess communication mechanisms like inter- in a loop of some sort. In our initial version, we therefore rupts or signals that trigger a routine in the userspace sched- iterated through all functions and all loops in the input code, uler, which preemptively switches to another fiber. How- and inserted a call to fiber_yield() in each terminating ever, we would lose all benefits of having lightweight context basic block there; essentially, at the end of each function switches between fibers if each context switch was preceded and each loop, the resulting code would yield. by comparatively costly IPC involving the kernel. However, as context switching itself adds some execution For the sake of performance, we prefer to have fibers yield time overhead itself, we cannot yield at every such point on their own, but for the sake of robustness (and for practi- (the program would spend a significant portion of its time cal reasons), we would like to relieve the programmer from just context switching between fibers and not doing any real having to insert yields manually. Our proposal is thus a work). We therefore decided to keep a counter that is in- mechanism that automatically inserts yields into the pro- cremented each time execution passes through such a point, gram at compile time. and yield only when the counter reaches a certain thresh- old (yielding every single time would thus correspond to a threshold value of 1); our llvm Pass thus inserts something 1.2 LLVM like this: llvm [5] is a compilation framework that handles various ++globalcount; aspects of compilation, including optimisation and code gen- if (globalcount >= YIELD_THRESHOLD) { llvm globalcount = 0; eration. As input, it expects a specific -bitcode (rather fiber_yield(); than code in a regular programming language). The goal } is to relieve language designers from having to write a full blown compiler; instead, they only need to write an llvm The code is available in our public git repository1. frontend for their language (lexer and parser), transform it to llvm bitcode, and have the llvm backend handle the rest 3. EVALUATION of the compilation procedure. A well-known llvm frontend for the C language family is Clang [1]. The goal is to compare the cost of context switching be- This modular design is also applied to the internals of the tween fibers with the cost of context switching between ker- backend: llvm implements the various components involved nel threads, but also see how much of an overhead is added in the compilation procedure—such as optimisation—as so- by maintaining an additional counter for determining when called Passes that each run on the code. Third-party devel- to yield. opers can add Passes themselves and thereby add further, In this section, unless otherwise denoted, we refer to both more specific functionality to llvm if desired. userspace and kernel threads as simply “threads”. For our work, we intend to write an llvm Pass that auto- 1https://gitlab.gnugen.ch/mtweber/ matically injects yields into the input code. cooperative-multitasking

2 3.1 Setup 1,400 We have written two simple throughput benchmarks (we measure the time between the first thread starting and the last thread finishing): 1,200

• Counting: Let each thread increment a thread-specif- 1,000

ic variable from 0 to some given, relatively high value. 3 .

• Pointer Chasing: Generate a list of elements that 800 976 each contain pointers to their respective predecessors and successors, and store the elements in an array in a 600 random order; then let each thread walk through that list until it ends up at its starting position. To avoid that threads “profit” from each others’ memory reads execution time (seconds) 400 by avoiding cache misses, each thread starts in a differ- 75 . 01 98 63 34 83 38 ent position on the chain and—given the randomised 200 14 . . . . 99 97 . . . . . nature of the chain in memory—should follow different 345 271 270 270 270 267 264 264 250 paths. 250 0 5 10

Both benchmarks are compiled in multiple ways: 100 1 000 10 000 100 000 pthreads 1 000 000 • With libfiber, instrumented with our llvm Pass and no yields 10 000 000 with yields inserted automatically; 100 000 000 1 000 000 000 • With libfiber, uninstrumented and with yields inserted manually; Figure 1: 100 threads counting, with automatically inserted yields • With libfiber, uninstrumented and without any yields;

• With posix (kernel) threads. higher yield threshold (thus yielding less frequently) to per- form better than programs with a lower threshold. Conse- For the variants that contain yield calls, we create mul- quently, we expect the extreme case (uninstrumented, with- tiple executables, one for each threshold level among the out any yields) to perform best of all, even better than kernel 2 3 4 5 6 7 8 9 following: 5, 10, 10 , 10 , 10 , 10 , 10 , 10 , 10 and 10 . threads, as it does not perform any more actions than the The executables are built without any compiler optimisa- kernel threads, but also never context-switches during the tions (-O0). The test system is an Intel Core i5-2500 CPU execution of a fiber. (second generation, “Sandy Bridge”) with 6144 KiB cache, at We hope that the overhead added by maintaining a count- 3.30 GHz, and 8 GiB of main memory, running 4.20.3 er is outweighed by the benefits of faster context switch- on Arch Linux. ing, and that for some of the higher (but still reasonably We also want to exclude effects on our measurements low) threshold values, the fibers perform better than kernel caused by true parallelism (i. e. threads running on multi- threads. If not, we hope that fibers do not run too much ple cores); we are purely interested in comparing context slower than kernel threads (not more than 5 or 10 percent switch times on a single CPU core. For the variants that added overhead). use fibers we therefore initialise libfiber with a single ker- nel thread, and for the benchmarks using we 3.3 Results “pin” all kernel threads to a single core with taskset. We run each variant and each threshold level (where ap- 3.3.1 Counting plicable) with the following parameters: The execution times for 100 threads counting from 0 to • 2, 3, 4, 5, 10, 20, 50 and 100 threads; 500,000,000 can be seen in figure 1. As expected, the execu- tion time appears to be inversely proportional to the yield • Counting from 0 to 500,000,000; and pointer chasing threshold, although this relation remains largely unnotice- through 16,000,000 elements `a16 bytes each (256 MB). able for thresholds from 109 to 104 (all with overheads of around 5.2 to 8 % in comparison to kernel threads); execu- We run each variant–parameter combination three times, tion times only start rising noticeably when we get as low as and sum up the measured running times. 100 (37.8 % overhead), and rather extremely with 10 (289 %) and 5 (580.2 %). We get very similar results for any other 3.2 Expectation number of threads. We expect that maintaining a yield counter adds some Unlike expected, the uninstrumented fiber variant (that overhead, making our fiber implementation slower than ker- neither counts nor yields) does not run significantly faster nel threads in this regard; however, we also expect context than kernel threads. And more confusingly, if we write the switching between fibers to take less time than switching benchmark in a way that we manually count and yield (in- between kernel threads, making fibers faster. stead of instrumenting the code with the llvm Pass), the Furthermore, since yielding itself adds an overhead to the execution becomes faster than both kernel threads and non- execution time, we expect the programs compiled with a yielding fibers (figure 2).

3 ausapast niaeta h hi sntsufficiently not threshold is yield chain lower the for that a rate indicate to miss in appears cache per result values lower instructions to The non-stalled appears of cycle. which number misses, higher cache significantly to L1 appear fewer generally variants have low-threshold that 10 observe and We 5 thresholds (table between variants comparison different a the for data the compare can pattern. this change the figure not threads, in does 2 only shown tions be- with as run this are if changes results threads exception: one of with number haviour, the changing again. nor up chain go times very execution for the expected, values, As threshold larger yield threads. whereas kernel low the threads), as kernel times by execution taken (10 thresholds time in yield the completes of (which 100 % 70 around values threshold figure with in times seen be are can elements 16,000,000 run actually versions. observe versions non-optimised we optimised variants, manual the the that for times running absolute results. variants similar manual more and give instrumented to the cause do bench- timisations optimisation the ran compiler and with if built we mark see differences, To these out even code. would our optimised execu- that less overall indicate duces slower would equals this yields tion), (more variants mented yields serted 2: Figure .. one chasing Pointer 3.3.2 sn h efrac aaotie ihlnxpr,we linux-perf, with obtained data performance the Using the of size the changing neither that observed have We over pointers chasing threads 100 for times execution The the compare we if however: detail, awkward an is There instru- the to similar rather otherwise is behaviour the As

wildly execution time (seconds) 1 1 1 , , , 000 200 400 200 400 600 800 0 nxetd o n,w e h oetexecution lowest the get we one, for unexpected: -O3 0 hed onig ihmnal in- manually with counting, threads 100

(figure pthreads 250.99 no yields 9 250.97 o10 to 3 1 000 000 000 238.25 ,adi un u htcmie op- compiler that out turns it and ),

6 100 000 000 238.03 l euti ogl h same the roughly in result all ) 10 000 000 237.68 5 opln ihoptimisa- with Compiling . 1 000 000 238.5

llvm 100 000 239.05 10 000 239.77 8 o illustration). for , assml pro- simply Pass slower

4 1 000 246.44 h results The . 100 317.22

hnthe than 10 959.62 1

shows 5 4 misations 3: Figure

execution time (seconds) execution time (seconds) 1 1 1 100 150 200 250 300 , , , iue4: Figure 50 000 200 400 200 400 600 800 0 0 0 hed onig opldwt opti- with compiled counting, threads 100 pthreads 265.3 pthreads 0

230.76 instrumented no yields 264.72

0 hed hsn pointers chasing threads 100 0 no yields 228.89 manual 1 000 000 000 265.2 262.95 1 000 000 000 260.88 100 000 000 265.22 264.06 100 000 000 260.14 10 000 000 264.89 264.25 10 000 000 257.46 1 000 000 264.86 265.91 1 000 000 257.83 100 000 247.78 264.65 100 000 256.99 10 000 198.56 261.38 10 000 257.53 1 000 189.38 264.01 1 000 263.38 333.8 100 185.81 100 331.49 923.22 10 220.45 10 913.44 5 248.83 5 hehl eyso fe that. yield after the reaching soon a counter very between yield threshold scheduled the ex- and be little call to very system happens receiving blocking it only fiber because a system a time, is blocking problem to ecution a there lead the g. may if (e. which also counter means call), other yield is run some the by There only switch reset that context don’t loops. fibers currently fibers we over or favours that loops functions it or short balanced; functions very very long not run is that approach our tions, reduced. be added might overhead instrumentation the that the carefully, method by heuristic more a points use insertion we If chooses yields. (potential) inject to per- throughput reflects not adequately re- there. that might formance the way microbenchmark a on in the written impact that be some indicates has by which caching gathered sults, other data that the The shows on linux-perf unexpectedly. microbenchmark, rather chasing behaved pointer hand, in- The slower (slightly) be in code. results should optimisations bench- compiler the binaries with compiling mark why resulting understand to closely the more spected However, expected. DISCUSSION 4. CPU. the from misses (LLC) cache last-level seeds. random different various with shorter), shuffled and others longer if (both each nor after sizes even different from change of chains not profiting on do running fibers results the of however, we reads; and effect memory paths, an similar get too follow therefore thus fibers and randomised otipraty u ersi al oadestecase the address to fails heuristic our importantly, Most func- and loops count only we that given Furthermore, when deciding for approach simple very a use currently We as results us gave mostly microbenchmark counting The of number the get to unable was linux-perf Unfortunately, execution time (seconds) 0 1 2 3 4 5 6 7 8 iue5: Figure

pthreads 5.36 no yields 5.35

hed hsn pointers chasing threads 2 1 000 000 000 5.33 100 000 000 5.33 10 000 000 5.32 1 000 000 5.33 100 000 5.36 10 000 5.32 1 000 5.37 100 5.4 10 5.93 5 6.44 5 .RLTDWORK RELATED 5. method. our to with fibers, changes with performance applications their real-world how build see to us allow also aspect. that test benchmark la- latency to higher A result performed in frequently. be will result less should also threshold yield but fibers yield as times, high tencies, execution a total setting faster in latency: on own. pact its on on mechanisms heavyweight IPC mention relatively switching to already of is not act off mere call), and the the a that after add fact cause would and the however function before might library work single this additional every calls code, (as library overhead of application density major the the duration on in Depending the used for call. library fiber using the the to of interrupting back fall for application mechanisms the IPC makes this that mitigate code insert to not and the approach is have a possible made to fairness A library is library, the call. if the system happen in blocking only would inserted our yields yields by guaranteed: no instrumented are where not call, there library was long-running library a make the would fiber a where pointers” chasing threads “100 of ants 1: Table eiae oe,wiepeetn h enlfo preemp- from kernel the preventing while cores, by dedicated achieved is latter multiple into The iso- and threads. cores kernel the multiple other to them from distributing lated on focus a with (misses) cache LL (loads) cache LL (misses) L1-dcache (loads) L1-dcache missed . . . branches stalled instructions (backend) stalled . . . (frontend) stalled . . . cycles Arachne with API compatibility a Providing im- the is tested not have we that aspect import Another oeArbiter Core oprsno iu-efdt o w vari- two for data linux-perf of Comparison [ 7 cpuset sauesaernieta aae fibers manages that runtime userspace a is ] llvm 1,4,9.2 / 4,0,7.8 M/s 144,805,870.782 M/s 810,340,993.127 M/s 36,191,278.680 M/s 385,855,836.000 ucmoetta pisteCUcores CPU the splits that subcomponent a , 2349735Ms2,7,2.7 M/s 24,274,220.879 M/s 42,334,927.315 ,0,9.2 H ,0,9.5 GHz 4,207,196.051 GHz 4,208,791.920 152,960,803,893 251,376,381,280 304,005,243,805 349,047,740,311 ,adasgstrasrnigfiesto fibers running threads assigns and s, asdtc al oetra functions, external to calls detect Pass 67,204,009,583 32,000,182,047 3,510,962,527 5,333,447,255 thresh=5 3,357,845 .9CPI 1.99 IPC 0.44 20 % 72.02 % 87.10 .4% 7.94 % 0.01 — posix 335,560,362,901 364,678,743,152 372,017,103,625 llvm 32,000,778,333 12,804,314,318 hed would threads 2,146,423,707 2,365,264,383 3,200,177,626 thresh=10 14 CPI 11.40 as As Pass. .9IPC 0.09 02 % 90.20 % 98.03 84 % 18.47 .0% 0.00 8,813 — 8 tively scheduling other threads on that core and degrading [4] Kowalke, O. .. https://www.boost.org/ the performance of the fibers. doc/libs/1_69_0/libs/fiber/doc/html/index.html, Although we consider separating fibered threads from reg- 2013. ular threads to be rather elegant, Arachne does not specially [5] Lattner, C., and Adve, V. LLVM: A Compilation handle blocking system calls; it assumes and encourages that Framework for Lifelong Program Analysis & fibers only make non-blocking system calls. Furthermore, Transformation. Proceedings of the 2004 International the internal structure of the fiber scheduler limits the num- Symposium on Code Generation and Optimization ber of simultaneously existing fibers in a thread/core to only (CGO’04) (Mar 2004). 56, which does not fit our usecase and the initially mentioned [6] Prekas, G., Kogias, M., and Bugnion, E. ZygOS: scaling problem we are trying to solve. Achieving Low Tail Latency for Microsecond-scale Boost.Fiber [4] is another fiber implementation written Networked Tasks. SOSP ’17 (Oct 2017). in (and intended for use with) C++. Like libfiber, it can [7] Qin, H., Li, Q., Speiser, J., Kraft, P., and handle an arbitrary number of fibers. It also documents Ousterhout, J. Arachne: Core-Aware Thread how to make system calls in a pseudo-blocking fashion; how- Management. 13th USENIX Symposium on Operating 2 ever, it appears to be more involved than the mechanism Systems Design and Implementation (OSDI ’18) (Oct provided by libfiber. Nevertheless, it may be worth testing 2018), 145–160. how our benchmark performance changes if used with Boost [8] Watling, B. libfiber. fibers. https://github.com/brianwatling/libfiber. 6. CONCLUSION Based on the results obtained from our microbenchmarks, it is unfortunately not possible to draw any final conclusions about the usability of our approach. Yet, despite the incon- sistencies, most results indicate that performance (at least the throughput) is not substantially worse than with kernel threads, and our method could thus be used to write scal- able server applications with fibers, in a both intuitive and convenient way. Information gained from latency benchmarks might also tell us if this approach can be used to remove in ZygOS [6].

7. ACKNOWLEDGEMENT First and foremost, I would like to thank Marios Kogias for supervising my work, introducing me into the world of cooperative multitasking and userspace threads, always sup- plying me with all the necessary resources on that topic, and giving me feedbacks and ideas for how to approach this project (and also these vastly irritating results). Further- more, I thank Prof. Edouard Bugnion for letting me do this project in the EPFL Datacenter Systems Laboratory. Many thanks also go to Adrien Ghosn for his input and repeatedly taking the time to discuss the project and giving me vari- ous valuable tips for the benchmarks; and the DCSL team overall for participating in the project presentation and dis- cussion, and raising questions about the benchmarks (and their results).

8. REFERENCES [1] Clang. https://clang.llvm.org/. [2] Anderson, T. E., Bershad, B. N., Lazowkska, E. D., and Levy, H. M. Scheduler Activations: Effective Kernel Support for User-level Management of Parallelism. ACM Transactions on Computer Systems 10, 1 (Feb 1992), 53–79. [3] Goodspeed, N., and Kowalke, O. Distinguishing and fibers. ISO/IEC JTC1/SC22 – Programming languages and operating systems – C++ (WG21) (May 2014). 2https://www.boost.org/doc/libs/1_69_0/libs/fiber/ doc/html/fiber/nonblocking.html

6