WWW.DRAGONFLYBSD.ORG Light Weight Kernel Threading and User Processes

LWKT SUBSYSTEM USERLAND PROCESS SCHEDULER

CPU #1 PROCESS RUNQ SCHEDULER BGL LWKT SCHEDULER BGL THREAD PROCESS1 (FIXED PRI) PROCESS2 THREAD PROCESS2 CURPROC ON CPU1

PROCESS3 CPU #2

BGL THREAD PROCESS4 LWKT PROCESS5 SCHEDULER THREAD PROCESS5 CURPROC ON CPU2 (FIXED PRI) BGL THREAD PROCESS6 IPI Messaging ●Abstraction promotes cpu isolative algorithms. ●IPI Messages avoid mutexes. Software crossbar approach. ●Critical sections to interlock against interrupts and IPIs. ●Many IPIs can be passive. Pipelining can be made optimal. CPU #1 CPU #2 SCHEDULER SCHEDULER

ASYNC THREAD1 SCHEDULE IPI SCHEDULE THREAD2 THREAD2 MESSAGE THREAD2

ALLOCATOR ALLOCATE MEMORY THREAD1 ON CPU#1

ALLOCATOR THREAD2 ALLOCATOR PASSIVE FREE FREE IPI MEMORY MEMORY MESSAGE OWNED BY ON CPU #1 CPU #1 Light Weight Kernel Thread Messaging ● style message and port API. ●Highly Flexible. ●Semisynchronous: direct call to target's port agent. ●UP/SMP optimized ●Very fast synchronous path.

THREAD #1

TARGET PORT (OWNED BY T2)

PORT AGENT APP SEND MSG APP PROCESSES MSG SYNC RETURN

THREAD #1 THREAD #2

TARGET PORT (OWNED BY T2)

PORT AGENT SEND MSG (WAKEUP T2) GETMSG QUEUES MSG EASYNC APP APP REPLY PORT WAITMSG (WAKEUP T1) REPLY MSG RETURN Slab Allocator ●Per-CPU localization. ●Backed by kernel_map. ●No more kmem_map. CPU #1 CPU #2

THREAD THREAD THREAD THREAD THREAD THREAD

REMOTE PER-CPU SLAB CACHE FREE PER-CPU SLAB CACHE (PASSIVE IPI)

BULK BULK ●BULK=128K TYP BULK BULK ALLOCATION FREE ● ALLOCATION FREE () (LOCK) HYSTERESIS W/FREE (LOCK) (LOCK)

KERNEL_MAP Threaded Network Protocol Stacks

● NETISR TCP A Work being done by Jeffrey Hsu THREAD ● Utilizes LWKT messages and ports PCB PCB PCB ● 1 thread per cpu per protocol typ. NETISR TCP B ●UP is a degenerate case of SMP. THREAD PCB PCB PCB Etc PROTO NETIF SWITCH (IPI) NETISR UDP A THREAD PCB PCB PCB MB UF ALLOC PACKETS NETISR UDP B THREAD PCB PCB PCB Etc

MBUF FREE USER READ (PASSIVE IPI) CPU #0 ●Mbuf free path expected to cross cpus. CPU #1 ●Cache and return, or cache and reuse? Threaded Network Protocol Stacks Wildcard PCBs for TCP ●Hash on (srcaddr, srcport, destaddr, destport). ●Replicate wildcard listen sockets. ●Create normal PCB on accept().

CPU #1 CPU #2 CPU #3 CPU #4 (TCP THREAD) (TCP THREAD) (TCP THREAD) (TCP THREAD)

PCB 1 PCB 2 PCB 3 PCB 4

PCB 5 PCB 6 PCB 7 PCB 8

PCB 9 PCB 10 PCB 11 PCB 12

WILDCARD WILDCARD WILDCARD WILDCARD PCB #30 PCB #30 PCB #30 PCB #30

WILDCARD WILDCARD WILDCARD WILDCARD PCB #31 PCB #31 PCB #31 PCB #31 Threaded Network Protocol Stacks Wildcard PCBs PROGRAM

CLOSE ●Reference counters require synchronization. MSG ●So don't use reference counters..

CPU #1 CPU #2 CPU #3 CPU #4 (TCP THREAD) (TCP THREAD) (TCP THREAD) (TCP THREAD)

WILDCARD WILDCARD WILDCARD WILDCARD PCB #30 PCB #30 PCB #30 PCB #30

FREE PCB The Namecache

FREEBSD DRAGONFLY optional, disconnected mandatory, connected namespace vnode locked namespace namecache locked

Struct vnode Struct vnode LIST_HEAD(, ncp) v_cache_src; Struct ncp_list v_namecache; TAILQ_HEAD(, ncp) v_cache_dst; struct vnode *v_dd;

Struct namecache (ncp) Struct namecache (ncp) LIST_ENTRY(ncp) nc_src; TAILQ_ENTRY(ncp) nc_entry; TAILQ_ENTRY(ncp) nc_dst; TAILQ_ENTRY(ncp) nc_vnode; struct vnode *nc_dvp; struct ncp_list nc_list; struct vnode *nc_vp; struct vnode *nc_vp; The Namecache

●Namespace locked by namecache, not by vnode. ●Negative namecache entries can be created and locked. ●Aliased vnodes do not confuse the kernel.

VOP_RENAME *BSD: VOP_RENAME(dvp, fvp, fcnp, tdvp, tvp, tncp) DRAGONFLY: VOP_NRENAME(fncp, tncp)

Struct filedesc struct vnode *fd_cdir; struct vnode *fd_rdir; struct vnode *fd_jdir; struct namecache *fd_ncdir; struct namecache *fd_nrdir; struct namecache *fd_njdir; The Namecache

●Simplified VFS interface: VOP_NRESOLVE(ncp, cred) ●Simplified kernel interface: nlookup(). ●No more namei(), no more lookup() ●No more componentnames, except in compatibility shims. ●Separate NLOOKUPDOTDOT for “..” lookups (ala ). ●Topological guarentees for referenced namecache records. ● Node remains intact in the topology even if file deleted. ● Traversal to root is guarenteed (getcwd, fstat, and more) ● Most dvp references replaced by ncp references. ●VFS involvement only for attribute access (for now). The Namecache Future work

●Intended to be hooked into a cache coherency layer (aliasing, SSI, NFS). ●Direct Attribute Caching. ●Good place for a MAC implementation. ●Clean up VFSs and remove shims supporting obsolete VOP's. Big Removal

Matthew Dillon DragonFly BSD Project April 2005 Current BGL Coverage

SERIAL FASTINT CLOCK INTERRUPT THREADS

LWKT SOFTINT NETIF SUBSYSTEM THREADS -- IDLE (scheduler) DISK (messaging) THREAD THREADS TIMER (IN KERNEL)

THREADS (IN USER) NETISR ROUTE TBL IPI MSGS SYSCALLS VFS

=BGL NOT REQUIRED MEMORY SUB MEMORY SUB SLAB CACHE KMEM ALLOC =BGL REQUIRED Next Stage BGL Removal

SERIAL FASTINT CLOCK INTERRUPT THREADS

SOFTINT NETIF LWKT THREADS SUBSYSTEM IDLE DISK THREAD

THREADS TIMER (IN KERNEL)

THREADS (IN USER) NETISR ROUTE TBL IPI MSGS

SYSCALLS VFS =BGL NOT REQUIRED MEMORY SUB MEMORY SUB SLAB CACHE KMEM ALLOC =BGL REQUIRED BGL Removal – Network Detail ●WORK BEING DONE BY JEFFREY HSU ●UTILIZES LWKT MSGS AND PORTS ●MULTIPLE THREADS PER PROTOCOL ●NO UNI-PROCESSOR PENALTY NETISR TCP A THREAD PCB PCB PCB NETISR TCP B THREAD PCB PCB PCB Etc PROTO NETIF SWITCH NETISR UDP A THREAD PCB PCB PCB PACKETS NETISR UDP B THREAD PCB PCB PCB Etc

=BGL NOT REQUIRED The DragonFly BSD Project April 2005

Joe Angerson Galen Sampson David Xu YONETANI Tomokazu Hiroki Sato Craig Dooley Simon Schubert Liam J. Foy Joerg Sonnenberger Robert Garrett Justin Sherrill Jeffrey Hsu Scott Ullrich Douwe Kiela Jeroen Ruigrok van der Werven Sascha Wildner Todd Willey Emiel Kollof Kip Macy and Fred --> Andre Nathan Erik Nygaard Max Okumoto Hiten Pandya Chris Pressey David Rhodus WWW.DRAGONFLYBSD.ORG