KDDM: A Generic Basis For Distributed Kernel Infrastructure

Renaud Lottiaux – Kerlabs Erich Focht - NEC

June 28wwwth - .Okerlabs.LS 20com07 Goal of this BoF

Introduce the KDDM concept Measure the interest in such a sub-system Identify who could be interested in using this Have an idea of how far we could be from an integration in the main line

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 2 and Clustering

Linux community was quite skeptic regarding clustering However, several cluster projects are already included or close to be included in main line Transparent Inter Communication (TIPC) Distributed Lock Manager (DLM) GFS Oracle Cluster FS 2 (OCFS2) Distributed IPC (DIPC) ... This is just the beginning ! Good time to setup the basis of a kernel level distributed infrastructure KDDM

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 3 Definition of KDDM

Distributed of objects Generic mechanism to share objects between nodes Ensure an access to data which is Transparent Efficient Coherent ! Objects are accessed through a set of functions Don't care about data localization Just use data ! KDDM can host any kind of data Memory pages Data structure

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 4 Object identifier

Objects are identified using 3 values Object id Set id Name space id A set is of collection of objects You can freely define your sets Pages from the same system V IPC segment ... A name space is a collection of sets You can freely define your name spaces Regular linux name spaces ...

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 5 Object coherence

R/W-semaphore like access to object Single writer / multiple reader Object are transparently moved / duplicated between nodes Duplication means coherence problem Coherence managed using “invalidation on write” mechanism Lighter coherence mechanisms will be implemented Update on time-out Hardware helped data sharing (specific network needed) ...

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 6 Basic KDDM interface

void * _kddm_get_object ( struct kddm_set *kddm_set, objid_t objid )

void * _kddm_grab_object ( struct kddm_set *kddm_set, objid_t objid )

void * _kddm_put_object ( struct kddm_set *kddm_set, objid_t objid )

void * _kddm_remove_object ( struct kddm_set *kddm_set, objid_t objid )

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 7 IO Linkers

KDDM set instantiated by IO linkers Determine the nature of hosted objects Define object input/output functions One kind of IO linker per kind of object to share Memory pages File cache pages Inodes ... Define links between object and physical nodes File pages are linked to the node hosting the file data Process memory pages are not linked to a given node

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 8 IO Linker functions

IO Linkers are mainly a set of function pointer Define the behavior of the KDDM set Object allocation / Free First touch Object invalidation Object export / Import Object synchronization Etc... Default functions for kmalloc based objects.

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 9 The IO Linker Structure

struct iolinker_struct { int (*instantiate) (struct kddm_set *set, void *private_data, int master); void (*uninstantiate) (struct kddm_set *set, int destroy); int (*first_touch) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*remove_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*invalidate_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*flush_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*insert_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*put_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*sync_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); void (*change_access) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid, dsm_state_t state); void *(*alloc_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*import_object) (struct kddm_obj *entry, struct rpc_desc *desc); int (*export_object) (struct rpc_desc *desc, struct kddm_obj *obj_entry); void (*freeze_object) (struct kddm_obj *obj_entry); void (*warm_object) (struct kddm_obj *obj_entry); int (*is_frozen) (struct kddm_obj *obj_entry); char linker_name[16]; iolinker_id_t linker_id; };

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 10 KDDM Architecture

Distributed service Distributed service

KDDM Core

I/O Linker I/O Linker

Local resource Local resource manager manager

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 11 Outline

General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 12 KDDM “Hello World !” (1/3)

struct iolinker_struct hw_linker = { linker_name: “hw”, linker_id: 1 }; struct kddm_set *hw_set; void hello_world_setup (void) { register_io_linker (1, &hw_linker);

hw_set = create_new_kddm_set (kddm_def_ns, /* Default name space */ 1, /* IO linker id */ KDDM_SET_NOT_LINKED, 64, /* Size of objects to share*/ NULL, 0, 0); }

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 13 KDDM “Hello World !” (2/3)

void hello_world_node0 (void) { char *buf_en, *buf_en;

buf_en = kddm_grab_object (hw_set, 0); strcpy (buf_en, “Hello “); kddm_put_object (hw_set, 0);

buf_fr = kddm_grab_object (hw_set, 1); strcpy (buf_fr, “Bonjour “); kddm_put_object (hw_set, 1); } void hello_world_node1 (void) { char *buf_en, *buf_en;

buf_en = kddm_grab_object (hw_set, 0); strcpy (&buf_en[6], ”world !“); kddm_put_object (hw_set, 0);

buf_fr = kddm_grab_object (hw_set, 1); strcpy (&buf_fr[8], “monde !“); kddm_put_object (hw_set, 1); }

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 14 KDDM “Hello World !” (3/3)

Node 0 hello_world_setup (); hello_world_node0 ();

Node 1 hello_world_setup (); hello_world_node1 ();

char *buf; buf = kddm_get_object (hw_set, 0); printk (“%s\n”, buf); kddm_put_object (hw_set, 0); buf = kddm_get_object (hw_set, 1); printk (“%s\n”, buf); kddm_put_object (hw_set, 1); Hello world ! kddm_remove_object (hw_set, 0); Bonjour monde ! kddm_remove_object (hw_set, 1);

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 15 How does that work ?

kddm_grab_object (hw_st, 0)

KDDM set

I/O Linker I/O Linker

Memory Memory Hello

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 16 How does that work ?

kddm_grab_object (hw_set, 0)

KDDM set

I/O Linker I/O Linker

Memory Memory Hello Hello

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 17 How does that work ?

kddm_grab_object (hw_set, 0)

KDDM set

I/O Linker I/O Linker

Memory Memory Hello world !

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 18 How does that work ?

kddm_get_object (hw_set, 0)

KDDM set

I/O Linker I/O Linker

Memory Memory Hello world ! Hello world !

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 19 How does that work ?

kddm_get_object (hw_set, 0)

KDDM set

I/O Linker I/O Linker

Memory Memory Hello world ! Hello world !

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 20 Outline

General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 21 KDDM Design

Interface KDDM Object NS Set Object Protocol IO Linker Core server manager manager manager

KDDM Communication interface HotPlug Comm Layer

TIPC

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 22 Outline

General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 23 Building Distributed SHM with KDDM

Based on KDDM, building a distributed SHM mechanism is quite simple We need to share Segment content One SHM memory data IO linker A set of KDDM set instantiated with this linker One per memory segment SHM ids One SHM ids IO linker A unique KDDM hosting existing ids cluster wide

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 24 Distributed SHM implementation

On a new segment creation Hook in kernel newseg function Create a new KDDM set for segment data Create a new entry for the segment in the ids KDDM set Make the link between SHM id and KDDM data set id On segment removal Hook in kernel do_shm_rmid function Destroy the data KDDM set Remove the entry in ids KDDM set. On segment mapping Hook in kernel shm_mmap function Set the vm_ops mm field to our set of functions.

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 25 Distributed SHM implementation

On a segment lookup Hook in the kernel shm_lock function Check if the requested id exist in the ids KDDM set. kddm_get_object VM operations no_page kddm_get_object / kddm_grab_object on data KDDM set wp_page (present in 2.2 series) kddm_grab_object on data KDDM set

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 26 Container use in Kerrighed SSI OS

Used as a basic bloc to implement Process memory migration Memory sharing cluster wide File cache sharing cluster wide Inodes sharing cluster wide Cluster wide locks Signal sharing Etc... Could be used by some other projects DIPC OpenSSI Etc...

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 27 Conclusion

KDDM is a high level abstraction to share data between nodes at kernel level KDDM can be used to (more) easily implement distributed services Could be a very good basis for a distributed kernel infrastructure

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 28 Backups

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 29 Object Localization

Localization of objects on local node KDDM object table Localization across the cluster Localization of object copies For an object : one node is said to be the Owner The owner hosts a copy set : the list of nodes owning a copy The owner is the last node who made a grab The owner changes during the time depending on object access Localization of the owner Probable owner chain Each node hosts a pointer to a probable node owner Default owner is set to a default value depending on the KDDM type Chain updated during coherence management

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 30 Data Coherence Management

Object state associated to each object on each node Used states INV_COPY No local copy, Not owner INV_OWNER No local copy, Owner READ_COPY Local copy, Read-only, Not owner READ_OWNER Local copy, Read-only, Owner WRITE_OWNER Local copy, Write, Owner WRITE_GHOST Local copy, Write, Owner, Not used locally WAIT_OBJ_READ No local copy, Wait for a read copy WAIT_OBJ_WRITE No or read-only local copy, Wait for write copy WAIT_ACK_INV Wait for invalidation ACK to send a Write copy WAIT_ACK_WRITE Wait for invalidation ACK to get Write access WAIT_CHG_OWN_ACK

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 31 Alloc / First Touch / Remove

void *shm_alloc_object ( struct kddm_obj *obj_entry, struct kddm_set *set, objid_t objid) { return alloc_page(GFP_HIGHUSER); }

int shm_first_touch ( struct kddm_obj *obj_entry, struct kddm_set *set, objid_t objid) { objEntry->object = alloc_page(GFP_HIGHUSER); if (objEntry->object == NULL) return -ENOMEM;

return 0; }

int shm_remove_object ( void *object, struct kddm_set *set, objid_t objid) { page_cache_release ((struct page *) objEntry->object); return 0; }

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 32 Import / Export

int shm_import_object ( struct kddm_obj *obj_entry, struct rpc_desc *desc) { struct page *page = (struct page *) obj_entry->object; char *data;

data = kmap (page); rpc_unpack (desc, 0, data, PAGE_SIZE); kunmap (page);

return 0; } int shm_export_object ( struct kddm_obj *obj_entry, struct rpc_desc *desc) { struct page *page = (struct page *) obj_entry->object; char *data;

data = kmap (page); rpc_pack (desc, 0, data, PAGE_SIZE); kunmap (page);

return 0; }

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 33 Invalidate

int shm_invalidate_object ( ctnrObj_t * objEntry, container_t * ctnr, objid_t objid) { struct page *page = (struct page *) objEntry->object;

TestSetPageLocked (page);

SetPageToInvalidate (page); try_to_unmap (page); ClearPageToInvalidate (page);

remove_from_page_cache (page); page_cache_release (page);

unlock_page (page);

if (TestClearPageLRU (page)) del_page_from_lru (page_zone(page), page);

page_cache_release (page);

return 0; }

KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 34