Mikio Hirabayashi
Total Page:16
File Type:pdf, Size:1020Kb
IntroductionIntroduction toto TokyoTokyo ProductsProducts Mikio Hirabayashi <[email protected]> TokyoTokyo ProductsProducts •• TokyoTokyo CabinetCabinet – database library Tokyo Tyrant applications •• Tokyo Tyrant Prome database server nade – custom storage Tyrant Dystopia •• TokyoTokyo DystopiaDystopia Cabinet – full-text search engine file system •• TokyoTokyo PromenadePromenade – content management system •• openopen sourcesource – released under LGPL •• powerful,powerful, portable,portable, practicalpractical – written in the standard C, optimized to POSIX TokyoTokyo CabinetCabinet -- databasedatabase librarylibrary -- FeaturesFeatures •• modernmodern implementationimplementation ofof DBMDBM – key/value database • e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB – simple library = process embedded – Successor of QDBM • C99 and POSIX compatible, using Pthread, mmap, etc... • Win32 porting was abolished (inherited to KC) •• highhigh performanceperformance – insert: 0.4 sec/1M records (2,500,000 qps) – search: 0.33 sec/1M records (3,000,000 qps) •• highhigh concurrencyconcurrency – multi-thread safe – read/write locking by records •• highhigh scalabilityscalability – hash and B+tree structure = O(1) and O(log N) – no actual limit size of a database file (to 8 exabytes) •• transactiontransaction – write ahead logging and shadow paging – ACID properties •• variousvarious APIsAPIs – on-memory list/hash/tree – file hash/B+tree/array/table •• scriptscript languagelanguage bindingsbindings – Perl, Ruby, Java, Lua, Python, PHP, Haskell, Erlang, etc... TCHDB:TCHDB: HashHash DatabaseDatabase •• staticstatic hashinghashing bucket array – O(1) time complexity •• separateseparate chainingchaining key value – binary search tree key value – balances by the second hash key value •• freefree blockblock poolpool key value – best fit allocation key value – dynamic defragmentation key value •• combinescombines mmapmmap andand key value pwrite/preadpwrite/pread key value key value – saves calling system calls key value •• compressioncompression key value – deflate(gzip)/bzip2/custom TCBDB:TCBDB: B+B+ TreeTree DatabaseDatabase •• B+B+ treetree key value – O(log N) time complexity B tree index key value key value •• pagepage cachingcaching key value – LRU removing – speculative search key value key value •• standsstands onon hashhash DBDB key value – records pages in hash DB key value – succeeds time and space key value efficiency key value •• customcustom comparisoncomparison key value functionfunction key value – prefix/range matching key value key value •• cursorcursor key value – jump/next/prev key value TCFDB:TCFDB: FixedFixed --lengthlength DatabaseDatabase •• arrayarray ofof fixedfixed -- lengthlength elementselements array – O(1) time complexity value value value value – natural number keys value value value value – addresses records by value value value value multiple of key value value value value •• mostmost effectiveeffective value value value value value value value value – bulk load by mmap value value value value no key storage per – value value value value record value value value value extremely fast and – value value value value concurrent TCTDB:TCTDB: TableTable DatabaseDatabase •• columncolumn basedbased – the primary key and named columns – stands on hash DB bucket array •• flexibleflexible structurestructure – no data scheme, no data type primary key name value name value name value – various structure for each record primary key name value name value name value •• queryquery mechanismmechanism primary key name value name value name value – various operators matching column values primary key name value name value name value – lexical/decimal orders by column values primary key name value name value name value •• columncolumn indexesindexes – implemented with B+ tree primary key name value name value name value – typed as string/number – inverted index of token/q-gram column index – query optimizer OnOn --memorymemory StructuresStructures •• TCXSTR:TCXSTR: extensibleextensible stringstring – concatenation, formatted allocation •• TCLIST:TCLIST: arrayarray listlist (( dequeuedequeue )) – random access by index – push/pop, unshift/shift, insert/remove •• TCMAP:TCMAP: mapmap ofof hashhash tabletable – insert/remove/search – iterator by order of insertion •• TCTREE:TCTREE: mapmap ofof orderedordered treetree – insert/remove/search – iterator by order of comparison function OtherOther MechanismsMechanisms •• abstractabstract databasedatabase – common interface of 6 schema • on-memory hash, on-memory tree • file hash, file B+tree, file array, file table – decides the concrete scheme in runtime •• remoteremote databasedatabase – network interface of the abstract database – yes, it's Tokyo Tyrant! •• miscellaneousmiscellaneous utilitiesutilities – string processing, filesystem operation – memory pool, encoding/decoding ExampleExample CodeCode #include < tcutil.h > #include < tchdb.h > #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCHDB *hdb; int ecode; char *key, *value; /* create the object */ hdb = tchdbnew (); /* open the database */ if(! tchdbopen (hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){ ecode = tchdbecode(hdb); /* traverse records */ fprintf(stderr, "open error: %s¥n", tchdberrmsg(ecode)); tchdbiterinit (hdb); } while((key = tchdbiternext2(hdb)) != NULL){ value = tchdbget2 (hdb, key); /* store records */ if(value){ if(! tchdbput2 (hdb, "foo", "hop") || printf("%s:%s¥n", key, value); !tchdbput2 (hdb, "bar", "step") || free(value); !tchdbput2 (hdb, "baz", "jump")){ } ecode = tchdbecode(hdb); free(key); fprintf(stderr, "put error: %s¥n", tchdberrmsg(ecode)); } } /* close the database */ /* retrieve records */ if(! tchdbclose (hdb)){ value = tchdbget2 (hdb, "foo"); ecode = tchdbecode(hdb); if(value){ fprintf(stderr, "close error: %s¥n", tchdberrmsg(ecode)); printf("%s¥n", value); } free(value); } else { /* delete the object */ ecode = tchdbecode(hdb); tchdbdel (hdb); fprintf(stderr, "get error: %s¥n", tchdberrmsg(ecode)); } return 0; } TokyoTokyo TyrantTyrant -- databasedatabase serverserver -- FeaturesFeatures •• networknetwork serverserver ofof TokyoTokyo CabinetCabinet – client/server model – multi applications can access one database – effective binary protocol •• compatiblecompatible protocolsprotocols – supports memcached protocol and HTTP – available from most popular languages •• highhigh concurrency/performanceconcurrency/performance – resolves "c10k" with epoll/kqueue/eventports – 17.2 sec/1M queries (58,000 qps) •• highhigh availabilityavailability – hot backup and update log – asynchronous replication between servers •• variousvarious databasedatabase schemaschema – using the abstract database API of Tokyo Cabinet •• effectiveeffective operationsoperations – no-reply updating, multi-record retrieval – atomic increment •• LuaLua extensionextension – defines arbitrary database operations – atomic operation by record locking •• purepure scriptscript languagelanguage interfacesinterfaces – Perl, Ruby, Java, Python, PHP, Erlang, etc... AsynchronousAsynchronous ReplicationReplication master and slaves dual master (load balancing) (fail over) write query master server client client database query if the master is dead query read query standby master update log with load balancing active master database database replicate the difference slave server slave server update log update log replicate the difference database database update log update log ThreadThread PoolPool ModelModel epoll/kqueue listen accept the client connection if the event is about the listener queue first of all, the listening socket is enqueued into the epoll queue accept epoll_ctl(add) queue back if keep-alive epoll_wait task manager epoll_ctl(del) queue enqueue move the readable client socket from the epoll queue to the task queue deque worker thread worker thread do each task worker thread LuaLua ExtentionExtention •• definesdefines DBDB operationsoperations asas LuaLua functionsfunctions – clients send the function name and record data – the server returns the return value of the function •• optionsoptions aboutabout atomicityatomicity – no locking / record locking / global locking front end request back end - function name - key data Tokyo Tyrant - value data Clients Tokyo Cabinet Lua processor response - result data script database user-defined operations case:case: TimestampTimestamp DBDB atat mixi.jpmixi.jp 20 million records •• 20 million records mod_perl – each record size is 20 bytes home.pl update •• moremore thanthan 10,00010,000 TT (active) show_friend.pl updatesupdates perper sec.sec. database view_diary.pl – keeps 10,000 connections search.pl •• dualdual mastermaster replication other pages replicationreplication TT (standby) each server is only one – database •• memcachedmemcached list_friend.pl fetch compatiblecompatible protocolprotocol list_bookmark.pl – reuses existing Perl clients case:case: CacheCache forfor BigBig StoragesStorages • worksworks asas proxyproxy • clients 1. inserts to the storage – mediates insert/search 2. inserts to the cache – write through, read through •• LuaLua extensionextension Tokyo Tyrant MySQL/hBase – atomic operation by record atomic insert locking database Lua – uses LuaSocket to access the storage database • properproper DBDB schemescheme atomic search • Lua – TCMDB: for generic cache database – TCNDB: for biased access TCHDB: for large records cache – database such as image 1. retrieves from the cache if found, return – TCFDB: for small records 2.