Compendium of Technical White Papers

COMPENDIUM OF TECHNICAL WHITE PAPERS Compendium of Technical White Papers from Kx Technical Whitepaper Contents Machine Learning 1. Machine Learning in kdb+: kNN classification and pattern recognition with q ................................ 2 2. An Introduction to Neural Networks with kdb+ .......................................................................... 16 Development Insight 3. Compression in kdb+ ................................................................................................................. 36 4. Kdb+ and Websockets ............................................................................................................... 52 5. C API for kdb+ ............................................................................................................................ 76 6. Efficient Use of Adverbs ........................................................................................................... 112 Optimization Techniques 7. Multi-threading in kdb+: Performance Optimizations and Use Cases ......................................... 134 8. Kdb+ tick Profiling for Throughput Optimization ....................................................................... 152 9. Columnar Database and Query Optimization ............................................................................ 166 Solutions 10. Multi-Partitioned kdb+ Databases: An Equity Options Case Study ............................................. 192 11. Surveillance Technologies to Effectively Monitor Algo and High Frequency Trading ................... 204 1 | P a ge Technical Whitepaper 1 Machine Learning in kdb+: kNN classification and pattern recognition with q Author: Emanuele Melis works for Kx as kdb+ consultant. Currently based in the UK, he has been involved in designing, developing and maintaining solutions for Equities data at a world leading financial institution. Keen on machine learning, Emanuele has delivered talks and presentations on pattern recognition implementations using kdb+. Technical Whitepaper CONTENTS MACHINE LEARNING IN KDB+:KNN CLASSIFICATION AND PATTERN RECOGNITION WITH Q ................................................................. 2 1.1 LOADING THE DATASET ..................................................................................................................................................... 4 1.2 LOADING THE DATASET ..................................................................................................................................................... 5 1.3 K-NEAREST NEIGHBORS AND PREDICTION ............................................................................................................................. 8 1.3.1 Nearest Neighbor k=1 ................................................................................................................................. 8 1.3.2 k>1 .............................................................................................................................................................. 9 1.3.3 Prediction test ............................................................................................................................................. 9 1.4 ACCURACY CHECKS ........................................................................................................................................................ 10 1.4.1 Running with k=5 ...................................................................................................................................... 10 1.4.2 Running with k=3 ...................................................................................................................................... 11 1.4.3 Running with k=1 ...................................................................................................................................... 11 1.5 FURTHER APPROACHES ................................................................................................................................................... 11 1.5.1 Use slave threads ...................................................................................................................................... 11 1.5.2 Code optimization ..................................................................................................................................... 12 1.5.3 Euclidean or Manhattan Distance?........................................................................................................... 12 1.6 CONCLUSIONS............................................................................................................................................................... 14 Machine Learning in kdb+ 3 | P a ge Technical Whitepaper 1.1 Loading the Dataset Once downloaded, this is how the dataset looks in a text editor: The last number on the right-hand side of each line is the class label, and the other sixteen numbers are the class attributes; 16 numbers representing the 8 Cartesian coordinates sampled from each handwritten digit. Figure 1. CSV dataset We start by loading both test and training sets into a q session: q) test:`num xkey flip ((`$'16#.Q.a),`num)!((16#"i"),"c"; ",") 0:`pendigits.tes q) show test num| a b c d e f g h i j k l m n o p ---| -------------------------------------------------------- 8 | 88 92 2 99 16 66 94 37 70 0 0 24 42 65 100 100 8 | 80 100 18 98 60 66 100 29 42 0 0 23 42 61 56 98 8 | 0 94 9 57 20 19 7 0 20 36 70 68 100 100 18 92 9 | 95 82 71 100 27 77 77 73 100 80 93 42 56 13 0 0 9 | 68 100 6 88 47 75 87 82 85 56 100 29 75 6 0 0 .. q) tra:`num xkey flip ((`$'16#.Q.a),`num)!((16#"i"),"c"; ",") 0:`pendigits.tra q) show tra num| a b c d e f g h i j k l m n o p ---| -------------------------------------------------------- 8 | 47 100 27 81 57 37 26 0 0 23 56 53 100 90 40 98 2 | 0 89 27 100 42 75 29 45 15 15 37 0 69 2 100 6 1 | 0 57 31 68 72 90 100 100 76 75 50 51 28 25 16 0 4 | 0 100 7 92 5 68 19 45 86 34 100 45 74 23 67 0 1 | 0 67 49 83 100 100 81 80 60 60 40 40 33 20 47 0 .. For convenience, the two resulting dictionaries are flipped into tables and keyed on the class attribute so that we can later leverage q-sql and some of its powerful features. Keying at this stage is done for display purposes only. Rows taken from the sets will be flipped back into dictionaries and the class label will be dropped2 while computing distance metric. 2 Un-keying and flipping a table into dictionary to drop the first key is more efficient than functionally deleting a column. Machine Learning in kdb+ 4 | Page Technical Whitepaper The column names do not carry any information and so the first 16 letters of the alphabet are chosen for the 16 integers representing the class attributes; while the class label, stored as character, is assigned a mnemonic tag “num”. 1.2 Calculating distance metric As mentioned previously, in a k-NN classifier the distance metric between instances is the distance between their feature arrays. In our dataset, the instances are rows of the tra and test tables, and their attributes are the columns. To better explain this, we demonstrate with two instances from the training and test set: q) show tra1:1#tra num| a b c d e f g h i j k l m n o p ---| ----------------------------------------------- 8 | 47 100 27 81 57 37 26 0 0 23 56 53 100 90 40 98 q) show tes1:1#test num| a b c d e f g h i j k l m n o p ---| ---------------------------------------------- 8 | 88 92 2 99 16 66 94 37 70 0 0 24 42 65 100 100 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 20 40 60 80 100 0 20 40 60 80 100 Training Instance Training Instance Test instance Test instance Figure 2-1 tra1 and tes1 point plot Figure 2-2 tra1 and tes1 visual approximation Both instances belong to the class digit “8”, as per their class labels. However, this is not clear by just looking at the plotted points and the num column is not used by the classifier, which will instead calculate the distance between matching columns of the two instances. That is, calculating how far the a to p columns of tes1 are from their counterparts in tra1. While only an arbitrary measure, the classifier will use these distances to identify the nearest neighbour(s) in the training set and make a prediction. Machine Learning in kdb+ 5 | Page Technical Whitepaper In q, this will be achieved using a dyadic function whose arguments can be two tables. Using the right adverbs, it is applied column by column, returning one table that stores the result of each iteration in the a to p columns. The main benefit of this approach is that it relieves the developer from the burden of looping and indexing lists when doing point-point computation. The metric that will be used to determine the distance between the feature points of two instances is the Manhattan distance =1 | |, and it is calculated as the sum of the absolute difference of their Cartesian coordinates. Using a Cartesian distance metric is intuitive and convenient as the columns in our ∑ − set represent indeed X or Y coordinates: q) dist:{abs x-y} q) tra1 dist' tes1 num| a b c d e f g h i j k l m n o p ---| --------------------------------------------- 8 | 41 8 25 18 41 29 68 37 70 23 56 29 58 25 60 2 Now that the resulting table represents the distance metric between each attribute of the two instances, we can sum all the values and obtain the distance between the two instances in the feature space: q) sums each tra1 dist' tes1 num| a b c d e f g h i j k l m n o p ---| ----------------------------------------------------------- 8 | 41 49 74 92 133 162 230

Compendium of Technical White Papers

An Array-Oriented Language with Static Rank Polymorphism

Is Ja Dialect of APL? Reported by Jonathan Barman Eugene Mcdonnell - the Question Is Irrelevant

An APL Subset Interpreter for a New Chip Set / by James Hoskin

Kx for Manufacturing Flyer

KDB Kernel Debugger and Kdb Command

To the Members of First Derivatives Plc

APL Literature Review

A Lambda Calculus for Transfinite Arrays

Application and Interpretation

Kx for Sensors

Handout 16: J Dictionary

Chapter 1 Basic Principles of Programming Languages