arXiv:2002.01306v2 [math.PR] 18 Apr 2020 ae.Teerslsalwu obte uln h applicab applicatio the in outline theorems better separation stochastic to the us of randomly allow a limits drawn results from are These lin uniformly points layer. for the and boundaries when independently the separability, propose Fisher strengthening We probabil and earlier. thereby the for obtained theorems, and results separation points used of r stochastic explaini number we be in paper, for the this for can and estimations In data the phenomena. facts of intelligence of exponential similar sys natural dimension various is intelligent set and intrinsic artificial points an random This for determining of correctors a dimension. number constructing of of wit for the terms a if by in even any points other probability dimension from high separated high be can in points It learning. that machine and out analysis data high-dimensional in eeain(rjc 14.Y26.31.0022). (project Federation [1] in drawn, considered are is support set different with random constant distributions probability the the the on how on and determines depends that function exponential distribution the of form 1 h aastcnb eaae rmalohrsmlsb a by case) samples to special a other close as all probability – from a discriminant separated with Fisher even be (or can hyperplane set data the etc. [5] se neuron data phenom grandmother’s of intelligence as natural dimension such various explaining intrinsic for the syst and determining [4] intelligence for artificial corrector [3], of constructing [2], correctors for of learning ensembles machine and in used widely discriminant ecniee samnfsaino ocle the called so of can manifestation kind dimensionality a such as of considered theorems be the t high-dimensional above Due So, mentioned properties. geometric applications dimension. simple of fairly exhibit terms datasets in exponential is ae sflos random A follows. as lated eaal ihprobability with separable cne e,lna eaaiiy ihrsprblt,Fi separability, Fisher separability, linear set, -convex h oki upre yteMnsr fEuainadScienc and Education of Ministry the by supported is work The Abstract ne Terms Index ftedmnino h aai ih hnaysml of sample any then high, is data the of dimension the If been have [1] theorems separation stochastic Recently, nisuulfr tcatcsprto hoe sformu- is theorem separation stochastic a form usual its In ieradFse eaaiiyo admPoints Random of Separability Fisher and Linear Sohsi eaainterm lyipratrole important play theorems separation —Stochastic sohsi eaainterm,rno points, random theorems, separation —stochastic hnmnn[] [6]. [1], phenomenon nthe in .I I. ϑ ( 0 NTRODUCTION > p < ϑ < nttt fIfrainTcnlge,MteaisadMe and Mathematics Technologies, Information of Institute 1 mi:sre.ioo@tmunr,nikolai.zolotykh@i [email protected], Email: n eeetstin set -element 1 vntenme fsamples of number the even − d ϑ 1 .I atclr uniform particular, In ). if , dmninlShrclLayer Spherical -dimensional oahvk tt nvriyo ihiNovgorod Nizhni of University State Lobachevsky d dmninlspherical -dimensional ae < n .V ioo .Y.Zolotykh Yu. N. Sidorov V. S. R bd d ihiNvoo,Russia Novgorod, Nizhni h exact The . lsigof blessing slinearly is hrlinear sher fRussian of e es for tems, ns. turns the o [7], , some efine ems ena, ility ear ity ng ts h s , exists rmalohrpit ntest r nohrwrs h set the words, other in or, set, the in points other all from fteeeit yepaeseparated hyperplane a exists there if ehd,nml,t nwrteqeto,frwhat th for of question, boundary the applicability answer the to separa namely, clarify Fisher methods, to and separabilit linear us for allows linear estimations bility general) the of (more comparison a A considered [8] whereas Fisher robustness. the is discrimin and Fisher cheapest simplicity the its of computationally are advantages Other the [2]. Amon methods discriminant etc.). these data a Rosenblatt all in machine, algor points vector programming other linear support discriminant, all linear from (Fisher point set a separating [9]. functional in obtained is cu distribution the normal in exponential standard distributions the with product and sets for Estimates in volume. consist peaks small classes sharp these non-i.i.d. without speaking, distributions (including Roughly [3]. distributions in of considered classes Wider [8]. fteinequality the if or h hoeia siain o h rbblt fteline the them. experime discuss of corresponding and the probability frequencies with the separabilities Fisher for and estimations compari in experiments theoretical computational points give the of separability of results we linear the number their report Here guarantee the we separability. to for layer Fisher spherical estimations the bounds for precise the [7] more than even [1], estimates in accurate more obtained give results These h onsaedanrnol,idpnetyaduniformly and w independently separability a randomly, linear from drawn its guarantee are that points points the of set the of t need discriminant. no linear is sophisticated there more and a separability search Fisher only use to suffices [3]. point A point A h aes[][] 7 elwt nyFse separability, Fisher only with deal [7] [1]–[3], papers The a constructing for algorithms many are there that note We n[]teewr bandetmtosfrtecardinality the for estimations obtained were there [8] In e fpoints of set A ieryseparable linearly A d X dmninlshrcllyradfo h ntcube. unit the from and layer spherical -dimensional ∈ X X R ∈ ∈ d R R uhthat such ( d d { ,Y X, X is is tmm.unn.ru 1 ieryseparable linearly 1 faypoint any if [1] I D II. X , . . . , ihrseparable Fisher ) chanics < ( A ( EFINITIONS X ,X X, n X , ⊂ } ) ) > R od o all for holds d ( X A scalled is X rmteset the from i rmteset the from X slnal separable linearly is Y , from ) o all for 1 M Y -convex ..there i.e. , d ∈ M M Y and M Also, . ∈ ⊂ ⊂ ithm, are ) [10] n ntal hen [2], M ese ant R R ng be of ar ly y. it o g d d - . of vertices of their convex hull, conv(X1,...,Xn), coincides The authors of [1], [7] formulate their results for linearly with X ,...,X . separable sets of points, but in fact in the proofs they used { 1 n} The set X ,...,X is called Fisher separable if that the sets are only Fisher separable. { 1 n} (Xi,Xj) < (Xi,Xi) for all i, j, such that i = j [2], [3]. Note that all estimates (1)–(5) require 0 1 ϑ.qHere we improve this bound (see origin. − Corollary 2) and also give the estimates for P1(d, r, n) and Let M = X ,...,X be the set of points chosen n { 1 n} P (d, r, n) and compare them with known estimates (2), (4) randomly, independently, according to the uniform distribution F F for P1 (d, r, n) and P (d, r, n). on the spherical layer Bd rBd. Denote by P (d, r, n) the \ F probability that Mn is linearly separable, and by P (d, r, n) IV. NEW RESULTS the probability that Mn is Fisher separable. The following theorem gives a probability of the linear Denote by P1(d, r, n) the probability that a random point separability of a random point from a random n-element set chosen according to the uniform distribution on Bd rBd is Mn = X1,...,Xn in Bd rBd. The proof uses an approach F \ { } \ separable from Mn, and by P1 (d, r, n) the probability that a borrowed from [10], [11]. random point is Fisher separable from Mn. Theorem 1. Let 0 r< 1, d N. Then ≤ ∈ III. PREVIOUS WORKS n P (d, r, n) > 1 . (6) 1 − 2d In [1] it was shown (among other results) that for all r, ϑ, n, d, where 0 1 ϑ. C is −The following statements− are proved in [7]. Vol conv(Mn) conv(Mn) rBd For all r, where 0 (1 rd) 1 − . (2) Let us estimate the numerator of this fraction. We denote 1 − − 2   by Si the ball with center at the origin and with the diameter For all r, ϑ, where 0 1 ϑ. [ For all r, where 0 (1 rd) 1 (n 1) − . n \ n ∩ d ≤ i \ i − − − 2    i=1   (4) X n n r d For all r, ϑ, where 0 1 ϑ. where γ is the volume of a ball of radius 1. − d Fig. 1. The probability (frequency) that a random point is linearly (or Fisher) separable from a set of n = 10000 random points in the layer Bd \ rBd. The blue solid corresponds to the theoretical bound (6) for the linear separability. The red dash-dotted line represents the theoretical bound (2) for the Fisher separability. The crosses (circles) correspond to the empirical frequencies for linear (and, respectively, Fisher) separability obtained in 60 trials for each dimension d.

Hence The following theorem gives the probability of the linear nγ (1 rd) d − separability of a random n-element set M in B rB . P 2d n n d d (C) d = d \ ≤ γd(1 r ) 2 N − Theorem 2. Let 0 r< 1, d . Then and ≤ ∈ n n(n 1) P(C)=1 P(C) 1 . P (d, r, n) > 1 − . (8) − ≥ − 2d − 2d Proof. Denote by An the event that Mn is linearly separable and denote by Ci the event that Xi / conv(Mn Xi ) (i = Note that the bound (6) obtained in Theorem 1 doesn’t ∈ \{ } 1,...,n). Thus P (d, r, n)= P(An). Clearly An = C1 ... depend on r. Nevertheless the bound is quite accurate (in the ∩ ∩ Cn and P(An) = P(C1 ... Cn)=1 P(C1 ... sense that 1 shows behaviour close to empirical values.) as n ∩ ∩ − ∪ ∪ P is illustrated with Figure 1. The results of the experiment Cn) 1 (Ci). Let us find the upper bound for the ≥ − i=1 F show that the probabilities P1(d, r, n) and P1 (d, r, n) are probability ofP the event Ci. This event means that the point quite close and the theoretical bound (6) compared with (2) Xi belongs to the convex hull of the remaining points, i.e. approximates well the both probabilities. Xi conv(Mn Xi ). In the proof of the previous theorem, It is clear that the probabilities must increase monotonously it was∈ shown that\{ } when d increases, but in the real experiment the frequency n 1 can not coincide precisely with the probability and it can have P(Ci) − (i =1,...,n). ≤ 2d non-monotonic behaviour. In our experiment (with 60 trials for each d) it is non-monotonous. Hence n The following corollary gives an improved estimate for the n(n 1) P(A ) 1 P(C ) 1 − . number of points n guaranteeing the linear separability of a n ≥ − i ≥ − 2d i=1 random point from a random n-element set Mn in Bd rBd X with probability at least 1 ϑ. \ − Corollary 1. Let 0 r< 1, 0 <ϑ< 1, Note that the bound (8) obtained in Theorem 2 doesn’t ≤ depend on , although seems to increase monotoni- d r P (d, r, n) n < ϑ2 . (7) cally with increasing r (for a big enough n). Nevertheless the bound is quite accurate as is illustrated with Figures 2, 3. The Then P1(d, r, n) > 1 ϑ. − results of the experiment show that the probabilities P1(d, r, n) Proof. If n satisfies the condition n < ϑ2d, then the inequality F and P1 (d, r, n) are quite close and the theoretical bound (8) P1(d, r, n) > 1 ϑ holds by the previous theorem. compared with (4) approximates well the both probabilities. − Fig. 2. The probability (frequency) that the set of n = 1000 random points in the layer Bd \ rBd is linearly or Fisher separable. The blue solid line corresponds to the theoretical bound (8) for the linear separability obtained in Theorem 2. The dash-dotted lines represent the theoretical bound (4) for the Fisher separability. The crosses (circles) correspond to the empirical frequencies for linear (and, respectively, Fisher) separability obtained in 60 trials for each dimension d.

Fig. 3. The probability (frequency) that the set of n = 10000 random points in the layer Bd \ rBd is linearly or Fisher separable. The notations are the same as on Figure 2

Another important conclusion from the experiment is as The following corollary gives an improved estimate for the follows. Despite the fact that both probabilities PF (d, r, n) number of points n guaranteeing the linear separability of a P (d, r, n) are close to 1 for sufficiently big d, the “threshold random n-element set Mn in Bd rBd with probability at least values” for such a sufficiently big d differ greatly. In other 1 ϑ. This result strengthens the\ result obtained in [8]. words, the blessing of dimensionality when using linear dis- − Corollary 2. Let 0 r< 1, 0 <ϑ< 1, criminants comes noticeably earlier than if we only use Fisher ≤ discriminants. This is achieved at the cost of constructing the n< √ϑ2d. (9) usual linear discriminant in comparison with the Fisher one. Then P (d, r, n) > 1 ϑ. − d Proof. If n satisfies the condition n < √ϑ2d, then by the √5 1 f √ϑ r (√1+2ϑ+1) If r = 2− , then g = (1 r2)d/4 2ϑ = previous theorem d/2 − √ q2 √ 1+2ϑ+1 r = 1+2ϑ+1 > 1. n(n 1) n2 2√ϑ √1 r2 2√ϑ − 2 d/4 P (d, r, n) > 1 −d > 1 d > 1 ϑ.  √5 1 f √ϑ (1 r ) 1 − 2 − 2 − If 0 1 holds so . r 2ϑ(1 r2)d/2 2 g − √ − f → ∞ √1 r2 r2d 2ϑ For 3 1 then g = . r d (1 r2)d/4 2 d/2 n − ∼ √1−r2 d (1 r ) rd 2ϑ − (1 r ) 1 (n 1) − , 0 0, that is for − 1 holds if r + rq 1 < 0, that is for 3. g d/2 , if r = . r2 − ∼ 2 2 2 √5 1 √ 2 n 1 0 y so g = − y y −N q  nrd 1+ a( )d a3yd( )d nayd ard + ... nrd. d . If r and ϑ are fixed then the following asymptotic r − r − − ∼ ∈ f If 0 < r < √2 , then r < y so g = estimates of the quotient g hold: 2  d/2 d r d 3 d d r d d d f 1 r2 √5 1 ny ( y ) + a a y nar a( y ) r + ... nay . 1. , if − 1, if r = 2− . 2 2d 2d d d 2 ϑ n ar nar +... nr (1+a)= 2 r = 2 2d/2 . f 1 √5 q1 − ∼ 3. , if 0 1 for 2 d/2 n − √ϑ √1 r2 √1 r2 d (1 r ) N − → ∞ → ∞ − (1 r ) 1 (n 1) − 2 , 0 2− . rh and n are fixed then i q g (2r)d √2 1. f n 1 , if 2 2. REFERENCES − → ∞ √2 n(n 1) 2 d/2 g If 0 2. [2] A. N. Gorban, A. Golubkov, B. Grechuk, E. M. Mirkes, I. Y. Tyukin Correc- 2d → ∞ − n(n+1) 1 tion of AI systems by linear discriminants: Probabilistic foundations. √2 n(n+1) 1 g 2 2d/2 Information Sciences 466, 303–322 (2018) If r = , then g so − = 2 2 2d/2 f n(n 1) [3] A. N. Gorban, B. Grechuk, I. Y. Tyukin Augmented artificial intelligence: ∼ ∼ 2d 2d/2(n+1) a conceptual framework. https://arxiv.org/abs/1802.02172v3 (2018) 2(n 1) . [4] L. Albergante, J. Bac, A. Zinovyev Estimating the effective dimension − → ∞ of large biological datasets using Fisher separability analysis. 2019 International Joint Conference on Neural Networks (IJCNN) (2019) Thus, the estimate (8) tends to 1 faster than the estimate (4) [5] A. N. Gorban, V.A. Makarov, I. Y. Tyukin The unreasonable effectiveness for all 0