<<

Compound Poisson Point Processes, Concentration and Oracle Inequalities

Huiming Zhanga, Xiaoxu Wub

aSchool of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, 100871, China bMathematics Department, Rutgers University, Piscataway, NJ 08854, USA

Abstract This note aims at presenting several new theoretical results for the compound Poisson point process, which follows the work of Zhang et al. [Insurance Math. Econom. 59(2014), 325-336]. The first part provides a new characterization for a discrete compound Poisson point process (proposed by Acz´el[Acta Math. Hun- gar. 3(3)(1952), 219-224]), it extends the characterization of the Poisson point process given by Copeland and Regan [Ann. Math. (1936): 357-362]. Next, we derive some concentration inequalities for discrete com- pound Poisson point process (negative binomial with unknown dispersion is a significant example). These concentration inequalities are potentially useful in count data regressions. We give an ap- plication in the weighted Lasso penalized negative binomial regression whose KKT conditions of penalized likelihood hold with high probability and then we derive non-asymptotic oracle inequalities for a weighted Lasso . Keywords: characterization of point processes, Poisson random measure, concentration inequalities, sub-Gamma random variables, high-dimensional negative binomial regressions. 2010 Mathematics Subject Classification: 60G55, 60G51, 62J12

1. Introduction

Powerful and useful concentration inequalities of empirical processes have been derived one after another for several decades. Various types of concentration inequalities are mainly based on mo- ments condition (such as sub-Gaussian, sub-exponential and sub-Gamma) and bounded difference condition; see Boucheron et al. (2013). Its central task is to evaluate the fluctuation of empirical processes from some value (for example, the and median) in probability. Attention has greatly expanded in various area of research such as high-dimensional , models, etc. For the Pois- son or negative binomial count data regression (Hilbe (2011)), it should be noted that the (the limit distribution of negative binomial is Poisson) is not sub-Gaussian, but locally sub-Gaussian distribution (Chareka et al. (2006)) or sub- (see Sect. 2.1.3 of arXiv:1709.04159v3 [math.ST] 8 Dec 2019 Wainwright (2019)). However, few relevant results are recognized about the concentration inequali- ties for a weighted sum of negative binomial (NB) random variables and its statistical applications. With applications to the segmentation of RNA-Seq data, a documental example is that Cleynen and Lebarbier (2014) obtained a concentration inequality of the sum of independent centered negative binomial random variables and its derivation hinges on the assumption that the dispersion param- eter is known. Thus, the negative binomial random variable belongs to the . But, if the dispersion parameter in negative binomial random variables X is unknown in real world problems, it does not belong to the exponential family with density  fX (x | θ) = h(x) exp η(θ)T (x) − A(θ) , (1)

where T (x), h(x), η(θ), and A(θ) are some given functions. Whereas, it is well-known that Poisson and negative binomial distributions belong to the family of infinitely divisible distributions. Houdr´e(2002) studies dimension-free concentration inequalities for non-Gaussian infinitely divisible random vectors with finite exponential moments. In particular,

Email addresses: [email protected] (Huiming Zhang), [email protected] (Xiaoxu Wu)

Preprint submitted to Journal of Inequalities and Applications December 10, 2019 the geometric case, a special case of negative binomial, has been obtained in Houdr´e(2002). Via the entropy method, Kontoyiannis and Madiman (2006) gives a simple derivation of the concentration inequalities for a Lipschitz function of discrete compound Poisson random variable. Nonetheless, when deriving oracle inequalities of Lasso or Elastic-net estimates (see Ivanoff et al. (2016), Zhang and Jia Zhang and Jia (2017)), it is strenuous to use the results of Kontoyiannis and Madiman (2006), Houdr´e(2002) to get an explicit expression of tuning parameter under the KKT conditions of the penalized negative binomial likelihood (Poisson regression is a particular case). Moreover, when the negative binomial responses are treated as sub-exponential distributions (distributions with the Cramer type condition), the upper bound of the `1-estimation error oracle inequality is determined by the tuning parameter with the rate O( log p ), which does not show rate-optimality in q n log p ∗ the minimax sense, although it is sharper than O( n ). Consider linear regressions Y = Xβ + ε 2 ∗ with noise Var ε = σ In and β being the p-dimensional true parameter. Under the `0-constraints on β∗, Raskutti et al. (2011) shows that the lower bound on the minimax prediction error for all βˆ is

1 ˆ ∗ 2 s0 log (p/s0) inf sup kXβ − Xβ k2 ≥ O( ), (X is an n × p design matrix) ˆ ∗ β kβ k0≤s0 n n

ˆ ∗ with probability at least 1/2. Thus, the optimal minimax rate for kβ − β k1 should be proportional log p to the root of the rate O( n ). Similar minimax lower bound for Poisson regression is provided in Jiang et al. (2015); see Chap. 4 of Rigollet and H¨utter(2019) for more minimax theory. This paper is prompted by the motif of high-dimensional count regression that derives useful NB concentration inequalities which can be obtained under compound Poisson frameworks. The obtained concentration inequalities are beneficial to optimal high-dimensional inferences. Our con- centration inequalities are about the sum of independent centered weighted NB random variables, which is more general than the un-weighted case in Cleynen and Lebarbier (2014), which supposes that the over-dispersion parameter of NB random variables is known. However, in our case, the over-dispersion parameter can be unknown. As a by-product, we proposed a characterization for discrete compound Poisson point processes following the work in Zhang et al. (2014) and Zhang and Li (2016). The paper will end up with some applications of high-dimensional NB regressions. St¨adleret al. (2010) studies the oracle inequalities for `1-penalization finite mixture of Gaussian regressions model but they didn’t concern the mixture of count data regression (such as mixture Poisson, see Yang et al. (2019) for example). Here the NB regression can be approximately viewed as finite mixture of Poisson regression since NB distribution is equivalent to a continuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is gamma distributed. The paper will end up with some applications of high-dimensional NB regression in terms of oracle inequalities. The paper is organized as follows. Section 2 is an introduction to discrete compound Poisson point processes (DCPP). We give a theorem for characterizing DCPP, which is similar to the result that initial condition, stationary and independent increments, condition of jumps, together quadru- ply characterize a compound Poisson process, see Wang and Ji (1993) and the references therein. In Sect. 3, we derive the concentration inequality for discrete compound Poisson point process with infinite-dimensional parameters. As an application, for the optimization problem of weighted Lasso penalized negative binomial regression, we show that the true parameter version of KKT conditions hold with high probability, and the optimal data-driven weights in weighted Lasso penality are de- termined. We also show the oracle inequalities for weighted Lasso estimates in NB regressions, and the convergence rate attain the minimax rate derived in references.

2. Introductions to discrete compound Poisson point process

In this section, we present the preliminary knowledge for discrete compound Poisson point pro- cess. A new characterization of the point process with five assumptions is obtained.

2 2.1. Definition and preliminaries To begin with, we need to be familiar with the definition of the discrete compound Poisson distribution and its relation to the weighted sum of Poisson random variables. For more details, we refer the reader to Sect. 9.3 of Johnson et al. (2005), Zhang et al. (2014), Zhang and Li (2016) and the references therein. Definition 2.1. We say that Y is discrete compound Poisson (DCP) distributed if the characteristic function of Y is ( ∞ ) itY X itθ  ϕY (t) = Ee = exp αiλ e − 1 (θ ∈ R), (2) i=1 P∞ where (α1λ, α2λ, . . .) are infinite-dimensional parameters satisfying i=1 αi = 1, αi ≥ 0, λ > 0. We denote it by Y ∼ DCP(α1λ, α2λ, . . .).

If αi = 0 for all i > r and αr 6= 0, we say that Y is DCP distributed of order r. If r = +∞, then the DCP distribution has infinite-dimensional parameters and we say it is DCP distributed of order +∞. When r = 1, it is the well-known Poisson distribution for modeling equal-dispersed count data. When r = 2, we call it a Hermite distribution, which can be applied to model the over-dispersed and multi-modal count data; see Giles (2010). Equation (2) is the canonical representation of characteristic function for non-negative valued infinitely divisible random variable, i.e. a special case of the L´evy–Khinchine formula (see Chap. 2 of Sato (2013) and Sect. 1.6 of Petrov (1995)),

 1 Z  EeitY = exp ait − σ2t2 + eitx − 1 − itxI (x)ν(dx) , 2 |x|<1 R\{0} where a ∈ R, σ ≥ 0 and I{·}(x) is indicator function, the function ν(·) is called the L´evymeasure with restriction: R min(x2, 1)ν(dx) < ∞. R\{0} + P Let δx(t) := 1 (t) and be the positive integer set. If we set ν(t) = + λδx(t)αx, the ν(·) {x} N x∈N is the L´evymeasure in the L´evy–Khinchine formula. It is easy to see that Y has a weighted Poisson P∞ decomposition: Y = i=1 iNi, where the Ni are independent Poisson distributed with mean λαi. This decomposition is also called a L´evy–Itodecomposition; see Chap. 4 of Sato Sato (2013). If Ee−θY < ∞ for θ in a neighbourhood of zero, the generating function (m.g.f.) of DCP is ∞ Z ∞ −θY X −kθ −θx MY (θ) := Ee = exp{ αkλ(e − 1)} = exp{ (e − 1)ν(dx)}, (|θ| < R) k=1 0 for some R > 0. In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space (E, A), a good example could be E = Rd. In the following sections, we let E = Rd and denote N(A, ω) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato (2013), Kingman Kingman (1993)). Definition 2.2. Let (E, A, µ) be a measurable space with σ-finite measure µ, and N be the non- negative integer set. The Poisson random measure with intensity µ is a family of random variables {N(A, ω)}A∈A (defined on some probability space (Ω, F,P )) as the product map

N : A × Ω → N satisfying 1. ∀A ∈ A, the N(A, ·) is Poisson random variable on (Ω, F,P ) with mean µ(A); 2. ∀ω ∈ Ω, the N(·, ω) is a counting measure on (E, A);

3. If sets A1,A2,...,An ∈ A are disjoint, then N(A1, ·),N(A2, ·),...,N(An, ·) are mutually inde- pendent. 3 In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space (E, A). A good example is E = Rd. In the following section, we let E = Rd and denote N(A, ω) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato (2013), Kingman (1993)).

Definition 2.3. Let (E, A, µ) be a measurable space with σ-finite measure µ, and N be the non- negative integer set. The Poisson random measure with intensity µ is a family of random variables {N(A, ω)}A∈A (defined on some probability space (Ω, F,P )) as the product map

N : A × Ω → N satisfying 1. ∀A ∈ A, the N(A, ·) is Poisson random variable on (Ω, F,P ) with mean µ(A); 2. ∀ω ∈ Ω, the N(·, ω) is a counting measure on (E, A);

3. If sets A1,A2,...,An ∈ A are disjoint, then N(A1, ·),N(A2, ·), ··· ,N(An, ·) are mutually independent.

Let N(A) := N(A, ω). Based on the Poisson random measure, we define the discrete compound Poisson point processes (DCPP) {CP (A)}A∈A by the L´evy–Itodecomposition

∞ X CP (A) := kNk(A), k=1 R where the Nk(A) are independent Poisson point process with mean measure µk(A) := αk A λ(x) dx. P∞ R −kθ The m.g.f. of {CP (A)}A∈A is MCP (A)(θ) = exp{ k=1 αk A λ(x) dx(e − 1)}. We can see that λ(·) is the intensity function of the generating Poisson point processes {Nk(A)}A∈A for each k. Acz´el(1952) derives the p.m.f. and he calls it the inhomogeneous composed Poisson distribution as d = 1. If the intensity function λ(x) is the constant (i.e. the mean measure is a multiple of Lebesgue measure), the N is said to be a homogeneous Poisson point process. Define the probability Pk(A) by P (N(A) = k). If N(A) follows from a DCP point process with intensity measure λ(A), then the probability of having k (k ≥ 0) points in the set A is given by

s1 sk X α1 ··· αk  s1+···+sk −λ(A) Pk(A) = λ(A) e , (3) s1! ··· sk! R(s,k)=k

Pk where R(s, k) := t=1 tst; see Acz´el(1952) for the proof.

2.2. A new characterisation of DCP point process Based on preliminaries, we give a new characterization of DCP point process, which is an exten- sion of Copeland and Regan (1936). A similar characterization for DCP distribution and process (not for point process in terms of random measure) is derived by Wang and Ji (1993). For more characterizations of the point process, see the monograph Last and Penrose (2017) and the references therein.

Theorem 2.1 (Characterization for DCP point process). Consider the following assumptions:

1. Pk(A) > 0 if 0 < m(A) < ∞, where m(·) is the Lebesgue measure; P∞ 2. k=0 Pk(A) = 1; Pk 3. Pk(A1 ∪ A2) = i=0 Pk−i(A1)Pi(A2) if A1 ∩ A2 = ∅; P∞ 4. Let Sk(A) = i=k Pi(A), then limm(A)→0 S1(A) = 0;

4 Pk(A) P∞ 5. lim = αk, where αk = 1. m(A)→0 S1(A) k=1

If Pk(A) satisfies the assumptions 1–5, then there exists a measure ν(A) (countable additivity, ab- solutely continuous) such that Pk(A) is represented by the equation (3) for all k and A.

For A being the closed interval in R, this setting turns to the famous characterization of the Poisson process; see p. 447 of Feller (1968) (The postulates for the Poisson process). The proof of Theorem 2.1 is given in the Appendix. The proof consists of showing countably additive and absolutely continuous of the intensity measure and dealing with the matrix differential equation. As a random measure, the DCPP is a mathematical extension from discrete compound Poisson distribution and discrete compound Poisson process index by real line. It has been studied by Baraud and Birg´e(2009) that they use histogram type estimator to estimate the intensity of the Poisson random measure. A recent application, Das (2018) proposes a nonparametric Bayesian approach to RNA-seq count data by DCP process. For a set of regions {A ∈ A} of interest on the genome, such as genes, exons, or junctions, the number of reads is counted, and the count in the region Ai is viewed as a DCP process for gene expression. The Pk(A) satisfies assumptions 1–5 in Theorem 2.1 by the real-world setting (the process of generating the expression level for each gene in the experiment).

3. Concentration inequalities for discrete compound Poisson point process

This section is about the construction of concentration inequalities for DCPP. It has applications in high-dimensional `1-regularized negative binomial regression; see Sect. 4. As an appetizing example, if the DCPP has finite parameters, the existing result such as Lemma 3 of Baraud and Birg´e(2009) can be used to obtain the concentration inequality for the sum of DCP random variables.

Lemma 3.1 (Baraud and Birg´e(2009)). Let Y1,...,Yn be n independent centered random variables, n and τ, κ, {ηi}i=1 be some positive constants.

2 zYi z ηi (a) If log(Ee ) ≤ κ 2(1−zτ) for all z ∈ [0, 1/τ[, and 1 ≤ i ≤ n, then

" n n !1/2 # X X −x P Yi ≥ 2κx ηi + τx ≤ e for all x > 0. i=1 i=1

−zYi 2 (b) If for 1 ≤ i ≤ n and all z > 0 log(Ee ) ≤ κz ηi/2, then

" n n !1/2# X X −x P Yi ≤ − 2κx ηi ≤ e for all x > 0. i=1 i=1

Actually, the moment conditions in Lemma 3.1 is a direct application of the existing result based on Cramer-type conditions (Lemma 2.2 in Petrov (1995), or sub-exponential condition (4) discussed below) for the convolutions’ Poisson distribution with different rates. The centered random variable 2 zY z ηi Yi with the moment condition log(Ee ) ≤ κ 2(1−zτ) in Lemma 3.1(a), is called the sub-Gamma random variables (see Sect. 2.4 in Boucheron et al. (2013)). A random variable X with zero mean is sub-exponential (see Sect. 2.1.3 in Wainwright (2019)) if there are non-negative parameters (v, α) such that

2 2 λX v λ 1 Ee ≤ e 2 for all |λ| < . (4) α Note that the sub-exponential implies the sub-gamma condition by the following inequality:

2 2 −zYi κz ηi κz ηi log(Ee ) ≤ 2 ≤ 2(1−zτ) for all z ∈ [0, 1/τ[.

5 Proposition 3.1. Let Yi ∼ DCP(α1(i)λ(i), . . . , αr(i)λ(i)) for i = 1, 2, . . . , n, and σi := Var Yi = Pr 2 λ(i) k=1 k αk(i), then for all x > 0 v v " n u n # " n u n # X u X 2 −x X u X 2 −x P (Yi − EYi) ≥ t2x σi + rx ≤ e ,P (Yi − EYi) ≤ −t2x σi ≤ e . i=1 i=1 i=1 i=1

Moreover, we have  v  n u n X u X 2 −x P  (Yi − EYi) ≥ t2x σi + rx ≤ 2e . (5) i=1 i=1

Proof. First, in order to apply the above lemma, we need to evaluate the log-moment-generating Pr function of centered DCP random variables. Let µi =: EYi = λ(i) k=1 kαk(i), we have

Pr kz z(Yt−µi) zYi λ(i) αk(i)(e −1) log Ee = −zµi + log Ee = −zµi + log e k=1 r X kz  = λ(i) αk(i) e − kz − 1 . k=1

z z2 Using the inequality e − z − 1 ≤ 2(1−z) for 1 > z > 0, one derives

r 2 2 r 2 2 2 X k z X k z σiz log Eez(Yt−µi) ≤ λ(i) α (i) ≤ λ(i) α (i) =: (z > 0), k 2(1 − kz) k 2(1 − rz) 2(1 − rz) k=1 k=1

Pr 2 z z2 where σi = Var Yi = λ(i) k=1 k αk(i). And by applying the inequality e − z − 1 ≤ 2 for z < 0, we get r 2 2 2 X k z σiz log Eez(Yt−µi) ≤ λ(i) α (i) =: (z < 0). (6) k 2 2 k=1

2 −z(Yt−µi) σiz Thus (6) implies log Ee ≤ 2 for z > 0. 2 Therefore, by letting τ = 1, κ = 1, ηi = σi in Lemma 3.1, we have (5). In the sequel, without employing Cramer-type conditions, we will derive a Bernstein-type con- centration inequality for DCPP with infinite-dimensional parameters r = +∞. Theorem 3.1 below is a weighted-sum version that is not the same as Proposition 3.1. Theorem 3.1. Let f(x) be the measurable function such that R f 2(x)λ(x) dx < ∞, where the λ(x) R E R is the intensity function of the DCPP with parameter (α1 λ(x) dx, α2 λ(x) dx, . . .) under the P∞ E E condition k=1 kαk < ∞. Then the concentration inequality for a stochastic integral of DCPP is

" ∞ ! # ∞ ! ! Z X X  y  P f(x) CP (dx) − kα λ(x) dx ≥ kα p2yV + kfk ≤ e−y (y > 0), k k f 3 ∞ E k=1 k=1 (7)

R 2 where Vf = E f (x)λ(x) dx, kfk∞ is the supremum of f(x) on E.

EY √ y −y Let f(x) = 1A(x) in Theorem 3.1, and we have P (Y − EY ≥ λ [ 2yλ + 3 ]) ≤ e for Y ∼ DCP(α1λ, α2λ, . . .). From the guide of the proof in Corollary 3.1 below, it is not hard to get the Theorem 3.1 by a linear combination of random variables which is more important in applications in Sect. 4. The proof of Theorem 3.1 is given in the Appendix. This is essentially from the fact that the infinitely divisible distributions are closed under the convolutions and scalings. Corollary 3.1 is the version for the sum of n independent random measures.

6 n n Corollary 3.1. For a given set {Si}i=1 in E, if we have n independent DCPP {CPi(Si)}i=1 with a series of parameters n {(α1(i) ∫ Si λ(x) dx, α2(i) ∫ Si λ(x) dx, . . .)}i=1 n correspondingly, then, for a series of measurable function {fi(x)}i=1, we have

n n ! X Z  X  y  P f (x)CP (dx) − µ λ(x) dx ≥ µ p2yV + kf k ≤ e−ny (y > 0), (8) i i i i i,fi 3 i ∞ i=1 Si i=1

P∞ R 2 where µi := kαk(i) < ∞ and Vi,f := f (x)λ(x) dx < ∞. k=1 i Si i Pn p Pn µi Moreover, let c1n := i=1 µi 2Vi,fi , c2n := i=1 3 kfik∞, then s 2 n Z  !   2   X t c c1n P f (x)CP (dx) − µ λ(x) dx ≥ t ≤ exp −n + 1n − . (9) i i i c 4c2 2c i=1 Si 2n 2n 2n The difference between Proposition 3.1 and Corollary 3.1 is that the DCP random variables in Proposition 3.1 have order r < +∞ while Corollary 3.1 is related to DCP point process (a particular case of DCPP is the constant intensity λ(A) = λ) with order r = +∞. Moreover, Proposition 3.1 2 require dependent factors σi for DCP random variables. Remark 1. Significantly, Corollary 3.1 is the weighted-sum version, and it improves the results in Cleynen and Lebarbier (2014) who deals with the un-weighted sum of NB random variables (as a spe- cial case of DCP distribution). Moreover, Cleynen and Lebarbier (2014) result relies on Lemma 3.1, which depends on the sub-Gamma condition. It should be noted that we do not impose any bounded restriction for y in our concentration inequalities (7) and (8), while Theorem 1 of Houdr´e(2002) and concentration inequalities of Houdr´eand Privault (2002) require that the infinitely divisible random vector satisfies the Cramer-type conditions and the y in P (X − EX ≥ y) is bounded by a given constant. The moment conditions in Lemma 3.1 sometimes are difficult to check, being hard to apply in the Lasso KKT condition in Sect. 4.1 if the count data are other infinitely divisible discrete distributions. In our Theorem 3.1 and Corollary 3.1, we only need to check that the of the DCP processes are finite and Theorem 3.1 and Corollary 3.1 do not need the Bernstein-type condition (see Theorem 2.8 in Petrov (1995) and p. 27 of Wainwright (2019)). In addition, our concentration result does not require the higher moment condition proposed by Kontoyiannis and ∞ L P L Madiman (2006): EZ = k αk < ∞ (L > 1) about the compound distribution P (Z = i) := αi k=1 for the given discrete random variable Z. Before we give the proof of Corollary 3.1, we need some notations and a lemma. For a general point process N(A) defined on Rd, let f be any non-negative measurable function on E. The Laplace R functional is defined by LN (f) = E[exp{− E f(x)N(dx)}], it is the stochastic integral is specified by R f(x)N(dx) =: P f(x ), where the Π is the state space of the point process N(A). The E xi∈Π∩E i Laplace functional for a random measure is a crucial numerical characteristics that enables us to handle stochastic integral of the point process. Moreover, we rely on the following theorem due to Campbell. Lemma 3.2 (See Sect. 3.3 in Kingman (1993)). Let N(A) be a Poisson random measure with d P intensity λ(t) and a measurable function f : R → R, the random sum S = f(x) is absolutely R x∈Π convergent with probability one if and only if the integral d min(|f(x)|, 1)λ(x) dx < ∞. Assume R this condition holds, thus we assert that, for any complex number value θ, the equation Z  EeθS = exp eθf(x) − 1λ(x) dx d R holds if the integral on the right-hand side is convergent.

7 Proof. Now, we can easily compute the DCPP’s Laplace functional by Campbell’s theorem. Set r P R −f(x) CPr(A) =: kNk(A). And let θ = −1, that is LN (f) = exp{ E [e − 1]λ(x)dx}. By indepen- k=1 dence, it follows that

r r Z Z X Y Z LCPr (f) = E exp{− f(x)CPr(dx)} = E exp{− f(x) kNk(dx)} = Eexp{− kf(x)Nk(dx)} E E k=1 k=1 E r Z r Z Y −kf(x) X −kf(x) = exp{ [e − 1]αkλ(x)dx} = exp{ [e − 1]αkλ(x)dx} k=1 E k=1 E The above Laplace functional of compound random measure seems like L´evy-Khintchine represen- tation. Define for η > 0

( Z r Z r ) X X kf(x) Dr(η) := exp { ηf(x)[CPr(dx) − ( kαk)λ(x)]dx} − [e − kηf(x) − 1]αkλ(x)dx . E k=1 E k=1 (10) Therefore, we obtain EDr(η) = 1. It follows from (10) and Markov’s inequality, we have

Z r Z r ! X X kf(x) P ηf(x)[CPr(dx) − ( kαk)λ(x)dx] ≥ [e − kηf(x) − 1]αkλ(x)dx + y E k=1 E k=1 = P (D(η) ≥ ey) ≤ e−y. (11)

Note that

∞ i ∞ i X η i−2 X η i−2 ekηf(x) − kηf(x) − 1 ≤ kkfk k2f 2(x) = k2f 2(x) kkfk i! ∞ i(i − 1) ··· 2 ∞ i=2 i=2 ∞ i−2 k2η2f 2(x) X 1  ≤ kηkfk 2 3 ∞ i=2 k2η2f 2(x) 1 ≤ (1 − kηkfk ). (12) 2 3 ∞

Hence, using (12), (11) turns to

r r ! Z X Z X 1 k2ηf 2(x)λ(x) y  P f(x)[CP (dx) − ( kα )λ(x)dx] ≥ α dx + ≤ e−y. (13) r k k 2 1 − 1 kηkfk η E k=1 E k=1 3 ∞

R 2 Let Vf = E f (x)λ(x)dx, notice that

r  2  r  2  X 1 ηk Vf y X 1 ηk Vf y 1 ky α + = α + (1 − kηkfk ) + kfk k 2 1 − 1 kηkfk η k 2 1 − 1 kηkfk η 3 ∞ 3 ∞ k=1 3 ∞ k=1 3 ∞ r r (14) X  ky  X y ≥ α kpV y + kfk = ( kα )[pV y + kfk ]. k f 3 ∞ k f 3 ∞ k=1 k=1

Applying (14), so by minimizing η in (13), we have

r r ! Z X X y P f(x)[CP (dx) − ( kα )λ(x)dx] ≥ ( kα )[p2yV + kfk ] ≤ e−y. (15) r k k f 3 ∞ E k=1 k=1

d Letting r → ∞ in (15), then CPr(A) −→ CP (A), so we finally prove the inequality (7).

8 4. Application 4.1. Application to KKT conditions For negative binomial random variable NB, the probability mass function (p.m.f.) is Γ (n + r) P (NB = n) = (1 − q)sqn q ∈ (0, 1), n ∈ , Γ (r)n! N where s is a certain real value number. When s is positive integer, the negative reduces to Pascal distribution which is modeled as the number of failures before the s-th success in repeated mutually independent Bernoulli trials (with probability of success p = 1 − q). The m.g.f. of NB is

s ( ∞ )   1 − q   X qk ϕ (t) = exp log = exp −s log(1 − q) · eikt − 1 . (16) NB 1 − qeit −k log(1 − q) k=1 Here the parametrization of DCP is qs NB ∼ DCP(α1λ, α2λ, . . .) with λ = −s log(1 − q), αs = −s log(1−q) . n n Given the data {Yi, xi}i=1, the NB regression models suppose that the responses {Yi}i=1 is NB n T distributed random variables with mean {µi}i=1, and the vectors covariates xi = (xi1, . . . , xip) ∈ Rp are p dimensional fixed (non-random) variable for simplicity; see the so-called NB2 model in Hilbe (2011) for details. Based on the log-linear model in regression analysis, we usually suppose that log µi = f0(xi) is a linear function of xi. Two types of high-dimensional setting for count data regression are classified as follows. • Sparse approximation for nonparametric regression.

Sometimes, it is too rough and too simple to use a linear function to approximate log µi by a function of xi, denoted as f0(xi). The connection between log µi and xi is sometimes unknown, thus it is unlikely to be linear. The dimension of xi would be much larger than the sample size when the f is extremely flexible and complex. In order to capture the unknown functional p relation f0(xi), we prefer to use a given dictionary (orthogonal base functions) D = {φj(·)}j=1 ˆ Pp ˆ such that the linear combination f(xi) = j=1 βjφj(xij) is a sparse approximation to f0(xi) ∗ ∗ ∗ T which contains increasing-dimensional parameter β := (β1 , . . . , βp ) with p := pn → ∞ as n → ∞. A crucial point for orthogonal base functions D is that many f0(xi), which have no sparse representations in the non-orthogonal base, should be represented as sparse linear combinations of D in the orthogonal scenario; see Sect. 10 of Hastie et al. (2015) for details. In practice, the advantage of the dictionary approach is that the flexible f0(xi) may admit a good approximation sparse in some dictionary functions while not in others. For high-dimensional count data, Ivanoff et al. (2016) mentioned that: “The richer the dictionary, the sparser the decomposition, so that p can be larger than n and the model becomes high-dimensional”. Typical candidates of the dictionary are Fourier basis, cosine basis, Legendry base, wavelet basis (Haar basis), etc. • High-dimensional features in gene engineering. In the big data era, one challenge case we encountered in massive data sets is that the number of given covariates p is larger than the sample size n and the responses of interest are measured as counts. For example, in gene engineering, the covariates (features) are types of genes impacting on specific count phenotypes. The NB regression is a flexible framework for the analysis of over-dispersed counts RNA-seq data; see Li (2016), Mallick and Tiwari (2016) for more details.

T We assume that the expectation of Yi will be related to φ (xi)β after a log-transformation, ( p ) T X φ (xi)β µi := exp βjφj(xij) := e := h(xi), j=1 9 T where φ(xi) = (φ1(xi1), . . . , φp(xip)) with covariate {φj(xij)} being the jth component of φ(xi). p Here {φj(·)}j=1 are given transformations of bounded covariances. A trivial example is φj(x) = x. n µi Let {Yi} be a NB random variable with failure parameters {qi = } where θ > 0 is an i=1 θ+µi unknown dispersion parameter which can be estimated (see Hilbe (2011)). Then the p.m.f.s of n {Yi}i=1 are  yi  θ Γ (θ + yi) µi θ f(yi; µi, θ) = Γ (θ)yi! θ + µi θ + µi T  T φ (xi)β ∝ exp yiφ (xi)β − (θ + yi) log θ + e , (17)

2 µi where µi > 0 is the parameter with µi = EYi and Var Yi = µi + θ > EYi. n The log-likelihood of the predictor variable {Yi}i=1 is

" n # n T Y  X T φ (xi)β l(β) =: log f Yi; f0(xi), θ = Yiφ (xi)β − (θ + Yi) log θ + e . (18) i=1 i=1

T Let R(y, β, x) = yφT (x)β − (θ + y) log(θ + eφ (x)β) denote the negative average empirical risk 1 function by `(β) = − n l(β) := PnR(Y, β, x). T We assume that the expectation of Yi will be related to φ (xi)β after a log-transformation.

p T X φ (xi)β µi := exp{ βjφj(xij)} := e := h(Xi), j=1

T where φ(xi) = (φ1(xi1), ··· , φp(xip)) with covariate {φj(xij)} being the j-th component of φ(xi). p Here {φj(·)}j=1 are given transformations of bounded covariances. A trivial example is φj(x) = x. n µi n Let {Yi} be a NB random variable with parameters {qi = } where θ > 0 is an unknown i=1 θ+µi i=1 n dispersion parameter which can be estimated (see Hilbe (2011)). Then the p.m.f. of {Yi}i=1 are

Γ(θ + y ) µ θ T i i yi θ T φ (xi)β f(yi; µi, θ) = ( ) ( ) ∝ exp{yiφ (xi)β − (θ + yi) log(θ + e )} (19) Γ(θ)yi! θ + µi θ + µi

2 µi where µi > 0 is the parameter with µi = EYi and VarYi = µi + θ > EYi. n The log-likelihood of predictor variable {Yi}i=1 is

n n T Y X T φ (xi)β l(β) =: log[ f(Yi; f0(xi), θ)] = [Yiφ (xi)β − (θ + Yi) log(θ + e )] (20) i=1 i=1

T Let R(y, β, x) = yφT (x)β − (θ + y) log(θ + eφ (x)β), denote the negative average empirical risk 1 function by `(β) = − n l(β) := PnR(Y, β, x). Having obtained high-dimensional covariates, a principal task is to select important variables. Here, we prefer the weighted `1-penalized likelihood principle for sparse NB regression,

( p ) ˆ X β = argmin `(β) + wj|βj| , (21) p β∈R j=1

p where {wj}j=1 are data-dependent weights which are supposed to be specified in the following discussion. By using KKT conditions (i.e. the first order condition in convex optimization; see p. 68 of B¨uhlmannand van de Geer B¨uhlmannand van de Geer (2011)), the βˆ is a solution of (21) iff βˆ satisfies first order conditions ( ˙ ˆ ˆ ˆ −`j(β) = wj sign(βj) if βj 6= 0, ˙ ˆ ˆ (22) |`j(β)| ≤ wj if βj = 0,

10 ˙ ˆ ∂`(βˆ) where `j(β) := (j = 1, 2, . . . , p). ∂βˆj Let Φ(X) be the n × p dictionary design matrix given by   φ1(x11) ··· φp(x1p) Φ(X) =: (φ (x )) =  . .  =: (φT (x ), ··· , φT (x ))T . j ij 1≤i≤n,1≤j≤p  . .  1 n φ1(xn1) ··· φp(xnp) ˙ ˆ The KKT conditions implies |`j(β)| ≤ wj for all j. And

n T  φ (xi)β  ˙ ∂`(β) 1 X e `(β) := = − φ(xi) Yi − (θ + Yi) T ∂β n θ + eφ (xi)β i=1 n T n φ (xi)β 1 X φ(xi)(Yi − e )θ 1 X φ(xi)(Yi − EYi)θ = − T = − . n θ + eφ (xi)β n θ + EY i=1 i=1 i The Hessian matrix of the log- is

n T 2 φ (xi)β ∂ `(β) 1 X θ(θ + Yi)e `¨(β) := = φ(x )φT (x ) . (23) T i i T 2 ∂β∂β n φ (xi)β i=1 (θ + e )

∗ ∗ ∗ Let d = |{j : βj 6= 0}|; the true coefficient vector β is defined by β∗ = argmin ER(Y, X, β). (24) p β∈R

For the optimization problem of maximizing the `1-penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of the KKT conditions

n 1 X φj(xi)(Yi − EYi)θ |`˙ (β∗)| = ≤ w (25) j n θ + EY j i=1 i to approximate estimated parameter version of KKT conditions. When θ → ∞, this approach of defining data-driven weights wj is proposed in Ivanoff et al. (2016) for Poisson regression by using concentration inequality for Poisson point process version of Theorem 3.1. Similarly, in NB regression the wj are determined such that the event (29) holds with high probability for all j by applying Corollary 3.1. Recall the weighted Poisson decomposition of the NB point processes

∞ X NB(A) := kNk(A), k=1 R where Nk(A) is an inhomogeneous Poisson point process with intensity αk A λ(x) dx and αk = qk −k log(1−q) defined in (16). R Let µ(A) := A λ(x) dx, the m.g.f. of NB(A) is written as

( ∞ ) X −kθ  MNB(A)(θ) = exp αkµ(A) e − 1 . (26) k=1

˜ φj (xi)θ For the subsequent step, let φj(xi) := ≤ φj(xi) and we want to evaluate the complemen- θ+EYi tary events of the true parameter version of the KKT conditions:

n ! n ! X φj(xi)(Yi − EYi)θ X P ≥ w = P φ˜ (x )(Y − EY ) ≥ w for j = 1, 2, . . . , p. (27) θ + EY j j i i i j i=1 i i=1 11 p The aim is to find the values {wj}j=1 by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right-hand side of the above inequality. ∗ ∗ ∗ Let d = |{j : βj 6= 0}|, the true coefficient vector β is defined by β∗= argmin ER(Y, X, β). (28) p β∈R

For the optimization problem of maximizing the `1-penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of KKT conditions n 1 X φj(xi)(Yi − EYi)θ |`˙ (β∗)| = | | ≤ w (29) j n θ + EY j i=1 i to approximate estimated parameter version of KKT conditions. This approach of defining data- driven weights wj is proposed in Ivanoff et al. (2016), we want to find wj such that the event (29) hold with high probability for all j0s. Recall the weighted Poisson decomposition of the NB point processes

∞ X NB(A) := kNk(A) k=1

R pk where Nk(A) is inhomogeneous Poisson point process with intensity αk A λ(x)dx and αk = −k log(1−p) . R Let µ(A) := A λ(x)dx, the m.g.f. of NB(A) is written as ∞ X −kθ MNB(A)(θ) = exp{ αkµ(A)(e − 1)}. (30) k=1

˜ φj (xi)θ For the subsequent step, let φj(xi) := ≤ φj(xi) and we want to evaluate the complemen- θ+EYi tary events of true parameter version of KKT conditions:

n ! n ! X φj(xi)(Yi − EYi)θ X P ≥ w = P | φ˜ (x )(Y − EY )| ≥ w for j = 1, 2, ··· , p. (31) θ + EY j j i i i j i=1 i i=1 p The aim is to find the value {wj}j=1 by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right hand of the above inequality. d Let µ(A) be a Lebesgue measure for A ∈ R and Ni,k(A) be Poisson random measure with rate ∞ k αk(i)µ(A) P qi µi n , where µi = kαk(i) with αk(i) := and {qi := } . We define the µi −k log(1−qi) θ+µi i=1 k=1 independent NB point processes NBi(A) as :

∞ ∞ X X αk(i)µ(A) NBi(A) =: kNi,k(A), with E[NBi(A)] = k = µ(A) µi k=1 k=1 where {Ni,k(A)} are independent for each k, i. d n For [0, 1] := ∪i=1Si being a disjoined union, we assume that the intensity function n X h(Xi) λ(t) = I (t) µ(S ) Si i=1 i is a histogram type estimator (Baraud and Birg´e(2009)). It is just a piece-wise constant intensity function. Now we will apply Corollary 3.1, note that Z µ(Si) = λ(t)dt = h(xi) := EYi Si 12 via the definition of λ(t) that we defined before. Thus, both (NB1(S1),...,NBn(Sn)) and (Y1,...,Yn) have the same law by the weighted Poisson decomposition. n P ˜ Choosing f(x) = gj(x) = φj(xl)1Sl (x), it derives l=1 n Z n Z n n X X X ˜ X ˜ gj(x)NBi(dx) = φj(xl)ISl (x)NBi(dx) = φj(xi)Yi i=1 Si i=1 Si l=1 i=1 for each j. Assume that there exists constants C1,C2 such that

∞ X C1 = max µi = max kαk(i) = O(1) and O(1) = max h(xi) ≤ C2. (32) 1≤i≤n 1≤i≤n 1≤i≤n k=1 Then Z Z n n 2 X ˜2 X h(xi) Vi,gj = gj (x)λ(x)dx = φj (xl)ISl (x) ISi (t)dx µ(Si) Si Si l=1 i=1 ˜2 ˜2 2 = φj (xi)h(xi) ≤ C2 max φj (xi) ≤ C2 max φj (xi), 1≤i≤n 1≤i≤n

kgj(x)k∞ ≤ max |φj(xi)| . i=1,...,n From (31), we apply concentration inequality in Corollary 3.1 to the last probability in inequality below

n ! 1 r y  X ˜ 2 P φj(xi)(Yi − EYi) ≥ C1 2yC2 max φj (xi) + kgj(x)| n 1≤i≤n 3 ∞ i=1 n n ! 1 X Z  1 X q y  ≤ P g(x)NB (dx) − µ λ(x) dx ≥ µ 2yV + kg (x)k . n i i n i i,gj 3 j ∞ i=1 Si i=1

q 2 y Now we can let wj = C1[ 2yC2 max φj (xi) + max |φj(xi)|]. 1≤i≤n 3 i=1,...,n From (31), we apply concentration inequality in Corollary 3.1 to the last probability in above inequality. This implies

n ! n Z  ! 1 X φj(xi)(Yi − EYi)θ 1 X P ≥ w ≤ P g (x)[NB (dx) − µ λ(x)dx] ≥ w ≤ 2e−ny n θ + EY j n j i i j i=1 i i=1 Si

γ Take y = n log p where γ > 0, we have

n ! 1 X φj(xi)(Yi − EYi)θ 2 P ≥ w ≤ . (33) n θ + EY j pγ i=1 i The weights is proportional to r 2γ log p γ log p wj ∝ C1( max |φj(xi)|) + max |φj(xi)| . n i=1,...,n 3n i=1,...,n Here the γ can be treated as tuning parameters. In high-dimensional statistics, we often have log p γ log p dimension assumption = o(1), max |φj(xi)| < ∞, the term max |φj(xi)| is negligible n i=1,...,n 3n i=1,...,n q 2γ log p in the weights, comparing to the term n .

13 Then for large dimension p, the above upper bound of the probability inequality tends to zero. Therefore, with high probability, the event of the KKT condition holds. Based on (33) and some regular assumptions such as compatibility factor (a generalized restricted eigenvalue condition), we will derive the following `1-estimation error with high probability:  r  ˆ ∗ ∗ log p β − β ≤ Op d 1 n in the next section. Under the restricted eigenvalue condition proposed by Bickel et al. (2009), oracle inequalities for a weighted Lasso regularized Poisson binomial regression are studied in Ivanoff et al. (2016). Since the negative binomial is neither sub-gaussian nor a member of the exponential family, our extension is not straightforward. We employ the assumption of compatibility factor which renders the proof simpler than that of restricted eigenvalue condition.

4.2. Oracle inequalities for the weighted Lasso in count data regressions The ensuing subsections are described in two steps: The first step is to construct two lemmas to get a lower or upper bound for symmetric Bregman divergence in the weighted Lasso case. Adopting a symmetric Bregman divergence, the second step is to derive oracle inequalities by some inequality scaling tricks.

4.2.1. Case of NB regression The symmetric Bregman divergence (sB divergence) is defined by

Ds(β,ˆ β) = (βˆ − β)T `˙(βˆ) − `˙(β); see Nielsen and Nock Nielsen and Nock (2009). The following lemma is a crucial inequality for deriving the oracle inequality under the assump- tion of a compatibility factor which is also used in Yu Yu (2010) for the Cox model. Let β∗ be the ∗ c ∗ true estimated coefficients vector defined by (28), and H = {j : βj 6= 0}, H = {j : βj = 0}. Then we have the following. ˆ ˙ ∗ ˜ ˆ ∗ Lemma 4.1. Let β be the estimate in (21), set lm := kl(β )k∞, and β = β − β . Then, we have

˜ s ˆ ∗ ˜ ˜ (wmin − lm)kβHc k1 ≤ D β, β + (wmin − lm)kβHc k1 ≤ (wmax + lm)kβH k1, where wmax = max1≤j≤p wj, wmin = min1≤j≤p wj. ξ−1 Moreover, in the event lm ≤ ζ+1 wmin for any ξ > 1 and kΦ(X)k∞ ≤ K, with probability 2 1 − p1−O(1)r , we have ˜ ξwmax ˜ kβHC k1 ≤ kβH k1. (34) wmin Proof. The left inequality is obtained by Ds(β,ˆ β∗) = (βˆ − β∗)T [`˙(βˆ) − l˙(β∗)] ≥ 0 due to convexity of the function `(β). Note that

˜T  ˙ ∗ ˜ ˙ ∗ X ˜ ˙ ∗ ˜ X ˜ ˙ ∗ ˜ ˜T ˙ ∗ β ` β + β − ` β = βj`j β + β + βj`j β + β + β −` β j∈Hc j∈H X ˆ ˆ X ˜ ˜ ≤ −wjβj sign(βj) + wj|βj| + kβk1lm j∈Hc j∈H X ˆ X ˜ ˜ ˜ = −wj|βj| + wj|βj| + lmkβH k1 + lmkβHc k1 j∈Hc j∈H X ˆ X ˜ ˜ ˜ ≤ −wmin|βj| + wmax|βj| + lmkβH k1 + lmkβHc k1 j∈Hc j∈H ˜ ˜ = (lm − wmin)kβHc k1 + (wmax + lm)kβH k1.

14 where the second last inequality follows from KKT conditions and dual norm inequality. Thus we obtain (34). In fact, inequality (34) turns to

2wmin s ∗ 2wmin 2ξwmax kβ˜ c k ≤ D β,ˆ β + kβ˜ c k ≤ kβ˜ k (35) ξ + 1 H 1 ξ + 1 H 1 ξ + 1 H 1

2wmin 2ξ due to wmin − lm ≥ ξ+1 , wmax + lm ≤ ξ+1 wmax. ˙ ∗ ξ−1 For the event {k`(β )||∞ ≤ ξ+1 wmin}, applying the NB concentration inequality (9), we have   ˙ ∗ ξ − 1 P ` β ≥ wmin ∞ ξ + 1 n ! X φ1(xi)θ ξ − 1 ≤ pP (Y − EY ) ≥ nw (θ + EY ) i i ξ + 1 min i=1 i s 2   2   ξ − 1 nwmin c1n c1n ≤ p exp −n · + 2 − ξ + 1 c2n 4c2n 2c2n s 2   , 2   ξ − 1 nwmin ξ − 1 nwmin c1n c1n = p exp −n · · + 2 + ξ + 1 c2n ξ + 1 c2n 4c2n 2c2n  2 1−O(1)r2 ∼ p exp −nwminO(1) ∼ p . where the second last ∼ is due to c1n = O(n) = c2n by the assumption (32) and the the last ∼ is q log p by letting wmin = O(r n ). ˙ ∗ ξ−1 Here r > 0 is a suitable tuning parameter, such that the event {k`(β )||∞ ≤ ξ+1 wmin} holds with probability tending to 1 as p, n → ∞. In order to derive Theorem 4.1, we adopt a crucial inequality by using the similar approach in Lemma 5 of Zhang and Jia (2017), which is the technique of Taylor’s expansion for convex log-likelihood functions.

Lemma 4.2. Let the Hessian matrix `¨(β) be given in (23), and for δ ∈ Rp, assume the identifiability T T T condition: φ (xi)(β + δ) = φ (xi)β implies φ (xi)δ = 0, then we have

s T −2||Φ(X)|| ||δ|| D (β + δ, β) ≥ δ `¨(β)δe ∞ where ||Φ(X)||∞ = max{|φj(xij)|; 1 ≤ i ≤ n, 1 ≤ j ≤ p}. n o ˙ ∗ ξ−1 By the result in the second parts of Theorem 4.1, it says that in the event ||`(β )||∞ ≤ ξ+1 λmin , the error of estimate β˜ = βˆ − β∗ belongs to the following cone set:

p S(η, H) = {b ∈ R : ||bHc ||1 ≤ η||bH ||1} (36)

ξwmax with η = w . ∗ min∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ p Let d (β ) =: |H(β )| with H(β ) =: {j : βj 6= 0, β = (β1 , ··· , βj , ··· , βp ) ∈ R }. Sometimes, we write H(β∗) as H and d∗(β∗) as d∗ if there is no ambiguity. Define compatibility factor (see van de Geer (2007)) of a p × p nonnegative-definite matrix Σ as

1/2 (do(b))1/2(bT Σb) C(η, H, Σ) = inf > 0, (η ∈ R). 06=b∈S(η,H) kbH k1 p where S(η, H) = {b ∈ R : kbHc k1 ≤ η||bH ||1} is the cone condition. The compatibility factor is strongly linked to the restricted eigenvalue proposed by Bickel et al. (2009), 1/2 (bT Σb) RE(η, H, Σ) = inf > 0, (η ∈ R). (37) 06=b∈S(η,H) kbk2 15 Actually, by Cauchy-Schwarz inequality, it is evident that [C(η, H, Σ)]2 ≥ [RE(η, H, Σ)]2. In the proof of Theorem 4.1, the upper bounds of oracle inequalities are sharper if we use a compatibility factor rather than restricted eigenvalue. Next, let C(η, H) = C(ξ, H, ¨l(β∗)), then we have the following theorem which is analogous to Zhang and Jia (2017) who gives oracle inequalities for the Elastic-net estimate in NB regression.

∗ 2 K(ξ+1)d wmax Theorem 4.1. Suppose ||Φ(X)||∞ ≤ K with some K > 0 and let τ = 2 ≤ 2wmin[C(ξwmax/wmin,H)] q 1 −1 log p 2 e with assumption (32). Setting wmin = O(r n ) where r > 0 is a suitable tuning parameter, ˙ ∗ ξ−1 then in the event {||`(β )||∞ ≤ ξ+1 wmin} with probability tending to 1 as p, n → ∞, we have

2aτ ∗ 2 2 ˆ ∗ e (ξ + 1)d wmax ∗ wmax ||β − β ||1 ≤ 2 = Op(d ) 2wmin[C(ξwmax/wmin,H)] wmin

1 −2a where aτ ≤ 2 is smaller a solution of the equation ae = τ. And

2aτ 2 ∗ 2 s ˆ ∗ 4e ξ d wmax D (β, β ) ≤ 2 2 (ξ + 1) [C(ξwmax/wmin,H)]

˜ ˆ ∗ ˜ ˜ ∗ Proof. Let β = β − β 6= 0 and b = β/||β||1, and observe that `(β + bx) is a convex function in x since `(β) is convex. From the proof of (35) in Lemma 4.1, we have

T ∗ ∗ 2wmin 2ξwmax b [l˙(β + bx) − `˙(β )] + ||b C || ≤ ||b || (38) ξ + 1 H 1 ξ + 1 H 1 ˜ holding for x ∈ [0, ||β||1]. Since `(β) is a convex in β, then the function bT [`˙(β + bx) − `(β)] is increasing in x.) and b ∈ S(ξwmax/wmin,H). By the assumption, we have ||Φ(X)||∞||xb||1 ≤ Kx by noticing that ||b||1 = 1. And assuming ||Φ(x)||∞ ≤ K), we obtain following by the Lemma 4.2

T ∗ ∗ 2 −2||Φ(X)|| ||xb|| T 2 −2Kx T xb [`˙(β + bx) − `˙(β )] ≥ x e ∞ b `¨(β)b ≥ x e b `¨(β)b, this implies bT [`˙(β∗ + bx) − `˙(β∗)] ≥ xe−2KxbT `¨(β)b. (39) For the Hessian matrix evaluated at the true coefficient β∗, the compatibility factor is written as C(η, H) =: C(η, H, `¨(β∗)). From the definition of compatibility factor and the inequality (39), we have −2Kx 2 2 ∗ −2Kx T ¨ Kxe [C(ξwmax/wmin,H)] ||bH || /d ≤ Kxe b `(β)b (by (39)) ≤ KbT [`˙(β∗ + bx) − `˙(β∗)]

2ξwmax 2wmin (by (35)) ≤ K( ||b || − ||b c || ) ξ + 1 H 1 ξ + 1 H 1 2ξw 2w = K[ max ||b || − min (1 − ||b || )] ξ + 1 H 1 ξ + 1 H 1 2 2w ≤ K[ (ξw + w )||b || − min ] ξ + 1 max max H 1 ξ + 1 2ξw = K[2w ||b || − min ] max H 1 ξ + 1 K(ξ + 1)||b ||2w2 ≤ H 1 max 2wmin

2 2wmin (ξ+1)||bH ||1wmax where the last inequality is due to + ≥ 2wmax||bH ||1. ξ+1 2wmin Then we have ∗ 2 −2Kx K(ξ + 1)d wmax Kxe ≤ 2 =: τ (40) 2wmin[C(ξwmax/wmin,H)] 16 ˜ for any x ∈ [0, ||β||1]. ˜ So with constrain x ∈ [0, ||β||1], let aτ , bτ , (aτ < 0.5 < bτ ) be the two solutions of the equation −2z ˜ {z : ze = τ}. Let S = [0, ||β||1] ∩ {(−∞, aτ ] ∪ [bτ , +∞)} satisfy (40), see Figure 1.

Figure 1: The two solutions of the equation {z : ze−2z = τ}.

Note that bT [`˙(β + bx) − `(β)] is an increasing function of x, then S cannot be a disjoint union of the two intervals by the restriction condition in (38), thus S is a closed interval x ∈ [0, x˜]. The ˜ fact that all x ∈ [0, kβk1] implies x ∈ [0, x˜], therefore we have

Kx ≤ Kx˜ ≤ aτ .

Use (40) again, it implies Kxe˜ −2Kx˜ ≤ τ. Then, for ∀x ∈ [0, x˜], by we have

2aτ 2aτ ∗ 2 ˜ aτ e τ e (ξ + 1)d wmax ||β||1 = ||bx||1 = ||b||1x = x ≤ x˜ ≤ = = 2 K K 2wmin[C(ξwmax/wmin,H)]

It remains to show the oracle inequality of Ds(β,ˆ β∗). Let δ = bx in Lemma 4.2, Using the definition of compatibility factor, and observing that aτ ≥ Kx and Kx ≥ ||Φ(X)||∞||δ||1, it derives

−2aτ 2 ˜ 2 ∗ −2aτ ˜T ¨ ∗ ˜ −2Kx ˜T ¨ ∗ ˜ e [C(ξwmax/wmin,H)] ||βH ||1/d ≤ e β `(β )β ≤ e β l(β )β −2||Φ(X)|| ||δ|| T ∗ ≤ e ∞ 1 β˜ `¨(β )β.˜

Apply Lemma 4.2, it immediately deduces that

−2a 2 2 ∗ −2||Φ(X)|| ||δ|| T ∗ s ∗ 2ξwmax e τ [C(ξw /w ,H)] ||β˜ || /d ≤ e ∞ 1 β˜ `¨(β )β˜ ≤ D (β,ˆ β ) ≤ ||β˜ || max min H 1 ξ + 1 H 1 where the last inequality is from (38). ˜ Cancel the term ||βH ||1 on both sides of the inequality above, we get

2aτ ∗ ˜ 2e ξd wmax ||βH ||1 ≤ 2 . (ξ + 1)[C(ξwmax/wmin,H)] Consequently, 2aτ 2 ∗ 2 s ˆ ∗ 2ξwmax ˜ 4e ξ d wmax D (β, β ) ≤ ||βH ||1 ≤ 2 2 . ξ + 1 (ξ + 1) [C(ξwmax/wmin,H)]

17 ∗ Theorem 4.1 indicates that, if we presume d = O(1), the l1-estimation error bound is of the q log p order n . And this convergence rate is rate-optimal in the minimax sense; see Raskutti et al. (2011) and Li et al. (2019). Under some regularity conditions, the weighted Lasso estimates are expected to have the consistency property in the sense of the `1-norm when the dimension increases as order exp(o(n)). Unlike the MLE estimator with convergence rate √1 , in the high-dimensional √ n case we have to multiply log p to the rate of MLE as the price to pay.

4.2.2. Case of Poisson regression In this section, we use the same procedure as in Sect. 4.2.1 to prove the case of Poisson regression with weighted Lasso penalty. The derivation of this type of oracle inequalities is simpler than that in Ivanoff et al. (2016). Similar to Lemma 5 of Zhang and Jia (2017), the proof is based on the following key lemma. Lemma 4.3. The Hessian matrix and gradient of the log-likelihood function in the Poisson regres- T T ˙ 1 Pn φ (xi)β ¨ 1 Pn T φ (xi)β sion model are `(β) = − n i=1 φ(xi)[Yi − e ], `(β) = n i=1 φ(xi)φ (xi)e . Assume T T T the identifiability condition: φ (xi)(β + δ) = φ (xi)β implies φ (xi)δ = 0, then we have

s T −kΦ(X)k kδk D (β + δ, β) ≥ δ `¨(β)δe ∞ 1 , where kΦ(X)k∞ = max{|φj(xij)|; 1 ≤ i ≤ n, 1 ≤ j ≤ p}. T ˙ ¨ Proof. Without loss of generality, we assume that φ (xi)δ 6= 0. By the expression of `(β) and `(β), one deduces n 1 X T T δT `˙(β + δ) − `˙(β) = δT · φ(x )eφ (xi)(β+δ) − eφ (xi)β n i i=1 n φT (x )δ 1 X T e i − 1 = δT · φ(x )eφ (xi)β · · φT (x )δ n i φT (x )δ − 0 i i=1 i n T 1 X T φT (x )β −kΦ(X)k kδk ≥ δ φ(x )φ (x )e i δ · e ∞ 1 , n i i i=1

ex−ey −|x|∨|y| where the last inequality is from x−y ≥ e . ˆ Theorem 4.2. For weighted lasso estimator β in Poisson regression, suppose ||Φ(X)||∞ ≤ K with ∗ 2 K(ξ+1)d wmax −1 some K > 0 and let τ = 2 ≤ e with a certain K > 0. Then, in the event 2wmin[C(ξwmax/wmin,H)] ˙ ∗ ξ−1 ||`(β )||∞ ≤ ξ+1 wmin, we have

aτ ∗ 2 ˆ ∗ e (ξ + 1)d wmax ||β − β ||1 ≤ 2 2wmin[C(ξwmax/wmin,H)]

−a where aτ ≤ 1 is smaller solution of the equation ae = τ. And

aτ 2 ∗ 2 s ˆ ∗ 4e ξ d wmax D (β, β ) ≤ 2 2 (ξ + 1) [C(ξwmax/wmin,H)]

−z Proof. Note that aτ is the small solution of ze = τ in Poisson regression. By Lemma 4.1 and Lemma 4.3, the proof is almost exactly the same as proof of Theorem 4.1.

5. Simulations

This section aims to compare the performances of the un-weighted Lasso and weighted Lasso for NB regression on simulated data sets. We use the R package mpath with the function glmregNB to fit the un-weighted Lasso estimator of NB regression. The R package lbfgs is employed to do 18 Table 1: Means of `1 error and prediction error ˆ ∗ Method kβ − β k1 Prediction error p = 20, ρ = 0.5 Lasso 1.902 7.161 Weight Lasso 1.624 6.147 p = 20, ρ = 0.8 Lasso 3.001 7.155 Weight Lasso 2.917 6.979 p = 100, ρ = 0.5 Lasso 4.137 15.452 Weight Lasso 3.905 13.866 p = 100, ρ = 0.8 Lasso 4.543 10.866 Weight Lasso 4.552 10.695 p = 200, ρ = 0.5 Lasso 4.467 17.533 Weight Lasso 4.373 15.854 p = 200, ρ = 0.8 Lasso 4.467 17.533 Weight Lasso 4.373 15.854

a weighted `1-penalized optimization for NB regression. For weighted Lasso, we first run an un- weighted Lasso estimator and find the optimal tuning parameter λop by cross-validation. The actual weights we use are the standardized weights given by

wj w˜j := pPp . j=1 wj And then we solve the optimization problem

( p ) ˆ(t) X β = argmin −`(β) + λop w˜j|βj| . (41) p β∈R j=1 Here we first apply the function cv.glmregNB() to do a 10-fold cross-validation to obtain the optimal penalty parameter λ. For comparison, we use the function glmregNB to do un-weighted Lasso for NB regression and use the estimated θ as the initial value for our weighted Lasso algorithms. For simulations, we simulate 100 random data sets. By optimization with suitable λ, we obtain ˆ the model with the parameter βλopt . Then we estimate the performance of the `1 estimation error ∗ ˆ ∗ ˆ kβ − βλopt k1 and the prediction error kXtestβ − Xtestβλopt k2 on the test data (Xtest of size ntest) by the average of the 100-times errors. For each simulation ntrain = 100, ntest = 200. We mainly adopt the simulation setting: The predictor variables X are randomly drawn from Np(0,Σ), where Σ has elements ρ|j−l| (j, l = 1, 2, . . . , p). The correlation among predictor variables is controlled by ρ with ρ = 0.5 and 0.8, respectively. We assign the true vector as

β∗ = (1.25, −0.95, 0.9, −1.1, 0.6, 0,..., 0). | {z } p−5

The simulation results are displayed in Table 1. The table shows that the proposed weighted Lasso estimators are more accurate than the un-weighted Lasso estimators. With the help of the optimal weights, it reflects that the controlling of KKT conditions by a data-dependent tuning parameter is able to improve the accuracy of the estimation in aspects of the `1-estimation, square prediction errors. It can be also seen that the increasing p will hinder the `1-estimation error by the curse of dimensionality. 19 6. Appendix: Proofs

6.1. Proof of Theorem 2.1 The proof is divided into two parts. In the first part, we show that ν(A) is countably additive and absolutely continuous. Moreover, in the second part, we will show Eq. (3). Step 1. P∞ (1) Note that Pk(A) > 0 and k=0 Pk(A) = 1 by assumption 1, we have 0 < Pk(A) < 1 whenever 0 < m(A) < ∞. And from assumption 2, we get P0(A) = 1 if m(A) = 0. It follows that 0 < P0(A) ≤ 1 if 0 < m(A) < ∞. By assumptions 2 and 4, one has limm(A)→0 ν(A) = limm(A)→0 − log [1 − S1(A)] = 0. Therefore, ν(·) is absolutely continuous. (2) Let ν(A) := − log P0(A). For A1 ∩ A2 = ∅, it implies by assumption 3 that   ν(A1 ∪ A2) = − log P0(A1 ∪ A2) = − log P0(A1) · P0(A2) = ν(A1) + ν(A2), which shows that ν(·) is finitely additive. Based on the two arguments in Step 1, we have shown that ν(·) is countable additive. Step 2. For convenience, we first consider the case that A = Ia,b = (a, b] is a d-dimensional d interval on R , and then extend it to any measurable sets. Let T (x) =: ν(I0,x), so T (x) is a d dimensional coordinately monotone transformation with

T (0) = 0,T (x) ≤ T x0 iff x ≤ x0, where the componentwise order symbol (partially ordered symbol) “≤” means that each coordinate in x is less than or equal to the corresponding coordinate of x0 (for example, with d = 2, we have (1, 2) ≤ (3, 4)). This is by the fact that ν(·) is absolutely continuous and finite additive, which implies ν(I0,0) = limm(I0,x)→0 ν(I0,x) = 0. The definition of the minimum set of the componentwise ordering relation lays the base for our study; see Sect. 4.1 of Dinh The Luc Dinh The Luc (2016). Let A be a nonempty set in R+d.A point a ∈ A is called a Pareto minimal point of the set A if there is no point a0 ∈ A such that a0 ≥ a and a0 6= a. The sets of Pareto minimal points of A is denoted by PMin(A). Next, we define the generalized inverse T −1(t) (like some generalizations of the inverse matrix, it is not unique, we just choose one) by −1  +d x = (x1, . . . , xd) = T (t) =: PMin a ∈ R : t ≤ T (a) . When d = 1, the generalized inverse T −1(t) is just a similar version of the quantile function. By Theorem 4.1.4 in Dinh The Luc Dinh The Luc (2016), if A is a nonempty compact set, then −1 −1 it has a Pareto minimal point. Set x = T (t), x + ξ = T (t + τ) and define ϕk(τ, t) = Pk(Ix,x+ξ) for t = T (x), τ = T (x + ξ) − T (x). Consequently, by part one, finite additivity of ν(·), it implies that

τ = T (x + ξ) − T (x) = ν(I0,x+ξ) − ν(I0,x) = ν(Ix,x+ξ) = − log P0(Ix,x+ξ).

−τ P∞ Thus, we have ϕ0(τ, t) = e . Let Φk(τ, t) = i=k ϕi(τ, t). Due to assumption 5, it follows that

ϕk(τ, t) ϕk(τ, t) αk = lim = lim for k ≥ 1. (42) τ→0 Φ1(τ, t) τ→0 τ Pk Applying assumption 3, we get ϕk(τ + h, t) = i=0 ϕk−i(τ, t)ϕi(h, t + τ). Subtracting ϕk(τ, t) and dividing by h (h > 0) in the above equation, then

k ϕk(τ + h, t) − ϕk(τ, t) ϕ0(h, t + τ) − 1 X ϕi(h, t + τ) = ϕ (τ, t) · + ϕ (τ, t) · . (43) h k h k−i h i=1

Here the meaning of h in ϕi(h, t + τ) is that

h = T (x + ξ + η) − T (x + ξ) = ν(I0,x+ξ+η) − ν(I0,x+ξ) = ν(Ix+ξ,x+ξ+η), where x + ξ + η = T −1(t + τ + h) and x + ξ = T −1(t + τ). 20 Let h → 0 in (43). By using assumption 5 and its conclusion (42), it implies that

∂ ϕ0 (τ, t) =: ϕ (τ, t) = −ϕ (τ, t) + α ϕ (τ, t) + α ϕ (τ, t) + ··· + α ϕ (τ, t), (44) k ∂τ k k 1 k−1 2 k−2 k 0 which is a difference–differential equation. To solve (44), we need to write it in matrix form,    0  −1 α1 ··· αk   ϕ k(τ, t) ϕk(τ, t) .  . . .  . ∂Pk(τ, t)  .   .. .. .   .  :=   =     =: QPk(τ, t) ∂τ  0   α1     ϕ 1(τ, t)   ϕ1(τ, t)  0  .  ϕ 0(τ, t) .. −1 ϕ0(τ, t)

The general solution is ∞ X Qm P (τ, t) = eQτ · c =: τ m · c, k m! m=0 where c is a constant vector. To specify c, we observe that

T Qτ (0,..., 0, 1) = lim Pk(τ, t) = lim e c = c. τ→0 τ→0 Hence, the Q can be written as

2 k Q = −Ik+1 + α1N + α2N + ··· + αkN ,

0 I  where N =: k×1 k . As i ≥ k + 1, we have Ni = 0. 0 01×k We use the expansion of the power of the multinomial

k !m 1 (−I )s0 αs1 ··· αsk X i X k+1 1 k s1+2s2+···+ksk −Ik+1 + αiN = N . m! s0!s1! ··· sk! i=1 s0+s1+···+sk=m Then

∞ (−I )s0 αs1 ··· αsk X X k+1 1 k s1+2s2+···+ksk s0+s1+···+sk Pk(τ, t) = N τ c s0!s1! ··· sk! m=0 s0+s1+···+sk=m ∞ m s s s X X (−1) 0 X α 1 ··· α k = τ s0 1 k Ns1+2s2+···+ksk τ s1+···+sk c s0! s1! ··· sk! m=0 s0=0 s1+···+sk=m−s0 ∞ ∞ s s s X X (−1) 0 X α 1 ··· α k = τ s0 1 k τ s1+···+sk Ns1+2s2+···+ksk c s0! s1! ··· sk! s0=0 m−s0=0 s1+···+sk=m−s0 ∞ ∞ (−1)s0 αs1 ··· αsk X s0 X X 1 k s1+···+sk s1+2s2+···+ksk (let r = m − s0) = τ τ N c s0! s1! ··· sk! s0=0 r=0 s1+···+sk=r ∞ s s X X α 1 ··· α k = e−τ 1 k τ s1+···+sk Ns1+2s2+···+ksk c s1! ··· sk! r=0 s1+···+sk=r ∞ s s X X α 1 ··· α k = 1 k τ s1+···+sk e−τ Nlc s1! ··· sk! l=0 R(s,k)=l k s s X X α 1 ··· α k = 1 k τ s1+···+sk e−τ Nlc, (45) s1! ··· sk! l=0 R(s,k)=l

21 Pk l where R(s, k) =: t=1 tst and the last equality is obtained by the fact that N c = (0,..., 0, 1, 0,..., 0)T and Nl = 0 as l ≥ k + 1. | {z } l s αs1 ···α k P 1 k s1+···+sk −τ Choose the first element in (45). We show that ϕk(τ, t) = τ e . By R(s,k)=k s1!···sk! the relation of ϕk(τ, t) and Pk(A), we obtain

s1 sk α ··· α s +···+s X 1 k   1 k −ν(I0,ξ) Pk(I0,ξ) = ν(I0,ξ) · e . (46) s1! ··· sk! R(s,k)=k

Step 3. We need to show the countable additivity in terms of (46). Let A1, A2 be two disjoint intervals, we first show that

s1 sk α ··· α s +···+s X 1 k   1 k −ν(A1∪A2) Pk(A1 ∪ A2) = ν(A1 ∪ A2) e , (47) s1! ··· sk! R(s,k)=k which means that (46) is closed under a finite union of the Ai. Pk By assumption 3, to show (47), it is sufficient to verify i=0 Pk−i(A1)Pi(A2) equals the right- hand side of (47). From the m.g.f. of the DCP distribution with complex θ, we have

( ∞ ) −θN(Ai) X −aθ  MAi (θ) =: Ee = exp λ(Ai) αa e − 1 for i = 1, 2. a=1 And by the definition of m.g.f. and assumption 3,

+∞ +∞ k ! X −kθ X X −kθ MA1∪A2 (θ) = Pk(A1 ∪ A2)e = Pk−i(A1)Pi(A2) e k=0 k=0 i=0 +∞ ! +∞ ! X −jθ X −sθ = Pj(A1)e Ps(A2)e j=0 s=0 ( ∞ )   X −aθ  = MA1 (θ)MA2 (θ) = exp ν(A1) + ν(A1) αa e − 1 a=1 ( ∞ )   X −aθ  = exp ν(A1 + A2) αa e − 1 a=1

∞ s1 sk α ··· α s +···+s X X 1 k   1 k −ν(A1∪A2) −kθ = ν(A1 ∪ A2) e e . s1! ··· sk! k=0 R(s,k)=k

Thus, we show (47). Sn S∞ Finally, let En = i=1 Ai, E = i=1 Ai for disjoint intervals A1,A2,.... We use the following lemma in Copeland and Regan Copeland and Regan (1936).

Lemma 6.1. If we have assumptions 1–4, let en = {E\(E ∩ En)} ∪ {En\(E ∩ En)}. Then

lim m(en) = 0 implies lim Pk(En) = Pk(E). n→∞ n→∞ S∞ P∞ To see this, since en = i=n+1 Ai, we have m(en) = i=n+1 m(Ai) → 0 as n → ∞. Thus, we obtain the countable additivity with respect to (46).

22 6.2. Proof of Corollary 3.1 Pri n Let CPi,r (A) =: kNi,k(A), where {Ni,k(A)} are independent Poisson point process i R k=1 i=1 with intensity αk(i) A λ(x) dx for each fixed k. By Campbell’s theorem, it follows that n n r X Z X Z Xi E exp{− fi(x)CPi,ri (dx)} = E exp{− fi(x) kNi,k(dx)} i=1 Si i=1 Si k=1 n r Y Yi Z = E exp{− kfi(x)Ni,k(dx)} i=1 k=1 Si n ri Z Y Y −kfi(x) = exp{ [e − 1]αk(i)λ(x)dx} i=1 k=1 Si n ri Z X X −kfi(x) = exp{ [e − 1]αk(i)λ(x)dx}. (48) i=1 k=1 Si

For η > 0, define the normalized exponential transform of the stochastic integral by D{ri}(η):

n Z ri Z ri X X X kfi(x) log D{ri}(η) := η fi(x)[CPi,ri (dx) − kαk(i)λ(x)]dx− [e − kηfi(x) − 1]αk(i)λ(x)dx. i=1 Si k=1 Si k=1 (49) Therefore, we obtain ED{r}(η) = 1 by (48). It follows from (49) and Markov’s inequality, we have

n Z ri n Z ri ! X X X X kfi(x) P η fi(x)[CPi,ri (dx) − kαk(i)λ(x)]dx ≥ [e − kηfi(x) − 1]αk(i)λ(x)dx + ny i=1 Si k=1 i=1 Si k=1 ny −ny = P (D{r}(η) ≥ e ) ≤ e . (50) By inequality (12), the (35) implies

n r n r ! X Z Xi X Z Xi 1 k2ηf 2(x)λ(x) y  P f (x)[CP (dx) − kα (i)λ(x)]dx ≥ α (i) i dx + ≤ e−ny. i i,ri k k 2 1 − 1 kηkf k η i=1 Si k=1 i=1 Si k=1 3 i ∞ (51) R 2 Let Vi,f = f (x)λ(x)dx for i = 1, 2, ··· , n. By basic inequality, we have i Si i

n ri  2  X X 1 ηk Vi,f y α (i) i + k 2 1 − 1 kηkf k η i=1 k=1 3 i ∞ n ri  2  X X 1 ηk Vi,f y 1 ky = α (i) i + (1 − kηkf k ) + kf k k 2 1 − 1 kηkf k η 3 i ∞ 3 i ∞ i=1 k=1 3 i ∞ n r n r X Xi  ky  X Xi y ≥ α (i) kp2V y + kf k = kα (i)[p2V y + kf k ]. (52) k i,fi 3 i ∞ k i,f 3 i ∞ i=1 k=1 i=1 k=1 Optimizing η in (51) by the (52), we have

n r n r ! X Z Xi X Xi y P f(x)[CP (dx) − kα (i)λ(x)]dx ≥ kα (i)(p2yV + kf k ) ≤ e−ny. i,ri k k i,fi 3 i ∞ i=1 Si k=1 i=1 k=1 (53) d For i = 1, 2, ··· , let ri → ∞ in (53), then CPi,ri (A) −→ CPi(A). It implies the concentration inequality (8). n n P p P µi √ Define c1n := µi 2Vi,f , c2n := 3 kfik∞, let t = c1n y + c2ny, and we obtain the expres- i=1 i=1 sion of y by solving a quadratic equation. 23 7. Summary and discussion

The paper contributes three parts. In the first part, we provide a characterization of discrete compound Poisson point processes (DCPP) in the same fashion as it has been done in 1936 by Arthur H. Copeland and Francis Regan for the Poisson process. The second part is focused on deriving concentration inequalities for DCPP, which are applied in the third part to derive optimal oracle inequalities of weighted Lasso in high-dimensional NB regression. Oracle inequalities for discrete distributions are statistically useful in both the non-asymptotic and asymptotical analysis of count data regression. The motivation for deriving new concentration inequalities for DCP processes is that we want to show that the KKT conditions of the NB regressions with Lasso penalty (with a tuning parameter λ) is a high-probability event. The KKT conditions is based on the concentration equality for centralized weighted sum of NB random variables. However, existing concentration results are not friendly applicable to the aim of optimal inference procedure. The optimal inference procedure here is to choose a optimal tuning parameter for Lasso estimates in high-dimensional NB regressions, which lead to the minimax optimal convergence for the estimator by considering the oracle inequalities for `1-estimation error. In the future, it is of interest to study the concentration inequalities for other extended Poisson distributions in regression analysis such as Conway-Maxwell-Poisson distribution, see Li et al. (2019) for the existing probability properties. In another direction, our proposed concentration inequalities would be desirable to extend to establish oracle inequalities for penalized projection estimators studied by Reynaud-Bouret (2003), when considering the intensity of some inhomogeneous compound Poisson processes.

8. Acknowledgments

This work is part of the doctoral thesis of the first author who would like to show sincere gratefulness to one of my advisor Professor Jinzhu Jia for his guidance, and candidate Dr. Xiangyang Li for his assistance of simulations. The authors also express their thanks to the two anonymous referees for their valuable comments which greatly improved the quality of our manuscript. In writing the early version of this paper, Professor Patricia Reynaud-Bouret provided several helpful comments and materials about the concentration inequalities, to whom warm thanks are due.

References

References

Acz´el,J. (1952). On composed Poisson distributions, III. Acta Mathematica Hungarica, 3(3), 219-224. Baraud, Y., Birg´e,L. (2009). Estimating the intensity of a random measure by histogram type estimators. and Related Fields, 143(1-2), 239-284. Bickel, P. J., Ritov, Y. A., Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 1705-1732. B¨uhlmann,P., van de Geer, S. A. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer. Boucheron, S., Lugosi, G., Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press. Chareka, P., Chareka, O., Kennendy, S. (2006). Locally sub-Gaussian random variables and the strong law of large numbers. Atlantic Electronic Journal of Mathematics, 1(1), 75-81. Cleynen, A., Lebarbier, E. (2014). Segmentation of the Poisson and negative binomial rate models: a penalized estimator. ESAIM: Probability and Statistics, 18, 750-769. Copeland, A. H., Regan, F. (1936). A postulational treatment of the Poisson law. Annals of Mathematics, 357-362. Das, A. (2018). Design and Analysis of Statistical Learning Algorithms which Control False Discoveries (Doctoral dissertation, niversit¨atzu K¨oln). Dinh The Luc. (2016). Multiobjective Linear Programming: An Introduction. Springer. Giles, D. E. (2010). Hermite Regression Analysis of Multi-Modal Count Data. Economics Bulletin, 30(4), 2936-2945. Feller, W. (1968). An introduction to probability theory and its applications,Vol. I. 3ed., Wiley, New York Hastie, T., Tibshirani, R., Wainwright, M. (2015). Statistical learning with sparsity: the lasso and generalizations. CRC Press. Hilbe, J. M. (2011). Negative binomial regression, 2ed. Cambridge University Press. Houdr´e,C. (2002). Remarks on deviation inequalities for functions of infinitely divisible random vectors. Annals of probability, 1223-1237.

24 Houdr´e,C., Privault, N. (2002). Concentration and deviation inequalities in infinite dimensions via covariance repre- sentations. Bernoulli, 8(6), 697-720. Ivanoff, S., Picard, F., Rivoirard, V. (2016). Adaptive Lasso and group-Lasso for functional Poisson regression. Journal of Machine Learning Research, 17(55), 1-46. Johnson, N. L., Kemp, A. W., Kotz S. (2005). Univariate Discrete Distributions, 3ed. Wiley, New Jersey. Kingman, J. F. C. (1993). Poisson processes. Oxford University Press. Kontoyiannis, I., Madiman, M. (2006). Measure concentration for compound Poisson distributions. Electronic Com- munications in Probability, 11, 45-57. Last, G., Penrose, M. D. (2017). Lectures on the Poisson process. Cambridge University Press. Li, Q. (2016). Bayesian Models for High-Dimensional Count Data with Feature Selection (Doctoral dissertation, Rice University). Li, B., Zhang, H., & He, J. (2019). Some characterizations and properties of COM-Poisson random variables. Com- munications in Statistics-Theory and Methods, 1-19. Jiang, X., Raskutti, G., & Willett, R. (2015). Minimax optimal rates for Poisson inverse problems with physical constraints. IEEE Transactions on Information Theory, 61(8), 4458-4474. Mallick, H., Tiwari, H. K. (2016). EM Adaptive LASSO A Multilocus Modeling Strategy for Detecting SNPs Asso- ciated with Zero-inflated Count Phenotypes. Frontiers in genetics, 7. Nielsen, F., & Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory, 55(6), 2882-2904. Petrov, V. V. (1995). Limit theorems of probability theory: sequences of independent random variables. Oxford, New York. Raskutti, G., Wainwright, M. J., & Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over `q-balls. IEEE transactions on information theory, 57(10), 6976-6994. Rigollet, P., & H¨utter, J. C. (2019). High dimensional statistics. http://www-math.mit.edu/~rigollet/PDFs/ RigNotes17.pdf Reynaud-Bouret, P. (2003). Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probability Theory and Related Fields, 126(1), 103-153. Sato, K. (2013). L´evyprocesses and infinitely divisible distributions, Revised edition, Cambridge University Press. St¨adler,N., B¨uhlmann,P., & Van De Geer, S. (2010). `1-penalization for mixture regression models. Test, 19(2), 209-256. van de Geer, S. A. (2007). The deterministic lasso. Seminar f¨urStatistik, Eidgen¨ossische Technische Hochschule (ETH) Z¨urich. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press. Yu, Y. (2010). High-dimensional Variable Selection in Cox Model with Generalized Lasso-type Convex Penalty. https://people.maths.bris.ac.uk/~yy15165/index_files/Cox_generalized_convex.pdf Wang, Y. H., Ji, S. (1993). Derivations of the compound Poisson distribution and process. Statistics & probability letters, 18(1), 1-7. Yang, X., Zhang, H., Wei, H., & Zhang, S. (2019). Sparse Density Estimation with Measurement Errors. arXiv preprint arXiv:1911.06215. Zhang, H., Liu, Y., Li, B. (2014). Notes on discrete compound Poisson model with applications to risk theory. Insurance: Mathematics and Economics, 59, 325-336. Zhang, H., Li, B. (2016). Characterizations of discrete compound Poisson distributions. Communications in Statistics- Theory and Methods, 45(22), 6789-6802. Zhang, H., & Jia, J. (2017). Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection. arXiv preprint arXiv:1712.03412.

25