Quick viewing(Text Mode)

Compound Poisson Point Processes, Concentration and Oracle Inequalities Huiming Zhang1* and Xiaoxu Wu2

Compound Poisson Point Processes, Concentration and Oracle Inequalities Huiming Zhang1* and Xiaoxu Wu2

Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 https://doi.org/10.1186/s13660-019-2263-8

R E S E A R C H Open Access Compound Poisson point processes, concentration and oracle inequalities Huiming Zhang1* and Xiaoxu Wu2

*Correspondence: [email protected] Abstract 1School of Mathematical Sciences and Center for Statistical Science, This note aims at presenting several new theoretical results for the compound Peking University, Beijing, China Poisson point process, which follows the work of Zhang et al. (Insur. Math. Econ. Full list of author information is 59:325–336, 2014). The first part provides a new characterization for a discrete available at the end of the article compound Poisson point process (proposed by Aczél (Acta Math. Hung. 3(3):219–224, 1952)), it extends the characterization of the Poisson point process given by Copeland and Regan (Ann. Math. 37:357–362, 1936). Next, we derive some concentration inequalities for discrete compound Poisson point process (negative binomial with unknown dispersion is a significant example). These concentration inequalities are potentially useful in count data regression. We give an application in the weighted Lasso penalized negative binomial regressions whose KKT conditions of penalized likelihood hold with high probability and then we derive non-asymptotic oracle inequalities for a weighted Lasso . MSC: 60G55; 60G51; 62J12 Keywords: Characterization of point processes; Poisson random measure; Concentration inequalities; Sub-gamma random variables; High-dimensional negative binomial regressions

1 Introduction Powerful and useful concentration inequalities of empirical processes have been derived one after another for several decades. Various types of concentration inequalities are mainly based on moments condition (such as sub-Gaussian, sub-exponential and sub- Gamma) and bounded difference condition; see Boucheron et al. [4]. Its central task is to evaluate the fluctuation of empirical processes from some value (for example, the and median) in probability. Attention has greatly expanded in various area of research such as high-dimensional , models, etc. For the Poisson or negative binomial count data regression (Hilbe [13]), it should be noted that the (the limit distribution of negative binomial is Poisson) is not sub-Gaussian, but locally sub- Gaussian distribution (Chareka et al. [6]) or sub- (see Sect. 2.1.3 of Wainwright [34]). However, few relevant results are recognized about the concentra- tion inequalities for a weighted sum of negative binomial (NB) random variables and its statistical applications. With applications to the segmentation of RNA-Seq data, a docu- mental example is that Cleynen and Lebarbier [7] obtained a concentration inequality of

© The Author(s) 2019. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 2 of 28

the sum of independent centered negative binomial random variables and its derivation hinges on the assumption that the dispersion parameter is known. Thus, the negative bi- nomial random variable belongs to the . But, if the dispersion parameter in negative binomial random variables X is unknown in real world problems, it does not belong to the exponential family with density fX (x | θ)=h(x) exp η(θ)T(x)–A(θ) ,(1)

where T(x), h(x), η(θ), and A(θ)aresomegivenfunctions. Whereas, it is well-known that Poisson and negative binomial distributions belong to the family of infinitely divisible distributions. Houdré [14] studies dimension-free concen- tration inequalities for non-Gaussian infinitely divisible random vectors with finite expo- nential moments. In particular, the geometric case, a special case of negative binomial, hasbeenobtainedinHoudré[14]. Via the entropy method, Kontoyiannis and Madiman [20] gives a simple derivation of the concentration inequalities for a Lipschitz function of discrete compound Poisson random variable. Nonetheless, when deriving oracle in- equalities of Lasso or Elastic-net estimates (see Ivanoff et al. [16], Zhang and Jia [38]), it is strenuous to use the results of Kontoyiannis and Madiman [20], Houdré [14]togetanex- plicit expression of tuning parameter under the KKT conditions of the penalized negative binomial likelihood (Poisson regression is a particular case). Moreover, when the nega- tive binomial responses are treated as sub-exponential distributions (distributions with

the Cramer type condition), the upper bound of the 1-estimation error oracle inequality log p is determined by the tuning parameter with the rate O( n ), which does not show rate- log p optimality in the minimax sense, although it is sharper than O( n ). Consider linear ∗ 2 ∗ regressions Y = Xβ + ε with noise Var ε = σ In and β being the p-dimensional true pa- ∗ rameter. Under the 0-constraints on β ,Raskuttietal.[28] shows that the lower bound on the minimax prediction error for all βˆ is 1 ∗ 2 s log(p/s ) inf sup Xβˆ – Xβ ≥ O 0 0 ,(X is an n × p design matrix) ˆ ∗ 2 β β 0≤s0 n n

ˆ ∗ with probability at least 1/2. Thus, the optimal minimax rate for β – β 1 should be pro- log p portional to the root of the rate O( n ).SimilarminimaxlowerboundforPoissonre- gression is provided in Jiang et al. [17]; see Chap. 4 of Rigollet and Hütter [30]formore minimax theory. This paper is prompted by the motif of high-dimensional count regression that derives useful NB concentration inequalities which can be obtained under compound Poisson frameworks. The obtained concentration inequalities are beneficial to optimal high- dimensional inferences. Our concentration inequalities are about the sum of independent centered weighted NB random variables, which is more general than the un-weighted case in Cleynen and Lebarbier [7], which supposes that the over-dispersion parameter of NB random variables is known. However, in our case, the over-dispersion parameter can be unknown. As a by-product, we proposed a characterization for discrete compound Pois- son point processes following the work in Zhang et al. [40] and Zhang and Li [39]. The paper will end up with some applications of high-dimensional NB regression. Städler et

al. [32] studies the oracle inequalities for 1-penalization finite mixture of Gaussian regres- sions model but they didn’t concern the mixture of count data regression (such as mixture Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 3 of 28

Poisson, see Yang et al. [36] for example). Here the NB regression can be approximately viewed as finite mixture of Poisson regression since NB distribution is equivalent to a con- tinuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is gamma distributed. The paper will end up with some applications of high-dimensional NB regression in terms of oracle inequalities. The paper is organized as follows. Section 2 is an introduction to discrete compound Poisson point processes (DCPP). We give a theorem for characterizing DCPP, which is similar to the result that initial condition, stationary and independent increments, con- dition of jumps, together quadruply characterize a compound Poisson process, see Wang and Ji [35] and the references therein. In Sect. 3, we derive the concentration inequalities for discrete compound Poisson point process with infinite-dimensional parameters. As an application, for the optimization problem of weighted Lasso penalized negative bino- mial regression, we show that the true parameter version of KKT conditions hold with high probability, and the optimal data-driven weights in weighted Lasso penality are de- termined. We also show the oracle inequalities for weighted Lasso estimates in NB regres- sions, and the convergence rate attain the minimax rate derived in references.

2 Introductions to discrete compound Poisson point process In this section, we present the preliminary knowledge for discrete compound Poisson point process. A new characterization of the point process with five assumptions is ob- tained.

2.1 Definition and preliminaries To begin with, we need to be familiar with the definition of the discrete compound Poisson distribution and its relation to the weighted sum of Poisson random variables. For more details, we refer the reader to Sect. 9.3 of Johnson et al. [18], Zhang et al. [40], Zhang and Li [39] and the references therein.

Definition 2.1 We say that Y is discrete compound Poisson (DCP) distributed if the char- acteristic function of Y is ∞ itY itθ ϕY (t)=Ee = exp αiλ e –1 (θ ∈ R), (2) i=1 ∞ ≥ where (α1λ, α2λ,...) are infinite-dimensional parameters satisfying i=1 αi =1,αi 0, λ >0.WedenoteitbyY ∼ DCP(α1λ, α2λ,...).

If αi =0foralli > r and αr =0,wesaythat Y is DCP distributed of order r.Ifr =+∞,then the DCP distribution has infinite-dimensional parameters and we say it is DCP distributed of order +∞.Whenr = 1, it is the well-known Poisson distribution for modeling equal- dispersed count data. When r = 2, we call it the Hermite distribution, which can be applied to model the over-dispersed and multi-modal count data; see Giles [11]. Equation (2) is the canonical representation of characteristic function for non-negative valued infinitely divisible random variable, i.e. a special case of the Lévy–Khinchine for- mula (see Chap. 2 of Sato [31] and Sect. 1.6 of Petrov [27]), itY 1 2 2 itx Ee = exp ait – σ t + e –1–itxI|x|<1(x) ν(dx) , 2 R\{0} Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 4 of 28

∈ R ≥ · where a , σ 0andI{·}(x) is indicator function, the function ν( ) is called the Lévy measure with restriction: min(x2,1)ν(dx)<∞. R\{0} N+ Let δx(t):=1{x}(t)and be the positive integer set. If we set ν(t)= x∈N+ λδx(t)αx, the ν(·) is the Lévy measure in the Lévy–Khinchine formula. It is easy to see that Y has ∞ a weighted Poisson decomposition: Y = i=1 iNi,wheretheNi are independent Poisson distributed with mean λαi. This decomposition is also called a Lévy–Ito decomposition; see Chap. 4 of Sato [31]. If Ee–θY < ∞ for θ in a neighbourhood of zero, the gen- erating function (m.g.f.) of DCP is ∞ ∞ –θY –kθ –θx MY (θ):=Ee = exp αkλ e –1 = exp e –1 ν(dx) |θ| < R k=1 0

for some R >0. In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space (E, A), a good example could be E = Rd. In the following sections, we let E = Rd and denote N(A, ω) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato [31], Kingman [19]).

Definition 2.2 Let (E, A, μ)beameasurablespacewithσ -finite measure μ,andN be the non-negative integer set. The Poisson random measure with intensity μ is a family

of random variables {N(A, ω)}A∈A (defined on some probability space (Ω, F, P)) as the product map

N : A × Ω → N

satisfying 1. ∀A ∈ A,theN(A, ·) is Poisson random variable on (Ω, F, P) with mean μ(A); 2. ∀ω ∈ Ω,theN(·, ω) is a counting measure on (E, A);

3. If sets A1, A2,...,An ∈ A are disjoint, then N(A1, ·), N(A2, ·),...,N(An, ·) are mutually independent.

Let N(A):=N(A, ω).BasedonthePoissonrandommeasure,wedefinethediscretecom-

pound Poisson point processes (DCPP) {CP(A)}A∈A by the Lévy–Ito decomposition

∞ CP(A):= kNk(A), k=1

× where the Nk(A) are independent Poisson point process with mean measure μk(A):=αk λ(x) dx. A ∞ θ { } ∈A { –k } The m.g.f. of CP(A) A is MCP(A)(θ)=exp k=1 αk A λ(x) dx(e –1) .Wecansee that λ(·) is the intensity function of the generating Poisson point processes {Nk(A)}A∈A for each k.Aczél[1] derives the p.m.f. and he calls it the inhomogeneous composed Poisson distribution as d = 1. If the intensity function λ(x) is the constant (i.e. the mean measure is a multiple of Lebesgue measure), the N is said to be a homogeneous Poisson point process. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 5 of 28

Define the probability Pk(A)byP(N(A)=k). If N(A) follows from a DCP point process with intensity measure λ(A), then the probability of having k (k ≥ 0) points in the set A is given by

s1 sk ··· ··· α1 αk s1+ +sk –λ(A) Pk(A)= λ(A) e ,(3) s1! ···sk! R(s,k)=k k where R(s, k):= t=1 tst;seeAczél[1] for the proof.

2.2 A new characterisation of DCP point process Based on preliminaries, we give a new characterization of DCP point process, which is an extension of Copeland and Regan [8]. A similar characterization for DCP distribution and process (not for point process in terms of random measure) is derived by Wang and Ji [35]. For more characterizations of the point process, see the monograph Last and Penrose [21] and the references therein.

Theorem 2.1 (Characterization for DCP point process) Consider the following assump- tions:

1. Pk(A)>0if 0

ditivity, absolutely continuous) such that Pk(A) is represented by the equation (3) for all k and A.

For A being the closed interval in R, this setting turns to the famous characterization of the Poisson process; see p. 447 of Feller [10] (The postulates for the Poisson process). The proof of Theorem 2.1 is given in the Appendix. The proof consists of showing countably additive and absolutely continuous of the intensity measure and dealing with the matrix differential equation. As a random measure, the DCPP is a mathematical extension from discrete compound Poisson distribution and discrete compound Poisson process index by real line. It has been studied by Baraud and Birgé [2] that they use histogram type estimator to estimate the intensity of the Poisson random measure. A recent application, Das [9] proposes a non- parametric Bayesian approach to RNA-seq count data by DCP process. For a set of regions {A ∈ A} of interest on the genome, such as genes, exons, or junctions, the number of reads

is counted, and the count in the region Ai is viewed as a DCP process for gene expression.

The Pk(A) satisfies assumptions 1–5 in Theorem 2.1 by the real-world setting (the process of generating the expression level for each gene in the experiment).

3 Concentration inequalities for discrete compound Poisson point process This section is about the construction of concentration inequalities for DCPP. It has ap-

plications in high-dimensional 1-regularized negative binomial regression; see Sect. 4. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 6 of 28

As an appetizing example, if the DCPP has finite parameters, the existing result such as Lemma3ofBaraudandBirgé[2] can be used to obtain the concentration inequality for the sum of DCP random variables.

Lemma 3.1 (Baraud and Birgé [2]) Let Y1,...,Yn be n independent centered random vari- { }n ables, and τ , κ, ηi i=1 be some positive constants. 2 zYi ≤ z ηi ∈ ≤ ≤ (a) If log(Ee ) κ 2(1–zτ) for all z [0, 1/τ[, and 1 i n, then n n 1/2 –x P Yi ≥ 2κx ηi + τx ≤ e for all x >0. i=1 i=1

–zY 2 (b) If for 1 ≤ i ≤ n and all z >0log(Ee i ) ≤ κz ηi/2, then n n 1/2 –x P Yi ≤ – 2κx ηi ≤ e for all x >0. i=1 i=1

Actually, the moment conditions in Lemma 3.1 is a direct application of the existing result based on Cramer-type conditions (Lemma 2.2 in Petrov [27], or sub-exponential condition (4)discussedbelow)fortheconvolutions’ Poisson distribution with different 2 zY ≤ z ηi rates. The centered random variable Yi with the moment condition log(Ee ) κ 2(1–zτ) in Lemma 3.1(a), is called the sub-Gamma random variables (see Sect. 2.4 in Boucheron et al. [4]). A random variable X with zero mean is sub-exponential (see Sect. 2.1.3 in Wainwright [34]) if there are non-negative parameters (v, α)suchthat

2 2 λ v λ 1 Ee X ≤ e 2 for all |λ| < .(4) α

Note that the sub-exponential implies the sub-gamma condition by the following inequal- 2 2 –zYi ≤ κz ηi ≤ κz ηi ∈ ity: log(Ee ) 2 2(1–zτ) for all z [0, 1/τ[.

∼ α λ α λ σ Proposition 3.1 Let Yi DCP( 1(i) (i),..., r(i) (i)) for i =1,2,...,n, and i := Var Yi = r 2 λ(i) k=1 k αk(i), then for all x >0 n n ≥ 2 ≤ –x P (Yi –EYi) 2x σi + rx e , i=1 i=1 n n ≤ 2 ≤ –x P (Yi –EYi) – 2x σi e . i=1 i=1

Moreover, we have n n ≥ 2 ≤ –x P (Yi –EYi) 2x σi + rx 2e .(5) i=1 i=1

Proof First, in order to apply the above lemma, we need to evaluate the log-moment- r generating function of centered DCP random variables. Let μi =: EYi = λ(i) k=1 kαk(i), Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 7 of 28

we have

r kz z(Yt–μi) zYi λ(i) αk(i)(e –1) log Ee =–zμi + log Ee =–zμi + log e k=1 r kz = λ(i) αk(i) e – kz –1 . k=1

z ≤ z2 Using the inequality e – z –1 2(1–z) for 1 > z >0,onederives

r 2 2 r 2 2 2 k z k z σiz log Eez(Yt–μi) ≤ λ(i) α (i) ≤ λ(i) α (i) =: (z >0), k 2(1 – kz) k 2(1 – rz) 2(1 – rz) k=1 k=1

r 2 z ≤ z2 where σi = Var Yi = λ(i) k=1 k αk(i). And by applying the inequality e – z –1 2 for z <0,weget

r 2 2 2 k z σiz log Eez(Yt–μi) ≤ λ(i) α (i) =: (z < 0). (6) k 2 2 k=1

2 –z(Yt–μi) ≤ σiz Thus (6)implieslog Ee 2 for z >0. 2  Therefore, by letting τ =1,κ =1,ηi = σi in Lemma 3.1,wehave(5).

In the sequel, without employing Cramer-type conditions, we will derive a Bernstein- type concentration inequality for DCPP with infinite-dimensional parameters r =+∞. Theorem 3.1 below is a weighted-sum version that is not the same as Proposition 3.1. Theorem 3.1 Let f (x) be the measurable function such that f 2(x)λ(x) dx < ∞, where the E λ(x) is the intensity function of the DCPP with parameter (α λ(x) dx, α λ(x) dx,...) 1 E 2 E ∞ ∞ under the condition k=1 kαk < . Then the concentration inequality for a stochastic in- tegral of DCPP is ∞ ∞ y P f (x) CP(dx)– kα λ(x) dx ≥ kα 2yV + f ∞ k k f 3 E k=1 k=1 ≤ e–y (y > 0), (7)

2   where Vf = E f (x)λ(x) dx, f ∞ is the supremum of f (x) on E. √ ≥ EY y ≤ –y ∼ Let f (x)=1A(x)inTheorem3.1,andwehaveP(Y –EY λ [ 2yλ + 3 ]) e for Y DCP(α1λ, α2λ,...). From the guide of the proof in Corollary 3.1 below, it is not hard to get the Theorem 3.1 by a linear combination of random variables which is more important in applications in Sect. 4. The proof of Theorem 3.1 is given in the Appendix. This is essentially from the fact that the infinitely divisible distributions are closed under the convolutions and scalings. Corollary 3.1 is the version for the sum of n independent random measures. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 8 of 28

{ }n { }n Corollary 3.1 For a given set Si i=1 in E, if we have n independent DCPP CPi(Si) i=1 with a series of parameters n α1(i) λ(x) dx, α2(i) λ(x) dx,... Si Si i=1

{ }n correspondingly, then, for a series of measurable function fi(x) i=1, we have n n y P f (x) CP (dx)–μ λ(x) dx ≥ μ 2yV + f ∞ i i i i i,fi 3 i i=1 Si i=1 ≤ e–ny (y > 0), (8) ∞ 2 where μi := kαk(i)<∞ and Vi,f := f (x)λ(x) dx < ∞. k=1 i Si i n n μ i  ∞ Moreover, let c1n := i=1 μi 2Vi,fi , c2n := i=1 3 fi , then n P fi(x) CPi(dx)–μiλ(x) dx ≥ t i=1 Si 2 2 ≤ t c1n c1n exp –n + 2 – .(9) c2n 4c2n 2c2n

The difference between Proposition 3.1 and Corollary 3.1 is that the DCP random vari- ables in Proposition 3.1 have order r <+∞ while Corollary 3.1 is related to DCP point process (a particular case of DCPP is the constant intensity λ(A)=λ)withorderr =+∞. 2 Moreover, Proposition 3.1 require dependent factors σi for DCP random vari- ables.

Remark 1 Significantly, Corollary 3.1 is the weighted-sum version, and it improves the results in Cleynen and Lebarbier [7] who deals with the un-weighted sum of NB random variables (as a special case of DCP distribution). Moreover, Cleynen and Lebarbier’s [7]re- sult relies on Lemma 3.1, which depends on the sub-Gamma condition. It should be noted that we do not impose any bounded restriction for y in our concentration inequalities (7) and (8),whileTheorem1ofHoudré[14] and concentration inequalities of Houdré and Privault [15] require that the infinitely divisible random vector satisfies the Cramer-type conditions and the y in P(X – EX ≥ y) is bounded by a given constant. The moment condi- tions in Lemma 3.1 sometimes are difficult to check, being hard to apply in the Lasso KKT condition in Sect. 4.1 if the count data are other infinitely divisible discrete distributions. In our Theorem 3.1 and Corollary 3.1, we only need to check that the of the DCP processes are finite and Theorem 3.1 and Corollary 3.1 do not need the Bernstein-type condition (see Theorem 2.8 in Petrov [27] and p. 27 of Wainwright [34]). In addition, our concentration result does not require the higher moments conditions proposed by Kon- L ∞ L ∞ toyiannis and Madiman [20], they assume that EZ = k=1 k αk < (L >1)aboutthe compound distribution P(Z = i):=αi for the given discrete random variable Z.

Before we give the proof of Corollary 3.1, we need some notations and a lemma. For Rd a general point process N(A)definedon ,letf be any non-negative measurable func- E f exp{ f x N dx } tion on . The Laplace functional is defined by LN ( )=E[ – E ( ) ( ) ], it is the stochastic integral is specified by f (x)N(dx)=: f (x ), where the Π is the state E xi∈Π∩E i Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 9 of 28

space of the point process N(A). The Laplace functional for a random measure is a crucial numerical characteristics that enables us to handle stochastic integral of the point process. Moreover, we rely on the following theorem due to Campbell.

Lemma 3.2 (See Sect. 3.3 in Kingman [19]) Let N(A) be a Poisson random measure with Rd → R intensity λ(t) and a measurable function f : , the random sum S = x∈Π f (x) is ab- | | solutely convergent with probability one if and only if the integral Rd min( f (x) ,1)λ(x) dx < ∞. Assume this condition holds, thus we assert that, for any complex number value θ, the equation EeθS = exp eθf (x) –1 λ(x) dx Rd

holds if the integral on the right-hand side is convergent.

Proof Now, we can easily compute the DCPP’s Laplace functional by Campbell’s theorem. r { –f (x) } Set CPr(A)=: k=1 kNk(A). And let θ =–1,thatis,LN (f )=exp E [e –1]λ(x) dx .By independence, it follows that r

LCPr (f )=Eexp – f (x)CPr(dx) =Eexp – f (x) kNk(dx) E E k=1 r = Eexp – kf (x)Nk(dx) k=1 E r –kf (x) = exp e –1 αkλ(x) dx k=1 E r –kf (x) = exp e –1 αkλ(x) dx . k=1 E

The above Laplace functional of a compound random measure seems like a Lévy– Khintchine representation. Define for η >0 r Dr(η):=exp ηf (x) CPr(dx)– kαk λ(x) dx E k=1 r kf (x) – e – kηf (x)–1 αkλ(x) dx . (10) E k=1

Therefore, we obtain EDr(η)=1. It follows from (10) and the Markov’s inequality that r P ηf (x) CPr(dx)– kαk λ(x) dx E k=1 r kf (x) ≥ e – kηf (x)–1 αkλ(x) dx + y E k=1 = P D(η) ≥ ey ≤ e–y. (11) Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 10 of 28

Note that

∞ ∞ i i kηf (x) η i–2 2 2 2 2 η i–2 e – kηf (x)–1≤ kf ∞ k f (x)=k f (x) kf ∞ i! i(i –1)···2 i=2 i=2 ∞ k2η2f 2(x) 1 i–2 ≤ kηf ∞ 2 3 i=2 k2η2f 2(x) 1 ≤ 1– kηf ∞ . (12) 2 3

Hence, using (12), (11)turnsto

r r 1 k2ηf 2(x)λ(x) y P f (x) CPr(dx)– kαk λ(x) dx ≥ αk dx + 2 1– 1 kηf  η E k=1 E k=1 3 ∞ ≤ e–y. (13) 2 Let Vf = E f (x)λ(x) dx,noticethat r 2 1 ηk Vf y αk + 2 1– 1 kηf  η k=1 3 ∞ r 2 1 ηk Vf y 1 ky = αk + 1– kηf ∞ + f ∞ 2 1– 1 kηf  η 3 3 k=1 3 ∞ r ky r y ≥ α k 2V y + f ∞ = kα 2V y + f ∞ . (14) k f 3 k f 3 k=1 k=1

Applying (14), so by minimizing η in (13), we have

r r y P f (x) CP (dx)– kα λ(x) dx ≥ kα 2yV + f ∞ r k k f 3 E k=1 k=1 ≤ e–y. (15)

d Letting r →∞in (15), then CPr(A) −→ CP(A), so we finally prove the inequality (7). 

4 Application 4.1 Application to KKT conditions For negative binomial random variable NB, the probability mass function (p.m.f.) is

Γ (n + s) P(NB = n)= (1 – q)sqn q ∈ (0, 1), n ∈ N , Γ (s)n!

where s is a certain real value number. When s is positive integer, the negative reduces to Pascal distribution which is modeled as the number of failures before the s-th success in repeated mutually independent Bernoulli trials (with probability of success p =1–q). Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 11 of 28

The m.g.f. of NB is ∞ 1–q s qk ϕ (t)=exp log = exp –s log(1 – q) · eikt –1 . (16) NB 1–qeit –k log(1 – q) k=1

Here the parametrization of DCP is NB ∼ DCP(α1λ, α2λ,...)with λ =–s log(1 – q), αk = qk –k log(1–q) . { }n { }n Given the data Yi, xi i=1, the NB regression models suppose that the responses Yi i=1 { }n is NB distributed random variables with mean μi i=1, and the vectors covariates xi = T p (xi1,...,xip) ∈ R are p dimensional fixed (non-random) variable for simplicity; see the so-called NB2 model in Hilbe [13] for details. Based on the log-linear model in regression

analysis, we usually suppose that log μi = f0(xi) is a linear function of xi. Two types of high-dimensional setting for count data regression are classified as follows. • Sparse approximation for nonparametric regression. Sometimes, it is too rough and too simple to use a linear function to approximate

log μi by a function of xi, denoted as f0(xi). The connection between log μi and xi is

sometimes unknown, thus it is unlikely to be linear. The dimension of xi would be much larger than the sample size when the f is extremely flexible and complex. In

order to capture the unknown functional relation f0(xi),weprefertouseagiven D {φ · }p dictionary (orthogonal base functions) = j( ) j=1 such that the linear combination ˆ p ˆ f (xi)= j=1 βjφj(xij) is a sparse approximation to f0(xi) which contains ∗ ∗ ∗ T →∞ →∞ increasing-dimensional parameter β := (β1 ,...,βp ) with p := pn as n . A crucial point for orthogonal base functions D is that many f0(xi), which have no sparse representations in the non-orthogonal base, should be represented as sparse linear combinations of D in the orthogonal scenario; see Sect. 10 of Hastie et al. [12] for details. In practice, the advantage of the dictionary approach is that the flexible

f0(xi) may admit a good approximation sparse in some dictionary functions while not in others. For high-dimensional count data, Ivanoff et al. [16] mentioned that: “The richer the dictionary, the sparser the decomposition, so that p can be larger than n and the model becomes high-dimensional”.Typical candidates of the dictionary are Fourier basis, cosine basis, Legendry base, wavelet basis (Haar basis), etc. • High-dimensional features in gene engineering. In the big data era, one challenge case we encountered in massive data sets is that the number of given covariates p is larger than the sample size n and the responses of interest are measured as counts. For example, in gene engineering, the covariates (features) are types of genes impacting on specific count phenotypes. The NB regression is a flexible framework for the analysis of over-dispersed counts RNA-seq data; see Li [23], Mallick and Tiwari [25]formoredetails. T We assume that the expectation of Yi will be related to φ (xi)β after a log-transforma- tion, p T φ (xi)β μi := exp βjφj(xij) := e =: h(xi), j=1

T where φ(xi)=(φ1(xi1),...,φp(xip)) with covariate {φj(xij)} being the jth component of { · }p φ(xi). Here φj( ) j=1 are given transformations of bounded covariances. A trivial exam- ple is φj(x)=x. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 12 of 28

n μi Let {Yi} be a NB random variable with failure parameters {qi = } where θ >0isan i=1 θ+μi unknown dispersion parameter which can be estimated (see Hilbe [13]). Then the p.m.f.s { }n of Yi i=1 are yi θ Γ (θ + yi) μi θ f (yi; μi, θ)= Γ (θ)yi! θ + μi θ + μi T T φ (xi)β ∝ exp yiφ (xi)β –(θ + yi) log θ + e , (17)

2 μi where μi > 0 is the parameter with μi =EYi and Var Yi = μi + θ >EYi. { }n The log-likelihood of the predictor variable Yi i=1 is n n T T φ (xi)β l(β)=:log f Yi; f0(xi), θ = Yiφ (xi)β –(θ + Yi) log θ + e . (18) i=1 i=1

T Let R(y, β, x)=yφT (x)β –(θ + y) log(θ + eφ (x)β ) denote the negative average empirical 1 P risk function by (β)=–n l(β):= nR(Y, β, x). Having obtained high-dimensional covariates, a principal task is to select important

variables. Here, we prefer the weighted 1-penalized likelihood principle for sparse NB regression, p ˆ β = argmin (β)+ wj|βj| , (19) ∈Rp β j=1

{ }p where wj j=1 are data-dependent weights which are supposed to be specified in the fol- lowing discussion. By using KKT conditions (i.e. the first order condition in convex optimization; see p. 68 of Bühlmann and van de Geer [5]), the βˆ is a solution of (19)iffβˆ satisfies first order conditions ⎧ ⎨ ˙ ˆ ˆ ˆ –j(β)=wj sign(βj)ifβj =0, (20) ⎩ ˙ ˆ ˆ |j(β)|≤wj if βj =0,

˙ ˆ ∂(βˆ) where j(β):= ˆ (j =1,2,...,p). ∂βj Let Φ(X)bethen × p dictionary design matrix given by ⎛ ⎞ φ (x ) ··· φ (x ) ⎜ 1 11 p 1p ⎟ Φ φ ⎜ . . ⎟ φT φT T (X)=: j(xij) 1≤i≤n,1≤j≤p = ⎝ . . ⎠ =: (x1),..., (xn) .

φ1(xn1) ··· φp(xnp)

˙ ˆ The KKT conditions implies |j(β)|≤wj for all j. And n φT (x )β ˙ ∂(β) 1 e i (β):= =– φ(xi) Yi –(θ + Yi) T ∂β n θ φ (xi)β i=1 + e

n φT (x )β n 1 φ(xi)(Yi – e i )θ 1 φ(xi)(Yi –EYi)θ =– T =– . n θ φ (xi)β n θ +EY i=1 + e i=1 i Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 13 of 28

The Hessian matrix of the log- is

n T 2 φ (xi)β ¨ ∂ (β) 1 T θ(θ + Yi)e (β):= = φ(xi)φ (xi) . (21) ∂β∂βT n φT (x )β 2 i=1 (θ + e i )

∗ |{ ∗ }| ∗ Let d = j : βj =0 ; the true coefficient vector β is defined by

∗ β = argmin ER(Y, X, β). (22) β∈Rp

For the optimization problem of maximizing the 1-penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of the KKT conditions n ∗ 1 φj(xi)(Yi –EYi)θ ˙ β = ≤ w (23) j n θ +EY j i=1 i

to approximate estimated parameter version of KKT conditions.

When θ →∞, this approach of defining data-driven weights wj is proposed by Ivanoff et al. [16] for Poisson regression by using concentration inequality for Poisson point process

version of Theorem 3.1. Similarly, in NB regression the wj are determined such that the event (23) holds with high probability for each j by applying Corollary 3.1. Recall the weighted Poisson decomposition of the NB point processes

∞ NB(A):= kNk(A), k=1 where Nk(A) is an inhomogeneous Poisson point process with intensity αk A λ(x) dx and k α q k = –k log(1–q) defined in (16). Let μ(A):= A λ(x) dx,them.g.f.ofNB(A)iswrittenas ∞ –kθ MNB(A)(θ)=exp αkμ(A) e –1 . (24) k=1

˜ φj(xi)θ For the subsequent step, let φj(xi):= ≤ φj(xi) and we want to evaluate the comple- θ+EYi mentary events of the true parameter version of the KKT conditions: n φj(xi)(Yi –EYi)θ P ≥ w θ +EY j i=1 i n ˜ ≥ = P φj(xi)(Yi –EYi) wj for j =1,2,...,p. (25) i=1

{ }p The aim is to find the values wj j=1 by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right-hand side of the above inequality. d Let μ(A) be a Lebesgue measure for A ∈ R and Ni,k(A) be Poisson random measure k αk(i)μ(A) ∞ qi μi n with rate ,whereμi = kαk(i)withαk(i):= and {qi := } .We μi k=1 –k log(1–qi) θ+μi i=1 Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 14 of 28

define the independent NB point processes NBi(A)as:

∞ ∞ αk(i)μ(A) NBi(A)=: kNi,k(A), with E[NBi(A)] = k = μ(A) μi k=1 k=1

{ } where Ni,k(A) are independent for each k, i. d n For [0, 1] := i=1 Si being a disjoined union, we assume that the intensity function

n h(x ) λ(t)= i I (t) μ(S ) Si i=1 i

is a histogram type estimator (Baraud and Birgé [2]). It is just a piece-wise constant inten- sity function. Now we will apply Corollary 3.1,notethat

μ(Si)= λ(t)dt = h(xi):=EYi Si

via the definition of λ(t) that we defined before.

Thus, both (NB1(S1),...,NBn(Sn)) and (Y1,...,Yn) have the same law by the weighted Poisson decomposition. n ˜ Choosing f (x)=gj(x)= l=1 φj(xl)1Sl (x), it derives

n n n n ˜ ˜ gj(x)NBi(dx)= φj(xl)ISl (x)NBi(dx)= φj(xi)Yi i=1 Si i=1 Si l=1 i=1

for each j.

Assume that there exists constants C1, C2 such that

∞ C1 = max μi = max kαk(i)=O(1) and O(1) = max h(xi) ≤ C2. (26) 1≤i≤n 1≤i≤n 1≤i≤n k=1

Then

n n 2 ˜2 h(xi) Vi,gj = gj (x)λ(x)dx = φj (xl)ISl (x) ISi (t) dx μ(Si) Si Si l=1 i=1 ˜2 ≤ ˜2 ≤ 2 = φj (xi)h(xi) C2 max φj (xi) C2 max φj (xi). 1≤i≤n 1≤i≤n

Hence 1 n y ˜ ≥ 2 P φj(xi)(Yi –EYi) C1 2yC2 max φj (xi)+ gj(x) ∞ n 1≤i≤n 3 i=1 n n 1 1 y ≤ P g(x) NB (dx)–μ λ(x) dx ≥ μ 2yV + g (x) . n i i n i i,gj 3 j ∞ i=1 Si i=1 2 y ≤ ≤  ∞ Now we let wj = C1[ 2yC2 max1 i n φj (xi)+ 3 gj(x) ]. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 15 of 28

From (25), we apply concentration inequality in Corollary 3.1 to the last probability in the above inequality. This implies n 1 φj(xi)(Yi –EYi)θ P ≥ w n θ +EY j i=1 i 1 n ≤ P g(x) NB (dx)–μ λ(x) dx ≥ w ≤ 2e–ny. n i i j i=1 Si

γ Take y = n log p where γ >0,wehave n 1 φj(xi)(Yi –EYi)θ 2 P ≥ w ≤ . (27) n θ +EY j pγ i=1 i

The weights are proportional to 2γ log p γ log p wj ∝ C2 max φj(xi) + max φj(xi) . n i=1,...,n 3n i=1,...,n

Here the γ can be treated as tuning parameters. In high-dimensional statistics, we of- log p | | ∞ γ log p × ten have dimension assumption n = o(1), maxi=1,...,n φj(xi) < ,theterm 3n | | 2γ log p maxi=1,...,n φj(xi) is negligible in the weights, comparing to the term n . Then, for large dimension p, the above upper bound of the probability inequality tends to zero. Therefore, the event of the KKT condition holds with high probability. Based on (27) and some regular assumptions such as compatibility factor (a generalized

restricted eigenvalue condition), we will derive the following 1-estimation error with high probability: ∗ ∗ log p βˆ – β ≤ O d 1 p n

in the next section. Under the restricted eigenvalue condition proposed by Bickel et al. [3], oracle inequali- ties for a weighted Lasso regularized Poisson binomial regression are studied in Ivanoff et al. [16]. Since the negative binomial is neither sub-gaussian nor a member of the exponen- tial family, our extension is not straightforward. We employ the assumption of compati- bility factor which renders the proof simpler than that of restricted eigenvalue condition.

4.2 Oracle inequalities for the weighted Lasso in count data regressions The ensuing subsections are described in two steps: The first step is to construct two lemmas to get a lower or upper bound for symmetric Bregman divergence in the weighted Lasso case. Adopting a symmetric Bregman divergence, the second step is to derive oracle inequalities by some inequality scaling tricks.

4.2.1 Case of NB regression The symmetric Bregman divergence (sB divergence) is defined by Ds(βˆ, β)=(βˆ – β)T ˙(βˆ)–˙(β) ;

see Nielsen and Nock [26]. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 16 of 28

The following lemma is a crucial inequality for deriving the oracle inequality under the assumption of a compatibility factor which is also used in Yu [37]fortheCoxmodel.Let ∗ { ∗ } c { β be the true estimated coefficients vector defined by (22), and H = j : βj =0 , H = j : ∗ } βj =0 . Then we have the following.

ˆ ˙ ∗ ˜ ˆ ∗ Lemma 4.1 Let β betheestimatein(19), set lm := l(β )∞, and β = β –β . Then, we have ˜ s ˆ ∗ ˜ ˜ (wmin – lm)βHc 1 ≤ D β, β +(wmin – lm)βHc 1 ≤ (wmax + lm)βH 1,

where wmax = max1≤j≤p wj, wmin = min1≤j≤p wj. ξ–1 { ≤ }  ∞ ≤ Moreover, in the event lm ζ+1 wmin for any ξ >1and Φ(X) K, with probability 2 1–p1–O(1)r , we have

 ˜  ≤ ξwmax  ˜  βHC 1 βH 1. (28) wmin

Proof The left inequality is obtained by Ds(βˆ, β∗)=(βˆ – β∗)T [˙(βˆ)–˙l(β∗)] ≥ 0duetocon- vexity of the function (β). Note that ˜T ˙ ∗ ˜ ˙ ∗ ˜ ˙ ∗ ˜ ˜ ˙ ∗ ˜ ˜T ˙ ∗ β  β + β –  β = βjj β + β + βjj β + β + β – β j∈Hc j∈H ˆ ˆ ˜ ˜ ≤ –wjβj sign(βj)+ wj|βj| + β1lm j∈Hc j∈H ˆ ˜ ˜ ˜ = –wj|βj| + wj|βj| + lmβH 1 + lmβHc 1 j∈Hc j∈H ˆ ˜ ˜ ˜ ≤ –wmin|βj| + wmax|βj| + lmβH 1 + lmβHc 1 j∈Hc j∈H ˜ ˜ =(lm – wmin)βHc 1 +(wmax + lm)βH 1.

where the second last inequality follows from KKT conditions and dual norm inequality. Thus we obtain (28). In fact, inequality (28)turnsto

2wmin ˜ s ˆ ∗ 2wmin ˜ 2ξwmax ˜ β c  ≤ D β, β + β c  ≤ β  (29) ξ +1 H 1 ξ +1 H 1 ξ +1 H 1

≥ 2wmin ≤ 2ξ due to wmin – lm ξ+1 , wmax + lm ξ+1 wmax. ˙ ∗ ξ–1 { ∞ ≤ } For the event (β ) ξ+1 wmin , applying the NB concentration inequality (9), we have ∗ ξ –1 P ˙ β ≥ w ∞ ξ +1 min n φ (x )θ ξ –1 ≤ pP 1 i (Y – EY ) ≥ nw (θ + EY ) i i ξ +1 min i=1 i 2 ξ 2 ≤ –1· nwmin c1n c1n p exp –n + 2 – ξ +1 c2n 4c2n 2c2n Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 17 of 28

2 ξ ξ 2 –1· nwmin –1· nwmin c1n c1n = p exp –n + 2 + ξ +1 c2n ξ +1 c2n 4c2n 2c2n ∼ 2 ∼ 1–O(1)r2 p exp –nwminO(1) p .

∼ where the second last is due to c1n = O(n)=c2n by the assumption (26)andthethelast ∼ log p is by letting wmin = O(r n ). ˙ ∗ ξ–1 { ∞ ≤ } Here r > 0 is a suitable tuning parameter, such that the event (β ) ξ+1 wmin holds with probability tending to 1 as p, n →∞. 

In order to derive Theorem 4.1, we adopt a crucial inequality by using the similar ap- proach in Lemma 5 of Zhang and Jia [38], which is the technique of Taylor’s expansion for convex log-likelihood functions.

Lemma 4.2 Let the Hessian matrix ¨(β) be given in (21), and, for δ ∈ Rp, assume the iden- T T T tifiability condition: φ (xi)(β + δ)=φ (xi)β implies φ (xi)δ =0.Then we have

    Ds(β + δ, β) ≥ δT ¨(β)δe–2 Φ(X) ∞ δ ,

where Φ(X)∞ = max{|φj(xij)|;1≤ i ≤ n,1≤ j ≤ p}.

{ ˙ ∗  ≤ ξ–1 } By the result in the second parts of Theorem 4.1,intheevent (β ) ∞ ξ+1 λmin ,the errorofestimateβ˜ = βˆ – β∗ belongs to the following cone set: p S(η, H)= b ∈ R : bHc 1 ≤ ηbH 1 (30)

with η = ξwmax . wmin ∗ ∗ | ∗ | ∗ { ∗ ∗ ∗ ∗ ∗ ∈ Rp} Let d (β )=: H(β ) with H(β )=: j : βj =0,β =(β1 ,...,βj ,...,βp ) .Sometimes we write H(β∗)asH and d∗(β∗)asd∗ if there is no ambiguity. Define the compatibility factor (see van de Geer [33]) of a p × p nonnegative-definite matrix Σ as

(do(b))1/2(bT Σb)1/2 C(η, H, Σ)= inf >0 (η ∈ R), ∈   0=b S(η,H) bH 1

p where S(η, H)={b ∈ R : bHc 1 ≤ ηbH 1} is the cone condition. The compatibility factor is strongly linked to the restricted eigenvalue proposed by Bickel et al. [3],

(bT Σb)1/2 RE(η, H, Σ)= inf >0 (η ∈ R). (31) ∈ 0=b S(η,H) b2

Actually, by the Cauchy–Schwarz inequality, it is evident that [C(η, H, Σ)]2 ≥ [RE(η, H, Σ)]2. In the proof of Theorem 4.1, the upper bounds of the oracle inequalities are sharper if we use a compatibility factor rather than a restricted eigenvalue. Next, let C(η, H)=C(ξ, H,¨l(β∗)), then we have the following theorem which is analo- gous to Zhang and Jia [38] who give oracle inequalities for the Elastic-net estimate in NB regression. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 18 of 28

∗ 2   ≤ K(ξ+1)d wmax ≤ Theorem 4.1 Suppose Φ(X) ∞ KwithsomeK>0and let τ = 2 2wmin[C(ξwmax/wmin,H)] 1 –1 log p 2 e with assumption (26). Setting wmin = O(r n ) where r >0is a suitable tuning param- ˙ ∗ ξ–1 { ∞ ≤ } →∞ eter, then in the event (β ) ξ+1 wmin with probability tending to 1 as p, n , we have e2aτ (ξ +1)d∗w2 w2 βˆ β∗ ≤ max ∗ max – 1 2 = Op d , 2wmin[C(ξwmax/wmin, H)] wmin

≤ 1 –2a where aτ 2 is the smaller solution of the equation ae = τ . And

2aτ ξ 2 ∗ 2 s ˆ ∗ ≤ 4e d wmax D β, β 2 2 . (ξ +1) [C(ξwmax/wmin, H)]

˜ ˆ ∗ ˜ ˜ ∗ Proof Let β = β – β =0and b = β/β1,andobservethat(β + bx) is a convex function in x since (β) is convex. From the proof of (29) in Lemma 4.1,wehave T ˙ ∗ ˙ ∗ 2wmin 2ξwmax b l β + bx –  β + b C  ≤ b  (32) ξ +1 H 1 ξ +1 H 1

˜ holding for x ∈ [0, β1]. Since (β)isaconvexinβ, bT [˙(β + bx)–(β)] is increasing in x)andthefunction

b ∈ S(ξwmax/wmin, H). By the assumption, we have Φ(X)∞xb1 ≤ Kx by noticing that

b1 = 1. And assuming Φ(x)∞ ≤ K, we obtain the following by Lemma 4.2: ∗ ∗     xbT ˙ β + bx – ˙ β ≥ x2e–2 Φ(X) ∞ xb bT ¨(β)b ≥ x2e–2KxbT ¨(β)b,

this implies ∗ ∗ bT ˙ β + bx – ˙ β ≥ xe–2KxbT ¨(β)b. (33)

For the Hessian matrix evaluated at the true coefficient β∗, the compatibility factor is written as C(η, H)=:C(η, H, ¨(β∗)). From the definition of the compatibility factor and the inequality (33), we have –2Kx 2 2 ∗ –2Kx T ¨ Kxe C(ξwmax/wmin, H) bH  /d ≤ Kxe b (β)b ∗ ∗ (by (33)) ≤ KbT ˙ β + bx – ˙ β 2ξwmax 2wmin (by (29)) ≤ K b  – b c  ξ +1 H 1 ξ +1 H 1 2ξw 2w = K max b  – min 1–b  ξ +1 H 1 ξ +1 H 1 2 2w ≤ K (ξw + w )b  – min ξ +1 max max H 1 ξ +1 2ξwmin = K 2wmaxb  – H 1 ξ +1 K(ξ +1)b 2w2 ≤ H 1 max , 2wmin

  2 2wmin (ξ+1) bH 1wmax where the last inequality is due to + ≥ 2wmaxbH 1. ξ+1 2wmin Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 19 of 28

Then we have

ξ ∗ 2 –2Kx ≤ K( +1)d wmax Kxe 2 =: τ (34) 2wmin[C(ξwmax/wmin, H)]

˜ for any x ∈ [0, β1]. ˜ So with constrain x ∈ [0, β1], let aτ , bτ ,(aτ <0.5

Kx ≤ Kx˜ ≤ aτ .

Using (34)again,theendpointx˜ satisfies Kxe˜ –2Kx˜ ≤ τ .Then,for∀x ∈ [0, x˜], we have

2aτ τ 2aτ ξ ∗ 2  ˜     ≤ ˜ ≤ aτ e e ( +1)d wmax β 1 = bx 1 = b 1x = x x = = 2 . K K 2wmin[C(ξwmax/wmin, H)]

It remains to show the oracle inequality of Ds(βˆ, β∗). Let δ = bx in Lemma 4.2.Usingthe

definition of compatibility factor, and observing that aτ ≥ Kx and Kx ≥Φ(X)∞δ1,one derives –2aτ 2 ˜ 2 ∗ ≤ –2aτ ˜T ¨ ∗ ˜ ≤ –2Kx ˜T ¨ ∗ ˜ e C(ξwmax/wmin, H) βH 1/d e β  β β e β l β β     ∗ ≤ e–2 Φ(X) ∞ δ 1 β˜T ¨ β β˜.

Applying Lemma 4.2, one immediately deduces that 2 ∗     ∗ ∗ –2aτ  ˜ 2 ≤ –2 Φ(X) ∞ δ 1 ˜T ¨ ˜ ≤ s ˆ e C(ξwmax/wmin, H) βH 1/d e β  β β D β, β 2ξw ≤ max β˜  , ξ +1 H 1

where the last inequality is from (32). ˜ Canceling the term βH 1 on the both sides of the inequality above, we get

2aτ ∗  ˜  ≤ 2e ξd wmax βH 1 2 . (ξ +1)[C(ξwmax/wmin, H)]

Consequently,

ξ 2aτ ξ 2 ∗ 2 s ˆ ∗ ≤ 2 wmax  ˜  ≤ 4e d wmax D β, β βH 1 2 2 .  ξ +1 (ξ +1) [C(ξwmax/wmin, H)]

∗ Theorem 4.1 indicates that, if we presume d = O(1), the l1-estimation error bound is log p of the order n . And this convergence rate is rate-optimal in the minimax sense; see Raskutti et al. [28]andLietal.[22]. Under some regularity conditions, the weighted Lasso

estimates are expected to have the consistency property in the sense of the 1-norm when the dimension increases as order exp(o(n)). Unlike the MLE estimator with convergence Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 20 of 28

√1 rate n , in the high-dimensional case we have to multiply log p to the rate of MLE as the price to pay.

4.2.2 Case of Poisson regression In this section, we use the same procedure as in Sect. 4.2.1 to prove the case of Poisson regression with weighted Lasso penalty. The derivation of this type of oracle inequalities is simpler than that in Ivanoff et al. [16]. Similar to Lemma 5 of Zhang and Jia [38], the proof is based on the following key lemma.

Lemma 4.3 The Hessian matrix and gradient of the log-likelihood function in the Poisson T T ˙ 1 n φ (xi)β ¨ 1 n T φ (xi)β regression model are (β)=–n i=1 φ(xi)[Yi – e ], (β)= n i=1 φ(xi)φ (xi)e . T T T Assume the identifiability condition: φ (xi)(β + δ)=φ (xi)β implies φ (xi)δ =0,then we have

    Ds(β + δ, β) ≥ δT ¨(β)δe– Φ(X) ∞ δ 1 ,

where Φ(X)∞ = max{|φj(xij)|;1≤ i ≤ n,1≤ j ≤ p}.

T ˙ Proof Without loss of generality, we assume that φ (xi)δ = 0. By the expression of (β) and ¨(β), one deduces

n 1 T T δT ˙(β + δ)–˙(β) = δT · φ(x ) eφ (xi)(β+δ) – eφ (xi)β n i i=1

n φT (x )δ 1 T e i –1 = δT · φ(x )eφ (xi)β · · φT (x )δ n i φT (x )δ –0 i i=1 i n 1 T     ≥ δT φ(x )φT (x )eφ (xi)β δ · e– Φ(X) ∞ δ 1 , n i i i=1

ex–ey ≥ –|x|∨|y|  where the last inequality is from x–y e .

Theorem 4.2 For a weighted Lasso estimator βˆ in a Poisson regression, suppose ∗ 2   ≤ K(ξ+1)d wmax ≤ –1 Φ(X) ∞ K with assumption (26) and let τ = 2 e with a certain 2wmin[C(ξwmax/wmin,H)]  ˙ ∗  ≤ ξ–1 K >0.Then, in the event (β ) ∞ ξ+1 wmin, we have

aτ ξ ∗ 2 ˆ ∗ ≤ e ( +1)d wmax β – β 1 2 , 2wmin[C(ξwmax/wmin, H)]

–a where aτ ≤ 1 is the smaller solution of the equation ae = τ . Also,

aτ ξ 2 ∗ 2 s ˆ ∗ ≤ 4e d wmax D β, β 2 2 . (ξ +1) [C(ξwmax/wmin, H)]

–z Proof Note that aτ is the small solution of ze = τ in Poisson regression. By Lemma 4.1 and Lemma 4.3, the proof is almost exactly the same as the proof of Theorem 4.1.  Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 21 of 28

5 Simulations This section aims to compare the performances of the un-weighted Lasso and weighted Lasso for NB regression on simulated data sets. We use the R package mpath with the function glmregNB to fit the un-weighted Lasso estimator of NB regression. The R pack-

age lbfgs is employed to do a weighted 1-penalized optimization for NB regression. For weighted Lasso, we first run an un-weighted Lasso estimator and find the optimal tuning

parameter λop by cross-validation. The actual weights we use are the standardized weights given by

˜ wj wj := p p . j=1 wj

And then we solve the optimization problem p ˆ(t) β = argmin –(β)+λop w˜ j|βj| . (35) ∈Rp β j=1

Here we first apply the function cv.glmregNB() to do a 10-fold cross-validation to obtain the optimal penalty parameter λ. For comparison, we use the function glmregNB to do un-weighted Lasso for NB regression and use the estimated θ as the initial value for our weighted Lasso algorithms. For simulations, we simulate 100 random data sets. By optimization with suitable λ,we ˆ obtain the model with the parameter βλopt . Then we estimate the performance of the 1  ∗ ˆ   ∗ ˆ  estimation error β – βλopt 1 and the prediction error Xtestβ – Xtestβλopt 2 on the test data (Xtest of size ntest) by the average of the 100-times errors.

For each simulation ntrain = 100, ntest = 200. We mainly adopt the simulation setting:

The predictor variables X are randomly drawn from Np(0, Σ), where Σ has elements ρ|j–l| (j, l = 1,2,...,p). The correlation among predictor variables is controlled by ρ with ρ = 0.5 and 0.8, respectively. We assign the true vector as

∗ β =(1.25,–0.95,0.9,–1.1,0.6,0,...,0 ). p–5

The simulation results are displayed in Table 1. The table shows that the proposed weighted Lasso estimators are more accurate than the un-weighted Lasso estimators. With the help of the optimal weights, it reflects that the controlling of KKT conditions by a data- dependent tuning parameter is able to improve the accuracy of the estimation in aspects

of the 1-estimation, square prediction errors. It can be also seen that the increasing p will

hinder the 1-estimation error by the curse of dimensionality.

6 Summary and discussion The paper contributes to three parts. In the first part, we provide a characterization of discrete compound Poisson point processes (DCPP) in the same fashion as it has been done in 1936 by Arthur H. Copeland and Francis Regan for the Poisson process. The sec- ond part is focused on deriving concentration inequalities for DCPP, which are applied in the third part to derive optimally oracle inequalities of the weighted Lasso estimation in high-dimensional NB regression. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 22 of 28

Table 1 Means of 1 error and prediction error ˆ ∗ Method β – β 1 Prediction error p = 20, ρ =0.5 Lasso 1.902 7.161 Weight Lasso 1.624 6.147 p = 20, ρ =0.8 Lasso 3.001 7.155 Weight Lasso 2.917 6.979 p = 100, ρ =0.5 Lasso 4.137 15.452 Weight Lasso 3.905 13.866 p = 100, ρ =0.8 Lasso 4.543 10.866 Weight Lasso 4.552 10.695 p = 200, ρ =0.5 Lasso 4.467 17.533 Weight Lasso 4.373 15.854 p = 200, ρ =0.8 Lasso 4.467 17.533 Weight Lasso 4.373 15.854

Oracle inequalities for discrete distributions are statistically useful in both non- asymptotic and asymptotical analyses of count data regression. The motivation for de- riving new concentration inequalities for DCP processes is that we want to show that the KKT conditions of the NB regressions with Lasso penalty (wuth a tuning parameter λ) is a high-probability event. The KKT conditions are based on the concentration equality for a centralized weighted sum of NB random variables. However, existing concentration results are not friendly applicable to the aim of optimal inference procedure. The optimal inference procedure here is to choose an optimal tuning parameter for Lasso estimates in high-dimensional NB regressions, which lead to the minimax optimal convergence for

the estimator by considering the oracle inequalities for the 1-estimation error. In the future, it is of interest to study the concentration inequalities for other extended Poisson distributions in regression analysis such as the Conway–Maxwell–Poisson distri- bution; see Li et al. [22] for the existing probability properties. In another direction, our proposed concentration inequalities would be desirable to be extended to establish or- acle inequalities for penalized projection estimators studied by Reynaud-Boure [29], by considering the intensity of some inhomogeneous compound Poisson processes.

Appendix: Proof of Theorem 2.1 A.1 Proof of Theorem 2.1 The proof is divided into two parts. In the first part, we show that ν(A) is countably additive and absolutely continuous. Moreover, in the second part, we will show Eq. (3).

Proof Step 1. ∞ (1) Note that Pk(A)>0and k=0 Pk(A) = 1 by assumption 1, we have 0 < Pk(A)<1 whenever 0 < m(A)<∞. And from assumption 2, we get P0(A)=1ifm(A)=0.Itfol-

lows that 0 < P0(A) ≤ 1if0

limm(A)→0 – log [1 – S1(A)] = 0. Therefore, ν(·) is absolutely continuous. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 23 of 28

(2) Let ν(A):=–log P0(A). For A1 ∩ A2 = ∅, it implies by assumption 3 that ν(A1 ∪ A2)=–log P0(A1 ∪ A2)=–log P0(A1) · P0(A2) = ν(A1)+ν(A2),

which shows that ν(·) is finitely additive. Based on the two arguments in Step 1, we have shown that ν(·) is countable additive.

Step 2. For convenience, we first consider the case that A = Ia,b =(a, b]isad-dimensional d interval on R , and then extend it to any measurable sets. Let T(x)=:ν(I0,x), so T(x)isad dimensional coordinately monotone transformation with   T(0)=0, T(x) ≤ T x iff x ≤ x ,

where the componentwise order symbol (partially ordered symbol) “≤” means that each coordinate in x is less than or equal to the corresponding coordinate of x (for example, with d = 2, we have (1, 2) ≤ (3,4)).Thisisbythefactthatν(·) is absolutely continuous and

finite additive, which implies ν(I0,0)=limm(I0,x)→0 ν(I0,x)=0. The definition of the minimum set of the componentwise ordering relation lays the base for our study; see Sect. 4.1 of Dinh The Luc [24]. Let A be a nonempty set in R+d.Apoint a ∈ A is called a Pareto minimal point of the set A if there is no point a ∈ A such that a ≥ a and a = a. The sets of Pareto minimal points of A is denoted by PMin(A). Next, we define the generalized inverse T–1(t) (like some generalizations of the inverse matrix, it is not unique, we just choose one) by –1 +d x =(x1,...,xd)=T (t)=:PMin a ∈ R : t ≤ T(a) .

When d =1,thegeneralizedinverseT–1(t) is just a similar version of the quantile function. By Theorem 4.1.4 in Dinh The Luc [24], if A is a nonempty compact set, then it has a –1 –1 Pareto minimal point. Set x = T (t), x + ξ = T (t + τ)anddefineϕk(τ, t)=Pk(Ix,x+ξ )for t = T(x), τ = T(x + ξ)–T(x). Consequently, by part one, finite additivity of ν(·), it implies that

τ = T(x + ξ)–T(x)=ν(I0,x+ξ )–ν(I0,x)=ν(Ix,x+ξ )=–log P0(Ix,x+ξ ). –τ ∞ Thus, we have ϕ0(τ, t)=e .LetΦk(τ, t)= i=k ϕi(τ, t). Due to assumption 5, it follows that ϕ (τ, t) ϕ (τ, t) α = lim k = lim k for k ≥ 1. (36) k → → τ 0 Φ1(τ, t) τ 0 τ k Applying assumption 3, we get ϕk(τ + h, t)= i=0 ϕk–i(τ, t)ϕi(h, t + τ). Subtracting ϕk(τ, t) and dividing by h (h >0)intheaboveequation,then

ϕ (τ + h, t)–ϕ (τ, t) ϕ (h, t + τ)–1 k ϕ (h, t + τ) k k = ϕ (τ, t) · 0 + ϕ (τ, t) · i . (37) h k h k–i h i=1

Here the meaning of h in ϕi(h, t + τ)isthat

h = T(x + ξ + η)–T(x + ξ)=ν(I0,x+ξ+η)–ν(I0,x+ξ )=ν(Ix+ξ,x+ξ+η),

where x + ξ + η = T–1(t + τ + h)andx + ξ = T–1(t + τ). Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 24 of 28

Let h → 0in(37). By using assumption 5 and its conclusion (36), it implies that

 ∂ ϕ (τ, t)=: ϕ (τ, t)=–ϕ (τ, t)+α ϕ (τ, t)+α ϕ (τ, t)+···+ α ϕ (τ, t), (38) k ∂τ k k 1 k–1 2 k–2 k 0

which is a difference–differential equation. To solve (38), we need to write it in matrix form, ⎛ ⎞ ⎛ ⎞ ⎛ ⎞  –1 α ··· α ϕ k(τ, t) 1 k ϕk(τ, t) ⎜ ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . . . ⎟ ⎜ . ⎟ ∂Pk(τ, t) ⎜ . ⎟ ⎜ .. .. . ⎟ ⎜ . ⎟ := ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ =: QPk(τ, t). ∂τ ⎝  ⎠ ⎜ α ⎟ ⎝ ⎠ ϕ 1(τ, t) ⎝ 1 ⎠ ϕ1(τ, t)  . ϕ 0(τ, t) .. –1 ϕ0(τ, t)

The general solution is

∞ Qm P (τ, t)=eQτ · c =: τ m · c, k m! m=0

where c is a constant vector. To specify c,weobservethat

T Qτ (0,...,0,1) = lim Pk(τ, t)=lim e c = c. τ→0 τ→0

Hence, the Q can be written as

2 k Q =–Ik+1 + α1N + α2N + ···+ αkN , where N =: 0k×1 Ik .Asi ≥ k +1,wehaveNi = 0. 0 01×k Weusetheexpansionofthepowerofthemultinomial m k s0 s1 sk 1 (–Ik+1) α ···α ··· α i 1 k s1+2s2+ +ksk –Ik+1 + iN = ··· N . m! ··· s0!s1! sk! i=1 s0+s1+ +sk=m

Then

∞ s0 s1 sk (–Ik+1) α ···α ··· ··· τ 1 k s1+2s2+ +ksk τ s0+s1+ +sk Pk( , t)= ··· N c ··· s0!s1! sk! m=0 s0+s1+ +sk=m ∞ m s0 s1 sk (–1) α ···α ··· ··· τ s0 1 k s1+2s2+ +ksk τ s1+ +sk = ··· N c s0! ··· s1! sk! m=0 s0=0 s1+ +sk=m–s0 ∞ ∞ s0 s1 sk (–1) α ···α ··· ··· τ s0 1 k τ s1+ +sk s1+2s2+ +ksk = ··· N c s0! ··· s1! sk! s0=0 m–s0=0 s1+ +sk=m–s0 ∞ ∞ s0 s1 sk (–1) α ···α ··· ··· τ s0 1 k τ s1+ +sk s1+2s2+ +ksk (let r = m – s0)= ··· N c s0! ··· s1! sk! s0=0 r=0 s1+ +sk=r ∞ s1 sk α ···α ··· ··· –τ 1 k τ s1+ +sk s1+2s2+ +ksk = e ··· N c ··· s1! sk! r=0 s1+ +sk=r Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 25 of 28

∞ s1 sk α ···α ··· = 1 k τ s1+ +sk e–τ Nlc s1! ···sk! l=0 R(s,k)=l

k s1 sk α ···α ··· = 1 k τ s1+ +sk e–τ Nlc, (39) s1! ···sk! l=0 R(s,k)=l k l where R(s, k)=: t=1 tst and the last equality is obtained by the fact that N c =(0,...,0,1, T l ≥ 0,...,0 ) and N = 0 as l k +1. l s s α 1 ···α k ··· 1 k s1+ +sk –τ Choose the first element in (39). We show that ϕk(τ, t)= ··· τ e .By R(s,k)=k s1! sk! the relation of ϕk(τ, t)andPk(A), we obtain

s1 sk α ···α s +···+s 1 k 1 k –ν(I0,ξ ) Pk(I0,ξ )= ν(I0,ξ ) · e . (40) s1! ···sk! R(s,k)=k

Step 3. We need to show the countable additivity in terms of (40). Let A1, A2 be two disjoint intervals, we first show that

s1 sk α ···α ··· 1 k s1+ +sk –ν(A1∪A2) Pk(A1 ∪ A2)= ν(A1 ∪ A2) e , (41) s1! ···sk! R(s,k)=k

which means that (40)isclosedunderafiniteunionoftheAi. k By assumption 3, to show (41), it is sufficient to verify i=0 Pk–i(A1)Pi(A2)equalsthe right-hand side of (41). From the m.g.f. of the DCP distribution with complex θ,wehave ∞ –θN(Ai) –aθ MAi (θ)=:Ee = exp λ(Ai) αa e –1 for i =1,2. a=1

And by the definition of m.g.f. and assumption 3, +∞ +∞ k ∪ –kθ –kθ MA1∪A2 (θ)= Pk(A1 A2)e = Pk–i(A1)Pi(A2) e k=0 k=0 i=0 +∞ +∞ –jθ –sθ = Pj(A1)e Ps(A2)e j=0 s=0 ∞ –aθ = MA1 (θ)MA2 (θ)=exp ν(A1)+ν(A1) αa e –1 a=1 ∞ –aθ = exp ν(A1 + A2) αa e –1 a=1 ∞ s1 sk α ···α ··· 1 k s1+ +sk –ν(A1∪A2) –kθ = ν(A1 ∪ A2) e e . s1! ···sk! k=0 R(s,k)=k

Thus, we show (41). n ∞ Finally, let En = i=1 Ai, E = i=1 Ai for disjoint intervals A1, A2,....Weusethefollowing lemma in Copeland and Regan [8]. Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 26 of 28

Lemma A.1 If we have assumptions 1–4, let en = {E \ (E ∩ En)}∪{En \ (E ∩ En)}. Then

lim m(en)=0 implies lim Pk(En)=Pk(E). n→∞ n→∞ ∞ ∞ → →∞ To see this, since en = i=n+1 Ai,wehavem(en)= i=n+1 m(Ai) 0asn .Thus,we obtain the countable additivity with respect to (40). 

A.2 Proof of Corollary 3.1 Let CP (A)=: ri kN (A), where {N (A)}n are independent Poisson point process i,ri k=1 i,k i,k i=1 with intensity αk(i) A λ(x) dx for each fixed k. By Campbell’s theorem, it follows that n n ri

E exp – fi(x)CPi,ri (dx) =Eexp – fi(x) kNi,k(dx) i=1 Si i=1 Si k=1 n ri = E exp – kfi(x)Ni,k(dx) i=1 k=1 Si n ri –kfi(x) = exp e –1 αk(i)λ(x) dx i=1 k=1 Si n ri –kfi(x) = exp e –1 αk(i)λ(x) dx . (42) i=1 k=1 Si

For η > 0, define the normalized exponential transform of the stochastic integral by

D{ri}(η): n ri

log D{ri}(η):=η fi(x) CPi,ri (dx)– kαk(i)λ(x) dx i=1 Si k=1 ri kfi(x) – e – kηfi(x)–1 αk(i)λ(x) dx. (43) Si k=1

Therefore, we obtain ED{r}(η)=1by(42). It follows from (43) and Markov’s inequality that

n ri

P η fi(x) CPi,ri (dx)– kαk(i)λ(x) dx i=1 Si k=1 n ri kfi(x) ≥ e – kηfi(x)–1 αk(i)λ(x) dx + ny i=1 Si k=1 ny –ny = P D{r}(η) ≥ e ≤ e . (44)

By inequality (12), Eq. (29)implies

n ri

P fi(x) CPi,ri (dx)– kαk(i)λ(x) dx i=1 Si k=1 Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 27 of 28

n ri 2 2 1 k ηfi (x)λ(x) y ≥ αk(i) dx + 2 1– 1 kηf  η i=1 Si k=1 3 i ∞ ≤ e–ny. (45)

2 Let Vi,f := f (x)λ(x) dx for i =1,2,...,n.Byabasicinequality,wehave i Si i n ri 2 2 n ri 2 1 k ηfi (x)λ(x) y 1 ηk Vi,fi y αk(i) dx + = αk(i) + 2 1– 1 kηf  η 2 1– 1 kηf  η i=1 Si k=1 3 i ∞ i=1 k=1 3 i ∞ n ri 2 1 ηk Vi,fi y 1 ky = αk(i) + 1– kηfi∞ + fi∞ 2 1– 1 kηf  η 3 3 i=1 k=1 3 i ∞ r r n i ky n i y ≥ α (i) k 2V y + f ∞ = kα (i) 2V y + f ∞ . (46) k i,fi 3 i k i,fi 3 i i=1 k=1 i=1 k=1

Optimizing η in (45)byEq.(46), we have r r n i n i y P f (x) CP (dx)– kα (i)λ(x) dx ≥ kα (i) 2yV + f ∞ i i,ri k k i,fi 3 i i=1 Si k=1 i=1 k=1 ≤ e–ny. (47)

→∞ −→d For i =1,2,..., let ri in (47), then CPi,ri (A) CPi(A). It implies the concentration inequality (8). n n μ √ i  ∞ Define c1n := i=1 μi 2Vi,fi , c2n := i=1 3 fi ,lett = c1n y + c2ny,andweobtainthe expression of y by solving a quadratic equation.

Acknowledgements This work is part of the doctoral thesis of the first author who would like to express sincere gratefulness to my advisor Professor Jinzhu Jia for his guidance, and candidate Dr. Xiangyang Li for his assistance of simulations. The authors also express their thanks to the two anonymous referees for their valuable comments, which greatly improved the quality of our manuscript. In writing the early version of this paper, Professor Patricia Reynaud-Bouret provided several helpful comments and materials about the concentration inequalities, to whom warm thanks are due.

Funding No funding is used to support this paper.

Availability of data and materials Not applicable.

Competing interests The authors declare that they have no competing interests.

Authors’ contributions The authors completed the paper and approved the final manuscript.

Author details 1School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China. 2Mathematics Department, Rutgers University, Piscataway, USA.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Received: 14 March 2019 Accepted: 2 December 2019 Zhang and Wu Journal of Inequalities and Applications (2019)2019:312 Page 28 of 28

References 1. Aczél, J.: On composed Poisson distributions, III. Acta Math. Hung. 3(3), 219–224 (1952) 2. Baraud, Y., Birgé, L.: Estimating the intensity of a random measure by histogram type estimators. Probab. Theory Relat. Fields 143(1–2), 239–284 (2009) 3. Bickel, P.J., Ritov, Y.A., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009) 4. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, London (2013) 5. Bühlmann, P., van de Geer, S.A.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011) 6. Chareka, P., Chareka, O., Kennendy, S.: Locally sub-Gaussian random variables and the strong law of large numbers. Atl. Electron. J. Math. 1(1), 75–81 (2006) 7. Cleynen, A., Lebarbier, E.: Segmentation of the Poisson and negative binomial rate models: a penalized estimator. ESAIM Probab. Stat. 18, 750–769 (2014) 8. Copeland, A.H., Regan, F.: A postulational treatment of the Poisson law. Ann. Math. 37, 357–362 (1936) 9. Das, A.: Design and Analysis of Statistical Learning Algorithms which Control False Discoveries. Doctoral dissertation, Universität zu Köln (2018) 10. Feller, W.: An Introduction to and Its Applications, vol. I, 3rd edn. Wiley, New York (1968) 11. Giles, D.E.: Hermite regression analysis of multi-modal count data. Econ. Bull. 30(4), 2936–2945 (2010) 12. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015) 13. Hilbe, J.M.: Negative Binomial Regression, 2nd edn. Cambridge University Press, Cambridge (2011) 14. Houdré, C.: Remarks on deviation inequalities for functions of infinitely divisible random vectors. Ann. Probab. 30, 1223–1237 (2002) 15. Houdré, C., Privault, N.: Concentration and deviation inequalities in infinite dimensions via covariance representations. Bernoulli 8(6), 697–720 (2002) 16. Ivanoff, S., Picard, F., Rivoirard, V.: Adaptive Lasso and group-Lasso for functional Poisson regression. J. Mach. Learn. Res. 17(55), 1–46 (2016) 17. Jiang, X., Raskutti, G., Willett, R.: Minimax optimal rates for Poisson inverse problems with physical constraints. IEEE Trans. Inf. Theory 61(8), 4458–4474 (2015) 18. Johnson, N.L., Kemp, A.W., Kotz, S.: Univariate Discrete Distributions, 3rd edn. Wiley, New York (2005) 19. Kingman, J.F.C.: Poisson Processes. Oxford University Press, London (1993) 20. Kontoyiannis, I., Madiman, M.: Measure concentration for compound Poisson distributions. Electron. Commun. Probab. 11, 45–57 (2006) 21. Last, G., Penrose, M.D.: Lectures on the Poisson Process. Cambridge University Press, Cambridge (2017) 22. Li, B., Zhang, H., He, J.: Some characterizations and properties of COM-Poisson random variables. Commun. Stat., Theory Methods (2019). https://doi.org/10.1080/03610926.2018.1563164 23. Li, Q.: Bayesian Models for High-Dimensional Count Data with Feature Selection. Doctoral dissertation, Rice University (2016) 24. Luc, D.T.: Multiobjective Linear Programming: An Introduction. Springer, Berlin (2016) 25. Mallick, H., Tiwari, H.K.: EM adaptive LASSO a multilocus modeling strategy for detecting SNPs associated with zero-inflated count phenotypes. Front. Genet. 7, Article 32 (2016) 26. Nielsen, F., Nock, R.: Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 55(6), 2882–2904 (2009) 27. Petrov, V.V.: Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Clarendon, Oxford (1995) 28. Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011) 29. Reynaud-Bouret, P.: Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Relat. Fields 126(1), 103–153 (2003) 30. Rigollet, P., Hütter, J.C.: High dimensional statistics (2019). http://www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf 31. Sato, K.: Lévy Processes and Infinitely Divisible Distributions, revised edn. Cambridge University Press, Cambridge (2013) 32. Städler, N., Bühlmann, P., van de Geer, S.A.: 1-penalization for mixture regression models. Test 19, 209–256 (2010) 33. van de Geer, S.A.: The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich (2007) 34. Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019) 35. Wang, Y.H., Ji, S.: Derivations of the compound Poisson distribution and process. Stat. Probab. Lett. 18(1), 1–7 (1993) 36. Yang, X., Zhang, H., Wei, H., Zhang, S.: Sparse density estimation with measurement errors. arXiv preprint (2019). arXiv:1911.0621 37. Yu, Y.: High-dimensional variable selection in Cox model with generalized Lasso-type convex penalty (2010). https://people.maths.bris.ac.uk/~yy15165/index_files/Cox_generalized_convex.pdf 38. Zhang, H., Jia, J.: Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection. arXiv preprint (2017). arXiv:1712.03412 39. Zhang, H., Li, B.: Characterizations of discrete compound Poisson distributions. Commun. Stat., Theory Methods 45(22), 6789–6802 (2016) 40. Zhang, H., Liu, Y., Li, B.: Notes on discrete compound Poisson model with applications to risk theory. Insur. Math. Econ. 59, 325–336 (2014)