A Unified Approach to Sparse Tweedie Modeling of Multisource Insurance Claim Data
Total Page:16
File Type:pdf, Size:1020Kb
TECHNOMETRICS 2020, VOL. 62, NO. 3, 339–356 https://doi.org/10.1080/00401706.2019.1647881 A Unifed Approach to Sparse Tweedie Modeling of Multisource Insurance Claim Data Simon Fontainea,YiYangb,WeiQianc,YuwenGud,andBoFane aDepartment of Statistics, University of Michigan, Ann Arbor, MI; bDepartment of Mathematics and Statistics, McGill University, Montreal, Quebec, Canada; cDepartment of Applied Economics and Statistics, University of Delaware, Newark, DE; dDepartment of Statistics, University of Connecticut, Storrs, CT; eDepartment of Statistics, University of Oxford, Oxford, UK ABSTRACT ARTICLE HISTORY Actuarial practitioners now have access to multiple sources of insurance data corresponding to various Received June 2018 situations: multiple business lines, umbrella coverage, multiple hazards, and so on. Despite the wide use Accepted June 2019 and simple nature of single-target approaches, modeling these types of data may beneft from an approach performing variable selection jointly across the sources. We propose a unifed algorithm to perform sparse KEYWORDS learning of such fused insurance data under the Tweedie (compound Poisson) model. By integrating ideas Backtracking line search; Groupwise proximal gradient from multitask sparse learning and sparse Tweedie modeling, our algorithm produces fexible regularization descent; Multisource that balances predictor sparsity and between-sources sparsity. When applied to simulated and real data, our insurance data; Multitask approach clearly outperforms single-target modeling in both prediction and selection accuracy, notably learning; Regularization; when the sources do not have exactly the same set of predictors. An efcient implementation of the Tweedie model proposed algorithm is provided in our R package MStweedie, which is available at https://github.com/ fontaine618/MStweedie. Supplementary materials for this article are available online. 1. Introduction etc.) of the insured object characteristics (car type and usage, propertyvalue,etc.)orofanyothercharacteristicdeemedrel- Insurance claim data are characterized by excess zeros, corre- evant. sponding to insurance policies without any claims, and highly Traditionally, actuarial practitioners adopt a single-target right-skewed positive values associated with nonzero claim approachthat,foragiveninsuranceproduct,assumesonepop- amounts, typically in monetary value. The modeling of insur- ulationtobehomogeneouslycharacterizedbysomecovariates ance claim data helps to predict the expected loss associated with and aims to build a single Tweedie model solely from the prod- a portfolio of policies and is widely used for premium pricing. uct’ssampledata.Despitethewideuseandsimplenatureof As claim data refect a unique mixed nature of distributions this approach, practitioners now have access to multiple sources with both discrete and continuous components, there are gen- of insurance data. For instance, many insurers have multiple erally two popular modeling approaches. The frst type consid- ers a frequency-severity approach where claim frequency (i.e., business lines such as the auto insurance and the property whether a claim exists or not) and claim amount are modeled insurance; in umbrella coverage, claim amounts are available separately (Yip and Yau 2005;Frees,Gao,andRosenberg2011; for multiple types of coverage and for diferent hazard causes of Shi, Feng, and Ivantsova 2015), so that the two models need the same coverage; multiple datasets can be accumulated for a to be used together for claim loss prediction. The second type long period of time, during which business environment may uses Tweedie’s compound Poisson model (or Tweedie model for have changed signifcantly so that earlier-year and later-year short; Tweedie 1984) that considers an inherent Poisson pro- data sources may not be treated as one homogeneous population. cess and models both components simultaneously. Our study As a result, the modern multisource insurance data may not will focus on the second approach that draws upon Tweedie be characterized well by a homogeneous model. With these distribution’s natural structure for claim data modeling (Smyth emerging multisource insurance data problems, much attention and Jørgensen 2002; Frees, Meyers, and Cummings 2011;Zhang has been drawn to addressing their modeling issues in statistics 2013; Shi, Feng, and Boucher 2016). It is also common practice and actuarial science. Both the frequency-severity and Tweedie that insurers collect and maintain external information associ- model approaches have been investigated in the context of mul- ated with insurance policies either directly from policy holders tivariate regressions to model the multiple responses simultane- or from third-party databases. Covariates generated from the ously(seeFrees,Lee,andYang2016;Shi2016 and references external information can be associated with the claim loss and therein). help improve the modeling process. Depending on the type Variable selection is one of the most important tasks in of insurance, this information may consist of policyholder’s building transparent and interpretable models for claim loss characteristics (demographics, employment status, experience, prediction. Large-scale high-dimensional sparse modeling is CONTACT Yi Yang [email protected] Department of Mathematics and Statistics, McGill University, Montreal, Quebec H3A 0G4, Canada. Color versions of one or more of the fgures in the article can be found online at www.tandfonline.com/r/TECH. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/TECH. © 2019 American Statistical Association and the American Society for Quality 340 S. FONTAINE ET AL. commonlyencounteredashundredsofcovariatesareofen rithm is theoretically guaranteed to converge to the optimiza- considered as candidate variables while only a few of them tion target with at least linear rate, and is practically fexible are believed to be associated with the claim loss or can be to handle source-specifc missing covariates. In addition, we used in the fnal model production. Under the single popu- implement our proposal in an efcient and user-friendly R lation setting, efcient variable selection approaches designed package called MStweedie (standing for multisource Tweedie for the Tweedie model have been developed via a shrinkage- modeling), which is available at https://github.com/fontaine618/ type approach (see Qian, Yang, and Zou 2016 and references MStweedie. therein). The increasingly prevalent multisource data scenarios The article is organized as follows. In Section 2,weintro- coupledwithhighdimensionalityandlargedatascalepose duce the sparse Tweedie model for multisource claim data and new challenges to actuarial practitioners. To our knowledge, derive a general objective function. Section 3 develops a uni- the corresponding variable selection issues for multisource fed algorithm to efciently optimize that objective. Section 4 Tweedie models have not been studied in the literature. On provides the details of implementation and tuning parameter the one hand, simply treating all diferent data sources as if selection for the proposed algorithm. In Section 5,wecompare they were from one population is problematic due to severe the performance of our proposal to other existing methods in model misspecifcation. On the other hand, it may not be ideal a series of numerical experiments on both simulated and real either to perform variable selection separately on each indi- data. Section 6 concludes the article. The technical proofs are vidual data source because it ofen results in a loss of estima- relegated to the appendix (supplementary materials). tion efciency. In the aforementioned multiline, multitype, or multiyear scenarios, the diferent data sources ofen contain similar types of covariates and some (or all) of them can be 2. Methodology relevant across some (or all) data sources, even if diferent data 2.1. Tweedie’s Compound Poisson Model come from totally diferent sets of customers. For example, both auto and property insurance contain geographical, credit, and The Tweedie model is closely related to the exponential disper- sion models (EDM; Jørgensen 1987) experience variables that may be important in both lines of business. Therefore, a proper variable selection process should yθ − κ(θ) f (y|θ, φ) = a(y, φ)exp , ideally take advantage of the potential connections among Y φ data sources as opposed to simply treating each data source separately. parameterized by the natural parameter θ and dispersion param- · · In this article, we augment the multisource claim data anal- eter φ,whereκ( ) is the cumulant function and a( ) is the · · ysis through an integrated shrinkage-based Tweedie modeling normalizing function. Both a( ) and κ( ) are known functions. ≡ =˙ approach that fuses diferent data sources to fnd commonly It can be shown that Y has mean μ E(Y) κ(θ)and variance = ¨ ˙ ¨ shared relevant covariates. To insure our method is plausible, var(Y) φκ(θ),whereκ(θ) and κ(θ) denote the frst and we will assume that the diferent sources have some (if not second derivatives of κ(θ), respectively. In this article, we are all) covariates in common. At the same time, our method primarily interested in the Tweedie EDMs, a class of EDMs that = ρ retains the ability to recover model structures and covariates have the mean-variance relationship var(Y) φμ ,whereρ is unique to individual data sources. In particular, we impose a the power parameter. Such mean-variance relation gives