Section on Survey Research Methods

Application of the Truncated Triangular and the Trapezoidal Distributions

Jay J. Kim ORM, NCHS, 3311 Toledo rd, Hyattsville, MD, 20782

Abstract them and investigate some of their properties in this paper. It has been suggested that the truncated be used for masking Multiplicative noise has the following form: microdata. The random variable which follows the truncated triangular distribution serves as a yxeiiii==,1,2, n, multiplicative noise factor. The natural candidate distribution is the one which is th where y is the masked variable for the i unit centered at 1, but truncated symmetrically i around 1, because it is not desirable to use such as person, household, establishment or some values which are too close to 1 due to governmental unit, xi is the corresponding un- confidentiality concerns. Instead of truncating masked variable and ei is the noise. the middle section of the distribution, we can assign low for that part of the Since noise is generated independent of the distribution. This procedure will render some original data, x and e are independent. Thus variations of the trapezoidal distribution. Two i i other approaches are to use the upside-down triangular distribution or parabloa for the Ey()= ExEe ()(). middle section of the distribution. In this paper, we will derive the density Usually E(e)= 1 is used, and thus function for each of the four distributions and investigate their properties. Ey()= Ex ().

KEY WORDS: confidentiality, masking, Also truncated triangular distribution 22 Vy()=++ VxVe () ()µµ Vx () Ve (). 1 Introduction ex

Since around 1980, the U.S. Energy Information Thus 2 Administration (EIA) has been using Vy()− µx Ve () Vx()= 2 (1) multiplicative noise for masking the number of Ve()+ µe heating and cooling days and marginal rates for utility, such as electricity and gas in their public Since the data disseminating agencies provide use micro data file from the Residential Energy the mean and of the noise distribution to Consumption Survey. Evans, et al (1998) the data users, users can estimate the variance of proposed use of multiplicative noise to mask characteristics from the masked data using the economic data. They considered noise which formula in equation (1). follows distributions such as normal and truncated normal distributions. Kim and Winkler 2. Noise Distributions. (2002, 2003) considered a noise distribution following the truncated . 2.1 Triangular Distribution Massell and Russell (2006) proposed to use a truncated triangular distribution for masking Since a truncated triangular distribution is a economic data. In this paper, we will consider modified form of a triangular distribution, we noise that follows the truncated triangular, consider a triangular distribution first. modified trapezoidal, double triangular and parabola-filled truncated triangular distributions. As the above probability density functions are not available at this , we will derive

2723 Section on Survey Research Methods

2.2 Truncated Triangular Distribution

When noise following a triangular distribution is used, using 1 for m is not desirable. This is because multiplication by 1 does not change the original value at all. That is, if the original value is multiplied by 1 for a unit, the unit does not get e any protection from disclosure. Even worse, the a m b probability of noise being 1 is the highest in this Figure 1. Triangular Distribution type of distribution. This implies that the largest number of units do not get protection. Thus Massell and Russell suggested using noise that follows the triangular distribution whose middle Triangular distribution of e, as shown above has portion, or the part near 1, is truncated. The the following form. truncated triangular distribution mentioned above is shown in Figure 2. ⎧ 2 ea− ⎪ , aem≤< ⎪bama−− fe()= ⎨ (2) ⎪ 2 be− , meb≤< ⎩⎪babm−−

Note that m is the of this distribution. If it is a symmetric distribution around m, m is also the mean and median of the variable e. If the distribution is symmetric around m, i.e., ma−=− bm then equation (2) reduces to e m cd ab ⎧ ea− ⎪ , aem≤< Figure 2. Truncated Triangular Distribution ⎪()ma− 2 fe()= ⎨ (3) ⎪ be− , meb≤< ⎩⎪()bm− 2

The of e is Suppose the distribution is truncated at b and c , cb> , as shown above, which are around m. mab++ One of the vertices, denoted by b in equation (2), Ee()= (4) is now denoted by d. In this case, the probability 3 density function (pdf) of e has the following form: Suppose the triangular distribution is symmetric around m, i.e., ma bm or ab2 m, −=− += ⎧ 2(ea− ) then the above expectation reduces to m. ⎪ , aem≤< ⎪km()()−− a d a fe()= ⎨ (7) The variance of e is ⎪ 2(de− ) , med≤< ⎩⎪kd()()−− m d a b222−++ abam − mab() + Ve()= . (5) 18 where k needs be determined.

If ma−=− bm or ab+=2 m, then the above Now variance reduces to bd2(ea−− ) 2( de ) ∫∫de+ de ()bm− 2 ackmada()()−− kdmda ()() − − Ve()= (6) 6

2724 Section on Survey Research Methods

()ba−−22 () dc symmetric triangular distribution with m=1 will = + . (8) provide an unbiased mean of the original data. kmada()()()()−− kdmda − −

Example 1. In order for the function in equation (7) to become a pdf, equation (8) needs to be 1. Thus If a = 0.5, d = 1.5, m = 1, b = 0.75 and c =

1.25, then E(e) without truncation is ()()()()dmba−−+−−22 madc k = ()()()damadm−−− 1++ 0.75 1.25 Ee()== 1. 3 Using the above k, the truncated triangular distribution of e, truncated at b and c is, The expected value of e after the truncation is

⎧ 2(dm− ) Ee()= 1, since m = 1. ⎪ 22(),ea− ⎪()()()()ba−−+−− dm dc ma ⎪ for a≤< e b As shown in the formulas, they are the same. fe()= ⎨ ⎪ 2(ma− ) (),de− The variance of e of the truncated triangular ⎪()()()()ba−−+−−22 dm dc ma ⎪ distribution is, ⎩ forced≤< (9) 1 Ve()= ⎡⎤222 If the triangular distribution is symmetric around 18⎣⎦()()()()ba−−+−− dm dc ma m and the truncation is also symmetric around m, 62 62 then equation (9) can be simplified as follows: {()()()()ba−−+−− dm dc ma 22 2 +−()()()()()ba dmdc − − ma −⎣⎡ 3 ba + ⎧ ea− ⎪ , aeb≤< 2 22 ⎪()dc− 2 ++()32cd +() a + d − 422()() ba + cd +⎦⎤} (13) fe()= ⎨ (10) ⎪ de− , ced≤< ⎩⎪ ()dc− 2 If the truncation is symmetric around m and the triangular distribution is also symmetric around The expected value of e without assuming m, then symmetry of the distribution and truncation is 22 2 16mc+−++ 8 md 2( d 2 dc 3 c ) Ve()=− m (14) ()()(2)dmba−−2 ba + 12 Ee()= 22 3[(ba−−+−− ) ( dm ) ( dc ) ( ma )] Example 2. Using the data in Example 1, V(e) without truncation is .04167, but after truncation, ()()(2)madc−−2 cd + + (11) it is .1146. The variance of e after truncation is 3[(ba−−+−− )22 ( dm ) ( dc ) ( ma )] 2.75 times that of the variable e without truncation. Suppose again that the triangular distribution is symmetric around m and the mid-section of the The variance of the truncated triangular distribution is truncated symmetrically around m. distribution is greater, since the values around Then the mean, which have the highest frequencies, are not there anymore. E(e)=m (12) 2.3 Trapezoidal Distribution Thus, the mean of the symmetrically truncated triangular distribution, where the mid-section of If we connect the top two points in the truncated the distribution is symmetrically truncated, is the part of the truncated triangular distribution, we same as that of the un-truncated triangular have a trapezoidal distribution. Instead of not distribution. Consequently, the data masked by allowing noise (e) to have any positive noise following the symmetrically truncated probability between b and c, we can allow some

2725 Section on Survey Research Methods

small (equal) probability. In that case, we have a Suppose the trapezoidal distribution is modified trapezoidal distribution, which is symmetric around m, then Ex()= m . shown in Figure 4. To derive the probability density function for the modified trapezoidal 2.4 Modified Trapezoidal Distribution distribution, we start with the trapezoidal distribution in Figure 3. When numbers around 1 are multiplied for masking records, it is better to avoid using 1 or numbers too close to 1. This is the reason why the truncated normal and triangular distributions are considered as candidates for a noise distribution, where truncation is done very near to 1. An alternative approach to the truncated distribution is using a distribution which allows a small probability, instead of a zero probability, in the mid-section of the noise distribution. The shape of the noise distribution then will have a variation of the trapezoidal distribution (see a b m c d Figure 4 below), where there is no connectivity Figure 3. Trapezoidal Distribution in the two highest points and the mid-section is dropped down.

Let the trapezoidal distribution have the following form.

⎧ 2 k ⎪ 1 (),ea−≤< aeb ⎪()()daba−− ⎪ ⎪ ⎪ 2 k fe()=≤<⎨ 2 , b e c (15) da ⎪ − e ⎪ a b m c d ⎪ Figure 4. Modified Trapezoidal Distribution ⎪ 2 k3 (),de−≤< ced ⎩⎪()()dadc−−

Depending on values of k’s, the height of top line k changes. Suppose kkk13== and k2 = . Then q da− If f (b) = f (c), kkk=== , 123dcba ⎧ 2kea− +−− ⎪ , aeb≤< ⎪daba−− then the distribution has the following form: ⎪ 2k fe()=≤<⎨ , b e c (17) ⎪ ()daq− ⎧ 2 ea− ⎪ ⎪ , aeb≤< 2kde− dcbaba+−− − ⎪ , ced≤< ⎪ ⎩dadc−− ⎪ 2 fe()=≤<⎨ , b e c (16) dcba ⎪ +−− 2k b ⎪ 2 de− ∫∫f() e de=− ( e a ) de + ⎪ , ced≤< ()()daba−−a ⎩dcbadc Ω +−− −

22kkcd The above is the same form as given by van ∫∫1()de+− d e de (18) Dorp and Kotz. qd()−−− abc ()() d a d c

2726 Section on Survey Research Methods

The equation above needs to sum up to 1 in order for the function in equation (17) to be a probability density function.

Equation (18) becomes

kqd[(−+− c b a ) + 2( b − c )] (19) ()daq− e abm cd

Hence, Figure 5. Double Triangular Distribution

()daq− k = (20) qd()2()−+− c b a + b − c The “double” triangular distribution shown above, assuming two triangles are symmetric around m, is Plugging k in the equation (17), we get

⎧ 1 ⎧ 2qea− ⎪ (),ea−≤< a eb ⎪ , ()()maba−− qd()2()−+− c b a + c − b b − a ⎪ ⎪ ⎪ 1 ⎪ for a≤< e b ⎪ (),me−≤< b e m ⎪ ⎪()()mamb−− ⎪ 2 fe()= ⎨ (22) ⎪ , ⎪ 1 fe()= ⎨ qd()2()−+− c b a + c − b ⎪ (),emm−≤< ec ⎪ ()()dmcm−− for b≤< e c ⎪ ⎪ ⎪ 1 ⎪ 2qde− ⎪ (),de−≤< ced ⎪ , ⎩()()dmdc−− qd()2()−+− c b a + c − b d − c ⎪ ⎪ ⎩ for c≤< e d (21) 1 ⎡⎤−−ab a22 m + mb Ee()=+⎢⎥ The height of the middle section of the ma− ⎣⎦66 distribution depends on the value of q. If a small probability in the mid-section is desired, then a 1 ⎡⎤−−mc m22 d + dc large q should be used. If a smaller q is used, the ++⎢⎥ (23) dm− ⎣⎦66 probability will go up.

The expected value of this distribution is the Suppose m - b = c – m, m – a = d – m. Equation same as the regular trapezoidal distribution. (23) can be simplified as follows.

2.5 Double Triangular Distribution 2(mad++++ )() bc Ee()== m (24) 6 The middle truncated section of the triangular distribution can be filled with an upside-down 2.5 Parabola-Filled Truncated Triangular triangular distribution. In that case, the middle Distribution triangle should be narrower than the outer triangle before truncation. Here we have a The truncated part of the triangular distribution double triangular distribution as shown in Figure can be filled with a parabola (see Figure 6). It 5. Note that b and c can be farther away from m even can be filled with a u-shaped parabola. The than the truncated triangular distribution, latter can be more desirable in the sense that the because we don’t want to have high probabilities probabilities of e near m would be very small in of e near m. the u-shaped distribution.

2727 Section on Survey Research Methods

Hence

⎧ 1 ⎪ (),ea−≤< a e b ⎪()()maba−− ⎪ 3(em− ) 2 ⎪ , fe()= ⎨2(dmcbcb−++−++ )[22 3 mcb ( ) 3 m 2 ] ⎪ e ⎪ for b≤< e c abm cd ⎪ 1 ⎪ (),de−≤< ced Figure 6. Parabola-Filled Truncated Triangular Distribution ⎩()()dmdc−− (26)

The expected value of e is, Let

⎧ 1 12⎛⎞bbaa22−− ⎪ (),ea−≤< a eb Ee()= ⎜⎟ ⎪()()maba−− ma− ⎝⎠6 ⎪ fe()= ⎨ (25) 22 2 12⎛⎞ddcc+− ⎪ kx(),−≤< m b e c + ⎜⎟ ⎪ 1 dm− ⎝⎠6 ⎪ (),de−≤< ced ⎩()()dmdc−− 2222 ()3()()8(cb−++−++⎣⎡ cbc b mccbb ) + To determine k in equation (25), we set the 8(dmcbcb−++−++ )[22 3 mcb ( ) 3 m 2 ] following to 1. 2 6(mcb+ )⎦⎤ bc 1 2 + ∫∫fede()=−+− ( e ade ) ∫ ke ( m ) de 8(dmcbcb−++−++ )[22 3 mcb ( ) 3 m 2 ] Ω ab()()maba−−

Assuming a symmetric distribution for the d 1 +−∫ ()dede. parabola and triangular distribution, the above c ()()dmdc−− expected value reduces to m.

The above is, 3. Concluding Remarks

ba−− dc c33 − b It has been suggested to use noise following a ++k[ truncated triangular distribution for masking 2(ma−− ) 2( dm ) 3 sensitive variables using a multiplicative noise scheme. However, the formula for its density −−+−mc()()]22 b m 2 c b function has not been available. This paper developed the formula. Furthermore, three other Assuming a symmetric distribution, i.e., m – a = candidates for the noise distribution, which are d – m, we have modifications of the truncated triangular distribution, are also developed here. If the 33 triangular distribution is symmetric and the badc−+− c − b +−−+−=kmcbmcb[()()]122 2 truncation is also symmetric around m, or if the 2(dm− ) 3 triangular distribution is symmetric and the rectangle, triangle or parabola filling the Thus truncated portion is also symmetric around m, then the expected value of e is m. If m = 1 is 3 used, the expected value of the masked data will k = 2(dmcbcb−++−++ )[22 3 mcb ( ) 3 m 2 ] be the same as that of the unmasked data. In other words, the sample mean of the masked data

2728 Section on Survey Research Methods

will be an unbiased estimate of the unmasked Data Released by the U.S. Department of data. Energy, Journal of the American Statistical Association, Vol. 81, No. 395, 690-688. The data disseminating agencies will provide the data users with the mean and variance of the Kim, J..J. and Winkler, W. E. (2001) noise being used for masking. The data users can Multiplicative Noise for Masking Continuous use the information to estimate the sample Data, Proceedings of the Survey Methods variance from the masked data. Note that users Research Section, American Statistical cannot estimate the mean and variance of the Association, CD Rom. noise from the masked data. Kim, J..J. and Winkler, W. E. (2001) In further research, the domain estimation Multiplicative Noise for Masking Continuous formula needs to be developed for this type of Data, Research Report Series, #2003- masking. These distributions need to be tried 01, Statistical Research Division, U.S. Bureau of with real data and compared to find out which the Census. one performs best. Massell, P. and Russell, N. (2006) Protecting 4. References Confidentiality of Commodity Flow Survey Tabular Data by Adding Noise to Underlying Evans, T., Zayatz, L., and Slanta, J. (1998) Using Microdata, presented at a Washington Statistical Noise for Disclosure Limitation of Establishment Society seminar, October 24, 2006. Tabular Data, Journal of Official Statistics, Vol. 14, No. 4, 537-551. van Dorp, J.R. and Kotz, S (2003) Generalized Trapezoidal Distributions, Metrika, Vol. 58, Hwang, J.T. (1986) Multiplicative Errors-in- Issue 1, 85-97. Variables Models with Applications to Recent

The findings and conclusions in this paper are those of the author and do not necessarily represent the views of the National Center for Health Statistics, Centers for Disease Control and Prevention.

2729