Home

Compound Poisson approximation: a user's guide A. D. Barbour1,2

1. I where L I S gt m ye Clearly the random variable W counts the number of sets of k x k2 rectangular subregions in which at least m events occur Thus we have P W gt 1 P Sk gt m Assuming a Bernoulli model for the number of events in each h x h subregion we observe that the problem of approximating L W by a compound Poisson distribution can be approached in very much the same way as the two dimensional consecutive k out of n F system C k n F In fact C k n F is then a particular case of the two dimensional discrete scan statistic with kj k2 k ny n2 n and m k The extra generality in the choices of dimensions k and n makes essentially no difference to the argument but allowing m lt k k is a significant change since as m becomes smaller the dependence between neighbouring k x k2 sets decreases more slowly with decreasing degree of overlap Thus in the notation of Section 3 2 C a large choice of R such as k k2 max ky ko may no longer give an adequate approximation In principle the choice R 1 would be good but the computation of a R may only be feasible using a computer If the Bernoulli model were replaced by the more general Binomial model an analogous approach could still be used though the computation of R would become still more involved 26 B Linear conditional scan statistics The next example concerns one dimensional scan statistics used when testing for cluste
2. The detailed way in which is to be chosen and in which the S z depend on the radius of convergence of the power series gt j gt 1 pa and on the nearness of u to being periodic are explicitly specified in Barbour and Utev 1999 The third term in 2 29 is a penalty incurred in modifying the straightforward Stein argument Similar estimates for Kolmogorov distance are given in Barbour and Utev 1998 under less restrictive conditions on p Note that if a declumping has been achieved as in 2 1 then CPA 2A can be applied with o b3 and 1 bj b3 as defined in 2 26 giving the following estimate CPA 2B If W is declumped as in 2 1 and if gt 2 and p are as for CPA PP then for any N gt 2 and p satisfying m Am and such that gt 5 pind lt oo for some r gt 1 and that p is aperiodic pw IZ lt 1 for alll gt 2 we have drv LW CP X 1 lt NYY Peo Sola Nel Si e PLW lt o a ry S011 2 30 whenever A m gt 2 1 w 1 where 3 4 lt u lt 1 and Si p lt 0 0 lt 1 lt 2 and where 9 b3 and e is as given in 2 28 The advantage of CPA 2A over CPA 1B is that good behaviour as A increases is ob tained under much less restrictive conditions on p than those of 2 18 and 2 20 Strong restrictions on the form of u are replaced by requiring the existence of an exponential moment and an aperiodic u and the latter conditi
3. k Spk ka kYa lt eka a A for a suitably chosen gt 0 The random vector K conditional on J 1 has the multinomial distribution MN k Ya a A and hence S ELja lt X Pi Kia kral gt eka acA k B Ya 0 thus giving a contribution to 3 33 of at most Zig X e7 1 e exp Ae 2 2 2e7t exp Ae 2 2 e 3 34 if ke gt 2 this last from the tail bounds for the binomial distribution given in Barbour Holst and Janson 1992 Proposition A 2 5 For k B splitting N as before only the 2kn pairs r s with r i lt k 1 and s j gt k and the 2km pairs with r i gt k and s j lt k 1 need to be estimated differently For the first set we have E Is lijk I lt I ee acA for each r s giving a total contribution to 3 33 of at most Ag 2kn Y exp k 1 vglog 1 p 3 35 where v 1 H o 7 log 1 p va 1 H v 7 log 1 p and H p y X yYalog te pa gt aCA 36 the second set contributes at most Ag 2km exp k 1 e v log 1 p 3 36 to 3 33 The sum of the three contributions 3 34 3 36 is then added over the m k 1 n k 1 possible pairs i j and the improved bounds for g and Ag are applied This yields alternative error estimates in place of 3 20 dic EW CP As lt GTA Aa Cr p mng As p lt 1 35 dry L W CP A H lt 5 lAl A2 2 mnY
4. nplr 2 Now take BO YD H G 9 F EL Sk S PSr kT and apply the estimate CPA PP Simple calculations show that b lt n 1 p p 2 2k 1 1 p b2 lt 2n 1 p b3 O 3 2 and hence the error estimate that results from CPA PP is of order O nky O ky EW We next turn to the direct approximations CPA 1A and 1B as exemplified in the thesis of Roos 1993 We define neighbourhoods of dependence for each y I taking into account the dependence structure of the problem ee a a E a ca aa ES o aa a Ny y Ak a E Eka 9 2k DV Ty T y US UN Under the above partioning of I we have a decomposition of the random variable W as in 2 6 with W So Ig Uy So Ig and Z So Ig BET BES BEN We observe that S4 N 2 k 1 and that Ig 8 S U y are independent of Ig 8 T for each y T so that 62 63 64 0 Furthermore we have S EL EL E U Z 4k 3 nap yer X E I Z 2 k 1 mp yer and thus by CPA 1A drv L W CP A Lt HY A H X EL ELE U Zy E 1 Z yer HY A p 6k 5 rab 18 and the canonical parameters and yp are as given in 2 7 These can be explicitly computed Aui ni ELT I Up Ip i np Pl Ik Ik Ix 1 1 1 np i t P V V i l1 where Vp and V _ are independent and Ge p truncated at k 1 distributed random variables Evaluating the la
5. 2 eA HITY A 61 lt min 1 7 2 16 Barbour Holst and Janson Lemma 1 1 1 and Remark 10 2 4 If bounds with similar dependence could also be found for general compound Poisson distributions the estimates of CPA 1A and CPA 1B would greatly improve upon the error estimates derived in CPA PP 11 The reason for this is quite simple In the constituents 1 lt l lt 4 of the bounds given in CPA 1A and CPA 1B as also in b1 b and b3 of CPA PP average measures of dependence at each location are multiplied by the mean of W or by A the mean number of clumps each of which increases in proportion to the overall size of the system Bounds of this magnitude are an intrinsic feature of total variation approximation for Poisson processes and are thus unavoidable in CPA PP but as in the Poisson case the same need not be true of total variation approximation to W In particular whenever HTV A u O A is true the elements in the estimate CPA 1A involving can be made independent of the system size since the factor AT neutralizes the growth of the multiplying factor EW in 63 and 64 this is particularly advantageous for variant ii of the estimates Unfortunately the only known analogue of 2 16 for general pz is the bound HEY A u lt min 1 Api e 1 0 1 2 17 proved in Barbour Chen and Loh 1992 This bound can be useful for small A but for larger the exponential factor rapidly makes itself felt
6. Am mi and ce b b3 Amidw Q Q 2 28 Comparing the error estimate in 2 27 to that of 2 4 note that the quantities b are larger than the corresponding b because of the factors j and l Thus CPA 1C is never better than CPA PP unless the H A u are small This is the case under Con dition 2 20 as soon as A becomes large and then CPA 1C is substantially better than CPA PP the same is typically true if Condition 2 18 is satisfied In other circumstances CPA PP is normally preferable to CPA 1C and the direct estimates CPA 1A and 1B are only competitive if a more advantageous decomposition of W as in 2 6 can be found without using the declumping 2 3 Improved estimates The weakness of the estimates CPA 1A 1C when the H A p are not small sug gests that modification of the original Stein argument is needed One such approach was exploited in Barbour and Utev 1998 1999 14 CPA 2A IfW is decomposed as in 2 6 and if gt 2 p and 6 1 lt l lt 4 areas in 2 7 2 11 then for any X gt 2 and p satisfying m Am and such that gt 5 pind lt for some r gt 1 and that pw is aperiodic wW lZ4 lt 1 for alll gt 2 we have drv L W CP Xp lt A T eoSolu 0 e1 SiH P W lt 9 u Am S2 u 2 29 whenever A m gt 2 1 u 1 where 3 4 lt p lt 1 and S p lt 0 lt 1 lt 2 and where 9 and 1 are as given in 2 14 and CPA 1A
7. As p lt 1 5 3 37 where A 2k mexp k 1 uilog 1 p nexp k 1 e v2log 1 p 3 m n p 2ky 3 38 As So e711 e exp kyae 2 2 2671 exp kya 2 2 e A is as in 3 21 and Cr p 2 fe 1 p In the circumstances illustrated in 3 23 provided that l lt v i i 1 2 a suitable s 1 maf 2 BI 3 39 This has the effect of replacing the exponents 6 and 2 in 3 29 by 1 and adding choice for is given by the element involving Ag this alternative bound can then be used whenever ke gt 2 In asymptotic terms it is now enough that l lt v i 1 2 which is a less restrictive condition than 3 30 to allow the choice of s 1 mxd JI 2 3 40 U1 V2 for in which case the overall error estimate is of order O p n O EWn for some 6 gt 0 this once again converges to zero as long as EW only grows with n at most as fast as a small power of n 3 6 Markov chains Counting both success runs and occurrences of a word can be rephrased as particular cases of the more general problem of counting the number of visits W to a rare set S 37 in n steps of a recurrent Markov chain X For success runs as in Section 3 1 the state space S can be taken to be 0 1 k by setting X j if amp 6 1 41 and amp 0 1 lt j lt k 1 and X k otherwise then apart from edge effects the number of
8. 1995 bD lt An k 1 ky 2k r 1 y y 29 and r 2 b lt lt n k 1 kb t r k 1 p n k 1 ky Ninf n a1 where vw Em 0 1 Sal and Ting Mingea 7 a The long range dependence term bg of 2 3 can be bounded using the geometric ergodicity of the Markov chain amp Schbath 1995 showing that Bo O n7p for some 0 lt p lt 1 Hence combining CPA PP with 3 12 through the triangle inequality it follows that dry L W CP A m lt Cink r 1 b Con p 2k 3 14 where the quantities Cy and C2 are of order O 1 the parameters and p are given by Au n k 1 EInu A n k 1 0 L and r can be chosen at will To illustrate the possible asymptotics let n nm A Am and k km all depend on m gt 1 in such a way that Ym 7 Am gt 0 as m oo Note that the quantities 7 and y are typically geometrically small as functions of k so that if the mean E Wm 2mm is kept bounded then km lognm stays bounded away from 0 and oo Taking r rm 3 log nm log p the error estimate 3 14 is then of order O m log w7 For the same sequence of words Am one could instead count the number of copies of Am appearing in a sequence of letters from the Markov chain of length growing faster with m for example nm w for any fixed c gt 1 In this case the error esti mate 3 14 is of order O w2 gt log w which is not even small with m if c gt 2
9. CP Ow lt as SP 3 7 b If r gt 2 then dx L W CP A p lt ew A z A eee 3 8 A a z EW red where acre EW n r 1 n r 1 r r 1 e k r SS E Fil CED r k r cream 2 We note that the above bounds are both of order O rw For more details further appli cations and the relevant literature see Chryssaphinou and Vaggelatou 1999a 3 2 Reliability systems A The Two Dimensional Consecutive k out of n F System The system C k n F consists of n independent components each with lifetime distribution F placed on a square grid of size n It fails if there is a square subgrid of side k with all its k components failed Let the set T i j 1 lt i j lt n k 1 and let A Aj denote the k x k subgrid with left lowermost component i j Az Gat j y 1 x y 1 k We fix a time T and define the indicators I all components in A are failed at time T for each y T setting W erly The random variable W counts the number of possibly overlapping k x k squares with all components failed in the system Clearly the reliability of the system is given by P W 0 and 4 EL g where q 1 F T We first apply the estimate CPA 1A as in Barbour Chryssaphinou and Roos 1996 to approximate the distribution of W by an appropriate compound Poisson distribution CP A 4 Our first step is to define the neighbourhoods of dependence S Ny and T for each y
10. Here we see the advantage of the improved compound Poisson estimates If we again take Tm 3logNm log p the estimate 2 30 can be computed to be of order Ym log Ym P W lt JEW 3 15 for some lt 1 and for c gt 1 the regenerative structure of the finite Markov chain can be used to show that the latter term is of smaller order Hence the improved estimate CPA 2A yields a much better asymptotic order for c gt 1 than does CPA PP by a factor of YS t For small enough values of L Conditions 2 18 and 2 20 are satisfied Condi tion 2 18 if L lt 1 2 allowing 2 27 to be applied using the bounds 2 19 and 2 23 and Condition 2 20 if L lt 1 5 in which case 2 27 can be applied with the bounds 2 21 30 Details and some numerical examples are given in Barbour Chryssaphinou and Vaggela tou 2000 Section 3 The Markov chain approach of Erhardsson 1999 see Section 3 6 can also be applied The error estimates obtained from his theorems are not quite so explicit as those derived from 2 27 but they are of comparable accuracy when EW is small and of similar order O Ym log w 71 in the asymptotic setting considered above Erhardsson 1997 also considers the number of appearances of words from a prescribed set of words AM 1 lt i lt l see also Chryssaphinou et al 2000 Reinert and Schbath 1998 consider the joint distribution of the numbers of copies of each of a finite set of words in a sequence
11. I Here we take S 6 Er Bq BAJ k k consisting of the k x k subgrids A 1 j Ai i j Aij 1 Ai j 1 We observe that there are n k 1 21 possible positions of the k x k subgrid A of which 4 are corners 4 n k 1 are borders and the remaining n k 1 are interior to I and that then S is equal to 2 3 and 4 respectively Next we take T yeET yNf for all 8 S U y so that 2 63 64 0 and assign the remaining k x k subgrids to N Since S lt 4 we find that S N lt 2k 1 1 Then W can be written in the form W W 2Z U 4 1 as in 2 6 where U Pes os Lag PEN I and W PeT Ty We now compute the value of 64 from 2 11 obtaining N EL n k 1 yer SCELE U Z S gt X BLEIg lt n k 1 2k 1 1 y yer vel BES UN gt BUZ D X E L I6 ye Ver perl k 1k 1 k 2 x lt n k 1 skye 4 Se rs 4S gh w s 1 r 1 s 1 where the complicated sums arise because of the differing overlaps possible between two k x k squares Next we compute the canonical parameters from 2 7 obtaining 1 Air Z E LI L Ig r YET BES 1 n k 1n k 1 3 9 r 2 Ds E ligl fig Li j 1 t 4 1 gga finn ri rTlyp 4r r 4 n k 1 ra r n k 1 r3 r forr 1 5 and P Aur where m lr P Ig r 1 1 P Bi 2 q r 1 for the corner indicators BES P Bi 3 q r 1
12. L I Xi r 41 lt lt X for EE E so that W gt gt l with E W n r 1 y and 4 1 r Then in order to achieve a declumping we define the indicator random variable J for the event a k clump occurs at the ith trial that is Lik I Xi r gt Xi r 1 Zi Nie rE GER gt Xi kl k gt 1 In this definition it is assumed that there is a doubly infinite sequence of random variables X 1 E Z at our disposal so that edge effects play no r le This simplifies the analysis but introduces an error in that we actually approximate the distribution of a new declumped random variable W Soe Dopo klik which is not quite the same as W however W differs from W only when in the infinite sequence either Xo lt X lt lt Xp or Xn r41 lt t lt Xn lt Xn41 and hence 2 2a PORAMI eat eT 2 The indicator I is dependent only on the random variables X _ Xj and is thus independent of all the indicators J for which 7 1 lt i rorjg r gt i k This observation leads us to define the neighbourhoods of dependence by Bik G 0 i lL r lt j lt itk r n 1 n x N 3 6 20 ensuring that b3 0 The quantities b and b3 given in 2 26 can then be bounded and the canonical parameters and p as given for CPA PP can be determined taking into account the conditions 2 20 and 2 18 this leads to the following bounds a If r gt 4 then A r 1 A QW dry CW
13. What is more it is shown in Barbour and Utev 1998 that there can be no general analogue of 2 16 in which the bounds decrease with A Placing some restrictions on u however better bounds can be obtained Under the condition ipi gt G mii i21 2 18 it follows that Barbour Chen and Loh 1992 HEY A min 4 1 l 2 J 0 AK lt NG Da A u 2p2 2 19 TY min og HEV A p lt 1 oS aaa te CT 2u2 b Alternatively if the condition 0 m m m lt 1 2 2 20 holds where m gt l u1 then it follows that 1 E Barbour and Xia 1999b these latter bounds being of exactly the same order in as 1 TV lt Hi u lt 2 21 those of 2 16 for the Poisson distribution Note that for the canonical parameters A and p y E X U V EX 2 22 yer yer 12 is a weighted average of the conditional expectations of the excess clump sizes U at y given the possible positive values of X if the X are indicators and the pairs X U are identically distributed then 0 E U X 1 Neither of the conditions 2 18 and 2 20 allows the approximating compound Pois son to be too far from a Poisson distribution indeed in the latter case it follows that m lt 3 2 and hence that u gt 1 2 Nonetheless there are many applications for exam ple in the area of scan statistics in which a Poisson approximation is a reasonab
14. been applied in Eichelsbacher and Roos 1999 Section 3 1 to give an error of order O ky exp axnw for some a gt 0 with an unspecified constant The term involving the exponential comes from the third term in 2 29 Finally Barbour and Xia 1999b examined the same problem in the special case where k 2 Using a much 19 less straightforward argument and a slightly different approximating compound Poisson distribution they obtained an explicit error estimate of order O y nw which is surprisingly even better than the best of the estimates above which in this case is of order Oy B Increasing sequences Consider a sequence X1 Xn of independent random variables with the same con tinuous distribution F and the events X i r 1 lt lt X of appearances of increasing sequences of length r for i r n Pittel 1981 and R v sz 1983 considered the limiting behaviour of random variables closely related to the number W of appearances of the above event in a sequence of n such trials we shall approximate its distribution by a compound Poisson distribution using the estimate CPA 1C This W could well be used as a test for local dependence in a supposedly random X sequence such as the innova tions in a supposedly GARCH 1 1 process being used to model a financial time series or in ARCH and ARMA processes as well as in quality control see Wolfowitz 1944 and Engle 1995 We first define the indicators
15. been placed on asymptotics in which m and n tend to infinity in such a way that both log m log n and A converge to finite non zero limits Using 3 20 and 3 21 we can make more precise statements about how well the distribution of W is then being approximated under less restrictive conditions provided that p is in the permitted ranges For m and n given set k kmn log mn log 1 p Emn 3 23 for any C Cmn gt 0 and define l hm n logm log mn lh l m n 1 h4 3 24 note also that then EW p Considering the elements of 3 21 we immediately have m n p m n7 p 3 25 and dirk lt FON nnp 3 26 where do log 1 y log 1 p Then defining 2 _ J log p ai Pe i ilm n eee a tS 12 3 27 we also have qi p pAb pet exp l log mn 1 6 i 1 2 which from the definition of l and because 1 6 l lt 1 in view of 3 27 and qi gt p implies that m q p lt m p n qo p lt np 3 28 34 Thus A is bounded uniformly in c gt 0 by the rather explicit formula 2 log mn log 1 p 2 log mn J ese par atm b CET rnp 8 29 which is small so long as 6 m n and 62 m n are sufficiently positive and c is not too large A is small if A is In asymptotic terms one would require that l i lI limsup l m n lt log p q limeap log 1 p eu since then i liminf 6 gt Een Se 1 gt 0 i 1 2 m n gt oo 7 log
16. constants for compound Poisson approximation Bernoulli to appear BARBOUR A D AND XIA A 1999b Poisson perturbations ESAIM P amp S 3 131 150 BARLOW R E HUNTER L C AND PROSCHAN F 1963 Optimum redundancy when components are subject to two kinds of failure J Soc Indust Appl Math 11 64 73 LE CaM L 1960 An approximation theorem for the Poisson binomial distribution Pacific J Math 10 1181 97 Cao M T Fu J C AND KOUTRAS M V 1995 A Survey of the Reliabil ity Studies of Consecutive k Out Of n F System and Related Systems IEEE Trans Reliability 44 120 127 CHEN J AND GLAZ J 1996 Two dimensional discrete scan statistics Stat Probab Letters 31 59 68 CHEN R W Hwanc F K AND Li W C W 1993 Consecutive 2 out of n F systems with node and link failures IEEE Trans Reliability 42 497 502 40 a oo oo CHRYSSAPHINOU O AND PAPASTAVRIDIS S 1988a A limit theorem for the num ber of non overlapping occurrences of a pattern in a sequence of independent trials J Appl Prob 25 428 431 CHRYSSAPHINOU O AND PAPASTAVRIDIS S 1988b A limit theorem for the number of overlapping appearances of a pattern in a sequence of independent trials Prob Theory Rel Fields 79 129 1438 CHRYSSAPHINOU O AND PAPASTAVRIDIS S 1990 Limit distribution for a consec utive k out of n F system Adv Appl Prob 22 491 49
17. define the random variable W which counts the overlapping appearances of the word A in the sequence Chryssaphinou and Papastavridis 1988b using generating functions and combinatorial arguments proved that the random variable W converges in distribution to a compound Poisson distribution under quite general conditions Barbour Holst and Janson 1992 examined the accuracy of Poisson approximation for L W taking into account the set of periods see Theorem 8 F In Example 10 4 2 of the same reference the case of small periods was examined using a compound Poisson approximation the accuracy of which is always much better Applying CPA PP together with combinatorial arguments based on the set of principal periods of A Chryssaphinou Papastavridis and Vaggelatou 1999 obtained an upper bound on the total variation distance between L W and an appropriate compound Poisson distribution but of the same accuracy as that obtained by Barbour Holst and Janson 1992 As we shall see later we can obtain sharper error estimates This model has been used to solve problems which arise in many areas In particular for a finite sequence of letters taken from the alphabet A C G T the above mentioned results have proved useful for determining critical values for test statistics in the analysis of DNA sequences For more literature see Arratia Goldstein and Gordon 1989 1990 Arratia Martin Reinert and Waterman 1996 Barbour Hols
18. of A and p is given in Erhardsson 1999 Definition 3 2 The simplest theorem is obtained when regeneration is defined in terms of visits to an atom So such that So gt 0 and So N S1 In the example of success runs the choice So 0 is appropriate for the occurrence of a word any singleton of the form a 0 could be used or one could take So Uaea a 0 for any collection A C A such that m a is the same for all a A Let Ts and Ts denote the first times that X hits So and S T the first time that the reversed chain hits S Then Erhardsson uses CPA 1A combined with a coupling argument to prove that dry W CP A 3 41 lt 2HY A p ny Enis Tso 78 7 So E Tso 2Pr its lt Tso Good bounds for HV A u are currently only known under either of Conditions 2 18 and 2 20 In the examples of word counts and success runs when Condition 2 20 holds the results obtained for total variation approximation are nonetheless of the best asymptotic order generally known Erhardsson 1999 shows that 2 18 holds whenever v ess sup P Ts lt Ts lt 3 2v ess inf P Ts lt Ts 7 res reSy 38 and that then A p04 22 gt nyp 1 Sza 4P uis Ts lt TSo enabling HTV A p to be effectively bounded using 2 19 in these circumstances More particularly if S is an atom as is the case in the examples of success runs and occur rences of a word then u 1 p
19. of further clumps occurring in the immediate neighbourhood of an index pair 7 1 conditional on 1 Finally b has a similar interpretation as EEN times a weighted average but now of the unconditional expected number of clumps with index pairs 3 1 7 1 U B y L this is a measure of the extent of the neighbourhoods and should also be small if Poisson process approximation is to be accurate Thus the choice of the sets B y 1 is critical to the success of the approach With these definitions the following compound Poisson estimate derived by way of a Poisson process approximation can be proved as in Arratia Goldstein and Gor don 1990 Section 4 2 1 with improved coefficients of b and bz from Barbour Holst and Janson 1992 Theorem 10 A CPA PP IFW gt ep iii Hy is as defined above and A Z ep i gt Elyi EN and u Xs yer EZ l gt 1 then dry L W CP A LL lt by bs bz 2 4 By choosing the sets B y l carefully very good results can be obtained as long as A is not too large There are two drawbacks to the point process approach First the identification of a unique representative for each clump declumping is rarely natural and can pose difficulties Secondly if A is large the error estimates derived in this way are frequently far from accurate because the error involved in point process approximation in total variation is often much larger than that for compound Poiss
20. of n letters Their approximations are expressed in terms of independent compound Poisson distributions and are valid only under a hy pothesis which restricts the possible overlapping of clumps of a single word with clumps of another They do so by applying CPA PP This is an example of the generality and usefulness of the point process approach as yet there is no bivariate analogue of the direct compound Poisson approach to use as an alternative 3 5 Sequence matching Let 1 m and 71 1 be two independent sequences of independently chosen letters from a finite alphabet A the chosen according to a distribution and the n according to v Fix k and set Tig NE ny Sin Nj4as itk Nj k 1l so that m k 1n k4 1 i 1 j l counts the number of times that pairs of matching strings of length k can be found in the two sequences In molecular sequence applications an observed value of W higher than that expected according to the above model would indicate evolutionary relationship between the two sequences Previous work Arratia Goldstein and Gordon 1989 Novak 1994 Neuhauser 1996 has largely concentrated on approximating P W 0 which by then varying k translates into a statement about the length of the longest matching run with this in mind the strategy is typically to replace W by a random variable which counts distinct clumps of k runs and to approximate its distribution by a Poisson random var
21. system are that it has higher reliability than a series system but is less expensive to build than a parallel system It has applications in telecommunication systems oil pipelines vacuum systems computer ring networks spacecraft relay stations and many other engineering systems The reliability of this system the probability that it has not failed has been exactly determined but the explicit formula is quite complicated especially if n and k are large For this reason upper and lower bounds for the reliability have been derived In this context the Stein Chen method for Poisson approximation has proved a power ful tool The approach is as follows For any fixed time T we associate independent Bernoulli Be p random variables Jj Jn with the components where p 1 F T J takes the value 1 0 if the ith component works fails i 1 n We then define the random variable W oo I where I Wed J takes the value 1 if all the components i i 1 7 1 fail and 0 in all other cases Clearly the C k n F system fails if and only if W gt 0 Although the components themselves work independently the indicators J are locally dependent tending to occur in clusters Nonetheless the ran dom variable W is reasonably well approximated by a Poisson distribution Po A with the Stein Chen local approach Barbour Holst and Janson 1992 8 4 2 Chryssaphi nou and Papastavridis 1990 Godbole 1991 and Roos
22. the Poisson local bounds Barbour Holst and Janson 1 28 if the X are 0 1 valued and S Two points should be 9 noted First the two weighted averages are now multiplied by the expectation yer EX of W and not by the expected number A of clumps Secondly in the term IPX 1 X E X Z EW X w X a E Z X y yer ver I gt 1 Y where the weights wy are defined by w EX EW the average is over conditional expectations of Z given the value of X and does not include any contribution from the strongly dependent U the effect of the U is already accounted for in the definition of the canonical parameters 2 7 Each of the quantities 6 2 and 63 is a measure of the effect of any long range depen dence on the joint distribution of X U and can be contrasted with b3 in 2 3 In d2 and 63 it is expressed in terms of the effect on the distribution of the distant W exer cised by the value of X U measured either in terms of total variation or Wasserstein distance In the dependence is rewritten in terms of the effect on the distribution of X U exercised by W All three can be viewed as weighted averages of measures of de pendence at a distance multiplied by EW For independent X one can take Z U 0 for which choice 6 62 63 0 and 64 reduces to yo UEX Remark The distances d L W X U gt W appearing in 2 9 and 2 10 are often bounded by constructing random var
23. 1 M nsson M 1999 On compound Poisson approximation for sequence matching Preprint MICHEL R 1988 An improved error bound for the compound Poisson approx imation of a nearly homogeneous portfolio ASTIN Bulletin 17 165 169 VON MISES R 1921 Das Problem der Iterationen Z Angew Math Mech 1 298 307 NEUHAUSER C 1994 A Poisson approximation for sequence comparisons with in sertions and deletions Ann Statist 22 1603 29 NEUHAUSER C 1996 A phase transition for the distribution of matching blocks Comb Prob Computing 5 139 59 42 Novak S 1994 Poisson approximation for the number of long match patterns in random sequences Theor Prob Appl 39 593 603 PITTEL B G 1981 Limiting behavior of a process of runs Ann Prob 9 119 129 PRESTON C J 1973 Generalized Gibbs states and Markov random fields Adv Appl Probab 5 242 261 REINERT G AND SCHBATH S 1998 Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains J Comp Biol 5 223 253 REVESZ P 1983 Three problems on the lengths of increasing runs Stoch Proc Appl 15 169 179 Roos M 1993 Stein Chen method for compound Poisson approximation Ph D Thesis Universitat Z rich Roos M 1994 Stein s method for compound Poisson approximation the local approach Ann Appl Prob 4 1177 1187 Ross S 1979 M
24. 1 p and then ensure that c was at most some suitably small multiple of logn corresponding to a growth in IEW of order at most mn for some small 5 gt 0 Previous asymptotics have mostly assumed that IEW remains fixed so that this last condition was automatically satisfied As already observed in Neuhauser 1996 the condition 3 30 is stronger than is actually necessary for the approximation to be accurate asymptotically To see why this is write J in the form Iij k 3 31 kESk where fx zal ss C e a niks I Pe kal acA aCA and where Kija Hl O lt L lt k 1 ipi Mjaa AEA Then it is possible for E Z Ii n 1 to be very large for values of k for which EJ is small with a significant contribution to E I Zi j resulting In such circumstances the element m q p n q2 p in A may be unduly pessimistic This element derives from the inequality lol Zig Wiz 90 Wig ag Vig 1 1 lt Ag 4ijf51 Uig 1 8 82 used in deriving the basic compound Poisson estimates which may give a larger result than the alternative inequality lal Zig Wij gU Wig ag Vig F 1 lt 2log IZ 2 tU 4 1 35 if Zi is likely to be large when J 1 So instead using 3 31 we can use both inequalities to arrive at the bound Tiyg W So IIU 1 Ug Wy I gt 1 3 33 lt Agll X lige Zig 2llall XD Ligne keB k B where B C zl can be chosen at will we take B
25. 1993 and the argument works equally well for more general situations such as when the failure probabilities may differ A Day r q where q 1 p provided that p is small This has been established using from component to component However the probability P W 0 is more accurately estimated by using a Poisson approximation for the distribution of the number of clusters or alternatively a compound Poisson approximation to the distribution of W Some more complicated problems which arise in the design of reliable electronic de vices and in pattern detection inspired Salvia and Lasher 1990 to consider a planar version of the above system They introduce the Two dimensional Consecutive k out of n F C k n F system which consists of n x n components placed on a square grid it fails if there is a square subgrid of size at least k x k with all its components failed The exact reliability of this system is not known and thus approximations to it are essential Salvia and Lasher 1990 obtained bounds for the reliability of the system by relating it to certain C k n F systems Koutras Papadopoulos and Papastavridis 1993 proposed bounds using results of Arratia Goldstein and Gordon 1989 1990 using the Stein Chen method for Poisson approximation With the same assumptions about the operating behaviour of the components as for the C k n F system we define the random variable W pans I where I I all 4 com
26. 2 20 If Condition 2 18 holds with fy gt 2u2 and m lt oo then in Cases a and b the bound 2 19 leads to error estimates which are asymptotically slightly larger the order being in both cases multiplied by a factor of log np n again Case c is impossible If neither of Conditions 2 18 and 2 20 are satisfied the error estimate derived from 2 24 using 2 17 becomes rapidly worse as n increases and is useless if m oo Comparison with the Poisson process estimate If a declumping has been achieved one can also use it in conjunction with CPA 1A and 1B If W D7 ep X gt Hyo as in 2 1 13 with 375 Ly 0 1 for each y decompose W as in 2 6 but with I T x N in place of T taking Xy Uy Uy 0 2 jla35 Wya SS j1g5 2 25 Sor si Bj B y 1 where B y 1 is as for CPA PP The canonical parameters and u defined using 2 7 are exactly as for CPA PP and we can take co 6 b3 and c1 64 b b3 in CPA 1A and 1B where i 5S0 JEE b X SO JIE 7 DEPXN 8 j EB 7 1 aa et iy 2 26 b gt IEJE I Ely olez 6 3 By H ETX N This gives the following estimate CPA 1C In the setting of CPA PP for any choices and u we have dr L W CP X p lt hH e H dry L W CP X p lt eh HEY e ETY 2 27 where H HE X pu and Hj HEV XN py for L 0 1 where bj b and b3 are as defined in 2 26 and where as in 2 14 Eh b3
27. 3 CHRYSSAPHINOU O PAPASTAVRIDIS S AND VAGGELATOU E 1999 On the number of appearances of a word in a sequence of i i d trials Meth Comp Appl Probab 1 329 348 CHRYSSAPHINOU O PAPASTAVRIDIS S AND VAGGELATOU E 2000 Poisson Ap proximation for the non overlapping appearances of several words in Markov chains Comb Prob Computing to appear CHRYSSAPHINOU O AND VAGGELATOU E 1999a Compound Poisson approxima tion for the numbers of long increasing sequences J Appl Probab to appear CHRYSSAPHINOU O AND VAGGELATOU E 1999b Compound Poisson approxima tion for multiple runs in a Markov chain Ann Inst Statist Math to appear EICHELSBACHER P AND Roos M 1999 Compound Poisson approximation for dissociated random variables via Stein s method Comb Prob Computing 8 335 346 ENGLE R 1995 ARCH Oxford University Press ERHARDSSON T 1997 Compound Poisson approximation for Markov chains Ph D Thesis Royal Institute of Technology Stockholm ERHARDSSON T 1999 Compound Poisson approximation for Markov chains using Stein s method Ann Appl Probab 27 565 596 Fu J C AND Koutras M V 1994 Poisson approximations for 2 dimensional patterns Ann Inst Stat Math 46 179 192 GESKE M X GODBOLE A P SCHAFFNER A A SKOLNICK A M AND WALL STROM G L 1995 Compound Poisson approximation for word patterns under Markovian hypotheses J App
28. Compound Poisson approximation a user s guide A D Barbour and O Chryssaphinou Universitat Zurich and University of Athens Abstract Compound Poisson approximation is a useful tool in a variety of applications including insurance mathematics reliability theory and molecular sequence analysis In this paper we review the ways in which Stein s method can currently be used to derive bounds on the error in such approximations The theoretical basis for the construction of error bounds is systematically discussed and a number of specific examples are used for illustration We give no numerical comparisons in this paper contenting ourselves with references to the literature where compound Poisson approximations derived using Stein s method are shown frequently to improve upon bounds obtained from problem specific ad hoc methods Keywords Compound Poisson Stein s method Total Variation Distance Kolmogorov Distance AMS 1991 Subject Classifications 60 02 60C05 60E15 60F05 60K10 62E17 92D20 1 Partially supported by Schweizer Nationalfonds Grant 20 50686 97 2 Partially supported by University of Athens Research Grant 70 4 2558 1 1 Motivation Many probability models Aldous 1989 involve rare isolated and weakly dependent clumps of interesting occurences A typical example is that of clusters of extreme events such as earthquakes of magnitude exceeding 4 0 when one event occurs several more may follow in
29. acA Swapping the r les of r and s in the first two cases this finally gives the bound E I j Zij lt 2k may ngz 2k m n p 4k py4 3 19 Hence in particular adding over the mn possible pairs i j and using the improved bounds 2 23 and 2 21 we deduce from CPA 1A i and 3 16 that dx L W CP A u lt Ai A2 p lt l 3 lt pp ee ae 3 20 dry L W CP A u lt ies Bp F Ao p lt 1 5 where Ay 2k m qi p n qo p 3 m n p 2kf Ag k m n p 3 21 In the case when p gt 1 5 as would be the case in the important application to the four letter DNA alphabet with almost uniform distributions over the letters the total variation estimate above cannot be used However it is once more possible to apply CPA 2A to get error estimates of almost the same asymptotic order The only essential 33 difficulty lies in showing that the tail probability in 2 29 is small Here one can apply the exponential lower tail bounds of Janson 1998 Theorem 10 which extend Suen s 1990 inequality to cover the event appearing in 2 29 This leads to the result that whatever the value of p lt 1 one has dry L W CP A O A e EW 3 22 uniformly in A lt 1 for some a gt 0 This makes for very good asymptotics whenever EW at all fast when this is not the case the error estimate derived from CPA PP is usually close to best possible Most emphasis has previously
30. at o 0 in CPA 14A i The canonical parameters for compound Poisson approximation are essentially the same as those for success runs but with the new definition of p and with n replaced by mn mnypt t 1 p ifi 1 k 1 Aui itmny 2pt 1 p 2k i 2 p 1 p ifi k 2k 2 2k 1 Imnyp t if i 2k 1 2k 1 ASN Aie i 1 32 As before Condition 2 18 is satisfied if p lt 1 3 and Condition 2 20 is satisfied with 1 26 1 p 1 5p if p lt 1 5 once again a P lya Aeppli PA mnz 1 p p approximation would contribute at most an extra 4ky 1 p 1 5p to the total variation error estimate given below To compute 1 of CPA 1A i it is immediate that and all that remains is to bound E J Z In order to express the result we define three further quantities q T2 Va q2 CoV max Ya where Yq Dp 10eVe 3 18 ae aCA acA noting that py gt qi gt p i 1 2 We then observe that there are at most 2kn pairs r s Nj such that r i lt k 1 and s j gt k for each of which E I Irs lt q5 that there are at most 2kn pairs r s Nj such that r i gt k and s j gt k for each of which E I I s p and that there are at most 4k pairs r s Nj such that jr i lt k 1 and s j lt k 1 for each of which E I Irs lt max i CalOaVa rloare t lt py acA
31. ce between the distribution of the aggregated claim amount and an appropriate compound Poisson distribution is no greater than the total variation distance A lt L p Bar bour and Hall 1984 between the distribution of the total number of claims N and Po A If the occurrences of claims are weakly dependent but the claim amounts are still indepen dent and identically distributed Goovaerts and Dhaene 1996 have noted that Michel s observation can still be applied and that the new value of A which will usually be larger than 57 p because of the dependence can be estimated using the Stein Chen method Barbour Holst and Janson 1992 In many insurance applications however there may be strong local dependence be tween claim occurrences For instance in storm damage insurance the rare single oc currence of a tornado in a particular town may lead to some random number of claims on the portfolio Since the distribution of this number of claims may well depend on the time of year the preceding argument even assuming independent and identically distributed individual claim amounts cannot be applied because the aggregate claim amounts result ing from occurrences of tornadoes are no longer independent and identically distributed Despite this it still seems reasonable to suppose that the distribution of the total number of claims is close to a compound Poisson distribution in total variation in which case if once again
32. ct on X4 U is negligible and N the remainder This procedure is the analogue of the local approach to Poisson approximation Barbour Holst and Janson 1992 pp 9 10 which is recovered for 0 1 valued X by taking S and hence U 0 Define the parameters and p of the canonical approximating compound Poisson distribution as follows Canonical parameters 1 Mu F Ds PU Sth shee a oA EG cera T o pU gt l gt 1 2 7 Note that if the X are 0 1 valued and S then u 1 and all other u are zero and EW all consistent with the simple Poisson approximation Then setting my Ta Me Xe j U ki Ming j 1 k gt 0 where mi EX define the four Glee quantities which appear in the error estimates and which should be small for the estimates to be good r j Uy k W Em DD Pe Ph 28 yer j gt 1k gt 0 62 2 X E X dry L W Xy Uy L W 2 9 yer 0 E X dw L W Xy U L Wy 2 10 yer a gt E X Z BX E X U Z 2 11 yer In 2 10 the distance dy is the Wasserstein L metric on probability measures over Z4 ftar fra where Lip f f r s lt r s r s Z4 w P Q sup fELip The quantities 6 1 lt l lt 4 can be interpreted as follows To start with 64 is an analogue of b b2 in 2 2 combining the effects of local dependence and neighbourhood size and it reduces to the corresponding element in
33. dependently subject to failure with probabilities q 1 lt i lt n The system has a collection of s vertex sensitive subsets often minimal cut sets in the underlying graph and the system fails if there are at least m such subsets with all vertices failed The Two Dimensional Consecutive k out of n F system is a particular example of a connected k system Let I denote the set of all sensitive sets and y kj k its typical element where ki ks are the indices of the s vertices Set I all vertices in the set y are failed and W Dover I then the reliability of the system is equal to P W lt m 1 and EW gt yer Ile ydi To define neighbourhoods of dependence we first specify the minimum number R of items in common to a and y when a S4 The possible choices of R are 1 2 s 1 Note that the choice of R s leads to the Poisson local approach 24 For 1 lt R lt s 1 we define the sets S R aeT f y lany r r R s 1 N R 8 8 Na gt 1 for some a 7 U S R I 7 U S P fs TAR T aurea O THRE With the above definitions and after some computations CPA 1A implies that dry L W CP A R w R lt HIY A R U R Q R dmaa EW 3 11 Imax where Q R 1 max S R 2 max N R yer yer and dmaz is the largest failure probability of an individual item Thus for equal q s if I is large and q small but EW I w is o
34. f order 1 where p qF compound Poisson approximation is reasonable with R 1 if Q 1 is much less than T If instead EW is large the approximation CPA 2A can be used to show that the error in compound Poisson approximation is of order WQ 1 P W lt EW for suitable lt 1 and Janson s 1990 inequality can be used to bound the latter probability However if the structure of S 1 is complicated it may be difficult to compute the canonical parameters either numerically or by using computer algebra In such cases a larger value of R and correspondingly smaller S R can be chosen which will however give an error estimate of higher order we adopted this strategy in the C k n F example taking R k k Finally we note that the Poisson estimate is typically poorer than the compound Poisson estimates but that Poisson approximation may be much easier to achieve The above general approach is illustrated in Barbour Chryssaphinou and Roos 1996 where the reliability of a model called The double pipeline is approximated by an ap propriate compound Poisson using CPA 1A 3 3 Scan statistics A Two dimensional scan statistics As an example of the application of compound Poisson approximation to scan statis tics we take the two dimensional discrete scan statistic which was applied by Glaz 1996 to the problem of detecting minefields see also Chen and Glaz 1996 Other applications are to be found in Glaz et a
35. for the border indicators T r P X Is r 1 1 BES m3 r P Ig r 1 1 P Bi 4 4 r 1 forthe interior indicators BES y 22 Hence applying the approximation CPA 1A we have dry L W CP A p k 1k 1 k 2 lt HIV A p n k 10 Ga 12k 3 p 4 gt DUEDD D s 1 r 1 s 1 3 10 In reliability applications it is usual to suppose that lt Am n k 1 is small and the reliability high in which case the bound HIY A p lt e from 2 17 is adequate In the above example the computation of A and p is not very complicated since we have only to calculate 5 terms and the resulting error estimate is uniformly of order O q 7 in A lt 1 This provides an improvement on the order O q obtainable by us ing the Poisson local approach by the factor O q Bigger improvements still could be obtained by expanding the set S but at the cost of more complicated calculations needed to determine A and u The same approach is still valid for the case of unequal failure probabilities when computer algebra can be used to evaluate the canonical pa rameters From tables see Barbour Chryssaphinou and Roos 1996 one can see that these error estimates are mostly comparable to or better than those presented by Fu and Koutras 1994 though for k 2 the Stein Chen method is not good in either the Poisson or the compound Poisson approach For larger relevant for reliability calculations
36. i able Here as also in Mansson 1999 we use compound Poisson approximation to treat the whole distribution of W and to provide rather explicit estimates for the accuracy of the approximations obtained Our approach is based on that of Mansson 1999 and also uses some refinements from Neuhauser 1996 31 In order to simplify the canonical parameters 2 7 of the approximating compound Poisson distribution we work instead with the random variable W Tijs i 1 j 1 derived from the and 7 sequences by using the torus convention 4m i Nj n j for all i j Z Since m n m n k 1 o lt w W S X I X DO In i 1 j n k 2 i m k 2 j l it is immediate that dry L W L W lt m n k 1 k 1 y 3 16 where Y p and we assume that 0 lt p acs TaVa lt 1 The random variable W has expectation EW mnp and we are typically interested in values of k less than say 2log mn log 1 p so that IEW is not extremely small In order to construct a suitable decomposition of the form W 1 Ui Zi Wi as in 2 6 we note that the indicators most strongly dependent on J are those of the form Iii j 1 with lt k 1 so we take Uy gt p liyi j 1 lt l lt k 1 and Lig E X Ti 3 Lij EI Uij r s ENij where Nij r 8 min r il s j lt 2 k 1 This yields W I U Zij Wij in such a way that W is independent of the pair Lj Uij so th
37. iables wy and Wit on the same probability space for each gt 1 and gt 0 with LW L W X Uy i 1 and LOM L W in such a way that wo and wee are close for instance so that they coincide with high probability In practice it is often easier to make a coupling of L W Xz Uy Yp i l y and L W where Y summarizes additional information for example the exact knowledge of Ig 3 S4 rather than just the value of U This causes no extra difficulty since it is always the case that E X d L W Xy U W5 lt E X d L W Xy Uy Y3 W In terms of these quantities the following estimate can be established see Roos 1994 Barbour and Utev 1999 Theorem 1 9 Barbour 1999 Equations 5 13 and 5 14 CPA 1A There exist constants HK HR mw lt Af APY Am 1 0 1 which do not involve W in any way such that if A and u are the canonical parameters then dx L W CP A u lt coHo tearHy drv L W CP A m lt coo H 2 12 10 in either bound one can take i o min 1 62 and 1 64 or ii o 0 and 1 63464 In general when evaluating 62 and 63 it is often possible to compute the distances between distributions by means of couplings Variant ii when applied with Z 0 gives the analogue of the Poisson coupling estimate variant i leads to the analogue of the Poisson local estimate Barbour Holst and Janson 1992 Theorems 1 B and 1 A
38. if the system tolerates up to a large number m of failed k x k squares before it collapses note that Alm m Soei 1 ur lt 4E Am r 1 so that 0 lt 4q Thus from 2 21 we can take n k 1 H7 A p 1 8 1 pro vided that 8q lt 1 yielding an error estimate which is still of asymptotic order O q If 8q gt 1 and is large a larger neighbourhood S4 is needed if accurate approximation is to be achieved together with the asymptotically sharper estimate of CPA 2A For the latter a bound on P W lt EW for lt 1 is also required Janson s 1990 inequality shows for instance that here P W lt JEW lt exp aEW 1 4k2q for some a a gt 0 see Eichelsbacher and Roos 1999 Section 3 3 B Multiple failure mode systems Reliability systems which are subject to more than one type of failure are also of interest see Barlow et al 1963 Ross 1979 Satoh et al 1993 and Koutras 1997 The 23 system that we discuss here is also a generalization of consecutive k out of n F system C k n F but in a different direction from that of C k n F We consider a system consisting of n linearly arranged components in each of which any one of r defects may be present The system fails if for any 1 lt i lt r at least k consecutive components have the defect of type i We denote such a system by C kj k n F With the n components we associate the sequence
39. l 1994 and in Barbour and Mansson 1999 A two dimensional rectangular region R 0 L x 0 L2 is inspected for the occurrence of certain events Fix n1 na gt 1 and divide R into n na subregions Jy jlo h 1 hi lihi x l 1 h2 leh 5 1 lt li lt Ni 25 each of size hy x ho where h L nj set T i 7 1 lt i lt n k 11l lt j lt ng ka 1 For y ET let the random variable Y count the number of events that occur in the subregion J Set By h l2 i lt lh lt ith 1 j lt l lt j k2 1 and define the random variable Dy y Ye vel BEB which counts the total number of events in the rectangular subregion B of R which consists of kjk2 adjacent smaller subregions If S exceeds a level m then we will say that m events are clustered within the region Finally the two dimensional discrete scan statistic is defined by Sk max S y Er k k k2 It is of interest to test the null hypotheses of randomness under which it is assumed that the Yg are independent distributed according to a binomial distribution with parameters N and po versus the alternative that they are distributed according to a binomial with parameters N and p with p gt po In order to deal with the testing procedure an accurate approximation for the distri bution P Sk gt m is necessary A compound Poisson approximation is used for this To derive one we associate the random variable Sk with the random variable W
40. l Prob 32 877 892 GLAZ J 1996 Discrete scan statistics with applications to minefield detection In Proc Conf SPIE 2765 Orlando FL 420 429 Al GLAZ J AND BALAKRISHNAN N 1999 Scan statistics and applications Birk hauser Boston GLAZ J Naus J Roos M AND WALLENSTEIN S 1994 Poisson approxima tions for the distribution and moments of ordered m spacings J Appl Prob 31 271 281 GODBOLE A 1991 Poisson approximations for runs and patterns of rare events Adv Appl Prob 23 851 865 GODBOLE A P SCHAFFNER A A 1993 Improved Poisson approximations for word patterns Adv Appl Prob 25 334 347 GOOVAERTS M J AND DHAENE J 1996 The compound Poisson approximation for a portfolio of dependent risks Insurance Math Econ 18 81 85 HUFFER F AND LIN C T 1997 Approximating the distribution of the scan statis tic using moments of the number of clumps J Amer Stat Soc 92 1466 1475 JANSON S 1990 Poisson approximation for large deviations Rand Struct Alg 1 221 229 JANSON S 1998 New versions of Suen s correlation inequality Rand Struct Alg 13 467 483 Koutras M V 1997 Consecutive k r out of n DFM systems Microelectron Reliab 37 597 603 Koutras M V PAPADOPOULOS G K AND PAPASTAVRIDIS S G 1993 Reli ability of 2 Dimensional Consecutive k out of n F Systems IEEE Trans Reliability 44 658 66
41. le but crude first approximation and approximation by a compound Poisson distribution which is not too far from the Poisson can be very much better In such cases the bounds 2 19 and 2 21 combined with the error estimates given in CPA 1A and CPA 1B can prove ex tremely effective see for example 3 4 3 7 and 3 20 below For Kolmogorov distance sharper bounds under Condition 2 18 are also available Barbour and Xia 1999a 2 1 1 K lt min 4 1 4 HE lt min 2 2 Ho Asu lt min 1 Am min 5 5 3 If neither 2 18 nor 2 20 holds there is as yet no simple fix though the theorems in Section 2 3 frequently make it possible to obtain approximation errors of best asymptotic order albeit with unpalatable constants Example A continued Define I i j 1 lt i lt n j gt 1 and use the decomposition 2 6 with Uj Re Xi and Zi 0 Then the pair X U is independent of Wij so that 61 9 63 E0 0 and 4 a J p EY o pi and m Di piut are as before Then the direct estimate CPA 1A gives dry L W CP A u lt HIY A u X pj IBY 2 24 i 1 If Condition 2 20 holds then in Case a the bound 2 21 implies an error estimate of 1 20 m p n of the correct asymptotic order in Case b the error estimate is less than 1 20 myp n 4 n 1 mip n gt again of the correct asymptotic order Case c is incompatible with Condition
42. milarity between parts of sequences can be used as evidence for relationship It then becomes important to be able to determine what measure of similarity can be interpreted as unusual The simplest model is to suppose that two finite sequences x and y of length n from the same finite alphabet A A 4 for DNA A 20 for amino acid sequences are to be compared which on the null hypothesis of no relation are each independently composed of letters drawn independently from A with the same probability distribution v A measure of relatedness might be the count W lt 1 Ij where t j 1 lij Hizi yjtil 1 0 the number of pairs of subsequences of length m 1 from the two sequences which exactly match where m is to be suitably chosen The indicators J and Ij are independent unless either i i lt m or j j lt m so that dependence between the J is in general weak however the conditional probability that 1 41 1 given that I 1 is typically substantial at least 1 A so that matching pairs tend to occur in local clusters Hence a compound Poisson distribution is a suitable candidate for approximating the distribution of W There are of course many generalizations of the model the most important for the purposes of practical algorithms being to allow some degree of mismatching in the pairs of interest through insertions and deletions of sequence segments and through the replacement of one amin
43. nite sequence W X liy yer l gt 1 An l clump consists of exactly l overlapping occurrences of A whose union does not overlap any preceding or subsequent occurrence of A Only the sequence is observable in practice and the definition of may involve for indices i 1 2 n so that we are actually interested in the random variable W rather than W However W 4 W only if a copy of A in the infinite sequence overlaps one of the ends of the interval 1 2 n so that dry L W L W lt P W ZA W lt 2 k 1 y 3 12 where k 1 y EL 27 A 7 a1 I II a ai41 i 1 The bound 3 12 of order ky is no larger than other terms in the later estimates Fur thermore p EL 1 LLY where L D Mei aia 3 13 pEP A i 1 so that EW n k 14 EW In order to apply CPA PP to W we begin by defining the neighbourhoods B y 1 In doing so we introduce an integer r which determines their size and which can then be chosen to minimize the estimates obtained Denote by Z y l the set which contains the positions of the letters defining the random variable Z We say that the indices 7 1 and 3 7 are not neighbours if the respective Z 7 and Z 7 are separated by at least r positions r gt 0 Then the neighbourhood B y 1 is given by Bet 8 j 4 2 k r lt 8B y lt 2 k r ar xN With the above neighbourhoods computing the quantities 2 2 one obtains Schbath
44. o acid by another similar one see Neuhauser 1994 1996 5 2 Compound Poisson approximation 2 1 The Poisson process approach A first approach to compound Poisson approximation for sums of dependent indicators is to proceed by way of Poisson point process approximation This is a very natural idea in the context of an underlying process consisting of rare isolated and weakly dependent clumps of events In such a system the locations of the clumps when suitably defined occur more or less as a Poisson process on the index set I and if the sizes of the clumps are added to I as an extra index dimension then the process of clump locations and sizes on I x IN is also almost a Poisson process The typical strategy is to assign a location to each clump by using exactly one of the indices y I as the representative of each clump and to replace W yer X by a sum way dihi ST yer l gt 1 where I now denotes the indicator of the event that y is the index of the representative of a clump of size l thus for each clump exactly one of the I takes the value 1 and no index y is representative of more than one clump Poisson process approximation in total variation to the point process yer V1 I 6 where 6 denotes the point mass at l is then accomplished by using Stein s method for Poisson process approximation and compound Poisson approximation in total variation to the random variable W er X gt Hy with exactly the
45. of random variables X X1 Xn taking values in A 0 1 r assuming it to be generated by a stationary Markov chain The state 0 denotes a correctly functioning component while i denotes a component defect of type i Such a model can arise as the equilibrium distribution of a reversible nearest neighbour birth and death process in which a correctly functioning component can become defective or a defective component return to normal working for example after repair with suitably chosen rates depending only on the states of the immediately neighbouring components see for example Preston 1973 In Chryssaphinou and Vaggelatou 1999b this model is successfully analyzed by using CPA 1C and a declumping The techniques required are more complicated than those used above because of the Markov dependence and are similar in spirit to those that we use in Section 3 4 when counting the occurrences of copies of a word in a string of letters The error bounds that they obtain are of order O log 1 w maxi lt i lt r ki where Yy X Yi and Y P X X2 Xx and where the constants implicit in the order depend on the transition matrix of the underlying Markov chain C Connected s Systems Many reliability systems can be represented as graphs G V E with V a set of ver tices machines and a set of edges connections between them see Chen Hwang and Li 1993 Here we suppose that the n vertices are in
46. on approximation Handbook of Statistics to appear BARBOUR A D CHEN L H Y AND Lon W L 1992 Compound Poisson approximation for non negative random variables via Stein s method Ann Probab 20 1843 66 39 BARBOUR A D CHRYSSAPHINOU O Roos M 1995 Compound Poisson Ap proximation in Reliability Theory IEEE Trans Reliability 44 398 402 BARBOUR A D CHRYSSAPHINOU O Roos M 1996 Compound Poisson ap proximation in systems reliability Naval Res Logistics 43 251 264 BARBOUR A D CHRYSSAPHINOU O AND VAGGELATOU E 2000 Applications of compound Poisson approximation Statistical Distribution Theory and Inference Eds N Balakrishnan M V Koutras and C Charalambides CRC Press to appear BARBOUR A D AND HALL P 1984 On the rate of Poisson convergence Math Proc Cam Phil Soc 95 473 80 BARBOUR A D HOLST L AND JANSON S 1992 Poisson approximation Ox ford University Press BARBOUR A D AND MANsson M 1999 Compound Poisson approximation and the clustering of random points Adv Appl Probab to appear BARBOUR A D AND UTEV S 1998 Solving the Stein Equation in compound Poisson approximation Adv Appl Probab 30 449 475 BARBOUR A D AND UTEV S 1999 Compound Poisson approximation in total variation Stoch Procs Applics 82 89 125 BARBOUR A D AND XIA A 1999a Estimating Stein s
47. on approximation to L W However the point process approach still provides flexible useful and explicit error estimates for compound Poisson approximation Example A Let Xi LI Y gt j 1 lt i lt n j gt 1 be a double array of indicators in which the J Be p are independent and the Y wp are independent of each other and of the J Set W e SE LY i 1 i 1 j gt 1 Declumping is easily achieved by using representatives y 1 2 n x 1 denoted for short by i and then defining Ia L I Y l for each i l Then 57 p and py 7 ATEN a pil and taking B i 1 i x IN for each i it also follows that b Xi p and by b3 0 The Poisson process estimate CPA PP thus immediately implies that drv L W CP A 4 lt X p3 2 5 i 1 To illustrate the implications of 2 5 let p n be chosen in such a way that p n 0 and np n co as n ov and consider three choices of the p ps and p pr Case a Suppose that ps p n and pw p for all i Then 2 5 gives a total variation error bound of np n for approximation by CP np n u however the true error is actually much less being at most p n Case b Suppose that the ps and yw are as in Case a for 2 lt i lt n and that p is such that u gt 0 suppose also that ps 5 and that wC 6 for all n Then 2 5 gives an error estimate of at least 1 4 for approximation b
48. on is essential For fixed aperiodic p having a finite exponential moment the coefficients of o and are of exactly the same satisfactory A order as in the Poisson case 2 16 and provided that is reasonably large the third term in 2 29 may be relatively unimportant This makes for excellent asymptotic orders in the error estimates The disadvantage is that the expressions for Si u given in Barbour and Utev 1999 while explicit are very complicated and can be expected to be quite large in practical applications thus error estimates of the correct asymptotic order may be obtained at the cost of unreasonably large constant factors In the same way CPA 2B frequently improves upon CPA PP in the asymptotic order of the error estimate at the expense of introducing unwieldy constant factors 15 Example A continued Suppose that gt j gt 1 ujz has radius of convergence greater than 1 and that u is aperiodic Take and p to be the canonical parameters In Cases a and b the S u are bounded uniformly in n and y is bounded away from 1 Hence from 2 29 error estimates of order O p n exp np n a for some a gt 0 are obtained with Bernstein s inequality being used to derive the exponential bound for the third term in 2 29 This is of the ideal order O p n except when np n oo very slowly with n At first sight it appears that the same error estimate should also follow in Case c cont
49. p for j gt 1 where p Ps tTs lt Ts and thus Condition 2 18 holds for p lt 1 2 with u 2u2 1 p 1 2p and the approximat ing distribution is then a P lya Aeppli distribution However the regenerative structure of such a Markov chain also lends itself well to proving good bounds for the quantity P W lt Am in CPA 2A so that it is to be expected that the better asymptotic order normally given by CPA 2A could be achieved here too when Condition 2 18 fails to hold Acknowledgement ADB thanks the Department of Mathematics and Statistics at Monash University for their warm hospitality while part of this paper was being writ ten References ALDous D J 1989 Probability approximations via the Poisson clumping heuristic Springer New York ARRATIA R GOLDSTEIN L AND GORDON L 1989 Two moments suffice for Poisson approximations the Chen Stein method Ann Probab 17 9 25 ARRATIA R GOLDSTEIN L AND GORDON L 1990 Poisson approximation and the Chen Stein method Stat Sci 5 403 34 ARRATIA R GORDON L AND WATERMAN M S 1990 The Erd s R nyi law in distribution for coin tossing and sequence matching Ann Statist 18 539 570 ARRATIA R MARTIN D REINERT G AND WATERMAN M S 1996 Pois son process approximation for sequence repeats and sequencing by hybridization J Comp Biol 3 425 463 BARBOUR A D 1999 Topics in Poiss
50. ponents in a k x k subgrid with left lowermost component i j are failed Then the reliability of the system is just P W 0 The indicators and Iy are independent unless i i lt k 1 and j j lt k 1 but the local dependence between the T j is now frequently relatively strong For example the conditional probability that j41 1 given that I 1 is q as compared with the unconditional probability of gt Thus the indicators I j tend to occur in clusters and the random variable W is best approximated by a compound Poisson distribution The reliability P W 0 is then approximated by eTEN where N is the number of clusters rather than by the Poisson approximation e E the two quantities differing inasmuch as EW CEN where C is the expected cluster size see Barbour Chryssaphinou and Roos 1995 1996 1 3 Sequence matching Biological systematics has been revolutionized by the discovery of the double helix and the genetic code and by the development of cheap fast and automatic sequencing machines The degree of relationship between closely related species can now be assessed in terms of the similarity of their genomes For more distant species relationship at the DNA level may well have become obscured because too many random mutations have occurred since divergence but functionally vital elements of the amino acid sequences composing their proteins are still likely to have been conserved Thus unusual si
51. pplying the Poisson process approach as in Arratia Goldstein and Gor don 1990 Defining the random variable R er Xy where Xy 1 amp 1 J with X I we observe that R _ counts the number of clumps and is approximately Poisson Po ER distributed with mean ER y n 1 p 1 For interesting results we want IER to be bounded away from 0 which is essentially the condition that k lt log p n 1 p Note that if we are interested in the distribution M of the longest of these 1 s runs then we have P M lt k P R Oj The size of each clump minus one is the length by which the associated run of 1 s exceeds k and is approximately distributed as a geometric random variable with param eter p Furthermore the clump sizes are almost independent of each other and of the total number R of clumps so that the distribution of W is approximately Poisson Po ER compounded by geometric Ge p the P lya Aeppli distribution PA ER p this is equi valently expressed as CP A p with A ER pw p 1 p 1 gt 1 3 1 Although a clump size l could take any value between 1 and n we shall simplify the calculation by considering only l L 1 2 n 2 k 1 and declump by defining Ty 1 amp 1 b b541 Ey4egi 2 1 amp 4n41 1 for 7 1 Er x L then the random 17 variable n n k We X X Unp W n X h ver lEL i l YEP n 2 k satisfies dry L W L W lt p
52. quick succession before normality returns Clusters of events can then be expected to take place almost at random according to a Poisson process in which case the number of clusters occurring in a given time interval would have a distribution close to a Poisson distribution Po with some mean A and the sizes of the individual clumps might well also be assumed to be approximately independent and identically distributed with some distribution u If these assumptions were precisely true the total number W of occurrences would then have a compound Poisson distribution CP A w the distribution of the sum of a random Po A distributed number of independent random variables each with distribution u more formally CP A u is defined by M CP A H L gt r Soi j l i gt 1 for any gt 0 and any probability distribution u on N where Y j gt 1 are independent have distribution u and are independent also of M Po A and where M i gt 1 are independent with M Po Au The former representation is that which ties in with the description above The latter is an equivalent definition which emphasizes that the number of clumps of each size i gt 1 itself has a Poisson distribution and that these numbers are independent this structure can also be useful In practice the assumptions above are usually not satisfied exactly the clumps may occur only approximately as a Poisson process and the clump sizes may not qui
53. radicting the fact that the true distance in total variation is of order O 1 The reason why this estimate is not obtained in case c is that the distribution w approaches a periodic limit u as n oo and the S u become unbounded In situations where asymptotic rates of approximation are of interest both A A and H H typically vary with n In such cases the error estimate given in CPA 2A depends on n not only in the obvious way through the quantities o ei and ny but also because the S w 1 0 1 2 and u depend on n through their dependence on pw This latter dependence is in general quite complicated However the following result which is proved in Mansson 1999 Proposition 2 3 is useful in showing that for asymptotics a single choice often suffices for the S and for Proposition Suppose that v is an aperiodic probability measure on N and that the p are such that for some r gt 1 and c gt 0 4 sup a lt 00 n21 55 ii inf ijn cv for each j gt 1 Then Sy sup Sj pi lt oo l 0 1 2 and 3 4 lt sup u lt 1 n gt 1 n gt 1 and CPA 2A holds for each n with S and in place of S u and u whenever infn gt 1 A4 gt 211 71 The proposition can then be combined with the estimate CPA 2A to give good asymptotic rates in a wide variety of problems 3 Examples 3 1 Runs A Success runs 16 The problem of success r
54. respectively If approximation by another compound Poisson distribution with parameters AX and p is preferred a similar estimate is available Barbour 1999 Compound Poisson Estimate 2 One advantage of allowing distributions other than the canonical compound Poisson dis tribution as approximations is that the canonical distribution may be very complicated whereas an approximation of the same order may be obtainable with a very simple com pound Poisson distribution CPA 1B For any choices and yp we have dx L W CP Nw SeoHo eH drv W CP X w lt enHo e1H7 Y 2 13 where H HE X u and HITY HEY X p for 0 1 and where E eo Am X mi and 1 c1 Amidw Q Q 2 14 with 9 and as for CPA 1A here m V1 lu and the probability measures Q and Q on N are such that Q i ip m and Q i ips m In particular if Xm Am then o Remark If mi Am then one can instead take e1 er SUE 1 u Meal 2 15 121 Roos 1994 Theorem 3 The formulae for the elements Ap from the canonical parameters are easy to obtain from 2 7 and the alternative parameters and y are usually chosen for their simplicity so that this quantity is easy to compute In order to exploit CPA 1A and CPA 1B it thus remains to find suitable bounds for HE A p and HEY A p 0 1 In the Poisson case p 6 it is known that HEY A 61 lt min 1
55. ring among a fixed number n of points This problem has been studied by many authors and has been applied in a variety of fields including geology medicine and nuclear physics For references see Glaz and Balakrishnan 1999 Let X1 Xn be independent and identically distributed observations from the uni form distribution on the interval 0 1 and let Y w be the number of X s contained in the scanning interval t t w The scan statistic S also called the linear conditional scan statistic is defined by Sw Maxo lt t lt 1 w Y w Because the points are on the line the events Sw gt m can simply be rewritten as Ww m gt 1 where n m 1 Wumi gt amp and G 1 XG4m 1 Xw lt ud i 1 with X denoting the 7 th order statistic Thus the problem is rephrased in terms of counting short m 1 spacings Even so exact evaluation of the tail probabilities of the scan statistic is a complicated problem and as the sample size n increases even with moderate values of m and small values of w it becomes practically impossible When nw is small Barbour Holst and Janson 1992 Corollary 7 C 2 proved a Poisson approximation to L We m s where Wum S and Aa l 1 lt i lt n m 1 i I Xa m i n Xa lt w n m 2 lt i lt n For these random variables w EJ is the same for all i being given by w P Bi n 1 w gt m 1 nw m 1 and thus P Wwm 4 Wom lt m 1 w The error tha
56. same error estimate follows as a consequence There have been many successful applications of this approach a number of which are given in Arratia Goldstein and Gordon 1989 1990 To formulate the results we introduce some notation For each y l T x N let B y l CT x N be a set containing y x IN and set n gt XO ElnEIg b X SO E Lyle 2 2 VIET XN 6 j B y MOEEXN P HERNY b XO E E I Eln o Is 6 3 B D 2 3 yJ ErxXN The set B y l can be thought of as indexing the immediate neighbourhood of y l and should be chosen so that the indicators Ig whose indices 3 7 do not belong to it collectively have little or no influence on the value of Zu The degree of influence from outside the immediate neighbourhood aggregated over all 7 1 is measured by b3 An alternative expression for it is By D I b EN X wE 7 JDErxN Pe 1 oUs 8 4 J Pll 1 where N D y lerxn Ly denotes the total number of clumps and the w are the weights EI EN This represents b3 as the product of EN and a weighted average of a measure of the dependence of the distribution of on what happens outside its immediate neighbourhood the events in o Ig G 7 B y Local dependence which should also be weak for a Poisson process approximation to be good is summarized in bg This quantity can be interpreted in a similar way as the product of EN and a weighted average of the expected number
57. st probability we get nyp 1 p ifi 1 k 1 Aui 4 i np 2p 1 1 p 2k i 2 p 71 1 p ifi k 2k 2 2k 1 tnyp if i 2k 1 2k 1 d So Ai i 1 Finally Roos 1993 showed that Condition 2 18 is satisfied if p lt 1 3 so that HIV A p can be bounded using 2 19 writing M A u 2u2 np 1 p 1 2p we derive the error estimate 1 1 P lt 41A 2 Ai dry L W CP A lt A M E log 2M ok 5 ny 3 3 The estimate 3 3 of order O kw log nw is a big improvement over the bound of order O nkw obtained by using CPA PP and 3 2 whenever EW ny is at all large Furthermore Condition 2 20 is satisfied for p lt 1 5 since 0 lt 2p 1 p and 2 21 then gives the error estimate dry W CP A 1 lt 6k 5 Y 1 p 1 5p 3 4 of even better asymptotic order O kw It is shown in Barbour Chryssaphinou and Vaggelatou 1999 2 9 that the canonical compound Poisson distribution can be re placed using 2 15 by the P lya Aeppli distribution PA ny 1 p p see 3 1 if 4ky 1 p 1 5p is added to the error estimate The Markov chain approach of Er hardsson 1999 outlined in Section 3 6 also gives an approximation of the same order when p lt 1 5 but with a better constant factor If essentially no restriction on p is to be imposed the estimate CPA 2A is still available this has
58. success runs of length k becomes the number of visits of X to S k For copies of a single word A a a2 ax of length k in the Markov model of Section 3 4 set X j if max 0 l gt EAN ET aiaz a JE 1 2 k and set X 0 otherwise the number of copies of A is then the number of visits to S k When visits to S generally occur singly a Poisson approximation to L W is appropriate Barbour Holst and Janson 1992 Section 8 5 but it is more often the case that there is a tendency for visits to occur in clumps and then compound Poisson approximation gives much sharper results This problem has been studied in some generality by Erhardsson 1999 exploiting the regenerative structure of a recurrent Markov chain to derive very pleasing approximation theorems The Markov chain X is assumed to be stationary and Harris recurrent on S having a unique stationary distribution v and W So 1 x s counts the number of visits to S S define w v S The approximating compound Poisson distribution CP A w is defined in terms of regeneration cycles is the expected number of cycles in 1 2 2 which contain at least one visit to S1 and yp is the conditional distribution of the number of visits to S in a cycle given that at least one occurs In particular if S is an atom for example a single state then p is geometric and CP A u is a P lya Aeppli distribution A formal definition
59. t and Janson 1992 and Waterman 1995 The assumption of independent is not a good one for DNA sequences In what follows we show how to derive compound Poisson approximation for sequences modelled by Markov chains as in Schbath 1995 We assume that the finite sequence 1 n of random variables taking values in the alphabet A A C G T arises from a stationary realization of an irreducible aperiodic homogeneous Markov chain on the finite state space A the more general case of an m order chain can be treated as a first order chain on A Let II a a 1 denote the first order transition probabilities of the chain and z a the invariant probability of a Define the indicator random variables I with y 1 n k 1 by I Sy a1 Ey4k 1 Qk then W yer I We say that A has period p if a aj p for all i 1 k p and let P A denote the set of periods of A We also define the set of principal periods P A of A consisting of the minimal period and of those which are 28 not a multiple of it Schbath 1995 applying CPA PP proved that if the EW are bounded as n ov then L W can be approximated by a compound Poisson distribution Her argument runs much as follows First replace W by a random variable W which is easier to analyze defined using the entire doubly infinite stationary sequence o lt i lt oo Set Ty I an I clump of A s starts at y in the infi
60. t they obtained is of order yt mI and is thus unlikely to be accurate for m gt 3 The reason for this is that the indicators I are dependent However indicators J and I are close to being independent if i j gt m 1 for m fixed and sufficiently large n Since the indicators I and J for i j lt m 1 are positively correlated we expect that the 1 s tend to occur in clusters while the number of such clusters approximately follows a Poisson distribution Thus the approximation of the distribution of Wy by a suitable CP A distribution arises in a natural way giving P S gt m P Wu m gt 1 1 exp A Glaz et al 1994 in work closely related to Roos 1993 use an approach based on CPA 1A to obtain a compound Poisson approximation with error of order O w 1 log A using 2m 1 non zero u s they also give a simpler approximation involving only m non zero uj s with the same order of error using CPA 1B The evaluation of and the uj 27 was accomplished using the clumping heuristic of Aldous 1989 Huffer and Lin 1997 suggested an alternative compound Poisson approximation 3 4 Occurrences of a word in a sequence of letters Let 7 gt 0 be independent and identically distributed random variables taking values in a set alphabet Q w1 wq q gt 2 with probabilities p P E ws s 1 q i gt 0 Let A ay a be a fixed string or word of length k We
61. te be independent If this is the case and if the CP A p distribution is being used to ap proximate the distribution of W it is important to know how accurate the approximation really is In this paper we are interested in showing how to quantify this closeness when W gt yer Xy iS a countable sum of non negative integer valued random variables X y ET We review estimates for the distance between L W and CP A m for suit ably chosen A and u with respect to the Kolmogorov distance dx and the total variation distance dry where for probability distributions P and Q on Z dx P Q sup P 0 J Q 0 jl drv P Q sup P A Q A l ACZ The X are to be thought of as being generally weakly dependent apart possibly from some strong local dependence but we retain generality and flexibility by avoiding as far as possible making any specific assumptions in this respect 2 1 1 Insurance A simple model of an insurance portfolio assumes a finite number n of insured risks each of which may lead to a claim with a small probability independently of all the others The distribution of the total number N of claims is then well approximated by the Poisson distribution Po A with Joa pj even if the claim probabilities p are not equal Fur thermore as observed by Michel 1988 see also Le Cam 1965 if all claim amounts are independent and identically distributed the difference in terms of total variation distan
62. the individual claim amounts are independent and identically distributed the distribution of the aggregated claim amount is itself at least as close in total variation to an appropriate compound Poisson distribution again by Michel s observation To exploit this idea if the possibility of substantial local dependence between the random claim num bers is also to be allowed it is necessary to have an equivalent of the Stein Chen method which quantifies the error in total variation when approximating the distribution of a sum of nonnegative random variables most of them taking the value zero with high probability by a compound Poisson distribution Once such an equivalent has been developed there is the further advantage that one can dispense with Michel s observation and the assumption of independent and identically distributed individual claim amounts and prove compound Poisson approximations to the total claim amount directly 1 2 Reliability We consider a system of n linearly arranged components having independent lifetimes with common distribution function F connected in such a way that the system fails if at 3 least k consecutive components fail This reliability system is called the Consecutive k out of n F C k n F system Over the last two decades the C k n F and related systems have been extensively studied by many authors One can find a rich literature in Chao Fu and Koutras 1995 The advantages of using such a
63. ultivalued state component systems Ann Probab 7 379 383 SALVIA A A AND LASHER W C 1990 2 Dimensional Consecutive k Out Of n F Models IEEE Trans Reliability 39 382 385 SATON N SASAKI M YUGE T AND YANASI S 1993 Reliability of three state device systems IEEE Trans Reliability 42 470 477 SCHBATH S 1995 Compound Poisson approximation of word counts in DNA se quences ESAIM P amp S 1 1 16 Suen W C S 1990 A correlation inequality and a Poisson limit theorem for non overlapping balanced subgraphs of a random graph Rand Struct Alg 1 231 242 WATERMAN M S 1995 Introduction to computational Biology Chapman and Hall London WOLFOWITZ J 1944 Asymptotic distribution of runs up and down Ann Math Stat 15 163 172 43
64. uns is very well known in the literature having already been considered by von Mises 1921 in the context of Poisson approximation It is the simplest prototype for many problems in the general area of reliability and sequence analysis Ar ratia Goldstein and Gordon 1989 1990 Arratia Gordon and Waterman 1990 and gives a good test of the effectiveness of the various compound Poisson approximations To formulate the problem consider the independent identically distributed Bernoulli random variables 1 where P 1 p 1 P 0 i 1 n We are interested in compound Poisson approximation to the number of k runs of consecutive 1 s In order to avoid trivialities and edge effects we assume that n gt 4k 3 and we identify all indices of the form i nj for j Z We define I WG and W eer Taz T 1 n and set y EL p It is clear that the random variable W counts the number of locations among the first n at which a run of 1 s of length at least k begins It is also clear that runs of 1 s occur in clumps that is if there is a run of 1 s of length k beginning at position y then with probability p there will also be a run of 1 s of length k at position y 1 with probability p a run of length k beginning at position y 2 and so forth This is an example with average clump size 1 p p p of the Poisson clumping heuristic described by Aldous 1989 We start by a
65. y CP An Hn where dn n 1p n 2 and py A n Ipinu 464 here the true error in fact tends to 0 with n being of order O p n np n Case c Suppose that everything is as in Case b except that now w 2Z 1 so that in particular u 0 In this case the error estimate of order O 1 furnished by 2 5 is of the correct order The contrast between Cases b and c indicates that improving upon the error estimates for compound Poisson approximation that are derived using the point process approach is likely to be a delicate matter 2 2 The direct approach If EW is large but finite there is advantage to be gained by taking a direct approach by way of Stein s method for the compound Poisson distribution introduced in Barbour Chen and Loh 1992 Here there is no need to rewrite W Jer form Instead for each y decompose W into non negative integer valued random variables X in declumped in the form W W Zy U X 2 6 where for the representation to be useful W should be almost independent of X4 U and U and Z should not be too large the sense in which these requirements are to be interpreted becomes clear shortly Such a decomposition is often realized by partitioning the indices 1 2 n into subsets 7 S4 N and T and setting U Y Xe and ZS gt Xp BES BENy Roos 1994 S contains those Xg 8 4 y which strongly influence X T those Xg whose cumulative effe

Compound Poisson approximation: a user's guide A. D. Barbour1,2

Contents

Download Pdf Manuals

Related Search

Related Contents