Nonparametric Multiple Comparisons in Repeated Measures Designs for Data with Ties Ullrich Munzel1; ... cularly ordinal data; the methods are also app...

0 downloads 17 Views 246KB Size

Loading...

Nonparametric Multiple Comparisons in Repeated Measures Designs for Data with Ties Ullrich Munzel 1; * and Ajit C. Tamhane 2 1

2

Department of Medical Biometrics and Data Management Merz + Co. GmbH & Co., Eckenheimer Landstr. 100–104, D-60318 Frankfurt/Main, Germany Department of Statistics, Northwestern University, 2006 Sheridan Road, Evanston, Illinois 60208, USA

Summary We consider some multiple comparison problems in repeated measures designs for data with ties, particularly ordinal data; the methods are also applicable to continuous data, with or without ties. A unified asymptotic theory of rank tests of Brunner, Puri and Sen (1995) and Akritas and Brunner (1997) is utilized to derive large sample multiple comparison procedures (MCP’s). First, we consider a single treatment and address the problem of comparing its time effects with respect to the baseline. Multiple sign tests and rank tests (and the corresponding simultaneous confidence intervals) are derived for this problem. Next, we consider two treatments and address the problem of testing for treatment time interactions by comparing their time effects with respect to the baseline. Simulation studies are conducted to study the type I familywise error rates and powers of competing procedures under different distributional models. The data from a psychiatric study are analyzed using the above MCP’s to answer the clinicians’ questions.

Key words: Rank statistics; Sign statistics; Midranks; Rank transform tests; Ordinal data; Joint ranking; Familywise error rate; Power.

1. Introduction This work was motivated by the research questions addressed in a randomized parallel group trial conducted at the Department of Psychiatry, University of Go¨ttingen to compare a new drug versus a placebo to cure panic disorder in psychiatric patients. A total of 30 patients were randomly assigned to the two treatment groups with 15 per group. Patients were assessed at the baseline and then were monitored on seven occasions over a period of 10 weeks. One of the efficacy variables was the clinical global impression (CGI) measured on a seven-point ordinal scale from 0 = best to 6 = worst. The data are shown in Table 1. Side-by-side * Corresponding author: [email protected] This work was supported in party by the Boehringer Ingelheim Fonds. The authors thank Professor Edgar Brunner for useful coments.

# 2002 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

0323-3847/02/0609-0762 $ 17.50þ.50/0

763

Biometrical Journal 44 (2002) 6

box plots of the weekly data for the drug (solid circles) and placebo (open circles) groups over the 10-week period are shown in Figure 1. We see that the trend is flat in the placebo group, but is downward sloping in the drug group indicating improvement. The following questions of interest to the clinicians will be addressed in the present paper: 1. What is the earliest time point at which the drug shows a significant improvement w.r.t. the baseline? 2. Is there a placebo effect as measured by a significant improvement w.r.t. the baseline at any time point? 3. What is the earliest time point at which the drug shows a significantly higher improvement than the placebo w.r.t. the baseline? To formally answer these questions, not only does one need to deal with ordinal data involving many ties, but also one needs to perform multiple comparisons over time points (in particular, comparisons with the baseline). Note that rank transform statistics of exisiting parametric procedures (Conover and Iman, 1981) in general are not suitable to solve these problems (e.g. see Akritas, 1991). The goal of the present paper is to derive the necessary procedures. These procedures will be of use to researchers in many disciplines, including medicine, psychometry and marketing, where repeated measures designs with ordinal data are common.

Table 1 CGI Values of Panic Disorder Patients Active Drug Group

Placebo Group

Week

Week

Patient

0

1

2

3

4

6

8

10

Patient

0

1

2

3

4

6

8

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

4 5 4 4 5 5 6 4 4 5 4 5 5 5 5

5 5 3 3 4 5 6 4 4 2 5 5 5 4 5

5 4 3 2 4 5 5 4 4 2 4 5 5 4 2

4 3 1 1 3 5 6 3 3 2 3 4 2 3 2

3 2 1 0 2 5 4 3 4 4 2 4 1 2 2

1 1 2 1 2 3 5 2 4 4 2 3 4 2 3

1 1 1 1 2 3 5 2 2 2 1 2 1 1 1

1 1 2 2 2 2 3 1 1 2 2 1 1 1 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5 5 4 5 4 5 3 4 3 4 5 4 3 5 4

5 4 4 5 4 5 3 4 3 4 4 4 3 5 4

5 5 3 5 4 4 3 4 4 4 5 4 3 5 4

4 5 3 5 4 4 2 4 4 4 4 4 2 4 4

3 5 4 4 3 4 3 4 3 5 4 4 1 4 4

5 5 4 5 3 4 1 4 1 4 4 3 4 5 4

5 6 5 4 3 4 1 4 2 4 5 4 4 4 4

5 5 5 5 2 4 1 4 2 4 5 4 5 5 3

764

U. Munzel: Nonparametric Multiple Comparisons

6 5

CGI

4 3 2 1 Active Drug Placebo

0 0

1

2

3

4

5

6

7

8

9

10

Week Fig. 1. A Plot of the Panic Disorder Data

There are many nonparametric multiple comparison procedures (MCP’s) for continuous data. In particular, the multiple sign test of Steel (1959) and the multiple signed rank test of Nemenyi (1963) discussed in Section 2.2, Chapter 9 of Hochberg and Tamhane (1987) deal with dependent data, which is the focus of the present paper. Nonparametric procedures for repeated measures designs with continuous data have been studied by Thompson (1991), Akritas and Arnold (1994) and Brunner and Denker (1994). In applications, measurement scales are essentially discrete because of limited gage accuracy; hence ties are common. In many applications, as in our example, an ordinal scale is used in which case ties are a rule rather than an exception. The above procedures are not suitable in the present context because of the prevalence of ties in ordinal data. Most rank tests handle ties by assigning them midranks based on heuristic grounds. The concept of normalized distribution functions due to Ruymgaart (1980) leads to midranks in a natural way. Using this approach, a unified asymptotic theory of rank tests for continuous as well as discrete data has been developed by Brunner, Puri and Sun (1995) and Akritas and Brunner (1997). Brunner and Langer (1999, 2000) applied the results to longitudinal data. We utilize the results of these papers to derive large sample MCP’s to answer the questions of interest. Some technical details are omitted for brevity; the interested reader may refer to the aforementioned papers. More than two treatments are not considered in this paper, so there are no multiple comparisons between treatments.

765

Biometrical Journal 44 (2002) 6

The organization of the paper is as follows. Section 2 defines the notation and basic assumptions. Section 3 considers the single treatment case (e.g., only the placebo or only the drug group) where multiple comparisons stem from the need to compare different time points with each other. We focus on the many-to-one comparisons with the baseline to answer Questions 1 and 2 above. Multiple sign and rank tests (and the associated simultaneous confidence intervals) are derived for suitably defined effects. Section 4 considers two treatment groups. Multiple rank tests are given for treatment time interactions to answer Question 3 above. Section 5 gives simulation results to study the type I error rates and powers of the proposed tests. In Section 6 we return to the example and analyze the data in Table 1 using the procedures proposed in earlier sections. Finally, a discussion of the resulting methods is given in Section 7.

2. Notation and Assumptions Consider a ¼ 1 or 2 treatments and b þ 1 2 repeated measures, which are assumed to be observations at b þ 1 successive occasions, beginning with occasion 0, called the baseline. The subjects are assumed to be drawn as a random sample from a homogeneous population. For a ¼ 2, an independent samples design is assumed with subjects assigned at random to the treatments. Let Xijk denote the observation at the jth occasion on the kth subject in the ith treatment group ði ¼ 1; 2; j ¼ 0; 1; . . . ; b; k ¼ 1; 2; . . . ; ni Þ. The Xijk are measured on at least ordinal scale. For asymptotic considerations, we assume that n1 ; n2 ! 1 at the same rate so that the ratio n1 =n2 is bounded away from zero and 1. The common distribution of X ik ¼ ðXi0k ; Xi1k ; . . . ; Xibk Þ0 is assumed to be nondegenerate, but otherwise completely arbitrary. In order to account for ties and ordinal data, we define the marginal normalized cumulative distribution function (c.d.f.) of Xijk as (Ruymgaart, 1980) Fij ðxÞ ¼

1 2

½Fijþ ðxÞ þ Fij ðxÞ ;

i ¼ 1; 2 ;

j ¼ 0; 1; . . . ; b ;

where Fijþ ðxÞ is the right-continuous and Fij ðxÞ is the left-continuous version of the original marginal c.d.f. of Xijk . In Section 3 we consider a single treatment and drop the subscript i, letting Xk ¼ ðX0k ; X1k ; . . . ; Xbk Þ0 denote independent and identically distributed (i.i.d.) observation vectors on subjects k ¼ 1; 2; . . . ; n and Fj ðxÞ the corresponding normalized c.d.f.’s. The comparisons of interest are formulated as multiple hypothesis testing problems in each case. The type I familywise error rate (FWE) (Hochberg and Tamhane, 1987) for a family of hypotheses is defined as FWE ¼ PfAt least one true null hypothesis is rejectedg:

ð2:1Þ

766

U. Munzel: Nonparametric Multiple Comparisons

A requirement for a multiple test procedure is that the FWE be strongly controlled, i.e., controlled under all possible configurations of the true hypotheses, at a specified level a.

3. A Single Treatment In this case, we focus on many-to-one comparisons between the occasions j ¼ 1; . . . ; b with occasion 0, the baseline. The results can be readily extended to other comparisons, e.g., pairwise comparisons between all occasions or comparisons between successive occasions. We define the effect of occasion j w.r.t. the baseline as qj ¼ PðX0k < Xjk Þ þ 12 PðX0k ¼ Xjk Þ 12 ;

j ¼ 1; . . . ; b:

ð3:1Þ

Depending on whether qj >; ¼ or < 0, we can say that Xjk is tendentiously larger, comparable or smaller than X0k . In Section 3.1 we derive multiple sign tests on the qj . Although qj has a very simple interpretation, it only admits sign-type of tests. To admit rank-type tests we propose an alternative definition of the effect of occasion j ¼ 1; . . . ; b w.r.t. the baseline: Ð k 6¼ ‘ : wj ¼ PðX0k < Xj‘ Þ þ 12 PðX0k ¼ Xj‘ Þ 12 ¼ F0 ðxÞ dFj ðxÞ 12 ; (3.2) Note that by comparing two independent subjects, this measure reduces the comparison between occasion j with the baseline in terms of the respective marginal distributions. In Section 3.2 we derive multiple rank tests on the wj . Remark 1: Define Gj ðxÞ ¼ 0:5 F0 ðxÞ þ Fj ðxÞ : Then wj ¼ E Gj ðXjk Þ Gj ðX0k Þ . Thus Gj may be thought of as an unknown link function or transformation which results in a linear scale on which Xjk and X0k can be compared. The above equation can also be expressed as j ¼ 1; . . . ; b : ð3:3Þ wj ¼ 12 E F0 ðXjk Þ Fj ðX0k Þ ; We use these alternative representations of wj in the sequel.

&

3.1 Multiple Sign Tests and Confidence Intervals Consider the family of two-sided hypothesis testing problems: H0j : qj ¼ 0

vs.

H1j : qj 6¼ 0 ;

j ¼ 1; . . . ; b :

ð3:4Þ

767

Biometrical Journal 44 (2002) 6

Let Yjk ¼ 1; 1=2 or 0 according as X0k <; ¼ or > Xjk. Then an unbiased estimate of qj is given by n 1 1 P 1 1 1 0 1 þ ^ qj ¼ Y j ¼ Nj þ Nj ; Yjk ¼ 2 n k¼1 2 n 2 2 where Njþ ¼ ]fkjX0k < Xjk g and Nj0 ¼ ]fkjX0k ¼ Xjk g. The standardized test statistics pﬃﬃﬃ ðY j 1=2Þ n pﬃﬃﬃ ; j ¼ 1; . . . ; b ð3:5Þ Zj ¼ ^jj s ^jj is a consistent are asymptotically standard normal under H0j of (3.4) where s estimate of sjj ¼ Var ðYjk Þ. Let sij denote Cov ðYik ; Yjk Þ (which equals Var ðYjk Þ for i ¼ j). Since the Yjk for k ¼ 1; 2; . . . ; n are i.i.d., the sij and correlations qij are consistently estimated by ^ij ¼ s

n 1 P Yik Y i ðYjk Y j Þ n 1 k¼1

and

^ij s ^ij ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ : q ^ii s ^ jj s

ð3:6Þ

The testing family fðH0j ; Zj Þ ; j ¼ 1; . . . ; bg is joint (Gabriel, 1969). By using the union-intersection method (see Hochberg and Tamhane, 1987, p. 28), a multiple test procedure that strongly controls the FWE asymptotically at level a for the family of hypotheses (3.4) is given by: Reject H0j : qj ¼ 0

if

jZj j > jzjb; f^qij g; a ;

j ¼ 1; . . . ; b ;

where jzjb ;f^qij g; a is the two-sided upper a equicoordinate critical point of the bvariate standard normal distribution (with zero means and unit variances) and correlation matrix f^ qij g. The corresponding 100ð1 aÞ% simultaneous confidence intervals (SCI’s) on the qj are given by pﬃﬃﬃ ^j jzj ^jj =n ; j ¼ 1; . . . ; b : qj 2 ½q b; f^ qij g; a s Extensions to one-sided hypotheses and SCI’s are straightforward. Remark 2: The above method of deriving a multiple test procedure by first obtaining the test statistics Zj that form a joint testing family fðH0j ; Zj Þg and then using the union- intersection method to find the common critical point from the asymptotic multivariate normal (MVN) distribution of the Zj is used in all of the problems discussed in the remainder of the paper. Therefore these technical details are not always mentioned. Also, note that it is possible to derive more powerful & stepwise testing procedures if SCI’s are not required. Remark 3: The statistic (3.5) is a paired t-statistic where each paired difference, Xjk X0k , is coded as 1, 1/2 or 0 as explained above. This suggests that, especially for small n, the joint distribution of the Zj under the overall null hypothesis may be better approximated by a b-variate t-distribution. In fact, the critical point

768

U. Munzel: Nonparametric Multiple Comparisons

jzjb; f^qij g; a gives a slightly anti-conservative test. So we suggest that it be replaced by the two-sided upper a equicoordinate critical point jtjb; n1; f^qij g; a from the b-variate t-distribution with n 1 degrees of freedom (d.f.). This critical point can be calculated by using the SAS-IML program of Genz and Bretz (1999) (available at www.bioinf.uni-hannover.de). It should be noted that the multivariate t-distribution assumes a common estimate of variance for all statistics, whereas here we have different estimates that are correlated. However, we do take this into account by not pooling the d.f. The superiority of the t-approximation in accurately controlling the FWE was confirmed in the simulation studies. The same type of approxima& tion is used in all of the problems discussed in the remainder of the paper.

3.2 Multiple Rank Tests and Confidence Intervals Consider the family of two-sided hypothesis testing problems: H0j : wj ¼ 0 vs. H1j : wj 6¼ 0 ;

j ¼ 1; . . . ; b:

ð3:7Þ

A natural estimate of wj is obtained as follows. Let F^j denote the empirical normalized c.d.f.’s for j ¼ 0; 1; . . . ; b. Then from (3.3) we have ^j ¼ w

n 1 P ½F^0 ðXjk Þ F^j ðX0k Þ: 2n k¼1

ð3:8Þ

Now, 1 ð0jÞ 1 ð0jÞ ðjÞ ð0Þ F^0 ðXjk Þ ¼ ½Rjk Rjk and F^j ðX0k Þ ¼ ½R0k R0k ; n n ð0jÞ

ð0jÞ

where R0k and Rjk denote the midranks of X0k and Xjk in the joint ranking of ð0Þ ðjÞ the baseline and the occasion j samples together, and R0k and Rjk denote the midranks of X0k and Xjk in their respective internal rankings, i.e., within the n P ð0jÞ ð0jÞ R0k and baseline and the occasion j samples separately. Let R0 ¼ n1 n k¼1 P ð0jÞ ð0jÞ Rjk be the sample means of the midranks. Then it is easy to verify Rj ¼ n1 k¼1

that (3.8) reduces to ^j ¼ w

1 ð0jÞ ð0jÞ ðRj R0 Þ: 2n

ð3:9Þ

Notice that the internal ranks cancel out in the calculation of this estimate; however, they appear in the formulas for the estimates of the variances and covar^ j is asymptotically unbiased and consistent (Brunner, Puri iances. The estimate w and Sun, 1995). ^ j , define To obtain the asymptotic joint distribution of the w Yjk ¼ F0 ðXjk ÞFj ðX0k Þ

769

Biometrical Journal 44 (2002) 6

and 1 ð0jÞ ðjÞ ð0jÞ ð0Þ Y^jk ¼ F^0 ðXjk Þ F^j ðX0k Þ ¼ ½Rjk Rjk R0k þR0k : n Then from (3.8) we have n 1 1 P ^ j ¼ Y^j ¼ Y^jk : w 2 2n k¼1

ð3:10Þ

Note that Y^j is the sample mean of Y^jk , which are not independent. However, using methods similar to those in Brunner, Puri, and Sun (1995), we can show that pﬃﬃﬃ j ¼ 1; . . . ; b n jjY^j Y j jj2 ! 0 ; in the L2 -norm, where Yj is the sample mean of the Yjk , which are i.i.d. Since componentwise convergence implies multivariate convergence in finite dimensions, it follows that pﬃﬃﬃ n jjðY^1 ; . . . ; Y^b Þ ðY 1 ; . . . ; Y b Þjj2 ! 0 : By the multivariate central limit theorem (Gnedenko, 1962), ðY1 ; . . . ; Yb Þ is asymptotically MVN. Hence ðY^1 ; . . . ; Y^b Þ has the same asymptotic MVN distribution with j ¼ 1; . . . ; b; EðY^j Þ ! E Y j ¼ 2wj ; n Cov ðY^i ; Y^j Þ ! n Cov ðY i ; Y j Þ ¼ Cov ðYik ; Yjk Þ ¼ sij ;

i; j ¼ 1; . . . ; b:

The sij can be consistently estimated by the corresponding sample variances (for i ¼ j) and covariances (for i 6¼ j) among the Yik and Yjk . But since the latter are unobservable, we use the corresponding sample quantities, Y^ik and Y^jk , respectively, resulting in the following estimates: n 1 P ^ij ¼ ðY^ik Y^i Þ ðY^jk Y^j Þ s n 1 k¼1 n P 1 ð0iÞ ð0iÞ ð0iÞ ðiÞ ð0iÞ ð0Þ fRik Rik R0k þ R0k ðRi R0 Þg ¼ 2nðn 1Þ k¼1 ð0jÞ

ðjÞ

ð0jÞ

ð0Þ

ð0jÞ

ð0jÞ

fRjk Rjk R0k þ R0k ðRj R0 Þg:

ð3:11Þ

These estimates can be shown to be consistent by using the techniques of Brunner, Puri, and Sun (1995). Therefore, pﬃﬃﬃﬃﬃ ð0jÞ ð0jÞ pﬃﬃﬃ R R 2 ^ j 4n w j 0 Zj ¼ pﬃﬃﬃﬃﬃ ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; n s^jj 1 P ð0jÞ ð0jÞ 2 ð0jÞ ðjÞ ð0jÞ ð0Þ fR Rjk R0k þ R0k ðRj R0 Þg n 1 k¼1 jk j ¼ 1; . . . ; b ;

770

U. Munzel: Nonparametric Multiple Comparisons

are asymptotically standard normal under the respective hypotheses H0j : wj ¼ 0. qij g in the usual The correlation matrix of the Zj can be consistently estimated by f^ manner. As before, a multiple test procedure that strongly controls the FWE asymptotically at level a for the family of hypothesis testing problems (3.7) is given by: Reject H0j : wj ¼ 0 if

jZj j > jzjb; f^qij g; a ;

j ¼ 1; . . . ; b :

The corresponding 100ð1 aÞ% simultaneous two-sided confidence intervals on the wj are given by qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^jj =4n ; wj jzjb; f^qij g; a s j ¼ 1; . . . ; b : wj 2 ½^ In analogy with Remark 2, we recommend the use of the multivariate t critical point jtjb; n1; f^qij g; a in place of jzjb; f^qij g; a . 4. Two Treatments In this case, one of the questions of interest is whether the time effects are different for the two treatment groups. This can be formulated as a test of hypothesis of no treatment time interaction. The multiple sign and rank tests of the previous section can be extended to the two treatment setting. For example, for the multiple sign test we can take the interaction effect for occasion j as q1j q2j where qij is the effect of treatment i at time j w.r.t. the baseline as defined in (3.1). Similarly, for the multiple rank test we can take the interaction effect for occasion j as w1j w2j where wij is the effect of treatment i at time j w.r.t. the baseline as defined in (3.2). The estimates of these interaction effects and their asymptotic distributions can be derived in a straightforward manner. We omit the details for brevity. The multiple rank test faces a difficulty arising because it ranks observations within each treatment separately (but jointly over the occasion j and baseline). This is a result of the fact that the measures wij are defined separately for each treatment i ¼ 1; 2. An undesirable consequence of this separate ranking is that a large quantitative interaction (i.e., the time effects in both treatments are in the same direction, but are of different magnitudes) is likely to go undetected. As a simple example, suppose we have two observations from each treatment at the baseline and occasion 1: X 11 ¼ ð1; 3Þ; X 12 ¼ ð2; 4Þ; X 21 ¼ ð5; 15Þ; X 22 ¼ ð6; 16Þ. Then X 11 X 10 ¼ 2 and X 21 X 20 ¼ 10. This is a quantitative interaction. Unfortunately, since the ranks are assigned separately for each treatment, we get identical joint ranks for the two subjects in each treatment group: ð01Þ ð01Þ ð01Þ ð01Þ ð01Þ ð01Þ ð01Þ ð01Þ R101 ¼ R201 ¼ 1, R102 ¼ R202 ¼ 2, R111 ¼ R211 ¼ 3, R112 ¼ R212 ¼ 4: Their inð0Þ ð0Þ ð0Þ ð0Þ ð1Þ ð1Þ ternal ranks are also identical: R101 ¼ R201 ¼ 1, R102 ¼ R202 ¼ 2, R111 ¼ R211 ¼ 1, ð1Þ ð1Þ ^ 11 w ^ 21 ¼ 0 (the corresponding variance estimate R112 ¼ R212 ¼ 2: As a result, w

771

Biometrical Journal 44 (2002) 6

is also zero) and the interaction will go undetected. The same difficulty occurs with the multiple sign test on q1j q2j . However, other types of interactions (e.g., quantitative interactions with small time effects in both treatments or qualitative interactions) are detectable using these two tests. To avoid the above problem, we need to assign ranks jointly over the two occasions as well as over the two treatments (called overall ranking). Rank statistics based on overall ranks result naturally if we define the effect of treatment i at occasion j w.r.t. the baseline as Ð Ð fij ¼ E½Gj ðXijk Þ Gj ðXi0k Þ ¼ Gj ðxÞ dFij ðxÞ Gj ðxÞ dFi0 ðxÞ ; i ¼ 1; 2 ;

j ¼ 1; 2; . . . ; b ;

where Gj ¼ N 1 ½n1 ðF10 þ F1j Þ þ n2 ðF20 þ F2j Þ is a weighted average of all four distributions and N ¼ 2ðn1 þ n2 Þ denotes the total number of all observations. Consider the hypothesis testing problem: H0j : f1j f2j ¼ 0 vs:

H1j : f1j f2j 6¼ 0 ;

j ¼ 1; . . . ; b :

ð4:1Þ

Note that H0j is implied by the stricter null hypothesis H0jF : F1j F10 F2j þ F20 ¼ 0. This latter null hypothesis was considered by Akritas and Brunner (1997). Their test is based on the same transformation Gj above. A natural estimate of fij is given by ð ð 1 ð0jÞ ð0jÞ ^ ^ ^ fij ¼ Gj d F ij G^j d F^i0 ¼ ðRij Ri0 Þ ; N ð0jÞ

ð0jÞ

where Ri‘k , ‘ ¼ 0; j denotes the overall rank of Xi‘k among all observations at time j and baseline in both groups. Using arguments pﬃﬃﬃﬃ similar to those in Akritas ^ ^ and Brunner p(1997) it can be shown that ﬃﬃﬃﬃ N ðf1j f2j Þ is asymptotically equivalent to N Y j; 1j Y j; 10 Y j; 2j þ Y j; 20 under the null hypothesis H0jF , ni P Yj; i‘k , ‘ ¼ 0; j: For every nonempty where Yj; i‘k ¼ Gj ðXi‘k Þ and Yj; i‘ ¼ n1 i k¼1

subset J f1; 2; . . . ; bg, the asymptotic multivariate normality of pﬃﬃﬃﬃ ^ Þ; j 2 J under T H F follows from the application of the multi^ f N ðf 1j 2j 0j j2J variate central limit theorem. pﬃﬃﬃﬃ ^ f ^ Þ and 0 If s denotes the asymptotic covariance between N ðf jj 1j 2j pﬃﬃﬃﬃ ^ ^ N ðf1j0 f2j0 Þ then the abovementioned asymptotic equivalence implies that sjj0 ¼ Nðs1; jj0 =n1 þ s2; jj0 =n2 Þ ; where si; jj0 ¼ CovðYj; ijk Yj; i0k ; Yj0; ij0 k Yj0; i0k Þ. Again, the Yj; i‘k are not observable and hence we use their empirical counterparts

1 1 ð0jÞ ^ ^ Ri‘k Y j; i‘k ¼ Gj ðXi‘k Þ ¼ N 2

772

U. Munzel: Nonparametric Multiple Comparisons

to obtain consistent estimates of the variances and covariances. Under H0jF \ H0jF 0 we obtain a consistent estimate of sjj0 by ^ jj0 ¼ Nðs ^1; jj0 =n1 þ s ^2; jj0 =n2 Þ; s where ni N P ðY^j; ijk Y^j; i0k Y^j; ij þ Y^j; i0 Þ ðY^j0 ; ij0 k Y^j0; i0k Y^j0; ij0 þ Y^j0; i0 Þ ni 1 k¼1 ni P 1 ð0jÞ ð0jÞ ð0j0 Þ ð0j0 Þ ð0jÞ ð0jÞ ð0j0 Þ ð0j0 Þ ðRijk Ri0k Rij þ Ri0 Þ ðRij0 k Ri0k Rij0 þ Ri0 Þ : ¼ Nðni 1Þ k¼1

^i; jj0 ¼ s

The test statistics Zj ¼

^ f ^ f 1j pﬃﬃﬃﬃﬃ 2j ; ^jj s

j ¼ 1; 2; . . . ; b

are asymptotically standard normal under H0jF . The correlation matrix of the Zj is consistently estimated by f^ qjj0 g in the usual manner. Multiple tests of the hypotheses (4.1) can be based on these statistics with strong FWE control as discussed before. SCI’s cannot be computed for f1j f2j because the variance and covariance estimates are consistent only under the corresponding null hypotheses H0jF \ H0jF 0 . Moreover, note that the test statistics are based on the effects f1j and not consistent against alternatives in H0j nH0jF (Brunner, Munzel and Puri, 1999). Solutions of this problem for the independent case (e.g. see Cohen et al., 2000) cannot be easily extended to repeated measures.

5. Simulations We carried out extensive simulations to study the FWE and power properties of the multiple sign and rank tests for a single treatment proposed in Section 3. The normal theory multiple paired t-test with empirically estimated covariance matrix was included in the simulation study as a benchmark for comparison. Two values of b were considered: b ¼ 3 and b ¼ 7. Four different ðb þ 1Þ-variate distributions were simulated: 1) Multivariate normal (MVN), 2) rounded multivariate normal (RMVN) as an example of a discrete distribution, 3) multivariate lognormal (MVLN) as an example of a skewed distribution, 4) multivariate Cauchy (MVCAUCHY) as an example of a heavy-tailed distribution. MVN observations with zero means and unit variances were generated with two different correlation structures: 1.) Autocorrelation structure having Corr ðXik ; Xjk Þ ¼ qjijj with autocorrelation coefficient q ¼ 0:5, 2.) compound symmetry structure with Corr ðXik ; Xjk Þ ¼ q ¼ 0:5 for all i 6¼ j. The latter structure arises from a random subject effects model. Specifically, if Xjk ¼ sk þ ejk , where

773

Biometrical Journal 44 (2002) 6

the sk are i.i.d. Nð0; s2s Þ subject effects and the ejk are i.i.d. Nð0; s2e Þ random errors then the common correlation q ¼ s2s =ðs2s þ s2e Þ. Because the simulation results were similar in both cases, only the results for the autocorrelated data are reported here. Five different sample sizes were studied: n ¼ 10; 15; 20; 30 and 50. A nominal a level of 5% was used throughout. The multivariate t critical point approximations were used for all tests. The estimated correlation matrix f^ qij g among the test statistics was used to determine the multivariate t critical points for b ¼ 3 using the Genz and Bretz (1999) algorithm. For b ¼ 7, this algorithm is too slow. ^ij by a common Therefore we used an approximation obtained by replacing the q ^, equal to their arithmetic average. This approximation is known to be correlation q slightly conservative in case of the normal distribution (Hochberg and Tamhane, 1987, p. 146). It was calculated using the SAS PROBMC procedure. The FWE simulations were conducted under the overall null hypothesis H0 : F0 ¼ F1 ¼ ¼ Fb , i.e., when all marginal distributions are identical. From Gabriel (1969, Theorem 2), it follows that the FWE is maximized under this configuration if the multiple test procedure is a UI procedure based on a joint testing family, which is the case here. The simulation results are summarized in Table 2. We see that the multiple t-test controls the FWE quite accurately for MVN and RMVN data, but is overly conservative for MVLN and MVCAUCHY data. The multiple sign test controls the FWE reasonably well for n 30, but for smaller n, its FWE values are highly variable ranging from as low as 2.6% to as high as 9.2%. The multiple rank test controls the FWE quite well in most cases for n 20, but for smaller n, its FWE is as high as 6.9%. Table 2 Empirical Type I Familywise Error Rates of Multiple t, Sign and Rank Tests (a ¼ 5%Þ Distribution

b¼3

n

b¼7

t-Test

Sign Test

Rank Test

t-Test

Sign Test

Rank Test

MVN

15 20 30

5.5 4.8 5.4

4.8 5.1 5.7

4.8 4.6 5.1

5.6 5.1 5.2

9.2 3.9 4.6

5.7 5.2 5.1

RMVN

15 20 30

5.6 5.4 5.0

5.7 5.4 5.1

5.5 5.3 5.2

5.6 5.2 5.1

6.1 6.1 5.5

6.0 5.6 5.4

MVLN

15 20 30

3.3 3.0 3.3

4.8 5.5 5.5

5.2 4.8 4.8

2.5 2.4 2.7

9.1 3.9 4.8

5.5 5.4 5.5

MVCAUCHY

15 20 30

2.0 2.0 2.1

4.7 5.8 5.9

5.7 5.5 5.5

1.2 1.1 1.3

9.3 3.7 4.9

6.2 5.3 5.4

774

U. Munzel: Nonparametric Multiple Comparisons

1,0 0,9 0,8 0,7

Power

0,6 0,5 0,4 0,3

Paired t-Test Sign Test Rank Test

0,2 0,1 0,0 0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

δ Fig. 2. Power of the Multiple Sign Test, Multiple Rank Test and Multiple Paired t-Test for Multivariate Normal Data and b þ 1 ¼ 4 Time Points

The powers of the three tests were simulated for MVN and MVLN data for n ¼ 30 under location alternatives. It should be noted that for location models there is a one-to-one relationship between the effects qj and wj on the one hand and the location shift on the other hand. Location shifts were created by adding a quantity equal to dj=b to the jth time point (j ¼ 0; 1; . . . ; b) for selected values of d > 0. This creates a linear shift w.r.t. time. The resulting power curves for the three tests are shown in Figure 2 for MVN data and in Figure 3 for MVLN data. Figure 2 shows that for MVN data the multiple rank test is only slightly less powerful than the multiple t-test, while the multiple sign test is markedly less powerful. For MVLN data, however, the rank test is markedly more powerful than the t-test, which is even less powerful than the sign test. Hence we can conclude that the possible loss of power of the rank test w.r.t. the t-test is only small for MVN data, whereas the possible gain can be quite large. The sign test is less powerful than the rank test in general. For two treatments we compare the multiple interaction tests proposed in Section 4, i.e., the multiple sign test, the test based on different joint and internal ranks in each group and the test based on overall ranks as given in Section 4. A parametric multiple t-test procedure was also included in the simulations. This procedure was based on the interactions ðX 1j X 10 Þ ðX 2j X 20 Þ. The covariances of these interactions were estimated by the corresponding sample covariances. The critical points for all tests were taken from multivariate t-distributions with n1 þ n2 2 d.f. and appropriately estimated correlation matrices.

775

Biometrical Journal 44 (2002) 6 1,0 0,9 0,8 0,7

Power

0,6 0,5 0,4 0,3

Paired t-Test Sign Test Rank Test

0,2 0,1 0,0 0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

δ Fig. 3. Power of the Multiple Sign Test, Multiple Rank Test and Multiple Paired t-Test for Multivariate Lognormal Data and b þ 1 ¼ 4 Time Points

Table 3 Empirical Type I Familywise Error Rates of the Multiple Interaction Tests for b ¼ 7 Time Points (a ¼ 5%Þ Distribution

n1 ¼ n2

Low Time Effect

High Time Effect

t-Test Sign Test

Joint Overall Rank Rank Test Test

t-Test Sign Test

Joint Overall Rank Rank Test Test

MVN

15 20 30

4.9 4.8 4.8

5.4 5.1 5.2

4.6 4.4 4.7

5.1 4.7 4.9

5.3 4.8 4.9

0.9 0.4 0.7

1.0 0.9 1.0

4.9 4.6 4.3

RMVN

15 20 30

4.7 4.9 5.2

4.9 5.2 5.0

4.8 5.0 4.9

4.7 5.1 5.0

4.9 4.9 4.9

0.7 0.6 1.1

1.0 1.1 1.4

4.3 4.4 4.6

MVLN

15 20 30

3.4 3.9 3.7

4.9 5.4 5.1

4.1 4.5 4.4

4.5 4.5 4.5

3.4 3.9 4.3

1.0 1.3 2.3

1.7 1.9 2.7

4.7 4.7 4.8

MVCAUCHY

15 20 30

1.2 1.3 1.2

4.8 4.9 5.0

5.1 5.4 5.5

5.5 5.5 5.6

1.4 1.2 1.2

3.5 4.2 4.1

4.2 4.2 4.9

5.1 5.0 4.9

776

U. Munzel: Nonparametric Multiple Comparisons

The same distributions as for the single treament simulations were used to generate the data. Location shifts were created by adding 0:5 þ jd=b to each observation from Group 1 and jd=b to each observation from Group 2 (so that there is a constant group effect of 0.5, and hence there is no treatmenttime interaction) where d ¼ 1 represented a low time effect and d ¼ b represented a high time effect. The simulated type I FWE’s for these two situations are displayed in Table 3. The results show that the sign test and the joint rank test maintain a quite accurately even for small sample sizes ðni 10Þ for a low time effect, but are very conservative for a high time effect. The parametric t-test is slightly conservative for MVLN and very conservative for MVCAUCHY data. Only the overall rank test is robust against different distributions and low or high time effects. Comparing the power of the overall rank test and the t-test shows that the t-test is slightly more efficient if the differences Xijk Xi0k have a symmetrical distribution whereas the rank test is much more powerful if the distribution is heavy-tailed or skew. The resulting power curves are comparable to those in the single treatment situation and therefore are not displayed.

6. Example Let us return to the questions stated in the Introduction and answer them using the proposed tests. For each test we use a ¼ :05. The sample size of 15 per group is slightly on the low side for asymptotics to work, but the rank tests used below perform satisfactorily as seen in the simulation study. (The highest FWE of the rank test in the single treatment case for b ¼ 7 and n ¼ 15 is 6.2% and the highest FWE of the overall rank test in the two treatment case for b ¼ 7 and n ¼ 15 is 5.1%. Both these are for MVCAUCHY data.) To answer Question 1 in the Introduction we computed the rank test statistics (3.2) and their two-sided p-values using the multivariate t approximation. The results are shown in Table 4. We see that the drug shows a significant effect relative to the baseline beginning with week 3. Note that when comparing week 10 Table 4 Rank Test Statistics and Associated p-Values for Answering Research Questions in Panic Disorder Psychiatric Study Question

Week

1

2

3

4

6

8

10

1

Z p

0.952 0.869

2.698 0.087

5.294 <0.001

8.880 <0.001

9.457 <0.001

10.980 <0.001

– <0.001

2

Z p

1.390 0.636

0.646 0.976

2.084 0.256

2.281 0.186

1.808 0.385

0.866 0.918

0.196 >0.999

3

Z p

0.362 0.998

2.046 0.231

3.614 0.007

3.463 0.010

5.247 <0.001

6.750 <0.001

7.499 <0.001

Biometrical Journal 44 (2002) 6

777

with baseline, the empirical distributions of the data for the two time points are disjoint. In that case the variance estimate of the rank statistic equals 0 and hence the Z-statistic cannot be computed. However, it is clear that the difference is highly significant in that case. To answer Question 2 we apply the same test used to answer Question 1 to the placebo group. The results are shown in Table 4. We see that there is no significant placebo effect at any time. To answer Question 3 we apply the multiple test for treatmenttime interaction based on overall ranks. The results are shown in Table 4. We see that the active drug has a significant effect relative to the placebo beginning week 3. Note that for the given example more powerful tests could be derived by using stepwise methods. For example a closure test could be applied to compare the time points to baseline successively beginning from the last. This method is reasonable if a monotone time effect can be assumed. However, it does not offer the possibility to compute simultaneous confidence intervals and it is rather non-robust if the assumption is not fulfilled. For example a time dependent placebo effect may result in an umbrella alternative that could not be detected.

7. Discussion In this paper we have given nonparametric MCP’s to compare time effects w.r.t. the baseline for a single treatment or two treatments. Although the approach does not offer a general solution to deal with repeated measures (for different objectives see, e.g., Reiczigel, 1999 or Keselman et al., 2001), it can easily be extended to some other MCP’s, e.g., multiple comparisons between successive time points. Missing values are not considered. However, the results could be extended to values missing completely at random by using the methods of Brunner, Munzel and Puri (1999), who among others generalized the approach of Akritas and Brunner (1997). Other nonparametric approaches to missing values (e.g., see Davis, 1991) compare the multivariate cdf’s or are based on linear forms of the dependent components. All results derived throughout the paper are asymptotic and the MCP’s cannot be used with small sample sizes. The accuracy of the approximation depends on the sample sizes and the number of time points. Moreover, the accuracy of the approximation is affected slightly by the number of ties. Exact nonparametric tests or resampling methods are known only for independent observations. Therefore, several authors use linear forms (e.g. see Davis, 1991) or summary statistics (e.g. Weinberg and Lagakos, 2001) to condense the repeated measures. As in the parametric theory, however, summary statistics as well as linear forms or multivariate tests do not offer the opportunity to distinguish between the treatment effects, time effects and interactions. This implies that they could not be used, e.g., to answer the Questions 1–3. in the Introduction. Moreover many of

778

U. Munzel: Nonparametric Multiple Comparisons

the standard summary statistics cannot be applied to ordinal data (AUC, maximum change). Thus, new methodology as proposed in this paper is required. Note that all tests are based on the asymoptotic rank transform (ART) method. The ARTs are known to have heterogenous variances and covariances even if the original observations are homogenous. Consequently all proposed methods are appropriate for homogenous as well as for heterogenous situations. References Akritas, M. G., 1991: Limitations of the rank transform procedure: A study of repeated measures designs, Part I. Journal of the American Statistical Association 86, 457–460. Akritas, M. G., and Arnold, S. F., 1994: Fully nonparametric hypotheses for factorial designs I: multivariate repeated measures designs. Journal of the American Statistical Association 89, 336–343. Akritas, M. G., and Brunner, E., 1997: A unified approach to rank tests in mixed models. Journal of Statistical Planning and Inference 61, 249–277. Brunner, E., and Denker, M., 1994: Rank statistics under dependent observations and applications to factorial designs. Journal of Statistical Planning and Inference 42, 353–378. Brunner, E., and Langer, F., 1999: Nichtparametrische Analyse ordinaler Daten. Oldenbourg, Mu¨nchen. Brunner, E., and Langer, F., 2000: Nonparametric analysis of ordered categorical data in designs with longitudinal observations and small sample sizes. Biometrical Journal 42, 663–675. Brunner, E., Munzel, U., and Puri, M. L., 1999: Rank-score tests in factorial designs with repeated measures. Journal of Multivariate Analysis 70, 286–317. Brunner, E., Puri, M. L., and Sun, S., 1995: Nonparametric methods for stratified two-sample designs with application to multi-clinic trials. Journal of the American Statistical Association 90, 1004–1014. Cohen, A., Sackrowitz, H. B., and Sackrowitz, M., 2000: Testing whether treatment is ‘better’ than control with ordered categorical data: an evaluation of new methodoloy. Statistics in Medicine 19, 2699–2712. Conover, W. J., and Iman, R. L. 1981: Rank Transformations as a Bridge between Parametric and Nonparametric Statistics. The American Statistician 35, 124–129. Davis, C. S., 1991: Semi-parametric and non-parametric methods for the analysis of repeated measurements with applications to clinical trials. Statistics in Medicine 10, 1959–1980. Gabriel, K. R., 1969: Simultaneous test procedures – Some theory of multiple comparisons. Annals of Mathematical Statistics 40, 224–250. Genz, A., and Bretz, F., 1999: Numerical computation of multivariate t-probabilities with application to power calculation of multiple contrasts. Journal of Statistical Computation and Simulation 63, 361–378. Gnedenko, B. V., 1962: The Theory of Probability, 2nd Edition. Chelsea Publishing, New York. Hochberg, Y., and Tamhane, A. C., 1987: Multiple Comparison Procedures, Wiley, New York. Keselman, H.J., Algina, L., and Kowalchuk, R.K., 2001: The analysis of repeated measures designs: a review. British Journal of Mathematical and Statistical Psychology 54, 1–20. Nemenyi, P., 1963: Distribution-free multiple comparisons. Unpublished doctoral dissertation, Princeton University, Princeton, NJ. Reiczigel, J., 1999: Analysis of experimental data with repeated measures. Biometrics 55, 1059–1063. Ruymgaart, F. H., 1980: A unified approach to the asymptotic distribution theory of certain midrank statistics. In J. P. Raoult (ed.): Statistique non Parametrique Asymptotique. Lecture Notes on Mathematics 821. Springer, Berlin, 1–18.

Biometrical Journal 44 (2002) 6

779

Steel, R. G. D., 1959: A multiple comparison sign test: Treatment vs. control. Journal of the American Statistical Association 54, 767–775. Thompson, G. L., 1991: A unified approach to rank tests for multivariate and repeated measures designs. Journal of the American Statistical Association 86, 410–419. Weinberg, J. M., and Lagakos, S. W., 2001: Efficiency comparisons of rank and permutation tests based on summary statistics computed from repeated measures data. Statistics in Medicine 15, 705–731. Received February 2001 Revised September 2001 Accepted February 2002