help for kapssi                                       (Author:  David Harrison)

Sample size calculations for kappa

Two unique raters, two ratings:

kapssi kappa, { se(#) | diff(#) [level(#)] | n(#) } p1(#) [ p2(#) round ]

Two or more (non-unique) raters, two ratings:

kapssi kappa, { se(#) | diff(#) [level(#)] | n(#) } p(#) [ m(#) round ]


kapssi estimates required sample size for estimating the kappa-statistic of inter-rater reliability for a binary outcome (having postulated value kappa) with given standard error, or the standard error for a given sample size. If n() is specified, kapssi computes standard error; otherwise it computes sample size. kapssi is an immediate command; all of its arguments are numbers (see help immed).

For two raters, the results are the same as produced by sskdlg or sskapp (except for rounding; see round option below), based on the asymptotic variance presented by Fleiss, Cohen and Everitt (1969). Results for more than two raters are based on the asymptotic variance for the Fleiss-Cuzick estimator of kappa presented by Zou & Donner (2004) in the case of equal numbers of ratings for each subject.


se(#) specifies the standard error of kappa.

diff(#) specifies the half width of the confidence interval for kappa as an alternative to the standard error.

level(#) specifies the significance level for the confidence interval; the default is obtained from set level (see help level), usually level(95).

n(#) specifies the sample size for which to calculate standard error.

p1(#) specifies the proportion of positive results reported by rater 1 (of two raters).

p2(#) specifies the proportion of positive results reported by rater 2 (of two raters); if p2 is not specified it is assumed to be equal to p1.

p(#) specifies the overall proportion of positive results (multiple raters).

m(#) specifies the number of raters; the default is m(2).

round specifies that the sample size is to be rounded to the nearest integer; the default is to round up using the function ceil(). This allows reproducability of results for two raters produced by sskdlg or sskapp which both have this behaviour.


Two raters. Compute sample size given standard error:

. kapssi .8, se(.1) p(.1)

Compute sample size given half width of confidence interval:

. kapssi .6, diff(.2) p1(.15) p2(.12) round

This is equivalent to:

. sskapp, p1(.15) p2(.12) diff(.2) kapp(.6)

More than two raters. Compute sample size:

. kapssi .75, se(.12) p(.05) m(3)

Compute standard error for given sample size:

. kapssi .8, n(100) p(.12) m(4)


Fleiss, J. L., Cohen, J. and Everitt, B.S. 1969. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 72: 323-327.

Zou, G. and Donner, A. 2004. Confidence interval estimation of the intraclass correlation coefficient for binary outcome data. Biometrics 60: 807-811.


David A. Harrison Intensive Care National Audit & Research Centre david@icnarc.org

Also see

Online: help for kappa, sskdlg, sskapp, immed