I am into estimating interrater reliability lately. Here is a good summary :A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability by Steven E. Stemler

This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results.