Inter-rater Reliability Ensures Consistency

interrater reliability

In a previous article, we focused on determining content validity using the Lawshe method when gauging the quality of an assessment that’s been developed “in-house.” As a reminder, content validity pertains to how well each item measures what it’s intended to measure and the Lawshe method determines the extent to which each item is necessary and appropriate for the intended group of test takers. In this piece, we’ll zero in on inter-rater reliability.

Internally Created Assessments Often Lack Quality Control

Many colleges and universities use a combination of assessments to measure their success. This is particularly true when it comes to accreditation and the process of continuous program improvement. Some of these assessments are proprietary, meaning that they were created externally—typically by a state department of education or an assessment development company. Other assessments are internally created, meaning that they were created by faculty and staff inside the institution. Proprietary assessments have been tested for quality control relative to quality indicators such as validity and reliability. However, it’s common for institutional staff to confirm these elements in the assessments that are created in-house. In many cases, a department head determines they need an additional data source and so they tap the shoulder of faculty members to quickly create something they think will suffice. After a quick review, the instrument is approved and goes “live” without piloting or additional quality control checks.

Skipping these important quality control methods can wreak havoc later on, when an institution attempts to pull data and use it for accreditation or other regulatory purposes. Just as a car will only run well when its tank is filled with the right kind of fuel, data are only as good as the assessment itself. Without reliable data to that will yield consistent results over multiple administrations, it’s nearly impossible to draw conclusions and make programmatic decisions with confidence.

Inter-rater Reliability

One quality indicator that’s often overlooked is inter-rater reliability. In a nutshell, this is a fancy way of saying that an assessment will yield consistent results over multiple administrations by multiple evaluators. We most often see this used in conjunction with a performance-based assessment such as a rubric, where faculty or clinical supervisors go into the field to observe and evaluate the performance of a teacher candidate, a nursing student, counseling student, and so on. A rubric could also be used to evaluate a student’s professional dispositions at key intervals in a program, course projects, and the like.

In most instances, a program is large enough to have more than one clinical supervisor or faculty member in a given course who observe and evaluate student performance. When that happens, it’s extremely important that each evaluator rates student performance through a common lens. If for example one evaluator rates student performance quite high or quite low in key areas, it can skew data dramatically. Not only is this grading inconsistency unfair to students but it’s also highly problematic for institutions that are trying to make data-informed decisions as part of their continuous program improvement model. Thus, we must determine inter-rater reliability.


Using Percent Paired Agreement to Determine Inter-rater Reliability

One common way to determine inter-rater reliability is through the percent paired agreement method. It’s actually the simplest way to say with confidence that supervisors or faculty members who evaluate student performance based on the same instrument will rate them similarly and consistently over time. Here are the basic steps involved in determining inter-rater reliability using the percent paired agreement method:

Define the behavior or performance to be assessed: The first step is to define precisely what behavior or performance is to be assessed. For example, if the assessment is of a student’s writing ability, assessors must agree on what aspects of writing to evaluate, such as grammar, structure, and coherence as well as any specific emphasis or weight should be given to specific criteria categories. This is often already decided when the rubric is being created.

Select the raters: Next, select the clinical supervisors or faculty members who will assess the behavior or performance. It is important to choose evaluators who are trained in the assessment process and who have sufficient knowledge and experience to assess the behavior or performance accurately. Having two raters for each item is ideal—hence the name paired agreement.

Assign samples to each rater for review: Assign a sample of rubrics to each evaluator for independent evaluation. The sample size should be large enough to ensure statistical significance and meaningful results. For example, it may be helpful to pull work samples from 10% of the entire student body in a given class for this exercise, if there are 100 students in the group. The samples should either be random, or representative of all levels of performance (high, medium, low).

Compare results: Compare the results of each evaluator’s ratings of the same performance indicators using a simple coding system. For each item where raters agree, code it with a 1. For each item where raters disagree, code it with a 0. This is called an exact paired agreement, which I recommend over an adjacent paired agreement. In my opinion, the more precise we can be the better.

Calculate the inter-rater reliability score: Calculate the inter-rater reliability score based on the level of agreement between the raters. A high score indicates a high level of agreement between the raters, while a low score indicates a low level of agreement. The number of agreements between the two raters is then divided by the total number of items, and this number is multiplied by 100 to express it as a percentage. For example, if two raters independently score 10 items, and they agree on 8 of the items, then their inter-rater reliability would be 80%. This means that the two raters were consistent in their scoring 80% of the time.

Interpret the results: Finally, interpret the results to determine whether the assessment is reliable within the context of paired agreement. Of course, 100% is optimal but the goal should be to achieve a paired agreement of 80% or higher for each item. If the inter-rater reliability score is high, it indicates that the data harvested from that assessment is likely to be reliable and consistent over multiple administrations. If the score is low, it suggests that those items on the assessment need to be revised, or that additional evaluator training is necessary to ensure greater consistency.

Like determining content validity using Lawshe, the percent paired agreement method in determining inter-rater reliability is straightforward and practical. By following these steps, higher education faculty and staff can use the data from internally created assessments with confidence as part of their continuous program improvement efforts.


About the Author: A former public school teacher and college administrator, Dr. Roberta Ross-Fisher provides consultative support to colleges and universities in quality assurance, accreditation, educator preparation and competency-based education. Specialty: Council for the Accreditation of Educator Preparation (CAEP).  She can be reached at: