Degenerative findings on MRI of the cervical spine: an inter- and intra-rater reliability study

Background Knowledge about the assessment reliability of common cervical spine changes is a prerequisite for precise and consistent communication about Magnetic Resonance Imaging (MRI) findings. The purpose of this study was to determine the inter- and intra-rater reliability of degenerative findings when assessing cervical spine MRI. Methods Fifty cervical spine MRIs from subjects with neck pain were used. A radiologist, a chiropractor and a second-year resident of rheumatology independently assessed kyphosis, disc height, disc contour, vertebral endplate signal changes, spinal canal stenosis, neural foraminal stenosis, and osteoarthritis of the uncovertebral and zygapophyseal joints. An evaluation manual was composed containing classifications and illustrative examples, and ten of the MRIs were evaluated twice followed by consensus meetings to refine the classifications. Next, the three readers independently assessed the full sample. Reliability measures were reported using prevalence estimates and unweighted kappa (Κ) statistics. Results The overall inter-rater reliability was substantial (Κ ≥ 0.61) for the majority of variables and moderate only for zygapophyseal osteoarthritis (Κ = 0.56). Intra-rater reliability estimates were higher for all findings. Conclusions The present classifications for some of the most common cervical degenerative findings yielded mainly substantial inter-rater reliability estimates and substantial to almost perfect intra-rater reliability estimates. . Trial registration Regional Data Protection Agency (J.no. 1–16–02-86-16). The letter of exemption from the Regional Ethical Committee is available from the author on request (case no. 86 / 2017). Electronic supplementary material The online version of this article (10.1186/s12998-018-0210-2) contains supplementary material, which is available to authorized users.


Background
Although not recommended as routine imaging in neck pain [1,2], the number of cervical MRIs has increased by 18% compared to a 4.5% increase in neck pain prevalence over recent years in Denmark [2][3][4]. While patients believe in MRI to unveil the true cause of their pain [5], health care professionals appreciate the advantages of MRI compared with other modalities of diagnostic imaging. The non-invasiveness, absence of radiation exposure and the capacity to discriminate soft tissue changes are all highly valued in the field of musculoskeletal imaging.
When communicating MRI findings, the importance of consistency and precision remains unaltered. Both for academic and clinical purposes, a prerequisite for such consistency and precision is reliability in MRI assessments. Reliability is defined as "the extent to which scores for patients who have not changed are the same for repeated measurement under several conditions" [6]. In the case of MRI, this means that while the images do not change, reliability reflects whether the image interpretation remains the same when assessed by different raters (inter-rater reliability) or by the same rater at different times (intra-rater reliability).
To our knowledge, only one reliability study on cervical spine MRI has covered a broad range of common degenerative findings [14] for which reason, further studies are needed.

Objective
To determine the inter-and intra-rater assessment reliability of degenerative findings (kyphosis, disc height, disc contour, vertebral endplate signal changes, spinal canal stenosis, neural foraminal stenosis, uncovertebral osteoarthritis and zygapophyseal osteoarthritis) on MRI of the cervical spine.

Subjects
Fifty MRIs of the cervical spine were chosen from among subjects previously enrolled in a randomized controlled trial (RCT) [15]. Subjects for the RCT were recruited from primary health care professionals (physiotherapists, chiropractors and general practitioners (GPs)). If subjects fulfilled the inclusion criteria (age 18-60 years, part-time or full-time sick leave for 4-16 weeks owing to neck pain or shoulder pain, and fluency in Danish), their GPs referred them to The Spine Centre, Silkeborg Regional Hospital, Denmark. For the current study, the predefined inclusion criterion was the availability of a cervical spine MRI with a satisfactory signal-to-noise ratio. After assessment by the most experienced reader, 32 MRIs were excluded based on unsatisfactory signal-to-noise ratio. By choosing every second MRI among those remaining, 50 MRIs were selected for the current study. A study flow-chart is seen in Fig. 1.

Data collection -images
The MRIs were provided from five different hospitals collaborating with The Spine Centre. The majority of the images were obtained using a 1.5 T field strength. All MRIs comprised sagittal T1-weighted and T2-weighted sequences, while an axial T2 sequence was available for 94% and oblique T2 sequences were available for 82% of the images.

Data collectionreaders
The three readers (Readers A, B and C) all assessed the images independently over a time frame of 5-8 weeks.
Reader A was a second-year resident of rheumatology with no previous formal education in MRI assessment. She had 9 years of postgraduate clinical experience including assessment of spinal MRI for clinical purposes. Reader B was an experienced radiologist having worked with musculoskeletal MRI for 25 years, mostly on a daily basis. Reader C was a chiropractor who had completed a 1-year fulltime internship in spinal MRI in a radiology department. He had another 10 years of clinical and academic experience with spinal MRI. Prior to the study, Reader B taught Reader A assessment of cervical spine MRI for 2 h. Following this two-hour session, Reader A completed 50 clinical narrative reports of cervical spine MRIs from patients with neck pain with or without radiculopathy. These were not part of the current study. The reports were corrected if necessary and approved by Reader B.
For the intra-rater reliability assessment, Reader A assessed all the images twice. The second assessment took place after 6 weeks to prevent recollection of the first assessments.

Evaluation manual, piloting and workstations
Based on the literature [10][11][12][13][14][16][17][18][19][20][21][22][23][24], an evaluation manual with written and visual classifications of the findings was made by Reader A, adjusted and approved by Readers B and C. Next, 10 MRIs from the study sample were evaluated twice followed by consensus meetings. This piloting served the purpose of refining both the classifications in the evaluation manual and the practice of the readers. All images were de-identified, leaving

Variables
Classifications for common and degenerative MRI findings were developed based primarily on the existing literature [10-13, 16-19, 23-26] and on experiences from the piloting. An effort was made to create definitions that were as simple as possible [14], assuming that simplicity is essential for clinical applicability. The most common degenerative findings were chosen, including kyphosis and vertebral endplate signal changes; all are routinely considered by radiologists assessing cervical spine MRIs at Silkeborg Regional Hospital. All the classifications yielded categorical (but not ordinal) data. The complete list of variables is presented in Table 1. Except for kyphosis, these findings were assessed for each of the six cervical disc levels (level C2/C3 to C7/T1). Furthermore, the neural foramina, uncovertebral and zygapophyseal joints were assessed separately on the left and right hand side. The evaluation manual is available in Additional file 1.

Data entry and statistical analysis
All three readers independently entered and stored data using Epidata (Version 3.1., The EpiData Association, Odense, Denmark, 2003Denmark, -2004. If assessment of a certain finding was not possible due to the available sequences, the particular finding was allotted the value '9' representing 'missing'. In accordance with the recommendations for reliability studies [27], 50 MRIs were included in the current study. Prior to the kappa (Κ) calculations, all readers' prevalence assessments were calculated, one variable at a time. This tabulation of data offered the opportunity of 1) assessing the sample homogeneity and 2) identifying any possible systematic differences between the readers; as both can affect the Κ estimates [27,28]. Tabulation thus allowed for a clearer impression of agreement and possible misclassification than offered by the Κ value alone. Tabulation also provided estimates for observed agreement (OA) and agreement by chance (AC) for the pairwise analyses. For the overall three-reader analysis, OA was calculated by computing the number of observations with complete agreement and dividing this number with the total number of anatomical sites assessed. The three-reader AC was calculated by multiplication of  [27]. Reliability measures were computed using unweighted kappa statistics owing to the categorical (as opposed to ordinal) nature of the data. Given the condition of total independence among the readers, Κ is defined as where OA is observed agreement and AC agreement by chance [29]. Reliability measures were computed for the readers in pairs (A1B1, A1C1, B1C1, A1A2) and over-all (A1B1C1). Acknowledging the influence of prevalence on the Κ estimates [27,28], these were only computed whenever the readers in question agreed on prevalences ≥10%. For each disc level, the left and right hand side assessments of neural foraminal stenosis, uncovertebral and zygapophyseal osteoarthritis were pooled before computing reliability estimates. The interpretation of Κ values followed the suggestions by Landis & Koch [29]: Κ values were reported using 95% confidence intervals and additional information on OA and AC were supplied for all findings. Analyses were performed using the STATA (version 15.0; Stata Corporation, College Station, Texas, USA) software package.

Ethics
All subjects provided written informed consent. The study was approved by the Regional Data Protection Agency (J.no. 1-16-02-86-16). Approval by the regional ethical committee was not needed due to the study's methodological nature. The letter of exemption from The Central Denmark Region Committees on Health Research Ethics is available from the author on request (case no. 86 / 2017).

Results
The majority of the subjects were female (n = 31; 62%) with a mean age of 43.7 years (SD = 9.2). The prevalence of positive findings for all readers can be seen in Additional file 2. For vertebral endplate signal changes, prevalence estimates were below 10% and thus too low for Κ statistics. For the remaining degenerative findings, prevalence estimates allowed for kappa statistics including one to six anatomical sites (e.g. 2 disc levels~100 observations included in Κ analysis for spinal canal stenosis). Further scrutiny of the prevalence table revealed a slight tendency for Reader C to assign the label "reduced disc height" more frequently. Otherwise no systematic differences among the readers were identified.
As shown in Table 2, the overall inter-rater reliability (A1B1C1) ranged from moderate to almost perfect for the majority of the findings (substantial to almost perfect for kyphosis and neural foraminal stenosis; moderate to almost perfect for spinal canal stenosis; and moderate to substantial for disc height, disc contour, uncovertebral and zygapophyseal osteoarthritis). Exploratory analyses were made to assess the inter-rater reliability of neural foraminal stenosis when including only MRIs with oblique images (Additional file 3). This did not change the reliability estimates but broadened the confidence intervals slightly.
The intra-rater reliability estimates (Table 3) were slightly better than those for inter-rater reliability. Almost perfect reliability was found for kyphosis and substantial to almost perfect reliability for disc contour, uncovertebral osteoarthritis and neural foraminal stenosis. For spinal canal stenosis and zygapophyseal osteoarthritis, moderate to almost perfect intra-rater reliability was found while moderate to substantial reliability was found for disc height.

Discussion
To our knowledge, this is the first reliability study covering eight common cervical MRI findings. The overall inter-rater reliability was substantial for all variables except zygapophyseal osteoarthritis where moderate reliability was found. Intra-rater reliability was substantial for the majority of variables and almost perfect for kyphosis. These reliability estimates reflect that the observed agreement notably exceeds the agreement that can be expected by chance.
For disc degeneration, other studies [9,12] reported higher reliability estimates than the disc height estimates in the current study. Although the use of intraclass correlation coefficient in the study by Jacobs et al. [12] does not allow for direct comparison, possible explanations for the reliability differences are the use of a ubiquitously accessible reference image of a normal disc [12] and the notable experience among readers with the same educational background [9].
For disc contour, the reliability estimates were similar to those of other studies despite the fact that we used a three-category classification compared to the previously reported dichotomous classifications [8,30,31] and comparison of more experienced readers [30,31].
For spinal canal stenosis, the current study's unweighted reliability estimates exceeded those previously reported by use of weighted kappa statistics [13,32], although the use of weights are expected to yield higher estimates. A higher   number of readers (six [13] and nine [32]) could explain this difference, but even when compared to the three most experienced readers in these studies, better reliability estimates were still achieved in the current study. The most probable reason appears to be the limited introduction of their classification [13,32]. When using both written and visual descriptions, our moderate to almost perfect reliability among readers with considerable experience differences suggest good applicability of this classification of spinal canal stenosis. For zygapophyseal osteoarthritis, both the intra-and inter-rater reliability estimates were better than previously reported [11], which is most likely explained by the use of a dichotomous variable in the current study compared to a classification with four severity categories [11].
For neural foraminal stenosis, this study still achieved higher reliability estimates compared to studies with more experienced readers [30,31]. The inferior reliability estimates may be explained by unclear definitions [30] and by low prevalence estimates together with images obtained using a 0.5 T field strength [31]. Compared to the study from which we modified the classification of neural foraminal stenosis [10], the current study was unable to reach the same almost perfect reliability estimates (Κ > 0.9). Nevertheless, we consider the substantial to almost perfect reliability to be satisfactory, bearing in mind differences in reader experience and the heterogeneous image material (i.e. images with different field strengths and available sequences). The modified classification (dichotomous versus the original four categories) proved reliable and the association with clinical findings has previously been reported [33].

Methodological considerations
A limitation of the study is that it was not preceded by a power calculation. However; the confidence intervals for the Κ estimates only comprised more than two levels (e.g. from moderate to almost perfect for spinal canal stenosis) in a minority of cases. A larger sample would have narrowed the confidence intervals but would probably not have caused substantial changes in the reliability estimates.
Another limitation is the involvement of only reader A in the intra-rater analysis. Two considerations explain this: 1) previous reliability studies found higher [7-9, 12, 14, 21] or similar/higher [10,11,13] intra-rater reliability than inter-rater reliability and 2) involvement of reader A was necessary since a future prognostic study will involve MRI assessments performed by reader A. As for the inter-rater reliability, the study included three readers, only one of these being a radiologist. However, the results suggest that our method is applicable among other health care professionals (i.e. rheumatologists and chiropractors) in a controlled research setting. Involvement of other relevant healthcare professionals, e.g. spine surgeons, would have been desirable but was unfortunately not possible.
Owing to the properties of Κ, the measure does not disentangle systematic and random misclassification [28]. Therefore, we provided the prevalence tables from which we find no suspicion of systematic misclassification.
The prevalence table discloses a notable difference in the number of disc levels assessed for disc contour on levels C2/C3, C3/C4 and C7/T1: Reader A assessed fewer levels than Readers B and C owing to the lack of axial images of the selfsame disc levels. This discrepancy suggests a difference among the readers, and whether this partly explains why higher reliability estimates were not achieved for disc contour cannot be refuted.
Another potential limitation is that all MRIs were derived only from individuals with neck pain. But since cervical spine MRI is seldom performed in patients without neck pain and since the future use of the evaluation manual applies to patients with neck pain, we consider the current sample appropriate for its purpose.
Finally, a potential limitation of the study is the heterogeneous image material (MRIs were performed at five different hospitals. Different field strengths and sequences were available). Yet, as it resembles everyday clinical practice, this was an intended challenge and an attempt was made to manage this heterogeneity by using a standardized evaluation manual. The differences between OA and AC (Tables 2 and 3) reflect that both inter-and intra-rater agreement notably exceed the agreement that can be expected by chance. Furthermore, the high levels of observed agreement reflect only a minor degree of misclassification. Based on these observations of OA, our interpretation is that the evaluation manual and the standardized procedures explain the high levels of agreement rather than pure chance when assessing heterogeneous images.
Ultimately, the heterogeneous image material and the use of three different health care professionals both add to the generalizability and thus constitute strengths of the study. The blinding of the readers, the use of simple and easily comprehensible classifications along with regular encouragement to follow the evaluation manual, are other important strengths of the study.
In contrast to the controlled settings of the current study, a study comparing narrative MRI reports demonstrated considerable variability [34]. In this study [34], a patient with low back pain and right L5 radicular symptoms had lumbar spine MRI performed at 10 different MRI centers within 3 weeks. Comparison of the 10 narrative reports revealed considerable variability; none of the 49 described findings occurred in all 10 reports and only one finding occurred in nine reports. Even if this amount of variability is unusually large [34], it supports our clinical experience that variability also prevails in the interpretation of cervical spine MRIs. A possible way to overcome this is by using classifications sufficiently comprehensible to be applied 1) by different health care professionals and 2) when assessing heterogeneous images from different MRI scanners. Such classifications were presented in the current study. Confirmatory studies will be needed. If those studies were to involve experienced radiologists, provide proper training for lesser experienced MRI readers, and use an evaluation manual, better reliability might be achieved in clinical settings. So far, the results suggest that the evaluation of MRI findings can be used in controlled research settings studying individuals with neck pain. Suggestions for future research include comparison of reliability with and without the use of an evaluation manual. Also, including more than one of each health care professional could allow for comparison of experience levels both among and within different types of health care professionals.

Conclusions
In conclusion, the current study found substantial reliability for the majority of included MRI findings. This suggests that the present classifications are sufficiently comprehensible to be applied by different health care professionals when assessing images from different MRI scanners. In our view, the proposed classifications are sufficiently reliable to be used for both quality assurance and further research purposes.