This study was reported according to the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [14].
Design
Inter-rater reliability study.
Study population
The study sample consisted of MRI referrals received by a Regional Hospital Silkeborg’s (RHS) radiology department in Denmark in 2016. The referrals concerned patients ≥18 years with LBP with or without leg pain referred for an MRI of the lumbar spine from clinicians in the primary health care sector. In a Danish setting, this includes GPs, consultants (e.g., rheumatology or neurology), and chiropractors in the RHS catchment area. The referrals were received by the radiology department and checked for contraindications for MRI.
During the data collection period, the radiology department’s procedure was to accept referrals from GPs even though the clinical reason for imaging was not appropriate. Some referrals did not contain enough information about absolute MRI contraindications, such as materials not compatible with MRI. If so, the referrals were returned to the clinician to request further information before acceptance. All referrals in this study were of patients who received an MRI of the lumbar spine.
Data collection
MRI referrals were received and stored in the Kodak Carestream RIS (Radiology Information System) version 6.3.0. The narrative texts were exported from the RIS-archive and were de-identified and uploaded to REDCap electronic data capture tools hosted at Aarhus University [21, 22] REDCap (Research Electronic Data Capture).
Classification of referrals
Referrals were classified as compliant or non-compliant to the 2015 version of the ACR imaging appropriateness criteria for LBP [13]. The ACR-appropriateness criteria concern MRI referrals for patients with LBP or radiculopathy or both. They describe one scenario of inappropriate MRI referrals (‘Variant 1’) and five scenarios of appropriate MRI referrals (‘Variant 2–6’). A flow chart was created to operationalise the criteria (Fig. 1) further based on these scenarios. If any of the referrals included information on red flags or had a clinical indication of imaging, the referrals were considered appropriate (Fig. 1, green box). These appropriate referrals were subdivided into five categories predefined by the ACR criteria as ‘Variant 2–6’: Variant 2) Suspicion of fracture (e.g. trauma, osteoporosis, chronic steroid use); Variant 3) Suspicion of cancer, infection, immunosuppression or spondylarthritis; Variant 4) Candidate for surgery or intervention with persistent or progressive symptoms during or following six weeks of conservative management; Variant 5) New or progressing symptoms or clinical findings with history of prior lumbar surgery; Variant 6) Suspected cauda equina syndrome or rapidly progressive neurological deficit. If the MRI referrals did not include any of these conditions, the referrals were deemed inappropriate (Fig. 1, red box). For this study’s purpose, the ACR-appropriateness criteria were slightly modified by dividing the inappropriate referrals into three subcategories: 1) no information on previous non-surgical treatment, 2) no information on the duration of non-surgical treatment or 3) other reasons. Details on the classification process are provided in Additional file 1.
Three different classifications of the ACR-appropriateness criteria were evaluated in this study. Firstly, the most important evaluation in a clinical context is whether the MRI referral is appropriate or inappropriate (Fig. 1 (A)). Therefore, the inter-rater reliability of the classification of referrals into these two categories was tested. Secondly, the original ACR-appropriateness criteria were described with six subcategories, of which the five appropriate categories were helpful for the radiology department to decide the most appropriate imaging modality. Therefore, the reliability of these six subcategories was tested (Fig. 1 (B)). Thirdly, as we modified the criteria by dividing the inappropriate referral category into three subgroups (see below), we found it relevant to also test the reliability of this new criteria with eight subcategories in order to inform upcoming research projects (Fig. 1 (C)).
Raters, training and consensus
Four inexperienced students (chiropractic master students) and a senior clinician (chiropractor) were chosen as independent raters in this inter-rater reliability study. The senior clinician was a part of the research group and had 4 y of experience managing referrals and reading spinal imaging (radiographs and MRI) at the Radiology department at RHS and 15 years of clinical experience with LBP patients in primary and secondary care. The four inexperienced raters were all in their final year (fifth year) of the chiropractic master’s program and had no experience reviewing imaging referrals. Inexperienced raters were chosen to ensure that anyone could perform the rating regardless of clinical knowledge regarding MRI referrals.
Before the inter-rater reliability study, introduction and two training sessions were carried out to ensure consensus regarding the understanding of classification criteria and identify potential practical issues. The ACR-appropriateness criteria were distributed, and a flowchart based on the ACR criteria was presented to the rater group (Fig. 1). For both training sessions, nine and 10 MRI referrals were randomly selected from a sample of approximately 1000 referrals and were independently evaluated by each rater, according to Fig. 1. Each rater’s final classification of the MRI referrals was registered in an Excel worksheet developed for data collection in the present study and based on the categories in Fig. 1.
In the first training session, nine randomly selected referrals were rated, and the raters agreed on the classification of six referrals. During the discussion, it became clear that the disagreement (three referrals) was caused by lacking information from the narrative text from the referrals. In particular, ambiguous or lacking information about non-surgical treatment and non-surgical treatment duration led to subjective assessments by the raters and therefore disagreement. For example, if a referral described a patient who had received physiotherapy ‘several times’ or that the patient had ‘regularly’ performed training, the time duration of non-surgical treatment remained unclear. The raters agreed that the non-surgical treatment modality and the exact timeframe should be explicitly stated to reduce the risk of subjective assessments. As a result of this decision, the three subgroups described in the ‘Methods section’ were added to the ACR-appropriateness model (Fig. 1 (C)). With the modified flow chart, the second training session was conducted with another 10 randomly selected lumbar spine MRI referrals with an agreement of eight out of 10. After discussing the two referrals, the five raters reached consensus on all 10 referrals, and no further training was performed.
Sample size
The final study sample for the inter-rater reliability study consisted of 50 referrals considered appropriate for this type of study [23]. The referrals were randomly selected from the same sample of 1000 referrals as the training-sessions.
Data entry and statistical analyses
All five raters independently reviewed and stored data in the data collection sheet. Raters were blinded to the results of their fellow raters. Also, raters were blinded to any other information than tentative diagnosis, date, and the referral’s narrative text.
For all raters, the prevalence of each category was estimated and tabulated. This was done to clarify the potential systematic difference between readers and enable assessment of the sample’s homogeneity based on the tabulation. The comprehensive agreement and expected agreement were calculated with a pairwise comparison of all raters and an overall five-rater comparison. Inter-rater reliability was quantified using Kappa statistics for two raters and Fleiss Kappa statistics based on Cohens Kappa for more than two raters [24]. Kappa values were reported with 95% confidence interval (CI). All calculations were performed for two categories (appropriate versus inappropriate MRI referral classification), six subcategories (one inappropriate and five appropriate MRI referral classifications) and all eight subcategories (three inappropriate and five appropriate MRI referral classifications).
Kappa statistics were interpreted according to the six levels defined by Landis and Koch [25]: < 0.0 ‘Poor’, 0.01–0.20 ‘Slight’, 0.21–0.40 ‘Fair’, 0.41–0.60 ‘Moderate’, 0.61–0.80 ‘Substantial’ and 0.81–1.00 ‘Almost perfect’.
One of the co-authors (RKJ) performed statistical analyses who did not participate in the data collection. Data management and analysis were performed using STATA version 16.0 (StataCorp LLC, TX77845, USA).