Among four dozen MP studies included in an annotated review of MP reliability studies [1], Potter et al. [26] were the only investigators to have used a SSS method and ICC analysis similar to the present included studies. Since theirs was an intraexaminer study, unlike the present study, and furthermore included postural and movement asymmetry in the examination panel in addition to MP, the results of their study cannot be directly compared with the present results. The palpators in the present study did not verbally interact with the subjects, ensuring that the findings of spinal level stiffness alone were central to the identification of dysfunctional spinal segments, not confounded by subjective information concerning pain or tenderness.
Broadly speaking, subset analysis in Tables 3, 4 and 5 supports each of the following statements: (a) increased examiner confidence was associated with increased interexaminer reliability; (b) interexaminer reliability was greater in the cervical and lumbar spines than in the thoracic spine; and (c) examiner confidence had a more variable impact on examiner agreement in the regional analyses than in the whole dataset. These trends are especially visible in the data for the thoracic spine and for the combined dataset. In the thoracic spine, MedAED was 0.7 VE when both examiners were confident, but increased to 1.3 VE when at least one examiner lacked confidence and to 2.2 VE when both lacked confidence. In the combined dataset, MedAED was 0.6 VE when both examiners were confident, increasing to 1.0 VE when one examiner lacked confidence, and to 1.8 VE when neither was confident.
The subjects were relatively homogeneous in their SSSs (Fig. 2), with the most frequently identified SSS for the cervical spine being C6, the thoracic spine T7, and the lumbar spine L3. ICC is not the ideal index of interexaminer reliability when, as in our studies, the subjects are relatively homogeneous. In that circumstance, ICC becomes misleadingly low [27]. This results from the fact that ICC is a ratio of the variance within subjects to the total variance (the sum of within-subject and between-subjects variance). When between-subjects variance is relatively low, the ICC level diminishes even when and if the examiners largely agree. To illustrate the fact that ICC is very sensitive to subject homogeneity, the previously published lumbar study [15] constructed a hypothetical dataset in which examiner differences were equal to those seen in the actual dataset, but in which the findings of the SSS were more evenly distributed across the lumbar spine. In this hypothetical dataset, ICC(2,1) increased from 0.39 (“poor”) in the real dataset to 0.70 (“good”) in the hypothetical dataset, despite examiner differences being equal, subject for subject, in the two datasets.
To offset the interpretation of the misleadingly low ICC values in this study, the authors emphasized indices of interexaminer reliability that were immune to it: MeanAED, MedAED, and LOA [20, 21]. MedianAED calculations are especially preferred [23, 24] because they are immune to the impact of extreme values [23], which do conversely impact the calculation of MeanAED and LOA. From a clinician’s point of view, it ought to be intuitively obvious that the happenstance of occasional large differences in two examiners’ determination of the SSS ought not distract them from the clinical utility of an examination protocol using which usually results in agreement on the SSS or the motion segment including it. These MedianAED calculations reinforce confidence in the protocol. The insensitivity of median calculations to extreme values accounts for why the MedianAED values were generally smaller than the MeanAED values in this study. Although either MedAED or LOA calculations may have sufficed unto itself, it was deemed more convincing to deploy each to check for consistency between methods. Between the two, the LOA are more conservative estimates of examiner agreement, as explained below.
Interpretation of the subsets in Tables 3 and 4, which are stratified by spinal region and examiner confidence, becomes misleading as the size of the subsets diminished. When a sample size is small, the results of the analysis can be altered considerably by shifting a very small number of data points from one clinical result to another. Walsh [28] has described a Fragility Index: “the minimum number of patients whose status would have to change from a nonevent to an event to turn a statistically significant result to a non-significant result.” As an example using the lumbar ICC values, if the two examiners had exactly agreed on subject 13, rather than disagreed by 7.1 cm (the largest disagreement in this subset), the ICC(2,1) for all subjects in the lumbar subset would have increased from the reported 0.39 to 0.46, and the interpretation would have changed from “poor” to “fair.” Likewise, if the two examiners had disagreed on subject 32 by 7.1 cm rather than exactly agreeing, the ICC(2,1) for the N = 15 subset where at least one examiner lacked confidence would have decreased from 0.52 to 0.43. Therefore, shifting only two of 34 data points would have negated the otherwise paradoxical finding in the actual lumbar study that less confidence in the lumbar spine was associated with smaller examiner differences.
The columns labeled MAD in Table 4 represent the degree of data dispersion, how spread out the data are. It paints a more complete picture than the more typically reported range, the simplest measure of dispersion, the difference between the maximum and minimum values. The primary problem with reporting simple range is that it is very impacted by extreme minimum or maximum values. Standard deviation and variance, although very widely used to assess dispersion in normal distributions, are also impacted by outliers, since a data point very distant from the others can substantially increase their computed values. In addition, when using standard deviation as a measure of data dispersion, the distances from the mean are squared, so large deviations are weighted more heavily. MADmedian is robust to such extreme values (i.e., it is not impacted by them), since a larger extreme value has no greater impact than a smaller extreme value. The primary strength of MADmedian is also an important weakness: so-called extreme outliers at the lower and upper quartiles of examiner differences may represent an important characteristic of the examination method under investigation.
To assist in interpreting the findings for MedianAED (Fig. 4), let us consider a case in which the first examiner has judged the SSS to be at the exact middle of a given segment. So long as the second examiner identifies a SSS that is not more than 1.5 segments away, it can be stated they at least agreed on the motion segment including the SSS, and may have agreed on the SSS itself. Given the findings of the first examiner, this agreement may have occurred on the motion segment above or below. That stated, we must be careful to make clear this does not imply their findings were somehow in a range spanning 2 motion segments. It simply means in some cases they agreed on the motion segment above that identified by examiner 1, and in other cases below. If, on the other hand, an examiner had identified the SSS at the most inferior or superior aspect of a given segment, the other examiner must not have been disagreed by more than 1.0 VE for them to have identified the same segment or the motion segment containing it. This happened 60.2% of the time. In this study, as can be seen in the box-and-whisker plot (Fig. 3), examiner differences were ≤1.5 VE 77.0% of the time in the combined dataset. It would be very difficult, if not impossible, to tease more accurate numbers from these studies, so as to know whether the frequency of missing by more than one motion segment is closer to 23.0 or 39.8%. Doing so would require untenable assumptions as to exactly how the location of what the examiners actually touched in these 3 different anatomical regions (the articular pillar in the cervical spine, transverse processes in the thoracic spine, and spinous process in the lumbar spine) related to the actual center of the vertebrae.
When both examiners were confident, their differences were ≤1.0 VE in 55 of 61 (90.2%) of cases, which is to say they definitely agreed on the motion segment containing the SSS 9 (again, on the motion segment above or below given the findings of the first examiner); and in only one case (1.7%) did they differ by more than 1.5 VE, suggesting they definitely disagreed on the motion segment containing the SSS. When one of the examiners lacked confidence, their differences were ≤1.0 VE in 18 of 36 (50.0%) of cases; when neither examiner was confident, there were no cases when their difference was ≤1.0VE. Outliers, defined as such because they were ≥1.5 times the interquartile range, may have occurred when a subject had more than one spinal segment that was stiff in the range being examined, and yet the examiner was constrained to decide upon the stiffest segment.
The 95% LOA round off to −3.0, 3.1 VE. This may be interpreted as follows: 95% of examiners’ differences for the SSS were ≤ 3 vertebral heights apart. It must be emphasized that these LOA do not identify the mean examiner difference, but rather the boundaries that contain 95% of examiner differences. Increasing examiner confidence decreased the LOA; when both examiners were confident, 95% of the time they were ≤ 1.8 levels apart. The LOA were smaller in the lumbar and cervical spines, but relatively larger in the thoracic spine, presumably due to its greater length. Identifying the stiffest spinal site among nine in the thoracic spine might have resulted in relatively lower agreement compared with identifying it among only five vertebrae in the lumbar spine. With more choices available, there is a greater risk of finding two or more levels stiff. In our forced-call method, where the examiners had to choose the stiffest segment, palpators who largely agreed on those two segments might have disagreed as to which was stiffest.
Since LOA are derived using calculations that involve squaring examiner differences, they generally result in wider confidence intervals for examiner differences than the ranges established by MedAED calculations. Therefore, it may be said they are more conservative in their estimation of interexaminer agreement. The choice between using less and more conservative measures of examiner agreement might best depend on the clinical significance of the measurements. For example, if two technologies for measuring a lab value obtain measures on opposite sides of a benchmark number supporting or not supporting prescribing a medication, the safety of a patient may be compromised depending on which technology is emphasized. However, in performing motion palpation for spinal stiffness, there is little if any evidence that examiner differences in judgement are likely to significantly compromise the health status of the patient.
Table 6, which compares interexaminer reliability using randomly created chance data to the reliability that was obtained using the real data, best illustrates to what extent the information provided by MP impacted interexaminer reliability. The furthest right column provides the ratio of simulated to actual MedAED. The information improved interexaminer reliability by a factor ranging from 1.8 times to 4.7 times, depending on the regional subset analyzed. These data provide convincing evidence that MP for the SSS improves interexaminer agreement on the site of potential spine care, despite previously reported data based on level-by-level analysis that MP infrequently achieve reliability above chance levels [1–4]. There is no obvious way to compare these heuristic calculations of the enhancement of interexaminer reliability afforded by the SSS protocol with other measures of reliability that have been deemed acceptable. What defines an acceptable level of reliability for a spinal assessment procedure depends on the consequence of a mistake being made. One would suppose that a mistake on the SSS would not matter nearly as much as, for example, a mistake made by a spine surgeon concerning the intervertebral disc level thought responsible for lumbar radiculitis.
Examiners could agree on the SSS and yet both be incorrect in their determinations, as might be determined by comparison with a valid reference standard. Moreover, even were they accurate, the information might prove to be of little clinical utility. An innovative efficacy study [29] using a randomized trial study design explored whether the data provided by MP was associated with a clinically relevant pain reduction in one session of cervical manipulation compared with non-specific cervical manipulation. Although the study found endplay assessment did not contribute to same-day clinical improvement in the cervical spine, the investigators did not rule out possible contribution over a longer term.
Perfect segmental specificity on a spinal site of care is probably not strictly required, since a spinal intervention generally addresses a motion segment consisting of two vertebrae [30, 31]. As can be ascertained from both the MedAED and LOA analyses, the pairs of examiners in the three studies herein re-analyzed tended to identify the same or adjacent vertebrae as the SSS, especially when they were confident in their findings; and especially in the cervical and lumbar spinal regions.
The better reliability seen in these studies compared with the great majority of previous MP studies [1–4] is most likely not primarily attributable to improvements in the end-feel palpatory methods that were used, and may not constitute a better method for identifying the most appropriate site of spine care. The authors are not aware of any outcome studies that report different results based on characterizing every spinal level as moving or not; compared with flagging the most relevant location within a patient’s area of primary complaint. Therefore, these results do not call for clinicians to adopt new patient assessment methods nor that they change their record-keeping protocols. They do suggest that researchers might consider designing study protocols and research methods to explore reliability using the “most clinically relevant spinal site” protocol that some clinicians no doubt use, as an alternative to level-by-level analysis. In fact, these results raise the possibility that the present inventory of mostly level-by-level (certainly for MP) reliability studies may have underestimated clinically relevant examiner agreement, thereby unduly discouraging further research and clinician interest in such research. It may be possible to apply the continuous analysis approach used in the present study to other types of interexaminer reliability scenarios, including for example thermography and leg length inequality studies. In fact, at least one study on the reliability of thermographic assessment did in part use continuous analysis [32], as did two studies on assessing leg length inequality [33, 34].) These experimental design modifications may more meaningfully assess examiner agreement than the mostly level-by-level analysis that has been used up until now.
Limitations of method
To facilitate pooling data from all three regional studies, the authors arbitrarily included only data for examiner two vs examiner three from the cervical study, excluding the data for one vs. two and two vs. three. The authors chose to use the two vs three data because its findings for interexaminer reliability appeared to lie between those of the other two examiner comparisons. Each of the prior studies included a different number of subjects; it would have been better to have equal numbers, but the subjects were recruited at different times in an environment where the size and gender mix of the convenience sample fluctuated. In the thoracic study, the range examined did not include T1-2 and T12; the investigators had formulated the clinical opinion based on prior experience that these areas were so prone to stiffness that the experimental findings of reliability could become misleadingly inflated. Among the original three included studies, only the lumbar study included a power analysis.
Some of the sample sizes in the subset analyses of the present study were clearly underpowered, suggesting caution in interpreting interexaminer reliability. The recommended number of subjects for either a complete dataset or a subset in this kind of study is about 35 subjects, to have 80% power at the 5% significance level to detect ICC ≥ 0.6 [35]. Since subsets are by definition smaller than the complete dataset, “it would be more reliable to look at the overall results of a study than the apparent effect observed within a subgroup” [36]. The subsets for one examiner lacking confidence and both examiners lacking confidence were combined in some of the analyses in this study to at least partially mitigate this effect. Although the data clearly suggested increased levels of examiner confidence bred reliability, among all the subset analyses made in this study only one reached the threshold of 35 subjects: the both examiners confident subset of the combined dataset. That stated, all of the measures of reliability in the combined dataset (MeanAED, MedianAED, and LOA) showed substantially increased reliability in the both doctors confident subset compared with the full dataset (both of which were adequate in subject size), suggesting the study’s conclusions regarding the role of confidence are reasonable.
Lacking a reference standard, it cannot be confirmed there actually were stiff spinal levels in the included studies of asymptomatic and minimally symptomatic subjects. An examiner might not have found any segment significantly stiff to palpatory pressure, or an examiner may have found multiple segments significantly but indistinguishably stiff to palpatory pressure. The study participants were largely asymptomatic, thus not reflective of symptomatic patients seeking care, jeopardizing the external validity in a manner that has been previously criticized [4, 8]. On the other hand there is some evidence that using more symptomatic participants does not appreciably change the outcome [37]. The research assistant may have introduced some error in marking and measuring the locations for each examiner’s SSS; however, the data are consistent with these putative errors having been random and thus unbiased (the bias estimates in Table 5 are all near zero). Although the examiners did not converse with the subjects, the subjects may have provided non-verbal cues such as pain withdrawal reactions or wincing gestures; these putative non-verbal cues may have impacted the examiners' findings for the SSS. To some extent this study suggested that examiner confidence breeds examiner agreement. However, since it is not known if the examiners were accurate, nothing is implied about an individual examiner’s confidence in a typical practice setting; i.e., it is not known whether it is more or less efficient to be confident in the findings of MP. The present study does not suggest that high confidence, which could very well be unwarranted by skill level, improves the accuracy of MP. Since this study focused on spinal hypomobility, it did not address the question of whether a putative “most hypermobile segment” might be identified using similar methods, which may arguably be quite important in clinical practice.
Although examiners agreed on the SSS or the motion segment 60.2–77.0% of the time, it is equally true that they disagreed on the SSS by more than one motion segment 23.0–39.8% of the time. Granted that clinician disagreement on the site of spinal intervention care may lead to suboptimal care or even harm patients, the authors are not aware of studies confirming or excluding that possibility. Perhaps too optimistically, but not without reason, Cooperstein and Hass wrote [38]: “Although most patients are better off after a round of chiropractic care, there are data suggesting that about half of them suffer at least one adverse consequence along the way [39, 40]. Nevertheless, these tended to be minor and transient, and we have every reason to believe that even these patients were made better off than had they received no care at all. Since most patients improve, but some more quickly and with less adverse consequences along the way, perhaps ‘wrong listings’ are not so much wrong as suboptimal. This is just what we would expect if, rather than listings being simply right or wrong, there were a listings continuum ranging from very appropriate to very inappropriate. Then listings would matter, in the sense that doing the ‘right thing’ would be better than the ‘wrong thing,’ although even the wrong thing would usually be better than literally nothing, i.e., no clinical intervention.”