Development and evaluation of the MAINTAIN instrument, selecting patients suitable for secondary or tertiary preventive manual care: the Nordic maintenance care program

Background Chiropractic maintenance care (MC) has been found to be effective for patients classified as dysfunctional by the West Haven-Yale Multidimensional Pain Inventory (MPI). Although displaying good psychometric properties, the instrument was not designed to be used in clinical practice to screen patients for stratified care pathways. The aim was to develop a brief clinical instrument with the intent of identifying dysfunctional patients with acceptable diagnostic accuracy. Methods Data from 249 patients with a complete MPI dataset from a randomized clinical trial that investigated the effect and cost-effectiveness of MC with a 12-month follow-up was used in this cross-sectional analysis. A brief screening instrument was developed to identify dysfunctional patients, with a summary measure. Different cut-offs were considered with regards to diagnostic accuracy using the original instrument’s classification of dysfunctional patients as a reference. Very good diagnostic accuracy was defined as an area under the curve (AUC) metric between 0.8 and 0.9. The instrument was then externally validated in 3 other existing datasets to assess model transportability across populations and medical settings. Results Using an explorative approach, the MAINTAIN instrument with 10 questions (0–6 Likert responses) capturing 5 dimensions (pain severity, interference, life control, affective distress, and support) was developed, generating an algorithm-based score ranging from − 12 to 48. Reporting a MAINTAIN score of 18 or higher, 146 out of the 249 patients were classified as dysfunctional with 95.8% sensitivity and 64.3% specificity. At a score of 22 or higher, 109/249 were classified as dysfunctional with 81.1% sensitivity and 79.2% specificity. AUC was estimated to 0.87 (95% CI 0.83, 0.92) and Youden’s index was highest (0.70) at a score of 20. The diagnostic accuracy was similar and high across populations with minor differences in optimal thresholds for identifying dysfunctional individuals. Conclusion The MAINTAIN instrument has very good diagnostic accuracy with regards to identifying dysfunctional patients and may be used as a decision aid in clinical practice. By using 2 thresholds, patients can be categorized into “low probability (− 12 to 17)”, “moderate probability (18 to 21)”, and “high probability (22 to 48)” of having a good outcome from maintenance care for low back pain. Trial registration Clinical trials.gov; NCT01539863; registered February 28, 2012; https://clinicaltrials.gov/ct2/show/NCT01539863. Supplementary Information The online version contains supplementary material available at 10.1186/s12998-022-00424-6.


Background
Non-specific low back pain (LBP) is a highly prevalent condition affecting a large part of the adult population with major consequences worldwide. It is considered one of the biggest economic burdens in western societies [1]. According to the Global Burden of Disease Study, LBP results in more years lived with disability than any other disease in the world [2]. LBP episodes are often short lived but recurrent throughout life and the lifetime prevalence worldwide has been estimated to be 80-85% [2]. Therefore, it is recommended that research should aim at secondary prevention (preventing recurrences in people recovered from a previous episode) and tertiary prevention (preventing progression and consequences of LBP) [3].
Chiropractic maintenance care (MC) is described as a long-term management strategy for musculoskeletal disorders, introduced when optimum treatment benefit has been reached after an initial care plan. The aim of MC is to prevent future episodes, progression, and consequences of LBP by treating the patient at regular intervals, regardless of symptoms [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. In an ambitious effort, researchers across the Nordic countries have systematically explored and investigated indications, content, and frequency of MC in a series of research projects [18]. Commonly, MC patients are selected based on their previous history of pain and the effectiveness of the initial care plan [10]. Selected MC patients are commonly scheduled with 1-3 months intervals and are treated with manual therapy, along with individual exercises and lifestyle advice [18].
Based on this knowledge, a pragmatic randomized clinical trial (RCT) was designed to investigate the effectiveness of MC in patients with recurrent and persistent LBP [19]. The trial exhibited that the MC-group had fewer days with bothersome LBP over a year compared to the control group [20]. Although more effective, the number of visits was higher in the MC-group with more treatments during the 52-week study period [20]. There was a large variability in the data, suggesting that there may be subgroups of patients who experienced fewer days of pain and fewer visits than others.
Psychological [21,22], behavioural [23] and social characteristics [24] of LBP patients are important in the transition from acute to recurrent and persistent pain states [25][26][27][28][29][30][31]. In line with the bio-psycho-social model, the leading theoretical framework underpinning the management of LBP [24,[32][33][34] clinicians would be expected to consider cognitive processes, psychological and behavioural dimensions of the pain experienced when managing patients with pain. Based on the cognitive-behavioural conceptualization of pain, The West Haven-Yale Multidimensional Pain Inventory (MPI) was developed to capture the perceptions and consequences of living with chronic pain [35]. The original instrument has been used to identify three clusters/ subgroups of patients [36] and has been shown to be reliable, valid, and useful in outcome-based research [37,38]. The three different subgroups are defined as adaptive copers (AC), interpersonally distressed (ID), and dysfunctional (DYS). Individuals in the AC group are characterized by low pain severity, low interference with everyday life due to pain, low life distress, a high activity level, and a high perception of life control and have the best prognosis with the lowest risk for long-term sick leave [39]. Individuals in the ID group are characterized by dysfunctional behaviours such as low levels of social support, low levels of solicitous and distracting responses from significant others, and high scores on punishing responses compared to the DYS and AC patients [39]. Individuals in the DYS subgroup are characterized by having high pain severity, marked interference with everyday life due to pain, high affective distress, low perception of life control, and low activity levels and have the worst prognosis along with the highest risk of long-term sickness absence [39].
In a secondary analysis of the data from the RCT [40,41], it was found that patients with a less favourable psychological profile (DYS subgroup) reported better outcomes from the MC approach. Surprisingly, the effect of MC within the DYS subgroup was achieved at an equal number of visits compared to the control group. On the other hand, patients within the AC group who received MC reported more days with pain while also receiving a higher number of visits compared to the control group. These results may change the way MC is delivered in clinical practice as we can now identify subgroups of patients where MC is most effective. At the same time, it becomes into "low probability (− 12 to 17)", "moderate probability (18 to 21)", and "high probability (22 to 48)" of having a good outcome from maintenance care for low back pain.
clear that for a specific group of patients, the AC group, MC should not be recommended.
Efforts have been made to tailor treatments to specific subgroups of patients with recurrent or persistent LBP [28,37,[42][43][44][45][46][47][48]. However, none of the available instruments have been able to improve outcomes among chiropractic patients by identifying candidates suitable for a stratified care pathway [49,50]. Previous research has investigated the predictive ability of the Keele STarT Back Tool when implemented in chiropractic practice and has found its prognostic value to be limited among patients with acute LBP [49,51]. To date, to our knowledge, there are no published empirical studies investigating the high risk groups (ID and DYS) classified by the MPI-S instrument or high risk groups classified by the Keele STarT Back Tool [43,52,53] or the Örebro Musculoskeletal Screening Tool (ÖMST) [27,30,54].
Although valid and reliable, the Swedish version of the MPI instrument (MPI-S) was not designed to be used as a screening instrument in daily clinical practice but rather as a comprehensive research tool. To utilize the aforementioned research findings [20,40,41] in clinical practice, a pragmatic and convenient instrument based on the MPI-S with fewer items needs to be developed.
The objective of this study was threefold: (1) To develop a new brief instrument for identifying dysfunctional patients in a clinical setting, with adequate sensitivity, specificity, and receiver operating characteristics. (2) To assess the instrument's ability to reproduce the previously published effect estimates of MC, and (3) To test the sensitivity, specificity, and receiver operating characteristics in 3 other existing datasets to assess external validity and model transportability across populations.

Method
To address the first two objectives, data from a recently conducted RCT (developmental dataset 1) was utilized, and to answer the last objective, data from 3 other clinical trials described below (external validation datasets 2, 3, and 4) were used. Karolinska Institutet is the custodian of all the datasets used in this study which have been used and are reported on in previously published papers [19,20,40,41,[55][56][57][58][59][60][61]. Only individuals with a complete MPI-S dataset were used in the analysis, thus there were no missing data of the variables used to develop and validate the instrument.

Design
Objective 1 (instrument development) was addressed using a cross-sectional analysis of the MPI-S data collected at baseline in developmental dataset 1.
Objective 2 (reproduction of effect) was answered using a longitudinal analysis of data from patients who completed the 12-month RCT comparing the intervention (MC) to the control group (developmental dataset 1) with regards to the total number of days with activity limiting LBP for patients classified as dysfunctional at each level of the MAINTAIN instrument. These effect estimates along with the test statistics estimated in objective 1 were used to recommend clinical thresholds for a positive test.
Objective 3 (external validation of the instrument) was answered using a cross-sectional analysis of the MPI-S data collected at baseline in external validation datasets 2, 3, and 4, with the same procedure used in objective 1, to assess external validity and model transportability across populations.

Populations
Dataset 1 (developmental) was drawn from a randomized controlled trial with a 12-month follow-up period [19,20,40,41] investigating the effectiveness and cost-effectiveness of MC in a population of patients with recurrent or persistent LBP from chiropractic primary care clinics in Sweden. The trial started in 2012 and was concluded in 2016 with a total of 249 patients completing the 12-month study period with a complete MPI-S dataset.
Dataset 2 (validation) was drawn from a large intervention study entitled "Work and Health in the Processing and Engineering Industries" conducted between 2000 and 2003 at four companies in Sweden [55,56]. Individuals classified as having a high risk of developing chronic disabling neck pain (NP) and/or LBP and long-term sick leave with a comprehensive risk assessment tool were selected for this study.
The third and fourth datasets (validation) came from a large intervention study entitled "Health-economic Evaluation and Rehabilitation (HUR)" conducted between 1994 and 1997 to evaluate multidisciplinary rehabilitation interventions with regards to their effect on sick leave, health-related quality of life, and cost-effectiveness. Dataset 3 was drawn from an observational outcome study nested within the larger HUR project consisting of subjects suffering from neck and low back pain with intermittent sickness absence [57,59,61]. Dataset 4 was extracted from a randomized controlled trial also nested within the larger HUR project consisting of subjects with ongoing sickness absence as a result of NP and LBP [58,60]. Data were collected as part of the baseline assessment at the initial visit to the clinics. The research projects have been described in detail elsewhere [57][58][59][60].
The second, third, and fourth datasets were chosen as they represent a broad selection of different populations with LBP with different severity and consequences with regard to the patient's condition, i.e., working, on intermittent sick leave, or ongoing sick leave. These 4 populations have been compared with regards to psychological characteristics in a previous publication [62].

Measures
We define discriminative ability as: "How well the test discriminates between two conditions of interest (health and disease; two stages of a disease etc.)". This can be quantified by the measures of diagnostic accuracy such as Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), Receiver Operating Characteristics (ROC), and Youden's index [63].
Based on exploratory factor analysis, Turk and Melzac [64] identified 8 items from the original MPI-S instrument that they suggested can be used to develop a brief version of MPI for the purpose of identifying dysfunctional patients. Based on these 8 items, different scoring algorithms were explored to find a working model. During the process, 2 additional items from the "support" dimension were added to improve discrimination between dysfunctional and interpersonally distressed individuals. A final scoring algorithm was decided by considering an optimal compromise between diagnostic accuracy and ease of use in a clinical setting.

Outcomes
Outcomes of the study were measures of diagnostic accuracy (Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), Receiver Operating Characteristics (ROC), and Youden's index) of the MAINTAIN instrument's ability to classify dysfunctional subjects according to the original classification algorithm used by the original MPI-S instrument (gold standard).

Instrument development
Based on 10 items from the original MPI-S questionnaire, 5 dimensions of the patients' pain experience were recorded: pain severity (Q1 + Q7), interference (Q4 + Q8), life control (Q15 + Q18), affective distress (Q22 + Q20) and support (Q5 + Q14). Tentatively, a dysfunctional patient would score high on pain severity, interference, affective distress, support, and low on life control. A summary score was created based on this assumption, adding together the items from the high dimensions, and subtracting the score from the expected low dimension. The following scoring algorithm was created to generate a summative MAIN-TAIN Score: Q1 + Q7 + Q4 + Q8 − Q15 − Q18 + Q22 + Q20 + Q5 + Q14, ranging from − 12 to 48. Based on this scoring algorithm, each level of the MAINTAIN Score in developmental dataset 1 was analysed as a possible threshold for a dysfunctional classification and compared to the original MPI-S subgroup classification.
A positive test would signify a dysfunctional profile and a negative test would signify either an interpersonally distressed or an adaptive coper profile. For each level of the MAINTAIN Score, the proportion of individuals for each subgroup who had a positive test was calculated. This data was graphically depicted to illustrate the relative frequency (%) of true and false positive tests within each subgroup. In addition, a ROC analysis was performed by plotting the binary outcome (AC/ID vs DYS) of the original instrument against a varying discrimination threshold of the MAINTAIN score. Based on the ROC analysis, an estimate the AUC was used to further report the diagnostic accuracy of the instrument. Diagnostic accuracy based on the AUC metrics were defined as: 0.9-1.0 excellent, 0.8-0.9 very good, 0.7-0.8 good, 0.6-0.7 sufficient, 0.5-0.6 bad, < 0.5 test not useful [63].
By analysing test statistics for each level of the instrument, a trichotomization by three identified thresholds based on iterative discussions within the research group were established. First, a lower threshold was identified where most dysfunctional individuals (> 95%) can be identified with a reasonable number of false positive tests with particular attention to AC individuals (< 20%). Second, an upper threshold where most of the AC individuals (> 95%) have been excluded at the expense of a reasonable number of false negative tests with respect to the DYS individuals (< 20%) was identified. Third, a threshold with the optimum cut-off point where equal weight is given to false positive and false negative values was identified. To illustrate how the optimum cut-off point is defined, the diagnostic accuracy of the instrument across different possible discrimination thresholds (each MS score) is reported in Table 2. The final MAINTAIN instrument is reported in Additional file 1.

Reproduction of effect estimates
The difference in the total number of days with pain (the primary outcome in the original RCT) between the control and intervention groups in development dataset 1 was estimated to understand how well the instrument could reproduce the effect size in the subgroup previously shown to respond well to MC. Using a general linear regression model, effect estimates were produced for each level of the MAINTAIN instrument (as a threshold for dysfunctional classification). By using the data on clinical outcomes for each level of the instrument along with the data on instrument performance from the development stage, two specific thresholds were identified to classify patients into: low, moderate, and high probability of MC being effective.

External validation across different medical settings
Finally, the MAINTAIN instrument was externally validated in 3 separate data sets consisting of patient populations who were experiencing pain resulting in different degrees of activity limitation collected in different medical settings during different time periods. An analysis of relatedness between the developmental dataset and the validation datasets was not performed as the datasets were substantially different in terms of medical settings, demographics, and outcome occurrence. To establish model transportability test statistics (Sensitivity, Specificity, PPV, NPV, ROC and Youden's index) were estimated and compared across populations.

Ethical considerations
The project was approved by the Swedish Ethical Review Authority (Ref.: 2019-04505). No new data was collected. Data had been handled and stored in accordance with the tenets of the World Medical Association's Declaration of Helsinki, and all the subjects had previously signed informed consent forms and agreed to participate in each of the four studies.

Results
The mean age was similar across all datasets except for the distribution of females which was higher in dataset 1 and lower in dataset 2. Compared to the dataset 1 mean MAINTAIN scores in datasets 3 and 4 were higher, and lower in dataset 2. This is in in line with the rationale described in the method, as the datasets were chosen to represent distinctively different populations with regards to medical setting, status of sick listing and distribution of MPI subgroups. Demographics of datasets 1-4 used in this study are reported in Table 1.   Table 3.

Objective 3, external validation across different medical settings
In validation dataset 2, at a score of 18     At a score of 22 or higher 108 were classified as Dysfunctional with 80.6% sensitivity and 66.3% specificity. The test showed very good to excellent diagnostic accuracy (ROC characteristics) in all datasets with high AUC estimates that were statistically significant. When considering the optimum cut-off point where equal weight is given to false positive and false negative values (highest value on the Youden´s index) the thresholds differ somewhat in development dataset 1 compared to validation datasets 2 and 3. In validation dataset 4, the highest index (0.451) can be found at a MAINTAIN score of 20 resembling the data from development dataset 1. Whereas the highest index in validation datasets 2 and 3 can be found at a MAINTAIN score of 25 (0.712) and 27 (0.590), respectively.
The test statistics for each score of the MAINTAIN instrument in validation datasets 2 to 4 are reported in Additional files 2, 3 and 4. AUCs for datasets 1 to 4 are reported in Table 4.

Discussion
To our knowledge, this is the first study to develop and psychometrically assess a brief version of MPI-S with the purpose of identifying dysfunctional subjects suitable for MC in a clinical setting, see Additional file 1. The MAIN-TAIN instrument has the potential of being an effective clinical decision aid allowing clinicians to acquire a relevant score of psychological distress quickly with a low administrative load. The instrument has displayed an acceptable trade-off between sensitivity and specificity across contexts and populations (model transportability) and can be used with different thresholds depending on the rate of false positive tests that can be accepted in the clinical encounter. The instrument has exhibited high ROC statistics with scores ranging between 0.79 and 0.90, thus, suggesting very good to excellent diagnostic accuracy [63,65]. Rather than using a single threshold with a dichotomous outcome we recommend 3 ranges to illustrate that the instrument represents a scale where the likelihood of a good outcome from MC changes in a dose-response-like relationship as reported in Table 3. We suggest clinicians use the instrument as one parameter in a broader assessment strategy where previous pain history, initial treatment effectiveness, ability to perform active strategies, and patient preferences should be considered.
In the present work, datasets were collected from different patient populations in different medical settings which has allowed for a robust evaluation and introspection of the instrument's diagnostic accuracy. Also, by verifying the stratification procedure using clinical outcomes, the clinical usefulness of the instrument is assessed. We considered the data from this trial to be robust and trustworthy, and the MAINTAIN instrument adds a valuable contribution by informing clinical decision-making by deepening the understanding of the  psychological characteristics of the patient's condition. The advantage of a brief instrument with a simple scoring algorithm is the possibility to employ it during the clinical encounter and being able to score while the patient is present. This allows the clinician to promptly get an overview of the psychological dimensions of the patient's pain experience as well as an opportunity to dwell into each of the 5 dimensions, pain severity, interference, life control, affective distress, and support, with probing follow-up questions. In addition to deepening the understanding of the patient's pain condition, this information can be used to establish treatment goals or adapt the intervention to better align with the patients' needs and preferences. Patients included in sample 1, where the effect and cost-effectiveness of MC were investigated, were included only if they reported recurrent or persistent pain, and the long-term effect over 12 months was studied. This may explain the very good to excellent discriminant ability of the MAINTAIN instrument in contrast to the research done on the Keele STarT Back Tool. Thus, psychometric properties, such as validity, may not pertain to an instrument as such; rather, they are a feature of the construal of the results generated from a contextual study [66].
In objective 2, the difference in the total number of days with pain was used to reproduce the effect sizes of the intervention in the new subgroups defined by the thresholds of the MAINTAIN instrument. This outcome has been used in several clinical trials and the measure has been shown to correlate well with pain intensity [67]. However, there is currently no study that has investigated minimally clinically relevant changes in the outcome, and it is therefore difficult to draw firm conclusions on the clinical relevance.
The cross-sectional design used to develop this instrument, naturally, does not allow us to explore the measurement over time and represents a weakness. Additional cross-sectional and longitudinal data on reliability estimates that evaluate the stability, internal consistency, interrater reliability, and responsiveness of the MAIN-TAIN instrument would further inform on applicability. Also, the datasets used to validate the instrument are quite old, therefore it is possible perceptions and attitudes towards pain in general may have changed over time in the societal context. Whether this has had any impact on how patients have scored the original MPI-s instrument is unclear. Future research should investigate the use of the MAINTAIN instrument in a clinical setting in an implementation trial by comparing a stratified MC approach using the MAINTAIN instrument and comparing it with standard care, exercise interventions, and digital consultations. Also, it would be valuable to contrast and compare the thresholds to the Keele STarT Back Tool and the Örebro Musculoskeletal Screening Pain Questionnaire (ÖMSPQ) to further explore the construct validity and usefulness of the MAINTAIN tool to populations with recurrent or persistent LBP. Such data could shed light on the possibility of using these instruments as alternative decision-making aids for identifying patients suitable for MC.

Conclusion
The MAINTAIN instrument is a brief cinical tool with a simple scoring algorithm that has a very good to excellent diagnostic accuracy with regard to selecting Table 3 Effect of maintenance care with a dysfunctional  classification for each of the scores 8-27 as possible  discrimination thresholds on the MAINTAIN instrument  (development dataset 1  dysfunctional patients in a clinical setting. By using 2 thresholds, patients can be categorized into "low probability", "moderate probability" and "high probability" of having a good outcome from maintenance care for LBP. The diagnostic accuracy is similar and high across populations with minor differences in optimal thresholds for identifying dysfunctional individuals. Implementing the MAINTAIN instrument has the potential of improving outcomes by identifying dysfunctional high-risk patients early in the clinical course and stratifying them to appropriate interventions with a higher chance of treatment success.