A literature review of clinical tests for lumbar instability in low back pain: validity and applicability in clinical practice

Background Several clinical tests have been proposed on low back pain (LBP), but their usefulness in detecting lumbar instability is not yet clear. The objective of this literature review was to investigate the clinical validity of the main clinical tests used for the diagnosis of lumbar instability in individuals with LBP and to verify their applicability in everyday clinical practice. Methods We searched studies of the accuracy and/or reliability of Prone Instability Test (PIT), Passive Lumbar Extension Test (PLE), Aberrant Movements Pattern (AMP), Posterior Shear Test (PST), Active Straight Leg Raise Test (ASLR) and Prone and Supine Bridge Tests (PB and SB) in Medline, Embase, Cinahl, PubMed, and Scopus databases. Only the studies in which each test was investigated by at least one study concerning both the accuracy and the reliability were considered eligible. The quality of the studies was evaluated by QUADAS and QAREL scales. Results Six papers considering 333 LBP patients were included. The PLE was the most accurate and informative clinical test, with high sensitivity (0.84, 95% CI: 0.69 - 0.91) and high specificity (0.90, 95% CI: 0.85 -0.97). The diagnostic accuracy of AMP depends on each singular test. The PIT and the PST demonstrated by fair to moderate sensitivity and specificity [PIT sensitivity = 0.71 (95% CI: 0.51 - 0.83), PIT specificity = 0.57 (95% CI: 039 - 0.78); PST sensitivity = 0.50 (95% CI: 0.41 - 0.76), PST specificity = 0.48 (95% CI: 0.22 - 0.58)]. The PLE showed a good reliability (k = 0.76), but this result comes from a single study. The inter-rater reliability of the PIT ranged by slight (k = 0.10 and 0.04), to good (k = 0.87). The inter-rater reliability of the AMP ranged by slight (k = −0.07) to moderate (k = 0.64), whereas the inter-rater reliability of the PST was fair (k = 0.27). Conclusions The data from the studies provided information on the methods used and suggest that PLE is the most appropriate tests to detect lumbar instability in specific LBP. However, due to the lack of available papers on other lumbar conditions, these findings should be confirmed with studies on non-specific LBP patients. Electronic supplementary material The online version of this article (doi:10.1186/s12998-015-0058-7) contains supplementary material, which is available to authorized users.


Background
Low back pain (LBP) is a growing health problem in the industrialized world. Despite the high medical expenses required for its management, the prevalence of LBP is increasing [1]. LBP is a heterogeneous condition, and the identification of different sub-groups could help the management decisions [2,3]. One of these sub-groups is lumbar segmental instability [4,5].
The radiologically determined instability is characterized by a loss of passive integrity, causing excessive vertebral translation or rotation. The maximum lumbar flexionextension radiographs in standing position are considered to be a reference standard to detect the function of the passive stabilization system [6,7]. This imaging method is commonly used to evaluate lumbar segmental mobility in isthmic and degenerative spondylolisthesis and degenerative disc dysfunctions. The radiographic diagnosis of spondylolisthesis is considered to be one of the most efficient methods of identifying lumbar instability [8].
Some authors refer to the concept of instability also considering the so-called "clinical" or "functional" instability, in which no defect of the body architecture of the lumbar spine, and no excessive detectable translation or rotation are shown. However, a poor trunk muscle function and/or an insufficient motor control is believed to be a factor in abnormal inter-segmental movement and LBP [9][10][11]. Despite this type of instability has not been demonstrated enough as a clinical entity and is not really measureable by any gold standard, it is one of the most frequent fields of interest for chiropractors and manual therapists.
Clinicians have used several clinical tests to detect the spinal instability and/or the ability of the muscles to stabilize the lumbar spine [12]. Recently, some of these tests have been suggested in the "Clinical Practice Guidelines linked to the International Classification of Functioning, Disability and Health from the Orthopaedic Section of the American Physical Therapy Association", to assess the impairments of body functions in LBP [5]. The most commonly used tests are the Prone Instability Test (PIT), the Passive Lumbar Extension (PLE) test, the Aberrant Movements Pattern (AMP), the Posterior Shear Test (PST), the Prone Bridge Test (PBT), the Supine Bridge Test (SBT), and the Active Straight Leg Raise Test.
Previous reviews separately investigated the diagnostic accuracy [13] or the reliability [14] of the instability tests, but a complete vision about their diagnostic validity to detect lumbar instability is lacking. A single literature review on both the diagnostic accuracy (sensitivity, specificity and likelihood ratios) and the inter-rater reliability of these clinical tests does not exist. More specifically, a researcher could be interested in investigating the reliability of the tests that previously demonstrated sufficient face validity.
The objective of this literature review was to assess the methods used for diagnosis (primarily the accuracy with additional reporting of reliability of these tests) of the clinical tests for lumbar instability in individuals with LBP and investigate their applicability in daily practice.

Methods
This is a literature review of all the studies presenting a diagnosis of the clinical tests for lumbar instability in individuals with LBP in literature. PRISMA Guidelines [15] were followed during the design, search and reporting stages of this review on diagnostic test studies.

Literature search
A literature search of relevant literature was performed from July 2012 to December 2013. A comprehensive search, limited to articles in English, Italian and Spanish, was conducted in the following databases: Medline, Embase, Cinahl, PubMed, Scopus. Diagnostic test studies regarding humans published between 1972 and December 2013 were included. Narrative or systematic reviews, guidelines and meta-analyses were excluded.
Two authors (SF and TM) independently performed two different and parallel searches to avoid leaving out relevant articles. The search strategies are shown in Figure 1.
The results of these seven searches were unified into a single item set. From the results of the initial search, double citations were removed and then the titles, abstracts and full texts of retrieved articles were independently evaluated for definitive inclusion. When the two reviewers were unable to reach a consensus, a third reviewer (CV) was consulted. In addition to the Internet-assisted search, references were pulled from a textbook on diagnostic accuracy of orthopedic clinical tests [16], and from reference lists of included studies. Finally, an independent hand search including scanning of reference lists from other systematic reviews [13,14] was performed.

Study selection
Several criteria were used to select eligible studies. Articles examining clinical tests for lumbar instability were included if they met the following criteria: 1) Diagnostic accuracy studies on adult population with sub-acute or chronic LBP were considered if clinical instability tests were employed as index tests. Dynamic radiographs were the reference test to diagnose lumbar instability. The subject articles had to report data which would allow computation of parametric statistical tests of diagnostic accuracy [sensitivity, specificity, or positive and negative likelihood ratios (+LR and -LR)]. 2) Reliability studies on healthy or LBP adult population were considered if they concerned the use of clinical tests to diagnose lumbar instability by one or more clinicians. Articles had to report the parametric statistical tests of relationship or agreement. 3) Finally, only the studies in which each test was investigated by at least one study concerning both the accuracy and the reliability were considered eligible.

Data extraction and quality assessment
One author (TM) gathered data regarding clinical tests, with its description and score, study population (e.g. age, gender, setting, clinical characteristics), inclusion and exclusion criteria, diagnostic reference standard, differences in operationalizing the index tests, study raters.
Study results about sensitivity, specificity, LR+, LR-, and reliability were collected (or calculated, if included articles did not provide these data). Other authors (SF and FB) verified data extraction once completed. The methodological quality of included articles was independently assessed by 2 reviewers (TM and FB), using different tools for the 2 types of studies: the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool for diagnostic accuracy articles [17] and the Quality Appraisal of Reliability Studies (QAREL) checklist for diagnostic reliability articles [18].

Data synthesis and analysis
Kappa statistics were used to assess agreement between the 2 raters on article selection and QUADAS and QAREL ratings [19]. The QUADAS and QAREL statement delineates essential items to be reported in diagnostic test studies (Table 1 and Table 2).
Concerning reliability, the following criteria has been used to determine the strength of the coefficients: ≤ 0.25 = little or no relationship; 0.26 -0.50 = fair degree of relationship; 0.51 -0.75 = moderate to good relationship; 0.76 -1.00 = good to excellent relationship [22]. Figure 1 shows the process of study selection. Initial searching identified 773 citations. Following the first screening, 299 articles were excluded and 474 citations were retained for the second screening; after reviewing the titles, 446 were excluded and 28 considered of interest, looking at the abstracts 16 were maintained and 13 retrieved in full text.  Ravenna et al. [26] Rabin et al. [12] 1. Was the test evaluated in a sample of subjects who were representative of those to whom the authors intended the results to be applied?

Results
Was the test performed by raters who were representative of those to whom the authors intended the results to be applied?
Were raters blinded to the findings of the other raters during the study?
Were raters blinded to their own prior findings of the test under evaluation?

5.
Were raters blinded to the results of the accepted reference standard or disease status for the target disorder (or variable) being evaluated?
Were raters blinded to clinical information that was not intended to be provided as part of the testing procedures or study design?
Were raters blinded to additional cues that were not part of the test?
Was the stability (or theoretical stability) of the variable being measured taken into account when determining the suitability of the time-interval between repeated measures?  Using the inclusion and exclusion criteria a further 7 articles were excluded. This study finally included 6 papers, considering 333 LBP patients, for the review [12,[23][24][25][26][27].

Quality scores
Two articles of the 6 studies (33%) were identified as having high methodological rigor according to the QUA-DAS tool (Table 1). Table 2 shows the distribution of studies according to the scores obtained from the assessment of their methodological quality, following the QAREL tool.

Diagnostic accuracy of the tests
The diagnostic accuracy was investigated by 2 authors only: Fritz et al. [24] and Kasai et al. [25] Four lumbar instability tests were considered: the PLE test, the PIT, the AMP, and the PST. The main characteristics of the studies on diagnostic accuracy are shown in Table 3, whereas Table 4 shows the results.
Kasai et al. [25] found that the PLE test was the most accurate clinical test, with high sensitivity (0.84, 95% CI: 0.7 -0.93) and specificity (0.90, 95% CI: 0.82 -0.95), in a sample of subjects diagnosed with spinal stenosis or lumbar spondylolisthesis or lumbar degenerative scoliosis. The positive and negative LR's were informative.

Reliability of the tests
The reliability of the four clinical tests was studied in 5 papers [12,23,24,26,27]. The main characteristics of the studies on reliability and their results are shown in Table 5, whereas Table 6 shows the results in terms of inter-rater reliability.
The PLE test showed a better reliability, but this result comes from a single study [12]. The inter-rater reliability of this test resulted good (k = 0.76).

Implications for clinical practice
The data from the studies provided information on the tests and methods used, the error of measurement and also the validity of the tests. However, only 5 studies (83.3%) provided information concerning the setting and the years of raters clinical experience, whereas all studies identified the person performing the assessment and his/her professional competence.

Discussion
This literature review was aimed to identify the most reliable findings concerning the assessment of methods for diagnosis of the clinical tests for lumbar instability in LBP subjects.
The lumbar instability is traditionally a field of debate. Lumbar segmental instability in the absence of defects of the bony architecture of the lumbar spine has also been cited as a significant cause of chronic low back pain [5,28]. The differences between surgical instability criteria and "functional instability" criteria were defined by Panjabi [29] decades ago. Chiropractics and Manual Therapists are more interested in the lost of motor control than in hypermobility detectable with flexion/extension radiological imaging, which is more useful to spine surgeons. However, the difficulty to clinically detect abnormal or excessive inter-segmental motion makes these tests often insensitive and unreliable and it becomes a limit for the clinical diagnosis of lumbar segmental instability [30,31]. The lack of studies in this field emerges also by our research, which found many studies about reliability of tests used by clinicians but few about their accuracy. Being aware that this criterion is too rigorous for manual therapists we have chosen to be rigorous and we have been forced to do our research having as reference the best reference (gold standard) to instability, that is dynamic X-rays. The result is that many other tests used in the manual clinical practice to detect lumbar clinical instability (i.e. active hip abduction test or hip extension test) have not been considered because no study had investigated their accuracy. These tests are Kasai et al. [25] -Passive lumbar extension test: The subject was in the prone position; both lower extremities the were elevated concurrently to a height of about 30 cm from the bed while maintaining the knees extended and gently pulling the legs. Positive test when the subject complained of strong pain in the lumbar region ("low back pain", "very heavy feeling on the low back", "feeling as if the low back was coming off") during elevation of both lower legs, and such pain disappeared when they returned to the initial position. In contrast, when the subject complained of an abonrmal sensation (mild numbness or prickling sensation) the test was negative. -Instability catch sign: The subject was asked to bend his body forward as much as possible and then return to the erect position; subject who was not able to return to erect position because of sudden low back pain was judged positive to the test. not present in this review, so that, in latest analysis, our study could be considered as a literature review of accuracy of lumbar clinical tests with additional reporting of reliability information.
Six high-quality studies were selected and four lumbar clinical instability tests (PLE test, PIT, AMP and PST) satisfied the inclusion criteria.

Accuracy
The characteristics of the samples of the 2 subject studies [24,25] cannot be considered accurate. Fritz et al. [24] studied a population whose majority had a prior history of LBP, and in which only 30.6% (n = 15) of people complained about distal knee symptoms. Kasai et al. [25], however, investigated a population with specific lumbar conditions (lumbar spinal canal stenosis, lumbar spondylolisthesis or lumbar scoliosis), most of whom had intermittent claudication, and 42.6% (n = 52) had neurological leg symptoms.
The PLE test was the most accurate and informative test, even though it was measured by only one study, in patients affected by lumbar degenerative diseases. Despite the PLE test appears to be a potentially effective clinical test to detect lumbar instability, the characteristics of the investigated sample and the presence of only one study on its diagnostic accuracy may suggest the necessity of studies on non-specific LBP patients.
The PIT demonstrated low to moderate sensitivity and specificity [24] indicating that this test has limited accuracy in diagnosing lumbar instability in patients with LBP.
The PST showed relatively poor sensitivity and specificity [24], indicating that this test is less accurate than the PLE test and the PIT to detect lumbar instability.
The Instability Catch Sign, the Painful Catch Sign and the Apprehension Sign are three of the five signs included in the AMP investigated by Fritz et al. [24]. The relatively low sensitivity and high specificity resulting from the study of Kasai et al. [25] suggest caution in the use of these tests to diagnose lumbar instability. According to Hicks et al. [23], these 5 tests should be used together, as a complete observation of the trunk movement and the 5 signs could be considered as only one comprehensive test. However, positive results on AMP and PIT, which demonstrated moderate sensitivity and specificity, were considered predictive for a favorable response to stabilization exercises [32].

Reliability
The characteristics of the samples were not always well explained or were not reliable. The PLE test [12] and the PIT [12,23,24] demonstrated good inter-rater reliability. The reliability of PLE test is evident in younger subjects referred to outpatient physical therapy [12]. Five studies on PIT demonstrated very different inter-rater reliability scores. Nevertheless, the 2 studies showing fair reliability [26,27] are affected by possible bias; in the first case [27] due to a very limited sample size and in the second case    E: contraindications to radiographic assessment (e.g., current pregnancy), previous lumbar fusion surgery, inability (e.g. pain or muscle spasm) to actively flex and extend the spine adequately to permit an assessment of segmental motion.
-Age: 39.2 ± 11.3 yrs; The second rater repeats the assessment 5 minutes after the first rater's assessment  E: pregnancy; history suggesting a non-mechanical origin of symptoms (e.g., malignancy, inflammatory conditions), LBP due to a fracture, osteoporosis, regular use of corticosteroids, rheumatoid arthritis, presence of 2 or more signs suggesting lumbar nerve root compression.
-Age: 33.5 ± 8.0 yrs AMP was assessed by the two raters simultaneously; PIT and PLE are assessed by the two raters separately (second assessment 5 minutes after the first one).
One rater with postprofessional master's degree (contributes to rating all subjects).

-Passive Lumbar Extension Test:
Positive if LBP is elicited.
-Gender: 15♀, 15♂ Other raters with bachelor degree in physical therapy contribute to rating in 23, 4, and 3 subjects, respectively.  [26] due to procedures and methodological weaknesses as the involvement of novel raters and the use of a modified test. The main statistical problem was the presence of few samples that could invalidate the k score. Despite all the other 4 studies adopting the PIT closely followed its original description, some differences in the positivity criteria were found. Hicks et al. [23] and Schneider et al. [27] judged the test positive when the pain disappeared in the second part of the test; Fritz et al. [24] when the pain decreased, whilst for Rabin et al. [12] the pain had to be both relieved or abolished. After having excluded the two studies with the main methodological weaknesses, the reliability of the PIT appeared from moderate to good.
The AMP reliability was investigated in three studies [12,23,24] but their results were not similar and ranged from insufficient reliability [24] to moderate reliability [12,23]. The PST was investigated by only one study and scored the lowest reliability [24], which is insufficient to recommend its use.

Implications for clinical practice
After an initial inspection of the articles it appears that the information derived from the studies could provide a useful picture of the items that contribute to the definition of "applicability in rehabilitation practice". Sufficient information was provided on the execution of the tests, whereas little information regarded the duration, and the time needed to process data. Considering that in clinical practice a standard manual therapy session normally lasts 30 minutes, it may be the case that a series of tests proposed in the literature cannot be repeated by the clinicians due to lack of time. The attempt to identify methods for the evaluation of lumbar instability in patients with LBP allowed us to select some tests that are suitable for clinicians in everyday clinical practice. The time needed to test and process data are compatible with clinical practice and research purposes. Starting from the same key-words used for the search of the articles of the literature review, 4 clinical tests (PIT, PLE, AMP and PST) investigated by 2 studies [24,25] met the criteria of applicability in clinical practice.

Limits
The main limitation of this review is the small number of articles found on any single test. Only 2 studies concerned the diagnostic accuracy, while for the studies investigating the reliability, the results are limited by statistical or methodological weaknesses. For example, the Ravenna's [26] conclusions should be cautiously interpreted also for some significant modifications made to standardize the PIT, such as the different hip and knee positions, the use of a stabilization scapular belt and a stool for foot placement.
The average age and the characteristics of the spinal dysfunctions of the samples were not homogeneous in the different studies, thus reducing the external validity of the results. Another limitation of this review concerns the insufficient homogeneity regarding the execution and interpretation of the tests. As already mentioned, a lack of standardization of a test affects comparative analyses among different studies and the implementation of that test in clinical practice.