It is essential to accurately assess distress and pain in neonatal intensive care unit (NICU); however, few instruments have had their measurement properties tested for the Brazilian population.
ObjectiveTo analyze the intra- and inter-examiner reliability, internal consistency, and responsiveness of the Brazilian Portuguese version of the Premature Infant Pain Profile-Revised (PIPP-R) scale in Brazilian neonates.
MethodsThis is a methodological study conducted in the NICU. Neonates with a gestational age of 24–42 weeks who were not under the effect of muscle block or analgesia at the time of evaluation were included. Inter-examiner reliability was assessed at bedside by two trained evaluators who independently assessed the neonates at bedside and in real time using the PIPP-R. Procedures were filmed and used for intra-examiner reliability assessment after 10–14 days. The Intraclass Correlation Coefficient (ICC) was used to determine intra- and inter-examiner reliability. Responsiveness was assessed by comparing the total scores before and after painful procedures using a paired t-test, followed by an effect size analysis.
ResultsA total of 119 assessments were performed on 15 neonates. The PIPP-R demonstrated excellent intra- and inter-examiner reliability (ICC > 0.9), and successfully detected changes after an acute painful procedure (p = 0.003; effect size = 0.8).
ConclusionExcellent intra- and inter-examiner reliability, and sensitivity to changes over time were observed by using the PIPP-R at bedside, indicating that this is a suitable instrument for clinical use.
In Neonatal Intensive Care Units (NICU), physical therapy has advanced significantly, evolving towards a more comprehensive, neonate-family-centered care approach.1–3 Through the accurate identification and appropriate interpretation of signs of discomfort and overstimulation, physical therapists play a critical role in promoting physiological stability and supporting developmental trajectories while minimizing stress to the neonate. Within this framework, the assessment of neonatal pain and stress indicators is a pivotal component to ensure that therapeutic interventions are both effective and consistent with the principles of individualized, developmentally supportive care. 1–4
In this context, physical therapists assume a central role in the assessment and modulation of neonatal pain, particularly due to their active involvement in delivering sensory-motor stimulation and respiratory interventions commonly applied in the NICU setting.4–6 The use of validated and culturally adapted pain assessment scales is not only essential for ensuring the accuracy and reliability of clinical evaluations, but also serves to support physical therapists' clinical reasoning and therapeutic decision-making. These tools are instrumental in aligning physical therapy care with the principles of family-centered and developmentally supportive care by enabling individualized pain management strategies and promoting positive neurodevelopmental outcomes.7 The systematic review by Nunes et al. highlights that respiratory physical therapy interventions are generally safe and can lead to improvements in pulmonary mechanics and vital parameters without significantly increasing neonatal pain.4 This evidence reinforces the importance of integrating sensitive and reliable pain assessment instruments into clinical practice, thereby allowing physical therapists to monitor responses to interventions more precisely and to optimize care for this vulnerable population.
A systematic review identified 65 validated measurement tools for assessing pain and sedation during the pre-verbal stage of development. However, it was noted that few studies conducted factor analysis or similar methods to select well-defined representative items, and construct validity, internal consistency, and interobserver reliability were not evaluated for all measures.8 Menegol et al. identified that out of >40 neonatal pain measurement tools available in the literature, only four have been adequately translated and adapted to Brazilian Portuguese. The authors also noted that the instruments available for the Brazilian population demonstrated low methodological quality as evaluated by the COnsensus-based Standards for the Selection of Health Measurement INstruments (COSMIN), advising caution in interpreting assessments conducted with these scales.9 Among the scales translated and adapted to Brazilian Portuguese, the Premature Infant Pain Profile-Revised (PIPP-R) and the Neonatal Infant Pain Scale (NIPS) were developed to assess acute pain.8,9 Content validity, reliability, and internal consistency were assessed for the NIPS, whereas only content and construct validity were assessed for the PIPP-R.9 The PIPP-R is a revised version of the Premature Infant Pain Profile (PIPP), which was modified to facilitate its clinical use.10 The PIPP is a reliable and valid measure of acute pain in neonates and several validation studies demonstrate this is a robust measurement tool for neonatal pain.8,10–14 For the PIPP-R, the indicators were maintained; however, the scoring method was changed for heart rate, oxygen saturation, facial activity, initial behavioral state, and gestational age.10
After the translation, content, and construct validation of the PIPP-R into Brazilian Portuguese, we identified the need to further evaluate the measurement properties of this tool. Therefore, this study aimed to assess inter and intra-rater reliability, internal consistency, and responsiveness of the Brazilian version of the PIPP-R.
MethodsExperimental designThis cross-sectional methodological study assessed the measurement properties of the PIPP-R15 using the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) checklist.16 The study protocol was approved by the Ethics Committee of Human Research (protocol number: 36,633,320.0.0000.0118). Data were collected in the NICU of a maternity hospital in Florianópolis, Santa Catarina, Brazil.
ParticipantsNeonates in the NICU were eligible for inclusion if they were clinically stable at the time of evaluation, had chronological or corrected gestational age between 24 and 42 weeks and 6 days, and were scheduled to undergo a routine procedure deemed painful. Neonates receiving muscle blockers or analgesic medications and those on phototherapy were excluded. Parents of eligible neonates admitted to the NICU were approached by a research team member, who explained the study and obtained written consent.
The present study followed the recommended sample size of n = 50–100.17 The sample selection method was non-probabilistic and intentional.
InstrumentsThe PIPP-R is a multidimensional pain assessment tool comprised of seven items, and it has been widely used to assess acute pain in neonates.10–12 The tool was translated into Brazilian Portuguese and underwent initial validation. The content validity index attributed by the experts for both clarity and relevance was 1.0. For construct validity, which is based on the correlation between PIPP and PIPP-R scores according to analgesic interventions, the correlation coefficients ranged from r = 0.96 to 1.00 (p < 0.001).13 The PIPP-R includes three behavioral indicators (facial actions: eyebrow bulging, eye squeezing, and nasolabial furrowing), two physiological indicators (heart rate and oxygen saturation), and two contextual factors (gestational age and behavioral state). Each item is numerically scaled and scored on a 4-point scale (0, 1, 2, and 3), indicating increasing changes in each variable relative to baseline values. The scores obtained for the seven items are summed to deliver the total pain intensity score. The maximum possible score is 21 for premature neonates (<28 weeks of gestational age) and 18 for full-term neonates.10,12
ProceduresAll procedures observed were deemed clinically required by the NICU staff. Manipulative techniques for chest physical therapy, nasal suctioning, orotracheal suctioning, insertion/removal of nasogastric or orogastric tubes, changing bandages (central line sticker, transcutaneous oxygen pressure sticker), tracheal intubation, tracheal extubation, venipuncture, arterial puncture, heel puncture, insertion of continuous positive airway pressure, insertion of a central catheter, and insertion or removal of venous access were considered painful procedures in this study.18,19
All neonates were monitored for heart rate and oxygen saturation using the unit’s equipment (Vismo PVM-4000 series - Nihon Kohden). For inter-rater reliability, two trained examiners independently assessed pain intensity using the PIPP-R in real-time at bedside (incubator or crib). Examiners recorded all PIPP-R indicators using pen and paper. All examiners were qualified physical therapists with at least 5 years of experience in pediatrics.
A 2-minute video recording of the neonate’s face with the camera (iPhone 11 smartphone) positioned at the top of the incubator or crib, at an approximate distance of 36 to 43 cm from the neonate’s face was taken. One of the examiners re-assessed the behavioral state and facial actions after 10–14 days of the procedure. The examiner watched the video once, in real time. The same physiological indicators obtained at bedside were used. Scores observed in real time and obtained from the recordings were compared to assess intra-examiner reliability
The responsiveness of the PIPP-R was assessed by comparing the pain intensity scores before (baseline) and after painful procedures, by the same physical therapist researcher. These painful procedures were performed by the healthcare staff, as described above and they decided on the need to use non-pharmacological pain management techniques for the procedures, according to local practices; this information was not recorded as part of the study.
Data analysisStatistical analyses were performed using IBM SPSS 20.0 (IBM Corp., Armonk, NY, USA). The variables are presented using measures of central tendency, dispersion, 95 % confidence intervals (95 % CI), and absolute and/or relative frequency. The normal distribution of the data was assessed using the Kolmogorov–Smirnov test. The significance level was set at 0.05.20 Internal consistency was evaluated using Cronbach's alpha, with acceptable values defined as ≥ 0.7.21
To analyze inter-rater reliability, the Mann–Whitney test compared the assessments of the two examiners, and the Intraclass Correlation Coefficient (ICC) was applied using two-way random effects, absolute agreement, and mean measurement models.22–24 For intra-rater reliability, the Wilcoxon test was used to compare the two assessments (real-time and video), and the ICC was applied using two-way mixed effects, absolute agreement, and mean measurement models.22 Reliability values were categorized as follows: ICC > 0.9 (excellent); ICC 0.75–0.9 (good); ICC 0.5–0.75 (moderate); and ICC < 0.5 (poor reliability).22 The standard error of measurement (SEM) was calculated by using the formula SEM=√(σ2o + σ2residual), where σ2o represents the variance owing to systematic differences between the examiner and σ2residual represents the random error variance.23,24
The correlation between the subtotal and total score variables between the intra-and inter-rater assessments was analyzed using the Spearman correlation test. The correlation values were interpreted as follows: ≤ 0.25 (weak or no correlation); 0.25–0.5 (regular correlation); 0.5–0.75 (moderate to good correlation); and > 0.75 (good to excellent correlation).20 Bland–Altman analysis was employed to verify the agreement between the intra- and inter-rater reliability assessments. This analysis was based on the mean and standard deviation of the differences, as well as the lower and upper limits of agreement. In addition, the error (defined as the dispersion of the difference points around the mean) was analyzed. To confirm agreement between the scores in the Bland-Altman analysis, the bias (range of differences between the averages deviating from zero), and the error (dispersion of the difference in points around the average) must be examined, given that at least 90–95 % of the measurements must fall within the limit of agreement.25–27 Lin’s Concordance Correlation Coefficient (CCC) was used for the analysis of agreement by ascertaining the magnitude of deviation from the line of perfect agreement.28,29 To assess the degree of agreement by CCC, the following classification was used: very good agreement (0.81 to 1), good agreement (0.61 to 0.80), moderate agreement (0.41 to 0.60), fair agreement (0.21 to 0.40), and poor agreement (<0.2).30,31
Responsiveness was assessed by comparing total scores before and after the painful procedure. The analysis was performed using the paired t-test, along with mean difference (95 % CI) and the effect size analysis, which considered the difference in means divided by the standard deviation of the mean at time zero (the duration from the first assessment to the time before the painful procedure) between moments. The effect size was interpreted as small at 0.20, medium at 0.50, and large at 0.80.20
ResultsA total of 119 assessments were performed on 15 neonates (10 males), and the median gestational age at birth was 28 weeks, with an interquartile range of 4 weeks. The PIPP-R demonstrated an overall internal consistency of 0.753 using Cronbach's alpha.
The intra-rater reliability and SEM data are presented in Table 1. A comparative analysis between the two intra-rater reliability assessments for the PIPP-R did not reveal statistically significant differences across all items (p > 0.05), indicating no detectable differences within the limits of this analysis. Inter-rater reliability and SEM data are presented in Table 2. Similarly, no statistically significant differences were found between the two raters for all items (p > 0.05), indicating that potential differences, if any, were not captured by the current study design. Excellent reliability was observed for both intra- and inter-rater assessments (ICC = 0.98 and 0.90, respectively). Fig. 1 shows the data analysis using Lin’s Concordance Correlation Coefficient.
Intra-examiner reliability of Premature Infant Pain Profile-Revised (PIPP-R) (n = 119).
| Mean (SD) Bedside | Mean (SD) Video | pa | Mean (SD) | ICC | 95 % CI | pb | SEM | ||
|---|---|---|---|---|---|---|---|---|---|
| LB | UB | ||||||||
| Brow bulge | 1.12 (1.3) | 1.09 (1.3) | 0.36 | 1.10 (1.3) | 0.99 | 0.981 | 0.991 | <0.01 | 0.29 |
| Eye squeeze | 0.54 (0.9) | 0.57 (1.0) | 0.43 | 0.55 (0.3) | 0.92 | 0.886 | 0.945 | <0.01 | 0.46 |
| Nasolabial furrow | 0.18 (0.6) | 0.16 (0.6) | 0.16 | 0.17 (0.1) | 0.99 | 0.985 | 0.993 | <0.01 | 0.16 |
| Baseline BS | 1.62 (1.2) | 1.61 (1.2) | 0.79 | 1.61 (0.1) | 0.96 | 0.948 | 0.975 | <0.01 | 0.34 |
| Subtotal | 2.77 (2.6) | 2.76 (2.6) | 0.98 | 2.77 (0.1) | 0.98 | 0.975 | 0.998 | <0.01 | 0.49 |
| Total Score | 5.66 (2.8) | 5.64 (2.8) | 0.86 | 5.65 (0.2) | 0.98 | 0.971 | 0.986 | <0.01 | 0.60 |
BS, behavioural state; CI, confidence interval; HR, heart rate; ICC, intraclass correlation coefficient; LB, Lower Bound; SatO2, oxygen saturation; SD, standard deviation; SEM, standard error of measurement; UB, Upper Bound
Inter-examiner reliability of Premature Infant Pain Profile-Revised (PIPP-R) (n = 95).
| Mean (SD) Examiner 1 | Mean (SD) Examiner 2 | pa | Mean (SD) | ICC | 95 % CI | pb | SEM | ||
|---|---|---|---|---|---|---|---|---|---|
| LB | UB | ||||||||
| Change in HR | 0.61 (0.7) | 0.60 (0.7) | 1.00 | 0.59 (0.1) | 0.96 | 0.947 | 0.977 | <0.01 | 0.20 |
| Decrease in SatO2 | 0.38 (0.6) | 0.38 (0.6) | 0.82 | 0.38 (0.1) | 0.93 | 0.899 | 0.955 | <0.01 | 0.23 |
| Brow bulge | 0.98 (1.2) | 0.90 (1.1) | 0.80 | 0.93 (0.7) | 0.88 | 0.822 | 0.921 | <0.01 | 0.75 |
| Eye squeeze | 0.44 (0.8) | 0.43 (0.8) | 0.91 | 0.42 (0.1) | 0.80 | 0.700 | 0.867 | <0.01 | 0.47 |
| Nasolabial furrow | 0.13 (0.5) | 0.07 (0.4) | 0.36 | 0.10 (0.5) | 0.90 | 0.845 | 0.932 | <0.01 | 0.41 |
| Baseline BS | 1.74 (1.2) | 1.84 (1.2) | 0.57 | 1.80 (0.9) | 0.91 | 0.869 | 0.942 | <0.01 | 0.82 |
| Subtotal | 2.53 (2.3) | 2.38 (2.1) | 0.78 | 2.42 (0.1) | 0.90 | 0.853 | 0.935 | <0.01 | 1.39 |
| Total Score | 5.10 (2.6) | 5.04 (2.4) | 0.90 | 5.06 (0.1) | 0.90 | 0.855 | 0.936 | <0.01 | 1.11 |
BS, behavioural state; CI, confidence interval; HR, heart rate; ICC, intraclass correlation coefficient; LB, Lower Bound; SatO2, oxygen saturation; SD, standard deviation; SEM, standard error of measurement; UB, Upper Bound.
The qualitative analysis of agreement between the PIPP-R subtotal and total score data was conducted using the Bland–Altman analysis (Fig. 2). The intra-rater bias was close to zero, both for the subtotal and total score analyses of PIPP-R, with no statistically significant differences [(95 % CI: −0.11, 0.13, p = 0.89) and (95 % CI: −0.12, 0.17, p = 0.74), respectively], indicating agreement. In the linear regression analysis for the intra-rater comparison, no significant proportional bias for the subtotal and total scores was observed, (p = 0.98 and 0.91, respectively). When analyzing the inter-rater bias of the subtotal and the total PIPP-R, values close to zero and no statistically significant differences were observed, [(95 % CI: −1.25, 0.42, p = 0.28 and (95 % CI: −0.25, 0.35, p = 0.73), respectively]. Furthermore, in the linear regression analysis for the inter-examiner comparison of the subtotal and total scores, no significant proportional bias was observed (p = 0.25 and 0.25, respectively).
Regarding the responsiveness of the PIPP-R, there was a significant capacity for change on the score after an acute painful procedure compared to baseline with a mean difference of 1.5 points (95 % CI: 0.42, 1.15; p < 0.01; effect size = 0.8).
DiscussionThis study assessed inter- and intra-examiner reliability, internal consistency, and responsiveness of the Brazilian version of the PIPP-R. The results indicated excellent reliability, demonstrated by high intra- and inter-examiner agreement; and strong internal consistency, demonstrated by excellent correlation of the indicators. Furthermore, the scale was sensitive to changes in acute pain in neonates. The identification of reliable assessment tools helps strategize neonatal pain management and improve the quality of care for vulnerable newborns.32 Preterm and sick newborns are commonly exposed to a large number of painful procedures during their stay in the NICU, which increases the risk of negative neurobiological effects.33 Evidence-based practice combined with objective assessments of the value and effectiveness of rehabilitation assessment and treatment techniques should be prioritized.34 However, available tools in Brazilian Portuguese to measure neonatal pain are scarce. Of the ˃ 40 neonatal pain assessment instruments available in the literature, only four have been formally translated and adapted for this population.9
Construct and content validity of the Brazilian version of the PIPP-R were previously analyzed in a study that demonstrated high correlation between pain scores for PIPP and PIPP-R for procedures using different pain relief strategies as well as for different types of procedures in both full-term and preterm neonates.13 Strong internal consistency was also demonstrated. Measurement tools with strong internal consistency typically exhibit correlations ranging from 0.70 to 0.90 between their items. Lower correlations (< 0.70) indicate that the tool's items are possibly measuring different phenomena. Conversely, a high correlation (> 0.90) indicates redundant items, which may limit the content validity of the scale.20 Consistent with our findings, a systematic review of the validity and reliability of the instruments used to assess behavior, stress, and/or pain in preterm newborns in the NICU revealed that the internal consistency of PIPP-R, measured by the Cronbach’s alpha, varied between 0.71 to 0.84.14
In addition, this study demonstrated that the PIPP-R has excellent intra-examiner reliability. To our knowledge, studies on the intra-examiner reliability of the PIPP-R are scarce. Among the studies evaluating the measurement properties of the PIPP, intra-examiner reliability was analyzed in the study by Ballantyne et al., who reported excellent reliability for the PIPP scale (ICC = 0.94–0.98).35
In the present study, inter-examiner reliability was excellent, corroborating the findings of previous studies.12,36–38 Taplak and Bayat assessed inter-examiner reliability using video recordings and reported excellent reliability (ICC = 0.94–1.00).36 In line with this study’s methodology, in which two independent examiners rated pain scores at bedside and in real-time, studies reporting psychometric properties of the Persian and Indonesian versions of the PIPP-R also demonstrated excellent inter-examiner reliability (ICC = 0.98–0.99 and 0.97, respectively).37,38 By demonstrating intra-examiner reliability of the tool in the ‘real world’ (e.g., pain assessment at bedside, in real-time) this study helps to advance pain measurement practices in the NICU and to improve confidence among NICU professionals in using this particular tool.
An agreement analysis is recommended to complement the reliability assessment.27 Our findings demonstrated a good agreement in CCC between the intra- and inter-examiner subtotal and total scores. In this study, the Bland–Altman analysis demonstrated that the mean differences for both intra- and inter-examiner scores, in both subtotal and total scores, remained close to zero without proportion of bias, indicating strong intra- and inter-examiner agreement. While the graphs showed that most values were within the limits of agreement, the clinical interpretation of these limits should guide the assessment of agreement.26 Elias et al., in their study, aimed to analyze whether parents and healthcare professionals homogeneously assessed the presence and magnitude of pain in critically ill neonates. They reported agreement for the Bland–Altman analysis when no pain was present, while disagreement between observers was noted for moderate pain.39 In the present study, points outside the limit of agreement for inter-examiner reliability occurred in moderate to severe pain scores.
Our study findings elucidated that PIPP-R exhibited substantial responsiveness to changes after an acute painful procedure. The initial validation of the original version of the PIPP-R demonstrated the instrument's responsiveness between types of events and interventions, revealing a significant difference between PIPP-R scores in non-nutritive sucking with sucrose [PIPP-R 6.4 (3.4)], restraint facilitated with non-nutritive suction and sucrose [PIPP-R 7.2 (3.3)], and non-nutritive sucking alone [PIPP-R 8.6 (4.0)], suggesting greater efficacy in non-nutritive sucking with sucrose intervention.11 However, despite the considerable increase in the number of studies on neonatal pain and neonatal pain assessment, our knowledge of the responsiveness of pain tools is limited.40
In our study, pain was assessed using the PIPP-R in real-time, both at bedside (real-time) and through analysis of video recordings (without pauses). The results confirm that the tool is reliable and feasible for implementation in the clinical setting. The PIPP-R may assist physical therapists in recognizing signs of stress or pain in neonates, which may reflect a loss of self-regulation and suggest that continued interaction with physical therapy may not be appropriate at that moment. Such disruptions and observable disorganization highlight the infant’s developmental challenges and indicate areas of vulnerability under certain environmental or care conditions.3,41
Additional research on neonatal pain assessment is warranted to facilitate adequate pain management in newborns, and to disseminate the results already obtained. This study provides substantial contributions to this area by analyzing the measurement properties of the PIPP-R, thereby ensuring its safety and accuracy in assessing acute pain in the NICU and serving as a reliable and responsive tool for clinical and scientific use in Brazilian neonates. Despite these potential advantages, our study had certain limitations, including (1) the observation of only 15 neonates; (2) the exclusive use of physical therapists to apply the PIPP-R; and its cross-sectional design. Future studies should consider employing the Brazilian version of the PIPP-R in research exploring neonatal pain assessment and management.
ConclusionThe Brazilian version of the PIPP-R demonstrated excellent intra- and inter-examiner reliability, strong internal consistency, and good responsiveness in assessing procedural pain in neonates. Therefore, it is feasible to use this tool for procedural pain assessment in the NICU.
The authors declare no competing interest.
Grant sponsor PROAP-AUXEP (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior); FAPESC (Fundação de Amparo à Pesquisa do Estado de Santa Catarina); PAEX-PROCEU/UDESC n° 01/2021

