Keywords listening effort - psychophysiological measure - listening demand
Assessing physiological measures in listening effort research is common. Between 2019 and 2021, Clarivate's Web of Science database lists a total of 239 articles with the term “listening effort” in the title, abstract, or keywords. Among these articles, 35% (81) employed at least one physiological measure to examine listening effort; 7% (16) employed more than one physiological measure. The variety of measures used was large, and included measures directly indexing brain activity, such as electroencephalogram (EEG) alpha oscillations,[1 ]
[2 ] EEG-evoked potential components,[3 ]
[4 ] functional near-infrared spectroscopy (fNIRS),[5 ]
[6 ] and peripheral measures, such as skin conductance,[7 ]
[8 ] pupil diameter,[9 ]
[10 ] heart rate variability,[11 ]
[12 ] and cardiovascular preejection period.[12 ]
[13 ] The reason for the particular measures used seemed to be driven more by the researchers' interest and availability of measurement equipment than by a theoretical or conceptual rationale.
Given the lack of a unifying rationale, the heterogeneity in the employed measures constitutes a problem: It makes it difficult for listening effort researchers to decide which measure to use, to compare findings across studies involving different measures, and to draw straightforward conclusions from the existing literature.[14 ] Ultimately, a unifying rationale would boost theoretical progress and advance our understanding of the determinants, consequences, and mechanisms associated with listening effort. A more comprehensive approach that systematically integrates multiple physiological measures could be particularly useful when studying listening effort. However, there are a number of practical challenges to combining more than a single physiological measure in a listening effort study. The purpose of this article is to highlight some of these challenges and to provide recommendations on how to address them. We hope that this will help listening effort researchers to develop a more integrative, unified approach to using physiological measures and thereby accelerate the advancement of our understanding of listening effort. Our discussion strongly draws on the experience that we have gained in the context of the HEAR-ECO project (http://hear-eco.eu/ ) in which we employed several physiological measures to examine listening effort.[11 ]
[13 ]
[15 ]
[16 ]
[17 ]
[18 ] The topics that we are going to discuss here are (1) the selection of appropriate physiological measures, (2) the simultaneous assessment of multiple physiological signals, (3) the aggregation and combination of simultaneously assessed physiological measures, and (4) the statistical analyses of multiple physiological measures.
Selection of Appropriate Physiological Measures
Selection of Appropriate Physiological Measures
One of the most challenging aspects of a systematic, integrative approach that uses multiple physiological measures to examine listening effort is to find a good rationale for selecting the measures. For almost any common physiological measure, it is possible to find at least one publication where the authors associate the measure with listening effort or related constructs like effort, engagement, or resource allocation. Finding a published rationale that justifies the use of multiple physiological measures is, however, more difficult. Nonetheless, a unifying rationale seems to be desirable to facilitate the integration of results from different studies. Moreover, the lack of a unifying rationale increases the likelihood of a conflation of concept and measure, which is illustrated by the current discussion about the multiple dimensions of listening effort.[7 ]
[19 ]
[20 ]
[21 ] The lack of a unifying rationale linking the concept (listening effort) to physiological measures makes it difficult to decide whether the discussion is about the dimensions of listening effort or about the dimensions of the measures employed in listening effort research.
The first step may thus be a clear and commonly accepted definition of the concept of listening effort. Without a clear definition of the concept, we will struggle to differentiate it from other phenomena[22 ]—for instance, to decide whether a listening situation is more effortful or more arousing[11 ]
[16 ]—to find (psychophysiological) measures that appropriately match our concept,[22 ]
[23 ] and to build a refined theory of listening effort.[24 ] Psychophysiological measures can be viewed as proxies to self-report measures of subjectively perceived listening effort—a rating or other type of assessment of the individual's perception of how effortful listening is—which in common language may be viewed as the most meaningful definition of listening effort.[25 ] Whether or not this can be regarded as the “ground truth” depends on the experimental setup and on the specific definition of listening effort. The same applies to objective behavioral measures of listening effort such as dual-task measures or delayed recall.
There are at least two approaches to developing a clear definition of a concept, and both have been used in listening effort literature.[21 ] The first is the empirical observation that a physiological measure responds to variations in an independent criterion variable—for instance, a listening demand-related variable like the signal-to-noise ratio of speech embedded in noise—as evidence that the measure constitutes a correlate of listening effort.[26 ]
[27 ]
[28 ]
[29 ]
[30 ]
[31 ]
[32 ] This implies a definition of listening effort as a state that changes in a predictable way when the level of the criterion variable (e.g., listening demand) changes. For example, it is usually assumed that a measure sensitive to listening effort should indicate relatively high effort in moderately difficult listening demand conditions, and less effort in low listening demand conditions. Using such a concept definition, any physiological measure that has been demonstrated to respond to variations in the criterion variable would constitute a valid outcome of listening effort[33 ] and could be included in listening effort studies that employ physiological measures. Listening effort researchers favoring this approach should thus specify their criterion variable(s) and then review the literature to find out which psychophysiological measures respond to changes in it/them. These measures would then constitute the set of physiological measures that could legitimately be used to examine listening effort.
The second approach to define the concept of listening effort is to provide a verbal description of it. For instance, McGarrigle and colleagues[20 ] defined listening effort as “the mental exertion required to attend to, and understand, an auditory message,” Picou and colleagues[34 ] conceptualized it as “cognitive resources allocated for speech recognition,” and Pichora-Fuller and colleagues[35 ] defined it as “the deliberate allocation of mental resources to overcome obstacles in goal pursuit when carrying out a [listening] task.” The advantage of such a concept definition is that it avoids the risk of circularity of the criterion-variable approach[21 ]—the observed empirical relationship between a physiological measure and a listening-effort manipulation is considered to validate the measure as indicator of listening effort and, at the same time, hypotheses about whether the manipulation changes listening effort are tested using the physiological measure. If the concept definition refers to specific self-report or objective behavioral measures of listening effort, these measures provide an efficient way to resolve the circularity problem. For instance, a definition of listening effort as the subjective feeling of investing effort in a listening task would point to a self-report measure of listening effort as criterion. However, the descriptive approach often requires additional concept definitions to allow the justification of the selection of physiological measures. For instance, it requires an additional operational definition of cognitive resource allocation as changes in pupil diameter to use Picou and colleagues' definition[34 ] to justify the use of pupil diameter in listening effort research. As far as we know, none of the current theoretical accounts of listening effort offer such a justification of specific physiological measures.
If these additional concept definitions refer to general physiological mechanisms (instead of referring to a specific measure), they offer the justification of multiple physiological measures that is needed for a unified approach to the use of physiological measures in listening effort research. For instance, using the operational definition of mental resource allocation as increased cardiac sympathetic activity in combination with Pichora-Fuller and colleagues' general definition of listening effort[13 ] would imply that all physiological measures that reflect cardiac sympathetic activity should be included in listening effort research. It is obviously not required to have two levels of concept definitions—a general one of listening effort and an operational one linking listening effort to a physiological mechanism. One could directly use an operational definition of listening effort that refers to physiological measures—for instance, a definition of listening effort as cardiac sympathetic activity in listening tasks.[36 ] However, including a broad, descriptive concept definition of listening effort probably offers a better integration of the listening effort literature that has not used physiological measures, such as those studies using only self-report or behavioral measures.
Recommendation 1
Use a clear definition of the concept of listening effort that creates an explicit link to the employed physiological measures. Make this definition salient. Other researchers will adopt your concept definition or present conflicting definitions, which will foster a discussion about the listening effort concept. This will hopefully lead over time to a commonly accepted definition of listening effort.
Simultaneous Collection of Multiple Physiological Biosignals
Simultaneous Collection of Multiple Physiological Biosignals
Once the physiological measures of interest have been selected, one needs to collect the biosignals that are required to calculate these measures. One of the most obvious challenges in the simultaneous collection of multiple biosignals is the parallel use of different measurement devices and sensors, which may interfere with one another and may result in discomfort and stress for study participants. For instance, EEG electrodes and fNIRS optodes often need to be placed at similar locations on the participant's head, which may be physically impossible if two separate sensor patches are necessary. EEG and fNIRS sensors may also interfere with the appropriate placement of the electrodes of impedance cardiograph systems (required for the determination of preejection period) that use electrodes on the forehead or behind the ears.[37 ]
[38 ] Another example is the competition of eye tracking glasses and fNIRS and EEG sensors for space on the forehead. In addition to the competition for space, devices may also interfere with one another because of their electromagnetic properties. For instance, the simultaneous use of EEG and fNIRS can induce noise on the EEG signal caused by the electric activity of the fNIRS system.[39 ]
[40 ] Another example is the interference due to the magnetic field of magnetic resonance imaging (MRI) systems that can influence the ECG signal.[41 ]
[42 ]
Many of these problems can be avoided by carefully selecting equipment. For instance, there are custom-made hybrid EEG-fNIRS systems that enable the simultaneous assessment of both signals.[43 ]
[44 ] Impedance cardiography and measures that require sensors mounted on the head can be made compatible by using impedance cardiographs with an electrode configuration that does not interfere with the other devices' sensors (e.g., systems that only require electrodes on the thorax and neck[45 ]). Eye tracking is compatible with head-mounted sensors if a screen-based (remote) eye tracker is used. The problem of MRI artifacts on the ECG signal can be mitigated by using carbon fiber electrodes and leads as well as by employing statistical methods to control for the induced artifacts.[41 ]
[42 ] However, careful planning, customization, and expertise are required for all involved biosignals.
Of course, in field research, the simultaneous measurement of multiple biosignals is highly limited by the need for equipment to be sufficiently unobtrusive and practical so as not to interfere with daily life, while also remaining reliable, valid, and sensitive. This, of course, is challenging especially when experiments take place over many days and thus require the participants to manage the fitting and charging of equipment at home.[46 ] However, the rapid development of commercially available mobile sensors might solve some of the issues once these systems have proven to be sufficiently reliable, valid, and sensitive.[47 ]
[48 ]
The second major problem related to the simultaneous assessment of multiple biosignals is the synchronization of the data. The most frequently used approach is to label the data during the data collection process with event markers and to use these recorded markers to align the different signals offline. However, given that the signals are digitized by separate devices with their own independent clocks, there will be some delays and misalignment between the signals.[39 ]
[40 ] Moreover, if the signals were originally collected at different sampling rates, down-sampling of the raw signals to one and the same sampling rate may considerably distort the temporal aspects of the signals and introduce misalignment of the signals. A more sophisticated approach to data synchronization is to have one device that controls the sampling of all other devices. There are commercial solutions available, but the device (or software) would probably need to be customized to suit the needs of the specific, individual setup. Moreover, many standalone measurement devices are closed systems that do not allow a second device to control their data sampling process.
A researcher aiming to assess multiple physiological measures to examine listening effort thus needs to find a solution for the physical and electromagnetic interference of the employed devices and sensors as well as solve the problem of data synchronization. It may not always be possible to find an ideal solution, but awareness of these potential obstacles will allow for study designs to be optimized.
Recommendation 2
Determine how and whether the selected biosignals will interfere with one another and acquire appropriate specialized equipment accordingly to mitigate any problems caused by the physical and electromagnetic interference of the measurement devices and to attenuate the data synchronization issue. Consider these issues already at the planning stage of projects to ensure that the required financial, logistical, and knowledge-related resources (e.g., for the purchase of integrated measurement systems or for the recruitment of individuals with the expertise to provide custom-made solutions) are available.
Aggregation and Combination of Physiological Measures
Aggregation and Combination of Physiological Measures
Once one has managed to simultaneously sample the required biosignals and to synchronize them, the derived physiological measures must be aggregated and compared. One of the main challenges to this is caused by differences in the time characteristics of the physiological measures used in listening effort research. Continuous measures have a meaningful value at any given point in time and their time resolution is only limited by the quality of the measurement device. For instance, pupil diameter[10 ] or skin conductance level[29 ] has one particular value at any given moment and all such values provide meaningful information. In contrast, noncontinuous measures either do not exist at some points in time or they cannot be related to one specific point in time in a meaningful manner. For instance, peak pupil diameter refers to a specific point in time when the pupil diameter attains its maximum value in a certain time interval.[49 ] At all other points in time, peak pupil diameter does not exist. The same applies to specific components of EEG event-related potentials like the P400 amplitude[50 ] or systolic blood pressure,[12 ] the maximum blood pressure between two consecutive heart beats.
In addition to noncontinuous measures that exist only at specific points in time, there are noncontinuous measures that refer to specific time periods and can therefore not be associated with a specific point in time. For instance, preejection period[12 ] refers to the time interval between the onset of the electrical excitation of the left heart ventricle and the opening of the aortic valve. Consequently, it does not exist during other periods of the cardiac cycle[51 ] and is not associated with one single, specific point in time. Another example is heart period,[52 ] which refers to the time interval between two consecutive heart beats. There are also listening effort measures that are noncontinuous because of how they are calculated. For instance, the determination of EEG alpha[53 ] and theta power[54 ] requires the use of epochs to extract the frequency components of interest (e.g., an epoch of 1,250 ms would be required for the quantification of theta power[55 ]). Another example is heart rate variability,[29 ] which can also be determined only by quantifying variability over a certain time period (e.g., 1-minute intervals if a Fast Fourier Transform is used to quantify high-frequency heart rate variability[56 ]). [Fig. 1 ] provides an illustration of the variability in the time characteristics of the discussed measures.
Figure 1 Time characteristics of selected physiological measures. Dark gray lines indicate continuous measures; dark gray dots indicate noncontinuous measures. Dark gray dots with surrounding light gray boxes indicate noncontinuous measures that refer to time periods and not to specific points in time. The light gray boxes indicate the measurement epochs required to obtain the measure.
Associated with the various time-scales is the difference in baseline interval or nature of the baseline between various measures. For example, pupillometry measures often apply a trial-based baseline correction that is based on the mean pupil size in a relatively short period (e.g., 1,000–200 ms) prior to stimulus onset.[57 ] In some studies, this baseline is corrected for the individual dynamic range in the pupil size.[58 ] On the other hand, the reactivity of cardiovascular measures like preejection period or heart rate variability is often compared to a baseline measured before the onset of the task of interest (during rest).[12 ]
It should be evident then that aggregating noncontinuous and continuous measures is complex. While it is technically possible to treat the noncontinuous measure as a continuous one by resampling to obtain one data point of the noncontinuous measure for each data point of the continuous measure,[44 ]
[59 ] this leads to a bias given that data points are created where the measure does not exist or that a noncontinuous measure is treated as a continuous one. The solution that probably introduces the least artificial information is to use averages across large time periods (e.g., over a block of stimulus response trials) for both continuous and noncontinuous measures. One could still argue that the continuous measure is more reliable because it depends on more measurement points and its values are not artificially introduced. However, averaging across longer time periods comes with a cost: a potential loss of sensitivity to shorter, phasic changes and only reflecting tonic changes in the measures. Given the popularity of paradigms in listening effort research that rely on the analysis of short stimulus evoked phasic changes (e.g., changes in pupil response evoked by auditive stimuli[49 ]
[60 ]), this constitutes a serious shortcoming.
In addition to the obstacles to the integration and comparison of multiple physiological measures created by the time characteristics of the measures themselves, differences in the time characteristics of the underlying physiological mechanisms must also be considered. Many of the physiological measures used in listening effort are driven by physiological mechanisms that operate on different time scales. For instance, it can take up to 20 seconds from the onset of nervous system activity to the maximum response of heart rate and blood pressure, and it also can take more than 10 seconds from the end of nervous system activity to the return of heart rate and blood pressure to their baseline values.[61 ]
[62 ] Pupil responses seem to be driven by faster physiological mechanisms given that they appear sooner (a few seconds after stimulus onset) and also disappear within seconds.[63 ] EEG evoked potentials rely on even faster mechanisms, and can be observed after a few milliseconds.[64 ]
Given the differences in the time characteristics of the underlying physiological mechanisms, the paradigms used to optimize the assessment of the physiological measures of listening effort vary considerably. For instance, paradigms using cardiovascular measures normally present a single stimulus condition over a period of several minutes,[12 ]
[13 ]
[29 ] whereas paradigms using pupil-related measures tend to present different stimulus conditions in intervals of a few seconds.[30 ]
[65 ]
[66 ] Using multiple physiological measures that are driven by different physiological mechanisms consequently requires researchers to develop paradigms that are appropriate for the various time scales of their measures.
Recommendation 3
Take the individual time characteristics of the physiological measures and underlying physiological mechanisms into account when planning a study with multiple physiological measures. Develop paradigms that are appropriate for all involved measures.
Statistical Analysis of Multiple Physiological Measures
Statistical Analysis of Multiple Physiological Measures
The final challenge to using multiple physiological measures in listening effort research is the selection of an appropriate statistical approach. The main concern here is the prevention of type-I error inflation due to the number of assessed physiological measures. One approach that is frequently adopted in listening effort research is to use an independent statistical test for each assessed measure. Unfortunately, this quickly increases type-I error. It is thus necessary to employ a type-I error control procedure. However, the big challenge is to find one that has a minimal impact on statistical power.
One option is to analyze all physiological measures in a two-step procedure where a first multivariate analysis of variance (MANOVA) is used as gatekeeper for follow-up univariate tests.[67 ] For instance, Plain and colleagues[11 ] analyzed seven different physiological listening effort measures by first conducting a MANOVA that included all measures and then using univariate tests for those measures that were significant. If such a two-stage procedure is used with appropriately adapted critical F - and t -values for the follow-up tests, it can successfully control the maximum type-I error rate. However, in designs with more than two groups, simple single-stage multiple-comparison procedures (like the Bonferroni procedure) perform as well as the more complex MANOVA-protected procedure and may thus be preferred.[67 ] Avoiding multivariate procedures also mitigates the problem of multicollinearity between the dependent variables, which can influence the interpretability of the results.[68 ] Multicollinearity—the correlation between the outcome variables in this case—is common in psychophysiological research given that the measures are often driven by the same or associated physiological mechanisms.[69 ] For instance, both pupil changes and heart rate changes are driven by sympathetic and parasympathetic nervous system activity and will highly correlate with one another if the autonomic outflow to the pupil and the heart does not differ.
An alternative approach is to aggregate the measures into a single index.[70 ]
[71 ] For instance, preejection period and pupil diameter in the dark—when the parasympathetic contribution is minimal[72 ]—could be combined into a single index of sympathetic activity. A single aggregated index could be analyzed with a single statistical test and would thus prevent the problem of type-I error inflation discussed in the preceding paragraphs. Moreover, it would have higher statistical power because no type-I error inflation control would be needed and specific planned contrasts could be conducted.[73 ]
[74 ]
[75 ] Aggregating measures requires a decision on whether to standardize the individual measures before the aggregation. Standardizing the measures controls for the impact of the variability and magnitude of the responses of the individual measures. At first sight, this might seem to be a good idea because one would like each measure to have the same influence on the aggregated index. However, the standardization—for instance a z-standardization[76 ]—is often performed using the collected data, which introduces a bias. For instance, combining a z-standardized physiological measure where participants showed originally almost no response variability—for instance, heart rate changes with a mean of 2 beats per minute (bpm) and a standard deviation of 1 bpm—with a z-standardized measure where participants showed strong response differences—for instance, systolic blood pressure responses with a mean of 20 millimeters of mercury (mm Hg) and a standard deviation of 10 mm Hg—leads to a huge bias because it treats a blood pressure change of 30 mmHg as being equivalent to a heart rate change of 3 bpm. A blood pressure change of 30 mm Hg constitutes a much stronger physiological response than a heart rate response of 3 bpm, but this is neglected by the resulting index. This problem can be prevented by standardizing the individual physiological measures using their physiologically possible range as criterion (instead of their sample mean and variability). For fNIRS research, this approach has been taken recently by Zhang and colleagues who used a breath-holding task to scale the fNIRS response differences between conditions by the physiologically plausible range of the fNIRS response before performing the statistical analysis.[77 ] Unfortunately, information about the absolute minimum and maximum response of many of the physiological measures employed in listening effort research is often not available. For instance, no information is known regarding the physiological maximum of a skin conductance response.
Recommendation 4
Plan your statistical analysis to account for the problems of assessing statistical significance (p -values) when running multiple tests (i.e., increased type-I error when uncorrected or reduced statistical power when corrected for multiple testing) and of analyzing measures that are potentially highly correlated. If possible, use an aggregate index that represents the physiological mechanism that you are interested in.
Summary
Moving from using single physiological measures in listening effort research to combining multiple measures that are justified by a single, unifying rationale would help the field to overcome the fragmented approach that currently exists. The explicit presentation of researchers' concept definition of listening effort and its use to justify the employed physiological measures would promote a discussion about the core concept and hopefully lead to a commonly accepted definition of listening effort. Combining multiple measures does, however, require awareness of the problems that are caused by the simultaneous use of multiple measurement devices as well as sound knowledge about the time characteristics of the measures and the underlying physiological mechanisms. Moreover, awareness of the statistical issues associated with analyzing multiple measures is also required. The solutions to many of the challenges that we have outlined are still in their infancy or are yet to be developed. However, we are convinced that we should not leave it to future generations of researchers to integrate the fragmented field that we have created. Addressing these issues now is the only way forward to a more integrated approach to the use of physiological measures in listening effort research and to a comprehensive understanding of listening effort.