Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers – Nature.com

Posted: February 7, 2024 at 6:20 am

Dataset and study design

This section contains an overview of how the dataset was collected, its characteristics and its underlying study design. More in-depth descriptions are provided in two accompanying papers: Budd and co-workers23 report a detailed description of the full dataset, whereas Pigoli et al.30 present the rationale for and full details of the statistical design of our study.

Our main sources of recruitment were the REACT study and the NHS T+T system. REACT is a prevalence survey of SARS-CoV-2 that is based on repeated cross-sectional samples from a representative subpopulation defined via (stratified) random sampling from Englands NHS patient register31. The NHS T+T service was a key part of the UK governments COVID-19 recovery strategy for England. It ensured that anyone developing COVID-19 symptoms could be swab tested, followed by the tracing of recent close contacts of any individuals testing positive for SARS-CoV-2 (ref. 25).

Enrolment for both the REACT and NHS T+T recruitment channels was performed on an opt-in basis. Individuals participating in the REACT study were presented with the option to volunteer for this study. For the NHS T+T recruitment channel, individuals receiving a PCR test from the NHS T+T pillar 2 scheme were invited to take part in research (pillar 1 tests refer to all swab tests performed in Public Health England laboratories and NHS hospitals for those with a clinical need, and health and care workers, whereas pillar 2 comprises swab testing for the wider population25). The guidance provided to potential participants was that they should be at least 18 years old, had taken a recent swab test (initially no more than 48h, changing to 72h on 14 May 2021), agree to our data privacy statement and have their PCR barcode identifier available, which was then internally validated.

Participants were directed to the Speak up and help beat coronavirus web page24. Here, after agreeing to the privacy statement and completing the survey questions, participants were asked to record four audio clips. The first involved the participant reading out the sentence: I love nothing more than an afternoon cream tea, which was designed to contain a range of different vowel and nasal sounds. This was followed by three successive sharp exhalations, taking the form of a ha sound. The final two recordings involved the participant performing volitional/forced coughs, once, and then three times in succession. Recordings were saved in .wav format. Smart phones, tablets, laptops and desktops were all permitted. The audio recording protocol was homogenized across platforms to reduce the risk of bias due to device types.

Existing metadata such as age, gender, ethnicity and location were transferred from linked T+T/REACT records. Participants were not asked to repeat this information to avoid survey fatigue. An additional set of attributeshypothesized to pose the most utility for evaluating the possibility for COVID-19 detection from audiowas collected in the digital survey. This was in line with General Data Protection Regulation requirements that only the personal data necessary to the task should be collected and processed. This set included the symptoms currently on display (the full set of which are detailed in Fig. 1e,f), and long-term respiratory conditions such as asthma. The participants first language was also collected to control for different dialects/accents, and complement location and ethnicity. Finally, the test centre at which the PCR was conducted was recorded. This enabled the removal of submissions when cases were linked to faulty test centre results. A full set of the dataset attributes can be found in Budd and colleagues23.

The final dataset is downstream of a quality control filter (see Fig. 1g), in which a total of 5,157 records were removed, each with one or more of the following characteristics: (1) missing response data (missing a PCR test); (2) missing predictor data (any missing audio files or missing demographic/symptoms metadata); (3) audio submission delays exceeding ten days post test result; (4) self-inconsistent symptoms data; (5) a PCR testing laboratory under investigation for unreliable results; (6) a participant age of under 18; and (7) sensitive personal information detected in the audio signal (see Fig. 3d of ref. 23). Pigoli et al.30 present these implemented filters in full, and the rationale behind each one. The final collected dataset, after data filtration, comprised 23,514 COVID+ and 44,328 COVID individuals recruited between March 2021 and March 2022. Please note that the sample size here differs to that in our accompanying papers, in which Budd et al.23 reported numbers before the data quality filter was applied, whereas our statistical study design considerations, detailed in a work by Pigoli and colleagues30, focused on data from the restricted date range spanning March to November 2021. We note the step-like profile of the COVID count is due to the six REACT rounds, where a higher proportion of COVID participants were recruited than in the T+T channel. As detailed in the geo-plots in Fig. 1a,b, the dataset achieves a good coverage across England, with some areas yielding more recruited individuals than others. We are pleased to see no major correlation between geographical location and COVID-19 status, (Fig. 1c), with Cornwall displaying the highest level of COVID-19 imbalance, with a 0.8% difference in percentage proportion of COVID+ and COVID cases.

In our pre-specified analysis plan, we defined three training sets and five test sets to define a range of analyses in which we investigate, characterize and control for the effects of enrolment bias in our data:

Randomized train and test sets. A participant-disjoint train and test set was randomly created from the whole dataset, similar to methods in previous works.

Standard train and test set. Designed to be a challenging, out-of-distribution evaluation procedure. Carefully selected attributes such as geographical location, ethnicity and first language are held out for the test set. The standard test set was also engineered to over represent sparse combinations of categories such as older COVID+ participants30. The samples included in this split exclusively consist of recordings made prior to 29 November 2021.

Matched train and test sets. The numbers of COVID and COVID+ participants are balanced within each of several key strata. Each stratum is defined by a unique combination of measured confounders, including binned age, gender and a number of binary symptoms (for example, cough, sore throat, shortness of breath; see Methods for a full description). The samples included in this split exclusively consist of recordings made prior to 29 November 2021.

Longitudinal test set. To examine how classifiers generalized out-of-sample over time, the longitudinal test set was constructed only from participants joining the study after 29 November 2021.

Matched longitudinal test set. Within the longitudinal test set, the numbers of COVID and COVID+ participants are balanced within each of several key strata, similarly as in the matched test set above.

The supports for each of these splits are detailed in Fig. 1h.

Three separate models were implemented for the task of COVID-19 detection from audio, each representing an independent machine learning pipeline. These three models collectively span the machine learning research space thoroughlyranging from the established baseline to the current state of the art in audio classification technologiesand are visually represented in Extended Data Fig. 7. We also fitted an RF classifier to predict COVID-19 status from self-reported symptoms and demographic data. The outcome used to train and test each of the prediction models was a participants SARS-CoV-2 PCR test result. Each models inputs and predictors, and the details on how they are handled, can be found below. Wherever applicable, we have reported our studys findings in accordance with TRIPOD statement guidelines32. The following measures were used to assess model performance: ROCAUC, area under the precisionrecall curve (PRAUC), and UAR (also known as balanced accuracy). Confidence intervals for ROCAUC, PRAUC and UAR are based on the normal approximation method33, unless otherwise stated to be calculated by the DeLong method34.

We defaulted to the widely used openSMILESVM approach35 for our baseline model. Here, 6,373 handcrafted features (the ComParE 2016 set)including the zero-crossing rate and shimmer, which have been shown to represent human paralinguistics wellare extracted from the raw audio form. These features are then concatenated to form a 6,373-dimensional vector, fopenSMILE(w)v, where the raw waveform, ({{{bf{w}}}}in {{mathbb{R}}}^{n}) (n=clip duration in secondssample rate) is transformed to ({{{bf{v}}}}in {{mathbb{R}}}^{6,373}); v is then normalized prior to training and inference. A linear SVM is fitted to this space and tasked with binary classification. We select the optimal SVM configuration on the basis of the validation set before then retraining on the combined trainvalidation set.

Bayesian neural networks provide estimates of uncertainty, alongside strong supervised classification performance, which is desirable for real-world use cases, especially those involving clinical use. Bayesian neural networks are naturally suited to Bayesian decision theory, which benefits decision-making applications with different costs on error types (for example, assigning unequal weighting to errors in different COVID-19 outcome classifications)36,37. We thus supply a ResNet-50 (ref. 38) BNN model. The base ResNet-50 model showed initial strong promise for ABCS5, further motivating its inclusion in this comparison. We achieve estimates of uncertainty through Monte-Carlo Dropout to achieve approximate Bayesian inference over the posterior, as in ref. 39. We opt to use the pre-trained model for a warm start to the weight approximations, and allow full retraining of layers.

The features used to create an intermediate representation, as input to the convolutional layers, are Mel filterbank features with default configuration from the VGGish GitHub (ref. 40): ({{{{bf{X}}}}}_{i}in {{mathbb{R}}}^{96times 64}), 64 log-mel spectrogram coefficients using 96 feature frames of 10ms duration, taken from a resampled signal at 16kHz. Each input signal was divided into these two-dimensional windows, such that a 2,880ms clip would produce three training examples with the label assigned to each clip (COVID+ or COVID). Incomplete frames at edges were discarded. As with the openSMILESVM, silence was not removed. For evaluation, the mean prediction over feature windows was taken per audio recording, to produce a single decision per participant. To make use of the available uncertainty metrics, Supplementary Note 3 details an uncertainty analysis over all audio modalities for a range of traintest partitions.

In recent years, transformers41 have started to perform well in high-dimensional settings such as audio42,43. This is particularly the case when models are first trained in a self-supervised manner on unlabelled audio data. We adopt the SSAST44, which is on a par with the current state of the art for audio event classification. Raw audio is first resampled to 16kHz and normalized before being transformed into Mel filter banks. Strided convolutional neural layers are used to project the Mel filter bank to a series of patch level representations. During self-supervised pretraining, random patches are masked before all of the patches are passed to a transformer encoder. The model is trained to jointly reconstruct the masked audio and to classify the order of which the masked audio occurs. The transformer is made up of 12 multihead attention blocks. The model is trained end to end, with gradients being passed all of the way back to the convolutional feature extractors. The model is pre-trained on a combined set of AudioSet-2M (ref. 45) and Librispeech46, representing over two million audio clips for a total of ten epochs. The model is then fine-tuned in a supervised manner on the task of COVID-19 detection from audio. Silent sections of audio recordings are removed before then being resampled to 16kHz and normalized. Clips are cut/zero-padded to a fixed length of 5.12s, which corresponds to approximately the mean length of the audio clip. For cases in which the signal length exceeds 5.12s (after silence is removed), the first 5.12s are taken. At the training time, the signal is augmented through applying SpecAugment47 along with the addition of Gaussian noise. The output representations are mean pooled before being fed through a linear projection head. No layers are frozen and again the model is trained end-to-end. The model is fine-tuned for a total of 20 epochs. The model is evaluated on the validation set at the end of each epoch and its weights are saved. At the end of training the best performing model, over all epochs, is chosen.

To predict SARS-CoV-2 infection status from self-reported symptoms and demographic data, we applied an RF classifier with default settings (having self-reported symptoms and demographic data as inputs). In our dataset, predictor variables for the symptoms RF classifier on our dataset comprised: cough; sore throat; asthma; shortness of breath; runny/blocked nose; a new continuous cough; Chronic obstructive pulmonary disease (COPD) or emphysema; another respiratory condition; age; gender; smoker status; and ethnicity. In Han and colleagues dataset18, predictor variables for the symptoms RF classifier comprised: tightness of chest; dry cough; wet cough; runny/blocked nose; chills; smell/taste loss; muscle ache; headache; sore throat; short breath; dizziness; fever; runny/blocked nose; age; gender; smoker status; language; and location. Prior to training, categorical attributes were one-hot encoded. No hyperparameter tuning was performed, and models were trained on the combined Standard train and validation sets. For the hybrid symptoms+audio RF classifier, the outputted predicted COVID+ probability from an audio-trained SSAST is appended as an additional input variable to the self-reported symptoms and demographic variables listed above.

The matched test set was constructed by exactly balancing the numbers of individuals with COVID+ and COVID in each stratum where, to be in the same stratum, individuals must be matched on all of (recruitment channel)(10-year-wide age bins)(gender)(all of six binary symptoms covariates). The six binary symptoms matched on in the matched test set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; and at least one symptom.

Our matching algorithm proceeds as follows. First, each participant is mapped to exactly one stratum. Second, the following matching procedure is applied separately in each stratum: in stratum s (of a total of S strata) let ns,+ and ns, denote the number of individuals with COVID+ and COVID, respectively, and let ({{{{mathscr{A}}}}}_{s,+}) and ({{{{mathscr{A}}}}}_{s,-}) be the corresponding sets of individuals. Use ({{{{mathscr{M}}}}}_{s,+}) and ({{{{mathscr{M}}}}}_{s,-}) to denote random samples without replacement of size (min {{n}_{s,+},{n}_{s,-}}) from ({{{{mathscr{A}}}}}_{s,+}) and ({{{{mathscr{A}}}}}_{s,-}) respectively. Finally we combine matched individuals across all strata into the matched dataset ({{{mathscr{M}}}}) defined as:

$${{{mathscr{M}}}}:= {cup }_{s = 1}^{S}({{{{mathscr{M}}}}}_{s,+}cup {{{{mathscr{M}}}}}_{s,-}).$$

The resulting matched test set comprised 907 participants who were COVID positive and 907 who were COVID negative. The matched training set was constructed similarly to the matched test set, though with slightly different strata, so as to increase available sample size. For the matched training set, individuals were matched on all of: (10-year-wide age bins)(gender)(all of seven binary covariates). The seven binary covariates used for the matched training set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; COPD or emphysema; and smoker status. The resulting matched training set comprised 2,599 participants who were COVID positive and 2,599 who were COVID negative.

We consider the action of applying a particular testing protocol to an individual randomly selected from a population. The four possible outcomes ({O}_{hat{y},y}) are

$${O}_{hat{y},y}:= [,{{mbox{Predict COVID{{mbox{-}19}} status as}}},,hat{y}],{{{rm{AND}}}},[,{{mbox{True COVID{{mbox{-}19}} status is}}},,y]$$

(2)

for predicted COVID-19 status (hat{y}in {0,1}) and true COVID-19 status y{0,1}. We denote the probability of outcome ({O}_{hat{y},y}) by

$${p}_{hat{y},y}:= {mathbb{P}}({O}_{hat{y},y})$$

(3)

and use ({u}_{hat{y},y}) to denote the combined utility of the consequences of outcome ({O}_{hat{y},y}). For a particular population prevalence proportion, , the ({p}_{hat{y},y}) are subject to the constraints

$${p}_{0,1}+{p}_{1,1}=uppi$$

(4)

$${p}_{0,0}+{p}_{1,0}=1-uppi ,$$

(5)

leading to the following relationships, valid for (0,1), involving the sensitivity and specificity of the testing protocol:

$${{{rm{sensitivity}}}}equiv frac{{p}_{1,1}}{{p}_{1,1}+{p}_{0,1}}=frac{{p}_{1,1}}{uppi }$$

(6)

$${{{rm{specificity}}}}equiv frac{{p}_{0,0}}{{p}_{0,0}+{p}_{1,0}}=frac{{p}_{0,0}}{1-uppi }.$$

(7)

The expected utility is:

$${{{rm{EU}}}}=mathop{sum}limits_{hat{y}in {0,1}}mathop{sum}limits_{yin {0,1}}{u}_{hat{y},y}{p}_{hat{y},y}$$

(8)

$$={u}_{1,1}{p}_{1,1}+{u}_{0,1}(uppi -{p}_{1,1})+{u}_{0,0}{p}_{0,0}+{u}_{1,0}(1-uppi -{p}_{0,0})$$

(9)

$$begin{array}{l}=uppi [({u}_{1,1}-{u}_{0,1})times {{{rm{sensitivity}}}}+{u}_{0,1}]\+(1-uppi )[({u}_{0,0}-{u}_{1,0})times {{{rm{specificity}}}}+{u}_{1,0}],end{array}$$

(10)

where equations (4) and (5) are substituted into equation (8) to obtain equation (9), and equations (6) and (7) are substituted into equation (9) to obtain equation (10).

To provide researchers easy access to running the code, we have created a demonstration notebook where the participant is invited to record their own sentence, cough, three cough or exhalation sounds and evaluate our COVID-19 detection machine learning models on it. The model outputs a COVID-19 prediction, along with some explainable AI analysis, for example, enabling the user to listen back to the parts of the signal which the model allocated the most attention to. In the demonstration, we detail that this is not a clinical diagnostic test for COVID-19, but that it is instead for research purposes and does not provide any medical recommendation, nor should any action be taken following its use. The demonstration file is detailed on the main repository page and can be accessed at https://colab.research.google.com/drive/1Hdy2H6lrfEocUBfz3LoC5EDJrJr2GXpu?usp=sharing.

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Excerpt from:

Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers - Nature.com

Related Posts