Artificial intelligence for the recognition of benign lesions of vocal folds from audio recordings
Objective. The diagnosis of benign lesions of the vocal fold (BLVF) is still challenging. The analysis of the acoustic signals through the implementation of machine learning models can be a viable solution aimed at offering support for clinical diagnosis.
Materials and methods. In this study, a support vector machine was trained and cross-validated (10-fold cross-validation) using 138 features extracted from the acoustic signals of 418 patients with polyps, nodules, oedema, and cysts. The model’s performance was presented as accuracy and average F1-score. The results were also analysed in male (M) and female (F) subgroups.
Results. The validation accuracy was 55%, 80%, and 54% on the overall cohort, and in M and F, respectively. Better performances were observed in the detection of cysts and nodules (58% and 62%, respectively) vs polyps and oedema (47% and 53%, respectively). The results on each lesion and the different patterns of the model on M and F are in line with clinical observations, obtaining better results on F and more accurate detection of polyps in M.
Conclusions. This study showed moderately accurate detection of four types of BLVF using acoustic signals. The analysis of the diagnostic results on gender subgroups highlights different behaviours of the diagnostic model.
The development of artificial intelligence (AI), the evolution of voice technology, progress in audio signal analysis, and natural language processing/understanding methods have opened the way to numerous potential applications of voice, such as the identification of vocal biomarkers for diagnosis, classification, patient remote monitoring, or to enhance clinical practice 1. Especially at the beginning of the applications of AI, most studies used automatic systems based on machine learning (ML) on voice recordings to classify healthy subjects according to age and gender 2,3. More recently, research focused on the role of the audio signal of the voice as a signature of the pathogenic process. Dysphonia indicates that some changes have occurred in voice production 4. The overall prevalence of dysphonia is approximately 1% even if the actual rates may be higher depending on the population studied and the definition of the specific voice disorder 5. The impact of a voice disorder has been increasingly recognised as a public health concern because it influences the quality of physical, social, and occupational aspects of life by interfering with communication 6. Voice health may be assessed by several acoustic parameters. The relationship between voice pathology and acoustic voice features has been clinically established and confirmed both quantitatively and subjectively by speech experts 7. The automatic systems are designed to determine whether the sample belongs to a healthy or non-healthy subject. The exactness of acoustic parameters is linked to the features used to estimate them for speech noise identification. Current voice searches are mostly restricted to basic questions even if with broad perspectives. Dankovicova et al. 8 demonstrated that ML analysis can recognise pathological speech with high classification accuracy. In the literature, the studies on vocal biomarkers have mainly been performed in the fields of neurodegenerative disorders (Parkison’s and Alzheimer’s diseases) 6. On the contrary, the literature on vocal biomarkers of specific vocal fold diseases is anecdotal and related to functional vocal fold disorders or rare movement disorders of the larynx 9 (Tab. I).
The most common causes of dysphonia are benign lesions of the vocal fold (BLVF). The prevalence has been reported to be 2.5%-7.7% in national epidemiologic studies in the United States and South Korea 10,11. Currently, videolaryngostroboscopy is the gold standard for the diagnosis of BLVF 12. However, laryngoscopy is an invasive and expensive procedure. Moreover, it is not generally available in primary care units increasing the risk of delayed diagnosis and treatment. BLVF can interfere with vocal folds closure, introducing asymmetry to the vocal folds, and producing severely rough voices and aperiodic acoustic waveforms 13. The traditional acoustic analysis is suitable only for nearly periodic signals and allows the clinician to have a quantifiable preliminary status for treatment follow-up, but it does not provide information for differential diagnosis or inter-individual comparison. In recent years, research has been oriented towards two principal aims, namely nonlinear dynamic acoustic analysis and the individuation of quantitative instrumental tools for voice assessment 14.
Novel ML algorithms have recently improved the classification accuracy of selected features in target variables compared to more conventional procedures thanks to the ability to combine and analyse large data-sets of voice features 15,16. Even if the majority of studies focus on the diagnosis of a disorder where they differentiate between healthy and non-healthy subjects, we believe that a more important task is frequently differential diagnosis, where one needs to choose between two or more different diseases. Even though this is a challenging task, it is of crucial importance to move decisional support to this level. To our knowledge, the differential discrimination of BLVF using automatic audio data processing methods has been poorly studied to date. Moreover, there is no dataset of off-the-shelf audio recordings from dysphonic patients affected by BLVF available online. The main aim of this work is the study, development, and validation of ML algorithms to recognise the different BLVF from digital voice recordings. As a side result, the analysis of features’ importance for the diagnostic models and their pathophysiological relevance was obtained.
Materials and methods
We collected the audio recordings of dysphonic patients affected by BLVF who referred to the Phoniatric Unit of the Fondazione Policlinico Universitario A. Gemelli - IRCCS of Rome from June 2015 to December 2019. All voice samples were divided into the following groups based on the endoscopic diagnosis: vocal fold cysts, Reinke’s oedema, nodules and polyps. We excluded patients younger than 18 years or older than 65 years, with previous laryngeal or thyroid surgery, speech therapy, pulmonary diseases, gastro-oesophageal reflux, laryngeal movement disorder or recurrent laryngeal nerve paralysis. We also excluded non-native Italian speakers. The audio tracks were obtained by asking to pronounce with usual voice intensity, pitch and quality the word /aiuole/ three times in a row. Voices were acquired using a Shure model SM48 microphone (Evanston IL) positioned at an angle of 45° at a distance of 20 cm from the patient’s mouth. The microphone saturation input was fixed at 6/9 of CH1 and the environmental noise was < 30 dB sound pressure level (SPL). The signals were recorded in “.nvi” format with a high-definition audio-recorder Computerized Speech Lab, model 4300B, from Kay Elemetrics (Lincoln Park, NJ, USA) with a sampling rate of 50 kHz frequency and converted to “.wav” format. Each audio file was anonymously labelled with gender and type of BLVF.
All the following analyses were performed using MatLab R2019b, the MathWorks, Natick MA, USA. The analysis pipeline included signal pre-processing, features extraction, screening of the features, and model implementation (Fig. 1).
The unvoiced segments of signals were removed through a threshold-based algorithm, which accounts for the signal loudness and computed using the MathWorks built-in function acousticLoudness. This function implements two different methods accounted in the ISO 532 standard which allow estimation of loudness as perceived by persons with ontologically normal hearing under specific listening conditions. More specifically, three different features characterising the human features are implemented, i.e. the dependency on frequency, the concept of critical band and spectral masking (the hearing feature for which frequencies can be swamped by a louder tone with close frequency). Further details on the implementation of these methods can be found in the following references 17,18. Afterwards, a 1024-point Hamming window was used to segment the signals, with half-window overlap, to reduce the spectral leakage.
On the segmented signal, 66 different features in the time, frequency, and cepstral domain were extracted. Next, seven statistical measures were computed on the extracted features, namely: mean, standard deviation, skewness, kurtosis, 25th, 50th, and 75th percentiles. In addition, jitter, shimmer, and tilt of the power spectrum were obtained from the whole unsegmented signal (see supplementary material).
Feature screening was applied using biostatistical analyses on the whole dataset to reduce the extended number of features to give as input to the classifier. Two statistical tests were used to screen relevant features for the classification task: the one-way analysis of variance (ANOVA), when all the groups were normally distributed, and the Kruskal-Wallis test, otherwise. The groups’ normality was verified with the Kolmogorov-Smirnov test. For all the tests, a p-value < 0.05 was considered significant. An overview of the screened features entering the classification model is presented in Figure 2.
A non-linear Support Vector Machine (SVM) with a Gaussian kernel was the algorithm chosen in this work. The classifier was trained on the whole dataset (Mtot) and the male (Mm) and female (Mf) subsamples separately. On the three models, 10-fold cross-validation was used both for parameter optimisation and further feature selection. Specifically, two hyper-parameters of the model were set using a grid-search and in a range 1-1000, the box-constraint C, which is a regularisation parameter, and the kernel parameter γ, which controls the radius of influence of the kernel. Additionally, a sequential feature selection (SFS) algorithm was implemented to find an optimal feature set. Lastly, a class balancing method was employed to overcome the unbalanced frequencies of the pathological classes. The augmentation algorithm of the Synthetic Minority Oversampling Technique 19 was utilised for this purpose.
The classification performance was measured through the accuracy and the average F1-score. Both metrics were provided for the description of the overall classification performances and those obtained on gender subgroups.
Data were collected on 418 patients with a mean age of 48 years (range 19-65 years), 349 of whom (83.5%) were females. The classification domain included four pathological groups: among recruited patients, 69 (16.5% - 12 males and 57 females) were diagnosed with vocal fold cysts, 112 (26.8% - 12 males and 100 females) with Reinke’s oedema, 141 (33.7% - 12 males and 129 females) with nodules and 96 (23.0% - 33 males and 63 females) with polyps (Tab. II).
The seven statistical descriptors calculated for each of the 66 extracted features led to a total number of 462 features. Three additional features (jitter, shimmer, and tilt of the power spectrum) were computed on the whole signal, for a total of 466 features. Given the extensive number of features extracted (overcoming the number of patients), a first statistical screening was performed. From the statistical tests, 138 of 466 features were significantly associated with the outcome and were thus fed into the SVM model. In Figure 1, a summary of the analysis pipeline and the number of patients and features in each step is provided.
On the overall model Mtot, the optimised 4 classes classifier showed an accuracy of 55.0% and an average F1-score of 0.54. When splitting the sample by gender, an accuracy and average F1-score of 80% and 0.78 were obtained for Mm, and 54% and 0.54 for Mf, respectively. These results are summarised in Table III, along with the single pathology accuracies. In Figure 3, the confusion matrices for the developed models are presented.
In this work, an SVM was trained and cross-validated with the aim to obtain differential diagnosis of BLVF types given the acoustic signals of 418 patients, obtaining a global accuracy of 55.0%.
Compared to the majority of studies in the literature, our results showed reduced performance, although it is worth noticing that only a few have focused on pathological groups only for differential diagnosis 20,21. Even the comparison with the latter authors’ works should be done in light of the differences among the studies. Namely, non-pathological cases were also included, with the exception of Pham et al. 19 in which deep learning algorithms were applied, different pathological cases were tested and different types of the acoustic signal were processed (sustained vowels or combinations of them).
The taxonomy of the types of BLVF is still in evolution, as well as the tools and technologies available for diagnosis. For these reasons, and due to overlapping characteristics of the groups, differential diagnosis can be challenging 22. Thus, supporting the clinical experience with a tool based on objective data can be crucial, and acoustic signals are promising non-invasive solutions for such a purpose.
Going into detail about the specific types of BLVF, the model had better performance on cysts and nodules (58% and 62% accuracy, respectively), and less so for recognition of polyps and oedema definition of identity (47% and 53% accuracy, respectively). While more investigation should be dedicated to these results, a possible explanation of these differences may be due to the diverse effects of the above-mentioned lesions on phonation. In fact, while polyps and oedema tend to be very mobile during phonation, cysts and nodules are usually less mobile 23 and this may have an effect on the features obtained from the acoustic signals.
Lower accuracy could also be due to different ages of patients, to fact that the pathological samples were at different stages of disease evolution, or because the morphological features varied and impacted differently on acoustic parameters (e.g. size, the ratio of vocal lesion base to the width of the lesion, the implantation site etc.). In future studies, the system with a larger dataset and new classification model could be used to classify the stage of the disease.
Another aspect we considered in our analyses was the difference in results obtained on gender split. Differences between male and female voices and acoustic are related to many factors, such as physiology, anatomy, and even sociology and psychology, in terms of identity definition and behavioural characteristics 24. In this regard, as a result, differences in frequency content between genders could emphasise the effects in the voice of different pathologies.
In this work, where only features extracted from the spectral analysis entered the model, it is reasonable to obtain very different behaviours of the model on the two subgroups. Specifically, the results showed an accuracy of 79.7% in the model trained on males and 54.4% in the models trained on females. The overall F1-score and accuracy and the pathology-specific accuracies suggest how better performances can be obtained in the male group, with respect to females. In fact, particularly high accuracies of 83% and 85% were obtained on detection of nodules and polyps, respectively (Tab. III). The promising results on polyp detection, in opposition to the poor accuracy obtained on the Mf (49%), may be explained by the fact that BLVF, and especially vocal polyps, present a higher incidence in the male population 25. Nevertheless, these results, and more generically the role of gender in differential diagnosis, should be deepened further. Indeed, while reading these last findings, the sample size of the male subgroup should be taken into account as a limitation reducing the generalisability of the results.
To optimise the overall accuracy, other improvements could involve the training of other classification algorithms and the application of nested cross-validation for a better generalisability of the results 26. Additionally, a larger prospective dataset could allow for the exploration of deep learning solutions and a complete selection of the features through automatic methods, avoiding the need for biostatistical screening. In future works, introducing novel features for classifiers or identifying the best performing ones can provide new information. More sophisticated features capturing hidden patterns or nonlinear relationships can significantly boost prediction accuracy. We believe that the clinical usefulness of the classification accuracy achieved by our model could be understood by comparing with further studies how our algorithm performs to those of human experts and non-experts.
Above all, given the promising results in diagnostic problems obtained from clinical and endoscopic high-speed videos 27, we believe that a bimodal analysis system that integrates both audio recording and video-endoscopic imaging data would be helpful to improve accuracy and to explain the correlations of the SVM.
Finally, our preliminary work overall suggests that machine learning, in combination with telemedicine, could provide a strategy to support screening between BLVF and malignant glottic lesions. Undoubtedly, a larger data set is needed before reaching such a target but this could be a concrete development.
This work focused on the development and cross-validation of a diagnostic model for the identification of four different BLVFs: polyps, cysts, nodules, and oedema. Although further efforts could be deployed on the technical implementation, when larger datasets will be available, the results appear promising, with an overall accuracy in the automatic differential diagnosis of 55.3%. Moreover, the behavioural patterns of the models developed on the gender subgroups were particularly interesting. Specifically, a good sensitivity was found in polyp detection for males (85%), and in general better performance in the detection of each BLVF type than in females. Given these results and the analysis of the literature, future research should focus on the combined use of clinical and instrumental data for the development of diagnostic models of BLVF.
Conflict of interest statement
The authors declare no conflict of interest.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
MRM: conceptualisation, data curation, writing; FS: formal analysis, software; SC: preparation and creation of the published work, specifically data presentation, writing; MC, AN, FU, JG, MCC, GP: supervision; LD: data collection; AM: methodology, data curation.
This study was approved by the Institutional Ethics Committee of “Fondazione Policlinico Universitario A. Gemelli IRCCS” of Rome (ID 4519 - prot. n.0041543/21).
The research was conducted ethically, with all study procedures being performed in accordance with the requirements of the World Medical Association’s Declaration of Helsinki.
Written informed consent was obtained from each participant/patient for study participation and data publication.
Figures and tables
|Author and reference||Size of sample||Study aim||Method of analysis||Author’s key findings|
|Li et al. 2||Training data set: 472 speakers; development data set: 300 speakers||To present an automatic speaker age and gender identification approach which combines different methods at both acoustic and prosodic levels to improve the baseline performance||Baseline subsystems: GMM based on mel-frequency cepstral coefficient features, SVM based on GMM mean supervectors and SVM based on 450-dimensional utterance level features; four subsystems||Minimum 3.1% and maximum 5.6% improvement of accuracy compared to the SVM baseline system|
|Berardi et al. 3||From the archive of voice recordings given at Brigham Young University of a single individual spanning about 50 years were used||To investigate the progressive degeneration of a single talker’s voice quality by comparing the results of a computer model trained on actual age with a model trained on perceived age with the goal of determining acoustic features related to the perception of aging voice||The acoustic features were used to train two random Forest regression models. One model used actual (chronological) age as the response variable with the other using estimated age||The ML model estimated the age of the talker as well as the human listeners. The acoustic feature related to the perception of aging voice is the fundamental frequency|
|Dankovicova et al. 7||1560 speech features extracted and used to train the classification model||To utilise ML methods to recognise dysphonia||The classifiers, used: K-nearest neighbors, random Forest plots, and SVM||91.3% classification accuracy|
|Zhan et al. 8||6148 smartphone activity assessments from 129 individuals||Ability of the mPDS to detect intraday symptom fluctuations, the correlation between the mPDS and standard measures, and the ability of the mPDS to respond to dopaminergic medication||ML based approach to generate a mPDS that objectively weighs features derived from each smartphone activity||Effective and objective Parkinson disease severity score derived from smartphone assessments|
|Suppa et al. 9||60 patients with adductor-type spasmodic dysphonia pre-BoNT-A therapy and 60 healthy subjects; 35 patients were evaluated after BoNT-A therapy||To evaluate with cepstral analysis and ML adductor spasmodic voices and to compare the results with those of healthy subjects; to investigate the effect of BoNT-A||Cepstral analysis and ML algorithm classification techniques||ML and cepstral analysis differentiate healthy subjects and adductor-type spasmodic dysphonic voices. ML measures correlated with the severity of dysphonia pre and post-BoNT-A therapy|
|Asci et al. 15||Voice samples of 138 younger adults and 123 older adults collected at home using smartphones||To examine the age-related changes in voice in ecological setting||ML analysis through a SVM||ML analysis demonstrates the effect of ageing on voice|
|ML: Machine Learning; mPDS: mobile Parkinson disease score; BoNT-A: botulinum neurotoxin A; SVM: support vector machine; GMM: Gaussian mixture model.|
|Overall sample||Pathology-specific samples (N)|
|Models||Overall sample||Pathology-specific accuracy|
|Mtot: Whole dataset; Mm: male dataset; Mf: female dataset.|
- Robin J, Harrison JE, Kaufman LD. Evaluation of speech-based digital biomarkers: review and recommendations. Digit Biomark. 2020; 4:99-108. DOI
- Li M, Han KJ, Narayanan S. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang. 2013; 27:151-167. DOI
- Berardi ML, Hunter EJ, Ferguson SH. Talker age estimation using machine learning. Proc Meet Acoust. 2017; 30:040014. DOI
- Lopez-de-Ipina K, Satue-Villar A, Faundez-Zanuy M. Advances in Neural Networks. WIRN 2015. Smart Innovation, Systems and Technologies, vol 54. Springer: Cham.DOI
- Cohen SM, Dupont WD, Courey MS. Quality-of-life impact of non-neoplastic voice disorders: a meta-analysis. Ann Otol Rhinol Laryngol. 2006; 115:128-134. DOI
- Zhan A, Mohan S, Tarolli C. Using smartphones and machine learning to quantify Parkinson disease severity: the mobile Parkinson disease score. JAMA Neurol. 2018; 75:876-880. DOI
- Mekyska J, Janousova E, Gomez-Vilda P. Robust and complex approach of pathological speech signal analysis. Neurocomputing. 2015; 167:94-111. DOI
- Dankovicová Z, Sovák D, Drotár P. Machine learning approach to dysphonia detection. Appl Sci. 2018; 8:1927. DOI
- Suppa A, Asci F, Saggio G. Voice analysis in adductor spasmodic dysphonia: objective diagnosis and response to botulinum toxin. Park Relat Dis. 2020; 73:23-30. DOI
- Byeon H. Prevalence of perceived dysphonia and its correlation with prevalence of clinically diagnosed laryngeal disorders: the Korea National health and Nutrition Examination Surveys 2010-2112. Ann Otol Rhinol Laryngol. 2015; 124:770-776. DOI
- Hah JH, Sim S, An SY. Evaluation of the prevalence of and factors associated with laryngeal diseases among the general population. Laryngoscope. 2015; 125:2536-2542. DOI
- Bohlender J. Diagnostic and therapeutic pitfalls in benign vocal fold diseases. GMS Curr Top Otorhinolaryngol Head Neck Surg. 2013; 12:Doc01. DOI
- Channon F, Stone RE. Organic voice disorders: assessment and treatment. Singular: San Diego, CA; 2000.
- Heman-Ackah YD, Sataloff RT, Laureyns G. Quantifying the cepstral peak prominence, a measure of dysphonia. J Voice. 2014; 28:783-788. DOI
- Asci F, Costantini G, Di Leo P. Machine learning analysis of voice samples recorded through smartphones: the combined effect of ageing and gender. Sensors. 2020; 20:5022. DOI
- Hegde S, Shetty S, Rai S. A survey on machine learning approaches for automatic detection of voice disorders. J Voice. 2019; 33:947. DOI
- ISO 532-1:2017(E). “Acoustics – methods for calculating loudness – Part 1: Zwicker method”. International Organization for Standardization. ISO: Geneva; 2017.
- ISO 532-2:2017(E). “Acoustics – methods for calculating loudness – Part 2: Moore-Glasberg method”. International Organization for Standardization. ISO: Geneva; 2017.
- Chawla N, Bowyer K, Hall L. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002; 16:321-357. DOI
- Hu HC, Chang SY, Wang CH. Deep learning application for vocal fold disease prediction through voice recognition: preliminary development study. J Med Internet Res. 2021; 23:E25247. DOI
- Pham M, Lin J, Zhang Y. Diagnosing voice disorder with machine learning. IEEE International Conference on Big Data (Big Data). 2018;5263-5266. DOI
- Naunheim M, Carroll T. Benign vocal fold lesions: update on nomenclature, cause, diagnosis, and treatment. Curr Opin Otolaryngol Head Neck Surg. 2017; 25:1. DOI
- Dikkers FG, Nikkels PJ. Benign lesions of the vocal folds: histopathology and phonotrauma. Ann Otol Rhinol Laryngol. 1995; 104:698-703. DOI
- Pépiot E, Arnold A. Cross-Gender Differences in English/French Bilingual Speakers: A Multiparametric Study. Percept Mot Skills. 2021; 128:153-177. DOI
- Malik P, Yadav S, Sen RD. The clinicopathological study of benign lesions of vocal cords. Indian J Otolaryngol. 2017; 71:212-220. DOI
- Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;8. DOI
- Unger J, Schuster M, Hecker D. A multiscale product approach for an automatic classification of voice disorders from endoscopic high-speed videos. Annu Int Conf IEEE Eng Med Biol Soc. 2013; 2013:7360-7363. DOI
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
© Società Italiana di Otorinolaringoiatria e chirurgia cervico facciale , 2023
- Abstract viewed - 187 times
- PDF downloaded - 88 times
- Supplementary file downloaded - 3 times