Annals of Robotics and Automation
1Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Block 105, 1113 Sofia, Bulgaria
2Center for National Security and Defence Research, Bulgarian Academy of Sciences, 1, “15 November” Str., 1040 Sofia, Bulgaria
Cite this as
Hadjitodorov S. Multi-Method System for Robust Text-Independent Speaker Verification. Ann Robot Automation. 2025;9(1):001-006. Available from: 10.17352/ara.000020
Copyright License
© 2025 Hadjitodorov S. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.A system for robust speaker verification based on four recognition approaches and methods (classifiers) is proposed, in order to use different statistical characteristics of the speech parameters. These methods are: 1) Prototype Distribution Maps (PDM); 2) AR-vector models (ARVM); 3) Two-level approach: the first level uses several PDMs for preprocessing, and the second employs multilayer perceptron (MLP) networks; 4) Gaussian speaker’s models combined with the arithmetic-harmonic sphericity measure (GMAHSM).
These classifiers generate four preliminary classification decisions. The reliability and confidence of these preliminary decisions are evaluated by means of a weighting algorithm. The weights are assigned using the relative measures to the most similar speakers (or cohorts), i.e. a Cohort Normalization technique is implemented. The final classification is then performed using simple logical and threshold rules.
The speech signals of 92 speakers have been analyzed. The speaker verification accuracy was over 98%. Robust impostors’ detection was observed, because the classifiers never fail simultaneously.
There are many methods and approaches [1-14] for pattern recognition used for speaker recognition (identification and verification). However, each approach has its advantages and limitations. In order to use the advantages of several methods, a hybrid system for speaker verification, based on different approaches for pattern recognition is proposed. The recognition is made by means of four different recognition methods (classifiers):
These classifiers provide four preliminary decisions (classifications). The reliability and confidence of these preliminary decisions are evaluated by means of a weighting algorithm. The weights are assigned using the relative measures to the most similar speakers (or cohorts), i.e. a cohort normalization technique is implemented. The final classification then is done on the base of simple logical and threshold rules.
In order to minimize the errors during the speech parameters evaluation, the following procedure for speech analysis is proposed and used:
The quantized signal is divided into segments with length of three ‘To’ using a Hamming window. The duration of the segments is dynamically adapted to 3 To (using ‘To’ from the previous segment), because our experiments [15] have shown that such segment’s length is optimal for ‘To’ evaluation. The overlapping between the segments is two pitch periods in order to analyze the dynamics of the speech parameters. The length of the first segment is 30ms in order to ensure that the window contains at least two To.
The periodicity/aperiodicity separation (PAS) is very important for correct To detection (because errors in PAS will produce drastic errors in To) and for correct evaluation of the speech parameters. To minimize the number of errors in PAS the detector proposed in [16] is implemented, because it is characterized by:
In order to minimize the influence of the noisy components the aperiodic segments are eliminated.
In order to evaluate correctly To the robust hybrid pitch detector [17] is used because it has the following useful properties:
For every p-th segment (containing 3 To) the mean pitch period (Tom(p)) is calculated.
Many experimental studies [5,18-20] have shown that the LP-derived cepstral coefficients (c(n)) are very informative for speaker recognition. In the proposed approach the c(n) are calculated for voiced segments in order to minimize the influence of the noisy components. The cepstral analysis is carried out by means of the standard procedure LPC analysis by means of the autocorrelation method [21] and then the first 16 LPCC coefficients are calculated. The number of c(n) is 16, because our experimental research [15] has shown that the first 16 coefficients are the most informative for speaker recognition for our database.
It is known [15,22] that the analysis of spectrograms and sonograms (representing the formant structure) is very useful for speaker recognition. However, the estimation of formants remains a difficult problem that is not yet fully solved. That is why the formant structure is analyzed by evaluation and analysis of the group delay function (GDF) (GDF is the negative derivative of the phase spectrum). The GDF is used for approximation of the formant structure, because the GDF has the following useful properties [23]:
The main problems in calculation of the GDF are:
In order to solve these problems and to guarantee spectral resolution higher than in the standard wide band sonogram used for formant analysis (here 262 Hz), the following approach for GDF calculation is proposed and used:
Unfortunately, for high-pitched voices, the influence of the glottal source is not suppressed. However, in most of the practical cases during speaker recognition male voices are analyzed, generally characterized with low values of the pitch period.
The first S gdf(i) coefficients (for I = 1,...,S) are used as feature vectors representing the formant structure. The value of S is determined on the basis of baseband (300–3000 Hz) of the phone lines, because one of the main applications of this system will be speaker recognition over phone lines. To cover this spectral range the value of S is 12, because in our experiments the resolution in the GDF is 262 Hz.
The following input vectors are formed for every p-th (for p = 1,...,P) voiced segment: mean pitch period (Tom(p)); first 16 c(i) and the first 12 gdf(i).
Four methods for speaker recognition are used and for all these methods the measures (scores) - similarities or distances, are calculated to all the speakers, i.e. besides to the verified speaker. These scores are used in the final verification scheme for cohort normalization.
In order to use different statistical characteristics, the static and dynamic information of the speech parameters in the system are captured using the following methods for speaker recognition:
The PDM is used because [9,15]:
Training the PDM: For each of the M speakers from the database a PDM(m), (m = 1,...M, M-number of known speakers) is formed using the training procedure [8] For every k-th available utterance of a given speaker one separate PDMk(m) is formed.
Speaker verification: The following classification procedure is used:
The ARVM are used because they allow modeling speaker’s specific information even when the segments with features vectors are analyzed in a random order [3,10,13].
Training the ARVM: During the training for all the speakers their ARVM(m) are formed using the procedure [3].
Speaker verification: The classification is done by formation of the ARVM of the unknown speaker—ARVM(U) and calculation of the forward–backward symmetrized Itakura distances [3] between ARVM(U) and all ARVM(m). The speaker U is preliminary accepted as the speaker l to whom he has minimal Itakura distance. This speaker could be different from speaker c, who he claimed to be (in case of separately used method that means -rejection).
The Gaussian mixture speaker’s models are used because:
The AHSM is symmetric and is evaluated efficiently, without extracting the eigenvalues.
The AHSM allows speaker recognition with minimal error in comparison with other measures when noisy speech signals are analyzed [4].
Training: The covariance matrices (COV(m)) for the speakers are calculated.
Speaker verification: The covariance matrix (COV(U)) of the unknown speaker (U) is calculated. The ASHM between all the speakers and the unknown speaker are evaluated. The speaker U is preliminary accepted as the speaker l to whom he has the minimal AHSM distance. This speaker could be different from speaker c, who he claimed to be (in case of separately used method that means - rejection).
The two-level classifier incorporates the pdf’s estimating, statistical modeling and compressing powers of the PDM with the discriminant capabilities and classification power of the MLP. As a result the classifier is better than either PDM or MLP, especially in the case of noise-corrupted signals.
A. The first level of the classifier consists of several (T, T > 1) PDMs. The reason for building T PDMs is that the BP provides “good” estimates of Bayes a posteriori probability functions only if the MLP has enough flexibility to closely approximate the Bayes functions and there is sufficient training data.
Training: A single PDM could not adequately train the MLP and that is why for each speaker’s utterance its PDMs are trained.
Speaker verification: During this stage for the unknown speaker U, its PDMus are obtained.
B. The second level of the classifier, consists of a MLP. The MLP (for each speaker) are trained using the above mentioned PDMs. The MLPs are trained with supervision using the back-propagation (BP) algorithm, which minimizes the squared error between the actual outputs of the network and the desired outputs.
Training: The input feature vectors (the PDMs) of a given speaker are labeled as “one” and feature vectors of the remaining speakers as “zero”.
Speaker verification: All the test vectors for an unknown speaker (the PDMus) are applied to each MLP. The outputs of the MLP of every speaker are accumulated. The speaker is preliminary accepted as the speaker l if the corresponding MLP(l) is with the maximum accumulated output. This speaker could be different from speaker c who he claimed to be (in case of separately used method that means - rejection).
The architecture of the used MLP networks is with two hidden layers and one output and the nodes have sigmoid nonlinearities.
The following procedure for final decision is proposed and used:
Evaluation of the functions of the cohort score(s) for the classifiers: The following procedure for assigning of classifier’s weights is proposed and used:
Each classifier’s decision gets a labeled weight Wl (i) (for i = 1,2,3,4, corresponding to methods 1,2,3 and 4) according to the formulae:
wl(i) = fl(CS)/ VS(i), (1)
for I = 2,4
wl(i) = VS(i) /fl(CS), (2)
for I = 1,3
where:
fl(CS) - function of the cohort score(s) for the i-th classifier;
VS(i) - the preliminary verification score;
i - argument (label) of Dmin in methods 2 and 4, and Smax in methods 1 and 3.
The result (output) for methods 2 and 4 is two minimal distances (Dmin), and for methods 1 and 3, two maximal similarities (Smax). In order to preserve the same relative values of the weights, the weights for the methods 1 and 3 are inverted. As a result the functions of the cohort scores (f(CS)) are evaluated by means of the following algorithm:
fl(CS) = min (cohort scores), for I = 2,4 (3)
Calculation of the averaged sum of N minimal cohort scores (for I = 2, 4):
F1(CS) = max (cohort scores), for I = 1, 3 (5)
Calculation of the averaged sum of N maximal cohort scores:
Normalization of classifier’s weights
where: for i = 1,2,3,4.
Final decision by means of specific rules: The following specific rules for final decision are proposed and applied:
Let us assume that the speaker under verification is the c-th speaker. In general, as mentioned above the label l in wil could be different from c . Then the final decision is made according to the rules:
A. Successful verification (acceptance):
if at least two classification methods accept the speaker under verification,
i.e. li = c and then the verified speaker is accepted.
B. Rejection:
if only one method accept the speaker under verification or then the verified speaker is rejected.
In order to make a comparison between the implemented methods in the system and their application to different speech conditions, experiments with clean and noisy speech data have been carried out.
A proprietary dataset was used. The speech (clean and over telephone lines in order to account for telephone channel effects and noise) of 92 speakers (48 males and 44 females) have been analyzed. The signals are digitized with 21 KHz and 16 bits per sample using the standard audio board “Sound Blaster” of the PC and saved directly into the computer’s memory in order to avoid any distortions caused by the recorders. The training set consists of utterances of the 3 sentences (training sentences): My name is first name, second name, family name; My code is six digits; I am a(n) profession. The test set consists of utterances of another 3 test sentences: I am two digits years old; My mother’s name is first name; My hobby is up to three words. Note that the proposed system is for text-independent speaker verification. All these sentences were uttered once (in Bulgarian) in the abovementioned two conditions by each speaker in six separate sessions, i.e. we have 552 utterances in clean and 552 utterances over telephone in noisy condition. For test of the system Leave-one-out cross-validation (LOOCV) has been applied. The training set is formed by the utterances of the training sentences of 91 speakers. The test set is formed by the utterances of test sentences of all 92 speakers, i.e. the test set always include one impostor. According to LOOCV that scheme is repeated 92 times.
For the PDM method, the optimal values were used for the filter coefficient (k) and for the size (Q) the PDMs found in the experimental researches [8,15]: k = 0.1 and Q = 20.
For the ARVM method the order of AR-Vector Models was 2, because it is shown that for orders greater than 2, the prediction error does not decrease significantly [13](Montacie,1993; Le Floch,1994).
For the Two-level classifier are used the optional values for the number of PDMs (T), k and for Q found in the experimental researches [9,15]: k = 0.1, T = 10 and Q = 4. The architecture of the MLP for each speaker at the second level is: first hidden layer - 34 neurons, second - 2 and output layer - 1 neuron.
In our experiments we apply the functions f(CS) in the forms (3), (4) with N = 3, and (5), (6) also with N = 3.
The results are given in Table 1. It is important to note that almost all the impostors are eliminated. The proposed hybrid classifier has improved verification accuracy due to:
As first step of our research was to show that weighted (on the base of their cohort scores) combination of various methods that use different statistical characteristics of the speech parameters enhance the capability of a system to recognize with higher accuracy. Robust impostor detection was observed, as no case occurred where all classifiers failed simultaneously. Future work will be devoted to apply this approach to other publicly available data sets and to explore it in worse background conditions like significant noise, spoofing and adversarial actions. Attention will be paid also to improve computational efficiency.
Subscribe to our articles alerts and stay tuned.
PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."