This research describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis, using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal.
INTRODUCTION
Several speech synthesis systems were developed such as vocoders and LPC synthesizers (Childers, 1995; Childers and Lee, 1991) but most of them did not reproduce high quality of synthetic speech when compared with that of PSOLA based systems (Acero, 1998) such as MBROLA synthesizers (Dutoit et al., 1996). Especially, TD-PSOLA method (Time Domain Pitch Synchronous Overlap-Add) is the most efficient method to produce criteria of satisfaction speech (Moulines and Charpentier, 1990) and is one of the most popular concatenation synthesis techniques now-a-days. LP-PSOLA (Linear Predictive PSOLA) and FD-PSOLA (Frequency Domain PSOLA), though able to produce equivalent result, require much more computati-onal power. The 1st step of the TD-PSOLA is to perform a pitch detection algorithm and to generate pitch marks through overlapping windowed speech. To synthesize speech, the Short Time signals (ST signals) are simply overlapped and added with desired spacing of the ST-signals.
TD-PSOLA PRINCIPLE
To describe the TD-PSOLA principle we would like 1st to define the input signal as x[n] and a local version of xa[n] centered at ta time, ta is an analysis marks:
![]() |
We can then define ya[n] as a short-time version of xa[n] by multiplying it by a window wa[n] (Fig. 1):
![]() |
(1) |
![]() |
|
Fig. 1: | A windowed speech signal using a hanning window wa[n] |
The window length is two times of the local pitch period (for that spectrum Si(n) approximates the spectral envelope x(n)). To synthesize speech at different pitch periods, the Short Time signals (ST) are simply overlapped and added with desired spacing. The synthesized speech is:
![]() |
(2) |
A good choice for the time marks (ta) is to coincide with the instants of closing of the vocal folds which indicate the periodicity of speech.
For unvoiced speech, these marks could be arbitrarily placed. This estimation from speech waveforms is a very difficult problem but it can be done accurately by using EGG signals.
The use of a symmetric window makes perfect reconstruction impossible, unless time marks are equally spaced. In addition, truncation will occur if these time marks are spaced >N/2 apart (very long pitch periods). In synthesis, re-sampling is necessary at a time sequence ts is a synthesis marks different from that of the analysis marks ta.
SPEECH ANALYSIS AND SYNTHESIS
This study will describe the procedures of synchronous analysis and synthesis using TD-PSOLA modifier. Figure 2 shows the block diagram of these two stages.
Speech analysis: The 1st step in the speech analysis is to filter the speech signal by a RIF filter (pre-accentuation). The next step is to provide a sequence of pitch-marks and voiced/unvoiced classification for each segment between two consecutive pitch marks. This decision is based on the zero-crossing and the short time energy (Fig. 3a, b). A coefficient of voicement (v/uv) can be computed in order to quantize the periodicity of the signal (Cheveigne and Ahara, 1990).
Automatic segmentation: The segmentation of a speech signal is used in order to identify the voiced and un-voiced frames. This classification is based on the zero-crossing ratio and the energy value of each signal frame.
Speech marks: Different procedures of placed ta[i] are used according to the local features of components of the signal. A previous segmentation of the signal in identical feature zones permits to orient the marking toward the suitable method. Besides results of this segmentation will be necessary for the synthesis stage.
Reading marks: The idea of the algorithm is to select pitch marks among local extrema of the speech signal. Given a set of mark candidates which all are negative peaks or all positive peaks:
![]() |
Where:
ta(i) | = | Sample of the ith peak |
N | = | Number of peaks extracted |
![]() |
|
Fig. 2: | Block diagram of speech analysis and synthesis |
Laprie and Colotte (1998) explain how these candidates are found. Pitch marks are a subset of points out of Ta which are spaced by periods of pitch given by the pitch extraction algorithm. The selection can be represented by a sequence of indices:
![]() |
With K<N. J has to preserve the chronological order which requires the monotony of j: j(k)<j(k+1). The sequence of indices along with the corresponding peaks is defined to be the set of pitch marks:
![]() |
The determination of j requires a criterion expressing the reliability of two consecutive pitch marks with respect to pitch values previously determined. The local criterion, we chose is:
![]() |
(3) |
![]() |
|
Fig. 3: | Automatic segmentation of Arabic speech; a) babun; b) chamsun. This segmentation is used in order to identify the voiced and unvoiced frames |
![]() |
|
Fig. 4: | Pitch marks of Arabic speech; a) babun; b) akala |
We use the following algorithm for the marking: where, l<i. It takes into account the time interval between two marks compared to the pitch period Pa in samples. This criterion returns zero if the two peaks are exactly Pa(c(l)) samples away from one another and a positive value if the distance between these peaks is greater or less than the pitch period. The overall criterion is:
![]() |
(4) |
where, B is the bonus of selecting an extremum as a pitch mark. In a 1st time:
![]() |
(5) |
The coefficient δ expresses the compromise between closeness to pitch values and strength of pitch marks. Minimising D is achieved by using dynamic programming. The pitch marking results is shown in Fig. 4a, b.
![]() |
|
Fig. 5: | TD-PSOLA for pitch (F0) modification |
Synthesis marks: The OLA synthesis is based on the superposition-addition of elementary signals Yj(n), obtained from the Xi(n) placed in the new positions ts[j]. These positions are determined by the height and the length of the synthesis signal.
In such synthesis, one can modify the temporal scale by a coefficient t-scale .The positions ts(k-1) and the pitch period Pa(k) are supposed to be known we can deduce ts(k) as (Mower et al., 1991):
![]() |
(6) |
t-scale: Coefficient of length modification (Fig. 5). In order to increase the pitch, the individual pitch-synchronous frames are extracted, Hanning windowed, moved closer together and then added up. To decrease the pitch, we move the frames further apart. Increasing the pitch will result in a shorter signal so, we also need to duplicate frames if we want to change the pitch while holding the duration constant.
SYNTHESIS SPEECH
Therefore, given the pitch mark and the synthesis mark of a given frame, we use a fast re-sampling method described below to shift the frame precisely where, it will appear in the new signal. Let x[n] the original frame, the re-sampled signal is given by Oppenheim and Schafer (1975):
![]() |
(7) |
where, Ts is the sampling period. Calculating the result frame y[m] corresponding to the frame x[n] shifted by a small delay δ amounts to evaluate x(mTs-δ).
![]() |
|
Fig. 6: | Speech synthesis akala |
Therefore, y[m] = x(mTs-δ), i.e:
![]() |
(8) |
where, fs is the sampling frequency (1/Ts). Now by rewriting sin c as sin(x)/x and by using the following equation:
sin (πfs[(m-n)Ts-δ] = cos(πfs δ) sin(π(m-n))
|
but cosπ(m-n) = ±1 and sinπ(m-n) = 0 we get (Fig. 6):
![]() |
(9) |
As 0<δ<Ts (resp.-Ts<δ<0), we define δ = α Ts where, 0<α<1 (resp.-1<α<0). Then the synthesized speech is:
![]() |
(10) |
CONCLUSION
In this study, a voice quality conversion algorithm with TD-PSOLA modifier was implemented and tested under Matlab environment. The results of perceptual evaluation test indicate that the algorithm can effectively convert modal voice into the desired voice quality. Results of the simulation verify that the quality of the synthesized signal by TD-PSOLA with technique depends on the precision of the analysis marking as well as the synthesis marking which must be placed with precision to avoid errors in the phase. The higher precision algorithm for pitch marking during the synthesis stage increases the signal quality. This gain in accuracy, avoids the reduction of deference between original and synthetic signals.
Abdelkader Chabchoub and Adnan Cherif. Implementation of the Arabic Speech Synthesis with TD-PSOLA Modifier.
DOI: https://doi.org/10.36478/ijssceapp.2010.77.80
URL: https://www.makhillpublications.co/view-article/1997-5422/ijssceapp.2010.77.80