International Journal of System Signal Control and Engineering Application

Abstract

This research describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis, using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal.

INTRODUCTION

Several speech synthesis systems were developed such as vocoders and LPC synthesizers (Childers, 1995; Childers and Lee, 1991) but most of them did not reproduce high quality of synthetic speech when compared with that of PSOLA based systems (Acero, 1998) such as MBROLA synthesizers (Dutoit et al., 1996). Especially, TD-PSOLA method (Time Domain Pitch Synchronous Overlap-Add) is the most efficient method to produce criteria of satisfaction speech (Moulines and Charpentier, 1990) and is one of the most popular concatenation synthesis techniques now-a-days. LP-PSOLA (Linear Predictive PSOLA) and FD-PSOLA (Frequency Domain PSOLA), though able to produce equivalent result, require much more computati-onal power. The 1st step of the TD-PSOLA is to perform a pitch detection algorithm and to generate pitch marks through overlapping windowed speech. To synthesize speech, the Short Time signals (ST signals) are simply overlapped and added with desired spacing of the ST-signals.

TD-PSOLA PRINCIPLE

To describe the TD-PSOLA principle we would like 1st to define the input signal as x[n] and a local version of x_a[n] centered at t_atime, t_ais an analysis marks:

We can then define y_a[n] as a short-time version of x_a[n] by multiplying it by a window w_a[n] (Fig. 1):

(1)


Fig. 1:	A windowed speech signal using a hanning window w_a[n]

The window length is two times of the local pitch period (for that spectrum Si(n) approximates the spectral envelope x(n)). To synthesize speech at different pitch periods, the Short Time signals (ST) are simply overlapped and added with desired spacing. The synthesized speech is:

(2)

A good choice for the time marks (t_a) is to coincide with the instants of closing of the vocal folds which indicate the periodicity of speech.

For unvoiced speech, these marks could be arbitrarily placed. This estimation from speech waveforms is a very difficult problem but it can be done accurately by using EGG signals.

The use of a symmetric window makes perfect reconstruction impossible, unless time marks are equally spaced. In addition, truncation will occur if these time marks are spaced >N/2 apart (very long pitch periods). In synthesis, re-sampling is necessary at a time sequence t_s is a synthesis marks different from that of the analysis marks t_a.

SPEECH ANALYSIS AND SYNTHESIS

This study will describe the procedures of synchronous analysis and synthesis using TD-PSOLA modifier. Figure 2 shows the block diagram of these two stages.

Speech analysis: The 1st step in the speech analysis is to filter the speech signal by a RIF filter (pre-accentuation). The next step is to provide a sequence of pitch-marks and voiced/unvoiced classification for each segment between two consecutive pitch marks. This decision is based on the zero-crossing and the short time energy (Fig. 3a, b). A coefficient of voicement (v/uv) can be computed in order to quantize the periodicity of the signal (Cheveigne and Ahara, 1990).

Automatic segmentation: The segmentation of a speech signal is used in order to identify the voiced and un-voiced frames. This classification is based on the zero-crossing ratio and the energy value of each signal frame.

Speech marks: Different procedures of placed t_a[i] are used according to the local features of components of the signal. A previous segmentation of the signal in identical feature zones permits to orient the marking toward the suitable method. Besides results of this segmentation will be necessary for the synthesis stage.

Reading marks: The idea of the algorithm is to select pitch marks among local extrema of the speech signal. Given a set of mark candidates which all are negative peaks or all positive peaks:

Where:

t_a(i)	=	Sample of the ith peak
N	=	Number of peaks extracted


Fig. 2:	Block diagram of speech analysis and synthesis

Laprie and Colotte (1998) explain how these candidates are found. Pitch marks are a subset of points out of T_a which are spaced by periods of pitch given by the pitch extraction algorithm. The selection can be represented by a sequence of indices:

With K<N. J has to preserve the chronological order which requires the monotony of j: j(k)<j(k+1). The sequence of indices along with the corresponding peaks is defined to be the set of pitch marks:

The determination of j requires a criterion expressing the reliability of two consecutive pitch marks with respect to pitch values previously determined. The local criterion, we chose is:

(3)


Fig. 3:	Automatic segmentation of Arabic speech; a) babun; b) chamsun. This segmentation is used in order to identify the voiced and unvoiced frames


Fig. 4:	Pitch marks of Arabic speech; a) babun; b) akala

We use the following algorithm for the marking: where, l<i. It takes into account the time interval between two marks compared to the pitch period P_a in samples. This criterion returns zero if the two peaks are exactly P_a(c(l)) samples away from one another and a positive value if the distance between these peaks is greater or less than the pitch period. The overall criterion is:

(4)

where, B is the bonus of selecting an extremum as a pitch mark. In a 1st time:

(5)

The coefficient δ expresses the compromise between closeness to pitch values and strength of pitch marks. Minimising D is achieved by using dynamic programming. The pitch marking results is shown in Fig. 4a, b.


Fig. 5:	TD-PSOLA for pitch (F0) modification

Synthesis marks: The OLA synthesis is based on the superposition-addition of elementary signals Y_j(n), obtained from the X_i(n) placed in the new positions t_s[j]. These positions are determined by the height and the length of the synthesis signal.

In such synthesis, one can modify the temporal scale by a coefficient t-scale .The positions t_s(k-1) and the pitch period P_a(k) are supposed to be known we can deduce t_s(k) as (Mower et al., 1991):

(6)

t-scale: Coefficient of length modification (Fig. 5). In order to increase the pitch, the individual pitch-synchronous frames are extracted, Hanning windowed, moved closer together and then added up. To decrease the pitch, we move the frames further apart. Increasing the pitch will result in a shorter signal so, we also need to duplicate frames if we want to change the pitch while holding the duration constant.

SYNTHESIS SPEECH

Therefore, given the pitch mark and the synthesis mark of a given frame, we use a fast re-sampling method described below to shift the frame precisely where, it will appear in the new signal. Let x[n] the original frame, the re-sampled signal is given by Oppenheim and Schafer (1975):

(7)

where, Ts is the sampling period. Calculating the result frame y[m] corresponding to the frame x[n] shifted by a small delay δ amounts to evaluate x(mTs-δ).


Fig. 6:	Speech synthesis akala

Therefore, y[m] = x(mTs-δ), i.e:

(8)

where, fs is the sampling frequency (1/Ts). Now by rewriting sin c as sin(x)/x and by using the following equation:

sin (πfs[(m-n)Ts-δ] = cos(πfs δ) sin(π(m-n))

but cosπ(m-n) = ±1 and sinπ(m-n) = 0 we get (Fig. 6):

(9)

As 0<δ<Ts (resp.-Ts<δ<0), we define δ = α Ts where, 0<α<1 (resp.-1<α<0). Then the synthesized speech is:

(10)

CONCLUSION

In this study, a voice quality conversion algorithm with TD-PSOLA modifier was implemented and tested under Matlab environment. The results of perceptual evaluation test indicate that the algorithm can effectively convert modal voice into the desired voice quality. Results of the simulation verify that the quality of the synthesized signal by TD-PSOLA with technique depends on the precision of the analysis marking as well as the synthesis marking which must be placed with precision to avoid errors in the phase. The higher precision algorithm for pitch marking during the synthesis stage increases the signal quality. This gain in accuracy, avoids the reduction of deference between original and synthetic signals.

How to cite this article:

Abdelkader Chabchoub and Adnan Cherif. Implementation of the Arabic Speech Synthesis with TD-PSOLA Modifier.
DOI: https://doi.org/10.36478/ijssceapp.2010.77.80
URL: https://www.makhillpublications.co/view-article/1997-5422/ijssceapp.2010.77.80

International Journal of System Signal Control and Engineering Application

228
Views

1
Downloads

Implementation of the Arabic Speech Synthesis with TD-PSOLA Modifier

Abstract

How to cite this article:

International Journal of System Signal Control and Engineering Application

228Views

1Downloads

Implementation of the Arabic Speech Synthesis with TD-PSOLA Modifier

Abstract

How to cite this article:

228
Views

1
Downloads