UTAU wiki

UTAU User Manual - 10

6,067pages on
this wiki

UTAU manual TOP > 10. Setting voice banks

Prev: 9. Envelope and Vowel Blending

10. Setting voice banks

10-1. About voice configuration

UTAU can use sound sources (voice banks) made by registering 50 or more sounds that you record yourself as WAV files. However, when composing with UTAU with unmodified recorded WAV files, the sound is distorted and the phoenome timing is out of place because the consonant part is stretched, so it is necessary to configure the fixed consonant and preutterance in the files.


  • 原音設定の概念図 = voice configuration conceptual map
  • ピアノロール画面の音符 = piano screen notes
  • クロスフェードの状態 = crossfade status
  • 原音設定エディタの状態(あ.wav) = voice configuration editor status (a.wav)
  • 原音設定エディタの状態(さ.wav) = voice configuration editor status (sa.wav)

First, in order not to use the silent parts at the ends of the voice file in the speech synthesis, and the end part where the volume fades out, cover them with the left and right blanks (the purple areas). Depending on the case, the left blank may also include a part of the consonant when it is too long, or a part of the beginning of a vowel where the volume is low.

Next, cover the consonant pronunciation area with the fixed consonant range (the pink area). By performing this configuration, even if you enter a note longer than the original sound, like "sa.wav" in the illustration above, the white vowel area is the only part being stretched, so consonant distortion is prevented. (Theoretically, a fixed consonant range is not necessary in a vowel like "a.wav", but there are cases where UTAU procedures don't catch the pitch well if no consonant fixed range is configured, so set it to at least 5ms).

For consonants such as "sa. wav", the phoenome pronunciation comes late if it is pronounced beginning at the start of the note on the piano roll, so move the preutterance (the red vertical bar) to the boundary between the consonant and the vowel, so the consonant is pronounced before the start of the note. By setting a preutterance, the section from the end of the left blank to the red vertical bar (the consonant utterance area) is voiced before the start of the note, like in the image above. (To that extent, the pronunciation time before the note is short.)

Finally, move the overlap (the green vertical bar) to blend the consonant with the sound that precedes the note. By setting an overlap, the section from the end of the left blank to the green vertical bar overlaps the end of the previous note, and crossfades for a smooth sound when the envelope is configured.

For details, please refer to -> 10-4. Using the Voice Setting Editor to configure a voice.

Caution: The voice bank configuration method shown here is intended for the traditional single syllable(CV) sound sources (where the 50 or more sounds are separated into individual files). To configure continuous sound sources (VCV), please refer to -> 15-3. Voice configuration of continuous sound sources.

Warning: VOCALOID voice output must absolutely not be used as an UTAU sound source. Article 4 of VOCALOID End User License Agreement (Prohibitions) explicitly prohibits "(3) Using all or part of the product as a component of your or a third party's software".

Reference: VOCALOID2 Hatsune Miku End User License Agreement ->

Advice for creating original sound sources (voice banks).

  • The audio format requirements for preparing a voice bank are PCM/44100Hz/16bit in WAV or AIFF format. However, AIFF files are only supported in UTAU Ver0.2.46 and later.
  • Stereo and Mono files both work, but since the output is mono, it is recommended to use mono format to reduce the file size.
  • Voice configuration does not have to be performed by the author of the original sound source. Because it is especially difficult for a beginner to perform voice configuration, UTAU usage rules allow publishing only the voice sounds and leaving the voice configuration to the user. However, it is recommended to at least roughly configure the consonant fixed range, especially with sound sources where the pronunciation time is short, because the notes would be widely stretched, distorting the consonants. Also, users should not require the voice contributer to perform voice configuration.
  • If you publish a voice bank, it is recommended that you first read the article "To sound source contributors" in the UTAU Users Mutual Support@Wiki. (Important points are given regarding the display of the conditions of use, the character setup, etc.)

Back to Top

10-2. Opening the Voice Bank Settings screen

Select "Voice Bank Settings" 「原音の設定」 from the menu "Tools" 「ツール」, or press "Ctrl" + "G" to open the "Voice Configuration" 「原音設定」 screen and display the Voice settings data pertaining to the voice configured in "Project" 「プロジェクト」 (the voice which is displayed in the Voice Name display field at the upper left of the main screen).

※ After selecting just one note with the sound you want to configure, press "Ctrl" + "G" to directly open the Voice Setting Editor screen.

Explanations -> 10-4. Using the Voice Setting Editor to configure a voice

10voicesetting1 10voicesetting2

Back to Top

10-3. Switching the desired sound source in the Voice Settings screen

1. Select the menu "File" 「ファイル」 -> "Open Another Voice Bank" 「別の音源を開く」 in the "Voice Configuration" screen.


2. Select the file containing the voice bank you want to configure, then press the "Open" 「開く」 button.


3. When the WAV files of the Voice Bank you want to configure are displayed, press the "Open" 「開く」 button again to open the Voice Configuration screen displaying configuration data for this voice bank.


Back to Top

10-4. Using the Voice Setting Editor to configure a voice

1. In the Voice Configuration screen, click and select the sound you want (in our example "shi" 「し」), then press the "Launch Editor" 「エディタを起動」 button to open the Voice Setting Editor screen.※ After selecting just one note entered with the sound you want to configure, press "Ctrl" + "G" to open directly the Voice Setting Editor screen from UTAU's main screen.

  • Although it could be possible to perform each configuration by direct input of numerical values in milliseconds in the input fields on the right side of the Voice Setting Editor screen and press the "Set" 「セット」 button, using the Editor screen has the benefit of allowing configuration while looking at the waveform, thus its use is recommended. However, the "Alias" 「エイリアス」 configuration can be done only in the Voice Setting Editor screen.

※ Alias is a configuration feature allowing to enter a note with another name than the name of the WAV file, and to play this corresponding WAV file. Even with no configured alias, entering a note with the name of the WAV file plays its sound, thus it is not necessary to configure an alias if the WAV file has a name matching with the sound type.


2. Press the "+" button on the upper left of the Editor screen, to enlarge the sound file waveform horizontally and watch it more easily. Also, you can enlarge or reduce vertically by dragging the upper and lower edges of the Editor screen.

※ When pressing the "-" button, the sound file waveform is shrinked horizontally. Also, when double-clicking the "P" button next to the "-" button, you can listen to the sound corresponding to the actual settings of the voice within the editor.



※ In UTAU 0.2.61 and later versions, you can also display the spectrum by pressing the "s" button on the upper left of the Voice Editor screen.

By right-clicking on the "s" button you can also select and change the "Color" 「カラー」, and change the "Range" 「レンジ」 to 5, 7 or 10 kHz.

Caution: In order to quickly display the spectrum next time, "*.uspec" cache files are created in the voice folder. As these files are quite big, voice distributors should be cautious not to distribute them together with the sound files by mistake. (It is also possible to not create cache files by right-clicking the "s" button and select "Don't cache" 「キャッシュしない」.)


3. Configure the following five items by dragging them on the Editor screen.Description of each Voice Settings (using as example Nagone Mako's "shi" sound).


(1). Left blank (offset) (the purple part on the left of the Editor screen)

The left blank defines where to cut off the silent part and the sibilants at the beginning of a voice sound file.

Leaving too much of the silent part will result in no sound to be heard. In addition, leaving too much sibilants will produce a discordant "shiii" sound, so perform this setting first.

How to configure

Place the cursor toward the left edge of the Editor screen; when the cursor shape changes to a cross, you can set the left blank setting area, in purple on the left of the screen, by dragging the cursor to the right. However, if the consonant fixed range (2) (the pink area on the left of the screen) is not yet configured, doing it is possible by dragging the screen left edge, thus configure it before configuring the left blank region. Also if the green (4) and red (5) vertical lines defining the position of the overlap and the preutterance become a hindrance, move the vertical lines appropriately out of the way to the right.

Advisable setting values

Basically, set the whole silent part (the part where the voice waveform is flat).

However, as the pronunciation time of the consonant part for the sounds of the "ka" column, "ta, te, to" and the "pa" column is short, leave about 10 milliseconds of the silent part outside of the left blank, so as to improve the articulation. For the "sa" column, as it is mostly comprised of sibilants (the part where the waveform is very small, that comes just after the silent part), it is necessary to cut it to some degree in the left blank, but be careful that the articulation will become very bad if you set it to all the sibilants sounds. At the beginning, set it to 70ms or less (The "0.1" location on the upper part of the Editor screen represents 100ms), then adjust it while double-clicking "P" on the upper left of the screen to listen to the result. Also, for the vowels "a, i, u, e, o", it seems that it is better to trim with the offset (the left blank) the portion where the head of the waveform increases gradually. (So that the sounds are nicely connected when blending vowels.)

※ If the advisable setting values are not clear for you, there is the method of copying by eye another reference sound source with a relatively good articulation, where the voice settings are precisely done. (With the latest versions of UTAU, launch UTAU's main screen twice, and you can launch two instances of the Voice Configuration screen and the Voice Editor screen.) However, as the voice waveforms are somewhat different from one person to another, even for the same type of sound, in the end you will have to validate and readjust with your own ears.

For details about Voice Configuration, please refer to the following site.

UTAUWiki's Voice Configuration advices ->

Producing insider ??? - UTAU libraries construction support site -

(2). Consonant part fixed range (the pink area on the left side of the Editor)

The consonant part fixed range defines a part of the consonant that is not stretched, in order to prevent the distortion of the consonant sounds of the voice when the latter is stretched to fit long notes.

How to configure

Pull out the pink configuration area from the left edge of the screen, like with the left blank configuration.

Advisable setting values

Select the sibilants and the consonant part (the part where the waveform amplitude gradually increases on a narrow range). If there is room enough in the length of the voice, you can safely set to include some part of the vowel. If you are unsure, please refer to the Voice settings of another sound source.

Caution: Starting from UTAU Ver0.2.46, if the consonant part fixed range is set to a very short value (about 3ms or less) or even zero, a phenomenon occurs where the pitch cannot be efficiently captured. Set at least 5ms for the consonant part fixed range, even for vowels like "a, i, u, e, o".

Reference ->UTAU Exchange Board v2.46 bug

(3). Right-blank (the purple part on the right part of the Editor)

The right blank setting cuts the end part of the voice where the volume attenuates, in order to maintain a stable volume.

How to configure

Pull out the purple configuration area from the right edge of the screen, like with the left blank configuration.

Advisable setting values

Set to the end part of the voice where the volume attenuates. Also, as the pitch at the end of the voice often becomes unstable, include that part too (the part where the orange line representing the pitch wavers a lot and falls abruptly).

(4). Overlap (the green vertical bar)

Overlap is a configuration for improving the connection of sounds, by superimposing the voice end part of the preceding note with the head part of the configured voice.

How to configure

Set it by dragging to the right the vertical green line on the left edge of the screen.

Advisable setting values

Set it to about 30 ms as a rough estimate, or to about half of the preutterance. However, for the sounds where the pronunciation time of the consonant part is short, like e.g. the sounds of the "ka" column, "ta, te, to" and the "pa" column, setting an overlap would crush the consonant and the articulation would become bad, thus leave it to 0 (i.e. on the boundary between the left blank and the consonant fixed part on the Editor screen).

※ The longer the overlap, the better the sound connection, but in return the consonant pronunciation is drowned within the vowel of the preceding note and the articulation becomes bad, thus this setting must be traded-off. Because of this it seems that, for the overlap only, it is best to adjust each note individually depending on the melody. (Overlap and preutterance can also be set individually for each single note in the "Notes Properties" 「音符のプロパティ」 screen.)

Reference -> 7-2. The settings of the "Notes Properties" screen

(5). Preutterance (the red vertical bar)

The preutterance is a setting to automatically force the utterance ahead of time so as to match the consonant phonation timing.

※ You can also cut all the sibilants with the offset (the left blank) to match the phonation timing, but as the articulation becomes very bad, this is not recommended except as a temporary measure.

How to configure

Set it by dragging to the right the red vertical line on the left edge of the screen.

Advisable setting values

Set it to the boundary between the consonant part and the vowel part. If you are unsure, please refer to the Voice settings of another sound source.

Back to Top

10-5. Precisely adjusting the phonation timing (adjustment method using a metronome)

There is a Voice configuration method, invented by Ameya/Ayame sama, to precisely match the phonation timings by using the sound of a metronome.

Adjustment method using a metronome (adjusting the preutterance) ->

Back to Top

10-6. Automatically configuring a voice (estimating the voice settings values) (hidden feature)

Open the Voice settings screen for the voice you want to configure, and double-click the empty place on the upper right of the screen to open the "Estimate parameters" 「パラメータの推定を行います」 pop-up. When pressing "OK", you can automatically set estimated values for the voice settings of all the sounds. However, as there may be incorrect settings, depending on the sounds, be sure to check with your own ears and to perform timely readjustments.

Caution: As it is a hidden feature, it is unsupported by UTAU's author. Please use at your own risk.

※ The image below uses the default voice as an example, but as this default voice comes readily with adequate voice settings, it is not necessary to estimate the voice settings values.


Back to Top

10-7. Creating/regenerating the frequency table files in one go (hidden feature)

In UTAU, frequency table files (*.frq) are generated the first time a playback is performed, and WAV files are generated based on these frequency table files. However, because generating the frequency table files takes time, it is useful to generate them all beforehand using the following method. The method to regenerate them all is also explained, as you can sometimes produce more accurate frequency table files through an updated "resampler" voice synthesis engine.

Caution: As it is a hidden feature, it is unsupported by UTAU's author. Please use at your own risk.

Caution: Please note that this feature is useless if you can not use launch "resampler.exe", e.g. because of Virus Buster.

1. Open the Voice Settings screen for the voice on which you want to generate (or regenerate) the frequency table files, then double-click the area between the "Edit frequency map" 「周波数表の編集」 button and the "Initialize frequency map" 「周波数表を初期化」 button on the lower left of the screen.


2. The "Creating all the frequency tables" 「周波数表の一括作成」 screen opens. If you want to regenerate with a new "resampler" engine, select "Regenerate and overwrite all" 「上書きで全て再作成」. If you want to generate for new sounds, or to complete partially missing ones, select "Create only missing" 「無いものだけ作成」. Choose one of them, then press the "Execute" 「実行する」 button.

Warning: In older versions of UTAU, when selecting "Delete all then generate" 「全て削除後に作成」, the frequency table files for all the voices within the Voice folder are deleted! Be warned! (Version 0.2.72 of UTAU has been fixed so that only the frequency table files of the selected voice are erased before regeneration.)


3. A command prompt window appears and the frequency tables files are generated. If you want to cancel the table generation, close the command prompt window.


4. When "Frequency table generation terminated" 「周波数表の生成が終了しました。」 is displayed at the bottom of the command prompt screen, press Enter or any other key to close the command prompt window.

※ If you checked "Close command prompt when terminated" 「終了したらコマンドプロンプトを閉じる」 in the "Creating all the frequency tables" 「周波数表の一括作成」 screen, it closes automatically at the end of the frequency table files generation.


5. In the Voice Settings screen, verify that all the sounds are marked with a 「○」 in the "frq" column.


Back to Top

Next: 11. Adjusting the Pitch (attaching vibrato and portamento vocal expression)

UTAU manual TOP > 10. Setting voice banks

Around Wikia's network

Random Wiki