はじめに

DDSP-SVCをGoogle Colaboratoryで動かした記録です。 Ver3.0で追加された拡散機構を含んだバージョンを試してみました。

はじめに
Google Driveの準備
Colaboratoryの準備
モデルの訓練
おわりに

Google Driveの準備

まずはGoogleDrive上に必要なファイルを配置していきます。

ディレクトリの準備

Google Driveに「ddsp-svc」ディレクトリを作成します。
「ddsp-svc」の下に「pretrain」、「train_data」、「input」、「output」、「diffusion-test」を作成します。以下は作成した後の一例です。
次のpyhtonノートブックを開いてドライブにコピーします。

モデルのダウンロード

DDSP-SVCのリポジトリから訓練済みモデルをダウンロードします。

github.com

遷移先の以下の項目のContentVec、HuberSoft、NSF-HiFiGan、RMVPEをクリックしてダウンロードします。

2. Configuring the pretrained model

Feature Encoder (choose only one):
(1) Download the pre-trained ContentVec encoder and put it under pretrain/contentvec folder.

(2) Download the pre-trained HubertSoft encoder and put it under pretrain/hubert folder, and then modify the configuration file at the same time.

Vocoder or enhancer:
Download the pre-trained NSF-HiFiGAN vocoder and unzip it into pretrain/ folder.

Pitch extractor:
Download the pre-trained RMVPE extractor and unzip it into pretrain/ folder.

「pretrain」に以下の構成になるようにアップロードします。

また、以下の記載箇所から「model_0.pt」をダウンロードし、diffusion-test内にアップロードします。

(4.0 - Update) New DDSP cascade diffusion model
Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See section below).

We provide a pre-trained model here: https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (using 'contentvec768l12' encoder)

Move the model_0.pt to the model export folder specified by the 'expdir' parameter in diffusion-new.yaml, and the program will automatically load the pre-trained model in that folder.

学習用データのアップロード

学習用データを「train_data」にアップロードします。学習用としてアップロードするデータは[mp3, wav, flac]が利用できます。

変換するデータのアップロード

変換元データとして「input」にアップロードします。変換元データは[wav]が利用できます。

Colaboratoryの準備

コピーしたノートブックを開いてドライブをマウントします。この時、ランタイムのタイプがGPUになっていることを確認してください。

マウント後にOne-time setupの範囲を実行します。 One-time setupでは必要なコードのcloneと依存ライブラリのインストール、保存が行われます。

! cd DDSP-SVC && sudo python3 -m venv ./venv --without-pip && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && . ./venv/bin/activate && python3 get-pip.py && pip install --upgrade pip && pip install -r requirements.txt
! tar -zcvf DDSP-SVC.tar.gz DDSP-SVC
! cp DDSP-SVC.tar.gz ./drive/MyDrive/ddsp-svc/ -r

モデルの訓練

モデルの訓練を実施します。

学習済みモデルのコピー

必要な学習済みモデルをGoogleDriveの「pretrain」ディレクトリごとコピーします。

! cp drive/MyDrive/ddsp-svc/pretrain DDSP-SVC/ -r

音声ファイルの準備

音声ファイルを学習用のデータに分割します。分割対象の音声ファイルはGoogleDriveの「train_data」にある全ての音声ファイルです。

import sys
sys.path.append("./DDSP-SVC/venv/lib/python3.10/site-packages/")
sys.path.append("./DDSP-SVC/")
import separate

separate.separate_audio(input="/content/drive/MyDrive/ddsp-svc/train_data/", output="/content/DDSP-SVC/data/train/audio", silence_thresh=-40, min_silence_len=750, keep_silence=750 , min=2000, max=5000, padding=True)

訓練の実施

モデルの訓練を開始します。

! cd DDSP-SVC && . ./venv/bin/activate && python draw.py && python preprocess.py -c configs/diffusion-new.yaml

! cd DDSP-SVC && . ./venv/bin/activate && python train_diff.py -c configs/diffusion-new.yaml

モデルは2000ステップごとに保存され、古いチェックポイントは削除されます。最大で100000エポックまで学習が継続するため、回転しているボタンをクリックして適当なステップ数で学習を打ち切ってください。

モデルの保存

モデルの保存を行います。以下の処理が完了するとGoogleDriveのddsp-svcにmodel.zipが作成されます。

!zip model.zip ./DDSP-SVC/exp/diffusion-test -r

!mv model.zip ./drive/MyDrive/ddsp-svc/

モデルのテスト

以下の処理で音声変換を試すことができます。 inputにあるname.wavを変換し、outputにname.wavとして出力します。

! cd DDSP-SVC && . ./venv/bin/activate && python main_diff.py -i "../drive/MyDrive/ddsp-svc/input/name.wav" -o "../drive/MyDrive/ddsp-svc/output/name.wav" -diff "./exp/diffusion-test/model_10000.pt" -d "cuda"

各引数は以下のように設定してください。

-I 変換するファイルのパス name.wavを任意のファイル名に変更してください。
-o 出力先のパス name.wavを任意のファイル名に変更してください。
-diff 変換に使うモデルのパス model_‘0000.ptの数字を任意の値に変更してください。選択可能なモデルは「./DDSP-SVC/exp/diffusion-test」配下を確認してください。

おわりに

Google ColaboratoryでDDSP-SVCのモデルを訓練しました。前回のバージョンと比較して、より自然な音声に変換できていると感じました。後日、ローカル環境での音声変換の実行方法についても更新を行います。最後に、関連するリポジトリやコードをメンテナンスしていただいている皆様に感謝いたします。

kitaroの自由帳

Google ColaboratoryでDDSP-SVC4.0を使った声の変換を行ってみた