spleeter icon indicating copy to clipboard operation
spleeter copied to clipboard

[Discussion] About training models myself

Open xiebruce opened this issue 3 years ago • 5 comments

Below is the content from Here and I've read it.

Train model

For training your own model, you need:

  • A dataset of separated files such as musDB.
  • Dataset must be described in CSV files : one for training and one for validation) which are used for generating training data.
  • A JSON configuration file such as this one that gathers all parameters needed for training and paths to CSV file. Once your train configuration is setup, you can run model training as following :
spleeter train -p configs/musdb_config.json -d </path/to/musdb>

From the command above, I notice that I need to provide:

  • A musdb_config.json file;
  • A musdb

Question 1

A musdb_config.json file look like below, copy from here

{
    "train_csv": "configs/musdb_train.csv",
    "validation_csv": "configs/musdb_validation.csv",
    "model_dir": "musdb_model",
    "mix_name": "mix",
    "instrument_list": ["vocals", "drums", "bass", "other"],
    "sample_rate":44100,
    "frame_length":4096,
    "frame_step":1024,
    "T":512,
    "F":1024,
    "n_channels":2,
    "n_chunks_per_song":40,
    "separation_exponent":2,
    "mask_extension":"zeros",
    "learning_rate": 1e-4,
    "batch_size":4,
    "training_cache":"cache/training",
    "validation_cache":"cache/validation",
    "train_max_steps": 200000,
    "throttle_secs":1800,
    "random_seed":3,
    "save_checkpoints_steps":1000,
    "save_summary_steps":5,
    "model":{
        "type":"unet.unet",
        "params":{
               "conv_activation":"ELU",
               "deconv_activation":"ELU"
        }
    }
}

But where can I get the full documentation of it? For example, what does T and F mean? for instrument_list, can I only use ["vocals", "other"]? where can I get the full documentation of all these config options?


Question 2

I've downloaded the musdb from musdb18.zip and extract the zip file, I found that it is a folder containing 2 folders: train and test(see the screenshot below) image

Inside the train and test folder, there are all mp4 files(use mp4 instead of mp3 or aac is because mp4 can contain more than one track in it) image

image

I've listened the mp4 file in train folder and test folder in musdb18.zip, it seems they are no any difference, they are all songs.

So in my understanding, they are no difference. Assuming that I have 150 song file, can I choose 100 of them for training and the rest for validation?


Question 3

I use ffprobe to check the mp4 files mentioned above, I found it has many tracks, the first track is a mix of all audio tracks, the other audio tracks are separated tracks(vocals, drum, bass, .etc), and the last track is a video track, but in fact it has no video, it's a still png image.

ffprobe -hide_banner -i  Young\ Griffo\ -\ Pennies.stem.mp4
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f8bfc808200] stream 0, timescale not set
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'Young Griffo - Pennies.stem.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isom
    creation_time   : 2017-12-16T17:34:20.000000Z
  Duration: 00:04:37.80, start: 0.000000, bitrate: 1288 kb/s
  Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:2(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:3(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:4(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:5: Video: png, rgba(pc), 512x512 [SAR 20157:20157 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)

Now I have vocals.m4a and bg-musics.m4a, how can I merge a vocal and it's corresponding bg-music and a album cover image to a mp4 file by using ffmpeg?


Question 4

From question1 we can know that we also need 2 csv file: musdb_train.csv and musdb_validation.csv.

I notice that the musdb_train.csv file has 6 columns:

mix_path | vocals_path | drums_path | bass_path | other_path | duration

If I only needs 2stems, it that I only need to provide these 4 columns in this csv file?

mix_path | vocals_path | other_path | duration

xiebruce avatar Mar 15 '22 10:03 xiebruce

Hi @xiebruce, For your first question, you have some information about parameters set in config files in the wiki. It is possibly a bit incomplete though. Regarding the instrument list, it should be set according to the dataset you want to train on. For instance with musdb, you can possibly use ["vocals", "instrumentals"] as you have an both instrumentals.wav and vocals.wav file for every track and that they will sum up to mixture.wav.

For your second question, spleeter was not made for dealing with the multi-stem format *.mp4, so you should use the multiple waveform version of musdb. You can do a different split than the original proposed musdb one, which is only provided for algorithm comparison puposes. So if you don't plan to compare your model with other model on the test set, you can possibly use songs of the test set in your training.

The third question concerns musdb and not spleeter, so this is not the right place for asking/answering it.

For your 4th questions, indeed you can provide only two columns if you'd like to perform 2 stems separation. As mentioned in the answer to question 2, you need to ensure that the provided stems sums up to the mix (e.g. the sum of the stems is equal to the mix). With musdb, you can do it with the instrumentals stem and the vocal stems.

romi1502 avatar Mar 18 '22 10:03 romi1502

Hello! I am doing the same but with Beethoven Cello Sonatas.

How many hours of samples/data are you using to train spleeter?

Thanks!

isolepinas avatar Mar 26 '22 09:03 isolepinas

@isolepinas Sorry, I still don't know how to do it yet. But I think this is depends on your computer's performance and the sample data size. Can your share the whole process, the step that you are doing? I prefer use examples and screenshots rather than just describing, thank you in advance.

xiebruce avatar Mar 26 '22 10:03 xiebruce

Dear Bruce,

Just like you I am in the first steps of the process. My idea is to feed to spleeter 3 versions of a performance (piano solo),(cello solo) and (piano and cello together), I aim to train spleeter to understand whats a piano and whats a cello and when together, it should be able to only extract cello without losing vibrato, portamento and other characteristics.

Some studies have used 44 samples or 1h and 14 minutes such as here https://veleslavia.github.io/conditioned-u-net/

Maybe you will find it interesting!

lets keep in touch so we can share our findings and processes!

On Sat, 26 Mar 2022 at 10:50, Bruce @.***> wrote:

@isolepinas https://github.com/isolepinas Sorry, I still don't know how to do it yet. But I think this is depends on your computer's performance and the data size. Can your share the whole process, the step that you are doing? I prefer use examples and screenshots rather than just describing, thank you in advance.

— Reply to this email directly, view it on GitHub https://github.com/deezer/spleeter/issues/740#issuecomment-1079660733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYNKUM3RWJD55X6KK73LQNLVB3TVXANCNFSM5QYFB33Q . You are receiving this because you were mentioned.Message ID: @.***>

isolepinas avatar Mar 26 '22 10:03 isolepinas

@isolepinas OK, thank you.

xiebruce avatar Mar 26 '22 11:03 xiebruce