What does this PR do?

This PR aims at adding Descript-Audio-Codec model, a high fidelity general neural audio codec, to the Transformers library.

This model is composed of 3 components:

An Encoder model.
A ResidualVectorQuantizer model, which is used with the encoder to obtain the audio quantized latent codes.
A Decoder model, used to reconstruct the audio after compression.

This is still a draft PR. Here's what I've done for now:

Adapted the model to Transformers format in modeling_dac.py.
Added the checkpoint conversion scripts, and pushed to the hub the 3 models here (16/24 and 44 khz).
Made sure the forward pass gives the same output as the original model
Added a Feature Extractor (very similar to the Encodec FeatureExtractor).
Started iterating on tests.

Who can review ?

cc @sanchit-gandhi and @ArthurZucker cc @ylacombe for visibility

Jun 19 '24 13:06 kamilakesbi

They indeed use weights with the different losses during training (see original codebase). I'll add weight attributes in the config file.

Note that in the current code, we only return the commitment_loss and codebook_loss, but there are other losses used to train the model (mel_loss and gan losses).

Jun 26 '24 10:06 kamilakesbi

I took all of Sanchit's reviews and added integration tests. @ylacombe this should be ready for review!

Note that there's still one failing test which indicates:

1 failed because AssertionError -> <class 'transformers.models.dac.modeling_dac.DacModel'> is too big for the common tests (74175906)! It should have 1M max.

Should we overwrite this common test in the dac test file ?

cc @sanchit-gandhi

Jun 27 '24 12:06 kamilakesbi

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 27 '24 13:06 HuggingFaceDocBuilderDev

Thank you for your review @ylacombe!

I have taken your feedback into account and updated the code. I've also added the ability to decode from audio codebooks.

Regarding the loss, I agree with you that we should probably not return the encoder loss by default. Normally, the loss is returned when the labels argument is passed to the forward pass of the model, but here I'm not sure it makes sense to add a labels argument, as the loss is computed in an unsupervised manner. Should I add a return_loss arg instead ?

Otherwise I think this is ready for a final review @amyeroberts :) failling tests are unrelated to this PR I think.

Jul 03 '24 11:07 kamilakesbi

Thanks for the reviews @amyeroberts and @ylacombe!

We should be close from merging this model!

The last change would be to transfer the weights from my personal hugging face page to the descript organisation. I'm waiting for the members of the organisation to add me.

Jul 10 '24 13:07 kamilakesbi

@amyeroberts the checkpoints have been transferred to the Descript organisation.

We can merge this PR if everything if ok for you :)

Jul 11 '24 14:07 kamilakesbi

Gentle ping @amyeroberts

Jul 15 '24 08:07 kamilakesbi

@kamilakesbi There's still failing tests on the CI - these should be resolved before final review and merge. You may need to rebase on main to include upstream changes or trigger a re-run of the CI if the issues are relating to the environment or other libraries.

Jul 15 '24 09:07 amyeroberts

@amyeroberts I've rebased but there are still failing tests which I think are unrelated to this PR. The failing tests indicate the following message:

1 failed because huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID -> Root=1-6694ef78-6ec395936d885db72ebacd65;2c9bbc0d-6ecd-49c4-b28e-df134be7bd4a)

Jul 15 '24 10:07 kamilakesbi

@amyeroberts after rebasing all tests pass on the CI :)

If I get your approval I can merge!

Jul 18 '24 11:07 kamilakesbi

The slow tests are passing locally but not on the CI... any idea what could be the reason here @sanchit-gandhi @amyeroberts ?

Jul 23 '24 14:07 kamilakesbi

@kamilakesbi From an initial glance it looks like the tests are failing on numerical checks. The model's logits are sensitive to things like the running environment and hardware, as so we expect small changes. It is weird its changing so much to be outside the tolerance but depending on what the values are it could be OK.

The test values should be the ones for running on the CI, rather than ones own setup. You'll need to ssh into to the CI runners to get these and make sure they're not too different from the current test values

Jul 23 '24 14:07 amyeroberts

@kamilakesbi Why was this merged when there were failing slow tests?

Aug 19 '24 10:08 amyeroberts

Add Descript-Audio-Codec model

What does this PR do?

Who can review ?