[Core] refactor `embeddings`
What does this PR do?
Extension of https://github.com/huggingface/diffusers/pull/7995.
Some comments are in line.
Todos
- [ ] Add licensing
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@DN6 this is a point to be worked out. What criterion to follow for placing a class in combined and what to follow for image_text. At the moment, embedding classes dealing with images and texts are placed in image_text. Combined is usually for classes that take either timesteps or other forms of conditioning along with {image,text} into account.
Can't we split these into image and text?
I was trying to follow this:
https://github.com/huggingface/diffusers/pull/7995#issuecomment-2187356771
IMO, it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities (image and text, for example) regardless of whether they have Combined in their class names or not. And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.
WDYT @DN6 @yiyixuxu ?
Sorry about the back and forth here.
IMO, it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities (image and text, for example) regardless of whether they have Combined in their class names or not. And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.
This makes sense to me.
Alright. Will wait for @yiyixuxu to comment as well before making changes.
combined does not need "combined" in names, of course
regardless of whether they have Combined in their class names or not.
for this, I am not sure what you mean. Do you want to put all the timestep embeddings into combined.py? timestep embeddings should have their own file, but it is a "combined" embedding that takes various inputs, including timestep, of course, it should definitely go into combined
it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities
ok with this
And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.
and in addition, I made a comment here https://github.com/huggingface/diffusers/pull/8722#discussion_r1664447256 - we can put these attention pool layers into text
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.