Enshen Zhou
Enshen Zhou
## Description Mainly implements 4 multi-modal parts: 1. Building an agent to support multiple image inputs (see examples/vision/object_recognition.py) 2. Building an agent to call DALL-E to generate images (see examples/vision/image_crafting.py)...
In the Appendix D.2 section of the paper, the Prior Training section, I understood how Steve-1 collected text-video pairs for training the Prior. I am particularly interested in two points...
Thank you very much for providing such an excellent open-source code repository! I have a question regarding the resume functionality. Suppose I am training for just one epoch, and I...