Bulk Create without returning objects
The main problem here is that bulk_create converts objects to a list before insertion which loads all the models at the same time. The only reason for this is to return the inserted objects after the query is done. However, those objects are not being updated (adding pk for example), i.e the mothod will return what you sent. I think we can enhance this and make bulk_create benefits from #1613 .
Describe the solution you'd like There are several approaches for this:
- Do not return anything from
bulk_createor return the number of inserted rows. This maybe a breaking change. - Add a keyword argument called
return_objectsor something like that to control the behaviour. - Do not return anything when
batch_sizeis set.
If none of the above seems logical, I think at least we should mention what is going behind the scenes in the docs, so the user can handle chunking on his own. And maybe use tuple(objects) instead of list(objects).
Describe alternatives you've considered
The most natural alternative is to chunk data on your side and leave batch_size as None. But this makes batch_size useless.
Hi!
I think ideally would be to support id population for backends that allow it to do in same query - as with RETURNING in postgres
But not sure I will be able to do it in any near time
As right now returning objects doesn't bring any benefits, as we return what we got in - I don't see much value in this return and we can change it
I think most explicit way to do that - would be for now just removing that return object, allowing to iterate through models without loading them all in memory
Next step could be adding return_objects param and allowing it only for supported databases, where we would return updated objects from db in same query
Hi @abondar . I can make a PR on that, removing the returned object.
If you could - I would gladly help you with reviewing and merging it
Hi!
I think ideally would be to support id population for backends that allow it to do in same query - as with
RETURNINGin postgres But not sure I will be able to do it in any near timeAs right now returning objects doesn't bring any benefits, as we return what we got in - I don't see much value in this return and we can change it
I think most explicit way to do that - would be for now just removing that return object, allowing to iterate through models without loading them all in memory
Next step could be adding
return_objectsparam and allowing it only for supported databases, where we would return updated objects from db in same query
I think returning objects may be useful when it comes to the following situations, or there're better ways to do it?
class Movie(Model):
mv_id = fields.CharField(max_length=20, pk=True)
search_logs: fields.ManyToManyRelation["SearchLog"]
class SearchLog(Model):
search_date = fields.DateField()
movies: fields.ManyToManyRelation["Movie"] = fields.ManyToManyField(
"app.Movie", related_name="search_logs"
)
async def run():
await init_db("test")
# if we got many movies and want insert them at once
movies = [Movie(mv_id=f"{i}") for i in range(100)]
_movies = await Movie.bulk_create(movies, ignore_conflicts=True)
search_log, _ = await SearchLog.get_or_create({"search_date": date.today()})
await search_log.movies.add(*_movies)
await close_db()
Hi @rgbygv
The method bulk_create was returning the same objects you passed to it which means if you don't provide id to your model the returned one will not have the id populated. That's why it was removed. In your example movies and _movies would have the same objects.
Hi @rgbygv The method
bulk_createwas returning the same objects you passed to it which means if you don't provideidto your model the returned one will not have theidpopulated. That's why it was removed. In your examplemoviesand_movieswould have the same objects.
Thanks, I got it. I may need something like bulk_save(). I think inserting movies one by one would be slow, so I choose to insert them all at once. My current solution is query again based on movies after batch insert, although it seems a bit silly.
async def run():
await init_db("test")
movies = [Movie(mv_id=f"{i}") for i in range(10)]
await Movie.bulk_create(movies, ignore_conflicts=True)
movies = await Movie.filter(mv_id__in=[movie.mv_id for movie in movies]).all()
search_log, _ = await SearchLog.get_or_create(search_date=date.today())
await search_log.movies.add(*movies)
await close_db()
@rgbygv why do you need to fetch the movies if you have the IDs?
@rgbygv why do you need to fetch the movies if you have the IDs?
Because I want to do this “await search_log.movies.add(*movies)” and the movies need to be Movie instances, which can be obtained via .create() or .save(). But I don't want to use either of those methods. The other way to get Movie instances is fetch through the database.
I see now. You can bypass this by adding this before you call add method
for movie in movies:
movie._saved_in_db = True
However, I think we should have a way to access the through table like django does it.
This class can be used to query associated records for a given model instance like a normal model:
Model.m2mfield.through.objects.all()
What do you think about this @abondar ?