protobuf Repeated field assignment does not free memory used by previous assignments

What version of protobuf and what language are you using? Version: 4.21.2 Language: Python

What operating system (Linux, Windows, ...) and version? Reproduced on Linux and Windows

What runtime / compiler are you using (e.g., python version or gcc version) Python 3.9 (Anaconda)

What did you do? Steps to reproduce the behavior:

Create example.proto:

syntax = "proto3";

message Example {
    string foo = 1;
}

Compile example.proto:

C:\gdavey\bin\protoc-21.3-win64\bin\protoc.exe --python_out=. example.proto

Create memleak.py:

from example_pb2 import Example

example = Example()
while True:
    example.foo = 'Wake Me Up Before You Go-Go'

Run memleak.py:

python .\memleak.py

What did you expect to see

Near constant low memory usage.

What did you see instead?

Increasing memory usage with no upper limit e.g. after 10s this script will consume 5GB, after 20s it will consume 10GB, ...

Anything else we should know about your project / environment

We noticed this when upgrading from protobuf<4 to protobuf>=4.

Our code was creating a single instance of a protobuf, and assigning its fields many times.

With protobuf<4 using the CPP implementation, we saw near constant memory usage.

After upgrading to protobuf>=4 we observed a memory leak in our application. We traced the cause to this issue.

Jul 21 '22 14:07 geoffdavey

This is a side effect of the arena allocation strategy used by upb. There are also configurations of the existing python/C++ that would do the same thing (but it requires more effort to create). Ideally, we would notice this and have some mitigation inside the arena, but I do not think that we will be able to address this in the near term. The simplest workaround is to not have long lived objects like this.

We could expose a clone method or similar to create a packed detached arena, which would likely be a good affordance to help.

Jul 21 '22 15:07 fowles

@ericsalo or @zhangskz do either of you know off hand if the moral equivalent of clone already exists in the python API?

Jul 21 '22 15:07 fowles

@fowles Many thanks for picking this up.

The simplest workaround is to not have long lived objects like this.

Indeed, this was our workaround. The default behaviour is, however, not intuitive.

Jul 21 '22 16:07 geoffdavey

You are right that it is uintuitive. It also is (unfortunately) a side effect of one the major sources of performance. So it is very hard to have it both ways here.

Jul 21 '22 16:07 fowles

This is about 99% Working-As-Intended. The thought occurs that maybe in a highly artificial loop like this one, we can recognize that the most recent arena allocation is being replaced by the next arena allocation and unroll it. But aside from trivial benchmarks I doubt that this technique would buy us much in practice because even a single hole would break the optimization. And if we try to extend the allocation history then suddenly we are running a block allocator instead of an arena allocator. So imma just close this and declare unintuitive victory.

Sep 01 '22 18:09 ericsalo