C++ Bytes field representation as std::vector instead of std::string
What language does this apply to? C++ (maybe others) Affects generated code and general APIs (reflection etc.).
Describe the problem you are trying to solve.
When working with raw data buffers in C/C++, operating system interfaces or other low level interfaces typically represent those buffers as void*, uint8_t* or char*.
In protocol buffers generated C++ code, Bytes fields are represented as std::string. This leads to type conversion difficulties:
Unfortunately, std::string does not provide mutable access to its data in char* representation. Only non-mutable access is allowed via const char* data() const noexcept; (link)
This implies, that all low-level code, which creates or modifies buffer contents has to be wrapped using memory copies, for example like this:
char * buffer = malloc(size);
lowLevelBufferModifyingFuncationCall(buffer, size);
std::string myString(buffer, size); // copy happens here
delete buffer;
Describe the solution you'd like
In contrast to std::string, std::vector provides read and write access to the underlying datastructure, as it guarantees, that stored data is consecutively in memory (&v[0]+n == &v[n]:
value_type* data() noexcept;
const value_type* data() const noexcept;
see link
If protocol buffer would expose an interface to access Bytes fields as std::vectors, the above example could be implemented without the seemingly unnecessary copy operation:
// initialize the vector to the required size:
// zero initialization is assumed to be faster than copy (see comment below)
std::vector<char> buffer(size); // might also use buffer.resize()
lowLevelBufferModifyingFuncationCall(buffer.data(), size);
// now buffer could directly be used with protocol buffers, no copy necessary
Unfortunately std::vector does not provide a way to change its size without initializing its contents.
However zero initialization should be faster than a copy, which would still give better performance than the workaround with a temporary buffer.
Describe alternatives you've considered The only alternative I currently see is the one described on the problem section. Or derivations of this.
Additional context
As far as I know, the main difference of std::string vs. std::vector is, that std::vector guarantees that the data will be allocated as a continuous block of memory. I would be interested in the reasons as to why protocol buffer designers chose std::string over std::vector.
Actually if you want to get mutable access to the raw char*, I believe you can do that by just taking the address of the first element, e.g. &s[0]. According to this StackOverflow post, the data was not guaranteed to be contiguous in earlier versions of C++, but starting from C++11 it is guaranteed.
I was reading recently that some people consider it a bad practice to use std::string for non-textual binary data. I was surprised by that because we do this quite a bit in the protobuf codebase and haven't had any problems with it. For me this is just the way I am used to doing things. I believe proto1 (the predecessor to proto2 and proto3 which was never open sourced) didn't even make a distinction between string and bytes fields, so fields were treated the exact same way whether they included UTF-8 text or arbitrary binary data. That may explain part of the reason we ended up following this pattern.
Thanks for your quick answer :+1:
I did some research. Unfortunately I could not get my hands on the official C++ specification. The closest thing I found was some draft from 2011: link In section 21.4.1.5, indeed the following is stated:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
However I still can see two problems:
- This does not give the user the allowance to modify data in this memory region, but merely describes an implementation detail. The API does not expose any method to get a non const accessor to this data region.
- Your proposal about using
&s[0]does not respect access control of the string class and uses internal knowledge of the std::string class. This sounds to me like a workaround for a const cast. I am certain it will work, but it just looks dirty :hankey:
Maybe this is a bug/missing feature in the C++ API, as it is not (yet) exploiting all aspects of the Specification. Note, that the non-const .data() method for std::vector was also only added recently with C++11.
So with std::string I only have two options:
- Respect type-safety and make useless copies
- Drop type-safety by doing const casts or relative addressing starting from the first element
(not sure if type-safety is the correct word here)
Actually I did a little more research just now and found that C++17 adds in a non-const data() method for std::string, like the one that was added for std::vector. So at least starting from C++17 the problem goes away entirely. Still I think the &s[0] trick is not really so bad, since it relies only on behavior documented by the standard starting from C++11.
I agree with you. It seems like C++ std::string + std::vector were primarily designed to be usable within the std:: ecosystem.
Type-safe interoperability with low-level data-structures only recently seems to get attention. We were just unlucky, that std::vector got this attention first :smile:
I am beginning to wonder, what the difference between std::string and std::vector actually is... Maybe in the future we might get an O(1) conversion between std::string and std::vector :yum:
-> closing, as problem seems to be caused by limitations in legacy C++ APIs and does not exist in more recent C++ versions.
May I just chime in with this: This looks to me like the google engineers that first designed this didn't really care about type safety or using the proper types for this. It smacks to me of C programmers doing C++.
typedef char Byte;
typedef std::vector<Byte> Bytes;
Would have been far less nasty.
Telling people std::string is the way-to-go for binary data because it just "works" contradicts why we're using a programming "language" in essence!
Any C++ developer, by reading a code would expect std::string to hold null-terminated character sequence, being otherwise makes a lot of confusion!
We write "codes" which we read way more often than we write them, how about spreading them worldwide!
For more reasons, see answers here: https://stackoverflow.com/questions/9626914/stdstring-or-stdvectorchar-to-hold-raw-data
Re-opening due to public interest (see comments above).
I recently stumbled on this issue again when inter-operating with other libraries, which expect std::vector<uint8_t> as input for work on data buffers.
Modern C++ APIs rarely use raw pointers in favor of containers with move or references to containers.
I could still not find a way to cleanly convert std::string to std::vector<uint8_t> without nasty copy.
While I support the idea that std::string is the wrong type to use here, I do want to chime in that likely the reason the original designers of the codebase went with std::string may be due to the one small advantage it has: the short-string optimization, which can mean no allocations for short strings (<16 or <20 bytes, depending on implementation).
That being said, using std::string in this way to hold binary data blobs is not new to Google codebases -- leveldb also does it. I think it's a design error because it leads to a certain lack of type safety, but it might not be worth breaking compatibility to redo this anytime soon..
I think it worths to add returning values as std::string_view and std::span<std::uint8_t> beside std::string and std::vector<std::uint8_t>, as these will point to the decoded data directly without copying, and the user will know exactly how to handle them.
Not sure what's needed to add this feature in the library, perhaps decoding without copying the data in essence, where absent data got created on heap and deleted dynamically somehow, or even allowing those return-values dynamically only if present (conditional if(XX.has_value().
There are a great many competing tensions going on here:
-
std::stringis not a great type for binary data (as noted). - exposing any type as a
const &constrains the implementation a great deal (as the implementation must then have such a type) -
std::vectorcomes with some significant complexity around how it would interact with protobuf arenas (actuallystd::stringalready has some of that, but we do some carefully managed encapsulation breaks to work around it) - Any change to the signature of generated APIs must be embarked on carefully.
The entire motivation behind the new editions effort is to provide primitives for doing this.
One of the first targets we will use this for is to switch string accessors from const std::string& to absl::string_view parameters. We can review bytes fields at the same time, but there is a lot of value in terms of smooth evolution of user code bases when one can go back and forth between std::string. All this to say, we are thinking about these things, but don't expect any fast motion here.
I think it's better to stick with std::string_view for C++17 upwards compilers, as abseil might remove its support.
- Our goal is for Abseil types like string_view to generally disappear when the relevant standard is available. string_view is a C++17 type - by the time there is good compiler support for C++23 we may well be removing Abseil support for string_view.
I understand that protobuf is a very large project, which is loaded with years of backward-compatibility requirements, which naturally slows the pace of introducing new features. Given this, can we expect the no-copy decoding feature to be introduced soon (perhaps within two years)?
For the specific question std::string_view vs absl::string_view, protobuf is committed to using absl::string_view right now. At the end of 2024, both abseil and protobuf will update their minimum C++ to 2017. Until then we are definitely on absl::string_view. When that happens, the type will dissolve into an alias for std::string_view. After that, we will likely clean our internal spellings to std::string_view. Abseil will probably remove the alias eventually as well, but that is for them to decide.
For the specific case of no-copy decoding, we added support for this via absl::Cord in 23.x.
Thanks @fowles, got to have a good research before replying here, which concludes that the standard C++ protobuf library seems doesn't fit for an embedded systems regarding binary size, given that adding abseil to use the Cord will increase the size further. As it appears other options as nanopb are more suitable.
I suppose we all agree that std::string is a bad abstraction to the bytes buffer data. But what is the most appropriate abstraction so that we can avoid data copy whenever possible?
- All the bytes buffer can be represented as a pair of
void* dataandsize_t len. - The underlying data can be discontinuous. I believe we should consider such scenario, otherwise the Linux's
iovecandreadv,writevwould be unnecessary. - protobuf message should take the ownership of the buffer.
If we ignore 3, then it's no doubt that a sequence of std::string_view or std::span is the best. e.g., std:span<std::span<char>>. But just forget it. There is no way that we can abandon the ownership.
Let's assume we now have a pair of void* data and size_t len. We've know how it be allocated and how to release it. Does STL give us a facility to take the ownership of these data without copying memory? I believe the answer is NO.
But we still have a room for refine. We can let each of the protobuf's bytes type hold an instance of std::pmr::memory_resource, and replace std::string with std::pmr::string. Therefore, at least we can construct our buffer on the stack or somewhere where we can customize the behaviour of allocation and deallocation.
FYI. I use brpc (https://github.com/apache/brpc) in my corporate project. brpc has a type called IOBuf to represent the bytes buffer. If we use brpc to receive the http request, then the http body will be wrapped into an IOBuf object. This type enables user to append the user data into the buffer, but without data copy. That is, the buffer can be discontinuous. And we will also pass a deleter object when appending the data, so that when the IOBuf destructing, it can release all the data we've appended into it.
Here are some reference of IOBuf, hope these stuff can help the evolving of protobuf.
https://brpc.incubator.apache.org/docs/c++-base/iobuf/
https://github.com/apache/brpc/pull/2431 (A PR submitted by myself, change the deleter type from void(*)(void*) to std::function<void(void*)>.)
https://github.com/apache/brpc/blob/master/src/butil/iobuf.cpp#L1220 (The source codes of IOBuf.)
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.
This issue is labeled inactive because the last activity was over 90 days ago.
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it.
This issue was closed and archived because there has been no new activity in the 14 days since the inactive label was added.