Use MemoryIO.
Can you benchmark the difference? Creating a MemoryIO allocates memory so this should result in a slowdown. The idea is good, but I wouldn't merge this if it makes things slower (speed is more important)
Ok, let me think a bit. This is not final version, just first PR of something like this:
- [x] remove manual swaps in favor of IO provided functions (also makes code correct compiled on big-endian machines)
- [ ] add support for 1d arrays
- [ ] update decoding to one of...
- [ ] maybe move decoding into
#from_io(probably not, it feels like too much unnecessary OO) or... - [ ] simply individual functions on single
Decoderclass (probably yes), ie.Decoder#decode_jsonbetc. It's easy to reuse decoding logic this way and the code is very compact. Extending decoder with custom types will have to go through map of oid -> proc maybe.
- [ ] maybe move decoding into
- [ ] read the whole frame into MemoryIO and use that for decoder
- [ ] possibly use pool for MemoryIO buffers
...what do you think?
...and I thought to do each step as PR to stay in comprehensible size for each PR, otherwise it's going to be a bit big (but I can do it in single one as well).
Just a quick two thoughts:
- this introduction of
MemoryIOdoesn't overwrite anything, it's just a fallback. For exampleStringis decoded usingSlice(UInt8). Default implementation inDecoderjust falls back to theIO-based so there's no performance penalty if the type decoding wants to work on Slice directly (I've messed up UUID decoding performance wise, that's true, I'll change it back to previous implementation which is much faster). - the only reason to use
MemoryIOis to have nice advancing interface for reading. Ideally crystal would provide in standard library something likeBytes(orBuffer) struct, lightweight, similar to Slice but not generic (there's no point because it works on bytes) with reader/writer offsets in internal state and methods for encoding/decoding in specified endianness (endianness could be specialised at the type level, ie.NetworkEndian::Buffer).
By the way, do you know if it's feasible to allow struct inheritance with restriction that it can't change width? Ie. just adding methods or it's not possible?
@asterite btw. this code is using Slice(UInt8)-constructor-variant of MemoryIO, which doesn't do malloc so it shouldn't be slower, there are no extra allocations.
I mean, creating the MemoryIO is what allocates memory. Maybe it's insignificant, though, but I don't know.
@asterite ok, you're right, it is slower.
# ./bench/decoding.cr
# crystal run --release bench/decoding.cr
require "benchmark"
require "../src/pg"
DB_URL = ENV["DATABASE_URL"]? || "postgres:///"
db = PG.connect(DB_URL)
Benchmark.ips do |x|
x.report("int series") { db.exec("select * from generate_series(1, 1000);").rows }
end
# int series 1.78k (± 2.53%) will:master
# int series 1.54k (± 2.23%) mirek:memory-io
Let me fix that.
Thanks for the patch. I'm on a couple of cross country fights tomorrow (actually to go speak about crystal), so I should have enough down time to check out this in detail then.
However, in response to the endianness, I believe the current implementation does work on regardless of system endianness https://commandcenter.blogspot.se/2012/04/byte-order-fallacy.html
I updated my previous benchmark of various byte swapping methods https://gist.github.com/will/c348bf90a7cbd7e93fd0 to include going through the NetworkEndian interface, and it's quite a bit slower.
$ crystal run --release byte_swap.cr
32bit
==========
RES = 67305985
true
ntohl 42.64M (± 6.40%) 1.09× slower
intrin 43.25M (± 4.54%) 1.08× slower
handshift 46.57M (± 3.72%) fastest
handswap 42.42M (± 4.45%) 1.10× slower
memoryIo 9.8M (± 9.81%) 4.75× slower
64 bit
==========
RES64 = 578437695752307201
true
ntohl64 46.17M (±20.54%) 1.21× slower
intrin64 54.08M (± 9.65%) 1.03× slower
handshift64 53.88M (± 4.56%) 1.04× slower
handswap64 55.81M (± 5.68%) fastest
memoryIo64 10.83M (±21.91%) 5.15× slower
I think probably part of that is because this creates a MemoryIO and the memory and gc pressure there, but also the NetworkEndian code https://github.com/crystal-lang/crystal/blob/master/src/io/byte_format.cr#L57-L68 is also making some objects it seems, and even if that's all structs, doesn't look as llvm optimizable as the other methods. I haven't looked into the generated assembly for the NetworkEndian code to verify though.
Other things on your list
- we do need support the for the array types
- individual functions could work, and maybe captured as procs, but we'd need to benchmark that
and I have been considering on the protocol side of the driver taking manual control over the buffering and byte swapping to get more speed out, but started with the stdlib just to get things going. I think maybe though MemoryIO is one abstraction too high for this project. The safety is good for general programs, but by being careful and deliberate with the memory access, things are a lot faster by avoiding the extra objects and bookkeeping.
Again, I appreciate you looking into this, thank you.