crystal-pg Use MemoryIO.

May 28 '16 12:05 mirek

Can you benchmark the difference? Creating a MemoryIO allocates memory so this should result in a slowdown. The idea is good, but I wouldn't merge this if it makes things slower (speed is more important)

May 28 '16 14:05 asterite

Ok, let me think a bit. This is not final version, just first PR of something like this:

[x] remove manual swaps in favor of IO provided functions (also makes code correct compiled on big-endian machines)
[ ] add support for 1d arrays
[ ] update decoding to one of...
1. [ ] maybe move decoding into #from_io (probably not, it feels like too much unnecessary OO) or...
2. [ ] simply individual functions on single Decoder class (probably yes), ie. Decoder#decode_jsonb etc. It's easy to reuse decoding logic this way and the code is very compact. Extending decoder with custom types will have to go through map of oid -> proc maybe.
[ ] read the whole frame into MemoryIO and use that for decoder
[ ] possibly use pool for MemoryIO buffers

...what do you think?

May 28 '16 17:05 mirek

...and I thought to do each step as PR to stay in comprehensible size for each PR, otherwise it's going to be a bit big (but I can do it in single one as well).

May 28 '16 17:05 mirek

Just a quick two thoughts:

this introduction of MemoryIO doesn't overwrite anything, it's just a fallback. For example String is decoded using Slice(UInt8). Default implementation in Decoder just falls back to the IO-based so there's no performance penalty if the type decoding wants to work on Slice directly (I've messed up UUID decoding performance wise, that's true, I'll change it back to previous implementation which is much faster).
the only reason to use MemoryIO is to have nice advancing interface for reading. Ideally crystal would provide in standard library something like Bytes (or Buffer) struct, lightweight, similar to Slice but not generic (there's no point because it works on bytes) with reader/writer offsets in internal state and methods for encoding/decoding in specified endianness (endianness could be specialised at the type level, ie. NetworkEndian::Buffer).

By the way, do you know if it's feasible to allow struct inheritance with restriction that it can't change width? Ie. just adding methods or it's not possible?

May 28 '16 18:05 mirek

@asterite btw. this code is using Slice(UInt8)-constructor-variant of MemoryIO, which doesn't do malloc so it shouldn't be slower, there are no extra allocations.

May 28 '16 21:05 mirek

I mean, creating the MemoryIO is what allocates memory. Maybe it's insignificant, though, but I don't know.

May 29 '16 00:05 asterite

@asterite ok, you're right, it is slower.

# ./bench/decoding.cr
# crystal run --release bench/decoding.cr 

require "benchmark"
require "../src/pg"

DB_URL = ENV["DATABASE_URL"]? || "postgres:///"
db = PG.connect(DB_URL)

Benchmark.ips do |x|
  x.report("int series") { db.exec("select * from generate_series(1, 1000);").rows }
end

# int series   1.78k (± 2.53%) will:master 
# int series   1.54k (± 2.23%) mirek:memory-io

Let me fix that.

May 29 '16 08:05 mirek

Thanks for the patch. I'm on a couple of cross country fights tomorrow (actually to go speak about crystal), so I should have enough down time to check out this in detail then.

However, in response to the endianness, I believe the current implementation does work on regardless of system endianness https://commandcenter.blogspot.se/2012/04/byte-order-fallacy.html

May 31 '16 20:05 will

I updated my previous benchmark of various byte swapping methods https://gist.github.com/will/c348bf90a7cbd7e93fd0 to include going through the NetworkEndian interface, and it's quite a bit slower.

$ crystal run --release byte_swap.cr
32bit
==========
RES = 67305985
true

    ntohl  42.64M (± 6.40%)  1.09× slower
   intrin  43.25M (± 4.54%)  1.08× slower
handshift  46.57M (± 3.72%)       fastest
 handswap  42.42M (± 4.45%)  1.10× slower
 memoryIo    9.8M (± 9.81%)  4.75× slower

64 bit
==========
RES64 = 578437695752307201
true

    ntohl64  46.17M (±20.54%)  1.21× slower
   intrin64  54.08M (± 9.65%)  1.03× slower
handshift64  53.88M (± 4.56%)  1.04× slower
 handswap64  55.81M (± 5.68%)       fastest
 memoryIo64  10.83M (±21.91%)  5.15× slower

I think probably part of that is because this creates a MemoryIO and the memory and gc pressure there, but also the NetworkEndian code https://github.com/crystal-lang/crystal/blob/master/src/io/byte_format.cr#L57-L68 is also making some objects it seems, and even if that's all structs, doesn't look as llvm optimizable as the other methods. I haven't looked into the generated assembly for the NetworkEndian code to verify though.

Other things on your list

we do need support the for the array types
individual functions could work, and maybe captured as procs, but we'd need to benchmark that

and I have been considering on the protocol side of the driver taking manual control over the buffering and byte swapping to get more speed out, but started with the stdlib just to get things going. I think maybe though MemoryIO is one abstraction too high for this project. The safety is good for general programs, but by being careful and deliberate with the memory access, things are a lot faster by avoiding the extra objects and bookkeeping.

Again, I appreciate you looking into this, thank you.

Jun 01 '16 22:06 will