tile38 SET throughput limitations

I have been testing the Tile38 performance and the results of simple queries have been really impressive, the server can fully utilize 16 CPU cores when I send queries with 50 client connections, best throughput exceeding 100k queries/sec.

However, the SET performance scaling seems to be more limited. Only about 3-4 cores are utilized on the server during simple SETs. Looking at the code, it seems the server is locking the whole database during the update commands? I suppose this makes it easier to handle appendonly log and other shared resources. The normal SET throughput is still quite good, I get over 30000 SETs/sec with ~1ms latency.

Unfortunately, adding fences seems to reduce the SET performance a lot, even if there are no listeners for the fence events.

I am setting up 23k fences like this ´´´SETCHAN $1 NEARBY fleet DETECT enter,exit FENCE POINT $2 $3 5000´´´ where $1 is place ID and $2 and $3 are the coordinates of the place. The places are selected retail locations in the USA, which are naturally clustered around urban areas.

If I now add the same 23k points to the "fleet" key using ´´´SET fleet $1 POINT $2 $3´´´ from 20 clients, the SET throughput is only about 6000 SETs/sec. Setting the same coordinates again gets almost normal throughput, probably because enter/exit state does not change, but setting them reversed (switching $2 and $3) again gets only 6000 SETs/sec since it causes exit events.

If I use 50k point fences and the same point set in fleet, the SET fleet throughput drops to ~2500 SETs/sec (all the time using 20 clients). Setting the same points again gets only 7000 SETs/sec. Reversing the points drops the throughput to even lower ~1700 SETs/sec.

The tile38 server is only using a couple of cores during the slow SETs. It seems like the server is processing the fences while still under some kind of global lock? Is there any way to work around this limitation?

Sep 11 '19 14:09 jkarjala

Hi @jkarjala,

I've been investigating this issue on my side. I'm not seeing the severe slowdown that you describe.

I wrote up a little tool that attempts to simulate your use case.

https://gist.github.com/tidwall/8a1468111f1f4524e8273e379e949486

If I now add the same 23k points to the "fleet" key using ´´´SET fleet $1 POINT $2 $3´´´ from 20 clients, the SET throughput is only about 6000 SETs/sec. Setting the same coordinates again gets almost normal throughput, probably because enter/exit state does not change, but setting them reversed (switching $2 and $3) again gets only 6000 SETs/sec since it causes exit events.

For this case I'm seeing about 30k-70k/sec depending on pipelining.

If I use 50k point fences and the same point set in fleet, the SET fleet throughput drops to ~2500 SETs/sec (all the time using 20 clients). Setting the same points again gets only 7000 SETs/sec. Reversing the points drops the throughput to even lower ~1700 SETs/sec.

About the same results as 23k.

I'm still looking to reproduce the problem as you describe. What hardware, OS, and Tile38 version are you using?

I'm guessing there's some detail that I'm missing.

Thanks

Sep 11 '19 22:09 tidwall

Hi @tidwall, thanks for the quick reply.

I tried your tester gist, and with default arguments it indeed performs quite well on the test server:

>> numFences: 23000, radius: 5000m, pipeline 1, clients: 20 <<
SETCHAN      23,000 ops over 20 threads in 1125ms, 20,436/sec, 48932 ns/op
SET-POINTS   23,000 ops over 20 threads in 1538ms, 14,952/sec, 66877 ns/op
SET-SAME     23,000 ops over 20 threads in 788ms, 29,170/sec, 34281 ns/op
SET-REVERSE  23,000 ops over 20 threads in 1422ms, 16,179/sec, 61806 ns/op

However, once I patched the tester (see https://gist.github.com/jkarjala/33c8665a44df076af1093386e0fe0a6e) to cluster the points closer to each other (-A specifies the random area width and height in degrees), I start seeing the reduction in throughput:

tile-tester$ go run tester2.go -n 50000 -A 20
>> numFences: 50000, radius: 5000m, pipeline 1, clients: 20, area width/height: 20.000000 <<
SETCHAN      50,000 ops over 20 threads in 3058ms, 16,350/sec, 61159 ns/op
SET-POINTS   50,000 ops over 20 threads in 5000ms, 10,000/sec, 99990 ns/op
SET-SAME     50,000 ops over 20 threads in 1995ms, 25,062/sec, 39900 ns/op
SET-REVERSE  50,000 ops over 20 threads in 4851ms, 10,306/sec, 97026 ns/op
tile-tester$ go run tester2.go -n 50000 -A 10
>> numFences: 50000, radius: 5000m, pipeline 1, clients: 20, area width/height: 10.000000 <<
SETCHAN      50,000 ops over 20 threads in 3356ms, 14,900/sec, 67112 ns/op
SET-POINTS   50,000 ops over 20 threads in 8762ms, 5,706/sec, 175233 ns/op
SET-SAME     50,000 ops over 20 threads in 3086ms, 16,202/sec, 61718 ns/op
SET-REVERSE  50,000 ops over 20 threads in 9872ms, 5,064/sec, 197448 ns/op

10 degrees by 10 degrees area around equator is about 1000km times 1000km, which is still quite realistic area for 50000 fences, and better simulates my use case with locations clustered on urban areas.

If I use -A 5, the SET throughput drops to about 2000/sec. Pipelining with -P 10 does not help much.

My test server is a bare-metal Ubuntu 16 with 2 Intel X5660 CPUs hyper-threaded to 24 Linux cores, 48GB RAM. The Tile38 server is only using 100% to 200% of CPU, ie at most 2 cores, according to "top" while running the tester. Something seems to be blocking the threads?

Sep 12 '19 07:09 jkarjala

Thanks for the updated tester and additional hardware details. I haven't had a chance to test it on my side yet, but I plan on asap. I'll keep you posted. Thanks.

Sep 18 '19 01:09 tidwall