containers Rewrite Data.IntMap to be faster and use less memory

I still sort of think of this as a work in progress, but I've been sitting for over 2 years now on this code, and all the tests pass with significant improvements on many benchmarks. The general approach is detailed at gereeter/bounded-intmap - the README is a bit outdated and I really should incorporate it into comments, but it is the best reference for how the structure works. As for compatibility, I think I'm missing some instances, some Safe Haskell stuff, and anything required to get things working on compilers that aren't the latest GHC.

If this is a large enough change, I wouldn't mind sending mail to the libraries mailing list, but it should be entirely backwards compatible.

cc @ekmett - This is a (heavily modified) version of an idea you came up with.

Memory usage

The old implementation had 6n-5 words of overhead (i.e. discounting storage of the keys and the pointers to values). The new implementation has only 3n-1 words of overhead.

Large runtime regressions

~~fromAscList and fromDistinctAscList are currently just aliases for fromList~~
I have no clue about foldlWithKey'
keysSet, fromSet, restrictKeys, and withoutKeys are currently implemented naively and are probably much slower

Benchmarks

Benchmark                     Runtime change  Original runtime
lookup                         -19.25%        4.38e-04
insert                         -22.57%        9.17e-04
"insertWith empty"             -10.20%        8.85e-04
"insertWith update"             -7.32%        1.96e-03
"insertWith' empty"             -9.92%        8.84e-04
"insertWith' update"           -11.00%        1.67e-03
"insertWithKey empty"          -10.23%        8.76e-04
"insertWithKey update"          +6.64%        1.96e-03
"insertWithKey' empty"         -10.47%        8.84e-04
"insertWithKey' update"        -11.29%        1.68e-03
"insertLookupWithKey empty"    -46.46%        1.74e-03
"insertLookupWithKey update"   -41.50%        3.94e-03
map                            -17.13%        1.81e-04
mapWithKey                     -24.97%        2.19e-04
foldlWithKey                   -19.19%        1.50e-03
foldlWithKey'                 +112.26%        4.29e-05
foldrWithKey                   -79.82%        1.02e-07
delete                          +9.36%        3.42e-04
update                         -12.59%        1.50e-03
updateLookupWithKey             +1.39%        2.44e-03
alter                           +5.04%        1.53e-03
mapMaybe                       -15.58%        2.25e-04
mapMaybeWithKey                -15.88%        2.24e-04
fromList                       -15.42%        8.96e-04
fromAscList                    -81.50%        6.83e-04
fromDistinctAscList            -43.34%        2.24e-04

Merges

Benchmark               Runtime change  Original runtime
union-disj_nn             +8.53%        2.08e-07
union-disj_nn             +9.74%        2.09e-07
union-disj_ns            +11.66%        1.85e-07
union-disj_ns             +7.15%        1.85e-07
union-disj_sn             +7.00%        1.94e-07
union-disj_sn             +5.67%        1.94e-07
union-disj_nt             +7.22%        1.55e-07
union-disj_nt             +5.51%        1.55e-07
union-disj_tn             +5.88%        1.61e-07
union-disj_tn             +5.91%        1.61e-07
union-common_nn          -15.58%        1.12e-02
union-common_nn          -15.30%        1.10e-02
union-common_ns          -11.65%        4.69e-03
union-common_ns          -11.57%        4.67e-03
union-common_sn          -10.49%        4.67e-03
union-common_sn          -10.23%        4.67e-03
union-common_nt          -26.05%        9.11e-05
union-common_nt          -28.51%        9.43e-05
union-common_tn          -21.01%        8.89e-05
union-common_tn          -19.67%        8.80e-05
union-mix_nn             +12.20%        2.06e-02
union-mix_nn             +11.70%        2.06e-02
union-mix_ns              +5.05%        5.10e-03
union-mix_ns              +5.84%        5.10e-03
union-mix_sn              +6.15%        5.11e-03
union-mix_sn              +7.16%        5.08e-03
union-mix_nt              -4.19%        8.05e-05
union-mix_nt              -4.29%        8.03e-05
union-mix_tn              -6.17%        8.58e-05
union-mix_tn              -6.81%        8.61e-05
union-block_nn           +38.23%        1.03e-04
union-block_nn           +39.21%        1.03e-04
union-block_ns           +30.53%        6.79e-06
union-block_ns           +31.57%        6.78e-06
union-block_sn           +29.41%        6.92e-06
union-block_sn           +30.39%        6.88e-06
difference-disj_nn       -78.96%        1.72e-07
difference-disj_nn       -78.95%        1.72e-07
difference-disj_ns       -73.77%        1.60e-07
difference-disj_ns       -77.32%        1.60e-07
difference-disj_sn       -82.64%        1.91e-07
difference-disj_sn       -82.02%        1.91e-07
difference-disj_nt       -74.05%        1.40e-07
difference-disj_nt       -74.09%        1.40e-07
difference-disj_tn       -77.35%        1.47e-07
difference-disj_tn       -77.28%        1.47e-07
difference-common_nn      +4.10%        7.50e-03
difference-common_nn      +4.25%        7.50e-03
difference-common_ns      -0.97%        4.27e-03
difference-common_ns      -1.31%        4.28e-03
difference-common_sn     -30.13%        8.64e-04
difference-common_sn     -34.10%        8.55e-04
difference-common_nt     -20.80%        1.00e-04
difference-common_nt     -18.82%        1.01e-04
difference-common_tn     -37.75%        5.91e-05
difference-common_tn     -32.56%        5.92e-05
difference-mix_nn        -18.42%        1.78e-02
difference-mix_nn        -18.95%        1.78e-02
difference-mix_ns        -21.13%        4.36e-03
difference-mix_ns        -21.18%        4.35e-03
difference-mix_sn        +19.57%        1.49e-03
difference-mix_sn        +19.49%        1.49e-03
difference-mix_nt        -29.75%        8.57e-05
difference-mix_nt        -29.13%        8.52e-05
difference-mix_tn        -27.97%        5.37e-05
difference-mix_tn        -29.38%        5.45e-05
difference-block_nn      -69.85%        7.37e-05
difference-block_nn      -70.66%        7.57e-05
difference-block_ns      -64.67%        6.55e-06
difference-block_ns      -64.53%        6.54e-06
difference-block_sn      -40.12%        5.43e-06
difference-block_sn      -40.19%        5.44e-06
intersection-disj_nn     -70.09%        1.24e-07
intersection-disj_nn     -69.98%        1.24e-07
intersection-disj_ns     -67.57%        1.12e-07
intersection-disj_ns     -66.18%        1.12e-07
intersection-disj_sn     -66.65%        1.12e-07
intersection-disj_sn     -66.92%        1.12e-07
intersection-disj_nt     -60.91%        9.38e-08
intersection-disj_nt     -61.18%        9.38e-08
intersection-disj_tn     -60.17%        9.28e-08
intersection-disj_tn     -60.32%        9.27e-08
intersection-common_nn    +6.06%        7.08e-03
intersection-common_nn    +5.82%        7.07e-03
intersection-common_ns   -25.00%        1.04e-03
intersection-common_ns   -25.84%        1.05e-03
intersection-common_sn   -20.87%        1.07e-03
intersection-common_sn   -15.66%        1.05e-03
intersection-common_nt   -31.67%        5.48e-05
intersection-common_nt   -31.14%        5.45e-05
intersection-common_tn   -28.09%        5.97e-05
intersection-common_tn   -28.55%        5.98e-05
intersection-mix_nn       +0.71%        3.01e-03
intersection-mix_nn       -6.14%        3.14e-03
intersection-mix_ns      -29.39%        6.40e-04
intersection-mix_ns      -30.65%        6.39e-04
intersection-mix_sn      -23.04%        6.64e-04
intersection-mix_sn      -22.10%        6.58e-04
intersection-mix_nt      -17.51%        3.90e-05
intersection-mix_nt      -19.35%        3.91e-05
intersection-mix_tn      -21.05%        4.63e-05
intersection-mix_tn      -17.28%        4.59e-05
intersection-block_nn    -62.01%        4.88e-05
intersection-block_nn    -62.37%        4.88e-05
intersection-block_ns    -45.10%        3.52e-06
intersection-block_ns    -45.39%        3.54e-06
intersection-block_sn    -27.86%        3.43e-06
intersection-block_sn    -27.83%        3.43e-06

Sep 13 '16 03:09 gereeter

This also partially fixes #327 - it doesn't have all the tactics and instances or mergeA yet though.

Sep 13 '16 03:09 gereeter

This will be primarily up to @wrengr to decide. Are there operations or contexts where this new implementation is known or believed to be inherently worse?

Sep 13 '16 04:09 treeowl

Are there operations or contexts where this new implementation is known or believed to be inherently worse?

No. Everything has the same asymptotics (except findMin and findMax, which are faster), and the tree structure is the identical to a PATRICIA tree. Strictly fewer pointers need to be dereferenced in order to reach any given key.

Sep 13 '16 04:09 gereeter

Also, I think that most of the existing regressions can be fixed, especially the large ones - I haven't done the work to investigate foldlWithKey', and the rest basically just want specialized implementations.

Sep 13 '16 04:09 gereeter

Can you get this to build with older GHC by conditionally using your own copy of Data.Functor.Identity? How important is ScopedTypeVariables here? We use it elsewhere, conditionally, but unless and until it's standardized I'd prefer to avoid it. That said, if you use it to revolutionize map performance, I'm personally willing to go there.

On Sep 13, 2016 12:15 AM, "Jonathan S" [email protected] wrote:

Also, I think that most of the existing regressions can be fixed, especially the large ones - I haven't done the work to investigate foldlWIthKey', and the rest basically just want specialized implementations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-246568930, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_Sp0N9NvX6O6qBQ7xKA4ePEaOi_Iks5qpiNVgaJpZM4J7RJU .

Sep 13 '16 05:09 treeowl

I remove the requirements of EmptyDataDecls and ScopedTypeVariables, and I fixed a bunch of compatibility problems, including the use of Data.Functor.Identity.

Sep 13 '16 16:09 gereeter

The bounded-intmap README indicates regressions in intersection and difference. I don't see those in the benchmarks above. Were you able to work around the problem? If not, can you give us a sense of how bad it is? There are some set-operations benchmarks in a subdirectory of containers benchmarks that may help.

On Sep 12, 2016 11:57 PM, "Jonathan S" [email protected] wrote:

I still sort of think of this as a work in progress, but I've been sitting for over 2 years now on this code, and all the tests pass with significant improvements on many benchmarks. The general approach is detailed at gereeter/bounded-intmap https://github.com/gereeter/bounded-intmap - the README is a bit outdated and I really should incorporate it into comments, but it is the best reference for how the structure works. As for compatibility, I think I'm missing some instances, some Safe Haskell stuff, and anything required to get things working on compilers that aren't the latest GHC.

If this is a large enough change, I wouldn't mind sending mail to the libraries mailing list, but it should be entirely backwards compatible.

cc @ekmett https://github.com/ekmett - This is a (heavily modified) version of an idea you came up with. Memory usage

The old implementation had 6n-5 words of overhead (i.e. discounting storage of the keys and the pointers to values). The new implementation has only 3n-1 words of overhead. Large runtime regressions

fromAscList and fromDistinctAscList are currently just aliases for fromList

I have no clue about foldlWithKey'

keysSet, fromSet, restrictKeys, and withoutKeys are currently implemented naively and are probably much slower

Benchmarks after

Benchmark intmap-benchmarks: RUNNING... benchmarking lookup time 351.2 μs (348.0 μs .. 353.4 μs) 0.999 R² (0.999 R² .. 1.000 R²) mean 348.4 μs (345.8 μs .. 350.5 μs) std dev 7.589 μs (5.950 μs .. 8.951 μs) variance introduced by outliers: 14% (moderately inflated)

benchmarking insert time 710.6 μs (708.1 μs .. 713.5 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 709.2 μs (707.8 μs .. 711.0 μs) std dev 5.133 μs (4.064 μs .. 6.746 μs)

benchmarking insertWith empty time 784.2 μs (782.1 μs .. 786.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 783.8 μs (781.9 μs .. 786.2 μs) std dev 7.298 μs (5.164 μs .. 10.90 μs)

benchmarking insertWith update time 1.815 ms (1.809 ms .. 1.823 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.815 ms (1.810 ms .. 1.819 ms) std dev 16.52 μs (12.94 μs .. 22.60 μs)

benchmarking insertWith' empty time 794.3 μs (792.9 μs .. 795.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 794.5 μs (792.8 μs .. 796.1 μs) std dev 6.120 μs (4.768 μs .. 8.325 μs)

benchmarking insertWith' update time 1.491 ms (1.486 ms .. 1.496 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.496 ms (1.491 ms .. 1.503 ms) std dev 18.86 μs (14.67 μs .. 24.30 μs)

benchmarking insertWithKey empty time 790.0 μs (786.0 μs .. 793.7 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 794.1 μs (790.1 μs .. 807.6 μs) std dev 22.04 μs (5.555 μs .. 45.51 μs) variance introduced by outliers: 18% (moderately inflated)

benchmarking insertWithKey update time 2.034 ms (2.029 ms .. 2.042 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.033 ms (2.028 ms .. 2.039 ms) std dev 17.85 μs (14.47 μs .. 22.44 μs)

benchmarking insertWithKey' empty time 797.9 μs (796.6 μs .. 798.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 797.3 μs (795.8 μs .. 798.9 μs) std dev 5.257 μs (4.250 μs .. 6.617 μs)

benchmarking insertWithKey' update time 1.504 ms (1.499 ms .. 1.510 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.504 ms (1.500 ms .. 1.511 ms) std dev 18.15 μs (12.36 μs .. 29.84 μs)

benchmarking insertLookupWithKey empty time 917.9 μs (914.9 μs .. 920.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 915.4 μs (912.2 μs .. 918.1 μs) std dev 10.09 μs (8.160 μs .. 12.89 μs)

benchmarking insertLookupWithKey update time 2.332 ms (2.326 ms .. 2.338 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.334 ms (2.329 ms .. 2.341 ms) std dev 20.29 μs (16.52 μs .. 24.81 μs)

benchmarking map time 149.6 μs (149.1 μs .. 150.0 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 151.7 μs (151.0 μs .. 153.6 μs) std dev 3.726 μs (1.409 μs .. 7.458 μs) variance introduced by outliers: 19% (moderately inflated)

benchmarking mapWithKey time 162.1 μs (161.9 μs .. 162.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 162.1 μs (162.0 μs .. 162.4 μs) std dev 623.3 ns (379.3 ns .. 1.039 μs)

benchmarking foldlWithKey time 1.188 ms (1.184 ms .. 1.193 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.190 ms (1.187 ms .. 1.194 ms) std dev 10.63 μs (8.096 μs .. 14.74 μs)

benchmarking foldlWithKey' time 93.76 μs (93.72 μs .. 93.88 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 93.80 μs (93.74 μs .. 93.98 μs) std dev 371.8 ns (30.91 ns .. 714.5 ns)

benchmarking foldrWithKey time 22.71 ns (22.62 ns .. 22.80 ns) 0.998 R² (0.996 R² .. 1.000 R²) mean 22.65 ns (22.13 ns .. 23.43 ns) std dev 2.077 ns (1.152 ns .. 3.451 ns) variance introduced by outliers: 90% (severely inflated)

benchmarking delete time 392.1 μs (384.0 μs .. 408.8 μs) 0.994 R² (0.985 R² .. 1.000 R²) mean 386.7 μs (384.2 μs .. 395.9 μs) std dev 14.89 μs (710.6 ns .. 31.54 μs) variance introduced by outliers: 32% (moderately inflated)

benchmarking update time 1.277 ms (1.275 ms .. 1.280 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.277 ms (1.274 ms .. 1.279 ms) std dev 8.210 μs (6.086 μs .. 11.37 μs)

benchmarking updateLookupWithKey time 2.538 ms (2.529 ms .. 2.547 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.534 ms (2.527 ms .. 2.540 ms) std dev 22.10 μs (18.13 μs .. 28.85 μs)

benchmarking alter time 1.555 ms (1.550 ms .. 1.559 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.554 ms (1.550 ms .. 1.558 ms) std dev 13.60 μs (11.01 μs .. 16.65 μs)

benchmarking mapMaybe time 184.7 μs (184.1 μs .. 185.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 185.1 μs (184.5 μs .. 185.9 μs) std dev 2.239 μs (1.418 μs .. 3.591 μs)

benchmarking mapMaybeWithKey time 186.9 μs (184.6 μs .. 191.9 μs) 0.992 R² (0.977 R² .. 1.000 R²) mean 187.6 μs (185.1 μs .. 197.7 μs) std dev 14.17 μs (2.403 μs .. 31.55 μs) variance introduced by outliers: 69% (severely inflated)

benchmarking fromList time 717.2 μs (713.4 μs .. 719.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 713.2 μs (710.5 μs .. 716.1 μs) std dev 9.082 μs (7.906 μs .. 10.82 μs)

benchmarking fromAscList time 718.4 μs (713.7 μs .. 722.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 712.8 μs (710.4 μs .. 715.4 μs) std dev 8.612 μs (7.509 μs .. 10.04 μs)

benchmarking fromDistinctAscList time 717.4 μs (713.4 μs .. 720.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 712.6 μs (710.7 μs .. 714.8 μs) std dev 6.553 μs (5.526 μs .. 7.963 μs)

Benchmark intmap-benchmarks: FINISH

Benchmarks before

Benchmark intmap-benchmarks: RUNNING... benchmarking lookup time 462.8 μs (457.1 μs .. 466.8 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 461.9 μs (457.6 μs .. 464.5 μs) std dev 10.83 μs (7.533 μs .. 14.53 μs) variance introduced by outliers: 15% (moderately inflated)

benchmarking insert time 908.2 μs (906.9 μs .. 909.8 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 908.7 μs (907.0 μs .. 910.9 μs) std dev 6.776 μs (5.050 μs .. 9.429 μs)

benchmarking insertWith empty time 904.2 μs (903.1 μs .. 905.6 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 903.8 μs (902.0 μs .. 905.8 μs) std dev 5.991 μs (4.748 μs .. 7.628 μs)

benchmarking insertWith update time 1.962 ms (1.957 ms .. 1.968 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.965 ms (1.959 ms .. 1.971 ms) std dev 18.44 μs (13.84 μs .. 24.77 μs)

benchmarking insertWith' empty time 918.0 μs (916.0 μs .. 920.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 918.9 μs (915.3 μs .. 923.0 μs) std dev 13.87 μs (10.57 μs .. 18.18 μs)

benchmarking insertWith' update time 1.682 ms (1.672 ms .. 1.706 ms) 0.998 R² (0.995 R² .. 1.000 R²) mean 1.686 ms (1.678 ms .. 1.709 ms) std dev 44.73 μs (18.01 μs .. 85.74 μs) variance introduced by outliers: 14% (moderately inflated)

benchmarking insertWithKey empty time 906.4 μs (905.0 μs .. 907.8 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 905.8 μs (904.3 μs .. 907.2 μs) std dev 5.118 μs (4.080 μs .. 6.452 μs)

benchmarking insertWithKey update time 2.039 ms (2.019 ms .. 2.054 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 2.021 ms (2.014 ms .. 2.029 ms) std dev 25.52 μs (20.98 μs .. 31.95 μs)

benchmarking insertWithKey' empty time 912.6 μs (910.4 μs .. 916.3 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 925.0 μs (919.7 μs .. 930.8 μs) std dev 18.31 μs (15.26 μs .. 21.70 μs)

benchmarking insertWithKey' update time 1.690 ms (1.687 ms .. 1.695 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.693 ms (1.689 ms .. 1.699 ms) std dev 17.40 μs (13.96 μs .. 23.11 μs)

benchmarking insertLookupWithKey empty time 1.772 ms (1.766 ms .. 1.778 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.771 ms (1.768 ms .. 1.774 ms) std dev 10.39 μs (9.049 μs .. 12.27 μs)

benchmarking insertLookupWithKey update time 4.051 ms (4.019 ms .. 4.092 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 4.031 ms (4.018 ms .. 4.048 ms) std dev 50.84 μs (38.55 μs .. 67.42 μs)

benchmarking map time 172.7 μs (171.4 μs .. 173.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 171.3 μs (171.0 μs .. 171.9 μs) std dev 1.377 μs (870.7 ns .. 1.961 μs)

benchmarking mapWithKey time 227.8 μs (226.9 μs .. 228.8 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 227.1 μs (226.6 μs .. 227.7 μs) std dev 1.799 μs (1.315 μs .. 2.385 μs)

benchmarking foldlWithKey time 1.554 ms (1.513 ms .. 1.618 ms) 0.995 R² (0.991 R² .. 1.000 R²) mean 1.546 ms (1.538 ms .. 1.566 ms) std dev 40.51 μs (21.11 μs .. 81.14 μs) variance introduced by outliers: 14% (moderately inflated)

benchmarking foldlWithKey' time 46.11 μs (45.90 μs .. 46.25 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 45.64 μs (45.52 μs .. 45.78 μs) std dev 421.3 ns (342.6 ns .. 499.9 ns)

benchmarking foldrWithKey time 105.9 ns (105.7 ns .. 106.3 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 105.9 ns (105.7 ns .. 106.0 ns) std dev 461.7 ps (320.4 ps .. 689.5 ps)

benchmarking delete time 357.5 μs (356.7 μs .. 359.0 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 357.6 μs (357.3 μs .. 358.5 μs) std dev 1.734 μs (1.000 μs .. 3.244 μs)

benchmarking update time 1.542 ms (1.539 ms .. 1.548 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.541 ms (1.538 ms .. 1.544 ms) std dev 10.09 μs (7.858 μs .. 13.96 μs)

benchmarking updateLookupWithKey time 2.444 ms (2.439 ms .. 2.448 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.457 ms (2.451 ms .. 2.467 ms) std dev 24.38 μs (17.83 μs .. 32.20 μs)

benchmarking alter time 1.536 ms (1.532 ms .. 1.540 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.538 ms (1.535 ms .. 1.541 ms) std dev 10.88 μs (8.249 μs .. 16.94 μs)

benchmarking mapMaybe time 223.8 μs (223.6 μs .. 224.0 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 224.4 μs (223.5 μs .. 226.4 μs) std dev 4.150 μs (616.6 ns .. 7.538 μs) variance introduced by outliers: 11% (moderately inflated)

benchmarking mapMaybeWithKey time 223.3 μs (223.0 μs .. 223.6 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 223.3 μs (223.1 μs .. 223.6 μs) std dev 749.0 ns (456.5 ns .. 1.329 μs)

benchmarking fromList time 848.4 μs (847.1 μs .. 849.4 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 847.0 μs (845.7 μs .. 848.5 μs) std dev 4.540 μs (3.715 μs .. 6.255 μs)

benchmarking fromAscList time 604.2 μs (602.4 μs .. 605.6 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 602.6 μs (601.4 μs .. 604.0 μs) std dev 4.334 μs (3.383 μs .. 5.573 μs)

benchmarking fromDistinctAscList time 226.4 μs (225.7 μs .. 227.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 226.0 μs (225.7 μs .. 226.4 μs) std dev 988.2 ns (534.4 ns .. 1.965 μs)

Benchmark intmap-benchmarks: FINISH

You can view, comment on, or merge this pull request online at:

https://github.com/haskell/containers/pull/340 Commit Summary

Improve the Data.IntMap.mergeWithKey test to test more cases and have clearer output

Rewrite Data.IntMap to be faster and use less memory

File Changes

M Data/IntMap.hs https://github.com/haskell/containers/pull/340/files#diff-0 (26)

M Data/IntMap/Internal.hs https://github.com/haskell/containers/pull/340/files#diff-1 (3665)

M Data/IntMap/Lazy.hs https://github.com/haskell/containers/pull/340/files#diff-2 (1368)

A Data/IntMap/Merge/Internal.hs https://github.com/haskell/containers/pull/340/files#diff-3 (595)

A Data/IntMap/Merge/Lazy.hs https://github.com/haskell/containers/pull/340/files#diff-4 (143)

A Data/IntMap/Merge/Strict.hs https://github.com/haskell/containers/pull/340/files#diff-5 (150)

M Data/IntMap/Strict.hs https://github.com/haskell/containers/pull/340/files#diff-6 (1666)

M containers.cabal https://github.com/haskell/containers/pull/340/files#diff-7 (3)

M tests/intmap-properties.hs https://github.com/haskell/containers/pull/340/files#diff-8 (16)

Patch Links:

https://github.com/haskell/containers/pull/340.patch

https://github.com/haskell/containers/pull/340.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_dpKZTbtdZNvdPYdfH_LDph-EPjYks5qph8TgaJpZM4J7RJU .

Sep 13 '16 20:09 treeowl

So, I ran the set-operations-intmap benchmark, and the results are quite interesting. Since this benchmark is far more detailed and thorough than what I threw together in the bounded-intmap repository, the conclusions look very different:

Benchmark               Runtime change  Original runtime
union-disj_nn            +20.14%        2.08e-07
union-disj_nn            +20.32%        2.09e-07
union-disj_ns            +19.76%        1.85e-07
union-disj_ns            +19.57%        1.85e-07
union-disj_sn            +13.50%        1.94e-07
union-disj_sn            +13.65%        1.94e-07
union-disj_nt            +16.34%        1.55e-07
union-disj_nt            +16.23%        1.55e-07
union-disj_tn            +10.41%        1.61e-07
union-disj_tn            +10.25%        1.61e-07
union-common_nn          -13.83%        1.12e-02
union-common_nn          -13.22%        1.10e-02
union-common_ns          -10.26%        4.69e-03
union-common_ns          -10.07%        4.67e-03
union-common_sn          -11.13%        4.67e-03
union-common_sn          -11.21%        4.67e-03
union-common_nt          -21.66%        9.11e-05
union-common_nt          -24.34%        9.43e-05
union-common_tn          -22.92%        8.89e-05
union-common_tn          -22.40%        8.80e-05
union-mix_nn             +13.34%        2.06e-02
union-mix_nn             +13.49%        2.06e-02
union-mix_ns              +6.80%        5.10e-03
union-mix_ns              +6.80%        5.10e-03
union-mix_sn              +5.54%        5.11e-03
union-mix_sn              +6.38%        5.08e-03
union-mix_nt              +1.41%        8.05e-05
union-mix_nt              +1.75%        8.03e-05
union-mix_tn              -7.61%        8.58e-05
union-mix_tn              -8.24%        8.61e-05
union-block_nn           +44.40%        1.03e-04
union-block_nn           +44.11%        1.03e-04
union-block_ns           +40.25%        6.79e-06
union-block_ns           +40.85%        6.78e-06
union-block_sn           +32.62%        6.92e-06
union-block_sn           +33.01%        6.88e-06
difference-disj_nn       -78.23%        1.72e-07
difference-disj_nn       -78.25%        1.72e-07
difference-disj_ns       -76.55%        1.60e-07
difference-disj_ns       -76.54%        1.60e-07
difference-disj_sn       -81.40%        1.91e-07
difference-disj_sn       -81.47%        1.91e-07
difference-disj_nt       -73.18%        1.40e-07
difference-disj_nt       -73.24%        1.40e-07
difference-disj_tn       -75.89%        1.47e-07
difference-disj_tn       -75.83%        1.47e-07
difference-common_nn      +7.85%        7.50e-03
difference-common_nn      +7.86%        7.50e-03
difference-common_ns      +1.65%        4.27e-03
difference-common_ns      +1.73%        4.28e-03
difference-common_sn     -27.05%        8.64e-04
difference-common_sn     -27.24%        8.55e-04
difference-common_nt     -20.93%        1.00e-04
difference-common_nt     -21.43%        1.01e-04
difference-common_tn     -37.30%        5.91e-05
difference-common_tn     -37.35%        5.92e-05
difference-mix_nn        -16.25%        1.78e-02
difference-mix_nn        -16.50%        1.78e-02
difference-mix_ns        -19.62%        4.36e-03
difference-mix_ns        -19.37%        4.35e-03
difference-mix_sn        +24.33%        1.49e-03
difference-mix_sn        +24.91%        1.49e-03
difference-mix_nt        -28.88%        8.57e-05
difference-mix_nt        -28.41%        8.52e-05
difference-mix_tn        -22.30%        5.37e-05
difference-mix_tn        -23.62%        5.45e-05
difference-block_nn      -64.96%        7.37e-05
difference-block_nn      -65.44%        7.57e-05
difference-block_ns      -52.32%        6.55e-06
difference-block_ns      -53.00%        6.54e-06
difference-block_sn      -32.62%        5.43e-06
difference-block_sn      -32.50%        5.44e-06
intersection-disj_nn     -73.10%        1.24e-07
intersection-disj_nn     -73.09%        1.24e-07
intersection-disj_ns     -70.37%        1.12e-07
intersection-disj_ns     -70.24%        1.12e-07
intersection-disj_sn     -70.05%        1.12e-07
intersection-disj_sn     -70.02%        1.12e-07
intersection-disj_nt     -64.38%        9.38e-08
intersection-disj_nt     -64.49%        9.38e-08
intersection-disj_tn     -63.88%        9.28e-08
intersection-disj_tn     -63.85%        9.27e-08
intersection-common_nn   +15.38%        7.08e-03
intersection-common_nn   +15.37%        7.07e-03
intersection-common_ns   +73.79%        1.04e-03
intersection-common_ns   +71.16%        1.05e-03
intersection-common_sn   +72.89%        1.07e-03
intersection-common_sn   +76.42%        1.05e-03
intersection-common_nt   -14.21%        5.48e-05
intersection-common_nt   -13.98%        5.45e-05
intersection-common_tn   -25.13%        5.97e-05
intersection-common_tn   -25.16%        5.98e-05
intersection-mix_nn       -4.39%        3.01e-03
intersection-mix_nn      -11.54%        3.14e-03
intersection-mix_ns      -31.01%        6.40e-04
intersection-mix_ns      -30.98%        6.39e-04
intersection-mix_sn      -20.84%        6.64e-04
intersection-mix_sn      -24.33%        6.58e-04
intersection-mix_nt       -6.07%        3.90e-05
intersection-mix_nt       -4.52%        3.91e-05
intersection-mix_tn      -26.01%        4.63e-05
intersection-mix_tn      -24.39%        4.59e-05
intersection-block_nn    -65.50%        4.88e-05
intersection-block_nn    -65.43%        4.88e-05
intersection-block_ns    -41.59%        3.52e-06
intersection-block_ns    -41.95%        3.54e-06
intersection-block_sn    -24.71%        3.43e-06
intersection-block_sn    -24.46%        3.43e-06

In particular, union-disj, union-mix, union-block, and intersection-common are the major regressions. I definitely need to look more at intersection and see what I can do to solve the 75% spikes.

Sep 13 '16 21:09 gereeter

So it seems that the worst of the intersection regression was another misguided constant argument capture - I really need to just go through the codebase for every instance of that and benchmark each one. New intersection benchmarks:

Benchmark               Runtime change  Original runtime
intersection-disj_nn     -74.95%        1.24e-07
intersection-disj_nn     -74.86%        1.24e-07
intersection-disj_ns     -72.28%        1.12e-07
intersection-disj_ns     -72.15%        1.12e-07
intersection-disj_sn     -70.07%        1.12e-07
intersection-disj_sn     -70.00%        1.12e-07
intersection-disj_nt     -66.86%        9.38e-08
intersection-disj_nt     -66.80%        9.38e-08
intersection-disj_tn     -63.92%        9.28e-08
intersection-disj_tn     -63.90%        9.27e-08
intersection-common_nn   +10.31%        7.08e-03
intersection-common_nn    +9.39%        7.07e-03
intersection-common_ns   +42.33%        1.04e-03
intersection-common_ns   +40.95%        1.05e-03
intersection-common_sn   +42.86%        1.07e-03
intersection-common_sn   +45.64%        1.05e-03
intersection-common_nt   -20.33%        5.48e-05
intersection-common_nt   -20.35%        5.45e-05
intersection-common_tn   -28.17%        5.97e-05
intersection-common_tn   -29.02%        5.98e-05
intersection-mix_nn      -16.26%        3.01e-03
intersection-mix_nn      -20.03%        3.14e-03
intersection-mix_ns      -40.50%        6.40e-04
intersection-mix_ns      -39.25%        6.39e-04
intersection-mix_sn      -34.05%        6.64e-04
intersection-mix_sn      -32.00%        6.58e-04
intersection-mix_nt      -16.56%        3.90e-05
intersection-mix_nt      -16.80%        3.91e-05
intersection-mix_tn      -29.41%        4.63e-05
intersection-mix_tn      -29.14%        4.59e-05
intersection-block_nn    -69.29%        4.88e-05
intersection-block_nn    -69.22%        4.88e-05
intersection-block_ns    -51.56%        3.52e-06
intersection-block_ns    -51.88%        3.54e-06
intersection-block_sn    -39.75%        3.43e-06
intersection-block_sn    -39.76%        3.43e-06

Now the worst regression is 45%.

Sep 13 '16 21:09 gereeter

If the worst regressions remain that bad, it may make sense to add your version of IntMap in a separate module, rather than replacing the current one. I believe both Wren and I are generally in favor of adding more data structures to the package if there's a good case for them.

On Sep 13, 2016 5:45 PM, "Jonathan S" [email protected] wrote:

So it seems that the worst of the intersection regression was another misguided constant argument capture - I really need to just go through the codebase for every instance of that and benchmark each one. New intersection benchmarks:

Benchmark Runtime change Original runtime intersection-disj_nn -74.95% 1.24e-07 intersection-disj_nn -74.86% 1.24e-07 intersection-disj_ns -72.28% 1.12e-07 intersection-disj_ns -72.15% 1.12e-07 intersection-disj_sn -70.07% 1.12e-07 intersection-disj_sn -70.00% 1.12e-07 intersection-disj_nt -66.86% 9.38e-08 intersection-disj_nt -66.80% 9.38e-08 intersection-disj_tn -63.92% 9.28e-08 intersection-disj_tn -63.90% 9.27e-08 intersection-common_nn +10.31% 7.08e-03 intersection-common_nn +9.39% 7.07e-03 intersection-common_ns +42.33% 1.04e-03 intersection-common_ns +40.95% 1.05e-03 intersection-common_sn +42.86% 1.07e-03 intersection-common_sn +45.64% 1.05e-03 intersection-common_nt -20.33% 5.48e-05 intersection-common_nt -20.35% 5.45e-05 intersection-common_tn -28.17% 5.97e-05 intersection-common_tn -29.02% 5.98e-05 intersection-mix_nn -16.26% 3.01e-03 intersection-mix_nn -20.03% 3.14e-03 intersection-mix_ns -40.50% 6.40e-04 intersection-mix_ns -39.25% 6.39e-04 intersection-mix_sn -34.05% 6.64e-04 intersection-mix_sn -32.00% 6.58e-04 intersection-mix_nt -16.56% 3.90e-05 intersection-mix_nt -16.80% 3.91e-05 intersection-mix_tn -29.41% 4.63e-05 intersection-mix_tn -29.14% 4.59e-05 intersection-block_nn -69.29% 4.88e-05 intersection-block_nn -69.22% 4.88e-05 intersection-block_ns -51.56% 3.52e-06 intersection-block_ns -51.88% 3.54e-06 intersection-block_sn -39.75% 3.43e-06 intersection-block_sn -39.76% 3.43e-06

Now the worst regression is 45%.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-246835384, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_VwoJJ-KbmtU9O9yFgAEfbSW3ruNks5qpxmQgaJpZM4J7RJU .

Sep 13 '16 21:09 treeowl

I just pushed a hacky change to intersection that almost completely removes the regression. I'm not happy with it, since it makes every other intersection benchmark (those besides intersection-common) slower, but it may be a starting point for a better solution.

Sep 14 '16 00:09 gereeter

The vast number of changes to Data.IntMap.Internal make it a bit hard to review the changes on GitHub. Do you know if it would be happier if you were to temporarily restore the original (unused) Data.IntMap.Internal and create a brand-new Data.IntMap.Internal2? I don't know nearly enough about git or GitHub to say, but it might be worth a shot.

Sep 14 '16 20:09 treeowl

I can copy some files around if you wish, but since this really is a rewrite, not an edit, I'd just recommend looking at the new code not in diff format.

Sep 14 '16 21:09 gereeter

(Additionally, I don't think a separate file would work, since Github limits diffs to 1500 lines.)

Sep 14 '16 21:09 gereeter

I'm not sure if there's a way to make it happy. The annoying thing is that I can't seem to make inline comments on that most-important file.

On Sep 14, 2016 5:35 PM, "Jonathan S" [email protected] wrote:

I can copy some files around if you wish, but since this really is a rewrite, not an edit, I'd just recommend looking at the new code https://github.com/gereeter/containers/blob/direct-bounded-intmap/Data/IntMap/Internal.hs not in diff format.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-247162652, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_dOiu6L_tbWQEDW08QDAtyGvBBVAks5qqGiugaJpZM4J7RJU .

Sep 14 '16 21:09 treeowl

to use the current constructor names

I actually chose to use Bin and Tip as opposed to Bin and Nil to more accurately correspond to the existing code - for the same set of keys, my map will have a Bin exactly where the old code would have a Bin, and my map will have a Tip exactly where the old code would have a Tip. Empty is the closest analogue to Nil.

There is a non-trivial cost to these empty Tips. In particular, every leaf of the tree ends up looking like (to use the current constructor names) Bin k a Nil Nil, which is two words larger than Tip k a. I'm not saying you're wrong, but it may be something to consider. One alternative might be data Node t a = Bin !Key a !(Node L a) !(Node R a) | LOnly !Key a !(Node L a) | ROnly !(Node R a).

This could work, and certainly would save more memory (only 2n-1 words of overhead). However,

Without pattern synonyms, it would make the code much larger and much uglier. Implementing the already unweildy merge functions for the more complicated structure sounds like a nightmare.
I suspect that the overhead of branching based on all 4 possible node types would cause serious regressions.
I can't see any way that the more complicated structure could improve performance to compensate, except possibly better cache behavior due to being smaller. It doesn't save any pointer-chasing, since Tips are already empty and therefore don't have to be allocated specially.

I'd be willing to try the more compact variation if you think the space savings are important enough, but I'd prefer to do that as a separate PR if at all.

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?

newtype L = L L has a slight safety advantage in that it is impossible to construct, so it can't accidentally be used as anything but a marker. It doesn't really matter though.

Sep 14 '16 22:09 gereeter

I actually chose to use Bin and Tip as opposed to Bin and Nil to more accurately correspond to the existing code - for the same set of keys, my map will have a Bin exactly where the old code would have a Bin, and my map will have a Tip exactly where the old code would have a Tip. Empty is the closest analogue to Nil.

I'll have to read more deeply into this, because that doesn't make sense to me yet. The current code stores a key and a value in its Tip; how can that correspond to yours?

This could work, and certainly would save more memory (only 2n-1 words of overhead). However,
Without pattern synonyms, it would make the code much larger and much uglier. Implementing the already unwieldy merge functions for the more complicated structure sounds like a nightmare.

I fear you may be right about this. How hard would it be for you to use pattern synonyms temporarily, just to see how performance compares? I think these should all be "simple bidirectional" synonyms, which are generally rather straightforward.

I suspect that the overhead of branching based on all 4 possible node types would cause serious regressions.
I can't see any way that the more complicated structure could improve performance to compensate, except possibly better cache behavior due to being smaller. It doesn't save any pointer-chasing, since Tips are already empty and therefore don't have to be allocated specially.

Being smaller is tremendously helpful all by itself. But wait! There's more! A pointer to an already-evaluated node will generally be tagged (in low bits) to indicate which constructor that node was formed from. I don't know all the details of how this helps, but it helps quite tremendously.

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?
newtype L = L L has a slight safety advantage in that it is impossible to construct, so it can't accidentally be used as anything but a marker. It doesn't really matter though.

How about something like this?

#if __GLASGOW_HASKELL__ >= whatever
data Hand = L' | R'
type L = ' L'
type R = ' R'
#else
data L = L
data R = R
#endif

Sep 14 '16 22:09 treeowl

Actually, I guess the pattern synonyms may not be quite as simple as I thought. But they shouldn't be terrible, I don't think. If the benchmarks using then turn out fantastic, then we'll have proof positive that your structure can beat the current IntMap.

Another place to look for potential performance improvements is at the top of the tree. The way you do it now is good if GHC inlines the outermost constructor away. If not, you have an extra indirection. I haven't thought of a non-horrible way around this yet.

On Sep 14, 2016 6:15 PM, "Jonathan S" [email protected] wrote:

to use the current constructor names

I actually chose to use Bin and Tip as opposed to Bin and Nil to more accurately correspond to the existing code - for the same set of keys, my map will have a Bin exactly where the old code would have a Bin, and my map will have a Tip exactly where the old code would have a Tip. Empty is the closest analogue to Nil.

There is a non-trivial cost to these empty Tips. In particular, every leaf of the tree ends up looking like (to use the current constructor names) Bin k a Nil Nil, which is two words larger than Tip k a. I'm not saying you're wrong, but it may be something to consider. One alternative might be data Node t a = Bin !Key a !(Node L a) !(Node R a) | LOnly !Key a !(Node L a) | ROnly !(Node R a).

This could work, and certainly would save more memory (only 2n-1 words of overhead). However,

Without pattern synonyms, it would make the code much larger and much uglier. Implementing the already unweildy merge functions for the more complicated structure sounds like a nightmare.

I suspect that the overhead of branching based on all 4 possible node types would cause serious regressions.

I can't see any way that the more complicated structure could improve performance to compensate, except possibly better cache behavior due to being smaller. It doesn't save any pointer-chasing, since Tips are already empty and therefore don't have to be allocated specially.

I'd be willing to try the more compact variation if you think the space savings are important enough, but I'd prefer to do that as a separate PR if at all.

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?

newtype L = L L has a slight safety advantage in that it is impossible to construct, so it can't accidentally be used as anything but a marker. It doesn't really matter though.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-247173662, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_d9zAnYrLQ7GIaGgnfCr8i6e3UaPks5qqHIQgaJpZM4J7RJU .

Sep 14 '16 23:09 treeowl

Argh... I started looking into this some more, and I don't think pattern synonyms can help. However, I noticed that lookup, member, notMember, and findWithDefault can all be implemented (probably) efficiently using a single function lookupGen :: Key -> r -> (a -> r) -> IntMap a -> r.

Another question: is there an efficiency benefit to writing size by hand, or should we just use getSum . foldMap (const (Sum 1))?

Sep 15 '16 02:09 treeowl

Don't worry about using throwaway types like DeleteResult. I've sprayed similar things all over Data.Map and Data.Sequence. Such things seem to help a lot with strictness analysis, unboxing, etc. I wish we didn't need to use them, but don't think twice to do so and don't waste your time trying to avoid it.

By the way: I should have already said that I greatly appreciate the work you've done on this tremendous rewrite already. Aside from the fact that it speeds up a lot of critical operations, I'm seeing now that its code is much easier to understand than that of the current IntMap. I'm licking my chops at the thought of deleting great gobs of impenetrable bit fiddling.

Sep 15 '16 02:09 treeowl

Could you give me a sense of why some fold variants might be challenging?

No. I really have no clue why certain folds are slower with this implementation. The only thing I can think of is the fact that we need to use 3x the stack as the previous implementation since we need to store a node, a key, and a value instead of just a node at certain recursive calls.

If the worst regressions remain that bad, it may make sense to add your version of IntMap in a separate module, rather than replacing the current one.

Even if I fail to reduce the union regressions (now at worst 38%), I don't think this makes sense - the criteria for when to use which variant would be really odd, especially since the regression is so localized. For the old IntMap to be favorable (currently), you need to have a very union-heavy workload which generally unions maps that are disjoint but have coarsely interleaved keys. This seems very odd and certainly niche to me.

I'll have to read more deeply into this, because that doesn't make sense to me yet. The current code stores a key and a value in its Tip; how can that correspond to yours?

Also this structure stores the keys and values in different locations, the spine of the tree is the same. If the existing implementation looks like

Bin _ _ (Bin _ _ (Tip _ _) (Bin _ _ (Tip _ _) (Tip _ _))) (Tip _ _)

then my implementation will look like

NonEmpty _ _ (Bin _ _ (Bin _ _ Tip (Bin _ _ Tip Tip)) Tip)

See also the bounded-intmap README section about how this data structure was derived.

Being smaller is tremendously helpful all by itself.

This is true. However,

Since this PR already slims down the structure, we're starting to hit diminishing returns.
In the context of random accesses, what matters is having top levels of the tree in cache, since those help every access. However, the number of nodes in the top levels grows exponentially with depth, so a small constant factor decrease in size would probably only save 1 cache miss if it saves cache misses at all.

I'm not saying that space reduction is useless or unfavorable. I just think that it won't produce enough time benefit to outweigh the additional complexity.

In general, I think this may be worth it, solely because it saves memory even if it slows things down. I just want to put that discussion off until later, after the basic implementation is in.

But wait! There's more! A pointer to an already-evaluated node will generally be tagged (in low bits) to indicate which constructor that node was formed from. I don't know all the details of how this helps, but it helps quite tremendously.

I don't think this would actually help, though. The pointer tags already allow you to distinguish between Bin and Tip - the only new information would be how many Tips a Bin has as children, which would save a dereference except for the fact that you need to read the data stored in the Bin anyway. In fact, using only two constructors (just Bin and Tip) actually helps, as it makes it possible to check whether one child is a Tip independent of the other child, which is useful when you only actually care about a single child (as is the case is any point query).

I fear you may be right about this. How hard would it be for you to use pattern synonyms temporarily, just to see how performance compares? I think these should all be "simple bidirectional" synonyms, which are generally rather straightforward.

Actually, I guess the pattern synonyms may not be quite as simple as I thought. But they shouldn't be terrible, I don't think. If the benchmarks using then turn out fantastic, then we'll have proof positive that your structure can beat the current IntMap.

Argh... I started looking into this some more, and I don't think pattern synonyms can help.

Yeah, upon further thought, I think it would be a significant, ugly project even with pattern synonyms.

How about something like this?

#if __GLASGOW_HASKELL__ >= whatever
data Hand = L' | R'
type L = ' L'
type R = ' R'
#else
data L = L
data R = R
#endif

I could... but that seems like a rediculous amount of complexity for basically no gain. If you truly feel passionate about this, I can just use data L = L, and otherwise I can keep it like it is. The newtype solution is simple at only a single line, doesn't require any extensions and is thus trivially portable, and prevents misuse. Using conditional compilation to get a slightly prettier definition when you still have to keep around the older, uglier definition that acts nearly identically doesn't make sense to me.

Another place to look for potential performance improvements is at the top of the tree. The way you do it now is good if GHC inlines the outermost constructor away. If not, you have an extra indirection. I haven't thought of a non-horrible way around this yet.

It's not quite an extra indirection, since the wrapper is not useless - NonEmpty stores the minimum key/value pair. However, you are right that the top level can cause some performance issues - the regression in intersection basically boiled down to the fact that GHC can't return sum types on the stack. This makes it very painful to return a sum type that is immediately going to be inspected and deconstructed, as that causes a lot of garbage. My super hacky solution for intersection was to create a product type isomorphic (roughly) to IntMap_ and convert to/from that on recursion boundaries. This allows GHC to return the value on the stack, fixing the regression, but it also causes a bunch of annoying conversions and stack traffic that slows everything else down.

I did have one idea here that I didn't try yet, which was to inline one level of Node into IntMap_, making

data IntMap_ a = Empty | Singleton Key a | Branched Key a Key a (Node L a) (Node R a)

This would decrease indirection by one level and make things more symmetrical. I didn't really implement this because I believe that the top level is rarely the bottleneck - the only reason it showed up in intersection was because it was being returned and inspected in a loop. Most functions only inspect the top level once, making it largely irrelevant. Improvements would be nice of course, but they aren't critical.

Don't worry about using throwaway types like DeleteResult. I've sprayed similar things all over Data.Map and Data.Sequence. Such things seem to help a lot with strictness analysis, unboxing, etc. I wish we didn't need to use them, but don't think twice to do so and don't waste your time trying to avoid it.

Yeah, I'm not too worried about that. The one type I do want to get rid of is SP, which is just another strict pair, a remnant from when I was writing this code out of tree.

By the way: I should have already said that I greatly appreciate the work you've done on this tremendous rewrite already. Aside from the fact that it speeds up a lot of critical operations, I'm seeing now that its code is much easier to understand than that of the current IntMap. I'm licking my chops at the thought of deleting great gobs of impenetrable bit fiddling.

Thank you. One of the things I really like about this design is the fact that everything is just simple xors and comparisons, requiring no exotic instructions or bit twiddling. However, your opinion might change once you look at the merging functions, especially since I haven't commented them yet.

Sep 15 '16 05:09 gereeter

On Thu, Sep 15, 2016 at 1:11 AM, Jonathan S [email protected] wrote:

Could you give me a sense of why some fold variants might be challenging?

No. I really have no clue why certain folds are slower with this implementation. The only thing I can think of is the fact that we need to use 3x the stack as the previous implementation since we need to store a node, a key, and a value instead of just a node at certain recursive calls.

Hrmm... stack space is usually pretty cheap, so I'm a bit surprised. I'll try to see if I can think of anything. How much slower are they again?

If the worst regressions remain that bad, it may make sense to add your version of IntMap in a separate module, rather than replacing the current one.

Even if I fail to reduce the union regressions (now at worst 38%), I don't think this makes sense - the criteria for when to use which variant would be really odd, especially since the regression is so localized. For the old IntMap to be favorable (currently), you need to have a very union-heavy workload which generally unions maps that are disjoint but have coarsely interleaved keys. This seems very odd and certainly niche to me.

One thing that really helped out in Data.Set merge operations (and some Data.Map ones--there are others I haven't explored yet) is trying to reuse as much structure as possible. For the most part, I did this using a thin wrapper around reallyUnsafePtrEquality# (Utils.Containers.Internal.PtrEquality.ptrEq). Maps with coarsely interleaved keys could benefit quite a bit from this sort of thing.

I'll have to read more deeply into this, because that doesn't make sense to me yet. The current code stores a key and a value in its Tip; how can that correspond to yours?

Also this structure stores the keys and values in different locations, the spine of the tree is the same. If the existing implementation looks like

Bin _ _ (Bin _ _ (Tip _ _) (Bin _ _ (Tip _ _) (Tip _ _))) (Tip _ _)

then my implementation will look like

NonEmpty _ _ (Bin _ _ (Bin _ _ Tip (Bin _ _ Tip Tip)) Tip)

See also the bounded-intmap README section https://github.com/gereeter/bounded-intmap#moving-the-values-upward about how this data structure was derived.

Yeah, I did read that. It largely makes sens. One minor concern: if someone uses IntMap in a heavily persistent way, will your "pull things up" approach reduce sharing? I suspect that's not the most common scenario, but we have to think about it.

Being smaller is tremendously helpful all by itself.

This is true. However,

Since this PR already slims down the structure, we're starting to hit diminishing returns.

In the context of random accesses, what matters is having top levels of the tree in cache, since those help every access. However, the number of nodes in the top levels grows exponentially with depth, so a small constant factor decrease in size would probably only save 1 cache miss if it saves cache misses at all.

I'm not saying that space reduction is useless or unfavorable. I just think that it won't produce enough time benefit to outweigh the additional complexity.

In general, I think this may be worth it, solely because it saves memory even if it slows things down. I just want to put that discussion off until later, after the basic implementation is in.

That's fine. When I thought it could be benchmarked quickly using a couple pattern synonyms, it looked like low-hanging fruit. Now that I realize I have no idea how to get there quickly, it looks like a better back-burner candidate.

But wait! There's more! A pointer to an already-evaluated node will generally be tagged (in low bits) to indicate which constructor that node was formed from. I don't know all the details of how this helps, but it helps quite tremendously.

I don't think this would actually help, though. The pointer tags already allow you to distinguish between Bin and Tip - the only new information would be how many Tips a Bin has as children, which would save a dereference except for the fact that you need to read the data stored in the Bin anyway. In fact, using only two constructors (just Bin and Tip) actually helps, as it makes it possible to check whether one child is a Tip independent of the other child, which is useful when you only actually care about a single child (as is the case is any point query).

Sounds plausible.

How about something like this?

#if GLASGOW_HASKELL >= whateverdata Hand = L' | R'type L = ' L'type R = ' R' #elsedata L = Ldata R = R #endif

I could... but that seems like a ridiculous amount of complexity for basically no gain. If you truly feel passionate about this, I can just use data L = L, and otherwise I can keep it like it is. The newtype solution is simple at only a single line, doesn't require any extensions and is thus trivially portable, and prevents misuse. Using conditional compilation to get a slightly prettier definition when you still have to keep around the older, uglier definition that acts nearly identically doesn't make sense to me.

I personally strongly prefer the nullary-constructor datatypes over the newtypes. When I see an uninhabited type, it always sends the message that something interesting is going on; in fact, less than nothing is going on because the type is only used as a phantom. You're probably right that conditional DataKinds is overkill.

Another place to look for potential performance improvements is at the top of the tree. The way you do it now is good if GHC inlines the outermost constructor away. If not, you have an extra indirection. I haven't thought of a non-horrible way around this yet.

It's not quite an extra indirection, since the wrapper is not useless - NonEmpty stores the minimum key/value pair. However, you are right that the top level can cause some performance issues - the regression in intersection basically boiled down to the fact that GHC can't return sum types on the stack. This makes it very painful to return a sum type that is immediately going to be inspected and deconstructed, as that causes a lot of garbage. My super hacky solution for intersection was to create a product type isomorphic (roughly) to IntMap_ and convert to/from that on recursion boundaries. This allows GHC to return the value on the stack, fixing the regression, but it also causes a bunch of annoying conversions and stack traffic that slows everything else down.

Blegh. If we can avoid nastiness, let's avoid it. If we can't, well, we can't.

I did have one idea here that I didn't try yet, which was to inline one level of Node into IntMap_, making

data IntMap_ a = Empty | Singleton Key a | Branched Key a Key a (Node L a) (Node R a)

This would decrease indirection by one level and make things more symmetrical. I didn't really implement this because I believe that the top level is rarely the bottleneck - the only reason it showed up in intersection was because it was being returned and inspected in a loop. Most functions only inspect the top level once, making it largely irrelevant. Improvements would be nice of course, but they aren't critical.

I was considering something like this. I suspect it will be helpful for situations where someone stuffs a data structure full of IntMaps. How much do you think it would blow up the code?

Apropos of not much: do you think there's some recursion-schemes-style trickery that could make it easier to work with your partially-decorated trees as though they were fully-decorated?

Thank you. One of the things I really like about this design is the fact that everything is just simple xors and comparisons, requiring no exotic instructions or bit twiddling. However, your opinion might change once you look at the merging functions, especially since I haven't commented them yet.

I'll get there eventually.

Sep 15 '16 05:09 treeowl

How did you like my lookupGen idea, by the way? If it's INLINEd, I think it can probably produce good object code, and it cuts the source down a non-trivial (although non-large) amount.

Sep 15 '16 05:09 treeowl

Another question: is there an efficiency benefit to writing sizeby hand, or should we just usegetSum . foldMap (const (Sum 1))`?

In terms of efficiency, I doubt there is any difference at the moment (I haven't actually benchmarked), though the current code counts Tips instead of key/value pairs.

I wrote it manually for a few reasons:

Clarity. getSum . foldMap (const (Sum 1)) isn't necessarily hard to read, but it involves a bunch of high powered concepts and immediately raises a bunch of flags in my head: will everything get inlined correctly to produce a specialized function? Will that build up a large thunk instead of just evaluating things in one go, given that foldMap isn't automatically strict? Even if foldMap were strict, does Sum box the contained value and reintroduce laziness? Even if everything does get inlined and optimized properly, will this take several optimization passes to finish, thereby preventing further automatic fusion or improvement? These questions aren't necessarily hard to answer, but they just go away with a direct implementation.
Consistent performance. Even if using foldMap optimizes now, changes to it could prevent that in the future. For example, binding a constant argument (such as the function passed to foldMap) can cause slowdowns, but as seen in #302, it is also necessary to allow the function to be properly inlined, since GHC won't inline recursive functions. A manual implementation has no such forward-compatibility concerns.
No real reason to switch. Using a foldMap-based one-liner only saves 4 lines of code. I don't see the point.

However, I noticed that lookup, member, notMember, and findWithDefault can all be implemented (probably) efficiently using a single function lookupGen :: Key -> r -> (a -> r) -> IntMap a -> r.

How did you like my lookupGen idea, by the way? If it's INLINEd, I think it can probably produce good object code, and it cuts the source down a non-trivial (although non-large) amount.

This is somewhat related to the case of size, but in this case I think that lookupGen or some variation is a good idea, mostly because lookup and friends are fairly large. There are a few unanswered questions about the best way to do this, though: is there a way to make it return the individual go functions so that it can replace the redundant logic in, e.g., intersection? Is this desired? Should the constant key be bound in a closure? Is the decision about that dependent on which function is being implemented?

Basically, yes, I like your idea, but I may not implement it immediately.

Sep 16 '16 21:09 gereeter

I just pushed optimized versions of fromAscList and friends. However, in doing so, I realized that their strictness (in Data.IntMap.Strict) seems somewhat bogus:

Prelude Data.IntMap.Strict> fromList [(0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:2:15 in interactive:Ghci1
Prelude Data.IntMap.Strict> fromAscList [(0, undefined), (0, 0)]
fromList [(0,0)]
Prelude Data.IntMap.Strict> fromList [(0, 0), (0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:4:23 in interactive:Ghci2
Prelude Data.IntMap.Strict> fromAscList [(0, 0), (0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:5:26 in interactive:Ghci2

I believe I matched this behavior, but it possibly should be changed.

Sep 17 '16 22:09 gereeter

This indeed looks wrong. Please fix it and (at some point) mention the change and any others in the change log. It also looks like the Show instance is peculiarly lazy. I believe we want

show undefined = undefined

Please feel free to fix that too.

On Sep 17, 2016 6:25 PM, "Jonathan S" [email protected] wrote:

I just pushed optimized versions of fromAscList and friends. However, in doing so, I realized that their strictness (in Data.IntMap.Strict) seems somewhat bogus:

Prelude Data.IntMap.Strict> fromList [(0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :2:15 in interactive:Ghci1 Prelude Data.IntMap.Strict> fromAscList [(0, undefined), (0, 0)] fromList [(0,0)] Prelude Data.IntMap.Strict> fromList [(0, 0), (0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :4:23 in interactive:Ghci2 Prelude Data.IntMap.Strict> fromAscList [(0, 0), (0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :5:26 in interactive:Ghci2

I believe I matched this behavior, but it possibly should be changed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-247811863, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_ZrOWD9ExbiOw4RC8KGsgfWQy2cfks5qrGjlgaJpZM4J7RJU .

Sep 18 '16 19:09 treeowl

I should have been more specific about how it should be changed. For fromAscList, the value that's actually installed should be forced. For fromList, that's not efficient, so we should force all the values in the list.

On Sep 18, 2016 3:29 PM, "David Feuer" [email protected] wrote:

This indeed looks wrong. Please fix it and (at some point) mention the change and any others in the change log. It also looks like the Show instance is peculiarly lazy. I believe we want

show undefined = undefined

Please feel free to fix that too.

On Sep 17, 2016 6:25 PM, "Jonathan S" [email protected] wrote:

I just pushed optimized versions of fromAscList and friends. However, in doing so, I realized that their strictness (in Data.IntMap.Strict) seems somewhat bogus:

Prelude Data.IntMap.Strict> fromList [(0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :2:15 in interactive:Ghci1 Prelude Data.IntMap.Strict> fromAscList [(0, undefined), (0, 0)] fromList [(0,0)] Prelude Data.IntMap.Strict> fromList [(0, 0), (0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :4:23 in interactive:Ghci2 Prelude Data.IntMap.Strict> fromAscList [(0, 0), (0, undefined), (0, 0)] fromList *** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at :5:26 in interactive:Ghci2

I believe I matched this behavior, but it possibly should be changed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/340#issuecomment-247811863, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_ZrOWD9ExbiOw4RC8KGsgfWQy2cfks5qrGjlgaJpZM4J7RJU .

Sep 18 '16 21:09 treeowl

I'm not a big fan of SketchyIntMap. It looks like a bad compromise and complicates things. I think we should strip it out and either eat the regression or (my preference) try something else.

Sep 19 '16 13:09 treeowl

I decided to try a little crowdsourcing: http://stackoverflow.com/questions/39520359/is-there-some-way-to-reduce-the-pain-of-range-tracking

I'd really love to find a way to make the range management for merges less horribly complicated.

Sep 19 '16 14:09 treeowl

I've had a few thoughts:

1

We could (at least temporarily, while trying to develop this thing) enable type families so we can use

type family Exchange x where
  Exchange L = R
  Exchange R = L

This will allow us to tag the min and max keys and associated values to show how they fit with left and right children.

newtype Bound t = Bound Key deriving (Eq, Ord, Num, Show, Real, Integral, Enum, Data.Bits.Bits)
newtype Carry t a = Carry a deriving (Eq, Ord, Show)

data IntMap_ t a = NonEmpty !(Bound t) (Carry t a) !(Node t a) | Empty deriving (Eq)
data Node t a = Bin !(Bound (Exchange t)) (Carry (Exchange t) a) !(Node L a) !(Node R a) | Tip deriving (Eq, Show)

I suspect this will make it considerably easier to avoid getting mixed up. I've certainly found it helpful in my initial experiments.

2

I'm beginning to suspect rather strongly that things are likely to get easier, as well as potentially faster, if we inline one layer of Node into IntMap_ as you describe above, which will also allow us to rename IntMap_ to IntMap. This gives us a very friendly view from the top.

Sep 19 '16 19:09 treeowl