Bend icon indicating copy to clipboard operation
Bend copied to clipboard

Performance degradation on Bitonic Sort output since release

Open VictorTaelin opened this issue 1 year ago • 7 comments

Reproducing the behavior

On release, the Bitonic Sort example:

def gen(d, x):
  switch d:
    case 0:
      return x
    case _:
      return (gen(d-1, x * 2 + 1), gen(d-1, x * 2))

def sum(d, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return sum(d-1, t.a) + sum(d-1, t.b)

def swap(s, a, b):
  switch s:
    case 0:
      return (a,b)
    case _:
      return (b,a)

def warp(d, s, a, b):
  switch d:
    case 0:
      return swap(s ^ (a > b), a, b)
    case _:
      (a.a,a.b) = a
      (b.a,b.b) = b
      (A.a,A.b) = warp(d-1, s, a.a, b.a)
      (B.a,B.b) = warp(d-1, s, a.b, b.b)
      return ((A.a,B.a),(A.b,B.b))

def flow(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return down(d, s, warp(d-1, s, t.a, t.b))

def down(d,s,t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return (flow(d-1, s, t.a), flow(d-1, s, t.b))

def sort(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return flow(d, s, (sort(d-1, 0, t.a), sort(d-1, 1, t.b)))

def main:
  return sum(20, sort(20, 0, gen(20, 0)))

Generated the following HVM output:

@down = (?(((a (* a)) @down__C0) (b (c d))) (c (b d)))

@down__C0 = ({a e} ((c g) ({b f} (d h))))
  &! @flow ~ (a (b (c d)))
  &! @flow ~ (e (f (g h)))

@flow = (?(((a (* a)) @flow__C0) (b (c d))) (c (b d)))

@flow__C0 = ({$([+1] a) c} ((e f) ({b d} h)))
  & @down ~ (a (b (g h)))
  & @warp ~ (c (d (e (f g))))

@gen = (?(((a a) @gen__C0) b) b)

@gen__C0 = ({a d} ({$([*2] $([+1] b)) $([*2] e)} (c f)))
  &! @gen ~ (a (b c))
  &! @gen ~ (d (e f))

@main = a
  & @sum ~ (20 (@main__C1 a))

@main__C0 = a
  & @gen ~ (20 (0 a))

@main__C1 = a
  & @sort ~ (20 (0 (@main__C0 a)))

@sort = (?(((a (* a)) @sort__C0) (b (c d))) (c (b d)))

@sort__C0 = ({$([+1] a) {c f}} ((d g) (b i)))
  & @flow ~ (a (b ((e h) i)))
  &! @sort ~ (c (0 (d e)))
  &! @sort ~ (f (1 (g h)))

@sum = (?(((a a) @sum__C0) b) b)

@sum__C0 = ({a c} ((b d) f))
  &! @sum ~ (a (b $([+] $(e f))))
  &! @sum ~ (c (d e))

@swap = (?((@swap__C0 @swap__C1) (a (b c))) (b (a c)))

@swap__C0 = (b (a (a b)))

@swap__C1 = (* (a (b (a b))))

@warp = (?((@warp__C0 @warp__C1) (a (b (c d)))) (c (b (a d))))

@warp__C0 = ({a e} ({$([>] $(a b)) d} ($([^] $(b c)) f)))
  & @swap ~ (c (d (e f)))

@warp__C1 = ({a f} ((d i) ((c h) ({b g} ((e j) (k l))))))
  &! @warp ~ (f (g (h (i (j l)))))
  &! @warp ~ (a (b (c (d (e k)))))

Which performs about ~12000 MIPS on RTX 4090.

Currently, Bend generates the following output:

@down = (?(((* (a a)) @down__C0) b) b)

@down__C0 = ({a e} ({b f} ((c g) (d h))))
  &!@flow ~ (a (b (c d)))
  &!@flow ~ (e (f (g h)))

@flow = (?(((* (a a)) @flow__C0) b) b)

@flow__C0 = ({$([+0x0000001] a) c} ({b d} ((e f) h)))
  & @down ~ (a (b (g h)))
  & @warp ~ (c (d (e (f g))))

@gen = (?(((a a) @gen__C0) b) b)

@gen__C0 = ({a d} ({$([*0x0000002] $([+0x0000001] b)) $([*0x0000002] e)} (c f)))
  &!@gen ~ (a (b c))
  &!@gen ~ (d (e f))

@main = c
  & @sum ~ (20 (b c))
  & @sort ~ (20 (0 (a b)))
  & @gen ~ (20 (0 a))

@sort = (?(((* (a a)) @sort__C0) b) b)

@sort__C0 = ({$([+0x0000001] a) {c f}} (b ((d g) i)))
  & @flow ~ (a (b ((e h) i)))
  &!@sort ~ (c (0 (d e)))
  &!@sort ~ (f (1 (g h)))

@sum = (?(((a a) @sum__C0) b) b)

@sum__C0 = ({a c} ((b d) f))
  &!@sum ~ (a (b $([+] $(e f))))
  &!@sum ~ (c (d e))

@swap = (?((@swap__C0 @swap__C1) a) a)

@swap__C0 = (a (b (a b)))

@swap__C1 = (* (b (a (a b))))

@warp = (?((@warp__C0 @warp__C1) a) a)

@warp__C0 = ($([^] $(b c)) ({$([>] $(a b)) d} ({a e} f)))
  & @swap ~ (c (d (e f)))

@warp__C1 = ({a f} ({b g} ((c h) ((d i) ((e j) (k l))))))
  &!@warp ~ (f (g (h (i (j l)))))
  &!@warp ~ (a (b (c (d (e k)))))

Which performs about 6000 MIPS, or 50% slower.

Tested using the last version of HVM.

Can you please investigate?

@developedby @kings177

System Settings

.

Additional context

No response

VictorTaelin avatar Jul 17 '24 03:07 VictorTaelin

Skimming over the output, i notice two changes:

=============

Some differences seem to come from changes in either the linearization or eta-reduction. The new version generates some definitions smaller, probably by reordering some lambdas which allows them to be eta-reduced (basically generating a better linearization).

We can see this in the swap function

# old
swap = λa λb λc ((switch a {0: swap_c0; _: swap_c1}) c b)
swap_c0 = λa λb (b, a)
swap_c1 = λa λb (a, b)

# new
swap = λa switch a {0: swap_c0; _: swap_c1}
swap_c0 = λa λb (a, b)
swap_c1 = λa λb (b, a)

The same thing happens in down, flow and warp.

This being the source of the slowdown would be problematic because the new version is strictly smaller, with fewer rewrites. I don't think I can predict what exactly is the effect of ordering the arguments on performance, so it would be hard to generate always the fastest code.

================

In the old version, we lifted the calls to sort and gen in main to their own separate functions. This lets sum start one iteration earlier than sort and both start earlier than gen.

We have to do this because otherwise in a lot of cases a program would just return a lazy reference without calling it. For example, the three calls in main could be lifted together into their own definition and swapped by a reference. If we didn't inline the reference then the program would do nothing.

Something like this:

main = do_main
do_main = sum(20, sort(20, 0, gen(20, 0)))

Here it's quite stupid, but in a more general setting this was happening automatically a lot. It's not easy to decide when we need and when we don't need to to this, so it's always done.

This is quite easy to test, we just have to manually write the old one.

If this is the source of the slowdown, then we need to figure out a better heuristic for when to lift the functions in main that maximizes the speed of execution. Although in this case I don't think that's the problem since both sum and sort depend on the output generated by gen to do anything, so they should become somewhat synchronized automatically.

developedby avatar Jul 17 '24 08:07 developedby

@kings177 can you test the second thing I said and see if it makes any difference? I mean writing main like this

main = sum(20, sort(20, 0, gen(20, 0)))

vs like this

main = sum(20, main_1)
main_1 = sort(20, 0, main_2)
main_2 = gen(20, 0)

developedby avatar Jul 17 '24 08:07 developedby

@developedby tried with:

def main:
  return sum(20, sort(20, 0, gen(20, 0)))
def main:
  main = sum(20, sort(20, 0, gen(20, 0)))
  return main
def main:
  main_2 = gen(20, 0)
  main_1 = sort(20, 0, main_2)
  main = sum(20, main_1)
  return main

no difference at all when running.

also, they all generate the same exact hvm file.

kings177 avatar Jul 17 '24 14:07 kings177

no difference at all when running.

also, they all generate the same exact hvm file.

What I meant was this

def main():
  return sum(20, main_1)
def main_1():
  return sort(20, 0, main_2)
def main2():
  return gen(20, 0)

developedby avatar Jul 18 '24 08:07 developedby

tried doing it that way too, same result, 6200~MIPS

kings177 avatar Jul 18 '24 17:07 kings177

Looking into this again, I found that this performance degradation started exactly in the commit 3073285 (Call desugar_use before linearize matches), changing desugar_use to be done before linearize_matches, check_unbound_vars, and make_var_names_unique, which should increase performance. All commits before it get around 12000 MIPS on this example, while it and all commits after it get around 6000 MIPS.

It's still unclear if the degradation in performance is general (happens for every or most programs) or something that only happens in this case.

edusporto avatar Aug 09 '24 17:08 edusporto

Looking into this again, I found that this performance degradation started exactly in the commit 3073285 (Call desugar_use before linearize matches), changing desugar_use to be done before linearize_matches, check_unbound_vars, and make_var_names_unique, which should increase performance.

It's still unclear if the degradation in performance is general (happens for every or most programs) or something that only happens in this case.

The intended effect of this change was to generate linearization lambdas in a more sensible order (the order they appear), as well as generating eta-reduced structures, llike we can see in the changes to the snapshots

// Before and after: swapping arguments allows for eta-reduction
@List.fold = ((@List.fold__C1 (a (b c))) (b (a c)))
@List.fold = ((@List.fold__C1 a) a)

// Before and after: just swapping the order of the linearized arguments, not eta-reducible.
@List.fold__C0 = (* (a (b (d ({(a (e f)) c} f)))))
  & @List.fold ~ (b (c (d e)))
@List.fold__C1 = (?(((a (* a)) @List.fold__C0) b) b)

@List.fold__C0 = (* (a (b ({(a (e f)) c} (d f)))))
  & @List.fold ~ (b (c (d e)))
@List.fold__C1 = (?(((* (a a)) @List.fold__C0) b) b)

I'd expect the first kind of change to have practically no effect on performance and the latter one to increase performance slightly. Halving performance is definitely unexpected from this, probably something to do with how the threads fill their work space?

Could be interesting to see if we can isolate the two changes in the generated hvm program for the bitonic sort example. One that just applies the eta-reduction and another that just reorders the arguments and see which one is causing the degradation.

developedby avatar Aug 09 '24 17:08 developedby

Actually, just running without eta-reduction should already already show if the problem is reordering or eta-reduction

developedby avatar Aug 21 '24 17:08 developedby

this should be a priority can we keep investigating?

VictorTaelin avatar Aug 30 '24 12:08 VictorTaelin

Yes, I can keep investigating this with high priority

developedby avatar Sep 03 '24 18:09 developedby

The performance reduction for the CUDA runtime does come 100% from eta-reduction (which is good since it's easy to disable).

I tried doing eta-reduction on each function separately to see the impact on the overall performance. By eta=reducing any of warp, down and flow, we see a degradation of performance of similar degree, down to ~7700MIPS, going down to the ~6500 MIPS when combining eta reduction of the three.

This is the opposite effect that I observe on CPU, where eta-reducing this program increases performance instead (by a very small margin).

It's also not clear to me when we should eta-reduce and when should we not. My guess is that it's a problem on binary recursive functions, but I don't know why it messes with the proper scheduling of work.

As a quick stopgap we can disable eta-reduction in Bend temporarily until we get better insights.

developedby avatar Sep 04 '24 12:09 developedby

The eta-reduction doesn't happen in the binary recursive nets, but rather in the ones that call them, so the idea in the previous message does not work (and in this example doesn't actually do anything).

developedby avatar Sep 04 '24 17:09 developedby

I closed since this specific problem is solved by the PR, but I don't really understand exactly why the problem happened in the first place, so not super happy about it.

developedby avatar Sep 04 '24 21:09 developedby