fsharp Faster equality in generic contexts

What is this?

This is the first part of this effort to resurrect the awesome work on compiler performance by @manofstick.

How is this different?

Many things have changed, among other things we now have a bigger team & community to review things and a different release cadence allowing us to dogfood things a bit.

More importantly, the original PRs often contained multiple ideas (fixes, breaking, optimizations, experiments, ...), whereas I want to narrow PRs as much as possible to make things easier to review and control.

Is this PR breaking?

A tiny bit. This is mostly an optimization.

Here is an example of code where the behavior does change:

[<Struct; CustomEquality; NoComparison>]
type Blah =
    interface IEquatable<Blah> with
        member lhs.Equals rhs = failwith "bang"

let eq x y = 
    printfn "hello" // long code insuring no inlining
    ...
    x = y // generic equality context

let result = eq (Blah()) (Blah())  // currently false, will be exception

We agreed that it's OK here.

What's the source of inspiration here?

This is essentially part of the #5112, with the following changes:

Optimizer is not touched (there wasn't a consensus on that)
tail calls are not touched (same, also partially was there to work around JIT issues that were resolved later)
=> hence IL is not changed
some comparison optimizations are removed (I wasn't convinced by them)
refactoring is removed (might be done in a followup)
=> hence diff is minimized
benchmarks are added

So where does this improve things?

More theory and motivation is in this document - link will be updated once the PR merged.

TL;DR: this improves things when HashIdentity.Structural<'T> comparison is used in non-inlined code.

Benchmarks

Main targets: structs, enums, floats, and special generic types:

Structs and enums

Before:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Struct	301.8 us	14.77 us	43.56 us	68.3594	-	-	420.61 KB
Enum	299.4 us	19.55 us	57.63 us	22.9492	-	-	140.65 KB

After:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Struct	143.72 us	7.325 us	21.598 us	0.1221	-	-	792 B
Enum	25.93 us	1.803 us	5.289 us	0.0458	-	-	336 B

2x and 12x speed improvements for structs and enums respectively and nearly zero allocations now in both cases.

Value tuples

Before:

Method	Mean	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ValueTuple3	673.4 us	1.00	61.5234	15.8691	-	378.13 KB
ValueTuple4	812.2 us	1.22	69.0918	19.7754	-	424.98 KB
ValueTuple5	1,004.2 us	1.50	84.9609	24.4141	-	523.63 KB
ValueTuple6	1,100.7 us	1.65	92.7734	23.4375	-	570.48 KB
ValueTuple7	1,324.9 us	1.97	117.1875	57.6172	29.2969	669.14 KB
ValueTuple8	1,461.9 us	2.20	117.1875	58.1055	29.2969	762.85 KB

After:

Method	Mean	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ValueTuple3	173.0 us	1.00	28.5645	9.3994	-	175.11 KB
ValueTuple4	174.9 us	1.03	28.5645	9.4604	-	175.11 KB
ValueTuple5	208.9 us	1.22	34.4238	11.3525	-	211.29 KB
ValueTuple6	217.0 us	1.26	34.4238	11.3525	-	211.29 KB
ValueTuple7	293.7 us	1.73	29.2969	29.2969	29.2969	247.48 KB
ValueTuple8	293.8 us	1.73	29.2969	29.2969	29.2969	247.48 KB

~80% in speed and ~50% in memory reduction, also much steeper ratios' increase for both.

Options and co

Before:

Method	Mean	Gen 0	Gen 1	Gen 2	Allocated
Option	165.0 us	16.3574	3.1738	-	101.74 KB
ValueOption	157.1 us	28.8086	4.3945	-	177.02 KB
Result	186.3 us	40.4053	10.0098	-	248.25 KB

After:

Method	Mean	Gen 0	Gen 1	Gen 2	Allocated
Option	82.13 us	12.6953	3.0518	-	78.33 KB
ValueOption	55.09 us	9.7656	1.5869	-	59.98 KB
Result	75.92 us	22.5830	5.6152	-	138.93 KB

50-75% speed and 25-75% memory improvements.

Nullable

Before:

Method	Mean	Gen 0	Gen 1	Gen 2	Allocated
Nullable	443.7 us	24.9023	3.9063	-	153.66 KB

After:

Method	Mean	Gen 0	Gen 1	Gen 2	Allocated
Nullable	60.16 us	9.7656	1.5869	-	59.94 KB

About 7x speed and 3x memory improvements.

Floats

Before:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
FloatER	51.47 us	2.228 us	6.570 us	7.0801	0.3662	-	43.68 KB
Float32ER	56.39 us	2.566 us	7.525 us	7.1106	0.3662	-	43.68 KB
FloatPER	149.49 us	2.952 us	7.513 us	38.3301	2.6855	-	231.34 KB
Float32PER	136.09 us	5.584 us	16.201 us	37.5977	2.4414	-	227.01 KB

After:

Method	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
FloatER	15.26 us	0.487 us	1.397 us	15.38 us	3.2806	0.1678	-	20.1 KB
Float32ER	15.80 us	0.316 us	0.666 us	15.82 us	3.2806	0.1678	-	20.1 KB
FloatPER	82.88 us	5.580 us	16.452 us	93.36 us	14.9536	1.2817	-	90.46 KB
Float32PER	94.22 us	1.851 us	3.904 us	93.97 us	14.1602	0.9766	-	86.43 KB

PER comparison still takes more time and memory but still the improvements are 2-3 times in all cases.

Also (positively) affected: basic types, arrays, reference types - due to shorter call chains and less casting:

Arrays

Before:

Method	Mean	Error	StdDev
Int32	974.3 us	73.48 us	214.3 us
Int64	1,090.7 us	58.65 us	172.9 us
Byte	1,075.3 us	41.56 us	121.9 us
Obj	1,451.8 us	43.91 us	128.8 us

After:

Method	Mean	Error	StdDev
Int32	253.3 us	18.09 us	52.78 us
Int64	312.3 us	14.06 us	41.45 us
Byte	246.5 us	6.11 us	17.82 us
Obj	489.3 us	17.69 us	51.61 us

About 3x faster.

F# basic types

Before (countBy):

Method	Mean	Error	StdDev
Bool	39.06 us	1.936 us	5.709 us
SByte	55.23 us	2.032 us	5.992 us
Byte	50.62 us	1.617 us	4.766 us
Int16	85.46 us	4.668 us	13.764 us
UInt16	86.66 us	3.189 us	9.351 us
Int32	86.21 us	4.690 us	13.827 us
UInt32	87.69 us	4.911 us	14.480 us
Int64	112.81 us	3.962 us	11.681 us
UInt64	112.66 us	4.003 us	11.550 us
IntPtr	114.61 us	3.430 us	10.114 us
UIntPtr	109.40 us	3.322 us	9.796 us
Char	98.99 us	2.825 us	8.330 us
String	214.52 us	5.968 us	17.503 us
Decimal	315.84 us	12.545 us	36.988 us

After (countBy):

Method	Mean	Error	StdDev
Bool	29.19 us	1.608 us	4.740 us
SByte	50.72 us	2.167 us	6.390 us
Byte	47.40 us	1.719 us	5.069 us
Int16	83.50 us	3.508 us	10.342 us
UInt16	84.48 us	2.949 us	8.649 us
Int32	87.24 us	2.670 us	7.832 us
UInt32	86.17 us	3.630 us	10.703 us
Int64	95.78 us	4.763 us	14.044 us
UInt64	113.86 us	4.278 us	12.479 us
IntPtr	110.92 us	3.192 us	9.412 us
UIntPtr	105.11 us	3.219 us	9.440 us
Char	91.03 us	4.164 us	12.211 us
String	214.23 us	5.859 us	17.276 us
Decimal	150.39 us	3.703 us	10.683 us

Before (distinct):

Method	Mean	Error	StdDev
Bool	9.489 us	0.7345 us	2.142 us
SByte	12.568 us	0.4049 us	1.168 us
Byte	12.393 us	0.4176 us	1.231 us
Int16	23.539 us	0.8906 us	2.626 us
UInt16	22.311 us	0.8351 us	2.449 us
Int32	22.302 us	0.4448 us	1.180 us
UInt32	22.092 us	0.4760 us	1.319 us
Int64	25.567 us	0.8679 us	2.518 us
UInt64	26.200 us	1.4150 us	4.172 us
IntPtr	25.528 us	1.4250 us	4.202 us
UIntPtr	25.000 us	0.8785 us	2.590 us
Char	22.883 us	0.8124 us	2.383 us
String	47.659 us	1.6451 us	4.799 us
Decimal	58.086 us	4.7662 us	13.828 us

After:

Method	Mean	Error	StdDev
Bool	9.408 us	0.9121 us	2.689 us
SByte	11.795 us	0.3516 us	1.020 us
Byte	11.225 us	0.4078 us	1.170 us
Int16	23.211 us	0.7982 us	2.354 us
UInt16	21.682 us	1.0798 us	3.184 us
Int32	20.035 us	0.6128 us	1.787 us
UInt32	20.123 us	0.8883 us	2.591 us
Int64	22.967 us	0.7908 us	2.319 us
UInt64	24.757 us	1.3208 us	3.832 us
IntPtr	26.448 us	1.3528 us	3.989 us
UIntPtr	24.596 us	1.1292 us	3.330 us
Char	22.670 us	0.8961 us	2.642 us
String	40.480 us	1.3767 us	3.994 us
Decimal	35.401 us	1.4948 us	4.289 us

Note that decimals also show improvement in memory allocations:

Before (countBy)

Method	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Decimal	149.9 us	13.45 us	39.01 us	136.1 us	38.4521	12.8174	-	237.55 KB

After (countBy)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Decimal	65.17 us	5.013 us	13.807 us	28.5645	9.4604	-	175.09 KB

Before (distinct)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Decimal	57.83 us	6.097 us	17.977 us	26.3062	6.5308	-	162.35 KB

After (distinct)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Decimal	34.79 us	1.572 us	4.585 us	21.2708	5.3101	-	131.1 KB

These mostly stay the same (as expected), apart from decimal, which show ~50% speed and ~25% alloc improvements.

Records

Before:

Method	Mean	Error	StdDev
Record	157.8 us	8.98 us	25.61 us

After:

Method	Mean	Error	StdDev
Record	143.1 us	7.13 us	20.00 us

Which is about 10% improvement.

Generic unions

Before:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
GenericUnion	439.8 us	17.85 us	51.78 us	41.9922	12.2070	-	260 KB

After:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
GenericUnion	167.1 us	3.32 us	7.22 us	26.9775	8.9722	-	166.3 KB

About 60% and 30% improvements in speed and allocs.

Some (positive) implications for F#.Core.

F# core functions in question

Before:

Method	Mean	Error	StdDev	Median
ArrayCountBy	230.20 us	7.098 us	20.929 us	230.07 us
ArrayGroupBy	125.80 us	5.344 us	15.674 us	127.24 us
ArrayDistinct	121.08 us	8.648 us	25.500 us	127.72 us
ArrayDistinctBy	113.45 us	2.636 us	7.732 us	114.15 us
ArrayExcept	92.28 us	1.835 us	5.354 us	92.66 us
ListCountBy	225.45 us	11.730 us	34.585 us	230.79 us
ListGroupBy	167.40 us	12.814 us	37.783 us	151.49 us
ListDistinct	125.20 us	4.192 us	12.229 us	127.01 us
ListDistinctBy	109.52 us	3.480 us	10.097 us	110.94 us
ListExcept	194.14 us	20.467 us	60.347 us	164.68 us
SeqCountBy	460.13 us	13.224 us	38.575 us	459.04 us
SeqGroupBy	354.81 us	15.000 us	43.992 us	348.92 us
SeqDistinct	359.59 us	12.339 us	36.381 us	361.93 us
SeqDistinctBy	128.8 us	5.70 us	15.97 us	127.2 us
SeqExcept	127.8 us	5.03 us	14.74 us	128.1 us

After:

Method	Mean	Error	StdDev	Median
ArrayCountBy	137.30 us	5.008 us	14.767 us	137.01 us
ArrayGroupBy	78.80 us	4.673 us	13.778 us	80.87 us
ArrayDistinct	71.63 us	3.698 us	10.904 us	73.49 us
ArrayDistinctBy	69.93 us	2.148 us	6.299 us	70.02 us
ArrayExcept	68.39 us	3.071 us	9.008 us	69.59 us
ListCountBy	138.17 us	6.708 us	19.566 us	142.05 us
ListGroupBy	129.57 us	12.765 us	37.639 us	110.78 us
ListDistinct	85.99 us	1.662 us	3.683 us	85.73 us
ListDistinctBy	67.36 us	3.001 us	8.707 us	67.95 us
ListExcept	173.01 us	21.536 us	63.499 us	138.14 us
SeqCountBy	289.70 us	12.109 us	35.514 us	295.43 us
SeqGroupBy	245.97 us	6.969 us	20.329 us	246.79 us
SeqDistinct	254.17 us	14.776 us	43.336 us	257.62 us
SeqDistinctBy	101.7 us	5.35 us	14.91 us	102.4 us
SeqExcept	105.9 us	5.20 us	15.35 us	102.1 us

The improvement varies 20-40% in speed here.

Other considerations.

>64 bit value types

Before:

Method	Mean	Error	StdDev
BigStruct	47.43 ms	3.136 ms	9.098 ms

After:

Method	Mean	Error	StdDev
BigStruct	44.57 ms	3.243 ms	9.562 ms

This became marginally faster - but more importantly, it's here to address concerns about 64 bit JIT. So likely the underlying JIT problem got fixed meanwhile.

TODO

[ ] AOT tests to see if startup performance is not too affected
[ ] Finish the design doc on equality
[ ] Describe the changed code flow

Followups

Revive equality devirtualization
Add optimizations for recently added primitive types, as suggested here

Jan 30 '24 16:01 psfinaki

:heavy_exclamation_mark: Release notes required

:white_check_mark: Found changes and release notes in following paths:

Change path Release notes path Description

src/FSharp.Core docs/release-notes/.FSharp.Core/8.0.300.md

Change path	Release notes path	Description
`src/FSharp.Core`	docs/release-notes/.FSharp.Core/8.0.300.md

Jan 30 '24 16:01 github-actions[bot]

Benchmark results are weird, numbers before and after are within the statistical error. What about allocations? We shouldn't be boxing as much now?

Feb 08 '24 17:02 vzarytovskii

@vzarytovskii yeah those are not the right benchmarks for this PR. We had a session with Don today to figure out the right ones and I am already getting some 25-30% improvements there, will post soon - stay tuned.

Feb 08 '24 17:02 psfinaki

/azp run

Feb 12 '24 16:02 psfinaki

Azure Pipelines successfully started running 2 pipeline(s).

Feb 12 '24 16:02 azure-pipelines[bot]

Is this PR breaking? No. This is (supposed to be just) an optimization.

Well... It'll be calling IEquatable.Equals<>, not object.Equals, which is hence a breaking change - albeit I feel in a fashion that should be part of an evolution...

...and really should be paired with the optimizer change (@TIHan I believe worked on one at some stage?) Otherwise you have the somewhat bizarre situation where using an comparison of an external struct type such as a NodaTime.Instant in a generic context doesn't box and is fast, vs in a non-generic context where it does box is an is slower... Unless that has been resolved already?

Anyway, glad to see this moving forward. My beard is somewhat grayer now that when I first started trying to improve comparison/equality in F#!!

Feb 13 '24 13:02 manofstick

/azp run

Feb 13 '24 14:02 psfinaki

Azure Pipelines successfully started running 2 pipeline(s).

Feb 13 '24 14:02 azure-pipelines[bot]

Marking as ready to review since the CI is green finally :D

Feb 13 '24 15:02 psfinaki

And glad to see you here, @manofstick :) BTW thanks for the clear commits and comments to the commits in your original PRs, that really helped me a lot.

As for this one, well, it's not breaking in terms of NaN behavior and in terms of IL. So far. That's probably what we care about the most.

Feb 13 '24 17:02 psfinaki

Since it touches an important part of the language, I would like to have three things:

What changed, in the form of Given (generic) code: ... How it worked before: ... How does it work now: ...

Unless it exists somewhere already. It should help everyone in understanding this change in future in case of any issues.
Some AOT compilation tests involving changed equality. It can be a new project, like the trimming one we have. We should start having more AOT testing, and this is a good place to start, since we know AOT was working with equality before, it's crucial it's still working after the change.
Can we have some compiler perf comparison/profiling of using compiler+fslib before and after changes on something like FCS? It would be nice to have it as reference.

Feb 14 '24 08:02 vzarytovskii

Thanks all for the reviews and the feedback.

This needs a thorough rereview with @dsyme, meanwhile I will be improving the description, clarifying the code and adding tests.

Feb 14 '24 12:02 psfinaki

...and really should be paired with the optimizer change (@TIHan I believe worked on one at some stage?) Otherwise you have the somewhat bizarre situation where using an comparison of an external struct type such as a NodaTime.Instant in a generic context doesn't box and is fast, vs in a non-generic context where it does box is an is slower... Unless that has been resolved already?

Agreed that in principle these should go together, though can be done in a separate PR.

Well... It'll be calling IEquatable.Equals<>, not object.Equals, which is hence a breaking change - albeit I feel in a fashion that should be part of an evolution...

I think this specific change is OK - use IEquatable.Equals if it exists.

Feb 14 '24 13:02 dsyme

Well... It'll be calling IEquatable.Equals<>, not object.Equals, which is hence a breaking change - albeit I feel in a fashion that should be part of an evolution...

@manofstick Could you remind me of the specific thing that causes the call to IEquatable.Equals<, and which types it applies to. I assume it comes from EqualityComparer<'T>.Default and thus only applies to types that pass canUseDefaultEqualityComparer?

I think that is OK as a change. It is vanishingly rare to explicitly implement IEquatable.Equals<> on struct types in F# code, (except the automatic implementations provided by the compiler) - and even then it would always be the desired behaviour to have that implementation invoked

Feb 14 '24 13:02 dsyme

Also, does this need an addition to spec (rfc), since it changes how we do equality? Or rather shall the doc from the other PR live in the design repo?

Feb 14 '24 13:02 vzarytovskii

@vzarytovskii I think it's a good idea to have that doc (from the other PR) in the design repo. This PR is meant to be an optimization and hence focuses on the implementation. I am looking at the spec while working on this and my impression so far is that the spec doesn't go that deep in the implementation to make this change anyhow misalign with it. If we discover anything of that nature, we can make the RFC or further discuss it.

Feb 14 '24 14:02 psfinaki

I added benchmarks for the value tuples, options and that stuff - the results are very convincing so far. More to come soon!

Feb 14 '24 18:02 psfinaki

@manofstick Could you remind me of the specific thing that causes the call to IEquatable.Equals<, and which types it applies to. I assume it comes from EqualityComparer<'T>.Default and thus only applies to types that pass canUseDefaultEqualityComparer?

Yep, simple as that.

I think that is OK as a change. It is vanishingly rare to explicitly implement IEquatable.Equals<> on struct types in F# code, (except the automatic implementations provided by the compiler) - and even then it would always be the desired behaviour to have that implementation invoked

Oh, completely agree. It was just that this was (as far as I understand it) the main reason why this change never went through in the past.

And just confirming there, given your comment, this change, as implemented, would be all types (that pass the check) using IEquatable<>.Equals in preference to object.Equals not just value types (obviously struct types get the most benefit, as they avoid the boxing step, but ref types can also get a slight improvement as they, don't need to cast input). But the check includes IsSealed, hence negating any potential inheritance issues. (Once again, from memory, don't have a compiler here to test, but F# records are created as Sealed always)

Feb 15 '24 02:02 manofstick

@manofstick just to be sure we're on the same page - which input cast do you mean? :)

Feb 15 '24 17:02 psfinaki

@psfinaki

...forgive my github comment coding, I think memory serves, but it's more than enough for the gist of things if it's incorrect...

[<Sealed>] // only for sealed
type Blah(...) =
    ...

    override lhs.Equals (rhsObj:obj) = // (1)
        if obj.ReferenceEquals (lhs, rhsObj) then
            true
        else 
            match rhsObj with
            | :? Blah as rhs -> // (2)
                (lhs:>IEquatable<Blah>).Equals rhs // (4)
            | _ -> false

    interface IEquatable<Blah> with
        member lhs.Equals rhs = // (3)
            // actually do the check

Well (1) would of been the entry point, where at (2) would of been the cast (well type check), but EqualityComparer<Blah>.Default will call (3) directly.

Most likely the actually equality check in these cases will be the significant component of the cost, so savings is minimal, but as mentioned, this is just for complete understanding of what's going on here...

(4) ... just because I'm showing off, the IL that F# used to generate actually caused this operation to have a cost, but way back at the dawn of history I fixed that... ;-)

Feb 15 '24 20:02 manofstick

@psfinaki I believe it should be possible to write a test case that detects the change, e.g.

[<Sealed, Struct>]
type Blah() =
    override lhs.Equals (rhsObj:obj) = true
    interface IEquatable<Blah> with
        member lhs.Equals rhs = failwith "bang"

let eq x y = 
    printfn "hello" // do this enough times to make sure no inlining
    printfn "hello" // do this enough times to make sure no inlining
    printfn "hello" // do this enough times to make sure no inlining
    printfn "hello" // do this enough times to make sure no inlining
    (x = y) // generic equality context

eq (Blah()) (Blah())  // true prior to this, exception now

I don't really think this needs a runtime library switch to disable, and we don't currently have such a mechanism to do those kinds of flags anyway

The equality spec should probably be updated to mention this case

Feb 15 '24 22:02 dsyme

Yes, correct. For reference, I guess this is the minimal code to demo the behavior difference:

[<Struct; CustomEquality; NoComparison>]
type Blah =
    interface IEquatable<Blah> with
        member lhs.Equals rhs = failwith "bang"

let eq x y = 
    printfn "hello" // do this enough times to make sure no inlining
    ...
    x = y // generic equality context

let result = eq (Blah()) (Blah())  // false prior to this, exception now

Alright, so we've identified what breaks here and blessed it at the same time. Indeed, this is quite an esoteric scenario.

Good stuff, I will update the spec and the description to reflect this.

Feb 16 '24 16:02 psfinaki

Non-inlined stacktrace examples before and after could be helpful for reviewers as well, could you add it pelase, if it's not too much work.

Feb 17 '24 18:02 vzarytovskii

Yep I am going to add it to the PR description.

Feb 19 '24 12:02 psfinaki

Okay so hopefully getting to the finish line here, I think I have added enough micro benchmarks and grouped them in the PR description.

Next week will add AOT testing, refresh the equality design doc, finalize the PR description and give this a final review with Don :)

Feb 23 '24 16:02 psfinaki

/azp run

Mar 04 '24 17:03 psfinaki

Azure Pipelines successfully started running 2 pipeline(s).

Mar 04 '24 17:03 azure-pipelines[bot]

Nice

Mar 04 '24 18:03 KevinRansom

64 bit value types This became marginally faster

I'm just relying on your words here, as I have not built this, but they should be significantly faster, and non boxing (I'm not talking about my concern re tail calls here). I'm guessing that because they implement 'IStructuralEquality' they are using that path, and hence not being sped up. I thought in my original PR I handled these too, maybe I didn't, or maybe you didn't carry that code across, I don't know.

Anyway, what I would suggest is that the code for 'canUseDefault...' is used at compile time, and if default can be used then the 'IStructuralEquals' interface isn't implemented. This, once again, is a breaking change, but I think it's even lesser than the one introduced by this PR.

Another follow up, if it hasn't been started yet, would be used user Compare<>.Default for inequalities....

Mar 05 '24 07:03 manofstick

Yes, I think we will get to those as well. Next bigger one will be devirt equality, then we'll probably look into comparison. There is a lot of useful stuff to pull out of your contributions :)

Mar 05 '24 09:03 psfinaki