fsharp icon indicating copy to clipboard operation
fsharp copied to clipboard

Race in GetProjectOptionsFromScript prevents getting referenced packages

Open auduchinok opened this issue 4 years ago • 7 comments

Repro steps:

  • call GetProjectOptionsFromScript
  • start the resulting async as task and pass a cancellation token
  • request cancellation on the token source
  • call GetProjectOptionsFromScript again (probably from a different thread)
  • start the task with a new cancellation token

The GetProjectOptionsFromScript calls are made using lock application to prevent concurrent access.

Calling and cancelling it like this may lead to situation where two scrips closures are calculated concurrently, despite one of them has already been cancelled by the callee. Here're two screenshots for different threads, both made while staying on the same breakpoint:

Screenshot 2022-02-02 at 14 17 37 Screenshot 2022-02-02 at 14 17 16

This leads to a situation where both threads try to write a generated file during the package restore. If the file is locked then the restore fails and results with missing package references are added to cache inside DependencyProvider, so all later GetProjectOptionsFromScript invocations return failed results for the same compiler directive collection.

The build failure details

The cached script closure:
Screenshot 2022-02-02 at 09 44 52

Exception stacktrace: Screenshot 2022-02-02 at 09 45 19

auduchinok avatar Feb 02 '22 12:02 auduchinok

I don't quite understand the likely nature of the bug from the above

Calling and cancelling it like this may lead to situation where two scrips closures are calculated concurrently, despite one of them has already been cancelled by the callee.

When you say "calculated concurrently" could you indicate which routine or code you think is being executed concurrently for the same script?

Also when you say "has been cancelled" do you mean that the cancellation hasn't been effective?

Or is it more that a bad failing result (related to cancellation) has been written to the cache?

dsyme avatar Feb 02 '22 13:02 dsyme

When you say "calculated concurrently" could you indicate which routine or code you think is being executed concurrently for the same script?

@dsyme Yes, please check the stacks on the attached screenshots. 🙂

Also when you say "has been cancelled" do you mean that the cancellation hasn't been effective?

I think this is what is happening indeed, the cancellation tokens are also captured on the screenshots.

Or is it more that a bad failing result (related to cancellation) has been written to the cache?

The very last screenshot shows a cached failure to write to a locked file inside an MSBuild task.

auduchinok avatar Feb 02 '22 13:02 auduchinok

Looks like FSharpDependencyManager is allowing multiple concurrent callers to attempt resolutions within its working directory at the same time

dsyme avatar Feb 02 '22 14:02 dsyme

@KevinRansom Could you take a look at this? I think FSharpDependencyManager needs to be made concurrency safe - not its data structures, but rather its use of the working directory as a dedicated non-concurrent resource.

dsyme avatar Feb 02 '22 14:02 dsyme

I shall take a look. For sure there is no real reason beyond debuggability that we reuse the working directory. Although, when trying to figure out what is going on, that is my go to.

KevinRansom avatar Feb 02 '22 16:02 KevinRansom

I suppose a risk of using separate directories (ie fixing this bug) is that we will end up spawning hundreds of resolutions as the programmer is editing the references of a script, each typing of a character doing a new spawn. Right now the bug is probably gating the process of doing resolutions in some way.

Equally we've gradually moved away from have FSharp.Compiler.Service manage resources and act as a mini-operating-system (e.g. by removing the reactor thread), but it's something to be aware of and watch for.

dsyme avatar Feb 02 '22 18:02 dsyme

I suppose a risk of using separate directories (ie fixing this bug) is that we will end up spawning hundreds of resolutions as the programmer is editing the references of a script, each typing of a character doing a new spawn.

It should only create new directories/files for unique dependency provider keys, so it won't be happening for each keystroke, unless changing the reference directives: https://github.com/dotnet/fsharp/blob/1490dd5d8abfc5fe6609e56a174914cf72306179/src/fsharp/DependencyManager/DependencyProvider.fs#L425

And when changing the directives, it currently may reuse the files, leading to hard-to-diagnose race issues.

auduchinok avatar Jul 07 '22 10:07 auduchinok