cache icon indicating copy to clipboard operation
cache copied to clipboard

Cache restoration is orders of magnitude slower on Windows compared to Linux.

Open Adnn opened this issue 3 years ago • 31 comments

We are using this action to cache Conan packages.

Our current cached is ~300 MB on both Linux (ubuntu-20.04) and Windows (windows-2019). Sadly, where the cache step routinely takes ~10 seconds on Linux, it oftentimes takes ~5 minutes on Windows.

This makes iterations frustrating, as the rest of the workflow takes about 2 minutes to complete, we get a global x3 time penalty because of this.

As far as I can tell, the archive retrieval time is comparable, it really is the cache un-archiving which seems to take very long on Windows.


Our latest run below for illustration purposes.

Linux: image

Windows: image

Adnn avatar Mar 02 '22 13:03 Adnn

really is the cache un-archiving which seems to take very long on Windows.

That's because tar is extremely slow on Windows. Some related links:

  • https://github.com/microsoft/Windows-Dev-Performance/issues/27
  • https://superuser.com/questions/1124472
  • https://github.com/Microsoft/WSL/issues/507

rikhuijzer avatar Mar 06 '22 11:03 rikhuijzer

Thank you for your reponse!

(If the issue is with the specific un-archiving utility, maybe there is an alternative that the action could use on Windows to get better performances?)

Adnn avatar Mar 07 '22 10:03 Adnn

I'm experiencing the same issue. Here is one of our latest runs after implementing dependency caching.

https://github.com/getsentry/sentry-dotnet/runs/6327391846?check_suite_focus=true

Time to restore dependencies from cache:

Ubuntu macOS Windows
18s 29s 4m 5s

With timings visible, it's clearly the tar operation that is the culprit:

image

mattjohnsonpint avatar May 06 '22 19:05 mattjohnsonpint

Looks like this has been going on for awhile... See also #442 and #529. Hopefully @bishal-pdMSFT can make some improvements here? Maybe just provide an optional parameter to the action that would tell it to use .zip (or another format) instead of .tgz on Windows? 7-Zip is pre-installed on the virtual environments.

mattjohnsonpint avatar May 09 '22 17:05 mattjohnsonpint

@vsvipul To restore the cache, the ubuntu server takes 16 seconds, and the windows server takes 2 minutes, and 29 seconds. https://github.com/space-wizards/space-station-14/runs/7017334343?check_suite_focus=true

wrexbe avatar Jun 23 '22 06:06 wrexbe

Hi @Adnn @wrexbe @mattjohnsonpint, we do understand. Actually zstd is disabled on Windows due issues with bsd tar, hence the slowness of restore. While we look into it, you could use the workaround suggested here to improve your hosted windows runner performance. This would enable zstd while using GNU tar.

TLDR; Add the following step to your workflow before the cache step. That's it.

- if: ${{ runner.os == 'Windows' }}
      name: Use GNU tar
      shell: cmd
      run: |
        echo "Adding GNU tar to PATH"
        echo C:\Program Files\Git\usr\bin>>"%GITHUB_PATH%"

Hope this helps.

lvpx avatar Aug 22 '22 14:08 lvpx

@pdotl Thank you for your response. It is encouraging to read this issue will be looked into!

On the other hand, I tried to add the Use GNU tar step you provided, just before the cache action in our pipeline. The step ran and correctly output "Adding GNU tar to PATH", yet I did not observe any noticeable speed-up (but as I am not used to do testing at the workflow level, I may be proven wrong).

Adnn avatar Aug 23 '22 11:08 Adnn

@pdotl - Thanks. But like @Adnn, I have to report that changing to gnu tar gave no significant performance gain.

https://github.com/getsentry/sentry-dotnet/runs/7979994529?check_suite_focus=true#step:3:58

Tue, 23 Aug 2022 18:07:43 GMT Cache Size: ~1133 MB (1187773080 B) Tue, 23 Aug 2022 18:07:43 GMT "C:\Program Files\Git\usr\bin\tar.exe" --use-compress-program "zstd -d" -xf D:/a/_temp/6caa1415-0002-48ab-a4c2-b1ac4df21961/cache.tzst -P -C D:/a/sentry-dotnet/sentry-dotnet --force-local Tue, 23 Aug 2022 18:13:01 GMT Cache restored successfully

As you can see from the timestamps, it took over 5 minutes to decompress. The same build on Linux and macOS took less than 30 seconds.

I realize that it would require some significant changes to actions/toolkit, but I think to really get comparable perf we're going to need ability to use a format other than tar.gz. Perhaps .7z since 7-Zip is pre-installed on the runners?

mattjohnsonpint avatar Aug 23 '22 18:08 mattjohnsonpint

Hi @Adnn @mattjohnsonpint we are tracking this issue in #984. Please check the current proposal there and provide comments/feedback if any. Closing this as duplicate.

lvpx avatar Nov 16 '22 08:11 lvpx

@pdotl since #984 was narrowed down to cross-os (ref) should this issue be reopened?

AMoo-Miki avatar Jan 19 '23 21:01 AMoo-Miki

In my case our the things we're caching takes ~10-15 minutes to run and restoring the cache takes longer.

hipstersmoothie avatar Mar 18 '23 04:03 hipstersmoothie

Has there been any update on this?

ofek avatar Apr 01 '23 16:04 ofek

https://learn.microsoft.com/en-us/virtualization/community/team-blog/2017/20171219-tar-and-curl-come-to-windows https://github.com/libarchive/libarchive seems to have good enough performance on windows for tar maybe this can be used here?

LabhanshAgrawal avatar Jun 14 '23 11:06 LabhanshAgrawal

Could this be a regression of https://github.com/microsoft/Windows-Dev-Performance/issues/27#issuecomment-677955496? (Updating the tar binary might have caused the Defender signatures to no longer match, disabling the optimization)

mschfh avatar Sep 19 '23 19:09 mschfh

@lvpx @Phantsure Can we expect this to be on the team's radar anytime soon or should we just accept it?

Safihre avatar Oct 27 '23 10:10 Safihre

@Safihre we are not part of GitHub anymore unfortunately. Hopefully someone will be pick these pending issues and respond.

lvpx avatar Oct 30 '23 13:10 lvpx

Does anyone know who the product owner is who might shed light on the progress of this?

jezdez avatar Dec 06 '23 08:12 jezdez

@bethanyj28 seems to be releasing latest version. They might help get this on someone's radar.

Phantsure avatar Dec 06 '23 09:12 Phantsure

We are facing this issue. due to this, caching takes longer than actual run. Thus, we have disabled caching for now.

Kindly provide an update on when it will be fixed.

gaurish avatar Dec 06 '23 16:12 gaurish

I just spent 3 days trying to find a workaround for this, so I hope this helps someone...

Background:

We use Bazel, and the local Bazel cache, although it is not necessarily big (~200 MB in our case), it contains a ton of small files (including symlinks). For us, it took ~19 Minutes to untar those 200 MB on Windows 😖

Failed attempts:

Based on this Windows issue (also linked above), my initial attempts were related to the Windows Defender:

  • Disable all features of the Windows Defender / Exclude the directory where the cache action extracts the archive (based on this gist and these docs)
    ➜ However, I found out that all relevant features are already disabled and the entire D:/ drive (where the cache is extracted) is already excluded by default in the Github runners.
  • Completely uninstalling Windows Defender (see this example)
    ➜ However, that requires a reboot. So, that's only viable on a self-hosted runner.

BTW: The cache action already uses the tar.exe from Git Bash (C:\Program Files\Git\usr\bin\tar.exe), so this workaround (suggested by @lvpx >1 year ago) makes no difference anymore.

Our current workaround:

The underlaying issue is the large amount of files that need to be extracted, so let's reduce the cache to a single file: ➜ Let's put the cache in a Virtual Hard Disk!

So this is our solution:

runs-on: windows-2022
steps:
  ...

  - name: Cache Bazel (VHDX)
    uses: actions/cache@v3
    with:
      path: C:/bazel_cache.vhdx
      key: cache-windows

  - name: Create, mount and format VHDX
    run: |
      $Volume = `
        If (Test-Path C:/bazel_cache.vhdx) { `
            Mount-VHD -Path C:/bazel_cache.vhdx -PassThru | `
            Get-Disk | `
            Get-Partition | `
            Get-Volume `
        } else { `
            New-VHD -Path C:/bazel_cache.vhdx -SizeBytes 10GB | `
            Mount-VHD -Passthru | `
            Initialize-Disk -Passthru | `
            New-Partition -AssignDriveLetter -UseMaximumSize | `
            Format-Volume -FileSystem NTFS -Confirm:$false -Force `
        }; `
      Write-Output $Volume; `
      Write-Output "CACHE_DRIVE=$($Volume.DriveLetter)`:/" >> $env:GITHUB_ENV

  - name: Build and test
    run: bazelisk --output_base=$env:CACHE_DRIVE test --config=windows //...

  - name: Dismount VHDX
    run: Dismount-VHD -Path C:/bazel_cache.vhdx

I know... it's long and ugly, but it works: Extracting the cache only takes 7 seconds and mounting the VHDX only takes 19 seconds! 🎉 This means that we reduced the cache restoration time by a factor of 44 🤓

This is based on Example 3 of the Mount-VHD docs and Example 5 of the New-VHD docs. I'm by no means proficient in powershell scripting, so there might be room for improvement...

A few details about the solution:

  • We reserve 10 GB for the VHDX, but that doesn't mean that that's the actual size of the file. The size of the VHDX is only slightly bigger than the size of the contents. But with 10 GB, we give Bazel enough space to work :)
  • The VHDX is mounted in E:/ in the Github runners. However, this is not necessarily deterministic. I tried assigning a specific Drive Letter, but there are two issues with that: 1. the drive could be occupied and 2. the Mount-VHD command doesn't support that (only the New-Partition command does).
    So, we store the path of the drive in a new CACHE_DRIVE environment variable that we can use in later steps.
  • We don't use always() or !cancelled() in the Dismount VHDX step, because if something fails, the cache will be disregarded anyways. So, we don't care if the volume gets dismounted or not 🤷

paco-sevilla avatar Dec 08 '23 11:12 paco-sevilla

Thanks @paco-sevilla. It's just a bit crazy we have to resort to these kind of solutions instead of a proper solution from GitHub. This has been going on for ages. And it can't just be the free users (like me) that experience this, also the corporate customers that actually pay for each Actions-minute must experience this and need reduction.

Safihre avatar Dec 08 '23 12:12 Safihre

I totally agree! I actually work for one of those enterprise costumers that pay for a certain amount of minutes. And it's not only about the 💸... The developer experience is (to put it nicely) poor, if something that should take 2-3 minutes suddenly takes 10 times longer 😕

Anyway, I just wanted to share my findings. Maybe they inspire a proper solution. BTW: This is a public repo, so I guess anyone could contribute to a proper solution, without having to wait on Github staff.

paco-sevilla avatar Dec 08 '23 12:12 paco-sevilla

@paco-sevilla - Thank you so much!!! I've been just experiencing this issue - where 25 mins to decompress the bazel cache (granted It's probably caching more than needs to). Thank you! Thank you! Thank you!

malkia avatar Feb 26 '24 19:02 malkia

Any plan to solve this? Even the official documentation doesn't mention anything about performance issues under Windows:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows I lost some time to implement the cache on Windows, and in the end it turns out that it is much worse than a regular online update (save or restore cache takes 6min, when online update takes 1 min.)

ArkadiuszMichalski avatar Apr 11 '24 09:04 ArkadiuszMichalski

+1 Attempting to cache on Windows proves to be a significant waste of time; I spent hours on it. At the very least, the official documentation should be updated to document this "known issue."

gaurish avatar Apr 11 '24 18:04 gaurish

@paco-sevilla This is awesome. I found myself recently needing to do this across multiple jobs and was surprised there were no Github Actions that could do this. I ended up creating one since I thought others might benefit samypr100/setup-dev-drive.

samypr100 avatar May 12 '24 16:05 samypr100

@samypr100 Your Github Action is awesome! And the examples are super useful! Thanks for creating it and also for the acknowledgements in the README 🙂

I think that, if something similar would be implemented in this Action and the documentation was updated, this issue could finally be closed (after more than 2 years)...

paco-sevilla avatar May 16 '24 09:05 paco-sevilla

Awesome job @samypr100 - I'll have to try it out! - I was experimenting here with NTFS and tried ReFS - https://github.com/malkia/opentelemetry-cpp/blob/main/.github/workflows/otel_sdk.yml#L28 - though the VHDX gets to be much bigger (in size, granted mostly zeroes). Wonder what else I can do there.

malkia avatar May 16 '24 17:05 malkia

I also experimented with ReFS on my local Windows machine and the performance (even when declared as a DevDrive with AntiVirus turned off) is significantly worse (slower and, as @malkia says, requires more space) than NTFS for my use-case (Bazel cache).

paco-sevilla avatar May 17 '24 08:05 paco-sevilla

Thats unfortunate to hear, the action does give you the flexibility to change the format to NTFS if desired.

samypr100 avatar May 17 '24 16:05 samypr100