Zia Khan comments

Results 12 comments of


                                            Zia Khan

steps to speed up job submission?

@HenrikBengtsson Curious if this issue has anything to do with scheduler.latency in makeClusterFunctionsSlurm().

non-abi version of libtorch

I tried the environment variable TORCH_HOME, but it now it segfaults. I wonder if it's abi incompatibility with liblantern.so. It's not a huge deal. I'll explore some alternatives.

non-abi version of libtorch

That's great! Thank you! If there's a link where I can download, I can give it a try on our slurm cluster.

I got the artifact from: https://storage.googleapis.com/torch-lantern-builds/refs/heads/non-abi/latest/LinuxNonABI-cpu.zip and set the TORCHHOME environment variable. No segfault this time. Still need to test a bit more. Any chance you can build the cu101...

Error when Conda/Script directive used simultaneously

I still need to create a minimal example, but I've noticed that snakemake runs python interpreter from the enclosing environment in which snakemake is called using the full path and...

Error when Conda/Script directive used simultaneously

Here is a minimal example: enclosing_smk.yml ``` name: enclosing_smk channels: - conda-forge - bioconda - nodefaults dependencies: - snakemake ``` Here is a named environment nameed_smk.yml ``` name: named_smk channels:...

Gradient accumulation steps: {grad_acc} has to be greater than 0

I noticed that this grad_acc > 0 error occurs if args.micro_train_batch_size != args.actor_num_gpus_per_node * args.train_batch_size. I think it has to do with the fact that deepspeed has some logic around...

RuntimeError: Error building extension 'slstm_***'

@kpoeppel I think I figured out the issue. If you look at the nvcc command it has `-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,code=sm_90` I think some of these architectures...

RuntimeError: Error building extension 'slstm_***'

@kpoeppel I figured it out. Looks like if you set the environment variable `TORCH_CUDA_ARCH_LIST` it correctly handles the nvcc gencode parameter. This should fix all the issues posted here. ```...

[Feature]: add snapshotForAI support

I'd like to drive playwright using custom tools in my LLM agent. Being able to use `snapshotForAI` directly would be awesome instead of through playwright-mcp.