BioFSharp icon indicating copy to clipboard operation
BioFSharp copied to clipboard

[BUG] ClustalO wrapper example from the docs is not working correctly

Open kMutagene opened this issue 4 years ago • 8 comments

Describe the bug The example throws an error when it wants to read some temporary files

System.IO.IOException: The process cannot access the file 'C:\Users\schne\AppData\Local\Temp\ac7bce1d-efe2-45ba-8ffa-90d0a60e0f49.fasta' because it is being used by another process.   at System.IO.FileSystem.DeleteFile(String fullPath)
   at System.IO.File.Delete(String path)
   at BioFSharp.IO.ClustalOWrapper.ClustalOWrapper.AlignSequences(IEnumerable`1 input, IEnumerable`1 parameters, FSharpOption`1 name) in C:\Users\schne\source\repos\CSBiology\BioFSharp\src\BioFSharp.IO\ClustalOWrapper.fs:line 347
   at <StartupCode$FSI_0010>.$FSI_0010.main@()
Stopped due to error

To Reproduce

#r "nuget: BioFSharp, 2.0.0-beta5"
#r "nuget: BioFSharp.IO, 2.0.0-beta5"

open BioFSharp
open BioFSharp.IO

open ClustalOWrapper
let cw = ClustalOWrapper()

let sequences = 
    [
    TaggedSequence.createTaggedSequence "pep1" ("AAGECGEK")
    TaggedSequence.createTaggedSequence "pep2" ("AAGEGEK")
    TaggedSequence.createTaggedSequence "pep3" ("AAAGECGEK")
    TaggedSequence.createTaggedSequence "pep4" ("AAGECGEL")
    ]

let alignedSequences = 
    cw.AlignSequences(sequences,Seq.empty)

Additional context

Also, conversion functions or overloads for AlignSequences that work on all BioSequences (BioArray, BioList, BioSeq), so something like TaggedSequence<string,seq<IBioItem>> would be awesome. Tagging @HLWeil as you're the creator ;)

kMutagene avatar May 07 '21 15:05 kMutagene

Maybe related: #61

kMutagene avatar May 07 '21 15:05 kMutagene

If I remember correctly, the ClustalO tool is executed and the output is directly read back into the F# runtime afterwards. It seems like the reader is not waiting until the ClustalO is finished. I'll look into this..

The issue #61 is unrelated but definitely needs some attention 😅 (Edit: Whoops definitely related to your additional context)

HLWeil avatar May 07 '21 21:05 HLWeil

I could reproduce the problem but: The code you wrote above shouldn't even get that far because using the ClustalOWrapper() constructor with no parameters will be looking for a default clustalo.exe in a nonsense path (mea culpa).

Downloading the tool and giving it's exe path to the constructor fixes this first hurdle:

let clustalToolPath = System.IO.Path.Combine(__SOURCE_DIRECTORY__,"clustal-omega-1.2.2-win64\clustalo.exe")
let cw = ClustalOWrapper(clustalToolPath)

But calling the AlignSequences methods fails with the error you described. The reason for this seems to be the clustalo tool failing:

> Starting -i C:\Users\HLWei\AppData\Local\Temp\5a1b3b45-1685-4741-a450-28f071ce30cf.fasta -o C:\Users\HLWei\AppData\Local\Temp\0508777c-193d-4bb7-bd27-d903cffb787a.aln --infmt=fa --outfmt=clu ...

FATAL: File C:\Users\HLWei\AppData\Local\Temp\5a1b3b45-1685-4741-a450-28f071ce30cf.fasta does not appear to be in FASTA format at line 1. You may want to specify the file format on the command line. Usually this is done with an option --infmt <fmt>.

I can reproduce this error when running the tool by hand in the powershell, which is weird as the input file is definitely in fasta format:

>pep1 AAGECGEK >pep2 AAGEGEK >pep3 AAAGECGEK >pep4 AAGECGEL

Maybe some line ending bonanza or the fasta format was changed recently :shrug:

Edit: Whoops misclicked on close issue. Will investigate further

HLWeil avatar May 07 '21 22:05 HLWeil

The line FastA.write id inPath (sequences |> Seq.map tsToFasta) does write in UTF8-BOM encoding which does not seem to be supported by ClustalO. Copying the same fastA file into a UTF8 encoded file does work properly.

Can you check whether your temporary files are encoded in UTF8-BOM or UTF8 not?

Also the FastA.write does not seem to close the writer when finished, blocking the file. If this is also the case for you I think we should open an additional issue.

AdditionalInfo:

let tsToFasta (ts:TaggedSequence.TaggedSequence<string,char>) =
    {FastA.Header = ts.Tag;
    FastA.Sequence = ts.Sequence}

HLWeil avatar May 07 '21 22:05 HLWeil

Good catch, UTF8BOM was already an issue when is tried to use fasta files for TargetP2, could have thought about that one. Ill fix that one

kMutagene avatar May 08 '21 08:05 kMutagene

Should be fixed with https://github.com/CSBiology/BioFSharp/commit/beb4158e8e8d71e7c956347db1df2f2e99f8a5e4

Adaptions to work on biosequences would still be highly appreciated

kMutagene avatar May 08 '21 09:05 kMutagene

Also, the rootpath works just fine for me as long is i have the BioFSharp repo on my machine, was that not the intention of the design?

kMutagene avatar May 08 '21 09:05 kMutagene

well yes, but actually no

This is the path it tries for me: C:\Users\schne\source\repos\CSBiology\BioFSharp\lib\clustal-omega\clustalo.exe 💯

I think either we change it to look for a global clustal installation or we make the parameter mandatory?

(Edit: Yes that was the intention. Not sure if building the repo yourself should be regarded as a a case in the code though)

HLWeil avatar May 08 '21 19:05 HLWeil