flow icon indicating copy to clipboard operation
flow copied to clipboard

Read Nexus tree or tree/matrix files

Open curtislisle opened this issue 11 years ago • 7 comments

Nexus files are used often in phylogenetics. Instead of having to support our own parsers, we should adopt mature parsers if they exist. The parser below handles Nexus and Newick files into R with more reliability than ape, and uses the NCL (Nexus class library).

http://francoismichonneau.net/2014/12/rncl/

curtislisle avatar Jan 14 '15 04:01 curtislisle

In the attached ZIP is a simple tree in Nexus and a corresponding character matrix. We need to be able to add reading of this type to Arbor. A lot of existing packages will output in this format.

nexus_example_data.zip

curtislisle avatar Aug 15 '16 22:08 curtislisle

I know we have simple Nexus tree reading, but this format is complex. There is a very complete C++ implementation of the NEXUS spec available here. maybe we can use this to parse to our intermediate tree representation:

https://github.com/mtholder/ncl

curtislisle avatar Aug 15 '16 22:08 curtislisle

Flow currently assumes nexus file extensions to be trees. This is not correct. Nexus is a file type which can (and often does) contain either trees, matrices, or both in a single file. Multiple trees and multiple matrices can be stored in a single Nexus file. Reading Nexus successfully is fairly critical for widespread adoption of Arbor.

curtislisle avatar Aug 16 '16 00:08 curtislisle

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.

jeffbaumes avatar Aug 16 '16 12:08 jeffbaumes

Thanks. This isn’t urgent, but I’d like to work on this over the next few weeks/months.

On Aug 16, 2016, at 5:45 AM, Jeffrey Baumes [email protected] wrote:

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Kitware/flow/issues/89#issuecomment-240090622, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDZ9vFNOcpHkT942BCOkpCFFK92ndBSks5qgbDkgaJpZM4DSGzB.

curtislisle avatar Aug 16 '16 15:08 curtislisle

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:

img_0897

jeffbaumes avatar Aug 16 '16 17:08 jeffbaumes

I agree to this approach of having the combined format and selector steps in a workflow. I am working with David Maddison this week. I'll ask him for samples and how many trees / matrices are allowed per file.

On Aug 16, 2016, at 10:17 AM, Jeffrey Baumes [email protected] wrote:

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

curtislisle avatar Aug 17 '16 04:08 curtislisle