nbodykit icon indicating copy to clipboard operation
nbodykit copied to clipboard

abundance matching is very confusing.

Open rainwoodman opened this issue 9 years ago • 8 comments

I want to generate a selection column for a given abundance, ranked by another column. The current ad-hoc way of doing so has confused Elena, and I think it will confuse others down the road as well.

This can be accomplished with two steps:

  1. generate an Abundance column from another column, using the ranking (collective argsort), (maybe divide by the volumn to convert to number density

  2. generate a selection column from an Abundance column.

How does this proposal interfere with halotools and the current Halo source? @nickhand

rainwoodman avatar Jan 28 '17 19:01 rainwoodman

Yes, I think that is fine. The to_halotools function of the HaloCatalog source accepts a selection keyword that specifies which halos go in to the catalog. So you just need to add the Selection column based on the Abundance to the HaloCatalog and then only those halos will be populated.

nickhand avatar Jan 29 '17 02:01 nickhand

@rainwoodman where are we on this? Do we want to add a sort() / argsort() to Catalog objects? Feels like that would be nice now that the selecting subsets of catalogs is a bit easier.

We should also think through whether implementing actual slices or integer lists is useful and whether that should be collective or non-collective. I can imagine sorting by mass and then saying give me the top X objects, but that could be difficult in parallel....

nickhand avatar Apr 07 '17 04:04 nickhand

Computing a sorting rank then filter is easier than sorting the actual data.

rainwoodman avatar Apr 07 '17 19:04 rainwoodman

sorting rank can be done with two mpsort calls. The problem is that MP-sort only takes integer keys (it is a radix sort -- dynamic range of double is too big.)

rainwoodman avatar Apr 07 '17 19:04 rainwoodman

@rainwoodman so if we want to sort by something like mass, how would we do this exactly with mpsort? We would need to make a 'u8' sort rank column first?

nickhand avatar May 03 '17 03:05 nickhand

Yes. I'd first suppress the dynamic range with a log, then scale it up to integers. The result won't be always be exactly sorted because several floating number may map into the same integer. (hence I did not think it was a good idea to let mpsort do this). It shall be good enough for abundance matching.

rainwoodman avatar May 04 '17 19:05 rainwoodman

Okay I think I understand what's going on here. Some simple tests seems to indicate that we can do something like:

precision = '4'
sorting_keys = np.fromstring(data.astype('f'+precision).tobytes(), dtype='u'+precision)

which should re-interpret the floating point binary representation as integers, which also preserves the rank ordering for positive input. And we can take the log of data first if we think that is necessary

nickhand avatar May 04 '17 21:05 nickhand

No this will not work. mpsort not only need the rank order. It needs the radix to be numerical. It does a binary search for histograming. This will mess up the exponents.

On Thu, May 4, 2017 at 2:06 PM, Nick Hand [email protected] wrote:

Okay I think I understand what's going on here. Some simple tests seems to indicate that we can do something like:

precision = '4' sorting_keys = np.fromstring(data.astype('f'+precision).tobytes(), dtype='u'+precision)

which should re-interpret the floating point binary representation as integers, which also preserves the rank ordering for positive input. And we can take the log of data first if we think that is necessary

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/304#issuecomment-299309137, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIbTBKlItA6khvE1xo8wt6YMuXKhKFcks5r2j3DgaJpZM4LwmYu .

rainwoodman avatar May 04 '17 21:05 rainwoodman