geometry icon indicating copy to clipboard operation
geometry copied to clipboard

Feature/uniform point distribution

Open tinko92 opened this issue 6 years ago • 10 comments

This is the second of three branches that I created during the GSoC. It does not depend on any of the other two branches. It contains uniform_point_distribution (for now in the extension directory), a class modeled after the random number distribution found in Boost.Random and the STL and designed to generate random geometries; in this case, points uniformly distributed on some domain.

Strategies for domains that are pointlike (CS-agnostic), box (n-d cartesian, 2d/3d spherical), segment (everything supported by line_interpolate), linear (everything supported by line_interpolate) or areal (everything that is cartesian or spherical and supported by envelope) are provided. Tests and examples are included, pictures of how the output may look like are found in this blog post: https://tinko92.github.io/gsoc2019/2019/09/24/work-product.html (second and fourth picture).

@vissarion @awulkiew This branch slightly differs from the one I submitted to you at the end of GSoC because I replaced some of my implementations for uniform sampling on segments with the usage of line_interpolate. I only noticed the algorithm yesterday which is the reason for this last-minute change.

tinko92 avatar Sep 05 '19 07:09 tinko92

@tinko92 one last comment, in gsoc project you also included the documentation of the code. Is it going to be in a different PR, or will be left as future work?

vissarion avatar Oct 18 '19 11:10 vissarion

Thanks for all the comments, I already addressed a couple of them in comments and some with a commit. I think the main question, though, is the matter of strategies and algorithms.

I am not yet sure how to best change my code to address the problem that it dispatches strategies w.r.t. geometry tag. Here is what makes it difficult:

  • the uniform_point_distribution is a little different from algorithms. It is a class meant to be used in such a way that the user obtains an instance that he may use multiple times (I think it is not always clear how many random points one might need before starting to process them). It makes sense to keep the instance because there is some work that only needs to be done once per domain before generating samples.
  • for some geometry types such as polygons or rings, multiple implementations are sensible. Rejection sampling via envelope has a lower one-time cost, rejection sampling via convex hull has lower cost per sample if more than a couple of samples are needed.
  • The class has some methods that are the same for all geometries and coordinate systems, but operator() and operator== depend on the implementations.

So what would be the preferable way to make parts of the class implementation interchangeable and extensible? It seems to be a situation where one could use the strategy pattern, but I understand that strategies should only be used for coordinate_system dependent differences.

tinko92 avatar Oct 20 '19 23:10 tinko92

@tinko92 I think we should have another level of abstraction between a CS-specific part of what is now a strategy and algorithmic part. Note that currently the CS-specific part is not really exposed, because the algorithms are called with default strategies in random strategies. E.g. in case of uniform_linear_single CS-specific part would be pt-pt distance strategy passed into length and line_interpolate strategy.

I think we need additional layer somewhere between the standard random-generator and uniform-point-distribution which would be at the algorithms/distribution side and then on the strategies side there would be CS-specific part. But I'm not sure what the interface should look like. The easiest thing would be to implement it as various point distributions and then have one default distribution able to take any geometry as a Domain. Another solution would be to take Policy in distribution, or a static tag defining the algorithm used internally. But I don't really like these ideas. I feel that it would be better to divide distribution, distributing/randomizing algorithm, CS-specific-strategy and generator somehow and combine them by either passing in constructors or operator(). I'm not sure how exactly though.

awulkiew avatar Oct 21 '19 17:10 awulkiew

There are 4 parts of the randomization:

  • distribution
  • randomizing algorithm
  • CS-specific strategy
  • generator

Randomizing algorithm has to know the domain/geometry and may need to preprocess data once and store the results for future use. This means it needs geometry and CS-specific strategy during construction (passed into the constructor). Then it has to be called while randomizing and for that it again needs CS-specific strategy and generator.

Let's start from the obvious. Generator is std or Boost.Random engine. So this is known and has to be passed into the operator() of the distribution.

Then the strategy. I propose to contain the CS-specific parts needed by all randomizing algorithms in one umbrella strategy. So to have 3 strategies, one for each CS, e.g.:

bg::strategy::random::cartesian<CalculationType> sc;
bg::strategy::random::spherical<RadiusOrSphere, CalculationType> ss(r); // Radius if needed.
bg::strategy::random::geographic<Formula, Order, Spheroid, CalculationType> sg(spheroid); // Order if needed

And internally define strategies used by whatever is needed by all of the algorithms: distance, interpolation, convex_hull, envelope, expand, etc. Maybe add a new strategy if there is a specific method of randomization for some CS like uniform_inverse_transform_sampling and also define it in the umbrella strategy.

Then there is distribution and algorithm. I can think of several possible solutions:

  1. Integrate the algorithm and distribution:
namespace bgr = boost::geometry::random;
// naming is only an example, cartesian strategy used in all cases
bgr::uniform_point_distribution_linear_xxx<Linear, Point, Strategy> d1(linear, sc);
bgr::uniform_point_distribution_linear_yyy<Linear, Point, Strategy> d2(linear, sc);
bgr::uniform_point_distribution_box_zzz<Box, Point, Strategy> d3(box, sc);
// default algorithm based on Geometry, user-defined strategy
bgr::uniform_point_distribution<Geometry, Point, Strategy> dd(g, sc);
// default algorithm and strategy
bgr::uniform_point_distribution<Geometry> dd(g);

std::mt19937 gen;
Point p1 = d1(gen);
Point pd = dd(gen);
  • Pros:
    • the relation between geometries, strategies and algorithms is clear.
  • Cons:
    • long names.
  1. Pass algorithm as a policy or tag.
// arguments may need some reordering
bgr::uniform_point_distribution<Linear, bgr::linear_xxx, Point, Strategy> d1(linear, sc);
bgr::uniform_point_distribution<Linear, bgr::linear_yyy, Point, Strategy> d2(linear, sc);
bgr::uniform_point_distribution<Box, bgr::box_zzz, Point, Strategy> d3(box, sc);
// default algorithm
bgr::uniform_point_distribution<Geometry, bgr::default_policy, Point, Strategy> ds(g, sc);
// default algorithm and strategy
bgr::uniform_point_distribution<Geometry> d(g);
  • Pros:
    • more expressive than integrated solution above
    • actually needed if there are different distributions and the same algorithms can be used in several of them
  • Cons:
    • the user has to care more about the geometry-policy pairing
    • if implemented as tags then it requires to add a more templates
    • if implemented as policies with uniform_point_distribution taking them as template template parameter then all policies has to have the same number of template parameters in order to uniform_point_distribution define the policy type for all of them the same way.
  1. Pass algorithm as optional part of the Domain
bgr::uniform_point_distribution<bgr::linear_xxx<Linear>, Point, Strategy> d1(linear, sc);
bgr::uniform_point_distribution<bgr::linear_yyy<Linear>, Point, Strategy> d2(linear, sc);
bgr::uniform_point_distribution<bgr::box_zzz<Box>, Point, Strategy> d3(box, sc);
// default algorithm
bgr::uniform_point_distribution<Geometry, Point, Strategy> ds(g, sc);
// default algorithm and strategy
bgr::uniform_point_distribution<Geometry> d(g);
  • Pros:
    • more expressive than integrated solution above
    • actually needed if there are different distributions and the same algorithms can be used in several of them
    • the algorithm and geometry relation is clear
  • Cons:
    • the semantics of the Domain is not clear since it may be a geometry or optionally an algorithm with geometry
    • the difference between this interface and std random interface is bigger

There is also an issue mentioned in one of the points above. That is, what different distributions there might be aside from uniform_point_distribution and could the same algorithms be used in them as well? What would be the common part? Rephrasing the questions: could other distributions be built on top of uniform distribution or would they require some different input? It's like with std random distributions and engines. Engines always return a random integer in some range and this is used by all distributions to generate e.g. floating point numbers. Would something like this (well defined interfacing based on a very simple operation) be possible with various geometrical distributions whatever they might be? Probably not because even now there is a specific algorithm randomizing in convex polygon so the basic operation in this case is quite complex.

If only certain algorithms can be used with certain distributions then probably dividing them into 2 parts doesn't make sense.

Btw, it's not clear to me. Is uniform_convex_polygon_sampler cartesian-only or CS-agnostic?

What do you think? Do you have different ideas? Is there something I didn't take into account?

awulkiew avatar Oct 22 '19 16:10 awulkiew

@awulkiew thank you for your elaborate comment. I'll address the easy questions first:

  • Is uniform_convex_polygon_sampler cartesian-only or CS-agnostic? I think uniform_convex_polygon_sampler in itself is CS-agnostic, however, it requires a CS-specific strategy for sampling on a spherical triangle. I did not implement that because I only created uniform_convex_polygon_sampler close to the deadline back then and the spherical geometry of directly sampling uniformly on spherical triangles seemed a little too involved for me to implement it before the end of GSoC back then.

  • What different distributions there might be aside from uniform_point_distribution and could the same algorithms be used in them as well? I have considered the question and I could think of random walks, Gaussian distributions, and stratified distributions. Out of those three, the last one might reuse algorithms implemented for the uniform distribution.

I will think more about the designs you proposed. Some tentative comments though:

  • I think the policy solution is strictly superior to the tag solution because with tags the user has to provide the same information but the whole thing is less extensible and the single parts less composable.
  • I feel like the policy solution is the most intuitive. We are always doing the same thing (obtaining uniformly distributed points in some geometry), so we always use the same interface but we have an optional parameter to optionally switch between implementations.

I have a question regarding the policy solution: What is the difference between a strategy and a policy (conceptually)? Some sources suggest, it is the same thing.

tinko92 avatar Oct 23 '19 22:10 tinko92

I also lean towards the policy solution, since it seems possible that some algorithms could be used to sample from other distributions as well. In particular, rejection sampling can be used to sample from Gaussian (or any other distribution by first sampling in a bounding box). Random walks (that referred above) can also be used to sample from various distributions (depends on the particular random walk though). All in all, I think it is a good idea to have that interface and not integrate the algorithm with the distribution.

Regarding policies vs strategies in Boost Geometry strategies are used (or supposed to be used, since there is no official documentation for this) only to separate the CS specific part of the algorithm. We use policies for the other "CS-non-specific" uses.

vissarion avatar Oct 24 '19 09:10 vissarion

I think the policy solution is strictly superior to the tag solution

This is my suggestion number 2, correct? You differentiate between policies and tags while I considered policies and tags as the same interface, so interface 2. At least from the user's perspective they would look the same. So it seems I was not precise enough and I'd like to make sure we're on the same page. So the difference I was originally thinking about would be something like this:

  • 2.1 Policies
template <typename Domain, typename Point>
struct policy
{
    // static_assert(is_compatible_with_Domain);

    policy(Domain const&, Strategy const&) {}

    template <typename Generator>
    Point apply(Generator const&, Strategy const&) { return Point(); }
};

// if void is used below then this is not needed
template <typename Domain, typename Point>
struct default_policy {};

template
<
    typename Domain,
    template Strategy = default_strategy, /*or void, though we already have this type*/
    template <typename, typename> typename Policy = default_policy, /*or void*/
    typename Point = typename point_type<Domain>::type, /*or void*/
>
struct distribution
{
public:
    distribution(Domain const& d, Strategy const& s)
        : m_strategy(s)
        , m_policy(d, m_strategy)
    {}

    template <typename Generator>
    Point operator()(Generator const& g)
    {
        // do something here

        // or instead of Generator pass something else here,
        // something which has the characteristic of the distribution
        Point p = m_policy.apply(g, m_strategy);

        // do something here

        return p;
    }

private:
    Strategy m_strategy;
    Policy<Domain, Strategy, Point> m_policy;
};

Note that in this case all policies has to have the same interface:

template <typename Domain, typename Point> struct policy2 {};
  • 2.2 Tags

The only difference is that Tags are used to indirectly define the Policy.

namespace detail {
template <typename Tag, typename Domain, typename Point = typename point_type<Domain>::type>
struct policy_type : not implemented<Tag> {};
}

template <typename Domain, typename Point> struct policy {};

struct policy_tag {};

namespace detail {
template <typename Domain, typename Point>
struct policy_type
{
    typedef policy<Domain, Point> type;
};
}

struct default_policy_tag {};

template <typename Domain, typename Tag = default_policy_tag /*, ...*/>
struct distribution
{
    /*...*/
    typename detail::policy_type<Tag, Domain /*, ...*/>::type m_policy;
};

So then in the future if we e.g. need to add a policy taking more types than the other policies we're able to modify the internals like policy_type and everything works. While in the previous case (2.1) It wouldn't. I'm not sure if this is needed though because if we faced the need to change the Policy concept then it would also interfere with the distributions, i.e. we'd have to pass some additional information to them in the first place in order to pass it further into the Policy.

So this is the different between Policies and Tags I had in mind. Are we talking about the same thing?

Of course instead of template template parameter we could take the Policy as a type but it'd mean we'd force the user to duplicate the types like that:

distribution<Domain, policy<Domain, Point>, Strategy, Point> d;

awulkiew avatar Oct 24 '19 16:10 awulkiew

I am sorry for the long delay. I will resume work on this PR and I will attempt to implement the policy solution in the near future and commit it to this PR.

tinko92 avatar Mar 17 '20 22:03 tinko92

Hello @awulkiew , I have reconsidered your solutions 2.1 und 2.2 and I genereally lean towards solution 2.1.

I have tried to consider the sensible design space for uniform random point generation and how to properly separate the non-cs/cs concerns into policies and strategies. I came up with the following proposal:

For pointlike domains, everything is trivial, no strategy/policy questions. For linear, single domains (segments), the policy is trivial, the strategy is a matter of interpolation and calculating length.

For linestrings and multilinestrings, one needs to obtain the total length L and interpolate. policies:

  1. walk over all subsegments, store cumulative lengths for each segment and total length L, sample uniform number in [0, L], find appropriate subsegment via a) linear or b) binary search, interpolate output point on selected subsegment
  2. obtain total length L, sample number in [0, L], linear search for appropriate subsegment and interpolate (constant memory) strategies required: length, uniform map to segment (interpolate)

For areal domains there are more sensible solutions, depending on the situation policies:

  1. Obtain envelope, sample uniformly from envelope until sample point is determined to be within domain.
  2. Obtain convex hull, implicitly use the fan triangulation, store cumulative triangle areas and total area A, sample uniformly in [0,A], find appropriate triangle via a) linear or b) binary search, uniformly sample on triangle, repeat until sample point is determined to be within domain.
  3. Obtain triangulation, then proceed as in 2. but no within-test is required. strategies: for 1.: envelope strategy, uniform map to envelope (box) type [new], within strategy for 2.: convex hull strategy, triangle area strategy (side strategy), uniform map to triangle [new], within strategy for 3.: triangulation strategy, triangle area strategy (side strategy), uniform map to triangle [new]

The ideas for areal domains should work similarly for volumes as well.

My GSoC 2019 does not include implementations for all of this (e.g. I do not have an algorithm to triangulate arbitrary polygons) but a lot of this and I should be able to quickly refactor my code to match this proposed layout.

Consequences of these considerations:

  • Strategies will only be used to uniformly map numbers to geometric primitives (such as lines, boxes, triangles) depending on the coordinate system. They do not need to have a state, i.e. their apply-method can be static. The new strategies are not necessarily specific to the application of random point generation, like it is the case for interpolate.
  • Policies are templated types, they take different kinds of strategies. Maybe they can be template template, such that the user does not need to respecify domain and point type which are already specified in the uniform_distribution interface.

I hope these remarks present my proposed refactoring clearly, if not it should become more clear with the next commit.

tinko92 avatar Apr 04 '20 12:04 tinko92

I have committed and pushed changes that move non-specific code into policies.

Please consider this commit WIP. I opted to push this early, so that it can discussed whether this general approach is appropriate with regard to the paradigms of Boost.Geometry and is a sound design, generally. If this approach is deemed correct, I will do one more check of the formatting, unnecessary headers, and so on, and rebase everything, so that it is a single commit that can be applied to the current develop branch.

Having said that, I will give a summary of the changes: The new interface for uniform_point_distribution is: template<typename DomainGeometry, typename Point, typename Policy> uniform_point_distribution The parameter Policy has a default value that depends on DomainGeometry and Point.

There are the following policies: single_point, multi_point (trivial, take no strategies) linear_single, linear_multi (for segments and line strings, take two strategies for distance and line_interpolate) box (for box domains, takes a box-strategy, see below) envelope_rejection (for areal and volumetric domains, takes three strategies: envelope, within, box, see below) convex_polygon (for convex rings and polygons, takes a triangle-strategy, see below, and an area strategy) convex_hull_rejection (for areal domains, takes all strategies for convex_polygon and additionally, a within-strategy)

There are the following strategies: uniform_point_distribution::cartesian_box uniform_point_distribution::spherical_box uniform_point_distribution::cartesian_triangle

These are basically for boxes and triangles, what line_interpolate is for segments. They map numbers from [0,1] to the primitive geometries (boxes or triangles) in a uniform way, which is the CS-specific aspect of the uniform point distributions. Here is an illustration of what the strategies do: https://tinko92.github.io/images/triangle.svg https://tinko92.github.io/images/box.svg https://tinko92.github.io/images/spherical_box.svg (this is from a large square on the northern hemisphere, mapped back to cartesian coordinates)

The implementation that I managed to get working in the end is closer to 2.1, but I take the policy as a type. I am aware that this causes the problem of having to duplicate template parameters that you mentioned in the end, but I don't think the policies can all have the same interface because the number of strategies that they take depends on the policy.

tinko92 avatar Apr 06 '20 13:04 tinko92