Results 2 issues of ShvetsKS

it's part of #7192 Current implementation of partition allow to read bin matrix data with less random access full train| airline, ~100m | higgs, ~10m -- | -- | --...

Here is several optimizations of cpu hist method mostly applicable for large datasets like Airline(100M), Higgs(10m), Epsilon(2k X 400k) to increase hw utilization (as for such large data it's better...