TBB parallel_sort is slow for huge std::vector

Question

Given an std::vector containing 5k CGAL::Line_2, I need to compute all the intersection points and sort them in increasing lexicographic order. I am using exact arithmetic, in particular Exact_predicates_exact_constructions_kernel.

I tried the following code:

using Point2 = CGAL::Exact_predicates_exact_constructions_kernel::Point_2;
std::vector<Point2> vertices;
vertices.resize(lines.size() * (lines.size() - 1));
std::atomic<size_t> counter{ 0 };

{
    std::cout << "START vertices" << std::endl;
    static tbb::affinity_partitioner ap;
    tbb::parallel_for(tbb::blocked_range2d<int, int>(0, lines.size(), 0, lines.size()),
        [&](tbb::blocked_range2d<int, int>& r) {
            if (r.rows().end() > r.cols().begin())
            {
                for (int i = r.rows().begin(); i < r.rows().end(); ++i)
                {
                    for (int k = r.cols().begin(); k < r.cols().end(); ++k)
                    {
                        if (k == i) continue;
                        const auto intersection = CGAL::intersection(lines[k], lines[i]);
                        if (intersection)
                        {
                            const Point2* point = boost::get<Point2>(&*intersection);
                                    CGAL_precondition(point != nullptr);
                            vertices[counter.fetch_add(1, std::memory_order_relaxed)] = *point;
                        }
                    }
                }
            }

        });
}

vertices.resize(counter.load());

std::cout << "START sorting of " << vertices.size() << " vertices" << std::endl;

{
    static tbb::affinity_partitioner ap;
    tbb::parallel_sort(vertices.begin(), vertices.end(), CGAL::Less<Point2, Point2>());
}

std::cout << "DONE sorting" << std::endl;

The code takes around 2.2 minutes with 4 cores and 4.8 minutes with 8 cores.
In particular, parallel_sort takes most of the execution time to finish.
I believe memory transfers are the problem here, as the profiler shows that most of the time is spent for OS routines calls.

Still, I am not sure about how things could be improved here, do you have any suggestion? thanks

Leave a Comment Cancel reply