TBB parallel_sort is slow for huge std::vector

Given an std::vector containing 5k CGAL::Line_2, I need to compute all the intersection points and sort them in increasing lexicographic order. I am using exact arithmetic, in particular Exact_predicates_exact_constructions_kernel.

I tried the following code:

using Point2 = CGAL::Exact_predicates_exact_constructions_kernel::Point_2;
std::vector<Point2> vertices;
vertices.resize(lines.size() * (lines.size() - 1));
std::atomic<size_t> counter{ 0 };

{
    std::cout << "START vertices" << std::endl;
    static tbb::affinity_partitioner ap;
    tbb::parallel_for(tbb::blocked_range2d<int, int>(0, lines.size(), 0, lines.size()),
        [&](tbb::blocked_range2d<int, int>& r) {
            if (r.rows().end() > r.cols().begin())
            {
                for (int i = r.rows().begin(); i < r.rows().end(); ++i)
                {
                    for (int k = r.cols().begin(); k < r.cols().end(); ++k)
                    {
                        if (k == i) continue;
                        const auto intersection = CGAL::intersection(lines[k], lines[i]);
                        if (intersection)
                        {
                            const Point2* point = boost::get<Point2>(&*intersection);
                                    CGAL_precondition(point != nullptr);
                            vertices[counter.fetch_add(1, std::memory_order_relaxed)] = *point;
                        }
                    }
                }
            }

        });
}

vertices.resize(counter.load());

std::cout << "START sorting of " << vertices.size() << " vertices" << std::endl;

{
    static tbb::affinity_partitioner ap;
    tbb::parallel_sort(vertices.begin(), vertices.end(), CGAL::Less<Point2, Point2>());
}

std::cout << "DONE sorting" << std::endl;

The code takes around 2.2 minutes with 4 cores and 4.8 minutes with 8 cores.
In particular, parallel_sort takes most of the execution time to finish.
I believe memory transfers are the problem here, as the profiler shows that most of the time is spent for OS routines calls.
Profiling
Still, I am not sure about how things could be improved here, do you have any suggestion? thanks

  • Look up “false sharing”, e.g. stackoverflow.com/questions/22766191/…

    – 




  • IDK if counter.fetch_add is often executed (it depends of the actual values we do not know), but this portion of the code will be serialised. Atomic operation are serialized on x86-64 CPUs and they are pretty slow too (AFAIK this is mostly the cause on ARM too and many other architectures). You should partition data and use thread-local data structure to avoid performance issues.

    – 

  • Btw, it is not clear whether tbb::parallel_sort is the issue or not on the profiler. It looks like this is the code above. AFAIK, tbb::parallel_sort should already care about false-sharing, locks/atomics issues and avoid that. To confirm the memory issue hypothesis, you can use hardware performance counters. IMC ones can give you the DRAM throughput. VTune has a user-friendly benchmark for that in the list (though it tends to make my Windows crash on my machine :/ ).

    – 

Leave a Comment