Given an std::vector containing 5k CGAL::Line_2, I need to compute all the intersection points and sort them in increasing lexicographic order. I am using exact arithmetic, in particular Exact_predicates_exact_constructions_kernel.
I tried the following code:
using Point2 = CGAL::Exact_predicates_exact_constructions_kernel::Point_2;
std::vector<Point2> vertices;
vertices.resize(lines.size() * (lines.size() - 1));
std::atomic<size_t> counter{ 0 };
{
std::cout << "START vertices" << std::endl;
static tbb::affinity_partitioner ap;
tbb::parallel_for(tbb::blocked_range2d<int, int>(0, lines.size(), 0, lines.size()),
[&](tbb::blocked_range2d<int, int>& r) {
if (r.rows().end() > r.cols().begin())
{
for (int i = r.rows().begin(); i < r.rows().end(); ++i)
{
for (int k = r.cols().begin(); k < r.cols().end(); ++k)
{
if (k == i) continue;
const auto intersection = CGAL::intersection(lines[k], lines[i]);
if (intersection)
{
const Point2* point = boost::get<Point2>(&*intersection);
CGAL_precondition(point != nullptr);
vertices[counter.fetch_add(1, std::memory_order_relaxed)] = *point;
}
}
}
}
});
}
vertices.resize(counter.load());
std::cout << "START sorting of " << vertices.size() << " vertices" << std::endl;
{
static tbb::affinity_partitioner ap;
tbb::parallel_sort(vertices.begin(), vertices.end(), CGAL::Less<Point2, Point2>());
}
std::cout << "DONE sorting" << std::endl;
The code takes around 2.2 minutes with 4 cores and 4.8 minutes with 8 cores.
In particular, parallel_sort takes most of the execution time to finish.
I believe memory transfers are the problem here, as the profiler shows that most of the time is spent for OS routines calls.
Still, I am not sure about how things could be improved here, do you have any suggestion? thanks
Look up “false sharing”, e.g. stackoverflow.com/questions/22766191/…
IDK if
counter.fetch_add
is often executed (it depends of the actual values we do not know), but this portion of the code will be serialised. Atomic operation are serialized on x86-64 CPUs and they are pretty slow too (AFAIK this is mostly the cause on ARM too and many other architectures). You should partition data and use thread-local data structure to avoid performance issues.Btw, it is not clear whether
tbb::parallel_sort
is the issue or not on the profiler. It looks like this is the code above. AFAIK,tbb::parallel_sort
should already care about false-sharing, locks/atomics issues and avoid that. To confirm the memory issue hypothesis, you can use hardware performance counters. IMC ones can give you the DRAM throughput. VTune has a user-friendly benchmark for that in the list (though it tends to make my Windows crash on my machine :/ ).