trinity

A library for anisotropic mesh adaptation on manycore machines.

Project maintained by hobywan Hosted on GitHub Pages — Theme by mattgraham

trinity is a C++ library and command-line tool for anisotropic mesh adaptation.
It is targetted to non-uniform memory access multicore and manycore processors.
It was primarly designed for performance and hence for HPC applications.
It is intended to be involved within a numerical simulation loop.

adaptive-loop

Build and use
Features
Profile and deploy
How to contribute

Build and use

Building the library

trinity is completely standalone.
It can be built on Linux or macOS using CMake.
It only requires a C++14 compiler endowed with OpenMP.
It can build medit to render meshes but it is optional though.
It supports hwloc to retrieve and print more information on the host machine.

mkdir build                                        # out-of-source build
cd build                                           #
cmake ..                                           # see build options
make -j4                                           # use multiple jobs
make install                                       # optional

Option	Description	Default
Build_Medit	Build medit mesh renderer	ON
Build_GTest	Build googletest for future unit tests	OFF
Build_Main	Build the command-line tool	ON
Build_Examples	Build built-in examples	ON
Use_Deferred	Use deferred topology updates scheme in pragmatic	OFF

Linking to your project

trinity is exported as a package.
To use it in your project, update your CMakeLists.txt with:

find_package(trinity)                           # for build|install trees
target_link_libraries(target PRIVATE trinity)   # replace 'target'

And then include trinity.h in your application.
Please take a look at the examples folder for basic usage.

Use the tool

The list of command arguments is given by the -h option.

host:~$ bin/trinity -h
Usage: trinity [options]

Options:
  -h, --help            show this help message and exit
  -m CHOICE             select mode [release|benchmark|debug]
  -a CHOICE             cpu architecture [skl|knl|kbl]
  -i STRING             initial mesh file
  -o STRING             result mesh file
  -s STRING             solution field .bb file
  -c INT                number of threads
  -b INT                vertex bucket capacity [64-256]
  -t FLOAT              target resolution factor [0.5-1.0]
  -p INT                metric field L^p norm [0-4]
  -r INT                remeshing rounds [1-5]
  -d INT                max refinement/smoothing depth [1-3]
  -v INT                verbosity level [0-2]
  -P CHOICE             enable papi [cache|cycles|tlb|branch]

For now, only .mesh files used in medit are supported.

Setting thread-core affinity

For performance reasons, I recommend to explicitly set thread-core affinity before any run.
Indeed, threads should be statically bound to cores to prevent the OS from migrating them.
Besides, simultaneous multithreading (or hyperthreading on Intel) should be:

enabled to ease memory latency penalties especially on Intel KNL.
disabled to reduce shared caches saturation on faster nodes.

It can be done by setting some environment variables:

export OMP_PLACES=[cores|threads] OMP_PROC_BIND=close  # with GNU or clang/LLVM
export KMP_AFFINITY=granularity=[core|fine],compact    # with Intel compiler  

Features

Overview

trinity aims to reduce and equidistribute the interpolation error of a computed physical field u on a triangulated
planar domain M by adapting its discretization with respect to a target number of points n.
Basically, it takes (u, M, n) and outputs a mesh adapted to the variation of the gradient of u on M using n points.
It uses metric tensors to encode the desired point distribution with respect to the estimated error.

principle

It enables to resample and regularize a planar triangular mesh M.
It aims to reduce and equidistribute the error of a solution field u on M using n points.
For that, it uses five kernels:

metric recover: compute a tensor field which encodes desired point density.
refinement: add points on areas where the error of u is large.
coarsening: remove points on areas where the error of u is small.
swapping: flip edges to locally improve cell quality.
smoothing: relocate points to locally improve cell qualities.

Error estimate

trinity uses metric tensors to link the error of u with mesh points distribution.
A tensor encodes the desired edge length incident to a point, which may be direction-dependent.
trinity enables to tune the sensitivity of the error estimate according to the simulation needs.
For that, it provides a multi-scale estimate in L^p norm:

a small p will distribute points to capture all scales of the error of u.
a large p will distribute points mainly on large variation areas (shocks).

It actually implements the continuous metric defined in:

📄 Fréderic Alauzet, Adrien Loseille, Alain Dervieux and Pascal Frey (2006).
“Multi-Dimensional Continuous Metric for Mesh Adaptation”.
In proceedings of the 15th International Meshing Roundtable, pp 191-214, Springer Berlin.

To obtain a good mesh, it needs an accurate metric tensor field.
The latter rely on the computation of the variations of the gradient of u.
It is given by its local hessian matrices.
It is computed in trinity through a L^2 projection.

Fine-grained parallelism

trinity enables intra-node parallelism by multithreading.
It relies on a fork-join model through OpenMP.
All kernels are structured into synchronous stages.
A stage consists of local computation, a reduction in shared-memory, and a barrier.

It does not rely on domain partitioning unlike coarse-grained parallel remeshers.
It does not rely on task parallelism and runtime capabilities such as Cilk, TBB or StarPU neither.

In fact manycore machines have plenty of slow cores with small caches.
To scale up, one needs plenty of very thin and local tasks to keep them busy.
In trinity, remesh kernels are expressed into a graph, except for refinement.
Runnable tasks are then extracted using multithreaded heuristics:

maximal stable set for coarsening
maximal matching for swapping
maximal coloring for smoothing

trinity fixes incidence data only at the end of a round of any kernel.
It uses an explicit synchronization scheme to fix them.
It relies on the use of low-level atomic primitives.
It was designed to minimize data movement penalties, especially on NUMA cases.

For further details, please take a look at:

📄 Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau and Nicolas Le-Goff (2017).
“Scalable fine-grained metric-based remeshing algorithm for manycore/NUMA architectures”.
In In proceedings of 23rd International European Conference on Parallel and Distributed Computing, Springer.

Benchmark

Profiling

trinity is natively instrumented.
It prints the runtime stats with three verbosity level.
Here is an output example with the medium level.

screenshot

Stats are exported as tab-separated values and can be easily plotted with gnuplot or matplotlib.
You can use wrappi to profile oncore events such as cycles, caches misses, branch predictions.

Deployment on a cluster

Preparing a benchmark campaign can be tedious 😩.
I included some python scripts to help setting it up on a node, enabling to:

compute a synthetic solution field.
rebuild sources and set thread-core affinity.
set memory affinity through numactl, which is useful on a Intel KNL node.
compact profiling data and generate gnuplot script for plots.
profile memory bandwith of the host machine using STREAM.
plot sparsity pattern of mesh incidence graph.

They are somewhat outdated, so adapt them to your needs.

trinity is free and intended for research purposes.
It was written during my doctorate, so improvements are welcome.
To get involved, you can:

report bugs or request features by submitting an issue.
submit code contributions using feature branches and pull requests.

Enjoy! 😊

trinity

Table of contents

Build and use

Building the library

Linking to your project

Use the tool

Setting thread-core affinity

Features

Overview

Error estimate

Fine-grained parallelism

Benchmark

Profiling

Deployment on a cluster

Copyright 2016, Hoby Rakotoarivelo

trinity

Table of contents

Share

Build and use

Building the library

Linking to your project

Use the tool

Setting thread-core affinity

Features

Overview

Error estimate

Fine-grained parallelism

Benchmark

Profiling

Deployment on a cluster

Copyright 2016, Hoby Rakotoarivelo

Share