A library for anisotropic mesh adaptation on manycore machines.
trinity is a C++ library and command-line tool for anisotropic mesh adaptation.
It is targetted to non-uniform memory access multicore and manycore processors.
It was primarly designed for performance and hence for HPC applications.
It is intended to be involved within a numerical simulation loop.

trinity is completely standalone.
It can be built on Linux or macOS using CMake.
It only requires a C++14 compiler endowed with OpenMP.
It can build medit to render meshes but it is optional though.
It supports hwloc to retrieve and print more information on the host machine.
mkdir build # out-of-source build
cd build #
cmake .. # see build options
make -j4 # use multiple jobs
make install # optional
| Option | Description | Default |
| Build_Medit | Build medit mesh renderer | ON |
| Build_GTest | Build googletest for future unit tests | OFF |
| Build_Main | Build the command-line tool | ON |
| Build_Examples | Build built-in examples | ON |
| Use_Deferred | Use deferred topology updates scheme in pragmatic | OFF |
trinity is exported as a package.
To use it in your project, update your CMakeLists.txt with:
find_package(trinity) # for build|install trees
target_link_libraries(target PRIVATE trinity) # replace 'target'
And then include trinity.h in your application.
Please take a look at the examples folder for basic usage.
The list of command arguments is given by the -h option.
host:~$ bin/trinity -h
Usage: trinity [options]
Options:
-h, --help show this help message and exit
-m CHOICE select mode [release|benchmark|debug]
-a CHOICE cpu architecture [skl|knl|kbl]
-i STRING initial mesh file
-o STRING result mesh file
-s STRING solution field .bb file
-c INT number of threads
-b INT vertex bucket capacity [64-256]
-t FLOAT target resolution factor [0.5-1.0]
-p INT metric field L^p norm [0-4]
-r INT remeshing rounds [1-5]
-d INT max refinement/smoothing depth [1-3]
-v INT verbosity level [0-2]
-P CHOICE enable papi [cache|cycles|tlb|branch]
For now, only
.meshfiles used in medit are supported.
For performance reasons, I recommend to explicitly set thread-core affinity before any run.
Indeed, threads should be statically bound to cores to prevent the OS from migrating them.
Besides, simultaneous multithreading (or hyperthreading on Intel) should be:
It can be done by setting some environment variables:
export OMP_PLACES=[cores|threads] OMP_PROC_BIND=close # with GNU or clang/LLVM
export KMP_AFFINITY=granularity=[core|fine],compact # with Intel compiler
trinity aims to reduce and equidistribute the interpolation error of a computed physical field u on a triangulated
planar domain M by adapting its discretization with respect to a target number of points n.
Basically, it takes (u, M, n) and outputs a mesh adapted to the variation of the gradient of u on M using n points.
It uses metric tensors to encode the desired point distribution with respect to the estimated error.

It enables to resample and regularize a planar triangular mesh M.
It aims to reduce and equidistribute the error of a solution field u on M using n points.
For that, it uses five kernels:
trinity uses metric tensors to link the error of u with mesh points distribution.
A tensor encodes the desired edge length incident to a point, which may be direction-dependent.
trinity enables to tune the sensitivity of the error estimate according to the simulation needs.
For that, it provides a multi-scale estimate in L^p norm:
It actually implements the continuous metric defined in:
📄 Fréderic Alauzet, Adrien Loseille, Alain Dervieux and Pascal Frey (2006).
“Multi-Dimensional Continuous Metric for Mesh Adaptation”.
In proceedings of the 15th International Meshing Roundtable, pp 191-214, Springer Berlin.
To obtain a good mesh, it needs an accurate metric tensor field.
The latter rely on the computation of the variations of the gradient of u.
It is given by its local hessian matrices.
It is computed in trinity through a L^2 projection.

trinity enables intra-node parallelism by multithreading.
It relies on a fork-join model through OpenMP.
All kernels are structured into synchronous stages.
A stage consists of local computation, a reduction in shared-memory, and a barrier.
![]() |
It does not rely on domain partitioning unlike coarse-grained parallel remeshers.
It does not rely on task parallelism and runtime capabilities such as Cilk, TBB or StarPU neither.
In fact manycore machines have plenty of slow cores with small caches.
To scale up, one needs plenty of very thin and local tasks to keep them busy.
In trinity, remesh kernels are expressed into a graph, except for refinement.
Runnable tasks are then extracted using multithreaded heuristics:

trinity fixes incidence data only at the end of a round of any kernel.
It uses an explicit synchronization scheme to fix them.
It relies on the use of low-level atomic primitives.
It was designed to minimize data movement penalties, especially on NUMA cases.
For further details, please take a look at:
📄 Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau and Nicolas Le-Goff (2017).
“Scalable fine-grained metric-based remeshing algorithm for manycore/NUMA architectures”.
In In proceedings of 23rd International European Conference on Parallel and Distributed Computing, Springer.
trinity is natively instrumented.
It prints the runtime stats with three verbosity level.
Here is an output example with the medium level.

Stats are exported as tab-separated values and can be easily plotted with gnuplot or matplotlib.
You can use wrappi to profile oncore events such as cycles, caches misses, branch predictions.
Preparing a benchmark campaign can be tedious 😩.
I included some python scripts to help setting it up on a node, enabling to:
numactl, which is useful on a Intel KNL node.They are somewhat outdated, so adapt them to your needs.
trinity is free and intended for research purposes.
It was written during my doctorate, so improvements are welcome.
To get involved, you can:
Enjoy! 😊