Debugging memory issues in compiled code in R packages

This is the story of a long lasting issue we had in the mmrm package, and how we solved it together with the open source community. Along the way, we learned a lot about debugging memory issues in compiled code in R packages, and we want to share our learnings with you.

Irreproducible reproducibility problems

mmrm is using the TMB package to automatically differentiate the likelihood function of an MMRM which is coded in C++. Therefore, like other R packages using compiled code, also for mmrm we sometimes get additional challenges depending on compiler settings, different operating systems, etc.

Already in December 2022 we discovered that unit tests ran on CRAN machines did sometimes fail, because the results were slightly different to the snapshots we had taken (GitHub). Unfortunately, we were not able to reproduce these problems on our local machines, despite trying to create Docker images matching the CRAN machines’ configuration regarding compiler settings and operating system. In March 2023 we had another report from a developer that different MMRM fit results were obtained when knitting an Rmd file (GitHub). Other users commented, e.g., in September 2023 as well and confirmed the problem. Finally, again in August 2024 we got a similar report from Novartis colleagues Tobias Muetze, Lukas Widmer and Mark Baillie (GitHub), which confirmed the problem across various operating systems.

Lukas discovers the `valgrind` key

valgrind is a is a programming tool for memory debugging, memory leak detection, and profiling. It is widely used in the C++ community, but also in R packages that use compiled code. Lukas discovered that using R -d valgrind makes the problems reproducible (GitHub). This was the first important step. Just one day later, Lukas made the second important discovery, by finding out that using the Docker image provided by Winston Chang for debugging R memory problems (GitHub) with a particular way of specifying the options for valgrind allowed to actually find a memory issue:

RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla

The error then reported by valgrind for the test program which uses the mmrm::mmrm() fit function was:

==3954== Conditional jump or move depends on uninitialised value(s)
[...]
==3954==    by 0x20869038: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:104)
[...]
==3954==  Uninitialised value was created by a stack allocation
==3954==    at 0x206FC590: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)

So from this we understood that in line 104 of mmrm.cpp (GitHub) there is somewhere in the code stack an uninitialized value, which is then causing different results, just based on the more or less random content of the memory stack at the time of execution.

But where does this uninitialized value come from?

Hunting the uninitialized value

Just looking at the code stack, I suspected that the problem must be somewhere in the line 595 of the Eigen library routine for the LDLT decomposition (doc):

if(abs(vecD(i)) > tolerance)

because this is mentioned in the error message from valgrind and because it is a conditional statement. But was it the left or the right hand side of the comparison that was uninitialized?

To find out, I had to also install Docker on my laptop, use the special image, and then use my rusty gdb skills to set a breakpoint in the Eigen library routine and inspect the values of vecD(i) and tolerance. But more on that later.

How to run R with `valgrind` in Docker

First we want to share a few tips how to run R with valgrind in the Docker image mentioned above, because it took us some time to figure this out.

To start the Docker container based on the original image, in the below example we mount the local directory C:/Users/Daniel to the /Daniel directory in the Docker container. This is useful to share files between the host and the container.

docker run --rm -ti --security-opt seccomp=unconfined -v C:/Users/Daniel:/Daniel wch1/r-debug

Then in the Docker container, first just run RDvalgrind without additional options to install required packages. In particular, we install here already the mmrm package from the local git repository (assuming this was cloned using e.g., git clone git@github.com:openpharma/mmrm.git), with debug compilation flags:

install.packages("devtools")
library("devtools")
with_debug(install("/Daniel/git/mmrm", upgrade = "never"))

Then we exit R again.

It is useful to save the container configuration, now including the package installation, in a new image. This makes it easier to reuse the same configuration later.

docker ps # to get the container ID
docker commit <container ID> # to create a new image based on the container 
docker images -a # to get the image ID
docker tag <image ID> <new image name> # to give the new image a name that we can remember
docker images -a # to check that the new image is there

Say we named the image r-debug-mmrm, then we can start the container with the new image and the valgrind options as follows:

docker run --rm -ti --security-opt seccomp=unconfined -v C:/Users/Daniel:/Daniel r-debug-mmrm

Then in the Docker container, run R with the valgrind options as follows:

RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite"

Now we can run the test program that causes the error, e.g., the mmrm::mmrm() fit function, and see the error message from valgrind.

Using `gdb` and `valgrind` together

Now that we can use R with valgrind, we still need to add gdb into the mix, because that allows us to interactively step through the C++ code, similarly as in R we can use browser() to step through the R code. A great resource to start with this topic is the tutorial by Alexandra Petlanova Hajkova (link). We will need to set up two processes, on is R with valgrind, and the other one will be the gdb one. Then we connect them with each other and can debug.

Here is a quick summary of the practical steps:

In one terminal, start the Docker container and inside it start R with valgrind with an additional vgdb-error option:

RDvalgrind -d "valgrind --vgdb-error=0 --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=all"

Now in a new terminal, we need to connect to the same Docker container that has the first R process running. We can do this by using the docker exec command:

docker ps # to get the container ID
docker exec -it <container ID> bash

Now we can start gdb in the Docker container:

gdb /usr/local/RDvalgrind/lib/R/bin/exec/R # afterwards type c, enter
target remote | /usr/bin/vgdb --pid=<PID> # where PID is the process ID of the R process with valgrind
continue

Note that the --pid is optional if only one valgrind process is running.

Now we can go back to the first terminal and see that R is indeed starting up with the R terminal. When we now execute the code example again in R, we will see that the execution will stop at the point where a valgrind problem is detected. This is because valgrind emulates a “trapping instruction” by generating a SIGTRAP signal. This SIGTRAP is sent by valgrind only when attached to gdb, so that gdb can intercept the signal.

Now we go to the gdb terminal and see that it prints:

Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x0000000020490715 in Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > (this=0x1ffeffb910, rhs=..., dst=...) at /usr/local/RDvalgrind/lib/R/site-library/RcppEigen/include/Eigen/src/Cholesky/LDLT.h:598
598         if(abs(vecD(i)) > tolerance)

And indeed, this confirmed my initial suspicion about the problematic line of code!

Finding the root cause

There we are: inside the execution of the C++ code, inside R, ready with our gdb surgery kit to find the root cause of the problem. We need to find out now what is unititialized, the left or the right hand side of the comparison. With the print command in gdb, we can inspect the values, e.g.:

(gdb) print i
$3 = 0
(gdb) print tolerance
$2 = {taped_value = {index = 4294967295, static NA = 4294967295}, data = {
value = 2.6755265079869555e-315,
glob = 0x204721b2 <Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::vectorD() const+54>}}

This still looks quite cryptic, but the value of the tolerance object looks already suspiciously small. We can further inspect the memory at the address of tolerance:

(gdb) print &tolerance
$20 = (Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::RealScalar *) 0x1ffeffb240  
(gdb) mo xb 0x1ffeffb240
                  00
0x1FFEFFB240:   0xff
(gdb) mo xb 0x1ffeffb240 10
                  00      00      00      00      ff      ff      ff      ff
0x1FFEFFB240:   0xff    0xff    0xff    0xff    0x00    0x00    0x00    0x00
                  ff      ff
0x1FFEFFB248:   0xb2    0x21

Here the mo xb is a Memcheck monitor command (see e.g., here for more details). The xb tool shows the definedness bits (first line) and values (second line) starting at the memory address provided, for each byte. The values are given in hexadecimal format, and the definedness bits are ff for defined and 00 for undefined. So here we can see that the first 4 bytes are not defined, which confirms that indeed the tolerance object is uninitialized.

Finding a solution

We can see in the code in the RcppEigen package that the tolerance object is created by a call to the std::numeric_limits<RealScalar>::min function (GitHub). So somehow this fails here and leaves tolerance uninitialized. I confirmed this by replacing this line with a hard-coded value, and then the error indeed disappeared. But that was not a real solution yet.

I suspected that the problem is the RealScalar type used by TMB (which is TMBad::global::ad_aug). Fortunately at this point Kasper Kristensen responded quickly to our comment, and prepared a reproducible example (GitHub). He realized that std::numeric_limits<Type> class template functions (see here for details) don’t work for the “TMBad” framework in TMB, because the required specialization is missing for the Type being RealScalar.

Then, by defining the specialized class as inheriting from the double numeric limits class, the problem was solved (GitHub)!

An additional problem was solved quickly

After solving this problem, still some unreproducible results were obtained with mmrm. Also this could be traced back to the TMB package, and Kaspar Kristensen explained us how we could make the TMB algorithm completely deterministic (GitHub).

Conclusion

This was one of the hardest debugging sessions I ever had, but also one of the most rewarding ones. I learned a lot about how to use valgrind and gdb together, and how to find the root cause of a memory issue in compiled code. It was great to see how the open source community worked together so efficiently to solve the problem: Although it took us a long time to make progress, once we had found the key with the right valgrind options, all of this was solved in less than a month, making the mmrm package finally completely stable and reproducible. And solving this problem in the underlying TMB package also benefits all other packages building on top of TMB.

Big thanks to: