This is the story of a long lasting issue we had in the mmrm
package, and how we solved it together with the open source community. Along the way, we learned a lot about debugging memory issues in compiled code in R packages, and we want to share our learnings with you.
Irreproducible reproducibility problems
mmrm
is using the TMB
package to automatically differentiate the likelihood function of an MMRM which is coded in C++
. Therefore, like other R packages using compiled code, also for mmrm
we sometimes get additional challenges depending on compiler settings, different operating systems, etc.
Already in December 2022 we discovered that unit tests ran on CRAN machines did sometimes fail, because the results were slightly different to the snapshots we had taken (GitHub). Unfortunately, we were not able to reproduce these problems on our local machines, despite trying to create Docker images matching the CRAN machines’ configuration regarding compiler settings and operating system. In March 2023 we had another report from a developer that different MMRM fit results were obtained when knitting an Rmd file (GitHub). Other users commented, e.g., in September 2023 as well and confirmed the problem. Finally, again in August 2024 we got a similar report from Novartis colleagues Tobias Muetze, Lukas Widmer and Mark Baillie (GitHub), which confirmed the problem across various operating systems.
Lukas discovers the valgrind
key
valgrind
is a is a programming tool for memory debugging, memory leak detection, and profiling. It is widely used in the C++ community, but also in R packages that use compiled code. Lukas discovered that using R -d valgrind
makes the problems reproducible (GitHub). This was the first important step. Just one day later, Lukas made the second important discovery, by finding out that using the Docker image provided by Winston Chang for debugging R memory problems (GitHub) with a particular way of specifying the options for valgrind
allowed to actually find a memory issue:
RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla
The error then reported by valgrind
for the test program which uses the mmrm::mmrm()
fit function was:
==3954== Conditional jump or move depends on uninitialised value(s)
[...]
==3954== by 0x20869038: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:104)
[...]
==3954== Uninitialised value was created by a stack allocation
==3954== at 0x206FC590: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)
So from this we understood that in line 104 of mmrm.cpp
(GitHub) there is somewhere in the code stack an uninitialized value, which is then causing different results, just based on the more or less random content of the memory stack at the time of execution.
But where does this uninitialized value come from?
Hunting the uninitialized value
Just looking at the code stack, I suspected that the problem must be somewhere in the line 595 of the Eigen library routine for the LDLT decomposition (doc):
if(abs(vecD(i)) > tolerance)
because this is mentioned in the error message from valgrind
and because it is a conditional statement. But was it the left or the right hand side of the comparison that was uninitialized?
To find out, I had to also install Docker on my laptop, use the special image, and then use my rusty gdb
skills to set a breakpoint in the Eigen library routine and inspect the values of vecD(i)
and tolerance
. But more on that later.
How to run R with valgrind
in Docker
First we want to share a few tips how to run R with valgrind
in the Docker image mentioned above, because it took us some time to figure this out.
To start the Docker container based on the original image, in the below example we mount the local directory C:/Users/Daniel
to the /Daniel
directory in the Docker container. This is useful to share files between the host and the container.
docker run --rm -ti --security-opt seccomp=unconfined -v C:/Users/Daniel:/Daniel wch1/r-debug
Then in the Docker container, first just run RDvalgrind
without additional options to install required packages. In particular, we install here already the mmrm
package from the local git repository (assuming this was cloned using e.g., git clone git@github.com:openpharma/mmrm.git
), with debug compilation flags:
install.packages("devtools")
library("devtools")
with_debug(install("/Daniel/git/mmrm", upgrade = "never"))
Then we exit R again.
It is useful to save the container configuration, now including the package installation, in a new image. This makes it easier to reuse the same configuration later.
docker ps # to get the container ID
docker commit <container ID> # to create a new image based on the container
docker images -a # to get the image ID
docker tag <image ID> <new image name> # to give the new image a name that we can remember
docker images -a # to check that the new image is there
Say we named the image r-debug-mmrm
, then we can start the container with the new image and the valgrind
options as follows:
docker run --rm -ti --security-opt seccomp=unconfined -v C:/Users/Daniel:/Daniel r-debug-mmrm
Then in the Docker container, run R with the valgrind
options as follows:
RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite"
Now we can run the test program that causes the error, e.g., the mmrm::mmrm()
fit function, and see the error message from valgrind
.
Using gdb
and valgrind
together
Now that we can use R with valgrind
, we still need to add gdb
into the mix, because that allows us to interactively step through the C++
code, similarly as in R we can use browser()
to step through the R code. A great resource to start with this topic is the tutorial by Alexandra Petlanova Hajkova (link). We will need to set up two processes, on is R with valgrind
, and the other one will be the gdb
one. Then we connect them with each other and can debug.
Here is a quick summary of the practical steps:
In one terminal, start the Docker container and inside it start R with valgrind
with an additional vgdb-error
option:
RDvalgrind -d "valgrind --vgdb-error=0 --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=all"
Now in a new terminal, we need to connect to the same Docker container that has the first R process running. We can do this by using the docker exec
command:
docker ps # to get the container ID
docker exec -it <container ID> bash
Now we can start gdb
in the Docker container:
gdb /usr/local/RDvalgrind/lib/R/bin/exec/R # afterwards type c, enter
target remote | /usr/bin/vgdb --pid=<PID> # where PID is the process ID of the R process with valgrind
continue
Note that the --pid
is optional if only one valgrind
process is running.
Now we can go back to the first terminal and see that R is indeed starting up with the R terminal. When we now execute the code example again in R, we will see that the execution will stop at the point where a valgrind
problem is detected. This is because valgrind
emulates a “trapping instruction” by generating a SIGTRAP
signal. This SIGTRAP
is sent by valgrind
only when attached to gdb
, so that gdb
can intercept the signal.
Now we go to the gdb
terminal and see that it prints:
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x0000000020490715 in Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > (this=0x1ffeffb910, rhs=..., dst=...) at /usr/local/RDvalgrind/lib/R/site-library/RcppEigen/include/Eigen/src/Cholesky/LDLT.h:598
598 if(abs(vecD(i)) > tolerance)
And indeed, this confirmed my initial suspicion about the problematic line of code!
Finding the root cause
There we are: inside the execution of the C++
code, inside R, ready with our gdb
surgery kit to find the root cause of the problem. We need to find out now what is unititialized, the left or the right hand side of the comparison. With the print
command in gdb
, we can inspect the values, e.g.:
(gdb) print i
$3 = 0
(gdb) print tolerance
$2 = {taped_value = {index = 4294967295, static NA = 4294967295}, data = {
value = 2.6755265079869555e-315,
glob = 0x204721b2 <Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::vectorD() const+54>}}
This still looks quite cryptic, but the value
of the tolerance
object looks already suspiciously small. We can further inspect the memory at the address of tolerance
:
(gdb) print &tolerance
$20 = (Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::RealScalar *) 0x1ffeffb240
(gdb) mo xb 0x1ffeffb240
00
0x1FFEFFB240: 0xff
(gdb) mo xb 0x1ffeffb240 10
00 00 00 00 ff ff ff ff
0x1FFEFFB240: 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00
ff ff
0x1FFEFFB248: 0xb2 0x21
Here the mo xb
is a Memcheck monitor command (see e.g., here for more details). The xb
tool shows the definedness bits (first line) and values (second line) starting at the memory address provided, for each byte. The values are given in hexadecimal format, and the definedness bits are ff
for defined and 00
for undefined. So here we can see that the first 4 bytes are not defined, which confirms that indeed the tolerance
object is uninitialized.
Finding a solution
We can see in the code in the RcppEigen
package that the tolerance
object is created by a call to the std::numeric_limits<RealScalar>::min
function (GitHub). So somehow this fails here and leaves tolerance
uninitialized. I confirmed this by replacing this line with a hard-coded value, and then the error indeed disappeared. But that was not a real solution yet.
I suspected that the problem is the RealScalar
type used by TMB
(which is TMBad::global::ad_aug
). Fortunately at this point Kasper Kristensen responded quickly to our comment, and prepared a reproducible example (GitHub). He realized that std::numeric_limits<Type>
class template functions (see here for details) don’t work for the “TMBad” framework in TMB
, because the required specialization is missing for the Type
being RealScalar
.
Then, by defining the specialized class as inheriting from the double
numeric limits class, the problem was solved (GitHub)!
An additional problem was solved quickly
After solving this problem, still some unreproducible results were obtained with mmrm
. Also this could be traced back to the TMB
package, and Kaspar Kristensen explained us how we could make the TMB
algorithm completely deterministic (GitHub).
Conclusion
This was one of the hardest debugging sessions I ever had, but also one of the most rewarding ones. I learned a lot about how to use valgrind
and gdb
together, and how to find the root cause of a memory issue in compiled code. It was great to see how the open source community worked together so efficiently to solve the problem: Although it took us a long time to make progress, once we had found the key with the right valgrind
options, all of this was solved in less than a month, making the mmrm
package finally completely stable and reproducible. And solving this problem in the underlying TMB
package also benefits all other packages building on top of TMB
.
Big thanks to: