summaryrefslogtreecommitdiff
path: root/candle-book
diff options
context:
space:
mode:
authorNicolas Patry <patry.nicolas@protonmail.com>2023-08-02 18:16:50 +0200
committerNicolas Patry <patry.nicolas@protonmail.com>2023-08-02 18:40:24 +0200
commitae68635af9dfcae359f621dd3e1df3b3c3d97042 (patch)
treedf1ea669007deaf1b0414807e056b729c6eae864 /candle-book
parentc11e78b33454b976ad97b1534cc06eb027356865 (diff)
downloadcandle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.tar.gz
candle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.tar.bz2
candle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.zip
Add small error management.
Diffstat (limited to 'candle-book')
-rw-r--r--candle-book/src/error_manage.md12
1 files changed, 12 insertions, 0 deletions
diff --git a/candle-book/src/error_manage.md b/candle-book/src/error_manage.md
index af7593d6..c1a16bd9 100644
--- a/candle-book/src/error_manage.md
+++ b/candle-book/src/error_manage.md
@@ -36,4 +36,16 @@ Another thing to note, is that since Rust is compiled it is not necessarily as e
especially in release builds. We're using [`anyhow`](https://docs.rs/anyhow/latest/anyhow/) for that.
The library is still young, please [report](https://github.com/LaurentMazare/candle/issues) any issues detecting where an error is coming from.
+## Cuda error management
+
+When running a model on Cuda, you might get a stacktrace not really representing the error.
+The reason is that CUDA is async by nature, and therefore the error might be caught while you were sending totally different kernels.
+
+One way to avoid this is to use `CUDA_LAUNCH_BLOCKING=1` as an environment variable. This will force every kernel to be launched sequentially.
+You might still however see the error happening on other kernels as the faulty kernel might exit without an error but spoiling some pointer for which the error will happen when dropping the `CudaSlice` only.
+
+
+If this occurs, you can use [`compute-sanitizer`](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html)
+This tool is like `valgrind` but for cuda. It will help locate the errors in the kernels.
+