diff options
author | Nicolas Patry <patry.nicolas@protonmail.com> | 2023-08-02 18:16:50 +0200 |
---|---|---|
committer | Nicolas Patry <patry.nicolas@protonmail.com> | 2023-08-02 18:40:24 +0200 |
commit | ae68635af9dfcae359f621dd3e1df3b3c3d97042 (patch) | |
tree | df1ea669007deaf1b0414807e056b729c6eae864 /candle-book | |
parent | c11e78b33454b976ad97b1534cc06eb027356865 (diff) | |
download | candle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.tar.gz candle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.tar.bz2 candle-ae68635af9dfcae359f621dd3e1df3b3c3d97042.zip |
Add small error management.
Diffstat (limited to 'candle-book')
-rw-r--r-- | candle-book/src/error_manage.md | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/candle-book/src/error_manage.md b/candle-book/src/error_manage.md index af7593d6..c1a16bd9 100644 --- a/candle-book/src/error_manage.md +++ b/candle-book/src/error_manage.md @@ -36,4 +36,16 @@ Another thing to note, is that since Rust is compiled it is not necessarily as e especially in release builds. We're using [`anyhow`](https://docs.rs/anyhow/latest/anyhow/) for that. The library is still young, please [report](https://github.com/LaurentMazare/candle/issues) any issues detecting where an error is coming from. +## Cuda error management + +When running a model on Cuda, you might get a stacktrace not really representing the error. +The reason is that CUDA is async by nature, and therefore the error might be caught while you were sending totally different kernels. + +One way to avoid this is to use `CUDA_LAUNCH_BLOCKING=1` as an environment variable. This will force every kernel to be launched sequentially. +You might still however see the error happening on other kernels as the faulty kernel might exit without an error but spoiling some pointer for which the error will happen when dropping the `CudaSlice` only. + + +If this occurs, you can use [`compute-sanitizer`](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) +This tool is like `valgrind` but for cuda. It will help locate the errors in the kernels. + |