summaryrefslogtreecommitdiff
path: root/candle-kernels
Commit message (Collapse)AuthorAgeFilesLines
* Import the ggml_cuda_dp4a function. (#2628)Laurent Mazare2024-11-191-33/+44
|
* Bump the crate version to 0.8.0. (#2612)Laurent Mazare2024-11-121-1/+1
|
* Improved launch config for layer-norm/rms-norm. (#2591)Laurent Mazare2024-11-041-8/+6
| | | | | * Improved launch config for layer-norm/rms-norm. * Add more testing for the fused layer/rms norm kernels.
* Bump the crate version to 0.7.2. (#2517)Laurent Mazare2024-09-291-1/+1
|
* Move the candle version to 0.7.1. (#2495)Laurent Mazare2024-09-221-1/+1
|
* Bump the crate version. (#2491)Laurent Mazare2024-09-211-1/+1
|
* Bump the version to 0.6.1. (#2438)Laurent Mazare2024-08-221-1/+1
|
* Bump the crate version. (#2248)Laurent Mazare2024-06-051-1/+1
|
* Add the layernorm specialized op. (#2212)Laurent Mazare2024-05-241-0/+84
| | | | | | | | | | | | | | | * Add the layernorm cuda kernels. * Dedicated layer norm op. * Add the slower variant. * Plug the cuda implementation. * Add the metal variant. * Add a dedicated test. * Bugfix.
* More efficient cuda implementation for ConvTranspose1d. (#2211)Laurent Mazare2024-05-241-0/+65
| | | | | * More efficient cuda implementation for ConvTranspose1d. * Small tweak.
* Bump the version number to 0.5.1. (#2155)Laurent Mazare2024-05-031-1/+1
| | | | | | | * Bump the version number to 0.5.1. * Fix clippy lints for 1.78. * More clippy fixes.
* Fix sigmoid gradient calculation and move sigmoid into a specialized op (#2114)MilkFather2024-04-291-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * add sigmoid op * small fix * add as a method on `Tensor` * implement gradient calculation for sigmoid * add sigmoid tests * we should have a specialized op for this * fix clippy * fix clippy 2 * Revert all previous commits in favor of a `CustomOp` based solution * use `CustomOp1` implementation * fix rustfmt * experimental add metal impl * add cuda kernel impl * fix fmt * Add a test + reduce some cuda duplication. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add the cuda dequantize f16 kernels. (#2137)Laurent Mazare2024-04-281-37/+75
| | | | | | | | | | | | | * Add the cuda dequantize f16 kernels. * Expose the cuda kernels. * Add some testing + fix. * Test the other cases too. * A few more tests. * Add an environment variable to enable the dequantize f16 + matmul behavior.
* Add argsort. (#2132)Laurent Mazare2024-04-272-0/+89
| | | | | | | | | | | | | | | | | | | | | * Add the argsort cuda kernels. * CPU version of arg-sort. * Hook the cuda kernel + rework the cpu bits. * Add some dedicated test. * Working cuda kernel. * Metal kernel. * Metal adjustments. * Bugfix. * Use the fast rope in qwen. * Rework the expert selection in qwen.
* Add more QMMV cuda kernels. (#2077)Laurent Mazare2024-04-181-0/+324
| | | | | | | * Add more QMMV cuda kernels. * Enable the new kernels. * Adapt the testing.
* Add the mmv kernels for small batch sizes. (#2075)Laurent Mazare2024-04-161-10/+254
| | | | | | | | | | | | | | | | | * Add the mmv kernels for smaller sizes. * Support more mmv kernels. * Use the new kernels. * Fix the call. * Silly fix. * Improve the testing. * Fix for dmmv. * Add another dedicated test for the batching mmv.
* Faster kernels for quantized matmul on cuda (#2060)Laurent Mazare2024-04-151-11/+118
| | | | | | | | | | | | | | | * Hook the quantized matmul cuda kernels. * Add a (currently broken) test. * Kernel fixes. * Fix by transposing the rhs matrix. * Add the q4-1 kernels. * Proper block sizes. * More details in the tests.
* Add the full quantized matmul kernels for cuda. (#2057)Laurent Mazare2024-04-141-0/+1071
|
* Add the rope THD kernel. (#2014)Laurent Mazare2024-04-051-5/+43
| | | | | | | | | * Add the rope THD kernel. * Cuda kernel for rope-thd. * Add the metal kernels. * Add a dedicated test.
* Add support for "sign" on tensors (#2012)Thomas Santerre2024-04-041-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | * add the sign unary operator * remove uneeded import * remove uneeded import * undo formatting * undo formatting * remove unnecessary redefintion * allow gradient to flow through for sign and round * fix cpu ops to ensure that negzero and positive zero are handled properly * clippy fixes * Properly avoid gradient tracking. * Use a branchless version. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Bumping the version number to 0.5.0. (#2009)Laurent Mazare2024-04-041-1/+1
|
* Relax the contiguous check for cuda kernels. (#2000)Laurent Mazare2024-04-031-1/+1
| | | | | | | | | * Relax the contiguous check for cuda kernels. * Ensure contiguity for RNNs. * Unrelated fix for segment anything. * Better error message + allow concatenating empty slices.
* More ggml cuda kernels (#1977)Laurent Mazare2024-04-011-75/+1014
| | | | | | | | | | | | | | | | | | | | | | | | | * Add more cuda kernels for quantized matmul. * Add the vec-dot bits. * Expose the quantized matmul-vec kernels. * Also include the quantize-q8-1 kernel. * Glue code for the q8-1 quantization. * mm-vec product via q8-1 quantization. * Add a test. * Add a mm test. * Get the test to return some sensible results. * Also test dmmv. * Fix the launch params. * Allow for tweaking the force_dmmv parameter while it's experimental.
* Ensure that the kernels get rebuilt on cuh changes. (#1954)Laurent Mazare2024-03-281-0/+3
|
* Use the new rope kernel in mistral. (#1937)Laurent Mazare2024-03-251-2/+2
| | | | | | | * Use the new rope kernel in mistral. * Compute the cos and sin with full precision. * Bugfix.
* Contiguous variant of the rope kernel. (#1929)Laurent Mazare2024-03-251-6/+34
| | | | | | | * Contiguous variant of the rope kernel. * Add the cuda kernel. * Metal kernel.
* Fast kernels for rotary embeddings. (#1928)Laurent Mazare2024-03-241-0/+29
| | | | | | | | | | | | | | | | | | | * Fast kernels for rotary embeddings. * Add a test for the fast CPU kernel. * Rope cuda bindings. * Cuda kernel. * Metal kernel (part 1). * Cuda kernels. * Finish the metal kernel. * Use the new kernels in the quantized example. * Fix warning.
* Add cast_bf16_x/cast_x_bf16 when CUDA_ARCH<800 but CUDA_VERSION >= 11000 (#1919)yinqiwen2024-03-231-0/+12
| | | - it make possible to load bf16 models on T4(sm75)
* Support scatter/index_add with i64 indices for f16 (#1915)Daniël de Kok2024-03-221-0/+2
|
* Custom op for RmsNorm (#1890)Laurent Mazare2024-03-211-0/+65
| | | | | | | | | | | | | * Trying out a custom RmsNorm cuda kernel. * CPU implementation for rms-norm. * Cuda wrappers. * Add some validation. * Add some testing. * More testing.
* Cuda backend optimization (#1886)Laurent Mazare2024-03-204-7/+7
| | | | | | | * Attempt at making the kernel faster. * Also adapt the cast kernels. * Also apply to binary ops.
* Optimize the cat operation on contiguous tensors (#1855)Laurent Mazare2024-03-171-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add a specialized kernel for copy2d. * Move the cat operations. * Avoid transpositions in cat. * Bugfix. * Bugfix for the cuda kernel. * Add a benchmark. * Add more testing. * Test fix. * Faster kernel. * Add the missing kernel. * Tweak the test. * Add a metal kernel. * Fix for the metal kernel. * Get the tests to pass on metal. * Also use this opportunity to fix the metal kernel for ELU. * Add some bf16 kernels. * Clippy fixes.
* Bump the crate versions to 0.4.2. (#1821)Laurent Mazare2024-03-081-1/+1
|
* Add a cuda kernel for dequantizing q8_0. (#1804)Laurent Mazare2024-03-051-0/+24
|
* Handle Q5_0 and Q5_1 quants in cuda.laurent2024-02-291-7/+9
|
* Bump the version number to 0.4.1. (#1768)Laurent Mazare2024-02-271-1/+1
| | | | | * Fix the block size for some cuda kernels. * Bump the version number to 0.4.1.
* Cuda kernel for dequantizing q8k. (#1760)Laurent Mazare2024-02-261-0/+35
| | | | | * Cuda kernel for dequantizing q8k. * Clippy lints.
* Cuda acceleration for quantized model. (#1754)Laurent Mazare2024-02-252-0/+1537
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Boilerplate for the quantized cuda support. * More basic cuda support. * More cuda quantization (quantize on cpu for now). * Add the dequantization bit. * Start adding some dedicated cuda kernels from llama.cpp. * Move the kernel code. * Start interfacing with the kernel. * Tweak the kernel launch params. * Bugfix for quantized metal. * Fix some clippy lints. * Tweak the launch parameters. * Tweak cuda basics to perform a quantized matmul. * Perform the dequantization on the cpu + use cublas for matmul. * Add the dequantization kernel. * Test the qmatmul. * More kernels. * Matmul-vec kernel. * Add a couple kernels. * More dequantization kernels.
* Fix the silu cuda kernel. (#1710)Laurent Mazare2024-02-141-1/+1
|
* feat: add silu activation function (#1706)OlivierDehaene2024-02-141-0/+9
| | | | | | | | | * feat: add silu activation function * use silu/arg in grad * update candle-nn * use node
* ConvTranspose1d cuda support. (#1697)Laurent Mazare2024-02-121-2/+77
| | | | | | | * ConvTranspose1d cuda support. * Add the conv-transpose1d kernel. * Remove some unused variables.
* Bump the crate version to 0.4.0. (#1658)Laurent Mazare2024-02-041-1/+1
|
* Moving to a proper build crate `bindgen_cuda`. (#1531)Nicolas Patry2024-01-072-242/+5
| | | | | * Moving to a proper build crate `bindgen_cuda`. * Fmt.
* Bump the crate version to 0.3.3. (#1490)Laurent Mazare2023-12-281-1/+1
|
* Bump the crate version to 0.3.2. (#1452)Laurent Mazare2023-12-171-1/+1
|
* Update for 0.3.1. (#1324)Laurent Mazare2023-11-111-2/+2
|
* Rework the cuda casting bits. (#1112)Laurent Mazare2023-10-171-31/+54
|
* feat: parse Cuda compute cap from env (#1066)OlivierDehaene2023-10-162-89/+110
| | | | | | | | | * feat: add support for multiple compute caps * Revert to one compute cap * fmt * fix
* fix: fix index_select cuda kernel for src target dim different than ids dim ↵Gonzalo2023-10-051-6/+8
| | | | | | | when selecting dim > 0 (#1037) * fix: fix index_select cuda kernel for src target dim different than ids dim when selecting dim > 0 * cargo fmt
* Add the rounding operators. (#1030)Laurent Mazare2023-10-042-0/+24
| | | | | | | * Add the rounding operators. * Avoid tracking gradients for the rounding operations. * Add some rounding tests.