summaryrefslogtreecommitdiff
path: root/candle-core/src/quantized/cuda.rs
Commit message (Collapse)AuthorAgeFilesLines
* Clippy fixes for the cuda feature. (#2650)Laurent Mazare2024-11-291-1/+1
|
* Cuda quantized mmv bugfix. (#2526)Laurent Mazare2024-10-011-1/+25
|
* Yet another cuda qmm padding fix. (#2509)Laurent Mazare2024-09-301-25/+55
|
* Add the cuda dequantize f16 kernels. (#2137)Laurent Mazare2024-04-281-13/+75
| | | | | | | | | | | | | * Add the cuda dequantize f16 kernels. * Expose the cuda kernels. * Add some testing + fix. * Test the other cases too. * A few more tests. * Add an environment variable to enable the dequantize f16 + matmul behavior.
* Add more QMMV cuda kernels. (#2077)Laurent Mazare2024-04-181-8/+10
| | | | | | | * Add more QMMV cuda kernels. * Enable the new kernels. * Adapt the testing.
* Add the mmv kernels for small batch sizes. (#2075)Laurent Mazare2024-04-161-18/+46
| | | | | | | | | | | | | | | | | * Add the mmv kernels for smaller sizes. * Support more mmv kernels. * Use the new kernels. * Fix the call. * Silly fix. * Improve the testing. * Fix for dmmv. * Add another dedicated test for the batching mmv.
* Fix for the batch dim in the quantized matmul example. (#2073)Laurent Mazare2024-04-151-1/+1
| | | | | | | | | * Fix for the batch dim in the quantized matmul example. * Enable more tests on cuda. * Add a test for qmm with a batch. * Fix the zeros-dim test on metal.
* Add a function to clear the KV cache in falcon. (#2066)Laurent Mazare2024-04-151-0/+1
| | | | | * Add a function to clear the KV cache in falcon. * Clippy.
* Faster kernels for quantized matmul on cuda (#2060)Laurent Mazare2024-04-151-6/+137
| | | | | | | | | | | | | | | * Hook the quantized matmul cuda kernels. * Add a (currently broken) test. * Kernel fixes. * Fix by transposing the rhs matrix. * Add the q4-1 kernels. * Proper block sizes. * More details in the tests.
* Quantized cuda tweaks. (#1981)Laurent Mazare2024-04-011-89/+62
| | | | | | | * Quantized cuda tweaks. * Add some safety checks. * Factorize the dequantization bits.
* Switch the default to using the faster kernels. (#1978)Laurent Mazare2024-04-011-1/+1
| | | | | * Switch the default to using the faster kernels. * Add the force-dmmv flag.
* More ggml cuda kernels (#1977)Laurent Mazare2024-04-011-7/+147
| | | | | | | | | | | | | | | | | | | | | | | | | * Add more cuda kernels for quantized matmul. * Add the vec-dot bits. * Expose the quantized matmul-vec kernels. * Also include the quantize-q8-1 kernel. * Glue code for the q8-1 quantization. * mm-vec product via q8-1 quantization. * Add a test. * Add a mm test. * Get the test to return some sensible results. * Also test dmmv. * Fix the launch params. * Allow for tweaking the force_dmmv parameter while it's experimental.
* Properly handle the batch dimension in cuda quantized matmul. (#1832)Laurent Mazare2024-03-101-1/+1
|
* Handle Q5_0 and Q5_1 quants in cuda.laurent2024-02-291-16/+38
|
* Fix the block size for some cuda kernels. (#1767)Laurent Mazare2024-02-271-13/+15
|
* Cuda kernel for dequantizing q8k. (#1760)Laurent Mazare2024-02-261-18/+16
| | | | | * Cuda kernel for dequantizing q8k. * Clippy lints.
* Cuda acceleration for quantized model. (#1754)Laurent Mazare2024-02-251-0/+321
* Boilerplate for the quantized cuda support. * More basic cuda support. * More cuda quantization (quantize on cpu for now). * Add the dequantization bit. * Start adding some dedicated cuda kernels from llama.cpp. * Move the kernel code. * Start interfacing with the kernel. * Tweak the kernel launch params. * Bugfix for quantized metal. * Fix some clippy lints. * Tweak the launch parameters. * Tweak cuda basics to perform a quantized matmul. * Perform the dequantization on the cpu + use cublas for matmul. * Add the dequantization kernel. * Test the qmatmul. * More kernels. * Matmul-vec kernel. * Add a couple kernels. * More dequantization kernels.