| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
* Use flash-attn in gemma.
* Fix for the fast bf16 cublas gemm.
* Fix some clippy lints.
* Fix another lint.
* Proper clippy fix.
|
|
|
|
|
|
|
| |
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
|
|
|
|
|
| |
* Cuda kernel for dequantizing q8k.
* Clippy lints.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the dilation parameter.
* Restore the basic optimizer example.
* Dilation support in cudnn.
* Use the dilation parameter in the cpu backend.
* More dilation support.
* No support for dilation in transposed convolutions.
* Add dilation to a test.
* Remove a print.
* Helper function.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add to the cuda example a reproduction of the issue.
* Tweak.
* Add a test using non-square matrixes.
* Fix the conv2d kernel.
* Display the error.
* And tweak the comment.
|
|
|
|
|
|
|
|
|
| |
cuda. (#578)
* Add a test for conv2d with padding.
* Cosmetic changes.
* Bugfix the rand function on the cuda backend.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add some group parameter to convolutions.
* Avoid some unnecessary groups checks.
* Move the tensor convolution bits.
* Properh handling of groups.
* Bump the crate version.
* And add a changelog.
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a cudnn feature to be used for conv2d.
* Allocate the proper workspace.
* Only create a single cudnn handle per cuda device.
* Proper cudnn usage.
* Bugfix.
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more tracing to the whisper example.
* Support accelerate in more examples.
* Use accelerate for pointwise functions.
* Use accelerate for binary operations too.
* Bugfix for binary operation: use the rhs before the lhs.
|
|
|
|
|
| |
* Rename to candle-core.
* More candle-core renaming.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Sketch a fast cuda kernel for reduce-sum.
* Sketch the rust support code for the fast sum kernel.
* More work on the fast kernel.
* Add some testing ground.
* A couple fixes for the fast sum kernel.
|
|
|
|
|
| |
* Add some very simple sum benchmark.
* Rename the file.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix some rebase issues.
* Use mkl instead.
* Use mkl in bert.
* Add the optional mkl feature.
* Conditional compilation based on the mkl feature.
* Add more mkl support.
|
|
|