summaryrefslogtreecommitdiff
path: root/candle-core
Commit message (Collapse)AuthorAgeFilesLines
...
* Automatically upcast for to_u64 (#2244)Eric Buehler2024-06-041-1/+7
|
* add where_cond f32 for metal (#2236)Lionel Touati2024-06-021-0/+1
|
* Add a metal kernel for col2im1d. (#2214)Laurent Mazare2024-05-251-34/+92
| | | | | | | | | * Add a metal kernel for col2im1d. * Enable the col2im variant. * Bugfix. * Revert the quantized tweak.
* Add the layernorm specialized op. (#2212)Laurent Mazare2024-05-242-1/+39
| | | | | | | | | | | | | | | * Add the layernorm cuda kernels. * Dedicated layer norm op. * Add the slower variant. * Plug the cuda implementation. * Add the metal variant. * Add a dedicated test. * Bugfix.
* More efficient cuda implementation for ConvTranspose1d. (#2211)Laurent Mazare2024-05-242-4/+75
| | | | | * More efficient cuda implementation for ConvTranspose1d. * Small tweak.
* Add a slice_set op. (#2193)Laurent Mazare2024-05-182-0/+87
| | | | | | | | | | | | | | | * Add a slice_set op. * Add some testing. * Add the dedicated kv-cache module. * Derive debug and clone. * Expose more kv-cache functions. * Return the current data when appending. * Use the new cache in the quantized phi3 model.
* Add SliceSafetensors. (#2179)Laurent Mazare2024-05-112-0/+71
| | | | | * Add SlicedSafetensors. * And add some testing.
* Make it possible to use TF32 accumulation in F32 matmuls. (#2178)Laurent Mazare2024-05-113-30/+89
| | | | | | | * Allow the use of tf32 accumulation in matmul. * Better timings. * Dummy versions for use when cuda is not enabled.
* Use write rather than try-write on the metal rw-locks. (#2162)Laurent Mazare2024-05-052-7/+13
|
* Separate quantized phi-3 implementation. (#2157)Laurent Mazare2024-05-042-4/+1
| | | | | | | | | | | * Separate quantized phi-3 implementation. * Integrate the quantized phi3 model.= * Small fixes, get the generation to work properly. * Keep the old llama implementation around. * Change the default.
* Bump the version number to 0.5.1. (#2155)Laurent Mazare2024-05-033-39/+2
| | | | | | | * Bump the version number to 0.5.1. * Fix clippy lints for 1.78. * More clippy fixes.
* F16/BF16 bugfix (bis). (#2143)Laurent Mazare2024-04-291-14/+36
| | | | | | | * F16/BF16 bugfix (bis). * Another fix. * Yet another fix.
* Bugfix the recent f16/bf16 changes. (#2142)Laurent Mazare2024-04-291-8/+8
|
* Bug Fix: When converting a tensor to a variable, clone if the tensor is ↵Jeffrey Dallatezza2024-04-291-2/+7
| | | | | | | | | | | already a variable. (#2124) * When converting a tensor to a variable, clone if the tensor is already a variable. * Add a test to ensure training a batch norm works with VarMaps --------- Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
* Fix sigmoid gradient calculation and move sigmoid into a specialized op (#2114)MilkFather2024-04-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * add sigmoid op * small fix * add as a method on `Tensor` * implement gradient calculation for sigmoid * add sigmoid tests * we should have a specialized op for this * fix clippy * fix clippy 2 * Revert all previous commits in favor of a `CustomOp` based solution * use `CustomOp1` implementation * fix rustfmt * experimental add metal impl * add cuda kernel impl * fix fmt * Add a test + reduce some cuda duplication. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add a toggle for F16/BF16 accumulation in gemm. (#2141)Laurent Mazare2024-04-293-15/+150
| | | | | | | * Add a toggle to control f16/bf16 gemm precision. * Use the faster variant in the quantized example. * Bugfix.
* Add a forward_via_f16 method to the qmatmul op. (#2138)Laurent Mazare2024-04-281-0/+19
|
* Add the cuda dequantize f16 kernels. (#2137)Laurent Mazare2024-04-284-18/+242
| | | | | | | | | | | | | * Add the cuda dequantize f16 kernels. * Expose the cuda kernels. * Add some testing + fix. * Test the other cases too. * A few more tests. * Add an environment variable to enable the dequantize f16 + matmul behavior.
* Add a sort function. (#2134)Laurent Mazare2024-04-282-0/+35
|
* Add argsort. (#2132)Laurent Mazare2024-04-274-1/+241
| | | | | | | | | | | | | | | | | | | | | * Add the argsort cuda kernels. * CPU version of arg-sort. * Hook the cuda kernel + rework the cpu bits. * Add some dedicated test. * Working cuda kernel. * Metal kernel. * Metal adjustments. * Bugfix. * Use the fast rope in qwen. * Rework the expert selection in qwen.
* Add StorageRef. (#2113)Laurent Mazare2024-04-2310-5/+108
| | | | | * Add the storage-ref bits. * Add the metal implementation.
* Update zip requirement from 0.6.6 to 1.1.1 (#2103)dependabot[bot]2024-04-221-1/+1
| | | | | | | | | | | | | | | | | | | * Update zip requirement from 0.6.6 to 1.1.1 --- updated-dependencies: - dependency-name: zip dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix for the zip crate update. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: laurent <laurent.mazare@gmail.com>
* Metal Unary: Add benchmarks and process kernels in a tile based fashion (#2056)Thomas Santerre2024-04-214-147/+283
| | | | | | | | | | | | | | | | | * add basic unary bench for sqrt * process unary commands in tiles of 4 * re-enable all benchmarks * rename helper to unary * modify approach to split up tiled and non-tiled operations * undo bench ignore for other tests * update tile size to 2 * only perform the optimization on the contiguous even numbered element case
* Small cleanups to the llama multi-process example. (#2098)Laurent Mazare2024-04-201-1/+5
|
* Handle multiple dimensions in metal QMM + two fixes. (#2097)Laurent Mazare2024-04-201-15/+20
|
* Fix the silu gradient issue on 0. (#2083)Laurent Mazare2024-04-181-1/+1
|
* Add more QMMV cuda kernels. (#2077)Laurent Mazare2024-04-182-15/+25
| | | | | | | * Add more QMMV cuda kernels. * Enable the new kernels. * Adapt the testing.
* Add the mmv kernels for small batch sizes. (#2075)Laurent Mazare2024-04-162-19/+81
| | | | | | | | | | | | | | | | | * Add the mmv kernels for smaller sizes. * Support more mmv kernels. * Use the new kernels. * Fix the call. * Silly fix. * Improve the testing. * Fix for dmmv. * Add another dedicated test for the batching mmv.
* Fix for the batch dim in the quantized matmul example. (#2073)Laurent Mazare2024-04-153-38/+38
| | | | | | | | | * Fix for the batch dim in the quantized matmul example. * Enable more tests on cuda. * Add a test for qmm with a batch. * Fix the zeros-dim test on metal.
* Add a function to clear the KV cache in falcon. (#2066)Laurent Mazare2024-04-151-0/+1
| | | | | * Add a function to clear the KV cache in falcon. * Clippy.
* Handle zero dims in some simple operations. (#2064)Laurent Mazare2024-04-152-0/+43
| | | | | | | * Handle zero dims in some simple operations. * Handle zero-dims in matmul. * More testing.
* Faster kernels for quantized matmul on cuda (#2060)Laurent Mazare2024-04-151-6/+137
| | | | | | | | | | | | | | | * Hook the quantized matmul cuda kernels. * Add a (currently broken) test. * Kernel fixes. * Fix by transposing the rhs matrix. * Add the q4-1 kernels. * Proper block sizes. * More details in the tests.
* Expose the synchronize function on the generic device. (#2062)Laurent Mazare2024-04-141-0/+8
|
* Add missing bfloat unary strided kernels and fix typo (#2058)ivarflakstad2024-04-141-0/+20
|
* Add a synchronize method to devices. (#2055)Laurent Mazare2024-04-146-0/+24
| | | | | * Add a synchronize method to devices. * Metal version.
* Add benchmarks for qmatmul operations (#2048)Thomas Santerre2024-04-133-0/+74
| | | | | * Add qmatmul bench * add all dtypes
* Support gather on bf16 for metal. (#2035)Laurent Mazare2024-04-101-0/+1
|
* Use BufferOffset in metal backend ops. (#2029)Laurent Mazare2024-04-081-50/+39
| | | | | | | * Use BufferOffset in the metal backend. * More BufferOffset usage. * Use in where-cond.
* Rework the buffer offset logic for metal kernels (#2028)Laurent Mazare2024-04-071-39/+43
| | | | | | | | | | | | | | | * Move the metal kernels utils in a separate module. * Use the BufferOffset for unary ops. * Fix clippy lints. * Use the new BufferOffset. * Adapt the binary ops. * Affine. * More ops (powf, elu, cast).
* Handle the batch dimension in quantized MMV on metal. (#2022)Laurent Mazare2024-04-061-1/+4
|
* first commit (#2018)Jorge António2024-04-051-1/+1
|
* Add support for "sign" on tensors (#2012)Thomas Santerre2024-04-045-10/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | * add the sign unary operator * remove uneeded import * remove uneeded import * undo formatting * undo formatting * remove unnecessary redefintion * allow gradient to flow through for sign and round * fix cpu ops to ensure that negzero and positive zero are handled properly * clippy fixes * Properly avoid gradient tracking. * Use a branchless version. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fix the matmul layout for accelerate & mkl. (#2011)Laurent Mazare2024-04-043-26/+8
| | | | | | | * Fix the matmul layout for accelerate & mkl. * Reduce the required precision for pow (because of accelerate). * And a fix the gelu f16 test.
* update dtypes checks for several metal operations (#2010)Thomas Santerre2024-04-041-27/+45
|
* Optimize the gelu f16 opt. (#2008)Laurent Mazare2024-04-042-8/+19
| | | | | * Optimize the gelu f16 opt. * And add a test.
* Split the cuda error file. (#2003)Laurent Mazare2024-04-042-65/+67
|
* Relax the contiguous check for cuda kernels. (#2000)Laurent Mazare2024-04-031-1/+6
| | | | | | | | | * Relax the contiguous check for cuda kernels. * Ensure contiguity for RNNs. * Unrelated fix for segment anything. * Better error message + allow concatenating empty slices.
* Improve the handling of matmul with squeezed layouts. (#1998)Laurent Mazare2024-04-024-138/+150
| | | | | | | * Improve the handling of matmul with squeezed layouts. * Fix for the cuda backend. * Revert the temporary fix.
* modify access for conv and op to be pub to allow external packages to have ↵Thomas Santerre2024-04-011-2/+2
| | | | custom backends (#1986)
* Quantized cuda tweaks. (#1981)Laurent Mazare2024-04-011-89/+62
| | | | | | | * Quantized cuda tweaks. * Add some safety checks. * Factorize the dequantization bits.