| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
* Sync upstream mlx sdpa vector kernels with mask
* Dispatch to the 2pass kernel
* Format
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Fixes for lint errors introduced with Rust 1.83
* rustfmt
* Fix more lints.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
|
|
|
|
| |
* Add some missing index-select metal kernels.
* Make some matrix contiguous pre-matmul.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* WIP: hopefully better const impl
* with GPU
* More tests on
* Reverting primitive for
* Incorporating review changes - added check elem count check in kerner, using for call strategy
* rustfmt ran
|
| |
|
| |
|
| |
|
|
|
|
|
| |
* Bugfix for the metal elu kernel.
* Add a test.
|
|
|
|
|
|
|
|
|
| |
* Split out the commands part of the metal device.
* Make most fields private.
* Move the allocator back.
* Rework the encoder provider type.
|
| |
|
|
|
|
|
| |
* Add some metal gemm benchark.
* More benchmarks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Include the MLX gemm kernels.
* Clippy lints.
* Export the gemm_f32 kernel.
* Add the f16/bf16 variants.
* Add the initial dispatch code.
* More plugging of the mlx kernels.
* Add a currently broken test.
* Tweaks.
* Bugfix + get the tests to pass.
* Enable the gemm bf16 tests.
* Add some randomized tests.
* Update candle-metal-kernels/src/lib.rs
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* More fixes.
* More clippy fixes.
---------
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
|
| |
|
| |
|
|
|
|
|
| |
* Add updated mfa metallib
* Add bgemm and tests
|
| |
|
| |
|
|
|
|
| |
reused). (#2352)
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
* Add a metal kernel for col2im1d.
* Enable the col2im variant.
* Bugfix.
* Revert the quantized tweak.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
|
|
|
|
|
|
|
| |
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add basic unary bench for sqrt
* process unary commands in tiles of 4
* re-enable all benchmarks
* rename helper to unary
* modify approach to split up tiled and non-tiled operations
* undo bench ignore for other tests
* update tile size to 2
* only perform the optimization on the contiguous even numbered element case
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Use BufferOffset in the metal backend.
* More BufferOffset usage.
* Use in where-cond.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Move the metal kernels utils in a separate module.
* Use the BufferOffset for unary ops.
* Fix clippy lints.
* Use the new BufferOffset.
* Adapt the binary ops.
* Affine.
* More ops (powf, elu, cast).
|
|
|
|
|
| |
* Optimize copy-2d for metal.
* Add a hacky stopping rule for moondream.
|
|
|
|
|
|
|
|
|
| |
* Add the rope THD kernel.
* Cuda kernel for rope-thd.
* Add the metal kernels.
* Add a dedicated test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add the sign unary operator
* remove uneeded import
* remove uneeded import
* undo formatting
* undo formatting
* remove unnecessary redefintion
* allow gradient to flow through for sign and round
* fix cpu ops to ensure that negzero and positive zero are handled properly
* clippy fixes
* Properly avoid gradient tracking.
* Use a branchless version.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Fix for the RWKV models.
* More general fix + revert the rwkv hack.
* Remove the old hack.
|
|
|
|
|
| |
* Support i64 in index-select on metal.
* Add some testing of index-select for all dtypes.
|