| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
|
|
|
|
|
| |
* Add some missing index-select metal kernels.
* Make some matrix contiguous pre-matmul.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* WIP: hopefully better const impl
* with GPU
* More tests on
* Reverting primitive for
* Incorporating review changes - added check elem count check in kerner, using for call strategy
* rustfmt ran
|
|
|
|
|
|
|
|
|
| |
* Split out the commands part of the metal device.
* Make most fields private.
* Move the allocator back.
* Rework the encoder provider type.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
* Add some tracing.
* Get the trace to work.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
* Add a metal kernel for col2im1d.
* Enable the col2im variant.
* Bugfix.
* Revert the quantized tweak.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
|
|
|
|
|
| |
* Add the storage-ref bits.
* Add the metal implementation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add basic unary bench for sqrt
* process unary commands in tiles of 4
* re-enable all benchmarks
* rename helper to unary
* modify approach to split up tiled and non-tiled operations
* undo bench ignore for other tests
* update tile size to 2
* only perform the optimization on the contiguous even numbered element case
|
|
|
|
|
|
|
|
|
| |
* Fix for the batch dim in the quantized matmul example.
* Enable more tests on cuda.
* Add a test for qmm with a batch.
* Fix the zeros-dim test on metal.
|
| |
|
|
|
|
|
| |
* Add a synchronize method to devices.
* Metal version.
|
| |
|
|
|
|
|
|
|
| |
* Use BufferOffset in the metal backend.
* More BufferOffset usage.
* Use in where-cond.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Move the metal kernels utils in a separate module.
* Use the BufferOffset for unary ops.
* Fix clippy lints.
* Use the new BufferOffset.
* Adapt the binary ops.
* Affine.
* More ops (powf, elu, cast).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add the sign unary operator
* remove uneeded import
* remove uneeded import
* undo formatting
* undo formatting
* remove unnecessary redefintion
* allow gradient to flow through for sign and round
* fix cpu ops to ensure that negzero and positive zero are handled properly
* clippy fixes
* Properly avoid gradient tracking.
* Use a branchless version.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
|
* Backend refactoring.
* Metal tweaks.
* Move the cudnn module.
|