| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
* Add a metal kernel for col2im1d.
* Enable the col2im variant.
* Bugfix.
* Revert the quantized tweak.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
|
|
|
|
|
| |
* More efficient cuda implementation for ConvTranspose1d.
* Small tweak.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a slice_set op.
* Add some testing.
* Add the dedicated kv-cache module.
* Derive debug and clone.
* Expose more kv-cache functions.
* Return the current data when appending.
* Use the new cache in the quantized phi3 model.
|
|
|
|
|
| |
* Add SlicedSafetensors.
* And add some testing.
|
|
|
|
|
|
|
| |
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
|
|
|
|
|
|
|
| |
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
|
|
|
|
|
|
|
| |
* F16/BF16 bugfix (bis).
* Another fix.
* Yet another fix.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
already a variable. (#2124)
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
| |
* Add a toggle to control f16/bf16 gemm precision.
* Use the faster variant in the quantized example.
* Bugfix.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the cuda dequantize f16 kernels.
* Expose the cuda kernels.
* Add some testing + fix.
* Test the other cases too.
* A few more tests.
* Add an environment variable to enable the dequantize f16 + matmul behavior.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
|
|
|
|
|
| |
* Add the storage-ref bits.
* Add the metal implementation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Update zip requirement from 0.6.6 to 1.1.1
---
updated-dependencies:
- dependency-name: zip
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
* Fix for the zip crate update.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add basic unary bench for sqrt
* process unary commands in tiles of 4
* re-enable all benchmarks
* rename helper to unary
* modify approach to split up tiled and non-tiled operations
* undo bench ignore for other tests
* update tile size to 2
* only perform the optimization on the contiguous even numbered element case
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Add more QMMV cuda kernels.
* Enable the new kernels.
* Adapt the testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the mmv kernels for smaller sizes.
* Support more mmv kernels.
* Use the new kernels.
* Fix the call.
* Silly fix.
* Improve the testing.
* Fix for dmmv.
* Add another dedicated test for the batching mmv.
|
|
|
|
|
|
|
|
|
| |
* Fix for the batch dim in the quantized matmul example.
* Enable more tests on cuda.
* Add a test for qmm with a batch.
* Fix the zeros-dim test on metal.
|
|
|
|
|
| |
* Add a function to clear the KV cache in falcon.
* Clippy.
|
|
|
|
|
|
|
| |
* Handle zero dims in some simple operations.
* Handle zero-dims in matmul.
* More testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Hook the quantized matmul cuda kernels.
* Add a (currently broken) test.
* Kernel fixes.
* Fix by transposing the rhs matrix.
* Add the q4-1 kernels.
* Proper block sizes.
* More details in the tests.
|
| |
|
| |
|
|
|
|
|
| |
* Add a synchronize method to devices.
* Metal version.
|
|
|
|
|
| |
* Add qmatmul bench
* add all dtypes
|
| |
|
|
|
|
|
|
|
| |
* Use BufferOffset in the metal backend.
* More BufferOffset usage.
* Use in where-cond.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Move the metal kernels utils in a separate module.
* Use the BufferOffset for unary ops.
* Fix clippy lints.
* Use the new BufferOffset.
* Adapt the binary ops.
* Affine.
* More ops (powf, elu, cast).
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add the sign unary operator
* remove uneeded import
* remove uneeded import
* undo formatting
* undo formatting
* remove unnecessary redefintion
* allow gradient to flow through for sign and round
* fix cpu ops to ensure that negzero and positive zero are handled properly
* clippy fixes
* Properly avoid gradient tracking.
* Use a branchless version.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
| |
* Fix the matmul layout for accelerate & mkl.
* Reduce the required precision for pow (because of accelerate).
* And a fix the gelu f16 test.
|
| |
|
|
|
|
|
| |
* Optimize the gelu f16 opt.
* And add a test.
|
| |
|
|
|
|
|
|
|
|
|
| |
* Relax the contiguous check for cuda kernels.
* Ensure contiguity for RNNs.
* Unrelated fix for segment anything.
* Better error message + allow concatenating empty slices.
|
|
|
|
|
|
|
| |
* Improve the handling of matmul with squeezed layouts.
* Fix for the cuda backend.
* Revert the temporary fix.
|
|
|
|
| |
custom backends (#1986)
|
|
|
|
|
|
|
| |
* Quantized cuda tweaks.
* Add some safety checks.
* Factorize the dequantization bits.
|