| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
* Sync upstream mlx sdpa vector kernels with mask
* Dispatch to the 2pass kernel
* Format
|
|
|
|
|
|
|
|
|
|
|
| |
* layer_norm_no_bias
* Modernbert model.
* Format + cleanup error.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Fixes for lint errors introduced with Rust 1.83
* rustfmt
* Fix more lints.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
| |
* Provide a method to allow PTH files iwth state maps to be loaded.
* add a line to the doc
* String-. &str
|
|
|
|
|
| |
* add module docs for candle-core
* doc each of the candle-nn modules and add the links to the doc page
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
|
|
|
|
| |
* Improved launch config for layer-norm/rms-norm.
* Add more testing for the fused layer/rms norm kernels.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add: direction for lstm layer
* lint: remove unused Error import
* refactor: remove unnecessary int assignment to Direction enum:
* refactor: use &'static str type instead of String for direction_str:
* Run cargofmt.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add Pixtral.
* More pixtral vision encoder.
* Sketch a pixtral example.
* Sketch a pixtral example.
* Better image loading.
* Support loading images embedded in safetensor files.
* Clippy fixes.
* Add the llava multimodal adapter.
* Add more of the llava bits.
* Add the pixtral config.
* More pixtral inference.
* Add the text generation bits.
* Get the example to work.
* Bugfix.
* Run some bits of the model in f32.
* Blessed version :)
* Better rope frequency computations.
* README update.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a RotatingKVCache.
* Add some KvCache tests.
* Test the reset too.
* More kv-cache testing.
* More tests for the rotating kv-cache.
* Improve the api for the rotating cache so that the whole src tensor gets returned when it's overlarge.
* Handle contiguity + bugfix + use in mimi.
* Add a way to test the mimi streaming mode.
* Mimi streaming fixes.
* More rotating kv-cache.
* Fix the attn mask generation.
* Handle the abs case.
* Add some tests for the generated mask.
|
|
|
| |
use candle-nn LSTM
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add Llama 3.1 rope
* Clippy
* Format
* Clippy
* Add support for multiple eos tokens:
* Untagged either
* Remove either dep and fix settings.json
* Make the max positional embeddings configurable
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* define structs
* construct ResidualConvUnit
* forward() for ResidualConvUnit
* implement FeatureFusionBlock
* implement Scratch
* implement DPTHead
* add identity module
* implement forward for DTPHead
* add get_intermediate_layers to DinoVisionTransformer
* implement DepthAnythingV2
* some minor tweaks
* fix compile errors
* fix var builder prefixes
* setup initial example
* use fixed patch size of 37 (518 / 14)
* debugged until output
* print min and max values
* add some dynamism to the output location
* scale input image
* extract prep function
* extract output path function
* normalize image with magic mean and std
* add spectral coloring
* squeeze in the right place
* make enterpolation optional
* use bail instead of panic
* omit unnecessary Shape call
* remove empty curly braces
* use bail instead of assert
* use vb and pp
* remove closures
* extract config object
* Apply rustfmt.
* Fix some clippy lints.
* More lints.
* Use the array methods.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
|
|
|
|
| |
* Enable the new layer-norm.
* Shape fixes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a slice_set op.
* Add some testing.
* Add the dedicated kv-cache module.
* Derive debug and clone.
* Expose more kv-cache functions.
* Return the current data when appending.
* Use the new cache in the quantized phi3 model.
|
|
|
|
|
| |
Also implement SimpleBackend for SliceSafetensors
Signed-off-by: Harry Stern <harry@harrystern.net>
|
|
|
|
|
| |
* Add SlicedSafetensors.
* And add some testing.
|
|
|
|
|
|
|
| |
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
|
|
|
|
|
|
|
|
|
|
|
| |
already a variable. (#2124)
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
|
|
|
|
| |
* Use the faster rms-norm kernel for llama.
* Use the fast variant by default.
|
| |
|
|
|
|
|
|
|
|
|
| |
* Add the rope THD kernel.
* Cuda kernel for rope-thd.
* Add the metal kernels.
* Add a dedicated test.
|
|
|
|
|
|
|
|
|
| |
* Relax the contiguous check for cuda kernels.
* Ensure contiguity for RNNs.
* Unrelated fix for segment anything.
* Better error message + allow concatenating empty slices.
|
|
|
|
|
|
|
| |
* add benchmarks for the candle-nn package
* uncomment test
* format
|
|
|
| |
* quantized models(awq/squeezellm/...) have multiple data type tensors, use 'get_with_hints_dtype' to load tensors with given dtype
|
| |
|
|
|
|
|
|
|
| |
* Contiguous variant of the rope kernel.
* Add the cuda kernel.
* Metal kernel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fast kernels for rotary embeddings.
* Add a test for the fast CPU kernel.
* Rope cuda bindings.
* Cuda kernel.
* Metal kernel (part 1).
* Cuda kernels.
* Finish the metal kernel.
* Use the new kernels in the quantized example.
* Fix warning.
|
|
|
|
|
|
|
|
|
| |
* RmsNorm kernel for metal.
* Wrapper for the metal kernel.
* Get the ops to actually work.
* Fix, get the tests to pass.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Trying out a custom RmsNorm cuda kernel.
* CPU implementation for rms-norm.
* Cuda wrappers.
* Add some validation.
* Add some testing.
* More testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a specialized kernel for copy2d.
* Move the cat operations.
* Avoid transpositions in cat.
* Bugfix.
* Bugfix for the cuda kernel.
* Add a benchmark.
* Add more testing.
* Test fix.
* Faster kernel.
* Add the missing kernel.
* Tweak the test.
* Add a metal kernel.
* Fix for the metal kernel.
* Get the tests to pass on metal.
* Also use this opportunity to fix the metal kernel for ELU.
* Add some bf16 kernels.
* Clippy fixes.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Improve metal buffer usage
* Clone cpu storage when loading to reduce wait_until_complete calls
* Use powers of two for buffer sizes so reuse is more likely.
* Select best available buffer by size.
* Add count to MetalStorage -> can use buffer with different size
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
* Simplify new buffer creation without blit copy. Revert &[] -> Vec
* Add documentation on newBufferWithBytes safety / synchronization
* Drop unused buffers after command buffer is done syncing.
---------
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
|
|
|
|
|
|
|
| |
* Add the StarCoder2 model.
* Add the example code and get things to work.
* And also tweak the readme.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Encodec model.
* Fixes.
* Add the padding functions.
* Get the LSTM bit to work.
* Get the encodec model to generate some tokens (decoder only for now).
* Minor tweak.
* Minor tweak.
|
| |
|
|
|
|
|
| |
* Support for attention bias in gemma + refactor things a bit.
* Fix the cuda tests.
|
| |
|
|
|
|
|
| |
* Groups support in conv-transpose-1d.
* Remove dangling file.
|
| |
|
| |
|