| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
* Add the SmolLM2 models.
* More SmolLM2 support.
|
|
|
|
|
| |
* Fix the repo name for llama 3.1.
* Fix the book.
|
|
|
|
|
| |
* Add some llama-3.2 examples.
* Support tie-word-embeddings for llama.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add Llama 3.1 rope
* Clippy
* Format
* Clippy
* Add support for multiple eos tokens:
* Untagged either
* Remove either dep and fix settings.json
* Make the max positional embeddings configurable
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Llama v3.
* Tweak the default params + handle special tokens.
* Small tweak.
|
| |
|
|
|
|
|
| |
* Use the tokenizer-output-stream in the llama example.
* Also use tokenizer-output-stream for llama2-c.
|
|
|
|
|
|
|
|
|
| |
* fix index_pos bug when kv cache is disabled
* Tweak the fix.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
* Simplify the safetensor usage.
* Convert more examples.
* Move more examples.
* Adapt stable-diffusion.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Implement top_p / nucleus sampling
* Update changelog
* rustfmt
* Add tests
* Fix clippy warning
* Fix another clippy error
|
|
|
|
|
|
|
|
|
| |
* Move some models to candle-transformers so that they can be shared.
* Also move falcon.
* Move Llama.
* Move whisper (partial).
|
|
|
|
|
| |
* Add some optional repeat penalty.
* Add the missing files.
|
| |
|
|
|
|
|
|
| |
Codellama requires bf16 for now (error to convert from bf16 to f16).
Multiprocess demo not functional for it because flash-attn only supports
f16 for now.
|
|
|
|
|
|
|
|
|
| |
* GQA support in the quantized model.
* Fix the reshaping.
* Fix the main llama model.
* Infer the proper gqa from the model kind.
|
|
|
|
|
|
|
| |
* Start adding the module trait.
* Use the module trait.
* Implement module for qmatmul.
|
| |
|
|
|
|
|
|
|
| |
* Add some options to make layer-norm more configurable.
* Add the rms-norm variant.
* Replace the RmsNorm with the shared bits.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more stats to the ggml example.
* Build a quantized model from the file content.
* Move the tensor retrieval in the main crate.
* Start adding the forward pass.
* Add more to the forward pass of the quantized llama.
* Apply the attention layers.
* Add the sampling loop.
* Get the sampling loop to work.
* Minor tweak.
* Add a quantize/dequantize test.
* Bugfix.
* Add a comment + swap the order.
* Bugfixes.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Support local weights & dynamic outputs
* Revise as suggested
* Cargo code format
|
|
|
|
|
| |
* Add a cuda kernel for upsampling.
* Update for the latest tokenizers version.
|
|
|
|
|
| |
* Remove the checkpoint conversion script.
* Remove references to the script.
|
|
|
|
|
| |
* Add the accelerate feature.
* Ffi tweaks.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
* Line-up the llama implementation with the python-transformers one.
* Also lineup the multiprocess version.
|
|
|
|
|
| |
* Softmax numerical stability.
* Fix the flash-attn test.
|
|
|
|
| |
all the time)
|
| |
|
|
|
|
|
| |
* Move the flash-attn function in the proper crate.
* Causality tweak.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Again set a few extra params.
* Use the appropriate kernel sizes.
* Add all the kernel sizes.
* Parallel compiling.
* Reduce the amount of parallelism.
* Add the missing kernel.
* Fix a typo.
* Remove bf16 support for now.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Proper flash-attn parameters.
* Set the flash attention parameters.
* Add more validations.
* Setup the o_ flash attn parameters.
* More flash-attn support.
* Set more flash attn parameters.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.
* More flash attn.
* Set up the flash attn parameters.
* Get things to compile locally.
* Move the flash attention files in a different directory.
* Build the static C library with nvcc.
* Add more flash attention.
* Update the build part.
* Better caching.
* Exclude flash attention from the default workspace.
* Put flash-attn behind a feature gate.
* Get the flash attn kernel to run.
* Move the flags to a more appropriate place.
* Enable flash attention in llama.
* Use flash attention in llama.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Support for MQA for llama v2.
* More llama-v2.
* Move the rotary embedding precomputation in the cache.
* Add a v2 flag.
* Use the hf model.
|
| |
|
| |
|