| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
* Add the SmolLM2 models.
* More SmolLM2 support.
|
| |
|
|
|
|
|
|
|
| |
* Add a toggle to control f16/bf16 gemm precision.
* Use the faster variant in the quantized example.
* Bugfix.
|
|
|
|
|
| |
* Add the phi-v3 quantized model.
* Also include phi-3 in the main phi example.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* add support for l3b, new tokenizer
* add todo
* Add todo and use k_s model
* Use the official tokenizers.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
| |
* Include topk sampling in the quantized example.
* Also sample with top-k on the mistral side.
|
|
|
|
|
| |
* Switch the default to using the faster kernels.
* Add the force-dmmv flag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more cuda kernels for quantized matmul.
* Add the vec-dot bits.
* Expose the quantized matmul-vec kernels.
* Also include the quantize-q8-1 kernel.
* Glue code for the q8-1 quantization.
* mm-vec product via q8-1 quantization.
* Add a test.
* Add a mm test.
* Get the test to return some sensible results.
* Also test dmmv.
* Fix the launch params.
* Allow for tweaking the force_dmmv parameter while it's experimental.
|
|
|
|
|
| |
* Add a flag to force running the quantized model on CPUs.
* Add encodec to the readme.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
|
|
|
|
|
| |
* Support mistral instruct v0.2.
* Use the safetensors model now that they are available.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the Mixtral model.
* Add more of the mixtral layers.
* Add the final layers for mixtral.
* Sketch the expert selection.
* Add some expert routing logic.
* Hopefully finish the routing logic for mixtral.
* Add the mixtral example.
* Fix the weight filenames.
* Bugfix.
* Another fix.
* Yet another fix + remove the unused pragma.
* Shape fix.
* Support for quantized mixtral.
* Support mixtral in the quantized example.
* Mlp or moe type.
* Fix the expert field namings.
* Refactor the mlp bit.
* More MoE logic.
* Add the MoE quantized logic.
* Fix the experts length.
|
| |
|
|
|
|
|
| |
* Add quantized Starling, fix open-chat prompt
* Fix open-chat and starling prompts
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add OpenChat to quantized examples
* Add chat prompt
* Make the openchat example more in line with the other models.
* Fix a typo.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
|
|
| |
* Fix quantized zephyr chat prompt (#1314)
* Avoid using a mutable variable.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Support the shape op in ONNX.
* Share the axis normalization bits.
* Add some limited support for gather.
* Unsqueeze.
* Comparison with broadcasting.
* Add Not + handle i32.
* Tweaks for the quantized model.
|
|
|
|
|
|
|
|
|
|
|
| |
* Adds check for 7b-zephyr and uses correct template
* Handle zephyr as mistral.
* Disable the protoc bits of the CI.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
| |
|
| |
|
| |
|
|
|
|
|
| |
* Add a gif to the quantized readme.
* gif update.
|
|
|
|
|
|
|
|
|
| |
* Add more readmes.
* Add a readme for dinov2.
* Add some skeleton files for a couple more examples.
* More whisper details.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Implement top_p / nucleus sampling
* Update changelog
* rustfmt
* Add tests
* Fix clippy warning
* Fix another clippy error
|
| |
|
|
|
|
|
|
|
|
|
| |
* Move dinov2.
* Move efficientnet.
* Move the quantized llama model.
* Move segment-anything.
|
|
|
|
|
| |
* Print the args + change the default temp/repeat penalty.
* Minor formatting tweak.
|
| |
|
|
|
|
|
|
|
|
|
| |
* Q5k vecdot.
* Add the q3k vecdot.
* Q2k vecdot.
* Move the quantized model to its own file.
|
|
|
|
|
|
|
|
|
| |
* Remove some dead-code annotations.
* More dead code removal.
* One more.
* CI fix.
|
|
|
|
|
| |
* Add some optional repeat penalty.
* Add the missing files.
|
|
|
|
|
|
|
| |
* Generic implementation of vecdot for q80.
* Add support for code-llama 7b.
* Support more code-llama.
|
| |
|
| |
|
|
|
|
|
| |
* add chat models in quantized example
* cargo fmt
|
|
|
|
|
| |
* GGUF support in the quantized model.
* Get the GGUF support to work on llama.
|
|
|
|
|
|
|
|
|
| |
* GQA support in the quantized model.
* Fix the reshaping.
* Fix the main llama model.
* Infer the proper gqa from the model kind.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a couple functions required for yolo.
* Add the yolo-v3 example.
* Add minimum and maximum.
* Use the newly introduced maximum.
* Cuda support for min/max + add some testing.
* Allow for more tests to work with accelerate.
* Fix a typo.
|
|
|
|
|
|
|
| |
* Separate the prompt stats from the post-prompt ones in the quantized example.
* Slightly nicer output printing.
* Line up with the llama.cpp implementation.
|
|
|
|
|
|
|
| |
* Start adding the module trait.
* Use the module trait.
* Implement module for qmatmul.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Print the detected arch options.
* Add the q6k quantization.
* Add a currently broken test.
* Bugfix.
* Bugfix.
* Another bugfix.
* Another bugfix + get the test to work.
|
| |
|
| |
|
|
|
|
|
|
|
| |
* Add some options to make layer-norm more configurable.
* Add the rms-norm variant.
* Replace the RmsNorm with the shared bits.
|