summaryrefslogtreecommitdiff
path: root/candle-examples/examples/quantized
Commit message (Collapse)AuthorAgeFilesLines
* Add the SmolLM2 models. (#2595)Laurent Mazare2024-11-031-1/+24
| | | | | * Add the SmolLM2 models. * More SmolLM2 support.
* Force the revision for the phi3-llama quantized models. (#2159)Laurent Mazare2024-05-041-2/+11
|
* Add a toggle for F16/BF16 accumulation in gemm. (#2141)Laurent Mazare2024-04-291-0/+3
| | | | | | | * Add a toggle to control f16/bf16 gemm precision. * Use the faster variant in the quantized example. * Bugfix.
* Add the phi-v3 quantized model. (#2118)Laurent Mazare2024-04-241-24/+35
| | | | | * Add the phi-v3 quantized model. * Also include phi-3 in the main phi example.
* Add support for llama3 on the quantized example (#2086)Thomas Santerre2024-04-181-8/+23
| | | | | | | | | | | | | * add support for l3b, new tokenizer * add todo * Add todo and use k_s model * Use the official tokenizers. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Include topk sampling in the quantized example. (#2005)Laurent Mazare2024-04-041-7/+19
| | | | | * Include topk sampling in the quantized example. * Also sample with top-k on the mistral side.
* Switch the default to using the faster kernels. (#1978)Laurent Mazare2024-04-011-3/+3
| | | | | * Switch the default to using the faster kernels. * Add the force-dmmv flag.
* More ggml cuda kernels (#1977)Laurent Mazare2024-04-011-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | * Add more cuda kernels for quantized matmul. * Add the vec-dot bits. * Expose the quantized matmul-vec kernels. * Also include the quantize-q8-1 kernel. * Glue code for the q8-1 quantization. * mm-vec product via q8-1 quantization. * Add a test. * Add a mm test. * Get the test to return some sensible results. * Also test dmmv. * Fix the launch params. * Allow for tweaking the force_dmmv parameter while it's experimental.
* Add a flag to force running the quantized model on CPUs. (#1778)Laurent Mazare2024-02-281-1/+5
| | | | | * Add a flag to force running the quantized model on CPUs. * Add encodec to the readme.
* Add an option to split the prompt. (#1766)Laurent Mazare2024-02-271-1/+14
|
* Quantized GGUF style (#1523)Nicolas Patry2024-01-171-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Metal quantized modifications proposal. - Add a device param, wherever needed. - Create new QMetal storage thing that implements QuantizedType. - Update everywhere needed. Fix Python. Fixing examples. Fix: fmt + clippy + stub. Moving everything around. Only missing the actual implems. Fixing everything + adding dequantized kernels. More work. Fixing matmul. Fmt + Clippy Some clippy fixes. Working state. Q2K Metal -> Bugged (also present in GGML). Q4K CPU -> Bugged (present previously, new test catch it). Q5K CPU -> Bugged (present previously). Q8_1 Both -> Never really implemented it seems Q8K metal -> Never implemented in metal Fixing Q2K bug (present in ggml). * Cleanup. * Fix the rebase. * Removing the fences speeds everything up and *is* correct this time... * Cleanup the fence. * After rebase. * Bad code removal. * Rebase after phi2 merge + fix replit default to CPU. * Making the CI happy. * More happy tests. --------- Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
* Support mistral instruct v0.2. (#1475)Laurent Mazare2023-12-231-4/+15
| | | | | * Support mistral instruct v0.2. * Use the safetensors model now that they are available.
* Mixtral quantized instruct. (#1447)Laurent Mazare2023-12-161-0/+11
|
* Update the readme to mention mixtral. (#1443)Laurent Mazare2023-12-151-0/+13
|
* Quantized mixtral model (#1442)Laurent Mazare2023-12-151-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add the Mixtral model. * Add more of the mixtral layers. * Add the final layers for mixtral. * Sketch the expert selection. * Add some expert routing logic. * Hopefully finish the routing logic for mixtral. * Add the mixtral example. * Fix the weight filenames. * Bugfix. * Another fix. * Yet another fix + remove the unused pragma. * Shape fix. * Support for quantized mixtral. * Support mixtral in the quantized example. * Mlp or moe type. * Fix the expert field namings. * Refactor the mlp bit. * More MoE logic. * Add the MoE quantized logic. * Fix the experts length.
* Add the leo models to the quantized examples. (#1398)Laurent Mazare2023-12-031-31/+46
|
* Add quantized Starling, fix open-chat prompt (#1393)Lucas de Ávila Martins2023-12-021-6/+36
| | | | | * Add quantized Starling, fix open-chat prompt * Fix open-chat and starling prompts
* Fix OpenChat 3.5 tokenizer (#1347)Lucas de Ávila Martins2023-11-191-1/+3
|
* Add OpenChat 3.5 to quantized examples (#1346)Lucas de Ávila Martins2023-11-191-7/+39
| | | | | | | | | | | | | * Add OpenChat to quantized examples * Add chat prompt * Make the openchat example more in line with the other models. * Fix a typo. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fix quantized zephyr chat prompt (#1314) (#1317)Michael Leandersson2023-11-111-2/+7
| | | | | | | | | * Fix quantized zephyr chat prompt (#1314) * Avoid using a mutable variable. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Quantized model small tweaks (#1290)Laurent Mazare2023-11-071-39/+54
| | | | | | | | | | | | | | | * Support the shape op in ONNX. * Share the axis normalization bits. * Add some limited support for gather. * Unsqueeze. * Comparison with broadcasting. * Add Not + handle i32. * Tweaks for the quantized model.
* Adds check for 7b-zephyr and uses correct template (#1283)DTJ112352023-11-061-3/+6
| | | | | | | | | | | * Adds check for 7b-zephyr and uses correct template * Handle zephyr as mistral. * Disable the protoc bits of the CI. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add support for Zephyr-7b in the quantized model. (#1124)Laurent Mazare2023-10-181-2/+12
|
* Fix the prompt for mistral when using instruct/interactive mode. (#1013)Laurent Mazare2023-10-011-12/+31
|
* Integrate TheBloke quantized mistral weights. (#1012)Laurent Mazare2023-09-301-2/+26
|
* Add a gif to the quantized readme. (#833)Laurent Mazare2023-09-132-0/+2
| | | | | * Add a gif to the quantized readme. * gif update.
* Add more example readmes. (#828)Laurent Mazare2023-09-121-1/+1
| | | | | | | | | * Add more readmes. * Add a readme for dinov2. * Add some skeleton files for a couple more examples. * More whisper details.
* Implement top_p / nucleus sampling (#819)Juarez Bochi2023-09-121-1/+5
| | | | | | | | | | | | | * Implement top_p / nucleus sampling * Update changelog * rustfmt * Add tests * Fix clippy warning * Fix another clippy error
* Add a small readme for the quantized example. (#823)Laurent Mazare2023-09-121-0/+35
|
* Move more models to candle-transformers (#796)Laurent Mazare2023-09-102-372/+1
| | | | | | | | | * Move dinov2. * Move efficientnet. * Move the quantized llama model. * Move segment-anything.
* Tweak some quantized args (#692)Laurent Mazare2023-08-311-5/+14
| | | | | * Print the args + change the default temp/repeat penalty. * Minor formatting tweak.
* Interactive mode for the quantized model. (#690)Laurent Mazare2023-08-312-55/+109
|
* Neon optimized vecdot (#666)Laurent Mazare2023-08-292-364/+371
| | | | | | | | | * Q5k vecdot. * Add the q3k vecdot. * Q2k vecdot. * Move the quantized model to its own file.
* Remove some dead-code annotations. (#629)Laurent Mazare2023-08-271-11/+0
| | | | | | | | | * Remove some dead-code annotations. * More dead code removal. * One more. * CI fix.
* Add some optional repeat penalty. (#623)Laurent Mazare2023-08-271-17/+5
| | | | | * Add some optional repeat penalty. * Add the missing files.
* Generic implementation of vecdot for q80. (#596)Laurent Mazare2023-08-251-5/+23
| | | | | | | * Generic implementation of vecdot for q80. * Add support for code-llama 7b. * Support more code-llama.
* Get the rms epsilon from GGUF. (#565)Laurent Mazare2023-08-231-8/+10
|
* Fix the quantized example. (#564)Laurent Mazare2023-08-231-2/+2
|
* add chat models in quantized example (#551)cksac2023-08-231-0/+18
| | | | | * add chat models in quantized example * cargo fmt
* GGUF support in the quantized model. (#559)Laurent Mazare2023-08-231-45/+143
| | | | | * GGUF support in the quantized model. * Get the GGUF support to work on llama.
* GQA support in the quantized model. (#555)Laurent Mazare2023-08-221-5/+31
| | | | | | | | | * GQA support in the quantized model. * Fix the reshaping. * Fix the main llama model. * Infer the proper gqa from the model kind.
* Add some llama-v2 variants. (#545)Laurent Mazare2023-08-221-3/+22
|
* Add some optional repeat penalty. (#535)Laurent Mazare2023-08-211-0/+33
|
* Add a yolo-v3 example. (#528)Laurent Mazare2023-08-201-0/+6
| | | | | | | | | | | | | | | * Add a couple functions required for yolo. * Add the yolo-v3 example. * Add minimum and maximum. * Use the newly introduced maximum. * Cuda support for min/max + add some testing. * Allow for more tests to work with accelerate. * Fix a typo.
* Line up the llama.cpp implementation with the candle one. (#518)Laurent Mazare2023-08-191-40/+78
| | | | | | | * Separate the prompt stats from the post-prompt ones in the quantized example. * Slightly nicer output printing. * Line up with the llama.cpp implementation.
* Add a simple Module trait and implement it for the various nn layers (#500)Laurent Mazare2023-08-181-1/+1
| | | | | | | * Start adding the module trait. * Use the module trait. * Implement module for qmatmul.
* Q6K quantization (#495)Laurent Mazare2023-08-171-0/+8
| | | | | | | | | | | | | | | * Print the detected arch options. * Add the q6k quantization. * Add a currently broken test. * Bugfix. * Bugfix. * Another bugfix. * Another bugfix + get the test to work.
* Add the whisper small model. (#490)Laurent Mazare2023-08-171-1/+1
|
* Add a verbose-prompt mode, similar to llama.cpp. (#489)Laurent Mazare2023-08-171-5/+13
|
* Layer norm tweaks (#482)Laurent Mazare2023-08-171-18/+4
| | | | | | | * Add some options to make layer-norm more configurable. * Add the rms-norm variant. * Replace the RmsNorm with the shared bits.