| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
* Add a Context trait similar to anyhow::Context.
* Switch two unwrap to context.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the cuda dequantize f16 kernels.
* Expose the cuda kernels.
* Add some testing + fix.
* Test the other cases too.
* A few more tests.
* Add an environment variable to enable the dequantize f16 + matmul behavior.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
|
|
|
|
|
| |
* Add the dummy qmetal backend.
* Fix the metal compilation.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
|
| |
|
|
|
|
|
|
|
| |
* Cosmetic change to the quantized whisper model.
* Fix the dequantization.
* Add the dequantize all variable.
|
|
|
|
|
|
|
| |
* Improve the quantized whisper setup.
* Fix the config file paths.
* Use the standard matmul where possible.
|
|
|
|
|
| |
* wasm/simd128 version of the quantized q8_0 vecdot.
* Add the missing conversion.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more pyo3 support.
* Add some support for quantized tensors in pyo3.
* Add an arc layer on qmatmul.
* Add the quantized matmul.
* Quantization support.
* More quantization support.
* Test the python quantization.
|
| |
|
|
|
|
|
|
|
| |
* Add a function to write gguf files.
* More GGUF file writing.
* Write the tensor data in GGUF files.
|
|
|
|
|
| |
* Preliminary GGUF support.
* Tensor reading.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* first q2 implementation
* First Q4K and Q5K implementations
* fix `q2k` and `q5k`
* Some first cleanups
* run `clippy` on tests
* finally implement `q3k`
* deactivate `q3k` test on macos
* also disable the test on linux
* Fix floating bits in `q3k` dequantization
* Refactoring pass + reorder quants in file
* `fmt`
* Re-add `src` asserts and redefine `dst`
|
|
|
|
|
|
|
| |
* Skeleton files for neon support of quantization.
* SIMD version for q4 vecdot.
* Also simdify the q6k multiplication.
|
|
|
|
|
|
|
| |
* Start adding the module trait.
* Use the module trait.
* Implement module for qmatmul.
|
|
|
|
|
|
|
|
|
|
|
| |
* Sketch some qmatmul test.
* Add the quantization function.
* More testing.
* Make the test smaller and faster.
* Add some shape checking.
|
|
|
|
|
| |
* Relax the requirements on CustomOp.
* Simplify the custom-ops when no backward is required.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more stats to the ggml example.
* Build a quantized model from the file content.
* Move the tensor retrieval in the main crate.
* Start adding the forward pass.
* Add more to the forward pass of the quantized llama.
* Apply the attention layers.
* Add the sampling loop.
* Get the sampling loop to work.
* Minor tweak.
* Add a quantize/dequantize test.
* Bugfix.
* Add a comment + swap the order.
* Bugfixes.
|
|
|
|
|
|
|
| |
* Add quantized tensors.
* Implement the debug trait for QTensor.
* Add the QMatMul custom op.
|
|
|