| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
* Add some tracing.
* Get the trace to work.
|
|
|
|
|
|
|
|
|
|
|
| |
* Use flash-attn in gemma.
* Fix for the fast bf16 cublas gemm.
* Fix some clippy lints.
* Fix another lint.
* Proper clippy fix.
|
|
|
|
|
|
|
| |
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
|
| |
|
|
|
|
|
| |
* Add a print command to tensor-tools.
* Add some flags to tweak the formatting.
|
|
|
|
|
| |
* Cuda kernel for dequantizing q8k.
* Clippy lints.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
|
|
|
|
|
| |
* Add a dequantize command to tensor-tools.
* Clippy fixes.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Quantized version of mistral.
* Integrate the quantized mistral variant.
* Use the quantized weight files.
* Tweak the quantization command.
* Fix the dtype when computing the rotary embeddings.
* Update the readme with the quantized version.
* Fix the decoding of the remaining tokens.
|
|
|
|
|
|
|
|
|
| |
* Use yoke to provide a self-referential container for mmaped safetensor files.
* Add the new self-owned type for safetensor files without removing the previous version.
* Add routing.
* Add an initializer for the case of multiple files.
|
|
|
|
|
| |
* Use the proper block size for quantizing models.
* Use the proper dimension.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Load gguf files for the quantized t5.
* Add the quantized t5 example.
* Allow for loading local files.
* Add some support for quantizing safetensor files.
* Transpose before quantizing.
* Quantized t5.
* Retrieve the weights from the hub.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a custom softmax implementation.
* Add softmaxlastdim to the benchmarks.
* And add a test.
* Support more dtypes.
* Polish the code.
* Use the slow implementation on cuda.
* Add a todo for the cuda kernel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add the dilation parameter.
* Restore the basic optimizer example.
* Dilation support in cudnn.
* Use the dilation parameter in the cpu backend.
* More dilation support.
* No support for dilation in transposed convolutions.
* Add dilation to a test.
* Remove a print.
* Helper function.
|
| |
|
|
|
|
|
|
|
| |
* Add the quantize command.
* Bugfix for writing gguf files.
* And add a comment.
|
|
|
|
|
| |
* More pickle support.
* Be more verbose.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add to the cuda example a reproduction of the issue.
* Tweak.
* Add a test using non-square matrixes.
* Fix the conv2d kernel.
* Display the error.
* And tweak the comment.
|
|
|
|
|
|
|
|
|
| |
cuda. (#578)
* Add a test for conv2d with padding.
* Cosmetic changes.
* Bugfix the rand function on the cuda backend.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add some group parameter to convolutions.
* Avoid some unnecessary groups checks.
* Move the tensor convolution bits.
* Properh handling of groups.
* Bump the crate version.
* And add a changelog.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
* Retrieve more information from PyTorch checkpoints.
* Add enough support to load dino-v2 backbone weights.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Pickle work-in-progress.
* More unpickling.
* More pickling.
* Proper handling of setitems.
* Clippy.
* Again more pickling.
* Restore the example.
* Add enough pickle support to get the list of tensors.
* Read the data from zip files.
* Retrieve the tensor shape.
* Extract the size and dtype.
* More storage types.
* Improve the destructuring.
* Also support ggml files.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Pickle work-in-progress.
* More unpickling.
* More pickling.
* Proper handling of setitems.
* Clippy.
* Again more pickling.
* Restore the example.
* Add enough pickle support to get the list of tensors.
* Read the data from zip files.
* Retrieve the tensor shape.
* Extract the size and dtype.
* More storage types.
* Improve the destructuring.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Sketch some qmatmul test.
* Add the quantization function.
* More testing.
* Make the test smaller and faster.
* Add some shape checking.
|
|
|
|
|
|
|
|
|
| |
* AVX version of the vecdot for q4_0.
* Tweak the avx bits.
* Add a qmatmul benchmark.
* Fix the quantized test.
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a cudnn feature to be used for conv2d.
* Allocate the proper workspace.
* Only create a single cudnn handle per cuda device.
* Proper cudnn usage.
* Bugfix.
|
|
|
|
|
| |
* Add a softmax bench.
* Add the vectorized sum reduce.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Add more tracing to the whisper example.
* Support accelerate in more examples.
* Use accelerate for pointwise functions.
* Use accelerate for binary operations too.
* Bugfix for binary operation: use the rhs before the lhs.
|
|
|
|
|
|
|
| |
* Refactor the benchmark example.
* Rename the example.
* Add some comments.
|
|
|
|
|
| |
* Add a conv1d benchmark based on the whisper sizes.
* Enforce the batch-dim in conv1d.
|
| |
|
|
|
|
|
| |
* Add the accelerate feature.
* Ffi tweaks.
|
|
|
|
|
| |
* Rename to candle-core.
* More candle-core renaming.
|
|
|
|
|
|
|
|
|
| |
* Simplify Tensor::randn.
* Also switch Tensor::rand to use a generic dtype.
* Support sampling for f16.
* Cleanup.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Sketch a fast cuda kernel for reduce-sum.
* Sketch the rust support code for the fast sum kernel.
* More work on the fast kernel.
* Add some testing ground.
* A couple fixes for the fast sum kernel.
|
|
|
|
|
| |
* Add some very simple sum benchmark.
* Rename the file.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix some rebase issues.
* Use mkl instead.
* Use mkl in bert.
* Add the optional mkl feature.
* Conditional compilation based on the mkl feature.
* Add more mkl support.
|
| |
|
| |
|
| |
|