forks/candle.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	Import the ggml_cuda_dp4a function. (#2628)	Laurent Mazare	2024-11-19	1	-33/+44
\|
*	Bump the crate version to 0.8.0. (#2612)	Laurent Mazare	2024-11-12	1	-1/+1
\|
*	Improved launch config for layer-norm/rms-norm. (#2591)	Laurent Mazare	2024-11-04	1	-8/+6
\| \| \| \| \|	* Improved launch config for layer-norm/rms-norm. * Add more testing for the fused layer/rms norm kernels.
*	Bump the crate version to 0.7.2. (#2517)	Laurent Mazare	2024-09-29	1	-1/+1
\|
*	Move the candle version to 0.7.1. (#2495)	Laurent Mazare	2024-09-22	1	-1/+1
\|
*	Bump the crate version. (#2491)	Laurent Mazare	2024-09-21	1	-1/+1
\|
*	Bump the version to 0.6.1. (#2438)	Laurent Mazare	2024-08-22	1	-1/+1
\|
*	Bump the crate version. (#2248)	Laurent Mazare	2024-06-05	1	-1/+1
\|
*	Add the layernorm specialized op. (#2212)	Laurent Mazare	2024-05-24	1	-0/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add the layernorm cuda kernels. * Dedicated layer norm op. * Add the slower variant. * Plug the cuda implementation. * Add the metal variant. * Add a dedicated test. * Bugfix.
*	More efficient cuda implementation for ConvTranspose1d. (#2211)	Laurent Mazare	2024-05-24	1	-0/+65
\| \| \| \| \|	* More efficient cuda implementation for ConvTranspose1d. * Small tweak.
*	Bump the version number to 0.5.1. (#2155)	Laurent Mazare	2024-05-03	1	-1/+1
\| \| \| \| \| \| \|	* Bump the version number to 0.5.1. * Fix clippy lints for 1.78. * More clippy fixes.
*	Fix sigmoid gradient calculation and move sigmoid into a specialized op (#2114)	MilkFather	2024-04-29	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* add sigmoid op * small fix * add as a method on `Tensor` * implement gradient calculation for sigmoid * add sigmoid tests * we should have a specialized op for this * fix clippy * fix clippy 2 * Revert all previous commits in favor of a `CustomOp` based solution * use `CustomOp1` implementation * fix rustfmt * experimental add metal impl * add cuda kernel impl * fix fmt * Add a test + reduce some cuda duplication. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
*	Add the cuda dequantize f16 kernels. (#2137)	Laurent Mazare	2024-04-28	1	-37/+75
\| \| \| \| \| \| \| \| \| \| \| \| \|	* Add the cuda dequantize f16 kernels. * Expose the cuda kernels. * Add some testing + fix. * Test the other cases too. * A few more tests. * Add an environment variable to enable the dequantize f16 + matmul behavior.
*	Add argsort. (#2132)	Laurent Mazare	2024-04-27	2	-0/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add the argsort cuda kernels. * CPU version of arg-sort. * Hook the cuda kernel + rework the cpu bits. * Add some dedicated test. * Working cuda kernel. * Metal kernel. * Metal adjustments. * Bugfix. * Use the fast rope in qwen. * Rework the expert selection in qwen.
*	Add more QMMV cuda kernels. (#2077)	Laurent Mazare	2024-04-18	1	-0/+324
\| \| \| \| \| \| \|	* Add more QMMV cuda kernels. * Enable the new kernels. * Adapt the testing.
*	Add the mmv kernels for small batch sizes. (#2075)	Laurent Mazare	2024-04-16	1	-10/+254
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add the mmv kernels for smaller sizes. * Support more mmv kernels. * Use the new kernels. * Fix the call. * Silly fix. * Improve the testing. * Fix for dmmv. * Add another dedicated test for the batching mmv.
*	Faster kernels for quantized matmul on cuda (#2060)	Laurent Mazare	2024-04-15	1	-11/+118
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Hook the quantized matmul cuda kernels. * Add a (currently broken) test. * Kernel fixes. * Fix by transposing the rhs matrix. * Add the q4-1 kernels. * Proper block sizes. * More details in the tests.
*	Add the full quantized matmul kernels for cuda. (#2057)	Laurent Mazare	2024-04-14	1	-0/+1071
\|
*	Add the rope THD kernel. (#2014)	Laurent Mazare	2024-04-05	1	-5/+43
\| \| \| \| \| \| \| \| \|	* Add the rope THD kernel. * Cuda kernel for rope-thd. * Add the metal kernels. * Add a dedicated test.
*	Add support for "sign" on tensors (#2012)	Thomas Santerre	2024-04-04	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* add the sign unary operator * remove uneeded import * remove uneeded import * undo formatting * undo formatting * remove unnecessary redefintion * allow gradient to flow through for sign and round * fix cpu ops to ensure that negzero and positive zero are handled properly * clippy fixes * Properly avoid gradient tracking. * Use a branchless version. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
*	Bumping the version number to 0.5.0. (#2009)	Laurent Mazare	2024-04-04	1	-1/+1
\|
*	Relax the contiguous check for cuda kernels. (#2000)	Laurent Mazare	2024-04-03	1	-1/+1
\| \| \| \| \| \| \| \| \|	* Relax the contiguous check for cuda kernels. * Ensure contiguity for RNNs. * Unrelated fix for segment anything. * Better error message + allow concatenating empty slices.
*	More ggml cuda kernels (#1977)	Laurent Mazare	2024-04-01	1	-75/+1014
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add more cuda kernels for quantized matmul. * Add the vec-dot bits. * Expose the quantized matmul-vec kernels. * Also include the quantize-q8-1 kernel. * Glue code for the q8-1 quantization. * mm-vec product via q8-1 quantization. * Add a test. * Add a mm test. * Get the test to return some sensible results. * Also test dmmv. * Fix the launch params. * Allow for tweaking the force_dmmv parameter while it's experimental.
*	Ensure that the kernels get rebuilt on cuh changes. (#1954)	Laurent Mazare	2024-03-28	1	-0/+3
\|
*	Use the new rope kernel in mistral. (#1937)	Laurent Mazare	2024-03-25	1	-2/+2
\| \| \| \| \| \| \|	* Use the new rope kernel in mistral. * Compute the cos and sin with full precision. * Bugfix.
*	Contiguous variant of the rope kernel. (#1929)	Laurent Mazare	2024-03-25	1	-6/+34
\| \| \| \| \| \| \|	* Contiguous variant of the rope kernel. * Add the cuda kernel. * Metal kernel.
*	Fast kernels for rotary embeddings. (#1928)	Laurent Mazare	2024-03-24	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Fast kernels for rotary embeddings. * Add a test for the fast CPU kernel. * Rope cuda bindings. * Cuda kernel. * Metal kernel (part 1). * Cuda kernels. * Finish the metal kernel. * Use the new kernels in the quantized example. * Fix warning.
*	Add cast_bf16_x/cast_x_bf16 when CUDA_ARCH<800 but CUDA_VERSION >= 11000 (#1919)	yinqiwen	2024-03-23	1	-0/+12
\| \| \|	- it make possible to load bf16 models on T4(sm75)
*	Support scatter/index_add with i64 indices for f16 (#1915)	Daniël de Kok	2024-03-22	1	-0/+2
\|
*	Custom op for RmsNorm (#1890)	Laurent Mazare	2024-03-21	1	-0/+65
\| \| \| \| \| \| \| \| \| \| \| \| \|	* Trying out a custom RmsNorm cuda kernel. * CPU implementation for rms-norm. * Cuda wrappers. * Add some validation. * Add some testing. * More testing.
*	Cuda backend optimization (#1886)	Laurent Mazare	2024-03-20	4	-7/+7
\| \| \| \| \| \| \|	* Attempt at making the kernel faster. * Also adapt the cast kernels. * Also apply to binary ops.
*	Optimize the cat operation on contiguous tensors (#1855)	Laurent Mazare	2024-03-17	1	-1/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add a specialized kernel for copy2d. * Move the cat operations. * Avoid transpositions in cat. * Bugfix. * Bugfix for the cuda kernel. * Add a benchmark. * Add more testing. * Test fix. * Faster kernel. * Add the missing kernel. * Tweak the test. * Add a metal kernel. * Fix for the metal kernel. * Get the tests to pass on metal. * Also use this opportunity to fix the metal kernel for ELU. * Add some bf16 kernels. * Clippy fixes.
*	Bump the crate versions to 0.4.2. (#1821)	Laurent Mazare	2024-03-08	1	-1/+1
\|
*	Add a cuda kernel for dequantizing q8_0. (#1804)	Laurent Mazare	2024-03-05	1	-0/+24
\|
*	Handle Q5_0 and Q5_1 quants in cuda.	laurent	2024-02-29	1	-7/+9
\|
*	Bump the version number to 0.4.1. (#1768)	Laurent Mazare	2024-02-27	1	-1/+1
\| \| \| \| \|	* Fix the block size for some cuda kernels. * Bump the version number to 0.4.1.
*	Cuda kernel for dequantizing q8k. (#1760)	Laurent Mazare	2024-02-26	1	-0/+35
\| \| \| \| \|	* Cuda kernel for dequantizing q8k. * Clippy lints.
*	Cuda acceleration for quantized model. (#1754)	Laurent Mazare	2024-02-25	2	-0/+1537
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Boilerplate for the quantized cuda support. * More basic cuda support. * More cuda quantization (quantize on cpu for now). * Add the dequantization bit. * Start adding some dedicated cuda kernels from llama.cpp. * Move the kernel code. * Start interfacing with the kernel. * Tweak the kernel launch params. * Bugfix for quantized metal. * Fix some clippy lints. * Tweak the launch parameters. * Tweak cuda basics to perform a quantized matmul. * Perform the dequantization on the cpu + use cublas for matmul. * Add the dequantization kernel. * Test the qmatmul. * More kernels. * Matmul-vec kernel. * Add a couple kernels. * More dequantization kernels.
*	Fix the silu cuda kernel. (#1710)	Laurent Mazare	2024-02-14	1	-1/+1
\|
*	feat: add silu activation function (#1706)	OlivierDehaene	2024-02-14	1	-0/+9
\| \| \| \| \| \| \| \| \|	* feat: add silu activation function * use silu/arg in grad * update candle-nn * use node
*	ConvTranspose1d cuda support. (#1697)	Laurent Mazare	2024-02-12	1	-2/+77
\| \| \| \| \| \| \|	* ConvTranspose1d cuda support. * Add the conv-transpose1d kernel. * Remove some unused variables.
*	Bump the crate version to 0.4.0. (#1658)	Laurent Mazare	2024-02-04	1	-1/+1
\|
*	Moving to a proper build crate `bindgen_cuda`. (#1531)	Nicolas Patry	2024-01-07	2	-242/+5
\| \| \| \| \|	* Moving to a proper build crate `bindgen_cuda`. * Fmt.
*	Bump the crate version to 0.3.3. (#1490)	Laurent Mazare	2023-12-28	1	-1/+1
\|
*	Bump the crate version to 0.3.2. (#1452)	Laurent Mazare	2023-12-17	1	-1/+1
\|
*	Update for 0.3.1. (#1324)	Laurent Mazare	2023-11-11	1	-2/+2
\|
*	Rework the cuda casting bits. (#1112)	Laurent Mazare	2023-10-17	1	-31/+54
\|
*	feat: parse Cuda compute cap from env (#1066)	OlivierDehaene	2023-10-16	2	-89/+110
\| \| \| \| \| \| \| \| \|	* feat: add support for multiple compute caps * Revert to one compute cap * fmt * fix
*	fix: fix index_select cuda kernel for src target dim different than ids dim ↵	Gonzalo	2023-10-05	1	-6/+8
\| \| \| \| \| \| \|	when selecting dim > 0 (#1037) * fix: fix index_select cuda kernel for src target dim different than ids dim when selecting dim > 0 * cargo fmt
*	Add the rounding operators. (#1030)	Laurent Mazare	2023-10-04	2	-0/+24
\| \| \| \| \| \| \|	* Add the rounding operators. * Avoid tracking gradients for the rounding operations. * Add some rounding tests.