summaryrefslogtreecommitdiff
path: root/candle-flash-attn/kernels
Commit message (Collapse)AuthorAgeFilesLines
* Flash-Attn upgrade / SoftCap Candle-FlashAttn [3/n] (#2690)Michael Feil2024-12-311-0/+2
| | | | | | | | | | | | | | | * update flash-attn v1 * restore: hdim224 * add 224 flash_fwd_template * remove whitespace * softcap is working, including test and api. * make softcap test case better * unpadded lse added
* Flash-Attn upgrade / SoftCap Candle-FlashAttn [2/n] (#2689)Michael Feil2024-12-311-3/+13
| | | | | | | | | | | | | | | | | * update flash-attn v1 * restore: hdim224 * add 224 flash_fwd_template * remove whitespace * softcap is working, including test and api. * make softcap test case better --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
* Flash-Attn upgrade / SoftCap Candle-FlashAttn [1/n] (#2688)Michael Feil2024-12-3139-82/+138
| | | | | | | | | * update flash-attn v1 * restore: hdim224 * add 224 flash_fwd_template * remove whitespace
* Update the flash attn kernels. (#2333)Laurent Mazare2024-07-1549-898/+2257
|
* Use flash-attn in gemma. (#2195)Laurent Mazare2024-05-181-0/+4
| | | | | * Use flash-attn in gemma. * Fix flash-attn for head dim 256.
* chore: update flash attention kernels (#1518)OlivierDehaene2024-01-0526-451/+658
| | | | | | | | | | | * chore: update flash attention kernels * fmt * remove unused kernels * force f32 * correct stride
* Add back the bf16 flash-attn kernels. (#730)Laurent Mazare2023-09-041-13/+13
|
* Flash attention without padding (varlen). (#281)Laurent Mazare2023-07-311-3/+6
| | | | | | | | | | | | | * Expose the seqlen variable for flash-attn without padding. * Fix the batched call. * Adapt for the varlen variant. * No need to set the batch strides when in varlen mode. * Add a test (disabled at the moment). * Get the test to work properly.
* Again set a few extra params in flash-attn. (#245)Laurent Mazare2023-07-2617-91/+379
| | | | | | | | | | | | | | | | | * Again set a few extra params. * Use the appropriate kernel sizes. * Add all the kernel sizes. * Parallel compiling. * Reduce the amount of parallelism. * Add the missing kernel. * Fix a typo. * Remove bf16 support for now.
* Proper flash-attn parameters. (#244)Laurent Mazare2023-07-261-1/+22
| | | | | | | | | | | | | * Proper flash-attn parameters. * Set the flash attention parameters. * Add more validations. * Setup the o_ flash attn parameters. * More flash-attn support. * Set more flash attn parameters.
* Add flash attention (#241)Laurent Mazare2023-07-2610-0/+2361
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab. * More flash attn. * Set up the flash attn parameters. * Get things to compile locally. * Move the flash attention files in a different directory. * Build the static C library with nvcc. * Add more flash attention. * Update the build part. * Better caching. * Exclude flash attention from the default workspace. * Put flash-attn behind a feature gate. * Get the flash attn kernel to run. * Move the flags to a more appropriate place. * Enable flash attention in llama. * Use flash attention in llama.