| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
| |
* More flexible matmul contiguity checks.
* Also relax the checks on the metal side.
|
|
|
|
|
| |
* update im2col dtype implementations
* update dtypes for upsample
|
|
|
|
|
|
|
| |
* Contiguous variant of the rope kernel.
* Add the cuda kernel.
* Metal kernel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fast kernels for rotary embeddings.
* Add a test for the fast CPU kernel.
* Rope cuda bindings.
* Cuda kernel.
* Metal kernel (part 1).
* Cuda kernels.
* Finish the metal kernel.
* Use the new kernels in the quantized example.
* Fix warning.
|
|
|
|
|
|
|
| |
* initial implementation
* use correct index, but still not breaking like it should have...
* fix test
|
|
|
|
|
|
|
| |
* add support for conv transpose 2d and add bench mark for float types
* update bench calculation
* enable testing all conv operations on metal
|
|
|
|
|
|
|
|
|
| |
* RmsNorm kernel for metal.
* Wrapper for the metal kernel.
* Get the ops to actually work.
* Fix, get the tests to pass.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* first attempt
* progress
* integrate into metal backend
* finish and get test passing
* add other dtype support
* update transpose1d dtypes supported
|
|
|
|
|
|
|
| |
* implement metal avg pool 2d
* fixX
* add suggested precision workaround for the accumulator
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* first pass at implementation of maxpool2d
* Add definitions for other dtypes
* add tests for other dtypes
* Cosmetic tweaks + re-enable maxpool2d tests for metal.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add a specialized kernel for copy2d.
* Move the cat operations.
* Avoid transpositions in cat.
* Bugfix.
* Bugfix for the cuda kernel.
* Add a benchmark.
* Add more testing.
* Test fix.
* Faster kernel.
* Add the missing kernel.
* Tweak the test.
* Add a metal kernel.
* Fix for the metal kernel.
* Get the tests to pass on metal.
* Also use this opportunity to fix the metal kernel for ELU.
* Add some bf16 kernels.
* Clippy fixes.
|
|
|
|
|
| |
* add support and tests for scatter add on metal
* add support for all datatypes
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* use_resource API misunderstood. It is not additive. Several usages must be bit-ORed together.
* The seeding was incorrect and used the address instead of the value of the passed in seed.
* Add a check that likely exhibits failure to update the seed between generation of random tensors.
* Buffer overrun, the length given to the std::ptr::copy call was in bytes, and not 32-bit units.
* By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine.
Use device.set_seed if determinism is warranted.
* Revert "By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine. Use device.set_seed if determinism is warranted."
This reverts commit d7302de9
Discussion in https://github.com/huggingface/candle/pull/1811#issuecomment-1983079119
* The Metal random kernel failed to set element N/2 of tensors with N elements, N being even. The reason was that all threads but thread 0 all created 2 random samples, but thread 0 only one, i.e. an odd number. In order to produce an even number of samples, the early termination of thread 0 should only everr occur for odd sized tensors.
* Add a test catching any deterministic tensor element in rand and randn output.
---------
Co-authored-by: niklas <niklas@appli.se>
Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>
|
|
|
|
|
| |
* Fix the block size for some cuda kernels.
* Bump the version number to 0.4.1.
|
|
|
|
|
|
|
|
|
| |
* feat: add silu activation function
* use silu/arg in grad
* update candle-nn
* use node
|
| |
|
|\
| |
| | |
fix: larger batches
|
| | |
|
| | |
|
| | |
|
| | |
|
|\ \
| |/
|/| |
|
| | |
|
| |\ |
|
| | |
| | |
| | |
| | |
| | |
| | | |
* set_seed via buffer content pointer copy + did_modify_range
* ensure random.metal kernel does not write outside of buffer range when tid==0
|
| | | |
|
| |\ \ |
|
| | | | |
|
| |\ \ \ |
|
| |\ \ \ \ |
|
| | | | | | |
|
| | | | | | |
|
|\ \ \ \ \ \
| |_|_|_|_|/
|/| | | | | |
Metal: Use uint8_t as output type in int64_t binary op kernel
|
| | |_|_|/
| |/| | | |
|
|/ / / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
* Use cfg to seperate benchmark results based on features
* Add bfloat affine and benchmarks
* Fix flops calculation
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
|
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
* Use cfg to seperate benchmark results based on features
* Add metal where_cond for f16 and bf16. Add benchmark
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
* Updated feature separated benchmarks
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
| | | |
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | | |
check (#1540)
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* Add relu kernel for metal
* Copy error messages proposed in #1491
* Revert non relu changes
* Fix name changes
* Fix the last of us (:
* Fix copy and paste mistakes
* Fix typo
* Revert order changes
* Revert order change
* Add deleted functions back
* Run rustfmt
|
| | |
|
| |
| |
| |
| |
| | |
* Metal: support unary abs
* cargo fmt
|
| |
| |
| |
| |
| | |
* Adds more metal u8
* Metal: more u32
|
| |
| |
| |
| |
| | |
* Adds basic metal i64 support
* metal copy i64
|
| | |
|