diff options
-rw-r--r-- | README.md | 46 | ||||
-rw-r--r-- | candle-core/Cargo.toml | 1 | ||||
-rw-r--r-- | candle-core/examples/conv1d_benchmark.rs | 24 | ||||
-rw-r--r-- | candle-core/examples/cpu_benchmarks.rs | 95 |
4 files changed, 119 insertions, 47 deletions
@@ -3,8 +3,8 @@ [](https://docs.rs/candle-core)  -Candle is a minimalist ML framework for Rust with a focus on easiness of use and -on performance (including GPU support). Try our online demos: +Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support) +and ease of use. Try our online demos: [whisper](https://huggingface.co/spaces/lmz/candle-whisper), [llama2](https://huggingface.co/spaces/lmz/candle-llama2). @@ -52,7 +52,7 @@ wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/model.bin wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/tokenizer.json trunk serve --release --public-url /candle-llama2/ --port 8081 ``` -And then browse to +And then head over to [http://localhost:8081/candle-llama2](http://localhost:8081/candle-llama2). <!--- ANCHOR: features ---> @@ -61,17 +61,17 @@ And then browse to - Simple syntax, looks and feels like PyTorch. - CPU and Cuda backends, m1, f16, bf16. -- Enable serverless (CPU), small and fast deployments +- Serverless (on CPU), small and fast deployments - WASM support, run your models in a browser. - Model training. - Distributed computing using NCCL. -- Models out of the box: Llama, Whisper, Falcon, StarCoder... +- Model support out of the box: Llama, Whisper, Falcon, StarCoder... - Embed user-defined ops/kernels, such as [flash-attention v2](https://github.com/huggingface/candle/blob/89ba005962495f2bfbda286e185e9c3c7f5300a3/candle-flash-attn/src/lib.rs#L152). <!--- ANCHOR_END: features ---> -## How to use ? +## How to use <!--- ANCHOR: cheatsheet ---> Cheatsheet: @@ -95,41 +95,41 @@ Cheatsheet: ## Structure - [candle-core](./candle-core): Core ops, devices, and `Tensor` struct definition -- [candle-nn](./candle-nn/): Facilities to build real models -- [candle-examples](./candle-examples/): Real-world like examples on how to use the library in real settings +- [candle-nn](./candle-nn/): Tools to build real models +- [candle-examples](./candle-examples/): Examples of using the library in realistic settings - [candle-kernels](./candle-kernels/): CUDA custom kernels - [candle-datasets](./candle-datasets/): Datasets and data loaders. -- [candle-transformers](./candle-transformers): Transformer related utilities. +- [candle-transformers](./candle-transformers): transformers-related utilities. - [candle-flash-attn](./candle-flash-attn): Flash attention v2 layer. ## FAQ -### Why Candle? +### Why should I use Candle? -Candle stems from the need to reduce binary size in order to *enable serverless* -possible by making the whole engine smaller than PyTorch very large library volume. -This enables creating runtimes on a cluster much faster. +Candle's core goal is to *make serverless inference possible*. Full machine learning frameworks like PyTorch +are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight +binaries. -And simply *removing Python* from production workloads. -Python can really add overhead in more complex workflows and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches. +Secondly, Candle lets you *remove Python* from production workloads. Python overhead can seriously hurt performance, +and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches. -Rust is cool, and a lot of the HF ecosystem already has Rust crates [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers). +Finally, Rust is cool! A lot of the HF ecosystem already has Rust crates, like [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers). ### Other ML frameworks - [dfdx](https://github.com/coreylowman/dfdx) is a formidable crate, with shapes being included - in types preventing a lot of headaches by getting compiler to complain about shape mismatch right off the bat - However we found that some features still require nightly and writing code can be a bit daunting for non rust experts. + in types. This prevents a lot of headaches by getting the compiler to complain about shape mismatches right off the bat. + However, we found that some features still require nightly, and writing code can be a bit daunting for non rust experts. We're leveraging and contributing to other core crates for the runtime so hopefully both crates can benefit from each - other + other. - [burn](https://github.com/burn-rs/burn) is a general crate that can leverage multiple backends so you can choose the best - engine for your workload + engine for your workload. - [tch-rs](https://github.com/LaurentMazare/tch-rs.git) Bindings to the torch library in Rust. Extremely versatile, but they - do bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development + bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development of `candle`. ### Missing symbols when compiling with the mkl feature. @@ -145,13 +145,13 @@ features, e.g.: = note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname) ``` -This is likely due to some missing linker flag that enable the mkl library. You +This is likely due to a missing linker flag that was needed to enable the mkl library. You can try adding the following at the top of your binary: ``` extern crate intel_mkl_src; ``` -### How to know where an error comes from. +### Tracking down errors You can set `RUST_BACKTRACE=1` to be provided with backtraces when a candle error is generated. diff --git a/candle-core/Cargo.toml b/candle-core/Cargo.toml index bf57a91c..b5d74e12 100644 --- a/candle-core/Cargo.toml +++ b/candle-core/Cargo.toml @@ -30,6 +30,7 @@ zip = { workspace = true } [dev-dependencies] anyhow = { workspace = true } +clap = { workspace = true } [features] default = [] diff --git a/candle-core/examples/conv1d_benchmark.rs b/candle-core/examples/conv1d_benchmark.rs deleted file mode 100644 index 52fae5e8..00000000 --- a/candle-core/examples/conv1d_benchmark.rs +++ /dev/null @@ -1,24 +0,0 @@ -#[cfg(feature = "mkl")] -extern crate intel_mkl_src; - -#[cfg(feature = "accelerate")] -extern crate accelerate_src; - -use anyhow::Result; -use candle_core::{Device, Tensor}; - -pub const N_ITERS: usize = 5; - -fn main() -> Result<()> { - let inp = Tensor::randn(0f32, 1., (1, 384, 3000), &Device::Cpu)?; - let w = Tensor::randn(0f32, 1., (384, 384, 3), &Device::Cpu)?; - let res = inp.conv1d(&w, 0, 1); - println!("{res:?}"); - let start = std::time::Instant::now(); - for i in 0..N_ITERS { - let res = inp.conv1d(&w, 0, 1); - println!("{i} {res:?}"); - } - println!("{:?}", start.elapsed() / N_ITERS as u32); - Ok(()) -} diff --git a/candle-core/examples/cpu_benchmarks.rs b/candle-core/examples/cpu_benchmarks.rs new file mode 100644 index 00000000..4cc710fb --- /dev/null +++ b/candle-core/examples/cpu_benchmarks.rs @@ -0,0 +1,95 @@ +/// This example contains some simple benchmarks so that it's easy to run them in perf etc. +#[cfg(feature = "mkl")] +extern crate intel_mkl_src; + +#[cfg(feature = "accelerate")] +extern crate accelerate_src; + +use candle_core::{Device, Result, Tensor}; +use clap::{Parser, Subcommand}; + +trait Benchmark { + type PreProcessData; + type RunResult; + + fn preprocess() -> Result<Self::PreProcessData>; + fn run_one(_: &Self::PreProcessData) -> Result<Self::RunResult>; + + const ITERS: usize; +} + +// Conv1d example as used in whisper. +struct Conv1d; +impl Benchmark for Conv1d { + type PreProcessData = (Tensor, Tensor); + type RunResult = Tensor; + fn preprocess() -> Result<Self::PreProcessData> { + let inp = Tensor::randn(0f32, 1., (1, 384, 3000), &Device::Cpu)?; + let w = Tensor::randn(0f32, 1., (384, 384, 3), &Device::Cpu)?; + Ok((inp, w)) + } + + fn run_one(d: &Self::PreProcessData) -> Result<Self::RunResult> { + d.0.conv1d(&d.1, 0, 1) + } + + const ITERS: usize = 5; +} + +// Conv2d example as used in stable-diffusion. +struct Conv2d; +impl Benchmark for Conv2d { + type PreProcessData = (Tensor, Tensor); + type RunResult = Tensor; + + fn preprocess() -> Result<Self::PreProcessData> { + let inp = Tensor::randn(0f32, 1., (2, 320, 96, 96), &Device::Cpu)?; + let w = Tensor::randn(0f32, 1., (320, 320, 3, 3), &Device::Cpu)?; + Ok((inp, w)) + } + + fn run_one(d: &Self::PreProcessData) -> Result<Self::RunResult> { + d.0.conv2d(&d.1, 0, 1) + } + + const ITERS: usize = 1; +} + +fn run<B: Benchmark>(iters: Option<usize>) -> Result<()> { + use std::hint::black_box; + + let iters = iters.unwrap_or(B::ITERS); + let d = B::preprocess()?; + let start = std::time::Instant::now(); + for _iter in 0..iters { + let _res = black_box(B::run_one(black_box(&d))?); + } + println!("{:?}", start.elapsed() / iters as u32); + Ok(()) +} + +#[derive(Subcommand, Debug, Clone)] +enum Task { + Conv1d, + Conv2d, +} + +#[derive(Parser, Debug)] +#[command(author, version, about, long_about = None)] +pub struct Args { + /// The benchmark to be run. + #[command(subcommand)] + task: Task, + + #[arg(long)] + iters: Option<usize>, +} + +fn main() -> Result<()> { + let args = Args::parse(); + match args.task { + Task::Conv1d => run::<Conv1d>(args.iters)?, + Task::Conv2d => run::<Conv2d>(args.iters)?, + } + Ok(()) +} |