4 files changed, 119 insertions, 47 deletions
diff --git a/README.md b/README.md
index 2b966d24..67ab5678 100644
--- a/README.md
+++ b/README.md
@@ -3,8 +3,8 @@
 [![Documentation](https://docs.rs/candle-core/badge.svg)](https://docs.rs/candle-core)
 ![License](https://img.shields.io/crates/l/candle-core.svg)
 
-Candle is a minimalist ML framework for Rust with a focus on easiness of use and
-on performance (including GPU support). Try our online demos: 
+Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support) 
+and ease of use. Try our online demos: 
 [whisper](https://huggingface.co/spaces/lmz/candle-whisper),
 [llama2](https://huggingface.co/spaces/lmz/candle-llama2).
 
@@ -52,7 +52,7 @@ wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/model.bin
 wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/tokenizer.json
 trunk serve --release --public-url /candle-llama2/ --port 8081
 ```
-And then browse to
+And then head over to
 [http://localhost:8081/candle-llama2](http://localhost:8081/candle-llama2).
 
 <!--- ANCHOR: features --->
@@ -61,17 +61,17 @@ And then browse to
 
 - Simple syntax, looks and feels like PyTorch.
 - CPU and Cuda backends, m1, f16, bf16.
-- Enable serverless (CPU), small and fast deployments
+- Serverless (on CPU), small and fast deployments
 - WASM support, run your models in a browser.
 - Model training.
 - Distributed computing using NCCL.
-- Models out of the box: Llama, Whisper, Falcon, StarCoder...
+- Model support out of the box: Llama, Whisper, Falcon, StarCoder...
 - Embed user-defined ops/kernels, such as [flash-attention
   v2](https://github.com/huggingface/candle/blob/89ba005962495f2bfbda286e185e9c3c7f5300a3/candle-flash-attn/src/lib.rs#L152).
 
 <!--- ANCHOR_END: features --->
 
-## How to use ?
+## How to use
 
 <!--- ANCHOR: cheatsheet --->
 Cheatsheet:
@@ -95,41 +95,41 @@ Cheatsheet:
 ## Structure
 
 - [candle-core](./candle-core): Core ops, devices, and `Tensor` struct definition
-- [candle-nn](./candle-nn/): Facilities to build real models
-- [candle-examples](./candle-examples/): Real-world like examples on how to use the library in real settings
+- [candle-nn](./candle-nn/): Tools to build real models
+- [candle-examples](./candle-examples/): Examples of using the library in realistic settings
 - [candle-kernels](./candle-kernels/): CUDA custom kernels
 - [candle-datasets](./candle-datasets/): Datasets and data loaders.
-- [candle-transformers](./candle-transformers): Transformer related utilities.
+- [candle-transformers](./candle-transformers): transformers-related utilities.
 - [candle-flash-attn](./candle-flash-attn): Flash attention v2 layer.
 
 ## FAQ
 
-### Why Candle?
+### Why should I use Candle?
 
-Candle stems from the need to reduce binary size in order to *enable serverless*
-possible by making the whole engine smaller than PyTorch very large library volume.
-This enables creating runtimes on a cluster much faster.
+Candle's core goal is to *make serverless inference possible*. Full machine learning frameworks like PyTorch
+are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight
+binaries.
 
-And simply *removing Python* from production workloads.
-Python can really add overhead in more complex workflows and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches.
+Secondly, Candle lets you *remove Python* from production workloads. Python overhead can seriously hurt performance,
+and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches.
 
-Rust is cool, and a lot of the HF ecosystem already has Rust crates [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers).
+Finally, Rust is cool! A lot of the HF ecosystem already has Rust crates, like [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers).
 
 
 ### Other ML frameworks
 
 - [dfdx](https://github.com/coreylowman/dfdx) is a formidable crate, with shapes being included
-  in types preventing a lot of headaches by getting compiler to complain about shape mismatch right off the bat
-  However we found that some features still require nightly and writing code can be a bit daunting for non rust experts.
+  in types. This prevents a lot of headaches by getting the compiler to complain about shape mismatches right off the bat.
+  However, we found that some features still require nightly, and writing code can be a bit daunting for non rust experts.
 
   We're leveraging and contributing to other core crates for the runtime so hopefully both crates can benefit from each
-  other
+  other.
 
 - [burn](https://github.com/burn-rs/burn) is a general crate that can leverage multiple backends so you can choose the best
-  engine for your workload
+  engine for your workload.
 
 - [tch-rs](https://github.com/LaurentMazare/tch-rs.git) Bindings to the torch library in Rust. Extremely versatile, but they 
-  do bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development
+  bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development
   of `candle`.
 
 ### Missing symbols when compiling with the mkl feature.
@@ -145,13 +145,13 @@ features, e.g.:
   = note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)
 ```
 
-This is likely due to some missing linker flag that enable the mkl library. You
+This is likely due to a missing linker flag that was needed to enable the mkl library. You
 can try adding the following at the top of your binary:
 ```
 extern crate intel_mkl_src;
 ```
 
-### How to know where an error comes from.
+### Tracking down errors
 
 You can set `RUST_BACKTRACE=1` to be provided with backtraces when a candle
 error is generated.
diff --git a/candle-core/Cargo.toml b/candle-core/Cargo.toml
index bf57a91c..b5d74e12 100644
--- a/candle-core/Cargo.toml
+++ b/candle-core/Cargo.toml
@@ -30,6 +30,7 @@ zip = { workspace = true }
 
 [dev-dependencies]
 anyhow = { workspace = true }
+clap = { workspace = true }
 
 [features]
 default = []
diff --git a/candle-core/examples/conv1d_benchmark.rs b/candle-core/examples/conv1d_benchmark.rs
deleted file mode 100644
index 52fae5e8..00000000
--- a/candle-core/examples/conv1d_benchmark.rs
+++ /dev/null
@@ -1,24 +0,0 @@
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-
-use anyhow::Result;
-use candle_core::{Device, Tensor};
-
-pub const N_ITERS: usize = 5;
-
-fn main() -> Result<()> {
-    let inp = Tensor::randn(0f32, 1., (1, 384, 3000), &Device::Cpu)?;
-    let w = Tensor::randn(0f32, 1., (384, 384, 3), &Device::Cpu)?;
-    let res = inp.conv1d(&w, 0, 1);
-    println!("{res:?}");
-    let start = std::time::Instant::now();
-    for i in 0..N_ITERS {
-        let res = inp.conv1d(&w, 0, 1);
-        println!("{i} {res:?}");
-    }
-    println!("{:?}", start.elapsed() / N_ITERS as u32);
-    Ok(())
-}
diff --git a/candle-core/examples/cpu_benchmarks.rs b/candle-core/examples/cpu_benchmarks.rs
new file mode 100644
index 00000000..4cc710fb
--- /dev/null
+++ b/candle-core/examples/cpu_benchmarks.rs
@@ -0,0 +1,95 @@
+/// This example contains some simple benchmarks so that it's easy to run them in perf etc.
+#[cfg(feature = "mkl")]
+extern crate intel_mkl_src;
+
+#[cfg(feature = "accelerate")]
+extern crate accelerate_src;
+
+use candle_core::{Device, Result, Tensor};
+use clap::{Parser, Subcommand};
+
+trait Benchmark {
+    type PreProcessData;
+    type RunResult;
+
+    fn preprocess() -> Result<Self::PreProcessData>;
+    fn run_one(_: &Self::PreProcessData) -> Result<Self::RunResult>;
+
+    const ITERS: usize;
+}
+
+// Conv1d example as used in whisper.
+struct Conv1d;
+impl Benchmark for Conv1d {
+    type PreProcessData = (Tensor, Tensor);
+    type RunResult = Tensor;
+    fn preprocess() -> Result<Self::PreProcessData> {
+        let inp = Tensor::randn(0f32, 1., (1, 384, 3000), &Device::Cpu)?;
+        let w = Tensor::randn(0f32, 1., (384, 384, 3), &Device::Cpu)?;
+        Ok((inp, w))
+    }
+
+    fn run_one(d: &Self::PreProcessData) -> Result<Self::RunResult> {
+        d.0.conv1d(&d.1, 0, 1)
+    }
+
+    const ITERS: usize = 5;
+}
+
+// Conv2d example as used in stable-diffusion.
+struct Conv2d;
+impl Benchmark for Conv2d {
+    type PreProcessData = (Tensor, Tensor);
+    type RunResult = Tensor;
+
+    fn preprocess() -> Result<Self::PreProcessData> {
+        let inp = Tensor::randn(0f32, 1., (2, 320, 96, 96), &Device::Cpu)?;
+        let w = Tensor::randn(0f32, 1., (320, 320, 3, 3), &Device::Cpu)?;
+        Ok((inp, w))
+    }
+
+    fn run_one(d: &Self::PreProcessData) -> Result<Self::RunResult> {
+        d.0.conv2d(&d.1, 0, 1)
+    }
+
+    const ITERS: usize = 1;
+}
+
+fn run<B: Benchmark>(iters: Option<usize>) -> Result<()> {
+    use std::hint::black_box;
+
+    let iters = iters.unwrap_or(B::ITERS);
+    let d = B::preprocess()?;
+    let start = std::time::Instant::now();
+    for _iter in 0..iters {
+        let _res = black_box(B::run_one(black_box(&d))?);
+    }
+    println!("{:?}", start.elapsed() / iters as u32);
+    Ok(())
+}
+
+#[derive(Subcommand, Debug, Clone)]
+enum Task {
+    Conv1d,
+    Conv2d,
+}
+
+#[derive(Parser, Debug)]
+#[command(author, version, about, long_about = None)]
+pub struct Args {
+    /// The benchmark to be run.
+    #[command(subcommand)]
+    task: Task,
+
+    #[arg(long)]
+    iters: Option<usize>,
+}
+
+fn main() -> Result<()> {
+    let args = Args::parse();
+    match args.task {
+        Task::Conv1d => run::<Conv1d>(args.iters)?,
+        Task::Conv2d => run::<Conv2d>(args.iters)?,
+    }
+    Ok(())
+}