diff options
Diffstat (limited to 'candle-book/src/training/training.md')
-rw-r--r-- | candle-book/src/training/training.md | 39 |
1 files changed, 39 insertions, 0 deletions
diff --git a/candle-book/src/training/training.md b/candle-book/src/training/training.md new file mode 100644 index 00000000..d68a917e --- /dev/null +++ b/candle-book/src/training/training.md @@ -0,0 +1,39 @@ +# Training + + +Training starts with data. We're going to use the huggingface hub and +start with the Hello world dataset of machine learning, MNIST. + +Let's start with downloading `MNIST` from [huggingface](https://huggingface.co/datasets/mnist). + +This requires [`hf-hub`](https://github.com/huggingface/hf-hub). +```bash +cargo add hf-hub +``` + +This is going to be very hands-on for now. + +```rust,ignore +{{#include ../../../candle-examples/src/lib.rs:book_training_1}} +``` + +This uses the standardized `parquet` files from the `refs/convert/parquet` branch on every dataset. +Our handles are now [`parquet::file::serialized_reader::SerializedFileReader`]. + +We can inspect the content of the files with: + +```rust,ignore +{{#include ../../../candle-examples/src/lib.rs:book_training_2}} +``` + +You should see something like: + +```bash +Column id 1, name label, value 6 +Column id 0, name image, value {bytes: [137, ....] +Column id 1, name label, value 8 +Column id 0, name image, value {bytes: [137, ....] +``` + +So each row contains 2 columns (image, label) with image being saved as bytes. +Let's put them into a useful struct. |