summaryrefslogtreecommitdiff
path: root/candle-book/src/training/README.md
blob: d68a917efa213ea72a9de26e91b929231bfdec39 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Training


Training starts with data. We're going to use the huggingface hub and 
start with the Hello world dataset of machine learning, MNIST.

Let's start with downloading `MNIST` from [huggingface](https://huggingface.co/datasets/mnist).

This requires [`hf-hub`](https://github.com/huggingface/hf-hub).
```bash
cargo add hf-hub
```

This is going to be very hands-on for now.

```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_1}}
```

This uses the standardized `parquet` files from the `refs/convert/parquet` branch on every dataset.
Our handles are now [`parquet::file::serialized_reader::SerializedFileReader`].

We can inspect the content of the files with:

```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_2}}
```

You should see something like:

```bash
Column id 1, name label, value 6
Column id 0, name image, value {bytes: [137, ....]
Column id 1, name label, value 8
Column id 0, name image, value {bytes: [137, ....]
```

So each row contains 2 columns (image, label) with image being saved as bytes.
Let's put them into a useful struct.