summaryrefslogtreecommitdiff
path: root/candle-book/src/training/README.md
blob: ddbbc7afd0821a1b9fda3fbfe1a65e8332615cd5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Training


Training starts with data. We're going to use the huggingface hub and 
start with the Hello world dataset of machine learning, MNIST.

Let's start with downloading `MNIST` from [huggingface](https://huggingface.co/datasets/mnist).

This requires `candle-datasets` with the `hub` feature.
```bash
cargo add candle-datasets --features hub
cargo add hf-hub
```


```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_1}}
```

This uses the standardized `parquet` files from the `refs/convert/parquet` branch on every dataset.
`files` is now a `Vec` of [`parquet::file::serialized_reader::SerializedFileReader`].

We can inspect the content of the files with:

```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_2}}
```

You should see something like:

```bash
Column id 1, name label, value 6
Column id 0, name image, value {bytes: [137, ....]
Column id 1, name label, value 8
Column id 0, name image, value {bytes: [137, ....]
```

So each row contains 2 columns (image, label) with image being saved as bytes.
Let's put them into a useful struct.