You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
84 lines
2.7 KiB
84 lines
2.7 KiB
2 years ago
|
## Preparation for pre-training
|
||
|
|
||
|
|
||
|
1. prepare a python environment, e.g.:
|
||
|
```shell script
|
||
|
$ conda create -n spark python=3.8 -y
|
||
|
$ conda activate spark
|
||
|
```
|
||
|
|
||
|
|
||
|
2. install `PyTorch` and `timm` environment (better to use `torch~=1.10`, `torchvision~=0.11`, and `timm==0.5.4`) then other python packages, e.g.:
|
||
|
```shell script
|
||
|
$ pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
|
||
|
$ pip install timm==0.5.4
|
||
|
$ pip install -r requirements.txt
|
||
|
```
|
||
|
|
||
|
It is highly recommended to follow these instructions to ensure a consistent environment for re-implementation.
|
||
|
|
||
|
|
||
|
3. prepare [ImageNet-1k](http://image-net.org/) dataset
|
||
|
- download the dataset to a folder `/path/to/imagenet`
|
||
|
- the file structure should look like:
|
||
|
```
|
||
|
/path/to/imagenet/:
|
||
|
train/:
|
||
|
class1:
|
||
|
a_lot_images.jpeg
|
||
|
class2:
|
||
|
a_lot_images.jpeg
|
||
|
val/:
|
||
|
class3:
|
||
|
a_lot_images.jpeg
|
||
|
class4:
|
||
|
a_lot_images.jpeg
|
||
|
```
|
||
|
|
||
|
|
||
|
4. (optional) if want to use sparse convolution rather than masked convolution, please install this [library](https://github.com/facebookresearch/SparseConvNet) and set `--sparse_conv=1` later
|
||
|
```shell script
|
||
|
$ git clone https://github.com/facebookresearch/SparseConvNet.git && cd SparseConvNet
|
||
|
$ rm -rf build/ dist/ sparseconvnet.egg-info sparseconvnet_SCN*.so
|
||
|
$ python3 setup.py develop --user
|
||
|
|
||
|
```
|
||
|
|
||
|
|
||
|
## Pre-training from scratch
|
||
|
|
||
|
1. since `torch.nn.parallel.DistributedDataParallel` is used for distributed training, you are expected to specify some distributed arguments on each node, including:
|
||
|
- `--num_nodes`
|
||
|
- `--ngpu_per_node`
|
||
|
- `--node_rank`
|
||
|
- `--master_address`
|
||
|
- `--master_port`
|
||
|
|
||
|
|
||
|
2. besides, you also need to specify the name of experiment and the ImageNet path in the first two arguments, and you may add arbitrary hyperparameter key-words (like `--ep=400 --bs=2048`) for other configurations, and the final command should like this:
|
||
|
```shell script
|
||
|
$ cd /path/to/SparK
|
||
|
$ bash ./scripts/pt.sh \
|
||
|
$ experiment_name /path/to/imagenet \
|
||
|
$ --num_nodes=1 --ngpu_per_node=8 --node_rank=0 \
|
||
|
$ --master_address=128.0.0.0 --master_port=30000 \
|
||
|
$ --model=res50 --ep=400 --bs=2048
|
||
|
```
|
||
|
|
||
|
|
||
|
## Resume
|
||
|
|
||
|
When an experiment starts running, the folder `SparK/<experiment_name>` would be created and record per-epoch checkpoints (e.g., `ckpt-last.pth`) and log files (`log.txt`).
|
||
|
|
||
|
To resume from a checkpoint, specify `--resume=/path/to/checkpoint.pth`.
|
||
|
|
||
|
|
||
|
|
||
|
## Read logs
|
||
|
|
||
|
The `stdout` and `stderr` are also saved in `SparK/<experiment_name>/stdout.txt` and `SparK/<experiment_name>/stderr.txt`.
|
||
|
|
||
|
Note `SparK/<experiment_name>/log.txt` would record the most important information like current loss values and the remaining time.
|
||
|
|
||
|
|