You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

61 lines
2.4 KiB

2 years ago
## Pre-training from scratch
2 years ago
The script for pre-training is [exp/pt.sh](https://github.com/keyu-tian/SparK/blob/main/scripts/pt.sh).
Since `torch.nn.parallel.DistributedDataParallel` is used for distributed training, you are expected to specify some distributed arguments on each node, including:
- `--num_nodes=<INTEGER>`
- `--ngpu_per_node=<INTEGER>`
- `--node_rank=<INTEGER>`
- `--master_address=<ADDRESS>`
- `--master_port=<INTEGER>`
Set `--num_nodes=0` if your task is running on a single GPU.
2 years ago
2 years ago
You can add arbitrary key-word arguments (like `--ep=400 --bs=2048`) to specify some pre-training hyperparameters (see [utils/meta.py](https://github.com/keyu-tian/SparK/blob/main/utils/meta.py) for all).
Here is an example command:
2 years ago
```shell script
$ cd /path/to/SparK
2 years ago
$ bash ./scripts/pt.sh <experiment_name> \
--num_nodes=1 --ngpu_per_node=8 --node_rank=0 \
--master_address=128.0.0.0 --master_port=30000 \
--data_path=/path/to/imagenet \
--model=res50 --ep=1600 --bs=4096
2 years ago
```
2 years ago
Note that the first argument is the name of experiment.
It will be used to create the output directory named `output_<experiment_name>`.
## Logging
Once an experiment starts running, the following files would be automatically created and updated in `SparK/output_<experiment_name>`:
- `ckpt-last.pth`: includes model states, optimizer states, current epoch, current reconstruction loss, etc.
- `log.txt`: records important meta information such as:
- the git version (commid_id) at the start of the experiment
- all arguments passed to the script
It also reports the loss and remaining training time at each epoch.
- `stdout_backup.txt` and `stderr_backup.txt`: will save all output to stdout/stderr
2 years ago
2 years ago
We believe these files can help trace the experiment well.
2 years ago
2 years ago
## Resuming
2 years ago
2 years ago
To resume from a saved checkpoint, run `pt.sh` with `--resume=/path/to/checkpoint.pth`.
2 years ago
2 years ago
## Regarding sparse convolution
2 years ago
2 years ago
For generality, we use the masked convolution implemented in [encoder.py](https://github.com/keyu-tian/SparK/blob/main/encoder.py) to simulate submanifold sparse convolution by default.
2 years ago
If `--sparse_conv=1` is not specified, this masked convolution would be used in pre-training.
2 years ago
2 years ago
**For anyone who might want to run SparK on another architectures**:
2 years ago
we recommend you to use the default masked convolution,
2 years ago
given the limited optimization of sparse convolution in hardware, and in particular the lack of efficient implementation of many modern operators like grouped conv and dilated conv.
2 years ago