4.3 KiB
Preparation for pre-training
- Prepare a python environment, e.g.:
$ conda create -n spark python=3.8 -y
$ conda activate spark
- Install
PyTorch
andtimm
(better to usetorch~=1.10
,torchvision~=0.11
, andtimm==0.5.4
) then other python packages:
$ pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install timm==0.5.4
$ pip install -r requirements.txt
It is highly recommended to install these versions to ensure a consistent environment for re-implementation.
-
Prepare the ImageNet-1k dataset
- assume the dataset is in
/path/to/imagenet
- check the file path, it should look like this:
/path/to/imagenet/: train/: class1: a_lot_images.jpeg class2: a_lot_images.jpeg val/: class1: a_lot_images.jpeg class2: a_lot_images.jpeg
- that argument of
--data_path=/path/to/imagenet
should be passed to the training script introduced later
- assume the dataset is in
-
(Optional) Install this sparse convolution library:
$ git clone https://github.com/facebookresearch/SparseConvNet.git && cd SparseConvNet
$ rm -rf build/ dist/ sparseconvnet.egg-info sparseconvnet_SCN*.so
$ python3 setup.py develop --user
Tips:
In our default implementation, masked convolution (defined in encoder.py) is used to simulate the submanifold sparse convolution for speed. It has equivalent computational results to sparse convolution. If you would like to use the true sparse convolution installed above, please pass--sparse_conv=1
to the training script.
Pre-training from scratch
The script for pre-training is exp/pt.sh.
Since torch.nn.parallel.DistributedDataParallel
is used for distributed training, you are expected to specify some distributed arguments on each node, including:
--num_nodes=<INTEGER>
--ngpu_per_node=<INTEGER>
--node_rank=<INTEGER>
--master_address=<ADDRESS>
--master_port=<INTEGER>
Set --num_nodes=0
if your task is running on a single GPU.
You can add arbitrary key-word arguments (like --ep=400 --bs=2048
) to specify some pre-training hyperparameters (see utils/meta.py for all).
Here is an example command:
$ cd /path/to/SparK
$ bash ./scripts/pt.sh <experiment_name> \
--num_nodes=1 --ngpu_per_node=8 --node_rank=0 \
--master_address=128.0.0.0 --master_port=30000 \
--data_path=/path/to/imagenet \
--model=res50 --ep=1600 --bs=4096
Note that the first argument is the name of experiment.
It will be used to create the output directory named output_<experiment_name>
.
Logging
Once an experiment starts running, the following files would be automatically created and updated in SparK/output_<experiment_name>
:
-
ckpt-last.pth
: includes model states, optimizer states, current epoch, current reconstruction loss, etc. -
log.txt
: records important meta information such as:- the git version (commid_id) at the start of the experiment
- all arguments passed to the script
It also reports the loss and remaining training time at each epoch.
-
stdout_backup.txt
andstderr_backup.txt
: will save all output to stdout/stderr
We believe these files can help trace the experiment well.
Resuming
To resume from a saved checkpoint, run pt.sh
with --resume=/path/to/checkpoint.pth
.
Regarding sparse convolution
For speed, we use the masked convolution implemented in encoder.py to simulate submanifold sparse convolution by default.
If --sparse_conv=1
is not specified, this masked convolution would be used in pre-training.
For anyone who might want to run SparK on another architectures: we still recommend you to use the default masked convolution, given the limited optimization of sparse convolution in hardware, and in particular the lack of efficient implementation of many modern operators like grouped conv and dilated conv.