2.7 KiB
Preparation for pre-training
- prepare a python environment, e.g.:
$ conda create -n spark python=3.8 -y
$ conda activate spark
- install
PyTorch
andtimm
environment (better to usetorch~=1.10
,torchvision~=0.11
, andtimm==0.5.4
) then other python packages, e.g.:
$ pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install timm==0.5.4
$ pip install -r requirements.txt
It is highly recommended to follow these instructions to ensure a consistent environment for re-implementation.
-
prepare ImageNet-1k dataset
- download the dataset to a folder
/path/to/imagenet
- the file structure should look like:
/path/to/imagenet/: train/: class1: a_lot_images.jpeg class2: a_lot_images.jpeg val/: class3: a_lot_images.jpeg class4: a_lot_images.jpeg
- download the dataset to a folder
-
(optional) if want to use sparse convolution rather than masked convolution, please install this library and set
--sparse_conv=1
later
$ git clone https://github.com/facebookresearch/SparseConvNet.git && cd SparseConvNet
$ rm -rf build/ dist/ sparseconvnet.egg-info sparseconvnet_SCN*.so
$ python3 setup.py develop --user
Pre-training from scratch
-
since
torch.nn.parallel.DistributedDataParallel
is used for distributed training, you are expected to specify some distributed arguments on each node, including:--num_nodes
--ngpu_per_node
--node_rank
--master_address
--master_port
-
besides, you also need to specify the name of experiment and the ImageNet path in the first two arguments, and you may add arbitrary hyperparameter key-words (like
--ep=400 --bs=2048
) for other configurations, and the final command should like this:
$ cd /path/to/SparK
$ bash ./scripts/pt.sh \
$ experiment_name /path/to/imagenet \
$ --num_nodes=1 --ngpu_per_node=8 --node_rank=0 \
$ --master_address=128.0.0.0 --master_port=30000 \
$ --model=res50 --ep=400 --bs=2048
Resume
When an experiment starts running, the folder SparK/<experiment_name>
would be created and record per-epoch checkpoints (e.g., ckpt-last.pth
) and log files (log.txt
).
To resume from a checkpoint, specify --resume=/path/to/checkpoint.pth
.
Read logs
The stdout
and stderr
are also saved in SparK/<experiment_name>/stdout.txt
and SparK/<experiment_name>/stderr.txt
.
Note SparK/<experiment_name>/log.txt
would record the most important information like current loss values and the remaining time.