mae-cnn

2.7 KiB

Raw Blame History

Preparation for pre-training

prepare a python environment, e.g.:

$ conda create -n spark python=3.8 -y
$ conda activate spark

install PyTorch and timm environment (better to use torch~=1.10, torchvision~=0.11, and timm==0.5.4) then other python packages, e.g.:

$ pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install timm==0.5.4
$ pip install -r requirements.txt

It is highly recommended to follow these instructions to ensure a consistent environment for re-implementation.

prepare ImageNet-1k dataset

download the dataset to a folder /path/to/imagenet
the file structure should look like:

/path/to/imagenet/:
    train/:
        class1: 
            a_lot_images.jpeg
        class2:
            a_lot_images.jpeg
    val/:
        class3: 
            a_lot_images.jpeg
        class4:
            a_lot_images.jpeg

(optional) if want to use sparse convolution rather than masked convolution, please install this library and set --sparse_conv=1 later

$ git clone https://github.com/facebookresearch/SparseConvNet.git && cd SparseConvNet
$ rm -rf build/ dist/ sparseconvnet.egg-info sparseconvnet_SCN*.so
$ python3 setup.py develop --user

Pre-training from scratch

since torch.nn.parallel.DistributedDataParallel is used for distributed training, you are expected to specify some distributed arguments on each node, including:
- --num_nodes
- --ngpu_per_node
- --node_rank
- --master_address
- --master_port
besides, you also need to specify the name of experiment and the ImageNet path in the first two arguments, and you may add arbitrary hyperparameter key-words (like --ep=400 --bs=2048) for other configurations, and the final command should like this:

$ cd /path/to/SparK
$ bash ./scripts/pt.sh \
$ experiment_name /path/to/imagenet \
$ --num_nodes=1 --ngpu_per_node=8 --node_rank=0 \
$ --master_address=128.0.0.0 --master_port=30000 \
$ --model=res50 --ep=400 --bs=2048

Resume

When an experiment starts running, the folder SparK/<experiment_name> would be created and record per-epoch checkpoints (e.g., ckpt-last.pth) and log files (log.txt).

To resume from a checkpoint, specify --resume=/path/to/checkpoint.pth.

Read logs

The stdout and stderr are also saved in SparK/<experiment_name>/stdout.txt and SparK/<experiment_name>/stderr.txt.

Note SparK/<experiment_name>/log.txt would record the most important information like current loss values and the remaining time.

2.7 KiB Raw Blame History

Preparation for pre-training

Pre-training from scratch

Resume

Read logs

2.7 KiB

Raw Blame History