Update trouble shooting page to resolve segmentation fault (#4055)

* update trouble shooting page to resolve segmentation fault

* resolve comments
pull/4088/head
Wenwei Zhang 4 years ago committed by GitHub
parent f1f89801bc
commit 7ad44cd6f9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 35
      docs/faq.md

@ -36,3 +36,38 @@ We list some common troubles faced by many users and their corresponding solutio
1. If you are using miniconda rather than anaconda, check whether Cython is installed as indicated in [#3379](https://github.com/open-mmlab/mmdetection/issues/3379).
You need to manually install Cython first and then run command `pip install -r requirements.txt`.
2. You may also need to check the compatibility between the `setuptools`, `Cython`, and `PyTorch` in your environment.
- "Segmentation fault".
1. Check you GCC version and use GCC 5.4. This usually caused by the incompatibility between PyTorch and the environment (e.g., GCC < 4.9 for PyTorch). We also recommand the users to avoid using GCC 5.5 because many feedbacks report that GCC 5.5 will cause "segmentation fault" and simply changing it to GCC 5.4 could solve the problem.
2. Check whether PyTorch is correctly installed and could use CUDA op, e.g. type the following command in your terminal.
```shell
python -c 'import torch; print(torch.cuda.is_available())'
```
And see whether they could correctly output results.
3. If Pytorch is correctly installed, check whether MMCV is correctly installed.
```shell
python -c 'import mmcv; import mmcv.ops'
```
If MMCV is correctly installed, then there will be no issue of the above two commands.
4. If MMCV and Pytorch is correctly installed, you man use `ipdb`, `pdb` to set breakpoints or directly add 'print' in mmdetection code and see which part leads the segmentation fault.
## Training
- "Loss goes Nan"
1. Check if the dataset annotations are valid: zero-size bounding boxes will cause the regression loss to be Nan due to the commonly used transformation for box regression. Some small size (width or height are smaller than 1) boxes will also cause this problem after data augmentation (e.g., instaboost). So check the data and try to filter out those zero-size boxes and skip some risky augmentations on the small-size boxes when you face the problem.
2. Reduce the learning rate: the learning rate might be too large due to some reasons, e.g., change of batch size. You can rescale them to the value that could stably train the model.
3. Extend the warmup iterations: some models are sensitive to the learning rate at the start of the training. You can extend the warmup iterations, e.g., change the `warmup_iters` from 500 to 1000 or 2000.
4. Add gradient clipping: some models requires gradient clipping to stablize the training process. You can add gradient clippint to avoid gradients that are too large.
- ’GPU out of memory"
1. There are some scenarios when there are large amount of ground truth boxes, which may cause OOM during target assignment.
You can set `gpu_assign_thr=N` in the config of assigner thus the assigner will calculate box overlaps through CPU when there are more than N GT boxes.
2. Set `with_cp=True` in the backbone. This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone.
3. Try mixed precision training using following the examples in `config/fp16`. The `loss_scale` might need further tuning for different models.

Loading…
Cancel
Save