You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
260 lines
11 KiB
260 lines
11 KiB
<!-- |
|
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
you may not use this file except in compliance with the License. |
|
You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, |
|
software distributed under the License is distributed on an "AS IS" BASIS, |
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
See the License for the specific language governing permissions and |
|
limitations under the License. |
|
--> |
|
|
|
Welcome to software-delivered AI. |
|
|
|
This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse. |
|
|
|
DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to the ONNX Runtime baseline, DeepSparse offers a 5.8x speed-up for YOLOv5s, running on the same machine! |
|
|
|
<p align="center"> |
|
<img width="60%" src="https://github.com/neuralmagic/deepsparse/raw/main/examples/ultralytics-yolo/ultralytics-readmes/performance-chart-5.8x.png"> |
|
</p> |
|
|
|
For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators. |
|
Put simply, DeepSparse gives you the performance of GPUs and the simplicity of software: |
|
- **Flexible Deployments**: Run consistently across cloud, data center, and edge with any hardware provider from Intel to AMD to ARM |
|
- **Infinite Scalability**: Scale vertically to 100s of cores, out with standard Kubernetes, or fully-abstracted with Serverless |
|
- **Easy Integration**: Clean APIs for integrating your model into an application and monitoring it in production |
|
|
|
**[Start your 90 day Free Trial](https://neuralmagic.com/deepsparse-free-trial/?utm_campaign=free_trial&utm_source=ultralytics_github).** |
|
|
|
### How Does DeepSparse Achieve GPU-Class Performance? |
|
|
|
DeepSparse takes advantage of model sparsity to gain its performance speedup. |
|
|
|
Sparsification through pruning and quantization is a broadly studied technique, allowing order-of-magnitude reductions in the size and compute needed to |
|
execute a network, while maintaining high accuracy. DeepSparse is sparsity-aware, meaning it skips the zeroed out parameters, shrinking amount of compute |
|
in a forward pass. Since the sparse computation is now memory bound, DeepSparse executes the network depth-wise, breaking the problem into Tensor Columns, |
|
vertical stripes of computation that fit in cache. |
|
|
|
<p align="center"> |
|
<img width="60%" src="https://github.com/neuralmagic/deepsparse/raw/main/examples/ultralytics-yolo/ultralytics-readmes/tensor-columns.png"> |
|
</p> |
|
|
|
Sparse networks with compressed computation, executed depth-wise in cache, allows DeepSparse to deliver GPU-class performance on CPUs! |
|
|
|
### How Do I Create A Sparse Version of YOLOv5 Trained on My Data? |
|
|
|
Neural Magic's open-source model repository, SparseZoo, contains pre-sparsified checkpoints of each YOLOv5 model. Using SparseML, which is integrated with Ultralytics, you can fine-tune a sparse checkpoint onto your data with a single CLI command. |
|
|
|
[Checkout Neural Magic's YOLOv5 documentation for more details](https://docs.neuralmagic.com/use-cases/object-detection/sparsifying). |
|
|
|
## DeepSparse Usage |
|
|
|
We will walk through an example benchmarking and deploying a sparse version of YOLOv5s with DeepSparse. |
|
|
|
### Install DeepSparse |
|
|
|
Run the following to install DeepSparse. We recommend you use a virtual environment with Python. |
|
|
|
```bash |
|
pip install deepsparse[server,yolo,onnxruntime] |
|
``` |
|
|
|
### Collect an ONNX File |
|
|
|
DeepSparse accepts a model in the ONNX format, passed either as: |
|
- A SparseZoo stub which identifies an ONNX file in the SparseZoo |
|
- A local path to an ONNX model in a filesystem |
|
|
|
The examples below use the standard dense and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs: |
|
```bash |
|
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none |
|
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none |
|
``` |
|
|
|
### Deploy a Model |
|
|
|
DeepSparse offers convenient APIs for integrating your model into an application. |
|
|
|
To try the deployment examples below, pull down a sample image and save it as `basilica.jpg` with the following: |
|
```bash |
|
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg |
|
``` |
|
|
|
#### Python API |
|
|
|
`Pipelines` wrap pre-processing and output post-processing around the runtime, providing a clean interface for adding DeepSparse to an application. |
|
The DeepSparse-Ultralytics integration includes an out-of-the-box `Pipeline` that accepts raw images and outputs the bounding boxes. |
|
|
|
Create a `Pipeline` and run inference: |
|
|
|
```python |
|
from deepsparse import Pipeline |
|
|
|
# list of images in local filesystem |
|
images = ["basilica.jpg"] |
|
|
|
# create Pipeline |
|
model_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none" |
|
yolo_pipeline = Pipeline.create( |
|
task="yolo", |
|
model_path=model_stub, |
|
) |
|
|
|
# run inference on images, receive bounding boxes + classes |
|
pipeline_outputs = yolo_pipeline(images=images, iou_thres=0.6, conf_thres=0.001) |
|
print(pipeline_outputs) |
|
``` |
|
|
|
If you are running in the cloud, you may get an error that open-cv cannot find `libGL.so.1`. Running the following on Ubuntu installs it: |
|
|
|
``` |
|
apt-get install libgl1-mesa-glx |
|
``` |
|
|
|
#### HTTP Server |
|
|
|
DeepSparse Server runs on top of the popular FastAPI web framework and Uvicorn web server. With just a single CLI command, you can easily setup a model |
|
service endpoint with DeepSparse. The Server supports any Pipeline from DeepSparse, including object detection with YOLOv5, enabling you to send raw |
|
images to the endpoint and receive the bounding boxes. |
|
|
|
Spin up the Server with the pruned-quantized YOLOv5s: |
|
|
|
```bash |
|
deepsparse.server \ |
|
--task yolo \ |
|
--model_path zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none |
|
``` |
|
|
|
An example request, using Python's `requests` package: |
|
```python |
|
import requests, json |
|
|
|
# list of images for inference (local files on client side) |
|
path = ['basilica.jpg'] |
|
files = [('request', open(img, 'rb')) for img in path] |
|
|
|
# send request over HTTP to /predict/from_files endpoint |
|
url = 'http://0.0.0.0:5543/predict/from_files' |
|
resp = requests.post(url=url, files=files) |
|
|
|
# response is returned in JSON |
|
annotations = json.loads(resp.text) # dictionary of annotation results |
|
bounding_boxes = annotations["boxes"] |
|
labels = annotations["labels"] |
|
``` |
|
|
|
#### Annotate CLI |
|
You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed! |
|
```bash |
|
deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none --source basilica.jpg |
|
``` |
|
|
|
Running the above command will create an `annotation-results` folder and save the annotated image inside. |
|
|
|
<p align = "center"> |
|
<img src="https://github.com/neuralmagic/deepsparse/raw/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg" alt="annotated" width="60%"/> |
|
</p> |
|
|
|
## Benchmarking Performance |
|
|
|
We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script. |
|
|
|
The benchmarks were run on an AWS `c6i.8xlarge` instance (16 cores). |
|
|
|
### Batch 32 Performance Comparison |
|
|
|
#### ONNX Runtime Baseline |
|
|
|
At batch 32, ONNX Runtime achieves 42 images/sec with the standard dense YOLOv5s: |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 -e onnxruntime |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none |
|
> Batch Size: 32 |
|
> Scenario: sync |
|
> Throughput (items/sec): 41.9025 |
|
``` |
|
|
|
#### DeepSparse Dense Performance |
|
|
|
While DeepSparse offers its best performance with optimized sparse models, it also performs well with the standard dense YOLOv5s. |
|
|
|
At batch 32, DeepSparse achieves 70 images/sec with the standard dense YOLOv5s, a **1.7x performance improvement over ORT**! |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none |
|
> Batch Size: 32 |
|
> Scenario: sync |
|
> Throughput (items/sec): 69.5546 |
|
``` |
|
#### DeepSparse Sparse Performance |
|
|
|
When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime is even stronger. |
|
|
|
At batch 32, DeepSparse achieves 241 images/sec with the pruned-quantized YOLOv5s, a **5.8x performance improvement over ORT**! |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 32 -nstreams 1 |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none |
|
> Batch Size: 32 |
|
> Scenario: sync |
|
> Throughput (items/sec): 241.2452 |
|
``` |
|
|
|
### Batch 1 Performance Comparison |
|
|
|
DeepSparse is also able to gain a speed-up over ONNX Runtime for the latency-sensitive, batch 1 scenario. |
|
|
|
#### ONNX Runtime Baseline |
|
At batch 1, ONNX Runtime achieves 48 images/sec with the standard, dense YOLOv5s. |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none |
|
> Batch Size: 1 |
|
> Scenario: sync |
|
> Throughput (items/sec): 48.0921 |
|
``` |
|
|
|
#### DeepSparse Sparse Performance |
|
|
|
At batch 1, DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, **a 2.8x performance gain over ONNX Runtime!** |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1 |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none |
|
> Batch Size: 1 |
|
> Scenario: sync |
|
> Throughput (items/sec): 134.9468 |
|
``` |
|
|
|
Since `c6i.8xlarge` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4. |
|
|
|
At batch 1, DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!** |
|
|
|
```bash |
|
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1 |
|
|
|
> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni |
|
> Batch Size: 1 |
|
> Scenario: sync |
|
> Throughput (items/sec): 179.7375 |
|
``` |
|
|
|
## Get Started With DeepSparse |
|
|
|
**Research or Testing?** DeepSparse Community is free for research and testing. Get started with our [Documentation](https://docs.neuralmagic.com/). |
|
|
|
**Want to Try DeepSparse Enterprise?** [Start your 90 day free trial](https://neuralmagic.com/deepsparse-free-trial/?utm_campaign=free_trial&utm_source=ultralytics_github).
|
|
|