`ultralytics 8.0.195` NVIDIA Triton Inference Server support (#5257)
Co-authored-by: TheConstant3 <46416203+TheConstant3@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>pull/5142/head^2 v8.0.195
parent
40e3923cfc
commit
c7aa83da31
21 changed files with 351 additions and 100 deletions
@ -0,0 +1,137 @@ |
||||
--- |
||||
comments: true |
||||
description: A step-by-step guide on integrating Ultralytics YOLOv8 with Triton Inference Server for scalable and high-performance deep learning inference deployments. |
||||
keywords: YOLOv8, Triton Inference Server, ONNX, Deep Learning Deployment, Scalable Inference, Ultralytics, NVIDIA, Object Detection, Cloud Inferencing |
||||
--- |
||||
|
||||
# Triton Inference Server with Ultralytics YOLOv8 |
||||
|
||||
The [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inferencing solution optimized for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production. Integrating Ultralytics YOLOv8 with Triton Inference Server allows you to deploy scalable, high-performance deep learning inference workloads. This guide provides steps to set up and test the integration. |
||||
|
||||
<p align="center"> |
||||
<br> |
||||
<iframe width="720" height="405" src="https://www.youtube.com/embed/NQDtfSi5QF4" |
||||
title="Getting Started with NVIDIA Triton Inference Server" frameborder="0" |
||||
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" |
||||
allowfullscreen> |
||||
</iframe> |
||||
<br> |
||||
<strong>Watch:</strong> Getting Started with NVIDIA Triton Inference Server. |
||||
</p> |
||||
|
||||
## What is Triton Inference Server? |
||||
|
||||
Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and many others. Its primary use cases are: |
||||
|
||||
- Serving multiple models from a single server instance. |
||||
- Dynamic model loading and unloading without server restart. |
||||
- Ensemble inferencing, allowing multiple models to be used together to achieve results. |
||||
- Model versioning for A/B testing and rolling updates. |
||||
|
||||
## Prerequisites |
||||
|
||||
Ensure you have the following prerequisites before proceeding: |
||||
|
||||
- Docker installed on your machine. |
||||
- Install `tritonclient`: |
||||
```bash |
||||
pip install tritonclient[all] |
||||
``` |
||||
|
||||
## Exporting YOLOv8 to ONNX Format |
||||
|
||||
Before deploying the model on Triton, it must be exported to the ONNX format. ONNX (Open Neural Network Exchange) is a format that allows models to be transferred between different deep learning frameworks. Use the `export` function from the `YOLO` class: |
||||
|
||||
```python |
||||
from ultralytics import YOLO |
||||
|
||||
# Load a model |
||||
model = YOLO('yolov8n.pt') # load an official model |
||||
|
||||
# Export the model |
||||
onnx_file = model.export(format='onnx', dynamic=True) |
||||
``` |
||||
|
||||
## Setting Up Triton Model Repository |
||||
|
||||
The Triton Model Repository is a storage location where Triton can access and load models. |
||||
|
||||
1. Create the necessary directory structure: |
||||
|
||||
```python |
||||
from pathlib import Path |
||||
|
||||
# Define paths |
||||
triton_repo_path = Path('tmp') / 'triton_repo' |
||||
triton_model_path = triton_repo_path / 'yolo' |
||||
|
||||
# Create directories |
||||
(triton_model_path / '1').mkdir(parents=True, exist_ok=True) |
||||
``` |
||||
|
||||
2. Move the exported ONNX model to the Triton repository: |
||||
|
||||
```python |
||||
from pathlib import Path |
||||
|
||||
# Move ONNX model to Triton Model path |
||||
Path(onnx_file).rename(triton_model_path / '1' / 'model.onnx') |
||||
|
||||
# Create config file |
||||
(triton_model_path / 'config.pdtxt').touch() |
||||
``` |
||||
|
||||
## Running Triton Inference Server |
||||
|
||||
Run the Triton Inference Server using Docker: |
||||
|
||||
```python |
||||
import subprocess |
||||
import time |
||||
|
||||
from tritonclient.http import InferenceServerClient |
||||
|
||||
# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver |
||||
tag = 'nvcr.io/nvidia/tritonserver:23.09-py3' # 6.4 GB |
||||
|
||||
# Pull the image |
||||
subprocess.call(f'docker pull {tag}', shell=True) |
||||
|
||||
# Run the Triton server and capture the container ID |
||||
container_id = subprocess.check_output( |
||||
f'docker run -d --rm -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models', |
||||
shell=True).decode('utf-8').strip() |
||||
|
||||
# Wait for the Triton server to start |
||||
triton_client = InferenceServerClient(url='localhost:8000', verbose=False, ssl=False) |
||||
|
||||
# Wait until model is ready |
||||
for _ in range(10): |
||||
with contextlib.suppress(Exception): |
||||
assert triton_client.is_model_ready(model_name) |
||||
break |
||||
time.sleep(1) |
||||
``` |
||||
|
||||
Then run inference using the Triton Server model: |
||||
|
||||
```python |
||||
from ultralytics import YOLO |
||||
|
||||
# Load the Triton Server model |
||||
model = YOLO(f'http://localhost:8000/yolo', task='detect') |
||||
|
||||
# Run inference on the server |
||||
results = model('path/to/image.jpg') |
||||
``` |
||||
|
||||
Cleanup the container: |
||||
|
||||
```python |
||||
# Kill and remove the container at the end of the test |
||||
subprocess.call(f'docker kill {container_id}', shell=True) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
By following the above steps, you can deploy and run Ultralytics YOLOv8 models efficiently on Triton Inference Server, providing a scalable and high-performance solution for deep learning inference tasks. If you face any issues or have further queries, refer to the [official Triton documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html) or reach out to the Ultralytics community for support. |
@ -0,0 +1,86 @@ |
||||
# Ultralytics YOLO 🚀, AGPL-3.0 license |
||||
|
||||
from typing import List |
||||
from urllib.parse import urlsplit |
||||
|
||||
import numpy as np |
||||
|
||||
|
||||
class TritonRemoteModel: |
||||
"""Client for interacting with a remote Triton Inference Server model. |
||||
|
||||
Attributes: |
||||
endpoint (str): The name of the model on the Triton server. |
||||
url (str): The URL of the Triton server. |
||||
triton_client: The Triton client (either HTTP or gRPC). |
||||
InferInput: The input class for the Triton client. |
||||
InferRequestedOutput: The output request class for the Triton client. |
||||
input_formats (List[str]): The data types of the model inputs. |
||||
np_input_formats (List[type]): The numpy data types of the model inputs. |
||||
input_names (List[str]): The names of the model inputs. |
||||
output_names (List[str]): The names of the model outputs. |
||||
""" |
||||
|
||||
def __init__(self, url: str, endpoint: str = '', scheme: str = ''): |
||||
""" |
||||
Initialize the TritonRemoteModel. |
||||
|
||||
Arguments may be provided individually or parsed from a collective 'url' argument of the form |
||||
<scheme>://<netloc>/<endpoint>/<task_name> |
||||
|
||||
Args: |
||||
url (str): The URL of the Triton server. |
||||
endpoint (str): The name of the model on the Triton server. |
||||
scheme (str): The communication scheme ('http' or 'grpc'). |
||||
""" |
||||
if not endpoint and not scheme: # Parse all args from URL string |
||||
splits = urlsplit(url) |
||||
endpoint = splits.path.strip('/').split('/')[0] |
||||
scheme = splits.scheme |
||||
url = splits.netloc |
||||
|
||||
self.endpoint = endpoint |
||||
self.url = url |
||||
|
||||
# Choose the Triton client based on the communication scheme |
||||
if scheme == 'http': |
||||
import tritonclient.http as client # noqa |
||||
self.triton_client = client.InferenceServerClient(url=self.url, verbose=False, ssl=False) |
||||
config = self.triton_client.get_model_config(endpoint) |
||||
else: |
||||
import tritonclient.grpc as client # noqa |
||||
self.triton_client = client.InferenceServerClient(url=self.url, verbose=False, ssl=False) |
||||
config = self.triton_client.get_model_config(endpoint, as_json=True)['config'] |
||||
|
||||
self.InferRequestedOutput = client.InferRequestedOutput |
||||
self.InferInput = client.InferInput |
||||
|
||||
type_map = {'TYPE_FP32': np.float32, 'TYPE_FP16': np.float16, 'TYPE_UINT8': np.uint8} |
||||
self.input_formats = [x['data_type'] for x in config['input']] |
||||
self.np_input_formats = [type_map[x] for x in self.input_formats] |
||||
self.input_names = [x['name'] for x in config['input']] |
||||
self.output_names = [x['name'] for x in config['output']] |
||||
|
||||
def __call__(self, *inputs: np.ndarray) -> List[np.ndarray]: |
||||
""" |
||||
Call the model with the given inputs. |
||||
|
||||
Args: |
||||
*inputs (List[np.ndarray]): Input data to the model. |
||||
|
||||
Returns: |
||||
List[np.ndarray]: Model outputs. |
||||
""" |
||||
infer_inputs = [] |
||||
input_format = inputs[0].dtype |
||||
for i, x in enumerate(inputs): |
||||
if x.dtype != self.np_input_formats[i]: |
||||
x = x.astype(self.np_input_formats[i]) |
||||
infer_input = self.InferInput(self.input_names[i], [*x.shape], self.input_formats[i].replace('TYPE_', '')) |
||||
infer_input.set_data_from_numpy(x) |
||||
infer_inputs.append(infer_input) |
||||
|
||||
infer_outputs = [self.InferRequestedOutput(output_name) for output_name in self.output_names] |
||||
outputs = self.triton_client.infer(model_name=self.endpoint, inputs=infer_inputs, outputs=infer_outputs) |
||||
|
||||
return [outputs.as_numpy(output_name).astype(input_format) for output_name in self.output_names] |
Loading…
Reference in new issue