> For the complete documentation index, see [llms.txt](https://mikechan0731.gitbook.io/workspace/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://mikechan0731.gitbook.io/workspace/study-note/tensor-to-tensorrt.md).

# Tensor to TensorRT

## Chapter 0-環境

Cpu=X86; OS: Ubuntu16.04; Tensorflow=12; Cuda=10.0; Cudnn=;

## Chapter 1-前言

### 檢驗NN好壞的BenchMark

1. Throughput
2. Efficiency
3. Latency
4. Accuracy
5. Memory Usage

### TensorRT 優化流程

![](/files/-LeGwn3RVJsOr-SJgNs1)

### TensorRT Core Library

#### Network Definition

提供辨識各NN層的能力，包含輸入/輸出層，RT支援層與非支援層，非支援層亦可使用 Plugin 寫入

#### Builder

本層可以創造 Network Definition 層的優化架構，同時可以設定 Maximum Batch，Workspace Batch，最小精度等級，自動調整訓練疊代次數，與量化8-bits 精度的表現&#x20;

#### Engine

Engine接口允許應用程序執行推理。它支持同步和異步執行，分析，枚舉和查詢引擎輸入和輸出的綁定。單個引擎可以具有多個執行上下文，允許使用單組訓練參數來同時執行多個批次。

#### Parser(eg. Caffe, Uff, ONNX)

This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.

## Chapter 2-使用 TRT C++ API  (略)

## Chapter 3-使用 TRT Python API

### 3.1 Import TRT&#x20;

```python
improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
```

### 3.2 Creating A Network Definition in Python

You can choose a tool from these options:&#x20;

1. **直接以TRT創建網路**&#x20;
2. **使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)**

#### 3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API&#x20;

創建神經網路時，首要步驟是定義 Engine 與 創建 Inference 層使用的 Builder 物件（就是手把手自製一個NN）

```python
# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
	# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
	# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
	input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

	# Add a convolution layer
	conv1_w = weights['conv1.weight'].numpy()
	conv1_b = weights['conv1.bias'].numpy()
	conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
	conv1.stride = (1, 1)

	pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
	pool1.stride = (2, 2)
	conv2_w = weights['conv2.weight'].numpy()
	conv2_b = weights['conv2.bias'].numpy()
	conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
	conv2.stride = (1, 1)

	pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
	pool2.stride = (2, 2)

	fc1_w = weights['fc1.weight'].numpy()
	fc1_b = weights['fc1.bias'].numpy()
	fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

	relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

	fc2_w = weights['fc2.weight'].numpy()
	fc2_b = weights['fc2.bias'].numpy()
	fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

	fc2.get_output(0).name =OUTPUT_NAME
	network.mark_output(fc2.get_output(0))
```

#### **3.2.2** 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python

**主要步驟：**

1. **創建 TRT Builder 與 Network**
2. **創造特定格式的 TRT Parser**
3. **使用 Parser 讀取模型並填充 Network**

> 建立 Network 必須先創建 Builder，因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的 輸出。更多資訊可以參考每個 Parser 的 API 文件:

* Caffe Parser:
* UFF Parser:  <https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html>
* ONNX Parser:

**3.2.3 Caffe Parser (pass)**

**3.2.4 Tensorflow Parser**

接下來的步驟展示了如何使用 UFF Parser 與 Python API 讀取 Tensorflow Model ，Sample Code 於以下路徑  `<site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist` ，

或是使用以下連結獲得更多Sample Code詳細資訊：[https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end\_to\_end\_tensorflow\_mnist ](<https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end_to_end_tensorflow_mnist >)

**步驟1. Import TRT**

```python
import tensorrt as trt
```

**步驟2. 創建 Tensorflow Fronzen Model**

為了可以順利轉出 UFF 檔案，必須將 Tensorflow Model Freezing to .pb file

參考連結:&#x20;

1. &#x20;<https://www.tensorflow.org/guide/extend/model_files#freezing>&#x20;
2. &#x20;<https://medium.com/@hamedmp/exporting-trained-tensorflow-models-to-c-the-right-way-cf24b609d183>

**步驟3. 使用 UFF Converter 將 Tensorflow Model 轉換成 UFF file**

```python
convert-to-uff frozen_inference_graph.pb
```

如果 convert-to-uff 檔案不是設為全域指令，則可以呼叫 bin 目錄檔案：

```python
~/.local/lib/python2.7/site-packages/uff/bin/convert_to_uff.py
```

若需要尋找 UFF Module 所在目錄，則可以執行指令：

```python
python -c “import uff; print(uff.__path__)”
```

或是採用另一種方法：使用 UFF Parser API 來轉換  Tensorflow GraphDef:

(UFF Parser API) \[<https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html>]

**步驟4. 定義相關路徑與連結**

改變以下路徑並指定到放 UFF Model 的地方：

```python
model_file = '/data/mnist/mnist.uff'
```

**步驟5. 建構 Builder, Network, 與 parser:**

```python
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
    	parser.register_input("Placeholder", (1, 28, 28))
    	parser.register_output("fc2/Relu")
parser.parse(model_file, network)
```

**3.2.5 ONNX Parser (pass)**

### 3.3 Building An Engine In Python

> Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用，因此有必要使用相同的GPU來進行 Builder 建構再行優化。
>
> Builder 裏面有很多可以調整的屬性，讓你可以設定比如特定NN層的精度，或可以自動調節kernel的數值以達到最大效率，你也可以 Query Builder 以確認硬體原生支援的混合精度。

兩個最重要的性質為 1. Maximum Batch Size  2. Maximum Workspace Size

* Maximum Batch Size 決定了 TRT 要優化 Batch Size，在 runtime 的時候會優先選擇較小的 batch size
* NN層在計算時經常需要暫時性的 workspace，Workspace Size 決定了限制了每一層在計算時的能使用的 workspace size.
* If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.&#x20;

更多建構 Engine 的範例，可以參考 <https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#introductory_parser_samples_resnet50>

1. 使用 Builde 物件建構 Engine:

```python
builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 <<  20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.
```

&#x20;   2\. 進行 Inference

> 請參考 <https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_python>

### 3.4 Serializing A Model In Python

所謂的序列化，即是為了 Inference 階段使用，先將 Engine 轉化成某個格式進行儲存。

為了用於 Inference，你可以簡單的進行反序列化(deserialize)。序列化與反序列化的步驟不是必要的，但是為了避免每次建構 Engine 就要 rebuild 一次所花費的時間，你可以先行反序列化回復 Engine 檔案，等到實際 Release 在將 Engine 進行序列化。從這裡開始，你可以直接使用 Engine 進行 inference。

Note: 序列化產出的 Engine 不適用於其他GPU或 TRT版本。Engine 只辨認 Built on Machine 的特定GPU版本。

1. 序列化 model to modelStream&#x20;

```python
serialized_engine = engine.serialize()
```

&#x20;  2\. 反序列化 modelStream 以用於執行 inference. 執行反序列化需要 runtime object:

```python
with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)
```

最後的參數是用於指定 custom layer ，更多資訊可以參考<https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#extending>

同時你可以將已經序列化的 Engine 存成一個檔案，然後讀取回來：

1. 序列化 Engine 並寫入檔案：

```python
with open(“sample.engine”, “wb”) as f:
		f.write(engine.serialize())
```

&#x20;   2\. 讀取 Engine File 並反序列化：&#x20;

```python
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
		engine = runtime.deserialize_cuda_engine(f.read())
```

### 3.5 Performing Inference In Python

接下來的步驟展示了使用 Engine 執行 Inference

1. 為輸入和輸出分配一些主機和設備 buffer：

```python
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
	h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
	h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
	d_input = cuda.mem_alloc(h_input.nbytes)
	d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
	stream = cuda.Stream()
```

&#x20;   2\. 創建一些空間來存儲中間激活值。由於引擎保持網絡定義和訓練的參數，因此需要額外的空間。它們保存在執行上下文中：

```python
with engine.create_execution_context() as context:
		# Transfer input data to the GPU.
		cuda.memcpy_htod_async(d_input, h_input, stream)
		# Run inference.
		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
		# Transfer predictions back from the GPU.
		cuda.memcpy_dtoh_async(h_output, d_output, stream)
		# Synchronize the stream
		stream.synchronize()
		# Return the host output. 
return h_output
```

一個 Engine 可以具有多個執行內容，允許一組權重用於多個交互的 Inference 。

例如，您可以在平行的多組 Cuda Stream 同時使用一個 Engine 與一組 Context 進行影像處理。每一個 Context 會被相同的 GPU 所創建如同 Engine 一樣。

## Chapter 4-Extending TensorRT with Custom Layers

## Chapter 5-Working With Mixed Precision

## Chapter 6-Working With DLA

## Chapter 7-Deploying A TensorRT Optimize Model

## Chapter 8-Working with Deep Learning Frameworks

### 8.1 Working With Tensorflow

### 8.2 Working With PyTorch and Other Frameworks (pass)

## Chapter 9-Trouble Shooting

## Appendix


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://mikechan0731.gitbook.io/workspace/study-note/tensor-to-tensorrt.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
