# Tensor to TensorRT

## Chapter 0-環境

Cpu=X86; OS: Ubuntu16.04; Tensorflow=12; Cuda=10.0; Cudnn=;

## Chapter 1-前言

### 檢驗NN好壞的BenchMark

1. Throughput
2. Efficiency
3. Latency
4. Accuracy
5. Memory Usage

### TensorRT 優化流程

![](https://2880279229-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LWxqLqSkI9LHgmHkLeg%2F-LeGuY2HpW1kGW0KpucE%2F-LeGwn3RVJsOr-SJgNs1%2Ffit.png?alt=media\&token=cfe76f46-f762-4b12-a836-cd5edcc8ba9b)

### TensorRT Core Library

#### Network Definition

提供辨識各NN層的能力，包含輸入/輸出層，RT支援層與非支援層，非支援層亦可使用 Plugin 寫入

#### Builder

本層可以創造 Network Definition 層的優化架構，同時可以設定 Maximum Batch，Workspace Batch，最小精度等級，自動調整訓練疊代次數，與量化8-bits 精度的表現&#x20;

#### Engine

Engine接口允許應用程序執行推理。它支持同步和異步執行，分析，枚舉和查詢引擎輸入和輸出的綁定。單個引擎可以具有多個執行上下文，允許使用單組訓練參數來同時執行多個批次。

#### Parser(eg. Caffe, Uff, ONNX)

This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.

## Chapter 2-使用 TRT C++ API  (略)

## Chapter 3-使用 TRT Python API

### 3.1 Import TRT&#x20;

```python
improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
```

### 3.2 Creating A Network Definition in Python

You can choose a tool from these options:&#x20;

1. **直接以TRT創建網路**&#x20;
2. **使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)**

#### 3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API&#x20;

創建神經網路時，首要步驟是定義 Engine 與 創建 Inference 層使用的 Builder 物件（就是手把手自製一個NN）

```python
# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
	# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
	# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
	input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

	# Add a convolution layer
	conv1_w = weights['conv1.weight'].numpy()
	conv1_b = weights['conv1.bias'].numpy()
	conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
	conv1.stride = (1, 1)

	pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
	pool1.stride = (2, 2)
	conv2_w = weights['conv2.weight'].numpy()
	conv2_b = weights['conv2.bias'].numpy()
	conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
	conv2.stride = (1, 1)

	pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
	pool2.stride = (2, 2)

	fc1_w = weights['fc1.weight'].numpy()
	fc1_b = weights['fc1.bias'].numpy()
	fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

	relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

	fc2_w = weights['fc2.weight'].numpy()
	fc2_b = weights['fc2.bias'].numpy()
	fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

	fc2.get_output(0).name =OUTPUT_NAME
	network.mark_output(fc2.get_output(0))
```

#### **3.2.2** 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python

**主要步驟：**

1. **創建 TRT Builder 與 Network**
2. **創造特定格式的 TRT Parser**
3. **使用 Parser 讀取模型並填充 Network**

> 建立 Network 必須先創建 Builder，因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的 輸出。更多資訊可以參考每個 Parser 的 API 文件:

* Caffe Parser:
* UFF Parser:  <https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html>
* ONNX Parser:

**3.2.3 Caffe Parser (pass)**

**3.2.4 Tensorflow Parser**

接下來的步驟展示了如何使用 UFF Parser 與 Python API 讀取 Tensorflow Model ，Sample Code 於以下路徑  `<site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist` ，

或是使用以下連結獲得更多Sample Code詳細資訊：[https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end\_to\_end\_tensorflow\_mnist ](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end_to_end_tensorflow_mnist)

**步驟1. Import TRT**

```python
import tensorrt as trt
```

**步驟2. 創建 Tensorflow Fronzen Model**

為了可以順利轉出 UFF 檔案，必須將 Tensorflow Model Freezing to .pb file

參考連結:&#x20;

1. &#x20;<https://www.tensorflow.org/guide/extend/model_files#freezing>&#x20;
2. &#x20;<https://medium.com/@hamedmp/exporting-trained-tensorflow-models-to-c-the-right-way-cf24b609d183>

**步驟3. 使用 UFF Converter 將 Tensorflow Model 轉換成 UFF file**

```python
convert-to-uff frozen_inference_graph.pb
```

如果 convert-to-uff 檔案不是設為全域指令，則可以呼叫 bin 目錄檔案：

```python
~/.local/lib/python2.7/site-packages/uff/bin/convert_to_uff.py
```

若需要尋找 UFF Module 所在目錄，則可以執行指令：

```python
python -c “import uff; print(uff.__path__)”
```

或是採用另一種方法：使用 UFF Parser API 來轉換  Tensorflow GraphDef:

(UFF Parser API) \[<https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html>]

**步驟4. 定義相關路徑與連結**

改變以下路徑並指定到放 UFF Model 的地方：

```python
model_file = '/data/mnist/mnist.uff'
```

**步驟5. 建構 Builder, Network, 與 parser:**

```python
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
    	parser.register_input("Placeholder", (1, 28, 28))
    	parser.register_output("fc2/Relu")
parser.parse(model_file, network)
```

**3.2.5 ONNX Parser (pass)**

### 3.3 Building An Engine In Python

> Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用，因此有必要使用相同的GPU來進行 Builder 建構再行優化。
>
> Builder 裏面有很多可以調整的屬性，讓你可以設定比如特定NN層的精度，或可以自動調節kernel的數值以達到最大效率，你也可以 Query Builder 以確認硬體原生支援的混合精度。

兩個最重要的性質為 1. Maximum Batch Size  2. Maximum Workspace Size

* Maximum Batch Size 決定了 TRT 要優化 Batch Size，在 runtime 的時候會優先選擇較小的 batch size
* NN層在計算時經常需要暫時性的 workspace，Workspace Size 決定了限制了每一層在計算時的能使用的 workspace size.
* If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.&#x20;

更多建構 Engine 的範例，可以參考 <https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#introductory_parser_samples_resnet50>

1. 使用 Builde 物件建構 Engine:

```python
builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 <<  20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.
```

&#x20;   2\. 進行 Inference

> 請參考 <https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_python>

### 3.4 Serializing A Model In Python

所謂的序列化，即是為了 Inference 階段使用，先將 Engine 轉化成某個格式進行儲存。

為了用於 Inference，你可以簡單的進行反序列化(deserialize)。序列化與反序列化的步驟不是必要的，但是為了避免每次建構 Engine 就要 rebuild 一次所花費的時間，你可以先行反序列化回復 Engine 檔案，等到實際 Release 在將 Engine 進行序列化。從這裡開始，你可以直接使用 Engine 進行 inference。

Note: 序列化產出的 Engine 不適用於其他GPU或 TRT版本。Engine 只辨認 Built on Machine 的特定GPU版本。

1. 序列化 model to modelStream&#x20;

```python
serialized_engine = engine.serialize()
```

&#x20;  2\. 反序列化 modelStream 以用於執行 inference. 執行反序列化需要 runtime object:

```python
with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)
```

最後的參數是用於指定 custom layer ，更多資訊可以參考<https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#extending>

同時你可以將已經序列化的 Engine 存成一個檔案，然後讀取回來：

1. 序列化 Engine 並寫入檔案：

```python
with open(“sample.engine”, “wb”) as f:
		f.write(engine.serialize())
```

&#x20;   2\. 讀取 Engine File 並反序列化：&#x20;

```python
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
		engine = runtime.deserialize_cuda_engine(f.read())
```

### 3.5 Performing Inference In Python

接下來的步驟展示了使用 Engine 執行 Inference

1. 為輸入和輸出分配一些主機和設備 buffer：

```python
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
	h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
	h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
	d_input = cuda.mem_alloc(h_input.nbytes)
	d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
	stream = cuda.Stream()
```

&#x20;   2\. 創建一些空間來存儲中間激活值。由於引擎保持網絡定義和訓練的參數，因此需要額外的空間。它們保存在執行上下文中：

```python
with engine.create_execution_context() as context:
		# Transfer input data to the GPU.
		cuda.memcpy_htod_async(d_input, h_input, stream)
		# Run inference.
		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
		# Transfer predictions back from the GPU.
		cuda.memcpy_dtoh_async(h_output, d_output, stream)
		# Synchronize the stream
		stream.synchronize()
		# Return the host output. 
return h_output
```

一個 Engine 可以具有多個執行內容，允許一組權重用於多個交互的 Inference 。

例如，您可以在平行的多組 Cuda Stream 同時使用一個 Engine 與一組 Context 進行影像處理。每一個 Context 會被相同的 GPU 所創建如同 Engine 一樣。

## Chapter 4-Extending TensorRT with Custom Layers

## Chapter 5-Working With Mixed Precision

## Chapter 6-Working With DLA

## Chapter 7-Deploying A TensorRT Optimize Model

## Chapter 8-Working with Deep Learning Frameworks

### 8.1 Working With Tensorflow

### 8.2 Working With PyTorch and Other Frameworks (pass)

## Chapter 9-Trouble Shooting

## Appendix
