Tensor to TensorRT

Document from (Nvidia TRT developer guide)[ https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/ ]

Chapter 0-環境

Cpu=X86; OS: Ubuntu16.04; Tensorflow=12; Cuda=10.0; Cudnn=;

Chapter 1-前言

檢驗NN好壞的BenchMark

Throughput
Efficiency
Latency
Accuracy
Memory Usage

TensorRT 優化流程

TensorRT Core Library

Network Definition

提供辨識各NN層的能力，包含輸入/輸出層，RT支援層與非支援層，非支援層亦可使用 Plugin 寫入

Builder

本層可以創造 Network Definition 層的優化架構，同時可以設定 Maximum Batch，Workspace Batch，最小精度等級，自動調整訓練疊代次數，與量化8-bits 精度的表現

Engine

Engine接口允許應用程序執行推理。它支持同步和異步執行，分析，枚舉和查詢引擎輸入和輸出的綁定。單個引擎可以具有多個執行上下文，允許使用單組訓練參數來同時執行多個批次。

Parser(eg. Caffe, Uff, ONNX)

This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.

Chapter 2-使用 TRT C++ API (略)

Chapter 3-使用 TRT Python API

3.1 Import TRT

improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

3.2 Creating A Network Definition in Python

You can choose a tool from these options:

直接以TRT創建網路
使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)

3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API

創建神經網路時，首要步驟是定義 Engine 與創建 Inference 層使用的 Builder 物件（就是手把手自製一個NN）

# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
	# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
	# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
	input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

	# Add a convolution layer
	conv1_w = weights['conv1.weight'].numpy()
	conv1_b = weights['conv1.bias'].numpy()
	conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
	conv1.stride = (1, 1)

	pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
	pool1.stride = (2, 2)
	conv2_w = weights['conv2.weight'].numpy()
	conv2_b = weights['conv2.bias'].numpy()
	conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
	conv2.stride = (1, 1)

	pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
	pool2.stride = (2, 2)

	fc1_w = weights['fc1.weight'].numpy()
	fc1_b = weights['fc1.bias'].numpy()
	fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

	relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

	fc2_w = weights['fc2.weight'].numpy()
	fc2_b = weights['fc2.bias'].numpy()
	fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

	fc2.get_output(0).name =OUTPUT_NAME
	network.mark_output(fc2.get_output(0))

3.2.2 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python

主要步驟：

創建 TRT Builder 與 Network
創造特定格式的 TRT Parser
使用 Parser 讀取模型並填充 Network

建立 Network 必須先創建 Builder，因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的輸出。更多資訊可以參考每個 Parser 的 API 文件:

Caffe Parser:
UFF Parser: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html
ONNX Parser:

3.2.3 Caffe Parser (pass)

3.2.4 Tensorflow Parser

接下來的步驟展示了如何使用 UFF Parser 與 Python API 讀取 Tensorflow Model ，Sample Code 於以下路徑 <site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist ，

或是使用以下連結獲得更多Sample Code詳細資訊：https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end_to_end_tensorflow_mnist

步驟1. Import TRT

import tensorrt as trt

步驟2. 創建 Tensorflow Fronzen Model

為了可以順利轉出 UFF 檔案，必須將 Tensorflow Model Freezing to .pb file

參考連結:

步驟3. 使用 UFF Converter 將 Tensorflow Model 轉換成 UFF file

convert-to-uff frozen_inference_graph.pb

如果 convert-to-uff 檔案不是設為全域指令，則可以呼叫 bin 目錄檔案：

~/.local/lib/python2.7/site-packages/uff/bin/convert_to_uff.py

若需要尋找 UFF Module 所在目錄，則可以執行指令：

python -c “import uff; print(uff.__path__)”

或是採用另一種方法：使用 UFF Parser API 來轉換 Tensorflow GraphDef:

(UFF Parser API) [https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html]

步驟4. 定義相關路徑與連結

改變以下路徑並指定到放 UFF Model 的地方：

model_file = '/data/mnist/mnist.uff'

步驟5. 建構 Builder, Network, 與 parser:

with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
    	parser.register_input("Placeholder", (1, 28, 28))
    	parser.register_output("fc2/Relu")
parser.parse(model_file, network)

3.2.5 ONNX Parser (pass)

3.3 Building An Engine In Python

Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用，因此有必要使用相同的GPU來進行 Builder 建構再行優化。
Builder 裏面有很多可以調整的屬性，讓你可以設定比如特定NN層的精度，或可以自動調節kernel的數值以達到最大效率，你也可以 Query Builder 以確認硬體原生支援的混合精度。

兩個最重要的性質為 1. Maximum Batch Size 2. Maximum Workspace Size

Maximum Batch Size 決定了 TRT 要優化 Batch Size，在 runtime 的時候會優先選擇較小的 batch size
NN層在計算時經常需要暫時性的 workspace，Workspace Size 決定了限制了每一層在計算時的能使用的 workspace size.
If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.

使用 Builde 物件建構 Engine:

builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 <<  20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.

2. 進行 Inference

請參考 https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_python

3.4 Serializing A Model In Python

所謂的序列化，即是為了 Inference 階段使用，先將 Engine 轉化成某個格式進行儲存。

為了用於 Inference，你可以簡單的進行反序列化(deserialize)。序列化與反序列化的步驟不是必要的，但是為了避免每次建構 Engine 就要 rebuild 一次所花費的時間，你可以先行反序列化回復 Engine 檔案，等到實際 Release 在將 Engine 進行序列化。從這裡開始，你可以直接使用 Engine 進行 inference。

Note: 序列化產出的 Engine 不適用於其他GPU或 TRT版本。Engine 只辨認 Built on Machine 的特定GPU版本。

序列化 model to modelStream

serialized_engine = engine.serialize()

2. 反序列化 modelStream 以用於執行 inference. 執行反序列化需要 runtime object:

with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)

最後的參數是用於指定 custom layer ，更多資訊可以參考https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#extending

同時你可以將已經序列化的 Engine 存成一個檔案，然後讀取回來：

序列化 Engine 並寫入檔案：

with open(“sample.engine”, “wb”) as f:
		f.write(engine.serialize())

2. 讀取 Engine File 並反序列化：

with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
		engine = runtime.deserialize_cuda_engine(f.read())

3.5 Performing Inference In Python

接下來的步驟展示了使用 Engine 執行 Inference

為輸入和輸出分配一些主機和設備 buffer：

# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
	h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
	h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
	d_input = cuda.mem_alloc(h_input.nbytes)
	d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
	stream = cuda.Stream()

2. 創建一些空間來存儲中間激活值。由於引擎保持網絡定義和訓練的參數，因此需要額外的空間。它們保存在執行上下文中：

with engine.create_execution_context() as context:
		# Transfer input data to the GPU.
		cuda.memcpy_htod_async(d_input, h_input, stream)
		# Run inference.
		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
		# Transfer predictions back from the GPU.
		cuda.memcpy_dtoh_async(h_output, d_output, stream)
		# Synchronize the stream
		stream.synchronize()
		# Return the host output. 
return h_output

一個 Engine 可以具有多個執行內容，允許一組權重用於多個交互的 Inference 。

例如，您可以在平行的多組 Cuda Stream 同時使用一個 Engine 與一組 Context 進行影像處理。每一個 Context 會被相同的 GPU 所創建如同 Engine 一樣。

Tensor to TensorRT

Chapter 0-環境

Chapter 1-前言

檢驗NN好壞的BenchMark

TensorRT 優化流程

TensorRT Core Library

Network Definition

Builder

Engine

Parser(eg. Caffe, Uff, ONNX)

Chapter 2-使用 TRT C++ API (略)

Chapter 3-使用 TRT Python API

3.1 Import TRT

3.2 Creating A Network Definition in Python

3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API

3.2.2 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python

3.3 Building An Engine In Python

3.4 Serializing A Model In Python

3.5 Performing Inference In Python

Chapter 4-Extending TensorRT with Custom Layers

Chapter 5-Working With Mixed Precision

Chapter 6-Working With DLA

Chapter 7-Deploying A TensorRT Optimize Model

Chapter 8-Working with Deep Learning Frameworks

8.1 Working With Tensorflow

8.2 Working With PyTorch and Other Frameworks (pass)

Chapter 9-Trouble Shooting

Appendix