MsPage
  • HOME
  • self_driving_lidar
    • Nuvo6108GC Ubuntu Installation
    • Xavier environment installation
    • Velodyne 相關紀錄與議題
  • Study Note
    • Ubuntu
    • Tensor to TensorRT
    • C++
    • Qt5 & QtCreator
    • ROS
  • Python Study Note
    • Flask
  • CHEATSHEET
    • Git CheatSheet
    • Ros Launch Attribute
    • Vim Command (basic)
  • SAMPLE CODE
    • Rosbag Extractor in Python
Powered by GitBook
On this page
  • Chapter 0-環境
  • Chapter 1-前言
  • 檢驗NN好壞的BenchMark
  • TensorRT 優化流程
  • TensorRT Core Library
  • Chapter 2-使用 TRT C++ API (略)
  • Chapter 3-使用 TRT Python API
  • 3.1 Import TRT
  • 3.2 Creating A Network Definition in Python
  • 3.3 Building An Engine In Python
  • 3.4 Serializing A Model In Python
  • 3.5 Performing Inference In Python
  • Chapter 4-Extending TensorRT with Custom Layers
  • Chapter 5-Working With Mixed Precision
  • Chapter 6-Working With DLA
  • Chapter 7-Deploying A TensorRT Optimize Model
  • Chapter 8-Working with Deep Learning Frameworks
  • 8.1 Working With Tensorflow
  • 8.2 Working With PyTorch and Other Frameworks (pass)
  • Chapter 9-Trouble Shooting
  • Appendix

Was this helpful?

  1. Study Note

Tensor to TensorRT

Document from (Nvidia TRT developer guide)[ https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/ ]

PreviousUbuntuNextC++

Last updated 6 years ago

Was this helpful?

Chapter 0-環境

Cpu=X86; OS: Ubuntu16.04; Tensorflow=12; Cuda=10.0; Cudnn=;

Chapter 1-前言

檢驗NN好壞的BenchMark

  1. Throughput

  2. Efficiency

  3. Latency

  4. Accuracy

  5. Memory Usage

TensorRT 優化流程

TensorRT Core Library

Network Definition

提供辨識各NN層的能力,包含輸入/輸出層,RT支援層與非支援層,非支援層亦可使用 Plugin 寫入

Builder

本層可以創造 Network Definition 層的優化架構,同時可以設定 Maximum Batch,Workspace Batch,最小精度等級,自動調整訓練疊代次數,與量化8-bits 精度的表現

Engine

Engine接口允許應用程序執行推理。它支持同步和異步執行,分析,枚舉和查詢引擎輸入和輸出的綁定。單個引擎可以具有多個執行上下文,允許使用單組訓練參數來同時執行多個批次。

Parser(eg. Caffe, Uff, ONNX)

This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.

Chapter 2-使用 TRT C++ API (略)

Chapter 3-使用 TRT Python API

3.1 Import TRT

improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

3.2 Creating A Network Definition in Python

You can choose a tool from these options:

  1. 直接以TRT創建網路

  2. 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)

3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API

創建神經網路時,首要步驟是定義 Engine 與 創建 Inference 層使用的 Builder 物件(就是手把手自製一個NN)

# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
	# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
	# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
	input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

	# Add a convolution layer
	conv1_w = weights['conv1.weight'].numpy()
	conv1_b = weights['conv1.bias'].numpy()
	conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
	conv1.stride = (1, 1)

	pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
	pool1.stride = (2, 2)
	conv2_w = weights['conv2.weight'].numpy()
	conv2_b = weights['conv2.bias'].numpy()
	conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
	conv2.stride = (1, 1)

	pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
	pool2.stride = (2, 2)

	fc1_w = weights['fc1.weight'].numpy()
	fc1_b = weights['fc1.bias'].numpy()
	fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

	relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

	fc2_w = weights['fc2.weight'].numpy()
	fc2_b = weights['fc2.bias'].numpy()
	fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

	fc2.get_output(0).name =OUTPUT_NAME
	network.mark_output(fc2.get_output(0))

3.2.2 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python

主要步驟:

  1. 創建 TRT Builder 與 Network

  2. 創造特定格式的 TRT Parser

  3. 使用 Parser 讀取模型並填充 Network

建立 Network 必須先創建 Builder,因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的 輸出。更多資訊可以參考每個 Parser 的 API 文件:

  • Caffe Parser:

  • ONNX Parser:

3.2.3 Caffe Parser (pass)

3.2.4 Tensorflow Parser

接下來的步驟展示了如何使用 UFF Parser 與 Python API 讀取 Tensorflow Model ,Sample Code 於以下路徑 <site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist ,

步驟1. Import TRT

import tensorrt as trt

步驟2. 創建 Tensorflow Fronzen Model

為了可以順利轉出 UFF 檔案,必須將 Tensorflow Model Freezing to .pb file

參考連結:

步驟3. 使用 UFF Converter 將 Tensorflow Model 轉換成 UFF file

convert-to-uff frozen_inference_graph.pb

如果 convert-to-uff 檔案不是設為全域指令,則可以呼叫 bin 目錄檔案:

~/.local/lib/python2.7/site-packages/uff/bin/convert_to_uff.py

若需要尋找 UFF Module 所在目錄,則可以執行指令:

python -c “import uff; print(uff.__path__)”

或是採用另一種方法:使用 UFF Parser API 來轉換 Tensorflow GraphDef:

步驟4. 定義相關路徑與連結

改變以下路徑並指定到放 UFF Model 的地方:

model_file = '/data/mnist/mnist.uff'

步驟5. 建構 Builder, Network, 與 parser:

with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
    	parser.register_input("Placeholder", (1, 28, 28))
    	parser.register_output("fc2/Relu")
parser.parse(model_file, network)

3.2.5 ONNX Parser (pass)

3.3 Building An Engine In Python

Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用,因此有必要使用相同的GPU來進行 Builder 建構再行優化。

Builder 裏面有很多可以調整的屬性,讓你可以設定比如特定NN層的精度,或可以自動調節kernel的數值以達到最大效率,你也可以 Query Builder 以確認硬體原生支援的混合精度。

兩個最重要的性質為 1. Maximum Batch Size 2. Maximum Workspace Size

  • Maximum Batch Size 決定了 TRT 要優化 Batch Size,在 runtime 的時候會優先選擇較小的 batch size

  • NN層在計算時經常需要暫時性的 workspace,Workspace Size 決定了限制了每一層在計算時的能使用的 workspace size.

  • If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.

  1. 使用 Builde 物件建構 Engine:

builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 <<  20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.

2. 進行 Inference

3.4 Serializing A Model In Python

所謂的序列化,即是為了 Inference 階段使用,先將 Engine 轉化成某個格式進行儲存。

為了用於 Inference,你可以簡單的進行反序列化(deserialize)。序列化與反序列化的步驟不是必要的,但是為了避免每次建構 Engine 就要 rebuild 一次所花費的時間,你可以先行反序列化回復 Engine 檔案,等到實際 Release 在將 Engine 進行序列化。從這裡開始,你可以直接使用 Engine 進行 inference。

Note: 序列化產出的 Engine 不適用於其他GPU或 TRT版本。Engine 只辨認 Built on Machine 的特定GPU版本。

  1. 序列化 model to modelStream

serialized_engine = engine.serialize()

2. 反序列化 modelStream 以用於執行 inference. 執行反序列化需要 runtime object:

with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)

同時你可以將已經序列化的 Engine 存成一個檔案,然後讀取回來:

  1. 序列化 Engine 並寫入檔案:

with open(“sample.engine”, “wb”) as f:
		f.write(engine.serialize())

2. 讀取 Engine File 並反序列化:

with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
		engine = runtime.deserialize_cuda_engine(f.read())

3.5 Performing Inference In Python

接下來的步驟展示了使用 Engine 執行 Inference

  1. 為輸入和輸出分配一些主機和設備 buffer:

# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
	h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
	h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
	d_input = cuda.mem_alloc(h_input.nbytes)
	d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
	stream = cuda.Stream()

2. 創建一些空間來存儲中間激活值。由於引擎保持網絡定義和訓練的參數,因此需要額外的空間。它們保存在執行上下文中:

with engine.create_execution_context() as context:
		# Transfer input data to the GPU.
		cuda.memcpy_htod_async(d_input, h_input, stream)
		# Run inference.
		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
		# Transfer predictions back from the GPU.
		cuda.memcpy_dtoh_async(h_output, d_output, stream)
		# Synchronize the stream
		stream.synchronize()
		# Return the host output. 
return h_output

一個 Engine 可以具有多個執行內容,允許一組權重用於多個交互的 Inference 。

例如,您可以在平行的多組 Cuda Stream 同時使用一個 Engine 與一組 Context 進行影像處理。每一個 Context 會被相同的 GPU 所創建如同 Engine 一樣。

Chapter 4-Extending TensorRT with Custom Layers

Chapter 5-Working With Mixed Precision

Chapter 6-Working With DLA

Chapter 7-Deploying A TensorRT Optimize Model

Chapter 8-Working with Deep Learning Frameworks

8.1 Working With Tensorflow

8.2 Working With PyTorch and Other Frameworks (pass)

Chapter 9-Trouble Shooting

Appendix

UFF Parser:

或是使用以下連結獲得更多Sample Code詳細資訊:

(UFF Parser API) []

更多建構 Engine 的範例,可以參考

請參考

最後的參數是用於指定 custom layer ,更多資訊可以參考

https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#end_to_end_tensorflow_mnist
https://www.tensorflow.org/guide/extend/model_files#freezing
https://medium.com/@hamedmp/exporting-trained-tensorflow-models-to-c-the-right-way-cf24b609d183
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#introductory_parser_samples_resnet50
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_python
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#extending