This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.
Chapter 2-使用 TRT C++ API (略)
Chapter 3-使用 TRT Python API
3.1 Import TRT
improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
3.2 Creating A Network Definition in Python
You can choose a tool from these options:
直接以TRT創建網路
使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)
3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API
# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model.
# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)
# Add a convolution layer
conv1_w = weights['conv1.weight'].numpy()
conv1_b = weights['conv1.bias'].numpy()
conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
conv1.stride = (1, 1)
pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
pool1.stride = (2, 2)
conv2_w = weights['conv2.weight'].numpy()
conv2_b = weights['conv2.bias'].numpy()
conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
conv2.stride = (1, 1)
pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
pool2.stride = (2, 2)
fc1_w = weights['fc1.weight'].numpy()
fc1_b = weights['fc1.bias'].numpy()
fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)
relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)
fc2_w = weights['fc2.weight'].numpy()
fc2_b = weights['fc2.bias'].numpy()
fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)
fc2.get_output(0).name =OUTPUT_NAME
network.mark_output(fc2.get_output(0))
3.2.2 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python
主要步驟:
創建 TRT Builder 與 Network
創造特定格式的 TRT Parser
使用 Parser 讀取模型並填充 Network
建立 Network 必須先創建 Builder,因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的 輸出。更多資訊可以參考每個 Parser 的 API 文件:
Caffe Parser:
ONNX Parser:
3.2.3 Caffe Parser (pass)
3.2.4 Tensorflow Parser
接下來的步驟展示了如何使用 UFF Parser 與 Python API 讀取 Tensorflow Model ,Sample Code 於以下路徑 <site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist ,
步驟1. Import TRT
import tensorrt as trt
步驟2. 創建 Tensorflow Fronzen Model
為了可以順利轉出 UFF 檔案,必須將 Tensorflow Model Freezing to .pb file
參考連結:
步驟3. 使用 UFF Converter 將 Tensorflow Model 轉換成 UFF file
或是採用另一種方法:使用 UFF Parser API 來轉換 Tensorflow GraphDef:
步驟4. 定義相關路徑與連結
改變以下路徑並指定到放 UFF Model 的地方:
model_file = '/data/mnist/mnist.uff'
步驟5. 建構 Builder, Network, 與 parser:
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
parser.register_input("Placeholder", (1, 28, 28))
parser.register_output("fc2/Relu")
parser.parse(model_file, network)
3.2.5 ONNX Parser (pass)
3.3 Building An Engine In Python
Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用,因此有必要使用相同的GPU來進行 Builder 建構再行優化。
If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.
使用 Builde 物件建構 Engine:
builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.
with trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(serialized_engine)
同時你可以將已經序列化的 Engine 存成一個檔案,然後讀取回來:
序列化 Engine 並寫入檔案:
with open(“sample.engine”, “wb”) as f:
f.write(engine.serialize())
2. 讀取 Engine File 並反序列化:
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
3.5 Performing Inference In Python
接下來的步驟展示了使用 Engine 執行 Inference
為輸入和輸出分配一些主機和設備 buffer:
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
with engine.create_execution_context() as context:
# Transfer input data to the GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
stream.synchronize()
# Return the host output.
return h_output