This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers.
Chapter 2-使用 TRT C++ API (略)
Chapter 3-使用 TRT Python API
3.1 Import TRT
improt tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
3.2 Creating A Network Definition in Python
You can choose a tool from these options:
直接以TRT創建網路
使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python (Caffe, TensorFlow, ONNX)
3.2.1 (pass) 直接以TRT創建網路 Creating A Network Definition From Scratch Using The Python API
# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model.
# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)
# Add a convolution layer
conv1_w = weights['conv1.weight'].numpy()
conv1_b = weights['conv1.bias'].numpy()
conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
conv1.stride = (1, 1)
pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
pool1.stride = (2, 2)
conv2_w = weights['conv2.weight'].numpy()
conv2_b = weights['conv2.bias'].numpy()
conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
conv2.stride = (1, 1)
pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
pool2.stride = (2, 2)
fc1_w = weights['fc1.weight'].numpy()
fc1_b = weights['fc1.bias'].numpy()
fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)
relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)
fc2_w = weights['fc2.weight'].numpy()
fc2_b = weights['fc2.bias'].numpy()
fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)
fc2.get_output(0).name =OUTPUT_NAME
network.mark_output(fc2.get_output(0))
3.2.2 使用 Parser 從模型建立網路 Importing A Model Using A Parser In Python
主要步驟:
創建 TRT Builder 與 Network
創造特定格式的 TRT Parser
使用 Parser 讀取模型並填充 Network
建立 Network 必須先創建 Builder,因為 Builder 就像是 Network 的製造工廠。而不同的 parser 擁有不同機制標記 NN 的 輸出。更多資訊可以參考每個 Parser 的 API 文件:
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
parser.register_input("Placeholder", (1, 28, 28))
parser.register_output("fc2/Relu")
parser.parse(model_file, network)
3.2.5 ONNX Parser (pass)
3.3 Building An Engine In Python
Builder 其中的一項功能為尋找主機裏面的 Cuda kernel 以作為加速使用,因此有必要使用相同的GPU來進行 Builder 建構再行優化。
builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
with trt.Builder(TRT_LOGGER) as builder:
with builder.build_cuda_engine(network) as engine:
# Do inference here.
with open(“sample.engine”, “wb”) as f:
f.write(engine.serialize())
2. 讀取 Engine File 並反序列化:
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
3.5 Performing Inference In Python
接下來的步驟展示了使用 Engine 執行 Inference
為輸入和輸出分配一些主機和設備 buffer:
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
# Allocate device memory for inputs and outputs.
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
with engine.create_execution_context() as context:
# Transfer input data to the GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
stream.synchronize()
# Return the host output.
return h_output