API学习

创建Runtime->创建Engine->创建Context->获取输入输出索引->创建buffers->为输入输出开辟GPU显存->创建cuda流->从CPU到GPU(拷贝input数据)->异步推理->从GPU到CPU(拷贝output数据)->同步cuda流->释放资源

使用流程：

Tensor RT C++ 使用流程_我为什么这么菜.的博客-CSDN博客

TensorRT 在所有支持的平台上提供 C++ 实现，在 Linux 上提供 Python 实现。 Windows 或 QNX 目前不支持 Python。

TensorRT的关键接口是：网络定义 Network Definition，为应用程序提供了定义网络的方法，可以指定输入和输出张量，并且可以添加和配置层，例如卷积层和循环层，插件层类型同样允许应用程序实现TensortRT本身不支持的功能（通过Network可以完成神经网络中的操作）。

优化配置文件 Optimization Profile 优化配置文件指定对动态维度的约束

构建器配置 Builder Configuration （Config）构建器配置接口指定了创建引擎的详细信息。它允许应用程序指定优化配置文件、最大工作空间大小、最小可接受精度水平、自动调整的时序迭代计数以及用于量化网络以 8 位精度运行的接口。

构建器Builder Builder 接口允许根据网络定义和构建器配置创建优化引擎。

引擎 Engine 允许应用程序执行推理，支持同步和一步执行、分析、枚举和查询引擎输入和输出绑定，单个引擎可以有多个执行上下文，允许使用一组经过训练的参数同时执行多个推理。（构造图纸，一切的思路源泉，其他的都是框架容器）

什么是执行上下文？（execution context）可以理解为多线程执行？

ExecutionContext(执行上下文)综述 - 大师兄石头 - 博客园

ONNX解析器 parser 用于解析ONNX模型

C++ API 与 Python API 理论上，C++ API 和 Python API 在支持您的需求方面应该接近相同。 C++ API 应该用于任何对性能至关重要的场景，以及安全很重要的情况，例如在汽车中。 Python API 的主要好处是数据预处理和后处理易于使用，因为您可以使用各种库，如 NumPy 和 SciPy。

（实用还是选择C++）

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

CUDA Runtime API :: CUDA Toolkit DocumentationCUDAStream： CUDA Runtime API :: CUDA Toolkit Documentation

**************************************************************************************************************什么是Logger？

日志组件，用于管理builder，engine和runtime的日志信息。

该类为 TensorRT 工具和示例提供了一个通用接口来将信息记录到控制台，并支持记录两种类型的消息：- 具有相关严重性（信息、警告、错误或内部错误/致命）的调试消息 - 测试通过/失败消息与直接发送到 stdout/stderr 相比，让所有样本都使用此类进行日志记录的优势在于，控制样本输出的详细程度和格式的逻辑集中在一个位置。将来，可以扩展此类以支持将测试结果转储到某种标准格式（例如，JUnit XML）的文件中，并提供额外的元数据（例如，对测试运行的持续时间进行计时）。

logger会作为一个必须的参数传递给builder runtime parser的实例化接口：

IBuilder* builder = createInferBuilder(gLogger);
IRuntime* runtime = createInferRuntime(gLogger);
auto parser = nvonnxparser::createParser(*network, gLogger);

Logger在内部被视为单例，因此 IRuntime 和/或 IBuilder 的多个实例必须都使用相同的Logger。

TensorRT: nvinfer1::ILogger Class Reference

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Execution Context 执行上下文

定义：TensorRT: NvInferRuntime.h Source File

实用引擎执行推理的上下文，具有功能不安全的特性。

一个 ICudaEngine 实例可能存在多个执行上下文，允许使用同一个引擎同时执行多个批处理。如果引擎支持动态形状，则并发使用的每个执行上下文必须使用单独的优化配置文件。

警告不要从此类继承，因为这样做会破坏 API 和 ABI 的向前兼容性。

使用方法：应用接口IExecutionContext，首先应该先创建一个ICudaEngine引擎类型的对象，构建器运行时将使用与创建线程关联的GPU上下文创建，建议在创建运行时或构建器对象之前创建和配置 CUDA 上下文。

const ICudaEngine& engine =context.getEngine();
IExecutionContext* context = engine->createExecutionContext();
context->destroy();
context.enqueue(batchSize,buffers,stream,nullptr);
//TensorRT execution is typically asynchronous, so enqueue the kernels on a CUDA stream.
//It is common to enqueue asynchronous memcpy() before and after the kernels to move data from the GPU if it is not already there. 
//The final argument to enqueueV2() is an optional CUDA event which will be signaled when the input buffers have been consumed and their memory may be safely reused.
//For more information, refer to enqueue() for implicit batch networks and enqueueV2() for explicit batch networks. 
//In the event that asynchronous is not wanted, see execute() and executeV2().
//The IExecutionContext contains shared resources, therefore, calling enqueue or enqueueV2 in from the same IExecutionContext object with different CUDA streams concurrently results in undefined behavior. 
//To perform inference concurrently in multiple CUDA streams, use one IExecutionContext per CUDA stream.

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

TensorRT: nvinfer1::IExecutionContext Class Reference

Engine

所属类：ICudaEngine，定义在 NvInferRuntime.h 中

IBuilderConfig* config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1<<20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network,*config);

在这之前需要搭建完整网络

TensorRT: nvinfer1::ICudaEngine Class Reference

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Network

作为构建器输入的网络定义

网络定义了网络结构，和IBuilderConfig结合使用IBuilder构建到引擎中，INetworkDefinition 可以具有在运行时指定的隐式批处理维度，或所有维度显式、完全维度模式。使用 createNetwork() 创建网络后，仅支持隐式批量大小模式。函数 hasImplicitBatchDimension() 用于查询网络的模式。

INetworkDefinition* network = builder->createNetworkV2(0U);
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createnetworkV2(1U << static_cast<unit32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));

//将输入层添加到网络，具有输入维度，包括动态批处理
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
auto data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{-1, 1, INPUT_H, INPUT_W});

//添加卷积层
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
conv1->setStrideNd(DimsHW{1, 1});
auto conv1 = network->addConvolution(*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
conv1->setStride(DimsHW{1, 1});

//添加池化层
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
pool1->setStrideNd(DimsHW{2, 2});
auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
pool1->setStride(DimsHW{2, 2});

//使用 ReLU 算法添加激活层
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);

//添加全连接层
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
auto ip1 = network->addFullyConnected(*pool1->getOutput(0), 500, weightMap["ip1filter"], weightMap["ip1bias"]);

//添加 SoftMax 层以计算最终概率并将其设置为输出：
ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));
auto prob = network->addSoftMax(*relu1->getOutput(0));
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

TensorRT: nvinfer1::INetworkDefinition Class Reference

解析器Parser

解析器主要用于解析ONNX模型并将其转换为TensorRT模型，所属类：IParser

使用INetwork定义作为输入创建ONNX解析器：

auto parser = nvonnxparser::createParser(*network, gLogger);

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

TensorRT: nvonnxparser::IParser Class Reference

基本流程

// 1. 读取 engine 文件
std::vector<char> engineData(fsize);
engineFile.read(engineData.data(), fsize);
util::UniquePtr<nvinfer1::IRuntime> runtime{nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger())};
util::UniquePtr<nvinfer1::ICudaEngine> mEngine(runtime->deserializeCudaEngine(engineData.data(), fsize, nullptr));

// 2. engine的输入输出初始化（也可以理解为 engine context 初始化）
// engine的输入是input，数据类型是float，shape是(1, 3, height, width)
auto input_idx = mEngine->getBindingIndex("input");
assert(mEngine->getBindingDataType(input_idx) == nvinfer1::DataType::kFLOAT);
auto input_dims = nvinfer1::Dims4{1, 3 /* channels */, height, width};
context->setBindingDimensions(input_idx, input_dims);
auto input_size = util::getMemorySize(input_dims, sizeof(float));
// engine的输出是output，数据类型是int32，自动获取输出数据shape
auto output_idx = mEngine->getBindingIndex("output");
assert(mEngine->getBindingDataType(output_idx) == nvinfer1::DataType::kINT32);
auto output_dims = context->getBindingDimensions(output_idx);
auto output_size = util::getMemorySize(output_dims, sizeof(int32_t));

// 3. inference 准备工作
// 为输入输出开辟显存空间
void* input_mem{nullptr};
cudaMalloc(&input_mem, input_size);
void* output_mem{nullptr};
cudaMalloc(&output_mem, output_size); 
// 定义图像norm操作
const std::vector<float> mean{0.485f, 0.456f, 0.406f};
const std::vector<float> stddev{0.229f, 0.224f, 0.225f};
auto input_image{util::RGBImageReader(input_filename, input_dims, mean, stddev)};
input_image.read();
auto input_buffer = input_image.process();
// 将处理好的数据转移到显存中
cudaMemcpyAsync(input_mem, input_buffer.get(), input_size, cudaMemcpyHostToDevice, stream);

// 4. 执行 inference 操作
// 通过 executeV2 or enqueueV2 激发 inference 的具体执行
void* bindings[] = {input_mem, output_mem};
bool status = context->enqueueV2(bindings, stream, nullptr);
// 获取预测结果
auto output_buffer = std::unique_ptr<int>{new int[output_size]};
cudaMemcpyAsync(output_buffer.get(), output_mem, output_size, cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
// 释放资源
cudaFree(input_mem);
cudaFree(output_mem);

// 5. 输出预测结果
const int num_classes{21};
const std::vector<int> palette{
	(0x1 << 25) - 1, (0x1 << 15) - 1, (0x1 << 21) - 1};
auto output_image{util::ArgmaxImageWriter(output_filename, output_dims, palette, num_classes)};
output_image.process(output_buffer.get());
output_image.write();

视觉菜鸟Leonardo

深度学习最终BOSS——TensorRT

API学习