使用 RunInference 与 TensorRT

NVIDIA TensorRT 是一个 SDK，它促进高性能机器学习推理。它旨在与 TensorFlow、PyTorch 和 MXNet 等深度学习框架协同工作。它专门专注于优化和运行经过训练的神经网络，以便在 NVIDIA GPU 上高效运行推理。TensorRT 可以通过多种优化来最大限度地提高推理吞吐量，同时保持模型精度，包括模型量化、层和张量融合、内核自动调整、多流执行和高效的张量内存使用。
在 Apache Beam 2.43.0 中，Beam 引入了 TensorRTEngineHandler，它允许您在 Beam 管道中部署 TensorRT 引擎。RunInference 转换通过允许开发人员在生产管道中使用 Sklearn、PyTorch、TensorFlow 和现在的 TensorRT 模型而无需编写大量样板代码，简化了 ML 推理管道创建过程。

以下示例演示了如何在 Beam 管道中使用 TensorRT 与 RunInference API 使用基于 BERT 的文本分类模型。

为推理构建 TensorRT 引擎

要将 TensorRT 与 Apache Beam 一起使用，您需要从训练好的模型中转换的 TensorRT 引擎文件。我们采用经过训练的基于 BERT 的文本分类模型，该模型执行情感分析并将任何文本分类为两个类别：正面或负面。经过训练的模型可从 HuggingFace 获取。要将 PyTorch 模型转换为 TensorRT 引擎，您需要先将模型转换为 ONNX，然后从 ONNX 转换为 TensorRT。

转换为 ONNX

您可以使用 HuggingFace 的 transformers 库将 PyTorch 模型转换为 ONNX。有关详细信息，请参阅博文使用 Hugging Face Optimum 将 Transformers 转换为 ONNX。该博文解释了需要安装哪些必需的软件包。以下代码用于转换。

from pathlib import Path
import transformers
from transformers.onnx import FeaturesManager
from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification


# load model and tokenizer
model_id = "textattack/bert-base-uncased-SST-2"
feature = "sequence-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# load config
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
onnx_config = model_onnx_config(model.config)

# export
onnx_inputs, onnx_outputs = transformers.onnx.export(
        preprocessor=tokenizer,
        model=model,
        config=onnx_config,
        opset=12,
        output=Path("bert-sst2-model.onnx")
)

从 ONNX 到 TensorRT 引擎

要将 ONNX 模型转换为 TensorRT 引擎，请从 CLI 使用以下命令

trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT engine> --useCudaGraph --verbose

要使用 trtexec，请按照博文使用 NVIDIA TensorRT 简化和加速 Apache Beam 中的机器学习预测中的步骤操作。该博文解释了如何从 DockerFile 构建可用于转换的 Docker 镜像。我们使用以下 Dockerfile，它类似于博文中使用的文件

ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3

FROM ${BUILD_IMAGE}

ENV PATH="/usr/src/tensorrt/bin:${PATH}"

WORKDIR /workspace

RUN apt-get update -y && apt-get install -y python3-venv
RUN pip install --no-cache-dir apache-beam[gcp]==2.44.0
COPY --from=apache/beam_python3.8_sdk:2.44.0 /opt/apache/beam /opt/apache/beam

RUN pip install --upgrade pip \
    && pip install torch==1.13.1 \
    && pip install torchvision>=0.8.2 \
    && pip install pillow>=8.0.0 \
    && pip install transformers>=4.18.0 \
    && pip install cuda-python

ENTRYPOINT [ "/opt/apache/beam/boot" ]

该博文还包含解释如何本地测试 TensorRT 引擎的说明。

在 Beam 管道中使用 RunInference 运行 TensorRT 引擎

现在您有了 TensorRT 引擎，您可以使用 TensorRT 引擎与 RunInference 在可以在本地和 Google Cloud 上运行的 Beam 管道中一起使用。

以下代码示例是管道的一部分。您使用 TensorRTEngineHandlerNumPy 加载 TensorRT 引擎并设置其他推理参数。

  model_handler = TensorRTEngineHandlerNumPy(
      min_batch_size=1,
      max_batch_size=1,
      engine_path=known_args.trt_model_path,
  )

  tokenizer = AutoTokenizer.from_pretrained(known_args.model_id)

  with beam.Pipeline(options=pipeline_options) as pipeline:
    _ = (
        pipeline
        | "ReadSentences" >> beam.io.ReadFromText(known_args.input)
        | "Preprocess" >> beam.ParDo(Preprocess(tokenizer=tokenizer))
        | "RunInference" >> RunInference(model_handler=model_handler)
        | "PostProcess" >> beam.ParDo(Postprocess(tokenizer=tokenizer)))

完整的代码可以在 GitHub 上找到。

要在 Dataflow 上运行此作业，请在本地运行以下命令

python tensorrt_text_classification.py \
--input gs://{GCP_PROJECT}/sentences.txt \
--trt_model_path gs://{GCP_PROJECT}/sst2-text-classification.trt \
--runner DataflowRunner \
--experiment=use_runner_v2 \
--machine_type=n1-standard-4 \
--experiment="worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver" \
--disk_size_gb=75 \
--project {GCP_PROJECT} \
--region us-central1 \
--temp_location gs://{GCP_PROJECT}/tmp/ \
--job_name tensorrt-text-classification \
--sdk_container_image="us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt"

Dataflow 基准测试

我们在 Dataflow 中使用 TensorRT 引擎和以下配置运行了实验：带有 75GB 磁盘大小的 n1-standard-4 机器。为了模拟通过 PubSub 流式传输到 Dataflow 的数据，我们通过将 ModelHandlers 的最小和最大批次大小设置为 1，将批次大小设置为 1。

	包含 RunInference 的阶段	平均 inference_batch_latency_micro_secs
使用 T4 GPU 的 TensorFlow	3 分钟 1 秒	15,176
使用 T4 GPU 的 TensorRT	45 秒	3,685

Dataflow 运行器将管道分解为多个阶段。您可以通过查看包含推理调用的阶段来更好地了解 RunInference 的性能，而不是查看读取和写入数据的其他阶段。这在包含 RunInference 的阶段列中。

inference_batch_latency_micro_secs 指标是在微秒内执行一批示例的推理所花费的时间，即调用 model_handler.run_inference 所花费的时间。这会随着时间的推移而变化，具体取决于 BatchElements 的动态批处理决策以及元素的特定值或 dtype 值。对于此指标，您可以看到 TensorRT 比 TensorFlow 快约 4.1 倍。

最后更新于 2024/10/31

您找到了您要找的所有内容吗？

所有内容都有用且清晰吗？您想更改任何内容吗？请告诉我们！