【实战教程】在本地计算机上运行AI视觉语言模型：通过文本实现目标检测任务【附源码】

《------往期经典推荐------》

二、机器学习实战专栏【链接】，已更新31期，欢迎关注，持续更新中~~
三、深度学习【Pytorch】专栏【链接】
四、【Stable Diffusion绘画系列】专栏【链接】
五、YOLOv8改进专栏【链接】，持续更新中~~
六、YOLO性能对比专栏【链接】，持续更新中~

《------正文------》

引言

对于小型LLMs生态系统，其在边缘设备上实现应用程序中有巨大的潜力。例如在医学和建筑，商业，监控等许多行业中，应用程序是无穷无尽的。

本文将介绍如何在PC上运行的小型视觉语言模型（LLM）moondream，并运行它做一些对象检测的实验。

【实战教程】在本地计算机上运行AI视觉语言模型：通过文本实现目标检测任务【附源码】-LMLPHP

实现步骤

运行模型

首先，让我们从如何运行模型开始，它非常简单。只需确保安装依赖项并下载模型（它不到2GB，很小但很强大）。此处使用的是开源的轻量级AI视觉语言模型Moondream。

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-07-23"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image_path = "images/test.jpeg"
image = Image.open(image_path)
enc_image = model.encode_image(image)

测试样本

为了测试模型，我们将使用下面的图片，其中有一只可爱的狗和一只猫。

【实战教程】在本地计算机上运行AI视觉语言模型：通过文本实现目标检测任务【附源码】-LMLPHP

我们将提示模型询问图像中有多少动物，让模型进行回答：

print(model.answer_question(enc_image, "how many animals are in the picture?", tokenizer))
# Answer 
'There are two animals in the picture: a dog and a cat.'

而且是正确的！但是这个任务很容易，所以让我们把它变得更难一点。

目标检测示例

随着模型的最新版本，对象检测能力越来越好，似乎我们将得到更多的改进。所以，让我们试着从图像中识别猫！

print(model.answer_question(enc_image, "Detect the cat and return the bounding box", tokenizer))
# Output:
[0.00, 0.23, 0.50, 0.99]

我们有答案了，但先别急着下结论。至少我们得到了一个答案。让我们也来问问狗的位置：

print(model.answer_question(enc_image, "Detect the dog and return the bounding box", tokenizer))
# Output: 
[0.32, 0.09, 0.99, 0.98]

现在让我们创建一个辅助函数来绘制边界框。代码很简单，你可以复制粘贴！

def draw_bounding_box(image_path, bbox):
    """
    Draws a bounding box on an image and displays it.

    Parameters:
    - image_path (str): Path to the image file.
    - bbox (list): Normalized bounding box coordinates [x_min, y_min, x_max, y_max], where
                   each value is between 0 and 1.
    """
    # Open the image
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)

    # Get image dimensions
    img_width, img_height = image.size

    # Convert normalized bounding box coordinates to absolute pixel values
    x_min = int(bbox[0] * img_width)
    y_min = int(bbox[1] * img_height)
    x_max = int(bbox[2] * img_width)
    y_max = int(bbox[3] * img_height)

    # Draw the bounding box
    draw.rectangle([x_min, y_min, x_max, y_max], outline="red", width=3)

    # Display the image
    image.show()

让我们使用相同的图像并传递边界框：

image_path = "images/test.jpeg"
bbox = [0.00, 0.23, 0.50, 0.99] # Bounding box coordinates for the cat
draw_bounding_box(image_path, bbox)

bbox = [0.32, 0.09, 0.99, 0.98] # Bounding box coordinates for the dog
draw_bounding_box(image_path, bbox)

正如你在下面的图片中看到的，模型能够正确地识别边界框。

【实战教程】在本地计算机上运行AI视觉语言模型：通过文本实现目标检测任务【附源码】-LMLPHP

总结

这些微小的模型具有巨大的应用潜力。您甚至可以对它们进行微调，使它们更适合您的自定义应用程序。考虑到它们甚至可以在Raspberry Pi上运行，可能性是无限的。如果文章对你有帮助，感谢点赞关注！

【实战教程】在本地计算机上运行AI视觉语言模型：通过文本实现目标检测任务【附源码】-LMLPHP

好了，这篇文章就介绍到这里，喜欢的小伙伴感谢给点个赞和关注，更多精彩内容持续更新~~
关于本篇文章大家有任何建议或意见，欢迎在评论区留言交流！

阿_旭