如何实现和测试 Phi3：微软的强大新大型语言模型-平芜编程栈

原文：towardsdatascience.com/how-to-implement-and-test-phi3-microsofts-powerful-new-large-language-model-b2003b2aa155

本文讨论了微软新发布的Phi3 大型语言模型，这是一个具有独特大上下文窗口的 LLM，相对于模型的大小，能够执行各种任务。我将讨论如何在本地上运行 Phi3 并进行测试，以了解它在简洁回答、JSON 格式化和信息提取等任务上的表现。最后，我将分享我对该模型及其性能的看法。

<…/Images/020bb43d5c6c70fe73654239806dd078.png>

ChatGPT 可视化展示一个小型语言模型努力工作的场景。图片由 ChatGPT 提供。OpenAI. (2024). ChatGPT (4) [大型语言模型].chat.openai.com

动机

我写这篇文章的动机是 Phi3 是微软最新发布的大型语言模型之一，这使得它成为一个有趣的测试模型。此外，较小的语言模型特别有趣，因为它们比大模型更有效地利用了更少的参数。较小的模型也能够在较小的设备上运行，拥有一个可以在您的手机上本地运行的语模型可能在 AI 领域带来巨大的进步。

此外，这篇文章是系列文章的一部分，我在其中测试了机器学习领域最新发布的模型。我之前写过关于测试另外两个语言模型的文章：TinyLlama，它与 Phi3 相似，是一个较小的语言模型，以及 Llama3，这是 Meta 最新的大型语言模型：

释放 Llama3 – 如何使用最新的科技巨头开源 LLM

在本地运行模型

在您的计算机上本地运行 Phi3 的最简单方法之一是利用 Ollama。首先，您需要从该网站下载 Ollama。确保在运行 Python 代码时安装并运行 Ollama 应用程序，因为这是 Python 与 Ollama 通信所必需的。然后，您可以使用以下命令安装 Ollama 的 pip 包：

pip install ollama

现在你可以使用以下代码下载任何 Ollama 模型：

importollama ollama.pull(<model name>:<model tag>)

你可以在 Ollama 模型库中找到很多不同的模型，尽管本文将仅讨论 Phi3 模型。截至本文撰写时，Ollama 上有六个不同的 Phi3 模型，以下标签表示：

最新
3.8B
instruct
mini
3.8b-mini-instruct-4k-q4_k_M
3.8b-mini-instruct-4k-fp16

因此，要安装任何模型，你可以运行以下命令：

ollama.pull(phi3:<model tag>)# for exampleollama.pull(phi3:instruct)

Instruct 表示模型被训练来遵循特定指令，这是我推荐使用的。标签 5 和 6 是量化到 q4 和 fp16 的模型，其中量化是通过降低模型参数的精度来降低语言模型所需的计算和存储的过程。不幸的是，在撰写本文时，Ollama 上只有 4K 上下文窗口版本可用。尽管如此，在本文后面的信息提取能力测试部分，我将向您展示如何运行 128K 上下文长度的版本（这稍微复杂一些）。运行ollama.pull命令将下载模型，然后你可以使用以下代码（在这种情况下，使用 Phi3 的 instruct 版本）提示模型：

defprompt_phi3(prompt,system_prompt="You are a great question answering machine that answers in a concise manner."):if(len(system_prompt)>0):response=ollama.chat(model="phi3:instruct",messages=[{'role':'system','content':system_prompt},{'role':'user','content':prompt},])else:response=ollama.chat(model="phi3:instruct",messages=[{'role':'user','content':prompt},])returnresponse["message"]["content"]

测试模型

我将进行多项测试以检查 Phi3 的性能。我将使用我在关于 TinyLlama 和 Llama3 的文章中应用的测试，我还会使用改进的信息提取测试来真正测试 Phi3 使用其 128k 上下文长度的情况。这些测试是在考虑我通常要求 LLMs 执行哪些任务时构思的。这是测试应该表明的本质：机器学习模型执行你要求任务的性能如何。我在关于测试图质量的以下文章中对此有更多描述：

如何测试图质量以提高图机器学习性能

在简单提示下测试简洁答案

提示：挪威的首都是什么？** 只需输出答案，不要输出其他内容。

Phi3 响应：奥斯陆

提示：圆周率的前十个数字是什么？只需输出答案，不要输出其他内容。

Phi3 响应：3141592653

（这是正确答案）

提示：美国国旗上有多少颗星星？只需输出答案，不要输出其他内容。

Phi3 响应：50

提示：微软的 CEO 是谁？

Phi3 响应：微软的现任首席执行官是萨蒂亚·纳德拉，他于 2014 年 2 月 4 日上任。

我的想法：

Phi 3 在这里表现非常出色，给出了清晰简洁的答案。它正确回答了所有问题，但在被要求只回答前三个问题的答案时，它处理得非常完美。请注意，当 Phi3 没有被特别提示只回答答案时，它以完整的句子回答，我认为这是一种自然的回答。总的来说，Phi3 以优异的成绩通过了这次测试。

测试对象格式化能力

就像测试Llama3一样，我还会要求 Phi3 总结一个文本并以 JSON 格式响应。文本由 Llama3 生成，看起来像：

The majestic edifice stands tallandproud,its grandeur commanding attentionfromallwho lay eyes on it.The Building of Elegance,asitisaffectionately known,isan architectural masterpiece that has been a beloved landmarkinthe cityforgenerations.As you approach the structure,you can't help but be struck by its sheer scale and majesty. Rising high into the sky, the building's faÃ§ade gleamswitha subtle sheen,its cream-colored stones polished to perfection by years of gentle wear.The entrance,flanked by grandiose columns,isa symphony of ornate carvingsandintricate moldings that seem to dance across the surface.Stepping inside,you're enveloped in a warm, welcoming atmosphere. The lobby's high ceiling soars above you like a vaulted sky,adornedwithdelicate frescoes depicting scenes of mythological grandeur.A gleaming marble floor stretches out before you,reflecting the soft glow of overhead lightsandcasting a gentle sheen across the room.To either side of the entrance,sweeping staircases curve upward like great wings,leading visitors to the various levels above.The airisfilledwiththe soft hum of activity â€" whispers, footsteps, and the occasional rustle of papers â€"aspeople go about their daily lives within these hallowed halls.As you look around,your gaze falls upon an array of stunning features.To one side stands a magnificent chandelier,its delicate crystals refracting light into a kaleidoscope of colors that seem to shiftandshimmerwithevery movement.Nearby,a grand fountain bubblesandsplashes,its gentle song providing a soothing accompaniment to the bustle below.The building's interiorisjustasimpressiveasits exterior.Halls linedwithgleaming wooden paneling stretch outinalldirections,punctuated by occasional doorways leading to various offices,meeting rooms,orquiet nooksforcontemplation.Every nowandthen,a stunning stained-glass window bursts forthfromthe stone walls,casting vibrant hues across the floor.The Building of Elegance has been home to countless stories over the years â€" whispers of romance,triumphs of industry,andmoments of quiet reflection.Itisa place where memories are made,where dreams take shape,andwhere the passing of timeismeasured by the soft ticking of clockwork mechanisms hidden awayinthe walls.As you wander through these hallowed halls,you can't help but feel a sense of reverence for this magnificent structure â€" a true gem of architecture that has stood the test of time. And as you look up at its soaring heights, you can'thelpbut wonder what secrets it holds within its ancient stones,waiting to be discovered by those who come after.

然而，这里的要点不是文本的内容，而是 Phi3 是否能够接受这个文本，提供一些关于文本的有用信息，并以有效的 JSON 格式提供响应。这对于 LLM 来说是一个关键任务，因为人们经常希望对某些数据进行总结，而响应必须是有效的对象，因为将非有效的 JSON 对象转换为有效的对象可能很难通过编程实现。以完美的 JSON 格式响应还要求 LLM 的响应接近完美，因为制作 JSON 对象时的任何错误都会使对象无效。

我用以下提示来提示 Phi3，其中构建信息是上面显示的文本：

prompt_phi3(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text:{building_text}.")

而 Phi3 给出的第一个响应是：

{'title':'The Building of Elegance','description':'A majestic edifice known as The Building of Elegance stands tall and proud in the city. Its grandeur commands attention with a sheer scale, cream-colored stones polished by years, grandiose columns at its entrance, delicate frescoes adorning its lobby ceiling, sweeping staircases leading to various levels above, magnificent chandelier and bubbling fountain, and halls lined with gleaming wooden paneling. This architectural masterpiece has been a beloved landmark for generations.','features':[{'description':'Sheer scale','location':'entire building'},{'description':'Cream-colored stones polished by years','location':'façade'},{'description':'Grandiose columns at entrance','location':'entrance'},{'description':'Delicate frescoes adorning lobby ceiling','location':'lobby'},{'description':'Sweeping staircases leading to various levels above','location':'entrance'},{'description':'Magnificent chandelier refracting light into kaleidoscope of colors','location':'side of entrance'},{'description':'Grand fountain bubbling and splashing softly','location':'nearby chandelier'},{'description':'Halls lined with gleaming wooden paneling','location':'various areas inside the building'},{'description':'Stunning stained-glass windows casting vibrant hues across floor','location':'walls of various rooms'}]}

这个响应看起来不错。现在是时候看看 Phi3 是否能够持续创建有效的 JSON 对象了。我将运行上面的相同提示 100 次，以查看 Phi3 未能生成有效 JSON 对象的情况。要执行此测试，您必须首先使用以下命令下载 tqdm：

pip install tqdm

您可以随后运行实验

# test ability to make valid JSON objectfromtqdm.autoimporttqdm failed_count=0num_runs=100for_intqdm(range(num_runs)):try:json_string=prompt_phi3(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text:{building_text}.")obj=json.loads(json_string)print("Success")exceptExceptionase:failed_count+=1print("Failed: ",e)print(f"Failed{failed_count/num_runs*100}% of the time to make a valid JSON object")

运行此代码显示，Phi 在91/100 次运行中未能提供有效的 JSON 对象。

我的想法

Phi3 可以提供有效的 JSON 对象，正如 Phi3 给出的第一个响应所示，以及它在 9%的情况下成功提供有效 JSON 对象的事实。然而，我认为 Phi3 只能有 9%的时间成功创建有效的 JSON 对象，这令人失望。如果这是一个人类无法干预的自动化系统，您需要平均运行代码 11 次，Phi3 才能成功创建一个有效的 JSON 摘要。这在实践中是不可行的，因此，基于这个测试，我认为 Phi3 在返回格式化对象方面并不特别出色。

测试信息提取/上下文长度利用

LLM 的另一个重要方面是它们执行信息提取的能力。在这个上下文中，信息提取意味着给 LLM 大量的文本，然后针对该文本提出具体的问题。在这种情况下，模型上下文的长度自然很重要，因为较长的上下文长度允许用较长的文本提示模型进行信息提取。

为了执行此测试，我还切换到了 Phi3 的 128K 上下文长度版本。您应该遵循此 GitHub 页面上的运行 Phi3 模型与 onnx 文件相关的说明来使用此模型。另外请注意，当根据此 GitHub 页面设置最小和最大标记数时，这些标记数计算在内，包括输入标记的数量。

<…/Images/3c325ccb22b2213fa19a0f7934870d1f.png>

ChatGPT 对一个 LLM 执行信息提取的想象。图由 ChatGPT。OpenAI。（2024）。ChatGPT（4）[大型语言模型]。chat.openai.com

Phi3 有两个模型，一个有 4K 上下文长度，另一个有 128K 上下文长度。128K 上下文长度很令人印象深刻，但确保模型利用完整的 128K 个标记的上下文长度很重要。我生成了一篇大约 100K 个标记的长文本来测试这一点。然后，我有一个特定的句子需要模型提取：

公司位于 27 楼

因此，我将围绕有关建筑所在楼层的 100K 个标记（约 75K 个单词）周围，然后提示 Phi3 提取建筑所在的楼层。为了测试模型能否充分利用其完整的上下文长度，我将尝试十次，将关于建筑所在楼层的句子放置在文本中的 10 个不同位置，确保 Phi3 能够使用其完整的上下文。我执行此操作的代码如下：我首先加载生成的文本，我也确保它没有提及建筑所在的楼层：

withopen(r"random_text.txt",encoding="utf-8")asf:random_text=f.read()

我随后有一个函数用于插入重要信息：

important_information="The company is on floor number 27\. "definsert_text(full_text,text_to_insert,index):assertindex>=0andindex<=len(full_text)returnf"{full_text[:index]}{text_to_insert}{full_text[index:]}"

最后，我使用np.linspace获取模型上下文长度内的十个均匀分布的索引；我一次在其中一个索引处插入模型并提示模型。请注意，这里的 _randomtext是我用 Llama3 生成的文本，作为噪声。这可以是您想要使用的任何其他文本。

importnumpyasnp# get 10 indices evenly split over length of random textindices=np.linspace(0,len(random_text),10,dtype=int)responses=[]foridxintqdm(indices):random_text_with_info=insert_text(random_text,important_information,idx)assertimportant_informationinrandom_text_with_info prompt=f"In the following text:{random_text_with_info}, what is the floor number of the company?"print("PROMPT:",prompt)response=main(args,prompt)responses.append(response)break

我想用完整的 128K 个标记测试模型，但不幸的是，由于计算限制，我无法做到。然而，我仍然在 16K 个标记上运行。最初，我在让模型找到正确答案方面遇到了困难。然而，经过测试一些不同的提示后，我终于在 6/10 的情况下，使用 16K 个标记使模型发现正确答案。

我的想法

尽管我无法测试 Phi3 模型的完整 128K 上下文长度，但在 6/10 个实例中定位到重要信息证明该模型无法很好地利用其上下文。此外，与许多其他 LLM 一样，Phi3 在这个测试中遇到了对提示非常敏感的问题。我不得不多次调整提示，Phi3 才能找到正确的答案。对提示措辞的敏感性是许多 LLM 都面临的挑战。然而，遗憾的是，这可能会对模型在信息提取方面的有用性产生强烈的负面影响。例如，如果我必须扫描文本，其中要提取的信息并不那么明确地陈述，或者文本中我无法确定信息是否存在（使得调整提示以找到信息变得不可能），在这些情况下，Phi3 的弱点，如本节所述，将是模型的一个严重缺点。

我对 Phi3 的整体看法

总体而言，Phi3 模型很有趣，擅长以简洁的答案回答问题。Phi3 是一个较小的模型，使用 Phi3 进行推理比使用 Llama3 快得多。较快的推理时间可以成为使用 Phi3 模型进行典型 LLM 任务（如问答或信息提取）的理由。

Phi3 也有一些缺点，特别是它无法返回格式化响应，如 JSON 对象，以及其模型对信息提取提示的敏感性。必须调整提示以找到特定信息，例如楼层号，这是一个严重的缺点，也是在使用 Phi3 进行信息提取任务时应该记住的弱点。

虽然 Phi3 是一个小型模型，但我对其功能并不太印象深刻，主要是因为它在格式化响应和信息提取方面的低性能。一个具有较少参数的模型有 128K 上下文长度的选项确实很酷。然而，当模型在 6/10 个实例中只能使用 16K 上下文检索到正确信息时，这证明模型无法充分利用其上下文长度。

结论

在这篇文章中，我讨论了微软的新语言模型 Phi3。我还讨论了我测试新模型的动机：跟上机器学习领域最新的创新。然后，我向您展示了如何在您的电脑上本地运行 Phi3 以及如何运行几个测试来检查其性能。最后，我分享了我对模型的看法，这是一个在简洁问答方面表现良好但返回格式化响应和信息提取表现平庸的小型模型。

您也可以阅读我的文章在 WordPress 上。