如何使用和测试 WizardLM2：微软的新语言模型-育师

原文：towardsdatascience.com/how-to-use-and-test-wizardlm2-microsofts-new-llm-2786a1a85874

本文将讨论使用微软的新语言模型 WizardLM2。它还将讨论如何测试该模型（以及一般语言模型）以获得其性能的初步了解。此外，我将讨论该模型的优势和劣势，并分享我对其性能的看法。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/d22af8c90321c0fb0d5af2b36ef4761a.png

ChatGPT 运行 WizardLM2 语言模型的可视化。图由 ChatGPT 制作。提示：“制作一个运行 wizardlm2 语言模型的图像”。ChatGPT，4，OpenAI，2024 年 5 月 5 日。chat.openai.com.

动机

我写这篇文章的动机是测试机器学习领域最新的模型。为了跟上所有这些进步，我关注像PapersWithCode、GitHub Trending和HuggingFace这样的网站。我从HuggingFace 的公告中发现了这个模型，然后尽可能快地运行并测试了该模型。我强烈推荐定期阅读 PapersWithCode 和 HuggingFace 等网站，以跟上 AI 的所有进步。了解最新可用的（开源）模型及其功能，在解决各种数据科学问题时可以带来巨大的优势。

我之前写过关于不同新机器学习模型的文章，包括Phi3、Llama3、TinyLlama以及以下链接中的 DocOwl 视觉语言模型，以及其他一些模型。

使用多模态文档机器学习模型查询您的文档

本地部署 WizardLM2

在 Python 中本地运行 WizardLM2 的最简单方法之一是使用Ollama 应用程序。为此，首先从Ollama 网站下载应用程序。然后，安装并运行应用程序；当在 Python 中使用 Ollama 时，应用程序必须始终处于运行状态。

你可以随后访问Ollama 模型库中的 WizardLM2 页面，浏览你想要下载的不同版本的 WizardLM2。截至撰写本文时，Ollama 上有 22 个版本的 WizardLM2。WizardLM2 有两个主要模型：7B 参数模型和 22B 参数模型。其余的模型是这些主要模型的量化版本。量化模型意味着模型权重以较低的精度存储，这使得模型占用的存储空间更少，并且使用模型进行推理更快。如果你在一个不太强大的计算机上运行，我建议运行 WizardLM2 7B 参数模型的量化版本。在 Python 中，你可以使用以下命令下载最小的 WizardLM2 模型：

importollama ollama.pull("wizardlm2:7b-q2_K")

你也可以在Ollama 上的 WizardLM2 模型页面上找到你可以下载的所有其他模型。

要提示模型，你可以使用以下函数：

defprompt_wizard(prompt,system_prompt="You are a great question answering machine that answers in a concise manner."):if(len(system_prompt)>0):response=ollama.chat(model="wizardlm2:7b-q2_K",messages=[{'role':'system','content':system_prompt},{'role':'user','content':prompt},])else:response=ollama.chat(model="wizardlm2:7b-q2_K",messages=[{'role':'user','content':prompt},])returnresponse["message"]["content"]

该函数被称为 Ollama，无论是否有系统提示。由于如果你向ollama.chat函数提供一个空的系统提示，你会得到一个错误，所以必须有一个 if-else 语句。另外，如果你使用与最小模型（7b-q2_K）不同的模型标签，你必须在上面的代码中更改模型标签。

测试 WizardLM2

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/beb787b24594b8e9009fcb06adae3936.png

ChatGPT 对测试大型语言模型的想象。图由 ChatGPT 制作。"制作一个测试大型语言模型的图像"提示。ChatGPT，4，OpenAI，2024 年 5 月 5 日。chat.openai.com.

测试简洁问答

模型的第一次测试与其他语言模型（如Llama3、Phi3和TinyLlama）的测试类似，通过测试模型的问题回答能力来进行。由于目前大多数高级大型语言模型都能正确回答问题，因此增加了一个额外的挑战：只回答问题，不围绕答案提供任何文本。大型语言模型仅用答案简洁回答的能力很重要，尤其是如果大型语言模型的输出要直接用于自动化系统。人类可以自然地过滤出大型语言模型输出中的精确答案，但如果这是更大自动化系统的一部分，那么大型语言模型能够直接输出精确答案而不附加任何文本至关重要。

我向模型提出了以下提示：

prompts=["What is the capital of the USA? Only output the answer, nothing else""What are the 10 first digits of Eulers number. Only output the answer, nothing else""Mention 3 well-known airline companies. Only output the answer, nothing else""How many countries are in the EU? Only output the answer, nothing else"]

然后提示模型如下：

forpromptinprompts:print(prompt_wizard(prompt))

模型随后回答道：

["The capital of the USA is Washington, D.C. (Washington, District of Columbia). However, the federal government's administrative 'headquarters' are in various locations due to the nature of a government spread across different cities as per the U.S. Constitution and its amendments. The most prominent among them is the Capitol in Washington, D.C. where both the Senate and House of Representatives meet.","The first ten digits of Euler's number (e) are approximately 28415926535.nn(Note: Euler's number is an irrational and transcendental number with an infinite sequence of non-repeating digits, so the above figure is an approximation of the first ten digits after 284.)nnTo see the next part of the sequence, one would typically add a few more digits (e.g., 10106) to reach the next integer value (28415926535 + 10106 = 2841592736) and then continue with the actual base-2 calculations to find the following digits.","1\. Delta Air Linesnn2\. American Airlinesnn3\. United Airlines (formerly known as United Airlines and also as UAL)","27 countries plus Portugal, as of my knowledge cutoff in 2023, are members of the European Union (EU), following the departure of the United Kingdom. The EU itself is an supranational organization of those member states. However, there are also a few countries that are part of the EU's single market but not part of its governance structures, often referred to as 'EU27' plus the UK."]

这些结果令人失望。模型正确回答了问题 1、3 和 4，但问题 2 的回答是错误的。更糟糕的是，模型无法提供像我看到的 Phi3 和 Llama3 那样的简洁答案。背后的一个可能原因是我在使用一个量化模型，尽管 Phi3 的大小与我所使用的 WizardLM2 量化版本相似，所以我原本期望模型能做得更好。

测试格式化响应

我还对一些其他语言模型进行了一个有趣的测试，即测试大型语言模型输出格式化响应的能力。大型语言模型最常见的输出格式之一是请求模型以 JSON 对象的形式回答，这个测试将要求模型做到这一点。给定一个描述建筑的 435 个单词的文本（使用 Llama3 随机生成），WizardLM2 模型将被要求以 JSON 格式总结所描述的建筑。然后，我将响应转换为 JSON 对象。如果模型的输出不能直接转换为 JSON 对象，则模型失败；如果模型的输出可以直接加载到 JSON 对象中，则模型成功。实验将重复进行 100 次，以使实验尽可能确定。运行此实验将使用以下代码：

首先，加载描述建筑的文本：

building_text="""The majestic edifice stands tall and proud, its grandeur commanding attention from all who lay eyes on it. The Building of Elegance, as it is affectionately known, is an architectural masterpiece that has been a beloved landmark in the city for generations. As you approach the structure, you can't help but be struck by its sheer scale and majesty. Rising high into the sky, the building's faÃ§ade gleams with a subtle sheen, its cream-colored stones polished to perfection by years of gentle wear. The entrance, flanked by grandiose columns, is a symphony of ornate carvings and intricate moldings that seem to dance across the surface. Stepping inside, you're enveloped in a warm, welcoming atmosphere. The lobby's high ceiling soars above you like a vaulted sky, adorned with delicate frescoes depicting scenes of mythological grandeur. A gleaming marble floor stretches out before you, reflecting the soft glow of overhead lights and casting a gentle sheen across the room. To either side of the entrance, sweeping staircases curve upward like great wings, leading visitors to the various levels above. The air is filled with the soft hum of activity â€" whispers, footsteps, and the occasional rustle of papers â€" as people go about their daily lives within these hallowed halls. As you look around, your gaze falls upon an array of stunning features. To one side stands a magnificent chandelier, its delicate crystals refracting light into a kaleidoscope of colors that seem to shift and shimmer with every movement. Nearby, a grand fountain bubbles and splashes, its gentle song providing a soothing accompaniment to the bustle below. The building's interior is just as impressive as its exterior. Halls lined with gleaming wooden paneling stretch out in all directions, punctuated by occasional doorways leading to various offices, meeting rooms, or quiet nooks for contemplation. Every now and then, a stunning stained-glass window bursts forth from the stone walls, casting vibrant hues across the floor. The Building of Elegance has been home to countless stories over the years â€" whispers of romance, triumphs of industry, and moments of quiet reflection. It is a place where memories are made, where dreams take shape, and where the passing of time is measured by the soft ticking of clockwork mechanisms hidden away in the walls. As you wander through these hallowed halls, you can't help but feel a sense of reverence for this magnificent structure â€" a true gem of architecture that has stood the test of time. And as you look up at its soaring heights, you can't help but wonder what secrets it holds within its ancient stones, waiting to be discovered by those who come after."""

然后，对 WizardLM2 模型进行提示，我尝试将其输出转换为 JSON 对象 100 次，并打印出模型可以直接输出正确格式化 JSON 对象的次数。

# test ability to make valid JSON objectfromtqdm.autoimporttqdm failed_count=0num_runs=100for_intqdm(range(num_runs)):try:json_string=prompt_wizard(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text:{building_text}.")obj=json.loads(json_string)print("Success")exceptExceptionase:failed_count+=1print("Failed: ",e)print(f"Failed{failed_count/num_runs*100}% of the time to make a valid JSON object")

在模型无法直接创建正确格式的对象的情况下，JSON 对象中只有微小的错误，使其无效。尽管如此，这仍然是一个很好的测试，看看模型能多好地正确遵循指令并以期望的格式响应。

模型在 16/100 的情况下创建了正确格式的 JSON 对象，给出了 16%的百分比。这比 Phi3 模型的高，尽管我仍然认为这表现相当差。只有 16%的时间接收到有效的 JSON 对象意味着你平均需要推断模型 6.25 次才能得到一个正确格式的响应。

测试从上下文中提取信息

大型语言模型的另一个重要方面是它从上下文中提取信息的能力。因此，这个测试将为 WizardLM2 模型提供大量文本，其中一句话包含我们想要的相关信息。然后，将要求模型检索该相关信息。我使用这个网站的随机拉丁文文本生成器来生成随机文本。创建这种文本的一个不幸之处是，要提取的文本将与噪声文本不同语言，尽管这是一种快速简单的方法来生成大量文本。

信息提取能力至关重要的一个场景是在检索增强生成（RAG）系统中，模型被给予通过向量相似性从编码文本字符串检索到的上下文，并要求模型在给定上下文的情况下回答问题。我在下面的文章中对此有更多介绍：

如何构建一个 RAG 系统以获得对数据的强大访问

为了进行这个实验，我遵循了与我Phi3 测试相同的格式，尽管在重要信息周围使用了不同的文本。我首先加载了 990 个随机拉丁文单词：

# load random textrandom_text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin tincidunt dapibus leo, in dictum nunc volutpat et. In id suscipit velit, vel rutrum velit. Donec vel vehicula nulla, id accumsan felis. Integer consectetur felis purus, a rhoncus felis fermentum sit amet. Donec tempor, quam eu feugiat pulvinar, turpis elit placerat erat, vel commodo ante ex eu nisi. In hac habitasse platea dictumst. Integer vel dapibus ex, id cursus ligula. Sed posuere orci nec ex consequat, a sollicitudin dolor posuere. Proin blandit, nulla at auctor iaculis, neque odio efficitur orci, vitae malesuada urna est eget leo. Fusce aliquet tempor rutrum. Ut eget neque dolor. In hac habitasse platea dictumst. Ut rutrum lectus eget facilisis tristique. Donec nec ante tortor. Nam vitae mi eu dui pulvinar posuere. Mauris sagittis viverra lorem, vitae interdum neque ultricies quis. Donec at arcu non ante lobortis malesuada. Cras molestie convallis lorem, non laoreet leo tristique et. Donec et nibh nisi. Nam interdum libero ac ante aliquam consequat. In tristique a nisi et euismod. Proin consectetur ipsum a leo feugiat euismod. Nam sed libero sollicitudin, aliquet mauris eu, sodales sem. Phasellus vulputate leo eu odio accumsan, ut lacinia est lobortis. In hac habitasse platea dictumst. Nunc id nulla molestie, sodales leo a, imperdiet libero. Phasellus dapibus tellus quis magna ultricies, ac cursus odio faucibus. Pellentesque ullamcorper sodales turpis, eget rhoncus massa imperdiet eget. Proin vel feugiat ligula, in ornare turpis. Morbi eget nunc vitae arcu sagittis posuere non eget nisi. Pellentesque laoreet ut ex et tempus. Sed sit amet erat libero. Donec a neque metus. Phasellus risus sapien, lobortis id tristique nec, tincidunt non sem. Cras et urna sed erat congue mollis non non urna. Quisque pretium consectetur suscipit. Praesent iaculis, tortor ut congue finibus, magna augue tempus ligula, et pharetra eros mauris a nisi. Nulla venenatis gravida blandit. Aenean ultricies orci ipsum, eget blandit elit vulputate vel. Etiam sit amet dapibus eros, eu semper orci. Donec gravida lacinia convallis. In nec est eget nisi lobortis mollis sit amet sit amet risus. Proin sit amet enim mauris. Nunc imperdiet elit commodo dolor posuere pharetra. Suspendisse nec dapibus mi, at tincidunt neque. Duis blandit eget urna ac porta. Morbi et ipsum nec mi sagittis eleifend vitae at tellus. Maecenas vel hendrerit arcu. Fusce ac sem justo. Nullam mauris magna, pharetra sit amet quam ac, ullamcorper venenatis magna. Donec eget nisi enim. Curabitur nec congue dolor. Suspendisse eget efficitur dolor. In varius libero sit amet lectus porta, ut vulputate dolor rutrum. In tincidunt dui sem, non pulvinar odio tincidunt a. Aliquam at ante eu tellus consequat tincidunt vel sit amet velit. Aenean pharetra ullamcorper urna, luctus egestas nisl ultrices quis. Vestibulum sodales id tellus sed rutrum. Ut at accumsan tortor. Curabitur pretium magna viverra, viverra nulla et, tempor velit. Quisque massa nisl, rhoncus et volutpat a, tincidunt vel elit. Sed quis leo turpis. Pellentesque et est dictum, gravida magna id, aliquet leo. In odio arcu, dictum nec suscipit id, scelerisque id risus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Aliquam consectetur luctus posuere. Aliquam nec ultricies ipsum. Nam vulputate varius congue. Nullam quis leo sed enim aliquet feugiat nec nec felis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus viverra, quam non ornare dignissim, dui nulla fermentum est, nec finibus orci mi nec ligula. Nam lobortis mauris lacus, eu placerat nulla vulputate ut. Nulla facilisi. Nunc molestie neque lorem, in facilisis nibh ornare in. Nunc nec metus nec nunc suscipit blandit. Duis cursus sollicitudin ipsum, tincidunt commodo sem. Aenean in feugiat purus. Proin suscipit pellentesque dolor ut pharetra. Sed sollicitudin nisl egestas sodales rutrum. Vivamus laoreet enim eleifend eros pellentesque laoreet. Nullam tincidunt massa ut felis imperdiet, ac aliquet libero mattis. Nulla tempor sem quis sapien malesuada aliquam. Sed rhoncus tortor a nisi scelerisque fermentum. Suspendisse elit nulla, tempor eu posuere quis, aliquam in tortor. Donec vitae dictum lectus. Duis tempor vestibulum velit, ut convallis enim aliquet vitae. Sed at nibh et risus vulputate fermentum et vitae urna. Sed accumsan felis vitae est ornare semper. Aliquam rhoncus, massa ut tempor vulputate, augue tellus lobortis nunc, eget sollicitudin lorem dolor non nibh. Cras sagittis augue suscipit congue imperdiet. Nunc pharetra, neque non porta congue, magna tellus iaculis quam, vel pellentesque magna quam pulvinar nisi. Etiam quis condimentum ex. Nullam gravida velit eu justo accumsan tincidunt. Fusce ornare justo et mi egestas, vitae blandit est lobortis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Suspendisse potenti. Aenean vehicula pellentesque porttitor. Duis ut nulla in quam volutpat euismod vitae id libero. Etiam id scelerisque mauris, condimentum commodo libero. Nam eget ligula ac ipsum consequat iaculis. Donec mollis est a neque placerat porta. Vestibulum sed bibendum diam, nec pretium elit. Sed semper diam metus, ac condimentum neque fringilla vel. In eu felis id diam blandit eleifend. Duis eget sodales lorem. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tincidunt lacus vel vestibulum tincidunt. Maecenas auctor, mi eu cursus feugiat, ligula metus posuere turpis, vel ultrices purus dui a justo. Aliquam dolor elit, dapibus eu nunc a, luctus gravida magna. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nunc mattis mauris sed arcu malesuada, sit amet tempus libero eleifend. Sed pretium lobortis auctor. Curabitur in magna dolor. Proin ac libero ac justo volutpat rhoncus a sit amet mi. Donec vitae libero cursus, vehicula massa in, efficitur ipsum. Cras suscipit accumsan dolor egestas interdum. Mauris rutrum justo non egestas aliquet. Ut eleifend odio at turpis mollis vulputate. Nunc id leo suscipit, facilisis ipsum nec, egestas dui. Aliquam est leo, volutpat eget tellus vitae, luctus finibus metus. Vivamus nec porta sapien. Etiam a blandit est, sollicitudin feugiat est. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed venenatis magna rhoncus nulla pretium convallis. Sed quis lectus pretium, vehicula sem elementum, iaculis ipsum. Praesent semper vestibulum tellus nec consectetur. Morbi vel eleifend erat. Nullam elit lorem, laoreet a ex nec, malesuada rutrum tellus. Mauris porta tempus bibendum. Nulla."""

然后，你需要一个函数来插入重要信息。请注意，重要信息可以是任何文本；你只需要确保文本陈述了一个具体的事实。我将使用一个说公司位于 27 楼的文本，然后询问模型公司位于哪一层。

important_information="The company is on floor number 27\. "definsert_text(full_text,text_to_insert,index):assertindex>=0andindex<=len(full_text)returnf"{full_text[:index]}{text_to_insert}{full_text[index:]}"

然后，我在随机拉丁文文本的不同位置插入重要信息，并测试模型能否从上下文的各个区域提取重要信息。这将确保模型可以使用其整个上下文长度，而不仅仅是上下文的一部分。插入文本并提示模型回答的代码是：

importnumpyasnp# get 10 indices evenly split over length of random textindices=np.linspace(0,len(random_text),10,dtype=int)responses=[]foridxintqdm(indices):random_text_with_info=insert_text(random_text,important_information,idx)assertimportant_informationinrandom_text_with_info prompt=f"In the following text:{random_text_with_info}, what is the floor number of the company?"print("PROMPT:",prompt)response=prompt_wizard(prompt)responses.append(response)

你可以手动检查结果，看看模型正确率有多高。

当运行信息提取十次时，模型有六次正确提取了建筑物的楼层号。这是一个不错的性能，因为由于文本中使用了多种语言，这个任务可能很难。同时，当重要信息在上下文后期时，模型也是正确的。当重要信息插入十次时，它被插入到文本的 1/10 间隔处（意味着第一次运行将重要信息放置在文本的开头，第二次运行将信息放置在大约全文长度的 1/10 处，第十次运行将信息放置在文本的末尾）。这表明 WizardLM2 很好地利用了其后期上下文，但难以使用其上下文的开始部分。

我的整体想法

我对 WizardLM2 模型的整体想法是混合的。我对模型在简洁问答任务上的表现感到失望，模型首先对欧拉数的第一个数字给出了错误的回答。模型在回答中可以提供比答案更多的信息，并且还给出了很多不必要的文本。这与我在我的 Phi3 评论文章中展示的、表现良好的微软 Phi3 模型形成对比。

然而，在对象格式化任务中，WizardLM3 大约 16% 的时间可以创建有效的 JSON 对象，这比 Phi3 模型实现的 9% 要好得多。不幸的是，只能有 16% 的时间创建有效对象仍然是 WizardLM2 模型的一个严重限制，尽管这可能对 WizardLM2 的更少量化的模型或更大的 22B 模型版本效果更好。

最后，还进行了信息提取任务，其中 WizardLM2 在 6/10 的实例中正确提取了重要信息。这还不错，但这个实验中最值得注意的是 WizardLM2 在使用最新上下文长度方面表现更好，但在上下文早期放置的信息上存在困难。考虑到重要信息可以放置在上下文的任何位置，例如在 RAG 系统中，这可能会非常成问题。此外，信息提取实验只使用了大约 1000 个单词，大约 1333 个标记，而 WizardLM2 的上下文长度是 8K。当最大上下文长度为 8K 时，模型在只有大约 1333 个标记的上下文中遇到困难，这令人担忧。

结论

这篇文章一开始就讨论了为什么你应该跟上机器学习领域最新模型的发展，以及你可以使用像 PapersWithCode、GitHub Trending 和 HuggingFace 这样的网站来保持更新。文章接着讨论了使用 Ollama 应用程序和 Python 包本地实现和运行 WizardLM2。继续进行，我进行了三次不同的测试：