OpenAI の組み込み検索は本当に便利ですか?定量的な評価により、より深い理解が得られます。

ベクターデータベースの強力なライバルが登場するか？トラック関連のスタートアップ企業がまたもや倒産しそうだ?

…

これらは、OpenAI がアシスタント検索機能を開始した後にテクノロジー界で聞かれた声の一部です。その理由は、この機能により、ナレッジベースの質問と回答に基づいた RAG (検索拡張) 機能をユーザーに提供できるためです。以前は、大規模なモデルの「幻覚」効果を軽減するために、RAG ソリューションの重要なコンポーネントとしてベクターデータベースが使用される傾向がありました。

そこで疑問になるのが、OpenAI の組み込みアシスタント検索機能と、ベクターデータベース上に構築されたオープンソースの RAG ソリューションのどちらが優れているかということです。

厳密な検証の精神で、この問題の定量的な評価を実施しましたが、結果は非常に興味深いものでした。OpenAI は本当に強力です!

ただし、ベクターデータベースに基づくオープンソースの RAG ソリューションに比べると、やや劣ります。

次に、評価プロセス全体を復元します。これらの評価を完了するのは容易ではなく、少数のテストサンプルでは RAG アプリケーションのすべての側面の有効性を効果的に測定できないことを強調する必要があります。

そのため、公平かつ客観的なRAG効果評価ツールを採用し、適切なデータセットで評価を行い、定量的な評価・分析を行い、結果の再現性を確保する必要がある。

さっそく、プロセスに取り掛かりましょう。

1. 評価ツール

Ragas (https://docs.ragas.io/en/latest/) は、RAG アプリケーションの有効性を評価するためのオープンソースフレームワークです。ユーザーは、RAG プロセスで質問、コンテキスト、回答などの情報を提供するだけで、この情報を使用して複数の指標を定量的に評価できます。 pip 経由で Ragas をインストールすると、数行のコードだけで評価を実行できます。プロセスは次のとおりです。

 Python from ragas import evaluate from datasets import Dataset # prepare your huggingface dataset in the format # dataset = Dataset({ # features: ['question', 'contexts', 'answer', 'ground_truths'], # num_rows: 25 # }) results = evaluate(dataset) # {'ragas_score': 0.860, 'context_precision': 0.817, # 'faithfulness': 0.892, 'answer_relevancy': 0.874

Ragas には、次のような評価のためのスコアリングメトリックのサブカテゴリが多数あります。

• 生成の観点からは、忠実性と回答の関連性という用語があります。

• 検索の観点からは、知識想起の正確さを測定するコンテキスト精度、知識想起率を測定するコンテキスト想起、想起されたコンテンツの関連性を測定するコンテキスト関連性があります。

•回答を真実と比較する観点から、回答の関連性を記述する回答の意味的類似性と、回答の正確さを記述する回答の正確性があります。

• 回答自体から始まり、様々な側面の批評があります

写真

画像ソース: https://docs.ragas.io/en/latest/concepts/metrics/index.html

これらの指標は、さまざまな観点を測定します。たとえば、回答の正確さの指標は結果指向であり、RAG アプリケーションによって提供された回答の正確さを直接測定します。以下は、回答の正確性スコアが高い場合と低い場合の比較の例です。

 Plain Text Ground truth: Einstein was born in 1879 at Germany . High answer correctness: In 1879, in Germany, Einstein was born. Low answer correctness: In Spain, Einstein was born in 1879.

その他のインジケーターの詳細については、公式ドキュメントを参照してください。

(https://docs.ragas.io/en/latest/concepts/metrics/index.html)。

重要なのは、各指標が異なる角度を測定するため、ユーザーはあらゆる側面と角度から RAG アプリケーションの品質を評価できることです。

2. 評価データセット

テストデータセットとして、Financial Opinion Mining and Question Answering (fiqa) データセット (https://sites.google.com/view/fiqa/) を使用します。主な理由は次のとおりです。

•このデータセットは金融分野に属します。そのコーパスソースは非常に多様で、手動回答も含まれています。非常に人気のない金融の専門知識が含まれており、GPT のトレーニングデータセットに現れる可能性は低いです。これにより、この知識を一度も見たことがない LLM とは対照的に、外部知識ベースとして使用するのがより適したものになります。

•このデータセットはもともと情報検索 (IR) 機能を評価するために使用されたため、リコールの標準回答 (真実値) として直接使用できるラベル付けされた知識フラグメントが含まれています。

•Ragas はこれを公式に標準的なエントリーレベルのテストデータセットと見なし (https://docs.ragas.io/en/latest/getstarted/evaluation.html#the-data)、これを構築するためのスクリプトも提供しています (https://github.com/explodinggradients/ragas/blob/main/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb)。そのため、一定のコミュニティ基盤を持ち、満場一致で承認することができ、ベースラインとして使用するのにより適しています。

まず、変換スクリプトを使用して、元の fiqa データセットを Ragas が処理しやすい形式に変換します。まず、評価データセットの内容を見てみましょう。金融関連のクエリ質問が 647 件あります。各質問に対応する元の知識コンテンツのリストは ground_truths で、通常 1 ～ 4 個の知識コンテンツフラグメントが含まれています。

FIQAデータセットの例

このステップでは、テストデータが準備されます。質問列を使用して RAG アプリケーションに質問し、RAG アプリケーションの回答とリコールをグラウンドトゥルースとマージし、Ragas を使用してこのすべての情報を評価してスコア付けするだけです。

3. RAG制御設定

次のステップは、比較する 2 つの RAG アプリケーションを構築し、実行スコアを比較することです。次に、OpenAI アシスタントとベクターデータベースに基づくカスタマイズされた RAG パイプラインの 2 つの RAG アプリケーションを構築します。

1. OpenAIアシスタント

OpenAI が導入した公式アシスタント検索方法 (https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) を使用し、アシスタントを構築して知識をアップロードし、OpenAI が提供する公式方法 (https://platform.openai.com/docs/assistants/how-it-works/message-annotations) を使用して回答と思い出されたコンテキストを取得します。その他の設定はすべてデフォルトで設定されています。

2. ベクトルデータベースに基づくRAGパイプライン

次に、ベクトルリコールに基づく RAG パイプラインを構築しました。 Milvus (https://milvus.io/) ベクトルデータベースを使用して知識を保存し、HuggingFaceEmbeddings (https://python.langchain.com/docs/integrations/platforms/huggingface) の BAAI/bge-base-en モデルを使用して埋め込みを構築し、LangChain (https://python.langchain.com/docs/get_started/introduction) コンポーネントを使用してドキュメントのインポートとエージェントの構築を行います。

以下に 2 つのソリューションの比較を示します。

ここで使用する LLM モデルは gpt-4-1106-preview であることに注意してください。他の戦略はクローズドソースであるため、OpenAI はそれと多くの違いがあるはずです。スペースの制限により、具体的な実装の詳細はここでは説明しません。実装コード (https://github.com/milvus-io/bootcamp/tree/master/evaluation) を参照してください。

IV. 結果と分析

1. 実験結果

Ragas では複数の指標を使用してスコア付けし、各指標について次の比較結果を取得します。

各種指標の比較

私たちが数えた 5 つの指標のうち、OpenAI アシスタントは、カスタム RAG パイプラインを上回る answer_similarity を除いて、カスタム RAG パイプラインよりもわずかに低いことがわかります。

さらに、Ragas はさまざまな指標の調和平均を計算して、Ragas スコアと呼ばれる全体の平均スコアを取得することもできます。調和平均の効果は、低いスコアにペナルティを課すことです。全体的に、OpenAI アシスタントの Ragas スコアもカスタム RAG パイプラインのスコアよりも低くなっています。

ラガのスコア比較

では、なぜこのような結果になったのでしょうか?多数の単一の結果を比較することで、ベクターデータベース上に構築された RAG パイプラインには、OpenAI アシスタントに比べて次の利点があることがわかりました。

1) OpenAI アシスタントは独自の知識で回答する可能性が高く、思い出した知識を使用する可能性は低くなります。

これは、エージェントの内部設定によるものかもしれません。GPT が知っている可能性がある事実に遭遇すると、GPT はより自信を持って、思い出した知識を使用するのではなく、トレーニング中に学習した知識を使用することを選択します。ただし、この時点での知識は、トレーニングとまったく逆である可能性があります。例えば：

質問：

 Plain Text Are personal finance / money management classes taught in high school, anywhere?

OpenAIアシスタントの回答:

 Plain Text Personal finance and money management classes are indeed taught in some high schools, although the availability and requirement of such classes can vary greatly depending on the country and even within regions of a country. In the United States, for example, the Council for Economic Education (CEE) conducts surveys and reports on the state of financial education in K-12 schools. Their findings suggest that as of my last update, a growing number of states have integrated personal finance into their educational standards, and some have mandated that students take a personal finance course in order to graduate. Although I don't have access to a live database or the most recent studies, you can often find information about financial education in schools by: 1. Checking government education websites. 2. Looking at reports from financial education nonprofits and councils. 3. Contacting local school boards or departments of education. If you are interested in specific locations or have a particular focus in mind, you could specify that so I could provide more directed information, if available.

この答えは正しいでしょうか？回答だけ見ると問題ないのですが、ハイライト部分では最新のナレッジを取得するためにインターネットに接続していないことが強調されており、アップロードされたナレッジドキュメントの内容が使用されていないことが分かりました。

データセットにマークされている正解であるグラウンドトゥルースを見てみましょう。

 Plain Text ["In Houston, Texas USA where I went to a private high school they had a half-semester class in personal finance, but it was optional and didn't give you any credits towards graduation. You are right though, it should be a standard class. After all, who doesn't need that information in their adult lives, and not everyone goes to college." ... '"We had a ""civics"" class when I was a freshman in high school. This was in the Ann Arbor, MI public schools. It covered the very basics (how to balance your checkbook, what are stocks, how do income taxes work, what is interest, etc.) of money management along with an overview of politics and the legal system. It was a really light class, though, and didn\'t go deeply into personal finance and money management. I agree that such a class would be very valuable, as would cooking, nutrition, and basic home and car repair."']

ご覧のとおり、このドキュメントではこの質問に答えるための具体的なシナリオが多数示されていますが、OpenAI の回答ではそれらは使用されていません。質問は直接答えられるほど単純であると想定し、ナレッジベースで必要になる可能性のある情報は無視します。

次に、ベクターデータベースに基づく RAG からの回答を見てみましょう。

 Plain Text Personal finance and money management classes are indeed taught in some high schools, but the availability and requirement of such classes can vary greatly. For example, in Houston, Texas, a private high school offered an optional half-semester class in personal finance that did not provide credits towards graduation. Staples High School in Westport, Connecticut, introduced a personal financial management class focused on using mathematics to develop financial literacy skills, covering topics such as earnings, banking, credit cards, loans, taxes, insurance, investing, budgeting, and buying personal property. In Ann Arbor, Michigan, public schools offered a 'civics' class that included basic money management topics. Additionally, in the UK, the 'Every Child Matters' framework includes 'achieving economic well-being' as an important element, and some schools use a School Rewards System to teach kids about finance and economy. While these examples show that such classes exist, they are not universally required or standardized, and the depth and focus of the curriculum can vary.

RAG 回答で引用されている真実の内容が強調表示され、知識の内容が回答に非常にうまく統合されており、これはユーザーが必要としているものです。

2.) OpenAIの知識セグメンテーションとリコールを最適化する必要があり、オープンソースのカスタムソリューションの方が優れている

アシスタントの中間想起知識を調べることで、知識文書に対するアシスタントのセグメンテーション戦略を分析したり、その埋め込みモデルの効果を推測したりすることができます。例えば：

質問：

 JSON Pros / cons of being more involved with IRA investments [duplicate]

OpenAI アシスタントからの中間リコールスニペット:

 JSON ['PROS: CONS']

これは明らかに誤ったリコールであり、リコールされたのはこの 1 つのクリップのみでした。まず、セグメンテーションがあまり合理的ではなく、次のコンテンツがカットされています。第二に、埋め込みモデルは質問に答えることができるより重要なセグメントを思い出すのではなく、質問の単語に類似したセグメントのみを思い出します。

RAG パイプラインのリコールスニペットをカスタマイズします。

 Plain Text ['in the tax rate, there\'s also a significant difference in the amount being taxed. Thus, withdrawing from IRA is generally not a good idea, and you will never be better off with withdrawing from IRA than with cashing out taxable investments (from tax perspective). That\'s by design."' "Sounds like a bad idea. The IRA is built on the power of compounding. Removing contributions will hurt your retirement savings, and you will never be able to make that up. Instead, consider tax-free investments. State bonds, Federal bonds, municipal bonds, etc. For example, I invest in California muni bonds fund which gives me ~3-4% annual dividend income - completely tax free. In addition - there's capital appreciation of your fund holdings. There are risks, of course, for example rate changes will affect yields and capital appreciation, so consult with someone knowledgeable in this area (or ask another question here, for the basics). This will give you the same result as you're expecting from your Roth IRA trick, without damaging your retirement savings potential." "In addition to George Marian's excellent advice, I'll add that if you're hitting the limits on IRA contributions, then you'd go back to your 401(k). So, put enough into your 401(k) to get the match, then max out IRA contributions to give you access to more and better investment options, then go back to your 401(k) until you top that out as well, assuming you have that much available to invest for retirement." "While tax deferral is a nice feature, the 401k is not the Holy Grail. I've seen plenty of 401k's where the investment options are horrible: sub-par performance, high fees, limited options. That's great that you've maxed out your Roth IRA. I commend you for that. As long as the investment options in your 401k are good, then I would stick with it." "retirement plans which offer them good cheap index funds. These people probably don't need to worry quite as much. Finally, having two accounts is more complicated. Please contact someone who knows more about taxes than I am to figure out what limitations apply for contributing to both IRAs and 401(k)s in the same year."]

独自に構築した RAG パイプラインによって IRA 投資に関する多くの情報が思い出され、これらの内容も最終的な LLM の回答に効果的に統合されていることがわかります。

さらに、ベクトルリコールもBM25などの単語分割リコールと同様の効果があることがわかります。リコールされたキーワードは確かに必要な単語「IRA」です。したがって、ベクトルリコールは全体的な意味だけでなく、マイクロ語彙リコールにも効果的です。その効果は単語頻度リコールに劣りません。

2. その他の側面

実験的な効果分析に加えて、より柔軟なカスタムオープンソース RAG ソリューションと比較すると、OpenAI アシスタントには明らかな欠点がいくつかあります。

•OpenAIアシスタントはRAGプロセス内のパラメータを調整できず、内部的にブラックボックスであるため、最適化も不可能です。カスタマイズされた RAG ソリューションでは、top_k、チャンクサイズ、埋め込みモデルなどのコンポーネントやパラメータを調整できるため、特定のデータに合わせて最適化することもできます。

•OpenAI の保存ファイルの量は限られていますが、ベクターデータベースは膨大な量の知識を保存できます。 OpenAI の単一ファイルのアップロードには 512 MB の上限があり、2,000,000 トークンを超えることはできません。

したがって、OpenAI は、より複雑なビジネス、より大きなデータ量、またはより多くのカスタマイズを伴う RAG サービスを完了することができません。

V. 結論

Ragas 評価ツールに基づいて、OpenAI アシスタントとベクトルデータベースに基づくオープンソース RAG ソリューションの詳細な比較と分析を実施しました。 OpenAI アシスタントは検索性能は優れているものの、回答効果とリコール性能の点ではベクトル RAG 検索ソリューションに劣っていることがわかります。Ragas のさまざまな指標もこの結論を定量的に反映しています。

したがって、より強力で効果的な RAG アプリケーションを構築するために、開発者は、Milvus (https://zilliz.com/what-is-milvus) や Zilliz Cloud (https://cloud.zilliz.com.cn/signup) などのベクターデータベースに基づく定義取得機能の構築を検討できます。これにより、より良い結果と柔軟な選択肢がもたらされます。

<<: AIは古い建物のエネルギー効率を変えるでしょうか?

>>: マッキンゼー：2024年にGenAIが人工知能のビジネス界を支配する