LangChain と Pinecone ベクトルデータベースを使用してカスタム Q&A アプリケーションを構築する

LangChain、OpenAI、PineconeDB を使用して、任意のデータソースから質問応答アプリケーションを開発するためのカスタムチャットボットを構築します。

導入

大規模言語モデルの出現は、現代の最もエキサイティングな技術開発の 1 つです。人工知能の分野に無限の可能性をもたらし、さまざまな業界の現実の問題に対する解決策を提供します。これらのモデルの最も興味深い応用例の 1 つは、個人または組織のデータソースからカスタムの質問応答またはチャットボットを開発することです。ただし、LLM は公開されている一般的なデータに基づいてトレーニングされるため、その回答は必ずしもエンドユーザーにとって具体的または有用であるとは限りません。この問題を解決するには、LangChain などのフレームワークを使用して、データに基づいて特定の回答を提供するカスタムチャットボットを開発できます。この記事では、カスタム Q&A アプリケーションを構築し、Streamlit Cloud にデプロイする方法を学習します。それでは始めましょう！

学習目標

カスタム質問応答アプリケーションが微調整された言語モデルよりも優れている理由を学びます
OpenAI と Pinecone を使用してセマンティック検索パイプラインを開発する方法を学びます
カスタム Q&A アプリを開発し、Streamlit Cloud にデプロイします。

Q&Aアプリの概要

質疑応答や「データに関するチャット」は、LLM と LangChain の一般的な使用例です。 LangChain は、ユースケースに応じて見つかるあらゆるデータソースをロードするための一連のコンポーネントを提供します。多数のデータソースとコンバーターをサポートし、一連の文字列に変換してベクターデータベースに保存します。データがデータベースに保存されると、リトリーバーと呼ばれるコンポーネントを使用してクエリを実行できます。さらに、LLMS を使用することで、大量のドキュメントを処理することなく、チャットボットのように正確な回答を得ることができます。

LangChain は次のデータソースをサポートしています。図に示すように、120 を超える統合により、あらゆるデータソースに接続できます。

写真

質疑応答アプリケーションワークフロー

LangChain でサポートされているデータソースについて学習し、LangChain で利用可能なコンポーネントを使用して質問応答パイプラインを開発できるようになりました。以下は、ドキュメントの読み込み、保存、取得、および出力の生成に使用される LLM のコンポーネントです。

ドキュメントローダー: ユーザードキュメントを読み込み、ベクトル化して保存します
テキストセグメンター: ドキュメントを固定長のチャンクに変換して効率的に保存するドキュメントコンバーターです。
ベクトルストレージ: 入力テキストのベクトル埋め込みを保存するためのベクトルデータベース統合
ドキュメント検索: データベースに対するユーザークエリに基づいてテキストを取得します。類似性検索技術を使用して、同一のコンテンツを検索します。
モデル出力: クエリ入力プロンプトと取得されたテキストに基づいて生成されたユーザークエリの最終的なモデル出力。

これは、さまざまな種類の現実世界の問題を解決できる質問応答パイプラインの高レベルなワークフローです。私はLangChainの各コンポーネントを詳しく調べなかった

写真

モデルの微調整よりもカスタム Q&A の利点

特定の状況に対する回答
新しい入力文書への適応
モデルを微調整する必要がないため、モデルのトレーニングコストを節約できます。
一般的な回答よりも正確で具体的な回答

Pinecone Vector Database とは何ですか?

松ぼっくり

Pinecone は、LLM を利用したアプリケーションの構築に使用される人気のベクターデータベースです。高性能 AI アプリケーションに適した汎用性と拡張性を備えています。これは、ユーザーにインフラストラクチャの煩わしさをもたらすことのない、完全に管理されたクラウドネイティブのベクターデータベースです。

LLMS ベースのアプリケーションには、情報を最大限の精度で取得するために複雑な長期メモリを必要とする大量の非構造化データが含まれます。生成 AI アプリケーションは、ベクトル埋め込みのセマンティック検索を利用して、ユーザー入力に基づいて適切なコンテキストを返します。

Pinecone はこのようなアプリケーションに適しており、低レイテンシで大量のベクトルを保存および照会して、ユーザーフレンドリーなアプリケーションを構築するように最適化されています。質問と回答のアプリケーション用に Pinecone ベクトルデータベースを設定する方法を学びましょう。

 # install pinecone-client pip install pinecone-client # 导入pinecone 并使用您的API 密钥和环境名称进行初始化import pinecone pinecone.init(api_key= "YOUR_API_KEY" ,envirnotallow= "YOUR_ENVIRONMENT" ) # 创建您的第一个索引以开始存储Vectors pinecone.create_index( "first_index" ,Dimension= 8 , metric= "cosine" ) # 更新插入样本数据（5个8维向量） index.upsert([ ( "A" , [ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ) , 0.1 , 0.1 , 0.1 ]), ( "B" , [ 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 ]), ( "C" , [ 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 ]), ( "D" , [ 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 ]), ( "E" , [ 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 ]) ]) # 使用list_indexes() 方法调用db 中可用的多个索引pinecone.list_indexes() [Output]>>> [ 'first_index' ]

上記のデモでは、プロジェクト環境でベクターデータベースを初期化するために、pinecone クライアントをインストールしました。ベクトルデータベースを初期化した後、ベクトルデータベースにベクトル埋め込みを挿入するために必要なディメンションとメトリックを持つインデックスを作成できます。次のセクションでは、Pinecone と LangChain を使用して、アプリケーションのセマンティック検索パイプラインを開発します。

OpenAI と Pinecone を使用したセマンティック検索パイプラインの構築

Q&A アプリケーションのワークフローには 5 つのステップがあることがわかりました。このセクションでは、ドキュメントローダー、テキストスプリッター、ベクターストレージ、ドキュメント取得という最初の 4 つの手順を実行します。

ローカル環境または Google Colab などのクラウドベースのノートブック環境でこれらの手順を実行するには、いくつかのライブラリをインストールし、OpenAI と Pinecone にアカウントを作成して、それぞれ API キーを取得する必要があります。環境設定から始めましょう:

必要なライブラリをインストールする

# install langchain and openai with other dependencies !pip install --upgrade langchain openai -q !pip install pillow==6.2.2 !pip install unstructured -q !pip install unstructured[local-inference] -q !pip install detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2 -q !apt-get install poppler-utils !pip install pinecone-client -q !pip install tiktoken -q # setup openai environment import os os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY" # importing libraries import os import openai import pinecone from langchain.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Pinecone from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain

インストールが完了したら、上記のコードスニペットに記載されているすべてのライブラリをインポートします。次に、次の手順に従います。

ドキュメントの読み込み

このステップでは、AI プロジェクトパイプラインの開始点として、ディレクトリからドキュメントを読み込みます。ディレクトリには、プロジェクト環境に読み込む 2 つのファイルがあります。

 #load the documents from content/data dir directory = '/content/data' # load_docs functions to load documents using langchain function def load_docs(directory): loader = DirectoryLoader(directory) documents = loader.load() return documents documents = load_docs(directory) len(documents) [Output]>>> 5

テキストデータのセグメント化

各ドキュメントの長さが固定されている場合、テキスト埋め込みと LLMS のパフォーマンスが向上します。したがって、LLMS の使用例では、テキストを同じ長さのチャンクに分割することが必要になります。「RecursiveCharacterTextSplitter」を使用して、ドキュメントをテキストドキュメントと同じサイズに変換します。

 # split the docs using recursive text splitter def split_docs(documents, chunk_size=200, chunk_overlap=20): text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) docs = text_splitter.split_documents(documents) return docs # split the docs docs = split_docs(documents) print(len(docs)) [Output]>>>12

ベクトルストレージにデータを保存する

ドキュメントがセグメント化されたら、OpenAI Embedding を使用してベクターデータベースに埋め込みを保存します。

 # embedding example on random word embeddings = OpenAIEmbeddings() # initiate pinecondb pinecone.init( api_key="YOUR-API-KEY", envirnotallow="YOUR-ENV" ) # define index name index_name = "langchain-project" # store the data and embeddings into pinecone index index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

ベクターデータベースからデータを取得する

このフェーズでは、セマンティック検索を使用してベクターデータベースからドキュメントを取得します。ベクトルは「langchain-project」というインデックスに保存され、以下のようにクエリを実行すると、データベースから最も類似したドキュメントが取得されます。

 # An example query to our database query = "What are the different types of pet animals are there?" # do a similarity search and store the documents in result variable result = index.similarity_search( query, # our search query k=3 # return 3 most relevant docs ) - --------------------------------[Output]-------------------------------------- result [Document(page_cnotallow='Small mammals like hamsters, guinea pigs, and rabbits are often chosen for their low maintenance needs. Birds offer beauty and song, and reptiles like turtles and lizards can make intriguing pets.', metadata={'source': '/content/data/Different Types of Pet Animals.txt'}), Document(page_cnotallow='Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities. Small', metadata={'source': '/content/data/Different Types of Pet Animals.txt'}), Document(page_cnotallow='intriguing pets. Even fish, with their calming presence , can be wonderful pets.', metadata={'source': '/content/data/Different Types of Pet Animals.txt'})]

類似性検索に基づいてベクターストアからドキュメントを取得できます。

Streamlit を使用したカスタム質問と回答アプリ

Q&A アプリケーションの最終段階では、ワークフローの各コンポーネントを統合して、ユーザーがさまざまなデータソース (Web ベースの記事、PDF、CSV など) を入力してチャットできるカスタム Q&A アプリケーションを構築します。したがって、彼らは日々の活動において生産的になります。 GitHub リポジトリを作成し、次のファイルを追加する必要があります。

写真

GitHub リポジトリ構造

追加する必要があるプロジェクトファイル:

main.py — ストリーミングフロントエンドコードを含むPythonファイル
qanda.py — ユーザーのクエリに対する回答を返すプロンプト設計とモデル出力関数
utils.py — 入力ドキュメントの読み込みと分割のためのユーティリティ関数
vector_search.py — テキスト埋め込みとベクトル保存関数
requirements.txt - Streamlit パブリッククラウドでアプリを実行するためのプロジェクト依存関係

このプロジェクトデモでは、次の 2 種類のデータソースをサポートしています。

Web URLベースのテキストデータ
オンラインPDFドキュメント

これら 2 つのタイプには広範なテキストデータが含まれており、多くのユースケースで最も一般的です。アプリケーションのユーザーインターフェイスを理解するには、以下の main.py Python コードを参照してください。

 # import necessary libraries import streamlit as st import openai import qanda from vector_search import * from utils import * from io import StringIO # take openai api key in api_key = st.sidebar.text_input("Enter your OpenAI API key:", type='password') # open ai key openai.api_key = str(api_key) # header of the app _ , col2,_ = st.columns([1,7,1]) with col2: col2 = st.header("Simplchat: Chat with your data") url = False query = False pdf = False data = False # select option based on user need options = st.selectbox("Select the type of data source", optinotallow=['Web URL','PDF','Existing data source']) #ask a query based on options of data sources if options == 'Web URL': url = st.text_input("Enter the URL of the data source") query = st.text_input("Enter your query") button = st.button("Submit") elif options == 'PDF': pdf = st.text_input("Enter your PDF link here") query = st.text_input("Enter your query") button = st.button("Submit") elif options == 'Existing data source': data= True query = st.text_input("Enter your query") button = st.button("Submit") # write code to get the output based on given query and data sources if button and url: with st.spinner("Updating the database..."): corpusData = scrape_text(url) encodeaddData(corpusData,url=url,pdf=False) st.success("Database Updated") with st.spinner("Finding an answer..."): title, res = find_k_best_match(query,2) context = "\n\n".join(res) st.expander("Context").write(context) prompt = qanda.prompt(context,query) answer = qanda.get_answer(prompt) st.success("Answer: "+ answer) # write a code to get output on given query and data sources if button and pdf: with st.spinner("Updating the database..."): corpusData = pdf_text(pdf=pdf) encodeaddData(corpusData,pdf=pdf,url=False) st.success("Database Updated") with st.spinner("Finding an answer..."): title, res = find_k_best_match(query,2) context = "\n\n".join(res) st.expander("Context").write(context) prompt = qanda.prompt(context,query) answer = qanda.get_answer(prompt) st.success("Answer: "+ answer) if button and data: with st.spinner("Finding an answer..."): title, res = find_k_best_match(query,2) context = "\n\n".join(res) st.expander("Context").write(context) prompt = qanda.prompt(context,query) answer = qanda.get_answer(prompt) st.success("Answer: "+ answer) # delete the vectors from the database st.expander("Delete the indexes from the database") button1 = st.button("Delete the current vectors") if button1 == True: index.delete(deleteAll='true')

Streamlit CloudにQ&Aアプリケーションを導入する

写真

アプリケーションユーザーインターフェイス

Streamlit は、アプリを無料でホストするためのコミュニティクラウドを提供します。さらに、streamlit は自動化された CI/CD パイプライン機能により使いやすいです。

結論は

要約すると、LangChain と Pinecone ベクトルデータベースを使用してカスタムの質問応答アプリケーションを構築する魅力的な可能性を検討しました。このブログでは、質問応答アプリケーションの概要から始まり、Pinecone ベクターデータベースの機能を探りながら、基本的な概念を紹介します。 OpenAI のセマンティック検索パイプラインのパワーと Pinecone の効率的なインデックス作成および検索システムを組み合わせることで、Streamlit で強力かつ正確な質問応答ソリューションを作成する可能性を最大限に活用できました。