Pytorch Geometric を使用したリンク予測コードの例

PyTorch Geometric (PyG) は、グラフニューラルネットワークモデルを構築し、さまざまなグラフ畳み込みを試すための主要なツールです。今回はリンク予測を通して紹介します。

リンク予測は、どの 2 つのノードを相互にリンクする必要があるかという質問に答えます。「変換分割」を実行して、モデリング用のデータを準備します。バッチ処理専用のグラフデータローダーを準備します。 Torch Geometric でモデルを構築し、PyTorch Lightning を使用してトレーニングし、モデルのパフォーマンスを調べます。

ライブラリの準備

トーチは説明の必要がない
Torch Geometric はグラフニューラルネットワークのメインライブラリであり、この記事の焦点です。
PyTorch Lightning は、モデルのトレーニング、調整、検証に使用されます。トレーニングの運用を簡素化します
Sklearn Metrics と Torchmetrics を使用してモデルのパフォーマンスを確認します。
PyTorch Geometric には特定の依存関係があるため、インストール時に問題が発生した場合は、公式ドキュメントを参照してください。

データ準備

Cora ML 引用データセットを使用します。データセットには Torch Geometric を通じてアクセスできます。

 data = tg.datasets.CitationFull(root="data", name="Cora_ML")

デフォルトでは、Torch Geometric データセットは複数のグラフを返すことができます。 1 つの画像がどのように見えるかを見てみましょう。

 data[0] > Data(x=[2995, 2879], edge_index=[2, 16316], y=[2995])

ここで、X はノードの特徴です。 edge_index は 2 x (n エッジ) の行列です (最初の次元 = 2、次のように解釈されます: 行 0 - ソースノード/「送信者」、行 1 - 宛先ノード/「受信者」)。

リンク分割

まず、データセット内のリンクを分割します。グラフリンクの 20% を検証セットとして使用し、10% をテストセットとして使用します。ここでは、負の例はトレーニングデータセットに追加されません。負のリンクはバッチデータローダーによってオンザフライで作成されるためです。

一般に、ネガティブサンプリングでは「偽の」サンプル (この場合はノード間のリンク) が作成され、モデルは実際のリンクと偽のリンクを区別する方法を学習します。ネガティブサンプリングはサンプリングの理論と数学に基づいており、優れた統計特性を備えています。

まず、リンク分割オブジェクトを作成しましょう。

 link_splitter = tg.transforms.RandomLinkSplit( num_val=0.2, num_test=0.1, add_negative_train_samples=False, disjoint_train_ratio=0.8)

disjoint_train_ratio は、「教師あり」フェーズ中にトレーニング情報として使用されるエッジの数を制御します。残りのエッジは、メッセージの受け渡し（ネットワーク内の情報転送フェーズ）に使用されます。

グラフニューラルネットワークでエッジをセグメント化する方法には、少なくとも 2 つあります。それは、誘導的セグメンテーションと伝導的セグメンテーションです。変換方法では、GNN がグラフ構造から構造パターンを学習する必要があることを前提としています。帰納的設定では、ノード/エッジラベルを学習に使用できます。この記事の最後にある2つの論文では、これらの概念について詳細に議論し、追加の形式化を行っています: ([1]、[3])。

 train_g, val_g, test_g = link_splitter(data[0]) > Data(x=[2995, 2879], edge_index=[2, 2285], y=[2995], edge_label=[9137], edge_label_index=[2, 9137])

この操作の後、いくつかの新しいプロパティが作成されます。

edge_label : エッジが true か false かを説明します。これが私たちが予測したいことです。

edge_label_index は、ノードリンクを格納する 2 x NUM EDGES 行列です。

サンプルの分布を見てみましょう

th.unique(train_g.edge_label, return_counts=True) > (tensor([1.]), tensor([9137])) th.unique(val_g.edge_label, return_counts=True) > (tensor([0., 1.]), tensor([3263, 3263])) th.unique(val_g.edge_label, return_counts=True) > (tensor([0., 1.]), tensor([3263, 3263]))

トレーニングデータには負のエッジはありません (トレーニング中に作成されます)。評価/テストセットには、50:50 の比率でいくつかの「偽の」リンクが既に存在します。

モデル

GNNを使ってモデルを構築できるようになりました

クラス GNN(nn.Module):

 def __init__( self, dim_in: int, conv_sizes: Tuple[int, ...], act_f: nn.Module = th.relu, dropout: float = 0.1, *args, **kwargs): super().__init__() self.dim_in = dim_in self.dim_out = conv_sizes[-1] self.dropout = dropout self.act_f = act_f last_in = dim_in layers = [] # Here we build subsequent graph convolutions. for conv_sz in conv_sizes: # Single graph convolution layer conv = tgnn.SAGEConv(in_channels=last_in, out_channels=conv_sz, *args, **kwargs) last_in = conv_sz layers.append(conv) self.layers = nn.ModuleList(layers) def forward(self, x: th.Tensor, edge_index: th.Tensor) -> th.Tensor: h = x # For every graph convolution in the network... for conv in self.layers: # ... perform node embedding via message passing h = conv(h, edge_index) h = self.act_f(h) if self.dropout: h = nn.functional.dropout(h, p=self.dropout, training=self.training) return h

このモデルの注目すべき部分は、グラフ畳み込みのセット（この場合は SAGEConv）です。 SAGE 畳み込みの正式な定義は次のとおりです。

良い

vは現在のノード、ノードvのN(v)個の隣接ノードです。このタイプの畳み込みについてさらに詳しく知るには、GraphSAGE[1]のオリジナル論文をご覧ください。

準備したデータを使用してモデルが予測を行えるかどうかを確認しましょう。ここで、PyG モデルへの入力は、ノード機能 X と edge_index を定義するリンクのマトリックスです。

 gnn = GNN(train_g.x.size()[1], conv_sizes=[512, 256, 128]) with th.no_grad(): out = gnn(train_g.x, train_g.edge_index) out > tensor([[0.0000, 0.0000, 0.0051, ..., 0.0997, 0.0000, 0.0000], [0.0107, 0.0000, 0.0576, ..., 0.0651, 0.0000, 0.0000], [0.0000, 0.0000, 0.0102, ..., 0.0973, 0.0000, 0.0000], ..., [0.0000, 0.0000, 0.0549, ..., 0.0671, 0.0000, 0.0000], [0.0000, 0.0000, 0.0166, ..., 0.0000, 0.0000, 0.0000], [0.0000, 0.0000, 0.0034, ..., 0.1111, 0.0000, 0.0000]])

私たちのモデルの出力は、次元が N ノード x 埋め込みサイズのノード埋め込み行列です。

PyTorch ライトニング

トレーニングには主にPyTorch Lightningを使用しますが、ここでは出力ヘッドとしてGNNの出力の後にLinearレイヤーを追加し、つながっているかどうかを予測します。

クラス LinkPredModel(pl.LightningModule):

 def __init__( self, dim_in: int, conv_sizes: Tuple[int, ...], act_f: nn.Module = th.relu, dropout: float = 0.1, lr: float = 0.01, *args, **kwargs): super().__init__() # Our inner GNN model self.gnn = GNN(dim_in, conv_sizes=conv_sizes, act_f=act_f, dropout=dropout) # Final prediction model on links. self.lin_pred = nn.Linear(self.gnn.dim_out, 1) self.lr = lr def forward(self, x: th.Tensor, edge_index: th.Tensor) -> th.Tensor: # Step 1: make node embeddings using GNN. h = self.gnn(x, edge_index) # Take source nodes embeddings- senders h_src = h[edge_index[0, :]] # Take target node embeddings - receivers h_dst = h[edge_index[1, :]] # Calculate the product between them src_dst_mult = h_src * h_dst # Apply non-linearity out = self.lin_pred(src_dst_mult) return out def _step(self, batch: th.Tensor, phase: str='train') -> th.Tensor: yhat_edge = self(batch.x, batch.edge_label_index).squeeze() y = batch.edge_label loss = nn.functional.binary_cross_entropy_with_logits(input=yhat_edge, target=y) f1 = tm.functional.f1_score(preds=yhat_edge, target=y, task='binary') prec = tm.functional.precision(preds=yhat_edge, target=y, task='binary') recall = tm.functional.recall(preds=yhat_edge, target=y, task='binary') # Watch for logging here - we need to provide batch_size, as (at the time of this implementation) # PL cannot understand the batch size. self.log(f"{phase}_f1", f1, batch_size=batch.edge_label_index.shape[1]) self.log(f"{phase}_loss", loss, batch_size=batch.edge_label_index.shape[1]) self.log(f"{phase}_precision", prec, batch_size=batch.edge_label_index.shape[1]) self.log(f"{phase}_recall", recall, batch_size=batch.edge_label_index.shape[1]) return loss def training_step(self, batch, batch_idx): return self._step(batch) def validation_step(self, batch, batch_idx): return self._step(batch, "val") def test_step(self, batch, batch_idx): return self._step(batch, "test") def predict_step(self, batch): x, edge_index = batch return self(x, edge_index) def configure_optimizers(self): return th.optim.Adam(self.parameters(), lr=self.lr)

PyTorch Lightning の役割は、トレーニング手順を簡素化することです。いくつかの機能を設定するだけで済みます。次のコマンドを使用して、モデルが使用可能かどうかをテストできます。

 model = LinkPredModel(val_g.x.size()[1], conv_sizes=[512, 256, 128]) with th.no_grad(): out = model.predict_step((val_g.x, val_g.edge_label_index))

電車

トレーニングステップでは、データローダーに特別な処理が必要です。

グラフデータには、特にリンク予測などの特別な処理が必要です。 PyG には、バッチを正しく生成するための特殊なデータローダークラスがいくつかあります。使用するもの: tg.loader.LinkNeighborLoader。これは次の入力を受け入れます。

一括でロードするデータ（マップ）。 num_neighbors ノードごとに 1 回の「ホップ」中にロードする隣接ノードの最大数。隣接ノードの数 1 - 2 - 3 - ... - K を指定するリスト。非常に大きなグラフの場合に特に便利です。

edge_label_index どの属性が true/false リンクを示しているか。

neg_sampling_ratio - 負のサンプルと真のサンプルの比率。

 train_loader = tg.loader.LinkNeighborLoader( train_g, num_neighbors=[-1, 10, 5], batch_size=128, edge_label_index=train_g.edge_label_index, # "on the fly" negative sampling creation for batch neg_sampling_ratio=0.5 ) val_loader = tg.loader.LinkNeighborLoader( val_g, num_neighbors=[-1, 10, 5], batch_size=128, edge_label_index=val_g.edge_label_index, edge_label=val_g.edge_label, # negative samples for val set are done already as ground-truth neg_sampling_ratio=0.0 ) test_loader = tg.loader.LinkNeighborLoader( test_g, num_neighbors=[-1, 10, 5], batch_size=128, edge_label_index=test_g.edge_label_index, edge_label=test_g.edge_label, # negative samples for test set are done already as ground-truth neg_sampling_ratio=0.0 )

以下はトレーニングモデルです

model = LinkPredModel(val_g.x.size()[1], conv_sizes=[512, 256, 128]) trainer = pl.Trainer(max_epochs=20, log_every_n_steps=5) # Validate before training - we will see results of untrained model. trainer.validate(model, val_loader) # Train the model trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)

テストデータを確認し、分類レポートと ROC 曲線を表示します。

 with th.no_grad(): yhat_test_proba = th.sigmoid(model(test_g.x, test_g.edge_label_index)).squeeze() yhat_test_cls = yhat_test_proba >= 0.5 print(classification_report(y_true=test_g.edge_label, y_pred=yhat_test_cls))

結果はかなり良さそうです:

 precision recall f1-score support 0.0 0.68 0.70 0.69 1631 1.0 0.69 0.66 0.68 1631 accuracy 0.68 3262 macro avg 0.68 0.68 0.68 3262

ROC曲線も良好

私たちがトレーニングしたモデルは特に洗練されていたり、よく調整されていたりはしませんでしたが、目的は達成できました。もちろん、これはデモンストレーション目的の小さなデータセットにすぎません。

要約する

グラフニューラルネットワークは複雑に思えますが、PyTorch Geometric は優れたソリューションを提供します。組み込みのモデル実装を直接使用できるため、使いやすくなり、エントリのしきい値が簡素化されます。

この記事のコード: https://github.com/maddataanalyst/blogposts_code/blob/main/graph_nns_series/pyg_pyl_perfect_match/pytorch-geometric-lightning-perfect-match.ipynb

<<: ガートナー: 2024 年の主要な戦略的テクノロジートレンド

>>: MIT は驚くべきことに、大きな言語モデルが世界モデルであることを証明しました。 LLMは真実と嘘を区別することができ、人間によって洗脳されることもできる