I’ve recently started working with graph structures in the context of machine learning, and have found that I’ve opened what seems to be a reverse Pandora’s box, full of neat algorithms that can pull out a lot of insights from graph structures. As a way of cementing my knowledge and hopefully also giving a different perspective, I’ll do a series of blog posts of various graph algorithms that I find interesting and/or useful. I’ll aim to cover both the theoretical foundation of the algorithms as well as concrete implementations and examples of them.

The algorithm I’d like to start with today is somewhat of a classic by now: Google’s PageRank algorithm, developed in 1996 and originally designed to order search results, but which can be applied to any graph structure to get an idea of the most important nodes in the graph (where important here means most connected). There are two different versions of the algorithm: a global and a local one. They are very similar, but are used in completely different contexts. Let’s get started.

This post is part of my series on graph algorithms:

  1. PageRank
  2. DeepWalk
  3. Graph Convolutional Neural Networks

Some Intuition: The Random Web Surfer

Before we dive into the Mathematics of the algorithm, I’d like to start with an intuitive idea that will guide us along the way. We imagine a person who surfs around the web, clicking on a random link on every page. Every once in while the surfer ignores the current website however, and instead goes to a completely random site.

We then ask ourselves: how much of the web surfer’s time will be spent at each individual website? The idea is that the more time is spent, the more important the website is. The algorithm in its essence is quite simple: we simply let the person surf around and record how often they visit each node. The fact that this procedure will eventually terminate is then the crucial result that makes the algorithm useful.

[source]

Of course, this simple procedure generalises beyond websites, and we can apply the algorithm whenever we’re dealing with any kind of graph structure to get an idea of how central the nodes are.

Markov Chains

The Mathematics around the PageRank algorithm mostly concerns Markov chains. Let’s start with a formal definition and then dig into some intuition and examples.

Definition. For a state space $S\subseteq\mathbb N_0$ we define a Markov chain (on $S$) as a sequence $(X_t)_{t\in 0}^\infty$ of random variables $X_t$, such that

  1. (Markov property) $P(X_{t+1}\mid X_0,\dots,X_t) = P(X_{t+1}\mid X_t)$; and
  2. (Time homogeneity) $P(X_{t+1} = s’\mid X_t = s)$ is independent of $t$, for every $s, s’\in S$.

Here the intuition is that we imagine the Markov chain to be a random variable evolving through time. For instance, $X_t$ could be coffee consumption per capita in year $t$ (in some suitable units). For this example, the Markov property would translate to postulating that the amount of coffee drunk next year only depends upon the current year and is completely independent of how much coffee was enjoyed last year. Time homogeneity would say that if the same amount of coffee was had in two different years, then the coffee consumption for the following year in both cases follow the same distribution.

The reason why we care about the time homogeneity property is that it allows us to define for every pair of states $s,s’\in S$ the transition probability $p_{s,s’} := P(X_{t+1} = s’ \mid X_t = s)$ for some, or equivalently any, $t\in\mathbb N_0$. These transition probabilities thus shows us that, given that the coffee consumption is $s$ in a given year, what the probability distribution is for the coffee consumption in the following year. This comes with a neat graph structure:

A graph containing seven nodes, with various edges between them, usually going from a lower node to a higher one. The edges are labelled with numbers between 0 and 1, with the numbers on the outgoing edges of a node sum to one

Here every state has a node (here we’ve just shown a few of them), and an arrow from state $s$ to $s’$ is labelled with the transition probability $p_{s,s’}$. To enable ease of notation we define the transition matrix $T$ as the $N\times N$ matrix, where $N$ is the number of nodes, with entries $T_{s,s’} := p_{s,s’}$.

Now, with all this formalism we can now start to describe what we’re trying to find. Namely, the stationary distribution associated to a Markov chain is a discrete distribution $\pi:S\to\mathbb R$ such that $\pi T = \pi$, where $T$ is the transition matrix and we abuse notation and view $\pi$ as a row vector of length $N$. Said in another way, $\pi(s)$ is equal to the sum of all the $\pi(s’)$ which has an arrow going into $s$, weighted by their respective transition probabilities. This means that, roughly, the more in-links a state $s$ has, the higher $\pi(s)$ will be.

But how do we find the stationary distribution? This is where the following crucial theorem enters the picture. Intuitively, it says that by starting with a random choice of $\pi$ and updating $\pi$’s values by simply traversing the graph according to the transition probabilities, it will converge to the stationary distribution.

This result requires a certain technical assumption on the chain. Define for a state $x$ in the Markov chain the period of $x$ to be the largest integer dividing all $n$ for which there is a positive probability to return to $x$ in $n$ steps. We then further call $x$ aperiodic if its period is 1. A simple sufficient criterion to check for aperiodicity is if there’s a positive probability for a state to stay put (i.e. for the graph to have a loop).

Theorem. If a Markov chain has an aperiodic state and every state can be reached by every other state, then it has a unique stationary distribution. Further, if we let $\pi_0$ be the uniform distribution on the state space and define $\pi_{n+1} := \pi_n T$ with $T$ being the transition matrix, then the $\pi_n$’s converge to the stationary distribution.

The Algorithm

The definition of the PageRank algorithm is now quite simple. Recall that the setup is that we have a graph of interest $G$, and we would like to find the most central nodes in it. We will start with a special case of the algorithm and then generalise to the actual algorithm.

For the simple special case, we convert our graph $G$ into a Markov chain by assigning uniform transition probabilities, by which I mean that the ingoing links from a given state follow a uniform distribution. So if a state has 3 links into it, each of them has an associated transition probability $\tfrac{1}{3}$. We can then use the above theorem to get the stationary distribution, and these values are simply the PageRank values.

In the more general case, we need to tweak our graph a bit more. Here we start with a damping factor $\alpha\in [0, 1]$. Going back to the initial random surfer intuition, the damping factor is the probability that the surfer keeps on pressing links on websites, and the teleportation factor $1-\alpha$ is the chance that the web surfer gets tired of clicking and instead moves on to a random website. The special case was when $\alpha = 1$, as the web surfer never got tired and just kept on pressing links.

We start by making the graph complete, meaning that we add edges between every pair of nodes. Let’s call this graph the completion of $G$, denoted by $\overline G$.

A graph containing four nodes, which is completed by drawing new edges between every pair of nodes that did not already have an edge

We’re going to add two kinds of transition probabilities to the edges in $\overline G$, depending on whether the edge is an “old” edge, i.e. was part of $G$, or a new one.

Every old edge is going to get the transition probability $\tfrac{\alpha}{n_\text{in}} + \tfrac{1-\alpha}{N}$, where $n_\text{in}$ is the amount of ingoing links and $N$ is the total number of nodes in the graph, and new edges are going to get a transition probability of $\tfrac{1-\alpha}{N}$.

We see that when $\alpha = 1$ then the new edges all have transition probability zero, so we’re effectively back to the old graph and the special case above. The more general case reflects the teleporting behaviour of the surfer.

Note also that this trick ensures that the two conditions in the theorem are satisfied, as every node can, by construction, reach every other node in $\overline G$, and every node has a loop and is thus aperiodic.

Just as in the special case, the PageRank score is then the stationary distribution of $\overline G$, seen as a Markov chain.

Personalised PageRank

There is an important local variant of the algorithm, which is usually called Personalised PageRank. The only difference is in the construction of $\overline G$: instead of creating new links between all node pairs, we start with a specified set $S$ of nodes, and only add edges from all nodes into nodes from $S$.

This also means that to be able to use the theorem above, we have to ensure that we can reach all nodes from nodes belonging to $S$. But even if not then no harm is done: then the unreachable nodes will just get a PageRank score of 0.

This variant is very useful when we have isolated a group of interesting nodes in our graph, and we would like to see which nodes are connected to these nodes.

Python Implementation

The theory regarding the PageRank algorithm might be slightly complicated, but I hope you agree that the algorithm itself is not. We simply form $\overline G$, initialise the stationary distribution $\pi$ as the uniform distribution across all nodes, and then repeatedly multiplying the transition matrix with $\pi$ until convergence.

Here is a Python implementation of the algorithm, taken from https://en.wikipedia.org/wiki/PageRank#Python:

import numpy as np

def pagerank(M, num_iterations: int = 100, d: float = 0.85):
    """PageRank: The trillion dollar algorithm.

    Parameters
    ----------
    M : numpy array
        adjacency matrix where M_i,j represents the link from 'j' to
        'i', such that for all 'j', sum(i, M_i,j) = 1
    num_iterations : int, optional
        number of iterations, by default 100
    d : float, optional
        damping factor, by default 0.85

    Returns
    -------
    numpy array
        a vector of ranks such that v_i is the i-th rank from [0, 1],
        v sums to 1

    """
    N = M.shape[1]
    v = np.random.rand(N, 1)
    v = v / np.linalg.norm(v, 1)
    M_hat = (d * M + (1 - d) / N)
    for i in range(num_iterations):
        v = M_hat @ v
    return v

Note that the above implementation does not check for convergence, but we could easily implement that with some specified threshold parameter.

It’s also implemented in several graph libraries. Here’s the networkx implementation, given that you have a networkx.Graph object G:

pr = networkx.pagerank(G, alpha = 0.85)

You can also make the PageRank personalised by specifying the personalization argument in the function call.

If you’re working with a graph database in Neo4j then you can run PageRank directly in your Cypher query. You have to create a graph in the graph catalog, and then call the following:

CALL gds.pageRank.stream('myGraph', {
  maxIterations: 20,
  dampingFactor: 0.85,
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId) AS node, score
ORDER BY score DESC

Neo4j also supports personalised PageRank, by using the sourceNodes argument.