Basics of network science with Networkx - 2

Second instalment in the exploratory data analysis of graphs

By Vamshi Jandhyala in data science python

July 25, 2020

Hubs

A key feature of many networks is heterogeneity. Heterogeneous networks present a wide variability in the properties and roles of their elements — nodes and/or links. This reflects the diversity present in the complex systems described by networks. In air transportation networks, social networks, the Web, and many other networks, a clear source of heterogeneity is the degree of the nodes: a few nodes have many connections while most nodes have few.

The importance of a node or a link is estimated by computing its centrality. The degree is an important measure of centrality. High-degree nodes are called hubs.

The average degree of a network indicates how connected the nodes are on average. The average degree may not be representative of the actual distribution of degree values. This is the case when the nodes have heterogeneous degrees, as in many real-world networks.

Closeness

Another way to measure the centrality of a node is by determining how “close” it is to the other nodes. This can be done by summing the distances from the node to all others. If the distances are short on average, their sum is a small number and we say that the node has high centrality. This leads to the definition of closeness centrality, which is simply the inverse of the sum of distances of a node from all others.

The closeness centrality of a node $i$ is defined as

$$ g_i = \frac{1}{\sum_{j \neq i}l_{ij}} $$

where $l_{ij}$ is the distance from $i$ to $j$ and the sum runs over all the nodes of the network, except $i$ itself.

An alternative formulation is obtained by multiplying $g_i$ by the constant $N-1$, which is just the number of terms in the sum at the denominator:

$$ \tilde{g_i} = (N-1)g_i = \frac{N-1}{\sum_{j \neq i} l_{ij}} = \frac{1}{\sum_{j \neq i} l_{ij}/(N-1)} $$

This way we discount the graph size and make the measure comparable across different networks. Since what matters is not the actual value of $g_i$m but its ranking compared to the closeness centrality of the other nodes, the relative centrality of the nodes remains the same as by using $g_i$, because the ranking is not altered if the values are multiplied by a constant. The expression $\sum_{j \neq i} l_{ij}/(N-1)$ is the average distance from the focal node $i$ to the rest of the network. So we find that closeness can be expressed equivalently as the inverse of the average distance.

Betweeness

Assuming that the number of shortest paths that traverse a node is a good approximation for the frequency of use of the node. The centrality is then estimated by counting how many times a node is crossed by those paths. The higher the count, the more traffic is controlled by the node, which is therefore more influential in the network.

Given two nodes, there may be more than one shortest path between them in the network, all having the same length. For instance, if nodes $X$ and $Y$ are not connected to each other but have two common neighbors $S$ and $T$, there are two distinct shortest paths of length two running from $X$ to $Y$: $X-S-Y$ and $X-T-Y$. Let $\sigma_{hj}$ be the total number of shortest paths from $h$ to $j$ and $\sigma_{hj}(i)$ the number of these shortest paths that pass through node $i$. The betweenness of $i$ is defined as

$$ b_i = \sum_{h \neq j \neq i} \frac{\sigma_{hj}(i)}{\sigma_{hj}} $$

The sum runs over all pairs of vertices $h$ and $j$, distinct from $i$ and from each other. If no shortest path between $h$ and $j$ crosses $i$, the contribution of the pair $(h,j)$ to the betweenness of $i$ is $0$. If all shortest paths between $h$ and $j$ cross $i$, the contribution is $1$. If a node is a leaf, it cannot be crossed by any path. Therefore its betweenness is zero. Since the potential contributions come from all pairs of nodes, the betweenness grows with the network size.

A node has high betweenness if it occupies a special position in the network, such that it is an important station for the communication patterns running through the network. For that to happen, it is not necessary to have many neighbors. Generally we observe a correlation between the degree of a node and its betweenness, so that well-connected nodes have high betweenness and vice versa. However, there are many exceptions. Nodes bridging different regions of a network typically have high betweenness, even if their degree is low.

The betweenness centrality of a link is the fraction of shortest paths among all possible node couples that pass through that link. Links with very high betweenness centrality often join cohesive regions of the network, called communities. Therefore betweenness can be used to locate and remove those links, allowing for the separation and consequent identification of communities .

The betweenness centrality depends on the size of the network. If we wish to compare the centrality of nodes or links in different networks, the betweenness values should be normalized.For node betweenness, the maximum number of paths that could go through node $i$ is the number of pairs of nodes excluding $i$ itself. This is expressed by ${N-1 \choose 2}$. The normalized betweenness of node $i$ is therefore obtained by dividing $b_i$ by this factor.

Centrality Distributions

The statistical distribution of a centrality measure tells us how many elements — nodes or links — have a certain value of centrality, for all possible values. In large networks, this is a useful tool to identify the classes of elements: by examining the distribution, we can see if there are notable values or groups of values and classify the elements accordingly. The range of the distribution also reveals the heterogeneity of the network elements with respect to a specific centrality measure of interest: for example, if node degrees span many orders of magnitude, from units to millions, then the network is very heterogeneous with respect to degree.

Heavy-tailed degree distributions display a large heterogeneity in the degree values: while many of the nodes have just a few neighbors, some others have many neighbors, which gives them a prominent role in the network. These nodes are called hubs. One way to measure the breadth of the degree distribution is to compute the heterogeneity parameter, which compares the variability of the degree across nodes to the average degree.

The heterogeneity parameter $κ$ (Greek letter “kappa”) of a network’s degree distribution is the average of the squares of the degrees:

$$ \overline{\kappa^2} = \frac{\sum_i k_i^2}{N} $$

For a normal or narrow distribution with a sharp peak at some value, the distribution of the squared degrees is concentrated therefore $ \kappa \approx 1$. For a heavy-tailed distribution with the same average degree $κ gg 1$ .

If a network is directed, we have to consider two distributions, the in-degree and out-degree distributions, defined as the probability that a randomly chosen vertex has a given in- or out-degree, respectively. In this case, the definition of hub may refer to either the in-degree or the out-degree.

It turns out that the degree is usually correlated with other centrality measures. So, hubs typically rank among the most central nodes with respect to diverse criteria. There are exceptions as well, a node may have a large betweenness centrality if it connects different areas of the network, whether or not it has high degree.

Implications of Hubs

Friendship Paradox

The chances of hitting a hub increase if you move from the circle of neighbors to that of the neighbors of neighbors, and so on. This is because the number of links to follow increases at each step, so it becomes more likely that one of them is attached to a hub. The average degree of the neighbors of a node is larger than the average degree of the node. In other words, our friends have more friends than us, on average. This is known as the Friendship Paradox. The broader the degree distribution, the stronger the effect of the Friendship Paradox in the presence of hubs.

Ultra small worlds

In a network with hubs, we expect that the average distance between any two nodes is shorter compared to a network with the same number of nodes and links, but no hubs. In fact, networks with broad degree distributions often have the so-called ultra-small world property, indicating that the distances between nodes are very short.

Hubs make the network more robust

A system is robust if the failure of some of its components does not affect its function. The standard robustness test for networks consists of checking how the connectedness is affected as more and more nodes are removed, along with all of their adjacent links. To estimate the amount of disruption following node removal, scholars compute the relative size of the giant component (i.e. the ratio of the number of nodes in the giant component to the number of nodes initially present in the network). Let us suppose that the initial network is connected. In this case the giant component coincides with the whole network, so its relative size is one. If the removal of a subset of nodes does not break it into disconnected pieces, the proportion of nodes in the giant component just decreases by the fraction of removed nodes. If, however, the node removal breaks the network into two or more connected components, the size of the giant component may drop substantially. As the fraction of removed nodes approaches one, the few remaining nodes are likely distributed among tiny components, so the proportion of nodes in the giant component is close to zero.As long as a sufficient number of hubs survive, the system remains largely connected. Since we are removing nodes at random, the probability of hub failure is low, because they are statistically rare compared to other nodes.

Core Decomposition

When analyzing or visualizing a large network, it is often useful to focus on its denser portion (core). The degree of each node can be used to separate a network into distinct portions, called shells, based on their position in the core–periphery structure of the network. Low-degree outer shells correspond to the periphery. As they are removed, or peeled away, what remains is a denser and denser inner subnetwork, the core.

Formally, the $k$-core decomposition algorithm starts by setting $k=0$. Then it proceeds iteratively. Each iteration corresponds to a value of $k$ and consists of a few simple steps:

  1. Recursively remove all nodes of degree $k$, until there are no more left.

  2. The removed nodes make up the $k$-shell and the remaining nodes make up the $k+1$-core, because they all have degree $k+1$ or larger.

  3. If there are no nodes left in the core, terminate; else, increment k for the next iteration.

Python code for core decomposition

#Closeness centrality
nx.closeness_centrality(G, node) # closeness centrality of node

#Betweenness centrality - node
nx.betweenness_centrality(G)      # dict nodes ->betweenness centrality

#Betweenness centrality - edge
nx.edge_betweenness_centrality(G) # dict links -> betweenness centrality

# Core decomposition
nx.core_number(G) # return dict with core number of each node
nx.k_shell(G,k)   # subnetwork induced by nodes in k-shell
nx.k_core(G,k)    # subnetwork induced by nodes in k-core
nx.k_core(G)      # innermost (max-degree) core subnetwork