# Basics of network science with Networkx - 1

#### First instalment in the exploratory data analysis of graphs

By Vamshi Jandhyala in data science python

July 23, 2020

## Definition of a network

A network $G$ has two parts, a set of $N$ elements, called nodes or vertices, and a set of $L$ pairs of nodes, called links or edges. The link $(i,j)$ joins the nodes $i$ and $j$. A network can be undirected or directed. A directed network is also called a digraph. In directed networks, links are called directed links and the order of the nodes in a link represents the direction: the link $(i,j)$ goes from the source node $i$ to the target node $j$. In undirected networks, all links are bi-directional and the order of the two nodes in a link does not matter. A network can be unweighted or weighted. In a weighted network, links have associated weights: the weighted link $(i,j,w)$ between nodes $i$ and $j$ has weight $w$. A network can be both directed and weighted, in which case it has directed weighted links.

Each network is characterized by the total number of nodes $N$ and the total number of links $L$. We call $N$ the size of the network because it identifies the number of distinct elements composing the system.

There are several other classes of networks. A network might have more than one type of node. In a bipartite network, there are two groups of nodes such that links only connect nodes from different groups and not nodes from the same group. A network might have multiple types of links, in which case it is called a multiplex network.

## Subnetworks

A subnetwork (or subgraph) is obtained by selecting a subset of the nodes and all of the links among these nodes. The abundance of certain types of subnetworks and their properties is important in the characterization of real networks. As an example, a *clique* is a complete subnetwork: a subset of nodes all linked to each other. A special type of subnetwork is the *ego network* of a node, which is the subnetwork consisting of the chosen node — called the ego — and its neighbors.

## Network Representations

### Adjacency Matrix

The adjacency matrix is an $N\times N$ matrix in which each element represents the link between the nodes indexed by the corresponding row and column. Element $a_{ij}$ of the adjacency matrix represents the link between nodes $i$ and $j$. $a_{ij}=1$ if $i$ and $j$ are adjacent, $a_{ij}=0$ otherwise.

### Adjacency List

Adjacency matrix representation is not efficient for storing real networks, which are typically large and sparse. The required storage space grows like the square of the network size, but if the network is sparse, most of this space is wasted storing zeros (non-existing links). With large sparse networks, a more compact network representation is the adjacency list, a data structure that stores the list of neighbors for each node. Adjacency lists represent sparse networks efficiently because the non-existing links are ignored; only the existing links (non-zero values of the adjacency matrix) are considered.

### Edge list

We only store the list of edges in the graph.

## Density and Sparsity

The maximum number of links in a complete undirected network = ${N \choose 2}$ = $N(N-1)/2$. The maximum number of links in a complete directed network = $N(N-1)$. The maximum number of links in a complete bi-partite network = $N_1N_2$ where there are $N_1$ nodes of one colour and $N_2$ nodes of the second colour.

The fraction of possible links that actually exist, is called the *density* of the network. A complete network has maximal density: one. However, the actual number of links is typically much smaller than the maximum, as most pairs of nodes are not directly connected to each other. Therefore the density is often much, much smaller than one — by orders of magnitude in most real-world, large networks. This is called *sparsity*.

The density of a undirected network is given by $d = 2L/L_{max} = 2L/N(N-1)$. The density of a directed network is given by $d = L/L_{max} = L/N(N-1)$.

We say that the network is sparse if the number of links grows proportionally to the number of nodes ($L \sim N$), or even slower. If instead the number of links grows faster, e.g. quadratically with network size ($L \sim N^2$), then we say that the network is dense.

## Degree

### Undirected Networks

The degree of a node is its number of links, or neighbors. A node with no neighbours is called a *singleton*. The average degree of a network is denoted by $\overline{k}=\frac{\sum_ik_i}{N}$ where $k_i$ is the degree of node $i$. Since each link contributes to the degree of two nodes in an undirected network, we have $\sum_ik_i=2L$. From the definition of density for an undirected network we have

$$ \overline{k}=\frac{\sum_ik_i}{N} = \frac{2L}{N} = d(N-1) $$

The maximum possible degree of a node is $k_{max}=N-1$.

### Directed Networks

When we consider the degree of a node in a directed network, we have to think of incoming and outgoing links separately. The number of incoming links, or predecessors, of node $i$ is called the *in-degree* and denoted by $k_i^{in}$. The number of outgoing links, or successors, of node $i$ is called the *out-degree* and denoted by $k_i^{out}$.

The average in-degree for a directed network is given by

$$ \overline{k^{in}}=\frac{\sum_ik^{in}_i}{N} $$

The average out-degress for a directed network can be similarly defined.

### Weighted Networks

The weighted degree, or strength of a node in an undirected network, is the sum of the weights of its links. Similarly, we can define in-strength and out-strength in the case of a directed weighted network.

The weighted degree, or strength, of node $i$ in an undirected weighted network is denoted by

$$ s_i = \sum_j w_{ij} $$

where $w_{ij}$ is the weight of the link between nodes $i$ and $j$. We assume $w_{ij}=0$ if there is no link between $i$ and $j$. We can analogously generalize in-degree and out-degree to in-strength and out-strength in a directed weighted network:

$$
s^{in}*i = \sum_j w*{ij} \\

s^{out}*i = \sum_j w*{ij}
$$

where $w_{ij}$ is the weight of the directed link from $i$ to $j$.

## Homophily

Often, nodes that are connected to each other in a social network tend to be similar in their features: for example, relatives may live near each other, and friends may have similar interests. The technical name of this property is assortativity. Because of assortativity, we are able to make predictions about a person’s qualities by inspecting their neighbors.

Multiple factors may be responsible for the presence of assortativity in social networks. One possibility is that if people are similar in some way, they are more likely to select each other and become connected. This property is captured by a popular proverb: “birds of a feather flock together.” Its technical name is *homophily*.

The converse mechanism is when people who are friends become more similar over time, through the process of social influence. It is difficult to separate the causes of assortativity — does similarity induce links, or do links induce similarity?

## Assortativity

Assortativity based on degree is called degree assortativity or degree correlation: this occurs when high-degree nodes tend to be connected to other high-degree nodes, while low-degree nodes tend to have other low-degree nodes as neighbors. Networks with this property are called assortative.

There are two ways to measure the degree assortativity of a network, both based on measuring the correlation between degrees of neighbor nodes. We say that two variables are positively (negatively) correlated if larger values of one variable tend to correspond to larger (smaller) values of the other. Pearson’s correlation coefficient is a common way to measure correlation; it takes values in $[-1,1]$, with $0$ meaning no correlation and $\pm 1$ meaning perfect positive/negative correlation.

One measure of network assortativity is the assortativity coefficient, defined as the Pearson correlation between the degrees of pairs of linked nodes. When the assortativity coefficient is positive, the network is assortative, and when it is negative, the network is disassortative.

The second method is based on measuring the average degree of the neighbors of node $i$:

$$ k_nn(i) = \frac{1}{k_j} \sum_j a_{ij} $$

where $a_{ij}=1$ if $i$ and $j$ are neighbors, and $0$ otherwise. We then define the k-nearest-neighbors function $\overline{k_{nn}}(k)$ for nodes of a given degree $k$ as the average of $k_{nn}(i)$ across all nodes with degree $k$. If $\overline{k_{nn}}(k)$ is an increasing function of $k$, then high-degree nodes tend to be connected to high-degree nodes, therefore the network is assortative; if $\overline{k_{nn}}(k)$ decreases with $k$, the network is disassortative.

### Applications

The properties of assortativity are useful in the field of epidemiology, since they can help understand the spread of disease or cures. For instance, the removal of a portion of a network’s vertices may correspond to curing, vaccinating, or quarantining individuals or cells. Since social networks demonstrate assortative mixing, diseases targeting high degree individuals are likely to spread to other high degree nodes. Alternatively, within the cellular network—which, as a biological network is likely dissortative—vaccination strategies that specifically target the high degree vertices may quickly destroy the epidemic network.

## Paths and Distances

The path is the sequence of links traversed. The number of links in a path is called the path length. There may be multiple paths between the same two nodes. These paths may have different lengths, and may or may not share some common links. In directed paths, we must comply with link directions. A cycle is a special path that can be traversed to go from one node back to itself. A simple path never goes through the same link more than once.

The natural distance measure between two nodes is defined as the minimum number of links that must be traversed in a path connecting the two nodes. Such a path is called the shortest path, and its length is called the shortest-path length. There may be multiple shortest paths between two nodes; obviously they all must have the same length.

In many networks, link weights express a measure of similarity or intensity of interaction between two connected nodes. We may then be interested in finding paths with large weights. A common approach is to transform the weights into distances by taking the inverse (one divided by the weight), so that a large weight corresponds to a short distance. Then the problem becomes equivalent to finding short-distance paths.

By using the shortest-path length as a measure of distance among nodes, it is possible to define aggregate distance measures for an entire network: the average shortest-path length (or simply average path length) is obtained by averaging the shortest-path lengths across all pairs of nodes. The diameter of the network is instead the maximum shortest-path length across all pairs of nodes (i.e. the length of the longest shortest path in the network).

The average path length of an undirected, unweighted network as

$$ \overline{l} = \frac{\sum_{i,j} l_{ij}}{N \choose 2} $$

where $l_{ij}$ is the shortest-path length between nodes $i$ and $j$, and $N$ is the number of nodes. The sum is over all pairs of nodes, and we divide by the number of pairs to compute the average.

In the directed network case the definition is analogous, but the distance $l_{ij}$ is based on the shortest directed path between $i$ and $j$, and each pair of nodes is considered twice for paths in both directions:

$$ \overline{l} = \frac{\sum_{i,j}l_{ij}}{N(N-1)} $$

The diameter of a network is defined as

$$ l_{max} = \max_{i,j} l_{ij} $$

## Connectedness and Components

A network is connected if you can reach any node from any other node by following a path along links and intermediate nodes). If a network is not connected, we say that it is disconnected; it is composed of more than one connected components or simply components. A component is a subnetwork containing one or more nodes, such that there is a path connecting any pair of these nodes, but there is no path connecting them to other components. The largest connected component in many real networks includes a substantial portion of the network and is called the *giant* component. In a connected network, the giant component coincides with the entire graph.

To measure network distance when a network is disconnected. One way is to consider only the nodes in the giant component. Another approach is to average the distance only across pairs of nodes in the same component, but considering all components. To calculate the diameter of a disconnected network, one can calculate the diameter of each component and then take the maximum.

In the directed case, we have to pay attention to link directions when determining whether a node can be reached from another. We can of course ignore the link directions and treat the links as if they were undirected. In this case we refer to the components as *weakly* connected.

We say that a directed network is *strongly* connected if it is a single strongly connected component. A directed network is weakly connected if it is a single weakly connected component.

## Trees

Trees are a special class of undirected, connected networks such that the deletion of any one link will disconnect the network into two components. The number of links in a tree is $N-1$. They have no cycles. Trees are hierarchical. You can pick any node in a tree and call it a root. Each node in a tree is connected to a parent node (toward the root) and to one or more children nodes (away from the root). The exceptions are the root, which has no parent, and the so-called leaves of the tree, which have no children.

## Social Distance

The average path length, characterizes how close or far we expect nodes to be in a network. Intuitively, in a grid-like network like road networks and power grids, paths can be long. Is this typical of many real-world networks? Let us start by considering a few social networks, in which this question has been explored extensively.

It turns out that pretty much all social networks have very short paths among nodes - which gives rise to the popular notion of “small world” and “six degrees of separation”. Consequently, the number of friends of friends out there is much, much larger than we think.

We say that the average path length is short when it grows very slowly with the size of the network. We can express slow growth mathematically by saying that the average path length scales logarithmically with the size of the network.

Short paths that obey this kind of relationship are found across social networks, including academic collaborations, actor networks, networks of high-school friends, and online social networks such as Facebook. As it turns out, short paths are a ubiquitous feature in almost all real-world networks; grid-like networks are among the few exceptions.

## Clustering

In a social network, if Alice and Bob are both friends of Charlie’s, they are also likely to be friends of each other. In other words, there is a good chance that a friend of my friend is also my friend. This translates into the presence of many triangles in the network. The connectivity among the neighbors of the nodes is an important feature of the local structure of the network because it captures how tightly knit, or clustered, the nodes are.

The clustering coefficient of a node is the fraction of pairs of the node’s neighbors that are connected to each other. This is the same as the ratio between the number of triangles that include the node, and the maximum number of triangles in which the node could participate.

The clustering coefficient of node $i$ is formally defined as

$$ C(i) = \frac{\tau(i)}{\tau_{max}(i)} = \frac{\tau(i)}{k_i \choose 2} = \frac{2\tau(i)}{k_i(k_i-1)} $$

where $\tau(i)$ is the number of triangles involving $i$. The maximum possible number of triangles for $i$ is the number of pairs formed by its $k_i$ neighbors. Note that $C(i)$ is only defined if the degree $k_i>1$ due to the terms $k_i$ and $k_{i-1}$ in the denominator: a node must have at least two neighbors for any triangle to be possible.

The clustering coefficient of the entire network is the average of the clustering coefficients of its nodes:

$$ C = \frac{\sum_{i;k_i>1}C(i)}{N_{k>1}} $$

Nodes with degree $k<2$ are excluded when calculating the average clustering coefficient.

A low clustering coefficient (near zero) means that the network has few triangles, while a high clustering coefficient (near one) means that the network has many triangles. Social networks have a large clustering coefficient; a significant portion of all possible triangles are present. A simple mechanism explains the abundance of triangles in social networks: we meet people through shared contacts, thus closing triangles. This mechanism, is called triadic closure. Online social networks make suggestions based on triadic closure. For example, Facebook recommends “people you may know” based on common friends, and Twitter recommends accounts followed by your friends (whose accounts you follow). These recommendations result in high clustering.

### Applications

Triangle count and clustering coefficient have been shown to be useful as features for classifying a given website as spam, or non-spam, content. Clustering coefficient has been used to investigate the community structure of Facebook’s social graph, where they found dense neighbourhoods of users in an otherwise sparse global graph. Clustering coefficient has been proposed to help explore thematic structure of the web, and detect communities of pages with a common topic based on the reciprocal links between them.

## Python code using networkx

```
# Create Graphs
G = nx.Graph()
G.add_node(1)
G.add_node(2)
G.add_edge(1,2)
G.add_nodes_from([3,4,5])
G.add_edges_from([(2,3), (3,4),(3,5)])
nx.draw(G,with_labels=True)
# List nodes, links and neighbors of a given node
print(G.number_of_nodes())
print(G.number_of_edges())
print(G.nodes())
print(G.edges())
print(list(G.neighbors(3)))
# Subgraphs
nx.draw(nx.subgraph(G, (2,3,4,5)), with_labels=True)
# Density
print(nx.density(G))
# Degree
print(G.degree(2)) # returns the degree of node 2
print(G.degree()) # returns the degree of all nodes of G
# Network representation
## Adjacency Matrix
print("graph\n",nx.adjacency_matrix(G)) # graph
## Adjacency List
for n,neighbors in G.adjacency():
for number,link_attributes in neighbors.items():
print('(%d, %d)' % (n,number))
## Edge List
nx.write_edgelist(G, "file.edges")
G2 = nx.read_edgelist("file.edges") # G2 same as G
# Assortativity
assort_a = nx.attribute_assortativity_coefficient(G, category)
assort_n = nx.numeric_assortativity_coefficient(G, quantity)
import scipy.stats
knn_dict = nx.k_nearest_neighbors(G)
k, knn = list(knn_dict.keys()), list(knn_dict.values())
r, p_value = scipy.stats.pearsonr(k, knn)
r,p_value
# Paths and distances
print("Path between 1 and 2 exists? ", nx.has_path(G, 1, 2))
print("Shortest path between 1 and 5 ", nx.shortest_path(G, 1, 5))
print("Shortest path length between 1 and 5 ", nx.shortest_path_length(G,1,5))
print("Shortest path from 1 to other nodes ",nx.shortest_path(G, 1))
print("Shortest path lengths from 1 to other nodes ",nx.shortest_path_length(G, 1))
print("Shortest paths for all pairs of nodes ",nx.shortest_path(G)) # all pairs
print("Shortest path lengths for all pairs of nodes ",list(nx.shortest_path_length(G))) # all pairs
print("Average shortest path length", nx.average_shortest_path_length(G))
# Connectedness and components
nx.is_connected(G)
comps = sorted(nx.connected_components(G), key=len, reverse=True)
nodes_in_giant_comp = comps[0]
GC = nx.subgraph(G, nodes_in_giant_comp)
nx.is_connected(GC)
nx.is_strongly_connected(D)
nx.is_weakly_connected(D)
list(nx.weakly_connected_components(D))
list(nx.strongly_connected_components(D))
# Trees
nx.is_tree(S)
# Clustering
nx.triangles(G) # dict node -> no. triangles
nx.clustering(G, 2) # clustering coefficient of node
nx.clustering(G) # dict node -> clustering coeff.
nx.average_clustering(G) # network's clustering coeff.
```