Epidemic spreading in Social Networks

Sergio Ballesteros Solanas

sergio@ballesteros.me
  • PART 1: EPIDEMIC PROCESS ON A SOCIAL NETWORK

    1.1 - Consider either a synthetic Barabasi-Albert graph or a real social network (e.g., from the SNAP http://snap.stanford.edu/ or Konect http://konect.uni-koblenz.de/ repositories). If the chosen graph has multiple disconnected components, select the largest connected component. Make sure the graph has at least a few thousand nodes. Compute and plot the degree distribution.

    1.2 - Set up a simulation of an SIR epidemic process on the graph. Start the epidemic from a randomly chosen node. Choose values of model parameters \beta and \mu that allow the epidemic to take off with high probability, reaching most of the nodes.

    1.3 - Plot epidemic curves for multiple stochastic realizations of the epidemic. Compute the probability distribution of the overall attack rate (number of recovered nodes at the end of the simulation / total number of nodes) and the probability distribution of peak times for the epidemic. Display these distributions using boxplots.

  • PART 2: VACCINATION AND HERD IMMUNITY

    2.1 - Modify the simulation above so that it supports a given initial fraction r of randomly chosen "immunized" nodes, i.e., nodes that cannot be infected. Can you provide an upper bound for the overall attack rate without having to simulate the epidemic?

    2.2 - For a range of values of the initial fraction of immunized nodes (e.g., r = 0.01, 0.1, 0.5, 0.8, 0.9, ...) simulate the SIR epidemic above (multiple realizations for each value of r) and plot the overall attack rate as a function of the fraction r of immunized nodes.

    2.3 - Generate a random Erdős–Rényi graph with the same size and density as the original social network. Repeat the experiment 2.2 above. Compare the results you obtain in this case and in the previous case, and explain the differences you observe.

  • PART 3: TARGETED VACCINATION STRATEGIES

    3.1 - Imagine that you have a "budget" of M vaccination doses, with M < N, where N is the size of your network. That is, you can immunize a fraction r = M/N of nodes. You have studied above the effect of randomly immunizing the network nodes. Can you improve the performance of immunization, in terms of reduced overall attack rate, by means of "targeted" immunization? That is, by choosing the nodes to be immunized according to some specific strategy rather than choosing them at random. Provide an example of such a strategy, and test it in simulation, comparing the results you obtain with those of Part 2.2.

    3.2 - Now imagine that you still have a limited budget of M vaccination doses, but you cannot use information about the graph to decide how to use it. You can simulate a certain number of epidemics, without immunization, and use "historical" information on which nodes are infected (and when, and how often) to define your targeted immunization strategy. Design such a strategy and show its performance in simulation, comparing it to the random immunization of Part 2.2 and the targeted strategy you devised in Part 3.1 above.

    3.3 - Finally, imagine that you have limited information about the social network: you are given a set of K nodes (K << N, say K ~ 10% of the network) and for those K nodes you are given a list of their neighbors. Design a targeted immunization strategy that makes use of this information, and test its performance in simulation.

In [87]:
# Sergio Ballesteros

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from operator import itemgetter
import random
import operator
import itertools


# Quality parameters for the simulation
# Choose from 7 to 20 (more is better but slower)
precision=11
no_simulations=15

# Choose something from 30 to 150 (more is better but slower)
part_32 = 30

# Number of nodes
number_nodes = 2000

We create a social network from a Barabasi-Albert graph

In [88]:
g = nx.barabasi_albert_graph(number_nodes,4)
g = nx.convert_node_labels_to_integers(g)

Part 1.1

We check that it has only indeed one big connected component

In [89]:
len(list(nx.connected_component_subgraphs(g)))
Out[89]:
1

Now we can check how many nodes the graph has

In [90]:
no_nodes = nx.number_of_nodes(g)
no_nodes
Out[90]:
2000

Finally, the degree distribution of the graph is the following (in log scale for the x axis)

In [91]:
gdegrees = nx.degree(g)
plt.hist(np.log([gdegrees[key] for key in gdegrees]), normed=True)
plt.xlabel("log Number of degrees")
plt.show()

Part 1.2

Here we can perform a simulation with SIR

In [93]:
def SIR(g, beta=0.1, mu=0.1, seed = [], verbose = False, max_iters = 1000):
    '''
    Runs a SIR simulation
    :param g: netwrokx graph type
    :param beta: (optional)
    :param mu: (optional)
    :param seed: (optional, list ) list of initial infected nodes
    :param verbose: (optional, boolean) show progress
    :param max_iters: (optional, int) kill the simulation after some iterations
    :return: a dictionary (data) that contains the number of nodes in each category
    '''
    # Save the steps as a dict
    data = {"S": [], "R": [], "I": [], "I_R": [], "S_I": [], "iter": []}
    
    # All nodes are susceptible at the begining
    S = set(np.arange(nx.number_of_nodes(g)))
    
    # If a seed is given, then use the seed as input infections,
    # otherwise, select one randomly
    
    seed = set(seed)
    
    if len(seed) > 0:
        I = seed
    else:
        I = set([np.random.choice(list(S))])
        
    # Remove the infected from the susceptibles
    for infected in I:
        S.remove(infected)
        
    # Now the recovered set
    R = set()
    
    S_I = set(I)
    I_R = set()
    
    ite = 0
    while ite < max_iters and len(I) is not 0:
        ite = ite + 1
        if verbose == True:
            print("Progress: "+ str(round(ite/max_iters*100,1)) 
                  + "%" + " No I: " + str(len(I)) + ", S: " + str(len(S)) + ", R: " + str(len(R)))
        # Run over all the infected individuals
        for i in I:
            # Get who is in contact with the infected and can be infected
            neighboor = set(g.adjacency_list()[i]).intersection(S)
            
            # Infect or not each neighboor with some probability
            for n in neighboor:
                # Infect or not the neighboor
                if np.random.uniform() < beta:
                    S.remove(n)
                    S_I.add(n)
                    
                # Set as recovered or not the infected
            if np.random.uniform()< mu:
                R.add(i)
                I_R.add(i)
                    
        # The infectes now are the ones from before, plus the new
        # ones, minos the recovered
        I = I.union(S_I)
        I = I.difference(I_R)
        
        # Save the number of nodes on each group
        data["I"].append(len(I))
        data["R"].append(len(R))
        data["S"].append(len(S))
        data["I_R"].append(len(I_R))
        data["S_I"].append(len(S_I))
        data["iter"].append(ite)
        
    return data


# And we can plot the S, I and R over time
def plot_pandemia(data):
    plt.plot(data["iter"], data["S"], data["iter"], data["I"], data["iter"], data["R"])
    plt.legend(["Susceptible", "Infected", "Recovered"])
    plt.show()

We can run the simulation for some different parameters, for example for $\beta = \mu = 0.1$

In [94]:
data = SIR(g, beta=0.1, mu=0.1, verbose=False)

# Show a plot for the first simulation
plot_pandemia(data)

In the plot above, the epidemic took off, but if we change the parameters, increasing $\beta$ up to 0.3 and maintaining $\mu = 0.1$, the epidemic grows much faster:

In [95]:
data = SIR(g, beta=0.3, mu=0.1, verbose=False)
plot_pandemia(data)

Instead, if $\mu$ is bigger than $\beta$, with values 0.3 and 0.1 respectively, the epidemic will less likely take off, or if it does, the peak of infected nodes will be lower:

In [148]:
data = SIR(g, beta=0.1, mu=0.3, verbose=False)
plot_pandemia(data)

Part 1.3

In order to make the box plots, I will run the simulations a few times and calculate the peak times

In [97]:
def find_peak_time(data):
    '''
    Returns the peak time of the pandemia
    :param data: the data output of my function SIR
    :return: the time as int
    '''
    peak_epidemic = np.max(data["I"])
    for i in range(len(data["I"])):
        if peak_epidemic == data["I"][i]:
            return(data["iter"][i])

# Here  I run the simulation a few times with the different combination of the parameters

ratios = []
peaks_times = []
counter = 0
params_list = [(0.3, 0.1), (0.1, 0.3), (0.1, 0.1)]
for params in params_list:
    ratio = []
    peak_times = []
    for i in range(no_simulations):
        counter += 1
        #print("Progress: " + str(round(100*counter/(len(params_list)*no_simulations))) + "%")
        data = SIR(g, beta=params[0], mu=params[1], verbose=False)
        ratio.append(data["R"][-1]/no_nodes)
        peak_times.append(find_peak_time(data))
    ratios.append(ratio)
    peaks_times.append(peak_times)
    
print("Done")
Done

Now we can see the boxplot of the attack rates of the different values of the parameters

In [98]:
labels = [ "beta = " + str(p[0]) + "\n mu = " + str(p[1]) for p in params_list]
plt.boxplot(ratios, showfliers=False, labels=labels)
plt.ylabel("Overall attack rate ")
plt.show()

As we can see, when $\beta$ = 0.3 and $\mu = 0.1$, it always reaches all the population while when $\beta$ = 0.1 and $\mu$ = 0.3, the attack rate is lower.

Also we can compute the peak times of the epidemic:

In [99]:
plt.boxplot(peaks_times, showfliers=False, labels=labels)
plt.ylabel("Peak times")
plt.show()

As we can see, a $\beta$ = 0.3 and $\mu$ = 0.1 gives an inminent peak time. On the other hand, $\beta$ = 0.1 and $\mu = 0.3$ might seem also imminent, but actually it is because the infection quickly dies on many ocasions

So for the following simulations where I have to choose a combination of parameters that give a high probability of outbreak, I will do it with $\beta$ = 0.1 and $\mu$ = 0.1. I could choose $\beta$ = 0.9 and $\mu$ = 0.01, but I want to make simulations where sometimes not all the population is infected.

Part 2.1

Here I use the code from before to add nodes that are inmune

In [102]:
def SIRI(g, beta=0.1, mu=0.1, r = 0, seed = [],
         seed_immune = [], verbose = False, max_iters = 1000, detailed_return = False):
    # Save the steps as a dict
    data = {"S": [], "R": [], "I": [], "I_R": [], "S_I": [], "iter": [], "IN": []}

    # All nodes are susceptible at the begining
    S = set(np.arange(nx.number_of_nodes(g)))
    
    # If a seed is given, then use the seed as input infections,
    # otherwise, select one randomly
    
    seed = set(seed)
    
    if len(seed) > 0:
        I = seed
    else:
        I = set([np.random.choice(list(S))])
        
    # Remove the infected from the susceptibles
    for infected in I:
        S.remove(infected)

    # Select some nodes that are inmune
    no_inmune = int(round(nx.number_of_nodes(g)*r))
    if len(seed_immune) is 0:
        IN = set(np.random.choice(list(S), size=no_inmune, replace=False))
        if verbose is True: print("Random immunization")
    else:
        IN = set(seed_immune[0:no_inmune]).intersection(S)
        if verbose is True: print("Deterministic immunization")

    # Remove the inmunes from S
    for inmune in IN:
        S.remove(inmune)


    # Now the recovered set
    R = set()
    
    S_I = set(I)
    I_R = set()
    
    ite = 0
    while ite < max_iters and len(I) is not 0:
        ite = ite + 1
        if verbose == True:
            print("Progress: "+ str(round(ite/max_iters*100,1)) 
                  + "%" + " No I: " + str(len(I)) + ", S: " + str(len(S)) + ", R: " + str(len(R)))
        # Run over all the infected individuals
        for i in I:
            # Get who is in contact with the infected and can be infected
            neighboor = set(g.adjacency_list()[i]).intersection(S)
            
            # Infect or not each neighboor with some probability
            for n in neighboor:
                # Infect or not the neighboor
                if np.random.uniform() < beta:
                    S.remove(n)
                    S_I.add(n)
                    
                # Set as recovered or not the infected
            if np.random.uniform()< mu:
                R.add(i)
                I_R.add(i)
                    
        # The infectes now are the ones from before, plus the new
        # ones, minos the recovered
        I = I.union(S_I)
        I = I.difference(I_R)
        
        # Save the data
        if detailed_return is False:
            data["I"].append(len(I))
            data["R"].append(len(R))
            data["S"].append(len(S))
            data["I_R"].append(len(I_R))
            data["S_I"].append(len(S_I))
            data["iter"].append(ite)
            data["IN"].append(len(IN))
        else:
            data["I"].append(I)
            data["R"].append(R)
            data["S"].append(S)
            data["I_R"].append(I_R)
            data["S_I"].append(S_I)
            data["iter"].append(ite)
            data["IN"].append(IN)
        
    return data

# And we can plot the S, I, IN and R over time

def plot_pandemia_I(data):
    plt.plot(data["iter"], data["S"], data["iter"], data["I"],
             data["iter"], data["R"], data["iter"], data["IN"])
    plt.legend(["Susceptible", "Infected", "Recovered", "Immune"])
    plt.xlabel("Time (steps)")
    plt.ylabel("Number of nodes")
    plt.show()

One example of this immunization with 10% of the total population

In [103]:
data = SIRI(g, r = 0.1, verbose=False)
plot_pandemia_I(data)

Crearly, the nodes that are immune can never be part of the recovered set. This means that the upper bound is:

$R/N \leq 1-Immune/N$

Therefore the upper bound for the OAR is:

$OAR = \frac{R}{N}\leq1-Immune/N=1-r $

Part 2.2

This function below plots the overall attack rate as a function of r

In [104]:
def rate_inmunization_plot(g, precision = 11, mu = 0.1,
                           beta = 0.1, seed_immune = [],
                           no_simulations=15, verbose = False, return_more = False):    
    # Save here the simulations
    ratios = []
    
    # Use this values of r
    r_values = np.linspace(0.05,0.95,precision)
    
    # Go for it
    counter = 0
    for r in r_values:
        counter += 1
        ratio = []
        if verbose == True:
            print("Progress: " + str(round(counter/len(r_values)*100)) + "%")
        for i in range(no_simulations):
            data = SIRI(g, beta=beta, mu=mu, r=r, seed_immune=seed_immune, verbose=False)
            ratio.append(data["R"][-1]/no_nodes)
        ratios.append(ratio)
        
    # Since each r has different simulations, we take the mean of each ratio
    # then we can use the std to plot error bars
    
    ratios_mean = [np.mean(ratio) for ratio in ratios]
    ratios_std = [np.std(ratio) for ratio in ratios]
    
    if return_more == False:
        # Make a plot
        plt.errorbar(r_values, ratios_mean, yerr=ratios_std, fmt='o')
        plt.xlabel("inmunization ratio")
        plt.ylabel("Overall attack rate ")
        plt.show()
        
        return ratios_mean
    
    else:
        return ratios

For our original graph we get the following figure, where the error bars represent the SD

In [105]:
ratios_g = rate_inmunization_plot(g, precision=precision, no_simulations=no_simulations)

Part 2.3

Now we create an Erdős and A. Rényi graph with similar density as the original, where for density I use the average degree of the each node from the original graph

In [144]:
mean_degree = np.exp(np.mean(np.log(list(nx.average_neighbor_degree(g).values()))))
e = nx.erdos_renyi_graph(n=no_nodes, p=mean_degree/no_nodes)

Run the function of overall attack rate as a function of r for the ER grap:

In [145]:
ratios_e = rate_inmunization_plot(e, precision=precision, no_simulations=no_simulations)

Now we can drop the error bars and put together the two plots

In [146]:
r_values = np.linspace(0, 0.95, precision)
plt.plot(r_values, ratios_g)
plt.plot(r_values, ratios_e)
plt.legend(['Studied graph', 'Erdős and A. Rényi graph'])
plt.xlabel("immunization ratio")
plt.ylabel("Overall attack rate ")
plt.show()

Clearly we see that the Erdos Renyi graph has a greater Overall attack rate. This is because this graph has its edges randomly distributed across all the nodes, which means that the infection can spread to different parts of the graph faster.

Part 3.1

We can make a function that tells us which ones are the nodes that have the highest betweenness centrality, this is, the ones that act as a link between communities, and if we immunize them, we can avoid the infection to spread to a better degree

In [110]:
def best_betweenness_centrality(g, r=0):
    '''
    Calculates the betweenness centrality of a graph ordered by centrality
    :param g: a networkx graph
    :param r: [optional] ratio of immunization in 0-1
    :return: returns the top r central nodes
    '''
    d = nx.betweenness_centrality(g)
    ranking = sorted(d.items(), key=operator.itemgetter(1),reverse=True)
    nodes = [couple[0] for couple in ranking]
    if r is 0:
        return nodes
    else:
        no_nodes = len(nodes)
        return nodes[0:round(no_nodes*r)]
In [111]:
# Get the sorted list of between central nodes sorted (more to less) of all the nodes of g
best_betweenness = best_betweenness_centrality(g)

Now we can do the simulation. In this case the following function that I call will select the top r * N nodes from the list that I calculated in the previous cell

In [112]:
ratios_g_central = rate_inmunization_plot(g,
                                          precision=precision,
                                          seed_immune=best_betweenness,
                                          no_simulations=no_simulations)

Now we can drop the error bars and put together the two plots

In [113]:
r_values = np.linspace(0.05,0.95,precision)
plt.plot(r_values, ratios_g)
plt.plot(r_values, ratios_g_central)
plt.legend(['Random immunization', 'Betweennes centrality immunization', ])
plt.xlabel("inmunization ratio")
plt.ylabel("Overall attack rate ")
plt.show()

Fantastic. We have that the Betweenes centrality immunization is much better than the random one, as the OAR decays much faster

Fantastic. We have that the Betweenes centrality immunization is much better than the random one, as the OAR decays much faster

Part 3.2

Lets first run some simulations without any immunization to get some historical data. Also, we can compute the likelihood that one node is infected at time t, or in other words, an histogram at each point in time of the number of times that each node was infected. In order to do that, I will create a matrix, where the rows represent time and the columns each node, so the dimentions are n x t:

In [116]:
infected_nodes = []
peak_times = []

for _ in range(part_32):
    data_detailed = SIRI(g, beta=0.1, mu=0.1, r = 0, verbose=False, detailed_return=True)
    data["I"] = ([len(s) for s in data_detailed["I"]])
    peak_times.append(find_peak_time(data))
    infected_nodes.append(data_detailed["I"])


tmp = 0
for i in range(len(infected_nodes)):
    tmp = max(tmp, len(infected_nodes[i]))
    
for i in range(len(infected_nodes)):
    if len(infected_nodes[i]) < tmp:
        while len(infected_nodes[i]) < tmp:
           infected_nodes[i].append({})

new_infected_nodes = []

for i in range(tmp):
    l = [list(k[i]) for k in infected_nodes]
    new_infected_nodes.append(list(itertools.chain.from_iterable(l)))
    
# Create matrix to store the probabilities 
matrix = [[0 for __ in range(nx.number_of_nodes(g))] for _ in range(tmp)]

for t in range(len(new_infected_nodes)):
    seen = list(new_infected_nodes[t])
    for i in seen:
        matrix[t][i] = matrix[t][i] + 1

Now we can do a heat map to show which ones are the nodes that usually get more infected This is a way of representing the matrix, and the higher the value, the hotter it looks in the image

In [117]:
plt.imshow(matrix, cmap='hot', interpolation='nearest', aspect='auto')
plt.colorbar()
plt.ylabel("Time")
plt.xlabel("Node ID")
plt.show()

Clearly some nodes are infected earlier on the average, and others later. By the nature of the the Barabasi Albert graph, nodes with lower node ID are more likely to get infected earlier. Of course there is some randomness in all this, therefore, I am going to count the number of times that each node was infected before the average peak time of the epidemic. After that, I will make a sorted list and put in the top the nodes that are infected more times before the peak time

In [118]:
peak_time_t = int(round(np.mean(peak_times)))

infected_nodes = np.array(np.matrix.sum(np.matrix(matrix[0:peak_time_t]), axis=0))[0]

plt.plot(infected_nodes)
plt.ylabel("Times infected before peak time")
plt.xlabel("Node ID")
plt.show()

# Below we see an histogram about the number of times wach node was infected after the peak time
In [119]:
# Now we can order the nodes by the number of times that they were infected
sorted_infected_nodes = sorted(range(len(infected_nodes)),
                               key=lambda k: infected_nodes[k], reverse=True)

And finally run the simulation to see if we can decrease the epidemic

In [120]:
ratios_g_new_immunization = rate_inmunization_plot(g,precision=precision,
                       seed_immune=sorted_infected_nodes,
                       no_simulations=no_simulations)

That looks like an improvement with respect to pure randomness, so now lets compare it with the other immunizations.

Now we can drop the error bars and put together the two plots

In [121]:
r_values = np.linspace(0.05,0.95,precision)
plt.plot(r_values, ratios_g)
plt.plot(r_values, ratios_g_central)
plt.plot(r_values, ratios_g_new_immunization)
plt.legend(['Random immunization',
            'Betweennes centrality immunization',
            'Personal solution immunization'])
plt.xlabel("immunization ratio")
plt.ylabel("Overall attack rate ")
plt.show()

Even if it performs slightly worse than betweennes immunization, we have performed much better than random immunization, and we did not even use graph information.

Part 3.3

Let's select 10% of the nodes with a function. To do so, the function generates a subgraph with the conditions of 3.3, and computes inside that subgraph the betweennes centrality. Then it orders the nodes fo the subgraph from most central to least central. Since there are some nodes that are in the original graph but not in the subgraph, those are concatenated randomly at the end of the list

In [123]:
def information_small_population(g, p = 0.1):
    '''
    Given a graph g, creates a graph from p nodes 
    and calculates betweenes centrality in the subgraph
     and returns a list of the most important nodes
     (the ones not in the subgraph are concatenated at the end
     of the list randomly)
    :param g: networkx graph
    :param p: percentage of nodes we know about
    :return: ordered list ob centrality
    '''
    
    l = list(g.node)
    random_nodes = np.random.choice(l, size = int(round(len(l)*p)), replace=True)
    
    # Find the neighbors of those random nodes
    
    h_dict = {}
    for i in random_nodes:
        h_dict[i] = g.neighbors(i)
        
    # Create a new subgraph
    h = nx.Graph(h_dict)
    
    # Compute betweenness centrality from the subgraph
    best_nodes = best_betweenness_centrality(h)
    
    # Check the nodes that are not in the subgraph but are in the original graph and shuffle them
    not_present_nodes = list(set(list(g.node)).difference(set(list(h.node))))
    np.random.shuffle(not_present_nodes)
    
    final_list = best_nodes + not_present_nodes
    
    return final_list

Now I run the simulation many times with different 10% of population chosen (this means for each 10% of population I run the simulation several times and I run different 10% populations). For each r, the function will pick up the top r * N nodes from the list calculated with the previous function.

In [124]:
ratios_g_10_immunization_detailed = []
for _ in range(no_simulations):
    final_list = information_small_population(g)
    ratios_g_10_immunization_detailed.append(
                           rate_inmunization_plot(g,precision=precision,
                           seed_immune=final_list,
                           no_simulations=10, return_more=True)
    )
    
# Now I put together the different ratios for the different r that I used
    
ratios_mean = []
ratios_std = []

len_1 = len(ratios_g_10_immunization_detailed)
len_2 = len(ratios_g_10_immunization_detailed[0])

for i2 in range(len_2):
    for i1 in range(len_1):
        tmp_list = []
        tmp_list.append(ratios_g_10_immunization_detailed[i1][i2])
        
    ratios_mean.append(np.mean(tmp_list))
    ratios_std.append(np.std(tmp_list))

# Finally we can plot it

plt.errorbar(r_values, ratios_mean, yerr=ratios_std, fmt='o')
plt.xlabel("inmunization ratio")
plt.ylabel("Overall attack rate ")
plt.show()

The plot of all the strategies is the one that follows

In [125]:
plt.plot(r_values, ratios_g)
plt.plot(r_values, ratios_g_central)
plt.plot(r_values, ratios_g_new_immunization)
plt.plot(r_values, ratios_mean)
plt.legend(['Random immunization',
            'Betweennes centrality immunization',
            'Personal solution immunization',
            '10 % Limited knowledge immunization'])
plt.xlabel("inmunization ratio")
plt.ylabel("Overall attack rate ")
plt.show()

This method performs worse than the ones in points 3.1 and 3.2, but still is better than pure randomness.

In conclusion, if we want to lower the OAR, it is better that we know the whole shape of the graph, but if we do not, still historical information can be used to lower the OAR and even knowing 10% of the nodes with and their neighbours provide and improvement with respect to just immunize the population randomly.