Preface

The final semester of my Master's degree had me taking a course called Web Mining. Admittedly, I was fatigued at the time of registering for this course, which motivated my registration as the course's description reminded me of one that I took as an undergrad student - Information Storage and Retrieval (ISR).

ISR covered document representation and how to provide query mechanisms for groupings of texts. Typical search engine fodder involving scoring and term weighting which would then culminate into discussions of the vector space model. The course project was to build a crawler to collect and represent texts taken from MIT's collection of Shakespeare writings.

Natural language processing; analysis of textual material by statistical, syntactic, and logical methods; retrieval systems models, dictionary construction, query processing, file structures, content analysis; automatic retrieval systems and question-answering systems; and evaluation of retrieval effectiveness. - Verbatim course description from the University of Northern Iowa

ISR was fascinating as it gave insight to how complex data search can work. One invaluable resource to helping build an intuition for the subject was supplementary lecture provided by Victor Lavrenko through at his Youtube Channel. Specifically, his lecture series labeled ISR3 Vector Space Model. This supplement also complemented thorough reading from "Introduction to Information Retrieval" authored by Manning, Raghavan, and Schütze. The textbook would also be a required reading within Web Mining, where a good chunk of time was also spent discussing language models and how to represent texts.

Web Mining would turn out to be a deceptive course title. It differed from ISR by the exclusion of a project in which one would build a web crawler to harvest information. The course instead revolved around reading various research papers involving the processing of big data while interpreting the results said data provided. These papers asked questions such as "How has happiness shifted since a given event?" or "How does misinformation spread throughout social media communities?"

Core methods underlying development of applications on the Web; examples of relevant applications, including those pertaining to information retrieval, summarization of Web documents, and identifying social networks. - Verbatim course description from the University of Iowa

To answer these types of questions, one needs a basic understanding of network science. Network science is the study of complex networks; In the context of this course, it was an application of social computing to understand how networks of humans interact. It provides the understanding of scale-free networks and how they have a power-law distribution. These were concepts unknown to me prior to taking the course, which provided a pleasant surprise in terms of giving something new and interesting to study.

To provide a basic understanding, "Network Science" authored by Albert-László Barabási was used. The chapters on graph theory, random networks, and the scale-free property provide a great resource in terms of understanding complex social networks. Additionally, Leonid Zhukov's lecture series on YouTube discussing Network Science was another invaluable resource.

An assigned project that I found interesting involved assigning each student a web scrape from a social network. Each student was ambiguously tasked to analyze the data. Below is the result of data analysis that I personally drew from the dataset. I feel this is worth sharing to help those who have a genuine interest in the subject understand the processes involved.*

Update - Data Visualization

Additional content has been added to this page since initially publishing this project. The initial publication, whose state is encapsulated by git commit hash a1d800f, went far beyond the scope of the original problem statement of analysis. The initial analysis included a set of static figures containing images that show scatter-plot distributions and network graphs. The usage of these static figures aligns with what is typical of academic publications.

This publication has been further updated with the inclusion of more intuitive data presentation. This was accomplished by using the d3 data visualization library. The implementation of these new figures allow users to interact with the data being discussed more closely. These new additions are described as follows:

A force directed graph which represents a network of agents as nodes where a user can inspect the physical properties of said network with respect to its connectivity. A user can mouse over each node to reveal the agent's id in addition to revealing a report of how many other users have referred to it. The edges connecting these nodes represent one of these references.
- The Six degrees of separation subsection was added to provide an intuitive approach to correlating the properties that can be explored via the force directed graph to the statistical analysis discussed within this project page.
A set of interactive scatter-plot graphs which allow a user to pan and zoom in on the data that each scatter-plot represents. Each set of static figures will present an option to switch to a view where these interactive graphs can be looked at within the following subsections:

Data Science: Social Computing

Consider a dataset which describes interactions between Reddit users for two different subreddits during the span of a specific month.

The given dataset is given as a file which is formatted as an adjacency list. Each line of the file is represented as such: User1, User2, User3, ... , Userx\n where User1 replied to a comment made by User2 and another one by User3, and every user up to and including Userx.

Many different software solutions can be chosen for analysis. Gephi is one that is often recommended. Personal experience has found that Gephi is limited; It is handy for network visualization and generating a set of metrics which are indeed indicative of the properties of the network. Unfortunately, there seems to be no mechanism to display the calculated metrics using different scaling factors for any given scatterplot graph. For example, displaying a degree distribution graph in log-log scale doesn't seem to be an available option. To account for this, Python was leveraged, taking advantage of the matplotlib, numpy, and powerlaw libraries.

numpy arrays are used to interface with matplotlib.pyplot and powerlaw.
powerlaw allows the generation of a fitting function for a bin of data. It can also generate relevant alpha values, (the gamma value with respect to the textbook used in this class), for a power-law distribution function. powerlaw also interfaces with matplotlib to allow visual representation of these functions.
- Relevant class methods are powerlaw.Fit and powerlaw.plot_pdf. Relevant instance methods are powerlaw.Fit(<args>).plot_pdf. Relevant instance variables are powerlaw.Fit(<args>).alpha.
matplotlib.pyplot allows plotting of arrays as scatterplot. It has scaling methods to allow for the display of some graph in log-log scale.

It's worth elaborating on what is happening within the represention of the dataset. The existence of some edge in the adjacency list is a means of communication. Communication can be interpreted ambiguously. So it should also be noted that communication here is directed. The node that is representative of a list entry is replying to a comment made by a user within that list entry:

User1 replied to a comment by User2 and another comment by User3;
Can be interpreted as (user1 -> user2) and (user1 -> user3)

Thus, an entry of the adjacency list is indicative of the out-degree of a given node, (a node representing a user). In order to calculate the in-degree of the same node requires parsing through the adjacency list and taking note of how many times communication is directed towards it.

This was considered whilst initially examining the data. The initial dataset contains 17406 edges which connect 8129 nodes. While parsing through the data, it can be determined that the minimum out-degree is 1 and the minimum in-degree is 0. Likewise, the maximum out-degree is 203 while the maximum in-degree is 144. While examining the network as an undirected graph, the minimum degree is 1 and the maximum degree is 347.

Parsing and presenting the adjacency list

Skip to analysis

Transforming the extracted data

Parsing through the adjacency list is necessary to run meaningful analysis of the data. To calculate the above facets of data, one only needs to run through the adjacency list and maintain a running tally of unique node identifiers and edge combinations. When discovering the node with the smallest out-degree and the node with the largest out-degree, one only needs to maintain a maximum and minimum value while evaluating the amount of nodes contained in each adjacency list entry. That is, for every line in the adjacency list, the overall minimum is min(current min, line count) and the overall maximum is max(current max, line count).

To determine which nodes have the minimum and maximum in-bound degrees, more effort than a simple pass of the adjacency list is required. Accomplishing this task takes advantage of the fact that different tools require different data structures. To run analysis through Gephi, a set of edges are required. Here, the format of said edge list is simply a csv file where each line represents a single edge. Each line contains one comma where the set of characters on the left-hand side of the comma represents the origin of the edge and the characters on the right-hand side of the comma represents the target of said edge. For example, userone,usertwo is a case where userone is tagging or responding to usertwo in a message.

Creating an edge list forms a data structure that can be used by Gephi and can also be parsed through with linear pass in which a dictionary can be set up to create a reverse-index of nodes. After creating this reverse-index, another linear pass can be made where the nodes which have the minimum and maximum quantity of in-bound edges can be discovered.

These linear passes can be then altered to maintain a running total of nodes that have x in-bound edges, out-bound edges, and a total of both in-bound and out-bound edges (used for creating an undirected graph). Creating these buckets is helpful in visualizing the distribution of values.

The parsing routine for creating a bucket of values for out-bound degrees, (while creating an edge list), is as follows:

Take note that kdictionary contains the mapping of degree counts to the amount of nodes that have the degree count. From the edge list, a reverse-index can be formed:

Using this reverse-index, the bucket of values for in-bound degrees can be calculated:

Finally, the same process can be applied to form the bucket of values for an undirected graph. Here, we can confirm that the minimum and maximum of this undirected graph is equivalent to k_out_min + k_in_min and k_out_min + k_out_max, respectively:

Using the transformed data

Three dictionaries have been established. They are currently labeled kdictionary, k_out_dictionary, and k_in_dictionary. Each contains a mapping of degree counts to the quantity of nodes with the given degree count. To calculate the amount of nodes that exists within this network, one could sum up all the values associated with each key in any of these dictionaries. This facet can be used to help form a scatter plot of the distribution of degree counts within the network:

Instead of using list comprehension to calculate the total, the lists which contain raw degree counts can instead be used. To do so, the distribution_graphing function definition should expect an extra argument for one of these lists where total is instead calculated as the length of said list.

In this context, calling distributon_graphing(k_out_dictionary,out_raw,"k_out"), distribution_graphing(k_in_dictionary,in_raw,"k_in"), and distribution_graphing(kdictionary,raw,"k") create the first three figures for analysis.

Analysis of the network

How are the various degrees distributed? The following figures are indicative of distribution:

A graph showing the distribution plotting of nodes in terms of out-degrees. — Figure 1: Distribution plotting of node out-degrees.

A graph showing the distribution plotting of nodes in terms of in-degrees. — Figure 2: Distribution plotting of node in-degrees.

A graph showing the distribution plotting of nodes in terms of both degrees. — Figure 3: Distribution plotting of node degrees (inbound and outbound)

The associated distribution function seems to be exponential in shape. What is this function? The proportion of some node having degree k must be k raised to a negative power: k^-γ, with some constant factor, c. Discovering the value of this power can be made on the observation that kmax ≈ kmin * N^(1/γ-1), where N is the total number of nodes in the network.

Algebraic manipulation can isolate gamma here with application of the logarithm manipulation rules. This generates an approximation of the exponent. The constant factor, c, can be discovered once gamma is found; c = (γ-1)*kmin^-γ+1

The powerlaw python library makes use of statistical binning to generate these values with respect to a continuous power-law distribution function:

γ-out: 2.931904108581441; c-in: 1.931904108581441
γ-in: 2.586342073080467; c-in: 1.5863420730804672
γ-total: 2.0875259166393976; c-total: 1.0875259166393976

With gamma values in hand, average expected degrees can be calculated, dependent on statistical moment. This occurs when gamma is in [2,3]. The formula used here is <k> = (γ-1)/(γ-2)*kmin

<kout> ≈ 2.0730717793724676 ≈ <kin> ≈ <k>

Average expected distance can also be computed:

<d> ≈ lnlnN ≈ lnln(8129) ≈ 2.1975793137150044

The actual values computed by Gephi are as such:

<k>: 2.141
<d>: 7.07

Curiously, there is a significant difference between the expected distance and the actual distance. This likely is due to the fact the sample set fits in the ultra-small-world regime and the slice taken from reddit doesn't represent the full expected picture.

To confirm a power-law distribution, these distributions can be plotted in a log-log scale. The following figures show that a power-law distribution is indeed in play. The light blue points represent the distribution plot. The dotted blue line overlayed by the red line is a plotting of the power-law distribution function (C*k^-γ).

A graph showing the log-log scale distribution plotting of nodes in terms of out-degrees. — Figure 4: Log-log scale distribution plotting of node-out degrees

A graph showing the log-log scale distribution plotting of nodes in terms of in-degrees. — Figure 5: Log-log scale distribution plotting of node-in degrees

A graph showing the log-log scale distribution plotting of nodes in terms of both degrees. — Figure 6: Log-log scale distribution plotting of node degrees (inbound and outbound)

These log-scale graphs, and the calculation of the aformentioned γ and c values, can be created by calling the logic of the following function definition:

Synthetically generated networks

Random networks were generated to contrast this data. The algorithm that created these networks ensured the same node count and edge count. It also ensured there exists no node that does not have an edge – as is the case for the reddit data set. This synthetically generated network is created as follows:

The distribution of this network differs from the social network. Consider the following figures:

The distribution figures are similar for the other four randomized networks. This similarity holds true for the log-log scale plotting of the same data:

The following table of figures are the log-log scale plotting of four other randomized networks, with respect to evaluating out-bound degree:

These distributions are Poisson/binomial. They do not allow for the reasonable probability of having nodes with large degrees, (degrees that approach kmax). This is emphasized by the values given in the x-axis. The maximum node degree here is anywhere from 7 to 9; much smaller than the maximum node degrees of the Reddit dataset. There seems to be a higher occurrence nodes with degree quantities close to the maximum as well. This is shown in the network representation of the involved data, shown in the following figures:

A Gephi visualized graph which shows the clustering of the agents within a randomized network. The clustering shows consistency all around. — Figure 13: Full visualization of randomized network indicates a lack of any hubs. The network seems to have reached a transition where there exists only one component.

A Gephi visualized graph which shows the clustering of the agents within a social network. Clustering is not consistent here, where clustering seems to revolve around a select few agents. — Figure 14: Partial visualization of network representative of the Reddit dataset. Notice the significant hubs which are indicative of a higher degree.

A gephi visualized grpah which shows outliers of the social network; disconnected interactions between a subset of agents involved. — Figure 15: Partial visualization of the network representative of the Reddit dataset; structures like these exist in the orbit of the larger connected component of the prior figure. This is a property not observed in the generated randomized networks.

The connectivity of Figure 13 tracks once consideration of average node degree is taken. The average degree measured by Gephi is 2.141. This tracks considering the expected node degree of |E|/|V| which is 17406/8129 = 2.141. Once the average node degree surpasses 1, a randomized graph is in the super critical regime where there exists some gigantic component. This component is not fully connected, though; the average node degree has not reached a point to exceed ln(|V|).

The observation of the paragraph above helps us see the property of the complex network given by the Reddit dataset is a scale-free network; a means to visually support this assertion.

Six degrees of separation

Six degrees of separation is the term that encapsulates the idea that all individuals are six or fewer social connections away from each other. The network science discussed on this page acts as a means to validate this. The contrast of the Reddit social network graph and the graph of randomly generated connections also acts as evidence. The average distance between nodes will be higher in a randomized graph, leading to the assertion that there are more than six degrees of separation in this context.

This can easily be illustrated in the following figure which displays an dynamic graph taken from the datasets discussed thus far. The figure contains an option to switch between these categories of graphs and a slider which a threshold can be set to determine how many nodes are displayed on the graph. This threshold indicates the minimum amount of inbound connections to a node - the higher the value set by the slider, a lesser amount of nodes will be displayed.

While using these graphs, one can mouse over a given node to report the user name and inbound connections. A button is also given to allow a user to expand the distance between nodes. This can be useful to unclutter the space. The user also has the ability to drag each node within the graph.

The initial threshold set for mobile users will be set to 16 inbound connections. The initial threshold for desktop users will be 6. Setting the threshold below these values will prompt the user for confirmation. These graphs can be CPU and memory intensive as the node and edge count increases which may impact performance.

Please enable JavaScript to view dynamic representation of the small-world networks.

Observable properties

An immediate observation is the fact that the range slider in the above figure restricts access to the graphs containing all nodes. This was primarily done as a measure to save system resources; the average browser environment is not optimized to perform the calculations needed to render all 8029 nodes along with 17406 edges.

The properties of these universal graphs can be surmised, though. Taking the difference of node count presented with each threshold can be correlated to the set of distribution graphs discussed prior.

Knowing this, consider the following node counts sorted by their inbound edge quantities for the social network:

Exactly 1 inbound edge: 3029 nodes
Exactly 2 inbound edges: 1225 nodes
Exactly 3 inbound edges: 632 nodes
Exactly 4 inbound edges: 368 nodes
etc

Here, the distribution pattern shown in figures 2 and 5 are confirmed such that the edge count is continually decreasing at a rate that conforms to the power-law. This can be contrasted to the edge quantities for the nodes contained in the randomly generated network:

Exactly 1 inbound edge: 1978 nodes
Exactly 2 inbound edges: 2242 nodes
Exactly 3 inbound edges: 1537 nodes
Exactly 4 inbound edges: 1377 nodes
etc

These values also conform to the distribution graphs discussed prior. The value associated with each tier of distribution initially increases and then starts decreasing at a lesser rate than the graphs governed by the power-law. Node counts are more evenly spread out among these buckets until a steep fallout at the tail.

Contrasting the tails between both networks also highlights a difference in distribution. The randomly generated network contains only 188 nodes with 6 or more inbound edges. This is roughly 2% of the distribution. The remaining 98% is distributed among nodes with 5 or less inbound edges. An equivalent tail occurs within the social network while looking at nodes with 12 or more inbound edges. Going beyond 12, the rate at which nodes are filtered decreases much more slowly. To evaluate the random network's tail from 2% to 0% requires walking the range of edges from 6 inbound edges to 11 - a quantity smaller than the start of the social network's tail. Doing the equivalent requires walking the social network through the range of 12 inbound edges to 144 inbound edges. This gives the social network the long tail property power-law distributions are famous for.

To further illustrate six degrees of separation, grabbing a node within the dynamic graph and peeling it from the node cluster it belongs will exhibit two different behaviors between the graph types:

Within the social network, displacing a node will cause the cluster to shift with it in a manner that maintains a general coherency.
Within the randomized network, displacing a node will only cause a micro cluster to shift. This micro cluster unfurls more easily as each node is more likely to be connected to a smaller quantity of nodes. This implies an unfurling action where a singular strand of nodes is peeled from the global group.

These behaviors imply that it takes less node hops along edges to discover another node within the social network than it does within the randomized network. This intuitively describes how the properties of the power-law provides six degrees of separation.

Concluding notes

* - My writing on pedagogy makes the following claims:

In the domain of computer science, there are three different types of students:

There are those who are studying a different discipline who are required to take a CS course as a prerequisite.

There are those who have heard jobs related to the field pay well, and thus are studying on the prospect of future paycheck.

Finally, there are the individuals who are genuinely curious of the subject.

Students who are genuinely curious of the subject will succeed. The definition of success is that they will get a degree and they will have a solid and flexible intuition of the machinations of the discipline.

The students in the other categories will get just a degree.

I feel the need to emphasize on the fact this is for an individual who is genuinely curious. The description of the project as described is ambiguous, but there are metrics listed here that can turn a learning experience into an easy grade. Should this be the case, you are doing yourself as much of a disservice as an instructor who chooses to not provide a new/different dataset.

> Social Computing

> Projects

Home