Here's a list of my publications. If you want some extra information or
have some questions, please feel free to email me.
My email addresses are sesh[at]ucsc[dot]edu and csesha[at]gmail[dot]com.

Manuscripts 
Refereed Conference and Journal Papers
We propose a new distributionfree model of social networks. Our definitions are motivated by one of the most universal signatures of social networks, triadic closurethe property that pairs of vertices with common neighbors tend to be adjacent. Our most basic definition is that of a "$c$closed" graph, where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$ and $v$ are adjacent. We study the classic problem of enumerating all maximal cliques, an important task in social network analysis. We prove that this problem is fixedparameter tractable with respect to $c$ on $c$closed graphs. Our results carry over to "weakly $c$closed graphs", which only require a vertex deletion ordering that avoids pairs of nonadjacent vertices with $c$ common neighbors. Numerical experiments show that wellstudied social networks tend to be weakly $c$closed for modest values of $c$. We study the problem of approximating the number of $k$cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $\clk$ the number of $k$cliques. We design an algorithm that outputs a $(1+\eps)$approximation (with high probability) for $\clk$, whose expected query complexity and running time are $O\left(\frac{n}{\clk^{1/k}}+\frac{m^{k/2}}{\clk} \right)\poly(\log n, 1/\eps,k)$. Hence, the complexity of the algorithm is sublinear in the size of the graph for $\clk = \omega(m^{k/21})$. Furthermore, the query complexity of our algorithm is essentially optimal (up to the dependence on $\log n$, $1/\eps$ and $k$). The previous results in this vein are by Feige (SICOMP 06) and by Goldreich \dcom{and} Ron (RSA 08) for edge counting ($k=2$) and by Eden et al (FOCS 2015) for triangle counting ($k=3$). Our result matches the complexities of these results. The previous result by Eden et al. hinges on a certain amortization technique that works for triangle counting, and does not generalize %for larger cliques to all $k$. We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that} samples each $k$clique incident to a given set $S$ of vertices with approximately equal probability. The primary difficulty is in finding cliques incident to purely highdegree vertices, since random sampling within neighbors has a low success probability. This is achieved by an algorithm that samples uniform random high degree vertices and a careful tradeoff between estimating cliques incident purely to highdegree vertices and those that include a lowdegree vertex. The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on \emph{graph sampling}, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavytailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of \emph{sublinear algorithms}. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the \emph{hindex} and the \emph{zindex}. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most 1% of the vertices. This is a major improvement over the stateoftheart sampling algorithms, which typically sample more than 10% of the vertices to give comparable results. We also observe that the h and zindices of real graphs are large, validating our theoretical analysis. We study monotonicity testing of Boolean functions over the hypergrid [n]^d and design a nonadaptive tester with 1sided error whose query complexity is O(d^5/6)⋅poly(logn,1/ϵ). Previous to our work, the best known testers had query complexity linear in d but independent of n. We improve upon these testers as long as n=2^d^o(1). To obtain our results, we work with what we call the augmented hypergrid, which adds extra edges to the hypergrid. Our main technical contribution is a Margulisstyle isoperimetric result for the augmented hypergrid, and our tester, like previous testers for the hypercube domain, performs directed random walks on this structure. We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ (nonisolated) vertices, and define (for $s > 0$) $\mu_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $\mu_s$ within a multiplicative error of $(1+\varepsilon)$ (for a given approximation parameter $\varepsilon>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant $\varepsilon$ (Feige, SICOMP 06 and GoldreichRon, RSA 08). GonenRonShavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{11/(s+1)})$ queries. (Strictly speaking, their algorithm approximates the number of starsubgraphs of a given size, but a slight modification gives an algorithm for moments.) We design a new, significantly simpler algorithm for this problem. In the worstcase, it exactly matches the bounds of GonenRonShavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the \emph{degeneracy} of $G$. This is (essentially) the maximum density of an induced subgraph. We study the problem of testing unateness of functions f:{0,1}^d > R. We give a O((d/\eps)log(d/\eps))query nonadaptive tester and a O(d/\eps)query adaptive tester and show that both testers are optimal for a fixed distance parameter \eps. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the dimension both for the adaptive and the nonadaptive case. Moreover, no lower bounds for testing unateness were known. We also generalize our results to obtain optimal unateness testers for functions f:[n]^d > R. Our results establish that adaptivity helps with testing unateness of realvalued functions on domains of the form {0,1}^d and, more generally, [n]^d. This stands in contrast to the situation for monotonicity testing where there is no adaptivity gap for functions f:[n]^d > R. Clique counts reveal important properties about the structure of massive graphs, especially social networks. The simple setting of just 3cliques (triangles) has received much attention from the research community. For larger cliques (even, say $6$cliques) the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible. We present a new randomized algorithm that provably approximates the number of $k$cliques, for any constant $k$. The key insight is the use of (strengthenings of) the classic Tur\'an's theorem: this claims that if the edge density of a graph is sufficiently high, the $k$clique density must be nontrivial. We define a combinatorial structure called a \emph{Tur\'an shadow}, the construction of which leads to fast algorithms for clique counting. We design a practical heuristic, called TURANSHADOW, based on this theoretical algorithm, and test it on a large class of test graphs. In all cases, TURANSHADOW has less than 2\% error, in a fraction of the time used by welltuned exact algorithms. We do detailed comparisons with a range of other sampling algorithms, and find that TURANSHADOW is generally much faster and more accurate. For example, TURANSHADOW estimates all cliques numbers up to size $10$ in social network with over a hundred million edges. This is done in less than three hours on a single commodity machine. Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex or 5vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5vertex patterns. We perform extensive empirical experiments on a variety of realworld graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$vertex patterns. This algorithm is typically ten times faster than the state of the art $4$vertex counters. Finding similar user pairs is a fundamental task in social networks, with applications in link prediction, tie strength detection, and a number of ranking and personalization tasks. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, and cosine similarity among these vectors quantifies user similarity. The predominant task for user similarity applications is to discover all user pairs with ``high'' similarity, which is computationally challenging on networks with billions of edges. To the best of our knowledge, there is no practical solution for computing all user pairs with high similarity on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm  WHIMP  that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the ``wedgesampling" approach of CohenLewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges. The \emph{Ulam distance} between two permutations of length $n$ is the minimum number of insertions and deletions needed to transform one sequence into the other. Equivalently, the Ulam distance $d$ is $n$ minus the length of the longest common subsequence (LCS) between the permutations. Our main result is an algorithm, that for any fixed $\varepsilon>0$, provides a $(1+\varepsilon)$multiplicative approximation for $d$ in $\tilde{O}_{\varepsilon}(n/d + \sqrt{n})$ time, which has been shown to be optimal up to polylogarithmic factors. This is the first sublinear time algorithm (provided that $d = (\log n)^{\omega(1)}$) that obtains arbitrarily good multiplicative approximations to the Ulam distance. The previous best bound is an $O(1)$approximation (with a large constant) by Andoni and Nguyen (2010) with the same running time bound (ignoring polylogarithmic factors). The improvement in the approximation factor from $O(1)$ to $(1+\varepsilon)$ allows for significantly more powerful sublinear algorithms. For example, for any fixed $\delta > 0$, we can get additive $\delta n$ approximations for the LCS between permutations in $\tilde{O}_\delta(\sqrt{n})$ time. Previous sublinear algorithms require $\delta$ to be at least $11/C$, where $C$ is the approximation factor, which is close to $1$ when $C$ is large. Our algorithm is obtained by abstracting the basic algorithmic framework of Andoni and Nguyen, and combining it with the sublinear approximations for the longest increasing subsequence by Saks and Seshadhri (2010). The computation and study of triangles in graphs is a standard tool in the analysis of realworld networks. Yet most of this work focuses on undirected graphs. Realworld networks are often directed and have a significant fraction of reciprocal edges. While there is much focus on directed triadic patterns in the social sciences community, most data mining and graph analysis studies ignore direction. But how do we make sense of this complex directed structure? We propose a collection of directed closure values that are analogues of the classic transitivity measure (the fraction of wedges that participate in triangles). We perform an extensive set of triadic measurements on a variety of massive realworld networks. Our study of these values reveal a wealth of information of the nature of direction. For instance, we immediately see the importance of reciprocal edges in forming triangles and can measure the power of transitivity. Surprisingly, the chance that a wedge is closed depends heavily on its directed structure. We also observe striking similarities between the triadic closure patterns of different web and social networks. Together with these observations, we also present the first sampling based algorithm for fast estimation of directed triangles. Previous estimation methods were targeted towards undirected triangles and could not be extended to directed graphs. Our method, based on wedge sampling, gives orders of magnitude speedup over state of the art enumeration. Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for postanalysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly. We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noisetolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantilesampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows. Consider a scalar field $f:\mathbb{M} \mapsto \mathbb{R}$, where $\mathbb{M}$ is a triangulated simplicial mesh in $\mathbb{R}^d$. A level set, or contour, at value $v$ is a connected component of $f^{1}(v)$. As $v$ is changed, these contours change topology, merge into each other, or split. \emph{Contour trees} are concise representations of $f$ that track this contour behavior. The vertices of these trees are the critical points of $f$, where the gradient is zero. The edges represent changes in the topology of contours. It is a fundamental data structure in data analysis and visualization, and there is significant previous work (both theoretical and practical) on algorithms for constructing contour trees. Suppose $\mathbb{M}$ has $n$ vertices, $N$ facets, and $t$ critical points. A classic result of Carr, Snoeyink, and Axen (2000) gives an algorithm that takes $O(n\log n + N\alpha(N))$ time (where $\alpha(\cdot)$ is the inverse Ackermann function). A further improvement to $O(t\log t + N)$ time was given by Chiang et al. All these algorithms involve a global sort of the critical points, a significant computational bottleneck. Unfortunately, lower bounds of $\Omega(t\log t)$ also exist. We present the first algorithm that can avoid the global sort and has a refined time complexity that depends on the contour tree structure. Intuitively, if the tree is short and fat, we get significant improvements in running time. For a partition of the contour tree into a set of descending paths, $P$, our algorithm runs in $O(\sum_{p\in P} p\logp + t\alpha(t)+N)$ time. This is at most $O(t\log D + N)$, where $D$ is the diameter of the contour tree. Moreover, it is $O(t\alpha(t)+N)$ for balanced trees, a significant improvement over the previous complexity. We also prove lower bounds showing that the $\sum_{p\in P} p\logp$ complexity is inherent to computing contour trees. We present a characterization of shortterm stability of Kauffman's NK (random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether shortterm chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness. We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a sublineartime algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertexpair queries and neighbor queries. We show that for any given approximation parameter 0\textless epsilon\textless1, the algorithm provides an estimate hat\{t\} such that with high constant probability, (1epsilon) t\textless hat\{t\}\textless(1+epsilon)t, where t is the number of triangles in the graph G. The expected query complexity of the algorithm is O(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) poly(log n, 1/epsilon), where n is the number of vertices in the graph and m is the number of edges, and the expected running time is (n/t\textasciicircum \{1/3\} + m\textasciicircum\{3/2\}/t) poly(log n, 1/epsilon). We also prove that Omega(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in n (and the dependence on 1/epsilon). Given two sets of vectors, $A = \set{\vec{a_1}, \dots, \vec{a_m}}$ and $B=\set{\vec{b_1},\dots,\vec{b_n}}$, our problem is to find the top$t$ dot products, i.e., the largest $\vec{a_i}\cdot\vec{b_j}$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a samplingbased approach that avoids direct computation of all $mn$ dot products. We select diamonds (i.e., fourcycles) from the weighted tripartite representation of $A$ and $B$. The probability of selecting a diamond corresponding to pair $(i,j)$ is proportional to $(\vec{a_i}\cdot\vec{b_j})^2$, amplifying the focus on the largestmagnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the stateoftheart hashing methods. The degree distribution is one of the most fundamental graph properties of interest for realworld graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scalefree degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail. Estimating the number of triangles in a graph given as a stream of edges is a fundamental problem in data mining. The goal is to design a single pass spaceefficient streaming algorithm for estimating triangle counts. While there are numerous algorithms for this problem, they all (implicitly or explicitly) assume that the stream does not contain duplicate edges. However, data sets obtained from realworld graph streams are rife with duplicate edges. The work around is typically an extra unaccounted pass (storing all the edges!) just to "clean up" the data. Can we estimate triangle counts accurately in a single pass even when the stream contains repeated edges? In this work, we give the first algorithm for estimating the triangle count of a multigraph stream of edges. @techreport{JhSePi13, author = {Madhav Jha and C. Seshadhri and Ali Pinar}, title = {When a Graph is not so Simple: Counting Triangles in Multigraph Streams}, institution = {Arxiv}, year = {2013}, number = {1310.7665}, } Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4vertex pattern subgraphs. Our algorithm is based on a novel technique of 3path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges. Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, kdensest subgraph) are NPhard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of kcores and ktrusses. We give provable efficient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other stateoftheart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour. The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion is the Hamming distance over the {\em uniform} distribution on the domain. This restriction to uniformity is rather limiting, and it is important to investigate distances induced by more general distributions. In this paper, we give simple and optimal testers for {\em bounded derivative properties} over {\em arbitrary product distributions}. Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity. Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing. We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of expected depths of optimal BSTs for each marginal. Furthermore, we show this sumofdepths is also a lower bound. A technical contribution of our work is an {\em optimal dimension reduction theorem} for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity, and our theorem is an exponential improvement to the previous best known result. @techreport{ChDiJh+14, author = {Deeparnab Chakrabarty and Kashyap Dixit and Madhav Jha and C. Seshadhri}, title = {Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties}, institution = {ECCC}, year = {2014}, number = {1404.0718}, } We propose a new algorithm, FASTPPR, for estimating personalized PageRank: given input nodes $s$, $t$ in a directed graph and threshold $\delta$, it computes the Personalized PageRank $\PR_s(t)$ from $s$ to $t$, guaranteeing that the relative error is small as long $\PR_s(t) > \delta$. Existing algorithms for this problem have a runningtime of $\Omega(1/\delta)$. We design a new algorithm, FASTPPR, that has a provable average runningtime guarantee of $\tilde{O}(\sqrt{d/\delta})$ (where $d$ is the average indegree of the graph). This is a significant improvement, since $\delta$ is often $O(1/n)$ (where $n$ is the number of nodes) for applications. We also complement the algorithm with an $\Omega(1/\sqrt{\delta})$ lower bound for SignificantPageRank, showing that the dependence on $\delta$ cannot be improved. We perform a detailed empirical study on numerous massive graphs showing that FASTPPR dramatically outperforms existing algorithms. For example, with target nodes sampled according to popularity, on the 2010 Twitter graph with 1.5 billion edges, FASTPPR has a 20 factor speedup over the state of the art. Furthermore, an enhanced version of FASTPPR has a 160 factor speedup on the Twitter graph, and is at least 20 times faster on all our candidate graphs. @techreport{LoBaGo+14, author = {Peter Lofgren and Siddhartha Banerjee and Ashish Goel and C. Seshadhri}, title = {FASTPPR: Scaling Personalized PageRank Estimation for Large Graphs}, institution = {Arxiv}, year = {2014}, number = {1404.3181}, } Network data is ubiquitous and growing, yet we lack realistic generative models that can be calibrated to match realworld data. The recently proposed Block TwoLevel ErdosRenyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop implementation for a modeling a realworld web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications. @techreport{KoPiPl+13, author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri}, title = {A Scalable Generative Graph Model with Community Structure}, institution = {Arxiv}, year = {2013}, number = {1302.6636}, }
Graphs and networks are used to model interactions in a variety of contexts.
There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are trianglebased and give a measure of the connectedness of mutual friends.
This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for largescale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publiclyavailable networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec.
total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and nonsocial networks. To the best of our knowledge, these are the largest trianglebased graph computations published to date.
@techreport{KoPiPl+13, author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri and Christine Task}, title = {Counting Triangles in Massive Graphs with MapReduce}, institution = {Arxiv}, year = {2013}, number = {1301.5887}, } Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in realworld graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly superlinear) worstcase algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length $2$, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length $2$ where the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavytailed degree distributions, a defining feature of realworld graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the $\ell_{4/3}$norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows nontrivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected lineartime performance for degree sequences following a power law with exponent $\alpha \geq 7/3$, and nontrivial speedup whenever $\alpha \in (2,3)$. @INPROCEEDINGS{BeFoNo+14, author = {J. Berry and L. Fostvedt and D. Nordman and C. Phillips and C. Seshadhri and A. Wilson}, title = {Why do Simple Algorithms for Triangle Enumeration work in the Real World?}, booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)}, year = {2014} } High triangle density  the graph property stating that most twohop paths belong to a triangle  is a common signature of social networks. This paper studies triangledense graphs from a structural perspective. We prove constructively that most of the content of a triangledense graph is contained in a disjoint union of radius 2 dense subgraphs. This result quantifies the extent to which triangledense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in stable kmedian instances. @INPROCEEDINGS{GuRoSe14, author = {R. Gupta and T. Roughgarden and C. Seshadhri}, title = {Decompositions of TriangleDense Graphs}, booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)}, year = {2014} } First impressions from initial renderings of data are crucial for directing further exploration and analysis. In most visualization systems, default colormaps are generated by simply linearly interpolating color in some space based on a value's placement between the minimum and maximum taken on by the dataset. We design a simple samplingbased method for generating colormaps that highlights important features. We use random sampling to determine the distribution of values observed in the data. The sample size required is independent of the dataset size and only depends on certain accuracy parameters. This leads to a computationally cheap and robust algorithm for colormap generation. Our approach (1) uses perceptual color distance to produce palettes from color curves, (2) allows the user to either emphasize or deemphasize prominent values in the data, (3) uses quantiles to map distinct colors to values based on their frequency in the dataset, and (4) supports the highlighting of either inter or intramode variations in the data. @INPROCEEDINGS{ThBe+13, author = {D. Thompson and J. Bennett and C. Seshadhri and A. Pinar}, title = {A ProvablyRobust Sampling Method for Generating Colormaps of Large Data}, booktitle = {{Proceedings of the IEEE Symposium on LargeScale Data and Visualization (LDAV)}}, year = {2013} } For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinatewise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$far from monotone if at least an $\eps$fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$far. We prove that any (adaptive, twosided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $\Omega(\eps^{1}d\log n  \eps^{1}\log \eps^{1})$ queries. Recent upper bounds show the existence of $O(\eps^{1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a nonadaptive bound of $\Omega(d \log n)$. @INPROCEEDINGS{ChSe132, author = {D. Chakrabarty and C. Seshadhri}, title = {An optimal lower bound for monotonicity testing over hypergrids}, booktitle = {{Proceedings of the International Workshop on Randomization and Computation (RANDOM)}}, year = {2013}, owner = {scomand}, timestamp = {2013.03.14} } Modern massive graphs such as the web and communication networks are truly \emph{dynamic graphs}. The graph is not a static object, but is most accurately characterized as a \emph{stream of edges}. How does one store a small sample of the past and still answer meaningful questions about the entire graph? This is the aim of a sublinear space streaming algorithm. We study the problem of counting the number of triangles (or estimating the clustering coefficient) in a massive streamed graph. We provide a $O(\sqrt{n})$space algorithm that provably approximates the clustering coefficient with only a single pass through a graph of $n$ vertices. Our procedure is based on the classic probabilistic result, \emph{the birthday paradox}. While streaming algorithms for triangle counting have been designed before, none of these provide comparable theoretical (or empirical) results. Using some probabilistic heuristics, we design a practical algorithm based on these ideas. Our algorithm can compute accurate estimates for the clustering coefficient and number of triangles for a graph using only a single pass over the edges. The storage of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 50,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains \emph{a realtime estimate} of the clustering coefficient/number of triangles of a graph, by storing a miniscule fraction of edges. @INPROCEEDINGS{JhSePi13, author = {Jha, Madhav and Seshadhri, C. and Pinar, Ali}, title = {A space efficient streaming algorithm for triangle counting using the birthday paradox}, booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining}, year = {2013}, series = {KDD '13}, pages = {589597} } Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or ChungLu model can generate an undirected graph with any desired degree distribution. This serves as an excellent null model to compare algorithms or perform experimental studies that distinguish real networks from this baseline. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the realworld are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (oneway edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the ChungLu model that matches the in, out, and reciprocaldegree distributions of the real graphs. Our algorithm is scalable and requires O(m) random numbers to generate a graph with m edges. We perform a series of experiments on real datasets and compare with existing graph models. @inproceedings{DuKoPiSe12, author = {Nurcan Durak and Tamara G. Kolda and Ali Pinar and C. Seshadhri}, title = {A scalable directed graph model with reciprocal edges}, booktitle = {IEEE Workshop on Network Science}, year = {2013} } Given oracle access to a Boolean function $f:\{0,1\}^n \mapsto \{0,1\}$, we design a randomized tester that takes as input a parameter $\eps>0$, and outputs {\sf Yes} if the function is monotone, and outputs {\sf No} with probability $>2/3$, if the function is $\eps$far from monotone. That is, $f$ needs to be modified at $\eps$fraction of the points to make it monotone. Our nonadaptive, onesided tester makes $\widetilde{O}(n^{5/6}\eps^{5/3})$ queries to the oracle. @inproceedings{CS12, author = {D. Chakrabarty and C. Seshadhri}, title = {An o(n) monotonicity tester for Boolean functions over the hypercube}, booktitle = {ACM Symposium on Theory of Computing (STOC)}, year = {2013} } The problem of monotonicity testing over the hypergrid and its special case, the hypercube, is a classic, wellstudied, yet unsolved question in property testing. We are given query access to $f:[k]^n \mapsto \R$ (for some ordered range $\R$). The hypergrid/cube has a natural partial order given by coordinatewise ordering, denoted by $\prec$. A function is \emph{monotone} if for all pairs $x \prec y$, $f(x) \leq f(y)$. The distance to monotonicity, $\eps_f$, is the minimum fraction of values of $f$ that need to be changed to make $f$ monotone. For $k=2$ (the boolean hypercube), the usual tester is the \emph{edge tester}, which checks monotonicity on adjacent pairs of domain points. It is known that the edge tester using $O(\eps^{1}n\log\R)$ samples can distinguish a monotone function from one where $\eps_f > \eps$. On the other hand, the best lower bound for monotonicity testing over the hypercube is $\min(\R^2,n)$. This leaves a quadratic gap in our knowledge, since $\R$ can be $2^n$. We resolve this long standing open problem and prove that $O(n/\eps)$ samples suffice for the edge tester. For hypergrids, known testers require $O(\eps^{1}n\log k\log \R)$ samples, while the best known (nonadaptive) lower bound is $\Omega(\eps^{1} n\log k)$. We give a (nonadaptive) monotonicity tester for hypergrids running in $O(\eps^{1} n\log k)$ time. Our techniques lead to optimal property testers (with the same running time) for the natural \emph{Lipschitz property} on hypercubes and hypergrids. (A $c$Lipschitz function is one where $f(x)  f(y) \leq c\xy\_1$.) In fact, we give a general unified proof for $O(\eps^{1}n\log k)$query testers for a class of ``boundedderivative" properties, a class containing both monotonicity and Lipschitz. @inproceedings{CS12, author = {D. Chakrabarty and C. Seshadhri}, title = {Optimal Bounds for Monotonicity and Lipschitz Testing over Hypercubes and Hypergrids}, booktitle = {ACM Symposium on Theory of Computing (STOC)}, year = {2013} } there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We discuss the method of wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical timeapproximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the stateoftheart, while providing nearly the accuracy of full enumeration. Our results will enable more widescale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several realworld examples. @inproceedings{SePiKo13, author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda}, title = {Triadic Measures on Graphs: The Power of Wedge Sampling}, booktitle = {SIAM Conference on Data Mining (SDM)}, year = {2013} } Approximating the length of the longest increasing sequence (LIS) of a data stream is a wellstudied problem. There are many algorithms that estimate the size of the complement of the LIS, referred to as the \emph{distance to monotonicity}, both in the streaming and property testing setting. Let $n$ denote the size of an input array. Our aim is to develop a onepass streaming algorithm that accurately approximates the distance to monotonicity, and only uses polylogarithmic storage. For any $\delta > 0$, our algorithm provides a $(1+\delta)$multiplicative approximation for the distance, and uses only $O((\log^2 n)/\delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2factor. Our algorithm is simple and natural, being just 3 lines of pseudocode. It is essentially a polylogarithmic space implementation of a classic dynamic program that computes the LIS. Our technique is more general and is applicable to other problems that are exactly solvable by dynamic programs. We are able to get a streaming algorithm for the longest common subsequence problem (in the asymmetric setting introduced in \cite{AKO10}) whose space is small on instances where no symbol appears very many times. Consider two strings (of length $n$) $x$ and $y$. The string $y$ is known to us, and we only have streaming access to $x$. The size of the complement of the LCS is the edit distance between $x$ and $y$ with only insertions and deletions. If no symbol occurs more than $k$ times in $y$, we get a $O(k(\log^2 n)/\delta)$space streaming algorithm that provides a $(1+\delta)$multiplicative approximation for the LCS complement. In general, we also provide a deterministic 1pass streaming algorithm that outputs a $(1+\delta)$multiplicative approximation for the LCS complement and uses $O(\sqrt{(n\log n)/\delta})$ space. @inproceedings{SaSe13, author = {M. Saks and C. Seshadhri}, title = {Space Efficient Streaming Algorithms for the Distance to Monotonicity and Asymmetric Edit Distance}, booktitle = {Symposium on Discrete Algorithms (SODA)}, year = {2013} } We present sublineartime (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and treeminors in boundeddegree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$minorfree (resp., free from having the corresponding treeminor). In particular, if the graph is far (i.e., $\Omega(1)$far) {from} being cyclefree, i.e. if one has to delete a constant fraction of edges to make it cyclefree, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors. The foregoing results are the outcome of our study of the complexity of {\em onesided error} property testing algorithms in the boundeddegree graphs model. For example, we show that cyclefreeness of $N$vertex graphs can be tested with onesided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $\Omega(\sqrt{N})$ query lower bound, and contrasts with the fact that any minorfree property admits a {\em twosided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least~$k$. On the other hand, for any fixed tree $T$, we show that $T$minorfreeness has a onesided error tester of query complexity that only depends on the proximity parameter $\e$. Our algorithm for finding cycles in boundeddegree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding treeminors in $o(\sqrt{N})$ complexity. @journal{CGR+10, author = {Artur Czumaj and Oded Goldreich and Dana Ron and C. Seshadhri and Asaf Shapira and Christian Sohler}, title = {Finding Cycles and Trees in Sublinear Time}, journal = {Random Structures and Algorithms}, year = {2012}, notes = {DOI: 10.1002/rsa.20462} } Triangles are an important building block and distinguishing feature of realworld networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of \emph{degreelabeled triangles}  specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of realworld graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1\% of degrees participate in the \emph{vast majority} of triangles in heterogeneous graphs. We investigate whether current graph models reproduce the types of triangles that are observed in real data and observe that most models fail to accurately capture these salient features. @inproceedings{DPKS12, author = {N. Durak and A. Pinar and T. G. Kolda and C. Seshadhri}, title = {Degree Relations of Triangles in Realworld Networks and Graph Models}, booktitle = {ACM International Conference on Information and Knowledge Management}, year = {2012} } The communities of a social network are sets of vertices with more connections inside the set than outside. We theoretically demonstrate that two commonly observed properties of social networks, heavytailed degree distributions and large clustering coefficients, imply the existence of vertex neighborhoods (also known as egonets) that are themselves good communities. We evaluate these neighborhood communities on a range of graphs. What we find is that the neighborhood communities can exhibit conductance scores that are as good as the Fiedler cut. Also, the conductance of neighborhood communities shows similar behavior as the network community profile computed with a personalized PageRank community detection method. Neighborhood communities give us a simple and powerful heuristic for speeding up local partitioning methods. Since finding good seeds for the PageRank clustering method is difficult, most approaches involve an expensive sweep over a great many starting vertices. We show how to use neighborhood communities to quickly generate a small set of seeds. @inproceedings{GS12, author = {David F. Gleich and C. Seshadhri}, title = {Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods }, booktitle = {SIGKDD Conference on Knowledge Discovery and Data Mining}, year = {2012} } Computing the coordinatewise maxima of a planar point set is a classic and wellstudied problem in computational geometry. We give an algorithm for this problem in the \emph{selfimproving setting}. We have $n$ (unknown) independent distributions $\cD_1, \cD_2, \ldots, \cD_n$ of planar points. An input pointset $(p_1, p_2, \ldots, p_n)$ is generated by taking an independent sample $p_i$ from each $\cD_i$, so the input distribution $\cD$ is the product $\prod_i \cD_i$. A selfimproving algorithm repeatedly gets input sets from the distribution $\cD$ (which is \emph{a priori} unknown) and tries to optimize its running time for $\cD$. Our algorithm uses the first few inputs to learn salient features of the distribution, and then becomes an optimal algorithm for distribution $\cD$. Let $\OPT_\cD$ denote the expected depth of an \emph{optimal} linear comparison tree computing the maxima for distribution $\cD$. Our algorithm eventually has an expected running time of $O(\text{OPT}_\cD + n)$, even though it did not know $\cD$ to begin with. Our result requires new tools to understand linear comparison trees for computing maxima. We show how to convert general linear comparison trees to very restricted versions, which can then be related to the running time of our algorithm. An interesting feature of our algorithm is an interleaved search, where the algorithm tries to determine the likeliest point to be maximal with minimal computation. This allows the running time to be truly optimal for the distribution $\cD$. @inproceedings{CMS12, author = {Ken Clarkson and Wolfgang Mulzer and C. Seshadhri}, title = {Selfimproving Algorithms for Coordinatewise Maxima}, booktitle = {28th Symposium on Computational Geometry}, year = {2012} } Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erd\H{o}sR\'enyi~(ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavytailed degree distribution and community structure must contain a scale free collection of dense ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, we propose the Block TwoLevel Erd\H{o}sR\'enyi (BTER) model, and demonstrate that it accurately captures the observable properties of many realworld social networks. @article{SKP11, author = {C. Seshadhri and Tamara G. Kolda and Ali Pinar}, title = {Community structure and scalefree collections of Erd\H{o}sR\'enyi graphs}, journal = {Physical Review E}, year = {2012}, volume = {85}, issue = {5}, number = {056109}, url = {http://pre.aps.org/abstract/PRE/v85/i5/e056109} } Markov chains are a convenient means of generating realizations of networks, since they require little more than a procedure for rewiring edges. If a rewiring procedure exists for generating new graphs with specified statistical properties, then a Markov chain sampler can generate an ensemble of graphs with prescribed characteristics. However, successive graphs in a Markov chain cannot be used when one desires independent draws from the distribution of graphs; the realizations are correlated. Consequently, one runs a Markov chain for N iterations before accepting the realization as an independent sample. In this work, we devise two methods for calculating N. They are both based on the binary "timeseries" denoting the occurrence/nonoccurrence of edge (u, v) between vertices u and v in the Markov chain of graphs generated by the sampler. They differ in their underlying assumptions. We test them on the generation of graphs with a prescribed joint degree distribution. We find the N proportional E, where E is the number of edges in the graph. The two methods are compared by sampling on real, sparse graphs with 10^3  10^4 vertices. B @inproceedings{RPS12, author = {Jaideep Ray and Ali Pinar and C. Seshadhri}, title = {Are we there yet? When to stop a Markov chain while generating random graphs}, booktitle = {9th Workshop on Algorithms and Models for the Web Graph}, pages = {153164}, year = {2012} } The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model. We show that the parallel variant of the edgeconfiguration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against. @inproceedings{PiSeKo11, author = {Ali Pinar and C. Seshadhri and Tamara G. Kolda}, title = {The Similarity between Stochastic Kronecker and ChungLu Graph Models}, booktitle = {Proceedings of 12th SIAM International Conference on Data Mining}, pages = {10711082}, year = {2012} } Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing realworld graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a powerlaw distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices. These results have been communicated to the Graph500 steering committee, and they are currently using our theorems to set the benchmark parameters. Our noisy version of SKG is likely to become the Graph500 standard for next year's benchmark. @techreport{SPK11, author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda}, title = {A Hitchhiker's Guide to Choosing Parameters of Stochastic Kronecker Graphs}, institution = {Arxiv}, year = {2011}, number = {1102.5046}, } We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean networks. We show that some of the assumptions traditionally made in the more common meanfield analysis of Boolean networks do not hold in general. For example, we offer evidence that imbalance (internal inhomogeneity) of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than previously observed. @article{SVM+11, author = {C. Seshadhri and Y. Vorobeychik and J. Mayo and R. Armstrong and J. Ruthruff}, title = {Influence and Dynamic Behavior in Random Boolean Networks}, journal = {Physical Review Letters}, volume = {107}, number = {10}, year = {2011}, pages = {108701} } Let C be a depth3 circuit with n variables, degree d and top fanin k (called sps(k,d,n) circuits) over base field F. It is a major open problem to design a deterministic polynomial time blackbox algorithm that tests if C is identically zero. Klivans & Spielman (STOC 2001) observed that the problem is open even when k is a constant. This case has been subjected to a serious study over the past few years, starting from the work of Dvir & Shpilka (STOC 2005). We give the first polynomial time blackbox algorithm for this problem. Our algorithm runs in time poly(nd^k), regardless of the base field. The *only* field for which polynomial time algorithms were previously known is F=Q (Kayal & Saraf, FOCS 2009, and Saxena & Seshadhri, FOCS 2010). This is the first blackbox algorithm for depth3 circuits that does not use the rank based approaches of Karnin & Shpilka (CCC 2009). We prove an important tool for the study of depth3 identities. We design a blackbox polynomial time transformation that reduces the number of variables in a sps(k,d,n) circuit to k variables, but preserves the identity structure. @inproceedings{SaxS11, author = {N. Saxena and C. Seshadhri}, title = {Blackbox identity testing for bounded top fanin depth3 circuits: the field doesn't matter}, booktitle = {Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC)}, year = {2011} } We give the first combinatorial approximation algorithm for Maxcut that beats the trivial 0.5 factor by a constant. The main partitioning procedure is very intuitive, natural, and easily described. It essentially performs a number of random walks and aggregates the information to provide the partition. We can control the running time to get an approximation factorrunning time tradeoff. We show that for any constant b > 1.5, there is an O(n^{b}) algorithm that outputs a (0.5+delta)approximation for Maxcut, where delta = delta(b) is some positive constant. One of the components of our algorithm is a weak local graph partitioning procedure that may be of independent interest. Given a starting vertex $i$ and a conductance parameter phi, unless a random walk of length ell = O(log n) starting from i mixes rapidly (in terms of phi and ell), we can find a cut of conductance at most phi close to the vertex. The work done per vertex found in the cut is sublinear in n. @inproceedings{KalS10, author = {Satyen Kale and C. Seshadhri}, title = {Combinatorial Approximation Algorithms for MaxCut using Random Walks}, booktitle = {Proceedings of the Innovations in Computer Science (ICS)}, year = {2011}, pages = {367388} } We initiate the study of property testing of submodularity on the boolean hypercube. Submodular functions come up in a variety of applications in combinatorial optimization. For a vast range of algorithms, the existence of an oracle to a submodular function is assumed. But how does one check if this oracle indeed represents a submodular function? Consider a function f:{0,1}^n \rightarrow R. The distance to submodularity is the minimum fraction of values of $f$ that need to be modified to make f submodular. If this distance is more than epsilon > 0, then we say that f is epsilonfar from being submodular. The aim is to have an efficient procedure that, given input f that is epsilonfar from being submodular, certifies that f is not submodular. We analyze a very natural tester for this problem, and prove that it runs in subexponential time. This gives the first nontrivial tester for submodularity. On the other hand, we prove an interesting lower bound (that is, unfortunately, quite far from the upper bound) suggesting that this tester cannot be very efficient in terms of epsilon. This involves nontrivial examples of functions which are far from submodular and yet do not exhibit too many local violations. We also provide some constructions indicating the difficulty in designing a tester for submodularity. We construct a partial function defined on exponentially many points that cannot be extended to a submodular function, but any strict subset of these values can be extended to a submodular function. @inproceedings{SesV10, author = {J. Vondrak and C. Seshadhri}, title = {Is submodularity testable?}, booktitle = {Proceedings of the Innovations in Computer Science (ICS)}, year = {2011}, pages = {195210} } Finding the length of the longest increasing subsequence (LIS) is a classic algorithmic problem. Let $n$ denote the size of the array. Simple $O(n\log n)$ algorithms are known for this problem. What can a sublinear time algorithm achieve? We develop a polylogarithmic time randomized algorithm that for any constant $\delta > 0$, estimates the length of the LIS of an array upto an additive error of $\delta n$. More precisely, the running time of the algorithm is $(\log n)^c (1/\delta)^{O(1/\delta)}$ where the exponent $c$ is independent of $\delta$. Previously, the best known polylogarithmic time algorithms could only achieve an additive $n/2$ approximation. The \emph{distance to monotonicity}, which is the fractional size of the complement of the LIS, has been studied in depth by the streaming and property testing communities. Our polylogarithmic algorithm actually has a stronger guarantee, and gives (for any constant $\delta > 0$) a multiplicative $(1+\delta)$approximation to the distance to monotonicity. This is a significant improvement over the previously known $2$factor approximations. @inproceedings{SakS10, author = {M. E. Saks and C. Seshadhri}, title = {Estimating the longest increasing sequence in polylogarithmic time}, booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)}, year = {2010}, pages = {458467}, } We study the problem of identity testing for depth3 circuits of top fanin k and degree d. We give a new structure theorem for such identities. A direct application of our theorem improves the known deterministic d^{k^k}time blackbox identity test over rationals (KayalSaraf, FOCS 2009) to one that takes d^{k^2}time. Our structure theorem essentially says that the number of independent variables in a real depth3 identity is very small. This theorem settles affirmatively the stronger rank conjectures posed by DvirShpilka (STOC 2005) and KayalSaraf (FOCS 2009). Our techniques provide a unified framework that actually beats all known rank bounds and hence gives the best running time (for *every* field) for blackbox identity tests. Our main theorem (almost optimally) pins down the relation between higher dimensional SylvesterGallai theorems and the rank of depth3 identities in a very transparent manner. The existence of this was hinted at by DvirShpilka (STOC 2005), but first proven, for reals, by KayalSaraf (FOCS 2009). We introduce the concept of SylvesterGallai rank bounds for any field, and show the intimate connection between this and depth3 identity rank bounds. We also prove the first ever theorem about high dimensional SylvesterGallai configurations over *any* field. Our proofs and techniques are very different from previous results and devise a very interesting ensemble of combinatorics and algebra. The latter concepts are ideal theoretic and involve a new Chinese remainder theorem. Our proof methods explain the structure of *any* depth3 identity C: there is a nucleus of C that forms a low rank identity, while the remainder is a high dimensional SylvesterGallai configuration. @inproceedings{SaxS10, author = {N. Saxena and C. Seshadhri}, title = {From SylvesterGallai Configurations to Rank Bounds: Improved Blackbox Identity Test for Depth3 Circuits}, booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)}, year = {2010}, pages = {2129}, } We describe an algorithm for computing planar convex hulls in the selfimproving model: given a sequence $I_1, I_2, \ldots$ of planar $n$point sets, the upper convex hull $conv(I)$ of each set $I$ is desired. We assume that there exists a probability distribution $D$ on $n$point sets, such that the inputs $I_j$ are drawn independently according to $D$. Furthermore, $D$ is such that the individual points are distributed independently of each other. In other words, the $i$'th point is distributed according to $D_i$. The $D_i$'s can be arbitrary but are independent of each other. The distribution $D$ is not known to the algorithm in advance. After a learning phase of $n^\eps$ rounds, the expected time to compute $conv(I)$ is $O(n + H(conv(I)))$. Here, $H(conv(I))$ is the entropy of the output, which is a lower bound for the expected running time of \emph{any} algebraic computation tree that computes the convex hull. (More precisely, $H(conv(I))$ is the minimum entropy of any random variable that maps $I$ to a description of $conv(I)$ and to a labeling scheme that proves nonextremality for every point in $I$ not on the hull.) Our algorithm is thus asymptotically optimal for $D$. @inproceedings{ClaMS10, author = {K. Clarkson and W. Mulzer and C. Seshadhri}, title = {Selfimproving Algorithms for Convex Hulls}, booktitle = {Proceedings of the 21st ACMSIAM Symposium on Discrete Algorithms (SODA)}, year = {2010}, pages = {} } We study online learning in an oblivious changing environment. The standard measure of regret bounds the difference between the cost of the online learner and the best decision in hindsight. Hence, regret minimizing algorithms tend to converge to the static best optimum, clearly a suboptimal behavior in changing environments. On the other hand, various metrics proposed to strengthen regret and allow for more dynamic algorithms produce inefficient algorithms. We propose a different performance metric which strengthens the standard metric of regret and measures performance with respect to a changing comparator. We then describe a series of datastreamingbased reductions which transform algorithms for minimizing (standard) regret into adaptive algorithms albeit incurring only polylogarithmic computational overhead. Using this reduction, we obtain {\it efficient} low adaptiveregret algorithms for the problem of online convex optimization. This can be applied to various learning scenarios, i.e. online portfolio selection, for which we describe experimental results showing the advantage of adaptivity. @inproceedings{HazS09, author = {E. Hazan and C. Seshadhri}, title = {Efficient learning algorithms for changing environments}, booktitle = {Proceedings of the 26th Annual International Conference on Machine Learning (ICML)}, year = {2009}, pages = {50}, } We show that the rank of a depth$3$ circuit (over any field) that is simple, minimal and zero is at most $O(k^3\log d)$. The previous best rank bound known was $2^{O(k^2)}(\log d)^{k2}$ by Dvir and Shpilka (STOC 2005). This almost resolves the rank question first posed by Dvir and Shpilka (as we also provide a simple and minimal identity of rank $\Omega(k\log d)$). Our rank bound significantly improves (dependence on $k$ exponentially reduced) the best known deterministic blackbox identity tests for depth$3$ circuits by Karnin and Shpilka (CCC 2008). Our techniques also shed light on the factorization pattern of nonzero depth$3$ circuits, most strikingly: the rank of linear factors of a simple, minimal and nonzero depth$3$ circuit (over any field) is at most $O(k^3\log d)$. The novel feature of this work is a new notion of maps between sets of linear forms, called \emph{ideal matchings}, used to study depth$3$ circuits. We prove interesting structural results about depth$3$ identities using these techniques. We believe that these can lead to the goal of a deterministic polynomial time identity test for these circuits. @inproceedings{SaxS09, author = {N. Saxena and C. Seshadhri}, title = {An Almost Optimal Rank Bound for Depth3 Identities}, booktitle = {Proceedings of the 24th Annual Conference on Computational Complexity (CCC)}, year = {2009}, pages = {137148} } @article{SaxS11, author = {N. Saxena and C. Seshadhri}, title = {An Almost Optimal Rank Bound for Depth3 Identities}, journal = {SIAM Journal on Computing}, volume = {40}, number = {1}, year = {2011}, pages = {200224} } We consider the problem of \emph{online sublinear expander reconstruction} and its relation to random walks in ``noisy" expanders. Given access to an adjacency list representation of a boundeddegree graph $G$, we want to convert this graph into a boundeddegree expander $G'$ changing $G$ as little as possible. The graph $G'$ will be output by a \emph{distributed filter}: this is a sublinear time procedure that given a query vertex, outputs all its neighbors in $G'$, and can do so even in a distributed manner, ensuring consistency in all the answers. One of the main tools in our analysis is a result on the behavior of random walks in graph that are almost expanders: graphs that are formed by arbitrarily connecting a small unknown graph (the noise) to a large expander. We show that a random walk from almost any vertex in the expander part will have fast mixing properties, in the general setting of irreducible finite Markov chains. We also design sublinear time procedures to distinguish vertices of the expander part from those in the noise part, and use this procedure in the reconstruction algorithm. @inproceedings{KalPS08, author = {S. Kale and Y. Peres and C. Seshadhri}, title = {Noise Tolerance of Expanders and Sublinear Expander Reconstruction}, booktitle = {Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS)}, year = {2008}, pages = {719728}, } We consider the problem of testing graph expansion (either vertex or edge) in the bounded degree model of Goldreich and Ron. We give a property tester that given a graph with degree bound $d$, an expansion bound $\alpha$, and a parameter $\epsilon > 0$, accepts the graph with high probability if its expansion is more than $\alpha$, and rejects it with high probability if it is $\epsilon$far from any graph with expansion $\alpha'$ with degree bound $d$, where $\alpha' < \alpha$ is a function of $\alpha$. For edge expansion, we obtain $\alpha' = \Omega(\frac{\alpha^2}{d})$, and for vertex expansion, we obtain $\alpha' = \Omega(\frac{\alpha^2}{d^2})$. In either case, the algorithm runs in time $\tilde{O}(\frac{n^{(1+\mu)/2}d^2}{\epsilon\alpha^2})$ for any given constant $\mu > 0$. @inproceedings{KalS08, author = {S. Kale and C. Seshadhri}, title = {Testing Expansion in Bounded Degree Graphs}, booktitle = {In proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP(I))}, year = {2008} pages = {527538} } @article{KalS11, author = {S. Kale and C. Seshadhri}, title = {Testing Expansion in Bounded Degree Graphs}, journal = {SIAM Journal on Computing}, volume = {40}, number = {3}, year = {2011}, pages = {709720} } We study the problem of twodimensional Delaunay triangulation in the selfimproving algorithms model. We assume that the $n$ points of the input each come from an independent, unknown, and arbitrary distribution. The first phase of our algorithm builds data structures that store relevant information about the input distribution. The second phase uses these data structures to efficiently compute the Delaunay triangulation of the input. The running time of our algorithm matches the informationtheoretic lower bound for the given input distribution, implying that if the input distribution has low entropy, then our algorithm beats the standard $\Omega(n\log n)$ bound for computing Delaunay triangulations. Our algorithm uses a host of techniques: $\epsilon$net construction for disks, optimal expectedcase pointlocation structures, ideas from randomized incremental construction, lineartime Delaunay splitting algorithms, and informationtheoretic arguments. @inproceedings{ClaS08, author = {K. L. Clarkson and C. Seshadhri}, title = {Selfimproving algorithms for delaunay triangulations}, booktitle = {Proceedings of the 24th Annual Symposium on Computational Geometry (SOCG)}, year = {2008}, pages = {148155}, } We investigate the problem of monotonicity reconstruction, as defined by Ailon et al (ISAAC 2004), in a distributed setting. We have oracle access to a nonnegative realvalued function $f$ defined on domain $[n]^d=\{1,\ldots,n\}^d$ ($d$ is considered to be a constant). We would like to closely approximate $f$ by a monotone function $g$. This should be done by a procedure (a {\em filter}) that given as input a point $x \in [n]^d$ outputs the value of $g(x)$, and runs in time that is highly sublinear in $n$. The procedure can (indeed must) be randomized, but we require that all of the randomness be specified in advance by a single short random seed. We construct such an implementation where the time and space per query is $(\log n)^{O(1)}$ and the size of the seed is polynomial in $\log n$ and $d$. Furthermore the distance of the approximating function $g$ from $f$ is at most a constant (depending on $d$) multiple of the minimum distance of any monotone function from $f$. This allows for a distributed implementation: one can initialize many copies of the filter with the same short random seed, and they can autonomously handle queries, while producing outputs that are consistent with the same approximating function $g$. @inproceedings{SakS08, author = {M. E. Saks and C. Seshadhri}, title = {Parallel monotonicity reconstruction}, booktitle = {Proceedings of the 19th Annual Symposium on Discrete Algorithms (SODA)}, year = {2008}, pages = {962971}, } @article{SakS08, author = {M. E. Saks and C. Seshadhri}, title = {Local monotonicity reconstruction}, journal = {SIAM Journal on Computing}, volume = {39}, number = {7} year = {2010}, pages = {28972926} } We investigate a new class of geometric problems based on the idea of online error correction. Suppose one is given access to a large geometric dataset though a query mechanism; for example, the dataset could be a terrain and a query might ask for the coordinates of a particular vertex or for the edges incident to it. Suppose, in addition, that the dataset satisfies some known structural property ${\cal P}$ (eg, monotonicity or convexity) but that, because of errors and noise, the queries occasionally provide answers that violate ${\cal P}$. Can one design a filter that modifies the query's answers so that (i) the output satisfies ${\cal P}$; (ii) the amount of data modification is minimized? We provide upper and lower bounds on the complexity of online reconstruction for convexity in 2D and 3D. @inproceedings{ChaS06, author = {B. Chazelle and C. Seshadhri}, title = {Online geometric reconstruction}, booktitle = {Proceedings of the 22nd ACM Symposium on Computational Geometry (SOCG)}, year = {2006}, pages = {386394}, } @article{ChaS11, author = {B. Chazelle and C. Seshadhri}, title = {Online geometric reconstruction}, journal = {Journal of the ACM}, volume = {39}, number = {7} year = {2010}, pages = {28972926} } We show in this paper that the BGS model of abstract state machines can be simulated by random access machines with at most a polynomial time overhead. This result is already stated in [BGS99] with a very brief proof sketch. The present paper gives a detailed proof of the result. We represent hereditarily finite sets, which are the typical BGS ASM objects, by membership graphs of the transitive closure of the sets. Testing for equality between BGS objects can be done in linear time in our representation. @inproceedings{ComSB05, author = {S. Comandur and A. Seth and S. Biswas}, title = {RAM Simulation of BGS Model of abstract State Machines}, booktitle = {Proceedings of the 12th International Workshop on abstract State Machines (ASM)}, year = {2005}, pages = {377386}, } We initiate a new line of investigation into {\em online propertypreserving data reconstruction.} Consider a dataset which is assumed to satisfy various (known) structural properties; eg, it may consist of sorted numbers, or points on a manifold, or vectors in a polyhedral cone, or codewords from an errorcorrecting code. Because of noise and errors, however, an (unknown) fraction of the data is deemed {\em unsound}, ie, in violation with the expected structural properties. Can one still query into the dataset in an online fashion and be provided data that is {\em always} sound? In other words, can one design a filter which, when given a query to any item $I$ in the dataset, returns a {\em sound} item $J$ that, although not necessarily in the dataset, differs from $I$ as infrequently as possible. No preprocessing should be allowed and queries should be answered online. We consider the case of a monotone function. Specifically, the dataset encodes a function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$ that is at (unknown) distance $\eps$ from monotone, meaning that $f$ canand mustbe modified at $\eps n$ places to become monotone. Our main result is a randomized filter that can answer any query in $O(\log^2 n\log\log n)$ time while modifying the function $f$ at only $O(\eps n)$ places. The amortized time over $n$ function evaluations is $O(\log n)$. %With probability arbitrarily close to 1, %the filter never errs at any query. The filter works as stated with probability arbitrarily close to $1$. We provide an alternative filter with $O(\log n)$ worst case query time and $O(\eps n \log n)$ function modifications. For reconstructing $d$dimensional monotone functions of the form $f:\{1,\ldots, n\}^d\,\mapsto {\bf R}$, we present a filter that takes $(2^{O(d)}(\log n)^{4d2}\log\log n)$ time per query and modifies at most $O(\eps n^d)$ function values (for constant $d$). @inproceedings{AilCCL042, author = {N. Ailon and B. Chazelle and S. Comandur and D. Liu}, title = {PropertyPreserving Data Reconstruction}, booktitle = {Proceedings of the 15th International Symposium on Algorithms and Computation (ISAAC)}, year = {2004}, pages = {1627}, } @article{AilCCL042, author = {N. Ailon and B. Chazelle and S. Comandur and D. Liu}, title = {PropertyPreserving Data Reconstruction}, journal = {Algorithmica}, volume = {51}, number = {2}, year = {2008}, pages = {160182}, } In standard property testing, the task is to distinguish between objects that have a property $\mathcal{P}$ and those that are $\eps$far from $\mathcal{P}$, for some $\eps > 0$. In this setting, it is perfectly acceptable for the tester to provide a negative answer for every input object that does not satisfy $\mathcal{P}$. This implies that property testing in and of itself cannot be expected to yield any information whatsoever about the distance from the object to the property. We address this problem in this paper, restricting our attention to monotonicity testing. A function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$ is at distance $\eps_f$ from being monotone if it can (and must) be modified at $\eps_f n$ places to become monotone. For any fixed $\delta>0$, we compute, with probability at least $2/3$, an interval $[(1/2 \delta)\eps, \eps]$ that encloses $\eps_f$. The running time of our algorithm is $O( \eps_f^{1}\log\log \eps_f^{1} \log n )$, which is optimal within a factor of $\log\log \eps_f^{1}$ and represents a substantial improvement over previous work. We give a second algorithm with an expected running time of $O( \eps_f^{1}\log n \log\log\log n)$. Finally, we extend our results to multivariate functions. @inproceedings{AilCCL041, author = {N. Ailon and B. Chazelle and S. Comandur and D. Liu}, title = {Estimating the Distance to a Monotone Function}, booktitle = {APPROXRANDOM}, year = {2004}, pages = {229236}, } @article{AilCCL041, author = {N. Ailon and B. Chazelle and S. Comandur and D. Liu}, title = {Estimating the distance to a monotone function}, journal = {Random Struct. Algorithms}, volume = {31}, number = {3}, year = {2007}, pages = {371383}, } 