Seshadhri's webpage

Graph representation learning is a fundamental technique for machine learning (ML) on complex networks. Given an input network, these methods represent the vertices by low-dimensional real-valued vectors. These vectors can be used for a multitude of downstream ML tasks. We study one of the most important such task, link prediction. Much of the recent literature on graph representation learning has shown remarkable success in link prediction. On closer investigation, we observe that the performance is measured by the AUC (area under the curve), which suffers biases. Since the ground truth in link prediction is sparse, we design a vertex-centric measure of performance, called the VCMPR@k plots. Under this measure, we show that link predictors using graph representations show poor scores. Despite having extremely high AUC scores, the predictors miss much of the ground truth. We identify a mathematical connection between this performance, the sparsity of the ground truth, and the low-dimensional geometry of the node embeddings. Under a formal theoretical framework, we prove that low-dimensional vectors cannot capture sparse ground truth using dot product similarities (the standard practice in the literature). Our results call into question existing results on link prediction and pose a significant scientific challenge for graph representation learning. The VCMPR plots identify specific scientific challenges for link prediction using low-dimensional node embeddings.

Monotonicity testing of Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$, is a classic topic in property testing. Determining the non-adaptive complexity of this problem is an important open question. For arbitrary $n$, [Black-Chakrabarty-Seshadhri, SODA 2020] describe a tester with query complexity $\widetilde{O}(\varepsilon^{-4/3}d^{5/6})$. This complexity is independent of $n$, but has a suboptimal dependence on $d$. Recently, [Braverman-Khot-Kindler-Minzer, ITCS 2023] and [Black-Chakrabarty-Seshadhri, STOC 2023] describe $\widetilde{O}(\varepsilon^{-2} n^3\sqrt{d})$ and $\widetilde{O}(\varepsilon^{-2} n\sqrt{d})$-query testers, respectively. These testers have an almost optimal dependence on $d$, but a suboptimal polynomial dependence on $n$. \smallskip In this paper, we describe a non-adaptive, one-sided monotonicity tester with query complexity $O(\varepsilon^{-2} d^{1/2 + o(1)})$, \emph{independent} of $n$. Up to the $d^{o(1)}$-factors, our result resolves the non-adaptive complexity of monotonicity testing for Boolean functions on hypergrids. The independence of $n$ yields a non-adaptive, one-sided $O(\varepsilon^{-2} d^{1/2 + o(1)})$-query monotonicity tester for Boolean functions $f:\mathbb{R}^d \to \{0,1\}$ associated with an arbitrary product measure.

We study a new connection between a technical measure called μ-conductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes size-resolved properties of clusters and communities in social and information networks. The idea of μ-conductance is similar to the traditional graph conductance, but disregards sets with small volume. We derive a sequence of optimization problems including a low-rank semi-definite program from which we can derive a lower bound on the optimal μ-conductance value. These ideas give the first theoretically sound bound on the behavior of the network community profile for a wide range of cluster sizes. The algorithm scales up to graphs with hundreds of thousands of nodes and we demonstrate how our framework validates the predicted structures of real-world graphs.

The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of Khot-Minzer-Safra (FOCS 2015) gave a non-adaptive, one-sided tester making $\otilde(\eps^{-2}\sqrt{d})$ queries. Up to polylog $d$ and $\eps$ factors, this bound matches the $\widetilde{\Omega}(\sqrt{d})$-query non-adaptive lower bound (Chen-De-Servedio-Tan (STOC 2015), Chen-Waingarten-Xie (STOC 2017)). For any $n > 2$, the optimal non-adaptive complexity was unknown. A previous result of the authors achieves a $\otilde(d^{5/6})$-query upper bound (SODA 2020), quite far from the $\sqrt{d}$ bound for the hypercube. In this paper, we resolve the non-adaptive complexity of monotonicity testing for all constant $n$, up to $\poly(\eps^{-1}\log d)$ factors. Specifically, we give a non-adaptive, one-sided monotonicity tester making $\otilde(\eps^{-2}n\sqrt{d})$ queries. From a technical standpoint, we prove new directed isoperimetric theorems over the hypergrid $[n]^d$. These results generalize the celebrated directed Talagrand inequalities that were only known for the hypercube.

Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a low-dimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine learning tasks, there is little principled analysis of the effectiveness of this approach for common tasks. In this work, we provide an empirical and theoretical analysis for the performance of a class of embeddings on the common task of pairwise community labeling. This is a binary variant of the classic community detection problem, which seeks to build a classifier to determine whether a pair of vertices participate in a community. In line with our goal of foundational understanding, we focus on a popular class of unsupervised embedding techniques that learn low rank factorizations of a vertex proximity matrix (this class includes methods like GraRep, DeepWalk, node2vec, NetMF). We perform detailed empirical analysis for community labeling over a variety of real and synthetic graphs with ground truth. In all cases we studied, the models trained from embedding features perform poorly on community labeling. In constrast, a simple logistic model with classic graph structural features handily outperforms the embedding models. For a more principled understanding, we provide a theoretical analysis for the (in)effectiveness of these embeddings in capturing the community structure. We formally prove that popular low-dimensional factorization methods either cannot produce community structure, or can only produce ``unstable" communities. These communities are inherently unstable under small perturbations.

Finding large cliques or cliques missing a few edges is a fundamental algorithmic task in the study of real-world graphs, with applications in community detection, pattern recognition, and clustering. A number of effective backtracking-based heuristics for these problems have emerged from recent empirical work in social network analysis. Given the $\mathbb{NP}$-hardness of variants of clique counting, these results raise a challenge for \emph{beyond worst-case analysis} of these problems. Inspired by the triadic closure of real-world graphs, Fox et al. (SICOMP 2020) introduced the notion of $c$-closed graphs and proved that maximal clique enumeration is fixed-parameter tractable with respect to $c$. In practice, due to noise in data, one wishes to actually discover "near-cliques", which can be characterized as cliques with a sparse subgraph removed. In this work, we prove that many different kinds of maximal near-cliques can be enumerated in polynomial time (and FPT in $c$) for $c$-closed graphs. We study various established notions of such substructures, including $k$-plexes, complements of bounded-degeneracy and bounded-treewidth graphs. Interestingly, our algorithms follow relatively simple backtracking procedures, analogous to what is done in practice. Our results underscore the significance of the $c$-closed graph class for theoretical understanding of social network analysis.

Consider property testing on bounded degree graphs and let $\varepsilon > 0$ denote the proximity parameter. A remarkable theorem of Newman-Sohler (SICOMP 2013) asserts that all properties of planar graphs (more generally hyperfinite) are testable with query complexity only depending on $\varepsilon$. Recent advances in testing minor-freeness have proven that all additive and monotone properties of planar graphs can be tested in $\poly(\varepsilon^{-1})$ queries. Some properties falling outside this class, such as Hamiltonicity, also have a similar complexity for planar graphs. Motivated by these results, we ask: can all properties of planar graphs can be tested in $\poly(\varepsilon^{-1})$ queries? Is there a uniform query complexity upper bound for all planar properties, and what is the ``hardest" such property to test? We discover a surprisingly clean and optimal answer. Any property of bounded degree planar graphs can be tested in $\exp(O(\varepsilon^{-2}))$ queries. Moreover, there is a matching lower bound, up to constant factors in the exponent. The natural property of testing isomorphism to a fixed graph requires $\exp(\Omega(\varepsilon^{-2}))$ queries, thereby showing that (up to polynomial dependencies) isomorphism to an explicit fixed graph is the hardest property of planar graphs. The upper bound is a straightforward adapation of the Newman-Sohler analysis that tracks dependencies on $\varepsilon$ more carefully. The main technical contribution is the lower bound construction, which is achieved by a special family of planar graphs that are all mutually far from each other. We can also apply our techniques to get analogous results for bounded treewidth graphs. We prove that all properties of bounded treewidth graphs can be tested in $\exp(O(\varepsilon^{-1}\log \varepsilon^{-1}))$ queries. Moreover, testing isomorphism to a fixed forest requires $\exp(\Omega(\varepsilon^{-1}))$ queries.

Consider the family of bounded degree graphs in any minor-closed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for a sufficiently small {\epsilon} > 0, one removes {\epsilon}dn edges to get connected components of size independent of n. An important tool for sublinear algorithms and property testing for such classes is the partition oracle, introduced by the seminal work of Hassidim-Kelner-Nguyen-Onak (FOCS 2009). A partition oracle is a local procedure that gives consistent access to a hyperfinite decomposition, without any preprocessing. Given a query vertex v, the partition oracle outputs the component containing v in time independent of n. All the answers are consistent with a single hyperfinite decomposition. The partition oracle of Hassidim et al. runs in time d^poly(d/{\epsilon}) per query. They pose the open problem of whether poly(d/{\epsilon})-time partition oracles exist. Levi-Ron (ICALP 2013) give a refinement of the previous approach, to get a partition oracle that runs in time d^{\log(d/{\epsilon})-per query. In this paper, we resolve this open problem and give \poly(d/{\epsilon})-time partition oracles for bounded degree graphs in any minor-closed family. Unlike the previous line of work based on combinatorial methods, we employ techniques from spectral graph theory. We build on a recent spectral graph theoretical toolkit for minor-closed graph families, introduced by the authors to develop efficient property testers. A consequence of our result is a poly(d/{\epsilon})-query tester for any monotone and additive property of minor-closed families (such as bipartite planar graphs). Our result also gives poly(d/{\epsilon})-query algorithms for additive {\epsilon} n-approximations for problems such as maximum matching, minimum vertex cover, maximum independent set, and minimum dominating set for these graph families.

Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many real-world graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the timestamps. Temporal triangle counting has seen a few recent results, but there are varying definitions of temporal triangles. In all cases, temporal triangle patterns enforce constraints on the time interval between edges (in the triangle). We define a general notion (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})-temporal triangles that allows for separate time constraints for all pairs of edges. Our main result is a new algorithm, DOTTT (Degeneracy Oriented Temporal Triangle Totaler), that exactly counts all directed variants of (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})-temporal triangles. Using the classic idea of degeneracy ordering with careful combinatorial arguments, we can prove that DOTTT runs in O(m\kappa \log m) time, where m is the number of (temporal) edges and \kappa is the graph degeneracy (max core number). Up to log factors, this matches the running time of the best static triangle counters. Moreover, this running time is better than existing. DOTTT has excellent practical behavior and runs twice as fast as existing state-of-the-art temporal triangle counters (and is also more general). For example, DOTTT computes all types of temporal queries in Bitcoin temporal network with half a billion edges in less than an hour on a commodity machine.

Counting homomorphisms of a constant sized pattern graph H in an input graph G is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input G and the pattern H. Given the significance of this problem and the large sizes of modern inputs, we investigate when near-linear time algorithms are possible. We focus on the case when the input graph has bounded degeneracy, a commonly studied and practically relevant class for homomorphism counting. It is known from previous work that for certain classes of H, H-homomorphisms can be counted exactly in near-linear time in bounded degeneracy graphs. Can we precisely characterize the patterns H for which near-linear time algorithms are possible? We completely resolve this problem, discovering a clean dichotomy using fine-grained complexity. Let m denote the number of edges in G. We prove the following: if the largest induced cycle in H has length at most 5, then there is an O(mlogm) algorithm for counting H-homomorphisms in bounded degeneracy graphs. If the largest induced cycle in H has length at least 6, then (assuming standard fine-grained complexity conjectures) there is a constant γ>0, such that there is no o(m1+γ) time algorithm for counting H-homomorphisms.

Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph? We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model. Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.

We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $\Theta(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $\kappa$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(m\kappa/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $\Omega(m\kappa/T)$.

Clique and near-clique counts are important graph properties that help to characterize graphs and have applications in graph generation, graph modelling, graph analytics etc. Recently, Jain et. al proposed an algorithm called Turan-Shadow for estimating k-clique counts that was a significant improvement over the state-of-the-art. Essentially, given a graph $G$, it constructs an object called the Tur\'an Shadow - a collection of (possibly overlapping) dense subgraphs such that there is a bijection from the cliques in each subgraph to the set of k-cliques in $G$. It then estimates the count of k-cliques in $G$ by estimating the number of cliques in the subgraphs using uniform random sampling. As a side effect, it provides an efficient method to sample a uniform random $k$-clique. We use this clique sampler to discover and estimate counts of near-cliques i.e. k-cliques that are missing 1 or 2 edges. Every near-clique contains cliques of smaller sizes. By obtaining the Tur\'an Shadow for the smaller clique size, sampling a uniform random clique of the smaller size and obtaining an average of the number of near-cliques the sampled clique participates in, we give an estimate for the total number of near-cliques in $G$. There was no known efficient method to obtain near-clique counts for $k>5$ before this work. We obtain a $10x-100x$ speedup over existing algorithms for counting near-cliques. Additionally, we enhance Turan-Shadow by designing a new estimator that requires us to construct only parts of the shadow, and uses the shadow in an online manner, obviating the need to store the entire shadow. This helps achieve not only a $2x-10x$ speedup over Turan-Shadow in some cases but drastically reduces the space utilized.

The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems, to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have a rank that is nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as singular value decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.

We consider the problem of counting all k-vertex subgraphs in an input graph, for any constant k. This problem (denoted sub-cntk) has been studied extensively in both theory and practice. In a classic result, Chiba and Nishizeki (SICOMP 85) gave linear time algorithms for clique and 4-cycle counting for bounded degeneracy graphs. This is a rich class of sparse graphs that contains, for example, all minor-free families and preferential attachment graphs. The techniques from this result have inspired a number of recent practical algorithms for sub-cntk. Towards a better understanding of the limits of these techniques, we ask: for what values of k can sub-cntk be solved in linear time? We discover a chasm at k=6. Specifically, we prove that for k < 6, sub-cntk can be solved in linear time. Assuming a standard conjecture in fine-grained complexity, we prove that for all k≥6, sub-cntk cannot be solved even in near-linear time.

Subgraph counting is a fundamental task in network analysis. Typically, algorithmic work is on total counting, where we wish to count the total frequency of a (small) pattern subgraph in a large input data set. But many applications require local counts (also called vertex orbit counts) wherein, for every vertex v of the input graph, one needs the count of the pattern subgraph involving v. This provides a rich set of vertex features that can be used in machine learning tasks, especially classification and clustering. But getting local counts is extremely challenging. Even the easier problem of getting total counts has received much research attention. Local counts require algorithms that get much finer grained information, and the sheer output size makes it difficult to design scalable algorithms. We present EVOKE, a scalable algorithm that can determine vertex orbits counts for all 5-vertex pattern subgraphs. In other words, EVOKE exactly determines, for every vertex v of the input graph and every 5-vertex subgraph H, the number of copies of H that v participates in. EVOKE can process graphs with tens of millions of edges, within an hour on a commodity machine. EVOKE is typically hundreds of times faster than previous state of the art algorithms, and gets results on datasets beyond the reach of previous methods. Theoretically, we generalize a recent "graph cutting" framework to get vertex orbit counts. This framework generate a collection of polynomial equations relating vertex orbit counts of larger subgraphs to those of smaller subgraphs. EVOKE carefully exploits the structure among these equations to rapidly count. We prove and empirically validate that EVOKE only has a small constant factor overhead over the best (total) 5-vertex subgraph counter.

Clique counting is a fundamental task in network analysis, and even the simplest setting of 3-cliques (triangles) has been the center of much recent research. Getting the count of k-cliques for larger k is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use larger clique counts. Moreover, one often desires \textit{local} counts, the number of k-cliques per vertex/edge. Our main result is Pivoter, an algorithm that exactly counts the number of k-cliques, \textit{for all values of k}. It is surprisingly effective in practice, and is able to get clique counts of graphs that were beyond the reach of previous work. For example, Pivoter gets all clique counts in a social network with a 100M edges within two hours on a commodity machine. Previous parallel algorithms do not terminate in days. Pivoter can also feasibly get local per-vertex and per-edge k-clique counts (for all k) for many public data sets with tens of millions of edges. To the best of our knowledge, this is the first algorithm that achieves such results. The main insight is the construction of a Succinct Clique Tree (SCT) that stores a compressed unique representation of all cliques in an input graph. It is built using a technique called \textit{pivoting}, a classic approach by Bron-Kerbosch to reduce the recursion tree of backtracking algorithms for maximal cliques. Remarkably, the SCT can be built without actually enumerating all cliques, and provides a succinct data structure from which exact clique statistics (k-clique counts, local counts) can be read off efficiently.

We describe a Õ (d5/6)-query monotonicity tester for Boolean functions f:[n]d→{0,1} on the n-hypergrid. This is the first o(d) monotonicity tester with query complexity independent of n. Motivated by this independence of n, we initiate the study of monotonicity testing of measurable Boolean functions f:ℝd→{0,1} over the continuous domain, where the distance is measured with respect to a product distribution over ℝd. We give a Õ (d5/6)-query monotonicity tester for such functions. Our main technical result is a domain reduction theorem for monotonicity. For any function f:[n]d→{0,1}, let ϵf be its distance to monotonicity. Consider the restriction f̂ of the function on a random [k]d sub-hypergrid of the original domain. We show that for k=poly(d/ϵ), the expected distance of the restriction is 𝔼[ϵf̂ ]=Ω(ϵf). Previously, such a result was only known for d=1 (Berman-Raskhodnikova-Yaroslavtsev, STOC 2014). Our result for testing Boolean functions over [n]d then follows by applying the d5/6⋅poly(1/ϵ,logn,logd)-query hypergrid tester of Black-Chakrabarty-Seshadhri (SODA 2018). To obtain the result for testing Boolean functions over ℝd, we use standard measure theoretic tools to reduce monotonicity testing of a measurable function f to monotonicity testing of a discretized version of f over a hypergrid domain [N]d for large, but finite, N (that may depend on f). The independence of N in the hypergrid tester is crucial to getting the final tester over ℝd.

Given query access to an undirected graph $G$, we consider the problem of computing a $(1\pm\eps)$-approximation of the number of $k$-cliques in $G$. The standard query model for general graphs allows for degree queries, neighbor queries, and pair queries. Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$ be the number of $k$-cliques. Previous work by Eden, Ron and Seshadhri (STOC 2018) gives an $O^*(\frac{n}{n^{1/k}_k} + \frac{m^{k/2}}{n_k})$-time algorithm for this problem (we use $O^*(\cdot)$ to suppress $\poly(\log n, 1/\eps, k^k)$ dependencies). Moreover, this bound is nearly optimal when the expression is sublinear in the size of the graph. Our motivation is to circumvent this lower bound, by parameterizing the complexity in terms of \emph{graph arboricity}. The arboricity of $G$ is a measure for the graph density ``everywhere''. There is a very rich family of graphs with bounded arboricity, including all minor-closed graph classes (such as planar graphs and graphs with bounded treewidth), bounded degree graphs, preferential attachment graphs and more. We design an algorithm for the class of graphs with arboricity at most $\alpha$, whose running time is $O^*(\min\{\frac{n\alpha^{k-1}}{n_k},\; \frac{n}{n_k^{1/k}}+\frac{m \alpha^{k-2}}{n_k} \} )$. We also prove a nearly matching lower bound. For all graphs, the arboricity is $O(\sqrt m)$, so this bound subsumes all previous results on sublinear clique approximation. As a special case of interest, consider minor-closed families of graphs, which have constant arboricity. Our result implies that for any minor-closed family of graphs, there is a $(1\pm\eps)$-approximation algorithm for $n_k$ that has running time $O^*(\frac{n}{n_k})$. Such a bound was not known even for the special (classic) case of triangle counting in planar graphs.

Let $G$ be a graph with $n$ vertices and maximum degree $d$. Fix some minor-closed property $\mathcal{P}$ (such as planarity). We say that $G$ is $\varepsilon$-far from $\mathcal{P}$ if one has to remove $\varepsilon dn$ edges to make it have $\mathcal{P}$. The problem of property testing $\mathcal{P}$ was introduced in the seminal work of Benjamini-Schramm-Shapira (STOC 2008) that gave a tester with query complexity triply exponential in $\varepsilon^{-1}$. Levi-Ron (TALG 2015) have given the best tester to date, with a quasipolynomial (in $\varepsilon^{-1}$) query complexity. It is an open problem to get property testers whose query complexity is $\poly(d\varepsilon^{-1})$, even for planarity. In this paper, we resolve this open question. For any minor-closed property, we give a tester with query complexity $d\cdot \poly(\varepsilon^{-1})$. The previous line of work on (independent of $n$, two-sided) testers is primarily combinatorial. Our work, on the other hand, employs techniques from spectral graph theory. This paper is a continuation of recent work of the authors (FOCS 2018) analyzing random walk algorithms that find forbidden minors.

Testing monotonicity of a Boolean function $f:\{0,1\}^n \to \{0,1\}$ is an important problem in the field of property testing. It has led to connections with many interesting combinatorial questions on the directed hypercube: routing, random walks, and new isoperimetric theorems. Denoting the proximity parameter by $\eps$, the best tester is the non-adaptive $\otilde(\eps^{-2}\sqrt{n})$ tester of Khot-Minzer-Safra (FOCS 2015). A series of recent results by Belovs-Blais (STOC 2016) and Chen-Waingarten-Xie (STOC 2017) have led to $\widetilde{\Omega}(n^{1/3})$ lower bounds for \emph{adaptive} testers. Reducing this gap is a significant question, that touches on the role of adaptivity in monotonicity testing of Boolean functions. We approach this question from the perspective of \emph{parametrized property testing}, a concept recently introduced by Pallavoor-Raskhodnikova-Varma (ACM TOCT 2017), where one seeks to understand performance of testers with respect to parameters other than just the size. Our result is an adaptive monotonicity tester with one-sided error whose query complexity is $O(\eps^{-2}\bI(f)\log^5 n)$, where $\bI(f)$ is the total influence of the function. Therefore, adaptivity provably helps monotonicity testing for low influence functions.

Finding the dense regions of a graph and relations among them is a fundamental problem in network analysis. Core and truss decompositions reveal dense subgraphs with hierarchical relations. The incremental nature of algorithms for computing these decompositions and the need for global information at each step of the algorithm hinders scalable parallelization and approximations since the densest regions are not revealed until the end. In a previous work, Lu et al. proposed to iteratively compute the h-indices of neighbor vertex degrees to obtain the core numbers and prove that the convergence is obtained after a finite number of iterations. This work generalizes the iterative h-index computation for truss decomposition as well as nucleus decomposition which leverages higher-order structures to generalize core and truss decompositions. In addition, we prove convergence bounds on the number of iterations. We present a framework of local algorithms to obtain the core, truss, and nucleus decompositions. Our algorithms are local, parallel, offer high scalability, and enable approximations to explore time and quality trade-offs. Our shared-memory implementation verifies the efficiency, scalability, and effectiveness of our local algorithms on real-world networks.

Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\varepsilon n$ edges from $G$ to make it $H$-minor free (for some small constant $\varepsilon > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. For an example application, suppose one must remove $\varepsilon n$ edges from a bounded degree graph $G$ to make it planar. This result implies an algorithm, with the same running time, that produces a $K_{3,3}$ or $K_5$ minor in $G$. No sublinear time bound was known for this problem, prior to this result. By the graph minor theorem, we get an analogous result for any minor-closed property. Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008) on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014). Prior to this work, the only graphs $H$ for which non-trivial property testers were known for $H$-minor freeness are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017).

We propose a new distribution-free model of social networks. Our definitions are motivated by one of the most universal signatures of social networks, triadic closure---the property that pairs of vertices with common neighbors tend to be adjacent. Our most basic definition is that of a "$c$-closed" graph, where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$ and $v$ are adjacent. We study the classic problem of enumerating all maximal cliques, an important task in social network analysis. We prove that this problem is fixed-parameter tractable with respect to $c$ on $c$-closed graphs. Our results carry over to "weakly $c$-closed graphs", which only require a vertex deletion ordering that avoids pairs of non-adjacent vertices with $c$ common neighbors. Numerical experiments show that well-studied social networks tend to be weakly $c$-closed for modest values of $c$.

We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $\clk$ the number of $k$-cliques. We design an algorithm that outputs a $(1+\eps)$-approximation (with high probability) for $\clk$, whose expected query complexity and running time are $O\left(\frac{n}{\clk^{1/k}}+\frac{m^{k/2}}{\clk} \right)\poly(\log n, 1/\eps,k)$. Hence, the complexity of the algorithm is sublinear in the size of the graph for $\clk = \omega(m^{k/2-1})$. Furthermore, the query complexity of our algorithm is essentially optimal (up to the dependence on $\log n$, $1/\eps$ and $k$). The previous results in this vein are by Feige (SICOMP 06) and by Goldreich \dcom{and} Ron (RSA 08) for edge counting ($k=2$) and by Eden et al (FOCS 2015) for triangle counting ($k=3$). Our result matches the complexities of these results. The previous result by Eden et al. hinges on a certain amortization technique that works for triangle counting, and does not generalize %for larger cliques to all $k$. We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that} samples each $k$-clique incident to a given set $S$ of vertices with approximately equal probability. The primary difficulty is in finding cliques incident to purely high-degree vertices, since random sampling within neighbors has a low success probability. This is achieved by an algorithm that samples uniform random high degree vertices and a careful tradeoff between estimating cliques incident purely to high-degree vertices and those that include a low-degree vertex.

The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on \emph{graph sampling}, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of \emph{sublinear algorithms}. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the \emph{h-index} and the \emph{z-index}. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most 1% of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than 10% of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.

We study monotonicity testing of Boolean functions over the hypergrid [n]^d and design a non-adaptive tester with 1-sided error whose query complexity is O(d^5/6)⋅poly(logn,1/ϵ). Previous to our work, the best known testers had query complexity linear in d but independent of n. We improve upon these testers as long as n=2^d^o(1). To obtain our results, we work with what we call the augmented hypergrid, which adds extra edges to the hypergrid. Our main technical contribution is a Margulis-style isoperimetric result for the augmented hypergrid, and our tester, like previous testers for the hypercube domain, performs directed random walks on this structure.

We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ (non-isolated) vertices, and define (for $s > 0$) $\mu_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $\mu_s$ within a multiplicative error of $(1+\varepsilon)$ (for a given approximation parameter $\varepsilon>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant $\varepsilon$ (Feige, SICOMP 06 and Goldreich-Ron, RSA 08). Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{1-1/(s+1)})$ queries. (Strictly speaking, their algorithm approximates the number of star-subgraphs of a given size, but a slight modification gives an algorithm for moments.) We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the \emph{degeneracy} of $G$. This is (essentially) the maximum density of an induced subgraph.

We study the problem of testing unateness of functions f:{0,1}^d -> R. We give a O((d/\eps)log(d/\eps))-query nonadaptive tester and a O(d/\eps)-query adaptive tester and show that both testers are optimal for a fixed distance parameter \eps. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the dimension both for the adaptive and the nonadaptive case. Moreover, no lower bounds for testing unateness were known. We also generalize our results to obtain optimal unateness testers for functions f:[n]^d -> R. Our results establish that adaptivity helps with testing unateness of real-valued functions on domains of the form {0,1}^d and, more generally, [n]^d. This stands in contrast to the situation for monotonicity testing where there is no adaptivity gap for functions f:[n]^d -> R.

Clique counts reveal important properties about the structure of massive graphs, especially social networks. The simple setting of just 3-cliques (triangles) has received much attention from the research community. For larger cliques (even, say $6$-cliques) the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible. We present a new randomized algorithm that provably approximates the number of $k$-cliques, for any constant $k$. The key insight is the use of (strengthenings of) the classic Tur\'an's theorem: this claims that if the edge density of a graph is sufficiently high, the $k$-clique density must be non-trivial. We define a combinatorial structure called a \emph{Tur\'an shadow}, the construction of which leads to fast algorithms for clique counting. We design a practical heuristic, called TURAN-SHADOW, based on this theoretical algorithm, and test it on a large class of test graphs. In all cases, TURAN-SHADOW has less than 2\% error, in a fraction of the time used by well-tuned exact algorithms. We do detailed comparisons with a range of other sampling algorithms, and find that TURAN-SHADOW is generally much faster and more accurate. For example, TURAN-SHADOW estimates all cliques numbers up to size $10$ in social network with over a hundred million edges. This is done in less than three hours on a single commodity machine.

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$-vertex patterns. This algorithm is typically ten times faster than the state of the art $4$-vertex counters.

Finding similar user pairs is a fundamental task in social networks, with applications in link prediction, tie strength detection, and a number of ranking and personalization tasks. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, and cosine similarity among these vectors quantifies user similarity. The predominant task for user similarity applications is to discover all user pairs with ``high'' similarity, which is computationally challenging on networks with billions of edges. To the best of our knowledge, there is no practical solution for computing all user pairs with high similarity on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the ``wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges.

The \emph{Ulam distance} between two permutations of length $n$ is the minimum number of insertions and deletions needed to transform one sequence into the other. Equivalently, the Ulam distance $d$ is $n$ minus the length of the longest common subsequence (LCS) between the permutations. Our main result is an algorithm, that for any fixed $\varepsilon>0$, provides a $(1+\varepsilon)$-multiplicative approximation for $d$ in $\tilde{O}_{\varepsilon}(n/d + \sqrt{n})$ time, which has been shown to be optimal up to polylogarithmic factors. This is the first sublinear time algorithm (provided that $d = (\log n)^{\omega(1)}$) that obtains arbitrarily good multiplicative approximations to the Ulam distance. The previous best bound is an $O(1)$-approximation (with a large constant) by Andoni and Nguyen (2010) with the same running time bound (ignoring polylogarithmic factors). The improvement in the approximation factor from $O(1)$ to $(1+\varepsilon)$ allows for significantly more powerful sublinear algorithms. For example, for any fixed $\delta > 0$, we can get additive $\delta n$ approximations for the LCS between permutations in $\tilde{O}_\delta(\sqrt{n})$ time. Previous sublinear algorithms require $\delta$ to be at least $1-1/C$, where $C$ is the approximation factor, which is close to $1$ when $C$ is large. Our algorithm is obtained by abstracting the basic algorithmic framework of Andoni and Nguyen, and combining it with the sublinear approximations for the longest increasing subsequence by Saks and Seshadhri (2010).

The computation and study of triangles in graphs is a standard tool in the analysis of real-world networks. Yet most of this work focuses on undirected graphs. Real-world networks are often directed and have a significant fraction of reciprocal edges. While there is much focus on directed triadic patterns in the social sciences community, most data mining and graph analysis studies ignore direction. But how do we make sense of this complex directed structure? We propose a collection of directed closure values that are analogues of the classic transitivity measure (the fraction of wedges that participate in triangles). We perform an extensive set of triadic measurements on a variety of massive real-world networks. Our study of these values reveal a wealth of information of the nature of direction. For instance, we immediately see the importance of reciprocal edges in forming triangles and can measure the power of transitivity. Surprisingly, the chance that a wedge is closed depends heavily on its directed structure. We also observe striking similarities between the triadic closure patterns of different web and social networks. Together with these observations, we also present the first sampling based algorithm for fast estimation of directed triangles. Previous estimation methods were targeted towards undirected triangles and could not be extended to directed graphs. Our method, based on wedge sampling, gives orders of magnitude speedup over state of the art enumeration.

Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for post-analysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly. We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantile-sampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows.

Consider a scalar field $f:\mathbb{M} \mapsto \mathbb{R}$, where $\mathbb{M}$ is a triangulated simplicial mesh in $\mathbb{R}^d$. A level set, or contour, at value $v$ is a connected component of $f^{-1}(v)$. As $v$ is changed, these contours change topology, merge into each other, or split. \emph{Contour trees} are concise representations of $f$ that track this contour behavior. The vertices of these trees are the critical points of $f$, where the gradient is zero. The edges represent changes in the topology of contours. It is a fundamental data structure in data analysis and visualization, and there is significant previous work (both theoretical and practical) on algorithms for constructing contour trees. Suppose $\mathbb{M}$ has $n$ vertices, $N$ facets, and $t$ critical points. A classic result of Carr, Snoeyink, and Axen (2000) gives an algorithm that takes $O(n\log n + N\alpha(N))$ time (where $\alpha(\cdot)$ is the inverse Ackermann function). A further improvement to $O(t\log t + N)$ time was given by Chiang et al. All these algorithms involve a global sort of the critical points, a significant computational bottleneck. Unfortunately, lower bounds of $\Omega(t\log t)$ also exist. We present the first algorithm that can avoid the global sort and has a refined time complexity that depends on the contour tree structure. Intuitively, if the tree is short and fat, we get significant improvements in running time. For a partition of the contour tree into a set of descending paths, $P$, our algorithm runs in $O(\sum_{p\in P} |p|\log|p| + t\alpha(t)+N)$ time. This is at most $O(t\log D + N)$, where $D$ is the diameter of the contour tree. Moreover, it is $O(t\alpha(t)+N)$ for balanced trees, a significant improvement over the previous complexity. We also prove lower bounds showing that the $\sum_{p\in P} |p|\log|p|$ complexity is inherent to computing contour trees.

We present a characterization of short-term stability of Kauffman's N-K (random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether short-term chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness.

We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a sublinear-time algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries. We show that for any given approximation parameter 0\textless epsilon\textless1, the algorithm provides an estimate hat\{t\} such that with high constant probability, (1-epsilon) t\textless hat\{t\}\textless(1+epsilon)t, where t is the number of triangles in the graph G. The expected query complexity of the algorithm is O(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) poly(log n, 1/epsilon), where n is the number of vertices in the graph and m is the number of edges, and the expected running time is (n/t\textasciicircum \{1/3\} + m\textasciicircum\{3/2\}/t) poly(log n, 1/epsilon). We also prove that Omega(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in n (and the dependence on 1/epsilon).

Given two sets of vectors, $A = \set{\vec{a_1}, \dots, \vec{a_m}}$ and $B=\set{\vec{b_1},\dots,\vec{b_n}}$, our problem is to find the top-$t$ dot products, i.e., the largest $|\vec{a_i}\cdot\vec{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all $mn$ dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of $A$ and $B$. The probability of selecting a diamond corresponding to pair $(i,j)$ is proportional to $(\vec{a_i}\cdot\vec{b_j})^2$, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods.

The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution.
We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.

Estimating the number of triangles in a graph given as a stream of edges is a fundamental problem in data mining. The goal is to design a single pass space-efficient streaming algorithm for estimating triangle counts. While there are numerous algorithms for this problem, they all (implicitly or explicitly) assume that the stream does not contain duplicate edges. However, data sets obtained from real-world graph streams are rife with duplicate edges. The work around is typically an extra unaccounted pass (storing all the edges!) just to "clean up" the data. Can we estimate triangle counts accurately in a single pass even when the stream contains repeated edges? In this work, we give the first algorithm for estimating the triangle count of a multigraph stream of edges.

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.

Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable efficient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.

The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion is the Hamming distance over the {\em uniform} distribution on the domain. This restriction to uniformity is rather limiting, and it is important to investigate distances induced by more general distributions. In this paper, we give simple and optimal testers for {\em bounded derivative properties} over {\em arbitrary product distributions}. Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity. Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing. We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of expected depths of optimal BSTs for each marginal. Furthermore, we show this sum-of-depths is also a lower bound. A technical contribution of our work is an {\em optimal dimension reduction theorem} for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity, and our theorem is an exponential improvement to the previous best known result.

We propose a new algorithm, FAST-PPR, for estimating personalized PageRank: given input nodes $s$, $t$ in a directed graph and threshold $\delta$, it computes the Personalized PageRank $\PR_s(t)$ from $s$ to $t$, guaranteeing that the relative error is small as long $\PR_s(t) > \delta$. Existing algorithms for this problem have a running-time of $\Omega(1/\delta)$. We design a new algorithm, FAST-PPR, that has a provable average running-time guarantee of $\tilde{O}(\sqrt{d/\delta})$ (where $d$ is the average in-degree of the graph). This is a significant improvement, since $\delta$ is often $O(1/n)$ (where $n$ is the number of nodes) for applications. We also complement the algorithm with an $\Omega(1/\sqrt{\delta})$ lower bound for Significant-PageRank, showing that the dependence on $\delta$ cannot be improved. We perform a detailed empirical study on numerous massive graphs showing that FAST-PPR dramatically outperforms existing algorithms. For example, with target nodes sampled according to popularity, on the 2010 Twitter graph with 1.5 billion edges, FAST-PPR has a 20 factor speedup over the state of the art. Furthermore, an enhanced version of FAST-PPR has a 160 factor speedup on the Twitter graph, and is at least 20 times faster on all our candidate graphs.

Network data is ubiquitous and growing, yet we lack realistic generative models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdos-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.

Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec. total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.

Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in real-world graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly super-linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length $2$, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length $2$ where the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavy-tailed degree distributions, a defining feature of real-world graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the $\ell_{4/3}$-norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows non-trivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected linear-time performance for degree sequences following a power law with exponent $\alpha \geq 7/3$, and non-trivial speedup whenever $\alpha \in (2,3)$.

@INPROCEEDINGS{BeFoNo+14,
  author = {J. Berry and L. Fostvedt and D. Nordman and C. Phillips and C. Seshadhri and A. Wilson},
  title = {Why do Simple Algorithms for Triangle Enumeration work in the Real World?},
  booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)},
  year = {2014}
}

High triangle density -- the graph property stating that most two-hop paths belong to a triangle -- is a common signature of social networks. This paper studies triangle-dense graphs from a structural perspective. We prove constructively that most of the content of a triangle-dense graph is contained in a disjoint union of radius 2 dense subgraphs. This result quantifies the extent to which triangle-dense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in stable k-median instances.

First impressions from initial renderings of data are crucial for directing further exploration and analysis. In most visualization systems, default colormaps are generated by simply linearly interpolating color in some space based on a value's placement between the minimum and maximum taken on by the dataset. We design a simple sampling-based method for generating colormaps that highlights important features. We use random sampling to determine the distribution of values observed in the data. The sample size required is independent of the dataset size and only depends on certain accuracy parameters. This leads to a computationally cheap and robust algorithm for colormap generation. Our approach (1) uses perceptual color distance to produce palettes from color curves, (2) allows the user to either emphasize or de-emphasize prominent values in the data, (3) uses quantiles to map distinct colors to values based on their frequency in the dataset, and (4) supports the highlighting of either inter- or intra-mode variations in the data.

For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinate-wise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$-far from monotone if at least an $\eps$-fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$-far. We prove that any (adaptive, two-sided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $\Omega(\eps^{-1}d\log n - \eps^{-1}\log \eps^{-1})$ queries. Recent upper bounds show the existence of $O(\eps^{-1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a non-adaptive bound of $\Omega(d \log n)$.

@INPROCEEDINGS{ChSe13-2,
  author = {D. Chakrabarty and C. Seshadhri},
  title = {An optimal lower bound for monotonicity testing over hypergrids},
  booktitle = {{Proceedings of the International Workshop on Randomization and Computation (RANDOM)}},
  year = {2013},
  owner = {scomand},
  timestamp = {2013.03.14}
}

Modern massive graphs such as the web and communication networks are truly \emph{dynamic graphs}. The graph is not a static object, but is most accurately characterized as a \emph{stream of edges}. How does one store a small sample of the past and still answer meaningful questions about the entire graph? This is the aim of a sublinear space streaming algorithm. We study the problem of counting the number of triangles (or estimating the clustering coefficient) in a massive streamed graph. We provide a $O(\sqrt{n})$-space algorithm that provably approximates the clustering coefficient with only a single pass through a graph of $n$ vertices. Our procedure is based on the classic probabilistic result, \emph{the birthday paradox}. While streaming algorithms for triangle counting have been designed before, none of these provide comparable theoretical (or empirical) results. Using some probabilistic heuristics, we design a practical algorithm based on these ideas. Our algorithm can compute accurate estimates for the clustering coefficient and number of triangles for a graph using only a single pass over the edges. The storage of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 50,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains \emph{a real-time estimate} of the clustering coefficient/number of triangles of a graph, by storing a miniscule fraction of edges.

@INPROCEEDINGS{JhSePi13,
  author = {Jha, Madhav and Seshadhri, C. and Pinar, Ali},
  title = {A space efficient streaming algorithm for triangle counting using
	the birthday paradox},
  booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge
	discovery and data mining},
  year = {2013},
  series = {KDD '13},
  pages = {589--597}
}

Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as an excellent null model to compare algorithms or perform experimental studies that distinguish real networks from this baseline. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the real-world are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (one-way edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the Chung-Lu model that matches the in-, out-, and reciprocal-degree distributions of the real graphs. Our algorithm is scalable and requires O(m) random numbers to generate a graph with m edges. We perform a series of experiments on real datasets and compare with existing graph models.

Given oracle access to a Boolean function $f:\{0,1\}^n \mapsto \{0,1\}$, we design a randomized tester that takes as input a parameter $\eps>0$, and outputs {\sf Yes} if the function is monotone, and outputs {\sf No} with probability $>2/3$, if the function is $\eps$-far from monotone. That is, $f$ needs to be modified at $\eps$-fraction of the points to make it monotone. Our non-adaptive, one-sided tester makes $\widetilde{O}(n^{5/6}\eps^{-5/3})$ queries to the oracle.

The problem of monotonicity testing over the hypergrid and its special case, the hypercube, is a classic, well-studied, yet unsolved question in property testing. We are given query access to $f:[k]^n \mapsto \R$ (for some ordered range $\R$). The hypergrid/cube has a natural partial order given by coordinate-wise ordering, denoted by $\prec$. A function is \emph{monotone} if for all pairs $x \prec y$, $f(x) \leq f(y)$. The distance to monotonicity, $\eps_f$, is the minimum fraction of values of $f$ that need to be changed to make $f$ monotone. For $k=2$ (the boolean hypercube), the usual tester is the \emph{edge tester}, which checks monotonicity on adjacent pairs of domain points. It is known that the edge tester using $O(\eps^{-1}n\log|\R|)$ samples can distinguish a monotone function from one where $\eps_f > \eps$. On the other hand, the best lower bound for monotonicity testing over the hypercube is $\min(|\R|^2,n)$. This leaves a quadratic gap in our knowledge, since $|\R|$ can be $2^n$. We resolve this long standing open problem and prove that $O(n/\eps)$ samples suffice for the edge tester. For hypergrids, known testers require $O(\eps^{-1}n\log k\log |\R|)$ samples, while the best known (non-adaptive) lower bound is $\Omega(\eps^{-1} n\log k)$. We give a (non-adaptive) monotonicity tester for hypergrids running in $O(\eps^{-1} n\log k)$ time. Our techniques lead to optimal property testers (with the same running time) for the natural \emph{Lipschitz property} on hypercubes and hypergrids. (A $c$-Lipschitz function is one where $|f(x) - f(y)| \leq c\|x-y\|_1$.) In fact, we give a general unified proof for $O(\eps^{-1}n\log k)$-query testers for a class of ``bounded-derivative" properties, a class containing both monotonicity and Lipschitz.

there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We discuss the method of wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.

Approximating the length of the longest increasing sequence (LIS) of a data stream is a well-studied problem. There are many algorithms that estimate the size of the complement of the LIS, referred to as the \emph{distance to monotonicity}, both in the streaming and property testing setting. Let $n$ denote the size of an input array. Our aim is to develop a one-pass streaming algorithm that accurately approximates the distance to monotonicity, and only uses polylogarithmic storage. For any $\delta > 0$, our algorithm provides a $(1+\delta)$-multiplicative approximation for the distance, and uses only $O((\log^2 n)/\delta)$ space. The previous best known approximation using poly-logarithmic space was a multiplicative 2-factor. Our algorithm is simple and natural, being just 3 lines of pseudocode. It is essentially a polylogarithmic space implementation of a classic dynamic program that computes the LIS. Our technique is more general and is applicable to other problems that are exactly solvable by dynamic programs. We are able to get a streaming algorithm for the longest common subsequence problem (in the asymmetric setting introduced in \cite{AKO10}) whose space is small on instances where no symbol appears very many times. Consider two strings (of length $n$) $x$ and $y$. The string $y$ is known to us, and we only have streaming access to $x$. The size of the complement of the LCS is the edit distance between $x$ and $y$ with only insertions and deletions. If no symbol occurs more than $k$ times in $y$, we get a $O(k(\log^2 n)/\delta)$-space streaming algorithm that provides a $(1+\delta)$-multiplicative approximation for the LCS complement. In general, we also provide a deterministic 1-pass streaming algorithm that outputs a $(1+\delta)$-multiplicative approximation for the LCS complement and uses $O(\sqrt{(n\log n)/\delta})$ space.

We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $\Omega(1)$-far) {from} being cycle-free, i.e. if one has to delete a constant fraction of edges to make it cycle-free, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors. The foregoing results are the outcome of our study of the complexity of {\em one-sided error} property testing algorithms in the bounded-degree graphs model. For example, we show that cycle-freeness of $N$-vertex graphs can be tested with one-sided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $\Omega(\sqrt{N})$ query lower bound, and contrasts with the fact that any minor-free property admits a {\em two-sided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least~$k$. On the other hand, for any fixed tree $T$, we show that $T$-minor-freeness has a one-sided error tester of query complexity that only depends on the proximity parameter $\e$. Our algorithm for finding cycles in bounded-degree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding tree-minors in $o(\sqrt{N})$ complexity.

Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of \emph{degree-labeled triangles} --- specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1\% of degrees participate in the \emph{vast majority} of triangles in heterogeneous graphs. We investigate whether current graph models reproduce the types of triangles that are observed in real data and observe that most models fail to accurately capture these salient features.

The communities of a social network are sets of vertices with more connections inside the set than outside. We theoretically demonstrate that two commonly observed properties of social networks, heavy-tailed degree distributions and large clustering coefficients, imply the existence of vertex neighborhoods (also known as egonets) that are themselves good communities. We evaluate these neighborhood communities on a range of graphs. What we find is that the neighborhood communities can exhibit conductance scores that are as good as the Fiedler cut. Also, the conductance of neighborhood communities shows similar behavior as the network community profile computed with a personalized PageRank community detection method. Neighborhood communities give us a simple and powerful heuristic for speeding up local partitioning methods. Since finding good seeds for the PageRank clustering method is difficult, most approaches involve an expensive sweep over a great many starting vertices. We show how to use neighborhood communities to quickly generate a small set of seeds.

Computing the coordinate-wise maxima of a planar point set is a classic and well-studied problem in computational geometry. We give an algorithm for this problem in the \emph{self-improving setting}. We have $n$ (unknown) independent distributions $\cD_1, \cD_2, \ldots, \cD_n$ of planar points. An input pointset $(p_1, p_2, \ldots, p_n)$ is generated by taking an independent sample $p_i$ from each $\cD_i$, so the input distribution $\cD$ is the product $\prod_i \cD_i$. A self-improving algorithm repeatedly gets input sets from the distribution $\cD$ (which is \emph{a priori} unknown) and tries to optimize its running time for $\cD$. Our algorithm uses the first few inputs to learn salient features of the distribution, and then becomes an optimal algorithm for distribution $\cD$. Let $\OPT_\cD$ denote the expected depth of an \emph{optimal} linear comparison tree computing the maxima for distribution $\cD$. Our algorithm eventually has an expected running time of $O(\text{OPT}_\cD + n)$, even though it did not know $\cD$ to begin with. Our result requires new tools to understand linear comparison trees for computing maxima. We show how to convert general linear comparison trees to very restricted versions, which can then be related to the running time of our algorithm. An interesting feature of our algorithm is an interleaved search, where the algorithm tries to determine the likeliest point to be maximal with minimal computation. This allows the running time to be truly optimal for the distribution $\cD$.

Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erd\H{o}s--R\'enyi~(ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavy-tailed degree distribution and community structure must contain a scale free collection of dense ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, we propose the Block Two-Level Erd\H{o}s--R\'enyi (BTER) model, and demonstrate that it accurately captures the observable properties of many real-world social networks.

@article{SKP11,
  author    = {C. Seshadhri and Tamara G. Kolda and Ali Pinar},
  title     = {Community structure and scale-free collections of Erd\H{o}s-R\'enyi graphs},
  journal   = {Physical Review E},
  year      = {2012},
  volume    = {85},
  issue     = {5},
  number    = {056109},
  url       = {http://pre.aps.org/abstract/PRE/v85/i5/e056109}
}

Markov chains are a convenient means of generating realizations of networks, since they require little more than a procedure for rewiring edges. If a rewiring procedure exists for generating new graphs with specified statistical properties, then a Markov chain sampler can generate an ensemble of graphs with prescribed characteristics. However, successive graphs in a Markov chain cannot be used when one desires independent draws from the distribution of graphs; the realizations are correlated. Consequently, one runs a Markov chain for N iterations before accepting the realization as an independent sample. In this work, we devise two methods for calculating N. They are both based on the binary "time-series" denoting the occurrence/non-occurrence of edge (u, v) between vertices u and v in the Markov chain of graphs generated by the sampler. They differ in their underlying assumptions. We test them on the generation of graphs with a prescribed joint degree distribution. We find the N proportional |E|, where |E| is the number of edges in the graph. The two methods are compared by sampling on real, sparse graphs with 10^3 - 10^4 vertices. B

The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model.
We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.

Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices. These results have been communicated to the Graph500 steering committee, and they are currently using our theorems to set the benchmark parameters. Our noisy version of SKG is likely to become the Graph500 standard for next year's benchmark.

We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean networks. We show that some of the assumptions traditionally made in the more common mean-field analysis of Boolean networks do not hold in general. For example, we offer evidence that imbalance (internal inhomogeneity) of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than previously observed.

Let C be a depth-3 circuit with n variables, degree d and top fanin k (called sps(k,d,n) circuits) over base field F. It is a major open problem to design a deterministic polynomial time blackbox algorithm that tests if C is identically zero. Klivans & Spielman (STOC 2001) observed that the problem is open even when k is a constant. This case has been subjected to a serious study over the past few years, starting from the work of Dvir & Shpilka (STOC 2005). We give the first polynomial time blackbox algorithm for this problem. Our algorithm runs in time poly(nd^k), regardless of the base field. The *only* field for which polynomial time algorithms were previously known is F=Q (Kayal & Saraf, FOCS 2009, and Saxena & Seshadhri, FOCS 2010). This is the first blackbox algorithm for depth-3 circuits that does not use the rank based approaches of Karnin & Shpilka (CCC 2009). We prove an important tool for the study of depth-3 identities. We design a blackbox polynomial time transformation that reduces the number of variables in a sps(k,d,n) circuit to k variables, but preserves the identity structure.

We give the first combinatorial approximation algorithm for Maxcut that beats the trivial 0.5 factor by a constant. The main partitioning procedure is very intuitive, natural, and easily described. It essentially performs a number of random walks and aggregates the information to provide the partition. We can control the running time to get an approximation factor-running time tradeoff. We show that for any constant b > 1.5, there is an O(n^{b}) algorithm that outputs a (0.5+delta)-approximation for Maxcut, where delta = delta(b) is some positive constant. One of the components of our algorithm is a weak local graph partitioning procedure that may be of independent interest. Given a starting vertex $i$ and a conductance parameter phi, unless a random walk of length ell = O(log n) starting from i mixes rapidly (in terms of phi and ell), we can find a cut of conductance at most phi close to the vertex. The work done per vertex found in the cut is sublinear in n.

We initiate the study of property testing of submodularity on the boolean hypercube. Submodular functions come up in a variety of applications in combinatorial optimization. For a vast range of algorithms, the existence of an oracle to a submodular function is assumed. But how does one check if this oracle indeed represents a submodular function? Consider a function f:{0,1}^n \rightarrow R. The distance to submodularity is the minimum fraction of values of $f$ that need to be modified to make f submodular. If this distance is more than epsilon > 0, then we say that f is epsilon-far from being submodular. The aim is to have an efficient procedure that, given input f that is epsilon-far from being submodular, certifies that f is not submodular. We analyze a very natural tester for this problem, and prove that it runs in subexponential time. This gives the first non-trivial tester for submodularity. On the other hand, we prove an interesting lower bound (that is, unfortunately, quite far from the upper bound) suggesting that this tester cannot be very efficient in terms of epsilon. This involves non-trivial examples of functions which are far from submodular and yet do not exhibit too many local violations. We also provide some constructions indicating the difficulty in designing a tester for submodularity. We construct a partial function defined on exponentially many points that cannot be extended to a submodular function, but any strict subset of these values can be extended to a submodular function.

Finding the length of the longest increasing subsequence (LIS) is a classic algorithmic problem. Let $n$ denote the size of the array. Simple $O(n\log n)$ algorithms are known for this problem. What can a sublinear time algorithm achieve? We develop a polylogarithmic time randomized algorithm that for any constant $\delta > 0$, estimates the length of the LIS of an array upto an additive error of $\delta n$. More precisely, the running time of the algorithm is $(\log n)^c (1/\delta)^{O(1/\delta)}$ where the exponent $c$ is independent of $\delta$. Previously, the best known polylogarithmic time algorithms could only achieve an additive $n/2$ approximation. The \emph{distance to monotonicity}, which is the fractional size of the complement of the LIS, has been studied in depth by the streaming and property testing communities. Our polylogarithmic algorithm actually has a stronger guarantee, and gives (for any constant $\delta > 0$) a multiplicative $(1+\delta)$-approximation to the distance to monotonicity. This is a significant improvement over the previously known $2$-factor approximations.

We study the problem of identity testing for depth-3 circuits of top fanin k and degree d. We give a new structure theorem for such identities. A direct application of our theorem improves the known deterministic d^{k^k}-time black-box identity test over rationals (Kayal-Saraf, FOCS 2009) to one that takes d^{k^2}-time. Our structure theorem essentially says that the number of independent variables in a real depth-3 identity is very small. This theorem settles affirmatively the stronger rank conjectures posed by Dvir-Shpilka (STOC 2005) and Kayal-Saraf (FOCS 2009). Our techniques provide a unified framework that actually beats all known rank bounds and hence gives the best running time (for *every* field) for black-box identity tests. Our main theorem (almost optimally) pins down the relation between higher dimensional Sylvester-Gallai theorems and the rank of depth-3 identities in a very transparent manner. The existence of this was hinted at by Dvir-Shpilka (STOC 2005), but first proven, for reals, by Kayal-Saraf (FOCS 2009). We introduce the concept of Sylvester-Gallai rank bounds for any field, and show the intimate connection between this and depth-3 identity rank bounds. We also prove the first ever theorem about high dimensional Sylvester-Gallai configurations over *any* field. Our proofs and techniques are very different from previous results and devise a very interesting ensemble of combinatorics and algebra. The latter concepts are ideal theoretic and involve a new Chinese remainder theorem. Our proof methods explain the structure of *any* depth-3 identity C: there is a nucleus of C that forms a low rank identity, while the remainder is a high dimensional Sylvester-Gallai configuration.

@inproceedings{SaxS10,
  author    = {N. Saxena and C. Seshadhri},
  title     = {From Sylvester-Gallai Configurations to Rank Bounds: Improved Black-box Identity Test for Depth-3 Circuits},
  booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
  year      = {2010},
  pages     = {21--29},
}

We describe an algorithm for computing planar convex hulls in the self-improving model: given a sequence $I_1, I_2, \ldots$ of planar $n$-point sets, the upper convex hull $conv(I)$ of each set $I$ is desired. We assume that there exists a probability distribution $D$ on $n$-point sets, such that the inputs $I_j$ are drawn independently according to $D$. Furthermore, $D$ is such that the individual points are distributed independently of each other. In other words, the $i$'th point is distributed according to $D_i$. The $D_i$'s can be arbitrary but are independent of each other. The distribution $D$ is not known to the algorithm in advance. After a learning phase of $n^\eps$ rounds, the expected time to compute $conv(I)$ is $O(n + H(conv(I)))$. Here, $H(conv(I))$ is the entropy of the output, which is a lower bound for the expected running time of \emph{any} algebraic computation tree that computes the convex hull. (More precisely, $H(conv(I))$ is the minimum entropy of any random variable that maps $I$ to a description of $conv(I)$ and to a labeling scheme that proves nonextremality for every point in $I$ not on the hull.) Our algorithm is thus asymptotically optimal for $D$.

We study online learning in an oblivious changing environment. The standard measure of regret bounds the difference between the cost of the online learner and the best decision in hindsight. Hence, regret minimizing algorithms tend to converge to the static best optimum, clearly a suboptimal behavior in changing environments. On the other hand, various metrics proposed to strengthen regret and allow for more dynamic algorithms produce inefficient algorithms. We propose a different performance metric which strengthens the standard metric of regret and measures performance with respect to a changing comparator. We then describe a series of data-streaming-based reductions which transform algorithms for minimizing (standard) regret into adaptive algorithms albeit incurring only poly-logarithmic computational overhead. Using this reduction, we obtain {\it efficient} low adaptive-regret algorithms for the problem of online convex optimization. This can be applied to various learning scenarios, i.e. online portfolio selection, for which we describe experimental results showing the advantage of adaptivity.

We show that the rank of a depth-$3$ circuit (over any field) that is simple, minimal and zero is at most $O(k^3\log d)$. The previous best rank bound known was $2^{O(k^2)}(\log d)^{k-2}$ by Dvir and Shpilka (STOC 2005). This almost resolves the rank question first posed by Dvir and Shpilka (as we also provide a simple and minimal identity of rank $\Omega(k\log d)$). Our rank bound significantly improves (dependence on $k$ exponentially reduced) the best known deterministic black-box identity tests for depth-$3$ circuits by Karnin and Shpilka (CCC 2008). Our techniques also shed light on the factorization pattern of nonzero depth-$3$ circuits, most strikingly: the rank of linear factors of a simple, minimal and nonzero depth-$3$ circuit (over any field) is at most $O(k^3\log d)$. The novel feature of this work is a new notion of maps between sets of linear forms, called \emph{ideal matchings}, used to study depth-$3$ circuits. We prove interesting structural results about depth-$3$ identities using these techniques. We believe that these can lead to the goal of a deterministic polynomial time identity test for these circuits.

We consider the problem of \emph{online sublinear expander reconstruction} and its relation to random walks in ``noisy" expanders. Given access to an adjacency list representation of a bounded-degree graph $G$, we want to convert this graph into a bounded-degree expander $G'$ changing $G$ as little as possible. The graph $G'$ will be output by a \emph{distributed filter}: this is a sublinear time procedure that given a query vertex, outputs all its neighbors in $G'$, and can do so even in a distributed manner, ensuring consistency in all the answers. One of the main tools in our analysis is a result on the behavior of random walks in graph that are almost expanders: graphs that are formed by arbitrarily connecting a small unknown graph (the noise) to a large expander. We show that a random walk from almost any vertex in the expander part will have fast mixing properties, in the general setting of irreducible finite Markov chains. We also design sublinear time procedures to distinguish vertices of the expander part from those in the noise part, and use this procedure in the reconstruction algorithm.

We consider the problem of testing graph expansion (either vertex or edge) in the bounded degree model of Goldreich and Ron. We give a property tester that given a graph with degree bound $d$, an expansion bound $\alpha$, and a parameter $\epsilon > 0$, accepts the graph with high probability if its expansion is more than $\alpha$, and rejects it with high probability if it is $\epsilon$-far from any graph with expansion $\alpha'$ with degree bound $d$, where $\alpha' < \alpha$ is a function of $\alpha$. For edge expansion, we obtain $\alpha' = \Omega(\frac{\alpha^2}{d})$, and for vertex expansion, we obtain $\alpha' = \Omega(\frac{\alpha^2}{d^2})$. In either case, the algorithm runs in time $\tilde{O}(\frac{n^{(1+\mu)/2}d^2}{\epsilon\alpha^2})$ for any given constant $\mu > 0$.

We study the problem of two-dimensional Delaunay triangulation in the self-improving algorithms model. We assume that the $n$ points of the input each come from an independent, unknown, and arbitrary distribution. The first phase of our algorithm builds data structures that store relevant information about the input distribution. The second phase uses these data structures to efficiently compute the Delaunay triangulation of the input. The running time of our algorithm matches the information-theoretic lower bound for the given input distribution, implying that if the input distribution has low entropy, then our algorithm beats the standard $\Omega(n\log n)$ bound for computing Delaunay triangulations. Our algorithm uses a host of techniques: $\epsilon$-net construction for disks, optimal expected-case point-location structures, ideas from randomized incremental construction, linear-time Delaunay splitting algorithms, and information-theoretic arguments.

We investigate the problem of monotonicity reconstruction, as defined by Ailon et al (ISAAC 2004), in a distributed setting. We have oracle access to a nonnegative real-valued function $f$ defined on domain $[n]^d=\{1,\ldots,n\}^d$ ($d$ is considered to be a constant). We would like to closely approximate $f$ by a monotone function $g$. This should be done by a procedure (a {\em filter}) that given as input a point $x \in [n]^d$ outputs the value of $g(x)$, and runs in time that is highly sublinear in $n$. The procedure can (indeed must) be randomized, but we require that all of the randomness be specified in advance by a single short random seed. We construct such an implementation where the time and space per query is $(\log n)^{O(1)}$ and the size of the seed is polynomial in $\log n$ and $d$. Furthermore the distance of the approximating function $g$ from $f$ is at most a constant (depending on $d$) multiple of the minimum distance of any monotone function from $f$. This allows for a distributed implementation: one can initialize many copies of the filter with the same short random seed, and they can autonomously handle queries, while producing outputs that are consistent with the same approximating function $g$.

We investigate a new class of geometric problems based on the idea of online error correction. Suppose one is given access to a large geometric dataset though a query mechanism; for example, the dataset could be a terrain and a query might ask for the coordinates of a particular vertex or for the edges incident to it. Suppose, in addition, that the dataset satisfies some known structural property ${\cal P}$ (eg, monotonicity or convexity) but that, because of errors and noise, the queries occasionally provide answers that violate ${\cal P}$. Can one design a filter that modifies the query's answers so that (i) the output satisfies ${\cal P}$; (ii) the amount of data modification is minimized? We provide upper and lower bounds on the complexity of online reconstruction for convexity in 2D and 3D.

We show in this paper that the BGS model of abstract state machines can be simulated by random access machines with at most a polynomial time overhead. This result is already stated in [BGS99] with a very brief proof sketch. The present paper gives a detailed proof of the result. We represent hereditarily finite sets, which are the typical BGS ASM objects, by membership graphs of the transitive closure of the sets. Testing for equality between BGS objects can be done in linear time in our representation.

We initiate a new line of investigation into {\em online property-preserving data reconstruction.} Consider a dataset which is assumed to satisfy various (known) structural properties; eg, it may consist of sorted numbers, or points on a manifold, or vectors in a polyhedral cone, or codewords from an error-correcting code. Because of noise and errors, however, an (unknown) fraction of the data is deemed {\em unsound}, ie, in violation with the expected structural properties. Can one still query into the dataset in an online fashion and be provided data that is {\em always} sound? In other words, can one design a filter which, when given a query to any item $I$ in the dataset, returns a {\em sound} item $J$ that, although not necessarily in the dataset, differs from $I$ as infrequently as possible. No preprocessing should be allowed and queries should be answered online. We consider the case of a monotone function. Specifically, the dataset encodes a function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$ that is at (unknown) distance $\eps$ from monotone, meaning that $f$ can---and must---be modified at $\eps n$ places to become monotone. Our main result is a randomized filter that can answer any query in $O(\log^2 n\log\log n)$ time while modifying the function $f$ at only $O(\eps n)$ places. The amortized time over $n$ function evaluations is $O(\log n)$. %With probability arbitrarily close to 1, %the filter never errs at any query. The filter works as stated with probability arbitrarily close to $1$. We provide an alternative filter with $O(\log n)$ worst case query time and $O(\eps n \log n)$ function modifications. For reconstructing $d$-dimensional monotone functions of the form $f:\{1,\ldots, n\}^d\,\mapsto {\bf R}$, we present a filter that takes $(2^{O(d)}(\log n)^{4d-2}\log\log n)$ time per query and modifies at most $O(\eps n^d)$ function values (for constant $d$).

In standard property testing, the task is to distinguish between objects that have a property $\mathcal{P}$ and those that are $\eps$-far from $\mathcal{P}$, for some $\eps > 0$. In this setting, it is perfectly acceptable for the tester to provide a negative answer for every input object that does not satisfy $\mathcal{P}$. This implies that property testing in and of itself cannot be expected to yield any information whatsoever about the distance from the object to the property. We address this problem in this paper, restricting our attention to monotonicity testing. A function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$ is at distance $\eps_f$ from being monotone if it can (and must) be modified at $\eps_f n$ places to become monotone. For any fixed $\delta>0$, we compute, with probability at least $2/3$, an interval $[(1/2 -\delta)\eps, \eps]$ that encloses $\eps_f$. The running time of our algorithm is $O( \eps_f^{-1}\log\log \eps_f^{-1} \log n )$, which is optimal within a factor of $\log\log \eps_f^{-1}$ and represents a substantial improvement over previous work. We give a second algorithm with an expected running time of $O( \eps_f^{-1}\log n \log\log\log n)$. Finally, we extend our results to multivariate functions.

Tutorials/notes

Refereed Conference and Journal Papers