Tutorials/notes
• Some vignettes on subgraph counting using graph orientations
C. Seshadhri
International Conference on Database Theory (ICDT) 2023, Invited Tutorial
abstract
    Paper
    Slides
Subgraph counting is a fundamental problem that spans many areas in computer science: database theory, logic, network science, data mining, and complexity theory. Given a large input graph G and a small pattern graph H, we wish to count the number of occurrences of H in G. In recent times, there has been a resurgence on using an old (maybe overlooked?) technique of orienting the edges of G and H, and then using a combination of brute-force enumeration and indexing. These orientation techniques appear to give the best of both worlds. There is a rigorous theoretical explanation behind these techniques, and they also have excellent empirical behavior (on large real-world graphs). Time and again, graph orientations help solve subgraph counting problems in various computational models, be it sampling, streaming, distributed, etc. In this paper, we give some short vignettes on how the orientation technique solves a variety of algorithmic problems.
• Tutorial on subgraph counting
C. Seshadhri and Srikanta Tirthapura
Conference on the World Wide Web (WWW) 2019
Tutorial website
Refereed Conference and Journal Papers
- • Theoretical bounds on the network community profile from low-rank semi-definite programming
- Yufan Huang, C. Seshadhri, and David Gleich
- International Conference on Machine Learning (ICML) 2023
- abstract
    arxiv:2303.14550
We study a new connection between a technical measure called μ-conductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes size-resolved properties of clusters and communities in social and information networks. The idea of μ-conductance is similar to the traditional graph conductance, but disregards sets with small volume. We derive a sequence of optimization problems including a low-rank semi-definite program from which we can derive a lower bound on the optimal μ-conductance value. These ideas give the first theoretically sound bound on the behavior of the network community profile for a wide range of cluster sizes. The algorithm scales up to graphs with hundreds of thousands of nodes and we demonstrate how our framework validates the predicted structures of real-world graphs.
- • Directed Isoperimetric Theorems for Boolean Functions on the Hypergrid and an $~O(n\sqrt{d})$ Monotonicity Tester
- Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri
- Symposium on Theory of Computing (STOC) 2023
- abstract
    arxiv:2211.05281
    ECCC
The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of Khot-Minzer-Safra (FOCS 2015) gave a non-adaptive, one-sided tester making $\otilde(\eps^{-2}\sqrt{d})$ queries. Up to polylog $d$ and $\eps$ factors, this bound matches the $\widetilde{\Omega}(\sqrt{d})$-query non-adaptive lower bound (Chen-De-Servedio-Tan (STOC 2015), Chen-Waingarten-Xie (STOC 2017)). For any $n > 2$, the optimal non-adaptive complexity was unknown. A previous result of the authors achieves a $\otilde(d^{5/6})$-query upper bound (SODA 2020), quite far from the $\sqrt{d}$ bound for the hypercube.
In this paper, we resolve the non-adaptive complexity of monotonicity testing for all constant $n$, up to $\poly(\eps^{-1}\log d)$ factors. Specifically, we give a non-adaptive, one-sided monotonicity tester making $\otilde(\eps^{-2}n\sqrt{d})$ queries. From a technical standpoint, we prove new directed isoperimetric theorems over the hypergrid $[n]^d$. These results generalize the celebrated directed Talagrand inequalities that were only known for the hypercube.
- • Classic Graph Structural Features Outperform Factorization-Based Graph Embedding Methods on Community Labeling
- Andrew Stolman, Caleb Levy, C. Seshadhri, and Aneesh Sharma
- SIAM Conference on Data Mining (SDM) 2022
- abstract
    arxiv:2201.08481
Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a low-dimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine learning tasks, there is little principled analysis of the effectiveness of this approach for common tasks. In this work, we provide an empirical and theoretical analysis for the performance of a class of embeddings on the common task of pairwise community labeling. This is a binary variant of the classic community detection problem, which seeks to build a classifier to determine whether a pair of vertices participate in a community. In line with our goal of foundational understanding, we focus on a popular class of unsupervised embedding techniques that learn low rank factorizations of a vertex proximity matrix (this class includes methods like GraRep, DeepWalk, node2vec, NetMF). We perform detailed empirical analysis for community labeling over a variety of real and synthetic graphs with ground truth. In all cases we studied, the models trained from embedding features perform poorly on community labeling. In constrast, a simple logistic model with classic graph structural features handily outperforms the embedding models. For a more principled understanding, we provide a theoretical analysis for the (in)effectiveness of these embeddings in capturing the community structure. We formally prove that popular low-dimensional factorization methods either cannot produce community structure, or can only produce ``unstable" communities. These communities are inherently unstable under small perturbations.
- • FPT Algorithms for Finding Near-Cliques in c-Closed Graphs
- Balaram Behera, Edin Husić, Shweta Jain, Tim Roughgarden, and C. Seshadhri
- Innovations in Theoretical Computer Science (ITCS) 2022
- abstract
    arxiv:2007.09786
Finding large cliques or cliques missing a few edges is a fundamental algorithmic task in the
study of real-world graphs, with applications in community detection, pattern recognition, and clustering. A number of effective backtracking-based heuristics for these problems have emerged from recent empirical work in social network analysis.
Given the $\mathbb{NP}$-hardness of variants of clique counting,
these results raise a challenge for \emph{beyond worst-case analysis} of these problems.
Inspired by the triadic closure of real-world graphs,
Fox et al. (SICOMP 2020) introduced the notion of $c$-closed graphs and proved
that maximal clique enumeration is fixed-parameter tractable with respect to $c$.
In practice, due to noise in data, one wishes to actually discover "near-cliques",
which can be characterized as cliques with a sparse subgraph removed.
In this work, we prove that many different kinds of maximal near-cliques
can be enumerated in polynomial time (and FPT in $c$) for $c$-closed graphs.
We study various established notions of such substructures, including $k$-plexes, complements of bounded-degeneracy and bounded-treewidth graphs.
Interestingly, our algorithms follow relatively simple backtracking procedures, analogous to what is done in practice. Our results underscore the significance of the $c$-closed graph class for theoretical understanding of social network analysis.
- • The complexity of testing all properties of planar graphs, and the role of isomorphism
- Sabyasachi Basu, Akash Kumar, C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2022
- abstract
    arxiv:2108.10547
    ECCC
Consider property testing on bounded degree graphs and let $\varepsilon > 0$ denote the proximity parameter. A remarkable theorem of Newman-Sohler (SICOMP 2013) asserts that all properties of planar graphs (more generally hyperfinite) are testable with query complexity only depending on $\varepsilon$. Recent advances in testing minor-freeness have proven that all additive and monotone properties of planar graphs can be tested in $\poly(\varepsilon^{-1})$ queries. Some properties falling outside this class, such as Hamiltonicity, also have a similar complexity for planar graphs. Motivated by these results, we ask: can all properties of planar graphs can be tested in $\poly(\varepsilon^{-1})$ queries? Is there a uniform query complexity upper bound for all planar properties, and what is the ``hardest" such property to test?
We discover a surprisingly clean and optimal answer. Any property of bounded degree planar graphs can be tested in $\exp(O(\varepsilon^{-2}))$ queries. Moreover, there is a matching lower bound, up to constant factors in the exponent. The natural property of testing isomorphism to a fixed graph requires $\exp(\Omega(\varepsilon^{-2}))$ queries, thereby showing that (up to polynomial dependencies) isomorphism to an explicit fixed graph is the hardest property of planar graphs. The upper bound is a straightforward adapation of the Newman-Sohler analysis that tracks dependencies on $\varepsilon$ more carefully. The main technical contribution is the lower bound construction, which is achieved by a special family of planar graphs that are all mutually far from each other.
We can also apply our techniques to get analogous results for bounded treewidth graphs. We prove that all properties of bounded treewidth graphs can be tested in $\exp(O(\varepsilon^{-1}\log \varepsilon^{-1}))$ queries. Moreover, testing isomorphism to a fixed forest requires $\exp(\Omega(\varepsilon^{-1}))$ queries.
- • Random walks and forbidden minors III: poly(d/\epsilon)-time partition oracles for minor-free graph classes
- Akash Kumar, C. Seshadhri, and Andrew Stolman
- Foundations of Computer Science (FOCS) 2021
- abstract
    arxiv:2102.00556
    ECCC
Consider the family of bounded degree graphs in any minor-closed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for a sufficiently small {\epsilon} > 0, one removes {\epsilon}dn edges to get connected components of size independent of n. An important tool for sublinear algorithms and property testing for such classes is the partition oracle, introduced by the seminal work of Hassidim-Kelner-Nguyen-Onak (FOCS 2009).
A partition oracle is a local procedure that gives consistent access to a hyperfinite decomposition, without any preprocessing. Given a query vertex v, the partition oracle outputs the component containing v in time independent of n. All the answers are consistent with a single hyperfinite decomposition. The partition oracle of Hassidim et al. runs in time d^poly(d/{\epsilon}) per query. They pose the open problem of whether poly(d/{\epsilon})-time partition oracles exist. Levi-Ron (ICALP 2013) give a refinement of the previous approach, to get a partition oracle that runs in time d^{\log(d/{\epsilon})-per query.
In this paper, we resolve this open problem and give \poly(d/{\epsilon})-time partition oracles for bounded degree graphs in any minor-closed family. Unlike the previous line of work based on combinatorial methods, we employ techniques from spectral graph theory. We build on a recent spectral graph theoretical toolkit for minor-closed graph families, introduced by the authors to develop efficient property testers.
A consequence of our result is a poly(d/{\epsilon})-query tester for any monotone and additive property of minor-closed families (such as bipartite planar graphs). Our result also gives poly(d/{\epsilon})-query algorithms for additive {\epsilon} n-approximations for problems such as maximum matching, minimum vertex cover, maximum independent set, and minimum dominating set for these graph families.
- • Faster and Generalized Temporal Triangle Counting, via Degeneracy Ordering
- Noujan Pashanasangi and C. Seshadhri
- SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2021
- abstract
    arxiv:2106.02762
Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many real-world graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the timestamps.
Temporal triangle counting has seen a few recent results, but there are varying definitions of temporal triangles. In all cases, temporal triangle patterns enforce constraints on the time interval between edges (in the triangle). We define a general notion (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})-temporal triangles that allows for separate time constraints for all pairs of edges.
Our main result is a new algorithm, DOTTT (Degeneracy Oriented Temporal Triangle Totaler), that exactly counts all directed variants of (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})-temporal triangles. Using the classic idea of degeneracy ordering with careful combinatorial arguments, we can prove that DOTTT runs in O(m\kappa \log m) time, where m is the number of (temporal) edges and \kappa is the graph degeneracy (max core number). Up to log factors, this matches the running time of the best static triangle counters. Moreover, this running time is better than existing.
DOTTT has excellent practical behavior and runs twice as fast as existing state-of-the-art temporal triangle counters (and is also more general). For example, DOTTT computes all types of temporal queries in Bitcoin temporal network with half a billion edges in less than an hour on a commodity machine.
- • Near-Linear Time Homomorphism Counting in Bounded Degeneracy Graphs: The Barrier of Long Induced Cycles
- Suman K. Bera, Noujan Pashanasangi, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2021
- abstract
    arxiv:2010.08083
Counting homomorphisms of a constant sized pattern graph H in an input graph G is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input G and the pattern H. Given the significance of this problem and the large sizes of modern inputs, we investigate when near-linear time algorithms are possible. We focus on the case when the input graph has bounded degeneracy, a commonly studied and practically relevant class for homomorphism counting. It is known from previous work that for certain classes of H, H-homomorphisms can be counted exactly in near-linear time in bounded degeneracy graphs. Can we precisely characterize the patterns H for which near-linear time algorithms are possible?
We completely resolve this problem, discovering a clean dichotomy using fine-grained complexity. Let m denote the number of edges in G. We prove the following: if the largest induced cycle in H has length at most 5, then there is an O(mlogm) algorithm for counting H-homomorphisms in bounded degeneracy graphs. If the largest induced cycle in H has length at least 6, then (assuming standard fine-grained complexity conjectures) there is a constant γ>0, such that there is no o(m1+γ) time algorithm for counting H-homomorphisms.
- • How to Count Triangles, without Seeing the Whole Graph
- Suman K. Bera and C. Seshadhri
- SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2020
- abstract
    arxiv:2006.11947
Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph?
We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model.
Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.
- • How the Degeneracy Helps for Triangle Counting in Graph Streams
- Suman K. Bera and C. Seshadhri
- Principles of Database Systems (PODS) 2020
- abstract
    arxiv:2003.13151
We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $\Theta(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $\kappa$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(m\kappa/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $\Omega(m\kappa/T)$.
- • Provably and Efficiently Approximating Near-cliques using Tur\'an Shadow: PEANUTS
- Shweta Jain and C. Seshadhri
- The Web Conference (WWW) 2020
- abstract
arxiv:2006.13483
Clique and near-clique counts are important graph
properties that help to characterize graphs and have applications in
graph generation, graph modelling, graph analytics etc. Recently, Jain
et. al proposed an algorithm called Turan-Shadow for
estimating k-clique counts that was a significant improvement over the state-of-the-art. Essentially, given a graph $G$, it constructs an object called the Tur\'an Shadow - a collection of (possibly overlapping) dense subgraphs such that there is a bijection from the cliques in each subgraph to the set of k-cliques in $G$. It then estimates the count of k-cliques in $G$ by estimating the number of cliques in the subgraphs using uniform random sampling. As a side effect, it provides an efficient method to sample a uniform random $k$-clique.
We use this clique sampler to discover and estimate counts of near-cliques i.e. k-cliques that are missing 1 or 2 edges. Every near-clique contains cliques of smaller sizes. By obtaining the Tur\'an Shadow for the smaller clique size, sampling a uniform random clique of the smaller size and obtaining an average of the number of near-cliques the sampled clique participates in, we give an estimate for the total number of near-cliques in $G$. There was no known efficient method to obtain near-clique counts for $k>5$ before this work. We obtain a $10x-100x$ speedup over existing algorithms for counting near-cliques.
Additionally, we enhance Turan-Shadow by designing a new estimator that requires us to construct only parts of the shadow, and uses the shadow in an online manner, obviating the need to store the entire shadow. This helps achieve not only a $2x-10x$ speedup over Turan-Shadow in some cases but drastically reduces the space utilized.
- • The impossibility of low-rank representations for triangle-rich complex networks
- C. Seshadhri, Aneesh Sharma, Andrew Stolman, and Ashish Goel
- Proceedings of the National Academy of Sciences (PNAS) 2020
- abstract
    PNAS:1911030117
UCSC news link
Youtube, part 1
Youtube, part 2
slides
The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems, to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have a rank that is nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as singular value decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.
- • Linear Time Subgraph Counting, Graph Degeneracy, and the Chasm at Size Six
- Suman K. Bera, Noujan Pashanasangi, and C. Seshadhri
- Innovations in Theoretical Computer Science (ITCS) 2020
- abstract
    arxiv:1911.05896
We consider the problem of counting all k-vertex subgraphs in an input graph, for any constant k. This problem (denoted sub-cntk) has been studied extensively in both theory and practice. In a classic result, Chiba and Nishizeki (SICOMP 85) gave linear time algorithms for clique and 4-cycle counting for bounded degeneracy graphs. This is a rich class of sparse graphs that contains, for example, all minor-free families and preferential attachment graphs. The techniques from this result have inspired a number of recent practical algorithms for sub-cntk. Towards a better understanding of the limits of these techniques, we ask: for what values of k can sub-cntk be solved in linear time?
We discover a chasm at k=6. Specifically, we prove that for k < 6, sub-cntk can be solved in linear time. Assuming a standard conjecture in fine-grained complexity, we prove that for all k≥6, sub-cntk cannot be solved even in near-linear time.
- • Efficiently Counting Vertex Orbits of All 5-vertex Subgraphs, by EVOKE
- Noujan Pashanasangi and C. Seshadhri
- Web Science and Data Mining (WSDM) 2020
- abstract
    arxiv:1911.10616
Subgraph counting is a fundamental task in network analysis. Typically, algorithmic work is on total counting, where we wish to count the total frequency of a (small) pattern subgraph in a large input data set. But many applications require local counts (also called vertex orbit counts) wherein, for every vertex v of the input graph, one needs the count of the pattern subgraph involving v. This provides a rich set of vertex features that can be used in machine learning tasks, especially classification and clustering. But getting local counts is extremely challenging. Even the easier problem of getting total counts has received much research attention. Local counts require algorithms that get much finer grained information, and the sheer output size makes it difficult to design scalable algorithms.
We present EVOKE, a scalable algorithm that can determine vertex orbits counts for all 5-vertex pattern subgraphs. In other words, EVOKE exactly determines, for every vertex v of the input graph and every 5-vertex subgraph H, the number of copies of H that v participates in. EVOKE can process graphs with tens of millions of edges, within an hour on a commodity machine. EVOKE is typically hundreds of times faster than previous state of the art algorithms, and gets results on datasets beyond the reach of previous methods.
Theoretically, we generalize a recent "graph cutting" framework to get vertex orbit counts. This framework generate a collection of polynomial equations relating vertex orbit counts of larger subgraphs to those of smaller subgraphs. EVOKE carefully exploits the structure among these equations to rapidly count. We prove and empirically validate that EVOKE only has a small constant factor overhead over the best (total) 5-vertex subgraph counter.
- • The Power of Pivoting for Exact Clique Counting
- Shweta Jain and C. Seshadhri
- Web Science and Data Mining (WSDM) 2020 (Best paper)
- abstract
    arxiv:2001.06784
Clique counting is a fundamental task in network analysis, and even the simplest setting of 3-cliques (triangles) has been the center of much recent research. Getting the count of k-cliques for larger k is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use larger clique counts. Moreover, one often desires \textit{local} counts, the number of k-cliques per vertex/edge.
Our main result is Pivoter, an algorithm that exactly counts the number of k-cliques, \textit{for all values of k}. It is surprisingly effective in practice, and is able to get clique counts of graphs that were beyond the reach of previous work. For example, Pivoter gets all clique counts in a social network with a 100M edges within two hours on a commodity machine. Previous parallel algorithms do not terminate in days. Pivoter can also feasibly get local per-vertex and per-edge k-clique counts (for all k) for many public data sets with tens of millions of edges. To the best of our knowledge, this is the first algorithm that achieves such results.
The main insight is the construction of a Succinct Clique Tree (SCT) that stores a compressed unique representation of all cliques in an input graph. It is built using a technique called \textit{pivoting}, a classic approach by Bron-Kerbosch to reduce the recursion tree of backtracking algorithms for maximal cliques. Remarkably, the SCT can be built without actually enumerating all cliques, and provides a succinct data structure from which exact clique statistics (k-clique counts, local counts) can be read off efficiently.
- • Domain Reduction for Monotonicity Testing: A o(d) Tester for Boolean Functions in d-Dimensions
- Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2020
- abstract
    arxiv:1811.01427
We describe a Õ (d5/6)-query monotonicity tester for Boolean functions f:[n]d→{0,1} on the n-hypergrid. This is the first o(d) monotonicity tester with query complexity independent of n. Motivated by this independence of n, we initiate the study of monotonicity testing of measurable Boolean functions f:ℝd→{0,1} over the continuous domain, where the distance is measured with respect to a product distribution over ℝd. We give a Õ (d5/6)-query monotonicity tester for such functions.
Our main technical result is a domain reduction theorem for monotonicity. For any function f:[n]d→{0,1}, let ϵf be its distance to monotonicity. Consider the restriction f̂ of the function on a random [k]d sub-hypergrid of the original domain. We show that for k=poly(d/ϵ), the expected distance of the restriction is 𝔼[ϵf̂ ]=Ω(ϵf). Previously, such a result was only known for d=1 (Berman-Raskhodnikova-Yaroslavtsev, STOC 2014). Our result for testing Boolean functions over [n]d then follows by applying the d5/6⋅poly(1/ϵ,logn,logd)-query hypergrid tester of Black-Chakrabarty-Seshadhri (SODA 2018).
To obtain the result for testing Boolean functions over ℝd, we use standard measure theoretic tools to reduce monotonicity testing of a measurable function f to monotonicity testing of a discretized version of f over a hypergrid domain [N]d for large, but finite, N (that may depend on f). The independence of N in the hypergrid tester is crucial to getting the final tester over ℝd.
- • Faster sublinear approximations of k-cliques for low arboricity graphs
- Talya Eden, Dana Ron, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2020
- abstract
    arxiv:1811.04425
Given query access to an undirected graph $G$, we consider the problem of
computing a $(1\pm\eps)$-approximation of the number of $k$-cliques in $G$. The standard query
model for general graphs allows for degree queries, neighbor queries, and pair queries.
Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$
be the number of $k$-cliques.
Previous work by Eden, Ron and Seshadhri (STOC 2018) gives an $O^*(\frac{n}{n^{1/k}_k} + \frac{m^{k/2}}{n_k})$-time algorithm for this problem (we use $O^*(\cdot)$ to suppress $\poly(\log n, 1/\eps, k^k)$ dependencies). Moreover, this bound is nearly optimal when the expression is sublinear in the size of the graph.
Our motivation is to circumvent this lower bound, by parameterizing the complexity in terms of \emph{graph arboricity}. The arboricity of $G$ is a measure for the graph density ``everywhere''.
There is a very rich family of graphs with bounded arboricity, including all minor-closed graph classes (such as planar graphs and graphs with bounded treewidth), bounded degree graphs, preferential attachment graphs and more.
We design an algorithm for the class of graphs with arboricity at most $\alpha$, whose
running time is $O^*(\min\{\frac{n\alpha^{k-1}}{n_k},\; \frac{n}{n_k^{1/k}}+\frac{m \alpha^{k-2}}{n_k} \} )$.
We also prove a nearly matching lower bound.
For all graphs, the arboricity is $O(\sqrt m)$, so this bound subsumes all previous results on sublinear clique approximation.
As a special case of interest, consider minor-closed families of graphs, which have constant arboricity. Our result implies that for any minor-closed family of graphs, there is a $(1\pm\eps)$-approximation algorithm for $n_k$ that has running time $O^*(\frac{n}{n_k})$.
Such a bound was not known even for the special (classic) case of triangle counting in planar graphs.
- • Random walks and forbidden minors II: A $poly(d\varepsilon^{-1})$-query tester for minor-closed properties of bounded-degree graphs
- Akash Kumar, C. Seshadhri, and Andrew Stolman
- Symposium on Theory of Computing (STOC) 2019 (Invited to special issue)
- abstract
    stoc
    arxiv:1904.01055
    ECCC
Let $G$ be a graph with $n$ vertices and maximum degree $d$. Fix some minor-closed property $\mathcal{P}$ (such as planarity).
We say that $G$ is $\varepsilon$-far from $\mathcal{P}$ if one has to remove $\varepsilon dn$ edges to make it have $\mathcal{P}$.
The problem of property testing $\mathcal{P}$ was introduced in the seminal work of Benjamini-Schramm-Shapira (STOC 2008)
that gave a tester with query complexity triply exponential in $\varepsilon^{-1}$.
Levi-Ron (TALG 2015) have given the best tester to date, with a quasipolynomial (in $\varepsilon^{-1}$) query complexity.
It is an open problem to get property testers whose query complexity is $\poly(d\varepsilon^{-1})$,
even for planarity.
In this paper, we resolve this open question. For any minor-closed property,
we give a tester with query complexity $d\cdot \poly(\varepsilon^{-1})$.
The previous line of work on (independent of $n$, two-sided) testers is primarily combinatorial.
Our work, on the other hand, employs techniques from spectral graph theory. This paper
is a continuation of recent work of the authors (FOCS 2018) analyzing random walk algorithms
that find forbidden minors.
- • Adaptive Boolean Monotonicity Testing in Total Influence Time
- Deeparnab Chakrabarty and C. Seshadhri
- Innovations in Theoretical Computer Science (ITCS) 2019
- abstract
    arxiv:1801.02816
    ECCC
    ITCS
Testing monotonicity
of a Boolean function $f:\{0,1\}^n \to \{0,1\}$ is an important problem in the field of property testing.
It has led to connections with many interesting combinatorial questions on the directed hypercube: routing, random walks,
and new isoperimetric theorems.
Denoting the proximity parameter by $\eps$, the best tester is the non-adaptive $\otilde(\eps^{-2}\sqrt{n})$ tester
of Khot-Minzer-Safra (FOCS 2015). A series of recent results by Belovs-Blais (STOC 2016) and Chen-Waingarten-Xie (STOC 2017)
have led to $\widetilde{\Omega}(n^{1/3})$ lower bounds for \emph{adaptive} testers. Reducing this gap is a significant
question, that touches on the role of adaptivity in monotonicity testing of Boolean functions.
We approach this question from the perspective of \emph{parametrized property testing}, a concept recently introduced
by Pallavoor-Raskhodnikova-Varma (ACM TOCT 2017), where one seeks to understand performance of testers with respect to parameters other than just the size.
Our result is an adaptive monotonicity tester with one-sided error whose query complexity is $O(\eps^{-2}\bI(f)\log^5 n)$, where $\bI(f)$ is the total influence of the function. Therefore, adaptivity provably helps monotonicity testing for low influence functions.
- • Local Algorithms for Hierarchical Dense Subgraph Discovery
- A. Erdem Sariyuce, C. Seshadhri, and Ali Pinar
- Proceedings of Very Large Databases (PVLDB) 2018
- abstract
    arxiv:1704.00386
    VLDB
Finding the dense regions of a graph and relations among them is a fundamental problem in network analysis. Core and truss decompositions reveal dense subgraphs with hierarchical relations. The incremental nature of algorithms for computing these decompositions and the need for global information at each step of the algorithm hinders scalable parallelization and approximations since the densest regions are not revealed until the end. In a previous work, Lu et al. proposed to iteratively compute the h-indices of neighbor vertex degrees to obtain the core numbers and prove that the convergence is obtained after a finite number of iterations. This work generalizes the iterative h-index computation for truss decomposition as well as nucleus decomposition which leverages higher-order structures to generalize core and truss decompositions. In addition, we prove convergence bounds on the number of iterations. We present a framework of local algorithms to obtain the core, truss, and nucleus decompositions. Our algorithms are local, parallel, offer high scalability, and enable approximations to explore time and quality trade-offs. Our shared-memory implementation verifies the efficiency, scalability, and effectiveness of our local algorithms on real-world networks.
- • Finding forbidden minors in sublinear time: a $O(n^{1/2+o(1)})$-query one-sided tester for minor closed properties on bounded degree graphs
- Akash Kumar, C. Seshadhri, and Andrew Stolman
- Foundations of Computer Science (FOCS) 2018 (Invited to special issue)
- abstract
    arxiv:1805.08187
    ECCC
Let $G$ be an undirected, bounded degree graph
with $n$ vertices. Fix a finite graph $H$, and suppose
one must remove $\varepsilon n$ edges from $G$ to make it $H$-minor free (for some small constant $\varepsilon > 0$).
We give an $n^{1/2+o(1)}$-time randomized procedure that,
with high probability, finds an $H$-minor in such a graph.
For an example application, suppose one must remove $\varepsilon n$ edges from a bounded degree graph $G$ to make it planar.
This result implies an algorithm, with the same running time, that produces
a $K_{3,3}$ or $K_5$ minor in $G$.
No sublinear time bound was known for this problem, prior to this result.
By the graph minor theorem, we get an analogous result for any minor-closed property.
Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008)
on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly
optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014).
Prior to this work,
the only graphs $H$ for which non-trivial property testers were known for $H$-minor freeness
are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017).
- • Finding Cliques in Social Networks: A New Distribution-Free Model
- Jacob Fox, Tim Roughgarden, C. Seshadhri, Fan Wei, Nicole Wein
- International Colloquium on Automata, Languages, and Programming (ICALP) 2018
- abstract
    arxiv:1804.07431
We propose a new distribution-free model of social networks. Our definitions
are motivated by one of the most universal signatures of social networks,
triadic closure---the property that pairs of vertices with common neighbors
tend to be adjacent. Our most basic definition is that of a "$c$-closed" graph,
where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$
and $v$ are adjacent. We study the classic problem of enumerating all maximal
cliques, an important task in social network analysis. We prove that this
problem is fixed-parameter tractable with respect to $c$ on $c$-closed graphs.
Our results carry over to "weakly $c$-closed graphs", which only require a
vertex deletion ordering that avoids pairs of non-adjacent vertices with $c$
common neighbors. Numerical experiments show that well-studied social networks
tend to be weakly $c$-closed for modest values of $c$.
- • On Approximating the Number of k-Cliques in Sublinear Time
- Talya Eden, Dana Ron, and C. Seshadhri
- to appear, Symposium on the Theory of Computing (STOC) 2018
- abstract
    arxiv:1707.04858
We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph.
We consider the standard query model for general graphs via
(1) degree queries, (2) neighbor queries and (3) pair queries.
Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $\clk$ the number of $k$-cliques.
We design an algorithm that outputs a $(1+\eps)$-approximation (with high probability) for $\clk$, whose expected query complexity and running time are
$O\left(\frac{n}{\clk^{1/k}}+\frac{m^{k/2}}{\clk} \right)\poly(\log n, 1/\eps,k)$.
Hence, the complexity of the algorithm is sublinear in the size of the graph for $\clk = \omega(m^{k/2-1})$.
Furthermore,
the query complexity of our algorithm is essentially optimal
(up to the dependence on $\log n$, $1/\eps$ and $k$).
The previous results in this vein are by Feige (SICOMP 06) and by Goldreich \dcom{and} Ron (RSA 08)
for edge counting ($k=2$) and by Eden et al (FOCS 2015)
for triangle counting ($k=3$). Our result matches the complexities of
these results.
The previous result by Eden et al. hinges on a certain amortization technique that works
for triangle counting, and does not generalize %for larger cliques to all $k$.
We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that} samples each $k$-clique
incident to a given set $S$ of vertices with approximately equal probability.
The primary difficulty is in finding cliques incident to purely high-degree vertices,
since random sampling within neighbors has a low success probability.
This is achieved by an algorithm that samples uniform random
high degree vertices and
a careful tradeoff between estimating cliques incident purely to high-degree vertices and those
that include a low-degree vertex.
- • Provable and practical approximations for the degree distribution using sublinear graph samples
- Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri
- The Web Conference (WWW) 2018
- abstract
    arxiv:1710.08607
The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on \emph{graph sampling}, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance in degrees.
We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of \emph{sublinear algorithms}. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the \emph{h-index} and the \emph{z-index}. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law.
We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most 1% of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than 10% of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.
- • A o(d)polylog n Monotonicity Tester for Boolean Functions over the Hypergrid [n]^d
- Hadley Black, Deeparnab Chakrabarty, C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2018
- abstract
    arxiv:1710.10545
We study monotonicity testing of Boolean functions over the hypergrid [n]^d and design a non-adaptive tester with 1-sided error whose query complexity is O(d^5/6)⋅poly(logn,1/ϵ). Previous to our work, the best known testers had query complexity linear in d but independent of n. We improve upon these testers as long as n=2^d^o(1).
To obtain our results, we work with what we call the augmented hypergrid, which adds extra edges to the hypergrid. Our main technical contribution is a Margulis-style isoperimetric result for the augmented hypergrid, and our tester, like previous testers for the hypercube domain, performs directed random walks on this structure.
- • Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection
- Talya Eden, Dana Ron, and C. Seshadhri
- International Colloquium on Automata, Languages, and Programming (ICALP) 2017
- abstract
    arxiv:1604.03661
We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ (non-isolated) vertices, and define (for $s > 0$) $\mu_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $\mu_s$ within a multiplicative error of $(1+\varepsilon)$ (for a given approximation parameter $\varepsilon>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex.
For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant
$\varepsilon$ (Feige, SICOMP 06 and Goldreich-Ron, RSA 08).
Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{1-1/(s+1)})$ queries. (Strictly speaking, their algorithm approximates the number of star-subgraphs of a given size, but a slight modification gives an algorithm for moments.)
We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the \emph{degeneracy} of $G$. This
is (essentially) the maximum density of an induced subgraph.
- • Optimal Unateness Testers for Real-Valued Functions: Adaptivity Helps
- Roksana Baleshzar, Deeparnab Chakrabarty, Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, and C. Seshadhri
- International Colloquium on Automata, Languages, and Programming (ICALP) 2017
- abstract
    arxiv:1703.05199
We study the problem of testing unateness of functions f:{0,1}^d -> R. We give a O((d/\eps)log(d/\eps))-query nonadaptive tester and a O(d/\eps)-query adaptive tester and show that both testers are optimal for a fixed distance parameter \eps. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the dimension both for the adaptive and the nonadaptive case. Moreover, no lower bounds for testing unateness were known. We also generalize our results to obtain optimal unateness testers for functions f:[n]^d -> R.
Our results establish that adaptivity helps with testing unateness of real-valued functions on domains of the form {0,1}^d and, more generally, [n]^d. This stands in contrast to the situation for monotonicity testing where there is no adaptivity gap for functions f:[n]^d -> R.
- • A Fast and Provable Method for Estimating Clique Counts Using Tur\'{a}n's Theorem
- Shweta Jain and C. Seshadhri
- World Wide Web (WWW) 2017 (Best paper)
- abstract
    arxiv:1611.05561
Clique counts reveal important properties about the structure of massive graphs,
especially social networks.
The simple setting of just 3-cliques (triangles) has received much attention from
the research community. For larger cliques (even, say $6$-cliques)
the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible.
We present a new randomized algorithm that provably approximates the number of $k$-cliques,
for any constant $k$. The key insight is the use of (strengthenings of) the classic Tur\'an's theorem:
this claims that if the edge density of a graph is sufficiently high, the $k$-clique
density must be non-trivial. We define a combinatorial structure called a \emph{Tur\'an shadow},
the construction of which leads to fast algorithms for clique counting.
We design a practical heuristic, called TURAN-SHADOW, based on this theoretical algorithm, and test
it on a large class of test graphs. In all cases, TURAN-SHADOW has less than 2\% error,
in a fraction of the time used by well-tuned exact algorithms. We do detailed comparisons
with a range of other sampling algorithms, and find that TURAN-SHADOW is generally much faster
and more accurate. For example, TURAN-SHADOW estimates all cliques numbers up to size $10$ in social
network with over a hundred million edges. This is done in less than three hours on a single commodity
machine.
- • ESCAPE: Efficiently Counting All 5-Vertex Subgraphs
- Ali Pinar, C. Seshadhri, and V. Vishal
- World Wide Web (WWW) 2017
- abstract
    arxiv:1601.09411
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes.
We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns.
We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$-vertex
patterns. This algorithm is typically ten times faster than the state of the art $4$-vertex counters.
- • When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
- Aneesh Sharma, C. Seshadhri, and Ashish Goel
- World Wide Web (WWW) 2017
- abstract
    arxiv:1703.01054
Finding similar user pairs is a fundamental task in social networks,
with applications in link prediction, tie strength detection, and a
number of ranking and personalization tasks. A common manifestation
of user similarity is based upon network structure: each user is
represented by a vector that represents the user's network
connections, and cosine similarity among these vectors quantifies
user similarity. The predominant task for user similarity
applications is to discover all user pairs with ``high'' similarity,
which is computationally challenging on networks with billions of
edges. To the best of our knowledge, there is no practical solution
for computing all user pairs with high similarity on large social
networks, even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new
algorithm --- WHIMP --- that solves this problem efficiently in
the MapReduce model. The key insight in WHIMP is to combine the
``wedge-sampling" approach of Cohen-Lewis for approximate matrix
multiplication with the SimHash random projection techniques of
Charikar. We provide a theoretical analysis of WHIMP, proving that
it has near optimal communication costs while maintaining
computation cost comparable with the state of the art. We also
empirically demonstrate WHIMP's scalability by computing all highly
similar pairs on four massive data sets, and show that it accurately
finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of
billions of edges.
- • Accurate and Nearly Optimal Sublinear Approximations to Ulam Distance
- Timothy Naumovitz, Michael Saks, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2017
- abstract
    conf
The \emph{Ulam distance} between two permutations of length $n$ is the minimum number of insertions and deletions needed to transform one sequence into the other.
Equivalently, the Ulam distance $d$ is $n$ minus the length of the longest common subsequence (LCS) between the permutations.
Our main result is an algorithm, that for any fixed $\varepsilon>0$, provides a $(1+\varepsilon)$-multiplicative approximation
for $d$ in $\tilde{O}_{\varepsilon}(n/d + \sqrt{n})$ time, which has been shown to be optimal up to polylogarithmic factors. This is the first sublinear time algorithm (provided that $d = (\log n)^{\omega(1)}$)
that obtains arbitrarily good multiplicative approximations to the Ulam distance.
The previous best bound is an $O(1)$-approximation (with a large constant) by Andoni and Nguyen (2010) with
the same running time bound (ignoring polylogarithmic factors).
The improvement in the approximation factor from $O(1)$ to $(1+\varepsilon)$ allows for significantly more powerful sublinear
algorithms. For example, for any fixed $\delta > 0$, we can get additive $\delta n$ approximations for the LCS
between permutations in $\tilde{O}_\delta(\sqrt{n})$ time. Previous sublinear algorithms require $\delta$ to be
at least $1-1/C$, where $C$ is the approximation factor, which is close to $1$ when $C$ is large.
Our algorithm is obtained by abstracting the basic algorithmic framework
of Andoni and Nguyen, and combining it with the sublinear approximations
for the longest increasing subsequence by Saks and Seshadhri (2010).
- • Directed Closure Measures for Networks with Reciprocity
- C. Seshadhri, Ali Pinar, Nurcan Durak, and Tamara G. Kolda
- Journal of Complex Networks 2017
- abstract
    arxiv:1302.6220
The computation and study of triangles in graphs is a standard tool in the analysis of real-world networks. Yet most of this work focuses on undirected graphs. Real-world networks are often directed and have a significant fraction of reciprocal edges. While there is much focus on directed triadic patterns in the social sciences community, most data mining and graph analysis studies ignore direction.
But how do we make sense of this complex directed structure? We propose a collection of directed closure values that are analogues of the classic transitivity measure (the fraction of wedges that participate in triangles). We perform an extensive set of triadic measurements on a variety of massive real-world networks. Our study of these values reveal a wealth of information of the nature of direction. For instance, we immediately see the importance of reciprocal edges in forming triangles and can measure the power of transitivity. Surprisingly, the chance that a wedge is closed depends heavily on its directed structure. We also observe striking similarities between the triadic closure patterns of different web and social networks.
Together with these observations, we also present the first sampling based algorithm for fast estimation of directed triangles. Previous estimation methods were targeted towards undirected triangles and could not be extended to directed graphs. Our method, based on wedge sampling, gives orders of magnitude speedup over state of the art enumeration.
- • Trigger Detection for Adaptive Scientific Workflows Using Percentile Sampling
- Janine C. Bennett, Ankit Bhagatwala, Jacqueline H. Chen, Ali Pinar, Maher Salloum, and C. Seshadhri
- SIAM Journal on Scientific Computing 2016
- abstract
    arxiv:1411.2689
Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for post-analysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly.
We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantile-sampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows.
- • A Mountaintop View Requires Minimal Sorting: A Faster Contour Tree Algorithm
- Benjamin Raichel and C. Seshadhri
- Symposium on Computational Geometry (SOCG) 2016 (Invited to special issue)
- Discrete and Computational Geometry 2016
- abstract
    arxiv:1411.2689
Consider a scalar field $f:\mathbb{M} \mapsto \mathbb{R}$, where $\mathbb{M}$
is a triangulated simplicial mesh in $\mathbb{R}^d$. A level set, or contour,
at value $v$ is a connected component of $f^{-1}(v)$. As $v$ is changed, these
contours change topology, merge into each other, or split. \emph{Contour trees}
are concise representations of $f$ that track this contour behavior. The
vertices of these trees are the critical points of $f$, where the gradient is
zero. The edges represent changes in the topology of contours. It is a
fundamental data structure in data analysis and visualization, and there is
significant previous work (both theoretical and practical) on algorithms for
constructing contour trees.
Suppose $\mathbb{M}$ has $n$ vertices, $N$ facets, and $t$ critical points. A
classic result of Carr, Snoeyink, and Axen (2000) gives an algorithm that takes
$O(n\log n + N\alpha(N))$ time (where $\alpha(\cdot)$ is the inverse Ackermann
function). A further improvement to $O(t\log t + N)$ time was given by Chiang
et al. All these algorithms involve a global sort of the critical points, a
significant computational bottleneck. Unfortunately, lower bounds of
$\Omega(t\log t)$ also exist.
We present the first algorithm that can avoid the global sort and has a
refined time complexity that depends on the contour tree structure.
Intuitively, if the tree is short and fat, we get significant improvements in
running time. For a partition of the contour tree into a set of descending
paths, $P$, our algorithm runs in $O(\sum_{p\in P} |p|\log|p| + t\alpha(t)+N)$
time. This is at most $O(t\log D + N)$, where $D$ is the diameter of the
contour tree. Moreover, it is $O(t\alpha(t)+N)$ for balanced trees, a
significant improvement over the previous complexity. We also prove lower
bounds showing that the $\sum_{p\in P} |p|\log|p|$ complexity is inherent to
computing contour trees.
- • Characterizing short-term stability for Boolean networks over any distribution of transfer functions
- C. Seshadhri, Andrew M. Smith, Yevgeniy Vorobeychik, Jackson Mayo, and Robert C. Armstrong
- Physical Review E 2016
- abstract
    APS     arxiv:1409.4360
We present a characterization of short-term stability of Kauffman's N-K
(random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether short-term chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness.
- • Approximately Counting Triangles in Sublinear Time
- Talya Eden, Amit Levi, Dana Ron, and C. Seshadhri
- Foundations of Computer Science (FOCS) 2015
- abstract
    arxiv:1504.00954
We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory
and practice, but all existing algorithms read the entire graph. In this work we design a sublinear-time algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries.
We show that for any given approximation parameter 0\textless epsilon\textless1, the algorithm provides an estimate hat\{t\} such that with high constant probability, (1-epsilon) t\textless hat\{t\}\textless(1+epsilon)t, where t is the number of triangles in the graph G. The expected query complexity of the algorithm is O(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) poly(log n, 1/epsilon), where n is the number of vertices in the graph and m is the number of edges,
and the expected running time is
(n/t\textasciicircum \{1/3\} + m\textasciicircum\{3/2\}/t) poly(log n, 1/epsilon). We also prove that
Omega(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) queries are necessary,
thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in n (and the dependence on 1/epsilon).
- • Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search
- Grey Ballard, Tamara G. Kolda, Ali Pinar, and C. Seshadhri
- International Conference on Data Mining (ICDM) 2015 (Best paper)
- abstract
    arxiv:1506.03872
Given two sets of vectors, $A = \set{\vec{a_1}, \dots, \vec{a_m}}$ and
$B=\set{\vec{b_1},\dots,\vec{b_n}}$, our problem is to find the top-$t$ dot
products, i.e., the largest $|\vec{a_i}\cdot\vec{b_j}|$ among all possible
pairs. This is a fundamental mathematical problem that appears in numerous data
applications involving similarity search, link prediction, and collaborative filtering.
We propose a sampling-based approach that avoids direct computation of all $mn$ dot products.
We select diamonds (i.e.,
four-cycles) from the weighted tripartite representation of $A$ and
$B$. The probability of selecting a diamond corresponding to
pair $(i,j)$ is proportional to $(\vec{a_i}\cdot\vec{b_j})^2$, amplifying the focus on the
largest-magnitude entries.
Experimental results indicate that diamond sampling is orders of magnitude faster than
direct computation and
requires far fewer samples than any competing approach.
We also apply diamond sampling to the special case of maximum inner product search, and get
significantly better results than the state-of-the-art hashing methods.
- • Catching the head, the tail, and everything in between: a streaming algorithm for the degree distribution
- Olivia Simpson, C. Seshadhri, and Andrew McGregor
- International Conference on Data Mining (ICDM) 2015
- abstract     github code
    arxiv:1506.02574
The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution.
We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.
- • When a Graph is not so Simple: Counting Triangles in Multigraph Streams
- Madhav Jha, C. Seshadhri, and Ali Pinar
- Asilomar Conference on Signals, Systems and Computers, ACSSC 2015
- abstract
    arxiv:1310.7665
Estimating the number of triangles in a graph given as a stream of edges is a
fundamental problem in data mining. The goal is to design a single pass
space-efficient streaming algorithm for estimating triangle counts. While there
are numerous algorithms for this problem, they all (implicitly or explicitly)
assume that the stream does not contain duplicate edges. However, data sets
obtained from real-world graph streams are rife with duplicate edges. The work
around is typically an extra unaccounted pass (storing all the edges!) just to
"clean up" the data. Can we estimate triangle counts accurately in a single
pass even when the stream contains repeated edges? In this work, we give the
first algorithm for estimating the triangle count of a multigraph stream of
edges.
@techreport{JhSePi13,
author = {Madhav Jha and C. Seshadhri and Ali Pinar},
title = {When a Graph is not so Simple: Counting Triangles in Multigraph Streams},
institution = {Arxiv},
year = {2013},
number = {1310.7665},
}
- • Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts
- Madhav Jha, C. Seshadhri, and Ali Pinar
- World Wide Web (WWW) 2015
- abstract
    arxiv:1411.4942
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.
We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
- • Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions
- Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar, and Umit V. Catalyurek
- World Wide Web (WWW) 2015 (Best paper finalist)
- abstract
    arxiv:1411.3312
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable efficient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
- • Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties
- Deeparnab Chakrabarty, Kashyap Dixit, Madhav Jha, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2015 (Invited to special issue)
- abstract
    arxiv:1404.0718
    ECCC-TR14-042
The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far
from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion
is the Hamming distance over the {\em uniform} distribution on the domain.
This restriction to uniformity is rather limiting, and it is important to investigate distances induced by more general distributions.
In this paper, we give simple and optimal testers for {\em bounded derivative properties} over {\em arbitrary product distributions}.
Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity.
Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing.
We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of
expected depths of optimal BSTs for each marginal.
Furthermore, we show this sum-of-depths is also a lower bound.
A technical contribution of our work is an {\em optimal dimension reduction theorem} for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity, and our theorem is an exponential improvement to the previous best known result.
@techreport{ChDiJh+14,
author = {Deeparnab Chakrabarty and Kashyap Dixit and Madhav Jha and C. Seshadhri},
title = {Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties},
institution = {ECCC},
year = {2014},
number = {1404.0718},
}
- • FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs
- Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and C. Seshadhri
- SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2014
- abstract
    arxiv:1404.3181
We propose a new algorithm, FAST-PPR, for estimating personalized PageRank: given input nodes $s$, $t$ in a directed graph and threshold $\delta$, it computes the Personalized PageRank $\PR_s(t)$ from $s$ to $t$, guaranteeing that the relative error is small as long $\PR_s(t) > \delta$.
Existing algorithms for this problem have a running-time of $\Omega(1/\delta)$. We design a new algorithm, FAST-PPR, that
has a provable average running-time guarantee of $\tilde{O}(\sqrt{d/\delta})$ (where $d$ is the average in-degree of the graph). This is a significant improvement, since $\delta$ is often $O(1/n)$ (where $n$
is the number of nodes) for applications. We also complement the algorithm with an $\Omega(1/\sqrt{\delta})$ lower bound for Significant-PageRank, showing that the dependence on $\delta$ cannot be improved.
We perform a detailed empirical study on numerous massive graphs showing that FAST-PPR dramatically outperforms existing algorithms. For example, with target nodes sampled according to popularity, on the 2010 Twitter graph with 1.5 billion edges, FAST-PPR has a 20 factor speedup over the state of the art. Furthermore, an enhanced version of FAST-PPR has a 160 factor speedup on the Twitter graph, and is at least 20 times faster on all our candidate graphs.
@techreport{LoBaGo+14,
author = {Peter Lofgren and Siddhartha Banerjee and Ashish Goel and C. Seshadhri},
title = {FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs},
institution = {Arxiv},
year = {2014},
number = {1404.3181},
}
- • A Scalable Generative Graph Model with Community Structure
- Tamara G. Kolda, Ali Pinar, Todd Plantenga, and C. Seshadhri
- SIAM Journal of Scientific Computing (SISC) 2014
- abstract
    arxiv:1302.6636
Network data is ubiquitous and growing, yet we lack realistic generative models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdos-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.
@techreport{KoPiPl+13,
author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri},
title = {A Scalable Generative Graph Model with Community Structure},
institution = {Arxiv},
year = {2013},
number = {1302.6636},
}
- • Counting Triangles in Massive Graphs with MapReduce
- Tamara G. Kolda, Ali Pinar, Todd Plantenga, C. Seshadhri, and Christine Task
- SIAM Journal of Scientific Computing (SISC) 2014
- abstract
    arxiv:1301.5887
Graphs and networks are used to model interactions in a variety of contexts.
There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends.
This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec.
total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.
@techreport{KoPiPl+13,
author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri and Christine Task},
title = {Counting Triangles in Massive Graphs with MapReduce},
institution = {Arxiv},
year = {2013},
number = {1301.5887},
}
- • Why do Simple Algorithms for Triangle Enumeration work in the Real World?
- Jonathan Berry, Luke Fostvedt, Daniel Nordman, Cynthia Phillips, C. Seshadhri, and Alyson Wilson
- Innovations in Theoretical Computer Science (ITCS) 2014
- abstract
    conf
Listing all triangles is a fundamental graph operation. Triangles can have important interpretations
in real-world graphs, especially social and other interaction networks.
Despite the lack of provably efficient (linear, or slightly super-linear) worst-case algorithms for this problem, practitioners
run simple, efficient heuristics to find all triangles in graphs with millions of vertices.
How are these heuristics exploiting
the structure of these special graphs to provide major speedups in running time?
We study one of the most prevalent algorithms used by practitioners. A trivial algorithm
enumerates all paths of length $2$, and checks if each such path is incident to a triangle.
A good heuristic is to enumerate only those paths of length $2$
where the middle vertex has the lowest degree.
It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm.
We study the behavior of this algorithm over graphs with heavy-tailed
degree distributions, a defining feature of real-world graphs. The
erased configuration model (ECM) efficiently generates a graph
with asymptotically (almost) any desired degree sequence.
We show that the expected running time of this algorithm
over the distribution of graphs created by the ECM is controlled by the $\ell_{4/3}$-norm of the degree
sequence.
Norms of the degree sequence are a measure of the heaviness of the tail, and it is
precisely this feature that allows non-trivial speedups of simple
triangle enumeration algorithms. As a corollary of our main theorem,
we prove expected linear-time performance for degree sequences
following a power law with exponent $\alpha \geq 7/3$, and non-trivial
speedup whenever $\alpha \in (2,3)$.
@INPROCEEDINGS{BeFoNo+14,
author = {J. Berry and L. Fostvedt and D. Nordman and C. Phillips and C. Seshadhri and A. Wilson},
title = {Why do Simple Algorithms for Triangle Enumeration work in the Real World?},
booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)},
year = {2014}
}
- • Decompositions of Triangle-Dense Graphs
- Rishi Gupta and Tim Roughgarden and C. Seshadhri
- Innovations in Theoretical Computer Science (ITCS) 2014
- abstract
    conf
    arxiv:1309.7440
High triangle density -- the graph property stating that most two-hop paths belong to a triangle -- is a common signature of social networks. This paper studies triangle-dense graphs from a structural perspective. We prove constructively that most of the content of a triangle-dense graph is contained in a disjoint union of radius 2 dense subgraphs. This result quantifies the extent to which triangle-dense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in stable k-median instances.
@INPROCEEDINGS{GuRoSe14,
author = {R. Gupta and T. Roughgarden and C. Seshadhri},
title = {Decompositions of Triangle-Dense Graphs},
booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)},
year = {2014}
}
- • A Provably-Robust Sampling Method for Generating Colormaps of Large Data
- David Thompson and Janine Bennett and C. Seshadhri and Ali Pinar
- IEEE Symposium on Large-Scale Data and Visualization (LDAV) 2013
- abstract
    conf
First impressions from initial renderings of data are crucial for directing
further exploration and analysis. In most visualization systems, default colormaps
are generated by simply linearly interpolating color in some space based on a
value's placement between the minimum and maximum taken on by the dataset. We
design a simple sampling-based method for generating colormaps that highlights important
features. We use random sampling to determine
the distribution of values observed in the data. The sample size required is
independent of the dataset size and only depends on certain accuracy parameters.
This leads to a computationally cheap and robust algorithm for colormap
generation. Our approach (1) uses perceptual color distance to produce palettes
from color curves, (2) allows the user to either emphasize or de-emphasize prominent
values in the data, (3) uses quantiles to map distinct colors to values based
on their frequency in the dataset, and (4) supports the highlighting of either
inter- or intra-mode variations in the data.
@INPROCEEDINGS{ThBe+13,
author = {D. Thompson and J. Bennett and C. Seshadhri and A. Pinar},
title = {A Provably-Robust Sampling Method for Generating Colormaps of Large Data},
booktitle = {{Proceedings of the IEEE Symposium on Large-Scale Data and Visualization (LDAV)}},
year = {2013}
}
- • An Optimal Lower Bound for Monotonicity Testing over Hypergrids
- Deeparnab Chakrabarty and C. Seshadhri
- International Workshop on Randomization and Computation (RANDOM) 2013
- abstract
    conf
    arxiv:1304.5264     ECCC-TR13-062
For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinate-wise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$-far from monotone if at least an $\eps$-fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$-far.
We prove that any (adaptive, two-sided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $\Omega(\eps^{-1}d\log n - \eps^{-1}\log \eps^{-1})$ queries. Recent upper bounds show the existence of $O(\eps^{-1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a non-adaptive bound of $\Omega(d \log n)$.
@INPROCEEDINGS{ChSe13-2,
author = {D. Chakrabarty and C. Seshadhri},
title = {An optimal lower bound for monotonicity testing over hypergrids},
booktitle = {{Proceedings of the International Workshop on Randomization and Computation (RANDOM)}},
year = {2013},
owner = {scomand},
timestamp = {2013.03.14}
}
- • A Space Efficient Streaming Algorithm for Triangle Counting using the Birthday Paradox
- Madhav Jha, C. Seshadhri, and Ali Pinar
- SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD) 2013 (Best student paper)
- abstract
    conf
    arxiv:1212.2264
Modern massive graphs such as the web and communication networks are truly \emph{dynamic graphs}. The graph is not a static object, but is most accurately characterized as a \emph{stream of edges}. How does one store a small sample of the past and still answer meaningful questions about the entire graph? This is the aim of a sublinear space streaming algorithm. We study the problem of counting the number of triangles (or estimating the clustering coefficient) in a massive streamed graph.
We provide a $O(\sqrt{n})$-space algorithm that provably approximates the clustering coefficient with only a single pass through a graph of $n$ vertices. Our procedure is based on the classic probabilistic result, \emph{the birthday paradox}. While streaming algorithms for triangle counting have been designed before, none of these provide comparable theoretical (or empirical) results.
Using some probabilistic heuristics, we design a practical algorithm based on these ideas. Our algorithm can compute accurate estimates for the clustering coefficient and number of triangles for a graph using only a single pass over the edges. The storage of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 50,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains \emph{a real-time estimate} of the clustering coefficient/number of triangles of a graph, by storing a miniscule fraction of edges.
@INPROCEEDINGS{JhSePi13,
author = {Jha, Madhav and Seshadhri, C. and Pinar, Ali},
title = {A space efficient streaming algorithm for triangle counting using
the birthday paradox},
booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge
discovery and data mining},
year = {2013},
series = {KDD '13},
pages = {589--597}
}
- • A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal
- Nurcan Durak, Tamara G. Kolda, Ali Pinar, C. Seshadhri
- IEEE Workshop on Network Science (NSW) 2013
- abstract
    conf
    arxiv:1210.5288
Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as an excellent null model to compare algorithms or perform experimental studies that distinguish real networks from this baseline. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the real-world are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (one-way edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the Chung-Lu model that matches the in-, out-, and reciprocal-degree distributions of the real graphs.
Our algorithm is scalable and requires O(m) random numbers to generate a graph with m edges. We perform a series of experiments on real datasets and compare with existing graph models.
@inproceedings{DuKoPiSe12,
author = {Nurcan Durak and Tamara G. Kolda and Ali Pinar and C. Seshadhri},
title = {A scalable directed graph model with reciprocal edges},
booktitle = {IEEE Workshop on Network Science},
year = {2013}
}
- • An o(n) Monotonicity Tester for Boolean Functions over the Hypercube
- Deeparnab Chakrabarty and C. Seshadhri
- Symposium on Theory of Compting (STOC) 2013 (Invited to special issue)
- abstract
    conf
    ECCC-TR13-029
Given oracle access to a Boolean function $f:\{0,1\}^n \mapsto \{0,1\}$, we design a randomized tester that takes as input a parameter $\eps>0$, and outputs {\sf Yes} if the function is monotone, and outputs {\sf No} with probability $>2/3$, if the function is $\eps$-far from monotone. That is, $f$ needs to be modified at $\eps$-fraction of the points to make it monotone. Our non-adaptive, one-sided tester makes $\widetilde{O}(n^{5/6}\eps^{-5/3})$ queries to the oracle.
@inproceedings{CS12,
author = {D. Chakrabarty and C. Seshadhri},
title = {An o(n) monotonicity tester for Boolean functions over the hypercube},
booktitle = {ACM Symposium on Theory of Computing (STOC)},
year = {2013}
}
- • Optimal Bounds for Monotonicity and Lipschitz Testing over
Hypercubes and Hypergrids
- Deeparnab Chakrabarty and C. Seshadhri
- Symposium on Theory of Compting (STOC) 2013
- abstract
    conf
    ECCC-TR12-030
The problem of monotonicity testing over the hypergrid and its special case, the hypercube, is a classic, well-studied, yet unsolved question in property testing. We are given query access to $f:[k]^n \mapsto \R$
(for some ordered range $\R$).
The hypergrid/cube has a natural partial order given by coordinate-wise ordering, denoted by $\prec$.
A function is \emph{monotone} if for all pairs $x \prec y$,
$f(x) \leq f(y)$. The distance to monotonicity, $\eps_f$, is the minimum fraction of values of $f$
that need to be changed to make $f$ monotone.
For $k=2$ (the boolean hypercube), the usual tester is the \emph{edge tester}, which
checks monotonicity on adjacent pairs of domain points.
It is known that the edge tester using $O(\eps^{-1}n\log|\R|)$
samples can distinguish a monotone function from one where $\eps_f > \eps$.
On the other hand, the best lower bound for monotonicity testing over the hypercube is $\min(|\R|^2,n)$.
This leaves
a quadratic gap in our knowledge, since $|\R|$ can be $2^n$.
We resolve this long standing open problem and prove that $O(n/\eps)$ samples suffice
for the edge tester. For hypergrids, known testers require $O(\eps^{-1}n\log k\log |\R|)$ samples, while
the best known (non-adaptive) lower bound is $\Omega(\eps^{-1} n\log k)$. We give
a (non-adaptive) monotonicity tester for hypergrids running in $O(\eps^{-1} n\log k)$ time.
Our techniques lead to optimal property testers (with the same running time) for the natural \emph{Lipschitz property}
on hypercubes and hypergrids. (A $c$-Lipschitz function
is one where $|f(x) - f(y)| \leq c\|x-y\|_1$.)
In fact, we give a general unified proof for $O(\eps^{-1}n\log k)$-query testers for a class of ``bounded-derivative" properties,
a class containing both monotonicity and Lipschitz.
@inproceedings{CS12,
author = {D. Chakrabarty and C. Seshadhri},
title = {Optimal Bounds for Monotonicity and Lipschitz Testing over Hypercubes and Hypergrids},
booktitle = {ACM Symposium on Theory of Computing (STOC)},
year = {2013}
}
- • Triadic Measures on Graphs: The Power of Wedge Sampling
- C. Seshadhri, Ali Pinar, and Tamara G. Kolda
- SIAM Conference on Data Mining (SDM) 2013 (Best research paper)
- abstract
    conf
    arxiv:1202.5230
there is a growing need to quickly assess the structure of a graph.
Some of the most useful graph metrics, especially those measuring social cohesion,
are based on triangles.
Despite the importance of these triadic measures, associated algorithms can be extremely expensive.
We discuss the method of wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and
enables rapid uniform sampling of the triangles of a graph.
Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration.
Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.
@inproceedings{SePiKo13,
author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda},
title = {Triadic Measures on Graphs: The Power of Wedge Sampling},
booktitle = {SIAM Conference on Data Mining (SDM)},
year = {2013}
}
- • Space Efficient Streaming Algorithms for the Distance to Monotonicity and Asymmetric Edit Distance
- Michael Saks and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2013
- abstract
    conf
    arxiv:1204.1098
Approximating the length of the longest increasing sequence (LIS) of a data stream is a well-studied problem. There are many algorithms that estimate the size of the complement of the LIS, referred to as the \emph{distance to monotonicity}, both in the streaming and property testing setting. Let $n$ denote the size of an input array. Our aim is to develop a one-pass streaming algorithm that accurately approximates the distance to monotonicity, and only uses polylogarithmic storage. For any $\delta > 0$, our algorithm provides a $(1+\delta)$-multiplicative approximation for the distance, and uses only $O((\log^2 n)/\delta)$ space. The previous best known approximation using poly-logarithmic space was a multiplicative 2-factor. Our algorithm is simple and natural, being just 3 lines of pseudocode. It is essentially a polylogarithmic space implementation of a classic dynamic program that computes the LIS.
Our technique is more general and is applicable to other problems that are exactly solvable by dynamic programs. We are able to get a streaming algorithm for the longest common subsequence problem (in the asymmetric setting introduced in \cite{AKO10}) whose space is small on instances where no symbol appears very many times. Consider two strings (of length $n$) $x$ and $y$. The string $y$ is known to us, and we only have streaming access to $x$. The size of the complement of the LCS is the edit distance between $x$ and $y$ with only insertions and deletions. If no symbol occurs more than $k$ times in $y$, we get a $O(k(\log^2 n)/\delta)$-space streaming algorithm that provides a $(1+\delta)$-multiplicative approximation for the LCS complement. In general, we also provide a deterministic 1-pass streaming algorithm that outputs a $(1+\delta)$-multiplicative approximation for the LCS complement and uses $O(\sqrt{(n\log n)/\delta})$ space.
@inproceedings{SaSe13,
author = {M. Saks and C. Seshadhri},
title = {Space Efficient Streaming Algorithms for the Distance to Monotonicity and Asymmetric Edit Distance},
booktitle = {Symposium on Discrete Algorithms (SODA)},
year = {2013}
}
- • Finding Cycles and Trees in Sublinear Time
- Artur Czumaj, Oded Goldreich, Dana Ron, C. Seshadhri, Asaf Shapira, and Christian Sohler
- Random Structures and Algorithms 2012
- abstract
    arxiv:1007.4230
We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $\Omega(1)$-far) {from} being cycle-free, i.e. if one has to delete a constant fraction of edges to make it cycle-free, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors.
The foregoing results are the outcome of our study of the complexity of {\em one-sided error} property testing algorithms in the bounded-degree graphs model. For example, we show that cycle-freeness of $N$-vertex graphs can be tested with one-sided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $\Omega(\sqrt{N})$ query lower bound, and contrasts with the fact that any minor-free property admits a {\em two-sided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least~$k$. On the other hand, for any fixed tree $T$, we show that $T$-minor-freeness has a one-sided error tester of query complexity that only depends on the proximity parameter $\e$.
Our algorithm for finding cycles in bounded-degree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding tree-minors in $o(\sqrt{N})$ complexity.
@journal{CGR+10,
author = {Artur Czumaj and Oded Goldreich and Dana Ron and C. Seshadhri and Asaf Shapira and Christian Sohler},
title = {Finding Cycles and Trees in Sublinear Time},
journal = {Random Structures and Algorithms},
year = {2012},
notes = {DOI: 10.1002/rsa.20462}
}
- • Degree Relations of Triangles in Real-world Networks and Graph Models
- Nurcan Durak, Ali Pinar, Tamara G. Kolda, and C. Seshadhri
- Conference on Information and Knowledge Management (CIKM) 2012
- abstract
    conf
    arxiv:1207.7125
Triangles are an important building block and distinguishing feature
of real-world networks, but their structure is still poorly
understood. Despite numerous reports on the abundance of triangles,
there is very little information on what these triangles look like.
We initiate the study of \emph{degree-labeled triangles} ---
specifically, degree homogeneity versus heterogeneity in triangles.
This yields new insight into the structure of real-world graphs.
We observe that networks coming from social and collaborative situations are dominated
by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other.
On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles,
i.e., the degrees in triangles are quite disparate. Surprisingly,
nodes within the top 1\% of degrees participate in the \emph{vast majority}
of triangles in heterogeneous graphs.
We investigate whether current graph models
reproduce the types of triangles that are observed in real data and
observe that most models fail to accurately capture these salient features.
@inproceedings{DPKS12,
author = {N. Durak and A. Pinar and T. G. Kolda and C. Seshadhri},
title = {Degree Relations of Triangles in Real-world Networks and Graph Models},
booktitle = {ACM International Conference on Information and Knowledge Management},
year = {2012}
}
- • Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods
- David F. Gleich and C. Seshadhri
- SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2012
- abstract
    conf
    arxiv:1112.0031 (old version)     Code and data
The communities of a social network are sets of vertices
with more connections inside the set than outside.
We theoretically demonstrate that two commonly
observed properties of social networks, heavy-tailed
degree distributions and large clustering coefficients,
imply the existence of vertex neighborhoods (also known
as egonets) that are themselves good communities. We
evaluate these neighborhood communities on a range of graphs.
What we find is that the neighborhood communities can
exhibit conductance scores that are as good as the Fiedler
cut. Also, the conductance of neighborhood communities
shows similar behavior as the network community profile
computed with a personalized PageRank community detection
method. Neighborhood communities give us a simple and powerful
heuristic for speeding up local partitioning methods.
Since finding good seeds for the PageRank clustering
method is difficult, most approaches involve an expensive sweep over a great many
starting vertices. We show how to use
neighborhood communities to quickly generate a small set of
seeds.
@inproceedings{GS12,
author = {David F. Gleich and C. Seshadhri},
title = {Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods },
booktitle = {SIGKDD Conference on Knowledge Discovery and Data Mining},
year = {2012}
}
- • Self-improving Algorithms for Coordinate-wise Maxima
- Ken Clarkson, Wolfgang Mulzer, and C. Seshadhri
- Symposium on Computation Geometry (SoCG) 2012
- SIAM Journal on Computing 2014
- abstract
    conf
    journal(arxiv:1211.0952)
Computing the coordinate-wise maxima of a planar point set
is a classic and well-studied problem in computational geometry.
We give an algorithm for this problem in the \emph{self-improving setting}.
We have $n$ (unknown) independent distributions
$\cD_1, \cD_2, \ldots, \cD_n$ of planar points. An input pointset
$(p_1, p_2, \ldots, p_n)$
is generated by taking an independent sample $p_i$ from each $\cD_i$, so
the input distribution $\cD$ is the product $\prod_i \cD_i$. A self-improving
algorithm repeatedly gets input sets from the distribution $\cD$ (which is
\emph{a priori} unknown) and tries to
optimize its running time for $\cD$. Our algorithm uses the first few
inputs to learn salient features of the distribution, and then becomes an
optimal algorithm
for distribution $\cD$. Let $\OPT_\cD$ denote the expected depth of
an \emph{optimal} linear comparison tree computing the maxima for
distribution $\cD$. Our algorithm
eventually has an expected running time of $O(\text{OPT}_\cD + n)$,
even though it did not know $\cD$ to begin with.
Our result requires new tools to understand linear comparison trees
for computing maxima.
We show how to convert general linear comparison trees to very restricted
versions, which can then be
related to the running time of our algorithm. An interesting feature of our
algorithm
is an interleaved search, where the algorithm tries to determine the likeliest
point to be maximal with minimal computation. This allows the running time to be
truly optimal for the distribution $\cD$.
@inproceedings{CMS12,
author = {Ken Clarkson and Wolfgang Mulzer and C. Seshadhri},
title = {Self-improving Algorithms for Coordinate-wise Maxima},
booktitle = {28th Symposium on Computational Geometry},
year = {2012}
}
- • Community Structure and Scale-free Collections of Erdos-Renyi Graphs
- C. Seshadhri, Tamara G. Kolda, and Ali Pinar
- Physical Review E 2012
- abstract
    arxiv:1112.3644
    journal
    Code and data
Community structure plays a significant role in the analysis of social
networks and similar graphs, yet this structure is little understood
and not well captured by most models.
We formally define a community to be a subgraph that is
internally highly connected and has no deeper substructure.
We use tools of combinatorics to show that any such community must
contain a dense Erd\H{o}s--R\'enyi~(ER) subgraph.
Based on mathematical arguments, we hypothesize that any graph
with a heavy-tailed degree distribution and community structure
must contain a scale free collection of dense ER subgraphs.
These theoretical observations corroborate well with
empirical evidence.
From this, we propose the Block Two-Level Erd\H{o}s--R\'enyi (BTER) model,
and demonstrate that it accurately captures the observable
properties of many real-world social networks.
@article{SKP11,
author = {C. Seshadhri and Tamara G. Kolda and Ali Pinar},
title = {Community structure and scale-free collections of Erd\H{o}s-R\'enyi graphs},
journal = {Physical Review E},
year = {2012},
volume = {85},
issue = {5},
number = {056109},
url = {http://pre.aps.org/abstract/PRE/v85/i5/e056109}
}
- • Are we there yet? When to stop a Markov Chain while Generating Random Graphs
- Jaideep Ray, Ali Pinar, C. Seshadhri
- Workshop on Algorithms and Models for the Web Graph (WAW) 2012
- abstract
    conf
    arxiv:1202.3473
Markov chains are a convenient means of generating realizations of networks, since they require little more than a procedure for rewiring edges. If a rewiring procedure exists for
generating new graphs with specified statistical properties, then a Markov chain sampler can generate an ensemble of graphs with prescribed characteristics. However, successive
graphs in a Markov chain cannot be used when one desires independent draws from the distribution of graphs; the realizations are correlated. Consequently, one runs a Markov chain for
N iterations before accepting the realization as an independent sample. In this work, we devise two methods for calculating N. They are both based on the binary "time-series"
denoting the occurrence/non-occurrence of edge (u, v) between vertices u and v in the Markov chain of graphs generated by the sampler. They differ in their underlying assumptions. We
test them on the generation of graphs with a prescribed joint degree distribution. We find the N proportional |E|, where |E| is the number of edges in the graph. The two methods are
compared by sampling on real, sparse graphs with 10^3 - 10^4 vertices. B
@inproceedings{RPS12,
author = {Jaideep Ray and Ali Pinar and C. Seshadhri},
title = {Are we there yet? When to stop a Markov chain while generating random graphs},
booktitle = {9th Workshop on Algorithms and Models for the Web Graph},
pages = {153-164},
year = {2012}
}
- • The Similarity between Stochastic Kronecker and Chung-Lu Graph Models
- Ali Pinar, C. Seshadhri, and Tamara G. Kolda
- SIAM Conference on Data Mining (SDM) 2012
- abstract
    conf
    arxiv:1110.4925
The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model
has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little
understanding of the properties and behavior of this model.
We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate
parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to
fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG
model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.
@inproceedings{PiSeKo11,
author = {Ali Pinar and C. Seshadhri and Tamara G. Kolda},
title = {The Similarity between Stochastic Kronecker and Chung-Lu Graph Models},
booktitle = {Proceedings of 12th SIAM International Conference on Data Mining},
pages = {1071--1082},
year = {2012}
}
- • An In-Depth Analysis of Stochastic Kronecker Graphs
- C. Seshadhri, Ali Pinar, and Tamara G. Kolda
- International Conference on Data Mining (ICDM) 2011
- Journal of the ACM 2013
- abstract
    conf
    journal (arxiv:1102.5046)
Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices.
These results have been communicated to the Graph500 steering committee, and they are currently using our theorems to set the benchmark parameters. Our noisy version of SKG is likely to become the Graph500 standard for next year's benchmark.
@techreport{SPK11,
author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda},
title = {A Hitchhiker's Guide to Choosing Parameters of Stochastic Kronecker Graphs},
institution = {Arxiv},
year = {2011},
number = {1102.5046},
}
- • Influence and Dynamic Behavior in Random Boolean Networks
- C. Seshadhri, Yevgeniy Vorobeychik, Jackson R. Mayo, Robert C. Armstrong, and Joseph R. Ruthruff
- Physical Review Letters 2011
- abstract
    journal (arxiv:1107.3792)
We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network
models. We use this framework to provide the first formal proof of many of the standard critical transition
results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean
networks. We show that some of the assumptions traditionally made in the more common mean-field analysis
of Boolean networks do not hold in general. For example, we offer evidence that imbalance (internal inhomogeneity)
of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than
previously observed.
@article{SVM+11,
author = {C. Seshadhri and
Y. Vorobeychik and J. Mayo and R. Armstrong and J. Ruthruff},
title = {Influence and Dynamic Behavior in Random Boolean Networks},
journal = {Physical Review Letters},
volume = {107},
number = {10},
year = {2011},
pages = {108701}
}
- • Blackbox Identity Testing for Bounded Top Fanin Depth-3 Circuits: the Field doesn't matter
- Nitin Saxena and C. Seshadhri
- Symposium on Theory of Computing (STOC) 2011 (Invited to special issue)
- SIAM Journal on Computing 2012
- abstract
    conf
    ECCC TR10-167
    journal
Let C be a depth-3 circuit with n variables, degree d and top fanin k (called sps(k,d,n) circuits) over base field F.
It is a major open problem to design a deterministic polynomial time blackbox algorithm
that tests if C is identically zero.
Klivans & Spielman (STOC 2001) observed that the problem
is open even when k is a constant.
This case has been subjected to a serious study over the past few years, starting
from the work of Dvir & Shpilka (STOC 2005).
We give the first polynomial time blackbox algorithm for this problem. Our algorithm
runs in time poly(nd^k), regardless of the base field. The *only* field
for which polynomial time algorithms were previously known
is F=Q (Kayal & Saraf, FOCS 2009, and Saxena & Seshadhri, FOCS 2010).
This is the first blackbox algorithm for depth-3 circuits that does not use
the rank based approaches of Karnin & Shpilka (CCC 2009).
We prove an important tool for the study of depth-3 identities. We design
a blackbox polynomial time transformation that reduces the number of variables
in a sps(k,d,n) circuit to k variables, but preserves the identity structure.
@inproceedings{SaxS11,
author = {N. Saxena and C. Seshadhri},
title = {Blackbox identity testing for bounded top fanin depth-3 circuits: the field doesn't matter},
booktitle = {Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC)},
year = {2011}
}
- • Combinatorial Approximation Algorithms for MaxCut using Random Walks
- Satyen Kale and C. Seshadhri
- Innovations in Computer Science (ICS) 2010
- abstract
    conf
    arxiv:1008.3938
We give the first combinatorial approximation algorithm for Maxcut that beats
the trivial 0.5 factor by a constant. The main partitioning procedure is very
intuitive, natural, and easily described. It essentially performs a number of
random walks and aggregates the information to provide the partition. We can
control the running time to get an approximation factor-running time tradeoff.
We show that for any constant b > 1.5, there is an O(n^{b}) algorithm that
outputs a (0.5+delta)-approximation for Maxcut, where delta = delta(b) is some
positive constant.
One of the components of our algorithm is a weak local graph partitioning
procedure that may be of independent interest. Given a starting vertex $i$ and
a conductance parameter phi, unless a random walk of length ell = O(log n)
starting from i mixes rapidly (in terms of phi and ell), we can find a cut of
conductance at most phi close to the vertex. The work done per vertex found in
the cut is sublinear in n.
@inproceedings{KalS10,
author = {Satyen Kale and C. Seshadhri},
title = {Combinatorial Approximation Algorithms for MaxCut using Random Walks},
booktitle = {Proceedings of the Innovations in Computer Science (ICS)},
year = {2011},
pages = {367--388}
}
- • Is Submodularity Testable?
- C. Seshadhri and Jan Vondrak
- Innovations in Computer Science (ICS) 2010
- Algorithmica 2012
- abstract
    conf
    arxiv:1008.0831
    journal
We initiate the study of property testing of submodularity on the boolean hypercube. Submodular functions come up in a variety of applications in combinatorial optimization. For a vast range of algorithms, the existence of an oracle to a submodular function is assumed. But how does one check if this oracle indeed represents a submodular function?
Consider a function f:{0,1}^n \rightarrow R. The distance to submodularity is the minimum fraction of values of $f$ that need to be modified to make f submodular. If this distance is more than epsilon > 0, then we say that f is epsilon-far from being submodular. The aim is to have an efficient procedure that, given input f that is epsilon-far from being submodular, certifies that f is not submodular. We analyze a very natural tester for this problem, and prove that it runs in subexponential time. This gives the first non-trivial tester for submodularity. On the other hand, we prove an interesting lower bound (that is, unfortunately, quite far from the upper bound) suggesting that this tester cannot be very efficient in terms of epsilon. This involves non-trivial examples of functions which are far from submodular and yet do not exhibit too many local violations.
We also provide some constructions indicating the difficulty in designing a tester for submodularity. We construct a partial function defined on exponentially many points that cannot be extended to a submodular function, but any strict subset of these values can be extended to a submodular function.
@inproceedings{SesV10,
author = {J. Vondrak and
C. Seshadhri},
title = {Is submodularity testable?},
booktitle = {Proceedings of the Innovations in Computer Science (ICS)},
year = {2011},
pages = {195--210}
}
- • Estimating the Longest Increasing Sequence in Polylogarithmic Time
- Michael Saks and C. Seshadhri
- Foundations of Computer Science (FOCS) 2010
- abstract
    conf
    arXiv:1308.0626
Finding the length of the longest increasing subsequence (LIS) is
a classic algorithmic problem. Let $n$ denote the size
of the array. Simple $O(n\log n)$ algorithms
are known for this problem. What can a sublinear time algorithm
achieve? We develop a polylogarithmic time
randomized algorithm that for any constant $\delta > 0$,
estimates the length of the LIS of an array upto an additive error of $\delta n$.
More precisely, the running time of the algorithm
is $(\log n)^c (1/\delta)^{O(1/\delta)}$ where the exponent $c$
is independent of $\delta$.
Previously, the
best known polylogarithmic time algorithms could only
achieve an additive $n/2$ approximation.
The \emph{distance to monotonicity}, which is the fractional size of the complement
of the LIS, has been studied in depth by the streaming and property testing
communities.
Our polylogarithmic algorithm actually has a stronger guarantee, and gives (for any constant $\delta > 0$) a multiplicative
$(1+\delta)$-approximation to the distance to monotonicity. This is a significant improvement
over the previously known $2$-factor approximations.
@inproceedings{SakS10,
author = {M. E. Saks and
C. Seshadhri},
title = {Estimating the longest increasing sequence in polylogarithmic time},
booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
year = {2010},
pages = {458--467},
}
- • From Sylvester-Gallai Configurations to Rank Bounds: Improved Black-box Identity Test for Depth-3 Circuits
- Nitin Saxena and C. Seshadhri
- Foundations of Computer Science (FOCS) 2010
- Journal of the ACM 2013
- abstract
    conf
    ECCC TR10-013
    journal
We study the problem of identity testing for depth-3 circuits of top fanin k and degree d. We give a new structure theorem for such identities. A direct
application of our theorem improves the known deterministic d^{k^k}-time black-box identity test over rationals (Kayal-Saraf, FOCS 2009) to one that takes
d^{k^2}-time. Our structure theorem essentially says that the number of independent variables in a real depth-3 identity is very small. This theorem settles
affirmatively the stronger rank conjectures posed by Dvir-Shpilka (STOC 2005) and Kayal-Saraf (FOCS 2009). Our techniques provide a unified framework that
actually beats all known rank bounds and hence gives the best running time (for *every* field) for black-box identity tests. Our main theorem (almost
optimally) pins down the relation between higher dimensional Sylvester-Gallai theorems and the rank of depth-3 identities in a very transparent manner. The
existence of this was hinted at by Dvir-Shpilka (STOC 2005), but first proven, for reals, by Kayal-Saraf (FOCS 2009). We introduce the concept of
Sylvester-Gallai rank bounds for any field, and show the intimate connection between this and depth-3 identity rank bounds. We also prove the first ever
theorem about high dimensional Sylvester-Gallai configurations over *any* field. Our proofs and techniques are very different from previous results and devise
a very interesting ensemble of combinatorics and algebra. The latter concepts are ideal theoretic and involve a new Chinese remainder theorem. Our proof
methods explain the structure of *any* depth-3 identity C: there is a nucleus of C that forms a low rank identity, while the remainder is a high dimensional
Sylvester-Gallai configuration.
@inproceedings{SaxS10,
author = {N. Saxena and C. Seshadhri},
title = {From Sylvester-Gallai Configurations to Rank Bounds: Improved Black-box Identity Test for Depth-3 Circuits},
booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
year = {2010},
pages = {21--29},
}
- • Self-Improving Algorithms for Convex Hulls
- Ken Clarkson, Wolfgang Mulzer, and C. Seshadhri
- Symposium on Discrete Algorithms (SODA) 2010
|
|