Tutorials/notes
• Some vignettes on subgraph counting using graph orientations
C. Seshadhri
International Conference on Database Theory (ICDT) 2023, Invited Tutorial
abstract
Paper
Slides
Subgraph counting is a fundamental problem that spans many areas in computer science: database theory, logic, network science, data mining, and complexity theory. Given a large input graph G and a small pattern graph H, we wish to count the number of occurrences of H in G. In recent times, there has been a resurgence on using an old (maybe overlooked?) technique of orienting the edges of G and H, and then using a combination of bruteforce enumeration and indexing. These orientation techniques appear to give the best of both worlds. There is a rigorous theoretical explanation behind these techniques, and they also have excellent empirical behavior (on large realworld graphs). Time and again, graph orientations help solve subgraph counting problems in various computational models, be it sampling, streaming, distributed, etc. In this paper, we give some short vignettes on how the orientation technique solves a variety of algorithmic problems.
• Tutorial on subgraph counting
C. Seshadhri and Srikanta Tirthapura
Conference on the World Wide Web (WWW) 2019
Tutorial website
Refereed Conference and Journal Papers
 • Spectral Triadic Decompositions of RealWorld Networks
 Sabyasachi Basu. Suman K. Bera, and C. Seshadhri
 SIAM Journal on Mathematics of Data Science (SIMODS) 2024
 abstract
arXiv
A fundamental problem in mathematics and network analysis is to find conditions under which a graph can be partitioned into smaller pieces. The most important tool for this partitioning is the Fiedler vector or discrete Cheeger inequality. These results relate the graph spectrum (eigenvalues of the normalized adjacency matrix) to the ability to break a graph into two pieces, with few edge deletions. An entire subfield of mathematics, called spectral graph theory, has emerged from these results. Yet these results do not say anything about the rich community structure exhibited by realworld networks, which typically have a significant fraction of edges contained in numerous densely clustered blocks. Inspired by the properties of realworld networks, we discover a new spectral condition that relates eigenvalue powers to a network decomposition into densely clustered blocks. We call this the \emph{spectral triadic decomposition}. Our relationship exactly predicts the existence of community structure, as commonly seen in real networked data. Our proof provides an efficient algorithm to produce the spectral triadic decomposition. We observe on numerous social, coauthorship, and citation network datasets that these decompositions have significant correlation with semantically meaningful communities.
 • Improved Massively Parallel Triangle Counting in O(1) Rounds
 Quanquan Liu and C. Seshadhri
 Principles of Distributed Computing (PODC) 2024 (Brief Announcement)
 abstract
arXiv
In this short note, we give a novel algorithm for O(1) round triangle counting in bounded arboricity graphs. Counting triangles in O(1) rounds (exactly) is listed as one of the interesting remaining open problems in the recent survey of Im et al. [IKLMV23]. The previous paper of Biswas et al. [BELMR20], which achieved the best bounds under this setting, used O(loglogn) rounds in sublinear space per machine and O(mα) total space where α is the arboricity of the graph and n and m are the number of vertices and edges in the graph, respectively. Our new algorithm is very simple, achieves the optimal O(1) rounds without increasing the space per machine and the total space, and has the potential of being easily implementable in practice.
 • Link prediction using lowdimensional node embeddings: The measurement problem
 Nicolas Menand and C. Seshadhri
 Proceedings of the National Academy of Sciences (PNAS) 2024
 abstract
PNAS
UCSC Press Release
Graph representation learning is a fundamental technique for machine learning (ML) on complex networks. Given an input network, these methods represent the vertices by lowdimensional realvalued vectors. These vectors can be used for a multitude of downstream ML tasks. We study one of the most important such task, link prediction. Much of the recent literature on graph representation learning has shown remarkable success in link prediction. On closer investigation, we observe that the performance is measured by the AUC (area under the curve), which suffers biases. Since the ground truth in link prediction is sparse, we design a vertexcentric measure of performance, called the VCMPR@k plots. Under this measure, we show that link predictors using graph representations show poor scores. Despite having extremely high AUC scores, the predictors miss much of the ground truth. We identify a mathematical connection between this performance, the sparsity of the ground truth, and the lowdimensional geometry of the node embeddings. Under a formal theoretical framework, we prove that lowdimensional vectors cannot capture sparse ground truth using dot product similarities (the standard practice in the literature). Our results call into question existing results on link prediction and pose a significant scientific challenge for graph representation learning. The VCMPR plots identify specific scientific challenges for link prediction using lowdimensional node embeddings.
 • A $d^{1/2+o(1)}$ Monotonicity Tester for Boolean Functions on $d$Dimensional Hypergrids
 Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri
 Foundations of Computer Science (FOCS) 2023 (Invited to special issue)
 abstract
arxiv:2304.01416
ECCC
Monotonicity testing of Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$, is a classic topic in property testing. Determining the nonadaptive complexity of this problem is an important open question. For arbitrary $n$, [BlackChakrabartySeshadhri, SODA 2020] describe a tester with query complexity $\widetilde{O}(\varepsilon^{4/3}d^{5/6})$. This complexity is independent of $n$, but has a suboptimal dependence on $d$. Recently, [BravermanKhotKindlerMinzer, ITCS 2023] and [BlackChakrabartySeshadhri, STOC 2023] describe $\widetilde{O}(\varepsilon^{2} n^3\sqrt{d})$ and $\widetilde{O}(\varepsilon^{2} n\sqrt{d})$query testers, respectively. These testers have an almost optimal dependence on $d$, but a suboptimal polynomial dependence on $n$. \smallskip
In this paper, we describe a nonadaptive, onesided monotonicity tester with query complexity $O(\varepsilon^{2} d^{1/2 + o(1)})$, \emph{independent} of $n$. Up to the $d^{o(1)}$factors, our result resolves the nonadaptive complexity of monotonicity testing for Boolean functions on hypergrids. The independence of $n$ yields a nonadaptive, onesided $O(\varepsilon^{2} d^{1/2 + o(1)})$query monotonicity tester for Boolean functions $f:\mathbb{R}^d \to \{0,1\}$ associated with an arbitrary product measure.
 • Theoretical bounds on the network community profile from lowrank semidefinite programming
 Yufan Huang, C. Seshadhri, and David Gleich
 International Conference on Machine Learning (ICML) 2023
 abstract
arxiv:2303.14550
We study a new connection between a technical measure called μconductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes sizeresolved properties of clusters and communities in social and information networks. The idea of μconductance is similar to the traditional graph conductance, but disregards sets with small volume. We derive a sequence of optimization problems including a lowrank semidefinite program from which we can derive a lower bound on the optimal μconductance value. These ideas give the first theoretically sound bound on the behavior of the network community profile for a wide range of cluster sizes. The algorithm scales up to graphs with hundreds of thousands of nodes and we demonstrate how our framework validates the predicted structures of realworld graphs.
 • Directed Isoperimetric Theorems for Boolean Functions on the Hypergrid and an $~O(n\sqrt{d})$ Monotonicity Tester
 Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri
 Symposium on Theory of Computing (STOC) 2023
 abstract
arxiv:2211.05281
ECCC
The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of KhotMinzerSafra (FOCS 2015) gave a nonadaptive, onesided tester making $\otilde(\eps^{2}\sqrt{d})$ queries. Up to polylog $d$ and $\eps$ factors, this bound matches the $\widetilde{\Omega}(\sqrt{d})$query nonadaptive lower bound (ChenDeServedioTan (STOC 2015), ChenWaingartenXie (STOC 2017)). For any $n > 2$, the optimal nonadaptive complexity was unknown. A previous result of the authors achieves a $\otilde(d^{5/6})$query upper bound (SODA 2020), quite far from the $\sqrt{d}$ bound for the hypercube.
In this paper, we resolve the nonadaptive complexity of monotonicity testing for all constant $n$, up to $\poly(\eps^{1}\log d)$ factors. Specifically, we give a nonadaptive, onesided monotonicity tester making $\otilde(\eps^{2}n\sqrt{d})$ queries. From a technical standpoint, we prove new directed isoperimetric theorems over the hypergrid $[n]^d$. These results generalize the celebrated directed Talagrand inequalities that were only known for the hypercube.
 • Classic Graph Structural Features Outperform FactorizationBased Graph Embedding Methods on Community Labeling
 Andrew Stolman, Caleb Levy, C. Seshadhri, and Aneesh Sharma
 SIAM Conference on Data Mining (SDM) 2022
 abstract
arxiv:2201.08481
Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a lowdimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine learning tasks, there is little principled analysis of the effectiveness of this approach for common tasks. In this work, we provide an empirical and theoretical analysis for the performance of a class of embeddings on the common task of pairwise community labeling. This is a binary variant of the classic community detection problem, which seeks to build a classifier to determine whether a pair of vertices participate in a community. In line with our goal of foundational understanding, we focus on a popular class of unsupervised embedding techniques that learn low rank factorizations of a vertex proximity matrix (this class includes methods like GraRep, DeepWalk, node2vec, NetMF). We perform detailed empirical analysis for community labeling over a variety of real and synthetic graphs with ground truth. In all cases we studied, the models trained from embedding features perform poorly on community labeling. In constrast, a simple logistic model with classic graph structural features handily outperforms the embedding models. For a more principled understanding, we provide a theoretical analysis for the (in)effectiveness of these embeddings in capturing the community structure. We formally prove that popular lowdimensional factorization methods either cannot produce community structure, or can only produce ``unstable" communities. These communities are inherently unstable under small perturbations.
 • FPT Algorithms for Finding NearCliques in cClosed Graphs
 Balaram Behera, Edin Husić, Shweta Jain, Tim Roughgarden, and C. Seshadhri
 Innovations in Theoretical Computer Science (ITCS) 2022
 abstract
arxiv:2007.09786
Finding large cliques or cliques missing a few edges is a fundamental algorithmic task in the
study of realworld graphs, with applications in community detection, pattern recognition, and clustering. A number of effective backtrackingbased heuristics for these problems have emerged from recent empirical work in social network analysis.
Given the $\mathbb{NP}$hardness of variants of clique counting,
these results raise a challenge for \emph{beyond worstcase analysis} of these problems.
Inspired by the triadic closure of realworld graphs,
Fox et al. (SICOMP 2020) introduced the notion of $c$closed graphs and proved
that maximal clique enumeration is fixedparameter tractable with respect to $c$.
In practice, due to noise in data, one wishes to actually discover "nearcliques",
which can be characterized as cliques with a sparse subgraph removed.
In this work, we prove that many different kinds of maximal nearcliques
can be enumerated in polynomial time (and FPT in $c$) for $c$closed graphs.
We study various established notions of such substructures, including $k$plexes, complements of boundeddegeneracy and boundedtreewidth graphs.
Interestingly, our algorithms follow relatively simple backtracking procedures, analogous to what is done in practice. Our results underscore the significance of the $c$closed graph class for theoretical understanding of social network analysis.
 • The complexity of testing all properties of planar graphs, and the role of isomorphism
 Sabyasachi Basu, Akash Kumar, C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2022
 abstract
arxiv:2108.10547
ECCC
Consider property testing on bounded degree graphs and let $\varepsilon > 0$ denote the proximity parameter. A remarkable theorem of NewmanSohler (SICOMP 2013) asserts that all properties of planar graphs (more generally hyperfinite) are testable with query complexity only depending on $\varepsilon$. Recent advances in testing minorfreeness have proven that all additive and monotone properties of planar graphs can be tested in $\poly(\varepsilon^{1})$ queries. Some properties falling outside this class, such as Hamiltonicity, also have a similar complexity for planar graphs. Motivated by these results, we ask: can all properties of planar graphs can be tested in $\poly(\varepsilon^{1})$ queries? Is there a uniform query complexity upper bound for all planar properties, and what is the ``hardest" such property to test?
We discover a surprisingly clean and optimal answer. Any property of bounded degree planar graphs can be tested in $\exp(O(\varepsilon^{2}))$ queries. Moreover, there is a matching lower bound, up to constant factors in the exponent. The natural property of testing isomorphism to a fixed graph requires $\exp(\Omega(\varepsilon^{2}))$ queries, thereby showing that (up to polynomial dependencies) isomorphism to an explicit fixed graph is the hardest property of planar graphs. The upper bound is a straightforward adapation of the NewmanSohler analysis that tracks dependencies on $\varepsilon$ more carefully. The main technical contribution is the lower bound construction, which is achieved by a special family of planar graphs that are all mutually far from each other.
We can also apply our techniques to get analogous results for bounded treewidth graphs. We prove that all properties of bounded treewidth graphs can be tested in $\exp(O(\varepsilon^{1}\log \varepsilon^{1}))$ queries. Moreover, testing isomorphism to a fixed forest requires $\exp(\Omega(\varepsilon^{1}))$ queries.
 • Random walks and forbidden minors III: poly(d/\epsilon)time partition oracles for minorfree graph classes
 Akash Kumar, C. Seshadhri, and Andrew Stolman
 Foundations of Computer Science (FOCS) 2021
 abstract
arxiv:2102.00556
ECCC
Consider the family of bounded degree graphs in any minorclosed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for a sufficiently small {\epsilon} > 0, one removes {\epsilon}dn edges to get connected components of size independent of n. An important tool for sublinear algorithms and property testing for such classes is the partition oracle, introduced by the seminal work of HassidimKelnerNguyenOnak (FOCS 2009).
A partition oracle is a local procedure that gives consistent access to a hyperfinite decomposition, without any preprocessing. Given a query vertex v, the partition oracle outputs the component containing v in time independent of n. All the answers are consistent with a single hyperfinite decomposition. The partition oracle of Hassidim et al. runs in time d^poly(d/{\epsilon}) per query. They pose the open problem of whether poly(d/{\epsilon})time partition oracles exist. LeviRon (ICALP 2013) give a refinement of the previous approach, to get a partition oracle that runs in time d^{\log(d/{\epsilon})per query.
In this paper, we resolve this open problem and give \poly(d/{\epsilon})time partition oracles for bounded degree graphs in any minorclosed family. Unlike the previous line of work based on combinatorial methods, we employ techniques from spectral graph theory. We build on a recent spectral graph theoretical toolkit for minorclosed graph families, introduced by the authors to develop efficient property testers.
A consequence of our result is a poly(d/{\epsilon})query tester for any monotone and additive property of minorclosed families (such as bipartite planar graphs). Our result also gives poly(d/{\epsilon})query algorithms for additive {\epsilon} napproximations for problems such as maximum matching, minimum vertex cover, maximum independent set, and minimum dominating set for these graph families.
 • Faster and Generalized Temporal Triangle Counting, via Degeneracy Ordering
 Noujan Pashanasangi and C. Seshadhri
 SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2021
 abstract
arxiv:2106.02762
Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many realworld graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the timestamps.
Temporal triangle counting has seen a few recent results, but there are varying definitions of temporal triangles. In all cases, temporal triangle patterns enforce constraints on the time interval between edges (in the triangle). We define a general notion (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})temporal triangles that allows for separate time constraints for all pairs of edges.
Our main result is a new algorithm, DOTTT (Degeneracy Oriented Temporal Triangle Totaler), that exactly counts all directed variants of (\delta_{1,3}, \delta_{1,2}, \delta_{2,3})temporal triangles. Using the classic idea of degeneracy ordering with careful combinatorial arguments, we can prove that DOTTT runs in O(m\kappa \log m) time, where m is the number of (temporal) edges and \kappa is the graph degeneracy (max core number). Up to log factors, this matches the running time of the best static triangle counters. Moreover, this running time is better than existing.
DOTTT has excellent practical behavior and runs twice as fast as existing stateoftheart temporal triangle counters (and is also more general). For example, DOTTT computes all types of temporal queries in Bitcoin temporal network with half a billion edges in less than an hour on a commodity machine.
 • NearLinear Time Homomorphism Counting in Bounded Degeneracy Graphs: The Barrier of Long Induced Cycles
 Suman K. Bera, Noujan Pashanasangi, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2021
 abstract
arxiv:2010.08083
Counting homomorphisms of a constant sized pattern graph H in an input graph G is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input G and the pattern H. Given the significance of this problem and the large sizes of modern inputs, we investigate when nearlinear time algorithms are possible. We focus on the case when the input graph has bounded degeneracy, a commonly studied and practically relevant class for homomorphism counting. It is known from previous work that for certain classes of H, Hhomomorphisms can be counted exactly in nearlinear time in bounded degeneracy graphs. Can we precisely characterize the patterns H for which nearlinear time algorithms are possible?
We completely resolve this problem, discovering a clean dichotomy using finegrained complexity. Let m denote the number of edges in G. We prove the following: if the largest induced cycle in H has length at most 5, then there is an O(mlogm) algorithm for counting Hhomomorphisms in bounded degeneracy graphs. If the largest induced cycle in H has length at least 6, then (assuming standard finegrained complexity conjectures) there is a constant γ>0, such that there is no o(m1+γ) time algorithm for counting Hhomomorphisms.
 • How to Count Triangles, without Seeing the Whole Graph
 Suman K. Bera and C. Seshadhri
 SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2020
 abstract
arxiv:2006.11947
Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph?
We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting realworld graphs. Even sampling a uniform random vertex is a hard task in this model.
Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degreebiased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.
 • How the Degeneracy Helps for Triangle Counting in Graph Streams
 Suman K. Bera and C. Seshadhri
 Principles of Database Systems (PODS) 2020
 abstract
arxiv:2003.13151
We revisit the wellstudied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $\Theta(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $\kappa$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minorclosed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(m\kappa/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $\Omega(m\kappa/T)$.
 • Provably and Efficiently Approximating Nearcliques using Tur\'an Shadow: PEANUTS
 Shweta Jain and C. Seshadhri
 The Web Conference (WWW) 2020
 abstract
arxiv:2006.13483
Clique and nearclique counts are important graph
properties that help to characterize graphs and have applications in
graph generation, graph modelling, graph analytics etc. Recently, Jain
et. al proposed an algorithm called TuranShadow for
estimating kclique counts that was a significant improvement over the stateoftheart. Essentially, given a graph $G$, it constructs an object called the Tur\'an Shadow  a collection of (possibly overlapping) dense subgraphs such that there is a bijection from the cliques in each subgraph to the set of kcliques in $G$. It then estimates the count of kcliques in $G$ by estimating the number of cliques in the subgraphs using uniform random sampling. As a side effect, it provides an efficient method to sample a uniform random $k$clique.
We use this clique sampler to discover and estimate counts of nearcliques i.e. kcliques that are missing 1 or 2 edges. Every nearclique contains cliques of smaller sizes. By obtaining the Tur\'an Shadow for the smaller clique size, sampling a uniform random clique of the smaller size and obtaining an average of the number of nearcliques the sampled clique participates in, we give an estimate for the total number of nearcliques in $G$. There was no known efficient method to obtain nearclique counts for $k>5$ before this work. We obtain a $10x100x$ speedup over existing algorithms for counting nearcliques.
Additionally, we enhance TuranShadow by designing a new estimator that requires us to construct only parts of the shadow, and uses the shadow in an online manner, obviating the need to store the entire shadow. This helps achieve not only a $2x10x$ speedup over TuranShadow in some cases but drastically reduces the space utilized.
 • The impossibility of lowrank representations for trianglerich complex networks
 C. Seshadhri, Aneesh Sharma, Andrew Stolman, and Ashish Goel
 Proceedings of the National Academy of Sciences (PNAS) 2020
 abstract
PNAS:1911030117
UCSC news link
Youtube, part 1
Youtube, part 2
slides
The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems, to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a lowdimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for realworld networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have a rank that is nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as singular value decomposition and node2vec fail to capture significant structural aspects of realworld complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.
 • Linear Time Subgraph Counting, Graph Degeneracy, and the Chasm at Size Six
 Suman K. Bera, Noujan Pashanasangi, and C. Seshadhri
 Innovations in Theoretical Computer Science (ITCS) 2020
 abstract
arxiv:1911.05896
We consider the problem of counting all kvertex subgraphs in an input graph, for any constant k. This problem (denoted subcntk) has been studied extensively in both theory and practice. In a classic result, Chiba and Nishizeki (SICOMP 85) gave linear time algorithms for clique and 4cycle counting for bounded degeneracy graphs. This is a rich class of sparse graphs that contains, for example, all minorfree families and preferential attachment graphs. The techniques from this result have inspired a number of recent practical algorithms for subcntk. Towards a better understanding of the limits of these techniques, we ask: for what values of k can subcntk be solved in linear time?
We discover a chasm at k=6. Specifically, we prove that for k < 6, subcntk can be solved in linear time. Assuming a standard conjecture in finegrained complexity, we prove that for all k≥6, subcntk cannot be solved even in nearlinear time.
 • Efficiently Counting Vertex Orbits of All 5vertex Subgraphs, by EVOKE
 Noujan Pashanasangi and C. Seshadhri
 Web Science and Data Mining (WSDM) 2020
 abstract
arxiv:1911.10616
Subgraph counting is a fundamental task in network analysis. Typically, algorithmic work is on total counting, where we wish to count the total frequency of a (small) pattern subgraph in a large input data set. But many applications require local counts (also called vertex orbit counts) wherein, for every vertex v of the input graph, one needs the count of the pattern subgraph involving v. This provides a rich set of vertex features that can be used in machine learning tasks, especially classification and clustering. But getting local counts is extremely challenging. Even the easier problem of getting total counts has received much research attention. Local counts require algorithms that get much finer grained information, and the sheer output size makes it difficult to design scalable algorithms.
We present EVOKE, a scalable algorithm that can determine vertex orbits counts for all 5vertex pattern subgraphs. In other words, EVOKE exactly determines, for every vertex v of the input graph and every 5vertex subgraph H, the number of copies of H that v participates in. EVOKE can process graphs with tens of millions of edges, within an hour on a commodity machine. EVOKE is typically hundreds of times faster than previous state of the art algorithms, and gets results on datasets beyond the reach of previous methods.
Theoretically, we generalize a recent "graph cutting" framework to get vertex orbit counts. This framework generate a collection of polynomial equations relating vertex orbit counts of larger subgraphs to those of smaller subgraphs. EVOKE carefully exploits the structure among these equations to rapidly count. We prove and empirically validate that EVOKE only has a small constant factor overhead over the best (total) 5vertex subgraph counter.
 • The Power of Pivoting for Exact Clique Counting
 Shweta Jain and C. Seshadhri
 Web Science and Data Mining (WSDM) 2020 (Best paper)
 abstract
arxiv:2001.06784
Clique counting is a fundamental task in network analysis, and even the simplest setting of 3cliques (triangles) has been the center of much recent research. Getting the count of kcliques for larger k is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use larger clique counts. Moreover, one often desires \textit{local} counts, the number of kcliques per vertex/edge.
Our main result is Pivoter, an algorithm that exactly counts the number of kcliques, \textit{for all values of k}. It is surprisingly effective in practice, and is able to get clique counts of graphs that were beyond the reach of previous work. For example, Pivoter gets all clique counts in a social network with a 100M edges within two hours on a commodity machine. Previous parallel algorithms do not terminate in days. Pivoter can also feasibly get local pervertex and peredge kclique counts (for all k) for many public data sets with tens of millions of edges. To the best of our knowledge, this is the first algorithm that achieves such results.
The main insight is the construction of a Succinct Clique Tree (SCT) that stores a compressed unique representation of all cliques in an input graph. It is built using a technique called \textit{pivoting}, a classic approach by BronKerbosch to reduce the recursion tree of backtracking algorithms for maximal cliques. Remarkably, the SCT can be built without actually enumerating all cliques, and provides a succinct data structure from which exact clique statistics (kclique counts, local counts) can be read off efficiently.
 • Domain Reduction for Monotonicity Testing: A o(d) Tester for Boolean Functions in dDimensions
 Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2020
 abstract
arxiv:1811.01427
We describe a Õ (d5/6)query monotonicity tester for Boolean functions f:[n]d→{0,1} on the nhypergrid. This is the first o(d) monotonicity tester with query complexity independent of n. Motivated by this independence of n, we initiate the study of monotonicity testing of measurable Boolean functions f:ℝd→{0,1} over the continuous domain, where the distance is measured with respect to a product distribution over ℝd. We give a Õ (d5/6)query monotonicity tester for such functions.
Our main technical result is a domain reduction theorem for monotonicity. For any function f:[n]d→{0,1}, let ϵf be its distance to monotonicity. Consider the restriction f̂ of the function on a random [k]d subhypergrid of the original domain. We show that for k=poly(d/ϵ), the expected distance of the restriction is 𝔼[ϵf̂ ]=Ω(ϵf). Previously, such a result was only known for d=1 (BermanRaskhodnikovaYaroslavtsev, STOC 2014). Our result for testing Boolean functions over [n]d then follows by applying the d5/6⋅poly(1/ϵ,logn,logd)query hypergrid tester of BlackChakrabartySeshadhri (SODA 2018).
To obtain the result for testing Boolean functions over ℝd, we use standard measure theoretic tools to reduce monotonicity testing of a measurable function f to monotonicity testing of a discretized version of f over a hypergrid domain [N]d for large, but finite, N (that may depend on f). The independence of N in the hypergrid tester is crucial to getting the final tester over ℝd.
 • Faster sublinear approximations of kcliques for low arboricity graphs
 Talya Eden, Dana Ron, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2020
 abstract
arxiv:1811.04425
Given query access to an undirected graph $G$, we consider the problem of
computing a $(1\pm\eps)$approximation of the number of $k$cliques in $G$. The standard query
model for general graphs allows for degree queries, neighbor queries, and pair queries.
Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$
be the number of $k$cliques.
Previous work by Eden, Ron and Seshadhri (STOC 2018) gives an $O^*(\frac{n}{n^{1/k}_k} + \frac{m^{k/2}}{n_k})$time algorithm for this problem (we use $O^*(\cdot)$ to suppress $\poly(\log n, 1/\eps, k^k)$ dependencies). Moreover, this bound is nearly optimal when the expression is sublinear in the size of the graph.
Our motivation is to circumvent this lower bound, by parameterizing the complexity in terms of \emph{graph arboricity}. The arboricity of $G$ is a measure for the graph density ``everywhere''.
There is a very rich family of graphs with bounded arboricity, including all minorclosed graph classes (such as planar graphs and graphs with bounded treewidth), bounded degree graphs, preferential attachment graphs and more.
We design an algorithm for the class of graphs with arboricity at most $\alpha$, whose
running time is $O^*(\min\{\frac{n\alpha^{k1}}{n_k},\; \frac{n}{n_k^{1/k}}+\frac{m \alpha^{k2}}{n_k} \} )$.
We also prove a nearly matching lower bound.
For all graphs, the arboricity is $O(\sqrt m)$, so this bound subsumes all previous results on sublinear clique approximation.
As a special case of interest, consider minorclosed families of graphs, which have constant arboricity. Our result implies that for any minorclosed family of graphs, there is a $(1\pm\eps)$approximation algorithm for $n_k$ that has running time $O^*(\frac{n}{n_k})$.
Such a bound was not known even for the special (classic) case of triangle counting in planar graphs.
 • Random walks and forbidden minors II: A $poly(d\varepsilon^{1})$query tester for minorclosed properties of boundeddegree graphs
 Akash Kumar, C. Seshadhri, and Andrew Stolman
 Symposium on Theory of Computing (STOC) 2019 (Invited to special issue)
 abstract
stoc
arxiv:1904.01055
ECCC
Let $G$ be a graph with $n$ vertices and maximum degree $d$. Fix some minorclosed property $\mathcal{P}$ (such as planarity).
We say that $G$ is $\varepsilon$far from $\mathcal{P}$ if one has to remove $\varepsilon dn$ edges to make it have $\mathcal{P}$.
The problem of property testing $\mathcal{P}$ was introduced in the seminal work of BenjaminiSchrammShapira (STOC 2008)
that gave a tester with query complexity triply exponential in $\varepsilon^{1}$.
LeviRon (TALG 2015) have given the best tester to date, with a quasipolynomial (in $\varepsilon^{1}$) query complexity.
It is an open problem to get property testers whose query complexity is $\poly(d\varepsilon^{1})$,
even for planarity.
In this paper, we resolve this open question. For any minorclosed property,
we give a tester with query complexity $d\cdot \poly(\varepsilon^{1})$.
The previous line of work on (independent of $n$, twosided) testers is primarily combinatorial.
Our work, on the other hand, employs techniques from spectral graph theory. This paper
is a continuation of recent work of the authors (FOCS 2018) analyzing random walk algorithms
that find forbidden minors.
 • Adaptive Boolean Monotonicity Testing in Total Influence Time
 Deeparnab Chakrabarty and C. Seshadhri
 Innovations in Theoretical Computer Science (ITCS) 2019
 abstract
arxiv:1801.02816
ECCC
ITCS
Testing monotonicity
of a Boolean function $f:\{0,1\}^n \to \{0,1\}$ is an important problem in the field of property testing.
It has led to connections with many interesting combinatorial questions on the directed hypercube: routing, random walks,
and new isoperimetric theorems.
Denoting the proximity parameter by $\eps$, the best tester is the nonadaptive $\otilde(\eps^{2}\sqrt{n})$ tester
of KhotMinzerSafra (FOCS 2015). A series of recent results by BelovsBlais (STOC 2016) and ChenWaingartenXie (STOC 2017)
have led to $\widetilde{\Omega}(n^{1/3})$ lower bounds for \emph{adaptive} testers. Reducing this gap is a significant
question, that touches on the role of adaptivity in monotonicity testing of Boolean functions.
We approach this question from the perspective of \emph{parametrized property testing}, a concept recently introduced
by PallavoorRaskhodnikovaVarma (ACM TOCT 2017), where one seeks to understand performance of testers with respect to parameters other than just the size.
Our result is an adaptive monotonicity tester with onesided error whose query complexity is $O(\eps^{2}\bI(f)\log^5 n)$, where $\bI(f)$ is the total influence of the function. Therefore, adaptivity provably helps monotonicity testing for low influence functions.
 • Local Algorithms for Hierarchical Dense Subgraph Discovery
 A. Erdem Sariyuce, C. Seshadhri, and Ali Pinar
 Proceedings of Very Large Databases (PVLDB) 2018
 abstract
arxiv:1704.00386
VLDB
Finding the dense regions of a graph and relations among them is a fundamental problem in network analysis. Core and truss decompositions reveal dense subgraphs with hierarchical relations. The incremental nature of algorithms for computing these decompositions and the need for global information at each step of the algorithm hinders scalable parallelization and approximations since the densest regions are not revealed until the end. In a previous work, Lu et al. proposed to iteratively compute the hindices of neighbor vertex degrees to obtain the core numbers and prove that the convergence is obtained after a finite number of iterations. This work generalizes the iterative hindex computation for truss decomposition as well as nucleus decomposition which leverages higherorder structures to generalize core and truss decompositions. In addition, we prove convergence bounds on the number of iterations. We present a framework of local algorithms to obtain the core, truss, and nucleus decompositions. Our algorithms are local, parallel, offer high scalability, and enable approximations to explore time and quality tradeoffs. Our sharedmemory implementation verifies the efficiency, scalability, and effectiveness of our local algorithms on realworld networks.
 • Finding forbidden minors in sublinear time: a $O(n^{1/2+o(1)})$query onesided tester for minor closed properties on bounded degree graphs
 Akash Kumar, C. Seshadhri, and Andrew Stolman
 Foundations of Computer Science (FOCS) 2018 (Invited to special issue)
 abstract
arxiv:1805.08187
ECCC
Let $G$ be an undirected, bounded degree graph
with $n$ vertices. Fix a finite graph $H$, and suppose
one must remove $\varepsilon n$ edges from $G$ to make it $H$minor free (for some small constant $\varepsilon > 0$).
We give an $n^{1/2+o(1)}$time randomized procedure that,
with high probability, finds an $H$minor in such a graph.
For an example application, suppose one must remove $\varepsilon n$ edges from a bounded degree graph $G$ to make it planar.
This result implies an algorithm, with the same running time, that produces
a $K_{3,3}$ or $K_5$ minor in $G$.
No sublinear time bound was known for this problem, prior to this result.
By the graph minor theorem, we get an analogous result for any minorclosed property.
Up to $n^{o(1)}$ factors, this resolves a conjecture of BenjaminiSchrammShapira (STOC 2008)
on the existence of onesided property testers for minorclosed properties. Furthermore, our algorithm is nearly
optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014).
Prior to this work,
the only graphs $H$ for which nontrivial property testers were known for $H$minor freeness
are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$grid, and the $k$circus (Fichtenberger et al, Arxiv 2017).
 • Finding Cliques in Social Networks: A New DistributionFree Model
 Jacob Fox, Tim Roughgarden, C. Seshadhri, Fan Wei, Nicole Wein
 International Colloquium on Automata, Languages, and Programming (ICALP) 2018
 abstract
arxiv:1804.07431
We propose a new distributionfree model of social networks. Our definitions
are motivated by one of the most universal signatures of social networks,
triadic closurethe property that pairs of vertices with common neighbors
tend to be adjacent. Our most basic definition is that of a "$c$closed" graph,
where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$
and $v$ are adjacent. We study the classic problem of enumerating all maximal
cliques, an important task in social network analysis. We prove that this
problem is fixedparameter tractable with respect to $c$ on $c$closed graphs.
Our results carry over to "weakly $c$closed graphs", which only require a
vertex deletion ordering that avoids pairs of nonadjacent vertices with $c$
common neighbors. Numerical experiments show that wellstudied social networks
tend to be weakly $c$closed for modest values of $c$.
 • On Approximating the Number of kCliques in Sublinear Time
 Talya Eden, Dana Ron, and C. Seshadhri
 to appear, Symposium on the Theory of Computing (STOC) 2018
 abstract
arxiv:1707.04858
We study the problem of approximating the number of $k$cliques in a graph when given query access to the graph.
We consider the standard query model for general graphs via
(1) degree queries, (2) neighbor queries and (3) pair queries.
Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $\clk$ the number of $k$cliques.
We design an algorithm that outputs a $(1+\eps)$approximation (with high probability) for $\clk$, whose expected query complexity and running time are
$O\left(\frac{n}{\clk^{1/k}}+\frac{m^{k/2}}{\clk} \right)\poly(\log n, 1/\eps,k)$.
Hence, the complexity of the algorithm is sublinear in the size of the graph for $\clk = \omega(m^{k/21})$.
Furthermore,
the query complexity of our algorithm is essentially optimal
(up to the dependence on $\log n$, $1/\eps$ and $k$).
The previous results in this vein are by Feige (SICOMP 06) and by Goldreich \dcom{and} Ron (RSA 08)
for edge counting ($k=2$) and by Eden et al (FOCS 2015)
for triangle counting ($k=3$). Our result matches the complexities of
these results.
The previous result by Eden et al. hinges on a certain amortization technique that works
for triangle counting, and does not generalize %for larger cliques to all $k$.
We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that} samples each $k$clique
incident to a given set $S$ of vertices with approximately equal probability.
The primary difficulty is in finding cliques incident to purely highdegree vertices,
since random sampling within neighbors has a low success probability.
This is achieved by an algorithm that samples uniform random
high degree vertices and
a careful tradeoff between estimating cliques incident purely to highdegree vertices and those
that include a lowdegree vertex.
 • Provable and practical approximations for the degree distribution using sublinear graph samples
 Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri
 The Web Conference (WWW) 2018
 abstract
arxiv:1710.08607
The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on \emph{graph sampling}, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavytailed nature and the large variance in degrees.
We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of \emph{sublinear algorithms}. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the \emph{hindex} and the \emph{zindex}. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law.
We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most 1% of the vertices. This is a major improvement over the stateoftheart sampling algorithms, which typically sample more than 10% of the vertices to give comparable results. We also observe that the h and zindices of real graphs are large, validating our theoretical analysis.
 • A o(d)polylog n Monotonicity Tester for Boolean Functions over the Hypergrid [n]^d
 Hadley Black, Deeparnab Chakrabarty, C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2018
 abstract
arxiv:1710.10545
We study monotonicity testing of Boolean functions over the hypergrid [n]^d and design a nonadaptive tester with 1sided error whose query complexity is O(d^5/6)⋅poly(logn,1/ϵ). Previous to our work, the best known testers had query complexity linear in d but independent of n. We improve upon these testers as long as n=2^d^o(1).
To obtain our results, we work with what we call the augmented hypergrid, which adds extra edges to the hypergrid. Our main technical contribution is a Margulisstyle isoperimetric result for the augmented hypergrid, and our tester, like previous testers for the hypercube domain, performs directed random walks on this structure.
 • Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection
 Talya Eden, Dana Ron, and C. Seshadhri
 International Colloquium on Automata, Languages, and Programming (ICALP) 2017
 abstract
arxiv:1604.03661
We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ (nonisolated) vertices, and define (for $s > 0$) $\mu_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $\mu_s$ within a multiplicative error of $(1+\varepsilon)$ (for a given approximation parameter $\varepsilon>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex.
For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant
$\varepsilon$ (Feige, SICOMP 06 and GoldreichRon, RSA 08).
GonenRonShavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{11/(s+1)})$ queries. (Strictly speaking, their algorithm approximates the number of starsubgraphs of a given size, but a slight modification gives an algorithm for moments.)
We design a new, significantly simpler algorithm for this problem. In the worstcase, it exactly matches the bounds of GonenRonShavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the \emph{degeneracy} of $G$. This
is (essentially) the maximum density of an induced subgraph.
 • Optimal Unateness Testers for RealValued Functions: Adaptivity Helps
 Roksana Baleshzar, Deeparnab Chakrabarty, Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, and C. Seshadhri
 International Colloquium on Automata, Languages, and Programming (ICALP) 2017
 abstract
arxiv:1703.05199
We study the problem of testing unateness of functions f:{0,1}^d > R. We give a O((d/\eps)log(d/\eps))query nonadaptive tester and a O(d/\eps)query adaptive tester and show that both testers are optimal for a fixed distance parameter \eps. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the dimension both for the adaptive and the nonadaptive case. Moreover, no lower bounds for testing unateness were known. We also generalize our results to obtain optimal unateness testers for functions f:[n]^d > R.
Our results establish that adaptivity helps with testing unateness of realvalued functions on domains of the form {0,1}^d and, more generally, [n]^d. This stands in contrast to the situation for monotonicity testing where there is no adaptivity gap for functions f:[n]^d > R.
 • A Fast and Provable Method for Estimating Clique Counts Using Tur\'{a}n's Theorem
 Shweta Jain and C. Seshadhri
 World Wide Web (WWW) 2017 (Best paper)
 abstract
arxiv:1611.05561
Clique counts reveal important properties about the structure of massive graphs,
especially social networks.
The simple setting of just 3cliques (triangles) has received much attention from
the research community. For larger cliques (even, say $6$cliques)
the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible.
We present a new randomized algorithm that provably approximates the number of $k$cliques,
for any constant $k$. The key insight is the use of (strengthenings of) the classic Tur\'an's theorem:
this claims that if the edge density of a graph is sufficiently high, the $k$clique
density must be nontrivial. We define a combinatorial structure called a \emph{Tur\'an shadow},
the construction of which leads to fast algorithms for clique counting.
We design a practical heuristic, called TURANSHADOW, based on this theoretical algorithm, and test
it on a large class of test graphs. In all cases, TURANSHADOW has less than 2\% error,
in a fraction of the time used by welltuned exact algorithms. We do detailed comparisons
with a range of other sampling algorithms, and find that TURANSHADOW is generally much faster
and more accurate. For example, TURANSHADOW estimates all cliques numbers up to size $10$ in social
network with over a hundred million edges. This is done in less than three hours on a single commodity
machine.
 • ESCAPE: Efficiently Counting All 5Vertex Subgraphs
 Ali Pinar, C. Seshadhri, and V. Vishal
 World Wide Web (WWW) 2017
 abstract
arxiv:1601.09411
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex or 5vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes.
We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5vertex patterns.
We perform extensive empirical experiments on a variety of realworld graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$vertex
patterns. This algorithm is typically ten times faster than the state of the art $4$vertex counters.
 • When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
 Aneesh Sharma, C. Seshadhri, and Ashish Goel
 World Wide Web (WWW) 2017
 abstract
arxiv:1703.01054
Finding similar user pairs is a fundamental task in social networks,
with applications in link prediction, tie strength detection, and a
number of ranking and personalization tasks. A common manifestation
of user similarity is based upon network structure: each user is
represented by a vector that represents the user's network
connections, and cosine similarity among these vectors quantifies
user similarity. The predominant task for user similarity
applications is to discover all user pairs with ``high'' similarity,
which is computationally challenging on networks with billions of
edges. To the best of our knowledge, there is no practical solution
for computing all user pairs with high similarity on large social
networks, even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new
algorithm  WHIMP  that solves this problem efficiently in
the MapReduce model. The key insight in WHIMP is to combine the
``wedgesampling" approach of CohenLewis for approximate matrix
multiplication with the SimHash random projection techniques of
Charikar. We provide a theoretical analysis of WHIMP, proving that
it has near optimal communication costs while maintaining
computation cost comparable with the state of the art. We also
empirically demonstrate WHIMP's scalability by computing all highly
similar pairs on four massive data sets, and show that it accurately
finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of
billions of edges.
 • Accurate and Nearly Optimal Sublinear Approximations to Ulam Distance
 Timothy Naumovitz, Michael Saks, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2017
 abstract
conf
The \emph{Ulam distance} between two permutations of length $n$ is the minimum number of insertions and deletions needed to transform one sequence into the other.
Equivalently, the Ulam distance $d$ is $n$ minus the length of the longest common subsequence (LCS) between the permutations.
Our main result is an algorithm, that for any fixed $\varepsilon>0$, provides a $(1+\varepsilon)$multiplicative approximation
for $d$ in $\tilde{O}_{\varepsilon}(n/d + \sqrt{n})$ time, which has been shown to be optimal up to polylogarithmic factors. This is the first sublinear time algorithm (provided that $d = (\log n)^{\omega(1)}$)
that obtains arbitrarily good multiplicative approximations to the Ulam distance.
The previous best bound is an $O(1)$approximation (with a large constant) by Andoni and Nguyen (2010) with
the same running time bound (ignoring polylogarithmic factors).
The improvement in the approximation factor from $O(1)$ to $(1+\varepsilon)$ allows for significantly more powerful sublinear
algorithms. For example, for any fixed $\delta > 0$, we can get additive $\delta n$ approximations for the LCS
between permutations in $\tilde{O}_\delta(\sqrt{n})$ time. Previous sublinear algorithms require $\delta$ to be
at least $11/C$, where $C$ is the approximation factor, which is close to $1$ when $C$ is large.
Our algorithm is obtained by abstracting the basic algorithmic framework
of Andoni and Nguyen, and combining it with the sublinear approximations
for the longest increasing subsequence by Saks and Seshadhri (2010).
 • Directed Closure Measures for Networks with Reciprocity
 C. Seshadhri, Ali Pinar, Nurcan Durak, and Tamara G. Kolda
 Journal of Complex Networks 2017
 abstract
arxiv:1302.6220
The computation and study of triangles in graphs is a standard tool in the analysis of realworld networks. Yet most of this work focuses on undirected graphs. Realworld networks are often directed and have a significant fraction of reciprocal edges. While there is much focus on directed triadic patterns in the social sciences community, most data mining and graph analysis studies ignore direction.
But how do we make sense of this complex directed structure? We propose a collection of directed closure values that are analogues of the classic transitivity measure (the fraction of wedges that participate in triangles). We perform an extensive set of triadic measurements on a variety of massive realworld networks. Our study of these values reveal a wealth of information of the nature of direction. For instance, we immediately see the importance of reciprocal edges in forming triangles and can measure the power of transitivity. Surprisingly, the chance that a wedge is closed depends heavily on its directed structure. We also observe striking similarities between the triadic closure patterns of different web and social networks.
Together with these observations, we also present the first sampling based algorithm for fast estimation of directed triangles. Previous estimation methods were targeted towards undirected triangles and could not be extended to directed graphs. Our method, based on wedge sampling, gives orders of magnitude speedup over state of the art enumeration.
 • Trigger Detection for Adaptive Scientific Workflows Using Percentile Sampling
 Janine C. Bennett, Ankit Bhagatwala, Jacqueline H. Chen, Ali Pinar, Maher Salloum, and C. Seshadhri
 SIAM Journal on Scientific Computing 2016
 abstract
arxiv:1411.2689
Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for postanalysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly.
We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noisetolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantilesampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows.
 • A Mountaintop View Requires Minimal Sorting: A Faster Contour Tree Algorithm
 Benjamin Raichel and C. Seshadhri
 Symposium on Computational Geometry (SOCG) 2016 (Invited to special issue)
 Discrete and Computational Geometry 2016
 abstract
arxiv:1411.2689
Consider a scalar field $f:\mathbb{M} \mapsto \mathbb{R}$, where $\mathbb{M}$
is a triangulated simplicial mesh in $\mathbb{R}^d$. A level set, or contour,
at value $v$ is a connected component of $f^{1}(v)$. As $v$ is changed, these
contours change topology, merge into each other, or split. \emph{Contour trees}
are concise representations of $f$ that track this contour behavior. The
vertices of these trees are the critical points of $f$, where the gradient is
zero. The edges represent changes in the topology of contours. It is a
fundamental data structure in data analysis and visualization, and there is
significant previous work (both theoretical and practical) on algorithms for
constructing contour trees.
Suppose $\mathbb{M}$ has $n$ vertices, $N$ facets, and $t$ critical points. A
classic result of Carr, Snoeyink, and Axen (2000) gives an algorithm that takes
$O(n\log n + N\alpha(N))$ time (where $\alpha(\cdot)$ is the inverse Ackermann
function). A further improvement to $O(t\log t + N)$ time was given by Chiang
et al. All these algorithms involve a global sort of the critical points, a
significant computational bottleneck. Unfortunately, lower bounds of
$\Omega(t\log t)$ also exist.
We present the first algorithm that can avoid the global sort and has a
refined time complexity that depends on the contour tree structure.
Intuitively, if the tree is short and fat, we get significant improvements in
running time. For a partition of the contour tree into a set of descending
paths, $P$, our algorithm runs in $O(\sum_{p\in P} p\logp + t\alpha(t)+N)$
time. This is at most $O(t\log D + N)$, where $D$ is the diameter of the
contour tree. Moreover, it is $O(t\alpha(t)+N)$ for balanced trees, a
significant improvement over the previous complexity. We also prove lower
bounds showing that the $\sum_{p\in P} p\logp$ complexity is inherent to
computing contour trees.
 • Characterizing shortterm stability for Boolean networks over any distribution of transfer functions
 C. Seshadhri, Andrew M. Smith, Yevgeniy Vorobeychik, Jackson Mayo, and Robert C. Armstrong
 Physical Review E 2016
 abstract
APS arxiv:1409.4360
We present a characterization of shortterm stability of Kauffman's NK
(random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether shortterm chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness.
 • Approximately Counting Triangles in Sublinear Time
 Talya Eden, Amit Levi, Dana Ron, and C. Seshadhri
 Foundations of Computer Science (FOCS) 2015
 abstract
arxiv:1504.00954
We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory
and practice, but all existing algorithms read the entire graph. In this work we design a sublineartime algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertexpair queries and neighbor queries.
We show that for any given approximation parameter 0\textless epsilon\textless1, the algorithm provides an estimate hat\{t\} such that with high constant probability, (1epsilon) t\textless hat\{t\}\textless(1+epsilon)t, where t is the number of triangles in the graph G. The expected query complexity of the algorithm is O(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) poly(log n, 1/epsilon), where n is the number of vertices in the graph and m is the number of edges,
and the expected running time is
(n/t\textasciicircum \{1/3\} + m\textasciicircum\{3/2\}/t) poly(log n, 1/epsilon). We also prove that
Omega(n/t\textasciicircum \{1/3\} + min \{m, m\textasciicircum\{3/2\}/t\}) queries are necessary,
thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in n (and the dependence on 1/epsilon).
 • Diamond Sampling for Approximate Maximum Allpairs Dotproduct (MAD) Search
 Grey Ballard, Tamara G. Kolda, Ali Pinar, and C. Seshadhri
 International Conference on Data Mining (ICDM) 2015 (Best paper)
 abstract
arxiv:1506.03872
Given two sets of vectors, $A = \set{\vec{a_1}, \dots, \vec{a_m}}$ and
$B=\set{\vec{b_1},\dots,\vec{b_n}}$, our problem is to find the top$t$ dot
products, i.e., the largest $\vec{a_i}\cdot\vec{b_j}$ among all possible
pairs. This is a fundamental mathematical problem that appears in numerous data
applications involving similarity search, link prediction, and collaborative filtering.
We propose a samplingbased approach that avoids direct computation of all $mn$ dot products.
We select diamonds (i.e.,
fourcycles) from the weighted tripartite representation of $A$ and
$B$. The probability of selecting a diamond corresponding to
pair $(i,j)$ is proportional to $(\vec{a_i}\cdot\vec{b_j})^2$, amplifying the focus on the
largestmagnitude entries.
Experimental results indicate that diamond sampling is orders of magnitude faster than
direct computation and
requires far fewer samples than any competing approach.
We also apply diamond sampling to the special case of maximum inner product search, and get
significantly better results than the stateoftheart hashing methods.
 • Catching the head, the tail, and everything in between: a streaming algorithm for the degree distribution
 Olivia Simpson, C. Seshadhri, and Andrew McGregor
 International Conference on Data Mining (ICDM) 2015
 abstract github code
arxiv:1506.02574
The degree distribution is one of the most fundamental graph properties of interest for realworld graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scalefree degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution.
We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.
 • When a Graph is not so Simple: Counting Triangles in Multigraph Streams
 Madhav Jha, C. Seshadhri, and Ali Pinar
 Asilomar Conference on Signals, Systems and Computers, ACSSC 2015
 abstract
arxiv:1310.7665
Estimating the number of triangles in a graph given as a stream of edges is a
fundamental problem in data mining. The goal is to design a single pass
spaceefficient streaming algorithm for estimating triangle counts. While there
are numerous algorithms for this problem, they all (implicitly or explicitly)
assume that the stream does not contain duplicate edges. However, data sets
obtained from realworld graph streams are rife with duplicate edges. The work
around is typically an extra unaccounted pass (storing all the edges!) just to
"clean up" the data. Can we estimate triangle counts accurately in a single
pass even when the stream contains repeated edges? In this work, we give the
first algorithm for estimating the triangle count of a multigraph stream of
edges.
@techreport{JhSePi13,
author = {Madhav Jha and C. Seshadhri and Ali Pinar},
title = {When a Graph is not so Simple: Counting Triangles in Multigraph Streams},
institution = {Arxiv},
year = {2013},
number = {1310.7665},
}
 • Path Sampling: A Fast and Provable Method for Estimating 4Vertex Subgraph Counts
 Madhav Jha, C. Seshadhri, and Ali Pinar
 World Wide Web (WWW) 2015
 abstract
arxiv:1411.4942
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.
We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4vertex pattern subgraphs. Our algorithm is based on a novel technique of 3path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
 • Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions
 Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar, and Umit V. Catalyurek
 World Wide Web (WWW) 2015 (Best paper finalist)
 abstract
arxiv:1411.3312
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, kdensest subgraph) are NPhard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of kcores and ktrusses. We give provable efficient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other stateoftheart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
 • Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties
 Deeparnab Chakrabarty, Kashyap Dixit, Madhav Jha, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2015 (Invited to special issue)
 abstract
arxiv:1404.0718
ECCCTR14042
The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far
from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion
is the Hamming distance over the {\em uniform} distribution on the domain.
This restriction to uniformity is rather limiting, and it is important to investigate distances induced by more general distributions.
In this paper, we give simple and optimal testers for {\em bounded derivative properties} over {\em arbitrary product distributions}.
Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity.
Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing.
We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of
expected depths of optimal BSTs for each marginal.
Furthermore, we show this sumofdepths is also a lower bound.
A technical contribution of our work is an {\em optimal dimension reduction theorem} for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity, and our theorem is an exponential improvement to the previous best known result.
@techreport{ChDiJh+14,
author = {Deeparnab Chakrabarty and Kashyap Dixit and Madhav Jha and C. Seshadhri},
title = {Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties},
institution = {ECCC},
year = {2014},
number = {1404.0718},
}
 • FASTPPR: Scaling Personalized PageRank Estimation for Large Graphs
 Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and C. Seshadhri
 SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2014
 abstract
arxiv:1404.3181
We propose a new algorithm, FASTPPR, for estimating personalized PageRank: given input nodes $s$, $t$ in a directed graph and threshold $\delta$, it computes the Personalized PageRank $\PR_s(t)$ from $s$ to $t$, guaranteeing that the relative error is small as long $\PR_s(t) > \delta$.
Existing algorithms for this problem have a runningtime of $\Omega(1/\delta)$. We design a new algorithm, FASTPPR, that
has a provable average runningtime guarantee of $\tilde{O}(\sqrt{d/\delta})$ (where $d$ is the average indegree of the graph). This is a significant improvement, since $\delta$ is often $O(1/n)$ (where $n$
is the number of nodes) for applications. We also complement the algorithm with an $\Omega(1/\sqrt{\delta})$ lower bound for SignificantPageRank, showing that the dependence on $\delta$ cannot be improved.
We perform a detailed empirical study on numerous massive graphs showing that FASTPPR dramatically outperforms existing algorithms. For example, with target nodes sampled according to popularity, on the 2010 Twitter graph with 1.5 billion edges, FASTPPR has a 20 factor speedup over the state of the art. Furthermore, an enhanced version of FASTPPR has a 160 factor speedup on the Twitter graph, and is at least 20 times faster on all our candidate graphs.
@techreport{LoBaGo+14,
author = {Peter Lofgren and Siddhartha Banerjee and Ashish Goel and C. Seshadhri},
title = {FASTPPR: Scaling Personalized PageRank Estimation for Large Graphs},
institution = {Arxiv},
year = {2014},
number = {1404.3181},
}
 • A Scalable Generative Graph Model with Community Structure
 Tamara G. Kolda, Ali Pinar, Todd Plantenga, and C. Seshadhri
 SIAM Journal of Scientific Computing (SISC) 2014
 abstract
arxiv:1302.6636
Network data is ubiquitous and growing, yet we lack realistic generative models that can be calibrated to match realworld data. The recently proposed Block TwoLevel ErdosRenyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop implementation for a modeling a realworld web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.
@techreport{KoPiPl+13,
author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri},
title = {A Scalable Generative Graph Model with Community Structure},
institution = {Arxiv},
year = {2013},
number = {1302.6636},
}
 • Counting Triangles in Massive Graphs with MapReduce
 Tamara G. Kolda, Ali Pinar, Todd Plantenga, C. Seshadhri, and Christine Task
 SIAM Journal of Scientific Computing (SISC) 2014
 abstract
arxiv:1301.5887
Graphs and networks are used to model interactions in a variety of contexts.
There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are trianglebased and give a measure of the connectedness of mutual friends.
This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for largescale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publiclyavailable networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec.
total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and nonsocial networks. To the best of our knowledge, these are the largest trianglebased graph computations published to date.
@techreport{KoPiPl+13,
author = {Tamara G. Kolda and Ali Pinar and Todd Plantenga and C. Seshadhri and Christine Task},
title = {Counting Triangles in Massive Graphs with MapReduce},
institution = {Arxiv},
year = {2013},
number = {1301.5887},
}
 • Why do Simple Algorithms for Triangle Enumeration work in the Real World?
 Jonathan Berry, Luke Fostvedt, Daniel Nordman, Cynthia Phillips, C. Seshadhri, and Alyson Wilson
 Innovations in Theoretical Computer Science (ITCS) 2014
 abstract
conf
Listing all triangles is a fundamental graph operation. Triangles can have important interpretations
in realworld graphs, especially social and other interaction networks.
Despite the lack of provably efficient (linear, or slightly superlinear) worstcase algorithms for this problem, practitioners
run simple, efficient heuristics to find all triangles in graphs with millions of vertices.
How are these heuristics exploiting
the structure of these special graphs to provide major speedups in running time?
We study one of the most prevalent algorithms used by practitioners. A trivial algorithm
enumerates all paths of length $2$, and checks if each such path is incident to a triangle.
A good heuristic is to enumerate only those paths of length $2$
where the middle vertex has the lowest degree.
It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm.
We study the behavior of this algorithm over graphs with heavytailed
degree distributions, a defining feature of realworld graphs. The
erased configuration model (ECM) efficiently generates a graph
with asymptotically (almost) any desired degree sequence.
We show that the expected running time of this algorithm
over the distribution of graphs created by the ECM is controlled by the $\ell_{4/3}$norm of the degree
sequence.
Norms of the degree sequence are a measure of the heaviness of the tail, and it is
precisely this feature that allows nontrivial speedups of simple
triangle enumeration algorithms. As a corollary of our main theorem,
we prove expected lineartime performance for degree sequences
following a power law with exponent $\alpha \geq 7/3$, and nontrivial
speedup whenever $\alpha \in (2,3)$.
@INPROCEEDINGS{BeFoNo+14,
author = {J. Berry and L. Fostvedt and D. Nordman and C. Phillips and C. Seshadhri and A. Wilson},
title = {Why do Simple Algorithms for Triangle Enumeration work in the Real World?},
booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)},
year = {2014}
}
 • Decompositions of TriangleDense Graphs
 Rishi Gupta and Tim Roughgarden and C. Seshadhri
 Innovations in Theoretical Computer Science (ITCS) 2014
 abstract
conf
arxiv:1309.7440
High triangle density  the graph property stating that most twohop paths belong to a triangle  is a common signature of social networks. This paper studies triangledense graphs from a structural perspective. We prove constructively that most of the content of a triangledense graph is contained in a disjoint union of radius 2 dense subgraphs. This result quantifies the extent to which triangledense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in stable kmedian instances.
@INPROCEEDINGS{GuRoSe14,
author = {R. Gupta and T. Roughgarden and C. Seshadhri},
title = {Decompositions of TriangleDense Graphs},
booktitle = {Proceedings of the Innovations in Theoretical Computer Science (ITCS)},
year = {2014}
}
 • A ProvablyRobust Sampling Method for Generating Colormaps of Large Data
 David Thompson and Janine Bennett and C. Seshadhri and Ali Pinar
 IEEE Symposium on LargeScale Data and Visualization (LDAV) 2013
 abstract
conf
First impressions from initial renderings of data are crucial for directing
further exploration and analysis. In most visualization systems, default colormaps
are generated by simply linearly interpolating color in some space based on a
value's placement between the minimum and maximum taken on by the dataset. We
design a simple samplingbased method for generating colormaps that highlights important
features. We use random sampling to determine
the distribution of values observed in the data. The sample size required is
independent of the dataset size and only depends on certain accuracy parameters.
This leads to a computationally cheap and robust algorithm for colormap
generation. Our approach (1) uses perceptual color distance to produce palettes
from color curves, (2) allows the user to either emphasize or deemphasize prominent
values in the data, (3) uses quantiles to map distinct colors to values based
on their frequency in the dataset, and (4) supports the highlighting of either
inter or intramode variations in the data.
@INPROCEEDINGS{ThBe+13,
author = {D. Thompson and J. Bennett and C. Seshadhri and A. Pinar},
title = {A ProvablyRobust Sampling Method for Generating Colormaps of Large Data},
booktitle = {{Proceedings of the IEEE Symposium on LargeScale Data and Visualization (LDAV)}},
year = {2013}
}
 • An Optimal Lower Bound for Monotonicity Testing over Hypergrids
 Deeparnab Chakrabarty and C. Seshadhri
 International Workshop on Randomization and Computation (RANDOM) 2013
 abstract
conf
arxiv:1304.5264 ECCCTR13062
For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinatewise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$far from monotone if at least an $\eps$fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$far.
We prove that any (adaptive, twosided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $\Omega(\eps^{1}d\log n  \eps^{1}\log \eps^{1})$ queries. Recent upper bounds show the existence of $O(\eps^{1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a nonadaptive bound of $\Omega(d \log n)$.
@INPROCEEDINGS{ChSe132,
author = {D. Chakrabarty and C. Seshadhri},
title = {An optimal lower bound for monotonicity testing over hypergrids},
booktitle = {{Proceedings of the International Workshop on Randomization and Computation (RANDOM)}},
year = {2013},
owner = {scomand},
timestamp = {2013.03.14}
}
 • A Space Efficient Streaming Algorithm for Triangle Counting using the Birthday Paradox
 Madhav Jha, C. Seshadhri, and Ali Pinar
 SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD) 2013 (Best student paper)
 abstract
conf
arxiv:1212.2264
Modern massive graphs such as the web and communication networks are truly \emph{dynamic graphs}. The graph is not a static object, but is most accurately characterized as a \emph{stream of edges}. How does one store a small sample of the past and still answer meaningful questions about the entire graph? This is the aim of a sublinear space streaming algorithm. We study the problem of counting the number of triangles (or estimating the clustering coefficient) in a massive streamed graph.
We provide a $O(\sqrt{n})$space algorithm that provably approximates the clustering coefficient with only a single pass through a graph of $n$ vertices. Our procedure is based on the classic probabilistic result, \emph{the birthday paradox}. While streaming algorithms for triangle counting have been designed before, none of these provide comparable theoretical (or empirical) results.
Using some probabilistic heuristics, we design a practical algorithm based on these ideas. Our algorithm can compute accurate estimates for the clustering coefficient and number of triangles for a graph using only a single pass over the edges. The storage of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 50,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains \emph{a realtime estimate} of the clustering coefficient/number of triangles of a graph, by storing a miniscule fraction of edges.
@INPROCEEDINGS{JhSePi13,
author = {Jha, Madhav and Seshadhri, C. and Pinar, Ali},
title = {A space efficient streaming algorithm for triangle counting using
the birthday paradox},
booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge
discovery and data mining},
year = {2013},
series = {KDD '13},
pages = {589597}
}
 • A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal
 Nurcan Durak, Tamara G. Kolda, Ali Pinar, C. Seshadhri
 IEEE Workshop on Network Science (NSW) 2013
 abstract
conf
arxiv:1210.5288
Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or ChungLu model can generate an undirected graph with any desired degree distribution. This serves as an excellent null model to compare algorithms or perform experimental studies that distinguish real networks from this baseline. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the realworld are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (oneway edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the ChungLu model that matches the in, out, and reciprocaldegree distributions of the real graphs.
Our algorithm is scalable and requires O(m) random numbers to generate a graph with m edges. We perform a series of experiments on real datasets and compare with existing graph models.
@inproceedings{DuKoPiSe12,
author = {Nurcan Durak and Tamara G. Kolda and Ali Pinar and C. Seshadhri},
title = {A scalable directed graph model with reciprocal edges},
booktitle = {IEEE Workshop on Network Science},
year = {2013}
}
 • An o(n) Monotonicity Tester for Boolean Functions over the Hypercube
 Deeparnab Chakrabarty and C. Seshadhri
 Symposium on Theory of Compting (STOC) 2013 (Invited to special issue)
 abstract
conf
ECCCTR13029
Given oracle access to a Boolean function $f:\{0,1\}^n \mapsto \{0,1\}$, we design a randomized tester that takes as input a parameter $\eps>0$, and outputs {\sf Yes} if the function is monotone, and outputs {\sf No} with probability $>2/3$, if the function is $\eps$far from monotone. That is, $f$ needs to be modified at $\eps$fraction of the points to make it monotone. Our nonadaptive, onesided tester makes $\widetilde{O}(n^{5/6}\eps^{5/3})$ queries to the oracle.
@inproceedings{CS12,
author = {D. Chakrabarty and C. Seshadhri},
title = {An o(n) monotonicity tester for Boolean functions over the hypercube},
booktitle = {ACM Symposium on Theory of Computing (STOC)},
year = {2013}
}
 • Optimal Bounds for Monotonicity and Lipschitz Testing over
Hypercubes and Hypergrids
 Deeparnab Chakrabarty and C. Seshadhri
 Symposium on Theory of Compting (STOC) 2013
 abstract
conf
ECCCTR12030
The problem of monotonicity testing over the hypergrid and its special case, the hypercube, is a classic, wellstudied, yet unsolved question in property testing. We are given query access to $f:[k]^n \mapsto \R$
(for some ordered range $\R$).
The hypergrid/cube has a natural partial order given by coordinatewise ordering, denoted by $\prec$.
A function is \emph{monotone} if for all pairs $x \prec y$,
$f(x) \leq f(y)$. The distance to monotonicity, $\eps_f$, is the minimum fraction of values of $f$
that need to be changed to make $f$ monotone.
For $k=2$ (the boolean hypercube), the usual tester is the \emph{edge tester}, which
checks monotonicity on adjacent pairs of domain points.
It is known that the edge tester using $O(\eps^{1}n\log\R)$
samples can distinguish a monotone function from one where $\eps_f > \eps$.
On the other hand, the best lower bound for monotonicity testing over the hypercube is $\min(\R^2,n)$.
This leaves
a quadratic gap in our knowledge, since $\R$ can be $2^n$.
We resolve this long standing open problem and prove that $O(n/\eps)$ samples suffice
for the edge tester. For hypergrids, known testers require $O(\eps^{1}n\log k\log \R)$ samples, while
the best known (nonadaptive) lower bound is $\Omega(\eps^{1} n\log k)$. We give
a (nonadaptive) monotonicity tester for hypergrids running in $O(\eps^{1} n\log k)$ time.
Our techniques lead to optimal property testers (with the same running time) for the natural \emph{Lipschitz property}
on hypercubes and hypergrids. (A $c$Lipschitz function
is one where $f(x)  f(y) \leq c\xy\_1$.)
In fact, we give a general unified proof for $O(\eps^{1}n\log k)$query testers for a class of ``boundedderivative" properties,
a class containing both monotonicity and Lipschitz.
@inproceedings{CS12,
author = {D. Chakrabarty and C. Seshadhri},
title = {Optimal Bounds for Monotonicity and Lipschitz Testing over Hypercubes and Hypergrids},
booktitle = {ACM Symposium on Theory of Computing (STOC)},
year = {2013}
}
 • Triadic Measures on Graphs: The Power of Wedge Sampling
 C. Seshadhri, Ali Pinar, and Tamara G. Kolda
 SIAM Conference on Data Mining (SDM) 2013 (Best research paper)
 abstract
conf
arxiv:1202.5230
there is a growing need to quickly assess the structure of a graph.
Some of the most useful graph metrics, especially those measuring social cohesion,
are based on triangles.
Despite the importance of these triadic measures, associated algorithms can be extremely expensive.
We discuss the method of wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and
enables rapid uniform sampling of the triangles of a graph.
Our methods come with provable and practical timeapproximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the stateoftheart, while providing nearly the accuracy of full enumeration.
Our results will enable more widescale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several realworld examples.
@inproceedings{SePiKo13,
author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda},
title = {Triadic Measures on Graphs: The Power of Wedge Sampling},
booktitle = {SIAM Conference on Data Mining (SDM)},
year = {2013}
}
 • Space Efficient Streaming Algorithms for the Distance to Monotonicity and Asymmetric Edit Distance
 Michael Saks and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2013
 abstract
conf
arxiv:1204.1098
Approximating the length of the longest increasing sequence (LIS) of a data stream is a wellstudied problem. There are many algorithms that estimate the size of the complement of the LIS, referred to as the \emph{distance to monotonicity}, both in the streaming and property testing setting. Let $n$ denote the size of an input array. Our aim is to develop a onepass streaming algorithm that accurately approximates the distance to monotonicity, and only uses polylogarithmic storage. For any $\delta > 0$, our algorithm provides a $(1+\delta)$multiplicative approximation for the distance, and uses only $O((\log^2 n)/\delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2factor. Our algorithm is simple and natural, being just 3 lines of pseudocode. It is essentially a polylogarithmic space implementation of a classic dynamic program that computes the LIS.
Our technique is more general and is applicable to other problems that are exactly solvable by dynamic programs. We are able to get a streaming algorithm for the longest common subsequence problem (in the asymmetric setting introduced in \cite{AKO10}) whose space is small on instances where no symbol appears very many times. Consider two strings (of length $n$) $x$ and $y$. The string $y$ is known to us, and we only have streaming access to $x$. The size of the complement of the LCS is the edit distance between $x$ and $y$ with only insertions and deletions. If no symbol occurs more than $k$ times in $y$, we get a $O(k(\log^2 n)/\delta)$space streaming algorithm that provides a $(1+\delta)$multiplicative approximation for the LCS complement. In general, we also provide a deterministic 1pass streaming algorithm that outputs a $(1+\delta)$multiplicative approximation for the LCS complement and uses $O(\sqrt{(n\log n)/\delta})$ space.
@inproceedings{SaSe13,
author = {M. Saks and C. Seshadhri},
title = {Space Efficient Streaming Algorithms for the Distance to Monotonicity and Asymmetric Edit Distance},
booktitle = {Symposium on Discrete Algorithms (SODA)},
year = {2013}
}
 • Finding Cycles and Trees in Sublinear Time
 Artur Czumaj, Oded Goldreich, Dana Ron, C. Seshadhri, Asaf Shapira, and Christian Sohler
 Random Structures and Algorithms 2012
 abstract
arxiv:1007.4230
We present sublineartime (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and treeminors in boundeddegree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$minorfree (resp., free from having the corresponding treeminor). In particular, if the graph is far (i.e., $\Omega(1)$far) {from} being cyclefree, i.e. if one has to delete a constant fraction of edges to make it cyclefree, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors.
The foregoing results are the outcome of our study of the complexity of {\em onesided error} property testing algorithms in the boundeddegree graphs model. For example, we show that cyclefreeness of $N$vertex graphs can be tested with onesided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $\Omega(\sqrt{N})$ query lower bound, and contrasts with the fact that any minorfree property admits a {\em twosided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least~$k$. On the other hand, for any fixed tree $T$, we show that $T$minorfreeness has a onesided error tester of query complexity that only depends on the proximity parameter $\e$.
Our algorithm for finding cycles in boundeddegree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding treeminors in $o(\sqrt{N})$ complexity.
@journal{CGR+10,
author = {Artur Czumaj and Oded Goldreich and Dana Ron and C. Seshadhri and Asaf Shapira and Christian Sohler},
title = {Finding Cycles and Trees in Sublinear Time},
journal = {Random Structures and Algorithms},
year = {2012},
notes = {DOI: 10.1002/rsa.20462}
}
 • Degree Relations of Triangles in Realworld Networks and Graph Models
 Nurcan Durak, Ali Pinar, Tamara G. Kolda, and C. Seshadhri
 Conference on Information and Knowledge Management (CIKM) 2012
 abstract
conf
arxiv:1207.7125
Triangles are an important building block and distinguishing feature
of realworld networks, but their structure is still poorly
understood. Despite numerous reports on the abundance of triangles,
there is very little information on what these triangles look like.
We initiate the study of \emph{degreelabeled triangles} 
specifically, degree homogeneity versus heterogeneity in triangles.
This yields new insight into the structure of realworld graphs.
We observe that networks coming from social and collaborative situations are dominated
by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other.
On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles,
i.e., the degrees in triangles are quite disparate. Surprisingly,
nodes within the top 1\% of degrees participate in the \emph{vast majority}
of triangles in heterogeneous graphs.
We investigate whether current graph models
reproduce the types of triangles that are observed in real data and
observe that most models fail to accurately capture these salient features.
@inproceedings{DPKS12,
author = {N. Durak and A. Pinar and T. G. Kolda and C. Seshadhri},
title = {Degree Relations of Triangles in Realworld Networks and Graph Models},
booktitle = {ACM International Conference on Information and Knowledge Management},
year = {2012}
}
 • Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods
 David F. Gleich and C. Seshadhri
 SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2012
 abstract
conf
arxiv:1112.0031 (old version) Code and data
The communities of a social network are sets of vertices
with more connections inside the set than outside.
We theoretically demonstrate that two commonly
observed properties of social networks, heavytailed
degree distributions and large clustering coefficients,
imply the existence of vertex neighborhoods (also known
as egonets) that are themselves good communities. We
evaluate these neighborhood communities on a range of graphs.
What we find is that the neighborhood communities can
exhibit conductance scores that are as good as the Fiedler
cut. Also, the conductance of neighborhood communities
shows similar behavior as the network community profile
computed with a personalized PageRank community detection
method. Neighborhood communities give us a simple and powerful
heuristic for speeding up local partitioning methods.
Since finding good seeds for the PageRank clustering
method is difficult, most approaches involve an expensive sweep over a great many
starting vertices. We show how to use
neighborhood communities to quickly generate a small set of
seeds.
@inproceedings{GS12,
author = {David F. Gleich and C. Seshadhri},
title = {Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods },
booktitle = {SIGKDD Conference on Knowledge Discovery and Data Mining},
year = {2012}
}
 • Selfimproving Algorithms for Coordinatewise Maxima
 Ken Clarkson, Wolfgang Mulzer, and C. Seshadhri
 Symposium on Computation Geometry (SoCG) 2012
 SIAM Journal on Computing 2014
 abstract
conf
journal(arxiv:1211.0952)
Computing the coordinatewise maxima of a planar point set
is a classic and wellstudied problem in computational geometry.
We give an algorithm for this problem in the \emph{selfimproving setting}.
We have $n$ (unknown) independent distributions
$\cD_1, \cD_2, \ldots, \cD_n$ of planar points. An input pointset
$(p_1, p_2, \ldots, p_n)$
is generated by taking an independent sample $p_i$ from each $\cD_i$, so
the input distribution $\cD$ is the product $\prod_i \cD_i$. A selfimproving
algorithm repeatedly gets input sets from the distribution $\cD$ (which is
\emph{a priori} unknown) and tries to
optimize its running time for $\cD$. Our algorithm uses the first few
inputs to learn salient features of the distribution, and then becomes an
optimal algorithm
for distribution $\cD$. Let $\OPT_\cD$ denote the expected depth of
an \emph{optimal} linear comparison tree computing the maxima for
distribution $\cD$. Our algorithm
eventually has an expected running time of $O(\text{OPT}_\cD + n)$,
even though it did not know $\cD$ to begin with.
Our result requires new tools to understand linear comparison trees
for computing maxima.
We show how to convert general linear comparison trees to very restricted
versions, which can then be
related to the running time of our algorithm. An interesting feature of our
algorithm
is an interleaved search, where the algorithm tries to determine the likeliest
point to be maximal with minimal computation. This allows the running time to be
truly optimal for the distribution $\cD$.
@inproceedings{CMS12,
author = {Ken Clarkson and Wolfgang Mulzer and C. Seshadhri},
title = {Selfimproving Algorithms for Coordinatewise Maxima},
booktitle = {28th Symposium on Computational Geometry},
year = {2012}
}
 • Community Structure and Scalefree Collections of ErdosRenyi Graphs
 C. Seshadhri, Tamara G. Kolda, and Ali Pinar
 Physical Review E 2012
 abstract
arxiv:1112.3644
journal
Code and data
Community structure plays a significant role in the analysis of social
networks and similar graphs, yet this structure is little understood
and not well captured by most models.
We formally define a community to be a subgraph that is
internally highly connected and has no deeper substructure.
We use tools of combinatorics to show that any such community must
contain a dense Erd\H{o}sR\'enyi~(ER) subgraph.
Based on mathematical arguments, we hypothesize that any graph
with a heavytailed degree distribution and community structure
must contain a scale free collection of dense ER subgraphs.
These theoretical observations corroborate well with
empirical evidence.
From this, we propose the Block TwoLevel Erd\H{o}sR\'enyi (BTER) model,
and demonstrate that it accurately captures the observable
properties of many realworld social networks.
@article{SKP11,
author = {C. Seshadhri and Tamara G. Kolda and Ali Pinar},
title = {Community structure and scalefree collections of Erd\H{o}sR\'enyi graphs},
journal = {Physical Review E},
year = {2012},
volume = {85},
issue = {5},
number = {056109},
url = {http://pre.aps.org/abstract/PRE/v85/i5/e056109}
}
 • Are we there yet? When to stop a Markov Chain while Generating Random Graphs
 Jaideep Ray, Ali Pinar, C. Seshadhri
 Workshop on Algorithms and Models for the Web Graph (WAW) 2012
 abstract
conf
arxiv:1202.3473
Markov chains are a convenient means of generating realizations of networks, since they require little more than a procedure for rewiring edges. If a rewiring procedure exists for
generating new graphs with specified statistical properties, then a Markov chain sampler can generate an ensemble of graphs with prescribed characteristics. However, successive
graphs in a Markov chain cannot be used when one desires independent draws from the distribution of graphs; the realizations are correlated. Consequently, one runs a Markov chain for
N iterations before accepting the realization as an independent sample. In this work, we devise two methods for calculating N. They are both based on the binary "timeseries"
denoting the occurrence/nonoccurrence of edge (u, v) between vertices u and v in the Markov chain of graphs generated by the sampler. They differ in their underlying assumptions. We
test them on the generation of graphs with a prescribed joint degree distribution. We find the N proportional E, where E is the number of edges in the graph. The two methods are
compared by sampling on real, sparse graphs with 10^3  10^4 vertices. B
@inproceedings{RPS12,
author = {Jaideep Ray and Ali Pinar and C. Seshadhri},
title = {Are we there yet? When to stop a Markov chain while generating random graphs},
booktitle = {9th Workshop on Algorithms and Models for the Web Graph},
pages = {153164},
year = {2012}
}
 • The Similarity between Stochastic Kronecker and ChungLu Graph Models
 Ali Pinar, C. Seshadhri, and Tamara G. Kolda
 SIAM Conference on Data Mining (SDM) 2012
 abstract
conf
arxiv:1110.4925
The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model
has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little
understanding of the properties and behavior of this model.
We show that the parallel variant of the edgeconfiguration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate
parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to
fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG
model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.
@inproceedings{PiSeKo11,
author = {Ali Pinar and C. Seshadhri and Tamara G. Kolda},
title = {The Similarity between Stochastic Kronecker and ChungLu Graph Models},
booktitle = {Proceedings of 12th SIAM International Conference on Data Mining},
pages = {10711082},
year = {2012}
}
 • An InDepth Analysis of Stochastic Kronecker Graphs
 C. Seshadhri, Ali Pinar, and Tamara G. Kolda
 International Conference on Data Mining (ICDM) 2011
 Journal of the ACM 2013
 abstract
conf
journal (arxiv:1102.5046)
Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing realworld graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a powerlaw distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices.
These results have been communicated to the Graph500 steering committee, and they are currently using our theorems to set the benchmark parameters. Our noisy version of SKG is likely to become the Graph500 standard for next year's benchmark.
@techreport{SPK11,
author = {C. Seshadhri and Ali Pinar and Tamara G. Kolda},
title = {A Hitchhiker's Guide to Choosing Parameters of Stochastic Kronecker Graphs},
institution = {Arxiv},
year = {2011},
number = {1102.5046},
}
 • Influence and Dynamic Behavior in Random Boolean Networks
 C. Seshadhri, Yevgeniy Vorobeychik, Jackson R. Mayo, Robert C. Armstrong, and Joseph R. Ruthruff
 Physical Review Letters 2011
 abstract
journal (arxiv:1107.3792)
We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network
models. We use this framework to provide the first formal proof of many of the standard critical transition
results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean
networks. We show that some of the assumptions traditionally made in the more common meanfield analysis
of Boolean networks do not hold in general. For example, we offer evidence that imbalance (internal inhomogeneity)
of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than
previously observed.
@article{SVM+11,
author = {C. Seshadhri and
Y. Vorobeychik and J. Mayo and R. Armstrong and J. Ruthruff},
title = {Influence and Dynamic Behavior in Random Boolean Networks},
journal = {Physical Review Letters},
volume = {107},
number = {10},
year = {2011},
pages = {108701}
}
 • Blackbox Identity Testing for Bounded Top Fanin Depth3 Circuits: the Field doesn't matter
 Nitin Saxena and C. Seshadhri
 Symposium on Theory of Computing (STOC) 2011 (Invited to special issue)
 SIAM Journal on Computing 2012
 abstract
conf
ECCC TR10167
journal
Let C be a depth3 circuit with n variables, degree d and top fanin k (called sps(k,d,n) circuits) over base field F.
It is a major open problem to design a deterministic polynomial time blackbox algorithm
that tests if C is identically zero.
Klivans & Spielman (STOC 2001) observed that the problem
is open even when k is a constant.
This case has been subjected to a serious study over the past few years, starting
from the work of Dvir & Shpilka (STOC 2005).
We give the first polynomial time blackbox algorithm for this problem. Our algorithm
runs in time poly(nd^k), regardless of the base field. The *only* field
for which polynomial time algorithms were previously known
is F=Q (Kayal & Saraf, FOCS 2009, and Saxena & Seshadhri, FOCS 2010).
This is the first blackbox algorithm for depth3 circuits that does not use
the rank based approaches of Karnin & Shpilka (CCC 2009).
We prove an important tool for the study of depth3 identities. We design
a blackbox polynomial time transformation that reduces the number of variables
in a sps(k,d,n) circuit to k variables, but preserves the identity structure.
@inproceedings{SaxS11,
author = {N. Saxena and C. Seshadhri},
title = {Blackbox identity testing for bounded top fanin depth3 circuits: the field doesn't matter},
booktitle = {Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC)},
year = {2011}
}
 • Combinatorial Approximation Algorithms for MaxCut using Random Walks
 Satyen Kale and C. Seshadhri
 Innovations in Computer Science (ICS) 2010
 abstract
conf
arxiv:1008.3938
We give the first combinatorial approximation algorithm for Maxcut that beats
the trivial 0.5 factor by a constant. The main partitioning procedure is very
intuitive, natural, and easily described. It essentially performs a number of
random walks and aggregates the information to provide the partition. We can
control the running time to get an approximation factorrunning time tradeoff.
We show that for any constant b > 1.5, there is an O(n^{b}) algorithm that
outputs a (0.5+delta)approximation for Maxcut, where delta = delta(b) is some
positive constant.
One of the components of our algorithm is a weak local graph partitioning
procedure that may be of independent interest. Given a starting vertex $i$ and
a conductance parameter phi, unless a random walk of length ell = O(log n)
starting from i mixes rapidly (in terms of phi and ell), we can find a cut of
conductance at most phi close to the vertex. The work done per vertex found in
the cut is sublinear in n.
@inproceedings{KalS10,
author = {Satyen Kale and C. Seshadhri},
title = {Combinatorial Approximation Algorithms for MaxCut using Random Walks},
booktitle = {Proceedings of the Innovations in Computer Science (ICS)},
year = {2011},
pages = {367388}
}
 • Is Submodularity Testable?
 C. Seshadhri and Jan Vondrak
 Innovations in Computer Science (ICS) 2010
 Algorithmica 2012
 abstract
conf
arxiv:1008.0831
journal
We initiate the study of property testing of submodularity on the boolean hypercube. Submodular functions come up in a variety of applications in combinatorial optimization. For a vast range of algorithms, the existence of an oracle to a submodular function is assumed. But how does one check if this oracle indeed represents a submodular function?
Consider a function f:{0,1}^n \rightarrow R. The distance to submodularity is the minimum fraction of values of $f$ that need to be modified to make f submodular. If this distance is more than epsilon > 0, then we say that f is epsilonfar from being submodular. The aim is to have an efficient procedure that, given input f that is epsilonfar from being submodular, certifies that f is not submodular. We analyze a very natural tester for this problem, and prove that it runs in subexponential time. This gives the first nontrivial tester for submodularity. On the other hand, we prove an interesting lower bound (that is, unfortunately, quite far from the upper bound) suggesting that this tester cannot be very efficient in terms of epsilon. This involves nontrivial examples of functions which are far from submodular and yet do not exhibit too many local violations.
We also provide some constructions indicating the difficulty in designing a tester for submodularity. We construct a partial function defined on exponentially many points that cannot be extended to a submodular function, but any strict subset of these values can be extended to a submodular function.
@inproceedings{SesV10,
author = {J. Vondrak and
C. Seshadhri},
title = {Is submodularity testable?},
booktitle = {Proceedings of the Innovations in Computer Science (ICS)},
year = {2011},
pages = {195210}
}
 • Estimating the Longest Increasing Sequence in Polylogarithmic Time
 Michael Saks and C. Seshadhri
 Foundations of Computer Science (FOCS) 2010
 abstract
conf
arXiv:1308.0626
Finding the length of the longest increasing subsequence (LIS) is
a classic algorithmic problem. Let $n$ denote the size
of the array. Simple $O(n\log n)$ algorithms
are known for this problem. What can a sublinear time algorithm
achieve? We develop a polylogarithmic time
randomized algorithm that for any constant $\delta > 0$,
estimates the length of the LIS of an array upto an additive error of $\delta n$.
More precisely, the running time of the algorithm
is $(\log n)^c (1/\delta)^{O(1/\delta)}$ where the exponent $c$
is independent of $\delta$.
Previously, the
best known polylogarithmic time algorithms could only
achieve an additive $n/2$ approximation.
The \emph{distance to monotonicity}, which is the fractional size of the complement
of the LIS, has been studied in depth by the streaming and property testing
communities.
Our polylogarithmic algorithm actually has a stronger guarantee, and gives (for any constant $\delta > 0$) a multiplicative
$(1+\delta)$approximation to the distance to monotonicity. This is a significant improvement
over the previously known $2$factor approximations.
@inproceedings{SakS10,
author = {M. E. Saks and
C. Seshadhri},
title = {Estimating the longest increasing sequence in polylogarithmic time},
booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
year = {2010},
pages = {458467},
}
 • From SylvesterGallai Configurations to Rank Bounds: Improved Blackbox Identity Test for Depth3 Circuits
 Nitin Saxena and C. Seshadhri
 Foundations of Computer Science (FOCS) 2010
 Journal of the ACM 2013
 abstract
conf
ECCC TR10013
journal
We study the problem of identity testing for depth3 circuits of top fanin k and degree d. We give a new structure theorem for such identities. A direct
application of our theorem improves the known deterministic d^{k^k}time blackbox identity test over rationals (KayalSaraf, FOCS 2009) to one that takes
d^{k^2}time. Our structure theorem essentially says that the number of independent variables in a real depth3 identity is very small. This theorem settles
affirmatively the stronger rank conjectures posed by DvirShpilka (STOC 2005) and KayalSaraf (FOCS 2009). Our techniques provide a unified framework that
actually beats all known rank bounds and hence gives the best running time (for *every* field) for blackbox identity tests. Our main theorem (almost
optimally) pins down the relation between higher dimensional SylvesterGallai theorems and the rank of depth3 identities in a very transparent manner. The
existence of this was hinted at by DvirShpilka (STOC 2005), but first proven, for reals, by KayalSaraf (FOCS 2009). We introduce the concept of
SylvesterGallai rank bounds for any field, and show the intimate connection between this and depth3 identity rank bounds. We also prove the first ever
theorem about high dimensional SylvesterGallai configurations over *any* field. Our proofs and techniques are very different from previous results and devise
a very interesting ensemble of combinatorics and algebra. The latter concepts are ideal theoretic and involve a new Chinese remainder theorem. Our proof
methods explain the structure of *any* depth3 identity C: there is a nucleus of C that forms a low rank identity, while the remainder is a high dimensional
SylvesterGallai configuration.
@inproceedings{SaxS10,
author = {N. Saxena and C. Seshadhri},
title = {From SylvesterGallai Configurations to Rank Bounds: Improved Blackbox Identity Test for Depth3 Circuits},
booktitle = {Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
year = {2010},
pages = {2129},
}
 • SelfImproving Algorithms for Convex Hulls
 Ken Clarkson, Wolfgang Mulzer, and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2010
 abstract
conf
We describe an algorithm for computing
planar convex hulls in the selfimproving
model: given a sequence $I_1, I_2, \ldots$ of planar $n$point sets,
the upper convex hull $conv(I)$ of each set $I$ is desired.
We assume that there exists a probability distribution $D$ on $n$point
sets, such that the inputs $I_j$ are drawn independently
according to $D$. Furthermore, $D$ is such that the individual points
are distributed independently of each other. In other words, the $i$'th point
is distributed according to $D_i$. The $D_i$'s can be arbitrary
but are independent of each other.
The distribution
$D$ is not known to the algorithm in advance.
After a learning phase of $n^\eps$ rounds,
the expected time to compute $conv(I)$ is
$O(n + H(conv(I)))$. Here, $H(conv(I))$ is
the entropy of the output, which is
a lower bound for the expected running time
of \emph{any} algebraic computation tree that computes
the convex hull. (More precisely, $H(conv(I))$ is
the minimum entropy
of any random variable that maps $I$ to a description of
$conv(I)$ and to a labeling scheme that proves nonextremality
for every point in $I$ not on the hull.)
Our algorithm is thus asymptotically optimal
for $D$.
@inproceedings{ClaMS10,
author = {K. Clarkson and W. Mulzer and C. Seshadhri},
title = {Selfimproving Algorithms for Convex Hulls},
booktitle = {Proceedings of the 21st ACMSIAM Symposium on Discrete Algorithms (SODA)},
year = {2010},
pages = {}
}
 • Efficient Learning Algorithms for Changing Environments
 Elad Hazan and C. Seshadhri
 International Conference on Machine Learning (ICML) 2009
 abstract
ECCC TR07088
We study online learning in an oblivious changing environment. The standard measure of regret bounds the difference between the cost of the online learner and the best decision in hindsight. Hence, regret minimizing algorithms tend to converge to the static best optimum, clearly a suboptimal behavior in changing environments. On the other hand, various metrics proposed to strengthen regret and allow for more dynamic algorithms produce inefficient algorithms.
We propose a different performance metric which strengthens the standard metric of regret and measures performance with respect to a changing comparator. We then describe a series of datastreamingbased reductions which transform algorithms for minimizing (standard) regret into adaptive algorithms albeit incurring only polylogarithmic computational overhead.
Using this reduction, we obtain {\it efficient} low adaptiveregret algorithms for the problem of online convex optimization.
This can be applied to various learning scenarios, i.e. online portfolio selection, for which we describe experimental results showing the advantage of
adaptivity.
@inproceedings{HazS09,
author = {E. Hazan and
C. Seshadhri},
title = {Efficient learning algorithms for changing environments},
booktitle = {Proceedings of the 26th Annual International Conference on Machine Learning (ICML)},
year = {2009},
pages = {50},
}
 • An Almost Optimal Rank Bound for Depth3 Identities
 Nitin Saxena and C. Seshadhri
 Conference on Computational Complexity (CCC) 2009
 SIAM Journal on Computing 2011
 abstract
conf
ECCC TR08108
journal
We show that the rank of a depth$3$ circuit (over any field) that is simple,
minimal and zero is at most $O(k^3\log d)$. The previous best rank bound known was
$2^{O(k^2)}(\log d)^{k2}$ by Dvir and Shpilka (STOC 2005).
This almost resolves the rank question first posed by Dvir and Shpilka
(as we also provide a simple and minimal identity of rank $\Omega(k\log d)$).
Our rank bound significantly improves (dependence on $k$ exponentially reduced)
the best known deterministic blackbox identity tests for depth$3$ circuits
by Karnin and Shpilka (CCC 2008). Our techniques also shed light
on the factorization pattern of nonzero depth$3$ circuits, most
strikingly: the rank of linear factors of a simple, minimal and nonzero depth$3$ circuit
(over any field) is at most $O(k^3\log d)$.
The novel feature of this work is a new notion of maps between sets
of linear forms, called \emph{ideal
matchings}, used to study depth$3$ circuits. We prove interesting structural results about
depth$3$ identities using these techniques. We believe that these can lead
to the goal of a deterministic polynomial time identity test for these circuits.
@inproceedings{SaxS09,
author = {N. Saxena and C. Seshadhri},
title = {An Almost Optimal Rank Bound for Depth3 Identities},
booktitle = {Proceedings of the 24th Annual Conference on Computational Complexity (CCC)},
year = {2009},
pages = {137148}
}
@article{SaxS11,
author = {N. Saxena and C. Seshadhri},
title = {An Almost Optimal Rank Bound for Depth3 Identities},
journal = {SIAM Journal on Computing},
volume = {40},
number = {1},
year = {2011},
pages = {200224}
}
 • Noise Tolerance of Expanders and Sublinear Expander Reconstruction
 Satyen Kale, Yuval Peres, and C. Seshadhri
 Foundations of Computer Science (FOCS) 2008
 SIAM Journal on Computing 2013
 abstract
conf
journal
We consider the problem of \emph{online sublinear expander reconstruction} and
its relation to random walks in ``noisy" expanders. Given access to an
adjacency list representation of a boundeddegree graph $G$, we want to convert
this graph into a boundeddegree expander $G'$ changing $G$ as little as
possible. The graph $G'$ will be output by a \emph{distributed filter}: this is
a sublinear time procedure that given a query vertex, outputs all its neighbors
in $G'$, and can do so even in a distributed manner, ensuring consistency in
all the answers.
One of the main tools in our analysis is a result on the behavior of random
walks in graph that are almost expanders: graphs that are formed by arbitrarily
connecting a small unknown graph (the noise) to a large expander. We show that
a random walk from almost any vertex in the expander part will have fast mixing
properties, in the general setting of irreducible finite Markov chains. We also
design sublinear time procedures to distinguish vertices of the expander part
from those in the noise part, and use this procedure in the reconstruction
algorithm.
@inproceedings{KalPS08,
author = {S. Kale and
Y. Peres and
C. Seshadhri},
title = {Noise Tolerance of Expanders and Sublinear Expander Reconstruction},
booktitle = {Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS)},
year = {2008},
pages = {719728},
}
 • Testing Expansion in Bounded Degree Graphs
 Satyen Kale and C. Seshadhri
 International Colloquium on Automata, Languages and Programming (ICALP) 2008
 SIAM Journal on Computing 2011
 abstract
conf
ECCC TR07076
journal
We consider the problem of testing graph expansion (either vertex or edge) in
the bounded degree model of Goldreich and Ron. We give a property tester that given a
graph with degree bound $d$, an expansion bound $\alpha$, and a parameter
$\epsilon > 0$, accepts the graph with high probability if its expansion is
more than $\alpha$, and rejects it with high probability if it is
$\epsilon$far from any graph with expansion $\alpha'$ with degree bound $d$,
where $\alpha' < \alpha$ is a function of $\alpha$. For edge expansion, we
obtain $\alpha' = \Omega(\frac{\alpha^2}{d})$, and for vertex expansion, we
obtain $\alpha' = \Omega(\frac{\alpha^2}{d^2})$. In either case, the algorithm
runs in time $\tilde{O}(\frac{n^{(1+\mu)/2}d^2}{\epsilon\alpha^2})$ for any
given constant $\mu > 0$.
@inproceedings{KalS08,
author = {S. Kale and C. Seshadhri},
title = {Testing Expansion in Bounded Degree Graphs},
booktitle = {In proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP(I))},
year = {2008}
pages = {527538}
}
@article{KalS11,
author = {S. Kale and C. Seshadhri},
title = {Testing Expansion in Bounded Degree Graphs},
journal = {SIAM Journal on Computing},
volume = {40},
number = {3},
year = {2011},
pages = {709720}
}
 • SelfImproving Algorithms for Delaunay Triangulations
 Ken Clarkson and C. Seshadhri
 Symposium on Computational Geometry (SoCG) 2008
 Full version in paper "Selfimproving algorithms"
 abstract
conf
full (arXiv:0907.0884)
We study the problem of twodimensional Delaunay triangulation
in the selfimproving algorithms model. We assume that
the $n$ points of the input each come from an independent, unknown, and arbitrary
distribution. The first phase of our algorithm builds data structures that
store relevant information about the input distribution. The second phase uses
these data structures to efficiently compute the Delaunay triangulation
of the input. The running time of our algorithm matches the informationtheoretic
lower bound for the given input distribution, implying that if the input
distribution has low entropy, then our algorithm beats the standard $\Omega(n\log n)$ bound
for computing Delaunay triangulations.
Our algorithm uses a host of techniques: $\epsilon$net construction for disks, optimal
expectedcase pointlocation structures, ideas from randomized incremental construction,
lineartime Delaunay splitting algorithms, and informationtheoretic arguments.
@inproceedings{ClaS08,
author = {K. L. Clarkson and
C. Seshadhri},
title = {Selfimproving algorithms for delaunay triangulations},
booktitle = {Proceedings of the 24th Annual Symposium on Computational Geometry (SOCG)},
year = {2008},
pages = {148155},
}
 • Local Monotonicity Reconstruction
 Michael Saks and C. Seshadhri
 Symposium on Discrete Algorithms (SODA) 2008 (Version titled "Parallel Monotonicity Reconstruction")
 SIAM Journal on Computing 2010
 abstract
conf
journal
We investigate the problem of monotonicity reconstruction, as defined
by Ailon et al (ISAAC 2004), in a distributed setting. We have oracle access
to a nonnegative realvalued
function $f$ defined on domain $[n]^d=\{1,\ldots,n\}^d$
($d$ is considered to be a constant).
We would like to closely approximate $f$
by a monotone function $g$.
This should be done by a procedure (a {\em filter}) that given as input
a point $x \in [n]^d$ outputs the value of $g(x)$, and runs in
time that is highly sublinear in $n$. The procedure
can (indeed must) be randomized, but we require that all of the
randomness be specified in advance by a single short random seed.
We construct such an implementation where the
time and space per query is $(\log n)^{O(1)}$ and the size of
the seed is polynomial in $\log n$ and $d$.
Furthermore the distance of the approximating function $g$ from $f$
is at most a constant (depending on $d$) multiple of the minimum distance of
any monotone function from $f$.
This
allows for a distributed implementation: one can initialize many copies
of the filter with the same short random seed, and they
can autonomously handle queries, while producing outputs that
are consistent with the same approximating function $g$.
@inproceedings{SakS08,
author = {M. E. Saks and
C. Seshadhri},
title = {Parallel monotonicity reconstruction},
booktitle = {Proceedings of the 19th Annual Symposium on Discrete Algorithms (SODA)},
year = {2008},
pages = {962971},
}
@article{SakS08,
author = {M. E. Saks and
C. Seshadhri},
title = {Local monotonicity reconstruction},
journal = {SIAM Journal on Computing},
volume = {39},
number = {7}
year = {2010},
pages = {28972926}
}
 • Online Geometric Reconstruction
 Bernard Chazelle and C. Seshadhri
 Symposium on Computational Geometry 2006
 Journal of the ACM 2011
 abstract
conf
journal
We investigate a new class of geometric problems based on the idea
of online error correction.
Suppose one is given access to a large geometric dataset
though a query mechanism; for example, the dataset could
be a terrain and a query might ask for the coordinates
of a particular vertex or for the edges incident to it.
Suppose, in addition, that the dataset satisfies some known structural property
${\cal P}$ (eg, monotonicity or convexity) but that, because of errors and noise,
the queries occasionally provide answers that violate ${\cal P}$.
Can one design a filter that modifies the query's answers so that
(i) the output satisfies ${\cal P}$; (ii) the amount of data
modification is minimized?
We provide upper and lower bounds on the complexity
of online reconstruction for convexity in 2D and 3D.
@inproceedings{ChaS06,
author = {B. Chazelle and
C. Seshadhri},
title = {Online geometric reconstruction},
booktitle = {Proceedings of the 22nd ACM Symposium on Computational Geometry (SOCG)},
year = {2006},
pages = {386394},
}
@article{ChaS11,
author = {B. Chazelle and
C. Seshadhri},
title = {Online geometric reconstruction},
journal = {Journal of the ACM},
volume = {39},
number = {7}
year = {2010},
pages = {28972926}
}
 • SelfImproving Algorithms
 Nir Ailon, Bernard Chazelle, Ding Liu, and C. Seshadhri
 Symposium on Discrete Algorithms 2006
 Nir Ailon, Bernard Chazelle, Ken Clarkson, Ding Liu, Wolfgang Mulzer, and C. Seshadhri
 SIAM Journal on Computing 2011
 abstract
journal

We investigate ways in which an algorithm can improve
its expected performance by finetuning itself
automatically with respect to an {\em arbitrary, unknown} input distribution.
We give such {\em selfimproving} algorithms
for sorting and clustering. The highlights of this work:
(i) a sorting algorithm with optimal expected limiting running time;
and (ii) a kmedian algorithm over the Hamming cube with
linear expected limiting running time.
In all cases, the algorithm begins with a learning phase
during which it adjusts itself to the input distribution (typically
in a logarithmic number of rounds), followed
by a stationary regime in which the algorithm settles to its
optimized incarnation.
@inproceedings{DBLP:conf/soda/AilonCCL06,
author = {N. Ailon and
B. Chazelle and
S. Comandur and
D. Liu},
title = {Selfimproving algorithms},
booktitle = {Proceedings of the 17th Annual Symposium on Discrete Algorithms (SODA)},
year = {2006},
pages = {261270},
}
@article{AilCL+11,
author = {N. Ailon and
B. Chazelle and
K. Clarkson and
D. Liu and
W. Mulzer and
C. Seshadhri},
title = {Selfimproving algorithms},
journal = {SIAM Journal on Computing},
volume = {40},
number = {2},
year = {2011},
pages = {350375}
}
 • RAM Simulation of BGS Model of Abstract State Machines
 C. Seshadhri, Anil Seth, and Somenath Biswas
 Abstract State Machines (ASM) 2005
 abstract
full
We show in this paper that the BGS model of abstract state
machines can be simulated by random access machines with at most a
polynomial time overhead. This result is already stated in [BGS99] with
a very brief proof sketch. The present paper gives a detailed proof of the
result. We represent hereditarily finite sets, which are the typical BGS
ASM objects, by membership graphs of the transitive closure of the sets.
Testing for equality between BGS objects can be done in linear time in
our representation.
@inproceedings{ComSB05,
author = {S. Comandur and
A. Seth and
S. Biswas},
title = {RAM Simulation of BGS Model of abstract State Machines},
booktitle = {Proceedings of the 12th International Workshop on abstract State Machines (ASM)},
year = {2005},
pages = {377386},
}
 • PropertyPreserving Data Reconstruction
 Nir Ailon, Bernard Chazelle, Ding Liu, and C. Seshadhri
 International Symposium on Algorithms and Computation (ISAAC) 2004
 Algorithmica 2008
 abstract
journal
We initiate a new line of investigation into
{\em online propertypreserving data reconstruction.}
Consider a dataset which is assumed to
satisfy various (known) structural properties; eg, it may consist
of sorted numbers, or points on a manifold, or vectors in a polyhedral cone,
or codewords from an errorcorrecting code. Because of noise and errors,
however, an (unknown) fraction of
the data is deemed {\em unsound}, ie, in violation with
the expected structural properties.
Can one still query into the dataset in an online fashion
and be provided data that is {\em always} sound?
In other words, can one design a filter which,
when given a query to any item $I$ in the dataset,
returns a {\em sound} item $J$ that, although not necessarily
in the dataset, differs from $I$ as infrequently as possible.
No preprocessing should be allowed and queries
should be answered online.
We consider the case of a monotone function.
Specifically, the dataset encodes
a function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$
that is at (unknown) distance $\eps$ from monotone,
meaning that $f$ canand mustbe modified
at $\eps n$ places to become monotone.
Our main result is a randomized filter
that can answer any query in $O(\log^2 n\log\log n)$ time
while modifying the function $f$ at only $O(\eps n)$ places.
The amortized time over $n$ function evaluations is $O(\log n)$.
%With probability arbitrarily close to 1,
%the filter never errs at any query.
The filter works as stated with probability arbitrarily close
to $1$.
We provide an alternative filter with $O(\log n)$ worst case
query time and $O(\eps n \log n)$ function modifications. For reconstructing
$d$dimensional monotone functions of the form $f:\{1,\ldots, n\}^d\,\mapsto {\bf R}$,
we present a filter that takes $(2^{O(d)}(\log n)^{4d2}\log\log n)$
time per query and modifies at most $O(\eps n^d)$ function values (for constant $d$).
@inproceedings{AilCCL042,
author = {N. Ailon and
B. Chazelle and
S. Comandur and
D. Liu},
title = {PropertyPreserving Data Reconstruction},
booktitle = {Proceedings of the 15th International Symposium on Algorithms and Computation (ISAAC)},
year = {2004},
pages = {1627},
}
@article{AilCCL042,
author = {N. Ailon and
B. Chazelle and
S. Comandur and
D. Liu},
title = {PropertyPreserving Data Reconstruction},
journal = {Algorithmica},
volume = {51},
number = {2},
year = {2008},
pages = {160182},
}
 • Estimating the Distance to a Monotone Function
 Nir Ailon, Bernard Chazelle, Ding Liu, and C. Seshadhri
 International Workshop on Randomization and Computation (RANDOM) 2004
 Random Structures and Algorithms 2007
 abstract
journal
In standard property testing, the task is to
distinguish between objects that have a property $\mathcal{P}$
and those that are $\eps$far from $\mathcal{P}$, for some
$\eps > 0$. In this setting, it is perfectly acceptable
for the tester to provide a negative answer for every input object that
does not satisfy $\mathcal{P}$. This implies that
property testing in and of itself cannot be expected to
yield any information whatsoever about the distance
from the object to the property.
We address this problem in this paper, restricting our attention
to monotonicity testing.
A function $f:\{1,\ldots, n\}\,\mapsto {\bf R}$ is at distance
$\eps_f$ from being monotone if it can (and must) be modified
at $\eps_f n$ places to become monotone.
For any fixed $\delta>0$, we compute, with
probability at least $2/3$,
an interval $[(1/2 \delta)\eps, \eps]$ that
encloses $\eps_f$.
The running time of our algorithm is $O( \eps_f^{1}\log\log \eps_f^{1}
\log n )$, which
is optimal within a factor of $\log\log \eps_f^{1}$ and
represents a substantial improvement over previous work.
We give a second algorithm with an expected running time of
$O( \eps_f^{1}\log n \log\log\log n)$. Finally, we extend
our results to multivariate functions.
@inproceedings{AilCCL041,
author = {N. Ailon and
B. Chazelle and
S. Comandur and
D. Liu},
title = {Estimating the Distance to a Monotone Function},
booktitle = {APPROXRANDOM},
year = {2004},
pages = {229236},
}
@article{AilCCL041,
author = {N. Ailon and
B. Chazelle and
S. Comandur and
D. Liu},
title = {Estimating the distance to a monotone function},
journal = {Random Struct. Algorithms},
volume = {31},
number = {3},
year = {2007},
pages = {371383},
}

