Dr. Xiao Hu

Postdoctoral Research Fellow

The Hong Kong University of Science and Technology

Xiao Hu is a PhD graduate student from the Dept. of CSE at HKUST, supervised by Prof. Ke Yi. She defended her PhD thesis on July 18, 2019. During February 2019 to May 2019, she was a visiting scholar of the Dept. of Computer Science at the University of Wisconsin-Madison and worked with Prof. Paris Koutris. Before joining HKUST, she received her BE degree in Computer Software from the Tsinghua University in 2014. Her research interest lies in the intersection of the theory and practice of data management, including massively parallel computation, join algorithms, formal frameworks for query processing and compression of query results. She has published 4 top papers in PODS/SIGMOD and one of the papers has been invited to the journal TODS, collecting as the best of PODS 2017. Besides the theoretical problems, she also works on the practical aspects of implementing the algorithms in the distributed systems.

Massively Parallel Algorithms for Acyclic Joins

Due to the rapid development of massively parallel data processing systems such as MapReduce and Spark, there have been revived interests in the theoretical computer science community to study algorithms in a massively parallel computational model. Join evaluation, as one of the central algorithmic problems in database theory, has received much more attention. In my Ph.D. thesis, I study how to compute join queries efficiently (with theoretical guarantees) in today's massively parallel systems. The first main result is an almost complete characterization of acyclic join queries on which instance-optimality, output-optimality and worst-case optimality can be achieved respectively. I first give an instance-optimal algorithm for r-hierarchical joins and prove that instance-optimality cannot be achieved for non-r-hierarchical joins. Then I give a new outputsensitive algorithm for arbitrary acyclic joins, which is optimal if the output size is not too large. At last, I present a worst-case optimal algorithm for arbitrary acyclic joins, whose complexity is characterized by the fractional edge cover number of the join hypergraph. On the other hand, I give evidence that cyclic joins are inherently more difficult than acyclic ones. In particular, I give the first output-sensitive lower bound for the triangle join, as well as a worst-case lower bound for the so-called ⊟-join.

Beyond natural joins, I also investigate similarity joins under ℓ1/ℓ2/ℓ distance metrics by designing three algorithms: (1) a deterministic output-optimal algorithm for ℓ1/ℓ metric in constant dimensions; (2) a randomized output-sensitive algorithm for ℓ2 metric in constant dimensions; (3) an LSH-based approximation algorithm for arbitrary metric in high dimensions.

Although this thesis is theoretical in nature, some of the developed algorithms are also quite practical. I have implemented them in Spark and the experimental results have demonstrated their efficiency in practice.