Edif. Científico Técnico CITE-III, Universidad de Almería

Efficient Query Processing in Distributed Spatial Data Management Systems





Efficient Query Processing in Distributed Spatial Data Management Systems


Doctoral dissertation


Efficient Query Processing in Distributed Spatial Data Management Systems


Download thesis ]



Spatial Computing covers ideas, solutions, tools, technologies, and systems that transform our lives and society by creating a new understanding of spaces, locations, places, and properties. Since the term Big Data was coined for the first time in 2005, it has unleashed a worldwide revolution in scientific research and business. Big Spatial Data (BSD), the Big Data associated with spatial information, is now one of the most active research fields in spatial computing, mainly motivated by the rapid development of smart, sensor, and mobile technologies. Current usage of the term Big Spatial Data tends to refer to the process of capturing, storing, managing, analyzing, and visualizing huge amounts of spatial data, not using traditional tools and systems. Recent big spatial data developments have motivated the emergence of novel technologies for distributed processing of large-scale spatial data in shared-nothing clusters of computers, leading to Distributed Spatial Data Management Systems (DSDMSs). Distributed cluster-based computing systems can be classified as Hadoop-based or Spark-based systems. Based on this classification, two of the most leading DSDMSs are SpatialHadoop (disk-based DSDMS) and LocationSpark (in-memory-based DSDMS). These distributed systems support several characteristics like spatial data partitioning, indexing methods, and spatial query processing. An important aspect of these DSDMSs is to adopt a layered architecture for distributed computing and inject spatial data awareness into each layer. For example, the layers in SpatialHadoop are Language, Storage, MapReduce and Operations. Considering that SpatialHadoop is a comprehensive extension to the Hadoop ecosystem, it is a scalable and eficient cloud computing framework that allows distributed processing of large-scale spatial datasets using the MapReduce programming model.


In this thesis, we study and enrich SpatialHadoop by implementing new Distance- Based Query (DBQ) MapReduce algorithms in the Operations layer: "Distance Range Query ("DRQ), kNearest Neighbor Query (kNNQ), kClosest Pairs Query (kCPQ), kNearest Neighbor Join Query (kNNJQ), "Distance Join Query ("DJQ), "Distance Range Join Query ("DRJQ), Reverse kNearest Neighbor Query (RkNNQ), etc. Moreover, we improve the Storage layer with a new spatial partitioning technique (Voronoi-Diagram based partitioning), and a new local indexing structure (Quadtree) to optimize the distributed spatial query processing in shared-nothing clusters. This study and the knowledge of SpatialHadoop helps us identify new opportunities to enrich LocationSpark (a spatial data processing system built on top of Spark ecosystem) too, with the design and implementation of new distributed Distance-based Join Query (DJQ) algorithms (kCPQ, "DJQ and "DRJQ), extensions, and improvements over them. Additionally, we propose other enhancements and optimizations for distributed spatial query processing that leverage both data and algorithmic properties. Furthermore, we compare these DSDMSs by evaluating the performance of several distributed DJQ algorithms under diferent settings with large spatial real-world datasets from OpenStreetMap.


To develop this thesis, we start by reviewing the most relevant DSDMSs (research prototypes), the state-of-the-art spatial partitioning techniques in DSDMSs, and the most representative and common DBQs. Then, we focus our study on the structure and operations of spatial data partitioning methods and indexing structures in Spatial-Hadoop, by proposing a spatial partitioning technique based on Voronoi-Diagrams and including the Quadtree as a local index in such a DSDMS. Driven by an exhaustive analysis on the spatial query processing in SpatialHadoop, we identify and implement new spatial queries ("DRQ, kCPQ, "DJQ, "DRJQ, kNNJQ, RkNNQ, etc.) with different extensions (e.g., for non-points spatial data types) and improvements (e.g., repartitioning methods, less data technique, new pruning rules, etc.) in this DSDMS. Next, we analyze the general spatial query processing scheme of LocationSpark to extend it with new distributed DJQ algorithms and improvements. Afterward, we achieve an extensive performance evaluation of such enhancements (distributed spatial query algorithms, ex- tensions, and improvements) in SpatialHadoop and LocationSpark. Finally, we carry out a comparative study between SpatialHadoop and LocationSpark by executing an exhaustive set of experiments of several DJQs to identify which DSDMS is the most appropriate for the distributed query processing on large volumes of spatial data.