2023
Abstract
Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community to build systems and applications for efficient spatial data processing. In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioning techniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-based index structures outperform tree-based index structures (from 1.23× to 2.47×), (ii) learning-enhanced variants of commonly used spatial index structures outperform their original counterparts (from 1.44× to 53.34× faster), (iii) machine-learned search within a partition is faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishes in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizing the indexed partitions
Abstract
The spatial intersection join is an important spatial query operation, due to its popularity and high complexity. The spatial join pipeline takes as input two collections of spatial objects (e.g., polygons). In the filter step, pairs of object MBRs that intersect are identified and passed to the refinement step for verification of the join predicate on the exact object geometries. The bottleneck of spatial join evaluation is in the refinement step. We introduce APRIL, a powerful intermediate step in the pipeline, which is based on raster interval approximations of object geometries. Our technique applies a sequence of interval joins on 'intervalized' object approximations to determine whether the objects intersect or not. Compared to previous work, APRIL approximations are simpler, occupy much less space, and achieve similar pruning effectiveness at a much higher speed. Besides intersection joins between polygons, APRIL can directly be applied and has high effectiveness for polygonal range queries, within joins, and polygon-linestring joins. By applying a lightweight compression technique, APRIL approximations may occupy even less space than object MBRs. Furthermore, APRIL can be customized to apply on partitioned data and on polygons of varying sizes, rasterized at different granularities. Our last contribution is a novel algorithm that computes the APRIL approximation of a polygon without having to rasterize it in full, which is orders of magnitude faster than the computation of other raster approximations. Experiments on real data demonstrate the effectiveness and efficiency of APRIL; compared to the state-of-the-art intermediate filter, APRIL occupies 2x-8x less space, is 3.5x-8.5x more time-efficient, and reduces the end-to-end join cost up to 3 times.
SheetReader: Efficient Specialized Spreadsheet Parsing
Abstract
Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreadsheet loading practical on commodity systems, we introduce a novel parser that minimize memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R and Python environments. Our evaluation shows that our novel approach is up to 3x faster while consuming up to 40x less memory than state-of-the-art approaches.
Our open source implementation of SheetReader for the R language is available at https://github.com/fhenz/SheetReader-r and has been downloaded more than 4K times.
Optimistic Data Parallelism for FPGA-Accelerated Sketching
Abstract
Sketches are a popular approximation technique for large datasets and high-velocity data streams. While custom FPGA-based hardware has shown admirable throughput at sketching, the state-ofthe-art exploits data parallelism by fully replicating resources and constructing independent summaries for every parallel input value. We consider this approach pessimistic, as it guarantees constant processing rates by provisioning resources for the worst case. We propose a novel optimistic sketching architecture for FPGAs that partitions a single sketch into multiple independent banks shared among all input values, thus significantly reducing resource consumption. However, skewed input data distributions can result in conflicting accesses to banks and impair the processing rate. To mitigate the effect of skew, we add mergers that exploit temporal locality by combining recent updates.Our evaluation shows that an optimistic architecture is feasible and reduces the utilization of critical FPGA resources proportionally to the number of parallel input values. We further show that FPGA accelerators provide up to 2.6𝑥 higher throughput than a recent CPU and GPU, while larger sketch sizes enabled by optimistic architectures improve accuracy by up to an order of magnitude in a realistic sketching application.
Workload Prediction for IoT Data Management Systems
Abstract
The Internet of Things (IoT) is an emerging technology that allows numerous devices, potentially spread over a large geographical area, to collect and collectively process data from high-speed data streams. To that end, specialized IoT data management systems (IoTDMSs) have emerged. One challenge in those systems is the collection of different metrics from devices in a central location for analysis. This analysis allows IoTDMSs to maintain an overview of the workload on different devices and to optimize their processing. However, as an IoT network comprises of many heterogeneous devices with low computation resources and limited bandwidth, collecting and sending workload metrics can cause increased latency in data processing tasks across the network. In this ongoing work, we present an approach to avoid unnecessary transmission of workload metrics by predicting CPU, memory, and network usage using machine learning (ML). Specifically, we demonstrate the performance of two ML models, linear regression and Long Short-Term Memory (LSTM) neural network, and show the features that we explored to train these models. This work is part of an ongoing research to develop a monitoring tool for our new IoTDMS named NebulaStream.