2024
Efficient Placement of Decomposable Aggregation Functions for Stream Processing over Large Geo-Distributed Topologies
Abstract
A recent trend in stream processing is offloading the computation of decomposable aggregation function (DAF) from cloud nodes to geo-distributed fog/edge devices to decrease latency and improve energy efficiency. However, deploying DAFs on low-end devices is challenging due to their volatility and limited resources. Additionally, in geo-distributed fog/edge environments, creating new operator instances on demand and replicating operators ubiquitously is restricted, posing challenges for achieving load balancing without overloading devices. Existing work predominantly focuses on cloud environments, overlooking DAF operator placement in resource-constrained and unreliable geo-distributed settings.
This paper presents NEMO, a resource-aware optimization approach that determines the replication factor and placement of DAF operators in resource-constrained geo-distributed topologies. Leveraging Euclidean embeddings of network topologies and a set of heuristics, NEMO scales to millions of nodes and handles topological changes through adaptive re-placement and re-replication decisions. Compared to existing solutions, NEMO achieves up to 6x lower latency and up to 15x reduction in communication cost, while preventing overloaded nodes. Moreover, NEMO re-optimizes placements in constant time, regardless of the topology size. As a result, it lays the foundation to efficiently process continuous data streams on large, heterogeneous, and geo-distributed topologies.
Multi-Backend Zonal Statistics Execution with Raven
Abstract
The recent explosion in the number and size of spatial remote sensing datasets from satellite missions creates new opportunities for data-driven approaches in domains such as climate change monitoring and disaster management. These approaches typically involve a feature engineering step that summarizes remote sensing pixel data located within zones of interest defined by another spatial dataset, an operation called zonal statistics. While there exist several spatial systems that support zonal statistics operations, they differ significantly in terms of interfaces, architectures, and algorithms, making it hard for users to select the best system for a specific workload. To address this limitation, we propose Raven, a zonal statistics framework that provides users with a unified interface across multiple execution backends, while facilitating easy benchmarking and comparisons across systems. In this demonstration, we showcase several aspects of Raven, including its multi-backend execution environment, domain-specific declarative language, optimization techniques, and benchmarking capabilities.
Analysis of Geospatial Data Loading
Abstract
The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the microarchitectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.
2023
A Summary of ICDE 2022 Research Session Panels
Abstract
In the 38th IEEE International Conference on Data Engineering (ICDE), 2022, panel discussions were introduced after paper presentations to facilitate in-depth exploration of research topics and encourage participation. These discussions, enriched by diverse perspectives from experts and active audience involvement, provided fresh insights and a broader understanding of each topic. The introduction of panel discussions exceeded expectations, attracting a larger number of participants to the virtual sessions. This article summarizes the virtual panels held during ICDE ’22, focusing on sessions such as Data Mining and Knowledge Discovery, Federated Learning, Graph Data Management, Graph Neural Networks, Spatial and Temporal Data Management, and Spatial and Temporal Data Mining. By showcasing the success of panel discussions in generating inspiring discussions and promoting participation, this article aims to benefit the data engineering community, providing a valuable resource for researchers and suggesting a compelling format for holding research sessions for future conferences.
Abstract
Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, X, and Instagram). This exponential growth in spatial data has led the research community to build systems and applications for efficient spatial data processing. In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioning techniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-based index structures outperform tree-based index structures (from 1.23× to 2.47×), (ii) learning-enhanced variants of commonly used spatial index structures outperform their original counterparts (from 1.44× to 53.34× faster), (iii) machine-learned search within a partition is faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishes in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizing the indexed partitions
Abstract
The spatial intersection join is an important spatial query operation, due to its popularity and high complexity. The spatial join pipeline takes as input two collections of spatial objects (e.g., polygons). In the filter step, pairs of object MBRs that intersect are identified and passed to the refinement step for verification of the join predicate on the exact object geometries. The bottleneck of spatial join evaluation is in the refinement step. We introduce APRIL, a powerful intermediate step in the pipeline, which is based on raster interval approximations of object geometries. Our technique applies a sequence of interval joins on 'intervalized' object approximations to determine whether the objects intersect or not. Compared to previous work, APRIL approximations are simpler, occupy much less space, and achieve similar pruning effectiveness at a much higher speed. Besides intersection joins between polygons, APRIL can directly be applied and has high effectiveness for polygonal range queries, within joins, and polygon-linestring joins. By applying a lightweight compression technique, APRIL approximations may occupy even less space than object MBRs. Furthermore, APRIL can be customized to apply on partitioned data and on polygons of varying sizes, rasterized at different granularities. Our last contribution is a novel algorithm that computes the APRIL approximation of a polygon without having to rasterize it in full, which is orders of magnitude faster than the computation of other raster approximations. Experiments on real data demonstrate the effectiveness and efficiency of APRIL; compared to the state-of-the-art intermediate filter, APRIL occupies 2x-8x less space, is 3.5x-8.5x more time-efficient, and reduces the end-to-end join cost up to 3 times.
SheetReader: Efficient Specialized Spreadsheet Parsing
Abstract
Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreadsheet loading practical on commodity systems, we introduce a novel parser that minimize memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R and Python environments. Our evaluation shows that our novel approach is up to 3x faster while consuming up to 40x less memory than state-of-the-art approaches.
Our open source implementation of SheetReader for the R language is available at https://github.com/fhenz/SheetReader-r and has been downloaded more than 4K times.
Optimistic Data Parallelism for FPGA-Accelerated Sketching
Abstract
Sketches are a popular approximation technique for large datasets and high-velocity data streams. While custom FPGA-based hardware has shown admirable throughput at sketching, the state-ofthe-art exploits data parallelism by fully replicating resources and constructing independent summaries for every parallel input value. We consider this approach pessimistic, as it guarantees constant processing rates by provisioning resources for the worst case. We propose a novel optimistic sketching architecture for FPGAs that partitions a single sketch into multiple independent banks shared among all input values, thus significantly reducing resource consumption. However, skewed input data distributions can result in conflicting accesses to banks and impair the processing rate. To mitigate the effect of skew, we add mergers that exploit temporal locality by combining recent updates.Our evaluation shows that an optimistic architecture is feasible and reduces the utilization of critical FPGA resources proportionally to the number of parallel input values. We further show that FPGA accelerators provide up to 2.6𝑥 higher throughput than a recent CPU and GPU, while larger sketch sizes enabled by optimistic architectures improve accuracy by up to an order of magnitude in a realistic sketching application.
Workload Prediction for IoT Data Management Systems
Abstract
The Internet of Things (IoT) is an emerging technology that allows numerous devices, potentially spread over a large geographical area, to collect and collectively process data from high-speed data streams. To that end, specialized IoT data management systems (IoTDMSs) have emerged. One challenge in those systems is the collection of different metrics from devices in a central location for analysis. This analysis allows IoTDMSs to maintain an overview of the workload on different devices and to optimize their processing. However, as an IoT network comprises of many heterogeneous devices with low computation resources and limited bandwidth, collecting and sending workload metrics can cause increased latency in data processing tasks across the network. In this ongoing work, we present an approach to avoid unnecessary transmission of workload metrics by predicting CPU, memory, and network usage using machine learning (ML). Specifically, we demonstrate the performance of two ML models, linear regression and Long Short-Term Memory (LSTM) neural network, and show the features that we explored to train these models. This work is part of an ongoing research to develop a monitoring tool for our new IoTDMS named NebulaStream.