Efficient Placement of Decomposable Aggregation Functions for Stream Processing over Large Geo-Distributed Topologies

Xenofon Chatziliadis, Eleni Tzirita Zacharatou, Alphan Eracar, Steffen Zeuch, and Volker Markl
Journal Paper VLDB ’24: Proceedings of the VLDB Endowment (PVLDB), 17(6), 1501-1514, 2024


A recent trend in stream processing is offloading the computation of decomposable aggregation function (DAF) from cloud nodes to geo-distributed fog/edge devices to decrease latency and improve energy efficiency. However, deploying DAFs on low-end devices is challenging due to their volatility and limited resources. Additionally, in geo-distributed fog/edge environments, creating new operator instances on demand and replicating operators ubiquitously is restricted, posing challenges for achieving load balancing without overloading devices. Existing work predominantly focuses on cloud environments, overlooking DAF operator placement in resource-constrained and unreliable geo-distributed settings.

This paper presents NEMO, a resource-aware optimization approach that determines the replication factor and placement of DAF operators in resource-constrained geo-distributed topologies. Leveraging Euclidean embeddings of network topologies and a set of heuristics, NEMO scales to millions of nodes and handles topological changes through adaptive re-placement and re-replication decisions. Compared to existing solutions, NEMO achieves up to 6x lower latency and up to 15x reduction in communication cost, while preventing overloaded nodes. Moreover, NEMO re-optimizes placements in constant time, regardless of the topology size. As a result, it lays the foundation to efficiently process continuous data streams on large, heterogeneous, and geo-distributed topologies.

Multi-Backend Zonal Statistics Execution with Raven

Gereon Dusella, Haralampos Gavriilidis, Laert Nuhu, Volker Markl, and Eleni Tzirita Zacharatou
Demo Paper SIGMOD ’24: Proc. Intl. Conf. on Management of Data, 2024, To Appear


The recent explosion in the number and size of spatial remote sensing datasets from satellite missions creates new opportunities for data-driven approaches in domains such as climate change monitoring and disaster management. These approaches typically involve a feature engineering step that summarizes remote sensing pixel data located within zones of interest defined by another spatial dataset, an operation called zonal statistics. While there exist several spatial systems that support zonal statistics operations, they differ significantly in terms of interfaces, architectures, and algorithms, making it hard for users to select the best system for a specific workload. To address this limitation, we propose Raven, a zonal statistics framework that provides users with a unified interface across multiple execution backends, while facilitating easy benchmarking and comparisons across systems. In this demonstration, we showcase several aspects of Raven, including its multi-backend execution environment, domain-specific declarative language, optimization techniques, and benchmarking capabilities.

Analysis of Geospatial Data Loading

Aske Wachs and Eleni Tzirita Zacharatou
Workshop Paper DBTest@SIGMOD ’24: Proceedings of the International Workshop of Testing Database Systems, 2024, To Appear


The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the microarchitectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.


A Summary of ICDE 2022 Research Session Panels

Zhifeng Bao, Panagiotis Bouros, Reynold Cheng, Byron Choi, Anton Dignös, Wei Ding,Yixiang Fang, Boyang Han, Jilin Hu, Arijit Khan, Wenqing Lin, Xuemin Lin, Cheng Long, Nikos Mamoulis, Jian Pei, Matthias Renz, Shashi Shekhar, Jieming Shi, Eleni Tzirita Zacharatou, Sibo Wang, Xiao Wang, Xue Wang, Raymond Chi-Wing Wong, Da Yan, Xifeng Yan, Bin Yang, Dezhong Yao,Ce Zhang, Peilin Zhao, Rong Zhu
Journal Paper Data Engineering Bulletin, December 2023, Vol. 47, No. 4


In the 38th IEEE International Conference on Data Engineering (ICDE), 2022, panel discussions were introduced after paper presentations to facilitate in-depth exploration of research topics and encourage participation. These discussions, enriched by diverse perspectives from experts and active audience involvement, provided fresh insights and a broader understanding of each topic. The introduction of panel discussions exceeded expectations, attracting a larger number of participants to the virtual sessions. This article summarizes the virtual panels held during ICDE ’22, focusing on sessions such as Data Mining and Knowledge Discovery, Federated Learning, Graph Data Management, Graph Neural Networks, Spatial and Temporal Data Management, and Spatial and Temporal Data Mining. By showcasing the success of panel discussions in generating inspiring discussions and promoting participation, this article aims to benefit the data engineering community, providing a valuable resource for researchers and suggesting a compelling format for holding research sessions for future conferences.

Enhancing In-Memory Spatial Indexing with Learned Search

Varun Pandey, Alexander van Renen, Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Volker Markl, and Alfons Kemper
Preprint Arxiv, September 2023


Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community to build systems and applications for efficient spatial data processing. In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioning techniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-based index structures outperform tree-based index structures (from 1.23× to 2.47×), (ii) learning-enhanced variants of commonly used spatial index structures outperform their original counterparts (from 1.44× to 53.34× faster), (iii) machine-learned search within a partition is faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishes in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizing the indexed partitions

APRIL: Approximating Polygons as Raster Interval Lists

Thanasis Georgiadis, Eleni Tzirita Zacharatou, and Nikos Mamoulis
Preprint Arxiv, July 2023


The spatial intersection join is an important spatial query operation, due to its popularity and high complexity. The spatial join pipeline takes as input two collections of spatial objects (e.g., polygons). In the filter step, pairs of object MBRs that intersect are identified and passed to the refinement step for verification of the join predicate on the exact object geometries. The bottleneck of spatial join evaluation is in the refinement step. We introduce APRIL, a powerful intermediate step in the pipeline, which is based on raster interval approximations of object geometries. Our technique applies a sequence of interval joins on 'intervalized' object approximations to determine whether the objects intersect or not. Compared to previous work, APRIL approximations are simpler, occupy much less space, and achieve similar pruning effectiveness at a much higher speed. Besides intersection joins between polygons, APRIL can directly be applied and has high effectiveness for polygonal range queries, within joins, and polygon-linestring joins. By applying a lightweight compression technique, APRIL approximations may occupy even less space than object MBRs. Furthermore, APRIL can be customized to apply on partitioned data and on polygons of varying sizes, rasterized at different granularities. Our last contribution is a novel algorithm that computes the APRIL approximation of a polygon without having to rasterize it in full, which is orders of magnitude faster than the computation of other raster approximations. Experiments on real data demonstrate the effectiveness and efficiency of APRIL; compared to the state-of-the-art intermediate filter, APRIL occupies 2x-8x less space, is 3.5x-8.5x more time-efficient, and reduces the end-to-end join cost up to 3 times.

SheetReader: Efficient Specialized Spreadsheet Parsing

Haralampos Gavriilidis, Felix Henze, Eleni Tzirita Zacharatou, and Volker Markl
Journal Paper Information Systems, Volume 115, May 2023


Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreadsheet loading practical on commodity systems, we introduce a novel parser that minimize memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R and Python environments. Our evaluation shows that our novel approach is up to 3x faster while consuming up to 40x less memory than state-of-the-art approaches.

Our open source implementation of SheetReader for the R language is available at https://github.com/fhenz/SheetReader-r and has been downloaded more than 4K times.

Optimistic Data Parallelism for FPGA-Accelerated Sketching

Martin Kiefer, Ilias Poulakis, Eleni Tzirita Zacharatou, and Volker Markl
Journal Paper VLDB ’23: Proceedings of the VLDB Endowment (PVLDB), 16(5), 1113-1125, 2023


Sketches are a popular approximation technique for large datasets and high-velocity data streams. While custom FPGA-based hardware has shown admirable throughput at sketching, the state-ofthe-art exploits data parallelism by fully replicating resources and constructing independent summaries for every parallel input value. We consider this approach pessimistic, as it guarantees constant processing rates by provisioning resources for the worst case. We propose a novel optimistic sketching architecture for FPGAs that partitions a single sketch into multiple independent banks shared among all input values, thus significantly reducing resource consumption. However, skewed input data distributions can result in conflicting accesses to banks and impair the processing rate. To mitigate the effect of skew, we add mergers that exploit temporal locality by combining recent updates.Our evaluation shows that an optimistic architecture is feasible and reduces the utilization of critical FPGA resources proportionally to the number of parallel input values. We further show that FPGA accelerators provide up to 2.6𝑥 higher throughput than a recent CPU and GPU, while larger sketch sizes enabled by optimistic architectures improve accuracy by up to an order of magnitude in a realistic sketching application.

Workload Prediction for IoT Data Management Systems

David Burrell, Xenofon Chatziliadis, Eleni Tzirita Zacharatou, Steffen Zeuch, and Volker Markl
Workshop Paper BTW '23: Fachtagung für Datenbanksysteme für Business, Technologie und Web, 2023


The Internet of Things (IoT) is an emerging technology that allows numerous devices, potentially spread over a large geographical area, to collect and collectively process data from high-speed data streams. To that end, specialized IoT data management systems (IoTDMSs) have emerged. One challenge in those systems is the collection of different metrics from devices in a central location for analysis. This analysis allows IoTDMSs to maintain an overview of the workload on different devices and to optimize their processing. However, as an IoT network comprises of many heterogeneous devices with low computation resources and limited bandwidth, collecting and sending workload metrics can cause increased latency in data processing tasks across the network. In this ongoing work, we present an approach to avoid unnecessary transmission of workload metrics by predicting CPU, memory, and network usage using machine learning (ML). Specifically, we demonstrate the performance of two ML models, linear regression and Long Short-Term Memory (LSTM) neural network, and show the features that we explored to train these models. This work is part of an ongoing research to develop a monitoring tool for our new IoTDMS named NebulaStream.


Spatio-Temporal Graph Convolutional Network for Stochastic Traffic Speed Imputation

Carlos E. Muniz Cuza, Nguyen Ho, Eleni Tzirita Zacharatou, Torben Bach Pedersen, and Bin Yang
Conference Paper SIGSPATIAL ’22: Proceedings of the International Conference on Advances in Geographic Information Systems, 2022


The rapid increase of traffic data generated by different sensing systems opens many opportunities to improve transportation services. An important opportunity is to enable high-resolution stochastic routing that computes the arrival time probabilities for each suggested route instead of only the expected travel time. Stochastic routing relies on stochastic speed data that captures the speed distributions of vehicles in a road network. However, traffic datasets typically have many missing values, which prevents the construction of stochastic speeds. To address this limitation, we propose the Stochastic Spatio-Temporal Graph Convolutional Network (SST-GCN) architecture that accurately imputes missing speed distributions in a road network. SST-GCN combines Temporal Convolutional Networks and Graph Convolutional Networks into a single framework to capture both spatial and temporal correlations between road segments and time intervals, thereby providing a highly accurate estimation model for speed distributions. Moreover, to cope with datasets with many missing values, we propose a novel self-adaptive context-aware diffusion process that regulates the propagated information around the network, avoiding the spread of false information. We extensively evaluate the effectiveness of SST-GCN on real-world datasets, showing that it achieves from 4.6% to 50% higher accuracy than state-of-the-art baselines using three different evaluation metrics. Furthermore, multiple ablation studies confirm our design choices and scalability to large road networks.

Satellite Image Search in AgoraEO

Ahmet Kerem Aksoy, Pavel Dushev, Eleni Tzirita Zacharatou, Holmer Hemsen, Marcela Charfuelan, Jorge-Arnulfo Quiané-Ruiz, Begüm Demir, and Volker Markl
Demo Paper VLDB ’22: Proceedings of the VLDB Endowment (PVLDB), 15(12), 3646-3649, 2022


The growing operational capability of global Earth Observation (EO) creates new opportunities for data-driven approaches to understand and protect our planet. However, the current use of EO archives is very restricted due to the huge archive sizes and the limited exploration capabilities provided by EO platforms. To address this limitation, we have recently proposed MiLaN, a content-based image retrieval approach for fast similarity search in satellite image archives. MiLaN is a deep hashing network based on metric learning that encodes high-dimensional image features into compact binary hash codes. We use these codes as keys in a hash table to enable real-time nearest neighbor search and highly accurate retrieval. In this demonstration, we showcase the efficiency of MiLaN by integrating it with EarthQube, a browser and search engine within AgoraEO. EarthQube supports interactive visual exploration and Query-by-Example over satellite image repositories. Demo visitors will interact with EarthQube playing the role of different users that search images in a large-scale remote sensing archive by their semantic content and apply other filters.

In-Place Updates in Tree-Encoded Bitmaps

Marcellus Prama Saputra, Eleni Tzirita Zacharatou, Serafeim Papadias, and Volker Markl
Conference Paper SSDBM '22: Proc. Intl. Conf. on Scientific and Statistical Database Management, 2022


The Tree-Encoded Bitmap (TEB) is a novel bitmap compression scheme that provides a high compression ratio and logarithmic read time. It uses a tree-based compression algorithm that maps runs in the bitmap to leaf nodes of a binary tree. Currently, TEBs perform updates using an auxiliary differential data structure. However, consulting an additional data structure at every read introduces both memory and read overheads. To mitigate the shortcomings of differential updates, we propose algorithms to update TEBs in place. To that end, we identified two types of updates that can occur in a TEB: run-forming and run-breaking updates. Run-forming updates correspond to leaf nodes at the lowest level of the binary tree. All other updates are run-breaking. Each type of update requires different handling. Through experimentation with synthetic data, we determined that in-place run-forming updates are 2-3× faster than differential updates, while run-breaking updates cannot be efficiently performed in place. As a result, we propose a hybrid solution that performs run-forming updates in place while storing run-breaking updates in a differential data structure. Our experiments using synthetic data show that our hybrid solution is faster than differential updates as long as run-forming updates occur in a given workload. For instance, when 7% of all updates are run forming, our hybrid solution is 15% faster than differential updates.

Efficient Specialized Spreadsheet Parsing for Data Science

Felix Henze, Haralampos Gavriilidis, Eleni Tzirita Zacharatou, and Volker Markl
Workshop Paper DOLAP@EDBT '22: Proc. Intl. Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, 2022


Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreasheet loading practical on commodity systems, we introduce a novel parser that minimizes memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R environments. Our evaluation shows that our novel approach is up to 3× faster while consuming up to 40× less memory than state-of-the-art approaches.


Monitoring of Stream Processing Engines Beyond the Cloud: an Overview

Xenofon Chatziliadis, Eleni Tzirita Zacharatou, Steffen Zeuch, and Volker Markl
Workshop Paper VLIoT@VLDB '21: Proc. Intl. Workshop on Very Large Internet of Things, 2021


The Internet of Things (IoT) is rapidly growing into a network of billions of interconnected physical devices that constantly stream data. To enable data-driven IoT applications, data management systems like NebulaStream have emerged that manage and process data streams, potentially in combination with data at rest, in a heterogeneous distributed environment of cloud and edge devices. To perform internal optimizations, an IoT data management system requires a monitoring component that collects system metrics of the underlying infrastructure and application metrics of the running processing tasks. In this paper, we explore the applicability of existing cloud-based monitoring solutions for stream processing engines in an IoT environment. To this end, we provide an overview of commonly used approaches, discuss their design, and outline their suitability for the IoT. Furthermore, we experimentally evaluate different monitoring scenarios in an IoT environment and highlight bottlenecks and inefficiencies of existing approaches. Based on our study, we show the need for novel monitoring solutions for the IoT and define a set of requirements.

Agora-EO: A Unified Ecosystem for Earth Observation -- A Vision For Boosting EO Data Literacy --

Arne de Wall, Björn Deiseroth, Eleni Tzirita Zacharatou, Jorge-Arnulfo Quiané-Ruiz, Begüm Demir, and Volker Markl
Conference Paper BiDS '21: Proc. Conf. on Big Data from Space, 2021


Today, interoperability among EO exploitation platforms is almost inexistent as most of them rely on a heterogeneous set of technologies with varying interfaces and data formats. Thus, it is crucial to enable cross-platform (federated) analytics to make EO technology easily accessible to everyone. We envision AgoraEO, an EO ecosystem for sharing, finding, composing, and executing EO assets, such as datasets, algorithms, and tools. Making AgoraEO a reality is challenging for several reasons, the main ones being that the ecosystem must provide interactive response times and operate seamlessly over multiple exploitation platforms. In this paper, we discuss the different challenges that AgoraEO poses as well as our ideas to tackle them. We believe that having an open, unified EO ecosystem would foster innovation and boost EO data literacy for the entire population.

Towards Resilient Data Management for the Internet of Moving Things

Elena Beatriz Ouro Paz, Eleni Tzirita Zacharatou, and Volker Markl
Conference Paper BTW '21: Fachtagung für Datenbanksysteme für Business, Technologie und Web, 2021


Mobile devices have become ubiquitous; smartphones, tablets and wearables are essential commodities for many people. The ubiquity of mobile devices combined with their ever increasing capabilities, open new possibilities for Internet-of-Things (IoT) applications where mobile devices act as both data generators as well as processing nodes. However, deploying a stream processing system (SPS) over mobile devices is particularly challenging as mobile devices change their position within the network very frequently and are notoriously prone to transient disconnections. To deal with faults arising from disconnections and mobility, existing fault tolerance strategies in SPS are either checkpointing-based or replication-based. Checkpointing-based strategies are too heavyweight for mobile devices, as they save and broadcast state periodically, even when there are no failures. On the other hand, replication-based strategies cannot provide fault tolerance at the level of the data source, as the data source itself cannot be always replicated. Finally, existing systems exclude mobile devices from data processing upon a disconnection even when the duration of the disconnection is very short, thus failing to exploit the computing capabilities of the offline devices. This paper proposes a buffering-based reactive fault tolerance strategy to handle transient disconnections of mobile devices that both generate and process data, even in cases where the devices move through the network during the disconnection. The main components of our strategy are: (a) a circular buffer that stores the data which are generated and processed locally during a device disconnection, (b) a query-aware buffer replacement policy, and (c) a query restart process that ensures the correct forwarding of the buffered data upon re-connection, taking into account the new network topology. We integrate our fault tolerance strategy with NebulaStream, a novel stream processing system specifically designed for the IoT. We evaluate our strategy using a custom benchmark based on real data, exhibiting reduction in data loss and query latency compared to the baseline NebulaStream.

GeoBlocks: A Query-Cache Accelerated Data Structure for Spatial Aggregation over Polygons

Christian Winter, Andreas Kipf, Christoph Anneser, Eleni Tzirita Zacharatou, Thomas Neumann, and Alfons Kemper
Conference Paper EDBT '21: Proc. Intl. Conf. on Extending Database Technology, 2021


As individual traffic and public transport in cities are changing, city authorities need to analyze urban geospatial data to improve transportation and infrastructure. To that end, they highly rely on spatial aggregation queries that extract summarized information from point data (e.g., Uber rides) contained in a given polygonal region (e.g., a city neighborhood). To support such queries, current analysis tools either allow only predefined aggregates on predefined regions and are thus unsuitable for exploratory analyses, or access the raw data to compute aggregate results on-the-fly, which severely limits the interactivity. At the same time, existing pre-aggregation techniques are inadequate since they maintain aggregates over rectangular regions. As a result, when applied over arbitrary polygonal regions, they induce an approximation error that cannot be bounded.

In this paper, we introduce GeoBlocks, a novel pre-aggregating data structure that supports spatial aggregation over arbitrary polygons. GeoBlocks closely approximate polygons using a set of fine-grained grid cells and, in contrast to prior work, allow to bound the approximation error by adjusting the cell size. Furthermore, GeoBlocks employ a trie-like cache that caches aggregate results of frequently queried regions, thereby dynamically adapting to the skew inherently present in query workloads and improving performance over time. In summary, GeoBlocks outperform on-the-fly aggregation by up to three orders of magnitude, achieving the sub-second query latencies required for interactive exploratory analytics.

The Case for Distance-Bounded Spatial Approximations

Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Varun Pandey, Harish Doraiswamy, and Volker Markl
Conference Paper CIDR '21: Proc. Intl. Conf. on Innovative Data Systems Research, 2021


Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query condition, which is typically an expensive operation. Nevertheless, many emerging applications (e.g., visualization tools) require interactive responses, while only needing approximate results. Besides, real-world geospatial data is inherently imprecise, which makes exact data processing unnecessary. Given the uncertainty associated with spatial data and the relaxed precision requirements of many applications, this vision paper advocates for approximate spatial data processing techniques that omit exact geometric tests and provide final answers solely on the basis of fine-grained approximations. Thanks to recent hardware advances, this vision can be realized today. Furthermore, our approximate techniques employ a distance-based error bound, i.e., a bound on the maximum spatial distance between false or missing and exact results which is crucial for meaningful analyses. This bound allows to control the precision of the approximation and trade accuracy for performance.


NebulaStream: Complex Analytics Beyond the Cloud

Steffen Zeuch, Eleni Tzirita Zacharatou, Shuhao Zhang, Xenofon Chatziliadis, Ankit Chaudhary, Bonaventura Del Monte, Dimitrios Giouroukis, Philipp M. Grulich, Ariane Ziehn, and Volker Markl
Workshop Paper VLIoT@VLDB '20: Proc. Intl. Workshop on Very Large Internet of Things, 2020


The arising Internet of Things (IoT) will require significant changes to current stream processing engines (SPEs) to enable large-scale IoT applications. In this paper, we present challenges and opportunities for an IoT data management system to enable complex analytics beyond the cloud. As one of the most important upcoming IoT applications, we focus on the vision of a smart city. The goal of this paper is to bridge the gap between the requirements of upcoming IoT applications and the supported features of an IoT data management system. To this end, we outline how state-of-the-art SPEs have to change to exploit the new capabilities of the IoT and showcase how we tackle IoT challenges in our own system, NebulaStream. This paper lays the foundation for a new type of systems that leverages the IoT to enable large-scale applications over millions of IoT devices in highly dynamic and geo-distributed environments.

Adaptive Main-Memory Indexing for High-Performance Point-Polygon Joins

Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Christoph Anneser, Eleni Tzirita Zacharatou, Harish Doraiswamy, Peter Boncz, Thomas Neumann, and Alfons Kemper
Conference Paper EDBT '20: Proc. Intl. Conf. on Extending Database Technology, 2020


Connected mobility applications rely heavily on geospatial joins that associate point data, such as locations of Uber cars, to static polygonal regions, such as city neighborhoods. These joins typically involve expensive geometric computations, which makes it hard to provide an interactive user experience.

In this paper, we propose an adaptive polygon index that leverages true hit filtering to avoid expensive geometric computations in most cases. In particular, our approach closely approximates polygons by combining quadtrees with true hit filtering, and stores these approximations in a query-efficient radix tree. Based on this index, we introduce two geospatial join algorithms: an approximate one that guarantees a user-defined precision, and an exact one that adapts to the expected point distribution. In summary, our technique outperforms existing CPU-based joins by up to two orders of magnitude and is competitive with state-of-the-art GPU implementations.


Efficient Bundled Spatial Range Queries

Eleni Tzirita Zacharatou, Darius Sidlauskas, Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki
Conference Paper SIGSPATIAL '19: Proc. Intl. Conf. on Advances in Geographic Information Systems, 2019, 139-148


Efficiently querying multiple spatial data sets is a growing challenge for scientists. Astronomers query data sets that contain different types of stars (e.g. dwarfs, giants, stragglers) while neuroscientists query different data sets that model different aspects of the brain in the same space (e.g. neurons, synapses, blood vessels). The results of each query determine the combination of data sets to be queried next. Not knowing a priori the queried data sets makes it hard to choose an efficient indexing strategy.

In this paper, we show that indexing and querying the data sets separately incurs considerable overhead but so does using one index for all data sets. We therefore develop STITCH, a novel index structure for the scalable execution of spatial range queries on multiple data sets. Instead of indexing all data sets separately or indexing all of them together, the key insight we use in STITCH is to partition all data sets individually and to connect them to the same reference space. By doing so, STITCH only needs to query the reference space and follow the links to the data set partitions to retrieve the relevant data. With experiments we show that STITCH scales with the number of data sets and outperforms the state-of-the-art by a factor of up to 12.3.

Efficient Query Processing for Spatial and Temporal Data Exploration

Eleni Tzirita Zacharatou
Thesis PhD Dissertation, EPFL, Lausanne, Switzerland, August 2019


Core to many scientific and analytics applications are spatial data capturing the position or shape of objects in space, and time series recording the values of a process over time. Effective analysis of such data requires a shift from confirmatory pipelines to exploratory ones. However, there is a mismatch between the requirements of spatial and temporal data exploration and the capabilities of the data management solutions available today. First, traditional spatial query operators evaluate spatial relations with time-consuming geometric tests that oppose the interactivity expected from exploratory applications, creating an undue overhead. Second, spatial access methods are built on top of rough geometric object approximations that do not capture the complex structure and distribution of today’s spatial data and are thus inefficient. Third, traditional indexes are typically built upfront before queries can be processed and over single data attributes, thus precluding interactive accesses to interesting data subsets that may be specified with constraints on multiple attributes. Finally, existing access methods scale poorly with increasingly granular spatial and temporal data originating from ever more precise data acquisition technologies and ever faster computing infrastructure.

This thesis introduces a novel family of spatial and temporal access methods and query operators that aim to bridge the gap between existing data management techniques and data exploration applications. We show that spatial query operators can be decomposed into primitive graphics operations that are efficiently executed by graphics hardware (GPU) and allow to trade accuracy for interactivity. Furthermore, we design access methods that adapt to data characteristics, data growth trends, and workload access patterns, thereby providing scalable performance for ad-hoc queries over increasing data amounts. Specifically, we introduce a spatial approximation that adapts to the structural properties and distribution of the data, and propose spatial and time series access methods that leverage similarities between data items and support filtering over multiple attributes. Finally, we present an approach that indexes data incrementally, using queries as hints for optimizing data access.


Improving Spatial Data Processing by Clipping Minimum Bounding Boxes

Darius Sidlauskas, Sean Chester, Eleni Tzirita Zacharatou, and Anastasia Ailamaki
Conference Paper ICDE '18: Proc. Intl. Conf. on Data Engineering, 2018, 425-436


The majority of spatial processing techniques rely heavily on approximating each group of spatial objects by their minimum bounding box (MBB). As each MBB is compact to store (requiring only two multi-dimensional points) and intersection tests between MBBs are cheap to execute, these approximations are used predominantly to perform the (initial) filtering step of spatial data processing. However, fitting (groups of) spatial objects into a rough box often results in a very poor approximation of the underlying data. The resulting MBBs contain a lot of “dead space”---fragments of bounded area that contain no actual objects---that can significantly reduce the filtering efficacy.

This paper introduces the general concept of a clipped bounding box (CBB) that addresses the principal disadvantage of MBBs, their poor approximation of spatial objects. Essentially, a CBB “clips away” dead space from the corners of an MBB by storing only a few auxiliary points. On four popular R-tree implementations (a ubiquitous application of MBBs), we demonstrate how minor modifications to the query algorithm exploit auxiliary CBB points to avoid many unnecessary recursions into dead space. Extensive experiments show that clipped R-tree variants substantially reduce I/Os: e.g., by clipping the state-of-the-art revised R*-tree we can eliminate on average 19% of I/Os.

Interactive Visual Exploration of Spatio-Temporal Urban Data Sets using Urbane

Harish Doraiswamy, Eleni Tzirita Zacharatou, Fabio Miranda, Marcos Lage, Anastasia Ailamaki, Claudio Silva, and Juliana Freire
Demo Paper SIGMOD ’18: Proc. Intl. Conf. on Management of Data, 2018, 1693-1696
Best Demonstration Award


The recent explosion in the number and size of spatio-temporal data sets from urban environments and social sensors creates new opportunities for data-driven approaches to understand and improve cities. Visual analytics systems like Urbane aim to empower domain experts to explore multiple data sets, at different time and space resolutions. Since these systems rely on computationally-intensive spatial aggregation queries that slice and summarize the data over different regions, an important challenge is how to attain interactivity. While traditional pre-aggregation approaches support interactive exploration, they are unsuitable in this setting because they do not support ad-hoc query constraints or polygons of arbitrary shapes. To address this limitation, we have recently proposed Raster Join, an approach that converts a spatial aggregation query into a set of drawing operations on a canvas and leverages the rendering pipeline of the graphics hardware (GPU). By doing so, Raster Join evaluates queries on the fly at interactive speeds on commodity laptops and desktops. In this demonstration, we showcase the efficiency of Raster Join by integrating it with Urbane and enabling interactivity. Demo visitors will interact with Urbane to filter and visualize several urban data sets over multiple resolutions.


GPU Rasterization for Real-Time Spatial Aggregation over Arbitrary Polygons

Eleni Tzirita Zacharatou, Harish Doraiswamy, Anastasia Ailamaki, Claudio Silva, and Juliana Freire
Journal Paper Proceedings of the VLDB Endowment (PVLDB), 11(3), 352-365, 2017


Visual exploration of spatial data relies heavily on spatial aggregation queries that slice and summarize the data over different regions. These queries comprise computationally-intensive point-in-polygon tests that associate data points to polygonal regions, challenging the responsiveness of visualization tools. This challenge is compounded by the sheer amounts of data, requiring numerous such tests to be performed. Traditional pre-aggregation approaches are unsuitable in this setting since they fix the query constraints and support only rectangular regions. On the other hand, query constraints are defined interactively in visual analytics systems, and polygons can be of arbitrary shapes. In this paper, we convert a spatial aggregation query into a set of drawing operations on a canvas and leverage the rendering pipeline of the graphics hardware (GPU) to enable interactive response times. Our technique trades-off accuracy for response time by adjusting the canvas resolution, and can even provide accurate results when combined with a polygon index. We evaluate our technique on two large real-world data sets, exhibiting superior performance compared to index-based approaches.


Space odyssey: efficient exploration of scientific data

Mirjana Pavlovic, Eleni Tzirita Zacharatou, Darius Sidlauskas, Thomas Heinis, and Anastasia Ailamaki
Workshop Paper ExploreDB@SIGMOD/PODS '16: Proc. Intl. Workshop on Exploratory Search in Databases and the Web, 2016, 12-18


Advances in data acquisition—through more powerful supercomputers for simulation or sensors with better resolution—help scientists tremendously to understand natural phenomena. At the same time, however, it leaves them with a plethora of data and the challenge of analyzing it. Ingesting all the data in a database or indexing it for an efficient analysis is unlikely to pay off because scientists rarely need to analyze all data. Not knowing a priori what parts of the data sets need to be analyzed makes the problem challenging.

Tools and methods to analyze only subsets of this data are rather rare. In this paper we therefore present Space Odyssey, a novel approach enabling scientists to efficiently explore multiple spatial data sets of massive size. Without any prior information, Space Odyssey incrementally indexes the data sets and optimizes the access to data sets frequently queried together. As our experiments show, through incrementally indexing and changing the data layout on disk, Space Odyssey accelerates exploratory analysis of spatial data by substantially reducing query-to-insight time compared to the state of the art.


RUBIK: efficient threshold queries on massive time series

Eleni Tzirita Zacharatou, Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki
Conference Paper SSDBM ’15: Proc. Intl. Conf. on Scientific and Statistical Database Management, 2015, 18:1-18:12


An increasing number of applications from finance, meteorology, science and others are producing time series as output. The analysis of the vast amount of time series is key to understand the phenomena studied, particularly in the simulation sciences, where the analysis of time series resulting from simulation allows scientists to refine the model simulated. Existing approaches to query time series typically keep a compact representation in main memory, use it to answer queries approximately and then access the exact time series data on disk to validate the result. The more precise the in-memory representation, the fewer disk accesses are needed to validate the result. With the massive sizes of today's datasets, however, current in-memory representations oftentimes no longer fit into main memory. To make them fit, their precision has to be reduced considerably resulting in substantial disk access which impedes query execution today and limits scalability for even bigger datasets in the future.

In this paper we develop RUBIK, a novel approach to compressing and indexing time series. RUBIK exploits that time series in many applications and particularly in the simulation sciences are similar to each other. It compresses similar time series, i.e., observation values as well as time information, achieving better space efficiency and improved precision. RUBIK translates threshold queries into two dimensional spatial queries and efficiently executes them on the compressed time series by exploiting the pruning power of a tree structure to find the result, thereby outperforming the state-of-the-art by a factor of between 6 and 23. As our experiments further indicate, exploiting similarity within and between time series is crucial to make query execution scale and to ultimately decouple query execution time from the growth of the data (size and number of time series).


Automatic Music Transcription

Eleni Tzirita Zacharatou
Thesis Master Thesis, NTUA, Athens, Greece, July 2013


Music Information Retrieval is the interdisciplinary research area of retrieving information from music. Automatic Music Transcription is a field of Music Information Retrieval. It aims at extracting the pitches of notes and their timings from an audio signal as well as identifying which instrument generated them. After extracting that information it has to be represented in a form which is understandable to musicians and could be used to recreate the original audio. Automatic Music Transcription, especially for polyphonic music, is a problem of Audio Digital Signal Processing which remains open. Until today, an application that can reach the capabilities of a trained musician doesn’t exist.

The goal of this diploma thesis is the study and the development of techniques and algorithms for the Automatic Piano Music Transcription. Apart from an extensive review of the relevant with Automatic Music Transcription literature, we present an onset detection method for monophonic piano music and a multiple fundamental frequency (F0) estimation method for polyphonic piano music. Onset detection is based on the Teager-Kaiser energy operator and a Gabor filter bank, having the frequencies of the piano keys as the central frequencies. Similarly, the multiple fundamental frequency estimation method evaluates the DTFT of the signal at the specific frequencies of the piano notes. The polyphony order K is unknown and thus the proposed algorithm aims at both inferring K and finding the fundamental frequencies. This is done sequentially, starting from a monophonic assumption and continuing with plausible combinations of higher polyphony order. Candidate combinations of notes are selected as best matching the spectrum and some properties learned from the data. The final combination which is selected is that of minimum squared error. Our experimental results show good performance for both methods.

Full list of publications at DBLP.

Notice: The documents contained in this website are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.