SEA Data

Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores, 2nd edition

Co-located with VLDB 2021 (, Copenhagen, Denmark)

SEA Data workshop will provide a forum for researchers and practitioners to exchange ideas, results, and visions on challenges in data management, information extraction, exploration, and analysis of heterogeneous data and multiple data models at once.

Companies, governments, and organizations are now producing and collecting data from multiple heterogeneous sources, such as transactional data, internet traffic, logs, IoT applications, knowledge bases, and much more. The unprecedented pace in which data is produced and consumed calls for methods that organize, retrieve, and analyze such data appropriately. While traditionally data were organized into homogeneous datastores and formats, our current data collection from multiple different sources makes such datastores impractical. Even within the same organization, data dwells in independent silos each with a distinct data model and serving a specific application, keeping relevant portions of the data separate from each other.

As a consequence, we have witnessed an increasing interest in systems and methods that try to handle and analyze multiple data sources and formats holistically. Data-lakes and polystores are the most prominent examples of such heterogenous datastores. Moreover, graphs and learned databases have recently attracted the attention of the community for their flexibility in modeling, managing, and organizing heterogeneous data. Due to the fast pace of data collection and evolution, consolidating all the sources into a single data format and loading them into a single store is usually impractical.

Hence, the first challenge that these systems face is to provide flexible storage and retrieval methods that can adapt to multiple models and domains. On the other hand, from the user perspective, when such diverse data is collected, the tasks of data discovery, exploration, and analysis become even more challenging. These solutions in the case of heterogeneous datastores remain still widely uncharted for a lack of established methods that allow effective multi-model data retrieval and exploration. Data analytics should also accommodate issues due to the lack of shared dimensions, ambiguous semantics, and the need to ensure the quality and lineage of the analytical result.

Workshop Chairs

Important Dates

  • Submission: AoE
  • Notification: AoE
  • Camera Ready due: AoE

Submission types

  • Regular research, system papers (up to 6 pages)
  • Experiments and Analysis papers (up to 6 pages)
  • Vision & work-in-progress & experiences papers (up to 4 pages)

Submit your work!

Topics

SEA Data aims at gathering researchers and practitioners from various communities related to databases. We gladly accept submissions that present initial ideas and visions, just as much as reports on early results, or reflections on completed projects. The workshop will focus on discussion and interaction, rather than static presentations of what is in the paper. A list of relevant topics is presented below.

The workshop also welcomes papers on negative results

  • Search, Exploration, and Analysis for heterogeneous unstructured and semi-structured data (e.g., knowledge graphs, web documents, semantic web);
  • Multi-model data exploration and analysis;
  • Querying and analyzing data lakes and polystores;
  • Cross-platform query processing and analytics;
  • Theory of heterogeneous data management;
  • Machine-learning methods for multi-model data exploration and analysis.
  • Novel user interfaces and query paradigm for searching heterogeneous data;
  • Exploration of large datasets including multiple sources;
  • Data visualization of heterogeneous data;
  • Example-based search and discovery for multi-model and heterogeneous datastores;
  • User-driven approaches on data-management for complex datasets;
  • Novel analyses involving multiple data sources;
  • Federated search, exploration, and analysis;
  • Information integration and entity resolutions across heterogeneous knowledge-bases and multi-model databases;
  • Approximate, anytime, and fast algorithms for extracting information from heterogeneous datastores;
  • Learnable structures for multi-model datasets;
  • Workload and Domain-agnostic self-assembling data management systems;

We also welcome submissions on thought-provoking applications and emerging uses of data management technology in heterogeneous datastores or multi-model databases. The workshop also welcomes papers on negative results.

Program Committee


Workshop Program

Keynotes:

Systems for Human Data Interaction
by Eugene Wu

Abstract:

The rapid democratization of data has placed its access and analysis in the hands of the entire population. While the advances in rapid and large-scale data processing continue to reduce runtimes and costs, the interfaces and tools for end-users to interact and work with data are still lacking.

It is still too difficult to translate a user's data needs into the appropriate interfaces, too difficult to develop data interfaces that are responsive end-to-end and scalable, and too difficult for users to understand and interpret the data they see. In this talk, I will provide an overview of our lab's recent work on systems for human data interaction that go towards addressing these challenges.

Speaker Bio:

Eugene Wu is an Associate Professor of Computer Science at Columbia University. He received a Ph.D. in EECS from MIT, and B.S. from UC Berkeley. He is broadly interested in technologies for human data interaction, and how users can effectively and quickly make sense of their data. Eugene is interested in solutions that ultimately improve the interface between users and data. He combines ideas from database management, visualization, and HCI. Eugene has received the VLDB 2018 test-of-time award, the coveted CIDR gong show award, NSF CAREER, and the Google and Amazon faculty awards.

LIquid: Scaling the system that builds and serves the LinkedIn Economic Graph
by Bogdan Arsintescu & Scott Meyer

Abstract:

Liquid is a distributed graph database service that scales to serve the LinkedIn Economic Graph: 200B edges, 1B vertices, 1.2M QPS with very low latency and 99.99% availability. (Check your LinkedIn profile now, make our day!). This presentation describes some of the fundamental building blocks that allow us, on one side, to build and nimbly evolve a graph of this caliber from multiple data sources and, on the other side, to enable the entire company to compose ever-more-complex queries while increasing launch velocity and eventually develop one-query applications.

The graph log structure dramatically shrinks the cost and complexity of graph construction and curation: the sequential nature of the log enables us to harness the curation effort of arbitrarily many people. Similarly, it allows data to come from different sources. Consider the curation of code bases which, now, works this way as compared to the curation of data, which has been unchanged for decades.

A declarative language is required to free the application developer from the optimization complexity of the graph structure and nature. While many new languages promise simplicity, we have chosen Datalog for expressivity and modularity. Datalog can model subgraphs and constraints in the graph expressions in a better way than 'query by example' languages; also, it is a complete implementation of the relational model. Moreover, Datalog rules allow scalable modularity, reuse, and evolution of the queries: an entire application can be built as a single query and hierarchically composed. We are using Datalog both in the ETL to construct the graph and to query it.

We will exemplify these traits by building a sample graph from open-source datasets using declarative ETLs, curation and queries. We will discuss how such graph can evolve by adding new datasets and how we scale the serving system in production.

Speaker Bio:

Bogdan Arsintescu is Director of Engineering at LinkedIn leading the Graph team, responsible for building a state-of-the art distributed graph database from the ground up and operating the LinkedIn Economic Graph, a low-latency on-line service for a 200B edges graph with a peak traffic exceeding 1M QPS. Prior to that Bogdan worked at Google with primary contributions on interval temporal logic for location data and Pregel-based query execution for the Google Knowledge Graph. Earlier still, he worked at Cadence Design Systems on semantic data for design automation.

Submission Guidelines

All submissions will be electronic via the Easychair submission system.

Regular research papers as well as system papers have a page limit of 6 pages (references included).

Experiments and Analysis papers have also a page limit of 6 pages (references included).

Vision papers, work-in-progress papers, and experiences papers have a page limit of 4 pages (references included).

SEA Data workshop 2021 is single-blind, and thus authors must include their names and affiliations in submissions.

Formatting

Submitted papers must follow the VLDB Proceedings Format (available here) and submitted as PDF files.

The font size, margins, inter-column spacing, and line spacing in the templates must be kept unchanged.

Any submitted paper violating the length, file type, or formatting requirements will be rejected without review.

Formatting guidelines for camera ready will follow.