Towards the Specification and Generation of Time Series Datasets from Data Lakes

These days, more and more organizations are building data lakes as a mechanism to store the information they generate. This information is considered as a valuable asset that, if properly analyzed, can help to make more informed decisions. However, since the analyses to be performed are often not kn...

Full description

Saved in:
Bibliographic Details
Published inIEEE International Requirements Engineering Conference Workshops (Online) pp. 302 - 306
Main Authors Sal, Brian, de la Vega, Alfonso, Lopez-Martinez, Patricia, Garcia-Saiz, Diego, Grande, Alicia, Lopez, David, Sanchez, Pablo
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2023
Subjects
Online AccessGet full text
ISSN2770-6834
DOI10.1109/REW57809.2023.00057

Cover

More Information
Summary:These days, more and more organizations are building data lakes as a mechanism to store the information they generate. This information is considered as a valuable asset that, if properly analyzed, can help to make more informed decisions. However, since the analyses to be performed are often not known in advance, these data are stored in a raw format. This means that any application built on top of a data lake must carefully elicit what data will be used for a particular analysis and how those data will be transformed to make them all fit together into a dataset. This data selection and preparation task is typically performed by data scientists that write large and complicated scripts in data management languages to extract and transform the required data. This reduces the productivity of data scientists, who must write large pieces of highly similar code. It also makes it difficult for domain experts to participate in this process because they have little understanding of these scripts. To alleviate this problem, this work introduces a work-in-progress version of a high-level declarative language for specifying the requirements that a dataset coming from a data lake must satisfy. This language is then processed to automatically generate the specified dataset, allowing data scientists and domain experts to be agnostic about the details of how data are exactly retrieved and transformed.
ISSN:2770-6834
DOI:10.1109/REW57809.2023.00057