Sebastian Schmidl
Software Engineer and Ph.D. Student
at the Hasso Plattner Institute in Potsdam.
Detecting anomalous subsequences in time series.
Time series datasets are particularly challenging for data engineering and analytics due to their size, recording speed and complex nature. They are also a dominant form of data in statistics, sciences, and engineering. I focus on the advancement of subsequence anomaly detection and anomaly clustering techniques by employing more efficient and effective approaches. For this purpose, I investigate and evaluate the state-of-the-art in time series research, design novel subsequence anomaly detection algorithms, and build automated and scalable systems for the management and analysis of massive corpora of time series data.
Efficient and scalable time series analytics is the focus of my PhD thesis at the Hasso Plattner Institute in Potsdam, resulting in many publications and projects in this area.
Example projects: Comprehensive evaluation of time series anomaly detection algorithms, AutoTSAD, HYPEX, AKITA (with Rolls-Royce), DendroTime (with DLR, in development).
Solving computationally complex problems in distributed computing environments.
I investigate computationally complex problems and how they can be solved in a distributed environment. Complex problems are prevalent in both data engineering and data analytics. The majority of existing solutions for data-centric problems lack efficiency, scalability, robustness, and elasticity. These deficiencies, I believe, can be addressed through distributed computing. I am especially interested in actor programming, which allows building fault-tolerant, elastic, and shared-nothing systems.
Distributed computing was the focus of my master's studies at the Hasso Plattner Institute in Potsdam, Germany. The courses I took at HPI included Distributed Data Management, Reliable Distributed Systems Engineering, Actor Database Systems, Methods of Cloud Computing and many others. For my master's thesis, I developed a distributed and reactive algorithm to discover bidirectional order dependencies in relational data, which was published in the VLDB Journal.
Developing efficient algorithms to extract metadata from relational datasets.
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher, and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. For example, data profiling is used to understand the structure of a dataset, identify data quality issues, monitor data quality, prepare data for analysis, and optimize SQL queries in database management systems. There is a considerable amount of research in this area. However, data profiling application areas hardly use any modern data profiling techniques. Thus, the current state of data profiling research unfortunately fails to address the practical application needs.
I am particularly interested in efficient and scalable methods to discover order dependencies, and solving the profiling-and-application-disconnect with a novel data profiling engine and query language (DPQL).
Example projects: DISTOD, Metaserve (DPQL)
I maintain TimeEval, an evaluation tool for time series anomaly detection algorithms. It bundles more than 70 anomaly detection algorithms and thousands of time series datasets. We have used TimeEval in a project with Rolls-Royce to evaluate the state-of-the-art in time series anomaly detection resulting in a comprehensive evaluation paper (pVLDB).
Python | Dask | Pytest | GitHub Actions | Docker | Bash | Linux
I developed the DISTOD data profiling algorithm, a distributed algorithm to discover bidirectional order dependencies from relational data. It combines efficient pruning techniques with a novel, reactive, and distributed search strategy outperforming all existing baselines. The algorithm is published in the VLDB Journal.
Scala | Akka | Python | Bash | Linux
I am a core developer for aeon, a scikit-learn compatible toolkit for all machine learning tasks on time series. I am responsible for the creation and maintenance of the anomaly detection module; and I contribute to the large collection of elastic time series distance measures. We presented aeon in a tutorial at ECML PKDD.
Python | Sklearn | Numba | NumPy | Pytest | GitHub Actions
DendroTime is a progressive algorithm to compute a hierarchical agglomerative clustering for large collections of time series subsequence anomalies. Its anytime behavior allows the user to interrupt the clustering process early and still obtain a meaningful solution. I develop the algorithm in cooperation with the German Aerospace Center (DLR).
Scala | Akka | JavaScript | React | D3
During my PhD studies at HPI, I supervise students and teach Bachelor and Master courses in the field of data profiling, time series analytics, distributed computing, and reproducability in science: