Multivariate time-series forecasts inside databases with MindsDB and PyTorch

·Jul 30, 2021·

Cover Image for Multivariate time-series forecasts inside databases with MindsDB and PyTorch

Introduction
Let’s think of an example
Many series with one command
Adding more context into the mix...
Advanced features for improved resource usage
What’s next?
- References

Author: Patricio Cerda-Mardini, Machine Learning Research Engineer @mindsdb

Introduction

At MindsDB, we are building an open source platform so that anyone can leverage automated machine learning (ML) capabilities from within their databases. MindsDB harnesses a generalized philosophy in order to tackle novel and diverse use cases in the community; as there are numerous data types you may work with in their databases, our machine learning team focuses on ways to expand our philosophy to build robust and strong predictors across many different domains. PyTorch is a key ingredient in our ability to iterate quickly and deploy flexible ML code.

As relational databases increasingly have more temporal information stored in them, one of the usage trends we’ve noticed is the need for accurate forecasts ([1], [2]). We recently investigated how we could build predictors that could jointly account for temporal and non-temporal details in an effective way, which can be crucial to getting good results.

Time series forecasting is a difficult task that spans decades of research and development ([3], [5]). In this blog post, we’ll delve into some of the challenges that have arisen while extending our AutoML solution to handle a wide variety of forecasting scenarios in databases, and how we’re overcoming them with the help of powerful features and abstractions that PyTorch offers.

Let’s think of an example

In order to illustrate some common challenges, let’s consider a retailer with a handful of stores across the city. Their database has detailed sales records for all stores throughout the year. Among other things, each transaction has information about when each type of product was purchased, how many units were sold at what price, and some extra details that the customers voluntarily provided (e.g. address or age). The goal is to get a sense of how things will go next month by forecasting sales. This presents an interesting challenge with cardinality, where each product-store pair is an independent time series. With thousands of those pairs, what is the best way to manage forecasting for many time series in parallel, how to leverage contextual information, and how to maximize resource usage. Let’s look at them in more detail.

Many series with one command

Working within the database enables us to group our sales by any set of columns that we wish to aggregate information on with a simple 'SELECT' statement. For example, we might wish to analyze each store as a separate entity, and inquire how each department performs within the store. Beyond this, we could analyze the items available in each case by grouping them into predetermined price ranges. Considering each [store - department - price range] combination creates several distinct time series data subsets.

For some of these subsets, items may exhibit very different price points. For example, if we consider popular or expensive products, the sales amount should be higher than less popular products or items that are discontinued, not to mention that the actual dynamics of the series could be very different as well (not everyone is rushing to buy a winter coat in summer!).

How do we proceed? A normal approach here would be to train a different forecaster for each of these series. A more ambitious take considers all the series at once and trains a single model. The latter approach is desirable because of improved scalability when the number of combinations across subsets grows large and building an independent predictor on each case becomes intractable.

Given that these series can have wildly different ranges of values, how can we leverage all the data in our model without running into problems when training our neural networks? Let’s go through how MindsDB tackles this problem to build a unified predictor capable of generalizing to different time series problems.

As mentioned before, the MindsDB philosophy is to featurize data through an autoencoder, and use that transformed input to leverage a predictive model, the “mixer.” This approach allows us to flexibly combine different data types as diverse as text or time series together with categorical or numerical data.

Figure 1: MindsDB offers a flexible design to jointly handle different data types

The first step in handling time series data is to normalize the series. MindsDB performs a minmax normalization step, prior to feeding the data into the encoder, to consider temporal dynamics shown by all series in the training corpus within the same numerical range. (As an aside for the more technically curious, we’ll later see that this approach does not assume stationary series.)

This bounded set of numerical sequences is the input for our autoencoder architecture, built using PyTorch’s GRU and Linear layers. The paradigm is a classical encoder-decoder pair, where the interesting bit is the intermediate representation that the model generates once it’s trained, describing each series’ state given its last N values (with N determined by the user). Each series normalizer has its corresponding inverse transform available so that we can decode back into the original scale of each series.

The highly customizable and flexible PyTorch API enables us to automatically determine hyperparameters for these networks, such as the number of layers, hidden sizes, and activation functions. We set these based on relevant dataset properties, which can be time series but also any other type of information, like free form text.

Adding more context into the mix...

By this point in the process, we have useful intermediate representations (IRs) for our temporal data. Now what? Recall that sales are not the only information at our disposal. It’s possible that other bits of data, such as the stock of each product, can be considered by a machine learning model to further improve its predictions.

MindsDB automatically handles these other columns in parallel and, in a similar approach but with different architectures depending on the data type, creates IRs for each column passed in the initial SELECT query. Once we have encoded any relevant information, all column IRs pass to the mixer stage.

The mixer is the predictive model that incorporates every descriptor learned from the data thus far. Mixers can be gradient boosting algorithms or other classical ML approaches, but when dealing with a lot of data, we’ve found neural networks have the upper hand. To construct deep learning models, we employ PyTorch’s nn.Module class, as it lets us define custom pieces of architecture with ease. Residual connections, which shortcut information passing between layers, open up a flow that can be seen as an autoregressive component (which handles non-stationary dynamics). For a thorough formulation, check out the AR-Net paper by Facebook AI [4].

Figure 2: Mixers receive multiple IRs for series historical values through residual connections

We’re researching new mixer “flavors” all the time, in part thanks to how the nn.Module design encourages quick experimentation. Motivated by our experience so far, one of our next steps is to enable differentiable neural architecture search to automate the mixer generation a step further.

Advanced features for improved resource usage

Training a model can be very resource and time intensive, which is why it needs to be done with maximal efficiency. In fact, some particularly demanding use cases have pushed us to explore and use advanced PyTorch features.

Automated Mixed Precision (AMP) is used throughout MindsDB encoders’ and mixers’ internal training loops. This enables any owner of a CUDA GPU with tensor cores to train their machine learning model with half-precision floating point numbers, obtaining training speedups without losing accuracy.

To improve AMP-enabled model convergence, gradient scaling is recommended as a complement. In both cases, a couple of additional lines of torch.cuda.amp code was all we really needed to successfully deploy this technique.

Some of you train models on your own on-premise servers, which are then deployed on a variety of hardware, including machines without GPU acceleration. PyTorch abstractions trivialize handling tensors and model weights .to() and from the CPU. On the other hand, we use DataParallel to leverage the additional computing power when the user has multiple GPUs.

These techniques, in addition to ML tools like the RAdam + Lookahead optimizer (also known as “Ranger”) and early stopping, let us forecast training dynamics in a quick and scalable manner.

To finish our example thought experiment, we would be able to quickly get accurate forecasts for the sales volume of any store, department, and price range combination by training a single ML model. Some of you have similar situations, and have turned to MindsDB for those forecasting needs.

What’s next?

Our immediate roadmap considers enabling the MindsDB pipeline for stream processing solutions, because use cases like the one we exemplified are usually dynamic and with very high information throughput, so it’s a natural fit. In parallel, we are also starting to explore time series anomaly detection tasks, to allow us to detect aberrant behavior and unlock further key insights for your data.

I hope that this article has conveyed one of the beauties of open source: thanks to PyTorch, we are in a much better position to iterate fast and easily access the state of the art in neural network research. As an open source solution ourselves, we are excited for the community to try out our work, so let us know what you think! Please join us on GitHub or Slack.

If this article was helpful, please give us a GitHub star here.

-- Patricio Cerda-Mardini, Machine Learning Research Engineer @mindsdb

References

[1] Jie, C., Zeng, G., Zhou, W., Du, W., & Kangdi, L. (2018). Wind speed forecasting using nonlinear-learning ensemble of deep learning time series prediction and extremal optimization. Energy Conversion and Management, 165, 681-695.

[2] Gasparin, A., Lukovic, S., & Alippi, C. (2019). Deep learning for time series forecasting: The electric load case. ArXiv, abs/1907.09207.

[3] Dickey, D., & Fuller, W. (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association, 74, 427-431.

[4] Triebe, O., Laptev, N., & Rajagopal, R. (2019). AR-Net: A simple auto-regressive neural network for time-series. ArXiv, abs/1911.12436.

[5] Box, G.P., & Jenkins, G. (1970). Time series analysis, forecasting and control. Wiley.

‍