Forecasting: Long-Short Term Memory (LSTM) and Autoencoder

This project was initially intended as a solution for one of our customers. The idea is to figure out the daily number of orders placed on the website, in order to optimize production plans so that the quantity of products won’t fall into excess or run out of stock. We were using the Long-Short Term Memory (LSTM) and Autoencoder for time series forecasting.

The steps followed to forecast the time series using LSTM autoencoder are:

Check if the goal feature has enough data to make predictions. Alternatively, check if there is any dependent variable with better quality of records so that we can use to make an indirect prediction.
Split the data into train and test split and preprocess the features
Prepare and train the sequential and bidirectional LSTM autoencoders model
Using LSTM autoencoder to reconstruct the error on the test data over the whole year of records to adjust prediction and detect any anomalies

The Data

The available data consists into one year of records and three features: the number of orders placed, the number of visits and the number of visitors. Eventually, the dataset contains also an additional time feature which is scaled upon calendar days.

The main challenge in this specific use case of time series forecast is that the variables should be tightly dependent in order to use one feature to predict the other, independently of the quality of records within the goal variable. In this case, we tended to use the number of visits to indirectly predict the number of orders place, since this feature has many null values which bring the time series into extrema and won’t help into making a reliable prediction.

Pearson correlation

In order to check the dependency between variables, the Pearson Coefficient has been calculated. We set a threshold of 80% which, if exceeded, is an indicator that the variables are tightly dependent, which is the case for the two variables in question (see Fig. 2). The outcome in this case is 85% of correlation between “Visitors” and “Number of orders placed” features, which means that we could use one of them to predict the other.

Preprocessing

Train and Test split

Train set: We give the machine several observations to recognize patterns that we want it to predict later in the test phase. In this use case, we provided a ratio of 91% from the data we have (roughly 332 data points)
Test set: This step is crucially important, since it allows to test how efficient the prediction the machine performed is by providing the rest of the data to make new predictions and assess the metrics. For this, we left the remaining 9% of the observation, so roughly 33 data points.

Robust scaler

Create sequences

We need to communicate the data to the compiler into a format It can understand. We convert then data into sequences shaped under 3-dimensional arrays and separated with time steps of 30 days. Sequences are the most prominent parameter for LSTM modelling, they simply consist into various batches taken from the data that allow the cell to retain necessary and representative information at a certain rhythm.

The model

We chose the LSTM and autoencoder due to their efficiency and ability to reconstruct themselves into the learning process (further reading https://towardsdatascience.com/using-lstm-autoencoders-on-multidimensional-time-series-data-f5a7a51b29a1).

Build the Sequential

We declare the architecture of LSTM Autoencoder model and embed the sequences with a space of 30 days as time steps to feed the model
Next, we define an additional Bidirectional layer which expects the training sequence
A dropout layer is expected to decide how rough we would like to forget irrelevant information
A last output layer is added to define the output

Train the model
The model went through 40 epochs without early stopping, the local minima of the loss function reached 0.54 which is a tremendous indicator for accurate predictions (Fig.3).

Forecast the time series

As you can see, the LSTM autoencoder performed tremendous predictions, the red line is the predicted time series, and the blue line represents actual data. The model could not obviously catch extreme values due to the small size of training data.

Perspectives

This type of robust models is hungry for data, we thrive then every day to get as much input as we can and collaborate with our clients to provide us data so that we can feed the architecture. Therefore, getting as much input as we can helps us progressively to set a common understanding of the data, set appropriate preprocessing and data engineering frameworks and grasp business needs for better decision making