Large-Scale Time Series Forecasting in Apache Spark


Accurately forecasting power demand is important for securing energy supply. Time series forecasting methods and other machine learning algorithms can be used to create energy forecasts. We have developed a forecasting framework based on multi-model approach at customer account level. The framework uses a wide range of algorithms (e.g. GLM, ElasticNet, Seasonal ARIMA-X, Decision Tree, Random Forest and Gradient Boosting Machine). Models are pre-trained on AWS EMR cluster using Spark/SparklyR. The process is run at massively parallel scale (>3000 vCores). Once the model training algorithm has completed, the model objects are persisted on AWS S3 so that they can be reused at a later date. To trigger a forecast, the deploy pipeline will load the pre-trained model object from S3 and create a forecast based on the prevailing inputs. The output is stored as partitioned parquet files on S3, which can be converted into table view through AWS Athena.

Enterprise Application of the R Language

Slides can be downloaded here.