Parallel forecasting with Apache beam and Prophet

Magnus Friberg
3 min readOct 6, 2021

The ever-changing business environment requires that we make accurate time series forecasts in an efficient and fast way. The knowledge of making accurate and reliable time series forecasts is a competitive advantage for all companies.

Although it can be difficult running a single time series model, running several time series models at the same time is a different beast.

In this article, we will pretend that we run a grocery store and we would like to understand how much of each product we will sell the next day.

Because we want to do this regularly, we would like to automate this process so we decide to create a daily workflow that calculates this for us.

One solution may be to create a workflow for each product and run them separately. This will work but is not very efficient when it comes to code management having to handle a workflow for every product.

Products
├── forecast_product_1.py
├── forecast_product_2.py
├── forecast_product_3.py
├── forecast_product_4.py
├── forecast_product_...py
├── forecast_product_...py
├── forecast_product_1000.py

Another solution could be to run everything in one workflow and use a loop iteration over all products and forecast sequentially. However, this would not be effective from a memory or time perspective. Imagine running the forecast for an item that takes an average of 1 minute, and let’s say we had 1,000 products. As you can imagine, this would not be a good solution.

products = [p1, p2, p3, p4, .., pn]
for product in products:
run_forecast(product)

Finally, we will be able to combine the best of both worlds with just one workflow and apply forecasts to all products in parallel. This would be effective from both a code and time management perspective. To make this happen, we will use Apache Beam and Prophet:

Apache Beam is an open source, programming model for defining large-scale data parallel-processing pipelines.

Prophet is an open source forecasting library built by facebook and is one of the most popular scientific python libraries.

Let’s have a look at the code

In our example, we will assume that our data lives in BigQuery, but it can of course be other valid file storage or even data in memory. The forecast model is defined in the prophet_prediction function, and as you can see, heavy lifting has already been done by the library itself. Our parallel pipeline logic is defined in the with beam.Pipeline() statement. Each new step is separated by a “pipe” | symbol.

After the code has been run, we now have a forecast value called YHAT for each product in our grocery store. We will also receive the values for the lower and upper limits for the confidence interval we choose to use, the Prophet library has 95% by default.

+-----------+------+------------+------------+
| DS | YHAT | YHAT_LOWER | YHAT_UPPER |
+-----------+------+------------+------------+
| product 1 | 123 | 100 | 146 |
| product 2 | 345 | 298 | 392 |
| product 3 | 43 | 38 | 48 |
+-----------+------+------------+------------+

We can now feel comfortable that we will buy the correct amount of products for the coming week.

Feel free to contact me for tips on how you can put this into production.

--

--

Magnus Friberg

I work as lead data engineer at SeenThis. I am passionate about data engineering/science. I am a big fan of open source..