quickstart#

Date: Mar 07, 2023 Version:

Getting Started#

At a minimum, a divina pipeline must know which column represents the time index, which column represents the target and the frequency of the time index. Below is a minimal example of a pipeline trained on used to make in-sample predictions on the retail sales dataset from the M5 competition.

from divina import Divina

quickstart_pipeline_0 = Divina(
    target="Sales",
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
    ],
)

As you can see, Divina automatically uses the non-target columns that we don’t tell it to drop in the file to make in-sample predictions that are reasonably accurate.

Hyperparameter Optimization

Divina supports hyper-parameter optimization through grid search. In order to optimize the parameters of the selected causal model within the pipeline, provide a list of parameter dictionaries to be optimized from.

from divina import Divina

quickstart_pipeline_1 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
    ],
)

It seems clear that the log link function improves the performance of the default, linear, causal model on this sales dataset.

However, the forecast produced is for all stores in aggregate while there are three distinct retail locations in the dataset.

Target Dimensions

Below we use the target_dimensions option to tell divina to individually aggregate and forecast each retail store in the dataset.

from divina import Divina

quickstart_pipeline_2 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
    ],
)

We can see above that the forecast for an individual store is even more accurate and through the interpretability interface below what information is influencing the forecasts and how.

Time Features

An important part of forecasting and feature of divina is the ability to derive time-related features from the time index of a dataset. This is automatically handled by setting the time_features attribute of a divina` pipeline to True. If only a subset of time features are needed, those that aren’t needed can be dropped with the drop_features attribute.

from divina import Divina

quickstart_pipeline_3 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
        "Weekday",
        "Month",
        "Holiday",
        "Year",
    ],
    frequency="D",
    time_features=True,
)

We can see through the interpretablity interface that the new time information is now informing the forecasts.

Feature Engineering

Information encoding, binning and interaction terms are all powerful features of divina that bring its performance in line with that of tree-based models and neural networks. Here the interpetation interface shows us that our newly engineered features are informing the forecasts.

from divina import Divina

quickstart_pipeline_4 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
        "Weekday",
        "Month",
        "Holiday",
        "Year",
    ],
    frequency="D",
    time_features=True,
    encode_features=[
        "Store",
        "Month",
        "StoreType",
        "Weekday",
        "HolidayType",
    ],
    bin_features={"Month": [3, 6, 9]},
    interaction_features={"Store": ["HolidayType"]},
)

Cross Validation

While visual inspection is a powerful tool for validating a model, programmatic and distributional validation is provided through the time_validation_splits option of divina.

from divina import Divina

quickstart_pipeline_5 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
        "Weekday",
        "Month",
        "Holiday",
        "Year",
    ],
    time_features=True,
    encode_features=[
        "Store",
        "Month",
        "StoreType",
        "Weekday",
        "HolidayType",
    ],
    bin_features={"Month": [3, 6, 9]},
    interaction_features={"Store": ["HolidayType"]},
    validation_splits=["2014-06-01"],
)

The resulting validation split metrics can be accessed through the PipelineFitResult object that is returned from pipeline.fit().

Confidence Intervals

Confidence intervals provide important insight into how sure divina is of its predictions, further allowing high-quality decisions to be made on top of them. Below we add confidence intervals to the forecasts via the confidence_intervals option.

from divina import Divina

quickstart_pipeline_7 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
        "Weekday",
        "Month",
        "Holiday",
        "Year",
    ],
    time_features=True,
    encode_features=["Store", "Month", "StoreType", "Weekday", "HolidayType"],
    bin_features={"Month": [3, 6, 9]},
    validation_splits=["2014-06-01"],
    confidence_intervals=[0, 100],
    bootstrap_sample=5,
)

Endogenous Boosting

Divina provides the important capability of boosting the residuals of the causal piece of the ensemble, allowing forecasts to be much higher quality and more highly automated. You can see here that the default boosting model, an exponentially weighted moving average, makes small changes in the forecasts using the information available at the specified time horizons.

from divina import Divina

quickstart_pipeline_8 = Divina(
    target="Sales",
    causal_model_params=[{"link_function": "log"}, {"link_function": "None"}],
    target_dimensions=["Store"],
    time_index="Date",
    frequency="D",
    drop_features=[
        "Customers",
        "StoreType",
        "Assortment",
        "Promo2SinceWeek",
        "Promo2SinceYear",
        "PromoInterval",
        "Weekday",
        "Month",
        "Holiday",
        "Year",
    ],
    time_features=True,
    encode_features=["Store", "Month", "StoreType", "Weekday", "HolidayType"],
    bin_features={"Month": [3, 6, 9]},
    validation_splits=["2014-06-01"],
    confidence_intervals=[0, 100],
    bootstrap_sample=5,
    time_horizons=[7, 14, 21],
    boost_window=7,
)