Finally a good start for the foundational models for timeseries?

Plot twist: It is Chronos by AWS Supply Chain Optimization Technology (SCOT).


NeurIPS 2023 saw a proliferation of papers on the applicability of LLMs in time series forecasting. Some of the papers were so bad that I seriously have started (and continue to do so) questioning the review process of NeurIPS.

Seeing that particular trend, while also seeing the rise of S4 and Mamba, which are inherently more suitable to tackle continuous sequence modeling (which, by the way, represents the absolute majority of the time series-related tasks), I started to think that we probably cannot harness the crazy good sequence modeling capacity of attention and the so-called “emergent” zero-shot capability of the large language models in timeseries, The transformer-based models (INFORMER, FEDFORMER) were regularly beaten by the linear models from the ’00s. Then AWS SCOT drops Chronos.

Is Chronos a groundbreaking paper? No. Is it the absolute best, off-the-shelf forecasting algorithm out there? I strongly doubt it. It is actually a blissfully simple approach which is its inherent elegance. I suspect that many around the world thought about this exact same approach but could not proceed with experimentation due to infrastructure limitations. This is just a hypothesis though, since we are talking about the most challenging part of foundational models: pre-training.

Another hypothesis would be: due to the fixed-vocabulary tokenization process, the model’s capability of modeling trends can be reasonably compromised (in fact the paper admits that, see the experiments focusing on exponential trends). This weakness is shared by Gradient Boosted models as well (LightGBM, XGBoost, or CatBoost). But in case of the Gradient Boosted models, it can be fixed with proper feature engineering or augmenting pre/post-processing stages in the pipeline.

In general, Chronos can be thought of as a natively probabilistic pattern discovery/generation machine for timeseries applications. Note how I suddenly pivoted from using the term “forecasting” to “pattern discovery”. It was intentional.

I am not going to explain how Chronos works; the paper is very well written and very accessible. I want to point out the “First Principle” thinking that backed the narrative of the work. I believe these are absolutely necessary for foundational model research in general.


Why use a model with tens to hundreds of billions of parameters that barely outperforms a three-parameter statistical model or slightly more complex Gradient Boosted models?

Let’s pause for a while and ponder about it.

Can we always justify the inference cost/latency for a couple of percent accuracy gain on the basis of ground reality? Moreover, forecasts are often (if not always) used for downstream decision-making, where subsequent steps dilute the utility of a couple of percentage points in accuracy gains. So for the foudnational models to leave a solid footprint, we need to have a strong and consistent gain over much much much simpler models. Otherwise, why bother training or using models that are several orders of magnitude larger than the ones who are doing a reasonable job? For NLP applications, it is an entirely different game because pre-transformer models (like Hidden Markov Models for speech processing) were not as good. However, for timeseries, we are yet to conclusicvely beat ultra-fast, low latency statistical/ML models using pre-trained foundation models.

Having said that, this to me is the first sign of first principle thinking in the Chronos paper:

…We did not explore larger models due to slow inference times which would render them impractical for real-world applications

Finally, someone f*king said it and actioned on it. The largest Chronos model is a T5-based 710M parameter model.

The second point is about time series data augmentation. In my humble opinion, this is the biggest value proposition of the Chronos paper. It’s entirely possible that if we had a corpus of publicly available time series data, it could have dwarfed the common-crawl corpora. However, publicly available time series data pales in comparison to publicly available text data for many reasons. Businesses don’t want to expose their day-to-day transaction data to the outside world to protect their market capitalization. High-frequency stock market data are usually proprietary, and the publicly available stock movement time series data have usually undergone such a level of arbitrage that they can be treated as a random walk.

Chronos proposes two augmentation techniques: TSMix and KernelSynth. TSMix is built on top of an old, intuitive idea. However, KernelSynth is a really interesting addition. It is based on the work done in structure discovery using Kernel Search, but flipped upside down..To elaborate: it uses a bank of kernels (essentially Gaussian Processes), samples randomly from it, and composes synthetic time series using binary operators. Note that TSMix is limited by the availability of non-synthetic, real-time series. But with KernelSynth, it’s the wild west. There are no limitations.


Why I think foundational model research in timeseries forecasting is important?

I believe foundational model research in time series forecasting is crucial. I often draw an interesting analogy between Neural Machine Translation (NMT) during the pre-transformer era and the current time series forecasting pipeline. Before transformers, NMT relied on statistical learning techniques that involved numerous subcomponents, most of which required tuning. Transformers, aside from their state-of-the-art results at the time, greatly simplified this pipeline with a single architecture—a massive value proposition from a production perspective.

Foundation models for time series need to embody a similar vision. The time series forecasting pipeline can be quite messy, often requiring the alignment of data with different granularities and modalities. Model selection remains a nuanced debate, dating back to 1974 with Kahneman and Tversky’s seminal work on forecast selection by humans.

Simplifying this pipeline alone offers a massive value proposition.

Leave a comment