So I Explored Forecasting Metrics... Now I Want Your Two Cents πŸ’­

Retiago Drago - Aug 27 '23 - - Dev Community

Outlines

Introduction 🌟

Diving into the world of regression and forecasting metrics can be a real head-scratcher, especially if you're a newcomer. Trust me, I've been thereβ€”haphazardly applying every popular metric scholars swear by, only to be left puzzled by the results.

Ever wondered why your MAE, MSE, and RMSE values look stellar, but your MAPE is through the roof? Yep, me too.

That's why I set out on this journey to create an experimental notebook, aiming to demystify how different metrics actually behave.

The Objective 🎯

The goal of this notebook isn't to find the "one metric to rule them all" for a specific dataset. Instead, I want to understand how various metrics respond to controlled conditions in both dataset and model. Think of this as a comparative study, a sort of "Metrics 101" through the lens of someone who's still got that new-car smell in the field. This way, when I'm plunged into real-world scenarios, I'll have a better grip on interpreting my metrics.

Metrics Investigated πŸ”

To get a comprehensive view, I've opted to explore a selection of metrics that are commonly leveraged in regression and forecasting problems. Here's the lineup:

  1. Mean Absolute Error (MAE):

    • Definition: It measures the average magnitude of the errors between predicted and observed values.
    • Formula:
      MAE=1nβˆ‘i=1n∣yiβˆ’y^i∣MAE = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i|
      where (yi)(y_i) is the actual value, (y^i)(\hat{y}_i) is the predicted value, and (n)(n) is the number of observations.
  2. Mean Squared Error (MSE):

    • Definition: It measures the average of the squares of the errors between predicted and observed values. It gives more weight to large errors.
    • Formula:
      MSE=1nβˆ‘i=1n(yiβˆ’y^i)2MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  3. Root Mean Squared Error (RMSE):

    • Definition: It represents the sample standard deviation of the differences between predicted and observed values. It's the square root of MSE.
    • Formula:
      RMSE=MSERMSE = \sqrt{MSE}
  4. Mean Absolute Percentage Error (MAPE):

    • Definition: It measures the average of the absolute percentage errors between predicted and observed values.
    • Formula:
      MAPE=100nβˆ‘i=1n∣yiβˆ’y^iyi∣MAPE = \frac{100}{n}\sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|
      Note: MAPE can be problematic if the actual value (yi)(y_i) is zero for some observations.
  5. Mean Absolute Scaled Error (MASE):

    • Definition: It measures the accuracy of forecasts relative to a naive baseline method. If MASE is lower than 1, the forecast is better than the naive forecast.
    • Formula:
      MASE=βˆ‘i=1n∣yiβˆ’y^iβˆ£βˆ‘i=2n∣yiβˆ’yiβˆ’1∣MASE = \frac{\sum_{i=1}^{n} |y_i - \hat{y}i|}{\sum{i=2}^{n} |y_i - y_{i-1}|}
  6. R-squared (Coefficient of Determination):

    • Definition: It indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
    • Formula:
      R2=1βˆ’βˆ‘i=1n(yiβˆ’y^i)2βˆ‘i=1n(yiβˆ’yΛ‰)2R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}
      where (yˉ)(\bar{y}) is the mean of the observed data.
  7. Symmetric Mean Absolute Percentage Error (sMAPE):

    • Definition: It's a variation of MAPE that addresses some of its issues, especially when the actual value is zero.
    • Formula:
      sMAPE=100nβˆ‘i=1n∣yiβˆ’y^i∣(∣yi∣+∣y^i∣)/2sMAPE = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}
  8. Mean Bias Deviation (MBD):

    • Definition: It calculates the average percentage bias in the predicted values.
    • Formula:
      MBD=100nβˆ‘i=1nyiβˆ’y^iyiMBD = \frac{100}{n}\sum_{i=1}^{n} \frac{y_i - \hat{y}_i}{y_i}

Severity and Directional Emojis πŸ”₯πŸ‘‰

Let's face it, numbers alone can be dry, and if you're like me, you might crave a more visceral sense of how well your model is doing. Enter severity and directional emojis. These little symbols provide a quick visual cue for interpreting metric results, ranging from "you're nailing it" to "back to the drawing board."

Disclaimer: Keep in mind that these categorizations are user-defined and could vary depending on the context in which you're working.

Standard Error Metrics (MAE, MSE, RMSE) Categorization πŸ“Š

To clarify, the concept of Normalized Error Range stems from dividing the error by the range (max - min) of the training data.

Category Normalized Error Range
Perfect Exactly 0
Very Acceptable 0<x≀0.050 < x \leq 0.05
Acceptable 0.05<x≀0.10.05 < x \leq 0.1
Moderate 0.1<x≀0.20.1 < x \leq 0.2
High 0.2<x≀0.30.2 < x \leq 0.3
Very High 0.3<x≀10.3 < x \leq 1
Exceedingly High x>1x > 1

Percentage Error (MAPE, sMAPE, MBDev) Categorization πŸ“‰

Category Error Magnitude (%) Direction
Perfect Exactly 0% -
Very Acceptable 0<x≀50 < x \leq 5 Over/Under
Acceptable 5<x≀105 < x \leq 10 Over/Under
Moderate 10<x≀2010 < x \leq 20 Over/Under
High 20<x≀3020 < x \leq 30 Over/Under
Very High 30<x≀10030 < x \leq 100 Over/Under
Exceedingly High x>100x > 100 Over/Under

R2 Score Categorization πŸ“ˆ

Category R2 Value Range
Perfect Exactly 1
Very Acceptable 0.95≀x<10.95 \leq x < 1
Acceptable 0.9≀x<0.950.9 \leq x < 0.95
Moderate 0.8≀x<0.90.8 \leq x < 0.9
High 0.7≀x<0.80.7 \leq x < 0.8
Very High 0.5≀x<0.70.5 \leq x < 0.7
Exceedingly High 0<x<0.50 < x < 0.5
Doesn't Explain Variability Exactly 0
Worse Than Simple Mean Model x<0x < 0

MASE Categorization πŸ“‹

Category MASE Value Range
Perfect Exactly 0
Very Acceptable 0<x≀0.10 < x \leq 0.1
Acceptable 0.1<x≀0.50.1 < x \leq 0.5
Moderate 0.5<x≀0.90.5 < x \leq 0.9
High 0.9<x≀10.9 < x \leq 1
Equivalent to Naive Model Exactly 1
Worse Than Naive Forecast Model x>1x > 1

Severity Emojis 🚨

Metric Emoji
Perfect πŸ’―
Very Acceptable πŸ‘Œ
Acceptable βœ”οΈ
Moderate ❗
High ❌
Very High πŸ’€
Exceedingly High ☠
Doesn't Explain Variability 🚫
Worse Than Simple Mean Model πŸ›‘
Equivalent to Naive Model βš–
Worse Than Naive Forecast Model 🀬

Directional Emojis ➑️

Metric Emoji
Overestimation πŸ“ˆ
Underestimation πŸ“‰
Nan / None πŸ™…β€β™‚οΈ

Methodology πŸ“š

For this experiment, I've synthesized datasets using mathematical functions like sine and cosine, which offer a controlled level of predictability. On the modeling end, I've used statsmodels.tsa.ar_model.AutoReg and OffsetModel. I chose AutoReg for its foundational role in time series forecasting, while OffsetModel serves to mimic good performance by shifting test data. This entire endeavor is laser-focused on forecasting problems, underscoring the fact that all forecasting issues are essentially regression problems, just not the other way around as far as I understand.

Highlights of Findings ✨

To navigate the labyrinth of metrics, I've laid out my explorations in a tree graph, which you can check out below:

metrics exploration

The table here provides just a glimpse into the first phase of my deep dive into metrics. For those hungry for the full rundown, it's available right here. Click πŸ“Š for the plot.

Plot Based on Variant Dataset Model R2 MAE MSE RMSE MASE MAPE sMAPE MBDev
πŸ“Š Test Size Small=1 cos⁑(x)\cos(x) AutoReg πŸ™…β€β™‚οΈ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š OffsetModel πŸ™…β€β™‚οΈ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š sin⁑(x)\sin(x) AutoReg πŸ™…β€β™‚οΈ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ☠ β˜ πŸ“‰
πŸ“Š OffsetModel πŸ™…β€β™‚οΈ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ☠ β˜ πŸ“ˆ
πŸ“Š Small=2 cos⁑(x)\cos(x) AutoReg πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š OffsetModel πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š sin⁑(x)\sin(x) AutoReg πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ☠ β˜ πŸ“‰
πŸ“Š OffsetModel πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ☠ β˜ πŸ“ˆ
πŸ“Š Mid cos⁑(x)\cos(x) AutoReg πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š OffsetModel πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š sin⁑(x)\sin(x) AutoReg πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ πŸ’€ β˜ πŸ“‰
πŸ“Š OffsetModel πŸ›‘ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ☠ β˜ πŸ“ˆ
πŸ“Š Large cos⁑(x)\cos(x) AutoReg πŸ›‘ ❗ βœ”οΈ ❌ 🀬 πŸ‘Œ ❌ πŸ’€πŸ“ˆ
πŸ“Š OffsetModel ❗ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 πŸ‘Œ πŸ‘Œ πŸ‘ŒπŸ“ˆ
πŸ“Š sin⁑(x)\sin(x) AutoReg πŸ›‘ ❌ ❗ ❌ 🀬 ☠ πŸ’€ β˜ πŸ“‰
πŸ“Š OffsetModel πŸ‘Œ πŸ‘Œ πŸ‘Œ πŸ‘Œ 🀬 ☠ ❗ β˜ πŸ“ˆ

Key Insights πŸ”‘

  1. Inconsistent R2 Scores: Almost all of the AutoReg and OffsetModel experiments yielded R2 scores that were either nonexistent (πŸ™…β€β™‚οΈ) or poor (πŸ›‘). Only one OffsetModel experiment on a large dataset achieved an "Acceptable" R2 score (πŸ‘Œ).

  2. Good Performance on Standard Errors: Across various test sizes and datasets, both AutoReg and OffsetModel generally performed "Very Acceptable" (πŸ‘Œ) in terms of MAE, MSE, and RMSE metrics.

  3. Problematic MASE Scores: Every model configuration led to "Worse Than Naive Forecast Model" (🀬) MASE scores. This suggests that these models might not be better than a simple naive forecast in certain aspects.

  4. Diverse MAPE and sMAPE Responses: The models varied significantly in their MAPE and sMAPE scores, ranging from "Very Acceptable" (πŸ‘Œ) to "Exceedingly High" (☠ and πŸ’€), especially on sine ( sin⁑(x)\sin(x) ) datasets.

  5. Bias Direction: The Directional Emojis indicate a tendency for the models to either overestimate (πŸ“ˆ) or underestimate (πŸ“‰) the values. The direction of bias appears consistent within the same dataset but varies between datasets.

  6. Complexity vs. Error: Larger test sizes didn't necessarily yield better error metrics. In fact, some larger test sizes led to "High" (❌) and even "Very High" (πŸ’€) errors, as seen in the last row of the table.

  7. Dataset Sensitivity: The models' performance was noticeably different between the sine ( sin⁑(x)\sin(x) ) and cosine ( cos⁑(x)\cos(x) ) datasets, showing that dataset characteristics heavily influence metric values.

  8. Best Scenario: If one had to pick, OffsetModel with a large dataset and the sine function ( sin⁑(x)\sin(x) ) yielded a balanced outcome, achieving "Acceptable" (πŸ‘Œ) ratings in almost all metrics, barring MASE (🀬).

  9. Limitations & Risks: It's important to remember that these experiments used synthetic data and specific models; thus, the results may not be universally applicable. Caution should be exercised when generalizing these insights.

Please note that these insights are derived from synthetic data and controlled experiments. They are intended to offer a glimpse into the behavior of different metrics and should be used with caution in practical applications.

Points for Critique πŸ€”

I'm all ears for any constructive feedback on various fronts:

Did I get the interpretation of these metrics right?
Are there any hidden biases that I might have missed?
Is there a more suitable metric that should be on my radar?
Did you spot a typo? Yes, those bother me too.

Digging into metrics is a lot like treasure hunting; you don't really know what you've got until you put it under the microscope. That's why I'm so eager to get your feedback. I've listed a few questions above, but let's delve a bit deeper.

  • Interpretation of Metrics: I've given my best shot at understanding these metrics, but it's entirely possible that I've overlooked some nuances. If you think I've missed the mark or if you have a different angle, I'm keen to hear it.
  • Potential Biases: When you're neck-deep in numbers, it's easy to develop tunnel vision and miss out on the bigger picture. Have I fallen into this trap? Your external perspective could provide invaluable insights.
  • Alternative Metrics: While I've focused on some of the most commonly used metrics, the field is vast. If there's a gem I've missed, do let me know. I'm always up for adding another tool to my analytical toolbox.
  • Typos and Errors: Mistakes are the bane of any data scientist's existence, and not just in code. If you've spotted a typo, I'd appreciate the heads up. After all, clarity is key when it comes to complex topics like this.

So, am I on the right track, or is there room for improvement?

Your input could be the missing puzzle piece in my metrics exploration journey.

Conclusion πŸ€βœ…

So there it isβ€”my metric safari in a nutshell. It's been an enlightening experience for me, and I hope it shines some light for you too. I'm still on the learning curve, and I'd love to hear your thoughts. Whether it's a critique or a thumbs-up, all feedback is golden.

If this sparked your curiosity, let's keep the conversation going. Feel free to drop a comment below, write your own post in response, or reach out to me directly at my link. If you'd like to delve deeper, the full summary of my findings is available here. Better yet, why not conduct your own investigations? I'd be thrilled to see where you take it. You can follow my progress and check out my portfolio repository here

GitHub logo ranggakd / DAIly

A bunch of Data analysis +AI notebooks I'd worked on almost a daiLY basis

DAIly

A bunch of Data Analysis and Artificial Intelligence notebooks πŸ€– I'd worked on almost a daiLY basis πŸ‘¨β€πŸ’»

Ideas

This directory might contain notes or outlines of potential data analysis or AI projects that I'm considering working on in the future. These might be in the form of brainstorming notebooks, rough outlines powerpoint of project ideas, or notes on interesting data sources or tools that I want to explore further

Goodbye Average Rating System Hello Helpful Rating System

Redefining the average rating system by factoring in people's feedback

Regression and Forecasting Metrics Exploration

Navigating the maze of regression and forecasting metrics to understand their behavior and implications

back to ⬆

Tips

This directory might contain more practical information, such as code snippets or tutorials that I've found helpful in my data analysis and AI work. These could be tips on how to use specific libraries
…

Let's keep exploring and innovating together!

Check out all the links on my beacons.ai page

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .