Our Methodology,
Fully Transparent
Every dataset on SmartFinanceData is built on a rigorous, reproducible statistical framework. This page explains exactly how we source raw price data, define market events, run probability calculations, and validate results before publishing.
Data Sources
All price data is sourced from institutional-grade providers with full tick history. We use multiple redundant providers and cross-validate OHLC figures before ingestion to ensure data integrity.
Daily OHLC from interbank spot market. NY close (5pm EST) candles used as the standard session boundary for all forex pairs.
Cash index OHLC sourced directly from exchange feeds. Adjusted for index rebalancing events. Futures-adjusted series used for extended history where needed.
Volume-weighted aggregated OHLC across major centralised exchanges. UTC midnight used as the daily candle boundary for all crypto assets.
Data provider redundancy: OHLC values are cross-checked across at least two independent sources per instrument. Any discrepancy exceeding 0.05% triggers a manual review before the candle is included in the analysis database.
Data Pipeline
Raw price data passes through a six-stage pipeline before it reaches the analysis layer. Each stage has automated quality gates. Any failure halts the pipeline and triggers an alert — no partial or corrupted data is ever published.
Raw OHLC data is pulled from providers via API, converted to a uniform schema (UTC timestamp, Open, High, Low, Close, Volume), and stored in the raw data lake.
Automated checks flag: missing candles, High < Low violations, zero-volume sessions, extreme outliers (>5σ from rolling mean), and timestamp gaps.
For indices and commodities, contract rolls, index rebalances, and split events are identified and back-adjusted using ratio method to maintain return continuity.
Derived features are computed from clean OHLC: daily range, wick ratios, body-to-range ratio, gap size, session overlaps, ATR-normalised values, and rolling volatility.
Market events (e.g. "trend day", "Asian range break", "monthly high sweep") are classified using deterministic rule sets. Each classification has a documented definition — no ambiguity or look-back.
Statistical analysis runs on the classified dataset. Results are written to the publication layer only if they pass all significance thresholds. Outputs include probability tables, confidence intervals, and distribution data.
Analysis Types
SmartFinanceData publishes six core analysis types. Each answers a specific class of trading question by categorising historical price behaviour into measurable, actionable outcomes.
Measures the probability of consecutive directional closes (up/down streaks). Answers: "After N bullish days, how likely is another bullish close?" Calculated per day-of-week and month.
// Bayesian posterior also computed with Dirichlet prior
Analyses price behaviour within and across trading sessions (Asian, London, New York). Identifies trend day probability given session compression, London breakout continuation rates, and more.
// P(trend day | Asian range ≤ ATR₁₄ × 0.25)
Distributional analysis of daily, weekly, and monthly ranges expressed as multiples of ATR. Used to build probabilistic price targets and assess where the current range sits relative to historical norms.
// Percentile rank and Z-score calculated per day-of-week
Calculates how often price sweeps a prior period high/low (monthly, weekly, previous day) before reversing. Useful for understanding liquidity grab behaviour at key structural levels.
// P(sweep | prior range) segmented by volatility regime
Monthly and quarterly bias analysis using long-run averages. Statistical separation from random walk tested per calendar segment. Day-of-week and week-of-month breakdowns included.
// t-test vs H₀: bias = 0, with Bonferroni correction for 12 months
HTF/LTF break of structure probability after defined price patterns. Includes Outside Day follow-through, Inside Bar resolution direction, and post-consolidation breakout continuation rates.
// P(bullish/bearish resolution | wick dominance ratio)
Statistical Tests
Raw probability counts are necessary but not sufficient. Every published figure must pass at least one formal statistical test. The test chosen depends on the data type and the null hypothesis being evaluated.
| Test | Used For | Null Hypothesis | Threshold |
|---|---|---|---|
| Chi-Square (χ²) | Streak, session, sweep datasets — comparing observed vs expected frequency of categorical outcomes | Outcome frequencies match a uniform or 50/50 distribution | p < 0.05 |
| Z-Test (two-tail) | Directional bias in range and streak tables with large samples (n > 30) | Population proportion = 0.5 (no directional edge) | |Z| > 1.96 |
| Wilson CI | Confidence intervals on all binary probability estimates — avoids Wald interval breakdown at extreme probabilities | N/A — interval estimation, not hypothesis testing | 95% CI |
| Bayesian Posterior | Streak tables with small sample sizes (n < 30). Beta distribution with uniform prior | N/A — posterior distribution computed, not tested | 90% HDI |
| One-sample t-test | Seasonal bias tables — testing whether mean monthly return differs significantly from zero | μ = 0 (no seasonal effect) | p < 0.05 |
| Bonferroni Correction | Applied to seasonal tests across 12 months, 5 weekdays, and multi-comparison streak tables | Controls family-wise error rate | α / n comparisons |
Sample size minimum: Any probability with fewer than 20 historical observations is marked with a low-confidence indicator (⚠) in the dataset tables. Figures below 10 observations are suppressed entirely. This prevents overfitting to noise in thin data conditions.
Bias Controls
The two most dangerous failure modes in backtested analytics are lookahead bias and data snooping. We apply strict procedural controls to eliminate both.
All event classifications use only data that would have been available at the candle close. The analysis database is strictly append-only — historical rows are never retroactively modified once published.
Example: An "Asian range" is computed at 08:00 GMT using only candles from 00:00–07:59 GMT.
Each analysis is checked for stability using a walk-forward approach: we verify that probabilities computed on a 10-year training window hold within ±5% on the subsequent 5-year out-of-sample period.
Datasets that fail walk-forward stability checks are flagged as "unstable" and excluded from featured datasets.
When testing many sub-groups (e.g., 12 months × 5 weekdays = 60 cells), Bonferroni correction is applied to control the family-wise false discovery rate. This prevents spurious "signals" emerging from random variation at scale.
Probabilities are stress-tested across three volatility regimes (low / medium / high, defined by 14-period ATR quartiles) to confirm the edge is not regime-specific. Regime breakdowns are available in Pro datasets.
An edge that only exists in low-volatility regimes is disclosed prominently, not hidden in aggregate statistics.
Update Cadence
Datasets are not static snapshots — they are living tables that update automatically as new price data is confirmed. Different analysis types update on different schedules based on the cadence of the underlying data.
| Analysis Type | Update Trigger | Lag | Recalc Depth |
|---|---|---|---|
| Streak Tables | Daily close confirmed (NY session close) | ~2 hrs post-close | Rolling 5yr + full history |
| Session Analytics | NY session close (17:00 EST) | ~2 hrs post-close | Full history recalc |
| Range Distribution | Daily close confirmed | ~2 hrs post-close | Percentile tables rebuilt fully |
| Sweep Probability | Weekly close (Friday NY close) + daily | ~3 hrs post-close | Weekly + daily tables refreshed |
| Seasonal Tendencies | Month-end close | Within 24 hrs of month end | Full history, re-tested with t-test |
| Structure Analysis | Daily close confirmed | ~2 hrs post-close | Pattern tables rebuilt |
Known Limitations
All datasets are based on historical price data. Past statistical tendencies do not guarantee future outcomes. Markets are non-stationary — structural regime changes can and do invalidate historical edges.
Financial time series are not stationary. Central bank policy shifts, algorithmic market structure changes, and macro regime changes all affect the persistence of historical probabilities. We publish walk-forward stability scores to flag at-risk datasets.
Published probabilities are pre-cost and pre-slippage. A dataset showing 58% directional edge may not be viable after spread, commissions, and execution slippage are applied. Edge sizing should account for total round-trip cost.
Analysis uses continuous price data. Holiday periods, low-liquidity sessions, and data gaps during major events (flash crashes, circuit breakers) are excluded but may affect live execution if those conditions recur.
Probabilities describe population averages across historical conditions. They should be used as one input in a systematic trading framework — not as standalone trade signals. No single dataset constitutes a complete trading strategy.
Frequently Asked
Put the Methodology to Work
The datasets are built on the framework above. Explore market analytics, browse instruments, or get full Pro access to every probability table on the platform.