Over 16 months we built a functional automated trading product on Bitcoin. Along the way, the results confronted us with structural obstacles that reoriented the direction of the work. Bitcoin and the rest of the crypto market became progressively more institutional — flows stopped being explained by pure on-chain dynamics and began depending on ETFs, derivatives, macro, and cross-asset narrative —, and at the same time the sophisticated consumer shifted toward AI-first and AI-native products, redefining what they expect from a market tool. Iterating and experimenting on those two fronts — the new market and the new user — is what ultimately produced this pivot.
The process was continuous: rigorous backtesting over 808 days of Bitcoin at minute granularity and, in parallel, a serious stress test of what the industry is betting on today — LLM AI agents making trading decisions autonomously. Each hypothesis was designed as a closed question, with an explicit kill condition and statistical validation — no narrative, no lifesaving adjustments.
What the evidence produced was, in part, counterintuitive. Improving the tool — the structured context the agent receives — generated more consistent returns than improving the model, refining the prompt, or increasing decision frequency. And even on the best tool, a single LLM agent remains imperfect at autonomous capital management: its performance depends on the underlying model, token cost, context window size, and especially on the size and diversification of the portfolio it manages. An agent's profitability is relative — not absolute — and no agent, by itself, wins consistently across all market regimes.
The most valuable finding of the work, however, was not that limit — it was the asset that emerged from trying to overcome it. The engine we built to feed the agents — a system that classifies market regime, attributes cross-asset drivers, detects narrative saturation, and operates 24/7 with auditable calibration — turned out to be exactly the piece that any LLM-trading project is missing today. Tradit stops promising itself as an agent that predicts price and redefines itself as the Agent Market Intelligence layer on which any agent — internal or third-party — can decide better. The rest of this report documents how we reached that conclusion, with what evidence, and under what methodology.
Evidence · downloadable datasets
The raw data on which all the backtesting ran is publicly available, compressed by category. Any claim in this report is reproducible by re-running the repository scripts against these files.
| Category | Raw size | File |
|---|---|---|
| Tradit Engine — PostgreSQL dump (snapshot 2026-05-04) | 811 MB | tradit_defaultdb_20260504T023120Z.sql.gz |
| Hyperliquid Reservoir (1s candles, fills, liquidations) · split into 25 parts | 7.2 GB | hydromancer.tar.gz.part-aa…aw |
| Lance Chunks (versioned datasets) | 500 MB | chunks.tar.gz |
| BTC/USDT candles (1m, 808 days) | 221 MB | candles.tar.gz |
| Coinglass (liquidations, OI, FGI, ETF flows) | 3.7 MB | coinglass.tar.gz |
| Stablecoins (supply, flows) | 1.8 MB | stablecoin.tar.gz |
| Open Interest (cross-exchange) | 1.6 MB | oi.tar.gz |
| DXY (dollar index) | 1.5 MB | dxy.tar.gz |
| Funding (perp funding rates) | 324 KB | funding.tar.gz |
| Options (deribit, IV/skew) | 256 KB | options.tar.gz |
| Oil (CME, USO) | 244 KB | oil.tar.gz |
| Derivatives analysis (consolidated CSVs) | 48 KB | analysis.tar.gz |
Hyperliquid Reservoir dataset — download and reassemble
The Hyperliquid Reservoir file (~6.3 GB compressed) was uploaded in 25 parts of 280 MB each to work around the single-PUT upload limit of the provider. To download and reassemble it:
# Download the 25 parts
for p in {a..a}{a..w}; do
curl -fLOs "https://data.tradit.co/hydromancer.tar.gz.part-a$p"
done
# Reassemble
cat hydromancer.tar.gz.part-* > hydromancer.tar.gz
# Verify sha256 against the manifest
shasum -a 256 hydromancer.tar.gz
# must match MANIFEST.json → items[].sha256
# Extract
tar -xzf hydromancer.tar.gz
The MANIFEST.json
contains the sha256 of the full file and the exact list of
parts (field split_into). The other categories
(candles, chunks, etc.) are single tar.gz files —
direct download with curl -O.
Executive TL;DR
In 16 months (January 2025 → April 2026) the backtesting and experimentation work produced two results that clearly separate what fails from what works in the space of modern algorithmic trading.
What failed — invalidated hypotheses with data
- Predicting price with candlestick patterns. The largest experiment in the program (144 scripts in 24 phases) produced a pure trading engine PnL of approximately $-26 over 808 days. The only component that generated consistent return was funding carry (51% of total PnL, positive in 25 of 27 months) — and that is not a discovered strategy, it is a structural anomaly of the derivatives market. Statistical verdict: p=0.073 — the candle engine does not beat noise with confidence.
- Trusting popular technical strategies. We replicated 30 popular TradingView strategies (37 variants), including several with thousands of favorites, over 808 days with real costs: only 1 of 37 beats Buy & Hold. The most-favorited strategy in the sample (moving average crossover, 8.8K favorites) delivers Profit Factor 0.65 and -$98 on $1K. Public popularity does not correlate with real edge — it correlates with visible curve-fitting and systematic omission of costs.
- Assuming that an AI model trained on candles solves the problem. We fine-tuned a foundation transformer (an open-source foundation transformer, 4.1M parameters pre-trained on 12B candles from 45 exchanges) with 19,392 of our own BTC/USDT candles. Direction accuracy rose from 47% → 63% (+16pp), but the exercise confirmed the underlying thesis: a model trained on candles is not reliable as a directional predictor for capital decisions. Its only defensible use is as a negative signal — when its confidence drops, it indicates the market is moving due to causes external to price and one should stop looking at candles.
- Operating LLM agents as autonomous portfolio managers. Public evidence (Alpha Arena, October 2025: $60K real, 6 frontier models, 17 days) and our own experiments converge: LLMs over-trade, hold theses against evidence, fail at abstention discipline, and lack calibrated memory of their own hit rate. The bottleneck is not intelligence — it is the market context they receive.
What worked — validated hypotheses with data
- Cross-asset context DOES predict better than random when measured rigorously. Three features with ICIR > 1.5 (liquidations, coin return, US 10Y bonds). Four binary signals with Hit Rate 75-100% and sufficient N: equity crash >2%, liquidation cascade P99→P90, ETF outflows >$400M, DXY up >0.5%. Signals that actually work live outside the price.
- Market regime decides the outcome, not the strategy. The same system wins or loses depending on the phase (trending/ranging/crisis/recovery). Classifying the regime and abstaining outside it is a greater edge than improving any indicator.
- External catalysts can be detected before price reacts. On April 7, 2026, a geopolitical resolution moved oil -17.3% and BTC from $69K to $72.7K in hours. The candles could not have anticipated it. But the engine's causal graph — oil + narrative + ETF flows + news velocity — captured it in real time.
- Building an engine that integrates all of the above in real time is feasible and has been done. In production 24/7: 36+ APIs, 12-layer pipeline, 33-55 bipolar signals with documented physical causation, 21,954 scans recorded, 3 persistent processes without significant downtime. This engine knows how to read the directions the market is taking and attribute the drivers that explain them.
Product implication: Tradit stops promising to "predict price" or "build the highest-earning agent". It moves to selling what backtesting proved actually works: an Agent Market Intelligence layer — the engine that classifies regime, attributes drivers, detects narrative saturation, and calibrates publicly — on which any agent (human or LLM) can decide better.
Abstract
This report documents what failed and what worked in 16 months of backtesting and quantitative experimentation on Bitcoin (January 2025 → April 2026). The purpose was to separate with data the viable hypotheses from the hypotheses that the algorithmic trading industry — and the new LLM-trading agent projects — continue to pursue despite the evidence against them.
The starting point was concrete: an artificial intelligence agent — an LLM with instructions — makes wrong trading decisions when operating with insufficient context. Large-scale public experiments with frontier models have demonstrated this with real money (see the context section). The initial hypothesis was that the problem is not solved by a better model, a better prompt, or more raw data — it is solved by giving the agent a structured causal graph of the market that indicates what direction it is taking, what drivers explain it, and with what level of confidence. Building it required first understanding, with statistical rigor, what market information actually predicts and what is noise.
Over the 16 months, twelve independent experimental lines (over 400 numbered scripts) were run over 808 continuous days of Bitcoin at minute granularity, in parallel with the construction of a market cognition engine that today runs 24/7 (36+ integrated APIs, 12-layer pipeline, 21,954 recorded scenarios).
What backtesting proved does NOT work
- Predicting price with candlestick patterns. The largest experiment (144 scripts in 24 phases) did not beat noise with statistical confidence (p=0.073). The "winning" portfolio is only positive due to a structural funding carry anomaly (51% of PnL), not real edge.
- Trusting popular technical strategies. Honest replication of 30 popular TradingView strategies: only 1 of 37 beats Buy & Hold with real costs. Public popularity does not correlate with real edge.
- Assuming that an AI model trained on candles solves the problem. We fine-tuned a foundation transformer (an open-source foundation transformer, 4.1M parameters) on our own data. Direction accuracy rose from 47% to 63%, but this only confirmed the underlying thesis: a model trained on candles is not reliable as a directional predictor for making capital decisions.
What backtesting proved DOES work
- Identifying market direction from cross-asset context and regime. Signals that do predict live outside the price: liquidations, equity, bonds, ETF flows, DXY, narrative sentiment. Three features with ICIR > 1.5 (Tier S). Four binary signals with Hit Rate 75-100% and sufficient N. The same strategy wins or loses depending on the regime.
- Building a market cognition engine that integrates all of the above in real time. The system in production already classifies regime, attributes drivers, detects narrative saturation, and identifies external catalysts before they are reflected in price (documented case: April 7, 2026, BTC $69K → $72.7K).
Conclusion. The viable hypothesis that survives 16 months of evidence is not "let's build an agent that predicts price" — that hypothesis fails, both in our experiments and in public ones. It is "let's build the engine that tells the agent what direction the market is taking, why, and with what level of certainty". This opens the product to an Agent Market Intelligence layer: structured, calibrated, and verifiable context that any agent — internal or external — can consume to decide better.
Context: LLM agents in trading
An artificial intelligence agent is, at its core, a Large Language Model given a set of instructions (system prompt), tools, and a situational context. The agent reasons over that context and decides. But the underlying model was trained to optimize textual coherence across a heterogeneous corpus — not to optimize risk-adjusted PnL in an adversarial market. That training gap translates into predictable error patterns when the agent operates with poor context: over-trading, recency biases, activation by narratives instead of by structure, difficulty abstaining, difficulty managing leverage, and zero calibrated memory of its own hit rate.
This is not a speculative hypothesis. There is large-scale public evidence.
The Alpha Arena experiment (nof1.ai, October 2025)
In October 2025, nof1.ai organized Alpha Arena, a competition where the six most capable frontier models in the world — GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Grok 4, DeepSeek, and Qwen3 — each received $10,000 USD real money ($60,000 total) to trade perpetual futures with up to 20x leverage on Hyperliquid for 17 days, without human intervention.
- Qwen3 won the competition executing only 43 trades in 17 days — abstention discipline.
- Gemini 2.5 Pro lost executing 238 trades — over-trading that consumed >$600 in fees alone, before counting directional PnL.
- Models with more sophisticated reasoning in traditional benchmarks did not necessarily operate better than simpler models, because the relevant benchmark in trading is not abstract reasoning, but impulse containment and consistency under uncertainty.
The structural lesson: operating LLMs without a context system that anchors them to the real state of the market leads to predictable and reproducible losses, even for the most advanced models available. The problem is not the agent's intelligence — it is the poverty of the context it receives.
Other convergent public cases
Parallel initiatives (Hyper-Alpha-Arena, llm-tradebot, various LLM-driven bots published in 2025-2026) show equivalent patterns: the agent acts as a novice human trader would — enters on narrative, holds against evidence, sizes poorly, does not know when not to trade. The missing layer is not "more model": it is the structured market context that a professional trader has in their head and an LLM does not have in its prompt.
How this translates into Tradit
The conclusion is direct: if the agent's bottleneck is context, then the most valuable asset to build is not the agent — it is the system that produces that context, calibrates it, and delivers it in a structured form. That is exactly what Tradit's market cognition engine does today.
Building that system required answering with data questions that the industry typically assumes without verifying:
- Which market signals do predict better than random, with what N, with what hit rate, and under what regime?
- Which technical strategies the industry sells as "winners" actually survive paper trading with real costs?
- Is there an edge in candles, or does the edge live in cross-asset context?
- Can a foundation model, fine-tuned with our own data, provide incremental signal — and of what type?
- How does one measure the calibration of a probabilistic system in a public and verifiable way?
Each of the twelve experimental lines documented in this report answers one of those questions. The results — pleasant or not, expected or not — are the foundation on which the rest of the product is built.
2. What was built (in numbers)
| Chapter | Metric |
|---|---|
| Completed experimental lines | 12 |
| Numbered scripts run | ~400+ (144 in the main strategy line, 30 in benchmarks, 20 in technical compass, 14 in calibration, 12 in derivatives, etc.) |
| Raw data captured | 9.2 GB local |
| BTC/USDT 1m candles | 1,163,520 candles (January 2024 → March 2026, 808 continuous days) |
| Engine v1 production snapshots | 10,228 (every ~12 min, 47+ fields) |
| Cognitive engine v2 scans recorded | 21,954 (March–April 2026, 15-min interval) |
| Integrated APIs | 36+ (Binance, Coinglass, TwelveData, Yahoo CME, Coinbase, CoinGecko, DefiLlama, Brave, Dune, Kalshi, Polymarket, Google News, Santiment, FRED, etc.) |
| Pine Scripts replicated from TradingView | 7 (with 7 CSVs ≈ 1,360 trades — runner validated with 100% match on KST) |
| AI models fine-tuned | 1 (4.1M parameters, 19,392 candles, 2.5 hours on Apple M1) |
| Architecture proposals written | 45 numbered documents |
| Base research documents | 40+ (real costs, Monte Carlo, anti-overfitting, walk-forward, Kelly, attention-based trading, etc.) |
| Persistent production processes | 3 (scan every 15 min, daily labels, 24/7 liquidations WebSocket daemon) |
3. Timeline, methodology, and work structure
The experiments were executed in a non-linear sequence: each answered a different question, invalidated a hypothesis with data, and left a learning that the next one picked up.
Prior context: 17 previous iterations existed and were archived as reference. The project was restarted from scratch in March 2026 with a new methodology based on 18 base research documents (trading costs, Monte Carlo, CPCV, anti-overfitting, walk-forward, attention-based trading, Kelly criterion, Smart Money Concepts, expert trader mental model, etc.).
3.1 Foundations lab — eight parallel lines
Eight experiments run in parallel to measure before theorizing. Each was a distinct probe into the sea of data, with independent methodology, datasets, and validation. The work table followed a common protocol: (a) pre-registered hypothesis with its explicit kill condition, (b) execution over the 808 continuous days of BTC/USDT with real costs (Binance Futures commission + 0.5 tick slippage + hourly funding rate), (c) validation via purged walk-forward and/or Monte Carlo + multiple testing correction (Holm-Bonferroni), and (d) binary verdict on the kill condition before any "adjustment".
Line A — Theoretical ceiling ("perfect trader")
- Question: what is the absolute PnL ceiling if we had a perfect oracle that detected every market swing?
- Methodology: swing detector with adaptive ZigZag (dynamic ATR(14) threshold scaled to the volatility regime), entry/exit at ideal pivots, real costs applied.
- Data: 808 days, 1.16M 1m candles → resampled to 5m for significant swing detection.
- Result: 174 swings detected summing 668% of cumulative theoretical opportunity.
- Incidental methodological finding: the runner run detected 6 internal bugs (drawdown reported as 0%, funding not discounted, DCA with incorrect 10-day offset, equity with 1-bar mismatch, slippage ignored on stop-loss, asymmetric long/short fee). This turned the experiment into calibration of the runner itself.
- Function: upper bound against which to measure any real strategy.
Line B — Wave mapping
- Question: how many movement opportunities actually exist if we lower the detection threshold to micro levels?
- Methodology: same ZigZag detector with threshold swept in a grid {0.5%, 1%, 2%, 3%, 5%}. Counting "waves" as complete swings with sign change.
- Result: 2,771 waves identified with 0.5% threshold. Amplitude distribution confirms a power law.
- Central finding: in the 86 net bear days (-20%), BTC produced hundreds of percent of movement in both directions. The qualitative conclusion — "it didn't fall, it moved" — refutes the narrative of "quiet market during a crash".
Line C — Derivatives atlas
- Question: did derivatives (funding, OI, taker volume, L/S ratio, basis) warn before large moves of the underlying?
- Methodology: 28 binary signals defined a priori. Test against BTC movement at t+{1h, 4h, 24h}. Validation: Monte Carlo of 1,000 permutations + Holm-Bonferroni correction. Continuous version: rolling ICIR calculation.
- Result: 0 of 28 binary signals survive the multiple testing correction. As continuous features, ICIRs range from -0.83 to -1.48.
- Verdict: derivatives do not work as direct triggers — but they do provide information for sizing (adjusting position size) and as a veto (abstaining under extreme conditions).
Line D — Cross-asset intelligence
- Question: which assets correlated with BTC predict better than random?
- Methodology: 25 cross-asset features: liquidations, S&P 500, USDX, gold, oil, VIX, US 10Y bonds, MSTR, COIN, mining stocks, on-chain (Puell, NUPL), Fear & Greed, etc. ICIR calculation over rolling windows. Three tiers (S/A/B) by absolute ICIR > {1.5, 1.0, 0.5}.
- Data: TwelveData, Yahoo Finance, Coinglass, Alternative.me (FGI since 2018, 2,946 days), Glassnode/CoinMetrics-equivalent (Puell Multiple since 2010, 15 years).
- Result: 3 Tier S features (ICIR > 1.5): liq_long_ratio (-2.61), coin_ret (+2.20), US 10Y bonds (+1.52).
- Implication: the information that does predict lives outside the BTC price. Candles are the effect; the causes live in correlated assets.
Line E — Calibration lab
- Question: what hit rate do individual signals have with statistically valid N?
- Methodology: 14 independent scripts. Only signals with HR > 70% and N > 8 are promoted to "Tier 1".
- Result — 4 Tier 1 signals:
- Equity crash > 2% in US session → HR 100% (N=8) for BTC drop > 1.5% at t+24h.
- Liquidation cascade P99 → P90 in <2h → HR 87.5% (N=8) for mean reversion at t+4h.
- ETF outflow > $400M in one session → HR 75% (N=32) for BTC drop at t+72h.
- DXY ↑ > 0.5% daily → HR 75% (N=16) for BTC drop at t+24h.
- Honest limitation: small N on the strongest signals — Bayesian update sustains the priors but the credible range is wide.
Line F — Benchmark arena
- Question: how much is what the backtesting industry sells as "winners" actually worth on real data?
- Methodology: literal replication of 30 popular TradingView strategies (open Pine Script + official CSV scrape). 37 variants. Cross-validation: own runner vs official CSV must give 100% match on reference strategies (KST as gold standard).
- Result: only 1 of 37 beats Buy & Hold: SuperTrend AI with score threshold ≥ 65/100. PnL = +$715 on $1K in 808 days.
- The rest lose or break even. The most-favorited strategy in the sample (moving average crossover, 8.8K favorites on TradingView): Profit Factor 0.65, -$98 on $1K. RTB (Renko Trend Breaker): -$559.
- Publishable conclusion: the popularity of a strategy on a public platform does not correlate with its real edge.
Line G — Bear movement mapping (complement to B)
Specific analysis of the 86 bear days at -20%, decomposing the net movement into positive and negative sub-trajectories, to validate bidirectional trading opportunity even in a net declining market.
Line H — Cross-source validation
Methodological line that validates each feature against at least two independent sources (e.g. Binance funding rate vs Coinglass aggregate, Dune on-chain ETF flows vs Bloomberg aggregate) and discards features with discrepancy > 5% without an explainable cause.
Cross-cutting lesson from the lab: the majority of technical "signals" are noise. The few that predict live in derivatives, cross-asset, or regime — and only when the correct context is applied and kill conditions are respected. The industry publishes optimistic metrics because it omits costs, omits multiple testing, and cherry-picks the reporting period. When honest methodology is applied, 95% of "winning strategies" do not survive.
3.2 First strategy with $1,000 — the case of the self-invalidated backtest
- Hypothesis: a simple strategy, fed by engine v1 signals, should beat Buy & Hold during a net bear period.
- Methodology: 6 strategies over 87 days of engine v1 snapshots, with $1,000 initial capital, one position at a time, fixed sizing, real costs.
- Nominal result: S1 trend-follower won +$24 (+2.44%) while BTC fell -19%. PF 1.42, Sharpe 3.73.
- Post-audit: detected P0 bugs:
- Future price lookahead in 40% of snapshots.
- Duplicate trades due to race condition in the state machine.
- Equity equation not closed properly — fees not discounted on partial closes.
- Verdict: the original numbers were invalidated. The experiment was archived as a historical artifact and a postmortem was published with the exact detection chain.
- Lessons that changed everything after:
- Runner validates runner: without a known benchmark (TradingView CSV with 100% match), no own number can be trusted.
- Reproducible audit or it is not a number.
- Three lines of defense: (i) known fixture, (ii) Monte Carlo of the logic, (iii) manual sample audit.
3.3 Place selection — temporal grid trading
- Hypothesis: instead of predicting where BTC is going, change the question to where it is oscillating.
- Methodology: 5 grid trading hypotheses over 87 days. H1 uniform grid 24/7. H2 grid restricted to 14:00-19:00 UTC window. H3 grid + PDE Governor. H4 grid + FGI filter. H5 adaptive grid.
- Result: only H2 was positive (+$1.10, Sharpe +3.78). The edge came from temporal concentration. H3 turned out identical to H1 without filter — an internal case study for parsimony.
- Lesson: concentrating activity in liquid windows is half the battle. Overnight low-volatility hours destroy grid cycles.
3.4 Multi-Asset (BTC + ETH + SOL) — direction yes, timing no
- Hypothesis: a strategy that wins on BTC should transfer with minimal adjustments to ETH and SOL.
- Result: BTC + engine: +$48. ETH candle-only track: -$322. SOL candle-only track: -$370. Cause: the regime derived from pure candles oscillates 7x more than the engine regime → 7x more trades → noise destroys edge.
- Central finding — direction vs timing separation:
- Direction IS predictable: RSI has Cohen-d = 1.49, Donchian position d = 1.35.
- Timing is NOT predictable: d = 0.23.
- Internal motto: "the compass works, the trigger is missing".
3.5 Technical compass for BTC — the fuse-bug moment
- Methodology: 20 scripts with combinations of 12 components (ATR, RSI, MACD, ADX, ADL, ATR ratio, OBV, vol-of-vol, 4-window momentum, MFI, CCI, Stochastic). Purged walk-forward, 7-day gap.
- Nominal result on truncated period: best variant produced +$1.11/week over 200 days.
- Audit: critical bug in the ADX component fuse rescaling when moving to the full period.
- Real result on 808 days with bug corrected: best variant -$0.30/week, PF 0.86.
- Verdict: kill condition activated.
- Crystallized constitutional rule — the "BRONZE rule": nothing is promoted to shadow mode (live demo) without having demonstrated ≥ $25/week in honest backtesting over the full 808 days.
3.6 The main adaptive strategy line — 144 scripts in 24 phases
The most complete experiment in the project. 144 scripts, 24 sequential phases, executed over 808 days of clean data under a strict protocol.
Methodological protocol applied:
- Each script is accompanied by a pre-registered hypothesis document (date, author, kill condition, expected result, acceptance metric).
- Mandatory validation via pooled walk-forward with 10 splits, 6-month train, 1-month test, 14-day purge.
- Real costs always applied: Binance Futures commission (taker 0.04%, maker 0.02%), real hourly funding, 0.5 tick slippage per entry/exit + 1 additional tick on stop-loss.
- Reports with parallel metrics: PnL/week, PF, Sharpe, Sortino, max DD, expectancy, WR, AVG trade, exposure ratio, tail ratio.
- If the key metric does not exceed the pre-registered threshold, the phase is closed without "lifesaving adjustments".
Summary by phase:
| Phase | Focus | Key result |
|---|---|---|
| 1 | Single engine | 16 scripts. PF 0.47 → 1.73. |
| 2 | 3-component portfolio | 10 scripts. $37 → $133. |
| 2B | Portfolio optimization | 5 scripts. $133 → $145, PF 2.02. |
| 3 | Convergence with benchmarks | 5 scripts. 4 failed. |
| 4 | Audit + leverage | 4 scripts. 20x viable. |
| 5 | External audit + fixes | 8 scripts. D-1 and volatility correction. |
| 6 | Derivatives as grid filter | 5 scripts. CVD + liquidations as filter. |
| 7 | Vol targeting | 4 scripts. Volatility scaling validated. |
| 8 | Stress test | 5 scripts. Monte Carlo + walk-forward robust. |
| 9 | Crash tests | 8 scripts. Weak range detection. |
| 10 | Applied ML | 5 scripts. Early exit -0.5%@bar2 = PF 2.62. |
| 11 | External strategy landscape | 8 scripts. 0 improvements. |
| 12 | Final refinements | 6 scripts. 5 failed. |
| 13 | Derivatives exploration | 4 scripts. 0 improvements in walk-forward. |
| 14 | Academic paper replication | 5 scripts. 0 improvements in walk-forward. |
| 15 | Statistical significance | 3 scripts. p=0.073 — NOT significant. |
| 16 | "Oceanographic" analysis | 4 scripts. Independent trades confirmed. |
| 17 | Liquidation veto | 6 scripts. Walk-forward PASS 7/10, PF 1.95. |
| 18 | Carry forensics | 4 scripts. Carry = 51% of PnL, 25/27 months positive. |
| 19 | Offset independence | 4 scripts. 48/48 positive. |
| 20 | Stability verification | 3 scripts. Convergence confirmed. |
| 21 | State machine | 6 scripts. Walk-forward FAIL formal 5/10. |
| 22 | B1 integration | 4 scripts. PF 3.17. |
| 23 | Smart Money Concepts | 4 scripts. 3 failed. Insufficient N. |
| 24 | Range detection (B2) | 3 scripts. Volume ratio veto promoted. |
Final portfolio (10 rules, 8 strict + 2 promoted): PF 4.11 full-sample, PF 1.66 pooled walk-forward (7W/2T/1L), 35 trades, WR 40%, Sharpe ~3.5, 0 liquidations at all levels 1x→20x.
| Leverage | PnL/808d | PF | Max DD | BRONZE cap |
|---|---|---|---|---|
| 1x | +$130 | 1.70 | 2.2% | $23K |
| 3x | +$169 | 2.62 | 2.4% | $18K |
| 5x | +$200 | 3.10 | 2.8% | $15K |
| 10x | +$275 | 4.00 | 4.0% | $11K |
| 20x | +$435 | 5.80 | 6.0% | $7K |
The 8 constitutional laws that survived (each validated or invalidated with data across multiple scripts):
- ATR ×2.5 flat = definitive exit (9/9 alternatives failed).
- Long-only = definitive direction (6/6 failed in direct shorts).
- Fixed carry $500 = definitive allocation (4/4 modulations failed).
- Pullback > market entry (always).
- Simple > complex (~130 refinements failed).
- Vetos > optimizations (the only late improvements are abstention rules).
- N<100 invalidates fine refinements.
- Derivatives as direct trigger = 0/28. As sizing/veto = works.
The number that changed everything: carry = 51% of total PnL. Over 808 days, the gross PnL of the unfiltered engine is ≈ $-26 (the engine alone, over the full series). The final portfolio (engine + carry + grid, with vetos applied) generates $89.62: $46 come from funding carry, $26 from the trend-follower, and $18 from the grid. That is, more than half of the "system's" PnL is actually passive income from the derivatives market.
The final verdict (Phase 15): p=0.073 — the candle-based trading engine does not beat noise with statistical confidence. Candles alone do not contain sufficient predictive information.
3.7 The paradigm shift — from price to context
The trading engine asks price what it did. The cognition engine asks the world why it moved.
This line is a structural rethink of the product. It is not the previous line taken to production — it is something different: a system designed from scratch to produce the structured market context that an AI agent needs to decide well.
General architecture — 5 integrated systems:
- Capture — parallel raw data collection from 36+ APIs each cycle (15 min).
- Cognitive analysis — normalization + causalization of each signal.
- Synthesis — aggregation of signals into meta-factors with physical interpretation.
- Hypothesis tree — live system of active hypotheses with rolling scoring and explicit kill conditions.
- Reinforcement learning (feedback engine) — ghost P&L of decisions the agent took vs did not take.
Internal pipeline — 12 layers in order:
Fetch → Normalize → Causal Depth → Attention Heads → Bias →
Softmax → State → Output → Hypotheses → Feedback Engine → RL →
Output Final
- Fetch: parallel calls to the 36+ APIs with staggered timeout, geo fallbacks (Binance fast-scan), rate-limit aware.
- Normalize: each raw feature is mapped to a normalized scalar in [-1, 1] with a tempo-adaptive baseline.
- Causal Depth: each signal is labeled with its underlying causal mechanism.
- Attention Heads: multiple heads evaluate the situation from different perspectives (technical, regime, momentum, derivatives, sentiment, narrative).
- Bias: application of Bayesian prior according to the active regime.
- Softmax: final combination into a distribution over 9 possible actions, with temperature 0.45.
- State: persistence between scans (short-term memory).
- Hypotheses: live tree with continuous scoring. Each hypothesis has its kill condition.
- Feedback Engine: each decision is recorded with its full context. If the agent does not execute it, the ghost P&L is calculated.
- RL: cases feed weight adjustments on the heads and priors.
Operating figures:
- 33-55 bipolar signals active per scan. Each signal has a separate
dangerdimension andopportunitydimension. - 5 meta-factors with documented physical causation: liquidation cascade, equity risk, funding carry, sentiment delta, institutional flow.
- 3 persistent processes (PM2) running 24/7: main scan (cron 15 min), label generator (daily 8:00 UTC), liquidations WebSocket daemon.
- 21,954 scans recorded to date (~18KB per scan).
- 45 architecture proposals documented and versioned as internal RFCs.
The central finding — the edge does not live inside crypto.
As the engine accumulated real operation, two events validated with data the hypothesis that motivated the entire line: what moves Bitcoin is systematically outside of Bitcoin. Candles are the consequence. The causes live in tariffs, geopolitical announcements, Fed decisions, institutional flows registered off-chain, divergences in equities and commodities.
Case 1 — Whale trap during the tariff rollout (Q2 Day 1, April 1, 2026)
On the first day of Trump's tariffs, BTC touched $69,310 and was violently rejected $1,100 lower. To the crypto-only observer, it was "just another red candle". To the engine it was a documentable sequence of 7 converging manipulation signals:
- MSTR -2.10% while BTC +0.59% — institutional divergence. Signal came from equities, not crypto.
- Full funding cycle (5 → 7 → 9 → 14 → 8 / 21 coins negative in 72h): classic stop-hunt pattern.
- Volume pump in dead Asian session (volume 0.174x of baseline) → cheap distribution by whales.
- $93M in shorts liquidated in NY session, followed by longs liquidated hours later (liq 1h ratio 0.97).
- Kalshi 74% probability of BTC at $65K in April → the prediction market did not buy the rally.
- MACD swing from +257 to -80 in under 24h — extreme technical reversal confirmed.
- FGI returned to 8 (extreme fear) in 12h.
The engine did not just detect the rejection — it anticipated it documentably through the crossing of the 7 signals, and produced a complete analysis of the stop-hunt pattern in 5 phases.
Case 2 — Rally on Iran ceasefire (April 7, 2026)
Trump announces Iran ceasefire → oil -17.3% → BTC $69K → $72.7K in hours. The engine had pre-registered the correct hypothesis days before: "BTC's bottom requires an external catalyst, most probable candidate: resolution of the Iran front". When the catalyst arrived:
- The pre-registered hypothesis confirmed exactly — it was not post-hoc rationalization.
- The conditional cross-asset polarity module correctly inverted its oil reading from RISK_OFF to RISK_ON in real time.
- The abstention score dropped from 1.00 → 0.40 — first time in 474 consecutive scans.
- Bearish pressure rose from 22% → 41% during the rally — generating a quality contrarian advisory.
- The feedback engine produced two cases with ghost P&L and 6 proposed rules.
Each pipeline layer did what it was designed to do.
Operational implication — what was integrated into the engine:
- Equities with proven directional relevance (MSTR, COIN, mining stocks, SPY, QQQ, NVDA, AMD, TSLA, META, GOOGL).
- Macro and currencies (DXY, US 10Y bonds, M2 money supply from FRED).
- Commodities with risk-on/risk-off correlation (oil via USO and Yahoo CME, gold via PAXG, natural gas).
- Institutional CME futures as leading indicator of institutional flow.
- Prediction markets (Kalshi for event ladders, Polymarket for consensus pricing).
- Multi-source narrative layer (Google News RSS with mandatory
when:1d, Brave Search, CoinDesk, wire services with differentiated truth-weight). - On-chain ETF flows by issuer (Dune Analytics).
- Macro event calendar (FOMC, CPI, NFP, mega-cap tech earnings, tariff decisions).
- Conditional cross-asset polarity — the same oil move can be RISK_ON or RISK_OFF depending on context.
Engine operating philosophy:
The system informs. It never restricts. The agent decides.
No component prohibits actions. All emit advisories (risk + reason). This is no signals without context.
4. The brain modules
Four modules integrated in the cognitive system. Each answers a different question and all four intersect in a softmax decision over 9 possible actions.
4.1 Regime module — 7-channel classifier
Classifies EVERY event that moves BTC into one of 7 causal channels:
| Channel | Captures |
|---|---|
| Macro-Monetary | Fed, CPI, DXY, M2, rates |
| Geopolitical | Wars, sanctions, trade policy, international events |
| Institutional Flow | ETFs, treasuries, funds, allocations |
| Regulatory | Laws, court decisions, frameworks |
| Crypto-Native | On-chain, hacks, halvings, protocols |
| Energy-Commodity | Oil, gas, mining, electricity |
| Narrative-Sentiment | Media, sentiment indices, social |
Justification for the 7: historical validation 2014-2026 — each channel was dominant in at least one distinct era (Crypto-Native in Mt. Gox/Terra; Regulatory in Japan/ICOs; Narrative in 2017; Macro in COVID; Flow in ETF era 2024; Geopolitical in 2025-2026; Energy-Commodity emerging in 2025-2026).
Stress test: 12/12 scenarios passed. The classifier is robust to known historical shocks.
4.2 Narrative attribution module
Three layers (7 channels × 6 algorithms × v3 state). Detects narrative saturation as a contrarian signal — when everyone says the same thing, the structure under-confirms and the peak is near.
Recent implementations:
- Google News RSS baseline + wire-service tier interleaving (Truth Weight rose 0.450 → 0.570).
- News velocity score (change in acceleration of headline flow).
- Narrative arcs (7-day story structure).
- Breaking-news fast channel (accelerated channel for shocks).
- Tempo-adaptive cache TTL (reduces API consumption by 50%-98%).
- Truth weight registry (WIRE 0.75, INVESTIGATIVE 0.70, traditional 0.65, crypto-media 0.45-0.55).
4.3 The trained model experiment — evidence that a candle transformer is not a reliable predictor
Why this experiment exists in this report: during the program the natural question arose — what if we train a foundation model with our own data? doesn't that solve the "candles aren't enough" problem? To answer it with evidence, we ran the experiment. The result confirmed the underlying backtesting thesis.
Pre-registered hypothesis: a foundation model fine-tuned on our own data could produce a directional signal with sufficient accuracy to sustain trading decisions.
Kill condition: if accuracy on unseen data does not robustly and consistently exceed 65% on big moves (>2%), the model does not qualify as a predictor for capital. It only qualifies as a confidence/abstention sensor.
What was done (2026-04-09, 2.5h, $0 cloud):
We took the open-source model an open-source foundation transformer (4.1M parameters, decoder-only transformer pre-trained on 12 billion candles from 45 exchanges) and fine-tuned it with 19,392 1H BTC/USDT candles (resampled from the 1.16M 1m candles).
Training pipeline:
- Phase 1 — Tokenizer (15 epochs, LR 0.0001): learns the "grammar" of BTC. Recon Loss converged at epoch 1 (0.0029).
- Phase 2 — Base Model (10 epochs, LR 5e-7 very conservative): learns price sequences without destroying prior knowledge. Validation loss 2.7184.
Hardware: Apple M1 (MPS — Metal). 300MB RAM peak. CPU 45-80%. Total 147 minutes.
Evaluation results (30 points distributed over 2 years, unseen data):
| Metric | Pre-trained | Fine-tuned | Δ |
|---|---|---|---|
| Direction accuracy (all) | 47% | 63% | +16pp |
| Small moves (<2%) | 50% | 71% | +21pp |
| Big moves (>2%) | 40% | 54% | +14pp |
| MAPE (price prediction) | 1.22% | 1.25% | ~equal |
Honest reading of the results:
- 63% direction accuracy sounds like success — and better than random it is. But 63% on big moves drops to 54% (barely better than coin flip). Evaluation N: 30 points — low.
- 71% on small moves is attractive but small moves are exactly where fees and slippage destroy the edge.
- MAPE does not improve — the model did not learn to predict value, it learned to classify sign on a small sample.
- Most relevantly: it does not meet the pre-registered kill condition.
Verdict: the experiment confirmed the structural limit already intuited by the rest of the backtesting: a transformer trained on candles, even with a serious foundation model behind it, is not a reliable predictor.
The defensible use the model does have — confidence sensor, not predictor:
- When the model has high confidence → price is reasonably explained by its own history → it is safe to look at technical indicators.
- When the model has low confidence → something external (macro, narrative, geopolitics) is moving the market → one should stop looking at candles and pay attention to cross-asset context.
Under that reading, the model functions as the engine's attention router: it indicates when not to trust the price.
Current runtime: PM2 sidecar every 5 min. Fetch
of 360 candles → forward pass on MPS (~4s) with 30 Monte Carlo
paths → JSON output with direction,
confidence, volatility_forecast. The
confidence signal is what the engine consumes.
4.4 Confluence and decision engine
Combines the 33-55 bipolar signals + the transformer model + the
active hypotheses + the regime classifier into a
confluence score that decides the final
recommendation with softmax over 9 actions
(REDUCE_CARRY / MONITOR / HOLD / BUY_CORE / BUY_SATELLITE /
SELL_SATELLITE / SELL_CORE / FULL_EXIT / FULL_ENTRY) with
temperature 0.45.
Strategic memory: the transformer model acts as an attention router — high confidence weights the technical track, low confidence weights the macro/narrative track.
4.5 Learning engine (feedback engine)
Each engine decision is recorded with its full context. When the system recommends something and the external agent does NOT execute it, the engine calculates what would have happened — the ghost P&L — and uses it as a signal to validate or invalidate the current rule set.
Synthetic output of a case:
"ghost_pnl_weekly": -2.30,
"ghost_verdict": "FILTERS_PROTECTING (ghost trades would have lost $2.30)",
"patterns_discovered": [...],
"rules_proposed": [...] 4.6 Postmortem log — continuous engine calibration
Perhaps the most valuable methodological asset of the project, and the least visible from outside, is the engine postmortem log. Every time the system fails to correctly interpret a situation, the event is documented with its evidence, failure mechanism, ghost P&L, generalizable pattern, and candidate rules.
Current state: 20 postmortems published internally (April 2026), covering approximately 6,000 scans.
Typology of documented cases:
| Category | Count | Example finding |
|---|---|---|
| Data blind spots | 4 | The system could not see sustained institutional buying (ETF inflow treated as a moderate contra-narrative when it was an active signal). |
| Catalyst gaps | 3 | Macro-political events reached the pipeline after the price had already reacted. |
| Behavioral failures | 3 | Excessive caution: 378 consecutive abstentions during a +5.2% bullish move. |
| Design asymmetries | 2 | The system treated long and short symmetrically when the risk is not symmetric. |
| Pipeline or parsing bugs | 4 | Stale news parsing during the most active day of the month. |
| Fix validations | 3 | "Inverse" postmortems: an institutional metric went from 0.016 → 0.580 — a factor of 35x — after correcting the inflow signal calculation. |
| Behavioral milestones | 1 | First time in 474 consecutive scans that the abstention score dropped from 1.00 to 0.40. |
Aggregate metrics:
- 20 postmortems structured with mandatory sections.
- 48 candidate rules generated (R-001 to R-048), a fraction of which have been promoted to production after validation with N.
- ~$9K of cumulative ghost P&L documented in a 7-day window.
- Learning density: one postmortem every ~300 scans on average during the peak calibration phase.
Why this matters. The log is what turns the engine into something different from "another signal system". It is public calibration in narrative form: every time the system is wrong, the error is documented with its evidence, its causal mechanism, and the rule proposed so it does not happen again.
5. The laws the research crystallized
These are the "constitutional laws" that survived all the experiments. Each is backed by reproducible evidence.
On strategies
- Regime filter IS the edge. With filter (best benchmark): +$715. Without filter (worst benchmark): -$559.
- Long-only improves quality. PF rises 0.36 when removing shorts. 6/6 FAIL on direct shorts.
- 4H is the sweet spot for BTC. Daily: 4-17 trades in two years (inoperable). 1H: noise. 3M: catastrophic (PF 0.07 with 3,241 trades).
- Threshold > factors. A variant with score≥65/100: +$715. A variant with confluence of 6 factors and score≥4/6: breakeven.
- Frequency is a multiplier, not edge. No edge + freq = disaster. With edge + freq = $$.
- Backtests <6 months lie. A multi-TF DCA variant showed PF 4.56 in 4 months → PF 0.07 in 808 days.
- Popularity ≠ edge. A strategy with 8.8K favorites on TradingView produces PF 0.65 on real data.
On methodology
- Hypothesis BEFORE touching the data. Once. If it fails, reformulate — do not adjust parameters.
- N<100 invalidates fine refinements.
- Abstention rules > optimizations.
- Reproducible audit or it is not a number. The first backtest invalidated its own results upon detecting future matching.
- Soul + Status + Strategy + Observer + Log = the 5 context files that are now a template for any new experiment.
On the market
- The market has regimes. The same strategy works or destroys depending on the regime.
- Catalysts are external. Candles are the effect. Liquidations, ETF flows, VIX, DXY, negative funding — are the cause.
- Narrative saturates. When everyone says the same thing, the structure under-confirms. Detecting it is a measurable contrarian signal.
- Carry is an anomaly, not a strategy. 51% of the PnL in the largest experiment comes from carry — and that is structural to the derivatives market.
6. Data and tools inventory
6.1 Raw data available
Market layer (~9.2 GB):
- 1-minute BTC/USDT candles — 1,163,520 continuous candles between January 2024 and March 2026 (808 days), sourced from Binance spot.
- Engine v1 snapshots — 10,228 captures (every ~12 min, 47+ fields), 88 days of operation.
- Engine predictions — ~13 per day over the same period, with known outcome.
- Production backups — SQL and JSON dumps.
Cognition layer (~5.0 GB):
- 21,954 complete scans of the cognitive engine (~18 KB per scan), with five structured objects: bipolar signals, decision, engine state, active hypotheses, anomaly report.
- Derivatives analysis — rolling model evaluation, feedback engine cases with ghost P&L, RL validations.
- Narrative state — v3 state, 7-day narrative arcs, history of detected saturations.
- Model documentation — research, glossary, training log, and internal paper.
6.2 Integrated APIs (36+ active)
Tier 1 — always active:
- Binance (candles, funding, OI, taker volume, L/S ratio, orderbook 1m)
- Coinglass (liquidations, aggregate funding, OI, ETF flows, FGI — paid $35/mo)
- Yahoo CME (ES, NQ, CL, GC, DXY, Nikkei, KOSPI futures)
- CoinGecko (market cap, ATH, supply)
- Coinbase (Coinbase Premium proxy)
- DefiLlama (TVL 7d change)
Tier 2 — US hours (07-21 UTC):
- TwelveData (SPY, VIX, MSTR, COIN, USO, RSI — free 700/day with 15min cache)
Tier 3 — overnight hours:
- xStock / DexScreener (SPYx, QQQx, TSLAx, NVDAx — premium outside US hours)
Tier 4 — narrative and prediction markets:
- Brave Search (news with per-query cache, freshness, count)
- Google News RSS (5 queries with mandatory
when:1d, tempo-adaptive cache TTL) - CoinDesk RSS
- Kalshi (event ladder, monthly, yearly)
- Polymarket (consensus yearly, $27M vol)
- Dune (on-chain ETF flows by issuer, free 40 queries/day with 60min cache)
Tier 5 — alternative data (research):
- Santiment, FRED (M2, yields), Hyperliquid Reservoir (Hyperliquid 1s candles, fills, liquidations).
6.3 Tools built
- TypeScript runner validated 100% against TradingView CSV (KST as reference).
- Agent-browser — TradingView scraping with progressive scroll for virtualized tables.
- Semantic search engine — local LanceDB with 10 algorithms, cross-encoder reranking, query expansion, gap detection. Indexes code + runs + data. Available via MCP server.
- PM2 ecosystem — 3 persistent processes: cognitive scan (cron 15 min), labels (daily 8:00 UTC), WebSocket daemon (24/7).
- Replay engine — re-runs any run with new code. "Inverse archaeology" over 21,954 real scenarios.
- Stress-test CLI — asks the system what breaks if price drops to an arbitrary level.
- Diagnose — behavioral audit of the system over any time window.
6.4 Documented research
Over 40 research documents written during the foundations phase, groupable into four families:
Costs and market mechanics:
- Real operating costs (commission, slippage, funding).
- Limit vs market order simulation.
- Capital management, Kelly criterion, and volatility-based sizing.
Honest evaluation methods:
- Probabilistic backtesting (Monte Carlo, CPCV, Deflated Sharpe Ratio).
- Validation techniques (walk-forward, purged k-fold).
- Anti-overfitting (Holm-Bonferroni, multiple-testing correction).
- Causal inference and confounding variable control.
- Bayesian evaluation with posterior updating.
- Information theory (entropy, mutual information) for feature selection.
- Reproducible audit checklist.
How humans and institutions think:
- Expert trader mental model.
- Institutional edge 2026 — what Renaissance, Citadel, Two Sigma actually do.
- External landscape analysis of bots and LLM-trading platforms 2026.
- Agent architecture for trading.
Technical frameworks evaluated (several invalidated with data):
- Multi-head attention applied to signals.
- Smart Money Concepts (invalidated by insufficient N in the main line).
- Published BTC strategies 2026 — comparative analysis.
7. Tangible achievements — what was built and validated
The central achievement of the work is the market cognition engine running in production and the reproducible evidence that delimits which hypotheses survive and which do not.
Cognitive engine (central achievement)
- 12-layer pipeline working end-to-end from raw fetch to causal attribution with softmax over 9 actions.
- 21,954 scans recorded, available for replay and retrospective analysis.
- 5 meta-factors with documented physical causation (liquidation cascade, equity risk, funding carry, sentiment delta, institutional flow).
- 45 architecture proposals numbered and versioned as internal RFCs.
- Documented end-to-end operating case — the April 7, 2026 event, where the system classified regime, detected the external catalyst, and attributed drivers correctly before price reflected the movement.
Quantitative research — hypotheses separated by evidence
- 8 foundations lines completed, each with its question, methodology, and reproducible verdict.
- 3 Tier S cross-asset features with ICIR > 1.5.
- 4 Tier 1 signals with Hit Rate 75-100% and sufficient N.
- 2 technically validated strategies on walk-forward + significance.
- 0 pure technical strategies that survive the rigorous significance test with p<0.05 — this is also an achievement.
Capture and persistence infrastructure
- 3 persistent processes running 24/7 without significant downtime since March 2026.
- Multi-tier adaptive cache that reduces API consumption between 50% and 98% depending on detected volatility.
- Geo fallback system with automatic regional block detection.
- Versioned persistence — every scan, every decision, every hypothesis recorded in structured JSON.
- WebSocket daemon — continuous capture of Hyperliquid and Binance liquidations.
AI model — experiment that delimited a limit, not a product
- 1 fine-tune experiment executed on a foundation transformer, on local hardware, with no cloud cost.
- Explicit documented verdict: the model does not qualify as a directional predictor for capital decisions. Its defensible use is as a confidence/abstention sensor (attention router).
- Sidecar in production feeding the
confidencesignal to the engine — not thedirectionsignal. - This "achievement" is valuable for what it excludes from the product, not for what it confirms.
8. Visual synthesis: initial hypothesis → evidence → final product
This block compresses, in schematic form, the complete journey of the work: the hypothesis we started with, the accumulated evidence that invalidated it, the conclusion that survived that evidence, and the final form of the product.
INITIAL PROJECT HYPOTHESIS:
"Let's build an agent that predicts BTC with engine data."
EVIDENCE ACCUMULATED OVER 16 MONTHS:
├─ 144 scripts in main line: pure trading PF = $-26. p=0.073 (not significant).
├─ 30 scripts in benchmarks line: only 1/37 strategies beats B&H.
├─ Derivatives line: 0/28 binary signals survive statistical test.
├─ Carry = 51% of PnL — structural anomaly, not a strategy.
├─ Cross-asset: liq, bonds, equities have ICIR>1.5 (better than any technical).
├─ Calibration: 4 signals with HR 75-100% — all are CONTEXT, not direct entry.
├─ Custom-trained AI foundation model: 47%→63% direction accuracy.
└─ Milestone Apr-7: the system got the regime right while price lied.
HONEST CONCLUSION:
What the research PROVED works is NOT "predicting".
It is: classify regime + attribute drivers + detect saturation + calibrate publicly.
FINAL PRODUCT:
From "automated trading platform" → "probabilistic market intelligence layer".
Three coordinated surfaces: Dashboard + MCP + Alerts.
One atomic unit: the Card.
One moat: publicly verifiable calibration.
THE PIVOT DOES NOT RESTART TRADIT.
THE PIVOT SELLS WHAT TRADIT ALREADY FOUND. 9. Quick reference figures
For use in pitch decks, stakeholder conversations, or public materials:
| Metric | Value | Context |
|---|---|---|
| Days of data | 808 continuous | January 2024 → March 2026 |
| 1m candles captured | 1,163,520 | BTC/USDT spot Binance |
| Scripts run | 400+ | Numbered, reproducible |
| Experimental lines | 12 | Each with a distinct question |
| Engine v1 snapshots | 10,228 | Every ~12 min |
| Cognitive engine scans | 21,954 | Every 15 min, structured JSON |
| Integrated APIs | 36+ | 5 latency/cost tiers |
| Fine-tuned model | 4.1M params | 2.5h on M1, $0 cloud |
| Direction accuracy | 47% → 63% | +16pp post fine-tune |
| Strategies validated stat. | 2 | Walk-forward + significance |
| Tier 1 signals | 4 | HR 75-100%, sufficient N |
| Tier S features | 3 | ICIR > 1.5 |
| Regime stress test | 12/12 | Passed |
| Architecture proposals | 45 | Numbered and versioned |
| PM2 production processes | 3 | No significant downtime since March 2026 |
| API consumption reduction | 50% – 98% | Adaptive cache by volatility |
| Trading engine p-value | 0.073 | NOT significant — the system declares it |
10. Conclusion — from research to product
The 16 months of research documented in this report validated the product pivot proposal with data. The accumulated evidence points in a single direction and allows a clear separation of what works from what does not.
What did NOT work — the honest challenge of the agent as portfolio manager
The paradigm of "an AI agent that actively manages a trading portfolio" — the promise most current LLM-trading projects start with — ran into, in our experiments and in public ones (Alpha Arena, Hyper-Alpha-Arena, etc.), structural barriers that cannot be solved by better model or better prompt alone. An agent can reason well about text and still (i) over-trade when context is ambiguous, (ii) hold theses against evidence, (iii) size risk poorly under leverage, (iv) fail at abstention discipline, and (v) lack calibrated memory of its own hit rate. This is not a model failure — it is a context failure.
Our own exploration of predictive models on candles — where we fine-tuned a foundation transformer (an open-source foundation transformer) with our own data — confirmed the same thesis from another angle: not even a model trained specifically on our data qualifies as a reliable directional predictor for moving capital. The edge was not there.
What DID work — the enormous potential of the engine as a market direction map
The most unexpected front, and ultimately the most valuable, was the market cognition engine. Built as infrastructure to feed the agent, it ended up being the central asset of the project. Its demonstrated capacity to classify market regime, attribute causal drivers, detect narrative saturation, identify external catalysts before they are reflected in price, and operate 24/7 with auditable calibration is exactly what is missing in the space today: not more agents that predict, but a structured layer of market intelligence on which any agent — internal or external — can decide better.
The natural pivot — Agent Market Intelligence
From the combination of both findings the product emerges: an Agent Market Intelligence layer that delivers structured, calibrated, and verifiable context to trading agents (human or LLM), instead of competing with them as a predictor. The engine becomes the product; the agent becomes the client. Public calibration becomes the moat.
Backtesting was not a detour. It was the filter that separated the viable thesis (mapping context) from the thesis the industry keeps chasing in vain (predicting price). That separation is the net value of the work documented in this report.
Tradit does not predict the market. It maps its probabilities, names its drivers, declares its invalidation conditions, and publicly tracks its calibration. That sentence is not marketing — it is the scientific conclusion of 16 months of reproducible and auditable research.