Technical Methodology
Every formula, gate, and pipeline step that turns a YouTube transcript into a published score. No hand-waving — this is what actually runs. For the high-level overview, see Methodology.
Video ingestion
Every 6 hours we poll the YouTube Data API for new uploads across our curated channel list. Each channel has a stable channel_id and an admin-tuned reliability weight (1–100) that reflects the creator's track record.
New videos land in the videos table with metadata only — transcript and mention extraction run as separate stages so a failure in one doesn't block the next channel.
Transcript pipeline
Two-pass strategy, every 5 minutes, batched per video:
- Pass 1 — native captions. YouTube auto-captions (or creator-provided) via
youtube-transcript. Free, fast, ~80% coverage. - Pass 2 — AI speech-to-text. For the ~20% with no captions, Google Gemini transcribes the audio directly. Slower and metered, but no video is left behind.
Both paths produce a single transcript column with the source tagged (youtube / ai / none) so we can audit quality later.
Mention extraction
Each transcript is passed to Gemini with a structured-output schema. The model returns a JSON array of mentions, each carrying:
{
"ticker": "NVDA",
"stance": "bullish" | "neutral" | "bearish",
"confidence": 0.0 - 1.0,
"timestamp_seconds": 412,
"excerpt": "...verbatim quote..."
}Stance reflects what the creator said about the ticker — not whether we agree. Confidence is the model's certainty that the ticker was actually being discussed (vs. a passing reference or a misheard word).
Ticker validation
Models hallucinate. Before any mention can move forward, the ticker symbol is checked against Yahoo Finance — if the symbol doesn't resolve to a real listed instrument it's quarantined and removed by a weekly cleanup-invalid-tickers cron job. Validated symbols are cached in the tickers table with exchange, name, and logo.
Scoring formula
This is the real math, run on every recompute (auto-chained after every new transcript batch). For each ticker T across a rolling 14-day window:
sign = +1 if stance == "bullish"
-1 if stance == "bearish"
+0.2 if stance == "neutral"
mention_value = sign × confidenceNeutral mentions count, but barely — they nudge the score without dominating it. A creator saying "I'm watching NVDA" shouldn't move the needle the same as "I'm buying NVDA."
for each creator C who mentioned T: creator_avg[C] = mean(mention_value over C's mentions of T) creator_avg[C] = clamp(creator_avg[C], -1, +1)
Clamping prevents a single creator with many high-confidence bullish mentions from running away with the score — one creator can move the needle at most ± their weight.
raw = Σ ( weight[C] × creator_avg[C] ) for all creators C of T theoretical_max = max_weight × number_of_creators consensus_bonus = 1.15 if creator_count >= 3 else 1.00
deviation = (raw / theoretical_max) × 50 × consensus_bonus score = round( clamp( 50 + deviation, 0, 100 ) )
50 is neutral. Pure bullish consensus across high-weight creators pushes toward 100; bearish pushes toward 0. The 15% consensus bonus rewards independent corroboration — three creators agreeing is meaningfully different from one creator shouting.
Coverage gates
A score alone doesn't get published. The pick must clear:
- ≥ 2 distinct creators, OR
- ≥ 3 total mentions from the same creator
This kills single-mention noise. A ticker namedropped once by one creator never becomes a published recommendation, regardless of how confidently the model extracted it.
The previously-published score remains live until the next recompute changes it by ≥ 5 points or the contributing mentions change — at which point the AI thesis is invalidated and regenerated.
AI reasoning regeneration
Every published pick carries a written thesis — the "why" you see on the stock page. It's generated by a separate Gemini pass that ingests:
- The ticker's verbatim mention excerpts (with timestamps)
- Each contributing creator's name and weight
- The final score and stance distribution
When the underlying mentions or score shift materially, the thesis is marked stale and re-queued. A worker drains the queue every 5 minutes, so the public site never shows a thesis that contradicts the current score.
Backtest engine
We measure ourselves. Every published pick with score ≥ 70 is auto-enrolled into a hypothetical $100 position, captured at the next trading day's open after publication.
entry_date = first trading day after published_at entry_price = open[entry_date] spy_entry = open[entry_date] for SPY window prices captured at +1M, +3M, +6M, +1Y: pick_return = (price[window] / entry_price) - 1 spy_return = (spy[window] / spy_entry) - 1 alpha = pick_return - spy_return
Snapshots fill in automatically as windows mature (daily cron). We aggregate per creator and site-wide to a leaderboard. The dashboard is currently admin-only until the sample size is large enough to publish honestly — back-filling old picks with synthetic entry prices would distort the record, so we're letting it accumulate forward-only.
Assumes no fees, slippage, taxes, or dividends. α is purely illustrative — past performance doesn't guarantee future results.
Limits & honesty
What we explicitly do not do:
- No price targets. We don't predict where a stock will go — we surface what high-conviction creators are saying, weighted by their reliability.
- No recency decay yet. Within the 14-day window all mentions count equally. A future revision may weight more recent mentions higher.
- No sector or macro weighting. A 100 score on a microcap means the same thing as a 100 score on a megacap. Position-sizing is on you.
- No short signals emphasis. Bearish stances are tracked symmetrically but the audience reality is that most viewers act on long ideas.
- Not investment advice. This is a signal aggregator. Read the disclaimer.