/ docs / 03-pipeline / scoring-research.md
scoring-research.md
  1  ---
  2  title: Scoring System Research & Design
  3  category: pipeline
  4  last_verified: 2026-02-28
  5  related_files:
  6    - src/score.js
  7    - src/stages/rescoring.js
  8    - prompts/CONVERSION-SCORING-VISION.md
  9    - prompts/CONVERSION-SCORING-NOVIS.md
 10    - docs/03-pipeline/scoring-system.md
 11  tags: [scoring, CRO, research, rubric, design-decisions]
 12  status: active
 13  ---
 14  
 15  # Scoring System Research & Design
 16  
 17  This document captures the original deep-research conversation that produced the 333 Method's website conversion scoring system. It covers the research foundations, rubric design rationale, factor weighting decisions, screenshot strategy, rescoring approach, contact extraction, and implementation considerations.
 18  
 19  > **Source:** Open-WebUI chat with OpenRouter (January 2026), exported and consolidated.
 20  
 21  ---
 22  
 23  ## Table of Contents
 24  
 25  1. [Research Question](#1-research-question)
 26  2. [Foundations: Why These Factors](#2-foundations-why-these-factors)
 27  3. [The Nine-Factor Rubric](#3-the-nine-factor-rubric)
 28  4. [Factor Weights & Rationale](#4-factor-weights--rationale)
 29  5. [Score Calculation & Grading Scale](#5-score-calculation--grading-scale)
 30  6. [Screenshot Strategy](#6-screenshot-strategy)
 31  7. [Two-Pass Architecture](#7-two-pass-architecture)
 32  8. [Contact Extraction on Rescore](#8-contact-extraction-on-rescore)
 33  9. [Popover Handling](#9-popover-handling)
 34  10. [LLM Prompt Design](#10-llm-prompt-design)
 35  11. [Validation & Calibration](#11-validation--calibration)
 36  12. [Token Optimization](#12-token-optimization)
 37  13. [Implementation Notes](#13-implementation-notes)
 38  14. [Appendix: Original JSON Schemas](#14-appendix-original-json-schemas)
 39  
 40  ---
 41  
 42  ## 1. Research Question
 43  
 44  The original question that kicked off this research:
 45  
 46  > Propose a scoring system for website conversion (with standard school grading of A+ to F) based on factors such as clear offer, CTA, urgency, hook, strong headline, strong value proposition, clear reason to choose them (USP), no generic stock photos, trust elements (reviews, badges, guarantees), and anything else from best practices.
 47  >
 48  > This will be provided to an LLM to calculate the score, along with the HTML of the DOM after pageload and one or more screenshots. Please advise whether this scoring system will require a full-page screenshot, or will just an above-the-fold and maybe the first below-the-fold screenshots be sufficient to produce a reasonable score? This system will be scoring many hundreds of thousands of websites, so minimising LLM token usage is more important than score accuracy.
 49  
 50  ---
 51  
 52  ## 2. Foundations: Why These Factors
 53  
 54  The scoring factors were selected by synthesizing established CRO (Conversion Rate Optimization) best practices, prioritization frameworks (RICE, PIE), and behavioral psychology research. The factors map to three broad categories of conversion influence:
 55  
 56  ### Messaging Clarity & Value Communication
 57  
 58  - **Headline Quality** — The primary hook; must communicate what, who, and why within 3-5 seconds
 59  - **Value Proposition** — Extends the headline; shifts from features to benefits ("what's in it for me?")
 60  - **Unique Selling Proposition** — Why choose _this_ option over alternatives
 61  - **Clear Offer** — What exactly is the visitor being asked to do, and what do they get
 62  
 63  ### User Confidence & Trust
 64  
 65  - **Trust & Credibility Signals** — Testimonials, certifications, badges, partner logos, media mentions
 66  - **Authentic Imagery** — Real product photos vs. generic stock; professional visual design
 67  
 68  ### Action & Engagement
 69  
 70  - **Call-to-Action** — Copy clarity, visual prominence, and placement
 71  - **Urgency/Scarcity** — Legitimate time/supply pressure for immediate action
 72  - **Hook & Engagement** — Hero element that captures attention in the first seconds
 73  
 74  ### Additional Context
 75  
 76  - **Industry Appropriateness** (3% weight) — Whether design serves its specific business model context (B2B SaaS vs. e-commerce vs. local services have different norms)
 77  
 78  ### Key Research Findings
 79  
 80  - Users spend 57-80% of viewing time on above-the-fold content (Nielsen Norman Group)
 81  - Google found ads above-fold achieve 73% viewability vs. 44% below-fold — a 66% "fold cliff"
 82  - 90% of users begin scrolling within 14 seconds, but _only if above-fold content signals value_
 83  - CTA copy changes alone can generate conversion improvements exceeding 200%
 84  - GPT-4 Vision studies found cropped images actually _outperform_ full-page captures for identification tasks — background noise reduces accuracy
 85  
 86  ---
 87  
 88  ## 3. The Nine-Factor Rubric
 89  
 90  Each factor is scored 0-10 with specific rubric definitions. Below is the condensed rubric; the full detailed version with examples exists in the LLM prompts.
 91  
 92  ### Factor 1: Headline Quality & Clarity (15%)
 93  
 94  | Score | Description                                                                                           |
 95  | ----- | ----------------------------------------------------------------------------------------------------- |
 96  | 9-10  | Immediately communicates value; benefit-oriented; specific; creates curiosity or emotional connection |
 97  | 7-8   | Clearly communicates basic benefit; mostly specific; adequate direction                               |
 98  | 5-6   | Communicates a benefit but somewhat generic; requires modest interpretation                           |
 99  | 3-4   | Vague, generic, or fails to communicate core benefit                                                  |
100  | 1-2   | Confusing, contradictory, or essentially absent above-fold                                            |
101  | 0     | No discernable headline or actively confusing                                                         |
102  
103  ### Factor 2: Value Proposition Clarity (14%)
104  
105  | Score | Description                                                     |
106  | ----- | --------------------------------------------------------------- |
107  | 9-10  | Specific, benefit-oriented, compelling; clearly differentiates  |
108  | 7-8   | Clear and benefit-focused; adequately articulates core benefits |
109  | 5-6   | Present but generic or feature-heavy; requires interpretation   |
110  | 3-4   | Vague or feature-focused; unclear differentiation               |
111  | 1-2   | Barely present or confused with feature lists                   |
112  | 0     | No value proposition or contradictory messaging                 |
113  
114  ### Factor 3: Unique Selling Proposition (13%)
115  
116  | Score | Description                                                       |
117  | ----- | ----------------------------------------------------------------- |
118  | 9-10  | Clear, compelling differentiation; specific competitive advantage |
119  | 7-8   | Reasonably clear; specific advantage identified                   |
120  | 5-6   | Some differentiation implied but not explicit                     |
121  | 3-4   | Vague; relies on generic claims ("best in class")                 |
122  | 1-2   | Barely present; no clear reasons to choose                        |
123  | 0     | No differentiation; appears identical to generic competitors      |
124  
125  ### Factor 4: Call-to-Action Design & Placement (13%)
126  
127  | Score | Description                                                                                                 |
128  | ----- | ----------------------------------------------------------------------------------------------------------- |
129  | 9-10  | Visible above fold; specific action-oriented language; visually prominent; secondary CTAs at natural breaks |
130  | 7-8   | Visible above fold; action-oriented; reasonably prominent                                                   |
131  | 5-6   | Present; clear but generic ("Submit", "Learn More"); adequate placement                                     |
132  | 3-4   | Present but not prominent; vague language; requires scrolling                                               |
133  | 1-2   | Hard to find, confusing, or inadequately prominent                                                          |
134  | 0     | No CTA or buried below multiple scrolls                                                                     |
135  
136  ### Factor 5: Urgency & Scarcity (10%)
137  
138  | Score | Description                                                           |
139  | ----- | --------------------------------------------------------------------- |
140  | 9-10  | Legitimate urgency with specifics (deadline, count); genuine pressure |
141  | 7-8   | Clear mechanism; specific rather than vague                           |
142  | 5-6   | Some urgency suggested but lacks specifics                            |
143  | 3-4   | Vague ("act soon", "don't miss out") without details                  |
144  | 1-2   | Minimal or ineffective urgency                                        |
145  | 0     | No urgency or false urgency undermining credibility                   |
146  
147  ### Factor 6: Hook & Initial Engagement (9%)
148  
149  | Score | Description                                                                |
150  | ----- | -------------------------------------------------------------------------- |
151  | 9-10  | Visually compelling hero element; contextually relevant; strong engagement |
152  | 7-8   | Professional, relevant hero; adequate engagement                           |
153  | 5-6   | Present but generic; mild engagement                                       |
154  | 3-4   | Dated, poorly executed, or tangentially relevant                           |
155  | 1-2   | Missing, poor quality, or detracting                                       |
156  | 0     | No hook; purely text-based above-fold with no visual appeal                |
157  
158  ### Factor 7: Trust & Credibility Signals (11%)
159  
160  | Score | Description                                                                                        |
161  | ----- | -------------------------------------------------------------------------------------------------- |
162  | 9-10  | Multiple relevant elements (named testimonials, certifications, badges, logos); prominently placed |
163  | 7-8   | Several elements; specific testimonials or credible certifications                                 |
164  | 5-6   | Some elements (generic testimonials or basic badges); adequate                                     |
165  | 3-4   | Minimal; generic or lacking credibility                                                            |
166  | 1-2   | Nearly absent                                                                                      |
167  | 0     | No trust signals at all                                                                            |
168  
169  ### Factor 8: Authentic Imagery & Visual Design (8%)
170  
171  | Score | Description                                                             |
172  | ----- | ----------------------------------------------------------------------- |
173  | 9-10  | Authentic imagery (product photos, real customers); professional design |
174  | 7-8   | Mix of authentic and professional; solid design                         |
175  | 5-6   | Mostly professional with some stock; adequate; minor issues             |
176  | 3-4   | Significant stock photos; dated design; unprofessional impression       |
177  | 1-2   | Predominantly generic/low-quality; poor design                          |
178  | 0     | Broken images, extremely low-quality, or repelling                      |
179  
180  ### Factor 9: Clear Offer & Specificity (4%)
181  
182  | Score | Description                                                |
183  | ----- | ---------------------------------------------------------- |
184  | 9-10  | Specific, unambiguous; visitor knows exactly what they get |
185  | 7-8   | Clear and specific; minor ambiguity                        |
186  | 5-6   | Generally clear but could be more specific                 |
187  | 3-4   | Somewhat vague; visitor must infer details                 |
188  | 1-2   | Unclear or hard to determine                               |
189  | 0     | No discernable offer                                       |
190  
191  ### Factor 10: Contextual Appropriateness (3%)
192  
193  Evaluates whether design serves its industry/business model context. B2B SaaS, e-commerce, and local services have different CRO norms.
194  
195  ---
196  
197  ## 4. Factor Weights & Rationale
198  
199  The weights were derived from empirical research on correlation with actual conversion outcomes:
200  
201  | Factor              | Weight   | Rationale                                                                                     |
202  | ------------------- | -------- | --------------------------------------------------------------------------------------------- |
203  | Headline Quality    | 15%      | Primary determinant of whether users engage or bounce; captures 80% of initial attention      |
204  | Value Proposition   | 14%      | Extends headline; answers "what's in it for me?"; directly drives consideration               |
205  | USP/Differentiation | 13%      | Critical for competitive markets; answers "why you over alternatives?"                        |
206  | CTA Design          | 13%      | The conversion mechanism itself; changes to CTA alone can drive 200%+ improvement             |
207  | Trust Signals       | 11%      | Addresses fundamental "is this trustworthy?" concern; increasingly important post-privacy era |
208  | Urgency/Scarcity    | 10%      | Drives immediate action vs. postponement; effective when legitimate                           |
209  | Hook/Engagement     | 9%       | First-impression visual; supports but doesn't replace messaging                               |
210  | Imagery/Design      | 8%       | Credibility signal; generic stock undermines trust but doesn't make or break conversion       |
211  | Offer Clarity       | 4%       | Important but usually redundant with headline + CTA when those are strong                     |
212  | Context             | 3%       | Catch-all for industry-specific norms                                                         |
213  | **Total**           | **100%** |                                                                                               |
214  
215  The top 4 factors (headline, value prop, USP, CTA) account for 55% of the score. This reflects the research consensus that messaging clarity and the conversion mechanism are the dominant drivers.
216  
217  ---
218  
219  ## 5. Score Calculation & Grading Scale
220  
221  ### Formula
222  
223  ```
224  Overall Score = (Headline × 0.15) + (Value Prop × 0.14) + (USP × 0.13) + (CTA × 0.13)
225                + (Urgency × 0.10) + (Hook × 0.09) + (Trust × 0.11) + (Imagery × 0.08)
226                + (Offer × 0.04) + (Context × 0.03)
227  ```
228  
229  Each factor is 0-10, producing a weighted sum of 0-10, multiplied by 10 to get 0-100.
230  
231  ### Grading Scale
232  
233  The production system uses a standard academic grade scale with +/- modifiers:
234  
235  | Grade | Score Range | Interpretation                                        |
236  | ----- | ----------- | ----------------------------------------------------- |
237  | A+    | 97-100      | Exceptional — industry-leading conversion design      |
238  | A     | 93-96       | Excellent — strong conversion potential               |
239  | A-    | 90-92       | Very good — well-executed with small areas to improve |
240  | B+    | 87-89       | Good — solid foundation, a few clear opportunities    |
241  | B     | 83-86       | Above average — noticeable room for improvement       |
242  | B-    | 80-82       | Borderline — several conversion barriers present      |
243  | C+    | 77-79       | Below average — meaningful improvements needed        |
244  | C     | 73-76       | Fair — significant issues across multiple factors     |
245  | C-    | 70-72       | Weak — substantial work required                      |
246  | D+    | 67-69       | Poor — major barriers across most factors             |
247  | D     | 63-66       | Very poor — fundamental issues need addressing        |
248  | D-    | 60-62       | Critical — actively losing most potential customers   |
249  | F     | 0-59        | Failing — urgent, comprehensive overhaul required     |
250  
251  > See `src/score.js:computeGrade()` for the production implementation.
252  
253  ---
254  
255  ## 6. Screenshot Strategy
256  
257  ### Key Decision: Above-the-Fold Is Sufficient
258  
259  The research concluded that **above-the-fold and first below-the-fold screenshots are substantially sufficient for reliable scoring**, reducing token consumption by 65-75% compared to full-page screenshots while maintaining evaluation accuracy above 90%.
260  
261  ### Recommended Approach
262  
263  1. **Primary:** Desktop above-the-fold (1920x1080)
264  2. **Secondary:** Mobile above-the-fold (375x667) — _later dropped in production for cost reasons_
265  3. **Conditional:** Below-the-fold screenshot if initial score is low (rescoring pass)
266  
267  ### Evidence
268  
269  - Above-fold content captures 57-80% of user viewing time
270  - The 9 scoring factors cluster heavily above the fold on well-designed pages
271  - GPT-4 Vision cropping research showed focused images _improve_ accuracy by removing noise
272  - Token savings: ~1,000 tokens per above-fold image vs. ~2,000 for full-page
273  - At 500,000 websites: ~1 billion fewer tokens consumed
274  
275  ### Production Implementation
276  
277  In the production system (`src/capture.js`):
278  
279  - Desktop screenshot captured at page load (cropped + uncropped variants)
280  - DOM-aware intelligent cropping preserves CTAs, trust signals, hero imagery
281  - Cropped version saves 20-35% additional LLM tokens
282  - Below-fold screenshot captured separately for rescoring pass
283  - Mobile screenshot was dropped for cost efficiency
284  
285  ---
286  
287  ## 7. Two-Pass Architecture
288  
289  ### Design Decision: Conditional Resubmission
290  
291  The research compared three approaches for handling below-the-fold content:
292  
293  | Approach                             | Token Cost (100K sites, 30% low-scoring) |
294  | ------------------------------------ | ---------------------------------------- |
295  | **Conditional resubmission**         | 174M tokens                              |
296  | Always include below-fold            | 180M tokens                              |
297  | Include with "ignore if unnecessary" | 185M tokens                              |
298  
299  **Conditional resubmission wins** because:
300  
301  - Vision models charge for image tokens at input time regardless of whether the model "uses" the image
302  - Including an image and saying "only look at it if needed" does NOT save tokens
303  - The breakeven point is ~50% of sites scoring low; in practice only ~30% do
304  
305  ### Pass 1: Scoring (Above-the-Fold)
306  
307  - Input: Desktop screenshot (cropped) + HTML DOM
308  - Output: Factor scores (0-10 each), weighted total, grade, strengths, weaknesses, improvement opportunities
309  - Sites scoring below threshold proceed to Pass 2
310  
311  ### Pass 2: Rescoring (Below-the-Fold)
312  
313  - Input: Below-fold screenshot + HTML DOM + original score JSON
314  - Output: Adjusted factor scores (only where new content warrants change), recalculated total/grade, contact details
315  - Does NOT resend above-fold screenshots (LLM already has the context from Pass 1 JSON)
316  - Focused prompt references original scores and asks for adjustments, not full re-evaluation
317  
318  ### Threshold
319  
320  The original research suggested C+ (77) as the resubmission threshold. The production system uses a configurable `LOW_SCORE_CUTOFF` (currently 82, i.e., B- and below).
321  
322  > **Business logic:** We're selling web design services. High scorers don't need help; low scorers are prospects. Rescoring gives low-scoring sites a second chance with more data before proposal generation.
323  
324  ---
325  
326  ## 8. Contact Extraction on Rescore
327  
328  Contact extraction was added to the rescoring pass (not the initial scoring) to save tokens — you only extract contacts for sites you actually plan to contact (low scorers).
329  
330  ### What Gets Extracted
331  
332  From the HTML DOM (not guessed):
333  
334  - **Contact form details:** action URL, method, field presence (first_name, last_name, full_name, email, phone, company_name, subject_line, message) with field types, name attributes, and labels
335  - **Email addresses:** All explicit `mailto:` links or plain-text emails
336  - **Phone numbers:** All explicit `tel:` links or recognizable patterns
337  - **Social profiles:** Links to major platforms with platform identification
338  - **Contact page URLs:** Explicit "/contact" or "/support" links
339  
340  ### Design Rationale
341  
342  - Extracting contacts in Pass 1 would waste tokens on high-scoring sites we won't contact
343  - The HTML DOM is already being sent in Pass 2 anyway (for score adjustment)
344  - Adding contact extraction to the same API call adds minimal token overhead
345  - All fields are optional — the LLM reports what it finds, uses `null`/empty for missing data
346  
347  ---
348  
349  ## 9. Popover Handling
350  
351  **Decision: Close popovers before taking screenshots.**
352  
353  Reasoning:
354  
355  1. Popovers obscure the headline, hero image, CTA, and trust signals being evaluated
356  2. They represent a secondary conversion path (newsletter signup, discount) — not the primary page conversion
357  3. They create inconsistent evaluation conditions (some sites show immediately, others on delay/exit)
358  4. The entire scoring methodology depends on evaluating above-fold content, which is completely blocked by modal overlays
359  
360  ### Implementation
361  
362  The production system (`src/capture.js` / `src/utils/stealth-browser.js`):
363  
364  - Waits 2-3 seconds after page load for delay-triggered popovers
365  - Attempts to close via common selectors (`[class*='close']`, `[aria-label='Close']`, etc.)
366  - Sends Escape key as fallback
367  - Takes screenshot immediately after closing to avoid new popovers
368  
369  ---
370  
371  ## 10. LLM Prompt Design
372  
373  ### Prompt Structure (Both Passes)
374  
375  1. **System context:** Expert CRO specialist role
376  2. **Input specification:** What data is provided (screenshots, HTML, prior scores for rescoring)
377  3. **Evaluation framework:** Factor definitions with rubric anchors
378  4. **Scoring methodology:** Weighted calculation formula
379  5. **Output format:** Strict JSON schema
380  6. **Best practices:** Analyze HTML first, cross-reference with screenshots, assess mobile/desktop separately, provide specific evidence
381  
382  ### Key Design Principles
383  
384  - **Rubric in system prompt:** Full rubric definitions appear once in the system prompt, not repeated per-website
385  - **Evidence-based scoring:** Each factor score requires 1-2 sentence reasoning with specific page evidence
386  - **Independent factor scoring:** Score each factor independently, then calculate weighted total
387  - **Dual analysis:** HTML content analysis + visual assessment cross-referenced
388  - **Confidence assessment:** Overall confidence (High/Medium/Low) with limitation notes
389  
390  ### Production Evolution
391  
392  The production prompts (`prompts/CONVERSION-SCORING-VISION.md` and `prompts/CONVERSION-SCORING-NOVIS.md`) have evolved from this original design:
393  
394  - Simplified output JSON (removed verbose nested structures for token efficiency)
395  - Added `recommendation_sms` and `recommendation_email` fields for proposal generation
396  - Split into vision-enabled and HTML-only variants
397  - Grade calculation moved from LLM to code (`computeGrade()` in `src/score.js`)
398  - LLM now returns only `factor_scores`; total and grade computed programmatically for consistency
399  
400  ---
401  
402  ## 11. Validation & Calibration
403  
404  The research recommended the following validation approach (partially implemented):
405  
406  ### Inter-Rater Reliability
407  
408  - Evaluate 50-100 websites with experienced CRO professionals
409  - Run same websites through LLM scoring
410  - Target Spearman correlation > 0.85 between LLM and expert scores
411  - Analyze divergence cases; iterate on rubric wording
412  
413  ### Expected Accuracy
414  
415  - **Letter grade agreement:** 75-85% exact match with experts; remaining are adjacent grades
416  - **Factor-level accuracy:** 80-90% match (individual factors more objective than aggregated grades)
417  - **High-confidence cases:** >90% accuracy for clearly strong (A-) or weak (D/F) sites
418  - **Mid-range (B-C+):** Lower agreement due to inherent subjectivity
419  
420  ### Continuous Monitoring
421  
422  - Distribution should approximate normal centered around C+/B-
423  - Factor correlations should match expectations (headline ↔ value prop should correlate > 0.6)
424  - 1% human spot-checks; recalibrate if divergences exceed 5-10%
425  
426  ### Production Status
427  
428  In practice, the system has been validated through:
429  
430  - Manual review of thousands of scored sites during outreach QA
431  - Grade/score mismatch detection and correction
432  - Programmatic grade computation (eliminating LLM grading inconsistencies)
433  - Iterative prompt refinement based on observed scoring patterns
434  
435  ---
436  
437  ## 12. Token Optimization
438  
439  ### Image Optimization (Implemented)
440  
441  - JPEG conversion (quality 85): 40-50% file size reduction
442  - DOM-aware intelligent cropping: 20-35% token reduction
443  - Resolution targeting: minimum for text legibility
444  - Above-fold only: 65-75% reduction vs. full-page
445  
446  ### Prompt Optimization (Implemented)
447  
448  - Rubric in system prompt (not repeated per website)
449  - Abbreviated factor references in rescoring prompt
450  - Strict JSON output (no prose explanation outside the JSON)
451  - LLM returns factor scores only; computation done in code
452  
453  ### HTML-Only Mode (Added Later)
454  
455  When `ENABLE_VISION=false`, the system skips screenshots entirely:
456  
457  - No Playwright screenshot capture
458  - Text-only analysis of HTML DOM
459  - Auto-promotes scored sites through rescoring (no below-fold vision needed)
460  - Cost: ~$0.0025/site vs. ~$0.030/site with vision (83% savings)
461  
462  ---
463  
464  ## 13. Implementation Notes
465  
466  ### What Changed from Research to Production
467  
468  | Research Proposal                         | Production Implementation                     | Reason                                     |
469  | ----------------------------------------- | --------------------------------------------- | ------------------------------------------ |
470  | Academic grading scale (A+ to F with +/-) | Business scale (A+ to F, fewer subdivisions)  | Simpler for prospect identification        |
471  | Mobile + desktop screenshots              | Desktop only                                  | Cost reduction; mobile added minimal value |
472  | LLM computes grade                        | Code computes grade from factor scores        | Eliminates grading inconsistencies         |
473  | Batch processing of 50-100 sites          | Individual API calls with concurrency control | Rate limits; error isolation               |
474  | Few-shot examples in prompt               | Prompt-only (no examples)                     | Token savings; rubric detail sufficient    |
475  | C+ (77) rescore threshold                 | B- (82) configurable threshold                | Business need: more prospects              |
476  | 9 factors + context                       | 10 factors (context kept as factor 10)        | Consistent weighting                       |
477  | Full JSON with nested evidence/reasoning  | Simplified JSON with factor scores            | Token efficiency                           |
478  
479  ### Key Files
480  
481  - `src/score.js` — Scoring logic, `computeScoreFromFactors()`, `computeGrade()`
482  - `src/stages/rescoring.js` — Below-fold rescoring pass
483  - `src/capture.js` — Screenshot capture
484  - `src/contacts/prioritize.js` — Contact extraction and prioritization
485  - `prompts/CONVERSION-SCORING-VISION.md` — Vision-enabled scoring prompt
486  - `prompts/CONVERSION-SCORING-NOVIS.md` — HTML-only scoring prompt
487  
488  ---
489  
490  ## 14. Appendix: Original JSON Schemas
491  
492  ### Pass 1: Scoring Output
493  
494  ```json
495  {
496    "website_url": "https://example.com",
497    "evaluation_date": "2026-01-14T12:00:00Z",
498    "device_analysis": {
499      "desktop_visible": true,
500      "mobile_visible": true,
501      "design_differences": "Mobile layout stacks hero and CTA; desktop shows them side by side."
502    },
503    "factor_scores": {
504      "headline_quality": {
505        "score": 8,
506        "reasoning": "Headline clearly states what the product does and who it's for.",
507        "evidence": "Headline text: 'Automate Your Invoices in Minutes for Small Businesses'."
508      },
509      "value_proposition": {
510        "score": 7,
511        "reasoning": "Benefits are clear but lack concrete quantified outcomes.",
512        "evidence": "Copy mentions 'save time and reduce errors' but no specific numbers."
513      },
514      "unique_selling_proposition": {
515        "score": 6,
516        "reasoning": "USP is implied but not explicitly contrasted with competitors.",
517        "evidence": "Mentions 'built specifically for freelancers' but no comparison."
518      },
519      "call_to_action": {
520        "score": 9,
521        "reasoning": "Primary CTA is above the fold, high contrast, and action-oriented.",
522        "evidence": "CTA button 'Start Free 14-Day Trial' in hero, visually prominent."
523      },
524      "urgency_messaging": {
525        "score": 3,
526        "reasoning": "Weak urgency with vague wording and no specific deadline.",
527        "evidence": "Text: 'Join now and don't miss out' without concrete time limit."
528      },
529      "hook_engagement": {
530        "score": 7,
531        "reasoning": "Hero image is relevant and supports the message.",
532        "evidence": "Image shows a freelancer working with invoices on a laptop."
533      },
534      "trust_signals": {
535        "score": 5,
536        "reasoning": "Includes generic testimonials but no logos or certifications.",
537        "evidence": "Three short text testimonials with first names only."
538      },
539      "imagery_design": {
540        "score": 8,
541        "reasoning": "Clean, modern design with custom-looking imagery.",
542        "evidence": "Custom illustrations and product screenshots; no stock photos."
543      },
544      "offer_clarity": {
545        "score": 8,
546        "reasoning": "Offer is explicit: 14-day free trial, no credit card.",
547        "evidence": "Text near CTA: 'Try it free for 14 days. No credit card needed.'"
548      },
549      "contextual_appropriateness": {
550        "score": 7,
551        "reasoning": "Design is appropriate for a B2B SaaS invoicing tool.",
552        "industry_context": "B2B SaaS / invoicing software"
553      }
554    },
555    "overall_calculation": {
556      "weighted_total": 76.5,
557      "letter_grade": "C",
558      "grade_interpretation": "Acceptable fundamentals but room for improvement in USP, trust, and urgency."
559    },
560    "key_strengths": [
561      "Strong, visible primary CTA with clear action and low-friction offer.",
562      "Headline and offer are easy to understand for the target audience."
563    ],
564    "critical_weaknesses": [
565      "Weak urgency messaging provides little reason to act now.",
566      "Trust elements are generic and do not provide strong social proof."
567    ],
568    "quick_improvement_opportunities": [
569      "Add specific trust signals (logos, detailed testimonials) above the fold.",
570      "Introduce concrete urgency (time-bound offer or limited onboarding slots)."
571    ],
572    "confidence_assessment": {
573      "overall_confidence": "High",
574      "reasoning": "Above-fold content contains all major conversion elements.",
575      "limitation_notes": "Does not consider deeper content, checkout flow, or post-click funnel."
576    }
577  }
578  ```
579  
580  ### Pass 2: Rescoring + Contact Extraction Output
581  
582  The rescoring pass adds a `contact_details` section to the evaluation JSON:
583  
584  ```json
585  {
586    "contact_details": {
587      "primary_contact_form": {
588        "form_action_url": "https://example.com/contact-submit",
589        "form_method": "post",
590        "fields": {
591          "first_name": {
592            "present": true,
593            "field_type": "text",
594            "name_attribute": "first_name",
595            "label_or_placeholder": "First name"
596          },
597          "last_name": {
598            "present": true,
599            "field_type": "text",
600            "name_attribute": "last_name",
601            "label_or_placeholder": "Last name"
602          },
603          "full_name": {
604            "present": false,
605            "field_type": null,
606            "name_attribute": null,
607            "label_or_placeholder": null
608          },
609          "email": {
610            "present": true,
611            "field_type": "email",
612            "name_attribute": "email",
613            "label_or_placeholder": "Your email"
614          },
615          "phone": {
616            "present": true,
617            "field_type": "tel",
618            "name_attribute": "phone",
619            "label_or_placeholder": "Phone number"
620          },
621          "company_name": {
622            "present": false,
623            "field_type": null,
624            "name_attribute": null,
625            "label_or_placeholder": null
626          },
627          "subject_line": {
628            "present": false,
629            "field_type": null,
630            "name_attribute": null,
631            "label_or_placeholder": null
632          },
633          "message": {
634            "present": true,
635            "field_type": "textarea",
636            "name_attribute": "message",
637            "label_or_placeholder": "Your message"
638          }
639        }
640      },
641      "email_addresses": ["support@example.com", "sales@example.com"],
642      "phone_numbers": ["+1-555-123-4567"],
643      "social_profiles": [
644        {
645          "platform": "facebook",
646          "url": "https://www.facebook.com/example"
647        },
648        {
649          "platform": "linkedin",
650          "url": "https://www.linkedin.com/company/example"
651        }
652      ],
653      "contact_pages": ["https://example.com/contact", "https://example.com/support"]
654    }
655  }
656  ```
657  
658  ---
659  
660  ## References
661  
662  The original research cited ~60 sources. Key references that informed the design:
663  
664  - Nielsen Norman Group — User attention and fold research (57-80% above-fold viewing time)
665  - Google ad viewability study — 73% above-fold vs. 44% below-fold viewability
666  - CRO best practices literature — Factor selection and weighting
667  - GPT-4 Vision cropping research — Focused images outperform full-page for accuracy
668  - PURE method — Inter-rater reliability calibration for expert-based evaluation (>0.8 reliability)
669  - LLMLingua (Microsoft) — Prompt optimization: 35% token reduction maintaining quality
670  - DeepSeek vision-text compression — 7-20x token reduction at >90% accuracy
671  - OpenAI vision model documentation — Token calculation and pricing for image inputs