pseo-scale

📁 lisbeth718/pseo-skills 📅 11 days ago

总安装量

周安装量

安装命令

npx skills add https://github.com/lisbeth718/pseo-skills --skill pseo-scale

Agent 安装分布

amp 1

opencode 1

cursor 1

kimi-cli 1

github-copilot 1

Skill 文档

pSEO Scale Architecture

Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.

Scale Tiers

Tier	Pages	Data Layer	Validation	Rollout	Sitemap
Small	< 1K	JSON/files, in-memory	Full pairwise	Single deploy	Single file
Medium	1K-10K	Files or DB, two-tier memory	Fingerprint-based	2-4 week batches	Index + children
Large	10K-50K	Database required	Incremental + sampling	Category-by-category	Index + chunked
Very Large	50K-100K+	Database + cache layer	Delta-only + periodic full	ISR + sitemap waves	Index + streaming

This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.

1. Database-Backed Data Layer

At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.

Why In-Memory Breaks

Pages    PageIndex in memory    Full content (if loaded)
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (borderline)     ~5-25GB (impossible)
100K     ~100MB                 ~10-50GB (impossible)

At 50K+, even holding all PageIndex records in memory is borderline. At 100K, getAllSlugs() returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.

Database Requirements

Minimum schema:

CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- Heavy fields (only loaded per-page)
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- Scale fields
  data_sufficiency_score REAL,  -- see section 2
  content_hash  TEXT,           -- for incremental validation
  last_validated TIMESTAMPTZ
);

-- Required indexes for pSEO queries
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);

Categories table:

CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);

Redirects table:

CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);

Data Layer API at Scale

The pseo-data API contract changes at scale:

// getAllSlugs() must support cursor-based iteration at 50K+
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory() must use DB pagination, not in-memory slicing
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages() needs a precomputed index or efficient query
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // Option A: Use related_slugs array from the page record
  // Option B: Same category + shared tags query
  // Option C: Precomputed relatedness table (best at 50K+)
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}

Connection Pooling

At build time with parallel page generation, you need connection pooling:

import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // limit concurrent connections during build
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.

See references/database-patterns.md for full patterns by database type.

2. Data Sufficiency Gating

At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time â don’t create pages that will fail quality checks.

Sufficiency Score

Compute a score per potential page based on available data:

function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // > 50 chars
    hasBodyContent: 20,           // > 200 words
    hasFAQs: 15,                  // >= 3 Q&A pairs
    hasUniqueAttributes: 15,      // >= 3 non-null structured attributes
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // stats, ratings, prices â LLM citation signal
    hasSourceCitation: 5,         // data provenance for E-E-A-T
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... etc

  return score; // 0-100
}

Gating Thresholds

Score	Action
80-100	Generate page â sufficient data
60-79	Generate page with enrichment flag â mark for content pipeline
40-59	Hold â do not generate until data is enriched
0-39	Reject â insufficient data, do not generate

Store the score in the database (data_sufficiency_score column) so you can:

Query how many pages are gated vs. ready
Track enrichment progress over time
Re-score after data enrichment
Gate at build time: WHERE data_sufficiency_score >= 60 AND status = 'published'

Combination Gating

For combination pages (service Ã city, product Ã use-case), both dimensions must have sufficient data:

function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // Both dimensions must independently clear a minimum bar
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // The combination itself must produce enough unique content
  // that it's not just "dimA text + dimB text" pasted together
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}

This is the most important scale pattern. 500 services Ã 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.

3. Content Enrichment Pipeline

At 100K pages, you can’t manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.

Pipeline Architecture

Raw data (DB/CMS/API)
    â
    â¼
Data sufficiency scoring âââ Reject/hold insufficient records
    â
    â¼
Automated enrichment âââ Generate intros, FAQs, summaries from structured data
    â
    â¼
Quality sampling âââ Human reviews a random sample (5-10%) per batch
    â
    â¼
Publish gate âââ Only records with score >= 60 and enrichment complete
    â
    â¼
Page generation

Enrichment Sources (non-LLM)

Before reaching for LLM generation, exhaust structured data enrichment:

Template composition from multiple fields: Combine 3+ data fields into prose. Not "Best {service} in {city}" but structured sentences using attributes, stats, and category context.
Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
FAQ generation from data patterns: Turn common attribute variations into Q&A pairs (“How much does X cost in Y?” â answer from price data)
Comparison data: Auto-generate comparison sections from sibling records in the same category

LLM-Assisted Enrichment

If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:

Never generate the entire page content â LLM fills gaps in an otherwise data-driven page
Always ground in real data â the LLM prompt includes the record’s actual attributes, stats, and context
Human review sampling â review 5-10% of LLM-generated content per batch before publishing
Store the generation metadata â track which fields were LLM-generated vs. sourced from data
Apply quality guard after enrichment â run pseo-quality-guard on enriched content before publishing
Regenerate periodically â stale LLM content should be refreshed when underlying data changes

Google’s position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.

4. Incremental Validation

At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.

Delta Validation

Only validate pages that changed since the last validation run:

async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}

Content hashing: Store a hash of each page’s rendered content. On validation, only re-check pages whose hash changed:

import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Periodic Full Scan

Run a complete validation weekly or before major releases. For the full scan at 100K:

Parallelize: Run 4-8 validation workers, each processing a category partition
Sample cross-category similarity: Don’t compare all 100K Ã 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
Stream results to disk: Write validation results to a JSONL file, then aggregate. Don’t accumulate all results in memory.

// Parallel validation by category
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // each worker handles one category
);
const results = await Promise.all(workers);

Validation Budget

Scale	Delta validation	Full validation
10K	Minutes	30-60 minutes
50K	Minutes	2-4 hours
100K	Minutes	4-8 hours

Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.

5. Crawl Budget Management

At 100K pages, Google won’t crawl everything immediately. Crawl budget â the number of pages Google crawls per day â becomes a constraint.

Sitemap Submission Strategy

sitemap-index.xml
âââ sitemap-category-a.xml     (â¤ 50,000 URLs)
âââ sitemap-category-b.xml
âââ sitemap-category-c.xml
âââ ...

Submission cadence:

Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
Update individual category sitemaps as pages are added
Don’t submit all 100K URLs at once â Google throttles crawling for new sites with sudden URL spikes

Programmatic sitemap submission:

// Use Google Indexing API for high-priority pages (job postings, livestreams)
// For standard pages, rely on sitemap discovery + Search Console

// Batch sitemap updates by category
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}

Crawl Budget Optimization

Prioritize high-value pages: Set priority in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates â all waste crawl budget
Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
Monitor crawl stats: Check Google Search Console â Settings â Crawl stats weekly. If crawl rate drops, investigate.

Indexing Rate Expectations

Google does not guarantee indexing all pages. Realistic expectations:

Site authority	Pages submitted	Likely indexed (6 months)
New site	100K	10-30%
Established (DR 30-50)	100K	40-70%
High authority (DR 60+)	100K	70-90%

If indexing rate is low:

Improve content quality on indexed pages first
Earn more backlinks to category hubs
Reduce total page count (prune thin pages) â a smaller, higher-quality corpus often indexes better than a large, mediocre one
Ensure internal linking reaches every page within 3 clicks

6. Monitoring at Scale

At 100K pages, you can’t manually check pages. Build automated monitoring.

Key Metrics to Track

Metric	Source	Alert Threshold
Pages indexed	Google Search Console API	Drops > 5% week-over-week
Crawl rate	Google Search Console API	Drops > 20%
Crawl errors (5xx, 404)	Server logs, GSC	> 1% of total pages
CWV regressions	CrUX API or RUM	LCP > 4s or CLS > 0.25 on any template
Build duration	CI/CD logs	> 2x baseline
Build memory peak	CI/CD logs	> 80% of available memory
Page count by status	Database	Published count deviates from expected
Sufficiency score distribution	Database	> 20% of published pages below threshold

Automated Monitoring Script

// scripts/monitor-pseo.ts â run daily via cron or CI
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // Output report or send to monitoring service
  console.log(JSON.stringify(metrics, null, 2));

  // Alert on critical conditions
  if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20% of published pages below sufficiency threshold");
  }
}

Search Console API Integration

At 100K pages, manual Search Console checks are impractical. Use the API:

Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
Coverage issues: Monitor for “Crawled â currently not indexed” and “Discovered â currently not indexed” trends

7. Edge and CDN Architecture

At 100K pages, the origin server can’t handle all traffic directly.

Caching Strategy

Client â CDN Edge â Origin (Next.js/framework)
              â
         Cache Layer (Redis/edge KV)
              â
           Database

CDN configuration:

Cache all pSEO pages at the edge with s-maxage=86400, stale-while-revalidate=3600
Use ISR revalidation to refresh cached pages (not full rebuilds)
Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)

Cache invalidation:

On data change â invalidate the specific page’s cache via on-demand revalidation API
On category change â invalidate all pages in the category
On template change â purge the CDN for all pages of that template type

Edge rendering (if supported):

Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
Not all frameworks support edge rendering â check compatibility
Edge functions have memory limits (~128MB) that constrain complex data operations

Database Connection from Edge

Edge functions can’t maintain persistent database connections. Options:

HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
Pre-generated JSON at build time for index-tier data, database only for full page content

8. Scale-Specific Build Strategy

At 100K pages, the build process itself needs architecture.

Don’t Build All Pages

// At 100K, only pre-build the most important pages
export async function generateStaticParams() {
  // Pre-build: hub pages + top 1K pages by traffic/priority
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// ISR handles the remaining 99K pages on first request
export const dynamicParams = true;
export const revalidate = 86400; // 24 hours

Build Time Budget

Pages pre-built	Expected build time	Memory
1K (hubs + top pages)	5-15 minutes	2-4GB
5K	15-45 minutes	4-6GB
10K	30-90 minutes	6-8GB
100K (DON’T DO THIS)	5-15 hours	16GB+

Rule: Never pre-build more than 10K pages. Use ISR for everything else.

Warm-Up After Deploy

After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:

// scripts/warm-cache.ts â run after deploy
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // Hit each page to trigger ISR generation (rate-limited)
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 10 pages/second â don't DDoS yourself
  }
}

Checklist

Database is the primary data store (not JSON files or in-memory arrays)
Required indexes exist on slug, category, status, last_modified, sufficiency_score
Data sufficiency scoring is implemented and stored per page
Pages with score < 60 are gated from generation
Combination pages are gated on both dimensions
Content enrichment pipeline exists (structured data first, LLM-assisted only with review)
Incremental validation is implemented (delta + periodic full scan)
Content hashes are stored for change detection
Sitemap is split by category with index file
Sitemap excludes pages below sufficiency threshold
Crawl budget is monitored via Search Console
Monitoring script runs daily with alerts
CDN caching is configured with appropriate TTLs
ISR handles the long tail (only top pages pre-built)
Cache warm-up script exists for post-deploy
Connection pooling is configured for build-time queries
No function loads > 10K full page records into memory

Relationship to Other Skills

Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
Depends on: All content and structure skills must be in place before scaling
Validated by: pseo-quality-guard (quality doesn’t change â scale does)
Works with: pseo-orchestrate (scale considerations at every phase)

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台