# Saikan Brain Dump v1 — Implementation Plan

**Status:** Draft for internal review (Oliver / Rafael / Saikan team)
**Scope:** v1 only — bring Drive information into saikan.io, masticated and queryable
**Out of scope (v1):** Planos app, Discovery app, role-based permissions, cross-client patterns, Drive webhooks
**Author:** Krillin (Saikan's technical co-pilot)
**Date:** 2026-06-17

---

## 1. Context and Goal

Saikan's current architecture has a working pipeline that intercepts client ↔ bot conversations and uploads them to Google Drive under a per-client / per-agent folder structure:

```
clients/
  cordex/
    ceo-bot/
      conversation history/
      file history/
  norbidel/
    ceo-bot/
      ...
  central-mensageiros/
    ceo-bot/
      ...
```

This works for archival and for team members (e.g. Guillermo) who can only consume content through Drive. However, the team has no first-class place inside saikan.io to **work** with this information. Reading a file today means leaving the app and opening Drive, which breaks flow and makes cross-referencing across clients painful.

**v1 goal:** surface Drive information inside saikan.io as searchable, taggable, summarized "brain events" that the team can browse, filter, and use as raw material for future work — without ever having to leave the app to read a file.

**v1 non-goals (explicitly deferred):**
- Planos application (left as future work, not built in v1)
- Discovery application (idem)
- Per-plano role-based access control
- Cross-client pattern / playbook extraction
- Live Drive webhooks (polling is sufficient for 3 clients at current volume)
- Migration of bot pipeline away from Drive

---

## 2. Guiding Decisions (Locked)

| # | Decision | Rationale |
|---|----------|-----------|
| D1 | Two products (Planos, Discovery) but **only Brain Dump is built in v1** | Focus ships; future products get their own DBs when they start |
| D2 | One brain per (client × agent), N-ready from day 1 | Cordex may add CFO-bot, ops-bot later; schema supports it now to avoid migration |
| D3 | Drive remains the source of truth for **binaries** | Bot pipeline already writes there; Guillermo depends on it; no migration in v1 |
| D4 | Supabase is the source of truth for **cognitive metadata** | Summaries, tags, entities live in saikan.io DB |
| D5 | saikan.io embeds files inline via signed Drive URLs; no "open in Drive" as the primary path | Eliminates the friction Oliver called out; Drive URL is a secondary action only |
| D6 | Worker uses **MiniMax** (cheap model) for tags + summary + entities + intent in **a single call** | MiniMax API key, direct provider; rate-limited in the worker, not by per-token cost |
| D7 | All-in on Supabase: **Edge Functions + pg_cron + RLS**, no external queue/Redis/cron services | Zero extra cost, one platform, enough for v1 volume |
| D8 | Polling every 6h via `pg_cron`, not Drive webhooks | Simpler, sufficient for ~150 files/month across 3 clients; webhooks are a future swap |
| D9 | 5 Saikan team members have full read/write access; clients never see saikan.io | Matches current product stance; no per-user roles in v1 |
| D10 | No audio files exist; audio content is already inside conversation history transcripts | Removes an entire branch of the worker |
| D11 | No monthly per-client processing cap, no cost tracking in the log | MiniMax is the team's provider; rate limit in the worker is the only guardrail |
| D12 | PDF size hard limit: **100MB**. Files above are skipped and logged. | Avoids blowing up the worker on a single huge file |
| D13 | Storage budget alert: daily job measures total `brain_files.size_bytes` and alerts the team if Supabase free-tier usage exceeds **600MB / 1GB** | Free-tier guardrail; we ask the team to free space before hitting the 1GB cap |
| D14 | Worker rate-limits itself to **1 file every 2 seconds** (configurable) | Protects against MiniMax rate limits and keeps the worker predictable |
| D15 | **Backfill on demand**: a manual "Initialize brain" trigger per client processes the entire Drive folder once, then the cron takes over | Team gets the full historical context on day 1 for the clients that want it |

---

## 3. Architecture (v1)

```
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 4: saikan.io UI (Next.js)                                  │
│   - Client selector (Cordex / Norbidel / Central Mensageiros)    │
│   - Brain Dump view (list + search + filter)                     │
│   - Detail view (summary + embedded file + tags editable)        │
└──────────────────────────────────────────────────────────────────┘
                            │ reads
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 3: Supabase (Postgres)                                     │
│   - brain_clients, brain_agents, brain_files, brain_processing_log│
│   - RLS: only 5 Saikan emails                                    │
│   - pg_cron: every 6h triggers the worker                        │
└──────────────────────────────────────────────────────────────────┘
                            │ writes (via Edge Function)
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 2: Supabase Edge Function (worker)                         │
│   - Triggered by pg_cron                                         │
│   - Lists Drive folder contents per agent                        │
│   - Filters what's new vs brain_files                            │
│   - For each new file:                                           │
│       - Skip if size > 500MB, mime excluded, name matches junk  │
│       - Download to ephemeral memory                             │
│       - Extract text (PDF native, image→vision, md/txt direct)   │
│       - Single gpt-4.1-mini call: summary + tags + entities +   │
│         intent                                                   │
│       - Insert into brain_files                                  │
│       - Log into brain_processing_log                            │
└──────────────────────────────────────────────────────────────────┘
                            │ reads (service account)
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 1: Google Drive (unchanged)                                │
│   - Bot continues to upload conversation history + files here    │
│   - Guillermo continues to read here                             │
│   - Folder structure preserved                                   │
└──────────────────────────────────────────────────────────────────┘
```

**Data flow direction:** Drive → Edge Function → Supabase. Never the other way in v1. The bot pipeline is untouched.

**Why not use webhooks in v1:** Drive webhooks require a publicly reachable endpoint with `changes.watch` renewal every 7 days, plus a dead-letter queue. For 3 clients producing ~150 files/month, a 6-hour `pg_cron` poll is dramatically simpler with the same outcome. Webhook support is a clean swap of the trigger later (single Edge Function, same downstream logic).

---

## 4. Database Schema (Supabase)

### 4.1 `brain_clients`

```sql
create table brain_clients (
  id text primary key,                    -- 'cordex' | 'norbidel' | 'central-mensageiros'
  name text not null,                    -- human display name
  drive_root_folder_id text not null,    -- Drive folder ID for the client root
  status text not null default 'active', -- 'active' | 'frozen' | 'archived'
  config jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);
```

`config` is reserved for future per-client overrides (monthly caps, skip patterns, language preference).

### 4.2 `brain_agents`

```sql
create table brain_agents (
  id text primary key,                          -- 'cordex-ceo-bot'
  client_id text not null references brain_clients(id) on delete cascade,
  name text not null,                           -- 'CEO Discovery Bot'
  drive_folder_id text not null,                -- Drive folder ID for this agent
  status text not null default 'active',
  config jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),

  unique (client_id, drive_folder_id)
);

create index idx_brain_agents_client on brain_agents(client_id);
```

### 4.3 `brain_files`

The core table. One row per file ingested from Drive.

```sql
create table brain_files (
  id uuid primary key default gen_random_uuid(),
  client_id text not null references brain_clients(id) on delete cascade,
  agent_id text not null references brain_agents(id) on delete cascade,
  drive_id text not null,                       -- canonical Drive file ID
  drive_path text,                              -- human path: 'conversation history/2026-06/file.md'
  name text not null,
  mime_type text not null,
  size_bytes bigint not null,
  file_type text not null,                      -- 'pdf' | 'image' | 'text' | 'spreadsheet' | 'video' | 'other'
  summary text,                                 -- 1–3 line chewed-up summary
  tags text[] not null default '{}',
  entities jsonb not null default '{}'::jsonb,  -- {people, orgs, dates, money, topics}
  intent text,                                  -- 'decision' | 'question' | 'info' | 'task' | 'other'
  processed_at timestamptz,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),

  unique (client_id, agent_id, drive_id)
);

create index idx_brain_files_client_agent on brain_files(client_id, agent_id, created_at desc);
create index idx_brain_files_tags on brain_files using gin(tags);
create index idx_brain_files_intent on brain_files(intent);
create index idx_brain_files_created on brain_files(created_at desc);
```

**Why store `drive_id` separately from a UUID:** Drive IDs are stable across renames and moves. The `unique(client_id, agent_id, drive_id)` constraint gives idempotency for free — re-running the worker on the same file does nothing.

**Why `embedding vector(1536)` is not in v1:** the search UX in v1 is tag/keyword filtering. Semantic search ("find files that feel similar to X") is a v2 feature; we'll add the column and the index when we need it.

### 4.4 `brain_processing_log`

Append-only audit trail. Not a state machine; just a record of what happened.

```sql
create table brain_processing_log (
  id uuid primary key default gen_random_uuid(),
  client_id text references brain_clients(id) on delete set null,
  agent_id text references brain_agents(id) on delete set null,
  drive_id text,
  status text not null,           -- 'success' | 'skipped' | 'failed'
  reason text,                     -- populated for skipped/failed
  cost_usd numeric(10, 6),         -- optional; nullable. Filled when provider exposes per-call cost. Not used in v1 with MiniMax.
  duration_ms integer,
  created_at timestamptz not null default now()
);

create index idx_processing_log_created on brain_processing_log(created_at desc);
create index idx_processing_log_status on brain_processing_log(status, created_at desc);
```

`cost_usd` is kept as a nullable column so we can track cost later if we switch providers or add manual cost input. With MiniMax and rate-limited calls, v1 doesn't need it.

### 4.5 Row Level Security

```sql
alter table brain_clients         enable row level security;
alter table brain_agents          enable row level security;
alter table brain_files           enable row level security;
alter table brain_processing_log  enable row level security;

create or replace function is_saikan_member()
returns boolean
language sql
security definer
stable
as $$
  -- lower + trim to be defensive against auth providers that capitalize
  -- or pad email claims. v1 ships with the 5 Saikan team emails as
  -- placeholders; replace these literals in the function body before
  -- the first non-Oliver login.
  select coalesce(
    lower(trim(auth.jwt() ->> 'email')) in (
      'oliver@saikan.io',
      'rafael@saikan.io',
      'daniel@saikan.io',
      'ines@saikan.io',
      'guillermo@saikan.io'
    ),
    false
  );
$$;

create policy saikan_only on brain_clients
  for all using (is_saikan_member()) with check (is_saikan_member());
create policy saikan_only on brain_agents
  for all using (is_saikan_member()) with check (is_saikan_member());
create policy saikan_only on brain_files
  for all using (is_saikan_member()) with check (is_saikan_member());
create policy saikan_only on brain_processing_log
  for all using (is_saikan_member()) with check (is_saikan_member());
```

Edge Functions use the **service role** key, which bypasses RLS — so the worker can write freely even though it never impersonates a user.

**Operational note:** the 5 email literals in `is_saikan_member()` are placeholders. To replace them, run a one-line `create or replace function` migration with the real emails before the first non-Oliver login. No app-side change is needed — the function is referenced by name in the RLS policies.

### 4.6 Seed Data

```sql
insert into brain_clients (id, name, drive_root_folder_id) values
  ('cordex',                'Cordex',                '<DRIVE_FOLDER_ID_CORDEX>'),
  ('norbidel',              'Norbidel',              '<DRIVE_FOLDER_ID_NORBIDEL>'),
  ('central-mensageiros',   'Central Mensageiros',   '<DRIVE_FOLDER_ID_CENTRAL>');

insert into brain_agents (id, client_id, name, drive_folder_id) values
  ('cordex-ceo-bot',              'cordex',                'Cordex CEO Bot',              '<DRIVE_FOLDER_ID_CORDEX_CEO>'),
  ('norbidel-ceo-bot',            'norbidel',              'Norbidel CEO Bot',            '<DRIVE_FOLDER_ID_NORBIDEL_CEO>'),
  ('central-mensageiros-ceo-bot', 'central-mensageiros',  'Central Mensageiros CEO Bot', '<DRIVE_FOLDER_ID_CENTRAL_CEO>');
```

---

## 5. The Worker (Supabase Edge Function)

### 5.1 Trigger

```sql
-- Run every 6 hours
select cron.schedule(
  'brain-dump-poll',
  '0 */6 * * *',
  $$
  select net.http_post(
    url    := 'https://<PROJECT_REF>.supabase.co/functions/v1/brain-dump-worker',
    headers := jsonb_build_object(
      'Authorization', 'Bearer ' || current_setting('app.cron_secret', true),
      'Content-Type',  'application/json'
    ),
    body   := '{}'::jsonb
  );
  $$
);
```

The `cron_secret` is a Supabase Edge Function secret. The Edge Function rejects requests without it.

### 5.2 Pseudocode (TypeScript / Deno)

```typescript
// supabase/functions/brain-dump-worker/index.ts
// deno-lint-ignore-file no-explicit-any

import { createClient } from "https://esm.sh/@supabase/supabase-js@2";
import { google } from "https://esm.sh/googleapis@128";

const SUPABASE_URL         = Deno.env.get("SUPABASE_URL")!;
const SUPABASE_SERVICE_KEY = Deno.env.get("SUPABASE_SERVICE_ROLE_KEY")!;
const MINIMAX_API_KEY       = Deno.env.get("MINIMAX_API_KEY")!;
const MINIMAX_MODEL         = Deno.env.get("MINIMAX_MODEL") ?? "MiniMax-Text-01";
const CRON_SECRET           = Deno.env.get("CRON_SECRET")!;

const supabase = createClient(SUPABASE_URL, SUPABASE_SERVICE_KEY);

const SKIP_MIME_PREFIXES = ["video/", "audio/"];   // no audio/video in v1
const SKIP_NAME_PATTERNS = [/^\./, /^~\$/, /thumbs\.db$/i];
const MAX_PDF_BYTES      = 100 * 1024 * 1024;       // 100 MB PDF cap (D12)
const MAX_OTHER_BYTES    = 500 * 1024 * 1024;       // 500 MB cap for non-PDF
const RATE_LIMIT_MS      = 2000;                    // 1 file / 2s (D14)

export default async function handler(req: Request) {
  if (req.headers.get("Authorization") !== `Bearer ${CRON_SECRET}`) {
    return new Response("unauthorized", { status: 401 });
  }

  const startedAt = Date.now();

  // 1. Load active clients and agents
  const { data: agents } = await supabase
    .from("brain_agents")
    .select("id, drive_folder_id, client_id, brain_clients(status)")
    .eq("status", "active");

  const drive = google.drive({ version: "v3", auth: getDriveAuth() });

  for (const agent of agents ?? []) {
    try {
      await processAgent(agent, drive);
    } catch (err) {
      await logProcessing({
        client_id: agent.client_id,
        agent_id:   agent.id,
        status:     "failed",
        reason:     String((err as Error).message).slice(0, 500),
        duration_ms: Date.now() - startedAt,
      });
    }
  }

  return new Response(JSON.stringify({ ok: true, ms: Date.now() - startedAt }), {
    headers: { "content-type": "application/json" },
  });
}

async function processAgent(agent: any, drive: any) {
  // 2. List files in agent's Drive folder (paginated)
  const driveFiles = await listAll(drive, agent.drive_folder_id);

  // 3. Filter to new files not yet in brain_files
  const driveIds = driveFiles.map((f: any) => f.id);
  const { data: existing } = await supabase
    .from("brain_files")
    .select("drive_id")
    .eq("agent_id", agent.id)
    .in("drive_id", driveIds);
  const known = new Set((existing ?? []).map((r: any) => r.drive_id));

  for (const f of driveFiles) {
    if (known.has(f.id)) continue;

    // 4. Cheap pre-filters
    const isPdf = (f.mimeType ?? "") === "application/pdf";
    const sizeCap = isPdf ? MAX_PDF_BYTES : MAX_OTHER_BYTES;
    if (Number(f.size ?? 0) > sizeCap) {
      await logProcessing({ client_id: agent.client_id, agent_id: agent.id, drive_id: f.id,
        status: "skipped", reason: `size > ${isPdf ? "100MB" : "500MB"}`, duration_ms: 0 });
      continue;
    }
    if (SKIP_MIME_PREFIXES.some(p => (f.mimeType ?? "").startsWith(p))) {
      await logProcessing({ client_id: agent.client_id, agent_id: agent.id, drive_id: f.id,
        status: "skipped", reason: "mime excluded (video/audio in v1)", duration_ms: 0 });
      continue;
    }
    if (SKIP_NAME_PATTERNS.some(r => r.test(f.name ?? ""))) {
      await logProcessing({ client_id: agent.client_id, agent_id: agent.id, drive_id: f.id,
        status: "skipped", reason: "name pattern excluded", duration_ms: 0 });
      continue;
    }

    const t0 = Date.now();

    // Rate limit (D14): sleep 2s between files
    await sleep(RATE_LIMIT_MS);

    try {
      // 5. Download content (text or binary)
      const content = await downloadContent(drive, f.id, f.mimeType);

      // 6. Single MiniMax call → {summary, tags, entities, intent}
      const meta = await chewWithAI(content, f.name, f.mimeType);

      // 7. Insert
      await supabase.from("brain_files").insert({
        client_id:    agent.client_id,
        agent_id:     agent.id,
        drive_id:     f.id,
        drive_path:   f.path ?? null,
        name:         f.name,
        mime_type:    f.mimeType ?? "application/octet-stream",
        size_bytes:   Number(f.size ?? 0),
        file_type:    mapFileType(f.mimeType),
        summary:      meta.summary,
        tags:         meta.tags,
        entities:     meta.entities,
        intent:       meta.intent,
        processed_at: new Date().toISOString(),
      });

      await logProcessing({
        client_id: agent.client_id, agent_id: agent.id, drive_id: f.id,
        status: "success", duration_ms: Date.now() - t0,
      });
    } catch (err) {
      await logProcessing({
        client_id: agent.client_id, agent_id: agent.id, drive_id: f.id,
        status: "failed", reason: String((err as Error).message).slice(0, 500),
        duration_ms: Date.now() - t0,
      });
    }
  }
}

async function chewWithAI(content: string, name: string, mime: string) {
  // MiniMax chat completions — JSON mode, single call.
  const res = await fetch("https://api.MiniMax.chat/v1/text/chatcompletion_v2", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${MINIMAX_API_KEY}`,
      "Content-Type":  "application/json",
    },
    body: JSON.stringify({
      model: MINIMAX_MODEL,
      messages: [
        { role: "system", content: SYSTEM_PROMPT },
        { role: "user",   content: `File name: ${name}\nMime type: ${mime}\nContent:\n---\n${content}\n---` },
      ],
      response_format: { type: "json_object" },
      temperature: 0.2,
    }),
  });
  if (!res.ok) throw new Error(`MiniMax ${res.status}: ${await res.text()}`);
  const json = await res.json();
  return JSON.parse(json.choices[0].message.content);
}

const SYSTEM_PROMPT = `You are a precise information-extraction assistant.
Given a file's content (and its name + mime type), produce JSON:
{
  "summary":  "1–3 sentences, plain prose, no emojis, no filler, in the language of the content",
  "tags":     ["3–7 lowercase tags, hyphenated, no spaces, no '#'"],
  "entities": { "people": [], "orgs": [], "dates": [], "money": [], "topics": [] },
  "intent":   "decision" | "question" | "info" | "task" | "other"
}
Be terse. Be specific. No preamble. Output JSON only.`;

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));

// downloadContent, listAll, mapFileType, logProcessing, getDriveAuth
// are standard helpers.
```

The MiniMax endpoint and request shape above are illustrative — the worker is built against whatever the team's actual MiniMax account exposes. The contract is "one chat completions call with JSON-mode response, ~2k output tokens". If MiniMax's API surface differs, only `chewWithAI` changes.

### 5.3 The AI Prompt (single call)

```typescript
const SYSTEM = `You are a precise information-extraction assistant.
Given a file's content (and its name + mime type), produce JSON:
{
  "summary":  "1–3 sentences, plain prose, no emojis, no filler, in the language of the content",
  "tags":     ["3–7 lowercase tags, hyphenated, no spaces, no '#'"],
  "entities": { "people": [], "orgs": [], "dates": [], "money": [], "topics": [] },
  "intent":   "decision" | "question" | "info" | "task" | "other"
}
Be terse. Be specific. No preamble. Output JSON only.`;

const USER = `File name: ${name}
Mime type: ${mime}
Content (truncated to ~8k tokens):
---
${content}
---`;
```

JSON-mode is enabled so the response is parseable. The whole call is well under 2k output tokens.

### 5.4 Cost Model

**v1 does not track per-call cost.** MiniMax is the team's provider; we rate-limit the worker (D14) and that's the only cost guardrail. The `cost_usd` column in `brain_processing_log` is nullable and reserved for future use.

If at any point we want observability back, we add a small wrapper that:
- counts `prompt_tokens + completion_tokens` from the MiniMax response
- multiplies by the provider's per-1k-token rate (we hardcode it)
- writes the result into `brain_processing_log.cost_usd`

This is a 30-line change, not a re-architecture.

### 5.5 Filters (cheap, before AI)

| Filter | Rule | Why |
|--------|------|-----|
| Size (PDF) | `> 100MB` → skip (D12) | Almost certainly a scanned book or huge report |
| Size (other) | `> 500MB` → skip | Almost certainly a video dump or oversized archive |
| Mime | `video/*`, `audio/*` → skip | Out of scope for v1 |
| Name | matches `/^\./`, `/^~\$/`, `/thumbs\.db$/i` | Junk / OS metadata |
| Already known | `drive_id` exists in `brain_files` | Idempotency |

---

## 6. saikan.io UI (v1 scope)

A single page: **Brain Dump**. Three components.

### 6.1 Client selector (header, always visible)

```
[≡] saikan.io    [▼ Cordex · CEO Bot]               👤 Oliver
```

Switching the selector refetches the brain list. State is local to the selector; no global context provider required.

### 6.2 List view

```
┌─────────────────────────────────────────────────────────────────┐
│ 🧠 Brain Dump                                                   │
│                                                                 │
│ [🔍 Buscar...]  [📅 Todos ▼]  [🏷️ Tags ▼]  [🎯 Intent ▼]       │
│                                                                 │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 📄 contrato_nordex_2026.pdf                    hace 2h     │ │
│ │ "Contrato firmado con Nordex para integración ERP. Pago    │ │
│ │  inicial 50k€ en Q3. CEO confirma plazo 12 semanas."       │ │
│ │ #contrato #erp #nordex                            [ver]    │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 💬 conversacion_2026-06-17.md                  hace 5h     │ │
│ │ "Cliente confirma que quiere automatizar facturas..."      │ │
│ │ #facturas #automatización                        [ver]     │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ [Cargar más]                                                    │
└─────────────────────────────────────────────────────────────────┘
```

Search and filters hit Supabase directly via RSC + URL search params (no client state). Cursor-based pagination on `created_at desc`.

### 6.3 Detail view

```
┌─────────────────────────────────────────────────────────────────┐
│ ← Volver                                                        │
│                                                                 │
│ contrato_nordex_2026.pdf                                         │
│ #contrato #erp #nordex  ·  decision  ·  hace 2h  ·  2.3 MB     │
│                                                                 │
│ Summary (auto-generated, editable):                             │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Contrato firmado con Nordex para integración ERP. Pago     │ │
│ │ inicial 50k€ en Q3. CEO confirma plazo 12 semanas.         │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ Preview:                                                        │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ [ Embedded PDF viewer using signed Drive URL ]              │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ Tags: [#contrato] [#erp] [#nordex] [+ add]                     │
│                                                                 │
│ Secondary actions:                                              │
│ [ Abrir en Drive ]   [ Regenerar summary ]   [ Marcar fallido ] │
└─────────────────────────────────────────────────────────────────┘
```

- **Embedded preview:** implementation detail depends on the Drive sharing model (see §6.4 below). For text/markdown, the body is read from Drive on demand and rendered as formatted text inside saikan.io — no embed iframe needed.
- **Tag editing:** optimistic update; persists to `brain_files.tags` and `brain_processing_log` records the correction.
- **"Regenerar summary"** re-runs the AI for that file (debounced button; rate-limited to 1 per file per hour).

### 6.4 Drive preview strategy (technical)

This is the one place where Drive's sharing model matters. Three options, ranked by simplicity:

**Option A — Drive folders set to "anyone with link can view" (recommended for v1)**
- The service account uploads + the team reads from saikan.io with no auth dance.
- Embed PDFs via `https://drive.google.com/file/d/<drive_id>/preview` (iframe).
- Embed images via `<img src="https://drive.google.com/uc?id=<drive_id>">`.
- The service account's link-sharing setting is the only thing to enable.
- Tradeoff: anyone with the link can read the file. For internal Saikan team folders, this is fine. For folders that contain client-confidential content, this is not OK and Option B is required.

**Option B — saikan.io proxy (for when Option A is unacceptable)**
- Add a Next.js route `/(api)/drive-preview/[fileId]` that:
  1. Receives the request with the user's auth cookie (RLS already gates this).
  2. Calls `drive.files.get({ fileId, alt: 'media' })` with the service account.
  3. Streams the response back with appropriate `content-type` and a short-lived `cache-control` header.
- The detail view embeds via `<iframe src="/api/drive-preview/<drive_id>" />` and `<img src="/api/drive-preview/<drive_id>" />`.
- Tradeoff: the saikan.io server is now in the hot path for every file view. At v1 volume this is fine. At higher volume, a CDN in front of these routes is needed.
- Adds ~½ day of work vs Option A.

**Option C — Signed URLs (current draft, NOT recommended for embeds)**
- Use `drive.files.get({ fileId, alt: 'media' })` with `?alt=media&access_token=...`.
- Tradeoff: signed URLs expire (typically 1h), so any embed that lasts longer than 1h breaks. For an iframe preview that the user keeps open all afternoon, this is wrong. For one-shot downloads, it's fine.

**Decision needed before M3:** confirm Option A is acceptable for the 3 client folders (Cordex, Norbidel, Central Mensageiros). If any folder contains content the team does not want link-shareable, we go with Option B. The default in the plan is **A**; the implementation falls back to **B** if the team says no.

---

## 7. Implementation Sequence (6 milestones)

Each milestone is independently deployable and demoable. Stop after any milestone if the team wants to validate.

### M0 — Backfill trigger (½ day)
- Edge Function `brain-dump-backfill` (admin-only, manual POST)
- Body: `{ client_id, agent_id }` or `{ client_id }` (processes all agents for the client)
- Behavior: same as the polling worker, but processes **all files in the Drive folder** (not just new ones), with the same rate limit (1 file / 2s) and the same filters
- Idempotent: skips files already in `brain_files`
- Triggered from the admin UI (M5) or via curl
- **Run this once per client on day 1** to ingest historical context

### M1 — Database & RLS (1 day)
- Run schema migration on Supabase
- Add RLS policies
- Insert seed data with the 3 clients + 3 agents (with real Drive folder IDs)
- Verify: `select * from brain_clients` works as a Saikan user, fails as a non-Saikan user

### M2 — Edge Function: poll + ingest (2–3 days)
- `supabase/functions/brain-dump-worker/` (full code from §5.2)
- Drive auth via service account (JSON key as Edge Function secret)
- Single MiniMax call with the prompt in §5.3
- `pg_cron` schedule from §5.1
- Rate limit: 1 file / 2s (D14)
- Verify: trigger manually, watch 1 file get inserted into `brain_files` with summary + tags populated

### M3 — UI: list view + client selector (2 days)
- Next.js route `/(app)/brain-dump`
- Server component fetches from `brain_files` filtered by `client_id`
- Filters via URL params: `?tag=erp&intent=decision&from=2026-06-01`
- Pagination cursor
- Verify: visit `/brain-dump?client=cordex`, see real ingested files

### M4 — UI: detail view with embedded preview (1–2 days)
- Next.js route `/(app)/brain-dump/[fileId]`
- Embedded preview components per file_type
- Tag editor (inline, optimistic)
- "Regenerar summary" button (1/hour rate limit, calls a new Edge Function `brain-dump-regenerate` that takes a file id)

### M5 — Admin UI + storage budget alert (1 day)
- `/admin` view (still Saikan-only) showing:
  - List of clients and agents
  - Last 24h processing stats from `brain_processing_log` (count, success rate, failures)
  - **Storage usage gauge**: `sum(size_bytes) / 1GB`, turns red above 60% (D13)
  - "Initialize brain" button per client (calls M0)
- Daily `pg_cron` job `brain-dump-storage-check` that posts a notification (Telegram via the existing bot, or just a row in a `notifications` table the team polls) when storage > 60%

**Total estimate: 7–9 working days** (M0–M5), assuming the Saikan team supplies Drive folder IDs and the MiniMax key before M2.

---

## 8. Required Inputs from the Team

| # | Input | Needed for | Owner |
|---|-------|-----------|-------|
| I1 | 5 Saikan email addresses (for RLS) — placeholders OK in v1, replaced before any non-Oliver login | M1 | Oliver |
| I2 | Drive folder ID for each client root | M1 | Oliver |
| I3 | Drive folder ID for each agent (3 sub-folders) | M1 | Oliver |
| I4 | Google service account JSON key with read access to those folders | M2 | Oliver |
| I5 | MiniMax API key | M2 | Oliver |
| I6 | `CRON_SECRET` value (any 32+ char random string) | M2 | Oliver |

**Security note:** I4 and I5 are sensitive. They go directly into Supabase Edge Function secrets via the Supabase dashboard, never into source control or chat.

---

## 9. Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Drive folder IDs are wrong / mis-shared | Medium | Worker can't find files | M1 verification step: human visits each folder URL, confirms it matches the client |
| Service account loses Drive access | Low | Worker stops processing | Worker logs `failed` entries; alert (manual) if last 24h has zero successes |
| MiniMax rate-limits the worker | Medium | Files pile up, eventually `brain_files` lags Drive | D14 rate limit (1 file / 2s) is conservative; if it still hits, bump to 1 file / 5s via env var |
| MiniMax returns malformed JSON | Low | File insert fails, retried next run | Worker catches JSON parse errors, logs `failed`, file gets re-picked next cycle (drive_id check is exact match, so safe to retry) |
| Edge Function 60s timeout on huge PDFs | Low | File skipped | D12 PDF cap (100MB) keeps us well under 60s for the MiniMax call; the 60s limit is the secondary guard |
| Storage hits 1GB Supabase free tier cap | Medium | Worker breaks (no inserts possible) | D13 daily storage check alerts at 60% (600MB); team frees space (delete `processed_at is null` rows from old failed runs, or archive old summaries) |
| Team tries to access a deleted/frozen client's brain | Low | None | `status` on `brain_clients` filters out non-active in worker and UI |
| Guillermo keeps reading from Drive and sees outdated info | Low | Confusion | Doc explicitly states: Drive is read-only source, saikan.io is the working surface |
| Drive folder sharing model blocks embeds | Medium | Detail view has no inline preview, falls back to "open in Drive" | §6.4 documents three options (A/B/C); default is A with proxy B as fallback; decided before M3 |

---

## 10. What v1 Does Not Decide (Deferred)

These are real questions that v1 deliberately punts on. They are documented here so they don't get lost.

1. **Planos schema and UI** — versioned JSON plan structure, role-based access, owner/editor/viewer, change approval flow. (To be designed when Planos work starts.)
2. **Discovery schema and UI** — 1-per-client always-on discovery, open questions, proposals. (To be designed when Discovery work starts.)
3. **Per-agent sub-brains in the UI** — model supports N agents per client but v1 UI assumes 1 agent per client. Adding a second agent dropdown is a small UI change later.
4. **Semantic search** (`embedding vector(1536)`) — keyword/tag filter only in v1.
5. **Audio/video support** — explicitly excluded in v1; the worker already rejects them.
6. **Drive webhooks** — v1 polls every 6h; webhook swap is a single-trigger replacement.
7. **Cross-client patterns / Saikan playbooks** — explicitly out of scope, requires its own design.
8. **Migration of bot pipeline to Supabase Storage** — Oliver mentioned this as a future possibility. v1 makes it a non-breaking swap: Drive → Supabase Storage is a Layer-1 change, Layers 2/3/4 stay the same.
9. **Client-facing surfaces** — not in scope; clients continue to interact only with their bot.
10. **Per-user audit log** — `brain_processing_log` records worker events; not yet a per-user action log (no per-user actions in v1 beyond tag edits).

---

## 11. Open Questions for Review

These are the items where I (Krillin) made a call but want explicit validation before implementation:

1. **Email list for RLS** — the 5 emails in §4.5 are placeholders. Oliver confirmed placeholders are fine for v1, replaced before any non-Oliver login.
2. **First batch import** — Oliver confirmed: do a one-time backfill of all historical Drive content per client on day 1 (M0). After that, the cron takes over. If the historical folder for any client is enormous (e.g. 5,000+ files), we may want a "backfill in chunks" button. We can discover this empirically when M0 first runs.
3. **Drive preview sharing model** — see §6.4. Default is Option A ("anyone with link can view" on the 3 client folders). Needs Oliver/Rafael sign-off before M3. Fallback is Option B (saikan.io proxy).

---

## 12. Appendix A — File Type Mapping

| Drive mime | `file_type` | Worker behavior |
|------------|-------------|-----------------|
| `application/pdf` | `pdf` | Extract text natively, then AI |
| `image/*` | `image` | Vision model, then AI |
| `text/markdown`, `text/plain` | `text` | Read directly, then AI |
| `application/vnd.google-apps.document` | `text` | Export to plain text, then AI |
| `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | `spreadsheet` | Convert to markdown table, then AI |
| `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | `text` | Extract text, then AI |
| `video/*`, `audio/*` | `video` / `audio` | **Skipped in v1** |
| anything else | `other` | Skip (logged) |

## 13. Appendix B — Environment Variables (Edge Functions)

Set in Supabase dashboard → Edge Functions → Secrets:

| Secret | Description |
|--------|-------------|
| `SUPABASE_URL` | Auto-injected |
| `SUPABASE_SERVICE_ROLE_KEY` | Auto-injected |
| `MINIMAX_API_KEY` | API key for the MiniMax chat completions endpoint |
| `MINIMAX_MODEL` | Model identifier (default: `MiniMax-Text-01`) |
| `GOOGLE_SERVICE_ACCOUNT_JSON` | Full service account JSON (base64 or raw) |
| `CRON_SECRET` | Shared secret with `pg_cron` |
| `RATE_LIMIT_MS` | Optional override for the 1 file / 2s default (D14) |

## 14. Appendix C — Reference: Recent Context

- Repo: `D:\saikan.io\saikan.io`
- Current branch: `main` (clean working tree at time of writing)
- Recent merge of note: `641f217` — "refactor: hub-and-spoke architecture per revised plan (Piccolo/Rafael)" — confirms the project is already structured around a central hub with multiple apps. The Brain Dump fits as a hub concern; Planos and Discovery fit as spoke apps.
- Working dir: `D:\saikan.io\saikan.io`
