Database Infrastructure

This document outlines the database infrastructure for the AI Agent Orchestration Platform.

Overview

The platform uses PostgreSQL as its primary database for storing application data, with additional specialized databases for specific use cases. This document covers database setup, configuration, management, backup strategies, and scaling approaches.

Database Architecture

Primary Database: PostgreSQL

PostgreSQL is the main relational database used for:

User accounts and authentication
Workflow definitions and metadata
Agent registry and configuration
Execution history and results
Platform configuration

Database Architecture Diagram

Note: This is a placeholder for a database architecture diagram. The actual diagram should be created and added to the project.

Additional Databases

The platform also uses these specialized databases:

Vector Database (e.g., Milvus, Pinecone): For storing and querying embeddings
Time Series Database (e.g., TimescaleDB): For monitoring metrics
Graph Database (e.g., Neo4j): For relationship mapping (optional)
Document Database (e.g., MongoDB): For unstructured data (optional)

PostgreSQL Configuration

Basic Configuration

The PostgreSQL database is configured for optimal performance:

# postgresql.conf

# CONNECTIONS
max_connections = 100
superuser_reserved_connections = 3

# MEMORY
shared_buffers = 1GB
work_mem = 32MB
maintenance_work_mem = 256MB
effective_cache_size = 3GB

# WRITE AHEAD LOG
wal_level = replica
max_wal_size = 1GB
min_wal_size = 80MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB

# QUERY OPTIMIZATION
random_page_cost = 1.1
effective_io_concurrency = 200
default_statistics_target = 100

# AUTOVACUUM
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50

# LOGGING
log_destination = 'csvlog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_min_duration_statement = 1000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0

High Availability Configuration

For production environments, PostgreSQL is configured for high availability:

Primary-Replica Setup: Synchronous replication to standby servers
Automatic Failover: Using tools like Patroni or Stolon
Connection Pooling: Using PgBouncer for efficient connection management
Load Balancing: Using HAProxy or similar for read-write splitting

Example Patroni configuration:

scope: meta-agent-cluster
namespace: /service/meta-agent/
name: postgresql0

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgresql0:8008

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 100
        shared_buffers: 1GB
        work_mem: 32MB
        maintenance_work_mem: 256MB
        effective_cache_size: 3GB
        wal_level: replica
        max_wal_size: 1GB
        min_wal_size: 80MB
        checkpoint_completion_target: 0.9
        wal_buffers: 16MB
        random_page_cost: 1.1
        effective_io_concurrency: 200
        default_statistics_target: 100
        autovacuum: on
        autovacuum_max_workers: 3
        autovacuum_naptime: 1min
        autovacuum_vacuum_threshold: 50
        autovacuum_analyze_threshold: 50

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 md5
  - host replication replicator 10.0.0.0/8 md5
  - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgresql0:5432
  data_dir: /data/postgresql0
  bin_dir: /usr/lib/postgresql/15/bin
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: replpass
    superuser:
      username: postgres
      password: secretpassword
  parameters:
    unix_socket_directories: '/var/run/postgresql'

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Database Schema Management

Migration Strategy

The platform uses Alembic for database migrations:

Version Control: Track schema changes
Forward Migration: Apply schema changes
Backward Migration: Revert schema changes
CI/CD Integration: Automated migration in deployment pipeline

Example Alembic migration script:

"""Add agent execution metrics table

Revision ID: a1b2c3d4e5f6
Revises: 9z8y7x6w5v4
Create Date: 2025-04-18 14:30:45.123456

"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import JSONB, UUID

# revision identifiers, used by Alembic.
revision = 'a1b2c3d4e5f6'
down_revision = '9z8y7x6w5v4'
branch_labels = None
depends_on = None

def upgrade():
    op.create_table(
        'agent_execution_metrics',
        sa.Column('id', UUID, primary_key=True),
        sa.Column('agent_id', UUID, sa.ForeignKey('agents.id'), nullable=False),
        sa.Column('workflow_execution_id', UUID, sa.ForeignKey('workflow_executions.id'), nullable=False),
        sa.Column('start_time', sa.DateTime(timezone=True), nullable=False),
        sa.Column('end_time', sa.DateTime(timezone=True), nullable=True),
        sa.Column('status', sa.String(50), nullable=False),
        sa.Column('cpu_usage', sa.Float, nullable=True),
        sa.Column('memory_usage', sa.Float, nullable=True),
        sa.Column('input_size', sa.Integer, nullable=True),
        sa.Column('output_size', sa.Integer, nullable=True),
        sa.Column('error', sa.Text, nullable=True),
        sa.Column('metadata', JSONB, nullable=True),
        sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now(), nullable=False),
        sa.Column('updated_at', sa.DateTime(timezone=True), server_default=sa.func.now(), onupdate=sa.func.now(), nullable=False)
    )
    op.create_index('idx_agent_execution_metrics_agent_id', 'agent_execution_metrics', ['agent_id'])
    op.create_index('idx_agent_execution_metrics_workflow_execution_id', 'agent_execution_metrics', ['workflow_execution_id'])
    op.create_index('idx_agent_execution_metrics_start_time', 'agent_execution_metrics', ['start_time'])
    op.create_index('idx_agent_execution_metrics_status', 'agent_execution_metrics', ['status'])

def downgrade():
    op.drop_table('agent_execution_metrics')

Schema Documentation

The database schema is documented using tools like SchemaSpy or pgModeler:

Entity-Relationship Diagrams: Visual representation of schema
Table Definitions: Detailed table and column descriptions
Relationship Documentation: Foreign key relationships
Constraint Documentation: Primary keys, unique constraints, etc.

Backup and Recovery

Backup Strategy

The platform implements a comprehensive backup strategy:

Full Backups: Daily full database backups
WAL Archiving: Continuous transaction log archiving
Point-in-Time Recovery: Ability to restore to any point in time
Offsite Storage: Backups stored in multiple locations
Backup Verification: Regular testing of backup restoration

Example backup script:

#!/bin/bash
# backup_db.sh - Create a full backup of the PostgreSQL database

BACKUP_DIR="/backups/postgresql"
BACKUP_FILE="meta_agent_$(date +%Y%m%d_%H%M%S).backup"
RETENTION_DAYS=7

# Create backup directory if it doesn't exist
mkdir -p $BACKUP_DIR

# Create full backup
pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME -F c -f $BACKUP_DIR/$BACKUP_FILE

# Compress backup
gzip $BACKUP_DIR/$BACKUP_FILE

# Copy to offsite storage
aws s3 cp $BACKUP_DIR/$BACKUP_FILE.gz s3://meta-agent-backups/postgresql/

# Remove old backups
find $BACKUP_DIR -name "meta_agent_*.backup.gz" -type f -mtime +$RETENTION_DAYS -delete

echo "Backup completed: $BACKUP_FILE.gz"

Recovery Procedures

Documented recovery procedures for various scenarios:

Full Database Restore: Restore from full backup
Point-in-Time Recovery: Restore to specific point in time
Single Table Restore: Restore specific table
Disaster Recovery: Restore in new environment

Example recovery script:

#!/bin/bash
# restore_db.sh - Restore PostgreSQL database from backup

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: ./restore_db.sh [backup_file]"
  echo "Example: ./restore_db.sh meta_agent_20250418_143045.backup.gz"
  exit 1
fi

BACKUP_DIR="/backups/postgresql"

# Check if backup file exists
if [ ! -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
  # Try to download from offsite storage
  aws s3 cp s3://meta-agent-backups/postgresql/$BACKUP_FILE $BACKUP_DIR/

  if [ ! -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
    echo "Error: Backup file not found"
    exit 1
  fi
fi

# Uncompress if needed
if [[ $BACKUP_FILE == *.gz ]]; then
  gunzip -c $BACKUP_DIR/$BACKUP_FILE > $BACKUP_DIR/${BACKUP_FILE%.gz}
  BACKUP_FILE=${BACKUP_FILE%.gz}
fi

# Stop application services
docker-compose stop backend frontend

# Drop and recreate database
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE IF EXISTS $DB_NAME;"
psql -h $DB_HOST -U $DB_USER -c "CREATE DATABASE $DB_NAME;"

# Restore from backup
pg_restore -h $DB_HOST -U $DB_USER -d $DB_NAME -v $BACKUP_DIR/$BACKUP_FILE

# Start application services
docker-compose start backend frontend

echo "Database restored from $BACKUP_FILE"

Database Scaling

Vertical Scaling

Strategies for vertical scaling:

Resource Allocation: Increase CPU, memory, and storage
Storage Optimization: Use faster storage (SSD, NVMe)
Configuration Tuning: Optimize PostgreSQL parameters for larger hardware

Horizontal Scaling

Strategies for horizontal scaling:

Read Replicas: Distribute read queries across replicas
Sharding: Partition data across multiple database instances
Connection Pooling: Efficiently manage database connections
Query Optimization: Improve query performance

Multi-Region Deployment

For global deployments, multi-region database strategy:

Regional Replicas: Read replicas in each region
Data Synchronization: Efficient data replication between regions
Conflict Resolution: Handle write conflicts in multi-master setup
Latency Optimization: Minimize cross-region queries

Database Security

Access Control

Database access control measures:

Role-Based Access: Granular permissions for database users
Connection Security: TLS encryption for all connections
Network Security: Firewall rules and VPC isolation
Audit Logging: Track all database access and changes

Example PostgreSQL role setup:

-- Create application roles
CREATE ROLE meta_agent_app WITH LOGIN PASSWORD 'secure_password';
CREATE ROLE meta_agent_readonly WITH LOGIN PASSWORD 'readonly_password';
CREATE ROLE meta_agent_admin WITH LOGIN PASSWORD 'admin_password' SUPERUSER;

-- Grant permissions to application role
GRANT CONNECT ON DATABASE meta_agent TO meta_agent_app;
GRANT USAGE ON SCHEMA public TO meta_agent_app;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO meta_agent_app;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO meta_agent_app;

-- Grant read-only permissions
GRANT CONNECT ON DATABASE meta_agent TO meta_agent_readonly;
GRANT USAGE ON SCHEMA public TO meta_agent_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO meta_agent_readonly;

-- Set default privileges for future tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO meta_agent_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO meta_agent_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE ON SEQUENCES TO meta_agent_app;

Data Encryption

Database encryption measures:

Encryption at Rest: File system or tablespace encryption
Encryption in Transit: TLS for all connections
Column-Level Encryption: Encrypt sensitive columns
Key Management: Secure storage and rotation of encryption keys

Edge Database Strategy

For edge deployments, specialized database strategy:

Local SQLite Database: Lightweight database for edge devices
Synchronization: Bi-directional sync with central PostgreSQL
Conflict Resolution: Handle sync conflicts
Offline Operation: Function without connectivity

See Edge Infrastructure for more details.

Federated Database Strategy

For federated deployments, specialized database strategy:

Data Sovereignty: Keep data within organizational boundaries
Secure Queries: Privacy-preserving distributed queries
Schema Compatibility: Ensure schema compatibility across organizations
Audit Trail: Track cross-organization data access

See Federated Infrastructure for more details.

Database Monitoring

The platform implements comprehensive database monitoring:

Performance Metrics: Query performance, connection count, etc.
Resource Usage: CPU, memory, disk I/O, etc.
Slow Query Tracking: Identify and optimize slow queries
Replication Lag: Monitor replica synchronization
Connection Pooling: Monitor connection pool health

See Monitoring Infrastructure for more details.

Database Management Scripts

Scripts for database management are located in /infra/scripts/:

setup_db.sh - Initialize database and roles
backup_db.sh - Create database backup
restore_db.sh - Restore from backup
migrate_db.sh - Run database migrations
analyze_db.sh - Analyze database performance

Best Practices

Use database migrations for schema changes
Implement comprehensive backup strategy
Monitor database performance
Optimize queries and indexes
Implement connection pooling
Use appropriate scaling strategy
Secure database access
Regularly maintain database (VACUUM, ANALYZE)
Document database schema and procedures

References

Last updated: 2025-04-18