Configuration File Reference¶

Complete reference for dbslice YAML configuration files.

Table of Contents¶

Overview
File Location
Configuration Schema
Sections
version
database
extraction
anonymization
compliance
output
tables
performance
CLI Override Behavior
Validation Rules
Complete Examples
Best Practices

Overview¶

dbslice supports YAML configuration files for managing complex extraction scenarios. Configuration files are useful for:

Repeatable Extractions: Save extraction settings for consistent results
Team Sharing: Share extraction configs with team members
Complex Configurations: Manage multi-seed, multi-table extractions
CI/CD Integration: Version-controlled extraction configurations
Security: Keep sensitive settings (database URLs) out of command history

File Location¶

Default Locations¶

dbslice looks for configuration files in these locations (in order):

File specified with --config flag
dbslice.yaml in current directory
.dbslice.yaml in current directory
~/.config/dbslice/config.yaml in user home directory

Generating Configuration Files¶

# Generate default configuration
dbslice init postgresql://localhost/mydb

# Generate to specific location
dbslice init postgresql://localhost/mydb -f config/production.yaml

# Generate without sensitive field detection
dbslice init postgresql://localhost/mydb --no-detect-sensitive

Configuration Schema¶

The configuration file uses YAML format with the following top-level structure:

version: "1.0"           # Optional config version tag (informational)
database:                # Database connection settings
extraction:              # Extraction behavior settings
anonymization:           # Anonymization configuration
compliance:              # Compliance profiles and audit manifest (optional)
output:                  # Output format settings
tables:                  # Per-table configuration (optional)
performance:             # Performance tuning (optional)

Sections¶

version¶

Type: String Required: No Default: unset

Optional schema/version tag for your own tracking. dbslice currently treats this as informational metadata.

version: "1.0"

database¶

Database connection configuration.

Schema¶

database:
  url: string              # Database connection URL (required)
  schema: string           # Schema name (optional, default: "public" for PostgreSQL)
  options: object          # Optional URL query options (key/value)

Fields¶

Field	Type	Required	Default	Description
`url`	String	Yes	-	Database connection URL
`schema`	String	No	`"public"`	Schema name for PostgreSQL
`options`	Object	No	`{}`	Extra connection options merged into URL query params

Examples¶

# Basic PostgreSQL connection
database:
  url: postgresql://user:pass@localhost:5432/mydb

# With schema specification
database:
  url: postgresql://user:pass@localhost:5432/mydb
  schema: public

# Add query options via config
database:
  url: postgresql://user:pass@localhost:5432/mydb?sslmode=disable
  options:
    sslmode: require
    application_name: dbslice

# Environment variable (recommended for security)
database:
  url: ${DATABASE_URL}

# Read from file
database:
  url: ${DATABASE_URL_FILE}

database.url placeholder behavior: - Exact-match placeholders only: the full value must be ${VAR} or ${VAR_FILE}. - ${VAR}: uses the value of environment variable VAR. - ${VAR_FILE}: reads file path from environment variable VAR_FILE, then uses trimmed file contents. - Missing env var or unreadable _FILE target causes config-load validation failure. - Partial-string interpolation is not supported.

database.options precedence: - Applied only when URL comes from config (database.url). - If CLI provides database URL, config database.options are ignored.

extraction¶

Extraction behavior configuration.

Schema¶

extraction:
  default_depth: integer           # Default traversal depth
  direction: string                # Traversal direction (up/down/both)
  exclude_tables: list[string]     # Tables to exclude
  validate: boolean                # Enable validation
  fail_on_validation_error: boolean  # Stop on validation errors
  max_rows_per_table: integer      # Optional global row soft-cap
  allow_unsafe_where: boolean      # Allow subqueries in seed WHERE clauses (trusted input only)

Fields¶

Field	Type	Required	Default	Description
`default_depth`	Integer	No	`3`	Maximum FK traversal depth
`direction`	String	No	`"both"`	Traversal direction: `up`, `down`, or `both`
`exclude_tables`	List[String]	No	`[]`	Tables to exclude from extraction
`validate`	Boolean	No	`true`	Validate extraction for referential integrity
`fail_on_validation_error`	Boolean	No	`false`	Stop execution if validation finds issues
`max_rows_per_table`	Integer	No	unlimited	Global per-table soft-cap with integrity closure
`allow_unsafe_where`	Boolean	No	`false`	Allow seed subqueries like `IN (SELECT ...)` for trusted inputs

max_rows_per_table is deterministic and integrity-first: - dbslice first caps each table deterministically by primary key sort. - It then adds required parent rows so FK integrity is preserved. - Parent closure may exceed the configured cap. - If any row limit is configured, streaming mode is disabled automatically.

allow_unsafe_where notes: - Default is false for security. - When true, subqueries in seed WHERE clauses are allowed (for advanced filtering/join-style selection). - Dangerous operations (DROP, DELETE, comments, stacked queries, etc.) are still blocked.

Examples¶

# Basic extraction config
extraction:
  default_depth: 3
  direction: both

# Exclude audit tables
extraction:
  default_depth: 5
  direction: both
  exclude_tables:
    - audit_logs
    - sessions
    - temp_data
    - migration_history

# With validation
extraction:
  default_depth: 3
  direction: both
  validate: true
  fail_on_validation_error: false

# Parents only (dependencies)
extraction:
  default_depth: 10
  direction: up
  validate: true

# Trusted advanced WHERE filters (subqueries)
extraction:
  allow_unsafe_where: true

anonymization¶

Anonymization and data redaction configuration.

Schema¶

anonymization:
  enabled: boolean              # Enable anonymization
  seed: string                  # Deterministic seed
  fields: object                # Exact table.column -> provider
  patterns: object              # Wildcard table.column glob -> provider
  security_null_fields: list    # Wildcard table.column globs to force NULL
  deterministic: boolean        # Use deterministic anonymization (default: true)

Fields¶

Field	Type	Required	Default	Description
`enabled`	Boolean	No	`false`	Enable automatic anonymization
`seed`	String	No	Generated	Deterministic seed for consistent anonymization
`fields`	Object	No	`{}`	Exact map of `table.column` to Faker method
`patterns`	Object	No	`{}`	Wildcard map of `table.column` glob to Faker method
`security_null_fields`	List[String]	No	`[]`	Wildcard `table.column` globs to force `NULL`
`deterministic`	Boolean	No	`true`	Deterministic mode (same input = same output). Set `false` for non-deterministic anonymization with stronger privacy guarantees

Notes: - fields keys must be exact table.column entries (no wildcards). - patterns and security_null_fields use shell-style globs (*, ?) on table.column. - Provider names are validated at config-load time; invalid Faker providers fail fast. - Rule precedence: exact fields > wildcard patterns > built-in pattern matching. - If multiple wildcard patterns match, the most specific wins (ties use first-defined order). - Foreign-key columns are never anonymized or nulled.

Field Anonymization Methods¶

Common Faker methods for the fields mapping:

Method	Description	Example Output
`email`	Email address	`john@example.com`
`phone_number`	Phone number	`+1-555-0123`
`first_name`	First name	`John`
`last_name`	Last name	`Doe`
`name`	Full name	`John Doe`
`address`	Street address	`123 Main St`
`city`	City name	`New York`
`zipcode`	ZIP/postal code	`12345`
`ssn`	Social Security Number	`123-45-6789`
`credit_card_number`	Credit card number	`4532-1234-5678-9010`
`ipv4`	IPv4 address	`192.168.1.1`
`company`	Company name	`Acme Corp`
`url`	URL	`https://example.com`

See Faker documentation for complete list.

Examples¶

# Basic anonymization
anonymization:
  enabled: true

# With custom seed (for deterministic output)
anonymization:
  enabled: true
  seed: "my-secret-seed-12345"

# Field-specific anonymization
anonymization:
  enabled: true
  fields:
    users.email: email
    users.phone: phone_number
    users.first_name: first_name
    users.last_name: last_name
    users.ssn: ssn
    customers.company: company
    payments.card_number: credit_card_number
    logs.ip_address: ipv4

# Wildcard anonymization + forced NULL rules
anonymization:
  enabled: true
  patterns:
    users.*_name: name
    "*.phone*": phone_number
  security_null_fields:
    - users.password*
    - "*.api_key"

# Complete anonymization config
anonymization:
  enabled: true
  seed: "production-to-dev-2023"
  fields:
    # User PII
    users.email: email
    users.phone: phone_number
    users.first_name: first_name
    users.last_name: last_name
    users.date_of_birth: date_of_birth

    # Identity documents
    users.ssn: ssn
    users.passport: passport_number
    users.driver_license: license_plate

    # Financial data
    payments.card_number: credit_card_number
    payments.routing_number: aba
    payments.account_number: bban

    # Contact information
    customers.company: company
    customers.address: address
    customers.city: city
    customers.postal_code: postcode

    # Network data
    logs.ip_address: ipv4
    sessions.user_agent: user_agent

# Disable anonymization for specific environments
anonymization:
  enabled: false  # For non-production to non-production transfers

compliance¶

Compliance profile and audit manifest configuration.

Schema¶

compliance:
  profiles: list[string]          # Compliance profiles to apply
  strict: boolean                 # Fail if uncovered PII detected
  generate_manifest: boolean      # Generate audit manifest
  policy_mode: string             # Runtime policy gates: off|standard|strict
  allow_url_patterns: list[string]# Regex allow-list for source DB URL
  deny_url_patterns: list[string] # Regex deny-list for source DB URL
  required_sslmode: string        # Required sslmode query value in DB URL
  require_ci: boolean             # Require CI=true environment
  sign_manifest: boolean          # HMAC-sign manifest when key is available
  manifest_key_env: string        # Env var name containing signing key

Fields¶

Field	Type	Required	Default	Description
`profiles`	List[String]	No	`[]`	Compliance profiles: `gdpr`, `hipaa`, `pci-dss`
`strict`	Boolean	No	`false`	Fail extraction if value-based PII scanning detects unmasked PII
`generate_manifest`	Boolean	No	`false`	Generate a JSON audit manifest alongside output (auto-enabled when profiles are active)
`policy_mode`	String	No	`"off"`	Compliance policy gates: `off`, `standard`, `strict`
`allow_url_patterns`	List[String]	No	`[]`	Source DB URL must match one of these regex patterns (if set)
`deny_url_patterns`	List[String]	No	`[]`	Source DB URL must not match any of these regex patterns
`required_sslmode`	String	No	-	Required PostgreSQL `sslmode` query parameter value
`require_ci`	Boolean	No	`false`	Fail when running outside CI (`CI=true` expected)
`sign_manifest`	Boolean	No	`false`	Sign manifest with HMAC-SHA256 (tamper detection, not non-repudiation)
`manifest_key_env`	String	No	`"DBSLICE_MANIFEST_SIGNING_KEY"`	Env var containing HMAC signing key (shared secret)

Compliance Profiles¶

Profile	Description	Key Coverage
`gdpr`	EU General Data Protection Regulation	Names, email, phone, address, IP, DOB, SSN, financial IDs
`hipaa`	HIPAA Safe Harbor (18 identifiers)	All 18 Safe Harbor identifiers including medical record numbers, device IDs, dates
`pci-dss`	PCI-DSS v4.0	PAN, cardholder name, expiration, CVV/PIN (NULLed)

When a compliance profile is active: - Anonymization is auto-enabled (no need for anonymization.enabled: true) - Profile-defined column patterns are merged as fallback wildcard rules (user exact fields > user patterns > profile patterns > built-ins) - Value-based scanning runs in two phases: - coverage scan (pre-mask) to detect PII presence - residual scan (post-mask) on unprotected columns only (strict mode fails only here) - Free-text columns (notes, comments, descriptions) are flagged as warnings - Audit manifest is generated by default

Policy Modes¶

policy_mode adds runtime guardrails when compliance profiles are active. These are CLI-level checks that prevent accidental misconfiguration — they are not a security boundary.

off: No policy gates (default).
standard / strict: Block risky defaults — stdout output, --allow-unsafe-where, and non-masked extraction are rejected unless overridden with --allow-raw. Both modes currently apply the same gates; strict is reserved for future tightening.

Breakglass override: --allow-raw --breakglass-reason "..." --ticket-id "...". The reason and ticket ID are recorded in the manifest for audit purposes.

Important: Pseudonymization vs Anonymization¶

dbslice's anonymization is technically pseudonymization under GDPR (deterministic mode: same input = same output, reversible with seed knowledge). For stronger privacy guarantees, use anonymization.deterministic: false (non-deterministic mode), which uses random seeds per value but loses cross-table consistency.

True GDPR anonymization (where re-identification is "not reasonably possible") may require additional measures beyond what dbslice provides (k-anonymity, data generalization, etc.).

Audit Manifest¶

When generate_manifest is enabled, dbslice writes a *.manifest.json file alongside the output containing:

Extraction metadata (timestamp, version, seed hash)
Per-table breakdown of masked, NULLed, FK-preserved, and unmasked fields
Residual PII scan results from value-based scanning
Compliance warnings (e.g., free-text columns that may contain embedded PII)
Output file hash set (sha256) for produced artifacts
Optional breakglass metadata (reason + ticket) when override is used
Optional HMAC-SHA256 signature for tamper detection (symmetric key — integrity checking, not non-repudiation)

This manifest provides structured evidence for audit reviews. For non-repudiation (provable origin), sign the manifest externally with cosign or GPG in your CI pipeline.

Examples¶

# HIPAA-compliant extraction
compliance:
  profiles: [hipaa]
  strict: true
  generate_manifest: true

anonymization:
  enabled: true
  seed: "hipaa-compliant-seed-2024"

# Multiple compliance profiles
compliance:
  profiles: [gdpr, pci-dss]
  strict: false
  generate_manifest: true

# Non-deterministic mode for stronger privacy
compliance:
  profiles: [gdpr]
  strict: true

anonymization:
  enabled: true
  deterministic: false  # Random output each run

output¶

Output format and generation configuration.

Schema¶

output:
  format: string                   # Output format (sql/json/csv)
  include_transaction: boolean     # Wrap in BEGIN/COMMIT
  include_truncate: boolean        # Include TRUNCATE TABLE statements
  disable_fk_checks: boolean       # Disable FK checks during import
  file_mode: string                # Output file permissions (octal, e.g. "600")
  json_mode: string                # JSON mode (single/per-table)
  json_pretty: boolean             # Pretty-print JSON

Fields¶

Field	Type	Required	Default	Description
`format`	String	No	`"sql"`	Output format: `sql`, `json`, or `csv`
`include_transaction`	Boolean	No	`true`	Wrap SQL in BEGIN/COMMIT
`include_truncate`	Boolean	No	`false`	Include `TRUNCATE TABLE ... CASCADE` before inserts
`disable_fk_checks`	Boolean	No	`false`	For PostgreSQL SQL output, emits deferred-constraint statements and enables non-nullable cycle fallback when FKs are DEFERRABLE
`file_mode`	String/Octal	No	`"600"`	File permissions for generated outputs
`json_mode`	String	No	`"single"`	JSON mode: `single` or `per-table`
`json_pretty`	Boolean	No	`true`	Pretty-print JSON output

Examples¶

# Basic SQL output
output:
  format: sql

# SQL with transactions
output:
  format: sql
  include_transaction: true
  include_truncate: false

# SQL for test fixtures (destructive)
output:
  format: sql
  include_transaction: true
  include_truncate: true     # Truncates tables before inserting
  disable_fk_checks: true    # Disables FK checks during import

include_drop_tables is still accepted as a backward-compatible alias for include_truncate, but is deprecated.

Cycle note for PostgreSQL SQL imports: - When cycles have no nullable FK, dbslice can still generate SQL if disable_fk_checks: true and cycle FKs are DEFERRABLE. - If cycle FKs are not deferrable, extraction fails with a clear error.

# JSON output (single file)
output:
  format: json
  json_mode: single
  json_pretty: true

# JSON output (per-table files)
output:
  format: json
  json_mode: per-table
  json_pretty: true

# Compact JSON for APIs
output:
  format: json
  json_mode: single
  json_pretty: false

tables¶

Per-table configuration (optional advanced feature).

Schema¶

tables:
  table_name:
    skip: boolean                # Skip table entirely
    depth: integer               # Per-table DOWN depth override
    direction: string            # Per-table direction override: up/down/both
    max_rows: integer            # Per-table row soft-cap (overrides global)
    anonymize_fields: object     # Deprecated alias: column -> faker provider
    exclude: boolean             # Deprecated alias for skip

Examples¶

# Per-table overrides
tables:
  sessions:
    skip: true
  audit_logs:
    skip: true

  orders:
    depth: 2
    direction: up

  users:
    max_rows: 100
    anonymize_fields:
      phone: phone_number

Legacy aliases:
- `tables.<name>.exclude` is accepted as deprecated alias of `skip`.
- `tables.<name>.anonymize_fields` is accepted as deprecated alias; prefer `anonymization.fields`.
- If both `anonymization.fields` and `tables.<name>.anonymize_fields` set the same `table.column`, `anonymization.fields` wins.

performance¶

Performance tuning configuration (optional).

Schema¶

performance:
  profile: boolean                    # Enable query profiling
  batch_size: integer                 # Adapter query batch size
  streaming:
    enabled: boolean                  # Force streaming mode
    threshold: integer                # Auto-enable threshold (rows)
    chunk_size: integer               # Rows per chunk

Fields¶

Field	Type	Required	Default	Description
`profile`	Boolean	No	`false`	Enable query profiling
`batch_size`	Integer	No	adapter default	Query parameter batch size for PostgreSQL adapter
`streaming.enabled`	Boolean	No	`false`	Force streaming mode
`streaming.threshold`	Integer	No	`50000`	Auto-enable streaming above this row count
`streaming.chunk_size`	Integer	No	`1000`	Rows per chunk in streaming mode

Examples¶

# Basic performance config
performance:
  profile: true

# Streaming configuration
performance:
  streaming:
    enabled: false           # Auto-enable based on threshold
    threshold: 100000        # Enable streaming at 100K rows
    chunk_size: 1000         # Process 1K rows at a time

# Aggressive performance tuning
performance:
  profile: true
  batch_size: 2000
  streaming:
    enabled: false
    threshold: 50000
    chunk_size: 2000

# Memory-constrained environment
performance:
  streaming:
    enabled: true            # Always stream
    threshold: 10000         # Low threshold
    chunk_size: 500          # Small chunks

CLI Override Behavior¶

Command-line arguments take precedence over configuration file settings. This allows you to: - Use a base configuration file - Override specific settings via CLI for one-off extractions

Override Rules¶

CLI always wins: CLI arguments override config file settings
Merge behavior: Some options (like anonymization field mappings + CLI --redact) are merged
Complete replacement: Others (like depth, direction, exclude tables) are replaced

Override Examples¶

Config file (dbslice.yaml):

version: "1.0"
database:
  url: postgresql://localhost/mydb
extraction:
  default_depth: 3
  direction: both
  exclude_tables:
    - audit_logs
    - sessions
anonymization:
  enabled: true

CLI overrides:

# Override depth
dbslice extract --config dbslice.yaml --seed "orders.id=1" --depth 5
# Result: depth=5 (CLI wins)

# Override direction
dbslice extract --config dbslice.yaml --seed "orders.id=1" --direction up
# Result: direction=up (CLI wins)

# Override excluded tables
dbslice extract --config dbslice.yaml --seed "orders.id=1" --exclude temp_data
# Result: exclude_tables = [temp_data] (CLI replacement)

# Disable anonymization
dbslice extract --config dbslice.yaml --seed "orders.id=1" --no-anonymize
# Result: anonymization disabled (CLI wins)

# Override database URL
dbslice extract postgresql://other-host/db --config dbslice.yaml --seed "orders.id=1"
# Result: Uses postgresql://other-host/db (CLI wins)

Validation Rules¶

Configuration files are validated when loaded. Common validation errors:

Schema Validation¶

# ✅ Valid: version is optional
version: "1.0"
database:
  url: postgresql://localhost/mydb

Database URL Validation¶

# ❌ Invalid: Unsupported protocol
database:
  url: mysql://localhost/mydb  # MySQL not yet supported

# ❌ Invalid: Malformed URL
database:
  url: not-a-valid-url

# ✅ Valid: PostgreSQL URL
database:
  url: postgresql://localhost/mydb

Direction Validation¶

# ❌ Invalid: Unknown direction
extraction:
  direction: sideways

# ✅ Valid: Known directions
extraction:
  direction: up     # or "down", "both"

Depth Validation¶

# ❌ Invalid: Negative depth
extraction:
  default_depth: -1

# ❌ Invalid: Zero depth
extraction:
  default_depth: 0

# ✅ Valid: Positive depth
extraction:
  default_depth: 3

Output Format Validation¶

# ❌ Invalid: Unknown format
output:
  format: xml

# ✅ Valid: Supported formats
output:
  format: sql   # or "json"

Complete Examples¶

Development Environment¶

config/development.yaml:

version: "1.0"

database:
  url: postgresql://localhost:5432/myapp_dev

extraction:
  default_depth: 3
  direction: both
  exclude_tables:
    - audit_logs
    - sessions
    - temp_data
  validate: true
  fail_on_validation_error: false

anonymization:
  enabled: false  # No need to anonymize dev-to-dev

output:
  format: sql
  include_transaction: true
  include_truncate: false

performance:
  profile: false
  streaming:
    enabled: false
    threshold: 50000

Usage:

dbslice extract --config config/development.yaml --seed "orders.id=12345"

Production to Staging¶

config/prod_to_staging.yaml:

version: "1.0"

database:
  url: ${PRODUCTION_DATABASE_URL}  # From environment

extraction:
  default_depth: 5
  direction: both
  exclude_tables:
    - audit_logs
    - sessions
    - analytics_events
    - email_logs
  validate: true
  fail_on_validation_error: true

anonymization:
  enabled: true
  seed: "prod-to-staging-2024"
  fields:
    # User PII
    users.email: email
    users.phone: phone_number
    users.first_name: first_name
    users.last_name: last_name
    users.ssn: ssn
    users.passport: passport_number

    # Financial data
    payments.card_number: credit_card_number
    payments.routing_number: aba
    payments.cvv: random_int

    # Contact info
    customers.company: company
    customers.address: address
    customers.city: city

output:
  format: sql
  include_transaction: true
  include_truncate: false

performance:
  profile: true
  streaming:
    enabled: false
    threshold: 100000
    chunk_size: 1000

Usage:

export PRODUCTION_DATABASE_URL="postgresql://prod.example.com/myapp"

dbslice extract \
  --config config/prod_to_staging.yaml \
  --seed "users:created_at >= '2024-01-01' AND status='active'" \
  --out-file staging_subset.sql \
  --verbose

HIPAA-Compliant Extraction¶

config/hipaa_compliant.yaml:

version: "1.0"

database:
  url: ${MEDICAL_DATABASE_URL}

extraction:
  default_depth: 3
  direction: both
  exclude_tables:
    - audit_logs
    - system_events
  validate: true
  fail_on_validation_error: true

compliance:
  profiles: [hipaa]
  strict: true              # Fail if PII detected in output
  generate_manifest: true   # Generate audit trail

anonymization:
  enabled: true
  seed: "hipaa-compliant-extraction-2024"
  deterministic: false      # Non-deterministic for stronger privacy

Usage:

export MEDICAL_DATABASE_URL="postgresql://medical-db.example.com/ehr"

dbslice extract \
  --config config/hipaa_compliant.yaml \
  --seed "patients.id=12345" \
  --out-file patient_subset.sql

# Output:
#   patient_subset.sql              (anonymized data)
#   patient_subset.manifest.json    (audit manifest for compliance team)

Test Fixture Generation¶

config/test_fixtures.yaml:

version: "1.0"

database:
  url: postgresql://localhost/myapp_dev

extraction:
  default_depth: 10  # Deep traversal for complete fixtures
  direction: both
  validate: true
  fail_on_validation_error: true

anonymization:
  enabled: true
  seed: "test-fixtures-stable"  # Stable seed for reproducible tests
  fields:
    users.email: email
    users.phone: phone_number

output:
  format: sql
  include_transaction: true
  include_truncate: true      # Destructive - for test DB
  disable_fk_checks: false        # Keep FK validation

performance:
  profile: false
  streaming:
    enabled: false

Usage:

dbslice extract \
  --config config/test_fixtures.yaml \
  --seed "users.email='test@example.com'" \
  --seed "products:is_test_product=true" \
  --out-file tests/fixtures/baseline.sql

CI/CD Integration¶

config/ci.yaml:

version: "1.0"

database:
  url: ${CI_DATABASE_URL}

extraction:
  default_depth: 3
  direction: both
  exclude_tables:
    - audit_logs
    - sessions
  validate: true
  fail_on_validation_error: true  # Fail CI on validation errors

anonymization:
  enabled: true
  seed: ${CI_ANONYMIZATION_SEED}  # From CI secrets
  fields:
    users.email: email
    users.ssn: ssn

output:
  format: sql
  include_transaction: true

performance:
  profile: false
  streaming:
    enabled: false
    threshold: 10000  # Lower threshold for CI

CI Pipeline:

# .github/workflows/test.yml
steps:
  - name: Generate test data
    env:
      CI_DATABASE_URL: ${{ secrets.TEST_DB_URL }}
      CI_ANONYMIZATION_SEED: ${{ secrets.ANONYMIZATION_SEED }}
    run: |
      dbslice extract \
        --config config/ci.yaml \
        --seed "users:is_test_user=true" \
        --out-file test_data.sql

  - name: Load test data
    run: |
      psql $CI_DATABASE_URL < test_data.sql

Large Dataset Migration¶

config/migration.yaml:

version: "1.0"

database:
  url: ${SOURCE_DATABASE_URL}

extraction:
  default_depth: 3
  direction: both
  validate: true
  fail_on_validation_error: false  # Don't fail on orphaned records

anonymization:
  enabled: false  # Disable for migration

output:
  format: sql
  include_transaction: true
  include_truncate: false

performance:
  profile: true
  streaming:
    enabled: true           # Always stream
    threshold: 10000        # Low threshold
    chunk_size: 1000

Usage:

export SOURCE_DATABASE_URL="postgresql://source.example.com/myapp"

dbslice extract \
  --config config/migration.yaml \
  --seed "orders:created_at >= '2024-01-01'" \
  --out-file migration_2024.sql \
  --verbose

Best Practices¶

1. Version Control Configuration Files¶

# Commit config files to version control
git add config/*.yaml
git commit -m "Add dbslice extraction configs"

# Use .gitignore for environment-specific files
echo "config/local.yaml" >> .gitignore

2. Use Environment Variables for Secrets¶

# ❌ Bad: Hardcoded credentials
database:
  url: postgresql://user:password123@prod.example.com/myapp

# ✅ Good: Environment variable
database:
  url: ${DATABASE_URL}

3. Document Configuration Files¶

version: "1.0"

# Production to Staging configuration
# Purpose: Extract anonymized subset for staging environment
# Updated: 2024-01-15
# Owner: DevOps Team

database:
  url: ${PRODUCTION_DATABASE_URL}

extraction:
  # Depth of 5 captures full order history
  default_depth: 5
  direction: both

  # Exclude high-volume tables
  exclude_tables:
    - audit_logs      # 500M+ rows
    - analytics_events  # 1B+ rows

4. Separate Configs by Environment¶

config/
├── development.yaml      # Local development
├── staging.yaml          # Staging environment
├── production.yaml       # Production reads
├── ci.yaml              # CI/CD pipeline
└── migration.yaml       # Data migration

5. Test Configuration Files¶

# Validate config file
dbslice extract --config config/production.yaml --dry-run --seed "orders.id=1"

# Test with small dataset first
dbslice extract --config config/production.yaml --seed "orders.id=12345" --depth 1

6. Use Profiles for Different Scenarios¶

# Base configuration
version: "1.0"

database:
  url: ${DATABASE_URL}

extraction:
  default_depth: 3
  direction: both

# Override for specific scenarios via CLI
# Bug reproduction: --depth 10 --profile
# Quick test: --depth 1 --no-validate
# Large dataset: --stream --stream-threshold 10000

Configuration File Reference¶

Table of Contents¶

Overview¶

File Location¶

Default Locations¶

Generating Configuration Files¶

Configuration Schema¶

Sections¶

version¶

database¶

Schema¶

Fields¶

Examples¶

extraction¶

Schema¶

Fields¶

Examples¶

anonymization¶

Schema¶

Fields¶

Field Anonymization Methods¶

Examples¶

compliance¶

Schema¶

Fields¶

Compliance Profiles¶

Policy Modes¶

Important: Pseudonymization vs Anonymization¶

Audit Manifest¶

Examples¶

output¶

Schema¶

Fields¶

Examples¶

tables¶

Schema¶

Examples¶

performance¶

Schema¶

Fields¶

Examples¶

CLI Override Behavior¶

Override Rules¶

Override Examples¶

Validation Rules¶

Schema Validation¶

Database URL Validation¶

Direction Validation¶

Depth Validation¶

Output Format Validation¶

Complete Examples¶

Development Environment¶

Production to Staging¶

HIPAA-Compliant Extraction¶

Test Fixture Generation¶

CI/CD Integration¶

Large Dataset Migration¶

Best Practices¶

1. Version Control Configuration Files¶

2. Use Environment Variables for Secrets¶

3. Document Configuration Files¶

4. Separate Configs by Environment¶

5. Test Configuration Files¶

6. Use Profiles for Different Scenarios¶

See Also¶