Configuration File Reference¶
Complete reference for dbslice YAML configuration files.
Table of Contents¶
- Overview
- File Location
- Configuration Schema
- Sections
- version
- database
- extraction
- anonymization
- compliance
- output
- tables
- performance
- CLI Override Behavior
- Validation Rules
- Complete Examples
- Best Practices
Overview¶
dbslice supports YAML configuration files for managing complex extraction scenarios. Configuration files are useful for:
- Repeatable Extractions: Save extraction settings for consistent results
- Team Sharing: Share extraction configs with team members
- Complex Configurations: Manage multi-seed, multi-table extractions
- CI/CD Integration: Version-controlled extraction configurations
- Security: Keep sensitive settings (database URLs) out of command history
File Location¶
Default Locations¶
dbslice looks for configuration files in these locations (in order):
- File specified with
--configflag dbslice.yamlin current directory.dbslice.yamlin current directory~/.config/dbslice/config.yamlin user home directory
Generating Configuration Files¶
# Generate default configuration
dbslice init postgresql://localhost/mydb
# Generate to specific location
dbslice init postgresql://localhost/mydb -f config/production.yaml
# Generate without sensitive field detection
dbslice init postgresql://localhost/mydb --no-detect-sensitive
Configuration Schema¶
The configuration file uses YAML format with the following top-level structure:
version: "1.0" # Optional config version tag (informational)
database: # Database connection settings
extraction: # Extraction behavior settings
anonymization: # Anonymization configuration
compliance: # Compliance profiles and audit manifest (optional)
output: # Output format settings
tables: # Per-table configuration (optional)
performance: # Performance tuning (optional)
Sections¶
version¶
Type: String Required: No Default: unset
Optional schema/version tag for your own tracking. dbslice currently treats this as informational metadata.
database¶
Database connection configuration.
Schema¶
database:
url: string # Database connection URL (required)
schema: string # Schema name (optional, default: "public" for PostgreSQL)
options: object # Optional URL query options (key/value)
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url |
String | Yes | - | Database connection URL |
schema |
String | No | "public" |
Schema name for PostgreSQL |
options |
Object | No | {} |
Extra connection options merged into URL query params |
Examples¶
# Basic PostgreSQL connection
database:
url: postgresql://user:pass@localhost:5432/mydb
# With schema specification
database:
url: postgresql://user:pass@localhost:5432/mydb
schema: public
# Add query options via config
database:
url: postgresql://user:pass@localhost:5432/mydb?sslmode=disable
options:
sslmode: require
application_name: dbslice
# Environment variable (recommended for security)
database:
url: ${DATABASE_URL}
# Read from file
database:
url: ${DATABASE_URL_FILE}
database.url placeholder behavior:
- Exact-match placeholders only: the full value must be ${VAR} or ${VAR_FILE}.
- ${VAR}: uses the value of environment variable VAR.
- ${VAR_FILE}: reads file path from environment variable VAR_FILE, then uses trimmed file contents.
- Missing env var or unreadable _FILE target causes config-load validation failure.
- Partial-string interpolation is not supported.
database.options precedence:
- Applied only when URL comes from config (database.url).
- If CLI provides database URL, config database.options are ignored.
extraction¶
Extraction behavior configuration.
Schema¶
extraction:
default_depth: integer # Default traversal depth
direction: string # Traversal direction (up/down/both)
exclude_tables: list[string] # Tables to exclude
validate: boolean # Enable validation
fail_on_validation_error: boolean # Stop on validation errors
max_rows_per_table: integer # Optional global row soft-cap
allow_unsafe_where: boolean # Allow subqueries in seed WHERE clauses (trusted input only)
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
default_depth |
Integer | No | 3 |
Maximum FK traversal depth |
direction |
String | No | "both" |
Traversal direction: up, down, or both |
exclude_tables |
List[String] | No | [] |
Tables to exclude from extraction |
validate |
Boolean | No | true |
Validate extraction for referential integrity |
fail_on_validation_error |
Boolean | No | false |
Stop execution if validation finds issues |
max_rows_per_table |
Integer | No | unlimited | Global per-table soft-cap with integrity closure |
allow_unsafe_where |
Boolean | No | false |
Allow seed subqueries like IN (SELECT ...) for trusted inputs |
max_rows_per_table is deterministic and integrity-first:
- dbslice first caps each table deterministically by primary key sort.
- It then adds required parent rows so FK integrity is preserved.
- Parent closure may exceed the configured cap.
- If any row limit is configured, streaming mode is disabled automatically.
allow_unsafe_where notes:
- Default is false for security.
- When true, subqueries in seed WHERE clauses are allowed (for advanced filtering/join-style selection).
- Dangerous operations (DROP, DELETE, comments, stacked queries, etc.) are still blocked.
Examples¶
# Basic extraction config
extraction:
default_depth: 3
direction: both
# Exclude audit tables
extraction:
default_depth: 5
direction: both
exclude_tables:
- audit_logs
- sessions
- temp_data
- migration_history
# With validation
extraction:
default_depth: 3
direction: both
validate: true
fail_on_validation_error: false
# Parents only (dependencies)
extraction:
default_depth: 10
direction: up
validate: true
# Trusted advanced WHERE filters (subqueries)
extraction:
allow_unsafe_where: true
anonymization¶
Anonymization and data redaction configuration.
Schema¶
anonymization:
enabled: boolean # Enable anonymization
seed: string # Deterministic seed
fields: object # Exact table.column -> provider
patterns: object # Wildcard table.column glob -> provider
security_null_fields: list # Wildcard table.column globs to force NULL
deterministic: boolean # Use deterministic anonymization (default: true)
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled |
Boolean | No | false |
Enable automatic anonymization |
seed |
String | No | Generated | Deterministic seed for consistent anonymization |
fields |
Object | No | {} |
Exact map of table.column to Faker method |
patterns |
Object | No | {} |
Wildcard map of table.column glob to Faker method |
security_null_fields |
List[String] | No | [] |
Wildcard table.column globs to force NULL |
deterministic |
Boolean | No | true |
Deterministic mode (same input = same output). Set false for non-deterministic anonymization with stronger privacy guarantees |
Notes:
- fields keys must be exact table.column entries (no wildcards).
- patterns and security_null_fields use shell-style globs (*, ?) on table.column.
- Provider names are validated at config-load time; invalid Faker providers fail fast.
- Rule precedence: exact fields > wildcard patterns > built-in pattern matching.
- If multiple wildcard patterns match, the most specific wins (ties use first-defined order).
- Foreign-key columns are never anonymized or nulled.
Field Anonymization Methods¶
Common Faker methods for the fields mapping:
| Method | Description | Example Output |
|---|---|---|
email |
Email address | john@example.com |
phone_number |
Phone number | +1-555-0123 |
first_name |
First name | John |
last_name |
Last name | Doe |
name |
Full name | John Doe |
address |
Street address | 123 Main St |
city |
City name | New York |
zipcode |
ZIP/postal code | 12345 |
ssn |
Social Security Number | 123-45-6789 |
credit_card_number |
Credit card number | 4532-1234-5678-9010 |
ipv4 |
IPv4 address | 192.168.1.1 |
company |
Company name | Acme Corp |
url |
URL | https://example.com |
See Faker documentation for complete list.
Examples¶
# Basic anonymization
anonymization:
enabled: true
# With custom seed (for deterministic output)
anonymization:
enabled: true
seed: "my-secret-seed-12345"
# Field-specific anonymization
anonymization:
enabled: true
fields:
users.email: email
users.phone: phone_number
users.first_name: first_name
users.last_name: last_name
users.ssn: ssn
customers.company: company
payments.card_number: credit_card_number
logs.ip_address: ipv4
# Wildcard anonymization + forced NULL rules
anonymization:
enabled: true
patterns:
users.*_name: name
"*.phone*": phone_number
security_null_fields:
- users.password*
- "*.api_key"
# Complete anonymization config
anonymization:
enabled: true
seed: "production-to-dev-2023"
fields:
# User PII
users.email: email
users.phone: phone_number
users.first_name: first_name
users.last_name: last_name
users.date_of_birth: date_of_birth
# Identity documents
users.ssn: ssn
users.passport: passport_number
users.driver_license: license_plate
# Financial data
payments.card_number: credit_card_number
payments.routing_number: aba
payments.account_number: bban
# Contact information
customers.company: company
customers.address: address
customers.city: city
customers.postal_code: postcode
# Network data
logs.ip_address: ipv4
sessions.user_agent: user_agent
# Disable anonymization for specific environments
anonymization:
enabled: false # For non-production to non-production transfers
compliance¶
Compliance profile and audit manifest configuration.
Schema¶
compliance:
profiles: list[string] # Compliance profiles to apply
strict: boolean # Fail if uncovered PII detected
generate_manifest: boolean # Generate audit manifest
policy_mode: string # Runtime policy gates: off|standard|strict
allow_url_patterns: list[string]# Regex allow-list for source DB URL
deny_url_patterns: list[string] # Regex deny-list for source DB URL
required_sslmode: string # Required sslmode query value in DB URL
require_ci: boolean # Require CI=true environment
sign_manifest: boolean # HMAC-sign manifest when key is available
manifest_key_env: string # Env var name containing signing key
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
profiles |
List[String] | No | [] |
Compliance profiles: gdpr, hipaa, pci-dss |
strict |
Boolean | No | false |
Fail extraction if value-based PII scanning detects unmasked PII |
generate_manifest |
Boolean | No | false |
Generate a JSON audit manifest alongside output (auto-enabled when profiles are active) |
policy_mode |
String | No | "off" |
Compliance policy gates: off, standard, strict |
allow_url_patterns |
List[String] | No | [] |
Source DB URL must match one of these regex patterns (if set) |
deny_url_patterns |
List[String] | No | [] |
Source DB URL must not match any of these regex patterns |
required_sslmode |
String | No | - | Required PostgreSQL sslmode query parameter value |
require_ci |
Boolean | No | false |
Fail when running outside CI (CI=true expected) |
sign_manifest |
Boolean | No | false |
Sign manifest with HMAC-SHA256 (tamper detection, not non-repudiation) |
manifest_key_env |
String | No | "DBSLICE_MANIFEST_SIGNING_KEY" |
Env var containing HMAC signing key (shared secret) |
Compliance Profiles¶
| Profile | Description | Key Coverage |
|---|---|---|
gdpr |
EU General Data Protection Regulation | Names, email, phone, address, IP, DOB, SSN, financial IDs |
hipaa |
HIPAA Safe Harbor (18 identifiers) | All 18 Safe Harbor identifiers including medical record numbers, device IDs, dates |
pci-dss |
PCI-DSS v4.0 | PAN, cardholder name, expiration, CVV/PIN (NULLed) |
When a compliance profile is active:
- Anonymization is auto-enabled (no need for anonymization.enabled: true)
- Profile-defined column patterns are merged as fallback wildcard rules (user exact fields > user patterns > profile patterns > built-ins)
- Value-based scanning runs in two phases:
- coverage scan (pre-mask) to detect PII presence
- residual scan (post-mask) on unprotected columns only (strict mode fails only here)
- Free-text columns (notes, comments, descriptions) are flagged as warnings
- Audit manifest is generated by default
Policy Modes¶
policy_mode adds runtime guardrails when compliance profiles are active. These are CLI-level checks that prevent accidental misconfiguration — they are not a security boundary.
off: No policy gates (default).standard/strict: Block risky defaults — stdout output,--allow-unsafe-where, and non-masked extraction are rejected unless overridden with--allow-raw. Both modes currently apply the same gates;strictis reserved for future tightening.
Breakglass override: --allow-raw --breakglass-reason "..." --ticket-id "...". The reason and ticket ID are recorded in the manifest for audit purposes.
Important: Pseudonymization vs Anonymization¶
dbslice's anonymization is technically pseudonymization under GDPR (deterministic mode: same input = same output, reversible with seed knowledge). For stronger privacy guarantees, use anonymization.deterministic: false (non-deterministic mode), which uses random seeds per value but loses cross-table consistency.
True GDPR anonymization (where re-identification is "not reasonably possible") may require additional measures beyond what dbslice provides (k-anonymity, data generalization, etc.).
Audit Manifest¶
When generate_manifest is enabled, dbslice writes a *.manifest.json file alongside the output containing:
- Extraction metadata (timestamp, version, seed hash)
- Per-table breakdown of masked, NULLed, FK-preserved, and unmasked fields
- Residual PII scan results from value-based scanning
- Compliance warnings (e.g., free-text columns that may contain embedded PII)
- Output file hash set (
sha256) for produced artifacts - Optional breakglass metadata (reason + ticket) when override is used
- Optional HMAC-SHA256 signature for tamper detection (symmetric key — integrity checking, not non-repudiation)
This manifest provides structured evidence for audit reviews. For non-repudiation (provable origin), sign the manifest externally with cosign or GPG in your CI pipeline.
Examples¶
# HIPAA-compliant extraction
compliance:
profiles: [hipaa]
strict: true
generate_manifest: true
anonymization:
enabled: true
seed: "hipaa-compliant-seed-2024"
# Multiple compliance profiles
compliance:
profiles: [gdpr, pci-dss]
strict: false
generate_manifest: true
# Non-deterministic mode for stronger privacy
compliance:
profiles: [gdpr]
strict: true
anonymization:
enabled: true
deterministic: false # Random output each run
output¶
Output format and generation configuration.
Schema¶
output:
format: string # Output format (sql/json/csv)
include_transaction: boolean # Wrap in BEGIN/COMMIT
include_truncate: boolean # Include TRUNCATE TABLE statements
disable_fk_checks: boolean # Disable FK checks during import
file_mode: string # Output file permissions (octal, e.g. "600")
json_mode: string # JSON mode (single/per-table)
json_pretty: boolean # Pretty-print JSON
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
format |
String | No | "sql" |
Output format: sql, json, or csv |
include_transaction |
Boolean | No | true |
Wrap SQL in BEGIN/COMMIT |
include_truncate |
Boolean | No | false |
Include TRUNCATE TABLE ... CASCADE before inserts |
disable_fk_checks |
Boolean | No | false |
For PostgreSQL SQL output, emits deferred-constraint statements and enables non-nullable cycle fallback when FKs are DEFERRABLE |
file_mode |
String/Octal | No | "600" |
File permissions for generated outputs |
json_mode |
String | No | "single" |
JSON mode: single or per-table |
json_pretty |
Boolean | No | true |
Pretty-print JSON output |
Examples¶
# Basic SQL output
output:
format: sql
# SQL with transactions
output:
format: sql
include_transaction: true
include_truncate: false
# SQL for test fixtures (destructive)
output:
format: sql
include_transaction: true
include_truncate: true # Truncates tables before inserting
disable_fk_checks: true # Disables FK checks during import
include_drop_tables is still accepted as a backward-compatible alias for include_truncate, but is deprecated.
Cycle note for PostgreSQL SQL imports:
- When cycles have no nullable FK, dbslice can still generate SQL if disable_fk_checks: true and cycle FKs are DEFERRABLE.
- If cycle FKs are not deferrable, extraction fails with a clear error.
# JSON output (single file)
output:
format: json
json_mode: single
json_pretty: true
# JSON output (per-table files)
output:
format: json
json_mode: per-table
json_pretty: true
# Compact JSON for APIs
output:
format: json
json_mode: single
json_pretty: false
tables¶
Per-table configuration (optional advanced feature).
Schema¶
tables:
table_name:
skip: boolean # Skip table entirely
depth: integer # Per-table DOWN depth override
direction: string # Per-table direction override: up/down/both
max_rows: integer # Per-table row soft-cap (overrides global)
anonymize_fields: object # Deprecated alias: column -> faker provider
exclude: boolean # Deprecated alias for skip
Examples¶
# Per-table overrides
tables:
sessions:
skip: true
audit_logs:
skip: true
orders:
depth: 2
direction: up
users:
max_rows: 100
anonymize_fields:
phone: phone_number
Legacy aliases:
- `tables.<name>.exclude` is accepted as deprecated alias of `skip`.
- `tables.<name>.anonymize_fields` is accepted as deprecated alias; prefer `anonymization.fields`.
- If both `anonymization.fields` and `tables.<name>.anonymize_fields` set the same `table.column`, `anonymization.fields` wins.
performance¶
Performance tuning configuration (optional).
Schema¶
performance:
profile: boolean # Enable query profiling
batch_size: integer # Adapter query batch size
streaming:
enabled: boolean # Force streaming mode
threshold: integer # Auto-enable threshold (rows)
chunk_size: integer # Rows per chunk
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
profile |
Boolean | No | false |
Enable query profiling |
batch_size |
Integer | No | adapter default | Query parameter batch size for PostgreSQL adapter |
streaming.enabled |
Boolean | No | false |
Force streaming mode |
streaming.threshold |
Integer | No | 50000 |
Auto-enable streaming above this row count |
streaming.chunk_size |
Integer | No | 1000 |
Rows per chunk in streaming mode |
Examples¶
# Basic performance config
performance:
profile: true
# Streaming configuration
performance:
streaming:
enabled: false # Auto-enable based on threshold
threshold: 100000 # Enable streaming at 100K rows
chunk_size: 1000 # Process 1K rows at a time
# Aggressive performance tuning
performance:
profile: true
batch_size: 2000
streaming:
enabled: false
threshold: 50000
chunk_size: 2000
# Memory-constrained environment
performance:
streaming:
enabled: true # Always stream
threshold: 10000 # Low threshold
chunk_size: 500 # Small chunks
CLI Override Behavior¶
Command-line arguments take precedence over configuration file settings. This allows you to: - Use a base configuration file - Override specific settings via CLI for one-off extractions
Override Rules¶
- CLI always wins: CLI arguments override config file settings
- Merge behavior: Some options (like anonymization field mappings + CLI
--redact) are merged - Complete replacement: Others (like depth, direction, exclude tables) are replaced
Override Examples¶
Config file (dbslice.yaml):
version: "1.0"
database:
url: postgresql://localhost/mydb
extraction:
default_depth: 3
direction: both
exclude_tables:
- audit_logs
- sessions
anonymization:
enabled: true
CLI overrides:
# Override depth
dbslice extract --config dbslice.yaml --seed "orders.id=1" --depth 5
# Result: depth=5 (CLI wins)
# Override direction
dbslice extract --config dbslice.yaml --seed "orders.id=1" --direction up
# Result: direction=up (CLI wins)
# Override excluded tables
dbslice extract --config dbslice.yaml --seed "orders.id=1" --exclude temp_data
# Result: exclude_tables = [temp_data] (CLI replacement)
# Disable anonymization
dbslice extract --config dbslice.yaml --seed "orders.id=1" --no-anonymize
# Result: anonymization disabled (CLI wins)
# Override database URL
dbslice extract postgresql://other-host/db --config dbslice.yaml --seed "orders.id=1"
# Result: Uses postgresql://other-host/db (CLI wins)
Validation Rules¶
Configuration files are validated when loaded. Common validation errors:
Schema Validation¶
Database URL Validation¶
# ❌ Invalid: Unsupported protocol
database:
url: mysql://localhost/mydb # MySQL not yet supported
# ❌ Invalid: Malformed URL
database:
url: not-a-valid-url
# ✅ Valid: PostgreSQL URL
database:
url: postgresql://localhost/mydb
Direction Validation¶
# ❌ Invalid: Unknown direction
extraction:
direction: sideways
# ✅ Valid: Known directions
extraction:
direction: up # or "down", "both"
Depth Validation¶
# ❌ Invalid: Negative depth
extraction:
default_depth: -1
# ❌ Invalid: Zero depth
extraction:
default_depth: 0
# ✅ Valid: Positive depth
extraction:
default_depth: 3
Output Format Validation¶
# ❌ Invalid: Unknown format
output:
format: xml
# ✅ Valid: Supported formats
output:
format: sql # or "json"
Complete Examples¶
Development Environment¶
config/development.yaml:
version: "1.0"
database:
url: postgresql://localhost:5432/myapp_dev
extraction:
default_depth: 3
direction: both
exclude_tables:
- audit_logs
- sessions
- temp_data
validate: true
fail_on_validation_error: false
anonymization:
enabled: false # No need to anonymize dev-to-dev
output:
format: sql
include_transaction: true
include_truncate: false
performance:
profile: false
streaming:
enabled: false
threshold: 50000
Usage:
Production to Staging¶
config/prod_to_staging.yaml:
version: "1.0"
database:
url: ${PRODUCTION_DATABASE_URL} # From environment
extraction:
default_depth: 5
direction: both
exclude_tables:
- audit_logs
- sessions
- analytics_events
- email_logs
validate: true
fail_on_validation_error: true
anonymization:
enabled: true
seed: "prod-to-staging-2024"
fields:
# User PII
users.email: email
users.phone: phone_number
users.first_name: first_name
users.last_name: last_name
users.ssn: ssn
users.passport: passport_number
# Financial data
payments.card_number: credit_card_number
payments.routing_number: aba
payments.cvv: random_int
# Contact info
customers.company: company
customers.address: address
customers.city: city
output:
format: sql
include_transaction: true
include_truncate: false
performance:
profile: true
streaming:
enabled: false
threshold: 100000
chunk_size: 1000
Usage:
export PRODUCTION_DATABASE_URL="postgresql://prod.example.com/myapp"
dbslice extract \
--config config/prod_to_staging.yaml \
--seed "users:created_at >= '2024-01-01' AND status='active'" \
--out-file staging_subset.sql \
--verbose
HIPAA-Compliant Extraction¶
config/hipaa_compliant.yaml:
version: "1.0"
database:
url: ${MEDICAL_DATABASE_URL}
extraction:
default_depth: 3
direction: both
exclude_tables:
- audit_logs
- system_events
validate: true
fail_on_validation_error: true
compliance:
profiles: [hipaa]
strict: true # Fail if PII detected in output
generate_manifest: true # Generate audit trail
anonymization:
enabled: true
seed: "hipaa-compliant-extraction-2024"
deterministic: false # Non-deterministic for stronger privacy
Usage:
export MEDICAL_DATABASE_URL="postgresql://medical-db.example.com/ehr"
dbslice extract \
--config config/hipaa_compliant.yaml \
--seed "patients.id=12345" \
--out-file patient_subset.sql
# Output:
# patient_subset.sql (anonymized data)
# patient_subset.manifest.json (audit manifest for compliance team)
Test Fixture Generation¶
config/test_fixtures.yaml:
version: "1.0"
database:
url: postgresql://localhost/myapp_dev
extraction:
default_depth: 10 # Deep traversal for complete fixtures
direction: both
validate: true
fail_on_validation_error: true
anonymization:
enabled: true
seed: "test-fixtures-stable" # Stable seed for reproducible tests
fields:
users.email: email
users.phone: phone_number
output:
format: sql
include_transaction: true
include_truncate: true # Destructive - for test DB
disable_fk_checks: false # Keep FK validation
performance:
profile: false
streaming:
enabled: false
Usage:
dbslice extract \
--config config/test_fixtures.yaml \
--seed "users.email='test@example.com'" \
--seed "products:is_test_product=true" \
--out-file tests/fixtures/baseline.sql
CI/CD Integration¶
config/ci.yaml:
version: "1.0"
database:
url: ${CI_DATABASE_URL}
extraction:
default_depth: 3
direction: both
exclude_tables:
- audit_logs
- sessions
validate: true
fail_on_validation_error: true # Fail CI on validation errors
anonymization:
enabled: true
seed: ${CI_ANONYMIZATION_SEED} # From CI secrets
fields:
users.email: email
users.ssn: ssn
output:
format: sql
include_transaction: true
performance:
profile: false
streaming:
enabled: false
threshold: 10000 # Lower threshold for CI
CI Pipeline:
# .github/workflows/test.yml
steps:
- name: Generate test data
env:
CI_DATABASE_URL: ${{ secrets.TEST_DB_URL }}
CI_ANONYMIZATION_SEED: ${{ secrets.ANONYMIZATION_SEED }}
run: |
dbslice extract \
--config config/ci.yaml \
--seed "users:is_test_user=true" \
--out-file test_data.sql
- name: Load test data
run: |
psql $CI_DATABASE_URL < test_data.sql
Large Dataset Migration¶
config/migration.yaml:
version: "1.0"
database:
url: ${SOURCE_DATABASE_URL}
extraction:
default_depth: 3
direction: both
validate: true
fail_on_validation_error: false # Don't fail on orphaned records
anonymization:
enabled: false # Disable for migration
output:
format: sql
include_transaction: true
include_truncate: false
performance:
profile: true
streaming:
enabled: true # Always stream
threshold: 10000 # Low threshold
chunk_size: 1000
Usage:
export SOURCE_DATABASE_URL="postgresql://source.example.com/myapp"
dbslice extract \
--config config/migration.yaml \
--seed "orders:created_at >= '2024-01-01'" \
--out-file migration_2024.sql \
--verbose
Best Practices¶
1. Version Control Configuration Files¶
# Commit config files to version control
git add config/*.yaml
git commit -m "Add dbslice extraction configs"
# Use .gitignore for environment-specific files
echo "config/local.yaml" >> .gitignore
2. Use Environment Variables for Secrets¶
# ❌ Bad: Hardcoded credentials
database:
url: postgresql://user:password123@prod.example.com/myapp
# ✅ Good: Environment variable
database:
url: ${DATABASE_URL}
3. Document Configuration Files¶
version: "1.0"
# Production to Staging configuration
# Purpose: Extract anonymized subset for staging environment
# Updated: 2024-01-15
# Owner: DevOps Team
database:
url: ${PRODUCTION_DATABASE_URL}
extraction:
# Depth of 5 captures full order history
default_depth: 5
direction: both
# Exclude high-volume tables
exclude_tables:
- audit_logs # 500M+ rows
- analytics_events # 1B+ rows
4. Separate Configs by Environment¶
config/
├── development.yaml # Local development
├── staging.yaml # Staging environment
├── production.yaml # Production reads
├── ci.yaml # CI/CD pipeline
└── migration.yaml # Data migration
5. Test Configuration Files¶
# Validate config file
dbslice extract --config config/production.yaml --dry-run --seed "orders.id=1"
# Test with small dataset first
dbslice extract --config config/production.yaml --seed "orders.id=12345" --depth 1
6. Use Profiles for Different Scenarios¶
# Base configuration
version: "1.0"
database:
url: ${DATABASE_URL}
extraction:
default_depth: 3
direction: both
# Override for specific scenarios via CLI
# Bug reproduction: --depth 10 --profile
# Quick test: --depth 1 --no-validate
# Large dataset: --stream --stream-threshold 10000
See Also¶
- CLI Reference -- Command-line interface
- Advanced Usage -- Anonymization, streaming, virtual FKs