Command Line Interface
Pointblank provides a powerful command-line interface (CLI) that allows you to perform data validation directly from the terminal without writing Python code. This is especially useful for:
- quick data quality checks in CI/CD pipelines
- exploratory data analysis
- integration with shell scripts and automation workflows
- non-Python users who want to leverage Pointblank’s validation capabilities
Installation
The CLI is automatically available when you install Pointblank:
pip install pointblank
Terminal Demos
Check out our CLI terminal demos to see the CLI in action with comprehensive command examples and real output.
Basic Usage
The main CLI command is pb validate
, which performs single-step validations:
pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS]
Data Sources
You can validate various types of data sources:
CSV files
pb validate data.csv --check rows-distinct
Parquet files (including glob patterns and directories)
pb validate data.parquet --check col-vals-not-null --column age
pb validate "data/*.parquet" --check rows-distinct
pb validate data/ --check rows-complete # directory of parquet files
GitHub URLs (direct links to CSV or Parquet files)
pb validate "https://github.com/user/repo/blob/main/data.csv" --check rows-distinct
pb validate "https://raw.githubusercontent.com/user/repo/main/data.parquet" --check col-exists --column id
Database tables (connection strings)
pb validate "duckdb:///path/to/db.ddb::table_name" --check rows-complete
Built-in datasets
pb validate small_table --check col-exists --column a
Enhanced Data Source Support
The CLI leverages Pointblank’s centralized data processing pipeline, providing comprehensive support for various data sources:
GitHub Integration
Validate data directly from GitHub repositories without downloading files:
# Standard GitHub URLs (automatically converted to raw URLs)
pb preview "https://github.com/user/repo/blob/main/data.csv"
pb validate "https://github.com/user/repo/blob/main/sales.csv" --check rows-distinct
# Raw GitHub URLs (used directly)
pb scan "https://raw.githubusercontent.com/user/repo/main/data.parquet"
Advanced File Patterns
Support for complex file patterns and directory structures:
# Glob patterns for multiple files
pb validate "data/*.parquet" --check col-vals-not-null --column id
pb preview "sales_data_*.csv"
# Entire directories of Parquet files
pb scan data/partitioned_dataset/
pb missing warehouse/daily_reports/
# Partitioned datasets (automatically detects partition columns)
pb validate partitioned_sales/ --check rows-distinct
Database Connections
Enhanced support for database connection strings:
# DuckDB databases with table specification
pb validate "duckdb:///warehouse/analytics.ddb::customer_metrics" --check col-exists --column customer_id
# Preview database tables
pb preview "duckdb:///data/sales.ddb::transactions"
Automatic Data Type Detection
The CLI automatically detects and handles:
- CSV files: single files or glob patterns
- Parquet files: files, patterns, directories, and partitioned datasets
- GitHub URLs: both standard and raw URLs for CSV/Parquet files
- database connections: connection strings with table specifications
- built-in datasets: Pointblank’s included sample datasets
This unified approach means you can use the same CLI commands regardless of where your data is stored.
Available Validation Checks
Data Completeness
rows-distinct
: Check if all rows are unique (no duplicates)rows-complete
: Check if all rows are complete (no missing values)col-vals-not-null
: Check if a column has no Null/None/NA values
Column Existence
col-exists
: Verify that a specific column exists in the dataset
Numeric Value Checks Within Columns
col-vals-gt
: column values greater than a fixed valuecol-vals-ge
: column values greater than or equal to a fixed valuecol-vals-lt
: column values less than a fixed valuecol-vals-le
: column values less than or equal to a fixed value
Value Set Checks Within Columns
col-vals-in-set
: column values must be in an defined set
Examples
Basic Validation
# Check for duplicate rows
pb validate data.csv --check rows-distinct
# Check if all values in 'age' column are not null
pb validate data.csv --check col-vals-not-null --column age
# Check if all values in 'score' are greater than 50
pb validate data.csv --check col-vals-gt --column score --value 50
Range Validations
# Check if all ages are less than 100
pb validate data.csv --check col-vals-lt --column age --value 100
# Check if all prices are less than or equal to 1000
pb validate data.csv --check col-vals-le --column price --value 1000
# Check if all ratings are between 1 and 5 (using two commands)
pb validate data.csv --check col-vals-ge --column rating --value 1
pb validate data.csv --check col-vals-le --column rating --value 5
Set Membership
# Check if status values are in allowed set
pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending"
Debugging Failed Validations
Use --show-extract
to see which rows are causing validation failures:
# Show failing rows when validation fails
pb validate data.csv --check col-vals-lt --column age --value 65 --show-extract
# Limit the number of failing rows shown
pb validate data.csv --check col-vals-not-null --column email --show-extract --limit 5
Exit Codes for Automation
Use --exit-code
to make the command exit with a non-zero code when validation fails:
# For use in CI/CD pipelines
pb validate data.csv --check rows-distinct --exit-code
Output Examples
Successful Validation
✓ Loaded data source: data.csv
✓ Col Vals Lt validation completed
Validation Result: Column Values Less Than
Property Value
──────────────────────────────────────────────
Data Source data.csv
Check Type col-vals-lt
Column age
Threshold < 100.0
Total Rows Tested 1,250
Passing Rows 1,250
Failing Rows 0
Result ✓ PASSED
╭─────────────────────────────────────────────────╮
│ ✓ Validation PASSED: All values in column │
│ 'age' are < 100.0 in data.csv │
╰─────────────────────────────────────────────────╯
Failed Validation
✓ Loaded data source: data.csv
✓ Col Vals Lt validation completed
Validation Result: Column Values Less Than
Property Value
──────────────────────────────────────────────
Data Source data.csv
Check Type col-vals-lt
Column age
Threshold < 65.0
Total Rows Tested 1,250
Passing Rows 1,180
Failing Rows 70
Result ✗ FAILED
╭─────────────────────────────────────────────────╮
│ ✗ Validation FAILED: 70 values >= 65.0 found │
│ in column 'age' in data.csv │
│ 💡 Tip: Use --show-extract to see the failing │
│ rows │
╰─────────────────────────────────────────────────╯
Integration with CI/CD
The CLI is useful when integrating data validation into your CI/CD pipelines:
# Example GitHub Actions workflow
name: Data Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: pip install pointblank
- name: Validate data quality
run: |
pb validate data/sales.csv --check rows-distinct --exit-code
pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code
Getting Help
Use the --help
flag to see all available options:
pb validate --help
Terminal Demonstrations
Check out our step-by-step terminal demos that show real command execution with actual output.