Skip to main content

Data Profiling

Analyze your data to identify patterns, quality metrics, and statistical insights across tables and columns.

VIDEO TUTORIAL⏱️ 2 mins

📹 Data Profiling and Quality Metrics

Learn how the Data Profiler Agent analyzes your data to provide completeness, uniqueness, and statistical insights.

Overview

The Data Profiler Agent automatically analyzes your database tables and columns to provide comprehensive insights into data quality, patterns, and statistical properties. This helps identify data quality issues, understand data distributions, and make informed decisions about data management.

Viewing Profile Data

1. Dashboard Widget

The Data Profiler widget on the dashboard displays:

  • Data Completeness: Overall percentage of non-null values (100 - avg null %)
  • Data Uniqueness: Average distinct percentage across columns
  • Column Types: Count of columns by type (Numeric, String, Date, Other)
  • High-Null Columns: Number of columns with high null percentages

2. Table-Level Profile Tab

Access detailed profiling for individual tables:

  1. Navigate to a table in the catalog
  2. Click the Profile tab
  3. View column-level metrics:
    • Completeness: Percentage of non-null values per column
    • Uniqueness: Percentage of unique values
    • Top Values: Most frequent values in each column
    • Summary Statistics: Min, max, average for numeric columns

3. Data Profiler Agent View

The global agent view provides comprehensive profiling insights:

Data Quality Summary

Displays current profiling metrics with trend indicators:

  • Data Completeness (Avg Null %): Shows null percentage with progress bar and trend arrow
  • Data Uniqueness (Avg Distinct %): Shows distinct percentage with progress bar and trend arrow
  • High-Null Columns: Count of columns with significant null values and trend
  • Column Type Distribution: Grid showing counts for Numeric, String, Date, and Other columns

Track profiling metrics over time with interactive charts:

  • Completeness Trend: Data completeness changes over time
  • Uniqueness Trend: Data uniqueness changes over time
  • Column Type Evolution: Changes in column type distribution
  • High-Null Columns Trend: Tracking columns with quality issues

Column Profiling Metrics

For each column, the profiler analyzes:

Basic Metrics

MetricDescription
CompletenessPercentage of non-null values
UniquenessPercentage of distinct values
Null PercentagePercentage of null values
Data TypeDatabase data type

Statistical Analysis (Numeric Columns)

MetricDescription
MinMinimum value
MaxMaximum value
MeanAverage value
MedianMiddle value
Standard DeviationValue distribution spread

Pattern Analysis

MetricDescription
Top ValuesMost frequently occurring values
Value DistributionFrequency of different values
CardinalityNumber of unique values

Running Profile Analysis

Manual Execution

  1. Navigate to the Data Profiler Agent view
  2. Click the Run button
  3. The agent will analyze all tables based on configured settings
  4. View results in real-time as profiling completes

Automatic Profiling

  • Profiles are automatically generated after catalog scans
  • Scheduled profiling runs based on agent configuration
  • Incremental profiling for changed tables

Profile Settings

Default Configuration

The Data Profiler Agent uses these default settings:

SettingDescriptionDefault
Sample SizeMaximum rows to analyze per table10,000
Sampling ThresholdTables larger than this are sampled10,000 rows
Top Values LimitNumber of frequent values to capture10
Store Top ValuesWhether to store value distributionsYes

Performance Considerations

  • Small Datasets (<100K rows): Profile entire dataset
  • Medium Datasets (100K-1M rows): Use 10,000 sample size
  • Large Datasets (>1M rows): Consider smaller samples for efficiency

Quality Score Interpretation

Completeness Score

  • 90-100%: Excellent - Minimal missing data
  • 70-90%: Good - Some missing values
  • 50-70%: Fair - Significant gaps
  • Below 50%: Poor - Major data quality issues

Uniqueness Score

  • High Uniqueness (>90%): Potential primary key candidates
  • Medium Uniqueness (50-90%): Normal distribution
  • Low Uniqueness (<50%): Many duplicate values

Data Type Distribution

Helps understand:

  • Schema composition
  • Storage optimization opportunities
  • Data migration planning

Use Cases

Data Quality Assessment

  • Identify columns with high null percentages
  • Find potential data quality issues
  • Monitor data quality trends over time

Primary Key Discovery

  • Identify columns with 100% uniqueness and 0% nulls
  • Potential composite key combinations
  • Foreign key relationship candidates

Schema Optimization

  • Identify rarely used columns (high null percentage)
  • Find columns that could use different data types
  • Discover redundant or duplicate data

Data Migration Planning

  • Understand data volumes and distributions
  • Identify data cleansing requirements
  • Plan transformation rules

Best Practices

Regular Profiling

  • Schedule weekly or monthly profiling runs
  • Profile after major data loads
  • Monitor trends for early issue detection

Sampling Strategy

  • Use larger samples for critical tables
  • Adjust sample size based on data volatility
  • Consider full profiling for small tables

Acting on Results

  • Set quality thresholds and alerts
  • Document expected ranges for metrics
  • Create action plans for quality improvements

Integration with Other Features

Data Quality Rules

  • Use profiling insights to define quality rules
  • Set thresholds based on historical metrics
  • Monitor violations against baselines

Documentation

  • Auto-generate documentation from profile insights
  • Include statistics in data dictionaries
  • Share profile reports with stakeholders

Sensitivity Classification

  • Use cardinality to identify potential PII
  • Detect patterns in sensitive data
  • Validate classification accuracy

Troubleshooting

Common Issues

IssueCauseSolution
Slow profilingLarge sample sizeReduce sample size or use sampling
Incomplete resultsTimeout on large tablesIncrease timeout or reduce scope
Missing statisticsNon-numeric columnsExpected - statistics only for numeric types
Profile not updatingCache not refreshedManually trigger new profile run

API Access

Access profiling data programmatically:

  • Query profile results by table or column
  • Trigger profile runs via API
  • Export profile reports
  • Set up automated quality monitoring