Data Profiling
Analyze your data to identify patterns, quality metrics, and statistical insights across tables and columns.
📹 Data Profiling and Quality Metrics
Learn how the Data Profiler Agent analyzes your data to provide completeness, uniqueness, and statistical insights.
Overview
The Data Profiler Agent automatically analyzes your database tables and columns to provide comprehensive insights into data quality, patterns, and statistical properties. This helps identify data quality issues, understand data distributions, and make informed decisions about data management.
Viewing Profile Data
1. Dashboard Widget
The Data Profiler widget on the dashboard displays:
- Data Completeness: Overall percentage of non-null values (100 - avg null %)
- Data Uniqueness: Average distinct percentage across columns
- Column Types: Count of columns by type (Numeric, String, Date, Other)
- High-Null Columns: Number of columns with high null percentages
2. Table-Level Profile Tab
Access detailed profiling for individual tables:
- Navigate to a table in the catalog
- Click the Profile tab
- View column-level metrics:
- Completeness: Percentage of non-null values per column
- Uniqueness: Percentage of unique values
- Top Values: Most frequent values in each column
- Summary Statistics: Min, max, average for numeric columns
3. Data Profiler Agent View
The global agent view provides comprehensive profiling insights:
Data Quality Summary
Displays current profiling metrics with trend indicators:
- Data Completeness (Avg Null %): Shows null percentage with progress bar and trend arrow
- Data Uniqueness (Avg Distinct %): Shows distinct percentage with progress bar and trend arrow
- High-Null Columns: Count of columns with significant null values and trend
- Column Type Distribution: Grid showing counts for Numeric, String, Date, and Other columns
Historical Trends
Track profiling metrics over time with interactive charts:
- Completeness Trend: Data completeness changes over time
- Uniqueness Trend: Data uniqueness changes over time
- Column Type Evolution: Changes in column type distribution
- High-Null Columns Trend: Tracking columns with quality issues
Column Profiling Metrics
For each column, the profiler analyzes:
Basic Metrics
Metric | Description |
---|---|
Completeness | Percentage of non-null values |
Uniqueness | Percentage of distinct values |
Null Percentage | Percentage of null values |
Data Type | Database data type |
Statistical Analysis (Numeric Columns)
Metric | Description |
---|---|
Min | Minimum value |
Max | Maximum value |
Mean | Average value |
Median | Middle value |
Standard Deviation | Value distribution spread |
Pattern Analysis
Metric | Description |
---|---|
Top Values | Most frequently occurring values |
Value Distribution | Frequency of different values |
Cardinality | Number of unique values |
Running Profile Analysis
Manual Execution
- Navigate to the Data Profiler Agent view
- Click the Run button
- The agent will analyze all tables based on configured settings
- View results in real-time as profiling completes
Automatic Profiling
- Profiles are automatically generated after catalog scans
- Scheduled profiling runs based on agent configuration
- Incremental profiling for changed tables
Profile Settings
Default Configuration
The Data Profiler Agent uses these default settings:
Setting | Description | Default |
---|---|---|
Sample Size | Maximum rows to analyze per table | 10,000 |
Sampling Threshold | Tables larger than this are sampled | 10,000 rows |
Top Values Limit | Number of frequent values to capture | 10 |
Store Top Values | Whether to store value distributions | Yes |
Performance Considerations
- Small Datasets (<100K rows): Profile entire dataset
- Medium Datasets (100K-1M rows): Use 10,000 sample size
- Large Datasets (>1M rows): Consider smaller samples for efficiency
Quality Score Interpretation
Completeness Score
- 90-100%: Excellent - Minimal missing data
- 70-90%: Good - Some missing values
- 50-70%: Fair - Significant gaps
- Below 50%: Poor - Major data quality issues
Uniqueness Score
- High Uniqueness (>90%): Potential primary key candidates
- Medium Uniqueness (50-90%): Normal distribution
- Low Uniqueness (<50%): Many duplicate values
Data Type Distribution
Helps understand:
- Schema composition
- Storage optimization opportunities
- Data migration planning
Use Cases
Data Quality Assessment
- Identify columns with high null percentages
- Find potential data quality issues
- Monitor data quality trends over time
Primary Key Discovery
- Identify columns with 100% uniqueness and 0% nulls
- Potential composite key combinations
- Foreign key relationship candidates
Schema Optimization
- Identify rarely used columns (high null percentage)
- Find columns that could use different data types
- Discover redundant or duplicate data
Data Migration Planning
- Understand data volumes and distributions
- Identify data cleansing requirements
- Plan transformation rules
Best Practices
Regular Profiling
- Schedule weekly or monthly profiling runs
- Profile after major data loads
- Monitor trends for early issue detection
Sampling Strategy
- Use larger samples for critical tables
- Adjust sample size based on data volatility
- Consider full profiling for small tables
Acting on Results
- Set quality thresholds and alerts
- Document expected ranges for metrics
- Create action plans for quality improvements
Integration with Other Features
Data Quality Rules
- Use profiling insights to define quality rules
- Set thresholds based on historical metrics
- Monitor violations against baselines
Documentation
- Auto-generate documentation from profile insights
- Include statistics in data dictionaries
- Share profile reports with stakeholders
Sensitivity Classification
- Use cardinality to identify potential PII
- Detect patterns in sensitive data
- Validate classification accuracy
Troubleshooting
Common Issues
Issue | Cause | Solution |
---|---|---|
Slow profiling | Large sample size | Reduce sample size or use sampling |
Incomplete results | Timeout on large tables | Increase timeout or reduce scope |
Missing statistics | Non-numeric columns | Expected - statistics only for numeric types |
Profile not updating | Cache not refreshed | Manually trigger new profile run |
API Access
Access profiling data programmatically:
- Query profile results by table or column
- Trigger profile runs via API
- Export profile reports
- Set up automated quality monitoring