Data validation and quality

Before running any analytics on a loan tape, validate the data. This isn’t optional. Building analysis on dirty data produces convincing but wrong conclusions. A beautiful stratification table built from corrupted data is worse than no analysis at all because it looks authoritative.

This page covers the systematic approach to data validation: field-level checks, logical consistency, duplicate detection, exception reporting, and how to handle the issues you’ll inevitably find.

Why validation comes first

New analysts often jump straight to running stratifications or calculating performance metrics. Experienced analysts spend their first hour validating the data.

The difference matters:

Without validation: You calculate a 4.2% CDR and present it to investment committee. Later, someone notices 15% of loans have zero balance (should have been excluded) and the actual CDR is 5.8%. Your credibility takes a hit, and the deal may have been mispriced.

With validation: You identify the zero-balance loans immediately, document the exception, request clarification from the originator, and either exclude them with proper documentation or get a corrected tape. Your CDR calculation reflects reality.

Validation time is never wasted. It either confirms data quality (allowing confident analysis) or surfaces problems early (before they compound into decision errors).

Field-level validation

Run checks appropriate to each field type. Create a validation script you can reuse across tapes rather than doing this manually each time.

Numeric field checks

Balance fields:

No negative values (impossible for UPB)
Current balance <= Original balance (for amortizing, non-revolving loans)
Current balance > 0 for active loans (unless explicitly paid off)
Reasonable maximum (a $10M balance in a consumer portfolio warrants investigation)

Rate fields:

No negative values
Within reasonable bounds (0-50% for most products)
Decimal vs. percentage consistency (is 0.08 or 8 the correct format?)
Fixed/variable flag consistency (variable loans should show current rate, not origination rate)

Term fields:

Positive integers only
Within product guidelines (12-84 months typical for auto, 120-360 for mortgage)
Remaining term <= Original term
Remaining term > 0 for active loans

Payment amount:

Positive values for active loans
Reasonable relationship to balance/rate/term
Interest-only vs. P&I consistency with loan type

Example validation checks (Python):

# Balance checks
assert (df['current_balance'] >= 0).all(), "Negative balances found"
assert (df['current_balance'] <= df['original_balance']).all(), "Current > Original"

# Rate checks
assert (df['rate'] >= 0).all() & (df['rate'] <= 0.50).all(), "Rates out of range"

# Term checks
assert (df['remaining_term'] <= df['original_term']).all(), "Remaining > Original term"

Date field checks

Origination date:

Not in the future
Not before the originator started operations
Reasonable distribution (bunching on specific dates may indicate data issues)

Maturity date:

After origination date
Consistent with origination date + term
Not in the past for active loans (unless extended)

Last payment date:

Not in the future
Reasonable relationship to payment status
Sequential with prior tape’s last payment date

Date format consistency: The same date appearing as “01/02/2024,” “2024-01-02,” and “January 2, 2024” in different fields indicates sloppy data handling. Check all date fields use consistent formatting.

Categorical field checks

Status fields:

All values match expected code list
No unexpected variations (“Current” vs “CURRENT” vs “current”)
Distribution is reasonable (99% in one category might indicate default value overuse)

State/geography:

Valid state/country codes only
No impossible combinations (zip code in wrong state)
Distribution aligns with originator’s footprint

Product type:

All values documented
Characteristics align (a “prime” product with average 620 FICO is mislabeled)

Logical consistency checks

Individual field values may be valid but inconsistent with each other. These cross-field checks catch data corruption or mapping errors.

Standard consistency checks

Check	Expected relationship	Action if failed
Term vs. Maturity	Maturity date = Origination date + (Original term months)	Query date formats
Payment vs. Rate/Balance	Payment aligns with amortization calculation	Check loan type
Current balance	<= Original balance (amortizing loans)	Check for fees capitalized
Days past due	Consistent with payment status	Check status definitions
Remaining term	<= Original term	Check for extensions
Remaining term	= Maturity - Tape date (in months)	Check for date errors

Payment calculation verification

For a sample of loans, verify the payment amount matches the expected P&I calculation:

Standard amortization formula:

Payment = Principal * [r(1+r)^n] / [(1+r)^n - 1]

Where:
r = monthly rate (annual rate / 12)
n = original term in months

A 5% mismatch tolerance handles rounding differences. Larger mismatches indicate either:

Interest-only periods (expected for some products)
Balloon structures
Fees included in payment
Data errors

Status-performance alignment

Delinquency status and DPD should align:

DPD	Expected status
0	Current
1-29	Current or 30-day bucket
30-59	30 DPD or equivalent
60-89	60 DPD or equivalent
90+	90+ DPD, default, or equivalent

When a loan shows “Current” status but 45 DPD, investigate. Either:

The servicer uses a non-standard definition (some call loans current until 60 DPD)
Catch-up payments were made and status updated but DPD wasn’t
Data error

Geography validation

For U.S. consumer and mortgage pools:

State code should be valid 2-letter abbreviation
Zip code first 3 digits should match state
If metropolitan statistical area (MSA) is provided, should match state

Geographic data errors often indicate system migration issues where mapping tables weren’t properly updated.

Duplicate detection

Duplicates corrupt your analysis by double-counting loans. Check for multiple duplicate types.

Exact duplicate rows

The simplest case: identical rows appearing multiple times. These should never exist.

# Check for exact duplicates
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Exact duplicate rows: {len(duplicate_rows)}")

Duplicate loan IDs

Loan IDs should be unique. Multiple rows with the same loan ID indicate either:

Data extraction error
System migration artifacts (loan in both old and new system)
Genuine duplicates from operational error

# Check for duplicate loan IDs
duplicate_ids = df[df.duplicated(subset=['loan_id'], keep=False)]
print(f"Duplicate loan IDs: {duplicate_ids['loan_id'].nunique()}")

If duplicates exist, determine which record is authoritative. Usually the more recent record (by origination date or tape extraction) is correct.

Suspicious near-duplicates

Harder to detect: loans that aren’t exact duplicates but appear to be the same loan with slight variations.

Look for:

Same borrower ID with very similar origination dates and amounts
Same address/property with different loan IDs
Loans where most fields match but one or two differ

These may indicate:

Data entry errors (same loan entered twice with typo)
Loan modifications creating new records
Legitimate repeat borrowing (normal for some products)

Investigation is required to determine if these are errors or valid separate loans.

Common data quality issues

Every loan tape has issues. Knowing the common patterns helps you identify and address them efficiently.

Missing field patterns

Randomly missing: Fields missing for seemingly random loans suggests data entry gaps or optional field handling. Less concerning if <5% of loans affected.

Systematically missing: A field missing for all loans of a certain vintage or channel indicates a systemic issue—the field wasn’t captured for that cohort. This is more problematic because imputation may not be valid.

Missing for newest loans: Data pipeline delays sometimes mean the most recent loans are missing fields that will be populated later. Check if this is a timing issue.

Stale data

Loan tape date matters. A tape dated December 31 analyzed in March is three months stale. Performance has changed, and your DQ snapshot is outdated.

Minimum refresh requirements:

Stage	Acceptable staleness
Initial screening	Tape within 30 days
Diligence	Tape within 15 days
Closing	Tape as of cut-off date (day before closing)
Surveillance	Monthly refresh typical

For warehouse facilities with monthly borrowing base calculations, you typically receive monthly tapes. For term facilities, quarterly surveillance tapes are common.

Inconsistent coding across vintages

Originators change systems, and field values change meaning:

“Charged off” is code “CO” in 2022 vintage and “6” in 2023 vintage
Rate field changes from decimal (0.08) to percentage (8.0)
Date format changes from MM/DD/YYYY to YYYY-MM-DD

Solution: Build a mapping table covering all code variations and apply it consistently:

status_mapping = {
    'CO': 'Charged Off',
    '6': 'Charged Off',
    'CU': 'Current',
    '1': 'Current',
    # ... additional mappings
}
df['status_normalized'] = df['status_raw'].map(status_mapping)

Validate the mapping by confirming aggregate statistics (total charge-offs, total DQ) match originator-reported figures.

System migration artifacts

When originators switch loan management systems, data history often breaks:

Missing history: Loans may have no performance history from origination to migration date. Monthly payment records simply don’t exist for that period.

Field definition changes: A field that meant one thing in the old system may have a different definition in the new system, even with the same name.

Duplicate records: The same loan may appear in both old and new system exports.

Solution: Request data from both systems if analyzing cross-migration vintages. Apply extra scrutiny to any analysis spanning the migration date. Consider excluding loans originated near the migration date from trend analysis.

Data entry errors

Impossible values are the clearest signal:

Negative balances
Interest rates > 50% (unless explicitly high-rate product)
Terms > 360 months for consumer products
Origination dates before the originator existed
Future dates for historical events

Solution: Flag, exclude, or query. Don’t silently include impossible values in your analysis.

Exception reporting

Document all validation findings in a standard format. This creates an audit trail and provides clear communication with the originator.

Exception report template

EXCEPTION REPORT
================
Pool: XYZ Consumer Tape
Tape Date: 2024-01-15
Analysis Date: 2024-01-18
Analyst: [Name]

SUMMARY
-------
Total Loans: 5,000
Total UPB: $50.0M
Overall Data Quality: MODERATE

CRITICAL ISSUES (Must resolve before analysis)
----------------------------------------------
1. Missing FICO Score
   - Count: 127 loans (2.5% of pool)
   - UPB: $1.55M (3.1% of pool)
   - Impact: Cannot stratify by credit; eligibility uncertain
   - Request: Provide FICO or confirm alternative credit basis

2. Negative Current Balance
   - Count: 43 loans
   - Range: -$52 to -$3,421
   - Impact: Corrupts pool balance calculation
   - Request: Explain cause; provide corrected values

3. Duplicate Loan IDs
   - Count: 12 loan IDs appearing twice
   - Impact: Double-counting in analysis
   - Request: Confirm which record is authoritative

WARNINGS (Review with originator)
---------------------------------
1. Same-Day Origination
   - 8% of loans show origination date = tape date
   - Possible explanation: Batch funding on tape date
   - Action: Confirm these are valid new originations

2. High Rate Outliers
   - 15 loans with rate > 25%
   - Product type shows "Prime Consumer"
   - Action: Verify product type coding for these loans

3. Geographic Coding
   - 23 loans show state "XX"
   - Action: Request valid state codes

DATA QUALITY STATISTICS
-----------------------
Field              | % Complete | Issues
-------------------|------------|--------
Loan ID            | 100.0%     | 12 duplicates
Current Balance    | 100.0%     | 43 negatives
FICO               | 97.5%      | 127 missing
State              | 99.5%      | 23 invalid
Rate               | 100.0%     | 15 > 25%
Original Term      | 99.9%      | None

RECOMMENDATION
--------------
Request tape refresh with corrections before proceeding to full analysis.
Specifically: FICO population for 127 loans, corrected balances for 43 loans,
duplicate resolution for 12 loan IDs.

Severity classification

Critical issues: Block analysis. Must be resolved before proceeding.

Missing or corrupted primary keys
Corrupted balance fields
Systematic missing data in critical fields

Warnings: Document and discuss but can proceed with caveats.

Small number of outliers
Minor missing data (<5%)
Coding inconsistencies with known resolution

Informational: Note for completeness, no action required.

Expected patterns (seasonal origination spikes)
Known characteristics (product-specific field handling)

Addressing data issues

When you find problems, you have several options depending on severity and resolution feasibility.

Option 1: Request a corrected tape

The cleanest solution when the originator can fix the issue at source.

Communication approach:

“We identified 127 loans (2.5% of pool by count, 3.1% by balance) with missing FICO scores. Please confirm:

Is this field not captured for these loans, or is this a data extraction error?

If not captured, what was the credit decision basis for these loans?

Can you provide a refreshed tape with this field populated?”

Set a clear deadline. Waiting indefinitely for tape corrections derails analysis timelines.

Option 2: Exclude affected loans

When issues affect a small portion of the pool and correction isn’t practical.

Document the exclusion:

Which loans excluded
Why they were excluded
Impact on pool size and characteristics
Whether exclusion creates bias

Example: Excluding 2.5% of loans with missing FICO may inadvertently exclude a specific channel or vintage. Check that excluded loans don’t systematically differ from the pool.

Option 3: Apply conservative assumptions

When analysis must proceed but data is uncertain.

Missing FICO: Assume the eligibility floor (e.g., 600) or exclude from prime buckets.

Missing rate: Assume the highest rate in the pool.

Unknown status: Treat as most severe status.

Important threshold: Don’t impute data for more than 5% of loans. At that point, you’re inventing a portfolio rather than analyzing one.

Option 4: Apply adjustment factors

When you can quantify the impact of data issues.

Example: You know from originator disclosure that 15% of historically missing-FICO loans defaulted at 1.5x the pool average rate. Apply this adjustment to loss projections.

This is sophisticated and requires either historical data or external benchmarks. Only use when the adjustment basis is defensible.

Building a validation framework

For repeated tape analysis (monthly surveillance, multiple originators), build a reusable validation framework.

Standard validation suite

Create a script that runs automatically on any new tape:

Load data with type handling
Check completeness by field
Run range checks on numeric fields
Run consistency checks across fields
Check for duplicates
Generate exception report
Calculate data quality score

Data quality scoring

Create a simple scoring system to track quality over time:

Category	Weight	Score calculation
Completeness	30%	% of critical fields populated
Accuracy	30%	% of loans passing range checks
Consistency	20%	% of loans passing cross-field checks
Uniqueness	20%	% of loans with unique IDs

Data Quality Score = Weighted average of category scores

Track this score monthly. A declining score signals originator data practices are degrading—a leading indicator of operational problems.

Key takeaways

Validation comes first. Never run analytics on unvalidated data. The hour you spend validating saves days of correcting errors.
Check fields individually and in relationship. A valid rate field and a valid balance field can still be inconsistent with each other.
Document everything. The exception report is both an internal record and a communication tool with originators.
Know when to stop. If more than 5% of loans need imputation, the tape isn’t ready for analysis.
Build reusable tools. Validation should be automated and consistent, not manual and ad-hoc.
Track quality over time. Data quality degradation is a leading indicator of operational problems.