Installation

Install with CLI Recommended
gh skills-hub install datanalysis-credit-risk

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/datanalysis-credit-risk/

Extract the ZIP to .github/skills/ in your repo. The folder name must match datanalysis-credit-risk for Copilot to auto-discover it.

Skill Files (4)

SKILL.md 6.0 KB
---
name: datanalysis-credit-risk
description: Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment,  missing value analysis, or variable selection before modeling. it covers data loading and formatting, abnormal period filtering, missing rate calculation, high-missing variable removal,low-IV variable filtering, high-PSI variable removal, Null Importance denoising, high-correlation variable removal, and cleaning report generation. Applicable scenarios arecredit risk data cleaning, variable screening, pre-loan modeling preprocessing.
---

# Data Cleaning and Variable Screening

## Quick Start

```bash
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
```

## Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

1. **Get Data** - Load and format raw data
2. **Organization Sample Analysis** - Statistics of sample count and bad sample rate for each organization
3. **Separate OOS Data** - Separate out-of-sample (OOS) samples from modeling samples
4. **Filter Abnormal Months** - Remove months with insufficient bad sample count or total sample count
5. **Calculate Missing Rate** - Calculate overall and organization-level missing rates for each feature
6. **Drop High Missing Rate Features** - Remove features with overall missing rate exceeding threshold
7. **Drop Low IV Features** - Remove features with overall IV too low or IV too low in too many organizations
8. **Drop High PSI Features** - Remove features with unstable PSI
9. **Null Importance Denoising** - Remove noise features using label permutation method
10. **Drop High Correlation Features** - Remove high correlation features based on original gain
11. **Export Report** - Generate Excel report containing details and statistics of all steps

## Core Functions

| Function | Purpose | Module |
|------|------|----------|
| `get_dataset()` | Load and format data | references.func |
| `org_analysis()` | Organization sample analysis | references.func |
| `missing_check()` | Calculate missing rate | references.func |
| `drop_abnormal_ym()` | Filter abnormal months | references.analysis |
| `drop_highmiss_features()` | Drop high missing rate features | references.analysis |
| `drop_lowiv_features()` | Drop low IV features | references.analysis |
| `drop_highpsi_features()` | Drop high PSI features | references.analysis |
| `drop_highnoise_features()` | Null Importance denoising | references.analysis |
| `drop_highcorr_features()` | Drop high correlation features | references.analysis |
| `iv_distribution_by_org()` | IV distribution statistics | references.analysis |
| `psi_distribution_by_org()` | PSI distribution statistics | references.analysis |
| `value_ratio_distribution_by_org()` | Value ratio distribution statistics | references.analysis |
| `export_cleaning_report()` | Export cleaning report | references.analysis |

## Parameter Description

### Data Loading Parameters
- `DATA_PATH`: Data file path (best are parquet format)
- `DATE_COL`: Date column name
- `Y_COL`: Label column name
- `ORG_COL`: Organization column name
- `KEY_COLS`: Primary key column name list

### OOS Organization Configuration
- `OOS_ORGS`: Out-of-sample organization list

### Abnormal Month Filtering Parameters
- `min_ym_bad_sample`: Minimum bad sample count per month (default 10)
- `min_ym_sample`: Minimum total sample count per month (default 500)

### Missing Rate Parameters
- `missing_ratio`: Overall missing rate threshold (default 0.6)

### IV Parameters
- `overall_iv_threshold`: Overall IV threshold (default 0.1)
- `org_iv_threshold`: Single organization IV threshold (default 0.1)
- `max_org_threshold`: Maximum tolerated low IV organization count (default 2)

### PSI Parameters
- `psi_threshold`: PSI threshold (default 0.1)
- `max_months_ratio`: Maximum unstable month ratio (default 1/3)
- `max_orgs`: Maximum unstable organization count (default 6)

### Null Importance Parameters
- `n_estimators`: Number of trees (default 100)
- `max_depth`: Maximum tree depth (default 5)
- `gain_threshold`: Gain difference threshold (default 50)

### High Correlation Parameters
- `max_corr`: Correlation threshold (default 0.9)
- `top_n_keep`: Keep top N features by original gain ranking (default 20)

## Output Report

The generated Excel report contains the following sheets:

1. **汇总** - Summary information of all steps, including operation results and conditions
2. **机构样本统计** - Sample count and bad sample rate for each organization
3. **分离OOS数据** - OOS sample and modeling sample counts
4. **Step4-异常月份处理** - Abnormal months that were removed
5. **缺失率明细** - Overall and organization-level missing rates for each feature
6. **Step5-有值率分布统计** - Distribution of features in different value ratio ranges
7. **Step6-高缺失率处理** - High missing rate features that were removed
8. **Step7-IV明细** - IV values of each feature in each organization and overall
9. **Step7-IV处理** - Features that do not meet IV conditions and low IV organizations
10. **Step7-IV分布统计** - Distribution of features in different IV ranges
11. **Step8-PSI明细** - PSI values of each feature in each organization each month
12. **Step8-PSI处理** - Features that do not meet PSI conditions and unstable organizations
13. **Step8-PSI分布统计** - Distribution of features in different PSI ranges
14. **Step9-null importance处理** - Noise features that were removed
15. **Step10-高相关性剔除** - High correlation features that were removed

## Features

- **Interactive Input**: Parameters can be input before each step execution, with default values supported
- **Independent Execution**: Each step is executed independently without deleting original data, facilitating comparative analysis
- **Complete Report**: Generate complete Excel report containing details, statistics, and distributions
- **Multi-process Support**: IV and PSI calculations support multi-process acceleration
- **Organization-level Analysis**: Support organization-level statistics and modeling/OOS distinction
references/
analysis.py 50.7 KB
"""Variable selection and analysis module - simplified version
PSI calculation is reused in func.py, analysis.py only handles variable selection
"""
import pandas as pd
import numpy as np
import toad
from typing import List, Dict, Tuple
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from datetime import datetime
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from joblib import Parallel, delayed


def drop_abnormal_ym(data: pd.DataFrame, min_ym_bad_sample: int = 1, 
                     min_ym_sample: int = 500) -> tuple:
    """Filter abnormal months - overall statistics, not by organization"""
    stat = data.groupby('new_date_ym').agg(
        bad_cnt=('new_target', 'sum'),
        total=('new_target', 'count')
    ).reset_index()
    
    abnormal = stat[(stat['bad_cnt'] < min_ym_bad_sample) | (stat['total'] < min_ym_sample)]
    abnormal = abnormal.rename(columns={'new_date_ym': '年月'})
    abnormal['去除条件'] = abnormal.apply(
        lambda x: f'bad sample count {x["bad_cnt"]} less than {min_ym_bad_sample}' if x['bad_cnt'] < min_ym_bad_sample else f'total sample count {x["total"]} less than {min_ym_sample}', axis=1
    )
    
    if len(abnormal) > 0:
        data = data[~data['new_date_ym'].isin(abnormal['年月'])]
    
    # Remove empty rows
    abnormal = abnormal.dropna(how='all')
    abnormal = abnormal.reset_index(drop=True)
    
    return data, abnormal


def drop_highmiss_features(data: pd.DataFrame, miss_channel: pd.DataFrame, 
                           threshold: float = 0.6) -> tuple:
    """Drop high missing rate features"""
    high_miss = miss_channel[miss_channel['整体缺失率'] > threshold].copy()
    high_miss['缺失率'] = high_miss['整体缺失率']
    
    # Modify removal condition to show specific missing rate value
    high_miss['去除条件'] = high_miss.apply(
        lambda x: f'overall missing rate is {x["缺失率"]:.4f}, exceeds threshold {threshold}', axis=1
    )
    
    # Remove empty rows
    high_miss = high_miss.dropna(how='all')
    high_miss = high_miss.reset_index(drop=True)
    
    # Drop high missing rate features
    if len(high_miss) > 0 and '变量' in high_miss.columns:
        to_drop = high_miss['变量'].tolist()
        data = data.drop(columns=[c for c in to_drop if c in data.columns])
    
    return data, high_miss[['变量', '缺失率', '去除条件']]


def drop_lowiv_features(data: pd.DataFrame, features: List[str], 
                       overall_iv_threshold: float = 0.05, org_iv_threshold: float = 0.02,
                       max_org_threshold: int = 8, n_jobs: int = 4) -> tuple:
    """Drop low IV features - multi-process version, returns IV details and IV processing table
    
    Args:
        overall_iv_threshold: Overall IV threshold, values below this are recorded in IV processing table
        org_iv_threshold: Single organization IV threshold, values below this are considered not satisfied
        max_org_threshold: Maximum tolerated organization count, if more than this number of organizations have IV below threshold, record in IV processing table
    
    Returns:
        data: Data after dropping
        iv_detail: IV details (IV value of each feature in each organization and overall)
        iv_process: IV processing table (features that do not meet the conditions)
    """
    from references.func import calculate_iv
    from joblib import Parallel, delayed
    
    orgs = sorted(data['new_org'].unique())
    
    print(f"   IV calculation: feature count={len(features)}, organization count={len(orgs)}")
    
    # Calculate IV values for all organizations at once
    def _calc_org_iv(org):
        org_data = data[data['new_org'] == org]
        org_iv = calculate_iv(org_data, features, n_jobs=1)
        if len(org_iv) > 0:
            org_iv = org_iv.rename(columns={'IV': 'IV值'})
            org_iv['机构'] = org
            return org_iv
        return None
    
    # Calculate overall IV
    print(f"   Calculating overall IV...")
    iv_overall = calculate_iv(data, features, n_jobs=n_jobs)
    print(f"   Overall IV calculation result: {len(iv_overall)} features")
    if len(iv_overall) == 0:
        print(f"   Warning: Overall IV calculation result is empty, returning empty table")
        return data, pd.DataFrame(columns=['变量', 'IV值', '机构', '类型']), pd.DataFrame(columns=['变量', '整体IV', '低IV机构数', '处理原因'])
    iv_overall = iv_overall.rename(columns={'IV': 'IV值'})
    
    # Parallel calculation of IV values for all organizations
    print(f"   Parallel calculation of IV values for {len(orgs)} organizations...")
    iv_by_org_results = Parallel(n_jobs=n_jobs, verbose=0)(
        delayed(_calc_org_iv)(org) for org in orgs
    )
    iv_by_org = [r for r in iv_by_org_results if r is not None]
    iv_by_org = pd.concat(iv_by_org, ignore_index=True) if iv_by_org else pd.DataFrame(columns=['变量', 'IV值', '机构'])
    print(f"   Organization IV summary: {len(iv_by_org)} records")
    
    # Convert to wide format: feature, overall, org1, org2, ..., orgn
    iv_detail_dict = {'变量': []}
    iv_detail_dict['整体'] = []
    
    for org in orgs:
        iv_detail_dict[org] = []
    
    # Get all features
    all_vars = set(iv_overall['变量'].tolist())
    if len(iv_by_org) > 0:
        all_vars.update(iv_by_org['变量'].tolist())
    all_vars = sorted(all_vars)
    
    for var in all_vars:
        iv_detail_dict['变量'].append(var)
        
        # Overall IV
        var_overall = iv_overall[iv_overall['变量'] == var]
        if len(var_overall) > 0:
            iv_detail_dict['整体'].append(var_overall['IV值'].values[0])
        else:
            iv_detail_dict['整体'].append(None)
        
        # IV for each organization
        for org in orgs:
            var_org = iv_by_org[iv_by_org['机构'] == org]
            var_org = var_org[var_org['变量'] == var]
            if len(var_org) > 0:
                iv_detail_dict[org].append(var_org['IV值'].values[0])
            else:
                iv_detail_dict[org].append(None)
    
    iv_detail = pd.DataFrame(iv_detail_dict)
    # Sort by overall IV in descending order
    iv_detail = iv_detail.sort_values('整体', ascending=False)
    iv_detail = iv_detail.reset_index(drop=True)
    
    # Mark features that do not meet conditions
    # 1. Overall IV below threshold
    iv_overall_low = iv_overall[iv_overall['IV值'] < overall_iv_threshold]['变量'].tolist()
    
    # 2. Number of organizations with single organization IV below threshold
    if len(iv_by_org) > 0:
        iv_by_org_low = iv_by_org[iv_by_org['IV值'] < org_iv_threshold].groupby('变量').size().reset_index()
        iv_by_org_low.columns = ['变量', '低IV机构数']
    else:
        iv_by_org_low = pd.DataFrame(columns=['变量', '低IV机构数'])
    
    # Get list of low IV organizations for each feature
    low_iv_orgs_dict = {}
    if len(iv_by_org) > 0:
        for var in iv_by_org['变量'].unique():
            var_orgs = iv_by_org[(iv_by_org['变量'] == var) & (iv_by_org['IV值'] < org_iv_threshold)]['机构'].tolist()
            low_iv_orgs_dict[var] = var_orgs
    
    # 3. Mark features that need processing
    iv_process = []
    
    # Debug info: IV distribution statistics
    if len(iv_overall) > 0:
        print(f"   Overall IV statistics: min={iv_overall['IV值'].min():.4f}, max={iv_overall['IV值'].max():.4f}, median={iv_overall['IV值'].median():.4f}")
        print(f"   Number of features with overall IV less than {overall_iv_threshold}: {(iv_overall['IV值'] < overall_iv_threshold).sum()}/{len(iv_overall)}")
    
    if len(iv_by_org_low) > 0:
        print(f"   Statistics of features with organization IV less than {org_iv_threshold}:")
        print(f"     Maximum low IV organization count: {iv_by_org_low['低IV机构数'].max()}")
        print(f"     Number of features with low IV organization count greater than or equal to {max_org_threshold}: {(iv_by_org_low['低IV机构数'] >= max_org_threshold).sum()}/{len(iv_by_org_low)}")
    
    for var in features:
        reasons = []
        
        # Check overall IV
        var_overall_iv = iv_overall[iv_overall['变量'] == var]['IV值'].values
        if len(var_overall_iv) > 0 and var_overall_iv[0] < overall_iv_threshold:
            reasons.append(f'overall IV {var_overall_iv[0]:.4f} less than threshold {overall_iv_threshold}')
        
        # Check organization IV
        var_org_low = iv_by_org_low[iv_by_org_low['变量'] == var]
        if len(var_org_low) > 0 and var_org_low['低IV机构数'].values[0] >= max_org_threshold:
            reasons.append(f'IV less than threshold {org_iv_threshold} in {var_org_low["低IV机构数"].values[0]} organizations')
        
        if reasons:
            iv_process.append({
                '变量': var,
                '处理原因': '; '.join(reasons),
                '低IV机构': ','.join(low_iv_orgs_dict.get(var, []))
            })
    
    iv_process = pd.DataFrame(iv_process)
    iv_process = iv_process.reset_index(drop=True)
    
    # Drop features that do not meet conditions
    if len(iv_process) > 0 and '变量' in iv_process.columns:
        to_drop = iv_process['变量'].tolist()
        data = data.drop(columns=[c for c in to_drop if c in data.columns])
    
    return data, iv_detail, iv_process


def drop_highcorr_features(data: pd.DataFrame, features: List[str],
                           threshold: float = 0.8, gain_dict: dict = None, top_n_keep: int = 20) -> tuple:
    """Drop high correlation features - based on original gain, drop one feature at a time
    
    Args:
        data: Data
        features: Feature list
        threshold: Correlation threshold
        gain_dict: Mapping dictionary from feature to original gain
        top_n_keep: Keep top N features by original gain ranking
    
    Returns:
        data: Data after dropping
        dropped_info: Drop information
    """
    if gain_dict is None:
        gain_dict = {}
    
    # Get current feature list (only features that exist in data)
    current_features = [f for f in features if f in data.columns]
    
    if len(current_features) == 0:
        return data, pd.DataFrame(columns=['变量', '相关变量', '去除条件'])
    
    # Determine features to keep (top N by original gain)
    if gain_dict:
        # Only consider features that exist in current features
        current_gain_dict = {k: v for k, v in gain_dict.items() if k in current_features}
        if current_gain_dict:
            sorted_features = sorted(current_gain_dict.keys(), key=lambda x: current_gain_dict[x], reverse=True)
            top_features = set(sorted_features[:top_n_keep])
            # Create mapping from feature to ranking
            rank_dict = {v: i+1 for i, v in enumerate(sorted_features)}
        else:
            top_features = set()
            rank_dict = {}
    else:
        top_features = set()
        rank_dict = {}
    
    dropped_info = []
    
    # Loop to drop until no high correlation feature pairs
    while True:
        # Recalculate correlation matrix (only for current remaining features)
        current_features = [f for f in current_features if f in data.columns]
        if len(current_features) < 2:
            break
        
        corr = data[current_features].corr().abs()
        upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
        
        # Find all high correlation feature pairs
        high_corr_pairs = []
        for i, col1 in enumerate(upper.columns):
            for col2 in upper.columns[i+1:]:
                corr_val = upper.loc[col1, col2]
                if pd.notna(corr_val) and corr_val > threshold:
                    high_corr_pairs.append((col1, col2, corr_val))
        
        if not high_corr_pairs:
            break
        
        # For each high correlation feature pair, select the feature with smaller original gain as candidate for dropping
        candidates = set()
        for col1, col2, corr_val in high_corr_pairs:
            # Skip top N kept features
            if col1 in top_features and col2 in top_features:
                continue
            
            gain1 = gain_dict.get(col1, 0)
            gain2 = gain_dict.get(col2, 0)
            
            # Select feature with smaller original gain
            if gain1 <= gain2:
                candidates.add(col1)
            else:
                candidates.add(col2)
        
        if not candidates:
            break
        
        # Select feature with smallest original gain among candidates for dropping
        candidates_list = list(candidates)
        candidates_with_gain = [(c, gain_dict.get(c, 0)) for c in candidates_list]
        candidates_with_gain.sort(key=lambda x: x[1])
        to_drop = candidates_with_gain[0][0]
        
        # Find all features highly correlated with this feature
        related_vars = []
        for col1, col2, corr_val in high_corr_pairs:
            if col1 == to_drop:
                related_vars.append((col2, corr_val))
            elif col2 == to_drop:
                related_vars.append((col1, corr_val))
        
        # Record drop information
        # Related variables column: show feature name and similarity value (correlation value)
        related_str = ','.join([f"{v}(similarity={c:.4f})" for v, c in related_vars])
        # Removal condition column: show related features and their corresponding gain values
        gain_str = ','.join([f"{v}(gain={gain_dict.get(v, 0):.2f})" for v, c in related_vars])
        dropped_info.append({
            '变量': to_drop,
            '原始gain': gain_dict.get(to_drop, 0),
            '原始gain排名': rank_dict.get(to_drop, '-'),
            '相关变量': related_str,
            '去除条件': gain_str
        })
        
        # Delete this feature from data
        data = data.drop(columns=[to_drop], errors='ignore')
        current_features.remove(to_drop)
        
        print(f"   Dropped feature: {to_drop} (original gain={gain_dict.get(to_drop, 0):.2f})")
    
    # Convert to DataFrame and sort by original gain in descending order
    dropped_df = pd.DataFrame(dropped_info)
    if len(dropped_df) > 0:
        dropped_df = dropped_df.sort_values('原始gain', ascending=False)
        dropped_df = dropped_df.reset_index(drop=True)
    
    return data, dropped_df


def drop_highnoise_features(data: pd.DataFrame, features: List[str],
                           n_estimators: int = 100, max_depth: int = 5, gain_threshold: float = 50) -> tuple:
    """Null Importance to remove high noise features"""
    # Check if feature list is empty
    if len(features) == 0:
        print("   No features to process")
        return data, pd.DataFrame(columns=['变量', '原始gain', '反转后gain'])
    
    # Check if data is sufficient
    if len(data) < 1000:
        print(f"   Insufficient data ({len(data)} rows), skip Null Importance")
        return data, pd.DataFrame(columns=['变量', '原始gain', '反转后gain'])
    
    X = data[features].copy()
    Y = data['new_target'].copy()
    
    # Check if X is empty or contains NaN
    if X.shape[1] == 0:
        print("   Feature data is empty, skip Null Importance")
        return data, pd.DataFrame(columns=['变量', '原始gain', '反转后gain'])
    
    # Fill NaN
    X = X.fillna(0)
    
    # Shuffle labels
    Y_permuted = Y.copy()
    for _ in range(20):
        Y_permuted = np.random.permutation(Y_permuted)
    
    clf = lgb.LGBMClassifier(
        objective='binary', boosting_type='gbdt', learning_rate=0.05,
        max_depth=max_depth, min_child_samples=2000, min_child_weight=20,
        n_estimators=n_estimators, num_leaves=2**max_depth - 1, n_jobs=-1, verbose=-1
    )
    
    clf_permuted = lgb.LGBMClassifier(
        objective='binary', boosting_type='gbdt', learning_rate=0.05,
        max_depth=max_depth, min_child_samples=2000, min_child_weight=20,
        n_estimators=n_estimators, num_leaves=2**max_depth - 1, n_jobs=-1, verbose=-1
    )
    
    results, results_permuted = [], []
    
    print("Null Importance calculation in progress...")
    for i in range(2):
        random_n = np.random.randint(30)
        
        X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=random_n)
        
        # Check if training data is valid
        if X_train.shape[0] == 0 or X_test.shape[0] == 0:
            print(f"  Round {i+1}: Data split failed, skip")
            continue
        
        clf.fit(X_train, y_train)
        
        X_train_, X_test_, y_train_, y_test_ = train_test_split(X, Y_permuted, test_size=0.3, random_state=random_n)
        
        if X_train_.shape[0] == 0 or X_test_.shape[0] == 0:
            print(f"  Round {i+1}: Shuffled data split failed, skip")
            continue
        
        clf_permuted.fit(X_train_, y_train_)
        
        imp_real = pd.DataFrame({
            'feature': clf.booster_.feature_name(),
            'gain': clf.booster_.feature_importance(importance_type='gain')
        })
        imp_perm = pd.DataFrame({
            'feature': clf_permuted.booster_.feature_name(),
            'gain': clf_permuted.booster_.feature_importance(importance_type='gain')
        })
        
        results.append(imp_real)
        results_permuted.append(imp_perm)
        
        train_auc = roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1])
        test_auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
        print(f"  Round {i+1}: train_auc={train_auc:.3f}, test_auc={test_auc:.3f}")
    
    # Check if there are valid results
    if len(results) == 0 or len(results_permuted) == 0:
        print("   No valid training results, skip Null Importance")
        return data, pd.DataFrame(columns=['变量', '原始gain', '反转后gain'])
    
    imp_real_avg = pd.concat(results).groupby('feature')['gain'].mean().reset_index()
    imp_perm_avg = pd.concat(results_permuted).groupby('feature')['gain'].mean().reset_index()
    
    comparison = imp_real_avg.merge(imp_perm_avg, on='feature', suffixes=('_real', '_perm'))
    comparison['gain_real'] = comparison['gain_real'].fillna(0)
    comparison['gain_perm'] = comparison['gain_perm'].fillna(0)
    
    # Use condition where absolute difference of gain values before and after permutation is less than 50
    comparison['gain_diff'] = (comparison['gain_real'] - comparison['gain_perm']).abs()
    noise_features = comparison[comparison['gain_diff'] < gain_threshold]['feature'].tolist()
    
    # List original gain and permuted gain for all features
    dropped_info = pd.DataFrame({
        '变量': comparison['feature'].values,
        '原始gain': comparison['gain_real'].values,
        '反转后gain': comparison['gain_perm'].values
    })
    # Add status column, mark dropped features as '去除', kept features as '保留'
    dropped_info['状态'] = dropped_info.apply(
        lambda x: '去除' if np.abs(x['原始gain'] - x['反转后gain']) < gain_threshold else '保留', axis=1
    )
    # Sort by original gain in descending order
    dropped_info = dropped_info.sort_values('原始gain', ascending=False)
    dropped_info = dropped_info.reset_index(drop=True)
    # Add original gain ranking column
    dropped_info['原始gain排名'] = range(1, len(dropped_info) + 1)
    
    data = data.drop(columns=[c for c in noise_features if c in data.columns])
    
    print(f"  Dropped {len(noise_features)} noise features")
    return data, dropped_info


def _calc_single_psi(args):
    """Calculate PSI for a single organization and single feature - NaN as separate bin"""
    org, train_month, test_month, train_n, test_n, f, data_ref, min_sample = args
    
    try:
        org_data = data_ref[data_ref['new_org'] == org]
        train_data = org_data[org_data['new_date_ym'] == train_month]
        test_data = org_data[org_data['new_date_ym'] == test_month]
        
        # Get data
        train_vals = train_data[f].values
        test_vals = test_data[f].values
        
        # Mark NaN
        train_nan_mask = pd.isna(train_vals)
        test_nan_mask = pd.isna(test_vals)
        
        # Non-NaN values for binning
        train_nonan = train_vals[~train_nan_mask]
        test_nonan = test_vals[~test_nan_mask]
        
        if len(train_nonan) < min_sample or len(test_nonan) < min_sample:
            return {
                '机构': org, '日期': f"{train_month}->{test_month}",
                '变量': f, 'PSI': None, '有效计算': 0,
                '样本数': train_n
            }
        
        # Bin based on non-NaN data (10 bins)
        try:
            bins = pd.qcut(train_nonan, q=10, duplicates='drop', retbins=True)[1]
        except:
            bins = pd.cut(train_nonan, bins=10, retbins=True)[1]
        
        # Calculate proportion of each bin (including NaN bin)
        train_counts = []
        test_counts = []
        
        for i in range(len(bins)):
            if i == 0:
                train_counts.append((~train_nan_mask & (train_vals <= bins[i])).sum())
                test_counts.append((~test_nan_mask & (test_vals <= bins[i])).sum())
            else:
                train_counts.append((~train_nan_mask & (train_vals > bins[i-1]) & (train_vals <= bins[i])).sum())
                test_counts.append((~test_nan_mask & (test_vals > bins[i-1]) & (test_vals <= bins[i])).sum())
        
        # NaN bin
        train_counts.append(train_nan_mask.sum())
        test_counts.append(test_nan_mask.sum())
        
        # Convert to proportions
        train_pct = np.array(train_counts) / len(train_vals)
        test_pct = np.array(test_counts) / len(test_vals)
        
        # Avoid 0 values
        train_pct = np.where(train_pct == 0, 1e-6, train_pct)
        test_pct = np.where(test_pct == 0, 1e-6, test_pct)
        
        # Calculate PSI
        psi = np.sum((test_pct - train_pct) * np.log(test_pct / train_pct))
        
        return {
            '机构': org, '日期': f"{train_month}->{test_month}",
            '变量': f, 'PSI': round(psi, 4), '有效计算': 1,
            '样本数': train_n
        }
    except Exception as e:
        return {
            '机构': org, '日期': f"{train_month}->{test_month}",
            '变量': f, 'PSI': None, '有效计算': 0,
            '样本数': train_n
        }


def drop_highpsi_features(data: pd.DataFrame, features: List[str],
                         psi_threshold: float = 0.1, max_months_ratio: float = 1/3,
                         max_orgs: int = 4, min_sample_per_month: int = 100, n_jobs: int = 4) -> tuple:
    """Drop high PSI features - by organization + month-by-month version
    
    Multi-processing at feature level, loop through organizations, parallel calculation of features within organizations
    
    Args:
        psi_threshold: PSI threshold, values above this are considered unstable
        max_months_ratio: Maximum tolerated month ratio, if more than this ratio of months have PSI above threshold, record in processing table
        max_orgs: Maximum tolerated organization count, if more than this number of organizations are unstable, record in processing table
        min_sample_per_month: Minimum sample count per month
    
    Returns:
        data: Data after dropping
        psi_detail: PSI details (PSI value of each feature in each organization each month)
        psi_process: PSI processing table (features that do not meet the conditions)
    """
    orgs = data['new_org'].unique()
    
    # Build task list: each organization, each pair of months, each feature
    tasks = []
    for org in orgs:
        org_data = data[data['new_org'] == org]
        months = sorted(org_data['new_date_ym'].unique())
        
        if len(months) < 2:
            continue
        
        for i in range(len(months) - 1):
            train_month = months[i]
            test_month = months[i + 1]
            
            train_data = org_data[org_data['new_date_ym'] == train_month]
            test_data = org_data[org_data['new_date_ym'] == test_month]
            
            train_n = len(train_data)
            test_n = len(test_data)
            
            for f in features:
                tasks.append((org, train_month, test_month, train_n, test_n, f, data, min_sample_per_month))
    
    # Multi-process PSI calculation (parallel at feature level)
    print(f"   PSI calculation: {len(tasks)} tasks, using {n_jobs} processes")
    results = Parallel(n_jobs=n_jobs, verbose=0)(delayed(_calc_single_psi)(task) for task in tasks)
    
    psi_df = pd.DataFrame(results)
    
    if len(psi_df) == 0:
        return data, pd.DataFrame(columns=['变量', '机构', '月份', 'PSI值']), pd.DataFrame(columns=['变量', '处理原因'])
    
    # Filter valid calculation records
    valid_psi = psi_df[psi_df['有效计算'] == 1].copy()
    
    if len(valid_psi) == 0:
        return data, pd.DataFrame(columns=['变量', '机构', '月份', 'PSI值']), pd.DataFrame(columns=['变量', '处理原因'])
    
    # PSI detail table: PSI value of each feature in each organization each month
    # Change date to single month, initial month PSI value is 0
    psi_detail = valid_psi[['机构', '日期', '变量', 'PSI']].copy()
    
    # Parse date, extract test month
    psi_detail['月份'] = psi_detail['日期'].apply(lambda x: x.split('->')[1] if '->' in x else x)
    psi_detail = psi_detail.rename(columns={'PSI': 'PSI值'})
    
    # Sort by feature, organization, month in ascending order
    psi_detail = psi_detail.sort_values(['变量', '机构', '月份'], ascending=[True, True, True])
    
    # Get all organizations and months
    all_orgs = sorted(psi_detail['机构'].unique())
    all_vars = sorted(psi_detail['变量'].unique())
    
    # Build complete PSI detail table (including initial month, PSI value is 0)
    psi_detail_list = []
    for org in all_orgs:
        org_data = psi_detail[psi_detail['机构'] == org]
        if len(org_data) == 0:
            continue
        
        # Get all months for this organization
        months = sorted(org_data['月份'].unique())
        
        for var in all_vars:
            var_data = org_data[org_data['变量'] == var]
            if len(var_data) == 0:
                continue
            
            # Initial month PSI value is 0
            psi_detail_list.append({
                '机构': org,
                '变量': var,
                '月份': months[0],
                'PSI值': 0.0
            })
            
            # Subsequent months PSI values are calculation results
            for i in range(1, len(months)):
                month = months[i]
                var_month_data = var_data[var_data['月份'] == month]
                if len(var_month_data) > 0:
                    psi_value = var_month_data['PSI值'].values[0]
                else:
                    psi_value = 0.0
                psi_detail_list.append({
                    '机构': org,
                    '变量': var,
                    '月份': month,
                    'PSI值': psi_value
                })
    
    psi_detail = pd.DataFrame(psi_detail_list)
    psi_detail = psi_detail[['机构', '变量', '月份', 'PSI值']]
    psi_detail = psi_detail.reset_index(drop=True)
    # Sort by feature, organization, month in ascending order
    psi_detail = psi_detail.sort_values(['变量', '机构', '月份'], ascending=[True, True, True])
    psi_detail = psi_detail.reset_index(drop=True)
    
    # Mark unstable
    valid_psi['不稳定'] = (valid_psi['PSI'] > psi_threshold).astype(int)
    
    # Summary: number of unstable months and total months for each organization each feature
    org_summary = valid_psi.groupby(['机构', '变量']).agg(
        不稳定月份数=('不稳定', 'sum'),
        总月份数=('变量', 'count')
    ).reset_index()
    
    # Mark whether each organization each feature is unstable
    # Ensure threshold is at least 1, avoid being too strict when organization has few months
    org_summary['不稳定阈值'] = org_summary['总月份数'].apply(
        lambda x: max(1, int(x * max_months_ratio))
    )
    org_summary['是否不稳定'] = org_summary['不稳定月份数'] >= org_summary['不稳定阈值']
    
    # Organization level summary: number of unstable organizations
    org_count = len(orgs)
    channel_summary = org_summary.groupby('变量').apply(
        lambda x: pd.Series({
            '机构数': org_count,
            '不稳定机构数': x['是否不稳定'].sum()
        })
    ).reset_index()
    
    # Mark features that need processing
    channel_summary['需处理'] = channel_summary['不稳定机构数'] >= max_orgs
    channel_summary['处理原因'] = channel_summary.apply(
        lambda x: f'PSI unstable in {x["不稳定机构数"]} organizations' if x['需处理'] else '', axis=1
    )
    
    # Get list of unstable organizations for each feature
    unstable_orgs_dict = {}
    for var in org_summary['变量'].unique():
        var_orgs = org_summary[(org_summary['变量'] == var) & (org_summary['是否不稳定'] == True)]['机构'].tolist()
        unstable_orgs_dict[var] = var_orgs
    
    # PSI processing table: features that do not meet the conditions
    psi_process = channel_summary[channel_summary['需处理']].copy()
    psi_process['不稳定机构'] = psi_process['变量'].apply(lambda x: ','.join(unstable_orgs_dict.get(x, [])))
    psi_process = psi_process[['变量', '处理原因', '不稳定机构']]
    psi_process = psi_process.reset_index(drop=True)
    
    # Filter features to drop
    if len(psi_process) > 0 and '变量' in psi_process.columns:
        to_drop_vars = psi_process['变量'].tolist()
        data = data.drop(columns=[c for c in to_drop_vars if c in data.columns])
    
    return data, psi_detail, psi_process


def iv_distribution_by_org(iv_detail: pd.DataFrame, oos_orgs: list = None, iv_bins: list = [0, 0.02, 0.05, 0.1, float('inf')]) -> pd.DataFrame:
    """Count number and proportion of features in different IV ranges for each organization
    
    Args:
        iv_detail: IV detail table (containing feature, overall, organization columns)
        oos_orgs: Out-of-sample organization list
        iv_bins: IV range boundaries [0, 0.02, 0.05, 0.1, inf]
    
    Returns:
        IV distribution statistics table
    """
    if oos_orgs is None:
        oos_orgs = []
    
    # Get organization columns (exclude '变量' and '整体' columns)
    org_cols = [c for c in iv_detail.columns if c not in ['变量', '整体']]
    
    # Define range labels
    bin_labels = ['[0, 0.02)', '[0.02, 0.05)', '[0.05, 0.1)', '[0.1, +∞)']
    
    result = []
    
    # Statistics for each organization (not including overall)
    for org in org_cols:
        org_iv = iv_detail[org].dropna()
        total_vars = len(org_iv)
        
        # Determine organization type
        org_type = '贷外' if org in oos_orgs else '建模'
        
        for i in range(len(iv_bins) - 1):
            lower = iv_bins[i]
            upper = iv_bins[i + 1]
            if upper == float('inf'):
                count = ((org_iv >= lower)).sum()
            else:
                count = ((org_iv >= lower) & (org_iv < upper)).sum()
            ratio = count / total_vars if total_vars > 0 else 0
            result.append({
                '机构': org,
                '类型': org_type,
                'IV区间': bin_labels[i],
                '变量个数': count,
                '占比': f'{ratio:.2%}'
            })
    
    return pd.DataFrame(result)


def psi_distribution_by_org(psi_detail: pd.DataFrame, oos_orgs: list = None, psi_bins: list = [0, 0.05, 0.1, float('inf')]) -> pd.DataFrame:
    """Count number and proportion of features in different PSI ranges for each organization
    
    Args:
        psi_detail: PSI detail table (containing organization, feature, month, PSI value columns)
        oos_orgs: Out-of-sample organization list
        psi_bins: PSI range boundaries [0, 0.05, 0.1, inf]
    
    Returns:
        PSI distribution statistics table
    """
    if oos_orgs is None:
        oos_orgs = []
    
    # Define range labels
    bin_labels = ['[0, 0.05)', '[0.05, 0.1)', '[0.1, +∞)']
    
    result = []
    
    # Get all organizations
    orgs = psi_detail['机构'].unique()
    
    for org in orgs:
        org_data = psi_detail[psi_detail['机构'] == org]
        
        # Determine organization type
        org_type = '贷外' if org in oos_orgs else '建模'
        
        # For each feature, take its maximum PSI value
        var_max_psi = org_data.groupby('变量')['PSI值'].max()
        total_vars = len(var_max_psi)
        
        for i in range(len(psi_bins) - 1):
            lower = psi_bins[i]
            upper = psi_bins[i + 1]
            if upper == float('inf'):
                count = ((var_max_psi >= lower)).sum()
            else:
                count = ((var_max_psi >= lower) & (var_max_psi < upper)).sum()
            ratio = count / total_vars if total_vars > 0 else 0
            result.append({
                '机构': org,
                '类型': org_type,
                'PSI区间': bin_labels[i],
                '变量个数': count,
                '占比': f'{ratio:.2%}'
            })
    
    return pd.DataFrame(result)


def value_ratio_distribution_by_org(data: pd.DataFrame, features: List[str], 
                                     oos_orgs: list = None, 
                                     value_bins: list = [0, 0.15, 0.35, 0.65, 0.95, 1.0]) -> pd.DataFrame:
    """Count number and proportion of features in different value ratio ranges for each organization
    
    Args:
        data: Data (containing new_org column)
        features: Feature list
        oos_orgs: Out-of-sample organization list
        value_bins: Value ratio range boundaries [0, 0.15, 0.35, 0.65, 0.95, 1.0]
    
    Returns:
        Value ratio distribution statistics table
    """
    if oos_orgs is None:
        oos_orgs = []
    
    # Define range labels
    bin_labels = ['[0, 15%)', '[15%, 35%)', '[35%, 65%)', '[65%, 95%)', '[95%, 100%]']
    
    result = []
    
    # Get all organizations
    orgs = data['new_org'].unique()
    
    for org in orgs:
        org_data = data[data['new_org'] == org]
        
        # Determine organization type
        org_type = '贷外' if org in oos_orgs else '建模'
        
        # Calculate value ratio for each feature (proportion of non-NaN)
        value_ratios = {}
        for f in features:
            if f in org_data.columns:
                non_null_count = org_data[f].notna().sum()
                total_count = len(org_data)
                value_ratios[f] = non_null_count / total_count if total_count > 0 else 0
        
        # Count number of features in each range
        total_vars = len(value_ratios)
        for i in range(len(value_bins) - 1):
            lower = value_bins[i]
            upper = value_bins[i + 1]
            if upper == 1.0:
                count = sum(1 for v in value_ratios.values() if lower <= v <= upper)
            else:
                count = sum(1 for v in value_ratios.values() if lower <= v < upper)
            ratio = count / total_vars if total_vars > 0 else 0
            result.append({
                '机构': org,
                '类型': org_type,
                '有值率区间': bin_labels[i],
                '变量个数': count,
                '占比': f'{ratio:.2%}'
            })
    
    return pd.DataFrame(result)


def calculate_iv_by_org(data: pd.DataFrame, features: List[str], 
                        n_jobs: int = 4) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Calculate IV by organization and overall
    
    Returns:
        iv_by_org: IV details by organization
        iv_overall: Overall IV
    """
    from references.func import calculate_iv
    
    orgs = data['new_org'].unique()
    
    # Overall IV
    iv_overall = calculate_iv(data, features, n_jobs=n_jobs)
    iv_overall['类型'] = '整体'
    
    # IV by organization
    iv_by_org = []
    for org in orgs:
        org_data = data[data['new_org'] == org]
        org_iv = calculate_iv(org_data, features, n_jobs=1)  # Single process for single organization
        if len(org_iv) > 0:  # Only add non-empty results
            org_iv['机构'] = org
            org_iv['类型'] = '分机构'
            iv_by_org.append(org_iv)
    
    iv_by_org = pd.concat(iv_by_org, ignore_index=True) if iv_by_org else pd.DataFrame(columns=['变量', 'IV', '机构', '类型'])
    
    return iv_by_org, iv_overall


def calculate_psi_detail(data: pd.DataFrame, features: List[str],
                         max_psi: float = 0.1, min_months_unstable: int = 3,
                         min_sample: int = 100, n_jobs: int = 4) -> tuple:
    """Calculate month-by-month PSI details for each feature in each organization, and mark whether to drop
    
    Returns:
        data: Data after dropping
        dropped: Summary of dropped features
        psi_summary: Complete PSI details (including drop flag)
    """
    orgs = data['new_org'].unique()
    
    # Build tasks
    tasks = []
    for org in orgs:
        org_data = data[data['new_org'] == org]
        months = sorted(org_data['new_date_ym'].unique())
        
        if len(months) < 2:
            continue
        
        for i in range(len(months) - 1):
            train_month = months[i]
            test_month = months[i + 1]
            
            train_data = org_data[org_data['new_date_ym'] == train_month]
            test_data = org_data[org_data['new_date_ym'] == test_month]
            
            train_n = len(train_data)
            test_n = len(test_data)
            
            for f in features:
                tasks.append((org, train_month, test_month, train_n, test_n, f, data, min_sample))
    
    # Multi-process calculation
    print(f"   PSI calculation: {len(tasks)} tasks, using {n_jobs} processes")
    results = Parallel(n_jobs=n_jobs, verbose=0)(delayed(_calc_single_psi)(task) for task in tasks)
    
    psi_df = pd.DataFrame(results)
    
    if len(psi_df) == 0:
        return data, pd.DataFrame(columns=['变量', '机构数', '不稳定机构数', '原因']), pd.DataFrame(columns=['变量', '机构数', '不稳定机构数', '是否剔除', '去除条件'])
    
    # Filter valid calculation records
    valid_psi = psi_df[psi_df['有效计算'] == 1].copy()
    
    if len(valid_psi) == 0:
        return data, pd.DataFrame(columns=['变量', '机构数', '不稳定机构数', '原因']), pd.DataFrame(columns=['变量', '机构数', '不稳定机构数', '是否剔除', '去除条件'])
    
    # Mark unstable
    valid_psi['不稳定'] = (valid_psi['PSI'] > max_psi).astype(int)
    
    # Summary: number of unstable months for each organization each feature
    org_summary = valid_psi.groupby(['机构', '变量'])['不稳定'].sum().reset_index()
    org_summary.columns = ['机构', '变量', '不稳定月份数']
    
    # Organization level summary: features with more than min_months_unstable unstable months
    org_count = len(orgs)
    channel_summary = org_summary.groupby('变量').apply(
        lambda x: pd.Series({
            '机构数': org_count,
            '不稳定机构数': (x['不稳定月份数'] >= min_months_unstable).sum()
        })
    ).reset_index()
    
    # Mark features that need to be dropped (more than 1/3 organizations unstable)
    channel_summary['需剔除'] = channel_summary['不稳定机构数'] > (channel_summary['机构数'] / 3)
    channel_summary['是否剔除'] = channel_summary['需剔除'].astype(int)
    channel_summary['去除条件'] = channel_summary.apply(
        lambda x: f'More than 1/3 of {org_count} organizations have PSI>{max_psi} for {min_months_unstable} consecutive months' if x['需剔除'] else '', axis=1
    )
    
    # Filter features to drop
    if len(channel_summary) > 0 and '变量' in channel_summary.columns:
        to_drop_vars = channel_summary[channel_summary['需剔除']]['变量'].tolist()
        data = data.drop(columns=[c for c in to_drop_vars if c in data.columns])
    
    # Organize drop information (only return dropped features)
    dropped = channel_summary[channel_summary['需剔除']].copy()
    dropped['原因'] = f'More than 1/3 of {org_count} organizations have PSI>{max_psi} for {min_months_unstable} consecutive months'
    
    return data, dropped[['变量', '机构数', '不稳定机构数', '原因']], channel_summary[['变量', '机构数', '不稳定机构数', '是否剔除', '去除条件']]


def export_cleaning_report(filepath: str, steps: list, 
                           iv_detail: pd.DataFrame = None,
                           iv_process: pd.DataFrame = None,
                           psi_detail: pd.DataFrame = None,
                           psi_process: pd.DataFrame = None,
                           params: dict = None,
                           iv_distribution: pd.DataFrame = None,
                           psi_distribution: pd.DataFrame = None,
                           value_ratio_distribution: pd.DataFrame = None):
    """Export cleaning report to xlsx - one sheet per step
    
    Args:
        filepath: Output path
        steps: Cleaning step list [(step name, DataFrame), ...]
        iv_detail: IV details (IV value of each feature in each organization and overall)
        iv_process: IV processing table (features that do not meet the conditions)
        psi_detail: PSI details (PSI value of each feature in each organization each month)
        psi_process: PSI processing table (features that do not meet the conditions)
        params: Hyperparameter dictionary, used to dynamically generate conditions
        iv_distribution: IV distribution statistics table
        psi_distribution: PSI distribution statistics table
        value_ratio_distribution: Value ratio distribution statistics table
    """
    from openpyxl import load_workbook
    
    try:
        wb = load_workbook(filepath)
    except:
        wb = Workbook()
        wb.remove(wb.active)
    
    # Summary sheet - only show real filtering steps
    if '汇总' in wb.sheetnames:
        del wb['汇总']
    ws = wb.create_sheet('汇总', 0)
    ws['A1'] = 'Data Cleaning Report'
    ws['A2'] = f'Generated at: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'
    ws['A4'] = 'Step'
    ws['B4'] = 'Operation Details'
    ws['C4'] = 'Operation Result'
    ws['D4'] = 'Condition'
    
    # Only show real filtering steps (excluding details and distribution statistics)
    filter_steps = [
        'Step4-异常月份处理', 'Step6-高缺失率处理', 'Step7-IV处理', 
        'Step8-PSI处理', 'Step9-null importance处理', 'Step10-高相关性剔除'
    ]
    
    # Steps to exclude (details and distribution statistics)
    exclude_steps = [
        'Step7-IV明细', 'Step7-IV分布统计', 'Step8-PSI明细', 
        'Step8-PSI分布统计', 'Step5-有值率分布统计'
    ]
    
    # Steps that need to show drop count
    show_drop_count_steps = ['分离OOS数据']
    
    # Steps that only show parameter standards (no operation result)
    show_param_only_steps = ['机构样本统计', '缺失率明细']
    
    # Add note: each step is executed independently
    ws['A3'] = 'Note: Each filtering step is executed independently, data is not deleted, only statistics of features that do not meet conditions are recorded'
    
    # Get parameters, use default values if not provided
    if params is None:
        params = {}
    
    min_ym_bad_sample = params.get('min_ym_bad_sample', 10)
    min_ym_sample = params.get('min_ym_sample', 500)
    missing_ratio = params.get('missing_ratio', 0.6)
    overall_iv_threshold = params.get('overall_iv_threshold', 0.1)
    org_iv_threshold = params.get('org_iv_threshold', 0.1)
    max_org_threshold = params.get('max_org_threshold', 2)
    psi_threshold = params.get('psi_threshold', 0.1)
    max_months_ratio = params.get('max_months_ratio', 1/3)
    max_orgs = params.get('max_orgs', 4)
    gain_threshold = params.get('gain_threshold', 50)
    
    step_num = 1
    for name, df in steps:
        # Skip detail and distribution statistics steps
        if name in exclude_steps:
            continue
        
        # Remove StepX- prefix from operation details
        display_name = name.replace('Step4-', '').replace('Step6-', '').replace('Step7-', '').replace('Step8-', '').replace('Step9-', '').replace('Step10-', '')
        
        # Only show parameter standard steps (no operation result)
        if name in show_param_only_steps:
            ws.cell(4+step_num, 1, step_num)
            ws.cell(4+step_num, 2, display_name)
            result = ''
            # Condition: show parameter standards
            if name == '机构样本统计':
                condition = 'Statistics of sample count and bad sample rate for each organization'
            elif name == '缺失率明细':
                condition = 'Calculate missing rate for each feature'
            else:
                condition = ''
            ws.cell(4+step_num, 3, result)
            ws.cell(4+step_num, 4, condition)
            step_num += 1
        # Show steps that need to display drop count
        elif name in show_drop_count_steps:
            ws.cell(4+step_num, 1, step_num)
            ws.cell(4+step_num, 2, display_name)
            if df is not None and len(df) > 0:
                if name == '分离OOS数据':
                    # Special handling: show OOS and modeling sample counts
                    if '变量' in df.columns and '数量' in df.columns:
                    
                        oos_count = df[df['变量'] == 'OOS样本']['数量'].values[0] if len(df[df['变量'] == 'OOS样本']) > 0 else 0
                        model_count = df[df['变量'] == '建模样本']['数量'].values[0] if len(df[df['变量'] == '建模样本']) > 0 else 0
                        result = f'OOS samples {oos_count}, modeling samples {model_count}'
                    else:
                        result = f'{len(df)} rows'
                elif '变量' in df.columns:
                    result = f'Dropped {len(df)} features'
                else:
                    result = f'Dropped {len(df)}'
                condition = ''
            else:
                result = 'Empty'
                condition = ''
            ws.cell(4+step_num, 3, result)
            ws.cell(4+step_num, 4, condition)
            step_num += 1
        elif name in filter_steps:
            ws.cell(4+step_num, 1, step_num)
            ws.cell(4+step_num, 2, display_name)
            
            # Generate operation result and condition
            if df is not None and len(df) > 0:
                if name == 'Step4-异常月份处理':
                    # Operation result: dropped months
                    if '年月' in df.columns:
                        result = 'Dropped ' + ','.join(df['年月'].astype(str).tolist())
                    else:
                        result = 'Dropped ' + ','.join(df.iloc[:, 0].astype(str).tolist())
                    # Condition: parameter standards
                    condition = f'Months with bad sample count less than {min_ym_bad_sample} or total sample count less than {min_ym_sample} will be dropped (independent execution)'
                elif name == 'Step6-高缺失率处理':
                    # Operation result: number of dropped features
                    if '变量' in df.columns:
                        result = f'Dropped {len(df)} features'
                    else:
                        result = f'Dropped {len(df)}'
                    # Condition: parameter standards
                    condition = f'Features with overall missing rate greater than {missing_ratio} will be dropped (independent execution)'
                elif name == 'Step7-IV处理':
                    # Operation result: number of dropped features
                    if '变量' in df.columns:
                        result = f'Dropped {len(df)} features'
                    else:
                        result = f'Dropped {len(df)}'
                    # Condition: parameter standards
                    condition = f'Features with overall IV less than {overall_iv_threshold} or IV less than {org_iv_threshold} in {max_org_threshold} or more organizations will be dropped (independent execution)'
                elif name == 'Step8-PSI处理':
                    # Operation result: number of dropped features
                    if '变量' in df.columns:
                        result = f'Dropped {len(df)} features'
                    else:
                        result = f'Dropped {len(df)}'
                    # Condition: parameter standards
                    condition = f'PSI threshold {psi_threshold}, if an organization has more than {max_months_ratio:.0%} months with PSI greater than {psi_threshold}, the organization is considered unstable, if more than {max_orgs} organizations are unstable, the feature will be dropped (independent execution)'
                elif name == 'Step9-null importance处理':
                    # Operation result: number of dropped features
                    if '变量' in df.columns:
                        result = f'Dropped {len(df[df["状态"]=="去除"])} features'
                    else:
                        result = f'Dropped {len(df)}'
                    # Condition: parameter standards
                    condition = f'Features with absolute difference of gain values before and after permutation less than {gain_threshold} will be identified as noise and dropped (independent execution)'
                elif name == 'Step10-高相关性剔除':
                    # Operation result: number of dropped features
                    if '变量' in df.columns:
                        result = f'Dropped {len(df)} features'
                    else:
                        result = f'Dropped {len(df)}'
                    # Condition: parameter standards
                    max_corr = params.get('max_corr', 0.9)
                    top_n_keep = params.get('top_n_keep', 20)
                    condition = f'Features with correlation greater than {max_corr} will be dropped, keep top {top_n_keep} features by original gain ranking (independent execution)'
                else:
                    result = 'Dropped ' + str(len(df))
                    condition = ''
            else:
                result = 'Empty'
                condition = ''
            
            ws.cell(4+step_num, 3, result)
            ws.cell(4+step_num, 4, condition)
            step_num += 1
    
    # Calculate total number of dropped features (take union of dropped features from each step)
    all_dropped_vars = set()
    for name, df in steps:
        if name in filter_steps and df is not None and len(df) > 0 and '变量' in df.columns:
            if name == 'Step9-null importance处理':
                # null importance processing needs to filter features with status "去除"
                dropped_vars = df[df['状态'] == '去除']['变量'].tolist()
            else:
                dropped_vars = df['变量'].tolist()
            # Take union (deduplicated)
            all_dropped_vars = all_dropped_vars.union(set(dropped_vars))
    
    # Add final statistics row
    final_step_num = step_num
    ws.cell(4+final_step_num, 1, final_step_num)
    ws.cell(4+final_step_num, 2, 'Final Dropped Features Statistics')
    ws.cell(4+final_step_num, 3, f'Total dropped {len(all_dropped_vars)} features (union of dropped features from each step)')
    ws.cell(4+final_step_num, 4, 'Each step is executed independently, final dropped features are the union of dropped features from each step')
    
    # Details of each step (create sheets in step progression order)
    # Define sheet creation order
    sheet_order = [
        '机构样本统计', '分离OOS数据', 'Step4-异常月份处理', '缺失率明细',
        'Step5-有值率分布统计', 'Step6-高缺失率处理', 'Step7-IV明细', 'Step7-IV处理',
        'Step7-IV分布统计', 'Step8-PSI明细', 'Step8-PSI处理', 'Step8-PSI分布统计',
        'Step9-null importance处理', 'Step10-高相关性剔除'
    ]
    
    # Create sheets in order
    for sheet_name in sheet_order:
        # Find corresponding DataFrame in steps
        df = None
        for name, step_df in steps:
            if name == sheet_name:
                df = step_df
                break
        
        if df is not None:
            if sheet_name in wb.sheetnames:
                del wb[sheet_name]
            ws_detail = wb.create_sheet(sheet_name)
            
            for j, col in enumerate(df.columns):
                ws_detail.cell(1, j+1, col)
            
            for i, row in df.iterrows():
                for j, val in enumerate(row):
                    # Write value directly, avoid character escaping issues
                    ws_detail.cell(i+2, j+1, val if val is not None else '')
            
            header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
            header_font = Font(color="FFFFFF", bold=True)
            for cell in ws_detail[1]:
                cell.fill = header_fill
                cell.font = header_font
    
    wb.save(filepath)
    print(f"Report saved: {filepath}")
func.py 8.3 KB
"""Data processing functions module"""
import pandas as pd
import numpy as np
import toad
from typing import List, Dict, Tuple
import tqdm
from datetime import datetime

try:
    from openpyxl import Workbook
    from openpyxl.styles import Font, PatternFill, Alignment
    HAS_OPENPYXL = True
except:
    HAS_OPENPYXL = False


def get_dataset(data_pth: str, date_colName: str, y_colName: str, 
                org_colName: str, data_encode: str, key_colNames: List[str],
                drop_colNames: List[str] = None,
                miss_vals: List[int] = None) -> pd.DataFrame:
    """Load and format data
    
    Args:
        data_pth: Data file path
        date_colName: Date column name
        y_colName: Label column name
        org_colName: Organization column name
        data_encode: Data encoding
        key_colNames: Primary key columns (for deduplication)
        drop_colNames: Columns to drop
        miss_vals: List of abnormal values to replace with NaN, default [-1, -999, -1111]
    """
    if drop_colNames is None:
        drop_colNames = []
    if miss_vals is None:
        miss_vals = [-1, -999, -1111]
    
    # Multi-format reading
    for fmt, reader in [('parquet', pd.read_parquet), ('csv', pd.read_csv), 
                         ('xlsx', pd.read_excel), ('pkl', pd.read_pickle)]:
        try:
            data = reader(data_pth)
            break
        except:
            continue
    
    # Replace abnormal values with NaN
    data.replace({v: np.nan for v in miss_vals}, inplace=True)
    
    # Deduplication and filtering
    data = data[data[y_colName].isin([0, 1])]
    data = data.drop_duplicates(subset=key_colNames)
    
    # Drop invalid columns
    data.drop(columns=[c for c in drop_colNames if c in data.columns], errors='ignore')
    data.drop(columns=[c for c in data.columns if data[c].nunique() <= 1], errors='ignore')
    
    # Rename columns
    data.rename(columns={date_colName: 'new_date', y_colName: 'new_target', 
                         org_colName: 'new_org'}, inplace=True)
    data['new_date'] = data['new_date'].astype(str).str.replace('-', '', regex=False).str[:8]
    data['new_date_ym'] = data['new_date'].str[:6]
    
    return data


def org_analysis(data: pd.DataFrame, oos_orgs: List[str] = None) -> pd.DataFrame:
    """Organization sample statistics analysis
    
    Args:
        data: Data
        oos_orgs: Out-of-sample organization list, used to identify OOS samples
    """
    stat = data.groupby(['new_org', 'new_date_ym']).agg(
        单月坏样本数=('new_target', 'sum'),
        单月总样本数=('new_target', 'count'),
        单月坏样率=('new_target', 'mean')
    ).reset_index()
    
    # Cumulative statistics
    stat['总坏样本数'] = stat.groupby('new_org')['单月坏样本数'].transform('sum')
    stat['总样本数'] = stat.groupby('new_org')['单月总样本数'].transform('sum')
    stat['总坏样率'] = stat['总坏样本数'] / stat['总样本数']
    
    # Mark whether it is an OOS organization
    if oos_orgs and len(oos_orgs) > 0:
        stat['样本类型'] = stat['new_org'].apply(lambda x: '贷外' if x in oos_orgs else '建模')
    else:
        stat['样本类型'] = '建模'
    
    stat = stat.rename(columns={'new_org': '机构', 'new_date_ym': '年月'})
    
    # Sort by sample type (modeling first, OOS last)
    stat = stat.sort_values(['样本类型', '机构', '年月'], ascending=[True, True, True])
    stat = stat.reset_index(drop=True)
    
    return stat[['机构', '年月', '单月坏样本数', '单月总样本数', '单月坏样率', '总坏样本数', '总样本数', '总坏样率', '样本类型']]


def missing_check(data: pd.DataFrame, channel: Dict[str, List[str]] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Calculate missing rate - including overall and organization-level missing rates
    
    Returns:
        miss_detail: Missing rate details (format: variable, overall, org1, org2, ..., orgn)
        miss_ch: Overall missing rate (overall missing rate for each variable)
    """
    miss_vals = [-1, -999, -1111]
    miss_ch = []
    
    # Exclude non-variable columns: record_id, target, org_info, etc.
    exclude_cols = ['new_date', 'new_date_ym', 'new_target', 'new_org', 'record_id', 'target', 'org_info']
    cols = [c for c in data.columns if c not in exclude_cols]
    
    # Calculate overall missing rate
    for col in tqdm.tqdm(cols, desc="Missing rate"):
        rate = ((data[col].isin(miss_vals)) | (data[col].isna())).mean()
        miss_ch.append({'变量': col, '整体缺失率': round(rate, 4)})
    
    miss_ch = pd.DataFrame(miss_ch)
    
    # Calculate organization-level missing rates and convert to wide format
    orgs = sorted(data['new_org'].unique())
    miss_detail_dict = {'变量': []}
    miss_detail_dict['整体'] = []
    
    for org in orgs:
        miss_detail_dict[org] = []
    
    for col in cols:
        miss_detail_dict['变量'].append(col)
        # Overall missing rate
        overall_rate = ((data[col].isin(miss_vals)) | (data[col].isna())).mean()
        miss_detail_dict['整体'].append(round(overall_rate, 4))
        
        # Missing rate for each organization
        for org in orgs:
            org_data = data[data['new_org'] == org]
            rate = ((org_data[col].isin(miss_vals)) | (org_data[col].isna())).mean()
            miss_detail_dict[org].append(round(rate, 4))
    
    miss_detail = pd.DataFrame(miss_detail_dict)
    # Sort by overall missing rate in descending order
    miss_detail = miss_detail.sort_values('整体', ascending=False)
    miss_detail = miss_detail.reset_index(drop=True)
    
    return miss_detail, miss_ch


def calculate_iv(data: pd.DataFrame, features: List[str], n_jobs: int = 4) -> pd.DataFrame:
    """Calculate IV value - use toad.transform.Combiner for binning, set number of bins to 5, keep NaN values"""
    import tqdm
    from joblib import Parallel, delayed
    
    def _calc_iv(f):
        try:
            # Use toad.transform.Combiner for binning, set number of bins to 5
            c = toad.transform.Combiner()
            data_temp = data[[f, 'new_target']].copy()
            data_temp.columns = ['x', 'y']
            data_temp['x_bin'] = c.fit_transform(X=data_temp['x'], y=data_temp['y'], method='dt', n_bins=5, min_samples=0.05/5, empty_separate=True)
            
            # Calculate IV value using binned data
            iv_df = toad.quality(data_temp[['x_bin', 'y']], 'y', iv_only=True)
            if 'iv' in iv_df.columns and len(iv_df) > 0:
                iv_value = iv_df['iv'].iloc[0]
                if not np.isnan(iv_value):
                    return {'变量': f, 'IV': round(iv_value, 4)}
            return None
        except Exception as e:
            print(f"   IV calculation error: variable={f}, error={e}")
            return None
    
    # Use tqdm to show progress
    results = Parallel(n_jobs=n_jobs, verbose=0)(
        delayed(_calc_iv)(f) for f in features
    )
    iv_list = [r for r in results if r is not None]
    
    if len(iv_list) == 0:
        print(f"   IV calculation result is empty, number of features={len(features)}")
        return pd.DataFrame(columns=['变量', 'IV'])
    
    return pd.DataFrame(iv_list).sort_values('IV', ascending=False)


def calculate_corr(data: pd.DataFrame, features: List[str]) -> pd.DataFrame:
    """Calculate correlation matrix"""
    corr = data[features].corr().abs()
    return corr


def export_report_xlsx(filepath: str, data_name: str, data: pd.DataFrame, 
                       sheet_name: str, description: str = ""):
    """Export xlsx report - supports appending"""
    try:
        from openpyxl import load_workbook
        wb = load_workbook(filepath)
        ws = wb.create_sheet(sheet_name)
    except:
        wb = Workbook()
        ws = wb.active
        ws.title = sheet_name
    
    # Write description
    ws['A1'] = f"Data: {data_name}"
    ws['A2'] = f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
    if description:
        ws['A3'] = f"Description: {description}"
    
    # Write data
    start_row = 5
    for i, col in enumerate(data.columns):
        ws.cell(start_row, i+1, col)
    
    for i, row in enumerate(data.values):
        for j, val in enumerate(row):
            ws.cell(start_row+1+i, j+1, val)
    
    # Styles
    header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
    header_font = Font(color="FFFFFF", bold=True)
    for cell in ws[start_row]:
        cell.fill = header_fill
        cell.font = header_font
        cell.alignment = Alignment(horizontal='center')
    
    wb.save(filepath)
    print(f"[{sheet_name}] Saved to {filepath}")
scripts/
example.py 17.0 KB
#!/usr/bin/env python3
"""
Execution script
Version: 1.0.0
Last modified: 02-03-2026
"""
import os, sys
import time
import pandas as pd
from typing import Dict, List, Optional, Any, Callable
import numpy as np
import multiprocessing

# =============================================================================
# System Configuration
# =============================================================================
CPU_COUNT = multiprocessing.cpu_count()
N_JOBS = max(1, CPU_COUNT - 1)  # Multi-process parallel count, keep 1 core for system

def _ensure_references_on_path():
    script_dir = os.path.dirname(__file__)
    cur = script_dir
    for _ in range(8):
        candidate = os.path.join(cur, 'references')
        if os.path.isdir(candidate):
            # add parent folder (which contains `references`) to sys.path
            sys.path.insert(0, cur)
            return
        parent = os.path.dirname(cur)
        if parent == cur:
            break
        cur = parent
    # fallback: add a reasonable repo-root guess
    sys.path.insert(0, os.path.abspath(os.path.join(script_dir, '..', '..', '..')))


_ensure_references_on_path()

from references.func import get_dataset, missing_check, org_analysis
from references.analysis import (drop_abnormal_ym, drop_highmiss_features,
                               drop_lowiv_features, drop_highcorr_features,
                               drop_highpsi_features,
                               drop_highnoise_features,
                               export_cleaning_report,
                               iv_distribution_by_org,
                               psi_distribution_by_org,
                               value_ratio_distribution_by_org)

# ==================== Path Configuration (Interactive Input) ====================
# Use 50-column test data as default, support interactive modification in command line
default_data_path = ''
default_output_dir = ''

def _get_path_input(prompt, default):
    try:
        user_val = input(f"{prompt} (default: {default}): ").strip()
    except Exception:
        user_val = ''
    return user_val if user_val else default

DATA_PATH = _get_path_input('Please enter data file path DATA_PATH', default_data_path)
OUTPUT_DIR = _get_path_input('Please enter output directory OUTPUT_DIR', default_output_dir)
REPORT_PATH = os.path.join(OUTPUT_DIR, '数据清洗报告.xlsx')

# Data column name configuration (adjust according to actual data)
DATE_COL = _get_path_input('Please enter date column name in data', 'apply_date')
Y_COL = _get_path_input('Please enter label column name in data', 'target')
ORG_COL = _get_path_input('Please enter organization column name in data', 'org_info')

# Support multiple primary key column names input (comma or space separated)
def _get_list_input(prompt, default):
    try:
        user_val = input(f"{prompt} (default: {default}): ").strip()
    except Exception:
        user_val = ''
    if not user_val:
        user_val = default
    # Support comma or space separation
    parts = [p.strip() for p in user_val.replace(',', ' ').split() if p.strip()]
    return parts

KEY_COLS = _get_list_input('Please enter primary key column names in data (multiple columns separated by comma or space)', 'record_id')

# ==================== Multi-process Configuration Information ====================
print("=" * 60)
print("Multi-process Configuration")
print("=" * 60)
print(f"   Local CPU cores: {CPU_COUNT}")
print(f"   Current process count: {N_JOBS}")
print("=" * 60)

# ==================== OOS Organization Configuration (Interactive Input) ====================
# Default out-of-sample organization list, users can input custom list in comma-separated format during interaction
default_oos = [
   'orgA', 'orgB', 'orgC', 'orgD', 'orgE',
]

try:
    oos_input = input('Please enter out-of-sample organization list, comma separated (press Enter to use default list):').strip()
except Exception:
    oos_input = ''
if oos_input:
    OOS_ORGS = [s.strip() for s in oos_input.split(',') if s.strip()]
else:
    OOS_ORGS = default_oos

os.makedirs(OUTPUT_DIR, exist_ok=True)

# ==================== Interactive Hyperparameter Input ====================
def get_user_input(prompt, default, dtype=float):
    """Get user input, support default value and type conversion"""
    while True:
        try:
            user_input = input(f"{prompt} (default: {default}): ").strip()
            if not user_input:
                return default
            return dtype(user_input)
        except ValueError:
            print(f"   Invalid input, please enter {dtype.__name__} type")

# Record cleaning steps
steps = []

# Store parameters for each step
params = {}

# Timer decorator
def timer(step_name):
    """Timer decorator"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            print(f"\nStarting {step_name}...")
            start_time = time.time()
            result = func(*args, **kwargs)
            elapsed = time.time() - start_time
            print(f"   {step_name} elapsed: {elapsed:.2f} seconds")
            return result
        return wrapper
    return decorator

# ==================== Step 1: Get Data ====================
print("\n" + "=" * 60)
print("Step 1: Get Data")
print("=" * 60)
step_start = time.time()
# Use configuration from global_parameters
data = get_dataset(
    data_pth=DATA_PATH,
    date_colName=DATE_COL,
    y_colName=Y_COL,
    org_colName=ORG_COL,
    data_encode='utf-8',
    key_colNames=KEY_COLS,
    drop_colNames=[],
    miss_vals=[-1, -999, -1111]
)
print(f"   Original data: {data.shape}")
print(f"   Abnormal values replaced with NaN: [-1, -999, -1111]")
print(f"   Step 1 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 2: Organization Sample Analysis ====================
print("\n" + "=" * 60)
print("Step 2: Organization Sample Analysis")
print("=" * 60)
step_start = time.time()
org_stat = org_analysis(data, oos_orgs=OOS_ORGS)
steps.append(('机构样本统计', org_stat))
print(f"   Organization count: {data['new_org'].nunique()}, Month count: {data['new_date_ym'].nunique()}")
print(f"   Out-of-sample organizations: {len(OOS_ORGS)}")
print(f"   Step 2 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 3: Separate OOS Data ====================
print("\n" + "=" * 60)
print("Step 3: Separate OOS Data")
print("=" * 60)
step_start = time.time()
oos_data = data[data['new_org'].isin(OOS_ORGS)]
data = data[~data['new_org'].isin(OOS_ORGS)]
print(f"   OOS samples: {oos_data.shape[0]} rows")
print(f"   Modeling samples: {data.shape[0]} rows")
print(f"   OOS organizations: {OOS_ORGS}")
print(f"   Step 3 elapsed: {time.time() - step_start:.2f} seconds")
# Create separation information DataFrame
oos_info = pd.DataFrame({'变量': ['OOS样本', '建模样本'], '数量': [oos_data.shape[0], data.shape[0]]})
steps.append(('分离OOS数据', oos_info))

# ==================== Step 4: Filter Abnormal Months (Modeling Data Only) ====================
print("\n" + "=" * 60)
print("Step 4: Filter Abnormal Months (Modeling Data Only)")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['min_ym_bad_sample'] = int(get_user_input("Bad sample count threshold", 10, int))
params['min_ym_sample'] = int(get_user_input("Total sample count threshold", 500, int))
step_start = time.time()
data_filtered, abnormal_ym = drop_abnormal_ym(data.copy(), min_ym_bad_sample=params['min_ym_bad_sample'], min_ym_sample=params['min_ym_sample'])
steps.append(('Step4-异常月份处理', abnormal_ym))
print(f"   After filtering: {data_filtered.shape}")
print(f"   Parameters: min_ym_bad_sample={params['min_ym_bad_sample']}, min_ym_sample={params['min_ym_sample']}")
if len(abnormal_ym) > 0:
    print(f"   Dropped months: {abnormal_ym['年月'].tolist()}")
    print(f"   Removal conditions: {abnormal_ym['去除条件'].tolist()}")
print(f"   Step 4 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 5: Calculate Missing Rate ====================
print("\n" + "=" * 60)
print("Step 5: Calculate Missing Rate")
print("=" * 60)
step_start = time.time()
orgs = data['new_org'].unique().tolist()
channel = {'整体': orgs}
miss_detail, miss_channel = missing_check(data, channel=channel)
# miss_detail: Missing rate details (format: feature, overall, org1, org2, ..., orgn)
# miss_channel: Overall missing rate
steps.append(('缺失率明细', miss_detail))
print(f"   Feature count: {len(miss_detail['变量'].unique())}")
print(f"   Organization count: {len(miss_detail.columns) - 2}")  # Subtract '变量' and '整体' columns
print(f"   Step 5 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 6: Drop High Missing Rate Features ====================
print("\n" + "=" * 60)
print("Step 6: Drop High Missing Rate Features")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['missing_ratio'] = get_user_input("Missing rate threshold", 0.6)
step_start = time.time()
data_miss, dropped_miss = drop_highmiss_features(data.copy(), miss_channel, threshold=params['missing_ratio'])
steps.append(('Step6-高缺失率处理', dropped_miss))
print(f"   Dropped: {len(dropped_miss)}")
print(f"   Threshold: {params['missing_ratio']}")
if len(dropped_miss) > 0:
    print(f"   Dropped features: {dropped_miss['变量'].tolist()[:5]}...")
    print(f"   Removal conditions: {dropped_miss['去除条件'].tolist()[:5]}...")
print(f"   Step 6 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 7: Drop Low IV Features ====================
print("\n" + "=" * 60)
print("Step 7: Drop Low IV Features")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['overall_iv_threshold'] = get_user_input("Overall IV threshold", 0.1)
params['org_iv_threshold'] = get_user_input("Single organization IV threshold", 0.1)
params['max_org_threshold'] = int(get_user_input("Maximum tolerated low IV organization count", 2, int))
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
data_iv, iv_detail, iv_process = drop_lowiv_features(
    data.copy(), features,
    overall_iv_threshold=params['overall_iv_threshold'],
    org_iv_threshold=params['org_iv_threshold'],
    max_org_threshold=params['max_org_threshold'],
    n_jobs=N_JOBS
)
# iv_detail: IV details (IV value of each feature in each organization and overall)
# iv_process: IV processing table (features that do not meet the conditions)
steps.append(('Step7-IV处理', iv_process))
print(f"   Dropped: {len(iv_process)}")
print(f"   Parameters: overall_iv_threshold={params['overall_iv_threshold']}, org_iv_threshold={params['org_iv_threshold']}, max_org_threshold={params['max_org_threshold']}")
if len(iv_process) > 0:
    print(f"   Dropped features: {iv_process['变量'].tolist()[:5]}...")
    print(f"   Processing reasons: {iv_process['处理原因'].tolist()[:5]}...")
print(f"   Step 7 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 8: Drop High PSI Features ====================
print("\n" + "=" * 60)
print("Step 8: Drop High PSI Features (By Organization + Month-by-Month)")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['psi_threshold'] = get_user_input("PSI threshold", 0.1)
params['max_months_ratio'] = get_user_input("Maximum unstable month ratio", 1/3)
params['max_orgs'] = int(get_user_input("Maximum unstable organization count", 6, int))
step_start = time.time()
# Get features before PSI calculation (use all features)
features_for_psi = [c for c in data.columns if c.startswith('i_')]
data_psi, psi_detail, psi_process = drop_highpsi_features(
    data.copy(), features_for_psi,
    psi_threshold=params['psi_threshold'],
    max_months_ratio=params['max_months_ratio'],
    max_orgs=params['max_orgs'],
    min_sample_per_month=100,
    n_jobs=N_JOBS
)
# psi_detail: PSI details (PSI value of each feature in each organization each month)
# psi_process: PSI processing table (features that do not meet the conditions)
steps.append(('Step8-PSI处理', psi_process))
print(f"   Dropped: {len(psi_process)}")
print(f"   Parameters: psi_threshold={params['psi_threshold']}, max_months_ratio={params['max_months_ratio']:.2f}, max_orgs={params['max_orgs']}")
if len(psi_process) > 0:
    print(f"   Dropped features: {psi_process['变量'].tolist()[:5]}...")
    print(f"   Processing reasons: {psi_process['处理原因'].tolist()[:5]}...")
print(f"   PSI details: {len(psi_detail)} records")
print(f"   Step 8 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 9: Null Importance Denoising ====================
print("\n" + "=" * 60)
print("Step 9: Null Importance Remove High Noise Features")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['n_estimators'] = int(get_user_input("Number of trees", 100, int))
params['max_depth'] = int(get_user_input("Maximum tree depth", 5, int))
params['gain_threshold'] = get_user_input("Gain difference threshold", 50)
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
data_noise, dropped_noise = drop_highnoise_features(data.copy(), features, n_estimators=params['n_estimators'], max_depth=params['max_depth'], gain_threshold=params['gain_threshold'])
steps.append(('Step9-null importance处理', dropped_noise))
print(f"   Dropped: {len(dropped_noise)}")
print(f"   Parameters: n_estimators={params['n_estimators']}, max_depth={params['max_depth']}, gain_threshold={params['gain_threshold']}")
if len(dropped_noise) > 0:
    print(f"   Dropped features: {dropped_noise['变量'].tolist()}")
print(f"   Step 9 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 10: Drop High Correlation Features (Based on Null Importance Original Gain) ====================
print("\n" + "=" * 60)
print("Step 10: Drop High Correlation Features (Based on Null Importance Original Gain)")
print("=" * 60)
print("   Press Enter to use default values")
print("=" * 60)
params['max_corr'] = get_user_input("Correlation threshold", 0.9)
params['top_n_keep'] = int(get_user_input("Keep top N features by original gain ranking", 20, int))
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
# Get original gain from null importance results
if len(dropped_noise) > 0 and '原始gain' in dropped_noise.columns:
    gain_dict = dict(zip(dropped_noise['变量'], dropped_noise['原始gain']))
else:
    gain_dict = {}
data_corr, dropped_corr = drop_highcorr_features(data.copy(), features, threshold=params['max_corr'], gain_dict=gain_dict, top_n_keep=params['top_n_keep'])
steps.append(('Step10-高相关性剔除', dropped_corr))
print(f"   Dropped: {len(dropped_corr)}")
print(f"   Threshold: {params['max_corr']}")
if len(dropped_corr) > 0:
    print(f"   Dropped features: {dropped_corr['变量'].tolist()}")
    print(f"   Removal conditions: {dropped_corr['去除条件'].tolist()[:5]}...")
print(f"   Step 10 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Step 11: Export Report ====================
print("\n" + "=" * 60)
print("Step 11: Export Report")
print("=" * 60)
step_start = time.time()

# Calculate IV distribution statistics
print("   Calculating IV distribution statistics...")
iv_distribution = iv_distribution_by_org(iv_detail, oos_orgs=OOS_ORGS)
print(f"   IV distribution statistics: {len(iv_distribution)} records")

# Calculate PSI distribution statistics
print("   Calculating PSI distribution statistics...")
psi_distribution = psi_distribution_by_org(psi_detail, oos_orgs=OOS_ORGS)
print(f"   PSI distribution statistics: {len(psi_distribution)} records")

# Calculate value ratio distribution statistics (use all features)
print("   Calculating value ratio distribution statistics...")
features_for_value_ratio = [c for c in data.columns if c.startswith('i_')]
value_ratio_distribution = value_ratio_distribution_by_org(data, features_for_value_ratio, oos_orgs=OOS_ORGS)
print(f"   Value ratio distribution statistics: {len(value_ratio_distribution)} records")

# Add details and distribution statistics to steps list
steps.append(('Step7-IV明细', iv_detail))
steps.append(('Step7-IV分布统计', iv_distribution))
steps.append(('Step8-PSI明细', psi_detail))
steps.append(('Step8-PSI分布统计', psi_distribution))
steps.append(('Step5-有值率分布统计', value_ratio_distribution))

export_cleaning_report(REPORT_PATH, steps,
                      iv_detail=iv_detail,
                      iv_process=iv_process,
                      psi_detail=psi_detail,
                      psi_process=psi_process,
                      params=params,
                      iv_distribution=iv_distribution,
                      psi_distribution=psi_distribution,
                      value_ratio_distribution=value_ratio_distribution)
print(f"   Report: {REPORT_PATH}")
print(f"   Step 11 elapsed: {time.time() - step_start:.2f} seconds")

# ==================== Summary ====================
print("\n" + "=" * 60)
print("Data Cleaning Completed!")
print("=" * 60)
print(f"   Original data: {data.shape[0]} rows")
print(f"   Original features: {len([c for c in data.columns if c.startswith('i_')])}")
print(f"   Cleaning steps (each step executed independently, data not deleted):")
for name, df in steps:
    print(f"     - {name}: Dropped {df.shape[0] if hasattr(df, 'shape') else len(df)}")

License (MIT)

View full license text
MIT License

Copyright GitHub, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.