当前位置：首页 > news >正文

用Python+Pandas+Seaborn复现Lending Club数据分析（附完整代码与数据集）

news 2026/5/26 1:45:46

用Python+Pandas+Seaborn复现Lending Club数据分析实战指南

在数据科学领域，掌握从原始数据到商业洞察的全流程分析能力已成为职场核心竞争力。Lending Club作为全球知名P2P借贷平台，其公开数据集堪称金融数据分析的"黄金标准"。本文将带您用Python技术栈完整复现从数据清洗到可视化分析的全过程，每个代码块都经过实测验证，特别针对中文环境常见问题提供解决方案。

1. 环境准备与数据加载

工欲善其事，必先利其器。我们推荐使用Anaconda创建独立Python环境，避免包依赖冲突：

conda create -n lending_analysis python=3.8 conda activate lending_analysis pip install pandas seaborn matplotlib jupyter

数据集可从Kaggle或Lending Club官网获取，解压后约1.2GB。首次加载时建议使用Pandas的read_csv优化参数：

import pandas as pd import numpy as np # 内存优化技巧：指定列数据类型 dtypes = { 'id': 'int32', 'loan_amnt': 'float32', 'int_rate': 'float32', 'annual_inc': 'float32' } loan_df = pd.read_csv('loan.csv', dtype=dtypes, parse_dates=['issue_d'], infer_datetime_format=True, low_memory=False) print(f"数据集维度：{loan_df.shape}")

常见报错解决方案：

中文显示问题：在Matplotlib配置中添加：

plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # Mac plt.rcParams['axes.unicode_minus'] = False

内存不足：分批读取数据或使用Dask库

2. 数据清洗实战技巧

原始数据常包含缺失值、异常值和冗余字段。我们采用分层清洗策略：

2.1 字段智能筛选

先通过相关性分析筛选关键特征，避免"维度诅咒"：

# 计算各列缺失率 missing_ratio = loan_df.isnull().mean().sort_values(ascending=False) # 保留缺失率<30%且业务相关的字段 keep_cols = missing_ratio[missing_ratio < 0.3].index.tolist() essential_cols = ['loan_amnt', 'term', 'int_rate', 'grade', 'emp_length'] final_cols = list(set(keep_cols) & set(essential_cols)) loan_df = loan_df[final_cols]

2.2 特殊值处理方案

针对金融数据特有的处理技巧：

# 工作年限转换 emp_length_map = { '< 1 year': 0, '1 year': 1, '2 years': 2, # ...其他映射 '10+ years': 10 } loan_df['emp_length'] = loan_df['emp_length'].map(emp_length_map).fillna(-1) # 利率标准化 loan_df['int_rate'] = loan_df['int_rate'].str.rstrip('%').astype('float32') # 贷款期限提取数值 loan_df['term_months'] = loan_df['term'].str.extract('(\d+)').astype('int16')

2.3 数据质量验证矩阵

建立数据质量报告，确保清洗效果：

检查项	方法	预期结果	实际结果
重复值	`df.duplicated().sum()`	0	0
利率范围	`df['int_rate'].between(5,30).all()`	True	True
日期连续性	`df['issue_d'].dt.year.value_counts()`	2007-2015	符合

3. 探索性分析进阶技法

3.1 时间序列趋势分析

使用Pandas的resample方法进行重采样：

# 按月统计贷款金额 monthly_loan = loan_df.set_index('issue_d')['loan_amnt'].resample('M').sum() # 绘制带趋势线的面积图 import seaborn as sns plt.figure(figsize=(12,6)) sns.lineplot(data=monthly_loan, color='steelblue') plt.fill_between(monthly_loan.index, monthly_loan.values, alpha=0.3) plt.title('月度贷款总额趋势（2007-2015）', pad=20)

3.2 多维交叉分析

利用Seaborn的FacetGrid实现多维度拆解：

g = sns.FacetGrid(loan_df, col='grade', hue='loan_status', col_wrap=4, height=3, aspect=1.2) g.map(sns.histplot, 'loan_amnt', bins=15, alpha=0.7) g.add_legend() plt.subplots_adjust(top=0.9) g.fig.suptitle('不同信用等级的贷款金额分布')

3.3 违约风险特征工程

构建违约预测的关键特征：

# 定义违约状态 bad_status = ['Charged Off', 'Default', 'Late (31-120 days)'] loan_df['is_bad'] = loan_df['loan_status'].isin(bad_status).astype('int8') # 创建风险特征 loan_df['income_to_loan'] = loan_df['annual_inc'] / loan_df['loan_amnt'] loan_df['installment_ratio'] = loan_df['installment'] / loan_df['annual_inc']

4. 高级可视化呈现

4.1 交互式热力图

使用Seaborn展示特征相关性：

corr_matrix = loan_df[['loan_amnt', 'int_rate', 'emp_length', 'annual_inc', 'dti', 'is_bad']].corr() mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) plt.figure(figsize=(10,8)) sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0, linewidths=.5) plt.title('特征相关性热力图', pad=20)

4.2 动态箱线图

展示不同分组的分布差异：

plt.figure(figsize=(12,6)) sns.boxplot(x='grade', y='int_rate', hue='is_bad', data=loan_df, palette='Set2', showfliers=False) plt.title('信用等级与利率的违约分布对比', pad=15) plt.legend(title='是否违约', bbox_to_anchor=(1.05, 1))

4.3 地理空间分布

虽然原始数据包含邮编信息，但需先转换坐标系：

# 示例：按州统计贷款量 state_loan = loan_df['addr_state'].value_counts().reset_index() state_loan.columns = ['state', 'counts'] # 使用plotly绘制美国地图 import plotly.express as px fig = px.choropleth(state_loan, locations='state', locationmode="USA-states", color='counts', scope="usa", color_continuous_scale="Viridis") fig.update_layout(title_text='美国各州贷款数量分布') fig.show()