标签 seaborn 下的文章 - 蘇阿細

登录

标签搜索

蘇阿細

累计撰写 463 篇文章
累计收到 4 条评论

搜索到 2 篇与的结果

2025-12-08
二、数据分析 - seaborn - 案例 1. 房地产市场洞察与数据评估（1）导入依赖（2）导入数据（3）数据概览（4）数据清洗（5）新数据特征构造（6）问题分析及可视化import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import datetime from matplotlib import rcParams rcParams['font.family'] = 'Microsoft YaHei'df = pd.read_csv('static/2_pandas/data/house_sales.csv', encoding='utf-8') print("总记录数：", len(df)) print("字段数数：", len(df.columns)) print(df.head(10)) print() df.info() 总记录数： 106118 字段数数： 12 city address area floor name price province \ 0 合肥龙岗-临泉东路和王岗大道交叉口东南角 90㎡中层（共18层）圣地亚哥 128万安徽 1 合肥龙岗-临泉东路和王岗大道交叉口东南角 90㎡中层（共18层）圣地亚哥 128万安徽 2 合肥生态公园-淮海大道与大众路交口 95㎡中层（共18层）正荣·悦都荟 132万安徽 3 合肥生态公园-淮海大道与大众路交口 95㎡中层（共18层）正荣·悦都荟 132万安徽 4 合肥撮镇-文一名门金隅裕溪路与东风大道交口 37㎡中层（共22层）文一名门金隅 32万安徽 5 合肥撮镇-文一名门金隅裕溪路与东风大道交口 37㎡中层（共22层）文一名门金隅 32万安徽 6 合肥龙岗-长江东路与和县里交口 50㎡高层（共30层）柏庄金座 46万安徽 7 合肥龙岗-长江东路与和县里交口 50㎡高层（共30层）柏庄金座 46万安徽 8 合肥新亚汽车站-张洼路与临泉路交汇处向北100米(原红星机械 120㎡中层（共27层）天目未来 158万安徽 9 合肥新亚汽车站-张洼路与临泉路交汇处向北100米(原红星机械 120㎡中层（共27层）天目未来 158万安徽 rooms toward unit year \ 0 3室2厅南北向 14222元/㎡ 2013年建 1 3室2厅南北向 14222元/㎡ 2013年建 2 3室2厅南向 13895元/㎡ 2019年建 3 3室2厅南向 13895元/㎡ 2019年建 4 2室1厅南北向 8649元/㎡ 2017年建 5 2室1厅南北向 8649元/㎡ 2017年建 6 2室1厅南向 9200元/㎡ 2019年建 7 2室1厅南向 9200元/㎡ 2019年建 8 4室2厅南向 13167元/㎡ 2012年建 9 4室2厅南向 13167元/㎡ 2012年建 origin_url 0 https://hf.esf.fang.com/chushou/3_404230646.htm 1 https://hf.esf.fang.com/chushou/3_404230646.htm 2 https://hf.esf.fang.com/chushou/3_404304901.htm 3 https://hf.esf.fang.com/chushou/3_404304901.htm 4 https://hf.esf.fang.com/chushou/3_404372096.htm 5 https://hf.esf.fang.com/chushou/3_404372096.htm 6 https://hf.esf.fang.com/chushou/3_398859799.htm 7 https://hf.esf.fang.com/chushou/3_398859799.htm 8 https://hf.esf.fang.com/chushou/3_381138154.htm 9 https://hf.esf.fang.com/chushou/3_381138154.htm <class 'pandas.core.frame.DataFrame'> RangeIndex: 106118 entries, 0 to 106117 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 106118 non-null object 1 address 104452 non-null object 2 area 105324 non-null object 3 floor 104024 non-null object 4 name 105564 non-null object 5 price 105564 non-null object 6 province 106118 non-null object 7 rooms 104036 non-null object 8 toward 105240 non-null object 9 unit 105564 non-null object 10 year 57736 non-null object 11 origin_url 105564 non-null object dtypes: object(12) memory usage: 9.7+ MB# 数据清洗 df.drop(columns=['origin_url'], inplace=True) # df.head() # 缺失值处理 df.isna().sum() # 可选操作 df.dropna(inplace=True) # 删除重复数据 df.drop_duplicates(inplace=True) print(len(df)) 28104# 面积数据类型转换 df['area'] = df['area'].str.replace('㎡', '').astype(float) # 售价数据类型转换 df['price'] = df['price'].str.replace('万', '').astype(float) # 朝向数据类型转换 df['toward'] = df['toward'].astype('category') # 单价格数据类型转换 df['unit'] = df['unit'].str.replace('元/㎡', '').astype(float) # 年份格数据类型转换 df['year'] = df['year'].str.replace('年建', '').astype(int)# 异常值处理 # 房屋面积 df = df[(df['area'] > 20) & (df['area'] < 600)] # 价格 IQR Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 low_price = Q1 - 1.5 * IQR high_price = Q3 + 1.5 * IQR df = df[(df['price'] > low_price) & (df['price'] < high_price)]# 新数据特征构造 # 地区 district df['district'] = df['address'].str.split('-').str[0] # 楼层类型 floor_type df['floor_type'] = df['floor'].str.split('（').str[0].astype('category') # 是否是直辖市 zxs def is_zxs(city): if (city in ['北京', '上海', '天津', '重庆']): return True else: return False df['zxs'] = df['city'].apply(lambda x: 1 if is_zxs(x) else 0) # 卧室数 bedroom_num df['bedroom_num'] = df['rooms'].str.split('室').str[0].astype(int) # 客厅数量 living_room_num # df['living_room_num'] = df['rooms'].str.split('室').str[1].str.split('厅').str[0].fillna(0).astype(int) df['living_room_num'] = df['rooms'].str.extract(r'(\d+)厅').fillna(0).astype(int) # 房龄 house_age df['house_age'] = datetime.datetime.now().year - df['year'] # 价格区间 price_interval df['price_interval'] = pd.cut(df['price'], bins=3, labels=['低', '中', '高'])""" 分析一：哪些变量最影响房价？面积、楼层、房间数哪个因素影响最大？分析主题：特征性相关分析目标：各因素对房价的线性影响分组字段：无指标/方法：皮尔逊相关系数 """ # 选择数值型特征 # corr() 相关系数 corr = df[['price', 'area', 'unit', 'house_age']].corr() print("影响房价的因素：") print(corr['price'].sort_values(ascending=False)[1:]) # 相关性热力图 plt.figure(figsize=(10, 5)) sns.heatmap(corr, cmap='coolwarm') plt.title('房价因素热力图') plt.tight_layout()影响房价的因素： unit 0.742731 area 0.452523 house_age 0.091520 Name: price, dtype: float64""" 分析二：全国房价的总体分布，是否存在极端值分析主题：描述性统计分析字段：概览数值型字段的分布特征分组字段：无指标/方法：平均数/中位数/四分位数/标准差 """ print(df.describe()) # 房价分布直方图 plt.subplot() plt.hist(df['price'], bins=10) area price unit year zxs \ count 26135.000000 26135.000000 26135.000000 26135.000000 26135.000000 mean 103.755810 117.208370 11610.131012 2013.072240 0.008800 std 33.995994 60.967675 5824.245273 6.019342 0.093399 min 21.000000 9.000000 1000.000000 1976.000000 0.000000 25% 85.005000 72.000000 7587.000000 2011.000000 0.000000 50% 100.000000 103.000000 10312.000000 2015.000000 0.000000 75% 123.000000 150.000000 14184.000000 2017.000000 0.000000 max 470.000000 306.000000 85288.000000 2023.000000 1.000000 bedroom_num living_room_num house_age count 26135.000000 26135.000000 26135.000000 mean 2.714444 1.848556 11.927760 std 0.800768 0.407353 6.019342 min 0.000000 0.000000 2.000000 25% 2.000000 2.000000 8.000000 50% 3.000000 2.000000 10.000000 75% 3.000000 2.000000 14.000000 max 9.000000 12.000000 49.000000 (array([ 991., 4810., 6499., 4613., 3362., 2226., 1333., 1055., 691., 555.]), array([ 9. , 38.7, 68.4, 98.1, 127.8, 157.5, 187.2, 216.9, 246.6, 276.3, 306. ]), <BarContainer object of 10 artists>)sns.histplot(data=df, x='price', bins=10, kde=True)<Axes: xlabel='price', ylabel='Count'>""" 分析三：南北向是否比单一朝向贵？贵多少？分析主题：朝向溢价分析目标：评估不同朝向的价格差异分组字段：toward 指标/方法：方差分析/多重比较 """ print(df['toward'].value_counts()) toward 南北向 14884 南向 8796 东南向 974 东向 419 北向 258 西南向 254 西向 161 东西向 151 西北向 133 东北向 105 Name: count, dtype: int64print(df.groupby('toward', observed=False).agg({ 'price': ['mean', 'median'], 'unit': 'median', 'house_age': 'mean' })) price unit house_age mean median median mean toward 东北向 114.555333 100.0 12198.0 12.609524 东南向 115.542608 105.0 10864.0 10.951745 东向 110.158568 95.0 11421.0 12.761337 东西向 98.935099 82.0 9000.0 15.490066 北向 92.527907 75.5 11698.0 13.108527 南北向 119.472147 104.5 10000.0 12.073703 南向 114.555016 103.0 10759.0 11.551160 西北向 119.107594 105.0 12290.0 13.473684 西南向 139.711811 138.4 13333.0 13.452756 西向 102.662298 86.0 12528.0 13.385093plt.figure(figsize=(10, 5)) sns.boxplot(data=df, x='toward', y='price') plt.tight_layout()
- 2025年12月08日
- 16 阅读
- 0 评论
- 0 点赞
2025-12-07
一、数据分析 - seaborn # 安装依赖 pip install seaborn一、统计图import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['font.family'] = 'Microsoft YaHei' # 设置图表大小 plt.figure(figsize=(10, 5)) # data penguins = pd.read_csv('static/2_pandas/data/penguins.csv') penguins.dropna(inplace=True) # penguins.head()1. 直方图sns.histplot(data=penguins, x='species') <Axes: xlabel='species', ylabel='Count'>2. 核密度估计图核密度估计图（KDE，Kernel Density Estimate Plot）用于显示数据分布，其通过平滑直方图的方法来估计数据的概率密度函数，使得分布图看起来更加连续、平滑。核密度估计是一种非参数方法，用于估计随机变量的概率密度函数。其基本思想是将每个数据点视为一个”核（高斯分布）“，然后将这些核的贡献相加以构成平滑的密度曲线。# 喙长度 - 核密度估计图 sns.kdeplot(data=penguins, x='bill_length_mm') <Axes: xlabel='bill_length_mm', ylabel='Density'>sns.histplot(data=penguins, x='bill_length_mm', kde=True) <Axes: xlabel='bill_length_mm', ylabel='Count'>3. 计数图计数图用于绘制分类变量的计数分布图，显示每个类别在数据集中出现的次数，可以快速了解类别的分布情况。# 不同岛屿的企鹅分布数量 sns.countplot(data=penguins, x='island') <Axes: xlabel='island', ylabel='count'>4. 散点图# x：体重，y：脚蹼长度 # hue参数可设置通过不同”分类“进行比对 sns.scatterplot(data=penguins, x='body_mass_g', y='flipper_length_mm', hue='sex') <Axes: xlabel='body_mass_g', ylabel='flipper_length_mm'>5. 蜂窝图# kind='hex' sns.jointplot(data=penguins, x='body_mass_g', y='flipper_length_mm', kind='hex') <seaborn.axisgrid.JointGrid at 0x239c7cddc40>6. 二维核密度估计图# 同时设置参数x和y sns.kdeplot(data=penguins, x='body_mass_g', y='flipper_length_mm') <Axes: xlabel='body_mass_g', ylabel='flipper_length_mm'># fill 填充 # cbar 颜色示意条 sns.kdeplot(data=penguins, x='body_mass_g', y='flipper_length_mm', fill=True, cbar=True) <Axes: xlabel='body_mass_g', ylabel='flipper_length_mm'>7. 条形图sns.barplot(data=penguins, x='species', y='bill_length_mm', estimator='mean', errorbar=None) <Axes: xlabel='species', ylabel='bill_length_mm'>8. 箱线图sns.boxplot(data=penguins, x='species', y='bill_length_mm') <Axes: xlabel='species', ylabel='bill_length_mm'>9. 小提琴图小提琴图（Violin Plot）是一种结合了箱线图和核密度估计图的可视化图表，用于展示数据的分布情况、集中趋势、散布情况以及异常值，不仅可以显示数据的基本统计量（中位数、四分位数等），还可以展示数据的概率密度。sns.violinplot(data=penguins, x='species', y='bill_length_mm') <Axes: xlabel='species', ylabel='bill_length_mm'>10. 成对关系图成对关系图是一种用于显示多个变量之间关系的可视化图表，其对角线上的图通常显示每个变量的分布，即单变量特性，其他位置的图用于展示所有变量的两两关系。sns.pairplot(penguins, hue='species') <seaborn.axisgrid.PairGrid at 0x239c979de20>
- 2025年12月07日
- 13 阅读
- 0 评论
- 0 点赞

蘇阿細

463 文章数

4 评论量

标签

舔狗日记