33 当当网-G7-01

33.1 数据清洗

用于清洗data_raw中的原始数据
- clean_price(): 处理价格数据，去除¥符号并转换为浮点数** 【让后续更方便分析，本次分析全部默认为人民币】*
- clean_comments(): 处理评论数据，转换为整数
- clean_year(): 处理出版年份，提取年份并转换为整数
清洗后的数据会自动存放到data_clean中

# 第一个单元格
import pandas as pd
import numpy as np
import os

# 第二个单元格
# 读取原始数据
df = pd.read_excel('data_raw/当当网Python书籍销量排行_原始数据.xlsx')
print("原始数据形状：", df.shape)
print("\n原始数据前5行：")
df.head()

原始数据形状： (50, 8)

原始数据前5行：

	title	author	year	publisher	review_count	original_price	discounted_price	页面排名
0	小学生Python创意编程（视频教学版）	刘凤飞	2024-01-01	清华大学出版社	7332条评论	¥89.00	¥84.60	1
1	Python编程从入门到实践第3版	埃里克·马瑟斯	2023-05-01	人民邮电出版社	20216条评论	¥109.80	¥69.80	2
2	Python股票量化交易从入门到实践	袁霄	2021-07-01	人民邮电出版社	4498条评论	¥99.80	¥94.80	3
3	Python从入门到精通（第3版）	明日科技	2023-06-01	清华大学出版社	2438条评论	¥89.80	¥85.30	4
4	深度学习入门基于Python的理论与实现	斋藤康毅	2021-05-01	人民邮电出版社	14356条评论	¥69.80	¥39.80	5

# 第三个单元格
# 检查缺失值
print("缺失值统计：")
df.isnull().sum()

缺失值统计：

title               0
author              0
year                0
publisher           0
review_count        0
original_price      0
discounted_price    0
页面排名                0
dtype: int64

# 第四个单元格
# 数据清洗

# 1. 处理价格数据
def clean_price(price):
    """清洗价格数据，去除¥符号并转换为浮点数"""
    if pd.isna(price) or price == '暂无':
        return np.nan
    try:
        # 移除¥符号和空白字符，转换为浮点数
        return float(str(price).replace('¥', '').strip())
    except:
        return np.nan

def clean_comments(comments):
    """清洗评论数据，转换为整数"""
    if pd.isna(comments) or comments == '暂无':
        return 0
    try:
        # 移除可能的空白字符并转换为整数
        return int(str(comments).replace('条评论', '').strip())
    except:
        return 0

def clean_year(year):
    """清洗出版年份，提取年份并转换为整数"""
    if pd.isna(year) or year == '暂无':
        return np.nan
    try:
        # 假设年份格式为'YYYY-MM-DD'或'YYYY'，提取前4位
        return int(str(year)[:4])
    except:
        return np.nan

# 应用清洗函数
print("处理价格数据...")
df['original_price'] = df['original_price'].apply(clean_price)
df['discounted_price'] = df['discounted_price'].apply(clean_price)

print("处理评论数据...")
df['review_count'] = df['review_count'].apply(clean_comments)

print("处理出版年份...")
df['year'] = df['year'].apply(clean_year)

print("删除重复数据...")
df = df.drop_duplicates(subset=['title', 'author', 'publisher'])

# 重置索引
df = df.reset_index(drop=True)

print("\n数据清洗完成！")
print(f"原始数据条数：{len(df)}")

处理价格数据...
处理评论数据...
处理出版年份...
删除重复数据...

数据清洗完成！
原始数据条数：50

# 第五个单元格
# 查看清洗后的数据信息
print("\n清洗后的数据信息：")
df.info()

print("\n清洗后的数据样例：")
df.head()

print("\n清洗后的数据统计信息：")
df.describe()


清洗后的数据信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             50 non-null     object 
 1   author            50 non-null     object 
 2   year              38 non-null     float64
 3   publisher         50 non-null     object 
 4   review_count      50 non-null     int64  
 5   original_price    47 non-null     float64
 6   discounted_price  49 non-null     float64
 7   页面排名              50 non-null     int64  
dtypes: float64(3), int64(2), object(3)
memory usage: 3.3+ KB

清洗后的数据样例：

清洗后的数据统计信息：

	year	review_count	original_price	discounted_price	页面排名
count	38.000000	50.0	47.000000	49.000000	50.00000
mean	2020.842105	0.0	96.878723	80.891224	25.50000
std	1.763386	0.0	39.163514	36.547802	14.57738
min	2019.000000	0.0	39.800000	37.800000	1.00000
25%	2019.250000	0.0	69.800000	55.300000	13.25000
50%	2020.000000	0.0	89.800000	85.300000	25.50000
75%	2022.000000	0.0	108.000000	94.800000	37.75000
max	2024.000000	0.0	268.600000	255.200000	50.00000

# 第六个单元格
# 检查清洗后的缺失值
print("清洗后的缺失值统计：")
df.isnull().sum()

清洗后的缺失值统计：

title                0
author               0
year                12
publisher            0
review_count         0
original_price       3
discounted_price     1
页面排名                 0
dtype: int64

# 第七个单元格
# 保存清洗后的数据
output_path = 'data_clean/python_books_clean.xlsx'
df.to_excel(output_path, index=False)
print(f"清洗后的数据已保存到：{output_path}")

清洗后的数据已保存到：data_clean/python_books_clean.xlsx

这些代码完成了以下数据清洗任务： * 处理价格数据：去除¥符号，转换为浮点数 * 处理评论数：转换为整数，处理”暂无”等特殊情况 * 处理出版年份：提取年份并转换为整数 * 删除重复数据：基于书名、作者和出版社 * 数据质量检查：查看清洗前后的数据统计和缺失值情况

如果出错，请注意以下几点：确保您的文件路径正确（data_raw/当当网Python书籍销量排行_原始数据.xlsx）确保已安装所需的库（pandas, numpy）每个单元格都应该单独运行，按顺序执行如果遇到路径错误，请根据实际情况调整文件路径