D6. 类别变量和分组统计
作者:谭睿鹏 (南京大学)
E-Mail: tanruipeng652@126.com
在正式进行实证分析之前,进行变量的分组统计有助于发现样本的基本特征,并初步验证预期是否合理。有的数据有明确的分组指标:如国有 vs 民营,外资 vs 内资,东部 vs 中西部;但有的数据没有明确的分组指标,这时可以根据变量的分位数、平均值或其他指标进行确定。
D6.1 列表统计
有两种方式,一是列表,使用tabulate
命令;二是画图,包括柱状图或饼图,使用graph bar
等命令。
D6.1.1 一维列表
调用妇女工资数据资料 nlsw88.dta,看各行业内妇女的分布情况,可以用 tabluate
命令来实现:
sysuse "nlsw88.dta", clear
tabulate industry
industry | Freq. Percent Cum.
------------------------+---------------------------
Ag/Forestry/Fisheries | 17 0.76 0.76
Mining | 4 0.18 0.94
Construction | 29 1.30 2.24
Manufacturing | 367 16.44 18.68
Transport/Comm/Utility | 90 4.03 22.72
Wholesale/Retail Trade | 333 14.92 37.63
Finance/Ins/Real Estate | 192 8.60 46.24
Business/Repair Svc | 86 3.85 50.09
Personal Services | 97 4.35 54.44
Entertainment/Rec Svc | 17 0.76 55.20
Professional Services | 824 36.92 92.11
Public Administration | 176 7.89 100.00
------------------------+--------------------------- Total | 2,232 100.00
或者画柱状图,更加直观地呈现数据分布特征
graph hbar (count) idcode,over(industry) ///
blabel(bar,position(outside) format (%3.0f) ///
color(blue) size(small))
或者画饼图
graph pie, over(industry) sort
加 sort
选项后,饼图按面积从小到大顺时针排列,与图下方的图例一一对应
D6.1.2 二维列表
适用情形为样本中有两个或两个以上的类别变量,基于两个或多个变量看交叉的数据分布。
tabulate industry collgrad
| college graduatenot colle college g | Total
industry |
----------------------+----------------------+-------
Ag/Forestry/Fisheries | 14 3 | 17
Mining | 4 0 | 4
Construction | 26 3 | 29
Manufacturing | 334 33 | 367
Transport/Comm/Utilit | 79 11 | 90
Wholesale/Retail Trad | 295 38 | 333
Finance/Ins/Real Esta | 164 28 | 192
Business/Repair Svc | 69 17 | 86
Personal Services | 94 3 | 97
Entertainment/Rec Svc | 12 5 | 17
Professional Services | 479 345 | 824
Public Administration | 133 43 | 176
----------------------+----------------------+------- Total | 1,703 529 | 2,232
如果要显示每个类别里样本数的占比,则用以下命令:
tabulate industry collgrad, column nofreq
| college graduatenot colle college g | Total
industry |
----------------------+----------------------+--------
Ag/Forestry/Fisheries | 0.82 0.57 | 0.76
Mining | 0.23 0.00 | 0.18
Construction | 1.53 0.57 | 1.30
Manufacturing | 19.61 6.24 | 16.44
Transport/Comm/Utilit | 4.64 2.08 | 4.03
Wholesale/Retail Trad | 17.32 7.18 | 14.92
Finance/Ins/Real Esta | 9.63 5.29 | 8.60
Business/Repair Svc | 4.05 3.21 | 3.85
Personal Services | 5.52 0.57 | 4.35
Entertainment/Rec Svc | 0.70 0.95 | 0.76
Professional Services | 28.13 65.22 | 36.92
Public Administration | 7.81 8.13 | 7.89
----------------------+----------------------+-------- Total | 100.00 100.00 | 100.00
或者用 tabplot
可以画图,更加直观
///
tabplot industry collgrad, none) horizontal barw(1) ///
bfcolor(percent(collgrad) subtitle(% at each industry) ///
offset(0.05)) showval(
也可以根据两个分类变量来统计第三个变量的一些基本统计特征
tabulate industry collgrad, summarize(wage) means
| college graduatenot colle college g | Total
industry |
-----------+----------------------+----------
Ag/Forest | 5.3122251 7.0626361 | 5.6211211
Mining | 15.349593 . | 15.349593
Construct | 6.3992254 17.66774 | 7.5649338
Manufactu | 6.8861197 13.730761 | 7.5015779
Transport | 11.195142 13.225958 | 11.443353
Wholesale | 5.8526317 8.2472956 | 6.1258966I | 9.4961749 11.875596 | 9.8431737
Finance/
Business/ | 6.4778803 11.728484 | 7.5157904
Personal | 4.3603053 5.679114 | 4.4010932
Entertain | 5.9703888 8.5340567 | 6.7244088
Professio | 6.2705725 10.093487 | 7.8711858
Public Ad | 8.2137595 12.039293 | 9.148407
-----------+----------------------+---------- Total | 6.9259835 10.543932 | 7.7834633
或者用图形成呈现:
graph hbar (mean) wage, over(industry) over(collgrad) ///
scheme(slmono) ///
intensity(*0.4) blabel(bar,position(outside) format (%3.1f) ///
color(blue) size(small))
D6.2 连续变量转换为类别变量
更为一般的情形是,数据里只有连续变量,那如何将它们转换成类别变量?
sysuse nlsw88.dta,clear
. sort wage
. gen g_wage = group (5) //将工资等分成三组
. tab g_wage
.
g_wage | Freq. Percent Cum.
--------+----------------------------
1 | 450 20.04 20.04
2 | 449 19.99 40.03
3 | 449 19.99 60.02
4 | 449 19.99 80.01
5 | 449 19.99 100.00
--------+---------------------------- Total | 2,246 100.00
新变量g_wage
取值是1-5,可以使用lable define
和label value
对 g_wage
设置一个数字文字对应表
label define g_wage 1 "Low" 2 "High"
label value g_wage g_wage
进行初步统计分析
tabstat wage age married union collgrad south, ///
mean) by(g_wage) format(%4.2f) stat(
union collgrad south
g_wage | wage age married
--------+-------------------------------------------------
1 | 3.12 39.26 0.61 0.11 0.11 0.54
2 | 4.68 39.14 0.67 0.18 0.08 0.48
3 | 6.32 39.14 0.67 0.23 0.17 0.41
4 | 8.73 39.14 0.66 0.35 0.32 0.36
5 | 16.00 39.09 0.60 0.32 0.50 0.30
--------+-------------------------------------------------
Total | 7.77 39.15 0.64 0.25 0.24 0.42 ----------------------------------------------------------
命令label
和tabstat
分别对应B5
和R5
两讲。
除了基本的列表分析外,还可以进行回归分析:
local x "age ttl_exp married union collgrad south c_city"
reg wage `x' if g_wage == 1,robust
est store Low
reg wage `x' if g_wage == 3,robust
est store Mid
reg wage `x' if g_wage == 5,robust
est store High
reg wage `x' ,robust
est store Full
local m "Low Mid High Full"
`m', mtitle(`m') nogap s(r2 r2_a N) b(%6.3f) ///
esttab star(* 0.1 ** 0.05 *** 0.01)
---------------------------------------------------------------
(1) (2) (3) (4)
Low Mid High Full
---------------------------------------------------------------
age 0.003 0.010 0.031 -0.048*
(0.33) (1.15) (0.48) (-1.82)
ttl_exp 0.026*** 0.021*** 0.041 0.299***
(3.68) (3.22) (0.76) (18.01)
married 0.159** -0.030 -0.386 -0.123
(2.37) (-0.55) (-0.90) (-0.69) union 0.213** 0.105* -0.345 0.705***
(2.45) (1.71) (-0.75) (3.35)
collgrad -0.133 0.166** 0.910** 3.102***
(-1.05) (2.46) (2.12) (13.22)
south -0.222*** -0.055 -0.177 -1.347***
(-3.37) (-1.03) (-0.34) (-7.95)
c_city 0.088 0.010 0.788* 0.495**
(1.13) (0.19) (1.67) (2.57) _cons 2.768*** 5.658*** 11.623*** 5.188***
(6.98) (16.23) (4.68) (4.91)
---------------------------------------------------------------
r2 0.105 0.051 0.030 0.281
r2_a 0.085 0.034 0.013 0.279 N 320.000 400.000 391.000 1878.000
---------------------------------------------------------------
分组分析有利于更加深入的讲故事,文献在也有根据分位数来分组,比如在33百分位和66百分位,将整个样本分为3组。到底要如何分组,最好根据理论分析或权威文献。
help quantiles
bysort industry: quantiles wage,gen(p_wage) nq(3)
D6.3 扩展阅读
- 吴小齐, 2024, Stata绘图:高级柱状图(二)-离散变量之间关系的可视化, 连享会 No.1469.
- 徐阳, 2021, Stata绘图:柱状图专题-T212, 连享会 No.625.
- 范思妤, 2022, Stata绘图:环形柱状图-实时全球新冠确诊人数, 连享会 No.1000.
- 谢作翰, 2019, 普林斯顿Stata教程(二) - Stata绘图, 连享会 No.78.
- 连享会, 2020, Stata绘图:多维柱状图绘制, 连享会 No.273.
- 马洪栋, 2024, Stata绘图:高级柱状图(一)-均值和置信区间-cibar-coefpl, 连享会 No.1379.
- 马洪栋, 2024, Stata绘图:高级柱状图(三)-堆叠柱状图, 连享会 No.1413.
- 云锋, 2020, Stata:一文搞定论文表1——基本统计量列表, 连享会 No.22.
- 朱磊, 2020, Stata:今天你table了吗?二维列表和三维列表范例, 连享会 No.376.
- 李青塬, 2022, Stata+Python:爬取创历史新高股票列表, 连享会 No.894.
- 袁明生, 2024, xcontract-freqtop:列表呈现多个类别变量组合的频数和占比, 连享会 No.1384.
- 连享会, 2020, 连享会-知乎推文列表, 连享会 No.130.
- 连享会, 2022, 连享会主页-推文列表-按时间, 连享会 No.449.
- 连享会, 2022, 连享会主页-推文列表-按类别, 连享会 No.448.
- 陈佳慧, 2023, Stata绘图:mrtab-mrgraph-多元响应变量列表呈现和可视化, 连享会 No.1285.
- 陈佳慧, 2024, dgraph-组间均值差异t检验:列表及可视化, 连享会 No.1367.