Pivot Tables and Cross-Tabulation 数据透视表和交叉表-灵析社区

10.4 Pivot Tables and Cross-Tabulation（数据透视表和交叉表）

Pivot Tables（数据透视表）是一种常见的数据汇总工具，常见与各种spreadsheet programs（电子表格程序，比如Excel）和一些数据分析软件。它能按一个或多个keys来把数据聚合为表格，能沿着行或列，根据组键来整理数据。

数据透视表可以用pandas的groupby来制作，这个本节会进行介绍，除此之外还会有介绍如何利用多层级索引来进行reshape（更改形状）操作。DataFrame有一个pivot_table方法，另外还有一个pandas.pivot_table函数。为了有一个更方便的groupby接口，pivot_table能添加partial totals（部分合计）,也被称作margins(边界)。

回到之前提到的tipping数据集，假设我们想要计算一个含有组平均值的表格(a table of group means，这个平均值也是pivot_table默认的聚合类型)，按day和smoker来分组：

import numpy as np
import pandas as pd

tips = pd.read_csv('../examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

tips.pivot_table(index=['day', 'smoker'])

这个结果也可以通过groupby直接得到。

现在假设我们想要按time分组，然后对tip_pct和size进行聚合。我们会把smoker放在列上，而day用于行：

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker')

我们也快成把这个表格加强一下，通过设置margins=True来添加部分合计（partial total）。这么做的话有一个效果，会给行和列各添加All标签，这个All表示的是当前组对于整个数据的统计值：

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker', margins=True)

这里，对于All列，这一列的值是不考虑吸烟周和非吸烟者的平均值（smoker versus nonsmoker）。对于All行，这一行的值是不考虑任何组中任意两个组的平均值（any of the two levels of grouping）。

想要使用不同的聚合函数，传递给aggfunc即可。例如，count或len可以给我们一个关于组大小（group size）的交叉表格：

tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
                 aggfunc=len, margins=True)

如果一些组合是空的（或NA），我们希望直接用fill_value来填充：

tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
                 columns='day', aggfunc='mean', fill_value=0)

cross-tabulation（交叉表，简写为crosstab），是数据透视表的一个特殊形式，只计算组频率（group frequencies）。这里有个例子：

data = pd.DataFrame({'Sample': np.arange(1, 11),
        'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'],
        'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']})
data

作为调查分析（survey analysis）的一部分，我们想要按国家和惯用手来进行汇总。我们可以使用pivot_table来做到这点，不过pandas.crosstab函数会更方便一些：

pd.crosstab(data.Nationality, data.Handedness, margins=True)

crosstab的前两个参数可以是数组或Series或由数组组成的列表（a list of array）。对于tips数据，可以这么写：

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)