推荐最新

组合图中滚动条如何设置？

类似这样的组合图，通常有多个轴或多个数据区域，我应该如何配置滚动条从而使其控制指定区域的滚动？ "图片" (https://wmprod.oss-cn-shanghai.aliyuncs.com/images/20241221/c9cc6770cdc1f879be685987bb631828.png)

浏览量343

全能人才

在图表层面监听事件，是否可以通过event 参数获取具体点击的元素类型，类似于dom 的 target 参数？

在使用"VChart" (https://link.segmentfault.com/?enc=rXq2W2gKb85KOAi1CCkqEw%3D%3D.SXi8yhjGht7szRUTSVVAH5mvpYVfHSM5kKIHVxGk%2Fg4%3D)时，能通过监听整个chart或canvas，然后根据返回的参数，比如type来判断点击的是axis/legend/item吗？

浏览量344

清晨我上码

pandas使用教程：导入数据和查找行和列，条件选择

导入数据import pandas as pd df = pd.read_excel('team.xlsx') df这是一个学生各季度成绩总表（节选），各列说明如下。name：学生的姓名，这列没有重复值，一个学生一行，即一条数据，共100条。team：所在的团队、班级，这个数据会重复。Q1～Q4：各个季度的成绩，可能会有重复值。查看数据类型print(type(df)) #查看df类型 <class 'pandas.core.frame.DataFrame'>查看数据df.head() #查看前5条 df.tail() #查看后5条 df.sample(5) #查看随机5条查看数据信息df.shape # (100, 6) 查看行数和列数 (100, 6) df.describe() # 查看数值型列的汇总统计 df.dtypes # 查看各字段类型 df.axes # 显示数据行和列名 df.columns # 列名df.info() # 查看索引、数据类型和内存信息 <class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 100 non-null object 1 team 100 non-null object 2 Q1 100 non-null int64 3 Q2 100 non-null int64 4 Q3 100 non-null int64 5 Q4 100 non-null int64 dtypes: int64(4), object(2) memory usage: 4.8+ KBdf.describe() # 查看数值型列的汇总统计df.dtypes # 查看各字段类型 name object team object Q1 int64 Q2 int64 Q3 int64 Q4 int64 dtype: objectdf.columns # 列名 Index(['name', 'team', 'Q1', 'Q2', 'Q3', 'Q4'], dtype='object')建立索引实际上第一列name应当是实际的行索引，下面用代码实现：df.set_index('name', inplace=True) # 建立索引并生效 df team Q1 Q2 Q3 Q4 name Liver E 89 21 24 64 Arry C 36 37 37 57 Ack A 57 60 18 84 Eorge C 93 96 71 78 Oah D 65 49 61 86建立实际行索引之后，数字索引就没有了。数据查找选择列df['team'] name Liver E Arry C Ack A Eorge C Oah D .. Gabriel C Austin7 C Lincoln4 C Eli E Ben E选择行按数字索引选择 df[0:3] # 取前三行 df[0:10:2] # 在前10个中每两个取一个 df.iloc[:10,:] # 前10个按新索引选择df[df.index == 'Liver'] # 指定姓名 team Q1 Q2 Q3 Q4 name Liver E 89 21 24 64同时指定行和列df.loc['Arry','Q1':'Q3'] # 只看Arry的三个季度成绩 df.loc['Liver':'Eorge', 'Q1':'Q4'] #从Liver到Eorge的四个季度成绩 Q1 36 Q2 37 Q3 37 Name: Arry, dtype: object Q1 Q2 Q3 Q4 name Liver 89 21 24 64 Arry 36 37 37 57 Ack 57 60 18 84 Eorge 93 96 71 78条件选择单一条件df[df.Q1 > 94] # Q1列大于94的 team Q1 Q2 Q3 Q4 name Max E 97 75 41 3 Elijah B 97 89 15 46 Aaron A 96 75 55 8 Lincoln4 C 98 93 1 20 df[df.team == 'C'] # team列为'C'的 team Q1 Q2 Q3 Q4 name Arry C 36 37 37 57 Eorge C 93 96 71 78 Harlie C 24 13 87 43 Archie C 83 89 59 68复合查询df[(df['Q1'] > 90) & (df['team'] == 'C')] # and关系 df[df['team'] == 'C'].loc[df.Q1>90] # 多重筛选 team Q1 Q2 Q3 Q4 name Eorge C 93 96 71 78 Alexander C 91 76 26 79 Lincoln4 C 98 93 1 20

浏览量2071

清晨我上码

pandas-profiling / ydata-profiling介绍与使用教程

pandas-profilingpandas_profiling 官网（https://pypi.org/project/pandas-profiling/）大概在23年4月前发出如下公告：Deprecated 'pandas-profiling' package, use 'ydata-profiling' instead意味着pandas-profiling不能再用啦，要改用ydata-profiling。所以不用再找更改pandas-profiling版本等相关的教程，直接拥抱新版本的 ydata-profiling即可，功能比原来的更强大。ydata-profilingydata-profiling的主要目标是提供一种简洁而快速的探索性数据分析（EDA）体验。就像pandas中的df.describe()函数一样，ydata-profiling可以对DataFrame进行扩展分析，并允许将数据分析导出为不同格式，例如html和json。该软件包输出了一个简单而易于理解的数据集分析结果，包括时间序列和文本数据。安装pip install ydata-profiling使用方式import numpy as np import pandas as pd from ydata_profiling import ProfileReport df = pd.DataFrame(np.random.rand(100, 5), columns=['a','b','c','d','e']) profile = ProfileReport(df, title="Profiling Report")输出结果一些关键属性：类型推断 (Type inference)：自动检测列的数据类型（分类、数值、日期等）警告 (Warning)：对数据中可能需要处理的问题/挑战的概要（缺失数据、不准确性、偏斜等）单变量分析 (Univariate analysis)：包括描述性统计量（平均值、中位数、众数等）和信息可视化，如分布直方图多变量分析 (Multivariate analysis)：包括相关性分析、详细分析缺失数据、重复行，并为变量之间的交互提供视觉支持时间序列 (Time-Series)：包括与时间相关的不同统计信息，例如自相关和季节性，以及ACF和PACF图。文本分析 (Text analysis)：最常见的类别（大写、小写、分隔符）、脚本（拉丁文、西里尔文）和区块（ASCII、西里尔文）文件和图像分析 (File and Image analysis)：文件大小、创建日期、指示截断图像和存在EXIF元数据的指示比较数据集 (Compare datasets)：一行命令，快速生成完整的数据集比较报告灵活的输出格式 (Flexible output formats)：所有分析结果可以导出为HTML报告，便于与各方共享，也可作为JSON用于轻松集成到自动化系统中，还可以作为Jupyter Notebook中的小部件使用报告还包含三个额外的部分：概述 (Overview)：主要提供有关数据集的全局详细信息（记录数、变量数、整体缺失值和重复值、内存占用情况）警告 (Alerts)：一个全面且自动的潜在数据质量问题列表（高相关性、偏斜、一致性、零值、缺失值、常数值等）重现 (Reporduction)：分析的技术细节（时间、版本和配置）ydata-profiling实际应用iris鸢尾花数据集分析from sklearn.datasets import load_iris iris = load_iris() iris import pandas as pd df = pd.DataFrame(data=iris.data, columns=[name.strip(' (cm)') for name in iris.feature_names]) # DISPLAY FIRST 5 RECORDS OF THE # DATAFRAME df['species'] = iris.target df import ydata_profiling as yp profile = yp.ProfileReport(df.iloc[:,:4], title="Profiling Report") # 通过小部件使用 profile.to_widgets() # 生成嵌入式HTML报告 profile.to_notebook_iframe()ydata_profiling 可以在jupyter notebook中内嵌HTML报告，也可以使用to_file生产HTML或者json格式文件。

浏览量2067

清晨我上码

numpy教程：Example Random Walks 随机漫步

这个例子让我了解一个在实际任务中如何利用数组操作。首先一个最简单的随机漫步：从0开始，步幅为1和-1，以相同的概率出现。下面是纯python的实现方法，1000步：import random position = 0 walk = [position] steps = 1000 for i in range(steps): step = 1 if random.randint(0, 1) else -1 position += step walk.append(position)import matplotlib.pyplot as plt %matplotlib inlinewalk[:5][0, -1, -2, -3, -2]plt.plot(walk[:100])[<matplotlib.lines.Line2D at 0x1062588d0>]随机漫步其实就是一个简单的累加。而用np.random能更快：import numpy as npnsteps = 1000 draws = np.random.randint(0, 2, size=nsteps) steps = np.where(draws > 0, 1, -1)walk = steps.cumsum()我们能直接从中得到一些统计数据，比如最大值和最小值：walk.min()-57walk.max()7一个更复杂的统计值是在哪一步random walk到达了一个指定值。我们想知道从0走出10步用了多久，不论是正方形还是负方向。np.abs(walk) >= 10给我们一个布尔数组告诉我们是否超过10，但我们想要第一次出现的10或-10。因此，我们利用argmax来计算，这个会返回布尔数组中最大值的索引(Ture是最大值)：(np.abs(walk) >= 10).argmax()71注意，使用argmax并不总是效率的，因为它总会搜索整个数组。在这里例子里，一旦True被找到了，我们就返回为最大值。Simulating Many Random Walks at Once（一次模拟多个随机漫步）假设我们一次要模拟5000个随机漫步。传入一个2-tuple，np.random会生成一个二维数组，然后我们沿着每行来计算累加，这样就能一次模拟5000个：nwalks = 5000 nsteps = 1000 draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1 steps = np.where(draws > 0, 1, -1) walks = steps.cumsum(1)walksarray([[ -1, -2, -3, ..., -24, -25, -26], [ -1, -2, -1, ..., -10, -9, -8], [ 1, 0, 1, ..., -4, -3, -4], ..., [ 1, 0, 1, ..., 52, 51, 52], [ -1, 0, 1, ..., -26, -25, -26], [ -1, 0, -1, ..., -30, -29, -30]])找到所有漫步中的最大值和最小值：walks.max()115walks.min()-129在这些漫步模拟中，我们想找到30步以上的。用any方法：hits30 = (np.abs(walks) >= 30).any(1) hits30array([ True, False, False, ..., True, True, True], dtype=bool)hits30.sum()3423上面的step只是像翻硬币一样二选一，我们也可以用一个概率函数来生成：steps = np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps)

浏览量2018

清晨我上码

pandas使用教程：数据透视表函数 pivot_table

导入数据使用pandas官方教程提供的示例数据，导入地址：http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.htmlimport pandas as pd df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"], "B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"], "C": ["small", "large", "large", "small", "small", "large", "small", "small", "large"], "D": [1, 2, 2, 3, 3, 4, 5, 6, 7], "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]}) df参数说明pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean’, fill_value=None, margins=False, dropna=True, margins_name=‘All’, observed=False, sort=True)主要参数：data：待操作的 DataFramevalues：被聚合操作的列，可选项index：行分组键，作为结果 DataFrame 的行索引columns：列分组键，作为结果 DataFrame 的列索引aggfunc：聚合函数/函数列表，默认 numpy.mean ，这里要注意如果 aggfunc 中存在函数列表，则返回的 DataFrame 中会显示函数名称fill_value：默认 None，可设定缺省值dropna：默认 True，如果列的所有值都是 NaN，将被删除；False 则保留margins：默认 False，设置为 True 可以添加行/列的总计margins_name：默认显示 ‘ALL’，当 margins = True 时，可以设定 margins 行/列的名称常用操作指定index进行聚合使用pivot_table时必须要指定index，因为计算时要根据index进行聚合。import numpy as np pd.pivot_table(df, index='A', aggfunc=[np.sum]) sum D E A bar 22 32 foo 11 22通过指定value来选择被聚合的列import numpy as np pd.pivot_table(df, values='D', index='A', aggfunc=[np.count_nonzero]) count_nonzero D A bar 4 foo 5当只指定index进行聚合时，其实用groupby可以实现同样的效果。df.groupby(['A'])['D'].count().reset_index() A D 0 bar 4 1 foo 5添加columns参数，对列分组pd.pivot_table(df.head(10), values='D', index='A', columns='B', aggfunc=np.count_nonzero)C large small A bar 2 2 foo 2 3对于上面结果中的空值，使用fill_value参数统一填充为0pd.pivot_table(df.head(10), values='D', index='A', columns='B', fill_value=0, aggfunc=np.count_nonzero)注意此时的aggfunc参数，当参数值包含列表时，在结果DataFrame中就会显示函数名称。添加合计列如果需要添加合计列，只需指定margins=True即可，同时根据需要指定合计名称。pd.pivot_table(df.head(10), values='D', index='A', columns='B', fill_value=0, margins=True, aggfunc=np.count_nonzero)B one two All A bar 2 2 4 foo 3 2 5 All 5 4 9pd.pivot_table(df, values='D', index=['A','B'], columns=['C'], fill_value=0, margins=True, aggfunc=np.count_nonzero)C large small All A B bar one 1 1 2 two 1 1 2 foo one 2 1 3 two 0 2 2 All 4 5 9指定合计名称pd.pivot_table(df.head(10), values='D', index=['A','B'], columns=['C'], fill_value=0, margins=True, margins_name='合计', aggfunc=[np.count_nonzero]) count_nonzero C large small 合计 A B bar one 1 1 2 two 1 1 2 foo one 2 1 3 two 0 2 2 合计 4 5 9当然与groupby类似，对于计算函数我们可以同时指定多种方式。pd.pivot_table(df, values='D', index=['A'], columns=['C'], fill_value=0, margins=True, aggfunc=[max,sum,np.count_nonzero]) max sum count_nonzero C large small All large small All large small All A bar 7 6 7 11 11 22 2 2 4 foo 2 3 3 4 7 11 2 3 5 All 7 6 7 15 18 33 4 5 9

浏览量2050

清晨我上码

pandas使用教程：pandas resample函数处理时间序列数据

时间序列(TimeSeries)#创建时间序列数据 rng = pd.date_range('1/1/2012', periods=300, freq='S') ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) ts 2012-01-01 00:00:00 44 2012-01-01 00:00:01 54 2012-01-01 00:00:02 132 2012-01-01 00:00:03 70 2012-01-01 00:00:04 476 ... 2012-01-01 00:04:55 178 2012-01-01 00:04:56 83 2012-01-01 00:04:57 184 2012-01-01 00:04:58 223 2012-01-01 00:04:59 179 Freq: S, Length: 300, dtype: int32时间频率转换的参数如下resample重采样ts.resample('1T').sum() #按一分钟重采样之后求和 ts.resample('1T').mean() #按一分钟重采样之后求平均 ts.resample('1T').median() #按一分钟重采样之后求中位数 2023-01-01 00:00:00 275.5 2023-01-01 00:01:00 245.0 2023-01-01 00:02:00 233.5 2023-01-01 00:03:00 284.0 2023-01-01 00:04:00 245.5 Freq: T, dtype: float64执行多个聚合使用agg函数执行多个聚合。ts.resample('1T').agg(['min','max', 'sum']) min max sum 2023-01-01 00:00:00 0 492 15536 2023-01-01 00:01:00 3 489 15840 2023-01-01 00:02:00 3 466 14282 2023-01-01 00:03:00 2 498 15652 2023-01-01 00:04:00 6 489 15119上采样和填充值上采样是下采样的相反操作。它将时间序列数据重新采样到一个更小的时间框架。例如，从小时到分钟，从年到天。结果将增加行数，并且附加的行值默认为NaN。内置的方法ffill()和bfill()通常用于执行前向填充或后向填充来替代NaN。rng = pd.date_range('1/1/2023', periods=200, freq='H') ts = pd.Series(np.random.randint(0, 200, len(rng)), index=rng) ts 2023-01-01 00:00:00 16 2023-01-01 01:00:00 19 2023-01-01 02:00:00 170 2023-01-01 03:00:00 66 2023-01-01 04:00:00 33 ... 2023-01-09 03:00:00 31 2023-01-09 04:00:00 61 2023-01-09 05:00:00 28 2023-01-09 06:00:00 67 2023-01-09 07:00:00 137 Freq: H, Length: 200, dtype: int32#下采样到分钟 ts.resample('30T').asfreq() 2023-01-01 00:00:00 16.0 2023-01-01 00:30:00 NaN 2023-01-01 01:00:00 19.0 2023-01-01 01:30:00 NaN 2023-01-01 02:00:00 170.0 ... 2023-01-09 05:00:00 28.0 2023-01-09 05:30:00 NaN 2023-01-09 06:00:00 67.0 2023-01-09 06:30:00 NaN 2023-01-09 07:00:00 137.0 Freq: 30T, Length: 399, dtype: float64通过apply传递自定义功能import numpy as np def res(series): return np.prod(series) ts.resample('30T').apply(res) 2023-01-01 00:00:00 16 2023-01-01 00:30:00 1 2023-01-01 01:00:00 19 2023-01-01 01:30:00 1 2023-01-01 02:00:00 170 ... 2023-01-09 05:00:00 28 2023-01-09 05:30:00 1 2023-01-09 06:00:00 67 2023-01-09 06:30:00 1 2023-01-09 07:00:00 137 Freq: 30T, Length: 399, dtype: int32DataFrame对象对于DataFrame对象，关键字on可用于指定列而不是重新取样的索引df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd']) df['time'] = pd.date_range('1/1/2000', periods=9, freq='T') df.resample('3T', on='time').sum() Out[81]: a b c d time 2000-01-01 00:00:00 0 3 6 9 2000-01-01 00:03:00 0 3 6 9 2000-01-01 00:06:00 0 3 6 9

浏览量2043

清晨我上码

pandas教程：USDA Food Database USDA食品数据库

14.4 USDA Food Database（美国农业部食品数据库）这个数据是关于食物营养成分的。存储格式是JSON，看起来像这样：{ "id": 21441, "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading", "tags": ["KFC"], "manufacturer": "Kentucky Fried Chicken", "group": "Fast Foods", "portions": [ { "amount": 1, "unit": "wing, with skin", "grams": 68.0 } ... ], "nutrients": [ { "value": 20.8, "units": "g", "description": "Protein", "group": "Composition" }, ... ] } 每种食物都有一系列特征，其中有两个list，protions和nutrients。我们必须把这样的数据进行处理，方便之后的分析。这里使用python内建的json模块：import pandas as pd import numpy as np import jsonpd.options.display.max_rows = 10db = json.load(open('../datasets/usda_food/database.json')) len(db)6636db[0].keys()dict_keys(['manufacturer', 'description', 'group', 'id', 'tags', 'nutrients', 'portions'])db[0]['nutrients'][0]{'description': 'Protein', 'group': 'Composition', 'units': 'g', 'value': 25.18}nutrients = pd.DataFrame(db[0]['nutrients']) nutrients162 rows × 4 columns当把由字典组成的list转换为DataFrame的时候，我们可以吹创业提取的list部分。这里我们提取食品名，群（group），ID，制造商：info_keys = ['description', 'group', 'id', 'manufacturer'] info = pd.DataFrame(db, columns=info_keys) info[:5] info.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 6636 entries, 0 to 6635 Data columns (total 4 columns): description 6636 non-null object group 6636 non-null object id 6636 non-null int64 manufacturer 5195 non-null object dtypes: int64(1), object(3) memory usage: 207.5+ KB我们可以看到食物群的分布，使用value_counts:pd.value_counts(info.group)[:10]Vegetables and Vegetable Products 812 Beef Products 618 Baked Products 496 Breakfast Cereals 403 Legumes and Legume Products 365 Fast Foods 365 Lamb, Veal, and Game Products 345 Sweets 341 Pork Products 328 Fruits and Fruit Juices 328 Name: group, dtype: int64这里我们对所有的nutrient数据做一些分析，把每种食物的nutrient部分组合成一个大表格。首先，把每个食物的nutrient列表变为DataFrame，添加一列为id，然后把id添加到DataFrame中，接着使用concat联结到一起：# 先创建一个空DataFrame用来保存最后的结果 # 这部分代码运行时间较长，请耐心等待 nutrients_all = pd.DataFrame() for food in db: nutrients = pd.DataFrame(food['nutrients']) nutrients['id'] = food['id'] nutrients_all = nutrients_all.append(nutrients, ignore_index=True)译者：虽然作者在书中说了用concat联结在一起，但我实际测试后，这个concat的方法非常耗时，用时几乎是append方法的两倍，所以上面的代码中使用了append方法。一切正常的话出来的效果是这样的：nutrients_all389355 rows × 5 columns这个DataFrame中有一些重复的部分，看一下有多少重复的行：nutrients_all.duplicated().sum() # number of duplicates14179把重复的部分去掉：nutrients_all = nutrients_all.drop_duplicates() nutrients_all375176 rows × 5 columns为了与info_keys中的group和descripton区别开，我们把列名更改一下：col_mapping = {'description': 'food', 'group': 'fgroup'}info = info.rename(columns=col_mapping, copy=False) info.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 6636 entries, 0 to 6635 Data columns (total 4 columns): food 6636 non-null object fgroup 6636 non-null object id 6636 non-null int64 manufacturer 5195 non-null object dtypes: int64(1), object(3) memory usage: 207.5+ KBcol_mapping = {'description' : 'nutrient', 'group': 'nutgroup'}nutrients_all = nutrients_all.rename(columns=col_mapping, copy=False) nutrients_all375176 rows × 5 columns上面所有步骤结束后，我们可以把info和nutrients_all合并（merge）：ndata = pd.merge(nutrients_all, info, on='id', how='outer') ndata.info()<class 'pandas.core.frame.DataFrame'> Int64Index: 375176 entries, 0 to 375175 Data columns (total 8 columns): nutrient 375176 non-null object nutgroup 375176 non-null object units 375176 non-null object value 375176 non-null float64 id 375176 non-null int64 food 375176 non-null object fgroup 375176 non-null object manufacturer 293054 non-null object dtypes: float64(1), int64(1), object(6) memory usage: 25.8+ MBndata.iloc[30000]nutrient Glycine nutgroup Amino Acids units g value 0.04 id 6158 food Soup, tomato bisque, canned, condensed fgroup Soups, Sauces, and Gravies manufacturer Name: 30000, dtype: object我们可以对食物群（food group）和营养类型（nutrient type）分组后，对中位数进行绘图：result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)%matplotlib inlineresult['Zinc, Zn'].sort_values().plot(kind='barh', figsize=(10, 8))我们还可以找到每一种营养成分含量最多的食物是什么：by_nutrient = ndata.groupby(['nutgroup', 'nutrient']) get_maximum = lambda x: x.loc[x.value.idxmax()] get_minimum = lambda x: x.loc[x.value.idxmin()] max_foods = by_nutrient.apply(get_maximum)[['value', 'food']] # make the food a little smaller max_foods.food = max_foods.food.str[:50]因为得到的DataFrame太大，这里只输出'Amino Acids'(氨基酸)的营养群（nutrient group）:max_foods.loc['Amino Acids']['food']nutrient Alanine Gelatins, dry powder, unsweetened Arginine Seeds, sesame flour, low-fat Aspartic acid Soy protein isolate Cystine Seeds, cottonseed flour, low fat (glandless) Glutamic acid Soy protein isolate ... Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Tryptophan Sea lion, Steller, meat with fat (Alaska Native) Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Name: food, Length: 19, dtype: object

浏览量2052

清晨我上码

Pivot Tables and Cross-Tabulation 数据透视表和交叉表

10.4 Pivot Tables and Cross-Tabulation（数据透视表和交叉表）Pivot Tables（数据透视表）是一种常见的数据汇总工具，常见与各种spreadsheet programs（电子表格程序，比如Excel）和一些数据分析软件。它能按一个或多个keys来把数据聚合为表格，能沿着行或列，根据组键来整理数据。数据透视表可以用pandas的groupby来制作，这个本节会进行介绍，除此之外还会有介绍如何利用多层级索引来进行reshape（更改形状）操作。DataFrame有一个pivot_table方法，另外还有一个pandas.pivot_table函数。为了有一个更方便的groupby接口，pivot_table能添加partial totals（部分合计）,也被称作margins(边界)。回到之前提到的tipping数据集，假设我们想要计算一个含有组平均值的表格(a table of group means，这个平均值也是pivot_table默认的聚合类型)，按day和smoker来分组：import numpy as np import pandas as pdtips = pd.read_csv('../examples/tips.csv') # Add tip percentage of total bill tips['tip_pct'] = tips['tip'] / tips['total_bill']tips.head()tips.pivot_table(index=['day', 'smoker'])这个结果也可以通过groupby直接得到。现在假设我们想要按time分组，然后对tip_pct和size进行聚合。我们会把smoker放在列上，而day用于行：tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')我们也快成把这个表格加强一下，通过设置margins=True来添加部分合计（partial total）。这么做的话有一个效果，会给行和列各添加All标签，这个All表示的是当前组对于整个数据的统计值：tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)这里，对于All列，这一列的值是不考虑吸烟周和非吸烟者的平均值（smoker versus nonsmoker）。对于All行，这一行的值是不考虑任何组中任意两个组的平均值（any of the two levels of grouping）。想要使用不同的聚合函数，传递给aggfunc即可。例如，count或len可以给我们一个关于组大小（group size）的交叉表格：tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', aggfunc=len, margins=True)如果一些组合是空的（或NA），我们希望直接用fill_value来填充：tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], columns='day', aggfunc='mean', fill_value=0)cross-tabulation（交叉表，简写为crosstab），是数据透视表的一个特殊形式，只计算组频率（group frequencies）。这里有个例子：data = pd.DataFrame({'Sample': np.arange(1, 11), 'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'], 'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']}) data作为调查分析（survey analysis）的一部分，我们想要按国家和惯用手来进行汇总。我们可以使用pivot_table来做到这点，不过pandas.crosstab函数会更方便一些：pd.crosstab(data.Nationality, data.Handedness, margins=True)crosstab的前两个参数可以是数组或Series或由数组组成的列表（a list of array）。对于tips数据，可以这么写：pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

浏览量2044

清晨我上码

pandas教程：US Baby Names 1880–2010 1880年至2010年美国婴儿姓名

14.3 US Baby Names 1880–2010（1880年至2010年美国婴儿姓名）这个数据是从1880年到2010年婴儿名字频率数据。我们先看一下这个数据长什么样子：个数据集可以用来做很多事，例如：计算指定名字的年度比例计算某个名字的相对排名计算各年度最流行的名字，以及增长或减少最快的名字分析名字趋势：元音、辅音、长度、总体多样性、拼写变化、首尾字母等分析外源性趋势：圣经中的名字、名人、人口结构变化等之后的教程会涉及到其中一些。另外可以去官网直接下载姓名数据，Popular Baby Names。下载National data之后，会得到names.zip文件，解压后，可以看到一系列类似于yob1880.txt这样名字的文件，说明这些文件是按年份记录的。这里使用Unix head命令查看一下文件的前10行：!head -n 10 ../datasets/babynames/yob1880.txt由于这是一个非常标准的以逗号隔开的格式（即CSV文件），所以可以用pandas.read_csv将其加载到DataFrame中：import pandas as pd# Make display smaller pd.options.display.max_rows = 10names1880 = pd.read_csv('../datasets/babynames/yob1880.txt', names=['names', 'sex', 'births'])names18802000 rows × 3 columns这些文件中仅含有当年出现超过5次以上的名字。为了简单化，我们可以用births列的sex分组小计，表示该年度的births总计：names1880.groupby('sex').births.sum()sex F 90993 M 110493 Name: births, dtype: int64由于该数据集按年度被分割成了多个文件，所以第一件事情就是要将所有数据都组装到一个DataFrame里面，并加上一个year字段。使用pandas.concat可以做到：# 2010是最后一个有效统计年度 years = range(1880, 2011) pieces = [] columns = ['name', 'sex', 'births'] for year in years: path = '../datasets/babynames/yob%d.txt' % year frame = pd.read_csv(path, names=columns) frame['year'] = year pieces.append(frame) # 将所有数据整合到单个DataFrame中 names = pd.concat(pieces, ignore_index=True)这里要注意几件事。第一，concat默认是按行将多个DataFrame组合到一起的；第二，必须指定ignore_index=True，因为我们不希望保留read_csv所返回的原始索引。现在我们得到了一个非常大的DataFrame，它含有全部的名字数据。现在names这个DataFrame看上去是：names1690784 rows × 4 columns有了这些数据后，我们就可以利用groupby或pivot_table在year和sex界别上对其进行聚合了：total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)total_births.tail()import seaborn as sns %matplotlib inlinetotal_births.plot(title='Total births by sex and year', figsize=(15, 8))下面我们来插入一个prop列，用于存放指定名字的婴儿数相对于总出生数的比列。prop值为0.02表示每100名婴儿中有2名取了当前这个名字。因此，我们先按year和sex分组，然后再将新列加到各个分组上：def add_prop(group): group['prop'] = group.births / group.births.sum() return group names = names.groupby(['year', 'sex']).apply(add_prop)names1690784 rows × 5 columns在执行这样的分组处理时，一般都应该做一些有效性检查（sanity check），比如验证所有分组的prop的综合是否为1。由于这是一个浮点型数据，所以我们应该用np.allclose来检查这个分组总计值是否够近似于（可能不会精确等于）1：names.groupby(['year', 'sex']).prop.sum()year sex 1880 F 1.0 M 1.0 1881 F 1.0 M 1.0 1882 F 1.0 ... 2008 M 1.0 2009 F 1.0 M 1.0 2010 F 1.0 M 1.0 Name: prop, Length: 262, dtype: float64这样就算完活了。为了便于实现进一步的分析，我们需要取出该数据的一个子集：每对sex/year组合的前1000个名字。这又是一个分组操作：def get_top1000(group): return group.sort_values(by='births', ascending=False)[:1000] grouped = names.groupby(['year', 'sex']) top1000 = grouped.apply(get_top1000) # Drop the group index, not needed top1000.reset_index(inplace=True, drop=True)如果喜欢DIY的话，也可以这样：pieces =[] for year, group in names.groupby(['year', 'sex']): pieces.append(group.sort_values(by='births', ascending=False)[:1000]) top1000 = pd.concat(pieces, ignore_index=True)top1000261877 rows × 5 columns接下来针对这个top1000数据集，我们就可以开始数据分析工作了1 Analyzing Naming Trends（分析命名趋势）有了完整的数据集和刚才生成的top1000数据集，我们就可以开始分析各种命名趋势了。首先将前1000个名字分为男女两个部分：boys = top1000[top1000.sex=='M'] girls = top1000[top1000.sex=='F']这是两个简单的时间序列，只需要稍作整理即可绘制出相应的图标，比如每年叫做John和Mary的婴儿数。我们先生成一张按year和name统计的总出生数透视表：total_births = top1000.pivot_table('births', index='year', columns='name', aggfunc=sum) total_births131 rows × 6868 columns接下来使用DataFrame中的plot方法：total_births.info()<class 'pandas.core.frame.DataFrame'> Int64Index: 131 entries, 1880 to 2010 Columns: 6868 entries, Aaden to Zuri dtypes: float64(6868) memory usage: 6.9 MBsubset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")array([<matplotlib.axes._subplots.AxesSubplot object at 0x1132a4828>, <matplotlib.axes._subplots.AxesSubplot object at 0x116933080>, <matplotlib.axes._subplots.AxesSubplot object at 0x117d24710>, <matplotlib.axes._subplots.AxesSubplot object at 0x117d70b70>], dtype=object)评价命名多样性的增长上图反应的降低情况可能意味着父母愿意给小孩起常见的名字越来越少。这个假设可以从数据中得到验证。一个办法是计算最流行的1000个名字所占的比例，我们按year和sex进行聚合并绘图：import numpy as nptable = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum)table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10), figsize=(15, 8))从图中可以看出，名字的多样性确实出现了增长（前1000项的比例降低）。另一个办法是计算占总出生人数前50%的不同名字的数量，这个数字不太好计算。我们只考虑2010年男孩的名字：df = boys[boys.year == 2010]df1000 rows × 5 columns对prop降序排列后，我们想知道前面多少个名字的人数加起来才够50%。虽然编写一个for循环也能达到目的，但NumPy有一种更聪明的矢量方式。先计算prop的累计和cumsum，，然后再通过searchsorted方法找出0.5应该被插入在哪个位置才能保证不破坏顺序：prop_cumsum = df.sort_values(by='prop', ascending=False).prop.cumsum()prop_cumsum[:10]260877 0.011523 260878 0.020934 260879 0.029959 260880 0.038930 260881 0.047817 260882 0.056579 260883 0.065155 260884 0.073414 260885 0.081528 260886 0.089621 Name: prop, dtype: float64prop_cumsum.searchsorted(0.5)array([116])由于数组索引是从0开始的，因此我们要给这个结果加1，即最终结果为117。拿1900年的数据来做个比较，这个数字要小得多：df = boys[boys.year == 1900] in1900 = df.sort_values(by='prop', ascending=False).prop.cumsum() in1900[-10:]41853 0.979223 41852 0.979277 41851 0.979330 41850 0.979383 41849 0.979436 41848 0.979489 41847 0.979542 41846 0.979595 41845 0.979648 41876 0.979702 Name: prop, dtype: float64in1900.searchsorted(0.5) + 1array([25])现在就可以对所有year/sex组合执行这个计算了。按这两个字段进行groupby处理，然后用一个函数计算各分组的这个值：def get_quantile_count(group, q=0.5): group = group.sort_values(by='prop', ascending=False) return group.prop.cumsum().searchsorted(q) + 1 diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count) diversity = diversity.unstack('sex')现在，这个diversity有两个时间序列（每个性别各一个，按年度索引）。通过IPython，可以看到其内容，还可以绘制图标diversity.head()可以看到上面表格中的值为list，如果不加diversity=diversity.astype(float)的话，会报错显示，“no numeric data to plot” error。通过加上这句来更改数据类型，就能正常绘图了：diversity = diversity.astype('float') diversity131 rows × 2 columnsdiversity.plot(title='Number of popular names in top 50%', figsize=(15, 8))从图中可以看出，女孩名字的多样性总是比男孩高，而且还变得越来越高。我们可以自己分析一下具体是什么在驱动这个多样性（比如拼写形式的变化）。“最后一个字母”的变革一位研究人员指出：近百年来，男孩名字在最后一个字母上的分布发生了显著的变化。为了了解具体的情况，我们首先将全部出生数据在年度、性别以及末字母上进行了聚合：# 从name列中取出最后一个字母 get_last_letter = lambda x: x[-1] last_letters = names.name.map(get_last_letter) last_letters.name = 'last_letter' table = names.pivot_table('births', index=last_letters, columns=['sex', 'year'], aggfunc=sum)print(type(last_letters)) print(last_letters[:5])<class 'pandas.core.series.Series'> 0 y 1 a 2 a 3 h 4 e Name: last_letter, dtype: object然后，我们选出具有一个代表性的三年，并输出前几行：subtable = table.reindex(columns=[1910, 1960, 2010], level='year') subtable.head()接下来我们需要安总出生数对该表进行规范化处理，一遍计算出个性别各末字母站总出生人数的比例：subtable.sum()sex year F 1910 396416.0 1960 2022062.0 2010 1759010.0 M 1910 194198.0 1960 2132588.0 2010 1898382.0 dtype: float64letter_prop = subtable / subtable.sum() letter_prop26 rows × 6 columns有了这个字母比例数据后，就可以生成一张各年度各性别的条形图了：import matplotlib.pyplot as pltfig, axes = plt.subplots(2, 1, figsize=(10, 8)) letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Femal', legend=False)从上图可以看出来，从20世纪60年代开始，以字母'n'结尾的男孩名字出现了显著的增长。回到之前创建的那个完整表，按年度和性别对其进行规范化处理，并在男孩名字中选取几个字母，最后进行转置以便将各个列做成一个时间序列：letter_prop = table / table.sum() letter_prop.head()5 rows × 262 columnsdny_ts = letter_prop.loc[['d', 'n', 'y'], 'M'].T dny_ts.head()有了这个时间序列的DataFrame后，就可以通过其plot方法绘制出一张趋势图：dny_ts.plot(figsize=(10, 8))变成女孩名字的男孩名字（以及相反的情况）另一个有趣的趋势是，早年流行于男孩的名字近年来“变性了”，列入Lesley或Leslie。回到top1000数据集，找出其中以"lesl"开头的一组名字：all_names = pd.Series(top1000.name.unique()) lesley_like = all_names[all_names.str.lower().str.contains('lesl')] lesley_like632 Leslie 2294 Lesley 4262 Leslee 4728 Lesli 6103 Lesly dtype: object然后利用这个结果过滤其他的名字，并按名字分组计算出生数以查看相对频率：filtered = top1000[top1000.name.isin(lesley_like)] filtered.groupby('name').births.sum()name Leslee 1082 Lesley 35022 Lesli 929 Leslie 370429 Lesly 10067 Name: births, dtype: int64接下来，我们按性别和年度进行聚合，并按年度进行规范化处理：table = filtered.pivot_table('births', index='year', columns='sex', aggfunc='sum') table = table.div(table.sum(1), axis=0)table131 rows × 2 columns现在，我们可以轻松绘制一张分性别的年度曲线图了：table.plot(style={'M': 'k-', 'F': 'k--'}, figsize=(10, 8))<matplotlib.axes._subplots.AxesSubplot at 0x11f0640b8>

浏览量2043