很多网站都有公开的API,通过JSON等格式提供数据流。有很多方法可以访问这些API,这里推荐一个易用的requests包。
找到github里pandas最新的30个issues,制作一个GET HTTP request, 通过使用requests包:
import pandas as pd
import numpy as npimport requestsurl = 'https://api.github.com/repos/pandas-dev/pandas/issues'resp = requests.get(url)resp<Response [200]>response的json方法能返回一个dict,包含可以解析为python object的JSON:
data = resp.json()
data[0]['title']'Optimize data type'data[0]{'assignee': None,
'assignees': [],
'author_association': 'NONE',
'body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\nhttps://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!',
'closed_at': None,
'comments': 1,
'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/comments',
'created_at': '2017-11-13T22:51:32Z',
'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/events',
'html_url': 'https://github.com/pandas-dev/pandas/issues/18272',
'id': 273606786,
'labels': [],
'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/labels{/name}',
'locked': False,
'milestone': None,
'number': 18272,
'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
'state': 'open',
'title': 'Optimize data type',
'updated_at': '2017-11-13T22:57:27Z',
'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272',
'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/2468782?v=4',
'events_url': 'https://api.github.com/users/rspadim/events{/privacy}',
'followers_url': 'https://api.github.com/users/rspadim/followers',
'following_url': 'https://api.github.com/users/rspadim/following{/other_user}',
'gists_url': 'https://api.github.com/users/rspadim/gists{/gist_id}',
'gravatar_id': '',
'html_url': 'https://github.com/rspadim',
'id': 2468782,
'login': 'rspadim',
'organizations_url': 'https://api.github.com/users/rspadim/orgs',
'received_events_url': 'https://api.github.com/users/rspadim/received_events',
'repos_url': 'https://api.github.com/users/rspadim/repos',
'site_admin': False,
'starred_url': 'https://api.github.com/users/rspadim/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/rspadim/subscriptions',
'type': 'User',
'url': 'https://api.github.com/users/rspadim'}}data中的每一个元素都是一个dict,这个dict就是在github上找到的issue页面上的信息。我们可以把data传给DataFrame并提取感兴趣的部分:
issues = pd.DataFrame(data, columns=['number', 'title',
'labels', 'state'])
issues
如果在工作中,大部分数据并不会以text或excel的格式存储。最广泛使用的是SQL-based的关系型数据库(SQL Server,PostgreSQL,MySQL)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。
读取SQL到DataFrame非常直观,pandas中有一些函数能简化这个过程。举个例子,这里创建一个SQLite数据库,通过使用python内建的sqlite3 driver:
import sqlite3
import pandas as pdquery = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""con = sqlite3.connect('../examples/mydata.sqlite')con.execute(query)<sqlite3.Cursor at 0x1049931f0>con.commit()然后我们插入几行数据:
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"con.executemany(stmt, data)<sqlite3.Cursor at 0x1049932d0>con.commit()大部分python的SQL驱动(PyODBC, psycopg2, MySQLdb, pymssql, 等)返回a list of tuple,当从一个表格选择数据的时候:
cursor = con.execute('select * from test')rows = cursor.fetchall()rows[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]我们可以把list of tuples传递给DataFrame,但是我们也需要column names,包含cursor的description属性:
cursor.description(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy计划是一个六星的Python SQL工具箱,它能抽象出不同SQL数据库之间的不同。pandas有一个read_sql函数,能让我们从SQLAlchemy connection从读取数据。这里我们用SQLAlchemy连接到同一个SQLite数据库,并从之前创建的表格读取数据:
import sqlalchemy as sqladb = sqla.create_engine('sqlite:///../examples/mydata.sqlite')pd.read_sql('select * from test', db)
阅读量:858
点赞量:0
收藏量:0