Extract/Load - reading/writing tables from files, databases and other sources¶
Extract (read)¶
The “from…” functions extract a table from a file-like source or
database. For everything except petl.io.db.fromdb()
the
source
argument provides information about where to extract the
underlying data from. If the source
argument is None
or a
string it is interpreted as follows:
None
- read from stdin- string starting with http://, https:// or ftp:// - read from URL
- string ending with .gz or .bgz - read from file via gzip decompression
- string ending with .bz2 - read from file via bz2 decompression
- any other string - read directly from file
Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip file, a string or a subprocess, see the section on I/O helper classes below for more information.
Be aware that loading data from stdin breaks the table container
convention, because data can usually only be read once. If you are
sure that data will only be read once in your script or interactive
session then this may not be a problem, however note that some
petl
functions do access the underlying data source more than
once and so will not work as expected with data from stdin.
Load (write)¶
The “to…” functions load data from a table into a file-like source
or database. For functions that accept a source
argument, if the
source
argument is None
or a string it is interpreted as
follows:
None
- write to stdout- string ending with .gz or .bgz - write to file via gzip decompression
- string ending with .bz2 - write to file via bz2 decompression
- any other string - write directly to file
Some helper classes are also available for writing to other types of file-like sources, e.g., writing to a Zip file or string buffer, see the section on I/O helper classes below for more information.
Python objects¶
-
petl.io.base.
fromcolumns
(cols, header=None, missing=None)[source]¶ View a sequence of columns as a table, e.g.:
>>> import petl as etl >>> cols = [[0, 1, 2], ['a', 'b', 'c']] >>> tbl = etl.fromcolumns(cols) >>> tbl +----+-----+ | f0 | f1 | +====+=====+ | 0 | 'a' | +----+-----+ | 1 | 'b' | +----+-----+ | 2 | 'c' | +----+-----+
If columns are not the same length, values will be padded to the length of the longest column with missing, which is None by default, e.g.:
>>> cols = [[0, 1, 2], ['a', 'b']] >>> tbl = etl.fromcolumns(cols, missing='NA') >>> tbl +----+------+ | f0 | f1 | +====+======+ | 0 | 'a' | +----+------+ | 1 | 'b' | +----+------+ | 2 | 'NA' | +----+------+
See also
petl.io.json.fromdicts()
.New in version 1.1.0.
Delimited files¶
-
petl.io.csv.
fromcsv
(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]¶ Extract a table from a delimited file. E.g.:
>>> import petl as etl >>> import csv >>> # set up a CSV file to demonstrate with ... table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> with open('example.csv', 'w') as f: ... writer = csv.writer(f) ... writer.writerows(table1) ... >>> # now demonstrate the use of fromcsv() ... table2 = etl.fromcsv('example.csv') >>> table2 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | '1' | +-----+-----+ | 'b' | '2' | +-----+-----+ | 'c' | '2' | +-----+-----+
The source argument is the path of the delimited file, all other keyword arguments are passed to
csv.reader()
. So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.Note that all data values are strings, and any intended numeric values will need to be converted, see also
petl.transform.conversions.convert()
.
-
petl.io.csv.
tocsv
(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶ Write the table to a CSV file. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> etl.tocsv(table1, 'example.csv') >>> # look what it did ... print(open('example.csv').read()) foo,bar a,1 b,2 c,2
The source argument is the path of the delimited file, and the optional write_header argument specifies whether to include the field names in the delimited file. All other keyword arguments are passed to
csv.writer()
. So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.Note that if a file already exists at the given location, it will be overwritten.
-
petl.io.csv.
appendcsv
(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]¶ Append data rows to an existing CSV file. As
petl.io.csv.tocsv()
but the file is opened in append mode and the table header is not written by default.Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.
-
petl.io.csv.
teecsv
(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶ Returns a table that writes rows to a CSV file as they are iterated over.
-
petl.io.csv.
fromtsv
(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]¶ Convenience function, as
petl.io.csv.fromcsv()
but with different default dialect (tab delimited).
-
petl.io.csv.
totsv
(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶ Convenience function, as
petl.io.csv.tocsv()
but with different default dialect (tab delimited).
-
petl.io.csv.
appendtsv
(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]¶ Convenience function, as
petl.io.csv.appendcsv()
but with different default dialect (tab delimited).
-
petl.io.csv.
teetsv
(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶ Convenience function, as
petl.io.csv.teecsv()
but with different default dialect (tab delimited).
Pickle files¶
-
petl.io.pickle.
frompickle
(source=None)[source]¶ Extract a table From data pickled in the given file. The rows in the table should have been pickled to the file one at a time. E.g.:
>>> import petl as etl >>> import pickle >>> # set up a file to demonstrate with ... with open('example.p', 'wb') as f: ... pickle.dump(['foo', 'bar'], f) ... pickle.dump(['a', 1], f) ... pickle.dump(['b', 2], f) ... pickle.dump(['c', 2.5], f) ... >>> # now demonstrate the use of frompickle() ... table1 = etl.frompickle('example.p') >>> table1 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | 1 | +-----+-----+ | 'b' | 2 | +-----+-----+ | 'c' | 2.5 | +-----+-----+
-
petl.io.pickle.
topickle
(table, source=None, protocol=-1, write_header=True)[source]¶ Write the table to a pickle file. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> etl.topickle(table1, 'example.p') >>> # look what it did ... table2 = etl.frompickle('example.p') >>> table2 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | 1 | +-----+-----+ | 'b' | 2 | +-----+-----+ | 'c' | 2 | +-----+-----+
Note that if a file already exists at the given location, it will be overwritten.
The pickle file format preserves type information, i.e., reading and writing is round-trippable for tables with non-string data values.
-
petl.io.pickle.
appendpickle
(table, source=None, protocol=-1, write_header=False)[source]¶ Append data to an existing pickle file. I.e., as
petl.io.pickle.topickle()
but the file is opened in append mode.Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.
Text files¶
-
petl.io.text.
fromtext
(source=None, encoding=None, errors='strict', strip=None, header=('lines', ))[source]¶ Extract a table from lines in the given text file. E.g.:
>>> import petl as etl >>> # setup example file ... text = 'a,1\nb,2\nc,2\n' >>> with open('example.txt', 'w') as f: ... f.write(text) ... 12 >>> table1 = etl.fromtext('example.txt') >>> table1 +-------+ | lines | +=======+ | 'a,1' | +-------+ | 'b,2' | +-------+ | 'c,2' | +-------+ >>> # post-process, e.g., with capture() ... table2 = table1.capture('lines', '(.*),(.*)$', ['foo', 'bar']) >>> table2 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | '1' | +-----+-----+ | 'b' | '2' | +-----+-----+ | 'c' | '2' | +-----+-----+
Note that the strip() function is called on each line, which by default will remove leading and trailing whitespace, including the end-of-line character - use the strip keyword argument to specify alternative characters to strip. Set the strip argument to False to disable this behaviour and leave line endings in place.
-
petl.io.text.
totext
(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]¶ Write the table to a text file. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> prologue = '''{| class="wikitable" ... |- ... ! foo ... ! bar ... ''' >>> template = '''|- ... | {foo} ... | {bar} ... ''' >>> epilogue = '|}' >>> etl.totext(table1, 'example.txt', template=template, ... prologue=prologue, epilogue=epilogue) >>> # see what we did ... print(open('example.txt').read()) {| class="wikitable" |- ! foo ! bar |- | a | 1 |- | b | 2 |- | c | 2 |}
The template will be used to format each row via str.format.
XML files¶
-
petl.io.xml.
fromxml
(source, *args, **kwargs)[source]¶ Extract data from an XML file. E.g.:
>>> import petl as etl >>> # setup a file to demonstrate with ... d = '''<table> ... <tr> ... <td>foo</td><td>bar</td> ... </tr> ... <tr> ... <td>a</td><td>1</td> ... </tr> ... <tr> ... <td>b</td><td>2</td> ... </tr> ... <tr> ... <td>c</td><td>2</td> ... </tr> ... </table>''' >>> with open('example1.xml', 'w') as f: ... f.write(d) ... 212 >>> table1 = etl.fromxml('example1.xml', 'tr', 'td') >>> table1 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | '1' | +-----+-----+ | 'b' | '2' | +-----+-----+ | 'c' | '2' | +-----+-----+
If the data values are stored in an attribute, provide the attribute name as an extra positional argument:
>>> d = '''<table> ... <tr> ... <td v='foo'/><td v='bar'/> ... </tr> ... <tr> ... <td v='a'/><td v='1'/> ... </tr> ... <tr> ... <td v='b'/><td v='2'/> ... </tr> ... <tr> ... <td v='c'/><td v='2'/> ... </tr> ... </table>''' >>> with open('example2.xml', 'w') as f: ... f.write(d) ... 220 >>> table2 = etl.fromxml('example2.xml', 'tr', 'td', 'v') >>> table2 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | '1' | +-----+-----+ | 'b' | '2' | +-----+-----+ | 'c' | '2' | +-----+-----+
Data values can also be extracted by providing a mapping of field names to element paths:
>>> d = '''<table> ... <row> ... <foo>a</foo><baz><bar v='1'/><bar v='3'/></baz> ... </row> ... <row> ... <foo>b</foo><baz><bar v='2'/></baz> ... </row> ... <row> ... <foo>c</foo><baz><bar v='2'/></baz> ... </row> ... </table>''' >>> with open('example3.xml', 'w') as f: ... f.write(d) ... 223 >>> table3 = etl.fromxml('example3.xml', 'row', ... {'foo': 'foo', 'bar': ('baz/bar', 'v')}) >>> table3 +------------+-----+ | bar | foo | +============+=====+ | ('1', '3') | 'a' | +------------+-----+ | '2' | 'b' | +------------+-----+ | '2' | 'c' | +------------+-----+
If lxml is installed, full XPath expressions can be used.
Note that the implementation is currently not streaming, i.e., the whole document is loaded into memory.
If multiple elements match a given field, all values are reported as a tuple.
If there is more than one element name used for row values, a tuple or list of paths can be provided, e.g.,
fromxml('example.html', './/tr', ('th', 'td'))
.
For writing to an XML file, see petl.io.text.totext()
.
HTML files¶
-
petl.io.html.
tohtml
(table, source=None, encoding=None, errors='strict', caption=None, vrepr=<class 'str'>, lineterminator='\n', index_header=False, tr_style=None, td_styles=None, truncate=None)[source]¶ Write the table as HTML to a file. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> etl.tohtml(table1, 'example.html', caption='example table') >>> print(open('example.html').read()) <table class='petl'> <caption>example table</caption> <thead> <tr> <th>foo</th> <th>bar</th> </tr> </thead> <tbody> <tr> <td>a</td> <td style='text-align: right'>1</td> </tr> <tr> <td>b</td> <td style='text-align: right'>2</td> </tr> <tr> <td>c</td> <td style='text-align: right'>2</td> </tr> </tbody> </table>
The caption keyword argument is used to provide a table caption in the output HTML.
JSON files¶
-
petl.io.json.
fromjson
(source, *args, **kwargs)[source]¶ Extract data from a JSON file. The file must contain a JSON array as the top level object, and each member of the array will be treated as a row of data. E.g.:
>>> import petl as etl >>> data = ''' ... [{"foo": "a", "bar": 1}, ... {"foo": "b", "bar": 2}, ... {"foo": "c", "bar": 2}] ... ''' >>> with open('example.json', 'w') as f: ... f.write(data) ... 74 >>> table1 = etl.fromjson('example.json', header=['foo', 'bar']) >>> table1 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | 1 | +-----+-----+ | 'b' | 2 | +-----+-----+ | 'c' | 2 | +-----+-----+
If your JSON file does not fit this structure, you will need to parse it via
json.load()
and select the array to treat as the data, see alsopetl.io.json.fromdicts()
.Changed in version 1.1.0.
If no header is specified, fields will be discovered by sampling keys from the first sample objects in source. The header will be constructed from keys in the order discovered. Note that this ordering may not be stable, and therefore it may be advisable to specify an explicit header or to use another function like
petl.transform.headers.sortheader()
on the resulting table to guarantee stability.
-
petl.io.json.
fromdicts
(dicts, header=None, sample=1000, missing=None)[source]¶ View a sequence of Python
dict
as a table. E.g.:>>> import petl as etl >>> dicts = [{"foo": "a", "bar": 1}, ... {"foo": "b", "bar": 2}, ... {"foo": "c", "bar": 2}] >>> table1 = etl.fromdicts(dicts, header=['foo', 'bar']) >>> table1 +-----+-----+ | foo | bar | +=====+=====+ | 'a' | 1 | +-----+-----+ | 'b' | 2 | +-----+-----+ | 'c' | 2 | +-----+-----+
If header is not specified, sample items from dicts will be inspected to discovery dictionary keys. Note that the order in which dictionary keys are discovered may not be stable,
See also
petl.io.json.fromjson()
.Changed in version 1.1.0.
If no header is specified, fields will be discovered by sampling keys from the first sample dictionaries in dicts. The header will be constructed from keys in the order discovered. Note that this ordering may not be stable, and therefore it may be advisable to specify an explicit header or to use another function like
petl.transform.headers.sortheader()
on the resulting table to guarantee stability.
-
petl.io.json.
tojson
(table, source=None, prefix=None, suffix=None, *args, **kwargs)[source]¶ Write a table in JSON format, with rows output as JSON objects. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> etl.tojson(table1, 'example.json', sort_keys=True) >>> # check what it did ... print(open('example.json').read()) [{"bar": 1, "foo": "a"}, {"bar": 2, "foo": "b"}, {"bar": 2, "foo": "c"}]
Note that this is currently not streaming, all data is loaded into memory before being written to the file.
-
petl.io.json.
tojsonarrays
(table, source=None, prefix=None, suffix=None, output_header=False, *args, **kwargs)[source]¶ Write a table in JSON format, with rows output as JSON arrays. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> etl.tojsonarrays(table1, 'example.json') >>> # check what it did ... print(open('example.json').read()) [["a", 1], ["b", 2], ["c", 2]]
Note that this is currently not streaming, all data is loaded into memory before being written to the file.
Databases¶
Note
The automatic table creation feature of petl.io.db.todb()
requires SQLAlchemy to be installed, e.g.:
$ pip install sqlalchemy
-
petl.io.db.
fromdb
(dbo, query, *args, **kwargs)[source]¶ Provides access to data from any DB-API 2.0 connection via a given query. E.g., using
sqlite3
:>>> import petl as etl >>> import sqlite3 >>> connection = sqlite3.connect('example.db') >>> table = etl.fromdb(connection, 'SELECT * FROM example')
E.g., using
psycopg2
(assuming you’ve installed it first):>>> import petl as etl >>> import psycopg2 >>> connection = psycopg2.connect('dbname=example user=postgres') >>> table = etl.fromdb(connection, 'SELECT * FROM example')
E.g., using
pymysql
(assuming you’ve installed it first):>>> import petl as etl >>> import pymysql >>> connection = pymysql.connect(password='moonpie', database='thangs') >>> table = etl.fromdb(connection, 'SELECT * FROM example')
The dbo argument may also be a function that creates a cursor. N.B., each call to the function should return a new cursor. E.g.:
>>> import petl as etl >>> import psycopg2 >>> connection = psycopg2.connect('dbname=example user=postgres') >>> mkcursor = lambda: connection.cursor(cursor_factory=psycopg2.extras.DictCursor) >>> table = etl.fromdb(mkcursor, 'SELECT * FROM example')
The parameter dbo may also be an SQLAlchemy engine, session or connection object.
The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an
sqlite3
database.Note that the default behaviour of most database servers and clients is for the entire result set for each query to be sent from the server to the client. If your query returns a large result set this can result in significant memory usage at the client. Some databases support server-side cursors which provide a means for client libraries to fetch result sets incrementally, reducing memory usage at the client.
To use a server-side cursor with a PostgreSQL database, e.g.:
>>> import petl as etl >>> import psycopg2 >>> connection = psycopg2.connect('dbname=example user=postgres') >>> table = etl.fromdb(lambda: connection.cursor(name='arbitrary'), ... 'SELECT * FROM example')
For more information on server-side cursors see the following links:
-
petl.io.db.
todb
(table, dbo, tablename, schema=None, commit=True, create=False, drop=False, constraints=True, metadata=None, dialect=None, sample=1000)[source]¶ Load data into an existing database table via a DB-API 2.0 connection or cursor. Note that the database table will be truncated, i.e., all existing rows will be deleted prior to inserting the new data. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['c', 2]] >>> # using sqlite3 ... import sqlite3 >>> connection = sqlite3.connect('example.db') >>> # assuming table "foobar" already exists in the database ... etl.todb(table, connection, 'foobar') >>> # using psycopg2 >>> import psycopg2 >>> connection = psycopg2.connect('dbname=example user=postgres') >>> # assuming table "foobar" already exists in the database ... etl.todb(table, connection, 'foobar') >>> # using pymysql >>> import pymysql >>> connection = pymysql.connect(password='moonpie', database='thangs') >>> # tell MySQL to use standard quote character ... connection.cursor().execute('SET SQL_MODE=ANSI_QUOTES') >>> # load data, assuming table "foobar" already exists in the database ... etl.todb(table, connection, 'foobar')
N.B., for MySQL the statement
SET SQL_MODE=ANSI_QUOTES
is required to ensure MySQL uses SQL-92 standard quote characters.A cursor can also be provided instead of a connection, e.g.:
>>> import psycopg2 >>> connection = psycopg2.connect('dbname=example user=postgres') >>> cursor = connection.cursor() >>> etl.todb(table, cursor, 'foobar')
The parameter dbo may also be an SQLAlchemy engine, session or connection object.
The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an
sqlite3
database.If
create=True
this function will attempt to automatically create a database table before loading the data. This functionality requires SQLAlchemy to be installed.Keyword arguments:
- table : table container
- Table data to load
- dbo : database object
- DB-API 2.0 connection, callable returning a DB-API 2.0 cursor, or SQLAlchemy connection, engine or session
- tablename : string
- Name of the table in the database
- schema : string
- Name of the database schema to find the table in
- commit : bool
- If True commit the changes
- create : bool
- If True attempt to create the table before loading, inferring types from a sample of the data (requires SQLAlchemy)
- drop : bool
- If True attempt to drop the table before recreating (only relevant if create=True)
- constraints : bool
- If True use length and nullable constraints (only relevant if create=True)
- metadata : sqlalchemy.MetaData
- Custom table metadata (only relevant if create=True)
- dialect : string
- One of {‘access’, ‘sybase’, ‘sqlite’, ‘informix’, ‘firebird’, ‘mysql’, ‘oracle’, ‘maxdb’, ‘postgresql’, ‘mssql’} (only relevant if create=True)
- sample : int
- Number of rows to sample when inferring types etc. Set to 0 to use the whole table (only relevant if create=True)
Note
This function is in principle compatible with any DB-API 2.0 compliant database driver. However, at the time of writing some DB-API 2.0 implementations, including cx_Oracle and MySQL’s Connector/Python, are not compatible with this function, because they only accept a list argument to the cursor.executemany() function called internally by
petl
. This can be worked around by proxying the cursor objects, e.g.:>>> import cx_Oracle >>> connection = cx_Oracle.Connection(...) >>> class CursorProxy(object): ... def __init__(self, cursor): ... self._cursor = cursor ... def executemany(self, statement, parameters, **kwargs): ... # convert parameters to a list ... parameters = list(parameters) ... # pass through to proxied cursor ... return self._cursor.executemany(statement, parameters, **kwargs) ... def __getattr__(self, item): ... return getattr(self._cursor, item) ... >>> def get_cursor(): ... return CursorProxy(connection.cursor()) ... >>> import petl as etl >>> etl.todb(tbl, get_cursor, ...)
Note however that this does imply loading the entire table into memory as a list prior to inserting into the database.
-
petl.io.db.
appenddb
(table, dbo, tablename, schema=None, commit=True)[source]¶ Load data into an existing database table via a DB-API 2.0 connection or cursor. As
petl.io.db.todb()
except that the database table will be appended, i.e., the new data will be inserted into the table, and any existing rows will remain.
Excel .xls files (xlrd/xlwt)¶
Note
The following functions require xlrd and xlwt to be installed, e.g.:
$ pip install xlrd xlwt-future
Excel .xlsx files (openpyxl)¶
-
petl.io.xlsx.
fromxlsx
(filename, sheet=None, range_string=None, min_row=None, min_col=None, max_row=None, max_col=None, read_only=False, **kwargs)[source]¶ Extract a table from a sheet in an Excel .xlsx file.
N.B., the sheet name is case sensitive.
The sheet argument can be omitted, in which case the first sheet in the workbook is used by default.
The range_string argument can be used to provide a range string specifying a range of cells to extract.
The min_row, min_col, max_row and max_col arguments can be used to limit the range of cells to extract. They will be ignored if range_string is provided.
The read_only argument determines how openpyxl returns the loaded workbook. Default is False as it prevents some LibreOffice files from getting truncated at 65536 rows. True should be faster if the file use is read-only and the files are made with Microsoft Excel.
Any other keyword arguments are passed through to
openpyxl.load_workbook()
.
Arrays (NumPy)¶
-
petl.io.numpy.
fromarray
(a)[source]¶ Extract a table from a numpy structured array, e.g.:
>>> import petl as etl >>> import numpy as np >>> a = np.array([('apples', 1, 2.5), ... ('oranges', 3, 4.4), ... ('pears', 7, 0.1)], ... dtype='U8, i4,f4') >>> table = etl.fromarray(a) >>> table +-----------+----+-----+ | f0 | f1 | f2 | +===========+====+=====+ | 'apples' | 1 | 2.5 | +-----------+----+-----+ | 'oranges' | 3 | 4.4 | +-----------+----+-----+ | 'pears' | 7 | 0.1 | +-----------+----+-----+
-
petl.io.numpy.
toarray
(table, dtype=None, count=-1, sample=1000)[source]¶ Load data from the given table into a numpy structured array. E.g.:
>>> import petl as etl >>> table = [('foo', 'bar', 'baz'), ... ('apples', 1, 2.5), ... ('oranges', 3, 4.4), ... ('pears', 7, .1)] >>> a = etl.toarray(table) >>> a array([('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)], dtype=(numpy.record, [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')])) >>> # the dtype can be specified as a string ... a = etl.toarray(table, dtype='a4, i2, f4') >>> a array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)], dtype=[('foo', 'S4'), ('bar', '<i2'), ('baz', '<f4')]) >>> # the dtype can also be partially specified ... a = etl.toarray(table, dtype={'foo': 'a4'}) >>> a array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)], dtype=[('foo', 'S4'), ('bar', '<i8'), ('baz', '<f8')])
If the dtype is not completely specified, sample rows will be examined to infer an appropriate dtype.
-
petl.io.numpy.
torecarray
(*args, **kwargs)[source]¶ Convenient shorthand for
toarray(*args, **kwargs).view(np.recarray)
.
-
petl.io.numpy.
valuestoarray
(vals, dtype=None, count=-1, sample=1000)[source]¶ Load values from a table column into a numpy array, e.g.:
>>> import petl as etl >>> table = [('foo', 'bar', 'baz'), ... ('apples', 1, 2.5), ... ('oranges', 3, 4.4), ... ('pears', 7, .1)] >>> table = etl.wrap(table) >>> table.values('bar').array() array([1, 3, 7]) >>> # specify dtype ... table.values('bar').array(dtype='i4') array([1, 3, 7], dtype=int32)
DataFrames (pandas)¶
-
petl.io.pandas.
fromdataframe
(df, include_index=False)[source]¶ Extract a table from a pandas DataFrame. E.g.:
>>> import petl as etl >>> import pandas as pd >>> records = [('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)] >>> df = pd.DataFrame.from_records(records, columns=('foo', 'bar', 'baz')) >>> table = etl.fromdataframe(df) >>> table +-----------+-----+-----+ | foo | bar | baz | +===========+=====+=====+ | 'apples' | 1 | 2.5 | +-----------+-----+-----+ | 'oranges' | 3 | 4.4 | +-----------+-----+-----+ | 'pears' | 7 | 0.1 | +-----------+-----+-----+
-
petl.io.pandas.
todataframe
(table, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]¶ Load data from the given table into a pandas DataFrame. E.g.:
>>> import petl as etl >>> table = [('foo', 'bar', 'baz'), ... ('apples', 1, 2.5), ... ('oranges', 3, 4.4), ... ('pears', 7, .1)] >>> df = etl.todataframe(table) >>> df foo bar baz 0 apples 1 2.5 1 oranges 3 4.4 2 pears 7 0.1
HDF5 files (PyTables)¶
Note
The following functions require PyTables to be installed, e.g.:
$ # install HDF5
$ apt-get install libhdf5-7 libhdf5-dev
$ # install other prerequisites
$ pip install cython
$ pip install numpy
$ pip install numexpr
$ # install PyTables
$ pip install tables
-
petl.io.pytables.
fromhdf5
(source, where=None, name=None, condition=None, condvars=None, start=None, stop=None, step=None)[source]¶ Provides access to an HDF5 table. E.g.:
>>> import petl as etl >>> import tables >>> # set up a new hdf5 table to demonstrate with ... h5file = tables.open_file('example.h5', mode='w', ... title='Example file') >>> h5file.create_group('/', 'testgroup', 'Test Group') /testgroup (Group) 'Test Group' children := [] >>> class FooBar(tables.IsDescription): ... foo = tables.Int32Col(pos=0) ... bar = tables.StringCol(6, pos=2) ... >>> h5table = h5file.create_table('/testgroup', 'testtable', FooBar, ... 'Test Table') >>> # load some data into the table ... table1 = (('foo', 'bar'), ... (1, b'asdfgh'), ... (2, b'qwerty'), ... (3, b'zxcvbn')) >>> for row in table1[1:]: ... for i, f in enumerate(table1[0]): ... h5table.row[f] = row[i] ... h5table.row.append() ... >>> h5file.flush() >>> h5file.close() >>> # ... # now demonstrate use of fromhdf5 ... table1 = etl.fromhdf5('example.h5', '/testgroup', 'testtable') >>> table1 +-----+-----------+ | foo | bar | +=====+===========+ | 1 | b'asdfgh' | +-----+-----------+ | 2 | b'qwerty' | +-----+-----------+ | 3 | b'zxcvbn' | +-----+-----------+ >>> # alternatively just specify path to table node ... table1 = etl.fromhdf5('example.h5', '/testgroup/testtable') >>> # ...or use an existing tables.File object ... h5file = tables.open_file('example.h5') >>> table1 = etl.fromhdf5(h5file, '/testgroup/testtable') >>> # ...or use an existing tables.Table object ... h5tbl = h5file.get_node('/testgroup/testtable') >>> table1 = etl.fromhdf5(h5tbl) >>> # use a condition to filter data ... table2 = etl.fromhdf5(h5tbl, condition='foo < 3') >>> table2 +-----+-----------+ | foo | bar | +=====+===========+ | 1 | b'asdfgh' | +-----+-----------+ | 2 | b'qwerty' | +-----+-----------+ >>> h5file.close()
-
petl.io.pytables.
fromhdf5sorted
(source, where=None, name=None, sortby=None, checkCSI=False, start=None, stop=None, step=None)[source]¶ Provides access to an HDF5 table, sorted by an indexed column, e.g.:
>>> import petl as etl >>> import tables >>> # set up a new hdf5 table to demonstrate with ... h5file = tables.open_file('example.h5', mode='w', title='Test file') >>> h5file.create_group('/', 'testgroup', 'Test Group') /testgroup (Group) 'Test Group' children := [] >>> class FooBar(tables.IsDescription): ... foo = tables.Int32Col(pos=0) ... bar = tables.StringCol(6, pos=2) ... >>> h5table = h5file.create_table('/testgroup', 'testtable', FooBar, 'Test Table') >>> # load some data into the table ... table1 = (('foo', 'bar'), ... (3, b'asdfgh'), ... (2, b'qwerty'), ... (1, b'zxcvbn')) >>> for row in table1[1:]: ... for i, f in enumerate(table1[0]): ... h5table.row[f] = row[i] ... h5table.row.append() ... >>> h5table.cols.foo.create_csindex() # CS index is required 0 >>> h5file.flush() >>> h5file.close() >>> # ... # access the data, sorted by the indexed column ... table2 = etl.fromhdf5sorted('example.h5', '/testgroup', 'testtable', ... sortby='foo') >>> table2 +-----+-----------+ | foo | bar | +=====+===========+ | 1 | b'zxcvbn' | +-----+-----------+ | 2 | b'qwerty' | +-----+-----------+ | 3 | b'asdfgh' | +-----+-----------+
-
petl.io.pytables.
tohdf5
(table, source, where=None, name=None, create=False, drop=False, description=None, title='', filters=None, expectedrows=10000, chunkshape=None, byteorder=None, createparents=False, sample=1000)[source]¶ Write to an HDF5 table. If create is False, assumes the table already exists, and attempts to truncate it before loading. If create is True, a new table will be created, and if drop is True, any existing table will be dropped first. If description is None, the description will be guessed. E.g.:
>>> import petl as etl >>> table1 = (('foo', 'bar'), ... (1, b'asdfgh'), ... (2, b'qwerty'), ... (3, b'zxcvbn')) >>> etl.tohdf5(table1, 'example.h5', '/testgroup', 'testtable', ... drop=True, create=True, createparents=True) >>> etl.fromhdf5('example.h5', '/testgroup', 'testtable') +-----+-----------+ | foo | bar | +=====+===========+ | 1 | b'asdfgh' | +-----+-----------+ | 2 | b'qwerty' | +-----+-----------+ | 3 | b'zxcvbn' | +-----+-----------+
Bcolz ctables¶
-
petl.io.bcolz.
frombcolz
(source, expression=None, outcols=None, limit=None, skip=0)[source]¶ Extract a table from a bcolz ctable, e.g.:
>>> import petl as etl >>> import bcolz >>> cols = [ ... ['apples', 'oranges', 'pears'], ... [1, 3, 7], ... [2.5, 4.4, .1] ... ] >>> names = ('foo', 'bar', 'baz') >>> ctbl = bcolz.ctable(cols, names=names) >>> tbl = etl.frombcolz(ctbl) >>> tbl +-----------+-----+-----+ | foo | bar | baz | +===========+=====+=====+ | 'apples' | 1 | 2.5 | +-----------+-----+-----+ | 'oranges' | 3 | 4.4 | +-----------+-----+-----+ | 'pears' | 7 | 0.1 | +-----------+-----+-----+
If expression is provided it will be executed by bcolz and only matching rows returned, e.g.:
>>> tbl2 = etl.frombcolz(ctbl, expression='bar > 1') >>> tbl2 +-----------+-----+-----+ | foo | bar | baz | +===========+=====+=====+ | 'oranges' | 3 | 4.4 | +-----------+-----+-----+ | 'pears' | 7 | 0.1 | +-----------+-----+-----+
New in version 1.1.0.
-
petl.io.bcolz.
tobcolz
(table, dtype=None, sample=1000, **kwargs)[source]¶ Load data into a bcolz ctable, e.g.:
>>> import petl as etl >>> table = [('foo', 'bar', 'baz'), ... ('apples', 1, 2.5), ... ('oranges', 3, 4.4), ... ('pears', 7, .1)] >>> ctbl = etl.tobcolz(table) >>> ctbl ctable((3,), [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')]) nbytes: 132; cbytes: 1023.98 KB; ratio: 0.00 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0) [('apples', 1, 2.5) ('oranges', 3, 4.4) ('pears', 7, 0.1)] >>> ctbl.names ['foo', 'bar', 'baz'] >>> ctbl['foo'] carray((3,), <U7) nbytes := 84; cbytes := 511.98 KB; ratio: 0.00 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0) chunklen := 18724; chunksize: 524272; blocksize: 0 ['apples' 'oranges' 'pears']
Other keyword arguments are passed through to the ctable constructor.
New in version 1.1.0.
Text indexes (Whoosh)¶
-
petl.io.whoosh.
fromtextindex
(index_or_dirname, indexname=None, docnum_field=None)[source]¶ Extract all documents from a Whoosh index. E.g.:
>>> import petl as etl >>> import os >>> # set up an index and load some documents via the Whoosh API ... from whoosh.index import create_in >>> from whoosh.fields import * >>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), ... content=TEXT) >>> dirname = 'example.whoosh' >>> if not os.path.exists(dirname): ... os.mkdir(dirname) ... >>> index = create_in(dirname, schema) >>> writer = index.writer() >>> writer.add_document(title=u"First document", path=u"/a", ... content=u"This is the first document we've added!") >>> writer.add_document(title=u"Second document", path=u"/b", ... content=u"The second one is even more interesting!") >>> writer.commit() >>> # extract documents as a table ... table = etl.fromtextindex(dirname) >>> table +------+-------------------+ | path | title | +======+===================+ | '/a' | 'First document' | +------+-------------------+ | '/b' | 'Second document' | +------+-------------------+
Keyword arguments:
- index_or_dirname
- Either an instance of whoosh.index.Index or a string containing the directory path where the index is stored.
- indexname
- String containing the name of the index, if multiple indexes are stored in the same directory.
- docnum_field
- If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
-
petl.io.whoosh.
searchtextindex
(index_or_dirname, query, limit=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]¶ Search a Whoosh index using a query. E.g.:
>>> import petl as etl >>> import os >>> # set up an index and load some documents via the Whoosh API ... from whoosh.index import create_in >>> from whoosh.fields import * >>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), ... content=TEXT) >>> dirname = 'example.whoosh' >>> if not os.path.exists(dirname): ... os.mkdir(dirname) ... >>> index = create_in('example.whoosh', schema) >>> writer = index.writer() >>> writer.add_document(title=u"Oranges", path=u"/a", ... content=u"This is the first document we've added!") >>> writer.add_document(title=u"Apples", path=u"/b", ... content=u"The second document is even more " ... u"interesting!") >>> writer.commit() >>> # demonstrate the use of searchtextindex() ... table1 = etl.searchtextindex('example.whoosh', 'oranges') >>> table1 +------+-----------+ | path | title | +======+===========+ | '/a' | 'Oranges' | +------+-----------+ >>> table2 = etl.searchtextindex('example.whoosh', 'doc*') >>> table2 +------+-----------+ | path | title | +======+===========+ | '/a' | 'Oranges' | +------+-----------+ | '/b' | 'Apples' | +------+-----------+
Keyword arguments:
- index_or_dirname
- Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
- query
- Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
- limit
- Return at most limit results.
- indexname
- String containing the name of the index, if multiple indexes are stored in the same directory.
- docnum_field
- If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
- score_field
- If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
- fieldboosts
- An optional dictionary mapping field names to boosts.
- search_kwargs
- Any extra keyword arguments to be passed through to the Whoosh search() method.
-
petl.io.whoosh.
searchtextindexpage
(index_or_dirname, query, pagenum, pagelen=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]¶ Search an index using a query, returning a result page.
Keyword arguments:
- index_or_dirname
- Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
- query
- Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
- pagenum
- Number of the page to return (e.g., 1 = first page).
- pagelen
- Number of results per page.
- indexname
- String containing the name of the index, if multiple indexes are stored in the same directory.
- docnum_field
- If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
- score_field
- If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
- fieldboosts
- An optional dictionary mapping field names to boosts.
- search_kwargs
- Any extra keyword arguments to be passed through to the Whoosh search() method.
-
petl.io.whoosh.
totextindex
(table, index_or_dirname, schema=None, indexname=None, merge=False, optimize=False)[source]¶ Load all rows from table into a Whoosh index. N.B., this will clear any existing data in the index before loading. E.g.:
>>> import petl as etl >>> import datetime >>> import os >>> # here is the table we want to load into an index ... table = (('f0', 'f1', 'f2', 'f3', 'f4'), ... ('AAA', 12, 4.3, True, datetime.datetime.now()), ... ('BBB', 6, 3.4, False, datetime.datetime(1900, 1, 31)), ... ('CCC', 42, 7.8, True, datetime.datetime(2100, 12, 25))) >>> # define a schema for the index ... from whoosh.fields import * >>> schema = Schema(f0=TEXT(stored=True), ... f1=NUMERIC(int, stored=True), ... f2=NUMERIC(float, stored=True), ... f3=BOOLEAN(stored=True), ... f4=DATETIME(stored=True)) >>> # load index ... dirname = 'example.whoosh' >>> if not os.path.exists(dirname): ... os.mkdir(dirname) ... >>> etl.totextindex(table, dirname, schema=schema)
Keyword arguments:
- table
- A table container with the data to be loaded.
- index_or_dirname
- Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
- schema
- Index schema to use if creating the index.
- indexname
- String containing the name of the index, if multiple indexes are stored in the same directory.
- merge
- Merge small segments during commit?
- optimize
- Merge all segments together?
-
petl.io.whoosh.
appendtextindex
(table, index_or_dirname, indexname=None, merge=True, optimize=False)[source]¶ Load all rows from table into a Whoosh index, adding them to any existing data in the index.
Keyword arguments:
- table
- A table container with the data to be loaded.
- index_or_dirname
- Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
- indexname
- String containing the name of the index, if multiple indexes are stored in the same directory.
- merge
- Merge small segments during commit?
- optimize
- Merge all segments together?
I/O helper classes¶
The following classes are helpers for extract (from...()
) and load
(to...()
) functions that use a file-like data source.
An instance of any of the following classes can be used as the source
argument to data extraction functions like petl.io.csv.fromcsv()
etc.,
with the exception of petl.io.sources.StdoutSource
which is
write-only.
An instance of any of the following classes can also be used as the source
argument to data loading functions like petl.io.csv.tocsv()
etc., with the
exception of petl.io.sources.StdinSource
,
petl.io.sources.URLSource
and petl.io.sources.PopenSource
which are read-only.
The behaviour of each source can usually be configured by passing arguments
to the constructor, see the source code of the petl.io.sources
module
for full details.
-
class
petl.io.sources.
MemorySource
(s=None)[source]¶ Memory data source. E.g.:
>>> import petl as etl >>> data = b'foo,bar\na,1\nb,2\nc,2\n' >>> source = etl.MemorySource(data) >>> tbl = etl.fromcsv(source) >>> tbl +-----+-----+ | foo | bar | +=====+=====+ | 'a' | '1' | +-----+-----+ | 'b' | '2' | +-----+-----+ | 'c' | '2' | +-----+-----+ >>> sink = etl.MemorySource() >>> tbl.tojson(sink) >>> sink.getvalue() b'[{"foo": "a", "bar": "1"}, {"foo": "b", "bar": "2"}, {"foo": "c", "bar": "2"}]'
Also supports appending.