writing tables from files, databases and other sources¶

Extract (read)¶

The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb() the source argument provides information about where to extract the underlying data from. If the source argument is None or a string it is interpreted as follows:

None - read from stdin
string starting with http://, https:// or ftp:// - read from URL
string ending with .gz or .bgz - read from file via gzip decompression
string ending with .bz2 - read from file via bz2 decompression
any other string - read directly from file

Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip file, a string or a subprocess, see the section on I/O helper classes below for more information.

Be aware that loading data from stdin breaks the table container convention, because data can usually only be read once. If you are sure that data will only be read once in your script or interactive session then this may not be a problem, however note that some petl functions do access the underlying data source more than once and so will not work as expected with data from stdin.

Load (write)¶

The “to…” functions load data from a table into a file-like source or database. For functions that accept a source argument, if the source argument is None or a string it is interpreted as follows:

None - write to stdout
string ending with .gz or .bgz - write to file via gzip decompression
string ending with .bz2 - write to file via bz2 decompression
any other string - write directly to file

Some helper classes are also available for writing to other types of file-like sources, e.g., writing to a Zip file or string buffer, see the section on I/O helper classes below for more information.

Python objects¶

petl.io.base.fromcolumns(cols, header=None, missing=None)[source]¶

View a sequence of columns as a table, e.g.:

>>> import petl as etl
>>> cols = [[0, 1, 2], ['a', 'b', 'c']]
>>> tbl = etl.fromcolumns(cols)
>>> tbl
+----+-----+
| f0 | f1  |
+====+=====+
|  0 | 'a' |
+----+-----+
|  1 | 'b' |
+----+-----+
|  2 | 'c' |
+----+-----+

If columns are not the same length, values will be padded to the length of the longest column with missing, which is None by default, e.g.:

>>> cols = [[0, 1, 2], ['a', 'b']]
>>> tbl = etl.fromcolumns(cols, missing='NA')
>>> tbl
+----+------+
| f0 | f1   |
+====+======+
|  0 | 'a'  |
+----+------+
|  1 | 'b'  |
+----+------+
|  2 | 'NA' |
+----+------+

Delimited files¶

petl.io.csv.fromcsv(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]¶

Extract a table from a delimited file. E.g.:

>>> import petl as etl
>>> import csv
>>> # set up a CSV file to demonstrate with
... table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> with open('example.csv', 'w') as f:
...     writer = csv.writer(f)
...     writer.writerows(table1)
...
>>> # now demonstrate the use of fromcsv()
... table2 = etl.fromcsv('example.csv')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

The source argument is the path of the delimited file, all other keyword arguments are passed to csv.reader(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that all data values are strings, and any intended numeric values will need to be converted, see also petl.transform.conversions.convert().

petl.io.csv.tocsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶

Write the table to a CSV file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tocsv(table1, 'example.csv')
>>> # look what it did
... print(open('example.csv').read())
foo,bar
a,1
b,2
c,2

The source argument is the path of the delimited file, and the optional write_header argument specifies whether to include the field names in the delimited file. All other keyword arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that if a file already exists at the given location, it will be overwritten.

petl.io.csv.appendcsv(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]¶

Append data rows to an existing CSV file. As petl.io.csv.tocsv() but the file is opened in append mode and the table header is not written by default.

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.

petl.io.csv.teecsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶: Returns a table that writes rows to a CSV file as they are iterated over.

petl.io.csv.fromtsv(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]¶: Convenience function, as petl.io.csv.fromcsv() but with different default dialect (tab delimited).

petl.io.csv.totsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶: Convenience function, as petl.io.csv.tocsv() but with different default dialect (tab delimited).

petl.io.csv.appendtsv(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]¶: Convenience function, as petl.io.csv.appendcsv() but with different default dialect (tab delimited).

petl.io.csv.teetsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]¶: Convenience function, as petl.io.csv.teecsv() but with different default dialect (tab delimited).

Pickle files¶

petl.io.pickle.frompickle(source=None)[source]¶

Extract a table From data pickled in the given file. The rows in the table should have been pickled to the file one at a time. E.g.:

>>> import petl as etl
>>> import pickle
>>> # set up a file to demonstrate with
... with open('example.p', 'wb') as f:
...     pickle.dump(['foo', 'bar'], f)
...     pickle.dump(['a', 1], f)
...     pickle.dump(['b', 2], f)
...     pickle.dump(['c', 2.5], f)
...
>>> # now demonstrate the use of frompickle()
... table1 = etl.frompickle('example.p')
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' | 2.5 |
+-----+-----+

petl.io.pickle.topickle(table, source=None, protocol=-1, write_header=True)[source]¶

Write the table to a pickle file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.topickle(table1, 'example.p')
>>> # look what it did
... table2 = etl.frompickle('example.p')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

Note that if a file already exists at the given location, it will be overwritten.

The pickle file format preserves type information, i.e., reading and writing is round-trippable for tables with non-string data values.

petl.io.pickle.appendpickle(table, source=None, protocol=-1, write_header=False)[source]¶

Append data to an existing pickle file. I.e., as petl.io.pickle.topickle() but the file is opened in append mode.

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.

petl.io.pickle.teepickle(table, source=None, protocol=-1, write_header=True)[source]¶: Return a table that writes rows to a pickle file as they are iterated over.

Text files¶

petl.io.text.fromtext(source=None, encoding=None, errors='strict', strip=None, header=('lines', ))[source]¶

Extract a table from lines in the given text file. E.g.:

>>> import petl as etl
>>> # setup example file
... text = 'a,1\nb,2\nc,2\n'
>>> with open('example.txt', 'w') as f:
...     f.write(text)
...
12
>>> table1 = etl.fromtext('example.txt')
>>> table1
+-------+
| lines |
+=======+
| 'a,1' |
+-------+
| 'b,2' |
+-------+
| 'c,2' |
+-------+

>>> # post-process, e.g., with capture()
... table2 = table1.capture('lines', '(.*),(.*)$', ['foo', 'bar'])
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

Note that the strip() function is called on each line, which by default will remove leading and trailing whitespace, including the end-of-line character - use the strip keyword argument to specify alternative characters to strip. Set the strip argument to False to disable this behaviour and leave line endings in place.

petl.io.text.totext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]¶

Write the table to a text file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> prologue = '''{| class="wikitable"
... |-
... ! foo
... ! bar
... '''
>>> template = '''|-
... | {foo}
... | {bar}
... '''
>>> epilogue = '|}'
>>> etl.totext(table1, 'example.txt', template=template,
... prologue=prologue, epilogue=epilogue)
>>> # see what we did
... print(open('example.txt').read())
{| class="wikitable"
|-
! foo
! bar
|-
| a
| 1
|-
| b
| 2
|-
| c
| 2
|}

The template will be used to format each row via str.format.

petl.io.text.appendtext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]¶: Append the table to a text file.

petl.io.text.teetext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]¶: Return a table that writes rows to a text file as they are iterated over.

XML files¶

petl.io.xml.fromxml(source, *args, **kwargs)[source]¶

Extract data from an XML file. E.g.:

>>> import petl as etl
>>> # setup a file to demonstrate with
... d = '''<table>
...     <tr>
...         <td>foo</td><td>bar</td>
...     </tr>
...     <tr>
...         <td>a</td><td>1</td>
...     </tr>
...     <tr>
...         <td>b</td><td>2</td>
...     </tr>
...     <tr>
...         <td>c</td><td>2</td>
...     </tr>
... </table>'''
>>> with open('example1.xml', 'w') as f:
...     f.write(d)
...
212
>>> table1 = etl.fromxml('example1.xml', 'tr', 'td')
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

If the data values are stored in an attribute, provide the attribute name as an extra positional argument:

>>> d = '''<table>
...     <tr>
...         <td v='foo'/><td v='bar'/>
...     </tr>
...     <tr>
...         <td v='a'/><td v='1'/>
...     </tr>
...     <tr>
...         <td v='b'/><td v='2'/>
...     </tr>
...     <tr>
...         <td v='c'/><td v='2'/>
...     </tr>
... </table>'''
>>> with open('example2.xml', 'w') as f:
...     f.write(d)
...
220
>>> table2 = etl.fromxml('example2.xml', 'tr', 'td', 'v')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

Data values can also be extracted by providing a mapping of field names to element paths:

>>> d = '''<table>
...     <row>
...         <foo>a</foo><baz><bar v='1'/><bar v='3'/></baz>
...     </row>
...     <row>
...         <foo>b</foo><baz><bar v='2'/></baz>
...     </row>
...     <row>
...         <foo>c</foo><baz><bar v='2'/></baz>
...     </row>
... </table>'''
>>> with open('example3.xml', 'w') as f:
...     f.write(d)
...
223
>>> table3 = etl.fromxml('example3.xml', 'row',
...                      {'foo': 'foo', 'bar': ('baz/bar', 'v')})
>>> table3
+------------+-----+
| bar        | foo |
+============+=====+
| ('1', '3') | 'a' |
+------------+-----+
| '2'        | 'b' |
+------------+-----+
| '2'        | 'c' |
+------------+-----+

If lxml is installed, full XPath expressions can be used.

Note that the implementation is currently not streaming, i.e., the whole document is loaded into memory.

If multiple elements match a given field, all values are reported as a tuple.

If there is more than one element name used for row values, a tuple or list of paths can be provided, e.g., fromxml('example.html', './/tr', ('th', 'td')).

For writing to an XML file, see petl.io.text.totext().

HTML files¶

petl.io.html.tohtml(table, source=None, encoding=None, errors='strict', caption=None, vrepr=<class 'str'>, lineterminator='\n', index_header=False, tr_style=None, td_styles=None, truncate=None)[source]¶

Write the table as HTML to a file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tohtml(table1, 'example.html', caption='example table')
>>> print(open('example.html').read())
<table class='petl'>
<caption>example table</caption>
<thead>
<tr>
<th>foo</th>
<th>bar</th>
</tr>
</thead>
<tbody>
<tr>
<td>a</td>
<td style='text-align: right'>1</td>
</tr>
<tr>
<td>b</td>
<td style='text-align: right'>2</td>
</tr>
<tr>
<td>c</td>
<td style='text-align: right'>2</td>
</tr>
</tbody>
</table>

The caption keyword argument is used to provide a table caption in the output HTML.

petl.io.html.teehtml(table, source=None, encoding=None, errors='strict', caption=None, vrepr=<class 'str'>, lineterminator='\n', index_header=False, tr_style=None, td_styles=None, truncate=None)[source]¶: Return a table that writes rows to a Unicode HTML file as they are iterated over.

JSON files¶

petl.io.json.fromjson(source, *args, **kwargs)[source]¶

Extract data from a JSON file. The file must contain a JSON array as the top level object, and each member of the array will be treated as a row of data. E.g.:

>>> import petl as etl
>>> data = '''
... [{"foo": "a", "bar": 1},
... {"foo": "b", "bar": 2},
... {"foo": "c", "bar": 2}]
... '''
>>> with open('example.json', 'w') as f:
...     f.write(data)
...
74
>>> table1 = etl.fromjson('example.json', header=['foo', 'bar'])
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

If your JSON file does not fit this structure, you will need to parse it via json.load() and select the array to treat as the data, see also petl.io.json.fromdicts().

Changed in version 1.1.0.

If no header is specified, fields will be discovered by sampling keys from the first sample objects in source. The header will be constructed from keys in the order discovered. Note that this ordering may not be stable, and therefore it may be advisable to specify an explicit header or to use another function like petl.transform.headers.sortheader() on the resulting table to guarantee stability.

petl.io.json.fromdicts(dicts, header=None, sample=1000, missing=None)[source]¶

View a sequence of Python dict as a table. E.g.:

>>> import petl as etl
>>> dicts = [{"foo": "a", "bar": 1},
...          {"foo": "b", "bar": 2},
...          {"foo": "c", "bar": 2}]
>>> table1 = etl.fromdicts(dicts, header=['foo', 'bar'])
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

If header is not specified, sample items from dicts will be inspected to discovery dictionary keys. Note that the order in which dictionary keys are discovered may not be stable,

Databases¶

Note

The automatic table creation feature of petl.io.db.todb() requires SQLAlchemy to be installed, e.g.:

$ pip install sqlalchemy

petl.io.db.fromdb(dbo, query, *args, **kwargs)[source]¶

Provides access to data from any DB-API 2.0 connection via a given query. E.g., using sqlite3:

>>> import petl as etl
>>> import sqlite3
>>> connection = sqlite3.connect('example.db')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

E.g., using psycopg2 (assuming you’ve installed it first):

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

E.g., using pymysql (assuming you’ve installed it first):

>>> import petl as etl
>>> import pymysql
>>> connection = pymysql.connect(password='moonpie', database='thangs')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

The dbo argument may also be a function that creates a cursor. N.B., each call to the function should return a new cursor. E.g.:

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> mkcursor = lambda: connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
>>> table = etl.fromdb(mkcursor, 'SELECT * FROM example')

The parameter dbo may also be an SQLAlchemy engine, session or connection object.

The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an sqlite3 database.

Note that the default behaviour of most database servers and clients is for the entire result set for each query to be sent from the server to the client. If your query returns a large result set this can result in significant memory usage at the client. Some databases support server-side cursors which provide a means for client libraries to fetch result sets incrementally, reducing memory usage at the client.

To use a server-side cursor with a PostgreSQL database, e.g.:

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> table = etl.fromdb(lambda: connection.cursor(name='arbitrary'),
...                    'SELECT * FROM example')

For more information on server-side cursors see the following links:

http://initd.org/psycopg/docs/usage.html#server-side-cursors

http://mysql-python.sourceforge.net/MySQLdb.html#using-and-extending

petl.io.db.todb(table, dbo, tablename, schema=None, commit=True, create=False, drop=False, constraints=True, metadata=None, dialect=None, sample=1000)[source]¶

Load data into an existing database table via a DB-API 2.0 connection or cursor. Note that the database table will be truncated, i.e., all existing rows will be deleted prior to inserting the new data. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'],
...          ['a', 1],
...          ['b', 2],
...          ['c', 2]]
>>> # using sqlite3
... import sqlite3
>>> connection = sqlite3.connect('example.db')
>>> # assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')
>>> # using psycopg2
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> # assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')
>>> # using pymysql
>>> import pymysql
>>> connection = pymysql.connect(password='moonpie', database='thangs')
>>> # tell MySQL to use standard quote character
... connection.cursor().execute('SET SQL_MODE=ANSI_QUOTES')
>>> # load data, assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')

N.B., for MySQL the statement SET SQL_MODE=ANSI_QUOTES is required to ensure MySQL uses SQL-92 standard quote characters.

A cursor can also be provided instead of a connection, e.g.:

>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> cursor = connection.cursor()
>>> etl.todb(table, cursor, 'foobar')

The parameter dbo may also be an SQLAlchemy engine, session or connection object.

The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an sqlite3 database.

If create=True this function will attempt to automatically create a database table before loading the data. This functionality requires SQLAlchemy to be installed.

Keyword arguments:

table : table container: Table data to load
dbo : database object: DB-API 2.0 connection, callable returning a DB-API 2.0 cursor, or SQLAlchemy connection, engine or session
tablename : string: Name of the table in the database
schema : string: Name of the database schema to find the table in
commit : bool: If True commit the changes
create : bool: If True attempt to create the table before loading, inferring types from a sample of the data (requires SQLAlchemy)
drop : bool: If True attempt to drop the table before recreating (only relevant if create=True)
constraints : bool: If True use length and nullable constraints (only relevant if create=True)
metadata : sqlalchemy.MetaData: Custom table metadata (only relevant if create=True)
dialect : string: One of {‘access’, ‘sybase’, ‘sqlite’, ‘informix’, ‘firebird’, ‘mysql’, ‘oracle’, ‘maxdb’, ‘postgresql’, ‘mssql’} (only relevant if create=True)
sample : int: Number of rows to sample when inferring types etc. Set to 0 to use the whole table (only relevant if create=True)

Note

This function is in principle compatible with any DB-API 2.0 compliant database driver. However, at the time of writing some DB-API 2.0 implementations, including cx_Oracle and MySQL’s Connector/Python, are not compatible with this function, because they only accept a list argument to the cursor.executemany() function called internally by petl. This can be worked around by proxying the cursor objects, e.g.:

>>> import cx_Oracle
>>> connection = cx_Oracle.Connection(...)
>>> class CursorProxy(object):
...     def __init__(self, cursor):
...         self._cursor = cursor
...     def executemany(self, statement, parameters, **kwargs):
...         # convert parameters to a list
...         parameters = list(parameters)
...         # pass through to proxied cursor
...         return self._cursor.executemany(statement, parameters, **kwargs)
...     def __getattr__(self, item):
...         return getattr(self._cursor, item)
...
>>> def get_cursor():
...     return CursorProxy(connection.cursor())
...
>>> import petl as etl
>>> etl.todb(tbl, get_cursor, ...)

Note however that this does imply loading the entire table into memory as a list prior to inserting into the database.

petl.io.db.appenddb(table, dbo, tablename, schema=None, commit=True)[source]¶: Load data into an existing database table via a DB-API 2.0 connection or cursor. As petl.io.db.todb() except that the database table will be appended, i.e., the new data will be inserted into the table, and any existing rows will remain.

Excel .xls files (xlrd/xlwt)¶

Note

The following functions require xlrd and xlwt to be installed, e.g.:

$ pip install xlrd xlwt-future

petl.io.xls.fromxls(filename, sheet=None, use_view=True, **kwargs)[source]¶

Extract a table from a sheet in an Excel .xls file.

Sheet is identified by its name or index number.

N.B., the sheet name is case sensitive.

petl.io.xls.toxls(tbl, filename, sheet, encoding=None, style_compression=0, styles=None)[source]¶: Write a table to a new Excel .xls file.

Excel .xlsx files (openpyxl)¶

Note

The following functions require openpyxl to be installed, e.g.:

$ pip install openpyxl

petl.io.xlsx.fromxlsx(filename, sheet=None, range_string=None, min_row=None, min_col=None, max_row=None, max_col=None, read_only=False, **kwargs)[source]¶

Extract a table from a sheet in an Excel .xlsx file.

N.B., the sheet name is case sensitive.

The sheet argument can be omitted, in which case the first sheet in the workbook is used by default.

The range_string argument can be used to provide a range string specifying a range of cells to extract.

The min_row, min_col, max_row and max_col arguments can be used to limit the range of cells to extract. They will be ignored if range_string is provided.

The read_only argument determines how openpyxl returns the loaded workbook. Default is False as it prevents some LibreOffice files from getting truncated at 65536 rows. True should be faster if the file use is read-only and the files are made with Microsoft Excel.

Any other keyword arguments are passed through to openpyxl.load_workbook().

petl.io.xlsx.toxlsx(tbl, filename, sheet=None, write_header=True)[source]¶: Write a table to a new Excel .xlsx file.

petl.io.xlsx.appendxlsx(tbl, filename, sheet=None, write_header=False)[source]¶: Appends rows to an existing Excel .xlsx file.

Arrays (NumPy)¶

Note

The following functions require numpy to be installed, e.g.:

$ pip install numpy

petl.io.numpy.fromarray(a)[source]¶

Extract a table from a numpy structured array, e.g.:

>>> import petl as etl
>>> import numpy as np
>>> a = np.array([('apples', 1, 2.5),
...               ('oranges', 3, 4.4),
...               ('pears', 7, 0.1)],
...              dtype='U8, i4,f4')
>>> table = etl.fromarray(a)
>>> table
+-----------+----+-----+
| f0        | f1 | f2  |
+===========+====+=====+
| 'apples'  | 1  | 2.5 |
+-----------+----+-----+
| 'oranges' | 3  | 4.4 |
+-----------+----+-----+
| 'pears'   | 7  | 0.1 |
+-----------+----+-----+

petl.io.numpy.toarray(table, dtype=None, count=-1, sample=1000)[source]¶

Load data from the given table into a numpy structured array. E.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> a = etl.toarray(table)
>>> a
array([('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)],
      dtype=(numpy.record, [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')]))
>>> # the dtype can be specified as a string
... a = etl.toarray(table, dtype='a4, i2, f4')
>>> a
array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)],
      dtype=[('foo', 'S4'), ('bar', '<i2'), ('baz', '<f4')])
>>> # the dtype can also be partially specified
... a = etl.toarray(table, dtype={'foo': 'a4'})
>>> a
array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)],
      dtype=[('foo', 'S4'), ('bar', '<i8'), ('baz', '<f8')])

If the dtype is not completely specified, sample rows will be examined to infer an appropriate dtype.

petl.io.numpy.torecarray(*args, **kwargs)[source]¶: Convenient shorthand for toarray(*args, **kwargs).view(np.recarray).

petl.io.numpy.valuestoarray(vals, dtype=None, count=-1, sample=1000)[source]¶

Load values from a table column into a numpy array, e.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> table = etl.wrap(table)
>>> table.values('bar').array()
array([1, 3, 7])
>>> # specify dtype
... table.values('bar').array(dtype='i4')
array([1, 3, 7], dtype=int32)

DataFrames (pandas)¶

Note

The following functions require pandas to be installed, e.g.:

$ pip install pandas

petl.io.pandas.fromdataframe(df, include_index=False)[source]¶

Extract a table from a pandas DataFrame. E.g.:

>>> import petl as etl
>>> import pandas as pd
>>> records = [('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)]
>>> df = pd.DataFrame.from_records(records, columns=('foo', 'bar', 'baz'))
>>> table = etl.fromdataframe(df)
>>> table
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'apples'  |   1 | 2.5 |
+-----------+-----+-----+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+

petl.io.pandas.todataframe(table, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]¶

Load data from the given table into a pandas DataFrame. E.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> df = etl.todataframe(table)
>>> df
       foo  bar  baz
0   apples    1  2.5
1  oranges    3  4.4
2    pears    7  0.1

HDF5 files (PyTables)¶

Note

The following functions require PyTables to be installed, e.g.:

$ # install HDF5
$ apt-get install libhdf5-7 libhdf5-dev
$ # install other prerequisites
$ pip install cython
$ pip install numpy
$ pip install numexpr
$ # install PyTables
$ pip install tables

petl.io.pytables.fromhdf5(source, where=None, name=None, condition=None, condvars=None, start=None, stop=None, step=None)[source]¶

Provides access to an HDF5 table. E.g.:

>>> import petl as etl
>>> import tables
>>> # set up a new hdf5 table to demonstrate with
... h5file = tables.open_file('example.h5', mode='w',
...                           title='Example file')
>>> h5file.create_group('/', 'testgroup', 'Test Group')
/testgroup (Group) 'Test Group'
  children := []
>>> class FooBar(tables.IsDescription):
...     foo = tables.Int32Col(pos=0)
...     bar = tables.StringCol(6, pos=2)
...
>>> h5table = h5file.create_table('/testgroup', 'testtable', FooBar,
...                               'Test Table')
>>> # load some data into the table
... table1 = (('foo', 'bar'),
...           (1, b'asdfgh'),
...           (2, b'qwerty'),
...           (3, b'zxcvbn'))
>>> for row in table1[1:]:
...     for i, f in enumerate(table1[0]):
...         h5table.row[f] = row[i]
...     h5table.row.append()
...
>>> h5file.flush()
>>> h5file.close()
>>> #
... # now demonstrate use of fromhdf5
... table1 = etl.fromhdf5('example.h5', '/testgroup', 'testtable')
>>> table1
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'zxcvbn' |
+-----+-----------+

>>> # alternatively just specify path to table node
... table1 = etl.fromhdf5('example.h5', '/testgroup/testtable')
>>> # ...or use an existing tables.File object
... h5file = tables.open_file('example.h5')
>>> table1 = etl.fromhdf5(h5file, '/testgroup/testtable')
>>> # ...or use an existing tables.Table object
... h5tbl = h5file.get_node('/testgroup/testtable')
>>> table1 = etl.fromhdf5(h5tbl)
>>> # use a condition to filter data
... table2 = etl.fromhdf5(h5tbl, condition='foo < 3')
>>> table2
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+

>>> h5file.close()

petl.io.pytables.fromhdf5sorted(source, where=None, name=None, sortby=None, checkCSI=False, start=None, stop=None, step=None)[source]¶

Provides access to an HDF5 table, sorted by an indexed column, e.g.:

>>> import petl as etl
>>> import tables
>>> # set up a new hdf5 table to demonstrate with
... h5file = tables.open_file('example.h5', mode='w', title='Test file')
>>> h5file.create_group('/', 'testgroup', 'Test Group')
/testgroup (Group) 'Test Group'
  children := []
>>> class FooBar(tables.IsDescription):
...     foo = tables.Int32Col(pos=0)
...     bar = tables.StringCol(6, pos=2)
...
>>> h5table = h5file.create_table('/testgroup', 'testtable', FooBar, 'Test Table')
>>> # load some data into the table
... table1 = (('foo', 'bar'),
...           (3, b'asdfgh'),
...           (2, b'qwerty'),
...           (1, b'zxcvbn'))
>>> for row in table1[1:]:
...     for i, f in enumerate(table1[0]):
...         h5table.row[f] = row[i]
...     h5table.row.append()
...
>>> h5table.cols.foo.create_csindex()  # CS index is required
0
>>> h5file.flush()
>>> h5file.close()
>>> #
... # access the data, sorted by the indexed column
... table2 = etl.fromhdf5sorted('example.h5', '/testgroup', 'testtable',
...                             sortby='foo')
>>> table2
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'zxcvbn' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'asdfgh' |
+-----+-----------+

petl.io.pytables.tohdf5(table, source, where=None, name=None, create=False, drop=False, description=None, title='', filters=None, expectedrows=10000, chunkshape=None, byteorder=None, createparents=False, sample=1000)[source]¶

Write to an HDF5 table. If create is False, assumes the table already exists, and attempts to truncate it before loading. If create is True, a new table will be created, and if drop is True, any existing table will be dropped first. If description is None, the description will be guessed. E.g.:

>>> import petl as etl
>>> table1 = (('foo', 'bar'),
...           (1, b'asdfgh'),
...           (2, b'qwerty'),
...           (3, b'zxcvbn'))
>>> etl.tohdf5(table1, 'example.h5', '/testgroup', 'testtable',
...            drop=True, create=True, createparents=True)
>>> etl.fromhdf5('example.h5', '/testgroup', 'testtable')
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'zxcvbn' |
+-----+-----------+

petl.io.pytables.appendhdf5(table, source, where=None, name=None)[source]¶: As petl.io.hdf5.tohdf5() but don’t truncate the target table before loading.

Bcolz ctables¶

Note

The following functions require bcolz to be installed, e.g.:

$ pip install bcolz

petl.io.bcolz.frombcolz(source, expression=None, outcols=None, limit=None, skip=0)[source]¶

Extract a table from a bcolz ctable, e.g.:

>>> import petl as etl
>>> import bcolz
>>> cols = [
...     ['apples', 'oranges', 'pears'],
...     [1, 3, 7],
...     [2.5, 4.4, .1]
... ]
>>> names = ('foo', 'bar', 'baz')
>>> ctbl = bcolz.ctable(cols, names=names)
>>> tbl = etl.frombcolz(ctbl)
>>> tbl
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'apples'  |   1 | 2.5 |
+-----------+-----+-----+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+

If expression is provided it will be executed by bcolz and only matching rows returned, e.g.:

>>> tbl2 = etl.frombcolz(ctbl, expression='bar > 1')
>>> tbl2
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+

New in version 1.1.0.

petl.io.bcolz.tobcolz(table, dtype=None, sample=1000, **kwargs)[source]¶

Load data into a bcolz ctable, e.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> ctbl = etl.tobcolz(table)
>>> ctbl
ctable((3,), [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')])
  nbytes: 132; cbytes: 1023.98 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[('apples', 1, 2.5) ('oranges', 3, 4.4) ('pears', 7, 0.1)]
>>> ctbl.names
['foo', 'bar', 'baz']
>>> ctbl['foo']
carray((3,), <U7)
  nbytes := 84; cbytes := 511.98 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 18724; chunksize: 524272; blocksize: 0
['apples' 'oranges' 'pears']

Other keyword arguments are passed through to the ctable constructor.

New in version 1.1.0.

petl.io.bcolz.appendbcolz(table, obj, check_names=True)[source]¶: Append data into a bcolz ctable. The obj argument can be either an existing ctable or the name of a directory were an on-disk ctable is stored.

New in version 1.1.0.

Text indexes (Whoosh)¶

Note

The following functions require Whoosh to be installed, e.g.:

$ pip install whoosh

petl.io.whoosh.fromtextindex(index_or_dirname, indexname=None, docnum_field=None)[source]¶

Extract all documents from a Whoosh index. E.g.:

>>> import petl as etl
>>> import os
>>> # set up an index and load some documents via the Whoosh API
... from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True),
...                 content=TEXT)
>>> dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> index = create_in(dirname, schema)
>>> writer = index.writer()
>>> writer.add_document(title=u"First document", path=u"/a",
...                     content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b",
...                     content=u"The second one is even more interesting!")
>>> writer.commit()
>>> # extract documents as a table
... table = etl.fromtextindex(dirname)
>>> table
+------+-------------------+
| path | title             |
+======+===================+
| '/a' | 'First document'  |
+------+-------------------+
| '/b' | 'Second document' |
+------+-------------------+

Keyword arguments:

index_or_dirname: Either an instance of whoosh.index.Index or a string containing the directory path where the index is stored.
indexname: String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field: If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.

petl.io.whoosh.searchtextindex(index_or_dirname, query, limit=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]¶

Search a Whoosh index using a query. E.g.:

>>> import petl as etl
>>> import os
>>> # set up an index and load some documents via the Whoosh API
... from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True),
...                            content=TEXT)
>>> dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> index = create_in('example.whoosh', schema)
>>> writer = index.writer()
>>> writer.add_document(title=u"Oranges", path=u"/a",
...                     content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Apples", path=u"/b",
...                     content=u"The second document is even more "
...                             u"interesting!")
>>> writer.commit()
>>> # demonstrate the use of searchtextindex()
... table1 = etl.searchtextindex('example.whoosh', 'oranges')
>>> table1
+------+-----------+
| path | title     |
+======+===========+
| '/a' | 'Oranges' |
+------+-----------+

>>> table2 = etl.searchtextindex('example.whoosh', 'doc*')
>>> table2
+------+-----------+
| path | title     |
+======+===========+
| '/a' | 'Oranges' |
+------+-----------+
| '/b' | 'Apples'  |
+------+-----------+

Keyword arguments:

index_or_dirname: Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
query: Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
limit: Return at most limit results.
indexname: String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field: If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
score_field: If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
fieldboosts: An optional dictionary mapping field names to boosts.
search_kwargs: Any extra keyword arguments to be passed through to the Whoosh search() method.

petl.io.whoosh.searchtextindexpage(index_or_dirname, query, pagenum, pagelen=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]¶

Search an index using a query, returning a result page.

Keyword arguments:

index_or_dirname: Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
query: Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
pagenum: Number of the page to return (e.g., 1 = first page).
pagelen: Number of results per page.
indexname: String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field: If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
score_field: If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
fieldboosts: An optional dictionary mapping field names to boosts.
search_kwargs: Any extra keyword arguments to be passed through to the Whoosh search() method.

petl.io.whoosh.totextindex(table, index_or_dirname, schema=None, indexname=None, merge=False, optimize=False)[source]¶

Load all rows from table into a Whoosh index. N.B., this will clear any existing data in the index before loading. E.g.:

>>> import petl as etl
>>> import datetime
>>> import os
>>> # here is the table we want to load into an index
... table = (('f0', 'f1', 'f2', 'f3', 'f4'),
...          ('AAA', 12, 4.3, True, datetime.datetime.now()),
...          ('BBB', 6, 3.4, False, datetime.datetime(1900, 1, 31)),
...          ('CCC', 42, 7.8, True, datetime.datetime(2100, 12, 25)))
>>> # define a schema for the index
... from whoosh.fields import *
>>> schema = Schema(f0=TEXT(stored=True),
...                 f1=NUMERIC(int, stored=True),
...                 f2=NUMERIC(float, stored=True),
...                 f3=BOOLEAN(stored=True),
...                 f4=DATETIME(stored=True))
>>> # load index
... dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> etl.totextindex(table, dirname, schema=schema)

Keyword arguments:

table: A table container with the data to be loaded.
index_or_dirname: Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
schema: Index schema to use if creating the index.
indexname: String containing the name of the index, if multiple indexes are stored in the same directory.
merge: Merge small segments during commit?
optimize: Merge all segments together?

petl.io.whoosh.appendtextindex(table, index_or_dirname, indexname=None, merge=True, optimize=False)[source]¶

Load all rows from table into a Whoosh index, adding them to any existing data in the index.

Keyword arguments:

table: A table container with the data to be loaded.
index_or_dirname: Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
indexname: String containing the name of the index, if multiple indexes are stored in the same directory.
merge: Merge small segments during commit?
optimize: Merge all segments together?

I/O helper classes¶

The following classes are helpers for extract (from...()) and load (to...()) functions that use a file-like data source.

An instance of any of the following classes can be used as the source argument to data extraction functions like petl.io.csv.fromcsv() etc., with the exception of petl.io.sources.StdoutSource which is write-only.

An instance of any of the following classes can also be used as the source argument to data loading functions like petl.io.csv.tocsv() etc., with the exception of petl.io.sources.StdinSource, petl.io.sources.URLSource and petl.io.sources.PopenSource which are read-only.

The behaviour of each source can usually be configured by passing arguments to the constructor, see the source code of the petl.io.sources module for full details.

class petl.io.sources.FileSource(filename, **kwargs)[source]¶

class petl.io.sources.GzipSource(filename, remote=False, **kwargs)[source]¶

class petl.io.sources.BZ2Source(filename, remote=False, **kwargs)[source]¶

class petl.io.sources.ZipSource(filename, membername, pwd=None, **kwargs)[source]¶

class petl.io.sources.StdinSource[source]¶

class petl.io.sources.StdoutSource[source]¶

class petl.io.sources.URLSource(*args, **kwargs)[source]¶

class petl.io.sources.MemorySource(s=None)[source]¶

Memory data source. E.g.:

>>> import petl as etl
>>> data = b'foo,bar\na,1\nb,2\nc,2\n'
>>> source = etl.MemorySource(data)
>>> tbl = etl.fromcsv(source)
>>> tbl
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

>>> sink = etl.MemorySource()
>>> tbl.tojson(sink)
>>> sink.getvalue()
b'[{"foo": "a", "bar": "1"}, {"foo": "b", "bar": "2"}, {"foo": "c", "bar": "2"}]'

Also supports appending.

class petl.io.sources.PopenSource(*args, **kwargs)[source]¶