Utility functions¶
Basic utilities¶
-
petl.util.base.
header
(table)[source]¶ Return the header row for the given table. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> etl.header(table) ('foo', 'bar')
Note that the header row will always be returned as a tuple, regardless of what the underlying data are.
-
petl.util.base.
fieldnames
(table)[source]¶ Return the string values of the header row. If the header row contains only strings, then this function is equivalent to header(), i.e.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> etl.fieldnames(table) ('foo', 'bar') >>> etl.header(table) ('foo', 'bar')
-
petl.util.base.
data
(table, *sliceargs)[source]¶ Return a container supporting iteration over data rows in a given table (i.e., without the header). E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> d = etl.data(table) >>> list(d) [['a', 1], ['b', 2]]
Positional arguments can be used to slice the data rows. The sliceargs are passed to
itertools.islice()
.
-
petl.util.base.
values
(table, *field, **kwargs)[source]¶ Return a container supporting iteration over values in a given field or fields. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', True], ... ['b'], ... ['b', True], ... ['c', False]] >>> foo = etl.values(table1, 'foo') >>> foo foo: 'a', 'b', 'b', 'c' >>> list(foo) ['a', 'b', 'b', 'c'] >>> bar = etl.values(table1, 'bar') >>> bar bar: True, None, True, False >>> list(bar) [True, None, True, False] >>> # values from multiple fields ... table2 = [['foo', 'bar', 'baz'], ... [1, 'a', True], ... [2, 'bb', True], ... [3, 'd', False]] >>> foobaz = etl.values(table2, 'foo', 'baz') >>> foobaz ('foo', 'baz'): (1, True), (2, True), (3, False) >>> list(foobaz) [(1, True), (2, True), (3, False)]
The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes. Multiple fields can also be provided as positional arguments.
If rows are uneven, the value of the keyword argument missing is returned.
-
petl.util.base.
dicts
(table, *sliceargs, **kwargs)[source]¶ Return a container supporting iteration over rows as dicts. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> d = etl.dicts(table) >>> d {'foo': 'a', 'bar': 1} {'foo': 'b', 'bar': 2} >>> list(d) [{'foo': 'a', 'bar': 1}, {'foo': 'b', 'bar': 2}]
Short rows are padded with the value of the missing keyword argument.
-
petl.util.base.
namedtuples
(table, *sliceargs, **kwargs)[source]¶ View the table as a container of named tuples. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> d = etl.namedtuples(table) >>> d row(foo='a', bar=1) row(foo='b', bar=2) >>> list(d) [row(foo='a', bar=1), row(foo='b', bar=2)]
Short rows are padded with the value of the missing keyword argument.
The name keyword argument can be given to override the name of the named tuple class (defaults to ‘row’).
-
petl.util.base.
records
(table, *sliceargs, **kwargs)[source]¶ Return a container supporting iteration over rows as records, where a record is a hybrid object supporting all possible ways of accessing values. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> d = etl.records(table) >>> d ('a', 1) ('b', 2) >>> list(d) [('a', 1), ('b', 2)] >>> [r[0] for r in d] ['a', 'b'] >>> [r['foo'] for r in d] ['a', 'b'] >>> [r.foo for r in d] ['a', 'b']
Short rows are padded with the value of the missing keyword argument.
-
petl.util.base.
expr
(s)[source]¶ Construct a function operating on a table record.
The expression string is converted into a lambda function by prepending the string with
'lambda rec: '
, then replacing anything enclosed in curly braces (e.g.,"{foo}"
) with a lookup on the record (e.g.,"rec['foo']"
), then finally callingeval()
.So, e.g., the expression string
"{foo} * {bar}"
is converted to the functionlambda rec: rec['foo'] * rec['bar']
-
petl.util.base.
rowgroupby
(table, key, value=None)[source]¶ Convenient adapter for
itertools.groupby()
. E.g.:>>> import petl as etl >>> table1 = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 3, True], ... ['b', 2]] >>> # group entire rows ... for key, group in etl.rowgroupby(table1, 'foo'): ... print(key, list(group)) ... a [('a', 1, True)] b [('b', 3, True), ('b', 2)] >>> # group specific values ... for key, group in etl.rowgroupby(table1, 'foo', 'bar'): ... print(key, list(group)) ... a [1] b [3, 2]
N.B., assumes the input table is already sorted by the given key.
-
petl.util.base.
empty
()[source]¶ Return an empty table. Can be useful when building up a table from a set of columns, e.g.:
>>> import petl as etl >>> table = ( ... etl ... .empty() ... .addcolumn('foo', ['A', 'B']) ... .addcolumn('bar', [1, 2]) ... ) >>> table +-----+-----+ | foo | bar | +=====+=====+ | 'A' | 1 | +-----+-----+ | 'B' | 2 | +-----+-----+
Visualising tables¶
-
petl.util.vis.
look
(table, limit=0, vrepr=None, index_header=None, style=None, truncate=None, width=None)[source]¶ Format a portion of the table as text for inspection in an interactive session. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2]] >>> etl.look(table1) +-----+-----+ | foo | bar | +=====+=====+ | 'a' | 1 | +-----+-----+ | 'b' | 2 | +-----+-----+ >>> # alternative formatting styles ... etl.look(table1, style='simple') === === foo bar === === 'a' 1 'b' 2 === === >>> etl.look(table1, style='minimal') foo bar 'a' 1 'b' 2 >>> # any irregularities in the length of header and/or data ... # rows will appear as blank cells ... table2 = [['foo', 'bar'], ... ['a'], ... ['b', 2, True]] >>> etl.look(table2) +-----+-----+------+ | foo | bar | | +=====+=====+======+ | 'a' | | | +-----+-----+------+ | 'b' | 2 | True | +-----+-----+------+
Three alternative presentation styles are available: ‘grid’, ‘simple’ and ‘minimal’, where ‘grid’ is the default. A different style can be specified using the style keyword argument. The default style can also be changed by setting
petl.config.look_style
.
-
petl.util.vis.
lookall
(table, **kwargs)[source]¶ Format the entire table as text for inspection in an interactive session.
N.B., this will load the entire table into memory.
See also
petl.util.vis.look()
andpetl.util.vis.see()
.
-
petl.util.vis.
see
(table, limit=0, vrepr=None, index_header=None)[source]¶ Format a portion of a table as text in a column-oriented layout for inspection in an interactive session. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> etl.see(table) foo: 'a', 'b' bar: 1, 2
Useful for tables with a larger number of fields.
Lookup data structures¶
-
petl.util.lookups.
lookup
(table, key, value=None, dictionary=None)[source]¶ Load a dictionary with data from the given table. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['b', 3]] >>> lkp = etl.lookup(table1, 'foo', 'bar') >>> lkp['a'] [1] >>> lkp['b'] [2, 3] >>> # if no value argument is given, defaults to the whole ... # row (as a tuple) ... lkp = etl.lookup(table1, 'foo') >>> lkp['a'] [('a', 1)] >>> lkp['b'] [('b', 2), ('b', 3)] >>> # compound keys are supported ... table2 = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 2, False], ... ['b', 3, True], ... ['b', 3, False]] >>> lkp = etl.lookup(table2, ('foo', 'bar'), 'baz') >>> lkp[('a', 1)] [True] >>> lkp[('b', 2)] [False] >>> lkp[('b', 3)] [True, False] >>> # data can be loaded into an existing dictionary-like ... # object, including persistent dictionaries created via the ... # shelve module ... import shelve >>> lkp = shelve.open('example.dat', flag='n') >>> lkp = etl.lookup(table1, 'foo', 'bar', lkp) >>> lkp.close() >>> lkp = shelve.open('example.dat', flag='r') >>> lkp['a'] [1] >>> lkp['b'] [2, 3]
-
petl.util.lookups.
lookupone
(table, key, value=None, dictionary=None, strict=False)[source]¶ Load a dictionary with data from the given table, assuming there is at most one value for each key. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['b', 3]] >>> # if the specified key is not unique and strict=False (default), ... # the first value wins ... lkp = etl.lookupone(table1, 'foo', 'bar') >>> lkp['a'] 1 >>> lkp['b'] 2 >>> # if the specified key is not unique and strict=True, will raise ... # DuplicateKeyError ... try: ... lkp = etl.lookupone(table1, 'foo', strict=True) ... except etl.errors.DuplicateKeyError as e: ... print(e) ... duplicate key: 'b' >>> # compound keys are supported ... table2 = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 2, False], ... ['b', 3, True], ... ['b', 3, False]] >>> lkp = etl.lookupone(table2, ('foo', 'bar'), 'baz') >>> lkp[('a', 1)] True >>> lkp[('b', 2)] False >>> lkp[('b', 3)] True >>> # data can be loaded into an existing dictionary-like ... # object, including persistent dictionaries created via the ... # shelve module ... import shelve >>> lkp = shelve.open('example.dat', flag='n') >>> lkp = etl.lookupone(table1, 'foo', 'bar', lkp) >>> lkp.close() >>> lkp = shelve.open('example.dat', flag='r') >>> lkp['a'] 1 >>> lkp['b'] 2
-
petl.util.lookups.
dictlookup
(table, key, dictionary=None)[source]¶ Load a dictionary with data from the given table, mapping to dicts. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['b', 3]] >>> lkp = etl.dictlookup(table1, 'foo') >>> lkp['a'] [{'foo': 'a', 'bar': 1}] >>> lkp['b'] [{'foo': 'b', 'bar': 2}, {'foo': 'b', 'bar': 3}] >>> # compound keys are supported ... table2 = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 2, False], ... ['b', 3, True], ... ['b', 3, False]] >>> lkp = etl.dictlookup(table2, ('foo', 'bar')) >>> lkp[('a', 1)] [{'foo': 'a', 'bar': 1, 'baz': True}] >>> lkp[('b', 2)] [{'foo': 'b', 'bar': 2, 'baz': False}] >>> lkp[('b', 3)] [{'foo': 'b', 'bar': 3, 'baz': True}, {'foo': 'b', 'bar': 3, 'baz': False}] >>> # data can be loaded into an existing dictionary-like ... # object, including persistent dictionaries created via the ... # shelve module ... import shelve >>> lkp = shelve.open('example.dat', flag='n') >>> lkp = etl.dictlookup(table1, 'foo', lkp) >>> lkp.close() >>> lkp = shelve.open('example.dat', flag='r') >>> lkp['a'] [{'foo': 'a', 'bar': 1}] >>> lkp['b'] [{'foo': 'b', 'bar': 2}, {'foo': 'b', 'bar': 3}]
-
petl.util.lookups.
dictlookupone
(table, key, dictionary=None, strict=False)[source]¶ Load a dictionary with data from the given table, mapping to dicts, assuming there is at most one row for each key. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['b', 3]] >>> # if the specified key is not unique and strict=False (default), ... # the first value wins ... lkp = etl.dictlookupone(table1, 'foo') >>> lkp['a'] {'foo': 'a', 'bar': 1} >>> lkp['b'] {'foo': 'b', 'bar': 2} >>> # if the specified key is not unique and strict=True, will raise ... # DuplicateKeyError ... try: ... lkp = etl.dictlookupone(table1, 'foo', strict=True) ... except etl.errors.DuplicateKeyError as e: ... print(e) ... duplicate key: 'b' >>> # compound keys are supported ... table2 = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 2, False], ... ['b', 3, True], ... ['b', 3, False]] >>> lkp = etl.dictlookupone(table2, ('foo', 'bar')) >>> lkp[('a', 1)] {'foo': 'a', 'bar': 1, 'baz': True} >>> lkp[('b', 2)] {'foo': 'b', 'bar': 2, 'baz': False} >>> lkp[('b', 3)] {'foo': 'b', 'bar': 3, 'baz': True} >>> # data can be loaded into an existing dictionary-like ... # object, including persistent dictionaries created via the ... # shelve module ... import shelve >>> lkp = shelve.open('example.dat', flag='n') >>> lkp = etl.dictlookupone(table1, 'foo', lkp) >>> lkp.close() >>> lkp = shelve.open('example.dat', flag='r') >>> lkp['a'] {'foo': 'a', 'bar': 1} >>> lkp['b'] {'foo': 'b', 'bar': 2}
Parsing string/text values¶
-
petl.util.parsers.
dateparser
(fmt, strict=True)[source]¶ Return a function to parse strings as
datetime.date
objects using a given format. E.g.:>>> from petl import dateparser >>> isodate = dateparser('%Y-%m-%d') >>> isodate('2002-12-25') datetime.date(2002, 12, 25) >>> try: ... isodate('2002-02-30') ... except ValueError as e: ... print(e) ... day is out of range for month
If
strict=False
then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.
-
petl.util.parsers.
timeparser
(fmt, strict=True)[source]¶ Return a function to parse strings as
datetime.time
objects using a given format. E.g.:>>> from petl import timeparser >>> isotime = timeparser('%H:%M:%S') >>> isotime('00:00:00') datetime.time(0, 0) >>> isotime('13:00:00') datetime.time(13, 0) >>> try: ... isotime('12:00:99') ... except ValueError as e: ... print(e) ... unconverted data remains: 9 >>> try: ... isotime('25:00:00') ... except ValueError as e: ... print(e) ... time data '25:00:00' does not match format '%H:%M:%S'
If
strict=False
then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.
-
petl.util.parsers.
datetimeparser
(fmt, strict=True)[source]¶ Return a function to parse strings as
datetime.datetime
objects using a given format. E.g.:>>> from petl import datetimeparser >>> isodatetime = datetimeparser('%Y-%m-%dT%H:%M:%S') >>> isodatetime('2002-12-25T00:00:00') datetime.datetime(2002, 12, 25, 0, 0) >>> try: ... isodatetime('2002-12-25T00:00:99') ... except ValueError as e: ... print(e) ... unconverted data remains: 9
If
strict=False
then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.
-
petl.util.parsers.
boolparser
(true_strings=('true', 't', 'yes', 'y', '1'), false_strings=('false', 'f', 'no', 'n', '0'), case_sensitive=False, strict=True)[source]¶ Return a function to parse strings as
bool
objects using a given set of string representations for True and False. E.g.:>>> from petl import boolparser >>> mybool = boolparser(true_strings=['yes', 'y'], false_strings=['no', 'n']) >>> mybool('y') True >>> mybool('yes') True >>> mybool('Y') True >>> mybool('No') False >>> try: ... mybool('foo') ... except ValueError as e: ... print(e) ... value is not one of recognised boolean strings: 'foo' >>> try: ... mybool('True') ... except ValueError as e: ... print(e) ... value is not one of recognised boolean strings: 'true'
If
strict=False
then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.
Counting¶
-
petl.util.counting.
nrows
(table)[source]¶ Count the number of data rows in a table. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] >>> etl.nrows(table) 2
-
petl.util.counting.
valuecount
(table, field, value, missing=None)[source]¶ Count the number of occurrences of value under the given field. Returns the absolute count and relative frequency as a pair. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ... ['a', 1], ... ['b', 2], ... ['b', 7]] >>> etl.valuecount(table, 'foo', 'b') (2, 0.6666666666666666)
The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes.
-
petl.util.counting.
valuecounter
(table, *field, **kwargs)[source]¶ Find distinct values for the given field and count the number of occurrences. Returns a
dict
mapping values to counts. E.g.:>>> import petl as etl >>> table = [['foo', 'bar'], ... ['a', True], ... ['b'], ... ['b', True], ... ['c', False]] >>> etl.valuecounter(table, 'foo') Counter({'b': 2, 'a': 1, 'c': 1})
The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes.
-
petl.util.counting.
valuecounts
(table, *field, **kwargs)[source]¶ Find distinct values for the given field and count the number and relative frequency of occurrences. Returns a table mapping values to counts, with most common values first. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['a', True, 0.12], ... ['a', True, 0.17], ... ['b', False, 0.34], ... ['b', False, 0.44], ... ['b']] >>> etl.valuecounts(table, 'foo') +-----+-------+-----------+ | foo | count | frequency | +=====+=======+===========+ | 'b' | 3 | 0.6 | +-----+-------+-----------+ | 'a' | 2 | 0.4 | +-----+-------+-----------+ >>> etl.valuecounts(table, 'foo', 'bar') +-----+-------+-------+-----------+ | foo | bar | count | frequency | +=====+=======+=======+===========+ | 'a' | True | 2 | 0.4 | +-----+-------+-------+-----------+ | 'b' | False | 2 | 0.4 | +-----+-------+-------+-----------+ | 'b' | None | 1 | 0.2 | +-----+-------+-------+-----------+
If rows are short, the value of the keyword argument missing is counted.
Multiple fields can be given as positional arguments. If multiple fields are given, these are treated as a compound key.
-
petl.util.counting.
stringpatterncounter
(table, field)[source]¶ Profile string patterns in the given field, returning a
dict
mapping patterns to counts.
-
petl.util.counting.
stringpatterns
(table, field)[source]¶ Profile string patterns in the given field, returning a table of patterns, counts and frequencies. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ... ['Mr. Foo', '123-1254'], ... ['Mrs. Bar', '234-1123'], ... ['Mr. Spo', '123-1254'], ... [u'Mr. Baz', u'321 1434'], ... [u'Mrs. Baz', u'321 1434'], ... ['Mr. Quux', '123-1254-XX']] >>> etl.stringpatterns(table, 'foo') +------------+-------+---------------------+ | pattern | count | frequency | +============+=======+=====================+ | 'Aa. Aaa' | 3 | 0.5 | +------------+-------+---------------------+ | 'Aaa. Aaa' | 2 | 0.3333333333333333 | +------------+-------+---------------------+ | 'Aa. Aaaa' | 1 | 0.16666666666666666 | +------------+-------+---------------------+ >>> etl.stringpatterns(table, 'bar') +---------------+-------+---------------------+ | pattern | count | frequency | +===============+=======+=====================+ | '999-9999' | 3 | 0.5 | +---------------+-------+---------------------+ | '999 9999' | 2 | 0.3333333333333333 | +---------------+-------+---------------------+ | '999-9999-AA' | 1 | 0.16666666666666666 | +---------------+-------+---------------------+
-
petl.util.counting.
rowlengths
(table)[source]¶ Report on row lengths found in the table. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 1, 2], ... ['B', '2', '3.4'], ... [u'B', u'3', u'7.8', True], ... ['D', 'xyz', 9.0], ... ['E', None], ... ['F', 9]] >>> etl.rowlengths(table) +--------+-------+ | length | count | +========+=======+ | 3 | 3 | +--------+-------+ | 2 | 2 | +--------+-------+ | 4 | 1 | +--------+-------+
Useful for finding potential problems in data files.
-
petl.util.counting.
typecounter
(table, field)[source]¶ Count the number of values found for each Python type.
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 1, 2], ... ['B', u'2', '3.4'], ... [u'B', u'3', u'7.8', True], ... ['D', u'xyz', 9.0], ... ['E', 42]] >>> etl.typecounter(table, 'foo') Counter({'str': 5}) >>> etl.typecounter(table, 'bar') Counter({'str': 3, 'int': 2}) >>> etl.typecounter(table, 'baz') Counter({'str': 2, 'int': 1, 'float': 1, 'NoneType': 1})
The field argument can be a field name or index (starting from zero).
-
petl.util.counting.
typecounts
(table, field)[source]¶ Count the number of values found for each Python type and return a table mapping class names to counts and frequencies. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... [b'A', 1, 2], ... [b'B', '2', b'3.4'], ... ['B', '3', '7.8', True], ... ['D', u'xyz', 9.0], ... ['E', 42]] >>> etl.typecounts(table, 'foo') +---------+-------+-----------+ | type | count | frequency | +=========+=======+===========+ | 'str' | 3 | 0.6 | +---------+-------+-----------+ | 'bytes' | 2 | 0.4 | +---------+-------+-----------+ >>> etl.typecounts(table, 'bar') +-------+-------+-----------+ | type | count | frequency | +=======+=======+===========+ | 'str' | 3 | 0.6 | +-------+-------+-----------+ | 'int' | 2 | 0.4 | +-------+-------+-----------+ >>> etl.typecounts(table, 'baz') +------------+-------+-----------+ | type | count | frequency | +============+=======+===========+ | 'int' | 1 | 0.2 | +------------+-------+-----------+ | 'bytes' | 1 | 0.2 | +------------+-------+-----------+ | 'str' | 1 | 0.2 | +------------+-------+-----------+ | 'float' | 1 | 0.2 | +------------+-------+-----------+ | 'NoneType' | 1 | 0.2 | +------------+-------+-----------+
The field argument can be a field name or index (starting from zero).
-
petl.util.counting.
parsecounter
(table, field, parsers=(('int', <class 'int'>), ('float', <class 'float'>)))[source]¶ Count the number of str or unicode values under the given fields that can be parsed as ints, floats or via custom parser functions. Return a pair of Counter objects, the first mapping parser names to the number of strings successfully parsed, the second mapping parser names to the number of errors. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 'aaa', 2], ... ['B', u'2', '3.4'], ... [u'B', u'3', u'7.8', True], ... ['D', '3.7', 9.0], ... ['E', 42]] >>> counter, errors = etl.parsecounter(table, 'bar') >>> counter Counter({'float': 3, 'int': 2}) >>> errors Counter({'int': 2, 'float': 1})
The field argument can be a field name or index (starting from zero).
-
petl.util.counting.
parsecounts
(table, field, parsers=(('int', <class 'int'>), ('float', <class 'float'>)))[source]¶ Count the number of str or unicode values that can be parsed as ints, floats or via custom parser functions. Return a table mapping parser names to the number of values successfully parsed and the number of errors. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 'aaa', 2], ... ['B', u'2', '3.4'], ... [u'B', u'3', u'7.8', True], ... ['D', '3.7', 9.0], ... ['E', 42]] >>> etl.parsecounts(table, 'bar') +---------+-------+--------+ | type | count | errors | +=========+=======+========+ | 'float' | 3 | 1 | +---------+-------+--------+ | 'int' | 2 | 2 | +---------+-------+--------+
The field argument can be a field name or index (starting from zero).
Timing¶
-
petl.util.timing.
progress
(table, batchsize=1000, prefix='', out=None)[source]¶ Report progress on rows passing through to a file or file-like object (defaults to sys.stderr). E.g.:
>>> import petl as etl >>> table = etl.dummytable(100000) >>> table.progress(10000).tocsv('example.csv') 10000 rows in 0.13s (78363 row/s); batch in 0.13s (78363 row/s) 20000 rows in 0.22s (91679 row/s); batch in 0.09s (110448 row/s) 30000 rows in 0.31s (96573 row/s); batch in 0.09s (108114 row/s) 40000 rows in 0.40s (99535 row/s); batch in 0.09s (109625 row/s) 50000 rows in 0.49s (101396 row/s); batch in 0.09s (109591 row/s) 60000 rows in 0.59s (102245 row/s); batch in 0.09s (106709 row/s) 70000 rows in 0.68s (103221 row/s); batch in 0.09s (109498 row/s) 80000 rows in 0.77s (103810 row/s); batch in 0.09s (108126 row/s) 90000 rows in 0.90s (99465 row/s); batch in 0.13s (74516 row/s) 100000 rows in 1.02s (98409 row/s); batch in 0.11s (89821 row/s) 100000 rows in 1.02s (98402 row/s); batches in 0.10 +/- 0.02s [0.09-0.13] (100481 +/- 13340 rows/s [74516-110448])
See also
petl.util.timing.clock()
.
-
petl.util.timing.
log_progress
(table, batchsize=1000, prefix='', logger=None, level=20)[source]¶ Report progress on rows passing through to a python logger. If logger is none, a new logger will be created that, by default, streams to stdout. E.g.:
>>> import petl as etl >>> table = etl.dummytable(100000) >>> table.log_progress(10000).tocsv('example.csv') 10000 rows in 0.13s (78363 row/s); batch in 0.13s (78363 row/s) 20000 rows in 0.22s (91679 row/s); batch in 0.09s (110448 row/s) 30000 rows in 0.31s (96573 row/s); batch in 0.09s (108114 row/s) 40000 rows in 0.40s (99535 row/s); batch in 0.09s (109625 row/s) 50000 rows in 0.49s (101396 row/s); batch in 0.09s (109591 row/s) 60000 rows in 0.59s (102245 row/s); batch in 0.09s (106709 row/s) 70000 rows in 0.68s (103221 row/s); batch in 0.09s (109498 row/s) 80000 rows in 0.77s (103810 row/s); batch in 0.09s (108126 row/s) 90000 rows in 0.90s (99465 row/s); batch in 0.13s (74516 row/s) 100000 rows in 1.02s (98409 row/s); batch in 0.11s (89821 row/s) 100000 rows in 1.02s (98402 row/s); batches in 0.10 +/- 0.02s [0.09-0.13] (100481 +/- 13340 rows/s [74516-110448])
See also
petl.util.timing.clock()
.
-
petl.util.timing.
clock
(table)[source]¶ Time how long is spent retrieving rows from the wrapped container. Enables diagnosis of which steps in a pipeline are taking the most time. E.g.:
>>> import petl as etl >>> t1 = etl.dummytable(100000) >>> c1 = etl.clock(t1) >>> t2 = etl.convert(c1, 'foo', lambda v: v**2) >>> c2 = etl.clock(t2) >>> p = etl.progress(c2, 10000) >>> etl.tocsv(p, 'example.csv') 10000 rows in 0.23s (44036 row/s); batch in 0.23s (44036 row/s) 20000 rows in 0.38s (52167 row/s); batch in 0.16s (63979 row/s) 30000 rows in 0.54s (55749 row/s); batch in 0.15s (64624 row/s) 40000 rows in 0.69s (57765 row/s); batch in 0.15s (64793 row/s) 50000 rows in 0.85s (59031 row/s); batch in 0.15s (64707 row/s) 60000 rows in 1.00s (59927 row/s); batch in 0.15s (64847 row/s) 70000 rows in 1.16s (60483 row/s); batch in 0.16s (64051 row/s) 80000 rows in 1.31s (61008 row/s); batch in 0.15s (64953 row/s) 90000 rows in 1.47s (61356 row/s); batch in 0.16s (64285 row/s) 100000 rows in 1.62s (61703 row/s); batch in 0.15s (65012 row/s) 100000 rows in 1.62s (61700 row/s); batches in 0.16 +/- 0.02s [0.15-0.23] (62528 +/- 6173 rows/s [44036-65012]) >>> # time consumed retrieving rows from t1 ... c1.time 0.7243089999999492 >>> # time consumed retrieving rows from t2 ... c2.time 1.1704209999999766 >>> # actual time consumed by the convert step ... c2.time - c1.time 0.4461120000000274
See also
petl.util.timing.progress()
.
Statistics¶
-
petl.util.statistics.
limits
(table, field)[source]¶ Find minimum and maximum values under the given field. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2], ['b', 3]] >>> minv, maxv = etl.limits(table, 'bar') >>> minv 1 >>> maxv 3
The field argument can be a field name or index (starting from zero).
-
petl.util.statistics.
stats
(table, field)[source]¶ Calculate basic descriptive statistics on a given field. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 1, 2], ... ['B', '2', '3.4'], ... [u'B', u'3', u'7.8', True], ... ['D', 'xyz', 9.0], ... ['E', None]] >>> etl.stats(table, 'bar') stats(count=3, errors=2, sum=6.0, min=1.0, max=3.0, mean=2.0, pvariance=0.6666666666666666, pstdev=0.816496580927726)
The field argument can be a field name or index (starting from zero).
Materialising tables¶
-
petl.util.materialise.
columns
(table, missing=None)[source]¶ Construct a
dict
mapping field names to lists of values. E.g.:>>> import petl as etl >>> table = [['foo', 'bar'], ['a', 1], ['b', 2], ['b', 3]] >>> cols = etl.columns(table) >>> cols['foo'] ['a', 'b', 'b'] >>> cols['bar'] [1, 2, 3]
See also
petl.util.materialise.facetcolumns()
.
-
petl.util.materialise.
facetcolumns
(table, key, missing=None)[source]¶ Like
petl.util.materialise.columns()
but stratified by values of the given key field. E.g.:>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['a', 1, True], ... ['b', 2, True], ... ['b', 3]] >>> fc = etl.facetcolumns(table, 'foo') >>> fc['a'] {'foo': ['a'], 'bar': [1], 'baz': [True]} >>> fc['b'] {'foo': ['b', 'b'], 'bar': [2, 3], 'baz': [True, None]}
Randomly generated tables¶
-
petl.util.random.
randomtable
(numflds=5, numrows=100, wait=0, seed=None)[source]¶ Construct a table with random numerical data. Use numflds and numrows to specify the number of fields and rows respectively. Set wait to a float greater than zero to simulate a delay on each row generation (number of seconds per row). E.g.:
>>> import petl as etl >>> table = etl.randomtable(3, 100, seed=42) >>> table +----------------------+----------------------+---------------------+ | f0 | f1 | f2 | +======================+======================+=====================+ | 0.6394267984578837 | 0.025010755222666936 | 0.27502931836911926 | +----------------------+----------------------+---------------------+ | 0.22321073814882275 | 0.7364712141640124 | 0.6766994874229113 | +----------------------+----------------------+---------------------+ | 0.8921795677048454 | 0.08693883262941615 | 0.4219218196852704 | +----------------------+----------------------+---------------------+ | 0.029797219438070344 | 0.21863797480360336 | 0.5053552881033624 | +----------------------+----------------------+---------------------+ | 0.026535969683863625 | 0.1988376506866485 | 0.6498844377795232 | +----------------------+----------------------+---------------------+ ...
Note that the data are generated on the fly and are not stored in memory, so this function can be used to simulate very large tables.
-
petl.util.random.
dummytable
(numrows=100, fields=(('foo', functools.partial(<bound method Random.randint of <random.Random object>>, 0, 100)), ('bar', functools.partial(<bound method Random.choice of <random.Random object>>, ('apples', 'pears', 'bananas', 'oranges'))), ('baz', <built-in method random of Random object>)), wait=0, seed=None)[source]¶ Construct a table with dummy data. Use numrows to specify the number of rows. Set wait to a float greater than zero to simulate a delay on each row generation (number of seconds per row). E.g.:
>>> import petl as etl >>> table1 = etl.dummytable(100, seed=42) >>> table1 +-----+----------+----------------------+ | foo | bar | baz | +=====+==========+======================+ | 81 | 'apples' | 0.025010755222666936 | +-----+----------+----------------------+ | 35 | 'pears' | 0.22321073814882275 | +-----+----------+----------------------+ | 94 | 'apples' | 0.6766994874229113 | +-----+----------+----------------------+ | 69 | 'apples' | 0.5904925124490397 | +-----+----------+----------------------+ | 4 | 'apples' | 0.09369523986159245 | +-----+----------+----------------------+ ... >>> # customise fields ... import random >>> from functools import partial >>> fields = [('foo', random.random), ... ('bar', partial(random.randint, 0, 500)), ... ('baz', partial(random.choice, ... ['chocolate', 'strawberry', 'vanilla']))] >>> table2 = etl.dummytable(100, fields=fields, seed=42) >>> table2 +---------------------+-----+-------------+ | foo | bar | baz | +=====================+=====+=============+ | 0.6394267984578837 | 12 | 'vanilla' | +---------------------+-----+-------------+ | 0.27502931836911926 | 114 | 'chocolate' | +---------------------+-----+-------------+ | 0.7364712141640124 | 346 | 'vanilla' | +---------------------+-----+-------------+ | 0.8921795677048454 | 44 | 'vanilla' | +---------------------+-----+-------------+ | 0.4219218196852704 | 15 | 'chocolate' | +---------------------+-----+-------------+ ...
Data generation functions can be specified via the fields keyword argument.
Note that the data are generated on the fly and are not stored in memory, so this function can be used to simulate very large tables.
Miscellaneous¶
-
petl.util.misc.
typeset
(table, field)[source]¶ Return a set containing all Python types found for values in the given field. E.g.:
>>> import petl as etl >>> table = [['foo', 'bar', 'baz'], ... ['A', 1, '2'], ... ['B', u'2', '3.4'], ... [u'B', u'3', '7.8', True], ... ['D', u'xyz', 9.0], ... ['E', 42]] >>> sorted(etl.typeset(table, 'foo')) ['str'] >>> sorted(etl.typeset(table, 'bar')) ['int', 'str'] >>> sorted(etl.typeset(table, 'baz')) ['NoneType', 'float', 'str']
The field argument can be a field name or index (starting from zero).
-
petl.util.misc.
diffheaders
(t1, t2)[source]¶ Return the difference between the headers of the two tables as a pair of sets. E.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar', 'baz'], ... ['a', 1, .3]] >>> table2 = [['baz', 'bar', 'quux'], ... ['a', 1, .3]] >>> add, sub = etl.diffheaders(table1, table2) >>> add {'quux'} >>> sub {'foo'}
-
petl.util.misc.
diffvalues
(t1, t2, f)[source]¶ Return the difference between the values under the given field in the two tables, e.g.:
>>> import petl as etl >>> table1 = [['foo', 'bar'], ... ['a', 1], ... ['b', 3]] >>> table2 = [['bar', 'foo'], ... [1, 'a'], ... [3, 'c']] >>> add, sub = etl.diffvalues(table1, table2, 'foo') >>> add {'c'} >>> sub {'b'}
-
petl.util.misc.
strjoin
(s)[source]¶ Return a function to join sequences using s as the separator. Intended for use with
petl.transform.conversions.convert()
.
-
petl.util.misc.
nthword
(n, sep=None)[source]¶ Construct a function to return the nth word in a string. E.g.:
>>> import petl as etl >>> s = 'foo bar' >>> f = etl.nthword(0) >>> f(s) 'foo' >>> g = etl.nthword(1) >>> g(s) 'bar'
Intended for use with
petl.transform.conversions.convert()
.
-
petl.util.misc.
coalesce
(*fields, **kwargs)[source]¶ Return a function which accepts a row and returns the first non-missing value from the specified fields. Intended for use with
petl.transform.basics.addfield()
.