The query class

class dwopt._qry._Qry(operator, from_=None, select=None, join=None, where=None, group_by=None, having=None, order_by=None, sql=None)[source]

The query class.

See examples for quick-start.

Automatically instantiated via the qry() method from the database operator objects.

It is possible to pass in all needed parameters when calling the method. But it is clearer to only pass in one positional argument as table name, and use the Query object - clause methods to build up a query instead.

Parameters
  • operator (dwopt.db._Db) – Database operator object to operate on generated queries.

  • from (str) – Query table name.

  • select (str) – Columns.

  • join (str) – Join clause.

  • where (str) – Conditions.

  • group_by (str) – Group by names.

  • having (str) – Conditions.

  • order_by (str) – Order by names.

  • sql (str) – Sql code.

Notes

The summary query building framework

Queries are flexibly built as a combination of a sub query and a summary query from templates. Point of sub query is to be a flexible complementary pre-processing step, productivity gain comes from the summary query templates.

Example:

-- Sub query: arbituary query within a with clause named x
with x as (
    select * from test
    where score>0.5
        and dte is not null
        and cat is not null
)
-- Summary query: generated from templates
select
    dte, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by dte, cat
order by n desc

Corresponding code:

from dwopt import lt, make_test_tbl
_ = make_test_tbl(lt)
(
    lt.qry('test').where('score>0.5', 'dte is not null', 'cat is not null')
    .valc(
        'dte, cat',
        'avg(score) avgscore, round(sum(amt)/1e3,2) total',
        out=1
    )
)

Use the Query object - clause methods to iteratively piece together a query, or use the sql() method to provide an arbituary query. This created query will then be placed inside a with block and become the sub query on invocation of any summary methods.

The summary query is built from parameterized templates via the Query object - summary methods. Calling one of them completes the whole query and immediately runs it.

This way for large tables, heavy intermediate results from the sub query are never realized outside of the database engine, while light summary results are placed in python for analysis.

Examples

Create and use qry object using the qry() method from the database operator objects:

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').len()
150
>>> lt.qry('iris').valc('species', 'avg(petal_length)')
   species   n  avg(petal_length)
0  sicolor  50              4.260
1   setosa  50              1.462
2  rginica  50              5.552
>>> lt.qry('iris').where('petal_length > 2').valc('species', out=1)
with x as (
    select * from iris
    where petal_length > 2
)
select
    species
    ,count(1) n
from x
group by species
order by n desc

Iteratively piece together a query using the Query object - clause methods:

>>> from dwopt import lt
>>> lt.mtcars()
>>> sql = "select cyl from mtcars group by cyl having count(1) > 10"
>>> q = (
...     lt.qry('mtcars a')
...     .select('a.cyl, count(1) n, avg(a.mpg)')
...     .case('cat', "a.cyl = 8 then 1", els=0)
...     .join(f'({sql}) b', 'a.cyl = b.cyl', how='inner')
...     .group_by('a.cyl')
...     .having('count(1) > 10')
...     .order_by('n desc')
... )
>>>
>>> q.print()
select a.cyl, count(1) n, avg(a.mpg)
    ,case when a.cyl = 8 then 1 else 0 end as cat
from mtcars a
inner join (select cyl from mtcars group by cyl having count(1) > 10) b
    on a.cyl = b.cyl
group by a.cyl
having count(1) > 10
order by n desc
>>> q.run()
   cyl   n  avg(a.mpg)  cat
0    8  14   15.100000    1
1    4  11   26.663636    0

Use the Query object - summary methods for analysis:

>>> from dwopt import pg as d
>>> d.iris('iris')
>>> q = d.qry('iris').where('petal_length > 2')
>>> q.top()
sepal_length        7.0
sepal_width         3.2
petal_length        4.7
petal_width         1.4
species         sicolor
Name: 0, dtype: object
>>> q.head()
   sepal_length  sepal_width  petal_length  petal_width  species
0           7.0          3.2           4.7          1.4  sicolor
1           6.4          3.2           4.5          1.5  sicolor
2           6.9          3.1           4.9          1.5  sicolor
3           5.5          2.3           4.0          1.3  sicolor
4           6.5          2.8           4.6          1.5  sicolor
>>> q.len()
100
>>> agg = ', '.join(f'avg({col}) {col}' for col in q.cols() if col != 'species')
>>> q.valc('species', agg)
   species   n  sepal_length  sepal_width  petal_length  petal_width
0  sicolor  50         5.936        2.770         4.260        1.326
1  rginica  50         6.588        2.974         5.552        2.026
bin(out=None)[source]

WIP

case(col, *args, cond=None, els='NULL')[source]

Add a case when clause to the select clause.

Calling this method multiple times would add multiple statements.

Parameters
  • col (str) – Column name of the resulting column.

  • *args (str) – Positional argument in form ‘condition then treatement’.

  • cond (dict) – Dictionary of condition str to treatment str mappings.

  • els (str) – Value for else clause, default NULL.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> q = lt.qry('iris').case('mt4',"petal_length > 4 then 1", els=0)
>>> q.print()
select *
    ,case when petal_length > 4 then 1 else 0 end as mt4
from iris
>>> q.valc('species','avg(mt4) pct_mt4')
   species   n  pct_mt4
0  sicolor  50     0.68
1   setosa  50     0.00
2  rginica  50     1.00

Combine case when, value counts, and pivot:

>>> from dwopt import lt as d
>>> d.iris(q=1).case('cat', 
...     "petal_length > 5             then '5+'",
...     "petal_length between 2 and 5 then '2-5'",
...     "petal_length < 2             then '-2'",
... ).valc('species, cat').pivot('species','cat','n')
cat        -2   2-5    5+
species
rginica   NaN   9.0  41.0
setosa   50.0   NaN   NaN
sicolor   NaN  49.0   1.0

Use the cond argument:

>>> from dwopt import pg
>>> pg.mtcars()
>>> pg.qry("mtcars").select('name, mpg').case('cat', cond = {
...         'mpg between 10 and 20':15,
...         'mpg between 20 and 30':25,
...         'mpg between 30 and 40':35,
...     }
... ).print()
select name, mpg
    ,case
        when mpg between 10 and 20 then 15
        when mpg between 20 and 30 then 25
        when mpg between 30 and 40 then 35
        else NULL
    end as cat
from mtcars
cols(out=None)[source]

Fetch column names of the sub query table.

Parameters

out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

Column names as list of str. Or full query as str.

Examples

>>> from dwopt import lt as d
>>> d.iris()
>>> d.qry('iris').cols()
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
>>> d.qry('iris').cols(out=1)
with x as (
    select * from iris
)
select * from x where 1=2

Use with comprehension to obtain subsets:

>>> from dwopt import pg as d
>>> d.create_schema('test')
>>> q = d.iris('test.iris', q=1)
>>> [i for i in q.cols() if i != 'species']
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
dist(*args, out=None)[source]

Count number of distinct occurances of data.

Works on specified columns, or combination of columns, of the sub query table.

Parameters
  • *args (str or [str]) – Either column names as str, or iterator of column name str.

  • out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

pandas.Series

Examples

>>> from dwopt import lt
>>> q = lt.mtcars(q=1)
>>> q.dist('mpg')
count(distinct mpg)    25
Name: 0, dtype: int64
>>> q.dist('mpg', 'cyl')
count(distinct mpg)    25
count(distinct cyl)     3
Name: 0, dtype: int64
>>> q.dist(['mpg', 'cyl'])
count(distinct mpg || '_' || cyl)    27
Name: 0, dtype: int64
>>> q.dist(*q.cols()[:4])
count(distinct name)    32
count(distinct mpg)     25
count(distinct cyl)      3
count(distinct disp)    27
Name: 0, dtype: int64
>>> from dwopt import pg
>>> pg.create_schema('test')
>>> q = pg.mtcars('test.mtcars', q=1)
>>> q.dist('mpg', ['mpg', 'cyl'], out=1) 
with x as (
    select * from test.mtcars
)
select
    count(distinct mpg)
    ,count(distinct mpg || '_' || cyl)
from x
>>> q.dist('mpg', ['mpg', 'cyl'])
count    25
count    27
Name: 0, dtype: int64
five(out=None)[source]

WIP

from_(from_)[source]

Add the from clause to query.

Alternative to simply specifying table name as the only argument of the qry method. Use the qry method.

Parameters

from (str) – Table name str.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry().from_("iris").print()
select * from iris
group_by(*args)[source]

Add the group by clause to query.

Parameters

*args (str or [str]) – Group by columns in str format, or iterator of them.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> (
...     lt.qry('mtcars')
...     .select('hp, count(1) n, avg(mpg)')
...     .group_by('hp')
...     .having("avg(mpg) > 30")
...     .run()
... )
    hp  n  avg(mpg)
0   52  1      30.4
1   65  1      33.9
2  113  1      30.4
>>> (
...     lt.qry('mtcars')
...     .select('hp, cyl, count(1) n, avg(mpg)')
...     .group_by('hp,cyl')
...     #.group_by('hp', 'cyl') #same effect
...     #.group_by(['hp', 'cyl']) #same effect
...     .having("avg(mpg) > 30")
...     .print()
... )
select hp, cyl, count(1) n, avg(mpg)
from mtcars
group by hp,cyl
having avg(mpg) > 30
hash(*args, out=None)[source]

Calculate a simple oracle hash for table.

Arrive at a indicative hash value for a number of columns or all columns of a sub query table. Hash value is a number or symbol that is calculated from data , and is sensitive to any small changes in data. It serves as method to detect if any data element in data is changed.

Parameters
  • *args (str or [str]) – Column names in str. If no value is given, a cols method will be performed to fetch the list of all columns, from which a hash will be calculated. Also allow passing in a single list of str column names.

  • out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

int

Examples

from dwopt import oc
q = oc.iris(q=1)
q.hash('petal_length')
q.hash('petal_length', 'petal_width')
q.hash(['petal_length', 'petal_width'])
q.hash(out=1)
having(*args)[source]

Add the having clause to query.

Parameters

*args (str or [str]) – Conditions in str format, or iterator of them.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> q = (
...     lt.qry('mtcars')
...     .select('hp, count(1) n, avg(mpg)')
...     .group_by('hp')
...     .having("count(1) > 1", "avg(mpg) > 15")
...     .order_by('n desc')
... )
>>>
>>> q.print()
select hp, count(1) n, avg(mpg)
from mtcars
group by hp
having count(1) > 1
    and avg(mpg) > 15
order by n desc
>>> q.run()
    hp  n   avg(mpg)
0  180  3  16.300000
1  175  3  19.200000
2  110  3  21.133333
3  150  2  15.350000
4  123  2  18.500000
5   66  2  29.850000
head(out=None)[source]

Fetch top 5 rows of the sub query table.

Parameters

out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

pandas.DataFrame

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry("iris").head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
join(sch_tbl_nme, *args, how='left')[source]

Add a join clause to query.

Calling this method multiple times would add multiple clauses.

Parameters
  • sch_tbl_nme (str) – Table name in form my_schema1.my_table1 or my_table1.

  • *args (str) – Joining conditions in str format.

  • how (str) – The join keyword in str format, for example: inner, cross. Default left.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> q = (
...     lt.qry('mtcars a').select('a.cyl, b.mpg')
...     .join('mtcars b','a.name = b.name')
... )
>>> q.valc('cyl', 'avg(mpg)')
   cyl   n   avg(mpg)
0    8  14  15.100000
1    4  11  26.663636
2    6   7  19.742857
>>> q.valc('cyl', 'avg(mpg)', out=1) 
with x as (
    select a.cyl, b.mpg
    from mtcars a
    left join mtcars b
        on a.name = b.name
)
select
    cyl
    ,count(1) n
    ,avg(mpg)
from x
group by cyl
order by n desc
>>> (
...     lt.qry('mtcars a').select('a.cyl, b.mpg, c.hp')
...     .join('mtcars b','a.name = b.name')
...     .join('mtcars c','a.name = c.name')
...     .valc('cyl', 'avg(mpg), avg(hp)')
... )
   cyl   n   avg(mpg)     avg(hp)
0    8  14  15.100000  209.214286
1    4  11  26.663636   82.636364
2    6   7  19.742857  122.285714

Inject a sub-query into the sch_tbl_nme argument:

>>> sql = "select cyl from mtcars group by cyl having count(1) > 10"
>>> q = (
...     lt.qry('mtcars a').select('a.cyl, a.mpg')
...     .join(f'({sql}) b', 'a.cyl = b.cyl', how='inner')
... )
>>> q.valc('cyl', 'avg(mpg)')
   cyl   n   avg(mpg)
0    8  14  15.100000
1    4  11  26.663636
>>> q.valc('cyl', 'avg(mpg)', out=1) 
with x as (
    select a.cyl, a.mpg
    from mtcars a
    inner join (select cyl from mtcars group by cyl having count(1) > 10) b
        on a.cyl = b.cyl
)
select
    cyl
    ,count(1) n
    ,avg(mpg)
from x
group by cyl
order by n desc
len(out=None)[source]

Length of the sub query table.

Parameters

out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

int

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').len()
150
>>> lt.mtcars(q=1).len()
32
mimx(col, out=None)[source]

Fetch maximum and minimum values of a column.

Parameters
  • col (str) – Column name as str.

  • out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

pandas.Series

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').mimx('sepal_width')
max(sepal_width)    4.4
min(sepal_width)    2.0
Name: 0, dtype: float64
order_by(*args)[source]

Add the order by clause to query.

Parameters

*args (str or [str]) – Order by names in str format, or iterator of them.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> q = (
...     lt.qry('mtcars')
...     .select('hp, count(1) n, avg(mpg)')
...     .group_by('hp')
...     .having("count(1) > 1", "avg(mpg) > 15")
...     .order_by('n desc')
... )
>>>
>>> q.run()
    hp  n   avg(mpg)
0  180  3  16.300000
1  175  3  19.200000
2  110  3  21.133333
3  150  2  15.350000
4  123  2  18.500000
5   66  2  29.850000
pct()[source]

WIP

piv()[source]

WIP

print(sum_qry=None)[source]

Print the built query.

  • If the sum_qry argument is not given value, print underlying query as it is.

  • If a value is given, enclose the underlying query into a with clause, attach the summary query after it, print the full query.

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').select('count(1)').print()
select count(1)
from iris
>>> lt.qry('iris').print('select count(1) from x')
with x as (
    select * from iris
)
select count(1) from x
run(sum_qry=None, **kwargs)[source]

Run the built query.

  • If the sum_qry argument is not given value, run underlying query as it is.

  • If a value is given, enclose the underlying query into a with clause, attach the summary query after it, run the full query.

Parameters
  • sum_qry (str) – Summary query string.

  • **kwargs – Keyword arguments to pass on to database operator’s run method.

Returns

Returned by the database operator’s dwopt.dbo._Db.run() method.

Return type

pandas.DataFrame or None

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').select('count(1)').run()
   count(1)
0       150
>>> lt.qry('iris').run('select count(1) from x')
   count(1)
0       150
>>> lt.qry('iris').where('petal_length>:x').run(
...     'select count(1) from x', mods={'x':2}
... )
   count(1)
0       100
select(*args, sep=',')[source]

Add the select clause to query.

Parameters
  • *args (str or [str]) – Column name str as positional arguments , or an iterator of str column names.

  • sep (str) – Symbol used for seperating column names, default ,.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> q = lt.iris(q=1)
>>> q.select("species, sepal_length").print()
select species, sepal_length
from iris
>>> q.select("species", "sepal_length").print()
select species,sepal_length
from iris
>>> q.select(["species", "sepal_length"]).print()
select species,sepal_length
from iris
>>> q.select(["species", "sepal_length + 2 as sepal_length_new"]).print()
select species,sepal_length + 2 as sepal_length_new
from iris
sql(sql)[source]

Replace the underlying query by specified sql.

This allows arbituary advanced sql to be incorporated into framework.

Parameters

sql (str) – Sql code in str format.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> q = lt.qry().sql("select * from iris where sepal_length > 7")
>>> q.len()
12
>>> q.print()
select * from iris where sepal_length > 7
str(sum_qry=None)[source]

Return the built query as str.

  • If the sum_qry argument is not given value, return underlying query as it is.

  • If a value is given, enclose the underlying query into a with clause, attach the summary query after it, return the full query.

Parameters

sum_qry (str) – Summary query string.

Return type

str

Examples

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').select('count(1)').str()
'select count(1)\nfrom iris'
>>> lt.qry('iris').str('select count(1) from x')
'with x as (\n    select * from iris\n)\nselect count(1) from x'
top(out=None)[source]

Fetch top row of the sub query table.

Parameters

out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

pandas.Series

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> lt.qry('mtcars').top()
name    Mazda RX4
mpg          21.0
cyl             6
disp        160.0
hp            110
drat          3.9
wt           2.62
qsec        16.46
vs              0
am              1
gear            4
carb            4
Name: 0, dtype: object

Use print and to_string to force display of all columns:

>>> print(lt.iris(q=1).top().to_string())
sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
valc(group_by, agg=None, order_by=None, n=True, out=None)[source]

Value count of a column or combination of columns.

A value count is a group by query, with total number of row of each group calculated. Also allow custom summary calculation, and custom order by clauses to be added.

Parameters
  • group_by (str) – Group by clause as str.

  • agg (str) – Custom aggeregation clause as str.

  • order_by (str) – Order by clause as str.

  • n (Bool) – Should the value count column be automatically created or not. Default to be True.

  • out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.

Return type

pandas.DataFrame

Examples

Various configuration of value counts:

>>> from dwopt import lt
>>> lt.iris()
>>> lt.qry('iris').valc('species', 'avg(petal_length)')
   species   n  avg(petal_length)
0  sicolor  50              4.260
1   setosa  50              1.462
2  rginica  50              5.552
>>> lt.qry('iris').valc('species', 'avg(petal_length) avg', 'avg desc')
   species   n    avg
0  rginica  50  5.552
1  sicolor  50  4.260
2   setosa  50  1.462
>>> lt.qry('iris').valc('species', 'avg(petal_length)', n=False)
   species  avg(petal_length)
0  rginica              5.552
1   setosa              1.462
2  sicolor              4.260
>>> lt.qry('iris').valc('species', 'avg(petal_length)', out=1)
with x as (
    select * from iris
)
select
    species
    ,count(1) n
    ,avg(petal_length)
from x
group by species
order by n desc

Excel-pivot-table-like API:

from dwopt import lt, make_test_tbl
import logging
logging.basicConfig(level = logging.INFO)
_ = make_test_tbl(lt)
(
    lt.qry('test').where('score>0.5', 'dte is not null', 'cat is not null')
    .valc('dte, cat', 'avg(score) avgscore, round(sum(amt)/1e3,2) total')
    .pivot('dte', 'cat')
)

Logs showing sql used:

INFO:dwopt.dbo:running:
with x as (
    select * from test
    where score>0.5
        and dte is not null
        and cat is not null
)
select
    dte, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by dte, cat
order by n desc
INFO:dwopt.dbo:done

Results:

               n        avgscore             total
cat         test train      test     train    test   train
dte
2022-01-01  1140  1051  2.736275  2.800106  565.67  530.09
2022-02-02  1077  1100  2.759061  2.748898  536.68  544.10
2022-03-03  1037  1072  2.728527  2.743825  521.54  528.85

Combine case when, value counts, and pivot:

>>> from dwopt import lt as d
>>> d.iris(q=1).case('cat', 
...     "petal_length > 5             then '5+'",
...     "petal_length between 2 and 5 then '2-5'",
...     "petal_length < 2             then '-2'",
... ).valc('species, cat').pivot('species','cat','n')
cat        -2   2-5    5+
species
rginica   NaN   9.0  41.0
setosa   50.0   NaN   NaN
sicolor   NaN  49.0   1.0

Make summary for all relevant columns via generated statements:

>>> from dwopt import lt
>>> lt.iris()
>>> q = lt.qry('iris')
>>> agg = ', '.join(f'avg({col}) {col}' for col in q.cols() if col != 'species')
>>> q.valc('species', agg)
   species   n  sepal_length  sepal_width  petal_length  petal_width
0  sicolor  50         5.936        2.770         4.260        1.326
1   setosa  50         5.006        3.428         1.462        0.246
2  rginica  50         6.588        2.974         5.552        2.026
where(*args)[source]

Add the where clause to query.

Parameters

*args (str or [str]) – Conditions in str format, or iterator of condition str.

Returns

New query object with clause added.

Return type

dwopt._qry._Qry

Examples

>>> from dwopt import lt
>>> lt.mtcars()
>>> lt.qry('mtcars').where("cyl = 4").len()
11
>>> lt.qry('mtcars').where("cyl = 4", "vs = 0").run()
            name   mpg  cyl   disp  hp  drat    wt  qsec  vs  am  gear  carb
0  Porsche 914-2  26.0    4  120.3  91  4.43  2.14  16.7   0   1     5     2
>>> lt.qry('mtcars').where("cyl = 4 and vs = 0").print()
select * from mtcars
where cyl = 4 and vs = 0