The query class
- class dwopt._qry._Qry(operator, from_=None, select=None, join=None, where=None, group_by=None, having=None, order_by=None, sql=None)[source]
The query class.
See examples for quick-start.
Automatically instantiated via the
qry()
method from thedatabase operator
objects.It is possible to pass in all needed parameters when calling the method. But it is clearer to only pass in one positional argument as table name, and use the Query object - clause methods to build up a query instead.
- Parameters:
operator (dwopt.db._Db) – Database operator object to operate on generated queries.
from (str) – Query table name.
select (str) – Columns.
join (str) – Join clause.
where (str) – Conditions.
group_by (str) – Group by names.
having (str) – Conditions.
order_by (str) – Order by names.
sql (str) – Sql code.
Notes
The summary query building framework
Queries are flexibly built as a combination of a
sub query
and asummary query
from templates. Point ofsub query
is to be a flexible complementary pre-processing step, productivity gain comes from thesummary query
templates.Example:
-- Sub query: arbituary query within a with clause named x with x as ( select * from test where score>0.5 and dte is not null and cat is not null ) -- Summary query: generated from templates select dte, cat ,count(1) n ,avg(score) avgscore, round(sum(amt)/1e3,2) total from x group by dte, cat order by n desc
Corresponding code:
from dwopt import lt, make_test_tbl _ = make_test_tbl(lt) ( lt.qry('test').where('score>0.5', 'dte is not null', 'cat is not null') .valc( 'dte, cat', 'avg(score) avgscore, round(sum(amt)/1e3,2) total', out=1 ) )
Use the Query object - clause methods to iteratively piece together a query, or use the
sql()
method to provide an arbituary query. This created query will then be placed inside a with block and become thesub query
on invocation of any summary methods.The
summary query
is built from parameterized templates via the Query object - summary methods. Calling one of them completes the whole query and immediately runs it.This way for large tables, heavy intermediate results from the
sub query
are never realized outside of the database engine, while light summary results are placed in python for analysis.Examples
Create and use qry object using the
qry()
method from thedatabase operator
objects:>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').len() 150 >>> lt.qry('iris').valc('species', 'avg(petal_length)') species n avg(petal_length) 0 sicolor 50 4.260 1 setosa 50 1.462 2 rginica 50 5.552 >>> lt.qry('iris').where('petal_length > 2').valc('species', out=1) with x as ( select * from iris where petal_length > 2 ) select species ,count(1) n from x group by species order by n desc
Iteratively piece together a query using the Query object - clause methods:
>>> from dwopt import lt >>> lt.mtcars() >>> sql = "select cyl from mtcars group by cyl having count(1) > 10" >>> q = ( ... lt.qry('mtcars a') ... .select('a.cyl, count(1) n, avg(a.mpg)') ... .case('cat', "a.cyl = 8 then 1", els=0) ... .join(f'({sql}) b', 'a.cyl = b.cyl', how='inner') ... .group_by('a.cyl') ... .having('count(1) > 10') ... .order_by('n desc') ... ) >>> >>> q.print() select a.cyl, count(1) n, avg(a.mpg) ,case when a.cyl = 8 then 1 else 0 end as cat from mtcars a inner join (select cyl from mtcars group by cyl having count(1) > 10) b on a.cyl = b.cyl group by a.cyl having count(1) > 10 order by n desc >>> q.run() cyl n avg(a.mpg) cat 0 8 14 15.100000 1 1 4 11 26.663636 0
Use the Query object - summary methods for analysis:
>>> from dwopt import pg as d >>> d.iris('iris') >>> q = d.qry('iris').where('petal_length > 2') >>> q.top() sepal_length 7.0 sepal_width 3.2 petal_length 4.7 petal_width 1.4 species sicolor Name: 0, dtype: object >>> q.head() sepal_length sepal_width petal_length petal_width species 0 7.0 3.2 4.7 1.4 sicolor 1 6.4 3.2 4.5 1.5 sicolor 2 6.9 3.1 4.9 1.5 sicolor 3 5.5 2.3 4.0 1.3 sicolor 4 6.5 2.8 4.6 1.5 sicolor >>> q.len() 100 >>> agg = ', '.join(f'avg({col}) {col}' for col in q.cols() if col != 'species') >>> q.valc('species', agg) species n sepal_length sepal_width petal_length petal_width 0 sicolor 50 5.936 2.770 4.260 1.326 1 rginica 50 6.588 2.974 5.552 2.026
- case(col, *args, cond=None, els='NULL')[source]
Add a case when clause to the select clause.
Calling this method multiple times would add multiple statements.
- Parameters:
col (str) – Column name of the resulting column.
*args (str) – Positional argument in form ‘condition then treatement’.
cond (dict) – Dictionary of condition str to treatment str mappings.
els (str) – Value for else clause, default
NULL
.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.iris() >>> q = lt.qry('iris').case('mt4',"petal_length > 4 then 1", els=0) >>> q.print() select * ,case when petal_length > 4 then 1 else 0 end as mt4 from iris >>> q.valc('species','avg(mt4) pct_mt4') species n pct_mt4 0 sicolor 50 0.68 1 setosa 50 0.00 2 rginica 50 1.00
Combine case when, value counts, and pivot:
>>> from dwopt import lt as d >>> d.iris(q=1).case('cat', ... "petal_length > 5 then '5+'", ... "petal_length between 2 and 5 then '2-5'", ... "petal_length < 2 then '-2'", ... ).valc('species, cat').pivot('species','cat','n') cat -2 2-5 5+ species rginica NaN 9.0 41.0 setosa 50.0 NaN NaN sicolor NaN 49.0 1.0
Use the
cond
argument:>>> from dwopt import pg >>> pg.mtcars() >>> pg.qry("mtcars").select('name, mpg').case('cat', cond = { ... 'mpg between 10 and 20':15, ... 'mpg between 20 and 30':25, ... 'mpg between 30 and 40':35, ... } ... ).print() select name, mpg ,case when mpg between 10 and 20 then 15 when mpg between 20 and 30 then 25 when mpg between 30 and 40 then 35 else NULL end as cat from mtcars
- cols(out=None)[source]
Fetch column names of the sub query table.
- Parameters:
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
Column names as list of str. Or full query as str.
Examples
>>> from dwopt import lt as d >>> d.iris() >>> d.qry('iris').cols() ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] >>> d.qry('iris').cols(out=1) with x as ( select * from iris ) select * from x where 1=2
Use with comprehension to obtain subsets:
>>> from dwopt import pg as d >>> d.create_schema('test') >>> q = d.iris('test.iris', q=1) >>> [i for i in q.cols() if i != 'species'] ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
- dist(*args, out=None)[source]
Count number of distinct occurances of data.
Works on specified columns, or combination of columns, of the sub query table.
- Parameters:
*args (str or [str]) – Either column names as str, or iterator of column name str.
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
pandas.Series
Examples
>>> from dwopt import lt >>> q = lt.mtcars(q=1) >>> q.dist('mpg') count(distinct mpg) 25 Name: 0, dtype: int64 >>> q.dist('mpg', 'cyl') count(distinct mpg) 25 count(distinct cyl) 3 Name: 0, dtype: int64 >>> q.dist(['mpg', 'cyl']) count(distinct mpg || '_' || cyl) 27 Name: 0, dtype: int64 >>> q.dist(*q.cols()[:4]) count(distinct name) 32 count(distinct mpg) 25 count(distinct cyl) 3 count(distinct disp) 27 Name: 0, dtype: int64
>>> from dwopt import pg >>> pg.create_schema('test') >>> q = pg.mtcars('test.mtcars', q=1) >>> q.dist('mpg', ['mpg', 'cyl'], out=1) with x as ( select * from test.mtcars ) select count(distinct mpg) ,count(distinct mpg || '_' || cyl) from x >>> q.dist('mpg', ['mpg', 'cyl']) count 25 count 27 Name: 0, dtype: int64
- from_(from_)[source]
Add the from clause to query.
Alternative to simply specifying table name as the only argument of the qry method. Use the qry method.
- Parameters:
from (str) – Table name str.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry().from_("iris").print() select * from iris
- group_by(*args)[source]
Add the group by clause to query.
- Parameters:
*args (str or [str]) – Group by columns in str format, or iterator of them.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> ( ... lt.qry('mtcars') ... .select('hp, count(1) n, avg(mpg)') ... .group_by('hp') ... .having("avg(mpg) > 30") ... .run() ... ) hp n avg(mpg) 0 52 1 30.4 1 65 1 33.9 2 113 1 30.4 >>> ( ... lt.qry('mtcars') ... .select('hp, cyl, count(1) n, avg(mpg)') ... .group_by('hp,cyl') ... #.group_by('hp', 'cyl') #same effect ... #.group_by(['hp', 'cyl']) #same effect ... .having("avg(mpg) > 30") ... .print() ... ) select hp, cyl, count(1) n, avg(mpg) from mtcars group by hp,cyl having avg(mpg) > 30
- hash(*args, out=None)[source]
Calculate a simple oracle hash for table.
Arrive at a indicative hash value for a number of columns or all columns of a sub query table. Hash value is a number or symbol that is calculated from data , and is sensitive to any small changes in data. It serves as method to detect if any data element in data is changed.
- Parameters:
*args (str or [str]) – Column names in str. If no value is given, a cols method will be performed to fetch the list of all columns, from which a hash will be calculated. Also allow passing in a single list of str column names.
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
int
Examples
from dwopt import oc q = oc.iris(q=1) q.hash('petal_length') q.hash('petal_length', 'petal_width') q.hash(['petal_length', 'petal_width']) q.hash(out=1)
- having(*args)[source]
Add the having clause to query.
- Parameters:
*args (str or [str]) – Conditions in str format, or iterator of them.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> q = ( ... lt.qry('mtcars') ... .select('hp, count(1) n, avg(mpg)') ... .group_by('hp') ... .having("count(1) > 1", "avg(mpg) > 15") ... .order_by('n desc') ... ) >>> >>> q.print() select hp, count(1) n, avg(mpg) from mtcars group by hp having count(1) > 1 and avg(mpg) > 15 order by n desc >>> q.run() hp n avg(mpg) 0 180 3 16.300000 1 175 3 19.200000 2 110 3 21.133333 3 150 2 15.350000 4 123 2 18.500000 5 66 2 29.850000
- head(out=None)[source]
Fetch top 5 rows of the sub query table.
- Parameters:
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
pandas.DataFrame
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry("iris").head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
- join(sch_tbl_nme, *args, how='left')[source]
Add a join clause to query.
Calling this method multiple times would add multiple clauses.
- Parameters:
sch_tbl_nme (str) – Table name in form
my_schema1.my_table1
ormy_table1
.*args (str) – Joining conditions in str format.
how (str) – The join keyword in str format, for example:
inner
,cross
. Defaultleft
.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> q = ( ... lt.qry('mtcars a').select('a.cyl, b.mpg') ... .join('mtcars b','a.name = b.name') ... ) >>> q.valc('cyl', 'avg(mpg)') cyl n avg(mpg) 0 8 14 15.100000 1 4 11 26.663636 2 6 7 19.742857 >>> q.valc('cyl', 'avg(mpg)', out=1) with x as ( select a.cyl, b.mpg from mtcars a left join mtcars b on a.name = b.name ) select cyl ,count(1) n ,avg(mpg) from x group by cyl order by n desc
>>> ( ... lt.qry('mtcars a').select('a.cyl, b.mpg, c.hp') ... .join('mtcars b','a.name = b.name') ... .join('mtcars c','a.name = c.name') ... .valc('cyl', 'avg(mpg), avg(hp)') ... ) cyl n avg(mpg) avg(hp) 0 8 14 15.100000 209.214286 1 4 11 26.663636 82.636364 2 6 7 19.742857 122.285714
Inject a sub-query into the
sch_tbl_nme
argument:>>> sql = "select cyl from mtcars group by cyl having count(1) > 10" >>> q = ( ... lt.qry('mtcars a').select('a.cyl, a.mpg') ... .join(f'({sql}) b', 'a.cyl = b.cyl', how='inner') ... ) >>> q.valc('cyl', 'avg(mpg)') cyl n avg(mpg) 0 8 14 15.100000 1 4 11 26.663636 >>> q.valc('cyl', 'avg(mpg)', out=1) with x as ( select a.cyl, a.mpg from mtcars a inner join (select cyl from mtcars group by cyl having count(1) > 10) b on a.cyl = b.cyl ) select cyl ,count(1) n ,avg(mpg) from x group by cyl order by n desc
- len(out=None)[source]
Length of the sub query table.
- Parameters:
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
int
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').len() 150 >>> lt.mtcars(q=1).len() 32
- mimx(col, out=None)[source]
Fetch maximum and minimum values of a column.
- Parameters:
col (str) – Column name as str.
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
pandas.Series
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').mimx('sepal_width') max(sepal_width) 4.4 min(sepal_width) 2.0 Name: 0, dtype: float64
- order_by(*args)[source]
Add the order by clause to query.
- Parameters:
*args (str or [str]) – Order by names in str format, or iterator of them.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> q = ( ... lt.qry('mtcars') ... .select('hp, count(1) n, avg(mpg)') ... .group_by('hp') ... .having("count(1) > 1", "avg(mpg) > 15") ... .order_by('n desc') ... ) >>> >>> q.run() hp n avg(mpg) 0 180 3 16.300000 1 175 3 19.200000 2 110 3 21.133333 3 150 2 15.350000 4 123 2 18.500000 5 66 2 29.850000
- print(sum_qry=None)[source]
Print the built query.
If the
sum_qry
argument is not given value, print underlying query as it is.If a value is given, enclose the underlying query into a with clause, attach the summary query after it, print the full query.
See also
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').select('count(1)').print() select count(1) from iris >>> lt.qry('iris').print('select count(1) from x') with x as ( select * from iris ) select count(1) from x
- run(sum_qry=None, **kwargs)[source]
Run the built query.
If the
sum_qry
argument is not given value, run underlying query as it is.If a value is given, enclose the underlying query into a with clause, attach the summary query after it, run the full query.
- Parameters:
sum_qry (str) – Summary query string.
**kwargs – Keyword arguments to pass on to database operator’s run method.
- Returns:
Returned by the database operator’s
dwopt.dbo._Db.run()
method.- Return type:
pandas.DataFrame or None
See also
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').select('count(1)').run() count(1) 0 150 >>> lt.qry('iris').run('select count(1) from x') count(1) 0 150 >>> lt.qry('iris').where('petal_length>:x').run( ... 'select count(1) from x', mods={'x':2} ... ) count(1) 0 100
- select(*args, sep=',')[source]
Add the select clause to query.
- Parameters:
*args (str or [str]) – Column name str as positional arguments , or an iterator of str column names.
sep (str) – Symbol used for seperating column names, default
,
.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> q = lt.iris(q=1) >>> q.select("species, sepal_length").print() select species, sepal_length from iris >>> q.select("species", "sepal_length").print() select species,sepal_length from iris >>> q.select(["species", "sepal_length"]).print() select species,sepal_length from iris >>> q.select(["species", "sepal_length + 2 as sepal_length_new"]).print() select species,sepal_length + 2 as sepal_length_new from iris
- sql(sql)[source]
Replace the underlying query by specified sql.
This allows arbituary advanced sql to be incorporated into framework.
- Parameters:
sql (str) – Sql code in str format.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.iris() >>> q = lt.qry().sql("select * from iris where sepal_length > 7") >>> q.len() 12 >>> q.print() select * from iris where sepal_length > 7
- str(sum_qry=None)[source]
Return the built query as str.
If the
sum_qry
argument is not given value, return underlying query as it is.If a value is given, enclose the underlying query into a with clause, attach the summary query after it, return the full query.
- Parameters:
sum_qry (str) – Summary query string.
- Return type:
str
See also
Examples
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').select('count(1)').str() 'select count(1)\nfrom iris' >>> lt.qry('iris').str('select count(1) from x') 'with x as (\n select * from iris\n)\nselect count(1) from x'
- top(out=None)[source]
Fetch top row of the sub query table.
- Parameters:
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
pandas.Series
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> lt.qry('mtcars').top() name Mazda RX4 mpg 21.0 cyl 6 disp 160.0 hp 110 drat 3.9 wt 2.62 qsec 16.46 vs 0 am 1 gear 4 carb 4 Name: 0, dtype: object
Use
print
andto_string
to force display of all columns:>>> print(lt.iris(q=1).top().to_string()) sepal_length 5.1 sepal_width 3.5 petal_length 1.4 petal_width 0.2 species setosa
- valc(group_by, agg=None, order_by=None, n=True, out=None)[source]
Value count of a column or combination of columns.
A value count is a group by query, with total number of row of each group calculated. Also allow custom summary calculation, and custom order by clauses to be added.
- Parameters:
group_by (str) – Group by clause as str.
agg (str) – Custom aggeregation clause as str.
order_by (str) – Order by clause as str.
n (Bool) – Should the value count column be automatically created or not. Default to be True.
out (int) – Output mode. None to run full query, 1 to print full query, 2 to return full query as str. Default None.
- Return type:
pandas.DataFrame
Examples
Various configuration of value counts:
>>> from dwopt import lt >>> lt.iris() >>> lt.qry('iris').valc('species', 'avg(petal_length)') species n avg(petal_length) 0 sicolor 50 4.260 1 setosa 50 1.462 2 rginica 50 5.552 >>> lt.qry('iris').valc('species', 'avg(petal_length) avg', 'avg desc') species n avg 0 rginica 50 5.552 1 sicolor 50 4.260 2 setosa 50 1.462 >>> lt.qry('iris').valc('species', 'avg(petal_length)', n=False) species avg(petal_length) 0 rginica 5.552 1 setosa 1.462 2 sicolor 4.260 >>> lt.qry('iris').valc('species', 'avg(petal_length)', out=1) with x as ( select * from iris ) select species ,count(1) n ,avg(petal_length) from x group by species order by n desc
Excel-pivot-table-like API:
from dwopt import lt, make_test_tbl import logging # logging.basicConfig(level = logging.INFO) _ = make_test_tbl(lt) ( lt.qry('test').where('score>0.5', 'dte is not null', 'cat is not null') .valc('dte, cat', 'avg(score) avgscore, round(sum(amt)/1e3,2) total') .pivot('dte', 'cat') )
Logs showing sql used:
INFO:dwopt.dbo:running: with x as ( select * from test where score>0.5 and dte is not null and cat is not null ) select dte, cat ,count(1) n ,avg(score) avgscore, round(sum(amt)/1e3,2) total from x group by dte, cat order by n desc INFO:dwopt.dbo:done
Results:
n avgscore total cat test train test train test train dte 2022-01-01 1140 1051 2.736275 2.800106 565.67 530.09 2022-02-02 1077 1100 2.759061 2.748898 536.68 544.10 2022-03-03 1037 1072 2.728527 2.743825 521.54 528.85
Combine case when, value counts, and pivot:
>>> from dwopt import lt as d >>> d.iris(q=1).case('cat', ... "petal_length > 5 then '5+'", ... "petal_length between 2 and 5 then '2-5'", ... "petal_length < 2 then '-2'", ... ).valc('species, cat').pivot('species','cat','n') cat -2 2-5 5+ species rginica NaN 9.0 41.0 setosa 50.0 NaN NaN sicolor NaN 49.0 1.0
Make summary for all relevant columns via generated statements:
>>> from dwopt import lt >>> lt.iris() >>> q = lt.qry('iris') >>> agg = ', '.join(f'avg({col}) {col}' for col in q.cols() if col != 'species') >>> q.valc('species', agg) species n sepal_length sepal_width petal_length petal_width 0 sicolor 50 5.936 2.770 4.260 1.326 1 setosa 50 5.006 3.428 1.462 0.246 2 rginica 50 6.588 2.974 5.552 2.026
- where(*args)[source]
Add the where clause to query.
- Parameters:
*args (str or [str]) – Conditions in str format, or iterator of condition str.
- Returns:
New query object with clause added.
- Return type:
Examples
>>> from dwopt import lt >>> lt.mtcars() >>> lt.qry('mtcars').where("cyl = 4").len() 11 >>> lt.qry('mtcars').where("cyl = 4", "vs = 0").run() name mpg cyl disp hp drat wt qsec vs am gear carb 0 Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2 >>> lt.qry('mtcars').where("cyl = 4 and vs = 0").print() select * from mtcars where cyl = 4 and vs = 0