Overview

Installation

pip install dwopt

Features

Run query with less friction using default credentials
Automate processes with run sql from file, text replacement
Programatically make and run simple query
Sql template: Excel-pivot table experience
Sql template: Dataframe summary function experience
Sql template: DDL/DML statement, metadata queries
Automatic logging with fully reproducible sql

Walk Through

Run query with less friction using default credentials

On import, the package gives 3 different operator object (pg, lt, oc, one for each supported database), with default credentials (Use the save_url function to save to the system keyring). These allow frictionless running of sql from any python window.

>>> from dwopt import pg
>>> pg.run('select count(1) from test')
    42
>>> pg.qry('test').len()
    42

Alternatively, use the make_eng function and the operator constructors (Pg, Lt, Oc) to access database.

>>> from dwopt import make_eng, Pg
>>> url = "postgresql://scott:tiger@localhost/mydatabase"
>>> pg = Pg(make_eng(url))
>>> pg.run('select count(1) from test')
    42

Automate processes with run sql from file, text replacement

The operator object’s run method also allows running sql stored on a file. One could then replace parameters via a mapping dictionary, or simply supply the mappings to the function directly.

from dwopt import oc
oc.run(pth = "E:/projects/my_sql_script.sql"
    , my_run_date = '2022-01-31'
    , my_label = '20220131'
    , threshold = 10.5)

Above runs the sql stored on E:/projects/my_sql_script.sql as below:

create table monthly_extract_:my_label as
select * from test
where
    date = to_date(':my_run_date','YYYY-MM-DD')
    and measurement > :threshold

In future releases, the package will likely migrate to the jinja package’s directive syntax.

Programatically make and run simple query

The operator object’s qry method returns the query object. Use it’s clause methods to make a simple sql query, as it’s underlying query. The underlying query can be run directly, but the main usage is to act as the preprocessing step of the summary methods explained in the following sections.

from dwopt import lt
(
    lt.qry('test a').select('a.id', 'a.time')
    .case('amt', cond = {'amt < 1000':500,'amt < 2000':1500}, els = 'amt')
    .join('test2 b', 'a.id = b.id')
    .where("score > 0.5", "cat = 'test'")
    .print()#.run()
)

Above prints:

select a.id,a.time
    ,case
        when amt < 1000 then 500
        when amt < 2000 then 1500
        else amt
    end as amt
from test a
left join test2 b
    on a.id = b.id
where score > 0.5
    and cat = 'test'

In future releases, the package’s query construction internals will likely be improved from text manipulation to the sqlalchemy pakage’s toolkit.

Sql template: Excel-pivot table experience

A few lines of code specifying minimal information could produce a summary table similiar to what could be achieved in Excel. Difference being it is the efficient database engine doing the data processing work, and the flexible python machineries doing the presentation work.

The operator object’s qry method returns the query object. Use it’s summary methods to make and run summary queries. These methods operate on top of the underlying query.

For example:

from dwopt import lt #1
lt.qry('test').where("score > 0.5") \ #2
.valc('time, cat',"avg(score) avgscore, round(sum(amt)/1e3,2) total") \ #3
.pivot('time','cat',['n','avgscore','total']) #4

Results:

cat	n		avgscore		total
time	test	train	test	train	test	train
2013-01-02	816.0	847.0	0.746747	0.750452	398.34	417.31
2013-02-02	837.0	858.0	0.748214	0.743094	419.11	447.04
2013-03-02	805.0	860.0	0.756775	0.739017	394.89	422.35

Explanation of lines:

Get the default sqlite operator object.
Make, but do not run, an underlying sub query.
Make and run a value counts summary query (valc) with 2 groups, custom calcs, with the previous step’s underlying query placed inside a with clause.
Query result comes back to python as a standard pandas dataframe, call it’s pivot method.

Automatic logs showing the sql that was ran on line 3:

2022-01-23 11:08:13,407 [INFO] running:
with x as (
    select * from test
    where score > 0.5
)
select
    time, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by time, cat
order by n desc
2022-01-23 11:08:13,413 [INFO] done

In future releases, the package’s templating internals will ikely be driven by the jinjasql package.

Sql template: Dataframe summary function experience

It is possible to mimic what some of the dataframe summary functions would return, but implement via running sql templates. Difference being it is the efficient database engine doing the data processing work, and the flexible python machineries doing the presentation work.

The operator object’s qry method returns the query object. Use it’s summary methods to make and run summary queries. These methods operate on top of the underlying query.

For example:

from dwopt import lt #1
tbl = lt.qry('test').where("score > 0.5") #2
tbl.top()   #show top row to understand shape of data
tbl.head()  #as expected
tbl.cols()  #as expected
tbl.len()   #as expected
tbl.mimx('time')  #min and max of the column
tbl.dist('time', 'time, cat') #count distinct on the column or columns

Explanation of lines:

Get the default sqlite operator object.
Make, but do not run, an underlying sub query.
See the summary methods section for list of methods and their descriptions, examples, underlying sql shown in logs.

Sql template: DDL/DML statement, metadata queries

The operator object’s operation methods allows running of DDL/DML statements programatically, and enhances functionalities where desirable.

Also, the operator object’s metadata methods makes some useful metadata queries available.

Operation methods example:

from dwopt import lt
lt.drop('test')
lt.drop('test') #alter return instead of raising error if table not exist
lt.create(
        tbl_nme = 'test'
        ,dtypes = {
            'id':'integer'
            ,'score':'real'
            ,'amt':'integer'
            ,'cat':'text'
            ,'time':'text'
            ,'constraint df_pk':
                'primary key (id)'
        }
    )
lt.write(df,'test')
lt.write_nodup(df,'test',['id']) #remove duplicates before inserting

Metadata methods example:

from dwopt import pg
pg.list_tables() #list all tables
pg.table_cols('test.test') #examine columns
pg.table_cons() #list constraints

Automatic logging with fully reproducible sql

Many of the package’s methods are wired through the standard logging package.

In particular, the run method emits sql used as INFO level message. The relevant logger object has standard naming and is called dwopt.db. Configure the logging package or the logger at the start of application code for logs. See the logging package documentation for details.

Example configuration to show logs in console:

import logging
logging.basicConfig(level = logging.INFO)

from dwopt import lt
lt.list_tables()

Alternatively, to avoid logging info messages from other packages:

import logging
logging.basicConfig()
logging.getLogger('dwopt.db').setLevel(logging.INFO)

Example configuration to show in console and store on file, with timestamps:

import logging
logging.basicConfig(
    format = "%(asctime)s [%(levelname)s] %(message)s"
    ,handlers=[
        logging.FileHandler("E:/projects/logs.log"),
        logging.StreamHandler()
    ]
)
logging.getLogger('dwopt.db').setLevel(logging.INFO)

Example logs:

2022-01-23 11:08:13,407 [INFO] running:
with x as (
    select * from test
    where score > 0.5
)
select
    time, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by time, cat
order by n desc
2022-01-23 11:08:13,413 [INFO] done