Overview

Installation

pip install dwopt

Features

Walk Through

Run query with less friction using default credentials

On import, the package gives 3 different operator object (pg, lt, oc, one for each supported database), with default credentials (Use the save_url function to save to the system keyring). These allow frictionless running of sql from any python window.

>>> from dwopt import pg
>>> pg.run('select count(1) from test')
    42
>>> pg.qry('test').len()
    42

Alternatively, use the make_eng function and the operator constructors (Pg, Lt, Oc) to access database.

>>> from dwopt import make_eng, Pg
>>> url = "postgresql://scott:tiger@localhost/mydatabase"
>>> pg = Pg(make_eng(url))
>>> pg.run('select count(1) from test')
    42

Automate processes with run sql from file, text replacement

The operator object’s run method also allows running sql stored on a file. One could then replace parameters via a mapping dictionary, or simply supply the mappings to the function directly.

from dwopt import oc
oc.run(pth = "E:/projects/my_sql_script.sql"
    , my_run_date = '2022-01-31'
    , my_label = '20220131'
    , threshold = 10.5)

Above runs the sql stored on E:/projects/my_sql_script.sql as below:

create table monthly_extract_:my_label as
select * from test
where
    date = to_date(':my_run_date','YYYY-MM-DD')
    and measurement > :threshold

In future releases, the package will likely migrate to the jinja package’s directive syntax.

Programatically make and run simple query

The operator object’s qry method returns the query object. Use it’s clause methods to make a simple sql query, as it’s underlying query. The underlying query can be run directly, but the main usage is to act as the preprocessing step of the summary methods explained in the following sections.

from dwopt import lt
(
    lt.qry('test a').select('a.id', 'a.time')
    .case('amt', cond = {'amt < 1000':500,'amt < 2000':1500}, els = 'amt')
    .join('test2 b', 'a.id = b.id')
    .where("score > 0.5", "cat = 'test'")
    .print()#.run()
)

Above prints:

select a.id,a.time
    ,case
        when amt < 1000 then 500
        when amt < 2000 then 1500
        else amt
    end as amt
from test a
left join test2 b
    on a.id = b.id
where score > 0.5
    and cat = 'test'

In future releases, the package’s query construction internals will likely be improved from text manipulation to the sqlalchemy pakage’s toolkit.

Sql template: Excel-pivot table experience

A few lines of code specifying minimal information could produce a summary table similiar to what could be achieved in Excel. Difference being it is the efficient database engine doing the data processing work, and the flexible python machineries doing the presentation work.

The operator object’s qry method returns the query object. Use it’s summary methods to make and run summary queries. These methods operate on top of the underlying query.

For example:

from dwopt import lt #1
lt.qry('test').where("score > 0.5") \ #2
.valc('time, cat',"avg(score) avgscore, round(sum(amt)/1e3,2) total") \ #3
.pivot('time','cat',['n','avgscore','total']) #4

Results:

cat

n

avgscore

total

time

test

train

test

train

test

train

2013-01-02

816.0

847.0

0.746747

0.750452

398.34

417.31

2013-02-02

837.0

858.0

0.748214

0.743094

419.11

447.04

2013-03-02

805.0

860.0

0.756775

0.739017

394.89

422.35

Explanation of lines:

  1. Get the default sqlite operator object.

  2. Make, but do not run, an underlying sub query.

  3. Make and run a value counts summary query (valc) with 2 groups, custom calcs, with the previous step’s underlying query placed inside a with clause.

  4. Query result comes back to python as a standard pandas dataframe, call it’s pivot method.

Automatic logs showing the sql that was ran on line 3:

2022-01-23 11:08:13,407 [INFO] running:
with x as (
    select * from test
    where score > 0.5
)
select
    time, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by time, cat
order by n desc
2022-01-23 11:08:13,413 [INFO] done

In future releases, the package’s templating internals will ikely be driven by the jinjasql package.

Sql template: Dataframe summary function experience

It is possible to mimic what some of the dataframe summary functions would return, but implement via running sql templates. Difference being it is the efficient database engine doing the data processing work, and the flexible python machineries doing the presentation work.

The operator object’s qry method returns the query object. Use it’s summary methods to make and run summary queries. These methods operate on top of the underlying query.

For example:

from dwopt import lt #1
tbl = lt.qry('test').where("score > 0.5") #2
tbl.top()   #show top row to understand shape of data
tbl.head()  #as expected
tbl.cols()  #as expected
tbl.len()   #as expected
tbl.mimx('time')  #min and max of the column
tbl.dist('time', 'time, cat') #count distinct on the column or columns

Explanation of lines:

  1. Get the default sqlite operator object.

  2. Make, but do not run, an underlying sub query.

  3. See the summary methods section for list of methods and their descriptions, examples, underlying sql shown in logs.

Sql template: DDL/DML statement, metadata queries

The operator object’s operation methods allows running of DDL/DML statements programatically, and enhances functionalities where desirable.

Also, the operator object’s metadata methods makes some useful metadata queries available.

Operation methods example:

from dwopt import lt
lt.drop('test')
lt.drop('test') #alter return instead of raising error if table not exist
lt.create(
        tbl_nme = 'test'
        ,dtypes = {
            'id':'integer'
            ,'score':'real'
            ,'amt':'integer'
            ,'cat':'text'
            ,'time':'text'
            ,'constraint df_pk':
                'primary key (id)'
        }
    )
lt.write(df,'test')
lt.write_nodup(df,'test',['id']) #remove duplicates before inserting

Metadata methods example:

from dwopt import pg
pg.list_tables() #list all tables
pg.table_cols('test.test') #examine columns
pg.table_cons() #list constraints

Automatic logging with fully reproducible sql

Many of the package’s methods are wired through the standard logging package.

In particular, the run method emits sql used as INFO level message. The relevant logger object has standard naming and is called dwopt.db. Configure the logging package or the logger at the start of application code for logs. See the logging package documentation for details.

Example configuration to show logs in console:

import logging
logging.basicConfig(level = logging.INFO)

from dwopt import lt
lt.list_tables()

Alternatively, to avoid logging info messages from other packages:

import logging
logging.basicConfig()
logging.getLogger('dwopt.db').setLevel(logging.INFO)

Example configuration to show in console and store on file, with timestamps:

import logging
logging.basicConfig(
    format = "%(asctime)s [%(levelname)s] %(message)s"
    ,handlers=[
        logging.FileHandler("E:/projects/logs.log"),
        logging.StreamHandler()
    ]
)
logging.getLogger('dwopt.db').setLevel(logging.INFO)

Example logs:

2022-01-23 11:08:13,407 [INFO] running:
with x as (
    select * from test
    where score > 0.5
)
select
    time, cat
    ,count(1) n
    ,avg(score) avgscore, round(sum(amt)/1e3,2) total
from x
group by time, cat
order by n desc
2022-01-23 11:08:13,413 [INFO] done