coldstart package

Submodules

coldstart.build module

class coldstart.build.FeatureFactory

Bases: object

Coldstart’s main class

get_dataframe()

For returning a dataframe object

Returns

Training data

Return type

DataFrame

get_table()

For returning final feature table name

Returns

Table name containing training data

Return type

str

run(leftmost_table=None, feature_table=None, entity_id=None, domains=None, queries=None, date_range=None, query_dir=None, export_dir=None, drop_intermedieate_tables=True, return_df=True, compute_df=True, stop_on_error=False, downcast=False, batching=False, batch_size=None)

Used for running FeatureFactory

Parameters
  • leftmost_table (str) – Left-most table used for constraining feature queries. Should include entity_id and y. Can also include min_date and max_date. Defaults to None.

  • feature_table (str) – Destination for final table. Defaults to None.

  • entity_id (str) – Entity of interest for feature queries. Defaults to None.

  • domains (list) – Domains of interest for feature queries. Defaults to None.

  • queries (list) – Queries of interest for feature queries. Defaults to None.

  • date_range (list) – min_date and max_date used for constraining feature queries. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

  • export_dir (str) – Destination directory for frozen queries. Defaults to None.

  • drop_intermedieate_tables (bool) – Used for removing intermediate query results. Defaults to True.

  • return_df (bool) – Used for returning a dataframe. Defaults to True.

  • compute_df (bool) – Used for computing a dataframe. If False, a Dask dataframe will be returned as opposed to Pandas. Defaults to True.

  • stop_on_error (bool) – Will halt FeatureFactory if any one query fails. Defaults to False.

  • downcast (bool) – Will attempt dataframe dtype downcasting. Defaults to False.

  • batching (bool) – Used for dividing feature queries into batches. Defaults to False.

  • batch_size (int) – Corresponding batch size if batching is True. Defaults to None.

Raises
  • ValueError – Error for missing engine

  • ValueError – Error for errored queries

start_engine(db_spec)

Starts SQLAlchemy engine

Parameters

db_spec (dict) – Used as the config for create_engine

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for missing schema

  • ValueError – Error for missing project_id

  • ValueError – Error for unsupported database

  • ValueError – Error for failed engine creation

stop_engine()

Stops SQLAlchemy engine

coldstart.parse module

coldstart.parse.get_queries(dialect=None, entity_id=None, queries=None, query_dir=None)

Prepares dictionary of queries to template for given queries

Parameters
  • dialect (str) – Dialect of interest. Defaults to None.

  • entity_id (str) – Entity of interest. Defaults to None.

  • queries (list) – Queries of interest. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for invalid dialect

  • ValueError – Error for missing entity_id

  • ValueError – Error for invalid entity_id

  • ValueError – Error for invalid query

Returns

Dictionary of queries to template

Return type

dict

coldstart.parse.get_queries_from_domains(dialect=None, entity_id=None, domains=None, query_dir=None)

Prepares dictionary of queries to template for given domains

Parameters
  • dialect (str) – Dialect of interest. Defaults to None.

  • entity_id (str) – Entity of interest. Defaults to None.

  • domains (list) – Domains of interest. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for invalid dialect

  • ValueError – Error for missing entity_id

  • ValueError – Error for invalid entity_id

  • ValueError – Error for invalid domain

Returns

Dictionary of queries to template

Return type

dict

coldstart.parse.list_dialects(query_dir=None)

Return list of available dialects

Parameters

query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Returns

List of available dialects

Return type

list

coldstart.parse.list_domains(dialect=None, entity_id=None, query_dir=None)

Return list of available domains

Parameters
  • dialect (str) – Dialect of interest. Defaults to None.

  • entity_id (str) – Entity of interest. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for invalid dialect

  • ValueError – Error for missing entity_id

  • ValueError – Error for invalid entity_id

Returns

List of available domains

Return type

list

coldstart.parse.list_entities(dialect=None, query_dir=None)

Return list of available entities

Parameters
  • dialect (str) – Dialect of interest. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for involid dialect

Returns

List of available entities

Return type

list

coldstart.parse.list_queries(dialect=None, entity_id=None, domains=None, query_dir=None)

Return list of available queries

Parameters
  • dialect (str) – Dialect of interest. Defaults to None.

  • entity_id (str) – Entity of interest. Defaults to None.

  • domains (list) – Domains of interest. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for invalid dialect

  • ValueError – Error for missing entity_id

  • ValueError – Error for invalid entity_id

  • ValueError – Error for invalid domain

Returns

List of available queries

Return type

list

coldstart.query module

coldstart.query.attempt_downcast(df)

For attempting data type downcasting

Parameters

df (DataFrame) – To attempt downcasting on

Returns

Downcasted dataframe

Return type

DataFrame

coldstart.query.collect_metadata(engine, schema, table_list)

For collecting successful feature query metadata

Parameters
  • engine (object) – Engine object

  • schema (str) – schema of interest

  • table_list (list) – Table names of successful feature queries

Returns

table_name and column_name

Return type

DataFrame

coldstart.query.convert_categorical(s)

For attempting categorical conversion

Parameters

s (Series) – To attempt categorical conversion on

Returns

Converted to categorical data type

Return type

Series

coldstart.query.drop_tables(engine, table_list)

For dropping intermediate tables

Parameters
  • engine (object) – Engine object

  • table_list (list) – Tables to drop

Returns

Results: query_name, status, run_time

Return type

list

coldstart.query.freeze_queries(query_dir, export_dir, query_dict)

Instruction to save queries to specified directory

Parameters
  • query_dir (str) – Directory to copy feature queries from

  • export_dir (str) – Directory to export feature queries to

  • query_dict (dict) – Untemplated feature queries to freeze

coldstart.query.multi_query(query_tuples)

For running concurrent queries via threading

Parameters

query_tuples (list) – List of tuples: query_name, engine, sql, return_df

Returns

Results: query_name, status, run_time

Return type

list

coldstart.query.name_table(schema, query_name)

Names tables according to pattern

Parameters
  • schema (str) – Name of schema

  • query_name (str) – Name of query

Returns

Table name

Return type

str

coldstart.query.prep_join_query(schema, table_df, feature_table=None)

For preparing final join query

Parameters
  • schema (str) – schema of interest

  • table_df (DataFrame) – Containing table_name and column_name

  • feature_table (str) – Specified name of final table. Defaults to None.

Returns

Join SQL, Name of final table

Return type

str, str

coldstart.query.run_query(engine, sql, return_df=True)

For running queries

Parameters
  • engine (object) – Engine object

  • sql (str) – SQL query

  • return_df (bool) – Will return dataframe. Defaults to True.

Returns

Query results

Return type

DataFrame

coldstart.query.run_threaded_query(query_tuple)

Wrapper function for running concurrent queries via threading

Parameters

query_tuple (tuple) – query_name, engine, sql, return_df

Returns

Results: query_name, status, run_time

Return type

tuple

coldstart.query.stage_leftmost_table(engine, schema, leftmost_table, entity_id, dt1, dt2)

Stages leftmost table to include idx while performing data validation

Parameters
  • engine (object) – Engine object

  • schema (str) – schema of interest

  • leftmost_table (str) – User defined leftmost table

  • entity_id (str) – entity_id of interst

  • dt1 (str) – min_date

  • dt2 (str) – max_date

Raises
  • ValueError – Error for invalid entity_id column

  • ValueError – Error for missing y column

  • ValueError – Error for missing min_date column

  • ValueError – Error for missing max_date column

  • ValueError – Error for invalid date format

Returns

Staged table name, Results: query_name, status, run_time

Return type

str, tuple

coldstart.query.template_queries(engine, schema, staged_table, query_dict)

Templates queries with LEFTMOST_TABLE

Parameters
  • engine (object) – Engine object

  • schema (str) – schema of interest

  • staged_table (str) – Name of staged leftmost tables

  • query_dict (dict) – Queries to template

Returns

Dictionary of query_name: table_name, list of tamplated

queries

Return type

dict, list

Module contents

class coldstart.FeatureFactory

Bases: object

Coldstart’s main class

get_dataframe()

For returning a dataframe object

Returns

Training data

Return type

DataFrame

get_table()

For returning final feature table name

Returns

Table name containing training data

Return type

str

run(leftmost_table=None, feature_table=None, entity_id=None, domains=None, queries=None, date_range=None, query_dir=None, export_dir=None, drop_intermedieate_tables=True, return_df=True, compute_df=True, stop_on_error=False, downcast=False, batching=False, batch_size=None)

Used for running FeatureFactory

Parameters
  • leftmost_table (str) – Left-most table used for constraining feature queries. Should include entity_id and y. Can also include min_date and max_date. Defaults to None.

  • feature_table (str) – Destination for final table. Defaults to None.

  • entity_id (str) – Entity of interest for feature queries. Defaults to None.

  • domains (list) – Domains of interest for feature queries. Defaults to None.

  • queries (list) – Queries of interest for feature queries. Defaults to None.

  • date_range (list) – min_date and max_date used for constraining feature queries. Defaults to None.

  • query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.

  • export_dir (str) – Destination directory for frozen queries. Defaults to None.

  • drop_intermedieate_tables (bool) – Used for removing intermediate query results. Defaults to True.

  • return_df (bool) – Used for returning a dataframe. Defaults to True.

  • compute_df (bool) – Used for computing a dataframe. If False, a Dask dataframe will be returned as opposed to Pandas. Defaults to True.

  • stop_on_error (bool) – Will halt FeatureFactory if any one query fails. Defaults to False.

  • downcast (bool) – Will attempt dataframe dtype downcasting. Defaults to False.

  • batching (bool) – Used for dividing feature queries into batches. Defaults to False.

  • batch_size (int) – Corresponding batch size if batching is True. Defaults to None.

Raises
  • ValueError – Error for missing engine

  • ValueError – Error for errored queries

start_engine(db_spec)

Starts SQLAlchemy engine

Parameters

db_spec (dict) – Used as the config for create_engine

Raises
  • ValueError – Error for missing dialect

  • ValueError – Error for missing schema

  • ValueError – Error for missing project_id

  • ValueError – Error for unsupported database

  • ValueError – Error for failed engine creation

stop_engine()

Stops SQLAlchemy engine