coldstart package
Submodules
coldstart.build module
- class coldstart.build.FeatureFactory
Bases:
object
Coldstart’s main class
- get_dataframe()
For returning a dataframe object
- Returns
Training data
- Return type
DataFrame
- get_table()
For returning final feature table name
- Returns
Table name containing training data
- Return type
str
- run(leftmost_table=None, feature_table=None, entity_id=None, domains=None, queries=None, date_range=None, query_dir=None, export_dir=None, drop_intermedieate_tables=True, return_df=True, compute_df=True, stop_on_error=False, downcast=False, batching=False, batch_size=None)
Used for running FeatureFactory
- Parameters
leftmost_table (str) – Left-most table used for constraining feature queries. Should include entity_id and y. Can also include min_date and max_date. Defaults to None.
feature_table (str) – Destination for final table. Defaults to None.
entity_id (str) – Entity of interest for feature queries. Defaults to None.
domains (list) – Domains of interest for feature queries. Defaults to None.
queries (list) – Queries of interest for feature queries. Defaults to None.
date_range (list) – min_date and max_date used for constraining feature queries. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
export_dir (str) – Destination directory for frozen queries. Defaults to None.
drop_intermedieate_tables (bool) – Used for removing intermediate query results. Defaults to True.
return_df (bool) – Used for returning a dataframe. Defaults to True.
compute_df (bool) – Used for computing a dataframe. If False, a Dask dataframe will be returned as opposed to Pandas. Defaults to True.
stop_on_error (bool) – Will halt FeatureFactory if any one query fails. Defaults to False.
downcast (bool) – Will attempt dataframe dtype downcasting. Defaults to False.
batching (bool) – Used for dividing feature queries into batches. Defaults to False.
batch_size (int) – Corresponding batch size if batching is True. Defaults to None.
- Raises
ValueError – Error for missing engine
ValueError – Error for errored queries
- start_engine(db_spec)
Starts SQLAlchemy engine
- Parameters
db_spec (dict) – Used as the config for create_engine
- Raises
ValueError – Error for missing dialect
ValueError – Error for missing schema
ValueError – Error for missing project_id
ValueError – Error for unsupported database
ValueError – Error for failed engine creation
- stop_engine()
Stops SQLAlchemy engine
coldstart.parse module
- coldstart.parse.get_queries(dialect=None, entity_id=None, queries=None, query_dir=None)
Prepares dictionary of queries to template for given queries
- Parameters
dialect (str) – Dialect of interest. Defaults to None.
entity_id (str) – Entity of interest. Defaults to None.
queries (list) – Queries of interest. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Raises
ValueError – Error for missing dialect
ValueError – Error for invalid dialect
ValueError – Error for missing entity_id
ValueError – Error for invalid entity_id
ValueError – Error for invalid query
- Returns
Dictionary of queries to template
- Return type
dict
- coldstart.parse.get_queries_from_domains(dialect=None, entity_id=None, domains=None, query_dir=None)
Prepares dictionary of queries to template for given domains
- Parameters
dialect (str) – Dialect of interest. Defaults to None.
entity_id (str) – Entity of interest. Defaults to None.
domains (list) – Domains of interest. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Raises
ValueError – Error for missing dialect
ValueError – Error for invalid dialect
ValueError – Error for missing entity_id
ValueError – Error for invalid entity_id
ValueError – Error for invalid domain
- Returns
Dictionary of queries to template
- Return type
dict
- coldstart.parse.list_dialects(query_dir=None)
Return list of available dialects
- Parameters
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Returns
List of available dialects
- Return type
list
- coldstart.parse.list_domains(dialect=None, entity_id=None, query_dir=None)
Return list of available domains
- Parameters
dialect (str) – Dialect of interest. Defaults to None.
entity_id (str) – Entity of interest. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Raises
ValueError – Error for missing dialect
ValueError – Error for invalid dialect
ValueError – Error for missing entity_id
ValueError – Error for invalid entity_id
- Returns
List of available domains
- Return type
list
- coldstart.parse.list_entities(dialect=None, query_dir=None)
Return list of available entities
- Parameters
dialect (str) – Dialect of interest. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Raises
ValueError – Error for missing dialect
ValueError – Error for involid dialect
- Returns
List of available entities
- Return type
list
- coldstart.parse.list_queries(dialect=None, entity_id=None, domains=None, query_dir=None)
Return list of available queries
- Parameters
dialect (str) – Dialect of interest. Defaults to None.
entity_id (str) – Entity of interest. Defaults to None.
domains (list) – Domains of interest. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
- Raises
ValueError – Error for missing dialect
ValueError – Error for invalid dialect
ValueError – Error for missing entity_id
ValueError – Error for invalid entity_id
ValueError – Error for invalid domain
- Returns
List of available queries
- Return type
list
coldstart.query module
- coldstart.query.attempt_downcast(df)
For attempting data type downcasting
- Parameters
df (DataFrame) – To attempt downcasting on
- Returns
Downcasted dataframe
- Return type
DataFrame
- coldstart.query.collect_metadata(engine, schema, table_list)
For collecting successful feature query metadata
- Parameters
engine (object) – Engine object
schema (str) – schema of interest
table_list (list) – Table names of successful feature queries
- Returns
table_name and column_name
- Return type
DataFrame
- coldstart.query.convert_categorical(s)
For attempting categorical conversion
- Parameters
s (Series) – To attempt categorical conversion on
- Returns
Converted to categorical data type
- Return type
Series
- coldstart.query.drop_tables(engine, table_list)
For dropping intermediate tables
- Parameters
engine (object) – Engine object
table_list (list) – Tables to drop
- Returns
Results: query_name, status, run_time
- Return type
list
- coldstart.query.freeze_queries(query_dir, export_dir, query_dict)
Instruction to save queries to specified directory
- Parameters
query_dir (str) – Directory to copy feature queries from
export_dir (str) – Directory to export feature queries to
query_dict (dict) – Untemplated feature queries to freeze
- coldstart.query.multi_query(query_tuples)
For running concurrent queries via threading
- Parameters
query_tuples (list) – List of tuples: query_name, engine, sql, return_df
- Returns
Results: query_name, status, run_time
- Return type
list
- coldstart.query.name_table(schema, query_name)
Names tables according to pattern
- Parameters
schema (str) – Name of schema
query_name (str) – Name of query
- Returns
Table name
- Return type
str
- coldstart.query.prep_join_query(schema, table_df, feature_table=None)
For preparing final join query
- Parameters
schema (str) – schema of interest
table_df (DataFrame) – Containing table_name and column_name
feature_table (str) – Specified name of final table. Defaults to None.
- Returns
Join SQL, Name of final table
- Return type
str, str
- coldstart.query.run_query(engine, sql, return_df=True)
For running queries
- Parameters
engine (object) – Engine object
sql (str) – SQL query
return_df (bool) – Will return dataframe. Defaults to True.
- Returns
Query results
- Return type
DataFrame
- coldstart.query.run_threaded_query(query_tuple)
Wrapper function for running concurrent queries via threading
- Parameters
query_tuple (tuple) – query_name, engine, sql, return_df
- Returns
Results: query_name, status, run_time
- Return type
tuple
- coldstart.query.stage_leftmost_table(engine, schema, leftmost_table, entity_id, dt1, dt2)
Stages leftmost table to include idx while performing data validation
- Parameters
engine (object) – Engine object
schema (str) – schema of interest
leftmost_table (str) – User defined leftmost table
entity_id (str) – entity_id of interst
dt1 (str) – min_date
dt2 (str) – max_date
- Raises
ValueError – Error for invalid entity_id column
ValueError – Error for missing y column
ValueError – Error for missing min_date column
ValueError – Error for missing max_date column
ValueError – Error for invalid date format
- Returns
Staged table name, Results: query_name, status, run_time
- Return type
str, tuple
- coldstart.query.template_queries(engine, schema, staged_table, query_dict)
Templates queries with LEFTMOST_TABLE
- Parameters
engine (object) – Engine object
schema (str) – schema of interest
staged_table (str) – Name of staged leftmost tables
query_dict (dict) – Queries to template
- Returns
- Dictionary of query_name: table_name, list of tamplated
queries
- Return type
dict, list
Module contents
- class coldstart.FeatureFactory
Bases:
object
Coldstart’s main class
- get_dataframe()
For returning a dataframe object
- Returns
Training data
- Return type
DataFrame
- get_table()
For returning final feature table name
- Returns
Table name containing training data
- Return type
str
- run(leftmost_table=None, feature_table=None, entity_id=None, domains=None, queries=None, date_range=None, query_dir=None, export_dir=None, drop_intermedieate_tables=True, return_df=True, compute_df=True, stop_on_error=False, downcast=False, batching=False, batch_size=None)
Used for running FeatureFactory
- Parameters
leftmost_table (str) – Left-most table used for constraining feature queries. Should include entity_id and y. Can also include min_date and max_date. Defaults to None.
feature_table (str) – Destination for final table. Defaults to None.
entity_id (str) – Entity of interest for feature queries. Defaults to None.
domains (list) – Domains of interest for feature queries. Defaults to None.
queries (list) – Queries of interest for feature queries. Defaults to None.
date_range (list) – min_date and max_date used for constraining feature queries. Defaults to None.
query_dir (str) – Target directory containing feature queries. If None, coldstart/query_bank is used. Defaults to None.
export_dir (str) – Destination directory for frozen queries. Defaults to None.
drop_intermedieate_tables (bool) – Used for removing intermediate query results. Defaults to True.
return_df (bool) – Used for returning a dataframe. Defaults to True.
compute_df (bool) – Used for computing a dataframe. If False, a Dask dataframe will be returned as opposed to Pandas. Defaults to True.
stop_on_error (bool) – Will halt FeatureFactory if any one query fails. Defaults to False.
downcast (bool) – Will attempt dataframe dtype downcasting. Defaults to False.
batching (bool) – Used for dividing feature queries into batches. Defaults to False.
batch_size (int) – Corresponding batch size if batching is True. Defaults to None.
- Raises
ValueError – Error for missing engine
ValueError – Error for errored queries
- start_engine(db_spec)
Starts SQLAlchemy engine
- Parameters
db_spec (dict) – Used as the config for create_engine
- Raises
ValueError – Error for missing dialect
ValueError – Error for missing schema
ValueError – Error for missing project_id
ValueError – Error for unsupported database
ValueError – Error for failed engine creation
- stop_engine()
Stops SQLAlchemy engine