Implement `DataLoadingPlan` classes and scaffolding
Implement a DataLoadingPlan
class and a DataPipeline
class, plus ensure they can be saved/loaded through the node DB.
-
Implement the DataLoadingPlan
class. Specs for the implementation are extensively documented in the %SP 17 Item 01 - Clinician customization of dataset at add time milestone description. An API with at least the following functionalities should be implemented:-
constructor -
save and load to node DB -
add/append new pipeline -
str or repr
-
-
Implement the DataPipeline
class. Similarly, specs for the implementation are also well documented in the %SP 17 Item 01 - Clinician customization of dataset at add time milestone description.-
design class hierarchy and document the choices. One possibility is having an abstract base class that defines an "interface". -
define the class attributes and methods. (For some ideas, From the %SP 17 Item 01 - Clinician customization of dataset at add time description, revisit discussion of data
class member versus a more generic implementation, such as e.g. anapply
method) -
decide which methods should be mandatory to be implemented in any DataPipeline
, and document these at least in docstrings (and maybe also in interface abstract class) -
define in detail the expected use and behaviour of the save/load functions. For example is a DataPipeline first created with a default constructor, and immediately after the load
function is called? -
implement generic save/load to node DB in a base class
-
-
define data_loading_plans
anddata_pipelines
table in node DB -
modify _split_train_and_test_data
to load aDataLoadingPlan
if the dataset had one associated with it in the node DB -
write unit tests for instantiating, saving and loading a DataLoadingPlan
and aDataPipeline
to the node DB
For the purposes of this task it is not important to focus on a real use-case of an actual use of a DataLoadingPlan. It is also not needed to write the code for saving a DLP and a pipeline, which will be taken care of in other (more specific) tasks. Hence one may assume that a DLP has been correctly saved and exists in the DB.
The purpose of this task is only to establish the interface and skeleton of the DLP and pipeline workflows.
The name for DataPipeline
is still open for discussion.