Declearn-text
MVP : Be able to run a first full example fine tuning an existing BERT with HuggingFace on a simple task
- Store the data in proper format
- Load a tokenizer from hugging face
- Load a model from hugging face
- Run the training
What needs to change :
- The dataset API essentially :
- Accept data that is not tabular
- Allow for preprocessing steps, including at inference > possibly will require to also modify the model API
- The split util could also be revamped, but that is lower priority
Approach : rely as much as possible on tools developed by the main frameworks
- Rationale : data processing is complex and not the priority of our library, so we want to delegate as much as possible the work to other tools, in a way that is robust to future change as possible
- So we essentially use the same approach as the vector API, but try to make it more minimal
- TensorFlow and Torch provide with very practical tools to interface data> integrate those into a lightweight Dataset subclasses
Allow for HF full integration ?
- Using the pipeline
- Use Saas aspect and ease of pushing to HF repo ?
MVP todo:
-
Select fine tuning task and implement centralised version using HF + Torch -
Explore commonalities between frameworks pre-processing tools -
Decide and implement dataset API changes
Edited by BIGAUD Nathan