See if the 'strict' initialization parameter of the SampleFeaturesVectorizer can be removed, in the frame of wanting to cache vectorized data
Is it possible to give always a constant value to the 'strict' parameter (since this one does not directly modify the vectorization definition)?
The 'strict' parameter specify whether or not to raise an exception when a CategoricalFeature instance meets an unknown value. If a categorical feature meet an unknown value, either it will be interpreted as zeros (strict = False), or an exception will be raised (strict = True). If an exception is raised, that means:
- either one the mention's attribute value is incorrect, and will need to be changed, which implies to redo all the vectorization ("we changed one mention's attributes' values, which can for instance impact ml feature whose value is based upon the comparison between an attribute value of two mentions")
- or the mention's attribute value is correct, and one of the ml feature will need to be changed to accommodate for this value, which implies to redo all the vectorization ("we changed the vectorization operation by adding at least one column to the vector")
Questions:
-
If we cache the vectorization done with 'strict = False', can we ensure compatibility with a vectorization done with 'strict = True'? Yes if no error is ever encountered when using 'strict = True'. If we encounter an error with 'strict = True', we will have to redo all vectorization caching, so no in this case.
-
If we cache the vectorization done with 'strict = True', can we ensure compatibility with a vectorization done with 'strict = False'? If we successfully cache vectorization done with 'strict = True', then we should not run into any exception (assuming the collection of document, their characterization and their mention characterization stays the same, which we have to assume anyway; this has to be true for dected values, also). That means that we will not run into problems even when 'strict = False', and so the vectors produced when 'strict = False' will have the same values as those produced with 'strict = True'. So there is compatibility in this case. However, there is no use in forcing 'strict = True' here, since this will not affect the sample_features_vectorizer truly used to produce the vectors that will actually be cached. One way to solve this is to remove altogether the 'strict' parameter from being set up by the user. Anyway, when there is an error, the cortex programmer will want to know, to correct it, so it makes sense to always by in 'strict = True' mode.