How to correct a prediction (using vignettes), How to work with large datasets

picheral Sun Jun 21, 2009 9:53 pm

Question from Malinda Sutor:

I am at the stage of validating the results from the predicted
identifications (I have extracted vignettes to the prediction and vis
folders and I have reviewed the sorted vignettes). At this stage, do
you simply use any thumbnail browser to make corrections of the sorted
vignettes moving any that are misclassified to the correct folder?
After that, I know that you can use the validated results to update
the learning set, etc. From a procedural standpoint, have you found a
system that works well for relabeling the sorted vignettes after you
have validated them? Do you just change the name of the "to validate"
folder or move the subfolders from the "to validate" into the "sorted
vignettes" folder? I was just curious as I know that you have a lot
of experience with keeping track of data sets that contain thousands
of individual samples that are being processed at different stages by
many different people and I wondered if you had a system that you felt
worked well to ensure it was clear that the predicted vignettes had
been validated but also did not create any file structure that would
cause errors in Zooprocess if any other additional analyses were
conducted at a later time.

Dear Malinda,
As I explained in Baton Rouge last May, we have defined a strategy to help work with thousands of vignettes :
- we predict in batch of samples containing up to 10000 vignettes at all. We use enable the detailled files in PKId and to save the results of the analysis in the "Pid_results/Prediction" directory of the project. We also use to name the prediction "Analysis_date_time" which will be automatically added to the resulting files.
- after being predicted, these PID files are manually moved into the "pid_predicted" directory (inside the "pid_result" folder). It prevents to predict them again ! (and validate them again !)
- we extract vignettes accordind to prediction into the "date_time_to_validate" directory. The dat1.txt files are automatically copied into the "date_time_to_validate" directory and into the "dat1_extracted" subfolder of the "pid_results" folder (6.12). You thus know which "*dat1.txt" file have been extracted for validation.
- the vignettes extracted in the "*_to_validate" directory are validated by "expert" (sorted in the correct folder or in more folders for a detailled sorting). We utilize xnview.exe for that but you can also utilize the "Leaning set" tool in Plankton Identifier providing that you copy the PID files in the "date_time_to_validate" folder. Most care must be applied to avoid "loosing" any vignette if you do not utilize PKId.
- when the sorting is finished, we use to rename the folder "date_time_to_validate" by "date_time_validated_by_<expert_name>". This is not mandatory but very usefull for the superviser of the work !
- we then utilize either PKId to create what is called a TEST FILE (see PKId manual)... or the Zooprocess "Load Id from sorted vignettes" to update the "*dat1.txt" files with the new validated Ids. The latest version of the tool (6.12) copies also the validated "*dat1.txt" files into the "dat1_validated" sub-directory of the "pid_results" folder. All the *dat1.txt (or txt) files are thus (6.12) updated inside the "dat1_validated" of the "pid_results" folder insuring that these files have been validated and that user knows exactly which files have been validated.
Checking the "pid_predicted", the "dat1_extracted" and "dat1_validated" directories, you can know exactly what has been processed.

The major difference beetween PKId and Zooprocess is that :
- PKId works creates a new "learn_*.pid" file containing the variables and the validated Id for each of the object.
- Zooprocess can upload the validated Ids either in any "*.txt" file issued from the prediction (including the "analysis_*.txt" file) or in the detailled "*_dat1.txt" files (if the option to process detailled results has been enabled in PKId).
In Villefranche we utilize both strategies but we now prefer to work with "dat1.txt" files as they will contain :
- log information of the image process
- metadata
- variables, prediction and validated Id for each of the objects

In addition to this, the "Upload Ids from sorted vignettes" makes some basic stats (error rates, Recall, Contamination" to help analyse the efficiency of the prediction. The detailled stats gives the results for the following groups providing that you have named your groups in the Learning_set and in the validation following these rules:
- all copepods predictions (and folder names !) starts with "cop".
- all appendicularians predictions (and folder names !) starts with "app".
- all crustaceans (other than copepods) prediction (and folder names !) starts with "crust".
- all misc (aggregates, bad focus, fibers...) prediction (and folder names !) starts with "det".
- all cladocerans prediction (and folder names !) starts with "clad".
- all gelatinous prediction (and folder names !) starts with "gel".
- all mollusks (except pteropods) prediction (and folder names !) starts with "moll".
- all ostracods predictions (and folder names !) starts with "ostr".
- all pteropods predictions (and folder names !) starts with "pte".
- all raddiolarians predictions (and folder names !) starts with "rad".
- all multiples predictions (and folder names !) starts with "mult".
This is not mandatory. The tool will work without this naming convention but you would not get any detailled stats from the tool.

Best regards
Marc P.