Data Management of projects

Go down

Data Management of projects Empty Data Management of projects

Post  jensr on Thu Jan 28, 2010 1:35 pm

Hi all

I'm processing a stack of acquired projects while scanning is still going on. We always knew that the scanning and processing was going to be a storage-hungry job, but I am wondering if there is a way to reduce the size of the processed projects.
I keep a backup of the "raw" project where all the scans, background scans and metadata is acquired. Then when i process a project, i generally find that the project inflate to approximately 2.5 times the original size. Most of our projects are approximately around 100GB in size (we've split a time series up into the seperate years as projects - scanning all into a single project simply was not an option).
We acquire the data to a local hard drive which are thne mirrored onto two external hard drives, one purely for backup, and the other intended for processing. Once the pair of hard drive are "spent" in the lab - e.g. getting full, they are moved through to a PC dedicated for processing. But because the project increase so drastically in size during processing, we're having to move projects around quite a lot. This is a long process in itself obviously. We use network attached storage to back up the processed projects. I hve processed the first 4 projects and have burned through the first terrabyte of network storage. WHile I'm getting more, my original estimate of how much was required does not seem to hold as it was based on the acquisition size of a project.
When I run processing, I don't extract all the thumbnails immediately (mostly because copying so many files across network slows things down disproportionately to the actual size). But I've noticed I have a lot of copied of each image, eating up space:

In the raw folder:
sample_log.txt - small and must be kept
sample_meta.txt - small and must be kept
sample_raw.tif - Big original scan - already backed up - should this be kept in the processed project? - Although zipped still very large (e.g. ~618MB compared to 713MB for the raw scan). DO i need both zip and tif??

In the scan folder:
sample_tif - A smaller, but still significant amount of space (~356MB). This is the 8 bit version presumably? Does all the work happen from this file?

In the work folder (+ folder for each sample): - small file of plankton identifier - must be kept
sample_log.txt - small file - updates copy of the log-file from the raw folder - keep both in processed project?
sample_meas.txt - the measurement file, small but obviously important
sample_meta.txt - a copy of the metadata file from the raw folder - anything changed?
sample_msk1.gif - the mask file - obviously needs to be kept.
sample_out.gif - inverted downscaled image? Not sure what its used for, but its not huge, so no problem in keeping this - a relatively large (~156MB in my case) file. Looking at the compression ration - it looks like this is a zipped version of the 8-bit image from the scan folder? Are both necessary?

Altogether, each scan eands up eaching around 1.8+ GB, and I'm jsut wondering if there's a way of trimming the active projects i'm working with?
I will always keep a raw backup of the original scan project, so I'm not too worried about loosing information, but if it could be made a little lighter to swap orojects between computer, network and external HDD's it would shave quite a bit of time of (and if the usage of network storage can be reduced slightly it would also be a benefit since this is much more expensive).

I'm interested in hearing how others are handling the juggling of these large projects. I have about 9 years/projects of 100GB each to work on with ~ 160-200 scans in each, so even if its something simple that saves a couple of hours on each project, it all adds up to quite a bit of time saved Very Happy

Posts : 8
Join date : 2008-09-26
Age : 45
Location : Marine Scotland Science, Aberdeen, UK

Back to top Go down

Data Management of projects Empty Re: Data Management of projects

Post  jensr on Thu Feb 04, 2010 3:12 pm

From the manual p32:
The RAW images created by Vuescan (if you select the “save 16 bit raw image” option) are
saved uncompressed in the _raw folder with their respective meta and log files. The
conversion tool described below can Zip the Tif images automatically (default option). Users
will have to remove the original Tif images manually after the conversion as Zooprocess will
never delete any file.
The Tif images can also be compressed using Winzip or preferably Power archiver
( ) which allows batch file compression. If you use this last
method, the zip file naming will differ from the single archive compression. Zooprocess can
handle both archives names. .
You must NOT compress the meta and the log files.

From that I'm taking it that it is ok to remove the sample_raw.tif once the processing is done if I used the raw images?
I guess the main confusing for me is the creation of the zip file in the raw folder in the first place. Does this zip file actually serve any purpose or is it used at any stage in the processing? It's the swap between the raw.tif file and the zip file in terms of processing I'm uncertain about at this stage...
On page 40 of the manual (6.15b) it says:

The “raw” image from the _scan folder is zipped as a compressed archive (if option selected)
and saved in the “_zip” folder (version 3.02). This operation is not requested if you have
saved and zipped the “real” RAW image which will remain the source of all the others.

Which seems to imply that the raw image is redundant once the zipping has been this correct? My zipped scans are not saved in a _zip folder, they sit directly in the Zooscan_scan/_raw folder.

Posts : 8
Join date : 2008-09-26
Age : 45
Location : Marine Scotland Science, Aberdeen, UK

Back to top Go down

Data Management of projects Empty Data management

Post  picheral on Wed Feb 17, 2010 5:09 pm

Hi Jensr,
The Zooscan/Zooprocess creates a huge volume of information.
As written in the manual, I suggest to keep at least :
- raw folder (TIF images can be removed if ZIP well created : 80% the original raw image size)
- back folder
- config folder
I today add the whole "PID_process" folder in my daily backup of the Zooscan as we work routinely on prediction/validation.
This is a minimum which allows all reprocessing.

The next level is to backup also the "work" folder in order to cancel the re-processing of images in case of data loss.
The images in the "scan" folder can be removed if the images from the raw folder (ZIPped) are saved.

Conclusion : about 1Gb of data per scan !
Marc P.


Posts : 60
Join date : 2008-05-02

Back to top Go down

Data Management of projects Empty Re: Data Management of projects

Post  jensr on Wed Feb 17, 2010 5:14 pm

Thanks Mark - 1GB/scan is fine. We always knew is was going to be storage-hungry. But it helps a lot if I can shave it from ~1.8GB to 1GB/scan - simply from the speed perspective of backup routines etc.
As mentioned, I currently keep a "raw" backup of the scans where no processing has been done. Ultimately, once the processing is done for all scans, I hope to create the core backup from these.
So I think for now, I'll get the raw images deleted since they are also zipped in the raw folders to help reduce network storage requirements (its much slower and more expensive to get netowrk storage here than external HDD's, but it has the added benefit of decentralised backup).

Posts : 8
Join date : 2008-09-26
Age : 45
Location : Marine Scotland Science, Aberdeen, UK

Back to top Go down

Data Management of projects Empty Re: Data Management of projects

Post  Sponsored content

Sponsored content

Back to top Go down

Back to top

Permissions in this forum:
You cannot reply to topics in this forum