File Management with Shell and Python Scripts

Description:

Last updated: August, 2010

download: filemanagement.zip, robustarchive.zip

These shell scripts where written to manage a continuous stream of files from different sources. The sources sent files to remote ftp accounts which the application would monitor (so called watch folders). The distributed application had several (distributed) watch folders. Furthermore, there were multiple application instances serving multiple environments (production, staging, test/training). The script structure is still relative 'organic' as most of the work has been done fairly ad-hoc due to new and changing requirements. As the volume of the incoming files is relative low most scripts loop over simple 'ls' statements. However some of the outgoing watch folders receive a high number of files per hour and more robust scripts are needed to deal with archiving these files. robustarchive.zip contains some additional safeguards to deal with this. Certain scripts should only run as one instance. To prevent multiple instances Lockrun is used. . Sumarizing the following needs to be managed:

  • Per (incoming) watch folder files needed to be distributed to 3 or more separate (ingest) folders (production, staging, test/training.
  • Before files where distributed they needed to be (pre) processed, file splitting, dealing with mistakes, creating unique file names,etc..
  • After processing, files needed to be kept for 4-6 days (for quick initial troubleshooting).
  • After 4-6 days files needed to be archived and kept for 30 days. These archive files need to distinguish between success and failures (to make it easier to analyze the archives).
  • Sometimes part of the application is stopped and files need to be reset to their initial state.
The scripts are configurable on the level of watch folder the number of ingest folders and some other parameters. Not all scripts can be used without change, but it should be easy to adapt. As most of this was written 'on demand' the documentation is somewhat lacking still.

The zip files contains some other scripts too, but these are mainly for application management (e.g. start_proxy.sh, stop_proxy.sh, proxy_status.sh, check_proxy.sh, etc..). The image below shows the different states the files can be in, and which are managed by these scripts.

File flow
File state diagram with scripts associated to different states.