emdbXMLTranslator
=================

Translators for converting EMDB XML files from 1.9 <-> 2.0.
This package also contains utility scripts to make conversion of the whole archive easier,
class wrappers with read/write methods for both 1.9 and 2.0 schema and
example scripts to make it easier to use the class wrappers.

Author: Ardan Patwardhan (ardan@ebi.ac.uk)

Date: 2014/11/01

Version: 0.19

License
-------

Copyright [2014-2016] EMBL - European Bioinformatics Institute
Licensed under the Apache License, Version 2.0 (the
"License"); you may not use this file except in
compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.


Scripts
-------

* 	emdb_xml_translate.py - Convert an EMDB XML file from schema version 1.9 or 2.0 to 1.9 or 2.0
* 	process_all_19_19.py - Read EMDB XML files in schema version 1.9 from a directory and write out schema version 1.9 files in another directory.

	Used to convert files to a canonical form which makes comparison with 1.9 -> 2.0 -> 1.9 easier.
	
*	process_all_19_20.py - Read EMDB XML files in schema version 1.9 from a directory with a structure resembling the archive structure,

	translate the files to schema version 2.0 and write out the files to a specified directory
	
*	process_all_20_19.py - Read EMDB XML files in schema version 2.0 from a directory with a flat structure containing all the XML files,

	translate the files to schema version 1.9 and write out the files to a specified directory
	
*	diff_all.py - Read EMDB XML files from two specified directories do a straight text diff and write files to another directory. Useful for 

	for comparing 1.9 files from a round-trip conversion with starting 1.9 files
	
*	emdb_19_to_json.py: Example script that reads EMDB XML 1.9 files and outputs summary information to JSON
*	emdb_20_to_json.py: Example script that reads EMDB XML 2.0 files and outputs summary information to JSON

Modules
-------

All the above scripts may also be used as modules

*	emdb_user_methods.py - Override methods for formatting floats etc required in some classes
*	emdb19.py - Python data structure representation for EMDB XML 1.9 autogenerated using generateDS (http://pythonhosted.org/generateDS/)

  `python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py  --user-methods=emdb_user_methods --external-encoding='utf-8' -f -o emdb_19.py ../schema/emdb19.xsd`
		
*	emdb_da.py - Python data structure representation for EMDB XML 2.0, also autogenerated using generateDS

  `python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py  --user-methods=emdb_user_methods --external-encoding='utf-8' -f -o emdb_da.py ../schema/emdb_da.xsd`
    
*	emdb_settings.py - global configuration variables used by several scripts/modules

"data" directory and how to convert individual files
------------------------------------------------------

The "data" directory has example files for testing the translator:

* input/v1.9: example files from the archive in EMDB 1.9 format
* input/v2.0: files in EMDB2.0 format generated by the D&A system
* input/pathological_v19: pathological examples from the archive in EMDB 1.9 format which do not as yet work because other changes are needed (e.g., schema changes)
* output/v1.9: output directory for unit test converting files in input/v2.0 (2.0 -> 1.9 conversion)
* output/v2.0: output directory for unit test converting files in input/v1.9 (1.9 -> 2.0 conversion)
* output/roundtrip_v1.9: output directory for unit test converting files in output/v2.0 back to v1.9 (1.9 -> 2.0 -> 1.9 conversion)
* output/json/v1.9: output directory for summary JSONs created from 1.9 XML files in input/v1.9
* output/json/v2.0: output directory for summary JSONs created from 2.0 XML files in input/v2.0

To creat a canonical EMDB 1.9 file do:
        
  `python emdb_xml_translate.py -i 1.9 -o 1.9 -f /tmp/emd-1001.xml data/input/v1.9/emd-1001.xml`

To convert emd-1001.xml to EMDB 2.0 do:

  `python emdb_xml_translate.py -i 1.9 -o 2.0 -f /tmp/emd-1001-v2.xml data/input/v1.9/emd-1001.xml`

To convert the 2.0 file back to EMDB 1.9 do:

  `python emdb_xml_translate.py -i 2.0 -o 1.9 -f /tmp/emd-1001-v19.xml /tmp/emd-1001-v2.xml`

To create summary JSONs do:
	
  `python emdb_19_to_json.py  -f /tmp/junk.json data/input/v1.9/emd-1001.xml`
       
  `python emdb_20_to_json.py  -f /tmp/junk.json data/input/v2.0/emd-10120_v2.xml`

Unit tests
----------
To run the unit test do:
python translator_test.py
This will:

1) convert all files in data/input/v1.9 to EMDB 2.0 and put them in data/output/v2.0
2) take the EMDB 2.0 files in data/output/v2.0, convert them back to EMDB 1.9 and put them in data/output/roundtrip_v1.9
3) convert all files in data/input/v2.0 to EMDB 1.9 and put them in data/output/v1.9


	
How to test round-trip conversion (1.9 -> 2.0 -> 1.9) on the whole archive
--------------------------------------------------------------------------

1.	Although one can set command line parameters for all the scripts below (use -h option for help), it is probably easier to

	modify emdb_settings.py to set default parameters. Parameters to change:
	archiveHeaderTemplate: Template for getting header files from EMDB archive (standard directory structure)
	emdb20Dir: Output directory to put 1.9 -> 2.0 conversion files
	emdb19To19Dir: Output directory to put 1.9 -> canonical 1.9 conversion files
	emdb20To19Dir: Output directory to put 2.0 -> 1.9 conversion files
	diffDir: Output directory to put results of diff between canonical 1.9 and 2.0->1.9 conversion files
	
2.	Starting with a archive containing 1.9 XML header files, use:

    `python process_all_19_19.py >& logs/process_all_19_19.log &`
    
    to create canonical EMDB XML 1.9 files that will make comparison easier.
    
3. Starting with a archive containing 1.9 XML header files, use:

    `python process_all_19_20.py >& logs/process_all_19_20.log &`
    
    to create EMDB XML 2.0 header files.
    
4. Starting with a directory containing 2.0 XML header files, use:

    `python process_all_20_19.py >& logs/process_all_20_19.log &`
    
    to translate back to EMDB XML 1.9 files.    
    
5. To diff the canonical 1.9 files and the back translated 1.9 files, run:

	`python diff_all.py >& logs/diff_all.log &`
	
	You will need to check the diff files manually as small differences will remain even in the best cases. This can be tedious to 
	do with individual files. You can concatenate the output for files as follows:
	
	`find /data/emdb_diff -type f -name "emd-*.txt" -print |xargs -I % sh -c 'echo % ; cat % ' > /tmp/junk`
	



  
		
