emdbXMLTranslator¶
Translators for converting EMDB XML files from 1.9 <-> 2.0. This package also contains utility scripts to make conversion of the whole archive easier, class wrappers with read/write methods for both 1.9 and 2.0 schema and example scripts to make it easier to use the class wrappers.
Author: Ardan Patwardhan (ardan@ebi.ac.uk)
Date: 2014/11/01
Version: 0.6
License¶
Copyright [2014-2016] EMBL - European Bioinformatics Institute Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Scripts¶
emdb_xml_translate.py - Convert an EMDB XML file from schema version 1.9 or 2.0 to 1.9 or 2.0
process_all_19_19.py - Read EMDB XML files in schema version 1.9 from a directory and write out schema version 1.9 files in another directory.
Used to convert files to a canonical form which makes comparison with 1.9 -> 2.0 -> 1.9 easier.
process_all_19_20.py - Read EMDB XML files in schema version 1.9 from a directory with a structure resembling the archive structure,
translate the files to schema version 2.0 and write out the files to a specified directory
process_all_20_19.py - Read EMDB XML files in schema version 2.0 from a directory with a flat structure containing all the XML files,
translate the files to schema version 1.9 and write out the files to a specified directory
diff_all.py - Read EMDB XML files from two specified directories do a straight text diff and write files to another directory. Useful for
for comparing 1.9 files from a round-trip conversion with starting 1.9 files
emdb_19_to_json.py: Example script that reads EMDB XML 1.9 files and outputs summary information to JSON
emdb_20_to_json.py: Example script that reads EMDB XML 2.0 files and outputs summary information to JSON
Modules¶
All the above scripts may also be used as modules
- emdb_user_methods.py - Override methods for formatting floats etc required in some classes
- emdb19.py - Python data structure representation for EMDB XML 1.9 autogenerated using generateDS (http://pythonhosted.org/generateDS/)
python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py –user-methods=emdb_user_methods –external-encoding=’utf-8’ -f -o emdb_19.py ../schema/emdb19.xsd
- emdb_da.py - Python data structure representation for EMDB XML 2.0, also autogenerated using generateDS
python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py –user-methods=emdb_user_methods –external-encoding=’utf-8’ -f -o emdb_da.py ../schema/emdb_da.xsd
- emdb_settings.py - global configuration variables used by several scripts/modules
“data” directory and how to convert individual files¶
The “data” directory has example files for testing the translator:
- input/v1.9: example files from the archive in EMDB 1.9 format
- input/v2.0: files in EMDB2.0 format generated by the D&A system
- input/pathological_v19: pathological examples from the archive in EMDB 1.9 format which do not as yet work because other changes are needed (e.g., schema changes)
- output/v1.9: output directory for unit test converting files in input/v2.0 (2.0 -> 1.9 conversion)
- output/v2.0: output directory for unit test converting files in input/v1.9 (1.9 -> 2.0 conversion)
- output/roundtrip_v1.9: output directory for unit test converting files in output/v2.0 back to v1.9 (1.9 -> 2.0 -> 1.9 conversion)
- output/json/v1.9: output directory for summary JSONs created from 1.9 XML files in input/v1.9
- output/json/v2.0: output directory for summary JSONs created from 2.0 XML files in input/v2.0
To creat a canonical EMDB 1.9 file do:
python emdb_xml_translate.py -i 1.9 -o 1.9 -f /tmp/emd-1001.xml data/input/v1.9/emd-1001.xml
To convert emd-1001.xml to EMDB 2.0 do:
python emdb_xml_translate.py -i 1.9 -o 2.0 -f /tmp/emd-1001-v2.xml data/input/v1.9/emd-1001.xml
To convert the 2.0 file back to EMDB 1.9 do:
python emdb_xml_translate.py -i 2.0 -o 1.9 -f /tmp/emd-1001-v19.xml /tmp/emd-1001-v2.xml
To create summary JSONs do:
python emdb_19_to_json.py -f /tmp/junk.json data/input/v1.9/emd-1001.xml
python emdb_20_to_json.py -f /tmp/junk.json data/input/v2.0/emd-10120_v2.xml
Unit tests¶
To run the unit test do: python translator_test.py This will:
- convert all files in data/input/v1.9 to EMDB 2.0 and put them in data/output/v2.0
- take the EMDB 2.0 files in data/output/v2.0, convert them back to EMDB 1.9 and put them in data/output/roundtrip_v1.9
- convert all files in data/input/v2.0 to EMDB 1.9 and put them in data/output/v1.9
How to test round-trip conversion (1.9 -> 2.0 -> 1.9) on the whole archive¶
Although one can set command line parameters for all the scripts below (use -h option for help), it is probably easier to
modify emdb_settings.py to set default parameters. Parameters to change: archiveHeaderTemplate: Template for getting header files from EMDB archive (standard directory structure) emdb20Dir: Output directory to put 1.9 -> 2.0 conversion files emdb19To19Dir: Output directory to put 1.9 -> canonical 1.9 conversion files emdb20To19Dir: Output directory to put 2.0 -> 1.9 conversion files diffDir: Output directory to put results of diff between canonical 1.9 and 2.0->1.9 conversion files
Starting with a archive containing 1.9 XML header files, use:
python process_all_19_19.py >& logs/process_all_19_19.log &
to create canonical EMDB XML 1.9 files that will make comparison easier.
Starting with a archive containing 1.9 XML header files, use:
python process_all_19_20.py >& logs/process_all_19_20.log &
to create EMDB XML 2.0 header files.
Starting with a directory containing 2.0 XML header files, use:
python process_all_20_19.py >& logs/process_all_20_19.log &
to translate back to EMDB XML 1.9 files.
To diff the canonical 1.9 files and the back translated 1.9 files, run:
python diff_all.py >& logs/diff_all.log &
You will need to check the diff files manually as small differences will remain even in the best cases. This can be tedious to do with individual files. You can concatenate the output for files as follows:
find /data/emdb_diff -type f -name “emd-*.txt” -print |xargs -I % sh -c ‘echo % ; cat % ‘ > /tmp/junk
Contents:¶
- emdbXMLTranslator package
- Submodules
- emdbXMLTranslator.diff_all module
- emdbXMLTranslator.emdb_19 module
- emdbXMLTranslator.emdb_19_to_json module
- emdbXMLTranslator.emdb_20_to_json module
- emdbXMLTranslator.emdb_da module
- emdbXMLTranslator.emdb_settings module
- emdbXMLTranslator.emdb_user_methods module
- emdbXMLTranslator.emdb_xml_translate module
- emdbXMLTranslator.process_all_19_19 module
- emdbXMLTranslator.process_all_19_20 module
- emdbXMLTranslator.process_all_20_19 module
- emdbXMLTranslator.translator_test module
- Module contents