emdbXMLTranslator

Translators for converting EMDB XML files from 1.9 <-> 2.0. This package also contains utility scripts to make conversion of the whole archive easier, class wrappers with read/write methods for both 1.9 and 2.0 schema and example scripts to make it easier to use the class wrappers.

Author: Ardan Patwardhan (ardan@ebi.ac.uk)

Date: 2014/11/01

Version: 0.6

License

Copyright [2014-2016] EMBL - European Bioinformatics Institute Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Scripts

  • emdb_xml_translate.py - Convert an EMDB XML file from schema version 1.9 or 2.0 to 1.9 or 2.0

  • process_all_19_19.py - Read EMDB XML files in schema version 1.9 from a directory and write out schema version 1.9 files in another directory.

    Used to convert files to a canonical form which makes comparison with 1.9 -> 2.0 -> 1.9 easier.

  • process_all_19_20.py - Read EMDB XML files in schema version 1.9 from a directory with a structure resembling the archive structure,

    translate the files to schema version 2.0 and write out the files to a specified directory

  • process_all_20_19.py - Read EMDB XML files in schema version 2.0 from a directory with a flat structure containing all the XML files,

    translate the files to schema version 1.9 and write out the files to a specified directory

  • diff_all.py - Read EMDB XML files from two specified directories do a straight text diff and write files to another directory. Useful for

    for comparing 1.9 files from a round-trip conversion with starting 1.9 files

  • emdb_19_to_json.py: Example script that reads EMDB XML 1.9 files and outputs summary information to JSON

  • emdb_20_to_json.py: Example script that reads EMDB XML 2.0 files and outputs summary information to JSON

Modules

All the above scripts may also be used as modules

  • emdb_user_methods.py - Override methods for formatting floats etc required in some classes
  • emdb19.py - Python data structure representation for EMDB XML 1.9 autogenerated using generateDS (http://pythonhosted.org/generateDS/)
python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py –user-methods=emdb_user_methods –external-encoding=’utf-8’ -f -o emdb_19.py ../schema/emdb19.xsd
  • emdb_da.py - Python data structure representation for EMDB XML 2.0, also autogenerated using generateDS
python ../generateDS-2.17a0/build/scripts-2.7/generateDS.py –user-methods=emdb_user_methods –external-encoding=’utf-8’ -f -o emdb_da.py ../schema/emdb_da.xsd
  • emdb_settings.py - global configuration variables used by several scripts/modules

“data” directory and how to convert individual files

The “data” directory has example files for testing the translator:

  • input/v1.9: example files from the archive in EMDB 1.9 format
  • input/v2.0: files in EMDB2.0 format generated by the D&A system
  • input/pathological_v19: pathological examples from the archive in EMDB 1.9 format which do not as yet work because other changes are needed (e.g., schema changes)
  • output/v1.9: output directory for unit test converting files in input/v2.0 (2.0 -> 1.9 conversion)
  • output/v2.0: output directory for unit test converting files in input/v1.9 (1.9 -> 2.0 conversion)
  • output/roundtrip_v1.9: output directory for unit test converting files in output/v2.0 back to v1.9 (1.9 -> 2.0 -> 1.9 conversion)
  • output/json/v1.9: output directory for summary JSONs created from 1.9 XML files in input/v1.9
  • output/json/v2.0: output directory for summary JSONs created from 2.0 XML files in input/v2.0

To creat a canonical EMDB 1.9 file do:

python emdb_xml_translate.py -i 1.9 -o 1.9 -f /tmp/emd-1001.xml data/input/v1.9/emd-1001.xml

To convert emd-1001.xml to EMDB 2.0 do:

python emdb_xml_translate.py -i 1.9 -o 2.0 -f /tmp/emd-1001-v2.xml data/input/v1.9/emd-1001.xml

To convert the 2.0 file back to EMDB 1.9 do:

python emdb_xml_translate.py -i 2.0 -o 1.9 -f /tmp/emd-1001-v19.xml /tmp/emd-1001-v2.xml

To create summary JSONs do:

python emdb_19_to_json.py -f /tmp/junk.json data/input/v1.9/emd-1001.xml

python emdb_20_to_json.py -f /tmp/junk.json data/input/v2.0/emd-10120_v2.xml

Unit tests

To run the unit test do: python translator_test.py This will:

  1. convert all files in data/input/v1.9 to EMDB 2.0 and put them in data/output/v2.0
  2. take the EMDB 2.0 files in data/output/v2.0, convert them back to EMDB 1.9 and put them in data/output/roundtrip_v1.9
  3. convert all files in data/input/v2.0 to EMDB 1.9 and put them in data/output/v1.9

How to test round-trip conversion (1.9 -> 2.0 -> 1.9) on the whole archive

  1. Although one can set command line parameters for all the scripts below (use -h option for help), it is probably easier to

    modify emdb_settings.py to set default parameters. Parameters to change: archiveHeaderTemplate: Template for getting header files from EMDB archive (standard directory structure) emdb20Dir: Output directory to put 1.9 -> 2.0 conversion files emdb19To19Dir: Output directory to put 1.9 -> canonical 1.9 conversion files emdb20To19Dir: Output directory to put 2.0 -> 1.9 conversion files diffDir: Output directory to put results of diff between canonical 1.9 and 2.0->1.9 conversion files

  2. Starting with a archive containing 1.9 XML header files, use:

python process_all_19_19.py >& logs/process_all_19_19.log &

to create canonical EMDB XML 1.9 files that will make comparison easier.

  1. Starting with a archive containing 1.9 XML header files, use:

    python process_all_19_20.py >& logs/process_all_19_20.log &

    to create EMDB XML 2.0 header files.

  2. Starting with a directory containing 2.0 XML header files, use:

    python process_all_20_19.py >& logs/process_all_20_19.log &

    to translate back to EMDB XML 1.9 files.

  3. To diff the canonical 1.9 files and the back translated 1.9 files, run:

    python diff_all.py >& logs/diff_all.log &

    You will need to check the diff files manually as small differences will remain even in the best cases. This can be tedious to do with individual files. You can concatenate the output for files as follows:

    find /data/emdb_diff -type f -name “emd-*.txt” -print |xargs -I % sh -c ‘echo % ; cat % ‘ > /tmp/junk

Indices and tables