Skip to content

Bring your own data

Duration: ~3 hours (self-paced, with instructor support)

Goal: Apply the MultiFormatReader pattern from Day 1 to your own instrument data and produce a validated NeXus file.

What to bring: one or more raw data files from your instrument.


The core idea — always the same three steps

No matter what format your data comes in, the workflow is identical to Day 1:

Step A  Read your file(s) and store data as a dict on self
Step B  Write a config file that maps dict keys → NeXus paths
Step C  Run the converter and fix validation errors

The only part that changes between techniques and formats is Step A — the reading logic. Steps B and C are identical to Day 1.


Step 0 — Setup

Repeat the steps shown in the setup of Day 1. The goal is to have another pynxtools reader plugin instantiated with the pynxtools-plugin-template.


Step 1 — Know your format

Before writing any reader code, understand what you are working with.

Identify your format

Format Typical extensions How to recognize
HDF5 / NeXus .h5, .hdf5, .nxs Binary; starts with \x89HDF
HDF5 (instrument brand) .h5m, .hsp, .he5, … Same magic bytes; vendor-specific internal layout
VAMAS .vms, .vamas First line: VAMAS Surface Chemical Analysis
Igor Pro wave .ibw Binary with IGOR header
CSV / TSV .csv, .txt, .dat, .asc Human-readable columns
JSON .json { or [ as first non-whitespace character
YAML .yaml, .yml Key-value pairs with indentation
NetCDF .nc, .cdf, .netcdf Binary; readable with netCDF4 or xarray
TIFF (detector images) .tiff, .tif Binary image; use tifffile or PIL

Explore the data before coding

import h5py

# For HDF5 files
with h5py.File("your_file.h5", "r") as f:
    f.visititems(lambda name, obj: print(name, "→", type(obj).__name__))

# For text/CSV files
with open("your_file.csv") as f:
    for i, line in enumerate(f):
        print(line.rstrip())
        if i > 20:
            break

Take 10 minutes to understand the structure before writing any code.


Step 2 — Implement handle_*_file

Pick the section below that matches your format and implement the corresponding handler.

HDF5 (any vendor)

This is the same recursive reader from Day 1. It works for any HDF5 file — vendor-specific layouts, NeXus files, everything.

import h5py
from typing import Any

def handle_hdf5_file(self, file_path: str) -> None:
    result: dict[str, Any] = {}

    def collect(name: str, obj: Any) -> None:
        if isinstance(obj, h5py.Dataset):
            result[name] = obj[()]
        # optionally capture attributes:
        for k, v in obj.attrs.items():
            result[f"{name}/@{k}"] = v

    with h5py.File(file_path, "r") as f:
        f.visititems(collect)

    self.hdf5_data = result

After running, print the keys to understand what is available:

r.handle_hdf5_file("your_file.h5")
for k in sorted(r.hdf5_data):
    print(k)

CSV / TSV / columnar text

import numpy as np
from typing import Any

def handle_csv_file(self, file_path: str) -> None:
    # Adjust delimiter, skiprows, and encoding for your file
    data = np.genfromtxt(
        file_path,
        delimiter=",",    # "\t" for TSV, None for whitespace
        names=True,       # use first row as column names
        encoding="utf-8",
    )
    self.data = {name: data[name] for name in data.dtype.names}

Or with pandas for messy headers:

import pandas as pd
from typing import Any

def handle_csv_file(self, file_path: str) -> None:
    meta: dict[str, Any] = {}
    data_start = 0
    with open(file_path) as f:
        for i, line in enumerate(f):
            if line.startswith("#"):
                key, _, value = line[1:].partition("=")
                meta[key.strip()] = value.strip()
            else:
                data_start = i
                break

    df = pd.read_csv(file_path, skiprows=data_start, comment="#")
    self.data = {col: df[col].to_numpy() for col in df.columns}
    self.data.update(meta)

VAMAS (.vms)

VAMAS is common for XPS and other surface science data.

from typing import Any

def handle_vamas_file(self, file_path: str) -> None:
    try:
        from vamas import Vamas
    except ImportError:
        raise ImportError("pip install vamas")

    vms = Vamas(file_path)
    block = vms.blocks[0]   # first spectrum; iterate for multiple

    self.data = {
        "kinetic_energy":  block.x,
        "intensity":       block.y,
        "source_energy":   block.source_energy,
        "pass_energy":     block.analyzer_pass_energy,
        "dwell_time":      block.signal_collection_time,
        "sample_id":       block.sample_id,
        "technique":       block.technique,
        "comment":         block.comment,
    }

Igor Pro IBW (.ibw)

import numpy as np
from typing import Any

def handle_ibw_file(self, file_path: str) -> None:
    import igor2.igorpy as igor

    wave = igor.load(file_path)
    self.data = {"data": wave.data}

    # axis scaling
    for dim, (offset, delta) in enumerate(zip(wave.sfB, wave.sfA)):
        n = wave.data.shape[dim]
        self.data[f"axis_{dim}"] = offset + delta * np.arange(n)

    # JSON-encoded note (common in Scienta files)
    import json
    try:
        meta = json.loads(wave.notes.decode())
        for k, v in meta.items():
            self.data[f"meta/{k}"] = v
    except (json.JSONDecodeError, AttributeError):
        pass

NetCDF (.nc)

from typing import Any

def handle_netcdf_file(self, file_path: str) -> None:
    import xarray as xr

    ds = xr.open_dataset(file_path)
    self.data = {}
    for var in ds.data_vars:
        self.data[var] = ds[var].values
    for coord in ds.coords:
        self.data[f"axis/{coord}"] = ds.coords[coord].values
    for k, v in ds.attrs.items():
        self.data[f"attrs/{k}"] = v

TIFF / detector images

from typing import Any

def handle_tiff_file(self, file_path: str) -> None:
    import tifffile

    with tifffile.TiffFile(file_path) as tif:
        data = tif.asarray()   # (frames, H, W) or (H, W)
        meta = tif.imagej_metadata or {}
        if not meta and tif.pages[0].tags:
            meta = {t.name: t.value for t in tif.pages[0].tags.values()}

    self.data = {"detector/image": data}
    self.data.update({f"meta/{k}": v for k, v in meta.items()})

Anything else — the fallback pattern

from typing import Any

def handle_my_format(self, file_path: str) -> None:
    self.data = {}

    with open(file_path, "rb") as f:   # or "r" for text
        raw = f.read()

    # --- parse raw bytes or text here ---
    # e.g. use struct, regex, or your vendor's SDK

    self.data["signal"] = ...
    self.data["energy_axis"] = ...
    self.data["sample_name"] = ...

Then register the extension in __init__:

self.extensions[".myext"] = self.handle_my_format

Step 3 — Update the callbacks

If you used self.data (not self.hdf5_data), update the three callbacks:

from typing import Any

def get_attr(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    value = self.data.get(path)
    if isinstance(value, bytes):
        return value.decode()
    return value

def get_eln_data(self, key: str, path: str) -> Any:
    if self.eln_data is None:
        return None
    return self.eln_data.get(key)

def get_data(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    return self.data.get(path)

Step 4 — Find your application definition

Does one already exist?

Check whether a community definition exists for your technique:

Technique Application definition Plugin
XPS NXxps pynxtools-xps
ARPES / multi-photon NXmpes, NXmpes_arpes, NXarpes pynxtools-mpes
Raman NXraman pynxtools-raman
Ellipsometry NXellipsometry pynxtools-ellips
Electron microscopy NXem pynxtools-em
X-ray diffraction NXxrd pynxtools-xrd

Test whether it is installed:

dataconverter generate-template --nxdl NXmpes

No definition? Write a minimal one.

Use the skills from Session 1. Start with the smallest possible skeleton:

# NXmytechnique.yaml
category: application
doc: Application definition for my technique.
type: group
NXmytechnique(NXobject):
  (NXentry):
    definition:
      enumeration: [NXmytechnique]
    title:
    (NXinstrument):
      name(NX_CHAR):
    (NXsample):
      name(NX_CHAR):
    (NXdata):

Convert it:

nyaml2nxdl NXmytechnique.yaml --output-file NXmytechnique.nxdl.xml

In order to use your application definitions directly, you will need to add to the NeXus definitions stored in pynxtools. For this, you need to install pynxtools in editable mode. You can learn more in the pynxtools development guide.

Install pynxtools with the -e option in the same virtual environment that you are already working in. Instantiate the definitions submodule.

Then you can place your application definition NXDL XML file in pynxtools:

cp NXmytechnique.nxdl.xml src/pynxtools/definitions/contributed_definitions/
dataconverter generate-template --nxdl NXmytechnique

Step 5 — Write the config file

Generate the template first:

dataconverter generate-template --nxdl <YOUR_NXDL> > config.json

For each path in the output, fill in the config:

Where is the value? Config value
self.data["some/key"] or self.hdf5_data["some/key"] "@attrs:some/key"
self.eln_data["/ENTRY[entry]/..."] "@eln"
self.data["signal_array"] "@data:signal_array"
Fixed constant "eV" or 532

Learn more about the config file in the pynxtools documentation for the MultiFormatReader.


Step 6 — Convert, validate, iterate

dataconverter \
    your_file.ext \
    eln_data.yaml \
    config_file.json \
    --reader <your-reader> \
    --nxdl <YOUR_NXDL> \
    --output output.nxs

Read the output messages:

Level Meaning Action
ERROR Required field missing Add to config or provide ELN data
WARNING Recommended field missing Add if possible
INFO Optional field missing Safe to skip

Inspect the result:

import h5py
with h5py.File("output.nxs", "r") as f:
    f.visititems(lambda n, o: print(n))

Repeat until no errors remain.


Common errors and fixes

Error / symptom Cause Fix
ModuleNotFoundError: <vendor lib> Library not installed pip install <library>
KeyError: 'some/path' in callback Path missing from self.data print(sorted(self.data.keys())) to find the right key
Required field missing in output Config doesn't map it Add the path to config file
bytes in output string field h5py byte string Add .decode() in the callback
All get_eln_data return None Wrong CONVERT_DICT keys Print self.eln_data.keys() vs the key argument
Validation passes but file looks incomplete Application definition has no required fields Add required fields to the NXDL

Checklist before you leave

  • [ ] dataconverter runs without errors on your own data
  • [ ] All required fields are present in output.nxs
  • [ ] Units are set for every numeric field
  • [ ] reader.py and config_file.json are committed to your repository
  • [ ] You know which application definition matches your technique (or have written a minimal one)

Further reading