Bring your own data¶

Duration: ~3 hours (self-paced, with instructor support)

Goal: Apply the MultiFormatReader pattern from Day 1 to your own instrument data and produce a validated NeXus file.

What to bring: one or more raw data files from your instrument.

The core idea — always the same three steps¶

No matter what format your data comes in, the workflow is identical to Day 1:

Step A  Read your file(s) and store data as a dict on self
Step B  Write a config file that maps dict keys → NeXus paths
Step C  Run the converter and fix validation errors

The only part that changes between techniques and formats is Step A — the reading logic. Steps B and C are identical to Day 1.

Step 0 — Setup¶

Repeat the steps shown in the setup of Day 1. The goal is to have another pynxtools reader plugin instantiated with the pynxtools-plugin-template.

Step 1 — Know your format¶

Before writing any reader code, understand what you are working with.

Identify your format¶

Format	Typical extensions	How to recognize
HDF5 / NeXus	`.h5`, `.hdf5`, `.nxs`	Binary; starts with `\x89HDF`
HDF5 (instrument brand)	`.h5m`, `.hsp`, `.he5`, …	Same magic bytes; vendor-specific internal layout
VAMAS	`.vms`, `.vamas`	First line: `VAMAS Surface Chemical Analysis`
Igor Pro wave	`.ibw`	Binary with `IGOR` header
CSV / TSV	`.csv`, `.txt`, `.dat`, `.asc`	Human-readable columns
JSON	`.json`	`{` or `[` as first non-whitespace character
YAML	`.yaml`, `.yml`	Key-value pairs with indentation
NetCDF	`.nc`, `.cdf`, `.netcdf`	Binary; readable with `netCDF4` or `xarray`
TIFF (detector images)	`.tiff`, `.tif`	Binary image; use `tifffile` or `PIL`

Explore the data before coding¶

import h5py

# For HDF5 files
with h5py.File("your_file.h5", "r") as f:
    f.visititems(lambda name, obj: print(name, "→", type(obj).__name__))

# For text/CSV files
with open("your_file.csv") as f:
    for i, line in enumerate(f):
        print(line.rstrip())
        if i > 20:
            break

Take 10 minutes to understand the structure before writing any code.

Step 2 — Implement `handle_*_file`¶

Pick the section below that matches your format and implement the corresponding handler.

HDF5 (any vendor)¶

This is the same recursive reader from Day 1. It works for any HDF5 file — vendor-specific layouts, NeXus files, everything.

import h5py
from typing import Any

def handle_hdf5_file(self, file_path: str) -> None:
    result: dict[str, Any] = {}

    def collect(name: str, obj: Any) -> None:
        if isinstance(obj, h5py.Dataset):
            result[name] = obj[()]
        # optionally capture attributes:
        for k, v in obj.attrs.items():
            result[f"{name}/@{k}"] = v

    with h5py.File(file_path, "r") as f:
        f.visititems(collect)

    self.hdf5_data = result

After running, print the keys to understand what is available:

r.handle_hdf5_file("your_file.h5")
for k in sorted(r.hdf5_data):
    print(k)

CSV / TSV / columnar text¶

import numpy as np
from typing import Any

def handle_csv_file(self, file_path: str) -> None:
    # Adjust delimiter, skiprows, and encoding for your file
    data = np.genfromtxt(
        file_path,
        delimiter=",",    # "\t" for TSV, None for whitespace
        names=True,       # use first row as column names
        encoding="utf-8",
    )
    self.data = {name: data[name] for name in data.dtype.names}

Or with pandas for messy headers:

import pandas as pd
from typing import Any

def handle_csv_file(self, file_path: str) -> None:
    meta: dict[str, Any] = {}
    data_start = 0
    with open(file_path) as f:
        for i, line in enumerate(f):
            if line.startswith("#"):
                key, _, value = line[1:].partition("=")
                meta[key.strip()] = value.strip()
            else:
                data_start = i
                break

    df = pd.read_csv(file_path, skiprows=data_start, comment="#")
    self.data = {col: df[col].to_numpy() for col in df.columns}
    self.data.update(meta)

VAMAS (`.vms`)¶

VAMAS is common for XPS and other surface science data.

from typing import Any

def handle_vamas_file(self, file_path: str) -> None:
    try:
        from vamas import Vamas
    except ImportError:
        raise ImportError("pip install vamas")

    vms = Vamas(file_path)
    block = vms.blocks[0]   # first spectrum; iterate for multiple

    self.data = {
        "kinetic_energy":  block.x,
        "intensity":       block.y,
        "source_energy":   block.source_energy,
        "pass_energy":     block.analyzer_pass_energy,
        "dwell_time":      block.signal_collection_time,
        "sample_id":       block.sample_id,
        "technique":       block.technique,
        "comment":         block.comment,
    }

Igor Pro IBW (`.ibw`)¶

import numpy as np
from typing import Any

def handle_ibw_file(self, file_path: str) -> None:
    import igor2.igorpy as igor

    wave = igor.load(file_path)
    self.data = {"data": wave.data}

    # axis scaling
    for dim, (offset, delta) in enumerate(zip(wave.sfB, wave.sfA)):
        n = wave.data.shape[dim]
        self.data[f"axis_{dim}"] = offset + delta * np.arange(n)

    # JSON-encoded note (common in Scienta files)
    import json
    try:
        meta = json.loads(wave.notes.decode())
        for k, v in meta.items():
            self.data[f"meta/{k}"] = v
    except (json.JSONDecodeError, AttributeError):
        pass

NetCDF (`.nc`)¶

from typing import Any

def handle_netcdf_file(self, file_path: str) -> None:
    import xarray as xr

    ds = xr.open_dataset(file_path)
    self.data = {}
    for var in ds.data_vars:
        self.data[var] = ds[var].values
    for coord in ds.coords:
        self.data[f"axis/{coord}"] = ds.coords[coord].values
    for k, v in ds.attrs.items():
        self.data[f"attrs/{k}"] = v

TIFF / detector images¶

from typing import Any

def handle_tiff_file(self, file_path: str) -> None:
    import tifffile

    with tifffile.TiffFile(file_path) as tif:
        data = tif.asarray()   # (frames, H, W) or (H, W)
        meta = tif.imagej_metadata or {}
        if not meta and tif.pages[0].tags:
            meta = {t.name: t.value for t in tif.pages[0].tags.values()}

    self.data = {"detector/image": data}
    self.data.update({f"meta/{k}": v for k, v in meta.items()})

Anything else — the fallback pattern¶

from typing import Any

def handle_my_format(self, file_path: str) -> None:
    self.data = {}

    with open(file_path, "rb") as f:   # or "r" for text
        raw = f.read()

    # --- parse raw bytes or text here ---
    # e.g. use struct, regex, or your vendor's SDK

    self.data["signal"] = ...
    self.data["energy_axis"] = ...
    self.data["sample_name"] = ...

Then register the extension in __init__:

self.extensions[".myext"] = self.handle_my_format

Step 3 — Update the callbacks¶

If you used self.data (not self.hdf5_data), update the three callbacks:

from typing import Any

def get_attr(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    value = self.data.get(path)
    if isinstance(value, bytes):
        return value.decode()
    return value

def get_eln_data(self, key: str, path: str) -> Any:
    if self.eln_data is None:
        return None
    return self.eln_data.get(key)

def get_data(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    return self.data.get(path)

Step 4 — Find your application definition¶

Does one already exist?¶

Check whether a community definition exists for your technique:

Technique	Application definition	Plugin
XPS	`NXxps`	`pynxtools-xps`
ARPES / multi-photon	`NXmpes`, `NXmpes_arpes`, `NXarpes`	`pynxtools-mpes`
Raman	`NXraman`	`pynxtools-raman`
Ellipsometry	`NXellipsometry`	`pynxtools-ellips`
Electron microscopy	`NXem`	`pynxtools-em`
X-ray diffraction	`NXxrd`	`pynxtools-xrd`

Test whether it is installed:

dataconverter generate-template --nxdl NXmpes

No definition? Write a minimal one.¶

Use the skills from Session 1. Start with the smallest possible skeleton:

# NXmytechnique.yaml
category: application
doc: Application definition for my technique.
type: group
NXmytechnique(NXobject):
  (NXentry):
    definition:
      enumeration: [NXmytechnique]
    title:
    (NXinstrument):
      name(NX_CHAR):
    (NXsample):
      name(NX_CHAR):
    (NXdata):

Convert it:

nyaml2nxdl NXmytechnique.yaml --output-file NXmytechnique.nxdl.xml

In order to use your application definitions directly, you will need to add to the NeXus definitions stored in pynxtools. For this, you need to install pynxtools in editable mode. You can learn more in the pynxtools development guide.

Install pynxtools with the -e option in the same virtual environment that you are already working in. Instantiate the definitions submodule.

Then you can place your application definition NXDL XML file in pynxtools:

cp NXmytechnique.nxdl.xml src/pynxtools/definitions/contributed_definitions/
dataconverter generate-template --nxdl NXmytechnique

Step 5 — Write the config file¶

Generate the template first:

dataconverter generate-template --nxdl <YOUR_NXDL> > config.json

For each path in the output, fill in the config:

Where is the value?	Config value
`self.data["some/key"]` or `self.hdf5_data["some/key"]`	`"@attrs:some/key"`
`self.eln_data["/ENTRY[entry]/..."]`	`"@eln"`
`self.data["signal_array"]`	`"@data:signal_array"`
Fixed constant	`"eV"` or `532`

Learn more about the config file in the pynxtools documentation for the MultiFormatReader.

Step 6 — Convert, validate, iterate¶

dataconverter \
    your_file.ext \
    eln_data.yaml \
    config_file.json \
    --reader <your-reader> \
    --nxdl <YOUR_NXDL> \
    --output output.nxs

Read the output messages:

Level	Meaning	Action
ERROR	Required field missing	Add to config or provide ELN data
WARNING	Recommended field missing	Add if possible
INFO	Optional field missing	Safe to skip

Inspect the result:

import h5py
with h5py.File("output.nxs", "r") as f:
    f.visititems(lambda n, o: print(n))

Repeat until no errors remain.

Common errors and fixes¶

Error / symptom	Cause	Fix
`ModuleNotFoundError: <vendor lib>`	Library not installed	`pip install <library>`
`KeyError: 'some/path'` in callback	Path missing from `self.data`	`print(sorted(self.data.keys()))` to find the right key
Required field missing in output	Config doesn't map it	Add the path to config file
`bytes` in output string field	h5py byte string	Add `.decode()` in the callback
All `get_eln_data` return `None`	Wrong CONVERT_DICT keys	Print `self.eln_data.keys()` vs the `key` argument
Validation passes but file looks incomplete	Application definition has no required fields	Add `required` fields to the NXDL

Checklist before you leave¶

[ ] dataconverter runs without errors on your own data
[ ] All required fields are present in output.nxs
[ ] Units are set for every numeric field
[ ] reader.py and config_file.json are committed to your repository
[ ] You know which application definition matches your technique (or have written a minimal one)