Skill v1.0.2

currentAutomated scan100/100

nvidia-bionemo/nvmolkit/nvmolkit-usage

~1 modified

──Details

PublishedJune 23, 2026 at 09:11 PM

Content Hashsha256:1e7aa4102c100a7d...

Git SHAad4124ea1377

Bump Typepatch

Compare with v1.0.1

──Files

Files (1 file, 17.3 KB)

SKILL.md17.3 KBactive

SKILL.md · 315 lines · 17.3 KB

version: "1.0.2" name: nvmolkit-usage description: >- Write code that calls the installed nvMolKit Python API for GPU-accelerated, batched RDKit-style operations - Morgan fingerprints, Tanimoto/cosine similarity, ETKDG conformer embedding, MMFF/UFF optimization, TFD, conformer RMSD, Butina clustering, and substructure search. Use when the user is importing nvmolkit.*, debugging an nvmolkit call, choosing between nvMolKit and RDKit for a batched cheminformatics workflow, or wiring nvMolKit results into a torch/numpy pipeline. Out of scope: building nvMolKit from source. license: Apache-2.0 metadata: owner: Kevin Boyd (@scal444) risk_tier: skill

nvMolKit usage

What nvMolKit is

GPU-accelerated, batched implementations of common RDKit operations. APIs mirror RDKit where possible but are batch-oriented: they take lists of rdkit.Chem.Mol (or lists of fingerprints) and process them in parallel on one or more GPUs. nvMolKit links against RDKit at build time; inputs and outputs are real RDKit Mol objects.

Where nvMolKit does well

Reach for nvMolKit when:

The workload is a large batch of molecules processed together (typically thousands or more).
The metric is throughput / total wall time across the batch, not per-molecule latency.
The same operation is repeated identically across the batch (fingerprinting a library, embedding/minimizing many conformers, bulk pairwise similarity), so the GPU stays saturated.

Plain RDKit is usually the better choice for single-molecule one-offs or workflows that can't be expressed as a batch. nvMolKit is not meant to replace RDKit for those cases.

Runtime requirements

An NVIDIA GPU with compute capability 7.0 (V100) or higher
A CUDA driver compatible with CUDA 12.6+.
A working torch install with CUDA support (nvMolKit returns GPU tensors via torch's CUDA array interface).

If CUDA is unavailable, nvMolKit calls raise. There is no CPU fallback - if the user needs one, use RDKit directly for that path.

When helping with installation, make the user choose a PyTorch CUDA backend that the host driver supports before installing nvMolKit. nvMolKit's PyPI wheels are built with CUDA Toolkit 12.9 and depend on CUDA 12 runtime packages, but pip/uv can still select a CUDA 13 PyTorch wheel unless the install command says otherwise.

Conda: prefer conda-forge pytorch-gpu; pin cuda-version=12.6 or another CUDA version supported by the driver.
pip: send the user to the PyTorch install selector (https://pytorch.org/get-started/locally/) or previous-versions page (https://pytorch.org/get-started/previous-versions/) to install torch for a CUDA 12.x backend before installing nvMolKit.
uv: install nvMolKit with an explicit backend, e.g. uv pip install --torch-backend=cu128 nvmolkit.

Verify the install before writing real code

Run this once to confirm nvMolKit is importable and a GPU op works end to end:

python

import nvmolkit
import torch
from rdkit import Chem
from nvmolkit.fingerprints import MorganFingerprintGenerator
print("nvmolkit:", nvmolkit.__version__)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
mols = [Chem.MolFromSmiles(smi) for smi in ["CCO", "c1ccccc1", "CC(=O)O"]]
fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)
result = fpgen.GetFingerprints(mols)
torch.cuda.synchronize()
fps = result.torch()
print("fps shape:", tuple(fps.shape), "dtype:", fps.dtype)
# Expected: shape (3, 32), dtype torch.int32  (1024 bits packed into 32 int32s per row)

If this fails, point the user at the install guide on the docs site rather than guessing - see "Going deeper" below.

Entry points

Task	Module	Primary entry point
Morgan fingerprints	`nvmolkit.fingerprints`	`MorganFingerprintGenerator(radius, fpSize).GetFingerprints(mols)`
Bulk Tanimoto / cosine similarity	`nvmolkit.similarity`	`crossTanimotoSimilarity(...)`, `crossCosineSimilarity(...)`, plus `*MemoryConstrained` variants for results too large to fit in GPU memory
ETKDG conformer embedding	`nvmolkit.embedMolecules`	`EmbedMolecules(molecules, params, confsPerMolecule, ...)`
MMFF94 optimization (one-shot)	`nvmolkit.mmffOptimization`	`MMFFOptimizeMoleculesConfs(molecules, ...)`
UFF optimization (one-shot)	`nvmolkit.uffOptimization`	`UFFOptimizeMoleculesConfs(molecules, ...)`
Forcefield with custom options + constraints	`nvmolkit.batchedForcefield`	`MMFFBatchedForcefield(mols, properties=..., nonBondedThreshold=..., ignoreInterfragInteractions=..., hardwareOptions=...)`, `UFFBatchedForcefield(mols, vdwThreshold=..., ...)`. Per-molecule view `ff[i]` exposes `add_distance_constraint`, `add_position_constraint`, `add_angle_constraint`, `add_torsion_constraint`. Methods: `.compute_energy()`, `.compute_gradients()`, `.minimize(maxIters, forceTol)`
Pairwise conformer RMSD	`nvmolkit.conformerRmsd`	`GetConformerRMSMatrix(mol)`, `GetConformerRMSMatrixBatch(mols)`
Torsion Fingerprint Deviation (TFD)	`nvmolkit.tfd`	`GetTFDMatrix(mol)`, `GetTFDMatrices(mols)`
Butina clustering	`nvmolkit.clustering`	`butina(distance_matrix, cutoff)` (precomputed matrix), `fused_butina(fingerprints, cutoff)` (memory-efficient, on-the-fly)
Substructure search	`nvmolkit.substructure`	`hasSubstructMatch`, `countSubstructMatches`, `getSubstructMatches`
Hardware tuning (batch size, GPU IDs)	`nvmolkit.types`	`HardwareOptions(...)` passed to ETKDG / MMFF / UFF
Optional autotuning of `HardwareOptions`	`nvmolkit.autotune`	`tune_embed_molecules`, `tune_mmff_optimize`, `tune_uff_optimize`, `tune_batched_forcefield`. Requires the `optuna` package

Result types and execution model

Two return shapes carry GPU-resident output, depending on what the operation produces.

`AsyncGpuResult`

Used by operations that return a single flat tensor (fingerprints, similarity matrices, RMSD/TFD vectors, Butina inputs). Key behaviors:

Asynchronous. The kernel may not have completed when the call returns.
result.torch() returns a zero-copy torch.Tensor on the GPU. Caller is responsible for synchronizing before reading values on the host.
result.numpy() synchronizes and returns a CPU numpy array.
Exposes __cuda_array_interface__, so it can be passed directly into other nvMolKit functions (e.g. fingerprints → similarity) with no host round-trip.

CUDA stream control

A subset of the AsyncGpuResult-returning APIs accept an optional stream: torch.cuda.Stream | None = None argument so callers can submit nvMolKit work to a non-default stream and overlap it with their own kernels. When omitted, the call uses the current torch stream.

APIs that take a stream argument:

MorganFingerprintGenerator.GetFingerprints
crossTanimotoSimilarity, crossCosineSimilarity, and their *MemoryConstrained variants
butina, fused_butina
GetConformerRMSMatrix, GetConformerRMSMatrixBatch

Other APIs (ETKDG, MMFF/UFF optimization, TFD, substructure search) are synchronous to the caller — no stream plumbing needed.

Typical pattern:

python

import torch
from rdkit import Chem
from nvmolkit.fingerprints import MorganFingerprintGenerator
from nvmolkit.similarity import crossTanimotoSimilarity
stream = torch.cuda.Stream()
fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)
mols = [Chem.MolFromSmiles(smi) for smi in ["CCO", "c1ccccc1", "CC(=O)O"]]
with torch.cuda.stream(stream):
    fps = fpgen.GetFingerprints(mols, stream=stream)
    sim = crossTanimotoSimilarity(fps, stream=stream)
stream.synchronize()
print(sim.torch())

`Device3DResult`

Used by ETKDG embedding and MMFF/UFF optimization (one-shot and BatchedForcefield) when called with output=CoordinateOutput.DEVICE. The GPU-resident equivalent of writing conformers back to Mol objects. Fields:

values: AsyncGpuResult of shape (total_atoms, 3) float64. Concatenated conformer coordinates in CSR-style layout.
atom_starts, mol_indices, conf_indices: AsyncGpuResult int32 buffers describing the layout (values[atom_starts[i]:atom_starts[i+1]] is conformer i's atoms).
energies, converged: AsyncGpuResult buffers populated only for MMFF/UFF minimization (not for plain ETKDG).
gpu_id: device the buffers live on. The targetGpu argument on each API picks this; targetGpu=-1 uses the default consolidation device.
.per_molecule() returns nested list[list[torch.Tensor]] of per-conformer views; .dense(pad_value=nan) materializes a padded (n_mols, max_confs, max_atoms, 3) tensor.

The default mode (CoordinateOutput.RDKIT_CONFORMERS) still writes optimized coordinates back into each Mol and returns Python lists of energies/convergence flags. Reach for CoordinateOutput.DEVICE when chaining downstream GPU work (e.g. ETKDG → MMFF → similarity scoring) without host round-trips.

Configuration

Two configuration objects expose the GPU/CPU knobs.

`HardwareOptions` (ETKDG, MMFF, UFF)

from nvmolkit.types import HardwareOptions. Passed via hardwareOptions= to EmbedMolecules, MMFFOptimizeMoleculesConfs, UFFOptimizeMoleculesConfs, and the BatchedForcefield constructors. Every field has an "auto" sentinel; the defaults are usually fine.

Field	Type	Default	Meaning
`preprocessingThreads`	int	`-1` (all visible CPUs)	CPU threads for preprocessing
`batchSize`	int	`-1` (auto-tuned)	Number of conformers per GPU batch
`batchesPerGpu`	int	`-1` (auto)	Concurrent batches per GPU; must be `>0` or `-1`
`gpuIds`	`list[int]`	`[]` (all visible GPUs)	Specific device ordinals to target

Passing a gpuIds entry for a device that isn't visible raises RuntimeError: invalid device ordinal. For finding good values automatically across a representative sample, see nvmolkit.autotune (requires the optuna extra); each tune_* function returns a TuneResult whose best_config is a fully-populated HardwareOptions ready to pass back into the real call.

HardwareOptions round-trips through to_dict() / from_dict() for persisting tuned configs to disk.

`SubstructSearchConfig` (substructure search)

from nvmolkit.substructure import SubstructSearchConfig. Passed via config= to hasSubstructMatch, countSubstructMatches, and getSubstructMatches.

Field	Type	Default	Meaning
`batchSize`	int	`1024`	(target, query) pairs per GPU batch
`workerThreads`	int	`-1` (auto)	GPU runner threads per GPU
`preprocessingThreads`	int	`-1` (auto)	CPU threads for preprocessing
`maxMatches`	int	`0` (unlimited)	Max matches returned per (target, query) pair
`uniquify`	bool	`False`	Drop duplicate matches that differ only in atom enumeration order
`gpuIds`	`list[int] \	None`	`None` (current device only)	Specific device ordinals to target

Substructure search currently does not support chirality-aware matching, enhanced stereochemistry, or other advanced RDKit SubstructMatchParameters options.

Recipes

Morgan fingerprints + bulk Tanimoto similarity

python

import torch
from rdkit import Chem
from nvmolkit.fingerprints import MorganFingerprintGenerator
from nvmolkit.similarity import crossTanimotoSimilarity
smiles = ["CCO", "CCN", "c1ccccc1", "CC(=O)O", "CCOCC"]
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)
fps = fpgen.GetFingerprints(mols)
sim = crossTanimotoSimilarity(fps)
torch.cuda.synchronize()
print(sim.torch())

Inputs are list[Mol]. Output of GetFingerprints is an AsyncGpuResult wrapping an (n_mols, fpSize / 32) int32 tensor of packed bits. Pass it straight into crossTanimotoSimilarity for an (n, n) similarity matrix; pass two fingerprint sets for an (n, m) cross-matrix. For sets too large to materialize on the GPU, use crossTanimotoSimilarityMemoryConstrained (chunked compute, returns numpy on CPU).

ETKDG conformer embedding

python

from rdkit.Chem import AddHs, MolFromSmiles
from rdkit.Chem.rdDistGeom import ETKDGv3
from nvmolkit.embedMolecules import EmbedMolecules
mols = [AddHs(MolFromSmiles(smi)) for smi in ["C1CCCCC1", "C1CCCCC2CCCCC12", "COO"]]
params = ETKDGv3()
params.useRandomCoords = True
EmbedMolecules(mols, params, confsPerMolecule=10, maxIterations=-1)
for mol in mols:
    print(mol.GetNumConformers())

Inputs are list[Mol], sanitized and with hydrogens added (AddHs). Conformers are added in-place. params.useRandomCoords must be True - nvMolKit's ETKDG only supports random-coord initialization. A handful of niche EmbedParameters options are not supported (bounds matrices, custom CPCI, coord maps, separate-fragment embedding); the Features section of the docs site lists the full restrictions.

MMFF94 minimization of a batch of conformers

python

from rdkit.Chem import AddHs, MolFromSmiles
from rdkit.Chem.rdDistGeom import ETKDGv3
from nvmolkit.embedMolecules import EmbedMolecules
from nvmolkit.mmffOptimization import MMFFOptimizeMoleculesConfs
mols = [AddHs(MolFromSmiles(smi)) for smi in ["CCO", "CCN", "c1ccccc1"]]
params = ETKDGv3(); params.useRandomCoords = True
EmbedMolecules(mols, params, confsPerMolecule=5)
energies = MMFFOptimizeMoleculesConfs(mols, maxIters=500)
for mol, mol_energies in zip(mols, energies):
    print(mol.GetNumConformers(), mol_energies)

Inputs are list[Mol] with conformers already populated (typically by ETKDG, RDKit's EmbedMultipleConfs, or a prior nvMolKit call). Coordinates are updated in place; the return is list[list[float]] of optimized energies aligned with the input molecule order and conformer index. UFF is identical in shape: swap in from nvmolkit.uffOptimization import UFFOptimizeMoleculesConfs.

If any input molecule is None or lacks MMFF/UFF atom types, the call raises ValueError. The exception's args[1] is a dict with keys "none" and "no_params" listing the offending indices - useful for filtering a noisy input set.

Conformer RMSD and Butina clustering

python

import torch
from rdkit import Chem
from rdkit.Chem.rdDistGeom import EmbedMultipleConfs
from nvmolkit.clustering import butina
from nvmolkit.conformerRmsd import GetConformerRMSMatrixBatch
mols = [Chem.AddHs(Chem.MolFromSmiles(smi)) for smi in ["CCCCCC", "c1ccccc1"]]
for mol in mols:
    EmbedMultipleConfs(mol, numConfs=10)
# Remove hydrogens after embedding for heavy-atom RMSD.
heavy_mols = [Chem.RemoveHs(mol) for mol in mols]
# Default RMSD output is RDKit-compatible condensed lower-triangle form.
condensed = GetConformerRMSMatrixBatch(heavy_mols)
# Butina expects a square distance matrix, so request square GPU tensors.
square = GetConformerRMSMatrixBatch(heavy_mols, output_format="square")
clusters = [butina(distance_matrix, cutoff=0.5).torch() for distance_matrix in square]
torch.cuda.synchronize()
for mol_clusters in clusters:
    print(mol_clusters.cpu().tolist())

GetConformerRMSMatrix(mol) and GetConformerRMSMatrixBatch(mols) default to output_format="condensed", returning AsyncGpuResult objects that wrap RDKit-style flat vectors of length N * (N - 1) // 2. Use output_format="square" when chaining into butina() or any other API that expects an N x N distance matrix. Both forms live on the GPU; call .numpy() on condensed results or synchronize before moving square tensors to the CPU.

Custom forcefield options + constraints (`BatchedForcefield`)

Reach for MMFFBatchedForcefield / UFFBatchedForcefield instead of the one-shot MMFFOptimizeMoleculesConfs / UFFOptimizeMoleculesConfs when you need any of:

Custom maxIters / forceTol per call
Per-molecule nonBondedThreshold (MMFF) or vdwThreshold (UFF), or per-molecule ignoreInterfragInteractions
Per-molecule MMFFMolProperties objects (e.g. MMFF94s vs MMFF94)
Distance, position, angle, or torsion constraints
Standalone compute_energy() / compute_gradients() without minimization

python

from rdkit.Chem import AddHs, MolFromSmiles
from rdkit.Chem.rdDistGeom import EmbedMultipleConfs
from nvmolkit.batchedForcefield import MMFFBatchedForcefield
mols = [AddHs(MolFromSmiles(smi)) for smi in ["CCO", "CCCCCC"]]
for mol in mols:
    EmbedMultipleConfs(mol, numConfs=5)
ff = MMFFBatchedForcefield(
    mols,
    nonBondedThreshold=[100.0, 20.0],
    ignoreInterfragInteractions=True,
)
ff[0].add_position_constraint(0, max_displ=0.1, force_constant=50.0)
ff[1].add_distance_constraint(0, 4, relative=False, min_len=1.8, max_len=2.2, force_constant=25.0)
energies, converged = ff.minimize(maxIters=500, forceTol=1e-4)
for mol, mol_energies, mol_converged in zip(mols, energies, converged):
    print(mol.GetNumConformers(), mol_energies, mol_converged)

All conformers of each input molecule are minimized in one batch. Constraints attached via ff[i].add_*_constraint(...) apply to every conformer of molecule i; constraint setters mark the wrapper dirty and the native forcefield rebuilds on the next call. Pass output=CoordinateOutput.DEVICE to .minimize(...) to keep optimized coordinates on the GPU (Device3DResult) instead of writing them back into RDKit conformers. UFF is the same shape: UFFBatchedForcefield(mols, vdwThreshold=..., ...).

Going deeper

Full feature list, API reference, and guides: <https://nvidia-bionemo.github.io/nvMolKit/>
What changed in each release: <https://nvidia-bionemo.github.io/nvMolKit/changelog.html>
Worked examples (Jupyter notebooks): the examples/ directory in the GitHub repo

← v1.0.1 All versions

Skill v1.0.2

nvMolKit usage

What nvMolKit is

Where nvMolKit does well

Runtime requirements

Verify the install before writing real code

Entry points

Result types and execution model

AsyncGpuResult

CUDA stream control

Device3DResult

Configuration

HardwareOptions (ETKDG, MMFF, UFF)

SubstructSearchConfig (substructure search)

Recipes

Morgan fingerprints + bulk Tanimoto similarity

ETKDG conformer embedding

MMFF94 minimization of a batch of conformers

Conformer RMSD and Butina clustering

Custom forcefield options + constraints (BatchedForcefield)

Going deeper

`AsyncGpuResult`

`Device3DResult`

`HardwareOptions` (ETKDG, MMFF, UFF)

`SubstructSearchConfig` (substructure search)

Custom forcefield options + constraints (`BatchedForcefield`)