Skill v1.0.2
currentAutomated scan100/100~1 modified
version: "1.0.2" name: nvmolkit-usage description: >- Write code that calls the installed nvMolKit Python API for GPU-accelerated, batched RDKit-style operations - Morgan fingerprints, Tanimoto/cosine similarity, ETKDG conformer embedding, MMFF/UFF optimization, TFD, conformer RMSD, Butina clustering, and substructure search. Use when the user is importing nvmolkit.*, debugging an nvmolkit call, choosing between nvMolKit and RDKit for a batched cheminformatics workflow, or wiring nvMolKit results into a torch/numpy pipeline. Out of scope: building nvMolKit from source. license: Apache-2.0 metadata: owner: Kevin Boyd (@scal444) risk_tier: skill
nvMolKit usage
What nvMolKit is
GPU-accelerated, batched implementations of common RDKit operations. APIs mirror RDKit where possible but are batch-oriented: they take lists of rdkit.Chem.Mol (or lists of fingerprints) and process them in parallel on one or more GPUs. nvMolKit links against RDKit at build time; inputs and outputs are real RDKit Mol objects.
Where nvMolKit does well
Reach for nvMolKit when:
- The workload is a large batch of molecules processed together (typically thousands or more).
- The metric is throughput / total wall time across the batch, not per-molecule latency.
- The same operation is repeated identically across the batch (fingerprinting a library, embedding/minimizing many conformers, bulk pairwise similarity), so the GPU stays saturated.
Plain RDKit is usually the better choice for single-molecule one-offs or workflows that can't be expressed as a batch. nvMolKit is not meant to replace RDKit for those cases.
Runtime requirements
- An NVIDIA GPU with compute capability 7.0 (V100) or higher
- A CUDA driver compatible with CUDA 12.6+.
- A working
torchinstall with CUDA support (nvMolKit returns GPU tensors viatorch's CUDA array interface).
If CUDA is unavailable, nvMolKit calls raise. There is no CPU fallback - if the user needs one, use RDKit directly for that path.
When helping with installation, make the user choose a PyTorch CUDA backend that the host driver supports before installing nvMolKit. nvMolKit's PyPI wheels are built with CUDA Toolkit 12.9 and depend on CUDA 12 runtime packages, but pip/uv can still select a CUDA 13 PyTorch wheel unless the install command says otherwise.
- Conda: prefer conda-forge
pytorch-gpu; pincuda-version=12.6or another CUDA version supported by the driver. - pip: send the user to the PyTorch install selector (
https://pytorch.org/get-started/locally/) or previous-versions page (https://pytorch.org/get-started/previous-versions/) to installtorchfor a CUDA 12.x backend before installing nvMolKit. - uv: install nvMolKit with an explicit backend, e.g.
uv pip install --torch-backend=cu128 nvmolkit.
Verify the install before writing real code
Run this once to confirm nvMolKit is importable and a GPU op works end to end:
import nvmolkitimport torchfrom rdkit import Chemfrom nvmolkit.fingerprints import MorganFingerprintGeneratorprint("nvmolkit:", nvmolkit.__version__)print("cuda available:", torch.cuda.is_available())print("device count:", torch.cuda.device_count())mols = [Chem.MolFromSmiles(smi) for smi in ["CCO", "c1ccccc1", "CC(=O)O"]]fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)result = fpgen.GetFingerprints(mols)torch.cuda.synchronize()fps = result.torch()print("fps shape:", tuple(fps.shape), "dtype:", fps.dtype)# Expected: shape (3, 32), dtype torch.int32 (1024 bits packed into 32 int32s per row)
If this fails, point the user at the install guide on the docs site rather than guessing - see "Going deeper" below.
Entry points
| Task | Module | Primary entry point | |
|---|---|---|---|
| Morgan fingerprints | nvmolkit.fingerprints | MorganFingerprintGenerator(radius, fpSize).GetFingerprints(mols) | |
| Bulk Tanimoto / cosine similarity | nvmolkit.similarity | crossTanimotoSimilarity(...), crossCosineSimilarity(...), plus *MemoryConstrained variants for results too large to fit in GPU memory | |
| ETKDG conformer embedding | nvmolkit.embedMolecules | EmbedMolecules(molecules, params, confsPerMolecule, ...) | |
| MMFF94 optimization (one-shot) | nvmolkit.mmffOptimization | MMFFOptimizeMoleculesConfs(molecules, ...) | |
| UFF optimization (one-shot) | nvmolkit.uffOptimization | UFFOptimizeMoleculesConfs(molecules, ...) | |
| Forcefield with custom options + constraints | nvmolkit.batchedForcefield | MMFFBatchedForcefield(mols, properties=..., nonBondedThreshold=..., ignoreInterfragInteractions=..., hardwareOptions=...), UFFBatchedForcefield(mols, vdwThreshold=..., ...). Per-molecule view ff[i] exposes add_distance_constraint, add_position_constraint, add_angle_constraint, add_torsion_constraint. Methods: .compute_energy(), .compute_gradients(), .minimize(maxIters, forceTol) | |
| Pairwise conformer RMSD | nvmolkit.conformerRmsd | GetConformerRMSMatrix(mol), GetConformerRMSMatrixBatch(mols) | |
| Torsion Fingerprint Deviation (TFD) | nvmolkit.tfd | GetTFDMatrix(mol), GetTFDMatrices(mols) | |
| Butina clustering | nvmolkit.clustering | butina(distance_matrix, cutoff) (precomputed matrix), fused_butina(fingerprints, cutoff) (memory-efficient, on-the-fly) | |
| Substructure search | nvmolkit.substructure | hasSubstructMatch, countSubstructMatches, getSubstructMatches | |
| Hardware tuning (batch size, GPU IDs) | nvmolkit.types | HardwareOptions(...) passed to ETKDG / MMFF / UFF | |
Optional autotuning of HardwareOptions | nvmolkit.autotune | tune_embed_molecules, tune_mmff_optimize, tune_uff_optimize, tune_batched_forcefield. Requires the optuna package |
Result types and execution model
Two return shapes carry GPU-resident output, depending on what the operation produces.
AsyncGpuResult
Used by operations that return a single flat tensor (fingerprints, similarity matrices, RMSD/TFD vectors, Butina inputs). Key behaviors:
- Asynchronous. The kernel may not have completed when the call returns.
result.torch()returns a zero-copytorch.Tensoron the GPU. Caller is responsible for synchronizing before reading values on the host.result.numpy()synchronizes and returns a CPU numpy array.- Exposes
__cuda_array_interface__, so it can be passed directly into other nvMolKit functions (e.g. fingerprints → similarity) with no host round-trip.
CUDA stream control
A subset of the AsyncGpuResult-returning APIs accept an optional stream: torch.cuda.Stream | None = None argument so callers can submit nvMolKit work to a non-default stream and overlap it with their own kernels. When omitted, the call uses the current torch stream.
APIs that take a stream argument:
MorganFingerprintGenerator.GetFingerprintscrossTanimotoSimilarity,crossCosineSimilarity, and their*MemoryConstrainedvariantsbutina,fused_butinaGetConformerRMSMatrix,GetConformerRMSMatrixBatch
Other APIs (ETKDG, MMFF/UFF optimization, TFD, substructure search) are synchronous to the caller — no stream plumbing needed.
Typical pattern:
import torchfrom rdkit import Chemfrom nvmolkit.fingerprints import MorganFingerprintGeneratorfrom nvmolkit.similarity import crossTanimotoSimilaritystream = torch.cuda.Stream()fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)mols = [Chem.MolFromSmiles(smi) for smi in ["CCO", "c1ccccc1", "CC(=O)O"]]with torch.cuda.stream(stream):fps = fpgen.GetFingerprints(mols, stream=stream)sim = crossTanimotoSimilarity(fps, stream=stream)stream.synchronize()print(sim.torch())
Device3DResult
Used by ETKDG embedding and MMFF/UFF optimization (one-shot and BatchedForcefield) when called with output=CoordinateOutput.DEVICE. The GPU-resident equivalent of writing conformers back to Mol objects. Fields:
values:AsyncGpuResultof shape(total_atoms, 3)float64. Concatenated conformer coordinates in CSR-style layout.atom_starts,mol_indices,conf_indices:AsyncGpuResultint32 buffers describing the layout (values[atom_starts[i]:atom_starts[i+1]]is conformeri's atoms).energies,converged:AsyncGpuResultbuffers populated only for MMFF/UFF minimization (not for plain ETKDG).gpu_id: device the buffers live on. ThetargetGpuargument on each API picks this;targetGpu=-1uses the default consolidation device..per_molecule()returns nestedlist[list[torch.Tensor]]of per-conformer views;.dense(pad_value=nan)materializes a padded(n_mols, max_confs, max_atoms, 3)tensor.
The default mode (CoordinateOutput.RDKIT_CONFORMERS) still writes optimized coordinates back into each Mol and returns Python lists of energies/convergence flags. Reach for CoordinateOutput.DEVICE when chaining downstream GPU work (e.g. ETKDG → MMFF → similarity scoring) without host round-trips.
Configuration
Two configuration objects expose the GPU/CPU knobs.
HardwareOptions (ETKDG, MMFF, UFF)
from nvmolkit.types import HardwareOptions. Passed via hardwareOptions= to EmbedMolecules, MMFFOptimizeMoleculesConfs, UFFOptimizeMoleculesConfs, and the BatchedForcefield constructors. Every field has an "auto" sentinel; the defaults are usually fine.
| Field | Type | Default | Meaning | |
|---|---|---|---|---|
preprocessingThreads | int | -1 (all visible CPUs) | CPU threads for preprocessing | |
batchSize | int | -1 (auto-tuned) | Number of conformers per GPU batch | |
batchesPerGpu | int | -1 (auto) | Concurrent batches per GPU; must be >0 or -1 | |
gpuIds | list[int] | [] (all visible GPUs) | Specific device ordinals to target |
Passing a gpuIds entry for a device that isn't visible raises RuntimeError: invalid device ordinal. For finding good values automatically across a representative sample, see nvmolkit.autotune (requires the optuna extra); each tune_* function returns a TuneResult whose best_config is a fully-populated HardwareOptions ready to pass back into the real call.
HardwareOptions round-trips through to_dict() / from_dict() for persisting tuned configs to disk.
SubstructSearchConfig (substructure search)
from nvmolkit.substructure import SubstructSearchConfig. Passed via config= to hasSubstructMatch, countSubstructMatches, and getSubstructMatches.
| Field | Type | Default | Meaning | ||
|---|---|---|---|---|---|
batchSize | int | 1024 | (target, query) pairs per GPU batch | ||
workerThreads | int | -1 (auto) | GPU runner threads per GPU | ||
preprocessingThreads | int | -1 (auto) | CPU threads for preprocessing | ||
maxMatches | int | 0 (unlimited) | Max matches returned per (target, query) pair | ||
uniquify | bool | False | Drop duplicate matches that differ only in atom enumeration order | ||
gpuIds | `list[int] \ | None` | None (current device only) | Specific device ordinals to target |
Substructure search currently does not support chirality-aware matching, enhanced stereochemistry, or other advanced RDKit SubstructMatchParameters options.
Recipes
Morgan fingerprints + bulk Tanimoto similarity
import torchfrom rdkit import Chemfrom nvmolkit.fingerprints import MorganFingerprintGeneratorfrom nvmolkit.similarity import crossTanimotoSimilaritysmiles = ["CCO", "CCN", "c1ccccc1", "CC(=O)O", "CCOCC"]mols = [Chem.MolFromSmiles(smi) for smi in smiles]fpgen = MorganFingerprintGenerator(radius=2, fpSize=1024)fps = fpgen.GetFingerprints(mols)sim = crossTanimotoSimilarity(fps)torch.cuda.synchronize()print(sim.torch())
Inputs are list[Mol]. Output of GetFingerprints is an AsyncGpuResult wrapping an (n_mols, fpSize / 32) int32 tensor of packed bits. Pass it straight into crossTanimotoSimilarity for an (n, n) similarity matrix; pass two fingerprint sets for an (n, m) cross-matrix. For sets too large to materialize on the GPU, use crossTanimotoSimilarityMemoryConstrained (chunked compute, returns numpy on CPU).
ETKDG conformer embedding
from rdkit.Chem import AddHs, MolFromSmilesfrom rdkit.Chem.rdDistGeom import ETKDGv3from nvmolkit.embedMolecules import EmbedMoleculesmols = [AddHs(MolFromSmiles(smi)) for smi in ["C1CCCCC1", "C1CCCCC2CCCCC12", "COO"]]params = ETKDGv3()params.useRandomCoords = TrueEmbedMolecules(mols, params, confsPerMolecule=10, maxIterations=-1)for mol in mols:print(mol.GetNumConformers())
Inputs are list[Mol], sanitized and with hydrogens added (AddHs). Conformers are added in-place. params.useRandomCoords must be True - nvMolKit's ETKDG only supports random-coord initialization. A handful of niche EmbedParameters options are not supported (bounds matrices, custom CPCI, coord maps, separate-fragment embedding); the Features section of the docs site lists the full restrictions.
MMFF94 minimization of a batch of conformers
from rdkit.Chem import AddHs, MolFromSmilesfrom rdkit.Chem.rdDistGeom import ETKDGv3from nvmolkit.embedMolecules import EmbedMoleculesfrom nvmolkit.mmffOptimization import MMFFOptimizeMoleculesConfsmols = [AddHs(MolFromSmiles(smi)) for smi in ["CCO", "CCN", "c1ccccc1"]]params = ETKDGv3(); params.useRandomCoords = TrueEmbedMolecules(mols, params, confsPerMolecule=5)energies = MMFFOptimizeMoleculesConfs(mols, maxIters=500)for mol, mol_energies in zip(mols, energies):print(mol.GetNumConformers(), mol_energies)
Inputs are list[Mol] with conformers already populated (typically by ETKDG, RDKit's EmbedMultipleConfs, or a prior nvMolKit call). Coordinates are updated in place; the return is list[list[float]] of optimized energies aligned with the input molecule order and conformer index. UFF is identical in shape: swap in from nvmolkit.uffOptimization import UFFOptimizeMoleculesConfs.
If any input molecule is None or lacks MMFF/UFF atom types, the call raises ValueError. The exception's args[1] is a dict with keys "none" and "no_params" listing the offending indices - useful for filtering a noisy input set.
Conformer RMSD and Butina clustering
import torchfrom rdkit import Chemfrom rdkit.Chem.rdDistGeom import EmbedMultipleConfsfrom nvmolkit.clustering import butinafrom nvmolkit.conformerRmsd import GetConformerRMSMatrixBatchmols = [Chem.AddHs(Chem.MolFromSmiles(smi)) for smi in ["CCCCCC", "c1ccccc1"]]for mol in mols:EmbedMultipleConfs(mol, numConfs=10)# Remove hydrogens after embedding for heavy-atom RMSD.heavy_mols = [Chem.RemoveHs(mol) for mol in mols]# Default RMSD output is RDKit-compatible condensed lower-triangle form.condensed = GetConformerRMSMatrixBatch(heavy_mols)# Butina expects a square distance matrix, so request square GPU tensors.square = GetConformerRMSMatrixBatch(heavy_mols, output_format="square")clusters = [butina(distance_matrix, cutoff=0.5).torch() for distance_matrix in square]torch.cuda.synchronize()for mol_clusters in clusters:print(mol_clusters.cpu().tolist())
GetConformerRMSMatrix(mol) and GetConformerRMSMatrixBatch(mols) default to output_format="condensed", returning AsyncGpuResult objects that wrap RDKit-style flat vectors of length N * (N - 1) // 2. Use output_format="square" when chaining into butina() or any other API that expects an N x N distance matrix. Both forms live on the GPU; call .numpy() on condensed results or synchronize before moving square tensors to the CPU.
Custom forcefield options + constraints (BatchedForcefield)
Reach for MMFFBatchedForcefield / UFFBatchedForcefield instead of the one-shot MMFFOptimizeMoleculesConfs / UFFOptimizeMoleculesConfs when you need any of:
- Custom
maxIters/forceTolper call - Per-molecule
nonBondedThreshold(MMFF) orvdwThreshold(UFF), or per-moleculeignoreInterfragInteractions - Per-molecule
MMFFMolPropertiesobjects (e.g. MMFF94s vs MMFF94) - Distance, position, angle, or torsion constraints
- Standalone
compute_energy()/compute_gradients()without minimization
from rdkit.Chem import AddHs, MolFromSmilesfrom rdkit.Chem.rdDistGeom import EmbedMultipleConfsfrom nvmolkit.batchedForcefield import MMFFBatchedForcefieldmols = [AddHs(MolFromSmiles(smi)) for smi in ["CCO", "CCCCCC"]]for mol in mols:EmbedMultipleConfs(mol, numConfs=5)ff = MMFFBatchedForcefield(mols,nonBondedThreshold=[100.0, 20.0],ignoreInterfragInteractions=True,)ff[0].add_position_constraint(0, max_displ=0.1, force_constant=50.0)ff[1].add_distance_constraint(0, 4, relative=False, min_len=1.8, max_len=2.2, force_constant=25.0)energies, converged = ff.minimize(maxIters=500, forceTol=1e-4)for mol, mol_energies, mol_converged in zip(mols, energies, converged):print(mol.GetNumConformers(), mol_energies, mol_converged)
All conformers of each input molecule are minimized in one batch. Constraints attached via ff[i].add_*_constraint(...) apply to every conformer of molecule i; constraint setters mark the wrapper dirty and the native forcefield rebuilds on the next call. Pass output=CoordinateOutput.DEVICE to .minimize(...) to keep optimized coordinates on the GPU (Device3DResult) instead of writing them back into RDKit conformers. UFF is the same shape: UFFBatchedForcefield(mols, vdwThreshold=..., ...).
Going deeper
- Full feature list, API reference, and guides: <https://nvidia-bionemo.github.io/nvMolKit/>
- What changed in each release: <https://nvidia-bionemo.github.io/nvMolKit/changelog.html>
- Worked examples (Jupyter notebooks): the
examples/directory in the GitHub repo