Opticks : GPU ray trace accelerated optical photon simulation

Opticks :
GPU ray trace accelerated optical photon simulation

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — Kaiping — 15 January 2025

Outline

Optical Photon Simulation : Context and Problem
- (JUNO) Optical Photon Simulation Problem...
- Simulation of 214 GeV mu-
- Optical Photon Simulation ≈ Ray Traced Image Rendering
- NVIDIA RTX Generations : RT performance : ~2x every ~2 years
- NVIDIA OptiX : Ray Tracing Engine
Opticks : Solution to Optical Photon Simulation Problem
- Geant4 + Opticks + NVIDIA OptiX : Hybrid Workflow
- Geometry Model Translation : Geant4 => CSGFoundry => NVIDIA OptiX
- Full JUNO, Opticks, OptiX 7.5/8.0
- Integrated Analytic + Triangulated Geometry (NEW)
- Interactive ray traced visualization via OpenGL/OptiX interop (NEW)
- GuideTube : Torus Triangulated
- Optimized curand random number generation with Philox4_32_10 (NEW)
- Out-of-core optical photon simulation : multi-launch (NEW)
- Simulating One Billion Photons in under 100 sec
- Pure Optical TorchGenstep scan : 1M to 100M photons
- Optical simulation 4x faster 1st->3rd gen RTX
JUNOSW+Opticks : Release when ?
Summary + Links
Acknowledgements

(JUNO) Optical Photon Simulation Problem...

ALL0_Debug_Philox_GUN4_mu214gev.png

jok-tds-gdb : tut_detsim.py simulation time
Single threaded Geant4 (*)	41 hrs
Opticks (TITAN RTX, 1st G)	[AWAIT VALIDATION]

Need validation to establish equivalent simulation

38M photons, 147k gensteps, Debug build

(*) DELL Precision 7920T Workstation, Intel Xeon Gold 5118, 2.3GHz, 48 cores, 62G

Optical Photon Simulation ≈ Ray Traced Image Rendering

simulation: photon parameters at sensors (PMTs)
rendering: pixel values at image plane

Much in common : geometry, light sources, optical physics

both limited by ray geometry intersection, aka ray tracing

Many Applications of ray tracing :

advertising, design, architecture, films, games,...
-> huge efforts to improve hw+sw over 30 yrs

NVIDIA RTX Generations

RT Core : ray trace dedicated GPU hardware
Each gen : large ray tracing improvements:
- Blackwell (2025) Expect: ~2x ray trace over Ada
- Ada (2022) ~2x ray trace over Ampere
- Ampere (2020) ~2x ray trace over Turing (2018)
Blackwell 4th Gen RTX : announced Jan 6th 2025

ray trace performance : ~2x every ~2 years

NVIDIA® OptiX™ Ray Tracing Engine -- Accessible GPU Ray Tracing

OptiX makes GPU ray tracing accessible

Programmable GPU-accelerated Ray-Tracing Pipeline
Single-ray shader programming model using CUDA
ray tracing acceleration using RT Cores (RTX GPUs)
"...free to use within any application..."

OptiX features

acceleration structure creation + traversal (eg BVH)
instanced sharing of geometry + acceleration structures
compiler optimized for GPU ray tracing

User provides (Green):

ray generation
geometry bounding boxes
intersect functions
instance transforms

Latest Release : NVIDIA® OptiX™ 8.0.0 (Aug 2023) NEW:

Shader Execution Reordering (SER) (Ada: up to 2x)
SER: reduced execution+data divergence (on-the-fly)

Geant4 + Opticks + NVIDIA OptiX : Hybrid Workflow

https://bitbucket.org/simoncblyth/opticks

Opticks API : split according to dependency -- Optical photons are GPU "resident", only hits need to be copied to CPU memory

Geometry Model Translation : Geant4 => CSGFoundry => NVIDIA OptiX 7/8

Geant4 Geometry Model (JUNO: 400k PV, deep hierarchy)

PV	G4VPhysicalVolume	placed, refs LV
LV	G4LogicalVolume	unplaced, refs SO
SO	G4VSolid,G4BooleanSolid	binary tree of SO "nodes"

Opticks CSGFoundry Geometry Model (index references)

struct	Notes	Geant4 Equivalent
CSGFoundry	vectors of the below, easily serialized + uploaded + used on GPU	None
qat4	4x4 transform refs CSGSolid using "spare" 4th column (becomes IAS)	Transforms ref from PV
CSGSolid	refs sequence of CSGPrim	Grouped Vols + Remainder
CSGPrim	bbox, refs sequence of CSGNode, root of CSG Tree of nodes	root G4VSolid
CSGNode	CSG node parameters (JUNO: ~23k CSGNode)	node G4VSolid

NVIDIA OptiX 7/8 Geometry Acceleration Structures (JUNO: 1 IAS + 10 GAS, 2-level hierarchy)

IAS	Instance Acceleration Structures	JUNO: 1 IAS created from vector of ~50k qat4 (JUNO)
GAS	Geometry Acceleration Structures	JUNO: 10 GAS created from 10 CSGSolid (which refs CSGPrim,CSGNode )

JUNO : Geant4 ~400k volumes "factorized" into 1 OptiX IAS referencing ~10 GAS

Ada_cxr_overview_emm_t0_elv_t_moi__ALL.jpg

mostly analytic CSG
few complex solids (eg tori) : triangulated

raytrace 2M pixels
TITAN RTX (1st)	0.0118s (85 fps)
Ada 5000 RTX (3rd)	0.0031s (323 fps)

1st -> 3rd gen RTX : ~4x

Analytic + triangulated geometry

default : analytic CSG solids
user can name solids for triangulation
- avoids issue with toruses + complex solids
- BUT : approximate geometry
- triangulation from G4Polyhedron
- config per-solid NumberOfRotationSteps by envvars
- uses OptiX "built-in" triangle intersection

NEW FEATURE: Integration of analytic + triangulated geometry

cxr_min__eye_1,0,0__zoom_1__tmin_0.5__sSurftube_0V1_0:0:-1.jpg

Interactive ray traced visualization via OpenGL/OptiX interop

initial viewpoint, geometry exclusions via envvars

WASDQE+mouse 3D navigation

GuideTube : Torus Triangulated

GuideTube (39*2*2 = 156 G4Torus): split in phi segments, radius breaks

Intersect with torus expensive on GPU

requires double precision to solve quartic
even with double precision analytic solution imprecise
numerical approach favored => triangulation

Triangulation using G4Polyhedron

G4Poly..::SetNumberOfRotationSteps

	NumberOfRotationSteps
HepPolyhedron Default	24
Top Right	48
Bottom Right	480

Adjustable: precision of intersect, number of triangles

GPUs evolved for triangles => fast even with many

Optimized curand random number generation with Philox4_32_10

	sizeof state	notes
XORWOW	48	curand default, expensive init
Philox4_32_10	64	cheap init, counter based
Philox4_32_10 _OpticksLite	32	slimmed state

XORWOW

split init from usage by persisting state files

max photons limited by persisted states

Philox-4x32-10 (10 round calc)

init within simulate kernel

avoid states limiting photons

Philox Advantages

no curandState pre-initialization
avoid large files + slow init + global memory usage
avoids limiting photons by pre-prepared states

Philox Disadvantages

curand impl : larger state than XORWOW
- potentially to slim state if causes issue

Out-of-core optical photon simulation : multi-launch

Out-of-core: simulate more photons than fit VRAM

Approach centered on QSim::simulate

configure max slots, default based on VRAM
collect scintillation + cerenkov gensteps from Geant4
form vector of genstep slices
- each slice photon count less than max slots
loop over slices:
- upload genstep array slice
- kernel launch simulate
- gather results into NPFold
concatenate results (NPFold::concat)

curand "slot" offset by ph_offset: => perfect match with any slicing

Philox counter based RNG + Out-of-core => Opticks un-limited

no curandState limit
no VRAM limit

Simulating One Billion Photons in under 100 sec

cxs_min.sh: pure optical simulation of 40 torch gensteps from CD center totalling 1 billion photons on Dell Precision Workstation with NVIDIA RTX 5000 Ada (3rd Gen) [sreport shows microsecond timestamp deltas]

 [NP::MakeMetaKVS_ranges2_table num_specs 8
      SEvt__Init_RUN_META ==>    CSGFoundry__Load_HEAD          655                    ## init
    CSGFoundry__Load_HEAD ==>    CSGFoundry__Load_TAIL    4,235,189                    ## load_geom
    CSGOptiX__Create_HEAD ==>    CSGOptiX__Create_TAIL      266,810                    ## upload_geom
 A000_QSim__simulate_HEAD ==> A000_QSim__simulate_LBEG          251                    ## slice_genstep
 A000_QSim__simulate_PREL ==> A000_QSim__simulate_POST   23,137,923                    ## simulate slice
 A000_QSim__simulate_POST ==> A000_QSim__simulate_DOWN    3,975,867                    ## download slice
 A000_QSim__simulate_PREL ==> A000_QSim__simulate_POST   23,449,227 REP  46,587,150    ## simulate slice
 A000_QSim__simulate_POST ==> A000_QSim__simulate_DOWN    3,924,104 REP   7,899,971    ## download slice
 A000_QSim__simulate_PREL ==> A000_QSim__simulate_POST   23,736,442 REP  70,323,592    ## simulate slice
 A000_QSim__simulate_POST ==> A000_QSim__simulate_DOWN    4,108,315 REP  12,008,286    ## download slice
 A000_QSim__simulate_PREL ==> A000_QSim__simulate_POST   23,850,920 REP  94,174,512    ## simulate slice
 A000_QSim__simulate_POST ==> A000_QSim__simulate_DOWN    4,119,275 REP  16,127,561    ## download slice
 A000_QSim__simulate_LEND ==> A000_QSim__simulate_PCAT   15,900,158                    ## concat slices
 A000_QSim__simulate_BRES ==> A000_QSim__simulate_TAIL  117,551,399                    ## save arrays
                                                TOTAL:  248,256,535
 ]NP::MakeMetaKVS_ranges2_table num_keys:69

Out-of-core optical simulation
four kernel executions, total time	94 s
four hit slice downloads, total time	16 s
saving 216M hits (13GB .npy file)	117 s
loading geometry from /cvmfs	4 s
total time	248 s

Pure Optical TorchGenstep scan : 1M to 100M photons

TEST=medium_scan ~/opticks/cxs_min.sh

Generate optical only events with 1M->100M photons starting from CD center, gather and save only Hits.

OPTICKS_RUNNING_MODE=SRM_TORCH  ## "Torch" running enables num_photon scan
OPTICKS_NUM_PHOTON=M1,10,20,30,40,50,60,70,80,90,100
OPTICKS_NUM_EVENT=11
OPTICKS_EVENT_MODE=Hit

uses CSGOptiXSMTest executable (no Geant4 dependency, avoids ~150s of initialization time)
load and upload geometry in ~2s

Compare simulation scans on two Dell Precision Workstations:

GPU (VRAM)	Arch	GPU Release	CUDA(RT) Cores	RTX Gen	Driver	CUDA	OptiX
NVIDIA TITAN RTX(24G)	Turing	Dec 2018	4,608(72)	1st	515.43	11.7	7.5
NVIDIA RTX 5000(32G)	Ada	Aug 2023	12,800(100)	3rd	550.76	12.4	8.0

max launch size : 24/32/48G VRAM ~200/266/400M photons

ALL1_scatter_10M_photon_22pc_hit_alt.png

4.5M hits from 20M photon TorchGenstep, 4.4(1.1) seconds: with: NVIDIA TITAN RTX(NVIDIA RTX 5000 Ada) 1st(3rd) gen RTX

AB_Substamp_ALL_Etime_vs_Photon_rtx_gen1_gen3.png

Event Time(s) vs PH(M)
PH(M)	G1	G3	G1/G3
1	0.47	0.14	3.28
10	0.44	0.13	3.48
20	4.39	1.10	3.99
30	8.87	2.26	3.93
40	13.29	3.38	3.93
50	18.13	4.49	4.03
60	22.64	5.70	3.97
70	27.31	6.78	4.03
80	32.24	7.99	4.03
90	37.92	9.33	4.06
100	41.93	10.42	4.03

Optical simulation 4x faster 1st->3rd gen RTX, (3rd gen, Ada : 100M photons simulated in 10 seconds) [TMM PMT model]

JUNOSW+Opticks

Opticks v0.3.1 (11 Jan 2025) ready to be used in candidate release

BUT requires:

JUNOSW branch "blyth-hierarchical-sticks-fastener-geometry-with-thin-water" (26 Nov 2024) to be merged
- fixes incorrect hierarchy in fastener geometry (found by Geant4:Opticks chi2)
- adds thin 0.1mm layer of water around the steel screws of fastener (Ziyan requested)
envvars making small geometry changes, needed for Geant4:Opticks agreement
- adopting the small changes : avoid different GPU/CPU geometry + need for envvars

How to handle continuous geometry change ? (longstanding problem to keep up with changing geometry)

improve automation of validations, helpful
BUT: no way to automate the debugging of issues that are often revealed by Geant4:Opticks chi2 validation

Summary and Links

Opticks : state-of-the-art GPU ray traced optical simulation integrated with Geant4, with automated geometry translation into GPU optimized form.

NVIDIA Ray Trace Performance continues rapid progress (2x each gen., every ~2 yrs)
out-of-core + adoption of Philox RNG removes limits, makes Opticks easier to use

https://bitbucket.org/simoncblyth/opticks	day-to-day code repository
https://simoncblyth.bitbucket.io	presentations and videos
https://groups.io/g/opticks	forum/mailing list archive
email: `opticks+subscribe@groups.io`	subscribe to mailing list
`simon.c.blyth@gmail.com`	any questions

Acknowledgements

Opticks users
- ~38 members of forum : https://groups.io/g/opticks
- many thanks to active bug reporting users
  - (especially from JUNO, LZ, LHAASO, LHCb-RICH, DUNE, NEXT-CRAB0)
JUNO Collaboration
- Tao Lin, Yuxiang Hu, ... (+ many more : changing geometry and physics models)
- forced Opticks to continuously improve
Geant4 collaboration
- especially Hans Wentzel, Fermilab Geant4 group, early adopter of Opticks
- guest invites to Okinawa, Wollongong meetings
Dark Matter Search Community (XENON,LZ,DARWIN,..) : DANCE invite 2019
Many NVIDIA Engineers:
- NVIDIA GPU Technology Conferences (San Jose, Suzhou)
- seven dedicated meetings in 2021 : migrating to OptiX 7 API
- UK GPU Hackathon 2022