Direct pixel-to-3D mapping
Multi-scale image features are back-projected into a 3D feature volume, making the input view part of the generation coordinate frame — not just a conditioning signal.
Pixal3d lifts pixel features directly into 3D space through back-projection conditioning — delivering reconstruction-level fidelity, detailed geometry, and PBR textures. Try the official Hugging Face demo below, then explore the method, workflow, and production pipeline.
This can happen when the Hugging Face Space is sleeping, queued, or temporarily unavailable. Use the official Space link above and keep this page as your workflow guide.
Recent 3D generative models have rapidly improved synthesis quality, yet fidelity — pixel-level faithfulness to the input image — remains a central bottleneck. Pixal3d tackles this head-on.
Most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. Pixal3d instead generates 3D in a pixel-aligned way, consistent with the input view. It introduces a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume — establishing direct pixel-to-3D correspondence without ambiguity. The result: high-quality 3D assets that approach the fidelity level of reconstruction, with natural extension to multi-view generation and object-separated scene synthesis.
Pixal3d's core innovation is explicit pixel-to-3D correspondence: every generated 3D point stays directly tied to the input image, unlike attention-only methods that treat images as loose guidance.
Multi-scale image features are back-projected into a 3D feature volume, making the input view part of the generation coordinate frame — not just a conditioning signal.
The paper demonstrates that Pixal3d approaches the fidelity of true 3D reconstruction, with detailed geometry and PBR textures that closely match the source image.
The main branch uses an improved TRELLIS.2 backbone for better performance. The paper branch preserves the original Direct3D-S2 implementation for reproducing SIGGRAPH results.
Pixal3d naturally extends to multi-view generation by aggregating back-projected feature volumes across views — enabling even higher fidelity when multiple angles are available.
Beyond single objects, the paper shows a modular pipeline that produces high-fidelity, object-separated 3D scenes from images using the same pixel-aligned paradigm.
Code, model weights, and an interactive Gradio demo are all publicly available. You can try it right now in the embedded demo above or via the official Hugging Face Space.
Understanding the architecture helps you pick better input images. A clear silhouette and visible material regions give the conditioner stronger evidence to work with.
A VAE compresses pixel-aligned sparse SDF into efficient sparse latents. This enables high-resolution shape handling within a compact, learnable representation — the foundation that makes pixel-level conditioning tractable at scale.
This is the key differentiator. Instead of loosely attending to image features, Pixal3d explicitly lifts multi-scale 2D image features into 3D feature volumes through calibrated back-projection. Every 3D point knows exactly which pixel it came from.
A coarse stage predicts overall structure, then a detail stage predicts refined latents. The final result is decoded into a mesh with PBR texture maps (base color, normal, roughness, metallic) ready for rendering or import into game engines and DCC tools.
Practical insight: Pixal3d performs best when the input image shows a single subject with clean edges. Hidden back sides, transparent materials, and heavily occluded geometry are still challenges — use multi-view input when fidelity on all sides matters.
AI generation is the starting point — smart cleanup and validation turn a raw output into a production-ready asset.
Choose a single subject, centered crop, clean silhouette, visible texture zones. Avoid watermarks, heavy occlusion, and extreme lighting that confuses PBR estimation.
Use the Hugging Face demo above, the model card on Hugging Face, or clone the GitHub repo for local inference. For local use: python inference.py --image your_image.png --output ./output.glb
Rotate the model, compare the front view to the source image, then check back side completion, holes, floaters, UV seams, and overall scale. Don't trust the first render — rotate it.
GLB for WebGL preview, OBJ for Blender cleanup, FBX for Unity or Unreal, STL or 3MF only after watertight repair. Each format has a job — choose accordingly.
Keep the source image license, which branch/checkpoint you used, generation settings, output format, and any cleanup steps alongside the asset. Future you (or your team) will thank you.
Not every image is a good candidate for 3D generation. This checklist gives you a repeatable way to decide whether an input is worth spending GPU time on.
Aim for 75+ before using serious cleanup time.
A short brief keeps teams aligned: what the image shows, where the asset goes, what format matters, and what quality must survive export.
All resources are publicly available. Use these links as your primary reference chain and verify terms before commercial work.
A pretty first render is not enough. Evaluate the asset the way a technical artist would evaluate a handoff.
| Dimension | What to inspect | Pass condition |
|---|---|---|
| Silhouette fidelity | Front outline, proportions, and recognizable identity | Matches the source image at a glance from the input view |
| Geometry completeness | Back side, sides, holes, floaters, and normal direction | Rotates smoothly without visible collapse or missing surfaces |
| Material behavior | Base color, roughness, normals, and UV seams | Reads consistently under different lighting conditions |
| Topology usability | Poly count, mesh islands, UV layout, decimation tolerance | Can be repaired, retopologized, or decimated without chaos |
| Export reliability | GLB/OBJ/FBX import, texture paths, origin point, and scale | Opens cleanly in the target tool without manual fixes |
Use the official repository for exact requirements. This summary keeps the decision tree visible for quick reference.
Follow the TRELLIS.2 installation guide first — Pixal3d builds on top of it. The main branch uses the improved TRELLIS.2 backbone.
Install additional Python dependencies with pip install -r requirements.txt, then install utils3d from the project's release page.
Latest implementation with improved TRELLIS.2 backbone for better performance. Recommended for new projects and production use.
Original Direct3D-S2-based implementation. Use this to exactly reproduce the results reported in the SIGGRAPH 2026 paper.
python inference.py --image assets/test_image/0.png --output ./output.glb after dependencies are installed. A Gradio web demo is also included via python app.py.
The Hugging Face Space uses H-series GPU architecture. For local use, check the TRELLIS.2 requirements — requirements-hfdemo.txt may not be compatible with all GPU architectures.
Key events from the Pixal3d project, based on the paper, official project page, GitHub README, and Hugging Face model card.
Good engineering communication is honest about failure modes. Here's what Pixal3d cannot guarantee.
A single image cannot fully prove the back side. The model makes educated guesses — use multiple views when fidelity on all sides matters for production.
Don't upload copyrighted characters, trademarked brand assets, or private client images unless you have explicit permission. Check the model license for commercial use terms.
Game-ready, print-ready, and commerce-ready assets each need different validation and post-processing. The raw GLB output is a starting point, not the final deliverable.
If the Hugging Face Space sleeps or queues, the site degrades gracefully to official links and workflow guidance rather than hiding the issue behind a broken iframe.
Short, direct answers for users who want the essentials without scrolling through the full paper.
Pixal3d uses pixel back-projection to explicitly map 2D image features into 3D space, creating direct pixel-to-3D correspondence. Most other methods use attention-based conditioning, where the link between image pixels and 3D geometry is much looser. This gives Pixal3d significantly higher fidelity to the input image.
Check the model license on Hugging Face for the latest terms. The code is open source, but generated assets may have specific usage conditions. Always verify before commercial deployment.
Use main for the latest improved version with the TRELLIS.2 backbone — recommended for most users. Use paper only if you need to exactly reproduce the SIGGRAPH 2026 paper results.
Yes. The paper states that Pixal3d naturally extends to multi-view generation by aggregating back-projected feature volumes across multiple camera views, enabling even higher fidelity.
GLB for web and quick preview; OBJ for mesh editing and cleanup in Blender; FBX for game engines like Unity and Unreal; STL or 3MF only after watertight repair for 3D printing.
Hugging Face Spaces use shared GPUs with a queue system. Spaces can also enter sleep mode when idle. The page includes official links so you can open the Space directly, and the workflow guidance works regardless of demo availability.
Quick definitions to help non-research visitors navigate the page.
Use the official BibTeX citation when Pixal3d informs your research or technical writing.
@article{li2026pixal3d,
title = {Pixal3D: Pixel-Aligned 3D Generation from Images},
author = {Li, Dong-Yang and Zhao, Wang and Chen, Yuxin and Hu, Wenbo and Guo, Meng-Hao and Zhang, Fang-Lue and Shan, Ying and Hu, Shi-Min},
journal = {arXiv preprint arXiv:2605.10922},
year = {2026}
}