SIGGRAPH 2026 · Tencent ARC Lab · Tsinghua University

Pixel-Aligned 3D Generation from a Single Image

Pixal3d lifts pixel features directly into 3D space through back-projection conditioning — delivering reconstruction-level fidelity, detailed geometry, and PBR textures. Try the official Hugging Face demo below, then explore the method, workflow, and production pipeline.

  • Pixel-aligned conditioning
  • Back-projection volumes
  • Single or multi-view
  • GLB + PBR output
  • SIGGRAPH 2026
  • TRELLIS.2 backbone
The embedded demo is taking too long.

This can happen when the Hugging Face Space is sleeping, queued, or temporarily unavailable. Use the official Space link above and keep this page as your workflow guide.

Paper Abstract

What Pixal3d solves

Recent 3D generative models have rapidly improved synthesis quality, yet fidelity — pixel-level faithfulness to the input image — remains a central bottleneck. Pixal3d tackles this head-on.

From the arXiv paper (2605.10922) — SIGGRAPH 2026

Most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. Pixal3d instead generates 3D in a pixel-aligned way, consistent with the input view. It introduces a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume — establishing direct pixel-to-3D correspondence without ambiguity. The result: high-quality 3D assets that approach the fidelity level of reconstruction, with natural extension to multi-view generation and object-separated scene synthesis.

Why It Matters

Pixel-aligned, not just image-conditioned

Pixal3d's core innovation is explicit pixel-to-3D correspondence: every generated 3D point stays directly tied to the input image, unlike attention-only methods that treat images as loose guidance.

Direct pixel-to-3D mapping

Multi-scale image features are back-projected into a 3D feature volume, making the input view part of the generation coordinate frame — not just a conditioning signal.

Reconstruction-level fidelity

The paper demonstrates that Pixal3d approaches the fidelity of true 3D reconstruction, with detailed geometry and PBR textures that closely match the source image.

Two branches available

The main branch uses an improved TRELLIS.2 backbone for better performance. The paper branch preserves the original Direct3D-S2 implementation for reproducing SIGGRAPH results.

Multi-view ready

Pixal3d naturally extends to multi-view generation by aggregating back-projected feature volumes across views — enabling even higher fidelity when multiple angles are available.

Scene synthesis extension

Beyond single objects, the paper shows a modular pipeline that produces high-fidelity, object-separated 3D scenes from images using the same pixel-aligned paradigm.

Open source & free to try

Code, model weights, and an interactive Gradio demo are all publicly available. You can try it right now in the embedded demo above or via the official Hugging Face Space.

Core Architecture

The three-part Pixal3d pipeline

Understanding the architecture helps you pick better input images. A clear silhouette and visible material regions give the conditioner stronger evidence to work with.

Pixel-Aligned Structured Latent Representation Learning

A VAE compresses pixel-aligned sparse SDF into efficient sparse latents. This enables high-resolution shape handling within a compact, learnable representation — the foundation that makes pixel-level conditioning tractable at scale.

Image Back-Projection-Based Conditioner

This is the key differentiator. Instead of loosely attending to image features, Pixal3d explicitly lifts multi-scale 2D image features into 3D feature volumes through calibrated back-projection. Every 3D point knows exactly which pixel it came from.

Two-Stage Generation and Decoding

A coarse stage predicts overall structure, then a detail stage predicts refined latents. The final result is decoded into a mesh with PBR texture maps (base color, normal, roughness, metallic) ready for rendering or import into game engines and DCC tools.

Practical insight: Pixal3d performs best when the input image shows a single subject with clean edges. Hidden back sides, transparent materials, and heavily occluded geometry are still challenges — use multi-view input when fidelity on all sides matters.

Production Workflow

From one image to a usable 3D asset

AI generation is the starting point — smart cleanup and validation turn a raw output into a production-ready asset.

Prepare the image

Choose a single subject, centered crop, clean silhouette, visible texture zones. Avoid watermarks, heavy occlusion, and extreme lighting that confuses PBR estimation.

Run the official path

Use the Hugging Face demo above, the model card on Hugging Face, or clone the GitHub repo for local inference. For local use: python inference.py --image your_image.png --output ./output.glb

Inspect the first result

Rotate the model, compare the front view to the source image, then check back side completion, holes, floaters, UV seams, and overall scale. Don't trust the first render — rotate it.

Clean for the destination

GLB for WebGL preview, OBJ for Blender cleanup, FBX for Unity or Unreal, STL or 3MF only after watertight repair. Each format has a job — choose accordingly.

Document everything

Keep the source image license, which branch/checkpoint you used, generation settings, output format, and any cleanup steps alongside the asset. Future you (or your team) will thank you.

Before Generation

Image readiness checker

Not every image is a good candidate for 3D generation. This checklist gives you a repeatable way to decide whether an input is worth spending GPU time on.

Score your source image

0/100

Aim for 75+ before using serious cleanup time.

Asset Handoff

Build a Pixal3d-ready brief

A short brief keeps teams aligned: what the image shows, where the asset goes, what format matters, and what quality must survive export.

Asset brief builder


      
Official Source Map

Where to verify Pixal3d details

All resources are publicly available. Use these links as your primary reference chain and verify terms before commercial work.

QA Rubric

How to judge a generated model

A pretty first render is not enough. Evaluate the asset the way a technical artist would evaluate a handoff.

DimensionWhat to inspectPass condition
Silhouette fidelityFront outline, proportions, and recognizable identityMatches the source image at a glance from the input view
Geometry completenessBack side, sides, holes, floaters, and normal directionRotates smoothly without visible collapse or missing surfaces
Material behaviorBase color, roughness, normals, and UV seamsReads consistently under different lighting conditions
Topology usabilityPoly count, mesh islands, UV layout, decimation toleranceCan be repaired, retopologized, or decimated without chaos
Export reliabilityGLB/OBJ/FBX import, texture paths, origin point, and scaleOpens cleanly in the target tool without manual fixes
Developer Notes

Local installation and branch choices

Use the official repository for exact requirements. This summary keeps the decision tree visible for quick reference.

Step 1: TRELLIS.2 base

Follow the TRELLIS.2 installation guide first — Pixal3d builds on top of it. The main branch uses the improved TRELLIS.2 backbone.

Step 2: Pixal3d deps

Install additional Python dependencies with pip install -r requirements.txt, then install utils3d from the project's release page.

Main branch

Latest implementation with improved TRELLIS.2 backbone for better performance. Recommended for new projects and production use.

Paper branch

Original Direct3D-S2-based implementation. Use this to exactly reproduce the results reported in the SIGGRAPH 2026 paper.

Local inference

python inference.py --image assets/test_image/0.png --output ./output.glb after dependencies are installed. A Gradio web demo is also included via python app.py.

GPU requirements

The Hugging Face Space uses H-series GPU architecture. For local use, check the TRELLIS.2 requirements — requirements-hfdemo.txt may not be compatible with all GPU architectures.

2026 Timeline

Project milestones

Key events from the Pixal3d project, based on the paper, official project page, GitHub README, and Hugging Face model card.

  1. Improved version based on TRELLIS.2 backbone released with enhanced performance.
  2. Inference code and online Hugging Face Gradio demo made publicly available.
  3. arXiv submission 2605.10922 posted with full technical details.
  4. Paper accepted to SIGGRAPH 2026 — the premier conference for computer graphics.
Limitations

What not to overpromise

Good engineering communication is honest about failure modes. Here's what Pixal3d cannot guarantee.

Hidden surfaces are inferred

A single image cannot fully prove the back side. The model makes educated guesses — use multiple views when fidelity on all sides matters for production.

Rights and licensing matter

Don't upload copyrighted characters, trademarked brand assets, or private client images unless you have explicit permission. Check the model license for commercial use terms.

Production needs cleanup

Game-ready, print-ready, and commerce-ready assets each need different validation and post-processing. The raw GLB output is a starting point, not the final deliverable.

External demos can fail

If the Hugging Face Space sleeps or queues, the site degrades gracefully to official links and workflow guidance rather than hiding the issue behind a broken iframe.

FAQ

Common questions about Pixal3d

Short, direct answers for users who want the essentials without scrolling through the full paper.

What makes Pixal3d different from other image-to-3D methods?

Pixal3d uses pixel back-projection to explicitly map 2D image features into 3D space, creating direct pixel-to-3D correspondence. Most other methods use attention-based conditioning, where the link between image pixels and 3D geometry is much looser. This gives Pixal3d significantly higher fidelity to the input image.

Can I use Pixal3d for commercial projects?

Check the model license on Hugging Face for the latest terms. The code is open source, but generated assets may have specific usage conditions. Always verify before commercial deployment.

Which branch should I use — main or paper?

Use main for the latest improved version with the TRELLIS.2 backbone — recommended for most users. Use paper only if you need to exactly reproduce the SIGGRAPH 2026 paper results.

Does Pixal3d support multi-view input?

Yes. The paper states that Pixal3d naturally extends to multi-view generation by aggregating back-projected feature volumes across multiple camera views, enabling even higher fidelity.

Which output format should I choose?

GLB for web and quick preview; OBJ for mesh editing and cleanup in Blender; FBX for game engines like Unity and Unreal; STL or 3MF only after watertight repair for 3D printing.

Why does the embedded demo sometimes not load?

Hugging Face Spaces use shared GPUs with a queue system. Spaces can also enter sleep mode when idle. The page includes official links so you can open the Space directly, and the workflow guidance works regardless of demo availability.

Glossary

Key terms

Quick definitions to help non-research visitors navigate the page.

Pixel-aligned
A generation paradigm where 3D features stay tied to the input image view and individual pixel positions, rather than floating in canonical space.
Back-projection
The mathematical mapping from 2D image coordinates and features into 3D space — the core mechanism that enables Pixal3d's pixel-to-3D correspondence.
Sparse SDF
A Signed Distance Function representation of 3D shape that can be compressed into efficient structured latents for scalable generation.
PBR
Physically Based Rendering — a set of texture maps (base color, normal, roughness, metallic) that define how a surface interacts with light.
GLB
A compact binary glTF file format commonly used for web-based 3D viewers and quick asset previews across platforms.
Conditioner
In generative models, the component that processes conditioning signals (like images) and injects them into the generation process. Pixal3d's conditioner uses back-projection.
Academic Reference

Citation

Use the official BibTeX citation when Pixal3d informs your research or technical writing.

@article{li2026pixal3d,
  title   = {Pixal3D: Pixel-Aligned 3D Generation from Images},
  author  = {Li, Dong-Yang and Zhao, Wang and Chen, Yuxin and Hu, Wenbo and Guo, Meng-Hao and Zhang, Fang-Lue and Shan, Ying and Hu, Shi-Min},
  journal = {arXiv preprint arXiv:2605.10922},
  year    = {2026}
}