SSAlign

Semantic-Structural Alignment for Generative Pictorial Charts

Zhida Sun¹, Yulin Zhang¹, Zheng Gu¹, Min Lu¹, Bongshin Lee², Daniel Cohen-Or³, Hui Huang^1*.

¹Shenzhen University, ²Yonsei University, ³Tel-Aviv University.

Semantically-Rich Pictorial Chart Generation. Our method transforms abstract statistical graphics into high-fidelity pictorial charts by replacing geometric primitives with text-guided semantic objects. It synthesizes instances structurally faithful to the original data encodings (e.g., proportional heights) and supports multi-channel visual encodings (e.g., categorical hues), which can be complemented with contextual backgrounds. This yields artistically compelling charts with strong structural consistency.

Abstract

Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart's global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling.

传统的统计图形虽然精确，但往往缺乏具象图表（pictorial charts）所具备的视觉吸引力、记忆度与感染力。我们提出了一种用于自动化合成具象图表的生成式框架，弥合了语义表达与结构保真度（structural faithfulness）之间的鸿沟。我们并未仅将图表视为待风格化的图像，而是将该问题构建为由两个并行外部控制信号引导的双条件生成（dual-conditioned generation）任务：一是捕获编辑意图语义上下文的文本提示（text prompt），二是提供抽象统计图表全局结构的上下文图像（context image）。为了在多模态扩散 Transformer 中强化这些控制，我们引入了两种互补的特征级（feature-level）机制：用于将空间布局锚定至输入图表的结构对齐（structural alignment），以及用于从参考图像迁移极具表现力的纹理语义对齐（semantic alignment）。我们的方法能够泛化至主要的视觉通道（即长度、面积、角度和位置）及多样化的语义域中，生成的具象图表兼具艺术表现力与结构一致性。详尽的定量评估与感知用户研究表明，我们的框架优于传统的可控生成与图像编辑基线方法，为富有表现力的视觉叙事（visual storytelling）中高保真、数据驱动的生成式建模奠定了基础。

Research Questions

This research tackles the tension between the creative expressiveness of generative AI and the strict data constraints of information visualization. Here are the core research questions your work addresses:

RQ1: How can we leverage generative AI to enable expressive semantic synthesis in pictorial charts without compromising the strict structural fidelity and informational integrity of the underlying data?
RQ2: How can structural invariants be effectively disentangled from semantic expression to allow for controllable, dual-conditioned (text-guided and reference-guided) chart generation?
RQ3: Can an autonomous structural alignment approach generalize across diverse and complex chart topologies (e.g., area charts, stacked bar charts, scatter plots) while maintaining precise, continuous spatial data encoding?
RQ4: How does an end-to-end autonomous framework compare against existing human-in-the-loop, domain-specific tools (e.g., ChartSpark) and general image-editing baselines in achieving the optimal balance between visual cohesiveness and spatial accuracy?

Method

Method overview. Given a source chart image and a textual prompt, we provide them as the contextual image condition and textual condition, respectively. To enforce strong structural consistency and enable expressive semantic synthesis, we introduce Structural DIFT and Semantic DIFT. These operations are performed within the self-attention layers of the single-stream blocks in the MM-DiT.

The DIFT Remapping Process. (a) Structural DIFT: Dense correspondence (C_c→tgt, blue arrows) is computed between the reference chart queries (Q_c, green points) and the target queries (Q_tgt, red points). Based on this correspondence, the target features are spatially remapped (cyan arrows) to new positions, aligning their geometric layout with the structural arrangement of the reference chart. (b) Semantic DIFT: Using the established correspondence (C_tgt→ref) between the target features (H_tgt, green points) and the reference features (H_ref, red points), the reference keys (K_ref) and values (V_ref) are spatially remapped. These are subsequently interpolated with the target keys (K_tgt) and values (V_tgt) to fuse the semantic attributes.

Training pipeline. Overview of our progressive data curation and fine-tuning strategy: (a) Data Preprocessing: We pair manually collected pictorial charts with reverse-engineered geometric source charts, generating corresponding text prompts via a Vision-Language Model (VLM). (b) Fine-Tuning \& Data Augmentation: A baseline LoRA is fine-tuned on this small seed dataset and subsequently utilized to autonomously synthesize a large-scale, structurally diverse augmented dataset. (c) Optimization \& Inference: The augmented data spanning multiple visual encoding channels is aggregated for a final, comprehensive LoRA fine-tuning, equipping the model with a robust generative prior for expressive semantic synthesis.

Evaluation

Visualization of User Study Results. Rank distribution (left) and pairwise preference heat map (right) comparing our method against Ctrl-X, CIA, and FLUX.1 Kontext. The results demonstrate a significant user preference for our dual-conditioned method in balancing structural fidelity with high-quality semantic synthesis. Participant rankings demonstrate a strong consensus, with our method preferred (67.8\% of trials) over baselines.

Ablation Study. Columns demonstrate the progressive integration of specific modules across different themes (rows). The full method (bottom row) ensures both strict structural alignment and robust semantic synthesis.

Results

Qualitative Results. Our method generates diverse pictorial charts, preserving data-encoding colors and spatial structure during semantic synthesis.

Qualitative Comparison. We compare our method against eight baselines for controllable generation and image editing: Ctrl-X, CIA, ControlNet+IP-Adapter, Stable Flow, ICEdit, FLUX.1 Kontext, ControlNet, and SDEdit. The first three columns feature reference-guided methods utilizing an additional semantic reference image, while the remaining columns display text-guided methods relying solely on the source chart and prompt. Across these diverse conditions, our method achieves the optimal balance, performing expressive semantic synthesis while strictly preserving the structural fidelity of the original chart.

Generalization Across Chart Topologies.. High-fidelity semantic synthesis and structural alignment applied to diverse chart topologies, including area charts, donut charts, stacked bar charts, and scatter plots.

Future Explorations: Holistic Scene Generation. Contextually coherent backgrounds are currently achieved by expanding the base text prompt. Because explicit background control is not an objective of the current method, holistic scene generation relies heavily on the underlying model's prior, which can be disrupted by the chart-focused alignment mechanism. These preliminary results demonstrate the potential for full-scene data storytelling, underscoring the need for unified control mechanisms that govern background synthesis without compromising analytical legibility.

Reflection

Reflecting on the trajectory of this research, we recognize that the fundamental challenge of generative pictorial visualization lies in reconciling the unconstrained expressiveness of underlying AI models with the rigorous analytical demands of data representation. Our framework demonstrates that this tension is not insurmountable; by explicitly disentangling structural invariants from semantic features, we can achieve expressive semantic synthesis without sacrificing structural fidelity. For the visualization community, we provide a method for chart generation explicitly driven by data storytelling purposes. Simultaneously, for the computer graphics field, this work serves as a preliminary but critical trial in strictly controlling geometric structures during generative image synthesis. While our explorations into holistic scene generation highlight the ongoing difficulty of perfectly balancing automated background synthesis with strict foreground constraints, they also illuminate a clear path forward. Ultimately, this research establishes a foundational step toward a future where human creativity and artificial intelligence can collaboratively transform abstract data into resonant, visually compelling narratives.

Video Presentation

coming soon...