Generating complex scene layouts faces challenges such as cross-modal semantic alignment bias and low efficiency in modeling dynamic spatiotemporal relationships. Existing methods have limitations on ...