The CE safety commit (8a3485454a) converted module-level dicts to lazy
functions but forgot to update __init__.py, which still imported the
now-deleted TRACE_TASK_TO_CASE constant causing an ImportError at startup.
Add get_trace_task_to_case() to gateway.py as a lazy public wrapper
(inverse of _get_case_to_trace_task) and update __init__.py to call it.
- Move enqueue_draft_node_execution_trace import inside call site in workflow_service.py
- Make gateway.py enterprise type imports lazy (routing dicts built on first call)
- Restore typed ModelConfig in llm_generator method signatures (revert dict regression)
- Fix generate_structured_output using wrong key model_parameters -> completion_params
- Replace unsafe cast(str, msg.content) with get_text_content() across llm_generator
- Remove duplicated payload classes from generator.py, import from core.llm_generator.entities
- Gate _lookup_app_and_workspace_names and credential lookups in ops_trace_manager behind is_enterprise_telemetry_enabled()
Move gen_ai.usage.*, gen_ai.request.model, gen_ai.provider.name, and
gen_ai.user.id from companion-log-only to span attributes on workflow
and node execution spans.
These are small scalars with no size risk. Having them on spans enables
filtering and grouping in trace UIs (Tempo, Jaeger, Datadog) without
requiring a cross-signal join to companion logs.
Data dictionary updated: span tables gain the new fields; companion log
'additional attributes' tables trimmed to only list fields not already
covered by 'All span attributes'.
Workflow RT: replace float(info.workflow_run_elapsed_time) with
(end_time - start_time).total_seconds() using workflow_run.created_at and
workflow_run.finished_at. The elapsed_time DB field defaults to 0 and can
be stale if the workflow_storage Celery task has not committed yet when the
trace fires. Wall-clock timestamps are more reliable; elapsed_time is kept
as fallback.
Message RT: change end_time from created_at + provider_response_latency to
message.updated_at when updated_at > created_at. The pipeline explicitly
sets message.updated_at = naive_utc_now() at the moment the LLM response
is complete, making it the canonical response-complete timestamp.
Falls back to the latency-based calculation for error/aborted messages.
Add should_include_content() helper to extensions/otel/parser/base.py that
returns True in CE (no behaviour change) and respects ENTERPRISE_INCLUDE_CONTENT
in EE. Gate all content-bearing span attributes in LLM, retrieval, tool, and
default node parsers so that gen_ai.completion, gen_ai.prompt, retrieval.document,
tool call arguments/results, and node input/output values are suppressed when
ENTERPRISE_ENABLED=True and ENTERPRISE_INCLUDE_CONTENT=False.
- Add _lookup_llm_credential_info() to query Provider/ProviderModel tables
- Lookup LLM credentials when tool credential_id is null
- Fall back to provider-level credential if no model-specific credential
- Extract model_provider/model_name from process_data (LLM nodes store
model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
- Extract model_provider/model_name from process_data (LLM nodes store
model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
Fixes Celery worker error where process_enterprise_telemetry task
was unregistered despite being dispatched from the app.
Added conditional import when ENTERPRISE_TELEMETRY_ENABLED=true
to ensure the task is available in the worker process.
Resolves: KeyError 'tasks.enterprise_telemetry_task.process_enterprise_telemetry'
- Add resolved_parent_context property to BaseTraceInfo for reusable parent context extraction
- Refactor enterprise_trace.py to use property instead of duplicated dict plucking (~19 lines eliminated)
- Fix UUID validation in exporter.py with specific error logging for invalid trace correlation IDs
- Add error isolation in event_handlers.py to prevent telemetry failures from breaking user operations
- Replace pickle-based payload_fallback with JSON storage rehydration for security
- Update TelemetryEnvelope to use Pydantic v2 ConfigDict with extra='forbid'
- Update tests to reflect contract changes and new error handling behavior
- Change insecure parameter from API key-based to URL scheme-based detection
- https:// endpoints now correctly use TLS (insecure=False)
- All other endpoints (http://, no scheme) use insecure=True
- Update tests to reflect URL scheme-based logic
- Remove incorrect documentation claiming API key controls TLS
Move routing table, emit(), and is_enterprise_telemetry_enabled() from
enterprise/telemetry/gateway.py into core/telemetry/gateway.py so both
CE and EE share one code path. The ce_eligible flag in CASE_ROUTING
controls which events flow in CE — flipping it is the only change needed
to enable an event in community edition.
- Delete enterprise/telemetry/gateway.py (class-based singleton)
- Create core/telemetry/gateway.py (stateless functions, no shared state)
- Simplify core/telemetry/__init__.py to thin facade over gateway
- Remove TelemetryGateway class and get_gateway() from ext_enterprise_telemetry
- Single-source is_enterprise_telemetry_enabled in core.telemetry.gateway
- Fix pre-existing test bugs (missing dify.event.id in metric handler tests)
- Update all imports and mock paths across 7 test files
Add comprehensive model tracking across all OTEL metrics and logs:
- Node execution metrics now include model_name for LLM operations
- Suggested question metrics include model_provider and model_name
- Dataset retrieval captures both embedding and rerank model info
- Updated DATA_DICTIONARY.md with complete metric label documentation
This enables granular cost tracking, performance analysis, and usage monitoring per model across all operation types.
Separate background/configuration instructions from the data dictionary:
- README.md: Overview, configuration, correlation model, content gating
- DATA_DICTIONARY.md: Pure reference format with signals and attributes
The data dictionary is now concise (465 lines vs 911) and focuses on
attribute types and relationships without verbose explanations.
- Add ALLOWED_ACCESS_MODES constant to centralize valid access modes
- Include 'sso_verified' in validation to fix app duplication errors
- Update error message to dynamically list all allowed modes
- Refactor for maintainability: single source of truth for access modes
This fixes the issue where apps with access_mode='sso_verified' could not
be duplicated because the validation in update_app_access_mode() was missing
this mode, even though it was documented in WebAppSettings model.
When copying an app, the copied app was not getting a web_app_settings
record created. This caused the enterprise service to query for settings
that don't exist, falling back to default behavior.
This fix ensures copied apps inherit the same access mode as the original:
- If original has explicit settings (public/private/private_all/sso_verified),
the copy gets the same setting
- If original has no settings (old apps), copy defaults to 'public' to match
the original's effective permission via fallback
This prevents permission mismatches between original and copied apps and
ensures the enterprise service has explicit settings to query.
Related: langgenius/dify-enterprise#423
- Add dify.credential.id to node execution events
- Add dify.event.id to all telemetry events (APP_CREATED, APP_UPDATED, APP_DELETED, FEEDBACK_CREATED)
This ensures all .name fields have corresponding .id fields for reliable aggregation and deduplication.
- Add APP_CREATED, APP_UPDATED, APP_DELETED counters to EnterpriseTelemetryCounter
- Create EnterpriseTelemetryEvent StrEnum for type-safe event names
- Update metric_handler to use new app-specific counters with labels (tenant_id, app_id, mode)
- Convert all event_name strings to EnterpriseTelemetryEvent enum values
- Update exporter to create OTEL meters for new app counters (dify.app.created.total, etc.)
- Update tests to verify new counter behavior and enum usage
Prevents CE users from enqueueing EE-only events (all METRIC_LOG cases)
to non-existent enterprise_telemetry Celery queue.
- Add _should_drop_ee_only_event() check in emit() before routing
- Remove redundant check from _emit_trace()
- Single guard at gateway level protects both trace and metric/log paths
**Problem:**
The telemetry system had unnecessary abstraction layers and bad practices
from the last 3 commits introducing the gateway implementation:
- TelemetryFacade class wrapper around emit() function
- String literals instead of SignalType enum
- Dictionary mapping enum → string instead of enum → enum
- Unnecessary ENTERPRISE_TELEMETRY_GATEWAY_ENABLED feature flag
- Duplicate guard checks scattered across files
- Non-thread-safe TelemetryGateway singleton pattern
- Missing guard in ops_trace_task.py causing RuntimeError spam
**Solution:**
1. Deleted TelemetryFacade - replaced with thin emit() function in core/telemetry/__init__.py
2. Added SignalType enum ('trace' | 'metric_log') to enterprise/telemetry/contracts.py
3. Replaced CASE_TO_TRACE_TASK_NAME dict with CASE_TO_TRACE_TASK: dict[TelemetryCase, TraceTaskName]
4. Deleted is_gateway_enabled() and _emit_legacy() - using existing ENTERPRISE_ENABLED + ENTERPRISE_TELEMETRY_ENABLED instead
5. Extracted _should_drop_ee_only_event() helper to eliminate duplicate checks
6. Moved TelemetryGateway singleton to ext_enterprise_telemetry.py:
- Init once in init_app() for thread-safety
- Access via get_gateway() function
7. Re-added guard to ops_trace_task.py to prevent RuntimeError when EE=OFF but CE tracing enabled
8. Updated 11 caller files to import 'emit as telemetry_emit' instead of 'TelemetryFacade'
**Result:**
- 322 net lines deleted (533 removed, 211 added)
- All 91 tests pass
- Thread-safe singleton pattern
- Cleaner API surface: from TelemetryFacade.emit() to telemetry_emit()
- Proper enum usage throughout
- No RuntimeError spam in EE=OFF + CE=ON scenario
Enables distributed tracing for nested workflows across all trace providers
(Langfuse, LangSmith, community providers). When a workflow invokes another
workflow via workflow-as-tool, the child workflow now includes parent context
attributes that allow trace systems to reconstruct the full execution tree.
Changes:
- Add parent_trace_context field to WorkflowTool
- Set parent context in tool node when invoking workflow-as-tool
- Extract and pass parent context through app generator
This is a community enhancement (ungated) that improves distributed tracing
for all users. Parent context includes: trace_id, node_execution_id,
workflow_run_id, and app_id.
The logic was inverted - we were blocking all CE traces and only allowing
enterprise traces. The correct logic should be:
- Allow all CE traces (workflow, message, tool, etc.)
- Only block enterprise-only traces when enterprise telemetry is disabled
Before: if event.name not in _ENTERPRISE_ONLY_TRACES: return
After: if event.name in _ENTERPRISE_ONLY_TRACES and not is_enterprise_telemetry_enabled(): return
Changes:
- Change TelemetryEvent.name from str to TraceTaskName enum for type safety
- Remove hardcoded trace_task_name_map from facade (no mapping needed)
- Add centralized enterprise-only filter in TelemetryFacade.emit()
- Rename is_telemetry_enabled() to is_enterprise_telemetry_enabled()
- Update all 11 call sites to pass TraceTaskName enum values
- Remove redundant enterprise guard from draft_trace.py
- Add unit tests for TelemetryFacade.emit() routing (6 tests)
- Add unit tests for TraceQueueManager telemetry guard (5 tests)
- Fix test fixture scoping issue for full test suite compatibility
- Fix tenant_id handling in agent tool callback handler
Benefits:
- 100% type-safe: basedpyright catches errors at compile time
- No string literals: eliminates entire class of typo bugs
- Single point of control: centralized filtering in facade
- All guards removed except facade
- Zero regressions: 4887 tests passing
Verification:
- make lint: PASS
- make type-check: PASS (0 errors, 0 warnings)
- pytest: 4887 passed, 8 skipped
Migrate from direct TraceQueueManager.add_trace_task calls to TelemetryFacade.emit
with TelemetryEvent abstraction. This reduces CE code invasion by consolidating
telemetry logic in core/telemetry/ with a single guard in ops_trace_manager.py.
The model_provider field in prompt generation traces was being incorrectly
extracted by parsing the model name (e.g., 'deepseek-chat'), which resulted
in an empty string when the model name didn't contain a '/' character.
Now extracts the provider directly from the model_config parameter, with
a fallback to the old parsing logic for backward compatibility.
Changes:
- Update _emit_prompt_generation_trace to accept model_config parameter
- Extract provider from model_config.get('provider') when available
- Update all 6 call sites to pass model_config
- Maintain backward compatibility with fallback logic
- get_ops_trace_instance was trying to query App table with storage_id format "tenant-{uuid}"
- This caused psycopg2.errors.InvalidTextRepresentation when app_id is None
- Added early return for tenant-prefixed storage identifiers to skip App lookup
- Enterprise telemetry still works correctly with these storage IDs
- The no_variable=false code path in generate_rule_config was missing trace emission
- Added timing wrapper and _emit_prompt_generation_trace call to ensure metrics/logs are captured
- Trace now emitted on both success and failure cases for consistency with no_variable=true path
- Forward kwargs to message_trace to preserve node_execution_id
- Add node_execution_id extraction to all trace methods
- Add app_id parameter to prompt generation API endpoints
- Enable app_id tracing for rule_generate, code_generate, and structured_output operations
Remove app_id parameter from three endpoints and update trace manager to use
tenant_id as storage identifier when app_id is unavailable. This allows
standalone prompt generation utilities to emit telemetry.
Changes:
- controllers/console/app/generator.py: Remove app_id=None from 3 endpoints
(RuleGenerateApi, RuleCodeGenerateApi, RuleStructuredOutputGenerateApi)
- core/ops/ops_trace_manager.py: Use tenant_id fallback in send_to_celery
- Extract tenant_id from task.kwargs when app_id is None
- Use 'tenant-{tenant_id}' format as storage identifier
- Skip traces only if neither app_id nor tenant_id available
The trace metadata still contains the actual tenant_id, so enterprise
telemetry correctly emits metrics and logs grouped by tenant.
Remove app_id=None from three prompt generation endpoints that lack proper
app context. These standalone utilities only have tenant_id available, so
we don't pass app_id at all rather than passing incomplete information.
Affected endpoints:
- /rule-generate (RuleGenerateApi)
- /code-generate (RuleCodeGenerateApi)
- /structured-output-generate (RuleStructuredOutputGenerateApi)
- Add PromptGenerationTraceInfo trace entity with operation_type field
- Implement telemetry for rule-generate, code-generate, structured-output, instruction-modify operations
- Emit metrics: tokens (total/input/output), duration histogram, requests counter, errors counter
- Emit structured logs with model info and operation context
- Content redaction controlled by ENTERPRISE_INCLUDE_CONTENT env var
- Fix user_id propagation in TraceTask kwargs
- Fix latency calculation when llm_result is None
No spans exported - metrics and logs only for lightweight observability.
- Create dedicated MeterProvider instance (independent from ext_otel.py)
- Add create_metric_exporter() to _ExporterFactory with HTTP/gRPC support
- Enterprise metrics now work without requiring standard OTEL to be enabled
- Add MeterProvider shutdown to cleanup lifecycle
- Update module docstring to reflect full independence (Tracer, Logger, Meter)
Only export structured telemetry logs, not all application logs. The attach_log_handler method now attaches to the 'dify.telemetry' logger instead of the root logger.
- Add ENTERPRISE_OTLP_PROTOCOL config (http/grpc, default: http)
- Introduce _ExporterFactory class for protocol-agnostic exporter creation
- Support both HTTP and gRPC OTLP endpoints for traces and logs
- Refactor endpoint path handling into factory methods
- Add ENTERPRISE_OTEL_LOGS_ENABLED and ENTERPRISE_OTLP_LOGS_ENDPOINT config options
- Implement EnterpriseLoggingHandler for log record translation with trace/span ID parsing
- Add LoggerProvider and BatchLogRecordProcessor for OTLP log export
- Correlate telemetry logs with spans via span_id_source parameter
- Attach log handler during enterprise telemetry initialization
# If license check fails, log but don't block the request.
# This prevents service disruption if enterprise API is temporarily
# unavailable.
logger.exception("Failed to check enterprise license status")
# add after request hook for injecting trace headers from OpenTelemetry span context
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.