Cindy Sridharan’s popular “Distributed Systems Observability” book published by O’Reilly claims that logs, metrics, and traces are the three pillars of observability.
According to Sridharan, an event log is a record of events that contains both a timestamp and payload of content. Event logs come in three forms:
- Plaintext: A log record stored in plaintext is the most commonly used type of log
- Structured: A log record that is typically stored in JSON format and highly advocated for as the form to use
- Binary: Examples of binary event logs include Protobuf formatted logs, MySQL binlogs, systemd journal logs, etc.
Logs can be useful in identifying unpredictable behavior in a system. Sridharan explained that often distributed systems experience failures not because of one specific event happening, but because of a series of possible triggers. In order to pin down the cause of an event, operations teams need to be able start with a symptom pinpointed by a metric or log, infer the life cycle of a request across various system components, and iteratively ask questions about interactions between parts of that system.
Logs are the base of the three pillars, and both metrics and traces are built on top of them, Sridharan wrote.
Sridharan defines metrics as numeric representations of data measured across time intervals. They are useful in observability because they can be used by machine learning algorithms to gain insights on the behavior of a system over time. According to Sridharan, their numerical nature also allows for longer retention of data and easier querying, making them well suited for building dashboards that show historical trends.
Traces are the final pillar of observability. According to Sridharan, a trace is “a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.” They can provide visibility into the path that a request took and the structure of that request. Traces can help uncover the unintentional effects of a request, making them particularly well-suited for complex environments, like microservices