Logs, events and now traces. Storing and managing all this information can be time-consuming and expensive. Selecting a metrics vs another sounds like a valuable option, in reality you don’t know what you will need when it comes to figuring out what is going on in production, during an outage.
Most likely, an important signal can be tracked down using the signal that you decided to not collect.
On top there is the nature of this kind of information, when you set everything up usually you are at the beginning of your development, there are not many users, not much traffic so you collect everything you can find, for as long as you want and when costs explode you need to take a decision, quickly.
My suggestion is to play with retention policy. Many monitoring, time series databases or observability solution serve such capability. You can select for how long to store your data. Design different tier and strategies to move data in between them. Aggregations are a common strategy to move data in between tier. With aggregation, you loose granularity because in practice you are reducing many points in a range to a single one, but there is not a secret source here. Hard drives and SSD have limits. More data means slow queries, or higher costs.
In Italy, we say: “la coperta e’ corta”. “The blanket is short” and sometimes you need to decide if you want your head or your feet covered.
Cleaning up what you collect, defining priorities, is a valid solution.
Are you having trouble figuring out your way to building automation, release and troubleshoot your software? Let's get actionables lessons learned straight to you via email.