83 lines
5.1 KiB
Plaintext
83 lines
5.1 KiB
Plaintext
SYSOPS HEALTH CHECK / METRICS
|
|
|
|
OK, considered this and a log is a log and all logs are relevant to sysops people so I'm going to treat all logging the same regardless and make an effort to ensure each log entry
|
|
is tagged with the relevant class name
|
|
|
|
CRITICAL ISSUES
|
|
- Check for critical issues in a health check periodic job which also logs and metrics
|
|
- Critical issues should be logged first then sent via notification for system operators if subscribed
|
|
-
|
|
|
|
METRICS
|
|
- metrics should be gathered in DB and reported on via UI for ops users and potentially in other formats down the road
|
|
|
|
|
|
|
|
TODO LIST OF THINGS CODED THAT NEED TO BE LOGGED
|
|
- Items in code tagged with this:
|
|
- //TODO: core-log-sysop
|
|
- Generator failures
|
|
- IJobBiz derived objects failures
|
|
|
|
- configuration changes ???
|
|
- Install and uninstall feature changes
|
|
- Warnings (low disk space, slowness monitoring, db issues) (during health check JOB??)
|
|
|
|
|
|
"HEALTH CHECK" JOB
|
|
- things that need to be metric a sized are commented with //OPSMETRIC
|
|
- Maybe a "health check" job or "checkup" job that periodically asseses things and reports findings
|
|
- works in conjunction with metrics gathered maybe?
|
|
- Metrics would be a system that for example could get free disk space then get it again a few days later and project ahead to getting low and warning or simple when down to 10% warn or etc
|
|
- Anything we'd like to see from a support point of view would be useful too
|
|
- Go over the research doc to see what was recommended
|
|
- Dig up that guys example project on his blog that he was going to add metrics to.
|
|
- Brainstorm a list of recent support issues and what could be a benefit in dealing with them
|
|
- "Slowness" comes up a lot.
|
|
|
|
|
|
Ops Metrics
|
|
CONFIRMED REQUIRED
|
|
- Gather in memory and flush to db on a schedule is best
|
|
- CASE 3562 If found, count of mismatch of attached files in database vs file system
|
|
- CASE 3523 Log major ops related configuration changes (before and after snapshot)
|
|
- CASE 3502 Log feature or route or endpoint usage count as a snapshot metric so can compare month to month.
|
|
- CASE 3502 Log record count in each table or at least major ones as a snapshot metric so can compare month to month.
|
|
- CASE 3497 ACTIVE user count - Log user login, last login and login per X period
|
|
- CASE 3499 "Slow" I want to know if anything is slow, not what the user says but what the code determines
|
|
|
|
RESEARCH / IDEAS / EXAMPLES
|
|
- Metric types:
|
|
- https://www.app-metrics.io/getting-started/metric-types/
|
|
- Code example that deals with this issue:
|
|
- https://github.com/AppMetrics/AppMetrics/tree/dev/src/App.Metrics.Core
|
|
- Need more than one window into the data, for example we need a last few minutes (5?) view so people can see at a glance what is happening NOW
|
|
- But also need to know what was it historically. So maybe we need a NOW algorithm but also a HISTORICAL algorithm.
|
|
- Maybe a sliding scale of recency, so a 5 minute view, a THIS WEEK view and then a month to month view beyond that??
|
|
- LIBRARIES
|
|
- Health check Health Checks give you the ability to monitor the health of your application by writing a small tests which returns either a healthy, degraded or unhealthy result.
|
|
- https://www.app-metrics.io/health-checks/
|
|
- APP METRICS
|
|
- https://github.com/AppMetrics/AppMetrics
|
|
- Different types of metrics are Gauges, Counters, Meters, Histograms and Timers and Application Performance Indexes
|
|
- METRICS of a system:
|
|
- Network. Network metrics are related to network bandwidth usage.
|
|
- System. System metrics are related to processor, memory, disk I/O, and network I/O.
|
|
- Platform. Platform metrics are related to ASP.NET, and the .NET common language runtime (CLR).
|
|
- Application. Application metrics include custom performance counters "Application Instrumentation".
|
|
- Service level. Service level metrics are related to your application, such as orders per second and searches per second.
|
|
- USEFUL INFO HERE FOR SYSTEM METRICS LIKE MEMORY ETC: This document from Microsoft gives generally accepted limits for things like CPU threshold, memory etc in actual percentages
|
|
- Section "System Resources" here https://msdn.microsoft.com/en-us/library/ff647791.aspx#scalenetchapt15_topic5
|
|
|
|
- USEFUL EXAMPLE dashboard for web applications:
|
|
- https://sandbox.stackify.com/Stacks/WebApps
|
|
|
|
|
|
- some kind of internal metrics to track changes over time in operations with thresholds to trigger logs maybe?
|
|
- Has to be super fast, maybe an internal counter / cache in memory and a periodic job that writes it out to DB, i.e. don't write to db metrics on every get operation etc
|
|
- Average response time?
|
|
- Busyness / unique logins or tokens in use? A way to see how many distinct users are connecting over a period of time so we know how utilized it is?
|
|
- Utilization?
|
|
- Areas / routes used in AyaNova and how often / frequently they are used (we could use this for feature utilization)
|
|
- CPU peak usage snapshot
|
|
- Disk space change over time snapshots |