SYSOPS HEALTH CHECK / METRICS https://www.datadoghq.com/blog/monitoring-101-collecting-data/ OK, considered this and a log is a log and all logs are relevant to sysops people so I'm going to treat all logging the same regardless and make an effort to ensure each log entry is tagged with the relevant class name CRITICAL ISSUES - Check for critical issues in a health check periodic job which also logs and metrics - Critical issues should be logged first then sent via notification for system operators if subscribed - METRICS - metrics should be gathered in DB and reported on via UI for ops users and potentially in other formats down the road TODO LIST OF THINGS CODED THAT NEED TO BE LOGGED - Items in code tagged with this: - //TODO: core-log-sysop - Generator failures - IJobBiz derived objects failures - configuration changes ??? - Install and uninstall feature changes - Warnings (low disk space, slowness monitoring, db issues) (during health check JOB??) "HEALTH CHECK" JOB - things that need to be metric a sized are commented with //OPSMETRIC - Maybe a "health check" job or "checkup" job that periodically asseses things and reports findings - works in conjunction with metrics gathered maybe? - Metrics would be a system that for example could get free disk space then get it again a few days later and project ahead to getting low and warning or simple when down to 10% warn or etc - Anything we'd like to see from a support point of view would be useful too - Go over the research doc to see what was recommended - Dig up that guys example project on his blog that he was going to add metrics to. - Brainstorm a list of recent support issues and what could be a benefit in dealing with them - "Slowness" comes up a lot. Ops Metrics CONFIRMED REQUIRED - Gather in memory and flush to db on a schedule is best - CASE 3562 If found, count of mismatch of attached files in database vs file system - CASE 3523 Log major ops related configuration changes (before and after snapshot) - CASE 3502 Log feature or route or endpoint usage count as a snapshot metric so can compare month to month. - CASE 3502 Log record count in each table or at least major ones as a snapshot metric so can compare month to month. - CASE 3497 ACTIVE user count - Log user login, last login and login per X period - CASE 3499 "Slow" I want to know if anything is slow, not what the user says but what the code determines RESEARCH / IDEAS / EXAMPLES - Metric types: - https://www.app-metrics.io/getting-started/metric-types/ - Code example that deals with this issue: - https://github.com/AppMetrics/AppMetrics/tree/dev/src/App.Metrics.Core - Need more than one window into the data, for example we need a last few minutes (5?) view so people can see at a glance what is happening NOW - But also need to know what was it historically. So maybe we need a NOW algorithm but also a HISTORICAL algorithm. - Maybe a sliding scale of recency, so a 5 minute view, a THIS WEEK view and then a month to month view beyond that?? - LIBRARIES - Health check Health Checks give you the ability to monitor the health of your application by writing a small tests which returns either a healthy, degraded or unhealthy result. - https://www.app-metrics.io/health-checks/ - APP METRICS - https://github.com/AppMetrics/AppMetrics - Different types of metrics are Gauges, Counters, Meters, Histograms and Timers and Application Performance Indexes - METRICS of a system: - Network. Network metrics are related to network bandwidth usage. - System. System metrics are related to processor, memory, disk I/O, and network I/O. - Platform. Platform metrics are related to ASP.NET, and the .NET common language runtime (CLR). - Application. Application metrics include custom performance counters "Application Instrumentation". - Service level. Service level metrics are related to your application, such as orders per second and searches per second. - USEFUL INFO HERE FOR SYSTEM METRICS LIKE MEMORY ETC: This document from Microsoft gives generally accepted limits for things like CPU threshold, memory etc in actual percentages - Section "System Resources" here https://msdn.microsoft.com/en-us/library/ff647791.aspx#scalenetchapt15_topic5 - USEFUL EXAMPLE dashboard for web applications: - https://sandbox.stackify.com/Stacks/WebApps - some kind of internal metrics to track changes over time in operations with thresholds to trigger logs maybe? - Has to be super fast, maybe an internal counter / cache in memory and a periodic job that writes it out to DB, i.e. don't write to db metrics on every get operation etc - Average response time? - Busyness / unique logins or tokens in use? A way to see how many distinct users are connecting over a period of time so we know how utilized it is? - Utilization? - Areas / routes used in AyaNova and how often / frequently they are used (we could use this for feature utilization) - CPU peak usage snapshot - Disk space change over time snapshots