Monitoring and Alerting

From aptrust
Jump to: navigation, search

APTrust is using multiple monitoring tools to ensure the systems stability and insight into resource usage.

Icinga2

The Icinga2 monitoring suite is based on de-facto-industry standard Nagios but has a couple of additional features. It alerts APTrust staff by Slack notification about service interruptions or performance issues upon which the ops team can react. Each alert can be "acknowledged" using the Icinga2 web interface. Documenting text can be added to the acknowledgement. Icinga2 also provides a history of past alerts for auditing purposes.

Grafana & InfluxDB

InfluxDB provide time-series data about resource usage and performance. Grafana is a web-frontend and dashboard to visualize that data. Data is fed from an Icinga2 plugin that runs on every instance. It also polls AWS directly. With the time-series data the operations team can identify trends in resource usage and act accordingly. It aids in scaling decisions and resource usage over time.

Fail2ban

Fail2Ban scans SSH and Nginx log files to identify malicious actions. Once IP's have been determined to be malicious the tool puts them in a `jail` (server-local IPtables firewall jail) to prevent any further malicious activity from that IP.

Logwatch

Logwatch is a customizable log analysis system. Logwatch parses through your system's logs and creates a report analyzing areas that you specify. A daily cron job runs the program and sends out a summary email from each node. The operations staff is spot reviewing the logwatch report emails. An example as follows:

---------- Forwarded message ----------
From: Cron Daemon <ops@aptrust.org>
Date: Sun, Nov 5, 2017 at 7:00 PM
Subject: Cron <root@apt-demo-repo2> /usr/sbin/logwatch
To: ops@aptrust.org



 ################### Logwatch 7.4.0 (05/29/13) ####################
        Processing Initiated: Mon Nov  6 00:00:02 2017
        Date Range Processed: yesterday
                              ( 2017-Nov-05 )
                              Period is day.
        Detail Level of Output: 5
        Type of Output/Format: stdout / text
        Logfiles for Host: apt-demo-repo2
 ##################################################################

 --------------------- Cron Begin ------------------------

 Commands Run:
    User root:
          cd / && run-parts --report /etc/cron.hourly: 24 Time(s)
       /usr/sbin/logwatch: 1 Time(s)
       test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(3600))' && certbot -q renew: 2 Time(s)
       test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ): 1 Time(s)
       test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ): 1 Time(s)
    User ubuntu:
       . $HOME/.profile; /var/www/demo.aptrust.org/pharos/current/bin/pharos_notify.py >> /var/www/demo.aptrust.org/pharos/current/log/cron_pharos_notify.log 2>&1: 1440 Time(s)
       /etc/psql_backup/pg_backup_rotated.sh >> /var/log/psql_backup.log 2>&1: 1 Time(s)

 ---------------------- Cron End -------------------------


 --------------------- httpd Begin ------------------------

 144.53 MB transferred in 22982 responses  (1xx 0, 2xx 22944, 3xx 36, 4xx 2, 5xx 0)
        4 Images (0.02 MB),
        4 Documents (0.04 MB),
     8463 Content pages (41.26 MB),
        1 Redirects (0.00 MB),
    14510 Other (103.21 MB)

 Requests with error response codes
    400 Bad Request
       /w00tw00t.at.ISC.SANS.DFind:): 1 Time(s)
    404 Not Found
       /a2billing/admin/Public/index.php: 1 Time(s)

 A total of 5 ROBOTS were logged

 ---------------------- httpd End -------------------------


 --------------------- pam_unix Begin ------------------------

 cron:
    Sessions Opened:
       ubuntu: 1441 Time(s)
       root: 29 Time(s)

 sshd:
    Sessions Opened:
       cd3ef: 1 Time(s)

 sudo:
    Sessions Opened:
       root -> root: 36 Time(s)
       root -> ubuntu: 6 Time(s)


 ---------------------- pam_unix End -------------------------


 --------------------- Postfix Begin ------------------------

 ****** Summary *************************************************************************************

    5.010K  Bytes accepted                               5,130
    5.010K  Bytes sent via SMTP                          5,130
 ========   ==================================================

        1   Accepted                                   100.00%
 --------   --------------------------------------------------
        1   Total                                      100.00%
 ========   ==================================================

        1   Removed from queue
        1   Sent via SMTP

 ****** Detail (1) **********************************************************************************

        1   Sent via SMTP ---------------------------------------------------------------------------
        1      aptrust.org

 === Delivery Delays Percentiles ============================================================
                     0%       25%       50%       75%       90%       95%       98%      100%
 --------------------------------------------------------------------------------------------
 Before qmgr       0.01      0.01      0.01      0.01      0.01      0.01      0.01      0.01
 In qmgr           0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 Conn setup        0.17      0.17      0.17      0.17      0.17      0.17      0.17      0.17
 Transmission      0.29      0.29      0.29      0.29      0.29      0.29      0.29      0.29
 Total             0.48      0.48      0.48      0.48      0.48      0.48      0.48      0.48
 ============================================================================================

 ---------------------- Postfix End -------------------------


 --------------------- SSHD Begin ------------------------


 Users logging in through sshd:
    cd3ef:
       216.197.76.234 (c-va-3d8452d8df-23486-1.tingfiber.com): 1 time


 Received disconnect:
    11: disconnected by user
       216.197.76.234 : 1 Time(s)

 ---------------------- SSHD End -------------------------


 --------------------- Sudo (secure-log) Begin ------------------------


 cd3ef => root
 -------------
 /bin/sh                        -  36 Time(s).

 cd3ef => ubuntu
 ---------------
 /bin/sh                        -   6 Time(s).

 ---------------------- Sudo (secure-log) End -------------------------


 --------------------- Disk Space Begin ------------------------

 Filesystem                                 Size  Used Avail Use% Mounted on
 udev                                       2.0G   12K  2.0G   1% /dev
 /dev/xvda1                                  32G  4.0G   27G  14% /
 fs-97ff5bde.efs.us-east-1.amazonaws.com:/  8.0E  3.7G  8.0E   1% /mnt/efs/apt


 ---------------------- Disk Space End -------------------------


 ###################### Logwatch End #########################
AWS Cloudwatch

By default all instances in Amazon Web Services have basic Cloudwatch monitoring enabled which provides seven pre-selected metrics at five-minute frequency and three status check metrics at one-minute frequency[1]. All production instances (EC2 and RDS) have detailed Cloudwatch enabled which include additional metrics.

Currently there are no Cloudwatch alarms enabled on these metrics but rather used for a post-mortem analysis or a second layer of metrics. APTrust uses a combination of Grafana and Icinga2 metrics for troubleshooting and alarms.