Redshiftスケジュール起動・停止で発生する不要なアラートを抑止する

2021.07.19

この記事を書いたメンバー：

Tokugawa

はじめに

【追記：2021年8月23日】
Redshiftクラスタの起動に15分以上かかる場合があり、Lambdaがタイムアウトしてアラートが抑止できていない状況が発生していました。
この課題解決のため、IAMポリシーとLambdaのソースコードを修正しました。

今回はAWS Lambdaを使用してAmazon Redshiftのスケジュール停止・再開を制御するという内容です。

背景として、Redshiftクラスタの停止・再開タイミングでCloudWatchアラームが不要なアラートメールを送信してしまうという状況がありました。
Lambdaでクラスタ停止・再開とCloudWatchアラーム無効化・有効化を制御することで、アラートを抑止したいというのが目的です。

設計

・Lambda関数を1つ用意します。
・CloudWatchルールはRedshiftクラスタ1つにつき最低2つ用意し、ターゲットとしてLambdaを指定します。（ルールはRedshiftクラスタ停止、再開でそれぞれ1つずつ）

設定手順

1. IAMポリシーの作成

まずは、IAMポリシーを作成します。
このポリシーは、Lambdaで使用するIAMロールにアタッチします。

JSONは以下の記載内容を参考にしてください。
ポリシー名は「redshift-alert-reduction_policy」としました。

 {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowRedshiftClusterManagement",
            "Action": [
                "redshift:ResumeCluster",
                "redshift:PauseCluster"
            ],
            "Resource": [
                "arn:aws:redshift:ap-northeast-1:[アカウントID]:cluster:[クラスタ名]"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "AllowDescribeRedshiftClusters",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "AllowInvokeLambdaFunction",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:ap-northeast-1:[アカウントID]:function:redshift_alert_reduction"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "AllowCWAlarmManagement",
            "Action": [
                "cloudwatch:DisableAlarmActions",
                "cloudwatch:EnableAlarmActions"
            ],
            "Resource": [
                "arn:aws:cloudwatch:ap-northeast-1:[アカウントID]:alarm:*"
            ],
            "Effect": "Allow"
        }
    ]
}

2. IAMロールの作成

続いて、IAMロールを作成します。
ここではロール名を「lambda-alert-reduction-role」としました。

ロール作成時の「ユースケースの選択」でLambdaを選択してください。
ポリシーは以下の2つをアタッチします。

・AWSLambdaBasicExecutionRole
・redshift-alert-reduction_policy

3. Lambda関数の作成

Lambda関数を作成します。ランタイムはPython3.8です。
関数名は「redshift_alert_reduction」としました。

関数には作成したロール「lambda-alert-reduction-role」を設定します。
タイムアウトを15分に設定してください。

ソースコード

import json
import boto3
import logging
import time

def get_vein_logger():
    import logging
    log_format = '[VEIN-%(levelname)s][%(aws_request_id)s][%(funcName)s:%(lineno)d]\t%(message)s'
    formatter = logging.Formatter(log_format)
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    for handler in logger.handlers:
        handler.setFormatter(formatter)
        return logger

logger = get_vein_logger()


def control_cloudwatch_alarm(cluster_name, action):
    client = boto3.client('cloudwatch')

    alarm_disk = f'VEIN-ALERT Redshift {cluster_name} - Disk Space Used'
    alarm_health = f'VEIN-ALERT Redshift {cluster_name} - Health Status'
    alarm_cpu = f'VEIN-ALERT Redshift {cluster_name} - CPU Utilization'

    if (action):
        try:
            client.enable_alarm_actions(
                AlarmNames=[
                    alarm_disk,
                    alarm_health,
                    alarm_cpu
                ]
            )
            logger.info("Enabling CloudWatch Alarm")
        except Exception as e:
            logger.error("Exception: {}".format(e))
    else:
        try:
            client.disable_alarm_actions(
                AlarmNames=[
                    alarm_disk,
                    alarm_health,
                    alarm_cpu
                ]
            )
            logger.info("Disabling CloudWatch Alarm")
        except Exception as e:
            logger.error("Exception: {}".format(e))


def control_redshift_cluster(cluster_name, action):
    client = boto3.client('redshift')

    if (action):
        try:
            cluster_status = check_redshift_cluster_status(cluster_name)
            if cluster_status == 'Available':
                #sleep 10 minutes and enable cloudwatch alarm
                time.sleep(600)
                control_cloudwatch_alarm(cluster_name, action)
                return
            elif cluster_status == 'Modifying':
                pass
            elif cluster_status == 'Paused':
                client.resume_cluster(
                    ClusterIdentifier=cluster_name
                )
                logger.info("Resuming Redshift cluster")
            else:
                #sleep 10 minutes and enable cloudwatch alarm
                time.sleep(600)
                control_cloudwatch_alarm(cluster_name, action)
                return

            #sleep 8 minutes and check status of redshift cluster
            time.sleep(480)
            cluster_status = check_redshift_cluster_status(cluster_name)

            #judge if cloudwatch alarm can be enabled
            if cluster_status == 'Available':
                #sleep 5 minutes and enable cloudwatch alarm
                time.sleep(300)
                control_cloudwatch_alarm(cluster_name, action)
            else:
                logger.info("Invoking additional Lambda function to prevent 15-minute timeout")
                invoke_lambda_function(cluster_name)
        except Exception as e:
            logger.error("Exception: {}".format(e))
    else:
        try:
            client.pause_cluster(
                ClusterIdentifier=cluster_name
            )
            logger.info("Pausing Redshift cluster")
        except Exception as e:
            logger.error("Exception: {}".format(e))


def check_redshift_cluster_status(cluster_name):
    client = boto3.client('redshift')

    response = client.describe_clusters(
        ClusterIdentifier=cluster_name
    )
    cluster_status = response['Clusters'][0]['ClusterAvailabilityStatus']

    return cluster_status


def invoke_lambda_function(cluster_name):
    event_dict={
        "cluster_name": cluster_name,
        "action": True
    }
    event = json.dumps(event_dict)

    response = boto3.client('lambda').invoke(
        FunctionName='vein_redshift_alert_reduction',
        InvocationType='Event',
        Payload=event
    )


def lambda_handler(event, context):
    #redshift cluster name
    cluster_name = event['cluster_name']
    #if action is true, resume cluster and enable cloudwatch alram
    action = event['action']

    if action == True:
        control_redshift_cluster(cluster_name, action)
    elif action == False:
        control_cloudwatch_alarm(cluster_name, action)
        #wait until cloudWatch alarm becomes disabled
        time.sleep(60)
        control_redshift_cluster(cluster_name, action)

ソースコードの補足

・12～14行目でCloudWatchのアラーム名を指定しています。これらのアラームを抑止しますので、適宜アラーム名を変更してください。
・63行目に”time.sleep(480)”とありますが、Redshiftクラスタ再開から8分待機し、クラスタがAvailableになったらCloudWatchアラームを有効化します。
・RedshiftクラスタはAvailableになった直後にCloudWatchアラームを有効化してもアラートが上がる場合があります。そのため、アラーム有効化の前に69行目で5分の待機時間を入れています。
・1回の処理で待機時間の合計が15分を超えてしまうとLambdaがタイムアウトするため、待機時間は少し余裕を持たせて8分と5分の計13分としています。
・クラスタ再開から8分経過した時点でAvailableになっていない場合は、再度自身（redshift_alert_reduction）を呼び出し、Availableになるまで処理を繰り返します。
　これにより、Lambdaの15分タイムアウトを回避しつつ、アラート抑止を機能させます。

4. CloudWatchルール設定

設計の説明でも記載した通り、Redshiftクラスタ1つにつき、ルールは最低2つ設定します。
停止・再開タイミングが複数ある場合は、その分だけルール設定が必要となります。

ターゲットには作成したLambda関数を指定します。
入力の設定で「定数 (JSONテキスト)」を選択し、以下のJSONを入力します。
jsonで指定するactionがTrueの場合：Redshiftを起動、CloudWatchアラームを有効化
jsonで指定するactionがFalseの場合：Redshiftを停止、CloudWatchアラームを無効化

●定数に入力するJSON（クラスタ起動の場合）

 { "cluster_name": "クラスタ名", "action": true}

●定数に入力するJSON（クラスタ停止の場合）

 { "cluster_name": "クラスタ名", "action": false}

ルール名は以下のようにしました。
ルール①：redshift_[クラスタ名]_resume
ルール②：redshift_[クラスタ名]_pause

以上で設定は完了です。

おわりに

今回はLambdaを使用して、RedshiftやCloudWatchを制御することでアラートを抑止し、重要なアラートを見落とさないようにするという対応を実施しました。
記事投稿時点ではCloudWatchにアラートを抑止するための機能が組み込まれていないため、ここに記載した内容が少しでもお役に立てれば幸いです。

カテゴリー