grpc/tools/run_tests/xds_k8s_test_driver/README.md

# xDS Kubernetes Interop Tests

Proxyless Security Mesh Interop Tests executed on Kubernetes.

### Experimental
Work in progress. Internal APIs may and will change. Please refrain from making
changes to this codebase at the moment.

### Stabilization roadmap 
- [ ] Replace retrying with tenacity
- [ ] Generate namespace for each test to prevent resource name conflicts and
      allow running tests in parallel
- [ ] Security: run server and client in separate namespaces
- [ ] Make framework.infrastructure.gcp resources [first-class
      citizen](https://en.wikipedia.org/wiki/First-class_citizen), support
      simpler CRUD
- [x] Security: manage `roles/iam.workloadIdentityUser` role grant lifecycle for
      dynamically-named namespaces 
- [ ] Restructure `framework.test_app` and `framework.xds_k8s*` into a module
      containing xDS-interop-specific logic
- [ ] Address inline TODOs in code
- [x] Improve README.md documentation, explain helpers in bin/ folder

## Installation

#### Requirements
1. Python v3.6+
2. [Google Cloud SDK](https://cloud.google.com/sdk/docs/install)
3. Configured GKE cluster

#### Configure GKE cluster
This is an example outlining minimal requirements to run `tests.baseline_test`.  
For more details, and for the setup for security tests, see
["Setting up Traffic Director service security with proxyless gRPC"](https://cloud.google.com/traffic-director/docs/security-proxyless-setup)
 user guide.
 
Update gloud sdk:
```shell
gcloud -q components update
```

Pre-populate environment variables for convenience. To find project id, refer to
[Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).
```shell
export PROJECT_ID="your-project-id"
export PROJECT_NUMBER=$(gcloud projects describe "${PROJECT_ID}" --format="value(projectNumber)")
# Compute Engine default service account
export GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"
# The prefix to name GCP resources used by the framework
export RESOURCE_PREFIX="xds-k8s-interop-tests"

# The zone name your cluster, f.e. xds-k8s-test-cluster
export CLUSTER_NAME="${RESOURCE_PREFIX}-cluster"
# The zone of your cluster, f.e. us-central1-a
export ZONE="us-central1-a" 
# Dedicated GCP Service Account to use with workload identity.
export WORKLOAD_SA_NAME="${RESOURCE_PREFIX}"
export WORKLOAD_SA_EMAIL="${WORKLOAD_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
```

##### Create the cluster 
Minimal requirements: [VPC-native](https://cloud.google.com/traffic-director/docs/security-proxyless-setup)
cluster with [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) enabled. 
```shell
gcloud beta container clusters create "${CLUSTER_NAME}" \
 --zone="${ZONE}" \
 --enable-ip-alias \
 --workload-pool="${PROJECT_ID}.svc.id.goog" \
 --workload-metadata=GKE_METADATA \
 --tags=allow-health-checks
```

##### Create the firewall rule
Allow [health checking mechanisms](https://cloud.google.com/traffic-director/docs/set-up-proxyless-gke#creating_the_health_check_firewall_rule_and_backend_service)
to query the workloads health.  
This step can be skipped, if the driver is executed with `--ensure_firewall`.
```shell
gcloud compute firewall-rules create "${RESOURCE_PREFIX}-allow-health-checks" \
  --network=default --action=allow --direction=INGRESS \
  --source-ranges="35.191.0.0/16,130.211.0.0/22" \
  --target-tags=allow-health-checks \
  --rules=tcp:8080-8100
```

##### Setup GCP Service Account

Create dedicated GCP Service Account to use
with [workload identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).

```shell
gcloud iam service-accounts create "${WORKLOAD_SA_NAME}" \
  --display-name="xDS K8S Interop Tests Workload Identity Service Account"
```

Enable the service account to [access the Traffic Director API](https://cloud.google.com/traffic-director/docs/prepare-for-envoy-setup#enable-service-account).
```shell
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
   --member="serviceAccount:${WORKLOAD_SERVICE_ACCOUNT}" \
   --role="roles/trafficdirector.client"
```

##### Allow test driver to configure workload identity automatically
Test driver will automatically grant `roles/iam.workloadIdentityUser` to 
allow the Kubernetes service account to impersonate the dedicated GCP workload
service account (corresponds to the step 5
of [Authenticating to Google Cloud](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to)).
This action requires the test framework to have `iam.serviceAccounts.create`
permission on the project.

If you're running test framework locally, and you have `roles/owner` to your
project, **you can skip this step**.  
If you're configuring the test framework to run on a CI: use `roles/owner`
account once to allow test framework to grant `roles/iam.workloadIdentityUser`.

```shell
# Assuming CI is using Compute Engine default service account.
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${GCE_SA}" \
  --role="roles/iam.serviceAccountAdmin" \
  --condition-from-file=<(cat <<-END
---
title: allow_workload_identity_only
description: Restrict serviceAccountAdmin to granting role iam.workloadIdentityUser
expression: |-
  api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', [])
        .hasOnly(['roles/iam.workloadIdentityUser'])
END
)
```

##### Configure GKE cluster access
```shell
# Configuring GKE cluster access for kubectl
gcloud container clusters get-credentials "your_gke_cluster_name" --zone "your_gke_cluster_zone"

# Save generated kube context name
export KUBE_CONTEXT="$(kubectl config current-context)"
``` 

#### Install python dependencies

```shell
# Create python virtual environment
python3.6 -m venv venv

# Activate virtual environment
. ./venv/bin/activate

# Install requirements
pip install -r requirements.txt

# Generate protos
python -m grpc_tools.protoc --proto_path=../../../ \
    --python_out=. --grpc_python_out=. \
    src/proto/grpc/testing/empty.proto \
    src/proto/grpc/testing/messages.proto \
    src/proto/grpc/testing/test.proto
```

# Basic usage

### xDS Baseline Tests

Test suite meant to confirm that basic xDS features work as expected. Executing
it before other test suites will help to identify whether test failure related
to specific features under test, or caused by unrelated infrastructure
disturbances.

The client and server images are created based on Git commit hashes, but not
every single one of them. It is triggered nightly and per-release. For example,
the commit we are using below (`d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf`) comes
from branch `v1.37.x` in `grpc-java` repo.

```shell
# Help
python -m tests.baseline_test --help
python -m tests.baseline_test --helpfull

# Run on grpc-testing cluster
python -m tests.baseline_test \
  --flagfile="config/grpc-testing.cfg" \
  --kube_context="${KUBE_CONTEXT}" \
  --server_image="gcr.io/grpc-testing/xds-interop/java-server:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf" \
  --client_image="gcr.io/grpc-testing/xds-interop/java-client:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf"
```

### xDS Security Tests
```shell
# Help
python -m tests.security_test --help
python -m tests.security_test --helpfull

# Run on grpc-testing cluster
python -m tests.security_test \
  --flagfile="config/grpc-testing.cfg" \
  --kube_context="${KUBE_CONTEXT}" \
  --server_image="gcr.io/grpc-testing/xds-interop/java-server:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf" \
  --client_image="gcr.io/grpc-testing/xds-interop/java-client:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf"
```

### Test namespace
It's possible to run multiple xDS interop test workloads in the same project.
But we need to ensure the name of the global resources won't conflict. This can
be solved by supplying `--namespace` and `--server_xds_port`. The xDS port needs
to be unique across the entire project (default port range is [8080, 8280],
avoid if possible). Here is an example:

```shell
python3 -m tests.baseline_test \
  --flagfile="config/grpc-testing.cfg" \
  --kube_context="${KUBE_CONTEXT}" \
  --server_image="gcr.io/grpc-testing/xds-interop/java-server:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf" \
  --client_image="gcr.io/grpc-testing/xds-interop/java-client:d22f93e1ade22a1e026b57210f6fc21f7a3ca0cf" \
  --namespace="box-$(date +"%F-%R")" \
  --server_xds_port="$(($RANDOM%1000 + 34567))"
```

## Local development
This test driver allows running tests locally against remote GKE clusters, right
from your dev environment. You need:

1. Follow [installation](#installation) instructions
2. Authenticated `gcloud`
3. `kubectl` context (see [Configure GKE cluster access](#configure-gke-cluster-access))
4. Run tests with `--debug_use_port_forwarding` argument. The test driver 
   will automatically start and stop port forwarding using
   `kubectl` subprocesses. (experimental)

### Making changes to the driver
1. Install additional dev packages: `pip install -r requirements-dev.txt`
2. Use `./bin/yapf.sh` and `./bin/isort.sh` helpers to auto-format code.

### Setup test configuration

There are many arguments to be passed into the test run. You can save the
arguments to a config file ("flagfile") for your development environment.
Use [`config/local-dev.cfg.example`](https://github.com/grpc/grpc/blob/master/tools/run_tests/xds_k8s_test_driver/config/local-dev.cfg.example)
as a starting point:

```shell
cp config/local-dev.cfg.example config/local-dev.cfg
```

Learn more about flagfiles in [abseil documentation](https://abseil.io/docs/python/guides/flags#a-note-about---flagfile).

### Helper scripts
You can use interop xds-k8s [`bin/`](https://github.com/grpc/grpc/tree/master/tools/run_tests/xds_k8s_test_driver/bin)
scripts to configure TD, start k8s instances step-by-step, and keep them alive
for as long as you need. 

* To run helper scripts using local config:
  * `python -m bin.script_name --flagfile=config/local-dev.cfg`
  * `./run.sh bin/script_name.py` automatically appends the flagfile
* Use `--help` to see script-specific argument
* Use `--helpfull` to see all available argument

#### Overview
```shell
# Helper tool to configure Traffic Director with different security options
python -m bin.run_td_setup --help

# Helper tools to run the test server, client (with or without security)
python -m bin.run_test_server --help
python -m bin.run_test_client --help

# Helper tool to verify different security configurations via channelz
python -m bin.run_channelz --help
```

#### `./run.sh` helper
Use `./run.sh` to execute helper scripts and tests with `config/local-dev.cfg`.

```sh
USAGE: ./run.sh script_path [arguments]
   script_path: path to python script to execute, relative to driver root folder
   arguments ...: arguments passed to program in sys.argv

ENVIRONMENT:
   XDS_K8S_CONFIG: file path to the config flagfile, relative to
                   driver root folder. Default: config/local-dev.cfg
                   Will be appended as --flagfile="config_absolute_path" argument
   XDS_K8S_DRIVER_VENV_DIR: the path to python virtual environment directory
                            Default: $XDS_K8S_DRIVER_DIR/venv
DESCRIPTION:
This tool performs the following:
1) Ensures python virtual env installed and activated
2) Exports test driver root in PYTHONPATH
3) Automatically appends --flagfile="\$XDS_K8S_CONFIG" argument

EXAMPLES:
./run.sh bin/run_td_setup.py --help
./run.sh bin/run_td_setup.py --helpfull
XDS_K8S_CONFIG=./path-to-flagfile.cfg ./run.sh bin/run_td_setup.py --namespace=override-namespace
./run.sh tests/baseline_test.py
./run.sh tests/security_test.py --verbosity=1 --logger_levels=__main__:DEBUG,framework:DEBUG
./run.sh tests/security_test.py SecurityTest.test_mtls --nocheck_local_certs
```

### Regular workflow
```shell
# Setup Traffic Director
./run.sh bin/run_td_setup.py

# Start test server
./run.sh bin/run_test_server.py

# Add test server to the backend service
./run.sh bin/run_td_setup.py --cmd=backends-add

# Start test client
./run.sh bin/run_test_client.py
```

### Secure workflow
```shell
# Setup Traffic Director in mtls. See --help for all options
./run.sh bin/run_td_setup.py --security=mtls

# Start test server in a secure mode
./run.sh bin/run_test_server.py --secure

# Add test server to the backend service
./run.sh bin/run_td_setup.py --cmd=backends-add

# Start test client in a secure more --secure
./run.sh bin/run_test_client.py --secure
```

### Sending RPCs
#### Start port forwarding
```shell
# Client: all services always on port 8079
kubectl port-forward deployment.apps/psm-grpc-client 8079

# Server regular mode: all grpc services on port 8079
kubectl port-forward deployment.apps/psm-grpc-server 8080
# OR
# Server secure mode: TestServiceImpl is on 8080, 
kubectl port-forward deployment.apps/psm-grpc-server 8080
# everything else (channelz, healthcheck, CSDS) on 8081
kubectl port-forward deployment.apps/psm-grpc-server 8081
```

#### Send RPCs with grpccurl
```shell
# 8081 if security enabled
export SERVER_ADMIN_PORT=8080

# List server services using reflection
grpcurl --plaintext 127.0.0.1:$SERVER_ADMIN_PORT list
# List client services using reflection
grpcurl --plaintext 127.0.0.1:8079 list

# List channels via channelz
grpcurl --plaintext 127.0.0.1:$SERVER_ADMIN_PORT grpc.channelz.v1.Channelz.GetTopChannels
grpcurl --plaintext 127.0.0.1:8079 grpc.channelz.v1.Channelz.GetTopChannels

# Send GetClientStats to the client
grpcurl --plaintext -d '{"num_rpcs": 10, "timeout_sec": 30}' 127.0.0.1:8079 \
  grpc.testing.LoadBalancerStatsService.GetClientStats
```

### Cleanup
* First, make sure to stop port forwarding, if any
* Run `./bin/cleanup.sh`

##### Partial cleanup
You can run commands below to stop/start, create/delete resources however you want.  
Generally, it's better to remove resources in the opposite order of their creation.

Cleanup regular resources:
```shell
# Cleanup TD resources
./run.sh bin/run_td_setup.py --cmd=cleanup
# Stop test client
./run.sh bin/run_test_client.py --cmd=cleanup
# Stop test server, and remove the namespace
./run.sh bin/run_test_server.py --cmd=cleanup --cleanup_namespace
```

Cleanup regular and security-specific resources:
```shell
# Cleanup TD resources, with security
./run.sh bin/run_td_setup.py --cmd=cleanup --security=mtls
# Stop test client (secure)
./run.sh bin/run_test_client.py --cmd=cleanup --secure
# Stop test server (secure), and remove the namespace
./run.sh bin/run_test_server.py --cmd=cleanup --cleanup_namespace --secure
```

In addition, here's some other helpful partial cleanup commands:
```shell
# Remove all backends from the backend services
./run.sh bin/run_td_setup.py --cmd=backends-cleanup

# Stop the server, but keep the namespace
./run.sh bin/run_test_server.py --cmd=cleanup --nocleanup_namespace
```

### Known errors
#### Error forwarding port
If you stopped a test with `ctrl+c`, while using `--debug_use_port_forwarding`,
you might see an error like this:

> `framework.infrastructure.k8s.PortForwardingError: Error forwarding port, unexpected output Unable to listen on port 8081: Listeners failed to create with the following errors: [unable to create listener: Error listen tcp4 127.0.0.1:8081: bind: address already in use]`

Unless you're running `kubectl port-forward` manually, it's likely that `ctrl+c`
interrupted python before it could clean up subprocesses.

You can do `ps aux | grep port-forward` and then kill the processes by id,
or with `killall kubectl`