Skip to main content

Deploying a TDengine Cluster in Kubernetes

Overview

As a time series database for Cloud Native architecture design, TDengine supports Kubernetes deployment. Firstly we introduce how to use YAML files to create a highly available TDengine cluster from scratch step by step for production usage, and highlight the common operations of TDengine in Kubernetes environment.

To meet high availability requirements, clusters need to meet the following requirements:

  • 3 or more dnodes: multiple vnodes in the same vgroup of TDengine are not allowed to be distributed in one dnode at the same time, so if you create a database with 3 replicas, the number of dnodes is greater than or equal to 3
  • 3 mnodes: mnode is responsible for the management of the entire TDengine cluster. The default number of mnode in TDengine cluster is only one. If the dnode where the mnode located is dropped, the entire cluster is unavailable.
  • Database 3 replicas: The TDengine replica configuration is the database level, so 3 replicas for the database must need three dnodes in the cluster. If any one dnode is offline, does not affect the normal usage of the whole cluster. If the number of offline dnodes is 2, then the cluster is not available, because ** the cluster can not complete the election based on RAFT** . (Enterprise version: in the disaster recovery scenario, any node data file is damaged, can be restored by pulling up the dnode again)

Prerequisites

Before deploying TDengine on Kubernetes, perform the following:

  • This article applies Kubernetes 1.19 and above
  • This article uses the kubectl tool to install and deploy, please install the corresponding software in advance
  • Kubernetes have been installed and deployed and can access or update the necessary container repositories or other services

You can download the configuration files in this document from GitHub.

Configure the service

Create a service configuration file named taosd-service.yaml. Record the value of metadata.name (in this example, taos) for use in the next step. And then add the ports required by TDengine and record the value of the selector label "app" (in this example, tdengine) for use in the next step:

---
apiVersion: v1
kind: Service
metadata:
name: "taosd"
labels:
app: "tdengine"
spec:
ports:
- name: tcp6030
protocol: "TCP"
port: 6030
- name: tcp6041
protocol: "TCP"
port: 6041
selector:
app: "tdengine"

Configure the service as StatefulSet

According to Kubernetes instructions for various deployments, we will use StatefulSet as the deployment resource type of TDengine. Create the file tdengine.yaml , where replicas defines the number of cluster nodes as 3. The node time zone is China (Asia/Shanghai), and each node is allocated 5G standard storage (refer to the Storage Classes configuration storage class). You can also modify accordingly according to the actual situation.

Please pay special attention to the startupProbe configuration. If dnode's Pod drops for a period of time and then restart, the newly launched dnode Pod will be temporarily unavailable. The reason is the startupProbe configuration is too small, Kubernetes will know that the Pod is in an abnormal state and try to restart it, then the dnode's Pod will restart frequently and never return to the normal status. Refer to Configure Liveness, Readiness and Startup Probes

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: "tdengine"
labels:
app: "tdengine"
spec:
serviceName: "taosd"
replicas: 3
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: "tdengine"
template:
metadata:
name: "tdengine"
labels:
app: "tdengine"
spec:
containers:
- name: "tdengine"
image: "tdengine/tdengine:3.0.7.1"
imagePullPolicy: "IfNotPresent"
ports:
- name: tcp6030
protocol: "TCP"
containerPort: 6030
- name: tcp6041
protocol: "TCP"
containerPort: 6041
env:
# POD_NAME for FQDN config
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# SERVICE_NAME and NAMESPACE for fqdn resolve
- name: SERVICE_NAME
value: "taosd"
- name: STS_NAME
value: "tdengine"
- name: STS_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# TZ for timezone settings, we recommend to always set it.
- name: TZ
value: "Asia/Shanghai"
# TAOS_ prefix will configured in taos.cfg, strip prefix and camelCase.
- name: TAOS_SERVER_PORT
value: "6030"
# Must set if you want a cluster.
- name: TAOS_FIRST_EP
value: "$(STS_NAME)-0.$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local:$(TAOS_SERVER_PORT)"
# TAOS_FQND should always be set in k8s env.
- name: TAOS_FQDN
value: "$(POD_NAME).$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local"
volumeMounts:
- name: taosdata
mountPath: /var/lib/taos
startupProbe:
exec:
command:
- taos-check
failureThreshold: 360
periodSeconds: 10
readinessProbe:
exec:
command:
- taos-check
initialDelaySeconds: 5
timeoutSeconds: 5000
livenessProbe:
exec:
command:
- taos-check
initialDelaySeconds: 15
periodSeconds: 20
volumeClaimTemplates:
- metadata:
name: taosdata
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: "standard"
resources:
requests:
storage: "5Gi"

Use kubectl to deploy TDengine

First create the corresponding namespace, and then execute the following command in sequence :

kubectl apply -f taosd-service.yaml -n tdengine-test
kubectl apply -f tdengine.yaml -n tdengine-test

The above configuration will generate a three-node TDengine cluster, dnode is automatically configured, you can use the show dnodes command to view the nodes of the current cluster:

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"
kubectl exec -it tdengine-1 -n tdengine-test -- taos -s "show dnodes"
kubectl exec -it tdengine-2 -n tdengine-test -- taos -s "show dnodes"

The output is as follows:

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 0 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-19 17:54:18.469 | | | |
2 | tdengine-1.ta... | 0 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-19 17:54:38.698 | | | |
3 | tdengine-2.ta... | 0 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-19 17:55:02.039 | | | |
Query OK, 3 row(s) in set (0.001853s)

View the current mnode

kubectl exec -it tdengine-1 -n tdengine-test -- taos -s "show mnodes\G"
taos> show mnodes\G
*************************** 1.row ***************************
id: 1
endpoint: tdengine-0.taosd.tdengine-test.svc.cluster.local:6030
role: leader
status: ready
create_time: 2023-07-19 17:54:18.559
reboot_time: 2023-07-19 17:54:19.520
Query OK, 1 row(s) in set (0.001282s)

Create mnode

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "create mnode on dnode 2"
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "create mnode on dnode 3"

View mnode

kubectl exec -it tdengine-1 -n tdengine-test -- taos -s "show mnodes\G"

taos> show mnodes\G
*************************** 1.row ***************************
id: 1
endpoint: tdengine-0.taosd.tdengine-test.svc.cluster.local:6030
role: leader
status: ready
create_time: 2023-07-19 17:54:18.559
reboot_time: 2023-07-20 09:19:36.060
*************************** 2.row ***************************
id: 2
endpoint: tdengine-1.taosd.tdengine-test.svc.cluster.local:6030
role: follower
status: ready
create_time: 2023-07-20 09:22:05.600
reboot_time: 2023-07-20 09:22:12.838
*************************** 3.row ***************************
id: 3
endpoint: tdengine-2.taosd.tdengine-test.svc.cluster.local:6030
role: follower
status: ready
create_time: 2023-07-20 09:22:20.042
reboot_time: 2023-07-20 09:22:23.271
Query OK, 3 row(s) in set (0.003108s)

Enable port forwarding

Kubectl port forwarding enables applications to access TDengine clusters running in Kubernetes environments.

kubectl port-forward -n tdengine-test tdengine-0 6041:6041 &

Use curl to verify that the TDengine REST API is working on port 6041:

curl -u root:taosdata -d "show databases" 127.0.0.1:6041/rest/sql
{"code":0,"column_meta":[["name","VARCHAR",64]],"data":[["information_schema"],["performance_schema"],["test"],["test1"]],"rows":4}

Test cluster

Data preparation

taosBenchmark

Create a 3 replicas database with taosBenchmark, write 100 million data at the same time, and view the data at the same time

kubectl exec -it tdengine-0 -n tdengine-test -- taosBenchmark -I stmt -d test -n 10000 -t 10000 -a 3

# query data
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "select count(*) from test.meters;"

taos> select count(*) from test.meters;
count(*) |
========================
100000000 |
Query OK, 1 row(s) in set (0.103537s)

View vnode distribution by showing dnodes

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 8 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-19 17:54:18.469 | | | |
2 | tdengine-1.ta... | 8 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-19 17:54:38.698 | | | |
3 | tdengine-2.ta... | 8 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-19 17:55:02.039 | | | |
Query OK, 3 row(s) in set (0.001357s)

View xnode distribution by showing vgroup

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show test.vgroups"

taos> show test.vgroups
vgroup_id | db_name | tables | v1_dnode | v1_status | v2_dnode | v2_status | v3_dnode | v3_status | v4_dnode | v4_status | cacheload | cacheelements | tsma |
==============================================================================================================================================================================================
2 | test | 1267 | 1 | follower | 2 | follower | 3 | leader | NULL | NULL | 0 | 0 | 0 |
3 | test | 1215 | 1 | follower | 2 | leader | 3 | follower | NULL | NULL | 0 | 0 | 0 |
4 | test | 1215 | 1 | leader | 2 | follower | 3 | follower | NULL | NULL | 0 | 0 | 0 |
5 | test | 1307 | 1 | follower | 2 | leader | 3 | follower | NULL | NULL | 0 | 0 | 0 |
6 | test | 1245 | 1 | follower | 2 | follower | 3 | leader | NULL | NULL | 0 | 0 | 0 |
7 | test | 1275 | 1 | follower | 2 | leader | 3 | follower | NULL | NULL | 0 | 0 | 0 |
8 | test | 1231 | 1 | leader | 2 | follower | 3 | follower | NULL | NULL | 0 | 0 | 0 |
9 | test | 1245 | 1 | follower | 2 | follower | 3 | leader | NULL | NULL | 0 | 0 | 0 |
Query OK, 8 row(s) in set (0.001488s)

Manually created

Common a three-copy test1, and create a table, write 2 pieces of data

kubectl exec -it tdengine-0 -n tdengine-test  -- \
taos -s \
"create database if not exists test1 replica 3;
use test1;
create table if not exists t1(ts timestamp, n int);
insert into t1 values(now, 1)(now+1s, 2);"

View xnode distribution by showing test1.vgroup

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show test1.vgroups"

taos> show test1.vgroups
vgroup_id | db_name | tables | v1_dnode | v1_status | v2_dnode | v2_status | v3_dnode | v3_status | v4_dnode | v4_status | cacheload | cacheelements | tsma |
==============================================================================================================================================================================================
10 | test1 | 1 | 1 | follower | 2 | follower | 3 | leader | NULL | NULL | 0 | 0 | 0 |
11 | test1 | 0 | 1 | follower | 2 | leader | 3 | follower | NULL | NULL | 0 | 0 | 0 |
Query OK, 2 row(s) in set (0.001489s)

Test fault tolerance

The dnode where the mnode leader is located is disconnected, dnode1

kubectl get pod -l app=tdengine -n tdengine-test  -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tdengine-0 0/1 ErrImagePull 2 (2s ago) 20m 10.244.2.75 node86 <none> <none>
tdengine-1 1/1 Running 1 (6m48s ago) 20m 10.244.0.59 node84 <none> <none>
tdengine-2 1/1 Running 0 21m 10.244.1.223 node85 <none> <none>

At this time, the cluster mnode has a re-election, and the monde on dnode1 becomes the leader.

kubectl exec -it tdengine-1 -n tdengine-test -- taos -s "show mnodes\G"
Welcome to the TDengine Command Line Interface, Client Version:3.0.7.1.202307190706
Copyright (c) 2022 by TDengine, all rights reserved.

taos> show mnodes\G
*************************** 1.row ***************************
id: 1
endpoint: tdengine-0.taosd.tdengine-test.svc.cluster.local:6030
role: offline
status: offline
create_time: 2023-07-19 17:54:18.559
reboot_time: 1970-01-01 08:00:00.000
*************************** 2.row ***************************
id: 2
endpoint: tdengine-1.taosd.tdengine-test.svc.cluster.local:6030
role: leader
status: ready
create_time: 2023-07-20 09:22:05.600
reboot_time: 2023-07-20 09:32:00.227
*************************** 3.row ***************************
id: 3
endpoint: tdengine-2.taosd.tdengine-test.svc.cluster.local:6030
role: follower
status: ready
create_time: 2023-07-20 09:22:20.042
reboot_time: 2023-07-20 09:32:00.026
Query OK, 3 row(s) in set (0.001513s)

Cluster can read and write normally

# insert
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "insert into test1.t1 values(now, 1)(now+1s, 2);"

taos> insert into test1.t1 values(now, 1)(now+1s, 2);
Insert OK, 2 row(s) affected (0.002098s)

# select
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "select *from test1.t1"

taos> select *from test1.t1
ts | n |
========================================
2023-07-19 18:04:58.104 | 1 |
2023-07-19 18:04:59.104 | 2 |
2023-07-19 18:06:00.303 | 1 |
2023-07-19 18:06:01.303 | 2 |
Query OK, 4 row(s) in set (0.001994s)

Similarly, as for the non-leader mnode dropped, read and write can of course be normal, here will not do too much display .

Scaling Out Your Cluster

TDengine cluster supports automatic expansion:

kubectl scale statefulsets tdengine --replicas=4

The parameter --replica = 4 in the above command line indicates that you want to expand the TDengine cluster to 4 nodes. After execution, first check the status of the Pod:

kubectl get pod -l app=tdengine -n tdengine-test  -o wide

The output is as follows:

NAME                       READY   STATUS    RESTARTS        AGE     IP             NODE     NOMINATED NODE   READINESS GATES
tdengine-0 1/1 Running 4 (6h26m ago) 6h53m 10.244.2.75 node86 <none> <none>
tdengine-1 1/1 Running 1 (6h39m ago) 6h53m 10.244.0.59 node84 <none> <none>
tdengine-2 1/1 Running 0 5h16m 10.244.1.224 node85 <none> <none>
tdengine-3 1/1 Running 0 3m24s 10.244.2.76 node86 <none> <none>

At this time, the state of the POD is still Running, and the dnode state in the TDengine cluster can only be seen after the Pod status is ready :

kubectl exec -it tdengine-3 -n tdengine-test -- taos -s "show dnodes"

The dnode list of the expanded four-node TDengine cluster:

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
3 | tdengine-2.ta... | 10 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-20 10:48:43.445 | | | |
4 | tdengine-3.ta... | 0 | 16 | ready | 2023-07-20 16:01:44.007 | 2023-07-20 16:01:44.889 | | | |
Query OK, 4 row(s) in set (0.003628s)

Scaling In Your Cluster

Since the TDengine cluster will migrate data between nodes during volume expansion and contraction, using the kubectl command to reduce the volume requires first using the "drop dnodes" command ( If there are 3 replicas of db in the cluster, the number of dnodes after reduction must also be greater than or equal to 3, otherwise the drop dnode operation will be aborted ), the node deletion is completed before Kubernetes cluster reduction.

Note: Since Kubernetes Pods in the Statefulset can only be removed in reverse order of creation, the TDengine drop dnode also needs to be removed in reverse order of creation, otherwise the Pod will be in an error state.

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "drop dnode 4"
kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
3 | tdengine-2.ta... | 10 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-20 10:48:43.445 | | | |
Query OK, 3 row(s) in set (0.003324s)

After confirming that the removal is successful (use kubectl exec -i -t tdengine-0 --taos -s "show dnodes" to view and confirm the dnode list), use the kubectl command to remove the Pod:

kubectl scale statefulsets tdengine --replicas=3 -n tdengine-test

The last Pod will be deleted. Use the command kubectl get pods -l app = tdengine to check the Pod status:

kubectl get pod -l app=tdengine -n tdengine-test  -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tdengine-0 1/1 Running 4 (6h55m ago) 7h22m 10.244.2.75 node86 <none> <none>
tdengine-1 1/1 Running 1 (7h9m ago) 7h23m 10.244.0.59 node84 <none> <none>
tdengine-2 1/1 Running 0 5h45m 10.244.1.224 node85 <none> <none>

After the Pod is deleted, the PVC needs to be deleted manually, otherwise the previous data will continue to be used for the next expansion, resulting in the inability to join the cluster normally.

kubectl delete pvc aosdata-tdengine-3  -n tdengine-test

The cluster state at this time is safe and can be scaled up again if needed.

kubectl scale statefulsets tdengine --replicas=4 -n tdengine-test
statefulset.apps/tdengine scaled

kubectl get pod -l app=tdengine -n tdengine-test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tdengine-0 1/1 Running 4 (6h59m ago) 7h27m 10.244.2.75 node86 <none> <none>
tdengine-1 1/1 Running 1 (7h13m ago) 7h27m 10.244.0.59 node84 <none> <none>
tdengine-2 1/1 Running 0 5h49m 10.244.1.224 node85 <none> <none>
tdengine-3 1/1 Running 0 20s 10.244.2.77 node86 <none> <none>

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
3 | tdengine-2.ta... | 10 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-20 10:48:43.445 | | | |
5 | tdengine-3.ta... | 0 | 16 | ready | 2023-07-20 16:31:34.092 | 2023-07-20 16:38:17.419 | | | |
Query OK, 4 row(s) in set (0.003881s)

Remove a TDengine Cluster

When deleting the PVC, you need to pay attention to the pv persistentVolumeReclaimPolicy policy. It is recommended to change to Delete, so that the PV will be automatically cleaned up when the PVC is deleted, and the underlying CSI storage resources will be cleaned up at the same time. If the policy of deleting the PVC to automatically clean up the PV is not configured, and then after deleting the pvc, when manually cleaning up the PV, the CSI storage resources corresponding to the PV may not be released.

Complete removal of TDengine cluster, need to clean up statefulset, svc, configmap, pvc respectively.

kubectl delete statefulset -l app=tdengine -n tdengine-test
kubectl delete svc -l app=tdengine -n tdengine-test
kubectl delete pvc -l app=tdengine -n tdengine-test
kubectl delete configmap taoscfg -n tdengine-test

Troubleshooting

Error 1

No "drop dnode" is directly reduced. Since the TDengine has not deleted the node, the reduced pod causes some nodes in the TDengine cluster to be offline.

kubectl exec -it tdengine-0 -n tdengine-test -- taos -s "show dnodes"

taos> show dnodes
id | endpoint | vnodes | support_vnodes | status | create_time | reboot_time | note | active_code | c_active_code |
=============================================================================================================================================================================================================================================
1 | tdengine-0.ta... | 10 | 16 | ready | 2023-07-19 17:54:18.552 | 2023-07-20 09:39:04.297 | | | |
2 | tdengine-1.ta... | 10 | 16 | ready | 2023-07-19 17:54:37.828 | 2023-07-20 09:28:24.240 | | | |
3 | tdengine-2.ta... | 10 | 16 | ready | 2023-07-19 17:55:01.141 | 2023-07-20 10:48:43.445 | | | |
5 | tdengine-3.ta... | 0 | 16 | offline | 2023-07-20 16:31:34.092 | 2023-07-20 16:38:17.419 | status msg timeout | | |
Query OK, 4 row(s) in set (0.003862s)

Finally

For the high availability and high reliability of TDengine in a Kubernetes environment, hardware damage and disaster recovery are divided into two levels:

  1. The disaster recovery capability of the underlying distributed Block Storage, the multi-copy of Block Storage, the current popular distributed Block Storage such as Ceph, has the multi-copy capability, extending the storage copy to different racks, cabinets, computer rooms, Data center (or directly use the Block Storage service provided by Public Cloud vendors)
  2. TDengine disaster recovery, in TDengine Enterprise, itself has when a dnode permanently offline (TCE-metal disk damage, data sorting loss), re-pull a blank dnode to restore the original dnode work.

Finally, welcome to TDengine Cloud to experience the one-stop fully managed TDengine Cloud as a Service.

TDengine Cloud is a minimalist fully managed time series data processing Cloud as a Service platform developed based on the open source time series database TDengine. In addition to high-performance time series database, it also has system functions such as caching, subscription and stream computing, and provides convenient and secure data sharing, as well as numerous enterprise-level functions. It allows enterprises in the fields of Internet of Things, Industrial Internet, Finance, IT operation and maintenance monitoring to significantly reduce labor costs and operating costs in the management of time series data.