Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 2df9b5d9 authored by SIMONIN Matthieu's avatar SIMONIN Matthieu
Browse files

Merge branch 'a8_telegraf_auto' into 'master'

Monitoring: install telegraf on A8 nodes automatically

See merge request !63
parents 3de21283 9dbad499
Branches
Tags
1 merge request!63Monitoring: install telegraf on A8 nodes automatically
Pipeline #208284 passed with warnings
Showing
with 230 additions and 245 deletions
...@@ -192,6 +192,7 @@ This script implements a similar behavior as the "Radio sniffer with M3 nodes" ...@@ -192,6 +192,7 @@ This script implements a similar behavior as the "Radio sniffer with M3 nodes"
$ tar xfz <expid>-grenoble.iot-lab.info.tar.gz $ tar xfz <expid>-grenoble.iot-lab.info.tar.gz
$ wireshark <expid>/sniffer/m3-7.pcap $ wireshark <expid>/sniffer/m3-7.pcap
.. _IoT-LAB IPv6:
IPv6 - Interacting with frontend IPv6 - Interacting with frontend
-------------------------------- --------------------------------
...@@ -237,15 +238,8 @@ This example is based on the `Monitoring Service Class <../apidoc/service.html#m ...@@ -237,15 +238,8 @@ This example is based on the `Monitoring Service Class <../apidoc/service.html#m
It installs the Granafa/InfluxDB to visualize/store the monitoring metrics, which are collected by It installs the Granafa/InfluxDB to visualize/store the monitoring metrics, which are collected by
Telegraf agents running on each node of the infrastructure. Telegraf agents running on each node of the infrastructure.
Unfortunately, we cannot use the available API to install the monitoring stack in FIT/IoT-LAB, since
they depend on a debian/ubuntu base environment and docker containers which is not available in the nodes of
this testbed.
However, it is still possible to install the Telegraf agent in A8 nodes, and consequently,
collect metrics about their resource utilization.
In this scenario, a Grid'5000 node contains the collector(InfluxDB) and ui(Grafana). In this scenario, a Grid'5000 node contains the collector(InfluxDB) and ui(Grafana).
The telegraf agent is installed in both Grid'5000 and FIT/IoT-LAB nodes, but the installation on The telegraf agent is installed in both Grid'5000 and FIT/IoT-LAB nodes.
A8 nodes is done without using the standard "Monitoring Service Class".
Finally, to handle with the connectivity problem, because Grid'5000 and FIT/IoT-LAB are part of Finally, to handle with the connectivity problem, because Grid'5000 and FIT/IoT-LAB are part of
2 isolated networks, we need to run the openvpn client on A8 nodes. For that, it is necessary to 2 isolated networks, we need to run the openvpn client on A8 nodes. For that, it is necessary to
...@@ -255,6 +249,12 @@ Note that the openvpn client is already available in A8 nodes and no installatio ...@@ -255,6 +249,12 @@ Note that the openvpn client is already available in A8 nodes and no installatio
**Requirement**: Grid'5000 VPN files on shared folder (~/A8/). **Requirement**: Grid'5000 VPN files on shared folder (~/A8/).
.. warning::
This tutorial assumes that the files Grid5000_VPN.ovpn, .crt, .key are located on the FIT frontend.
Moreover it also assumes that no passphrase is given to the private key.
You private key is very sensitive so you must protect it the best you can (chmod 600).
Prefer using IPv6 if you can: :ref:`IoT-LAB IPv6`
.. literalinclude:: iotlab/tuto_iotlab_a8_monitoring.py .. literalinclude:: iotlab/tuto_iotlab_a8_monitoring.py
:language: python :language: python
:linenos: :linenos:
......
...@@ -6,35 +6,6 @@ import time ...@@ -6,35 +6,6 @@ import time
logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.basicConfig(stream=sys.stdout, level=logging.INFO)
def install_telegraf_a8_nodes(monit_obj, iotlab_nodes, pattern="a8", telegraf_version="telegraf-1.17.0"):
""" Installs and runs telegraf agent on A8 nodes.
A8 nodes are arm based. Tested with 1.17.0 version available at:
https://portal.influxdata.com/downloads/.
Adjust accordingly for your setup. Note that download links may change over time.
"""
# write configuration file in A8 nodes using the Monitoring class
remote_telegraf_conf = monit_obj.write_agent_config(iotlab_nodes[pattern])
with play_on(
# a8 nodes share the same image, we only need to install telegraf once
pattern_hosts=iotlab_nodes[pattern][0].address, roles=iotlab_nodes
) as p:
remote_dir = "/tmp/"
p.get_url(
url="https://dl.influxdata.com/telegraf/releases/" + telegraf_version + "_linux_armhf.tar.gz",
dest=remote_dir,
validate_certs=False,
)
p.shell("tar x -kzf " + remote_dir + telegraf_version + "_linux_armhf.tar.gz" + " -C /", creates="/" + telegraf_version)
# run telegraf on all hosts
with play_on(
pattern_hosts=pattern, roles=iotlab_nodes
) as p:
# running telegraf daemon on async mode
p.shell("/" + telegraf_version + "/usr/bin/" + "telegraf --config " + remote_telegraf_conf, asynch=3600, poll=0)
def setup_vpn(iotlab_roles, pattern="a8"): def setup_vpn(iotlab_roles, pattern="a8"):
""" Initialize VPN on A8 nodes""" """ Initialize VPN on A8 nodes"""
with play_on( with play_on(
...@@ -94,18 +65,15 @@ try: ...@@ -94,18 +65,15 @@ try:
g5k_roles, g5k_networks = g5k_provider.init() g5k_roles, g5k_networks = g5k_provider.init()
g5k_roles = discover_networks(g5k_roles, g5k_networks) g5k_roles = discover_networks(g5k_roles, g5k_networks)
print("Setting up VPN on A8 nodes to put them in the same network as Grid'5000")
setup_vpn(iotlab_roles)
print("Deploy monitoring stack on Grid'5000") print("Deploy monitoring stack on Grid'5000")
print("Install Grafana and InfluxDB at: %s" % str(g5k_roles["control"])) print("Install Grafana and InfluxDB at: %s" % str(g5k_roles["control"]))
print("Install Telegraf at: %s" % str(g5k_roles["compute"])) print("Install Telegraf at: %s" % str(g5k_roles["compute"]))
m = Monitoring(collector=g5k_roles["control"], agent=g5k_roles["compute"], ui=g5k_roles["control"]) m = Monitoring(collector=g5k_roles["control"], agent=g5k_roles["compute"]+iotlab_roles["a8"], ui=g5k_roles["control"])
m.deploy() m.deploy()
print("Setting up VPN on A8 nodes to put them in the same network as Grid'5000")
setup_vpn(iotlab_roles)
print("A8 nodes don't support the regular instalation process used by Monitoring class")
print("Install Telegraf specific, bare-metal version for ARM processor")
install_telegraf_a8_nodes(m, iotlab_roles)
ui_address = g5k_roles["control"][0].extra["my_network_ip"] ui_address = g5k_roles["control"][0].extra["my_network_ip"]
print("The UI is available at http://%s:3000" % ui_address) print("The UI is available at http://%s:3000" % ui_address)
print("user=admin, password=admin") print("user=admin, password=admin")
......
import json
from pathlib import Path from pathlib import Path
import os import os
from typing import Dict, List, Optional from typing import Dict, List, Optional
from enoslib.api import play_on, __python3__, __docker__ from enoslib.api import play_on, run_ansible
from enoslib.types import Host, Roles from enoslib.types import Host, Roles
from ..service import Service from ..service import Service
from ..utils import _check_path, _to_abs from ..utils import _check_path, _to_abs
...@@ -15,6 +14,8 @@ DEFAULT_COLLECTOR_ENV = {"INFLUXDB_HTTP_BIND_ADDRESS": ":8086"} ...@@ -15,6 +14,8 @@ DEFAULT_COLLECTOR_ENV = {"INFLUXDB_HTTP_BIND_ADDRESS": ":8086"}
DEFAULT_AGENT_IMAGE = "telegraf" DEFAULT_AGENT_IMAGE = "telegraf"
SERVICE_PATH = os.path.abspath(os.path.dirname(os.path.realpath(__file__)))
class Monitoring(Service): class Monitoring(Service):
def __init__( def __init__(
...@@ -30,7 +31,7 @@ class Monitoring(Service): ...@@ -30,7 +31,7 @@ class Monitoring(Service):
agent_env: Optional[Dict] = None, agent_env: Optional[Dict] = None,
agent_image: str = DEFAULT_AGENT_IMAGE, agent_image: str = DEFAULT_AGENT_IMAGE,
ui_env: Optional[Dict] = None, ui_env: Optional[Dict] = None,
priors: List[play_on] = [__python3__, __docker__], priors: List[play_on] = [],
extra_vars: Dict = None, extra_vars: Dict = None,
): ):
"""Deploy a TIG stack: Telegraf, InfluxDB, Grafana. """Deploy a TIG stack: Telegraf, InfluxDB, Grafana.
...@@ -96,9 +97,9 @@ class Monitoring(Service): ...@@ -96,9 +97,9 @@ class Monitoring(Service):
# agent options # agent options
self.agent_env = {} if not agent_env else agent_env self.agent_env = {} if not agent_env else agent_env
if agent_conf is None: if agent_conf is None:
self.agent_conf = Path("telegraf.conf.j2") self.agent_conf = "telegraf.conf.j2"
else: else:
self.agent_conf = _to_abs(Path(agent_conf)) self.agent_conf = str(_to_abs(Path(agent_conf)))
self.agent_image = agent_image self.agent_image = agent_image
# ui options # ui options
...@@ -113,63 +114,6 @@ class Monitoring(Service): ...@@ -113,63 +114,6 @@ class Monitoring(Service):
self.extra_vars = {"ansible_python_interpreter": "/usr/bin/python3"} self.extra_vars = {"ansible_python_interpreter": "/usr/bin/python3"}
self.extra_vars.update(extra_vars) self.extra_vars.update(extra_vars)
def write_agent_config(self, agents: List[Host] = None) -> str:
"""
Sets agent's telegraf config
Args:
agents: list of hosts to write telegraf's config
Returns:
config filename that will be created in agent's host
"""
_path = os.path.abspath(os.path.dirname(os.path.realpath(__file__)))
extra_vars = {"collector_address": self._get_collector_address()}
extra_vars.update(self.extra_vars)
roles = self._roles
# user set a specific list of hosts
if agents is not None:
roles = {"agent": agents}
with play_on(
pattern_hosts="agent", roles=roles, extra_vars=extra_vars
) as p:
p.file(path=self.remote_working_dir, state="directory")
p.template(
display_name="Generating the configuration file",
src=os.path.join(_path, self.agent_conf),
dest=self.remote_telegraf_conf,
)
return self.remote_telegraf_conf
def deploy_agent(self):
"""Deploy telegraf agent
Generates telegraf config and start docker container
"""
self.write_agent_config()
with play_on(
pattern_hosts="agent", roles=self._roles, extra_vars=self.extra_vars,
) as p:
volumes = [
f"{self.remote_telegraf_conf}:/etc/telegraf/telegraf.conf",
"/sys:/rootfs/sys:ro",
"/proc:/rootfs/proc:ro",
"/var/run/docker.sock:/var/run/docker.sock:ro",
]
env = {"HOST_PROC": "/rootfs/proc", "HOST_SYS": "/rootfs/sys"}
env.update(self.agent_env)
p.docker_container(
display_name="Installing Telegraf",
name="telegraf",
image=self.agent_image,
detach=True,
state="started",
recreate="yes",
network_mode="host",
volumes=volumes,
env=env,
)
def _get_collector_address(self) -> str: def _get_collector_address(self) -> str:
""" """
Auxiliary method to get collector's IP address Auxiliary method to get collector's IP address
...@@ -188,47 +132,7 @@ class Monitoring(Service): ...@@ -188,47 +132,7 @@ class Monitoring(Service):
if self.collector is None: if self.collector is None:
return return
# Some requirements
with play_on(
pattern_hosts="all",
roles=self._roles,
priors=self.priors,
extra_vars=self.extra_vars,
) as p:
p.pip(display_name="Installing python-docker", name="docker")
# Deploy the collector
# Handle port customisation
_, collector_port = self.collector_env["INFLUXDB_HTTP_BIND_ADDRESS"].split(":") _, collector_port = self.collector_env["INFLUXDB_HTTP_BIND_ADDRESS"].split(":")
with play_on(
pattern_hosts="collector", roles=self._roles, extra_vars=self.extra_vars
) as p:
p.docker_container(
display_name="Installing",
name="influxdb",
image="influxdb",
detach=True,
network_mode="host",
state="started",
recreate="yes",
env=self.collector_env,
volumes=[f"{self.remote_influxdata}:/var/lib/influxdb"],
)
p.wait_for(
display_name="Waiting for InfluxDB to be ready",
# I don't have better solution yet
# The ci requirements are a bit annoying...
host="172.17.0.1",
port=collector_port,
state="started",
delay=2,
timeout=120,
)
# Deploy the agents
self.deploy_agent()
# Deploy the UI
ui_address = None ui_address = None
if self.network is not None: if self.network is not None:
# This assumes that `discover_network` has been run before # This assumes that `discover_network` has been run before
...@@ -237,85 +141,41 @@ class Monitoring(Service): ...@@ -237,85 +141,41 @@ class Monitoring(Service):
# NOTE(msimonin): ping on docker bridge address for ci testing # NOTE(msimonin): ping on docker bridge address for ci testing
ui_address = "172.17.0.1" ui_address = "172.17.0.1"
# Handle port customisation extra_vars = {
ui_port = self.ui_env["GF_SERVER_HTTP_PORT"] "enos_action": "deploy",
with play_on( "collector_address": self._get_collector_address(),
pattern_hosts="ui", roles=self._roles, extra_vars=self.extra_vars "collector_port": collector_port,
) as p: "collector_env": self.collector_env,
p.docker_container( "agent_conf": self.agent_conf,
display_name="Installing Grafana", "agent_image": self.agent_image,
name="grafana", "remote_working_dir": self.remote_working_dir,
image="grafana/grafana", "ui_address": ui_address,
detach=True, "ui_port": self.ui_env["GF_SERVER_HTTP_PORT"],
network_mode="host", "ui_env": self.ui_env
env=self.ui_env, }
recreate="yes", extra_vars.update(self.extra_vars)
state="started", _playbook = os.path.join(SERVICE_PATH, "monitoring.yml")
) run_ansible(
p.wait_for( self.priors + [_playbook], roles=self._roles, extra_vars=extra_vars
display_name="Waiting for grafana to be ready", )
# NOTE(msimonin): ping on docker bridge address for ci testing
host=ui_address,
port=ui_port,
state="started",
delay=2,
timeout=120,
)
collector_url = f"http://{self._get_collector_address()}:{collector_port}"
p.uri(
display_name="Add InfluxDB in Grafana",
url=f"http://{ui_address}:{ui_port}/api/datasources",
user="admin",
password="admin",
force_basic_auth=True,
body_format="json",
method="POST",
# 409 means: already added
status_code=[200, 409],
body=json.dumps(
{
"name": "telegraf",
"type": "influxdb",
"url": collector_url,
"access": "proxy",
"database": "telegraf",
"isDefault": True,
}
),
)
def destroy(self): def destroy(self):
"""Destroy the monitoring stack. """Destroy the monitoring stack.
This destroys all the container and associated volumes. This destroys all the container and associated volumes.
""" """
with play_on( extra_vars = {
pattern_hosts="ui", roles=self._roles, extra_vars=self.extra_vars "enos_action": "destroy",
) as p: "remote_working_dir": self.remote_working_dir,
p.docker_container( }
display_name="Destroying Grafana", extra_vars.update(self.extra_vars)
name="grafana", _playbook = os.path.join(SERVICE_PATH, "monitoring.yml")
state="absent",
force_kill=True, run_ansible(
) [_playbook],
roles=self._roles,
with play_on( extra_vars=extra_vars
pattern_hosts="agent", roles=self._roles, extra_vars=self.extra_vars )
) as p:
p.docker_container(
display_name="Destroying telegraf", name="telegraf", state="absent"
)
with play_on(
pattern_hosts="collector", roles=self._roles, extra_vars=self.extra_vars
) as p:
p.docker_container(
display_name="Destroying InfluxDB",
name="influxdb",
state="absent",
force_kill=True,
)
p.file(path=f"{self.remote_influxdata}", state="absent")
def backup(self, backup_dir: Optional[str] = None): def backup(self, backup_dir: Optional[str] = None):
"""Backup the monitoring stack. """Backup the monitoring stack.
...@@ -330,29 +190,16 @@ class Monitoring(Service): ...@@ -330,29 +190,16 @@ class Monitoring(Service):
_backup_dir = _check_path(_backup_dir) _backup_dir = _check_path(_backup_dir)
with play_on( extra_vars = {
pattern_hosts="collector", roles=self._roles, extra_vars=self.extra_vars "enos_action": "backup",
) as p: "remote_working_dir": self.remote_working_dir,
backup_path = os.path.join(self.remote_working_dir, "influxdb-data.tar.gz") "backup_dir": str(_backup_dir)
p.docker_container( }
display_name="Stopping InfluxDB", name="influxdb", state="stopped" extra_vars.update(self.extra_vars)
) _playbook = os.path.join(SERVICE_PATH, "monitoring.yml")
p.archive(
display_name="Archiving the data volume", run_ansible(
path=f"{self.remote_influxdata}", [_playbook],
dest=backup_path, roles={"collector": self._roles["collector"]},
) extra_vars=extra_vars
)
p.fetch(
display_name="Fetching the data volume",
src=backup_path,
dest=str(Path(_backup_dir, "influxdb-data.tar.gz")),
flat=True,
)
p.docker_container(
display_name="Restarting InfluxDB",
name="influxdb",
state="started",
force_kill=True,
)
---
- name: Gather facts
hosts: all
tasks:
- name: Gather facts on all hosts
setup: {}
- name: Monitoring - agents
hosts: agent
become: yes
roles:
- agent
- name: Monitoring - collector
hosts: collector
become: yes
roles:
- collector
- name: Monitoring - UI
hosts: ui
become: yes
roles:
- ui
telegraf_binary_version: telegraf-1.17.0
telegraf_timeout: 31536000 # 1 year timeout
---
dependencies:
- role: python3
when: ansible_architecture != "armv7l"
- role: docker
when: ansible_architecture != "armv7l"
---
- name: Check that the telegraf binary already exists
stat:
path: /{{ telegraf_binary_version }}/usr/bin/telegraf
register: stat_result
- name: Installing telegraf binary
shell: "which telegraf || (curl -sfL https://dl.influxdata.com/telegraf/releases/{{ telegraf_binary_version }}_linux_armhf.tar.gz | tar x -zf - -C /)"
run_once: true
when: not stat_result.stat.exists
- name: Running telegraf
shell: "/{{ telegraf_binary_version }}/usr/bin/telegraf --config {{ remote_working_dir }}/telegraf.conf"
async: "{{ telegraf_timeout }}"
poll: 0
\ No newline at end of file
---
- name: "Creating remote directory"
file:
path: "{{ remote_working_dir }}"
state: directory
- name: "Generating the configuration file"
ansible.builtin.template:
src: "{{ agent_conf }}"
dest: "{{ remote_working_dir }}/telegraf.conf"
---
- name: Installing Telegraf
docker_container:
name: telegraf
image: "{{ agent_image }}"
detach: yes
state: started
recreate: yes
network_mode: host
volumes:
- "{{ remote_working_dir }}/telegraf.conf:/etc/telegraf/telegraf.conf"
- /sys:/rootfs/sys:ro
- /proc:/rootfs/proc:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
env:
HOST_PROC: "/rootfs/proc"
HOST_SYS: "/rootfs/sys"
---
- include_tasks: config.yml
- name: Host architecture
ansible.builtin.debug:
var: ansible_architecture
- include_tasks: container.yml
when: ansible_architecture != "armv7l"
- include_tasks: binary.yml
when: ansible_architecture == "armv7l"
\ No newline at end of file
---
- name: Destroying Telegraf
docker_container:
name: telegraf
state: absent
force_kill: yes
when: ansible_architecture != "armv7l"
- name: Destroying Telegraf (binary)
shell: pgrep telegraf | xargs kill
when: ansible_architecture == "armv7l"
\ No newline at end of file
---
- include: "{{ enos_action }}.yml"
---
dependencies:
- role: python3
- role: docker
---
- name: Stopping InfluxDB
docker_container:
name: influxdb
state: stopped
- name: Archiving the data volume
archive:
path: "{{ remote_working_dir }}/influxdb-data"
dest: "{{ remote_working_dir }}/influxdb-data.tar.gz"
- name: Fetching the data volume
fetch:
src: "{{ remote_working_dir }}/influxdb-data.tar.gz"
dest: "{{ backup_dir }}/"
flat: yes
- name: Restarting InfluxDB
docker_container:
name: influxdb
state: started
force_kill: yes
\ No newline at end of file
---
- name: Installing InfluxDB
docker_container:
name: influxdb
image: influxdb
detach: yes
state: started
recreate: yes
network_mode: host
volumes:
- "{{ remote_working_dir }}/influxdb-data:/var/lib/influxdb"
env: "{{ collector_env }}"
- name: Waiting for InfluxDB to be ready
wait_for:
host: "172.17.0.1"
port: "{{ collector_port }}"
state: started
delay: 2
timeout: 120
\ No newline at end of file
---
- name: Destroying InfluxDB
docker_container:
name: influxdb
state: absent
force_kill: yes
- name: Removing InfluxDB database
file:
path: "{{ remote_working_dir }}/influxdb-data"
state: absent
\ No newline at end of file
---
- include: "{{ enos_action }}.yml"
- name: Install docker
shell: "which docker || (curl -sSL https://get.docker.com/ | sh)"
- name: Installing python-docker
pip:
name: docker
- name: Install python3
shell: "(python --version | grep --regexp ' 3.*') || (apt update && apt install -y python3 python3-pip)"
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment