victoria Metrics无法获取抓取target的问题

victoriaMetrics无法获取抓取target的问题

问题描述

最近在新环境中部署了一个服务,其暴露的指标路径为:10299/metrics,配置文件如下(名称字段有修改):

apiVersion: v1 items: - apiVersion: operator.victoriametrics.com/v1beta1   kind: VMServiceScrape   metadata:     labels:       app_id: audit     name: audit     namespace: default   spec:     endpoints:     - path: /metrics       targetPort: 10299     namespaceSelector:       matchNames:       - default     selector:       matchLabels:         app_id: audit 

但在vmagent上查看其状态如下,vmagent无法发现该target:

一般排查方式

  1. 确保服务本身没问题,可以通过${podIp}:10299/metrics访问到指标
  2. 确保vmservicescrape–>service–>enpoints链路是通的,即配置的selector字段能够正确匹配到对应的资源
  3. 确保vmservicescrape格式正确。注:vmservicescrape资源格式不正确可能会导致vmagent无法加载配置,可以通过第5点检测到
  4. 确保vmagent中允许发现该命名空间中的target
  5. 在vmagent的UI界面执行reload,查看vmagent的日志是否有相关错误提示

经过排查发现上述方式均无法解决问题,更奇怪的是在vmagent的api/v1/targets中无法找到该target,说明vmagent压根没有发现该服务,即vmservicescrape配置没有生效。在vmagent中查看上述vmservicescrape生成的配置文件如下(其拼接了静态配置),可以看到它使用了kubernetes_sd_configs的方式来发现target:

- job_name: serviceScrape/default/audit/0   metrics_path: /metrics   relabel_configs:   - source_labels: [__meta_kubernetes_service_label_app_id]     regex: audit     action: keep   - source_labels: [__meta_kubernetes_pod_container_port_number]     regex: "10299"     action: keep   - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]     separator: ;     target_label: node     regex: Node;(.*)     replacement: ${1}   - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]     separator: ;     target_label: pod     regex: Pod;(.*)     replacement: ${1}   - source_labels: [__meta_kubernetes_pod_name]     target_label: pod   - source_labels: [__meta_kubernetes_pod_container_name]     target_label: container   - source_labels: [__meta_kubernetes_namespace]     target_label: namespace   - source_labels: [__meta_kubernetes_service_name]     target_label: service   - source_labels: [__meta_kubernetes_service_name]     target_label: job     replacement: ${1}   - target_label: endpoint     replacement: "8080"   kubernetes_sd_configs:   - role: endpoints     namespaces:       own_namespace: false       names:       - default 

代码分析

既然配置没有问题,那只能通过victoriametrics的kubernetes_sd_configs的运作方式看下到底是哪里出问题了。在victoriametrics的源码可以看到其拼接的target url如下:

scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr) 

其中:

  • schemeRelabeled:默认是http
  • metricsPathRelabeled:即生成的配置文件的metrics_path字段
  • optionalQuestionparamsStr没有配置,可以忽略

最主要的字段就是addressRelabeled,它来自一个名为"__address__"的标签

func mergeLabels(swc *scrapeWorkConfig, target string, extraLabels, metaLabels map[string]string) []prompbmarshal.Label {  ...  m["job"] = swc.jobName  m["__address__"] = target  m["__scheme__"] = swc.scheme  m["__metrics_path__"] = swc.metricsPath  m["__scrape_interval__"] = swc.scrapeInterval.String()  m["__scrape_timeout__"] = swc.scrapeTimeout.String()  ... } 

继续跟踪代码,可以看到该标签是通过sc.KubernetesSDConfigs[i].MustStart获取到的,从KubernetesSDConfigs的名称上看,它就是负责处理kubernetes_sd_configs机制的:

func (sc *ScrapeConfig) mustStart(baseDir string) {  swosFunc := func(metaLabels map[string]string) interface{} {   target := metaLabels["__address__"]   sw, err := sc.swc.getScrapeWork(target, nil, metaLabels)   if err != nil {    logger.Errorf("cannot create kubernetes_sd_config target %q for job_name %q: %s", target, sc.swc.jobName, err)    return nil   }   return sw  }  for i := range sc.KubernetesSDConfigs {   sc.KubernetesSDConfigs[i].MustStart(baseDir, swosFunc)  } } 

继续往下看,看看这个"__address__"字段到底是什么,函数调用如下:

MustStart –> cfg.aw.mustStart –> aw.gw.startWatchersForRole –> uw.reloadScrapeWorksForAPIWatchersLocked –> o.getTargetLabels

最后一个函数getTargetLabels是个接口方法

type object interface {  key() string   // getTargetLabels must be called under gw.mu lock.  getTargetLabels(gw *groupWatcher) []map[string]string } 

getTargetLabels的实现如下,这就是kubernetes_sd_configs的各个role的具体实现。上述服务用到的是kubernetes_sd_configsrole为endpoints

实现如下:

func (eps *Endpoints) getTargetLabels(gw *groupWatcher) []map[string]string {  var svc *Service  if o := gw.getObjectByRoleLocked("service", eps.Metadata.Namespace, eps.Metadata.Name); o != nil {   svc = o.(*Service)  }  podPortsSeen := make(map[*Pod][]int)  var ms []map[string]string  for _, ess := range eps.Subsets {   for _, epp := range ess.Ports {    ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.Addresses, epp, svc, "true")    ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.NotReadyAddresses, epp, svc, "false")   }  }  // See https://kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity  // and https://github.com/kubernetes/kubernetes/pull/99975  switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {  case "truncated":   logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())  case "warning":   logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())  }   // Append labels for skipped ports on seen pods.  portSeen := func(port int, ports []int) bool {   for _, p := range ports {    if p == port {     return true    }   }   return false  }  for p, ports := range podPortsSeen {   for _, c := range p.Spec.Containers {    for _, cp := range c.Ports {     if portSeen(cp.ContainerPort, ports) {      continue     }     addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)     m := map[string]string{      "__address__": addr,     }     p.appendCommonLabels(m)     p.appendContainerLabels(m, c, &cp)     if svc != nil {      svc.appendCommonLabels(m)     }     ms = append(ms, m)    }   }  }  return ms } 

可以看到,"__address__"其实就是拼接了p.Status.PodIPcp.ContainerPort,而p则代表一个kubernetes的pod数据结构,因此要求:

  1. pod状态是running的,且能够正确分配到PodIP
  2. p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的端口

问题解决

鉴于上述分析,查看了一下环境中的deployment,发现该deployment只配置了8080端口,并没有配置暴露指标的端口10299。问题解决。

apiVersion: apps/v1 kind: Deployment metadata:   labels:     app_id: audit   name: audit   namespace: default spec:   ...   template:     metadata:       ...     spec:       containers:       - env:         - name: APP_ID           value: audit         ports:         - containerPort: 8080           protocol: TCP 

总结

kubernetes_sd_configs方式其实就是通过listwatch的方式获取对应role的配置,然后拼接出target的__address__,此外它还会暴露一些额外的指标,如:

  • __meta_kubernetes_endpoint_hostname: Hostname of the endpoint.
  • __meta_kubernetes_endpoint_node_name: Name of the node hosting the endpoint.
  • __meta_kubernetes_endpoint_ready: Set to true or false for the endpoint’s ready state.
  • __meta_kubernetes_endpoint_port_name: Name of the endpoint port.
  • __meta_kubernetes_endpoint_port_protocol: Protocol of the endpoint port.
  • __meta_kubernetes_endpoint_address_target_kind: Kind of the endpoint address target.
  • __meta_kubernetes_endpoint_address_target_name: Name of the endpoint address target.
商匡云商
Logo
注册新帐户
对比商品
  • 合计 (0)
对比
0
购物车