一瞥之见

0%

记录 - 排查go的runtime.SetFinalizer未“如期”运行问题

记录自己demo测试runtime.SetFinalizer的运行未如期的排查过程

问题缘由

go-cache源码时候发现其使用了runtime.SetFinalizer这个函数绑定关闭周期性清除过期key的goroutine 到cache本身的生命周期,也就是说cache被垃圾回收的时候会停止此goroutine。
所以我也就想自己写个demo试试是不是真正的有效,设计的demo是假设有一个school,school有一个看门狗(watch_dog),假设在school开着的时候狗每一秒叫(bark)一句【此操作单独在一个goroutine通过ticker实现】,当school关闭(垃圾回收)时候自然期望能够停止前面狗叫的goroutine。demo代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
package main

import (
"log"
"runtime"
"time"
)

type dog struct {
stop chan bool
}

func (d *dog) dog_run() {
ticker := time.NewTicker(1 * time.Second)
for {
select {
case <-ticker.C:
log.Println("bark") // 学校每秒钟都会
case <-d.stop:
ticker.Stop()
return
}
}
}

type School struct {
Students []string
dog *dog
}

func (s *School) doNothing() {
}

func dog_stop(s *School) {
log.Println("dog_stop get called")
s.dog.stop <- true
}

func (s *School) new() {
watch_dog := dog{make(chan bool)}
s.dog = &watch_dog
go watch_dog.dog_run()
}

func NewSchool(names ...string) *School {
school := School{Students: names}
school.new()
runtime.SetFinalizer(&school, dog_stop) // !绑定停止狗叫的函数到school的生命周期
return &school
}

func main() {
school := NewSchool([]string{"xx小学"}...)
school.doNothing()
time.Sleep(time.Second * 4)
school = nil // 关闭学校
time.Sleep(time.Second * 360) // 期望的是这其中的时候会看不到狗叫的输出
}

问题

上面的demo代码输出如下(可见狗一直在叫、即使是在学校置空后6分钟也没有停止,故也就是并没有触发dog_stop函数):

1
2
3
4
5
021/07/05 09:11:13 bark
2021/07/05 09:11:14 bark
...
2021/07/05 09:17:15 bark
2021/07/05 09:17:16 bark

排查过程

是因为没有GC吗?

既然是在school回收的时候触发、那是不是此期间就没有回收、没有触发GC呢?这时候先使用了通过设置GODEBUG=gctrace=1来看输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2021/07/05 09:27:55 bark
...
2021/07/05 09:29:53 bark
GC forced
gc 9 @120.877s 0%: 0.033+0.43+0.003 ms clock, 0.13+0/0.42/1.0+0.015 ms cpu, 2->2->0 MB, 4 MB goal, 4 P
2021/07/05 09:29:54 bark
...
2021/07/05 09:31:54 bark
GC forced
gc 10 @241.096s 0%: 0.95+1.3+0.032 ms clock, 3.8+0/1.2/1.1+0.12 ms cpu, 0->0->0 MB, 4 MB goal, 4 P
2021/07/05 09:31:55 bark
...
2021/07/05 09:33:54 bark
GC forced
gc 11 @361.576s 0%: 0.083+0.53+0.004 ms clock, 0.33+0/0.48/0.82+0.016 ms cpu, 0->0->0 MB, 4 MB goal, 4 P
2021/07/05 09:33:55 bark
2021/07/05 09:33:56 bark
2021/07/05 09:33:57 bark

从上面的输出我们可以看出来、每两分钟,是执行了GC的!
那么问题是什么呢?
经过在知识星球上、欧长坤大佬的回复得知:这里的输出只是标记阶段、并不是清除阶段。而SetFinalizer的触发是要在实际清除阶段才会触发的
结合这篇文章我知道了,其实可以说的GC和sweep完全是两个阶段,笼统的在这里可以认为GC阶段纯粹是标记阶段【所以上面的gctrace=1参数的输出信息全部是标记阶段、是看不到sweep相关信息的】,而sweep阶段是要在标记为不用的块需要再次被用到的时候才会发生的阶段!

验证回答之手动触发sweep

可以通过分配大内存(从而要用那部分标记为已不用的堆内存)或 runtime.GC()手动触发

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
package main

import (
"log"
"runtime"
"time"
)

type dog struct {
stop chan bool
}

func (d *dog) dog_run() {
ticker := time.NewTicker(1 * time.Second)
for {
select {
case <-ticker.C:
log.Println("bark")
case <-d.stop:
ticker.Stop()
return
}
}
}

type School struct {
Students []string
dog *dog
}

func (s *School) doNothing() {
}

func dog_stop(s *School) {
log.Println("dog_stop get called")
s.dog.stop <- true
}

func (s *School) new() {
watch_dog := dog{make(chan bool)}
s.dog = &watch_dog
go watch_dog.dog_run()
}

func NewSchool(names ...string) *School {
school := School{Students: names}
school.new()
runtime.SetFinalizer(&school, dog_stop)
return &school
}

func main() {
school := NewSchool([]string{"xx小学"}...)
school.doNothing()
time.Sleep(time.Second * 4)
school = nil
log.Println("school closed")
time.Sleep(4 * time.Second)
log.Println("handly triggered GC start")
runtime.GC()
log.Println("handly triggered GC end")
time.Sleep(4 * time.Second)
}

输出:

1
2
3
4
5
6
7
8
9
10
11
12
2021/07/05 13:37:52 bark
2021/07/05 13:37:53 bark
2021/07/05 13:37:54 bark
2021/07/05 13:37:55 school closed
2021/07/05 13:37:55 bark
2021/07/05 13:37:56 bark
2021/07/05 13:37:57 bark
2021/07/05 13:37:58 bark
2021/07/05 13:37:59 bark
2021/07/05 13:37:59 handly triggered GC start
2021/07/05 13:37:59 handly triggered GC end
2021/07/05 13:37:59 dog_stop get called

可见在手动触发后、是有效触发了通过runtime.SetFinalizer绑定的dog_stop方法的!

怎么才能看到是真的触发了sweep呢?

虽然大佬说了,但是不看到真的是因为没有触发sweep以及runtime.GC()是真的触发了sweep还是不甘心,那么怎么能够看到是不是真的触发了sweep呢?

错解:通过gcpacertrace参数

胡乱看的时候、有看到可以设置gcpacertrace参数,其中看到了很具有迷惑性的sweep相关信息(以下是设置了GODEBUG=gctrace=1,gcpacertrace=1的输出):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
2021/07/05 10:09:47 bark
...
2021/07/05 10:11:45 bark
GC forced
pacer: assist ratio=+1.466185e-001 (scan 0 MB in 2->4 MB) workers=1++0.000000e+000
pacer: H_m_prev=748712 h_t=+9.500000e-001 H_T=4194304 h_a=+2.116616e+000 H_a=2333448 h_g=+4.602026e+000 H_g=4194304 u_a=+2.500000e-001 u_g=+3.000000e-001 W_a=512096 goalΔ=+3.652026e+000 actualΔ=+1.166616e+000 u_a/u_g=+8.333333e-001
gc 9 @121.098s 0%: 0.17+1.2+0.13 ms clock, 0.69+0/0.81/0+0.52 ms cpu, 2->2->0 MB, 4 MB goal, 4 P
pacer: sweep done at heap size 0MB; allocated 0MB during sweep; swept 417 pages at +1.748475e-004 pages/byte
2021/07/05 10:11:46 bark
...
2021/07/05 10:13:46 bark
GC forced
pacer: assist ratio=+7.457321e-002 (scan 0 MB in 0->4 MB) workers=1++0.000000e+000
pacer: H_m_prev=760792 h_t=+9.500000e-001 H_T=4194304 h_a=+0.000000e+000 H_a=760792 h_g=+4.513076e+000 H_g=4194304 u_a=+2.500000e-001 u_g=+3.000000e-001 W_a=510688 goalΔ=+3.563076e+000 actualΔ=-9.500000e-001 u_a/u_g=+8.333333e-001
gc 10 @241.368s 0%: 0.095+0.87+0.13 ms clock, 0.38+0/0.70/0.61+0.52 ms cpu, 0->0->0 MB, 4 MB goal, 4 P
pacer: sweep done at heap size 0MB; allocated 0MB during sweep; swept 267 pages at +1.100755e-004 pages/byte
2021/07/05 10:13:47 bark
...
2021/07/05 10:15:46 bark
GC forced
pacer: assist ratio=+7.349755e-002 (scan 0 MB in 0->4 MB) workers=1++0.000000e+000
pacer: H_m_prev=720120 h_t=+9.500000e-001 H_T=4194304 h_a=+0.000000e+000 H_a=720120 h_g=+4.824451e+000 H_g=4194304 u_a=+2.500000e-001 u_g=+3.000000e-001 W_a=510688 goalΔ=+3.874451e+000 actualΔ=-9.500000e-001 u_a/u_g=+8.333333e-001
gc 11 @361.639s 0%: 0.11+1.3+0.11 ms clock, 0.46+0/0.79/0.89+0.44 ms cpu, 0->0->0 MB, 4 MB goal, 4 P
pacer: sweep done at heap size 0MB; allocated 0MB during sweep; swept 260 pages at +1.071896e-004 pages/byte
2021/07/05 10:15:47 bark
2021/07/05 10:15:48 bark
2021/07/05 10:15:49 bark
2021/07/05 10:15:50 bark

虽然pacer的输出有sweep done at…, 但是其并不是真正说的sweep。因为pacer本身就不是处理sweep的,而是处理调整下次gc频率等作用的算法…

正解:go tool trace

go tool trace 可以看到所有的程序运行细节

  • 这里先看下调用了runtime.GC()的情况

    1. 在代码中加入trace相关代码

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      type dog struct {
      stop chan bool
      }

      func (d *dog) dog_run() {
      ticker := time.NewTicker(1 * time.Second)
      for {
      select {
      case <-ticker.C:
      // log.Println("bark")
      case <-d.stop:
      ticker.Stop()
      return
      }
      }
      }

      type School struct {
      Students []string
      dog *dog
      }

      func (s *School) doNothing() {
      }

      func dog_stop(s *School) {
      // log.Println("dog_stop get called")
      s.dog.stop <- true
      }

      func (s *School) new() {
      watch_dog := dog{make(chan bool)}
      s.dog = &watch_dog
      go watch_dog.dog_run()
      }

      func NewSchool(names ...string) *School {
      school := School{Students: names}
      school.new()
      runtime.SetFinalizer(&school, dog_stop)
      return &school
      }

      func main() {
      trace.Start(os.Stderr)
      defer trace.Stop()
      school := NewSchool([]string{"xx小学"}...)
      school.doNothing()
      time.Sleep(time.Second * 4)
      school = nil
      time.Sleep(4 * time.Second)
      runtime.GC()
      time.Sleep(4 * time.Second)
      }
    2. 将trace输出放到文件

      1
      go run main.go 2> trace_with_handly_gc.out
    3. 运行go tool trace运行可在浏览器中查看情况

      1
      2
      3
      4
      $ go tool trace trace_with_handly_gc.out
      2021/07/05 13:52:41 Parsing trace...
      2021/07/05 13:52:41 Splitting trace...
      2021/07/05 13:52:41 Opening browser. Trace viewer is listening on http://127.0.0.1:64658

      去到浏览器后显示:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      View trace
      Goroutine analysis
      Network blocking profile (⬇)
      Synchronization blocking profile (⬇)
      Syscall blocking profile (⬇)
      Scheduler latency profile (⬇)
      User-defined tasks
      User-defined regions
      Minimum mutator utilization

      点击其中的“Goroutine analysis”可看到程序中每个goroutine的信息:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      Goroutines:
      runtime.main N=1
      runtime.gcBgMarkWorker N=4
      main.(*dog).dog_run N=1
      runtime/trace.Start.func1 N=1
      runtime.runfinq N=1
      runtime.bgsweep N=1
      runtime.bgscavenge N=1
      N=1

      其中的runtime.bgsweep goroutine就表明触发了sweep!

  • 再次原程序并且走上面的流程、到最后一步“Goroutine analysis”可见输出如下:

    1
    2
    3
    4
    5
    Goroutines:
    main.(*dog).dog_run N=1
    runtime.main N=1
    runtime/trace.Start.func1 N=1
    N=4

    可见、上面没有runtime.bgsweep goroutine、没有触发sweep

总结

  • runtime.SetFinalizer的绑定触发是在sweep时候才会触发
  • gctrace=1参数看到的只是gc的信息,是没有sweep相关信息的
  • go tool trace非常强大、可基本看到全部信息