DTrace Internals: Digging into DTrace

Arun Thomas
<arun.thomas@acm.org>
@arunthomas

BSDCan 2017

On Systems Programming

James Mickens, The Night Watch:

[A] systems programmer has seen the terrors of the world and understood the intrinsic horror of existence

On Kernel Debugging

James Mickens, The Night Watch:

When you debug ... an OS kernel, you do it Texas-style. You gather some mean, stoic people, people who have seen things die, and you get some primitive tools, like a compass and a rucksack and a stick that’s pointed on one end, and you walk into the wilderness and you look for trouble, possibly while using chewing tobacco.

DTrace: An Advanced Debugging Tool

It's like packing military-grade GPS and a lightsaber in your rucksack before heading into the wilderness

And looking for trouble

Talk Overview

What is DTrace?

What Can DTrace Trace?

Syscall Tracing (syscall Provider)

# dtrace -qn 'syscall:::entry { printf("%s %s\n", execname, probefunc); }'

sshd read
sshd getpid
sshd select
sshd sigprocmask
sshd sigprocmask
sshd write
sshd select
ntpd sigprocmask
ntpd sigreturn
ntpd select

read() Sizes (syscall Provider)

# dtrace -qn 'syscall::read:return /execname == "sshd"/ { @ = quantize(arg0); }'

           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@                                    23       
               1 |@@@@@@@@@@                               41       
               2 |                                         1        
               4 |@@@@                                     19       
               8 |@                                        4        
              16 |                                         2        
              32 |@                                        6        
              64 |@@                                       9        
             128 |@@@                                      12       
             256 |@@@@@                                    20       
             512 |@@                                       9        
            1024 |@@@                                      12       
            2048 |@                                        4        
            4096 |@                                        4        
            8192 |@                                        3        
           16384 |                                         0        

Kernel Function Tracing (FBT Provider)

# dtrace -n 'fbt::malloc:entry /pid != $pid/ { printf("%s %s\n", execname, arg0); }'
dtrace: description 'fbt::malloc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1  32051                     malloc:entry   sendmail 44
  0  32051                     malloc:entry   syncer 496
  2  32051                     malloc:entry   ntpd   16

Userland Tracing (pid Provider)

# dtrace -qn 'pid$target:libc.so.7::entry {  @[probefunc] = count(); }' -c /bin/ls 
  ...
  memset                                                           69
  mbrtowc                                                          76
  wcrtomb                                                          76
  __free                                                          104
  free                                                            104
  memcpy                                                          119
  strncmp                                                         252

proc Provider (SDT)

What happens when we "man man"?

# dtrace -n 'proc:::exec-success { printf("%s", curpsinfo->pr_psargs); }'
dtrace: description 'proc:::exec-success ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1     14                none:exec-success /bin/sh /usr/bin/man man
  1     14                none:exec-success /sbin/sysctl -n hw.machine_arch
  2     14                none:exec-success /sbin/sysctl -n hw.machine
  2     14                none:exec-success /usr/bin/locale
  2     14                none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
  0     14                none:exec-success head -1
  2     14                none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
  1     14                none:exec-success mandoc -Tlint -Wunsupp
  2     14                none:exec-success /usr/bin/zcat /usr/share/man/man1/man.1.gz
  1     14                none:exec-success mandoc
  0     14                none:exec-success less

DTrace Solaris History

From bmc Wed Sep  3 10:27:51 2003
Subject: Houston, Tranquility Base here...
To: dtrace-interest@kiowa

DTrace has landed.  Twenty-three months after we set out, the first cut
of DTrace has integrated into Solaris 10.

DTrace Solaris History (2)

DTrace merge stats from Bryan's announcement:

  Number of source files in usr/src/uts                                  1,757
  Rank, in lines of code, of dtrace.c among these                           12
  Rank, in number of assertions, of dtrace.c among these                     7
  Lines in new .c and .h files created by DTrace project                54,644
  Lines of code (lines with trailing semicolon) in same                 17,088
  Lines of comments in same                                              6,713
  Age, in months, of DTrace project                                         23
  Number of engineers on DTrace project                                      3
  Number of above neither married nor engaged at beginning of project        3
  Number of above either married or engaged at end of project                2
  Subscribers to dtrace-interest@kiowa                                     181
  Number of people who asked to unsubscribe from dtrace-interest@kiowa       1

DTrace BSD history

D Language

DTrace Terminology

DTrace syntax

DTrace Providers

DTrace Components

DTrace Core Framework

D Intermediate Format (DIF)

# dtrace -S -n 'syscall:freebsd:open:entry {printf("%s %s\n", execname, copyinstr(arg0)); }'

DIFO 0x0x8049c7150 returns string (unknown) by ref (size 256)
OFF OPCODE      INSTRUCTION
00: 29011801    ldgs DT_VAR(280), %r1       ! DT_VAR(280) = "execname"
01: 23000001    ret  %r1
...

The Role of Providers

DTrace Architecture

Diagram shamelessly stolen from TeachBSD

Enabling Control Blocks (ECBs)

ECBs in Action

Digging into DTrace Source Code

Bryan Cantrill's Reddit AMA illuminates our path:

DTrace Frontends (aka Consumers)

dtrace(1)

NAME
       dtrace - generic front-end to the DTrace facility

lockstat(1)

NAME
       lockstat - report kernel lock and profiling statistics

plockstat(1)

NAME
     plockstat - Trace pthread lock statistics using DTrace

Writing a Consumer

sysctl debug.dtrace.verbose_ioctl=1
dtrace_ioctl(411): DTRACEIOC_ENABLE
dtrace_ioctl(584): DTRACEIOC_GO
dtrace_ioctl(388): DTRACEIOC_DOFGET
dtrace_ioctl(388): DTRACEIOC_DOFGET
dtrace_ioctl(778): DTRACEIOC_STATUS
dtrace_ioctl(237): DTRACEIOC_BUFSNAP curcpu 2 cpu 0
dtrace_ioctl(338): copyout the buffer snapshot
dtrace_ioctl(357): copyout buffer desc: size 0 drops 0 errors 0

dtrace(1)

plockstat(1)

cddl/contrib/opensolaris/cmd/plockstat/plockstat.c

    if ((g_dtp = dtrace_open(DTRACE_VERSION, 0, &err)) == NULL)
        fatal("failed to initialize dtrace: %s\n",
            dtrace_errmsg(NULL, err));
...
    if ((prog = dtrace_program_strcompile(g_dtp, g_prog,
        DTRACE_PROBESPEC_NAME, 0, 0, NULL)) == NULL)
        dfatal("failed to compile program");

    if (dtrace_program_exec(g_dtp, prog, &info) == -1)
        dfatal("failed to enable probes");
...
    if (dtrace_go(g_dtp) != 0)
        dfatal("dtrace_go()");
...
    do {
        ...
        switch (dtrace_work(g_dtp, stdout, NULL, chewrec, NULL)) {
        case DTRACE_WORKSTATUS_DONE:
            done = 1;
            break;
        ...
    } while (!done);

libdtrace Source

dt_open.c

cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c:

Which D language constructs are supported?

static const dt_ident_t _dtrace_globals[] = {
...
{ "execname", DT_IDENT_SCALAR, 0, DIF_VAR_EXECNAME,
    DT_ATTR_STABCMN, DT_VERS_1_0, &dt_idops_type, "string" },
{ "exit", DT_IDENT_ACTFUNC, 0, DT_ACT_EXIT, DT_ATTR_STABCMN, DT_VERS_1_0,
    &dt_idops_func, "void(int)" },
...
{ "strlen", DT_IDENT_FUNC, 0, DIF_SUBR_STRLEN, DT_ATTR_STABCMN, DT_VERS_1_0,
    &dt_idops_func, "size_t(const char *)" },
...
};

DIF compilation

Kernel Source Paths

Framework paths:

Provider paths:

Key Kernel Source Files

From sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c

/*
 * DTrace - Dynamic Tracing for Solaris
 *
 * This is the implementation of the Solaris Dynamic Tracing framework
 * (DTrace).  The user-visible interface to DTrace is described at length in
 * the "Solaris Dynamic Tracing Guide".  The interfaces between the libdtrace
 * library, the in-kernel DTrace framework, and the DTrace providers are
 * described in the block comments in the <sys/dtrace.h> header file.  The
 * internal architecture of DTrace is described in the block comments in the
 * <sys/dtrace_impl.h> header file.  The comments contained within the DTrace
 * implementation very much assume mastery of all of these sources; if one has
 * an unanswered question about the implementation, one should consult them
 * first.

dtrace_probe()

/*
 * If you're looking for the epicenter of DTrace, you just found it.  This
 * is the function called by the provider to fire a probe -- from which all
 * subsequent probe-context DTrace activity emanates.
 */
void
dtrace_probe(dtrace_id_t id, uintptr_t arg0, uintptr_t arg1,
    uintptr_t arg2, uintptr_t arg3, uintptr_t arg4)

dtrace_probe() - probe context

 *   dtrace_probe() may be called in virtually any context:  kernel, user,
 *   interrupt, high-level interrupt, with arbitrary adaptive locks held, with
 *   dispatcher locks held, with interrupts disabled, etc.  The only latitude
 *   that must be afforded to DTrace is the ability to make calls within
 *   itself (and to its in-kernel subroutines) and the ability to access
 *   arbitrary (but mapped) memory.  On some platforms, this constrains
 *   context.  For example, on UltraSPARC, dtrace_probe() cannot be called
 *   from any context in which TL is greater than zero.  dtrace_probe() may
 *   also not be called from any routine which may be called by dtrace_probe()
 *   -- which includes functions in the DTrace framework and some in-kernel
 *   DTrace subroutines.  All such functions "dtrace_"; providers that
 *   instrument the kernel arbitrarily should be sure to not instrument these
 *   routines.

DTrace-ing DTrace

static void
fbt_provide_module(void *arg, modctl_t *lf)
{
    ...
    /*
     * Employees of dtrace and their families are ineligible.  Void
     * where prohibited.
     */
    if (strcmp(modname, "dtrace") == 0)
        return;

Writing a Provider

static dtrace_pops_t dtmalloc_pops = {
        dtmalloc_provide,
        NULL,
        dtmalloc_enable,
        dtmalloc_disable,
        ...
};
    if (dtrace_register("dtmalloc", &dtmalloc_attr, DTRACE_PRIV_USER,
        NULL, &dtmalloc_pops, NULL, &dtmalloc_id) != 0)
        return;

Providers We'll Discuss

FBT and CTF (Compact C Type Format)

--- clock.o ---
ctfconvert -L VERSION -g clock.o
...
--- buildkernel ---
ctfmerge -L VERSION -g -o kernel.full ...
# readelf  -S /boot/kernel/kernel|grep -A 1 ctf
  ...
  [42] .SUNW_ctf         PROGBITS         0000000000000000  018aaa38
       00000000000d86bc  0000000000000000          43     0     4

FBT Probe Insertion Observed via kgdb

<sys_getpid>         push   %rbp
<sys_getpid+1>       mov    %rsp,%rbp
<sys_getpid+4>       mov    0x8(%rdi),%rax
<sys_getpid+8>       movslq 0xbc(%rax),%rax
<sys_getpid>         int3
<sys_getpid+1>       mov    %rsp,%rbp
<sys_getpid+4>       mov    0x8(%rdi),%rax
<sys_getpid+8>       movslq 0xbc(%rax),%rax

FBT Patching Value

#define  FBT_PATCHVAL            0xcc
#define      FBT_PATCHVAL            (AARCH64_BRK | AARCH64_BRK_IMM16_VAL)
#define        FBT_PATCHVAL            (RISCV_INSN_BREAK)

FBT Trap Handling on ARMv8

pid Provider

#define FASTTRAP_INSTR                  0xcc
int
fasttrap_tracepoint_install(proc_t *p, fasttrap_tracepoint_t *tp)
{
    fasttrap_instr_t instr = FASTTRAP_INSTR;

    if (uwrite(p, &instr, 1, tp->ftt_pc) != 0)
        return (-1);

Defining Static Tracepoints (aka SDT)

# dtrace -n 'proc:::exec-success { printf("%s", curpsinfo->pr_psargs); }'
SDT_PROVIDER_DECLARE(proc);
...
SDT_PROBE_DEFINE1(proc, , , exec__success, "char *");
...
/*
 * In-kernel implementation of execve().  All arguments are assumed to be
 * userspace pointers from the passed thread.
 */
static int
do_execve(td, args, mac_p)
        ...
        SDT_PROBE1(proc, , , exec__success, args->fname);

SDT: Transfer to DTrace Framework

static void
sdt_load()
{
...
    sdt_probe_func = dtrace_probe;
#define SDT_PROBE(prov, mod, func, name, arg0, arg1, arg2, arg3, arg4)  do {    \
        ...
        (*sdt_probe_func)(sdt_##prov##_##mod##_##func##_##name->id, \
            (uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2,   \
            (uintptr_t) arg3, (uintptr_t) arg4);            \
        ...

Misc. DTrace Source

Resources

Thanks

Summary