我正在尝试创建一个具有确定性实时响应的系统。

我创建了许多 cpusets ,将所有非关键任务和未固定的内核线程移动到一组,然后将每个实时线程固定到自己的cpuset,每个cpuset都由一个cpu组成。

$ non-critical tasks and unpinned kernel threads
cset proc --move --fromset=root --toset=system
cset proc --kthread --fromset=root --toset=system

$ realtime threads
cset proc --move --toset=shield/RealtimeTest1/thread1 --pid=17651
cset proc --move --toset=shield/RealtimeTest1/thread2 --pid=17654

我的情况是这样的:
  • 线程1:SCHED_OTHER,固定到set1,等待std::future<void>
  • 线程2:SCHED_FIFO,固定到set2,调用std::promise<void>::set_value()

  • 线程1永远阻塞。
    但是,如果我更改线程2,则为,如果是SCHED_OTHER ,则线程1可以继续。

    我运行了strace -f以获取更多见解;似乎线程1正在等待futex(我假设是std::future的内部信息),但从未唤醒。

    我绝对受阻-有没有办法将线程本身固定到内核并将其调度程序设置为FIFO,然后使用std::promise唤醒另一个正在等待其完成所谓的实时设置的线程?

    创建线程2的线程1的代码如下:
    // Thread1:
    std::promise<void> p;
    std::future <void> f = p.get_future();
    
    _thread = std::move(std::thread(std::bind(&Dispatcher::Run, this, std::ref(p))));
    
    LOG_INFO << "waiting for thread2 to start" << std::endl;
    
    if (f.valid())
        f.wait();
    

    线程2的运行函数如下:
    // Thread2:
    LOG_INFO << "started: threadId=" << Thread::GetId() << std::endl;
    
    Realtime::Service* rs = Service::Registry::Lookup<Realtime::Service>();
    if (rs)
        rs->ConfigureThread(this->Name()); // this does the pinning and FIFO etc
    
    LOG_INFO << "thread2 has started" << std::endl;
    p.set_value(); // indicate fact that the thread has started
    

    strace输出如下:
  • 线程1是[pid 17651]
  • 线程2是[pid 17654]

  • 为了简洁起见,我删除了一些输出。
    //////// Thread 1 creates thread 2 and waits on a future ////////
    
    [pid 17654] gettid()                    = 17654
    [pid 17651] write(2, "09:29:52 INFO waiting for thread"..., 4309:29:52 INFO waiting for thread2 to start
     <unfinished ...>
    [pid 17654] gettid( <unfinished ...>
    [pid 17651] <... write resumed> )       = 43
    [pid 17654] <... gettid resumed> )      = 17654
    [pid 17651] futex(0xd52294, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
    [pid 17654] gettid()                    = 17654
    [pid 17654] write(2, "09:29:52 INFO thread2 started: t"..., 6109:29:52 INFO thread2  started: threadId=17654
    ) = 61
    
    //////// <snip> thread2 performs pinning, FIFO, etc </snip> ////////
    
    [pid 17654] write(2, "09:29:52 INFO thread2 has starte"..., 3409:29:52 INFO thread2 has started
    ) = 34
    [pid 17654] futex(0xd52294, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xd52268, 2) = 1
    [pid 17651] <... futex resumed> )       = 0
    [pid 17654] futex(0xd522c4, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
    [pid 17651] futex(0xd52268, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
    [pid 17654] <... futex resumed> )       = 0
    [pid 17651] <... futex resumed> )       = 0
    
    //////// blocks here forever ////////
    

    您可以看到pid 17651(thread1)报告futex resumed,但是它可能运行在错误的cpu上并被阻塞在以FIFO运行的thread2后面吗?

    更新:似乎这是线程未在固定在其上的cpus上运行的问题。

    带有top -p 17649 -Hf,j可以显示last used cpu,表明线程1确实在线程2的cpu 上运行。
    top - 10:00:59 up 18:17,  3 users,  load average: 7.16, 7.61, 4.18
    Tasks:   3 total,   2 running,   1 sleeping,   0 stopped,   0 zombie
    Cpu(s):  7.1%us,  0.1%sy,  0.0%ni, 89.5%id,  0.0%wa,  0.0%hi,  3.3%si,  0.0%st
    Mem:   8180892k total,   722800k used,  7458092k free,    43364k buffers
    Swap:  8393952k total,        0k used,  8393952k free,   193324k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
    17654 root      -2   0 54080  35m 7064 R  100  0.4   5:00.77 3 RealtimeTest
    17649 root      20   0 54080  35m 7064 S    0  0.4   0:00.05 2 RealtimeTest
    17651 root      20   0 54080  35m 7064 R    0  0.4   0:00.00 3 RealtimeTest
    

    但是,如果查看cpuset文件系统,可以看到我的任务被固定在我请求的CPU上:
    /cpusets/shield/RealtimeTest1 $ for i in `find -name tasks`; do echo $i; cat $i; echo "------------"; done
    
    ./thread1/tasks
    17651
    ------------
    ./main/tasks
    17649
    ------------
    ./thread2/tasks
    17654
    ------------
    

    显示cpuset配置:
    $ cset set --list -r
    cset:
             Name       CPUs-X    MEMs-X Tasks Subs Path
     ------------ ---------- - ------- - ----- ---- ----------
             root       0-23 y     0-1 y   279    2 /
           system 0,2,4,6,8,10 n       0 n   202    0 /system
           shield 1,3,5,7,9,11 n       1 n     0    2 /shield
    RealtimeTest1    1,3,5,7 n       1 n     0    4 /shield/RealtimeTest1
          thread1          3 n       1 n     1    0 /shield/RealtimeTest1/thread1
          thread2          5 n       1 n     1    0 /shield/RealtimeTest1/thread2
             main          1 n       1 n     1    0 /shield/RealtimeTest1/main
    

    由此我可以说线程2应该在cpu 5上,但是top表示它正在cpu 3上运行。

    有趣的是,sched_getaffinity报告cpuset的功能-线程1在cpu 3上,线程2在cpu 5上。

    但是,查看/proc/17649/task以查找last_cpu的每个任务都在运行:
    /proc/17649/task $  for i in `ls -1`; do cat $i/stat | awk '{print $1 " is on " $(NF - 5)}'; done
    17649 is on 2
    17651 is on 3
    17654 is on 3
    

    sched_getaffinity报告一件事,而现实是另一件事

    有趣的是,main线程[pid 17649]应该在cpu 1上(根据cset输出),但实际上它在cpu 2上运行(在另一个套接字上)

    所以我会说cpuset无法正常工作?

    我的机器配置是:
    $ cat /etc/SuSE-release
    SUSE Linux Enterprise Server 11 (x86_64)
    VERSION = 11
    PATCHLEVEL = 1
    $ uname -a
    Linux foobar 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
    

    最佳答案

    我已经在 SLES 11/SP 2 框上重新运行了测试,并且固定成功。

    因此,我将其标记为答案,即:这是与 SP 1 相关的问题

    07-24 09:51