内存分配，jemalloc/tcmalloc/glibc测评

knull 发表于 2016-06-21 10:15

本帖最后由 knull 于 2016-06-21 10:23 编辑

最近在了解内存分配；了解了下jemalloc/tcmalloc——传说中的性能相对glibc内存分配大幅度提高的两个很火的内存分配器。
但是，经过简单的性能测试代码测试，发现性能并没有比glibc好。
难道是我测试方法有问题？请各位大侠赐教，探讨~~~

系统环境：CentOS7.1;512M;1CPU（单核）;（VM虚拟机安装）
编译环境：gcc(g++)4.8.5;用的是g++ -g -O2进行编译;
测试环境：单线程，8k次malloc/free。分别测试glibc/tcmalloc/jemallc。对比平均耗时,是us(微妙)级别的。
详细说明：
用内存分配器，一种可以直接调用jemalloc代码；还有一种可以设置环境变量LD_PRELOAD来替换glibc的malloc。
1、我首先测试了直接调用的情况(仅测试了je和glibc):直接调用jemalloc。发现，jemalloc平均耗时至少是glibc的2倍，最高近3倍.
2、设置环境变量，LD_PRELOAD.分别测试了je/tc/glibc。
测试发现，该情况下，je比直接调用耗时减少50%。（这点出乎我的意料，想不通）
je耗时比tc稍好。但是glibc比两者都要稍好(三者都相差不是特别大，不到10%)。
下面附上我的测试代码，请大家帮忙分析下啊~~~#include <list>
#include <stdlib.h>
#include <ctime>
#include <jemalloc/jemalloc.h>
#include <string>
class TimeKeeper
{
public :
TimeKeeper(const char *name,int64_t &count):name_(name),counter_(count)
{
   clock_gettime(CLOCK_MONOTONIC, &tp_start_);
}
~TimeKeeper()
{
   struct timespec tp_end;
   clock_gettime(CLOCK_MONOTONIC, &tp_end);
   //
   int64_t cost = tp_end.tv_sec - tp_start_.tv_sec;
   cost = (tp_end.tv_nsec-tp_start_.tv_nsec) + (cost*1000*1000*1000);
   counter_ += cost;
}
private :
struct timespec tp_start_;
int64_t &counter_;
std::string name_;
};

std::list<void*> buf;
int32_t loop = 8*1024;
int32_t MAX_MEM_SIZE = 128*1024;
int64_t alloc_time_;
int64_t free_time_;

int32_t GetRand(int32_t n = 16)
{
return rand()%n+1;
}

void *myalloc(size_t n)
{
TimeKeeper tmp(__FUNCTION__,alloc_time_);
//return je_malloc(n);
return malloc(n);
}

void myfree(void *ptr)
{
TimeKeeper tmp(__FUNCTION__,free_time_);
//je_free(ptr);
free(ptr);
}

void test_muli_times2()
{
int32_t num = GetRand();
void *ptr = NULL;
for (int i=0; i< loop; ++i)
{
   ptr = myalloc(GetRand(MAX_MEM_SIZE));
   --num;
   if (num == 0)
   {
         myfree(ptr);
         num = GetRand();
   }
   else
   {
         buf.push_back(ptr);
   }
}
while (!buf.empty())
{
   ptr = buf.back();
   buf.pop_back();
   myfree(ptr);
}
printf("alloc %ld obj! cost %ld'ns! avg = %d!\n",loop,alloc_time_,alloc_time_/loop);
printf("freed %ld obj! cost %ld'ns! avg = %d!\n",loop,free_time_,free_time_/loop);
}

int main ()
{
//TYPEPRINT(HashNode);
test_muli_times2();
return 0;
}

hellioncu 发表于 2016-06-21 10:39

好像tcmalloc主要是提升了多线程的性能吧
另外测试性能最好不用虚拟机，里面那个list最好也去掉

knull 发表于 2016-06-21 10:44

回复 2# hellioncu
谢谢你的回复；
我刚刚查看了下资料，发现一般说明都带多线程的；而且，重点说明也是“多核时代”。所以我又重新用多线程，4核测试下（仍然虚拟机），发现的确tc性能更好，je比glibc稍好

knull 发表于 2016-06-21 10:47

本帖最后由 knull 于 2016-06-21 10:49 编辑

我刚刚查看了下资料，发现一般说明都带多线程的；而且，重点说明也是“多核时代”。
我又重新用4线程，4核测试下（仍然虚拟机）
je的malloc/free跟glibc的差不多（je稍好）；
tc的malloc比glibc好很多，有4~5倍；但是free比glibc差很多，耗时接近glibc的2倍
下面是修改之后的代码（主要是增加多线程了）：#include <list>
#include <stdlib.h>
#include <ctime>
//#include <jemalloc/jemalloc.h>
#include <string>
#include <thread>
#include <atomic>

std::atomic<int64_t> alloc_time_;
std::atomic<int64_t> free_time_;

class TimeKeeper
{
public :
TimeKeeper(bool isalloc):alloc_(isalloc)
{
   clock_gettime(CLOCK_MONOTONIC, &tp_start_);
}
~TimeKeeper()
{
   struct timespec tp_end;
   clock_gettime(CLOCK_MONOTONIC, &tp_end);
   //
   int64_t cost = tp_end.tv_sec - tp_start_.tv_sec;
   cost = (tp_end.tv_nsec-tp_start_.tv_nsec) + (cost*1000*1000*1000);
   if (alloc_)
{
alloc_time_.fetch_add(cost);
}
else
{
free_time_.fetch_add(cost);
}
}
private :
struct timespec tp_start_;
bool alloc_;
};

int32_t loop = 8*1024;
int32_t MAX_MEM_SIZE = 128*1024;

int32_t GetRand(int32_t n = 16)
{
return rand()%n+1;
}

void *myalloc(size_t n)
{
TimeKeeper tmp(true);
//return je_malloc(n);
return malloc(n);
}

void myfree(void *ptr)
{
TimeKeeper tmp(false);
//je_free(ptr);
free(ptr);
}

void test_muli_times2()
{

std::list<void*> buf;
int32_t num = GetRand();
void *ptr = NULL;
for (int i=0; i< loop; ++i)
{
   ptr = myalloc(GetRand(MAX_MEM_SIZE));
   --num;
   if (num == 0)
   {
         myfree(ptr);
         num = GetRand();
   }
   else
   {
         buf.push_back(ptr);
   }
}
while (!buf.empty())
{
   ptr = buf.back();
   buf.pop_back();
   myfree(ptr);
}

}

void test_threads()
{
std::thread t1(test_muli_times2);
std::thread t2(test_muli_times2);
std::thread t3(test_muli_times2);
std::thread t4(test_muli_times2);
t1.join();
t2.join();
t3.join();
t4.join();
int64_t cost = alloc_time_.load();
int32_t times = loop * 4;
printf("alloc %ld obj! cost %ld'ns! avg = %d!\n",times,cost,cost/times);
cost = free_time_.load();
printf("freed %ld obj! cost %ld'ns! avg = %d!\n",times,cost,cost/times);
}

int main ()
{
//TYPEPRINT(HashNode);
test_threads();
return 0;
}

lxyscls 发表于 2016-06-21 11:10

虚拟机多核没卵用的，特别是计算密集场景

yulihua49 发表于 2016-06-21 11:16

lxyscls 发表于 2016-06-21 11:10 static/image/common/back.gif
虚拟机多核没卵用的，特别是计算密集场景
虚拟机可以多核。你是哪个虚拟机？

lxyscls 发表于 2016-06-21 11:24

回复 6# yulihua49

vmware workstation

knull 发表于 2016-06-21 11:50

回复 5# lxyscls
好的，谢谢回复；
我本地测试下试试；不过，多线程下的确有好转

knull 发表于 2016-06-21 12:24

回复 6# yulihua49

主机本身是I3的多核机器；虚拟机是VMware，可以设置核心数的

cjfeii 发表于 2016-07-22 16:37

用物理机测一下吧

页: [1] 2 3

Chinaunix's Archiver

内存分配，jemalloc/tcmalloc/glibc测评