免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1461 | 回复: 0
打印 上一主题 下一主题

Improving Ext3 performance by placing the journal [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-02-23 10:34 |只看该作者 |倒序浏览

Running a Linux Server on a HW RAID6 / LVM setup we are plagued by the fact that heavy activity on one file system will impact performance on all of them. If there is an active writer on one file system (especially meta data updates) then all other file systems will face extreme performance degradation. Especially read performance fell right through the floor. Response times become large and highly fluctuating.
The problem seems to even exist on simple single disk systems as is explained in this Ubuntu bug
131094
.
We have tried all sorts of things, like the
noatime,data=journal
mount option, various io schedulers and /proc/sys/vm paramters, unfortunately only with limited success.
With the arrival of Solid State Flash disks in the consumer market, a new opportunity presented itself: Keeping the ext3 journal on a fast external device. Having minimal seek time, we expected SSDs to be the ideal media for keeping a journal.
We went for the new OCZSSD2-1S32G (32GB SATA2 from OCZ) since it got some good reviews for its write speed, especially when compared to the offerings of Samsung. Interestingly enough the OCZ disk identified itself as a 'SAMSUNG MCBQE32G5MPP-0VA' to the Linux kernel. Oh well.
So tonight, after I had connected that new disk to a spare SATA port I was ready to go.
How to move your ext3 journal to an external device
I booted the box into single user mode and unmounted all file systems
umount -a
Then I partitioned the SSD (make sure that you actually pick the SSD and not your live disk since the disk numbering may have changed since you added the additional device). I used the disk/by-id devices just to be sure:
cfdisk /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
An ext3 journal has a maximum size of 400 MB (with 4k blocks) and since the external journals always take a whole partition. If you can, use lvm todo that since you will hit the scsi limit of 15 partitions pretty quickly. With lvm you would do:
pvcreate /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
vgcreate journal /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
lvcreate -L 400M -n my-dev journal
Once the partitions are created, they have to be formatted for journal duty. I added a label to the journal so that I could find the partition more easily later.
mke2fs -O journal_dev -L j-my-dev /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396-part1
or if you used lvm
mke2fs -O journal_dev -L j-my-dev /dev/journal/my-dev
Now drop curent journal from the cleanly unmounted file system. This assumes that you use lvm to manage your partitions and the vg for the partitions is called "local"
tune2fs -O ^has_journal  /dev/local/my_dev
and add the journal device. While adding the journal device, we also switch to journal_data mode. This is important, as it will make all meta-data and all data go to our fast journal first without any disk dependency. I also use the label assigned above.
tune2fs -o journal_data -j -J device=LABEL=j-my-dev /dev/local/my_dev
After the SSD journal was attached to all the file except for the root filesystem I ran a
mount -a
just to make sure they were all ok and then went for a reboot. A few minutes later the system was back up and running fine.
If you have todo this for many partions, I would strongly advise to use a script for the transition.
Performance Impact
After running the setup for a few days, I draw the following conclusions:
The general slowness of all file access, caused by a single heavy write is reduced so much that it does not interfear with daily work anymore.
The hardlink backup (using rsync to keep a copy of the files, with hardlinks to those that have not changed) is about twice as fast.
The tape based backup (bacula, running at the same time as the hardlink backup) is about twice as fast as well.

In other words, having an external journal with a HW RAID setup is a MUST.
Reliability Impact
Using a single SSD to store the journal may raise reliability concerns, since we are introducing a single point of failure into the system. The chances for the single SSD going up in smoke is probably quite a bit higher than for the RAID6 to develope such a problem because individual failed disks can be replaced easily.
I have asked on the ext3-users mailinglist what would happen if one lost the journal disk in such a context. My interpretation of
Theodore "ext3" Tso's reply
is the following:
In most cases when something goes wrong the journal will get disabled automatically.
The worst "highly unlikely" case is that a whole "losing a full inode table block's worth of inodes" could get lost. In general the loss should be the last few minutes worth of data.
Use SMART to monitor the health status of the SSD, since it will know when it starts running out of replacement blocks before it actually dies.
The discussion on the ext3-users list promted Teo to re-check the code and find some issues which he will create patches for, so watch the kernel log!

And from earlier conversations I draw:
Also a good thing is to use "errors=panic" as a mount option, this makes sure that a broken system does not linger in limbo, having lost part of its filesystems, makeing a mess of things as it limps on with half a brain.
So for my part, I am confident that the added risk is worth the performance we gain, but decide for yourself!


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/30686/showart_2184801.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP