O grupo no qual você está postando é um grupo da Usenet. As mensagens postadas neste grupo farão com que o seu e-mail fique visível para qualquer pessoa na internet.
On Wed, Mar 17 2004, Peter Zaitsev wrote: > Hello,
> I'm wondering is there any way in Linux to do proper fsync(), which > makes sure data is written to the disk.
> Currently on IDE devices one can see, fsync() only flushes data to the > drive cache which is not enough for ACID guaranties database server must > give.
> There is solution just to disable drive write cache, but it seems to > slowdown performance way to much.
Chris and I have working real fsync() with the barrier patches. I'll clean it up and post a patch for vanilla 2.6.5-rc today.
On Thu, 18 Mar 2004, Jens Axboe wrote: > Chris and I have working real fsync() with the barrier patches. I'll > clean it up and post a patch for vanilla 2.6.5-rc today.
This is good news.
The barrier stuff is long overdue^UI'm looking forward to this.
I'm using the term "TCQ" liberally although it may be inexact for older (parallel) ATA generations:
All these ATA fsync() vs. write cache issues have been open for much too long - no reproaches, but it's a pity we haven't been able to have data consistency for data bases and fast bulk writes (that need the write cache without TCQ) in the same drive for so long. I have seen Linux introduce TCQ for PATA early in 2.5, then drop it again. Similarly, FreeBSD ventured into TCQ for ATA but appears to have dropped it again as well.
May I ask that the information whether a particular driver (file system, hardware) supports write barriers be exposed in a standard way, for instance in the Kconfig help lines?
If I recall correctly from earlier patches, the barrier stuff is 1. command model (ATA vs. SCSI) specific and 2. driver and hardware specific and 3. requires that the file system knows how to use this properly.
Given that file systems have certain write ordering requirements if they are to be recoverable after a crash, I suspect Linux has _not_ been able to guarantee on-disk consistency for any time for years, which means that a crash in the wrong moment can kill the file system itself if the drive has reordered writes - only ext3 without write cache seems to behave better in this respect (data=ordered).
I would like to have a document that shows which file system, which chipset driver for PATA, which chipset driver for ATA, which low-level SCSI host adaptor driver, which file system support write barrier. We will probably also need to check if intermediate layers such as md and dm-mod propagate such information.
Given the necessary information, I can hack together a HTML document to provide this information; this offer has however not seen any response in the past. I am however not acquainted with the drivers and need information from the kernel hackers. Without such support, such a documentation effort is doomed.
BTW, I should very much like to be able to trace the low-level write information that goes out to the device, possibly including the payload - something like tcpdump for the ATA or SCSI commands that are sent to the driver. Is such a facility available?
-- Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
(btw - maybe you don't like to be cc'ed on kernel posts, but I do. it's lkml etiquette to do so, and it makes sure that I see your mail. otherwise I might not, especially true for bigger threads. so please, cc people. thanks)
On Thu, Mar 18 2004, Matthias Andree wrote: > On Thu, 18 Mar 2004, Jens Axboe wrote:
> > Chris and I have working real fsync() with the barrier patches. I'll > > clean it up and post a patch for vanilla 2.6.5-rc today.
> This is good news.
> The barrier stuff is long overdue^UI'm looking forward to this.
> I'm using the term "TCQ" liberally although it may be inexact for older > (parallel) ATA generations:
> All these ATA fsync() vs. write cache issues have been open for much too > long - no reproaches, but it's a pity we haven't been able to have data > consistency for data bases and fast bulk writes (that need the write > cache without TCQ) in the same drive for so long. I have seen Linux > introduce TCQ for PATA early in 2.5, then drop it again. Similarly, > FreeBSD ventured into TCQ for ATA but appears to have dropped it again > as well.
That's because PATA TCQ sucks :-)
> May I ask that the information whether a particular driver (file system, > hardware) supports write barriers be exposed in a standard way, for > instance in the Kconfig help lines?
Since reiser is the first implementation of it, it gets to chose how this works. Currently that's done by giving -o barrier=flush (=ordered used to exist as well, it will probably return - right now we just played with IDE).
> If I recall correctly from earlier patches, the barrier stuff is 1. > command model (ATA vs. SCSI) specific and 2. driver and hardware > specific and 3. requires that the file system knows how to use this > properly.
Yes.
> Given that file systems have certain write ordering requirements if they > are to be recoverable after a crash, I suspect Linux has _not_ been able > to guarantee on-disk consistency for any time for years, which means > that a crash in the wrong moment can kill the file system itself if the > drive has reordered writes - only ext3 without write cache seems to > behave better in this respect (data=ordered).
> I would like to have a document that shows which file system, which > chipset driver for PATA, which chipset driver for ATA, which low-level > SCSI host adaptor driver, which file system support write barrier. We > will probably also need to check if intermediate layers such as md and > dm-mod propagate such information.
Only PATA core needs to support it, not the chipset drivers. md and dm aren't a difficult to implement now that unplug/congestion already iterates the device list and I added a blkdev_issue_flush() command.
> Given the necessary information, I can hack together a HTML document to > provide this information; this offer has however not seen any response > in the past. I am however not acquainted with the drivers and need > information from the kernel hackers. Without such support, such a > documentation effort is doomed.
Usual approach - just start writing, it's a lot easier to get corrections (people seem to be several times more willing to point out your errors than give you recomendations for something you haven't started yet).
> BTW, I should very much like to be able to trace the low-level write > information that goes out to the device, possibly including the payload > - something like tcpdump for the ATA or SCSI commands that are sent to > the driver. Is such a facility available?
> > All these ATA fsync() vs. write cache issues have been open for much too > > long - no reproaches, but it's a pity we haven't been able to have data > > consistency for data bases and fast bulk writes (that need the write > > cache without TCQ) in the same drive for so long. I have seen Linux > > introduce TCQ for PATA early in 2.5, then drop it again. Similarly, > > FreeBSD ventured into TCQ for ATA but appears to have dropped it again > > as well.
> That's because PATA TCQ sucks :-)
True. Few drives support it, and many of these you would not want to run in production...
> > May I ask that the information whether a particular driver (file system, > > hardware) supports write barriers be exposed in a standard way, for > > instance in the Kconfig help lines?
> Since reiser is the first implementation of it, it gets to chose how > this works. Currently that's done by giving -o barrier=flush (=ordered > used to exist as well, it will probably return - right now we just > played with IDE).
This looks as though this was not the default and required the user to know what he's doing. Would it be possible to choose a sane default (like flush for ATA or ordered for SCSI when the underlying driver supports ordered tags) and leave the user just the chance to override this?
> Only PATA core needs to support it, not the chipset drivers. md and dm
Hum, I know the older Promise chips were blacklisted for PATA TCQ in FreeBSD. Might "ordered" cause situations where similar things happen to Linux? How about SCSI/libata? Is the situation the same there?
> aren't a difficult to implement now that unplug/congestion already > iterates the device list and I added a blkdev_issue_flush() command.
So this would - for SCSI - be an sd issue rather than a driver issue as well?
-- Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Mar 18 2004, Matthias Andree wrote: > > > All these ATA fsync() vs. write cache issues have been open for much too > > > long - no reproaches, but it's a pity we haven't been able to have data > > > consistency for data bases and fast bulk writes (that need the write > > > cache without TCQ) in the same drive for so long. I have seen Linux > > > introduce TCQ for PATA early in 2.5, then drop it again. Similarly, > > > FreeBSD ventured into TCQ for ATA but appears to have dropped it again > > > as well.
> > That's because PATA TCQ sucks :-)
> True. Few drives support it, and many of these you would not want to run > in production...
Plus, the spec is broken.
> > > May I ask that the information whether a particular driver (file system, > > > hardware) supports write barriers be exposed in a standard way, for > > > instance in the Kconfig help lines?
> > Since reiser is the first implementation of it, it gets to chose how > > this works. Currently that's done by giving -o barrier=flush (=ordered > > used to exist as well, it will probably return - right now we just > > played with IDE).
> This looks as though this was not the default and required the user to > know what he's doing. Would it be possible to choose a sane default > (like flush for ATA or ordered for SCSI when the underlying driver > supports ordered tags) and leave the user just the chance to override > this?
When things have matured, might not be a bad idea to default to using barriers.
> > Only PATA core needs to support it, not the chipset drivers. md and dm
> Hum, I know the older Promise chips were blacklisted for PATA TCQ in > FreeBSD. Might "ordered" cause situations where similar things happen to > Linux? How about SCSI/libata? Is the situation the same there?
Don't confuse TCQ and barriers, it has nothing to do with each other for IDE. I can't imagine any chipsets having problems with a syncronize cache command.
> > aren't a difficult to implement now that unplug/congestion already > > iterates the device list and I added a blkdev_issue_flush() command.
> So this would - for SCSI - be an sd issue rather than a driver issue as > well?
No, for scsi it's a low level driver issue. IDE chipset 'drivers' really aren't anything but setup stuff, and maybe a few hooks to deal with dma. All the action is in the ide core.
On Wed, 2004-03-17 at 22:47, Jens Axboe wrote: > > There is solution just to disable drive write cache, but it seems to > > slowdown performance way to much.
> Chris and I have working real fsync() with the barrier patches. I'll > clean it up and post a patch for vanilla 2.6.5-rc today.
Good to hear. How is it going to work from user point of view ? Just fsync working back again or there would be some special handling.
Also. What is about fsync() in 2.6 nowadays ?
I've done some tests on 3WARE RAID array and it looks like it is different compared to 2.4 I've been testing previously.
I have the simple test which has single page writes to the file followed by fsync(). First run give you the case when file grows with each write, second when you're writing to existing file space.
The results I have on 2.4 is something like 40 sec per 1000 fsyncs for new file, and 0.6 sec for existing file.
With 2.6.3 I have both existing file and new file to complete in less than 1 second.
-- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
On Thu, Mar 18 2004, Peter Zaitsev wrote: > On Wed, 2004-03-17 at 22:47, Jens Axboe wrote:
> > > There is solution just to disable drive write cache, but it seems to > > > slowdown performance way to much.
> > Chris and I have working real fsync() with the barrier patches. I'll > > clean it up and post a patch for vanilla 2.6.5-rc today.
> Good to hear. How is it going to work from user point of view ? > Just fsync working back again or there would be some special handling.
It's just going to work :)
> Also. What is about fsync() in 2.6 nowadays ?
> I've done some tests on 3WARE RAID array and it looks like it is > different compared to 2.4 I've been testing previously.
> I have the simple test which has single page writes to the file followed > by fsync(). First run give you the case when file grows with each > write, second when you're writing to existing file space.
> The results I have on 2.4 is something like 40 sec per 1000 fsyncs for > new file, and 0.6 sec for existing file.
> With 2.6.3 I have both existing file and new file to complete in less > than 1 second.
I believe some missed set_page_writeback() calls caused fsync() to never really wait on anything, pretty broken... IIRC, it's fixed in latest -mm, or maybe it's just pending for next release.
On Thu, 2004-03-18 at 14:47, Jens Axboe wrote: > > With 2.6.3 I have both existing file and new file to complete in less > > than 1 second.
> I believe some missed set_page_writeback() calls caused fsync() to never > really wait on anything, pretty broken... IIRC, it's fixed in latest > -mm, or maybe it's just pending for next release.
This should have only been broken in -mm. Which kernels exactly are you comparing? Maybe the 3ware array defaults to different writecache settings under 2.6?
On Thu, 2004-03-18 at 12:11, Chris Mason wrote: > > I believe some missed set_page_writeback() calls caused fsync() to never > > really wait on anything, pretty broken... IIRC, it's fixed in latest > > -mm, or maybe it's just pending for next release.
> This should have only been broken in -mm. Which kernels exactly are you > comparing? Maybe the 3ware array defaults to different writecache > settings under 2.6?
I'm trying RH AS 3.0 kernel, however I have the same behavior on my SuSE 8.2 workstation.
I use 2.6.3 kernel for tests now (It is not the latest I know) EXT3 file system.
On Thu, 2004-03-18 at 15:17, Peter Zaitsev wrote: > On Thu, 2004-03-18 at 12:11, Chris Mason wrote:
> > > I believe some missed set_page_writeback() calls caused fsync() to never > > > really wait on anything, pretty broken... IIRC, it's fixed in latest > > > -mm, or maybe it's just pending for next release.
> > This should have only been broken in -mm. Which kernels exactly are you > > comparing? Maybe the 3ware array defaults to different writecache > > settings under 2.6?
> I'm trying RH AS 3.0 kernel, however I have the same behavior on my > SuSE 8.2 workstation.
Some suse 8.2 kernels had write barriers for IDE, some did not. If you're running any kind of recent suse kernel, you're doing cache flushes on fsync with ext3.
Not sure if RH has ever carried the patches or not. Easy enough to test for on suse, just look for blk_queue_ordered in the System.map.
> I use 2.6.3 kernel for tests now (It is not the latest I know) > EXT3 file system.
> 3WARE has writeback cache setting in both cases.
Then it sounds like your 2.4 is doing flushes. I'd expect this test to run very quickly without them.
On Thu, 2004-03-18 at 12:33, Chris Mason wrote: > Some suse 8.2 kernels had write barriers for IDE, some did not. If > you're running any kind of recent suse kernel, you're doing cache > flushes on fsync with ext3.
I have this kernel:
Linux abyss 2.4.20-4GB #1 Sat Feb 7 02:07:16 UTC 2004 i686 unknown unknown GNU/Linux
I believe it is reasonably recent one from Hubert's kernels.
The thing is the performance is different if file grows or it does not. If it does - we have some 25 fsync/sec. IF we're writing to existing one, we have some 1600 fsync/sec
In the former case cache is surely not flushed.
> > I use 2.6.3 kernel for tests now (It is not the latest I know) > > EXT3 file system.
> > 3WARE has writeback cache setting in both cases.
> Then it sounds like your 2.4 is doing flushes. I'd expect this test to > run very quickly without them.
2.4 does flush in one case but not in other. 2.6 does not do it in ether case.
I was also surprised to see this simple test case has so different performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec per 1000 fsync's.
-- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
On Thu, 2004-03-18 at 15:46, Peter Zaitsev wrote: > On Thu, 2004-03-18 at 12:33, Chris Mason wrote:
> > Some suse 8.2 kernels had write barriers for IDE, some did not. If > > you're running any kind of recent suse kernel, you're doing cache > > flushes on fsync with ext3.
> I have this kernel:
> Linux abyss 2.4.20-4GB #1 Sat Feb 7 02:07:16 UTC 2004 i686 unknown > unknown GNU/Linux
> I believe it is reasonably recent one from Hubert's kernels.
> The thing is the performance is different if file grows or it does not. > If it does - we have some 25 fsync/sec. IF we're writing to existing > one, we have some 1600 fsync/sec
> In the former case cache is surely not flushed.
Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens when you commit. ext3 always commits on fsync and reiser only commits when you've changed metadata.
Thanks to Jens, the 2.6 barrier patch has a nice clean way to allow barriers on fsync, O_SYNC, O_DIRECT, etc, so we can make IDE drives much safer than the 2.4 code did.
I had a patch to make fsync always generate the barriers in 2.4, but it was tricky since it had to figure out the last buffer it was going to write before it wrote it. The 2.6 code is much better.
> 2.4 does flush in one case but not in other. 2.6 does not do it in ether > case.
> I was also surprised to see this simple test case has so different > performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec > per 1000 fsync's.
Not sure on that one, both cases are generating tons of unplugs, the drive is just responding insanely fast.
On Thu, 2004-03-18 at 13:02, Chris Mason wrote: > > In the former case cache is surely not flushed.
> Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens > when you commit. ext3 always commits on fsync and reiser only commits > when you've changed metadata.
Oh. Yes. This is Reiser, I did not think it is FS issue. I'll know to stay away from ReiserFS now.
> Thanks to Jens, the 2.6 barrier patch has a nice clean way to allow > barriers on fsync, O_SYNC, O_DIRECT, etc, so we can make IDE drives much > safer than the 2.4 code did.
Great.
> > I was also surprised to see this simple test case has so different > > performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec > > per 1000 fsync's.
> Not sure on that one, both cases are generating tons of unplugs, the > drive is just responding insanely fast.
Well why it would be slow if it has write cache off.
-- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote: > On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
> > > In the former case cache is surely not flushed.
> > Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens > > when you commit. ext3 always commits on fsync and reiser only commits > > when you've changed metadata.
> Oh. Yes. This is Reiser, I did not think it is FS issue. > I'll know to stay away from ReiserFS now.
For reiserfs data=ordered should be enough to trigger the needed commits. If not, data=journal. Note that neither fs does barriers for O_SYNC, so we're just not perfect in 2.4.
Chris Mason wrote: >On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
>>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
>>>>In the former case cache is surely not flushed.
>>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens >>>when you commit. ext3 always commits on fsync and reiser only commits >>>when you've changed metadata.
>>Oh. Yes. This is Reiser, I did not think it is FS issue. >>I'll know to stay away from ReiserFS now.
>For reiserfs data=ordered should be enough to trigger the needed >commits. If not, data=journal. Note that neither fs does barriers for >O_SYNC, so we're just not perfect in 2.4.
You are not listening to Peter. As I understand it from what Peter says and your words, your implementation is wrong, and makes fsync meaningless. If so, then you need to fix it. fsync should not be meaningless even for metadata only journaling. This is a serious bug that needs immediate correction, if Peter and I understand it correctly from your words.
On Fri, 2004-03-19 at 03:05, Hans Reiser wrote: > Chris Mason wrote:
> >On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
> >>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
> >>>>In the former case cache is surely not flushed.
> >>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens > >>>when you commit. ext3 always commits on fsync and reiser only commits > >>>when you've changed metadata.
> >>Oh. Yes. This is Reiser, I did not think it is FS issue. > >>I'll know to stay away from ReiserFS now.
> >For reiserfs data=ordered should be enough to trigger the needed > >commits. If not, data=journal. Note that neither fs does barriers for > >O_SYNC, so we're just not perfect in 2.4.
> >-chris
> You are not listening to Peter. As I understand it from what Peter says > and your words, your implementation is wrong, and makes fsync > meaningless. If so, then you need to fix it. fsync should not be > meaningless even for metadata only journaling. This is a serious bug > that needs immediate correction, if Peter and I understand it correctly > from your words.
I am listening to Peter, Jens and I have spent a significant amount of time on this code. We can go back and spend many more hours testing and debugging the 2.4 changes, or we can go forward with a very nice solution in 2.6.
On Fri, 2004-03-19 at 05:52, Chris Mason wrote: > I am listening to Peter, Jens and I have spent a significant amount of > time on this code. We can go back and spend many more hours testing and > debugging the 2.4 changes, or we can go forward with a very nice > solution in 2.6.
> I'm planning on going forward with 2.6
Chris, Hans
It is great to hear this is going to be fixed in 2.6, however it is quite a pity we have a real mess with this in 2.4 series.
Resuming what I've heard so far it looks like it depends on:
- If it is fsync/O_SYNC or O_DIRECT (which user would expect to have the same effect in this respect. - It depends on kernel version. Some vendors have some fixes, while others do not have them. - It depends on hardware - if it has write cache on or off - It depends on type of write (if it changes mata data or not) - Finally it depends on file system and even journal mount options
Just curious does at least Asynchronous IO have the same behavior as standard IO ?
All of these makes it extremely hard to explain what do users need in order to get durability for their changes, while preserving performance.
Furthermore as it was broken for years I expect we'll have people which developed things with fast fsync() in mind, who would start screaming once we have real fsync()
(see my mail about Apple actually disabling cache flush on fsync() due to this reason)
-- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
Chris Mason wrote: >On Fri, 2004-03-19 at 03:05, Hans Reiser wrote:
>>Chris Mason wrote:
>>>On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
>>>>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
>>>>>>In the former case cache is surely not flushed.
>>>>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens >>>>>when you commit. ext3 always commits on fsync and reiser only commits >>>>>when you've changed metadata.
>>>>Oh. Yes. This is Reiser, I did not think it is FS issue. >>>>I'll know to stay away from ReiserFS now.
>>>For reiserfs data=ordered should be enough to trigger the needed >>>commits. If not, data=journal. Note that neither fs does barriers for >>>O_SYNC, so we're just not perfect in 2.4.
>>>-chris
>>You are not listening to Peter. As I understand it from what Peter says >>and your words, your implementation is wrong, and makes fsync >>meaningless. If so, then you need to fix it. fsync should not be >>meaningless even for metadata only journaling. This is a serious bug >>that needs immediate correction, if Peter and I understand it correctly >>from your words.
>I am listening to Peter, Jens and I have spent a significant amount of >time on this code.
but you need to get it right.
>We can go back and spend many more hours testing and >debugging the 2.4 changes, or we can go forward with a very nice >solution in 2.6.
>I'm planning on going forward with 2.6
This is a very important patch that you have created, but you haven't articulated what happens in the following scenario (Peter I am making up something without knowing your internals, please feel encouraged to help me on this).
mysql fsync()'s a file, which it thinks guarantees that all of a mysql transaction has reached disk. The disk write caches it. You let fsync return. It is not on disk. mysql performs its mysql commit, and writes a mysql commit record which reaches disk, but not all of the transaction is on disk. The system crashes. mysql plays the log. mysql has internal corruption. User calls Peter. Peter asks, what do you expect when you use a piece of shit like reiserfs? User doesn't care about our internal squabbling and goes back to using windows which does proper commits.
Or, random application fsyncs, expects that it means that data has reached disk, and tells user to perform real world actions dependent on the data being on disk, but it is not.
I hope I am totally off-base and not understanding you.... Please help me here.
On Fri, 2004-03-19 at 14:36, Hans Reiser wrote: > I hope I am totally off-base and not understanding you.... Please help > me here.
Lets look at actual scope of the problem:
filesystem metadata filesystem data (fsync, O_SYNC, O_DIRECT) block device data (fsync, O_SYNC, O_DIRECT)
Multiply the cases above times each filesystem and also times md and device mapper, since the barriers need to aggregate down to all the drives.
In other words, just fixing fsync in 2.4 is not enough, and there is still considerable development needed in 2.6. Maybe after all the 2.6 changes are done and accepted we can consider backporting parts of it to 2.4.
Chris Mason wrote: >On Fri, 2004-03-19 at 14:36, Hans Reiser wrote:
>>I hope I am totally off-base and not understanding you.... Please help >>me here.
>Lets look at actual scope of the problem:
>filesystem metadata >filesystem data (fsync, O_SYNC, O_DIRECT) >block device data (fsync, O_SYNC, O_DIRECT)
>Multiply the cases above times each filesystem and also times md and >device mapper, since the barriers need to aggregate down to all the >drives.
>In other words, just fixing fsync in 2.4 is not enough, and there is >still considerable development needed in 2.6. Maybe after all the 2.6 >changes are done and accepted we can consider backporting parts of it to >2.4.
>-chris
In 2.6 does fsync always insert a write barrier when the metadata journaling option is set for reiserfs?
On Fri, 2004-03-19 at 11:36, Hans Reiser wrote: > mysql fsync()'s a file, which it thinks guarantees that all of a mysql > transaction has reached disk. The disk write caches it. You let fsync > return. It is not on disk. mysql performs its mysql commit, and writes > a mysql commit record which reaches disk, but not all of the transaction > is on disk. The system crashes. mysql plays the log. mysql has > internal corruption. User calls Peter. Peter asks, what do you expect > when you use a piece of shit like reiserfs? User doesn't care about our > internal squabbling and goes back to using windows which does proper > commits.
This is right,
We had some unexplained data corruptions in Innodb which can be explained by broken fsync(), but in the most cases the scenario is less gloomy. Users just do not see some of last committed transactions if they test durability by shutting off the power, which is however already not good enough for critical applications.
However this is due to external pre-caution Innodb does. It uses "double write buffer", which basically means each page is first written to some small page based log file, and only afterwards written to the proper place on the disk. We have to do it even with proper fsync() implementation as there is still possibility to crash in the middle of fsync (or synchronous write) which will result in partial page write. Think for example about the case when page crosses stripe boundary on RAID.
If file system would guaranty atomicity of write() calls (synchronous would be enough) we could disable it and get good extra performance.
-- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
On Fri, 2004-03-19 at 15:04, Hans Reiser wrote: > Chris Mason wrote: > >Lets look at actual scope of the problem:
> >filesystem metadata > >filesystem data (fsync, O_SYNC, O_DIRECT) > >block device data (fsync, O_SYNC, O_DIRECT)
> >Multiply the cases above times each filesystem and also times md and > >device mapper, since the barriers need to aggregate down to all the > >drives.
> >In other words, just fixing fsync in 2.4 is not enough, and there is > >still considerable development needed in 2.6. Maybe after all the 2.6 > >changes are done and accepted we can consider backporting parts of it to > >2.4.
> In 2.6 does fsync always insert a write barrier when the metadata > journaling option is set for reiserfs?
Yes, fsync is done in the 2.6 patches. O_SYNC, O_DIRECT and others are not yet. The important part right now is to get the IDE core bits reviewed and all the FS guys to agree on how we want to use them.
It's much cleaner in 2.6, the filesystem can just request a flush after the last data buffer goes down the pipe.
On Fri, 2004-03-19 at 14:26, Peter Zaitsev wrote: > On Fri, 2004-03-19 at 05:52, Chris Mason wrote:
> > I am listening to Peter, Jens and I have spent a significant amount of > > time on this code. We can go back and spend many more hours testing and > > debugging the 2.4 changes, or we can go forward with a very nice > > solution in 2.6.
> > I'm planning on going forward with 2.6
> Chris, Hans
> It is great to hear this is going to be fixed in 2.6, however it is > quite a pity we have a real mess with this in 2.4 series.
It is indeed.
> Resuming what I've heard so far it looks like it depends on:
> - If it is fsync/O_SYNC or O_DIRECT (which user would expect to have > the same effect in this respect. > - It depends on kernel version. Some vendors have some fixes, while > others do not have them. > - It depends on hardware - if it has write cache on or off > - It depends on type of write (if it changes mata data or not) > - Finally it depends on file system and even journal mount options
All of the above is correct.
> Just curious does at least Asynchronous IO have the same behavior as > standard IO ?
For the suse patch, yes. If it triggers a commit, you get a cache flush.
> All of these makes it extremely hard to explain what do users need in > order to get durability for their changes, while preserving performance.
> Furthermore as it was broken for years I expect we'll have people which > developed things with fast fsync() in mind, who would start screaming > once we have real fsync()
> (see my mail about Apple actually disabling cache flush on fsync() due > to this reason)
These are all difficult issues. I wish I had easier answers for you, hopefully we can get it all nailed down in 2.6 for starters.