[PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Vladimir Sementsov-Ogievskiy-2
Here is double bug:

First, return error but not set errp. This may lead to:
qmp block-dirty-bitmap-remove may report success when actually failed

block-dirty-bitmap-remove used in a transaction will crash, as
qmp_transaction will think that it returned success and will cal
block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
NULL

Second (like in anecdote), this case is not an error at all. As it is
documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
definition, absence of bitmap is not an error, and similar case handled
at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
there is no bitmaps at all..

But when there are some bitmaps, but not the requested one, it return
error with errp unset.

Fix that.

Fixes: b56a1e31759b750
Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
---

Hi all!

Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
sorry for introducing it, and it sad that it's found so late..

Personally, I think that this one worth rc5, as it makes new bitmap
interfaces unusable. But the decision is yours.

Last minute edit: hmm, actually, transaction action introduced in
4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
command is a regression... Maybe it's OK for stable.

 block/qcow2-bitmap.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 8abaf632fc..c6c8ebbe89 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
     Qcow2BitmapList *bm_list;
 
     if (s->nb_bitmaps == 0) {
-        /* Absence of the bitmap is not an error: see explanation above
-         * bdrv_remove_persistent_dirty_bitmap() definition. */
+        /*
+         * Absence of the bitmap is not an error: see explanation above
+         * bdrv_co_remove_persistent_dirty_bitmap() definition.
+         */
         return 0;
     }
 
@@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
 
     bm = find_bitmap_by_name(bm_list, name);
     if (bm == NULL) {
-        ret = -EINVAL;
+        /* Absence of the bitmap is not an error, see above. */
+        ret = 0;
         goto out;
     }
 
--
2.21.0


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

John Snow


On 12/5/19 2:30 PM, Vladimir Sementsov-Ogievskiy wrote:

> Here is double bug:
>
> First, return error but not set errp. This may lead to:
> qmp block-dirty-bitmap-remove may report success when actually failed
>
> block-dirty-bitmap-remove used in a transaction will crash, as
> qmp_transaction will think that it returned success and will cal
> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
> NULL
>
> Second (like in anecdote), this case is not an error at all. As it is
> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
> definition, absence of bitmap is not an error, and similar case handled
> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
> there is no bitmaps at all..
>
> But when there are some bitmaps, but not the requested one, it return
> error with errp unset.
>
> Fix that.
>
> Fixes: b56a1e31759b750
> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
> ---
>
> Hi all!
>
> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
> sorry for introducing it, and it sad that it's found so late..
>
> Personally, I think that this one worth rc5, as it makes new bitmap
> interfaces unusable. But the decision is yours.
>
> Last minute edit: hmm, actually, transaction action introduced in
> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
> command is a regression... Maybe it's OK for stable.
>

Might be hard to justify an RC5 for the sake of this patch.

If we need an RC5 for other reasons, I'd want to include this.

>  block/qcow2-bitmap.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 8abaf632fc..c6c8ebbe89 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>      Qcow2BitmapList *bm_list;
>  
>      if (s->nb_bitmaps == 0) {
> -        /* Absence of the bitmap is not an error: see explanation above
> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
> +        /*
> +         * Absence of the bitmap is not an error: see explanation above
> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
> +         */
>          return 0;
>      }
>  
> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>  
>      bm = find_bitmap_by_name(bm_list, name);
>      if (bm == NULL) {
> -        ret = -EINVAL;
> +        /* Absence of the bitmap is not an error, see above. */
> +        ret = 0;
>          goto out;
>      }
>  
>


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Eric Blake
In reply to this post by Vladimir Sementsov-Ogievskiy-2
On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
> Here is double bug:
>
> First, return error but not set errp. This may lead to:
> qmp block-dirty-bitmap-remove may report success when actually failed
>
> block-dirty-bitmap-remove used in a transaction will crash, as
> qmp_transaction will think that it returned success and will cal

call

> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
> NULL
>
> Second (like in anecdote), this case is not an error at all. As it is
> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
> definition, absence of bitmap is not an error, and similar case handled
> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
> there is no bitmaps at all..

double .

>
> But when there are some bitmaps, but not the requested one, it return
> error with errp unset.
>
> Fix that.
>
> Fixes: b56a1e31759b750
> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
> ---
>
> Hi all!
>
> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
> sorry for introducing it, and it sad that it's found so late..
>
> Personally, I think that this one worth rc5, as it makes new bitmap
> interfaces unusable. But the decision is yours.
>
> Last minute edit: hmm, actually, transaction action introduced in
> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
> command is a regression... Maybe it's OK for stable.

Libvirt REALLY wants to use transaction bitmap management (and require
qemu 4.2) for its incremental backup patches that Peter is almost ready
to merge in.  I'm trying to ascertain:

When exactly can this bug hit?  Can you give a sequence of QMP commands
that will trigger it for easy reproduction?  Are there workarounds (such
as checking that a bitmap exists prior to attempting to remove it)?  If
it does NOT get fixed in rc5, how will libvirt be able to probe whether
the fix is in place?

>
>   block/qcow2-bitmap.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 8abaf632fc..c6c8ebbe89 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>       Qcow2BitmapList *bm_list;
>  
>       if (s->nb_bitmaps == 0) {
> -        /* Absence of the bitmap is not an error: see explanation above
> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
> +        /*
> +         * Absence of the bitmap is not an error: see explanation above
> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
> +         */
>           return 0;
>       }
>  
> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>  
>       bm = find_bitmap_by_name(bm_list, name);
>       if (bm == NULL) {
> -        ret = -EINVAL;
> +        /* Absence of the bitmap is not an error, see above. */
> +        ret = 0;
>           goto out;
>       }
>  
>

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

John Snow


On 12/5/19 3:09 PM, Eric Blake wrote:

> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>> Here is double bug:
>>
>> First, return error but not set errp. This may lead to:
>> qmp block-dirty-bitmap-remove may report success when actually failed
>>
>> block-dirty-bitmap-remove used in a transaction will crash, as
>> qmp_transaction will think that it returned success and will cal
>
> call
>
>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>> NULL
>>
>> Second (like in anecdote), this case is not an error at all. As it is
>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>> definition, absence of bitmap is not an error, and similar case handled
>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>> there is no bitmaps at all..
>
> double .
>
>>
>> But when there are some bitmaps, but not the requested one, it return
>> error with errp unset.
>>
>> Fix that.
>>
>> Fixes: b56a1e31759b750
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
>> ---
>>
>> Hi all!
>>
>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>> sorry for introducing it, and it sad that it's found so late..
>>
>> Personally, I think that this one worth rc5, as it makes new bitmap
>> interfaces unusable. But the decision is yours.
>>
>> Last minute edit: hmm, actually, transaction action introduced in
>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>> command is a regression... Maybe it's OK for stable.
>
> Libvirt REALLY wants to use transaction bitmap management (and require
> qemu 4.2) for its incremental backup patches that Peter is almost ready
> to merge in.  I'm trying to ascertain:
>
> When exactly can this bug hit?  Can you give a sequence of QMP commands
> that will trigger it for easy reproduction?  Are there workarounds (such
> as checking that a bitmap exists prior to attempting to remove it)?  If
> it does NOT get fixed in rc5, how will libvirt be able to probe whether
> the fix is in place?
>

It looks like:

- You need to have at least one bitmap
- You need to use transactional remove
- you need to misspell the bitmap name
- The remove will fail (return -EINVAL) but doesn't set errp
- The caller chokes on incomplete information, state->bitmap is NULL

>>
>>   block/qcow2-bitmap.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>> index 8abaf632fc..c6c8ebbe89 100644
>> --- a/block/qcow2-bitmap.c
>> +++ b/block/qcow2-bitmap.c
>> @@ -1469,8 +1469,10 @@ int coroutine_fn
>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       Qcow2BitmapList *bm_list;
>>         if (s->nb_bitmaps == 0) {
>> -        /* Absence of the bitmap is not an error: see explanation above
>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>> +        /*
>> +         * Absence of the bitmap is not an error: see explanation above
>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>> +         */
>>           return 0;
>>       }
>>   @@ -1485,7 +1487,8 @@ int coroutine_fn
>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>         bm = find_bitmap_by_name(bm_list, name);
>>       if (bm == NULL) {
>> -        ret = -EINVAL;
>> +        /* Absence of the bitmap is not an error, see above. */
>> +        ret = 0;
>>           goto out;
>>       }
>>  
>


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Eric Blake
On 12/5/19 2:16 PM, John Snow wrote:

>>> Last minute edit: hmm, actually, transaction action introduced in
>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>> command is a regression... Maybe it's OK for stable.
>>
>> Libvirt REALLY wants to use transaction bitmap management (and require
>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>> to merge in.  I'm trying to ascertain:
>>
>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>> that will trigger it for easy reproduction?  Are there workarounds (such
>> as checking that a bitmap exists prior to attempting to remove it)?  If
>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>> the fix is in place?
>>
>
> It looks like:
>
> - You need to have at least one bitmap
> - You need to use transactional remove
> - you need to misspell the bitmap name
> - The remove will fail (return -EINVAL) but doesn't set errp
> - The caller chokes on incomplete information, state->bitmap is NULL

So in libvirt's case, as long as libvirt manages bitmaps completely,
it's a bug in libvirt to request deletion of a bitmap that doesn't
exist.  Or, someone messes with a qcow2 image of an offline guest behind
libvirt's back without updating libvirt's metadata of what bitmaps
should exist, and then if libvirt fails to check that a bitmap actually
exists, a user may be able to coerce libvirt into requesting a bitmap
deletion that will cause a qemu crash, but that's the user's fault for
going behind libvirt's back.  Or, libvirt could add code that instead of
trying to blindly delete a bitmap, it first makes a QMP call to ensure
the bitmap still exists (annoying, but harmless even when the bug is
fixed), instead of blaming the bug on the user operating behind
libvirt's back.

The bug is nasty, but feels to be enough of a corner case that I think
deferring to 5.0 with CC: stable (and then downstreams can backport it
at will) is the right approach; no need to hold up 4.2 if this is the
only flaw.  But I'm also not opposed to it going in 4.2 if we have
anything else serious.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

John Snow


On 12/5/19 4:53 PM, Eric Blake wrote:

> On 12/5/19 2:16 PM, John Snow wrote:
>
>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>> 4.2, so crash is not a regression, only broken
>>>> block-dirty-bitmap-remove
>>>> command is a regression... Maybe it's OK for stable.
>>>
>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>> to merge in.  I'm trying to ascertain:
>>>
>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>> that will trigger it for easy reproduction?  Are there workarounds (such
>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>> the fix is in place?
>>>
>>
>> It looks like:
>>
>> - You need to have at least one bitmap
>> - You need to use transactional remove
>> - you need to misspell the bitmap name
>> - The remove will fail (return -EINVAL) but doesn't set errp
>> - The caller chokes on incomplete information, state->bitmap is NULL
>
> So in libvirt's case, as long as libvirt manages bitmaps completely,
> it's a bug in libvirt to request deletion of a bitmap that doesn't
> exist.  Or, someone messes with a qcow2 image of an offline guest behind
> libvirt's back without updating libvirt's metadata of what bitmaps
> should exist, and then if libvirt fails to check that a bitmap actually
> exists, a user may be able to coerce libvirt into requesting a bitmap
> deletion that will cause a qemu crash, but that's the user's fault for
> going behind libvirt's back.  Or, libvirt could add code that instead of
> trying to blindly delete a bitmap, it first makes a QMP call to ensure
> the bitmap still exists (annoying, but harmless even when the bug is
> fixed), instead of blaming the bug on the user operating behind
> libvirt's back.
>
> The bug is nasty, but feels to be enough of a corner case that I think
> deferring to 5.0 with CC: stable (and then downstreams can backport it
> at will) is the right approach; no need to hold up 4.2 if this is the
> only flaw.  But I'm also not opposed to it going in 4.2 if we have
> anything else serious.
>

Further, the NASTIEST problem is with transactional remove, which is new
to 4.2.

Normal remove is also broken, but won't choke because it doesn't hold
undo information.

Vladimir, do you agree with this assessment? Do we have it correct?

--js


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Vladimir Sementsov-Ogievskiy-2
In reply to this post by John Snow
05.12.2019 23:16, John Snow wrote:

>
>
> On 12/5/19 3:09 PM, Eric Blake wrote:
>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> Here is double bug:
>>>
>>> First, return error but not set errp. This may lead to:
>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>
>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>> qmp_transaction will think that it returned success and will cal
>>
>> call
>>
>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>> NULL
>>>
>>> Second (like in anecdote), this case is not an error at all. As it is
>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>> definition, absence of bitmap is not an error, and similar case handled
>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>> there is no bitmaps at all..
>>
>> double .
>>
>>>
>>> But when there are some bitmaps, but not the requested one, it return
>>> error with errp unset.
>>>
>>> Fix that.
>>>
>>> Fixes: b56a1e31759b750
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
>>> ---
>>>
>>> Hi all!
>>>
>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>> sorry for introducing it, and it sad that it's found so late..
>>>
>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>> interfaces unusable. But the decision is yours.
>>>
>>> Last minute edit: hmm, actually, transaction action introduced in
>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>> command is a regression... Maybe it's OK for stable.
>>
>> Libvirt REALLY wants to use transaction bitmap management (and require
>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>> to merge in.  I'm trying to ascertain:
>>
>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>> that will trigger it for easy reproduction?  Are there workarounds (such
>> as checking that a bitmap exists prior to attempting to remove it)?  If
>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>> the fix is in place?
>>
>
> It looks like:
>
> - You need to have at least one bitmap
> - You need to use transactional remove
> - you need to misspell the bitmap name
> - The remove will fail (return -EINVAL) but doesn't set errp
> - The caller chokes on incomplete information, state->bitmap is NULL


No, that would be too simple, the thing is worse. Absolutele correct removing is broken, without any misspelling

Bug triggers when we are removing persistent bitmap that is not stored yet in the image AND at least one (another) bitmap already stored in the image. So, something like:

1. create persistent bitmap A
2. shutdown vm  (bitmap A is synced)
3. start vm
4. create persistent bitmap B
5. remove bitmap B - it fails (and crashes if in transaction)

====

Hmm, workaround..

I'm afraid that the only thing is not remove persistent bitmaps, which were never synced to the image. So, instead the sequence above, we need


1. create persistent bitmap A
2. shutdown vm
3. start vm
4. create persistent bitmap B
5. remember, that we want to remove bitmap B after vm shutdown
...
  some other operations
...
6. vm shutdown
7. start vm in stopped mode, and remove all bitmaps marked for removing
8. stop vm

But, I think that in real circumstances, vm shutdown is rare thing...


>
>>>
>>>    block/qcow2-bitmap.c | 9 ++++++---
>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>>> index 8abaf632fc..c6c8ebbe89 100644
>>> --- a/block/qcow2-bitmap.c
>>> +++ b/block/qcow2-bitmap.c
>>> @@ -1469,8 +1469,10 @@ int coroutine_fn
>>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>>        Qcow2BitmapList *bm_list;
>>>          if (s->nb_bitmaps == 0) {
>>> -        /* Absence of the bitmap is not an error: see explanation above
>>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>>> +        /*
>>> +         * Absence of the bitmap is not an error: see explanation above
>>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>>> +         */
>>>            return 0;
>>>        }
>>>    @@ -1485,7 +1487,8 @@ int coroutine_fn
>>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>>          bm = find_bitmap_by_name(bm_list, name);
>>>        if (bm == NULL) {
>>> -        ret = -EINVAL;
>>> +        /* Absence of the bitmap is not an error, see above. */
>>> +        ret = 0;
>>>            goto out;
>>>        }
>>>  
>>
>


--
Best regards,
Vladimir

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Vladimir Sementsov-Ogievskiy-2
In reply to this post by Eric Blake
05.12.2019 23:09, Eric Blake wrote:

> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>> Here is double bug:
>>
>> First, return error but not set errp. This may lead to:
>> qmp block-dirty-bitmap-remove may report success when actually failed
>>
>> block-dirty-bitmap-remove used in a transaction will crash, as
>> qmp_transaction will think that it returned success and will cal
>
> call
>
>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>> NULL
>>
>> Second (like in anecdote), this case is not an error at all. As it is
>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>> definition, absence of bitmap is not an error, and similar case handled
>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>> there is no bitmaps at all..
>
> double .
>
>>
>> But when there are some bitmaps, but not the requested one, it return
>> error with errp unset.
>>
>> Fix that.
>>
>> Fixes: b56a1e31759b750
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
>> ---
>>
>> Hi all!
>>
>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>> sorry for introducing it, and it sad that it's found so late..
>>
>> Personally, I think that this one worth rc5, as it makes new bitmap
>> interfaces unusable. But the decision is yours.
>>
>> Last minute edit: hmm, actually, transaction action introduced in
>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>> command is a regression... Maybe it's OK for stable.
>
> Libvirt REALLY wants to use transaction bitmap management (and require qemu 4.2) for its incremental backup patches that Peter is almost ready to merge in.  I'm trying to ascertain:

Side question:

Is libvirt prepared to inconsistent bitmaps?

For example, migration with enabled dirty-bitmaps capability will fail if there are inconsistent bitmaps.
Also, incremental backup may be affected (you see, that you have your bitmap, try to do something with it,
aiming to start next incremental backup, but bitmap is inconsistent, and operation fails)..

In fact, we in Virtuozzo now faced with these broken migration and backup because of inconsistent bitmaps
(which were ignored in past), and we have to apply a temporary fix like

--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -986,8 +986,8 @@ bool qcow2_load_dirty_bitmaps(BlockDriverState *bs, Error **errp)
      QSIMPLEQ_FOREACH(bm, bm_list, entry) {
          BdrvDirtyBitmap *bitmap;

-        if ((bm->flags & BME_FLAG_IN_USE) &&
-            bdrv_find_dirty_bitmap(bs, bm->name))
+        if ((bm->flags & BME_FLAG_IN_USE) /* &&
+            bdrv_find_dirty_bitmap(bs, bm->name) */)
          {
              /*
               * We already have corresponding BdrvDirtyBitmap, and bitmap in the
@@ -1000,6 +1000,13 @@ bool qcow2_load_dirty_bitmaps(BlockDriverState *bs, Error **errp)
               * should not load it on migration target, as we already have this
               * bitmap, being migrated.
               */
+
+            /*
+             * VZ7.12:
+             *
+             * We don't load inconsistent bitmaps at all, as they block
+             * migration. So the second condition is commented out.
+             */
              continue;
          }

And correct way is to handle inconsistent bitmaps in libvirt somehow.


Note, that the source of inconsistent bitmaps are crashes, hard resets, etc, when Qemu is unable to store bitmaps correctly.


>
> When exactly can this bug hit?  Can you give a sequence of QMP commands that will trigger it for easy reproduction?  Are there workarounds (such as checking that a bitmap exists prior to attempting to remove it)?  If it does NOT get fixed in rc5, how will libvirt be able to probe whether the fix is in place?
>
>>
>>   block/qcow2-bitmap.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>> index 8abaf632fc..c6c8ebbe89 100644
>> --- a/block/qcow2-bitmap.c
>> +++ b/block/qcow2-bitmap.c
>> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       Qcow2BitmapList *bm_list;
>>       if (s->nb_bitmaps == 0) {
>> -        /* Absence of the bitmap is not an error: see explanation above
>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>> +        /*
>> +         * Absence of the bitmap is not an error: see explanation above
>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>> +         */
>>           return 0;
>>       }
>> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       bm = find_bitmap_by_name(bm_list, name);
>>       if (bm == NULL) {
>> -        ret = -EINVAL;
>> +        /* Absence of the bitmap is not an error, see above. */
>> +        ret = 0;
>>           goto out;
>>       }
>>
>


--
Best regards,
Vladimir

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Eric Blake
In reply to this post by Vladimir Sementsov-Ogievskiy-2
[adding in Peter Maydell, as there is now potential talk of other
4.2-worthy patches]

On 12/6/19 4:18 AM, Vladimir Sementsov-Ogievskiy wrote:

> 05.12.2019 23:16, John Snow wrote:
>>
>>
>> On 12/5/19 3:09 PM, Eric Blake wrote:
>>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> Here is double bug:
>>>>
>>>> First, return error but not set errp. This may lead to:
>>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>>
>>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>>> qmp_transaction will think that it returned success and will cal
>>>
>>> call
>>>
>>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>>> NULL
>>>>
>>>> Second (like in anecdote), this case is not an error at all. As it is
>>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>>> definition, absence of bitmap is not an error, and similar case handled
>>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>>> there is no bitmaps at all..
>>>
>>> double .
>>>
>>>>
>>>> But when there are some bitmaps, but not the requested one, it return
>>>> error with errp unset.
>>>>
>>>> Fix that.
>>>>
>>>> Fixes: b56a1e31759b750
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
>>>> ---
>>>>
>>>> Hi all!
>>>>
>>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>>> sorry for introducing it, and it sad that it's found so late..
>>>>
>>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>>> interfaces unusable. But the decision is yours.
>>>>
>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>>> command is a regression... Maybe it's OK for stable.
>>>
>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>> to merge in.  I'm trying to ascertain:
>>>
>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>> that will trigger it for easy reproduction?  Are there workarounds (such
>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>> the fix is in place?
>>>
>>
>> It looks like:
>>
>> - You need to have at least one bitmap
>> - You need to use transactional remove
>> - you need to misspell the bitmap name
>> - The remove will fail (return -EINVAL) but doesn't set errp
>> - The caller chokes on incomplete information, state->bitmap is NULL
>
>
> No, that would be too simple, the thing is worse. Absolutele correct removing is broken, without any misspelling
>
> Bug triggers when we are removing persistent bitmap that is not stored yet in the image AND at least one (another) bitmap already stored in the image. So, something like:
>
> 1. create persistent bitmap A
> 2. shutdown vm  (bitmap A is synced)
> 3. start vm
> 4. create persistent bitmap B
> 5. remove bitmap B - it fails (and crashes if in transaction)
>
> ====
>
> Hmm, workaround..
>
> I'm afraid that the only thing is not remove persistent bitmaps, which were never synced to the image. So, instead the sequence above, we need
>
>
> 1. create persistent bitmap A
> 2. shutdown vm
> 3. start vm
> 4. create persistent bitmap B
> 5. remember, that we want to remove bitmap B after vm shutdown
> ...
>    some other operations
> ...
> 6. vm shutdown
> 7. start vm in stopped mode, and remove all bitmaps marked for removing
> 8. stop vm
>
> But, I think that in real circumstances, vm shutdown is rare thing...

This is sounding a bit more serious. As I said earlier, it shouldn't
delay 4.2 on its own, but if the fix is obvious (and other than
comments, it is a single change from 'ret = -EINVAL' to 'ret = 0' which
fixes a definite reproducible crash), I think it rises to the level of
acceptable.

I've been so worried about the question of which release, that I don't
know if I've previously offered:
Reviewed-by: Eric Blake <[hidden email]>

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

John Snow


On 12/6/19 9:36 AM, Eric Blake wrote:

> [adding in Peter Maydell, as there is now potential talk of other
> 4.2-worthy patches]
>
> On 12/6/19 4:18 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 05.12.2019 23:16, John Snow wrote:
>>>
>>>
>>> On 12/5/19 3:09 PM, Eric Blake wrote:
>>>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Here is double bug:
>>>>>
>>>>> First, return error but not set errp. This may lead to:
>>>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>>>
>>>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>>>> qmp_transaction will think that it returned success and will cal
>>>>
>>>> call
>>>>
>>>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>>>> NULL
>>>>>
>>>>> Second (like in anecdote), this case is not an error at all. As it is
>>>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>>>> definition, absence of bitmap is not an error, and similar case
>>>>> handled
>>>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>>>> there is no bitmaps at all..
>>>>
>>>> double .
>>>>
>>>>>
>>>>> But when there are some bitmaps, but not the requested one, it return
>>>>> error with errp unset.
>>>>>
>>>>> Fix that.
>>>>>
>>>>> Fixes: b56a1e31759b750
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[hidden email]>
>>>>> ---
>>>>>
>>>>> Hi all!
>>>>>
>>>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>>>> sorry for introducing it, and it sad that it's found so late..
>>>>>
>>>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>>>> interfaces unusable. But the decision is yours.
>>>>>
>>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>>> 4.2, so crash is not a regression, only broken
>>>>> block-dirty-bitmap-remove
>>>>> command is a regression... Maybe it's OK for stable.
>>>>
>>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>>> to merge in.  I'm trying to ascertain:
>>>>
>>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>>> that will trigger it for easy reproduction?  Are there workarounds
>>>> (such
>>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>>> the fix is in place?
>>>>
>>>
>>> It looks like:
>>>
>>> - You need to have at least one bitmap
>>> - You need to use transactional remove
>>> - you need to misspell the bitmap name
>>> - The remove will fail (return -EINVAL) but doesn't set errp
>>> - The caller chokes on incomplete information, state->bitmap is NULL
>>
>>
>> No, that would be too simple, the thing is worse. Absolutele correct
>> removing is broken, without any misspelling
>>
>> Bug triggers when we are removing persistent bitmap that is not stored
>> yet in the image AND at least one (another) bitmap already stored in
>> the image. So, something like:
>>
>> 1. create persistent bitmap A
>> 2. shutdown vm  (bitmap A is synced)
>> 3. start vm
>> 4. create persistent bitmap B
>> 5. remove bitmap B - it fails (and crashes if in transaction)
>>
>> ====
>>
>> Hmm, workaround..
>>
>> I'm afraid that the only thing is not remove persistent bitmaps, which
>> were never synced to the image. So, instead the sequence above, we need
>>
>>
>> 1. create persistent bitmap A
>> 2. shutdown vm
>> 3. start vm
>> 4. create persistent bitmap B
>> 5. remember, that we want to remove bitmap B after vm shutdown
>> ...
>>    some other operations
>> ...
>> 6. vm shutdown
>> 7. start vm in stopped mode, and remove all bitmaps marked for removing
>> 8. stop vm
>>
>> But, I think that in real circumstances, vm shutdown is rare thing...
>
> This is sounding a bit more serious. As I said earlier, it shouldn't
> delay 4.2 on its own, but if the fix is obvious (and other than
> comments, it is a single change from 'ret = -EINVAL' to 'ret = 0' which
> fixes a definite reproducible crash), I think it rises to the level of
> acceptable.
>
> I've been so worried about the question of which release, that I don't
> know if I've previously offered:
> Reviewed-by: Eric Blake <[hidden email]>
>

Oh, that is quite a bit more serious than I thought too.

Yeah, I want this in 4.2 if at all possible.

Reviewed-by: John Snow <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap

Eric Blake
On 12/6/19 1:02 PM, John Snow wrote:

>>> I'm afraid that the only thing is not remove persistent bitmaps, which
>>> were never synced to the image. So, instead the sequence above, we need
>>>
>>>
>>> 1. create persistent bitmap A
>>> 2. shutdown vm
>>> 3. start vm
>>> 4. create persistent bitmap B
>>> 5. remember, that we want to remove bitmap B after vm shutdown
>>> ...
>>>     some other operations
>>> ...
>>> 6. vm shutdown
>>> 7. start vm in stopped mode, and remove all bitmaps marked for removing
>>> 8. stop vm
>>>
>>> But, I think that in real circumstances, vm shutdown is rare thing...

Part of me wonders if we would have detected this MUCH sooner if I had
gotten my wish of having the qcow2 metadata updated on creation of any
persistent bitmap (not actually writing out the bitmap itself, just
updating the bitmap table to mark that there is a new persistent
inconsistent bitmap), so that a) qemu-img info -U can see the persistent
bitmap's existence, and b) an unexpected abrupt crash of qemu does not
lose the existence of the bitmap.  At the time I raised the question,
the push-back at the time was a desire to minimize writes to the qcow2
metadata at all costs, warranting our current extreme code contortions
to keep persistent bitmaps were kept in memory only until VM shutdown.
But had we been doing it, we would spot problems like this without
having to do VM shutdown, and our code might actually be simpler than
our current contortions.  Maybe we should still revisit that decision
(of course, that question is independent of this patch, and therefore
5.0 material at earliest).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org