commit e35fca4791 upstream.
Some edac drivers register themselves as mce decoders via
notifier_chain. But in current notifier_chain implementation logic,
it doesn't accept same notifier registered twice. If so, it will be
wrong when adding/removing the element from the list. For example,
on one SandyBridge platform, remove module sb_edac and then trigger
one error, it will hit oops because it has no mce decoder registered
but related notifier_chain still points to an invalid callback
function. Here is an example:
Call Trace:
[<ffffffff8150ef6a>] atomic_notifier_call_chain+0x1a/0x20
[<ffffffff8102b936>] mce_log+0x46/0x180
[<ffffffff8102eaea>] apei_mce_report_mem_error+0x4a/0x60
[<ffffffff812e19d2>] ghes_do_proc+0x192/0x210
[<ffffffff812e2066>] ghes_proc+0x46/0x70
[<ffffffff812e20d8>] ghes_notify_sci+0x48/0x80
[<ffffffff8150ef05>] notifier_call_chain+0x55/0x80
[<ffffffff81076f1a>] __blocking_notifier_call_chain+0x5a/0x80
[<ffffffff812aea11>] ? acpi_os_wait_events_complete+0x23/0x23
[<ffffffff81076f56>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff812ddc4d>] acpi_hed_notify+0x19/0x1b
[<ffffffff812b16bd>] acpi_device_notify+0x19/0x1b
[<ffffffff812beb38>] acpi_ev_notify_dispatch+0x67/0x7f
[<ffffffff812aea3a>] acpi_os_execute_deferred+0x29/0x36
[<ffffffff81069dc2>] process_one_work+0x132/0x450
[<ffffffff8106bbcb>] worker_thread+0x17b/0x3c0
[<ffffffff8106ba50>] ? manage_workers+0x120/0x120
[<ffffffff81070aee>] kthread+0x9e/0xb0
[<ffffffff81514724>] kernel_thread_helper+0x4/0x10
[<ffffffff81070a50>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff81514720>] ? gs_change+0x13/0x13
Code: f3 49 89 d4 45 85 ed 4d 89 c6 48 8b 0f 74 48 48 85 c9 75 17 eb 41
0f 1f 80 00 00 00 00 41 83 ed 01 4c 89 f9 74 22 4d 85 ff 74 1d <4c> 8b
79 08 4c 89 e2 48 89 de 48 89 cf ff 11 4d 85 f6 74 04 41
RIP [<ffffffff8150eef6>] notifier_call_chain+0x46/0x80
RSP <ffff88042868fb20>
CR2: ffffffffa01af838
---[ end trace 0100930068e73e6f ]---
BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [<ffffffff810705b0>] kthread_data+0x10/0x20
PGD 1a0d067 PUD 1a0e067 PMD 0
Oops: 0000 [#2] SMP
Only i7core_edac and sb_edac have such issues because they have more
than one memory controller which means they have to register mce
decoder many times.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull EDAC fixes from Mauro Carvalho Chehab:
"A series of EDAC driver fixes. It also has one core fix at the
documentation, and a rename patch, fixing the name of the struct that
contains the rank information."
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
edac: rename channel_info to rank_info
i5400_edac: Avoid calling pci_put_device() twice
edac: i5100 ack error detection register after each read
edac: i5100 fix erroneous define for M1Err
edac: sb_edac: Fix a wrong value setting for the previous value
edac: sb_edac: Fix a INTERLEAVE_MODE() misuse
edac: sb_edac: Let the driver depend on PCI_MMCONFIG
edac: Improve the comments to better describe the memory concepts
edac/ppc4xx_edac: Fix compilation
Fix sb_edac compilation with 32 bits kernels
>From the driver design, the variable limit wants to compare with its
previous value, we should set the value of limit instead of the value
of tmp_mb to the variable prev.
Signed-off-by: Hui Wang <jason77.wang@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
We can identify dram interleave mode from the Dram Rule register
rather than Dram Interleave list register.
In this context, the reg of INTERLEAVE_MODE(reg) contains the Dram
Interleave list register, we can't get interleave mode from the reg,
while the variable interleave_mode saves the the mode got from the
Dram Rule register, so we use the variable to replace
INTERLEAVE_MDDE(reg) here.
Signed-off-by: Hui Wang <jason77.wang@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
These const tables are currently marked __devinitdata, but
Documentation/PCI/pci.txt says:
"o The ID table array should be marked __devinitconst; this is done
automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE()."
So use DEFINE_PCI_DEVICE_TABLE(x).
Based on PaX and earlier work by Andi Kleen.
Signed-off-by: Lionel Debroux <lionel_debroux@yahoo.fr>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename
4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before
4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The edac driver for Sandy Bridge was found to be reporting "FPM"
for edac_mode, which clearly doesn't make sense. It was found that
sb_edac.c:get_dimm_config was reusing a variable for both mem_type
and edac_type, and thus was overwriting the value after setting
it correctly. This patch fixes that issue.
Before the patch:
/sys/devices/system/edac/mc/mc0/csrow0/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow1/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow2/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow3/edac_mode:FPM
After:
/sys/devices/system/edac/mc/mc0/csrow0/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow1/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow2/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow3/edac_mode:S4ECD4ED
Signed-off-by: Mark A. Grondona <mgrondona@llnl.gov>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Some changes on it were required due to changeset cd90cc84c6bf0, that
changed the glue with the MCE logic.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This driver is known to work on mine and Tony's test environments,
using software error injection, and a partial hardware/software
error injection tool.
There's no broader range test yet to double check if the error decoding
logic will actually point to the right DIMM, so use it with care.
More tests are required to be sure that the driver will work on all
different types of memory configurations.
If you're willing to risk using it, I suggest you to enable EDAC debugs
for your test machines, as the debug logs helps to track what's going
inside the driver.
Please feed me with bug reports, if you notice that the driver
is miss-behaving.
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>