On a double-bit error, the PMU counters may increment and
trigger the PPI edac pmu overflow interrupt before the SPI for the
double bit interrupt triggers.
Check the FATAL bit of the cache error syndrome registers and mark
the error as double bit if it is set.
Change-Id: I5f992dac877b3d8936f11483d29b4e82fc70b0bf
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Cortex A53 & A57 processors store information about the type of parity
error in esr_el1 for certain types of errors. Print out this information
in the interrupt handler.
Print the cpumerrsr, l2merrsr, and esr register in all cases.
Change-Id: I508560740ee4f6937b5137cb33ad21216f86649f
Signed-off-by: Patrick Daly <pdaly@codeaurora.org>
By design, the CortexA53/A57 processors are incapable of
gernerating interrupts or PMU events once a single-bit
error is observed in the L2 caches.
Hence, we need to poll the L2MERRSR register to periodically check
for single bit errors. We need to do this for L2 on both clusters.
Change-Id: I76a440b820f23c9667a5596cf550ff7725ec1cf5
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Use the poll_msec set by the driver that registreed an edac device
to define the poll time for how often the edac_check callback function
should be scheduled.
Change-Id: I973d7fb966cb9f6f9497510df5de000d4f8ffcba
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
EDAC provides a mechanism to poll for errors using a callback function
and uses a delayed timer to schedule it. Provide an option to create
a deferrable timer if the error checking is not worth waking up
the cpu from idle.
Change-Id: Ia25216323eabf7fa4b894897c950414006921f3f
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
The cti-pmu workaround cpu notifier enables the workaround on CPU online.
Check the DT property to ensure that this is required before enabling it.
Change-Id: Id92702fcdc98b99970cad0df46c6832faff78491
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Check for ecc errors on panic on all processors
Change-Id: I2a68644afb2730a69aca35abb1f10899a11514dd
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
The CTI PMU workaround is enabled by default. Use a device tree property
to decide if the workaround needs to be applied or not.
Change-Id: Iaa847b7309d204d41c0ca53984964f3b238e0427
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
GENMASK is used to create a contiguous bitmask([hi:lo]). It is
implemented twice in current kernel. One is in EDAC driver, the other
is in SiS/XGI FB driver. Move it to a more generic place for other
usage.
Change-Id: Ia8d2d7c9f6e143fee7338f5bb991658fffef52ec
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Winischhofer <thomas@winischhofer.net>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
[dramasub@codeaurora.org: resolved trivial merge conflicts]
Signed-off-by: Deva Ramasubramanian <dramasub@codeaurora.org>
Git-commit: 10ef6b0dffe404bcc54e94cb2ca1a5b18445a66b
Git-repo: https://github.com/torvalds/linux.git
The CTI PMU workaround for the EDAC interrupt is enabled at probe
only on the CPUs that are online during boot-up. Some of the CPUs
can be onlined at a later point from userspace.
Ensure that the workaround is enabled on those CPUs as well.
Change-Id: I551b228d3df3f7ed7d935a55aec6474339d569a6
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Cosmetic change to add 0x as a prefix while printing hexadecimal
values so that it makes the messages unambiguous.
Change-Id: I2d638708833da194f0066abd37ffc157c52c4182
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Currently, the decision to control panic on a correctable error is
through a defconfig. Add a commandline parameter for easy toggling of
this particular option. Also, to make these correctable errors stand-out
in the logs, add a WARN when the panic_on_ce is disabled.
Change-Id: Ie3c61cad2489b680f51d466c8243ba23ccf5375c
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
In MSM8994-V1, for unknown reasons the pmu percpu interrupt was
incorrectly connected to the corresponding CPU in the other cluster.
To workaround this problem, it was decided that the cti could be used
to trigger the percpu pmu interrupt to the right cpu which actually
caused the event and can handle it.
This patch implements this workaround.
Change-Id: I732ad77ed2529a54b85c51b3fadcc53d93d70279
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Single bit errors are detected using the PMU's memory event.
The counter overflow interrupt triggers a read of the
relevant CPU registers which help in reporting the single
bit errors.
Change-Id: I29cc3c952c1e0f1c05120b23cf30775583dcd67c
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Allow the Cortex A53/A57 EDAC driver to be configured to
panic the kernel if a correctable error (CE) is detected.
Change-Id: Id7bf66ed36a348eb321d8fd457efe097c003ebcd
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Add an EDAC device flag and associated sysfs entries to
allow an EDAC driver to be configured to panic the kernel
if a correctable error is detected. Though correctable
errors (by definition) have no adverse system effects,
a panic may still be useful, since a correctable error may
be indicative of a marginal system state.
Change-Id: I98921469254aa7b999979c1c7d9186286f982a0c
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
The 'instance number' argument passed to the EDAC framework
error reporting functions needs to correspond to the ID of
the CPU which reported the error, rather than the ID of the
cache set/way where the error occurred.
Change-Id: Ice5039abe83756ef16726cc34b62b0b3c88e1e16
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
CPUMERRSR_EL1 is a 64-bit register, so we must use a 64-bit
data type to hold its value.
Change-Id: I6bb7e41d84f417eb39e62cffef0807098205d268
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Add a driver for handling the L1/L2 cache error reporting
features found on the ARM Cortex A53 / A57 processors.
Change-Id: I03e11ada791265aa998aab7031a8e274a193d8f9
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
commit 75135da0d68419ef8a925f4c1d5f63d8046e314d upstream.
pci_get_device() decrements the reference count of "from" (last
argument) so when we break off the loop successfully we have only one
device reference - and we don't know which device we have. If we want
a reference to each device, we must take them explicitly and let
the pci_get_device() walk complete to avoid duplicate references.
This is serious, as over-putting device references will cause
the device to eventually disappear. Without this fix, the kernel
crashes after a few insmod/rmmod cycles.
Tested on an Intel S7000FC4UR system with a 7300 chipset.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Link: http://lkml.kernel.org/r/20140224111656.09bbb7ed@endymion.delvare
Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c0f5eeed0f4cef4f05b74883a7160e7edde58b6a upstream.
The reference count changes done by pci_get_device can be a little
misleading when the usage diverges from the most common scheme. The
reference count of the device passed as the last parameter is always
decreased, even if the function returns no new device. So if we are
going to try alternative device IDs, we must manually increment the
device reference count before each retry. If we don't, we end up
decreasing the reference count, and after a few modprobe/rmmod cycles
the PCI devices will vanish.
In other words and as Alan put it: without this fix the EDAC code
corrupts the PCI device list.
This fixes kernel bug #50491:
https://bugzilla.kernel.org/show_bug.cgi?id=50491
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Link: http://lkml.kernel.org/r/20140224093927.7659dd9d@endymion.delvare
Reviewed-by: Alan Cox <alan@linux.intel.com>
Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cb6ef42e516cb8948f15e4b70dc03af8020050a2 upstream.
We're using edac_mc_workq_setup() both on the init path, when
we load an edac driver and when we change the polling period
(edac_mc_reset_delay_period) through /sys/.../edac_mc_poll_msec.
On that second path we don't need to init the workqueue which has been
initialized already.
Thanks to Tejun for workqueue insights.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1391457913-881-1-git-send-email-prarit@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9da21b1509d8aa7ab4846722817d16c72d656c91 upstream.
Sanitize code even more to accept unsigned longs only and to not allow
polling intervals below 1 second as this is unnecessary and doesn't make
much sense anyway for polling errors.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1391457913-881-1-git-send-email-prarit@redhat.com
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c542b53da9ffa4fe9de61149818a06aacae531f8 upstream.
The usage of strict_strtol() is not preferred, because strict_strtol()
is obsolete. Thus, kstrtol() should be used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 90ed4988b8c030d65b41b7d13140e9376dc6ec5a upstream.
In case the device 0, function 1 is not found using pci_get_device(),
pci_scan_single_device() will be used but, differently than
pci_get_device(), it allocates a pci_dev but doesn't does bump the usage
count on the pci_dev and after few module removals and loads the pci_dev
will be freed.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Reviewed-by: mark gross <mark.gross@intel.com>
Link: http://lkml.kernel.org/r/20131205153755.GL4545@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a72b8859fd3941cc1d2940d5c43026d2c6fb959e upstream.
Register and enable interrupts after the edac registration. Otherwise
incomming ecc error interrupts lead to crashes during device setup.
Fixing this in drivers for mc and l2.
Signed-off-by: Robert Richter <robert.richter@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f0a56c480196a98479760862468cc95879df3de0 upstream.
It can happen that configurations are running in a single-channel mode
even with a dual-channel memory controller, by, say, putting the DIMMs
only on the one channel and leaving the other empty. This causes a
problem in init_csrows which implicitly assumes that when the second
channel is enabled, i.e. channel 1, the struct dimm hierarchy will be
present. Which is not.
So always allocate two channels unconditionally.
This provides for the nice side effect that the data structures are
initialized so some day, when memory hotplug is supported, it should
just work out of the box when all of a sudden a second channel appears.
Reported-and-tested-by: Roger Leigh <rleigh@debian.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix yet another issue caught by 8f46baaa7e ("base: core: WARN() about
bogus permissions on device attributes").
Signed-off-by: Borislav Petkov <bp@suse.de>
I get the following warning on boot:
------------[ cut here ]------------
WARNING: at drivers/base/core.c:575 device_create_file+0x9a/0xa0()
Hardware name: -[8737R2A]-
Write permission without 'store'
...
</snip>
Drilling down, this is related to dynamic channel ce_count attribute
files sporting a S_IWUSR mode without a ->store() function. Looking
around, it appears that they aren't supposed to have a ->store()
function. So remove the bogus write permission to get rid of the
warning.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: <stable@vger.kernel.org> # 3.[89]
[ shorten commit message ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull edac fixes from Mauro Carvalho Chehab:
"Two edac fixes:
- i7300_edac currently reports a wrong number of DIMMs when the
memory controller is in single channel mode
- on some Sandy Bridge machines, the EDAC driver bails out as one of
the PCI IDs used by the driver is hidden by BIOS. As the driver
uses it only to detect the type of memory, make it optional at the
driver"
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
edac: sb_edac.c should not require prescence of IMC_DDRIO device
i7300_edac: Fix memory detection in single mode
The Sandy Bridge EDAC driver uses a register in the IMC_DDRIO CSR
space to determine the type of DIMMs (registered or unregistered).
But this device does not exist on some single socket Sandy Bridge
servers. While the type of DIMMs is nice to know, it is not essential
for this driver's other functions. So it seems harsh to have it
refuse to load at all when it cannot find this device.
Make the check for this device be optional. If it isn't present
just report the memory type as "MEM_UNKNOWN".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Add code to handle DRAM ECC errors decoding for Fam16h.
Tested on Fam16h with ECC turned on using the mce_amd_inj facility and
works fine.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
[ Boris: cleanups and clarifications ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Both mci.mem_is_per_rank and mci.csbased denote the same thing: the
memory controller is csrows based. Merge both fields into one.
There's no need for the driver to actually fill it, as the core detects
it by checking if one of the layers has the csrows type as part of the
memory hierarchy:
if (layers[i].type == EDAC_MC_LAYER_CHIP_SELECT)
per_rank = true;
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
We were filling the csrow size with a wrong value. 16a528ee39 ("EDAC:
Fix csrow size reported in sysfs") tried to address the issue. It fixed
the report with the old API but not with the new one. Correct it for the
new API too.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
[ make it a per-csrow accounting regardless of ->channel_count ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull EDAC fixes and ghes-edac from Mauro Carvalho Chehab:
"For:
- Some fixes at edac drivers (i7core_edac, sb_edac, i3200_edac);
- error injection support for i5100, when EDAC debug is enabled;
- fix edac when it is loaded builtin (early init for the subsystem);
- a "Firmware First" EDAC driver, allowing ghes to report errors via
EDAC (ghes-edac).
With regards to ghes-edac, this fixes a longstanding BZ at Red Hat
that happens with Nehalem and Sandy Bridge CPUs: when both GHES and
i7core_edac or sb_edac are running, the error reports are
unpredictable, as both BIOS and OS race to access the registers. With
ghes-edac, the EDAC core will refuse to register any other concurrent
memory error driver.
This patchset moves the ghes struct definitions to a separate header
file (include/acpi/ghes.h) and adds 3 hooks at apei/ghes.c to
register/unregister and to report errors via ghes-edac. Those changes
were acked by ghes driver maintainer (Huang)."
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (30 commits)
i5100_edac: convert to use simple_open()
ghes_edac: fix to use list_for_each_entry_safe() when delete list items
ghes_edac: Fix RAS tracing
ghes_edac: Make it compliant with UEFI spec 2.3.1
ghes_edac: Improve driver's printk messages
ghes_edac: Don't credit the same memory dimm twice
ghes_edac: do a better job of filling EDAC DIMM info
ghes_edac: add support for reporting errors via EDAC
ghes_edac: Register at EDAC core the BIOS report
ghes: add the needed hooks for EDAC error report
ghes: move structures/enum to a header file
edac: add support for error type "Info"
edac: add support for raw error reports
edac: reduce stack pressure by using a pre-allocated buffer
edac: lock module owner to avoid error report conflicts
edac: remove proc_name from mci structure
edac: add a new memory layer type
edac: initialize the core earlier
edac: better report error conditions in debug mode
i5100_edac: Remove two checkpatch warnings
...
This removes an open coded simple_open() function and
replaces file operations references to the function
with simple_open() instead.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each_entry() macro aptly named
list_for_each_entry_safe().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
With the current version of CPER, there's no way to associate an
error with the memory error. So, the error location in EDAC
layers is unused.
As CPER has its own idea about memory architectural layers, just
output whatever is there inside the driver's detail at the RAS
tracepoint.
The EDAC location keeps untouched, in the case that, in some future,
we could actually map the error into the dimm labels.
Now, the error message:
[ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 72.396627] {1}[Hardware Error]: APEI generic hardware error status
[ 72.396628] {1}[Hardware Error]: severity: 2, corrected
[ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
[ 72.396632] {1}[Hardware Error]: flags: 0x01
[ 72.396634] {1}[Hardware Error]: primary
[ 72.396635] {1}[Hardware Error]: section_type: memory error
[ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 72.396638] {1}[Hardware Error]: node: 3
[ 72.396639] {1}[Hardware Error]: card: 0
[ 72.396640] {1}[Hardware Error]: module: 0
[ 72.396641] {1}[Hardware Error]: device: 0
[ 72.396643] {1}[Hardware Error]: error_type: 18, unknown
[ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
Is properly represented on the trace event:
kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location👎-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)
Tested on a 4 sockets E5-4650 Sandy Bridge machine.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Provide a better infrastructure for printk's inside the driver:
- use edac_dbg() for debug messages;
- standardize the usage of pr_info();
- provide warning about the risk of relying on this
driver.
While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Conflicts:
drivers/edac/ghes_edac.c