1. 22 Feb, 2014 4 commits
    • Borislav Petkov's avatar
      EDAC: Correct workqueue setup path · db06ad39
      Borislav Petkov authored
      commit cb6ef42e516cb8948f15e4b70dc03af8020050a2 upstream.
      
      We're using edac_mc_workq_setup() both on the init path, when
      we load an edac driver and when we change the polling period
      (edac_mc_reset_delay_period) through /sys/.../edac_mc_poll_msec.
      
      On that second path we don't need to init the workqueue which has been
      initialized already.
      
      Thanks to Tejun for workqueue insights.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/1391457913-881-1-git-send-email-prarit@redhat.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db06ad39
    • Borislav Petkov's avatar
      EDAC: Poll timeout cannot be zero, p2 · ba20cf8a
      Borislav Petkov authored
      commit 9da21b1509d8aa7ab4846722817d16c72d656c91 upstream.
      
      Sanitize code even more to accept unsigned longs only and to not allow
      polling intervals below 1 second as this is unnecessary and doesn't make
      much sense anyway for polling errors.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/1391457913-881-1-git-send-email-prarit@redhat.com
      Cc: Doug Thompson <dougthompson@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba20cf8a
    • Prarit Bhargava's avatar
      drivers/edac/edac_mc_sysfs.c: poll timeout cannot be zero · 64f452fc
      Prarit Bhargava authored
      commit 79040cad3f8235937e229f1b9401ba36dd5ad69b upstream.
      
      If you do
      
        echo 0 > /sys/module/edac_core/parameters/edac_mc_poll_msec
      
      the following stack trace is output because the edac module is not
      designed to poll with a timeout of zero.
      
        WARNING: CPU: 12 PID: 0 at lib/list_debug.c:33 __list_add+0xac/0xc0()
        list_add corruption. prev->next should be next (ffff8808291dd1b8), but was           (null). (prev=ffff8808286fe3f8).
        Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal coretemp kvm_intel kvm ixgbe e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt ptp sb_edac iTCO_vendor_support pps_core mdio ipmi_devintf edac_core ioatdma microcode shpchp lpc_ich pcspkr i2c_i801 dca mfd_core ipmi_si wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt isci i2c_algo_bit drm_kms_helper ttm drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
        CPU: 12 PID: 0 Comm: swapper/12 Not tainted 3.13.0+ #1
        Hardware name: Intel Corporation LH Pass ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
        Call Trace:
         <IRQ>
          __list_add+0xac/0xc0
          __internal_add_timer+0xab/0x130
          internal_add_timer+0x17/0x40
          mod_timer_pinned+0xca/0x170
          intel_pstate_timer_func+0x28a/0x380
          call_timer_fn+0x36/0x100
          run_timer_softirq+0x1ff/0x2f0
          __do_softirq+0xf5/0x2e0
          irq_exit+0x10d/0x120
          smp_apic_timer_interrupt+0x45/0x60
          apic_timer_interrupt+0x6d/0x80
         <EOI>
          cpuidle_idle_call+0xb9/0x1f0
          arch_cpu_idle+0xe/0x30
          cpu_startup_entry+0x9e/0x240
          start_secondary+0x1e4/0x290
      
        kernel BUG at kernel/timer.c:1084!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal coretemp kvm_intel kvm ixgbe e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt ptp sb_edac iTCO_vendor_support pps_core mdio ipmi_devintf edac_core ioatdma microcode shpchp lpc_ich pcspkr i2c_i801 dca mfd_core ipmi_si wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt isci i2c_algo_bit drm_kms_helper ttm drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
        CPU: 12 PID: 0 Comm: swapper/12 Tainted: G        W    3.13.0+ #1
        Hardware name: Intel Corporation LH Pass ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
        Call Trace:
         <IRQ>
          run_timer_softirq+0x245/0x2f0
          __do_softirq+0xf5/0x2e0
          irq_exit+0x10d/0x120
          smp_apic_timer_interrupt+0x45/0x60
          apic_timer_interrupt+0x6d/0x80
         <EOI>
          cpuidle_idle_call+0xb9/0x1f0
          arch_cpu_idle+0xe/0x30
          cpu_startup_entry+0x9e/0x240
          start_secondary+0x1e4/0x290
        RIP   cascade+0x93/0xa0
      
        WARNING: CPU: 36 PID: 1154 at kernel/workqueue.c:1461 __queue_delayed_work+0xed/0x1a0()
        Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal coretemp kvm_intel kvm ixgbe e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt ptp sb_edac iTCO_vendor_support pps_core mdio ipmi_devintf edac_core ioatdma microcode shpchp lpc_ich pcspkr i2c_i801 dca mfd_core ipmi_si wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt isci i2c_algo_bit drm_kms_helper ttm drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
        CPU: 36 PID: 1154 Comm: kworker/u481:3 Tainted: G        W    3.13.0+ #1
        Hardware name: Intel Corporation LH Pass ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
        Workqueue: edac-poller edac_mc_workq_function [edac_core]
        Call Trace:
          dump_stack+0x45/0x56
          warn_slowpath_common+0x7d/0xa0
          warn_slowpath_null+0x1a/0x20
          __queue_delayed_work+0xed/0x1a0
          queue_delayed_work_on+0x27/0x50
          edac_mc_workq_function+0x72/0xa0 [edac_core]
          process_one_work+0x17b/0x460
          worker_thread+0x11b/0x400
          kthread+0xd2/0xf0
          ret_from_fork+0x7c/0xb0
      
      This patch adds a range check in the edac_mc_poll_msec code to check for 0.
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64f452fc
    • Jingoo Han's avatar
      EDAC: Replace strict_strtol() with kstrtol() · 2b88adf6
      Jingoo Han authored
      commit c542b53da9ffa4fe9de61149818a06aacae531f8 upstream.
      
      The usage of strict_strtol() is not preferred, because strict_strtol()
      is obsolete. Thus, kstrtol() should be used.
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b88adf6
  2. 06 Feb, 2014 1 commit
  3. 04 Dec, 2013 1 commit
  4. 27 Sep, 2013 1 commit
    • Borislav Petkov's avatar
      amd64_edac: Fix single-channel setups · 34db3c07
      Borislav Petkov authored
      commit f0a56c480196a98479760862468cc95879df3de0 upstream.
      
      It can happen that configurations are running in a single-channel mode
      even with a dual-channel memory controller, by, say, putting the DIMMs
      only on the one channel and leaving the other empty. This causes a
      problem in init_csrows which implicitly assumes that when the second
      channel is enabled, i.e. channel 1, the struct dimm hierarchy will be
      present. Which is not.
      
      So always allocate two channels unconditionally.
      
      This provides for the nice side effect that the data structures are
      initialized so some day, when memory hotplug is supported, it should
      just work out of the box when all of a sudden a second channel appears.
      Reported-and-tested-by: default avatarRoger Leigh <rleigh@debian.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34db3c07
  5. 28 Jul, 2013 1 commit
    • Borislav Petkov's avatar
      EDAC: Fix lockdep splat · f46ef77d
      Borislav Petkov authored
      commit 88d84ac97378c2f1d5fec9af1e8b7d9a662d6b00 upstream.
      
      Fix the following:
      
      BUG: key ffff88043bdd0330 not in .data!
      ------------[ cut here ]------------
      WARNING: at kernel/lockdep.c:2987 lockdep_init_map+0x565/0x5a0()
      DEBUG_LOCKS_WARN_ON(1)
      Modules linked in: glue_helper sb_edac(+) edac_core snd acpi_cpufreq lrw gf128mul ablk_helper iTCO_wdt evdev i2c_i801 dcdbas button cryptd pcspkr iTCO_vendor_support usb_common lpc_ich mfd_core soundcore mperf processor microcode
      CPU: 2 PID: 599 Comm: modprobe Not tainted 3.10.0 #1
      Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A08 01/24/2013
       0000000000000009 ffff880439a1d920 ffffffff8160a9a9 ffff880439a1d958
       ffffffff8103d9e0 ffff88043af4a510 ffffffff81a16e11 0000000000000000
       ffff88043bdd0330 0000000000000000 ffff880439a1d9b8 ffffffff8103dacc
      Call Trace:
        dump_stack
        warn_slowpath_common
        warn_slowpath_fmt
        lockdep_init_map
        ? trace_hardirqs_on_caller
        ? trace_hardirqs_on
        debug_mutex_init
        __mutex_init
        bus_register
        edac_create_sysfs_mci_device
        edac_mc_add_mc
        sbridge_probe
        pci_device_probe
        driver_probe_device
        __driver_attach
        ? driver_probe_device
        bus_for_each_dev
        driver_attach
        bus_add_driver
        driver_register
        __pci_register_driver
        ? 0xffffffffa0010fff
        sbridge_init
        ? 0xffffffffa0010fff
        do_one_initcall
        load_module
        ? unset_module_init_ro_nx
        SyS_init_module
        tracesys
      ---[ end trace d24a70b0d3ddf733 ]---
      EDAC MC0: Giving out device to 'sbridge_edac.c' 'Sandy Bridge Socket#0': DEV 0000:3f:0e.0
      EDAC sbridge: Driver loaded.
      
      What happens is that bus_register needs a statically allocated lock_key
      because the last is handed in to lockdep. However, struct mem_ctl_info
      embeds struct bus_type (the whole struct, not a pointer to it) and the
      whole thing gets dynamically allocated.
      
      Fix this by using a statically allocated struct bus_type for the MC bus.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarMauro Carvalho Chehab <mchehab@infradead.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f46ef77d
  6. 21 May, 2013 1 commit
  7. 09 May, 2013 1 commit
    • Srivatsa S. Bhat's avatar
      EDAC: Don't give write permission to read-only files · c8c64d16
      Srivatsa S. Bhat authored
      I get the following warning on boot:
      
      ------------[ cut here ]------------
      WARNING: at drivers/base/core.c:575 device_create_file+0x9a/0xa0()
      Hardware name:  -[8737R2A]-
      Write permission without 'store'
      ...
      </snip>
      
      Drilling down, this is related to dynamic channel ce_count attribute
      files sporting a S_IWUSR mode without a ->store() function. Looking
      around, it appears that they aren't supposed to have a ->store()
      function. So remove the bogus write permission to get rid of the
      warning.
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
      Cc: <stable@vger.kernel.org> # 3.[89]
      [ shorten commit message ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      c8c64d16
  8. 29 Apr, 2013 2 commits
    • Luck, Tony's avatar
      edac: sb_edac.c should not require prescence of IMC_DDRIO device · de4772c6
      Luck, Tony authored
      The Sandy Bridge EDAC driver uses a register in the IMC_DDRIO CSR
      space to determine the type of DIMMs (registered or unregistered).
      But this device does not exist on some single socket Sandy Bridge
      servers.  While the type of DIMMs is nice to know, it is not essential
      for this driver's other functions. So it seems harsh to have it
      refuse to load at all when it cannot find this device.
      
      Make the check for this device be optional. If it isn't present
      just report the memory type as "MEM_UNKNOWN".
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      de4772c6
    • Mauro Carvalho Chehab's avatar
      i7300_edac: Fix memory detection in single mode · 33ad4126
      Mauro Carvalho Chehab authored
      When the machine is on single mode, only branch 0 channel 0
      is valid. However, the code is not honouring it:
      
      [ 1952.639341] EDAC DEBUG: i7300_get_mc_regs: Memory controller operating on single mode
      ...
      [ 1952.639351] EDAC DEBUG: i7300_init_csrows: 		AMB-present CH0 = 0x1:
      [ 1952.639353] EDAC DEBUG: i7300_init_csrows: 		AMB-present CH1 = 0x0:
      [ 1952.639355] EDAC DEBUG: i7300_init_csrows: 		AMB-present CH2 = 0x0:
      [ 1952.639358] EDAC DEBUG: i7300_init_csrows: 		AMB-present CH3 = 0x0:
      ...
      [ 1952.639360] EDAC DEBUG: decode_mtr: 	MTR0 CH0: DIMMs are Present (mtr)
      [ 1952.639362] EDAC DEBUG: decode_mtr: 		WIDTH: x8
      [ 1952.639363] EDAC DEBUG: decode_mtr: 		ELECTRICAL THROTTLING is enabled
      [ 1952.639364] EDAC DEBUG: decode_mtr: 		NUMBANK: 4 bank(s)
      [ 1952.639366] EDAC DEBUG: decode_mtr: 		NUMRANK: single
      [ 1952.639367] EDAC DEBUG: decode_mtr: 		NUMROW: 16,384 - 14 rows
      [ 1952.639368] EDAC DEBUG: decode_mtr: 		NUMCOL: 1,024 - 10 columns
      [ 1952.639370] EDAC DEBUG: decode_mtr: 		SIZE: 512 MB
      [ 1952.639371] EDAC DEBUG: decode_mtr: 		ECC code is 8-byte-over-32-byte SECDED+ code
      [ 1952.639373] EDAC DEBUG: decode_mtr: 		Scrub algorithm for x8 is on enhanced mode
      [ 1952.639374] EDAC DEBUG: decode_mtr: 	MTR0 CH1: DIMMs are Present (mtr)
      [ 1952.639376] EDAC DEBUG: decode_mtr: 		WIDTH: x8
      [ 1952.639377] EDAC DEBUG: decode_mtr: 		ELECTRICAL THROTTLING is enabled
      [ 1952.639379] EDAC DEBUG: decode_mtr: 		NUMBANK: 4 bank(s)
      [ 1952.639380] EDAC DEBUG: decode_mtr: 		NUMRANK: single
      [ 1952.639381] EDAC DEBUG: decode_mtr: 		NUMROW: 16,384 - 14 rows
      [ 1952.639383] EDAC DEBUG: decode_mtr: 		NUMCOL: 1,024 - 10 columns
      [ 1952.639384] EDAC DEBUG: decode_mtr: 		SIZE: 512 MB
      [ 1952.639385] EDAC DEBUG: decode_mtr: 		ECC code is 8-byte-over-32-byte SECDED+ code
      [ 1952.639387] EDAC DEBUG: decode_mtr: 		Scrub algorithm for x8 is on enhanced mode
      ...
      [ 1952.639449] EDAC DEBUG: print_dimm_size:               channel 0 | channel 1 | channel 2 | channel 3 |
      [ 1952.639451] EDAC DEBUG: print_dimm_size: -------------------------------------------------------------
      [ 1952.639453] EDAC DEBUG: print_dimm_size: csrow/SLOT 0   512 MB   |  512 MB   |    0 MB   |    0 MB   |
      [ 1952.639456] EDAC DEBUG: print_dimm_size: csrow/SLOT 1     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639458] EDAC DEBUG: print_dimm_size: csrow/SLOT 2     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639460] EDAC DEBUG: print_dimm_size: csrow/SLOT 3     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639462] EDAC DEBUG: print_dimm_size: csrow/SLOT 4     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639464] EDAC DEBUG: print_dimm_size: csrow/SLOT 5     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639466] EDAC DEBUG: print_dimm_size: csrow/SLOT 6     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639468] EDAC DEBUG: print_dimm_size: csrow/SLOT 7     0 MB   |    0 MB   |    0 MB   |    0 MB   |
      [ 1952.639470] EDAC DEBUG: print_dimm_size: -------------------------------------------------------------
      
      Instead of detecting a single memory at channel 0, it is showing
      twice the memory.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      33ad4126
  9. 19 Apr, 2013 1 commit
  10. 25 Mar, 2013 1 commit
  11. 16 Mar, 2013 2 commits
  12. 05 Mar, 2013 1 commit
  13. 26 Feb, 2013 2 commits
  14. 25 Feb, 2013 7 commits
    • Mauro Carvalho Chehab's avatar
      ghes_edac: Fix RAS tracing · 8ae8f50a
      Mauro Carvalho Chehab authored
      With the current version of CPER, there's no way to associate an
      error with the memory error. So, the error location in EDAC
      layers is unused.
      
      As CPER has its own idea about memory architectural layers, just
      output whatever is there inside the driver's detail at the RAS
      tracepoint.
      
      The EDAC location keeps untouched, in the case that, in some future,
      we could actually map the error into the dimm labels.
      
      Now, the error message:
      
      [   72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
      [   72.396627] {1}[Hardware Error]: APEI generic hardware error status
      [   72.396628] {1}[Hardware Error]: severity: 2, corrected
      [   72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
      [   72.396632] {1}[Hardware Error]: flags: 0x01
      [   72.396634] {1}[Hardware Error]: primary
      [   72.396635] {1}[Hardware Error]: section_type: memory error
      [   72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
      [   72.396638] {1}[Hardware Error]: node: 3
      [   72.396639] {1}[Hardware Error]: card: 0
      [   72.396640] {1}[Hardware Error]: module: 0
      [   72.396641] {1}[Hardware Error]: device: 0
      [   72.396643] {1}[Hardware Error]: error_type: 18, unknown
      [   72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
      
      Is properly represented on the trace event:
      
           kworker/0:2-584   [000] ....    72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)
      
      Tested on a 4 sockets E5-4650 Sandy Bridge machine.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      8ae8f50a
    • Mauro Carvalho Chehab's avatar
      ghes_edac: Make it compliant with UEFI spec 2.3.1 · 689c9cd8
      Mauro Carvalho Chehab authored
      The UEFI spec defines the memory error types ans the bits that
      validate each field on the memory error record, at
      Appendix N om items N.2.5 (Memory Error Section) and
      N.2.11 (Error Status). Make the error description compliant with
      it, only showing the valid fields.
      
      The EDAC error log is now properly reporting the error:
      
      [  281.556854] mce: [Hardware Error]: Machine check events logged
      [  281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
      [  281.557044] {2}[Hardware Error]: APEI generic hardware error status
      [  281.557046] {2}[Hardware Error]: severity: 2, corrected
      [  281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
      [  281.557050] {2}[Hardware Error]: flags: 0x01
      [  281.557052] {2}[Hardware Error]: primary
      [  281.557053] {2}[Hardware Error]: section_type: memory error
      [  281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
      [  281.557056] {2}[Hardware Error]: node: 3
      [  281.557057] {2}[Hardware Error]: card: 0
      [  281.557058] {2}[Hardware Error]: module: 1
      [  281.557059] {2}[Hardware Error]: device: 0
      [  281.557061] {2}[Hardware Error]: error_type: 18, unknown
      [  281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
      [  281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
      
      Tested on a 4 CPUs E5-4650 Sandy Bridge machine.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      689c9cd8
    • Mauro Carvalho Chehab's avatar
      ghes_edac: Improve driver's printk messages · d2a68566
      Mauro Carvalho Chehab authored
      Provide a better infrastructure for printk's inside the driver:
      	- use edac_dbg() for debug messages;
      	- standardize the usage of pr_info();
      	- provide warning about the risk of relying on this
      	  driver.
      
      While here, changes the size of a fake memory to 1 page. This is
      as good or as bad as 1000 pages, but it is easier for userspace to
      detect, as I don't expect that any machine implementing GHES would
      provide just 1 page available ;)
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      
      Conflicts:
      	drivers/edac/ghes_edac.c
      d2a68566
    • Mauro Carvalho Chehab's avatar
      ghes_edac: Don't credit the same memory dimm twice · 5ee726db
      Mauro Carvalho Chehab authored
      On my tests on a 4xE5-4650 CPU's system, the GHES
      EDAC driver is called twice. As the SMBIOS DMI enumeration
      call will seek for the entire DIMM sockets in the system, on
      this machine, equipped with 128 GB of RAM, the memory is
      displayed twice:
      
                +-----------------------+
                |    mc0    |    mc1    |
      ----------+-----------------------+
      memory45: |  8192 MB  |  8192 MB  |
      memory44: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory43: |     0 MB  |     0 MB  |
      memory42: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory41: |     0 MB  |     0 MB  |
      memory40: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory39: |  8192 MB  |  8192 MB  |
      memory38: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory37: |     0 MB  |     0 MB  |
      memory36: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory35: |     0 MB  |     0 MB  |
      memory34: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory33: |  8192 MB  |  8192 MB  |
      memory32: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory31: |     0 MB  |     0 MB  |
      memory30: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory29: |     0 MB  |     0 MB  |
      memory28: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory27: |  8192 MB  |  8192 MB  |
      memory26: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory25: |     0 MB  |     0 MB  |
      memory24: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory23: |     0 MB  |     0 MB  |
      memory22: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory21: |  8192 MB  |  8192 MB  |
      memory20: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory19: |     0 MB  |     0 MB  |
      memory18: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory17: |     0 MB  |     0 MB  |
      memory16: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory15: |  8192 MB  |  8192 MB  |
      memory14: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory13: |     0 MB  |     0 MB  |
      memory12: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory11: |     0 MB  |     0 MB  |
      memory10: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory9:  |  8192 MB  |  8192 MB  |
      memory8:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory7:  |     0 MB  |     0 MB  |
      memory6:  |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory5:  |     0 MB  |     0 MB  |
      memory4:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory3:  |  8192 MB  |  8192 MB  |
      memory2:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory1:  |     0 MB  |     0 MB  |
      memory0:  |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      
      Total sum of 256 GB.
      
      As there's no reliable way to credit DIMMS to the right memory
      controller, just put everything on memory controller 0 (with should
      always exist).
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      5ee726db
    • Mauro Carvalho Chehab's avatar
      ghes_edac: do a better job of filling EDAC DIMM info · 32fa1f53
      Mauro Carvalho Chehab authored
      Instead of just faking a random value for the DIMM data, get
      the information that it is available via DMI table.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      32fa1f53
    • Mauro Carvalho Chehab's avatar
      ghes_edac: add support for reporting errors via EDAC · f04c62a7
      Mauro Carvalho Chehab authored
      Now that the EDAC core is capable of just forward the errors via
      the userspace API, add a report mechanism for the GHES errors.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      f04c62a7
    • Mauro Carvalho Chehab's avatar
      ghes_edac: Register at EDAC core the BIOS report · 77c5f5d2
      Mauro Carvalho Chehab authored
      Register GHES at EDAC MC core, in order to avoid other
      drivers to also handle errors and mangle with error data.
      
      The edac core will warrant that just one driver will be used,
      so the first one to register (BIOS first) will be the one that
      will be reporting the hardware errors.
      
      For now, the EDAC driver does nothing but to register at the
      EDAC core, preventing the hardware-driven mechanism to
      interfere with GHES.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      77c5f5d2
  15. 21 Feb, 2013 13 commits
    • Mauro Carvalho Chehab's avatar
      edac: add support for raw error reports · e7e24830
      Mauro Carvalho Chehab authored
      That allows APEI GHES driver to report errors directly, using
      the EDAC error report API.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      e7e24830
    • Mauro Carvalho Chehab's avatar
      edac: reduce stack pressure by using a pre-allocated buffer · c7ef7645
      Mauro Carvalho Chehab authored
      The number of variables at the stack is too big.
      Reduces the stack usage by using a pre-allocated error
      buffer.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      c7ef7645
    • Mauro Carvalho Chehab's avatar
      edac: lock module owner to avoid error report conflicts · 80cc7d87
      Mauro Carvalho Chehab authored
      APEI GHES and i7core_edac/sb_edac currently can be loaded at
      the same time, but those are Highlander modules:
      	"There can be only one".
      
      There are two reasons for that:
      
      1) Each driver assumes that it is the only one registering at
         the EDAC core, as it is driver's responsibility to number
         the memory controllers, and all of them start from 0;
      
      2) If BIOS is handling the memory errors, the OS can't also be
         doing it, as one will mangle with the other.
      
      So, we need to add an module owner's lock at the EDAC core,
      in order to avoid having two different modules handling memory
      errors at the same time. The best way for doing this lock seems
      to use the driver's name, as this is unique, and won't require
      changes on every driver.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      80cc7d87
    • Mauro Carvalho Chehab's avatar
      edac: add a new memory layer type · c66b5a79
      Mauro Carvalho Chehab authored
      There are some cases where the memory controller layout is
      completely hidden. This is the case of firmware-driven error
      code, like the one provided by GHES. Add a new layer to be
      used on such memory error report mechanisms.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      c66b5a79
    • Mauro Carvalho Chehab's avatar
      edac: initialize the core earlier · 4ab19b06
      Mauro Carvalho Chehab authored
      In order for it to work with it builtin, the EDAC core should
      be initialized earlier, otherwise the ghes_edac driver initializes
      before edac_mc_sysfs_init() being called:
      
      ...
      [    4.998373] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
      ...
      [    4.998373] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
      [    6.519495] EDAC MC: Ver: 3.0.0
      [    6.523749] EDAC DEBUG: edac_mc_sysfs_init: device mc created
      
      The net result is that no EDAC sysfs nodes will appear.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      4ab19b06
    • Mauro Carvalho Chehab's avatar
      edac: better report error conditions in debug mode · 3d958823
      Mauro Carvalho Chehab authored
      It is hard to find what's wrong without a proper error
      report. Improve it, in debug mode.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      3d958823
    • Mauro Carvalho Chehab's avatar
      i5100_edac: Remove two checkpatch warnings · 59b9796d
      Mauro Carvalho Chehab authored
      The last changeset introduced a few checkpatch warnings:
      
      WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required
      261: FILE: drivers/edac/i5100_edac.c:1207:
      +       if (priv->debugfs)
      +               debugfs_remove_recursive(priv->debugfs);
      
      WARNING: debugfs_remove(NULL) is safe this check is probably not required
      290: FILE: drivers/edac/i5100_edac.c:1250:
      +       if (i5100_debugfs)
      +               debugfs_remove(i5100_debugfs);
      
      Get rid of them.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      59b9796d
    • Niklas Söderlund's avatar
      i5100_edac: connect fault injection to debugfs node · 9cbc6d38
      Niklas Söderlund authored
      Create a debugfs direcotry i5100_edac/mcX for each memory controller and
      add nodes to control how fault injection is preformed.
      
      After configuring an injection using inject_channel, inject_deviceptr1,
      inject_deviceptr2, inject_eccmask1, inject_eccmask2 and inject_hlinesel
      trigger the injection by writing anything to inject_enable.
      
      Example of a CE injection:
      
      echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_channel
      echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_hlinesel
      echo 61440 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask1
      echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_enable
      
      Example of UE injection:
      
      echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_channel
      echo 2 > /sys/kernel/debug/i5100_edac/mc0/inject_hlinesel
      echo 65535 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask1
      echo 65535 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask2
      echo 17 > /sys/kernel/debug/i5100_edac/mc0/inject_deviceptr1
      echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_deviceptr2
      echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_enable
      
      Sometimes it is needed to enable the injection more then once (echo to
      the inject_enable node) for the injection to happen, I am not sure why.
      Signed-off-by: default avatarNiklas Söderlund <niklas.soderlund@ericsson.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      9cbc6d38
    • Niklas Söderlund's avatar
      i5100_edac: add fault injection code · 53ceafd6
      Niklas Söderlund authored
      Add fault injection based on information datasheet for i5100, see 1. In
      addition to the i5100 datasheet some missing information on injection
      functions where found through experimentation and the i7300 datasheet,
      see 2.
      
      [1] Intel 5100 Memory Controller Hub Chipset
          Doc.Nr: 318378
          http://www.intel.com/content/dam/doc/datasheet/5100-
          memory-controller-hub-chipset-datasheet.pdf
      
      [2] Intel 7300 Chipset MemoryController Hub (MCH)
          Doc.Nr: 318082
      	http://www.intel.com/assets/pdf/datasheet/318082.pdfSigned-off-by: default avatarNiklas Söderlund <niklas.soderlund@ericsson.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      53ceafd6
    • Niklas Söderlund's avatar
      i5100_edac: probe for device 19 function 0 · 52608ba2
      Niklas Söderlund authored
      Probe and store the device handle for the device 19 function 0 during
      driver initialization. The device is used during fault injection.
      Signed-off-by: default avatarNiklas Söderlund <niklas.soderlund@ericsson.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      52608ba2
    • Mauro Carvalho Chehab's avatar
      edac: only create sdram_scrub_rate where supported · e7100478
      Mauro Carvalho Chehab authored
      Currently, sdram_scrub_rate sysfs node is created even if the device
      doesn't support get/set the scub rate. Change the logic to only
      create this device node when the operation is supported.
      Reported-by: default avatarFelipe Balbi <balbi@ti.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      e7100478
    • Mauro Carvalho Chehab's avatar
      i3200_edac: Fix the logic that detects filled memories · 61734e18
      Mauro Carvalho Chehab authored
      After running a series of tests on an HP DL320, filled with different
      memory sizes, it was noticed that, when filled with just one DIMM
      on such hardware, the driver wrongly detects twice the memory, and
      thinks that both channels 0 and 1 are filled.
      
      It seems to be partially caused by the BIOS and partially by the driver.
      
      The i3200_edac current logic would be working fine if the BIOS were
      disabling the unused second channel when just one DIMM is connected,
      in order to do power-saving, as recommended on this chipset's datasheet.
      
      However, the BIOS on this particular machine doesn't do it:
      
      [   16.741421] EDAC DEBUG: how_many_channels: In dual channel mode
      [   16.741424] EDAC DEBUG: how_many_channels: 2 DIMMS per channel enabled
      
      So, the driver were assuming that 2 channels are enabled (well, they are,
      but the second is unused).
      
      Combined with that, I found two issues at the logic that creates the
      EDAC data, that were failing when the two channels are not equally
      filled (AFAICT, that happens only when just 1 DIMM is plugged).
      
      The first one is that a 0 at DRB means that nothing is filled. The
      driver's logic, however, do some calculation with that.
      
      The second one is that the logic that fills the DIMM data currently
      assumes that both channels are equally filled.
      
      I tested the system already with the current configuration and my
      patch and it is now working fine. So, for a 2R single DIMM 2Gb memory
      at dimm slot 01 (channel 0), it is now displaying:
      
      [   16.741406] EDAC DEBUG: i3200_get_drbs: drb[0][0] = 16, drb[1][0] = 0
      [   16.741410] EDAC DEBUG: i3200_get_drbs: drb[0][1] = 32, drb[1][1] = 0
      [   16.741413] EDAC DEBUG: i3200_get_drbs: drb[0][2] = 32, drb[1][2] = 0
      [   16.741416] EDAC DEBUG: i3200_get_drbs: drb[0][3] = 32, drb[1][3] = 0
      ...
      [   16.741896] EDAC DEBUG: i3200_probe1: csrow 0, channel 0, size = 1024 Mb
      [   16.741899] EDAC DEBUG: i3200_probe1: csrow 1, channel 0, size = 1024 Mb
      
      and the corresponding sysfs nodes are now properly filled.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      61734e18
    • Mauro Carvalho Chehab's avatar
      i3200_edac: Add more debug to the driver · 5f466cb0
      Mauro Carvalho Chehab authored
      Currently, it is not possible to know, when debug is enabled,
      if the driver is using 2 DIMMS per channel mode or not. It is
      not possible to know the values of the drbs registers, used
      to identify the memory rank sizes.
      
      Add debug for both, as it helps to track issues on the driver.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      5f466cb0
  16. 10 Feb, 2013 1 commit