CoreFreq icon indicating copy to clipboard operation
CoreFreq copied to clipboard

[Solved] System Hard Locks Inserting corefreqk.ko (Intel Atom 330)

Open svmlegacy opened this issue 4 years ago • 39 comments

Clean make of main branch. Inserting corefreqk.ko module results in hard lock of this system, even num lock frozen. Have also seen this issue on select Intel ES processors, on unreleased steppings.

svmlegacy avatar Dec 12 '21 02:12 svmlegacy

Clean make of main branch. Inserting corefreqk.ko module results in hard lock of this system, even num lock frozen.

Atom 330 of Diamondville has a CPUID of 06_1C

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L1304

Was it running with older versions of CoreFreq ?

If not, comment out or remove those lines:

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2316

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L7487

... then rebuild and try.

Have also seen this issue on select Intel ES processors, on unreleased steppings.

ES, which CPUID and Brand strings are they ?

cyring avatar Dec 12 '21 07:12 cyring

Intel Atom 330, CPUID 106C2h (06_1C stepping 2) is correct.

Was it running with older versions of CoreFreq ? If not, comment out or remove those lines: ... then rebuild and try.

Unfortunately still hard-locking. This is the first chance I've had to run this system. Do you have a suggested older version to try?

ES, which CPUID and Brand strings are they ?

The two that I've tried are as follows:

  • Intel Nehalem-EP-B0: 106A2h (06_1A stepping 2), "Genuine Intel(R) CPU" Example CPU decode
  • Intel Auburndale-B0: 106F1h (06_1F stepping 1), "Genuine Intel(R) CPU" Example CPU decode

Unsure if it's related, always chocked it up to them being early ES's. They hardlock in the exact same manner, so added it as a piece of info.

svmlegacy avatar Dec 12 '21 16:12 svmlegacy

Unfortunately still hard-locking. This is the first chance I've had to run this system. Do you have a suggested older version to try?

Do you have any kernel log or screenshot of the backtracked functions and registers dump ?

ES, which CPUID and Brand strings are they ?

The two that I've tried are as follows:

CPUID signature 06_1A and 06_1F are both implemented into CoreFreq , respectively _Nehalem_Bloomfield and _Nehalem_MB

Probably those zeros in the brand string Genuine Intel(R) CPU @ 0000 @ 1.87GHz lead the driver to a division error.

For testings, the line bellow can be commented and replaced with a static value: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L1017

/*
	iArg->Features->Factory.Freq = Intel_Brand( iArg->Features->Info.Brand,
							iArg->Brand );
*/
	iArg->Features->Factory.Freq = 1870;

cyring avatar Dec 12 '21 17:12 cyring

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

cyring avatar Dec 14 '21 23:12 cyring

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

Still trying to get any kind of debugging info out. Hard lock occurs before any outputs. Trying to get debugging out to a secondary PC via the COM port, but so far only getting a garbled mess. Will let you know when I have something useful.

svmlegacy avatar Dec 14 '21 23:12 svmlegacy

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

Still trying to get any kind of debugging info out. Hard lock occurs before any outputs. Trying to get debugging out to a secondary PC via the COM port, but so far only getting a garbled mess. Will let you know when I have something useful.

About the Atom 330, I would suggest to read the MSR registers happening on the call flow.

Architecture entries are in these lines: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L6507

  • Load the Kernel msr driver and read registers using its CLI Any access violation should be trapped by kernel to prevent a crash.
modprobe msr
rdmsr -ax <reg_no>
  • First entry starts in Query_Core2() which leads to Intel_Core_Platform_Info() where are read :
  1. MSR_PLATFORM_INFO
  2. MSR_IA32_PERF_STATUS
  3. MSR_IA32_PLATFORM_ID
  • Thus do:
rdmsr -ax 0x000000ce
rdmsr -ax 0x00000198
rdmsr -ax 0x00000017
  • Next Query_Core2() goes into HyperThreading_Technology() for Topology where is read MSR_IA32_APICBASE

  • Do:

rdmsr -ax 0x0000001b
  • At this point we're done with Query_Core2() . Let me know if registers can be safely read on your processor.

cyring avatar Dec 15 '21 00:12 cyring

Somewhat interesting update:

After a full, clean reinstall of Fedora 35 due other unrelated troubles (nvidia 340 drivers breaking the system), When I made corefreq for the shipped kernel 5.14, I got a segmentation fault on inserting corefreqk.ko. Rebooting the system without updating any packages resulted in the hardlock on loading again.

All suggested registers outputted hex code without issue, matching on all cores. I'll submit the actual results of this tommorow.

At this point, I'm debating on switching to another distro, even if Fedora 35 works on other platforms.

svmlegacy avatar Dec 17 '21 03:12 svmlegacy

At this point, I'm debating on switching to another distro, even if Fedora 35 works on other platforms.

My favorite being ArchLinux, in my Wiki I'm providing CoreFreq live image based on Arch.

New Bottom of the page you'll also find the nightly build with CoreFreq development branch embedded.

Those images also contain the full Arch installation scripts, including Network Manager and its nmtui for easy Network devices setup.

cyring avatar Dec 17 '21 05:12 cyring

Just to be sure about Nehalem: here is the latest development using the bootable CoreFreq ISO CoreFreq_i7_920_20211219

cyring avatar Dec 19 '21 20:12 cyring

Update on the Atom 330: Corefreq Arch Linux build also has a kernel panic when loading the module.

Does this build push any information to ttyS0 by default? Still haven't gotten any meaningful information there from the machine at all, but curious if it's worth a try. Kernel panic didn't seem to have much valuable information, but I'll try to get a picture of it in the faulted state.

Will be trying the Nehalem chips after the Atom is sorted... They take up the same workbench :)

svmlegacy avatar Dec 19 '21 21:12 svmlegacy

Update on the Atom 330: Corefreq Arch Linux build also has a kernel panic when loading the module.

Does this build push any information to ttyS0 by default? Still haven't gotten any meaningful information there from the machine at all, but curious if it's worth a try. Kernel panic didn't seem to have much valuable information, but I'll try to get a picture of it in the faulted state.

Will be trying the Nehalem chips after the Atom is sorted... They take up the same workbench :)

Can you post here the output of command lspci -nn of your Atom 330 and the ES processors ?

Because I would like to check their device DID and the driver callflow consequently. Perhaps some DID are present but the Base Address and CSR registers are not. For exemple, Atom 330 has not VT-d support.

cyring avatar Dec 20 '21 06:12 cyring

Atom 330 lspci: here

svmlegacy avatar Dec 21 '21 03:12 svmlegacy

Atom 330 lspci: here

OH! NVidia MCP79 is not implemented yet.

Manufacturer DID 10de is not part of driver yet . It may start with argument:

insmod corefreqk.ko ArchID=<N>

where <N> taken from the generic architectures 0 or 11

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L6096

We will have to program a new loop from scratch. This time, I'll recommend to use the most transparent VM to test and enhance CoreFreq until we feel confident to run bare-metal.

As usual, the key for a good implementation is the NVidia MCP79 datasheet and its registers specification. Googling is showing some documents; kernel source code for that chip is to dig also.

cyring avatar Dec 21 '21 07:12 cyring

2021-12-21-101849_766x674_scrot

Apparently MSR_PLATFORM_ID is available. First change is to add _Atom_Bonnell in the Intel_MaxBusRatio() function: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2311

int Intel_MaxBusRatio(PLATFORM_ID *PfID)
{
	struct SIGNATURE whiteList[] = {
		_Core_Conroe,		/* 06_0F */
		_Core_Penryn,		/* 06_17 */
		_Atom_Bonnell,		/* 06_1C */
		_Atom_Silvermont,	/* 06_26 */
		_Atom_Lincroft, 	/* 06_27 */
		_Atom_Clover_Trail,	/* 06_35 */
		_Atom_Saltwell, 	/* 06_36 */
		_Silvermont_Bay_Trail,	/* 06_37 */
		_Atom_Bonnell,		/* 06_1C */
	};
	int id, ids = sizeof(whiteList) / sizeof(whiteList[0]);
	for (id = 0; id < ids; id++) {
		if ((whiteList[id].ExtFamily \
			== PUBLIC(RO(Proc))->Features.Std.EAX.ExtFamily)
		 && (whiteList[id].Family \
			== PUBLIC(RO(Proc))->Features.Std.EAX.Family)
		 && (whiteList[id].ExtModel \
			== PUBLIC(RO(Proc))->Features.Std.EAX.ExtModel)
		 && (whiteList[id].Model \
			== PUBLIC(RO(Proc))->Features.Std.EAX.Model))
		{
			RDMSR((*PfID), MSR_IA32_PLATFORM_ID);
			return 0;
		}
	}
	return -1;
}

Then rebuild, unload, restart all (bare-metal test)

cyring avatar Dec 21 '21 09:12 cyring

Another request is to check if MSR_PLATFORM_INFO is effectively not supported by Bonnel because it is not listed among architectural list: 2021-12-21-114312_765x417_scrot

whereas we have a go for MSR_IA32_PERF_STATUS 2021-12-21-115213_773x170_scrot

  • When executing ...
rdmsr -ax 0x000000ce

... check the kernel log for a trapped execution ? A zero returned value is also a sign of unsupported register.

If unsupported please comment out its usage in function Intel_Core_Platform_Info(): https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2341

change function like bellow:

void Intel_Core_Platform_Info(unsigned int cpu)
{
	PLATFORM_ID PfID = {.value = 0};
	PLATFORM_INFO PfInfo = {.value = 0};
	PERF_STATUS PerfStatus = {.value = 0};
	unsigned int ratio0 = 10, ratio1 = 10; /*Arbitrary values*/
/*
	RDMSR(PfInfo, MSR_PLATFORM_INFO);
	if (PfInfo.value != 0) {
		ratio0 = PfInfo.MaxNonTurboRatio;
	}
*/
	RDMSR(PerfStatus, MSR_IA32_PERF_STATUS);
	if (PerfStatus.value != 0) {				/* §18.18.3.4 */
		if (PerfStatus.CORE.XE_Enable) {
			ratio1 = PerfStatus.CORE.MaxBusRatio;
		} else {
			if (Intel_MaxBusRatio(&PfID) == 0) {
				if (PfID.value != 0)
				{
					ratio1 = PfID.MaxBusRatio;
				}
			}
		}
	} else {
			if (Intel_MaxBusRatio(&PfID) == 0) {
				if (PfID.value != 0)
				{
					ratio1 = PfID.MaxBusRatio;
				}
			}
	}

	PUBLIC(RO(Core, AT(cpu)))->Boost[BOOST(MIN)] =	KMIN(ratio0, ratio1);
	PUBLIC(RO(Core, AT(cpu)))->Boost[BOOST(MAX)] =	KMAX(ratio0, ratio1);
}

cyring avatar Dec 21 '21 10:12 cyring

@svmlegacy Hey! any progress with the debugging code requests above ?

cyring avatar Jan 09 '22 06:01 cyring

@svmlegacy : please let me know when you can contribute on issue.

cyring avatar Jan 22 '22 14:01 cyring

@svmlegacy Since commit b2f75c89332a1e0ffa517c22895c57c1b91ac812 what about Atom 330 ?

cyring avatar Apr 06 '22 01:04 cyring

Sorry about the inactivity lately, I'll give it a shot tommorow and see what happens! Thanks for the poke.

svmlegacy avatar Apr 06 '22 01:04 svmlegacy

All my previous attempts were fruitless, just tried again with the dev version of the archlinux ISO and the current master branch. No luck. Haven't been able to get a serial connection outbound either. Screenshot from 2022-04-06 18-20-26

svmlegacy avatar Apr 06 '22 22:04 svmlegacy

All my previous attempts were fruitless, just tried again with the dev version of the archlinux ISO and the current master branch. No luck. Haven't been able to get a serial connection outbound either. Screenshot from 2022-04-06 18-20-26

Thanks for trying the develop branch. Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

cyring avatar Apr 07 '22 04:04 cyring

Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

Great point! There is something that changed since last time I was working with this. Before, the system would hard lock, meaning I couldn't pull from dmesg. Now, it seems like it's not causing the system to lock (but still isn't working quite right.)

Here's the dmesg pulled from the system, the the attempted module insertion as the last entries: dmesg.txt .

svmlegacy avatar Apr 07 '22 20:04 svmlegacy

Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

Great point! There is something that changed since last time I was working with this. Before, the system would hard lock, meaning I couldn't pull from dmesg. Now, it seems like it's not causing the system to lock (but still isn't working quite right.)

Here's the dmesg pulled from the system, the the attempted module insertion as the last entries: dmesg.txt .

Yes, it started at:

CoreFreq(0:2:-1): Processor [ 06_1C] Architecture [Atom/Bonnell] SMT [4/4]

Can you read this register ?

## MSR_TEMPERATURE_TARGET
rdmsr -ax 0x1A2

if not, please comment that line in the driver code, next rebuild/reload all for testing https://github.com/cyring/CoreFreq/blob/a1540153123db1b2614dcc2d8cddede1be3a42cb/corefreqk.c#L7737

cyring avatar Apr 07 '22 23:04 cyring

Screenshot from 2022-04-07 20-25-47

Can you read this register ?

## MSR_TEMPERATURE_TARGET
rdmsr -ax 0x1A2

Nope. Could not read that MSR.

Commenting out this line enables the system to insert the mod with no issues. https://github.com/cyring/CoreFreq/blob/a1540153123db1b2614dcc2d8cddede1be3a42cb/corefreqk.c#L7737

Dumped a bunch of info here: https://gist.github.com/svmlegacy/9bd33c5b273e4310f20a3c6c2b288bfe

Wonderful to see progress!

svmlegacy avatar Apr 08 '22 00:04 svmlegacy

Great to see that screenshot of Bonnell

The last register MSR_TEMPERATURE_TARGET really hurts processor. And we are left without a TjMax which is hard-coded to 100°C We can fine tune TjMax and also the Temperature formula, if you aware of better values for your Processor ?

I'm wrapping up all the code change: other Atom architectures are also impacted by same issue.

cyring avatar Apr 08 '22 04:04 cyring

@svmlegacy Code changes made so far are available in commit 0794238d5e9bdeae6252dff46f8dd001f5c12294

The monitoring loop for Bonnell is very basic and now need to be affine with architectural MSR registers listed in the SDM specifications at chapter 2.3

2022-04-08-084946_811x144_scrot

cyring avatar Apr 08 '22 06:04 cyring

And this datasheet also -;)

  • For information, Low Power Features P_LVLx I/O So far we don't have the I/O Base Address register:
|- Core C-States                                                                
   |- C-States Base Address                                      BAR   [ 0x0   ]
  • For information, a TjMax of 85.2°C according to the bellow table.

EDIT: If temperature is not accurate, you can try the integer value of 85 at this code line:

https://github.com/cyring/CoreFreq/blob/0794238d5e9bdeae6252dff46f8dd001f5c12294/corefreqk.c#L8235

2022-04-08-100932_675x163_scrot

  • For your testings, Core voltage VCC :
  1. Replace Formula with one of these two statements (hopping one is compatible)

https://github.com/cyring/CoreFreq/blob/0794238d5e9bdeae6252dff46f8dd001f5c12294/corefreqk.h#L7115

with:

	.voltageFormula = VOLTAGE_FORMULA_INTEL_SOC,

or:

	.voltageFormula = VOLTAGE_FORMULA_INTEL_SNB,
  1. Rebuild and Run
  2. Set Voltage scope to < SMT> in Settings menu
  3. Change to the view Voltage

cyring avatar Apr 08 '22 08:04 cyring

Good News! The develop branch now works as-is for the Atom 330.

Reported temperature looks good. Offsetting by another 15°C would put it sub-ambient. Tjmax of 85°C matches what is reported by other utilities.

I tried changing the .voltageFormula with the suggested statements:

https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/corefreqk.h#L7172

Neither produced a good result in the SMT scope. _SOC was locked at 0.38V, and _SNB was at 0.0033 V. Expected VID range per the datasheet is 0.7 - 1.2 V.

FYI I have a couple other Bonnell chips that we can use for testing. Intel Atom N270 (32-bit only, Diamondville) Intel Atom N450 (64-bit capable, Pineview)

svmlegacy avatar Apr 18 '22 00:04 svmlegacy

Good News! The develop branch now works as-is for the Atom 330.

Reported temperature looks good. Offsetting by another 15°C would put it sub-ambient. Tjmax of 85°C matches what is reported by other utilities.

I tried changing the .voltageFormula with the suggested statements:

https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/corefreqk.h#L7172

Neither produced a good result in the SMT scope. _SOC was locked at 0.38V, and _SNB was at 0.0033 V. Expected VID range per the datasheet is 0.7 - 1.2 V.

Let's keep this voltage algorithm VOLTAGE_FORMULA_INTEL_SOC but we will adjust the formula here: https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/coretypes.h#L614

What we are interested in is this equation: https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/coretypes.h#L629 which receives a voltage VID as an input, and outputs the Vcore

In datasheets, most of the time volume 1, we should find the associations table between both. But also some steps and other offsets to apply to the Vcore formula.

Tbc.

FYI I have a couple other Bonnell chips that we can use for testing. Intel Atom N270 (32-bit only, Diamondville) Intel Atom N450 (64-bit capable, Pineview)

32-bits is not supported but I will enjoy the N450.

cyring avatar Apr 18 '22 02:04 cyring

In datasheet, table 3-2

2022-04-18-044350_632x473_scrot

  • a few samples, from the bottom table to the top:
  • VID converted in Decimal
  • can you replace code with this formula:
 Vcore = 0.7 + ( 73.0 - (double) (VID) ) * 0.0125;
VID Formula Vcore
1 0 0 1 0 0 1(73) 0.7 + (73.0 - 73.0) * 0.0125 0.7000
1 0 0 1 0 0 0(72) 0.7 + (73.0 - 72.0) * 0.0125 0.7125
0 1 1 0 1 1 0(54) 0.7 + (73.0 - 54.0) * 0.0125 0.9375
0 1 0 0 0 0 1(33) 0.7 + (73.0 - 33.0) * 0.0125 1.2000

cyring avatar Apr 18 '22 03:04 cyring