Same Ctranslate model, same inputs but different outputs
Hi, i have converted a fine-tuned opennmt model to ctranslate. The problem is I am getting different translation on same input on different machines.
sample input sentence: Sportsman Jhonathan Florez jumpe from helicopter above Bogota, the capital of Colombia, on Thursday.
tokenized input in machine 1 is:
[['\u2581Sports', ‘man’, '\u2581J', ‘hon’, ‘athan', '\u2581Fl', ‘orez', '\u2581jum', ‘pe’, '\u2581from', '\u2581helicopter', '\u2581above', '\u2581Bog', ‘ota’, ',', '\u2581the', '\u2S81capital', '\u2581of', '\u2581Colombia', ',', '\u2581on', '\u2581Thursday', '.']]
The output is :
['\u2581Le', '\u2581sportif', '\u2581J', 'hon', 'athan', '\u2581Fl', 'orez', '\u2581s', 'aut\u00e9', '\u2581d', ' 'un', '\u2581h', '\u00e9licopt\u00e8re', '\u2581au', '-', 'dessus', '\u2581de', '\u2581Bog', 'ota', ',', '\u2581la', '\u2581capitale', '\u2581de', '\u2581la', '\u2581Colombie', ',', '\u2581jeudi.,]
The Tokenized input in machine 2:
[['\u2581Sports', ‘man’, '\u2581J', ‘hon’, ‘athan', '\u2581Fl', ‘orez', '\u2581jum', ‘pe’, '\u2581from', '\u2581helicopter', '\u2581above', '\u2581Bog', ‘ota’, ',', '\u2581the', '\u2S81capital', '\u2581of', '\u2581Colombia', ',', '\u2581on', '\u2581Thursday', '.']]
The output is:
['\u2581Jeudi', '\u2581dernier', '.' ]
Machine 1 specs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping: 4
CPU MHz: 1221.118
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d
Machine 2 specs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 2294.608
BogoMIPS: 4589.21
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 pku ospke md_clear spec_ctrl intel_stibp
Can i know why there is a difference in the result being returned by the model?
Can you first check that the models are effectively the same? For example you can compare the MD5 of each file:
md5sum /path/to/model/*
If the model is the same, can you try translating on machine 2 with the following environment variables and see if it makes a difference:
CT2_FORCE_CPU_ISA=AVX MKL_CBWR=AVX python3 ...
Hi @guillaumekln the suggested solution has fixed the issue. But can i know what was the root cause of the issue and how the suggested solution fixes it.
It seems the model is producing wrong results with AVX512 instructions which are used by machine 2. The proposed environment variables force the execution to use AVX as used by machine 1. This is not really a solution since AVX is slower than AVX512.
Can you check each flag separately and report which one is fixing the output:
MKL_CBWR=AVX python3 ...
CT2_FORCE_CPU_ISA=AVX python3 ...
Additional questions:
- What CTranslate2 version are you using?
- Did you enable a quantization mode? (int8, int16?)
Hi @guillaumekln , only after setting both the flags I am getting proper output.
What CTranslate2 version are you using?
The ctranslate version i am using is 2.12.0
Did you enable a quantization mode? (int8, int16?)
The model is int8 quantized model.
Can i know whether it is necessary to set both the flags , otherwise all the ctranslate models irrespective of quantized or not will produce errored output(Is it the default behaviour of ctranslate2 models).
The ctranslate version i am using is 2.12.0
Please also check with the latest version. It should be the first thing to try before opening an issue.
only after setting both the flags I am getting proper output.
This seems unexpected. It means 2 different libraries (CTranslate2 and Intel MKL) are affected by the issue.
There may be a more general issue on your system. Maybe the KVM virtualization is causing issues with AVX2+? You should try other libraries such as PyTorch and check if there are similar incorrect results on machine 2.
Based on the provided information I don't think the issue is coming from CTranslate2. Feel free to reopen the issue if you can provide more information.