Okay so I have a correction to make about the previous message and some progress about the slowdown at the start of E1M2. First, the following claim is incorrect:
This is because, being used to libprof in its original setting (ie. with my custom kernel), I quickly forgot that I only ported the part about counting timer and not the one that determines the clock speed. The profiler assumes Pϕ=29.49 MHz so results are scaled if the frequency is different. Note that values are still proportional to real time, so comparisons between sections of the code and WADs are completely fine.
I realized this after testing under overclock with Ptune3. Ptune3 increases Pϕ so the delays were inflated, and ended up totaling 1700 ms even though the profiling completed in about 1100 ms during the test (18 FPS). Ptune2, in addition to pushing the performance further, does it with a lower Pϕ setting, deflating the numbers. This means I can end up with comically different numbers for the same performance in E1M1 (20 FPS with Ptune3, 21 FPS with Ptune2), which prompted confusion on my end. (So the black-magic-level numbers are not real time, even though I welcome the added boost.)
Now in your original report you alternate between 18 FPS (1111 ms profiling time) and 26 FPS (769 ms profiling time), which represents a 342 ms difference over 40 ticks. I've reproduced your setup by configuring Ptune2 as you indicated, and the Pϕ value in that particular situation is 13.36 MHz. Therefore, we can un-scale your measures to correct the libprof bias:
As you can see, there is really no room for other activity. I don't profile all the AIs and physics so a 10% difference is nothing unexpected. Now the difference between the two GR values is 288 ms which is about the 18 FPS vs 26 FPS difference of 342 ms. The extra time in DA and LL is likely spent during GR so we can't simply count it too, but I think by this point the difference is small enough that all doubt is cleared and we can be sure that the extra loading is the source of the FPS drop.
Now what could be loaded during rendering? Geometry is loaded entirely before the level starts, which leaves only textures (I checked the code to make sure there was nothing else).
The primary way in which textures are loaded during rendering, as far as my understanding goes, is through the R_GetColumn() function which simply returns a column of a texture. This function loads the texture into memory (if not already loaded) with purge level PU_CACHE, meaning that the loaded texture is considered temporary data and Z_Malloc is allowed to reuse the space immediately if it needs it. Therefore, it is possible that through random and/or complex interactions on the heap, textures get evicted to answer other allocation requests and thus need to be reloaded. It would be hard to make statistics about this, though...
Also, because you trusted unaligned lumps during the test, it is clear that these textures being loaded are fragmented, making the bug dependent on the structure of the WAD file in storage memory (thus difficult to reproduce).
I'm tempted to leave the problem as is now, because I'm fairly sure there's no easy way to address it. I'll spend some time looking at split textures instead, if by a stroke of luck and genius they can be implemented this bug would disappear at the same time, likely with performance improvements too. It's kind of a high-reward low-probability-of-success thing, but who knows. ^^
Quote:
In this setting, 40 ticks of profiling will have taken approximately 1.11 second and 0.77 second, respectively. However, the measurements only account for 499 ms (45%) and 319 ms (41%) of that duration. Therefore, the program spent most of its time not being profiled, which suggests that the actual bottleneck is somewhere else. At least, I'm confident that the lump loading (while possibly related) is not in itself the source of such an FPS drop.
This is because, being used to libprof in its original setting (ie. with my custom kernel), I quickly forgot that I only ported the part about counting timer and not the one that determines the clock speed. The profiler assumes Pϕ=29.49 MHz so results are scaled if the frequency is different. Note that values are still proportional to real time, so comparisons between sections of the code and WADs are completely fine.
I realized this after testing under overclock with Ptune3. Ptune3 increases Pϕ so the delays were inflated, and ended up totaling 1700 ms even though the profiling completed in about 1100 ms during the test (18 FPS). Ptune2, in addition to pushing the performance further, does it with a lower Pϕ setting, deflating the numbers. This means I can end up with comically different numbers for the same performance in E1M1 (20 FPS with Ptune3, 21 FPS with Ptune2), which prompted confusion on my end. (So the black-magic-level numbers are not real time, even though I welcome the added boost.)
Now in your original report you alternate between 18 FPS (1111 ms profiling time) and 26 FPS (769 ms profiling time), which represents a 342 ms difference over 40 ticks. I've reproduced your setup by configuring Ptune2 as you indicated, and the Pϕ value in that particular situation is 13.36 MHz. Therefore, we can un-scale your measures to correct the libprof bias:
Quote:
18 FPS
DA:7ms GR:764ms DI:227ms LL:104ms ULL:0ms
Total: 1102 ms (99% of expected total time)
26 FPS
DA:0ms GR:476ms DI:227ms LL:0ms ULL:0ms
Total: 703 ms (91% of expected total time)
DA:7ms GR:764ms DI:227ms LL:104ms ULL:0ms
Total: 1102 ms (99% of expected total time)
26 FPS
DA:0ms GR:476ms DI:227ms LL:0ms ULL:0ms
Total: 703 ms (91% of expected total time)
As you can see, there is really no room for other activity. I don't profile all the AIs and physics so a 10% difference is nothing unexpected. Now the difference between the two GR values is 288 ms which is about the 18 FPS vs 26 FPS difference of 342 ms. The extra time in DA and LL is likely spent during GR so we can't simply count it too, but I think by this point the difference is small enough that all doubt is cleared and we can be sure that the extra loading is the source of the FPS drop.
Now what could be loaded during rendering? Geometry is loaded entirely before the level starts, which leaves only textures (I checked the code to make sure there was nothing else).
The primary way in which textures are loaded during rendering, as far as my understanding goes, is through the R_GetColumn() function which simply returns a column of a texture. This function loads the texture into memory (if not already loaded) with purge level PU_CACHE, meaning that the loaded texture is considered temporary data and Z_Malloc is allowed to reuse the space immediately if it needs it. Therefore, it is possible that through random and/or complex interactions on the heap, textures get evicted to answer other allocation requests and thus need to be reloaded. It would be hard to make statistics about this, though...
Also, because you trusted unaligned lumps during the test, it is clear that these textures being loaded are fragmented, making the bug dependent on the structure of the WAD file in storage memory (thus difficult to reproduce).
I'm tempted to leave the problem as is now, because I'm fairly sure there's no easy way to address it. I'll spend some time looking at split textures instead, if by a stroke of luck and genius they can be implemented this bug would disappear at the same time, likely with performance improvements too. It's kind of a high-reward low-probability-of-success thing, but who knows. ^^