I'm back again for another ride, hopefully I can finish this (offering decent support for all the WADs) quickly!
This post and update results from a lot of performance investigation, especially between WADs. There are news, and they're good. This time, I've decided to leave more details, for your enjoyment and mostly in case someone looks at this code after me. So it's a pretty long post!
Changes in this update
Let's start with this. First, I've fixed odd structure issues with the repository and build system, and changed the final file name to CGDoom.g3a. Please make sure to delete the old CG_Doom.g3a file before using the new one to avoid any confusion.
MPoupe's version of CGDoom had an "emulator" which is a Windows target for the game, which I believe he used to debug the file mapping mechanism. I could never build it so it deteriorated with my changes, and it would be unusable now. I've started to rewrite it using the SDL2 API, which would help running CGDoom back on PC (and mine in particular x3). There is no longer a need to debug the file mapping mechanism, but it could be useful for statistics on the WAD since DOOM itself can act as a programmable WAD inspector. It doesn't work yet, but a large chunk of the port is done.
As a side note, I've tried using SLADE, but it keeps crashing for some really inconsequential reasons like not finding files about the UI layout. I've tried to troubleshoot it but I keep getting fresh errors, so I guess I'll leave it aside for now.
I've also added more developer stuff:
• A "Trust unaligned lumps" option, the purpose of which is explained below.
• Another developer screen when leaving, which reports unaligned lumps.
• The free memory key now reports memory by region and indicates total size.
• A profiler key on [)] to report detailed performance measurements (see below).
The full story about unaligned lumps
This post mentions unaligned lumps quite frequently since these were one of my specific targets. I'm not sure how familiar you are with these low-level concepts so I'll explain what these are in detail, maybe I can clear up some uncertainty in the process.
When accessing memory, the CPU can request either 1, 2, or 4 consecutive bytes. This is commonly used to load variables that occupy 1 byte (chars and bytes), 2 bytes (shorts) or 4 bytes (ints, fixed, pointers). However, a 2-byte access at an address can only be performed if the address is a multiple or 2, and a 4-byte access can only be performed if it's a multiple of 4. For instance, at 0x8c000002 you can access 1 byte or 2 bytes but not 4 bytes.
This is a called an alignment requirement. We say that 0x8c000002 is 2-aligned but not 4-aligned, which is often referred to as 2-aligned for short (implying this is the largest alignment for that address).
CGDoom, like any program, uses different access sizes to load variables from lumps. In some cases, like textures, CGDoom either only uses 1-byte accesses or it uses functions that don't care about alignment (like memcpy), so the lump can be at any address, even unaligned.
However, some lumps like line definitions or nodes contain shorts or ints and CGDoom uses 2-byte or 4-byte accesses when using them. This requires that every single short or int in the lump is properly aligned in memory or a System ERROR occurs. Fortunately, the C language is well-designed, and this requirement will automatically be met as long as the lump itself starts on a suitably aligned address (which means 4-aligned if any 4-byte accesses are used in the lump, or 2-aligned otherwise).
The original version of DOOM loads lumps to RAM with Z_Malloc. malloc-type functions are also well-designed and therefore always allocate at addresses with maximum alignment (4-alignment in our case), so loading any lump with this method guarantees that accesses will succeed.
However, CGDoom contains an optimization to
not load lumps to RAM when they are stored in the filesystem in a single fragment of the WAD file. This is because on the fx-CG, the filesystem can be accessed with a pointer like any other memory, so unless the lump is fragmented by the filesystem it will use less RAM to address it directly in ROM. In this situation, the alignment of the lump in memory is determined by its position within the file. For the lump to be 2-aligned, its position within the file must be a multiple of 2. And for it to be 4-aligned, its position in the file has to be a multiple of 4. Otherwise, you get a System ERROR when performing the access.
You might remember that at some point E1M4 would crash upon loading; this was why. Currently I load unaligned lumps to RAM with Z_Malloc, which takes both a bit of memory and a bit of time, but avoids this issue.
Porting libprof to CGDoom
When optimizing my programs for performance I usually use a small library of mine called
libprof, which is incredibly handy. It measures execution time with precision below 1 µs and can also be used as profiler. So far I'd used the RTC to measure file mapping time (which is in the seconds) and FPS, but that wouldn't cut it for true performance analysis.
So I added my libprof code to the repository. With libprof I can determine how much time is spent allocating memory, loading lumps, rendering frames, sending frames to the display... I added counters for these very things. Pressing the [)] key runs the profiler for 40 frames (frameskip included) then shows the results as a player message like this:
DA:53ms GR:1537ms DI:459ms LL:111ms ULL:0ms
DA is Dynamic Allocation (time spent in Z_Malloc and Z_Free)
GR is Graphics Rendering (rendering the 3D view)
DI is Display Interface (sending the rendered frame to the display)
LL is Lump Loading (copies from ROM, non-copied lumps take virtually no time)
ULL is Unaligned Lump Loading (subset of LL for unaligned lumps)
Barring other yet-unknown sources of performance drops that I can add later, now if we observe suspicious performance you wen can run the profiler and see if there is a culprit. As you will see in a moment, this has already helped me find discrepancies in Hangar.
Consistent differences in heap consumption
Quote:
I have 314 fragments when starting the WAD. In E2M7 I have 145 KB free, in E4M9 133 KB (for some reason always ~44KB less than you :/). I can play E2M7 but I still get a Z_Malloc failure and a Z_ChangeTag error. E4M9 gave me shortly a Z_Malloc failure but I could continue playing after that.
Hmm, this is strange. One would think you have unlucky fragmentation cutting into large lumps and forcing them to be loaded to RAM. But sometimes I have more than 1000 fragments (!!), and I still have exactly 44 more kB of free RAM than you (give or take 1 kB). There must be something subtler. I've changed the "Free" message to show consumption per heap region, just in case the heap happens to be smaller on your side. In Ultimate Doom, pausing straight after loading into E2M7 gives me the following (with developer information enabled):
Quote:
Fragments: 1143
Free: 1/422 kB, 20/249 kB, 161/162 kB
Unaligned lumps: 0 (0 B)
In Ultimate Doom E4M9, with the same setup:
Quote:
Fragments: 1143
Free: 2/422 kB, 7/249 kB, 161/162 kB
Unaligned lumps: 2 (12708 B)
Could you please try this setup to compare?
Performance bottlenecks: saturated regions destroying the heap
Quote:
This is mostly true but sometimes the FPS drops into the single digits like when I open the first door in Hangar the FPS goes down to 5 but goes up again after a short while. This is what I meant with the stutters. Always here and then FPS are dropping into the single digits. Also the Ultimate Doom WAD always seems to be slower. In E1M3 I get 8 FPS with no enemies on screen and in the shareware WAD 15 FPS even though it doesn't need to render more. Something really seems to affect the FPS in the Ultimate Doom WAD.
Ok so the matter here is pretty complicated. First, Hangar. There is a lot of geometry lying after that door, that's why it drops. It does not "go up again after a short while", but stays consistent in the 4-6 FPS realm as long as I look at that geometry. There was a difference between the two versions at hand though, shareware had 6 FPS while Ultimate Doom dropped to 4 FPS.
E1M2 was a even clearer: I had 9 FPS in the shareware, while only 4 in Ultimate Doom. I noticed that Ultimate Doom had exhausted the first heap region. This prompted suspicion because the design of the allocator (called "next fit") meant that the whole first region had to be traversed and fail for every single allocation, which is a huge cost. (This is what prompted me to port libprof.)
You can see for yourself the difference in profiling, see in particular DA.
Shareware Hangar: 6 FPS
DA:27ms GR:2424ms DI:459ms LL:280ms ULL:0ms
Shareware Nuclear Plant: 8 FPS
DA:53ms GR:1537ms DI:459ms LL:111ms ULL:0ms
Ultimate Doom Hangar: 4 FPS
DA:1970ms GR:3917ms DI:459ms LL:2276ms ULL:2276ms
Ultimate Doom Nuclear Plant: 4 FPS
DA:1605ms GR:3536ms DI:459ms LL:1975ms ULL:1975ms
That, and as you can see unaligned lumps being loaded left and right resulted in a lot of overhead (they probably caused most of the DA calls). Note that lumps are loaded during rendering so it's normal that GR went up that much. The developer statistics when closing the game show that anywhere from 20 MB to 200 MB of unaligned lumps are loaded dependending on how long you play, which we know is not needed in Hangar and Nuclear Plant because they didn't crash before I fixed E1M4.
So I started by improving the dynamic allocator to avoid spending so long on finding free blocks (basically extending the next fit paradigm over zones). The new results were as follow:
Ultimate Doom Hangar: 7 FPS
DA:11ms GR:1865ms DI:459ms LL:486ms ULL:485ms
Ultimate Doom Nuclear Plant: 7 FPS
DA:10ms GR:1954ms DI:459ms LL:526ms ULL:526ms
So I think that solves a nice bit of the discrepancy while noticeably improving performance in levels with lots of data loaded.
Note that if I "trust unaligned lumps" (detailed below) I get up to 10 FPS due to the reduced cost on ULL. This will be an objective.
Performance bottlenecks: uselessly loading unaligned lumps
Quote:
Seems to be that it's affecting the position of other lumps. In the shareware WAD audio is from index 109 to 231. If these are getting removed every lump that came after the audio before has a new index and therefore a new position. Not sure how this affects the performance but it's around 5 FPS slower.
Thank you. I meant the exact position in bytes, specifically its alignment. I hope the explanation at the top of this post clears that up. The concern is about moving lumps by 1 to 3 bytes within the file.
As you've seen before, WADs don't require lumps to be aligned, since the game mostly works even with unaligned lumps. I suspect the WAD editing software of mistakenly breaking alignment as a side effect of removing audio files or splitting episodes, but in any case for complete compatibility it's best if WADs with unaligned lumps are fully supported.
However, as you've seen, unaligned lumps have a certain cost that is hard to deal with because it's difficult to know when unaligned lumps will work in the program. In Hangar, they're textures, so we don't care that they're unaligned. But in Command Control some are part of the level definition and need to be aligned. We can't detect that, so we're doomed to either load all of them or fail on some of them.
There is one alternative though. MPoupe had taken steps to break down multi-byte accesses into single-byte accesses to lift the alignment requirement. If this approach can be completed, it would give maximum compatibility without needing to delve deep into the WAD format. However, it means that we need "complete coverage" of the code in order to ensure that all multi-byte accesses into lumps are broken down. I have a few tricks to help us get there with less work (unaligned C structures), we'll see if it's enough.
I'd like to work towards that goal, so I've added an option called "Trust unaligned lumps" in the main menu which will skip the loading of non-fragmented unaligned lumps, so that we can play around and look for System ERRORs. If you have opportunities to keep testing, I'd like you to use this option and report the System ERRORs so that I can attempt to fix them one at a time.
I have fixed the bug that prevented E1M4 from loading its line definitions by splitting accesses as mentioned previously (which is much less costly than loading the lump). I also fixed one that caused issues in Ultimate Doom E1M9 and was related to nodes. As of now I've checked that every level of both shareware and Ultimate Doom could be entered (barring the chance that some unaligned lump that would cause problems was actually loaded to RAM while testing because it was fragmented).
Incorrect skull keys
Quote:
Also something still seems to be wrong with the keys. The blue skull key doesn't show up after picking it up.
I didn't know about these keys, thanks. I tracked it down to some commented-out code that was probably left here before me. I've fixed it; note that since the cheat key gives you all the keys you now get all three skulls keys when you cheat even if the level doesn't require them.