Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

24.09 Release - Editor crashed randomly when leaving game mode [incorrect memory management] #18502

Open
lgleim opened this issue Nov 22, 2024 · 2 comments
Assignees
Labels
feature/azcore This item is related to the AZ core engine support libraries. feature/editor This item is related to the editor subsystem. kind/bug-2409 Used for stabilization/24.09 issues needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical Critical priority. Must be actively worked on as someone's top priority right now. sig/core Categorizes an issue or PR as relevant to SIG Core

Comments

@lgleim
Copy link
Contributor

lgleim commented Nov 22, 2024

Bug Description

The editor segfaults randomly when exiting game mode.

Debugging in LLDB traces this issue back to

free_link* next = free->mNext;
via various code paths (i.e. memory allocation for a large variety of components, not only the Lua Memory hook shown in the screenshot below)

Assets-Required

Any o3de level

Steps to Reproduce

Repeatedly enter and leave game-mode from the editor in quick succession

Expected Behavior

No crash

Actual Behavior

The O3DE Editor segfaults, i.e. crashes with signal SIGSEGV: invalid address (fault address: 0x0), randomly when exiting game mode. In our current project the Editor crashes in about 20% of the times exiting game mode, making this a significant hindrance for using O3DE productively.

Screenshots/Videos

Three exemplary code paths
image

image

image

Found In Branch

main / 24.09.1 release

Commit ID From

c602b49

Desktop/Device

Ubuntu 22.04 LTS
64GB RAM, i9, RTX 3080 Mobile

Additional Context

Debugging in LLDB traces this issue back to

free_link* next = free->mNext;
via various code paths.

The page struct p provided as argument to the alloc function

template<bool DebugAllocatorEnable>
void* HphaSchemaBase<DebugAllocatorEnable>::HpAllocator::bucket::alloc(page* p)
{
// get an element from the free list
HPPA_ASSERT(p && p->mFreeList);
p->inc_ref();
free_link* free = p->mFreeList;
free_link* next = free->mNext;
p->mFreeList = next;
if (!next)
{
// if full, auto sort to back
mPageList.erase(*p);
mPageList.push_back(*p);
}
return (void*)free;
}

does not have a valid mFreeList member.

Since the call to alloc stems from

AllocateAddress HphaSchemaBase<DebugAllocatorEnable>::HpAllocator::bucket_alloc_direct(unsigned bi)
and
page* p = mBuckets[bi].get_free_page();
must have returned nullptr (since get_free_page() explicitly checks for p->mFreeList's presence), p was created in
p = bucket_grow(bsize, mBuckets[bi].marker());
via
return new (mem) page((unsigned short)elemSize, m_poolPageSize, marker);

In turn, this means that the page constructor

struct page
: public block_header_proxy /* must be first */
, public AZStd::list_base_hook<page>::node_type
{
page(size_t elemSize, size_t pageSize, size_t marker)
: mBucketIndex((unsigned short)bucket_spacing_function_aligned(elemSize))
, mUseCount(0)
{
mMarker = marker ^ ((size_t)this);
// build the free list inside the new page
// the page info sits at the front of the page
size_t numElements = (pageSize - sizeof(page)) / elemSize;
char* endMem = (char*)this + pageSize;
char* currentMem = endMem - numElements * elemSize;
char* nextMem = currentMem + elemSize;
mFreeList = (free_link*)currentMem;
while (nextMem != endMem)
{
((free_link*)currentMem)->mNext = (free_link*)(nextMem);
currentMem = nextMem;
nextMem += elemSize;
}
((free_link*)currentMem)->mNext = nullptr;
}
returned a page where
mFreeList = (free_link*)currentMem;
and thus
char* currentMem = endMem - numElements * elemSize;
is an invalid memory address.

Ultimately, it appears that there is some logic error or race condition somewhere in the memory allocation procedure.

@lgleim lgleim added feature/azcore This item is related to the AZ core engine support libraries. feature/editor This item is related to the editor subsystem. feature/scripting This item is related to the scripting tools. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/core Categorizes an issue or PR as relevant to SIG Core priority/critical Critical priority. Must be actively worked on as someone's top priority right now. kind/bug-2409 Used for stabilization/24.09 issues labels Nov 22, 2024
@lgleim lgleim removed the feature/scripting This item is related to the scripting tools. label Nov 22, 2024
@lgleim
Copy link
Contributor Author

lgleim commented Nov 22, 2024

Digging deeper into this, I already tried undefining the USE_MUTEX_PER_BUCKET optimization but this rapidly results in Editor deadlocks at runtime and thus does not appear to be an option at this time.

@mbalfour-amzn @lemonade-dm @spham-amzn Any chance to get your expertise on this bug report? If not for investigating the issue, then for any insights on how to debug it further?

@spham-amzn
Copy link
Contributor

Digging deeper into this, I already tried undefining the USE_MUTEX_PER_BUCKET optimization but this rapidly results in Editor deadlocks at runtime and thus does not appear to be an option at this time.

@mbalfour-amzn @lemonade-dm @spham-amzn Any chance to get your expertise on this bug report? If not for investigating the issue, then for any insights on how to debug it further?

Thats some very good debugging. I'll give it a try on my Ubuntu machine. I wonder if we make a platform-specific value for hpAllocatorStructureSize where we bump up the value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/azcore This item is related to the AZ core engine support libraries. feature/editor This item is related to the editor subsystem. kind/bug-2409 Used for stabilization/24.09 issues needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical Critical priority. Must be actively worked on as someone's top priority right now. sig/core Categorizes an issue or PR as relevant to SIG Core
Projects
None yet
Development

No branches or pull requests

3 participants