Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rqd] Fix non ASCII chars #1335

Conversation

ramonfigueiredo
Copy link
Collaborator

Convert to ASCII while discarding characters that can not be encoded

Convert to ASCII while discarding characters that can not be encoded
@ramonfigueiredo
Copy link
Collaborator Author

ramonfigueiredo commented Jan 17, 2024

Problem

Before, if the RQD logs contained non-ASCII text, the RQD would crash, causing CueGUI to show the process as running indefinitely.

Error

UnicodeEncodeError: 'ascii' codec can't encode character u'\u221e' in position 64: ordinal not in range(128)

The problem occurs in the pipe_to_file function when trying to output non-ASCII characters to the file.

Error happens in the code below:

def pipe_to_file(stdout, stderr, outfile):
    ...

    def print_and_flush_ln(fd, last_timestamp):
    ...
        for line in lines[0:-1]:
            # Convert to ASCII while discarding characters that can not be encoded
            line = line.encode('ascii', 'ignore')
            print("[%s] %s" % (curr_line_timestamp, line), file=outfile)

Solution

BEFORE

for line in lines[0:-1]:
    print("[%s] %s" % (curr_line_timestamp, line), file=outfile)

NEW SOLUTION

for line in lines[0:-1]:
    # Convert to ASCII while discarding characters that can not be encoded
    line = line.encode('ascii', 'ignore')
    print("[%s] %s" % (curr_line_timestamp, line), file=outfile)

UPDATED LOGIC

The updated logic below ensure consistent handling of non-ASCII characters in logs with RQD_PREPEND_TIMESTAMP = True or RQD_PREPEND_TIMESTAMP = False.

The change always encode lines to ASCII with 'ignore' option to discard non-ASCII characters. I also removed unused file_descriptor block and ensured consistent encoding logic.

This change prevents UnicodeEncodeError and ensures consistent log outputs.

if rqd.rqconstants.RQD_PREPEND_TIMESTAMP:
    pipe_to_file(frameInfo.forkedCommand.stdout, frameInfo.forkedCommand.stderr, self.rqlog)
else:
    with open(self.rqlog, 'a') as f:
        # Convert to ASCII while discarding characters that can not be encoded
        for line in frameInfo.forkedCommand.stdout:
            line = line.encode('ascii', 'ignore')
            f.write(line.decode('ascii') + '\n')
        for line in frameInfo.forkedCommand.stderr:
            line = line.encode('ascii', 'ignore')
            f.write(line.decode('ascii') + '\n')

About the bug fix

Now, the RQD will ignore characters non-ascii in the logs and the RQD will work correctly.

For example:

If the log is:
'text here 영화, café'

It will be:
'text here , caf'

It will ignore the Korean character '영화' and the 'é'.

Tests

Manual Test

python

>>> text = 'text here 영화, café'

>>> text.encode('ascii')
Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)

>>> text.encode('ascii', 'ignore')
b'text here , caf'

@ramonfigueiredo
Copy link
Collaborator Author

Other solutions

Replace non-ASCII character with '?'

python

>>> text = 'text here 영화, café'

Current solution

>>> text.encode('ascii', 'ignore')
b'text here , caf'

Solution using .encode('ascii', 'replace')

>>> text.encode('ascii', 'replace')
b'text here ??, caf?'

Encode with UTF-8, instead of ASCII

It is also possible to keep UTF-8, but use encode() and decode():

UTF-8 example

>>> text = 'text here 영화, café'
>>> text_encoded = text.encode('utf-8', 'ignore')
>>> text_encoded.decode()
'text here �\x98\x81�\x99\x94, café'

ASCII example

>>> text = 'text here 영화, café'e')
>>> text_encoded = text.encode('ascii', 'ignore')
>>> text_encoded.decode()
'text here , caf'

- Convert to ASCII while discarding characters that can not be encoded
- Update sphinx version to 5.0.0 on docs/requirements.txt
- Convert to ASCII while discarding characters that can not be encoded
- Update sphinx version to 5.0.0 on docs/requirements.txt
- Change docs/conf.py to use language = 'en'
docs/requirements.txt Outdated Show resolved Hide resolved
@@ -1219,6 +1219,8 @@ def print_and_flush_ln(fd, last_timestamp):

remainder = lines[-1]
for line in lines[0:-1]:
# Convert to ASCII while discarding characters that can not be encoded
line = line.encode('ascii', 'ignore')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramonfigueiredo Thank you for the PR and the detailed write up. The detail makes it nice and easy to review with the proper context.

So, this error occurs when we intercept output in order to prepend a timestamp. What is logged to the file when RQD_PREPEND_TIMESTAMP is False?

My sense is that the output which is logged when RQD_PREPEND_TIMESTAMP is True vs False should be as similar as possible, aside from the timestamp of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @bcipriano ,

Sorry for the delay in answering your question. Thank you for your thorough review and the positive feedback on the PR and write-up.

Regarding your comment, the UnicodeEncodeError occurs when we intercept output to prepend a timestamp. When RQD_PREPEND_TIMESTAMP is False, the output is logged directly without any modification or encoding to ASCII, preserving the original characters, including any non-ASCII ones.

Here is how the output differs based on the value of RQD_PREPEND_TIMESTAMP:

  1. When RQD_PREPEND_TIMESTAMP is True:
  • We intercept the output to prepend a timestamp.
  • Non-ASCII characters are encoded to ASCII with the 'ignore' option, discarding any non-ASCII characters.
  • Example log: "[12:34:56] text here , caf" (non-ASCII characters are ignored).
  1. When RQD_PREPEND_TIMESTAMP is False:
  • The output is logged directly without any modification.
  • Non-ASCII characters are preserved.
  • Example log: "text here 영화, café" (non-ASCII characters are retained).

To ensure the outputs are as similar as possible, aside from the timestamp, I can adjust our approach to handle encoding more gracefully. Instead of discarding non-ASCII characters, I can consider other strategies, such as escaping them or converting them to a specific placeholder.

However, given the current solution, the primary goal was to prevent crashes due to UnicodeEncodeError by ignoring non-ASCII characters when RQD_PREPEND_TIMESTAMP is True (default option). This approach was chosen for simplicity and robustness.

If maintaining non-ASCII characters is critical, I can explore additional strategies to handle encoding more gracefully without discarding valuable information. I am open to suggestions and further discussion on the best approach to balance robustness and information retention.

Thank you again for your review and insights. I look forward to your feedback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bcipriano

FYI ...

The updated logic to ensure consistent handling of non-ASCII characters in logs with RQD_PREPEND_TIMESTAMP = True or RQD_PREPEND_TIMESTAMP = False.

Removing changes to update Sphinx version
@DiegoTavares
Copy link
Collaborator

@ramonfigueiredo is this ready for review again? I see one test failure, but it looks like the logs have expired

@ramonfigueiredo
Copy link
Collaborator Author

@ramonfigueiredo is this ready for review again? I see one test failure, but it looks like the logs have expired

Hi @DiegoTavares ,

Yes, it is ready to review. I am waiting for the feedback from @bcipriano .

I fixed the checks/pipeline, it was missing merging the master into the branch fix_non_ascii_chars.

- Always encode lines to ASCII with 'ignore' option to discard non-ASCII characters.
- Removed unused file_descriptor block and ensured consistent encoding logic.

This change prevents UnicodeEncodeError and ensures consistent log outputs.
@ramonfigueiredo ramonfigueiredo merged commit c56d8cb into AcademySoftwareFoundation:master Jul 10, 2024
12 checks passed
@@ -343,6 +339,16 @@ def runLinux(self):

if rqd.rqconstants.RQD_PREPEND_TIMESTAMP:
pipe_to_file(frameInfo.forkedCommand.stdout, frameInfo.forkedCommand.stderr, self.rqlog)
else:
with open(self.rqlog, 'a') as f:
Copy link
Contributor

@lithorus lithorus Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this broke file logging. self.rqlog is a file object and not a string.
(When not using RQD_PREPEND_TIMESTAMP=True)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look and fix that soon!

Thanks!

Copy link
Contributor

@lithorus lithorus Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note that #1401 changes alot of this logic anyway (if it gets merged)

Copy link
Collaborator Author

@ramonfigueiredo ramonfigueiredo Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lithorus

FYI ...

I changed the code to prevent crashes due to TypeError and fixed the broken file logging. Thanks!

This is the new PR: #1417 . PR was merged into the master!

Copy link
Collaborator Author

@ramonfigueiredo ramonfigueiredo Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lithorus ,

Let me know when your PR #1401 is ready for review. For now, it is on draft and some checks/pipelines are falling.

For sure it is a nice feature. Since Loki offers efficient, scalable log aggregation, cost savings, seamless Prometheus/Grafana integration, and strong community support. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will write some examples and documentation on how it's used after #1416 is merged (some of the loki stuff is not compatible with python 2.x)
Also, seeing the recent sentry support in cuebot, it would be interesting to also add support for that aswell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thanks!

n-jay pushed a commit to n-jay/OpenCue that referenced this pull request Jul 26, 2024
- Ensure consistent handling of non-ASCII characters in logs
- Always encode lines to ASCII with an 'ignore' option to discard non-ASCII characters.
- Removed unused file_descriptor block and ensured consistent encoding logic.

This change prevents UnicodeEncodeError and ensures consistent log outputs.
@ramonfigueiredo ramonfigueiredo deleted the fix_non_ascii_chars branch October 15, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants