skip to main content
research-article
Open access

Interval Parsing Grammars for File Format Parsing

Published: 06 June 2023 Publication History

Abstract

File formats specify how data is encoded for persistent storage. They cannot be formalized as context-free grammars since their specifications include context-sensitive patterns such as the random access pattern and the type-length-value pattern. We propose a new grammar mechanism called Interval Parsing Grammars IPGs) for file format specifications. An IPG attaches to every nonterminal/terminal an interval, which specifies the range of input the nonterminal/terminal consumes. By connecting intervals and attributes, the context-sensitive patterns in file formats can be well handled. In this paper, we formalize IPGs' syntax as well as its semantics, and its semantics naturally leads to a parser generator that generates a recursive-descent parser from an IPG. In general, IPGs are declarative, modular, and enable termination checking. We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF; we have also evaluated the performance of the generated parsers.

Supplementary Material

Auxiliary Archive (pldi23main-p263-p-archive.zip)
The appendix.

References

[1]
2008. Document managementPortable Document 493 FormatPart 1: PDF 1.7.
[2]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA.
[3]
Godmar Back. 2002. Datascript – A specification and scripting language for binary data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2487, Springer Verlag, 66–77. isbn:3540442847 issn:16113349 https://doi.org/10.1007/3-540-45821-2_4
[4]
Julian Bangert and Nickolai Zeldovich. 2014. Nail: A practical tool for parsing and generating data formats. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 615–628.
[5]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. 337–340.
[6]
1995. Executable and Linking Format (ELF) Specification. Version 1.2.
[7]
Kathleen Fisher and Robert Gruber. 2005. PADS: a domain-specific language for processing ad hoc data. In ACM Conference on Programming Language Design and Implementation (PLDI). ACM Press, New York, NY, USA. 295–304.
[8]
Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. 2006. The next 700 data description languages. In ACM Symposium on Principles of Programming Languages (POPL). ACM Press, New York, NY, USA. 2–15.
[9]
Bryan Ford. 2004. Parsing expression grammars: A recognition-based syntactic foundation. Conference Record of the Annual ACM Symposium on Principles of Programming Languages, 31 (2004), 111–122. isbn:158113729X issn:07308566
[10]
Bryan Ford. 2004. Parsing Expression Grammars: A Recognition-based Syntactic Foundation. In ACM Symposium on Principles of Programming Languages (POPL). 111–122.
[11]
Alexei Hmelnov Hmelnov and Andrei Mikhailov. 2019. Generation of Code for Reading Data from the Declarative File Format Specifications Written in Language FlexT. Proceedings - 2018 Ivannikov Isp Ras Open Conference, ISPRAS 2018, 23–30. isbn:9781728112756 https://doi.org/10.1109/ISPRAS.2018.00011
[12]
Suman Jana and Vitaly Shmatikov. 2012. Abusing file processing in malware detectors for fun and profit. In 2012 IEEE Symposium on Security and Privacy. 80–94.
[13]
Trevor Jim, Yitzhak Mandelbaum, and David Walker. 2010. Semantics and Algorithms for Data-dependent Grammars. 417–430.
[14]
Donald B. Johnson. 1975. Finding All the Elementary Circuits of a Directed Graph. SIAM J. Comput., 4, 1 (1975), 77–84.
[15]
2015. Kaitai Struct User Guide. https://doc.kaitai.io/user_guide.html
[16]
Donald E. Knuth. 1968. Semantics of context-free languages. Mathematical Systems Theory, 2, 2 (1968), 127–145.
[17]
Ashish Kumar, Bill Harris, and Gang Tan. 2023. DISV: Domain Independent Semantic Validation of Data Files. In 9th Workshop on Language-Theoretic Security (LangSec).
[18]
Zephyr S Lucas, Joanna Y Liu, Prashant Anantharaman, and Sean W Smith. 2021. Parsing PEGs with Length Fields in Software and Hardware. In 2021 IEEE Security and Privacy Workshops (SPW). 128–133.
[19]
Prashanth Mundkur, Linda Briesemeister, Natarajan Shankar, Prashant Anantharaman, Sameed Ali, Zephyr Lucas, and Sean W. Smith. 2020. Research Report: The Parsley Data Format Definition Language. In 6th Workshop on Language-Theoretic Security (LangSec). 300–307.
[20]
Terence Parr, Sam Harwell, and Kathleen Fisher. 2014. Adaptive LL(*) parsing: the power of dynamic analysis. In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 579–598.
[21]
Tahina Ramananandro, Antoine Delignat-Lavaud, Cédric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified Secure Zero-Copy Parsers for Authenticated Message Formats. In Usenix Security Symposium. 1465–1482.
[22]
William Underwood. 2012. Grammar-Based Specification and Parsing of Binary File Formats. International Journal of Digital Curation, 7, 1 (2012), 95–106. issn:1746-8256 https://doi.org/10.2218/ijdc.v7i1.217
[23]
W3C. 2018. Scalable Vector Graphics (SVG) 2. https://www.w3.org/TR/SVG2/
[24]
Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Interval Parsing Grammars for File Format Parsing. arxiv:2304.04859.
[25]
Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Reproduction Package for article "Interval Parsing Grammars for File Format Parsing". https://doi.org/10.5281/zenodo.7811236

Cited By

View all
  • (2024)Research Report: An Optim (l) Approach to Parsing Random-Access Formats2024 IEEE Security and Privacy Workshops (SPW)10.1109/SPW63631.2024.00023(192-199)Online publication date: 23-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 7, Issue PLDI
June 2023
2020 pages
EISSN:2475-1421
DOI:10.1145/3554310
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2023
Published in PACMPL Volume 7, Issue PLDI

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Context-sensitive Grammars
  2. File Formats

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16,607
  • Downloads (Last 6 weeks)2,696
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Research Report: An Optim (l) Approach to Parsing Random-Access Formats2024 IEEE Security and Privacy Workshops (SPW)10.1109/SPW63631.2024.00023(192-199)Online publication date: 23-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media