Sometimes you need to read online documentation or find something on the Internet from the command line or terminal. So, let's use Python to create a text-based browser! Of course, making a real, full-blown browser is a very difficult task. In this project, you'll create a very simple browser that will ignore JavaScript and CSS, won't have cookies, and will only process a limited set of tags.
Requirements:
- Python 3.7
- To run the tests: https://github.com/hyperskill/hs-test-python
- BeautyfulSoup
- Colorama
python browser.py
Every browser accepts a string from the user and then shows a web page. A string from the user is a URL (Uniform Resource Locator) and looks somewhat like this: https://www.google.com. After that, the browser has a lot of work. In a nutshell, this work can be described as finding a web page. The web page is located somewhere on the Internet and the browser has to retrieve it. Since the https://www.
part is always the same, it is often omitted and the correct shortened link looks like this: google.com.
In our first stage, we'll try to imitate this behavior.
- You should write a program that takes a string from the user (URL) and outputs a "hard-coded" website with news (just a header and some text below).
The websites are presented as two variables in source code, you can see them in the template. These are mock bloomberg.com and nytimes.com sites. You just need to output them as a response to the corresponding input URL. -
Also, you should add the possibility to quit the browser by typing
exit
, because real browsers don’t finish their work when they output a single web page: they are ready to accept a new URL at any moment. You should realize this behavior, too. An endless loop can help you with that part.
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> bloomberg.com
The Space Race: From Apollo 11 to Elon Musk
It's 50 years since the world was gripped by historic images
of Apollo 11, and Neil Armstrong -- the first man to walk
on the moon. It was the height of the Cold War, and the charts
were filled with David Bowie's Space Oddity, and Creedence's
Bad Moon Rising. The world is a very different place than
it was 5 decades ago. But how has the space race changed since
the summer of '69? (Source: Bloomberg)
Twitter CEO Jack Dorsey Gives Talk at Apple Headquarters
Twitter and Square Chief Executive Officer Jack Dorsey
addressed Apple Inc. employees at the iPhone maker’s headquarters
Tuesday, a signal of the strong ties between the Silicon Valley giants.
> exit
Let's make our browser store web pages in a file and show them if the user types a shortened request (for example, wikipedia instead of wikipedia.org). You can store each page as a separate file or find another way to do this. But your program should accept one command line argument which is a directory to store the files, and your web pages should be saved inside this directory.
At this stage, your program should:
- Check if the user has entered a valid URL. It must contain at least one dot, for example,
bloomberg.com
. If the URL is incorrect, the browser should output an error message (it should contain the worderror
) and wait for another URL. - Accept a command-line argument which is a directory for saved tabs. For example, if the argument is
dir
, then you need to create a folder with the namedir
and save all web pages that the user downloads in this folder. - Save this web page in a file. After that, the user needs to have a simple way to see the saved web page by typing "bloomberg". The rule is simple: you just need to remove the last dot and everything that comes after it.
bloomberg.com
becomesbloomberg
,en.wikipedia.org
becomesen.wikipedia
.
Check out a tutorial to learn how to work with files and create folders in Python.
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> python browser.py dir-for-files
> bloomberg.com
The Space Race: From Apollo 11 to Elon Musk
It's 50 years since the world was gripped by historic images
of Apollo 11, and Neil Armstrong -- the first man to walk
on the moon. It was the height of the Cold War, and the charts
were filled with David Bowie's Space Oddity, and Creedence's
Bad Moon Rising. The world is a very different place than
it was 5 decades ago. But how has the space race changed since
the summer of '69? (Source: Bloomberg)
Twitter CEO Jack Dorsey Gives Talk at Apple Headquarters
Twitter and Square Chief Executive Officer Jack Dorsey
addressed Apple Inc. employees at the iPhone maker’s headquarters
Tuesday, a signal of the strong ties between the Silicon Valley giants.
> bloomberg
The Space Race: From Apollo 11 to Elon Musk
It's 50 years since the world was gripped by historic images
of Apollo 11, and Neil Armstrong -- the first man to walk
on the moon. It was the height of the Cold War, and the charts
were filled with David Bowie's Space Oddity, and Creedence's
Bad Moon Rising. The world is a very different place than
it was 5 decades ago. But how has the space race changed since
the summer of '69? (Source: Bloomberg)
Twitter CEO Jack Dorsey Gives Talk at Apple Headquarters
Twitter and Square Chief Executive Officer Jack Dorsey
addressed Apple Inc. employees at the iPhone maker’s headquarters
Tuesday, a signal of the strong ties between the Silicon Valley giants.
> nytimes
Error: Incorrect URL
> exit
Every browser has a “back” button. If the user presses this button, the browser shows the previous web page. This feature can be realized using a stack. You save the pages visited by the user: google, wikipedia, bloomberg, ..., but when the user types back
, you will see the pages in the reverse order: ..., bloomberg, wikipedia, google.
The result of this task is the same as in the previous task, but now the program has a new feature:
- The program should show the previous page if the user types
back
. You can implement a stack to do this. - If there are no more pages in the browser history, just don’t output anything.
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> python browser.py dir-for-files
> bloomberg.com
The Space Race: From Apollo 11 to Elon Musk
It's 50 years since the world was gripped by historic images
of Apollo 11, and Neil Armstrong -- the first man to walk
on the moon. It was the height of the Cold War, and the charts
were filled with David Bowie's Space Oddity, and Creedence's
Bad Moon Rising. The world is a very different place than
it was 5 decades ago. But how has the space race changed since
the summer of '69? (Source: Bloomberg)
Twitter CEO Jack Dorsey Gives Talk at Apple Headquarters
Twitter and Square Chief Executive Officer Jack Dorsey
addressed Apple Inc. employees at the iPhone maker’s headquarters
Tuesday, a signal of the strong ties between the Silicon Valley giants.
> nytimes.com
This New Liquid Is Magnetic, and Mesmerizing
Scientists have created “soft” magnets that can flow
and change shape, and that could be a boon to medicine
and robotics. (Source: New York Times)
Most Wikipedia Profiles Are of Men. This Scientist Is Changing That.
Jessica Wade has added nearly 700 Wikipedia biographies for
important female and minority scientists in less than two
years.
> back
The Space Race: From Apollo 11 to Elon Musk
It's 50 years since the world was gripped by historic images
of Apollo 11, and Neil Armstrong -- the first man to walk
on the moon. It was the height of the Cold War, and the charts
were filled with David Bowie's Space Oddity, and Creedence's
Bad Moon Rising. The world is a very different place than
it was 5 decades ago. But how has the space race changed since
the summer of '69? (Source: Bloomberg)
Twitter CEO Jack Dorsey Gives Talk at Apple Headquarters
Twitter and Square Chief Executive Officer Jack Dorsey
addressed Apple Inc. employees at the iPhone maker’s headquarters
Tuesday, a signal of the strong ties between the Silicon Valley giants.
> exit
Now we should get closer to the browser with the address bar. At this stage, you need to forget about your hard-coded variables with sites and show your user some real pages. Make the browser request the real input URL and show the result.
One of the simplest ways to do this is the Request library. It is already installed in your project, so you can use it. This library allows to get any web page via URL by one string. You can find this string in Request documentation, though it’s better to read the whole quick manual to understand more.
Sometimes, it’s going to be a challenge. You might find that you suddenly don't have permission to visit certain websites. That’s because of the User-agent. It’s just a string that all browsers use to mark the request, and they all have different user-agents. Frankly, browsers add a lot of additional information to the requests. All this info can be set using the request library. For this task, it's optional, but feel free to experiment.
Add new features to the browser:
- So, your program should read the URL from input as before, but now show the real web page using the Request library.
- Since the user can input the URL without
https://
in the beginning, you need to append this string if it is not there.
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> python browser.py dir-for-files
> docs.python.org
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.7.4 Documentation</title>
<link rel="stylesheet" target="_blank" href="_static/pydoctheme.css" type="text/css" />
<link rel="stylesheet" target="_blank" href="_static/pygments.css" type="text/css" />
<script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<script type="text/javascript" src="_static/language_data.js"></script>
<script type="text/javascript" src="_static/sidebar.js"></script>
<link rel="search" type="application/opensearchdescription+xml"
title="Search within Python 3.7.4 documentation"
target="_blank" href="_static/opensearch.xml"/>
<link rel="author" title="About these documents" target="_blank" href="about.html" />
<link rel="index" title="Index" target="_blank" href="genindex.html" />
<link rel="search" title="Search" target="_blank" href="search.html" />
<link rel="copyright" title="Copyright" target="_blank" href="copyright.html" />
<link rel="shortcut icon" type="image/png" target="_blank" href="_static/py.png" />
<link rel="canonical" target="_blank" href="https://docs.python.org/3/index.html" />
<script type="text/javascript" src="_static/copybutton.js"></script>
<script type="text/javascript" src="_static/switchers.js"></script>
… (More than 200 such terrifying strings)
> exit
Now it is important for us to bring the resulting "text" to a form that is understandable to the user.
If you don’t know what HTML is, here's a short explanation. When working on the previous task you could see a lot of <div>
, <script>
or <p>
“words” on the displayed web page. These are called tags. Browsers need tags to know how exactly to show the page. For example, there could be headers that look different from the rest of the text. Also, there could be links; they could be blue, and the cursor could look like a pointing finger when it's on the link. To let the browser know where the links are, where an image should be and so on, tags are used.
Tags are necessary for the browser but aren’t useful for users. Most tags are paired. For example: <p>Some text</p>
, where <p>
is an opening tag and </p>
is a closing tag. You need to show only “Some text
” without <p>
and </p>
on a web page.
Each tag has its own purpose: <p>
for text, <h1> <h3> … <h6>
for headers, <a>
for links, <ul> <ol> <li>
for lists.
At this stage, you need to cut all content outside of these tags and output what remains. No more <div>, <script>, <p>
and so on, just text! You need to show only the content of a limited list of tags (<p>
, headers, <a>
and <ul>
, <ol>
, <li>
) without the tags themselves.
Use beautifulsoup4
library for solving this, it is installed in your project already. Feel free to get curious and browse through some more information about parsing!
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> python browser.py dir-for-files
> docs.python.org
index
modules
Python
Documentation
Python 3.7.4 documentation
Welcome! This is the documentation for Python 3.7.4.
Parts of the documentation:
What's new in Python 3.7? or all "What's new" documents since 2.0
Tutorial start here
Library Reference keep this under your pillow
Language Reference describes syntax and language elements
Python Setup and Usage how to use Python on different platforms
Python HOWTOs in-depth documents on specific topics
Installing Python Modules installing from the Python Package Index & other sources
Distributing Python Modules publishing modules for installation by others
Extending and Embedding tutorial for C/C++ programmers
Python/C API reference for C/C++ programmers
FAQs frequently asked questions (with answers!)
Indices and tables:
Global Module Index quick access to all modules
General Index all functions, classes, terms
Glossary the most important terms explained
Search page search this documentation
Complete Table of Contents lists all sections and subsections
Meta information:
Reporting bugs
About the documentation
History and License of Python
Copyright
> exit
It’s not enough to just drop the tags. You should make your output “readable”. After all, we would like to have a user-friendly browser, right? At this stage, try to make your browser look more like a browser.
Almost every page contains links. Have you ever wondered why blue was chosen to highlight them?
One of the reasons lies in the physiology of the human eye. Red and green are detected by the same cells in the eye, and one of the most common forms of colorblindness is red-green colorblindness. It affects 7% of men and only 0.4% of women, that’s still one person in 25 overall. But almost no one has a blue deficiency. Accordingly, nearly everyone can see blue, or, more accurately, almost everyone can distinguish blue as a color different from others.
Also, blue is the darkest color that does not reduce the readability of the text.
Let all links in your browser be blue! Pay attention to the Colorama library. This library is already installed in the project, so you can use it. With this library, you can easily solve this task just after reading the documentation!
The greater-than symbol followed by space (>
) represents the user input. Notice that it's not the part of the input.
> https://docs.python.org
index
modules
Python
Documentation
Python 3.7.4 documentation
Welcome! This is the documentation for Python 3.7.4.
Parts of the documentation:
What's new in Python 3.7? or all "What's new" documents since 2.0
Tutorial start here
Library Reference keep this under your pillow
Language Reference describes syntax and language elements
Python Setup and Usage how to use Python on different platforms
Python HOWTOs in-depth documents on specific topics
Installing Python Modules installing from the Python Package Index & other sources
Distributing Python Modules publishing modules for installation by others
Extending and Embedding tutorial for C/C++ programmers
Python/C API reference for C/C++ programmers
FAQs frequently asked questions (with answers!)
Indices and tables:
Global Module Index quick access to all modules
General Index all functions, classes, terms
Glossary the most important terms explained
Search page search this documentation
Complete Table of Contents lists all sections and subsections
Meta information:
Reporting bugs
About the documentation
History and License of Python
Copyright
> exit