blackst0ne
blackst0ne Vitaliy Klachkov. Backend developer at funbox.ru & GitLab Core Team (community) member

Banzai - text transformation component

Banzai - text transformation component
This article describes GitLab CE v11.10.4
  1. Intro
  2. Why?
  3. Architecture
  4. Filters
  5. Pipelines
    1. CombinedPipeline
    2. GfmPipeline
    3. PreProcessPipeline
    4. PostProcessPipeline
  6. Caching
    1. On database level
    2. On Redis level
  7. Permissions
  8. Summary

Intro

Banzai is the GitLab’s text transformation component.
Its goal is to take a text, update it in-place, and return a result as an HTML.
Banzai was introduced in !2027, December, 2015.

This component is used almost in every place where a user can type a text and even more.

Do you have an example?

Yes. :)

Let’s consider a simple comment a contributor leaves in a merge request:

This is a bugfix for ~rails5
It's based on !123 and closes #456
@username, could you review this, please? :slight_smile:

If they click on the Preview tab or just publish this comment, it will be rendered into this HTML code:

<p data-sourcepos="1:1-3:56" dir="auto">
  This is a bugfix for <a href="/gitlab-org/gitlab-ce/issues?label_name=rails5" data-original="~rails5" data-link="false" data-link-reference="false" data-project="13083" data-label="3697147" data-reference-type="label" data-container="body" data-placement="bottom" title="" class="gfm gfm-label has-tooltip"><span class="badge color-label has-tooltip" data-html="true" style="background-color: #cc0000; color: #FFFFFF" title="Issues and Merge Requests related to upgrading to Rails 5" data-container="body">rails5</span></a><br>
  It's based on <a href="/gitlab-org/gitlab-ce/merge_requests/123" data-original="!123" data-link="false" data-link-reference="false" data-project="13083" data-merge-request="16891" data-project-path="gitlab-org/gitlab-ce" data-iid="123" data-mr-title="IPv6 Support for nginx " data-reference-type="merge_request" data-container="body" data-placement="bottom" title="" class="gfm gfm-merge_request" data-mr-listener-added="true">!123 (closed)</a> and closes <a href="/gitlab-org/gitlab-ce/issues/456" data-original="#456" data-link="false" data-link-reference="false" data-project="13083" data-issue="52774" data-reference-type="issue" data-container="body" data-placement="bottom" title="incorrect email notifications sent when reassinging an issue" class="gfm gfm-issue has-tooltip">#456 (closed)</a><br>
  <a href="/username" data-user="3913" data-reference-type="user" data-container="body" data-placement="bottom" class="gfm gfm-project_member" title="name">@username</a>, could you review this, please? <gl-emoji title="slightly smiling face" data-name="slight_smile" data-unicode-version="7.0"><img class="emoji" title=":slight_smile:" alt=":slight_smile:" src="https://assets.gitlab-static.net/assets/emoji/slight_smile-10f4b66a755f5c78762a330f20d1866e4a22f3f1d495161d758d3bab8d2f36fe.png" width="20" height="20" align="absmiddle"></gl-emoji>
</p>

Or the same example in screenshots:

Original text

Transformed text

Why?

Why to transform all those special symbols to HTML tags?

Well, it’s all about usability. It is much easier to click a link instead of copy a URL, open a new browser’s tab, paste the URL and press Enter.
Bolded and italic texts highlight important pieces, tables are cool when you need to structurize your data, and so on.

Of course if you’re a linux kernel developer, you may not need all that stuff at all. But for everything else, there’s MasterCard HTML texts. :)

Architecture

Banzai is built on top of the html-pipeline gem.
The idea is simple: a text comes into Banzai, passes through a set of transformation rules, and goes out.

Such rules are named filters and they are placed in lib/banzai/filter.
You can pass your text into a filter, let it do its job with the text, and grab the modified text back.

project = Project.first
text = "This is a ~bug"

Banzai::Filter::LabelReferenceFilter.call(text, project: project).to_html

=> "This is a <a href=\"http://gitlab.com/gitlab-org/gitlab-ce/issues?label_name=bug\" data-original=\"~bug\" data-link=\"false\" data-link-reference=\"false\" data-project=\"1\" data-label=\"70\" data-reference-type=\"label\" data-container=\"body\" data-placement=\"bottom\" title=\"\" class=\"gfm gfm-label has-tooltip\"><span class=\"badge color-label has-tooltip\" data-html=\"true\" style=\"background-color: #FF0000; color: #FFFFFF\" title=\"\" data-container=\"body\">bug</span></a>"

But what if you need to call many filters in a row?
You can call manually all filters you need. But that’s going to bloat you code quite quickly:

project = Project.first
text = "Your text here"

result = Banzai::Filter::MarkdownFilter.call(text, project: project) # this filter doesn't have `to_html` method. 
result = Banzai::Filter::SanitizationFilter.call(result, project: project).to_html
result = Banzai::Filter::EmojiFilter.call(result, project: project).to_html
result = Banzai::Filter::ColorFilter.call(result, project: project).to_html
result = Banzai::Filter::AutolinkFilter.call(result, project: project).to_html
result = Banzai::Filter::ExternalLinkFilter.call(result, project: project).to_html

result

To make things easier, there’s another object called pipeline.
Pipelines live in lib/banzai/pipeline.

A pipeline is just a set of filters.

For example, there’s BroadcastMessagePipeline collecting all the filters mentioned above:

# lib/banzai/pipeline/broadcast_message_pipeline.rb
  
module Banzai
  module Pipeline
    class BroadcastMessagePipeline < DescriptionPipeline
      def self.filters
        @filters ||= FilterArray[
          Filter::MarkdownFilter,
          Filter::SanitizationFilter,

          Filter::EmojiFilter,
          Filter::ColorFilter,
          Filter::AutolinkFilter,
          Filter::ExternalLinkFilter
        ]
      end
    end
  end
end

And now you can call a pipeline:

project = Project.first
text = "Your text here"

Banzai::Filter::BroadcastMessagePipeline.call(text, project: project).to_html 

The result will be the same.

There’s other entrypoints which call pipelines and additionally cache results.
You can find them in lib/banzai.rb
The main of them is Banzai.render(text, context)

Let’s take a look deeper how Banzai works.


Filters

A filter is a PORO based on HTML::Pipeline::Filter with the call method.
It takes an HTML string or a Nokogiri::HTML::DocumentFragment, modificates it, and returns a result back. A result must be the same type: either a string or a Nokogiri::HTML::DocumentFragment.

For example, this filter finds all images and updates its src attribute:

def call
  doc.search("img").each do |img|
    img["src"] = "https://gitlabinternals.dev/assets/images/logo-big.png"
  end
end

doc is a Nokogiri::HTML::DocumentFragment where a text is stored while passing through filters.


Pipelines

A pipeline is also a PORO containing a set of filters.

When calling pipelines, there’re two parameters you have to pass:

  1. text - a text itself to be transformed.
  2. context - a hash of options.

All pipelines are inherited from BasePipeline which implements 5 following methods:

  1. filters: lists filters.
  2. transform_context: modifies the context.
  3. call: passes the context to transform_context and calls HTML::Pipeline.new(filters).call(text, context)
  4. to_html: passes the context to transform_context and calls HTML::Pipeline.new(filters).to_html(text, context)
  5. to_document: passes the context to transform_context and calls HTML::Pipeline.new(filters).to_document(text, context)

The methods filters and transform_context may be overrode by descendants if needed.

By default if no pipeline option passed in the context, the pipeline: :full option is used.
As you can guess, that means FullPipeline will be used. :)

But why do we want to have many pipelines? Can’t we just use one for everything?

It depends on where you want to use a Banzai’s pipeline.

For example, by default GitLab uses lazy-loading for images.
It allows to load pages with images way faster, because ImageLazyLoadFilter replaces the real src attribute value with a placeholder. And when a page is loaded, JavaScript starts loading images in background.

But what if we want to send an email with images?
The lazy-loading feature won’t be working in emails as the JavaScript code responsible for that lives in the JavaScript stack of the GitLab application, but not in emails.

Do we want to add all that JavaScript to emails too? Is it really going to work on clients?
It highly depends on what email client is used by an end user. So the lazy-loading feature may not work at all.

In this case we have to send an email with the original src attribute by disabling ImageLazyLoadFilter and passing some additional options to ExternalLinkFilter and MarkdownFilter.

So how can we solve this problem?

We can do this in 3 steps:

  1. Create a new pipeline, say, EmailPipeline, based on FullPipeline
  2. Override the filter method: remove ImageLazyLoadFilter from the filters list.
  3. Override the transform_context method: add some options to other filters.

So now we just call Banzai.render with the pipeline: :email option which will invoke our created pipeline, and that’s it.

# lib/banzai/pipeline/email_pipeline.rb
   
module Banzai
  module Pipeline
    class EmailPipeline < FullPipeline
      def self.filters
        super.tap do |filter_array|
          filter_array.delete(Banzai::Filter::ImageLazyLoadFilter) # exclude filter
        end
      end

      def self.transform_context(context)
        super(context).merge(
          only_path: false,      # use links with full URLs 
          emailable_links: true, # use punycode for links
          no_sourcepos: true     # do not include source positions in rendered HTML  
        )
      end
    end
  end
end

GitLab CE v11.10.4, which code is described here, has 20 pipelines.

So let’s take a look at some interesting ones:

  1. CombinedPipeline
  2. GfmPipeline
  3. PreProcessPipeline
  4. PostProcessPipeline
CombinedPipeline

If you want to create a pipeline by merging two or more pipelines, you can use CombinedPipeline.
It’s just a wrapper of multiple pipelines:

# lib/banzai/pipeline/combined_pipeline.rb
  
module Banzai
  module Pipeline
    module CombinedPipeline
      def self.new(*pipelines)
        Class.new(BasePipeline) do
          const_set :PIPELINES, pipelines

          def self.pipelines
            self::PIPELINES
          end

          def self.filters
            FilterArray.new(pipelines.flat_map(&:filters))
          end

          def self.transform_context(context)
            pipelines.reduce(context) do |context, pipeline|
              pipeline.transform_context(context)
            end
          end
        end
      end
    end
  end
end

and then:

# lib/banzai/pipeline/full_pipeline.rb
  
module Banzai
  module Pipeline
    class FullPipeline < CombinedPipeline.new(PlainMarkdownPipeline, GfmPipeline)
    end
  end
end

GfmPipeline

GFM stands for GitLab Flavored Markdown. It was inspired by GitHub Flavored Markdown.

GFM is an extension of a markdown engine. It brings additional syntax to markdown text, e.g. special GitLab references, task lists, emoji, mermaid flowcharts and diagrams, videos, etc.

You can read more about this syntax here.

Technically it is just a pipeline with a set of filters.

PreProcessPipeline

This pipeline is explicitly run first when you call Banzai.render

The pipeline runs two filters:

module Banzai
  module Pipeline
    class PreProcessPipeline < BasePipeline
      def self.filters
        FilterArray[
          Filter::FrontMatterFilter,
          Filter::BlockquoteFenceFilter,
        ]
      end

      def self.transform_context(context)
        context.merge(
          pre_process: true
        )
      end
    end
  end
end

Those filters check if a text has front matter blocks or block quote fences.
A text inside such blocks must be processed different ways.
For example, everything inside front matter blocks must be rendered as-is, and text inside fences may be highlighted or rendered also as-is.

	```
	I will be rendered as is, even if I have this smile :slight_smile: 
	```
	
	```ruby
	puts "And I will be highlighted" 
	```	

So text may require no modifications by other filters.
That’s why this pipeline runs first.

PostProcessPipeline

This pipeline is used when we need to redact already prepared HTML string, e.g. update issues/merge request states (#123 vs #123 (closed)), remove references the current user is not permitted to follow, etc.

The filters from this pipeline are not used in the FullPipeline which gets called when you use Banzai.render because these filters must be called for concrete users every time they open the text.

So you should call Banzai.post_process explicitly (or call the pipeline itself), or call Banzai.render_and_post_process to prepare your text and redact it at the same time.

If you open an issue with 10 notes (comments), PostProcessPipeline will be called at least 11 times: 1 for the description text and 10 times for each note (comment).


Caching

GitLab calls Banzai very often: when you open an issue, a merge request, a snippet, a commit, etc. It is very important to have the tool as fast as possible.

Given that text is mostly readable than writable, it makes sense to cache a text once it get transformed.
And there are such caches: on database level and on redis level

On database level

Some models have special attributes, e.g.:

# app/models/snippet.rb

class Snippet < ApplicationRecord
  # code skipped
  
  cache_markdown_field :title, pipeline: :single_line
  cache_markdown_field :description, issuable_state_filter_enabled: true
  
  # code skipped
end

The attribute’s implementation lives in the CacheableAttributes concern.
In short, the idea is to store both the original text, and its transformed (rendered) version in database tables.
The transformed variant is stored in the *_html fields, e.g. title_html.

Every time when an object has to be shown to a user, Banzai checks if the object has the up-to-date *_html field. If it does, then Banzai just returns the cached data. Otherwise Banzai refreshes data and stores it again in the database.

To check if an attribute’s cache is still up-to-date, there’s the method cached_html_up_to_date?
It uses the Dirty attributes + checks if the cached_markdown_version is changed.

The cached_markdown_version is a special field that changes if the Banzai’s markdown engine renders different output.
For example, if the engine’s options are changed, or the engine itself is updated, the CACHE_COMMONMARK_VERSION constant must be changed manually by a developer.

But after that, the cached fields still have to be redacted via PostProcessPipeline before showing to an end user.

On Redis level

Banzai supports caching rendered result to Redis.
It’s enabled when the context has the cache_key option set:

# lib/banzai/renderer.rb
  
def self.render(text, context = {})
  cache_key = context.delete(:cache_key)
  cache_key = full_cache_key(cache_key, context[:pipeline])

  if cache_key
    Gitlab::Metrics.measure(:banzai_cached_render) do
      Rails.cache.fetch(cache_key) do
        cacheless_render(text, context)
      end
    end
  else
    cacheless_render(text, context)
  end
end 

Permissions

GitLab has permissions system on different levels: groups, projects, issues, etc.

Let’s imagine there are two users: user1 and user2.
user1 creates a private issue #12345 and user2 don’t have permissions to view it.
Now user1 writes a note (comment):

I've left my thoughts on this problem in #12345

The #12345 text must be transformed to a link for user1 and be left as is for user2 because no permissions for them.

How can we handle such situation?

  1. Render all possible variations of text for each existing user, store them in database, and re-render all of them every time a set of permissions is changed for an user? It’d be a nightmare.
  2. Render text for every user each time they visit a page? It’d be a waste of resources.

Banzai works a bit different.
A text does get rendered just once without any permission checks.
If there’s a #123 reference, IssuableReferenceFilter will grab it and replace it with a <a href ...>#123</a> link.

But every time a note/issue/merge request/etc is loaded for a user, its already rendered *_html field goes through PostProcessPipeline where all the rendered links get removed if a user doesn’t have enough permissions. The result of the pipeline is shown for a user. Every user sees its personal result of a text passed through Banzai depending on permissions they have.

Summary

Banzai is a well expandable component.
Though it still has some performance issues, codebase can be refactored for better reading, and people want more features to get aboard, but it works and does its job quite well.


I hope this post gave you understanding how Banzai works, and it’s going to be much easier for you to implement a merge request for one of the existing issues for Banzai.

comments powered by Disqus