<<

Fedora Infrastructure Best Practices Documentation Release 1.0.0

The Fedora Infrastructure Team

Sep 09, 2021

Full Table of Contents:

1 Getting Started 3 1.1 Create a Fedora Account...... 3 1.2 Subscribe to the Mailing List...... 3 1.3 Join IRC...... 3 1.4 Next Steps...... 4

2 Full Table of Contents 5 2.1 Developer Guide...... 5 2.2 System Administrator Guide...... 28 2.3 (Old) System Administrator Guides...... 317

3 Indices and tables 335

i Fedora Infrastructure Best Practices Documentation, Release 1.0.0

This contains a development and system administration guide for the Fedora Infrastructure team. The development guide covers how to get started with application development as well as application best practices. You will also find several sample projects that serve as demonstrations of these best practices and as an excellent starting point for new projects. The system administration guide covers how to get involved in the system administration side of Fedora Infrastructure as well as the standard operating procedures (SOPs) we use. The source repository for this documentation is maintained here: https://pagure.io/infra-docs

Full Table of Contents: 1 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2 Full Table of Contents: CHAPTER 1

Getting Started

Fedora Infrastructure is full of projects that need help. In fact, there is so much work to do, it can be a little over- whelming. This document is intended to help you get ready to contribute to the Fedora Infrastructure.

1.1 Create a Fedora Account

The first thing you should do is create a Fedora account. Your Fedora account will be used for nearly everything you do as a member of the Fedora community. Once you’ve created your Fedora account, you need to read and sign Contributor Agreement (FPCA).

1.2 Subscribe to the Mailing List

Next, you should join the Fedora Infrastructure mailing list. You will need to log into your new Fedora account to subscribe. The mailing list is the best way to have a discussion with the entire Fedora Infrastructure community.

1.3 Join IRC

Join us on Relay Chat (IRC) to chat in real time. For a more thorough introduction to IRC, check out the Fedora Magazine’s beginner’s guide to IRC. There are many Fedora IRC channels on libera. To start with, you should check out the #fedora-admin and #fedora-apps channels. These channels are for Fedora Infrastructure system administration and application development, respectively.

3 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

1.4 Next Steps

Congratulations, you are now ready to get involved in a project! If application development is what you’re interested in, check out our developer Getting Started guide. If system administration sounds more to your liking, see the system administrator Getting Started guide.

4 Chapter 1. Getting Started CHAPTER 2

Full Table of Contents

2.1 Developer Guide

This is a complete guide to contributing to Fedora Infrastructure applications. It targets both new and experienced contributors, and is maintained by those contributors. If the documentation is in need of improvement, please file an issue or submit a pull request.

2.1.1 Getting Started

This document is intended to guide you through your first contribution to a Fedora Infrastructure project. It assumes you are already familiar with the version control system and the Python .

Development Environment

The Fedora Infrastructure team uses and Vagrant to set up development environments for the majority of our projects. It’s recommended that you develop on a Fedora host, but that is not strictly required. To install Ansible and Vagrant on Fedora, run:

$ sudo install vagrant libvirt vagrant-libvirt vagrant-sshfs ansible

Projects will provide a Vagrantfile.example file in the root of their repository if they support using Vagrant. Copy this to Vagrantfile, adjust it as you see fit, and then run:

$ vagrant up $ vagrant reload $ vagrant ssh

This will create a new virtual machine, configure it with Ansible, restart it to ensure you’re running the latest updates, and then SSH into the virtual machine. Individual projects will provide detailed instructions for their particular setup.

5 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Finding a Project

Fedora Infrastructure applications are either on GitHub in the fedora-infra organization, or on Pagure. Check out the issues tagged with easyfix for an issue to fix.

2.1.2 Development Environment

In order to make contributing easy, all projects should have an automated way to create a development environment. This might be as simple as a Python virtual environment, or it could be a virtual machine or container. This document provides guidelines for setting up development environments.

Ansible

Ansible is used throughout Fedora Infrastructure to automate tasks. If the project requires anything more than a Python virtual environment to be set up, you should use Ansible to automate the setup.

Vagrant

Vagrant is a tool to provision virtual machines. It allows you to define a base image (called a “box”), virtual machine resources, network configuration, directories to between the host and guest machine, and much more. It can be configured to use libvirt to provision the virtual machines. You can install Vagrant on a Fedora host with:

$ sudo dnf install libvirt vagrant vagrant-libvirt vagrant-sshfs

You can combine your Ansible playbook with Vagrant very easily. Simply point Vagrant to your Ansible playbook and it will run it. Users who would prefer to provision their virtual machines in some other way are free to do so and only need to run the Ansible playbook on their host.

Note: How a project lays out its development-related content is up to the individual project, but a good approach is to create a devel directory. Within that directory you can create an ansible directory and use the layout suggested in the Ansible roles documentation.

Below is a Vagrantfile that provisions a Fedora 25 virtual machine, updates it, and runs an Ansible playbook on it. You can place it in the root of your repository as Vagrantfile.example and instruct users to copy it to Vagrantfile and customize as they wish.

#-*- mode: ruby -*- # vi: set ft=ruby : # # Copy this file to ``Vagrantfile`` and customize it as you see fit.

VAGRANTFILE_API_VERSION="2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| # If you'd prefer to pull your boxes from Hashicorp's repository, you can # replace the config.vm.box and config.vm.box_url declarations with the line below. # # config.vm.box = "fedora/25-cloud-base" config.vm.box="f25-cloud-libvirt" config.vm.box_url="https://download.fedoraproject.org/pub/fedora/linux/releases"\ (continues on next page)

6 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) "/25/CloudImages/x86_64/images/Fedora-Cloud-Base-Vagrant-25-1"\ ".3.x86_64.vagrant-libvirt.box"

# Forward traffic on the host to the development server on the guest. # You can change the host port that is forwarded to 5000 on the guest # if you have other services listening on your host's port 80. config.vm.network"forwarded_port", guest: 5000, host: 80

# This is an optional plugin that, if installed, updates the host's /etc/hosts # file with the hostname of the guest VM. In Fedora it is packaged as # ``vagrant-hostmanager`` if Vagrant.has_plugin?("vagrant-hostmanager") config.hostmanager.enabled= true config.hostmanager.manage_host= true end

# Vagrant can share the source directory using rsync, NFS, or SSHFS (with the

˓→vagrant-sshfs # plugin). Consult the Vagrant documentation if you do not want to use SSHFS. config.vm.synced_folder".","/vagrant", disabled: true config.vm.synced_folder".","/home/vagrant/devel", type:"sshfs", sshfs_opts_

˓→append:"-o nonempty"

# To cache update packages (which is helpful if frequently doing `vagrant destroy &&

˓→vagrant up`) # you can create a local directory and share it to the guest's DNF cache. Uncomment

˓→the lines below # to create and use a dnf cache directory # # Dir.mkdir('.dnf-cache') unless File.exists?('.dnf-cache') # config.vm.synced_folder ".dnf-cache", "/var/cache/dnf", type: "sshfs", sshfs_opts_

˓→append: "-o nonempty"

# Comment this line if you would like to disable the automatic update during

˓→provisioning config.vm.provision"shell", inline:"sudo dnf upgrade -y"

# bootstrap and run with ansible config.vm.provision"shell", inline:"sudo dnf -y install python2-dnf libselinux-

˓→python" config.vm.provision"ansible" do |ansible| ansible.playbook="devel/ansible/vagrant-playbook.yml" end

# Create the "myproject" box config.vm.define"myproject" do |myproject| myproject.vm.host_name="myproject.example.com"

myproject.vm.provider :libvirt do |domain| # Season to taste domain.cpus=4 domain.graphics_type="spice" domain.memory= 2048 domain.video_type="qxl"

# Uncomment the following line if you would like to enable libvirt's unsafe

˓→cache (continues on next page)

2.1. Developer Guide 7 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) # mode. It is called unsafe for a reason, as it causes the virtual host to

˓→ignore all # fsync() calls from the guest. Only do this if you are comfortable with the

˓→possibility of # your development guest becoming corrupted (in which case you should only

˓→need to do a # vagrant destroy and vagrant up to get a new one). # # domain.volume_cache = "unsafe" end end end

2.1.3 Documentation

Since Fedora contributors live around the world and don’t often have the opportunity to meet in person, it’s important to maintain up-to-date high quality documentation for all our projects. Our preferred documentation tool is Sphinx. In fact, this documentation is written using Sphinx! A project’s documentation should at a minimum contain: • An introduction to the project • A user guide • A contributor guide • API documentation. The easiest way to maintain up-to-date documentation is to include the majority of the documentation in the code itself. Sphinx includes several extensions to turn Python documentation into HTML pages.

Note: Improving documentation is a great way to get involved in a project. When adding new documentation or cleaning up existing documentation, please follow the guidelines below.

Style

Sphinx supports three different documentation styles. By default, Sphinx expects ReStructuredText. However, it has included an extension to support the Google style and the NumPy style since version 1.3. The style of the documenta- tion blocks is left up to the individual project, but it should document the choice and be consistent.

Introduction

The project introduction should be easy to find - preferably it should be the documentation’s index page. It should provide an overview of the project and should be easy for a complete new-comer to understand.

User Guide

Have a clear user guide that covers most, if not all, features of the project as well as potential use cases. Keep in mind that your users may be non- technical as well as technical. Some users will want to use the project’s web interface, while others are interested in the API and the documentation should make it easy for both types of users to find the documentation for them.

8 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contributor Guide

Documenting how to start contributing makes it much easier for new contributors to get involved. This is a good place to cover the expectations about code style, documentation, tests, etc.

API Documentation

All APIs should be documented. Users should never have to consult the source code to use the project’s API.

Python

Python API documentation is easily generated by using the autodoc extension. Following these steps will create rich HTML, PDF, EPUB, or man format documentation: 1. All modules should contain a documentation block at the top of the file that describes the module’s purpose and documents module-level attributes it provides as part of its public interface. In the case of a package’s __init__.py, this should document the package’s purpose. 2. All classes should have documentation blocks that describe their purpose, any attributes they have, and example usage if appropriate. 3. All methods and functions should have documentation blocks that describe their purpose, the arguments they accept, the types of those arguments, and example usage if appropriate. 4. Make use of Sphinx’s cross-referencing feature. This will generate links between objects in the documentation.

HTTP APIs

Many projects provide an HTTP-based API. Use sphinxcontrib-httpdomain to produce the HTTP interface documen- tation. This task is made significantly easier if the project using a web framework that sphinxcontrib-httpdomain supports, like Flask. In that case, all you need to do is add the sphinxcontrib-httpdomain ReStructuredText directives to the functions or classes that provide the Flask endpoints. After that, all you need to do is use the autoflask ReStructuredText directive.

Release Notes and ChangeLog

The release notes (or the changelog) can be managed using towncrier. It can build a release notes files by assembling items that would be written in separate files by each pull request (or commit). This way, the different commits will not conflict by writing in the same changelog file, and a link to the issue, the pull request or the commit is automatically inserted. In your project root, add a pyproject.toml file with a tool.towncrier section similar to the one in fedora- messaging. Create a news directory where the news items will be written, and in there create a _template.rst file with a content similar to the one in fedora-messaging. Of course, replace fedora_messaging and the project URL by yours, where applicable. Then create a docs/changelog.rst file (location configured in the pyproject.toml file) with the follwing content:

2.1. Developer Guide 9 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

======Release Notes ======

.. towncrier release notes start

Then each commit can add a file in the news folder to document the change. The file has the source.type name format, where type is one of: • feature: for new features • bug: for bug fixes • api: for API changes • dev: for development-related changes • author: for contributor names • other: for other changes And where the source part of the filename is: • 42 when the change is described in issue 42 • PR42 when the change has been implemented in pull request 42, and there is no associated issue • Cabcdef when the change has been implemented in changeset abcdef, and there is no associated issue or pull request. • username for contributors (author extention). It should be the username part of their commits’ email ad- dress. A preview of the release notes can be generated with towncrier --draft. When running towncrier, the tool will write the changelog file and remove the individual news fragments. These changes can then be committed as part of the release commit.

2.1.4 Code Style

We attempt to maintain a consistent coding style across projects so contributors do not have to keep dozens of different styles in their head as they move from project to project.

Python

We follow the PEP8 style guide for Python. Projects should make it easy to check new code for PEP8 violations preferably by flake8. It is up to the individual project to choose an enforcement method, but it should be clearly documented and continuous integration tests should ensure code is correctly styled before merging pull requests.

Note: There are a few PEP8 rules which will vary from project to project. For example, the maximum line length might vary. The test suite should enforce this.

Enforcement

Projects should automatically enforce code style. How a project does so is up to the maintainers, but several good options are documented here.

10 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Tox

Tox is an excellent way to test style. flake8 looks in, among other places, tox.ini for its configuration so if you’re already using tox this is a good place to place your configuration. For example, adding the following snippet to your tox.ini file will run flake8 as part of your test suite:

[testenv:lint] deps= flake8> 3.0 commands= python-m flake8 {posargs}

[flake8] show-source= True max-line-length= 100 exclude=.git,.tox,dist, *egg

Unit Test

If you’re not using tox, you can add a unit test to your test suite:

"""This module runs flake8 on a subset of the code base""" import os import subprocess import unittest

# Adjust this as necessary to ensure REPO_PATH resolves to the root of your # repository. REPO_PATH= os.path.abspath(os.path.dirname(os.path.join(os.path.dirname(__file__),'.

˓→./'))) class TestStyle(unittest.TestCase): """Run flake8 on the repository directory as part of the unit tests."""

def test_code_with_flake8(self): """Assert the code is PEP8-compliant"""

# enforced_paths = [ # 'mypackage/pep8subpackage/', # 'mypackage/a_module.py', #] # enforced_paths = [os.path.join(REPO_PATH, p) for p in enforced_paths]

# If your entire codebase is not already PEP8 compliant, you can enforce # the style incrementally using ``enforced_paths``. # # flake8_command = ['flake8', '--max-line-length', '100'] + enforced_paths flake8_command=['flake8','--max-line-length','100', REPO_PATH]

self.assertEqual(subprocess.call(flake8_command),0) if __name__ =='__main__': unittest.main()

2.1. Developer Guide 11 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Auto formatting

The Black tool is an automatic code formatter for Python. From the : By using Black, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters. Black makes code review faster by producing the smallest diffs possible. Blackened code looks the same regardless of the project you’re reading. Formatting becomes transparent after a while and you can focus on the content instead. Your text editor is very likely to have a plugin to run Black on file saving. The documentation has instructions to set it up in and in VS Code (and in Emacs). You can check that your code is properly formatted according to Black’s settings by adding the following snippet to your tox.ini file:

[testenv:format] deps= black commands= python-m black--check {posargs:.}

Remember to add format to your Tox envlist.

Javascript

Javascript files should be formatted using the prettier code formatter. It has support for many editors and can integrate with ESLint to check the code automatically.

2.1.5 Frameworks and Tools

We attempt to use the same set of frameworks and tools across projects to minimize the number of frameworks developers must keep in their heads.

Python

Flask

Flask is a web microframework for Python based on Werkzeug and Jinja 2. It is our preferred framework and all new applications should use it unless there is a very good reason not to do so.

Note: For historical reasons, you may find applications that don’t use Flask. Other frameworks currently in use include and Pyramid.

Flask is designed to be extensible, so it’s common to use extensions with the core flask library. A few common extensions are documented below.

12 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Flask-SQLAlchemy

Flask-SQLAlchemy integrates Flask with SQLAlchemy. It will configure a scoped session for you, set up a declarative base class, and provide a convenient flask_sqlalchemy.BaseQuery sub-class for you.

SQLAlchemy

SQLAlchemy is an SQL toolkit and Object Relational Mapper. It provides a core set of tools (surprisingly called SQLAlchemy Core) for working with SQL , as well as an Object Relational Mapper (SQLAlchemy ORM) which is built using SQLAlchemy Core. SQLAlchemy is quite flexible and provides a myriad of options. We use SQLAlchemy with its Declarative extension to map SQL tables to Python classes. Once mapped, instances of those Python classes are created from rows using the Session interfaces.

2.1.6 Databases

We use PostgreSQL throughout Fedora Infrastructure.

Bi-directional Replication

Bi-directional replication (BDR) is a project that adds asynchronous multi- master logical replication to PostgreSQL. Fedora has a PostgreSQL deployment with BDR enabled. In Fedora, only one master is written to at any time. Applications are not required to use the BDR-enabled database, but it is encouraged since it provides redundancy and more flexibility for the system administrators. Applications need to take several things into account when considering whether or not to use BDR.

Primary Keys

All tables need to have primary keys.

Conflicts

BDR does not use any consensus algorithm or locking between nodes so writing to multiple masters can result in conflicts. There are several types of conflicts that can occur, and applications should carefully consider each one and be prepared to handle them. Some conflicts are handled automatically, while others can result in a deadlock that requires manual intervention.

Global DDL Lock

BDR uses a global DDL lock (across all PostgreSQL nodes) for DDL changes, which applications must explicitly acquire prior to emitting DDL statements. This can be done in Alembic by modifying the run_migrations_offline and run_migrations_online functions in env.py to emit the SQL when connecting to the database. An example of the run_migrations_offline:

2.1. Developer Guide 13 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

def run_migrations_offline(): """Run migrations in 'offline' mode.

This requires a configuration options since it's not known whether the target database is a BDR cluster or not. Alternatively, you can simply add the SQL to the script manually and not bother with a setting. """ url= config.get_main_option("sqlalchemy.url") context.configure(url=url)

with context.begin_transaction(): # If the configuration indicates this script is for a Postgres-BDR database, # then we need to acquire the global DDL lock before migrating. postgres_bdr= config.get_main_option('offline_postgres_bdr') if postgres_bdr is not None and postgres_bdr.strip().lower() =='true': _log.info('Emitting SQL to allow for global DDL locking with BDR') context.execute('SET LOCAL bdr.permit_ddl_locking = true') context.run_migrations()

An example of the run_migrations_online function: def run_migrations_online(): """Run migrations in 'online' mode.

This auto-detects when it's run against a Postgres-BDR system. """ engine= engine_from_config( config.get_section(config.config_ini_section), prefix='sqlalchemy.', poolclass=pool.NullPool)

connection= engine.connect() context.configure( connection=connection, target_metadata=target_metadata)

try: try: connection.execute('SHOW bdr.permit_ddl_locking') postgres_bdr= True except exc.ProgrammingError: # bdr.permit_ddl_locking is an unknown option, so this isn't a BDR

˓→database postgres_bdr= False with context.begin_transaction(): if postgres_bdr: _log.info('Emitting SQL to allow for global DDL locking with BDR') connection.execute('SET LOCAL bdr.permit_ddl_locking = true') context.run_migrations() finally: connection.close()

Be aware that long-running migrations will hold the global lock for the entire migration and while the global lock is held by a node, no other nodes may perform any DDL or make any changes to rows.

14 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

DDL Restrictions

BDR has a set of DDL Restrictions. Some of the restrictions are easily worked around by performing the task in several steps, while others are simply not available.

2.1.7 Tests

Tests make development easier for both veteran project contributors and newcomers alike. Most projects use the unittest framework for tests so you should familiarize yourself with this framework.

Note: Writing tests can be a great way to get involved with a project. It’s an opportunity to get familiar with the codebase and the code submission and review process. Check the project’s code coverage and write a test for a piece of code missing coverage!

Patches should be accompanied by one or more tests to demonstrate the feature or bugfix works. This makes the review process much easier since it allows the reviewer to run your code with very little effort, and it lets developers know when they break your code.

Test Organization

Having a standard test layout makes it easy to find tests. When adding new tests, follow the following guidelines: 1. Each module in the application should have a corresponding test module. These modules should be organized in the test package to mirror the package they test. That is, if the package contains the /server/ push.py module, the test module should be in a module called /server/test_push.py. 2. Within each test module, follow the unittest code organization guidelines. 3. Include documentation blocks for each test case that explain the goal of the test. 4. Avoid using mock unless absolutely necessary. It’s easy to write tests using mock that only assert that mock works as expected. When testing code that makes HTTP requests, consider using vcrpy.

Note: You may find projects that do not follow this test layout. In those cases, consider re-organizing the tests to follow the layout described here and follow the established conventions for that project until that happens.

Test Runners

Projects should include a way to run the tests with ease locally and the steps to run the tests should be documented. This should be the same way the continuous integration (Jenkins, TravisCI, etc.) tool runs the tests. There are many test runners available that can discover unittest based tests. These include: • unittest itself via python -m unittest discover • pytest • nose2 Projects should choose whichever runner best suits them.

2.1. Developer Guide 15 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Note: You may find projects using the nose test runner. nose is in maintenance mode and, according to their documentation, will likely cease without a new maintainer. They recommend using unittest, pytest, or nose2.

Tox

Tox is an easy way to run your project’s tests (using a Python test runner) using multiple Python interpreters. It also allows you to define arbitrary test environments, so it’s an excellent place to run the code style tests and to ensure the project’s documentation builds without errors or warnings. Here’s an example tox.ini file that runs a project’s unit tests in Python 2.7, Python 3.4, Python 3.5, and Python 3.6. It also runs flake8 on the entire codebase and builds the documentation with the “warnings treated as errors” Sphinx flag enabled. Finally, it enforces 100% coverage on lines edited by new patches using diff-cover:

[tox] envlist= py27,py34,py35,py36,lint,diff-cover,docs # If the user is missing an interpreter, don't fail skip_missing_interpreters= True

[testenv] deps= -rtest-requirements.txt # Substitute your test runner of choice commands= py.test # When running in OpenShift you don't have a username, so expanduser # won't work. If you are running your tests in CentOS CI, this line is # important so the tests can pass there, otherwise tox will fail to find # a home directory when looking for configuration files. passenv= HOME

[testenv:diff-cover] deps= diff-cover commands= diff-cover coverage.xml--compare-branch=origin/master--fail-under=100

[testenv:docs] changedir= docs deps= sphinx sphinxcontrib-httpdomain -rrequirements.txt whitelist_externals= mkdir sphinx-build commands= mkdir-p _static sphinx-build-W-b html-d {envtmpdir}/doctrees. _build/html

[testenv:lint] deps= flake8> 3.0 commands= python-m flake8 {posargs}

(continues on next page)

16 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) [flake8] show-source= True max-line-length= 100 exclude=.git,.tox,dist, *egg

Coverage

coverage is a good way to collect test coverage statistics. pytest has a pytest-cov plugin that integrates with coverage and nose-cov provides integration for the nose test runner. diff-cover can be used to ensure that all lines edited in a patch have coverage. It’s possible (and recommended) to have the test suite fail if the coverage percentage goes down. This example . coveragerc:

[run] # Track what conditional branches are covered. branch= True include= my_python_package/*

[report] # Fail if the coverage is not 100% fail_under= 100 # Display results with up 1/100th of a percent accuracy. precision=2 exclude_lines= pragma: no cover

# Don't complain if tests don't hit defensive assertion code raise AssertionError raise NotImplementedError

if __name__ ==.__main__.: omit= my_python_package/tests/*

To configure pytest to collect coverage data on your project, edit setup.cfg and add this block, substituting yourpackage with the name of the Python package you are measuring coverage on:

[tool:pytest] addopts=--cov-config.coveragerc--cov=yourpackage--cov-report term--cov-report

˓→xml--cov-report html

causes coverage (and any test running plugins using coverage) to fail if the coverage level is not 100%. New projects should enforce 100% test coverage. Existing projects should ensure test coverage does not drop to accept a pull request and should increase the minimum test coverage until it is 100%.

Note: coverage has great exclusion support, so you can exclude individual lines, conditional branches, functions, classes, and whole source files from your coverage report. If you have code that doesn’t make sense to have tests for, you can exclude it from your coverage report. Remember to leave a comment explaining why it’s excluded!

2.1. Developer Guide 17 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Licenses

The liccheck checker can verify that every dependency in your project has an acceptable license. The dependencies are checked recursively. The licenses are validated against a set of acceptable licenses that you define in a file called .license_strategy. ini in your project directory. Here is an example of such a file, that would accept Free licenses:

[Licenses] authorized_licenses: bsd new bsd simplified bsd apache apache 2.0 apache software gnu lgpl gpl v2 gpl v3 lgpl with exceptions or zpl isc isc license (iscl) mit python software foundation zpl 2.1

The verification is case-insensitive, and is done on both the license and the classifiers metadata fields. See liccheck’s documentation for more details. You can automate the license check with the following snippet in your tox.ini file:

[testenv:licenses] deps= liccheck commands= liccheck-s.license_strategy.ini

Remember to add licenses to your Tox envlist.

Security

The bandit checker is designed to find common security issues in Python code. You can add it to the tests run by Tox by adding the following snippet to your tox.ini file:

[testenv:bandit] deps= bandit commands= bandit-r your_project/-x your_project/tests/-ll

Remember to add bandit to your Tox envlist.

2.1.8 Authentication

Fedora applications that require authentication should support, at a minimum, authentication against Ipsilon. Ipsilon is an Identity Provider that uses a separate Identity Management system to perform authentication. In Fedora, Ipsilon is currently backed by the Fedora Account System. In the future, it will be backed by FreeIPA.

18 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Ipsilon supports OpenID 2.0, OpenID Connect, OAuth 2.0, and more.

Authentication

All new applications should use OpenID Connect for user authentication.

Note: Many existing applications use OpenID 2.0 and should eventually migrate to OpenID Connect.

OpenID Connect is an authentication layer built on top of OAuth 2.0 so to understand OpenID Connect you should first be familiar with OAuth 2.0 and its various flows prior to learning about OpenID Connect. When requesting an access token in OAuth 2.0, clients are allowed to specify the scope of the access token. This scope indicates what the token is allowed to be used for. In most cases, your application should require a scope or scopes of its own so users can issue access tokens that can only be used with a particular application. To do so, consult the Authentication Wiki page.

Warning: OpenID Connect requires that the “openid” scope is requested. Failing to do so will result in undefined behavior. In the case of Ipsilon, you won’t have access to the UserInfo or recieve an ID token.

Libraries

OAuthLib

OAuthLib is a low-level implementation of OAuth 2.0 with OpenID Connect support. It does not tie itself to a HTTP request framework. Typically, you will only use this library indirectly. If you are investigating this library, note that it is a library for both OAuth clients and OAuth providers. You will be most interested in the OAuth client sub-package.

Requests-OAuthlib

Requests-OAuthlib uses the Requests library with OAuthLib to provide an easy-to-use interface for OAuth 2.0 clients. If you need to add support to an application that doesn’t have an extension for OAuthLib, you should use this library.

Flask-OAuthlib

Flask-OAuthlib is a Flask extension that builds on top of Requests-OAuthlib. It comes with plenty of examples in the examples directory of the repository. Flask applications within Fedora Infrastructure should use this extension unless there is a good reason not to (and that reason is documented here).

Pyramid-OAuthLib

Pyramid-OAuthLib is a Pyramid extension that uses OAuthlib. It does not appear to be actively maintained, but it is a reasonable starting point for our few Pyramid applications.

Flask-OIDC

Flask-OIDC is a Flask extension.

2.1. Developer Guide 19 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Mozilla-Django-OIDC

Mozilla-Django-OIDC is a Django extension for OpenID Connect.

2.1.9 fedmsg fedmsg is a ZeroMQ-based messaging library used throughout Fedora Infrastructure applications. It uses a pub- lish/subscribe design so applications can decide what messages they’re interested in receiving.

Warning: fedmsg does not guarantee message delivery. Messages will be lost and your application should never depend on the reliable delivery of fedmsgs to function.

Topics

Existing Topics

There are many existing topics in Fedora Infrastructure.

New Topics

When creating new message topics, please use the following format: org.fedoraproject.ENV.CATEGORY.OBJECT[.SUBOBJECT].EVENT

Where: • ENV is one of dev, stg, or production. • CATEGORY is the name of the service emitting the message – something like koji, bodhi, or fedoratagger • OBJECT is something like package, user, or tag • SUBOBJECT is something like owner or build (in the case where OBJECT is package, for instance) • EVENT is a verb like update, create, or complete. All ‘fields’ in a topic should: • Be singular (Use package, not packages) • Use existing fields as much as possible (since complete is already used by other topics, use that instead of using finished). Furthermore, the body of messages will contain the following envelope: •A topic field indicating the topic of the message. •A timestamp indicating the seconds since the epoch when the message was published. •A msg_id bearing a unique value distinguishing the message. It is typically of the form -. These can be used to uniquely query for messages in the datagrepper web services. •A crypto field indicating if the message is signed with the X509 method or the gpg method. • An i field indicating the sequence of the message if it comes from a permanent service.

20 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

•A username field indicating the username of the process that published the message (sometimes, apache or fedmsg or something else). • Lastly, the application-specific body of the message will be contained in a nested msg dictionary.

2.1.10 Messaging

Fedora uses many event-driven services triggered by messages. In the past, this was done with ZeroMQ and fedmsg. This has been replaced by an AMQP message broker and the fedora-messaging for Python applications. This docu- mentation outlines the policies for sending and receiving messages. To learn how to send and receive messages, see the fedora-messaging documentation.

Broker URLs

The broker consists of multiple RabbitMQ nodes. They are available through the proxies at amqps://rabbitmq. fedoraproject.org for production and amqps://rabbitmq.stg.fedoraproject.org for staging. Clients can connect using these URLs both inside and outside the Fedora VPN, but users outside need to use a separate virtual host. Consult the fedora-messaging documentation for details on how to connect externally.

Identity

In order to help with debugging, clients must configure the client_properties option to include their application name under the app key. Clients should include the application version, if possible, in the app_version key.

Authentication

When applications are deployed, clients must authenticate with the message broker over a TLS connection using x509 certificates. There are configuration options for this in fedora-messaging. Clients require certificates issued by Fedora Infrastructure. If you’re not using the external, read-only user, file a ticket requesting a certificate for the AMQP broker and be sure to provide the username you plan to use. This is placed in the Common Name of the client certificate and must match the name of the user you create in AMQP. Consult the Authorization section below for details on creating users, queues, and bindings.

Authorization

The message broker can use virtual hosts to allow multiple applications to use the broker. The general purpose publish- subscribe virtual host is called /pubsub and has its authorization policy is outlined below. If your application is using a different virtual host for private messaging (for example, your application uses Celery), different authorization rules apply. pubsub Virtual Host

AMQP clients do not have permission to create exchanges, queues, or bindings. However, they can and should declare the exchanges, queues, and bindings they expect to exist in their fedora-messaging configuration so that if they do not exist, the application will fail with a helpful error message about which resource is not available.

Warning: Because AMQP clients don’t have permission to create objects, you need to set passive_declares = true or you will get 403 Permission Denied errors.

2.1. Developer Guide 21 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Users, exchanges, queues, bindings, and virtual hosts other objects are managed in the broker using the Fedora Infras- tructure Ansible project and must be declared there. To do so, you can use the rabbit/queue role in the Ansible repository. An example usage in your deployment playbook would be:

roles: - role: rabbit/queue username: bodhi queue_name: bodhi_masher routing_keys: -"routing_key1" -"routing_key2"

Note that users only have permissions to read from queues prefixed with their name so if your username is “bodhi”, all queues must start with “bodhi”. The username must also match the common name in the x509 certificate used for authentication. If you want to create a user that will only need to publish messages, and not consume them, you can use the rabbit/ user role in the Ansible repository. An example usage in your deployment playbook would be:

roles: - role: rabbit/user username: bodhi

Please note that the username must match the TLS certificate’s Common Name, and they were created with the environment suffix. As a result, if you want the username to match in staging too, you should use:

username:"bodhi{{ env_suffix}}"

Bindings

Messages from AMQP publishers are sent to the amq.topic exchange. Messages from ZeroMQ publishers are sent to the zmq.topic exchange. In order to receive all messages during the transition period from ZeroMQ to AMQP, be sure to your consumers to both exchanges:

[[bindings]] queue="your queue" exchange="amq.topic" routing_keys=["key1","key2"]

[[bindings]] queue="your queue" exchange="zmq.topic" routing_keys=["key1","key2"]

2.1.11 Developing Standard Operating Procedures

When a new application is deployed in Fedora, it is critical that you add a standard operating procedure (SOP) for it. This documents how the application is deployed in Fedora. Consult the current Standard Operating Procedures and if one is missing, please add it. You can modify this documentation or any of the current Standard Operating Procedures by making a pull request to the Pagure project.

22 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Adding a Standard Operating Procedure

To add a standard operating procedure, create a new reStructedText file in the sop directory and then add it to the index file. SOP text file names should use lowercase with dashes. Describe the service and end the page name with “.rst”.

Stuff every SOP should have

Here’s the template for adding a new SOP:

======SOP Title ======Provide a brief description of the SOP here.

Contact Information ======Owner Contact Location Servers Purpose

Sections Describing Things ======Put detailed information in these sections

A Helpful Sub-section ------You can even have sub-sections.

A Sub-section of a sub-section ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If a current SOP does not follow this general template, it should be updated to do so.

SOP Formatting

SOPs are written in ReStructuredText. To learn about the spec, read: • Quickstart • Quick references • Full Specification • Sphinx reStructuredText The format is somewhat simple if you remember a few key points: • Sections are deliniated by underlined texts. The convention is: – Title has “=” above and below the title text, at least as many columns as the title itself.

2.1. Developer Guide 23 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– Top level sections are underlined by “===” - at least as many columns as the section title in the line above. – Second level sections are underlined by “—” – Any of ! " # $ % & ' ( ) * +,-./: ; <=>?@[\]^_`{|}~ are valid section deliniators. If you need more than two section levels, choose between them but be sure to be consistent. • Indents are significant. Only indent for things like block quotes, nested lists etc. Match the tabstop of the document you are editing. Make note of the indentation level as you nest lists. • Use literal blocks for code and command sections. An indented section found after :: and a newline will be processed as a literal block. Like this:

Literal blocks can be nested into lists (think a numbered sequence of steps)

1. Log into the thing

2. run a command::

this indented relative to the first content column of the list so it is a block quote

This line begins at the first content column of the list, so it is considered a continuation of the list content.

3. Log out of the thing.

• For inline literals (commands, filenames, anything that wouldn’t make sense if translated, use your judgement) use double backticks, like this:

You should specify your Fedora username and ssh key in ``~/.ssh/config`` to make

˓→connecting better.

• If nesting and mixing content types, use newlines liberally. A bullet list doesn’t need newlines, but if a list item’s content spans more than one line, a newline is required. If a list is nested, the top level list should have newlines between list members.

2.1.12 Source Control

Pagure

If your project is hosted on Pagure, you should go to the project settings and set “Project tags” to have fedora-infra in it. This way your pull requests will appear on http://ambre.pingoured.fr/fedora-infra/ automati- cally.

2.1.13 OpenShift

OpenShift is a Kubernetes-based platform for running containers. The upstream project, OpenShift Origin, is what bases the OpenShift Container Platform product on. Fedora runs OpenShift Container Platform rather than OpenShift Origin.

24 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Getting Started

If you’ve never used OpenShift before a good place to start is with MiniShift, which deploys OpenShift Origin in a virtual machine.

OpenShift in Fedora Infrastructure

Fedora has two OpenShift deployments: Staging OpenShift and Production OpenShift. In addition to being the staging deployment of OpenShift itself, the staging deployment is intended to be a place for developers to deploy the staging version of their applications. Some features of OpenShift are not functional in Fedora’s deployment, mainly due to the lack of HTTP/2 support (at the time of this writing). Additionally, users are not allowed to alter configuration, roll out new deployments, run builds, etc. in the web UI or CLI.

Web User Interface

Some of the web user interface is currently non-functional since it requires HTTP/2. The rest is locked down to be read-only, making it of limited usefulness.

Command-line Interface

Although the CLI is also locked down to be read only, it is possible to view logs and request debugging containers, but only from batcave01. For example, to view the logs of a deployment in staging:

$ ssh batcave01.phx2.fedoraproject.org $ oc login os-master01.stg.phx2.fedoraproject.org You must obtain an API token by visiting https://os.stg.fedoraproject.org/oauth/token/

˓→request

$ oc login os-master01.stg.phx2.fedoraproject.org --token= $ oc get pods librariesio2fedmsg-28-bfj52 1/1 Running 522 28d $ oc logs librariesio2fedmsg-28-bfj52

Deploying Your Application

Applications are deployed to OpenShift using Ansible playbooks. You will need to create an Ansible Role for your application. A role is made up of several YAML files that define OpenShift objects. To create these YAML objects you have two options: 1. Copy and paste an existing role and do your best to rewrite all the files to work for your application. You will likely make mistakes which you won’t find until you run the playbook and when you do learn that your configuration is invalid, it won’t be clear where you messed up. 2. Set up your own deployment of OpenShift where you can click through the web UI to create your application (and occasionally use the built-in text editor when the UI doesn’t have buttons for a feature you need). Once you’ve done that, you can export all the configuration files and drop them into the infra ansible repository. They will be “messy” with lots of additional data OpenShift adds for you (including old revisions of the configuration). Both approaches have their downsides. #1 has a very long feedback cycle as you edit the file, commit it to the infra repository, and then run the playbook. #2 generates most of the configuration, but will produce crufty files.

2.1. Developer Guide 25 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Additionally, you will likely not have your OpenShift deployment set up the same way Fedora does so you still may produce configurations that won’t work. You will likely need (at a minimum) the following objects: •A BuildConfig - This defines how your container is built. • An ImageStream - This references a “stream” of container images and lets you trigger deployments or image builds based on changes in a stream. •A DeploymentConfig - This defines how your container is deployed (how many replicas, what ports are avail- able, etc) •A Service - An internal load balancer that routes traffic to your pods. •A Route - This exposes a Service as a host name.

2.1.14 Fedora Infrastructure Application Security Policy

This document sets out the security requirements applications must meet at a minimum to pass the security audit, and as such run in Fedora Infrastructure. This is by no means a comprehensive list, but it is a minimum set.

General

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119. The key words “MUST (BUT WE KNOW YOU WON’T)”, “SHOULD CONSIDER”, “REALLY SHOULD NOT”, “OUGHT TO”, “WOULD PROBABLY”, “MAY WISH TO”, “COULD”, “POSSIBLE”, and “MIGHT” in this docu- ment are to be interpreted as described in RFC 6919.

Static security checking

If written in Python, the application MUST pass Bandit on level medium with default configuration. Any exclusion lines that appear in the codebase MUST be sufficiently explained. Code that is only executed during test suite runs MAY be exempted from this check.

Authentication

The application MUST use OpenID Connect to authenticate users. The application MUST use an approved authenti- cation library. If the application supports an API, it SHOULD accept OpenID Connect Bearer tokens, which MUST be verified by the approved authentication library. The application MUST NOT accept any user credentials, but MUST forward the user to the OpenID Connect Provider in their browser. If the application supports API tokens that are not OpenID Connect Bearer tokens, they MUST be generated by the application via a Cryptographically Secure Psuedo- Random Number Generator. The application REALLY SHOULD NOT return error code 418 at any moment unless it is applicable.

Authorization

API tokens, whether OpenID Connect Bearer or custom, SHOULD allow the user to limit the scope of the token in a useful and clear way. The application SHOULD use authorization if provided by the authentication library, if it does not, this MUST be pointed out during the audit request so that specific review is performed.

26 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Data exchange formats

The application MUST NOT use the Python pickle library for data. If the application uses the PyYAML library, it MUST NOT use yaml.load, but MUST use yaml.safe_load. If the application uses XML data exchange, it MUST use the defusedxml library to process this data.

User input sanitization

Special care must be taken when processing user generated content. The application SHOULD use a common database abstraction layer (e.g. SQLAlchemy or Django ORM) that has protections against crafted input, and these protections MUST be used. Requests that are not part of an API call MUST be protected against cross-site request forgery.

Cookies

The application MUST set the Secure flag on any cookies it sets if it is not in a development mode. The application MUST set the httpOnly flag on any cookiees it sets. The application SHOULD NOT set a Domain parameter in any cookies it sets, if it does set the Domain, its value MUST be identical to the exact Host requested.

Security headers

The application MUST set the X-Frame-Options header, and its value SHOULD be DENY, unless there are spe- cific reasons it should be inserted into a frame. Setting anything else than DENY is a flag for review. The application MUST set the X-Xss-Protection header, and the value MUST be 1; mode=block. The application MUST set the X-Content-Type-Options header, and the value MUST be nosniff. The application MUST set the `Referrer-Policy```_ header, and the value MUST be ``no-referrer or same-origin. The application MUST set the `Content-Security-Policy```_ header and MUST set at least ``default-src. The content security MUST NOT allow any origins other than 'none', 'self', any of the explicitly approved origins (listed below) or nonce-$nonce. Any nonces used for the content security policy MUST be generated via a Cryptographically Secure PRNG. The allowed origin at this moment is: https://apps.fedoraproject.org.

Dependencies

The application MUST use up-to-date, maintained dependencies. The application MAY set minimum versions on dependencies, but MUST NOT set maximum versions.

Resources

The application MUST only use include any resources in produced HTML that are served via TLS.

2.1.15 Audit

The list of requirements in this document are a set of minimum requirements. Any deviation from them MUST be mentioned when requesting a security audit and MAY be reason for rejecting the security audit. Even if all these requirements are met, the auditor MAY reject the application on well-explained grounds.

2.1. Developer Guide 27 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2.2 System Administrator Guide

Welcome to the Fedora Infrastructure system administration guide.

2.2.1 Getting Started

If you haven’t already, you should complete the general Getting Started guide. Once you’ve completed that, you’re ready to get involved in the Fedora Infrastructure Apprentice group.

Fedora Infrastructure Apprentice

The Fedora Infrastructure Apprentice group in the Fedora Account System grants read-only access to many Fedora infrastructure machines. This group is used for new folks to look around at the infrastructure setup, check machines and processes and see where they might like to contribute moving forward. This also allows apprentices to examine and gather info on problems, then propose solutions.

Note: This group will be pruned often of inactive folks who miss the monthly email check-in on the infrastructure mailing list. There’s nothing personal in this and you’re welcome to re-join later when you have more time, we just want to make sure the group only has active members.

Members of the Fedora Infrastructure Apprentice group have ssh/shell access to many machines, but no sudo rights or ability to commit to the Ansible repository (but they do have read-only access). Apprentice can, however, con- tribute to the infrastructure documentation by making a pull request to the infra-docs repository. Access is via the bastion.fedoraproject.org machine and from there to each machine. See the SSH Access Infrastructure SOP for in- structions on how to set up SSH. You can see a list of hosts that allow apprentice access by using:

$ ./scripts/hosts_with_var_set -i inventory/ -o ipa_client_shell_groups=fi-apprentice from a checkout of the Ansible repository. The Ansible repository is hosted on pagure.io at https://pagure. io/fedora-infra/ansible.git.

Selecting a Ticket

Start by checking out the easyfix tickets. Tickets marked with this tag are a good place for apprentices to learn how things are setup, and also contribute a fix. Since apprentices do not have commit access to the Ansible repository, you should make your change, produce a patch with git diff, and attach it to the infrastructure ticket you are working on. It will then be reviewed.

2.2.2 Standard Operating Procedures

Below is a table of contents containing all the standard operating procedures for Fedora Infrastructure applications. For information on how to write a new standard operating procedure, consult the guide on Developing Standard Operating Procedures.

Two factor auth

Fedora Infrastructure has implemented a form of two factor auth for people who have sudo access on Fedora machines. In the future we may expand this to include more than sudo but this was deemed to be a high value, low hanging fruit.

28 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Using two factor http://fedoraproject.org/wiki/Infrastructure_Two_Factor_Auth To enroll a Yubikey, use the fedora-burn-yubikey script like normal. To enroll using FreeOTP or , go to https://admin.fedoraproject.org/totpcgiprovision/

What’s enough authentication?

FAS Password+FreeOTP or FAS Password+Yubikey Note: don’t actually enter a +, simple enter your FAS Password and press your yubikey or enter your FreeOTP code.

Administrating and troubleshooting two factor

Two factor auth is implemented by a modified copy of the https://github.com/mricon/totp-cgi project doing the au- thentication and pam_url submitting the authentication tokens. totp-cgi runs on the fas servers (currently fas01.stg and fas01/fas02/fas03 in production), listening on port 8443 for pam_url requests. FreeOTP, Google authenticator and yubikeys are supported as tokens to use with your password.

FreeOTP, Google authenticator:

FreeOTP application is preferred, however Google authenticator works as well. (Note that Google authenticator is not open source) This is handled via totpcgi. There’s a command line tool to manage users, totpprov. See ‘man totpprov’ for more info. Admins can use this tool to revoke lost tokens (google authenticator only) with ‘totpprov delete-user username’ To enroll using FreeOTP or Google Authenticator for production machines, go to https://admin.fedoraproject.org/ totpcgiprovision/ To enroll using FreeOTP or Google Authenticator for staging machines, go to https://admin.stg.fedoraproject.org/ totpcgiprovision/ You’ll be prompted to login with your fas username and password. Note that staging and production differ.

YubiKeys:

Yubikeys are enrolled and managed in FAS. Users can self-enroll using the fedora-burn-yubikey utility included in the fedora-packager package.

What do I do if I lose my token?

Send an email to [email protected] that is encrypted/signed with your gpg key from FAS, or otherwise identi- fies you are you.

2.2. System Administrator Guide 29 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

How to remove a token (so the user can re-enroll)?

First we MUST verify that the user is who they say they are, using any of the following: • Personal contact where the person can be verified by member of sysadmin-main. • Correct answers to security questions. • Email request to [email protected] that is gpg encrypted by the key listed for the user in fas. Then: 1. For google authenticator, 1. ssh into batcave01 as root 2. ssh into os-master01.iad2.fedoraproject.org 3. $ oc project fas 4. $ oc get pods 5. $ oc rsh (Pick one of totpcgi pods from the above list) 6. $ totpprov delete-user 2. For yubikey: login to one of the fas machines and run: /usr/local/bin/yubikey-remove.py username The user can then go to https://admin.fedoraproject.org/totpcgiprovision/ and reprovision a new device. If the user emails [email protected] with the signed request, make sure to reply to all indicating that a reset was performed. This is so that other admins don’t step in and reset it again after its been reset once.

Account Deletion SOP

For the most part we do not delete accounts. In the case that a deletion is paramount, it will need to be coordinated with appropriate entities. Disabling accounts is another story but is limited to those with the appropriate privileges. Reasons for accounts to be disabled can be one of the following: • Person has placed SPAM on the wiki or other sites. • It is seen that the account has been compromised by a third party. • A person wishes to leave the Fedora Project and wants the account disabled.

Contents

• Disabling – Disable Accounts – 1.2 Disable Groups • User Requested disables • Renames – Rename Accounts – Rename Groups • Deletion

30 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– Delete Accounts – Delete Groups

Disable

Disabling accounts is the easiest to accomplish as it just blocks people from using their account. It does not remove the account name and associated UID so we don’t have to worry about future, unintentional collisions.

Disable Accounts

To begin with, accounts should not be disabled until there is a ticket in the Infrastructure ticketing system. After that the contents inside the ticket need to be verified (to make sure people aren’t playing pranks or someone is in a crappy mood). This needs to be logged in the ticket (who looked, what they saw, etc). Then the account can be disabled.: ssh db02 sudo-u postgres pqsql fas2 fas2=# begin; fas2=# select * from people where username = 'FOOO';

Here you need to verify that the account looks right, that there is only one match, or other issues. If there are multiple matches you need to contact one of the main sysadmin-db’s on how to proceed.: fas2=# update people set status = 'admin_disabled' where username = 'FOOO'; fas2=# commit; fas2=# /q

Disable Groups

There is no explicit way to disable groups in FAS2. Instead, we close the group for adding new members and optionally remove existing members from it. This can be done from the web UI if you are an administrator of the group or you are in the accounts group. First, go to the group info page. Then click the (edit) link next to Group Details. Make sure that the Invite Only box is checked. This will prevent other users from requesting the group on their own. If you want to remove the existing users, View the Group info, then click on the View Member List link. Click on All under the Results heading. Then go through and click on Remove for each member. Doing this in the database instead can be quicker if you have a lot of people to remove. Once again, this requires someone in sysadmin-db to do the work: ssh db02 sudo-u postgres pqsql fas2 fas2=# begin; fas2=# update group, set invite_only = true where name = 'FOOO'; fas2=# commit; fas2=# begin; fas2=# select p.name, g.name, r.role_status from people as p, person_roles as r,

˓→groups as g where p.id=r.person_id and g.id=r.group_id and g.name='FOOO'; fas2=# -- Make sure that the list of users in the groups looks correct (continues on next page)

2.2. System Administrator Guide 31 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) fas2=# delete from person_roles where person_roles.group_id = (select id from groups

˓→where g.name = 'FOOO'); fas2=# -- number of rows in both of the above should match fas2=# commit; fas2=# /q

User Requested Disables

According to our Privacy Policy, a user may request that their personal information from FAS if they want to disable their account. We can do this but need to do some extra work over simply setting the account status to disabled.

Record User’s CLA information

If the user has signed the CLA/FPCA, then they may have contributed something to Fedora that we’ll need to contact them about at a later date. For that, we need to keep at least the following information: • Fedora username • human name • email address All of this information should be on the CLA email that is sent out when a user signs up. We need to verify with spot (Tom Callaway) that he has that record. If not, we need to get it to him. Something like: select id, username, human_name, email, telephone, facsimile, postal_address from

˓→people where username='USERNAME'; and send it to spot to keep.

Remove the personal information

The following sequence of db commands should do it: fas2=# begin; fas2=# select * from people where username = 'USERNAME';

Here you need to verify that the account looks right, that there is only one match, or other issues. If there are multiple matches you need to contact one of the main sysadmin-db’s on how to proceed.: fas2=# update people set human_name = '', gpg_keyid = null, ssh_key = null,

˓→unverified_email = null, comments = null, postal_address = null, telephone = null,

˓→facsimile = null, affiliation = null, ircnick = null, status = 'inactive', locale =

˓→'', timezone = null, latitude = null, longitude = null, country_code = null, email

˓→= '[email protected]' where username = 'USERNAME';

Make sure only one record was updated: fas2=# select * from people where username = 'USERNAME';

Make sure the correct record was updated:

32 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

fas2=# commit;

Note: The email address is both not null and unique in the database. Due to this, you need to set it to a new string for every user who requests deletion like this.

Renames

In general, renames do not require as much work as deletions but they still require coordination. This is because renames do not change the UID/GID but some of our applications save information based on username/groupname rather than UID/GID.

Rename Accounts

Warning: Needs more eyes This list may not be complete.

• Check the databases for koji, pkgdb, and bodhi for occurrences of the old username and update them to the new username. • Check fedorapeople.org for home directories and repositories under the old username that would need to be renamed • Check (or ask the user to check and update) mailing list subscriptions on fedorahosted.org and lists.fedoraproject.org under the old [email protected] email alias • Check whether the user has a [email protected] bugzilla account in python-fedora and update that. Also ask the user to update that in bugzilla. • If the user is in a sysadmin-* group, check for home directories on bastion and other infrastructure boxes that are owned by them and need to be renamed (Could also just tell the user to backup any files there themselves b/c they’re getting a new home directory). • grep through ansible for occurrences of the username • Check for entries in trac on fedorahosted.org for the username as an “Assigned to” or “CC” entry. • Add other places to check here

Rename Groups

Warning: Needs more eyes This list may not be complete.

• grep through ansible for occurrences of the group name. • Check for group-members,group-admins,[email protected] email alias presence in any fedo- rahosted.org or lists.fedoraproject.org mailing list • Check for entries in trac on fedorahosted.org for the username as an “Assigned to” or “CC” entry. • Add other places to check here

2.2. System Administrator Guide 33 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Deletion

Deletion is the toughest one to audit because it requires that we look through our systems looking for the UID and GID in addition to looking for the username and password. The UID and GID are used on things like filesystem permissions so we have to look there as well. Not catching these places may lead to security issus should the UID/GID ever be reused.

Note: Recommended to rename instead When not strictly necessary to purge all traces of an account, it’s highlyrec- ommended to rename the user or group to something like DELETED_oldusername instead of deleting. This avoids the problems and additional checking that we have to do below.

Delete Accounts

Warning: Needs more eyes This list may be incomplete. Needs more people to look at this and find places that may need to be updated

• Check everything for the #Rename Accounts case. • Figure out what boxes a user may have had access to in the past. This means you need to look at all the groups a user may ever have been approved for (even if they are not approved for those groups now). For instance, any git*, svn*, bzr*, hg* groups would have granted access to hosted03 and hosted04. packager would have granted access to pkgs.fedoraproject.org. Pretty much any group grants access to fedorapeople.org. • For those boxes, run a find over the files there to see if the UID owns any files on the system:

# find / -uid 100068 -print

Any files owned by that uid must be reassigned to another user or removed.

Warning: What to do about backups? Backups pose a special problem as they may contain the uid that’s being removed. Need to decide how to handle this

• Add other places to check here

Delete Groups

Warning: Needs more eyes This list may be incomplete. Needs more people to look at this and find places that may need to be updated

• Check everything for the #Rename Groups case. • Figure out what boxes may have had files owned by that group. This means that you’d need to look at the users in that group, what boxes they have shell accounts on, and then look at those boxes. groups used for hosted would also need to add hosted03 and hosted04 to that list and the box that serves the hosted mailing lists. • For those boxes, run a find over the files there to see if the GID owns any files on the system:

34 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

# find / -gid 100068 -print

Any files owned by that GID must be reassigned to another group or removed.

Warning: What to do about backups? Backups pose a special problem as they may contain the gid that’s being removed. Need to decide how to handle this

• Add other places to check here

Anitya Infrastructure SOP

Anitya is used by Fedora to track upstream project releases and maps them to downstream distribution packages, including (but not limited to) Fedora. Anitya staging instance: https://stg.release-monitoring.org Anitya production instance: https://release-monitoring.org Anitya project page: https://github.com/fedora-infra/anitya

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, #fedora-apps Persons zlopez Location iad2.fedoraproject.org Servers Production • os-master01.iad2.fedoraproject.org Staging • os-master01.stg.iad2.fedoraproject.org Purpose Map upstream releases to Fedora packages.

Hosts

The current deployment is made up of release-monitoring OpenShift namespace. release-monitoring

This OpenShift namespace runs following pods: • The apache/mod_wsgi application for release-monitoring.org • A libraries.io SSE client • A service checking for new releases This OpenShift project relies on: • A postgres db server running in OpenShift

2.2. System Administrator Guide 35 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• Lots of external third-party services. The anitya webapp can scrape pypi, rubygems.org, sourceforge and many others on command. • Lots of external third-party services. The check service makes all kinds of requests out to the Internet that can fail in various ways. • Fedora messaging RabbitMQ hub for publishing messages Things that rely on this host: • The New Hotness is a fedora messaging consumer running in Fedora Infra in OpenShift. It listens for Anitya messages from here and performs actions on koji and bugzilla.

Releasing

The release process is described in Anitya documentation.

Deploying

Staging deployment of Anitya is deployed in OpenShift on os-master01.stg.iad2.fedoraproject.org. To deploy staging instance of Anitya you need to push changes to staging branch on Anitya GitHub. GitHub webhook will then automatically deploy a new version of Anitya on staging. Production deployment of Anitya is deployed in OpenShift on os-master01.iad2.fedoraproject.org. To deploy production instance of Anitya you need to push changes to production branch on Anitya GitHub. GitHub webhook will then automatically deploy a new version of Anitya on production.

Configuration

To deploy the new configuration, you need ssh access to batcave01.iad2.fedoraproject.org and permissions to run the Ansible playbook. All the following commands should be run from batcave01. First, ensure there are no configuration changes required for the new update. If there are, update the Ansible anitya role(s) and optionally run the playbook:

$ sudo rbac-playbook -apps/release-monitoring.yml

The configuration changes could be limited to staging only using:

$ sudo rbac-playbook openshift-apps/release-monitoring.yml -l staging

This is recommended for testing new configuration changes.

Upgrading

Staging

To deploy new version of Anitya you need to push changes to staging branch on Anitya GitHub. GitHub webhook will then automatically deploy a new version of Anitya on staging.

36 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Production

To deploy new version of Anitya you need to push changes to production branch on Anitya GitHub. GitHub webhook will then automatically deploy a new version of Anitya on production. Congratulations! The new version should now be deployed.

Administrating release-monitoring.org

Anitya web application offers some functionality to administer itself. User admin status is tracked in Anitya database. Admin users can grant or revoke admin priviledges to users in the users tab. Admin users have additional functionality available in web interface. In particular, admins can view flagged projects, remove projects and remove package mappings etc. For more information see Admin user guide in Anitya documentation.

Flags

Anitya lets users flag projects for administrator attention. This is accessible to administrators in the flags tab.

Monitoring

To monitor the activity of Anitya you can connect to Fedora infra OpenShift and look at the state of pods. For staging look at the release-monitoring namespace in staging OpenShift instance. For production look at the release-monitoring namespace in production OpenShift instance.

Troubleshooting

This section contains various issues encountered during deployment or configuration changes and possible solutions.

Fedmsg messages aren’t sent

Issue: Fedmsg messages aren’t sent. Solution: Set USER environment variable in pod. Explanation: Fedmsg is using USER env variable as a username inside messages. Without USER env set it just crashes and didn’t send anything.

Cronjob is crashing

Issue: Cronjob pod is crashing on start, even after configuration change that should fix the behavior. Solution: Restart the cronjob. This could be done by OPS. Explanation: Every time the cronjob is executed after crash it is trying to actually reuse the pod with bad configuration instead of creating a new one with new configuration.

2.2. System Administrator Guide 37 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Database migration is taking too long

Issue: Database migration is taking few hours to complete. Solution: Stop every pod and cronjob before migration. Explanation: When creating new index or doing some other complex operation on database, the migration script needs exclusive access to the database.

Old version is deployed instead the new one

Issue: The pod is deployed with old version of Anitya, but it says that it was triggered by correct commit. Solution: Set dockerStrategy in buildconfig.yml to noCache. Explanation: The OpenShift is by default caching the layers of docker containers, so if there is no change in Docker- file it will just use the cached version and don’t run the commands again.

Ansible infrastructure SOP/Information.

Background

Fedora infrastructure used to use func and puppet for system change management. We are now using ansible for all system change mangement and ad-hoc tasks.

Overview

Ansible runs from batcave01 or backup01. These hosts run a ssh-agent that has unlocked the ansible root ssh private key. (This is unlocked manually by a human with the passphrase each reboot, the passphrase itself is not stored anywhere on the machines). Using ‘sudo -i’, sysadmin-main members can use this agent to access any machines with the ansible root ssh public key setup, either with ‘ansible’ for one-off commands or ‘ansible-playbook’ to run playbooks. Playbooks are idempotent (or should be). Meaning you should be able to re-run the same playbook over and over and it should get to a state where 0 items are changing. Additionally (see below) there is a rbac wrapper that allows members of some other groups to run playbooks against specific hosts.

GIT repositories

There are 2 git repositories associated with Ansible: • The Fedora Infrastructure Ansible repository and replicas.

Caution: This is a public repository. Never commit private data to this repo.

This repository exists as several copies or replicas: – The “upstream” repository on Pagure. https://pagure.io/fedora-infra/ansible

38 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2.2. System Administrator Guide 39 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

This repository is the public facing place where people can contribute (e.g. pull requests) as well as the authoritative source. Members of the sysadmin FAS group or the fedora-infra Pagure group have commit access to this repository. To contribute changes, fork the repository on Pagure and submit a Pull Request. Someone from the afore- mentioned groups can then review and merge them. It is recommended that you configure git to use pull --rebase by default by running git config --bool pull.rebase true in your ansible clone directory. This configuration prevents unneeded merges which can occur if someone else pushes changes to the remote repository while you are working on your own local changes. – Two bare mirrors on batcave01, /srv/git/ansible.git and /srv/git/mirrors/ansible. git

Caution: These are public repositories. Never commit private data to these repositories. Don’t commit or push to these repos directly, unless Pagure is unavailable.

The mirror_pagure_ansible service on batcave01 receives bus messages about changes in the repository on Pagure, fetches these into /srv/git/mirrors/ansible.git and pushes from there to /srv/git/ansible.git. When this happens, various actions are triggered via git hooks:

* The working copy at /srv/web/infra/ansible is updated. * A mail about the changes is sent to sysadmin-members. * The changes are announced on the message bus, which in turn triggers announcements on IRC. You can check out the repo locally on batcave01 with:

git clone/srv/git/ansible.git

If the Ansible repository on Pagure is unavailable, members of the sysadmin group may commit directly, provided this procedure is followed: 1. The synchronization service is stopped and disabled:

sudo systemctl disable--now mirror_pagure_ansible.service

2. Changes are applied to the repository on batcave01. 3. After Pagure is available again, the changes are pushed to the repository there. 4. The synchronization service is enabled and started:

sudo systemctl enable--now mirror_pagure_ansible.service

– /srv/web/infra/ansible on batcave01, the working copy from which playbooks are run.

Caution: This is a public repository. Never commit private data to this repo. Don’t commit or push to this repo directly, unless Pagure is unavailable.

You can access it also via a cgit web interface at: https://pagure.io/fedora-infra/ansible/

40 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• /srv/git/ansible-private on batcave01.

Caution: This is a private repository for passwords and other sensitive data. It is not available in cgit, nor should it be cloned or copied remotely.

This repository is only accessible to members of ‘sysadmin-main’.

Cron job/scheduled runs

With use of run_ansible-playbook_cron.py that is run daily via cron we walk through playbooks and run them with –check –diff params to perform a dry-run. This way we make sure all the playbooks are idempotent and there is no unexpected changes on servers (or playbooks).

Logging

We have in place a callback plugin that stores history for any ansible-playbook runs and then sends a report each day to sysadmin-logs-members with any CHANGED or FAILED actions. Additionally, there’s a fedmsg plugin that reports start and end of ansible playbook runs to the fedmsg bus. Ansible also logs to syslog verbose reporting of when and what commands and playbooks were run. role based access control for playbooks

There’s a wrapper script on batcave01 called ‘rbac-playbook’ that allows non sysadmin-main members to run specific playbooks against specific groups of hosts. This is part of the ansible_utils package. The upstream for ansible_utils is: https://bitbucket.org/tflink/ansible_utils To add a new group: 1. add the playbook name and sysadmin group to the rbac-playbook (ansible-private repo) 2. add that sysadmin group to sudoers on batcave01 (also in ansible-private repo) To use the wrapper: sudo rbac-playbook playbook.yml

Directory setup

Inventory

The inventory directory tells ansible all the hosts that are managed by it and the groups they are in. All files in this dir are concatenated together, so you can split out groups/hosts into separate files for readability. They are in ini file format. Additionally under the inventory directory are host_vars and group_vars subdirectories. These are files named for the host or group and containing variables to set for that host or group. You should strive to set variables in the highest level possible, and precedence is in: global, group, host order.

2.2. System Administrator Guide 41 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Vars

This directory contains global variables as well as OS specific variables. Note that in order to use the OS specific ones you must have ‘gather_facts’ as ‘True’ or ansible will not have the facts it needs to determine the OS.

Roles

Roles are a collection of tasks/files/templates that can be used on any host or group of hosts that all share that role. In other words, roles should be used except in cases where configuration only applies to a single host. Roles can be reused between hosts and groups and are more portable/flexable than tasks or specific plays.

Scripts

In the ansible git repo under scripts are a number of utilty scripts for sysadmins.

Playbooks

In the ansible git repo there’s a directory for playbooks. The top level contains utility playbooks for sysadmins. These playbooks perform one-off functions or gather information. Under this directory are hosts and groups playbooks. These playbooks are for specific hosts and groups of hosts, from provision to fully configured. You should only use a host playbook in cases where there will never be more than one of that thing.

Tasks

This directory contains one-off tasks that are used in playbooks. Some of these should be migrated to roles (we had this setup before roles existed in ansible). Those that are truely only used on one host/group could stay as isolated tasks.

Syntax

Ansible now warns about depreciated syntax. Please fix any cases you see related to depreciation warnings. Templates use the jinja2 syntax.

Libvirt virtuals

• TODO: add steps to make new libvirt virtuals in staging and production • TODO: merge in new-hosts.txt

Cloud Instances

• TODO: add how to make new cloud instances • TODO: merge in from ansible README file.

42 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 rdiff-backups see: https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/rdiff-backup.html

Additional Reading/Resources

Upstream docs: https://docs.ansible.com/ Example repo with all kinds of examples: • https://github.com/ansible/ansible-examples • https://gist.github.com/marktheunissen/2979474 Jinja2 docs: http://jinja.pocoo.org/docs/ apps-fp-o SOP

Updating and maintaining the landing page at https://apps.fedoraproject.org/

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-apps, #fedora-admin Servers: proxy0* Purpose: Have a nice landing page for all our webapps.

Description

We have a number of webapps, many of which our users don’t know about. This page was created so there was a central place where users could stumble through them and learn. The page is generated by a ansible role in ansible/roles/apps-fp-o/ It makes use of an RPM package, the source code for which is at https://github.com/fedora-infra/apps.fp.o You can update the page by updating the apps.yaml file in that ansible module. When ansible is run next, the two ansible handlers should see your changes and regenerate the static html and json data for the page.

How to Archive Old Fedora Releases

The Fedora download servers contain terabytes of data, and to allow for mirrors to not have to take all of that data, infrastructure regularly moves data of end of lifed releases (from /pub/fedora/) to the archives section (/pub/archive/fedora/linux)

2.2. System Administrator Guide 43 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Steps Involved

1. log into batcave01.phx2.fedoraproject.org and ssh to bodhi-backend01 $ sudo -i ssh [email protected] # su - ftpsync $ 2. Then change into the releases directory. $ cd /pub/fedora/linux/releases 3. Check to see that the target directory doesn’t already exist. $ ls /pub/archive/fedora/linux/releases/ 4. If the target directory does not already exist, do a recursive link copy of the tree you want to the target $ cp -lvpnr 21 /pub/archive/fedora/linux/releases/21 5. If the target directory already exists, then we need to do a recursive rsync to update any changes in the trees since the previous copy. $ rsync -avAXSHP –delete ./21/ /pub/archive/fedora/linux/releases/21/ 6. We now do the updates and updates/testing in similar ways. $ cd ../updates/ $ cp -lpnr 21 /pub/archive/fedora/linux/updates/21 $ cd testing $ cp -lpnr 21 /pub/archive/fedora/linux/updates/testing/21 Alternative if this is a later refresh of an older copy. $ cd ../updates/ $ rsync -avAXSHP 21/ /pub/archive/fedora/linux/updates/21/ $ cd testing $ rsync - avAXSHP 21/ /pub/archive/fedora/linux/updates/testing/21/ 7. Do the same with fedora-secondary. 8. Announce to the mirror list this has been done and that in 2 weeks you will move the old trees to archives. 9. In two weeks, log into mm-backend01 and run the archive script sudo -u mirrormanager mm2_move-to-archive –originalCategory=”Fedora Linux” –archiveCate- gory=”Fedora Archive” –directoryRe=’/21/Everything’ 10. If there are problems, the postgres DB may have issues and so you need to get a DBA to update the backend to fix items. 11. Wait an hour or so then you can remove the files from the main tree. ssh bodhi-backend01 cd /pub/fedora/linux cd releases/21 ls # make sure you have stuff here rm -rf * ln ../20/README . cd ../../updates/21 ls # make sure you have stuff here rm -rf * ln ../20/README . cd ../test- ing/21 ls # make sure you have stuff here rm -rf * ln ../20/README . This should complete the archiving.

Fedora ARM Infrastructure

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-releng Location Phoenix Servers arm01, arm02, arm03, arm04

44 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Purpose Information on working with the arm SOCs

Description

We have 4 arm chassis in phx2, each containing 24 SOCs (System On Chip). Each chassis has 2 physical network connections going out from it. The first one is used for the management interface on each SOC. The second one is used for eth0 for each SOC. Current allocations (2016-03-11): arm01 primary builders attached to koji.fedoraproject.org arm02 primary arch builders attached to koji.fedoraproject.org arm03 In cloud network, public qa/packager and copr instances arm04 primary arch builders attached to koji.fedoraproject.org

Hardware Configuration

Each SOC has: • eth0 and eth1 (unused) and a management interface. • 4 cores • 4GB ram • a 300GB disk SOCs are addressed by: arm{chassisnumber}-builder{number}.arm.fedoraproject.org

Where chassisnumber is 01 to 04 and number is 00-23

PXE installs

Kickstarts for the machines are in the kickstarts repo. PXE config is on noc01. (or cloud-noc01.cloud.fedoraproject.org for arm03) The kickstart installs the latests Fedora and sets them up with a base package set.

IPMI tool Management

The SOCs are managed via their mgmt interfaces using a custom ipmitool as well as a custom python script called ‘cxmanage’. The ipmitool changes have been submitted upstream and cxmanage is under review in Fedora. The ipmitool is currently installed on noc01 and it has ability to talk to them on their management interface. noc01 also serves dhcp and is a pxeboot server for the SOCs. However you will need to add it to your path: export PATH=$PATH:/opt/calxeda/bin/

2.2. System Administrator Guide 45 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Some common commands: To set the SOC to boot the next time only with pxe: ipmitool-U admin-P thepassword-H arm03-builder11-mgmt.arm.fedoraproject.org

˓→chassis bootdev pxe

To set the SOC power off: ipmitool-U admin-P thepassword-H arm03-builder11-mgmt.arm.fedoraproject.org power

˓→off

To set the SOC power on: ipmitool-U admin-P thepassword-H arm03-builder11-mgmt.arm.fedoraproject.org power

˓→on

To get a serial over lan console from the SOC: ipmitool-U admin-P thepassword-H arm03-builder11-mgmt.arm.fedoraproject.org-I

˓→lanplus sol activate

DISK mapping

Each SOC has a disk. They are however mapped to the internal 00-23 in a non direct manner:

HDD Bay EnergyCard SOC (Port1) SOC Num 003 03 100 00 201 01 302 02 413 07 510 04 611 05 712 06 823 11 920 08 1021 09 1122 10 1233 15 1330 12 1431 13 1532 14 1643 19 1740 16 1841 17 1942 18 2053 23 2150 20 2251 21 2352 22

Looking at the system from the front, the bay numbering starts from left to right.

46 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 cxmanage

The cxmanage tool can be used to update firmware or gather diag info. Until cxmanage is packaged, you can use it from a python virtualenv: virtualenv--system-site-packages cxmanage cd cxmanage source bin/activate install--extra-index-url=http://sources.calxeda.com/python/packages/ cxmanage deactivate

Some cxmanage commands cxmanage sensor arm03-builder00-mgmt.arm.fedoraproject.org Getting sensor readings... 1 successes|0 errors|0 nodes left|.

MP Temp0 arm03-builder00-mgmt.arm.fedoraproject.org: 34.00 degrees C Minimum : 34.00 degrees C Maximum : 34.00 degrees C Average : 34.00 degrees C ...( and about 20 more sensors)... cxmanage info arm03-builder00-mgmt.arm.fedoraproject.org Getting info... 1 successes|0 errors|0 nodes left|.

[ Info from arm03-builder00-mgmt.arm.fedoraproject.org ] Hardware version : EnergyCard X04 Firmware version : ECX-1000-v2.1.5 ECME version : v0.10.2 CDB version : v0.10.2 Stage2boot version : v1.1.3 Bootlog version : v0.10.2 A9boot version : v2012.10.16-3-g66a3bf3 Uboot version : v2013.01-rc1_cx_2013.01.17 Ubootenv version : v2013.01-rc1_cx_2013.01.17 DTB version : v3.7-4114-g34da2e2

firmware update: cxmanage--internal-tftp 10.5.126.41:6969--all-nodes fwupdate package ECX-1000_

˓→update-v2.1.5.tar.gz arm03-builder00-mgmt.arm.fedoraproject.org

(note that this runs against the 00 management interface for the chassis and updates all the nodes), and that we must run a tftpserver on port 6969 for firewall handling.

Links http://sources.calxeda.com/python/packages/cxmanage/

2.2. System Administrator Guide 47 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contacts [email protected] is the contact to send repair requests to.

Ask Fedora SOP

To set up https://ask.fedoraproject.org based on Askbot as a question and answer support forum for the Fedora community. A production instance could be seen at https://ask.fedoraproject.org and the staging instance is at http://ask.stg.fedoraproject.org/ This page describes how to set up and customize it from scratch.

Contents

1. Contact Information 2. Creating database 3. Setting up the forum 4. Adding administrators 5. Change settings within the forum 6. Database tweaks 7. Debugging

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons anyone from the sysadmin team Sponsor nirik Location phx2 Servers ask01 , ask01.stg Purpose To host Ask Fedora

Creating database

We use the postgresql database backend. To add the database to a postgresql server:

# psql -U postgres postgres# create user askfedora with password 'xxx'; postgres# create database askfedora; postgres# ALTER DATABASE askfedora owner to askfedora; postgres# \q;

Now setup the db tables if this is a new install:

48 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

python manage.py syncdb python manage.py migrate askbot python manage.py migrate django_authopenid #embedded login application

Setting up the forum

Askbot is packaged and available in Rawhide, Fedora 16 and EPEL 6. On a RHEL 6 system, you need to install EPEL 6 repo first.:

# yum install askbot

The /etc/askbot/sites/ask/conf/settings.py file should look something like:

DATABASE_ENGINE='postgresql_psycopg2' DATABASE_NAME='testaskbot' DATABASE_USER='askbot' DATABASE_PASSWORD='xxxxx' DATABASE_HOST='127.0.0.1' DATABASE_PORT='5432'

# Outgoing mail server settings # DEFAULT_FROM_EMAIL='[email protected]' EMAIL_SUBJECT_PREFIX='[Askfedora]' EMAIL_HOST='127.0.0.1' EMAIL_PORT='25'

# This variable points to the Askbot plugin which will be used for user # authentication. Not enabled yet because we don't need FAS auth but use # Fedora id as a openid provider. # # ASKBOT_CUSTOM_AUTH_MODULE = 'authfas'

Now Ask Fedora website should be accessible from the browser.

Adding administrators

As of Askbot version 0.7.21, the first user who logs in automatically becomes the administrator. In previous versions, you have to do the following.:

# cd /etc/askbot/sites/ask/conf/ # python manage.py add_admin 1 Do you really wish to make user (id=1, name=pjp) a site administrator? yes/no: yes

Once a user is marked as a administrator, he or she can go into anyone’s profile, go the “moderation” tab in the end and mark them as administrator or moderator as well as block or suspend a user.

Change settings within the forum

• Data entry and display: – Disable “Allow asking questions anonymously”

2.2. System Administrator Guide 49 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– Enable “Force lowercase the tags” – Change “Format of tag list” to “cloud” – Change “Minimum length of search term for Ajax search” to “3” – Change “Number of questions to list by default” to “50” – Change “What should “unanswered question” mean?” to “Question has no – answers” • Email and email alert settings – Change “Default news notification frequency” to “Instantly” • Flatpages - about, privacy policy, etc. Change “Text of the Q&A forum About page (html format)” to the following:

Ask Fedora provides a community edited knowledge base and support forum for the Fedora community. Make sure you read the FAQ and search for existing questions before asking yours. If you want to provide feedback, just a question in this site! Tag your questions "meta" to highlight your questions to the administrators of Ask Fedora.

• Login provider settings – Disable “Activate local login” • Q&A forum website parameters and urls – Change “Site title for the Q&A forum” to “Ask Fedora: Community Knowledge Base and Sup- port Forum” – Change “Comma separated list of Q&A site keywords” to “Ask Fedora, forum, community, support, help” – Change “Copyright message to show in the footer” to “All content is under Creative Commons Attribution Share Alike License. Ask Fedora is community maintained and Red Hat or Fedora Project is not responsible for content” – Change “Site description for the search engines” to “Ask Fedora: Community Knowledge Base and Support Forum” – Change “Short name for your Q&A forum” to “Ask Fedora” – Change “Base URL for your Q&A forum, must start with http or https” to “http://ask. fedoraproject.org” • Sidebar widget settings - main page – Disable “Show avatar block in sidebar” – Disable “Show tag selector in sidebar” • Skin and User Interface settings - Upload “Q&A site logo” - Upload “Site favicon”. Must be a ICO format file because that is the only one IE supports as a fav icon. - Enable “Apply custom style sheet (CSS)” - Upload the following custom CSS:

#ab-main-nav a { color: #333333; background-color: #d8dfeb; border:1px solid #888888; border-bottom: none; (continues on next page)

50 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) padding:0px 12px3px 12px; height: 25px; line-height: 30px; margin-right: 10px; font-size: 18px; font-weight: 100; text-decoration: none; display: block; float: left; }

#ab-main-nav a.on { height: 24px; line-height: 28px; border-bottom:1px solid #0a57a4; border-right:1px solid #0a57a4; border-top:1px solid #0a57a4; border-left:1px solid #0a57a4; /*background:#A31E39; */ background: #0a57a4; color: #FFF; font-weight: 800; text-decoration: none }

#ab-main-nav a.special { font-size: 18px; color: #072b61; font-weight: bold; text-decoration: none; }

/* tabs stuff */ .tabsA { float: right; } .tabsC { float: left; }

.tabsA a.on,.tabsC a.on,.tabsA a:hover,.tabsC a:hover { background: #fff; color: #072b61; border-top:1px solid #babdb6; border-left:1px solid #babdb6; border-right:1px solid #888a85; border-bottom:1px solid #888a85; height: 24px; line-height: 26px; margin-top:3px; }

.tabsA a.rev.on, tabsA a.rev.on:hover { padding:0px2px0px7px; }

.tabsA a,.tabsC a{ background: #f9f7eb; border-top:1px solid #eeeeec; border-left:1px solid #eeeeec; border-right:1px solid #a9aca5; border-bottom:1px solid #888a85; (continues on next page)

2.2. System Administrator Guide 51 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) color: #888a85; display: block; float: left; height: 20px; line-height: 22px; margin:5px004px; padding:07px; text-decoration: none; }

.tabsA.label,.tabsC.label { float: left; font-weight: bold; color: #777; margin:8px000px; }

.tabsB a { background: #eee; border:1px solid #eee; color: #777; display: block; float: left; height: 22px; line-height: 28px; margin:5px0px04px; padding:0 11px0 11px; text-decoration: none; }

a { color: #072b61; text-decoration: none; cursor: pointer; }

div.side-box { width:200px; padding:10px; border:3px solid #CCCCCC; margin:0px; background:-moz-linear-gradient(top, #DDDDDD, #FFFFFF); }

Database tweaks

To automatically delete expired sessions, we run a trigger that makes PostgreSQL delete them upon inserting a new one. The code used to create this trigger was: askfedora=# CREATE FUNCTION delete_old_sessions() RETURNS trigger askfedora-# LANGUAGE plpgsql askfedora-# AS $$ (continues on next page)

52 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) askfedora$# BEGIN askfedora$# DELETE FROM django_session WHERE expire_date

In case this trigger causes any problems, please remove it by running: DROP TRIGGER old_sessions_gc; To make this perform, we have a custom index that’s not in upstream askbot, please remember to add that when recreating the trigger:

CREATE INDEX CONCURRENTLY django_session_expire_date ON django_session (expire_date);

If you deleted the trigger, or reinstalled without trigger, please make sure to run manage.py clean_sessions regularly, so you don’t end up with a database that’s too massive in size.

Debugging

Set DEBUG to True in settings.py file and restart Apache.

Auth issues

Users can login to ask with a variety of social media accounts. Once they login with one they can attach other ones as well. If a user forgets what social media they used, you can look in the database: Login to database host (db01.phx2.fedoraproject.org) # sudo -u postgres psql askfedora psql> select * from django_authopenid_userassociation where user_id like ‘%username%’; If they can login again with the same auth, ask them to do so. If not, you can add the fedora account system openid auth to allow them to login with that: psql> insert into django_authopenid_userassociation (user_id, openid_url,provider_name) VALUES (2595, ‘http:// name.id.fedoraproject.org’, ‘fedoraproject’); Use the ID from the previous query and replace name with the users fas name.

Amazon Web Services Access

AWS includes a highly granular set of access policies, which can be combined into roles and groups. Ipsilon is used to translate between IAM policy groupings and groups in the Fedora Account System (FAS). Tags and namespaces are used to keep roles resources seperate.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin

2.2. System Administrator Guide 53 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Persons nirik, pfrields Location ? Servers N/A Purpose Provide AWS resource access to contributors via FAS group membership.

Accessing the AWS Console

To access the AWS Console via Ipsilon authentication, use this SAML link. You must be in the aws-iam FAS group (or another group with access) to perform this action.

Adding a role to AWS IAM

Sign into AWS via the URL above, and visit Identity and Access Management (IAM) in the Security, Identity and Compliance tools. Choose Roles to view current roles. Confirm there is not already a role matching the one you need. If not, create a new role as follows: 1. Select Create role. 2. Select SAML 2.0 federation. 3. Choose the SAML provider id.fedoraproject.org, which should already be populated as a choice from previous use. 4. Select the attribute SAML:aud. For value, enter https://signin.aws.amazon.com/saml. Do not add a condition. Proceed to the next step. 5. Assign the appropriate policies from the pre-existing IAM policies. It’s unlikely you’ll have to create your own, which is outside the scope of this SOP. Then proceed to the next step. 6. Set the role name and description. It is recommended you use the same role name as the FAS group for clarity. Fill in a longer description to clarify the purpose of the role. Then choose Create role. Note or copy the Role ARN (Amazon Resource Name) for the new role. You’ll need this in the mapping below.

Adding a group to FAS

When finished, login to FAS and create a group to correspond to the new role. Use the prefix aws- to denote new AWS roles in FAS. This makes them easier to locate in a search. It may be appropriate to set group ownership for aws- groups to an Infrastructure team principal, and then add others as users or sponsors. This is especially worth considering for groups that have modify (full) access to an AWS resource.

Adding an IAM role mapping in Ipsilon

Add the new role mapping for FAS group to Role ARN in the ansible git repo, under roles/ipsilon/files/infofas.py. Current mappings look like this:

54 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

aws_groups={ 'aws-master':'arn:aws:iam::125523088429:role/aws-master', 'aws-iam':'arn:aws:iam::125523088429:role/aws-iam', 'aws-billing':'arn:aws:iam::125523088429:role/aws-billing', 'aws-atomic':'arn:aws:iam::125523088429:role/aws-atomic', 'aws-s3-readonly':'arn:aws:iam::125523088429:role/aws-s3-readonly' }

Add your mapping to the dictionary as shown. Start a new build/rollout of the ipsilon project in openshift to make the changes live.

User accounts

If you only need to use the web interface to aws, a role (and associated policy) should be all you need, however, if you need cli access, you will need a user and a token. Users should be named the same as the role they are associated with.

Role and User policies

Each Role (and user if there is a user needed for the role) should have the same policy attached to it. Policies are named ‘fedora-$rolename-$service’ ie, ‘fedora-infra-ec2’. A copy of polices is available in the ansible repo under files/aws/iam/policies. These are in json form. Policies are setup such that roles/users can do most things with a resource if it’s untagged. If it’s tagged it MUST be tagged with their group: FedoraGroup / $groupname. If it’s tagged with another group name, they cannot do anything with or to that resource. (Aside from seeing it exists). If there’s a permssion you need, please file a ticket and it will be evaluated. Users MUST keep tokens private and secure. YOU are responsible for all use of tokens issued to you from Fedora Infrastructure. Report any compromised or possibly public tokens as soon as you are aware. Users MUST tag resources with their FedoraGroup tag within one day, or the resource may be removed. ec2 users/roles with ec2 permissions should always tag their instances with their FedoraGroup as soon as possible. Un- tagged resources can be terminated at any time. s3 users/roles with s3 permissions will be given specific bucket(s) that they can manage/use. Care should be taken to make sure nothing in them is public that should not be. cloudfront

Please file a ticket if you need cloudfront and infrastructure will do any needed setup if approved.

2.2. System Administrator Guide 55 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Regions

Users/groups are encouraged to use regions ‘near’ them or wherever makes the most sense. If you are trying to create ec2 instances you will need infrastructure to create a vpc in the region with network, etc. File a ticket for such requests.

Other Notes

AWS resource access that is not read-only should be treated with care. In some cases, Amazon or other entities may absorb AWS costs, so changes in usage can cause issues if not controlled or monitored. If you have doubts about access, consult the Fedora Project Leader or Fedora Engineering Manager.

Basset anti-spam service

Since the Fedora Project has come under targeted spam attacks, we have decided to create a service that all our applications can hook into to have a central repository for anti-spam procedures. Basset is this service, and it’s hosted on https://pagure.io/basset.

Contents

1. Contact Information 2. Overview 3. FAS 4. Trac 5. Wiki 6. Setup 7. Outage

Contact Information

Owner Patrick Uiterwijk (puiterwijk) Contact #fedora-admin, #fedora-apps, #fedora-noc, sysadmin-main Location basset01 Purpose Centralized anti-spam

Overview

Basset is a central anti-spam service: it received messages from services that certain actions happened, and will then decide to accept or deny the request, or pass it on to an administrator. At the moment, we have the following modules live: FAS, trac, wiki.

56 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

FAS

This module receives notifications from FAS about new users registrations and new users signing the FPCA. With Basset enabled, FAS will not automatically accept a new user registration or a FPCA signing, but instead let Basset know a user tried to perform these actions and then depend on Basset to enact this. In the case of registration this is done by setting the user to a spamcheck_awaiting status. As soon as Basset made a decision, it will set the user to spamcheck_manual, spamcheck_denied or active. If it sets the user to active, it will also send the welcome email to the user. If it made a wrong decision, and the user is set as spamcheck_manual or spamcheck_denied, a member of the accounts team can go to that users’ page and click the “Enable” button to override the decision. If this needed to be done, please notify puiterwijk so that the rules Basset uses can be updated. For the case of the FPCA, FAS will request the cla_fpca group membership, but not sponsor the user. At the moment that Basset decides it accepts the request, it will sponsor the user into the group. If it declined the FPCA request, it will remove the user from the group. To override this decision, a member of the accounts group can go to FAS and manually add the user to the cla_fpca group and sponsor them into it.

Trac

For Trac, if a post gets denied, the content item gets deleted, the Trac account gets blocked cross-instance and the FAS account gets blocked. To unblock the user, log in to hosted03, and remove /srv/web/trac/blocks/$username. For info on how to unblock the FAS user, see the notes under FAS.

Wiki

For Wiki, if an edit gets denied, the page gets deleted, the wiki account blocked and the FAS account gets blocked. For the wiki parts of undoing this, follow the regular mediawiki unblock procedures using: • https://fedoraproject.org/wiki/Special:BlockList to check if an user is blocked or not • https://fedoraproject.org/wiki/Special:Unblock to unblock that user Don’t forget to unblock the account as in FAS.

Setup

At this moment, Basset runs on a single server (basset01(.stg)), and runs the frontend, message broker and worker all on a single server. For all of it to work, the following services are used: - httpd (frontend) - rabbitmq-server (broker) - mongod (mongo database server for storage of internal info) - basset-worker (worker)

Outage

The consequences of certain services not being up results in various conditions: If the httpd or frontend aren’t up, no new messages will come in. FAS will set the user to spamcheck_awaiting, but not submit it to Basset. Work is in progress on a script to submit such entries to the queue after Basset frontend is back. However, since this part of the code is so small, this is not likely to be the part that’s down. (You can know that it is because the FAS logs will log an error instead of “result: checking”.)

2.2. System Administrator Guide 57 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

If the worker or the mongo server are down, no messages will be processed, but all messages queued up will be processed the moment both of the services start again: as long as a message makes it into the queue, it will be processed until completion. If the worker encounters an error during processing of a message, it will dump a tracedump into the journal log file, and stop processing any messages. Resolve the condition reported in the error and restart the basset-worker service, and all work will be continued, starting with the message it was processing when it errored out. This means that as long as the message is queued, the worker will pick it up and handle it.

Fedora Bastion Hosts

Owner sysadmin-main Contact [email protected] Location phx2 Servers bastion01, bastion02, bastion-comm01 Purpose background and description of bastion hosts and their unique issues.

Description

There are 2 primary bastion hosts in the phx2 datacenter. One will be active at any given time and the second will be a hot spare, ready to take over. Switching between bastion hosts is currently a manual process that requires changes in ansible. There is also a bastion-comm01 bastion host for the qa.fedoraproject.org network. This is used in cases where users only need to access resources in that qa.fedoraproject.org. All of the bastion hosts have an external IP that is mapped into them. The reverse dns for these IPs is controlled by RHIT, so any changes must be carefully coordinated. The active bastion host performs the following functions: • Outgoing smtp from fedora servers. This includes email aliases, mailing list posts, build and commit notices, mailing list posts, etc. • Incoming smtp from servers in phx2 or on the fedora vpn. Incoming mail directly from the outside is NOT accepted or forwarded. • ssh access to all phx2/vpn connected servers. • openvpn hub. This is the hub that all vpn clients connect to and talk to each other via. Taking down or stopping this service will be a major outage of services as all proxy and app servers use the vpn to talk to each other. When rebuilding these machines, care must be taken to match up the dns names externally, and to preserve the ssh host keys.

BladeCenter Access Infrastructure SOP

Many of the builders in PHX are blades in a blade center. A few other machines are also on blades.

58 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contents

1. Contact Information 2. Common Tasks 1. Logging into the web interface 2. Using the Serial Console of Blades

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location PHX Purpose Contains blades used for buildsystems, etc

Common Tasks

Logging into the web interface

The web interface to the bladecenters let you reset power, etc. They are bc01-mgmt and bc02-mgmt.

Using the Serial Console of Blades

All of the blades are set up with a serial console over lan (SOL). To use this, ssh into the bladecenter. You can then pick your system and bring up a console with: env-T system:blade[x] console-o where x is the blade number (can be determined from web interface, etc) To leave the console session, press Esc ( For more details on BladeCenter SOL, see http://www-304.ibm.com/systems/support/supportsite.wss/docdisplay? brandind=5000008&lndocid=MIGR-54666

Blockerbugs Infrastructure SOP

Blockerbugs is an app developed by Fedora QA to aid in tracking items related to release blocking and freeze exception bugs in branched Fedora releases.

Contents

1. Contact Information 2. File Locations 3. Upgrade Process * Upgrade Preparation (for all upgrades) * Minor Upgrade (no db change) * Major Upgrade (with db changes)

2.2. System Administrator Guide 59 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora QA Devel Contact #fedora-qa Location Phoenix Servers blockerbugs01.phx2, blockerbugs02.phx2, blockerbugs01.stg.phx2 Purpose Hosting the blocker bug tracking application for QA

File Locations

/etc/blockerbugs/settings.py - configuration for the app

Node Roles blockerbugs01.stg.phx2 the staging instance, it is not load balanced blockerbugs01.phx2 one of the load balanced production nodes, it is responsible for running bugzilla/bodhi/koji sync blockerbugs02.phx2 the other load balanced production node. It does not do any sync operations

Building for Infra

Do not use mock

For whatever reason, the epel7-infra koji tag rejects SRPMs with the el7. dist tag. Make sure that you build SRPMs with: rpmbuild-bs--define='dist .el7' blockerbugs.spec

Also note that this expects the release tarball to be in ~/rpmbuild/SOURCES/.

Building with Koji

You’ll need to ask someone who has rights to build into epel7-infra tag to make the build for you: koji build epel7-infra blockerbugs-0.4.4.11-1.el7.src.rpm

Note: The fun bit of this is that python-flask is only available on x86_64 builders. If your build is routed to one of the non-x86_64, it will fail. The only solution available to us is to keep submitting the build until it’s routed to one of the x86_64 builders and doesn’t fail.

Once the build is complete, it should be automatically tagged into epel7-infra-stg (after a ~15 min delay), so that you can test it on blockerbugs staging instance. Once you’ve verified it’s working well, ask someone with infra rights to move it to epel7-infra tag so that you can update it in production.

60 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Upgrading

Blockerbugs is currently configured through ansible and all configuration changes need to be done through ansible.

Upgrade Preparation (all upgrades)

Blockerbugs is not packaged in epel, so the new build needs to exist in the infrastructure stg repo for deployment to stg or the infrastructure repo for deployments to production. See the blockerbugs documentation for instructions on building a blockerbugs RPM.

Minor Upgrades (no database changes)

Run the following on both blockerbugs01.phx2 and blockerbugs02.phx2 if updating in production. 1. Update ansible with config changes, push changes to the ansible repo:

roles/blockerbugs/templates/blockerbugs-settings.py.j2

2. Clear yum cache and update the blockerbugs RPM:

yum clean expire-cache&& yum update blockerbugs

3. Restart httpd to reload the application:

service httpd restart

Major Upgrades (with database changes)

Run the following on both blockerbugs01.phx2 and blockerbugs02.phx2 if updating in production. 1. Update ansible with config changes, push changes to the ansible repo:

roles/blockerbugs/templates/blockerbugs-settings.py.j2

2. Stop httpd on all relevant instances (if load balanced):

service httpd stop

3. Clear yum cache and update the blockerbugs RPM on all relevant instances:

yum clean expire-cache&& yum update blockerbugs

5. Upgrade the database schema:

blockerbugs upgrade_db

6. Check the upgrade by running a manual sync to make sure that nothing unexpected went wrong:

blockerbugs sync

7. Start httpd back up:

2.2. System Administrator Guide 61 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

service httpd start

Bodhi Infrastructure SOP

Bodhi is used by Fedora developers to submit potential package updates for releases and to manage buildroot overrides. From here, bodhi handles all of the dirty work, from sending around emails, dealing with Koji, to composing the repositories. Bodhi production instance: https://bodhi.fedoraproject.org Bodhi project page: https://github.com/fedora-infra/bodhi

Contents

1. Contact Information 2. Adding a new pending release 3. 0-day Release Actions 4. Configuring all bodhi nodes 5. Pushing updates 6. Monitoring the bodhi output 7. Resuming a failed push 8. Performing a production bodhi upgrade 9. Syncing the production database to staging 10. Release EOL 11. Adding notices to the front page or new update form 12. Using the Bodhi Shell to modify updates by hand 13. Using the Bodhi shell to fix uniqueness problems with e-mail addresses 14. Troubleshooting and Resolution

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons bowlofeggs Location Phoenix Servers • bodhi-backend01.phx2.fedoraproject.org (composer) • os.fedoraproject.org (web front end and backend task workers for non-compose tasks) • bodhi-backend01.stg.phx2.fedoraproject.org (staging composer) • os.stg.fedoraproject.org (staging web front end and backend task workers for non-compose tasks) Purpose Push package updates, and handle new submissions.

62 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Adding a new pending release

Adding and modifying releases is done using the bodhi-manage-releases tool. You can add a new pending release by running this command: bodhi-manage-releases create--name F23--long-name"Fedora 23"--id-prefix FEDORA--

˓→version 23--branch f23--dist-tag f23--stable-tag f23-updates--testing-tag f23-

˓→updates-testing--candidate-tag f23-updates-candidate--pending-stable-tag f23-

˓→updates-pending--pending-testing-tag f23-updates-testing-pending--override-tag

˓→f23-override--state pending

Pre-Beta Bodhi config

Enable pre_beta policy in bodhi config in ansible.:: ansible/roles/bodhi2/base/templates/production.ini.j2 Uncomment or add the following lines:

#f29.status = pre_beta #f29.pre_beta.mandatory_days_in_testing = 3 #f29.pre_beta.critpath.min_karma = 1 #f29.pre_beta.critpath.stable_after_days_without_negative_karma = 14

Post-Beta Bodhi config

Enable post_beta policy in bodhi config in ansible.:: ansible/roles/bodhi2/base/templates/production.ini.j2 Comment or remove the following lines corresponding to pre_beta policy:

#f29.status = pre_beta #f29.pre_beta.mandatory_days_in_testing = 3 #f29.pre_beta.critpath.min_karma = 1 #f29.pre_beta.critpath.stable_after_days_without_negative_karma = 14

Uncomment or add the following lines for post_beta policy

#f29.status = post_beta #f29.post_beta.mandatory_days_in_testing = 7 #f29.post_beta.critpath.min_karma = 2 #f29.post_beta.critpath.stable_after_days_without_negative_karma = 14

0-day Release Actions

• update atomic config • run the ansible playbook Going from pending to a proper release in bodhi requires a few steps: Change state from pending to current: bodhi-manage-releases edit--name F23--state current

You may also need to disable any pre-beta or post-beta policy defined in the bodhi config in ansible.:

2.2. System Administrator Guide 63 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

ansible/roles/bodhi2/base/templates/production.ini.j2

Uncomment or remove the lines related to pre and post beta polcy

#f29.status = post_beta #f29.post_beta.mandatory_days_in_testing = 7 #f29.post_beta.critpath.min_karma = 2 #f29.post_beta.critpath.stable_after_days_without_negative_karma = 14 #f29.status = pre_beta #f29.pre_beta.mandatory_days_in_testing = 3 #f29.pre_beta.critpath.min_karma = 1 #f29.pre_beta.critpath.stable_after_days_without_negative_karma = 14

Configuring all bodhi nodes

Run this command from the ansible checkout to configure all of bodhi in production:

# This will configure the backends $ sudo rbac-playbook playbooks/groups/bodhi2.yml # This will configure the frontend $ sudo rbac-playbook openshift-apps/bodhi.yml

Pushing updates

SSH into the bodhi-backend01 machine and run:

$ sudo -u apache bodhi-push

You can restrict the updates by release and/or request:

$ sudo -u apache bodhi-push --releases f23,f22 --request stable

You can also push specific builds:

$ sudo -u apache bodhi-push --builds openssl-1.0.1k-14.fc22,openssl-1.0.1k-14.fc23

This will display a list of updates that are ready to be pushed.

Monitoring the bodhi composer output

You can monitor the bodhi composer via the bodhi CLI tool, or via the systemd journal on bodhi-backend01:

# From the comfort of your own laptop. $ bodhi composes list # From bodhi-backend01 $ journalctl -f -u fedmsg-hub

Resuming a failed push

If a push fails for some reason, you can easily resume it on bodhi-backend01 by running:

64 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

$ sudo -u apache bodhi-push --resume

Performing a bodhi upgrade

Build Bodhi

Bodhi is deployed from the infrastructure Koji repositories. At the time of this writing, it is deployed from the f29-infra and f29-infra-stg (for staging) repositories. Bodhi is built for these repositories from the master branch of the bodhi dist-git repository. As an example, to build a Bodhi beta for the f29-infra-stg repository, you can use this command:

$ rpmbuild --define "dist .fc29.infra" -bs bodhi.spec Wrote: /home/bowlofeggs/rpmbuild/SRPMS/bodhi-3.13.0-0.0.beta.e0ca5bc.fc29.infra.src.

˓→rpm $ koji build f29-infra /home/bowlofeggs/rpmbuild/SRPMS/bodhi-3.13.0-0.0.beta.e0ca5bc.

˓→fc29.infra.src.rpm

When building a Bodhi release that is intended for production, we should build from the production dist-git repo instead of uploading an SRPM:

$ koji build f29-infra git+https://src.fedoraproject.org/rpms/bodhi.git

˓→#d64f40408876ec85663ec52888c4e44d92614b37

All builds against the f29-infra build target will go into the f29-infra-stg repository. If you wish to promote a build from staging to production, you can do something like this command:

$ koji move-build f29-infra-stg f29-infra bodhi-3.13.0-1.fc29.infra

Staging

The upgrade playbook will apply configuration changes after running the alembic upgrade. Sometimes you may need changes applied to the Bodhi systems in order to get the upgrade playbook to succeed. If you are in this situation, you can apply those changes by running the bodhi-backend playbook:

sudo rbac-playbook-l staging groups/bodhi-backend.yml

In the os_masters inventory _, edit the bodhi_version setting it to the version you wish to deploy to staging. For example, to deploy bodhi-3.13.0-1.fc29.infra to staging, I would set that varible like this:

bodhi_version:"bodhi-3.13.0-1.fc29.infra"

Run these commands:

# Synchronize the database from production to staging $ sudo rbac-playbook manual/staging-sync/bodhi.yml -l staging # Upgrade the Bodhi backend on staging $ sudo rbac-playbook manual/upgrade/bodhi.yml -l staging # Upgrade the Bodhi frontend on staging $ sudo rbac-playbook openshift-apps/bodhi.yml -l staging

2.2. System Administrator Guide 65 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Production

The upgrade playbook will apply configuration changes after running the alembic upgrade. Sometimes you may need changes applied to the Bodhi systems in order to get the upgrade playbook to succeed. If you are in this situation, you can apply those changes by running the bodhi-backend playbook: sudo rbac-playbook groups/bodhi-backend.yml-l bodhi-backend

In the os_masters inventory _, edit the bodhi_version setting it to the version you wish to deploy to production. For example, to deploy bodhi-3.13.0-1.fc29.infra to production, I would set that varible like this: bodhi_version:"bodhi-3.13.0-1.fc29.infra"

To update the bodhi RPMs in production:

# Update the backend VMs (this will also run the migrations, if any) $ sudo rbac-playbook manual/upgrade/bodhi.yml -l bodhi-backend # Update the frontend $ sudo rbac-playbook openshift-apps/bodhi.yml

Syncing the production database to staging

This can be useful for testing issues with production data in staging:

$ sudo rbac-playbook manual/staging-sync/bodhi.yml -l staging

Release EOL bodhi-manage-releases edit--name F21--state archived

Adding notices to the front page or new update form

You can easily add notification messages to the front page of bodhi using the frontpage_notice option in ansi- ble/roles/bodhi2/base/templates/production.ini.j2. If you want to flash a message on the New Update Form, you can use the newupdate_notice variable instead. This can be useful for announcing things like service outages, etc.

Using the Bodhi Shell to modify updates by hand

The “bodhi shell” is a Python shell with the SQLAlchemy session and transaction manager initialized. It can be run from any production/staging backend instance and allows you to modify any models by hand. sudo pshell/etc/bodhi/production.ini

# Execute a script that sets up the `db` and provides a `delete_update` function. # This will eventually be shipped in the bodhi package, but can also be found here. # https://raw.githubusercontent.com/fedora-infra/bodhi/develop/tools/shelldb.py >>> execfile('shelldb.py')

66 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

At this point you have access to a db SQLAlchemy Session instance, a t transaction module, and m for the bodhi.models.

# Fetch an update, and tweak it as necessary. >>> up=m.Update.get(u'u'FEDORA-2016-4d226a5f7e', db)

# Commit the transaction >>>t.commit()

Here is an example of merging two updates together and deleting the original.

>>> up=m.Update.get(u'FEDORA-2016-4d226a5f7e', db) >>> up.builds [,

˓→'pki-core-10.3.5-1.fc24'}>] >>> b= up.builds[0] >>> up2=m.Update.get(u'FEDORA-2016-5f63a874ca', db) >>> up2.builds [] >>> up.builds.remove(b) >>> up.builds.append(up2.builds[0]) >>> delete_update(up2) >>> t.commit()

Using the Bodhi shell to fix uniqueness problems with e-mail addresses

Bodhi currently enforces uniqueness on user e-mail addresses. There is an issue filed to drop this upstream, but for the time being the constraint is enforced. This can be a problem for users who have more than one FAS account if they make one account use an e-mail address that was previously used by another account, if that other account has not logged into Bodhi since it was changed to use a different address. One way the user can fix this themselves is to log in to Bodhi with the old account so that Bodhi learns about its new address. However, an admin can also fix this by hand by using the Bodhi shell. For example, suppose a user has created user_1 and user_2. Suppose that user_1 used to use [email protected] but has been changed to use [email protected] in FAS, and user_2 is now configured to use [email protected] in FAS. If user_2 attempts to log in to Bodhi, it will cause a unique- ness violation since Bodhi does not know that user_1 has changed to [email protected]. The user can simply log in as user_1 to fix this, which will cause Bodhi to update its e-mail address to email_b@example. com. Or an admin can fix it with a shell on one of the Bodhi backend servers like this:

[bowlofeggs@bodhi-backend02 ~][PROD]$ sudo -u apache pshell /etc/bodhi/production.ini 2018-05-29 20:21:36,366 INFO [bodhi][MainThread] Using python-bugzilla 2018-05-29 20:21:36,367 DEBUG [bodhi][MainThread] Using Koji Buildsystem 2018-05-29 20:21:42,559 INFO [bodhi.server][MainThread] Bodhi ready and at your

˓→service! Python 2.7.14 (default, Mar 14 2018, 13:36:31) [GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux2 Type "help" for more information.

Environment: app The WSGI application. registry Active Pyramid registry. request Active request object. root Root of the default resource tree. root_factory Default root factory used to create `root`.

(continues on next page)

2.2. System Administrator Guide 67 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) Custom Variables: m bodhi.server.models

>>> u = m.User.query.filter_by(name=u'user_1').one() >>> u.email = u'[email protected]' >>> m.Session().commit()

Troubleshooting and Resolution

Atomic OSTree compose failure

If the Atomic OSTree compose fails with some sort of Device or Resource busy error, then run mount to see if there are any stray tmpfs mounts still active: tmpfs on/var/lib/mock/fedora-22-updates-testing-x86_64/root/var/tmp/rpm-ostree.

˓→bylgUq type tmpfs (rw,relatime,seclabel,mode=755)

You can then umount /var/lib/mock/fedora-22-updates-testing-x86_64/root/var/tmp/rpm-ostree.bylgUq and resume the push again. nfs repodata cache IOError

Sometimes you may hit an IOError during the updateinfo.xml generation process from createrepo_c:

IOError: Cannot open/mnt/koji/mash/updates/epel7-160228.1356/../epel7.repocache/

˓→repodata/repomd.xml: File/mnt/koji/mash/updates/epel7-160228.1356/../epel7.

˓→repocache/repodata/repomd.xml doesn't exists or not a regular file

This issue will be resolved with NFSv4, but in the mean time it can be worked around by removing the .repocache directory and resuming the push: rm-fr/mnt/koji/mash/updates/epel7.repocache

Bugzilla Sync Infrastructure SOP

We do not run bugzilla.redhat.com. If bugzilla itself is down we need to get in touch with Red Hat IT or one of the bugzilla hackers (for instance, Dave Lawrence (dkl)) in order to fix it. Infrastructure has some scripts that perform administrative functions on bugzilla.redhat.com. These scripts sync infor- mation from FAS and the Package Database into bugzilla.

Contents

1. Contact Information 2. Description 3. Troubleshooting and Resolution 1. Errors while syncing bugzilla with the PackageDB

68 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons abadger1999 Location Phoenix, Denver (Tummy), Red Hat Infrastructure Servers (fas1, app5) => Need to migrate these to bapp1, bugzilla.redhat.com Purpose Sync Fedora information to bugzilla.redhat.com

Description

At present there are two scripts that sync information from Fedora into bugzilla.

export-bugzilla.py

export-bugzilla.py is the first script. It is responsible for syncing Fedora Accounts into bugzilla. It adds Fedora packages and bug triagers into a bugzilla group that gives the users extra permissions within bugzilla. This script is run off of a cron job on FAS1. The source code resides in the FAS git repo in fas/scripts/export-bugzilla.* however the code we run on the servers presently lives in ansible:

roles/fas_server/files/export-bugzilla

pkgdb-sync-bugzilla

The other script is pkgdb-sync-bugzilla. It is responsible for syncing the package owners and cclists to bugzilla from the pkgdb. The script runs off a cron job on app5. The source code is in the packagedb bzr repo is packagedb/ fedora-packagedb-stable/server-scripts/pkgdb-sync-bugzilla.*. Just like FAS, a separate copy is presently installed from ansbile to /usr/local/bin/pkgdb-sync-bugzilla but that should change ASAP as the present fedora-packagedb package installs /usr/bin/pkgdb-sync-bugzilla.

Troubleshooting and Resolution

Errors while syncing bugzilla with the PackageDB

One frequent problem is that people will sign up to watch a package in the packagedb but their email address in FAS isn’t a bugzilla email address. When this happens the scripts that try to sync the packagedb information to bugzilla encounter an error and send an email like this:

Subject: Errors while syncing bugzilla with the PackageDB

The following errors were encountered while updating bugzilla with information from the Package Database. Please have the problems taken care of:

({'product':u'Fedora','component':u'aircrack-ng','initialowner':u'[email protected]

˓→', 'initialcclist':[u'[email protected]',u'[email protected]']}, 504,'The name [email protected] is

˓→not a (continues on next page)

2.2. System Administrator Guide 69 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) valid username. \n Either you misspelled it, or the person has not\n

˓→registered for a Red Hat Bugzilla account.')

When this happens we attempt to contact the person with the problematic mail address and get them to change it. Here’s a boilerplate message:

To: [email protected] Subject: Fedora Account System Email vs Bugzilla Email

Hello,

You are signed up to receive bug reports against the aircrack-ng package in Fedora. Unfortunately, the email address we have for you in the Fedora Account System is not a valid bugzilla email address. That means that bugzilla won't send you mail and we're getting errors in the script that syncs the cclist into bugzilla.

There's a few ways to resolve this:

1) Create a new bugzilla account with the email [email protected] as an account at https://bugzilla.redhat.com.

2) Change an existing account on https://bugzilla.redhat.com to use the [email protected] email address.

3) Change your email address in https://admin.fedoraproject.org/accounts to use an email address that matches with an existing bugzilla email address.

Please let me know what you want to do!

Thank you,

If the user does not reply someone in the cvsadmin group needs to go into the pkgdb and remove the user from the cclist for the package. bugzilla2fedmsg SOP

Receive events from bugzilla over the RH “unified messagebus” and rebroadcast them over our own fedmsg bus.

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-fedmsg, #fedora-admin, #fedora-noc Servers bugzilla2fedmsg01 Purpose Rebroadcast bugzilla events on our bus.

70 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description bugzilla2fedmsg is a small service running as the ‘moksha-hub’ process which receives events from bugzilla via the RH “unified messagebus” and rebroadcasts them to our fedmsg bus.

Note: Unlike all of our other fedmsg services, this one runs as the ‘moksha-hub’ process and not as the ‘fedmsg-hub’.

The bugzilla2fedmsg package provides a plugin to the moksha-hub that connects out over the STOMP protocol to a ‘fabric’ of JBOSS activemq FUSE brokers living in the Red Hat DMZ. We authenticate with a cert/key pair that is kept in /etc/pki/fedmsg/. Those brokers should push bugzilla events over STOMP to our moksha-hub daemon. When a message arrives, we query bugzilla about the change to get some ‘more interesting’ data to stuff in our payload, then we sign the message using a fedmsg cert and fire it off to the rest of our bus. This service has no database, no memcached usage. It depends on those STOMP brokers and being able to query bugzilla.rh.com.

Relevant Files

All managed by ansible, of course: STOMP config: /etc/moksha/production.ini fedmsg config: /etc/fedmsg.d/ certs: /etc/pki/fedmsg code: /usr/lib/python2.7/site-packages/bugzilla2fedmsg.py

Useful Commands

To look at logs, run:

$ journalctl -u moksha-hub -f

To restart the service, run:

$ systemctl restart moksha-hub

Internal Contacts

If we need to contact someone from the RH internal “unified messagebus” team, search for “unified messagebus” in mojo. It is operated as a joint project between RHIT and PnT Devops. See also the #devops-message IRC channel, internally.

Fedora OpenStack

Quick Start

Controller: sudo rbac-playbook hosts/fed-cloud09.cloud.fedoraproject.org.yml

Compute nodes:

2.2. System Administrator Guide 71 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo rbac-playbook groups/openstack-compute-nodes.yml

Description

If you need to install OpenStack install, either make sure the machine is clean. Or use ansible.git/files/ fedora-cloud/uninstall.sh script to brute force wipe off.

Note: by default, the script does not wipe LVM group with VM, you have to clean them manually. There is com- mented line in that script.

On fed-cloud09, remove the file /etc/packstack_sucessfully_finished to enforce run of packstack and few other commands. After that wipe, you have to:

ifdown eth1 configure eth1 to become normal Ethernet with ip yum install openstack-neutron-openvswitch /usr/bin/systemctl restart neutron-ovs-cleanup ifup eth1

Additionally when reprovision OpenStack, all volumes on DellEqualogic are preserved and you have to manually remove them (or remove them from OS before it is reprovision). SSH to DellEqualogic (credentials are at the bottom of /etc/cinder/cinder.conf) and run:

show (to get list of volumes) volume select offline volume delete

Before installing make sure: • make sure rdo repo is enabled • yum install openstack-packstack openstack-packstack-puppet openstack-puppet-modules • vim /usr/lib/python2.7/site-packages/packstack/plugins/dashboard_500.py and missing parentheses:

``host_resources.append((ssl_key, 'ssl_ps_server.key'))``

Now you can run playbook:

sudo rbac-playbook hosts/fed-cloud09.cloud.fedoraproject.org.yml

If you run it after wipe (i.e. db has been reset), you have to: • import ssh keys of users (only possible via webUI - RHBZ 1128233 • reset user passwords

Compute nodes

Compute node is much easier and is written as role. Use:

72 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

vars_files: -... SNIP -/srv/web/infra/ansible/vars/fedora-cloud.yml -"{{ private }}/files/openstack/passwords.yml" roles: ... SNIP - cloud_compute

Define a host variable in inventory/host_vars/FQDN.yml: compute_private_ip: 172.23.0.10

You should also add IP to vars/fedora-cloud.yml And when adding new compute node, please update files/fedora-cloud/hosts

Important: When reinstalling make sure you removed all members on Dell Equalogic (credentials are in /etc/cinder/cinder.conf on compute node) otherwise the space will be blocked!!!

Updates

Our openstack cloud should have updates applied and reboots when the rest of our servers are updated and rebooted. This will cause an outage, please make sure to schedule it. 1. Stop copr-backend process on copr-be.cloud.fedoraproject.org 2. Kill all copr-builder instances. 3. Kill all transient/scratch instances. 4. Update all instances we control. copr, persistent, infrastructure, qa etc. 5. Shutdown all instances 6. Update and reboot fed-cloud09 7. Update and reboot all compute nodes 8. Start up all instances that are shutdown in step 5. TODO: add commands for above as we know them.

Troubleshooting

• could not connect to VM? - check your security group, default SG does not allow any connection. • packstack end up with error, it is likely race condition in puppet - BZ 1135529. Just run it again. • ERROR [append() takes exactly one argument (2 given] vi /usr/lib/python2.7/site-packages/ packstack/plugins/dashboard_500.py and add one more surrounding () • Local ip for ovs agent must be set when tunneling is enabled restart fed-cloud09 or: ssh to fed-cloud09; if- down eth1; ifup eth1; ifup br-ex • mongodb problem? follow https://ask.openstack.org/en/question/54015/mongodbpp-error-when-installing-rdo-on-centos-7/ ?answer=54076#post-id-54076

2.2. System Administrator Guide 73 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• WARNING:keystoneclient.httpclient:Failed to retrieve management_url from token:

keystone --os-token $ADMIN_TOKEN --os-endpoint \ https://fedorainfracloud.org:35357/v2.0/ endpoint-create --region 'RegionOne' \ --service 91358b81b1aa40d998b3a28d0cfc86e7 --region 'RegionOne' --publicurl \ 'https://fedorainfracloud.org:5000/v2.0' --adminurl 'http://172.24.0.9:35357/v2.0

˓→'\ --internalurl 'http://172.24.0.9:5000/v2.0'

Fedora Classroom about our instance http://meetbot.fedoraproject.org/fedora-classroom/2015-05-11/fedora-classroom.2015-05-11-15.02.log.html

Collectd SOP

Collectd ( https://collectd.org/ ) is a client/server setup that gathers system information from clients and allows the server to display that information over various time periods. Our server instance runs on log01.phx2.fedoraproject.org and most other servers run clients that connect to the server and provide it with data.

1. Contact Information 2. Collectd info

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location https://admin.fedoraproject.org/collectd/ Servers log01 and all/most other servers as clients Purpose provide load and system information on servers.

Configuration

The collectd roles configure collectd on the various machines: collectd/base - This is the base client role for most servers. collectd/server - This is the server for use on log01. collectd/other - There’s various other subroles for different types of clients.

Web interface

The server web interface is available at: https://admin.fedoraproject.org/collectd/

74 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Restarting collectd runs as a normal systemd or sysvinit service, so you can: systemctl restart collectd or service collectd restart to restart it.

Removing old hosts

Collectd keeps information around until it’s deleted, so you may need to sometime go remove data from a host or hosts thats no longer used. To do this: 1. Login to log01 2. cd /var/lib/collectd/rrd 3. sudo rm -rf oldhostname

Bug reporting

Collectd is in Fedora/EPEL and we use their packages, so report bugs to bugzilla.redhat.com.

Communishift SOP

Communishift is an OpenShift deployment hosted and maintained by Fedora Infrastructure that is available to the community to host applications. Fedora Infrastructure does not maintain the applications in Communishift and is only responsible for the OpenShift deployment itself. Production instance: https://console-openshift-console.apps.os.fedorainfracloud.org/

Contents

• Contact information • Onboarding new users • KVM access

Contact information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons nirik Location Phoenix Servers • os-node01.fedorainfracloud.org • os-node02.fedorainfracloud.org • os-node03.fedorainfracloud.org • os-node04.fedorainfracloud.org

2.2. System Administrator Guide 75 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• os-node05.fedorainfracloud.org • os-node06.fedorainfracloud.org • os-node07.fedorainfracloud.org • os-node08.fedorainfracloud.org • os-node09.fedorainfracloud.org • os-node10.fedorainfracloud.org • os-node11.fedorainfracloud.org • virthost-os01.fedorainfracloud.org • virthost-os02.fedorainfracloud.org • virthost-os03.fedorainfracloud.org • virthost-aarch64-os01.fedorainfracloud.org • virthost-aarch64-os02.fedorainfracloud.org Purpose Allow community members to host services for the Fedora Project.

Onboarding new users

To allow new users to create projects in Communishift, begin by adding them to the communishift FAS group. At the time of this writing, there is no automation to sync users from the communishift FAS group to OpenShift, so you will need to log in to the Communishift instance and grant that user permissions to create projects. For example, to grant bowlofeggs permissions, you would do this:

$ oc adm policy add-cluster-role-to-user self-provisioner bowlofeggs $ oc create clusterquota for-bowlofeggs --project-annotation-selector openshift.io/

˓→requester=bowlofeggs --hard pods=10 --hard persistentvolumeclaims=5

This will grant bowlofeggs the ability to provision up to 10 pods and 5 volumes.

KVM access

We allow applications access to the kvm device so they can run emulation faster. Anytime the cluster is re-installed, run: !/bin/bash set -eux if ! oc get –namespace=default ds/device-plugin-kvm &>/dev/null; then oc create –namespace=default -f https://raw.githubusercontent.com/kubevirt/kubernetes-device-plugins/ master/manifests/kvm-ds.yml fi See the upstream docs as well as the original request for this.

Compose Tracker SOP

Compose Tracker tracks the pungi composes and creates a ticket in a pagure repo for the composes are not FINISHED with a tail of the debug and the koji tasks associated to it. Compose Tracker: https://pagure.io/releng/compose-tracker Failed Composes Repo: https://pagure.io/releng/ failed-composes

76 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contents

1. Contact Information

Contact Information

Owner Fedora Release Engineering Team Contact #fedora-releng Persons dustymabe mohanboddu Purpose Track failed composes

More Information

For information about the tool and deployment on Fedora Infra Openshift please look at the documetation in https: //pagure.io/releng/compose-tracker/blob/master/f/README.md

Content Hosting Infrastructure SOP

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, fedora-infrastructure-list Location Phoenix Servers secondary1, netapp[1-3], torrent1 Purpose Policy regarding hosting, removal and pruning of content. Scope download.fedora.redhat.com, alt.fedoraproject.org, archives.fedoraproject.org, secondary.fedoraproject.org, torrent.fedoraproject.org

Description

Fedora hosts both Fedora content and some non-Fedora content. Our resources are finite and as such we have to have some policy around when to remove old content. This SOP describes the test to remove content. The spirit of this SOP is to allow more people to host content and give it a try, prove that it’s useful. If it’s not popular or useful, it will get removed. Also out of date or expired content will be removed.

What hosting options are available

Aside from the hosting at https://pagure.io/ we have a series of mirrors we’re allowing people to use. They are located at: • http://archive.fedoraproject.org/pub/archive/ - For archives of historical Fedora releases • http://secondary.fedoraproject.org/pub/fedora-secondary/ - For secondary architectures • http://alt.fedoraproject.org/pub/alt/ - For misc content / catchall • http://torrent.fedoraproject.org/ - For torrent hosting

2.2. System Administrator Guide 77 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• http://spins.fedoraproject.org/ - For official Fedora Spins hosting, mirrored somewhat • http://download.fedoraproject.com/pub/ - For official Fedora Releases, mirrored widely

Who can host? What can be hosted?

Any official Fedora content can hosted and made available for mirroring. Official content is determined by the Council by virtue of allowing people to use the Fedora trademark. People representing these teams will be allowed to host.

Non Official Hosting

People wanting to host unofficial bits may request approval for hosting. Create a ticket at https://pagure.io/ fedora-infrastructure/ explaining what and why Fedora should host it. Such will be reviewed by the Fedora Infrastruc- ture team. Requests for non-official hosting that may conflict with existing Fedora policies will be escalated to the Council for approval.

Licensing

Anything hosted with Fedora must come with a license that is approved by Fedora. See http: //fedoraproject.org/wiki/Licensing for more.

Requesting Space

• Make sure you have a Fedora account - https://admin.fedoraproject.org/accounts/ • Ensure you have signed the Fedora Project Contributor Agreement (FPCA) • Submit a hosting request - https://pagure.io/fedora-infrastructure/ – Include who you are, and any group you are working with (e.g. a SIG) – Include Space requirements – Include an estimate of the number of downloads expected (if you can). – Include the nature of the bits you want to host. • Apply for group hosted-content - https://admin.fedoraproject.org/accounts/group/view/hosted-content

Using Space

A dedicated namespace in the mirror will be assigned to you. It will be your responsibility to upload content, remove old content, stay within your quota, etc. If you have any questions or concerns about this please let us know. Generally you will use rsync. For example: rsync-av--progress./my.iso secondary01.fedoraproject.org:/srv/pub/alt/mySpace/

Important: None of our mirrored content is backed up. Ensure that you keep backups of your content.

78 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Content Pruning / Purging / Removal

The following guidelines / tests will be used to determine whether or not to remove content from the mirror.

Expired / Old Content

If content meets any of the following criteria it may be removed: • Content that has reached the end of life (is no longer receiving updates). • Pre-release content that has been superceded. • EOL releases that have been moved to archives. • N-2 or greater releases. If more than 3 versions of a piece of content are on the mirror, the oldest may be removed.

Limited Use Content

If content meets any of the following criteria it may be removed: • Content with exceedingly limited seeders or downloaders, with little prospect of increasing those numbers and which is older then 1 year. • Content such as videos or audio which are several years old.

Catch All Removal

Fedora reserves the right to remove any content for any reason at any time. We’ll do our best to host things but sometimes we’ll need space or just need to remove stuff for legal or policy reasons.

Copr

Copr is build system for 3rd party packages. Frontend: • http://copr.fedorainfracloud.org/ Backend: • http://copr-be.cloud.fedoraproject.org/ Package signer: • copr-keygen.cloud.fedoraproject.org Dist-git • copr-dist-git.fedorainfracloud.org Devel instances (NO NEED TO CARE ABOUT THEM, JUST THOSE ABOVE): • http://copr-fe-dev.cloud.fedoraproject.org/ • http://copr-be-dev.cloud.fedoraproject.org/ • copr-keygen-dev.cloud.fedoraproject.org

2.2. System Administrator Guide 79 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• copr-dist-git-dev.fedorainfracloud.org

Contact Information

Owner msuchy (mirek) Contact #fedora-admin, #fedora-buildsys Location Fedora Cloud Purpose Build system

This document

This document provides a condensed information allowing you to keep Copr alive and working. For more sofisticated business processes, please see https://docs.pagure.org/copr.copr/maintenance_documentation.html

TROUBLESHOOTING

Almost every problem with Copr is due problem with spawning builder VMs, or with processing action queue on backend.

VM spawning/termination problems

Try to restart copr-backend service:

$ ssh [email protected] $ systemctl restart copr-backend

If this doesn’t solve the problem, try to follow logs for some clues:

$ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log

As the last resort option, you can terminate all builders and let copr-backend to throw all information about them. This action will obviously interrupt all running builds and reschedule them:

$ ssh [email protected] $ systemctl stop copr-backend $ cleanup_vm_nova.py $ redis-cli > FLUSHALL $ systemctl start copr-backend

Sometimes OpenStack can not handle spawning too much VMs at the same time. So it is safer to edit on copr- be.cloud.fedoraproject.org: vi/etc/copr/copr-be.conf and change: group0_max_workers=12

80 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 to “6”. Start copr-backend service and some time later increase it to original value. Copr automaticaly detect change in script and increase number of workers. The set of aarch64 VMs isn’t maintained by OpenStack, but by Copr’s backend itself. Steps to diagnose:

$ ssh [email protected] [root@copr-be ~][PROD]# systemctl status resalloc resalloc.service - Resource allocator server ...

[root@copr-be ~][PROD]# less /var/log/resallocserver/main.log

[root@copr-be ~][PROD]# su - resalloc

[resalloc@copr-be ~][PROD]$ resalloc-maint resource-list 13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64

˓→status=UP 13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64

˓→status=UP 13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64

˓→status=STARTING ...

[resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list 879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319 918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536 904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303 919 - state=OPEN tags=aarch64 ...

Be careful when there’s some resource in STARTING state. If that’s so, check /usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc. Copr takes tickets from resalloc server; and if the resources fail to spawn, the ticket numbers are not assigned with appropriately tagged resource for a long time. If that happens (it shouldn’t) and there’s some inconsistency between resalloc’s database and the actual sta- tus on aarch64 hypervisors (ssh copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org)- use virsh there to introspect theirs statuses - use resalloc-maint resource-delete, resalloc ticket-close or psql commands to fix-up the resalloc’s DB.

Backend Troubleshoting

Information about status of Copr backend services: systemctl status copr-backend*.service

Utilization of workers: ps axf

Worker process change $0 to list which task they are working on and on which builder. To list which VM builders are tracked by copr-vmm service:

/usr/bin/copr_get_vm_info.py

2.2. System Administrator Guide 81 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Appstream builder troubleshoting

Appstream builder is painfully slow when running on a repository with a huge amount of packages. See https://github. com/hughsie/appstream-glib/issues/301 . You might need to disable it for some projects:

$ ssh [email protected] $ cd /var/lib/copr/public_html/results/// $ touch .disable- # You should probably also delete existing appstream data because # they might be obsolete $ rm -rf ./appdata

Backend action queue issues

First check the ‘number of not-yet-processed actions‘_. If that number isn’t equal to zero, and is not decrementing relatively fast (say single action takes longer than 30s) – there might be some problem. Logs for the action dispatcher can be found in:

/var/log/copr-backend/action_dispatcher.log

Check if there’s no stucked process under Action dispatch parent process in pstree -a copr output.

Deploy information

Using playbooks and rbac:

$ sudo rbac-playbook groups/copr-backend.yml $ sudo rbac-playbook groups/copr-frontend-cloud.yml $ sudo rbac-playbook groups/copr-keygen.yml $ sudo rbac-playbook groups/copr-dist-git.yml https://pagure.io/copr/copr/blob/master/f/copr-setup.txt The copr-setup.txt manual is severely outdated, but there is no up-to-date alternative. We should extract useful information from it and put it here in the SOP or into https: //docs.pagure.org/copr.copr/maintenance_documentation.html and then throw the copr-setup.txt away. On backend should run copr-backend service (which spawns several processes). Backend spawns VM from Fedora Cloud. You could not login to those machines directly. You have to:

$ ssh [email protected] $ su - copr $ copr_get_vm_info.py # find IP address of the VM that you want $ ssh [email protected]

Instances can be easily terminated in https://fedorainfracloud.org/dashboard

Order of start up

When reprovision you should start first: copr-keygen and copr-dist-git machines (in any order). Then you can start copr-be. Well you can start it sooner, but make sure that copr-* services are stopped. Copr-fe machine is completly independent and can be start any time. If backend is stopped it will just queue jobs.

82 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Logs

Backend

• /var/log/copr-backend/action_dispatcher.log • /var/log/copr-backend/actions.log • /var/log/copr-backend/backend.log • /var/log/copr-backend/build_dispatcher.log • /var/log/copr-backend/logger.log • /var/log/copr-backend/spawner.log • /var/log/copr-backend/terminator.log • /var/log/copr-backend/vmm.log • /var/log/copr-backend/worker.log And several logs for non-essential features such as copr_prune_results.log, hitcounter.log, cleanup_vms.log, that you shouldn’t be worried with.

Frontend

• /var/log/copr-frontend/frontend.log • /var/log/httpd/access_log • /var/log/httpd/error_log

Keygen

• /var/log/copr-keygen/main.log

Dist-git

• /var/log/copr-dist-git/main.log • /var/log/httpd/access_log • /var/log/httpd/error_log

Services

Backend

• copr-backend - copr-backend-action - copr-backend-build - copr-backend-log - copr-backend-vmm • redis • All the copr-backend-*.service are configured to be a part of the copr-backend.service so e.g. in case of restarting all of them, just restart the copr-backend.service.

2.2. System Administrator Guide 83 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Frontend

• httpd • postgresql

Keygen

• signd

Dist-git

• httpd • copr-dist-git

PPC64LE Builders

Builders for PPC64 are located at rh-power2.fit.vutbr.cz and anyone with access to buildsys ssh key can get there using keys as msuchy@rh-power2.fit.vutbr.cz There are commands: $ ls bin/ destroy-all.sh reinit-vm26.sh reinit-vm28.sh virsh-destroy-vm26.sh virsh-destroy- vm28.sh virsh-start-vm26.sh virsh-start-vm28.sh get-one-vm.sh reinit-vm27.sh reinit-vm29.sh virsh-destroy-vm27.sh virsh-destroy-vm29.sh virsh-start-vm27.sh virsh-start-vm29.sh bin/destroy-all.sh destroy all VM and reinit them reinit-vmXX.sh copy VM image from template virsh-destroy- vmXX.sh destroys VM virsh-start-vmXX.sh starts VM get-one-vm.sh start one VM and return its IP - this is used in Copr playbooks. In case of big queue of PPC64 tasks simply call bin/destroy-all.sh and it will destroy stuck VM and copr backend will spawn new VM.

Ports opened for public

Frontend:

Port Protocol Service Reason 22 TCP ssh Remote control 80 TCP http Serving Copr frontend website 443 TCP https ^^

Backend:

Port Protocol Service Reason 22 TCP ssh Remote control 80 TCP http Serving build results and repos 443 TCP https ^^

Distgit:

84 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Port Protocol Service Reason 22 TCP ssh Remote control 80 TCP http Serving cgit interface 443 TCP https ^^

Keygen:

Port Protocol Service Reason 22 TCP ssh Remote control

Resources justification

Copr currently uses the following resources.

Frontend

• RAM: 2G (out of 4G) and some swap • CPU: 2 cores (3400mhz) with load 0.92, 0.68, 0.65 Most of the memory is eaten by PostgreSQL, followed by Apache. The CPU usage is also mainly used for those two services but in the reversed order. I don’t think we can settle down with any instance that provides less than (2G RAM, obviously), but ideally, we need 3G+. 2-core CPU is good enough. • Disk space: 17G for system and 8G for pgsqldb directory If needed, we are able to clean-up the database directory of old dumps and backups and get down to around 4G disk space.

Backend

• RAM: 5G (out of 16G) • CPU: 8 cores (3400MHz) with load 4.09, 4.55, 4.24 Backend takes care of spinning-up builders and running ansible playbooks on them, running createrepo_c (on big repositories) and so on. Copr utilizes two queues, one for builds, which are delegated to OpenStack builders, and action queue. Actions, however, are processed directly by the backend, so it can spike our load up. We would ideally like to have the same computing power that we have now. Maybe we can go lower than 16G RAM, possibly down to 12G RAM. • Disk space: 30G for the system, 5.6T (out of 6.8T) for build results Currently, we have 1.3T of backup data, that is going to be deleted soon, but nevertheless, we cannot go any lower on storage. Disk space is a long-term issue for us and we need to do a lot of compromises and settling down just to survive our daily increase (which is around 10G of new data). Many features are blocked by not having enough storage. We cannot go any lower and also we cannot go much longer with the current storage.

2.2. System Administrator Guide 85 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Distgit

• RAM: ~270M (out of 4G), but climbs to ~1G when busy • CPU: 2 cores (3400MHz) with load 1.35, 1.00, 0.53 Personally, I wouldn’t downgrade the machine too much. Possibly we can live with 3G ram, but I wouldn’t go any lower. • Disk space: 7G for system, 1.3T dist-git data We currently employ a lot of aggressive cleaning strategies on our distgit data, so we can’t go any lower than what we have.

Keygen

• RAM: ~150M (out of 2G) • CPU: 1 core (3400MHz) with load 0.10, 0.31, 0.25 We are basically running just signd and httpd here, both with minimal resource requirements. The memory usage is topped by systemd-journald. • Disk space: 7G for system and ~500M (out of ~700M) for GPG keys We are slowly pushing the GPG keys storage to its limit, so in the case of migrating copr-keygen somewhere, we would like to scale-up it to at least 1G.

Fedora CoreOS Cincinnati SOP

Cincinnati is the update service/backend for Fedora CoreOS (FCOS) machines. This SOP describes how to access and how to troubleshoot it.

Contact Information

Owner Fedora CoreOS Team Contact #fedora-coreos

Details

Source https://github.com/coreos/fedora-coreos-cincinnati Playbook https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/coreos-cincinnati.yml Location OpenShift cluster (production): https://os.fedoraproject.org Project coreos-cincinnati: https://os.fedoraproject.org/console/project/coreos-cincinnati/overview Deployment https://os.fedoraproject.org/console/project/coreos-cincinnati/browse/dc/coreos-cincinnati-stub Containers • fcos-graph-builder (GB - raw updates graph) • fcos-policy-engine (PE - frontend handling client requests) Routes

86 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• coreos-updates-raw (GB web service) • coreos-updates-raw-status (GB status and metrics) • coreos-updates (PE web service) • coreos-updates-status (PE status and metrics)

Troubleshooting

Each FCOS Cincinnati service exposes live metrics in Prometheus format: • Graph-builder: https://status.raw-updates.coreos.fedoraproject.org/metrics • Policy-engine: https://status.updates.coreos.fedoraproject.org/metrics

Upgrades

Building a new version

FCOS Cincinnati is built as container image directly from source, referencing a pinned git commit. In order to build a new version, you will first have to find the relevant commit (i.e. the latest on the ‘main’ branch) at https://github.com/coreos/fedora-coreos-cincinnati. Once you have identified the target commit, these are the steps to build a new container image: • update the ‘fcos_cincinnati_build_git_sha’ playbook variable in ‘roles/openshift-apps/coreos- cincinnati/vars/staging.yml’ • update the ‘fcos_cincinnati_build_git_sha’ playbook variable in ‘roles/openshift-apps/coreos- cincinnati/vars/production.yml’ • commit and push the update to the ‘fedora-infra/ansible’ repository • SSH to batcave01.iad2.fedoraproject.org • run ‘sudo rbac-playbook openshift-apps/coreos-cincinnati.yml’ using your FAS password and your second-factor OTP • go to the project build overview at https://os.fedoraproject.org/console/project/coreos-cincinnati/browse/ builds/coreos-cincinnati-stub • schedule a new build through the “Start Build” button

Things that could go wrong

Application build is stuck

Issues in the underlying OpenShift cluster may result in builds being permanently stuck. If a build does not complete within a reasonable amount of time (i.e. 15 minutes): • go to the build overview at https://os.fedoraproject.org/console/project/coreos-cincinnati/browse/builds/ coreos-cincinnati-stub • click on the build • cancel it through the “Cancel Build” button

2.2. System Administrator Guide 87 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• go back to the build overview page • schedule a new build through the “Start Build” button

Cyclades cyclades notes 1. login as root - default password is tslinux 2. change password for root and admin to our password from the phx2-access.txt file in the private repo 3. port forward to the web browser for the cyclades ssh -L 8080:rack47-serial.phx2. fedoraproject.org:80 4. connect to localhost:8080 in your web browser 5. login with root and the password you set above 6. click on ‘security’ 7. click on ‘moderate’ 8. logout, port forward port 443 as above: ssh -L 8080:rack47-serial.phx2.fedoraproject. org:443 9. click on the ‘wizard’ button at lower left 10. proceed through the wizard Info needed: - serial ports are set to 115200 8N1 by default - do not setup buffering - give it the ip of our syslog server 11. click ‘apply changes’ 12. hope 13. log back in 14. name/setup the port aliases

Darkserver SOP

To setup a http://darkserver.fedoraproject.org based on Darkserver project to provide GNU_BUILD_ID information for packages. A devel instance can be seen at http://darkserver01.dev.fedoraproject.org and staging instance is http: //darkserver01.stg.phx2.fedoraproject.org/. This page describes how to set up the server.

Contents

1. Contact Information 2. Installing the server 3. Setting up the database 4. SELinux Configuration 5. Koji plugin setup 6. Debugging

88 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin Persons: kushal mether Sponsor: nirik Location: phx2 Servers: darkserver01 , darkserver01.stg, darkserver01.dev Purpose: To host Darkserver

Installing the Server

root@localhost# yum install darkserver

Setting up the database

We are using MySQL as database. We will need two users, one for koji-plugin and one for darkserver.:

root@localhost# mysql -u root mysql> CREATE DATABASE darkserver; mysql> GRANT INSERT ON darkserver.* TO kojiplugin@'koji-hub-ip' IDENTIFIED BY'XXX'; mysql> GRANT SELECT ON darkserver.* TO dark@'darkserver-ip' IDENTIFIED BY'XXX';

Setup this db configuration in the conf file under /etc/darkserver/darkserverweb.conf:

[darkserverweb] host=db host name user=dark password=XXX database=darkserver

Now setup the db tables if it is a new install. (For this you may need to 'GRANT * ON darkserver.*' to the web user, and then 'REVOKE * ON darkserver.*' after running.)

root@localhost# python /usr/lib/python2.6/site-packages/darkserverweb/manage.py syncdb

SELinux Configuration

Do the follow to allow the webserver to connect to the database.:

root@localhost# setsebool -P httpd_can_network_connect_db 1

Setting up the Koji plugin

Install the package.:

2.2. System Administrator Guide 89 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

root@localhost# yum install darkserver-kojiplugin

Then fill up the configuration file under /etc/koji-hub/plugins/darkserver.conf:

[darkserver] host=db host name user=kojiplugin password=XXX database=darkserver port=3306

Then enable the plugin in the koji hub configuration.

Debugging

Set DEBUG to True in /etc/darkserver/settings.py file and restart Apache.

Database Infrastructure SOP

Our database servers provide database storage for many of our apps. Contents 1. Contact Information 2. Description 3. Creating a New Postgresql Database 4. Troubleshooting and Resolution 1. Connection issues 2. Some useful queries 1. What queries are running 2. Seeing how “dirty” a table is 3. XID Wraparound 3. Restart Procedure 1. Koji 2. Bodhi 5. Note about TurboGears and MySQL 6. Restoring from backups or specific dbs

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-dba group Location Phoenix

90 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Servers sb01, db03, db-fas01, db-datanommer02, db-koji01, db-s390-koji01, db-arm-koji01, db-ppc-koji01, db-qa01, dbqastg01 Purpose Provides database connection to many of our apps.

Description db01, db03 and db-fas01 are our primmary servers. db01 and db-fas01 run PostgreSQL. db03 contain mariadb. db-koji01, db-s390-koji01, db-arm-koji01, db-ppc-koji01 contain secondary kojis. db-qa01 and db-qastg01 contain resultsdb. db-datanommer02 contains all storage messages from postgresql database.

Creating a New Postgresql Database

Creating a new database on our postgresql server isn’t hard but there’s several steps that should be taken to make the database server as secure as possible. We want to separate the database permissions so that we don’t have the user/password combination that can do any- thing it likes to the database on every host (the webapp user can usually do a lot of things even without those extra permissions but every little bit helps). Say we have an app called “raffle”. We’d have three users: • raffleadmin: able to make any changes they want to this particular database. It should not be used in day to day but only for things like updating the database schema when an update occurs. We could very likely disable this account in the db whenever we are not using it. • raffleapp: the database user that the web application uses. This will likely need to be able to insert and select from all tables. It will probably need to update most tables as well. There may be some tables that it does not need delete on. It should almost certainly not need schema modifying permissions. (With postgres, it likely also needs permission to insert/select on sequences as well). • rafflereadonly: Only able to read data from tables, not able to modify anything. Sadly, we aren’t using this often but it can be useful for scripts that need to talk directly to the database without modifying it. db2 $ sudo -u postgres createuser -P -E NEWDBadmin Password: db2 $ sudo -u postgres createuser -P -E NEWDBapp Password: db2 $ sudo -u postgres createuser -P -E NEWDBreadonly Password: db2 $ sudo -u postgres createdb -E utf8 NEWDB -O NEWDBadmin db2 $ sudo -u postgres psql NEWDB NEWDB=# revoke all on database NEWDB from public; NEWDB=# revoke all on schema public from public; NEWDB=# grant all on schema public to NEWDBadmin; NEWDB=# [grant permissions to NEWDBapp as appropriate for your app] NEWDB=# [grant permissions to NEWDBreadonly as appropriate for a user that is only trusted enough to read information] NEWDB=# grant connect on database NEWDB to nagiosuser;

If your application needs to have the NEWDBapp and password to connect to the database, you probably want to add these to ansible as well. Put the password in the private repo in batcave01. Then use a templatefile to incorporate it into the config file. See fas.pp for an example.

2.2. System Administrator Guide 91 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Troubleshooting and Resolution

Connection issues

There are no known outstanding issues with the database itself. Remember that every time either database is restarted, services will have to be restarted (see below).

Some useful queries

What queries are running

This can help you find out what queries are cuurently running on the server:

select datname, pid, query_start, backend_start, query from pg_stat_activity where state<>'idle' order by query_start;

This can help you find how many connections to the db server are for each individual database:

select datname, count(datname) from pg_stat_activity group by datname order by count desc;

Seeing how “dirty” a table is

We’ve added a function from postgres’s contrib directory to tell how dirty a table is. By dirty we mean, how many tuples are active, how many have been marked as having old data (and therefore “dead”) and how much free space is allocated to the table but not used.:

\c fas2 \x select * from pgstattuple('visit_identity'); table_len| 425984 tuple_count| 580 tuple_len| 46977 tuple_percent| 11.03 dead_tuple_count| 68 dead_tuple_len| 5508 dead_tuple_percent| 1.29 free_space| 352420 free_percent| 82.73 \x

Vacuum should clear out dead_tuples. Only a vacuum full, which will lock the table and therefore should be avoided, will clear out free space.

XID Wraparound

Find out how close we are to having to perform a vacuum of a database (as opposed to individual tables of the db). We should schedule a vacuum when about 50% of the transaction ids have been used (approximately 530,000,000 xids):

select datname, age(datfrozenxid), pow(2, 31)- age(datfrozenxid) as xids_remaining from pg_database order by xids_remaining;

92 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Information on [61]wraparound

Restart Procedure

If the database server needs to be restarted it should come back on it’s own. Otherwise each service on it can be restarted: service mysqld restart service postgresql restart

Koji

Any time postgreql is restarted, koji needs to be restarted. Please also see [62]Restarting Koji

Bodhi

Anytime postgresql is restarted Bodhi will need to be restarted no sop currently exists for this.

TurboGears and MySQL

Note: about TurboGears and MySQL There’s a known bug in TurboGears that causes MySQL clients not to automatically reconnect when lost. Typically a restart of the TurboGears application will correct this issue.

Restoring from backups or specific dbs.

Our backups store the latest copy in /backups/ on each db server. These backups are created automatically by the db-backup script run fron cron. Look in /usr/local/bin for the backup script. To restore partially or completely you need to: 1. setup postgres on a system 2. start postgres/run initdb • if this new system running postgres has already run ansible then it will have wrong config files in /var/lib/pgsql/data - clear them out before you start postgres so initdb can work. 3. grab the backups you need from /backups - also grab global.sql edit up global.sql to only create/alter the dbs you care about 4. as postgres run: psql -U postgres -f global.sql 5. when this completes you can restore each db with (as postgres user):: createdb $dbname pg_restore -d db- name dbname_backup_file.db 6. restart postgres and check your data.

2.2. System Administrator Guide 93 Fedora Infrastructure Best Practices Documentation, Release 1.0.0 datanommer SOP

Consume fedmsg bus activity and stuff it in a postgresql db.

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-fedmsg, #fedora-admin, #fedora-noc Servers busgateway01 Purpose Save fedmsg bus activity

Description datanommer is a set of three modules: python-datanommer-models Schema definition and API for storing new items and querying existing items python-datanommer-consumer A plugin for the fedmsg-hub that actively listens to the bus and stores events. datanommer-commands A set of CLI tools for querying the DB. datanommer will one day serve as a backend for future web services like datagrepper and dataviewer. Source: https://github.com/fedora-infra/datanommer/ Plan: https://fedoraproject.org/wiki/User:Ianweller/statistics_ plus_plus

CLI tools

Dump the db into a file as json:

$ datanommer-dump > datanommer-dump.json

When was the last bodhi message?:

$ # It was 678 seconds ago $ datanommer-latest --category bodhi --timesince [678]

When was the last bodhi message in more readable terms?:

$ # It was 12 minutes and 43 seconds ago $ datanommer-latest --category bodhi --timesince --human [0:12:43.087949]

What was that last bodhi message?:

$ datanommer-latest --category bodhi [{"bodhi": { "topic": "org.fedoraproject.stg.bodhi.update.comment", "msg": { "comment": { "group": null, "author": "ralph", (continues on next page)

94 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) "text": "Testing for latest datanommer.", "karma": 0, "anonymous": false, "timestamp": 1360349639.0, "update_title": "xmonad-0.10-10.fc17" }, "agent": "ralph" }, }}]

Show me stats on datanommer messages by topic:

$ datanommer-stats --topic org.fedoraproject.stg.fas.group.member.remove has 10 entries org.fedoraproject.stg.logger.log has 76 entries org.fedoraproject.stg.bodhi.update.comment has 5 entries org.fedoraproject.stg.busmon.colorized-messages has 10 entries org.fedoraproject.stg.fas.user.update has 10 entries org.fedoraproject.stg.wiki.article.edit has 106 entries org.fedoraproject.stg.fas.user.create has 3 entries org.fedoraproject.stg.bodhitest.testing has 4 entries org.fedoraproject.stg.fedoratagger.tag.create has 9 entries org.fedoraproject.stg.fedoratagger.user.rank.update has 5 entries org.fedoraproject.stg.wiki.upload.complete has 1 entries org.fedoraproject.stg.fas.group.member.sponsor has 6 entries org.fedoraproject.stg.fedoratagger.tag.update has 1 entries org.fedoraproject.stg.fas.group.member.apply has 17 entries org.fedoraproject.stg.__main__.testing has 1 entries

Upgrading the DB Schema datanommer uses “python-alembic” to manage its schema. When developers want to add new columns or features, these should/must be tracked in alembic and shipped with the RPM. In order to run upgrades on our stg/prod dbs: 1) ssh to busgateway01{.stg} 2) cd /usr/share/datanommer.models/ 3) Run:

$ alembic upgrade +1

Over and over again until the db is fully upgraded.

Fedora Debuginfod Service - SOP

Debuginfod is the software that lies behind the service at https://debuginfod.fedoraproject.org/ and https://debuginfod. stg.fedoraproject.org/ . These services run on 1 VM each in the stg and prod infrastructure at IAD2.

Contact Information

Owner: RH perftools team + Fedora Infrastructure Team

2.2. System Administrator Guide 95 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact: @fche in #fedora-noc Servers: VMs Purpose: Serve elf/dwarf/source-code debuginfo for supported releases to debugger-like tools in Fedora. Repository: https://sourceware.org/elfutils/Debuginfod.html https://fedoraproject.org/wiki/Debuginfod

How it works

One virtual machine in prod NFS-mount the koji build system’s RPM repository, read-only. The production VM has a virtual twin in the staging environment. They each run elfutils debuginfod to index designated RPMs into a large local sqlite database. They answers HTTP queries received from users on the Internet via reverse-proxies at the https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply gzip compression on the data and provide redirection of the root / location only into the fedora wiki. Normally, it is autonomous and needs no maintenance. It should come back nicely after many kinds of outage. The software is based on elfutils in Fedora, but may occasionally track a custom COPR build with backported patches from future elfutils versions.

Configuration

The daemon uses systemd and /etc/sysconfig/debuginfod to set basic parameters. These have been tuned from the distro defaults via experimental hand-editing or ansible. Key parameters are: 1. The -I/-X include/exclude regexes. These tell debuginfod what fedora versions to include RPMs for. If index disk space starts to run low, one can eliminate some older fedoras from the index to free up space (after the next groom cycle). 2. The –fdcache related parameters. These tell debuginfod how much data to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, . . . ) are huge and worth retaining instead of repeated extracting.) This is straight disk space vs. time tradeoff. 3. The -t (scan interval) parameter. Scanning lets an index get bigger, as new RPMs in koji are examined and their contents indexed. Each pass takes a bunch of hours to traverse the entire koji NFS directory structure to fstat() everything for newness or change. A smaller scan interval lets debuginfod react quicker to koji builds coming into existence, but increases load on the NFS server. More -n (scan threads) may help the indexing process go faster, if the networking fabric & NFS server are underloaded. 5. The -g (groom interval) parameter. Grooming lets an index get smaller, as files removed from koji will be forgotten about. It can be run very intermittently - weekly or less - since it takes many hours and cannot run concurrently with scanning. A quick:

systemd restart debuginfod

activates the new settings. In case of some drastic failure like database corruption or signs of penetration/abuse, one can shut down the server with systemd, and/or stop traffic at the incoming proxy configuration level. The index sqlite database under /var/ cache/debuginfod may be deleted, if necessary, but keep in mind that it takes days to reindex the relevant parts of koji. Alternately, with the services stopped, the 150GB+ sqlite database files may be freely copied between the staging and production servers, if that helps during disaster recovery.

96 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Monitoring

Prometheus

The debuginfod daemons answer the standard /metrics URL endpoint to serve a variety of operational metrics in prometheus. Important metrics include: 1. filesys_free_ratio - free space on the filesystems. (These are also monitored via fedora-infra nagios.) If the free space on the database or tmp partition falls low, further indexing or even service may be impacted. Add more disk space if possible, or start eliding older fedora versions from the database via the -I/-X daemon options. 2. thread_busy - number of busy threads. During indexing, 1-6 threads may be busy for minutes or even days, intermittently. User requests show up as “buildid” (real request) or “buildid-after-you” (deferred duplicate request) labels. If there are more than a handful of “buildid” ones, there may be an overload/abuse underway, in which case it’s time to identify the excessive traffic via the logs and get a temporary iptables block going. Or perhaps there is an outage or slowdown of the koji NFS storage system, in which case there’s not much to do. 3. error_count. These should be zero or near zero all the time.

Logs

The debuginfod daemons produce voluminous logs into the local systemd journal, whence the traffic moves to the usual fedora-infra log01 server, /var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log. The lines related to HTTP GET identify the main webapi traffic, with originating IP addresses in the XFF: field, and response size and elapsed service time in the last columns. These can be useful in tracking down possible abuse.

Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT]

˓→(381551/2413727): 10.3.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 ˓→XFF:*elided* GET/buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 ˓→3354400+0ms

The lines related to prometheus /metrics are usually no big deal. The log also includes info about errors and indexing progress. Interesting may be the lines like:

Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT]

˓→(381551/2413727): serving fdcache archive/mnt/fedora_koji_prod/koji/packages/

˓→valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file/usr/libexec/

˓→valgrind/vgpreload_memcheck-amd64-linux.so

which identify the file names derived from requests (which RPMs the buildids to). These can provide some indirect distro telemetry: what packages and binaries are being debugged and for which architectures?

Denyhosts Infrastructure SOP

Denyhosts provides a protection against brute force attacks.

Contents

1. Contact Information 2. Description 3. Troubleshooting and Resolution

2.2. System Administrator Guide 97 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

1. Connection issues

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main group Location Anywhere Servers All Purpose Denyhosts provides a protection against brute force attacks.

Description

All of our servers now implement denyhosts to protect against brute force attacks. Very few boxes should be in the ‘allowed’ list. Especially internally.

Troubleshooting and Resolution

Connection issues

The most common issue will be legitimate logins failing. First, try to figure out why a host ended up on the deny list (tcptraceroute, failed login attempts, etc are all good candidates). Next do the following directions. The below example is for a host (10.0.0.1) being banned. Login to the box from a different host and as root do the following.:

cd/var/lib/denyhosts sed-si'/10.0.0.1/d' * /etc/hosts.deny /etc/init.d/denyhosts restart

That should correct the problem.

Departing admin SOP

From time to time admins depart the project, this SOP checks any access they may no longer need.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Everywhere Servers all

Description

From time to time people with admin access to various parts of the project may leave the project or no longer wish to contribute. This SOP attempts to list the process for removing access they no longer need.

98 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

0. First, make sure that this SOP is needed. Verify the person has left the project and what areas they might wish to still contibute to. 1. Gather info: fas username, email address, knowledge of passwords. 2. Check the following areas with the following commands: email address in ansible • Check: git grep email@address • Remove: git commit koji admin • Check: koji list-permissions --user=username • Remove: koji revoke-permission permissionname username wiki pages • Check: look for https://fedoraproject.org/wiki/User:Username • Remove: delete page, or modify with info they are no longer contributing. packages • Check: Download https://admin.fedoraproject.org/pkgdb/lists/bugzilla?tg_format=plain and grep • Remove: remove from cc, orphan packages or reassign. fas account • Check: check username in fas • Remove: set user inactive

Note: If there are scripts or files needed, save homedir of user.

passwords • Check: if departing admin knew sensitive passwords. • Remove: Change passwords.

Note: root pw, management interfaces, etc

DNS repository for fedoraproject

We’ve set this up so we can easily (and quickly) edit and deploy dns changes with a record of who changed what and why. This system also lets us edit out proxies from rotation for our many and varied quickly and with a minimum of opportunity for error. Finally, it checks to make sure that all of the zone changes will actually work before they are allowed.

DNS Infrastructure SOP

We have 5 DNS servers: ns02.fedoraproject.org hosted at ibiblio (ipv6 enabled)

2.2. System Administrator Guide 99 Fedora Infrastructure Best Practices Documentation, Release 1.0.0 ns05.fedoraproject.org hosted at internetx (ipv6 enabled) ns13.rdu2.fedoraproject.org in rdu2, internal to rdu2. ns01.iad2.fedoraproject.org in iad2, internal to iad2. ns02.iad2.fedoraproject.org in iad2, internal to iad2.

Contents

1. Contact Information 2. Troubleshooting, Resolution and Maintenance 1. DNS update 2. Adding a new zone 3. GeoDNS 1. Non geodns fedoraproject.org IPs 2. Adding and removing countries 3. IP Country Mapping 4. resolv.conf 1. Phoenix 2. Non-Phoenix

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main, sysadmin-dns Location: ServerBeach and ibiblio and internetx and phx2. Servers: ns02, ns05, ns13.rdu2, ns01.iad2, ns02.iad2 Purpose: Provides DNS to our users Troubleshooting, Resolution and Maintenance

Check out the DNS repository

You can get the dns repository from /srv/git/dns on batcave01:

$ git clone /srv/git/dns

Adding a new Host

Adding a new host requires to add it to DNS and to ansible, see new-hosts.rst for the details.

100 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Editing the domain(s)

We have three domains which needs to be able to change on demand for proxy rotation/removal: • fedoraproject.org. • getfedora.org. • cloud.fedoraproject.org. The other domains are edited only when we add/subtract a host or move it to a new ip. Not much else. If you need to edit a domain that is NOT In the above list: • change to the ‘master’ subdir, edit the domain as usual (remember to update the serial), save it. If you need to edit one of the domains in the above list: (replace fedoraproject.org with the domain from above) • if you need to add/change a host in fedoraproject.org that is not ‘@’ or ‘wildcard’ then: – edit fedoraproject.org.template – make your changes – do not edit the serial or anything surrounded by {{ }} unless you REALLY know what you are doing. • if you need to only add/remove a proxy during an outage or due to networking issue then run: – ./zone-template fedoraproject.org.cfg disable ip [ip] [ip] to disable the ip of the proxy you want removed. – ./zone-template fedoraproject.org.cfg enable ip [ip] [ip] reverses the dis- able – ./zone-template fedoraproject.org.cfg reset will reset to all ips enabled. • if you want to add an all new proxy as ‘@’ or ‘wildcard’ for fedoraproject.org: – edit fedoraproject.org.cfg – add the ip to the correct section of the ipv4 or ipv6 in the config. – save the file – check the file for validity by running: python fedoraproject.org.cfg looking for errors or trace- backs. When complete run: git add . git commit -a -m ‘description of your change here’ It is important to commit this before running the do-domains script as it makes it easier to track the changes. In all cases then run: • ./do-domains • if that completes successfully then run:

git add. git commit-a-m'description of your change here' git push

• nameservers update from dns via cron every 10minutes. The above git process can be achieved with the below bash function where the commit message is passed as an arg when running.:

2.2. System Administrator Guide 101 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

dnscommit() { local args=$1 cd ~/dns; git commit -a -m "${args}" git pull --rebase && ./do-domains && git add built && git commit -a -m "Signed DNS"

˓→&& git push }

If you need an update to be live more quickly: and then run this on all of the nameservers (as root):

/usr/local/bin/update-dns

To run this via ansible from batcave do:

$ sudo rbac-playbook update_dns.yml this will pull from the git tree, update all of the zones and reload the name server.

DNS update

DNS config files are ansible managed on batcave01. From your local machine run: git clone ssh://[email protected]/fedora-infra/ansible.git cd ansible/roles/dns/files/ ...make changes needed... git commit-m"What you did" git push

It should update within a half hour. You can test the new configs with dig: dig @ns01.fedoraproject.org fedoraproject.org

Adding a new zone

First name the zone and generate new set of keys for it. Run this on ns01. Note it could take SEVERAL minutes to run:

/usr/sbin/dnssec-keygen-a RSASHA1-b 1024-n ZONE c.fedoraproject.org /usr/sbin/dnssec-keygen-a RSASHA1-b 2048-n ZONE-f KSK c.fedoraproject.org

Then copy the created .key and .private files to the private git repo (You need to be sysadmin-main to do this). The directory is private/private/dnssec. • add the zone in zones.conf in ansible/roles/dns/files/zones.conf • save and commit - but do not push • Add zone file to the master subdir in this repo • git add and commit the file • check the zone by running check-domains

102 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• if you intend to have this be a dnssec signed zone then you must - create a new key:

/usr/sbin/dnssec-keygen -a RSASHA1 -b 1024 -n ZONE $domain.org /usr/sbin/dnssec-keygen -a RSASHA1 -b 2048 -n ZONE -f KSK $domain.org

- put the files this generates into /srv/privatekeys/dnssec on batcave01 - edit the do-domains file in this dir and your domain to the signed_domains entry at the top - edit the zone you just created and add the contents of the .key

˓→files to the bottom of the zone

If this is a subdomain of fedoraproject.org: • run dnssec-dsfromkey on each of the .key files generated • paste that output into the bottom of fedoraproject.org.template • commit everything to the dns tree • push your changes • push your changes to the ansible repo • test If you add a new child zone, such as c.fedoraproject.org or vpn.fedoraproject.org you will also need to add the contents of dsset-childzone.fedoraproject.org (for example), to the main fedoraproject.org zonefile, so that DNSSEC has a valid trust path to that zone. You also must set the NS delegation entries near the top of fedoraproject.org zone file these are necessary to keep dnssec-signzone from whining with this error msg: dnssec-signzone: fatal:'xxxxx.example.com': found DS RRset without NS RRset

Look for the: “vpn IN NS” records at the top of fedoraproject.org and copy them for the new child zone.

GeoDNS

As part of our Content Distribution Network we use geodns for certain zones. At the moment just fedoraproject. org and *.fedoraproject.org zones. We’ve got proxy servers all over the US and in Europe. We are now sending users to proxy servers that are near them. The current list of available ‘zone areas’ are: • DEFAULT •EU • NA DEFAULT contains all the zones. So someone who does not seem to be in or near the EU, or NA would get directed to any random set. (South Africa for example doesn’t get directed to any particular server).

Important: Don’t forget to increase the serial number in the fedoraproject.org zone file. Even if you’re making a change to one of the geodns IPs. There is only one serial number for all setups and that serial number is in the fedoraproject.org zone.

2.2. System Administrator Guide 103 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Note: Non geodns fedoraproject.org IPs If you’re adding as server that is just in one location, and isn’t going to get geodns balanced. Just add that host to the fedoraproject.org zone.

Adding and removing countries

Our setup actually requires us to specify which countries go to which servers. To do this, simply edit the named.conf file in ansible. Below is an example of what counts as “NA” (North America).: view"NA"{ match-clients { US; CA; MX; }; recursion no; zone"fedoraproject.org"{ type master; file"master/NA/fedoraproject.org.signed"; }; include"etc/zones.conf"; };

IP Country Mapping

The IP -> Location mapping is done via a config file that exists on the dns servers themselves (it’s not ansible con- trolled). The file, located at /var/named/chroot/etc/GeoIP.acl is generated by the GeoIP.sh script (that script is in ansible).

Warning: This is known to be a less efficient means of doing geodns than the patched version from kernel.org. We’re using this version at the moment because it’s in Fedora and works. The level of DNS traffic we see is generally low enough that the inefficiencies aren’t that noticed. For example, average load on the servers before this geodns was .2, now it’s around .4

resolv.conf

In order to make the network more transparent to the admins, we do a lot of search based relative names. Below is a list of what a resolv.conf should look like.

Important: Any machine that is not on our vpn or has not yet joined the vpn should _NOT_ have the vpn.fedoraproject.org search until after it has been added to the vpn (if it ever does)

Phoenix

search phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org

Phoenix in the QA network:

search qa.fedoraproject.org vpn.fedoraproject.org phx2.fedoraproject.org

˓→fedoraproject.org

Non-Phoenix

104 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

search vpn.fedoraproject.org fedoraproject.org

The idea here is that we can, when need be, setup local domains to contact instead of having to go over the VPN directly but still have sane configs. For example if we tell the proxy server to hit “app1” and that box is in PHX, it will go directly to app1, if its not, it will go over the vpn to app1. docs SOP

Fedora Documentation - Documentation for installing and using Fedora

Contact Information

Owner: docs, Fedora Infrastrcture Team Contact: #fedora-docs Servers: proxy* Purpose: Provide documentation for users and contributors.

Description:

The Fedora Documentation Project was created to provide documentation for fedora users and contributors. It’s like “The Bible” for using Fedora and other software used by the Fedora Project. It uses Publican, a free and open-source publishing tool. Publican generates html pages from content in DocBook XML format. The source files are in a git repo and publican builds html files from these source files whenever changes are made. As these are static pages these are available on all the proxy servers which serve our requests for docs.fedoraproject.org.

Updates process:

The fedora docs writers update and build their docs and then push the completed output into a git repo. This git repo is then pulled by each of the Fedora proxies and served as static content. Note that docs is talking about setting up a new process, this SOP needs updating when that happens.

Reporting bugs:

Bugs can be reported at the Fedora Documentation’s Bugzilla. Here’s the link: https://bugzilla.redhat.com/enter_bug. cgi?product=Fedora%20Documentation Errors or problems in the wiki can be modified by anyone with a FAS account.

Contributing to the Fedora Documentation Project:

If you find the existing documentation insufficient or outdated or any particular page is not available in your language feel free to improve the documentation by contributing to Fedora Documentation Project. You can find more details here: https://fedoraproject.org/wiki/Join_the_Docs_Project Translation of documentation is taken care by the Fedora Localization Project aka L10N. More details can be found at: https://fedoraproject.org/wiki/L10N

2.2. System Administrator Guide 105 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Publican wiki:

More details about Publican can be found at the publican wiki here: https://sourceware.org/publican/en-US/index.html

Fedora Account System

Notes about FAS and how to do things in it: • where are certs for fas accounts for koji, etc? on fas01 /var/lib/fedora-ca - makefile targets allow you to do things with them. look in index.txt for certs. One’s marked with an ‘R’ in the left-most column are ‘REVOKED’ to revoke a cert: cd/var/lib/fedora-ca

find the cert number in index.txt - the number is the 3rd column in the file - you can match it to the user by searching for their username. You want the highest number cert for their account. once you have the number you would run (as root or fas): make revoke cert=newcerts/$that_number.pem

How to gather information about a user

You’ll want to have direct access to query the database for this. The common way is to have someone in sysadmin-db ssh to the postgres db hosting FAS (currently db01). Then access it via ident auth on the box: sudo-u postgres psql fas2

There are several tables that will have information about a user. Some of it is redundant but it’s good to check all the sources there shouldn’t be inconsistencies: select * from people where username='USERNAME';

Of interest here are: id for later queries password_changed tells when the password was last changed last_seen last login to fas (including through jsonfas from other TG1/2 apps. Maybe wiki and insight as well. Not fedorahosted trac, shell login, etc) status_change last time that the user’s status was updated via the website. Usually triggered when the user was marked inactive for a mass password change and then they reset their password. Next table is the log table: select * from log where author_id= ID_FROM_PREV_QUERY or description~'. *USERNAME.* ˓→';

The FAS writes certain events to the log table. This will get those events. We use both the author_id field (who made the change) and the username in a description regex search because a few changes are made to users by admins. Fields of interest are pretty self explanatory here: changetime when the log was made

106 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

description description of the event that’s being logged

Note: FAS does not log every event that happens to a user. Only “important” ones. FAS also cannot record direct changes to the database here (for instance, when we mark accounts inactive administratively via the db).

Lastly, there’s the groups and person_roles table. When a user joins a group, the person_roles table is updated to reflect the user’s status in the group, when they applied, and when they were approved:

select groups.name, person_roles.* from person_roles, groups where person_id= ID_ ˓→FROM_INITIAL_QUERY and groups.id= person_roles.group_id;

This will give you the following fields to pay attention to: name Name of the group role_status If this is unapproved, it just means the user applied for it. If it is approved, it means they are actually in the group. creation When the user applied to the group approval When the user was approved to be in the group role_type What role the person has or wants to have in the group sponsor_id If you suspect something is suspicious with one of the roles, you may want to ask the sponsor if they remember sponsoring this person

Account Deletion and renaming

Note: see also accountdeletion.rst For information on how to disable, rename, and remove accounts.

Pseudo Users

Note: see also nonhumanaccounts.rst For information on creating pseudo user accounts for use in pkgdb/bugzilla

fas staging

we have a staging fas db setup on db-fas01.stg.phx2.fedoraproject.org - it accessed by fas01.stg.phx2.fedoraproject.org This system is not autopopulated by production fas - it must be done manually. To do this you must: • dump the fas2 db on db-fas01.phx2.fedoraproject.org:

sudo-u postgres pg_dump-C fas2> fas2.dump scp fas2.dump db-fas01.stg.phx2.fedoraproject.org:/tmp

• then on fas01.stg.phx2.fedoraproject.org:

/etc/init.d/httpd stop

• then on db02.stg.phx2.fedoraproject.org:

2.2. System Administrator Guide 107 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

echo"drop database fas2\;"| sudo-u postgres psql ; cat fas2.dump| sudo-u

˓→postgres psql

• then on fas01.stg.phx2.fedoraproject.org:

/etc/init.d/httpd start

that should do it.

FAS-OpenID

FAS-OpenID is the OpenID server of Fedora infrastructure. Live instance is at https://id.fedoraproject.org/ Staging instance is at https://id.dev.fedoraproject.org/

Contact Information

Owner Patrick Uiterwijk (puiterwijk) Contact #fedora-admin, #fedora-apps, #fedora-noc Location openid0{1,2}.phx2.fedoraproject.org openid01.stg.fedoraproject.org Purpose Authentication & Authorization

Trusted roots

FAS-OpenID has a set of “trusted roots”, which contains websites which are always trusted, and thus FAS-OpenID will not show the Approve/Reject form to the user when they login to any such site. As a policy, we will only add websites to this list which Fedora Infrastructure controls. If anyone ever ask to add a website to this list, just answer with this default message:

We only add websites we (Fedora Infrastructure) maintain to this list.

This feature was put in because it wouldn't make sense to ask for permission to send data to the same set of servers that it already came from.

Also, if we were to add external websites, we would need to judge their privacy policy etc.

Also, people might start complaining that we added site X but not their site, maybe causing us"political" issues later down the road.

As a result, we do NOT add external websites.

fedmsg (Fedora Messaging) Certs, Keys, and CA - SOP

X509 certs, private RSA keys, Certificate Authority, and Certificate Revocation List.

108 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-admin, #fedora-apps, #fedora-noc Servers • app0[1-7] • packages0[1-2] • fas0[1-3] • pkgs01 • busgateway01, • value0{1,3} • releng0{1,4} • relepel03 Purpose Certify fedmsg messages come from authentic sources.

Description fedmsg sends JSON-encoded messages from many services to a zeromq messaging bus. We’re not concerned with encrypting the messages, only with signing them so an attacker cannot spoof. Every instance of each service on each host has its own cert and private key, signed by the CA. By convention, we name the certs -.{crt,key} For instance, bodhi has the following certs: • bodhi-app01.phx2.fedoraproject.org • bodhi-app02.phx2.fedoraproject.org • bodhi-app03.phx2.fedoraproject.org • bodhi-app01.stg.phx2.fedoraproject.org • bodhi-app02.stg.phx2.fedoraproject.org • more Scripts to generate new keys, sign them, and revoke them live in the ansible repo in ansible/roles/fedmsg/ files/cert-tools/. The keys and certs themselves (including ca.crt and the CRL) live in the private repo in private/fedmsg-certs/keys/ fedmsg is locally configured to find the key it needs by looking in /etc/fedmsg.d/ssl.py which is kept in ansible in ansible/roles/fedmsg/templates/fedmsg.d/ssl.py.erb. Each service-host has its own key. This means: • A key is not shared across multiple instances of a service on different machines. i.e., bodhi on app01 and bodhi on app02 should have different key/cert pairs. • A key is not shared across multiple services on a host. i.e., mediawiki on app01 and bodhi on app01 should have different key/cert pairs. The attempt here is to minimize the number of potential attack vectors. Each private key should be readable only by the service that needs it. bodhi runs under mod_wsgi in apache and should run as its own unique bodhi user (not as apache). The permissions for its.phx2.fedoraproject.org private_key, when deployed by ansible, should be read-only for that local bodhi user.

2.2. System Administrator Guide 109 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

For more information on how fedmsg uses these certs see http://fedmsg.readthedocs.org/en/latest/crypto.html

Configuring the Scripts

Usage of the main scripts is described in more detail below. They are located in ansible/rolesfedmsg/files/ cert-tools. Before you use them, you’ll need to point them at the right directory to modify. By default, this is ~/private/ fedmsg-certs/keys/. You can change that by editing ansible/roles/fedmsg/files/cert-tools/ vars in the event that you have the private repo checked out to an alternate location. There are other configuration values defined in that script. Most will not need to be changed.

Wiping and Rebuilding Everything

There is a script in ansible/roles/fedmsg/files/cert-tools/ named rebuild-all-fedmsg-certs. You can run it with no arguments to wipe out the old and generate a new CA root certificate, a signing cert and key, and all key/cert pairs for all service-hosts.

Note: Warning – Obviously, this will wipe everything. Do you want that?

Adding a new key for a new service-host

First, checkout the ansible private repo as that’s where the keys are going to be stored. The scripts will assume this is checked out to ~/private. In ansible/roles/fedmsg/files/cert-tools run:

$ source ./vars $ ./build-and-sign-key -

For instance, if we bring up a new app host, app10.phx2.fedoraproject.org, we’ll need to generate a new cert/key pair for each fedmsg-enabled service that will be running on it, so you’d run:

$ source ./vars $ ./build-and-sign-key shell-app10.phx2.fedoraproject.org $ ./build-and-sign-key bodhi-app10.phx2.fedoraproject.org $ ./build-and-sign-key mediawiki-app10.phx2.fedoraproject.org

Just creating the keys isn’t quite enough, there are four more things you’ll need to do. The private keys are created in your checkout of the private repo under ~/private/private/fedmsg-certs/keys . There will be four files for each cert you created: .pem (ex: 5B.pem) and -.{crt,csr,key} git add, commit, and push all of those. Second, You need to edit ansible/roles/fedmsg/files/cert-tools/ rebuild-all-fedmsg-certs and add the argument of the commands you just ran, so that next time certs need to be blown away and recreated, the new service-hosts will be included. For the examples above, you would need to add to the list:

shell-app10.phx2.fedoraproject.org bodhi-app10.phx2.fedoraproject.org mediawiki-app10.phx2.fedoraproject.org

110 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

You need to ensure that the keys are distributed to the host with the proper permissions. Only the bodhi user should be able to access bodhi’s private key. This can be accomplished by using the fedmsg::certificate in ansible. It should distribute your new keys to the correct hosts and correctly permission them. Lastly, if you haven’t already updated the global fedmsg config, you’ll need to. You need to add your new service- node to fedmsg.d/endpoint.py and to fedmsg.d/ssl.py. Those can be found in ansible/roles/ fedmsg/templates/fedmsg.d. See http://fedmsg.readthedocs.org/en/latest/config.html for more information on the layout and meaning of those files.

Revoking a key

In ansible/roles/fedmsg/files/cert-tools run:

$ source ./vars $ ./revoke-full -

This will alter private/fedmsg-certs/keys/crl.pem which should be picked up and served publicly, and then consumed by all fedmsg consumers globally. crl.pem is publicly available at http://fedoraproject.org/fedmsg/crl.pem

Note: Even though crl.pem lives in the private repo, we’re just keeping it there for convenience. It really should be served publicly, so don’t panic. :)

Note: At the time of this writing, the CRL is not actually used. I need one publicly available first so we can test it out.

fedmsg-gateway SOP

Outgoing raw ZeroMQ message stream.

Note: see also: fedmsg-websocket

Contact Information

Owner: Messaging SIG, Fedora Infrastructure Team Contact: #fedora-apps, #fedora-admin, #fedora-noc Servers: busgateway01, proxy0* Purpose: Expose raw ZeroMQ messages outside the FI environment.

Description

Users outside of Fedora Infrastructure can listen to the production message bus by connecting to specific addresses. This is required for local users to run their own hubs and message processors (“Consumers”). It is also required for user-facing tools like fedmsg-notify to work.

2.2. System Administrator Guide 111 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

The specific public endpoints are: production tcp://hub.fedoraproject.org:9940 staging tcp://stg.fedoraproject.org:9940 fedmsg-gateway, the daemon running on busgateway01, is listening to the FI production fedmsg bus and will relay every message that it receives out to a special ZMQ pub endpoint bound to port 9940. haproxy mediates connections to the fedmsg-gateway daemon.

Connection Flow

Clients connect through haproxy on proxy0*:9940 are redirected to busgateway0*:9940. This can be found in the haproxy.cfg entry for listen fedmsg-raw-zmq 0.0.0.0:9940. This is different than the apache reverse proxy pass setup we have for the app0* and packages0* machines. That flow looks something like this:

Client-> apache(proxy01)-> haproxy(proxy01)-> apache(app01)

The flow for the raw zmq stream provided by fedmsg-gateway looks something like this:

Client-> haproxy(proxy01)-> fedmsg-gateway(busgateway01)

haproxy is listening on a public port. At the time of this writing, haproxy does not actually load balance zeromq session requests across multiple busgate- way0* machines, but there is nothing stopping us from adding them. New hosts can be added in ansible and pressed from busgateway01’s template. Add them to the fedmsg-raw-zmq listen in haproxy’s config and it should Just Work.

Increasing the Maximum Number of Concurrent Connections

HTTP requests are typically very short (a few seconds at most). This means that the number of concurrent tcp connec- tions we require for most of our services is quite low (1024 is overkill). ZeroMQ tcp connections, on the other hand, are expected to live for quite a long time. Consequently we needed to scale up the number of possible concurrent tcp connections. All of this is in ansible and should be handled for us automatically if we bring up new nodes. • The pam_limits user limit for the fedmsg user was increased from 1024 to 160000 on busgateway01. • The pam_limits user limit for the haproxy user was increased from 1024 to 160000 on the proxy0* machines. • The zeromq High Water Mark (HWM) was increased to 160000 on busgateway01. • The maximum number of connections allowed was increased in haproxy.cfg.

Nagios

New nagios checks were added for this that check to see if the number of concurrent connections through haproxy is approaching the maximum number allowed. You can check these numbers by hand by inspecting the haproxy web interface: https://admin.fedoraproject.org/ haproxy/proxy1#fedmsg-raw-zmq Look at the “Sessions” section. “Cur” is the current number of sessions versus “Max”, the maximum number seen at the same time and “Limit”, the maximum number of concurrent connections allowed.

112 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

RHIT

We had RHIT open up port 9940 special to proxy01.phx2 for this. fedmsg introduction and basics, SOP

General information about fedmsg

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-admin, #fedora-noc Servers Almost all of them. Purpose Introduce sysadmins to fedmsg tools and config

Description fedmsg is a system that links together most of our webapps and services into a message mesh or net (often called a “bus”). It is built on top of the zeromq messaging library. fedmsg has its own developer documentation that is a good place to check if this or other SOPs don’t provide enough information - http://fedmsg.rtfd.org

Tools

Generally, fedmsg-tail and fedmsg-logger are the two most commonly used tools for debugging and testing. To see if bus-connectivity exists between two machines, log onto each of them and run the following on the first:

$ echo testing from $(hostname) | fedmsg-logger

And run the following on the second:

$ fedmsg-tail --really-pretty

Configuration fedmsg configuration lives in /etc/fedmsg.d/ /etc/fedmsg.d/endpoints.py keeps the list of every possible fedmsg endpoint. It acts as a global index that defines the bus. See fedmsg.readthedocs.org/en/latest/config/ for a full glossary of configuration values.

Logs fedmsg daemons keep their logs in /var/log/fedmsg. fedmsg message hooks in existing apps (like bodhi) will log any errors to the logs of the app they’ve been added to (like /var/log/httpd/error_log).

2.2. System Administrator Guide 113 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

fedmsg-irc SOP

Echo fedmsg bus activity to IRC.

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-fedmsg, #fedora-admin, #fedora-noc Servers value03 Purpose Echo fedmsg bus activity to IRC

Description

fedmsg-irc is a daemon running on value03 and value01.stg. It is listening to the fedmsg bus and echoing that activity to the #fedora-fedmsg channel in IRC. It can be configured to ignore certain messages, join certain rooms, and take on a different nick by editing the values in /etc/fedmsg.d/irc.py and restarting it with sudo service fedmsg-irc restart See http://fedmsg.readthedocs.org/en/latest/config/#term-irc for more information on configuration.

Adding a new fedmsg message type

Instrumenting the program

First, figure out how you’re going to publish the message? Is it from a shell script or from a long running process? If its from shell script, you need to just add a fedmsg-logger statement to the script. Remember to set the –modname and –topic for your new message’s fully-qualified topic. If its from a python process, you need to just add a fedmsg.publish(..) call. The same concerns about modname and topic apply here. If this is a short-lived python process, you’ll want to add active=True to the call to fedmsg.publish(..). This will make the fedmsg lib “actively” reach out to our fedmsg-relay running on busgateway01. If it is a long-running python process (like a WSGI thread), then you don’t need to pass any extra arguments. You don’t want it to reach out to the fedmsg-relay if possible. Your process will require that some “endpoints” are created for it in /etc/fedmsg.d/. More on that below.

Supporting infrastructure

You need to make sure that the machine this is running on has a cert and key that can be read by the program to sign its message. If you don’t have a cert already, then you need to create it in the private repo. Ask a sysadmin-main member. Then you need to declare those certs in the fedmsg_certs data structure stored typically in our ansible group_vars/ for this service. Declare both the name of the cert, what group and user it should be owned by, and in the can_send: section, declare the list of topics that this cert should be allowed to publish. If this is a long-running python process that is not passing active=True to the call to fedmsg.publish(..), then you have to also declare endpoints for it. You do that by specifying the fedmsg_wsgi_procs and fedmsg_wsgi_vars in the group_vars for your service. The iptables rules and fedmsg endpoints should be automatically created for you on the next playbook run.

114 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Supporting code

At this point, you can push the change out to production and be publishing messages “okay”. Everything should be fine. However, your message will show up blank in datagrepper, in IRC, and in FMN, and everywhere else we try to render it. You must then follow up and write a new Processor for it in the fedmsg_meta library we maintain: https: //github.com/fedora-infra/fedmsg_meta_fedora_infrastructure You also must write a test case for it there. The docs listing all topics we publish at http://fedora-fedmsg.rtfd.org/ is automatically generated from the test suite. Please don’t forget this. Lastly, you should cut a release of fedmsg_meta and deploy it using the playbooks/manual/upgrade/fedmsg.yml play- book, which should update all the relevant hosts.

Corner cases

If the process publishing the new message lives outside our main network, you have to jump through more hoops. Look at abrt, koschei, and copr for examples of how to configure this (you need a special firewall rule, and they need to be configured to talk to our “inbound gateway” running on the proxies.

fedmsg-relay SOP

Bridge ephemeral scripts into the fedmsg bus.

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-admin, #fedora-noc Servers app01 Purpose Bridge ephemeral bash and python scripts into the fedmsg bus.

Description

fedmsg-relay is running on app01, which is a bad choice. We should look to move it to a more isolated place in the future. busgateway01 would be a better choice. “Ephemeral” scripts like pkgdb2branch.py, the post-receive git hook on pkgs01, and anywhere fedmsg-logger is used all depend on fedmsg-relay. Instead of emitting messages “directly” to the rest of the bus, they use fedmsg-relay as an intermediary. Check that fedmsg-relay is running by looking for it in the process list. You can restart it in the standard way with sudo service fedmsg-relay restart. Check for its logs in /var/log/fedmsg/fedmsg-relay.log Ephemeral scripts know where the fedmsg-relay is by looking for the relay_inbound and relay_outbound values in the global fedmsg config.

2.2. System Administrator Guide 115 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

But What is it Doing? And Why?

The fedmsg bus is designed to be “passive” in its normal operation. A mod_wsgi process under httpd sets up its fedmsg publisher socket to passively emit messages on a certain port. When some other service wants to receive these messages, it is up to that service to know where mod_wsgi is emitting and to actively connect there. In this way, emitting is passive and listening is active. We get a problem when we have a one-off or “ephemeral” script that is not a long-running process – a script like pkgdb2branch which is run when a user runs it and which ends shortly after. Listeners who want these scripts messages will find that they are usually not available when they try to connect. To solve this problem, we introduced the “fedmsg-relay” daemon which is a kind of “passive”-to-“passive” adaptor. It binds to an outbound port on one end where it will publish messages (like normal) but it also binds to an another port where it listens passively for inbound messages. Ephemeral scripts then actively connect to the passive inbound port of the fedmsg-relay to have their payloads echoed on the bus-proper. See http://fedmsg.readthedocs.org/en/latest/topology/ for a diagram. websocket SOP websocket communication with Fedora apps. see-also: fedmsg-gateway.txt

Contact Information

Owner Messaging SIG, Fedora Infrastructure Team Contact #fedora-apps, #fedora-admin, #fedora-noc Servers busgateway01, proxy0*, app0* Purpose Expose a websocket server for FI apps to use

Description

WebSocket is a protocol (an extension of HTTP/1.1) by which client web browsers can establish full-duplex socket communications with a server – the “real-time web”. In our case, webapps served from app0* and packages0* will include code instructing client browsers to establish a second connection to our WebSocket server. They point browsers to the following addresses: production wss://hub.fedoraproject.org:9939 staging wss://stg.fedoraproject.org:9939 The websocket server itself is a fedmsg-hub daemon running on busgateway01. It is configured to enable its websocket server component in the presence of certain configuration values. haproxy mediates connections to the fedmsg-hub websocket server daemon. An stunnel daemon provides SSL support.

Connection Flow

The connection flow is much the same as in the fedmsg-gateway.txt SOP, but is somewhat more complicated. “Normal” HTTP requests to our app servers traverse the following chain:

116 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Client-> apache(proxy01)-> haproxy(proxy01)-> apache(app01)

The flow for a websocket requests looks something like this:

Client-> stunnel(proxy01)-> haproxy(proxy01)-> fedmsg-hub(busgateway01) stunnel is listening on a public port, negotiates the SSL connection, and redirects the connection to haproxy who in turn hands it off to the fedmsg-hub websocket server listening on busgateway01. At the time of this writing, haproxy does not actually load balance zeromq session requests across multiple busgate- way0* machines, but there is nothing stopping us from adding them. New hosts can be added in ansible and pressed from busgateway01’s template. Add them to the fedmsg-websockets listen in haproxy’s config and it should Just Work.

RHIT

We had RHIT open up port 9939 special to proxy01.phx2 for this.

Fedocal SOP

Fedocal is a web-based group calender application that is made available to the various groups with in the Fedora project.

Contents

1. Contact Information 2. Documentation Links

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location https://apps.fedoraproject.org/calendar Servers Purpose To provide links to the documentation for fedocal, as it exists elsewhere on the internet and it was decided that a link document would be a better use of resources than to rewrite the book.

Documentation Links

For information on the latest and greatest in fedocal please review: http://fedocal.readthedocs.org/en/latest/ For documentation on the usage of fedocal please consult: http://fedocal.readthedocs.org/en/latest/usage.html

Fedora Release Infrastructure SOP

This SOP contains all of the steps required by the Fedora Infrastructure team in order to get a release out. Much of this work overlaps with the Release Engineering team (and at present share many of the same members). Some work may get done by releng, some may get done by Infrastructure, as long as it gets done, it doesn’t matter.

2.2. System Administrator Guide 117 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner: Fedora Infrastructure Team, Fedora Release Engineering Team Contact: #fedora-admin, #fedora-releng, sysadmin-main, sysadmin-releng Location: N/A Servers: All Purpose: Releasing a new version of Fedora

Preparations

Before a release ships, the following items need to be completed. 1. New website from the websites team (typically hosted at http://getfedora.org/_/) 2. Verify mirror space (for all test releases as well) 3. Verify with rel-eng permissions on content are right on the mirrors. Don’t leak. 4. Communication with Red Hat IS (Give at least 2 months notice, then reminders as the time comes near) (final release only) 5. Infrastructure change freeze 6. Modify Template:FedoraVersion to reference new version. (Final release only) 7. Move old releases to archive (post final release only) 8. Switch release from development/N to normal releases/N/ tree in mirror manager (post final release only)

Change Freeze

The rules are simple: • Hosts with the ansible variable “freezes” “True” are frozen. • You may make changes as normal on hosts that are not frozen. (For example, staging is never frozen) • Changes to frozen hosts requires a freeze break request sent to the fedora infrastructure list, containing a descrip- tion of the problem or issue, actions to be taken and (if possible) patches to ansible that will be applied. These freeze breaks must then get two approvals from sysadmin-main or sysadmin-releng group members before being applied. • Changes to recover from outages are acceptable to frozen hosts if needed. Change freezes will be sent to the fedora-infrastructure-list and begin 3 weeks before each release and the final release. The freeze will end one day after the release. Note, if the release slips during a change freeze, the freeze just extends until the day after a release ships. You can get a list of frozen/non-frozen hosts by:

git clone https://pagure.io/fedora-infra/ansible.git scripts/freezelist-i inventory

118 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Notes about release day

Release day is always an interesting and unique event. After the final sprint from test to the final release a lot of the developers will be looking forward to a bit of time away, as well as some sleep. Once Release Engineering has built the final tree, and synced it to the mirrors it is our job to make sure everything else (except the bit flip) gets done as painlessly and easily as possible.

Note: All communication is typically done in #fedora-admin. Typically these channels are laid back and staying on topic isn’t strictly enforced. On release day this is not true. We encourage people to come, stay in the room and be quiet unless they have a specific task or question releated to release day. Its nothing personal, but release day can get out of hand quick.

During normal load, our websites function as normal. This is especially true since we’ve moved the wiki to mod_fcgi. On release day our load spikes a great deal. During the Fedora 6 launch many services were offline for hours. Some (like the docs) were off for days. A large part of this outage was due to the wiki not being able to handle the load, part was a lack of planning by the Infrastructure team, and part is still a mystery. There are questions as to whether or not all of the traffic was legit or a ddos. The Fedora 7 release went much better. Some services were offline for minutes at a time but very little of it was out longer then that. The wiki crashed, as it always does. We had made sure to make the fedoraproject.org landing page static though. This helped a great deal though we did see load on the proxy boxes as spiky. Recent releases have been quite smooth due to a number of changes: we have a good deal more bandwith on master mirrors, more cpus and memory, as well as prerelease versions are much easier to come by for those interested before release day.

Day Prior to Release Day

Step 1 (Torrent)

Setup the torrent. All files can be synced with the torrent box but just not published to the world. Verify with sha1sum. Follow the instructions on the torrentrelease.txt sop up to and including step 4.

Step 2 (Website)

Verify the website design / content has been finalized with the websites team. Update the Fedora version num- ber wiki template if this is a final release. It will need to be changed in https://fedoraproject.org/wiki/Template: CurrentFedoraVersion Additionally, there are redirects in the ansible playbooks/include/proxies-redirects.yml file for Cloud Images. These should be pushed as soon as the content is available. See: https://pagure.io/fedora-infrastructure/issue/3866 for exam- ple

Step 3 (Mirrors)

Verify enough mirrors are setup and have Fedora ready for release. If for some reason something is broken it needs to be fixed. Many of the mirrors are running a check-in script. This lets us know who has Fedora without having to scan everyone. Hide the Alpha, Beta, and Preview releases from the publiclist page. You can check this by looking at:

2.2. System Administrator Guide 119 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

wget"http://mirrors.fedoraproject.org/mirrorlist?path=pub/fedora/linux/releases/test/

˓→28-Beta&country=global"

(replace 28 and Beta with the version and release.)

Release day

Step 1 (Prep and wait)

Verify the mirrors are ready and that the torrent has valid copies of its files (use sha1sum) Do not move on to step two until the Release Engineering team has given the ok for the release. It is the releng team’s decision as to whether or not we release and they may pull the plug at any moment.

Step 2 (Torrent)

Once given the ok to release, the Infrastructure team should publish the torrent and encourage people to seed. Complete the steps on the https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/torrentrelease.html after step 4.

Step 3 (Bit flip)

The mirrors sit and wait for a single permissions bit to be altered so that they show up to their services. The bit flip (done by the releng team) will replicate out to the mirrors. Verify that the mirrors have received the change by seeing if it is actually available, just use a spot check. Once that is complete move on.

Step 4 (Website)

Once all of the distribution pieces are verified (mirrors and torrent), all that is left is to publish the website. At present this is done by making sure the master branch of fedora-web is pulled by the syncStatic.sh script in ansible. It will sync in an hour normally but on release day people don’t like to wait that long so do the following on sundries01 sudo -u apache /usr/local/bin/lock-wrapper syncStatic ‘sh -x /usr/local/bin/syncStatic’ Once that completes, on batcave01: sudo-i ansible proxy\ * "/usr/bin/rsync --delete -a --no-owner --no-group ˓→bapp02::getfedora.org/ /srv/web/getfedora.org/"

Verify http://getfedora.org/ is working.

Step 5 (Docs)

Just as with the website, the docs site needs to be published. Just as above follow the following steps:

/root/bin/docs-sync

120 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Step 6 (Monitor)

Once the website is live, keep an eye on various news sites for the release announcement. Closely watch the load on all of the boxes, proxy, application and otherwise. If something is getting overloaded, see suggestions on this page in the “Juggling Resources” section.

Step 7 (Badges) (final release only)

We have some badge rules that are dependent on which release of Fedora we’re on. As you have time, please performs the following on your local box:

$ git clone ssh://[email protected]/fedora-badges.git $ cd badges

Edit rules/tester-it-still-works.yml and update the release tag to match the now old but stable release. For instance, if we just released fc21, then the tag in that badge rule should be fc20. Edit rules/tester-you-can-pry-it-from-my-cold-dead-hands.yml and update the release tag to match the release that is about to reach EOL. For instance, if we just released f28, then the tag in that badge rule should be f26. Commit the changes:

$ git commit -a -m 'Updated tester badge rule for f28 release.' $ git push origin master

Then, on batcave, perform the following:

$ sudo -i ansible-playbook $(pwd)/playbooks/manual/push-badges.yml

Step 8 (Done)

Just chill, keep an eye on everything and make changes as needed. If you can’t keep a service up, try to redirect randomly to some of the mirrors.

Priorities

Priorities of during release day (In order): 1. Website Anything related to a user landing at fedoraproject.org, and clicking through to a mirror or torrent to download something must be kept up. This is distribution, and without it we can potentially lose many users. 2. Linked addresses We do not have direct control over what Hacker News, Phoronix or anyone else links to. If they link to something on the wiki and it is going down or link to any other site we control a rewrite should be put in place to direct them to http://fedoraproject.org/get-fedora. 3. Torrent The torrent server has never had problems during a release. Make sure it is up. 4. Release Notes Typically grouped with the docs site, the release notes are often linked to (this is fine, no need to redirect) but keep an eye on the logs and ensure that where we’ve said the release notes are, that they can be found there. In previous releases we sometimes had to make this available in more than one spot. 5. docs.fedoraproject.org People will want to see whats new in Fedora and get further documentation about it. Much of this is in the release notes.

2.2. System Administrator Guide 121 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

6. wiki Because it is so resource heavy, and because it is so developer oriented we have no choice but to give the wiki a lower priority. 7. Everything else.

Juggling Resources

In our environment we’re running different things on many different servers. Using Xen we can easily give machines more or less ram, processors. We can take down builders and bring up application servers. The trick is to be smart and make sure you understand what is causing the problem. These are some tips to keep in mind: • IPTables based bandwidth and connection limiting (successful in the past) • Altering the weight on the proxy balancers • Create static pages out of otherwise dynamic content • Redirect pages to a mirror • Add a server / remove un-needed servers

CHECKLISTS:

Beta:

• Announce infrastructure freeze 3 weeks before Beta • Change /topic in #fedora-admin • mail infrastucture list a reminder. • File all tickets • new website • check mirror permissions, mirrormanager, check mirror sizes, release day ticket. After release is a “go”: • Make sure torrents are setup and ready to go. • fedora-web needs a branch for fN-beta. In it: • Beta used on get-prerelease • get-prerelease doesn’t direct to release • verify is updated with Beta info • releases.txt gets a branched entry for preupgrade • bfo gets updated to have a Beta entry. After release: • Update /topic in #fedora-admin • post to infrastructure list that freeze is over.

122 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Final:

• Announce infrastructure freeze 2 weeks before Final • Change /topic in #fedora-admin • mail infrastucture list a reminder. • File all tickets • new website, check mirror permissions, mirrormanager, check • mirror sizes, release day ticket. After release is a “go”: • Make sure torrents are setup and ready to go. • fedora-web needs a branch for fN-alpha. In it: • get-prerelease does direct to release • verify is updated with Final info • bfo gets updated to have a Final entry. • update wiki version numbers and names. After release: • Update /topic in #fedora-admin • post to infrastructure list that freeze is over. • Move MirrorManager repository tags from the development/$version/ Directory objects, to the releases/$version/ Directory objects. This is done using the move-devel-to-release --version=$version command on bapp02. This is usually done now a week or two after release.

Fedora Packages SOP

This SOP is for the Fedora Packages web application. https://apps.fedoraproject.org/packages

Contents

1. Contact Information 2. Deploying to the servers 3. Maintenance 4. Checking for AGPL violations

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, #fedora-apps Persons cverna Location PHX2

2.2. System Administrator Guide 123 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Servers packages03.phx2.fedoraproject.org packages04.phx2.fedoraproject.org pack- ages03.stg.phx2.fedoraproject.org Purpose Web interface for searching packages information

Deploying to the servers

Deploying

Once the new version is built, it needs to be deployed. To deploy the new version, you need ssh access to bat- cave01.phx2.fedoraproject.org and permissions to run the Ansible playbook. All the following commands should be run from batcave01. You can check the upstream documentation, on how to build a new release. This process results in a fedora-packages rpm available in the infra-tag rpm repo. You should make use of the staging instance in order to test the new version of the application.

Upgrading

To upgrade, run the upgrade playbook:

$ sudo rbac-playbook manual/upgrade/packages.yml

This will upgrade the fedora-pacages package and restart the Apache web server and fedmsg-hub service.

Rebuild the xapian Database

If you need to rebuild the xapian database then you can run the following playbook:

$ sudo rbac-playbook manual/rebuild/fedora-packages.yml

Maintenance

The web application is served by httpd and managed by the httpd service.:

$ sudo systemctl restart httpd can be used to restart the service if needed. The application log files are available under /var/log/httpd/ directory. The xapian database is updated by a fedmsg consumer. You can restart the fedmsg-hub serivce if needed by using:

$ sudo systemctl restart fedmsg-hub

To check the consumer logs you can use:

$ sudo journalctl -u fedmsg-hub

124 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Checking for AGPL violations

To remain AGPL compliant, we must ensure that all modifications to the code are made available in the SRPM that we link to in the footer of the application. You can easily query our app servers to determine if any AGPL violating code modifications have been made to the package.: func-command--host=" *app*"--host="community *""rpm -V fedoracommunity"

You can safely ignore any changes to non-code files in the output. If any violations are found, the Infrastructure Team should be notified immediately.

Fedora Pastebin SOP

Contents

1. Contact Information 2. Introduction 3. Installation 4. Dashboard 5. Add a word to censored list

1. Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons athmane herlo Sponsor nirik Location phx2 Servers paste01.stg, paste01.dev Purpose To host Fedora Pastebin

2. Introduction

Fedora pastebin is powered by sticky-notes which is included in EPEL. Fedora theming (skin) is included in ansible role.

3. Installation

Sticky-notes needs a MySQL db and a user with ‘select, update, delete, insert’ privileges. It’s recommended to dump and import db from a working installation to save time (skipping the installation and tweaking). By default the installation is locked ie: you can’t relaunch it.

2.2. System Administrator Guide 125 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

However, you can unlock the installation by commenting the line containing $gsod->trigger in /etc/ sticky-notes/install.php then pointing the web browser to ‘/install’ The configuration file containing general settings and DB credentials is located in /etc/sticky-notes/ config.php

4. Dashboard

Sticky-notes has a dashboard (URL: /admin/) that can be used to : • Manage pastes: – deleting paste – getting information about the paste author (IP/Date/time etc. . . ) • Manage users (aka admins) which can log into the dashboard • Manage IP Bans (add / delete banned IPs). • Authentication (not needed) • Site configuration: – General configuration (included in config.php). – Project Honey Pot configuration (not a FOSS service) – Word censor configuration: a list of words to be censored in pastes.

5. Add a word to censored list

If a word is in censored list, any paste containing that word will be rejected, to add one, edit the variable ‘$sg_censor’ in sticky-notes configuration file.:

$sg_censor = "WORD1 WORD2 ...... WORDn";

Websites Release SOP

• 1. Preparing the website for a release – 1.1 Obsolete GPG key of the EOL Fedora release – 1.2 Update GPG key * 1.2.1 Steps • 2. Update website – 2.1 For Alpha – 2.2 For Beta – 2.3 For GA • 3. Fire in the hole • 4. Tips

126 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– 4.1 Merging branches 1. Preparing the website for a new release cycle 1.1 Obsolete GPG key One month after a Fedora release the release number ‘FXX-2’ (i.e. 1 month after F21 release, F19 will be EOL) will be EOL (End of Life). At this point we should drop the GPG key from the list in verify/ and move the keys to the obsolete keys page in keys/obsolete.html. 1.2 Update GPG key After another couple of weeks and as the next release approaches, watch the fedora-release package for a new key to be added. Use the update-gpg-keys script in the fedora-web git repository to add it to static/. Manually add it to /keys and /verify in all websites where we use these keys: • arm.fpo • getfedora.org • labs.fpo • spins.fpo 1.2.1 Steps a) Get a copy of the new key(s) from the fedora-release repo, you will find FXX- primary and FXX-secondary keys. Save them in ./tools to make the update easier. https://pagure.io/fedora-repos b) Start by editing ./tools/update-gpg-keys and adding the key-ids of any obsolete keys to the obsolete_keys list. c) Then run that script to add the new key(s) to the fedora.gpg block: fedora-web git:(master) cd tools/ tools git:(master) ./update-gpg-keys RPM- GPG-KEY-fedora-23-primary tools git:(master) ./update-gpg-keys RPM-GPG- KEY-fedora-23-secondary This will add the key(s) to the keyblock in static/fedora.gpg and create a text file for the key in static/$KEYID.txt as well. Verify that these files have been created properly and contain all the keys that they should. • Handy checks: gpg static/fedora.gpg or gpg static/$KEYID.txt • Adding “–with-fingerprint” option will add the fingerprint to the output The output of fedora.gpg should contain only the actual keys, not the obsolete keys. The single text files should contain the correct information for the uploaded key. d) Next, add new key(s) to the list in data/verify.html and move the new key in- formations in the keys page in data/content/keys/index.html. A script to aid in generating the HTML code for new keys is in ./tools/make-gpg-key-html. It will print HTML to stdout for each RPM-GPG-KEY-* file given as arguments. This is suitable for copy/paste (or directly importing if your editor supports this). Check the copied HTML code and select if the key info is for a primary or secondary key (output says ‘Primary or Secondary’). tools git:(master) ./make-gpg-key-html RPM-GPG-KEY-fedora-23-primary

2.2. System Administrator Guide 127 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Build the website with ‘make en test’ and carefully verify that the data is correct. Please double check all keys in http://localhost:5000/en/keys and http://localhost: 5000/en/verify. NOTE: the tool will give you an outdated output, adapt it to the new websites and bootstrap layout! 2. Update website 2.1 For Alpha a) Create the fXX-alpha branch from master fedora-web git:(master) git push origin master:refs/heads/f22-alpha and checkout to the new branch: fedora-web git:(master) git checkout -t -b f13- alpha origin/f13-alpha b) Update the global variables Change curr_state to Alpha for all arches c) Add Alpha banner Upload the FXX-Alpha banner to static/images/banners/f22alpha.png which should appear in every ${PROD- UCT}/download/index.html page. Make sure the banner is shown in all sidebars, also in labs, spins, and arm. d) Check all Download links and paths in ${PRODUCT}/prerelease/index.html You can find all paths in bapp01 (sudo su - mirrormanager first) or you can look at the downlaod page http://dl.fedoraproject.org/pub/alt/stage e) Add CHECKSUM files to static/checksums and verify that the paths are cor- rect. The files should be in sundries01 and you can query them with: $ find /pub/fedora/linux/releases/test/17-Alpha/ -type f -name CHECKSUM -exec cp ‘{}’ . ; Remember to add the right checksums to the right websites (same path). f) Add EC2 AMI IDs for Alpha. All IDs now are in the globalvar.py file. We get all data from there, even the redirect path to trac the AMI IDs. We now also have a script which is useful to get all the AMI IDs uploaded with fedimg. Execute it to get the latest uploads, but don’t run the script too early, as new builds are added constantly. fedora-web git:(fXX-alpha) python ~/fedora-web/tools/get_ami.py g) Add CHECKSUM files also to http://spins.fedoraproject.org in static/checksums. Verify the paths are correct in data/content/verify.html. (see point e) to query them on sundries01). Same for labs.fpo and arm.fpo. h) Verify all paths and links on http://spins.fpo, labs.fpo and arm.fpo. i) Update Alpha Image sizes and pre_cloud_composedate in ./build.d/globalvar.py. Verify they are right in Cloud images and Docker image. j) Update the new POT files and push them to Zanata (ask a maintainer to do so) every time you change text strings. k) Add this build to stg.fedoraproject.org (ansible syncStatic.sh.stg) to test the pages online. l) Release Date: • Merge the fXX-alpha branch to master and correct conflicts manually • Remove the redirect of prerelease pages in ansible, edit: • ansible/playbooks/include/proxies-redirects.yml • ask a sysadmin-main to run playbook

128 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• When ready and about 90 minutes before Release Time push to master • Tag the commit as new release and push it too: $ git tag -a FXX-Alpha -m ‘Re- leasing Fedora XX Alpha’ $ git push –tags • If needed follow “Fire in the hole” below. 2.2 For Beta a) Create the fXX-beta branch from master fedora-web git:(master) git push origin master:refs/heads/f22-beta and checkout to the new branch: fedora-web git:(master) git checkout -t -b f22- beta origin/f22-beta b) Update the global variables Change curr_state to Beta for all arches c) Add Alpha banner Upload the FXX-Beta banner to static/images/banners/f22beta.png which should appear in every ${PROD- UCT}/download/index.html page. Make sure the banner is shown in all sidebars, also in labs, spins, and arm. d) Check all Download links and paths in ${PRODUCT}/prerelease/index.html You can find all paths in bapp01 (sudo su - mirrormanager first) or you can look at the downlaod page http://dl.fedoraproject.org/pub/alt/stage e) Add CHECKSUM files to static/checksums and verify that the paths are cor- rect. The files should be in sundries and you can query them with: $ find /pub/fedora/linux/releases/test/17-Beta/ -type f -name CHECKSUM -exec cp ‘{}’ . ; Remember to add the right checksums to the right websites (same path). f) Add EC2 AMI IDs for Beta. All IDs now are in the globalvar.py file. We get all data from there, even the redirect path to trac the AMI IDs. We now also have a script which is useful to get all the AMI IDs uploaded with fedimg. Execute it to get the latest uploads, but don’t run the script too early, as new builds are added constantly. fedora-web git:(fXX-beta) python ~/fedora-web/tools/get_ami.py g) Add CHECKSUM files also to http://spins.fedoraproject.org in static/checksums. Verify the paths are correct in data/content/verify.html. (see point e) to query them on sundries01). Same for labs.fpo and arm.fpo. h) Remove static/checksums/Fedora-XX-Alpha* in all websites. i) Verify all paths and links on http://spins.fpo, labs.fpo and arm.fpo. j) Update Beta Image sizes and pre_cloud_composedate in ./build.d/globalvar.py. Verify they are right in Cloud images and Docker image. k) Update the new POT files and push them to Zanata (ask a maintainer to do so) every time you change text strings. l) Add this build to stg.fedoraproject.org (ansible syncStatic.sh.stg) to test the pages online. m) Release Date: • Merge the fXX-beta branch to master and correct conflicts manually • When ready and about 90 minutes before Release Time push to master • Tag the commit as new release and push it too: $ git tag -a FXX-Beta -m ‘Re- leasing Fedora XX Beta’ $ git push –tags • If needed follow “Fire in the hole” below.

2.2. System Administrator Guide 129 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2.3 For GA a) Create the fXX branch from master fedora-web git:(master) git push origin mas- ter:refs/heads/f22 and checkout to the new branch: fedora-web git:(master) git checkout -t -b f22 origin/f22 b) Update the global variables Change curr_state for all arches c) Check all Download links and paths in ${PRODUCT}/download/index.html You can find all paths in bapp01 (sudo su - mirrormanager first) or you can look at the downlaod page http://dl.fedoraproject.org/pub/alt/stage d) Add CHECKSUM files to static/checksums and verify that the paths are cor- rect. The files should be in sundries01 and you can query them with: $ find /pub/fedora/linux/releases/17/ -type f -name CHECKSUM -exec cp ‘{}’ . ; Re- member to add the right checksums to the right websites (same path). e) At some point freeze translations. Add an empty PO_FREEZE file to every web- site’s directory you want to freeze. f) Add EC2 AMI IDs for GA. All IDs now are in the globalvar.py file. We get all data from there, even the redirect path to trac the AMI IDs. We now also have a script which is useful to get all the AMI IDs uploaded with fedimg. Execute it to get the latest uploads, but don’t run the script too early, as new builds are added constantly. fedora-web git:(fXX) python ~/fedora-web/tools/get_ami.py g) Add CHECKSUM files also to http://spins.fedoraproject.org in static/checksums. Verify the paths are correct in data/content/verify.html. (see point e) to query them on sundries01). Same for labs.fpo and arm.fpo. h) Remove static/checksums/Fedora-XX-Beta* in all websites. i) Verify all paths and links on http://spins.fpo, labs.fpo and arm.fpo. j) Update GA Image sizes and cloud_composedate in ./build.d/globalvar.py. Verify they are right in Cloud images and Docker image. k) Update static/js/checksum.js and check if the paths and checksum still match. l) Update the new POT files and push them to Zanata (ask a maintainer to do so) every time you change text strings. m) Add this build to stg.fedoraproject.org (ansible syncStatic.sh.stg) to test the pages online. n) Release Date: • Merge the fXX-beta branch to master and correct conflicts manually • Add the redirect of prerelease pages in ansible, edit: • ansible/playbooks/include/proxies-redirects.yml • ask a sysadmin-main to run playbook • Unfreeze translations by deleting the PO_FREEZE files • When ready and about 90 minutes before Release Time push to master • Update the short links for the Cloud Images for ‘Fedora XX’, ‘Fedora XX-1’ and ‘Latest’

130 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• Tag the commit as new release and push it too: $ git tag -a FXX -m ‘Releasing Fedora XX’ $ git push –tags • If needed follow “Fire in the hole” below. 3. Fire in the hole We now use ansible for everything, and normally use a regular build to make the websites live. If something is not happening as expected, you should get in contact with a sysadmin-main to run the ansible playbook again. All our stuff, such as SyncStatic.sh and SyncTranslation.sh scripts are now also in ansible! Staging server app02 and production server bapp01 do not exist anymore, now our staging websites are on sundries01.stg and the production on sundries01. Change your scripts accord- ingly and as sysadmin-web you should have access to those servers as before. 4. Tips 4.1 Merging branches Suggested by Ricky This can be useful if you’re sure all new changes on devel branch should go into the master branch. Conflicts will be solved directly accepting only the changes in the devel branch. If you’re not 100% sure do a normal merge and fix conflicts manually! $ git merge f22-beta $ git checkout –theirs f22-beta [list of conflicting po files] $ git commit

FedMsg Notifications (FMN) SOP

Route individualized notifications to fedora contributors over email, irc.

Contact Information

Owner

• Messaging SIG • Fedora Infrastructure Team

Contact

• #fedora-apps for FMN development • #fedora-fedmsg for an IRC feed of all fedmsgs • #fedora-admin for problems with the deployment of FMN • #fedora-noc for outage/crisis alerts

Servers

Production servers: • notifs-backend01.phx2.fedoraproject.org (RHEL 7) • notifs-web01.phx2.fedoraproject.org (RHEL 7) • notifs-web02.phx2.fedoraproject.org (RHEL 7)

2.2. System Administrator Guide 131 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Staging servers: • notifs-backend01.stg.phx2.fedoraproject.org (RHEL 7) • notifs-web01.stg.phx2.fedoraproject.org (RHEL 7) • notifs-web02.stg.phx2.fedoraproject.org (RHEL 7)

Purpose

Route notifications to users

Description fmn is a pair of systems intended to route fedmsg notifications to Fedora contributors and users. There is a web interface running on notifs-web01 and notifs-web02 that allows users to login and configure their preferences to select this or that type of message. There is a backend running on notifs-backend01 where most of the work is done. The backend process is a ‘fedmsg-hub’ daemon, controlled by systemd.

Hosts notifs-backend

This host runs: • fedmsg-hub.service • One or more [email protected]. Currently notifs-backend01 runs fmn-worker@{1-4}. service • [email protected][email protected] • rabbitmq-server.service, an AMQP broker used to communicate between the services. • redis.service, used for caching. This host relies on a PostgreSQL database running on db01.phx2.fedoraproject.org. notifs-web

This host runs: • A Python WSGI application via Apache httpd that serves the FMN web user interface. This host relies on a PostgreSQL database running on db01.phx2.fedoraproject.org.

Deployment

Once upstream releases a new version of fmn, fmn-web, or fmn-sse creating a Git tag, a new version can be built an deployed into Fedora infrastructure.

132 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Building

FMN is packaged in Fedora and EPEL as python-fmn (the backend), python-fmn-web (the frontend), and the optional python-fmn-sse. Since all the hosts run RHEL 7, you need to build all these packages for EPEL 7.

Configuration

If there are any configuration updates required by the new version of FMN, update the notifs Ansible roles on batcave01.phx2.fedoraproject.org. Remember to use:

{% if env =='staging'%} {% else %} {% endif%}

When deploying the update to staging. You can apply configuration updates to staging by running:

$ sudo rbac-playbook -l staging groups/notifs-backend.yml $ sudo rbac-playbook -l staging groups/notifs-web.yml

Simply drop the -l staging to update the production configuration.

Upgrading

To upgrade the python-fmn, python-fmn-web, and python-fmn-sse packages, apply configuration changes, and restart the services, you should use the manual upgrade playbook:

$ sudo rbac-playbook -l staging manual/upgrade/fmn.yml

Again, drop the -l staging flag to upgrade production. Be aware that the FMN services take a significant amount of time to start up as they pre-heat their caches before starting work.

Service Administration

Disable an account (on notifs-backend01):

$ sudo -u fedmsg /usr/local/bin/fmn-disable-account USERNAME

Restart:

$ sudo systemctl restart fedmsg-hub

Watch logs:

$ sudo journalctl -u fedmsg-hub -f

Configuration:

2.2. System Administrator Guide 133 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

$ ls /etc/fedmsg.d/ $ sudo fedmsg-config | less

Monitor performance: http://threebean.org/fedmsg-health-day.html#FMN

Upgrade (from batcave):

$ sudo -i ansible-playbook /srv/web/infra/ansible/playbooks/manual/upgrade/fmn.yml

Mailing Lists

We use FMN as a way to forward certain kinds of messages to mailing lists so people can read them the good old fashioned way that they like to. To accomplish this, we create ‘bot’ FAS accounts with their own FMN profiles and we set their email addresses to the lists in question. If you need to change the way some set of messages are forwarded, you can do it from the FMN web interface (if you are an FMN admin as defined in the config file in roles/notifs/frontend/). You can navigate to https://apps. fedoraproject.org/notifications/USERNAME.id.fedoraproject.org to do this. If the account exists as a FAS user already (for instance, the virtmaint user) but it does not yet exist in FMN, you can add it to the FMN database by logging in to notifs-backend01 and running fmn-create-user --email [email protected] --create-defaults FAS_USERNAME.

FPDC SOP

Fedora Product Definition Center is a service that aims to replace PDC in Fedora. It is meant to be a database with REST API access used to store data needed by other services.

Contact Information

Owner Infrastructure Team Contact #fedora-apps, #fedora-admin Persons cverna, abompard Location Phoenix (Openshift) Public addresses • fpdc.fedoraproject.org • fpdc.stg.fedoraproject.org Servers • os.fedoraproject.org • os.stg.fedoraproject.org Purpose Centralize metadata and facilitate access.

134 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Systems

FPDC is built using the DJANGO REST FRAMEWORK and uses a POSTGRESQL database to store the metadata. The application is run on Openshift and uses the Source-to-image technology to build the container directly from the git repository. In the staging and production environments, the application is automatically rebuilt for every new commit in the staging or production branch, this is achieved by configuring a github webhook’s to trigger an openshift deployment. For example a new deployment to staging would look like that: git clone [email protected]:fedora-infra/fpdc.git cd fpdc git checkout staging git rebase master git push origin staging The initial Openshift project deployment is manual and is done using the following ansible playbook sudo rbac-playbook openshift-apps/fpdc.yml

This will create a new fpdc project in Openshift with all the needed configuration.

Logs

Logs can be retrive using the openshift command line:

$ oc login os-master01.phx2.fedoraproject.org You must obtain an API token by visiting https://os.fedoraproject.org/oauth/token/

˓→request

$ oc login os-master01.phx2.fedoraproject.org --token= $ oc -n fpdc get pods fpdc-28-bfj52 1/1 Running 522 28d $ oc logs fpdc-28-bfj52

Database migrations

FPDC uses the recreate deployment configuration of openshift, which means that openshift will bring down the pods currently running and recreate new ones with the new version of the application. In the phase between the pods being down and the new pods being up, the database migrations are run in an independent pod.

Things that could go wrong

Hopefully not much. If something goes wrong is it currently advised to kill the pods to trigger a fresh deployment.

$ oc login os-master01.phx2.fedoraproject.org You must obtain an API token by visiting https://os.fedoraproject.org/oauth/token/

˓→request

$ oc login os-master01.phx2.fedoraproject.org --token= $ oc -n fpdc get pods fpdc-28-bfj52 1/1 Running 522 28d $ oc delete pod fpdc-28-bfj52

It is also possible to rollback to a previous version

2.2. System Administrator Guide 135 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

$ oc -n fpdc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY fpdc 39 1 1 config,image(fpdc:latest) $ oc -n fpdc rollback fpdc

FreeMedia Infrastructure SOP

This page is for defining the SOP for Fedora FreeMedia Program. This will cover the infrastructural things as well as procedural things.

Contents

1. Location of Resources 2. Location on Ansible 3. Opening of the form 4. Closing of the Form 5. Tentative timeline 6. How to 1. Open 2. Close 7. Handling of tickets 1. Login 2. Rejecting Invalid Tickets 3. Accepting Valid Tickets 8. Handling of non fulfilled requests 9. How to handle membership applications

Location of Resources

• The web form is at https://fedoraproject.org/freemedia/FreeMedia-form.html • The TRAC is at [63]https://fedorahosted.org/freemedia/report

Location on ansible

$PWD = roles/freemedia/files Freemedia form FreeMedia-form.html Backup form FreeMedia-form.html.orig Closed form FreeMedia-close.html Backend processing script process.php Error Document FreeMedia-error.html

136 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Opening of the form

The form will be opened on the First day of each month.

Closing of the Form

Tentative timeline

The form will be closed after a couple of days. This may vary according to the capacity.

How to

• The form is available at roles/freemedia/files/FreeMedia-form.html and roles/ freemedia/files//FreeMedia-form.html.orig • The closed form is at roles/freemedia/files/FreeMedia-close.html

Open

• Goto roles/freemedia/tasks • Open main.yml • Goto line 32. • To Open: Change the line to read:: src=”FreeMedia-form.html” • After opening the form, go to trac and grant “Ticket Create and Ticket View” privilege to “Anonymous”.

Close

• Goto roles/freemedia/tasks • Open main.yml • Goto line 32. • To Close: Change the line to read:: src=”FreeMedia-close.html”, • After closing the form, go to trac and remove “Ticket Create and Ticket View” privilege from “Anony- mous”.

Note: • Have to check about monthly cron. • Have to write about changing init.pp for closing and opening

2.2. System Administrator Guide 137 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Handling of tickets

Login

• Contributors are requested to visit https://fedorahosted.org/freemedia/report • Please login with your FAS account.

Rejecting Invalid Tickets

• If a ticket is invalid, don’t accept the request. Go to “resolve as:” and select “invalid” and then press “Submit Changes”. • A ticket is Invalid if – No Valid email-id is provided. – The region does not match the country. – No Proper Address is given. • If a ticket is duplicate, accept one copy, close the others as duplicate Go to “resolve as:” and select “duplicate” and then press “Submit Changes”.

Accepting Valid Tickets

• If you wish to fulfill a request, please ensure it from the above section, it is not liable to be discarded. • Now “Accept” the ticket from the “Action” field at the bottom, and press the “Submit Changes” button. • These accepted tickets will be available from https://fedorahosted.org/freemedia/report user both “My Tickets” and “Accepted Tickets for XX” (XX= your region e.g APAC) • When You ship the request, please go to the ticket again, go to “resolve as:” from the “Action” field and select “Fixed” and then press “Submit Changes”. • If an accepted ticket is not finalised by the end of the month, is should be closed with “shipping status unknown” in a comment

Handling of non fulfilled requests

We shall close all the pending requests by the end of the Month. • Please Check your region

How to handle membership applications

Steps to become member of Free-media Group. 1. Create an account in Fedora Account System (FAS) 2. Create an user page in Fedora Wiki with contact data. Like User:. There are templates. 3. Apply to Free-Media Group in FAS 4. Apply to Free-Media mailing list subscription

138 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Rules for deciding over membership applications

Case Applied to Free-Media User Page Applied to Free- Action Group Created Media List 1 Yes Yes Yes Approve Group and mailing list applications 2 Yes Yes No subscribe to list Within a Week 3 Yes No whatever make User Page Within a Week 4 No No Yes Reject

Note: 1. As you need to have an FAS account for steps 2 and 3, this is not included in the decision rules above 2. The time to be on hold is one week. If not action is taken after one week, the application has to be rejected. 3. When writing asking to fulfil steps, send CC to other Free-media sponsors to let them know the application has been reviewed.

Freshmaker SOP

Note: Freshmaker is very new and changing rapidly. We’ll try to keep this up to date as best we can.

Freshmaker is a service that watches message bus activity and tries to rebuild compound artifacts when their constituent pieces change.

Contact Information

Owner Factory2 Team, Release Engineering Team, Infrastructure Team Contact #fedora-modularity, #fedora-admin, #fedora-releng Persons jkaluza, cqi, qwan, sochotni, threebean Location Phoenix Public addresses • freshmaker.fedoraproject.org Servers • freshmaker-frontend0[1-2].phx2.fedoraproject.org • freshmaker-backend01.phx2.fedoraproject.org Purpose Rebuild compound artifacts. See description for more detail.

Description

See also http://fedoraproject.org/wiki/Infrastructure/Factory2/Focus/Freshmaker for some of the original (old) think- ing on Freshmaker.

2.2. System Administrator Guide 139 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

As per the summary above, Freshmaker is a bus-oriented system that watches for changes to smaller pieces of content, and triggers rebuilds of larger pieces of content. It doesn’t do the actual builds itself, but instead requests rebuilds in our existing build systems. It handles a number of different content types. In Fedora, we would like to roll out rebuilds in the following order:

Module Builds

When a spec file changes on a particular dist-git branch, trigger rebuilds of all modules that declare dependencies on that rpm branch. Consider the traditional workflow today. You make a patch to the f27 of your package, and you know you need to build that patch for f27, and then later submit an update for this single build. Packagers know what to do. Consider the modular workflow. You make a patch to the 2.2 branch of your package, but now, which modules do you rebuild? Maybe you had one in mind that you wanted to fix, but are there others that you forgot about – that you don’t even know about? Kevin could maintain a module that pulls in my rpm branch and he never told me. Even if he did, I have to now maintain a list of modules that depend on my rpm, and request rebuilds of them everytime I patch my .spec file. This is unmanageable. Freshmaker deals with this by watching the bus for dist-git fedmsg messages. When it sees a change on a branch, it looks up the list of modules that depend on that branch, and requests rebuilds of them in the MBS.

Container Slow Flow

When a traditional rpm or modular rpm is shipped stable, this trigger rebuilds of all containers that ever included previous versions of this rpm. This applies to both modular and non-modular contexts. Today, you build an rpm that fixes a CVE, but some other person maintains a container that includes your RPM. Maybe they never told you about this. Maybe they didn’t notice your CVE fix. Their container will remain outdated and vulnerable.. forever? Freshmaker deals with this by watching the bus for dist-git messages about rpms being shipped to the stable updates repo. When they’re shipped, it looks up all containers that ever included pervious versions of the rpm in question, and it triggers rebuilds of them. Waiting until the rpm ships to stable is necessary because the container build process doesn’t know about unshipped content. This is how containers are built manually today, and it is annoying. Which brings us to the more compli- cated. . .

Container Fast Flow

When a traditional rpm or modular rpm is signed, generate a repo containing it and rebuild all containers that ever included that rpm before. This is the better version of the slow flow, but is more complicated so we’re deferring it until after we’ve proved the first two cases out. Freshmaker will do this by requesting an interim build repo from ODCS (the On Demand Compose Service). ODCS can be given the appropriate koji tag and will produce a repo of (pre-signed) rpms. Freshmaker will request a rebuild of the container and will pass the ODCS repo url in. This gives us an auditable trail of disposable repos.

Systems

There is a frontend and a backend.

140 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Everything in the previous section describes the backend behavior. The frontend exists to provide an HTTP API that can be queried to find out the status of the backend: What is it doing? What is it planning to do? What has it done already?

Observing Freshmaker Behavior

There is currently no command line tool to query Freshmaker, but Freshmaker provides REST API which can be used to observe Freshmaker behavior. This is available at the following URLs: • https://freshmaker.fedoraproject.org/api/1/events • https://freshmaker.fedoraproject.org/api/1/builds The first /events URL should return a list of events that Freshmaker has noticed, recorded, and is handling. Handled events should produce associated builds. The second /builds URL should return a list of builds that Freshmaker has submitted and is monitoring. Each build should be traceable back to the event that triggered it.

Logs

The frontend logs are on freshmaker-frontend0[1-2] in /var/log/httpd/error_log. The backend logs are on freshmaker-backend01. Look in the journal for the fedmsg-hub service.

Upgrading

The package in question is freshmaker. Please use the playbooks/manual/upgrade/freshmaker.yml playbook.

Things that could go wrong

TODO. We don’t know yet. Probably lots of things.

Fedora gather easyfix SOP

Fedora-gather-easyfix as the name says gather tickets marked as easyfix from multiple sources (pagure, github and fedorahosted currently). Providing a single place for new-comers to find small tasks to work on.

Contents

1. Contact Information 2. Documentation Links

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location http://fedoraproject.org/easyfix/

2.2. System Administrator Guide 141 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Servers sundries01, sundries02, sundries01.stg Purpose Gather easyfix tickets from multiple sources. Upstream sources are hosted on github at: https://github.com/fedora-infra/fedora-gather-easyfix/ The files are then mirrored to our ansible repo, under the easyfix/gather role. The project is a simple script gather_easyfix.py gathering information from the projects sets on the Fedora wiki and outputing a single html file. This html file is then improved via the css and javascript files present in the sources. The generated html file together with the css and js files are then synced to the proxies for public consumption :)

GDPR Delete SOP

This SOP covers how Fedora Infrastructure handles General Data Protection Regulation (GDPR) Delete Requests. It contains information about how system administrators will use tooling to respond to Delete requests, as well as how application developers can integrate their applications with that tooling.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons nirik Location Phoenix Servers batcave01.phx2.fedoraproject.org Various application servers, which will run scripts to delete data. Purpose Respond to Delete requests.

Responding to a Deletion Request

This section covers how a system administrator will use our gdpr-delete.yml playbook to respond to a Delete request. When processing a Delete request, perform the following steps: 0. Verify that the requester is who they say they are. If the request came in email ask them to file an issue at https://pagure.io/fedora-pdr/new_issue Use the following in email reply to them: ‘‘In order to verify your identity, please file a new issue at https://pagure.io/fedora-pdr/new_issue using the appropriate issue type. Please note this form requires you to sign in to your account to verify your identity.‘‘ If the request has come via Red Hat internal channels as an explicit request to delete, mark the ticket with the tag rh. This tag will help delineate requests for any future reporting needs. If they do not have a FAS account, indicate to them that there is no data to be deleted. Use this response: Your request for deletion has been reviewed. Since there is no related account in the Fedora Account System, the Fedora infrastructure does not store data relevant for this deletion request. Note that some public content related to Fedora you may have previously submitted without an account, such as to public mailing lists, is not deleted since accurate

142 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

maintenance of this data serves Fedora's legitimate business interests, the public interest, and the interest of the open source community. 1. Identify the users FAS account name. The Delete playbook will use this FAS account to delete the required data. Update the fedora-pdr issue saying the request has been received. There is a ‘quick response’ in the pagure issue tracker to note this. 2. Login to FAS and clear the Telephone number entry, set Country to Other, clear Lattitude and Longitude and IRC Nick and GPG Key ID and set Time Zone to UTC and Locale to en and set the user status to disabled. If the user is not in cla_done plus one group, you are done. Update the ticket and close it. This step will be folded into the following one once we implement it. 3. If the user is in cla_done + one group, they may have additional data: Run the gdpr delete playbook on batcave01. You will need to define one Ansible variable for the playbook. sar_fas_user will be the FAS username of the user. $ sudo ansible-playbook playbooks/manual/gdpr/delete.yml -e gdpr_delete_fas_user=bowlofeggs After the script completes, update the ticket that the request is completed and close it. There is a ‘quick response’ in the pagure issue tracker to note this.

Integrating an application with our delete playbook

This section covers how an infrastructure application can be configured to integrate with our delete.yml playbook. To integrate, you must create a script and Ansible variables so that your application is compatible with this playbook.

Script

You need to create a script and have your project’s Ansible role install that script somewhere (most likely on a host from your project - for example fedocal’s is going on fedocal01.) It’s not a bad idea to put your script into your upstream project. This script should accept one environment variable as input: GDPR_DELETE_USERNAME. This will be a FAS username. Some scripts may need secrets embedded in them - if you must do this be careful to install the script with 0700 per- missions, ensuring that only gdpr_delete_script_user (defined below) can run them. Bodhi worked around this concern by having the script run as apache so it could read Bodhi’s server config file to get the secrets, so it does not have secrets in its script.

Variables

In addition to writing a script, you need to define some Ansible variables for the host that will run your script: You also need to add the host that the script should run on to the [gdpr_delete] group in inventory/ inventory:

[gdpr_delete] fedocal01.phx2.fedoraproject.org

GDPR SAR SOP

This SOP covers how Fedora Infrastructure handles General Data Protection Regulation (GDPR) Subject Access Requests (SAR). It contains information about how system administrators will use tooling to respond to SARs, as well as how application developers can integrate their applications with that tooling.

2.2. System Administrator Guide 143 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons bowlofeggs Location Phoenix Servers batcave01.phx2.fedoraproject.org Various application servers, which will run scripts to collect SAR data. Purpose Respond to SARs.

Responding to a SAR

This section covers how a system administrator will use our sar.yml playbook to respond to a SAR. When processing a SAR, perform the following steps: 0. Verify that the requester is who they say they are. If the request came in email and the user has a FAS account, ask them to file an issue at https://pagure.io/fedora-pdr/new_issue Use the following in email reply to them: ‘‘In order to verify your identity, please file a new issue at https://pagure.io/fedora-pdr/new_issue using the appropriate issue type. Please note this form requires you to sign in to your account to verify your identity.‘‘ If the request has come via Red Hat internal channels as an explicit request to delete, mark the ticket with the tag rh. This tag will help delineate requests for any future reporting needs. 1. Identify an e-mail address for the requester, and if applicable, their FAS account name. The SAR playbook will use both of these since some applications have data associated with FAS accounts and others have data associated with e-mail addresses. Update the fedora-pdr issue saying the request has been received. There is a ‘quick response’ in the pagure issue tracker to note this. 2. Run the SAR playbook on batcave01. You will need to define three Ansible variables for the playbook. sar_fas_user will be the FAS username, if applicable; this may be omitted if the requester does not have a FAS account. sar_email will be the e-mail address associated with the user. sar_tar_output_path will be the path you want the playbook to write the resulting tarball to, and should have a .tar.gz extension. For example, if bowlofeggs submitted a SAR and his e-mail address is [email protected], you might run the playbook like this:

$ sudo ansible-playbook playbooks/manual/gdpr/sar.yml -e sar_fas_user=bowlofeggs \ -e [email protected] -e sar_tar_output_path=/home/bowlofeggs/

˓→bowlofeggs.tar.gz

3. Generate a random sha512 with something like: openssl rand 512 | sha512sum and then move the output file to /srv/web/infra/pdr/the-sha512.tar.gz 4. Update the ticket to fixed / processed on pdr requests to have a link to https://infrastructure.fedoraproject.org/ infra/pdr/the-sha512.tar.gz and tell them it will be available for one week.

Integrating an application with our SAR playbook

This section covers how an infrastructure application can be configured to integrate with our sar.yml playbook. To integrate, you must create a script and Ansible variables so that your application is compatible with this playbook.

144 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Script

You need to create a script and have your project’s Ansible role install that script somewhere (most likely on a host from your project - for example Bodhi’s is going on bodhi-backend02.) It’s not a bad idea to put your script into your upstream project - there are plans for upstream Bodhi to ship bodhi-sar, for example. This script should accept two environment variables as input: SAR_USERNAME and SAR_EMAIL. Not all applications will use both, so do what makes sense for your application. The first will be a FAS username and the second will be an e-mail address. Your script should gather the required information related to those identifiers and print it in a machine readable format to stdout. Bodhi, for example, prints information to stdout in JSON. Some scripts may need secrets embedded in them - if you must do this be careful to install the script with 0700 permissions, ensuring that only sar_script_user (defined below) can run them. Bodhi worked around this concern by having the script run as apache so it could read Bodhi’s server config file to get the secrets, so it does not have secrets in its script.

Variables

In addition to writing a script, you need to define some Ansible variables for the host that will run your script:

Variable Description Example sar_script The full path to the script. /usr/bin/bodhi-sar sar_script_user The user the script should be run as apache sar_output_file The name of the file to write into the output tarball bodhi.json

You also need to add the host that the script should run on to the [sar] group in inventory/inventory:

[sar] bodhi-backend02.phx2.fedoraproject.org

Variables for OpenShift apps

When you need to add OpenShift app to SAR playbook, you need to add following variables to existing sar_openshift dictionary:

Variable Description Example sar_script The full path to the script. /usr/local/bin/sar.py sar_output_file The name of the file to write into the output tar- anitya.json ball openshift_namespace The namespace in which the application is run- release-monitoring ning openshift_pod The pod name in which the script will be run release-monitoring-web

The sar_openshift dictionary is located in inventory/group_vars/os_masters:

sar_openshift: # Name of the app release-monitoring: sar_script:/usr/local/bin/sar.py sar_output_file: anitya.json openshift_namespace: release-monitoring openshift_pod: release-monitoring-web

2.2. System Administrator Guide 145 Fedora Infrastructure Best Practices Documentation, Release 1.0.0 geoip-city-wsgi SOP

A simple web service that return geoip information as JSON-formatted dictionary in utf-8. Particularly, it’s used by [1] to get the most probable territory code, based on the public IP of the caller.

Contents

1. Contact Information 2. Basic Function 3. Ansible Roles 4. Apps depending of geoip-city-wsgi 5. Documentation Links

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-apps, #fedora-admin, #fedora-noc Location https://geoip.fedoraproject.org Servers sundries*, sundries*-stg Purpose A simple web service that return geoip information as JSON-formatted dictionary in utf-8. Particularly, it’s used by anaconda[1] to get the most probable territory code, based on the public IP of the caller.

Basic Function

• Users go to https://geoip.fedoraproject.org/city • The website is exposed via /etc/httpd/conf.d/geoip-city-wsgi-proxy.conf. • Return a string with geoip information with syntax as JSON-formatted dict in utf8 • It also currently accepts one override: ?ip=xxx.xxx.xxx.xxx, e.g. https://geoip.fedoraproject.org/city?ip=18.0.0. 1 which then uses the passed IP address instead of the determined IP address of the client.

Ansible Roles

The geoip-city-wsgi role https://pagure.io/fedora-infra/ansible/blob/main/f/roles/geoip-city-wsgi is present in sundries playbook https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/groups/sundries.yml the proxy task are present in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/include/ proxies-reverseproxy.yml

Apps depending of geoip-city-wsgi unknown

146 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Documentation Links app: https://geoip.fedoraproject.org source: https://github.com/fedora-infra/geoip-city-wsgi bugs: https:// github.com/fedora-infra/geoip-city-wsgi/issues Role: https://pagure.io/fedora-infra/ansible/blob/main/f/tree/roles/ geoip-city-wsgi [1] https://fedoraproject.org/wiki/Anaconda

Using github for Infra Projects

We’re presently using github to host git repositories and issue tracking for some infrastructure projects. Anything we need to know should be recorded here.

Setting up a new repo

Create projects inside of the fedora-infra group: https://github.com/fedora-infra That will allow us to more easily track what projects we have. [TODO] How do we create a new project and import it? • After creating a new repo, click on the Settings tab to set up some fancy things. If using git-flow for your project: – Set the default branch from ‘master’ to ‘develop’. Having the default branch be develop is nice: new contributors will automatically start committing there if they’re not paying attention to what branch they’re on. You almost never want to commit directly to the master branch. If there does not exist a develop branch, you should create one by branching off of master.:

$ git clone GIT_URL $ git checkout -b develop $ git push --all

– Set up an IRC hook for notifications. From the “settings” tab click on “Webhooks & Services.” Under the “Add Service” dropdown, find “IRC” and click it. You might need to enter your password. In the form, you probably want the following values:

* Server, irc.libera.chat * Port, 6697 * Room, #fedora-apps * Nick, * Branch Regexes, * Password, * Ssl, * Message Without Join, * No Colors, * Long Url, * Notice, * Active,

2.2. System Administrator Guide 147 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Add an EasyFix label

The EasyFix label is used to mark bugs that are potentially fixable by new contributors getting used to our source code or relatively new to python programming. GitHub doesn’t provide this label automatically so we have to add it. You can add the label from the issues page of the repository or use this curl command to add it: curl-k-u'$GITHUB_USERNAME:$GITHUB_PASSWORD' https://api.github.com/repos/fedora-

˓→infra/python-fedora/labels-H"Content-Type: application/json"-d'{"name":"EasyFix

˓→","color":"3b6eb4"}'

Please try to use the same color for consistency between Fedora Infrastructure Projects. You can then add the github repo to the list that easyfix.fedoraproject.org scans for easyfix tickets here: https://fedoraproject.org/wiki/Easyfix github2fedmsg SOP

Bridge github events onto our fedmsg bus. App: https://apps.fedoraproject.org/github2fedmsg/ Source: https://github.com/fedora-infra/github2fedmsg/

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-apps, #fedora-admin, #fedora-noc Servers github2fedmsg01 Purpose Bridge github events onto our fedmsg bus.

Description github2fedmsg is a small Python Pyramid app that bridges github events onto our fedmsg bus by way of github’s “webhooks” feature. It is what allows us to have IRC notifications of github activity via fedmsg. It has two phases of operation: • Infrequently, a user will log in to github2fedmsg via Fedora OpenID. They then push a button to also log in to github.com. They are then logged in to github2fedmsg with both their FAS account and their github account. They are then presented with a list of their github repositories. They can toggle each one: “on” or “off”. When they turn a repo on, our webapp makes a request to github.com to install a “webhook” for that repo with a callback URL to our app. • When events happen to that repo on github.com, github looks up our callback URL and makes an http POST request to us, informing us of the event. Our github2fedmsg app receives that, validates it, and then republishes the content to our fedmsg bus.

What could go wrong?

• Restarting the app or rebooting the host shouldn’t cause a problem. It should come right back up. • Our database could die. We have a db with a list of all the repos we have turned on and off. We would want to restore that from backup.

148 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• If github gets compromised, they might have to revoke all of their application credentials. In that case, our app would fail to work. There are lots of private secrets set in our private repo that allow our app to talk to github.com. There are inline comments there with instructions about how to generate new keys and secrets.

Gitweb Infrastructure SOP

Gitweb-caching is the web interface we use to expose git to the web at http://git.fedorahosted.org/git/

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-hosted Location Serverbeach Servers hosted[1-2] Purpose Http access to git sources.

Basic Function

• Users go to [46]http://git.fedorahosted.org/git/ • Pages are generated from cache stored in /var/cache/gitweb-caching/. • The website is exposed via /etc/httpd/conf.d/git.fedoraproject.org.conf. • Main config file is /var/www/gitweb-caching/gitweb_config.pl. This pulls git repos from /git/.

Greenwave SOP

Contact Information

Owner Factory2 Team, Fedora QA Team, Infrastructure Team Contact #fedora-qa, #fedora-admin Persons gnaponie (giulia), mprahl, lucarval, ralph (threebean) Location Phoenix Public addresses • https://greenwave-web-greenwave.app.os.fedoraproject.org/api/v1.0/version • https://greenwave-web-greenwave.app.os.fedoraproject.org/api/v1.0/policies • https://greenwave-web-greenwave.app.os.fedoraproject.org/api/v1.0/decision Servers • In OpenShift. Purpose Provide gating decisions.

2.2. System Administrator Guide 149 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

• See the focus document for background. • See the upstream docs for more detailed info. Greenwave’s job is: • answering yes/no questions (or making decisions) • about artifacts (RPM packages, source tarballs, . . . ) • at certain gating points in our pipeline • based on test results • according to some policy In particular, we’ll be using Greenwave to provide yes/no gating decisions to Bodhi about rpms in each update. Green- wave will do this by consulting resultsdb and waiverdb for individual test results and then combining those results into an aggregate decision. The policies for how those results should be combined or ignored, are defined in ansible in roles/ openshift-apps/greenwave/templates/configmap.yml. We expect to grow these over time to new use cases (rawhide compose gating, etc..)

Observing Greenwave Behavior

Login to os-master01.phx2.fedoraproject.org as root (or, authenticate remotely with openshift using oc login https://os.fedoraproject.org), and run:

$ oc project greenwave $ oc status -v $ oc logs -f dc/greenwave-web

Database

Greenwave currently has no database (and we’d like to keep it that way). It relies on resultsdb and waiverdb for information.

Upgrading

You can roll out configuration changes by changing the files in roles/openshift-apps/greenwave/ and running the playbooks/openshift-apps/greenwave.yml playbook. To understand how the software is deployed, take a look at these two files: • roles/openshift-apps/greenwave/templates/imagestream.yml • roles/openshift-apps/greenwave/templates/buildconfig.yml See that we build a fedora-infra specific image on top of an app image published by upstream. The latest tag is automatically deployed to staging. This should represent the latest commit to the master branch of the upstream git repo that passed its unit and functional tests. The prod-fedora tag is manually controlled. To upgrade prod to match what is in stage, move the prod-fedora tag to point to the same image as the latest tag. Our buildconfig is configured to poll that tag, so a new os.fp.o build and deployment should be automatically created.

150 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

You can watch the build and deployment with oc commands. You can poll this URL to see what version is live at the moment: https://greenwave-web-greenwave.app.os. fedoraproject.org/api/v1.0/version

Troubleshooting

In case of problems with greenwave messaging, check the logs of the container dc/greenwave-fedmsg-consumers to see if the is something wrong:

$ oc logs -f dc/greenwave-fedmsg-consumers

It is also possible to check if greenwave is actually publishing messages looking at this link and checking the time of the last message. In case of problems with greenwave webapp, check the logs of the container dc/greenwave-web:

$ oc logs -f dc/greenwave-web

Guest Disk Resize SOP

Resize disks in our kvm guests

Contents

1. Contact Information 2. How to do it 1. KVM/libvirt Guests

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main Location: PHX, Tummy, ibiblio, Telia, OSUOSL Servers: All xen servers, kvm/libvirt servers. Purpose: Resize guest disks

How to do it

KVM/libvirt Guests

1. SSH to the kvm server and resize the guest’s logical volume. If you want to be extra careful, make a - shot of the LV first:

lvcreate-n [guest name]-snap-L 10G-s/dev/VolGroup00/[guest name]

Optional, but always good to be careful

2.2. System Administrator Guide 151 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2. Shutdown the guest:

sudo virsh shutdown [guest name]

3. Disable the guests lv:

lvchange-an/dev/VolGroup00/[guest name]

4. Resize the lv:

lvresize-L [NEW TOTAL SIZE]G/dev/VolGroup00/[guest name]

or

lvresize-L+XG/dev/VolGroup00/[guest name] (to add X GB to the disk)

5. Enable the lv:

lvchange-ay/dev/VolGroup00/[guest name]

6. Bring the guest back up:

sudo virsh start [guest name]

7. Login into the guest:

sudo virsh console [guest name] You may wish to boot single user mode to avoid services coming up and going down

˓→again

8. On the guest, run:

fdisk/dev/vda

9. Delete the the LVM partition on the guest you want to add space to and recreate it with the maximum size. Make sure to set its type to LV (8e):

p to list partitions d to delete selected partition n to create new partition (default values should be ok) t to change partition type(set to8e) w to write changes

10. Run partprobe:

partprobe

11. Check the size of the partition:

fdisk-l/dev/vdaN

If this still reflects the old size, then reboot the guest and verify that its size changed correctly when it comes up again. 12. Login to the guest again, and run:

152 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

pvresize/dev/vdaN

13. A vgs should now show the new size. Use lvresize to resize the root lv:

lvresize-L [new root partition size]G/dev/GuestVolGroup00/root

(pvs will tell you how much space is available)

14. Finally, resize the root partition:

resize2fs/dev/GuestVolGroup00/root (If the root fs is ext4)

or

xfs_growfs/dev/GuestVolGroup00/root (if the root fs is xfs)

verify that everything worked out, and delete the snapshot you made if you made one.

Guest Editing SOP

Various virsh commands

Contents

1. Contact Information 2. How to do it 1. add/remove cpus 2. resize memory

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main Location: PHX, Tummy, ibiblio, Telia, OSUOSL Servers: All xen servers, kvm/libvirt servers. Purpose: Resize guest disks

How to do it

Add cpu

1. SSH to the virthost server 2. Calculate the number of CPUs the system needs 3. sudo virsh setvcpus --config - ie:

2.2. System Administrator Guide 153 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo virsh setvcpus bapp01 16--config

4. Shutdown the virtual system 5. Start the virtual system

Note: Note that using virsh reboot is insufficient. You have to actually stop the domain and start it with virsh destroy and virsh start for the change to take effect.

6. Login and check that cpu count matches 7. Remember to update the group_vars in ansible to match the new value you set, if appropriate.

Resize memory

1. SSH to the virthost server 2. Calculate the amount of memory the system needs in kb 3. sudo virsh setmem --config - ie:

sudo virsh setmem bapp01 16777216--config

4. Shutdown the virtual system 5. Start the virtual system

Note: Note that using virsh reboot is insufficient. You have to actually stop the domain and start it with virsh destroy and virsh start for the change to take effect.

6. Login and check that memory matches 7. Remember to update the group_vars in ansible to match the new value you set, if appropriate.

Haproxy Infrastructure SOP haproxy is an application that does load balancing at the tcp layer or at the http layer. It can do generic tcp balancing but it does specialize in http balancing. Our proxy servers are still running apache and that is what our users connect to. But instead of using mod_proxy_balancer and ProxyPass balancer://, we do a ProxyPass to [45]http://localhost:10001/ or [46]http://localhost:10002/. haproxy must be told to listen to an individual port for each farm. All haproxy farms are listed in /etc/haproxy/haproxy.cfg.

Contents

1. Contact Information 2. How it works 3. Configuration example 4. Stats 5. Advanced Usage

154 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main, sysadmin-web group Location: Phoenix, Tummy, Telia Servers: proxy1, proxy2, proxy3, proxy4, proxy5 Purpose: Provides load balancing from the proxy layer to our application layer.

How it works haproxy is a load balancer. If you’re familiar, this section won’t be that interesting. haproxy in its normal usage acts just like a web server. It listens on a port for requests. Unlike most webservers though it then sends that request to one of our back end application servers and sends the response back. This is referred to as reverse proxying. We typically configure haproxy to send check to a specific url and look for the response code. If this url isn’t sent, it just does basic checks to /. In most of our configurations we’re using round robin balancing. IE, request 1 goes to app1, request2 goes to app2, request 3 goes to app3 request 4 goes to app1, and the whole process repeats.

Warning: These checks do add load to the app servers. As well as additional connections. Be smart about which url you’re checking as it gets checked often. Also be sure to verify the application servers can handle your new settings, monitor them closely for the hour or two after you make changes.

Configuration example

The below example is how our fedoraproject wiki could be configured. Each application should have its own farm. Even though it may have an identical configuration to another farm, this allows easy addition and subtraction of specific nodes when we need them.: listen fpo-wiki 0.0.0.0:10001 balance roundrobin server app1 app1.fedora.phx.redhat.com:80 check inter2s rise2 fall5 server app2 app2.fedora.phx.redhat.com:80 check inter2s rise2 fall5 server app4 app4.fedora.phx.redhat.com:80 backup check inter2s rise2 fall5 option httpchk GET/wiki/Infrastructure

• The first line “listen . . . .” Says to create a farm called ‘fpo-wiki’. Listening on all IP’s on port 10001. fpo-wiki can be arbitrary but make it something obvious. Aside from that the important bit is :10001. Always make sure that when creating a new farm, its listening on a unique port. In Fedora’s case we’re starting at 10001, and moving up by one. Just check the config file for the lowest open port above 10001. • The next line “balance roundrobin” says to use round robin balancing. • The server lines each add a new node to the balancer farm. In this case the wiki is being served from app1, app2 and app4. If the wiki is available at [53]http://app1.fedora.phx.redhat.com/wiki/ Then this config would be used in conjunction with “RewriteRule ^/wiki/(.*) [54]http://localhost:10001/wiki/$1 [P,L]”. • ‘server’ means we’re adding a new node to the farm • ‘app1’ is the worker name, it is analagous to fpo-wiki but should match shorthostname of the node to make it easy to follow. • ‘app1.fedora.phx.redhat.com:80’ is the hostname and port to be contacted.

2.2. System Administrator Guide 155 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• ‘check’ means to check via bottom line “option httpchk GET /wiki/Infrastructure” which will use /wiki/Infrastructure to verify the wiki is working. If that URL fails, that entire node will be taken out of the farm mix. • ‘inter 2s’ means to check every 2 seconds. 2s is the same as 2000 in this case. • ‘rise 2’ means to not put this node back in the mix until it has had two successful connections in a row. haproxy will continue to check every 2 seconds whether a node is up or down • ‘fall 5’ means to take a node out of the farm after 5 failures. • ‘backup’ You’ll notice that app4 has a ‘backup’ option. We don’t actually use this for the wiki but do for other farms. It basically means to continue checking and treat this node like any other node but don’t send it any production traffic unless the other two nodes are down. All of these options can be tweaked so keep that in mind when changing or building a new farm. There are other configuration options in this file that are global. Please see the haproxy documentation for more info:

/usr/share/doc/haproxy-1.3.14.6/haproxy-en.txt

Stats

In order to view the stats for a farm please see the stats page. Each proxy server has its own stats page since each one is running its own haproxy server. To view the stats point your browser to https://admin.fedoraproject.org/haproxy/ shorthostname/ so proxy1 is at https://admin.fedoraproject.org/haproxy/proxy1/ The trailing / is important. • https://admin.fedoraproject.org/haproxy/proxy1/ • https://admin.fedoraproject.org/haproxy/proxy2/ • https://admin.fedoraproject.org/haproxy/proxy3/ • https://admin.fedoraproject.org/haproxy/proxy4/ • https://admin.fedoraproject.org/haproxy/proxy5/

Advanced Usage haproxy has some more advanced usage that we’ve not needed to worry about yet but is worth mentioning. For example, one could send users to just one app server based on session id. If user A happened to hit app1 first and user B happened to hit app4 first. All subsequent requests for user A would go to app1 and user B would go to app4. This is handy for applications that cannot normally be balanced because of shared storage needs or other locking issues. This won’t solve all problems though and can have negative affects for example when app1 goes down user A would either lose their session, or be unable to work until app1 comes back up. Please do some great testing before looking in to this option.

Fedorahosted migrations

Migrating hosted repositories to that of another type.

Contents

1. Contact Information 2. Description

156 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

3. SVN to GIT migration 1. Questions left to be answered with this SOP

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-hosted Location Serverbeach Servers hosted1, hosted2 Purpose Migrate hosted SCM repositories to that of another SCM.

Description fedorahosted.org can be used to host open source projects. Occasionally those projects want to change the SCM they utilize. This document provides documentation for doing so. 1. An scm for maintaining the code. The currently supported scm’s include , Git, Bazaar, or SVN. Note: There is no cvs 2. A trac instance, which provides a mini-wiki for hosting information and also provides a ticketing system. 3. A mailing list

Important: This page is for administrators only. People wishing to request a hosted project should use the [50]Tick- eting System ; see the new project request template. (Requires Fedora Account)

SVN to GIT migration

FAS User Prep

Currently you must manually generate $PROJECTNAME-users.txt by grabbing a list of people in the FAS group - and recording them in th following format:

$fasusername = FirstName LastName <$emailaddress>

This is error prone, and will stop the git-svn fetch below if an author appears that doesn’t exist in the list of users.: svn log--quiet| awk'/^r/{print $3}'| sort-u

The above will generate a list of users in the svn repo. If all users are FAS users you can use the following script to create a users file (written by tmz (Todd Zullinger):

#!/bin/bash if [ -z "$1" ]; then echo "usage: $0 " >&2 exit 1 fi (continues on next page)

2.2. System Administrator Guide 157 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) svnurl=file:///svn/$1 if ! svn info $svnurl &>/dev/null; then echo "$1 is not a valid svn repo." >&2 fi svn log -q $svnurl | awk '/^r[0-9]+/ {print $3}' | sort -u | while read user; do name=$( (getent passwd $user 2>/dev/null | awk -F: '{print $5}') || '' ) [ -z "$name" ] && name=$user email="[email protected]" echo "$user=$name <$email>" done

Doing the conversion

1. Log into hosted1 2. Make a temporary directory to convert the repos in:

$ sudo mkdir /tmp/tmp-$PROJECTNAME.git

$ cd /tmp/tmp-$PROJECTNAME.git

3. Create an git repo ready to receive migrated SVN data:

$ sudo git-svn init http://svn.fedorahosted.org/svn/$PROJECTNAME --no-metadata

4. Tell git to fetch and convert the repository:

$ git svn fetch

.. note:: This creation of a temporary repository is necessary because SVN leaves a number of items floating around that git can ignore, and we want those essentially ignored.

5. From here, you’ll wanted to follow [53]Creating a new git repo as if cloning an existing git repository to Fedorahosted. 6. After that process is done - kindly remove the temporary repo that was created:

$ sudo rm -rf /tmp/tmp-$PROJECTNAME.git

Doing the converstion (alternate)

Alternately, here’s another way to do this (tmz): Setup a working dir:

[tmz@hosted1 tmp (master)]$ mkdir im-chooser-conversion && cd im-chooser-conversion

Create authors file mapping svn usernames to Name form git uses.:

158 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

[tmz@hosted1 im-chooser-conversion (master)]$ ~tmz/svn-to-git-authors im-chooser >

˓→authors

Convert svn to git:

[tmz@hosted1 im-chooser-conversion (master)]$ git svn clone -s -A authors --no-

˓→metadata file:///svn/im-chooser

Move svn branches and tags into proper locations for the new git repo. (git-svn leaves them as ‘remote’ branches/tags.):

[tmz@hosted1 im-chooser-conversion (master)]$ cd im-chooser [tmz@hosted1 im-chooser (master)]$ mv .git/refs/remotes/tags/* .git/refs/tags/ && ˓→rmdir .git/refs/remotes/tags [tmz@hosted1 im-chooser (master)]$ mv .git/refs/remotes/* .git/refs/heads/

Now ‘git branch’ and ‘git tag’ should display the branches/tags. Create a bare repo from the converted git repo. Using file://$(pwd) here ensures that git copies all objects to the new bare repo.:

[tmz@hosted1 im-chooser-conversion (master)]$ git clone --bare --shared file://$(pwd)/

˓→im-chooser im-chooser.git

Follow the steps in https://fedoraproject.org/wiki/Hosted_repository_setup to finish setting proper modes and permis- sions for the repo. Don’t forget to update the description file.

Note: This still leaves moving the converted bare repo (im-chooser.git) to /git and fixing up the user/group.

Questions left to be answered with this SOP

• Obviously we need to have requestor review the migration and confirm it’s ok. • Do we then delete the old SCM contents? • Do we need to change the FAS-group type to grant them access to pull/push from it?

HOTFIXES SOP

From time to time we have to quickly patch a problem or issue in applications in our infrastructure. This process allows us to do that and track what changed and be ready to remove it when the issue is fixed upstream.

Ansible based items:

For ansible, they should be placed after the task that installs the package to be changed or modified. Either in roles or tasks. hotfix tasks should be called “HOTFIX description” They should also link in comments to any upstream bug or ticket. They should also have tags of ‘hotfix’ The process is: • Create a diff of any files changed in the fix. • Check in the _original_ files and change to role/task

2.2. System Administrator Guide 159 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• Check in now your diffs of those same files. • ansible will replace the files on the affected machines completely with the fixed versions. • If you need to back it out, you can revert the diff step, wait and then remove the first checkin Example:

# # install hash randomization hotfix # See bug https://bugzilla.redhat.com/show_bug.cgi?id=812398 # - name: hotfix- copy over new httpd init script copy: src="{{ files }}/hotfix/httpd/httpd.init" dest=/etc/init.d/httpd owner=root group=root mode=0755 notify: - restart apache tags: - config - hotfix - apache

Upstream changes

Also, if at all possible a bug should be filed with the upstream application to get the fix in the next version. Hotfixes are something we should strive to only carry a short time.

The New Hotness the-new-hotness is a fedora messaging consumer that subscribes to release-monitoring.org fedora messaging notifi- cations to determine when a package in Fedora should be updated. For more details on the-new-hotness, consult the project documentation.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin #fedora-apps Persons zlopez Location iad2.fedoraproject.org Servers Production • hotness01.iad2.fedoraproject.org Staging • hotness01.stg.iad2.fedoraproject.org Purpose File issues when upstream projects release new versions of a package

160 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Hosts

The current deployment is made up of the-new-hotness OpenShift namespace. the-new-hotness

This OpenShift namespace runs following pods: • A fedora messaging consumer This OpenShift project relies on: • Anitya Infrastructure SOP as message publisher • Fedora messaging RabbitMQ hub for consuming messages • Koji for scratch builds • Bugzilla for issue reporting

Releasing

The release process is described in the-new-hotness documentation.

Deploying

Staging deployment of the-new-hotness is deployed in OpenShift on os-master01.stg.iad2.fedoraproject.org. To deploy staging instance of the-new-hotness you need to push changes to staging branch on the-new-hotness GitHub. GitHub webhook will then automatically deploy a new version of the-new-hotness on staging. Production deployment of the-new-hotness is deployed in OpenShift on os-master01.iad2.fedoraproject.org. To deploy production instance of the-new-hotness you need to push changes to production branch on the-new-hotness GitHub. GitHub webhook will then automatically deploy a new version of the-new-hotness on production.

Configuration

To deploy the new configuration, you need ssh access to batcave01.iad2.fedoraproject.org and permissions to run the Ansible playbook. All the following commands should be run from batcave01. First, ensure there are no configuration changes required for the new update. If there are, update the Ansible anitya role(s) and optionally run the playbook:

$ sudo rbac-playbook openshift-apps/the-new-hotness.yml

The configuration changes could be limited to staging only using:

$ sudo rbac-playbook openshift-apps/the-new-hotness.yml -l staging

This is recommended for testing new configuration changes.

2.2. System Administrator Guide 161 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Upgrading

Staging

To deploy new version of the-new-hotness you need to push changes to staging branch on the-new-hotness GitHub. GitHub webhook will then automatically deploy a new version of the-new-hotness on staging.

Production

To deploy new version of the-new-hotness you need to push changes to production branch on the-new-hotness GitHub. GitHub webhook will then automatically deploy a new version of the-new-hotness on production. Congratulations! The new version should now be deployed.

Monitoring Activity

It can be nice to check up on the-new-hotness to make sure its behaving correctly. You can see all the Bugzilla activity using the user activity query (staging uses partner-bugzilla.redhat.com) and querying for the [email protected] user. You can also view all the Koji tasks dispatched by the-new-hotness. For example, you can see the failed tasks it has created. To monitor the pods of the-new-hotness you can connect to Fedora infra OpenShift and look at the state of pods. For staging look at the the-new-hotness namespace in staging OpenShift instance. For production look at the the-new-hotness namespace in production OpenShift instance.

Fedora Hubs SOP

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-tools, sysadmin-hosted Location ? Servers , , hubs-dev.fedorainfracloud.org Purpose Contributor and team portal.

Description

Fedora Hubs aggregates user and team activity throughout the Fedora infrastructure (and elsewhere) to show what a user or a team is doing. It helps new people find a place to contribute.

Components

Fedora Hubs has the following components: • a SQL database like PostgreSQL (in the Fedora infra we’re using the shared database).

162 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• a Redis server that is used as a message bus (it is not critical if the content is lost). System service: redis. • a MongoDB server used to store the contents of the activity feeds. It’s JSON data, limited to 100 entries per user or group. Service: mongod. • a Flask-based WSGI app served by Apache + mod_wsgi, that will also serve the JS front end as static files. System service: httpd. • a Fedmsg listener that receives messages from the fedmsg bus and puts them in Redis. System service: fedmsg-hub. • a set of “triage” workers that pull the raw messages from Redis, process them using SQL queries and puts work items in another Redis queue. System service: fedora-hubs-triage@. • a set of “worker” daemons that pull from this other Redis queue, work on the items by making SQL queries and external HTTP requests (to Github for example), and put reload notifications in the SSE Re- dis queue. They also access the caching system, which can be local files or memcached. System service: fedora-hubs-worker@. • The SSE server (-based) that pulls from that Redis queue and sends reload notifications to the con- nected browsers. It handles long-lived HTTP connection but there is little activity: only the notifications and a “keepalive ping” message every 30 seconds to every connected browser. System service: fedora-hubs-sse. Apache is configured to proxy the /sse path to this server.

Managing the services

Restarting all the services:

systemctl restart fedmsg-hub fedora-hubs-\*

By default, 4 triage daemons and 4 worker daemons are enabled. To add another triage daemon and another worker daemon, you can run:

systemctl enable--now [email protected] systemctl enable--now [email protected]

It is not necessary to have the same number of triage and worker daemons, in fact it is expected that more worker than triage daemons will be necessary, as they do more time-consuming work.

Hubs-specific operations

Other Hubs-specific operations are done using the fedora-hubs command:

$ fedora-hubs Usage: fedora-hubs [OPTIONS] COMMAND [ARGS]...

Options: --help Show this message and exit.

Commands: cache Cache-related operations. db Database-related operations. fas FAS-related operations. run Run daemon processes.

2.2. System Administrator Guide 163 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Manipulating the cache

The cache subcommand is used to do cache-related operations:

$ fedora-hubs cache Usage: fedora-hubs cache [OPTIONS] COMMAND [ARGS]...

Cache-related operations.

Options: --help Show this message and exit.

Commands: clean Clean the specified WIDGETs (id or name). coverage Check the cache coverage. list List widgets for which there is cached data.

For example, to check the cache coverage:

$ fedora-hubs cache coverage 107 cached values found, 95 are missing. 52.97 percent cache coverage.

The cache coverage value is an interesting metric that could be used in a Nagios check. A value below 50% could be considered as significant of application slowdowns and could thus generate a warning.

Interacting with FAS

The fas subcommand is used to get information from FAS:

$ fedora-hubs fas Usage: fedora-hubs fas [OPTIONS] COMMAND [ARGS]...

FAS-related operations.

Options: --help Show this message and exit.

Commands: create-team Create the team hub NAME from FAS. sync-teams Sync all the team hubs NAMEs from FAS.

To add a new team hub for a FAS group, run:

$ fedora-hubs fas create-team

IBM RSA II Infrastructure SOP

Many of our physical machines use RSA II cards for remote management.

Contact Information

Owner Fedora Infrastructure Team

164 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact #fedora-admin, sysadmin-main Location PHX, ibiblio Servers All physical IBM machines Purpose Provide remote management for our physical IBM machines

Restarting the RSA II card

Normally, the RSA II can be restarted from the web/ssh interface. If you are locked out of any outside access to the RSA II, follow these instructions on the physical machine. If the machine can be rebooted without issue, cut off all power to the machine, wait a few seconds, and restart everything. Otherwise, to restart the card without rebooting the machine: 1. Download and install the IBM Remote Supervisor Adapter II Daemon 1. yum install usbutils libusb-devel # (needed by the RSA II daemon) 2. Download the correct tarball from http://www-947.ibm.com/systems/support/supportsite.wss/ docdisplay?lndocid=MIGR-5071676&brandind=5000008 (TODO: check if this can be packaged in Fedora) 3. Extract the tarball and run sudo ./install.sh --update 2. Download and extract the IBM Advanced Settings Utility http://www-947.ibm.com/systems/support/ supportsite.wss/docdisplay?lndocid=TOOL-ASU&brandind=5000016

Warning: this tarball dumps files in the current working directory

3. Issue a sudo ./asu64 rebootrsa to reboot the RSA II. 4. Clean up: yum remove ibmusbasm64

Other Resources

http://www.redbooks.ibm.com/abstracts/sg246495.html may be a useful resource to refer to when working with this.

Infrastructure Git Repos

Setting up an infrastructure git repo - and the push mechanisms for the magicks We have a number of git repos (in /git on batcave) that manage files for ansible, our docs, our common host info database and our kickstarts This is a doc on how to setup a new one of these, if it is needed.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Phoenix

2.2. System Administrator Guide 165 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Servers batcave01.phx2.fedoraproject.org, batcave-comm01.qa.fedoraproject.org

Steps

Create the bare repo: make $git_dir setfacl -m d:g:$yourgroup:rwx -m d:g:$othergroup:rwx \ -m g:$yourgroup:rwx -m g:$othergroup:rwx $git_dir cd $git_dir git init --bare edit up config - add these lines to the bottom:

[hooks] # ([email protected]) mailinglist= emailaddress @yourdomain.org emailprefix= maildomain= fedoraproject.org reposource=/path/to/this/dir repodest=/path/to/where/you/want/the/files/dumped edit up description - make it something useful: cd hooks rm-f *.sample cp hooks from /git/infra-docs/hooks/ on batcave01 to this path modify sudoers to allow users in whatever groups can commit to this repo can run /usr/local/bin/syncgittree.sh w/o inputting a password

Infrastructure Host Rename SOP

This page is intended to guide you through the process of renaming a virtual node.

Contents

1. Introduction 2. Finding out where the host is 3. Preparation 4. Renaming the Logical Volume 5. Doing the actual rename 6. Telling ansible about the new host 7. VPN Stuff

166 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Introduction

Throughout this SOP, we will refer to the old hostname as $oldhostname and the new hostname as $newhostname. We will refer to the Dom0 host that the vm resides on as $vmhost. If this process is being followed so that a temporary-named host can replace a production host, please be sure to follow the [51]Infrastructure retire machine SOP to properly decommission the old host before continuing.

Finding out where the host is

In order to rename the host, you must have access to the Dom0 (host) on which the virtual server resides. To find out which host that is, log in to batcave01, and run: grep $oldhostname /var/log/virthost-lists.out

The first column of the output will be the Dom0 of the virtual node.

Preparation

SSH to $oldhostname. If the new name is replacing a production box, change the IP Address that it binds to, in /etc/sysconfig/network-scripts/ifcfg-eth0. Also change the hostname in /etc/sysconfig/network. At this point, you can sudo poweroff $oldhostname. Open an ssh session to $vmhost, and make sure that the node is listed as shut off. If it is not, you can force it off with: virsh destroy $oldhostname

Renaming the Logical Volume

Find out the name of the logical volume (on $vmhost): virsh dumpxml $oldhostname | grep 'source dev'

This will give you a line that looks like which tells you that /dev/VolGroup00/$oldhostname is the path to the logical volume. Run /usr/sbin/lvrename (the path that you found above) (the path that you found above, with $newhostname at the end instead of $oldhostname)‘ For example:: /usr/sbin/lvrename /dev/VolGroup00/noc03-tmp /dev/VolGroup00/noc01

Doing the actual rename

Now that the logical volume has been renamed, we can rename the host in libvirt. Dump the configuration of $oldhostname into an xml file, by running: virsh dumpxml $oldhostname > $newhostname.xml

2.2. System Administrator Guide 167 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Open up $newhostname.xml, and change all instances of $oldhostname to $newhostname. Save the file and run: virsh define $newhostname.xml

If there are no errors above, you can undefine $oldhostname: virsh undefine $oldhostname

Power on $newhostname, with: virsh start $newhostname

And remember to set it to autostart: virsh autostart $newhostname

VPN Stuff

TODO

Infrastructure/SOP/Raid Mismatch Count

What to do when a raid device has a mismatch count

Contents

1. Contact Information 2. Description 3. Correction 1. Step 1 2. Step 2

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location All Servers Physical hosts Purpose Provides database connection to many of our apps.

168 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

In some situations a raid device may indicate there is a count mismatch as listed in:

/sys/block/mdX/md/mismatch_cnt

Anything other than 0 is considered not good. Though if the number is low it’s probably nothing to worry about. To correct this situation try the directions below.

Correction

More than anything these steps are to A) Verify there is no problem and B) make the error go away. If step 1 and step 2 don’t correct the problems, PROCEED WITH CAUTION. The steps below, however, should be relatively safe. Issue a repair (replace mdX with the questionable raid device): echo repair>/sys/block/mdX/md/sync_action

Depending on the size of the array and disk speed this can take a while. Watch the progress with: cat/proc/mdstat

Issue a check. It’s this check that will reset the mismatch count if there are no problems. Again replace mdX with your actual raid device.: echo check>/sys/block/mdX/md/sync_action

Just as before, you can watch the progress with: cat/proc/mdstat

Infrastructure Yum Repo SOP

In some cases RPM’s in Fedora need to be rebuilt for the Infrastructure team to suit our needs. This repo is provided to the public (except for the RHEL RPMs). Rebuilds go into this repo which are stored on the netapp and shared via the proxy servers after being built on koji. For basic instructions, read the standard documentation on Fedora wiki: - https://fedoraproject.org/wiki/Using_the_ Koji_build_system This document will only outline the differences between the “normal” repos and the infra repos.

Contents

1. Contact Information 2. Building an RPM 3. Tagging an existing build 4. Promoting a staging build 5. Koji package list

2.2. System Administrator Guide 169 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location PHX [53] https://kojipkgs.fedoraproject.org/repos-dist/ Servers koji batcave01 / Proxy Servers Purpose Provides infrastructure repo for custom Fedora Infrastructure rebuilds

Building an RPM

Building an RPM for Infrastructure is significantly easier then building an RPM for Fedora. Basically get your SRPM ready, then submit it to koji for building to the $repo-infra target. (e.g. epel7-infra). Example: rpmbuild--define"dist .el7.infra"-bs test.spec koji build epel7-infra test-1.0-1.el7.infra.src.rpm

Note: Remember to build it for every dist / arch you need to deploy it on.

After it has been built, you will see it’s tagged as $repo-infra-candidate, this means that it is a candidate for being signed. The automatic signing system will pick it up and sign the package for you without any further intervention. You can track when this is done by checking the build info: when it is moved from $repo-infra-candidate to $repo- infra-stg, it has been signed. You can check this on the web interface (look under “Tags”), or via: koji buildinfo test-1.0-1.el7.infra

After the build has been tagged into the $repo-infra-stg tag, tag2distrepo will automatically create a distrepo task, which will update the repository so that the package is available on staging hosts. After this time, you can yum clean all and then install the packages via yum install or yum update.

Tagging existing builds

If you already have a real build and want to use it in the infrastructure before it has landed in stable, you can tag it into the respective infra-candidate tag. For example, if you have an epel7 build of test2-1.0-1.el7.infra, run: koji tag epel7-infra-candidate test2-1.0-1.el7.infra

And then the same autosigning and repogen from the previous section applies.

Promoting a staging build

After getting autosigned, builds will land in the respective infra-stg tag, for example epel7-infra-stg. These tags go into repos that are enabled on staging machines, but not on production. If you decide, after testing, that the build is good enough for production, you can promote it by running: koji move epel7-infra-stg epel7-infra test2-1.0-1.el7.infra

170 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Koji package list

If you try to build a package into the infra tags, and koji says something like: BuildError: package test not in list for tag epel7-infra-candidate That means that the package has not been added to the list for building in that particular tag. Either add the package to the respective Fedora/EPEL branches (this is the preferred method, since we should always aim to get everything packaged for Fedora/EPEL), or add the package to the listing for the respective tag. To add package to infra tag, run: koji add-pkg $tag $package --owner=$user

Infrastructure retire machine SOP

Owner: Fedora Infrastructure Team Contact: #fedora-admin Location: anywhere Servers: any Purpose: Makes sure decommisioning machines is correctly done

Introduction

When a machine (be it virtual instance or real physical hardware is decommisioned, a set of steps must be followed to ensure that the machine is properly removed from the set of machines we manage and doesn’t cause problems down the road.

Retire process

1. Ensure that the machine is no longer used for anything. Use git-grep, stop services, etc. 2. Remove the machine from ansible. Make sure you not only remove the main machine name, but also any aliases it might have (or move them to an active server if they are active services. Make sure to search for the IP address(s) of the machine as well. Ensure dns is updated to remove the machine. 3. Remove the machine from any labels in hardware devices like consoles or the like. 4. Revoke the ansible cert for the machine. 5. Move the machine xml defintion to ensure it does NOT start on boot. You can move it to ‘name-retired- YYYY-MM-DD’. 6. Ensure any backend storage the machine was using is freed or renamed to name-retired-YYYY-MM-DD

TODO

fill in commands

Infrastructure/SOP/Yubikey

This document describes how yubikey authentication works

2.2. System Administrator Guide 171 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contents

1. Contact Information 2. User Information 3. Host Admins 1. pam_yubico 4. Server Admins 1. Basic architecture 2. ykval 3. ykksm 4. Physical Yubikey info 5. fas integration

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Phoenix Servers fas*, db02 Purpose Provides yubikey authentication in Fedora

Config Files

• /etc/httpd/conf.d/yk-ksm.conf • /etc/httpd/conf.d/yk-val.conf • /etc/ykval/ykval-config.php • /etc/ykksm/ykksm-config.php • /etc/fas.cfg

User Information

See [57]Infrastruture/Yubikey

Host Admins pam_yubico Generated from fas, the /etc/yubikeyid works like a authroized_keys file and maps valid keys to users. It is downloaded from FAS: [58]https://admin.fedoraproject.org/accounts/yubikey/dump

172 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Server Admins

Basic architecture

Yubikey authentication takes place in 3 basic phases. 1. User presses yubikey which generates a one time password 2. The one time password makes its way to the yk-val application which verifies it is not a replay 3. yk-val passes that otp on to the yk-ksm application which verifies the key itself is a valid key If all of those steps succeed, the ykval application sends back an OK and authentication is considered successful. The two applications are defined below, if either of them is unavailable, yubikey authentication will fail. ykval

Database: db02:ykval The database contains 3 tables. clients: just a valid client. These are not users, these are systems able to authenticate against ykval. In our case Fedora is the only client so there’s just one entry here queue: Used for distributed setups (we don’t do this) yubikeys: maps which yubikey belongs to which user ykval is installed on fas* and is located at: [59]http://localhost/yk-val/verify Purpose: Is to map keys to users and protect against replay attacks ykksm

Database: db02:ykksm The database contains one table: yubikeys: maps who created keys, what key was created, when, and the public name and serial number, whether its active, etc. ykksm is installed on fas* at [60]http://localhost/yk-ksm Purpose: verify if a key is a valid known key or not. Nothing contacts this service directly except for ykval. This should be considered the “high security” portion of the system as access to this table would allow users to make their own yubikeys.

Physical Yubikey info

The actual yubikey contains information to generate a one time password. The important bits to know are the begining of the otp contains the identifier of the key (used similar to how ssh uses authorized_keys) and note the rest of it contains lots of bits of information, including a serial incremental. Sample key: ccccfcdaivjrvdhvzfljbbievftnvncljhibkulrftt Breaking this up, the first 12 characters are the identifier. This can be considered ‘public’ ccccfcdaivj rvdhvzfljbbievftnvncljhibkulrftt The second half is the otp part.

2.2. System Administrator Guide 173 Fedora Infrastructure Best Practices Documentation, Release 1.0.0 fas integration

Fas integration has two main parts. First is key generation, the next is activation. The fas-plugin-yubikey contains the bits for both, and verification. Users call on this page to generate the key info: [61]https://admin.fedoraproject.org/accounts/yubikey/genkey The fas password field automatically detects whether someone is using a otp or a regular password. It then sends otp requests to yk-val for verification.

Ipsilon Infrastructure SOP

Contents

1. Contact Information 2. Description 3. Known Issues 4. ReStarting 5. Configuration 6. Common actions 6.1. Registering OpenID Connect Scopes 6.2. Generate an OpenID Connect token 6.3. Create OpenID Connect secrets for apps

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Primary upstream contact Patrick Uiterwijk - FAS: puiterwijk Backup upstream contact Simo Sorce - FAS: simo (irc: simo) Howard Johnson - FAS: merlinthp (irc: MerlinTHP) Rob Crittenden - FAS: rcritten (irc: rcrit) Location Phoenix Servers ipsilon01.phx2.fedoraproject.org ipsilon02.phx2.fedoraproject.org ipsilion01.stg.phx2.fedoraproject.org. Purpose Ipsilon is our central authentication service that is used to authenticate users agains FAS. It is seperate from FAS.

Description

Ipsilon is our central authentication agent that is used to authenticate users agains FAS. It is seperate from FAS. The only service that is not using this currently is the wiki. It is a web service that is presented via httpd and is load balanced by our standard haproxy setup.

Known issues

No known issues at this time. There is not currently a logout option for ipsilon, but it is not considered an issue. If group memberships are updated in ipsilon the user will need to wait a few minutes for them to replicate to the all the systems.

174 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Restarting

To restart the application you simply need to ssh to the servers for the problematic region and issue an ‘service httpd restart’. This should rarely be required.

Configuration

Configuration is handled by the ipsilon.yaml playbook in Ansible. This can also be used to reconfigure application, if that becomes nessecary.

Common actions

This section describes some common configuration actions.

OpenID Connect Scope Registration

As documented on https://fedoraproject.org/wiki/Infrastructure/Authentication, application developers can request their own scopes. When a request for this comes in, look in ansible/roles/ipsilon/files/oidc_scopes/ and copy an example module. Copy this to a new file, so we have a file per scope set. Fill in the information: • name is an Ipsilon-internal name. This should not include any spaces • display_name is the name that is displayed to the category of scopes to the user • scopes is a dictionary with the full scope identifier (with namespace) as keys. The values are dicts with the following keys: – display_name: The complete display name for this scope. This is what the user gets shown to accept/reject – claims: A list of additional “claims” (pieces of user information) an application will get when the user – consents to this scope. For most scopes, this will be the empty list. In ansible/roles/ipsilon/tasks/main.yml, add the name of the new file (without .py) to the with_items of “Copy OpenID Connect scope registrations”). To enable, open ansible/roles/ipsilon/templates/configuration.conf, and look for the lines starting with “openidc enabled extensions”. Add the name of the plugin (in the “name” field of the file) to the environment this scopeset has been requested for. Run the ansible ipsilon.yml playbook.

Generate an OpenID Connect token

There is a handy script in the Ansible project under scripts/generate-oidc-token that can help you generate an OIDC token. It has a self-explanatory --help argument, and it will print out some SQL that you can run against Ipsilon’s database, as well as the token that you seek. The SERVICE_NAME (the required positional argument) is the name of the application that wants to use the token to perform actions against another service. To generate the scopes, you can visit our authentication_ docs and find the service you want the token to be used for. Each service has a base namespace (a URL) and one or more scopes for that namespace. To form a scope for this script, you concatenate the namespace of the service with the scope you want to grant the service. You can provide the script the -s flag multiple times if you want to grant more than one scope to the same token. As an example, to give Bodhi access to create waivers in WaiverDB, you can see that the base namespace is https:/ /waiverdb.fedoraproject.org/oidc/ and that there is a create-waiver scope. You can run this to generate Ipsilon SQL and a token with that scope:

2.2. System Administrator Guide 175 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

[bowlofeggs@batcave01 ansible][PROD]$ ./scripts/generate-oidc-token bodhi -e 365 -s

˓→https://waiverdb.fedoraproject.org/oidc/create-waiver

Run this SQL against Ipsilon's database:

------START CUTTING HERE------BEGIN; insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','username',

˓→'bodhi@service'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','security_check','-

˓→ptBqVLId-kUJquqkVyhvR0DbDULIiKp1eqbXqG_dfVK9qACU6WwRBN3-7TRfoOn'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','client_id','bodhi'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','expires_at',

˓→'1557259744'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','type','Bearer'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','issued_at',

˓→'1525723744'); insert into token values ('2a5f2dff-4e93-4a8d-8482-e62f40dce046','scope','["openid",

˓→"https://someapp.fedoraproject.org/"]'); COMMIT; ------END CUTTING HERE ------

Token: 2a5f2dff-4e93-4a8d-8482-e62f40dce046_-ptBqVLId-kUJquqkVyhvR0DbDULIiKp1eqbXqG_

˓→dfVK9qACU6WwRBN3-7TRfoOn

Once you have the SQL, you can run it against Ipsilon’s database, and you can provide the token to the application through some secure means (such as putting into Ansible’s secrets and telling the requestor the Ansible variable they can use to access it.)

Create OpenID Connect secrets for apps

Application wanting to use OpenID Connect need to register against our OpenID Connect server (Ipsilon). Since we do not allow self-registration (except on iddev.fedorainfracloud.org) for obvious reasons, the secrets need to be created and configured per application and environment (production vs staging). To do so: - Go to the private ansible repository. - Edit the file: files/ipsilon/openidc.{{env}}.static - At the bottom of this file, add the information concerning the application you are adding. This will look something like:

fedocal client_name="fedocal" fedocal client_secret="" fedocal redirect_uris=["https://calendar.stg.fedoraproject.org/oidc_callback

˓→"] fedocal client_uri="https://calendar.stg.fedoraproject.org/" fedocal ipsilon_internal={"type":"static","client_id":"fedocal","trusted

˓→":true} fedocal contacts=["[email protected]"] fedocal client_id=null fedocal policy_uri="https://fedoraproject.org/wiki/Legal:PrivacyPolicy" fedocal grant_types="authorization_code" fedocal response_types="code" fedocal application_type="web" fedocal subject_type="pairwise" fedocal logo_uri=null (continues on next page)

176 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) fedocal tos_uri=null fedocal jwks_uri=null fedocal jwks=null fedocal sector_identifier_uri=null fedocal request_uris=[] fedocal require_auth_time=null fedocal token_endpoint_auth_method="client_secret_post" fedocal id_token_signed_response_alg="RS256" fedocal request_object_signing_alg="none" fedocal initiate_login_uri=null fedocal default_max_age=null fedocal default_acr_values=null fedocal client_secret_expires_at=0

In most of situation, only the first 5 lines (up to ipsilon_internal) will change. If the application is not using flask-oidc or is not maintained by the Fedora Infrastructure the first 11 lines (up to application_type) may change. The remaining lines require a deeper understanding of OpenID Connect and Ipsilon.

Note: client_id in ipsilon_internal must match the begining of the line, and the client_id field must either match the begining of the line or be null as in the example here.

Note: In our OpenID connect server, OIDC.user_getfield(‘nickname’) will return the FAS username, which we know from FAS is unique. However, not all OpenID Connect servers enforce this constraint, so the application code may rely on the sub which is the only key that is sure to be unique. If the application relies on sub and wants sub to return the FAS username, then the configuration should be adjusted with: subject_type="public".

After adjusting this file, you will need to make the client_secret available to the application via ansible, for this simply add it to vars.yml as we do for the other private variables and provide the variable name to the person who requested it. Finally, commit and push the changes to both files and run the ipsilon.yml playbook. iSCSI iscsi allows one to share and mount block devices using the scsi protocol over a network. Fedora currently connects to a netapp that has an iscsi export.

Contents

1. Contact Information 2. Typical uses 3. iscsi basics 1. Terms 2. iscsi’s basic login / logout procedure is 4. Loggin in 5. Logging out 6. Important note about creating new logical volumes

2.2. System Administrator Guide 177 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Phoenix Servers xen[1-15] Purpose Provides iscsi connectivity to our netapp.

Typical uses

The best uses for Fedora are for servers that are not part of a farm or live replicated. For example, we wouldn’t put app1 on the iscsi share because we don’t gain anything from it. Shutting down app1 to move it isn’t an issue because app1 is part of our application server farm. noc1, however, is not replicated. It’s a stand alone box that, at best, would have a non-live failover. By placing this host on an iscsi share, we can make it more highly available as it allows us to move that box around our virtualization infrastructure without rebooting it or even taking it down. iscsi basics

Terms

• initiator means client • target means server • swab means mop • deck means floor iscsi’s basic login / logout procedure is

1. Notify your client that a new target is available (similar to editing /etc/fstab for a new nfs mount) 2. Login to the iscsi target (similar to running “mount /my/nfs” 3. Logout from the iscsi target (similar to running “umount /my/nfs” 4. Delete the target from the client (similar to removing the nfs mount from /etc/fstab)

Logging in

Most mounts are covered by ansible so this should be automatic. In the event that something goes wrong though, the best way to fix this is: • Notify the client of the target:

iscsiadm--mode node--targetname iqn.1992-08.com.netapp:sn.118047036--portal 10.

˓→5.88.21:3260-o new

• Log in to the new target:

178 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

iscsiadm--mode node--targetname iqn.1992-08.com.netapp:sn.118047036--portal 10.

˓→5.88.21:3260--login

• Scan and activate lvm:

pvscan vgscan vgchange-ay xenGuests

Once this is done, one should be able to run “lvs” to see the logical volumes

Logging out

Logging out isn’t normally needed, for example rebooting a machine automatically logs the initiator out. Should a problem arise though here are the steps: • Disable the logical volume:

vgchange-an xenGuests

• log out:

iscsiadm--mode node--targetname iqn.1992-08.com.netapp:sn.118047036--portal 10.

˓→5.88.21:3260--logout

Note: Cannot deactivate volume group If the vgchange command fails with an error about not being able to deactivate the volume group, this means that one of the logical volumes is still in use. By running “lvs” you can get a list of volume groups. Look in the Attr column. There are 6 attrs listed. The 5th column usually has a ‘-’ or an ‘a’. ‘a’ means its active, - means it is not. To the right of that (the last column) you will see an ‘-’ or an ‘o’. If you see an ‘o’ that means that logical volume is still mounted and in use.

Important: Note about creating new logical volumes At present we do not have logical volume locking on the xen servers. This is dangerous and being worked on. Basically when you create a new volume on a host, you need to run: pvscan vgscan lvscan on the other virtualization servers.

Jenkins Fedmsg SOP

Send information about Jenkins builds to fedmsg.

Contact Information

Owner Ricky Elrod, Fedora Infrastructure Team

2.2. System Administrator Guide 179 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact #fedora-apps

Reinstalling when it disappears

For an as-of-yet unknown reason, the plugin sometimes seems to disappear, though it still shows as “installed” on Jenkins. To re-install it, grab fedmsg.hpi from /srv/web/infra/bigfiles/jenkins. Go to the Jenkins web interface and log in. Click Manage Jenkins -> Manage Plugins -> Advanced. Upload the plugin and on the page that comes up, check the box to have Jenkins restart when running jobs are finished.

Configuration Values

These are written here in case the Jenkins configuration ever gets lost. This is how to configure the jenkins-fedmsg- emit plugin. Assume the plugin is already installed. Go to “Configure Jenkins” -> “System Configuration” Towards the bottom, look for “Fedmsg Emitter” Values: Signing: Checked Fedmsg Endpoint: tcp://209.132.181.16:9941 Environment Shortname: prod Cer- tificate File: /etc/pki/fedmsg/jenkins-jenkins.fedorainfracloud.org.crt Keystore File: /etc/pki/fedmsg/jenkins- jenkins.fedorainfracloud.org.key

Kerneltest-harness SOP

The kerneltest-harness is the web application used to gather and present statistics about kernel test results.

Contents

1. Contact Information 2. Documentation Links

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location https://apps.fedoraproject.org/kerneltest/ Servers kerneltest01, kerneltest01.stg Purpose Provide a system to gather and present kernel tests results

180 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Add a new Fedora release

• Login • On the front page, in the menu on the left side, if there is a Fedora Rawhide release, click on (edit). • Bump the Release number on Fedora Rawhide to avoid conflicts with the new release you’re creating • Back on the index page, click on New release • Complete the form: Release number This would be the integer version of the Fedora release, for example 24 for Fedora 24. Support The current status of the Fedora release - Rawhide for Fedora Rawhide - Test for branched release - Release for released Fedora - Retired for retired release of Fedora

Upload new test results

The kernel tests are available on the kernel-test git repository. Once ran with runtests.sh, you can upload the resulting file either using fedora_submit.py or the UI. If you choose the UI the steps are simply: • Login • Click on Upload in the main menu on the top • Select the result file generated by running the tests • Submit

Kickstart Infrastructure SOP

Kickstart scripts provide our install infrastructure. We have a plethora of different kickstarts to best match the system you are trying to install.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Everywhere we have machines. Servers batcave01 (stores kickstarts and install media) Purpose Provides our install infrastructure

Introduction

Our kickstart infrastructure lives on batcave01. All install media and kickstart scripts are located on batcave01. Be- cause the RHEL binaries are not public we have these bits blocked. You can add needed IPs to (from batcave01): ansible/roles/batcave/files/allows

2.2. System Administrator Guide 181 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Physical Machine (kvm virthost)

Note: PXE Booting If PXE booting just follow the prompt after doing the pxe boot (most hosts will pxeboot via console hitting f12).

Prep

This only works on an already booted box, many boxes at our colocations may have to be rebuilt by the people in those locations first. Also make sure the IP you are about to boot to install from is allowed to our IP restricted infrastructure.fedoraproject.org as noted above (in Introduction). Download the vmlinuz and initrd images. for a rhel6 install: wget https://infrastructure.fedoraproject.org/repo/rhel/RHEL6-x86_64/images/pxeboot/

˓→vmlinuz \ -O/boot/vmlinuz-install wget https://infrastructure.fedoraproject.org/repo/rhel/RHEL6-x86_64/images/pxeboot/

˓→initrd.img \ -O/boot/initrd-install.img grubby--add-kernel=/boot/vmlinuz-install \ --args="ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-

˓→6-nohd \ repo=https://infrastructure.fedoraproject.org/repo/rhel/RHEL6-x86_64/ \ ksdevice=link ip=$IP gateway=$GATEWAY netmask=$NETMASK dns=$DNS"\ --title="install el6"--initrd=/boot/initrd-install.img for a rhel7 install: wget https://infrastructure.fedoraproject.org/repo/rhel/RHEL7-x86_64/images/pxeboot/

˓→vmlinuz-O/boot/vmlinuz-install wget https://infrastructure.fedoraproject.org/repo/rhel/RHEL7-x86_64/images/pxeboot/

˓→initrd.img-O/boot/initrd-install.img

For phx2 hosts: grubby--add-kernel=/boot/vmlinuz-install \ --args="ks=http://10.5.126.23/repo/rhel/ks/hardware-rhel-7-nohd \ repo=http://10.5.126.23/repo/rhel/RHEL7-x86_64/ \ net.ifnames=0 biosdevname=0 bridge=br0:eth0 ksdevice=br0 \ ip={{ br0_ip }}::{{ gw }}:{{ nm }}:{{ hostname }}:br0:none"\ --title="install el7"--initrd=/boot/initrd-install.img

(You will need to setup the br1 device if any after install) For non phx2 hosts: grubby--add-kernel=/boot/vmlinuz-install \ --args="ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-

˓→7-ext \ repo=https://infrastructure.fedoraproject.org/repo/rhel/RHEL7-x86_64/ \ net.ifnames=0 biosdevname=0 bridge=br0:eth0 ksdevice=br0 \ (continues on next page)

182 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) ip={{ br0_ip }}::{{ gw }}:{{ nm }}:{{ hostname }}:br0:none"\ --title="install el7"--initrd=/boot/initrd-install.img

Fill in the br0 ip, gateway, etc The default here is to use the hardware-rhel-7-nohd config which requires you to connect via VNC to the box and configure its drives. If this is a new machine or you are fine with blowing everything away, you can instead use https://infrastructure.fedoraproject.org/rhel/ks/hardware-rhel-6-minimal as your kickstart If you know the number of hard drives the system has there are other kickstarts which can be used. 2 disk system:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-02disk or external:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-02disk-ext 4 disk system:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-04disk or external:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-04disk-ext 6 disk system:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-06disk or external:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-06disk-ext 8 disk system:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-08disk or external:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-08disk-ext 10 disk system:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-10disk or external:: ks=https://infrastructure.fedoraproject.org/repo/rhel/ks/hardware-rhel-7-10disk-ext Double and triple check your configuration settings (On RHEL-6 cat /boot/grub/menu.lst and on RHEL-7 cat /boot/grub2/grub.cfg), especially your IP information. In places like ServerBeach not all hosts have the same netmask or gateway. Once everything you are ready to run the commands to get it set up to boot next boot. RHEL-6: echo"savedefault --default=0 --once"| grub--batch shutdown-r now

RHEL-7: grub2-reboot0 shutdown-r now

Installation

Once the box logs you out, start pinging the IP address. It will disappear and come back. Once you can ping it again, try to open up a VNC session. It can take a couple of minutes after the box is back up for it to actually allow vnc sessions. The VNC password is in the kickstart script on batcave01: grep vnc /mnt/fedora/app/fi-repo/rhel/ks/hardware-rhel-7-nohd vncviewer $IP:1

If using the standard kickstart script, one can watch as the install completes itself, there should be no need to do anything. If using the hardware-rhel-6-nohd script, one will need to configure the drives. The password is in the kickstart file in the kickstart repo.

2.2. System Administrator Guide 183 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Post Install

Run ansible on the box asap to set root passwords and other security features. Don’t leave a newly installed box sitting around.

Koji Infrastructure SOP

Note: We are transitioning from two buildsystems, koji for Fedora and plague for EPEL, to just using koji. This page documents both.

Koji and plague are our buildsystems. They share some of the same machines to do their work.

Contents

1. Contact Information 2. Description 3. Add packages into Buildroot 4. Troubleshooting and Resolution 1. Restarting Koji 2. kojid won’t start or some builders won’t connect 3. OOM (Out of Memory) Issues 1. Increase Memory 2. Decrease weight 4. Disk Space Issues 5. Should there be mention of being sure filesystems in chroots are unmounted before you delete the chroots?

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-build group Persons mbonnet, dgilmore, f13, notting, mmcgrath, SmootherFrOgZ Location Phoenix Servers • koji.fedoraproject.org • buildsys.fedoraproject.org • xenbuilder[1-4] • hammer1, ppc[1-4] Purpose Build packages for Fedora.

184 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

Users submit builds to koji.fedoraproject.org or buildsys.fedoraproject.org. From there it gets passed on to the builders.

Important: At present plague and koji are unaware of each other. A result of this may be an overloaded builder. A easy fix for this is not clear at this time

Add packages into Buildroot

Some contributors may have the need to build packages against fresh built packages which are not into buildroot yet. Koji has override tags as a Inheritance to the build tag in order to include them into buildroot which can be set by: koji tag-pkg dist-$release-override

Troubleshooting and Resolution

Restarting Koji

If for some reason koji needs to be restarted, make sure to restart the koji master first, then the builders. If the koji master has been down for a short enough time the builders do not need to be restarted.: service httpd restart service kojira restart service kojid restart

Important: If postgres becomes interrupted in some way, koji will need to be restarted. As long as the koji master daemon gets restarted the builders should reconnect automatically. If the db server has been restarted and the builders don’t seem to be building, restart their daemons as well.

kojid won’t start or some builders won’t connect

In the event that some items are able to connect to koji while some are not, please make sure that the database is not filled up on connections. This is common if koji crashes and the db connections aren’t properly cleared. Upon restart many of the connections are full so koji cannot reconnect. Clearing old connections is easy, guess about how long it the new koji has been up and pick a number of minutes larger then that and kill those queries. From db3 as postgres run: echo"select procpid from pg_stat_activity where usename='koji' and now() - query_

˓→start \ >='00:40:00' order by query_start;"| psql koji| grep"^"| xargs kill

OOM (Out of Memory) Issues

Out of memory issues occur from time to time on the build machines. There are a couple of options for correction. The first fix is to just restart the machine and hope it was a one time thing. If the problem continues please choose from one of the following options.

2.2. System Administrator Guide 185 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Increase Memory

The xen machines can have memory increased on their corresponding xen hosts. At present this is the table:

xen3 xenbuilder1 xen4 xenbuilder2 disabled xenbuilder3 xen8 xenbuilder4

Edit /etc/xen/xenbuilder[1-4] and add more memory.

Decrease weight

Each builder has a weight as to how much work can be given to it. Presently the only way to alter weight is actually changing the database on db3:

$ sudo su - postgres -bash-2.05b$ psql koji koji=# select * from host limit 1; id | user_id | name | arches | task_load | capacity | ready |

˓→enabled ---+------+------+------+------+------+------+----

˓→----- 6 | 130 | ppc3.fedora.redhat.com | ppc ppc64 | 1.5 | 4 | t | t (1 row) koji=# update host set capacity=2 where name='ppc3.fedora.redhat.com';

Simply update capacity to a lower number.

Disk Space Issues

The builders use a lot of temporary storage. Failed builds also get left on the builders, most should get cleaned but plague does not. The easiest thing to do is remove some older cache dirs. Step one is to turn off both koji and plague:

/etc/init.d/plague-builder stop /etc/init.d/kojid stop

Next check to see what file system is full: df-h

Important: If any one of the following directories is full, send an outage notification as outlined in: [62]Infrastruc- ture/OutageTemplate to the fedora-infrastructure-list and fedora-devel-list, then contact Mike McGrath • /mnt/koji • /mnt/ntap-fedora1/scratch • /pub/epel • /pub/fedora

186 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Typically just / will be full. The next thing to do is determine if we have any extremely large builds left on the builder. Typical locations include /var/lib/mock and /mnt/build (/mnt/build actually is on the local filesystem): du-sh/var/lib/mock/ * /mnt/build/*

/var/lib/mock/dist-f8-build-10443-1503 classic koji build /var/lib/mock/fedora-6-ppc-core-57cd31505683ef1afa533197e91608c5a2c52864 classic plague build If nothing jumps out immediately, just start deleting files older than one week. Once enough space has been freed start koji and plague back up:

/etc/init.d/plague-builder start /etc/init.d/kojid start

Unmounting

Warning: Should there be mention of being sure filesystems in chroots are unmounted before you delete the chroots? Res ipsa loquitur.

This SOP documents how to archive Fedora EOL’d builds from the DEFAULT volume to archived volume. Before archiving the builds, identify if any of the EOL’d release builds are still being used in the current releases. For example. to test if f28 builds are still being using in f32, use: $ koji list-tagged f32 | grep fc28 Tag all these builds to koji’s do-not-archive-yet tag, so that they wont be archived. To do that, first add the packages to the do-not-archive-tag $ koji add-pkg do-not-archive-yet –owner pkg1 pkg2 . . . Then tags the builds to do-not-archive-yet tag $ koji tag-build do-not-archive-yet build1 build2 . . . Then update the archive policy which is available in releng repo (https://pagure.io/releng/blob/master/f/ koji-archive-policy) Run the following from compose-x86-01.phx2.fedoraproject.org $ cd $ wget https://pagure.io/releng/raw/master/f/koji-archive-policy $ git clone https://pagure.io/koji-tools/ $ cd koji- tools $ ./koji-change-volumes -p compose_koji -v ~/archive-policy In any case, if you need to move a build back to DEFAULT volume $ koji add-pkg do-not-archive-yet –owner pkg1 $ koji tag-build do-not-archive-yet build1 $ koji set-build- volume DEFAULT

Setup Koji Builder SOP

2.2. System Administrator Guide 187 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contents

• Setting up a new koji builder • Resetting/installing an old koji builder

Builder Setup

Setting up a new koji builder involves a goodly number of steps:

Network Overview

1. First get an instance spun up following the kickstart sop. 2. Define a hostname for it on the 125 network and a $hostname-nfs name for it on the .127 network. 3. make sure the instance has 2 network connections: • eth0 should be on the .125 network • eth1 should be on the .127 network For VM eth0 should be on br0, eth1 on br1 on the vmhost.

Setup Overview

• install the system as normal:

virt-install -n $builder_fqdn -r $memsize \ -f $path_to_lvm --vcpus=$numprocs \ -l http://10.5.126.23/repo/rhel/RHEL6-x86_64/ \ -x "ksdevice=eth0 ks=http://10.5.126.23/repo/rhel/ks/kvm-rhel-6 \ ip=$ip netmask=$netmask gateway=$gw dns=$dns \ console=tty0 console=ttyS0" \ --network=bridge=br0 --network=bridge=br1 \ --vnc --noautoconsole

• run python /root/tmp/setup-nfs-network.py this should print out the -nfs hostname that you made above • change root pw • disable selinux on the machine in /etc/sysconfig/selinux • reboot • setup ssl cert into private/builders - use fqdn of host as DN – login to fas01 as root – cd /var/lib/fedora-ca – ./kojicerthelper.py normal --outdir=/tmp/ \ --name=$fqdn_of_the_new_builder --cadir=. --caname=Fedora – info for the cert should be like this:

188 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Country Name (2 letter code) [US]: State or Province Name (full name) [North Carolina]: Locality Name (eg, city) [Raleigh]: Organization Name (eg, company) [Fedora Project]: Organizational Unit Name (eg, section) []:Fedora Builders Common Name (eg, your name or your servers hostname) []:$fqdn_of_new_builder Email Address []:[email protected]

– scp the file in /tmp/${fqdn}_key_and_cert.pem over to batcave01 – put file in the private repo under private/builders/${fqdn}.pem – git add + git commit – git push • run ./sync-hosts in infra-hosts repo; git commit; git push • as a koji admin run:

koji add-host $fqdnr i386 x86_64

(note: those are yum basearchs on the end - season to taste)

Resetting/installing an old koji builder

• disable the builder in koji (ask a koji admin) • halt the old system (halt -p) • undefine the vm instance on the buildvmhost:

virsh undefine $builder_fqdn

• reinstall it - from the buildvmhost run:

virt-install -n $builder_fqdn -r $memsize \ -f $path_to_lvm --vcpus=$numprocs \ -l http://10.5.126.23/repo/rhel/RHEL6-x86_64/ \ -x "ksdevice=eth0 ks=http://10.5.126.23/repo/rhel/ks/kvm-rhel-6 \ ip=$ip netmask=$netmask gateway=$gw dns=$dns \ console=tty0 console=ttyS0" \ --network=bridge=br0 --network=bridge=br1 \ --vnc --noautoconsole

• watch install via vnc:

vncviewer -via bastion.fedoraproject.org $builder_fqdn:1

• when the install finishes: – start the instance on the buildvmhost:

virsh start $builder_fqdn

– set it to autostart on the buildvmhost:

virsh autostart $builder_fqdn

2.2. System Administrator Guide 189 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• when the guest comes up – login via ssh using the temp root password – python /root/tmp/setup-nfs-network.py – change root password – disable selinux in /etc/sysconfig/selinux – reboot – ask a koji admin to re-enable the host

Koschei SOP

Koschei is a continuous integration system for RPM packages. Koschei runs package scratch builds after dependency change or after time elapse and reports package buildability status to interested parties. Production instance: https://apps.fedoraproject.org/koschei Staging instance: https://apps.stg.fedoraproject.org/ koschei

Contact Information

Owner mizdebsk, msimacek Contact #fedora-admin Location Fedora Cloud Purpose continuous integration system

Deployment

Koschei deployment is managed by two Ansible playbooks: sudo rbac-playbook groups/koschei-backend.yml sudo rbac-playbook groups/koschei-web.yml

Description

Koschei is deployed on two separate machines - koschei-backend and koschei-web Frontend (koschei-web) is a Flask WSGi application running with httpd. It displays information to users and allows editing package groups and changing priorities. Backend (koschei-backend) consists of multiple services: • koschei-watcher - listens to fedmsg events for complete builds and changes build states in the database • koschei-repo-resolver - resolves package dependencies in given repo using hawkey and compares them with previous iteration to get a dependency diff. It resolves all packages in the newest repo available in Koji. The output is a base for scheduling new builds • koschei-build-resolver - resolves complete builds in the repo in which they were done in Koji. Pro- duces the dependency differences visible in the frontend • koschei-scheduler - schedules new builds based on multiple criteria:

190 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– dependency priority - dependency changes since last build valued by their distance in the dependency graph – manual and static priorities - set manually in the frontend. Manual priority is reset after each build, static priority persists – time priority - time elapsed since the last build • koschei-polling - polls the same types of events as koschei-watcher without reliance on fedmsg. Addi- tionaly takes care of package list synchronization and other regularly executed tasks

Configuration

Koschei configuration is in /etc/koschei/config-backend.cfg and /etc/koschei/ config-frontend.cfg, and is merged with the default configuration in /usr/share/koschei/config. cfg (the ones in /etc overrides the defaults in /usr). Note the merge is recursive. The configuration contains all configurable items for all Koschei services and the frontend. The alterations to configuration that aren’t temporary should be done through ansible playbook. Configuration changes have no effect on already running services – they need to be restarted, which happens automatically when using the playbook.

Disk usage

Koschei doesn’t keep on disk anything that couldn’t be recreated easily - all important data is stored in PostgreSQL database, configuration is managed by Ansible, code installed by RPM and so on. To speed up operation and reduce load on external servers, Koschei caches some data obtained from services it integrates with. Most notably, YUM repositories downloaded from Koji are kept in /var/cache/koschei/ repodata. Each repository takes about 100 MB of disk space. Maximal number of repositories kept at time is controlled by cache_l2_capacity parameter in config-backend.cfg (config-backend.cfg.j2 in Ansible). If repodata cache starts to consume too much disk space, that value can be decreased - after restart, koschei-*-resolver will remove least recently used cache entries to respect configured cache capacity.

Database

Koschei needs to connect to a PostgreSQL database, other database systems are not supported. Database connection is specified in the configuration under the database_config key that can contain the following keys: username, password, host, port, database. After an update of koschei, the database needs to be migrated to new schema. This happens automatically when using the upgrade playbook. Alternatively, it can be executed manulally using:

koschei-admin alembic upgrade head

The backend services need to be stopped during the migration.

Managing koschei services

Koschei services are systemd units managed through systemctl. They can be started and stopped independently in any order. The frontend is run using httpd.

2.2. System Administrator Guide 191 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Suspespending koschei operation

For stopping builds from being scheduled, stopping the koschei-scheduler service is enough. For planned Koji outages, it’s recommended to stop koschei-scheduler. It is not necessary, as koschei can recover from Koji errors and network errors automatically, but when Koji builders are stopped, it may cause unexpected build failures that would be reported to users. Other services can be left running as they automatically restart themselves on Koji and network errors.

Limiting Koji usage

Koschei is by default limited to 30 concurrently running builds. This limit can be changed in the configuration under koji_config.max_builds key. There’s also Koji load monitoring, that prevents builds from being scheduled when Koji load is higher that certain threshold. That should prevent scheduling builds during mass rebuilds, so it’s not necessary to stop scheduling during those.

Fedmsg notifications

Koschei optionally supports sending fedmsg notifications for package state changes. The fedmsg dispatch can be turned on and off in the configuration (key fedmsg-publisher.enabled). Koschei doesn’t supply configuration for fedmsg, it lets the library to load it’s own (in /etc/fedmsg.d/).

Setting admin announcement

Koschei can display announcement in web UI. This is mostly useful to inform users about outages or other problems. To set announcement, run as koschei user:

koschei-admin set-notice"Koschei operation is currently suspended due to scheduled

˓→Koji outage"

or:

koschei-admin set-notice"Sumbitting scratch builds by Koschei is currently disabled

˓→due to Fedora 23 mass rebuild"

To clear announcement, run as koschei user:

koschei-admin clear-notice

Adding package groups

Packages can be added to one or more group. To add new group named “mynewgroup”, run as koschei user:

koschei-admin add-group mynewgroup

To add new group named “mynewgroup” and populate it with some packages, run as koschei user:

koschei-admin add-group mynewgroup pkg1 pkg2 pkg3

192 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Set package static priority

Some packages are more or less important and can have higher or lower priority. Any user can change manual priority, which is reset after package is rebuilt. Admins can additionally set static priority, which is not affected by package rebuilds. To set static priority of package “foo” to value “100”, run as koschei user: koschei-admin--collection f27 set-priority--static foo 100

Branching a new Fedora release

After branching occurs and Koji build targets have been created, Koschei should be updated to reflect the new state. There is a special admin command for this purpose, which takes care of copying the configuration and also last builds from the history. To branch the collection from Fedora 27 to Fedora 28, use the following: koschei-admin branch-collection f27 f28-d'Fedora 27'-t f28--bugzilla-version 27

Then you can optionally verify that the collection configuration is correct by visiting https://apps.fedoraproject.org/ koschei/collections and examining the configuration of the newly branched collection.

Layered Image Build System

The Fedora Layered Image Build System, often referred to as OSBS (OpenShift Build Service) as that is the upstream project that this is based on, is used to build Layered Container Images in the Fedora Infrastructure via Koji.

Contents

1. Contact Information 2. Overview 3. Setup 4. Outage

Contact Information

Owner Clement Verna (cverna) Contact #fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main, sysadmin-releng Location osbs-control01, osbs-master01, osbs-node01, osbs-node02 registry.fedoraproject.org, candidate- registry.fedoraproject.org osbs-control01.stg, osbs-master01.stg, osbs-node01.stg, osbs-node02.stg registry.stg.fedoraproject.org, candidate-registry.stg.fedoraproject.org x86_64 koji buildvms Purpose Layered Container Image Builds

2.2. System Administrator Guide 193 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Overview

The build system is setup such that Fedora Layered Image maintainers will submit a build to Koji via the fedpkg container-build command a container namespace within DistGit. This will trigger the build to be scheduled in OpenShift via osbs-client tooling, this will create a custom OpenShift Build which will use the pre-made buildroot container image that we have created. The Atomic Reactor( atomic-reactor) utility will run within the buildroot and prep the build container where the actual build action will execute, it will also maintain uploading the Content Generator metadata back to Koji and upload the built image to the candidate docker registry. This will run on a host with iptables rules restricting access to the docker bridge, this is how we will further limit the access of the buildroot to the outside world verifying that all sources of information come from Fedora. Completed layered image builds are hosted in a candidate docker registry which is then used to pull the image and perform tests.

Setup

The Layered Image Build System setup is currently as follows (more detailed view available in the RelEng Architecture Document):

=== Layered Image Build System Overview ===

+------++------+ |||| | koji hub+----+| batcave| ||||| +------+|+----+------+ || V| +------+V ||+------+ | koji builder||+------+ ||| osbs-control01+------+| +-+------+|+-----+|| |+------+||| |||| |||| |||| V||| +------+||| ||||| | osbs-master01+------+ [ansible] |+------+|||| +------+||||| ^||||| |||||| |VV||| |+------++------+||| |||||||| || osbs-node01|| osbs-node02|||| |||||||| |+------++------+||| |^^||| |||||| ||+------+|| |||| |+------+| (continues on next page)

194 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) | | +------+

Deployment

From batcave you can run the following

$ sudo rbac-playbook groups/osbs/deploy-cluster.yml

This is going to deploy the OpenShift cluster used by OSBS. Currently the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can be used to deploy only one of these if needed for example osbs-x86-deploy-openshift. If the openshift-ansible playbook fails it can be easier to run it directly from osbs-control01 and use the verbose mode.

$ ssh osbs-control01.iad2.fedoraproject.org $ sudo -i # cd /root/openshift-ansible # ansible-playbook -i cluster-inventory playbooks/prerequisites.yml # ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml

Once these playbook have been successfull, you can configure OSBS on the cluster. For that use the following playbook

$ sudo rbac-playbook groups/osbs/configure-osbs.yml

When this is done we need to get the new koji service token and update its value in the private repository

$ ssh osbs-master01.iad2.fedoraproject.org $ sudo -i # oc -n osbs-fedora sa get-token koji dsjflksfkgjgkjfdl ....

The token needs to be saved in the private ansible repo in files/osbs/production/x86-64-osbs-koji. Once this is done you can run the builder playbook to update that token.

$ sudo rbac-playbook groups/buildvm.yml -t osbs

Operation

Koji Hub will schedule the containerBuild on a koji builder via the koji-containerbuild-hub plugin, the builder will then submit the build in OpenShift via the koji-containerbuild-builder plugin which uses the osbs-client python API that wraps the OpenShift API along with a custom OpenShift Build JSON payload. The Build is then scheduled in OpenShift and it’s logs are captured by the koji plugins. Inside the buildroot, atomic- reactor will upload the built container image as well as provide the metadata to koji’s content generator.

Outage

If Koji is down, then builds can’t be scheduled but repairing Koji is outside the scope of this document. If either the candidate-registry.fedoraproject.org or registry.fedoraproject.org Container Registries are unavailable, but repairing those is also outside the scope of this document.

2.2. System Administrator Guide 195 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

OSBS Failures

OpenShift Build System itself can have various types of failures that are known about and the recovery procedures are listed below.

Ran out of disk space

Docker uses a lot of disk space, and while the osbs-nodes have been alloted what is considered to be ample disk space for builds (since they are automatically cleaned up periodically) it is possible this will run out. To resolve this, run the following commands:

# These command will clean up old/dead docker containers from old OpenShift # Pods

$ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i;

˓→done

$ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done

# This command should only be run on osbs-master01 (it won't work on the # nodes) # # This command will clean up old builds and related artifacts in OpenShift # that are older than 30 days (We can get more aggressive about this if # necessary, the main reason these still exist is in the event we need to # debug something. All build info we care about is stored in Koji.)

$ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm

A node is broken, how to remove it from the cluster?

If a node is having an issue, the following command will effectively remove it from the cluster temporarily. In this example, we are removing osbs-node01

$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true

Container Builds are unable to access resources on the network

Sometimes the Container Builds will fail and the logs will show that the buildroot is unable to access networked resources (docker registry, dnf repos, etc). This is because of a bug in OpenShift v1.3.1 (current upstream release at the time of this writing) where an Open- VSwitch flow is left behind when a Pod is destroyed instead of the flow being deleted along with the Pod. Method to confirm the issue is unfortunately multi-step since it’s not a cluster-wide issue but isolated to the node experiencing the problem. First in the koji createContainer task there is a log file called openshift-incremental.log and in there you will find a key:value in some JSON output similar to the following:

196 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

'openshift_build_selflink':u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``

The last field of the value, in this example cockpit-f24-6 is the OpenShift build identifier. We need to ssh into osbs-master01 and get information about which node that ran on.

# On osbs-master01 # Note: the output won't be pretty, but it gives you the info you need

$ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node

Once you know what machine you need, ssh into it and run the following:

$ sudo docker run --rm -ti buildroot /bin/bash'

# now attempt to run a curl command

$ curl https://google.com # This should get refused, but if this node is experiencing the networking # issue then this command will hang and eventually time out

How to fix: Reboot the affected node that’s experiencing the issue, when the node comes back up OpenShift will rebuild the flow tables on OpenVSwitch and things will be back to normal. systemctl reboot

Libera IRC Channel Infrastructure SOP

Fedora uses the libera IRC network for it’s IRC communications. If you want to make a new Fedora Related IRC Channel, please follow the following guidelines.

Contents

1. Contact Information 2. Is a new channel needed? 3. Adding new channel 4. Recovering/fixing an existing channel

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin Location: libera Servers: none Purpose: Provides a channel for Fedora contributors to use.

2.2. System Administrator Guide 197 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Is a new channel needed?

First you should see if one of the existing Fedora channels will meet your needs. Adding a new channel can give you a less noisy place to focus on something, but at the cost of less people being involved. If you topic/area is development related, perhaps the main #fedora-devel channel will meet your needs?

Adding new channel

• Make sure the channel is in the #fedora-* namespace. This allows the Fedora Group Coordinator to make changes to it if needed. • Found the channel. You do this by /join #channelname, then /msg chanserv register #channelname • Setup GUARD mode. This allows ChanServ to be in the channel for easier management: /msg chanserv set #channel GUARD on • Add Some other Operators/Managers to the access list. This would allow them to manage the channel if you are asleep or absent.:

/msg chanserv access #channel add NICK +ARfiorstv

You can see what the various flags mean at https://libera.chat/guides/channelmodes You may want to consider adding some or all of the folks in #fedora-ops who manage other channels to help you with yours. You can see this list with /msg chanserv access #fedora-ops list‘ • Set default modes. /msg chanserv set mlock #channel +Ccnt (The t for topic lock is optional, if your channel would like to have people change the topic often). • If your channel is of general interest, add it to the main communicate page of IRC Channels, and possibly announce it to your target audience. • You may want to request zodbot join your channel if you need it’s functions. You can request that in #fedora- admin.

Recovering/fixing an existing channel

If there is an existing channel in the #fedora-* namespace that has a missing founder/operator, please contact the Fedora Group Coordinator: [49]User:Spot and request it be reassigned. Follow the above procedure on the channel once done so it’s setup and has enough operators/managers to not need reassiging again. librariesio2fedmsg SOP librariesio2fedmsg is a small service that converts Server-Sent Events from libraries.io to fedmsgs. librariesio2fedmsg is an instance of sse2fedmsg using the libraries.io firehose running on OpenShift and pub- lishes its fedmsgs through the busgateway01.phx2.fedoraproject.org relay using the org.fedoraproject.prod. sse2fedmsg.librariesio topic.

Updating sse2fedmsg is installed directly from its git repository, so once a new release is tagged in sse2fedmsg, just update the tag in the git URL provided to pip in the build config.

198 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Deploying

Run the playbook to apply the new OpenShift configuration:

$ sudo rbac-playbook openshift-apps/librariesio2fedmsg.yml

Link tracking

Using link tracking is [43]an easy way for us to find out how people are getting to our download page. People might click over to our download page from any of a number of areas, and knowing the relative usage of those links can help us understand what materials we’re producing are more effective than others.

Adding links

Each link should be constructed by adding ? to the URL, followed by a short code that includes: • an indicator for the link source (such as the wiki release notes) • an indicator for the Fedora release in specific (such as F15 for the final, or F15a for the Alpha test release) So a link to get.fp.o from the one-page release notes would become http://get.fedoraproject.org/?opF15.

FAQ

I want to copy a link to my status update for social networking, or my blog. If you’re posting a status update to identi.ca, for example, use the link tracking code for status updates. Don’t copy a link straight from an an- nouncement that includes link tracking from the announcement. You can copy the link itself but remember to change the portion after the ? to instead use the st code for status updates and blogs, followed by the Fedora release version (such as F16a, F16b, or F16), like this:

http://fedoraproject.org/get-prerelease?stF16a

I want to point people to the announcement from my blog. Should I use the announcement link tracking code? The actual URL link itself is the announcement URL. Add the link tracking code for blogs, which would start with ?st and end with the Fedora release version, like this:

http://fedoraproject.org/wiki/F16_release_announcement?stF16a

The codes

Note: Additions to this table are welcome.

2.2. System Administrator Guide 199 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Link source Code Email announcements an Wiki announcements wkan Front page fp Front page of wiki wkfp The press release Red Hat makes rhpr http://redhat.com/fedora rhf Test phase release notes on wkrn Official release notes rn Official installation guide ig One-page release notes op Status links (blogs, social media) st

Loopabull

Loopabull is an event-driven Ansible-based automation engine. This is used for various tasks, originally slated for Release Engineering Automation.

Contents

1. Contact Information 2. Overview 3. Setup 4. Outage

Contact Information

Owner Adam Miller (maxamillion) Pierre-Yves Chibon (pingou) Contact #fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main, sysadmin-releng Location loopabull01.phx2.fedoraproject.org loopabull01.stg.phx2.fedoraproject.org Purpose Event Driven Automation of tasks within the Fedora Infrastructure and Fedora Release Engineering

Overview

The loopabull system is setup such that an event will take place within the infrastructure and a fedmsg is sent, then loopabull will consume that message, trigger an Ansible playbook that shares a name with the fedmsg topic, and provide the payload of the fedmsg to the playbook as extra variables.

Setup

The setup is relatively simple, the Overview above describes it and a more detailed version can be found in the releng docs.

200 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

+------++------+ |||| | fedmsg+------>| Looper| ||| (fedmsg handler plugin)| |||| +------++------+ | | +------+| ||| ||| | Loopabull+<------+ | (Event Loop)| || +------+------+ | | | | V +------+------+ || | ansible-playbook| || +------+

Deployment

Loopabull is deployed on two hosts, one for the production instance: loopabull01.prod.phx2. fedoraproject.org and one for the staging instance: loopabull01.stg.phx2.fedoraproject.org. Each host is running loopabull with 5 workers reacting to fedmsg notifications.

Expanding loopabull

The documentation to expand loopabull’s usage is documented at: https://pagure.io/Fedora-Infra/loopabull-tasks

Outage

In the event that loopabull isn’t responding or isn’t running playbooks as it should be, the following scenarios should be approached.

What is going on?

There are a few commands that may help figuring out what is going: • Check the status of the different services: systemctl|grep loopabull

• Follow the logs of the different services:

2.2. System Administrator Guide 201 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

journalctl-lfu loopabull-u loopabull@1-u loopabull@2-u loopabull@3\ -u loopabull@4-u loopabull@5

If a playbook returns a non-zero error code, the worker running it will be stopped. If that happens, you may want to carefully review the logs to assess what lead to this situation so it can be prevented in the future. • Monitoring the queue size The loopabull service listens to the fedmsg bus and puts the messages as they come into a rabbitmq/amqp queue for the workers to process. If you want to see the number of messages pending to be processed by the workers you can check the queue size using: rabbitmqctl list_queues

The output will be something like:

Listing queues... workers 489989 ...done.

Where workers is the name of the queue used by loopabull and 489989 the number of messages in that queue (yes that day we were recovering from a several-day long outage).

Network Interruption

Sometimes if the network is interrupted, the loopabull service will hang because the fedmsg listener will hold a dead socket open. The service and its workers simply needs to be restarted at that point. systemctl restart loopabull loopabull@1 loopabull@2 loopabull@3\ loopabull@4 loopabull@5

Mailman Infrastructure SOP

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-tools, sysadmin-hosted Location phx2 Servers mailman01, mailman02, mailman01.stg Purpose Provides mailing list services.

Description

Mailing list services for Fedora projects are located on the mailman01.phx2.fedoraproject.org server.

Common Tasks

202 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Creating a new mailing list

• Log into mailman01 • sudo -u mailman mailman3 create @lists.fedora(project|hosted). org --owner @fedoraproject.org --notify

Note: Note that list names should make sense, and not contain the words ‘fedora’ or ‘list’ - the fact that it has to do with Fedora and that it’s a list are both obvious from the domain of the email address.

Important: Please make sure to add a valid description to the newly created list. (to avoid [no description available] on listinfo index)

Removing content from archives

We don’t. It’s not easy to remove content from the archives and it’s generally useless as well because the archives are often mirrored by third parties as well as being in the INBOXs of all of the people on the mailing list at that time. Here’s an example message to send to someone who requests removal of archived content:

Greetings,

We're sorry to say that we don't remove content from the mailing list archives. Doing so is a non-trivial amount of work and usually doesn't achieve anything because the content has already been disseminated to a wide audience that we do not control. The emails have gone out to all of the subscribers of the mailing list at that time and also (for a great many of our lists) been copied by third parties (for instance: http://markmail.org and http://gmane.org).

Sorry we cannot help further,

Mailing lists and their owners

Checking Membership

Are you in need of checking who owns a certain mailing list without having to search around on list’s frontpages? Mailman has a nice tool that will help us list members by type. Get a full list of all the mailing lists hosted on the server:

sudo-u mailman mailman3 lists

Get the list of regular members for [email protected]:

sudo-u mailman mailman3 members example @example.com

Get the list of owners for [email protected]:

sudo-u mailman mailman3 members-R owner example @example.com

2.2. System Administrator Guide 203 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Get the list of moderators for [email protected]: sudo-u mailman mailman3 members-R moderator example @example.com

Troubleshooting and Resolution

List Administration

Specific users are marked as ‘site admins’ in the database. Please file a issue if you feel you need to have this access.

Restart Procedure

If the server needs to be restarted mailman should come back on it’s own. Otherwise each service on it can be restarted: sudo service mailman3 restart sudo service restart

How to delete a mailing list

Delete a list, but keep the archives: sudo-u mailman mailman3 remove

SSL Certificate Creation SOP

Every now and then you will need to create an SSL certificate for a Fedora Service.

Creating a CSR for a new server.

Know your hostname, ie lists.fedoraproject.org‘: export ssl_name=

Create the cert. 8192 does not work with various boxes so we use 4096 currently.: openssl genrsa -out ${ssl_name}.pem 4096 openssl req -new -key ${ssl_name}.pem -out $(ssl_name}.csr

Country Name (2 letter code) [XX]:US State or Province Name (full name) []:NM Locality Name (eg, city) [Default City]:Raleigh Organization Name (eg, company) [Default Company Ltd]:Red Hat Organizational Unit Name (eg, section) []:Fedora Project Common Name (eg, your name or your server's hostname) []:lists.fedorahosted.org Email Address []:[email protected]

Please enter the following 'extra' attributes (continues on next page)

204 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) to be sent with your certificate request A challenge password []: An optional company name []: send the CSR to the signing authority and wait for a cert. place all three into private directory so that you can make certs in the future.

Creating a temporary self-signed certificate.

Repeat the steps above but add in the following: openssl x509 -req -days 30 -in ${ssl_name}.csr -signkey ${ssl_name}.pem -out ${ssl_

˓→name}.cert Signature ok subject=/C=US/ST=NM/L=Raleigh/O=Red Hat/OU=Fedora Project/CN=lists.fedorahosted.org/[email protected]

Getting Private key We only want a self-signed certificate to be good for a short time so 30 days sounds good.

Mass Upgrade Infrastructure SOP

Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.

Contents

1. Contact Information 2. Preparation 3. Staging 4. Special Considerations • Disable builders • Post reboot action • Schedule autoqa01 reboot • Bastion01 and Bastion02 and openvpn server • Special yum directives 5. Update Leader 6. Group A reboots 7. Group B reboots 8. Group C reboots 9. Doing the upgrade 10. Doing the reboot 11. Aftermath

2.2. System Administrator Guide 205 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main, [email protected], #fedora-noc Location: All over the world. Servers: all Purpose: Apply kernel/other upgrades to all of our servers

Preparation

1. Determine which host group you are going to be doing updates/reboots on. Group “A” servers that end users will see or note being down and anything that depends on them. Group “B” servers that contributors will see or note being down and anything that depends on them. Group “C” servers that infrastructure will notice are down, or are redundent enough to reboot some with others taking the load. 2. Appoint an ‘Update Leader’ for the updates. 3. Follow the [61]Outage Infrastructure SOP and send advance notification to the appropriate lists. Try to schedule the update at a time when many admins are around to help/watch for problems and when impact for the group affected is less. Do NOT do multiple groups on the same day if possible. 4. Plan an order for rebooting the machines considering two factors: • Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together] • Impact of systems going down on other services, operations and users. Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes. 5. To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document. 6. Schedule downtime in nagios. 7. Make doubly sure that various app owners are aware of the reboots

Staging

Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, etc.

Special Considerations

While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:

206 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Disable builders

Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete: • db04 • nfs01 • kojipkgs02 Builders can be removed from koji, updated and re-added. Use: koji disable-host NAME

and koji enable-host NAME

Note: you must be a koji admin

Additionally, rel-eng and builder boxes may need a special version of rpm. Make sure to check with rel-eng on any rpm upgrades for them.

Post reboot action

The following machines require post-boot actions (mostly entering passphrases). Make sure admins that have the passphrases are on hand for the reboot: • backup-2 (LUKS passphrase on boot) • sign-vault01 (NSS passphrase for sigul service) • sign-bridge01 (NSS passphrase for sigul bridge service) • serverbeach* (requires fixing firewall rules): Each serverbeach host needs 3 or 4 iptables rules added anytime it’s rebooted or libvirt is upgraded: iptables-I FORWARD-o virbr0-j ACCEPT iptables-I FORWARD-i virbr0-j ACCEPT iptables-t nat-I POSTROUTING-s 192.168.122.3/32-j SNAT--to-source 66.135.62.187

Note: The source is the internal guest ips, the to-source is the external ips that map to that guest ip. If there are multiple guests, each one needs the above SNAT rule inserted.

Schedule autoqa01 reboot

There is currently an autoqa01.c host on cnode01. Check with QA folks before rebooting this guest/host.

2.2. System Administrator Guide 207 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Bastion01 and Bastion02 and openvpn server

We need one of the bastion machines to be up to provide openvpn for all machines. Before rebooting bastion02, modify: manifests/nodes/bastion0*.phx2.fedoraproject.org.pp files to start openvpn server on bastion01, wait for all clients to re-connect, reboot bastion02 and then revert back to it as openvpn hub.

Special yum directives

Sometimes we will wish to exclude or otherwise modify the yum.conf on a machine. For this purpose, all machines have an include, making them read [62]http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include from the infrastructure repo. If you need to make such changes, add them to the infrastructure repo before doing updates.

Update Leader

Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren’t specficially asked by the Leader to reboot or change something, please don’t. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn’t come back up from reboot or aren’t working right after reboot. It’s important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.

Group A reboots

Group A machines are end user critical ones. Outages here should be planned at least a week in advance and announced to the announce list. List of machines currently in A group (note: this is going to be automated) These hosts are grouped based on the virt host they reside on: • torrent02.fedoraproject.org • ibiblio02.fedoraproject.org • people03.fedoraproject.org • ibiblio03.fedoraproject.org • collab01.fedoraproject.org • serverbeach09.fedoraproject.org • db05.phx2.fedoraproject.org • virthost03.phx2.fedoraproject.org • db01.phx2.fedoraproject.org • virthost04.phx2.fedoraproject.org • db-fas01.phx2.fedoraproject.org • proxy01.phx2.fedoraproject.org • virthost05.phx2.fedoraproject.org • ask01.phx2.fedoraproject.org • virthost06.phx2.fedoraproject.org

208 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

These are the rest: • bapp02.phx2.fedoraproject.org • bastion02.phx2.fedoraproject.org • app05.fedoraproject.org • backup02.fedoraproject.org • bastion01.phx2.fedoraproject.org • fas01.phx2.fedoraproject.org • fas02.phx2.fedoraproject.org • log02.phx2.fedoraproject.org • memcached03.phx2.fedoraproject.org • noc01.phx2.fedoraproject.org • ns02.fedoraproject.org • ns04.phx2.fedoraproject.org • proxy04.fedoraproject.org • smtp-mm03.fedoraproject.org • batcave02.phx2.fedoraproject.org • mm3test.fedoraproject.org • packages02.phx2.fedoraproject.org

Group B reboots

This Group contains machines that contributors use. Announcements of outages here should be at least a week in advance and sent to the devel-announce list. These hosts are grouped based on the virt host they reside on: • db04.phx2.fedoraproject.org • bvirthost01.phx2.fedoraproject.org • nfs01.phx2.fedoraproject.org • bvirthost02.phx2.fedoraproject.org • pkgs01.phx2.fedoraproject.org • bvirthost03.phx2.fedoraproject.org • kojipkgs02.phx2.fedoraproject.org • bvirthost04.phx2.fedoraproject.org These are the rest: • koji04.phx2.fedoraproject.org • releng03.phx2.fedoraproject.org • releng04.phx2.fedoraproject.org

2.2. System Administrator Guide 209 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Group C reboots

Group C are machines that infrastructure uses, or can be rebooted in such a way as to continue to provide services to others via multiple machines. Outages here should be announced on the infrastructure list. Group C hosts that have proxy servers on them: • proxy02.fedoraproject.org • ns05.fedoraproject.org • hosted-lists01.fedoraproject.org • internetx01.fedoraproject.org • app01.dev.fedoraproject.org • darkserver01.dev.fedoraproject.org • fakefas01.fedoraproject.org • proxy06.fedoraproject.org • osuosl01.fedoraproject.org • proxy07.fedoraproject.org • bodhost01.fedoraproject.org • proxy03.fedoraproject.org • smtp-mm02.fedoraproject.org • tummy01.fedoraproject.org • app06.fedoraproject.org • noc02.fedoraproject.org • proxy05.fedoraproject.org • smtp-mm01.fedoraproject.org • telia01.fedoraproject.org • app08.fedoraproject.org • proxy08.fedoraproject.org • coloamer01.fedoraproject.org Other Group C hosts: • ask01.stg.phx2.fedoraproject.org • app02.stg.phx2.fedoraproject.org • proxy01.stg.phx2.fedoraproject.org • releng01.stg.phx2.fedoraproject.org • value01.stg.phx2.fedoraproject.org • virthost13.phx2.fedoraproject.org • db-fas01.stg.phx2.fedoraproject.org • pkgs01.stg.phx2.fedoraproject.org • packages01.stg.phx2.fedoraproject.org

210 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• virthost11.phx2.fedoraproject.org • app01.stg.phx2.fedoraproject.org • koji01.stg.phx2.fedoraproject.org • db02.stg.phx2.fedoraproject.org • fas01.stg.phx2.fedoraproject.org • virthost10.phx2.fedoraproject.org • autoqa01.qa.fedoraproject.org • autoqa-stg01.qa.fedoraproject.org • bastion-comm01.qa.fedoraproject.org • batcave-comm01.qa.fedoraproject.org • virthost-comm01.qa.fedoraproject.org • compose-x86-01.phx2.fedoraproject.org • compose-x86-02.phx2.fedoraproject.org • download01.phx2.fedoraproject.org • download02.phx2.fedoraproject.org • download03.phx2.fedoraproject.org • download04.phx2.fedoraproject.org • download05.phx2.fedoraproject.org • download-rdu01.vpn.fedoraproject.org • download-rdu02.vpn.fedoraproject.org • download-rdu03.vpn.fedoraproject.org • fas03.phx2.fedoraproject.org • secondary01.phx2.fedoraproject.org • memcached04.phx2.fedoraproject.org • virthost01.phx2.fedoraproject.org • app02.phx2.fedoraproject.org • value03.phx2.fedoraproject.org • virthost07.phx2.fedoraproject.org • app03.phx2.fedoraproject.org • value04.phx2.fedoraproject.org • ns03.phx2.fedoraproject.org • darkserver01.phx2.fedoraproject.org • virthost08.phx2.fedoraproject.org • app04.phx2.fedoraproject.org • packages02.phx2.fedoraproject.org • virthost09.phx2.fedoraproject.org

2.2. System Administrator Guide 211 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• hosted03.fedoraproject.org • serverbeach06.fedoraproject.org • hosted04.fedoraproject.org • serverbeach07.fedoraproject.org • collab02.fedoraproject.org • serverbeach08.fedoraproject.org • dhcp01.phx2.fedoraproject.org • relepel01.phx2.fedoraproject.org • sign-bridge02.phx2.fedoraproject.org • koji03.phx2.fedoraproject.org • bvirthost05.phx2.fedoraproject.org • (disable each builder in turn, update and reenable). • ppc11.phx2.fedoraproject.org • ppc12.phx2.fedoraproject.org • backup03

Doing the upgrade

If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging). To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages ([63]Infrastructure Yum Repo SOP) On batcave01, as root run: func-yum [--host=hostname] update

..note: –host can be specified multiple times and takes wildcards. pinging people as necessary if you are unsure about any packages. Additionally you can see which machines still need rebooted with: sudo func-command--timeout=10--oneline/usr/local/bin/needs-reboot.py| grep yes

You can also see which machines would need a reboot if updates were all applied with: sudo func-command--timeout=10--oneline/usr/local/bin/needs-reboot.py after-updates

˓→| grep yes

Doing the reboot

In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on. You can see the guests per virt host on batcave01 in /var/log/virthost-lists.out To reboot sets of boxes based on which virthost they are we’ve written a special script which facilitates it: func-vhost-reboot virthost-fqdn

212 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 ex: sudo func-vhost-reboot virthost13.phx2.fedoraproject.org

Aftermath

1. Make sure that everything’s running fine 2. Reenable nagios notification as needed 3. Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes) 4. Close outage ticket.

Non virthost reboots:

If you need to reboot specific hosts and make sure they recover - consider using: sudo func-host-reboot hostname hostname1 hostname2...

If you want to reboot the hosts one at a time waiting for each to come back before rebooting the next pass a -o to func-host-reboot.

Master Mirror Infrastructure SOP

Contents

1. Contact Information 2. PHX Master Mirror Setup 3. RDU I2 Master Mirror Setup 4. Raising Issues

Contact Information

Owner: Red Hat IS Contact: #fedora-admin, Red Hat ticket Location: PHX Servers: server[1-5].download.phx.redhat.com Purpose: Provides the master mirrors for Fedora distribution

PHX Master Mirror Setup

The master mirrors are accessible as:

2.2. System Administrator Guide 213 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

download1.fedora.redhat.com-> CNAME to download3.fedora.redhat.com download2.fedora.redhat.com-> currently no DNS entry download3.fedora.redhat.com-> 209.132.176.20 download4.fedora.redhat.com-> 209.132.176.220 download5.fedora.redhat.com-> 209.132.176.221 from the outside. download.fedora.redhat.com is a round robin to the above IPs. The external IPs correspond to internal load balancer IPs that balance between server[1-5]:

209.132.176.20-> 10.9.24.20 209.132.176.220-> 10.9.24.220 209.132.176.221-> 10.9.24.221

The load balancers then balance between the below Fedora IPs on the rsync servers:

10.8.24.21 (fedora1.download.phx.redhat.com)- server1.download.phx.redhat.com 10.8.24.22 (fedora2.download.phx.redhat.com)- server2.download.phx.redhat.com 10.8.24.23 (fedora3.download.phx.redhat.com)- server3.download.phx.redhat.com 10.8.24.24 (fedora4.download.phx.redhat.com)- server4.download.phx.redhat.com 10.8.24.25 (fedora5.download.phx.redhat.com)- server5.download.phx.redhat.com

RDU I2 Master Mirror Setup

Note: This section is awaiting confirmation from RH - information here may not be 100% accurate yet. download-i2.fedora.redhat.com (rhm-i2.redhat.com) is a round robin between:

204.85.14.3- 10.11.45.3 204.85.14.5- 10.11.45.5

Raising Issues

Issues with any of this setup should be raised in a helpdesk ticket.

Module Build Service Infra SOP

The MBS is a build orchestrator on top of Koji for “modules”. https://fedoraproject.org/wiki/Changes/ModuleBuildService

Contact Information

Owner Release Engineering Team, Infrastructure Team Contact #fedora-modularity, #fedora-admin, #fedora-releng Persons jkaluza, fivaldi, breilly, mikem Location Phoenix Public addresses

214 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• mbs.fedoraproject.org Servers • mbs-frontend0[1-2].phx2.fedoraproject.org • mbs-backend01.phx2.fedoraproject.org Purpose Build modules for Fedora.

Description

Users submit builds to mbs.fedoraproject.org referencing their modulemd file in dist-git. (In the future, users will not submit their own module builds. The freshmaker daemon (running in infrastructure) will watch for .spec file changes and modulemd.yaml file changes – it will submit the relevant module builds to the MBS on behalf of users.) The request to build a module is received by the MBS flask app running on the mbs-frontend nodes. Cursory validation of the submitted modulemd is performed on the frontend: are the named packages valid? Are their branches valid? The MBS keeps a copy of the modulemd and appends additional data describing which branches pointed to which hashes at the time of submission. A fedmsg from the frontend triggers the backend to start building the module. First, tags and build/srpm-build groups are created. Then, a module-build-macros package is synthesized and submitted as an srpm build. When it is complete and available in the buildroot, the rest of the rpm builds are submitted. These are grouped and limited in two ways: • First, there is a global NUM_CONCURRENT_BUILDS config option that controls how many koji builds the MBS is allowed to have open at any time. It serves as a throttle. • Second, a given module may specify that it’s components should have a certain “build order”. If there are 50 components, it may say that the first 25 of them are in one buildorder batch, and the second 25 are in another buildorder batch. The first batch will be submitted and, when complete, tagged back into the buildroot. Only after they are available will the second batch of 25 begin. When the last component is complete, the MBS backend marks the build as “done”, and then marks it again as “ready”. (There is currently no meaning to the “ready” state beyond “done”. We reserved that state for future CI interactions.)

Observing MBS Behavior

The mbs-build command

The fm-orchestrator repo and the module-build-service package provide an mbs-build command with a few subcom- mands. For general help:

$ mbs-build --help

To generate a report of all currently active module builds:

$ mbs-build overview ID State Submitted Components Owner Module ------

˓→------570 build 2017-06-01T17:18:11Z 35/134 psabata shared-userspace-f26-

˓→20170601141014 569 build 2017-06-01T14:18:04Z 14/15 mkocka mariadb-f26-20170601141728

2.2. System Administrator Guide 215 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

To generate a report of an individual module build, given its ID:

$ mbs-build info 569 NVR State Koji Task ------

˓→------libaio-0.3.110-7.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803741 BUILDING https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19804081 libedit-3.1-17.20160618cvs.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803745 compat-openssl10-1.0.2j-6.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803746 policycoreutils-2.6-5.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803513 selinux-policy-3.13.1-255.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803748 systemtap-3.1-5.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803742 libcgroup-0.41-11.module_ea91dfb0 COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19685834 net-tools-2.0-0.42.20160912git.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19804010 time-1.7-52.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803747 desktop-file-utils-0.23-3.module_ea91dfb0 COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19685835 libselinux-2.6-6.module_ea91dfb0 COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19685833 module-build-macros-0.1-1.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803333 checkpolicy-2.6-1.module_414736cc COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19803514 dbus-glib-0.108-2.module_ea91dfb0 COMPLETE https://koji.fedoraproject.

˓→org/koji/taskinfo?taskID=19685836

To actively watch a module build in flight, given its ID:

$ mbs-build watch 570 Still building: libXrender https://koji.fedoraproject.org/koji/taskinfo?taskID=19804885 libXdamage https://koji.fedoraproject.org/koji/taskinfo?taskID=19805153 Failed: libXxf86vm https://koji.fedoraproject.org/koji/taskinfo?taskID=19804903

Summary: 2 components in the BUILDING state 34 components in the COMPLETE state 1 components in the FAILED state 97 components in the undefined state psabata's build #570 of shared-userspace-f26 is in the "build" state

The releng repo

There are more tools located in the scripts/mbs/ directory of the releng repo: https://pagure.io/releng/blob/master/f/ scripts/mbs

216 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Cancelling a module build

Users can cancel their own module builds with:

$ mbs-build cancel $BUILD_ID

MBS admins can also cancel builds of any user.

Note: MBS admins are defined as members of the groups listed in the ADMIN_GROUPS configuration options in roles/mbs/common/templates/config.py.

Logs

The frontend logs are on mbs-frontend0[1-2] in /var/log/httpd/error_log. The backend logs are on mbs-backend01. Look in the journal for the fedmsg-hub service.

Upgrading

The package in question is module-build-service. Please use the playbooks/manual/upgrade/mbs.yml playbook.

Managing Bootstrap Modules

In general, modules use other modules to define their buildroots, but what defines the buildroot of the very first module? For this, we use “bootstrap” modules which are manually selected. For some history on this, see these tickets: • https://pagure.io/releng/issue/6791 • https://pagure.io/fedora-infrastructure/issue/6097 The tag for a bootstrap module needs to be manually created and populated by Release Engineering. Builds for that tag are curated and selected from other Fedora tags, with care to ensure that only as many builds are added as needed. The existence of the tag is not enough for the bootstrap module to be useable by MBS. MBS discovers the bootstrap module as a possible dependency for other yet-to-be-built modules by querying PDC. During normal operation, these entries in PDC are automatically created by pdc-updater on pdc-backend02, but for the bootstrap tag they need to be manually created and linked to the new bootstrap tag. The fm-orchestrator repo has a bootstrap/ directory with tools that we used to create the first bootstrap entries. If you need to create a new bootsrap entry or modify an existing one, use these tools for inspiration. They are not general purpose and will likely have to be modified to do what is needed. In particular, see import-to-pdc.py as an example of creating a new entry and activate-in-pdc.py for an example of editing an existing entry. To be usable, you’ll need a token with rights to speak to staging/prod PDC. See the PDC SOP for information on client configuration in /etc/pdc.d/ and on where to find those tokens.

Things that could go wrong

Overloading koji

If koji is overloaded, it should be acceptable to stop the fedmsg-hub daemon on mbs-backend01 at any time.

2.2. System Administrator Guide 217 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Note: As builds finish in koji, they will be missed by the backend.. but when it restarts it should find them in datagrepper. If that fails as well, the mbs backend has a poller which should start up ~5 minutes after startup that checks koji for anything it may have missed, at which point it will resume functioning.

If koji continues to be overloaded after startup, try decreasing the NUM_CONCURRENT_BUILDS option in the config file in roles/mbs/common/templates/.

Memcached Infrastructure SOP

Our memcached setup is currently only used for wiki sessions. With mediawiki, sessions stored in files over NFS or in the DB are very slow. Memcached is a non-blocking solution for our session storage.

Contents

1. Contact Information 2. Checking Status 3. Flushing Memcached 4. Restarting Memcached 5. Configuring Memcached

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-web groups Location PHX Servers memcached03, memcached04 Purpose Provide caching for Fedora web applications.

Checking Status

Our memcached instances are currently firewalled to only allow access from wiki application servers. To check the status of an instance, use:

echo stats| nc memcached0{3,4} 11211

from an allowed host.

Flushing Memcached

Sometimes, wrong contents get cached, and the cache should be flushed. To do this, use:

echo flush_all| nc memcached0{3,4} 11211

from an allowed host.

218 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Restarting Memcached

Note that restarting an memcached instance will drop all sessions stored on that instance. As mediawiki uses hashing to distribute sessions across multiple instances, restarting one out of two instances will result in about half of the total sessions being dropped. To restart memcached: sudo/etc/init.d/memcached restart

Configuring Memcached

Memcached is currently setup as a role in the ansible git repo. The main two tunables are the MAXCONN (the maxi- mum number of concurrent connections) and CACHESIZE (the amount memory to use for storage). These variables can be set through $memcached_maxconn and $memcached_cachesize in ansible. Additionally, other options (as described in the memcached manpage) can be set via $memcached_options.

Message Tagging Service SOP

Contact Information

Owner Factory2 Team, Fedora QA Team, Infrastructure Team Contact #fedora-qa, #fedora-admin Persons cqi, lucarval, vmaljulin Location Phoenix Servers • In OpenShift. Purpose Tag module build

Description

Message Tagging Service, aka MTS, is an event-driven microservice to tag a module build triggered by MBS specific event. MTS basically listens on message bus for the MBS event mbs.build.state.change. Once a message is re- ceived, the module build represented by that message will be tested if it matches any predefined rules. Each rule definition has destination tag defined. If a rule matches the build, the destination tag will be applied to that build. Only module build in ready state is handled by MTS for now.

Observing Behavior

Login to os-master01.phx2.fedoraproject.org as root (or, authenticate remotely with openshift using oc login https://os.fedoraproject.org), and run: oc project mts oc status-v oc logs-f dc/mts

2.2. System Administrator Guide 219 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Database

MTS does not use database.

Configuration

Please do remember to increase MTS_CONFIG_VERSION so that Openshift creates a new pod after running the playbook.

Deployment

You can roll out configuration changes by changing the files in roles/openshift-apps/ message-tagging-service/ and running the playbooks/openshift-apps/ message-tagging-service.yml playbook.

Stage

MTS docker image is built automatically and pushed to upstream quay.io. By default, tag latest is applied to a fresh image. Tag stg is applied to image, then run the playbook playbooks/openshift-apps/ message-tagging-service.yml with environment staging.

Prod

If everything works well, apply tag prod to docker image in quay.io, then, run the playbook with environment prod.

Update Rules

Rules file is managed along side the playbook role in same repository. For detailed information of rules format, please refer to documentation under Modularity.

Troubleshooting

In case of problems with MTS, check the logs:

oc logs-f dc/mts

Mirror hiding Infrastructure SOP

At times, such as release day, there may be a conflict between Red Hat trying to release content for RHEL, and Fedora trying to release Fedora. One way to limit the pain to Red Hat on release day is to hide download.fedora.redhat.com from the publiclist and mirrorlist redirector, which will keep most people from downloading the content from Red Hat directly.

220 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-web group Location Phoenix Servers app3, app4 Purpose Hide Public Mirrors from the publiclist / mirrorlist redirector

Description

To hide a public mirror, so it doesn’t appear on the publiclist or the mirrorlist, simply go into the MirrorManager administrative web user interface, at [45]https://admin.fedoraproject.org/mirrormanager. Fedora sysadmins can see all Sites and Hosts. For each Site and Host, there is a checkbox marked “private”, which if set, will hide that Site (and all its Hosts), or just that single Host, such that it won’t appear on the public lists. To make a private-marked mirror public, simply clear the “private” checkbox again. This change takes effect at the top of each hour.

MirrorManager Infrastructure SOP

MirrorManager manages mirrors for fedora distribution.

Contents

• MirrorManager Infrastructure SOP – Contact Information – Description – Release Preparation – One Week After a Release – Move to Archive – mirrorlist containers and mirrorlist servers – Troubleshooting and Resolution

* Regenerating the Publiclist * Updating the mirrorlist containers * Debugging problems with mirrorlist container startup * General debugging for mirrorlist containers

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-web

2.2. System Administrator Guide 221 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Location Phoenix Servers mm-frontend01, mm-frontend02, mm-frontend-checkin01, mm-backend01, mm-crawler01, mm-crawler02 Mirrorlist Servers Docker container on the proxy servers Purpose Manage mirrors for Fedora distribution

Description

MirrorManager handles our mirroring system. It keeps track of lists of valid mirrors and handles handing out metalink URLs to end users to download packages from. The backend server (mm-backend01) scans the master mirror (NFS mounted at /srv) using the mm2_update-master- directory-list script (umdl) for changes. Changed directories are detected by comparing the ctime to the value in the database. The two crawlers (mm-crawler01 and mm-crawler02) compare the content on the mirrors with the results from umdl using RSYNC, HTTP, HTTPS. The crawler process on mm-crawler01 starts at 0:00 and 12:00 and at 2:00 and 14:00 on mm-crawler02. If the content on the mirrors is the same as on the master those mirrors are included in the dynamic metalink/mirrorlist. Every hour the backend server generates a python pickle which contains the information about the state of each mirror. This pickle file is used by the mirrorlist containers on the proxy servers to dynamically generate the metalink/mirrorlist for each client individually. The frontend servers (mm-frontend01 and mm-frontend02) offer an interface to manipulate the mirrors. Each mirror- admin can only change the details of the associated mirror. Members of the FAS group sysadmin-web can seen and change all existing mirrors. The mirrorlist provided by the frontend servers has no actively consumed content and is therefore heavily cached (12h). It is only used to give an overview of existing mirrors. Additionally the frontend servers provide: • an overview of the mirror list usage https://admin.fedoraproject.org/mirrormanager/statistics • a propagation overview https://admin.fedoraproject.org/mirrormanager/propgation • a mirror map https://admin.fedoraproject.org/mirrormanager/maps The mm-frontend-checkin01 server is only used for report_mirror check-ins. This is used by mirrors to report their status independent of the crawlers.

Release Preparation

MirrorManager should automatically detect the new release version, and will create a new Version() object in the database. This is visible on the Version page in the web UI, and on https://admin.fedoraproject.org/mirrormanager/. If the versioning scheme changes, it’s possible this will fail. If so, contact the Mirror Wrangler.

One Week After a Release

In the first week after the release MirrorManager still uses the files at fedora/linux/development/ and not at fedora/linux/releases/ Once enough mirrors have picked up the files in the release directory following script (on mm-backend01) can be used to change the paths in MirrorManager:

222 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo-u mirrormanager mm2_move-devel-to-release--version=26--category="Fedora Linux" sudo-u mirrormanager mm2_move-devel-to-release--version=26--category="Fedora

˓→Secondary Arches"

Move to Archive

Once the files of an EOL release have been copied to the archive directory tree and enough mirrors have picked the files up at the archive location there is also a script to adapt those paths in MirrorManager’s database: sudo-u mirrormanager mm2_move-to-archive--originalCategory='Fedora EPEL'--

˓→directoryRe='/4/' mirrorlist containers and mirrorlist servers

Every hour at :55 after the hour, mm-backend01 generates a pkl file with all the current mirrormanager information in it and syncs it to proxies and mirrorlist-servers. Each proxy accepts requests to mirrors.fedoraproject.org on apache, then uses haproxy to determine what backend will reply. There are 2 containers defined on each proxy: mirrorlist1 and mirrorlist2. haproxy will look for those first, then fall back to any of the mirrorlist servers defined over the vpn. At :15 after the hour, a script runs on all proxies: /usr/local/bin/restart-mirrorlist-containers This script starts up mir- rorlist2 container, makes sure it can process requests and then if so, restarts mirrorlist1 container with the new pkl data. If not, mirrorlist1 keeps running with the old data. During this process at least one (with mirrorlists servers as backup) server is processing requests so users see no issues. mirrorlist-containers log to /var/log/mirrormanager/mirrorlist{1|2}/ on the host proxy server.

Troubleshooting and Resolution

Regenerating the Publiclist

On mm-backend01: sudo-u mirrormanager/usr/bin/mm2_update-mirrorlist-server sudo-u mirrormanager/usr/local/bin/sync_pkl_to_mirrorlists.sh

Those two commands generates a new mirrorlist pickle and transfers it to the proxies. The mirrorlist containers on the proxies are restarted 15 minutes after each full hour. The mirrorlist generation can take up to 20 minutes. If a faster solution is required the mirrorlist pickle from the previous run is available at:

/var/lib/mirrormanager/old/mirrorlist_cache.pkl

Updating the mirrorlist containers

The container used for mirrorlists is the mirrormanager2-mirrorlist container in Fedora dist git: https://src. fedoraproject.org/cgit/container/mirrormanager2-mirrorlist.git/ The one being used is defined in a ansible variable in: roles/mirrormanager/mirrorlist_proxy/defaults/main.yml and in turn used in systemd unit files for mirrorlist1 and mirrorlist2. To update the container used, update this variable, run the playbook and then restart the mirrorlist1 and mirrorlist2 containers on each proxy. Note that this may take a while the first time as the image has to be downloaded from our registiry.

2.2. System Administrator Guide 223 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Debugging problems with mirrorlist container startup

Sometimes on boot some hosts won’t be properly serving mirrorlists. This is due to a container startup issue. run: ‘docker ps -a’ as root to see the active containers. It will usually say something like ‘exited(1)’ or the like. Record the container id and then run: ‘docker rm –force ’ then run ‘docker ps -a’ and confirm nothing shows. Then run ‘systemctl start mirrorlist1’ and it should correctly start mirrorlist1.

General debugging for mirrorlist containers docker commands like ‘docker ps -a’ show a fair bit of information. Also, systemctl status mirrorlist1/2 or the journal should have information when a container is failing.

AWS Mirrors

Fedora Infrastructure mirrors EPEL content (/pub/epel) into Amazon Simple Storage Service (S3) in multiple regions, to make it fast for EC2 CentOS/RHEL users to get EPEL content from an effectively local mirror. For this to work, we have private mirror entries in MirrorManager, one for each region, which include the EC2 netblocks for that region. Amazon updates their list of network blocks roughly monthly, as they consume additional address space. Therefore, we need to make the corresponding changes into MirrorManager’s entries for same. Amazon publishes their list of network blocks on their forum site, with the subject “Announcement: Amazon EC2 Public IP Ranges”. As of November 2014, this was https://forums.aws.amazon.com/ann.jspa?annID=1701 As of November 19, 2014, Amazon publishes it as a JSON file we can download. http://docs.aws.amazon.com/general/ latest/gr/aws-ip-ranges.html mote SOP mote is a MeetBot log wrangler, providing an user-friendly interface for viewing logs produced by Fedora’s IRC meetings. Production instance: http://meetbot.fedoraproject.org/ Staging instance: http://meetbot.stg.fedoraproject.org

Contents

1. Contact information 2. Deployment 3. Description 4. Configuration 5. Database 6. Managing mote 7. Suspespending mote operation 8. Changing mote’s name and category definitions

224 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner cydrobolt Contact #fedora-admin Location Fedora Infrastructure Purpose IRC meeting coordination

Deployment

If you have access to rbac-playbook: sudo rbac-playbook groups/value.yml

Forcing Reload

There is a playbook that can force mote to update its cache in case it gets stuck somehow: sudo rbac-playbook manual/rebuild/mote.yml

Doing Upgrades

Put a new copy of the mote rpm in the infra repo and run: sudo rbac-playbook manual/upgrade/mote.yml

Description mote is a Python webapp running on Flask with mod_wsgi. It can be used to view past logs, browse meeting minutes, or glean other information relevant to Fedora’s IRC meetings. It employs a JSON file store cache, in addition to a memcached store which is currently not in use with Fedora infrastructure.

Configuration mote configuration is located in /etc/mote/config.py. The configuration contains all configurable items for all mote services. Alterations to configuration that aren’t temporary should be done through ansible playbooks. Configu- ration changes have no effect on running services – they need to be restarted, which can be done using the playbook.

Database mote does not currently utilise any databases, although it uses a file store in Fedora Infrastructure and has an optional memcached store which is currently unused.

Managing mote mote is ran using mod_wsgi and httpd, hence, you must manage the httpd service to change mote’s status.

2.2. System Administrator Guide 225 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Suspespending mote operation mote can be stopped by stopping the httpd service: service httpd stop

Changing mote’s name and category definitions mote uses a set of JSON name and category definitions to provide friendly names, aliases, and listings on its interface. These definitions can be located in mote’s GitHub repository, and need to be pulled into ansible in order to be deployed. These files are name_mappings.json and category_mappings.json. To deploy an update to these def- initions, place the updated name and category mapping files in ansible/roles/mote/templates. Run the playbook in order to deploy your changes.

Fedora Infrastructure Nagios

Contact Information

Owner sysadmin-main, sysadmin-noc Contact #fedora-admin, #fedora-noc Location Anywhere Servers noc01, noc02, noc01.stg, batcave01 Purpose This SOP is to describe nagios configurations

Configuration

Fedora Project runs two nagios instances, nagios (noc01) https://admin.fedoraproject.org/nagios and nagios-external (noc02) https://nagios-external.fedoraproject.org/nagios, you must be in the ‘sysadmin’ group to access them. Apart from the two production instances, we are currently running a staging instance for testing-purposes available through SSH at noc01.stg. nagios (noc01) The nagios configuration on noc01 should only monitor general host statistics ansible status, uptime, apache status (up/down), SSH etc. The configurations are found in nagios ansible module: ansible/roles/nagios nagios-external (noc02) The nagios configuration on noc02 is located outside of our main datacenter and should monitor our user websites/applications (fedoraproject.org, FAS, PackageDB, Bodhi/Updates). The configurations are found in nagios ansible role: roles/nagios

Note: Production and staging instances through SSH: Please make sure you are into ‘sysadmin’ and ‘sysadmin-noc’ FAS groups before trying to access these hosts. See SSH Access SOP

226 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

NRPE

We are currently using NRPE to execute remote Nagios plugins on any host of our network. A great guide about it and its usage mixed up with some nice images about its structure can be found at: https: //assets.nagios.com/downloads/nagioscore/docs/nrpe/NRPE.pdf

Understanding the Messages

General:

Nagios notifications are generally easy to read, and follow this consistent format:

** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert- hostname/Check is WARNING/CRITICAL/OK ** ** HOST DOWN/UP alert- hostname **

Reading the message will provide extra information on what is wrong.

Disk Space Warning/Critical:

Disk space warnings normally include the following information:

DISK WARNING/CRITICAL/OK- free space: mountpoint freespace(MB) (freespace(%)

˓→inode=freeinodes(%)):

A message stating “(1% inode=99%)” means that the diskspace is critical not the inode usage and is a sign that more diskspace is required.

Further Reading

• Ansible SOP • Outages SOP

Netapp Infrastructure SOP

Provides primary mirrors and additional storage in PHX2

Contents

1. Contact Information 2. Description 3. Public Mirrors 1. Snapshots 4. PHX NFS Storage 1. Access 2. Snapshots

2.2. System Administrator Guide 227 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

5. iscsi 1. Updating LVM 2. Mounting ISCSI

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, releng Location Phoenix, Tampa Bay, Raleigh Servers batcave01, virt servers, application servers, builders, releng boxes Purpose Provides primary mirrors and additional storage in PHX2

Description

At present we have three netapps in our infrastructure. One in TPA, RDU and PHX. For purposes of visualization its easiest to think of us as having 4 netapps, 1 TPA, 1 RDU and 1 PHX for public mirrors. And an additional 1 in PHX used for additional storage not related to the public mirrors.

Public Mirrors

The netapps are our primary public mirrors. The canonical location for the mirrors is currently in PHX. From there it gets synced to RDU and TPA.

Snapshots

Snapshots on the PHX netapp are taken hourly. Unfortunately the way it is setup only Red Hat employees can access this mirror (this is scheduled to change when PHX becomes the canonical location but that will take time to setup and deploy). The snapshots are available, for example, on wallace in:

/var/ftp/download.fedora.redhat.com/.snapshot/hourly.0

PHX NFS Storage

There is a great deal of storage in PHX over NFS from the netapp there. This storage includes the public mirror. The majority of this storage is koji however there are a few gig worth of storage that goes to wiki attachments and other storage needs we have in PHX. You can access all of the nfs share shares at: batcave01:/mnt/fedora or: ntap-fedora-a.storage.phx2.redhat.com:/vol/fedora/

228 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Access

The netapp is provided by RHIS and as a result they also control access. Access is controlled by IP mostly and some machines have root squashed. Worst case scenario if batcave01 is not accessible, just bring another box up under its IP address and use that for an emergency.

Snapshots

There are hourly and nightly snapshots on the netapp. They are available in: batcave01:/mnt/fedora/.snapshot iscsi

We have iscsi deployed in a number of locations in our infrastructure for xen machines. To get a list of what xen machines are deployed with iscsi, just run lvs: lvs/dev/xenGuests

Live migration is possible though not fully supported at this time. Please shut a xen machine down and bring it up on another host. Memory is the main issue here.

Updating LVM iscsi is mounted all over the place and if one xen machine creates a logical volume the other xen machines will have to pick up those changes. To do this run: pvscan vgscan lvscan vgchange-a y

Mounting ISCSI

On reboots sometimes the iscsi share is not remounted. This should be automated in the future but for now run: iscsiadm-m discovery-tst-p ntap-fedora-b.storage.phx2.redhat.com:3260 sleep1 iscsiadm-m node-T iqn.1992-08.com.netapp:sn.118047036-p 10.5.88.21:3260-l sleep1 pvscan vgscan lvscan vgchange-a y

DNS Host Addition SOP

You should be able to follow these steps in order to create a new set of hosts in infrastructure.

2.2. System Administrator Guide 229 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Walkthrough

Get a DNS repo checkout on batcave01

git clone/srv/git/dns cd dns

An example always helps, so you can use git grep for something that has been recently added to the data center/network that you want:

git grep badges-web01 built/126.5.10.in-addr.arpa:69 IN PTR badges-web01.stg.phx2.

˓→fedoraproject.org. [...lots of other stuff in built/ ignore these as they'll be generated later...] master/126.5.10.in-addr.arpa:69 IN PTR badges-web01.stg.phx2.

˓→fedoraproject.org. master/126.5.10.in-addr.arpa:101 IN PTR badges-web01.phx2.

˓→fedoraproject.org. master/126.5.10.in-addr.arpa:102 IN PTR badges-web02.phx2.

˓→fedoraproject.org. master/168.192.in-addr.arpa:109.1 IN PTR badges-web01.vpn.fedoraproject.

˓→org master/168.192.in-addr.arpa:110.1 IN PTR badges-web02.vpn.fedoraproject.

˓→org master/phx2.fedoraproject.org:badges-web01.stg IN A 10.5.126.69 master/phx2.fedoraproject.org:badges-web01 IN A 10.5.126.101 master/phx2.fedoraproject.org:badges-web02 IN A 10.5.126.102 master/vpn.fedoraproject.org:badges-web01 IN A 192.168.1.109 master/vpn.fedoraproject.org:badges-web02 IN A 192.168.1.110

So those are the files we need to edit. In the above example, two of those files are for the host on the PHX network. The other two are for the host to be able to talk over the VPN. Although the VPN is not always needed, the common case is that the host will need it. (If any clients need to connect to it via the proxy servers or it is not hosted in PHX2 it will need a VPN connection). An common exception is here the staging environment: since we only have one proxy server in staging and it is in PHX2, a VPN connection is not typically needed for staging hosts. Edit the zone file for the reverse lookup first (the *in-addr.arpa file) and find ips to use. The ips will be listed with a domain name of “unused.” If you’re configuring a web application server, you probably want two hosts for stg and at least two for production. Two in production means that we don’t need downtime for reboots and updates. Two in stg means that we’ll be less likely to encounter problems related to having multiple web application servers when we take a change tested in stg into production:

-105 IN PTR unused. -106 IN PTR unused. -107 IN PTR unused. -108 IN PTR unused. +105 IN PTR elections01.stg.phx2.fedoraproject.org. +106 IN PTR elections02.stg.phx2.fedoraproject.org. +107 IN PTR elections01.phx2.fedoraproject.org. +108 IN PTR elections02.phx2.fedoraproject.org.

Edit the forward domain (phx2.fedoraproject.org in our example) next:

elections01.stg IN A 10.5.126.105 elections02.stg IN A 10.5.126.106 (continues on next page)

230 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) elections01 IN A 10.5.126.107 elections02 IN A 10.5.126.108

Repeat these two steps if you need to make them available on the VPN. Note: if your stg hosts are in PHX2, you don’t need to configure VPM for them as all our stg proxy servers are in PHX2. Also remember to update the Serial at the top of all zone files. Once the files are edited, you need to run a script to build the zones. But first, commit the changes you just made to the “source”: git add. git commit-a-m'Added staging and production elections hosts.'

Once that is committed, you need to run a script to build the zones and then push them to the dns servers.:

./do-domains # This builds the files git add . git commit -a -m 'done build' git push

$ sudo -i ansible ns\* -a '/usr/local/bin/update-dns' # This tells the dns servers to ˓→load the new files

Make certs

WARNING: If you already had a clone of private, make VERY sure to do a git pull first! It’s quite likely somebody else added a new host without you noticing it, and you cannot merge the keys repos manually. (seriously, don’t: the index and serial files just wouldn’t match up with the certificate, and you would revoke the wrong certificate upon revocation). When doing 2 factor auth for sudo, the hosts that we connect from need to have valid SSL Certs. These are currently stored in the private repo: git clone /srv/git/ansible-private && chmod 0700 ansible-private cd ansible-private/files/2fa-certs . ./vars ./build-and-sign-key $FQDN # ex: elections01.stg.phx2.fedoraproject.org

The $FQDN should be the phx2 domain name if it’s in phx2, vpn if not in phx2, and if it has no vpn and is not in phx2 we should add it to the vpn.: git add. git commit-a git push

NOTE: Make sure to re-run vars from the vpn repo. If you forget to do that, You will just (try to) generate a second pair of 2fa certs, since the ./vars script create an environment var to the root key directory, which is different. Servers that are on the VPN also need certs for that. These are also stored in the private repo: cd ansible-private/files/vpn/openvpn . ./vars ./build-and-sign-key $FQDN # ex: elections01.phx2.fedoraproject.org ./build-and-sign-key $FQDN # ex: elections02.phx2.fedoraproject.org

2.2. System Administrator Guide 231 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

The $FQDN should be the phx2 domain name if it’s in phx2, and just fedoraproject.org if it’s not in PHX2 (note that there is never .vpn in the FQDN in the openvpn keys). Now commit and push.: git add. git commit-a git push ansible git clone https://pagure.io/fedora-infra/ansible.git cd ansible

To see an example: git grep badges-web01 (example) find.-name badges-web01\ * find.-name badges-web' \'*' inventory

The ansible inventory file lists all the hosts that ansible knows about and also allows you to create sets of hosts that you can refer to via a group name. For a typical web application server set of hosts we’d create things like this:

[elections] elections01.phx2.fedoraproject.org elections02.phx2.fedoraproject.org

[elections-stg] elections01.stg.phx2.fedoraproject.org elections02.stg.phx2.fedoraproject.org

[... find the staging group and add there:...]

[staging] db-fas01.stg.phx2.fedoraproject.org elections01.stg.phx2.fedoraproject.org electionst02.stg.phx2.fedoraproject.org

The hosts should use their fully qualified domain names here. The rules are slightly different than for 2fa certs. If the host is in PHX2, use the .phx2.fedoraproject.org domain name. If they aren’t in PHX2, then they usually just have .fedoraproject.org as their domain name. (If in doubt about a not-in-PHX2 host, just ask).

VPN config

If the machine is in VPN, create a file in ansible at roles/openvpn/server/files/ccd/$FQDN with contents like: ifconfig-push 192.168.1.X 192.168.0.X Where X is the last octet of the DNS IP address assigned to the host, so for example for elec- tions01.phx2.fedoraproject.org that would be: ifconfig-push 192.168.1.44 192.168.0.44

232 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Work in progress

From here to the end of file is still being worked on host_vars and group_vars ansible consults files in inventory/group_vars and inventory/host_vars to set parameters that can be used in templates and playbooks. You may need to edit these It’s usually easy to copy the host_vars and group_vars from an existing host that’s similar to the one you are working on and then modify a few names to make it work. For instance, for a web application server: cd~/ansible/inventory/group_vars cp badges-web elections

Change the following:

- fas_client_groups: sysadmin-noc,sysadmin-badges + fas_client_groups: sysadmin-noc,sysadmin-web

(You can change disk size, mem_size, number of cpus, and ports too if you need them). Some things will definitely need to be defined differently for each host in a group – notably, ip_address. You should use the ip_address you claimed in the dns repo: cd~/ansible/inventory/host_vars cp badges-web01.stg.phx2.fedoraproject.org elections01.stg.phx2.fedoraproject.org

The host will need vmhost declaration. There is a script in ansible/scripts/vhost-info that will report how much free memory and how many free cpus each vmhost has. You can use that to inform your decision. By convention, staging hosts go on virthost12. Each vmhost has a different volume group. To figure out what volume group that is, execute the following command on the virthost.: vgdisplay

You mant want to run “lsblk” to check that the volume group you expect is the one actually used for virtual guests.

Note: 19:16:01 3. add ./inventory/host_vars/FQDN host_vars for the new host. 19:16:56 that will have in it ip addresses, dns resolv.conf, ks url/repo, volume group to make the host lv in, etc etc. 19:17:10 4. add any needed vars to inventory/group_vars/ for the group 19:17:33 this has memory size, lvm size, cpus, etc 19:17:45 5. add tasks/virt_instance_create.yml task to top of group/host playbook 19:18:10 6. run the playbook and it will go to the virthost you set, create the lv, guest, install it, wait for it to come up, then continue configuring it. mailman.yml copy it from another file.

2.2. System Administrator Guide 233 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

./ans-vhost-freemem--hosts=virtost\ * group vars • vmhost (of the host that will host the VM) • kickstart info (url of the kickstart itself and the repo) • datacenter (although most likely won’t change) The host playbook is rather basic • Change the name • Most things won’t change much ansible-playbook/srv/web/infra/ansible/infra/ansible/playbooks/grous/mailman.yml

Adding a new proxy or webserver

When adding a new web server other files must be edited by hand currently until templates replace them. These files cover getting httpd logs from the server onto log01 so that log analysis can be done. roles/base/files/syncHttpLogs.sh roles/epylog/files/merged/modules.d/rsyncd.conf roles/hosts/files/staging-hosts roles/mediawiki123/templates/LocalSettings.php.fp.j2 There are also nagios files which will need to be edited but that should be done following the nagios document.

References

• The making a new instance section of: http://meetbot.fedoraproject.org/meetbot/fedora-meeting-1/2013-07-17/ infrastructure-ansible-meetup.2013-07-17-19.00.html

Non-human Accounts Infrastructure SOP

We have many non-human accounts for various services, used by our web applications and certain automated scripts.

Contents

1. Contact Information 2. FAS Accounts 3. Bugzilla Accounts 4. PackageDB Owners 5. Koji Accounts

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin Persons: sysadmin-main

234 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Purpose: Provide Non-human accounts to our various services

Tokens

Wherever possible OIDC (Open Id Connect) tokens or other tokens should be used for the script. Whatever the token it should have the minimum privs to do whatever the script or process needs to do and no more. Depending on what service(s) it needs to interact with this could different tokens. Consult with the Fedora Security Officer for exact details.

Nuancier SOP

Nuancier is the web application used by the design team and the community to submit and vote on the supplemental wallpapers provided with each version of Fedora.

Contents

1. Contact Information 2. Documentation Links

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location https://apps.fedoraproject.org/nuancier Servers nuancier01, nuancier02, nuancier01.stg, nuancier02.stg Purpose Provide a system to submit and vote on supplemental wallpapers

Create a new election

• Login • Go to the Admin panel via the menu at the top • Click on Create a new election. • Complete the form: Election name A short name used in all the pages, most often since we have one election per release it has been of the form Fedora XX Name of the folder containing the pictures This just links the election with the folder where the images will be uploaded on disk. Keep it simple, safe, something like fXX will do. Year The year when the election will be happening, this will just give some quick sorting option Submission start date (in UTC) The date from which the people will be able to submit wallpapers for the election. The submission starts on the exact day at midnight UTC.

2.2. System Administrator Guide 235 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Start date (in UTC) The date when the election starts (and thus the submissions end). There is no buffer between when the submissions end and when the votes start which means admins have to keep up with the submissions as they are done. End date (in UTC) The date when the election ends. There are no embargo on the results, they are available right after the election ends. URL to claim a badge for voting The URL at which someone can claim a badge. This URL is displayed on the voting page as well as ones people have voted. Which means that having the badge does not ensure people voted, at max it ensures people visited nuancier during a voting phase. Number of votes an user can make The number of wallpapers an user can choose/vote on. This was made as they was a debate in the design team if having everyone vote on all 16 wallpapers was a good idea or not. Number of candidate an user can upload Restricts the number of wallpapers an user can submit for an elec- tion to prevent people from uploading tens of wallpapers in one election.

Review an election

Admins must do that regularly during a submission phase to avoid candidates from piling up. • Login • Go to the Admin panel via the menu at the top • Find the election of interest in the list and click on Review If the images are not showing, you can generate the thumbnails using the button (Re-)generate cache. On the review page, you will be able to filter the candidates by Approved, Pending, Rejected or see them All (default). You can then check the images one by one, select their checkbox and then either Approve or Deny all the ones you selected.

Note: Rejections must be motivated in the Reason for rejection / Comments input field. This motivation is then sent by email to the user explaining why a wallpaper they submitted was not accepted into the election.

Vote on an election

Once an election is opened, a link announcing it will be available from the front page and in the page listing the elections (Elections tab in the menu) a green check-mark will appear on the Votes column while a red forbidden sign will appear on the Submissions column. You can then click on the election name which will take you on the voting page. There, enlarge the image by clicking on them and make your choice by clicking on the bottom right corner of the image. On the column on the right the total number of vote available will appear. If you need to change remove a wallpaper from your selection, simply click on it in the right column. As long as you have not picked the maximum number of candidates allowed, you can cast your vote multiple times (but not on the same candidates of course).

236 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

View all the candidates of an election

All the candidates of an election are only accessible once the election is over. If you wish to see all the images uploaded, simply go to the Elections tab and click on the election name.

View the results of an election

The results of an election are accessible immediately after the end of it. To see them, simply click the Results tab in the menu. There you can click on the name of the election to see the wallpaper ordered by their number of votes or on stats to view some stats about the election (such as the number of participants, the number of voters, votes or the evolution of the votes over time).

Miscellaneous

Nuancier uses a volume shared between the two hosts (in prod and in stg) where are stored the images, making sure they are available to both frontends. This may make things a little trickier sometime, be aware of it.

On Demand Compose Service SOP

Note: The ODCS is very new and changing rapidly. We’ll try to keep this up to date as best we can.

The ODCS is a service generating temporary compose from Koji tag(s) using Pungi.

Contact Information

Owner Factory2 Team, Release Engineering Team, Infrastructure Team Contact #fedora-modularity, #fedora-admin, #fedora-releng Persons jkaluza, cqi, qwan, threebean Location Phoenix Public addresses • odcs.fedoraproject.org Servers • odcs-frontend0[1-2].phx2.fedoraproject.org • odcs-backend01.phx2.fedoraproject.org Purpose Generate temporary compose from Koji tag(s) using Pungi.

Description

ODCS clients submit request for a compose to odcs.fedoraproject.org. The requests are submitted using python2- odcs-client Python module or just using plain JSON. The request contains all the information needed to build a compose:

2.2. System Administrator Guide 237 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• source type: Type of compose source, for example “tag” or “module” • source: Name of Koji tag or list of modules defined by name-stream-version. • packages: List of packages to include in a compose. • seconds to live: Number of seconds after which the compose is removed from the filesystem and is marked as “removed”. • flags: Various flags further defining the compose - for example the “no_deps” flag saying that the packages dependencies should not be included in a compose. The request is received by the ODCS flask app running on odcs-frontend nodes. The frontend does input validation of the request and then adds the compose request to database with “wait” state and sends fedmsg message about this event. The compose request gets its unique id which can be used by a client to query its status using frontend REST API. The odcs-backend node then handles the compose requests in “wait” state and starts generating the compose using the Pungi tool. It does so by generating all the configuration files for Pungi and executing “pungi” executable. Backend also changes the compose request status to “generating” and sends fedmsg message about this event. The number of concurrent pungi processes can be set using the num_concurrent_pungi variable in ODCS configuration file. The output directory for a compose is shared between frontend and backend node. Once the compose is generated, the backend changes the status of compose request to “done” and again sends fedmsg message about this event. The shared directory with a compose is available using httpd on the frontend node and ODCS client can access the generated compose. By default this is on https://odcs.fedoraproject.org/composes/ URL. If the compose generation goes wrong, the backend changes the state of the compose request to “failed” and again sends fedmsg message about this event. The “failed” compose is still available for seconds to live time in the shared directory for further examination of pungi logs if needed. After the seconds to live time, the backend node removes the compose from filesystem and changes the state of compose request to “removed”. If there are compose requests for the very same composes, the ODCS will reuse older compose instead of generating new one and points the new compose to older one. The “removed” compose can be renewed by a client to generate the same compose as in the past. The seconds to live attribute of a compose can be extended by a client when needed.

Observing ODCS Behavior

There is currently no command line tool to query ODCS, but ODCS provides REST API which can be used to observe the ODCS behavior. This is available on https://odcs.fedoraproject.org/api/1/composes. The API can be filtered by following keys entered as HTTP GET variables: • owner • source_type • source • state It is also possible to see all the current composes in the compose output directory, which is available on the frontend on https://odcs.fedoraproject.org/composes.

238 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Removing compose before its expiration time

Members of FAS group defined in the admins section of ODCS configuration can remove any compose by sending DELETE request to following URL: https://odcs.fedoraproject.org/api/1/composes/\protect\T1\textdollarcompose_id

Logs

The frontend logs are on odcs-frontend0[1-2] in /var/log/httpd/error_log or /var/log/httpd/ ssl_error_log. The backend logs are on odcs-backend01. Look in the journal for the odcs-backend service.

Upgrading

The package in question is odcs-server. Please use the playbooks/manual/upgrade/odcs.yml playbook.

Things that could go wrong

Not enough space on shared volume

In case there are too many composes, member of FAS group defined in the ODCS configuration file admins section should: • Remove the oldest composes to get some free space immediatelly. List of such composes can be found on https://odcs.fedoraproject.org/composes/ by sorting by Last modified fields. • Decrease the max_seconds_to_live in ODCS configuration file. openQA Infrastructure SOP openQA is an automated test system used to run validation tests on nightly and candidate Fedora composes, and also to run a subset of these tests on critical path updates. openQA production instance: https://openqa.fedoraproject.org openQA staging instance: https://openqa.stg. fedoraproject.org Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA Upstream project page: http://open.qa/ Upstream repositories: https://github.com/os-autoinst

Contact Information

Owner Fedora QA devel Contact #fedora-qa, #fedora-admin, qa-devel mailing list People Adam Williamson (adamwill / adamw), Petr Schindler (pschindl) Location PHX2 Machines See ansible inventory groups with ‘openqa’ in name Purpose Run automated tests on VMs via screen recognition and VNC input

2.2. System Administrator Guide 239 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Architecture

Each openQA instance consists of a server (these are virtual machines) and one or more worker hosts (these are bare metal systems). The server schedules tests (“jobs”, in openQA parlance) and stores results and associated data. The worker hosts run “jobs” and send the results back to the server. The server also runs some fedmsg consumers to handle automatic scheduling of jobs and reporting of results to external systems (ResultsDB and Wikitcms).

Server

The server runs a web UI for viewing scheduled, running and completed tests and their data, with an admin interface where many aspects of the system can be configured (though we do not use the web UI for several aspects of configu- ration). There are several separate services that run on each server, and communicate with each other mainly via dbus. Each server requires its own PostgreSQL database. The web UI and websockets server are made externally available via reverse proxying through an Apache server. It hosts an NFS share that contains the tests, the ‘needles’ (screenshots with metadata as JSON files that are used for screen matching), and test ‘assets’ like ISO files and disk images. The path is /var/lib/openqa/share/ factory. In our deployment, the PostgreSQL database for each instance is hosted by the QA database server. Also, some paths on the server are themselves mounted as NFS shares from the infra storage server. This is so that these are not lost if the server is re-deployed, and can easily be backed up. These locations contain the data from each executed job. As both the database and these key data files are not actually stored on the server, the server can be redeployed from scratch without loss of any data (at least, this is the intent). Also in our deployment, an openQA plugin (which we wrote, but which is part of the upstream codebase) is enabled which emits fedmsgs on various events. This works by calling fedmsg-logger, so the appropriate fedmsg configuration must be in place for this to emit events correctly. The server systems run a fedmsg consumer for the purpose of automatically scheduling jobs in response to the appear- ance of new composes and critical path updates, and one for the purpose of reporting the results of completed jobs to ResultsDB and Wikitcms. These use the fedmsg-hub system.

Worker hosts

The worker hosts run several individual worker ‘instances’ (via systemd’s ‘instantiated service’ mechanism), each of which registers with the server and accepts jobs from it, uploading the results of the job and some associated data to the server on completion. The worker instances and server communicate both via a conventional web API provided by the server and via websockets. When a worker runs a job, it starts a qemu virtual machine (directly - libvirt is not used) and interacts with it via VNC and the serial console, following a set of steps dictating what it should do and what response it should expect in terms of screen contents or serial console output. The server ‘pushes’ jobs to the worker instances over a websocket connection. Each worker host must mount the /var/lib/openqa/share/factory NFS share provided by the server. If this share is not mounted, any jobs run will fail immediately due to expected asset and test files not being found. Some worker hosts for each instance are denominated ‘tap workers’, meaning they run some advanced jobs which use software-defined networking (openvswitch) to interact with each other. All the configuration for this should be handled by the ansible scripts, but it’s useful to be aware that there is complex software-defined networking stuff going on on these hosts which could potentially be the source of problems.

240 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Deployment and regular operation

Deployment and normal update of the openQA systems should run entirely through Ansible. Just running the appro- priate ansible plays for the systems should complete the entire deployment / update process, though it is best to check after running them that there are no failed services on any of the systems (restart any that failed), and that the web UI is properly accessible. Regular operation of the openQA deployments is entirely automated. Jobs should be scheduled and run automatically when new composes and critical path updates appear, and results should be reported to ResultsDB and Wikitcms (when appropriate). Dynamically generated assets should be regenerated regularly, including across release boundaries (see the section on createhdds below): no manual intervention should be required when a new Fedora release appears. If any of this does not happen, something is wrong, and manual inspection is needed. Our usual practice is to upgrade the openQA systems to new Fedora releases promptly as they appear, using dnf system-upgrade. This is done manually. We usually upgrade the staging instance first and watch for problems for a week or two before upgrading production.

Rebooting / restarting

The optimal approach to rebooting an entire openQA deployment is as follows: 1. Wait until no jobs are running 2. Stop all openqa-* services on the server, so no more will be queued 3. Stop all openqa-worker@ services on the worker hosts 4. Reboot the server 5. Check for failed services (systemctl --failed) and restart any that failed 6. Once the server is fully functional, reboot the worker hosts 7. Check for failed services and restart any that failed, particularly the NFS mount service Rebooting the workers after the server is important due to the NFS share. If only the server needs restarting, the entire procedure above should ideally be followed in any case, to ensure there are no issues with the NFS mount breaking due to the server reboot, or the server and worker getting confused about running jobs due to the websockets connections being restarted. If only a worker host needs restarting, there is no need to restart the server too, but it is best to wait until no jobs are running on that host, and stop all open-worker@ services on the host before rebooting it. There are two ways to check if jobs are running and if so where. You can go to the web UI for the server and click ‘All Tests’. If any jobs are running, you can open each one individually (click the link in the ‘Test’ column) and look at the ‘Assigned worker’, which will tell you which host the job is running on. Or, if you have admin access, you can go to the admin menu (top right of the web UI, once you are logged in) and click on ‘Workers’, which will show the status of all known workers for that server, and select ‘Working’ in the state filter box. This will show all workers currently working on a job. Note that if something which would usually be tested (new compose, new critpath update. . . ) appears during the reboot window, it likely will not be scheduled for testing, as this is done by a fedmsg consumer running on the server. You will need to schedule it for testing manually in this case (see below).

Scheduling jobs manually

While it is not normally necessary, you may sometimes need to run or re-run jobs manually.

2.2. System Administrator Guide 241 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

The simplest cases can be handled by an admin from the web UI: for a logged-in admin, all scheduled and running tests can be cancelled (from various views), and all completed tests can be restarted. ‘Restarting’ a job actually effectively clones it and schedules the clone to be run: it creates a new job with a new job ID, and the previous job still exists. openQA attempts to handle complex cases of inter-dependent jobs correctly when restarting, but doesn’t always manage to do it right; when it goes wrong, the best thing to do is usually to re-run all jobs for that medium. To run or re-run the full set of tests for a compose or update, you can use the fedora-openqa CLI. To run or re-run tests for a compose, use: fedora-openqa compose-f (COMPOSE LOCATION) where (COMPOSE LOCATION) is the full URL of the /compose subdirectory of the compose. This will only work for Pungi-produced composes with the expected productmd-format metadata, and a couple of other quite special cases. The -f argument means ‘force’, and is necessary to re-run tests: usually, the scheduler will refuse to re-schedule tests that have already run, and -f overrides this. To run or re-run tests for an update, use: fedora-openqa update-f (UPDATEID) (RELEASE) where (UPDATEID) is the update’s ID - something like FEDORA-2018-blahblah - and (RELEASE) is the release for which the update is intended (27, 28, etc). To run or re-run only the tests for a specific medium (usually a single image file), you must use the lower-level web API client, with a more complex syntax. The command looks something like this:

/usr/share/openqa/script/client isos post \ ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \ FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27

˓→PREVREL=26\ RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \ LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.

˓→0/compose

The ISO value is the filename of the image to test (it may not actually be an ISO), the DISTRI value is always ‘fedora’, the VERSION value should be the release number or ‘Rawhide’, the FLAVOR value depends on the image being tested (you can check the value from an existing test for the same or a similar ISO), the ARCH value is the arch of the image being tested, the BUILD value is the compose ID, CURREL should be the release number of the current Fedora release at the time the test is run, PREVREL should be one lower than CURREL, RAWREL should be the release number associated with Rawhide at the time the test is run, IMAGETYPE depends on the image being tested (again, check a similar test for the correct value), LOCATION is the URL to the /compose subdirectory of the compose location, and SUBVARIANT again depends on the image being tested. Please ask for help if this seems too daunting. To re-run the ‘universal’ tests on a given image, set the FLAVOR value to ‘universal’, then set all other values as appropriate to the chosen image. The ‘universal’ tests are only likely to work at all correctly with DVD or netinst images. openQA provides a special script for cloning an existing job but optionally changing one or more variable values, which can be useful in some situations. Using it looks like this:

/usr/share/openqa/script/clone_job.pl--skip-download-- from localhost 123 RAWREL=28 to clone job 123 with the RAWREL variable set to ‘28’, for instance. For interdependent jobs, you may or may not want to use the --skip-deps argument to avoid re-running the cloned job’s parent job(s), depending on circumstances.

242 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Manual updates

In general updates to any of the components of the deployments should be handled via ansible: push the changes out in the appropriate way (git repo update, package update, etc.) and then run the ansible plays. However, sometimes we do want to update or test a change to something manually for some reason. Here are some notes on those cases. For updating openQA and/or os-autoinst packages: ideally, ensure no jobs are running. Then, update all installed subpackages on the server. The server services should be automatically restarted as part of the package update. Then, update all installed subpackages on the worker hosts, and restart all worker services. A ‘for’ loop can help with that, for instance:

for i in {1..10}; do systemctl restart openqa-worker@$i.service; done

on a host with ten worker instances. For updating the openQA tests:

cd/var/lib/openqa/share/tests/fedora git pull (or git checkout (branch) or whatever) ./templates--clean ./templates-updates--update

The templates steps are only necessary if there are any changes to the templates files. For updating the scheduler code:

cd/root/fedora_openqa git pull (or whatever changes) python setup.py install systemctl restart fedmsg-hub

Updating other components of the scheduling process follow the same pattern: update the code or package, then remember to restart fedmsg-hub, or the fedmsg consumers won’t use the new code. It’s relatively common for the openQA instances to need fedfind updates in advance of them being pushed to stable, for example when a new compose type is invented and fedfind doesn’t understand it, openQA can end up trying to schedule tests for it, or the scheduler consumer can crash; when this happens we have to fix and update fedfind on the openQA instances ASAP.

Logging

Just about all useful logging information for all aspects of openQA and the scheduling and report tools is logged to the journal, except that the Apache server logs may be of interest in debugging issues related to accessing the web UI or websockets server. To get more detailed logging from openQA components, change the logging level in /etc/ openqa/openqa.ini from ‘info’ to ‘debug’ and restart the relevant services. Any run of the Ansible plays will reset this back to ‘info’. Occasionally the test execution logs may be useful in figuring out why all tests are failing very early, or some specific tests are failing due to an asset going missing, etc. Each job’s execution logs can be accessed through the web UI, on the Logs & Assets tab of the job page; the files are autoinst-log.txt and worker-log.txt.

Dynamic asset generation (createhdds)

Some of the hard disk image file ‘assets’ used by the openQA tests are created by a tool called createhdds, which is checked out of a git repo to /root/createhdds on the servers and also on some guests. This tool uses virt-install and the Python bindings for libguestfs to create various hard disk images the tests need to run. It is usually run in two different ways. The ansible plays run it in a mode where it will only create expected

2.2. System Administrator Guide 243 Fedora Infrastructure Best Practices Documentation, Release 1.0.0 images that are entirely missing: this is mainly meant to facilitate initial deployment. The plays also install a file to /etc/cron.daily causing it to be run daily in a mode where it will also recreate images that are ‘too old’ (the age-out conditions for images are part of the tool itself). This process isn’t 100% reliable; virt-install can sometimes fail, either just quasi-randomly or every time, in which case the cause of the failure needs to be figured out and fixed so the affected image can be (re-)built. The i686 and x86_64 images for each instance are built on the server, as its native arch is x86_64. The images for other arches are built on one worker host for each arch (nominated by inclusion in an ansible inventory group that exists for this purpose); those hosts have write access to the NFS share for this purpose.

Compose check reports (check-compose)

An additional ansible role runs on each openQA server, called check-compose. This role installs a tool (also called check-compose) and an associated fedmsg consumer. The consumer kicks in when all openQA tests for any compose finish, and uses the check-compose tool to send out an email report summarizing the results of the tests (well, the production server sends out emails, the staging server just logs the contents of the report). This role isn’t really a part of openQA proper, but is run on the openQA servers as it seems like as good a place as any to do it. As with all other fedmsg consumers, if making manual changes or updates to the components, remember to restart fedmsg-hub service afterwards.

Autocloud ResultsDB forwarder (autocloudreporter)

An ansible role called autocloudreporter also runs on the openQA production server. This has nothing to do with openQA at all, but is run there for convenience. This role deploys a fedmsg consumer that listens for fedmsgs indicating that Autocloud (a separate automated test system which tests cloud images) has completed a test run, then forwards those results to ResultsDB.

OpenShift SOP

OpenShift is used in Fedora Infrastructure to host a number of applications. This SOP is applicable to the OpenShift cluster and not the application running on it. Production instance: https://os.fedoraproject.org/ Staging instance: https://os.stg.fedoraproject.org/

Contents

• Contact information • Things that could go wrong – Application build is stuck

Contact information

Owner Fedora Infrastrucutre Team Contact #fedora-admin Persons .oncall Location Phoenix

244 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Servers • os-master01.phx2.fedoraproject.org • os-master02.phx2.fedoraproject.org • os-master03.phx2.fedoraproject.org • os-node01.phx2.fedoraproject.org • os-node02.phx2.fedoraproject.org • os-node03.phx2.fedoraproject.org • os-node04.phx2.fedoraproject.org • os-node05.phx2.fedoraproject.org Purpose Run Fedora Infrastructure applications

Things that could go wrong

Application build is stuck

If an application build seems stuck, it generally helps to restart the docker service on the node used for the build. First check which builds are currently running on the cluster.

[os-master01] # oc get builds --all-namespaces | grep -i running if the builds seems stuck (ie running for more than 20 mins), Check on which nodes they are scheduled. Let’s take a bodhi build for example

[os-master01] # oc -n bodhi get builds

[os-master01] # oc -n bodhi describe build bodhi-base-49 | grep os-node

Once you have identified which node the build is running on, you can restart the docker service on this node.

[os-node02] # systemctl restart docker

You can start a new build.:

[os-master01] # oc -n bodhi start-build bodhi-base

Finally you can check if there are any more build stuck. If that’s the case just repeat these steps.

[os-master01] # oc get builds --all-namespaces

OpenVPN SOP

OpenVPN is our server->server VPN solution. It is deployed in a routeless manner and uses ansible managed keys for authentication. All hosts should be given static IP’s and a hostname.vpn.fedoraproject.org DNS address.

2.2. System Administrator Guide 245 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Phoenix Servers bastion (vpn.fedoraproject.org) Purpose Provides vpn solution for our infrastructure.

Add a new host

Create/sign the keys

From batcave01 check out the private repo:

# This is to ensure that the clone is not world-readable at any point. RESTORE_UMASK=$(umask -p) umask 0077 git clone /srv/git/ansible-private $RESTORE_UMASK cd ansible-private/vpn/openvpn

Next prepare your environment and run the build-key script. This example is for host “proxy4.fedora.phx.redhat.com”:

. ./vars ./build-key $FQDN # ./revoke-full $FQDN to revoke keys that are no longer used. git add . git commit -a git push

Create Static IP

Giving static IP’s out in openvpn is mostly painless. Take a look at other examples but each host gets a file and 2 IP’s.: git clone https://pagure.io/fedora-infra/ansible.git vi ansible/roles/openvpn/server/files/ccd/$FQDN

The file format should look like this: ifconfig-push 192.168.1.314 192.168.0.314

Basically the first IP is the IP that is contactable over the vpn and should always take the format “192.168.1.x” and the PtPIP is the same ip on a different network: “192.168.0.x” Commit and install: git add. git commit-m"What have you done?" git push

And then push that out to bastion:

246 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo -i ansible-playbook $(pwd)/playbooks/groups/bastion.yml -t openvpn

Create DNS entry

After you have your static IP ready, just add the entry to DNS: git clone/srv/git/dns&& cd dns vi master/168.192.in-addr.arpa # pick out an ip that's unused vi master/vpn.fedoraproject.org git commit-m"What have you done?" ./do-domains git commit-m"done build." git push

And push that out to the name servers with: sudo-i ansible ns\ * -a"/usr/local/bin/update-dns"

Update resolv.conf on the client

To make sure traffic actually goes over the VPN, make sure the search line in /etc/resolv.conf looks like: search vpn.fedoraproject.org fedoraproject.org for external hosts and: search phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org for PHX2 hosts.

Remove a host

:: # This is to ensure that the clone is not world-readable at any point. RESTORE_UMASK=$(umask -p) umask 0077 git clone /srv/git/ansible-private $RESTORE_UMASK cd private/vpn/openvpn Next prepare your environment and run the build-key script. This example is for host “proxy4.fedora.phx.redhat.com”:

. ./vars ./revoke-full $FQDN git add . git commit -a git push

TODO

Deploy an additional VPN server outside of PHX. OpenVPN does support failover automatically so if configured properly, when the primary VPN server goes down all hosts should connect to the next host in the list.

2.2. System Administrator Guide 247 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Orientation Infrastructure SOP

Basic orientation and introduction to the sysadmin group. Welcome aboard!

Contents

1. Contact Information 2. Description 3. Welcome to the team 1. Time commitment 2. Prove Yourself 4. Doing Work 1. Ansible 5. Our Setup 6. Our Rules

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Purpose Provide basic orientation and introduction to the sysadmin group

Description

Fedora’s Infrastructure team is charged with keeping all the lights on, improving pain points, expanding services, designing new services and partnering with other teams to help with their needs. The team is highly dynamic and primarily based in the US. This is only significant in that most of us work during the day in US time. We do have team members all over the globe though and generally have decent coverage. If you happen to be one of those who is not in a traditional US time zone you are encouraged to be around, especially in #fedora-admin during those times when we have less coverage. Even if it is just to say “I can’t help with that but $ADMIN will be and he should be here in about 3 hours”. The team itself is generally friendly and honest. Don’t be afraid to disagree with someone, even if you’re new and they’re an old timer. Just make sure you ask yourself what is important to you and make sure to provide data, we like that. We generally communicate on irc.libera.chat in #fedora-admin. We have our weekly meetings on IRC and its the quickest way to get in touch with everyone. Secondary to that we use the mailing list. After that its our ticketing system and talk.fedoraproject.org. Welcome to the team!

Time commitment

Often times this is the biggest reason for turnover in our group. Some groups like sysadmin-web and certainly sysadmin-main require a huge time commitment. Don’t be surprised if you see people working between 10-30 hours a week on various tasks and that’s the volunteers. Your time commitment is something personal to each individual and

248 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 its something that you should take some serious thought about. In general it’s almost impossible to be a regular part of the team without at least 5-10 hours a week dedicated to the Infrastructure team. Also note, if you are going to be away, let us know. As a volunteer we can’t possibly ask you to always be around all the time. Even if you’re in the middle of a project and have to stop, let us know. Nothing is worse then thinking someone is working on something or will be around and they’re just not. Really, we all understand, got a test coming up? Busier at work then normal? Going on a vacation? It doesn’t matter, just let us know when you’re going to be gone and what you’re working on so it doesn’t get forgotten. Additionally don’t forget that its worth it to discuss with your employer about giving time during work. They may be all for it.

Prove Yourself

This is one of the most difficult aspects of getting involved with our team. We can’t just give access to everyone who asks for it and often actually doing work without access is difficult. Some of the best things you can do are: • Keep bugging people for work. It shows you’re committed. • Go through bugs, look at stale bugs and close bugs that have been fixed • Try to duplicate bugs on your workstation and fix them there Above all stick with it. Part of proving yourself is also to show the time commitment it actually does take.

Doing Work

Once you’ve been sponsored for a team its generally your job to find what work needs to be done in the ticketing system. Be proactive about this. The tickets can be found at: https://pagure.io/fedora-infrastructure/issues When you find a ticket that interests you contact your sponsor or the ticket owner and offer help. While you’re getting used to the way things work, don’t be offput by someone saying no or you can’t work on that. It happens, sometimes its a security thing, sometimes its a “I’m half way through it and I’m not happy with where it is thing.” Just move on to the next ticket and go from there. Also don’t be surprised if some of the work involved includes testing on your own workstation. Just setup a virtual environment and get to work! There’s a lot of work that can be done to prove yourself that involves no access at all. Doing this kind of work is a sure fire way to get in to more groups and get more involved. Don’t be afraid to take on tasks you don’t already know how to do. But don’t take on something you know you won’t be able to do. Ask for help when you need it and keep in contact with your sponsor so you know

Ansible

Things we do gets done in Ansible. It is important that you not make changes directly on servers. This is for many reasons but just always make changes in Ansible. If you want to get more familiar with Ansible, set it up yourself and give it a try. The docs are available at https://docs.ansible.com/

Our Setup

Most of our work is done via bastion.fedoraproject.org. That host has access to our other hosts, many of which are all over the globe. We have a vpn solution setup so that knowing where the servers physically are is only important when troubleshooting things. When you first get granted access to one of the sysadmin-* groups, the first place you should turn is bastion.fedoraproject.org then from there ssh to batcave01.

2.2. System Administrator Guide 249 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

We also have an architecture repo available in our git repo. To get a copy of this repo just: dnf install git git clone https://pagure.io/fedora-infrastructure.git

This will allow you to look through (and help fix) some of our scripts as well as have access to our architectural documentation. Become familiar with those docs if you’re curious. There’s always room to do better documentation so if you’re interested just ping your sponsor and ask about it.

Our Rules

The Fedora Infrastructure Team does have some rules. First is the security policy. Please ensure you are compliant with: https://infrastructure.fedoraproject.org/csi/security-policy/ before logging in to any of our servers. Many of those items rely on the honor system. Additionally note that any of the software we deploy must be available in Fedora. There are some rare exceptions to this (particularly as it relates to specific applications to Fedora). But each exception is taken on a case by case basis.

Outage Infrastructure SOP

What to do when there’s an outage or when you’re planning to take an outage.

Contents

1. Contact Information 2. Users (No Access) 1. Planned Outage 1. Contacts 2. Unplanned Outage 1. Check first 2. Reporting or participating in an outage 5. Infrastructure Members (Admin Access) 1. Planned Outage 1. Planning 2. Preparations 3. Outage 4. Post outage cleanup 2. Unplanned Outage 1. Determine Severity 2. First Steps 3. Fix it 4. Escalate

250 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

5. The Resolution 6. The Aftermath

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main group Location Anywhere Servers Any Purpose This SOP is generic for any outage Emergency: https://admin.fedoraproject.org/pager

Users (No Access)

Note: Don’t have shell access? Doesn’t matter. Stop by and stay in #fedora-admin if you have any expertise in what is going on, please assist. Random users have helped the team out countless numbers of times. Any time the team doesn’t have to go to the docs to look up an answer is a time they can be spending fixing what’s busted.

Planned Outage

If a planned outage comes at a terrible time, just let someone know. The Infrastructure Team does its best to keep outages out of the way but if there’s a mass rebuild going on that we don’t know about and we schedule a koji outage, let someone know.

Contacts

Pretty much all coordination occurs in #fedora-admin on irc.libera.chat. Stop by there to watch more about what’s going on. Just stay on topic.

Unplanned Outage

Check first

Think something is busted? Please check with others to see if they are also having issues. This could even include checking on another computer. When reporting an outage remember that the admins will typically drop everything they are doing to check what the problem is. They won’t be happy to find out your cert has expired or you’re using the wrong username. Additionally, check the status dashboard (http://status.fedoraproject.org) to verify that there is no previously reported outage that may be causing and/or related to your issue.

2.2. System Administrator Guide 251 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Reporting or participating in an outage

If you think you’ve found an outage, get as much information as you can about it at a glance. Copy any errors you get to http://pastebin.ca/. Use the following guidelines: Don’t be general. • BAD: “The wiki is acting slow” • Good: “Whenever I try to save https://fedoraproject.org/wiki/Infrastructure, I get a proxy error after 60 seconds” Don’t report an outage that’s already been reported. • BAD: “/join #fedora-admin; Is the build system broken?” • Good: “/join #fedora-admin; wait a minute or two; I noticed I can’t submit builds, here’s the error I get:” Don’t suggest drastic or needless changes during an outage (send it to the list) • “Why don’t you just use lighttpd?” • “You could try limiting MaxRequestsPerChild in Apache” Don’t get off topic or be too chatty • “Transformers was awesome, but yeah, I think you guys know what to do next” Do research the technologies we’re using and answer questions that may come up. • BAD: “Can’t you just fix it?” • Good: “Hey guys, I think this is what you’re looking for: http://httpd.apache.org/docs/2.2/mod/mod_ mime.html#addencoding” If no one can be contacted after 10 minutes or so please see the section below called Determine Severity to determine whether or not someone should get paged.

Infrastructure Members (Admin Access)

The Infrastructure Members section is specifically written for members with access to the servers. This could be admin access to a box or even a specific web application. Basically anyone with access to fix the problem.

Planned Outage

Any outage that is intentionally caused by a team member is a planned outage. Even if it has to happen in the next 5 minutes.

Planning

All major planned outages should occur with at least 1 week notice. This is not always possible, use best judgment. Please use our standard outage template at: https://fedoraproject.org/wiki/Infrastructure/OutageTemplate. Make sure to have another person review your template/announcement to check times and services affected. Make sure to send the announcement to the lists that are affected by the outage: announce, devel-announce, etc. Always create a ticket in the ticketing system: https://fedoraproject.org/wiki/Infrastructure/Tickets Send an email to the fedora-infrastructure-list with more details if warranted. Remember to follow an existing SOP as much as possible. If anything is missing from the SOP please add it.

252 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Preparations

Remember to schedule an outage in nagios. This is important not just so notifications don’t get sent but also important for trending and reporting. https://admin.fedoraproject.org/nagios/

Outage

Prior to beginning an outage to any monitored service on http://status.fedoraproject.org please push an update to reflect the outage (see status-fedora SOP). Report all information in #fedora-admin. Coordination is extremely important, it’s rare for our group to meet in person and IRC is our only real-time communication device. If a web site is out please put up some sort of outage page in its place.

Post outage cleanup

Once the outage is over ensure that all services are up and running. Ensure all nagios services are back to green. Notify everyone in #fedora-admin to scan our services for issues. Once all services are cleared update the status.fp.o dashboard. If the outage included a new feature or major change for a group, please notify that group that the change is ready. Make sure to close the ticket for the outage when it’s over. Once the services are restored, an update to the status dashboard should be pushed to show the services are restored.

Important: Additionally update any SOP’s that may have changed in the course of the outage

Unplanned Outage

Unplanned outages happen, stay cool. As a team member never be afraid to do something because you think you’ll get in trouble over it. Be smart, don’t be reckless, and never say “I shouldn’t do this”. If an unorthodox method or drastic change will fix the problem, do it, document it, and let the team know. Messes can always be cleaned up after the outage.

Determine Severity

Some outages require immediate fixing, some don’t. A page should never go out because someone can’t sign the cla. Most of our admins are in US time, use your best judgment. If it’s bad enough to warrant an emergency page, page one of the admins at: https://admin.fedoraproject.org/pager Use the following as loose guidelines, just use your best judgment. • BAD: “I can’t see the Recent Changes on the wiki.” • Good: “The entire wiki is not viewable” • BAD: I cannot sign the CLA • Good: I can’t change my password in the account system, I have admin access and my laptop was just stolen • BAD: I can’t access awstats for fedoraproject.org • Good: The mirrors list is down. • BAD: I think someone misspelled some words on the webpage

2.2. System Administrator Guide 253 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• Good: The web page has been hacked and I think someone notified .

First Steps

After an outage has been verified, acknowledge the outage in nagios: https://admin.fedoraproject.org/nagios/, update the related system on the status dashboard (see the status-fedora SOP) and verify changes at http://status.fedoraproject. org, then head in to #fedora-admin to figure out who is around and coordinate the next course of action. Consult any relevent SOP’s for corrective actions.

Fix it

Fix it, Fix it, Fix it! Do whatever needs to be done to fix the problem, just don’t be stupid about it.

Escalate

Can’t fix it? Don’t wait, Escalate! All of the team members have expertise with some areas of our environment and weaknesses in other areas. Never be afraid to tap another team member. Sometimes it’s required, sometimes it’s not. The last layer of defense is to page someone. At present our team is small enough that a full escalation path wouldn’t do much good. Consult the contact information on each SOP for more information.

The Resolution

Once the services are restored, an update to the status dashboard should be pushed to show the services are restored.

The Aftermath

With any outage there will be questions. Please try as hard as possible to answer the following questions and send them to the fedora-infrastructure-list. 1. What happened? 2. What was affected? 3. How long was the outage? 4. What was the root cause?

Important: Number 4 is especially important. If a kernel build keeps failing because of issues with koji caused by a database failure caused by a full filesystem on db1. Don’t say koji died because of a db failure. Any time a root cause is discovered and not being monitored by nagios, add it if possible. Most failures can be prevented or mitigated with proper monitoring.

Package Database Infrastructure SOP

The PackageDB is used by Fedora developers to manage package ownership and acls. It controls who is allowed to commit to a package and who gets notification of changes to packages. PackageDB project Trac: [45]https://fedorahosted.org/packagedb/

254 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contents

1. Contact Information 2. Troubleshooting and Resolution 3. Common Actions 1. Adding a new Pseudo User as a package owner 2. Renaming a package 3. Removing a package 4. Add a new release 5. Update App DB for a release going final 6. Orphaning all the packages for a user

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons abadger1999 Location Phoenix Servers admin.fedoraproject.org Click on one of the [55]current haproxy servers to see the physical servers Purpose Manage package ownership

Troubleshooting and Resolution

Common Actions

Adding a new Pseudo User as a package owner

Sometimes you want to have a mailing list own a package so that bugzilla email is assigned to the mailing list. Doing this requires adding a new pseudo user to the account system and assigning that person as the package maintainer.

Warning: pseudo users often have a dash in their name. We create email aliases via ansible that have dashes in their name in order to not collide with fas usernames (users cannot create usernames with dashes via the webui). Make sure that any pseudo-users you create do not clash with existing email aliases.

In the following examples, replace (“xen”, “kernel-xen-2.6”) with the packages you are assigning to the new user and 9902 to the userid you select in step 2 • Log into fas-db01. • Log into the db as a user that can make changes:

$ psql -U postgres fas2 fas2>

2.2. System Administrator Guide 255 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

– Find the current pseudo-users:

fas2> select id, username from people where id< 10000 order by id; id| username ------+------9900| orphan 9901| anaconda-maint

– Create a new account with the next available id after 9900:

fas2> insert into people (id, username, human_name, password, email) values (9902,'xen-maint','Xen Maintainers',' *','[email protected]');

• Connect to the pkgdb as a user that can make changes:

$ psql -U pkgdbadmin -h db01 pkgdb pkgdb>

• Add the current package owner as a comaintainer of the package. If this user is not currently on he acls for the package you can use the following database queries:

insert into personpackagelisting (username, packagelistingid) select pl.owner, pl.id from packagelisting as pl, package as p where p.id= pl.packageid and p.name in ('xen','kernel-xen-2.6'); insert into personpackagelistingacl (personpackagelistingid, acl, statuscode) select ppl.id,'build',3 from personpackagelisting as ppl, packagelisting as pl,

˓→package as p where p.id= pl.packageid and pl.id= ppl.packagelistingid and pl.owner= ppl.

˓→username and p.name in ('xen','kernel-xen-2.6'); insert into personpackagelistingacl (personpackagelistingid, acl, statuscode) select ppl.id,'commit',3 from personpackagelisting as ppl, packagelisting as pl,

˓→ package as p where p.id= pl.packageid and pl.id= ppl.packagelistingid and pl.owner= ppl.username and p.name in ('xen','kernel-xen-2.6'); insert into personpackagelistingacl (personpackagelistingid, acl, statuscode) select ppl.id,'approveacls',3 from personpackagelisting as ppl, packagelisting

˓→as pl, package as p where p.id= pl.packageid and pl.id= ppl.packagelistingid and pl.owner= ppl.username and p.name in ('xen','kernel-xen-2.6');

If the owner is in the acls, you will need to figure out which packages already acls and only add the new acls for that one. • Reassign the pseudo-user to be the new owner:

update packagelisting set owner='xen-maint' from package as p where packagelisting.packageid=p.id and p.name in ('xen','kernel-xen-2.6');

Renaming a package

On db2:

256 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo-u postgres psql pkgdb select * from package where name='OLDNAME'; [Make sure only the package you want is selected] update package set name='NEWNAME' where name='OLDNAME';

On cvs-int:

CVSROOT=/cvs/pkgs cvs co CVSROOT sed-i's/OLDNAME/NEWNAME/g' CVSROOT/modules cvs commit-m'Rename OLDNAME => NEWNAME' cd/cvs/pkgs/rpms mv OLDNAME NEWNAME cd NEWNAME find.-name'Makefile,v'-exec sed-i's/NAME := OLDNAME/NAME := NEWNAME/' \{\} \; cd../../devel rm OLDNAME ln-s../rpms/NEWNAME/devel.

If the package has existed long enough to have been added to koji, run something like the following to “retire” the old name in koji.: koji block-pkg dist-f12 OLDNAME

Removing a package

Warning: Do not remove a package if it has been built for a fedora release or if you are not also willing to remove the cvs directory.

When a package has been added due to a typo, it can be removed in one of two ways: marking it as a mistake with the “removed” status or deleting it from the db entirely. Marking it as removed is easier and is explained below. On db2: sudo-u postgres psql pkgdb pkgdb=# select id, name, summary, statuscode from package where name = 'b'; id| name| summary| statuscode ------+------+------+------6618|b| A simple database interface to MS-SQL for Python|3 (rows1)

• Make sure there is only one package returned and it is the correct one. • Statuscode 3 is “approved” and it’s what we’re changing from • You’ll also need the id for later:

pkgdb=# BEGIN; pkgdb=# update package set statuscode = 17 where name = 'b'; UPDATE1

• Make sure only a single package was changed.:

pkgdb=# COMMIT;

(continues on next page)

2.2. System Administrator Guide 257 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) pkgdb=# select id, packageid, collectionid, owner, statuscode from packagelisting

˓→where packageid = 6618; id| packageid| collectionid| owner| statuscode ------+------+------+------+------42552| 6618| 19| 101437|3 38845| 6618| 15| 101437|3 38846| 6618| 14| 101437|3 38844| 6618|8| 101437|3 (rows4)

• Make sure the output here looks correct (packageid is all the same, etc). • You’ll also need the ids for later:

pkgdb=# BEGIN; pkgdb=# update packagelisting set statuscode = 17 where packageid = 6618; UPDATE4 -- Make sure the same number of rows were committed as you saw before. pkgdb=# COMMIT;

pkgdb=# select * from personpackagelisting where packagelistingid in (38844, ˓→38846, 38845, 42552); id| userid| packagelistingid. ----+------+------(0 rows)

• In this case there are no comaintainers so we don’t have to do anymore. If there were we’d have to treat them like groups handled next:

pkgdb=# select * from grouppackagelisting where packagelistingid in (38844, 38846, ˓→ 38845, 42552); id| groupid| packagelistingid. ------+------+------39229| 100300| 38844 39230| 107427| 38844 39231| 100300| 38845 39232| 107427| 38845 39233| 100300| 38846 39234| 107427| 38846 84481| 107427| 42552 84482| 100300| 42552 (8 rows)

pkgdb=# select * from grouppackagelistingacl where grouppackagelistingid in ˓→(39229, 39230, 39231, 39232, 39233, 39234, 84481, 84482);

• The results of this are usually pretty long. so I’ve omitted everything but the rows (24 rows) • For groups it’s typically 3 (one for each of commit, build, and checkout) * • number of grouppackagelistings. In this case, that’s 24 so this matches our expectations.:

pkgdb=# BEGIN; pkgdb=# update grouppackagelistingacl set statuscode = 13 where

˓→grouppackagelistingid in (39229, 39230, 39231, 39232, 39233, 39234, 84481,

˓→84482);

• Make sure only the number of rows you saw before were updated:

258 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

pkgdb=# COMMIT;

If the package has existed long enough to have been added to koji, run something like the following to “retire” it in koji.:

koji block-pkg dist-f12 PKGNAME

Add a new release

To add a new Fedora Release, ssh to db02 and do this: sudo-u postgres psql pkgdb

• This adds the release for Package ACLs:

insert into collection (name, version, statuscode, owner, koji_name) values(

˓→'Fedora','13',1,'jkeating','dist-f13'); insert into branch select id,'f13','.fc13', Null,'f13' from collection where

˓→name='Fedora' and version='13';

• If this is for mass branching we probably need to advance the branch information for devel as well.:

update branch set disttag='.fc14' where collectionid=8;

• This adds the new release’s repos for the App DB:

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-i386','Fedora 13 - i386','development/13/i386/os','http://download.

˓→fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c where c.

˓→name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-i386-d','Fedora 13 - i386 - Debug','development/13/i386/debug','http://

˓→download.fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c

˓→where c.name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-i386-tu','Fedora 13 - i386 - Test Updates','updates/testing/13/i386/',

˓→'http://download.fedoraproject.org/pub/fedora/linux/', true, c.id from

˓→collection as c where c.name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-i386-tud','Fedora 13 - i386 - Test Updates Debug','updates/testing/13/i386/

˓→debug/','http://download.fedoraproject.org/pub/fedora/linux/', true, c.id

˓→from collection as c where c.name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-x86_64','Fedora 13 - x86_64','development/13/x86_64/os','http://download.

˓→fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c where c.

˓→name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-x86_64-d','Fedora 13 - x86_64 - Debug','development/13/x86_64/debug',

˓→'http://download.fedoraproject.org/pub/fedora/linux/', true, c.id from

˓→collection as c where c.name='Fedora' and c.version='13'; (continues on next page)

2.2. System Administrator Guide 259 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page)

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-x86_64-tu','Fedora 13 - x86_64 - Test Updates','updates/testing/13/x86_64/

˓→','http://download.fedoraproject.org/pub/fedora/linux/', true, c.id from

˓→collection as c where c.name='Fedora' and c.version='13';

insert into repos (shortname, name, url, mirror, active, collectionid) select'F-

˓→13-x86_64-tud','Fedora 13 - x86_64 - Test Updates Debug','updates/testing/13/

˓→x86_64/debug/','http://download.fedoraproject.org/pub/fedora/linux/', true, c.

˓→id from collection as c where c.name='Fedora' and c.version='13';

Update App DB for a release going final

When a Fedora release goes final, the repositories for it change where they live. The repo definitions allow the App browser to sync information from the yum repositories. The PackageDB needs to be updated for the new areas:

BEGIN; insert into repos (shortname, name, url, mirror, active, collectionid) select'F-14-

˓→i386-u','Fedora 14 - i386 - Updates','updates/14/i386/','http://download.

˓→fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c where c.name

˓→='Fedora' and c.version='14'; insert into repos (shortname, name, url, mirror, active, collectionid) select'F-14-

˓→i386-ud','Fedora 14 - i386 - Updates Debug','updates/14/i386/debug/','http://

˓→download.fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c

˓→where c.name='Fedora' and c.version='14'; update repos set url='releases/14/Everything/i386/os/' where shortname='F-14-i386'; update repos set url='releases/14/Everything/i386/debug/' where shortname='F-14-

˓→i386-d'; insert into repos (shortname, name, url, mirror, active, collectionid) select'F-14-

˓→x86_64-u','Fedora 14 - x86_64 - Updates','updates/14/x86_64/','http://download.

˓→fedoraproject.org/pub/fedora/linux/', true, c.id from collection as c where c.name

˓→='Fedora' and c.version='14'; insert into repos (shortname, name, url, mirror, active, collectionid) select'F-14-

˓→x86_64-ud','Fedora 14 - x86_64 - Updates Debug','updates/14/x86_64/debug/',

˓→'http://download.fedoraproject.org/pub/fedora/linux/', true, c.id from collection

˓→as c where c.name='Fedora' and c.version='14'; update repos set url='releases/14/Everything/x86_64/os/' where shortname='F-14-x86_

˓→64'; update repos set url='releases/14/Everything/x86_64/debug/' where shortname='F-14-

˓→x86_64-d'; COMMIT;

Orphaning all the packages for a user

This can be done in the database if you don’t want to send email:

$ ssh db02 $ sudo -u postgres psql pkgdb pkgdb> select * from packagelisting where owner = 'xulchris'; pkgdb> -- Check that the list doesn't look suspicious.... There should be a record ˓→for every fedora release * package pkgdb> BEGIN; (continues on next page)

260 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) pkgdb> update packagelisting set owner = 'orphan', statuscode = 14 where owner =

˓→'xulchris'; pkgdb> -- If the right number of rows were changed pkgdb> COMMIT;

Note: Note that if you do it via pkgdb-client or the python-fedora API instead, you’ll want to only orphan the packages on non-EOL branches that exist to cut down on the amount of email that’s sent. That entails figuring out what branches you need to do this on.

Package Review SOP

Contents

1. Contact Information 2. Introduction 3. Determine Category 4. Cron Job

Contact Information

Owner sysadmin-main Contact #fedora-admin, #fedora-noc or [email protected] Location Phoenix DC Server(s) sundries01.phx2.fedoraproject.org Purpose To explain the overall function of this page, where, and how it gets its information.

Introduction

The Cached Package Review Tracker is used to collect, organize and allow searching through tickets. Organization includes the following ‘categories’: • Trivial • New • New EPEL • Needsponser • Hidden • Under Review Each ticket references a source RH Bugzilla Bug entry and generates the categories of tickets as stated above, based off multiple fields values, for easier viewing and report generation. Page also includes searchable fields allowing a search by package name and either email address for reviews, for packages or for commented reviews.

2.2. System Administrator Guide 261 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Pagure Infrastructure SOP

Pagure is a code hosting and management site.

Contents

1. Contact Information 2. Description 3. When unresponsive 4. Git repo locations 5. Services and what they do

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-apps Location OSUOSL Servers pagure01, pagure-stg01 Purpose Source code and issue tracking

Description

Pagure ( https://pagure.io/pagure ) is a source code management and issue tracking application. Its written in flask. It uses: celery, redis, postgresql, and pygit2.

When unresponsive

Sometimes pagure will stop responding, even though it’s still running. You can issue a ‘systemctl reload httpd’ and that will usually get it running again.

Git repo locations

• Main repos are in /srv/git/repositories/ • issue/ticket repos are under /srv/git/repositories/tickets/ • Docs are under /srv/git/repositories/docs/ • Releases (not a git repo) are under /var/www/releases/

Services and what they do

• pagure service is the main flask application, it runs from httpd wsgi. • pagure_ci service talks to jenkins or other CI for testing PR’s

262 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• pagure_ev service talks to websockets and updates issues and comments live for users. • pagure_loadjson service takes issues loads from pagure-importer and processes them. • pagure_logcom service handles logging. • pagure_milter processes email actions. • pagure_webhook service processes webhooks to notify about changes. • pagure worker service updates git repos with changes.

Useful commands

This section lists commands that can be useful to fix issues encountered every once in a while. • Recompile the gitolite configuration file

# sudo -u git HOME=/srv/git/ gitolite compile && sudo -u git HOME=/srv/git/

˓→gitolite trigger POST_COMPILE

• Duplicated projects We have observed that every so often two different workers create a project in the database. This leads to pagure failing to give access to the project as it finds multiple projects with the same namespace/name where it expects these to be unique. The following two SQL commands allows finding out which projects are in this situation: select user_id, name, namespace, is_fork from projects where is_fork = FALSE group by namespace, name, is_fork, user_id having count(user_id) > 1; select user_id, name, namespace, is_fork from projects where is_fork = TRUE group by namespace, name, is_fork, user_id having count(user_id) > 1; This will return you the namespace/name as well as the user_id of the user who duplicated the projects in the database. You can then do: select id, user_id, name, namespace, is_fork from projects where name = ‘’ order by user_id; In that query you will see the project id, user_id, name and namespace of the project. You will see in this one of the projects is listed twice with the same user_id (the one returned in the previous query). From there, you will have to delete the duplicates (potentially the one with the highest project id). If the project remains un-accessible, check the apache logs, it could be that the git repositories have not been created. In that case, the simplest course of action is to delete all the duplicates and let the users re-create the projects as they wish.

PDC SOP

Store metadata about composes we produce and “component groups”. App: https://pdc.fedoraproject.org/ Source for frontend: https://github.com/product-definition-center/ product-definition-center Source for backend: https://github.com/fedora-infra/pdc-updater

2.2. System Administrator Guide 263 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Release Engineering, Fedora Infrastructure Team Contact #fedora-apps, #fedora-releng, #fedora-admin, #fedora-noc Servers pdc-web0{1,2}, pdc-backend01 Purpose Store metadata about composes and “component groups”

Description

The Product Definition Center (PDC) is a webapp and API designed for storing and querying product metadata. We automatically populate our instance with data from our existing releng tools/processes. It doesn’t do much on its own, but the goal is to enable us to develop more sane tooling down the road for future releases. The webapp is a django app running on pdc-web0{1,2}. Unlike most of our other apps, it does not use OpenID for authentication, but it instead uses SAML2. It uses mod_auth_mellon to achieve this (in cooperation with ipsilon). The webapp allows new data to be POST’d to it by admin users. The backend is a fedmsg-hub process running on pdc-backend01. It listens for new composes over fedmsg and then POSTs data about those composes to PDC. It also listens for changes to the fedora atomic host git repo in pagure and updates “component groups” in PDC to reflect what rpm components constitute fedora atomic host. For long-winded history and explanation, see the original Change document: https://fedoraproject.org/wiki/Changes/ ProductDefinitionCenter NOTE: pdc is being replaced by fpdc (Fedora Product Definition Center)

Upgrading the Software

There is an upgrade playbook in playbooks/manual/upgrade/pdc.yml which will upgrade both the frontend and the backend if new packages are available. Database schema upgrades should be handled automatically with a run of that playbook.

Logs

Logs for the frontend are in /var/log/httpd/error_log on pdc-web0{1,2}. Logs for the backend can be accessed with journalctl -u fedmsg-hub -f on pdc-backend01.

Restarting Services

The frontend runs under apache. So either apachectl graceful or systemctl restart httpd should do it. The backend runs as a fedmsg-hub, so systemctl restart fedmsg-hub should restart it.

Scripts

The pdc-updater package (installed on pdc-backend01) provides three scripts: • pdc-updater-audit • pdc-updater-retry

264 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• pdc-updater-initialize A possible failure scenario is that we will lose a fedmsg message and the backend will not update the frontend with info about that compose. To detect this, we provide the pdc-updater-audit command (which gets run once daily by cron with emails sent to the releng-cron list). It compare all of the entries in PDC with all of the entries in kojipkgs and then raises an alert if there is a discrepancy. Another possible failure scenario is that the fedmsg message is published and received correctly, but there is some processing error while handling it. The event occurred, but the import to the PDC db failed. The pdc-updater-audit script should detect this discrepancy, and then an admin will need to manually repair the problem and retry the event with the pdc-updater-retry command. If doomsday occurs and the whole thing is totally hosed, you can delete the db and re-ingest all information available from releng with the pdc-updater-initialize tool. (Creating the initial schema needs to happen on pdc-web01 with the standard django settings.py commands.)

Manually Updating Information

In general, you shouldn’t have to do these things. pdc-updater will automatically create new releases and update information, but if you ever need to manipulate PDC data, you can do it with the pdc-client tool. A copy is installed on pdc-backend01 and there are some credentials there you’ll need, so ssh there first. Make sure that you are root so that you can read /etc/pdc.d/fedora.json. Try listing all of the releases:

$ pdc -s fedora release list

Deactivating an EOL release:

$ pdc -s fedora release update fedora-21-updates --deactivate

Note: There are lots more attribute you can manipulate on a release (you can change the type, and rename them, etc..) See pdc –help and pdc release –help for more information.

Listing all composes:

$ pdc -s fedora compose list

We’re not sure yet how to flag a compose as the Gold compose, but when we do, the answer should appear here: https://github.com/product-definition-center/product-definition-center/issues/428

Adding superusers

Some small group of release engineers need to be superuser to set eol dates and add/remove components. You can grant them permissions to do this via some direct database calls. First find out their email address listed in fas, then login to db01.phx2.fedoraproject.org: sudo -u postgresql psql pdc pdc-# update kerb_auth_user set is_superuser = ‘true’ where email = ‘usersemailfromfas’; the user will now have privs with their normal tokens.

2.2. System Administrator Guide 265 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Pesign upgrades/reboots

Fedora has (currently) 2 special builders. These builders are used to build a small set of packages that need to be signed for secure boot. These packages include: grub2, shim, kernel, pesign-test-app When rebooting or upgrading pesign on these machines, you have to follow a special process to unlock the signing keys.

Contact Information

Owner Fedora Release Engineering, Kernel/grub2/shim/pesign maintainers Contact #fedora-admin, #fedora-kernel Servers bkernel01, bkernel02 Purpose Upgrade or restart singning keys on kernel/grub2/shim builders

Procedure

0. Coordinate with pesign maintainers or pesign-test-app commiters as well as releng folks that have the pin to unlock the signing key. 1. remove builder from koji:

koji disable-host bkernel01.phx2.fedoraproject.org

2. Make sure all builds have completed. 3. Stop existing processes:

service pcscd stop service pesign stop

4. Perform updates or reboots. 5. Restart services (if you didn’t reboot):

service pcscd start service pesign start

6. Unlock signing key:

pesign-client-t"OpenSC Card (Fedora Signer)"-u (enter pin when prompted)

7. Make sure no builds are in progress, then Re-add builder to koji, remove other builder:

koji enable-host bkernel01.phx2.fedoraproject.org koji disable-host bkernel02.phx2.fedoraproject.org

8. Have a commiter send a build of pesign-test-app and make sure it’s signed correctly. 9. If so, repeat process with second builder.

266 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Planet Subgroup Infrastructure SOP

Fedora’s planet infrastructure produces planet configs out of users’ ~/.planet files in their homedirs on fedo- rapeople.org. You can also create subgroups of users into other planets. This document explains how to setup new subgroups.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Servers batcave01/ planet.fedoraproject.org Purpose provide easy setup of new planet groups on planet.fedoraproject.org following: The Setup 1. on batcave01:

cp-a configs/system/planet/grouptmpl configs/system/planet/newgroupname

2. cd to the new directory 3. run:

-pi-e"s/ %%groupname/newgroupname/g" fpbuilder.conf base_config planet- ˓→group.cron templates/*

replacing newgroupname with the groupname you want 4. git add the whole dir 5. edit manifests/services/planet.pp 6. copy and paste everything from begging to end of the design team group, to use as a template. 7. modify what you copied replacing design with the new group name 8. save it 9. check everything in 10. run ansible on planet and check if it works

Use

Tell the requester to then copy their current .planet file to .planet.newgroupname. For example with the design team: cp~/.planet~/.planet.design

This will then show up on the new feed - http://planet.fedoraproject.org/design/

Private fedorahosted tickets Infrastructure SOP

Provides for users only viewing tickets they are involved with.

2.2. System Administrator Guide 267 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-hosted Location Servers hosted1 Purpose Provides for users only viewing tickets they are involved with.

Description

Fedora Hosted Projects have the option of setting ticket permissions so that only users involved with tickets can see them. This plugin requires someone in sysadmin-hosted to set it up, and requires justification to use. The only current implementation is a request tracking system at [45]https://fedorahosted.org/famnarequests for tracking requests for North American ambassadors since mailing addresses, etc will be put in there.

Implementation

On hosted1: sudo-u apache vim/srv/web/trac/projects//conf/trac.ini

Add the following to the appropriate sections of trac.ini:

[privatetickets] group_blacklist= anonymous, authenticated

[components] privatetickets.* = enabled

[trac] permission_policies= PrivateTicketsPolicy, DefaultPermissionPolicy,

˓→LegacyAttachmentPolicy

Note: For projects not currently using plugins, you’ll have to add the [components] section, and you’ll need to add the permission_policies to the [trac] section.

Next, someone with TRAC_ADMIN needs to grant TICKET_VIEW_SELF (a new permission) to authenticated. This permission allows users to view tickets that they are either owner, CC, or reporter on. There are other options more fully described at [46]the upstream site. Make sure that TICKET_VIEW is removed from anonymous, or else this plugin will have no effect.

Fedora Infrastructure Machine Classes

Contact Information

Owner sysadmin-main, application developers Contact sysadmin-main

268 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Location Everywhere we have machines. Servers publictest, dev, staging, production Purpose Explain our use of various types of machines.

Introduction

This document explains what are various types of machines are used for in the life cycle of providing an application or resource.

Public Test machines publictest instances are used for early investigation into a resource or application. At this stage the application might not be packaged yet, and we want to see if it’s worth packaging and starting it on the process to be available in production. These machines are accessable to anyone in the sysadmin-test group, and coordination of use of instances is done on an ad-hock basis. These machines are re-installed every cycle cleanly, so all work must be saved before this occurs. Authentication must not be against the production fas server. We have fakefas.fedoraproject.org setup for these systems instead.

Note: We’re planning on merging publictest into the development servers. Environment-wise they’ll be mostly the same (one service per machine, a group to manage them, no proxy interaction, etc) Service by service we’ll assign timeframes to the machines before being rebuilt, decommissioned if no progress, etc.

Development

These instances are for applications that are packaged and being investigated for deployment. Typically packages and config files are modified locally to get the application or resource working. No caching or proxies are used. Access is to a specific sysadmin group for that application or resource. These instances can be re-installed on request to ‘start over’ getting configration ready. Some services hosted on dev systems are for testing new programs. These will usually be associated with an RFR and have a limited lifetime before the new service has to prove itself worthy of continued testing, to be moved on to stg, or have the machine decommissioned. Other services are for developing existing services. They are handy if the setup of the service is tricky or lengthy and the person in charge wants to maintain the .dev server so that newer contributors don’t have to perform that setup in order to work on the service. Authentication must not be against the production fas server. We have fakefas.fedoraproject.org setup for these systems instead.

Note: fakefas will be renamed fas01.dev at some point in the future

Staging

These instances are used to integrate the application or resource into ansible as well as proxy and caching setups. These instances should use ansible to deploy all parts of the application or resource possible. Access to these instances is only to a sysadmin group for that application, who may or may not have sudo access. Permissions on stg mirror permissions on production (for instance, sysadmin-web would have access to the app servers in stg the same as production).

2.2. System Administrator Guide 269 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Production

These instances are used to serve the ready for deployment application to the public. All changes are done via ansible and access is restricted. Changes should be done here only after testing in staging.

RabbitMQ SOP

‘RabbitMQ‘_ is the message broker Fedora uses to allow applications to send each other (or themselves) messages.

Contact Information

Owner

Fedora Infrastructure Team

Contact

#fedora-admin

Servers

• rabbitmq0[1-3].phx2.fedoraproject.org • rabbitmq0[1-3].stg.phx2.fedoraproject.org

Purpose

General purpose publish-subscribe message broker as well as application-specific messaging.

Description

RabbitMQ is a message broker written in Erlang that offers a number of interfaces including AMQP 0.9.1, AMQP 1.0, STOMP, and MQTT. At this time only AMQP 0.9.1 is made available to clients. Fedora uses the RabbitMQ packages provided by the Red Hat Openstack repository as it has a more up-to-date version.

The Cluster

RabbitMQ supports ‘clustering‘_ a set of hosts into a single logical message broker. The Fedora cluster is composed of 3 nodes, rabbitmq01-03, in both staging and production. groups/rabbitmq.yml is the playbook that deploys the cluster.

Virtual Hosts

The cluster contains a number of virtual hosts. Each virtual host has its own set of resources - exchanges, bindings, queues - and users are given permissions by virtual host.

270 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

/pubsub

The /pubsub virtual host is the generic publish-subscribe virtual host used by most applications. Messages published via AMQP are sent to the “amq.topic” exchange. Messages being bridged from fedmsg into AMQP are sent via “zmq.topic”.

/public_pubsub

This virtual host has the “amq.topic” and “zmq.topic” exchanges from /pubsub ‘federated‘_ to it, and we allow anyone on the Internet to connect to this virtual host. For the moment it is on the same broker cluster, but if people abuse it it can be moved to a separate cluster.

Authentication

Clients authenticate to the broker using x509 certificates. The common name of the certificate needs to match the username of a user in RabbitMQ.

Troubleshooting

RabbitMQ offers a CLI, rabbitmqctl, which you can use on any node in the cluster. It also offers a web interface for management and monitoring, but that is not currently configured.

Network Partition

In case of network partitions, the RabbitMQ cluster should handle it and recover on its own. In case it doesn’t when the network situation is fixed, the partition can be diagnosed with rabbitmqctl cluster_status. It should include the line {partitions,[]}, (empty array). If the array is not empty, the first nodes in the array can be restartedi one by one, but make sure you give them plenty of time to sync messages after restart (this can be watched in the /var/log/rabbitmq/rabbit.log file)

Federation Status

Federation is the process of copying messages from the internal /pubsub vhost to the external / public_pubsub vhost. During network partitions, it has been seen that the Federation relaying process does not come back up. The federation status can be checked with the command rabbitmqctl eval 'rabbit_federation_status:status().' on rabbitmq01. It should not return the empty array ([]) but something like:

[[{exchange,<<"amq.topic">>}, {upstream_exchange,<<"amq.topic">>}, {type,exchange}, {vhost,<<"/public_pubsub">>}, {upstream,<<"pubsub-to-public_pubsub">>}, {id,<<"b40208be0a999cc93a78eb9e41531618f96d4cb2">>}, {status,running}, {local_connection,<<"">>}, {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, {timestamp,{{2020,3,11},{16,45,18}}}], (continues on next page)

2.2. System Administrator Guide 271 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) [{exchange,<<"zmq.topic">>}, {upstream_exchange,<<"zmq.topic">>}, {type,exchange}, {vhost,<<"/public_pubsub">>}, {upstream,<<"pubsub-to-public_pubsub">>}, {id,<<"c1e7747425938349520c60dda5671b2758e210b8">>}, {status,running}, {local_connection,<<"">>}, {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, {timestamp,{{2020,3,11},{16,45,17}}}]]

If the empty array is returned, the following command will restart the federation (again on rabbitmq01): rabbitmqctl clear_policy-p/public_pubsub pubsub-to-public_pubsub rabbitmqctl set_policy-p/public_pubsub--apply-to exchanges pubsub-to-public_pubsub

˓→"^(amq|zmq)\.topic$"'{"federation-upstream":"pubsub-to-public_pubsub"}'

After which the Federation link status can be checked with the same command as before. rdiff-backup SOP

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location Phoenix Servers backup03 and others Purpose backups of critical data

Description

We are now running a rdiff-backup of all our critical data on a daily basis. This allows us to keep incremental changes over time as well has have a recent copy in case of disaster recovery. The backups are run from backup03 every day at 22:10UTC as root. All config is in ansible. The cron job checks out the ansible repo from git, then runs ansible-playbook with the rdiff-backup playbook. This playbook looks at variables to decide which machines and partitions to backup. • First, machines in the backup_clients group in inventory are operated on. If a host is not in that group it is not backed up via rdiff-backup. • Next, any machines in the backup_clients group will have their /etc and /home directories backed up by the server running rdiff-backup and using the rdiff-backup ssh key to access the client. • Next, if any of the hosts in backup_clients have a variable set for host_backup_targets, those directories will also be backed up in the same manner as above with the rdiff-backup ssh key. For each backup an email will be sent to sysadin-backup-members with a summary. Backups are stored on a netapp volume, so in addition to the incrementals that rdiff-backup provides there are netapp snapshots. This netapp volume is mounted on /fedora_backups and is running dedup on the netapp side.

272 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Rebooting backup03

When backup03 is rebooted, you must restart the ssh-agent and reload the rdiff-backup ssh key into that agent so backups can take place. sudo-i ssh-agent-s> sshagent source sshgent ssh-add.ssh/rdiff-backup-key

Adding a new host to backups

1. add the host to the backup_clients inventory group in ansible. 2. If you wish to backup more than /etc and /home, add a variable to: inventory/host_vars/fqdn like: host_backup_targets: [‘/srv’] 3. On the client to be backed up, install rdiff-backup. 4. On the client to be backed up, install the rdiff-backup ssh public key to /root/.ssh/ authorized_keys It should be restricted from:

from="10.5.126.161,192.168.1.64"

and command can be restricted to:

command="rdiff-backup --server --restrict-update-only"

Restoring from backups rdiff backup keeps a copy of the most recent version of files on disk, so if you wish to restore the last backup copy, simply rsync from backup03. If you wish an older incremental, see rdiff-backup man page for how to specify the exact time.

Retention

Backups are currently kept forever, but likely down the road we will look at pruning them some to match available space.

Public_key: ssh-dss AAAAB3NzaC1kc3MAAACBAJr3xqn/hHIXeth+NuXPu9P91FG9jozF3Q1JaGmg6szo770rrmhiSsxso/

˓→Ibm2mObqQLCyfm/

˓→qSOQRynv6tL3tQVHA6EEx0PNacnBcOV7UowR5kd4AYv82K1vQhof3YTxOMmNIOrdy6deDqIf4sLz1TDHvEDwjrxtFf8ugyZWNbTAAAAFQCS5puRZF4gpNbaWxe6gLzm3rBeewAAAIBcEd6pRatE2Qc/

˓→dW0YwwudTEaOCUnHmtYs2PHKbOPds0+Woe1aWH38NiE+CmklcUpyRsGEf3O0l5vm3VrVlnfuHpgt/a/

˓→pbzxm0U6DGm2AebtqEmaCX3CIuYzKhG5wmXqJ/

˓→z+Hc5MDj2mn2TchHqsk1O8VZM+1Ml6zX3Hl4vvBsQAAAIALDt5NFv6GLuid8eik/

˓→nn8NORd9FJPDBJxgVqHNIm08RMC6aI++fqwkBhVPFKBra5utrMKQmnKs/sOWycLYTqqcSMPdWSkdWYjBCSJ/

˓→QNpyN4laCmPWLgb3I+2zORgR0EjeV2e/46geS0MWLmeEsFwztpSj4Tv4e18L8Dsp2uB2Q== root@backup03-rdiff-backup

2.2. System Administrator Guide 273 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Container registry SOP

Fedora uses the Docker Distribution container registry to host its container images. Production instance: https://registry.fedoraproject.org CDN instance: https://cdn.registry.fedoraproject.org

Contact information

Owner Fedora Infrastructure Team Contact #fedora-admin Persons bowlofeggs cverna puiterwijk Location Phoenix Servers oci-candidate-registry01.phx2.fedoraproject.org oci-candidate-registry01.stg.phx2.fedoraproject.org oci-registry01.phx2.fedoraproject.org oci-registry01.stg.phx2.fedoraproject.org oci- registry02.phx2.fedoraproject.org Purpose Serve Fedora’s container images

Configuring all nodes

Run this command from the ansible checkout to configure all nodes in production:

$ sudo rbac-playbook groups/oci-registry.yml

Upgrades

Fedora infrastructure uses the registry packaged and distributed with Fedora. Thus, there is no special upgrade proce- dure - a simple dnf update will do.

System architecture

The container registry is hosted in a fairly simple design. There are two hosts that run Docker Distribution to serve the registry API, and these hosts are behind a load balancer. These hosts will respond to all requests except for requests for blobs. Requests for blobs will receive a 302 redirect to https://cdn.registry.fedoraproject.org, which is a caching proxy hosted by CDN 77. The primary goal of serving the registry API ourselves is so that we can serve the container manifests over TLS so that users can be assured they are receiving the correct image blobs when they retrieve them. We do not rely on signatures since we do not have a Notary instance. The two registry instances are configured not to cache their data, and use NFS to replicate their shared storage. This way, changes to one registry should appear in the other quickly.

Troubleshooting

Logs

You can monitor the registry via the systemd journal:

274 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

sudo journalctl-f-u docker-distribution

Running out of disk space

We have a niagos check that monitors the available disk space on /srv/registry. An ansible playbook is available to reclaim some disk space if needed: sudo rbac-playbook manual/oci-registry-prune.yml

This will delete all the images that are older than 30 days on the candidate registries (prod and stg) and then run the garbage collection on the registries server.

Request for resources SOP

Contents

1. Contact Information 2. Introduction 3. Pre sponsorship 4. Planning 5. Development Instance 6. Staging Instance 7. Production deployment 8. Maintenance

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location fedoraproject.org/wiki Servers dev, stg, production Purpose Explains the technical part of Request for Resources

Introduction

Once a RFR has a sponsor and has been generally agreed to move forward, this SOP will describe the technical parts of moving a RFR through the various steps it needs from idea to implementation. Note that for high level and non technical requirements, please see the main RFR page. A RFR will go through (at least) the following steps, but note that it can be dropped, removed or reverted at any time in the process and that MUST items MUST be provided before the next step is possible.

2.2. System Administrator Guide 275 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Pre sponsorship

Until a RFR has a sysadmin-main person who is sponsoring and helping with the request, no further technical action should take place with this SOP. Please see the main RFR SOP to aquire a sponsor and do the steps needed before implementation starts. If your resource requires packages to be complete, please finish your packaging work before moving forward with the RFR (accepted/approved packages in Fedora/EPEL). If your RFR only has a single person working on it, please gather at least another person before moving forward. Single points of failure are to be avoided.

Requirements for continuing:

• MUST have a RFR ticket. • MUST have the ticket assigned and accepted by someone in infrastructure sysadmin-main group.

Planning

Once a sponsor is aquired and all needed packages have been packaged and are available in EPEL, we move on to the planning phase. In this phase discussion should take place about the application/resource on the infrastructure list and IRC. Questions about how the resource could be deployed should be considered: • Should the resource be load balanced? • Does the resource need caching? • Can the resource live on it’s own instance to separate it from more critical services? • Who all is involved in maintaining and deploying the instance?

Requirements for continuing:

• MUST discuss/note the app on the infrastructure mailing list and answer feedback there. • MUST determine who is involved in the deployment/maintaining the resource.

Development Instance

In this phase a development instance is setup for the resource. This instance is a single virtual host running the needed OS. The RFR sponsor will create this instance and also create a group ‘sysadmin-resource’ for the resource, adding all responsible parties to the group. It’s then up to sysadmin-resource members to setup the resource and test it. Questions asked in the planning phase should be investigated once the instance is up. Load testing and other testing should be performed. Issues like expiring old data, log files, acceptable content, packaging issues, configuration, general bugs, security profile, and others should be investigated. At the end of this step a email should be sent to the infrastucture list explaining the testing done and inviting comment. Also, the security officer should be informed that a new service will need a review in the near future. In the request for the security audit, please add the results of self-evaluation of the Application Security Policy. Any deviations from the policy must be noted in the request for audit.

Requirements for continuing:

• MUST have RFR sponsor sign off that the resource is ready to move to the next step. • MUST have answered any outstanding questions on the infrastructure list about the resource. Decisions about caching, load balancing and how the resource would be best deployed should be determined.

276 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• MUST add any needed SOP’s for the service. Should there be an Update SOP? A troubleshooting SOP? Any other tasks that might need to be done to the instance when those who know it well are not available? • MUST perform self-evaluation of the Application Security Policy. • MUST tag in the security officer in the ticket so an audit can be scheduled, including the results of the Security Policy evaluation.

Staging Instance

The next step is to create a staging instance for the resource. In this step the resource is fully added to Ansi- ble/configuration management. The resource is added to caching/load balancing/databases and tested in this new env. Once initial deployment is done and tested, another email to the infrastructure list is done to note that the resource is available in staging. The security officer should be informed as soon as the code is reasonably stable, so that they can start the audit or delegate the audit to someone.

Requirements for continuing:

• MUST have sign off of RFR sponsor that the resource is fully configured in Ansible and ready to be deployed. • MUST have a deployment schedule for going to production. This will need to account for things like freezes and availability of infrastructure folks. • MUST have an approved audit by the security officer or appointed delegate.

Production deployment

Finally the staging changes are merged over to production and the resource is deployed. Monitoring of the resource is added and confirmed to be effective.

Maintenance

The resource will then follow the normal rules for production. Honoring freezes, updating for issues or security bugs, adjusting for capacity, etc.

Ticket comment template

You can copy/paste this template into your RFR ticket. Keep the values empty until you know answers - you can go back later and edit the ticket to fill in information as it develops. Phase I • Software: • Advantage for Fedora: • Sponsor: Phase II • Email list thread:

2.2. System Administrator Guide 277 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

• Upstream source: • Development contacts: • Maintainership contacts: • Load balanceable: • Caching: Phase III • SOP link: • Application Security Policy self-evaluation:.... • Audit request: (can be same) • Audit timeline: <04-11-2025 - 06-11-2025> (timeline to be provided by the security officer upon audit request) Phase IV • Ansible playbooks: • Fully rebuilt from ansible: • Production goal: <08-11-2025> • Approved audit: resultsdb SOP store results from Fedora CI, OpenQA and other test systems

Contact Information

Owner Fedora QA Devel, Fedora Infrastructure Team Contact #fedora-qa, #fedora-admin, #fedora-noc Location PHX2 Servers resultsdb-dev01.qa, resultsdb-stg01.qa, resultsdb01.qa Purpose store results from Fedora CI, OpenQA and other test systems

Architecture

ResultsDB as a system is made up of two parts - a results storage API and a simple html based frontend for humans to view the results accessible through that API (resultsdb and resultsdb_frontend).

Deployment

The only part of resultsdb deployment that isn’t currently in the ansible playbooks is database initialization (disabled due to bug). Once the resultsdb app has been installed, initialize the database, run:

resultsdb init_db

278 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Updating

Database schema changes are not currently supported with resultsdb and the app can be updated like any other web application: • update app • restart httpd

Backup

All important information in ResultsDB is stored in its database - backing up that database is sufficient for backup and restoring that database from a snapshot is sufficient for restoring. retrace SOP

Retrace server - provides complete tracebacks for unhandled crashes and show aggregated information for developers.

Contact Information

Owner: Fedora QA Devel, Fedora Infrastructure Team, ABRT team Contact: #abrt, #fedora-admin, #fedora-noc Servers: retrace*, faf* Purpose: Provides complete tracebacks for unhandled crashes and show aggregated information for developers.

Description

The physical server (retrace.fedoraproject.org) runs two main services: retrace-server and FAF.

Retrace-server

The upstream for retrace server lives at: https://github.com/abrt/retrace-server When a user has the ABRT client installed and a process crashes with an unhandled exception (e.g., traceback or core dump), the user can send a request to retrace-server. The server will install the same set of packages plus debuginfo, and will return a traceback to the user that includes function names instead of plain pointers. This information is useful for debugging. The upstream retrace-server allows users to upload coredumps through a web interface, but the Fedora instance dis- ables this feature.

FAF

When a user decides to report a crash, data is sent to FAF. ABRT can also be configured to send microreports auto- matically, if desired.

2.2. System Administrator Guide 279 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

FAF can aggregate similar reports into one entity (called a Problem). FAF provides a nice web interface for developers, allowing them to see crashes of their packages. It lives at: https://retrace.fedoraproject.org/faf/

Playbook

The playbook is split into several roles. There are two main roles • abrt/faf • abrt/retrace These roles are copied from upstream. You should never update it directly. The new version can be fetched from upstream using: # cd ansible/abrt # rm -rf faf retrace # ansible-galaxy install -f -r requirements.yml –ignore-errors -p ./ You should review the new differences, and commit and push. Then there are some roles which are local for our instance: • abrt/faf-local - This is run before abrt/faf. • abrt/retrace-local - This is run after abrt/retrace. • abrt/retrace-local-pre - This is run before abrt/retrace.

Services

FAF and retrace-server are web applications; only httpd is required.

Cron

FAF and retrace-server each have cron tasks. They are not installed under /etc/cron* but are installed as user cron jobs for the ‘faf’ and ‘retrace’ users. You can list those crons using: • sudo -u faf crontab -l • sudo -u retrace crontab -l All cronjobs should be Ansible managed. Just make sure if you delete some cron from Ansible that it does not remain on the server (not always possible with state=absent).

Directories

• /srv/ssd - fast disk, used for PostgreSQL storage • /srv - big fat disk, used for storing packages. Mainly: - /srv/faf/lob - /srv/retrace • /srv/faf/db-backup/ - Daily backups of DB. No rotating yet. Needs to be manually deleted occasionally. • /srv/faf/lob/InvalidUReport/ - Invalid reports, can be pretty big. No automatic removal too. Need to be purged manually occasionally.

280 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Front-page

The main web page is handled by the abrt-server-info-page package, which can be controlled using: /usr/lib/python2.7/site-packages/abrt-server-info-page/config.py

DB

Only FAF uses a database. We use our own instance of PostgreSQL. You can connect to it using: sudo -u faf psql faf

ReviewBoard Infrastructure SOP

Review Board is a powerful web-based code review tool that offers developers an easy way to handle code reviews. It scales well from small projects to large companies and offers a variety of tools to take much of the stress and time out of the code review process.

Contents

1. Contact Information 2. File Locations 3. Troubleshooting and Resolution • Restarting 4. Create a new repository in ReviewBoard • Creating a new git repository • Creating a new bzr repository • Create a default reviewer for a repository

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main, sysadmin-hosted Location: ServerBeach Servers: hosted[1-2] Purpose: Provide our fedorahosted users a way to review code.

File Locations

Main Config File: hosted[1-2]:/srv/reviewboard/conf/settings_local.py ReviewBoard: hosted[1-2]:/etc/httpd/conf.d/fedorahosted.org/reviewboard.conf Upstream: https://fedorahosted.org/reviewboard/

2.2. System Administrator Guide 281 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Troubleshooting and Resolution

Restarting

After an update, to restart reviewboard just restart apache. Doing a service httpd stop and then a service httpd start should do it.

Create a new repository in ReviewBoard

Creating a new git repository

1. Enter the admin interface. If you have admin privilege, a link will be visible in the upper-right corner of the dashboard. 2. In the admin dashboard click “Add” next to “Repositories” 3. For the name, enter the Fedora Hosted project short name. (e.g. if the project is [53]https://fedorahosted.org/sssd, then the repository name should be sssd) 4. “Show this repository” must be checked. 5. Hosting service is “Custom” 6. Repository type is Git 7. Path should be /srv/git/project_short_name.git (e.g. /srv/git/sssd.git) 8. Mirror path should be git://git.fedorahosted.org/git/project_short_name.git

Note: Mirror path is used by client tools such as post-review to determine to which repository a submission belongs

9. Raw file URL mask should be left blank 10. Username and Password should both be left blank 11. The bug tracker URL may vary from project to project, but if they are using the Fedora Hosted Trac bugtracker, it should be • Type: Trac • Bug Tracker URL: [54]https://fedorahosted.org/project_short_name (e.g. [55]https://fedorahosted.org/sssd) 12. Do not set a Bug Tracker URL

Creating a new bzr repository

1. Go to the admin dashboard to [56]add a new repository. 2. For the name, enter the Fedora Hosted project short name. (e.g. if the project is [57]https://fedorahosted.org/kitchen, then the repository name should be kitchen) 3. “Show this repository” must be checked. 4. Hosting service is “Custom” 5. Repository type is Bazaar

282 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

6. Path should be /srv/git/project_short_name/branch_name (e.g. /srv/bzr/kitchen/devel) – reviewboard doesn’t understand how to work with repository conventions; it just works on branches. 7. Mirror path should be bzr://bzr.fedorahosted.org/bzr/project_short_name/branch_name

Note: Mirror path is used by client tools such as post-review to determine to which repository a submission belongs

8. Username and Password should both be left blank 9. The bug tracker URL may vary from project to project, but if they are using the Fedora Hosted Trac bug- tracker, it should be • Type: Trac • Bug Tracker URL: [58]https://fedorahosted.org/project_short_name (e.g. [59]https://fedorahosted.org/kitchen) 10. Do not set a Bug Tracker URL

Create a default reviewer for a repository

Reviews should be sent to the project development mailing list unless otherwise requested. 1. Enter the admin interface. If you have admin privilege, a link will be visible in the upper-right corner of the dashboard. 2. In the admin dashboard click “Add” next to “Review Groups” 3. Enter the following values: • Name: The project short name • Display Name: project_short_name Review Group • Mailing List: Development discussion list for the project 4. Do not select any users 5. Return to the main admin dashboard and click on “Add” next to “Default Reviewers” 6. Enter the following values: • Name: Something unique and sensible • File Regular Expression: enter ‘.*’ (without the quotes)

Note: This means that by default, the mailing list should receive email for reviews of all files in the repository

7. Under “Default groups”, select the group you created above and click the arrow pointing right. 8. Do not select any default people 9. Under “Repositories”, select the repository added above and click the arrow pointing right. 10. Save your changes.

2.2. System Administrator Guide 283 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

SCM Admin SOP

Warning: Most information here (probably 1.4 and later) is not updated for pkgdb2 and therefore not correct anymore.

Contents

1. Creating New Packages 1. Obtaining process-git-requests 2. Prerequisites 3. Running the script 4. Steps for manual processing 1. Using pkgdb-client 2. Using pkgdb2branch 3. Update Koji 5. Helper Scripts 1. mkbranchwrapper 2. setup_package 6. Pseudo Users for SIGs 2. Deprecate Packages 3. Undeprecate Packages 4. Performing mass comaintainer requests

Creating New Packages

Package creation is mostly automatic and most details are handled by a script.

Obtaining process-git-requests

The script is not currently packaged; lives in the rel-eng git repository. You can check it out with: git clone https://git.fedorahosted.org/git/releng and keep this up to date by running: git pull occasionally somewhere in the checked-out tree occasionally before processing new requests. The script lives in scripts/process-git-requests.

284 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Prerequisites

You must have the python-bugzilla and python-fedora packages installed. Before running process-git-requests, you should run:

bugzilla login

The “Username” you will be prompted for is the email address attached to your bugzilla account. This will obtain a cookie so that the script can update bugzilla tickets. The cookie is good for quite some time (at least a month); if you wish to remove it, delete the ~/.bugzillacookies file. It is also advantageous to have your Fedora ssh key loaded so that you can ssh into pkgs.fedoraproject.org without being prompted for a password. It perhaps goes without saying that you will need unfirewalled and unproxied access to ports 22, 80 and 443 on various Fedora machines.

Running the script

Simply execute the process-git-requests script and follow the prompts. It can provide the text of all comments in the bugzilla ticket for inspection and will perform various useful checks on the ticket and the included SCM request. If there are warnings present, you will need to accept them before being allowed to process the request. Note that the script only looks at the final request in a ticket; this permits users to tack on a new request at any time and re-raise the fedora-cvs flag. Packagers do not always understand this, though, so it is necessary to read through the ticket contents to make sure that’s the request matches reality. After a request has been accepted, the script will create the package in pkgdb (which may require your password) and attempt to log into the SCM server to create the repository. If this does not succeed, the package name is saved and when you finish processing a command line will be output with instructions on creating the repositories manually. If you hit Crtl-C or the script otherwise aborts, you may miss this information. If so, see below for information on running pkgdb2branch.py on the SCM server; you will need to run it for each package you created.

Steps for manual processing

It is still useful to document the process of handling these requests manually in the case that process-git-requests has issues. 1. Check Bugzilla Ticket to make sure it looks ok 2. Add the package information to the packagedb with pkgdb-client 3. Use pkgdb2branch to create the branches on the cvs server

Warning: Do not run multiple instances of pkgdb2branch in parallel! This will cause them to fail due to mismatching ‘modules’ files. It’s not a good idea to run addpackage, mkbranchwrapper, or setup_package by themselves as it could lead to packages that don’t match their packagedb entry.

4. Update koji.

Using pkgdb-client

Use pkgdb-client to update the pkgdb with new information. For instance, to add a new package::

2.2. System Administrator Guide 285 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

pkgdb-client edit-u toshio-o terjeros \ -d'Python module to extract EXIF information'\ -b F-10-b F-11-b devel python-exif

To update that package later and add someone to the initialcclist do:

pkgdb-client edit-u toshio-c kevin python-exif

To add a new branch for a package:

pkgdb-client edit-u toshio-b F-10-b EL-5 python-exif

To allow provenpackager to edit a branch:

pkgdb-client edit-u toshio-b devel-a provenpackager python-exif

To remove provenpackager commit rights on a branch:

pkgdb-client edit-u toshio-b EL-5-b EL-4-r provenpackager python-exif

More options can be found by running pkgdb-client --help You must be in the cvsadmin group to use pkgdb-client. It can be run on a non-Fedora Infrastructure box if you set the PACKAGEDBURL environment variable to the public URL:

export PACKAGEDBURL=https://admin.fedoraproject.org/pkgdb

Note: You may be asked to CC fedora-perl-devel-list on a perl package. This can be done with the username “perl- sig”. This is presently a user, not a group so it cannot be used as an owner or comaintainer, only for CC.

Using pkgdb2branch

Use pkgdb2branch.py to create branches for a package. pkgdb2branch.py takes a list of package names on the com- mand line and creates the branches that are specified in the packagedb. The script lives in /usr/local/bin on the SCM server (pkgs.fedoraproject.org) and must be run there. For instance, pkgdb2branch.py python-exif qa-assistant will create branches specified in the pack- agedb for python-exif and qa-assistant. pkgdb2branch can only be run from pkgs.fedoraproject.org.

Update Koji

Optionally you can synchronize pkgdb and koji by hand: it is done automatically hourly by a cronjob. There is a script for this in the admin/ directory of the CVSROOT module. Since dist-f13 and later inherit from dist-f12, and currently dist-f12 is the basis of our stack, it’s easiest to just call:

./owner-sync-pkgdb dist-f12

Just run ./owners-sync-pkgdb for usage output. This script requires that you have a properly configured koji client installed.

286 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0 owner-sync-pkgdb requires the koji client libraries which are not available on the cvs server. So you need to run this from one of your machines.

Helper Scripts

These scripts are invoked by the scripts above, doing some of the heavy lifting. They should not ordinarily be called on their own. mkbranchwrapper

/usr/local/bin/mkbranchwrapper is a shell script which takes a list of packages and branches. For instance: mkbranchwrapper foo bar EL-5F-11 will create modules foo and bar for devel if they don’t exist and branch them for the other 4 branches passed to the script. If the devel branch exists then it just branches. If there is no branches passed the module is created in devel only. mkbranchwrapper has to be run from cvs-int.

Important: mkbranchwrapper is not used by any current programs. Use pkgdb2branch instead.

setup_package setup_package creates a new blank module in devel only. It can be run from any host. To create a new package run: setup_package foo setup_package needs to be called once for each package. it could be wrapped in a shell script similar to:

#!/bin/bash

PACKAGES="" for arg in $@; do PACKAGES="$PACKAGES $arg" done echo "packages=$PACKAGES" for package in $PACKAGES; do ~/bin/setup_package $package done then call the script with all branches after it.

Note: setup_package is currently called from pkgdb2branch.

2.2. System Administrator Guide 287 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Pseudo Users for SIGs

See [62]Package_SCM_admin_requests#Pseudo-users_for_SIGs for the current list.

Deprecate Packages

Any packager can deprecate a package. click on the deprecate package button for the package in the webui. There’s currently no pkgdb-client command to deprecate a package.

Undeprecate Packages

Any cvsadmin can undeprecate a package. Simply use pkgdb-client to assign an owner and the package will be undeprecated: pkgdb-client-o toshio-b devel qa-assistant

As a cvsadmin you can also log into the pkgdb webui and click on the unretire package button. Once clicked, the package will be orphaned rather than deprecated.

Performing mass comaintainer requests

• Confirm that the requestor has ‘approveacls’ on all packages they wish to operate on. If they do not, they MUST request the change via FESCo. • Mail maintainers/co-maintainers affected by the change to inform them of who requested the change and why. • Download a copy of this script: http://git.fedorahosted.org/git/?p=fedora-infrastructure.git;a=blob;f=scripts/ pkgdb_bulk_comaint/comaint.py;hb=HEAD • Edit the script to have the proper package owners and package name pattern. • Edit the script to have the proper new comaintainers. • Ask someone in sysadmin-web to disable email sending on bapp01 for the pkgdb (following the instructions in comments in the script) • Copy the script to an infrastructure host (like cvs01) that can contact bapp01 and run it.

SELinux Infrastructure SOP

SELinux is a fundamental part of our but still has a large learning curve and remains quite intim- idating to both developers and system administrators. Fedora’s Infrastructure has been growing at an unfathomable rate, and is full of custom software that needs to be locked down. The goal of this SOP is to make it simple to track down and fix SELinux policy related issues within Fedora’s Infrastructure. Fully deploying SELinux is still an ongoing task, and can be tracked in fedora-infrastructure [45]ticket #230.

Contents

1. Contact Information 2. Step One: Realizing you have a problem 3. Step Two: Tracking down the violation

288 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

4. Step Three: Fixing the violation 1. Allowing ports 2. Toggling an SELinux boolean 3. Setting custom context 4. Deploying custom policy modules

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main, sysadmin-web groups Purpose To ensure that we are able to fully wield the power of SELinux within our infrastructure.

Step One: Realizing you have a problem

If you are trying to find a specific problem on a host go look in the audit.log per-host on our cental log server. See the syslog SOP for more information.

Step Two: Tracking down the violation

Generate SELinux policy allow rules from logs of denied operations. This is useful for getting a quick overview of what has been getting denied on the local machine: audit2allow-la

You can obtain more detailed audit messages by using ausearch to get the most recent violations: ausearch-m avc-ts recent

Again -see the syslog SOP for more information here.

Step Three: Fixing the violation

Below are examples of using our current ansible configuration to make SELinux deployment changes. These constructs are currently home-brewed, and do not exist in upstream Ansible. For these functions to work, you must ensure that the host or servergroup is configured with ‘include selinux’, which will enable SELinux in permissive mode. Once a host is properly configured, this can be changed to ‘include selinux-enforcing’ to enable SELinux Enforcing mode.

Note: Most services have $service_selinux manpages that are automatically generated from policy.

Toggling an SELinux boolean

SELinux booleans, which can be viewed by running semanage boolean -l, can easily be configured using the following syntax within your ansible configuration.:

2.2. System Administrator Guide 289 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

seboolean: name=httpd_can_network_connect_db state=yes persistent=yes

Setting custom context

Our infrastructure contains many custom applications, which may utilize non-standard file locations. These issues can lead to trouble with SELinux, but they can easily be resolved by setting custom file context.:

"file: path=/var/tmp/l10n-data recurse=yes setype=httpd_sys_content_t"

Fixing odd errors from the logs

If you see messages like this in the log reports: restorecon:/etc/selinux/targeted/contexts/files/file_contexts: Multiple same/

˓→specifications for /home/fedora. matchpathcon://etc/selinux/targeted/contexts/files/file_contexts: Multiple same//

˓→specifications for /home/fedora.

Then it is likely you have an overlapping filecontext in your local selinux context configuration - in this case likely one added by ansible accidentally. To find it run this: semanage fcontext-l| grep/path/being/complained/about sometimes it is just an ordering problem and reversing them solves it other times it is just an overlap, period. look at the context and delete the one you do not want or reorder. To delete run: semanage fcontext-d'/entry/you/wish/to/delete'

This just removes that filecontext - no need to worry about files being deleted. Then rerun the triggering command and see if the problem is solved.

Sigul servers upgrades/reboots

Fedora currently has 1 sign-bridge and 2 sign-vault machines for primary, there is a similar setup for secondary architectures. When upgrading or rebooting these machines, some special steps must be taken to ensure everything is working as expected.

Contact Information

Owner Fedora Release Engineering Contact #fedora-admin, #fedora-noc Servers sign-vault03, sign-vault04, sign-bridge02, secondary-bridge01.qa Purpose Upgrade or restart sign servers

290 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

0. Coordinate with releng on timing. Make sure no signing is happening, and none is planned for a bit. Sign-bridge02, secondary-bridge01.qa: 1. Apply updates or changes 2. Reboot virtual instance 3. Once it comes back, start the sigul_bridge service and enter empty password. Sign-vault03/04: 1. Determine which server is currently primary. It’s the one that has the floating ip address for sign-vault02 on it. 2. Login to the non primary server via serial or management console. (There is no ssh access to these servers) 3. Take a lvm snapshot:

lvcreate--size5G--snapshot--name YYYMMDD/dev/mapper/vg_signvault04-lv_root

Replace YYMMDD with todays year, month, day and the vg with the correct name Then apply updates.

4. Confirm the server comes back up ok, login to serial console or management console and start the sigul_server process. Enter password when prompted. 5. On the primary server, down the floating ip address:

ip addr del 10.5.125.75 dev eth0

6. On the secondary server, up the floating ip address:

ip addr add 10.5.125.75 dev eth0

7. Have rel-eng folks sign some packages to confirm all is working. 8. Update/reboot the old primary server and confirm it comes back up ok.

Note: Changes to database When making any changes to the database (new keys, etc), it’s important to sync the data from the primary to the secondary server. This process is currently manual.

simple-koji-ci simple-koji-ci is a small service running in our infra cloud that listens for fedmsg messages coming from pagure on dist-git about new pull-requests. It then creates a SRPM based on the content of each pull-request, kicks off a scratch build in koji and reports the outcome of that build on the pull-request.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, #fedora-apps

2.2. System Administrator Guide 291 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Persons pingou Location the cloud Servers simple-koji-ci-dev.fedorainfracloud.org simple-koji-ci-prod.fedorainfracloud.org Purpose Performs scratch builds for pull-request opened on dist-git

Hosts

The current deployment is made in a single host: simple-koji-ci-prod.fedorainfracloud.org for prod, simple-koji-ci-dev.fedorainfracloud.org for stagging.

Service simple-koji-ci is a fedmsg-based service, so it can be turned on or off via the fedmsg-hub service. It interacts with koji via a keytab created by the keytab/service role in ansible. The configuration of the service (including the weight of the builds kicked off in koji) is located at /etc/fedmsg. d/simple_koji_ci.py. One can monitor the service using: journalctl -lfu fedmsg-hub.

Impact

This service is purely informative, nothing does nor should rely on it. If anything goes wrong, there are no conse- quences for stopping it.

SSH Access Infrastructure SOP

Contents

1. Contact Information 2. Introduction 3. SSH configuration 4. SSH Agent forwarding 5. Troubleshooting

Contact Information

Owner sysadmin-main Contact #fedora-admin or [email protected] Location IAD2 Servers All IAD2 and VPN Fedora machines Purpose Access via ssh to Fedora project machines.

292 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Introduction

This page will contain some useful instructions about how you can safely login into Fedora PHX2 machines suc- cessfully using a public key authentication. As of 2011-05-27, all machines require a SSH key to access. Password authentication will no longer work. Note that this SOP has nothing to do with actually gaining access to specific machines. For that you MUST be in the correct group for shell access to that machine. This SOP simply describes the process once you do have valid and appropriate shell access to a machine.

SSH configuration

First of all: (on your local machine): vi~/.ssh/config

Note: This file, and any keys, need to be chmod 600, or you will get a “Bad owner or permissions” error. The .ssh directory must be mode 700. then, add the following:

Host bastion.fedoraproject.org HostName bastion-iad01.fedoraproject.org User FAS_USERNAME (all lowercase) ProxyCommand none ForwardAgent no Host *.iad2.fedoraproject.org *.qa.fedoraproject.org 10.3.160.* 10.3.161.* 10.3.163.* ˓→10.3.165.* 10.3.167.** .vpn.fedoraproject.org batcave01 User FAS_USERNAME (all lowercase) ProxyCommand ssh-W%h:%p bastion.fedoraproject.org

How ProxyCommand works? A connection is established to the bastion host:

+------++------+ | you|---ssh--->| bastion host| +------++------+

Bastion host establish a connction to the target server:

+------++------+ | bastion host|------>| server| +------++------+

Your client then connects through the Bastion and reaches the target server:

+-----++------++------+ | you|| bastion host|| server| || ===ssh=over=bastion======>|| +-----++------++------+

PuTTY SSH configuration

You can configure Putty the same way by doing this:

2.2. System Administrator Guide 293 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

0. In the session section type batcave01.fedoraproject.org port 22 1. In Connection:Data enter your FAS_USERNAME 2. In Connection:Proxy add the proxy settings • ProxyHostname is bastion-iad01.fedoraproject.org • Port 22 • Username FAS_USERNAME • Proxy Command plink %user@%proxyhost %host:%port 3. In Connection:SSH:Auth remember to insert the same key file for authentication you have used on FAS profile

SSH Agent forwarding

You should normally have:

ForwardAgent no

For Fedora hosts (this is the default in OpenSSH). You can override this on a per-session basis by using ‘-A’ with ssh. SSH agents could be misused if you connect to a compromised host with forwarding on (the attacker can use your agent to authenticate them to anything you have access to as long as you are logged in). Additionally, if you do need SSH agent forwarding (say for copying files between machines), you should remember to logout as soon as you are done to not leave your agent exposed.

Troubleshooting

• ‘channel 0: open failed: administratively prohibited: open failed’ If you receive this message for a machine proxied through bastion, then bastion was unable to connect to the host. This most likely means that tried to SSH to a nonexistent machine. You can debug this by trying to connect to that machine from bastion. • if your local username is different from the one registered in FAS, please remember to set up a User vari- able (like above) where you specify your FAS username. If that’s missing SSH will try to login by using your local username, thus it will fail. • ssh -vv is very handy for debugging what sections are matching and what are not. • If you get access denied several times in a row, please consult with #fedora-admin. If you try too many times with an invalid config your IP could be added to denyhosts. • If you are running an OpenSSH version less than 5.4, then the -W option is not available. In that case, use the following ProxyCommand line instead:

ProxyCommand ssh-q bastion.fedoraproject.org exec nc%h%p

SSH known hosts Infrastructure SOP

Provides Known Hosts file that is globally deployed and publicly available at https://admin.fedoraproject.org/ssh_ known_hosts

294 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin group Location: all Servers: all Purpose: Provides Known Hosts file that is globally deployed.

Adding a host alias to the ssh_known_hosts

If you need to add a host alias to a host in ssh_known_hosts simply go to the dir for the host in infra-hosts and add a file named host_aliases to the git repo in that dir. Put one alias per line and save. Then the next time fetch-ssh-keys runs it will add those aliases to known hosts.

Staging SOP

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location Mostly in PHX2 Servers stg Purpose Staging environment to test changes to apps and create initial Ansible configs.

Introduction

Fedora uses a set of staging servers for several purposes: • When applications are initially being deployed, the staging version of those applications are setup with a staging server that is used to create the initial Ansible configuration for the application/service. • Established applications/services use staging for testing. This testing includes: – Bugfix updates – Configuration changes managed by Ansible – Upstream updates to dependent packages (httpd changes for example)

Goals

The staging servers should be self contained and have all the needed databases and such to function. At no time should staging resources talk to production instances. We use firewall rules on our production servers to make sure no access is made from staging. Staging instances do often use dumps of production databases and data, and thus access to resources in staging should be controlled as it is in production.

2.2. System Administrator Guide 295 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

DNS and naming

All staging servers should be in the stg.phx2.fedoraproject.org domain. /etc/hosts files are used on stg servers to override dns in cases where staging resources should talk to the staging version of a service instead of the production one. In some cases, one staging server may be aliased to several services or applications that are on different machines in production.

Syncing databases

Syncing FAS

Sometimes you want to resync the staging fas server with what’s on production. To do that, dump what’s in the production db and then import it into the staging db. Note that resyncing the information will remove any of the information that has been added to the staging fas servers. So it’s good to mention that you’re doing this on the infra list or to people who you know are working on the staging fas servers so they can either save their changes or ask you to hold off for a while. On db01:

$ ssh db01 $ sudo -u postgres pg_dump -C fas2 |xz -c fas2.dump.xz $ scp fas2.dump.xz db02.stg:

On fas01.stg (postgres won’t drop the database if something is accessing it) (ATM, fas in staging is not load balanced so we only have to do this on one server):

$ sudo /etc/init.d/httpd stop

On db02.stg:

$ echo 'drop database fas2' |sudo -u postgres psql $ xzcat fas2.dump.xz | sudo -u postgres psql

On fas01.stg:

$ sudo /etc/init.d/httpd start

Other databases behave similarly.

External access

There is http/https access from the internet to staging instances to allow testing. Simply replace the production resource domain with stg.fedoraproject.org and it should go to the staging version (if any) of that resource.

Ansible and Staging

All staging machine configurations is now in the same branch as master/production. There is a ‘staging’ environment - Ansible variable “env” is equal to “staging” in playbooks for staging things. This variable can be used to differentiate between producion and staging systems.

296 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Workflow for staging changes

1. If you don’t need to make any Ansible related config changes, don’t do anything. (ie, a new version of an app that uses the same config files, etc). Just update on the host and test. 2. If you need to make Ansible changes, either in the playbook of the application or outside of your module: • Make use of files ending with .staging (see resolv.conf in global for an example). So, if there’s persistent changes in staging from production like a different config file, use this. • Conditionalize on environment:

- name: your task ... when: env =="staging"

- name: production-only task ... when: env !="staging"

• These changes can stay if they are helpful for further testing down the road. Ideally normal case is that staging and production are configure in the same host group from the same Ansible playbook.

Time limits on staging changes

There is no hard limit on time spent in staging, but where possible we should limit the time in staging so we are not carrying changes from production for a long time and possible affecting other staging work.

Fedora Status Service - SOP

Fedora-Status is the software that generates the page at http://status.fedoraproject.org/. This page should be kept up to date with the current status of the services ran by Fedora Infrastructure. This page is hosted at AWS.

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, #fedora-noc Servers: AWS S3/CloudFront Purpose: Give status information to users about the current status of our public services. Repository: https://github.com/fedora-infra/statusfpo

How it works

To keep this website as stable as can be, the page is hosted external to our main infrastructure, in AWS. It is based on an S3 bucket with the files, fronted by a CloudFront distribution for TLS termination and CNAMEs. The website is statically generated using Pelican on your local machine, and then pushed to S3.

2.2. System Administrator Guide 297 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Adding and changing outages

Making Changes

Before pushing changes live to S3, use the Pelican’s devserver to stage and view changes. 1. Install the packages you need to run the devserver with:

sudo dnf install pelican python-packaging

2. Check out the repo at:

[email protected]:fedora-infra/statusfpo.git

3. Run the devserver with:

make devserver

4. View the generated site at http://0.0.0.0:8000. Note that any changes to the content and theme will automatically regenerate. 5. Commit changes (or open a Pull Request) to https://github.com/fedora-infra/statusfpo

Create a new outage

1. Add a markdown file to either content/planned/, content/ongoing, or content/resolved/. The name of the file needs to be unique, so check the resolved outages for an idea on how to name your file. 2. Add your outage notice to the markdown file, for example:

Title: Buzzilla Slow Date: 2021-04-28 10:22+0000 OutageFinish: 2021-04-28 13:30+0000 Ticket: 123456

A swarm of bees have taken up residence in one of the Buzzilla Server rooms. Consequently, some requests to Buzzilla may respond slower than usual. An apiarist has been called to capture and relocate the swarm.

• Note that OutageFinish is optional, but should really only be ommited if the projected / or actual outage time is unknown. • When providing dated, keep the timezone offset at +0000 / UTC datetimes

Moving an outage

To move an outage, say from Planned to Ongoing simply move the markdown file into a different status directory in content/, and regenerate.

Publishing

Only members of sysadmin-main and people given the AWS credentials can update the status website.

298 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Initial Configuration for Publishing

1. First, install the AWS command line tool with:

sudo dnf install aws-cli

3. Grab ansible-private/files/aws-status-credentials and store in ~/.aws/credentials. 4. Run:

aws configure set preview.cloudfront true

Publishing changes live

Once you are satisfied with your changes and how they look on the devserver, and they have been committed to Git, and push the built changes live with the command: make upload

Note that this command only updates content changes (i.e. adding / moving outages)

Publishing theme changes

If your changes involve changes to the theme, run the following command to upload everything content and theme changes to the live server: make upload-theme

Renewing SSL certificate

Because 1. Run certbot to generate certificate and have it signed by LetsEncrypt (you can run this command anywhere certbot is installed, you can use your laptop or certgetter01.phx2.fedoraproject.org):

rm-rf~/certbot certbot certonly--agree-tos-m admin @fedoraproject.org--no-eff-email--manual--

˓→manual-public-ip-logging-ok-d status.fedoraproject.org-d www.fedorastatus.org

˓→--preferred-challenges http-01--config-dir~/certbot/conf--work-dir~/certbot/

˓→work--logs-dir~/certbot/log

2. You will be asked to make specific file available under specific URL. In a different terminal upload requested file to AWS S3 bucket:

echo SOME_VALUE>myfile aws--profile statusfpo s3 cp myfile s3://status.fedoraproject.org/.well-known/

˓→acme-challenge/SOME_FILE

3. Verify that uploaded file is available under the rigt URL. If previous certificate already expired you may need to run curl with -k option:

curl-kL http://www.fedorastatus.org/.well-known/acme-challenge/SOME_FILE

2.2. System Administrator Guide 299 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

4. After making sure that curl outputs expected value, go back to certbot run and continue by pressing Enter. You will be asked to repeat steps 2 and 3 for another domain. Note that S3 bucket name should stay the same. 5. Deploy generated certificate to AWS. This requires additional permissions on AWS.

Log Infrastructure SOP

Logs are centrally referred to our loghost and managed from there by rsyslog to create several log outputs. Epylog provides twice-daily log reports of activities on our systems. It runs on our central loghost and generates reports on all systems centrally logging.

Contact Information

Owner: Fedora Infrastructure Team Contact: #fedora-admin, sysadmin-main Location: Phoenix Servers: log01.phx2.fedoraproject.org Purpose: Provides our central logs and reporting

Essential data/locations:

• Logs compiled using rsyslog on log01 into a single set of logs for all systems:

/var/log/merged/

These logs are rotated every day and kept for only 2 days. This set of logs is only used for immediate analysis and more trivial ‘tailing’ of the log file to watch for events. • Logs for each system separately in /var/log/hosts These logs are maintained forever, practically, or for as long as we possibly can. They are broken out into a $hostname/$YEAR/$MON/$DAY directory structure so we can locate a specific day’s log immediately. • Log reports generated by epylog: Log reports generated by epylog are outputted to /srv/web/epylog/merged The reports are accessible via a web browser from https://admin.fedoraproject.org/epylog/merged/ This path requires a username and a password to access. To add your username and password you must first join the sysadmin-logs group then login to log01.phx2.fedoraproject.org and run this command:

htpasswd -m /srv/web/epylog/.htpasswd $your_username

when prompted for a password please input a password which is NOT YOUR FEDORA ACCOUNT SYSTEM PASSWORD.

Important: Let’s say that again to be sure you got it: DO _NOT_ HAVE THIS BE THE SAME AS YOUR FAS PASSWORD

300 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Configs:

Epylog configs are controlled by ansible - please see the ansible epylog module for more details. Specifically the files in roles/epylog/files/merged/

Generating a one-off epylog report:

If you wish to generate a specific log report you will need to run the following command on log01: sudo/usr/sbin/epylog-c/etc/epylog/merged/epylog.conf--last5h

You can replace ‘5h’ with other time measurements to control the amount of time you want to view from the merged logs. This will mail a report notification to all the people in the sysadmin-logs group.

Audit logs, centrally:

We’ve taken the audit logs and enabled our rsyslogd on the hosts to relay the audit log contents to our central log server. Here’s how we did that: 1. modify the selinux policy so that rsyslogd can read the file(s) in /var/log/audit/audit.log BEGIN Selinux policy module:

module audit_via_syslog 1.0;

require { type syslogd_t; type auditd_log_t; class dir { search }; class file { getattr read open};

}

#======syslogd_t ======allow syslogd_t auditd_log_t:dir search; allow syslogd_t auditd_log_t:file { getattr read open};

END selinux policy module 2. add config to rsyslog on the clients to repeatedly send all changes to their audit.log file to the central syslog server as local6:

# monitor auditd log and send out over local6 to central loghost $ModLoad imfile.so

# auditd audit.log $InputFileName /var/log/audit/audit.log $InputFileTag tag_audit_log: $InputFileStateFile audit_log $InputFileSeverity info $InputFileFacility local6 $InputRunFileMonitor

then modify your emitter to the syslog server to send local6.* there

2.2. System Administrator Guide 301 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

3. on the syslog server - setup log destinations for: • merged audit logs of all hosts explicitly drop any non-AVC audit message here) magic exclude line is:

:msg, !contains, "type=AVC" ~

that line must be directly above the log entry you want to filter and it has a cascade effect on everything below it unless you disable the filter • per-host audit logs - this is everything from audit.log 4. On the syslog server - we can run audit2allow/audit2why on the audit logs sent there by doing this:

grep'hostname'/var/log/merged/audit.log| sed's/^. *tag_audit_log: //'| ˓→audit2allow

the sed is to remove the log prefix garbage from syslog transferring the msg

Future:

• additional log reports for errors from http processes or servers • SEC Simple Event Coordinator to report, immediately, on events from a log stream - available in fedora/epel. • New report modules within epylog

Tag2DistRepo Infrastructure SOP

Contents

1. Contact Information 2. Description 3. Configuration

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Primary upstream contact Patrick Uiterwijk - FAS: puiterwijk Location Phoenix Servers bodhi-backend02.phx2.fedoraproject.org Purpose Tag2DistRepo is a Fedmsg Consumer that waits for tag operations in specific tags, and then instructs Koji to create Distro Repos.

Description

Tag2DistRepo is a Fedmsg Consumer that waits for tag operations in specific tags, and then instructs Koji to create Distro Repos.

302 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Configuration

Configuration is handled by the bodhi-backend.yaml playbook in Ansible. This can also be used to reconfigure appli- cation, if that becomes nessecary.

Torrent Releases Infrastructure SOP

http://torrent.fedoraproject.org/ is our master torrent server for Fedora distribution. It runs out of ibiblio.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-torrent group Location ibiblio Servers torrent.fedoraproject.org Purpose Provides the torrent master server for Fedora distribution

Torrent Release

When you want to add a new torrent to the tracker at [46]http://torrent.fedoraproject.org you need to take the following steps to have it listed correctly: 1. login to torrent02.fedoraproject.org. If you are unable to do so please contact the fedora infrastructure group about access. This procedure requires membership in the torrentadmin group. 2. Change the group ID to torrentadmin

newgrp torrentadmin

3. Remove everything from the working directory /srv/torrent/new/fedora/

rm -r /srv/torrent/new/fedora/*

4. rsync all the iso’s from ibiblio

rsync -avhHP rsync://download-ib01.fedoraproject.org/fedora-stage/_ ˓→-

5. Then cd into /srv/torrent/new/fedora/ to change the directory structure

cd /srv/torrent/new/fedora/

6. The directories should be created by removing Label in the iso’s name

for iso in $(ls *iso); do dest=$(echo $iso|sed -e 's|-

7. Now copy the checksum’s into the associated directories

for checksum in $(ls *CHECKSUM); do for file in $(grep "SHA256 (" $checksum ˓→|sed -e 's|SHA256 (||g' -e 's|-

2.2. System Administrator Guide 303 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

8. Verify if all the checksums are copied into the right locations

ls */

9. Remove the manifest files and checksums for netinst (since we dont mirror netinst images) and other files

rm -rf *manifest *netinst**CHECKSUM *i386 *x86_64

10. Run the maketorrent script from /srv/torrent/new/fedora/

../maketorrent *

Note: Next steps should be run 12 hours before the release time which is generally 14:00 UTC on Tuesday.

11. Grab fedora-torrent-init.py from releng scripts and change it to executable

cd~ wget https://pagure.io/releng/raw/master/f/scripts/fedora-torrent-ini.py chmod 755 ~/fedora-torrent-ini.py

12. Run the following command from /srv/torrent/new/fedora/

~/fedora-torrent-ini.py _ > _

˓→.ini

13. Copy all the torrents to /srv/web/torrents/

cp *torrent /srv/web/torrents/

14. Copy everything in /srv/torrent/new/fedora/ to /srv/torrent/btholding/

cp -rl * /srv/torrent/btholding/

15. Copy the .ini file created in step 12 to /srv/torrent/torrent-generator/

sudo cp _.ini /srv/torrent/torrent-generator/

16. Restart rtorrent and opentracker services

systemctl restart opentracker-ipv4 opentracker-ipv6

sudo -i

su -s /bin/bash torrent

tmux(or tmux attach if the session is already running)

control-q if rtorrent is already running.

cd /srv/torrent/btholding

rtorrent *.torrent

control-b d(disconnect from tmux)

304 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Note: For final release, remove all the alpha and beta directories and torrent files corresponding to the release in /srv/torrent/btholding/ directory.

Note: At EOL of a release, remove all the directories and torrent files corresponding to the release in /srv/torrent/btholding/ directory.

Fedora Infra Unbound Notes

Sometimes, especially after updates/reboots you will see alerts like this:

18:46:55< zodbot> PROBLEM- unbound-tummy01.fedoraproject.org/Unbound 443/tcp is

˓→WARNING: DNS WARNING- 0.037 seconds response time (dig returned an error status)

˓→(noc01) 18:51:06< zodbot> PROBLEM- unbound-tummy01.fedoraproject.org/Unbound 80/tcp is

˓→WARNING: DNS WARNING- 0.035 seconds response time (dig returned an error status)

˓→(noc01)

To correct this, restart unbound on the relevant node (in the example case above, unbound-tummy01), by running the restart_unbound Ansible playbook from batcave01.:

sudo-i ansible-playbook/srv/web/infra/ansible/playbooks/restart_unbound.yml--extra-

˓→vars="target=unbound-tummy01.fedoraproject.org"

Fedora Infrastructure Kpartx Notes

How to mount virtual partitions

There can be multiple reasons you need to work with the contents of a virtual machine without that machine running. 1. You have decommisioned the system and found you need to get something that was not backed up. 2. The system is for some reason unbootable and you need to change some file to make it work. 3. Forensics work of some sort. In the case of 1 and 2 the following commands and tools are invaluable. In the case of 3, you should work with the Fedora Security Team and follow their instructions completely.

Steps to Work With Virtual System

1. Find out what physical server the virtual machine image is on. A. Log into batcave01.phx2.fedoraproject.org B. search for the hostname in the file /var/log/virthost-lists.out:

$ grep proxy01.phx2.fedoraproject.org /var/log/virthost-lists.out virthost05.phx2.fedoraproject.org:proxy01.phx2.fedoraproject.org:running:1

C. If the image does not show up in the list then most likely it is an image which has been decommissioned. You will need to search the virtual hosts more directly:

2.2. System Administrator Guide 305 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

# for i in `awk -F: '{print $1}' /var/log/virthost-lists.out | sort -u`; do ansible $i -m shell -a 'lvs | grep proxy01.phx2' done

2. Log into the virtual server and make sure the image is shutdown. Even in cases where the system is not working correctly it may have still have a running qemu on the physical server. It is best to confirm that the box is dead. # virsh destroy 3. We will be using the kpartx command to make the guest image ready for mounting. # lvs | grep # kpartx -l /dev/mapper/- # kpartx -a /dev/mapper/- # vgscan # vgchange -ay /dev/mapper/ # mount /dev/mapper/ /mnt 4. Edit the files as needed. 5. Tear down the tree. # umount /mnt # vgchange -an # vgscan # kpartx -d /dev/mapper/-

Fedora Infrastructure Libvirt Notes

Notes/FAQ on using libvirt/virsh/virt-manager in our environment how do I migrate a guest from one virthost to another multiple steps: 1. setup an unpassworded root ssh key to allow communication between the two virthosts as root. This is only temporary, so, while scary it is not a big deal. Right now, this also means modifying the /etc/ ssh/sshd_config to permitroot without-password. 2. Determine whatever changes need to be made to the guest. This can be the number of cpus, the amount of memory, or the disk location as this may not be standard on the current server. 3. Make a dump of the current virtual guest using the virsh command. Use virsh dumpxml --migratable guestname and then edit any changes in disk layout, memory and cpu needed. 4. setup storage on the destination end to match the source storage. lvs will give the amount of disk space. Due to some vaguries on disk sizes, it is always better to round up so if the original server says it is using 19.85 GB, make the next image 20 GB. On the new server, use lvcreate -L+${SIZE}GB -n ${FQDN} vg_guests 3. as root on source location:

virsh -c qemu:///system migrate --xml ${XML_FILE_FROM_3} \ --copy-storage-all ${GUESTNAME} \ qemu+ssh://root@destinationvirthost/system

This should start the migration process and it will output absolutely jack-squat on the cli for you to know this. On the destination system go look in /var/log/libvirt/qemu/myguest.log (tail -f will show you the progress results as a percentage completed)

4. Once the migration is complete you will probably need to run this on the new virthost:

306 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

scp ${XML_FILE_FROM_3} root@destinationvirthost:

ssh root@destinationvirthost virsh define ${XML_FILE_FROM_3} virsh autostart ${GUESTNAME}

5. Edit ansible host_vars of the guest and make sure that the associated values are correct:

volgroup: /dev/vg_guests vmhost: virthost??.phx2.fedoraproject.org

6. Run the noc.yml ansible playbook to update nagios. This should work for most systems. However in some cases, the virtual servers on either side may have too much activity to ‘settle’ down enough for a migration to work. In other cases the guest may be on a disk like ISCSI which may not allow for direct migration. In this case you will need to use a more direct movement. 1. Schedule outage time if any. This will need to be long enough to copy the data from one host to another, so will depend on guest disk size. 2. Turn off monitoring in nagios 3. setup an unpassworded root ssh key to allow communication between the two virthosts as root. This is only temporary, so, while scary it is not a big deal. Right now, this also means modifying the /etc/ ssh/sshd_config to permitroot without-password. 4. Determine whatever changes need to be made to the guest. This can be the number of cpus, the amount of memory, or the disk location as this may not be standard on the current server. 5. Make a dump of the current virtual guest using the virsh command. Use virsh dumpxml --migratable guestname and then edit any changes in disk layout, memory and cpu needed. 6. setup storage on the destination end to match the source storage. lvs will give the amount of disk space. Due to some vaguries on disk sizes, it is always better to round up so if the original server says it is using 19.85 GB, make the next image 20 GB. On the new server, use lvcreate -L+${SIZE}GB -n ${FQDN} vg_guests 7. Shutdown the guest. 8. Insert iptables rule for nc transfer:

iptables-I INPUT 14-s-m tcp-p tcp--dport 11111-j ACCEPT

9. On the destination host: • RHEL-7:

nc-l 11111| dd of=/dev//

10. On the source host:

dd if=/dev//| nc desthost 11111

Wait for the copy to finish. You can do the following to track how far something has gone by finding the dd pid and then sending a ‘kill -USR1’ to it. 11. Once the migration is complete you will probably need to run this on the new virthost:

2.2. System Administrator Guide 307 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

scp ${XML_FILE_FROM_3} root@destinationvirthost:

ssh root@destinationvirthost virsh define ${XML_FILE_FROM_3} virsh autostart ${GUESTNAME}

12. Edit ansible host_vars of the guest and make sure that the associated values are correct:

volgroup: /dev/vg_guests vmhost: virthost??.phx2.fedoraproject.org

13. Run the noc.yml ansible playbook to update nagios. virtio notes

We have found that virtio is faster/more stable than emulating other cards on our VMs. To switch a VM to virtio: • Remove from DNS if it’s a proxy • Log into the vm and shut it down • Log into the virthost that the VM is on, and sudo virsh edit • Add this line to the appropriate bridge interface(s):

• Save/quit the editor • sudo virsh start • Re-add to DNS if it’s a proxy

Voting Infrastructure SOP

The live voting instance can be found at https://admin.fedoraproject.org/voting and the staging instance at https:// admin.stg.fedoraproject.org/voting/ The code base can be found at http://git.fedorahosted.org/git/?p=elections.git

Contents

1. Contact Information 2. Creating a new election 1. Creating the election 2. Adding Candidates 3. Who can vote 3. Modifying an Election 1. Changing the details of an Election 2. Removing a candidate

308 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

3. Releasing the results of an embargoed election

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, elections Location PHX Servers elections0{1,2}, elections01.stg, db02 Purpose Provides a system for voting on Fedora matters

Creating a new election

Creating the elections

• Log in • Go to “Admin” in the menu at the top, select “Create new election” and fill in the form. • The “usefas” option results in candidate names being looked up as FAS usernames an displayed as their real name. • An alias should be added when creating a new election as this is used in the link on the page of listed elections on the frontpage. • Complete the election form: Alias A short name for the election. It is the name that will be used in the templates. Example: FESCo2014 Summary A simple name that will be used in the URLs and as in the links in the application Example: FESCo elections 2014 Description A short description about the elections that will be displayed above the choices in the voting page Type Allow setting the types of elections (more on that below) Maxium Range/Votes Allow setting options for some election type (more on that below) URL A URL pointing to more information about the election Example: the wiki page presenting the election Start Date The Start of the elections (UTC) End Date The Close of the elections (UTC) Number Elected The number of seats that will be selected among the candidates after the election Candidates are FAS users? Checkbox allowing integration between FAS account and their names retrieved from FAS. Embargo results If this is set then it will require manual intervention to release the results of the election Legal voters groups Used to restrict the votes to one or more FAS groups. Admin groups Give admin rights on that election to one or more FAS groups

2.2. System Administrator Guide 309 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Adding Candidates

The list of all the elections can be found at voting/admin/ Click on the election of interest and and select “Add a candidate”. Each candidate is added with a name and an URL. The name can be his/her FAS username (interesting if the checkbox that candidates are FAS users has been checked when creating the calendar) or something else. The URL can be a reference to the wiki page where they nominated themselves. This will add extra candidates to the available list.

Who can vote

If no ‘Legal voters groups’ have been defined when creating the election, the election will be opened to anyone having signed the CLA and being in one other group (commonly referred to CLA+1).

Modifying an Election

Changing the details of an Election

Note: this page can also be used to verify details of an election before it opens for voting.

The list of all the elections can be found at /voting/admin/ After finding the right election, click on it to have the overview and select “Edit election” under the description.

Edit a candidate

On the election overview page found via /voting/admin/ (and clicking on the election of interest), next to each candidate is an [edit] button allowing the admins to edit the information relative to the candidate.

Removing a candidate

On the election overview page found via /voting/admin/ (and clicking on the election of interest), next to each candidate is an [x] button allowing the admins to remove the candidatei from the election.

Releasing the results of an embargoed election

Visit the elections admin interface and edit the election to uncheck the ‘Embargo results?’ checkbox.

Results

Admins have early access to the results of the elections (regardless of the embargo status). The list of the closed elections can be found at /voting/archives. Find there the election of interest and click on the “Results” link in the last column of the table. This will show you the Results page included who was elected based on the number of seats elected entered when creating the election.

310 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

You may use these information to send out the results email.

Legacy

Note: The information below should now be included in the Results page (see above) but I left them here in case.

Other things you might need to query

The current election software doesn’t retrieve all of the information that we like to include in our results emails. So we have to query the database for the extra information. You can use something like this to retrieve the total number of voters for the election:

ELECT e.id, e.shortdesc, COUNT(distinct v.voter) FROM elections AS e LEFT JOIN votes AS v ON e.id=v.election_id WHERE e.shortdesc in ('FAmSCo - February 2014') GROUP BY e.id, e.shortdesc;

You may also want to include the vote tally per candidate for convenience when the FPL emails the election results:

SELECT e.id, e.shortdesc, c.name, c.novotes FROM elections AS e LEFT JOIN fvotecount AS c ON e.id=c.election_id WHERE e.shortdesc in ('FAmSCo - February 2014','FESCo- February 2014');

WaiverDB SOP

WaiverDB is a service for recording waivers, from humans, that correspond with results in ResultsDB. On its own, this doesn’t do much. Importantly, the Greenwave service queries resultsdb and waiverdb and makes decisions (for Bodhi and other tools) based on the combination of data from the two sources. A result in resultsdb may matter, unless waived in waiverdb.

Contact Information

Owner Factory2 Team, Fedora QA Team, Infrastructure Team Contact #fedora-qa, #fedora-admin Persons dcallagh, gnaponie (giulia), lholecek, ralph (threebean) Location Phoenix Public addresses • https://waiverdb-web-waiverdb.app.os.fedoraproject.org/api/v1.0/about • https://waiverdb-web-waiverdb.app.os.fedoraproject.org/api/v1.0/waivers Servers • In OpenShift. Purpose Record waivers and respond to queries about them.

2.2. System Administrator Guide 311 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

See the the upstream API docs for detailed information. The information here will be contextual to the Fedora envi- ronment. There will be two ways of inserting waivers into waiverdb: First, a cli tool, which performs a HTTP POST from the packager’s machine. Second, a proxied request from bodhi. In this case, the packager will click a button in the Bodhi UI (next to a failing test result). Bodhi will receive the request from the user and in turn submit a POST to waiverdb on the user’s behalf. Here, the Bodhi Server will authenticate as the bodhi user, but request that the waiver be recorded as having been submitted by the original packager. Bodhi’s account will have to be given special proxy privileges in waiverdb. See https://pagure.io/waiverdb/issue/77

Observing WaiverDB Behavior

Login to os-master01.phx2.fedoraproject.org as root (or, authenticate remotely with openshift using oc login https://os.fedoraproject.org), and run:

$ oc project waiverdb $ oc status -v $ oc logs -f dc/waiverdb-web

Removing erroneous waivers

In general, don’t do this. But if for some reason we really need to, the database for waiverdb lives outside of openshift in our standard environment. Connect to db01:

[root@db01 ~][PROD]# sudo -u postgres psql waiverdb waiverdb=# \d List of relations Schema| Name| Type| Owner ------+------+------+------public| waiver| table| waiverdb public| waiver_id_seq| sequence| waiverdb (2 rows) waiverdb=# select * from waiver;

Be careful. You can delete individual waivers with SQL.

Upgrading

You can roll out configuration changes by changing the files in roles/openshift-apps/waiverdb/ and running the playbooks/openshift-apps/waiverdb.yml playbook. To understand how the software is deployed, take a look at these two files: • roles/openshift-apps/waiverdb/templates/imagestream.yml • roles/openshift-apps/waiverdb/templates/buildconfig.yml

312 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

See that we build a fedora-infra specific image on top of an app image published by upstream. The latest tag is automatically deployed to staging. This should represent the latest commit to the master branch of the upstream git repo that passed its unit and functional tests. The prod tag is manually controlled. To upgrade prod to match what is in stage, move the prod tag to point to the same image as the latest tag. Our buildconfig is configured to poll that tag, so a new os.fp.o build and deployment should be automatically created. You can watch the build and deployment with oc commands. You can poll this URL to see what version is live at the moment: https://waiverdb-web-waiverdb.app.os.fedoraproject. org/api/v1.0/about

What Can I Do For Fedora SOP

Contents

1. Contact Information 2. Introduction 3. Determine Category 4. Cron Job

Contact Information

Owner sysadmin-main Contact #fedora-admin, #fedora-noc or [email protected] Location Phoenix (Openshift) Public addresses • whatcanidoforfedora.org • stg.whatcanidoforfedora.org Server(s) • os.fedoraproject.org • os.stg.fedoraproject.org Purpose To explain the overall function of the whatCanIDoForFedora webpage.. including some back story, how to build your own, and site navigation.

Introduction

The ‘What Can I Do For Fedora’ (whatcanidoforfedora.org) page was the brainchild of Ralph Bean after getting inspiration from ‘whatcanidoformozilla.org’ created by Josh Matthews, Henri Koivuneva and a few others. Ralph wanted to make the whatcanidoforfedora (wcidff) as configurable as possible. The purpose of this site is to assist, in as user friendly a way as possible, new and prospective community members and help them realize what skills they may posess that can be helpful for the Fedora Project.

2.2. System Administrator Guide 313 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Deployment

The application deployment is managed from the github repository using the ‘staging’ and ‘production’ branch to deploy a new version. For example a new deployment to staging would look like that: git clone [email protected]:fedora-infra/asknot-ng.git cd asknot-ng git checkout staging git rebase develop git push origin staging

The github repository has a webhook configured to send the push informations to our Openshift instance. Once Openshift receives the webhook requests it will trigger a new build of the using the repository’s Dockerfile. The ‘asknot-ng’ container runs Apache HTTP web server and the configuration is store in the git repository.

Initial Deployment

The following playbook is used to create the initial Openshift project with the correct configuration: sudo rbac-playbook openshift-apps/asknot.yml

Logs

Logs can be retrive by accessing the Openshift webconsole or by using the openshift command line:

$ oc login os-master01.phx2.fedoraproject.org You must obtain an API token by visiting https://os.fedoraproject.org/oauth/token/

˓→request

$ oc login os-master01.phx2.fedoraproject.org --token= $ oc -n asknot get pods asknot-28-bfj52 1/1 Running 522 28d $ oc logs asknot-28-bfj52

Wiki Infrastructure SOP

Managing our wiki.

Contact Information

Owner Fedora Infrastructure Team / Fedora Website Team Contact #fedora-admin or #fedora-websites on irc.libera.chat Location: http //fedoraproject.org/wiki/ Servers proxy[1-3] app[1-2,4] Purpose Provides our production wiki

314 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Description

Our wiki currently runs mediawiki.

Important: Whenever you changes anything on the wiki (bugfix, configuration, plugins, . . . ), please update the page at https://fedoraproject.org/wiki/WikiChanges.

Dealing with Spammers:

If you find a spammer is editing pages in the wiki do the following: 1. admin disable their account in fas, add ‘wiki spammer’ as the comment 2. block their account in the wiki from editing any additional pages 3. go to the list of pages they’ve edited and rollback their changes - one by one. If there are many get someone to help you.

Zodbot Infrastructure SOP zodbot is a supybot based that we use in our #fedora channels.

Contents

1. Contact Information 2. Description 3. shutdown 4. startup 5. Processing interrupted meeting logs 6. Becoming an admin

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin Location Phoenix Servers value01 Purpose Provides our IRC bot

Description zodbot is a supybot based irc bot that we use in our #fedora channels. It runs on value01 as the daemon user. We do not config manage the zodbot.conf because supybot makes changes to it on its own. Therefore it gets backed up and is treated as data.

2.2. System Administrator Guide 315 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

shutdown killall supybot startup ‘‘ cd /srv/web/meetbot‘‘ # zodbot current needs to be started in the meetbot directory. # This requirement will go away in a later meetbot release. sudo -u daemon supybot -d /var/lib/zodbot/conf/ zodbot.conf

Startup issues

If the bot won’t connect, with an error like:

"Nick/channel is temporarily unavailable"

found in /var/lib/zodbot/logs/messages.log, hop on Libera (with your own IRC client) and do the following:

/msg nickserv release zodbot [the password]

The password can be found on the bot’s host in /var/lib/zodbot/conf/zodbot.conf This should allow the bot to connect again.

Processing interrupted meeting logs zodbot forgets about meetings if they are in progress when the bot goes down; therefore, the meetings never get processed. Users may request a ticket in [52]our Trac instance to have meeting logs processed. Trac tickets for meeting log processing should consist of a URL where zodbot had saved the log so far and an uploaded file containing the rest of the log. The logs are stored in /srv/web/meetbot. Append the remainder of the log uploaded to Trac (don’t worry too much about formatting; meeting.py works well with - and XChat-like logs), then run:

sudo python/usr/lib/python2.7/site-packages/supybot/plugins/MeetBot/meeting.py

˓→replay/path/to/fixed.log.txt

Close the Trac ticket, letting the user know that the logs are processed in the same directory as the URL they gave you.

Becoming an admin

Register with zodbot on IRC.:

/msg zodbot misc help register

You have to identify to the bot to do any admin type commands, and you need to have done so before anyone can give you privs. After doing this, ask in #fedora-admin on IRC and someone will grant you privs if you need them. You’ll likely be added to the admin group, which has the following capabilities (the below snippet is from an IRC log illustrating how to get the list of capabilities).

21:57< nirik>.list admin 21:57< zodbot> nirik: capability add, capability remove, channels, ignore add, ignore list, ignore remove, join, nick, and part

316 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

2.3 (Old) System Administrator Guides

2.3.1 Old Standard Operating Procedures

Below is a table of contents containing all the outdated or old standard operating procedures for Fedora Infrastruc- ture applications. They are kept around for documentation/history purposes but they shouldn’t be in used anywhere anymore. badges SOP

Fedora Badges - a recognition system for contributors See also: This document is now hosted in the Fedora Badges Documentation Website. Latest source available here: https: //pagure.io/fedora-badges/blob/master/f/docs/modules/ROOT/pages/push-badges.adoc

Fedorahosted Infrastructure SOP

Provide hosting place for open source projects.

Important: This page is for administrators only. People wishing to request a hosted project should use the Ticketing System ; see the new project request template. (Requires Fedora Account)

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-hosted Location Serverbeach Servers hosted03, hosted04 Purpose Provide hosting place for open source projects

Description fedorahosted.org can be used to host open source projects. It provides the following facilities: 1. An scm for maintaining the code. The currently supported SCMs include Mercurial, Git, Bazaar, or SVN. There is no cvs. 2. A trac instance, which provides a mini-wiki for hosting information and also provides a ticketing system. 3. A mailing list

2.3. (Old) System Administrator Guides 317 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

How to setup a new hosted project

1. Create source group in Fedora Account System of the form ex gitepel, svnkernel, etc 2. Create source repo 3. Log into hosted03 4. Create new project space:

sudo/usr/local/bin/hosted-setup.sh

must use the same case as the scm repo • You’re likely to end up with:

'Command failed: columns username, action are not unique'

this can be safely ignored as this only tries to tell you that you are giving admin access to a person already having admin access. 5. If a mailing list is desired, follow the directions for the mailman SOP.

How to import data from a cvs repo into git repo

Often users request their git repos to be imported from an existing cvs repo. This is a two step process as follows: git-cvsimport-v-d :pserver:anonymous @cvs.fedoraproject.org/cvs/docs-C

˓→module> sudo git clone--bare--no-hardlinks/git/.git/

Example: git-cvsimport-v-d :pserver:anonymous @cvs.fedoraproject.org/cvs/docs-C translation-

˓→quick-start-guide translation-quick-start-guide sudo git clone--bare--no-hardlinks translation-quick-start-guide//git/translation-

˓→quick-start-guide.git/

Note: Note that our git repos disallow non-fast-forward pushes by default. This default makes the most sense, but sometimes, users understand the impact of doing so, but still wish to make such a push. To enable this temporarily, edit the config file inside of the git repo, and make sure that receive.denyNonFastforwards is set to false. Make sure to reenable this once the user has finished their push.

How to allow a project to redirect parts of their release tree

A project may want to host parts of their release tree elsewhere (for instance, moving docs from hosting inside of the fedorhosted release tree to an external service). To do that, modify:: configs/web/fedorahosted.org/release.conf

Adding a new Directory section like this:

318 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

# Allow python-fedora project to redirect documentation/release tree elsewhere AllowOverride FileInfo

Then tell the project that they can create a .htaccess file with the Redirect (Note that the release tree can be reached by two URLs so you need to redirect both of them):

Redirect permanent/releases/p/y/python-fedora/doc http://pythonhosted.org/python-

˓→fedora Redirect permanent/released/python-fedora/doc/ http://pythonhosted.org/python-fedora

Fedorahosted FedMsg Infrastructure SOP

Publish fedmsg messages from Fedora Hosted trac instances.

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-apps #fedora-admin, sysadmin-hosted Location Serverbeach Servers hosted03, hosted04 Purpose Broadcast trac activity for select projects (opt-in)

Description fedmsg activity is usually an all-or-nothing proposition. We emit messages for all koji jobs and all bodhi updates, or none. fedmsg activity for Fedora Hosted is another story. We provide the option for project owners to opt-in to fedmsg and have their activity broadcast, but it is off by default. This document describes how to: 1. Enable the fedmsg plugin for a fedora hosted project. 2. Setup the fedmsg plugin on a new node.

Enable the fedmsg plugin for a fedora hosted project.

Enable the trac plugin

The trac-fedmsg-plugin package should be installed, but disabled. Edit /srv/web/trac/projects/$PROJECT/conf/trac.ini. Under the [components] section add: trac_fedmsg_plugin.* = enabled

And restart apache with “sudo apachectl graceful”

2.3. (Old) System Administrator Guides 319 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Enable the git hook

There is an ansible playbook that does this. There is no need to do it by hand anymore. Run:

$ sudo -i ansible-playbook \ /srv/web/infra/ansible/playbooks/fedorahosted_fedmsg_git.yml \ --extra-vars '{"repos":["yanex.git"]}'

Enabling by hand

If you were to do it by hand, without the playbook, you could follow the instructions below: Make a backup of the old post-receive hook. It should be empty when you encounter it, but just to be safe:

$ mv /srv/git/$PROJECT.git/hooks/post-receive \ /srv/git/$PROJECT.git/hooks/post-receive.orig

Then, symlink in the new post-receive hook with:

$ ln -s /usr/local/share/git/hooks/post-receive-fedorahosted-fedmsg \ /srv/git/$PROJECT.git/hooks/post-receive

That hooks is managed by ansible – if you want to modify it you can do so there.

Note: IF there was an old post-receive hook in place, you should check to see if it did something important. The ‘fedora-web’ git repo (which was converted early on) had such a hook. See /srv/git/fedora-web.git/hooks for an example of how to handle multiple git hooks. Something like /usr/share/git-core/post-receive-chained can be used to chain the hook across multiple scripts.

How to setup the fedmsg plugin on a new fedorahosted node.

1) Create certs for the new node as per the fedmsg-certs doc. 2) Declare those certs in /etc/fedmsg.d/ssl.py‘ globally. 3) Declare endpoints for the new node in /etc/fedmsg.d/endpoints.py. 4) Use our configuration management tool to distribute that new global fedmsg config to the new node and all other nodes. 5) Install the trac-fedmsg-plugin package on the new node and follow the steps above.

FH-Projects-Cleanup Infrastructure SOP

Contents 1. Introduction 2. Our first move 3. Removing Project’s git repo 4. Removing Trac’s project 5. Removing Project’s ML

320 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

6. FAS Group Removal

Introduction

This wiki page will help any sysadmin having a [50]Fedora Hosted Project completely removed either because the owner requested to have it removed or for whatever any other issue that would take us to remove a project. This page covers git, Trac, Mailing List and FAS group clean-up.

Our first move

If you are going to remove a Fedora Hosted’s project, please remember to create a folder into /srv/tmp that should follow the following syntax: cd /srv/tmp && mkdir $project-hold-until-xx-xx-xx where xx-xx-xx should be substituted with the date everything should be purged away from there. (it happens 14 days after the delete request)

Removing Project’s git repo

Having a git repository removed can be achieved with the following steps: ssh [email protected] cd /git mv $project.git/ /srv/tmp/$project-hold-until-xx-xx-xx/

We’re done with git!

Removing Trac’s project

Steps are: ssh [email protected] cd /srv/web/trac/projects mv $project/ /srv/tmp/$project-hold-until-xx-xx-xx/ and. . . that’s all!

Removing Project’s ML

We have two options here: Delete a list, but keep the archives: sudo/usr/lib/mailman/bin/rmlist

Delete a list and its archives: sudo/usr/lib/mailman/bin/rmlist-a

If you are going to completely remove the Mailing List and its archives, please make sure the list is empty and there are no subscribers in it.

2.3. (Old) System Administrator Guides 321 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

FAS Group Removal

Not every Fedora sysadmin can have this done. See [51]ISOP:ACCOUNT_DELETION for information. You may want to remove the group or simply disable it.

Hosted repository setup

Fedora provides SCM repositories for open source projects. Contents 1. Mercurial Repository 3. Repo Setup 2. Commit Mail 2. Git Repository 1. Repo Setup 2. Commit Mail 3. Bazaar Repository 4. SVN Repository 1. Repo Setup 2. Commit Mail

Mercurial Repository

You’ll need to know three things in order to start the mercurial repository. PROJECTNAME what the project wants to be called. OLDURL how to access the project’s current sourcecode in their mercurial repository. PROJECTGROUP the group setup in the account system for readwrite access to the repository.

Repo Setup

The Mercurial repository lives on the hosted server. Access it by logging into hosted1 Then follow these steps: 1. Fetch latest content from the FAS Database.:

$ fasClient -i -f

2. Create the repo:

$ cd /hg $ sudo hg clone -U $OLDURL $PROJECTNAME (or sudo mkdir $PROJECTNAME; cd

˓→$PROJECTNAME; sudo hg init) $ sudo find $PROJECTNAME -type d -exec chmod g+s \{\} \; $ sudo chmod -R g+w $PROJECTNAME $ sudo chown -R root:$PROJECTGROUP $PROJECTNAME

This should setup all the files needed for the repository.

322 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Commit Mail

The Mercurial Notify extension can be used to send out email when commits are pushed to a Mecurial repository. To enable notifications, create the file /hg/$PROJECTNAME/.hg/hgrc:

[extensions] hgext.notify =

[hooks] changegroup.notify = python:hgext.notify.hook

[email] from = [email protected]

[smtp] host = localhost

[web] baseurl = http://hg.fedorahosted.org/hg

[notify] sources = serve push pull bundle test = False config = /hg/$PROJECTNAME/.hg/subscriptions maxdiff = -1

And the file /hg/$PROJECTNAME/.hg/subscriptions:

[usersubs] user@host = *

[reposubs]

Git Repository

You’ll need to know several things in order to start the git repository. PROJECTNAME what the project wants to be called. OLDURL how to access the project’s current source code in their git repository. PROJECTGROUP the group setup in the account system for write access to the repository. COMMITLIST comma-separated list of email addresses for commits (optional) DESCRIPTION description of the project (optional) PROJECTOWNER the FAS username of the project owner

Repo Setup

The git repository lives on the hosted server. Access it by logging into hosted1 Then follow these steps: Fetch latest content from the FAS Database.:

2.3. (Old) System Administrator Guides 323 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

$ sudo fasClient -i -f

$ cd /git

Clone an existing repository:

$ sudo git clone --bare $OLDURL $PROJECTNAME.git $ cd $PROJECTNAME.git $ sudo git config core.sharedRepository true $ # $ ## or $ # $ # Create a new repository: $ sudo mkdir $PROJECTNAME.git $ cd $PROJECTNAME.git $ sudo git init --bare --shared=true

Give the repository a nice description for gitweb:

$ echo $DESCRIPTION | sudo tee description > /dev/null

Setup and run post-update hook. ..note:: We symlink this because /git is on a filesystem with noexec set)

$ sudo ln -svf /usr/share/git-core/templates/hooks/post-update.sample ./hooks/post-

˓→update $ sudo git update-server-info

Ensure ownership and modes are correct:

$ sudo find -type d -exec chmod g+s \{\} \; $ sudo find -perm /u+w -a ! -perm /g+w -exec chmod g+w \{\} \; $ sudo chown -R $PROJECTOWNER:$PROJECTGROUP .

This should setup all the files needed for the repository. The repository owner can push changes into the repo by running:

$ git push ssh://git.fedorahosted.org/git/$PROJECTNAME.git/ master from within their local git repository.

Commit Mail

If they want commit mail, then there are a couple of additional steps.:

$ cd /git/$PROJECTNAME.git $ sudo git config hooks.mailinglist $COMMITLIST $ sudo git config hooks.maildomain fedoraproject.org $ sudo git config hooks.emailprefix "[$PROJECTNAME]" $ sudo git config hooks.repouri "http://git.fedorahosted.org/cgit/$PROJECTNAME.git" $ sudo ln -svf /usr/share/git-core/post-receive-chained ./hooks/post-receive $ sudo mkdir ./hooks/post-receive-chained.d $ sudo ln -svf /usr/local/bin/git-notifier ./hooks/post-receive-chained.d/post-

˓→receive-email $ sudo ln -svf /usr/local/share/git/hooks/post-receive-fedorahosted-fedmsg ./hooks/

˓→post-receive-chained.d/post-receive-fedmsg

324 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Bazaar Repository

You’ll need to know three things in order to start a bazaar repository. PROJECTNAME what the project wants to be called. OLDBRANCHURL how to access the project’s current sourcecode in their previous bazaar repository. Note that a project may have multiple branches that they want to import. Each branch will have a separate URL. (The project can import the new branches after the repository is created if they want.) PROJECTGROUP the group setup in the account system for readwrite access to the repository.

Repo Setup

The bzr repository lives on the hosted server. Access it by logging into hosted1 then follow these steps: The first stage is to create the Bazaar repository. Fetch latest content from the FAS Database.:

$ fasClient -i -f

$ cd /srv/bzr/ $ # This creates a Bazaar repository which has shared storage between branches $ sudo bzr init-repo $PROJECTNAME --no-trees $ cd $PROJECTNAME $ sudo bzr branch $OLDURL $ sudo bzr branch $OLDURL2 $ # [...] $ sudo bzr branch $OLDURLN $ cd .. $ sudo find $PROJECTNAME -type d -exec chmod g+s \{\} \; $ sudo chmod -R g+w $PROJECTNAME $ sudo chown -R root:$PROJECTGROUP $PROJECTNAME

This should be all that is needed. To checkout run: bzr init-repo $MYLOCALPROJECTREPO cd $MYLOCALPROJECTREPO bzr branch bzr+ssh://bzr.fedorahosted.org/bzr/$PROJECTNAME/$BRANCHNAME bzr branch bzr://bzr.fedorahosted.org/bzr/$PROJECTNAME/$BRANCHNAME/

Note: If the end user checks out a branch without creating their own repository they will need to create a local working tree by doing the following: cd $BRANCHNAME bzr checkout --lightweight

SVN Repository

You’ll need to know two things in order to start a svn repository. PROJECTNAME what the project wants to be called. PROJECTGROUP The Fedora account system group with read-write access.

2.3. (Old) System Administrator Guides 325 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

COMMITLIST comma-separated list of email addresses for commits (optional)

Repo Setup

SVN lives on the hosted server. Access it by logging into hosted1. Then run the following steps: Fetch latest content from the FAS Database.:

$ fasClient -i -f

Create the repo:

$ cd /svn/ $ sudo svnadmin create $PROJECTNAME $ cd $PROJECTNAME $ sudo chgrp -R $PROJECTGROUP . $ sudo chmod -R g+w . $ sudo find -type d -exec chmod g+s \{\} \;

This should be all that is needed. To checkout run: svn co svn+ssh://svn.fedorahosted.org/svn/$PROJECTNAME

Commit Mail

If they want commit mail, then there are a couple of additional steps.:

$ echo $COMMITLIST | sudo tee ./commit-list > /dev/null $ sudo ln -sv /usr/bin/fedora-svn-commit-mail-hook ./hooks/post-commit

FedoraHosted Project Rename SOP

This describes the steps necessary to rename a project in Fedora Hosted.

Contents

1. Rename the Trac instance 2. Rename the git / svn / hg / . . . directory 3. Rename any old releases directories 4. Rename the group in FAS

Rename the Trac instance cd/srv/web/trac/projects mv oldname newname cd newname/conf sed-i-e's/oldname/newname/' trac.ini cd.. (continues on next page)

326 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

(continued from previous page) sudo-u apache trac-admin. resync

Rename the git / svn / hg / . . . directory cd/git mv oldname.git newname.git

Rename any old releases directories cd/srv/web/releases/o/l/oldname somehow, the newname releases dir gets created; if there were old releases, move them to the new location.

Rename the group in FAS

Note: Don’t blindly rename fedorahosted groups are usually safe to rename. If the old group could be present in other apps/configs, though, (like provenpackagers, perl-sig, etc) do not rename them. The other apps would need to have the group name updated there as well to make this safe. ssh db2 sudo-u postgres psql fas2

BEGIN; select * from groups where name='$OLDNAME'; update groups set name='$NEWNAME' where name='$OLDNAME';

• Check that only one row was modified:

select * from groups where name in ('$OLDNAME','$NEWNAME');

• Check that there’s only one row and the name == $NEWNAME • If incorrect, do ROLLBACK; instead of commit:

COMMIT;

Warning: Don’t delete groups If, for some reason, you end up with a group in FAS that was a typo but it doesn’t conflict with anything else, don’t delete it without talking to other admins on fedora-infrastructure-list. The numeric group ids could be present on a filesystem somewhere and removing the group could eventually lead to the id being allocated to some other group which would give unintended people access to the files. As a group we can figure out what hosts and files need to be checked for this issue if a delete is needed.

Guest migration between hosts.

Move guests from one host to another.

2.3. (Old) System Administrator Guides 327 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main Location PHX, Tummy, ibiblio, Telia, OSUOSL Servers All xen servers, kvm/libvirt servers. Purpose Migrate guests

How to do it

1. Schedule outage time if any. This will need to be long enough to copy the data from one host to another, so will depend on guest disk size. 2. Turn off monitoring in nagios 3. On new host create disk space for server:

lvcreate-n app03-L 32G vg_guests00

4. prepare old guest for migration: a) if system is xen, install a regular kernel b) look for entries for xenblk and hvc0 in /etc files 5. Shutdown the guest.

6. virsh dumpxml guestname> guest.xml

7. Copy guest.xml to the new machine. You will need to make various edits depending on if the system was originally xen or such. I normally need to compare an existing xml on the target system and the one we dumped out to make up the differences. 8. Define the guest on the new machine: ‘virsh define guest.xml’. Depending on the changes in the xml this may not work and you will need to make many manual changes plus copy the guest.xml to /etc/libvirtd/qemu and do a /sbin/service libvirtd restart 9. Insert iptables rule for nc transfer:

iptables-I INPUT 14-s-m tcp-p tcp--dport 11111-j ACCEPT

10. On the destination host: • RHEL-5:

nc-l-p 11111| dd of=/dev/mapper/

• RHEL-6:

nc-l 11111| dd of=/dev/mapper/

11. On the source host:

dd if=/dev/mapper/guest-partition| nc desthost 11111

Wait for the copy to finish. You can do the following to track how far something has gone by finding the dd pid and then sending a ‘kill -USR1’ to it. 11. start the guest on the new host:

328 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

``virsh start guest``

12. On the source host, rename storage and undefine guest so it’s not started.

Drive Replacement Infrastructure SOP

At present this SOP only works for the X series IBM servers. We have multiple machines with lots of different drives in them. For the most part now though, we are trying to standardise on IBM X series servers. At present I’ve not figured out how to disable onboard raid, as a result of this many of our servers have two raid 0 arrays then we do software raid on this. The system xen11 is currently an HP ProLiant DL180 G5 with its own interesting RAID system (using Compaq Smart Array ccis). Like the IBM X series each drive is considered a single RAID-0 instance which is then accessed through a logical drive.

Contents

1. Contact Information 2. Verify the drive is dead 1. Re-adding a drive (poor man’s fix) 3. Actually replacing the drive (IBM) 1. Collecting Data 2. Call IBM 3. Get the package, give access to the tech 4. Prepwork before the tech arrives 5. Tech on site 6. Rebuild the array 4. Actually Replacing the Drive (HP) 1. Collecting data 2. Call HP 3. Get the package, give access to the tech 4. Prepwork before the tech arrives 5. Tech on site 6. Rebuild the array 5. Installing RaidMan (IBM Only) Database - DriveReplacement

Contact Information

Owner Fedora Infrastructure Team Contact #fedora-admin, sysadmin-main

2.3. (Old) System Administrator Guides 329 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Location All Servers All Purpose Steps for drive replacement.

Verify the drive is dead

$ cat /proc/mdadm Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0] 513984 blocks [2/2] [UU]

md1 : active raid1 sdb2[2](F) sda2[0] 487717248 blocks [2/1] [U_]

This indicates that md1 is in a degraded state and that /dev/sdb2 is the failed drive. Notice that /dev/sdb1 (same physical drive as /dev/sdb2) is not failed. /dev/md0 (not yet degraded) is showing a good state. This is because /dev/md0 is /boot. If you run: touch/boot/t sync rm/boot/t

That should make /dev/md0 notice that its drive is also failed. If it does not fail, its possible the drive is fine and that some blip happened that caused it to get flagged as dead. It is also worthwhile to log in to xenX-mgmt to determine if the RSAII adapter has noticed the drive is dead. If you think the drive just had a blip and is fine, see “Re-adding” below

Re-adding a drive (poor man’s fix)

Basically what we’re doing here is making sure the drive is, infact, dead. Obviously you don’t want to do this more then once on a drive, if it continues to fail. Replace it.

# cat /proc/mdadm Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0] 513984 blocks [2/2] [UU] md1 : active raid1 sdb2[2](F) sda2[0] 487717248 blocks [2/1] [U_] # mdadm /dev/md1 --remove /dev/sdb2 # mdadm /dev/md1 --add /dev/sdb2 # cat /proc/mdstat md0 : active raid1 sdb1[1] sda1[0] 513984 blocks [2/1] [U_] resync=DELAYED

md1 : active raid1 sdb2[2] sda2[0] 487717248 blocks [2/1] [U_] [=>...... ] recovery= 9.2%(45229120/487717248) finish=145.2min

˓→speed=50771K/sec

So we removed the bad drive, added it again and you can now see the recovery status. Watch it carefully. If it fails again, time for a drive replacement.

330 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Actually replacing the drive (IBM)

Actually replacing the drive is a bit of a todo. If the box is in a RH owned location, we’ll have to file a ticket and get someone access to the colo. If it is at another location, we may be able to just ship the drive there and have someone do it on site. Please follow the below steps for drive replacement.

Collecting Data

There’s a not insignificant amount of data you’ll need to place the call. Please have the following information handy: 1) The hosts machine type (this is not model number).:

# lshal | grep system.product system.product='IBM System x3550 -[7978AC1]-' (string)

In the above case, the machine type is encoded into [7978AC1]. And is just the first 4 numbers. So this machine type is 7978. M/T (machine type) is always 4 digits for IBM boxes. 2) Machine’s serial number:

# lshal | grep system.hardware.serial system.hardware.serial='FAAKKEE' (string)

The above’s serial number is ‘FAAKKEE’ 3) Drive Stats There are two ways to get the drive stats. You can get some of this information via hal, but for the full complete information you need to either have someone physically go look at the drive (some of which is in inventory) or use RaidMan. See “Installing RaidMan” below for more information on how to install RaidMan. Specifically you need: • Drive Size (in G) • Drive Type (SAS or SATA?) • Drive Model • Drive Vendor To get this information run:

# cd /usr/RaidMan/ # ./arcconf GETCONFIG 1

4) The phone number and address of the building where the drive is currently located. This will go to the RH cage. This information is located in the contacts.txt of private git repo on batcave01 (only available to sysadmin- main people) Call IBM Call 1-800-426-7378 and follow the directions they give you. You’ll need to use the M/T above to get to the correct rep. They will ask you for the information above (you wrote it down, right?) When they agree to replace the drive, make sure to tell them you need the shipping number of the drive as well as the name of the tech who will do the drive replacement.

2.3. (Old) System Administrator Guides 331 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Sometimes the tech will just bring the drive. If not though, you need to open a ticket with the colo to let them know a drive is coming. Get the package, give access to the tech As SOON as you get this information, open a ticket with RH. at is-ops-tickets red- hat.com. Request a ticket ID from RH. If the tech has any issues getting into the colo, you can give the AT&T ticket request to the tech to get them in. NOTE: this can often take hours. We have 4 hour on site response time from IBM. This time goes very quickly, sometimes you may need to page out someone in IS to ensure it gets created quickly. To get this pager information see contacts.txt in batcave01’s private repo (if batcave01 is down for some reason see the dr copy on backup2.fedoraproject.org:/srv/ Prepwork before the tech arrives Really the big thing here is to remove the broken drive from the array. In our earlier example we found /dev/sdb failed. We’ll want to remove it from both arrays: # mdadm /dev/md0 –remove /dev/sdb1 # mdadm /dev/md1 –remove /dev/sdb2 Next get the current state of the drives and save it somewhere. See “Installing RaidMan” for more information if RaidMan is not installed. # cd /usr/RaidMan # ./arcconf GETCONFIG 1 > /tmp/raid1.txt Copy /tmp/raid1.txt off to some other device and save it until the tech is on site. It should contain information about the failed drive. Tech on site When the tech is on site you may have to give him the rack location. All of our Mesa servers are in one location, “the same room that the desk is in”. You may have to give him the serial number of the server, or possibly make it blink. It’s either the first rack on the left labeled: “01 2 55” or “01 2 58”. Once he’s replaced the drive, he’ll have you verify. Use the RaidMan tools to do the following: # cd /usr/RaidMan # ./arcconf RESCAN 1 # ./arcconf GETCONFIG 1 > /tmp/raid2.txt # # arcconf CRE- ATE LOGICALDRIVE [Options] # ./arcconf create 1 LOGICALDRIVE 476790 Simple_volume 0 1 First we’re going to re-scan the array for the new drive. Then we’ll re-get the configs. Compare /tmp/raid2.txt to /tmp/raid1.txt and verify the bad drive is fixed and that it has a different serial number. Also make sure its the correct size. Thank the tech and send him on his way. The last line there creates a new logical drive from the physical drive. “Simple_volume” tells it to create a raid0 array of one drive. The size was pulled out of our initial /tmp/raid1.txt (should match the other drive). The last two numbers are the Channel and ID of the new drive. Rebuild the array Now that the disk has been replaced we need to put a partition table on the new drive and add it to the array: • /dev/sdGOOD is the GOOD drive • /dev/sdBAD is the BAD drive # dd if=/dev/sdGOOD of=/tmp/sda-mbr.bin bs=512 count=1 # dd if=/tmp/sda-mbr.bin of=/dev/sdBAD # partprobe

332 Chapter 2. Full Table of Contents Fedora Infrastructure Best Practices Documentation, Release 1.0.0

Next re-add the drives to the array: • /dev/sdBAD1 and /dev/sdBAD2 are the partitons on the new drive which is no longer bad. # mdadm /dev/md0 –add /dev/sdBAD1 # mdadm /dev/md1 –add /dev/sdBAD2 # cat /proc/mdadm This starts rebuilding the arrays, the last line checks the status. Actually Replacing the Drive (HP) Replacing the drive on the HP’s is similar to the IBM’s. First you will need to contact HP, then you will need to open a ticket with Red Hat’s Helpdesk to get into the PHX2 facility. Then you will need to coordinate with the technician on the colocation’s rules for entry and who to call/talk with. Collecting data Call HP Get the package, give access to the tech Prepwork before the tech arrives Tech on site Rebuild the array Now that the disk has been replaced we need to put a partition table on the new drive and add it to the array: • /dev/cciss/c0dGOOD is the GOOD drive. The HP utilities will have a code like 1I:1:1 • /dev/cciss/c0dBAD is the BAD drive. The HP utilities will have a code like 2I:1:1 First we need to create the logical drive on the system. # hpacucli controller serialnumber=P61630H9SVU4JF create type=ld sectors=63 drives=2I:1:1 raid=0 # dd if=/dev/ccis/c0dGOOD of=/tmp/sda-mbr.bin bs=512 count=1 # dd if=/tmp/sda-mbr.bin of=/dev/ccis/c0dBAD # partprobe Next re-add the drives to the array: • /dev/sdBAD1 and /dev/sdBAD2 are the partitons on the new drive which is no longer bad. # mdadm /dev/md0 –add /dev/sdBAD1 # mdadm /dev/md1 –add /dev/sdBAD2 # cat /proc/mdadm This starts rebuilding the arrays, the last line checks the status. Installing RaidMan (IBM Only) Unfortunately there is no feasible alternative to managing IBM Raid Arrays without causing downtime. You can get and do this via the pre-POST interface. This requires downtime, and if the first drive is the failed drive, may result in a non-booting system. So for now RaidMan it is until we can figure out how to get rid of the raid controllers in these boxes completely. yum -y install compat-libstdc++-33.i686 rpm -ihv https://infrastructure.fedoraproject.org/rhel/RaidMan/ RaidMan-9.00.i386.rpm To verify installation has completed successfully: # cd /usr/RaidMan/ # ./arcconf GETCONFIG 1 This should print the current configuration of the raid controller and its logical drives.

2.3. (Old) System Administrator Guides 333 Fedora Infrastructure Best Practices Documentation, Release 1.0.0

334 Chapter 2. Full Table of Contents CHAPTER 3

Indices and tables

• genindex • search

335