Universidade Federal do Rio de Janeiro

Compressive Sensing

Claudio Mayrink Verdun

Rio de Janeiro Dezembro de 2016 i

Universidade Federal do Rio de Janeiro

Compressive Sensing

Claudio Mayrink Verdun

Dissertação de Mestrado apresentada ao Programa de Pós-graduação em Matemática Aplicada, In- stituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.

Orientador: Prof. Cesar Javier Niche Mazzeo. Co-orientador: Prof. Bernardo Freitas Paulo da Costa. Co-orientador: Prof. Eduardo Antonio Barros da Silva.

Rio de Janeiro Dezembro de 2016

iii

Verdun, Claudio Mayrink V487c Compressive Sensing / Claudio Mayrink Verdun. – Rio de Janeiro: UFRJ/ IM, 2016. 202 f. 29,7 cm. Orientador: César Javier Niche Mazzeo. Coorientadores: Bernardo Freitas Paulo da Costa, Eduardo Ântonio Barros da Silva. Dissertação (Mestrado) — Universidade Federal do Rio de Janeiro, Instituto de Matemática, Programa de Pós-Graduação em Matemática, 2016.

1. Compressive Sensing. 2. Sparse Modelling. 3. Signal Processing. 4. Applied Harmonic Analysis. 5. Inverse Problems I. Mazzeo, César Javier Niche, orient. II. Costa, Bernardo Freitas Paulo da, coorient. Silva, Eduardo Ântonio Barros da, coorient. III. Título iv v

Del rigor en la ciencia

En aquel Imperio, el Arte de la Cartografía logró tal Perfección que el Mapa de una sola Provincia ocupaba toda una Ciudad, y el Mapa del Imperio, toda una Provincia. Con el tiempo, estos Mapas Desmesurados no satisficieron y los Colegios de Cartógrafos levantaron un Mapa del Imperio, que tenía el Tamaño del Imperio y coincidía puntualmente con él. Menos Adictas al Estudio de la Cartografía, las Generaciones Siguientes entendieron que ese dilatado Mapa era Inútil y no sin Impiedad lo entregaron a las Inclemencias del Sol y los Inviernos. En los Desiertos del Oeste perduran despedazadas Ruinas del Mapa, habitadas por Animales y por Mendigos; en todo el País no hay otra reliquia de las Disciplinas Geográficas.

Suárez Miranda: Viajes de varones prudentes Libro Cuarto, cap. XLV, Lérida, 1658. In [Borges ’60]. vi

Acknowledgments

This is the only part of my dissertation written in my native language. I hope that all who are cited here may read clearly what follows. Se você é daquelas pessoas que só liga para a matemática e acha que ela não é feita por seres humanos que sorriem, choram, pensam coisas profundas mas também fazem coisas triviais, pule esta parte e vá diretamente para o Capítulo 1. Ali, vetores esparsos e medições compressivas te esperam.

Eu tentei ser preciso e não esquecer ninguém. Se eu esqueci, possivelmente foi de propósito, se não foi, peço desculpas, embora nunca irá transparecer qual foi o motivo do esquecimento. Além disso, veremos ao longo dessa dissertação que redundância é algo fundamental, então começaremos por aqui, citando algumas pessoas em mais de um lugar e por motivos diferentes.

Esta dissertação é dedicada a duas santíssimas trindades: A primeira delas, Felipe Acker, o pai, Fábio Ramos, o filho, e Luiz Wagner Biscainho, o espírito santo. A segunda, Roberto Velho, o pai, Filipe Goulart, o filho, e Hugo Carvalho, o espírito santo. Essas seis pessoas contribuíram de maneira não trivial para minha formação e introduziram muitas não-linearidades na minha vida ao longo dos últimos anos. A quantidade de coisas que elas me ensinaram e todas as vezes que fui ajudado por elas, mesmo que as três horas da manhã, as vezes desesperado, e com pedidos nada convencionais não pode ser descrito aqui em poucas palavras. Parte substancial da minha formação humana e científica deve-se aos laços estreitos e inúmeras conversas que mantenho com esses seis indi- víduos bastante singulares.

Começo agradecendo três professores que tive na escola e que foram fonte de inspiração para a escolha da carreira científica: Andre Fontenele, José Carlos de Medeiros e José Alexandre Tostes Lopes (in memoriam). Aproveitei muitos os três anos de contato com eles, as conversas que iam de geometria euclidiana à suma teológica de São Tomás de Aquino e todos os ensinamentos sobre porque o mundo era muito maior que o IME/ITA e porque haviam opções melhores sobre o que fazer da vida.

Agradeço ao Vinícius Gripp por ter me mostrado a sociedade secreta que é a matemática aplicada da UFRJ. Se eu não tivesse passado mal um certo dia na escola e não tivesse precisado ir embora, não teria conhecido o Vinícius, todo o seu entusiasmo e tudo o que se seguiu a partir daí. Além disso, agradeço aos seis bandeirantes da matemática aplicada: Enio Hayashi, Pedro Maia, Rafael March, Roberto Velho, Rodrigo Targino e Yuri Saporito. Esses veteranos, que acabaram virando amigos, continuam por aí desbravando a mata fechada e eu sou muito feliz em aproveitar muitos dos conselhos por eles oferecidos.

Aos vários mestres que tive nos últimos sete anos. Cada lição, cada raciocínio, cada forma de fazer as coisas dentro e fora da sala de aula me mostrou um mundo de ideias interessantes que eu ainda vou utilizar muito nas minhas pesquisas e atividades didáticas. São eles Adan Corcho, Alexander Arbieto, Alexandra Schmidt, Amit Bhaya, Átila Pantaleão, Bernardo da Costa, Bruno Scárdua, César Niche, Didier Pilod, Eduardo Barros Silva, Fábio Ramos, Felipe Acker, Flávio Dickstein, Heudson Mirandola, Luca Moriconi, Marco Aurélio Cabral, Mar- cus Venícius Pinto, Luiz Wagner Biscainho, Nilson Bernardes, Paulo Diniz, Sergio Lima Netto, Severino Collier, Umberto Hryniewicz, Wagner Coelho, Wallace Martins e Wladimir Neves. Agradeço também à vários outros professores que conheci na universidade e que me ensinaram como não fazer as coisas, como não trabalhar, como não fazer pesquisa e dar aula, como não tratar os alunos, etc.

Aos muitos amigos e colegas que fiz aplicada e na UFRJ. Alguns se tornaram muito próximos, outros nem tanto, mas todos eles compartilharam momentos felizes e ajudaram a superar momentos tristes na universidade. Também me proporcionaram conversas fantásticas e muitas vezes me mostraram que eu estava errado, por mais que que fosse difícil dar o braço a torcer. São eles: Aloizio Macedo, Arthur Mitrano, Bruno Almeida, Bruno Oliveira, Bruno Santiago, Camilla Codeço, Carlos Lechner, Carlos Zanini, Cecília Mondaini, Claudio Heine, Daniel Soares, Danilo Barros, Davi Obata, Felipe Pagginelli, Felipe Senra, Flávio Ávila, Franklin Maia, Gabriel Castor, Gabriel Sanfins, Gabriel Vidal, Gabriela Lewenfus, Gabriella Radke, Ian Martins, Jhonatas Afradique, Leandro Machado, Leonardo Assumpção, Lucas Barata, Lucas Manoel, Lucas Stolerman, Marcelo Ribeiro, Marco Aurélio Soares, Milton Nogueira, Nicolau Sarquis, Paulo Cardozo, Pedro Fonini, Pedro Schmidt, Philipe de Fab- ritiis, Rafael Teixeira, Raphael Lourenço, Reinaldo de Melo e Souza, Rogério Lourenço, Thiago Degenring, Tiago Domingues, Tiago Vital, Washington Reis, Victor Coll e Victor Corcino. vii

Aos top’s, a galera realmente truuu, que se tornaram super amigos e que sei poder contar para qualquer coisa em qualquer hora e lugar. Alguns deles estão longe mas sei que caso eu precise, eles não medirão esforços para estar perto. Obrigado Adriano Côrtes, Bruno Braga, Bruno Santiago, Cláudio Gustavo Lima, Cristiane Krug, Douglas Picciani, Guilherme Sales, Luís Felipe Velloso e Rafael Ribeiro por serem quem vocês são e pela amizade tão forte.

Aos meus amigos de infância Hugo Portuita, Pedro Goñi e João Paulo Siqueira que sempre me procuram, insistem em me trazer para um mundo distante do acadêmico, e me lembram que o samba não pode morrer. Obrigado pelo carinho e amizade ao longo deste muitos anos. E desculpem pelo sumiço. Minhas ausências estão parcialmente justificada nas páginas que seguem.

Aos meus amigos de conversas estranhas, em qualquer horário e sobre qualquer assunto, sempre regadas a muitas risadas, gritos e discussões intensas. Algumas delas, no batista às cinco da tarde, outras no 485 lotado às dez da noite que faziam as pessoas olharem com vergonha alheia para nós, seja falando de política, religião, mal dos outros ou qualquer outra coisa. Ou mesmo nos infinitos áudios do Whatsapp. Espero ainda continuar brig- ando e conversando com Adriano Côrtes, André Saraiva, Bruno Santiago, Bruno Braga, Camila Codeço, Carlos Lechner, Cristiane Krug, Daniel Soares, Douglas Picciani, Filipe Goulart, Flávio Ávila, Gabriel Vidal, Guilherme Sales, Hugo Carvalho, Luís Felipe Velloso, Nicolau Sarquis, Reinaldo de Melo e Souza, Tiago Domingues e Tiago Vital.

Aos meus amigos do SMT, do lado aplicado da força. Isabela Apolinário e Renam Castro me ajudaram nas matérias com questões de computação e a entender a linguagem falada pelos engenheiros que soava tão estranha para mim. Ainda me assusto um pouco com um banco de filtros mas isso vai passar... Nossos trabalhos em equipe e seminários foram divertidos, por mais que muitas vezes eu não soubesse direito o que eu estava fazendo. Ao Flávio Ávila, grande amigo de seminários e de conversas político-ecônomicas, pela amizade aplicada. Espero finalmente ter tempo para nosso projetos começarem a andar. Meus colegas Hamed Yazdanpanah, Pedro Fonini e Rafael Zambrano, que trabalharam comigo no projeto do CENPES, me ajudaram a desbravar um mundo de engenharia de reservatórios completamente desconhecido para todos nós. Juro que tentarei não perturbar mais vocês as quatro da manhã para terminarmos um monte de relatórios e simulações. Aos professores Eduardo Barros Silva, Paulo Diniz e Sergio Lima Netto que compõem esta equipe e me ofereceram a oportunidade de trabalhar neste projeto, ajudando-me financeira e intelectualmente. A maneira como vocês encaram as situações de forma descontraída mas ao mesmo tempo profissional me encanta e fez com que todas as tarde de quarta fossem sempre engraçadas e leves mas ao mesmo tempo profundas. Aos professores Flávio Dickstein e Paulo Goldfeld por gen- tilmente cederem o espaço que utilizei para a escrita desta dissertação e para o trabalho no projeto nos últimos tempos. Foi muito interessante interagir com eles, mesmo que brevemente, e vê-los trabalhando com entusiasmo todos os dias. E ao Renan Vicente, pelas risadas, inúmeros favores com rodar simulações no IMEX para mim e pela companhia diária no LABMA.

Aos meus amigos Dingos da Montanha e companheiros de aventura Carolina Casals, Gabriel Vidal, Luis Fe- lipe Velloso, Nicolau Sarquis, Petrônio Lima e Tiago Vital. Os dias ensolarados de trilha e as noites frias de travessia seriam menos engraçadas e intensos se vocês não estivessem juntos para dividir barraca, comida, roupa e as expências pelas quais passamos. Espero vocês para continuarmos as aventuras pela Europa!

Aos meus amigos de café, de Batista, de Anjinho, de almoço, etc, Bruno Scárdua, Fábio Ramos, Heudson Mirandola e Marcelo Tavares, que sempre estiveram presentes vibrando e compartilhando conquistas, que me proporcionaram ótimas conversas e me deram conselhos vitais sobre como não aumentar a entropia, resolver con- flitos e como chegar ao sim comigo mesmo e com os outros. Em particular ao Bruno, mestre e amigo por quem nutro grande admiração, que me ensinou a importância dos slogans das propagandas de pneumáticos, dos alunos ilustres de Munique e que me ajudou em vários pontos fundamentais desde o curso de análise real.

Aos membros do seminário de Compressive Sensing: Amit Bhaya, Bernardo da Costa, César Niche, Hugo Carvalho, Iker Sobron, Luiz Wagner Biscainho, Marcello Campos e Wallace Martins. A paciência de vocês foi infinita ao me ouvir falar todas as terças, por mais de dois anos, sobre CS. Obrigado pelas discussões prolíficas e por me fazer entender o assunto melhor. Em especial ao Wallace Martins pela empolgação e por ter feito deste seminário uma realidade possível e única na história da UFRJ.

Agora vem a parte que agradecemos aos membros da banca. Farei isso de forma alfabética: viii

Ao Amit Bhaya, pelo carinho, amizade, pela erudição e conversas que vão desde a União Soviética à ética científica, passando pelo controle dos sistemas fisiológicos e a culinária indiana. Ir até a sua sala com uma per- gunta ou lhe enviar um email e sempre receber uma enxurrada de referências interessantes sempre fazia minhas semanas mais felizes. Ao Bernardo da Costa por toda a amizade, ajuda, suporte, orientação, empolgação, por toda a matemática e ciência que aprendi, pelas dúvidas difíceis desta dissertação (e fora dela!) e pelas conversas que nunca têm fim. Tenho a sensação que toda vez que eu o encontro fico cada vez mais perdido sobre por onde começar a falar já que o número de assunto entre nós está divergindo. Ao Carlos Tomei pelos almoços regados a histórias engraçadas e pelos conselhos de vida. Ele me ajudou de forma fundamental em momentos singulares onde eu achei que não dava para continuar em frente e me ensinou que existe uma relação não trivial entre inconscientes e molhos de tomate. Além disso, os muitos momentos esparsos de conversas que tivemos serviram para fazer surgir uma explosão de perguntas e ideias na minha cabeça. Ao Eduardo Silva por arrumar tempo para me receber em sua sala a qualquer hora, sempre com muito carinho, me ouvir e me ensinar muitas coisas sobre a vida e as relações com as pessoas. Ele foi um grande mentor, acreditando em mim, e me dando oportunidades de crescimento únicas, além de me ensinar várias coisas de processamento de sinais e teoria da informação. E fez tudo isso mesmo tendo 73472416 alunos, projetos, defesas, seminários e reuniões. Aprendi com ele, mais do que com qualquer outra pessoa, a importância do trabalho duro e da disciplina para fazer as coisas funcionarem. Ao Fábio Ramos, por tudo. Pelas matérias que não tinham fim e que nos faziam vir ao fundão no dia 30 de dezembro, pelos conselhos diários envolvendo todos os aspectos da minha vida, pela quantidade de livros e assuntos que descobri graças a ele, pelas piadas e trocadilhos que faziam os almoços um ponto alto do dia e muitas outras coisas difíceis de enumerar aqui. Além disso, seu apoio e extrema confiança valeram mais do que qualquer orientação. Ao Roberto Imbuzeiro por conselhos envolvendo matemática pura e aplicada, pelas críticas fundamentais à esta dissertação e por toda a ajuda em utilizar a sala de videoconferência do IMPA. Sem ela, meu futuro talvez tivesse tomado outro rumo. Ele também foi responsável por um imenso incentivo no que diz respeito a ir trabalhar na Alemanha. E à todos eles, por prontamente aceitarem fazer parte da minha banca e por lerem com todo o cuidado o trabalho que segue.

Não posso deixar de citar os funcionários da secretaria Alan, Aníbal e Cristiano. Os três sempre foram ex- emplares para resolver todos os problemas buracráticos que tive e para regularizar meu histórico cheio de coisas estranhas. Agradeço a tia Deise (in memoriam) que estava todo dia no fundão, com bom humor ou sem, e além de entender como todo aquele conto kafkaniano funcionava, resolvia com primor tudo o que os alunos da aplicada precisavam. Espero, como ela, trabalhar com força e determinação até o último dia. E também ao Walter, que é a pessoa que faz o Instituto de Matemática funcionar e que mesmo com greve, cachorros, esgoto radioativo na linha vermelha, tiroteio na Vila do João, falta de água, ataque zumbi, etc, sempre esteve disposto a me ajudar com os problemas que surgiam na ABC-116 e a proporcionar o melhor ambiente possível para os alunos estudarem.

No que diz respeito à dissertação, agradeço ao Yuri Saporito que forneceu o template que serviu de molde para essa e inúmeras outras dissertações da aplicada, ao Luís Felipe Velloso pela ajuda e suporte com todas as figuras que precisei nos últimos tempos não somente nesta dissertação mas principalmente em apresentações diversas e pela rataria com o WinFIG e Xfig, ao Hugo Carvalho pela revisão ninja e por inúmeras melhorias no texto e ao Roberto Velho por um certo código mágico do Mathematica. Gostaria muito de ter descoberto isso antes...

À Natália Castro Telles pelo suporte e por me ensinar a ver a vida de vários outros ângulos, repensando minhas posições e estruturas. O crescimento que você me proporciona é muito singular.

Aos criadores e colaboradores da Wikipedia, pelas inestimáveis consultas. Aos que trabalham duro para man- ter os servidores do Cazaquistão, de Neue, do Equador, da Samoa Ocidental, de Belize, do Território Britânico do Oceano Índico e da Rússia funcionando a todo vapor, apesar dos ataques constantes, e que tornam o conheci- mento livre de qualquer mordaça. Essa dissertação teria sido impossível sem os inúmeros esforços desses anônimos espalhados pelo mundo. Agradeço ao Donald Knuth, Leslie Lamport e à todos os que trabalham arduamente para ampliar este mundo incrível da formatação que é o LATEX. Aos criadores do Detextify, esta ferramenta poderosa que me ajudou em vários momentos. Sem todos vocês, este documento teria sido escrito naquele editor que não pode ser nomeado e as coisas não ficariam tão bonitas.

À Andrei Kolmogorov e Claude Shannon, duas das mentes mais privilegiadas da história da humanidade, pela inspiração e pelas ideias que originaram os primórdios desse trabalho. Sem eles, não é exagero dizer que tudo isso (e muitas outras coisas) teria demorado mais uns 500 anos para surgir.

Às minhas companhias da madrugada por proporcionarem múltiplos ensinamentos. Muitos de vocês foram ix essenciais em momentos recentes onde concentrações de medida ou teoremas malucos de russos sobre espaços de Banach não faziam mais nenhum sentido e eu estava de enjoado de olhar para o computador. Vivos ou mortos, carrego seus escritos como marcas indeléveis. Obrigado Antonio Damásio, Atul Gawande, Carl Sagan, Douglas Adams, Edward Wilson, Fernando Pessoa, Fiodor Dostoievsky, Franz Kafka, George Orwell, José Saramago, Liev Tolstoi, Luiz Felipe Pondé, Malcolm Gladwell, Vilayanur Ramachandran, Nelson Rodrigues, Oliver Sacks, Otto Maria Carpeaux, Robert Sapolsky, Valter Hugo Mãe e Wislawa Szymborska.

Aos professores Felix Krahmer e Holger Boche pela oportunidade que terei daqui pra frente e por tudo que conversamos por email até agora. Estou sendo tratado de maneira íncrivel e espero retribuir isso trabalhando bastante e não decepcionando-os.

À minha família pelo suporte e pelo amor. Eu enganei vocês e na verdade eu nunca estudei psicologia dos anelídeos como muitas vezes eu disse em casa. Agradeço ao meu irmão Gabriel, minhas tias Marisa e Marilene e meu primo Flávio por tudo. Agradeço a minha avó Maria da Glória pelas conversas sobre a finitude da vida e por me ensinar, dentre muitas coisas que em várias situações a experiência é um pente que a gente ganha quando fica careca. E também ao Carlos e Irene, pessoas que acabaram por se tornarem meus segundos pais e que sei estarem presentes, de coração aberto, para o que eu precisar.

E finalmente, mas não menos importante... ao meu orientador César Niche, por tudo. Por uma orientação que virou amizade e que espero mantê-la sempre e pelo carinho e preocupação. Também pelos ótimos cursos, pela matemática aprendida, pelas conversas sobre a vida na academia e também fora dela, por topar me acompanhar nessa jornada nos últimos quatro anos, por acreditar em mim e nos meus projetos muitas vezes loucos e por sempre ter me dado liberdade de seguir em frente e estudar o que eu quisesse. Apesar da minha imensa teimosia e das muitas vezes em que discordei de suas ideias, sou grato a todas elas e a tudo o que aprendi com ele. Essa dissertação é fruto direto de toda essa ótima interação.

Ao meu pai Luiz Cláudio pelo entusiamo com ciência, por me ensinar o poder da comunicação e por querer sempre mais dos meus esforços. À minha mãe Marília, por ter me ensinado o poder e a responsabilidade das próprias escolhas, da leitura e busca incessante, por tudo o que enfrentamos juntos, no dia a dia, sempre com muita conversa e argumentação e por sempre ter me dito que isso ia dar certo. À ambos, pelo amor e carinho, por nunca terem medido esforços para me dar a melhor educação que estava ao alcance deles e por me ensinarem que, na vida, devemos escolher fazer aquilo que nos deixe feliz e que faça com que nunca olhemos para o relógio.

À Ivani Ivanova, revisora máxima desse texto (que em outras situações seria colocada como coautora...), lendo cada detalhe, cada demonstração, cada argumento e respondendo com um tratado completo sobre como melhorá-los. Se o texto ainda apresenta inúmeros erros (e eu sei que apresenta pois somente hoje achei uns três), foi porque eu não terminei suas revisões com a paciência e esmero suficientes e a culpa é inteiramente minha. Ela não só é a revisora desse texto mas fez com que minha vida passasse por uma revisão completa desde que saí por aí falando de números cardinais. Toda a gratidão à companhia, amor, carinho e dedicação de sua parte, ao longo de toda a montanha-russa, não podem ser descritos aqui.

Claudio Mayrink Verdun Dezembro de 2016 x

“All that you touch All that you see All that you taste All you feel. All that you love All that you hate All you distrust All you save. All that you give All that you deal All that you buy, beg, borrow or steal. All you create All you destroy All that you do All that you say. All that you eat And everyone you meet All that you slight And everyone you fight. All that is now All that is gone All that’s to come and everything under the sun is in tune but the sun is eclipsed by the moon.

"There is no dark side of the moon really. Matter of fact it’s all dark." ."

Eclipse Roger Waters - Pink Floyd xi

Resumo

Compressive Sensing

Claudio Mayrink Verdun

Resumo da dissertação de Mestrado submetida ao Programa de Pós-graduação em Matemática Aplicada, Instituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.

Resumo: Vivemos em um mundo digital. Os aparelhos ao nosso redor interpretam e processam bits a todo instante. Para isso ser possível, a conversão de sinais analógicos para o domínio discreto se faz necessária. Ela se dá por meio dos processos de amostragem, quantização e codificação. Através de tais etapas, é possível processar e armazenar os sinais em dispositivos digitais. O paradigma clássico do Processamento Digital de Sinais nos diz que devemos realizar uma etapa de amostragem e, em seguida, uma etapa de compressão, por meio da quantização e codificação, para a aquisição de sinais. Entretanto, realizado desta forma, este processo pode ser custoso ou desnecessário, devido a uma alta taxa de amostragem ou a uma grande perda de dados na etapa de compressão. A partir desta observação, coloca-se a seguinte questão:

É possível realizar o processo de aquisição e compressão simultaneamente de forma a obter o mínimo de informação necessária para a reconstrução de um dado sinal?

Os trabalhos seminais de David Donoho, Emmanuel Candès, Justin Romberg e Terence Tao mostram que a resposta para esta pergunta é afirmativa em muitas situações envolvendo sinais naturais ou gera- dos pelo ser humano. O presente trabalho tem como objetivo estudar a área denominada Compressive Sensing, termo sem tradução que se refere à resposta positiva a esta pergunta, isto é, ao procedimento de realizar uma aquisição compressiva de dados. Esta dissertação é um guia do mochileiro para esta área que está em rápido crescimento. Os teoremas que a fundamentam são apresentados com detalhe e rigor. Do ponto de vista matemático, ela é uma interseção das áreas de Otimização, Probabilidade, Geometria de Espaços de Banach, Análise Harmônica e Álgebra Linear Numérica. Aqui são discuti- das técnicas de Teoria de Frames, Matrizes Aleatórias e aproximações ótimas em espaços de Banach para demonstrar a viabilidade deste novo paradigma em Processamento de Sinais.

Palavras–chave. Compressive Sensing, Compressive Sampling, Compressed Sensing, Esparsidade, Represen- tação Esparsa, Redundância, Teoria de Frames, Análise Harmônica Aplicada, Matrizes Aleatórias, Princípio da Incerteza, Processamento de Sinais.

Rio de Janeiro Dezembro de 2016 xii

Abstract

Compressive Sensing

Claudio Mayrink Verdun

Abstract da dissertaçao de Mestrado submetida ao Programa de Pós-graduação em Matemática Aplicada, Instituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.

Abstract: We live in a digital world. The devices around us interpret and process bits all the time. For this to be possible, a conversion of analog signals into the discrete domain is necessary. It hap- pens through the processes of sampling, quantization and coding. Through such steps, it is viable to process and store such signals in digital devices. The classical paradigm of Digital Signal Process- ing for signal acquisition is a sampling step followed by a compression step, through quantization and coding. However, this way, the process can be costly or unnecessary due to a high sampling rate or a large loss of data in the compression stage. From this observation, the following question is posed:

Is it possible to perform the acquisition and compression process simultaneously in order to obtain the minimum information for the reconstruction of a given signal?

The seminal works of David Donoho, Emmanuel Candès, Justin Romberg and Terence Tao show that the answer to this question is affirmative in many circumstances involving natural or man-made signals. The present work aims to study the area called Compressive Sensing, which is a procedure to perform data acquisition and compression at the same time. This dissertation is a hitchhiker’s guide to this rapidly growing field. The fundamental theorems will be explored with detail and rigor. From the mathematical point of view, it is an intersection of the Optimization, Probability, Geometry of Banach Spaces, Harmonic Analysis and Numerical Linear Algebra. Here we discuss the techniques of Frame Theory, Random Matrices and approximation in Banach Spaces to demonstrate the feasibility of this new paradigm in Signal Processing.

Keywords. Compressive Sensing, Compressive Sampling, Compressed Sensing, Sparsity, Sparse Representa- tion, Redundance, Frame Theory, Applied Harmonic Analyis, Random Matrices, Uncertainty Principle, Signal Processing.

Rio de Janeiro December of 2016 Contents

1 Sparse Solutions of Linear Systems 5 1.1 The What, Why and How of Compressive Sensing ...... 5 1.2 A Little (Pre)History of Compressive Sensing ...... 8 1.3 Sampling Theory ...... 9 1.4 Sparse and Compressible Vectors ...... 11 1.5 How Many Measurements Are Necessary? ...... 15 1.6 Computational Complexity of Sparse Recovery ...... 17 1.7 Some Definitions ...... 20

2 Some Algorithms and Ideas 23 2.1 Introduction ...... 23 2.2 Basis Pursuit ...... 23 2.3 Greedy Algorithms ...... 30 2.4 Thresholding Algorithms ...... 33 2.5 The Search For the Perfect Algorithm ...... 35

3 The Null Space Property 37 3.1 Introduction ...... 37 3.2 The Null Space Property ...... 38 3.3 Recovery via Nonconvex Minimization ...... 41 3.4 Stable Measurements ...... 42 3.5 Robust Measurements ...... 45 3.6 Low-Rank Matrix Recovery ...... 48 3.7 Complexity Issues of the NSP ...... 53

4 The Coherence Property 55 4.1 Introduction ...... 55 4.2 Uncertainty Principle ...... 55 4.3 A Case Study: Two Orthogonal Basis ...... 59 4.4 Uniqueness Analysis for the General Case ...... 60 4.5 Coherence for General Matrices ...... 61 4.6 Properties and Generalizations of Coherence ...... 63 4.7 Constructing Matrices with Small Coherence ...... 65 4.8 A Glimpse of Frame Theory ...... 67 4.8.1 History and Motivation ...... 67 4.8.2 An Example ...... 68 4.8.3 Definition and Basic Facts ...... 69 4.8.4 Constraints for Equiangular Tight Frames ...... 71 4.9 Analysis of Algorithms ...... 77 4.10 The Quadratic Bottleneck ...... 78

1 2 CONTENTS

5 Restricted Isometry Property 81 5.1 Introduction ...... 81 5.2 The RIP Constant and its Properties ...... 82 5.3 The Quadratic Bottleneck Reloaded ...... 88 5.4 Breaking the Bottleneck ...... 90 5.5 Analysis of Algorithms ...... 92 5.6 RIP Limitations ...... 99

6 Interlude: Non-asymptotic Probability 103 6.1 Introduction ...... 103 6.2 Subgaussian and Subexponential Random Variables ...... 104 6.3 Nonasymptotic Inequalities ...... 111 6.4 Comparison of Gaussian Processes ...... 115 6.5 Concentration of Measure ...... 118 6.6 Covering and Packing Numbers ...... 126

7 Matrices Which Satisfy The RIP 129 7.1 Introduction ...... 129 7.2 RIP for Subgaussian Ensembles ...... 130 7.3 Universal Recovery ...... 134 7.4 The Curious Case of Gaussian Matrices ...... 135 7.5 Johnson-Lindenstrauss Embeddings and the RIP ...... 140

8 Optimality in the Number of Measurements 147 8.1 Introduction ...... 147 8.2 n-Widths in Approximation Theory ...... 148 8.3 Compressive Widths ...... 151 8.4 Optimal Number of Measurements ...... 153 8.5 The Theorem of Kashin, Garnaev and Gluskin ...... 155 8.6 Connections Between Widths ...... 159 8.7 Instance Optimality ...... 161

Bibliography 165 Preface

Measure what can be measured, and make measurable what cannot be measured. Attributed to Galileo Galilei

Measure what should be measured. Compressive Sensing version by Thomas Strohmer

In this dissertation we will explore one of the fascinating connections that emerged over the last 20 years among the communities of Applied Mathematics, Electrical Engineering and Statistics. It deals with several relevant ideas that have arisen about data acquisition and how we perform measurements. We live in an era where data and information are spread everywhere and it is a major challenge getting them in the most economical way. Two studies sponsored by EMC corporation, [Gantz & Reinsel ’10] and [Gantz & Reinsel ’12], measured all digital data created, replicated, and consumed in a single year, and also did a projection of the digital universe size for the end of the decade. They argued that at the end of 2020, there will be 40000 exabytes or 40 trillion gigabytes of data in the world. They say “From now (2012) until 2020, the digital universe will about double every two years”. Also, it can be expensive in terms of time, money or (in case of, say, Computed Tomography) damage done to the object from which information is being acquired. Based on these facts, more than ever, we need efficient ways to represent, store and manipulate data. The question of how to sample and compress datasets is of major importance. In the last twenty years, many researchers asked if there are classes of signals that can be encoded using just few informations about them without much perceptual loss. Simultaneously, they realized that many signals seems to have few degrees of freedom, despite the fact that they “live”, in principle, in high-dimensional spaces. Knowing this, one can potentially design efficient sampling and storage protocols that take the useful information content embedded in a signal and condense it into a small amount of data. For example, [Candès, Romberg & Tao I ’06] performed an impressive numerical experiment and showed that a 512 512 pixel test image, known in the Image Processing community as the Logan-Shepp phantom, can be× reconstructed from 512 Fourier coefficients sampled along 22 radial lines. In other words, more than 95% of the relevant data is missing yet a very efficient compression and reconstruction scheme was found. At the same year, [Donoho ’06] realized that these compressed data acquisition protocols are, in his own words “nonadaptive and parallelizable”. They do not require any prior knowledge of the data other than the fact that data has a parsimonious representation with few coefficients. Furthermore, the measurements can be performed all at the same time. He called this new simultaneous acquisition and compression process Compressed Sensing. These two fundamental papers were followed by the works [Candès & Tao I ’06], [Candès & Tao II ’06] and [Candès, Romberg & Tao II ’06]. These three works, in turn,have further developed the results and disseminate them in the communities of Statistics, Signal Processing and Mathematics. Since these five founding papers were published, ten years ago, there was an explosion in the theory and applications of Compressive Sensing. Therefore it is important to highlight that ideas which emerged from the notions of sparsity, incoherence and randomness and that now permeate the world of data science will not be transient. In Brazil, the emergence of Compressive Sensing was in 2009, in a course at the 27◦ Brazilian Mathe- matical Colloquium through the book [Schulz, da Silva & Velho ’09]. I attended the classes and despite

3 4 CONTENTS the fact that I was in the first year of college, the impressions recorded in my mind were that it would be interesting to put together Mathematics, Signal Processing, computation and statistical skills and try to solve a real-life problem. After I spent two years studying Fourier Analysis and Inverse Problems and also working on math- ematical problems from medical images, I decided that I would dedicate myself to this subject and its new developments. Now, a little over ten years after the articles that founded the theory have been published, there are some excellent books that are partially or entirely dedicated to it. See [Elad ’10], [Damelin & Miller ’12] [Eldar & Kutyniok ’12], [Rish & Grabarnik ’14] and [Eldar ’15]. However, up to now, the comprehensive book about the subject is [Rauhut & Foucart ’13]. It is the main reference of this dissertation, my companion in the last three years and most of the main ideas and proofs were extracted from it. Despite this, the influence of the other five books is also present in several passages of this text. There are some excellent reviews and surveys about Compressive Sensing. See [Candès ’06], [Holtz ’08], [Baraniuk ’07], [Romberg ’08], [Strohmer ’12], [Candès & Wakin ’08], [Jacques & Vandergheynst ’11]. In addition, the success of this theory has been described in four issues of important journals entirely de- voted to the subject. The articles in these four issues served as a source of inspiration and clarification on some obscure and hard points of the theory.

IEEE Signal Processing Magazine, Vol. 25, No. 2, Mar 2008. • IEEE Journal on Selected Topics in Signal Processing, Vol. 4, No. 2, Apr 2010. • Proceedings of the IEEE, Vol. 98, No. 6, Jun 2010. • IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 2, No. 3, Sep 2010. • Unfortunately, I did not solve a real problem in the last three years, as was my initial desire. Instead, I studied Compressive Sensing theory carefully. It is essential to say that in each chapter I have outlined important open problems connected to Compressive Sensing. The purpose of this dissertation is to state Compressive Sensing as a new paradigm in Signal Processing and to develop the main techniques necessary to completely understand it from the mathematical viewpoint. I hope the next three years will be enough to solve or, at least, to understand a real problem using the knowledge acquired from the study of these techniques. Chapter 1

Sparse Solutions of Linear Systems

Is there anything of which one might say, “See this, it is new”? Already it has existed for ages Which were before us. Ecclesiastes 1:10

1.1 The What, Why and How of Compressive Sensing

In the last century, we have witnessed the deployment of a wide variety of sensors that acquire measure- ments to represent the physical world. Microphones, telescopes, anemometers, gyroscopes, transducers, radars, thermometers, GPS, seismometers, microscopes, hydrometers, etc are everywhere. The purpose of all these instruments is to directly acquire measures in order to capture the meaning of the world around us. Besides, a world without digital photos, videos and sounds being stored or shared is unimaginable nowadays. The increase in technology has allowed us to obtain more and more data in a short time. To deduce the state or structure of a system from this data is a fundamental task throughout the sciences and engineering. The first step to acquire the data is sampling. We need to sample the signals obtained from the real world in order to convert them from the analogical domain to a sequence of bits in the digital domain. The second step is compression, which involves encoding information using fewer bits than the original representation. Both steps are necessary to store, manipulate, process, transmit and interpret data. However, in many situations it is difficult and costly to obtain a massive amount of data or even slow, such as medical image acquisition. Moreover, nowadays, in many analog-digital (A/D) converters, after the signal is sampled, significant computations are expended on lossy compression algorithms. A representative example of this paradigm is the digital camera of any smartphone. It acquires millions of measurements each time a picture is taken and most of the data is discarded after the acquistion through the application of an image compression algorithm. Therefore, when the signal is decoded, typically only an approximation of the original one is produced. The main question that motivates this text is

Can both operations, sampling and compression, be combined?

That is, instead of high-rate sampling followed by computationally expensive compression, is it possible to reduce to number of samples and produce a compressed signal directly? The classical sampling theory does not exploit the structure or any prior knowledge about the signal being sampled. All we need to know is its frequency information and use this to determine the sample rate. The objective of this dissertation is the study of Compressive Sensing, also known as Compressed Sensing or Compressive Sampling. It takes its name from the fact that data acquisition and compres- sion can be performed simultaneously in such a way that the reconstruction algorithms will exploit the structure of the data. This technique is, nowadays, an essential underpinning of data analysis. We have two basic hypotheses behind the theory: sparsity and incoherence. The first one is related to the objects we are acquiring, i.e, we will use few measurements in order to capture signals by exploiting

5 6 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS their parsimonious representation in some basis. The second one is related to the measurement system. It extends the duality between time and frequency and the idea behind it is that an object that has sparse representation in a given basis must be spread out in any other basis, such as the one which represents the acquisition domain. We wil also see that randomness plays a key role in the design of optimal incoherent system. Thus, it can be considered as a third ingredient of the theory. From a mathematical viewpoint, Compressive Sensing can be seen as a chapter of contemporary Linear Algebra since it deals with the pursuit of sparse solution of underdetermined linear systems. Nevertheless, mathematical techniques behind it are very sophisticated, going from Probability Theory, passing through Harmonic Analysis and Optimization and arriving at the Geometry of Banach Spaces. From the applied viewpoint, it is a fecund intersection of many areas from Electrical Engineering, Computer Science, Statistics and Physics. Through the use of Compressive Sensing, we can not only provide theoretical guarantees for the minimal number of measurements but also efficient algorithms for practical reconstruction. And all this is based on a simple principle of parsimony, which was enunciated many times, but divulged by the medieval philosopher William of Ockham: “Pluralitas non est ponenda sine neccesitate”, i.e., entities should not be multiplied unnecessarily. It is interesting to note that ideas of Compressive Sensing ideas spontaneously occur in nature due to natural selection: the mammalian visual system has been shown to behave using sparse and redundant representation of sensory input. Some researches, starting from the works of Barlow (great-grandson of Charles Darwin), Hubel and Wiesel argued that a dramatic reduction occurs from the information that hits the retina to the information present in the visual cortex. Typically, a neuron in the retina responds simply to whatever contrast is present at that point in space, whereas a neuron in the cortex would respond only to a specific spatial configuration of light intensities. This leads to an optimal representation of the visual information in the brain and is known nowadays in the field of Neuroscience as Sparse Coding. See [Olshausen & Field ’04] and [Huang & Rao ’11]. The first application of Compressive Sensing was in pediatric magnetic resonance imaging (MRI). See [Vasanawala, Alley, Hargreaves, Barth, Pauly & Lustig ’10] and the blog post [Ellenberg - 02/22/10]. It enables a sevenfold speedup while preserving diagnostic quality and resolution. In this imaging modality, the traditional approaches to produce high-resolution imagines relies on Shannon Sampling Theorem (Theorem 1.2 demonstrated below), and may take several minutes. For a heart patient imaging, one cannot hold his breath for too long and the artificial induction of a cardiac arrest can be dangerous and irreversible, depending on the health of the patient. Thus, to get images without blurring is a difficult task. See [Lustig, Donoho & Pauly ’07] for the problem statement and the use of Compressive Sensing. After the launch of these preprints, Compressive Sensing appeared as a noticeable improvement, gaining the attention of the media and of several research groups interested in improving the performance of signal reconstruction. Nowadays, besides medical imaging, it leads to important contributions in Radar Technology, Error Correction Codes, Systems Recommendation, Wireless Communications, Learning Algorithms, DNA Microarrays, etc. See Section 1.2 of [Rauhut & Foucart ’13] and references therein for these and more examples. In mathematical terms, the problem of acquiring and compressing simultaneously the signal x CN of interest is modeled by a (fat and short) matrix measurement, converting x to y = Ax. After, we∈ need to reconstruct it using the underdetermined linear system

Ax = y, where the measurement fat and short 1 matrix A Cm×N models the information acquisition process ∈ and y Cm is the observed data. Since we are in a compression regime, we expect (or impose) that m N∈. At a first moment, this system has an infinite number solutions. But the role played by sparsity will allow us to reduce it to a unique solution and despite a seeming lack of sufficient information in the acquisition, we will see that recovery algorithms have a wondrous performance. Therefore we must start by understanding what sparsity is and how and when it leads to unique solutions of underdetermined linear systems. It is important to note that linear measurements are, in principle, not necessary and

1For this name and other ideas, one should consult the blog by Dustin Mixon: https://dustingmixon.wordpress.com/. 1.1. THE WHAT, WHY AND HOW OF COMPRESSIVE SENSING 7 we could develop a theory for non-linear measurements. Nevertheless, linear sensing is widely used and simpler than non-linear one and we study this framework throughout the text. In this dissertation we begin with basic definitions of sparse and compressible vectors. Then, we will explore the algorithms that solve efficiently the problem of the recovery of sparse vectors. They are divided in three main families: optimization algorithms, greedy algorithms and thresholding algorithms. After this, we will discuss the three most important properties in the study of indeterminate linear systems: the nullspace property, the coherence property and the restricted isometry property. Later, we will proceed to the study of non-asymptotic probability techniques, specially the concentration of measure phenomena, and how to use them to design efficient measurement matrices with an optimal number of measurements. Finally, the Geometry of Banach Spaces will play a special role to understand the minimal amount of information to recover sparse signals and it will appear as the main tool to provide sharper general results for even more general reconstruction methods than those presented in the previous chapters. The flowchart of ideas developed through this dissertation can be seen in Figure 1.1 below.

Figure 1.1: Flowchart of the dissertation

The core of this diagram is the “equivalence” between Basis Pursuit, an algorithm from convex opti- mization, and sparse recovery, a combinatorial problem. We will see that the Nullspace Property (NSP) plays a fundamental role in this equivalence. More specifically, Basis Pursuit allows sparse recovery if and only if the measurement matrix satisfies this property, as will be seen in Chapter 3. Since it is very hard to establish this property, we will replace it by two different sufficient conditions: the Coherence Property and the Restricted Isometry Property (RIP). This will be done in Chapters 4 and 5, respectively. The second will lead to better results in the number of measurements through the use of probability tech- niques, such as the concentration of measure. Therefore, in Chapters 6 and 7 we will explore probabilistic tools and how to use then in the Compressive Sensing framework. Finally, in Chapter 8 we introduce some concepts of the Geometry of Banach Spaces in order to establish the optimality of Basis Pursuit and other methods of sparse recovery discussed through this dissertation. 8 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

1.2 A Little (Pre)History of Compressive Sensing

Compressive Sensing has a long history. The first ideas of sparse estimation can be traced back to Gaspard Riche de Prony, a French mathematician and engineer. In 1795, he proposed an algorithm to estimate the parameters associated with a small number of complex exponentials sampled in the presence of noise. See Theorem 2.15 of [Rauhut & Foucart ’13] and Section 15.2 of [Eldar ’15]. After, in the first half of the 20th century, the works of Constantin Caratheodory, Arne Beurling and Benjamin Logan in Harmonic Analysis showed that it was possible to recover certain sinusoids and pieces of Fourier Transform by exploiting the notion of minimal extrapolation, which nowadays we recognize as `1-minimization. See [Donoho ’10]. This will be further discussed in Chapter 2. During World War II, the technique of combinatorial group testing was introduced. It was a time when there were many people infected with syphilis and the problem was to discover who was infected and who was not by using efficient testing schemes, considering that resources were scarce. The US Public Health Service did not want ill men serving in the military but it was not possible to test all individuals, hundreds of thousands of people, in order to identify the small proportion of infected people. Then, instead of test single individuals, the blood was combined in an efficient way to reveal that one person in this combination had the disease. Therefore the task was to identify all individuals with syphilis while minimizing the number of tests and again the parsimony principle applies. See [Dorfman ’43] and [Gilbert, Iwen & Strauss ’08]. In the late seventies and early eighties, researchers from Geology and Geophysics communities showed that the structure of the layered format in the earth’s interior could be explored to increase the accuracy of seismic signal recovery. [Taylor, Banks & McCoy ’79], [Levy & Fullagar ’81] and [Walker & Ulrych ’83], showed that very incomplete measurements can be used to recover a full wideband seismic signal, despite that no low-frequency can be acquired due to the nature of the seismic measurements. Also, the first paper to explicit state the use of `1-norm for signal reconstruction was written by people working on Geophysical Inverse Problems: [Santosa & Symes ’86]. After these contributions, the Magnetic Resonance Spectroscopy and Radio Astronomy communities started to use sparsity concepts in the signal recovery. In particular, methods exploiting this parsimonious representation of data were faster and more efficient than the classical method of Maximum Entropy Inversion [Donoho, Johnstone, Stern & Hoch ’92]. Essentially at the same time, the “Wavelets explosion” appeared through the works of Daubechies, Meyer, Mallat, Coifman, Wickerhauser and others. From their ideas, it became typical to describe effec- tive media representation, such as image and video, by using Wavelets. Afterwards, Wavelets became part of the standard techniques in Signal Processing and standard technologies such as fingerprint databases. See [Daubechies ’92] and [Burrus, Gopinath & Guo ’97]. The first works that showed the general algorithmic principles which are used in compressive sensing were [Mallat & Zhang ’93] and [Chen & Donoho ’94]. In these papers, a systematic study was performed in order to develop a general theory of sparse representations of signals and to understand how this can be done in a efficient way. Hitching on the paper [Chen & Donoho ’94], it is important to point out that a large part of the history and evolution of ideas before Compressive Sensing, in the nineties, can be understood from the point of view of the works of David Donoho and his collaborators. Donoho, a very prolific mathemati- cian, paved the way until to founding papers of Compressive Sensing. It starts with seminal paper [Donoho & Stark ’89], where important ideas about the relation between the uncertainty principle and the acquisition of information from a signal where showed. After this, many of his works contained funda- mental ideas in Statistics and Signal Processing that culminated in Compressive Sensing. One important example is [Donoho & Huo ’01]. The goal of this paper was to establish the connections between problems and optimal parsimonious representations of signals by using very few coefficients, what they called an Ideal Atomic Decomposition. Parallel to advances in signal processing, similar ideas have emerged in the Statistics community. The development of LASSO by [Tibshirani ’96], a linear regression that uses `1-norm as a regularizer, started a new era in Statistics and Machine Learning where many problems, previously intractable, could be solved. 1.3. SAMPLING THEORY 9

Finally, the foundation of Compressive Sensing can be credited to [Donoho ’06], [Candès & Tao I ’06], [Candès & Tao II ’06], [Candès, Romberg & Tao I ’06] and [Candès, Romberg & Tao II ’06]. They com- bined deep techniques from Probability Theory, Approximation Theory and Banach Spaces Geometry in order to achieve a theoretical breakthrough: they showed how to exponentially decrease the number of measurements for an accurate and computationally efficient recovery of signals, provided that they are sparse. These five works revolutionized the way we think about Signal Processing. Now, according to Google Scholar, they have now more than 52000 citations together. Subsequently, Compressive Sensing turned into a rapidly growing field. Since 2006, many sharper theorems, more efficient algorithms, dedicated hardware and generalizations of the theory emerged. It will probably take a few more years for all contributions to be properly organized so we can tell the story of new ideas, developments and breakthroughs of the field in the last twenty years. Before we start with it, we need to come back to the fundamentals of sampling, specially to the classical Sampling Theorem.

1.3 Sampling Theory

Sampling lies at the core of Signal Processing and is the first step of analog to digital conversion. It is the gate between the discrete and the continuous world. Other two important steps are quantization and coding. The former is the reduction of the sample values from their continuous range to a discrete set. The latter generates a digital bitstream as the final representation of the continuous time signal. This dissertation is about sampling. We start by looking at the celebrated Shannon-Nyquist-Kotelnikov- Whittaker Theorem2. First, we need a definition.

2 Definition 1.1. The Paley-Wiener Space PWΩ(R) is the subspace of L (R) consisting of all functions with Fourier transforms supported on finite intervals symmetric around the origin. More precisely,

 2 ˆ PWΩ(R) = f L (R): f(ξ) = 0 for ξ Ω . ∈ | | ≥ As [Mishali & Eldar ’11] points out, “The band-limited signal model is a natural choice to describe physical properties that are encountered in many applications. For example, a physical communication medium often dictates the maximal frequency that can be reliably transferred. Thus, material, length, dielectric properties, shielding and other electrical parameters define the maximal frequency Ω. Often, bandlimitedness is enforced by a lowpass filter with cutoff Ω, whose purpose is to reject thermal noise beyond frequencies of interest”.

Theorem 1.2. (Theorem 13 of [Shannon ’48]): Let f(t) PWΩ(R). Then f(t) is completely determined ∈ by its samples at the points tn = nπ/Ω n∈ . Also, the following reconstruction formula holds { } Z X nπ sin(Ωt nπ) f(t) = f − , Ω Ωt nπ n∈Z − where the series above converge in the L2 sense and uniformly.

Remark 1. Here we provide a proof of Sampling Theorem because it is difficult to find a rigorous one in sin t the literature. For a proof in a abstract framework, with more general interpolation functions than t , see Theorem 2.7 of [Güntürk ’00]. Also, for an engineering interpretation of Theorem 1.2 as “sampling by modulation”, see Section 4.2.2 of [Eldar ’15].

Proof. Since f(t) L2(R), then fˆ(ξ) L2(R). Also, as a consequence of the Paley-Wiener Theorem (see Section 4.3 from [Stein∈ & Shakarchi ’05]),∈ f(t) is the restriction to the real line of an analytic function. We ˆ inπξ/Ω can write the Fourier Series of f(ξ) in the interval [ Ω, Ω] since the complex exponentials e n∈Z form an orthogonal basis for L2( Ω, Ω). In particular,− the coefficients will be given by { } − 2In the literature of Approximation Theory, it is known as the cardinal theorem of interpolation. See [Higgins ’85]. 10 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

Z Ω 1 −inπξ/Ω cn = fˆ(ξ)e dξ. 2Ω −Ω

By hypothesis, fˆ(ξ) = 0 for all ξ Ω. This yields, for all x Z, | | ≥ ∈

Z Ω Z ∞ Z ∞   1 inπξ/Ω 1 inπξ/Ω π 1 iξ nπ π nπ c−n = fˆ(ξ)e dξ = fˆ(ξ)e dξ = fˆ(ξ)e Ω dξ = f , 2Ω −Ω 2Ω −∞ Ω 2π −∞ Ω Ω where we used the Inversion Formula for the Fourier Transform in the last equality. Thus, we obtain

∞   N   π X nπ −iξ nπ X π nπ −iξ nπ fˆ(ξ) = f e Ω = lim f e Ω . (1.1) Ω Ω N→∞ Ω Ω n=−∞ n=−N The convergence above is in the L2( Ω, Ω) sense. We have also that − Z Ω Z Ω fˆ 1 = fˆ(ξ) dx = 1 fˆ(ξ) dx 1 2 fˆ 2 = √2Ω fˆ 2. || || −Ω | | −Ω · | | ≤ || || || || || ||

This leads to the fact that fˆ(ξ) L1(R) and that the convergence in L2(R) implies convergence in 1 ˆ ∈ L (R). Using that f(ξ) = 0 for ξ Ω, we can multiply Equation (1.1) on both sides by χ[−Ω,Ω](ξ), the characteristic function of the interval| | ≥ [ Ω, Ω]. This yields − N   π X nπ −iξ nπ fˆ(ξ) f e Ω χ[−Ω,Ω](ξ) 0. − Ω Ω −−−−→N→∞ n=−N p for p = 1 and p = 2. We known, by Parseval’s Theorem, that the Inverse Fourier Transform is a bounded continuous operator −1 : L2(R) L2(R). Also, From Riemann-Lebesgue Theorem, it is a bounded F 1 → continuous operator from L (R) to C0(R), the space of continuous functions vanishing at infinity equipped with the supremum norm. Therefore we conclude

 N    −1 −1 π X nπ −iξ nπ f = (fˆ) = lim f e Ω χ[−Ω,Ω](ξ) F F N→∞ Ω Ω n=−N N   π X nπ −1 −iξ nπ  = lim f e Ω χ[−Ω,Ω](ξ) , N→∞ Ω Ω F n=−N 2 where the convergence occurs both in the L (R) norm and in C0(R). To finish, note that Z ∞ Z Ω −1 −iξ nπ  1 −iξ nπ iξt 1 −iξ nπ iξt e Ω χ[−Ω,Ω](ξ) = e Ω χ[−Ω,Ω](ξ)e dξ = e Ω e dξ F 2π −∞ 2π −Ω Z Ω 1 −ξt−iξ nπ Ω sin(Ωt nπ) = e Ω dξ = − . 2π π Ωt nπ −Ω −

This is a modern translation of the theorem. We can compare it with the original words of Shannon: “If a function f(t) contains no frequencies higher than Ω cycles-per-second, it is completely determined by giving its ordinates at a series of points spaced 1/2Ω seconds apart”. He also wrote “it is a fact which is common knowledge in the communication art..but in spite of its evident importance it seems not to have appeared explicitly in the literature of communication theory“. Many signals of interest are modeled as bandlimited functions, i.e., real valued functions on the real line with compactly supported Fourier transforms. Then Theorem 1.2 tells us that it is possible to reconstruct 1.4. SPARSE AND COMPRESSIBLE VECTORS 11 a band-limited function using uniform samples. It says that an enumerable amount of information is sufficient to reconstruct any (band-limited) function on the continuum. The heuristic behind it is that band-limited functions have limited time variation, and can therefore be perfectly reconstructed from equalispaced samples with a rate at least Ω/π, its maximum frequency, called the Nyquist rate.

Remark 2. Theorem 1.2 has a long and prolific history. It is a striking case of multiple discoveries and many names beyond Shannon, Nyquist, Kotelnikov and Whittaker are associated with it. Also, it forms the basis of Signal Processing. See [Jerri ’77], [Unser ’00], [Meijering ’02], [Ferreira & Higgins ’11] and references therein. Also, many generalizations, including results for not necessarily band-limited functions or nonuniform sampling can be found in the literature. See [Butzer & Stens ’92] and [Marvasti ’01].

Although it forms the basis of modern Signal Processing, Theorem 1.2 has two major drawbacks. The first one is that not all signals of the real world are band-limited. The second is that if the bandwidth is too large, then we must have too many samples in order to perform the acquisition of the signal. Since, in many situations, it is very difficult to sample at high rate, some alternatives began to be considered. These alternatives are known as sub-Nyquist sampling strategies. Incredibly, the first overcoming of the Sampling Theorem occurred before its existence. It was proved in [Carathéodory 1911] that if a signal is a positive linear combination of any N sinusoids, it is uniquely determined by its value at t = 0 and any other 2N points. Depending on the highest frequency sinusoid in the signal, this can be much better than Nyquist rate. While the number of samples in the Nyquist rate increases with Ω, in Caratheodory’s argument it increases with N. Among all sub-Nyquist sampling alternatives, Compressive Sensing is one of the most important [Mishali & Eldar ’11]. It focuses on efficiently measuring a discrete signal through a measurement matrix A Cm×N with fewer measurements than the ambient dimension. Although it is a theory for discrete signals,∈ recently some researches generalized ideas from Compressive Sensing to analog signals. See [Tropp, Laska, Duarte, Romberg & Baraniuk ’10] and [Adcock, Hansen, Roman & Teschke ’13]. This sensing with fewer measurements is only possible (and useful) because many natural signals are sparse or compressible. Using this priori information will allow us to design efficient reconstruction schemes. These, in turn, will be able to recover the vectors as the unique solution of some optimization problems.

1.4 Sparse and Compressible Vectors

In this section we define the main objects of our study: sparse and compressible vectors. The fundamental hypothesis we will explore is that many real-world signals have most of their components being zero (or approximately zero).

Definition 1.3. The support of a vector x CN is the index set of its nonzero entries, i.e., ∈

supp(x) = j [N]: xj = 0 . { ∈ 6 } N A vector x C is called s-sparse if at most s of its entries are nonzero, i. e., if x 0 = #(supp(x)) s. ∈ || || ≤ The set of all s-sparse signals is denoted by Σs = x : x 0 s . { || || ≤ } Despite x 0 not being a norm, it is abusively called this way throughout the literature on Sparse Recovery, Signal|| || Processing or Computational and Applied Harmonic Analysis. It can be seen as the limit of `p-quasinorms as p decreases to zero through the following observation.

N N p X p p→0 X x = xj 1 = # j [N]: xj = 0 . || ||p | | −−−→ {xj 6=0} { ∈ 6 } j=1 j=1

Remark 3. Sparsity has no linear structure. Given any x, y Σs, we do not necessarily have x+y Σs, ∈ N ∈ although we do have x + y Σ2s. The set Σs consists in the union of all possible canonical subspaces ∈ s of CN . 12 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

As discussed in Section 1.3.2 of [Eldar & Kutyniok ’12], in certain applications it is possible to have more concise models by restricting the feasible signal supports to a small subset of the possi- N ble s selection of nonzero coefficient. This leads to the concept of sparse union of subspaces. See [Duarte & Eldar ’11]. In the Signal Processing community, it is said that we can model the sparse signal x Cm as the linear combination of N elements, called atoms, that is ∈

N X x = Φα = αiϕi, i=1 where αi are the representation coefficients of x in the dictionary Φ = [ϕ1, . . . , ϕN ]. We will require N m that the elements of the dictionary ϕ i=1 span the entire signal space. Here it will be C but in other { } situations it can be some specific subset or proper subspace of Cm. This representation is often redundant since we allow N > m. Due to redundancy, this representation is not unique, but we typically consider (or look for) the one with the smallest number of terms. A signal may be not sparse in a first moment but it may be “sparsified” just by using a specific dictionary. What is behind the concept of sparsity is the philosophy of Occam’s razor: when we are faced with many possible ways to represent a signal, the simplest choice is best one. This is so because there is a cost to describe the basis we are using for this representation. However, sparsity is a theoretical abstraction. In the real world, one does not seek for sparse vectors but, instead, for approximately sparse vectors. Even more, we want to measure how close our signals of interest are to a true sparse one. This measure is given by the next concept.

N Definition 1.4. For p > 0, the `p-error of best s-term approximation to a vector x C is defined by ∈ N σs(x)p = inf x z p, z C is s-sparse . {|| − || ∈ } Remark 4. The infimum in Definition 1.4 will always be achieved by an s-sparse vector z CN whose nonzero entries are equal to the s largest absolute entries of x. There is no reason for this∈ vector to be unique. Nevertheless, when it attains the infimum, it occurs independently of p > 0. The concept of approximately sparse or compressible vector is a little more ingenious and realistic. Instead of asking for the number of nonzero component to be small, we ask for the error of its best s-term approximation to decay fast in s. This can be modeled in two different ways. The first one is by using `p balls for some small p > 0. The second one is by exploring the concept of significant components of a vector itself. This will lead to the concept of weak `p-spaces. Proposition 1.5 tells us that discrete signals which belong to the unit ball in some `p-norm, for some small p > 0 are good models for compressible vectors. Here we will denote the nonconvex ball in RN N N equipped with `p-norm by Bp = z C : z p 1 . { ∈ || || ≤ } Proposition 1.5. For any q > p > 0 and any x CN , ∈ 1 σs(x)q x p. ≤ s1/p−1/q || || Definition 1.6. The nonincreasing rearrangement of the vector x CN is the vector x∗ RN for which ∈ ∈ x∗ x∗ x∗ 0. 1 ≥ 2 ≥ · · · ≥ N ≥ and there is a permutation π :[N] [N] with x∗ = x for all j [N]. → j | π(j)| ∈ ∗ N N Proof. (of Proposition 1.5): x R+ , the nonincreasing rearrangement of x C , yields ∈ ∈ N N s q−p N   p   q X ∗ q ∗ q−p X ∗ p 1 X ∗ p X ∗ p σs(x) = (x ) (x ) (x ) (x ) (x ) q j ≤ s j ≤ s j j j=s+1 j=s+1 j=1 j=s+1 1.4. SPARSE AND COMPRESSIBLE VECTORS 13

q−p 1  p 1 x p x p = x q . ≤ s || ||p || ||p sq/p−1 || ||p

The next result is a stronger version, with optimal constant, of Proposition 1.5. Its importance relies on the fact that sharp results on error reconstruction analysis depends on it. Also, the method for proving it appears many times in the Compressive Sensing literature. We use optimization ideas to prove some inequalities with sharp constant. This technique will be invoked again in Lemma 4.25 and Lemma 5.21.

Theorem 1.7. For any q > p > 0 and any x CN , the inequality ∈ cp,q σs(x)q x p ≤ s1/p−1/q || || holds with

pp/q p1−p/q1/p cp,q = 1 1. q − q ≤

∗ N N ∗ p Proof. Let x R+ be the nonincreasing rearrangement of x C . If we set αj = (xj ) , we will transform the problem∈ into a equivalent one given by ∈  q α1 α2 αN 0 q/p q/p q/p cp,q ≥ ≥ · · · ≥ ≥ = as+1 + as+2 + + as+N . α1 + α2 + + αN 1 ⇒ ··· ≤ sq/p−1 ··· ≤ Therefore, using r = q/p > 1, we need to maximize the convex function

r r r f(α1, α2, . . . , αN ) = α + α + + α , s+1 s+s ··· N N over the convex polytope C = (α1, . . . , αN ) R : α1 αN 0 and α1 + + αN 1 . As any point of C is a convex{ combination of∈ its vertices≥ and · · · because ≥ ≥ the function f···is convex,≤ the} maximum is attained at a vertex of C (see Theorem B.16 from [Rauhut & Foucart ’13] or Theorem 2.65 from [Ruszczynski]). Besides, the vertices are intersections of N hyperplanes when we force N of the N + 1 inequalities to become equalities. From this we obtain the following possibilities:

1. If α1 = = αN = 0, then f(α1, α2, . . . , αN ) = 0. ··· 2. If α1 + + αN = 1 and α1 = = αk > αk+1 = = αN = 0 for some 1 k s, then one has ··· ··· ··· ≤ ≤ f(α1, α2, . . . , αN ) = 0.

3. If α1 + + αN = 1 and α1 = = αk > αk+1 = = αN = 0 for some s + 1 k N then one ··· ··· ··· r ≤ ≤ has α1 = = αk = 1/k, and consequently f(α1, α2, . . . , αN ) = (k s)/k . ··· − Thus we have obtained

k s max f(α1, α2, . . . , αs) = max −r . (α1,...,αs)∈C s+1≤k≤N k Now, consider f(k) as a function of a continuous variable k, we observe that the function g(k) = (k s)/kr is increasing until the critical point k∗ = (r/(r 1))s and decreasing thereafter. Using this information− yields −

 r−1 q ∗ 1 1 1 cp,q max f(α1, α2, . . . , αs) g(k ) = 1 r−1 = q/p−1 . (α1,...,αs)∈C ≤ r − r s s

The other way of reasoning about compressible vectors is to require that the number of its significant components be small. This can be done with the introduction of the weak `p-spaces. 14 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

N N Definition 1.8. For p > 0, the weak `p space w`p denotes the space C equipped with the quasinorm n M p o x p,∞ = inf M 0 : # j [N]: xj t for all t > 0 . || || ≥ { ∈ | | ≥ } ≤ tp Remark 5. In the definition above, the quaisnorm comes from the fact that a constant appears in the max{1,1/p} triangular inequality: x + y p,∞ 2 ( x p,∞ + y p,∞). For definitions, see Section 1.7. || || ≤ || || || || N There is an alternative expression for the weak `p-quasinorm of a vector x C . ∈ N Proposition 1.9. For p > 0, the weak `p-quasinorm of a vector x C can be expressed as ∈ 1/p ∗ x p,∞ = max k xk, || || k∈[N]

∗ N N where x R+ denotes the noincreasing rearrangement of x C . ∈ ∈ N ∗ Proof. Given x C , we clearly have x p,∞ = x p,∞. Then we need to prove that x := 1/p ∗ ∈ ∗ || || || || || || max k x equals x p,∞. For t > 0 we have two possibilities: k∈[N] k || || j [N]: x∗ t = [k] for some k [N] or j [N]: x∗ t = . { ∈ j ≥ } ∈ { ∈ j ≥ } ∅ ∗ 1/p ∗ p p In the first case, t xj x /k . This implies # j [N]: xj t = k x /t . Note that ≤ ≤ || || { ∗∈ ≥ } ≤ || || this inequality holds trivially in the case that j [N]: xj t = . By the definition of the weak `p- ∗ { ∈ ≥ } ∅∗ ∗ quasinorm, this leads to x p,∞ x . Now, suppose that x > x p,∞, so that x (1+ε) x p,∞ || || ≤ || || 1/p ∗ || ∗|| || || || || ≥ || || for some ε > 0. This can be translate into k xk (1 + ε) x p,∞ for some k [N]. Therefore we have the inclusion ≥ || || ∈

∗ ∗ 1/p [k] j [N]: x (1 + ε) x p,∞/k . ⊂ { ∈ j ≥ || || } Again, by the definition of the weak `p-quasinorm, we obtain

x∗ p k k || ||p,∞ = , ∗ 1/pp p ≤ (1 + ε) x p,∞/k (1 + ε) || || ∗ which is a contradiction. We conclude that x = x p,∞. || || || ||

With the alternative expression of the weak `p-quasinorm, we can establish a proposition similar to Proposition 1.5.

Proposition 1.10. For any q > p > 0 and x CN , we have the inequality ∈  p  1 σs(x)q x p,∞. ≤ q p s1/p−1/q || || − ∗ 1/p Proof. Without loss of generality, we assume that x p,∞ 1, so that xk 1/k for all k [N]. This yields || || ≤ ≤ ∈

N N t=N X X 1 Z ∞ 1 1 1 p 1 σ (x)p = (x∗)q dt = . s p k q/p q/p q/p−1 q/p−1 ≤ k ≤ s t −q/p 1 t ≤ q p s k=s+1 k=s+! − t=s − Taking the power 1/q leads to the desired inequality.

N Proposition 1.5 and Proposition 1.10 show that if we have vectors x C for which x p 1 or ∈ || || ≤ x p,∞ 1 for small p > 0, than they will be compressible vectors in the sense that their errors of best s||-term|| approximation≤ decay quickly with s. 1.5. HOW MANY MEASUREMENTS ARE NECESSARY? 15

1.5 How Many Measurements Are Necessary?

The fundamental problem in Compressive Sensing is to reconstruct an s-sparse vectors from few measure- ments. In mathematical terms, this is represented by solving the linear system Ax = y, where x CN is ∈ the s-sparse signal being acquired, A Cm×N represents the measurement matrix and y Cm represents the acquired information. ∈ ∈ Since we want to acquire the signal in a compressible fashion, we have fewer measurements than the ambient dimension, i.e., m < N. This results in an underdetermined system, which has, without any other assumption, an infinite number of solutions. As described in Section 1.3, it is natural to assume that the signals are sparse in some basis. With this regularization assumption, we expect to identify the original vector x. Thus, the recovery of sparse signals has two meanings:

1. Uniform recovery: the reconstruction of all s-sparse vectors x CN simultaneously. ∈ 2. Nonuniform recovery: the reconstruction of an specific s-sparse vector x CN . ∈ In case we are dealing with measurements matrices described by stochastic models, we can translate conditions above by finding a lower bound of the form

1. Uniform recovery: P( s-sparse x, recovery of x is successful using A) 1 ε. ∀ ≥ − 2. Nonuniform recovery: s-sparse, P(recovery of x is successful using A) 1 ε. ∀ ≥ − In both cases, the probability is over the random draw of A, described by a certain model. In this dissertation we will deal with the first case. We will answer when it is possible to recover all signals from few measurements, provided that all of them have some sparsity and that we have access to the acquired vector y (this acquired vector will be different for every signal x). For the nonuniform case, see Sections 12.2 and 14.2 of [Rauhut & Foucart ’13] and the discussion in [Candès & Plan ’11].

m×N Remark 6. For a matrix A C and a subset S [N], we denote by AS the column submatrix of A ∈ ⊂ N S consisting of the columns indexed by S. For a vector x C , we denote by xS the vector in C consisting ∈ of the entries indexed by S or the vector in CN which coincides with x on the entries in S and is zero on the entries outside S. It will be clear from the context which one of the two options we are dealing with. The way we recover signals is by supposing that they are as sparse as possible. Therefore, an algorith- mic approach for recovering them is through `0-minimization. In other words, we search for the sparsest vector consistent with the measured data y = Ax.

min z 0 subject to Az = y. (P0) N z∈C || || In case that measurements are corrupted by noise or are slightly inaccurate, the natural generalization is given by

min z 0 subject to Az y 2 η. (P0,η) N z∈C || || || − || ≤ The main point about Compressive Sensing is that we can recover compressible signals corrupted by noise. Then, this problem has stability and robustness. See Sections 3.4 and 3.5. Therefore, in real situations, we will deal with the problem (P0,η). We can understand the recovery of sparse vectors through the properties of the measurement matrix A as shown in the next theorem.

Theorem 1.11. Given A Cm×N , the following properties are equivalent ∈ a) Every s-sparse vector x CN is the unique s-sparse solution of Az = Ax, that is, if Ax = Az and both x and z are s-sparse,∈ then x = z. 16 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

b) The null space ker A does not contain any 2s-sparse vector other than the zero vector, that is, N ker A z C : z 0 2s = 0 . ∩ { ∈ || || ≤ } { } c) Every set of 2s columns of A is linearly independent.

S m d) For every S [N] with #S 2s, the submatrix AS is an injective map from C to C and the ⊂ ∗ ≤ Gram matrix ASAS is invertible.

N N Proof. a) = b): Assume that for every vector x C , we have z C : Az = Ax, z 0 s = x . Let v ker A⇒be 2s-sparse. We write v = x z for ∈s-sparse vectors{x,∈ z with supp x supp|| || z≤=} . Then{ } Ax = ∈Az and by assumption x = z. Since the− supports of x and z are disjoint, it follows∩ that z ∅= x = 0 and v = 0. N b) = c): Let S be a set of indexes 1 i1 < i2 < < i2s N with #S = 2s and let x C such ⇒ P2s ≤ ··· ≤ ∈ that supp(x) S. If Ax = `=1 ai` xi` = 0, then x = 0. Thus the 2s columns vectors indexed by S must be linearly independent.⊂ c) = d): If every set of 2s columns of A is linearly independent then every set of s columns is also ⇒ S m linearly independent. Thus, clearly, AS is injective as a map from C to C . Also, consider S and x as in the previous case. We have

2s 2 ∗ X x, A ASx = ASx, ASx = xis ais > 0, S h i h i s,t=1 2 ∗ since, by c), the 2s column vectors ais are linearly independent. Also, the matrix ASAS is self-adjoint, ∗ therefore all of its eigenvalues are real and they will be positive if and only if the quadratic form x, ASASx is positive. Furthermore, a matrix with all positive eigenvalues is invertible. h i d) = a): Suppose that x1 and x2 are two s-sparse vectors satisfying Ax1 = Ax2 = Az. Setting ⇒ x = x1 x2 we have that it is 2s-sparse and also that Ax = Ax1 Ax2 = Az Az = 0. Let S be the − − − index set is for the support of x. Since x is in the kernel of A, we have { } 2s 2 ∗ X x, A ASx = xis ais . S h i s,t=1 2 ∗ Since ASAS in invertible, the last sum vanishes only if x = 0. This implies x1 = x2. From Theorem 1.11, we conclude that if it is possible to reconstruct every s-sparse, so that the statement c) above holds, then we have rank(A) 2s. On the other hand, since A Cm×N , rank(A) m. We conclude that the number of measurements≥ necessary to reconstruct every s∈-sparse vector satisfies≤ m 2s. The next theorem shows that m = 2s measurements suffice. After, we will discuss why this bound≥ is a theoretical one and typically not stable for practical implementation. In particular, due to a lack of knowledge about the location of the support, in Chapter 8 we will see that, in fact, a few more than 2s measurements will be necessary.

Theorem 1.12. For any integer N 2s, there exists a measurements matrix A Cm×N with m = 2s ≥ ∈ rows such that every s-sparse vector x CN can be recovered from its measurement vector y = Ax Cm ∈ ∈ as a solution of (P0).

m×N Proof. For fixed tn > > t2 > t1 > 0, we can define the matrix A C with m = 2s as ··· ∈  1 1 ... 1   t1 t2 . . . tN    A =  . . .  .  ......  2s−1 2s−1 2s−1 t1 t2 . . . tN 2s×2s Let S = j1 < < j2s be an index set of cardinality 2s. Define the square matrix AS C . It will { ··· } Q ∈ be the transpose of a Vandermonde matrix. It is well known that det(AS) = (tj tj ) > 0. This k<` ` − k 1.6. COMPUTATIONAL COMPLEXITY OF SPARSE RECOVERY 17

shows that AS is invertible and, in particular, injective. The hypotheses of Theorem 1.11 are fulfilled, thus every s-sparse vector x CN is the unique s-sparse vector satisfying Az = Ax. Therefore it can be ∈ recovered as the solution of (P0).

Remark 7. As discussed in Section 2.2 of [Rauhut & Foucart ’13], it is interesting to note that many matrices fulfill the hypotheses of Theorem 1.11. We can take any matrix M RN×N that is totally ∈ positive, i.e., that satisfies det(MI,J ) > 0 for any sets I, J, [N] with same cardinality, where MI,J represents the submatrix of M with rows indexed by I and columns⊂ indexed by J. After, we select any m = 2s rows of M, indexed by a set I, to form the measurement matrix A. Since the original matrix M is totally positive, for any index S [N] with #S = 2s, the matrix AS is the ⊂ same as MI,S, hence it is invertible and we can apply Theorem 1.11. In particular, the partial Discrete Fourier Transform

 1 1 1 1 ... 1   1 e2πi/N e2πi2/N . . . e2πi(N−1)/N    A =  . . . . .  ,  . . . . .  1 e2πi(2s−1)/N e2πi(2s−1)2/N . . . e2πi(2s−1)(N−1)/N allows for the reconstruction of every s-sparse vector x CN from y = Ax C2s. ∈ ∈ The reasoning above can be generalized and we can prove that, from a measure-theoretical viewpoint, the set of matrices that can not recovery s-sparse vector is small.

Theorem 1.13. The set of 2s N matrices such that det(AS) = 0 for some S [N] with #S 2s has Lebesgue measure zero. Therefore,× most 2s N matrices allow the reconstruction⊂ of every s-sparse≤ × vector x CN from y = Ax C2s. ∈ ∈ Theorem 1.13 is a simples consequence of Sard’s Theorem3 and tells us that we can draw a matrix at random and it will be, in principle, a good measurement matrix. However, we do not address the problem of robustness to error measurements and stability for compressible vectors recovery. These points will be further discussed in Chapter 3 and Chapter 8. In the latter, we will prove that any stable reconstruction for s-sparse vectors requires at least m = Cs ln(eN/s) linear measurements, for a constant C > 0. From this perspective, matrices such as the Vandermonde matrix from Theorem 1.12 are not adequate as measurement matrices.

1.6 Computational Complexity of Sparse Recovery

In this section we focus on a brief digression about computational complexity theory. This is a branch of mathematics/computer science where one tries to quantify the amount of computational resources required to solve a given task. We do not intend to be rigorous here. For a rigorous treatment, see the classical [Garey & Johnson ’79] or the modern [Arora & Barak ’07]. For example, if one tries to solve a linear system, the classic Gaussian elimination algorithm uses essentially n3 basic arithmetic operations to solve n equations over n variables. However, in 1960s, a more efficient algorithm was created, see [Strassen ’69]. The latter is a highly nonintuitive algorithm. Therefore one might ask if, in some cases, there are more efficient (but nonintuitive) algorithms than the ones used for many years. In the field of computational complexity, researchers try to answer such questions and prove that, sometimes, there is no better algorithm than the existing ones. Even more, for some problems they try to prove that there is no efficient algorithm. Computational complexity starts with the work [Hartmanis & Stearns ’65]. For an historical account, see [Fortnow & Homer ’03]. It is philosophically similar to taxonomy, where typically one tries to establish patterns and group problems into classes according to its inherent difficulty, in the same way naturalists

3For a proof of this theorem, see pages 205-207 of [Guillemin & Pollack]. 18 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS try to classify the living beings according to physiological properties. After, they relate those classes to each other. In this classification, a problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. A polynomial/exponential-time algorithm is an algorithm performing its tasks in a number of steps bounded by a polynomial/exponential expression in the size of the input. The first distinction between polynomial-time and exponential-time algorithm was given by [von Neumann ’53]. It is standard in com- plexity theory to identify polynomial-time with feasible. Clearly, this is not always true, as pointed by [Cook]. He says: “for example, a computer program requiring n100 steps could never be executed on an input even as small as n = 10.” Even so, let us assume that this is the case and let us consider the Feasibility Thesis:

“A natural problem has a feasible algorithm if and only if it has a polynomial-time algorithm.”

There is a classification of computational problems: they are divided in decision problems, search problems, counting problems, optimization problems and function problems. See [Goldreich ’08]. Here we are interested in decision and optimization problems. A decision problem is a question in some formal system with a yes-or-no answer, depending on the values of some input parameters. An optimization problem asks to find the “best possible” solution among the set of all feasible solutions. Every optimization problem can be transformed into a decision problem just by asking if the best possible solution exists or not. Now we introduce some terminology following [Rauhut & Foucart ’13]. For rigorous definitions, see Definitions 1.13, 2.1 and 2.7 from [Arora & Barak ’07].

Definition 1.14. We have four basic complexity classes:

The class of P-problems consists of all decision problems for which there exists a polynomial-time • algorithmP finding a solution.

The class of NP-problems consists of all decision problems for which there exists a polynomial- • time algorithmNP certifying a solution.

The class -hard of NP-hard problems consist of all problems (not necessarily decision problems) • for whichNP a solving algorithm could be transformed in polynomial time into a polynomial time solving algorithm for any NP-problem. Roughly speaking, this is the class of problems at least as hard as any NP-problem.

The class -complete of NP-complete problems consist of all problems that are both NP and NP- • hard; in otherNP words, it consists of all the NP-problems at least as hard as any other NP-problem.

Note that the class is contained in the class . One can ask if the converse is also true. This is the important conjectureP = , see [Cook]. IfNP this conjecture is shown to be false, which is widely believed, then there will existP problemsNP for which potential solutions can be certified, but no solution can be found in polynomial time.

Figure 1.2: Relation between the four main complexity classes. 1.6. COMPUTATIONAL COMPLEXITY OF SPARSE RECOVERY 19

As [Arora & Barak ’07] points out, Recognizing the correctness of an answer is often much easier than coming up with the answer. Appreciating a Beethoven sonata is far easier than composing the sonata; verifying the solidity of a design for a suspension bridge is easier (to a civil engineer anyway!) than coming up with a good design; verifying the proof of a theorem is easier than coming up with a proof itself. Since nowadays we know that there are many -complete problems, these problems are probably intrinsically intractable. Some of the first examplesNP were given by a list of 21 problems described in [Karp ’72]. He says “In this paper we give theorems that suggest, but do not imply, that these problems, as well as many others, will remain intractable perpetually”. In particular, we are interested in the fourteenth problem of the list.

Remark 8. (Exact cover by 3-sets problem or X3C problem): Given a collection Ci, i [N] of 3- element subsets of [m], does there exist an exact cover (or a partition) of [m], i.e., a{ set J∈ [N}] such ⊂ that j∈J Cj = [m] and Cj Ck = for all j, k J with j = k? Clearly∪ we see that m ∩must be∅ a multiple∈ of 3. Let6 us use one example in order to understand. Suppose we have [m], i.e., [m] = 1, 2, 3, 4, 5, 6 . If we have the collection of 3-sets elements given by { } 1, 2, 3 , 2, 3, 4 , 1, 2, 5 , 2, 5, 6 , 1, 5, 6 = C1,C2,C3,C4,C5 then we could choose C2,C5 = {{2, 3, 4}, {1, 5, 6} {as an} exact{ cover} { because}} each{ element in [m] appears} exactly once. { } {{ If instead,} { the}} collection of 3-sets was given by 1, 2, 3 , 2, 4, 5 , 2, 5, 6 , then any subcover from this that we choose will not be an exact cover (we need{{ all 3} subsets{ } to{ cover}} all elements in [m] at least once, but then the element 2 appears three times). In Section 1.5 we stated the main problem of Compressive Sensing: the optimization problem of sparse recovery. The straightforward approach for solving it is to solve every square linear system given ∗ ∗ S by ASASu = ASy for u C where S runs though all possible subsets of [N] with size s. In this brute ∈ N force approach, one needs to solve a number of linear systems of order s . Example 1.15. As described in Section 2.3 of [Rauhut & Foucart ’13], suppose that we have a small size 1000 1000 10 20 problem of sparse recovery with N = 1000 and s = 10. We would have to solve 10 ( 10 ) = 10 linear systems of size 10 10. If each linear system require 10−10 seconds to be solved,≥ the time required ×10 for solving (P0) will be 10 seconds, i.e., more than 300 years. The heuristic of Example 1.15 will be confirmed by Theorem 1.16, i.e., this problem is in fact in- tractable for any possible approach. We will prove that (P0) belongs to the NP-hard class, i.e., assuming that the exact cover by 3-sets problem is NP-complete, we can now prove that the main Compressive Sensing problem is as hard as this one.

Theorem 1.16. (Theorem 1 of [Natarajan ’96]): For any η 0, the `0-minimization problem (P0,η) for ≥ general A Cm×N and y Cm is NP-hard. ∈ ∈ Proof. Without loss of generality, we may assume that η < 1, by a rescaling argument. Using that the exact cover by 3-sets problem is NP-complete, we will reduce it, in polynomial time, to our `0-minimization problem. Let us consider Ci : i [N] , the collection of 3-element subsets of [m]. Using this collection, { m∈ } we define vectors a1,..., an C by ∈ ( 1 if j Ci, (ai)j = ∈ 0 if j / Ci. ∈ With these vectors, we can define a matrix A Cm×N and a vector y Cm by ∈ ∈  

A =  a1 a2 ... aN  , y = [1, 1,..., 1].

It is possible to make this construction in polynomial time since, by construction, we have N m. ≤ 3 Using the vector y defined above, for any z CN satisfying Az y η, we have that any of its ∈ || − || ≤ 20 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS

components must be distant to 1 at most by η. Hence, these components are nonzero and Az 0 = m. || P||N On the other hand, by definition, each ai has exactly 3 nonzero components and then Az = j=1 zjaj has at most 3 z 0 nonzero components, i.e., Az 0 3 z 0. || || || || ≤ || || Therefore, we conclude that if a vector satisfies Az y 2 η then it must satisfy z 0 m/3. Suppose || − N|| ≤ || || ≥ that we ran the `0-problem and that it returned x C as the output. We have two possibilities. ∈

1. If x 0 = m/3, then suppose the collection Cj, j supp(x) is not an exact cover of [m]. In this || || PN { ∈ } case, the m components of Ax = j=1 xjaj would not be all be nonzero. Since this is impossible, we conclude that in this case the exact cover exists. Therefore we can answer positively the question about the existence of an exact cover of [m] using the solution of (P0,η).

2. If x 0 > m/3, suppose that an exact cover of [m] exists, i.e., we can cover the set [m] with a || || N partition Cj, j J . In this case, we can define a vector z C by zj = 1 if j J and zj = 0 if zj / J. { ∈ } PN P ∈ ∈ ∈ This vector satisfies Az = zjaj = aj = y and z 0 = m/3. This is a contradiction, since j=1 j∈J || || in this case x would be no more the solution of (P0,η). Thus, in the case x 0 > m/3, we can answer || || negatively about the existence of an exact cover by using the solution of (P0,η). In both cases, we showed that through the solution of the `0-minimization problem we can solve the exact cover by the 3-sets problem.

It is important to highlight what this theorem is and what it is not. It concerns the intractability in the general case. This means that for a general matrix A and a general vector y, we cannot have a smart strategy to solve (P0), in other words, no algorithm is able to solve the problem for any choice of A and y. This does not mean that for special choices of A and y we do not have tractable algorithms for sparse recovery. This will be the content of Chapter 2. Also, as we shall see in Theorem 3.30, that `q-minimization, for q < 1, is also an NP-hard problem.

1.7 Some Definitions

In this section we give a pot-pourri of definitions used along this dissertation. Definition 1.17. A nonnegative function . : X [0, ) on a vector space X is called a norm if || || → ∞ (a) x = 0 if and only if x = 0. (b) ||λx|| = λ x for all scalars λ and all vectors x X. (c) ||x +||y | |||x|| + y . ∈ || || ≤ || || || || If only (b) and (c) hold,so that x = 0 does not necessarily imply x = 0, then . is called a seminorm. If (a) and (b) hold, but (c) is replacedby|| || the weaker quasitriangular inequality || ||

x + y C( x + y ) || || ≤ || || || || for some C 1, then . is called a quasinorm. The smallest constant C is called its quasinorm constant. ≥ || || n n Definition 1.18. Let . be a norm on R or C . Its dual norm . ∗ is defined by || || || ||

x ∗ = sup x, y . || || ||y||≤1 |h i| In the real case we have x ∗ = sup x, y . n || || y∈R ,||y||≤1h i In the complex case, we have x ∗ = sup Re x, y . n || || y∈R ,||y||≤1 h i 1.7. SOME DEFINITIONS 21

The dual of the dual norm . ∗ is the norm . itself. In particular, we have || || || || x = sup x, y = sup Re x, y || || ||y||∗≤1 |h i| ||y||∗≤1 h i Definition 1.19. Let A : X Y be a linear map between two normed vector spaces X, . and (Y, . ) with possible different norms.→ The operator norm of A is defined as || || ||| |||

A = sup Ax = sup Ax . || || ||x||≤1 ||| ||| ||x||=1 ||| |||

In particular, for a matrix A Cm×n and 1 p, q , we define the matrix norm (or operator n m ∈ ≤ ≤ ∞ norm) between `p and `q as

A p→q = sup Ax q = sup Ax q. || || ||x||p≤1 || || ||x||p=1 || ||

Definition 1.20. We say that two functions f, g R are comparable if there exists absolute constants ∈ c1, c2 > 0 such that c1f(t) g(t) c2Af(t) for all t. We denote it by f g. ≤ ≤  We have a notational convention for asymptotic analysis. It was introduced by Paul Bachmann in 1894 and popularized in subsequent years by Edmund Landau in the field of Number Theory.

Definition 1.21. i.) (Big O): We say that f(x) = O(g(x)) if there exists a constant C > 0 such that

f(x) C g(x) for all x. (1.2) | | ≤ | | and when O(g(x)) stands in the middle of a formula it represents a function f(x) that satisfies Equation (1.2). This notation is typically used in the asymptotic context, specially when x is not integer. See Chapter 9 of [Graham, Knuth & Patashnik ’94].

ii.) (Big Omega) : For lower bounds, we say that f(x) = Ω(g(x)) if there exists a constant C > 0 such that

f(x) C g(x) for all x. (1.3) | | ≥ | | We have f(x) = Ω(g(x)) if and only if g(x) = O(f(x)).

iii.) (Big Theta): To specify an exact order of growth, we have the Big Theta notation. We denote by f(x) = Θ(g(x)) if the following statements hold

f(x) = O(g(x)) and f(x) = Ω(g(x)). (1.4) Big Theta definition equivalent to Definition 1.20. We emphasize it here due to the fact that Computa- tional Complexity community uses it while Approximation Theory community uses f g. Both of them will be used along this text.  22 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS Chapter 2

Some Algorithms and Ideas

All models are wrong, some are useful. George E. P. Box in [Box ’79].

2.1 Introduction

For general measurement matrices A and vectors y, the problem (P0) of recovering sparse vectors is NP-hard, as Theorem 1.16 shows. Then there is no hope of solving it in a computationally efficient way. However, in the last twenty years a lot of efficient algorithms were developed to deal with this problem. Techniques from Convex Optimization and from Approximation Theory were combined in order to find heuristics that solve the problem, at least for particular instances. The early development of Compressive Sensing was based on the assumption that the solution to the `1-minimization problem provides the recovery of the correct sparse vector, serving as a proxy to (P0), and also that it is feasible to solve this problem with a computer. We will see why this is true but also that a lot of work has been done in order to find alternative algorithms that are faster or give superior reconstruction performance in some situations. The purpose of this chapter is to present the three most popular classes of algorithms used for sparse vectors recovery: optimization methods, greedy methods and thresholding methods. Here we will not analyze any of them. This will be postponed to later chapters, after we introduce the Coherence Property and the Restricted Isometry Property. We do not intend to be exhaustive in the description of these classes of algorithms and our goal now is just to give some intuitive explanation and to develop some ideas about the computational aspects of Compressive Sensing. After presenting the three major classes of algorithms in Sections 2.2 to 2.4, we will briefly discuss the problem of choosing an algorithm for a particular application in Section 2.5, along with a review of recent numerical work related to this issue.

2.2 Basis Pursuit

Basis Pursuit is the main strategy for solving P0 we will deal with along this dissertation. It was introduced by [Chen ’95] as part of his Ph.D. thesis and was fully explored by [Chen, Donoho & Saunders ’01]. Since then, it has been widely used by the Signal Processing, Statistics and Machine Learning communities. As this strategy belongs to the optimization methods family, we need to start by defining what is an optimization problem. It is not our intention to be encyclopedic because Mathematical Programming is a vast and deep subject, see [Ruszczynski], [Boyd & Vanderberghe ’04] or [Bertsekas ’16]1 for further

1There is a whole volume of Documenta Mathematica dedicated to the history of Optimization: https://www.math. uni-bielefeld.de/documenta/vol-ismp/vol-ismp.html for some references. Also, the book [Gass & Assad ’04] has some interesting historical notes.

23 24 CHAPTER 2. SOME ALGORITHMS AND IDEAS information. For the use in Machine Learning of Optimization ideas related to Compressive Sensing, see [Sra, Nowozin & Wright ’11]. Definition 2.1. An optimization problem is described by

minimize F0(x) subject to Fi(x) bi, i = 1, 2,... (2.1) ≤ N N where the function F0 : R ( , ] is the objective function and the functions F1,...,Fn : R → −∞ ∞ → ( , ] are the constraint functions. A point x RN is called feasible if it satisfies the constraints. A −∞ ∞ # ∈ # feasible point x for which the minimum is attained, that is, F0(x ) F0(x) for all feasible point x is # ≤ called an optimal point and F0(x ), an optimal value. If F0,F1,...,Fn are all convex (linear) functions, this is called a convex (linear) optimization problem. Remark 9. This framework, only with inequality constraints, is the most general one and encompasses equality constraints too. Every time we have some constraint of the form Fi(x) = ci, we can write the equivalent inequalities Fi(x) ci and Fi(x) ci. ≤ − ≤ − In our setting, the measurement constraint Ax = y must be satisfied, therefore some of the inequalities N will be given by Fj(x) := Aj, x yj and Fj(x) := Aj, x yj, where Aj R is the jth row of A. Here, we will distinguishh thei ≤ measurement− constraint−h as aispecial ≤ − constraint.∈ Also, when modeling some signals of interest, other constraints Fj(x) may appear, thus we will describe the set of feasible points by

N K = x R Ax = y, Fj(x) bj, j [M] . (2.2) { ∈ | ≤ ∈ } Therefore, we are interested in problems of the form

min F0(x). (2.3) x∈K Another important class of optimization problems is the class of conic problem. This kind of problem will appear in the context of complex vectors recovery. Here is the formal definition: Definition 2.2. A conic optimization problem is of the form

min F0(x) subject to x K and Fi(x) bi, i [n], N x∈R ∈ ≤ ∈ q  N+1 PN 2 where K is a convex cone and all Fi are convex functions. If K = x R : xj xN+1 , ∈ j=1 ≤ then we have a second-order cone problem and if K is the cone of positive semidefinite matrices, we have a semidefinite program. The study of many optimization problems is carried on through the notion of duality. As sometimes it is hard to find the minimum of a problem, we transform it into another (hopefully easier) problem and maximize the new one. This new problem, called the dual problem, will provide a bound for the optimal value of the original problem, called primal problem. If we are lucky enough, the solution of both problems will agree. In order to describe the dual problem, we need another definition. Definition 2.3. The Lagrange function of an optimization problem given by (2.2) is defined for x N m M ∈ R , ξ R , v R with v` 0 for all ` [M], by ∈ ∈ ≥ ∈ M X L(x, ξ, v) = F0(x) + ξ, Ax y + v`(F`(x) b`). h − i − `=1

The variables ξ and v are called Lagrange Multipliers. Here, we will denote that v` 0 for all ` [M] ≥ ∈ by v < 0. The Lagrange dual function is defined by

H(ξ, v) = inf L(x, ξ, v), ξ m, v M , v 0. N R R < x∈R ∈ ∈ 2.2. BASIS PURSUIT 25

If x L(x, ξ, v) is unbounded from below, we set H(ξ, v) = . Also, if there are no inequality constraints,7→ then −∞

m H(ξ) = inf L(x, ξ) = inf F0(x) + ξ, Ax y , ξ . (2.4) N N R x∈R x∈R { h − i} ∈

# The dual function provides a bound on the optimal value of F0(x ) for the minimization problem # m M describe by the feasible set (2.2), that is, H(ξ, v) F0(x ) for all ξ R , v R , v 0. To see why, ≤ ∈ ∈ < taking x as a feasible point for (2.3), then Ax y = 0 and F`(x) b` 0 for all ` [M], so we have, for − − ≤ ∈ all ξ Rm and v 0, that ∈ < M X ξ, Ax y + v`(F`(x) b`) 0. h − i − ≤ `=1 And then

M X L(x, ξ, v) = F0(x) + ξ, Ax y + v`(F`(x) b`) F0(x). h − i − ≤ `=1

Using that H(ξ, v) L(x, ξ, v) and taking the minimum over all feasible points x RN , we obtain a lower bound for the≤ primal problem. This leads to the new optimization problem ∈

max H(ξ, v) subject to v 0. (2.5) m M < ξ∈R , v∈R The main point of this transformation is that H(ξ, v) is always concave, even if the original function is not convex. Therefore the dual problem is equivalent to minimizing H, a convex function, subject to − # # # v < 0. A feasible maximizer of (2.5) is called a dual optimal. We always have H(ξ , v ) F (x ) and this is what we call weak duality. In case of equality, we have strong duality. We can interpret≤ Lagrange duality via a saddle-point argument.

Remark 10. (Saddle-point interpretation): We will consider here the problem

minimize F0(x) subject to Ax = y, (2.6) that is, for the sake of simplicity we will consider the original problem without inequality constraints. The general case with inequality constraints (or for conic programs) has a similar derivation. Using the definition of Lagrange function yields ( F0(x) if Ax = y, sup L(x, ξ) = sup F0(x) + ξ, Ax y = ξ∈ m ξ∈ m h − i otherwise. R R ∞ Therefore, is x is not feasible, than the above supremum is infinite. On one hand, a feasible minimizer # of the primal problem will satisfies N m On the other hand, a dual optimal F0(x ) = infx∈R supξ∈R L(x, ξ) # satisfies m N by definition of the Lagrange dual function. Hence, weak H(ξ ) = supξ∈R infx∈R L(x, ξ) duality reads as

sup inf L(x, ξ) inf sup L(x, ξ), m x∈ N x∈ N m ξ∈R R ≤ R ξ∈R whereas the same reasoning holds for strong duality with equality instead of inequality. This tells us that we can change maximization and minimization if strong duality holds. This is the saddle-point property. This is the same as says that for a primal-dual optimal pair (x#, ξ#) holds that

# # # # N m L(x , ξ) L(x , ξ ) L(x, ξ ) for all x R , ξ R . ≤ ≤ ∈ ∈ 26 CHAPTER 2. SOME ALGORITHMS AND IDEAS

Saddle-property says that it is equivalent to finding a saddle point of the Lagrange function or to jointly optimize primal and dual problems, provided that strong duality holds. We will use it in Theorem 2.5. The next Theorem, stated here in a simplified version and known as the Slater condition, provides a sufficient condition for strong duality to hold, for a proof see Section 5.3 of [Boyd & Vanderberghe ’04].

N Theorem 2.4. ([Slater ’50]): Assume that F0,F1,...,FM are convex functions with dom(F0) = R . N If there exists x R such that Ax = y and F`(x) < b` for all ` [M], strong duality holds for the optimization problem∈ (2.3). In the absence of inequality constraints,∈ strong duality holds if there exists x RN with Ax = y. ∈ Remark 11. Duality theory for conic problems is more involved, see Section 4.3 of [Ruszczynski].

After this digression about optimization, we need to state what is the main strategy for solving the problem of finding the sparsest vector satisfying the linear system of measurements. This is called Basis Pursuit or `1-minimization and is interpreted as the convex relaxation of (P0).

Basis Pursuit (BP)

Input: measurement matrix A, measurement vector y. Instructions: # x = argmin z 1 subject to Az = y. (P1) || ||

Output: the vector x#.

Note we called Basis Pursuit a strategy and not an algorithm because we did not say how to implement the strategy above. In fact, there are many ways of doing that. Basis Pursuit is a linear program in the real case and a second-order cone program in the complex case, as we will show below. Therefore general purpose optimization algorithms can be used such as the Simplex Method or Interior-Point Methods. In particular, [Chen & Donoho ’94] points out that “Basis Pursuit is only thinkable because of recent advances in via "interior point" methods”. We refer to [Nesterov & Nemirovskii ’94]2 for details about this technique and to [Kim, Koh, Lustig, Boyd & Gorinevsky ’08] for its applications to `1-minimization and sparse recovery. We also have specifically design algorithms developed for `1-minimization. There is a discussion in the community about how faster they are compared to general purpose algorithms. We can cite the Homotopy Method, developed by [Donoho & Tsaig, ’08]3, Iteratively Reweighted Least Squares, developed by [Daubechies, DeVore, Fornasier & Güntürk ’10] or Primal-Dual Algorithms, developed by [Chambolle & Pock ’11] among many other direct or iterative methods. These three main algorithms are described in Chapter 15 of [Rauhut & Foucart ’13]. Besides, it is important to note that now there are many solvers available for Basis Pursuit. One can 4 cite [`1-MAGIC], [YALL1], [Fast `1], [NESTA], [GPSR] and [L1-LS] . A user-friendly package for Convex Optimization in Python is [CVXPY], where many of the techniques mentioned above are implemented. The work [Lorenz, Pfetsch & Tillmann ’15] studies and compares some solvers and heuristics for Basis Pursuit. When we have measurement errors, we replace Ax = y by Ax y 2 η and then the problem || − || ≤

min z 1 subject to Az y 2 η, (P1,η) N z∈C || || || − || ≤

2The papers [Wright ’04] and [Gondzio ’12] are very interesting for a historical perspective about Interior-Point Methods. 3In the Statistics community, this algorithm was independently developed by [Efron, Hastie, Johnstone & Tibshirani ’04] and is known as Least Angle Regression (or LARS). 4 The most complete (albeit not up-to-date and disorganized) list of `1-minimization solvers can be found at Section 4 of https://sites.google.com/site/igorcarron2/cs. 2.2. BASIS PURSUIT 27 is known as Quadratically Constrained Basis Pursuit. There are also two very relevant, related problems. The first one is Basis Pursuit Denoising(BPDN), which consists in solving, for some parameter λ 0, ≥ 2 min λ z 1 + Az y . (2.7) N 2 z∈C || || || − || The second one is LASSO, the Least Absolute Shrinkage and Selection Operator, which consists in solving for some parameter τ 0, ≥

min Az y 2 subject to z 1 τ. (2.8) N z∈C || − || || || ≤ Basic Pursuit Denoising was introduced in [Chen & Donoho ’94] and LASSO in [Tibshirani ’96]. These three problems are related, as the next Theorem shows. For a unified treatment of them, one can see [Hastie, Tibshirani & Wainwright ’15].

Theorem 2.5. We have an equivalence between LASSO, Basis Pursuit Denoising and Quadratically Constrained Basis Pursuit as the following three statements show.

i. If x is a minimizer of the Basis Pursuit Denoising with λ > 0, then there exists η = η(x) such that x is a minimizer of the Quadratically Constrained Basis Pursuit.

ii. If x is a unique minimizer of the Quadratically Constrained Basis Pursuit with η 0, then there exists τ = τ(x) 0 such that x is a unique minimizer of the LASSO. ≥ ≥ iii. If x is a unique minimizer of the LASSO with τ > 0, then there exists λ = λ(x) 0 such that x is a minimizer of the Basis Pursuit Denoising. ≥

N Proof. i.) Let η = Ax y 2 and consider z C such that Az y 2 η. Using the fact that x is a minimizer of (2.7),|| we have− || ∈ || − || ≤

2 2 2 λ x 1 + Ax y λ z 1 + Az y λ z 1 + Ax y . || || || − ||2 ≤ || || || − ||2 ≤ || || || − ||2 After some simplifications, we obtain x 1 z 1 and then x is a minimizer of the Quadratically Con- strained Basis Pursuit. || || ≤ || ||

N ii.) Set τ = x 1 and consider z C , z = x such that z 1 τ. Since x is, by hypothesis, the unique minimizer|| || of Quadratically Constrained∈ 6 Basis Pursuit,|| this|| ≤ implies that z cannot satisfy the constraint of (P1,η). Hence, Az y 2 > η > Ax y 2. This shows that x is the unique minimizer of (2.8). || − || || − ||

iii.) This part is more elaborate and we will prove a more general result, where we replace the `2-norm by any other norm. The Theorem will be proved for the real case but it also holds for the complex setting just by interpreting CN as R2N . Let . be a norm on Rm and . a norm on RN . For A Rm×N , y Rm and τ > 0, the general LASSO|| is|| given by ||| ||| ∈ ∈

min Ax y subject to x τ, (2.9) N x∈R || − || ||| ||| ≤ while the general Basis Pursuit Denoising is given by

min λ x + Ax y 2. (2.10) N x∈R ||| ||| || − || Therefore we will prove that if x# is a minimizer of the general LASSO, then there exists λ = λ(x) 0 such that x is a minimizer of the general Basis Pursuit Denoising. First of all, the general LASSO≥ is obviously equivalent to 28 CHAPTER 2. SOME ALGORITHMS AND IDEAS

min Ax y 2 subject to x τ. (2.11) N x∈R || − || ||| ||| ≤ The Lagrange function for this problem is given by

L(x, ξ) = Ax y 2 + ξ x τ (2.12) || − || ||| ||| − For τ > 0, there exist vectors x with x < τ. Then, by Theorem 2.4, we have strong duality for (2.9), so we guarantee the existence of||| a dual||| optimal ξ# 0. The saddle-point property ensures that ≥ L(x#, ξ#) L(x, ξ#) for all x RN . Therefore x# is also a minimizer of x L(x, ξ#). Since the constant term≤ ξ#τ does not affect∈ the minimizer, x# is also a minimizer of Ax7→ y 2 + ξ# x and the conclusion− follows with λ = ξ#. || − || ||| |||

The parameters η, λ and τ can be seen as regularization parameters for the solution of the linear system. Theorem 2.5 shows that the transformation of the parameters depends on the minimizer. There- fore we need to solve the optimization problems before we come to know them. For this reason, this equivalence is useless for practical purposes. Finding appropriate parameters is a highly nontrivial task and the topic is widely discussed in the literature, see [Yu & Feng ’14] and references therein. Also, problems related to how this kind of optimization technique can lead to false discoveries are discussed in [Su, Bogdan & Candès ’15]. Basis Pursuit is a linear program in the real case and a second-order cone program in the complex case. For the real case, let us introduce the variables x+, x− RN . For x RN , let ∈ ∈ ( ( + xj if xj > 0 − 0 if xj > 0 xj = and xj = 0 if xj 0 xj if xj 0. ≤ − ≤ + − Hence, problem (P1) is equivalent to the following linear optimization problem for the variables x , x ∈ RN

N  +  + X + − x x 0 min (x + x ) subject to [A A] − = y, − 0. (P1) x+,x−∈ N | − x x ≥ R i=1 + # − # With a solution (x ) , (x ) of this problem in hand, we get the solution of the original problem (P1) just by taking x# = (x+)# (x−)#. This consideration allows us to conclude that we have, in fact, a linear problem. − In the complex setting, we will be looking, instead, to the more general Quadratically Constrained Basis Pursuit. Then, given a vector z CN , we need to introduce its real and imaginary parts u, v RN N ∈ q 2 2 ∈ and also a vector c R such that cj zj = uj + vj for all j [N]. The problem P1,η will be ∈ ≥ | | ∈ equivalent to the following optimization problem with variables c, u, v RN : ∈         Re(A) Im(A) u Re(y) N  − η X  Im(A) Re(A) v − Im(y) ≤ 0 min cj subject to 2 (P1,τ ) c,u,v,∈ N R j=1  p 2 2 p 2 2  u + v c1,..., u + v cN . 1 1 ≤ N N ≤ This is a second-order cone problem. Then given a solution (c#, u#, v#), the solution of the original # # # problem is x = u + iv . When we take η = 0, we recover the solution of (P1) in the complex case. Why is Basis Pursuit good alternative to the original combinatorial problem (P0)? Because in the real case `1-minimizers are sparse, as we will show now. In Chapter 3 this will by fully explored through the notion of the Null Space Property for measurement matrices.

m×N Theorem 2.6. Let A R be a measurement matrix with columns a1, . . . , aN . Assuming the unique- ness of a minimizer x#∈ of 2.2. BASIS PURSUIT 29

min x 1 subject to Ax = y, N x∈R || || # # # then the set aj, j supp(x ) is linearly independent, and in particular x 0 = #(supp(x )) m. { ∈ } || || ≤ # Proof. Suppose, by contradiction, that the set aj, j S is linearly dependent, where S = supp(x ). { ∈ } This means that there exists a nonzero vector v RN supported on S such that Av = 0. By uniqueness of x#, for any t = 0, ∈ 6 # # X # X # # x 1 < x + tv 1 = x + tvj = sgn(x + tvj)(x + tvj). || || || || | j | j j j∈S j∈S Now, we need to understand what happens with a sign of a number when we add another number to it. # If b < a the sgn(a + b) = sgn(a). Therefore, when t < minj∈S xj / v ∞, we have | | | | | | | | || || # # sgn(x + tvj) = sgn(x ) for all j S. j j ∈ # It follows that, for t = 0 with t < minj∈S xj / v ∞, 6 | | | | || ||

# X # # X # # X # # X # x 1 < sgn(x + tvj)(x + tvj) = sgn(x )(x + tvj) = sgn(x )x + t sgn(x )vj || || j j j j j j j j∈S j∈S j∈S j∈S

# X # = x 1 + t sgn(x )vj. || j || j j∈S P # This is a contradiction because we can always take a very small t = 0 such that t sgn(x )vj 0. 6 j∈S j ≤ Remark 12. In the complex case, the situation is more delicate. Consider the vector x = [1, ei2π/3, ei4π/3] and the measurement matrix described by

1 0 1 A = . 0 1 −1 −

This vector is the unique minimizer of the problem min 3 z 1 subject to Az = Ax. See Section 3.1 z∈C || || of [Rauhut & Foucart ’13]. Hence, in the complex setting, `1-minimization does not necessarily provides m-sparse solutions, where m is the number of measurements. In order to better understand when this occurs, we need the theory of the next chapters, such as the Null Space Property.

Quadratically Constrained Basis Pursuit relies on Ax y 2 η, that is, the noise is bounded in the || − || ≤ `2-norm regardless of structure or prior information about it. So, in the case of Gaussian noise, which is unbounded, the error in sparse vector recovery typically does not decay as the number of measurement increases, see pages 29 and 30 of [Boche et at. ’15]. In order to deal which this problem, another type 5 of `1-minimization algorithm was proposed by [Candès & Tao III, 06], called Dantzig Selector . It is described by

∗ min z 1 subject to A (Az y) ∞ τ. (DS) N z∈C || || || − || ≤ The heuristics of this method is based on the fact that the residual r = Az y should have −∗ small correlation with all columns aj of the matrix A. Indeed, the constraint is A (Az y) ∞ = || − || maxj∈[N] r, aj . There is a theory for the Dantzig Selector similar to the one developed through this dissertation|h i| for Basis Pursuit, which is sometimes more complicated, see Chapter 8 in [Elad ’10], [Hastie, Tibshirani & Wainwright ’15], [Meinshausen, Rocha & Yu ’07], [James, Radchenko & Lv ’09], [Lv & Fan ’09], [Zhang ’09] and [Bickel, Ritov & Tsybakov ’09].

5As Tao points out in [Tao’s Blog - 22/03/2008], “we called the Dantzig selector, due to its reliance on the linear programming methods to which , who had died as we were finishing our paper, had contributed so much to”. 30 CHAPTER 2. SOME ALGORITHMS AND IDEAS

2.3 Greedy Algorithms

The basic idea of a greedy algorithm in Compressive Sensing is to update the target vector in such a way that in each iteration, the algorithm adds one index to the target support and then finds the vector that best fits the measurements with this given support. This kind of strategy justifies the name of such class of algorithms. The most famous algorithm in this context is called the Orthogonal Matching Pursuit, sometimes called Orthogonal Greedy Algorithm. This algorithm has been rediscovered many times in different fields. We can trace its origins back to [Chen, Billings & Luo ’89], in the context of system identification in Control Theory. Independently, it was introduced and analyzed in the context of Time-Frequency Analysis by [Davies, Mallat & Zhang ’94] and [Pati, Rezaiifar & Krishnaprasad ’93]. Here is the formal description of the algorithm.

Orthogonal Matching Pursuit (OMP)

Input: measurement matrix A, measurement vector y. Initialization: S0 = , x0 = 0. Iteration: repeat until∅ a stopping criterion is met at n = n:

n+1 n ∗ n S = S jn+1 , jn+1 = argmax (A (y Ax ))j , (OMP1) ∪ { } j∈[N] {| − |}

n+1 n+1 x = argmin y Az 2, supp(z) S . (OMP2) N z∈C {|| − || ⊂ } Output: the n-sparse vector x# = xn.

To identify the signal x, we need to determine which columns of A participate in the measurement vector. Informally speaking, OMP is a greedy algorithm that selects at each step the column that is most correlated with the current residuals. This column is then included into the set of selected columns. The algorithm updates the residuals by projecting the observation onto the linear subspace spanned by the columns that have already been selected and the algorithm then iterates. The main goal of the introduction of this algorithm was to improve the already existing Matching Pursuit6 by forcing it to converge, for finite-dimensional signals, in a finite number of steps, which does not happen for the original (Nonorthogonal) Matching Pursuit, as Theorem 4.1 of [DeVore & Temlyakov ’96] and the discussion in Section 2.3.2 of [Chen, Donoho & Saunders ’01] show. In general, the residuals of the Orthogonal MP decrease faster than the Nonorthogonal MP. However, this improvement possess some disadvantages, as, for example, orthogonal projection procedure can yield unstable expansions by selecting ill-conditioned elements through the running of the algorithm. Moreover, Orthogonal MP also requires much more operations than Nonorthogonal Pursuits due to the Gram-Schmidt orthogonalization process.

Remark 13. [Davies, Mallat & Avellaneda ’97] showed exponential convergence and made a detailed comparison of both greedy algorithms in terms of theoretical complexity and practical performance. For a proof and discussion of the decay rate, see Chapter 3 of [Elad ’10].

It is important to note that the connection between sparse approximation and greedy algorithms was first done by [Tropp ’04]. His results showed that under some conditions on the coherence of the matrix (see Theorem 4.41), OMP works for sparse recovery in the noiseless case. [Cai, Wang & Xu III ’10] proved that this condition is sharp and the performance of OMP on noisy case was analyzed by [Cai & Wang ’11].

6This algorithm was introduced by [Mallat & Zhang ’93] in the Signal Processing community and by [Friedman & Stuetzle ’81] in the Statistics community under the name Projection Pursuit Regression. 2.3. GREEDY ALGORITHMS 31

As we can see in the description of OMP, the costlier part is the projection step (OMP2). Typ- ically a QR-decomposition of ASn is used in each step and then efficient methods for updating the QR-decomposition when a column is added to the matrix are used, see Section 3.2 of [Björck ’96]. More- over, fast vector-matrix multiplication via FFT can be performed in (OMP1) for the computation of A∗(y Axn). We will briefly describe how this is done for a general least-square problem. For any − matrix A Cm×n with m n, there exists a unitary matrix Q Cm×m and an upper triangular matrix ∈ ≥ ∈ R Cn×n such that ∈ R A = Q . 0 n So, for a general least-squares problem of minimizing Ax y 2 for all x C , using that Q is unitary, we have the following || − || ∈   ∗ ∗ R ∗ Ax y 2 = Q Ax Q y 2 = x Q y . (2.13) || − || || − || 0 − 2 ∗ n Partitioning b = (b1, b2) = Q y with b1 C , the right-hand side of (2.13) will be minimized when we ∈ solve the triangular system Rx = b1 and this can be done with a simple backward elimination. Therefore, at each iteration of OMP, we need to solve a least-square problem with a new added column to the matrix A. We discuss now the choice of the index jn+1, as stated in (OMP1). It is given by a greedy strategy n where the objective is to reduce the `2-norm of the residual y Ax as much as possible at each iteration. ∗ − n Lemma 2.7 explains why an index j maximizing (A (y Ax ))j is a good candidate for this large decrease of the residual. We just need to apply it to| S = S−n and v =| xn.

m×N Lemma 2.7. Let A C be a matrix with `2-normalized columns. Given S [N], v supported on S and j [N], if ∈ ⊂ ∈

w = argmin y Az 2, supp(z) S j , N z∈C {|| − || ⊂ ∪ { }} then

2 2 ∗ 2 y Aw y Av (A (y Av))j . || − ||2 ≤ || − ||2 − | − | Proof. Note that any vector v + tej, with t C, is supported on S j . By the definition of w, we have ∈ ∪ { } 2 2 y Aw 2 min y A(v + tej) 2. || − || ≤ t∈C || − || The idea is then to work in polar coordinates and choose proper radial and angular numbers to represent t. Indeed, writing t = ρeiθ, with ρ 0 and θ [0, 2π), we estimate ≥ ∈

2 2 2 2 2  y A(v + tej) = y Av tAej = y Av + t Aej 2Re t y Av, Aej || − ||2 || − − ||2 || − ||2 | | || ||2 − h − i 2 2 −iθ ∗  2 2 ∗ = y Av + ρ 2Re ρe (A (y Av))j y Av + ρ 2ρ (A (y Av))j || − ||2 − − ≥ || − ||2 − | − | with equality for a properly chosen θ. The right-hand side of this inequality is a quadratic polynomial in ∗ ρ and this expression is minimized when ρ = (A (y Au))j . Therefore this leads to | − | 2 2 ∗ 2 min y A(v + tej) 2 y Av 2 (A (y Av))j t∈C || − || ≤ || − || − | − | which concludes the proof. 32 CHAPTER 2. SOME ALGORITHMS AND IDEAS

Another interesting point is that (OMP2) is equivalent to solving the normal equations of least squares. n+1 ∗ −1 ∗ † n+1 In fact, it also reads as xSn+1 = (ASn+1 ASn+1 ) ASn+1 y = ASn+1 y, where xSn+1 denotes the restriction of xn+1 to the support set Sn+1. This is justified by the following lemma. Lemma 2.8. Given an index set S [N], if ⊂

v = argmin y Az 2, supp(z) S , N z∈C {|| − || ⊂ } then

(A∗(y Av)) = 0. − S Proof. The definition of v tells us that the vector Av is the orthogonal projection of y onto the space Az, supp(z) S . This means that { ⊂ } N y Av, Az = 0 for all z C with supp(z) S. h − i ∈ ⊂ The orthogonality condition is equivalent to A∗(y Av), z = 0 for all z CN with supp(z) S which is the same as (A∗(y Av)) = 0. h − i ∈ ⊂ − S Lemma 2.8 is important as it may be useful for solving normal equations instead of using QR decom- position. In this case, a direct method, such as Cholesky decomposition, or an iterative method, such as Conjugate Gradient method, may be used. For details of when the use of such methods has advantages, see Chapters 6 and 7 of [Björck ’96]. The most natural stopping criterion for OMP is Axn¯ = y. In applications, where we have measurement n¯ ∗ n¯ and rounding errors, this needs to be changed to y Ax 2 ε or A (y Ax ) ∞ ε for some tolerance parameter ε > 0. || − || ≤ || − || ≤ If we have a priori estimates for the sparsity of the signals, we can use this information to provide n¯ = s as a stopping criterion because the target vector xn¯ is n-sparse. In fact, in the case where A is an orthogonal matrix, this forces the algorithm to successfully recover the sparse x CN from y = Ax. n † n ∈ Indeed, from xSn = ASn y, we see that the vector x will be the n-sparse vector consisting of the n largest entries of x. In the general case, the success of OMP for recovering s-sparse vectors using s iterations is given by the following theorem.

Theorem 2.9. Given a matrix A Cm×N , every nonzero vector x CN supported on a set S of size s is recovered from y = Ax after at∈ most s iterations of orthogonal matching∈ pursuit if and only if the matrix AS is injective and

∗ ∗ max (A r)j > max (A r)` , (2.14) j∈S | | `∈S | | for all nonzero r Az, supp(z) S . ∈ { ⊂ } Proof. First, we will assume that the OMP works and recovers every vector supported on a set S in at most s = #S iterations. Therefore, any two vectors x1, x2 having S as support and the same measurement vector y = Ax1 = Ax2 must be the same. This implies that AS is injective. Moreover, we saw that an index chosen at the first iteration will always remain in the target support and then, if y = Ax for some x CN exactly supported on S, then an index ` S cannot be chosen at the first iteration and ∈ ∗ ∗ ∈ maxj∈S (A y)j > (A y)` . We use this to conclude | | | | ∗ ∗ max (A y)j > max (A y)` for all nonzero y Az, supp(z) S . j∈S | | `∈S | | ∈ { ⊂ } We established the two necessary conditions. Now, we proceed to prove that they are also sufficient. Assume that Ax1 = y, . . . , Axs−1 = y, because otherwise there is nothing to do. We need to prove that Sn is a subset6 of S of size n6 for any 0 n s and from this we will deduce that Ss = S. s ≤ ≤ s Using this with Ax = y =: Ax, given by (OMP2), allows us to conclude that x = x because AS is 2.4. THRESHOLDING ALGORITHMS 33 injective by hypothesis. Therefore we need to establish our claim by using (2.14). We will prove by induction that Sn S. We begin with S0 = S. Given 0 n s 1, we see that Sn S yields n n ⊂ ∅ ⊂ ≤ ≤ − ⊂ r = y Ax Az, supp(z) S . Then, by equation (2.14), the index jn+1 must lie in S. Hence, n+1 −n ∈ { ⊂ } n S = S jn+1 S. This proves that S S for an 0 n s. ∪ { } ⊂ ⊂ ∗ n ≤n ≤ n ∗ n Now, for 1 n s 1, Lemma 2.8 implies that (A r )S = 0. If jn+1 S , then A r = 0 and by equation (2.14)≤ we≤ could− conclude that rn = 0, since the strict inequality{ } must ⊂ occur for any nonzero 1 s−1 n vector. This cannot happen since we assumed Ax = y, . . . , Ax = y. Therefore jn+1 S , that is, in each iteration, a new index, different from all previous6 indexes is6 added to the{ target} support. 6⊂ This inductively proves that Sn is a set of size n.

As discussed by [Chen, Donoho & Saunders ’01], “because the algorithm is myopic, one expects that, in certain cases, it might choose wrongly in the first few iterations and, in such cases, end up spending most of its time correcting for any mistakes made in the first few terms. In fact this does seem to happen.”. This phenomenon will be illustrated in Section 5.5, after Theorem 5.22. Despite these pathological negative results, there is theoretical evidence and empirical results which suggest that OMP can recover an s-sparse signal when the number of measurements m is nearly propor- tional to s, as the next Theorem shows.

Theorem 2.10. (Theorem 2 of [Tropp & Gilbert ’07]): Fix δ (0, 0.36) and choose m > Cs log(n/δ). ∈ Assume that x RN is s-sparse and let A Rm×N have i.i.d entries from the Gaussian distribution N(0, 1/m). Then,∈ given the data y = Ax, orthogonal∈ matching pursuit can reconstruct the signal x with probability exceeding 1 2δ. The constant C satisfies C 20. For large s it can be shown that C 4 is enough. − ≤ ≈

These are not the unique greedy algorithms for sparse approximation. Many corrections and mod- ifications were proposed and we can cite Subspace Pursuit [Dai & Milenkovic ’09], Gradient Pursuit [Blumensath & Davies ’08], Stagewise Orthogonal Matching Pursuit [Donoho, Tsaig, Drori & Starck ’12] and Compressive Sampling Matching Pursuit, known as CoSaMP [Needell & Tropp ’08]. There is a whole monograph devoted to topic of greedy approximations and algorithms, see [Temlyakov ’11], specially Chapter 5, where the problem of Compressive Sensing is treated.

2.4 Thresholding Algorithms

In order to recover sparse vectors, we need to understand the action of the measurement matrix on them. Thresholding algorithms rely on the fact that we will approximate the inversion of the action of A onto sparse vectors by the action of its adjoint A∗. Thus, the basic thresholding strategy is to determine the support of the s-sparse vector x CN to be recovered from y = Ax Cm as the indices of s largest absolute entries of A∗. After finding∈ the support, we perform some least-square∈ technique in order to find the vector having this support that best fits the measurements. To describe the strategy, we need to introduce two operators. The first one denotes the best s-term approximation of a vector and the second denotes the support of this approximation,

Hs(z) = zLs(z) (2.15) N Ls(z) = index set of s largest absolute entries of z C (2.16) ∈

Next, we have the strategy using this cut-off idea. 34 CHAPTER 2. SOME ALGORITHMS AND IDEAS

Basic Thresholding Strategy

Input: measurement matrix A, measurement vector y, sparsity level s. Instruction: # ∗ S = Ls(A y). (BT1) # # x = argmin y Az 2, supp(z) S . (BT2) N z∈C {|| − || ⊂ } Output: the s-sparse vector x# = xn.

As Theorem 2.9 for greedy strategy, we have a necessary and sufficient condition for the recovery of s-sparse vectors through the use of Thresholding strategy.

Proposition 2.11. A vector x CN supported on a set S is recovered from y = Ax via Basic Thresh- ∈ ∗ ∗ olding Strategy if and only if minj∈S (A y)j > max (A y)` . | | `∈S | | Proof. The vector x will be recovered if and only if the index set S# defined in the iteration of the strategy coincides with the set S and this happens if and only if any entry of A∗y on S is greater than any entry of A∗y on S.

This strategy seems too simple to have any chance of working. We can reformulate it and solve the system Az = y using the prior information that the solution is s-sparse for some s. Actually, we will not solve the fat and short linear system Az = y but instead we will solve the square system A∗Az = A∗y. This system can be reformulated as the fixed-point equation z = (Id A∗A)z + A∗y. This, in turn, can be solved by using the iteration xn+1 = (Id A∗A)xn + A∗y. − − The thresholding strategy, applied at this fixed-point iteration, tells us to keep the s largest absolute entries of (Id A∗A)xn + A∗y = xn + A∗(y Axn). If the s largest entries are not uniquely defined, this algorithm selects− the smallest possible indices.− Formally, that is the content of the next algorithm. It was developed by [Blumensath & Davies II ’08] and analyzed in [Blumensath & Davies ’09]. Its analysis relies on the fact that the matrix A∗A behaves like the identity when its domain and range are restricted to small support sets (as we shall see in Chapter 5) and then xn+1 = xn +A∗A(x xn) xn +x xn = x. The authors of the algorithm argue that it has many good features such as robustness− ≈ to observation− noise, near-optimal error guarantees, the requirement of a fixed number of iterations depending only on the logarithm of a form of signal to noise ratio of the signal, a memory requirement which is linear in the problem size, among others.

Iterative Hard Thresholding (IHT)

Input: measurement matrix A, measurement vector y, sparsity level s. Initialization: s-sparse vector x0, typically x0 = 0. Iteration: repeat until a stopping criterion is met at n = n:

n+1 n ∗ n x = Hs(x + A (y Ax )). (IHT) − Output: the s-sparse vector x# = xn.

In the class of greedy algorithms, the difference between Matching Pursuit and Orthogonal Matching Pursuit is that in the later we decide to pay the price for an orthogonal projection. Here, we can do the same and look, at each iteration, to the vector with the same support of xn+1 that best fits the measurements. This idea was explored by [Foucart ’11], where the Hard Thresholding Pursuit was devel- oped as a fusion of IHT and the greedy algorithm CoSaMP. In fact, this work did more and created a 2.5. THE SEARCH FOR THE PERFECT ALGORITHM 35 family of thresholding algorithms indexed by an integer k. There, Iterative Hard Thresholding and Hard Thresholding Pursuit correspond to the cases k = 0 and k = , respectively. ∞

Hard Thresholding Pursuit(HTP)

Input: measurement matrix A, measurement vector y, sparsity level s. Initialization: s-sparse vector x0, tipically x0 = 0. Iteration: repeat until a stopping criterion is met at n = n:

n+1 n ∗ n S = Ls(x + A (y Ax )),. (HTP1) − n+1 n+1 x = argmin y Az 2, supp(z) S . (HTP2) N z∈C {|| − || ⊂ } Output: the s-sparse vector x# = xn.

Recently, some theoretical results concerning HTP have been proved. [Bouchot, Foucart & Hitczenko ’16], for example, proved that the number of iterations can be estimated independently of the shape of x and it is at most proportional to the sparsity s. For a precise statement, see Theorem 5 in that article. Also, it provided some benchmarks for HTP: in terms of success rate, its performance is roughly similar to OMP, however, in terms of number of iterations HTP does perform best, owing to the fact that we have prior knowledge of the sparsity s. Clearly, in these thresholding algorithms we need an estimate of vector sparsity. The Soft Thresholding Algorithms class of algorithms was developed to deal with situations in which we have no such estimates. There, we substitute the hard thresholding operator by the soft thresholding operator with threshold N τ > 0. This operator maps each entry zj of a vector z C to ∈ ( sgn(zj)( zj τ), if zj τ, Sτ (zj) = | | − | | ≥ 0, otherwise. One can cite as an example of such algorithms, the Iterative Shrinkage-Thresholding Algorithm developed by [Daubechies, Defrise & De Mol ’04]. This is a variation of the classical Landweber iteration, well- known to the Inverse Problems community, see section 2.3 of [Kirsch ’11]. It is typically used in the context of image restoration and in this context, we are able to use it to deal with non-differentiable regularizers such as total-variation regularization and wavelet-based regularization [Bioucas-Dias & Figueiredo ’07].

2.5 The Search For the Perfect Algorithm

Unfortunately, there is little guidance available on choosing a good technique for a given parameter regime. [Tropp & Gilbert ’07]

After presenting these three major classes of algorithms, we should ask which of them is better suited for a certain application. In other words, we need to answer the following question: Given the sparsity level s, the number of measurements m and the ambient dimension N, which algorithm should we use? In Compressive Sensing we seek for uniform recovery results, that is, we aim to reconstruct every sparse vector with the same measurement matrix. Thus, if we have some prior information about the signals we are looking for, that is, if we have more structure beyond sparsity, can we take advantage of it in order to carefully select the algorithms? These questions have not been not fully answered yet. The number of articles related to numer- ical issues of sparse recovery is small compared to the development of theoretical guarantees. Im- portant references in this area are [Elad ’10], which is devoted to numerical investigations in Com- pressive Sensing, as well as [Pope ’09], [Maleki & Donoho ’10], [Lorenz, Pfetsch & Tillmann ’15] and 36 CHAPTER 2. SOME ALGORITHMS AND IDEAS

[Blanchard & Tanner ’15]. This last one was the first to perform large-scale empirical testing with more realistic, application sized problems, typically with ambient dimension N = 218 or N = 220. This is particularly important in evaluating the behavior of algorithms in the extreme undersampling regime of m N. These analysis are out of the scope of this dissertation. Even so, we can provide some elementary rules of thumb. For small sparsity s, OMP is typically the fastest option because its speed depends on the number of iterations, which will be equal to s if the algorithm succeeds. For mild sparsity (compared to N) thresholding algorithms are preferred because its runtime is typically not affected by the sparsity level. Assigning rules of thumb for Basis Pursuit is more difficult because it depends of how we implement this strategy. Due to the lack of a comprehensive numerical study, the following project remains open.

Open Project: Provide a computational investigation for worst and average case behavior of all al- gorithms in the three major classes: optimization methods, greedy methods and thresholding methods. Identify the regions of the problem size where these algorithms are able to reliably recover the sparsest solution. Construct phase transition diagrams to the probability of recovery for a given algorithm. Per- form all the tests with matrices drawn from some probability distribution like Gaussian or Bernoulli and also with deterministic matrices. Chapter 3

The Null Space Property

It is the hallmark of any deep truth that its negation is also a deep truth.1. Niels Bohr

3.1 Introduction

As discussed in Chapter 2, in this dissertation we are interested in the solution of fat and short linear systems. This is so because they model the simultaneous acquisition and (high) compression of data. As the matrices of these systems have nontrivial kernel, we must define a proper notion of solution, i.e., we must specify what properties this solution should have. This is the same as saying that find a unique solution of an underdetermined linear system is impossible unless some additional information about the solution is given. Therefore, after this regularization procedure, there is hope in finding a unique (regularized) solution. As we are looking for parsimonious representation of signals, the natural approach is to look for the sparsest solution. In previous chapters, we argued that this approach makes sense, since many natural signals contain little information, as they are a combination of only a handful of vector from an appropriate basis. In mathematical terms, we need to recover a sparse vector x CN from its measurements y = Ax ∈ ∈ Cm, where m < N. In order to do so, we need to solve the following optimization problem:

min z 0 subject to Az = y. (P0) N x∈C || || This combinatorial problem is NP-hard, as Theorem 1.16 shows, so we need to develop some smart strategies in order to solve it. The most popular strategy is Basis Pursuit, or `1-minimization, presented in Chapter 2. Therefore, our approach will be to solve the problem

min z 1 subject to Az = y. (P1) N x∈C || || This chapter is devoted to understanding one property the matrix A must have in order to guarantee that solutions from problem (P1) are solutions of the problem (P0). It is the null space property (NSP), a necessary and sufficient condition for exact reconstruction of a sparse vector x from its measurements y = Ax, which we present in the next section. Even more, we will define two further criteria which we expect from a recovery algorithm, namely Stability and Robustness and show that NSP also works for stable reconstruction with respect to sparsity defect and that it is robust to measurement error. Then, the problem of low-rank matrix recovery will be explored. This can be seen as a variation of the Compressive Sensing problem and a generalization of the null space property will appear during the analysis of low-rank matrix recovery.

1Quoted by Max Delbruck, “Mind from Matter? An Essay on Evolutionary Epistemology", Blackwell Scientific Publi- cations, Palo Alto, CA, 1986; page 167.

37 38 CHAPTER 3. THE NULL SPACE PROPERTY

3.2 The Null Space Property

As we are interested in recovering the sparsest solution, we need to know when, i.e., for which matrices, this can be done. Moreover, we also need to know when we have uniqueness of the recovered solution. However, finding a property of the matrix A that implies that a solution to the (P1) problem is a solution of the (P0) problem was a nontrivial task. This took a little more than a decade, from the beginning of Basis Pursuit studies in the Thesis [Chen ’95], until the formal definition of the Null Space Property in the paper [Cohen, Dahmen & DeVore ’09]. It is important to mention that this property had already implicitly appeared in [Donoho & Huo ’01], [Elad & Bruckstein ’02] and [Donoho & Elad ’ 03]. The first place where this property was isolated is in Lemma 1 of [Gribonval & Nielsen ’03].

Definition 3.1. A matrix A Km×N is said to satisfy the null space property2 relative to a set S [N] if ∈ ⊂

vS 1 < v 1 v ker A 0 (NSP) || || || S|| ∀ ∈ \{ } It is said to satisfy the null space property of order s if it satisfies the null space property relative to any set S [N] with #S = s. ⊂ Remark 14. This property tells us how the null space of the matrix A should be oriented in order to touch the `1-ball in just one point. This geometric interpretation will be explored in Theorem 3.12 and can be seen in Figure 3.1 below.

Figure 3.1: Proper null space of A for uniqueness in the reconstruction of sparse vector.

First of all, it is important to observe that if this condition holds for the index set S of s largest (in absolute value) entries of v, then it holds for any other set S with #S s. Also, we can reformulate this ≤ definition of the null space property in two ways. The first one is by adding vS 1 to both sides of the inequality in the definition. Then, the null space relative to a set S become || ||

2 vS 1 < v 1 v ker A 0 . (3.1) || || || || ∀ ∈ \{ } For the second, one can add vS 1 to both sides of the inequality. Thus, the null space property of order s will be || || v 1 < 2 σs(v)1 = 2 inf x z p v ker A 0 || || ||z||0≤s || − || ∀ ∈ \{ }

2 [Cohen, Dahmen & DeVore ’09] defined NSP as a slightly more general property, namely, ||v||1 ≤ Cσs(x)1 for all v ∈ ker A, where C ≥ 1 in an unspecified constant. 3.2. THE NULL SPACE PROPERTY 39

We can now state the main Theorem concerning the NSP, which is a necessary and sufficient condition for exact recovery of sparse vectors through `1-norm minimization. The statement given here is a small modification form the original reference.

Theorem 3.2. (Essentially Theorem 3.2 of [Cohen, Dahmen & DeVore ’09]): Given a matrix A m×N N ∈ K , every vector x K supported on a set S is the unique solution of (P1) with Ax = y if and only if A satisfies the null space∈ property relative to S.

m×N Proof. Suppose that every vector x K supported on S is the unique minimizer of z 1 subject ∈ || || to Az = Ax. Thus, for any v ker A 0 , the vector vS is the unique minimizer of z 1 subject to ∈ \{ } || || Az = AvS. But, as we have Av = 0, then A(vS + v ) = Av = 0 and so A( v ) = AvS. Furthermore, as S − S v = 0, we must have v = vS. Then, we conclude that vS 1 < v 1. 6 − S 6 || || || S|| Conversely, assume that the null space property relative to S holds. For a given vector x Km×N ∈ supported on S and a vector z Km×N , with z = x and satisfying Az = Ax, we consider the vector v = x z ker A 0 . Due to the∈ null space property,6 we have − ∈ \{ }

x 1 x zS 1 + zS 1 = vS 1 + zS 1 < v 1 + zS 1 = z 1 + zS 1 = z 1. || || ≤ || − || || || || || || || || S|| || || || − S|| || || || ||

Now, if we let the set S vary, we obtain as a obvious consequence the following result.

Theorem 3.3. Given a matrix A Km×N , every s-sparse vector x A KN is the unique solution of ∈ ∈ ∈ (P1) if and only if A satisfies the null space property of order s.

This impressive and simple theorem tells us that if we have a sparse vector x and a measurement vector y = Ax, Basis Pursuit actually solves (P0), provided that the NSP holds for the matrix A. In fact, suppose that we are able to recover the vector x via `1-norm minimization from y = Ax. Then, let x˜ be the minimizer of (P0) with y = Ax. Therefore, of course, we have x˜ 0 x 0 and so x˜ is also s-sparse, || || ≤ || || since by hypothesis we have that x is s-sparse. Since every s-sparse vector is the unique `1-minimizer, it follows that x˜ = x. The Null Space Property has an interesting feature: it is preserved if we rescale and reshuffle the measurements or even if some new measurements were added. From a mathematical point of view, this is the same as replacing the measurement matrix A

A1 = GA, where A is some invertible m m matrix. ×

  A 0 A2 = , where B is some m N matrix. B ×

In the second case, A2 is represented as a block matrix and B represents the additional measurements. This is reasonable because the measurement process should not get worse if: new information is acquired; if the order in which the measurements are obtained changes; or if we change the scale of all measurements of the signal. The justification for such facts is simple, note that ker A1 = ker A and ker A2 ker A hence ⊂ A1 and A2 have the NSP property if A does. In the definition of NSP, K could be either R or C. However, when the system has real coefficients, sparse solutions can be considered either as real or complex vectors, leading to two apparently distinct null space properties. This is so because there is a distinction between the real null space kerR A and the complex null space kerC A = kerR A + i kerR A. Nevertheless, the real and complex NSP are equivalent, as proved by [Foucart & Gribonval ’10]. While the original proof links it with a problem about convex polygons in the real plane, in Theorem 3.4, we prove this equivalence following the elementary argument of [Lai & Liu ’11]. 40 CHAPTER 3. THE NULL SPACE PROPERTY

Theorem 3.4. (Theorem 1 of [Foucart & Gribonval ’10]): Given a matrix A Rm×N , the real null space property relative to a set S, that is, ∈ X X vj < vi v ker A, v = 0. (3.2) | | | | ∀ ∈ R 6 j∈S i∈S is equivalent to the complex null space property relative to this set S, that is, X q X q v2 + w2 < v2 + w2 v, w ker A, (v, w) = (0, 0). (3.3) j j i i ∀ ∈ R 6 j∈S i∈S In particular, the real null space property of order s is equivalent to the complex null space property of order s. As [Foucart & Gribonval ’10] notes, “Before stating the theorem, we point out that, for a real mea- surement matrix, one may also recover separately the real and imaginary parts of a complex vector using two real `1-minimizations - which are linear programs - rather than recovering the vector directly using one complex `1-minimization - which is a second order cone program.”. Proof. Clearly equation (3.2) follows from equation (3.3) when we set w = 0. Then we need to prove the converse. Assume that (3.2) holds and consider v, w in ker A with (v, w) = (0, 0). Suppose first that v R 6 and w are linearly dependent, e.g. w = kv for some k C. Therefore, equation (3.3) is deduced from ∈ q q X X X p 2 X p 2 X 2 2 2 X 2 2 2 vj < vi k + 1 vj < k + 1 vi k v + v < k v + v | | | | ⇒ | | | | ⇒ j j i i j∈S i∈S j∈S i∈S j∈S i∈S X q X q w2 + v2 < w2 + v2. ⇒ j j i i j∈S i∈S Now assume u and v are independent. In this case, u = v cos θ + w sin θ ker A will be nonzero. So, ∈ R the real null space property (3.2) leads, for any θ R, to ∈ X X vj cos θ + wj sin θ < vi cos θ + wi sin θ . (3.4) | | | | j∈S i∈S

If we define, for each k [N], θk [ π, π] through the equations ∈ ∈ − v w cos θ = k , sin θ = k . k p 2 2 k p 2 2 vk + wk vk + wk Equation (3.4) can be reformulate as q X q 2 2 X 2 2 v + w cos(θ θj) < v + w cos(θ θi) . j j | − | i i | − | j∈S i∈S We can integrate over θ [ π, π] and obtain ∈ − Z π q Z π X q 2 2 X 2 2 vj + wj cos(θ θj) dθ < vi + wi cos(θ θi) dθ. −π | − | −π | − | j∈S i∈S R π Since −π cos(θ θj) dθ = 4 is independent of θi we have the desired inequality. Then the real null space property implies| − the complex| null space property.

Sometimes it makes sense to use `q-norm minimization, for 0 < q < 1 instead of `1-norm minimiza- tion. In order to analyze this case, the Null Space Property should be generalized. This was done by [Gribonval & Nielsen ’07], see also [Gao, Peng, Yue & Zhao ’15]. The same problem of the equivalence between real and complex Null Space Property arises in this nonconvex case. The proof that they are indeed equivalent notions was given by [Lai & Liu ’11]. 3.3. RECOVERY VIA NONCONVEX MINIMIZATION 41

3.3 Recovery via Nonconvex Minimization

The `q-quasinorm of a vector can be used to approximate the number of nonzero entries of a vector, as

N N X q q→0 X xi 1 = x 0. | | −−−→ {xi6=0} || || j=1 i=1

Therefore, we would like to know whether we can obtain results for sparse recovery using `q-norm mini- mization, with 0 < q < 1. Similarly to the other cases, let us call the problem of minimizing x q subject || || to Ax = y of (Pq). Despite being a nonconvex problem, it can be shown that exact reconstruction is possible with substantially fewer measurements while maintaining robustness to noise and stability to signal non-sparsity, see [Chartrand ’07] for numerical experiments and [Saab, Chartrand & Yilmaz ’08] for the robustness analysis in the nonconvex case. We will prove that the problem (Pq) does not provide a worse approximation to (P0) when we make q smaller. In order to do this, we need to show an equivalence similar to Theorem 3.3 between sparse recovery with the `q-norm and some form of the Null Space Property.

Theorem 3.5. Given a matrix A Cm×N and 0 < q 1, every s-sparse vector x CN is the unique ∈ ≤ ∈ solution of (Pq) with y = Ax in and only if, for any S [N] with #S s, ⊂ ≤

vS q < v q v ker A 0 . || || || S|| ∀ ∈ \{ } The proof of this theorem is analogous to that of Theorem 3.2. We just need to use the fact that the qth power of the `q-quasinorm satisfies the triangular inequality. Using Theorem 3.5, [Gribonval & Nielsen ’07] were able to establish that sparse recovery via `q-minimization implies sparse recovery via `p-minimization for 0 < p < q 1. ≤ Theorem 3.6. ([Gribonval & Nielsen ’07]): Given a matrix A Cm×N and 0 < p < q 1, if every N ∈ N ≤ s-sparse vector x C is the unique solution of (Pq) with y = Ax, then x C is also the unique ∈ ∈ solution of (Pp) with y = Ax.

Proof. Due to Theorem 3.5, we just need to prove that, if v ker A 0 and if S is an index set of s largest absolute entries of v, then ∈ \{ }

X p X p vi < vi . (3.5) | | | | i∈S i∈S

We have, by hypothesis, that the same inequality is valid with q in place of p. Then we have vS = 0 since P6 p S is an index set of the largest absolute entries and v = 0. So we can rewrite (3.5) dividing by i∈S vi . So, the inequality we want to prove is equivalent to 6 | |

X 1 f(p) := P p < 1. (3.6) ( vj / vi ) i∈S j∈S | | | |

Clearly we have that vj / vi 1 for j S and i S. So, f(p) is a nondecreasing function of p. Hence, for p < q, we have f(p|) | |f(q|) ≤. By hypothesis,∈ we∈ have f(q) < 1 and this implies the result. ≤ In these results about sparse recovery via convex (and also nonconvex) optimization problem, nothing is said about the computational tractability of the Null Space Property. In Section 3.7 we will study some computational complexity issues related to `q-norm minimization for 0 < q 1. We first need to analyze what happen if the measurements are corrupted or if we have vectors which≤ are not, in fact, sparse, but approximately sparse. 42 CHAPTER 3. THE NULL SPACE PROPERTY

3.4 Stable Measurements

In idealized situations, all the vectors we deal with are sparse. This may not be the case in more realistic scenarios. Instead, we can have approximately sparse vectors and we want to recover them with an error controlled by its distance to the closest sparse vector. This is called stability of the reconstruction with respect to sparsity defect. In this section, we will prove that Basis Pursuit is stable. In order to do this, we need a stronger Null Space Property that allow to encompass these defects into the kernel of the measurement matrix.

Definition 3.7. A matrix A Cm×N is said to satisfy the stable null space property with constant 0 < ρ < 1 relative to a set S ∈[N] if ⊂

vS 1 ρ v 1 v ker A.. (NSPρ) || || ≤ || S|| ∀ ∈ We say that A satisfies the stable null space property of order s with constant 0 < ρ < 1 if it satisfies the stable null space property with constant 0 < ρ < 1 relative to any set S [N] with #S s. ⊂ ≤ We now can generalize Theorem 3.3 to the stable case.

Theorem 3.8. Suppose that a matrix A Cm×N satisfies the stable null space property of order s with N∈ # constant 0 < ρ < 1. Then, for any x C , a solution x of (P1) with y = Ax approximates the vector ∈ x with `1-error

# 2(1 + ρ) x x 1 σs(x)1. (3.7) || − || ≤ (1 ρ) − Remark 15. It is interesting to note that Theorem 3.8 does not guarantee uniqueness for the `1- minimization. However, without explaining why, [Rauhut & Foucart ’13] argue that nonuniqueness is rather pathological.

# Albeit there is nonuniqueness, the content of Theorem 3.8 is that every solution x of (P1) with y = Ax satisfies (3.7). We will prove, in fact, a stronger result, namely Theorem 3.9. The key point is to look for any vector z CN satisfying Az = Ax and establish that, under the stable null space property ∈ relative to a set S, the distance between a vector x CN supported on S and a vector z CN satisfying Az = Ax can be controlled by the difference of their∈ norms. Theorem 3.8 will follow as a∈ corollary.

Theorem 3.9. The matrix A Cm×N satisfies the stable null space with constant 0 < ρ < 1 relative to S if and only if ∈ 1 + ρ z x 1 ( z 1 x 1 + 2 x 1), (3.8) || − || ≤ 1 ρ || || − || || || S|| − for all vectors x, z CN with Az = Ax. ∈ In order to prove this result, we need the following simple observation.

Lemma 3.10. Given a set S [N] and vectors x, z CN , the following inequality holds ⊂ ∈

(x z) 1 z 1 x 1 + (x z)S 1 + 2 x 1. || − S|| ≤ || || − || || || − || || S|| Proof. We need to separate a vector x CN into two parts, one relative to the set S and other relative ∈ to the complementary set S. So, given a set S [N] and vectors x, z CN , we have ⊂ ∈

x 1 = x 1 + xS 1 x 1 + (x z) 1 + zS 1, || || || S|| || || ≤ || S|| || − S|| || || and also

(x z) 1 x 1 + z 1. || − S|| ≤ || S|| || S|| 3.4. STABLE MEASUREMENTS 43

We sum these two inequality to conclude

x 1 + (x z) 1 2 x 1 + (x z) 1 + z 1. || || || − S|| ≤ || S|| || − S|| || ||

Proof. (of Theorem 3.9): We will assume that the matrix A satisfies (3.8) for all vectors x, z C with ∈ Az = Ax. For v ker A, A(vS + v ) = Av = 0 and so A( v ) = AvS. From (3.8) applied to x = vS ∈ S − S − and z = vS we obtain

1 + ρ  v 1 v 1 vS 1 , || || ≤ 1 ρ || S|| − || || − which is equivalent to

  (1 ρ) vS 1 + v 1 (1 + ρ) v 1 vS 1 . − || || || S|| ≤ || S|| − || ||

After rearranging, we have vS 1 ρ vS 1 and this is the stable null space property with constant 0 < ρ < 1 relative to the set|| S||.≤ Conversely,|| || assume that the matrix A satisfies this property. For N x, z C with Az = Ax, since v = z x ker A, the stable null space property yields vS 1 ρ v 1. ∈ − ∈ || || ≤ || S|| Then Lemma 3.10 leads to

v 1 z 1 x 1 + vS 1 + 2 x 1. (3.9) || S|| ≤ || || − || || || || || S|| Putting the definition of the stable null space property into (3.9), we obtain

v 1 z 1 x 1 + ρ v 1 + 2 x 1, || S|| ≤ || || − || || || S|| || S|| and since ρ < 1, this is the same as

1  v 1 z 1 x 1 + 2 x 1 . || S|| ≤ 1 ρ || || − || || || S|| − Using the definition of the stable null space property again, we conclude

1 + ρ  v 1 = v 1 + vS 1 (1 + ρ) v 1 z 1 x 1 + 2 x 1 . || || || S|| || || ≤ || S|| ≤ 1 ρ || || − || || || S|| −

In [Rauhut & Foucart ’13] it was pointed out that if sparse vectors are exactly recovered then the stability is obtained at no cost. To see how, let us consider for each index set s [N] with #S s, the ⊂ ≤ operator Rs defined on ker A by Rs(v) = vs. So we have the following equivalence for the definition of null space property,

2 vS 1 < v 1 v ker A 0 µ := max RS 1→1 : S [N], #S s < 1/2. || || || || ∀ ∈ \{ } ⇐⇒ {|| || ⊂ ≤ } It then follows that A satisfies NSPρ with constant ρ = µ/(1 µ) 1. Then, we have stability. However, note that the constant 2(1 + ρ)/(1 ρ) in (3.7) can be very− large≤ as ρ 1. − → Now, we can deduce Theorem 3.8 as a Corollary.

1+ρ Proof. (of Theorem 3.8): We just need to prove that 1−ρ ( z 1 x 1 + 2 xS 1) can be dominated 2(1+ρ) || || − || || || || by (1−ρ) σs(x)1. In order to do this, we take S as the set of s largest absolute coefficients of x, so # # # xS 1 = σs(x)1. Also, if x is a minimizer of (P1), then x 1 x 1 and Ax = Ax. Therefore we just|| || need to consider z = x# and this concludes the proof.|| || ≤ || || 44 CHAPTER 3. THE NULL SPACE PROPERTY

Sometimes we can deal with a fixed sparse vector rather than all vectors with a given sparsity. This will be the case in the next two theorems, where we give a geometric characterization for the uniqueness of the `1-minimization problem. [Chandrasekaran, Recht, Parrilo & Willsky ’12] introduced a general geometric theory for the recovery of “simple” objects, in their own words, from few measurements. They analyzed the connections between the recovery, from limited linear measurements, of objects with some structure and the convex hull of the set of these objects. This was inspired by the fact that the convex hull of (unit Euclidean-norm) one- sparse vectors is the unit ball of the `1-norm and, similarly, the convex hull of the (unit Euclidean-norm) rank-one matrices is the nuclear norm ball, as we will discuss in Section 3.6.

Definition 3.11. For a vector x RN , the descent convex cone is given by ∈ N T (x) = cone z x : z R , z 1 x 1 , { − ∈ || || ≤ || || } where cone means the conic hull of a set.

Theorem 3.12. (Proposition 2.1 of [Chandrasekaran, Recht, Parrilo & Willsky ’12]): For A Rm×N , N ∈ a vector x R is the unique minimizer of z 1 subject to Az = Ax if and only if ker A T (x) = 0 . ∈ || || ∩ { } # # Proof. Suppose that ker A T (x) = 0 . Let x be an `1-minimizer. So we have x 1 x 1 and Ax# = Ax. And then v = x∩# x must{ } belong to T (x) ker A = 0 . Therefore, x#||= x||. Then≤ || x||is the − ∩ { } unique `1-minimizer. Now assume that x is the unique `1-minimizer. Every vector v T (x) 0 can be written as v = P ∈ \{ } P αi(zi x) with αi 0 (see Lemma 2.6 of [Ruszczynski]) and zi 1 x 1. As v = 0, we have αi > 0 − ≥ 0 P || || ≤ || || 6 P 0 and so we can take the normalized coefficients αi = αi/ αi. If v ker A, we have A( αizi) = Ax P 0 P 0 ∈ P 0 and also αizi 1 αi zi 1 x 1. By uniqueness of the `1-minimizer, αizi = x and so v = 0, which is a|| contradiction.|| ≤ Therefore|| || ≤ we || || have (T (x) 0 ) ker A = and then T (x) ker A = 0 . \{ } ∩ ∅ ∩ { }

This theorem shows us that we can rewrite exact recovery from `1-minimization purely in terms of convex geometry. A proper null space for x will be oriented in such a way that its shift by x will touch T (x) uniquely at x. Also, this geometric characterization can be extended to robust recovery as the following theorem shows.

Theorem 3.13. (Proposition 2.2 of [Chandrasekaran, Recht, Parrilo & Willsky ’12]): For A Rm×N , N m ∈ let x R and y = Ax + e R with e 2 η. If ∈ ∈ || || ≤

inf Av 2 τ v∈T (x),||v||2=1 || || ≥

# for some τ > 0, then a minimizer x of z 1, subject to Az y 2 η satisfies || || || − || ≤

# 2η x x 2 . || − || ≤ τ

Proof. We may assume that x# x = 0, since otherwise the result is trivial. As x# is a minimizer of # − 6 # # z 1. we have x 1 x 1 and this leads to fact that v = (x x)/ x x 2 belongs to T (x). Since || || || || ≤ || || − || − || v 2 = 1 and v T (x), our hypothesis says that Av 2 τ. Using this and the triangular inequality, ||we|| have ∈ || || ≥

# 1 # 1  #  2η x x 2 A(x x) 2 Ax y 2 + Ax y 2 . || − || ≤ τ || − || ≤ τ || − || || − || ≤ τ 3.5. ROBUST MEASUREMENTS 45

3.5 Robust Measurements

Due to physical limitations there is no hope to measure a signal x CN with infinite precision. Hence ∈ the measurement vector y Cm will be always an approximation to the vector Ax Cm. We want to model our problem in such∈ a way that we can control the error of this approximation∈ by

Ax y η, || − || ≤ for some η 0 and some norm . on Cm. Typically we will require finite energy of this difference ≥ || || which, in mathematical terms, means that will we work with the `2-norm. Then, the output of our reconstruction technique cannot be x, but will be x∗ CN whose distance to the original vector x CN must be controlled by η 0. This is called the robustness∈ of the reconstruction scheme with respect∈ to ≥ the measurement error. Let us then substitute the original basis pursuit (P1) by the following convex problem

min z 1 subject to Az y η, . (P1,η) N z∈C || || || − || ≤ where the norm in Az y is a properly chosen norm for the measurement error. || − || Definition 3.14. The matrix A Cm×N is said to satisfy the robust null space property (with respect to the norm . ) with constants 0∈< ρ < 1 and τ > 0 relative to a set S [N] if || || ⊂ n vS 1 ρ v 1 + τ Av v C . (NSPρ,τ ) || || ≤ || S|| || || ∀ ∈ Likewise in the definition of the stable null space property, we say that a matrix A satisfies the robust null space property of order s with constants 0 < ρ < 1 and τ > 0 if it satisfies the robust null space property with constants ρ, τ to any set S [N] with #S s. ⊂ ≤ Remark 16. In this definition we do not ask for v to be in ker A. Note that if v ker A, then the term Av vanishes and we recover the stable null space property. ∈ || || We can now generalize Theorem 3.8 to the case η = 0. 6 Theorem 3.15. Suppose that a matrix A Cm×N satisfies the robust null space property of order s with ∈ N # constants 0 < ρ < 1 and τ > 0. Then, for any x C , a solution x of (P1,η) with y = Ax + e and e η satisfies ∈ || || ≤

# 2(1 + ρ) 4τ x x 1 σs(x)1 + η. || − || ≤ (1 ρ) 1 ρ − − As in Section 3.4, we will prove a stronger statement valid for any index set S and obtain Theorem 3.15 as a particular case.

Theorem 3.16. The matrix A Cm×N satisfies the robust null space property with constants 0 < ρ < 1 and τ > 0 relative to S if and only∈ if

(1 + ρ) 2τ z x ( z 1 x 1 + 2 x 1) + A(z x) , (3.10) || − || ≤ (1 ρ) || || − || || || S|| 1 ρ|| − || − − for all vectors x, z C. ∈ Proof. The idea of the proof, mutatis mutandis, is same from Theorem 3.9. Let us first assume that the N N matrix A satisfies (3.10) for all x, z C . Then, for any v C , we can take x = vS and z = vS. This leads to ∈ ∈ −

1 + ρ 2τ v 1 ( v 1 vS 1) + Av . || || ≤ 1 ρ || S|| − || || 1 ρ|| || − − 46 CHAPTER 3. THE NULL SPACE PROPERTY

After rearranging, we obtain (1 ρ)( vS 1 + v 1) (1 + ρ)( v 1 vS 1) + 2τ Av , and this is − || || || S|| ≤ || S|| − || || || || the same as vS 1 ρ vS 1 + τ Av . This is exactly the robust null space property with constants 0 < ρ < 1 and|| τ >|| 0≤relative|| || to S.|| || Conversely, assume that the matrix A satisfies the robust null space property with constant 0 < ρ < 1 N and τ > 0 relative to S. For x, z C , we set v = z x and then, using NSPρ,τ and Lemma 3.10 yields ∈ −

vS 1 ρ v 1 + τ Av , || || ≤ || S|| || || v 1 z 1 x 1 + vS 1 + 2 x 1. || S|| ≤ || || − || || || || || S|| Combining both inequalities we have

1 v 1 ( z 1 x 1 + 2 x 1 + τ Av ) . || S|| ≤ 1 ρ || || − || || || S|| || || − Using NSPρ,τ once more, we obtain

1 + ρ 2τ v 1 = v 1 + vS 1 (1 + ρ) v 1 + τ Av ( z 1 x 1 + 2 x 1) + Av . || || || S|| || || ≤ || S|| || || ≤ 1 ρ || || − || || || S|| 1 ρ|| || − −

We will now establish a result where the `1-error estimate is replaced by a general `p-error estimate for p 1. This means that, from now on, we have other error reconstruction rates available. In order ≥ to do this, we need to change and strength the null space property to include other norms than the `1. Now we will directly define the property of order s instead of define it for a fixed set S [N] first. ⊂ m×N Definition 3.17. Given q 1, the matrix A C is said to satisfy the `q-robust null space property of order s (with respect to the≥ norm . ) with∈ constants 0 < ρ < 1 and τ > 0 if, for any set S [N] with #S s, || || ⊂ ≤ ρ n vS q v 1 + τ Av v C . (NSPq,ρ,τ ) || || ≤ s1−1/q || S|| || || ∀ ∈ 1/p−1/q Remark 17. Note that with the aid of the inequality vS p s vS q valid for 1 p q we can || || ≤ || || ≤ ≤ prove that the `q-robust null space property with constants 0 < ρ < 1 and τ > 0 implies that, for any set S [N] with #S s, ⊂ ≤ ρ 1/p−1/q n vS p v 1 + τs Av v C . || || ≤ s1−1/p || S|| || || ∀ ∈ 1/p−1/q Hence, if we change the norm . by the new norm s . , the `q-robust null space property implies || || || || the `q-robust null space property with the same constants for any 1 p q. Therefore, Definition 3.17 ≤ ≤ is a natural strengthening of the robust null space property for the `1 norm.

We will now show that quadratically constrained basis pursuit works for matrices satisfying NSP2,ρ,τ . The precise statement is as follows.

m×N Theorem 3.18. Suppose that the matrix A C satifies the `2-robust null space property of order s ∈ N # with constants 0 < ρ < 1 and τ > 0. Then, for any x C , a solution x of (P1,η) with . = . 2, ∈ || || || || y = Ax + e, and e 2 η approximates the vector x with `p-error || || ≤

# C 1/p−1/2 x x p σs(x)1 + Ds η, 1 p 2, (3.11) || − || ≤ s1−1/p ≤ ≤ Consequently, if there is a way to characterize which matrices satisfy this property or, at least, find a useful set of matrices for which this property is valid, we can then ensure the recovery of approxi- mately sparse vectors with controlled error rates. In Section 5.5, in particular Theorem 5.19, proved by [Andersson & Strömberg ’14], says that this is the case for matrices with sufficiently small restricted 3.5. ROBUST MEASUREMENTS 47 isometry constants. Nonetheless, it is important to note that the first result of stability and robustness of sparse reconstruction via Basis Pursuit was obtained by [Candès, Romberg & Tao I ’06]. Next, we will prove a more general result from which Theorem 3.18 follows by choosing q = 2, taking # # # z = x and observing that since x is solution of min N z 1, we necessarily have x 1 x 1 0. z∈C || || || || − || || ≤ m×N Theorem 3.19. Given 1 p q, suppose that the matrix A C satisfies the `q-robust null space ≤ ≤ ∈ property of order s with constants 0 < ρ < 1 and τ > 0. Then, for any x, z CN , ∈

C   1/p−1/q z x p z 1 x 1 + 2σs(x)1 + Ds A(z x) , || − || ≤ s1−1/p || || − || || || − || where C = (1 + ρ)2/(1 ρ) and D = (3 + ρ)τ/(1 ρ). − − Proof. Recall that the `q-robust null space property implies the `1-robust null space property and `p- robust null space property for p q. These two statements can be written as follows ≤ 1−1/q vS 1 ρ v 1 + τs Av . (3.12) || || ≤ || S|| || ||

ρ 1/p−1/q vS p v 1 + τs Av . (3.13) || || ≤ s1−1/p || S|| || || for all v CN and all S [N] with #S s. Thus, considering (3.12) and using Theorem 3.16 with S chosen as∈ an index set of ⊂s largest (in absolute≤ value) entries of x we obtain

1 + ρ 2τ 1−1/q z x 1 ( z 1 x 1 + 2σs(x)1) + s A(z x) . (3.14) || − || ≤ 1 ρ || || − || || 1 ρ || − || − − Now, considering (3.13) and choosing S as an index set of s largest (in absolute value) entries of z x, − we can estimate the `p-norm by the `1-norm and obtain 1 z x p (z x) p + (z x)S p z x 1 + (z x)S p || − || ≤ || − S|| || − || ≤ s1−1/p || − || || − ||

1 ρ 1/p−1/q 1 + ρ 1/p−1/q z x 1 + (z x) 1 + τs A(z x) (z x) 1 + τs A(z x) ≤ s1−1/p || − || s1−1/p || − S|| || − || ≤ s1−1/p || − || || − || (3.8) 2 (1 + ρ) 1   (3 + ρ)τ 1/p−1/q z 1 x 1 + 2σs(x)1 + s A(z x) . ≤ (1 ρ) s1−1/p || || − || || (1 ρ) || − || − −

Returning to Theorem 3.18, let us analyze the extremal cases p = 1 and p = 2. Then , we have the estimates

# # C x x 1 Cσs(x)1 + D√sη and x x 2 σs(x)1 + Dη. || − || ≤ || − || ≤ √s

It is interesting to note that in the first case, the coefficient of σs(x)1 is constant while in the second one is proportional to 1/√s. Also, the coefficient of η scales like √s for p = 1 while it is constant for p = 2. Another curious point to note is that regardless of the chosen norm in which we seek for error estimates, the error σs(x)1 always appears on the right-hand side. So, for example, in the case p = 2, one may inquire why σs(x)2 does not appear, since we have an `2-error estimate, but instead σs(x)1/√s appears in its place. In Chapter 8 we will see that such kind of estimate, with σs(x)2, is impossible in the parameters regime we are looking for. In order to acquire information in a compressible fashion, m must be much smaller than N. After developing some techniques on Geometry of Banach Spaces, we will prove in Theorem 8.21 a curious condition. Namely, if we want estimates with σs(x)2, then necessarily m cN ≥ for some positive constant c, so this `2 estimates are useless for our purposes. Lastly, we saw on Chapter 1 that unit `1-balls with q < 1 can model compressible vectors in a satisfactory way. Indeed, by Proposition 1.5, if x 1 1 for q < 1, then, for p 1, we have σs(x)p || || ≤ ≥ ≤ 48 CHAPTER 3. THE NULL SPACE PROPERTY s1/p−1/q. On the other hand, taking η = 0 on Equation (3.11), which is the same as consider measurements without error, we have

# C C 1−1/q 1/p−1/q x x p σs(x)1 s = Cs , 1 p 2. || − || ≤ s1−1/p ≤ s1−1/p ≤ ≤

Hence, we have the same decay error rate for the reconstruction error in `p and the best s-term approx- 1−1/p imation, provided that p [1, 2]. Therefore the terms σs(x)1/s and σs(x)1 are comparable and the ∈ appearance of σs(x)1 on the right-hand side makes sense, see the comments after Definition 8.19.

3.6 Low-Rank Matrix Recovery

In this dissertation, we will be interested in the problem of sparse vector recovery. However, the recov- ery/reconstruction of higher dimensional structures like matrices and higher order tensors is also possible from few observations. The matrix case is specially important in many practical problems. Consider, for example, a survey data containing the responses of various individuals to specific questions. We can make a table with individuals in the rows and questions in the columns. In any quiz, many questions are left unanswered and the aim is to provide a good estimate for the missing answers. Even more, typically many questions will be left with no answer, so we want to recover a matrix from very few measurements. Matrix-completion problems arise in a natural way into fields where questions about dimensionality or complexity are present. Typically we can model these as a problem about the rank of some appropriate matrix. The main point is that in these cases, the matrix we wish to recover is known to be structured in the sense that it is low-rank or approximately low-rank. Some applications of the techniques of low-rank matrix recovery are: phase retrieval of diffracted waves in crystallographic and astronomical imaging [Candès, Eldar, Strohmer & Voroninski ’13]; low- dimensional embedding of data and the connectivity structure of graphs which arise in the study of social networks [Linial, London & Rabinovich ’15]; distance matrices which represents wireless sensor net- work localization [So & Ye ’07] and multi-task learning and recommendation systems in machine learning [Rohde & Tsybakov ’11], just to name a few. This last one became very famous after the Netflix Problem3, where some techniques related to low-rank matrix recovery were used. Solving all of these problems efficiently is particularly important in view of the massive size of actual datasets. Even more, it would be impossible to fully observe the matrices arising in those applications. As [Davenport & Romberg ’16] points out, “it can be prohibitively expensive to fully sample the entire output of a sensor array; we might only be able to measure the strength of a few connections in a graph; and any particular user of a recommendation system will provide only a few ratings". In this section, we briefly explore the connection between the problem of Low-Rank Matrix Recovery and Compressive Sensing. More specifically, we will see that one problem can be realized as a par- ticular instance of the other, when we change the `1-norm minimization by the nuclear norm, defined below. This connection was first explored by [Candès & Recht ’09]; important contributions were made by [Recht, Fazel & Parrilo ’10], [Candès & Tao ’10] and [Keshavan, Montanari & Oh ’10]. Suppose that a matrix A Cn1×n2 of rank at most r is observed via linear measurements described by ∈ m m n1×n2 m X X ∗ : R R ,X X,Aj ej = tr(XAj )ej A → 7→ h i j=1 j=1 m where A1,...,An are suitable n1 n2 matrices and e1, . . . , em is the standard basis in R . If we have × noise, then our measurements will be described by y = (X) + e, where e Rm denotes the additive noise. Like the vectorial case, our problem here is given byA ∈

min rank(X) subject to (X) = y (NSPrank) n ×n X∈C 1 2 A

3See the websites http://www.netflixprize.com/ and https://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html for the problem description and more information about the solution. 3.6. LOW-RANK MATRIX RECOVERY 49

Note that the rank of X is the `0-norm of the vector [σ1(X), . . . , σn(X)] of singular values of X. When the matrix is diagonal, this problem reduces to the (P0) problem and then it is also NP-hard. Therefore, we need a good relaxation strategy. To solve the problem we will, instead of pursuing rank minimization, try to obtain nuclear norm minimization. This norm, also known by the names Schatten 1-norm and Pn Ky-Fan norm, is defined by X ∗ = j=1 σj(X) with n = min n1, n2 . Let us first recall the following definition. || || { }

Definition 3.20. Let be a convex set. The convex envelope of a (possible nonconvex) function C f : R is defined as the largest convex function g such that g(x) f(x) for all x . More precisely,C → ≤ ∈ C conv f(x) = sup g(x) g convex, g f . { | ≤ } Remark 18. From this definition we see that, among all convex functions that bound f(x) from below, g is the best approximation and so instead of minimizing f, we can try to efficiently minimize g.

Now we state a result about the convex envelope of the matrix rank.

Theorem 3.21. (Theorem 1 of [Fazel, Hindi & Boyd ’01]): The convex envelope of the function φ(X) = m×n rank(X) on C = X R X 2→2 1 is φenv(X) = X ∗ { ∈ || || ≤ } || || [Fazel, Hindi & Boyd ’01] proposed the heuristic of the nuclear norm minimization (three years before the first preprint about Compressive Sensing was released and therefore, without the analogy with sparse recovery and basis pursuit) and this was fully explored in the thesis [Fazel ’02].

min X ∗ subject to (X) = y (Pnuclear) n ×n X∈C 1 2 || || A This is a convex optimization problem, and therefore, at least in principle, easily solvable. If the matrix variable X is symmetric and positive semidefinite, then its singular values are the same as its eigenvalues and the problem (NSPnuc) reduces to the trace minimization. Moreover, this norm is the `1-norm of the singular values of X and then we can try to import the definitions and techniques from the vector case to this case. First of all, we have a matrix version of the null space property. Second, the success of the nuclear norm minimization for the recovery of low-rank matrices is also equivalent to matrix NSP as state in the next theorem.

Theorem 3.22. ([Recht, Xu & Hassibi ’08]): Give a linear map from Cn1×n2 to Cm, every matrix n1×n2 A X C of rank r is the unique solution of (Pnuclear) with y = (X) if and only if, for all M ∈ A ∈ ker 0 with singular values σ1(M) σn(M) 0, n = min n1, n2 , A\{ } · · · ≥ ≥ { } r n X X σj(M) < σj(M). j=1 j=r+1 In order to prove this result, we need the classical variational characterization of the singular values. Here, we present the proof given by [Rauhut & Foucart ’13].

Theorem 3.23. (Courant-Fischer Minimax Theorem): For a self-adjoint matrix A Cm×n, the eigen- values of A are obtained through ∈

(3.15) λk(A) = maxn min Ax, x . M⊂C x∈M h i dim M=k ||x|2=1

m×n As a consequence, the singular values σ1(A) σmin{m,n} 0 of a matrix A C can be described by ≥ · · · ≥ ≥ ∈

σk(A) = maxn min Ax 2 M⊂C x∈M || || dim M=k ||x|2=1 50 CHAPTER 3. THE NULL SPACE PROPERTY

Proof. Let us first prove that λk(A) is not larger than the right-hand side of Equation (3.15). Taking Pk u1, . . . , un as an orthonormal basis and M = span(u1, . . . , un), for x = j=1 αjuj M with unit norm we{ have } ∈

k k X 2 X 2 2 Ax, x = λj(A)α λk(A) α = λk(A) x = λk(A). h i i ≥ i || ||2 j=1 j=1

n Now, we need to prove the converse inequality for λk(A). So, given a k-dimensional subspace M of C , Pk we choose a vector x M Span uk, . . . , un with x 2 = 1 in such a way that x = j=1 αjuj. This leads to ∈ ∩ { } || ||

n n X 2 X 2 2 Ax, x = λj(A)α λj(A) α = λk(A) x = λk(A) h i i ≤ i || ||2 j=k j=k

Let λ(X) = (λ1(X), . . . , λn(X)) be the spectrum of any matrix X, considered as a vector, instead of a set. In the short note [Lidskii ’50], it was proved that for Hermitian matrices A and B, λ(A+B) λ(A) is in the convex hull spanned by P λ(B), where P is a permutation matrix. The Birkhoff-von Neumann− Theorem states that the set of doubly stochastic matrices, which is a convex polytope, is the convex hull of the set of permutation matrices. So Lidskii’s result is equivalent to λ(A + B) λ(A) = Oλ(B) for some doubly stochastic matrix O and this, in turn, is equivalent to the following inequality.−

Lemma 3.24. ([Lidskii ’50] and [Wielandt ’55]): Let A, B Cn×n be two self-adjoint matrices, and ∈ let (λj(A))j∈[n], (λj(B))j∈[n], (λj(A + B))j∈[n] denote the eigenvalues of A, B, and A + B arranged in nonincreasing order. For any 1 i1 < < ik n, ≤ ··· ≤ k k k X X X λi (A + B) λi (A) + λi(B). j ≤ j i=1 i=1 i=1 Proof. Note that the inequality is invariant under the change B B αId, for a given constant α and so, without loss of generality, we may assume that we have translated→ − the spectrum in such a way that λk+1(B) = 0. Therefore all the vectors λk+2(B), λk+3(B),... are negative. Let us use the spectral decomposition for B and define the positive semidefinite matrix B+ Cn×n as ∈

∗ B = Udiag[λ1(B), . . . , λk(B), λk+1(B), . . . , λn(B)]U , + ∗ (3.16) B = Udiag[λ1(B), . . . , λk(B), 0,..., 0]U .

+ Using the fact that the eigenvalues λk+2(B), λk+3(B),... are negative, we have that B B < 0, i.e., + + −+ B B is positive semidefinite. Then, we have that A + B < A + B and also that A + B < A. Using − + + the variational characterization (3.15), we have λi(A + B) λi(A + B ) and λi(A) λi(A + B ). This leads to ≤ ≤

k k n X  X +  X +  λi (A + B) λi (A) λi (A + B ) λi (A) λi(A + B ) λi (A) j − j ≤ j − j ≤ − j j=1 j=1 j=1

k + + X = tr(A + B ) tr(A) = tr(B ) = λi(B). − i=1 3.6. LOW-RANK MATRIX RECOVERY 51

This result is more than an upper bound for the k smallest eigenvalues of A + B, because we can take any k eigenvalues of A + B, arranged in nonincreasing order and bound them from above by taking the sum eigenvalues of A and B with the same indexes. For more information about inequalities for eigenvalues of Hermitian matrices, see [Tao’s Blog - 01/12/2010]. Lemma 3.24 can be used to derive an inequality about singular values as stated in the next Lemma.

m×n m×n Lemma 3.25. If the matrices X C and Y C have singular values σ1(X) σl(X) 0 ∈ ∈ ≥ · · · ≥ ≥ and σ1(Y ) σl(Y ) 0, where l = min m, n the, for any k [l], ≥ · · · ≥ ≥ { } ∈ k k X X σj(X) σj(Y ) σj(X Y ). | − | ≤ − j=1 j=1

Proof. Consider the self-adjoint dilatation matrices S(X),S(Y ) C(m+n)×(m+n) defined by ∈  0 X   0 Y  S(X) = and S(Y ) = . X∗ 0 Y ∗ 0 Their eigenvalues obey the inequalities

σ1(X) σl(X) 0 = = 0 σl(X) σ1(X), ≥ · · · ≥ ≥ ··· ≥ − ≥ · · · ≥ − (3.17) σ1(Y ) σl(Y ) 0 = = 0 σl(Y ) σ1(Y ). ≥ · · · ≥ ≥ ··· ≥ − ≥ · · · ≥ − In order to see this, let us compute S2(X).

 XX∗ 0  S2(X) = S(X)S(X) = . 0 X∗X Let σ be any singular values of X. Then σ2 is an eigenvalue of XX∗ (and X∗X). Hence σ2 will be an eigenvalue for S2(X) for a given eigenvector v, that is,

 XX∗ 0  S2(X)v = v = σ2v. 0 X∗X Therefore, we can finally conclude that σ is an eigenvalue for S(X). Then, for j [l], there exists a ± ∈ subset Ik of [m + n] with size k such that

k X X  σj(X) σj(Y ) = λj(S(X)) λj(S(Y )) . | − | − j=1 j∈Ik So, using Lemma 3.24, with A = S(Y ), B = S(X Y ) and A + B = S(X) leads to − k k k X X X σj(X) σj(Y ) = λj(S(X Y )) = σj(X Y ). | − | − − j=1 j=1 j=1

We can now finally prove that if a matrix satisfies the nullspace property it can recovered through the minimization of the nuclear norm.

Proof. (of Theorem 3.22): Assume that every matrix X Cn1×n2 of rank r is the unique solution of ∈ (Pnuclear) with y = (X). Consider the singular value decomposition of a matrix M ker 0 , which A ∗ n1×n1 ∈ nA\{2×n2} we will denote by M = Udiag(σ1, . . . , σn)V for σ1 . . . σn 0 and U C , V C unitary matrices. So we can threshold M and cut off some singular≥ values≥ in order∈ to define two∈ new matrices

∗ ∗ M1 = Udiag(σ1, . . . , σr, 0,..., 0)V and M2 = Udiag(0,..., 0, σr+1,..., σn)V , − − 52 CHAPTER 3. THE NULL SPACE PROPERTY

in such a way that M = M1 M2. Then we translate the condition (M) = 0 into (M1) = (M2). − A A A Since the rank of M1 is at most r, its nuclear norm must be smaller than the nuclear norm of M2. This is the same as saying that

M1 ∗ = σ1(M) + + σr(M) < M2 ∗ = σr+1(M) + + σn(M). || || ··· || || ··· Pr Pn In order to prove the converse, let us suppose that j=1 σj(M) < j=r+1 σj(M) for every M ker 0 n1×n2 ∈ A\{ } with singular values σ1(M) σn(M) 0. Let us take a matrix X C of rank r and a matrix n1×n2 ≥ · · · ≥ ≥ ∈ Y C , with X = Y and satisfying (Y ) = (X). We will prove that X ∗ > Y ∗. Let us define M∈= X Y ker 6 0 . Lemma 3.24, withA A =AX M and B = M, implies|| || that || || − ∈ A\{ } − n n X X Y ∗ = σj(X M) σj(X) σj(M) || || − ≥ | − | j=1 j=1

For j [r], we have σj(X) σj(M) σj(X) σj(M) and for the smallest singular values, r+1 j n, ∈ | − | ≥ − ≤ ≤ we clearly have σj(X) σj(M) = σj(M). So, using our hypothesis, we have | − | r r n r X X X X Y ∗ σj(X) σj(M) + σj(M) σj(X) = X ∗ || || ≥ − ≥ || || j=1 j=1 j=r+1 j=1

Recently, [Kabanava, Kueng, Rauhut & Terstiege ’16] generalized Theorem 3.22 and established the stable and robust reconstruction of the low-rank matrices through the introduction of the robust null space property for the matrix case. They proved the following theorem.

Theorem 3.26. (Theorem 11 of [Kabanava, Kueng, Rauhut & Terstiege ’16]): Let : Cn1×n2 Cm m A → be any linear measurement map, let n = min n1, n2 and . be any norm on C . Assume that for any n1×n2 { } || || matrix X C , our measurement vector is given by y = (X) + e with e 2 η, for some η 0. Suppose that∈ the measurement satisfies the robust rank nullA space property|| of|| order≤ r (with respect≥ to the norm . ) with constants 0A< ρ < 1 and τ > 0, that is || || r n X X n1×n2 σi(M) ρ σi(M) + τ (M) , M C . ≤ ||A || ∀ ∈ i=1 i=r+1

Then, for any matrices X,Z Cn1×n2 , we have ∈ n ! 1 + ρ X X Z ∗ Z ∗ X ∗ + 2 σi(M) . || − || ≤ 1 ρ || || − || || − i=r+1 Also, if satisfies the Frobenius robust rank null space property of order r, i.e., for all M ker 0 , A ∈ A\{ } r !1/2 n X 2 ρ X σi(M) σi(M) + τ (M) , ≤ √r ||A || i=1 i=r+1 then, for X#, a solution of the quadratically constrained nuclear norm minimization problem,

min Z ∗ subject to (Z) y 2 η, (3.18) n ×n X∈C 1 2 || || ||A − || ≤ we have the following error estimate in the Frobenius norm,

2 n # 2(1 + ρ) 1 X 2τ(3 + ρ) X X F σi(X) + η. || − || ≤ 1 ρ √r 1 ρ − i=r+1 − 3.7. COMPLEXITY ISSUES OF THE NSP 53

As a corollary of Theorem 3.26, we obtain that if X Cn1×n2 is a matrix of rank at most r and the measurements are noiseless, i.e., η = 0, then the Frobenius∈ robust null space property implies that X is the unique solution of (Pnuclear). This section showed that the heuristic of replacing the (nonconvex) rank objective function by the sum of the singular values works, provided that the measurement satisfies the null space property. But until now, nothing was said about how to minimize the nuclearA norm in an efficient form. This was done by [Vandenberghe & Boyd ’96] and highlighted by [Fazel, Hindi & Boyd ’01], where they expressed the problem as a Semidefinite Programming (SDP) problem with constrains given by a linear matrix inequality. They argue that for this kind of problem, a lot of solvers are available to efficiently solve it. In particular, they proved the following theorem:

Theorem 3.27. (Lemma 1 of [Fazel, Hindi & Boyd ’01]): The problem (NSPnuc) is equivalent to the following semidefinite programming problem:

 YX  min tr Y + tr Z subject to (X) = y and M = 0. n ×n n ×n ∗ < X∈C 1 2 ,Y ∈C 1 1 A X Z n ×n Z∈C 2 2 To finish this discussion, let us point out that low-rank matrix recovery has a lot of application. One can cite camera calibration, removal of shadows and specularities from face images, reconstruction of urban structures from low-rank textures and so on. These and many other applications can be found at Low-Rank Matrix Recovery and Completion via Convex Optimization, a website maintained by Professor Yi Ma, from the University of Illinois. The URL is http://perception.csl.illinois.edu/matrix- rank/references.html.

3.7 Complexity Issues of the NSP

In this chapter, we saw the null space property as a necessary and sufficient condition for the equivalence between (P1) and (P0). However, in this section we will state an odd result related to the computational complexity of the null space property. More precisely, we will state a recent theorem which says that the computation of the null space constant is intractable from a computational point of view. In order to do this, we must remember one of the three equivalent conditions for NSP given in the beginning of Section 3.2, more precisely, the one given by Equation (3.1). This equation tells us that the null space property relative to a set S is

1 2 vS 1 < v 1 vS 1 < v 1 v ker A 0 . || || || || ⇐⇒ || || 2|| || ∀ ∈ \{ } So taking S to be the index set S of s largest (in absolute value) entries of v, we can generalize the inequality above and define the null space constant.

m×N Definition 3.28. For a given matrix A C , the null space constant αs is defined as the smallest constant such that the NSP of order s holds∈ with this constant, that is

αs = min α such that vS 1 α v 1 v ker A 0 . || || ≤ || || ∀ ∈ \{ } or equivalently,

αs = max vS 1 such that Av = 0, v 1 = 1,S [N], #S s. || || || || ⊆ ≤

So, clearly, by Equation (3.1) and Theorem 3.2, sparse recovery is ensured by (P1) if and only if m×N αs < 1/2. Therefore, one can ask how to compute αs or to decide whether a given matrix A C ∈ there exists αs < 1. The next theorem states that there is no polynomial time algorithm that computes the null space constant for all matrices A and all sparsity levels s. 54 CHAPTER 3. THE NULL SPACE PROPERTY

Theorem 3.29. (Theorem 5 of [Tillmann & Pfetsch ’14]): Given a matrix A Qm×N and a positive ∈ integer s, the problem to decide whether A satisfies the NSP of order s with some constant αs < 1 is coNP-complete. It is sufficient to prove this theorem for a rational matrix because in floating-point arithmetics there is only the representation of rational numbers. Theorem 3.29 provides a justification to investigate sufficient conditions to Basis Pursuit, instead of relying only on necessary and sufficient condition as NSP. In Chapter 4 and Chapter 5 we will analyze the two most important sufficient conditions in Compressive Sensing literature: Coherence Property and Restricted Isometry Property. Based on Section 3.3, we could ask why the nonconvex approach is not used instead of Basis Pursuit. The main point is that the null space property for the `q-norm, with q < 1, has also theoretical drawbacks, as the next theorem shows.

Theorem 3.30. (Theorem 1 of [Ge, Jiang & Ye ’11]): For any 0 < q < 1, the problem (Pq) is NP-hard.

Therefore, the `1-minimization approach, a convex strategy, remains as the most widely used approach to find sparse (or compressible) solutions of linear systems. We finish this chapter with an open theoret- ical problem related to NSP which was stated in the last section of the work [Tillmann & Pfetsch ’14].

Open Problem: Prove that the computation of the null space constant is strongly NP-hard, that is, prove that cannot exist a fully polynomial time approximation scheme (FPTAS), i.e., an algorithm that solves the minimization problem within a factor of (1 + ε) of the optimal value in polynomial time with respect to the input size and to 1/ε. Chapter 4

The Coherence Property

There’s no unique criterion for coherence, but you have to be sensitive to incoherence. And as we’ve said, the test for incoherence is whether you’re getting the results you don’t want. David Bohm (1917-1992)1

4.1 Introduction

Until now, we gave a panorama of some strategies to solve the `0-minimization problem, like thresholding and greedy algorithms. However, our main focus on this dissertation is Basis Pursuit working as a proxy for the initial combinatorial problem. With this in mind, our investigation relies in finding conditions in the matrix A which ensure exact or approximate reconstruction of the original sparse or compressible vector. A fundamental point in the analysis of the algorithms is whether we are able to prove that some specific matrices arising in applications satisfy the conditions that we examine in this chapter. This question is not so well understood, and will return later in this dissertation. In Theorem 3.3 we proved that the Null Space Property is a necessary and sufficient condition for the solution of (P0) via `1-minimization. At the same time, the results about the computational complexity of NSP show us that we must look in other directions to find alternative sufficient ways to solve the problem - the question of necessity will be proven much harder - even if these alternatives are not sharp. The seminal paper [Candès, Romberg & Tao I ’06] defined the Restricted Isometry Property2 in 2004, which will be explored in Chapter 5, whereas [Cohen, Dahmen & DeVore ’09] defined NSP in 2009. His- torically, other sufficient conditions for sparse recovery appeared before RIP and NSP. Along this chapter, we follow the path taken by Donoho and collaborators to find sufficient conditions for Basis Pursuit. We will also begin to explore the suitability of matrices that are present in Compressive Sensing and applications in order to guarantee uniqueness of the (sparse) solution. This analysis leads to sufficient conditions under the names Spark and Coherence of the matrix. We begin this investigation with a new point of view for a very general and ubiquitous principle of physics, which is also a metatheorem in mathematics: the uncertainty principle.

4.2 Uncertainty Principle

In the twenties, Quantum Mechanics changed the physical understanding of the world. One of its deepest results is the Uncertainty Principle3, which dates back to [Heisenberg ’27]. The papers [Kennard ’27]

1In David Bohm, “Thought as a System", Routledge, 1994. pp. 59. 2These authors originally called it the Uniform Uncertainty Principle. Also, despite the publication date, the original preprint appeared in ArXiv at 2004. 3For the history of quantum mechanics, one can consult the 6 volumes (and more than 5000 pages) of [Mehra & Rechenberg ’01], the most comprehensive work undertaken by anyone on one of the vastest and most impor- tant development in the history of science.

55 56 CHAPTER 4. THE COHERENCE PROPERTY and [Weyl ’28] quantified and proved it4. The signal analysis community probably came to know and to reflect on it after the fundamental work of Gabor [Gabor ’46] in 1946. Then this principle evolved throughout the century and we can cite the extensions made by Landau, Pollack and Slepian [Slepian & Pollack ’61], [Landau & Pollack ’61] and later by Donoho and Stark [Donoho & Stark ’89]. Furthermore, there are deep connections with other areas such as Partial Differential Equations [Fefferman ’83] and Mathematical Analysis in general, see [Havin & Jöricke ’94]. A particularly clear description of the philosophy of the Uncertainty Principle in Mathematics is given by Terence Tao in [Tao’s Blog - 06/25/2010]: “A recurring theme in mathematics is that of duality: a mathematical object X can either be described internally (or in physical space, or locally), by describing what X physically consists of (or what kind of maps exist into X), or externally (or in frequency space, or globally), by describing what X globally interacts or resonates with (or what kind of maps exist out of X). These two fundamentally opposed perspectives on the object X are often dual to each other in various ways: performing an operation on X may transform it one way in physical space, but in a dual way in frequency space, with the frequency space description often being a “inversion" of the physical space description. In several important cases, one is fortunate enough to have some sort of fundamental theorem connecting the internal and external perspectives[...]the uncertainty principle, that describes the dual relationship between physical space and frequency space. There are various concrete formalisations of this principle, most famously the Heisenberg uncertainty principle and the Hardy uncertainty principle - but in many situations, it is the heuristic formulation of the principle that is more useful and insightful than any particular rigorous theorem that attempts to capture that principle. Unfortunately, it is a bit tricky to formulate this heuristic in a succinct way that covers all the various applications of that principle; the Heisenberg inequality ∆x ∆ξ & 1 is a good start, but it only captures a portion of what the principle tells us.” · As we are interested in sparsity, we follow [Donoho & Stark ’89] and [Donoho & Huo ’01] in order to see what the Uncertainty Principle represents to us, namely, that a signal cannot be sparsely represented both in time and in frequency. Suppose we have a non-zero vector v Rn and two orthonormal basis, Ψ and Φ. So we can represent v in two ways: as a linear combination∈ of columns of Ψ or as a linear combination of columns of Φ

v = Ψx = Φy. Let us suppose, as an important case, that Ψ is the identity matrix and Φ is the matrix of Discrete Fourier Transform. We then have the time-domain representation and the frequency-domain representation. Now, any kind of uncertainty principle for the representation in both basis at the same time must take into account some distance between Ψ and Φ, since if we have Ψ = Φ, we can define v as one of the columns of Ψ and get the smallest possible cardinality (1 in both x and y). One of the most important notions of distance between basis was defined by [Donoho & Huo ’01], although it appeared in a heuristic way in [Mallat & Zhang ’93].

Definition 4.1. Let Ψ and Φ be orthonormal (in the `2-norm) bases for some finite dimensional vector space V with dim V = n. We define the mutual coherence between two bases as

µ(Ψ, Φ) = sup ψ, φ : ψ Ψ, φ Φ . {|h i| ∈ ∈ } Proposition 4.2. Let Ψ and Φ be orthonormal bases for some finite dimensional vector space V with dim V = n. Then the mutual coherence satisfies √1 µ(Ψ, Φ) 1. n ≤ ≤ Proof. The upper bound is just Cauchy-Schwarz inequality. For the lower bound, note that (ΨT Φ)T (ΨT Φ) = ΦT ΨΨT Φ = Id, so ΨT Φ is also orthonormal and the sum of squares of its entries in each column is √1 equal to 1. If all of these entries are less than n , the sum of their squares is less than 1, which is impossible. To finish, note that this lower bound is achieved by the identity matrix and the Discrete Fourier Transform matrix, for example.

4See also the contributions of von Neumann in [Székely & Rizzo ’07]. 4.2. UNCERTAINTY PRINCIPLE 57

The first version of an uncertainty principle in this context was proved in [Donoho & Stark ’89] for the particular case where the representation is given by the identity matrix and the DFT matrix. The following more general version was proved in [Donoho & Huo ’01]. Theorem 4.3. ([Donoho & Huo ’01]): For an arbitrary pair of orthogonal bases Ψ, Φ, with mutual coherence µ(Ψ, Φ), and for a non-zero vector v Rn with representations x and y respectively ∈  1  x 0 + y 0 1 + . || || || || ≥ µ(Ψ, Φ) In fact we will prove an improved and stronger version, which resembles the original result in [Donoho & Stark ’89]. Theorem 4.4. ([Elad & Bruckstein ’02]): For an arbitrary pair of orthogonal bases Ψ, Φ, with mutual coherence µ(Ψ, Φ), and for a non-zero vector v Rn with representations x and y respectively, ∈ 2 x 0 + y 0 . || || || || ≥ µ(Ψ, Φ) Remark 19. To see the difference between them, note that 2 1 1 1 = + 1 + , µ(Ψ, Φ) µ(Ψ, Φ) µ(Ψ, Φ) ≥ µ(Ψ, Φ) since µ(Ψ, Φ) 1. ≤ Remark 20. Here we present two proofs of Theorem 4.4

Proof. Assume w.l.o.g. that v 2 = 1. Since v = Ψx = Φy, we have || || n n n n 2 X X X X 1 = v = v, v = Ψx, Φy = xiyj φi, ψj µ(Ψ, Φ) xi yj || ||2 h i h i h i ≤ | || | i=1 j=1 i=1 j=1

= µ(Ψ, Φ) x 1 y 1. (4.1) || || || || Through the arithmetic-geometric mean inequality we obtain

1 2 x 1 y 1 = x 1 + y 1 p , (4.2) || || || || ≥ µ(Ψ, Φ) ⇒ || || || || ≥ µ(Ψ, Φ) which is a kind of uncertainty principle for the `1-norm. For the (P0) problem, the problem of finding a representation for x with the greatest `1-norm satisfying x 2 = 1 and have k non-zeros (i.e. x 0 = k) leads to the optimization problem || || || ||

2 max x 1 subject to x 2 = 1 and x 0 = k. (4.3) x || || || || || ||

Assume we obtain a solution to this problem of the form f(k) = f( x 0). Similarly, suppose we have || || a solution for the analogous problem for y as f(B) = f( y 0). From Equation (4.2), we obtain || || 1 x 1 y 1 f( x 0)f( y 0). (4.4) µ(Ψ, Φ) ≤ || || || || ≤ || || || || If we find the solution to the optimization problem (4.3), we can replace it in Equation (4.4), and this will give us the inequality we are looking for. Assume w.l.o.g. that the k non-zeros in x are the first entries and that all of these entries are strictly positive (since we are considering only absolute values in this problem). Using Lagrange multipliers, the `0 constraint vanishes, and we obtain

k k ! X X 2 (x) = xi + λ 1 x . L − i i=1 i=1 58 CHAPTER 4. THE COHERENCE PROPERTY

Setting the derivatives equal to zero

∂ (x) L = 1 2λxi = 0, ∂xi − from which we obtain the optimal value x = 1 . Using the ` constraint, we have x = √1 and thus i 2λ 2 i k g(k) = √k = √k, the maximal ` -norm of the vector x. Using an analogous result for y, we have k 1 1 p x 1 y 1 f( x 0)f( y 0) = x 0 y 0. µ(Ψ, Φ) ≤ || || || || ≤ || || || || || || || || With the help of the arithmetic-geometric mean inequality, we obtain

1 p 1 x 0 y 0 ( x 0 + y 0) (4.5) µ(Ψ, Φ) ≤ || || || || ≤ 2 || || || ||

As [Elad ’10] points out, Allan Pinkus found a simpler proof of Theorem 4.4.

Proof. Since Ψ and Φ are unitary matrices, we have that v 2 = x 2 = y 2. Let us denote the suport P || || || || || || of x by I. From v = Ψx = i∈I xiψi we have

2 2 2 X 2 X 2 2 yj = v, φj = xi ψi, φj x ( ψi, φj ) v I µ(Ψ, Φ). | | |h i| h i ≤ || ||2 h i ≤ || ||2 | | i∈I i∈I

Summing the above for all j J, the support of y, we obtain ∈

X 2 2 2 1 p p yj = v v I J µ(Ψ, Φ) = I J = x 0 y 0. | | || ||2 ≤ || ||2 | | | || ⇒ µ(Ψ, Φ) ≤ | | | | || || || || j∈J

Remark 21. Regardless of their locations, this theorem gives us a lower bound on the number of nonzeros and tells us that, if the mutual coherence of two basis is small, then it cannot be that representations of the same vector in both basis are simultaneously sparse. This has a resemblance with the classical uncertainty principle of quantum mechanics.

Example 4.5. Take Ψ = Id and Φ = DFT . Remember that the Discrete Fourier Transform is the unitary matrix F Cn×n given by ∈

1 2πi(i−1)(k−1)/n Fik = e , i, k [n]. √n ∈

√1 We have µ(Ψ, Φ) = n , by the definition of the Discrete Fourier Transform. It follows that a signal cannot have fewer than 2√n nonzeros entries in both time and frequency domains. This is a tight relationship. To see it, suppose that n = N 2 is a square and consider the signal5 ( 1 k = 1 mod √n, IIIk = 0 otherwise.

This signal has, of course, √n coordinates different from zero. The same happens with its Fourier transform, as the next computation shows. By the definition of Fourier transform, we have

5This function, sometimes called picked fence and sometimes called chá, has this symbol and the second name as a homage to the 26th letter of Russian alphabet. 4.3. A CASE STUDY: TWO ORTHOGONAL BASIS 59

N 2 N ( 1 X 2πi(`−1)(j−1)/N 2 1 X 2πi(`−1)(j−1)/N 1 j = 1 mod N, IIIcj = III`e = e = N N otherwise `=1 k=1 0 .

This shows that IIIc = III and that III 0 = IIIc 0 = N = √n. This implies the sharpness of Theorem 4.4 with the right-hand side given by 2||√N|| nonzeros|| || if the bases are given by the identity and the DFT.

4.3 A Case Study: Two Orthogonal Basis

Our main goal in this dissertation is to understand when linear systems Ax = b have sparse solutions and how to find them. The discussion in this section indicates that we can analyze a toy model first. This model will have the matrix A formed by the concatenation of two orthogonal basis placed side by side, i.e. A = [Ψ, Φ]. In this case, a connection with the uncertainty principle arises. Donoho and collaborators found that the study of the solutions of this kind of linear system can be performed by just considering representation in bases Ψ, Φ simultaneously. After showing this, we pass to the general case. In this particular representation problem, an overcomplete set of vectors (a dictionary) could lead to multiple representations of the same signal. This has some advantages. We can, for example, design better compression schemes for the signals we are working with or find natural sparse representations among multiple choices. Due to the idea that natural signals have parsimonious representation, in many situations it is desirable to work with multiple (but sparse) representations instead of a unique but dense representation of a signal. Therefore, we can state the first consequence from the overcomplete representation idea and from Theorem 4.4. It is the second uncertainty principle for `0-norm, which says that just one solution of our linear system, given by a concatenation of basis, can be sufficiently sparse.

Theorem 4.6. Let x1 and x2 be two different solutions of the linear system Ax = [Ψ, Φ]x = b. Then

2 x1 0 + x2 0 , || || || || ≥ µ(Ψ, Φ) i.e. both solutions cannot be very sparse simultaneously.

Proof. Let v = x1 x2 be the difference between the two solutions, then v ker(A). We now partition − ∈ v into first n entries and last n entries, respectively vψ and vφ. Then Av = 0 implies Ψvψ + Φvφ = 0, so

Ψvψ = Φvφ = y = 0, (4.6) 6 where y = 0 because Ψ and Φ are nonsingular. From (4.6) we see that vψ is the representation of v in 6 the basis Ψ and that vφ is the representation of v in the basis Φ. From Theorem 4.4, we have

2 v 0 = vψ 0 + vφ 0 . || || || || || || ≥ µ(Ψ, Φ)

Since v = x1 x2 , this turns into − 2 x1 0 + x2 0 x1 x2 0 = v 0 , || || || || ≥ || − || || || ≥ µ(Ψ, Φ) where the last step is just triangular inequality for the `0-norm.

An immediate consequence of Theorem 4.6 is the following uniqueness result for linear systems. Henceforward we will denote the coherence of a concatenation of basis by µ(A) instead of µ(Ψ, Φ). 60 CHAPTER 4. THE COHERENCE PROPERTY

Corollary 4.7. Let x be one solution of Ax = [Ψ, Φ]x = b. Suppose that it satisfies

x 0 1/µ(A). || || ≤

Then it is necessarily the sparsest one and any other solution y must have y 0 > 1/µ(A), that is, we have the uniqueness of sparsest solutions of the linear system. || || The importance of Corollary 4.7 relies on the fact that, while usually in non-convex optimization problem we can only verify local optimality, for systems of the form Ax = [Ψ, Φ]x = b we have not only uniqueness but also global optimality, provided the the solution is sufficiently sparse.

4.4 Uniqueness Analysis for the General Case

The question now is how to pass from the concatenation of basis to the general case. Many important ideas related to it were already presented in the work [Rao & Gorodnistsky ’97], including, in a hidden form, the next definition. However, it was just after [Donoho & Elad ’ 03] that the connection between the `0-norm and the kernel of the matrix A appeared. This is given through the following definition. Definition 4.8. The spark 6 of a general matrix A is the smallest number of columns from A that are linearly-dependent. In other words, the spark of A is the infimum of the set x 0 Ax = 0 . {|| || | } The resemblance and connection between spark and rank seems obvious. The latter measures the maximal number of columns thar are linearly independent while the former measures the minimal number of columns that are linearly dependent. However it is much more difficult to compute the spark of a matrix than its rank, as it needs a combinatorial search over all possible subsets of columns of the matrix. Example 4.9. [Tillmann & Pfetsch ’14] Consider the matrix

1 1 0 0 1 1 0 1 0 0 1 1

Clearly spark(A) = 2. But it is very important to point out that generally, spark(A) k does not guarantee the existence of a vector with k nonzeros entries in the nullspace of A. Just take≤k = 3 in this example. On the other hand, a nullspace vector with support size k does not yield spark(A) = k but only spark(A) k. ≤ We list now some simple properties of the spark.

Proposition 4.10. For a matrix A Cm×n we have: ∈ I. spark(A) 1, . . . , n ; ∈ { } ∪ {∞} II. spark(A) = if and only if rank(A) = n; ∞ III. spark(A) =1 if and only if A has a zero column; IV. If spark(A) = , then spark(A) rank(A) + 1. 6 ∞ ≤ With this new definition, we can prove a simple but general result. Theorem 4.11. ([Rao & Gorodnistsky ’97]): If a system of linear equations Ax = b has a solution which satisfies

x 0 < spark(A)/2, || || then this solution is necessarily the sparsest one. 6Fusion of the words sparse and rank. 4.5. COHERENCE FOR GENERAL MATRICES 61

Proof. Let y be another solution of this linear system, Ay = b. This implies A(x y) = 0. Now, any − vector γ ker(A) must satisfy γ 0 spark(A), since Aγ = 0 is a linear combination of the columns of A and by∈ the definition of spark|| we|| must≥ have at least as many columns as spark(A). By the triangular inequality for the `0-norm

x 0 + y 0 x y 0 spark(A). || || || || ≥ || − || ≥ Since by hypothesis x 0 < spark(A)/2, any other solution y must have y 0 spark(A)/2. || || || || ≥ Theorem 4.12. ([Donoho & Elad ’ 03]): Conversely, if x, the sparsest solution of the linear system Ay = b, is unique then x 0 < spark(A)/2. || || Proof. Suppose, by contradiction, that spark(A) 2 x 0. This means that there exists a set of at most 2k columns that are linearly dependent, which in≤ turn|| || implies that there exists an h ker A such that ∈ h 0 2k. Since we have this limitation for the sparsity of h, we can write h = x x˜ with x 0 k and || || ≤ − || || ≤ x˜ 0 k and x =x ˜. Then we have Ah = A(x x˜) = 0 so that Ax = Ax˜. But this is a contradiction ||with|| the≤ assumption6 that there exists a unique sparsest− solution of Ax = b.

Although we have a nice uniqueness result for the general case, the spark actually requires calculations with all possible subsets of columns of size up to spark(A) + 1. This heuristic indicates that the its computational complexity can make the use of spark an inadequate practical choice for finding sparsest solutions of linear systems. Therefore we must try to find another, possibly non-sharp, guarantee of uniqueness. However, it is important to point out that in some situations, using random matrices, it can be easy to estimate the spark. If we consider a matrix A Rm×n with m n and entries given by i.i.d. random variables (random matrices will be explored in Chapter∈ 7), then with≤ probability one, any submatrix of size m m has maximal rank m and hence spark(A) = m+1 with probability 1. See [Feng & Zhang ’07]. The× question about computational tractability of the spark was addressed by [Tillmann & Pfetsch ’14]. They confirm the heuristics mentioned above. This work considers the spark as a particular case of a circuit and proves the NP-completeness for the decision problem of the existence of a circuit in a matrix by reducing it to the k-Clique Problem. See Section 3.1.3 of [Garey & Johnson ’79] for the k-Clique Problem.

Definition 4.13. For a given matrix A Rm×n, a circuit is a set C 1, . . . , n of column indices ∈ ⊆ { } such that AC x = 0 has a nonzero solution but every proper subset of C does not have this property, i.e., rank(AC ) = C 1 = rank(A ), j C. The spark of A is the size of its smallest circuit. | | − C\{j} ∀ ∈ The most important theorem from [Tillmann & Pfetsch ’14] is the following.

Theorem 4.14. Given a matrix A Qm×n and a positive integer k, the problem of deciding whether there exists a circuit of A of size at most∈ k is NP-complete. To end this section, just note that there exists a circuit of size at most k if and only if the spark is at most k. This proves that computing spark(A) is NP-hard.

4.5 Coherence for General Matrices

A clever way of overcoming the difficulty in computing the spark was proposed by [Donoho & Elad ’ 03]. They look at the mutual coherence, defined in Section 4.2 for the two-orthogonal case, in a different way. This definition can be seen as the maximal off-diagonal entry (in absolute value) of the Gram matrix:

 Id ΨT Φ AT A = . ΦT Ψ Id 62 CHAPTER 4. THE COHERENCE PROPERTY

Inspired by it, we are able to define the concept of coherence for a general matrix. We stress, again, that, without loss of generality, the columns of the matrix are always implicitly understood to be `2 normalized. m×N Definition 4.15. Let A C be a matrix with `2-normalized columns a1,..., aN for all i [N]. The coherence µ = µ(A) of∈ a matrix A is defined as ∈

µ(A) := max ai, aj . (4.7) 1≤i6=j≤N |h i| Remark 22. Many authors call the quantity in the previous definition mutual coherence. Here we will emphasize that coherence refers to a matrix A and mutual coherence to a pair of basis, like the two-orthogonal case. The coherence is easier to compute than the spark. Also, it provides a lower bound for the spark, which, as we pointed out above, is in general hard to compute. In order to prove this statement, we first state and prove a well-known theorem in Linear Algebra. This result will reappear later when discussing the relation between the number of measurements and the ambient space dimension of the signals of interest. Theorem 4.16. ([Gershgorin ’31]): Let λ be an eigenvalue of a matrix A Cn×n. There exists an index j [n] such that ∈ ∈ X λ Aj,j Aj,l . | − | ≤ | | l∈[n]\{j} n Proof. Let v C 0 be an eigenvector associated with λ, and let j [n] such that vj is maximal, i.e., ∈ \{P} ∈ | P| vj = v ∞. Then Aj,lvl = λvj, and separating the jth coordinate, we have Aj,lvl = | | || || l∈[n] l∈[n]\{j} λvj Aj,jvj. By the triangle inequality − X X X λ Aj,j vj Aj,l vl v ∞ Aj,l = vj Aj,l . | − || | ≤ | || | ≤ || || | | | | | | l∈[n]\{j} l∈[n]\{j} l∈[n]\{j}

Dividing by vj > 0 yields the desired statement. | |

As a result of Theorem 4.16, every eigenvalue of a matrix lies within at least one of the Gershgorin P discs, which are the discs centered at one of the diagonal entries with radius l∈[n]\{j} Aj,l . It has many applications, such as matrix preconditioning. When we try to solve Ax = b for A with| a large| condition number, we instead precondition and solve P Ax = P b where P A−1. Since PA Id, the eigenvalues of PA should all be close to 1 and we can use Theorem 4.16 to≈ estimate how good≈ is our choice of P . For more details on Gershgorin’s Theorem and its applications, see [Varga ’04]. Now we can state the relation between spark(A) and the coherence. Lemma 4.17. (Theorem 5 of [Donoho & Elad ’ 03]): For any matrix A Rn×m, ∈ 1 spark(A) 1 + . ≥ µ(A)

Proof. The Spark and the coherence of a matrix do not change when we `2-normalize its columns, therefore without loss of generality, we will consider a column normalized matrix. Let spark(A) = k be the smallest number of dependent columns of A and let K be the set of such columns. Consider the matrix AK given T by the restriction of A to its columns in K. The Gram matrix G = AK AK is singular, since the original Ak is singular too. Therefore, the spectrum of G contains 0. Applying Theorem 4.16 to the eigenvalue λ = 0, we obtain X X λ gii = 1 0 gij = ai, aj (k 1)µ(A) = (spark(A) 1)µ(A). | − | | − | ≤ | | h i ≤ − − i6=j i6=j 4.6. PROPERTIES AND GENERALIZATIONS OF COHERENCE 63

Now we have another uniqueness corollary, this time based on the coherence.

Theorem 4.18. ([Donoho & Huo ’01]): If a system of linear equations Ax = b has a solution which satisfies

1  1  x 0 < 1 + , || || 2 µ(A) this solution is necessarily the sparsest one.

The difference between Theorem 4.11 and Theorem 4.18 is that while the first one is sharp, the second one just gives a lower bound. Coherence can never be smaller than 1/√n and therefore the cardinality bound in the second case in never larger than √n/2. However, the spark can be as large as n, so the first theorem gives a bound for the cardinality as large as n/2. Theorem 4.18 provides a sufficient condition for linear systems to have a unique solution. Albeit it is not sharp, it is easier to be estimated. See Section 2.3 of [Elad ’10]. Now, it is interesting to explore this property and find out what it can tell us about the solution of linear systems.

4.6 Properties and Generalizations of Coherence

The definition of coherence is a way to represent the dependence between columns of the matrix A. It was generalized in many different ways in order to characterize sharper degrees of dependence. As [Kutyniok ’12] pointed out, it is interesting to note that different variations of coherence have appeared in the literature, for example structured p-Babel function [Borup, Gribonval & Nielsen ’08], Babel func- tion [Donoho & Elad ’ 03], cluster coherence [Donoho & Kutyniok ’13], cumulative coherence function [Tropp ’04], and fusion coherence [Boufounos, Kutyniok & Rauhut ’11]. Following Tropp, we will introduce a generalization of coherence which he called cumulative coherence function. However, as in [Rauhut & Foucart ’13], we call it `1-coherence function.

m×N Definition 4.19. Let A C be a matrix with `2-normalized columns a1,..., aN . The `1-coherence ∈ function µ1 of a matrix A is defined for s [N 1] by ∈ −   X  µ1(s) := max max ai, aj ,S [N], card(S) = s, i / S . (4.8) i∈[N] |h i| ⊂ ∈ j∈S 

Remark 23. Note that for s = 1, the `1-coherence function is the coherence of a matrix.

Remark 24. The `1-coherence function can be generalized in a straightforward way for p > 0, to the `p-coherence function, that is

 1/p     X p  µp(s) := max max  ai, aj  ,S [N], card(S) = s, i / S . i∈[N] |h i| ⊂ ∈  j∈S 

Now we will state the first properties that follow from the definition of µ(A).

Proposition 4.20. For 1 s N 1 we have ≤ ≤ −

µ µ1(s) sµ. (4.9) ≤ ≤ More generally, for 1 s, t N 1 with s + t N 1, we have ≤ ≤ − ≤ −

max µ1(s), µ1(t) µ1(s + t) µ1(s) + µ1(t). (4.10) { } ≤ ≤ 64 CHAPTER 4. THE COHERENCE PROPERTY

A very important property of matrices with small coherence is that column submatrices of moderate size are well conditioned. This is the content of Theorem 4.21, which is the reason behind the motto “matrices with small coherence are great for sensing”. Let us recall that AS denotes the matrix formed by the columns of A Cm×N indexed by a subset S of [N]. ∈ m×N Theorem 4.21. Let A C be a matrix with `2-normalized columns and let s [N]. For all s-sparse ∈ ∈ vectors x CN ∈ 2 2 2 (1 µ1(s 1)) x Ax (1 + µ1(s 1)) x , − − || ||2 ≤ || ||2 ≤ − || ||2 ∗ or equivalently, for each set S [N] with card(S) s, the eigenvalues of the matrix ASAS lie in the ⊂ ≤ ∗ interval [1 µ1(s 1), 1 + µ1(s 1)]. In particular, if µ1(s 1) < 1, then A AS is invertible. − − − − S ∗ Proof. Note that for a fixed set S [N] with card(S) s, the matrix ASAS is positive semidefinite, so it has an orthonormal basis of eigenvectors⊂ associated≤ with real positive eigenvalues. Note also that n Ax = ASxS for any x C supported on S. Due to normalization, i.e. aj 2 = 1 for all j [N], the diagonal entries of A∈∗A are all equal to one. By Gershgorin’s theorem,|| the|| eigenvalues of A∗∈A are contained in the union of the disks centered at 1 with radii

X ∗ X rj = (A A)j,l = al, aj µ1(s 1) j S. | | |h i| ≤ − ∈ l∈S,l6=j l∈S,l6=j

Since the eigenvalues are real, they must lie in [1 µ1(s 1), 1 + µ1(s 1)]. This is equivalent to the first inequality in the Theorem’s statement, which− follows− from the fact that− the Rayleigh-Ritz quotient

x, A∗Ax RA∗A(x) = h i x, x h i attain its maximum and minimum λmax and λmin at their associated eigenvectors, respectively x = vmax and y = vmin.

m×N Corollary 4.22. Given a matrix A C with `2-normalized columns and an integer s 1, if ∈ ≥

µ1(s) + µ1(s 1) < 1 − ∗ then, for each set S [N] with card(S) 2s, the matrix ASAS is invertible and the matrix AS is injective. In particular,⊂ the conclusion holds≤ if

1 µ < . (4.11) 2s 1 − Proof. Using Equation (4.10), the condition µ1(s) + µ1(s 1) < 1 implies that µ1(2s 1) < 1. For a set S [N] with card(S) 2s, from the previous Theorem− we deduce that the smallest− eigenvalue of the ⊂ ∗ ≤ ∗ matrix A A satisfies λmin 1 µ1(2s 1) > 0 which shows that A A is invertible. To prove that A is ≥ − − ∗ injective, observe that ASz = 0 leads to A Az = 0 and so z = 0. Now, the second statement follows from (4.9), because µ1(s) + µ1(s 1) sµ + (s 1)µ = (2s 1)µ < 1 if µ < 1/(2s 1). − ≤ − − −

Corollary 4.22 is a reformulation of Theorem 4.18. This theorem tells us what kind of vectors we can retrieve from the solution of the linear system, that is, with what degree of sparsity we can deal with. Clearly, with any reasonable method, we expect to recover not only 1-sparse or 2-sparse vectors but maybe 10000-sparse vectors. From Equation 4.11, we have that the coherence must be as small as possible. Therefore it is important to know what properties a matrix with small coherence has or, at least, how to construct them. 4.7. CONSTRUCTING MATRICES WITH SMALL COHERENCE 65

4.7 Constructing Matrices with Small Coherence

∗ Matrices with small coherence are important because they are well behaved, in the sense that ASAS is invertible. Now we address this question of how to construct this type of matrices A Km×N with m N. We first note that if m = N, µ(A) = 0 for an orthogonal matrix A. In the m ∈N framework, we expect to have some lower bound. This is the content of the following theorem. 

m×N Theorem 4.23. ([Welch ’74]): The coherence of a matrix A K with `2-normalized colums satisfies ∈ s N m µ − . (4.12) ≥ m(N 1) − Equality holds if and only if the columns a1,..., aN of the matrix A form an equiangular tight frame. Remark 25. Frames will be introduced in the next section. We postpone until then to understand what it means the equality in Theorem 4.23 to hold. Anyway, we exhibit the proof here.

Proof. Consider the Gram matrix G = A∗A KN×N , with entries given by ∈

Gi,j = ai, aj = aj, ai , i, j [N] h i h i ∈ ∗ N×N and the matrix H = AA K . As the columns (a1,..., aN ) are `2-normalized, we have ∈ N X 2 tr(G) = ai = N. (4.13) || ||2 i=1 From the Cauchy-Schwarz inequality we obtain

p ∗ tr(H) = H, Idm F H F Idm F = √m tr(HH ). (4.14) h i ≤ || || || || Now observe that

N ∗ ∗ ∗ ∗ X 2 tr(HH ) = tr(A AA A) = tr(GG ) = ai, aj = |h i| i,j=1 (4.15) N N N X 2 X 2 X 2 ai + ai, aj = N + ai, aj . || ||2 |h i| |h i| i=1 i,j=1,i6=j i,j=1,i6=j Combining the three equations above with the fact that tr(H) = tr(G), yields

 N  2 X 2 N m N + ai, aj  . (4.16) ≤ |h i| i,j=1,i6=j Since

ai, aj µ for all i, j [N], i = j (4.17) |h i| ≤ ∈ 6 q we obtain N 2 m(N + (N 2 N)µ2) which leads to µ N−m . ≤ − ≥ m(N−1) Equality holds in Equation (4.12) exactly when it holds in Equations (4.14) and (4.17). Equality in (4.14) implies that H = λIdm for some nonnegative constant λ. In the next section we will see that this is exactly the case when the system (a1,..., aN ) is a tight frame and that equality in (4.17) means that this system is equiangular.

We can extend the Welch bound to the `1-coherence function for small values of its argument. 66 CHAPTER 4. THE COHERENCE PROPERTY

m×N Theorem 4.24. The `1-coherence of a matrix A K with `2-normalized columns satisfies ∈ s N m µ1(s) s − whenever s < √N 1. (4.18) ≥ m(N 1) − − Equality holds if and only if the colums a1,..., aN of the matrix A form an equiangular tight frame. In order to prove this theorem, we will need a lemma. The reader may note the similarity in the proof of this lemma with the proof of Theorem 1.7 via optimization techniques.

Lemma 4.25. For k < √n, if the finite sequence (α1, α2, . . . , αn) satisfies

2 2 2 n α1 α2 αn 0 and α + α + + α ≥ ≥ · · · ≥ ≥ 1 2 ··· n ≥ k2 then α1 + α2 + + αk 1 ··· ≥ with equality if and only if α1 = α2 = = αn = 1/k. ··· Proof. The first thing to note is that the lemma is equivalent to the statement  α1 α2 . . . αn 0 ≥ ≥ ≥ α1 + α2 + . . . αk 1 ≤ 2 2 2 n 1 which implies α1 + α2 + . . . αn k2 , with equality if and only if α1 = α2 = = αn = k . To prove this, ≤ ··· 2 2 2 we can look at the problem of maximizing the convex function f(α1, α2, . . . , αn) := α1 + α2 + . . . αn over the convex polygon

n := (α1, . . . , αn) R : α1 α2 . . . αn 0 and α1 + α2 + . . . αk 1 . C { ∈ ≥ ≥ ≥ ≤ } As the function f is convex and is a convex combination of its vertices, it follows that the the maximum is attained at a vertex of (seeC Theorem B.16 from [Rauhut & Foucart ’13] or Theorem 2.65 from [Ruszczynski]). The verticesC of are obtained as intersections of n hyperplanes arising when we turn n of the n + 1 inequality constrainsC into equalities. Thus, these are the possibilities we have:

1. If α1 = α2 = = αn = 0, then f(α1, α2, . . . , αn) = 0, ··· 2. If α1 + α2 + + αk = 1 and α1 = = αl > αl+1 = = αn = 0 for 1 l k then ··· ··· ··· ≤ ≤ α1 = α2 = = αl = 1/l, and consequently f(α1, α2, . . . , αn) = 1/l, ··· 3. If α1 + α2 + + αk = 1 and α1 = = αl > αl+1 = = αn = 0 for k l n then ··· ··· ··· 2 ≤ ≤ α1 = α2 = = αl = 1/k, and consequently f(α1, α2, . . . , αn) = l/k . ··· Given that k < √n, it follows that

 1 l  n n o n max f(α1, α2, . . . , αn) = max max , max 2 = max 1, 2 = 2 (α1,...,αn)∈C 1≤l≤k l k≤l≤n k k k with equality only in the case l = n where α1 = α2 = = αn = 1/k ···

Proof. (of Theorem 4.24) From equation (4.16), we have

N 2 X 2 N N(N m) ai, aj N = − , |h i| ≥ m − m i,j=1,i6=j which yields 4.8. A GLIMPSE OF FRAME THEORY 67

N N X 2 1 X 2 N m max ai, aj ai, aj − . i∈[N] |h i| ≥ N |h i| ≥ m j=1,i6=j i,j=1,i6=j

∗ ∗ N Let i [N] be the index at which the maximum is achieved. If we reorder our sequence ai , aj j=1,j6=i∗ ∈ ∗ ∗ ∗ {|h i|} in such way that a , a1 a , a2 a , aN−1 0, we have |h i i| ≥ |h i i| ≥ · · · ≥ |h i i| ≥

∗ 2 ∗ 2 ∗ 2 N m a , a1 + a , a2 + + a , aN−1 − . |h i i| |h i i| ··· |h i i| ≥ m p ∗ Lemma 4.25 with n = N 1, k = s and αl := ( m(N 1)/(N m)/s) ai , al leads to α1 + +αs 1. It then follows that − − − |h i| ··· ≥ s ∗ ∗ ∗ N m µ1(s) a , a1 + a , a2 + + a , as s − . ≥ |h i i| |h i i| ··· |h i i| ≥ m(N 1) − Let us assume now that equality holds in (4.18), which implies that all previous inequalities are in fact equalities. As in the proof of the Welch bound for coherence, equality in (4.16) implies that the system (a1,..., aN ) is a tight frame. Besides that, the case of equality in Lemma 4.25 implies that ∗ p ∗ ∗ a , aj = ( (N 1)/m(N m)) for all j [N], j = i . Since the index i can be arbitrarily chosen |h i i| − − ∈ 6 from [N], the system (a1,..., aN ) is also equiangular. Conversely, the proof that equiangular tight frames yield equality in (4.18) follows easily from Theorem 4.23 and (4.9).

The content of Theorem 4.23 tells us that “equality holds if and only if the columns a1,..., aN of the matrix A form an equiangular tight frame.” Then, in order to achieve the lower bound for coherence and construct good sensing matrices, we need to understand what is an equiangular tight frame.

4.8 A Glimpse of Frame Theory 4.8.1 History and Motivation Along this dissertation we emphasized the important question of what are the building blocks of our signals of interest and how to represent them in a sparse way. In this context the concept of basis can be a bit restrictive because of the requirements of linear independence and uniqueness of coefficients in the decomposition of a vector. If we relax some of these conditions we can introduce the idea of frames, which are intuitively “basis with extra elements”. Fourier Transform is a major tool in Analysis, with connections to many other areas of Mathematics. However, it has some shortcomings when used in signal analysis. For example, from Fourier Analysis alone we cannot know the moment of emission and duration of a signal, because this information is hidden in the phase. Consider for example an orchestra with a very high-pitch piccolo and a low-pitched tuba. The Fourier Transform gives us that the piccolo and the tuba are present in the music but it cannot give us the information about when one of the instruments starts to play or the other one ceases. The situation about this kind of analysis started changing in 1946, when Dennis Gabor formulated a fundamental approach to signal decomposition in terms of some elementary signals, introducing the concept of localization in the frequency space that quickly became a paradigm for spectral analysis and is, nowadays, central for many areas such as image and audio processing. See [Cohen ’94]. Frames were introduced in 1952 by [Duffin & Schaeffer ’52], who used Gabor’s ideas in the study of nonharmonic Fourier series. In their work, they used highly overcomplete families of exponential functions (in their terminology, a “Hilbert space frame”) and found a very efficient way to compute the coefficients in these families. Their work is considered seminal in the field of Time-Frequency Analysis. See [Gröchenig ’01] for a book with mathematical orientation and [Cohen ’94] for a book with a more applied approach. Early on, their ideas did not achieve general interest outside of nonharmonic Fourier series 68 CHAPTER 4. THE COHERENCE PROPERTY studies (one might look at [Young ’80]). After the initial work of Gabor, Frame Theory became popular in the area of Signal Processing just in 1985, due to the work [Daubechies, Grossman & Meyer ’85]. The importance of Frames relies on the fact that they provide a great approach to deal with redundant, yet stable, representations of data dealing with noise, erasures and quantization effects. Frames still have the property that they can be constructed to fit a particular problem in a way not possible by a set of linearly independent vectors, so today frames provide an extensive framework for the analysis and decomposition accompanied by various reconstruction procedures of signals. Besides the great scope of applications that go beyond signal processing, data compression and sampling theory we can also cite optics, filterbanks, signal detection just to name a few subjects. Nowadays, it is also a highly prolific area for mathematicians, with connections with Besov Spaces, Geometry of Banach Spaces and Random Matrices. In this Section we follow [Casazza ’00] closely, which is a great introduction to the history and the theory of frames. Other good references on this area are [Christensen ’08] and [Casazza & Kutyniok ’13].

4.8.2 An Example The following example, which gives a clear picture of the differences between basis and frames, is taken from [Han et al. ’07]. Consider two finite sequences of vectors in the plane R2: (r r r ) 2 2 1 √3 2 1 √3 A = (1, 0), (0, 1) and B = (1, 0), , , , . { } 3 3 − 2 2 3 − 2 − 2 Let us compare the two sequences:

1. Both A and B are spanning sets for R2. 2. The vectors in A are linearly independent, so the coefficients in the linear expansion are unique, while the vectors in B are not linearly independent, so a linear expansion is not unique;

q 2 3. Vectors in A are normalized. Vectors in B all have length 3 ; 4. Vectors in A are orthogonal. Vectors in B are not orthogonal; 5. Both A and B satisfy Parseval’s identity for orthonormal bases:

2 3 X X x 2 = c2 = d2 || || i i i=1 i=1

where ci and di are the coefficients in the expansion in the sequence A and B respectively; 6. The coefficients for a linear expansion of some vector x in the sequence A can be computed easily using the dot product, as A is an orthonormal basis. One way to write x as a linear combination of the vectors in B is to use the coefficients formed by taking dot products also, so even without being orthogonal or linearly independent, the set B maintains one of the extremely useful features of an orthonormal basis.

We will later see that both sequences A and B are examples of a particular type of frame, called a Parseval frame. Sequence B has many of the properties of an orthonormal basis. This is exactly the motivation behind the study of frames. In some situations, the properties of uniqueness of coefficients and orthogonality of vectors are not necessary. If you are sending your side of a phone conversation or a photo, what matters is quickly computing a working set of expansion coefficients, not whether those coefficients are unique. What if one of the coefficients representing a vector gets lost in transmission? That piece of information cannot be reconstructed. It is lost. Perhaps we would like our system to have some redundancy, such that if one piece gets lost, the information can be pieced together from what does get through. 4.8. A GLIMPSE OF FRAME THEORY 69

4.8.3 Definition and Basic Facts

Definition 4.26. Let V be a vector space. A countable family of elements ak k∈I in V is a frame for V if there exist constants A, B > 0 such that { }

2 X 2 2 A x x, ak B x x V. || || ≤ |h i| ≤ || || ∀ ∈ k∈I The numbers A, B are called frame bounds. The optimal lower frame bound is the supremum over all lower frame bounds, and the optimal upper frame bound is the infimum over all upper frame bounds. When A = B we will say that the frame is a tight frame. When A = B = 1, we call it a Parseval frame. We also say that we have a Uniform Frame when all vectors in it have equal norm.

A practical way to characterize frames in a equivalent manner is given by the following proposition.

m Proposition 4.27. For a system of vectors a1, . . . , aN in K that is a tight frame, the following three statements are equivalent:

2 PN 2 m i.) x 2 = A x, aj , x K ; || || j=1 |h i| ∀ ∈ PN m ii.) x = A x, aj aj, x K ; j=1h i ∀ ∈ ∗ 1 iii.) ΦΦ = A Idm where Φ is the matrix with columns a1, . . . , aN . N Let ak be a sequence in V . By Cauchy-Schwarz’ inequality we have that { }k=1 N N X 2 X 2 2 x, ak ak x x V, |h i| ≤ || || || || ∀ ∈ k=1 k=1

PN 2 so choosing B = k=1 ak we prove that this sequence is a frame. The quest is to find a smaller upper || || N bound than this one. Now, in order to find lower bounds, it is necessary that span ak k=1 = V . This condition is also sufficient. { }

N Proposition 4.28. Let F = ak k=1 be a sequence in V . Then F is a frame for V if and only if span F = V . { } PN 2 Proof. We can assume that not all ak are zero. We know that B = k=1 ak is an upper bound. Now consider the continuos mapping || ||

N X 2 φ : V R, φ(x) = x, ak . → |h i| k=1 The unit ball in V is compact, so we can find y V with y = 1 such that ∈ || || N ( N ) X 2 X 2 A := y, ak = inf x, ak : x V, x = 1 . |h i| |h i| ∈ || || k=1 k=1

Clearly A > 0, because if A = 0, y would be orthogonal to all ak, thus contradicting the fact that N V = span ak . Now given x V with x = 0, we have { }k=1 ∈ 6 N N X 2 X x 2 2 2 x, ak = , ak x A x , |h i| |h x i| || || ≥ || || k=1 k=1 || || N therefore F is a frame. We will now prove the converse. Assume that ak k=1 does not span V. Then, ⊥ { } N there exists a vector x in the orthogonal complement W of the subspace W = span ak k=1. Since x Pk 2 { } is orthogonal to each ai, the sum i=1 x, ai = 0, the lower frame bound would not exist and this collection would not be a frame. |h i| 70 CHAPTER 4. THE COHERENCE PROPERTY

Remark 26. From Proposition 4.28, we see that a frame might contain more elements than needed to N N be a basis. In particular, if ak k=1 is a frame for V and gk k=1 is an arbitrary finite collection of N { } N { } vectors in V , then ak k=1 gk k=1 is also a frame for V . A frame which is not a basis is said to be overcomplete or redundant{ } ∪ {. }

We want to use frames to reconstruct vectors with redundance. Proposition 4.27 tells us how to do this with tight frames. For general frames, we need the concept of dual frames, given by the following proposition:

N N Proposition 4.29. Let ak k=1 be a frame for V . Then there exists a frame bk k=1 such that every x V can be reconstructed{ through} the formula { } ∈ k k X X x = x, bi ai = x, ai bi. h i h i i=1 i=1

N Remark 27. Such frame bk is called a dual frame. { }k=1 Proof. Let T : V Ck be defined by →

T x = ( x, a1 , x, a2 ,..., x, an ). h i h i h i

Note that T is linear and also injective, because T x = 0 implies x, ai = 0 i 1, . . . , k . Since N h i ∀ ∈ { ∗ } ak k=1 spans V , we conclude that x = 0. We also conclude that the operator given by S = T T : V V { } N k → is invertible. The operator S is called the frame operator for ak k=1. Now, for ei i=1 the canonical k Pk { } { } basis of C , we have T x = x, ai ei, and for each x V , i=1h i ∈ k ∗ X x, T ej = T x, ej = x, ai ei, ej = x, aj , h i h i h h i i h i i=1 ∗ thus T ej = aj. This implies, for every x V ∈ k ! k k ∗ ∗ X X ∗ X Sx = T T x = T x, ai ei = x, ai T ei = x, ai ai. h i h i h i i=1 i=1 i=1 −1 Define bi = S ai for i = 1, . . . k. Then, we have

k k k −1 −1 X X −1 X x = S Sx = S x, ai ai = x, ai S ai = x, ai bi. h i h i h i i=1 i=1 i=1 Now note that S (and S−1) are self-adjoint, as

k k k −1 X −1 X −1 X x = SS x = S x, ai ai = x, S ai ai = x, bi ai. h i h i h i i=1 i=1 i=1 N It remains to be proved that bk k=1 is a frame. Let A and B be the lower and upper frame bounds for N { } the frame ak . Then { }k=1

k k A 2 −1 2 X −1 2 X −1 2 −1 2 −1 2 2 x A S x S x, ai = x, S ai B S x B S x , S 2 || || ≤ || || ≤ |h i| |h i| ≤ || || ≤ || || || || || || i=1 i=1 which implies k 2 X 2 2 A˜ x x, bi B˜ x || || ≤ |h i| ≤ || || i=1 4.8. A GLIMPSE OF FRAME THEORY 71 with A A˜ = and B˜ = B S−1 2. S 2 || || || ||

Frames plays a fundamental role in the reconstruction of quantized vectors. In this context, the concept of Sobolev frames arises. See Chapter 8 of [Casazza & Kutyniok ’13] and references therein. Due to the Welch Bound, a set of vectors having the same angle between any two of them is very important. This motivates the next definition.

m×N Definition 4.30. A system of `2-normalized vectors a1,..., aN in K is called equiangular if there is a constant c 0 such that ≥

ai, aj = c i, j [N], i = j. |h i| ∀ ∈ 6 Example 4.31. The following examples are important in order to understand properties of equiangular tight frames.

1. An orthonormal basis is an equiangular tight frame for N = m;

2. A regular simplex is an equiangular tight frame for N = m + 1. For a simple construction of this example, take N 1 rows from an N N Discrete Fourier Transform matrix. The resulting columns, after being scaled− to have unit norm,× form an equiangular tight frame;

3. When m = 1, any unit norm frame amounts to a list of scalars of unit modulus, and such frames are necessarily equiangular and tight.

Remark 28. It is important to cite that in the literature equiangular tight frames also appear under the names Maximum Welch-Bound-Equality Sequences, optimal Grassmannian frames and two-uniform frames.

4.8.4 Constraints for Equiangular Tight Frames In the context of Compressive Sensing, we have the trade off between small coherence and the discrepancy between the number of rows and the number of columns, since we expect that, for m N matrices, N m. This is the same as saying that we expect to have much fewer measurements than× the ambient space. Therefore it is impossible to meet the Welch’s bound. Indeed, the next result shows that an equiangular set cannot be arbitrarily large.

m Theorem 4.32. The cardinality N of an equiangular set of `2-normalized vectors a1,..., aN in K satisfies

m(m + 1) N when K = R, ≤ 2

2 N m when K = C. ≤ If there is equality, then this set of vectors is also a tight frame.

Remark 29. Is important to note that it is a one way theorem, i.e. if there is equality, then we have a equiangular tight frame, but this is not a necessary condition. We can have equiangular tight frames without having N = m(m + 1)/2 or N = m2. This will be very significant in the subsequent discussion.

To prove this Theorem, we need the following Lemma. 72 CHAPTER 4. THE COHERENCE PROPERTY

Lemma 4.33. For any z C, the n n matrix ∈ × 1 z z . . . z z 1 z . . . z   ......  . . . . .   z . . . z 1 z z . . . z z 1 has 1 + (n 1)z as a single eigenvalue and 1 z as a multiple eigenvalue of multiplicity n 1. − − − Proof. Sum the elements in each line and note that the vector (1,..., 1) is an eigenvector for the eigenvalue 1 + (n 1)z. Now, subtracting from the first column each one of the others, we see that the (n 1) linearly− independent vectors (1, 1, 0,..., 0), (1, 0, 1, 0,..., 0),..., (1, 0,..., 0, 1) are eigenvectors− for the eigenvalue 1 z. − − − − Proof. (of Theorem 4.32): The main idea in this proof is to associate, to every ai in the equiangular set, a projection operator from Km to Km, which will be symmetric, in the real case, or hermitian, in the complex case. This association is a linear map, which preserves the inner product when using the Frobenius inner product in the space of linear maps. As a consequence of Lemma 4.33, the Gram matrix of these operators is invertible, hence they are linearly independent. In the real case, symmetric operators form a subspace of dimension m(m + 1)/2, and the result follows. In the complex case, hermitian operators do not form a subspace and the minimal subspace that contais them is the whole space, which has dimension m2. Therefore, in the complex case we need to consider the space of all operators on Cm. Let us define the orthogonal projectors P1,...,PN onto the lines spanned by a1, . . . , aN . These operators are:

m m Pi : K K ,Pi(v) = v, ai ai, → h i 2 ∗ m m m which verify Pi = Pi = Pi . Endowing (K , K ), the space of operators in K , with the Frobenius inner product B

∗ P,Q F = tr(PQ ), h i and using that the vectors ai are equiangular, we obtain that

m m m ∗ X X X 2 2 Pi,Pi F = tr(PiP ) = tr(Pi) = Pi(ek), ek = ek, ai ai, ek = ai, ek = ai = 1, h i i h i h ih i |h i| || || k=1 k=1 k=1

m m m ∗ X X X Pi,Pj F = tr(PiP ) = tr(PiPj) = PiPj(ek), ek = Pj(ek),Pi(ek) = ek, aj ek, ai aj, ai = h i j h i h i h ih ih i k=1 k=1 k=1

* m + X 2 2 ai, aj ai, ek ek, aj = ai, aj ai, aj = ai, aj = c , h i h i h ih i |h i| k=1 m where we used the canonical basis ei in K . The Gram matrix for these projectors is { }  1 c2 c2 . . . c2 c2 1 c2 . . . c2    ......   . . . . .  ,   c2 . . . c2 1 c2 c2 . . . c2 c2 1 4.8. A GLIMPSE OF FRAME THEORY 73 so as 0 c < 1, Lemma 4.33 implies that this Gram matrix is invertible (as all of its eigenvalues are ≤ positive) and this means that the system P1,...,PN is linearly independent. Thus the theorem follows. Now assume that equality holds. If we concatenate the identity operator to this system, this new one Idm,P1,...,PN , will be linearly dependent, then the determinant of its Gram matrix vanishes. This translates into

m 1 1 1 ... 1 2 2 2 1 1 c c . . . c 2 2 2 1 c 1 c . . . c

...... = 0......

1 c2 . . . c2 1 c2

1 c2 . . . c2 c2 1 After dividing the first row by m and substracting from all the other rows and then expanding the determinant with respect to the first column, we obtain an N 1 N 1 matrix such that − × − −1 2 −1 2 −1 2 −1 1 m c m c m . . . c m 2− −1 − −1 2 − −1 2 − −1 c m 1 m c m . . . c m − − − − ...... = 0

c2 m−1 . . . c2 m−1 1 m−1 c2 m−1 − − − − c2 m−1 . . . c2 m−1 c2 m−1 1 m−1 − − − − m Multiplying each of the entries by m−1 , which does not alter the determinant, since it is zero, we obtain

1 z z . . . z

z 1 z . . . z mc2 1 ...... = 0 where z := − m 1 z . . . z 1 z −

z . . . z z 1 As the determinant is zero, at least one the eigenvalues of this matrix must be zero. Since 1 z = m(1 c2)(m 1) = 0, Lemma 4.33 implies that 1 + (N 1)z = 0, which leads to − − − 6 − N m c2 = − . m(N 1) −

Then, the `2-normalized system a1, . . . , aN meets the Welch bound and therefore it is an equiangular tight frame.

Now we address the question of whether this upper bound is sharp. The answer is yes, as seen from the four examples given below (two in the real case and two in the complex case).

Example 4.34. (m = 3 and m(m + 1)/2 = 6) Let c = (√5 1)/2. The following vectors form a system of 6 equiangular vectors in R3: − (1, c, 0), (0, 1, c), (c, 0, 1), (1, c, 0), (0, 1, c), ( c, 0, 1). − − − Example 4.35. (m = 7 and m(m + 1)/2 = 28) Let the vectors obtained by unit cyclic shifts of the following four vectors:

(1, 1, 0, 1, 0, 0, 0), (1, 1, 0, 1, 0, 0, 0), (1, 1, 0, 1, 0, 0, 0), (1, 1, 0, 1, 0, 0, 0). − − − − They form an equiangular system of 28 vectors in R7. 74 CHAPTER 4. THE COHERENCE PROPERTY

Example 4.36. (m = 2 and m2 = 4) iπ p 2 Let c = e 4 2 √3. The following vectors form a system of 4 equiangular vectors in C : − (1, c), (c, 1), (1, c), ( c, 1). − − Example 4.37. (m = 3 and m2 = 9) i2π p 3 Let c = e 3 2 √3. The following vectors form a system of 9 equiangular vectors in C : −

( 2, 1, 1), (1, 2, 1), (1, 1, 2), ( 2, c, c2), (c2, 2, c), (c, c2, 2), ( 2, c2, c), (c, 2, c2), (c2, c, 2). − − − − − − − − − Remark 30. From now on we use the abbreviations RETF(m, N) for an equiangular tight frame with N vectors in Rm and CETF(m, N), for the analogous frame in Cm. The question of whether there exist equiangular frames in every dimension or if there is some kind of rigidity property about them, has been addressed in many articles. Zauner [Zauner ’99] made important numerical studies in this area. More information about these computational studies can be found at his website www.gerhardzauner.at/sicfiducialsd.html. [Holmes & Paulsen ’ 04] and [Sustik et al. ’07] are also good references on this topic. In particular, the following Theorem is important. Theorem 4.38. (Theorem A of [Sustik et al. ’07]): Suppose N = 2m and m > 3. The existence of a RETF(m, N) implies that 6 r r m(N 1) (N m)(N 1) − and − − , N m m − are both odd integers. In particular, N is an even number and if N = m(m+1)/2, then m+2 is necessarily the square of an odd integer. Furthermore, if there exists an RETF(m,2m), then m is an odd number and 2m 1 is the sum of two squares. − The proof of this theorem relies on field theory arguments as in Theorem A of [Sustik et al. ’07] and on strongly regular graphs arguments as in Corollary 5.8 of [Waldron ’09]. We will prove here, following [Rauhut & Foucart ’13], that “if N = m(m + 1)/2, then m+2 is necessarily the square of an odd integer" and the obvious consequence that “m is an odd number". The assertion “2m 1 is the sum of two squares" follows from a theorem of Euler which states that a natural number is the sum− of two squares if and only if each prime factor having the form 4k + 3 occurs in the prime factorization with an even power. More details can be found in [Sustik et al. ’07].

Proof. Let a1, . . . , aN be a system with N = m(m + 1)/2 equiangular `2-normalized vectors. From Theorem 4.23 we know that this system is a tight frame and by the equivalences in Proposition 4.27, the ∗ ∗ matrix A with columns a1, . . . , aN satisfies AA = λIdm for some λ > 0. As the matrix A A has the same nonzero eigenvalues as AA∗, then zero is an eigenvalue of multiplicity N m for A∗A. Moreover, ∗ − since A A is the Gram matrix of the vectors a1, . . . , aN , its diagonal entries are all equal to one, while 1 ∗ its off-diagonal entries have the same absolute value c. Then B = (A A IdN ) is c −   0 b1,2 . . . b1,N  .. .   b2,1 0 . .  B =   ,  . .. ..   . . . bN−1,N  bN,1 . . . bN,N−1 0 where bi,j = 1. This matrix has (λ 1)/c as an eigenvalue of multiplicity m and 1/c as an eigenvalue of multiplicity±N m. Since the characteristic− polynomial of a integer matrix has integer− coefficients, we can write it as − N X k PB(x) := βk( x) with βN = 1. − k=0 4.8. A GLIMPSE OF FRAME THEORY 75

Using Welch’s bound and N = m(m + 1)/2,, we have

s s r N m (m + 1)/2 1 m 1 1 c = − = − = − = , m(N 1) m(m + 1)/2 1 m2 + m 2 √m + 2 − − − 1 so = √m + 2. Then PB( 1/c) = PB( √m + 2) = 0, i.e., − c − − −     X k X k  β2k(m + 2)  + √m + 2  β2k+1(m + 2)  = 0. 0≤k≤N/2 0≤k≤(N−1/2)

We notice that these sums, now denoted by Σ1 and Σ2, are integers, due to the fact that βi are integer 2 2 coefficients of the characteristic polynomial. We then obtain the equality Σ1 = (m + 2)Σ2, which implies that m + 2 is a square, since any prime factor of m + 2 must necessarily appear an even number of times in its prime factorization. To finish the proof, it remains to be show that γ = √m + 2 is odd. Let JN×N be the matrix with all entries equal to one. Its kernel has dimension N 1 (since its rank is 1), then it intersects the (N m)- dimensional eigenspace of B corresponding to the− eigenvalue 1/c = n, since due to our hypothesis− m 3 implies that N = m(m + 1)/2 > m + 1, hence N 1− + N m− > N, resulting in a nontrivial intersection≥ because the sum of the dimensions of two subspaces− is greater− then the ambient space. So, with an argument analogous to the one in first part of the proof, the matrix C := (B Idn + JN×N )/2 has (n + 1)/2 as an eigenvalue. − − As the diagonal elements of C are zero, while its off-diagonal entries are all equal to one or zero, its characteristic polynomial can be written as

N X k PC(x) = ck( x) with cN = 1 and ci Z. − ∈ k=0

It vanishes at x = (n + 1)/2 and we can rewrite the equality PC( (n + 1)/2) = 0 as − − N−1 N X N−k k (n + 1) = 2 ck(n + 1) . − k=0

This shows that (n + 1)N is an even integer, hence so is n + 1 and we conclude that n = √m + 2 is an odd integer.

Now, we address some interesting questions concerning equiangular tight frames. [Fickus & Mixon ’15] gives an “account for recent and future developments in the construction or impossibility of ETFs in var- ious dimensions”. They remark that despite ETF having a functional nature (as an equality in the Welch’s bound, for example), every known way to construct infinite families of ETF is based on some kind of combinatorial techniques such as strongly regular graphs, difference sets and Steiner systems. The conference paper [Casazza, Redmond & Tremain ’08] also discusses the problem of frame classification. In the complex case, [Zauner ’99] made the following conjecture, that is still open today.

Open Problem: For every m 2, there exists CETF(m, m2). ≥ [Scott & Grassl ’10] proved that if m 17 or M 19, 24, 35, 48 then there exist CETF(m, m2) and provided numerical tests indicating that,≤ up to machine∈ { precision, there} are CETF(m, m2) for m 67, suggesting that Zauner’s conjecture should be true. If this conjecture were true, there would existm≤ m2 matrices with the lowest possible coherence, i.e. µ = 1/√m + 1. This would be of great importance× not only for sparse recovery but in many questions in Coding Theory. See [Bodmann & Kutyniok ’09] and references therein. 76 CHAPTER 4. THE COHERENCE PROPERTY

We now focus on the existence of matrices with small coherence. For example, [Temlyakov ’11] constructed p pk matrices (p > k being a prime number) with coherence bounded above by (k 1)/√p. We present here× a construction from [Strohmer & Heath ’03], based on ideas of [Alltop ’80], of an− explicit m m2 matrix with coherence equal to 1/√m. Note that this is the limit of the Welch bound when N × . → ∞ Proposition 4.39. For each prime number m 5, there is an explicit m m2 complex matrix with coherence µ = 1/√m ≥ ×

Proof. Let us identify the set [m] with Z/mZ = Zm and introduce, for k, ` Zm, the translation and m ∈ modulation operators Tk and Ml, defined, for z CZ and j Zm, by ∈ ∈ 2πi`j/m (Tkz)j = zj−k, (M`z)j = e zj.

m These operators are isometries of `2(Zm). Let x CZ , be the `2-normalized Alltop vector, with compo- nents ∈

1 2πij3/m xj = e , j Zm. √m ∈ 2 We claim that the m m matrix constructed with columns M`Tkx for k, ` Zm, i.e., the matrix × ∈

(M1T1x ... M1Tmx M2T1x ...... MmT1x ... MmTmx) | | | | | | | | has coherence µ = 1/√m. To prove this, we need to calculate the inner product between two columns. Taking indices (k, `) and the one indexed by (k0, `0), we have X M`Tkx, M`0 Tk0 x = (M`Tkx)j(M`0 Tk0 x) = h i j j∈Zm

X 2πi`j/m −2πi`0j/m 1 X 2πi(`−`0)j/m 2πi((j−k)3−(j−k0)3/m e xj−ke xj−k0 = e e . m j∈Zm j∈Zm Set a = ` ` and b = k k0, so that (a, b) = (0, 0). Change the summation index to h = j k0 and we obtain − − 6 −

1 2πiak0/m X 2πiah/m 2πi((h−b)3−h3)/m M`Tkx, M`0 Tk0 x = e e e |h i| m h∈Zm

1 X 2 2 3 1 X 2 2 = e2πiah/me2πi(−3bh +3b h−b )/m = e2πi(−3bh +(a+3b )h/m) . m m h∈Zm h∈Zm Now setting c = 3b and d = a + 3b2, we compute −

2 1 X 2πi(ch2+dh)/m X 2πi(ch02+dh0)/m 1 X 2πi(h−h0)(c(h+h0)+d)/m 0 0 M`Tkx, M` Tk x = 2 e e = 2 e |h i| m 0 m 0 h∈Zm h ∈Zm h,h ∈Zm ! 1 X 2πih00(c(h00+2h0)+d)/m 1 X 2πih00(ch00+d)/m X 4πich00h0/m = 2 e = 2 e e . m 00 m 00 0 h,h ∈Zm h ∈Zm h ∈Zm 00 The analysis of the last sum, for each h Zm, leads to ∈  00 X 00 0 m if 2ch = 0 mod m e4πich h /m = . 0 if 2ch0 = 0 mod m 0 h ∈Zm 6 We now study the two cases: 4.9. ANALYSIS OF ALGORITHMS 77

1. c = 0 mod m. From the definition of c, we have c = 3b and since 3 = 0 mod m, we must have b = 0 mod m. Hence, d = a + 3b2 = 0 mod m because− a = ` `0 = 06 mod m, so 6 − 6

1 X 2πidh0/m M`Tkx, M`0 Tk0 x = e = 0. |h i| m 00 h ∈Zm

2. c = 0 mod m. Since 2 = 0 mod m, the equality 2ch00 = 0 can only occur when h00 = 0 mod m, so6 that 6

1 M`Tkx, M`0 Tk0 x = . |h i| m Then we conclude that the coherence of this matrix is equal to µ = 1/√m.

Remark 31. A good reference concerning Frames is the homepage of the Frame Research Center at the University of Missouri: http: // www. framerc. org/ , where up-to-date references and information concerning frames can be found.

4.9 Analysis of Algorithms

In Chapter 3 we saw that to solve (P0) is equivalent to ensure the null space property for the matrix A. Also, it is difficult to determine if the matrix satisfies this property or not. Therefore, one of our quests is to find sufficient conditions that imply the Null Space Property. However, it is important to stress that the first guarantees for the working sparse recovery techniques, like Basis Pursuit, were discovered before NSP was invented. Using mutual coherence, for the two or- thogonal case, and using the spark, for the general case, many researchers were able to directly provide these guarantees. The coherence property, studied in this chapter, is one of the sufficient conditions we can use to guarantee the success of the algorithms described in Chapter 2. It was used before NSP, but we can restate the result in a modern language by using the Null Space Property. We will now prove that Basis Pursuit works under some conditions on the coherence, i.e., small coherence implies NSP of order s and, consequently, s-sparse recovery by Theorem 3.3. After, we comment similar results for greedy and threshold algorithms.

m×N Theorem 4.40. ([Tropp ’04]): Let A C be a matrix with `2-normalized columns. If ∈

µ1(s) + µ1(s 1) < 1, − then every s-sparse vector x CN is exactly recovered from the measurement vector y = Ax via Basis Pursuit. ∈

Proof. It is necessary and sufficient to prove that the matrix A satisfies the NSPs, which is

vS 1 < v 1 v kerA 0 and S [N] with S = s. || || || S|| ∀ ∈ \{ } ∀ ⊂ | | PN Let a1, . . . , an denote the columns of A. Then the condition v kerA is the same as vjaj = 0. ∈ j=1 What we have to do is take the inner product with ai and isolate the term vi to obtain

N X X X vi = vi ai, ai = vj aj, ai = v` a`, ai vj aj, ai . h i − h i − h i − h i j=1,j6=i `∈S j∈S,j6=i This implies 78 CHAPTER 4. THE COHERENCE PROPERTY

X X vi v` a`, ai + vj aj, ai . | | ≤ | ||h i| | ||h i| `∈S j∈S,j6=i Summing all over i and interchanging the two finite summations yields X X X X X vS 1 = vi v` a`, ai + vj aj, ai || || | | ≤ | | |h i| | | |h i| i∈S `∈S i∈S j∈S i∈S,i6=j X X v` µ1(s) + vj µ1(s 1) = µ1(s) v 1 + µ1(s 1) vS 1. ≤ | | | | − || S|| − || || `∈S j∈S Rearranging this last expression leads to

(1 µ1(s 1)) vS 1 µ1(s) v 1. − − || || ≤ || S||

And this implies the NSPs because µ1(s) < 1 µ1(s 1), that is − −

µ1(s) vS 1 < (1 µ1(s 1)) vS 1 µ1(s) v 1 = vS 1 < v 1 || || − − || || ≤ || S|| ⇒ || || || S||

We have similar results for the Orthogonal Matching Pursuit and for the Hard Thresholding Pursuit. Their proofs can be found at [Rauhut & Foucart ’13] on pages 123 and 127 respectively:

m×N Theorem 4.41. Let A C be a matrix with `2-normalized columns. If ∈

µ1(s) + µ1(s 1) < 1 − then every s-sparse vector x CN is exactly recovered from the measurement vector y = Ax after at most s iterations of Orthogonal∈ Matching Pursuit.

m×N Theorem 4.42. Let A C be a matrix with `2-normalized columns. If ∈

2µ1(s) + µ1(s 1) < 1 − then every s-sparse vector x CN is exactly recovered from the measurement vector y = Ax after at most s iterations of Hard Thresholding∈ Pursuit. We should mention that all the results derived in this chapter are the worst-case scenarios, implying that the kind of guarantees we obtain can be over-pessimistic, as they are supposed to hold for all signals, and for all possible supports of a given cardinality. Average-case analysis is also available and typically outstrips the worst-case behavior by a large margin. These analysis are available at [Tropp ’05] and [Schnass & Vandergheynst ’07].

4.10 The Quadratic Bottleneck

In Section 4.9 we proved that in linear systems formed by matrices having small coherence, unique sparse recovery can be performed. However, until now, we do not address the question about the relation between N, the ambient space dimension, and m, the number of measurements. Since we are trying to recover s-sparse vectors we ideally expected, in a first moment, to have a close relation between m and s, that is, it is presumed that the number of measurement can scale with the sparsity. Reformulating this reasoning, we expect that m = Cs, for some C > 0. Despite the fact that the coherence property is a sufficient condition the main algorithms of sparse recovery to work, we will see in this section that, when using it, we cannot guarantee a linear scale between m and s. On one hand, Theorem 4.41 and Theorem 4.40 state that the recovery of s-sparse 4.10. THE QUADRATIC BOTTLENECK 79 vectors is guaranteed via BP or OMP if (2s 1)µ < 1 is satisfied. On the other hand, we can choose a − matrix A Cm×N with small coherence µ c/√m (for instance, using Proposition 4.39). For matrices with coherence∈ that behaves this way, the condition≈ (2s 1)µ < 1 holds only if − m Cs2, ≥ for some C > 0. This estimate for the number of measurements is too pessimistic, especially if we are in a context of large s. It is not possible to overcome this estimate in the context of coherence and we will explain why. Instead of working with the condition (2s 1)µ < 1, let us use the more general condition stated in Theorem 4.41 and Theorem 4.40, −

µ1(s) + µ1(s 1) < 1. (4.19) − Suppose, by contradiction, that Equation (4.19) holds with m (2s 1)2 and say, s < √N 1. This choice of parameters was made only to facilitate calculations. Provided≤ that− N is large, say N −2m then ≥ we use Welch’s bound for the `1-coherence, Theorem 4.24, and obtain s s s N m N m N m 1 > µ1(s) + µ1(s 1) s − + (s 1) − = (2s 1) − − ≥ m(N 1) − m(N 1) − m(N 1) − − − s r 2(N m) N s − . ≥ m(N 1) ≥ N 1 − − This is clearly a contradiction. Therefore, it is not possible to combine coherence estimates with a linear scale between m and s. This is known as the quadratic bottleneck problem. The real reason behind it is related to Gershgorin’s Theorem. All known estimates of coherence require this theorem and therefore this bottleneck will always appear. In fact, the relation between Gershgorin’s Theorem and the estimation of the coherence is through ∗ the proof of Theorem 4.21. If one looks carefully, it states that the eigenvalues of the matrix ASAS lie in the interval [1 µ1(s 1), 1 + µ1(s 1)] and we know from Theorem 1.11 that sparse recovery is related − ∗ − − to invertibility of ASAS. Thus, despite the fact that coherence is a sufficient condition for sparse recovery and is easy to compute compared, e.g., to NSP, it gives us a poor scale between the sparsity and the number of measurements. Particularly, coherence is a local property, because it deals with only two vectors at the same time. In order to circumvent it, new ideas of sufficient conditions for sparse recovery are necessary. Notably, some approach that tries to handle all column vectors of the measurement matrix at the same time. From the development of a new path, we will see in Chapter 7 that it is possible to have an almost linear scale between m and s. 80 CHAPTER 4. THE COHERENCE PROPERTY Chapter 5

Restricted Isometry Property

Respect nature. There’s no guarantee she will respect you back. 66◦North1 advertising in Reykjavík, Iceland, 2011.

5.1 Introduction

From a practical point of view, coherence is a useful measure of how suitable the matrix A is for the problem of finding sparse solutions of linear systems. However we saw that the lower bound from The- orem 4.23 implies a quadractic bottleneck, which limits the performance of recovery for rather small sparsity levels, as the number of measurementsm is of the order of the square of the sparsity s. There can be no linear scale between sparsity and the number of measurements due to estimates relying in Gershgorin’s Theorem, as showed in Section 4.10. On a closer analysis, coherence only depends on pairs of columns. Maybe some properties which depend on larger sets of columns at the same time can be found to give us a finer criterion for sparse recovery. This was the inspiration of Candès and Tao when they defined the uniform uncertainty principle, nowadays known as restricted isometry property [Candès & Tao II ’06]. It became the widely used tool by the signal processing community to analyze the measurement performance of encoder/decoder pairs [Blanchard, Cartis & Tanner ’11]. This new property overcomes the limitation mentioned above and its advantage is twofold. First, up to a logarithmic factor, we can get an almost linear scaling between sparsity and the number of measurements. Besides, we will grasp the potential of probabilistic analysis in this framework through the use of powerful results as the concentration of measure inequality and Gordon’s Lemma. These results will help us to find families of good measurement matrices in the optimal regime. This notion of optimality will be made precise along the chapter. The purpose of this chapter is to study the Restricted Isometry Property. We will prove that is is a sufficient condition for the Null Space Property, and therefore for Basis Pursuit, by Theorem 3.3. Then, we will show how its introduction allows us to bypass the constrains which appear if we’re limited by coherence methods. Later, we will explore a deep result of [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11] about breaking the quadratic bottleneck with deterministic matrices. We will also establish the success of sparse recovery under different reconstruction methods from some conditions on this property. More specifically, after the introduction of restricted isometry constants, we will show how we can use them to prove the suitability of thresholding and greedy methods for sparse recovery. We close the chapter with some limitations on this property and some attempts to fix it.

1One of the oldest Icelandic manufacturing companies: www.66north.is

81 82 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

5.2 The RIP Constant and its Properties

As we described in the introduction, the idea is to create a finer property for the matrix recovery analysis, involving many columns at the same time. The restricted isometry constant of order s involves all s-tuples of columns simultaneously and measures how close to an isometry, for sparse vectors, the matrix is.

m×N Definition 5.1. The s-th restricted isometry constant δs = δs(A) of a matrix A C is the smallest δ 0 such that ∈ ≥ (1 δ) x 2 Ax 2 (1 + δ) x 2, (5.1) − || ||2 ≤ || ||2 ≤ || ||2 for all s-sparse vectors x CN . Equivalently, it is given by ∈ ∗ δs = max ASAS IdS 2→2. S⊂[N], #S≤s || − ||

∗ The idea behind this definition is that ASAS, a projection onto a given subspace of s-sparse vectors, will behave almost as an orthogonal transformation. In the literature, it is common to say “this matrix A satisfies RIP”. It means that δs(A) is small for reasonably large s. What is understood by large s and small δs will be made precise after some discussion. See discussion after Proposition 5.4. Proposition 5.2. The two definitions of the restricted isometry property are indeed equivalent.

Proof. One can start by noticing the equivalence between

2 2 2 N (1 δ) x 2 Ax 2 (1 + δ) x 2 x C s-sparse. − || || ≤ || || ≤ || || ∀ ∈ and

2 2 2 S ASx 2 x 2 δ x 2 S [N], #S s and x C . (5.2) ||| || − || || | ≤ || || ∀ ⊂ ≤ ∀ ∈ Then, x Cs, ∀ ∈ 2 2 ∗ ASx x = ASx, ASx x, x = (A AS Id)x, x . || ||2 − || ||2 h i − h i h S − i ∗ Now, the matrix A AS Id is hermitian. So we have S − ∗ (ASAS Id)x, x ∗ max h − 2 i = ASAS Id 2→2. x∈ s\{0} x || − || C || ||2 Due to (5.2), (5.1) is equivalent to

∗ max AsAs Id 2→2 δ. S⊂[N], #S≤s || − || ≤

Since δs is the smallest such δ and the 2-norm of a matrix is equal to its largest singular value, we have the equality.

This characterization will be useful when comparing the restricted isometry constants of a matrix and its coherence. First of all, note that the sequence of such constants is nondecreasing

δ1 δ2 δs δs+1 δN . ≤ ≤ · · · ≤ ≤ ≤ · · · ≤ This follows immediately from the definition of RIP, since every (s 1)-sparse vector is also an s-sparse vector. The definition of RIP, given above, can be exceedingly restrictive.− There is some asymmetry between the influence of the lower and the upper inequality in the context of sparse recovery. Therefore, we could define the asymmetrical RIP. 5.2. THE RIP CONSTANT AND ITS PROPERTIES 83

L L m×N Definition 5.3. The s-th lower restricted isometry constant δs = δs (A) of a matrix A C is the smallest δ 0 such that ∈ ≥ 2 2 (1 δ) x Ax x such that x 0 s. (5.3) − || ||2 ≤ || ||2 ∀ || || ≤ Also, the s-th upper restricted isometry constant δU = δU (A) is the smallest δ 0 such that s s ≥ 2 2 (1 + δ) x Ax x such that x 0 s (5.4) || ||2 ≥ || ||2 ∀ || || ≤ ∗ Both the smallest and largest eigenvalues of the Gram matrix ASAS affect the stability of the recon- struction algorithms. However, it is the smaller eigenvalue which allows to distinguish between sparse L vectors from their measurement by A. Clearly, δS < 1 implies that A injective and so it is a sufficient condition to ensure that no two s-sparse vectors have the same measurements. The asymmetric RIP also plays an important role in the context of high-dimensional statistics, for example in the study of LASSO and Dantzig Selector. For the recovery of sparse vectors from noisy measurements, the asymmetric RIP appears as a crucial generalization. It is connected to the so-called restricted eigenvalue conditions, see Section 5 of [Oliveira ’13], [Bickel, Ritov & Tsybakov ’09] for some discussion. Also, Figure 1 of [van de Geer & Buhlmann ’09] is especially instructive in order to under- stand the zoo of concepts related to RIP. The asymmetric definition is especially important in the context of Gaussian matrices. In Chapter 7 we will develop the connections between random matrices and Compressive Sensing. For now, we just need the following remark from [Blanchard, Cartis & Tanner ’11]: Remark 32. If A is a Gaussian matrix (see Definition 7.1) then A∗A follows a Wishart distribution. Thus, in order to understand RIP of Gaussian matrices, it is important to analyze the empirical distribu- tion of the eigenvalues from a Wishart matrix. If the expected value of the largest and the smallest eigen- U L values were asymmetric with respect to some value then the investigation of δs and δs separately could ∗ 2 give better results in RIP estimation. Figure 5.1 shows this asymmetry since Eλmax(ASAS) = (1 + √ρ) ∗ 2 and Eλmin(ASAS) = (1 √ρ) . See [Silverstein ’85] and [Geman ’80] for more details on these asymp- totic formulas. −

Figure 5.1: For a given m × N matrix A and S ⊂ [N] such that #S = s, this figure shows a plot of the expected values ∗ ∗ of the largest and smallest eigenvalues of a Wishart matrix AS AS . The blue curve represents Eλmax(AS AS ) whereas the ∗ red curve represents Eλmin(AS AS ). Note the asymmetry with respect to the constant line equal to 1.

Besides [Blanchard, Cartis & Tanner ’11], also the work [Bah & Tanner ’10] showed the advantages of the asymmetric definition for sharper results related to RIP for Gaussian matrices. 84 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

In this work we are more interested in the signal processing problem than in the statistical problem. Paraphrasing [Oliveira ’13], in the former we think about the measurement vector as controlled by the experimenter whereas in the latter, the vectors are generated by a random process. Accordingly, our focus is on the most common definitions historically related to compressive sensing and used by the Signal Processing community: coherence and symmetrical RIP. Now, finally, the proposition which compares both of them.

Proposition 5.4. If a matrix A has `2-normalized columns a1, . . . , aN , then

δ1 = 0, δ2 = µ δs µ1(s 1) (s 1)µ for s 2, ≤ − ≤ − ≥ where µ and µ1(s 1) are the coherence and coherence function from Definition 4.15 and Definition 4.19 respectively. −

2 2 Proof. By hypothesis, the columns are `2-normalized, this means that Aej = ej for all j [N]. || ||2 || ||2 ∈ This is the same as δ1 = 0 as ej is 1-sparse. By the equivalence proved above, we have   ∗ ∗ 1 aj, ai δ2 = max Ai,jAi,j Id 2→2 with Ai,jAi,j = h i . 1≤i6=j≤N || − || ai, aj 1 h i ∗ The eigenvalues of the matrix A Ai,j Id are ai, aj and ai, aj , so its (2 2)-norm is ai, aj , i,j − |h i| −|h i| → |h i| and taking the maximum over 1 i = j N yields the equality δ2 = µ. For s > 2, we know from ≤ 26 ≤ 2 2 Theorem 4.21 that (1 µ1(s 1)) x 2 Ax 2 (1 + µ1(s 1)) x 2. Since δs is the smallest constant to attain this kind of− inequality,− the|| || theorem≤ || follows.|| ≤ − || ||

2 This theorem allows us to prove the existence of m m matrices with δs < 1 for s √m, since we know that matrices with coherence equal 1/√m exists.× This is just a simples calculation:≤ s δs µ1(s 1) (s 1)µ < sµ = 1 for all s √m. ≤ − ≤ − √m ≤ ≤ The main point about using random matrices is that we can establish a much better relation between s and m. In fact, we will show that, given δ < 1, there exist m N matrices with δs δ for s cm/ ln(eN/m) with c depending only on δ. Even more surprisingly, this× relation cannot be improved,≤ see≤ Chapter 8. Now we can finally say “How large s and how small δs should be”. Matrices that have the small restricted isometry constant of this optimal order are said to satisfy the restricted isometry property or just RIP. From now we will make no distinction between the words “property” and “constant”. The next simple proposition, taken from [Candès & Tao II ’06], is of great importance.

N Proposition 5.5. Let u, v C be vectors with u 0 s and v 0 t. If their supports do not intersect, i.e., supp(u) supp∈(v) = , then || || ≤ || || ≤ ∩ ∅

Au, Av δs+t u 2 v 2 |h i| ≤ || || || || Proof. Let S = supp(u) supp(v) and let uS and vS be the restrictions of u and v to S, respectively, so ∪ that Au = ASuS and Av = ASvS. Since supp(u) supp(v) = , uS, vS = 0, and this yields ∩ ∅ h i

∗ ∗ Au, Av = ASus,ASvs uS, vS = (A AS Id)uS, vS (A AS Id)uS 2 vS 2 |h i| |h i − h i| |h S − i| ≤ || S − || || || ∗ (A AS Id) 2 uS 2 vS 2. ≤ || S − || || || || || Now, observe that uS 2 = u 2 and vS 2 = v 2. Using the alternative definition for δs+t, the conclusion follows. || || || || || || || || 5.2. THE RIP CONSTANT AND ITS PROPERTIES 85

We define now a quantity that allows us to estimate RIP and prove some important inequalities.

Definition 5.6. The (s, t)-restricted orthogonality constant (or just ROC) θs,t = θs,t(A) of a matrix A Cm×N is the smallest θ 0 such that ∈ ≥

Au, Av θ u 2 v 2 |h i| ≤ || || || || for all disjointly supported s-sparse and t-sparse vectors u, v CN . ∈ Similarly to the case of the coherence and RIP constants, we can give an equivalent definition (which has the same proof of equivalence as above):

∗ θs,t = max A AS 2→2,S T = , card(S) s, card(T ) t {|| T || ∩ ∅ ≤ ≤ } Finally we can relate the two constants by the following Proposition.

Proposition 5.7. Restricted isometry contants and restricted orthogonality constants are related by the inequalities.

1 θs,t δs+t (sδs + tδt + 2√stθs,t). ≤ ≤ s + t The special case s = t gives

θs,s δ2s δs + θs,s. ≤ ≤ Proof. The first inequality follows from Proposition 5.5 and the definition of ROC. For the second one, N we need to show that given an (s + t)-sparse vector x C with x 2 = 1, we have ∈ || ||

2 2 1 Ax x (sδs + tδt + 2√stθs,t). || ||2 − || ||2 ≤ s + t In order to do this, let us separate the vector x into two disjointly supported vectors u and v such that u is s-sparse and v is t-sparse. Then

Ax 2 = A(u + v),A(u + v)) = Av 2 + Au 2 + 2Re Au, Av . || ||2 h i || ||2 || ||2 h i As u and v are disjointly supported, we have x 2 = u 2 + v 2 and so || ||2 || ||2 || ||2

Ax 2 x 2 Au 2 u 2 + Av 2 v 2 + 2 Au, Av || ||2 − || ||2 ≤ || ||2 − || ||2 || ||2 − || ||2 |h i| 2 2 2 δs u + δt v + 2θs,t u 2 v 2 = f( u ), ≤ || ||2 || ||2 || || || || || ||2 where we have, for k [0, 1] ∈ p f(k) = δsk + δt(1 k) + 2θs,t k(1 k). − − Taking the derivative and equating it to zero, we have

0 −1/2 f (k) = δs δt + θs,t(1 2k) k(1 k) = 0. − − − Therefore, the critical points are the solutions of the equation

1/2 (δs δt) k(1 k) + θs,t(1 2k) = 0. − − − Making the substitution 1 2k = A, yields −

(δs δt) 2 2 1/2 2 − (1 A )(1 + A ) + θs,tA = 0. 2 − 86 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

After solving this equation for A and substituting for the original variable k, we discover that one of the roots is

∗ 1 1 2 2−1/2 k = (δs δt) 4θ + (δs δt) . 2 − 2 − s,t − It can be proved, after simple calculations, that it lies in [0, 1]. Besides, the function f(k) is nondecreasing on [0, k∗] and nonincreasing on [k∗, 1]. Depending on the location of k∗ with respect to s/(s + t), the function f is either nondecreasing on [0, s/(s + t)] or nonincreasing on [s/(s + t), 1]. There is freedom to 2 choose u. So, without loss of generality, we will assume that u 2 is in one of these intervals. Taking u having the s smallest absolute entries of x and v having the|| t||largest absolute entries of x, we have ui vj i, j, thus ≤ ∀ u 2 v 2 1 u 2 s || ||2 || ||2 = − || ||2 and then u 2 . s ≤ t t || ||2 ≤ s + t In the other case, where u belongs to the other interval and u had the s largest absolute entries of x, then we would likewise have u 2 s/(s + t). We conclude that || ||2 ≥   2 2 2 s s t √st Ax x = f( u ) f = δs + δt + 2θs,t . || ||2 − || ||2 || ||2 ≤ s + t s + t s + t s + t

Sometimes, when proving convergence of the algorithms, some sufficient conditions based on δks (for very large k) can be found. For example, in Table 5.5 below, we will see an example that needs δ8s < 1, provided by [Zhou, Kong & Xiu ’13]. So it is important to know how to control the restricted isometry constants and restricted orthogonality constant of high order by those of lower order.

Proposition 5.8. For integers r, s, t 1 with t s, ≥ ≥ r t t d d θt,r θs,r and δt − δ2s + δs, ≤ s ≤ s s where d = gcd(s, t), where gcd denotes the great common divisor. The special case t = cs leads to

δcs cδ2s ≤ Proof. From the definitions of these constants, we need to show that, given a t-sparse vector u and a r-sparse vector v with disjoin support, we have r t Au, Av θs,r u 2 v 2 (5.5) |h i| ≤ s || || || ||   2 2 t d d 2 Au u − δ2s + δs u (5.6) || ||2 − || ||2 ≤ s s || ||2 As d is the great common divisor of s and t, we can write s = kd and t = nd for some integers k, n. Denoting the support of u by T = j1, j2, . . . , jt , we will partition it in n subsets S1,S2,...,Sn T of size s defined by { } ⊂

Si = j , j , . . . , j { (i−1)d+1 (i−1)d+2 (i−1)d+s} with indexes meant modulo t. In this trick partition, each j T belongs to exactly s/d = k sets Si, which gives ∈ n n 1 X 2 1 X 2 u = uS and u = uS k i || ||2 k || i ||2 i=1 i=1 5.2. THE RIP CONSTANT AND ITS PROPERTIES 87

Let us see through an example why this partitioning makes sense. Suppose that t = 10 = 2 5 and × s = 8 = 2 4, with d = gcd(10, 8) = 2. So, we have T = j1, j2, . . . , j10 , which we will partition into 5 sets with 8×elements: { }

S1 = j1, . . . , j8 , { } S2 = j2+1, . . . , j2+8 = j3, . . . , j10 , { } { } S3 = j2×2+1, . . . , j2×2+8 = j5, . . . , j10, j1, j2 , { } { } S4 = j3×2+1, . . . , j3×2+8 = j7, . . . , j10, j1, . . . , j4 , { } { } S5 = j4×2+1, . . . , j4×2+8 = j9, j10, j1, . . . , j6 . { } { }

Now it is easy to verify that each ji is contained in only 4 of these 5 sets and that the decomposition of u made above, dividing by k to take repetition into account, holds.

Back to the proof, to prove (5.5) we do the following calculations

n n 1 X 1 X Au, Av AuS , Av θs,r uS 2 v 2 |h i| ≤ k |h i i| ≤ k || i || || || i=1 i=1

n !1/2 r s √n X 2 n t/d θs,r uS v 2 = θs,r u 2 v 2 = θs,r u 2 v 2. ≤ k || i ||2 || || k || || || || s/d|| || || || i=1 And inequality (5.6) follows from

n n 2 2 ∗ 1 X X ∗ Au u = (A A Id)u, u (A A Id)uS , uS ||| ||2 − || ||2| |h − i| ≤ k2 |h − i j i| i=1 j=1

 n  1 X ∗ X ∗ =  (AS ∪S ASi∪Sj Id)uSi , uSj + (AS ASi Id)uSi , uSi  k2 |h i j − i| |h i − i| 1≤i6=j≤n i=1

 n  n !2 n 1 X X 2 δ2s X δ2s δs X 2  δ2s uSi 2 uSj 2 + δs uSi 2 = uSi 2 − uSi 2 ≤ k2 || || || || || || k2 || || − k2 || || 1≤i6=j≤n i=1 i=1 i=1

  n   δ2sn δ2s δs X 2 n 1 2 − uS = δ2s (δ2s δs) u ≤ k2 − k2 || i ||2 k − k − || ||2 i=1     t 1 2 t d d 2 = δ2s (δ2s δs) u = − δ2s + δs u . s − k − || ||2 s s || ||2

As in the coherence case, it is important to know how the scaling between the number of measurements p m and the sparsity s affects RIP. We will prove now that δs c s/m. For s = 2, this can be interpreted ≥ as δ2 = µ c/˜ √m, which reminds us of the Welch bound, Theorem 4.23. ≥ m×N Theorem 5.9. For A C and 2 s N, there exists constants c, C and δ∗ such that for N C, ∈ ≤ ≤ ≥ δs δ∗ one has ≤ s m c 2 ≥ δs

For instance, we could choose c = 1/162, C = 30 and δ∗ = 2/3. 88 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

Proof. We start by noticing that the theorem cannot be valid for s = 1, as δ1 = 0 if all the columns of A have `2-norm equal to 1. If we set t = s/2 1, then we will decompose the matrix A into blocks of size m t (the last block could have lessb columns)c ≥ ×

A = [ A1 A2 ... An ],N nt. | | | ≤ From Proposition 5.2, we can use the alternative definition of the restricted isometry constant and also the definition of the restricted orthogonality constant. So we have, for all i, j [n], i = j, that ∈ 6 ∗ ∗ A Ai Id 2→2 δt δs and A Aj 2→2 θt,t δ2t δs. || i − || ≤ ≤ || i || ≤ ≤ ≤ ∗ ∗ Then we conclude that the eingenvalues of Ai Ai and the singular values of Ai Aj satisfy

∗ 1 δs λk(A Ai) 1 + δs, − ≤ i ≤ ∗ m×m ∗ ∗ N×N Now, we define the matrices H = AA C and G = A A = [Ai Aj]1≤i,j≤n C . First we deduce the estimate ∈ ∈

n n t X ∗ X X ∗ tr(H) = tr(G) = tr(A Ai) = λk(A Ai) nt(1 δs). (5.7) i i ≥ − i=1 i=1 k=1 ∗ Then, using the Frobenius inner product M1,M2 F = tr(M M1), we deduce h i 2

2 2 2 2 ∗ ∗ ∗ ∗ ∗ tr (H) = Idm,H Id H = m tr(H H) = m tr(AA AA ) = m tr(A AA A) = h iF ≤ || ||F || ||F n  m  " t n t # ∗ X X ∗ ∗ X X ∗ 2 X X ∗ 2 m tr(GG ) = m tr Ai AjAj Ai = m σk(Ai Aj) + λk(Ai Ai) i=1 j=1 1≤i6=j≤n k=1 i=1 k=1  2 2  2 2 m n(n 1)tδ + nt(1 + δs) = mnt (n 1)δ + (1 + δs) . (5.8) ≤ − s − s From (5.7) and (5.8), we derive

2 nt(1 δs) m 2 − 2 . ≥ (n 1)δ + (1 + δs) − s 2 2 If (n 1)δ < (1 + δs) /5, we would obtain, using δs 2/3, − s ≤ 2 2 nt(1 δs) 5(1 δs) 1 m > − 2 − 2 N N. 6(1 + δs )/5 ≥ 6(1 + δs) ≥ 30 2 2 This is a contradiction since we assumed that N 30m. Therefore we have (n 1)δ (1 + δs) /5. ≥ − s ≥ Hence, using again δs 2/3 and s 3t, we conclude ≤ ≤ 2 nt(1 δs) 1 t 1 s s m − = c . ≥ 6(n 1)δ2 ≥ 54 δ2 ≥ 162 δ2 δ2 − s s s s

5.3 The Quadratic Bottleneck Reloaded

Theorem 5.9 helps us estimate how small the restricted isometry constant can be. Even more, we can compare the scaling between s, the sparsity of the signals we want to recover, and m, the number of measurements which guarantees the smallness of δs. We will see that there is a great difference between the two sufficient conditions we are working in this dissertation: coherence and RIP. The former provides a quadratic scale between m and s while the latter provides, using probabilistic techniques, an almost linear scale. Also, until recently, the typical technique 5.3. THE QUADRATIC BOTTLENECK RELOADED 89 to estimate RIP was through coherence and then the phenomenon of a bottleneck appears again. Let us see precisely what all of this means. Theorem 5.9 provides a linear scaling between s and m

m Cδ−2 s ≥ s On the other hand, if we choose a matrix A with an optimal coherence µ = C/√m then Theorem 5.4 implies that δs (s 1)µ cs/√m, which leads to ≤ − ≤ s r s C1 δs C2 , (5.9) √m ≥ ≥ m for some universal constants C1 and C2. Note that the right-hand side is valid for all matrices. Meanwhile, for the left-hand side we know how to prove that an explicit family of matrices indeed attain this optimal order of coherence and, consequently, this scale for RIP. 2 Then, the quadratic scale m Cs is a sufficient condition for δs to be small. Also, the right side of the inequality (5.9) shows that m≥ Cs˜ is a necessary condition. There is a significant gap between both of them and up to this point in this≥ dissertation we are unable to show if this second condition is also sufficient for small RIP constant or not. In Chapter 7 we will exhibit certain random matrices satisfying δs δ with high probability provided ≤ m Cδ−2s ln(eN/s). (5.10) ≥ This will prove that there exists some (actually, many!) matrices A with small RIP in an asymptotic regime much closer to linear than quadratic. Additionally, in Chapter 8 we will prove that δs δ requires ≤ that m Cδs ln(eN/s) for a certain constant Cδ depending on δ. In the literature it is usually said that in order≥ to obtain small RIP constant, linear scale is optimal up to logarithm factors. All the constructions of matrices in the optimal regime involve powerful probabilistic arguments like concentration of measure inequalities. Despite this, it is common in applications to have fixed and deterministic sensing matrices. So we have a problem of constructing (or of guaranteeing that) deterministic matrices satisfying RIP in the optimal regime. On the left hand side of (5.9), setting −1 1/2 δs = δ∗, we have s C1 δ∗√m which could be reformulated as s = Ω(m ). This is the regime in which we know how to guarantee≥ small RIP constant through coherence techniques. However, ideally we would like to improve this and increase the exponent from 1/2 to 1/2 + ε, i.e. break the bottleneck. Then, it will be great to make ε 1/2, and provide small RIP in a suitable scale for applications. This will shows the existence of deterministic→ matrices with RIP property in an almost linear regime. Summarizing this, and following [Mixon ’15], we provide the following definition.

Definition 5.10. For any z > 0, ExRIP[z] denote the following statement: There exists an explicit family of M N matrices with arbitrarily large aspect ratio N/M which have restricted isometry property of order s with× constant δ and s = Ω(M z−ε) for all ε > 0 and δ < 1/3.2 The main point is that with the aid of coherence, it is straightforward to prove ExRIP[1/2]. One can use the construction of Proposition 4.39, for example, and the argument above which estimate RIP through coherence. The latter, in its turn, is estimated through Gershgorin’s Theorem. This was the only technique to estimate RIP until 2011. The next step is to prove ExRIP[1/2+ε0] for some ε0 > 0. That is, we need to construct of an explicit 1/2−ε family of matrices Ai such that for C > 0, we have si > Cmi for ε > 0. Here si denotes the sparsity level and mi the number of rows of matrix Ai. This important problem was solved by [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11]. The authors created a new technique bypassing Gershgorin’s Thereom through additive combinatorics argu- ments to construct a family of matrices in this regime. This will be the subject of the next section. Despite this major theoretical breakthrough, this is the only known explicit construction which breaks

2 In section 5.5 the number 1/3 will become clear. Essentially, we can guarantee that with δs < 1/3, basis pursuit works to recover sparse vectors. This was proved by [Cai & Zhang ’13]. 90 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY the bottleneck. Besides, for practical purpose, this constant is not useful and does not help im improving computational time, in finding a better measurement scheme or in a suitable scale for practical purposes. −28 In [Mixon’s Blog - 12/02/2013], this constant was estimated to be ε0 5.5169 10 . Recently, in four 3 ≈ × −24 blog posts, Dustin Mixon and Afonso Bandeira optimized the constant to ε0 4.4466 10 . It is a major problem to go further and prove this in the optimal (linear up to≈ logarithm× factor) regime:

Open Problem: Construct deterministic (or constructible in polynomial time) matrices with δ δ in s . the optimal regime m Cδ−2s ln(eN/s) ≤ ≥ Therefore, until now there is no way to guarantee that deterministic families of matrices coming from applications have small RIP constant in the optimal regime. This makes it impossible to use RIP as a sufficient condition for compressive sensing with deterministic matrices. In spite of the lack of guarantees, applied researchers continue to prefer deterministic matrices as sensing matrices due to well known constructions.

5.4 Breaking the Bottleneck

In this section we briefly describe the problem of estimating RIP via coherence and provide a big picture of the techniques used by Bourgain et al. [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11], now known as BDFKK restricted isometry machine [Mixon ’15]. Let S be a set with #S = s. For a matrix A satisfying RIP with δs constant, the eigenvalues of ∗ ASAS lie in [1 δs, 1 + δs]. Therefore we can prove that a matrix has small RIP constant by estimate its eigenvalues. From− Theorem 4.21, we known that a matriz A always satisfies RIP with constant (s 1)µ, where µ is it coherence. But the Welch’s bound, Theorem 4.23, says that coherence cannot be too− small. For N cM, µ Ω(M −1/2) and then, for δ < 1/3 as in Definition 5.10, we have s = O(M 1/2). This was a variation≥ of≥ what we presented on the last section and it is the standard technique to prove RIP. One should consult [DeVore ’07] and [Applebaum, Howard, Searle & Calderbank ’09] for examples of this kind of construction. The problem with this technique is that it relies on Gershgorin Theorem. When estimating the ∗ eigenvalues of the Gram matrix ASAS, this theorem only takes into account their magnitude. Bourgain et al realized that their signs should also be considered and then some more sophisticated combinatorial techniques must be used in order to “cancelate” these signs. So, the initial idea was to convert the RIP statement, about all s-sparse vectors simultaneously, into a statement about finitely many vectors. Toward this, they introduced the following definition. m×N Definition 5.11. Let A = [ a1 ... aN ] C be a matrix with ai i∈[N] being its column vectors. | | ∈ { } We say that A satisfies θs-flat restricted isometry property (or flat-RIP of order s and constant θ) if for every disjoint I,J [N] with #I, #J s ⊆ ≤   X X p ai, aj θ #I#J ≤ i∈I j∈J

Note that A has flat-RIP by taking u and v to be the characteristic function χI and χJ into the Definition 5.6. So it is also called the flat-ROC property, after [Bandeira et al. ’13]. With this definition in hands we can change the estimation of RIP by the estimation of flat-RIP with the following Theorem:

Theorem 5.12. If A has the flat restricted isometry property with constant θs and has unit-norm columns then A has RIP (for s-sparse vectors) with constant 150θ log s. This theorem was proved in [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11] for s 210 and it was stated there not as here, but in a similar way. The proof for all s as well as the statement≥ like the one the give here appeared in [Bandeira et al. ’13]. The former also defined the following:

3The progress of this constant optimization was described in the Math Research Wiki [Wiki - Deterministic RIP Matrices]. 5.4. BREAKING THE BOTTLENECK 91

m×N 0 Definition 5.13. We say that A = [ a1 ... aN ] C satisfies the θs-weak flat RIP if for every disjoint I,J [N] with #I, #J s, | | ∈ ⊆ ≤   X X 0 ai, aj θ s. ≤ i∈I j∈J And so the following connection between weak flat-RIP and flat-RIP can be proved:

Lemma 5.14. (Essentially Lemma 1 from [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11]): If A 0 satisfies the θs-weak flat RIP and has coherence µ 1/s, then A has flat RIP of order s with constant √θ0. ≤ Proof. By triangle inequality, we have   X X X X #I#J ai, aj ai, aj µ#I#J . ≤ |h i| ≤ ≤ s i∈I j∈J i∈I j∈J Also, A satisfies the weak flat RIP, so   X X 0 p 0 ai, aj min θ s, #I#J/s θ #I#J. ≤ { } ≤ i∈I j∈J

The following lemma, presented here without proof, is of fundamental importance in order to break the bottleneck.

Lemma 5.15. (Essentially Lemma 3 from [Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11]): If A satisfies RIP of order s with constant δs then A satisfies RIP of order αs with constant 2αδs for all α 1. ≥ These three results together allow us to convert results with modest s and tiny δs into large s and modest δs. To see how, we will denote a matrix which satisfies RIP of order s with constant δs by (s, δs)-RIP, one which satisfies flat-RIP of order s with constant δs by (s, δs)-fRIP and one which satisfies weak flat-RIP by (s, δs)-wfRIP. Given this notation, with the aid of Theorem 5.12 and Lemmas 5.14 and 5.15, the following chain of implications can be proved:

2 (2, [δs/(2s150 log s)] )-wfRIP = (2, δs/(2s150 log s))-fRIP = (2, δs/s)-RIP = (s, δs)-RIP. ⇒ ⇒ ⇒ Due to the combinatorial nature of RIP, it is better to get results for 2-sparse vectors then for s-sparse vectors for large s. Instead of constructing matrices with a certain RIP, they construct matrices with appropriate weak flat-RIP. For these, the sign cancellation can be performed. The columns of these matrices where based on complex exponentials knows as chirps.

Definition 5.16. Let p be a prime number and Fp be the field of size p. We define the chirp as a complex exponential of the form

1 2πi(ax2+bx)/p ua,b = (e )x∈ . √p Fp These exponentials were used as entries of the Gram matrix A∗A because their inner product has some interesting properties. If a1 = a2, then ua1,b1 , u12,b2 = 1 if b1 = b2 and 0 if b1 = b2. Otherwise, the inner product is h i 6 2  1 X 2πi (a1−a2)x +(b1−b2)x /p ua ,b , u1 ,b = e . h 1 1 2 2 i p x∈Fp 92 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

This expression can be manipulated with the aid of Legendre symbols4 and then it is possible to perform the cancellation in the signs and, consequently, improve the estimates of RIP. Since we want these ∗ expressions in the Gram matrix A A, the columns of A must be ua,b for some well-designed { }(a,b)∈A×B sets , Fp. Bourgain et at. constructed two set and with low additive energy, in the sense of additiveA B combinatorics ⊆ [Mixon ’15]. This constructionA is the highlyB technical part of the paper. After defining such set, it was proved that for sufficiently large p, the p #( )#( ) matrix with 1/2+ε0−ε × A B columns ua,b, for a and b satisfies (p , δ)-RIP for any ε > 0 and δ < √2 1, thereby ∈ A ∈ B − implying ExRIP[1/2 + ε0]. This leads to the natural question of how much these techniques could be improved in order to give 1/2 + ε0 1/2 + 1/2. →

5.5 Analysis of Algorithms

In Chapter 4 we analyzed Basis Pursuit and proved that when coherence is small, the algorithm works and recovers all s-sparse solutions of the linear system. In this section we provide two analogous results for the RIP. We provide two results. The first one is simple and natural while the second is more sophisticated and involved. The philosophy behind both of them is the same: when RIP is smaller than some constant, then the algorithm must work.

Theorem 5.17. Suppose that the 2s-th restricted isometry constant of the matrix A Cm×N satisfies ∈ 1 δ2s < . 3 Then every s-sparse vector x CN is the unique solution of ∈

min z 1 subject to Az = Ax. N z∈C || || The following Lemma will be used many times in our proof.

Lemma 5.18. Given q > p > 0, if u Cs and v Ct satisfy ∈ ∈

max ui min vj , i∈[s] | | ≤ j∈[t] | | then

s1/q u q v p. || || ≤ t1/p || || The special case p = 1, q = 2 and t = s leads to

1 u 2 v 1. || || ≤ √s|| || Proof. Notice that

" s #1/q " t #1/p u q 1 X q 1 X p v p || || = ui max ui min vj vj || || . s1/q s | | ≤ i∈[s] | | ≤ j∈[t] | | ≤ t | | ≤ t1/p i=1 j=1

4The Legendre symbol is a multiplicative function in Number Theory with values 1, −1, 0 that is a quadratic character modulo a prime number p: its value on a (nonzero) quadratic residue mod p is 1 and on a non-quadratic residue (non-residue) is −1. Its value on zero is 0. 5.5. ANALYSIS OF ALGORITHMS 93

Proof. (of Theorem 5.17). By Theorem 3.3, it is enough to prove that the matrix A satisfies the null space property of order s, that is 1 vS 1 < v 1 v ker A 0 and all S [N] with #(S) = s. || || 2|| || ∀ ∈ \{ } ⊂ In fact, we will prove a stronger statement ρ vS 2 v 1 v ker A and all S [N] with #(S) = s, || || ≤ 2√s|| || ∀ ∈ ⊂

2δ2s where ρ = satisfies ρ < 1 whenever δ2s < 1/3. It is clear that we need to consider just the index 1−δ2s set S := S0 of the s largest absolute entries of v. So let us partition the complement S0 of S0 in [N] as S0 = S1 S2 ... where ∪ ∪

S1 is the index set of the s largest absolute entries of v in S0 • S2 is the index set of the s largest absolute entries of v in S0 S1 • ∪ P  and so on. As v ker A, then Av = 0 and so A i=0 vSi = 0 which implies A(vS0 ) = A( vS1 vS2 ... ). This leads to ∈ − − −

2 1 2 1 vS0 2 A(vS0 ) 2 = A(vS0 ),A(vS0 ) || || ≤ 1 δ2s || || 1 δ2s h i − − 1 1 X = A(vS0 ),A( vS1 ) + A( vS2 ) + ... = A(vS0 ),A( vSk ) . (5.11) 1 δ2s h − − i 1 δ2s h − i − − k≥1 From Proposition 5.5, we obtain

A(vS ),A( vS ) δ2s vS 2 vS 2. h 0 − k i ≤ || 0 || || k || Putting this into (5.11) we obtain

δ2s X X ρ vS0 2 vSk 2 = vSk 2. || || ≤ 1 δ2s || || 2|| || − k≥1 k≥1

We defined vS in a decreasing way, Hence, for k 1, the s absolute entries of vS do not exceed any of k ≥ k the s absolute entries of vSk−1 and Lemma 5.18 yields 1 vS 2 vS 1, || k || ≤ √s|| k−1 || which gives ρ X ρ vS 2 vS 1 v 1. || 0 || ≤ 2√s || k−1 || ≤ 2√s|| || k≥1

This implies NSP for matrix A and concludes the proof that basis pursuit works under δ2s < 1/3.

In (5.11), we interpreted the vector vS0 as being 2s-sparse. In fact, it is an s-sparse vector. So a better 2 2 bound vS A(vS ) /(1 δs) can be used. This will turn into a sufficient condition based on δs || 0 ||2 ≤ || 0 ||2 − instead of δ2s. There are many such sufficient conditions involving the constants δs. One can cite, for example, the works of [Cai, Wang & Xu I ’10] and [Cai & Zhang ’13]. The first proved that basis pursuit performs sparse recovery if δs < 0.307. The latter, that basis pursuit works under δs < 1/3. This is the best known result involving δs. In any case, the condition based on δ2s is more natural since it is known, after [Candès & Tao II ’06], that an algorithm recovering all s-sparse vectors x from the measurements y = Ax exists if and only if 94 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

Estimate Numerical Approx. Paper δ2s + δ3s < 1 ——– [Candès & Tao II ’06] δ3s + 3δ4s < 2 ——– [Candès, Romberg & Tao II ’06] δ2s < √2 1 0.4142 [Candès ’08] − δ2s < 2(3 √2)/7 0.4531 [Foucart & Lai ’10] − δ2s < 3/(4 + √6) 0.4651 [Foucart ’10] δ2s < 1/(1 + √1.25) 0.4721 [Cai, Wang & Xu I ’10] 5 δ2s < 4/(6 + √6) 0.4734 [Foucart II ’10] δ2s < 0.4931 0.4931 [Mo & Li ’11] δ2s < 1/2 0.5000 [Cai & Zhang ’13] 6 δ2s < 3/2 (1 + √41)/8 0.5746 [Zhou, Kong & Xiu ’13] − δ2s < 4/√41 0.6246 [Andersson & Strömberg ’14]

Table 5.1: Historical Improvements on RIP Bounds

δ2s < 1. In table 5.5 we present the historical improvements in bounding RIP. We will now prove the last result from this list. This is the best known bound on δ2s so far. Theorem 5.19. ([Andersson & Strömberg ’14])7: Suppose that the 2s-th restricted isometry constant of the matrix A Cm×N satisfies ∈ 4 δ2s < 0.6246. (5.12) √41 ≈ N m Then, for any x C and y C with Ax y 2 η, a solution x˜ of ∈ ∈ || − || ≤

min z 1 subject to Az y 2 η, N z∈C || || || − || ≤ approximates the vector x with errors

C x x˜ 1 Cσs(x)1 + D√sη and x x˜ 2 σs(x)1 + Dη, || − || ≤ || − || ≤ √s where the constants C, D > 0 depend only on δ2s. This theorem is not only the best known bound on RIP but it also incorporates stability and robustness for basis pursuit. From Theorem 3.18, we just need to prove that the matrix A satisfies NSP2,ρ,τ . Precisely, we are going to prove the following.

Theorem 5.20. If the 2s-th restricted isometry constant of A Cm×N obeys (5.12), then the matrix A ∈ satisfies the `2-robust null space property of order s with constants 0 < ρ < 1 and τ > 0 depending only on δ2s. For this, we need a lemma called square root lifting inequality. It is a kind of a counterpart of the s inequality v 1 √s v 2 for v C . It is important to note that this inequality, together with shifting inequality [Cai,|| || Wang≤ || &|| Xu I ’10],∈ are the main techniques used in all the papers in Table 5.5 to improve δ2s. The proof of both results can be found in [Cai, Wang & Xu I ’10] and [Cai, Wang & Xu II ’10].

Lemma 5.21. ([Cai, Wang & Xu I ’10]): For a1 a2 as 0, ≥ ≥ · · · ≥ ≥ q 2 2 a1 + + as √s a + + a ··· + (a1 as). 1 ··· s ≤ √s 4 − 5Valid for large s. It needs some limit arguments when s → ∞. 6 This result needs also a technical condition δ8s < 1. All the other results do not need supplementary conditions to hold. 7The theorem appeared in the literature for the first time on the book [Rauhut & Foucart ’13], published on 2013. It was based on a preprint version of the paper [Andersson & Strömberg ’14] from 2012. 5.5. ANALYSIS OF ALGORITHMS 95

Proof. Let us start by noting that the lemma is equivalent to the following statement

 q a1 a2 as 0√ 2 2 √s ≥ ≥ · · · ≥ ≥ s = a1 + + as + as 1. (a1 + + as)/√s + a1 1 ⇒ ··· 4 ≤ ··· 4 ≤ Therefore, we need to maximize the convex function

q 2 2 √s f(a1, a2, . . . , as) = a + + a + as, 1 ··· s 4 over the convex polytope   s a1 + + as √s C = (a1, . . . , as) R : a1 as 0 and ··· + as 1 . ∈ ≥ · · · ≥ ≥ √s 4 ≤ As any point of C is a convex combination of its vertices and because the function f is convex, the maximum is attained at a vertex of C. Besides, the vertices are intersections of s hyperplanes when we force s of the s + 1 inequalities to become equality. From this we obtain following possibilities:

1. If a1 = = as = 0, then f(a1, a2, . . . , as) = 0. ··· √ s 2. If (a1 + + as)/√s + a1 = 1 and a1 = = ak > ak+1 = = as = 0 for some 1 k s 1, ··· 4 ··· ··· ≤ ≤ − then one has a1 = = ak = 4√s/(4k +s), and consequently f(a1, a2, . . . , as) = 4√ks/(4k +s) 1 by the AM-GM inequality··· with the pair (4k, s). ≤ √ s 3. If (a1 + + as)/√s + a1 = 1 and a1 = = as > 0 then one has a1 = = as = 4/(5√s), and ··· 4 ··· ··· consequently f(a1, a2, . . . , as) = 4/5 + 1/5 = 1.

Thus we have obtained

max f(a1, a2, . . . , as) = 1, (a1,...,as)∈C and this concludes the proof.

Proof. (of Theorem 5.19): We need to prove NSP2,ρ,τ , that is, we need to find constants 0 < ρ < 1 and τ > 0 such that, for any v CN and any S [N] with #(S) = s, ∈ ⊂ ρ vS 2 v 1 + τ Av 2. || || ≤ √s|| S|| || ||

Again it is enough to consider an index set S =: S0 of s largest absolute entries of v. The same construction of Theorem 5.17 works here, that is we partition the complement S0 of S0 in [N] as S0 = S1 S2 ... where ∪ ∪

S1 is the index set of the s largest absolute entries of v in S0. •

S2 is the index set of the s largest absolute entries of v in S0 S1. • ∪ etc. Now, in contrast to Theorem 5.17 where we interpreted the vector vS0 as being 2s-sparse, we think of it it as being s-sparse and then we can write

2 2 AvS = (1 + t) vS with t δs. || 0 ||2 || 0 ||2 | | ≤ The first estimate we need to establish is that, for any k 1, ≥ q 2 2 AvS0 , AvS δ t vS0 2 vS 2. (5.13) h k i ≤ 2s − || || || k || 96 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

In order to do it, we will first normalize the vectors vS0 and vSk by defining u = vS0 / vS0 2 and iθ || || w = e vSk / vSk 2 with θ being chosen to give Au, Aw = Re Au, Aw . For real numbers α, β 0 to be chosen later,|| we|| have |h i| h i ≥

1 h i 2 Au, Aw = A(αu + w) 2 A(βu w) 2 (α2 β2) Au 2 |h i| α + β || ||2 − || − ||2 − − || ||2 1 h 2 2 2 2 2i (1 + δ2s) αu + w (1 δ2s) βu w (α β )(1 + t) u ≤ α + β || ||2 − − || − ||2 − − || ||2 1 h 2 2 2 2 i = (1 + δ2s)(α + 1) (1 δ2s)(β + 1) (α β )(1 + t) α + β − − − − 1  2 2  = α (δ2s t) + β (δ2s + t) + 2δ2s . α + β −

p 2 2 p 2 2 Setting α = (δ2s + t)/ δ t and β = (δ2s t)/ δ t we can derive 2s − − 2s − p 2 2 q δ2s t   2 2 2 Au, Aw − δ2s + t + δ2s t + 2δ2s = 2 δ2s t , |h i| ≤ 2δ2s − − which is the same as the equation (5.13) we wanted to prove. After this, one observes that

*  + 2 X X AvS = AvS ,A v vS = AvS , Av AvS , AvS || 0 ||2 0 − k h 0 i − h 0 k i k≥1 k≥1 q X 2 2 AvS 2 Av 2 + δ t vS 2 vS 2 ≤ || 0 || || || 2s − || 0 || || k || k≥1  q  2 2 X = vS 2 √1 + t Av 2 + δ t vS 2 . (5.14) || 0 || || || 2s − || k || k≥1

− + For each k 1, let us denote by vk and vk the smallest and the largest absolute entries of v on Sk. Using Lemma≥ 5.21, we obtain

  X X 1 √s + − 1 √s + 1 1 vS 2 vS 1 + (v v ) v 1 + v v 1 + vS 2. || k || ≤ √s|| k || 4 k − k ≤ √s|| S0 || 4 1 ≤ √s|| S0 || 4|| 0 || k≥1 k≥1

2 2 Using this last inequality in the right-hand side of (5.14) while changing AvS by (1 + t) vS on its || 0 ||2 || 0 ||2 remote left-hand side, and canceling one vS factor we conclude that || 0 ||

p 2 2 p 2 2 δ2s t δ2s t (1 + t) vS 2 √1 + t Av 2 + − v 1 + − vS 2 || 0 || ≤ || || √s || S0 || 4 || 0 ||   1 δ2s δ2s (1 + t) Av 2 + v 1 + vS 2 , p 2 S0 p 2 0 ≤ √1 + t|| || √s 1 δ2s || || 4 1 δ2s || || − − p 2 2 where we used δ2s t /(1 + t) δ2s/√1 δ2s. Then, dividing by 1 + t, using 1/√1 + t √1 δ2s and rearranging, the last− expression≤ leads to− ≤ −

δ v 1 √1 + δ v 2s || S0 || + 2s Av . S0 2 p 2 p 2 2 || || ≤ 1 δ δ2s/4 √s 1 δ δ2s/4|| || − 2s − − 2s − This is exactly the NSPρ,τ if 5.5. ANALYSIS OF ALGORITHMS 97

δ ρ := 2s < 1, p 2 1 δ δ2s/4 − 2s − p 2 namely, if 5δ2s/4 < 1 δ2s or δ2s < 4/√41. Then, if this condition is satisfied, we have NSPρ,τ and therefore stable and robust− reconstruction of compressible vectors. We have related results for thresholding and greedy algorithms. All the proofs can be found in Sections 6.3 and 6.4 of [Rauhut & Foucart ’13]. There are two standard thresholding algorithms: iterative hard threshold and hard threshold pursuit. In the second one, for example, we have a sequence defined inductively by

n+1 n ∗ n S = Ls(x + A (y Ax )). (HTP1) −

n+1 n+1 x = argmin y Az 2, supp(z) S , (HTP2) {|| − || ⊂ } N where Ls(z) denotes the index set of s largest absolute entries of a vector z C . The success of hard thresholding pursuit is guaranteed by the following theorem8 ∈

Theorem 5.22. (Theorem 3.8 in [Foucart ’11]): Suppose that the 3s-th restricted isometry constant of the matrix A Cm×N satisfies ∈ 1 δ3s < 0.5773. √3 ≈ N m n Then, for x C , e C , and S [N] with #(S) = s, the sequence x defined by (HTP1) and (HTP2) with y = Ax∈+ e satisfies,∈ for any ⊂n 0, ≥ n n 0 x xS 2 ρ x xS 2 + τ Ax + e 2. (5.15) || − || ≤ || − || || S || p where ρ = δ2 /(1 δ2 ) < 1 and τ 5.15/(1 ρ). 3s − 2s ≤ − It is important to note what occurs with threshold algorithms in general. If we take the limit n , # # N n → ∞ iteration (5.15) yields x xS 2 τ AxS + e 2 if x C is the limit of the sequence x or at least one of its accumulation|| points.− The|| ≤ proof|| of this|| Theorem∈ does not guarantee the existence of such limit. However, the boundedness of xn is ensured by (5.15). Then we have, by the triangle inequality, that # ||# || x x 2 xS 2 + xS x 2, so choosing S as an index set of the s largest absolute entries of x gives|| − || ≤ || || || − ||

# x x 2 σs(x)2 + τ Ax + e 2. || − || ≤ || S || This is a very interesting estimate which differs from the one available for basis pursuit due to the fact that we have now σs(x)2 instead of σs(x)1. This is exclusive for threshold algorithms. But the classical error estimates as in the `1-minimization are also available. We just need to make the replacement s 2s, and → then, instead of a hypothesis based on δ3s, we have one based on δ6s. This is described in the following result

Theorem 5.23. Suppose that the 6s-th order restricted isometry constant of the matrix A Cm×N N m n ∈ satisfies δ6s < 1/√3. Then, for all x C and e C , the sequence x defined by (HTP1) and (HTP2) with y = Ax + e, x0 = 0 and s replaced∈ by 2s satisfies,∈ for any n 0, ≥ n n x x 1 Cσs(x)1 + D√s e 2 + 2ρ √s x 2, || − || ≤ || || || || n C n x x 2 σs(x)1 + D e 2 + 2ρ x 2, || − || ≤ √s || || || ||

8[Foucart ’11] established results about the analysis for a family of thresholding algorithms indexed by an integer k. There, iterative hard thresholding and hard thresholding pursuit correspond to the cases k = 0 and k = ∞, respectively, as described in Section 2.3. 98 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

n where the constants C, D > 0 and 0 < ρ < 1 depend only on δ6s. In particular, if the sequence x clusters around some x# CN , then ∈

# # C x x 1 Cσs(x)1 + D√s e 2, and x x 2 σs(x)1 + D e 2. || − || ≤ || || || − || ≤ √s || || For greedy algorithms, we will state results for the orthogonal matching pursuit despite the existence of results for other algorithms, such as CoSaMP. See, for example, the original paper [Needell & Tropp ’08], where stability and robustness were stated under the condition δ4s 0.1, or the improved Theorem ≤ 6.27 of [Rauhut & Foucart ’13], where the condition δ4s 0.17157 does the job. For OMP, a curious phenomenon occurs. It was noticed in Theorem 3.2 of [Mo≤ & Shen ’12] that standard RIP arguments are not enough to establish the recovery of all s-sparse vectors in at most s iterations. Here we present a modified version which appears in [Rauhut & Foucart ’13] but follows the same ideas. We take a fixed 1 η < √s and consider the (s + 1) (s + 1) matrix with `2-normalized columns defined by ≤ ×  η  s  .   Id .  A =  η  .    s  q s−η2 0 ... 0 s In order to estimate RIP, we need to compute

 η  s  .  A∗A Id =  0 .  . −  η   s  η η s ... s 0 This matrix has eigenvalues η/√s, η/√s and 0 with multiplicity s 1. Then, − − ∗ δs+1 = A A Id 2→2 = η/√s. || − || However, consider the s-sparse vector x = [1,..., 1, 0]. It is not recovered from y = Ax after s iterations. To see why, just note that the s + 1 index is wrongly picked at the first iteration. Indeed

 η      s 1 1  .   .   .  A∗(y Ax0) = A∗Ax =  Id .   .  =  .  . −  η       s   1   1  η η s ... s 1 0 η Since OMP chooses one index in each iteration, we conclude that OMP fails to recover this specific vector in s iterations for this given matrix. This is what the naive greedy algorithm does. As [Rauhut & Foucart ’13] points out, there are two ways for circumvent this. Perform more than s iterations or find a way to reject the wrong indices by modifying the OMP. However in both cases the sparse recovery can be established if the restricted isometry constant is sufficiently small. Now, in order to state the result about OMP convergence, we consider a more general greedy algorithm which starts with an index set S0, x0 = argmin y Az , supp(z) S0 and iterating scheme {|| − || ⊂ } n+1 n ∗ n S = S L1(A (y Ax )) (OMP1) ∪ −

n+1 n+1 x = argmin y Az 2, supp(z) S . (OMP2) {|| − || ⊂ } Note that the usual OMP algorithm corresponds to the choice S0 = and x0 = 0. The following theorem was first established in [Zhang ’11] but restated and generalized in [Rauhut∅ & Foucart ’13]. 5.6. RIP LIMITATIONS 99

Theorem 5.24. (Theorem 6.25 in [Rauhut & Foucart ’13]): Suppose that A Cm×N has restricted isometry constant ∈ 1 δ13s < . 6 N m Then there is a constant C > 0 depending only on δ13s, such that, for all x C and e C , the n ∈ ∈ sequence x defined by (OMP1) and (OMP2) with y = Ax + e satisfies

12s y Ax 2 C Ax + e 2, || − || ≤ || S || for any S [N] with #S = s. Furthermore, if δ26s < 1/6, then there are constants C, D > 0 depending ⊂ N m n only on δ26s, such that, for all x C and e C , the sequence x defined by (OMP1) and (OMP2) with y = Ax + e satisfies ∈ ∈

12s y Ax 2 C Ax + e 2 || − || ≤ || S || for any S [N] with #S = s. Furthermore, if δ26s < 1/6, then there are constants C, D > 0 depending ⊂ N m n only on δ26s < 1/6, such that, for all x C and e C , the sequence x defined by (OMP1) and ∈ ∈ (OMP2) with y = Ax + e satisfies, for any 1 p 2 ≤ ≤ 24s c 1/p−1/2 x x p σs(x)1 + Ds e 2. || − || ≤ s1−1/p || || Nowadays there are many variations on greedy algorithm for sparse recovery. We could also cite stagewise OMP, regularized OMP and CoSaMP. A comparison between them can be found in the survey [Blanchard & Tanner ’15].

5.6 RIP Limitations

We saw that with RIP we could guarantee the convergence of all the main algorithms used for sparse recovery. We just need to ensure the existence of matrices with small RIP constant in the right regime of sparsity. Even more, RIP gives better scale measurements than coherence. However, the restricted isometry property has some limitations. For example, we could ask if it is possible to increase the bound on δs (or δks for some k), making δs 1 and continue to ensure the recovery of sparse vectors for matrices with RIP constant sufficiently close→ to 1. This would be the ideal situation. If that was true, taking any matrix A, the linear systems Ax = b it would be generically solvable, i.e., would be possible to find sparse solution for this system. Unfortunately this is not the case. We have two counterexamples, one for δs and another for δ2s, that show there is little hope to improve RIP constant indefinitely.

Theorem 5.25. (Theorem 4.1 in [Cai, Wang & Xu II ’10]): Let s be a positive integer. Then there exists a (2s 1) 2s matrix A with restricted isometry constant δs = (s 1)/(2s 1) and two nonzero s-sparse − × − − vectors β1 and β2 with disjoint supports such that Aβ1 = Aβ2. As a corollary, there exists a matrix with δs < 1/2 such that it is impossible to recover sparse vectors.

While this first counterexample is simple and direct, the second one, which is based on δ2s, is a little more sophisticated. We first need the following definition

m×N Definition 5.26. Given a matrix Φ R with unit spectral norm Φ 2→2 = 1, we define the 2 ∈ || || asymmetric RIC σk as Φx 2 σ2(Φ) = min Ω 2 k x || ||2 Ω xΩ #Ω≤k || ||2 Remark 33. As the maximum of a singular value of any squared submatrix is bounded by 1, a matrix 2 2 1/2 Φ with unit spectral norm with given σk implies the existence of a rescaled matrix Ak = (2/(1 + σk)) Φ with restricted isometry constant satisfying 100 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY

2 1 σk(Φ) δk(Ak) − 2 . (5.16) ≤ 1 + σk(Φ)

Theorem 5.27. (Theorem 3 of [Davies & Gribonval ’09]): Consider 0 < p 1 and let 0 < ηp < 1 be the 2/p 2 ≤ unique positive solution to ηp + 1 = (1 ηp). Then p − 1. If Φ Rm×N is a unit spectral matrix, 2s m < N and ∈ ≤

2 2 σ (Φ) > 1 ηp, (5.17) 2s − 2 p − then all s-sparse vectors can be uniquely recovered by basis pursuit

2. For every ε > 0 there exists integers s 1, N 2s + 1 and a matrix Φ R(n−1)×N with ≥ ≥ ∈

2 2 σ (Φ) > 1 ηp ε 2s − 2 p − − for which there exists a s-sparse vector which cannot be uniquely recovered via basis pursuit.

2 For p = 1 we have η + 2η1 1 = 0, hence η1 = √2 1 and the right-hand side in (5.17) is 3 2√2. 1 − − − In terms of the standard RIC for the rescaled matrix Ak, with k = 2s, this means that, using (5.16), that for any ε > 0, there exists a matrix A with δ2s < 1/√2 + ε where `1-recovery can fail. Therefore there is no hope in the improvement of RIP constant beyond 1/√2 0.707. What remains is to close the gap. ≈ In the case of δs, from 1/3 to 1/2, and in the case of δ2s, from 0.6246 to 0.707. Up until now we have said nothing about the difficulty in the combinatorial nature of the calculation of RIP. Evaluating RIP, namely, computing the constant δs for some matrix A and a level of sparsity s is a difficult problem. The intractability comes from the fact that any brute-force method would have to look at all submatrices with column subsets of size up to s. Despite a lot of computational evidence, this was an open problem in complexity theory until the end of 2013. It was solved by [Tillmann & Pfetsch ’14].

Theorem 5.28. Given a matrix A Qm×N and a positive integer s, the problem to decide whether ∈ there exists some rational constant δs < 1 such that A satisfies RIP of order s with constant δs is coNP-complete.

Corollary 5.29. For a given matrix A Qm×N and a positive integer s, it is NP-hard to compute the ∈ restricted isometry constant δs.

It is important to note that the result above is for the case δ = 1, i.e., it does not imply the result for every fixed constant δ < 1. Also, we can ask whether a matrix A satisfies RIP with given order s and given constant δs (0, 1). This is known as the RIP certification problem. It was also solved in [Tillmann & Pfetsch ’14], although∈ [Bandeira, Dobriban, Mixon & Sawin ’13] derived the result independently. The result is a little weaker than the previous one.

m×N Theorem 5.30. Given a matrix A Q , a positive integer s and some constant δs (0, 1), it is ∈ ∈ coNP-hard to decide whether A satisfies the RIP of order s with constant δs. It remains to be shown that the problem is also in the coNP class of complexity. Remark 5 in [Tillmann & Pfetsch ’14] explicitly asks:

Open Problem: Prove that the RIP certification problem is coNP-complete.

This kind of complexity results open a branch of investigation on approximation algorithms to compute bounds on δs instead of searching for exact polynomial time algorithms. 5.6. RIP LIMITATIONS 101

We saw, after Theorem 3.3, that sparse recovery is guaranteed if we take the measurements matrix and rescale, reshuffle or add some new measurement. This will not change the NSP property, but these operations will corrupt the RIP property. Reshuffling a measurement matrix is the same as multiplying it by a permutation matrix, but δs(UA) = δs(A) for any unitary matrix U, so in this case the RIP remains unchanged. However, if we rescale the measurements, replacing the measurement matrix A Cm×N by DA, with D Cm×m as a diagonal matrix, we can increase the restricted isometry constant.∈ Even worse, we may have∈ trouble with the scalar rescaling, where we replace the matrix A by λA for some λ C. For a very simple ∈ example of this deterioration, suppose that we have a matrix A with δS(A) < 3/5. Recalling that ∗ δs = max A AS Id 2→2, we can estimate the restricted isometry constant of 2A as S⊂[N], #S≤s || S − ||

∗ ∗ ∗ δs(2A) = max (2AS) (2AS) Id 2→2 (2AS) (2AS) Id 2→2 = 4ASAS 4Id + 3Id 2→2 S⊂[N], #S≤s || − || ≥ || − || || − ||

∗ D  ∗  E ∗ = 4(ASAS Id) + 3Id 2→2 = max x, 3Id + 4(ASAS Id) x = 3 + 4 ASAS Id 2→2 3 4δs(A) || − || ||x||2=1 − || − || ≥ −

> δs(A). In case we add some new measurements, and therefore more information, the RIP could also be perturbed. Consider a matrix with δs(A) < 1 and let δ > δs(A). Let us append a new row [0 ... 0 √1 + δ] and call the new matrix A˜. With the aid of the vector x = [0 ... 0 1] we conclude that Ax 2 1 + δ. || ||2 ≥ This implies δS(A˜) > δs(A). Therefore, RIP is a sufficient condition for sparse recovery but can be, sometimes, a poor suffi- cient condition. It does not capture some operations that can be done with the measurement ma- trix. Also, some works argue that it does not capture the essence of the structure of sparsity, see [Adcock, Hansen & Roman ’15] and references therein. 102 CHAPTER 5. RESTRICTED ISOMETRY PROPERTY Chapter 6

Interlude: Non-asymptotic Probability

“There is a tendency in teaching to reduce probability problems to pure analysis as soon as possible and to forget the specific characteristics of probability theory itself. Such treatments are based on a poorly defined notion of random variables usually introduced at the outset." William Feller in An Introduction to Probability Theory and Its Applications Vol. 1

6.1 Introduction

The purpose of this chapter is to take an interlude on Compressive Sensing theory and introduce some modern concepts related to Probability theory. Probability theory is concerned with obtaining informa- tion from the data being generated by some well-known data generating process. For example, one could ask when (and if) the surname “Montgomery” will disappear from families in USA or UK. For this, we can compute the expected number of Montgomery’s births if one has a model for birth and death process and a model for the choice of names. The original purpose of Probability was to calculate odds in games of chance. It was the Theory of Chance. For example, the gambler’s ruin was in several circles of intellectual discussions during the Enlightenment. Most of the modern ideas in Probability arose at that time. Some historical account of Probability can be found at [Stigler ’90], [Hacking ’06], [Fischer ’11] and [von Plato ’98]. As Feller points out in his legendary book [Feller ’68], in the forties “few mathematicians outside the Soviet Union recognized probability as a legitimate branch of mathematics. Applications were limited in scope, and the treatment of individual problems often led to incredible complications”. In fact, a tipping point happened when Kolmogorov published his seminal work axiomatizing Probability. He and his collaborators/students Khinchin, Gnedenko, Prokhorov, Dynkin, Shiryaev, etc have had a profound influence on making Probability a prestigious and intense field of research. Also, we should mention Levy, von Mises, Cramer, von Neumann, Jeffreys, Savage, de Finetti, Doob and Feller among many others as being part of this “revolution”. Nowadays probability is a very important branch of mathematics with lots of ramifications and ap- plications to the real world. It pervades all the sciences and thoughts about uncertainty are fundamental in the modeling of any phenomena. Close to the subject of this dissertation, the techniques developed in Machine Learning and Statistical Signal Processing confirm this fact. Through this chapter we will adopt the non-asymptotic point of view, generalize the Bernoulli and Gaussian distributions through the subgaussian distribution and state some important inequalities such as Gordon’s Lemma and the concentration of Gaussian measure for Lipschitz functions. All of this will be done in order to use some probabilistic tools in Compressive Sensing. Then we will prove the suitability of random matrices in the optimal regime, as discussed in the Chapter 5, for sparse vector recovery. The reader should have in mind that many of the major breakthroughs in Compressive Sensing, and also in many of modern Data Science subareas, relies on probabilistic arguments. Hence we spend some time collecting and demonstrating tools which will help us to understand the rest of this dissertation.

103 104 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

We will not focus on basic concepts of Probability. Instead, the reader should consult [Resnick ’13] or [Durrett ’10] for essentials from Probability. The book [Feller ’68] should always be consulted.

6.2 Subgaussian and Subexponential Random Variables

“Many years ago (in 1893) I called the Laplace-Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another “abnormal”. Karl Pearson in [Pearson ’20] The most renowned probability distribution is the Gaussian distribution. The richness of its prop- erties, on one side, and his simplicity on the other side turn it ubiquitous. From the Bean Machine developed by Francis Galton to the Central Limit Theorem, passing by the maximization of entropy and least squares, science changed since its introduction. An account of its ubiquitous use can be seen at [Kim & Shevlyakov ’08] and references therein. Definition 6.1. A Gaussian random variable or normal random variable has probability density function

1 2 2 ψ(t) = e−(t−µ) /(2σ ). √2πσ2 It has mean EX = µ and variance E(X µ)2 = σ2.A standard Gaussian variable g is a Gaussian random − variable satisfying Eg = 0 and Eg2 = 1. Proposition 6.2. Let g be a standard Gaussian random variable. Then, for all u > 0,  r  2 1 2 P( g u) min 1, exp( u /2). (6.1) | | ≥ ≤ π u − r    r  2 1 1 2 2 P( g u) max 1 , 1 u exp( u /2). (6.2) | | ≥ ≥ π u − u2 − π − Proof. Using the definition of the probability density function of the Gaussian distribution, we have q 2 R ∞ −t2/2 P( g u) = e dt. Then, a change of variables yields | | ≥ 2π u ∞ ∞ ∞ Z 2 Z 2 2 Z 2 e−t /2dt = e−(t+u) /2dt = e−u /2 e−tue−t /2dt. (6.3) u 0 0 For the upper bound, on the one hand, we can use that e−tu 1 for t, u 0. Hence ≤ ≥ ∞ ∞ r Z 2 2 Z 2 π 2 e−t /2dt e−u /2 e−t /2dt = e−u /2. u ≤ 0 2 2 On the other hand, we can use that e−t /2 1 and this leads to ≤ ∞ ∞ Z 2 2 Z 1 2 e−t /2dt e−u /2 e−tudt = e−u /2. u ≤ 0 u 2 For the lower bound, we use the estimates e−t /2 1 t2/2 and e−tu 1 tu in Equation (6.3) and this yields, respectively, ≥ − ≥ −

Z ∞ Z ∞  2    −t2/2 −u2/2 t −tu −u2/2 1 1 e dt e 1 e dt = e 3 , u ≥ 0 − 2 u − u and

∞ ∞ r  Z 2 2 Z 2 2 π e−t /2dt e−u /2 e−t /2(1 ut)dt = e−u /2 u . u ≥ 0 − 2 − 6.2. SUBGAUSSIAN AND SUBEXPONENTIAL RANDOM VARIABLES 105

Besides tail estimates, another useful fact is the expression for the moment-generating function of the Gaussian random variable.

2 Proposition 6.3. Let X be a Gaussian random variable. Then, for θ R, we have E(eθX ) = eθ /2 and, ∈ more generally, for θ R and a < 1/2, ∈  2  aX2+θX 1 θ E(e ) = exp . √1 2a 2(1 2a) − − As a consequence, the moments of the Gaussian distribution are given by EX2n+1 = 0 and EX2n = (2n)!/(2nn!). Proof. See Section 5.6 of [DeGroot & Schervish ’11].

Inspired by the tails and all of the remarkable properties of Gaussian distribution, one might try to find other distributions with similar tails (and, consequently, some similar properties). This leads to the concept of subgaussian random variables.

Definition 6.4. A random variable X is called subgaussian1 if there exists constants β, κ > 0 such that

−κt2 P( X t) βe for all t > 0. | | ≥ ≤ By Proposition 6.2, a standard Gaussian random variable is subgaussian with β = 1 and κ = 1/2. Other examples of subgaussian random variables include Rademacher random variables, with distribution P X = 1 = P X = 1 = 1/2, and bounded random variables, which satisfy X < M almost surely for{ some−M}. { } | | Despite the fact that the class of subgaussian variables is quite wide, it does not encompass some important distributions which have heavier tails than the Gaussian distribution. An example is the exponential distribution, which satisfies P( X t) e−t for all t > 0. Because of this we need to define the class of subexponential random variables.| | ≥ ≤

Definition 6.5. A random variable X is called subexponential if there exists constants β, κ > 0 such that −κt P( X t) βe for all t > 0. | | ≥ ≤ It follows immediately from this definition that that any subgaussian variable is also subexponential. One can ask if the converse is also true, that is, if any subexponential variable is also subgaussian. After developing some equivalent conditions for a random variable being subgaussian, we will see a counterexample. This class of subgaussian random variables was introduced by [Kahane ’60] in order to establish a sufficient condition for the almost sure uniform convergence of certain random series of functions. Nowadays these variables have been proved to be fundamental not only for the study of series but also in the Geometry of Banach Spaces and for Random Matrices. They shown to be very important in the context of Compressive Sensing, as we will see in Chapter 7. More information and discussion about them can be found at [Buldygin & Kozachenko ’98]. Now we proceed to the first equivalence. It relies on a general relations between moments and tails of a random variable. We will see two other ways to characterize them.

Theorem 6.6. Suppose that X is a random variable satisfying

p 1/p 1/p 1/γ (E X ) αβ p for all p [p0, p1], | | ≤ ∈ for some constants α, β, γ, p1 > 0, p0 > 0. Then

1/γ −uγ /γ P( X e αu) βe , | | ≥ ≤ 1Subgaussian is the free translation of the French sous-gaussienne coined in the work [Kahane ’60]. 106 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY for all u [p1/γ , p1/γ ]. Conversely, suppose that a random variable X satisfies, for some γ > 0, ∈ 0 1 1/γ −uγ /γ P( X e αu) βe for all u > 0, | | ≥ ≤ then, for p > 0,   p p p/γ p E X βα (eγ) Γ + 1 . (6.4) | | ≤ γ As a consequence, for p 1, ≥ p 1/p 1/p 1/γ (E X ) C1α(C2,γ β) p for all p 1, (6.5) | | ≤ ≥ 1/(2e) p γ/12 where C1 = e 1.2019 and C2,γ = 2π/γe . In particular, C2,1 2.7245 and C2,2 2.0939. ≈ ≈ ≈

Proof. For the first part, for an arbitrary κ > 0, we use Markov’s inequalityfor X p (for some p to be fixed later) and obtain | |

p  1/γ p κ E X αp P( X e αu) | | β . | | ≥ ≤ (eκαu)p ≤ eκαu The choice p = uγ and κ = 1/γ yields the result. For the converse part, we have

Z Z Z |X|p Z Z ∞ Z ∞ Z p p E X = X dP = 1 dxdP = I{|X|p≥x}dxdP = I{|X|p≥x}dPdx | | Ω | | Ω 0 Ω 0 0 Ω

Z ∞ Z ∞ Z ∞ p p p p−1 p−1 = P( X x)dx = p P( X t )t dt = p P( X t)t dt 0 | | ≥ 0 | | ≥ 0 | | ≥ Z ∞ Z ∞ p p/γ 1/γ p−1 p p/γ −uγ /γ p−1 = pα e P( X e αu)u du pα e βe u du 0 | | ≥ ≤ 0 Z ∞ p  p   p  = pβαpep/γ e−v(γv)p/γ−1dv = βαp(eγ)p/γ Γ = βαp(eγ)p/γ Γ + 1 . 0 γ γ γ Now, in order to prove (6.5), we just use Stirling’s formula for the Gamma function. It states that, for x > 0, Γ(x) = √2πxx−1/2e−x exp (θ(x)/12x) with 0 θ(x) 1 (the proof of this fact can be found in many references, e.g., [Jameson ’15]2). Applying this formula≤ ≤ to the Gamma function in the equation above yields

 p/γ+1/2 p p p/γ p −p/γ γ/(12p) p γ/(12p) p/γ+1/2 −1/2 E X βα (eγ) √2π e e = √2πβα e p γ . | | ≤ γ Under the assumption p 1, we obtain ≥ 1/p γ/12 ! p 1/p √2πe 1/γ 1/(2p) (E X ) β αp p . | | ≤ √γ

Lastly, the maximum value of p1/(2p) is attained for p = e and so p1/(2p) e1/(2e). This concludes the proof of Theorem 6.6. ≤

2For historical references about Stirling’s formula, one can look at [Fowler ’00] and [Tweddle ’84]. Also, sixteen variations on the theme can be found at [Dominici ’08]. 6.2. SUBGAUSSIAN AND SUBEXPONENTIAL RANDOM VARIABLES 107

Particularizing Theorem 6.6 for Subgaussian variables with α = (2eκ)−1/2 and γ = 2 shows that its moments satisfy (E X p)1/p Cκ˜ −1/2β1/pp1/2 = O(√p) for all p 1. Also, for a Subexponential variable | | ≤ ≥ X, setting α = (eκ)−1 and γ = 1, Theorem 6.6 leads to (E X p)1/p Cκ−1β1/pp = O(p) for all p 1. We have another equivalence for Subgaussian variables.| | This is≤ sometimes called super-exponential≥ moment equivalence.

2 Proposition 6.7. A random variable X is Subgaussian, i.e., P( X t) Ce−ct , if and only if there | | ≥ ≤ exist constants c > 0 and C 1 such that E[exp(cX2)] C. ≥ ≤ Proof. First, let us proof the only if part. Using Theorem 6.6, (more precisely, the moment estimate (6.4)) with κ = 1/(2eα2), we have EX2n βκ−nn!. Using the Taylor series of the exponential function and Fubini’s theorem leads to ≤

∞ n 2n ∞ n −n −1 2 X c E[X ] X c κ n! βcκ E[exp(cX )] = 1 + 1 + β = 1 + , n! ≤ n! 1 cκ 1 n=1 n=1 − − provided c < κ. Now, in order to prove the if part, we just use Markov’s inequality. It leads to

2 2  2 −ct2 −ct2 P( X t) = P exp(cX ) exp(ct ) E[exp(cX )]e Ce . | | ≥ ≥ ≤ ≤

The study of deviation inequalities for the tail bounds of a random variable can be done through the moment-generating function. This is the content Cramér-Chernoff theorem below. Therefore, an equivalent definition for subgaussian variables will be given via moment-generating function. We begin with a definition.

Definition 6.8. The cumulant-generating function of a real-valued random variable is defined as the logarithm of the moment-generating function, that is,

CX (θ) = ln E exp(θX). We now present the general deviation bound.

Theorem 6.9. Let X1,...,XM be a sequence of independent (real-valued) random variables with cumulant- generating functions CX , i [M]. Then, for t > 0, i ∈ M M  X   n X o P Xi t exp inf θt + CXi (θ) . ≥ ≤ θ>0 − i=1 i=1 Proof. For θ > 0, Markov’s inequality and independence yield

M !  M  ! " M # " M # X X −θt X −θt Y P Xi t = P exp θ Xi exp(tθ) e E exp(θ Xi) = e E exp(θXi) ≥ ≥ ≤ i=1 i=1 i=1 i=1

M M M ! −θt Y −θt Y X = e E[exp(θXi)] = e exp(CXi (θ)) = exp θt + CXi (θ) . − i=1 i=1 i=1 Taking the infimum over θ > 0 concludes the proof.

Thus, another important equivalent definition of a (zero mean) Subgaussian variable is given in terms of the moment-generating function as the following theorem shows. 108 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

Theorem 6.10. Let X be a random variable.

i.) If X is Subgaussian with EX = 0, then there exists a constant c (depending only on β and κ) such that 2 E[exp(θX)] exp(cθ ) for all θ R. (6.6) ≤ ∈ ii.) Conversely, if (6.6) holds, then EX = 0 and X is Subgaussian with parameters β = 2 and κ = 1/(4c). Proof. First, we will proceed with the proof of ii. Taking t, θ > 0 and applying Markov inequality yields

−θt cθ2−θt P(X t) = P(exp(θX) exp(θt)) E[exp(θX)]e e . ≥ ≥ ≤ ≤ 2 Choosing the optimal parameter θ = t/(2c) on the right-hand side leads to P(X t) e−t /(4c), and 2≥ ≤ the same computation above with X instead of X shows that P( X t) e−t /(4c). So, the union bound yields what we want, that is,− − ≥ ≤

−t2/(4c) P( X t) 2e . | | ≥ ≤ Now, it remains to deduce that X has zero mean. Using that 1+θX exp(θX) and taking the expectation on both sides leads, for θ < 1, to ≤ | | 2 2 4 1 + θE(X) E[exp(θX)] exp(cθ ) 1 + (c/2)θ + O(θ ). ≤ ≤ ≤ Taking θ 0 yields EX 0. Making the same calculations with X instead of X shows that EX 0. → ≤ − ≥ Therefore, EX = 0. Now, to prove the first part, it is enough to prove it for θ 0 because the theorem for θ < 0 follows just by substituting X by X. As in the proof of Proposition≥ 6.7, using the Taylor series for the exponential function and Fubini’s− theorem yields

∞ ∞ X θnEXn X θnE X n E[exp(θX)] = 1 + θE(x) + = 1 + | | , n! n! n=2 n=2 since E(X) = 0 by hypothesis. Let us consider, for a moment, that 0 θ θ0 for some θ0 to be chosen ≤ ≤ later. Using the moment estimate (E X p)1/p Cκ˜ −1/2β1/pp1/2 from Theorem 6.6, and Stirling’s formula in its version n! √2πnne−n leads to| | ≤ ≥ ∞ ∞ X θnC˜nκ−n/2nn/2 β X θnC˜nκ−n/2nn/2 E[exp(θX)] 1 + β 1 + n! √ nne−n ≤ n=2 ≤ 2π n=2 ˜ 2 ∞ ˜ 2 2 β(Ce) X −1/2 n 2 β(Ce) 1 2 2 1 + θ (Ceθ˜ 0κ ) = 1 + θ = 1 + c1θ exp(c1θ ), −1/2 ≤ √2πκ √2πκ 1 Ceθ˜ 0κ ≤ n=0 − −1/2 −1 provided that Ceθ˜ 0κ < 1. Therefore, if we choose θ0 = (2Ce˜ ) √κ, we have the result for c1 = √2βκ−1(Ce˜ )2/√π. It remains to prove the result when θ > θ0. Actually, we will reformulate it and our goal will be to 2 prove E[exp(θX)] exp[(c2θ )] for some constant c2 > 0. First, observe that: ≤  2 2 2 2 X X X θX c2θ = √c2θ + . − − − 2√c2 4c2 ≤ 4c2

Using the constant c > 0 and C 1 from Proposition 6.7 and choosing c2 = 1/(4c) yields the following estimate ≥

2 h 2 i h 2 i E[exp(θX c2θ )] E exp X /(4c2) = E exp cX ) C. − ≤ ≤ −2 Finally, we need again to use θ0 somehow. Defining K = ln(C)θ0 leads to 6.2. SUBGAUSSIAN AND SUBEXPONENTIAL RANDOM VARIABLES 109

2 2 2 2 2 E[exp(θX)] C exp(c2θ ) = C exp( Kθ ) exp (c2 + K)θ C exp( Kθ0) exp((c2 + K)θ ) ≤ − ≤ − 2 exp (c2 + K)θ . ≤ We divided the proof in two cases, setting c3 = max c1, c2 + K completes the proof. { } This result tells us that the moment-generating function of a subgaussian variable exists for allθ R. As [Rudelson ’14] points out, here we have a subtle difference from subexponential variables. For∈ the latter, the bound for the moment-generating function only holds in a neighborhood of zero. This comes from the fact the moment-generating function of the exponential random variable with parameter 1 does not exist for θ 1. Another interesting≥ result is that the sum of zero mean subgaussian variables is also subgaussian as the next theorem shows. It is important to note that there is also a version of this theorem for subexponential variables, i.e, the subexponential property is also preserved under summation in the case of independent subexponential random variables.

Remark 34. Equation 6.8 below is sometimes called Hoeffding inequality for subgaussian variables.

Theorem 6.11. Let X1,...,XM be a sequence of independent zero mean subgaussian random variables M PM with subgaussian parameter c in (6.6) . For a = (α1, . . . , αM ) R , the random variable X = i=1 αiXi is Subgaussian, i. e., ∈

2 2 E exp(θX) exp(c a 2θ ). (6.7) ≤ || || In particular, by Theorem 6.10,

M ! X  t2  P αiXi t 2 exp for all t > 0. (6.8) ≥ ≤ −4c a 2 i=1 || ||2

Proof. Using the independence of X1,...,XM , we have

M ! M M M X Y Y Y 2 2 2 2 E exp θ αiXi = E exp(θαiXi) = E exp(θαiXi) exp(cθ αi ) = exp(c a 2θ ). ≤ || || i=1 i=1 i=1 i=1

This proves the first inequality. For the second, we just need to use part ii.) of Theorem 6.10.

The discussion of Theorem 6.6, Proposition 6.7 and Theorem 6.10 tells us that we have four equivalent ways of definining a subgaussian random variable. Using the tail of the distribution, the moments’ bounds, super-exponential moment equivalence or, in the case of zero mean variables, the moment-generating function. Even more, the class of Subgaussian variables in a given probability space forms a normed space through the following definition.

Definition 6.12. The subgaussian norm of X, denoted by X ψ is defined by || || 2 −1/2 p 1/p X ψ2 = sup p (E X ) . || || p≥1 | |

In the discussion about a Subgaussian variable X, it is not necessary for X to have mean zero. This is done only to simplify the notation in proofs. Anyway, X always can be transformed into a zero-mean variables just by observing that if X is Subgaussian then so is X EX. Moreover, by the triangle − inequality X EX ψ2 X ψ2 + EX ψ2 and the simple estimate EX ψ2 = EX E X X ψ2 , || − || ≤ || || || || || || | | ≤ | | ≤ || || we have X EX ψ2 2 X ψ2 . For the subexponential case, we have: || − || ≤ || || 110 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

Definition 6.13. The subexponential norm of X, denoted by X ψ is defined by || || 1 −1 p 1/p X ψ1 = sup p (E X ) . || || p≥1 | | The normed space equipped with any of these two norms can be viewed as a particular cases of a Birnbaum-Orlicz space. Some details of this generalization can be found in Section 4 of [Rivasplata ’12] or in the notes of Chapter 8 from [Rauhut & Foucart ’13]. Finally, we can exemplify why Rademacher and bounded variables are subgaussian variables and give an example of a subexponential which is not subgaussian. Example 6.14. (Rademacher variables): Let us compute the moment-generating function of Rademacher variables, that is, variables which satisfies P X = 1 = P X = 1 = 1/2. { − } { }    ∞ k ∞ k  ∞ 2k ∞ 2k θX 1 −θ θ 1 X ( θ) X θ X θ X θ θ2/2 E[e ] = e + e = − + = 1 + = e . 2 2 k! k! (2k)! ≤ 2kk! k=0 k=0 k=0 k=1 This tells us that Rademacher variables are subgaussian. Example 6.15. (Bounded variables): Let X be mean zero random variable contained in some interval [a, b]. Let X˜ be an independent copy of X. Here we will use a technique called symmetrization, where after the introduction of this new variable, we use a symmetry argument with the aid of a Rademacher variable. A different argument can be found in Theorem 2.5 of [Rivasplata ’12]. Now it will be necessary to denote by EX the expected value with respect to the variable X.

  b b   Z ˜ Z ˜ θX ˜  θX −θEX˜ [X] θX −θX EX [e ] = EX exp θ X EX˜ [X] = e e p(x)dx e EX˜ [e ]p(x)dx = − a ≤ a

θX −θX˜  θ(X−X˜ ) EX [e ]EX˜ [e ] = EX,X˜ e . where we used the convexity of the exponential and Jensen’s inequality. Now, denoting by Z a Rademacher variable, we have that the distribution of Z(X X˜) is the same as X X˜. This yields − − 2 ˜ 2 2 2  θ(X−X˜ )  θZ(X−X˜ )   θ (X−X)  θ (b−a) E ˜ e = E ˜ EZ [e ] E ˜ e 2 e 2 . X,X X,X ≤ X,X ≤ There the penultimate inequality follows from applying the calculations of Example 6.14 conditionally, using that (X, X˜) is fixed, and the last inequality follows from X X˜ b a. This proves that bounded variables are subgaussian. | − | ≤ − The main difference between subgaussian and subexponential variables can be described in terms of the moment-generating function. In particular, we can cite the following theorem that can be found in Lemma 5.15 from [Vershynin ’12]. Theorem 6.16. Let X be a mean zero subexponential random variable. Then, for t such that t 2 2 | | ≤ c/ X ψ1 , one has E exp(tX) exp(Ct X ψ ), where C, c > 0 are absolute constants. || || ≤ || || 1 While, by Theorem 6.10, the moment-generating function of subgaussian variables exists for all θ ∈ R, this is not the case for subexponential variables. Therefore, if we find a subexponential variable with a moment generating function that does not exists for all θ, we will have found an example of a subexponential variable which is not subgaussian. This is shown in the next example. Example 6.17. Consider the random variable Y = X2, where X has a standard normal distribution. This variables is called χ2 random variable. Since its mean is 1, we calculate the moment-generating function of Y 1. Thus, for θ < 1/2 we have − Z −θ θ(Y −1) 1 θ2(z2−1) −z2/2 e E[e ] = e e dz = . √2π √1 2θ R − 6.3. NONASYMPTOTIC INEQUALITIES 111

For θ > 1/2, the moment-generating function of Y 1 does not exist. Therefore, the one for Y also does not exist. This shows that Y cannot be subgaussian.− On the other side, we have

−θ e 2 2 e2θ = e4θ /2. √1 2θ ≤ − for θ < 1/4. Thus, this shows that Y = X2 is subexponential. | | 6.3 Nonasymptotic Inequalities

“Yet, moving beyond this terra firma, one quickly encounters examples where classical methods are brittle”. Joel Tropp in [Tropp ’15]

In the first part of the last century, researches of Probability Theory were mainly concerned with the asymptotic behavior of random variables. See [Feller ’45]. After many efforts, the two most important theorem of Probability, The Law of Large Numbers and the Central Limit Theorem, got a definitive shape. Here we state them not in the general but instead, in a pleasant form.

Theorem 6.18. (Law of Large Numbers): Let X1,X2,... be a sequence of independent, identically distributed random variables with mean µ. Consider the sum Sn = X1 + + Xn. Then, as n , ··· → ∞ S n µ almost surely. n →

Theorem 6.19. (Central Limit Theorem): Let X1,X2,... be a sequence of independent, identically 2 distributed random variables with mean µ and variance σ . Consider the sum Sn = X1 + + Xn and normalize it to obtain a random variable with zero mean and unit variance as follows ···

n Sn E[Sn] 1 X Zn = − = (Xi µ). p √ − Var(Sn) σ N i=1 We denote the normal distribution by g N(0, 1). Then, as n , ∼ → ∞ Z ∞ 1 −x2/2 P(Zn t) P(g t) = e dx, t R. ≥ → ≥ √2π t ∀ ∈ For a historical account, see [Seneta ’13] and [Fischer ’11]. As one can note there is no mention to the rate of convergence. Thus, the first question is how to give a bound on the maximal error of the approximation between the normal distribution and the true distribution of the scaled sample mean. In the case of Theorem 6.19, we have the following answer, called the Barry-Esseen Theorem. For a proof, see Section XVI.5 of [Feller ’72].

Theorem 6.20. (Barry-Esseen Theorem): In the setting of Theorem 6.19, for every n and every t R, we have ∈

3ρ P(Zn t) P(g t) . ≥ − ≥ ≤ √n 3 3 Here ρ = E X1 µ /σ and g N(0, 1). | − | ∼ Despite the fact that this theorem provides a quantitative version for the error, this decays to zero very slowly, even slower than linear in n. Since we are dealing with the normal distribution and we saw in Proposition 6.2 that its distribution has an exponential decay tail, we expect to find better estimates. This is the purpose of Nonasymptotic Probability: derive quantitative finite versions of limit of sums of random variables with a typically exponential decay. It is a very rich and active research field, specially 112 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY important for computational purposes, such as in Monte Carlo methods and modern Data Science in general. These concentration inequalities will provide bounds on how a random variable deviates from some value. Most often this value will be its expected value. In Section 6.2 we came across an example in Theorem 6.11. This example is a particular case of a very general statement. Also, in Theorem 6.9, we saw that if we know how to compute the cumulant-generating function of a family of random variables, then an exponential decay for the tail appears naturally. Before we move forward, let us look at an example from [Vershynin ’15]. Example 6.21. If we toss a coin N times, what is the probability that we get at least 3N/4 heads? Denoting the Bernoulli variables, i.e., variables which satisfy P(Xi = 0) = P(Xi = 1) = 1/2, by Xi, PN we model our problem through the binomial random variable SN = i=1 Xi. It is well known that ESN = N/2 and Var(SN ) = N/4. Then, Chebyshev’s inequality leads to   P SN 3N/4 P SN N/2 N/4 4/N. ≥ ≤ | − | ≥ ≤ This is a linear decay. The main point here is that, for sufficiently large N, we can use a naive estimate from the Central Limit Theorem and then we should expect the following !  SN N/2 p p 1 −N/8 P SN 3N/4 P p− N/4 P(g N/4) = e . ≥ ≤ N/4 ≥ ≈ ≥ √2π The comparison between the two decay rates is shocking. However it is not possible to make the reasoning above rigorous. New techniques, such as Theorem 6.9, were invented in order to deduce such kind of estimates. The literature of concentration inequalities is vast and has many results. It is important to cite the contributions of Hoeffding, Bernstein. Cramer, Chernoff, Azuma, McDiarmid [Ledoux ’01]. This kind of techniques, where there is solely a finite number of variables, is fundamental for Data-Science and Machine Learning. In this scenario, N typically corresponds to the sample size of the dataset. We saw in Theorem 6.11 that a concentration inequality is valid for subgaussian variables. For the subexponential case, we could ask if a similar inequality is available. In this case, we must proceed more carefully, since the tails of sub-exponential distributions may not decay fast enough to make the moment- generating function finite everywhere, as Theorem 6.16 shows. The inequality for this case must take into account moments of higher orders. The next theorem, known as Bernstein’s inequality, quantifies these informations. Also, if the variance is small, then it can be a large improvement on Theorem 6.11.

Theorem 6.22. Let X1,...,XM be independent zero mean random variables such that, for all integers n 2, ≥ n n−2 2 E Xi n!R σi /2 for all i [M] (6.9) | | ≤ ∈ for some constants R > 0 and σi > 0, i [M]. Then, for all t > 0, ∈ M ! M  2  X t /2 2 X 2 P Xi t 2 exp , where σ = σi . (6.10) ≥ ≤ −σ2 + Rt i=1 i=1

Proof. Let us estimate the moment-generating function of Xi. This will allows us to use Theorem 6.9. Again, using Taylor series of exponential and Fubini’s theorem, we have

∞ n n 2 2 ∞ n−2 n X θ E[Xi ] θ σi X θ E[Xi ] E[exp(θXi)] = 1 + θE[Xi] + = 1 + . n! 2 n!σ2/2 n=2 n=2 i

n−2 n P∞ θ E[Xi ] 2 2 2 2 If we define Fi(θ) = n=2 n!σ2/2 , then we have E[exp(θXi)] = 1 + θ σi Fi(θ)/2 exp(θ σi Fi(θ)/2). i ≤ 2 PM 2 Thus, introducing F (θ) = maxi∈[M] Fi(θ) and remembering that σ = i=1 σi , Theorem 6.9 yields 6.3. NONASYMPTOTIC INEQUALITIES 113

 M  X 2 2 2 2 P Xi t inf exp(θ σ F (θ)/2 θt) inf exp(θ σ F (θ)/2 θt). ≥ ≤ θ>0 − ≤ 0

M  X   t2σ2 1 t2   t2/2  P Xi t exp 2 2 Rt 2 = exp 2 . ≥ ≤ 2(σ + Rt) 1 2 − σ + Rt − σ + Rt i=1 − σ +Rt The same estimate could be derived for Xi in place of Xi. Applying the union bound concludes the proof. −

From Theorem 6.22 we easily derive a Bernstein inequality type for subexponential variables.

Theorem 6.23. Let X1,...,XM be independent zero mean subexponential random variables, i.e., P( Xi t) βe−κt for some constants β, κ > 0 for all t > 0, i [M]. Then | | ≥ ≤ ∈ M ! X  (κt)2/2  P Xi t 2 exp − . (6.12) ≥ ≤ −2βM + κt i=1 Proof. As in the proof of Theorem 6.6, we will estimate the high moments. So, for n 2 N, ≥ ∈ Z ∞ Z ∞ n n−1 −κt n−1 E Xi = n P( Xi t)t dt βn e t dt = | | 0 | | ≥ ≤ 0 Z ∞ 2βκ−2 βnκ−n e−xxn−1dx = βnΓ(n 1)κ−n = n!κ−(n−2) . 0 − 2 −1 2 −2 Therefore, condition 6.9 holds with R = κ and σi = 2βκ . Thus, according to Theorem 6.22, the theorem follows. Bernstein’s inequality could be generalized for matrices. One can show that there are exponential concentration inequalities for the spectral norm of a sum of independent random matrices. This is fundamental for, among other things, problems of blind deconvolution in Signal Processing and problems of covariance matrix estimation in Statistics. See the fantastic book [Tropp ’15]. The next theorem is a large deviation inequality for Rademacher chaos. It tells us that a homogeneous Rademacher chaos can be controlled by a mix of subexponential and subgaussian tails. Such estimates are known by the name of Hanson-Wright inequalities.

M×M Definition 6.24. Let  = (1, . . . , M ) be a Rademacher vector. For a self-adjoint matrix A C with zero diagonal we define the homogeneous Rademacher chaos by ∈

∗ X X =  A = jkAjk (6.13) j6=k Remark 35. Since A is self-adjoint, X will be real-valued even when A is a complex matrix. this allows us to reduce the study to real-valued symmetric matrices A RM×M . ∈ 114 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

Theorem 6.25. Let A RM×M be a symmetric matrix with zero diagonal and let ε be a Rademacher vector. Then the homogeneous∈ Rademacher chaos X defined in Equation (6.13) satisfies, for t > 0

   2 4||A||2  3t F   2  2 exp 128||A||2 if 0 < t 3||A|| , 3t t  − F ≤ 2→2 P( X t) 2 exp min 2 , =   2 | | ≥ ≤ − 128 A 32 A 2→2 t 4||A|| || ||F || || 2 exp if t > F .  − 32||A||2→2 3||A||2→2 (6.14)

In order to prove it, we need a powerful technique, called decoupling, that reduces stochastic dependen- cies of variables such as Rademacher chaos. For a proof of this fact, see Theorem 6.1.1 of [Vershynin ’15] or Theorem 8.11 of [Rauhut & Foucart ’13].

Theorem 6.26. Let X = (X1,...,Xn) be a sequence of independet random variables with E[Xi] = 0 for all i [M]. Let αjk with j, k [n], be a double sequence of elements in a finite-dimensional vector space ∈ ∈ V . If F : V R is a convex functions, then → "  n # "  n # X X 0 E F αj,kXjXk E F 4 αj,kXjXk , ≤ j,k=1 j,k=1 j6=k j6=k where X0 denotes an independent copy of X.

Proof. (of Theorem 6.25): Inspired by the ideas of Theorem 6.9, we will estimate the moment-generating function of X. For θ > 0, using the convexity of f(x) = exp(θx) and Theorem 6.26 leads to     X X 0 E exp(θX) = E exp θ jkAjk E exp 4θ jkAjk ≤ j6=k j,k     X 0 X 2 X  X  EE0 exp 4θ k jAjk E exp 8θ jAjk , (6.15) ≤ k j k j where in the last step we used Theorem 6.11 conditionally on  and the fact that the subgaussian parameter for Rademacher is c = 1/2, as calculated in Example 6.14. By the symmetry of A, we have,

2 X  X  X X X X X ∗ 2 jAjk = jAjk `A`k = j` AjkAk` =  A . k j k j ` j,` k Set B = A2. We can estimate the moment-generatng function of the positive semidefinite chaos ∗B by

     ∗ 2   X X  λtr(B)  X 0  E exp(λ A ) = E exp λ Bjj + λ jkBjk e E exp 4λ jkBjk ≤ j j6=k j,k

 2  λtr(B)  2 X  X   e E exp 8λ jBjk . ≤ k j Again we have applied Theorem 6.11 conditionally on  and Theorem 6.26. Since B is positive semidefi- nite, its square-root exists and then

2 X  X  ∗ 1/2 ∗ 1/2 ∗ jBjk =  B = (B ) B(B ) B 2→2 B. ≤ || || k j

In the case 8λ B 2→2 < 1, Jensen’s inequality yields || || 6.4. COMPARISON OF GAUSSIAN PROCESSES 115

8λ||B|| ∗  2 ∗    ∗  2→2 E exp(λ B) exp(λtr(B))E exp(8λ B 2→2 B) exp(λtr(B)) E exp(λ B) . ≤ || || ≤ Rearranging this expression leads to    ∗  λtr(B) −1 E exp(λ B) exp , 0 < λ < (8 B 2→2) . (6.16) ≤ 1 8λ B 2→2 || || − || || 2 −1 After setting λ = 8θ we have, for 0 < θ < (16 A 2→2) , || ||  2 2   2 2  8θ tr(A ) 8θ A F E exp(θX) exp 2 2 = exp ||2 || 2 . ≤ 1 64θ A 2→2 1 64θ A − || || − || ||2→2 −1 Using Markov’s inequality, for 0 < θ (16 A 2→2) , we obtain ≤ || ||  2 2  θX θt −θt  θX  8θ A f P(X t) = P(e e ) e E e exp θt + || || ≥ ≥ ≤ ≤ − 1 64θ2 A 2 − || ||2→2  8θ2 A 2  exp θt + || ||F exp( θt + 32θ2 A 2 /3). ≤ − 1 1/4 ≤ − || ||F − 2 2 −1 If t 4 A F /(3 A 2→2), the optimal choice θ = 3t/(64 A F ) satisfies θ (16 A 2→2) . For t given by this≤ || inequality,|| || we|| therefore obtain || || ≤ || ||

 3t2  P(X t) exp . ≥ ≤ − 128 A 2 || ||F 2 Now, we must look to the complementary inequality, namely t > 4 A F /(3 A 2→2). In this case, setting −1 || || || || θ = (16 A 2→2) leads to || ||

2 2 P(X t) exp( θt + 32θ A F /3) exp( θt + θt/2) = exp( θt/2) = exp( t/(32 A 2→2)), ≥ ≤ − || || ≤ − − − || || since in this case we have θ < 3t/(64 A 2 ). To finish, see that X has the same distribution of X and || ||F − we can derive the same bounds for P(X t). Taking the union bound yields the theorem. ≤ − This kind os estimate can be generalized to quadratic forms involving more general subgaussian random vectors, see [Hanson & Wright ’71]. Specifically in the Gaussian case, see [Bechar ’09].

6.4 Comparison of Gaussian Processes

“Tout le monde croit que les erreurs suivent une loi normale, me disait un jour M. Lippmann, car les expérimentateurs car ils pensen qu’il s’agit d’un théorème, et les mathématiciens que pensent que c’est un fait expérimental.”3 Henri Poincaré in Calcul des Probabilités, p.171

n n In many situations one wishes to compare two families of random variables (Xt)t=1 and (Yt)t=1 in order to extract some information about one of them using the other. There are many reasons for that but typically we choose the latter as having more tractable properties than the former. Thus we wish to n n provide an upper bound on (Xt)t=1 just by knowing that (Yt)t=1 is larger in some sense. This is remarkable in the case of Gaussian variables. In the inconspicuous report [Slepian ’62], David Slepian realized that the covariance structure can be regarded as an indicator of how Gaussian variables behave jointly. Therefore, using its correlation we can quantify how much the variables in each of the

3A free translation of Poincarés’ quote is “Everybody believes that the errors follow a normal law, said one day to me M. Lippmann, because the experimenters think that it is a theorem, and the mathematicians think that it is an experimental fact.” 116 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY families maintain similar expected magnitudes. Then, if we want to compare functions of two families of Gaussian random variables, it is a good ideia to compare their covariances. This is confirmed by Theorem 6.29, due to Slepian and Fernique. Before we state it, let us make some definitions.

Definition 6.27. A stochastic process is a collection Xt, t T of random variables indexed by some set ∈ T . It is said to be centered if EXt = 0 for all t T . A process is called centered Gaussian if for each ∈ finite collection t1, . . . , tn T , the random vector (Xt ,...,Xt ) is a zero mean Gaussian random vector. ∈ 1 n PM A typical Gaussian Process is given by Xt = i=1 gixi(t), where g = (g1, . . . , gM ) is a standard Gaussian random vector and xj : T R, are arbitrary functions. It is interesting to note that Gaussian process are receiving increasing attention→ in the Machine Learning community and leading to new solu- tions for a wide class of regression and classification problems. One can see [Rasmussen & Williams ’05] for details of how to deal with them in a computational fashion. As in the finite case, Gaussian processes can be generalized to the concept of subgaussian processes. Definition 6.28. Given a stochastic process, we define the pseudometric (two distinct points can have zero distance) 2 1/2 d(s, t) = (E Xs Xt ) , s, t T. | − | ∈ A centered stochastic process Xt is called subgaussian if

2 2 E exp(θ(Xs Xt)) exp(θ d(s, t) /2), s, t T, θ > 0. (6.17) − ≤ ∈ Now, the comparison lemma.

Theorem 6.29. Let X,Y be zero mean Gaussian random vectors on Rm. If

2 2 E Xi Xj E Yi Yj i, j [m], | − | ≤ | − | ∀ ∈ then E max Xj E max Yj. j∈[m] ≤ j∈[m]

2 2 2 2 2 Since we have E Xi Xj = EXi 2EXiXj +EXj , if we have the additional assumption EXi = EYi , | − | − then the hypothesis of the Slepian-Fernique Comparison Lemma is, actually, EXiXj EYiYj. Therefore, as we said in the beginning of this section, the comparison of covariance implies in≥ the comparison of expected maxima of Gaussian vectors. This theorem has a deep generalization for min-max of Gaussian variables disposed in a retangular array. It was proved in [Gordon ’85] and restated in [Gordon ’88].

Theorem 6.30. (Theorem A of [Gordon ’88]): Let Xij,Yij, i [n], j [m] be two families of zero mean Gaussian random variables. If ∈ ∈

2 2 E Xij Xkl E Yij Ykl i = k and j, l, | − | ≤ | − | ∀ 6 2 2 E Xij Xil E Yij Yil i, j, l, | − | ≥ | − | ∀ then E min max Xij E min max Yij. i∈[n] j∈[m] ≥ i∈[n] j∈[m] This comparison lemma was generalized even more. One should look at [Kahane’86] and [Vitale ’00] for some generalizations. Also, [Tong ’80] is a whole monograph dedicated to inequalities for multivariate distributions. There, some interesting comments related to comparisons lemmas can be found. Theorem 6.29 also generalizes to Gaussian processes indexed by infinite sets. If we have that X = 2 2 (Xt)t∈T and Y = (Yt)t∈T are Gaussian processes and if E Xs Xt E Ys Yt for all s, t T , then it also holds that | − | ≤ | − | ∈

E sup Xt E sup Yt, t∈T ≤ t∈T 6.4. COMPARISON OF GAUSSIAN PROCESSES 117

where E supt∈T Xt, in this case, is defined by E supt∈T Xt = sup E supt∈F Xt,F T,F finite . This is called lattice supremum and is used in order to avoid problems with{ nonmensurable⊂ sets. Theorem} 6.30 generalizes in a similar way, using double index Gaussian processes. Despite the intrinsic interest on the comparison theorems, in the field of Compressive Sensing we are highly interested in some of their geometric corollaries. Before that, we need a notion about how to measure the width of a set. It is a measure of the size of a set T in the sense that how well, on average, the vectors in the set T can align with a randomly chosen direction.

Definition 6.31. For a set T RN , we define its Gaussian width by ⊂ `(T ) = E sup g, x , x∈T h i where g RN is a standard Gaussian variable. ∈ m With the aid of this definition and denoting, for a Gaussian vector g R , the mean of its `2-norm by ∈ E g 2 = Em, we have the celebrated Gordon’s escape through the mesh theorem, consequence of Gordon’s Comparison|| || Lemma.

Theorem 6.32. Let A Rm×N be a Gaussian random variable and T be a subset of the unit sphere N−1 N ∈ S = x R , x 2 = 1 . Then, for t > 0, { ∈ || || }   −t2/2 P inf Ax 2 Em `(T ) t e . (6.18) x∈T || || ≤ − − ≤ Its proof can be found in Theorem 9.21 of [Rauhut & Foucart ’13]. It relies on the comparison between two well chosen Gaussian processes and on the use ofTheorem 6.30. The first process is, for x T and m−1 N m ∈ y S , Xx,y = Ax, y while the second is Yx,y = g, x + h, x , for g R and h R independent standard∈ Gaussianh vectors.i A very similar argumenth willi be developedh i in∈ Theorem 7.12∈ . In this dissertation, we aim to understand when a measurement matrix is good in the sense that we can use it to recover the unique sparsest vector it “captures”. Next chapter will deal with random matrices as sensing matrices. In order to determine how good a random matrix is, thinking in terms of the Null Space Property, good matrices will be the ones that avoids certain subsets. This is what Gordon called “escapes a mesh”. For this to happen, of course, these subsets must be small in some sense. The precise sense is given exactly by the concept of Gaussian width. A consequence of Theorem 6.32 that really deserves to be called a escape through the mesh theorem is the next one. Its uses for the problem of sparse recovery were first done by [Rudelson & Vershynin ’08] and general- ized for other convex problems by [Chandrasekaran, Recht, Parrilo & Willsky ’12]. We start by defining what is a Grassmanian.

Definition 6.33. Let V be a finite-dimensional vector space over a field K. The Grassmannian Gk(V ) is the set of all k-dimensional linear subspaces of V . If V has dimension d, then the Grassmannian is also denoted Gk(d). The following useful description is used for the Grassmanian. First, we need some definitions.

Definition 6.34. For E Gk(d), we denote by PE the orthogonal projection onto E. We denote by ∈ M(d, R) the set of all d d real matrices endowed with the Frobenius norm. Also, we define the set × 2 ∗ G(d, k) = P M(d, R) P = P,P = P, rank(P ) = k . { ∈ | } Proposition 6.35. The function f : Gk(V ) G(d, k), E PE is a bijection. The set G(d, k) is a compact metric space. → 7→

With a little more work, this identification allows us to represent Gk(V ) as a submanifold of the space of symmetrical matrices. Therefore, we can study all the interesting properties of the Grassmanian through G(d, k) and then transfer the results by using the function f : Gk(V ) G(d, k). Finally, we can state the geometrical version of the Gordon’s Escape Through the Mesh Theorem.→ 118 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

Theorem 6.36. (Theorem 3.3 and Corollary 3.4 in [Gordon ’88]): Take a closed subset T of the unit N−1 N N sphere S = x R , x 2 = 1 and let Bε = x R : x 2 ε . If `(T ) < (1 ε)Em εEN , { ∈ || || } { ∈ || || ≤ } − − then an (N m)-dimensional subspace Y drawn uniformly from the Grassmannian GN−m(N) satisfies −   2   7 1 (1 ε)Em εEN `(T ) P Y (S + Bε) = 1 exp − − − . ∩ ∅ ≥ − 2 − 2 3 + ε + εEN /Em

In particular, with ε = 0, we have that if `(S) < Em, then an (N m)-dimensional subspace Y drawn − uniformly from the Grassmannian GN−m(N) satisfies

  7  1  2 P Y S = 1 exp Em `(S) . ∩ ∅ ≥ − 2 − 18 − The main idea of its proof consists in making the identification of the random subspace Y with the null space of a certain m N random matrix A with Gaussian entries. After, we change the original problem of ×   estimating P null(A) (S +B) = into an estimative on P A 2 (1+δ)E A 2 . This last probability is, in turn, controlled∩ by the concentration∅ of measure inequalityk k that≥ we willk developk in the next section. The complete proof (with nice intuitive explanations) can be found at [Mixon’s Blog - 02/08/2014].

6.5 Concentration of Measure

“Now only after Fourier and then by Paul Levy Is rendered easy such a proof; Without their tools perhaps, dear listener, You’d demonstrate it as a tour de force; I’ve tried without success.” Excerpt of “de Moivre”, a poem by Richard Dudley in [Dudley ’75].

The roots of concentration of measure phenomena can be traced back to the works of Émile Borel and Paul Lévy. However, it was not before Vitali Milman’s work on Geometric Banach Space Theory that modern concepts and theorems related to the phenomena emerged. There, he revisited proofs of theorems and provides deep analysis related to spherical sections of convex sets into high dimension normed spaces. The understanding of this phenomena is a deep achievement, as Michel Talagrand points out: “The idea of concentration of measure, which was discovered by Vitali Milman, is unarguably one of the greatest ideas of analysis in our times” [Ledoux - Lecture II]. The quotes’ author is another important character in the history of concentration of measure. In the nineties he established connections between isoperimetric inequalities and concentration inequalities. See section 1.2 of [Boucheron, Lugosi & Massart ’13] for this connection. After these contributions, he was able to solve many open problems on summation of random variables. Loosely speaking, if we have a large number of random variables X1,...,Xn, one might ask what happens with their sum assuming that they are bounded in average. This phenomenon essentially tells us that if we have sufficient independence between them, this sum will be highly concentrated in a much narrower range than the one expected by the classical convergence theorems of Probability. We can quantify this by proving large deviation inequalities, that is, exponential upper bounds for the probability of the sum (or more general functions) of random variables deviates from its mean. This hypothesis of high degree of independence is fundamental. The idea is that since the random variables are independent, it is difficult for them to be “synchronized” in order to deviate its sum too far from its mean. There exist monographs entirely dedicated to this theme such as [Boucheron, Lugosi & Massart ’13] and [Ledoux ’01]. Once again, Terence Tao’s blog entries have elementary and pedagogical insights about the subject [Tao’s Blog - 01/03/2010]. Also, the book [Dubhashi & Panconesi ’12] shows applications of the theory and starts the discussion about the concentration phenomena from scratch. Despite the generality of concentration of measure, in this dissertation, it will have a very special meaning, namely, the guarantees that any L-Lipschitz function of a standard Gaussian random vector, 6.5. CONCENTRATION OF MEASURE 119 regardless of the dimension, exhibits concentration like a scalar Gaussian variable with variance L2. It was first proved by [Tsirelson, Ibragimov & Sudakov ’76] using arguments based on stochastic calculus.

Theorem 6.37. Let f : Rn R be a Lipschitz function with constant L > 0, i.e. → n f(x) f(y) L x y 2 x, y R , | − | ≤ || − || ∀ ∈ and g be a standard Gaussian random vector. Then, for all t > 0

 2      t P f(g) E f(g) t exp , − ≥ ≤ − 2L2 and consequently

 2      t P f(g) E f(g) t 2 exp . − ≥ ≤ − 2L2 Before we develop all the required tools to prove this theorem, we must say that the constant 2 inside the exponential is optimal. This follows from the case n = 1, f(x) = x and the lower tail bound for the standard Gaussian stated in Proposition 6.2. Remark 36. It is important to note that without any additional structure for the function f (such as convexity), dimension-free concentration for Lipschitz functions need not hold for an arbitrary subgaussian distribution. One can consult [Ledoux ’01] for further results. The original proof uses the idea of symmetrization, explored in Example 6.15 and Laplace transform method, that is, to bound φ(θ) = E exp(θ(f(X) E[f(X)])). It is less laborious than the approach adopted here, but is has the disadvantage that it does not− provide optimal constants for concentration. Instead, we will rely on information theory methods, known as Herbst argument after [Davies & Simon ’84]. First, we need the definition of entropy of a random variable. Definition 6.38. For a nonnegative random variable X, we define its entropy as

(X) = E[φ(X)] φ(E(X)) = E[X ln X] E(X) ln(E(X)), E − − where φ is defined, for x > 0, as φ(x) = x ln(x) and extends continuously to x = 0 by φ(0) = 0. If the first term in infinite, then we set (X) = . E ∞ It follows from the convexity of φ and by Jensen inequality that (X) 0. Also, entropy of a random variable is homogeneous of zeroth order, (tX) = t (X), by definition.E Now,≥ we are able to explain the entropy method for obtaining concentrationE inequalities.E

Remark 37. (Herbst Argument): We have a method, attributed to Ira Herbst4 to derive concentration inequalities using information theory ideas. It is also called entropy method and it became highly pop- ular after the work of Michel Ledoux on Talagrand’s inequalities. See [Tao’s Blog - 09/06/2009] and [Ledoux ’01]. The idea is to derive a bound on the entropy of eθX , for θ > 0 of the form (eθX ) E ≤ g(θ)E[eθX ], for some function g. If we define F (θ) = E[eθX ], this inequality will be equivalent to

(eθX ) θF 0(θ) F (θ) ln F (θ) g(θ)F (θ). (6.19) E ≤ − ≤ Also, if we set G(θ) = θ−1 ln F (θ), then Equation (6.19) reduces to G0(θ) θ−2g(θ). Moreover, we can −1 0 ≤ note that G(0) = limθ→0 θ ln F (θ) = F (0)/F (0) = E[X]. Therefore, integrating on both sides yields Z θ Z θ 0 −2 G(θ) G(0) = G(θ) E[X] = G (t)dt t g(t)dt, − − 0 ≤ 0 4This argument about how to derive concentration inequalities from Logarithm Sobolev Inequalities appeared for the first time in Appendix A of [Davies & Simon ’84]. 120 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY which implies that

 Z θ  θ(X− [X]) θ[G(θ)− (x)] −2 E[e E ] = e E exp θ t g(t)dt , θ > 0. ≤ 0 To conclude, we simply uses Markov’s inequality to derive tail bounds for X E[X]. −

Herbst argument together with logarithmic Sobolev inequality for Gaussian vectors are the main core of our proof of measure concentration. Before we turn to it, we need to develop some information theory results, such as the dual characterization of entropy. Let us begin with a basic convex inequality.

Lemma 6.39. For a given function f : RN ( , ] we have → −∞ ∞ ∗ N x, y f(x) + f (y) for all x, y R , h i ≤ ∈ ∗ N ∗ where f : R ( , ] is the convex conjugate of f defined by f (y) = supx∈ N x, y F (x) . For → −∞ ∞ R {h i − } the particular case f(x) = exp(x), it reads xy ex + y ln y y for all x R, y > 0. ≤ − ∈ Proof. The inequality is obvious from the definition. For the particular case of f(x) = ex, we have that the function x xy exp(x) takes its maximum at x = ln y if y > 0 and this gives 7→ −  y ln y y if y > 0,  − f ∗(y) = sup xy ex = 0 if y = 0, x∈ { − } R  if y < 0. ∞ From this convex conjugate, the inequality follows immediately.

Using this inequality, we are able to prove a dual characterization for entropy which will be very useful

Lemma 6.40. Let X be a nonnegative random variable satisfying E[X] < . Then ∞ (X) = sup E[XY ]: E[exp(Y )] 1 . E { ≤ } Proof. Let us prove, first, the case of a strictly positive random variable X. We may assume that EX = 1 due to the homogeneity of the entropy. By Lemma 6.39, for a random variable Y satisfying E[exp(Y )] 1, we have ≤

E[XY ] E[exp(Y )] + E[X ln X] E[X] E[X ln X] = (X). ≤ − ≤ E The next step is to prove the converse direction, that is, that (X) E[XY ] for some Y . In order E ≤ to do this, we just simple choose Y = ln X ln(EX). By the definition of , this choice satisfies − E (X) = E[XY ] and also E exp(Y ) = E[X] exp( ln(EX)) = 1. Therefore we have attained the supremum andE this concludes the theorem for positive variables.− For the general nonnegative case, we just use an approximation argument and the continuity of φ(x) = x ln x.

We have two useful consequences of the above dual characterization. Another characterization using positive random variables and the subadditivity of entropy.

Corollary 6.41. For a nonnegative variable X, we have

(X) = sup E[X ln(Z)] E[X] ln(E[Z]) : Z > 0 . E { − } Also, for two nonnegative random variables X and Y , we have (X + Y ) (X) + (Y ). E ≤ E E Proof. For the first part, we just make the substitution Y = ln(Z/E(Z)) for a positive random variable Z. The second part follows from the properties of supremum and expected value. 6.5. CONCENTRATION OF MEASURE 121

It is time to introduce some notation. For a sequence of random variables X = (X1,...,Xn) and any i index i [n], we define X = (X1,...,Xi−1, Xˆi,Xi+1,...,Xn) = (X1,...,Xi−1,Xi+1,...,Xn), that is, ∈ we have the exclusion of the variable Xi from our sequence. Also, for a function f of X, we will write

i EXi f(X) = EXi [f(X1,...,Xi,...,Xn)] = E[f(X X )]. | for the conditional expectation. As one can note, this is a function of X1,...Xi−1,Xi+1,...,Xn but it is constant with respect to the variable Xi. From this, we can define the conditional entropy.

Definition 6.42. For a vector X = (X1,...,Xn) and a function f(X) of this vector, the conditional entropy is defined by

i  Xi (f(X)) = f(X X ) = EXi (φ(f(X))) φ(EXi (f(X))) E E | −

= EXi [f(X) ln f(X)] EXi [f(X)] ln(EXi [f(X)]). − Now we have developed all the basic tools in order to start the laborious series of theorems that will finally lead to Theorem 6.37. The first theorem we will prove is called tensorization inequality. It enables us to bound the entropy of a function of several variables by the sum of the entropies of this function in terms of the individual variables.

Theorem 6.43. Let X = (X1,...,Xn) be a vector of independent random variables and let f be a nonnegative function satisfying E[f(X)] < . Then ∞ n  X  (f(X)) E Xi (f(X)) . E ≤ E i=1 Proof. We will prove the theorem for strictly positive f. As in Lemma 6.40, for the general case we just use that function φ is continuous at 0 and use an approximation argument. Let us define the i i conditional expectation operator E given by E [f(X)] = EX1,...,Xi−1 [f(X)] = E[X Xi,...,Xn]. This | operator integrates out the dependence on the first i 1 variables. Of course we have E1[f(X)] = f(X) − and En+1[f(X)] = E[f(X)]. Next, a smart telescopic decomposition leads to

n X i i+1 ln(f(X)) ln(E[f(X)]) = (ln(E [f(X)]) (ln(E [f(X)])). (6.20) − − i=1

i Now we use the characterization given by Corollary 6.41 with X = f(X), Z = E [f(X)] > 0 and E = EXi .

h  i i i EXi f(X) ln(E [f(X)]) ln EXi [E [f(X)]] Xi (f(X)). − ≤ E Using the independence and Fubini’s theorem to interchange the order of integration yields

i+1 EXi [E[f(X)]] = EXi EX1,...,Xi−1 [f(X)] = E [f(X)]. To finish, just take Equation (6.20), multiply it by f(X) and take expectation on both sides. This finally leads to

  (f(X)) = E f(X) ln(f(X)) ln(E[f(X)]) = E − n n X h h i i ii X E EXi f(X) ln(E [f(X)]) ln(EXi [E [f(X)]]) E[ Xi (f(X))]. − ≤ E i=1 i=1 122 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

There is a deep connection between concentration of measures and functional inequalities such as the logarithmic Sobolev inequality. These inequalities are ubiquitous in many areas such as Partial Differential Equations, Information Theory and Optimal Transport among others beyond, of course, Probability Theory. In the latter, it has many connections, not only with concentration inequalities but also with mixing times of Markov Chains. It is interesting to mention that in Information Theory, this type of inequality shows a relation between entropy and Fisher information [Kitsos & Tavoularis ’09]. They were also used for the celebrated proof of Poincaré’s Conjecture by G. Perelman. See [Tao’s Blog - 24/04/2008] for further information. The next theorem is an example of a log-Sobolev inequality for the case of Rademacher vectors.

Theorem 6.44. Let f : 1, 1 n R be a real-valued function and  be an n-dimensional Rademacher vector. Then {− } →

 n  2 1 X (i) 2 (f ()) E (f() f( ) , (6.21) E ≤ 2 − i=1

(i) where  = (1, . . . , i−1, i, i+1, . . . , n) is obtained from  by flipping the ith entry. − 2  Pn 2  Proof. By Theorem 6.43, we have (f ()) E i=1 i (f ()) . Therefore, we just need to prove that, for each i [n], E ≤ E ∈    2  1 (i) 2 i f () Ei f() f( ) . (6.22) E ≤ 2 −

(i) For any realization of (1, . . . , i−1, i+1, . . . , n), f() and f( ) can only have two possible values. These values will be denoted by a and b R, respectively. Using the definition of entropy, inequality (6.22) is the same as the following scalar inequality∈

1 a2 + b2 a2 + b2  1 a2 ln(a2) + b2 ln(b2) ln (a b)2. 2 − 2 2 ≤ 2 − Thus, it remains to prove this elementary inequality. In order to do this, let us define, for a fixed b, the function 1 a2 + b2 a2 + b2  1 H(a) = a2 ln(a2) + b2 ln(b2) ln (a b)2. 2 − 2 2 − 2 − A tedious computation shows that the first and the second derivative of H(a) are given by

 2a2   2a2  2a2 H0(a) = a ln (a b) and H00(a) = ln + 1 . a2 + b2 − − a2 + b2 − a2 + b2 The first thing to note is that H(b) = 0 and H0(b) = 0. Also, since ln x x 1, we obtain H00(a) 0 ≤ − ≤ for all a R. This implies that H is concave and then H(a) 0 for all a R. This establishes the theorem. ∈ ≤ ∈

The right-hand side of equation (6.21) can be interpreted as a gradient in a discrete version. If we try to prove a similar inequality for continuous variables, such as Gaussian variables, then an authentic gradient appears. This is the content of the Gaussian logarithm Sobolev inequality, proved by [Gross ’75]. As pointed by [Ledoux - Lecture I], nowadays we have more than 15 different proofs of this inequality.

Theorem 6.45. (Essentially Corollary 4.2 of [Gross ’75]): Let f : Rn R be a continuously differentiable → function satisfying E[φ(f 2(g))] < for a standard Gaussian vector g on Rn. Then ∞ 2 h 2i (f (g)) 2E f(g) . E ≤ ||∇ || 6.5. CONCENTRATION OF MEASURE 123

Proof. The initial step is to prove the theorem for n = 1 and g being a standard Gaussian random variable. After, we will prove a general version. Even more, we will start by taking f with compact 0 0 0 0 support. Since f is uniformly continuous, its modulus of continuity ω(f , δ) = sup|t−s|≤δ f (t) f (s) 0 | − | satisfies ω(f , δ) 0 as δ 0. Let  = (1, . . . , m) be a Rademacher vector and set → → m 1 X Sm = j. √m j=1

The idea is to use the Theorem 6.44, for Rademacher vectors, and the fact that Sm converges in distri- bution to a standard normal random variable by the Central Limit Theorem. Equation (6.21) yields

 n   n  2 2 1 X (i) 2 1 X  2i  (f (Sm)) E (f() f( ) = E f(Sm) f Sm . E ≤ 2 − 2 − − √m i=1 i=1 Notices that for each i [m] we have ∈ Z S  2i  2i 0 0 0 f(Sm) f Sm = f (Sm) + √ (f (t) f (Sm))dt − − √m √m Sm−2i/ m −

2 0 2  0 2  f (Sm) + ω f , . ≤ √m| | √m √m It follows that

m  2  2 X  2i  0 2 0  0 2   0 2  f(Sm) f Sm 4 f (Sm) + 2 f (Sm) ω f , + ω f , . − − √m ≤ | | √m √m i=1 0 0 2 0 2 Using the boundedness of f and f , the central limit theorem implies that E[f (Sm) ] E[f (g) ] and 0 2 0 2 → [f (Sm) ] [f (g) ] as m . Therefore, we conclude the inequality E → E → ∞ 2 0 2 (f (Sm)) 2E[f (g) ]. E ≤ Now we turn to the general case where f does not have compact support. For a small ε > 0 given, the hypothesis E[φ(f 2(g))] < ensure the existence of T > 0 such that for any subset I of R [ T,T ], ∞ \ −

1 Z 2 1 Z 2 φ(f(t)2) e−t /2dt ε and e−t /2dt ε. √2π I | | ≤ √2π I ≤ Considering a continuously differentiable function h satisfying 0 h(t) 1 for all t R, h(t) = 1 for all t [ T,T ] and h(t) = 0 for t [ T 1,T + 1]. We then set f≤˜(t) = ≤f(t)h(t). This∈ is a continuously differentiable∈ − function with compact6∈ − support.− The result just proved above applies to this new function and then gives

2 0 2 (f˜ (g)) 2E[f˜ (g) ]. (6.23) E ≤ The subadditivity of the entropy, Corollary 6.41, provides   (f 2(g)) = f˜2(g) + f 2(g)(1 h2(g)) (f˜2(g)) + (f 2(g)(1 h2(g)). E E − ≤ E E − 2 2 Introducing the sets I1 = t R : t T, f(t) < e and I2 = t R : t T, f(t) e , we can estimate the second term of{ the∈ right| | hand ≥ side after some} calculations{ ∈ | | ≥ ≥ }

Z Z Z 2 2 1 2 2 −t2 1 2 −t2 1 2 −t2 E[f (g)(1 h (g))] = f (t)(1 h (t))e dt f (t)e dt + f (t)e dt √ √ √ − 2π R\[−T,T ] − ≤ 2π I1 2π I2 124 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

e Z 2 1 Z 2 e−t dt + φ(f 2(t))e−t dt (e + 1)ε. (6.24) ≤ √2π I1 √2π I2 ≤ For sufficiently small x, e.g., x < 1/e, the function φ(x) = x ln x is increasing. Applying φ to both sides  2 2  | | | | | | of Equation 6.24 leads to φ E f (g) 1 h (g) φ((1+e)ε) for a sufficiently small ε. With the aid of −2 2 ≤ 2 2 the auxiliary sets I3 = t R : t T, f (t)(1 h (t)) < e and I4 = t R : t T, f (t)(1 h (t)) e and denoting κ = max{ ∈ φ(|t)| ≥= e−1, we have− } { ∈ | | ≥ − ≥ } t∈[0,e] | | Z Z 2 2 κ −t2/2 1 2 2 −t2/2 E[φ(f (g)(1 h (g)))] e dt + φ(f (g)(1 h (g)))e dt − ≤ √2π I3 √2π I4 − ≤

1 Z 2 κε + φ(f 2(g))e−t /2dt (κ + 1)ε. √2π I4 ≤ The triangular inequality in the definition of entropy leads to

2 2   2 2   2 2  f (g)(1 h (g)) φ E f (g)(1 h (g)) + E φ f (g)(1 h (g)) φ((e + 1)ε) + (κ + 1)ε. E − ≤ − − ≤ | | 2 0 2 And using Equation (6.23), we derive f (g)) E[f˜ (g) ] + (1 + κ)ε + φ((1 + e)ε) . Besides, applying again the triangular inequality, we obtainE ≤ | |

0 2 1/2 0 0 2 1/2 0 2 1/2 0 2 1/2 E[f˜ (g) ] = E[(f h + fh )(g) ] E[(f h)(g) ] + E[(fh )(g) ] . ≤ Now it is time to analyze both terms of the right-hand side above. For the first one, we simple note that E[(f 0h)(g)2] = E[f 0(g)2h(g)2] E[f 0(g)2]. For the second one, using estimate (6.24), it holds that ≤

Z ∞ 0 2 Z 0 2 1 2 0 2 −t2/2 h ∞ 2 −t2/2 0 2 E[(fh )(g) ] = f(t) h (t) e dt || || f(t) e dt (e + 1) h ∞ε. (6.25) √2π −∞ ≤ √2π I1∪I2 ≤ || ||

˜0 2 0 2 1/2 0 2 When we put all these steps together, we conclude that E[f (g) ] E[f (g) ] + h ∞((e + 1)ε) . Hence, ≤ || ||

2 2  0 2 1/2 0 1/2 (f (g)) 2 E[f (g) ] + h ∞((e + 1)ε) + φ((1 + e)ε) + (κ + 1)ε. E ≤ || || | |

This result is valid for any ε > 0 and since limt→0 φ(t) = φ(0) = 0, we can conclude

2 0 2 (f (g)) E[f (g) ]. E ≤ The next step is to go from the one dimensional case to the general case. This will be done via the tensorization inequality, Theorem 6.43. With its aid, we obtain

" n # " n  2# 2 X 2 X ∂f  2 (f (g)) E gi (f (g)) 2E (g) = 2E f(g) 2 . E ≤ E ≤ ∂x ||∇ || i=1 i=1 i

The main tools are in our hands. After all this effort, we can prove the (optimal) celebrated concen- tration of measure inequality for Lipschitz functions.

Proof. (of Theorem 6.37): First of all, we will assume that the function f is differentiable. Since f n is Lipschitz with constant L. by hypothesis, this implies f(x) 2 L for all x R . Applying Theorem 6.45 to eθf/2 leads to ||∇ || ≤ ∈

2 2 2 θf(g)/2 h θf(g)/2 2i θ h θf(g) 2i θ L h θf(g)i (e ) 2E e /2 = E e f(g) 2 E e . (6.26) E ≤ ∇ 2 2 ||∇ || ≤ 2 6.5. CONCENTRATION OF MEASURE 125

Here we stop for a moment in order to state an important comment. In the hypothesis of Theorem 6.37, we must have E[φ(f 2(g))] < . In our case, this translates to E[φ(e2θf(g))] < . This holds by the ∞ ∞ Lipschitz hypothesis, since e2θf(g) e2θ|f(0)|eLθ||x||2 for all x Rn. If one looks carefully, Inequality (6.26) is exactly what you would like≤ to have to apply Herbst argument.∈ This, in turn, implies

 Z θ   θ(f(g)− [f(g)]) −2 2 2 E e E exp θ t g(t)dt = exp(θ L /2). ≤ 0 Using Markov’s inequality, we conclude

θ(f(g)− [f(g)]) −θt+θ2L2/2 −t2/(2L2) P(f(g) E[f(g)] t) inf exp( θt)E[e E ] inf e = e . − ≥ ≤ θ>0 − ≤ θ>0 The last step follows because the infimum is attained at θ = t/L2. If we replace f by f, we have the same bound. Therefore, a union bound argument yields −

 2      t P f(g) E f(g) t 2 exp , − ≥ ≤ − 2L2 for f differentiable. In the general case of not necessarily differentiable f, for each ε > 0, we can find a differentiable Lipschitz function g with the same Lipschitz constant such that f(x) h(x) ε for all | − | ≤ x Rn. A proof of this classical analytical fact can be found at Appendix C of [Rauhut & Foucart ’13]. ∈ −(t−2ε)2/(4L2) P(f(g) E[f(g)] t) P(h(g) E[h(g)] t 2ε) e . − ≥ ≤ − ≥ − ≤ But since ε > 0 can be made arbitrarily small, we have the result for general f. Again, replacing f by f and using a union bound, the theorem follows. − This deep theorem is useful to estimate a lot of things. Also, since all norms in Rn are Lipschitz functions, we can use it to estimate geometrical notions like the Gaussian width. Here we provide a simple example. 2 Example 6.46. Let us understand the concentration of the χ -distribution. If Xk N(0, 1) are n Pn 2 2 ∼ independent standard normal random variables, then Y = i=1 Xi will be a χ random variable with n degrees of freedom (see section 8.2 of [DeGroot & Schervish ’11]). We define the variable Z = √Y/√n = (X1,...,Xn) 2/√n. Since the `2-norm is a 1-Lipschitz function, Theorem 6.37 implies that || ||  −nδ2/2 P Z E(Z) + δ e , δ 0. ≥ ≤ ∀ ≥ Since the square root function is concave, by Jensen’s inequality we have  n 1/2 p 2 1 X 2 E(Z) E(Z ) = E(Xi ) = 1. ≤ n i=1 Now we remember that Z = Y/√N to conclude

2 −nδ2/2 P Y/n (1 + δ) e , δ 0. ≥ ≤ ∀ ≥ Since (1 + δ)2 = 1 + 2δ + δ2 1 + 3δ for all δ [0, 1], making the substitution t = 3δ leads to ≤ ∈  −nt2/18 P Y n(1 + t) e , t [0, 3]. ≥ ≤ ∀ ∈ The same kind of concentration inequality holds for suprema of Gaussian processes. Its proofs can be found at Theorem 5.8 of [Boucheron, Lugosi & Massart ’13]. To close this section, we state it.

Theorem 6.47. Let (Xt)t∈T be an almost surely continuous centered Gaussian process, that is, with 2 2 probability 1, Xt is a continuous function of t, indexed by a totally bounded set T . If σ = supt∈T E[Xt ], 2 then Z = sup Xt satisfies Var(Z) σ , and for all u > 0, we have t∈T ≤   −u2/(2σ2) P Z E[Z] u e . − ≥ ≤ 126 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

6.6 Covering and Packing Numbers

“Mit den Grenzen der Wissenschaften gegeneinander verhält es sich ähnlich, wie mit denen der Meere. Sie sind künstlich und aus praktischen Gründen angenommen. Alles Wissen steht miteinander im Zusammenhange, und eine Untersuchung abzubrechen, um nicht eine solche Grenze zu überschreiten, wäre nicht zu rechtfertigen.5” Postcard from Gottlob Frege to Heinrich Rickert in 1 July 1911.

This section has, in a first moment, no clear relation with probability. Here we will define two important notions for metric spaces that will be used in subsequent chapters: covering numbers and packing numbers. However, in a second moment, these notions - together with probabilistic techniques - proved to be fundamental to demonstrate important theorems [Pisier ’89]. Also, they play a fundamental role in understanding the behavior of stochastic processes. Usually one wants to infer some properties of a stochastic process Xt, with t Λ. Knowing the structure of the set Λ and one might ask the dependence ∈ of Xt on Λ. An example of such phenomenon is Dudley’s Theorem in Stochastic Process. Besides, in many situations, we want to manipulate and quickly compute some properties of a data set. This set is sometimes modeled as a (compact) metric space and this manipulation could be translated as the use of “sparse”, typically discrete, object. The role of this object is to approximately capture the geometry or complexity of a (subset of) a metric space. It is in this context that the concepts of covering number and packing number arise. They are highly used, for example, in Error-Corrector Coding [Candès & Randall ’08], Quantization [Boufonos ’12] and Geometric Functional Analysis [Pisier ’89].

Definition 6.48. Let X be a subset of a metric space (M, d). For t > 0, the covering number Nε = N(X, d, ε) is the smallest N such that X can be covered with balls of radius ε, that is, we cover X with balls Bε(x) = x M, d(x, xi) ε . The logarithm of the covering number is called the metric entropy of the metric space{ ∈ X. ≤ }

Definition 6.49. The packing number P (X, d, ε) is defined, for ε > 0, as the maximal integer P such that there are points xi X, i = 1,...,P which are ε-separated, i.e., d(xi, xk) > ε for all i, k 1,...,P , i = k. ∈ { } ∈ { } 6 Proposition 6.50. We have some basic properties for the covering number. The packing number will also satisfy them. a.) For arbitrary sets X,Y M, we have N(X Y, d, ε) N(X, d, ε) + N(Y, d, ε). ⊂ ∪ ≤ b.) For any α > 0, we have N(X, αd, ε) = N(X, d, ε/α).

c.) If M = Rn and d is a metric induced by a norm . , then N(αX, αd, ε) = N(X, d, ε/α). || || d.) If d˜ is another metric on M that satisfies d˜(x, y) d(x, y) for all x, y X, then ≤ ∈ N(X, d,˜ ε) N(X, d, ε). ≤ Proof. All of them follow from distance properties such as triangular inequality as well as the definition of the covering/packing number.

The two notions are related, as the next theorem shows.

Theorem 6.51. Let X be a subset of a metric space (M, d) and let t > 0. Then

P (X, d, 2ε) N(X, d, ε) P (X, d, ε) ≤ ≤ 5In a free translation: The boundaries between the disciplines of science are somewhat like those between the seas. They are artificial and adopted for practical purposes. All knowledge is interrelated, and it would be unjustifiable to abort an investigation in order not to cross such a boundary. The fac simile version of the postcard and more details about it can be found at [Schlotter & Wehmeier ’13]. 6.6. COVERING AND PACKING NUMBERS 127

Proof. In order to prove the right-hand side, let P = P (X, d, ε). Then there exists P points x1, . . . , xP { } that are ε-separated in X. By the maximality of P , it follows that any other x X satisfies d(x, xi) ε m ∈ ≤ for some index i. Thus X Bε(xi) and so N(X, d, ε) P (X, d, ε). ⊆ ∪i=1 ≤ For the left-hand side, we just need to use the pigeonhole principle. Let P = P (X, d, 2ε). Then there exists P points x1, . . . , xP X that are 2ε-separated, i.e., d(xi, xj) > 2ε if, i = j. Suppose, by contradiction, that P{ > N(X, d,} ε) ⊂, then X is covered by fewer than P balls of radius6 ε. So, at least two of the P distinct points xi and xj must lie in the same ball Bε(˜x) with a certain center x˜. By the triangle inequality, we have d(xi, xj) d(xi, x˜) + d(xj, x˜) But this is a contradiction since xi and xj are 2ε-separated. Therefore P N(X, d,≤ ε). ≤

The next question one might ask is how to calculate these quantities for certain sets on metric spaces. Usually this is a difficult problem and one can just provide some estimates instead of a closed formula. The set of notes [Wainwright ’15] provides many examples and a nice explanation about these quantities and how to use them. Here, we state just one example that will be useful in the following chapters.

Proposition 6.52. Let . be some norm on Rn and let U be a subset of the unit ball B = x || || { ∈ Rn, x 1 . Then the packing and the covering numbers satisfy, for ε > 0, || || ≤ }  2n N(U, . , ε) P (U, . , ε) 1 + . || || ≤ || || ≤ ε n In the case that U = B, we have the lower estimate 1  N(U, . , ε). 2ε ≤ || || Proof. The inequality N(U, . , ε) P (U, . , ε) is just Theorem 6.51. For the other inequality, let || || ≤ || || x1, . . . , xP U be a maximal ε-packing. This tells us that the balls B(xi, ε/2) have empty intersection {and, even more,} ⊂ they are contained in the ball (1 + ε/2)B. What remains is to compare the Lebesgue measure (aka volume) of these balls.

P ! P [ X  vol B(xi, ε/2) = vol(B(xi, ε/2)) = P vol((ε/2)B) vol (1 + ε/2)B . ≤ i=1 i=1

Also, we have the homogeneity of volumes in Rn which tells us that vol(εB) = εnvol(B). Therefore we conclude that P (ε/2)nvol(B) (1 + ε/2)nvol(B). Dividing both sides by (ε/2)nvol(B) leads to P (1 + 2/ε)n. ≤ ≤ Now, for the case U = B, in view of Proposition 6.52, we just need to prove that if P = P (B, . , ε), then P (1/ε)n since this implies in N(B, . , ε) P (B, . , 2ε) (1/2ε)n. If P = P (B, . , ε),|| there|| must ≥ || || ≥ || || ≥ || || exists P points x1, . . . , x that are ε-separated in B. By the maximality of P , it follows that any point { P } P PP x B satisfies d(x, xi) ε foor some i. Hence, B vol(Bε(xj)) and vol(Bε(xj)) vol(B). ∈ ≤ ⊂ ∪j=1 j=1 ≥ By the homogeneity of the volume, we have P εn 1. This leads to P (1/ε)n as we wanted to show. ≥ ≥ These definitions allow us to reduce the complexity of the calculations of norm operators, for example. The next theorem illustrates this fact. In order to calculate it, it is necessary to maximize over the whole sphere Sn−1. Using coverings, we replace by the maximum over a finite set.

n−1 Theorem 6.53. Let A be an N m matrix, and let Nε be an ε-covering of S for some ε [0, 1). Then × ∈ −1 sup Ax 2 A 2→2 (1 ε) sup Ax 2 x∈Nε || || ≤ || || ≤ − x∈Nε || || In the case of a symmetric n n matrix, we have × −1 A 2→2 = sup Ax, x (1 2ε) sup Ax, x || || x∈Sn−1 |h i| ≤ − x∈Nε |h i| 128 CHAPTER 6. INTERLUDE: NON-ASYMPTOTIC PROBABILITY

Proof. In the first case, the lower bound follows from the definition of an operator’s norm. For the n−1 upper bound, let us fix x S which attains the norm, i.e., A 2→2 = Ax 2. Now, choose y in the ∈ || || || || ε-covering which approximates x as x y 2 ε. Using the triangle inequality, we have Ax Ay 2 || − || ≤ || − || ≤ A 2→2 x y 2 ε A 2→2. It follows that || || || − || ≤ || ||

Ay 2 Ax 2 Ax Ay 2 Ax 2 A 2→2 x y 2 A 2→2 ε A 2→2 = (1 ε) A 2→2. || || ≥ || || − || − || ≥ || || − || || || − || ≥ || || − || || − || || Taking the maximum over all y in the ε-covering in this inequality completes the proof. For the second n−1 statement, the proof is quite similar. Take x S for which A 2→2 = Ax, x and choose y in the ∈ || || |h i| ε-covering which approximates x as x y 2 ε. Again, by the triangle inequality we have || − || ≤

Ax, x Ay, y = Ax, x y + A(x y), y h i − h i h − i h − i A 2→2 x 2 x y 2 + A 2→2 y 2 x y 2 2ε A 2→2. ≤ || ||| || || || − || || ||| || || || − || ≤ || ||| It follows that Ay, y Ax, x 2ε A 2→2 = (1 2ε) A 2→2. Taking the maximum over all y in the ε-covering in|h thisi| inequality ≥ |h yieldsi| − the|| ||| desired result.− || ||| For generalizations related to the results of this section, one can consult Chapter 4 of [Pisier ’89]. Chapter 7

Matrices Which Satisfy The RIP

The 8P Rule: Proper Prior Planning and Preparation Prevents Piss Poor Performance. One of the many variations of a US Marine Corps adage

7.1 Introduction

In Chapter 5 we discussed the importance of having matrices that satisfy the restricted isometry prop- erty. For linear systems associated to these matrices, the search for sparse vectors is successful through various algorithms, such as BP, IHT, OMP, COSaMP, etc. As a consequence of Lemma 5.21 from [Cai, Wang & Xu I ’10], we deduced, in the case of `1 minimization, that δ2s < 0.6246 suffices for a robust and stable reconstruction. However, there is little hope in improving this bound and make δ 1, due to Theorem 5.27, proved by [Davies & Gribonval ’09]. Nevertheless this quantity is very important→ since it can ensure the existence of matrices with the small (and optimal) number of measurements. This optimality will be explored through this chapter with the aid of probabilistic arguments. However, as we saw in Section 4.10, so far we can only conclude the existence of matrices satisfying RIP with a lower bound in the number of rows given by m Cs2. Considering that the vectors we seek when solving the linear system have information only in s directions,≥ it was expected that the number of measurements should vary linearly, and not quadratically, in s, the sparsity. This is an indication that there could be another way of constructing matrices with δs small enough but, at the same time, with m significantly smaller than Cs2. In this chapter will use the power of the probability theorems developed in Chapter 6, such as concentration of measure and Gordon’s Lemma, to prove the existence of such matrices when m Cδs s ln(N/s). This impressive result will be proven to be sharp in Chapter 8. In the literature it is≥ common to say thatt “m varying linearly with s up to a logarithmic factor.” Such kind of probabilistic argument has also played a central role in many fields. Random matrices and projections have been important tools in some fields like asymptotic theory of finite-dimensional normed spaces and convex geometry since the seventies. Milman, Schechtman, Szarek, Gluskin, Garnaev, Kashin, Rudelson and Vershynin can be mentioned in the development of these ideas. It is important to note that Rudelson and Vershynin introduced a lot of probabilistic techniques into the field of Compressive Sensing. For more on these, see [Schechtman ’03], [Artstein-Avidan, Giannopoulos & Milman ’15] and references therein. Some ideas from these fields appear in modern Data Science techniques. We can cite the Johnson- Lindenstrauss Lemma, which essentially states that with high probability, Lipschitz projections do not disturb the geometry of a point cloud when they are projected onto a space of dimension logarithmic in the number of points. We will see that ideas from compressive sensing appear, in some hidden form, in the development of this profound theorem. It will be shown that there exists a deep connection between the restricted isometry property and dimensionality reduction through the JL Lemma. Also, we can understand the idea behind the motto random projections act as almost norm preserving

129 130 CHAPTER 7. MATRICES WHICH SATISFY THE RIP for some subsets of the sphere, used nowadays in the Machine Learning community. In this chapter, we introduce an ensemble of matrices for which the restricted isometry property holds with high probability and we show the importance of Johnson-Lindenstrauss’ Lemma and its connections with the Restricted Isometry Property.

7.2 RIP for Subgaussian Ensembles

While developing the probabilistic tools on the last chapter, subgaussian random variables emerged as a generalization of Gaussian variables, preserving many good properties. We will now consider matrices A Rm×N having random variables in each of its entries. These are called random matrices or a random matrix∈ ensemble. Definition 7.1. Let A Rm×N be a random matrix [i.] If the entries of A∈are independent Rademacher variables, A is called a Bernoulli random matrix

[ii.] If the entries of A are independent standard gaussian random variables, A is a Gaussian random matrix.

[iii.] If the entries of A are independent zero mean subgaussian random variables with variance 1 and same subgaussian parameters β and κ, i.e.,

−κt2 P( Ai,j t) βe t > 0, i [m], j [N], | | ≥ ≤ ∀ ∈ ∈ A is called a subgaussian random matrix. We showed that Gaussian and Rademacher random variables are particular cases of subgaussian variables. The same holds for matrices: Gaussian and Bernoulli matrices are subgassian matrices. Besides, note that entries of a subgassian matrix do not have to be identically distributed. √1 We are interested in the RIP constant of A. Henceforth we will work with m A instead of A because " 2 # 1 2 E Ax = x 2 √m 2 || || for any vector x and a subgaussian matrix A (since all entries have variance 1). This is the same as saying that the expectation of the squared `2-norm of the columns of A is 1. In this case, the RIP constant δs is a measure of the deviation of √1 Ax 2 from its mean, uniformly over all s-sparse vectors x. || m ||2 Our main result in the first part of this chapter, due to [Baraniuk et al. ’08] and the seminal work of [Mendelson, Pajor & Tomczak-Jaegermann ’08], is the following. Theorem 7.2. Let A Rm×N be a subgaussian random matrix and ε > 0. Suppose ∈ m Cδ−2(s ln(eN/s) + ln(2ε−1)) ≥ for a constant C > 0 (depending only on the subgaussian parameters β, κ). Then, with probability at least 1 1 ε, δs, the restricted isometry constant of √ A satisfies δs δ. − m ≤ In fact we will prove a more general result. For this, we need some definitions. Definition 7.3. Let Y be a random vector in RN .  2 2 N [i.] If E Y, x = x x R , then Y is called isotropic. |h i| || || ∀ ∈ N [ii.] If, x R with x 2 = 1, the random variable Y, x is subgaussian with a subgaussian parameter c∀independent∈ of x||, that|| is, h i

  2 E exp(θ Y, x ) exp(cθ ), θ R and x with x 2 = 1, h i ≤ ∀ ∈ ∀ || || then Y is called a subgaussian random vector. 7.2. RIP FOR SUBGAUSSIAN ENSEMBLES 131

We quickly realize that the constant c should ideally be independent of N. But note that if Y is a randomly selected row or column of an orthogonal matrix, then Y is isotropic and subgaussian but c has a dependence in N. We now state the more general result we will prove. Theorem 7.4. Let A Rm×N be a random matrix with independent, isotropic and subgaussian rows with the same subgaussian∈ parameter c in the definition above. If

m Cδ−2(s ln(eN/s) + ln(2ε−1)), ≥ where C = 2(4β + 2κ)/3κ2 and β, κ are some parameters depending only on c, then with probability at 1 least 1 ε, the restricted isometry constant of √ A satisfies δs δ. − m ≤ The proof of this theorem relies on a concentration inequality which, in turn, is a consequence of Bernstein’s inequality for subexponential random variables, proved in the last chapter. Lemma 7.5. Let A Rm×N be a random matrix with independent, isotropic, and subgaussian rows with ∈ the same subgaussian parameter c as in Definition 7.3. Then, x RN and t (0, 1), ∀ ∈ ∀ ∈   −1 2 2 2 2 P m Ax 2 x 2 t x 2 2 exp( Kt m) || || − || || ≥ || || ≤ − where K depends only on c. N N Proof. Let x R . We may assume that x 2 = 1. Denoting the rows of A by Y1,...,Ym R we can define the following∈ random variables || || ∈

2 2 Zi = Yi, x x , i [m]. |h i| − || ||2 ∈ By hypothesis, Yi is isotropic, so E[Zi] = 0. Moreover, as Yi, x is subgaussian, Zi is subexponential, h i that is, P( Zi r) β exp( κr) for all r > 0 and some parameters β, κ depending only on c. Now, after a simple calculation| | ≥ ≤ we obtain−

m m 1 2 2 1 X 2 2 1 X Ax x = ( Yi, x x ) = Zi. m|| ||2 − || ||2 m |h i| − || ||2 m i=1 i=1

Since the Yi are independent, Zi are also independent, it follows from Bernstein’s inequality for subex- ponential random variables, Theorem 6.23, that

m ! m !  2 2 2   2  1 X X κ m t /2 κ 2 P Zi t = P Zi tm 2 exp 2 exp t m , m ≥ ≥ ≤ −2βm + κmt ≤ −4β + 2κ i=1 i=1

2 where we used that t (0, 1) in the last step. Defining K = κ , the concentration inequality follows. ∈ 4β+2κ

Now, in order to prove Theorem 7.4, we start by showing that a submatrix of a random matrix is well conditioned, with high probability, if we make some restriction on its size. Theorem 7.6. Suppose that a random matrix A Rm×N is drawn according to a probability distribution for which the concentration inequality of the lemma∈ above holds, that is, for t (0, 1), ∈   2 2 2 2 N P Ax 2 x 2 t x 2 2 exp( Kt m) x R . || || − || || ≥ || || ≤ − ∀ ∈ Suppose also that for δ, ε (0, 1) we have ∈ m Cδ−2(7s + 2 ln(2ε−1)) ≥ where C = 2/(3K). Then, for S [N] with #S = s, we have ⊂ ∗ P( ASAS Id 2→2 < δ) 1 ε || − || ≥ − 132 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

Proof. From the combinatorial arguments developed in Theorem 6.51 we know that that for ρ (0, 1/2), N ∈ there exists a finite subset U from the unit ball BS = x R , supp x S, x 2 1 which satisfies { ∈ ⊂ || || ≤ }  2s #U 1 + and min z u 2 ρ z BS. ≤ ρ u∈U || − || ≤ ∀ ∈ Now, using the concentration inequality of the hypothesis, we have, for t (0, 1) and δ and ρ (they will be determined later) ∈

    2 2 2 X 2 2 2 P Au 2 u 2 t u 2 for some u U P Au 2 u 2 t u 2 || || − || || ≥ || || ∈ ≤ | || − || || ≥ || || u∈U  2s 2#U exp( Kt2m) 2 1 + exp( Kt2m). ≤ − ≤ ρ − Suppose, indeed, that the realization of the random matrix A yields the opposite inequality

2 2 2 Au u < t u u U. (7.1) || ||2 − || ||2 || ||2 ∀ ∈ The calculations above showed that  s 2 2 2  2 2 P Au 2 u 2 < t u 2 u U 1 2 1 + exp( Kt m). (7.2) || || − || || || || ∀ ∈ ≥ − ρ − 2 2 We are going to show that, for the right choice of ρ and t, (7.1) implies Ax 2 x 2 δ for all ∗ || || − || || ≤∗ x BS, i.e., the well conditioning of the submatrices, ASAS Id 2→2 δ. Defining B = ASAS Id, the∈ equation (7.1) is the same as Bu, u < t u U||. This happens− || because≤ − |h i| ∀ ∈ 2 2 2 2 Au u < t u = Au, Au u, u < t u || ||2 − || ||2 || ||2 ⇒ |h i − h i| || ||2 = A∗Au, u u, u < t u 2 = (A∗A Id)u, u < t u 2 = Bu, u < t u 2. ⇒ |h i − h i| || ||2 ⇒ |h − i | || ||2 ⇒ |h i| || ||2 Consider a vector x BS and choose u U such that x u 2 ρ < 1/2. We obtain ∈ ∈ || − || ≤ Bx, x = Bu, u + B(x + u), x u Bu, u + B(x + u), x u |h i| |h i h − i| ≤ |h i| |h − i| < t + B 2→2 x + u 2 x u 2 t + 2ρ B 2→2. || || || || || − || ≤ || || Taking the maximum over all x BS, we see that ∈ t B 2→2 < t + 2ρ B 2→2, i.e. B 2→2 . || || || || || || ≤ 1 2ρ − We would like that B 2→2 < δ, so with the choice t = (1 2ρ)δ the aid of (7.2), we conclude || || −  s ∗ 2 2 2 P ( ASAS Id 2→2 δ) 1 2 1 + exp( K(1 2ρ) δ m). (7.3) || − || ≤ ≥ − ρ − − Now, this probability is at least 1 ε if −  2s ε 2 1 + exp( K(1 2ρ)2δ2m). ≥ ρ − − And this is the same as

1 m δ−2(ln(1 + 2/ρ)s + ln(2ε−1)). ≥ K(1 2ρ)2 − To finish, we need to choose the value of ρ. Taking ρ = 2/(e7/2 1) 0.0623 so that 1/(1 2ρ)2 4/3 and ln(1 + 2/ρ)/(1 2ρ)2 14/3, the inequality above will be satisfied− ≈ if − ≤ − ≤ 7.2. RIP FOR SUBGAUSSIAN ENSEMBLES 133

2 m δ−2(7s + 2 ln(2ε−1)). ≥ 3K This estimative concludes the proof.

What remains to be done is to complete the gaps. We first need to show that subgaussian matrices have isotropic and subgaussian rows. After this, we have to prove that matrices drawn from a distribution that satisfies the concentration inequality have small RIP constant of optimal order. For the first part the following lemma suffices.

Lemma 7.7. Let Y RN be a random vector with independent, zero mean and subgaussian entries with variance 1 and same∈ subgaussian parameter c. Then Y is an isotropic and subgaussian random vector with the same subgaussian parameter c.

N Proof. Let x R with x 2 = 1. Since Yi are independent, zero mean and have unit variance, we have ∈ || || N N N  2 X X   X 2 2 E Y, x = xixi0 E YiYi0 = xi = x 2. |h i| || || i=1 i0=1 i=1 PN This allows us to conclude that Y is isotropic. We need to prove that Z = Y, x = i=1 xiYi is subgaussian. By the hypothesis of independence, we have h i

" N !# " N # N   X Y Y   E exp (Zθ) = E exp θ xiYi = E exp(θxiYi) = E exp(θxiYi) i=1 i=1 i=1 N Y exp(cθ2x2) = exp(c x 2θ2). ≤ i || ||2 i=1 Hence Y is a subgaussian random vector with parameter independent of N.

Finally, we prove that matrices which satisfy the concentration inequality have small RIP of optimal order in the number of parameters.

Theorem 7.8. Suppose that a random matrix A Rm×N is drawn according to a probability distribution for which the concentration inequality holds, that∈ is, for t (0, 1), ∈   2 2 2 2 N P Ax 2 x 2 t x 2 2 exp( Kt m), x R . || || − || || ≥ || || ≤ − ∀ ∈ Suppose further that, for δ, ε (0, 1), the number of rows (measurements) satisfies ∈ m Cδ−2s9 + 2 ln(N/s) + 2 ln(2ε−1), ≥ where C = 2/(3K). Then with probability 1 ε, the restricted isometry constant δs of A satisfies δs < δ. − ∗ Proof. We start by recalling the equivalent definition of the RIC, i.e. δs = sup A AS S⊂[N], #S=s || S − Id 2→2. So we must analyze all subsets of [N] with fixed cardinality s. By taking the union bound in all N|| subsets S [N] of cardinality s we obtain s ⊂

     s X ∗ N 2 2 2 P(δs δ) P ASAS Id 2→2 δ 2 1 + exp( Kδ (1 2ρ) m) ≥ ≤ || − || ≥ ≤ s ρ − − S⊂[N], #S=s

eN s  2s 2 1 + exp( Kδ2(1 2ρ)2m). ≤ s ρ − − 134 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

7/2 Choosing ρ = 2/(e 1) 0.0623 as in Theorem 7.6 yields P(δs < δ) 1 ε if − ≈ ≥ − 1 4 14 4  2 m s ln(eN/s) + s + ln(2ε−1) = δ−2[s(9 + 2 ln(N/s)) + 2 ln(2ε−1)]. ≥ Cδ2 3 3 3 3K

The diagram below shows the chain of ideas developed in this section:

Lemma 7.7 Lemma 7.5 Theorem 7.8 Matrices with Matrices which Subgaussian Subgaussian and satisfy Matrices which Matrices Isotropic Rows Concentration satisfy RIP Inequality

In all of the previous Theorems above, we can choose the more convenient value of ε. In particular, if we set ε = 2 exp( m/2C2) in Theorem 7.8, with probability 1 2 exp( m/2C2) all s-sparse vectors will − − − be recovered via `1-minimization using an m N subgaussian random matrix for which ×

m 2C1s ln(eN/s) ≥ So, at least theoretically, instead of using matrices built with high precision to recover sparse infor- mation, some coins with a certain subgaussian distribution can be drawn to be used as the entries of the matrix. This will recover sparse vectors with high probability. Albeit this approach with subgaussian variables seems very promising and many papers written by mathematicians fully establish this as the status quo in sparse recovery, in practical terms the implementation is more subtle. This is currently a very active research area, see for example [Haboba et al. ’12], where the relative strengths and weaknesses of hardware implementation of Gaussian and Bernoulli circuits is discussed.

7.3 Universal Recovery

Usually we need to find the right basis to represent signals in a sparse way. But not all sparsity phenomena occur with respect to the canonical basis. In many situations we need some other (orthonormal) basis such as discrete Fourier, wavelets, shearlets, curvelets, etc to achieve a compact representation of the signals. In mathematical terms, this means that the phenomenon we are studying is written as a vector z = Ux with a orthogonal N N matrix and a s-sparse vector x CN . So the act of taking measurements is represented by × ∈

y = Az = AUx. In order to recover z, we need to recover first x and then make z = Ux. In practical terms, this means that the compressive sensing problem has a measurement matrix of the form A˜ = AU. The matrix A, in this context, is a random m N matrix while the one changing basis is a fixed and deterministic matrix × U RN×N . Then we need to adapt the results concerning small RIP constant to this more realistic situation.∈ This is easily made in the corollary below.

Corollary 7.9. Let U RN×N be a fixed orthogonal matrix. Suppose that an m N random matrix A is drawn according to a∈ probabilistic distribution for which the concentration inequality×

  2 2 2 2 N P Ax 2 x 2 t x 2 2 exp( Kt m) x R || || − || || ≥ || || ≤ − ∀ ∈ holds for all t (0, 1) and x RN . Suppose also that for a given δ, ε (0, 1) we have ∈ ∈ ∈ 7.4. THE CURIOUS CASE OF GAUSSIAN MATRICES 135

m Cδ−2[s(9 + 2 ln(N/s)) + 2 ln(2ε−1)], ≥ for C = 2/(3K). Then, with probability 1 ε, the restricted isometry constant δs of UA satisfies δs < δ. − Proof. As U is an orthogonal matrix, the concentration inequality holds with A replaced by UA. Let x RN and x˜ = Ux. We have ∈  2 2 2  2 2 2 2 P AUx 2 x 2 t x 2 = P Ax˜ 2 x˜ 2 t x˜ 2 2 exp( Kt m). || || − || || ≥ || || || || − || || ≥ || || ≤ −

One can see that the matrix U in this theorem is arbitrary. So we say that sparse recovery with sub- gaussian matrices is a universal phenomenon regarding the orthogonal basis where signals are represented in a sparse way. The implication is that since the measurements are taken in the form y = AUx, one does not need to know the matrix U in advance to perform the sensing (enconding) stage. The knowledge of U is necessary only when applying the recovery techniques at the decoding stage. The theorem proved in this section only states that for a fixed matrix U, a random choice of A will work with high probability. Universality does not mean that one single matrix measurement matrix A can recover all vectors represented in a sparse way on any basis since all vectors are 1-sparse in some basis. Of course that for every measurement scheme, we can construct a matrix of basis transform U such that the recover will fail. Universality is a very interesting feature which occurs not only here but permeates all of the probability theory. The excellent post [Tao’s Blog - 09/14/2010] gives a glimpse of this search for universality in a very didactic way.

7.4 The Curious Case of Gaussian Matrices

“Our lives are defined by opportunities, even the ones we miss." Eric Roth in The Curious Case of Benjamin Button by F. Scott Fitzgerald

From a mathematical point of view, the original approach of Candes and Tao in [Candès & Tao I ’06] was very different from the one shown, relying on the condition number estimate for Gaussian matrices. Since it uses tools as the Gaussian concentration of measure and Slepian-Gordon lemma, Theorem 6.37 and Theorem 6.30 respectively, it has the disadvantage of working only for this particular set of matrices. On the other hand, it has the advantage of providing better estimates due to a harder analysis. It is of major importance to find better constants and thereby find the suitable balance between the number of measurements and dimension of the signal. The knowledge of these scales is necessary to apply the techniques of compressive sensing. To prove the theorem which asserts that Gaussian matrices have small RIP constant with high prob- ability we need some tools from the area of non-asymptotic theory of random matrices, especially the estimates for the extremal singular values of Gaussian random matrices. For a great overview of this area, see [Rudelson & Vershynin ’10] and [Rudelson ’14]. In order to prove these estimates, we need the following two lemmas.

n×n Lemma 7.10. For all matrices A and B in R , their smallest and largest singular values σmin and σmax satisfy

σmax(A) σmax(B) A B 2→2 A B F and σmin(A) σmin(B) A B 2→2 A B F . | − | ≤ || − || ≤ || − || | − | ≤ || − || ≤ || − || Proof. For the largest singular value, we just need to identify it with the operator norm:

σmax(A) σmin(B) = A 2→2 B 2→2 A B 2→2. | − | || || − || || ≤ || − || For the smallest singular vales, we have: 136 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

 σmin(A) = inf Ax 2 inf Bx 2 + (A B)x 2 ||x||2=1 || || ≤ ||x||2=1 || || || − ||  inf Bx 2 + (A B) 2→2 = σmin(B) + A B 2→2. ≤ ||x||2=1 || || || − || || − ||

Therefore, we have σmin(A) σmax(B) A B 2→2 and the lemma follows from the symmetry between A and B. The second inequality− is just≤ the || domination− || of the operator norm by the Frobenius norm. Lemma 7.11. For integers n s 1, ≥ ≥ Γ((n + 1)/2) Γ((s + 1)/2) √2 √2 √n √s. Γ(n/2) − Γ(s/2) ≥ − Proof. It suffices to show that, for any n 1, ≥ Γ((n + 1)/2) αn+1 αn √n + 1 √n where αn = √2 . − ≥ − Γ(n/2)

If follows from Γ((n + 2)/2) = (n/2)Γ(n/2) that αn+1αn = n. Then, multiplying the inequality we want to prove by αn and rearranging, we see that we need to prove that

2 α + (√m + 1 √m)αm m 0. m − − ≤ 2 So we need to prove that αm does not exceed the positive root of the polynomial z +(√m + 1 √m)z m, i.e. that − − q (√m + 1 √m) + (√m + 1 √m)2 + 4m αm βm = − − − . ≤ 2 First, we will bound αm from above and then bound βm from below.. For the first part, we will use the Gauss hypergeometric theorem (see Theorem 2.2.2 of [Andrews, Askey & Roy ’99]), from which we obtain that if Re(c a b) > 0, then − − Γ(c)Γ(c a b) − − = 2F1(a, b; c; 1), Γ(c a)Γ(c b) − − where 2F1(a, b; c; z) denotes the Gauss hypergeometric function

∞ n X (a)n(b)n z 2F1(a, b; c; z) = . (c) n! n=0 n and (x)n is the Pochhammer symbol defined by (x)0 = 1 and (x)n = x(x + 1) ... (x + n 1) for n 1. Hence, choosing a = 1/2, b = 1/2 and c = (m 1)/2, we derive − ≥ − − − Γ((m + 1)/2)2 Γ((m + 1)/2)Γ((m 1)/2) α2 = 2 = (m 1) − m Γ(m/2)2 − Γ(m/2)2 ∞ 2 X ( 1/2)(1/2) ... ( 1/2 + n 1) 1 = (m 1) − − − − ((m 1)/2)((m + 1)/2) ... ((m + 2n 3)/2) n! n=0 − − ∞ n−2 2 1 1 1 X 2 (1/2) ... (n 3/2) = m + + − . − 2 8 m + 1 n!(m + 1)(m + 3) ... (m + 2n 3) n=3 − Defining γm as

  ∞ n−2 2 2 1 1 1 X 2 (1/2) ... (n 3/2) γm := (m + 1) α m + = − , m − 2 − 8 m + 1 n!(m + 3) ... (m + 2n 3) n=3 − 7.4. THE CURIOUS CASE OF GAUSSIAN MATRICES 137

we see that γm decreases with m. Thus, there is an integer m0 (that will be chosen later) such that

2 1 1 1 γm0 α m + + m m0. (7.4) m ≤ − 2 8 m + 1 m + 1 ≥

Now, let us bound βm from below. Let

(√m + 1 √m) + √4m 3√m √m 1 βm − − = − − = δm. ≥ 2 2 p p Expanding m(m + 1) and using that δ2 = (10m + 1 6 m(m + 1))/4, we have m −

r ∞  n! p 1 X (1/2)( 1/2) ... (1/2 n + 1) 1 m(m + 1) = (m + 1) 1 = (m + 1) 1 + − − − m + 1 n! m + 1 − n=1

∞ n−1 1 1 1 1 X (1/2) ... (1/2 n + 1) 1  = m + − − . 2 8 m + 1 2 n! m + 1 − − n=3 Therefore

∞  n−1! 10m + 1 3p 10m + 1 3 1 1 1 1 X (1/2) ... (1/2 n + 1) 1 δ2 = m(m + 1) = m+ − − . m 4 2 4 2 2 8 m + 1 2 n! m + 1 − − − − n=3 Thus

1 3 1 β2 δ2 m + . (7.5) m ≥ m ≥ − 2 16 m + 1 Subtracting (7.4) from

2 2 1 1 γm0 β α m m0. m − m ≥ 16 m + 1 − m + 1 ≥ 2 2 To finish, we choose m0 to be the smallest integer such that γm0 1/16, that is, m0 = 3, so that βm αm for all m 3. The numerical cases m = 1 and m = 2 are straightforward≤ to verify. ≥ ≥

We are now able to prove the non-asymptotic estimates of the singular values of a Gaussian matrix. This result is a consequence of estimates due to Gordon and Slepian1 and was obtained by [Ledoux ’01].

m×s Theorem 7.12. Let A R be a Gaussian matrix with m > s and let σmin and σmax be the smallest ∈ √1 and largest singular values of the renormalized matrix m A. Then, for t > 0,

p −mt2/2 p −mt2/2 P(σmax 1 + s/m + t) e and P(σmin 1 s/m t) e . ≥ ≤ ≤ − − ≤ Proof. Let us start by using Lemma 7.10 and by noticing that the extremal singular values are 1-Lipschitz functions with respect to the Frobenius norm. So, by the concentration of measure inequality, Theo- rem 6.37, we have the following relations between σmax and its expected value,

 −r2/2 P σmax E[σmax] + r e . (7.6) ≥ ≤ It remains to estimate the expected value above. For this, we use Slepian-Gordon’s Lemma, Theorem 6.30. As

1 Asymptotic estimates are know since the eighties, see [Geman ’80] and [Yin, Bai & Krishnaiah ’88] for σmax and [Silverstein ’85] for σmin. Also, see [Marcenko & Pastur ’67]. 138 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

σmax = sup sup Ax, y , x∈Ss−1 y∈Sm−1h i we define the following two special Gaussian processes

Xx,y = Ax, y and Yx,y = g, x + h, y , h i h i h i where g Rs and h Rm are two independent standard Gaussian vectors. Now we will compare these ∈ ∈ s−1 m−1 two processes. Taking x, x˜ S and y, y˜ S and using that Aij are independent and of variance 1, we obtain ∈ ∈

m s 2 m s m s 2 X X X X 2 X X 2 2 2 2 E Xx,y Xx,˜ y˜ = E Aij(xjyi x˜jy˜i) = (xjyi x˜jy˜i) = (xj yi 2xjx˜jyiy˜i +x ˜j y˜i ) − − − − i=1 j=1 i=1 j=1 i=1 j=1

= x 2 y 2 + x˜ 2 y˜ 2 2 x, x˜ y, y˜ = 2 2 x, x˜ y, y˜ . || ||2|| ||2 || ||2|| ||2 − h ih i − h ih i Using independence and isotropy of the standard multivariate Gaussian, we have

2 2 2 2 2 2 E Yx,y Yx,˜ y˜ = E g, x x˜ + h, y y˜ = E g, x x˜ + E h, y y˜ = x x˜ 2 + y y˜ 2 − |h − i h − i| |h − i| |h − i| || − || || − || = x 2 + x˜ 2 2 x, x˜ + y 2 + y˜ 2 2 y, y˜ = 4 2 x, x˜ 2 y, y˜ . || ||2 || ||2 − h i || ||2 || ||2 − h i − h i − h i Therefore we conclude that

2 2 E Yx,y Yx,˜ y˜ E Xx,y Xx,˜ y˜ = 2(1 x, x˜ y, y˜ + x, x˜ y, y˜ ) = 2(1 x, x˜ )(1 y, y˜ ) 0. − − − − h i − h i h ih i − h i − h i ≥ In this last inequality we used Cauchy-Schwarz inequality and the fact that the vectors belong to the unit sphere (so equality holds in and only if y, y˜ = 1 or x, x˜ = 1). Thus, we showed that h i h i 2 2 E Xx,y Xx,˜ y˜ E Yx,y Yx,˜ y˜ . − ≤ − Then, Slepian-Gordon’s Lemma implies that

Eσmax = E sup sup Xx,y E sup sup Yx,y = E sup g, x + E sup h, y x∈Ss−1 y∈Sm−1 ≤ x∈Ss−1 y∈Sm−1 x∈Ss−1h i y∈Sm−1h i q q 2 2 = E g 2 + E h 2 E g 2 + E h 2 √s + √m, || || || || ≤ || || || || ≤ where we used Jensen’s inequality. Using this in (7.6) we obtain that

 −r2/2 P σmax(A) √s + √m + r e . ≥ ≤ √1 √A To finish, we just need to rescale A by m . This gives us a estimate for the largest singular value of m . The estimate for σmin(A) = infx∈Ss−1 Ax 2 is more intricate. We need to measure the Gaussian width of Ss−1 and then use Gordon’s escape|| through|| the mesh, Theorem 6.32. Recall that for a given standard Gaussian vector g Rs we have ∈ s−1 h i `(S ) = E sup g, x = E g 2. x∈Ss−1h i || || If the entries of g are independent, then g 2 has χ2 distribution with n degrees of freedom. Then || ||2

 n 1/2 Z ∞ (n+1)/2 Z ∞ X 2 1 1/2 (n/2)−1 −u/2 2 (n/2)−1/2 −t E g 2 = E gi = u u e du = t e dt || || 2n/2Γ(n/2) 2n/2Γ(n/2) i=1 0 0 7.4. THE CURIOUS CASE OF GAUSSIAN MATRICES 139

Γ((n + 1)/2) = √2 . Γ(n/2) s Let us recall that for g R we introduced the notation E g 2 = Es. Then, with the aid of Lemma 7.11, we obtain ∈ || ||

s−1 Γ((m + 1)/2) Γ((s + 1)/2) Em `(S ) = Em Es = √2 √2 √m √s. − − Γ(m/2) − Γ(s/2) ≥ − To conclude, using Theorem 6.32 we obtain

   s−1  −t2/2 P σmin(A) √m √s t P inf Ax 2 Em `(S ) t e . ≤ − − ≤ x∈Ss−1 || || ≤ − − ≤

√1 Finally, one just needs to rescale A by m to finish the proof.

We can state the main Theorem of this section.

Theorem 7.13. Let A Rm×N be a Gaussian matrix with m < N. Let η, ε (0, 1), and assume that ∈ ∈ m 2η−2s ln(eN/s) + ln(2ε−1). ≥ Then we have

"      2 # 1 1 1 2 P δs A 2 1 + p η + 1 + p η 1 ε. √m ≤ 2 ln(eN/s) 2 ln(eN/s) ≥ −

Proof. The proof is very similar to the one for Theorem 7.8. Let S [N] such that #S = s. The 1 ⊂∗ submatrix AS is an m s Gaussian matrix and the eigenvalues of m ASAS Id are inside the interval 2 2 × −1 [σ 1, σ 1] where σmin and σmax are the extremal singular values of √ AS. From Theorem 7.12, min − max − m we obtain

 n o 1 ∗ p 2 p 2 P ASAS Id max 1 + s/m + η 1, 1 ( s/m + η) m − 2→2 ≤ − −

  1 ∗ p p 2 2 = P ASAS Id 2( s/m + η) + ( s/m + η) 1 2 exp( mη /2). m − 2→2 ≤ ≥ − − ∗ Again, in view of the definition δs = supS⊂[N], #S=s ASAS Id 2→2, we take the union bound over all N || − || s subsets of [N] with cardinality s:      s p p 2 N −mη2/2 eN −mη2/2 P δ2 > 2( s/m + η) + ( s/m + η) 2 e 2 e ε, ≤ s ≤ s ≤ by the hypothesis on the number of rows and the Stirling’s formula. This same hypothesis also tells us p η that s/m and so ≤ √2 ln(eN/s) "      2 # 1 1 1 2 P δs A > 2 1 + p η + 1 + p η √m 2 ln(eN/s) 2 ln(eN/s)   p p 2 P δ2 > 2( s/m + η) + ( s/m + η) < ε. ≤ This concludes the proof. 140 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

The reader should ask why we expend so much effort in the Gaussian case, proving even technical lemma such as Lemma 7.11. The reason is related to the constant C in m Cs ln(N/s) and here a difference between Theorem 7.2 and Theorem 7.13 appears. In the Gaussian≥ case, the number of measurements are much less than in the general subgaussian case and in this case we have the best known constant so far. The reason behind it is related to Theorem 6.32. The reader can consult [Vershynin ’15] for more details on this comparison. Also, the techniques developed for the Gaussian case are so powerful that we can not only prove that they satisfy the RIP with small constant but we also can directly establish that they satisfy NSP, rather than relying on RIP and use it as a sufficient condition to NSP. The proof is even more laborious and appeared in the literature for the first time in the book [Rauhut & Foucart ’13].

Theorem 7.14. (Theorem 9.29 of [Rauhut & Foucart ’13]: Let A Rm×N be a random drawn of a Gaussian matrix. Assume that ∈

s !2 m2 ln(ε−1) 2s ln(eN/s) 1 + ρ−1 + D(s/N) + , m + 1 ≥ s ln(eN/s) where D is a function that satisfies D(α) 0.92 for all α (0, 1] and limα→0 D(α) = 0. The, with probability at least 1 ε the matrix A satisfies≤ the stable null∈ space property of order s with constant ρ. − Proof. See Theorem 9.29 and Corollary 9.34 from [Rauhut & Foucart ’13].

7.5 Johnson-Lindenstrauss Embeddings and the RIP

N Assume we have a cloud of data in a high dimensional space, i.e., M points x1, . . . , xM R . If N > M, these points lie in a subspace of dimension M. Then we can consider{ the projection} ∈ of these points into this subspace without distorting their mutual distance, that is, xi, xj we have a projection ∀ f satisfying f(xi) f(xj) 2 = xi xj 2. This is the same as saying that we have a natural isometry (only for the|| data points)− into|| this|| subspace.− || Usually it is computationally expensive to process data when it lies in a high dimensional space. Moreover the existing algorithms scale very poorly when increasing the dimension of the data. So this projection must be done to avoid the curse of dimensionality, extremely important in these “Big Data" days. At the same time, we need to project them while preserving some kind of geometric structure. Now, consider the situation where we allow a bit of distortion when projecting this cloud of data, that is, instead of looking for isometric projections, we look for ε-isometries

2 2 2 (1 ε) xi xj f(xi) f(xj) (1 + ε) xi xj . (7.7) − || − ||2 ≤ || − ||2 ≤ || − ||2 When we have measurements with errors, it is reasonable to allow this kind to projection. The question is then the following: provided we have distortion, can we do beter than N = M?, that is, can we project on a lower dimensional space? [Johnson & Lindenstrauss ’84] proved that this is possible.

N Theorem 7.15. (Lemma 1 of [Johnson & Lindenstrauss ’84]): Let x1, . . . , xM R be an arbitrary set of points and ε > 0. If ∈

m > Kε−2 ln(M), then there exists a matrix A Rm×N such that ∈ 2 2 2 (1 ε) xi xj A(xi xj) (1 + ε) xi xj , − || − ||2 ≤ || − ||2 ≤ || − ||2 for all i, j [M]. The constant K > 0 is universal. ∈

Proof. If we look at the set of mutual distances E = xj xi : 1 j i M with cardinality #E M(M 1/2), we just need to show the existence{ of a− matrix A such≤ that≤ ≤ } ≤ − 7.5. JOHNSON-LINDENSTRAUSS EMBEDDINGS AND THE RIP 141

(1 ε) x 2 Ax 2 (1 + ε) x 2, x E. (7.8) − || ||2 ≤ || ||2 ≤ || ||2 ∀ ∈ 1 Take A˜ = √ A Rm×N , where A is a random drawn of a subgaussian matrix. Then Lemma 7.5 implies m ∈ the existence of a constant C such, for any fixed x E, we have ∈   ˜ 2 2 2 2 P Ax 2 x 2 ε x 2 2 exp( Cmε ). || || − || || ≥ || || ≤ − Now, taking the union bound in such a way that Equation (7.8) holds for all x E ∈ ! !

[ ˜ 2 2 2 X ˜ 2 2 2 M(M 1) 2 P AX 2 x 2 ε x 2 P AX 2 x 2 ε x 2 2 − exp( Cmε ). || || − || || ≥ || || ≤ || || − || || ≥ || || ≤ 2 − x∈E x∈E And so   2 2 2 2 −Cmε2 P (1 ε) x 2 Ax 2 (1 + ε) x 2 x E 1 M e − || || ≤ || || ≤ || || ∀ ∈ ≥ −

2 As we want M 2e−Cmε smaller than a certain η, we just need to take m C−1ε−2 ln(M 2/η) and then the inequality above holds with probability 1 η. We conclude the existence≥ of the Johnson-Lindenstrauss map when η < 1. Considering the limit η − 1, this gives the claim with K = 2C−1. → This deep theorem tells us that if we relax the condition on the isometry and ask for a quasi-isometry, we can project our cloud of data in a much lower dimensional space, whose dimension is logarithmic in the number of vectors and can be controlled by ε, our error relative to geometric structure preservation. And it is remarkable because the target dimension is independent of the ambient dimension N. Moreover, the proof presented here is that subgaussian matrices work as this kind of projection. The Johnson-Lindenstrauss Lemma has a lot of applications and is widely used nowadays to transform a high-dimensional problem into a low-dimensional one in such a way that the optimal solution to the problem in low dimension can be lifted to a nearly optimal solution to the high dimensional one. Besides its connection with compressive sensing, we could cite aplications in combinatorial optimization, data streams, fast low-rank approximations, nearest neighbor search, etc. For variations on the theme and applications, see [Matousek ’08] and [Vempala ’04]. It would be interesting, for computational purposes, to find other ensemble of matrices which can be used as projections, other than the subgaussians ones. The next Theorem, proved in [Krahmer & Ward ’12] shows that given a matrix A satisfying the RIP, a randomization of the column signs of A provides a Johnson-Lindenstrauss embedding and this will allow fast computational embeddings.

Theorem 7.16. ([Krahmer & Ward ’12]): Let E RN be a finite point set such that #E = M. For m×N ⊂ ε, η (0, 1) , let A R with RIC satisfying δ2s ε/4 for some s 16 ln(4M/η) and let  = ∈ ∈ ≤ ≥ (1, . . . , N ) be a Rademacher sequence. Also define D as the diagonal matrix with  on the diagonal. Then with probability at least 1 η − 2 2 2 (1 ε) x ADx (1 + ε) x , x E. (7.9) − || ||2 ≤ || ||2 ≤ || ||2 ∀ ∈ Proof. We may assume, without loss of generality, that for all x E, we have x 2 = 1. For a fixed x E, we will partition [N] into blocks of size s as a nonincreasing∈ rearrangement|| of|| x. More precisely, ∈ let us take S1 [N] as the index set of the s largest absolute entries of x. Now, take S2 [N] S1 as the ⊂ ⊂ \ index set of the s largest absolute entries of x in [N] S1 and so on. With this construction, we can write \

2 D E D E 2 X X 2 X ADx = AD xSj = ADxSj + 2 ADxS1 , ADxS + ADxSj , ADxSl . 2 2 2 1 j≥1 j≥1 j,l≥2 j6=l 142 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

By hypothesis, A satisfies RIP with δs δ2s ε/4 and since DxSj = xSj 2 we can estimate the first term by ≤ ≤ || || || ||

2 X 2 X 2 2 (1 ε/4) x = (1 ε/4) xSj ADxSj (1 + ε/4) x . − || ||2 − || ||2 ≤ 2 ≤ || ||2 j≥1 j≥1

To estimate the second term, we use the fact that Dx = Dxε. Considering the random variable

D E D E D ∗ E D E X X = ADxS1 , ADx = ADxS , ADx  = DxS A AS1 DxS S1 ,  = v,  = ivi, S1 1 S1 1 xS1 1 S1 S1 i6=S1 where v RS1 is given by ∈ ∗ v = Dx A AS Dx S . S1 xS1 1 S1 1 Now, note that v and  are independent because one of them just depends on S and the other just S1 1 on S1. Our goal is to estimate this second term by the Hoeffding’s inequality for Rademacher variables, Equation (6.8), that says

 M  X 2 P ivi v 2u 2 exp( u /2). ≥ || || ≤ − i=1 Therefore we need to estimate the 2-norm of the vector v, so

X D ∗ E v 2 = sup v, z = sup zS ,Dx A AS Dx S || || h i i Si Si 1 S1 1 ||z||2≤1 ||z||2≤1 i≥2

X D ∗ E X ∗ = sup zS ,Dx A AS D xS sup zS 2 Dx A AS D 2→2 xS 2 i Si Si 1 S1 1 ≤ || i || || Si Si 1 S1 || || 1 || ||z||2≤1 i≥2 ||z||2≤1 i≥2

X ∗ sup zS 2 Dx 2→2 A AS 2→2 D 2→2 xS 2 ≤ || i || || Si || || Si 1 || || S1 || || 1 || ||z||2≤1 i≥2

X ∗ sup zS 2 xS ∞ A AS 2→2 xS 2. ≤ || i || || i || || Si 1 || || 1 || ||z||2≤1 i≥2 Here we have used that D = x and that D =  = 1. Moreover, from the xSi 2→2 Si ∞ S1 2→2 S1 ∞ || || || || || || || || −1/2 way we construct S1,S2,... , etc, it follows from Lemma 5.18 that xSi ∞ s xSi−1 2. Besides that, and from Proposition we have, for ||, that|| ≤∗ || || . It then xS1 2 x 2 = 1 5.5 i 2 ASi AS1 2→2 δ2s follows|| that|| ≤ || || ≥ || || ≤

δ2s X δ2s X 1 2 2 δ2s v 2 sup zS 2 xS 2 sup zS + xS , (7.10) || || ≤ √s || i || || i−1 || ≤ √s 2 || i ||2 || i−1 ||2 ≤ √s ||z||2≤1 i≥2 ||z||2≤1 i≥2

P 2 2 where we have used the geometric-arithmetic mean inequality and that xS x = 1. Now, j≥2 || j ||2 ≤ || ||2 Hoeffding’s inequality, Equation (6.8), conditionally on S1 yields, for t > 0  M    X 2 P X v 2u = P ivi v 2u 2 exp( u /2), | | ≥ || || ≥ || || ≤ − i=1 which, using Equation (7.10), δ2s ε/4 and v 2u = t, is the same as ≤ || ||   2 2 2 2 2 P X t exp( t /2 v 2) exp( t s/2δ2s) exp( 8st /ε). (7.11) | | ≥ ≤ − || || ≤ − ≤ − To finish the proof, we need to estimate the third term. In order to do it, let us consider the random variable Y define by 7.5. JOHNSON-LINDENSTRAUSS EMBEDDINGS AND THE RIP 143

N X D E X ∗ Y = ADxSj , ADxSl = jlBj,l =  B, j,l≥2 j,l=s+1 j6=l where B RN×N is a symmetric matrix with zero diagonal given by ∈ ( ∗ xjaj alxl if j, l S1 and j, l belong to different blocks Sk Bj,l = ∈ 0 otherwise We will bound the tail of this Rademacher chaos using Theorem 6.25. Therefore, we need now to estimate the Frobenius and the spectral norms of B. As it is symmetric, its spectral norm can be estimated by

X D ∗ E B 2→2 = sup Bz, z = sup zS ,Dx A AS Dx zS || || h i j Sj Sj l S1 l ||z||2≤1 ||z||2≤1 j,l≥2 j6=l

X ∗ sup zS 2 zS 2 xS ∞ xS ∞ A AS 2→2 ≤ || j || || l || || j || || l || || Sj l || ||z||2≤1 j,l≥2 j6=l

X −1/2 −1/2 δ2s sup zS 2 zS 2s xS 2s xS 2 ≤ || j || || l || || j−1 || || l−1 || ||z||2≤1 j,l≥2 j6=l

δ2s X 2 2 2 2 δ2s sup xS + zS xS + zS . ≤ 4s || j−1 ||2 || j ||2 || l−1 ||2 || l ||2 ≤ s ||z||2≤1 j,l≥2 j6=l The Frobenius norm obeys the same bound,

X X X ∗ 2 X X 2 ∗ 2 B F = (xja alxl) = x DxS A ai || || j i k Sk 2 j,l≥2 i∈Sj l∈Sk j,l≥2 i∈Sj j6=l j6=l

2 X X 2 2 ∗ 2 2 X 2 −1 2 δ2s x xS A ai δ xS s xS , ≤ i || k ||∞|| Sk ||2 ≤ 2s || j ||2 || k ||2 ≤ s j,l≥2 i∈Sj j,l≥2 j6=l j6=l where we have used the fact that ∗ a 2 ∗ . So, by Inequality (6.14), the tail ASk i 2 = ASk 2→2 δs+1 δ2s of the third term can be estimate,|| for any||t > 0||, by || ≤ ≤

    3t2 t  P Y t 2 exp min 2 , | | ≥ ≤ − 128 B 32 B 2→2 || ||F || ||   3st2 st    3t2 t  2 exp min , 2 exp min 2 , . (7.12) ≤ − 128δ2s 32δ2s ≤ − 8ε 8ε Choosing t = ε/6 in (7.11) and t = ε/2 in (7.12) and taking into account the bounds for the three terms simultaneously, we conclude that for a fixed x E, with probability at least ∈ 1 2 exp( s/8) 2 exp( s min 3/32, 1/16 ) 1 exp( s/16), − − − − { } ≥ − − we have

2 2 2 (1 ε) x ADx (1 + ε) x . − || ||2 ≤ || ||2 ≤ || ||2 Taking the union bound over all x E and using the hypothesis s 16 ln(2M/η), we conclude that this holds for all x E with probability∈ at least ≥ ∈ 144 CHAPTER 7. MATRICES WHICH SATISFY THE RIP

1 4M exp( s/16) 1 η. − − ≥ − This concludes the proof. We note that this theorem allows us to conclude the existence of Johnson-Lindenstrauss embeddings for other types of matrices. In particular, matrices that satisfy RIP provide such embeddingsm as is the case of random partial Fourier matrices. It is an open problem to show directly that such type os matrices are JL embeddings because all the proofs involve, in some way, subgaussian matrices. Without the randomization of columns, this Theorem is false. To see this, just take the points of E belonging to the kernel of A then the lower bound cannot hold. Thus the randomization of columns signs ensures that the probability of the intersection of E and the kernel of AD not being empty is very small. Additionally, the question about sharpness of the results emerges, that is, if we could improve the fact that m = O(ε−2 ln(M)) or, in other words, if there exist some set E such that for any such map f, as defined in (7.7), we must have m = Ω(ε−2 ln(M)). In their original paper, [Johnson & Lindenstrauss ’84] proved the first lower bound of m = Ω(ln N) when ε is smaller than some constant. This was improved by [Alon ’03], who showed that any JL map must embed into dimension m = Ω(min N, ε−2 ln N/ ln(1/ε) ), where the first term in the min is achieved by the identity map. This was later{ improved to m =} Ω(ε−2 min ln N, (ln N/ ln(1/ε))2 ). Finally, [Larsen & Nelson ’14] closed the gap by proving the following result. { }

Theorem 7.17. There exists C > 0 such that for any N > 1 and 0 < ε < 1/2, there is a subset E RN ⊂ with #E = N + N 3 such that any embedding linear f : RN Rm satisfying → (1 ε) x 2 f(x) 2 (1 + ε) x 2, x E, − || ||2 ≤ || ||2 ≤ || ||2 ∀ ∈ must have m C min N, ε−2 ln(N) . In other words, the JL lemma is optimal in the case where the projection f is≥ linear. { } Proof. Theorem 3 from [Larsen & Nelson ’14]. It is very curious that all methods for dimensionality reduction through JL lemma are via linear maps. Moreover, the sharpness result is also for linear maps. Thus, circumventing the lower bound would require a fundamentally new approach with nonlinear projections. To conclude this chapter, we address the subject of the time needed to generate suitable JL projec- tions. This is known as the search for Fast Johnson-Lindenstrauss Transform(FJLT), a term coined in [Ailon & Chazelle ’06]. They define a probability distribution over a product of matrices through the following product

AF JLT = PHN D N where (remembering that we are dealing with vectors E = x1, . . . , xM R , that is, we have M vectors in a space whose dimension is N) { } ∈

D is a N N matrix where each Dii is drawn independently from 1, 1 with probability 1/2, • × {− } as in Theorem 7.16. Note that D is an isometry for the 2-norm.

HN is the N N Hadamard matrix (we are assuming that N is a power of 2) with entries given by • ×hi−1,j−1i (Hd)ij = ( 1) /√N where for i, j each in 0, ..., d 1 we write them in binary and treat − { − } them as log2 d-dimensional vectors. Note that HN is also an isometry for the 2-norm. P is a m t matrix whose elements are independently distributed as follows. With probability 1 η • × − set Pij as 0, and otherwise (with the remaining probability η) draw Pij from a normal distribution of expectation 0 and variance η−1. The sparsity constant q is given as

( ) ε−2 log M  q = min Θ , 1 . t 7.5. JOHNSON-LINDENSTRAUSS EMBEDDINGS AND THE RIP 145

We need to choose t in order to complete the construction. It will be t = ε−2 ln(1/η) ln(N/η), with η as defined above being the probability of failure in the JL lemma. The running time to apply A to some vector x RN is ∈

D takes O(N) time. •   1 HN/2 HN/2 Due to the recurrence relation H1 = (1) and HN = √ , it takes O(N ln N). • 2 HN/2 HN/2 − Applying P takes O(mt) time. •

 −4 2  So the total time to apply AF JLT is O N ln N + ε ln (1/η) ln(N/η) . To complete the argument, an analysis must be done showing that this is an JL isometry. This is given by the following lemmas:

N Lemma 7.18. For any x R with x 2 = 1, and for any 0 < η < 1/2, ∈ || || " r # ln(N/η) P HN Dx > < η. ∞ N

Proof. Lemma 1 from [Ailon & Chazelle ’09]. p Lemma 7.19. If x 2 = 1 has x ∞ ln(N/η)/N, then || || || || ≤   P 1 ε P x 2 1 + ε > η. − ≤ || || ≤ Proof. Lemma 7 from [Nelson ’??]2. This approach is known as FJLT via Fast Hadamard. For more details and a complete analysis, see [Ailon & Chazelle ’06]. Other methods to obtain similar results include the use of FFT and also Toeplitz and circulant matrices. Some of them do not need to assume that matrices have dimension N = 2k for some k, see [Vybiral ’11], [Hinrichs & Vybiral ’11], [Ailon & Liberty ’11], [Dasgupta, Kumar & Sarlós ’10], and references therein. Besides, Jelani Nelson points out the following open question, the most important in the computational tractability of using JL as a tool for dimension reduction:

Open Problem: Obtain a probability distribution for the JL lemma with embedding time O(N ln m) . (or even O(N ln N)) and the dimension of the target space being m = O(ε−2 ln(N))

In this chapter we proved that for matrices for which m Cδs ln(N/s), RIP holds. It turns out that subgaussian matrices provide a good ensemble for the algorithms≥ of sparse recovery to work. We also saw that, roughly speaking, this scale is optimal in order to do dimension reduction via Johnson-Lindenstrauss lemma. Now we would like to know whether this scale is optimal and if there are some special situations where it could be improved. For example, can we design, for some specific sparse signals, smart measurement schemes such that there will be improvement in the art of measurement? In the next Chapter we provide an answer through ideas based on the work of Kolmogorov and Gelfand in approximation theory.

2It is not indicated in the website of Jelani Nelson, http://people.seas.harvard.edu/~minilek/, when these notes were written, so we have the interrogation on the reference. 146 CHAPTER 7. MATRICES WHICH SATISFY THE RIP Chapter 8

Optimality in the Number of Measurements

É estranho que tu, sendo homem do mar, me digas isso, que ja não há ilhas desconhecidas. Homem da terra sou eu, e não ignoro que todas as ilhas, mesmo as conhecidas, são desconhecidas enquanto não embarcamos nelas.1 José Saramago in O Conto da Ilha Desconhecida

8.1 Introduction

Along this dissertation we developed the theory of Compressive Sensing and showed that it is possible to sense signals using much fewer measurements than stated by the usual sampling paradigm. In the introduction of [Donoho ’06], he asks “why go to so much effort to acquire all the data when most of what we get will be thrown away? Can’t we just directly measure the part that won’t end up being thrown away?” We saw that this new kind of sensing can be done in a nonadaptive way, that is, we do not require the knowledge of the signal in advance. We just need to know that data is sparse or compressible when expressed in some basis. Several techniques were required in order to show when and how this is possible. We passed through Mathematical Optimization, Nonasymptotic Probability and sophisticated ideas from Linear Algebra and Harmonic Analysis. This shows us that often circumventing dominant ideas or breaking paradigms can be a difficult task. After many steps, we arrived at a point where it was possible to show how to design (random) sensing matrices and perform optimal sampling with them. This was a breakthrough introduced by Donoho, Tao, Candès and Romberg. They discovered that m s ln(N/s) is a rule of thumb regarding the number of measurements. In other words, the information acquired≈ is linear in the information content (sparsity S) and logarithmic in the ambient dimension. In Section 1.4 of his seminal work2, Donoho says that the estimation of error measurement of com- pressible objects “concerns the geometry of high-dimensional convex and nonconvex balls...to develop this geometric viewpoint further, we consider two notions of n-width” and argues that there is an equivalence between optimal recovery of nonadaptive information and objects in approximation theory. The purpose of this chapter is to explore this equivalence. Despite the fact that this type of observation was well known in the Approximation Theory community (see [DeVore ’06] and [Micchelli & Rivlin ’76]), it was a new information for people working on Signal Processing. Another fundamental paper of Com- pressive Sensing concerning this error reconstruction questions is [Cohen, Dahmen & DeVore ’09]. In this final chapter, we will define the two notions of n-width that Donoho was referring to. In order to do this, we first need to discuss some ideas of Approximation Theory. After, we prove the Theorem

1A free translation of Saramago’s quote is: It is strange that you, being a man from the sea, tell me this, that there are no unknown islands. I am a man from the land, and I am not unaware that all the islands, even the known ones, are unknown until we land on them. 2According to Google Scholar, it has more than 16500 citations.

147 148 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS of Kashin-Garnaev-Gluskin in Geometry of Banach Spaces and explain why this deep theorem is funda- mental for Compressive Sensing. We derive, as its consequence, the optimal number of measurements for sparse recovery, regardless of the reconstruction scheme. The first part of this chapter was highly inspired by [Gamkrelidze ’90] and [Pinkus ’85]. The technical results where extracted from [Rauhut & Foucart ’13] and original papers. Also, the historical part was taken from many papers written by Vladimir Tikhomirov3.

8.2 n-Widths in Approximation Theory

In any computational study where the available memory is finite, it is necessary to represent functions using a finite number of parameters, for example, coefficients of some truncated series expansion. More- over, these coefficients, real numbers, must be transformed into a finite alphabet and many times this alphabet will be represented by binary digits. This is the problem of quantization. With the bits obtained we could use coding techniques to achieve a final compression. Many ideas for this kind of representation (or approximation) came from the Soviet school [Steffens ’06]. In particular, the abstract notions of metric entropy and widths are linked to the name of Kolmogorov, who introduced them, and Tikhomirov, one of his students. The latter developed many techniques related to widths and solved many problems left open by Kolmogorov and collaborators. In this con- text, the names of Mityagin, Vitushkin, Kashin, Gluskin, Maiorov, Solomyak should also be mentioned [Pietsch ’07]. One of the first problems studied in approximation theory is the following: given a point and a subset in a metric space, find the point (or points) in the subset which best approximates the given point. As well as in the quantization problem, in any approximation problem it is necessary to characterize what best approximation (or quantization) means. This will measure the accuracy with which an element from the given set can be recovered. For example, given x X, where X is a Banach space, and given Xn, an ∈ n-dimensional subspace of X, we wish to find the best approximation of x in Xn. Also we would like to estimate the value of the error, i.e., the measure of the distance between x and its best approximant. If we denote the distance by

d(x, Xn) = inf x y X : y Xn , {|| − || ∈ } ∗ ∗ we ask if there exists a y Xn for which d(x, Xn) = x y and also ask for its characterization and ∈ || − || uniqueness, when possible. Note that as Xn is a finite dimensional subspace of X, a best approximation always exist. Now, instead of a point, we could ask the same for a given subset K X, that is, how well one can ⊂ approximate K by Xn, an n-dimensional subspace of X. Sometimes this is called the deviation of A from Xn. We also allow the possibility of varying the n-dimensional subspaces Xn. This was first done by [Kolmogorov ’36], despite Uryson in 1922 and Aleksandrov in 1933 having created similar definitions for the study of geometric problems. Geometry was also the motivation behind the theory of widths developed in the Soviet Union, see Chapter 8 of [Charpentier, Lesne & Nikolski ’07].

Definition 8.1. The Kolmogorov m-width of a subset K of a normed space X is defined as   dm(K,X) = inf sup inf x y ,Xm subspace of X with dim(Xm) m Xm x∈K y∈Xm || − || ≤ As Tikhomirov points out in [Gamkrelidze ’90], “undoubtedly, of most importance in approximation theory is the Kolmogorov width which is the most closely connected with the basic direction of classical approximation theory”. He also wrote, “the determination of the entropy4 and widths of function classes can have several goals, and these were actually all noted by Kolmogorov himself in 1956. Firstly, it can lead to invariants

3The interested reader can consult the page of Tikhomirov at Math-Net: http://www.mathnet.ru/eng/person8555. 4This is a reference to entropy numbers, which are related to the concept of covering numbers defined in Section 6.6. 8.2. N-WIDTHS IN APPROXIMATION THEORY 149 enabling us to distinguish function sets of different massiveness. The meaning of the most fundamental concept of “number of variables” often manifests itself in this way. Secondly, computations of widths and entropy make it possible to find new untraditional methods of approximation and to argue in favour of their expediency. Thirdly, the exact solution of the corresponding extremal problems makes it possible to enrich means of analysis by starting from the conviction that what is good does not merely turn out to be good for a single question, but usually turns out to be useful for other questions. Fourthly, and finally, this can be of interest for computational mathematics by giving guidelines for the creation of the most expedient algorithms for solving practical problems.” [Tikhomirov ’83]. Tikhomirov points out that in a discussion with Gel’fand, the former expressed the idea that there must be a dual to the Kolmogorov width. So this was introduced in 1965 by Tikhomirov through the following definition (it was introduced independently by S. Smolyak and I. Sharygin in the same year).

Definition 8.2. The Gel’fand m-widht of a subset K of a normed space X is defined as   m d (K,X) = inf sup x ,Xm subspace of X with codim(Xm) m . X m x∈K∩Xm || || ≤

A subspace Xm of X is of codimension at most m in and only if there exists linear functionals λ1, . . . , λm : X R in the dual space X∗ such that →

Xm = x X : λi(x) = 0 for all i [m] = ker A, { ∈ ∈ } m where A : X R , x [λ1(x), . . . , λm(x)]. Using A, we have the following alternative definition → 7→   m m d (K,X) = inf sup x ,A : X R linear . Xm x∈K∩ker A || || → This means that the width measures the extent to which one can determine x X, given the value of m functionals applied on x. ∈ Remark 38. In Compressive Sensing, the Gel’fand widths will be more important than the Kolmogorov ones, as we will see through this chapter.

The following Theorem establishes the duality of these widths, as suggested by Gel’fand.

Theorem 8.3. For 1 p, q , let p∗, q∗ be such that 1/p∗ + 1/p∗ = 1 and 1/q + 1/q∗ = 1. Then ≤ ≤ ∞ N N m N N dm(Bp , `q ) = d (Bq∗ , `p∗ ) In order to prove this result, we need the definition of a dual norm and a Lemma.

Definition 8.4. Let . be a norm on a Banach space. Its dual norm is defined by x ∗ = sup x, y . || || || || ||y||≤1 |h i| Lemma 8.5. Let Y be a finite-dimensional subspace of a normed space X. Given x X Y and y∗ Y , the following statements are equivalent: ∈ \ ∈

1. y∗ is a best approximation to x from Y , that is, y∗ x y x , y Y . || − || ≤ || − || ∀ ∈ 2. x y∗ = λ(x) for some linear functional λ vanishing on Y and satisfying λ 1. || − || || || ≤ Proof. Assume that 1) holds. Let us define the linear functional λ˜ on the space Y span(x) by ⊕ ∗ λ˜(y + tx) = t x y , y Y and t R, || − || ∀ ∈ ∈ When setting t = 0, we see that λ˜ vanishes on Y . Furthermore, for y Y and t = 0, we have ∈ 6 λ˜(y + tx) = t x y∗ t x ( y/t) = y + tx , | | | | || − || ≤ | | || − − || || || 150 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS since y∗ is the best approximation to x. Dividing both sides of this inequality (which is obviously valid for t = 0) by y + tx , we have λ˜ 1. Using the Hahn-Banach Extension Theorem, the functional || || || || ≤ λ : X R we are looking for is just the extension to X provided by this theorem. → Now assume that 2) holds. First observe that λ(y) = 0, y Y and so ∀ ∈ x y∗ = λ(x) = λ(x y) λ x y x y , y Y. || − || − ≤ || || || − || ≤ || − || ∀ ∈

Now we are able to prove the duality relation between Gel’fand and Kolmogorov widths.

N N N Proof. (Theorem 8.3) Recall that `q denotes R with the . q norm. Given a subspace Xm of `q with N || || N dim(Xm) m and a vector x Bp , Lemma 8.5 shows that there exists a functional λ : `q R with ≤ ∈ → λ q∗ 1 and λ(Xm) = 0 such that || || ≤

∗ inf x z q = x z q = λ(x) λ q∗ x q x q = sup u, x = sup u, x = sup u, x , z∈Xm || − || || − || ≤ || || || || ≤ || || ||u|| ∗ =1h i N h i N ⊥ h i q u∈Bq∗ u∈Bq∗ ∩Xm where in the last equality we used the fact that the functional satisfies λ(Xm) = 0 and so u, z = 0 ⊥ h i z Xm implies u X . For z Xm, we have ∀ ∈ ∈ m ∈

sup u, x = sup u, x z sup u q∗ x z q = x z q. N h i N ⊥ h − i ≤ N ⊥ || || || − || || − || u∈Bq∗ u∈Bq∗ ∩Xm u∈Bq∗ ∩Xm Therefore we deduce the equality

inf x z q = sup u, x . z∈Xm || − || N ⊥ h i u∈Bq∗ ∩Xm It follows that

sup inf x z q = sup sup u, x = sup sup u, x = sup u p∗ x∈BN z∈Xm || − || x∈BN N ⊥ h i N ⊥ x∈BN h i N ⊥ || || p p u∈Bq∗ ∩Xm u∈Bq∗ ∩Xm p u∈Bq∗ ∩Xm

⊥ Now, note that there is a correspondence between the subspaces Xm and the subspaces Lm with codim(Lm) m. Taking the infimum over all Xm with dim(Xm) m yields ≤ ≤ N N m N N dm(Bp , `q ) = d (Bq∗ , `p∗ ).

It is difficult to prove general results about widths. There is essentially only one general result available. It was proved in [Tikhomirov ’66]. All other results are proved for very particular cases.

Theorem 8.6. (Tikhomirov ’66): For every sequence αi i∈N satisfying α1 > > αn and αn 0, there exists a Banach space X and a compact set K X{ such} that ··· → ⊂

dn(K,X) = αn n = 0, 1,... The same holds for the Kolmogorov width dm(K,X).

It is interesting to note that the reference which transformed the theory of widths, a theory previously studied only in Soviet Union, into a well known field in western mathematics was the book [Lorentz ’66]. The reason, as described in [Pietsch ’07], is that “G.G. Lorentz (1910-2006) emigrated from Leningrad to the USA. After, Lorentz wrote the book which opened up the concepts of widths and entropy to the “Free World” and has become a standard reference.” 8.3. COMPRESSIVE WIDTHS 151

8.3 Compressive Widths

The power of the theory of widths enters compressive sensing through the following definition, in which a way of “measuring” the worst-case reconstruction errors is given.

Definition 8.7. The (non-adaptive) compressive m-width of a subset K of a normed space X is defined as   m m m E (K,X) = inf sup x ∆(Ax) ,A : X R linear and ∆ : R X x∈K || − || → →

m In this definition, no assumptions are made on the reconstruction map ∆ : R X: it could be `0- minimization, a greedy algorithm or even a thresholding algorithm. It also can vary with→ the measurement matrix A or be fixed. Moreover, the measurement map A is nonadaptive. It is given by m fixed linear functionals λ1, . . . , λm, that is, the action of A is Ax = [λ1(x), . . . , λm(x)] . We could also consider an adaptive map where a choice of a measurement depends on the previous one through a specific rule. This adaptive map F : X Rm, is represented by →   F (x) = λ1(x), λ2;λ1(x), . . . , λm;λ1(x),...,λm−1(x)(x) , (8.1) that is, the linear functional λj;λ1(x),...,λj−1(x)(x) is allowed to depend on previous λj−1;λ1(x),...,λj−2(x)(x). With this definition in mind, we introduce a notion of width for this kind of measurements.

Definition 8.8. The adaptive compressive m-widht of a subset K of a normed space X is defined as   m m m Eada(K,X) = inf sup x ∆(Ax) ,A : X R adaptive and ∆ : R X . x∈K || − || → → Remark 39. As [Rauhut & Foucart ’13] points out, intuitively, we expect that the notion of adaptivity should improve the measurement/reconstruction scheme. However this is false and is a particular case of the general philosophy from information-based complexity that “adaptivity does not help”. For more information, see [Novak & Wozniakowski ’08].

In their seminal paper [Cohen, Dahmen & DeVore ’09], they introduced the notion of null space prop- erty described in Chapter 4 and proved a result relating adaptive and non-adaptive reconstruction (see also [Donoho ’06], where many similar results are proved). When considering the worst case recon- struction over K, this theorem shows that adaptive and non-adaptive compressive sensing widths are equivalent. Even more, under reasonable conditions on K, they are equivalent to the Gel’fand widths.

Theorem 8.9. Let X be a normed space. If K X, then ⊂ Em (K,X) Em(K,X). ada ≤ If K = K, then − m m d (K,X) E (K,X). ≤ ada Furthermore, if K + K aK for some positive constant a, then ⊂ Em(K,X) a dm(K,X). ≤ Proof. The first inequality is obvious, as any linear map A : X Rm can be considered adaptive. For → m the second inequality, assume that we have an adaptive map Aada : X R as described by (8.1) and → a reconstruction map ∆ : Rm X. We will consider subspaces of X generated by the kernel of a map m → A : X R defined by A(x) = [λ1(x), λ2;0(x), . . . , λm;0,...,0(x)], so we set Xm = ker A. As →

dim(Im(A)) + dim(ker(A)) = dim X and dim(ker(A)) + codim(ker(A)) = dim X, 152 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS then codim(ker(A)) = dim(Im(A)) m. So, by the definition of the Gel’fand widths, we have ≤

  m d (K,X) = inf sup x ,Xm subspace of X with codim(Xm) m sup x . (8.2) X m x∈K∩Xm || || ≤ ≤ x∈K∩ker A || ||

For the image of v ker A, the image of the adaptative map Aada will be zero, because ∈

λ1(v) = 0 = λ (v) = λ2;0(v) = 0 = = λ (v) = λm;0,...,0(v) = 0 ⇒ 2;λ1(v) ⇒ · · · ⇒ m;λ1(v),...,λm−1(v)

= Aada(v) = 0. ⇒ Thus, taking v K ker A, we have ∈ ∩

v ∆(0) = v ∆(Aada(v)) sup x ∆(Aada(x)) . || − || || − || ≤ x∈K || − || By the hypothesis, K = K, which implies v K ker A and then − − ∈ ∩

v ∆(0) = v ∆(Aada(v)) sup x ∆(Aada(x)) . || − − || || − − || ≤ x∈K || − || Then, for any x K ker A, ∈ ∩

1 1 1 1 x = (x ∆(0)) ( x ∆(0)) x ∆(0) + x ∆(0) sup x ∆(Aada(x)) . || || 2 − − 2 − − ≤ 2|| − || 2|| − − || ≤ x∈K || − ||

Incorporating this in (8.2), we obtain

m d (K,X) sup x ∆(Aada(x)) . ≤ x∈K || − || m m Taking the infimum over all possible Aada and ∆, we conclude that d (K,X) E (K,X). ≤ ada Now let us use the same construction as before, namely, consider a map A : X Rm and a subspace → m of X given by Xm = ker A. We need an appropiate choice of a reconstruction map. Define ∆ : R X such that →

∆(y) K A−1(y) y A(K). ∈ ∩ ∀ ∈ This choice implies that   Em(K,X) sup x ∆(A(x)) sup sup x z ≤ x∈K || − || ≤ x∈K z∈K∩A−1(Ax) || − || Analyzing x z, for x K and z K A−1(Ax), we see that it belongs to K + ( K) and also to − ∈ ∈ ∩ − ker A = Xm. Our hypothesis says that K + K aK for some positive constant a, hence we have ⊂ Em(K,X) sup u sup v . ≤ u∈aK∩Xm || || ≤ v∈K∩Xm || || m m Taking the infimum, this time over Xm, we obtain E (K,X) ad (K,X). ≤ Now that we know that the comparison between widths makes sense, the question of which are the “good” sets K to consider in the context of Compressive Sensing arises. We saw in Proposition 1.5 that N N the unit balls Bp in `p with small p are good models for compressible vectors. Then maybe we should N N consider these sets and try to estimate their widths. If it is possible to estimate dm(Bp , `p ) for small p m N N then, by the Theorem 8.9, we will be able to estimate E (Bp , `p ). This will provide constrains to the 8.4. OPTIMAL NUMBER OF MEASUREMENTS 153 quantities involved in the coding/decoding process. Therefore the sharp relation between the number of measurements and the ambient space of the signals will be established. N N In the quest to estimate the Gel’fand widths dm(Bp , `p ) for small p, some difficulties arise when p < 1, since in this case we do not have a norm, but an `p-quasinorm. Here we will only treat the case p = 1 and refer to the corresponding literature for the other cases. Nevertheless, the case p = 1 allows us to establish a sharp relation between the number of measurements and the dimension of the signal. N N The estimates of dm(Bp , `1 ) are the content of an important theorem, which will be proved in Section 8.5. It was proved by [Gluskin ’84] and [Garnaev & Gluskin ’84] following work by [Kashin ’77]. In the latter, the author explored the duality relation and developed some results about Gel’fand widths in the course of determining the Kolmogorov widths of some Sobolev spaces. Before proving this very important result, let us analyze what happens if we try to use basis pursuit as a method of sparse recovery.

8.4 Optimal Number of Measurements

In this chapter we are interested in comparing m and N, this is, the relation between the amount of measured information and the dimension of the signals under the process of measurement and recover. We start this section by showing a necessary condition of combinatorial flavor for Basis Pursuit.

m×N N Theorem 8.10. Given a matrix A R , if every 2s-sparse vector x R is a minimizer of z 1 subject to Az = Ax, then ∈ ∈ || ||

 N  m C1s ln , ≥ C2s where C1 = 1/ ln 9 and C2 = 4. In order to prove this result, we need to estimate the amount of sets with a given intersection between them. This was done in the context of Compressive Sensing by [Foucart, Pajor, Rauhut & Ullrich ’10] and their proof follows [Mendelson, Pajor & Rudelson ’05]. However, it seems that this combinatorial lemma was already known by the community of researchers in combinatorics and complexity theory. See for example [Graham & Sloane ’80] and [Noam & Avi ’94]. Lemma 8.11. Given integers s < N, there exist

 N s/2 n ≥ 4s subsets S1,...,Sn of [N] such that each Sj has cardinality s and s #(Si Sj) < for i = j. ∩ 2 6 Proof. Let us assume that s N/4, because otherwise we have 1 > N and it suffices to take n = 1 ≤ 4s subset of [N]. Denote by s the family of subsets of [N] having cardinality s. Let us draw a fixed A s element S1 s and cluster all the sets S s such that #(S S1) 2 into a family 1s. The cardinality of∈ this A family is estimated by ∈ A ∩ ≥ A

s s s X sN s N s X s N s X s #( 1s) = − max − max − A k s k ≤ ds/2e≤k≤s s k k ≤ ds/2e≤k≤s s k k k=ds/2e − − k=ds/2e − k=0 N s N s 2s max − = 2s − , ≤ ds/2e≤k≤s s k s/2 − b c where the last equality holds because it is known that a general binomial number attains its maximum when k = N/2 so, in our case, where s N/4 the maximum of s k is s/2, which is still smaller than ≤ − 154 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS

(N s)/2. Looking to the complement of 1s, it is clear, by definition, that any S s 1s satisfies − A ∈ A \A #(S S1) < s/2, provided the latter is nonempty. Then we take a new element S2 s 1s and make ∩ s ∈ A \A a new cluster by putting together all the sets S s 1s such that #(S S2) . This new family will ∈ A \A ∩ ≥ 2 be named 2s. By the same argument, we have A   s N s #( 2s) 2 − . A ≤ s/2 b c Next, observe that any set S s ( 1s 2s) satisfies #(S S1) < s/2 and #(S S2) < s/2 simultane- ∈ A \ A ∪ A ∩ ∩ ously. Inductively we repeat this construction and selection of sets S1,S2,...,Sn until s ( 1s ns) A \ A ∪· · ·∪A is empty. Hence all the constructed sets must satisfy #(Si Sj) < s/2 for i = j. Moreover, this con- P∩n 6 struction was done in such a way that i is = s and so # is # ( is) = #(As). With this ∪ A A i=1 A ≥ ∪ A observation in mind, we can estimate the cardinality of s through A n X   #( s) #( is) n max #( is) . A ≤ A ≤ 1≤i≤n A i=1 Then

N #( s) s n A N−s  ≥ max1≤i≤n #( is) ≥ 2s A bs/2c 1 N(N 1) ... (N s + 1) 1 = − − 2s (N s)(N s 1) ... (N s s/2 + 1) s(s 1) ... ( s/2 + 1) − − − − − b c − b c 1 N(N 1) ... (N s/2 + 1) 1 N bs/2c  N s/2 − − b c . ≥ 2s (s)(s 1) ... (s s/2 + 1) ≥ 2s s ≥ 4s − − b c

Now we are able the prove Theorem 8.10, which says that: “if Basis Pursuit works, then the minimal  number of measurements is given by m C1s ln N/(C2s) for all s-sparse vectors” or, in other words, what is the maximum discrepancy between≥ the number of rows and the number of columns when we have a unique sparse solution of a linear system Ax = b. Proof. (Theorem 8.10) Consider the quotient space

N N X = `1 / ker A = [x] = x + ker A, x R , { ∈ } N endowed with the norm [x] = infv∈ker A x v 1. If we have a 2s-sparse vector x R , then for || || || − || ∈ v ker A and z = x v, we have Az = A(x v) = Ax and thus [x] = x 1. Let S1,...Sn be sets ∈ − − || || || || as the ones introduced in the Lemma 8.11 and let us define s-sparse vectors x1, . . . , xn RN with unit ∈ `1-norms by ( i 1/s if k Si, xk = ∈ 0 if k / Si. ∈ i j i j i j i j As the vector x x is 2s-sparse, we have, for 1 i = j n, [x ] [x ] = [x x ] = x x 1 i −j ≤ 6 ≤ || − || || − || || − || and since xk xk = 1/s if k Si∆Sj = (Si Sj) (Si Sj), vanishes otherwise and #(Si∆Sj) > s, we i | j− | ∈ ∪ \ ∩ have x x 1 > 1. This implies || − || i j [x ] [x ] > 1 1 i = j n, − ∀ ≤ 6 ≤ and shows that [x1],..., [xn] is a 1-separated subset of the unit sphere of X, which has dimension r = rank(A) m{, since it is isomorphic} to the image of A. Theorem 6.51 implies that n 3r 3m. By Lemma 8.11,≤ the number of sets in this construction is at least ≤ ≤ 8.5. THE THEOREM OF KASHIN, GARNAEV AND GLUSKIN 155

 N s/2 n 4s ≤ and this implies

 N s/2 3m. 4s ≤ The conclusion follows by taking the logarithm on both sides.

8.5 The Theorem of Kashin, Garnaev and Gluskin

An important problem posed by Kolomogorov in 1950s was to compute widths in Sobolev spaces. In the sixties, Tikhomirov and Babenko solved some of these problems using ideas of duality and Gel’fand widths. Kashin made important progress when he reduced the width calculation for function spaces to N N the finite-dimensional case of Bp inside `q [Tikhomirov ’03]. It is important to note that [Kashin ’77] was the one of the first papers in which random matrices are explicitly used to study the geometry of Banach spaces. See [Davidson & Szarek ’01]. This motivated many soviet mathematicians like Garnaev, Gluskin, Temlyakov to work on this type of estimates. For the width of the `1-ball, Kashin provided only the upper bound in Theorem 8.12 and did not achieve the best exponent in the logarithm, whereas Garnaev and Gluskin proved the lower bounds and provided the correct exponent. Kashin’s construction of subspaces Xm involves m N Bernoulli random matrices with independent entries. On the other hand, the construction of Garnaev× and Gluskin for the lower bound used Gaussian matrices. More recently, [Milman & Pajor ’03] realized that Bernoulli matrices could be used to yield the upper bound and they also generalized the result to an arbitrary compact and convex body K RN . ⊂ Theorem 8.12. For 1 < p 2 and m < N, there exists constants c1, c2 > 0 depending only on p such that ≤

 1−1/p  1−1/p ln(eN/m) m N N ln(eN/m) c1 min 1, d (B , ` ) c2 min 1, . m ≤ 1 p ≤ m We split the proof of this result into two parts: the upper and the lower bound, and present a modern version based on Compressive Sensing techniques. In many results in sparse recovery, we find estimates of the form m cs ln(eN/m). We first need a lemma about changing the m inside the logarithm by s. ≥ Lemma 8.13. Let N m s be positive integers, and c, d two positive real numbers. If m cs ln(dN/m) then ≥ ≥ ≥

1. m c0s ln(dN/s) with c0 = ec/(e + c), ≥ 2. m cs˜ ln(dN/s) with c˜ = c/(1 + ln(c)) provided c, d e. ≥ ≥ Proof. Rewriting the hypothesis m cs ln(dN/m) we obtain: ≥ dN   s  dN  s  s  m cs ln + cs ln = cs ln + cm ln . ≥ s m m m m s Taking x = m , consider f(x) = x ln(x), which is decreasing on (0, 1/e) and increasing on (1/e, + ) and has a minimum value of 1/e. We conclude that ∞ − dN  s  s  dN  cm m cs ln + cm ln cs ln , ≥ m m m ≥ m − e 156 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS and this implies the first inequality

 c −1 dN  m 1 + cs ln . ≥ e s

From m cs ln(dN/s), we have ≥ s 1 1 . m ≤ c ln(dN/m) ≤ c As f(x) = x ln x is a decreasing function on (0, 1/e), this leads to f s  f 1  = ln c and then m ≥ c − c dN  s  s  dN   s  dN  ln c m cs ln + cm ln = cs ln + cmf cs ln cm . ≥ m m m m m ≥ m − c The result of the second part follows.

N Proof. (Upper Bound of Theorem 8.12): For x R , when using the inequality x p x 1 in the definition of Gel’fand width, it yields ∈ || || ≤ || ||

  m N N N 0 N d (B1 , `p ) = inf sup x p,Xm subspace of `p with codim(Xm) m d (B1, `1 ) = 1. Xm N || || ≤ ≤ x∈B1 ∩Xm

So if m c ln(eN/m) then ≤  c ln(eN/m)1−1/p dm(BN , `N ) min 1, . (8.3) 1 p ≤ m On the other hand, if m > c ln(eN/m), let us define s 1 to be the largest integer smaller than m/(c ln(eN/m)), so that ≥ 1 m m s < . 2 c ln(eN/m) ≤ c ln(eN/m) Recall Lemma 8.13 we have that for c > e, m > cs ln(eN/m) implies m > cs˜ ln(eN/s), with c˜ = c/(1 + ln(c)). Choosing c = 1160,

1160 m > s ln(eN/s) > 144s ln(eN/s). 1 + ln(1160)

Using Theorem 7.13, which says that if m 2η−2s ln(eN/s) + ln(2ε−1), then we have ≥ "      2 # 1 1 1 2 P δs A 2 1 + p η + 1 + p η 1 ε, √m ≤ 2 ln(eN/s) 2 ln(eN/s) ≥ − for Gaussian random matrices. So we choose η = 1/6 and ε = 2 exp( m/144) and our hypothesis (“If m cs ln(dN/m) then”) turns into − ≥  m  m 2 36s ln(eN/s) + ln(exp( m/144)−1) = 72 s ln(eN/s) + . ≥ · − 144 p This implies m 144s ln(eN/s), by Lemma 8.13. Using the estimate 1 + 1/ 2 ln(eN/s) 2 in Theorem ≥ ≤ 7.13, we can guarantee the existence of a measurement matrix5 B Rm×N with restricted isometry constant given by ∈

5 √ In fact, what we guarantee√ is the existence of a `2-normalized matrix with δs(A/ m) ≤ δ. As only the existence is important, we just take B = A/ m. 8.5. THE THEOREM OF KASHIN, GARNAEV AND GLUSKIN 157

2 4 4 7 δs(B) δ = 4η + 4η = + = . ≤ 6 36 9

Now, we split the index set [N] as a disjoint union S0 S1 S2 ... of index sets of size s in such a way ∪ ∪ ∪ that xi xj whenever i Sk−1, j Sk and k 1. By Lemma 5.18 we have xS 2 xS 1/√s | | ≥ | | ∈ ∈ ≥ || k || ≤ || k−1 || for all k 1. Hence, for x Xm = ker A, we have ≥ ∈

1/p−1/2 X X 1/p−1/2 X s x p xS p s xS 2 A(xS ) 2 || || ≤ || k || ≤ || k || ≤ √1 δ || k || k≥0 k≥0 k≥0 − 1/p−1/2 "   # 1/p−1/2 " # s X X s X = A xS + A(xS ) 2 2 A(xS ) 2 √1 δ − k || k || ≤ √1 δ || k || − k≥1 2 k≥1 − k≥1 r r 1 + δ 1/p−1/2 X 1 + δ 1/p−1/2 X 2 s xS 2 2 s xS 1/√s ≤ 1 δ || k || ≤ 1 δ || k−1 || − k≥1 − k≥1 r r !1−1/p 1 + δ 1 X 1 + δ 2c ln(eN/m) = 2 xS 1 2 x 1, 1 δ s1−1/p || k−1 || ≤ 1 δ m || || − k≥1 − where in the second inequality we used that 1 < p 2. Choosing δ = 7/9 and using 21−1/p 2, it follows that, for all x BN Xm, ≤ ≤ ∈ 1 ∩ !1−1/p c ln(eN/m) x p 8√2 . || || ≤ m

This shows that, if m > c ln(eN/m), then

( )1−1/p c ln(eN/m) dm(BN , `N ) 8√2 min 1, . (8.4) 1 p ≤ m From 8.3 and 8.4, we conclude that

( )1−1/p ln(eN/m) dm(BN , `N ) C min 1, , 1 p ≤ m with C = 1160 8√2 = 9280√2. · N We now establish the lower bound for the Gel’fand widths of `1-balls in ` for 1 < p . p ≤ ∞ Proof. (Lower Bound of Theorem 8.12): Let c0 = 2/(1 + 4 ln 9). First, we need to understand why it is sufficient to show that

( )1−1/p 1 c0 ln(eN/m) dm(BN , `N ) min 1, . 1 p ≥ 22−1/p m

2−1/p ln(eN/m) Letting K = 1/2 and ϕ := m , we need to see why

dm(BN , `N ) K min 1, c0ϕ 1−1/p, 1 p ≥ { } implies , that there is a positive x such that

dm(BN , `N ) c min 1, ϕ 1−1/p. 1 p ≥ { } 158 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS

This is just the following sequence of logical implications:

dm(BN , `N ) K min 1, c0ϕ 1−1/p 1 p ≥ { } implies dm(BN , `N ) K or dm(BN , `N ) K(c0ϕ)1−1/p 1 p ≥ 1 p ≥ that implies dm(BN , `N ) min K,K(c0)1−1/p or dm(BN , `N ) min K,K(c0)1−1/p ϕ1−1/p 1 p ≥ { } 1 p ≥ { } which, in turn, implies dm(BN , `N ) c min 1, ϕ 1−1/p with c = min K,K(c0)1−1/p . 1 p ≥ { } { } 0  c ln(eN/m) m N N 1−1/p 2−1/p Define µ = min 1, m . Now, assume by contradiction that d (B1 , `p ) µ /2 . Then N ≤ there exists a subspace Xm of R with codim(Xm) m such that, for all v Xm 0 , ≤ ∈ \{ } µ1−1/p v p < v 1. || || 22−1/p || || m×N Again, as in the proof of Theorem 8.9, let us consider a matrix A R such that ker A = Xm. Let s = 1/µ 1, so that 1/(2µ) < s 1/µ. So, for all v ker A 0 ,∈ b c ≥ ≤ ∈ \{ } 1  1 1−1/p v p < v 1. || || 2 2s || || 1−1/p 1−1/p By the Hölder inequality, we have v 1 N v p and then 1 < (N/2s) . Hence, we can deduce that 2s < N. Then, again by the Hölder|| || ≤ inequality,|| || for S [N] with #S 2s and for v ker A 0 , we have ⊂ ≤ ∈ \{ }

1−1/p 1−1/p 1 vS 1 (2s) vS p (2s) v p < v 1. || || ≤ || || ≤ || || 2|| || The conclusion is that A has the null space property of order 2s. Due to Theorem 3.3, every 2s-sparse N vector x R is uniquely recovered from y = Ax via `1-minimization. By Theorem 8.10, the following inequality∈ holds:

 N  1 m c1s ln , with c1 = and c2 = 4. ≥ c2s ln 9

Also we have that m 2(2s) = 4s = c2s as a consequence of Theorem 1.11. Thus ≥         N N eN c1 eN c1 m c1s ln c1s ln = c1s ln c1s > ln m. ≥ c2s ≥ m m − 2µ m − 4 Rearranging leads to

−1 2c1 ln(eN/m) 2c1 ln(eN/m) 2(ln 9) ln(eN/m) m > 0 0 = −1 0 = m. 4 + c1 min 1, c ln(eN/m)/m ≥ 4 + c1 c ln(eN/m)/m 4 + (ln 9) c ln(eN/m)/m { } This is the contradiction we are looking for. So we have the result about the lower estimate of Gel’fand widths.

This theorem was extended in [Donoho ’06] to Gel’fand widths of `p-balls with p < 1. Unfortunately, his proof of the lower bound contains a gap. In the same paper, he proved the upper bound with log(N) instead of log(N/m). Then, [Vybiral ’08] provided the correct upper bound when p 1. After this, [Foucart, Pajor, Rauhut & Ullrich ’10] used methods based on compressive sensing techniques≤ (RIP, for example), to establish the lower bound for the case p 1. This was shown in Theorem 8.12, where we exhibited their proof for the case p = 1. The final result≤ is the following. 8.6. CONNECTIONS BETWEEN WIDTHS 159

Theorem 8.14. ([Foucart, Pajor, Rauhut & Ullrich ’10]): For 0 < p 1 and p < q 2 and m < N, ≤ ≤ there exists constants cp,q,Cp,q > 0 depending only p and q such that

 1/p−1/q  1/p−1/q ln(eN/m) m N N ln(eN/m) cp,q min 1, d (B , ` ) Cp,q min 1, . m ≤ p q ≤ m Remark 40. For 1 < q < p , the study of widths is typically divided into four regions. We summarize here the results known for Gel’fand≤ ∞ widths. See Chapter 14 in [Lorentz, von Golitschek & Makovoz ’96] for the proof of these results and more results concerning widths in general. q

∞ III IV

I: 1 q p ≤ ≤ ≤ ∞ II: 1 p q 2 2 ≤ ≤ ≤ III: 2 p q ≤ ≤ ≤ ∞ IV: 1 p 2 q ≤ ≤ ≤ ≤ ∞ II I

1 2 p ∞

Region II: If 1 < p < q 2 ≤ ( ) N 1−1/p dm(BN , `N ) min 1, . p q  m1/2 Region III: If 2 p < q ≤ ≤ ∞ ( 1/p−1/q ) 1  m  1−2/q dm(BN , `N ) max , 1 . p q  N 1/p−1/q − N

Region IV: If 1 < p 2 < q ≤ ≤ ∞ ( ) 1  m 1/2  N 1−1/p  dm(BN , `N ) min , 1 min 1, . p q  N 1/p−1/q − N m1/2

8.6 Connections Between Widths

With these estimates in hands, we can relate Gel’fand widths with compressive widths. Corollary 8.15. For 1 < p 2 and m < N, the adaptive and nonadaptive compressive widths satisfy ≤ ( )1−1/p ln(eN/m) Em (BN , `N ) Em(BN , `N ) min 1, . ada 1 p  1 p  m

Proof. Since BN = BN and BN + BN 2BN , Theorem 8.9 implies − 1 1 1 1 ⊂ 1 dm(BN , `N ) Em (BN , `N ) Em(BN , `N ) 2dm(BN , `N ). 1 p ≤ ada 1 p ≤ 1 p ≤ 1 p By the theorem of Kashin, Gluskin and Garnaev, the results follows. 160 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS

Proposition 8.16. Let 1 < p 2. Suppose that there exists a matrix A Rm×N and a map ∆ : Rm ≤ ∈ → RN such that, x RN , ∀ ∈ C x ∆(Ax) p σs(x)1. (8.5) || − || ≤ s1−1/p

Then, for some constants c1, c2 > 0 depending only on C, we have

eN  m c1s ln , ≥ s

N m provided that s > c2. The same statement holds for an adaptive map F : R R in place of a linear map A. →

Proof. Again, it suffices to prove the proposition for an adaptive map F : RN Rm, since every linear map could be considered adaptive. Hypothesis (8.5) implies →

m N N C C Eada(B1 , `p ) 1−1/p sup σs(x)1 1−1/p . ≤ s N ≤ s x∈B1 By the Corollary 8.15, there exists a constant c > 0 such that

( )1−1/p ln(eN/m) C c min 1, Em (BN , `N ) m ≤ ada 1 p ≤ s1−1/p

Hence, there is some constant c˜ > 0 such that

( ) ln(eN/m) 1 c˜min 1, . m ≤ s

We conclude that either s 1/c˜ or m cs˜ ln(eN/m). Setting c2 = 1/c˜ < s, we must have the second ≤ ≥ case(we will deal with the case s < c2 in Section 8.7 because it is more delicate). To finish the proof, we need to prove that m cs˜ ln(eN/m) implies m c1s ln(eN/s) with c1 =ce/ ˜ (˜c + e). But this is the content of Lemma 8.13. ≥ ≥

We will remove the restrictions s > c2 and p > 1 in Section 8.7. Now, accepting that this theorem is true for all values of s and p, we can state the result on the minimal number of measurements for the best known values which ensures that RIP is satisfied.

m×N Corollary 8.17. If the 2s-th restricted isometry constant of A R satisfies δ2s < 4/√41 0.6246, then necessarily ∈ ≈

eN  m cs ln ≥ s for some constant c > 0 depending only on δ2s.

Proof. If δ2s < 0.6246 and if ∆ is basis pursuit, i.e., the `1-minimization reconstruction map, we proved in Theorem 5.19 that C x ∆(Ax) 2 σs(x)1 || − || ≤ s1/2 holds for some constant C depending only on δ2s. By Theorem 8.16, the conclusion follows. 8.7. INSTANCE OPTIMALITY 161

8.7 Instance Optimality

Suppose that we have a measurement-reconstruction scheme chosen for s-sparse recovery. We will compare the reconstruction error in `p with the best s-term approximation error in `p. In order to do this, we define the concept of `p-instance optimality for a pair of measurement matrix and reconstruction map.

Definition 8.18. Given p 1, a pair of a measurement matrix A Cm×N and a reconstruction map m N ≥ ∈ ∆ : C C is `p-instance optimal of order s with constant C > 0 if → N x ∆(Ax) p Cσs(x)p for all x C . || − || ≤ ∈

Some examples of `1-instance optimal pairs were shown in Chapter 6, i.e., a matrix A with restricted isometry constant δ2s, δ6s or δ26s less than a certain value and a reconstruction map ∆ corresponding to Basis Pursuit, IHT or OMP respectively. We can generalize this notion of instance optimality for two different norms, one for the reconstruction error and another one for the best s-term approximation. This has already appeared in the analysis of the algorithms in Chapter 6.

Definition 8.19. Given q p 1, a pair of a measurement matrix A Cm×N and a reconstruction m N ≥ ≥ ∈ map ∆ : C C is mixed (`q, `p)-instance optimal of order s with constant C > 0 if → C N x ∆(Ax) q σs(x)p for all x C . || − || ≤ s1/p−1/q ∈ Remark 41. Why does the term s1/p−1/q appears in the definition of the mixed instance optimality for two different norms? Clearly the error of reconstruction x ∆(Ax) q should be comparable to the error N || − || of best approximation σs(x)q for vectors x Br with r < 1. Regarding Proposition 1.5 as a “change of variables”, we have ∈

1 1 sup σs(x)q sup σs(x)p. N 1/r−1/q 1/p−1/q N x∈Br  s  s x∈Br 1/p−1/q This justifies the comparison between x ∆(Ax) q and σs(x)p/s . || − || The next theorem shows that vectors belonging to the kernel of a measurement matrix are controlled by the best approximation error if and only if the measurement matrix is part of a mixed instance optimal scheme. Theorem 8.20. ([Cohen, Dahmen & DeVore ’09]): Let q p 1 and a measurement matrix A m×N ≥ ≥ ∈ C . If there exists a reconstruction map ∆ making the pair (A, ∆) a mixed (`q, `p)-instance optimal of order s with constant C > 0, then C v q σ2s(v)p for all v ker A (8.6) || || ≤ s1/p−1/q ∈ Conversely, if Equation (8.6) holds, then there exists a reconstruction map ∆ which makes the pair (A, ∆) a mixed (`q, `p)-instance optimal of order s with constant 2C.

Proof. Let us start by assuming that (A, ∆) is a mixed (`q, `p)-instance optimal of order s with constant C. For v ker A, let S be an index set of s largest absolute entries of v. From the instance optimality, ∈ we have that vS ∆(AvS) p Cσs(vS)1 = 0 implies vS ∆(A(vS)) = 0, which is the same as || − || ≤ − vS = ∆(A( vS)). Besides, we have that Av = 0 leads to A(vS + v ) = 0, hence A( vS) = A(v ). We − − S − S then deduce that vS = ∆(A(v )). With this, Equation (8.6) follows from − S C C v q = v + vS q = v ∆(A(v )) q σs(v )p = σ2s(vS)p. || || || S || || S − S || ≤ s1/p−1/q S s1/p−1/q Now, suppose that (8.6) holds for some measurement matrix A. The reconstruction map we are looking for is given by 162 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS

∆(y) = argmin σs(z)p subject to Az = y . { } N For any x C , applying (8.6) to v = x ∆(Ax) ker A and using the triangular inequality σ2s(u+v)p ∈ − ∈ ≤ σs(u)p + σs(v)p, yields

C C  2C x ∆(Ax) q σ2s(x ∆(Ax))p σs(x)p + σ2s(∆(Ax))p σs(x)p. || − || ≤ s1/p−1/q − ≤ s1/p−1/q ≤ s1/p−1/q

This proves that (A, ∆) is a mixed (`q, `p)-instance optimal of order s with constant 2C.

The next theorem asserts that `2-instance optimality is not a good concept to work with. More specifically, if we expect this property to hold, then the number m of measurements is comparable to the dimension N of the signal. This holds even if we ask for instance optimality of order s = 1.

Theorem 8.21. ([Cohen, Dahmen & DeVore ’09]): If a pair of measurement matrix A Cm×N and m N ∈ reconstruction map ∆ : C C is `2-instance optimal of order s 1 with constant C, then → ≥ m cN, ≥ for some constante c depending only on C.

Proof. From Theorem 8.20 we have that the measurement matrix A in the instance optimal pair satisfies

v 2 Cσs(v)2 v ker A. || || ≤ ∀ ∈ Taking s = 1 and using the triangular inequality we obtain

2 2 2 2 v C ( v vj ), v ker A and j [N]. || ||2 ≤ || ||2 − | | ∀ ∈ ∈ N If e1, . . . , eN represents the canonical basis of C , the last inequality can be rewritten as { }

p 2 2 v, ej (C 1)/C v 2 v ker A and j [N]. |h i| ≤ − || || ∀ ∈ ∈ So, denoting the orthogonal projection onto ker A by P , we have

N N   X X p 2 2 p 2 2 N m dim(ker A) = tr(P ) = P ei, ei (C 1)/C P ei 2 (C 1)/C N. − ≤ h i ≤ − || || ≤ − i=1 i=1 p The conclusion follows by taking c = 1 (C2 1)/C2. − −

Theorem 8.21 shows that it is better to abandon ideas related to ordinary least squares and `2- minimization in some situations involving compression, and instead look for some convex relaxation like Basis Pursuit. The next theorem confirms this expectation when looking at `1-instance optimality. Moreover, we will remove the restrictions imposed in Proposition 8.16, i.e. q > 1 and s larger than some constant.

Theorem 8.22. ([Foucart, Pajor, Rauhut & Ullrich ’10]): If a pair of measurement matrix A Cm×N m N ∈ and reconstruction map ∆ : C C is `1-instance optilmal of order s 1 with constant C, then → ≥ m cs ln(eN/s), ≥ for some constant c depending only on C. 8.7. INSTANCE OPTIMALITY 163

s/2 Proof. We know from Theorem 8.11 that there exists n (N/4s) index sets S1,...,Sn os size s ≥ satisfying #(Si Sj) < s/2 for all 1 i = j n. Let us consider the same construction of vectors x1, . . . , xn from∩ Theorem 8.10, i.e. ≤ 6 ≤ ( i 1/s if k Si, xk = ∈ 0 if k / Si. ∈ i i j First, notice that x 1 = 1 and x x 1 > 1 for all 1 i = j n. Next, we will prove that if || || i N|| N− || ≤ 6 ≤ N ρ = 1/(2(C +1)), then A(x +ρB1 ) i=1 is a disjoint collection of subsets of A(C ) which has dimension d m. Suppose, by contradiction,{ that} we have two elements from this collection which are equal, that is,≤ there exist indices i = j and vectors z, z˜ ρBN such that A(xi + z) = A(xj +z ˜). Then 6 ∈ 1 i j i i  j j  x x 1 = x + z ∆(A(x + z)) x +z ˜ ∆(A(x +z ˜)) z +z ˜ || − || − − − − 1

i i j j x + z ∆(A(x + z)) + x +z ˜ ∆(A(x +z ˜)) + z 1 + z˜ 1 ≤ − 1 − 1 || || || || i j Cσs(x + z)1 + Cσs(x +z ˜)1 + z 1 + z˜ 1 C z 1 + C z˜ 1 + z 1 + z˜ 1 2(C + 1)ρ = 1. ≤ || || || || ≤ || || || || || || || || ≤ Now, using the fact that the collection A(xi + ρBN ) N is contained in (1 + ρ)A(BN ), we deduce { 1 }i=1 1 X   vol(A(xi + ρBN )) vol (1 + ρ)A(BN ) . 1 ≤ 1 i∈[N] We are working on the d-dimensional complex case, which is equivalent to working in the 2d-dimensional real case. Thus, using the homogeneity and the translation invariance of the volume, we have

nρ2d vol(A(BN )) (1 + ρ)2d vol(A(BN )). 1 ≤ 1 This leads to

 N s/2  12s n 1 + = (2C + 3)2d (2C + 3)2m. 4s ≤ ≤ ρ ≤ Taking logarithms, we obtain

m ln(N/4s) . s ≥ 2 ln(2C + 3)

Besides, we know by hypothesis that the pair (A, ∆) is `1-instance optimal of order s and so it ensures exact recovery of s-sparse vectors. Based on it, we have m 2s. Combining both inequalities leads to ≥  m 4 ln(2C + 3) + 2 ln(N/4s) + 4 = ln(N/4s) + ln(e4) = ln(e4N/4s) ln(eN/s). s ≥ ≥ Defining c = 1/(4(ln(2C + 3) + 2)), we have the result.

Now we can prove that mixed (`q, `p)-instance optimality is preserved when we decrease q. We will use this to prove that the same number of measurements is imposed when we have (`q, `1)-instance optimality for q > 1. Equivalently, no matter in which norm we bound the error of our reconstruction, we can expect the same number of measurements in all error norms.

0 Lemma 8.23. Given q q p 1, if a pair (A, ∆) is mixed (`q, `p)-instance optimal of order s with ≥ ≥ ≥ 0 0 constant C, then there is a reconstruction map ∆ making the pair (A, ∆ ) mixed (`q0 , `p)-instance optimal of order s with constant C0 depending only on C. 164 CHAPTER 8. OPTIMALITY IN THE NUMBER OF MEASUREMENTS

Proof. By hypothesis, the pair (A, ∆) is a mixed (`q, `p)-instance optimal of order s with constant C. Then, for a vector v ker A, Theorem 8.20 yields ∈ C v q σ2s(v)p. || || ≤ s1/p−1/q If S denotes an index set of the 3s largest entries of v in modulus, we have

1/q0−1/q 1/q0−1/q 1/q0−1/q C vs q0 (3s) vS q (3s) v q (3s) σ2s(v)p || || ≤ || || ≤ || || ≤ s1/p−1/q 0 31/q −1/qC 3C = σ2s(v)p σ2s(v)p. s1/p−1/q0 ≤ s1/p−1/q0 Also, from Proposition 1.5, we have 1 v q0 σ2s(v)p. || S|| ≤ s1/p−1/q0 Thus,

3C + 1 C˜ v q0 vS q0 + v q0 σ2s(v)p = σ2s(v)p. || || ≤ || || || S|| ≤ s1/p−1/q0 s1/p−1/q0 Therefore, by the converse part of Theorem 8.20, the result holds with C˜ = 2(3C + 1).

With the same techniques, we can prove an analogous result for p instead of q, that is, mixed (`q, `p)- instance optimality is also preserved when we decrease p instead of q. This is the content of the following Lemma. 0 Lemma 8.24. For q p p 1, if a pair (A, ∆) is mixed (`q, `p)-instance optimal of order s with ≥ ≥ ≥ 0 constant C, then it is also mixed (`q, `p0 )-instance optimal of order s/2 with constant C depending only on C. d e

Corollary 8.25. Given q > 1, if a pair of measurement matrix and reconstruction map is mixed (`q, `1)- instance optimal of order s with constant C, then

m cs ln(eN/s) ≥ for some constant c depending only on C. Proof. This is a simple consequence of Theorem 8.22 and Lemma 8.23.

Remark 42. What happens in the case of a mixed (`1, `p)-instance optimal of order s with constant C for p > 1? Without loss of generality, suppose that we are on the region II described at the end of Section 8.5. So we have

1/p−1/q  N 1−1/p  1/p−1/2 C min 1, dm(BN , `N ) Em (BN , `N ) . m1/2 ≤ p q ≤ ada p q ≤ s1/p−1/q Considering only reasonable values of parameters (i.e. excluding the case where the minimum is 1), we have

1/p−1/q C N 1−1/p  1/p−1/2 . s1/p−1/q ≥ m1/2 Now, raising both side to 2(1/p 1/2)/(1/p 1/q), we obtain − − −2(1/p−1/2) m C 1/p−1/q s2(1/p−1/2)N 2−2/p = Cs˜ 2(1/p−1/2)N 2−2/p ≥ So, we can conclude that, for p > 1, a measurement/decoder scheme which is (`q, `p)-instance optimal is not achievable in the regime m s ln(eN/m). The same analysis could be done to regions III and IV. Therefore, we need to look for p  1 in order to use reasonable techniques for compressive sensing. ≤ Bibliography

[] Books: [Andrews, Askey & Roy ’99] George Andrews, Richard Askey & Ranjay Roy, Special Functions, Ency- clopedia of Mathematics and Its Applications, Volume 71, Cambridge University Press, Cambridge, 1999. [Arora & Barak ’07] Sanjeev Arora & Boaz Barak, Computational Complexity: A Modern Approach, Cambridge University Press, Cambridge, 2007. [Artstein-Avidan, Giannopoulos & Milman ’15] Shiri Artstein-Avidan, Apostolos Giannopoulos & Vitali D. Milman, Asymptotic Geometric Analysis, Mathematical Surveys and Monographs, American Mathematical Society, 2015. [Bertsekas ’16] Dimitri Bertsekas, Nonlinear Programming, Athena Scientific; 3rd edition, 2016. [Björck ’96] Ake Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [Boche et at. ’15] Holger Boche (Ed), Robert Calderbank (Ed), Gitta Kutyniok (Ed), Jan Vybíral (Ed), Compressed Sensing and its Applications: MATHEON Workshop 2013, Birkhäuser, 2015. [Boucheron, Lugosi & Massart ’13] Stephane Boucheron, Gábor Lugosi & Pascal Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxfod University Press, Oxford, 2013. [Boyd & Vanderberghe ’04] Stephen Boyd & Lieven Vanderberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Buldygin & Kozachenko ’98] Valerii V. Buldygin & Yuri V. Kozachenko, Metric Characterization of Random Variables and Random Process, Translations of Mathematical Monographs, Volume 188, American Mathematical Society, Providence, Rhode Island, 1998. [Burrus, Gopinath & Guo ’97] C. Sidney Burrus, Ramesh A. Gopinath & Haitao Guo, Introduction to Wavelets and Wavelet Transforms: A Primer. Prentice-Hall, New Jersey, 1997. [Casazza & Kutyniok ’13] Peter G. Casazza(Ed) & Gitta Kutyniok(Ed) Finite Frames: Theory and Ap- plications, Birkhäuser; 2013. [Charpentier, Lesne & Nikolski ’07] Eric Charpentier (Ed), Annick Lesne (Ed), Nikolaï K. Nikolski(Ed), Kolmogorov’s Heritage in Mathematics, Springer, 2007. [Christensen ’08] Ole Christensen, Frames and Bases: An Introductory Course, Birkhäuser, 2008. [Cohen ’94] Leon Cohen, Time Frequency Analysis: Theory and Applications, Prentice Hall, 1994. [Damelin & Miller ’12] Steven B. Damelin & Willard Miller Jr, The Mathematics of Signal Processing, Cambridge Texts in Applied Mathematics, Cambridge University Press, Cambridge, 2012. [Daubechies ’92] Ingrid Daubechies, Ten Lectures on Wavelets. SIAM, Philadelphia, 1992.

165 166 BIBLIOGRAPHY

[DeGroot & Schervish ’11] Morris H. DeGroot & Mark J. Schervish, Probability and Statistics. Pearson, Upper Saddle River, USA, 2011.

[Dubhashi & Panconesi ’12] Devdatt P. Dubhashi & Alessandro Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, Cambridge, 2012.

[Durrett ’10] Rick Durrett, Probability: Theory and Examples, Cambridge University Press, Cambridge, 2010.

[Elad ’10] Michael Elad, Sparse and Reduntant Representations: From Theory to Applications in Signal and Imaging Processing, Springer, 2010.

[Eldar ’15] Yonina Eldar, Sampling Theory: Beyond Bandlimited Systems, Cambridge University Press, Cambridge, 2015.

[Eldar & Kutyniok ’12] Yonina C. Eldar(Ed) & Gitta Kutyniok(Ed), Compressed Sensing: Theory and Applications, Cambridge University Press, 2012.

[Feller ’68] William Feller, An Introduction to Probability Theory and Its Applications, Volume 1, 3rd Edition, John Wiley & Sons, New York, 1968.

[Feller ’72] William Feller, An Introduction to Probability Theory and Its Applications, Volume 2, 3rd Edition, John Wiley & Sons, New York, 1972.

[Fischer ’11] Hans Fischer, A History of the Central Limit Theorem: From Classical to Modern Probability Theory, Sources and Studies in the History of Mathematics and Physical Sciences, Springer, New York, 2011.

[Gamkrelidze ’90] Revaz Gamkrelidze, Analysis II: Convex Analysis and Approximation Theory, Ency- clopaedia of Mathematical Sciences (Book 14). Springer, 1990.

[Garey & Johnson ’79] M. R. Garey & D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness. Freeman and Company, 1979.

[Gass & Assad ’04] Saul I. Gass & Arjang A. Assad, An Annotated Timeline of Operations Research: An Informal History. Springer Verlag, 2004.

[Goldreich ’08] Oded Goldreich, Computational Complexity: A conceptual Perspective. Cambridge Uni- versity Press, Cambridge, 2008.

[Graham, Knuth & Patashnik ’94] Ronald L. Graham, Donald E. Knuth & Oren Patashnik, Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley Professional, 1994.

[Gröchenig ’01] Karlheinz Gröchenig, Foundations of Time-Frequency Analysis, Birkhäuser, 2001.

[Guillemin & Pollack] Victor Guillemin & Allan Pollack, Differential Topology. Prentice Hall, 1974.

[Hacking ’06] Ian Hacking, The Emergence of Probability: A Philosophical Study of Early Ideas About Probability Induction and Statistical Inference. Cambridge University Press, Cambridge, 2006.

[Han et al. ’07] Deguang Han, Keri Kornelson, David Larson & Eric Weber Frames for Undergraduates, American Mathematical Society, 2007.

[Hardy, Littlewood & Polya] Godfrey H. Hardy, John E. Littlewood & George Polya, Inequalities", Cam- bridge University Press, Second Edition, 1952.

[Hastie, Tibshirani & Wainwright ’15] Trevor Hastie, Robert Tibshirani & Martin Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015. BIBLIOGRAPHY 167

[Havin & Jöricke ’94] Victor Havin & Burglind Jöricke, The Uncertainty Principle in Harmonic Analysis, Springer, 1994.

[Jammer ’96] Max Jammer, The Conceptual Development of Quantum Mechanics, McGraw-Hill, 1966.

[Jammer ’74] Max Jammer, The Philosophy of Quantum Mechanics: The Interpretations of Quantum Mechanics in Historical Perspective, Wiley, 1974.

[Kirsch ’11] Andreas Kirsch, An Introduction to the Mathematical Theory of Inverse Problems, Applied Mathematical Sciences, Vol. 120, Springer, 2011.

[Ledoux ’01] Michel Ledoux, The Concentration of Measure Phenomenon, Mathematical Surveys and Monographs 89, American Mathematical Society, Providence, RI, 2001.

[Lorentz ’66] G. G. Lorentz, Approximation of Functions, Holt, Rinehart and Winston, New York, 1966.

[Lorentz, von Golitschek & Makovoz ’96] G. G. Lorentz, M von Golitschek & Y. Makovoz, Constructiive Approximation: Advanced Problems, Springer, 1996.

[Marvasti ’01] Farokh Marvasti, Nonuniform Sampling: Theory and Practice. Information Technology: Transmission, Processing and Storage Series, Springer, New York, 2001.

[Mehra & Rechenberg ’01] Jagdish Mehra & Helmut Rechenberg, The Historical Development of Quan- tum Mechanics, Springer, 2001.

[Micchelli & Rivlin ’76] Charles A. Micchelli & Theodore J. Rivlin, Optimal Estimation in Approximation Theory, Springer ,1976.

[Nesterov & Nemirovskii ’94] Yurii E. Nesterov & Arkadii S. Nemirovskii, Interior Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, 1994.

[Novak & Wozniakowski ’08] Erich Novak & Henryk Wozniakowski, ‘Tractability of Multivariate Prob- lems: Linear Information, European Mathematical Society, 2008.

[Novak & Wozniakowski ’10] Erich Novak & Henryk Wozniakowski, Tractability of Multivariate Prob- lems: Standard Information for Functionals, European Mathematical Society, 2010.

[Novak & Wozniakowski ’12] Erich Novak & Henryk Wozniakowski, Tractability of Multivariate Prob- lems: Standard Information for Operators, European Mathematical Society, 2012.

[Pinkus ’85] Allan Pinkus, N-Width in Approximation Theory, Springer, Berlin, 1985.

[Pisier ’89] Gillies Pisier, The Volume of Convex Bodies and Banach Space Geometry, Cambridge Tracts in Mathematics 94, Cambridge University Press, Cambridge, 1989.

[Pietsch ’07] Albrecht Pietsch, History of Banach Spaces and Linear Operators, Birkhäuser, 2007.

[Rasmussen & Williams ’05] Carl Edward Rasmussen & Christopher K. I. Williams, Gaussian Processes for Machine Learning, The MIT Press, 2005.

[Rauhut & Foucart ’13] Holger Rauhut & Simon Foucart, A Mathematical Introduction to Compressive Sensing, Birkhäuser, 2013.

[Resnick ’13] Sidney Resnick, A Probability Path, Birkhäuser, 2013.

[Rish & Grabarnik ’14] Irina Rish & Genady Grabarnik, Sparse Modeling: Theory, Algorithms and Ap- plications, CRC Press, 2014.

[Ruszczynski] Andrzej Ruszczynski, Nonlinear Optimization, Princeton University Press, 2006. 168 BIBLIOGRAPHY

[Schulz, da Silva & Velho ’09] Adriana Schulz, Eduardo A. B. da Silva & Luiz Velho, Compressive Sens- ing, 27◦ Colóquio Brasileiro de Matemática, IMPA, Rio de Janeiro, 2009. [Sra, Nowozin & Wright ’11] Suvrit Sra, Sebastian Nowozin & Stephen J. Wright, Optimization for Ma- chine Learning, The MIT Press, 2011. [Steffens ’06] Karl-Georg Steffens, The History of Approximation Theory: From Euler to Bernstein, Birkhäuser, 2006. [Stein & Shakarchi ’05] Elias Stein & Rami Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton Lectures in Analysis, Volume III, Princeton University Press, Nova Jersey, 2005. [Stein & Shakarchi ’11] Elias Stein & Rami Shakarchi, Functional Analysis: Introduction to Further Top- ics in Analysis. Princeton Lectures in Analysis, Volume IV, Princeton University Press, Nova Jersey, 2005. [Stigler ’90] S. M. Stigler, The History of Statistics: The Measurement of Uncertainty Before 1900. Belknap Press, 1990. [Temlyakov ’11] Greedy Approximation, Cambridge University Press, 2011. [Tong ’80] Yung Liang Tong, Probability Inequalities in Multivariate Distributions, Academic Press, New York, 1980. [Tropp ’15] Joel Tropp, An Introduction to Matrix Concentration Inequalities, Foundations and Trends in Machine Learning Volume 8, Now Publishers Inc, 2015. [von Plato ’98] Jan von Plato, Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective, Cambridge University Press , Cambridge, 1998. [Young ’80] Robert Young, An Introduction to Non-Harmonic Fourier Series. Academic Press, New York, 1980. [Varga ’04] Richard S. Varga, Gershgorin and His Circles. Springer-Verlag, Berlin, 2004. [Vempala ’04] Santosh Vempala, The Random Projection Method. AMS, 2004. [Vershynin ’16] Roman Vershynin, High Dimensional Probability: An Introduction with Applica- tions in Data Science, to be published by Cambridge University Press, 2016. Available at http://www-personal.umich.edu/~romanv/papers/HDP-book/HDP-book.pdf. Historical: [Borges ’60] Jorge Luis Borges, El Hacedor, Biblioteca Borges, Alianza Editorial, 1960. [Carathéodory 1911] Constantin Carathéodory, Über den Variabilitätsbereich der Fourierschen Funktio- nen. Rendiconto del Circolo Matematico di Palermo, Volume 32, 193-217, 1911. [Dorfman ’43] Robert Dorfman, The Detection of Defective Members of Large Populations. The Annals of Mathematical Statistics, Volume 14, Issue 4, 436-440, 1943. [Gabor ’46] Dennis Gabor Theory of communication, J. Inst. Elec. Engr. 93, 429-457, 1946. [Gershgorin ’31] S. Gerschgorin Über die Abgrenzung der Eigenwerte einer Matrix, Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk 6, 749-754, 1931. [Heisenberg ’27] Werner Heisenberg Über den anschaulichen Inhalt der quantentheoretischen Kinematic und Mechanik, Zeit. Phyisik, 43, 172-198, 1927. BIBLIOGRAPHY 169

[Kennard ’27] Earle H. Kennard Zur Quantenmechanik einfacher Bewegungstypen, Zeit. Physik, 44, 326- 352, 1927.

[Levy & Fullagar ’81] S. Levy & P. Fullagar, Reconstruction of a Sparse Spike Train From a Portion of Its Spectrum and Applications to High-Resolution Deconvolution. Geophysics, Volume 46, Number 9, 1235-1243, 1981.

[Santosa & Symes ’86] Fadil Santosa & William W. Symes, Linear Inversion of Band-Limited Reflection Seismograms. SIAM Journal on Scientific and Statistical Computing, Volume 7, Issue 4, 1307-1330- 1986.

[Shannon ’48] Claude E. Shannon, “A Mathematical Theory of Communication”. Bell System Technical Journal, Volume 27, 379-423, 1948.

[Slepian ’62] David Slepian, The One-Sided Barrier Problem for Gaussian Noise. Bell System Technical Journal, Volume 41, 463-501, 1962.

[Taylor, Banks & McCoy ’79] H. L. Taylor, S. C. Banks & J. F. McCoy, Deconvolution with the `1-norm. Geophysics, Volume 44, Number 1, 39-53, 1979.

[Walker & Ulrych ’83] C. Walker & T. Ulrich, Autoregressive Recovery of the Acoustic Impedance. Geo- physics, Volume 48, Number 10, 1338-1350, 1983.

[Weyl ’28] Hermann Weyl, Gruppentheorie und Quantenmechanik, S. Hirzel, 1928. Revised English edi- tion: The Theory of Groups and Quantum Mechanics, Methuen, London, 1931; reprinted by Dover, New York, 1950.

[Wiener ’56] Norbert Wiener, I am a Mathematician, MIT Press, Cambridge, 1956.

[von Neumann ’53] , A Certain Zero-Sum Two-Person Game Equivalent to the Optimal Assignment Problem, in Contributions to the Theory of Games II, H.W. Kahn and A.W. Tucker (editors), 5-12. Princeton University Press, Princeton, NJ, 1953.

Papers:

[Adcock, Hansen & Roman ’15] Ben Adcock, Anders Hansen & Bogdan Roman, The Quest for Optimal Sampling: Computationally Efficient, Structure-Exploiting Measurements for Compressive Sensing in Compressed Sensing and Its Applications: MATHEON Workshop 2013, Birkhäuser, 2015.

[Adcock, Hansen, Roman & Teschke ’13] Ben Adcock, Anders Hansen, Bogdan Roman & Gerd Teschke, Generalized Sampling: Stable Reconstructions, Inverse Problems and Compressed Sensing Over the Continuum. Advances in Imaging and Electron Physics, Volume 182, 187-279, 2013.

[Ailon & Chazelle ’06] Nir Ailon & Bernard Chazelle, Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform. In Proceedings of the 38th ACM Symposium on Theory of Com- puting (STOC), 557-563, 2006.

[Ailon & Chazelle ’09] Nir Ailon & Bernard Chazelle, The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors. SIAM Journal on Computing, Volume 39, Issue 1, 302-322, 2009.

[Ailon & Liberty ’11] Nir Ailon & E. Liberty, Almost Optimal Unrestricted Fast Johnson-Lindenstrauss transform. In Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), San Francisco, USA, 22-2, 2011.

[Alon ’03] Noga Alon, Problems and Results in Extremal Combinatorics I, Discrete Mathematics, Volume 273(1-3), 31,53, 2003. 170 BIBLIOGRAPHY

[Alltop ’80] W. O. Alltop, Complex Sequences With Low Periodic Correletions, IEEE Transaction of Information Theory, Vol. 26, Issue 3, 350-354, 1980. [Andersson & Strömberg ’14] Joel Andersson & Jan-Olov Strömberg, On the Theorem of Uniform Recov- ery of Random Sampling Matrices, IEEE Transactions on Information Theory, Volume 60, Number 3, 1700-1710, 2014. [Applebaum, Howard, Searle & Calderbank ’09] Lorne Applebaum, Stephen D. Howard, Stephen Searle & Robert Calderbank, Chirp Sensing Codes: Deterministic Compressed Sensing Measurements for Fast Recovery,?Applied and Computational Harmonic Analysis, Volume 26, Issue 2, 283-290, 2009. [Bah ’12] Bubacarr Bah, Restricted Isometry Constants in Compressed Sensing. Ph.D. Thesis, University of Edinburgh, 2012. [Bah & Tanner ’10] Bubacarr Bah & Jared Tanner, Improved Bounds on Restricted Isometry Constants for Gaussian Matrices, SIAM Journal Matrix Analysis and Applications, Volume 31, Number 5, 2882-2898, 2010. [Bandeira et al. ’13] Afonso S. Bandeira, Matthew Fickus, Dustin G. Mixon & Percy Wong, The Road to Deterministic Matrices with the Restricted Isometry Property, Journal of Fourier Analysis and Applications, Volume 19, Issue 6, 1123-1149, 2013. [Bandeira, Dobriban, Mixon & Sawin ’13] Afonso S. Bandeira, Egdar Dobriban, Dustin G. Mixon & William F. Sawin, Certifying the Restricted Isometry Property is Hard, IEEE Transactions of Infor- mation Theorym Volume 59, Number 6, 3448-3450, 2013. [Baraniuk ’07] Richard Baraniuk, Compressive Sensing. IEEE Signal Processing Magazine, Volume 24, Issue 4, 118-121, 2007. [Baraniuk et al. ’08] R. G. Baraniuk, M. Davenport, R.A. DeVore & M. Wakin, A Simples Proof of the Restricted Isometry Property for Random Matrices. Constructive Approximations, Vol. 28, Issue 3, 253-263, 2008. [Baraniuk et al.] R. G. Baraniuk, M. Davenport, R.A. DeVore & M. Wakin, The Johnson-Lindenstrauss Lemma Meets Compressed Sensing. Preprint. [Bechar ’09] Ikhlef Bechar, A Bernstein-Type Inequality for Stochastic Processes of Quadratic Forms of Gaussian Variables, Preprint 2009. Available at https://arxiv.org/abs/0909.3595. [Bickel, Ritov & Tsybakov ’09] Peter Bickel, Ya’acov Ritov & Alexandre Tsybakov, Simultaneous Anal- ysis of Lasso and Dantzig Selector. Annals of Statistics, Volume 37, Issue 4, 1705-1732, 2009. [Bioucas-Dias & Figueiredo ’07] José M. Bioucas-Dias & Mário A. T. Figueiredo, A New TwIST: Two- Step Iterative Shrinkage/Thresholding Algorithms for Image Restoration, IEEE Transactions on Im- age Processing, Volume 16, Issue 12, 2992-3004, 2007. [Blanchard, Cartis & Tanner ’11] Jeffrey D. Blanchard, Coralia Cartis & Jared Tanner, Compressed sensing: How Sharp is the Restricted Isometry Property?, SIAM Review, Volume 53, Issue 1, 105- 125, 2011. [Blanchard & Tanner ’15] Jeffrey D. Blanchard & Jared Tanner, Performance Comparisons of Greedy Algorithms in Compressed Sensing, Volume 22, Issue 2,2 54-282, 2015. [Blumensath & Davies ’08] Thomas Blumensath & Mike Davies, Gradient Pursuits. IEEE Transactions on Signal Processing, Volume 56, Issue 6, 2370-2382, 2008. [Blumensath & Davies II ’08] Thomas Blumensath & Mike Davies, Iterative Thresholding for Sparse Approximations. Journal of Fourier Analysis and its Applications, Volume 14, Issue 5, 629-654, 2008. BIBLIOGRAPHY 171

[Blumensath & Davies ’09] Thomas Blumensath & Mike Davies, Iterative Hard Thresholding for Com- pressed Sensing. Applied and Computational Harmonic Analysis, Volume 27, Issue 3, 265-274, 2009.

[Bodmann & Kutyniok ’09] Bernhard G. Bodmann & Gitta Kutyniok, Erasure-Proof Transmissions: Fusion Frames meet Coding Theory, Proc. SPIE 7446, Wavelets XIII, Volume 74460, 2009.

[Borup, Gribonval & Nielsen ’08] L. Borup, R. Gribonval & M. Nielsen, Beyond Coherence: Recovering Structured Time-Frequency Representations . Appl. Comput. Harmon. Anal., Vol. 14, 120?128, 2008.

[Bouchot, Foucart & Hitczenko ’16] Jean-Luc Bouchot, Simon Foucart & Pawel Hitczenko, Hard Thresh- olding Pursuit Algorithms: Number of Iterations, Applied and Computational Harmonic Analysis Volume 41, Issue 2, 412-435. 2016.

[Boufonos ’12] Petros Boufonos, Universal Rate-Efficient Scalar Quantization, IEEE Transactions on Information Theory, Volume 58 Issue 3, 1861-1872, 2012.

[Boufounos, Kutyniok & Rauhut ’11] Petros Boufounos, Gitta Kutyniok & Holger Rauhut, Sparse Re- covery From Combined Fusion Frame Measurements. IEEE Transactions on Information Theory, Vol. 57, Issue 6, 3864-3876, 2011.

[Bourgain, Dilworth, Ford, Konyagin & Kutzarova ’11] Jean Bourgain, Stephen Dilworth, Kevin Ford, Sergei Konyagin & Denka Kutzarova, Explicit Constructions of RIP Matrices and Related Problems, Duke Mathematical Journal, Volume 159, Number 1, 145-185, 2011.

[Box ’79] George E. P. Box, Robustness in the Strategy of Scientific Model Building, in Launer, R. L.; Wilkinson, G. N., Robustness in Statistics, Academic Press, 201-236, 1979.

[Butzer & Stens ’92] Paul L. Butzer & Rudolf L. Stens, Sampling Theory for not Necessarily Band- Limited Functions - A Historical Overview. SIAM Review, Volume 34, Number 1, 40-53, 1992.

[Cai, Wang & Xu I ’10] T. Cai, L. Wang $ G. Xu, Shifting Inequality and Recovery of Sparse Vectors, IEEE Transactions Signal Processing, Vol.58, Issue 9, 4388-4394, 2010.

[Cai, Wang & Xu II ’10] T. Cai, L. Wang $ G. Xu, New Bounds for Restricted Isometry Constants. IEEE Transactions on Information Theory, Volume 56, Issue 9, 4388-4394, 2010.

[Cai, Wang & Xu III ’10] T. Cai, L. Wang $ G. Xu, Stable Recovery of Sparse Signals and an Oracle Inequality, IEEE Transactions on Information Theory, Vol. 56, Issue 7, 3516-3522, 2010.

[Cai & Wang ’11] T. Cai & Lie Wang, Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise, IEEE Transactions on Information Theory, Vol. 57, Issue 7, 4680-4688, 2011.

[Cai & Zhang ’13] T. Cai & Anru Zhang, Sharp RIP Bound for Sparse Signal and Low-Rank Matrix Recovery, Applied and Computational Harmonic Analysis, Volume 35, Issue 1, 74-93, 2013.

[Casazza, Redmond & Tremain ’08] P.G. Casazza, D. Redmond & J.C.Tremain, Real Equiangular Frames, Conference on Information Sciences and Systems, Princeton, NJ, 2008.

[Candès ’06] Emanuel J. Candès„ Compressive Sampling. In Proceedings of the International Congress of Mathematicians

[Candès ’08] Emanuel J. Candès, The Restricted Isometry Property and Its Implications for Compressed Sensing, Comptes Rendus Mathematique, Volume 346, Issue 9-10, 589-592, 2008.

[Candès, Eldar, Strohmer & Voroninski ’13] Emmanuel Candès, Yonina Eldar, Thomas Strohmer & Vladislav Voroniski, Phase Retrieval via Matrix Completion, SIAM Journal of Imaging Science, Volume 6, Issue 1, 199-225. 2013. 172 BIBLIOGRAPHY

[Candès & Plan ’11] Emmanuel Candès & Yaniv Plan, A Probabilistic and RIPless Theory of Compressed Sensing. IEEE Transactions on Information Theory, Volume 57, Issue 11, 7235-7254, 2011.

[Candès & Randall ’08] Emmanuel J. Candès & Paige A. Randall, Highly Robust Error Correction by Convex Programming, IEEE Transactions on Information Theory, Volume 54, Issue 7, 2829-2840, 2008.

[Candès & Recht ’09] Emmanuel J. Candès & Benjamin Recht, Exact Matrix Completion via Convex Optimization, Foundations of Computational Mathematics, Volume 9, 717-772, 2009.

[Candès & Romberg ’07] Emanuel J. Candès & Justin K. Romberg, Sparsity and Incoherence in Com- pressive Sampling, Inverse Problems, Volume 23, Number 3, 969-985, 2007.

[Candès, Romberg & Tao I ’06] Emanuel J. Candès, Justin K. Romberg & Terence Tao, Robust Un- certainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency Information, IEEE Transactions on Information Theory, Volume 47, Issue 2, 489-509, 2006.

[Candès, Romberg & Tao II ’06] Emanuel J. Candès, Justin K. Romberg & Terence Tao Stable Signal Recovery From Incomplete and Inaccurate Measurements, Communications on Pure and Applied Mathematics, Volume 59, Issue 8, 1207-1223, 2006.

[Candès & Tao I ’06] Emanuel J. Candès & Terence Tao Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?, IEEE Transactions on Information Theory, Vol.52, Issue 12, 5406-5425, 2006.

[Candès & Tao II ’06] Emanuel J. Candès & Terence Tao Decoding by Linear Programming, IEEE Trans- actions Information Theory, Volume 51, Issue 12, 4203-4215, 2006.

[Candès & Tao III, 06] Emanuel J. Candès & Terence Tao The Dantzig Selector: Statistical Estimation When p is Much Larger Than n, The Annals of Statistics, 2313-2351, 2006.

[Candès & Tao ’10] Emanuel J. Candès & Terence Tao The Power of Convex Relaxation: Near-Optimal Matrix Completion, IEEE Transactions Information Theory, Volume 56, Number 5, 2053-2080, 2010.

[Candès & Wakin ’08] Emanuel Candès & Michael B. Wakin, An Introduction to Compressive Sampling, IEEE Signal Processing Magazine, Vol. 25, Issue 2, 21-30, 2008.

[Casazza ’00] Peter G. Casazza, The Art of Frame Theory, Taiwanese Journal of Mathematics, Vol. 4, No. 2, 129-201, 2000.

[Chambolle & Pock ’11] Antonin Chambolle & Thomas Pock, A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging, Journal of Mathematical Imaging and Vision, Volume 40, 120-145, 2011.

[Chandrasekaran, Recht, Parrilo & Willsky ’12] V. Chandrasekaran, B. Recht, P. Parrilo, A. Willsky, The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, Volume 12, Issue 6, 805-849, 2012.

[Chartrand ’07] Rick Chartrand, Exact Reconstruction of Sparse Signals via Nonconvex Minimization, IEEE Signal Processing Letters, Volume 14, Number 10, 707-710, 2007.

[Chen, Billings & Luo ’89] Shaobing Scott Chen, S. Billings & W. Luo, Orthogonal Least Squares Meth- ods and Their Application to Nonlinear System Identification. International Journal of Control, Volume 50, Issue 5, 1873-1896, 1989.

[Chen & Donoho ’94] Shaobing Scott Chen & David Donoho, Basis Pursuit. Technical Report, Depart- ment of Statistics, Stanford University. BIBLIOGRAPHY 173

[Chen, Donoho & Saunders ’01] Shaobing Scott Chen, David L. Donoho & Michael A. Saunders, Atomic Decomposition by Basis Pursuit, SIAM Journal on Scientific Computing, 20 (1), 33-61, 2001. [Chen ’95] Shaobing Scott Chen, Basis Pursuit, Ph.D. Thesis, Stanford, 1995. [Cohen, Dahmen & DeVore ’09] Albert Cohen, Wolfgang Dahmen & Ronald DeVore, Compressed Sens- ing and Best k-term Approximation, Journal of the AMS, Vol. 22, Number 1, 211-231, 2009. [Cook] Stephen Cook, The P Versus NP Problem. Available online: http://www.claymath.org/sites/ default/files/pvsnp.pdf. [Dai & Milenkovic ’09] W. Dai & O. Milenkovic, Subspace Pursuit for Compressive Sensing Signal Re- construction. IEEE Transactions on Information Theory, Volume 55, Issue 5, 2230-2249, 2009. [Dasgupta, Kumar & Sarlós ’10] Anirban Dasgupta, Ravi Kumar & Tamás Sarlós, A Sparse Johnson- Lindenstrauss Transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), 205-214, 2010. [Daubechies, Defrise & De Mol ’04] Ingrid Daubechies, Michel Defrise & Christine De Mol, An Iterative Thresholding Algorithm for Linear Inverse Problems With a Sparsity Constraint. Communications in Pure and Applied Mathematics, Volume 63, Issue 1, 1-38, 2010. [Daubechies, DeVore, Fornasier & Güntürk ’10] Ingrid Daubechies, Ronald DeVore, Maximo Fornasier & Sinan Güntürk, Iteratively Re-Weighted Least Squares Minimization for Sparse Recovery, Commu- nications in Pure and Applied Mathematics, Volume 63, Issue 1, 1-38, 2010. [Daubechies, Grossman & Meyer ’85] Ingrid Daubechies, A. Grossman, & Yves Meyer, Painless Nonorthogonal Expansions, Journal of Mathematical Physics, Volume 127, 1271-1283, 1985. [Davenport & Romberg ’16] Mark Davenport & Justin Romberg, An Overview of Low-Rank Matrix Re- covery From Incomplete Observations, IEEE Journal of Selected Topics in Signal Processing, Volume 10, Number 4, 608-622, 2016. [Davidson & Szarek ’01] K. Davidson & S. Szarek, Local Operator Theory, Random Matrices and Banach Spaces. Handbook of the Geometry of Banach Spaces, 1-317, 2001.

[Davies & Gribonval ’09] M. Davies & R. Gribonval, Restricted Isometry Constants where `p Sparse Re- covery can Fail for 0 < p 1, IEEE Transaction on Information Theory, Vol. 55, Issue 5, 2203-2214, 2009. ≤ [Davies, Mallat & Avellaneda ’97] Geoffrey Davies, Stephane Mallat & G. Davis, S. Mallat, and M. Avel- laneda. Greedy adaptive approximation. Journal Constructive Approximation, Volume 13, 57-98, 1997. [Davies, Mallat & Zhang ’94] Geoffrey Davies, Stephane Mallat & Zhifeng Zhang, Adaptive Time- Frequency Decompositions. Optical Engineering, Volume 33, Issue 7, 2183-2191, 1994. [Davies & Simon ’84] Edward B. Davies & Barry Simon, Ultracontractivity and the Heat Kernel for Schrödinger Operators and Dirichlet Laplacians, Journal of Functional Analysis, Volume 59, Issue 2, 335-395, 1984. [DeVore ’07] Ronald DeVore, Deterministic constructions of compressed sensing matrices, Journal of Complexity, Volume 23, Issues 4?6, 918-925, 2007. [DeVore ’06] Ronald DeVore, Optimal Computation, Proceedings of the International Congress of Math- ematicians, European Mathematical Society, Madrid, Spain, 2006. [DeVore & Temlyakov ’96] Ronald DeVore & Vladimir Temlyakov, Some Remarks on Greedy Algorithms, Advances in Computational Mathematics, Volume 5, 173-187, 1996. 174 BIBLIOGRAPHY

[Dominici ’08] Diego Dominici, Variations on a Theme by James Stirling, Note di Matematica Volume 28, Issue 1, 2008.

[Donoho ’06] David L. Donoho, Compressed Sensing, IEEE Transactions on Information Theory, Volume 52, Issue 4, 1289-1306, 2006.

[Donoho ’10] David L. Donoho, Scanning the Technology, Proceedings of the IEEE, VOlume 98, Number 6, 910=912, 2010.

[Donoho & Elad ’ 03] David. L. Donoho & Michael Elad, Optimally Sparse Representation in General (Nonorthogonal) Dictionaries via l1 Minimization, Proc. Nat. Acad. Sci. USA, Vol. 100, 2197-2202, 2003.

[Donoho & Huo ’01] David L. Donoho & Xiaoming Huo, Uncertainty Principles and Ideal Atomic De- composition, IEEE Transactions on Information Theory, Volume 47, Issue 7, 2845-2862, 2001.

[Donoho, Johnstone, Stern & Hoch ’92] David L. Donoho, Iain M. Johnstone, Alan S. Stern & Jeffrey C. Hoch, Maximum entropy and the nearly black object (with discussion). Journal of the Royal Statistical Society B, Volume 54, Number 1, 41-81, 1992.

[Donoho & Kutyniok ’13] David L. Donoho & Gitta Kutyniok Microlocal Analysis of the Geometric Separation Problem, Communications on Pure and Applied Mathematics, Vol. 66, Issue 1, pages 1-47, 2013.

[Donoho & Stark ’89] David L. Donoho & Philip B. Stark, Uncertainty Principles and Signal Recovery, SIAM J. Appl. Math., Volume 49, Issue 3, 906-931, June 1989. DOI:10.1137/0149053

[Donoho & Tsaig, ’08] David Donoho & Yaakov Tsaig, Fast Solution of `1-norm Minimization Problems When the Solution May Be Sparse, IEEE Transactions on Information Theory, Volume 54, Issue 11, 4789-4812, 2008.

[Donoho, Tsaig, Drori & Starck ’12] David Donoho, Yaakov Tsaig, Iddo Drori & Jean-Luc Starck, Sparse Solution of Underdetermined Systems of Linear Equations by Stagewise Orthogonal Matching Pur- suit, IEEE Transactions on Information Theory, Volume 58, Issue 2, 1094-1121, 2012.

[Duarte & Eldar ’11] Marco F. Duarte & Yonina C. Eldar, Structured Compressive Sensing: Theory and Applications. IEEE Transactions on Signal Processing, Volume 59, Issue 9, 4053-4085, 2011.

[Dudley ’75] Richard Dudley, The Gaussian Process and How to Approach It, Proceedings of the Inter- national Congress of Mathematicians, Vancouver, 1974.

[Duffin & Schaeffer ’52] Duffin, R., Schaeffer, A. A class of Nonharmonic Fourier series, Trans. Am. Math. Soc. Vol. 72, 341-366, 1952.

[Efron, Hastie, Johnstone & Tibshirani ’04] Bradley Efron, Trevor Hastie, Iain Johnstone & Robert Tib- shirani, Least Angle Regression, Annals of Statistics, Volume 32, Issue 2, 407-499, 2004.

[Elad & Bruckstein ’02] Michael Elad & Alfred Bruckstein, A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases, IEEE Transaction on Information Theory, Vol. 48, Issue 9, 2558-2567, 2002.

[Fazel ’02] Maryam Fazel, Matrix rank Minimization with Applications, Ph.D. Thesis, Stanford Univer- sity, 2002.

[Fazel, Hindi & Boyd ’01] Maryam Fazel, Haitham Hindi & Stephen Boyd, A Rank Minimization Heuris- tic with Application to Minimum Order System Approximation, in Proceedings of the American COntrol Conference, IEEE, 4734-4739, 2001. BIBLIOGRAPHY 175

[Fefferman ’83] Charles L. Fefferman, The Uncertainty Principle, Bull. Amer. Math. Soc, Vol. 9, 129-206, 1983.

[Feller ’45] William Feller, The Fundamental Limit Theorems in Probability, Bulletin of the American Mathematical Society, Volume 51, Number 11, 800-832, 1945.

[Feng & Zhang ’07] Xinlong Feng & Zhinan Zhang, The Rank of a Random Matrix, Applied Mathematics and Computation, Volume 185, Issue 1, 689-694, 2007.

[Ferreira & Higgins ’11] Paulo J. S. G. Ferreira & Rowland Higgins, The Establishment of Sampling as a Scientific Principle: A Striking Case of Multiple Discovery. Notices of the AMS, Volume 58, Number 10, 1446-1450, 2011.

[Fickus & Mixon ’15] Matthew Fickus & Dustin G. Mixon, Tables of the Existence of Equiangular Tight Frames, Preprint 2015. Available at http://arxiv.org/abs/1504.00253

[Folland & Sitaram ’97] Gerald B. Folland & Alladi Sitaram, The Uncertainty Principle: A Mathemat- ical Survey, Journal of Fourier Analysis and Applications, Volume 3, Issue 3, 207-238, May 1997.

[Fortnow & Homer ’03] Lance Fortnow & Steve Homer, A Short History of Computational Complexity Bulletin of the European Association for Theoretical Computer Science, Volume 80, 1-26, 2003.

[Foucart ’11] Simon Foucart, Hard Thresholding Pursuit: An Algorithm for Compressive Sensing. SIAM Journal of Numerical Analysis, Volume 49, Issue 6, 2543-2563, 2011.

[Foucart II ’10] Simon Foucart, A Note on Guaranteed Sparse Recovery Via `1-Minimization, Applied and Computational Harmonic Analysis, Volume 29, Issue 1, 97-103, 2010.

[Foucart ’10] Simon Foucart, Sparse Recovery Algorithms: Sufficient Conditions in Terms of Restricted Isometry Constants, In Approximation THeory XIII: San Antonio 2010, 65-77, ed. by M. Neamtu, L. Schumaker. Springer Proceedings in Mathematics, Volume 13, Springer, New York, 2012.

[Foucart & Gribonval ’10] Simon Foucart $ Remi Gribonval, Real vs. Complex Null Space Properties for Sparse Vector Recovery. Comptes Rendus de L’académie des Sciences Mathématiques, Volume 348, Issue 15-16, 863-865, 2010.

[Foucart & Lai ’10] Simon Foucart & Ming-Jun Lai, Sparsest Solutions of Underdetermined Linear Sys- tems via `q-minimization for 0 < q1, Applied and Computational Harmonic Analysis, Volume 26, Issue 3, 395-407, 2009.

[Foucart, Pajor, Rauhut & Ullrich ’10] S. Foucart, A. Pajor, H. Rauhut& T. Ullrich, The Gelfand Widths onf `p-balls for 0 < p 1, Journal of Complexity, Vol. 26, Issue 6, 629-640, 2010. ≤ [Fowler ’00] David Fowler, The Factorial Function: Stirling’s Formula, The Mathematical Gazette, Vol- ume 84, Number 499, 42-50, 2000.

[Friedman & Stuetzle ’81] J. Friedman & W. Stuetzle, Projection Pursuit Regression. Journal of the American Statistical Association,Volume 76, 817-823, 1981.

[Gantz & Reinsel ’10] John Gantz & David Reinsel, The Digital Universe - Are You Ready?. IDC Re- port, Sponsored by EMC Corporation. Available at https://www.emc.com/collateral/analyst- reports/idc-digital-universe-are-you-ready.pdf.

[Gantz & Reinsel ’12] John Gantz & David Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and BIggest Growth in the Far East. IDC Report, Sponsored by EMC Corporation. Available at https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe- in-2020.pdf. 176 BIBLIOGRAPHY

[Gao, Peng, Yue & Zhao ’15] Yi Gao, Jigen Peng, Shigang Yue & Yuan Zhao, On the Null Space Property of `q-Minimization for 0 < q 1 in Compressed Sensing, Journal of Function Spaces, Volume 10, 1-10, 2015. ≤

[Garnaev & Gluskin ’84] A. Y. Garnaev & E. D. Gluskin, On Widths of the Euclidean Ball. Soviet Mathematics - Doklady, Vol. 30, 200-203, 1984.

[Ge, Jiang & Ye ’11] Dongdong Ge, Xiaoye Jiang & Yinyu Ye, A Note on the Complexity of Lp Mini- mization, Mathematical Programming, Volume 129, Issue 2, 285-299, 2011.

[van de Geer & Buhlmann ’09] Sara A. van de Geer & Peter Bühlmann, On the Conditions Used to Prove Oracle Results for the LASSO, Electronic Journal of Statistics, Volume 3, 1360-1392, 2009.

[Geman ’80] Stuart Geman, A Limit Theorem for the Norm of Random Matrices, Annals of Probability, Volume 8, 252-261, 1980.

[Gilbert, Iwen & Strauss ’08] Anna C. Gilbert, Mark A. Iwen, Martin J. Strauss, Group Testing and Sparse Signal Recovery. Proceedings of the 42nd Asilomar Conference on Signals, Systems and Com- puters, 1059-1063, 2008.

[Gluskin ’84] E. D. Gluskin, Norms of Random Matrices and Widths of Finite-Dimensional Sets. Math- ematics of the USSR-Sbornik, Volume 48, Number 1, 173-182, 1984.

[Gondzio ’12] Jacek Gondzio, Interior Point Methods 25 Years Later, European Journal of Operational Research, Volume 218, 587-601, 2012.

[Gordon ’85] Yehoram Gordon, Some Inequalities for Gaussian Processes and Applications. Israel Jour- nal of Mathematics, Volume 50, Number 4, 1985.

[Gordon ’88] Yehoram Gordon, On Milman’s Inequality and Random Subspaces Which Escape Through a Mesh in Rn. In Geometric Aspects of Functional Analysis (1986/87), Volume 1317 of Lecture Notes in Mathematics, 84-106. Springer, Berlin, 1988.

[Graham & Sloane ’80] Ronald Graham & Neil Sloane, Lower Bounds for Constant Weight Codes. IEEE Transactions on Information Theory, Vol. 26, Issue 1, 37-43, 1980.

[Gribonval & Nielsen ’03] Remi Gribonval & Morten Nielsen, Sparse Representation in the Union of Basis, IEEE Transactions on Information Theory, Volume 49, Issue 12, 3320-3325, 2003.

[Gribonval & Nielsen ’07] Remi Gribonval & Morten Nielsen, Highly Sparse Representations From Dictio- naries are Unique and Independent of the Sparseness Measure, Applied and Computational Harmonic Analysis, Volume 22, Issue 3, 335-355, 2007.

[Gross ’75] Leonard Gross, Logarithmic Sobolev Inequalities. American Journal of Mathematics, Volume 97, Number 4, 1061-1083, 1975.

[Güntürk ’00] C. Sinai Güntürk, Harmonic Analysis of Two Problems in Signal Quantization and Com- pression, Ph. D. Thesis, Princeton University, 2000.

[Haboba et al. ’12] Javier Haboba, Mauro Mangia, Fabio Pareschi, Riccardo Rovatti & Gianluca Setti, A Pragmatic Look at Some Compressive Sensing Architectures With Saturation and Quantization, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol.2, Number 3, 443-459, 2012.

[Hanson & Wright ’71] D. L. Hanson & F. T. Wright, A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables, The Annals of Mathematical Statistics, Volume 42, Number 3, 1079-1083, 1971. BIBLIOGRAPHY 177

[Hartmanis & Stearns ’65] Juris Hartmanis & RIchard E. Stearns, On the Computational Complexity of Algorithms. Transactions of the American Mathematical Society, Volume 117, 285-306, 1965.

[Higgins ’85] John Rowland Higgins, Five Short Stories About the Cardinal Series. Bulletin of the Amer- ican Mathematical Society. Volume 12, Number 1, 45-89, 1985.

[Hinrichs & Vybiral ’11] Aicke Hinrichs & Jan Vybíral, Johnson-Lindenstrauss Lemma for Circulant Ma- trices. Random Structures & Algorithms, Volume 39, Issue 3, 391?398, 2011.

[Holmes & Paulsen ’ 04] Roderick B. Holmes & Vern I. Paulsen, Optimal Frames for Erasures, Linear Algebra and Its Applications, Vol. 377, 31?51, 2004.

[Holtz ’08] Olga Holtz, Compressive Sensing: A Paradigm Shift in Signal Processing, Preprint 2008. Available at https://arxiv.org/abs/0812.3137.

[Huang & Rao ’11] Yanping Huang & Rajesh P. N. Rao, Predictive Coding. Wiley Interdisciplinary Re- views. Cognitive Science. Volume 2, Issue 5, 580-593, 2011.

[Jacques & Vandergheynst ’11] Laurent Jacques & Pierre Vandergheynst, Compressed Sensing: When Sparsity Meets Sampling, in Optical and Digital Image Processing: Fundamentals and Applications (eds G. Cristobal, P. Schelkens and H. Thienpont), Wiley, Weinheim, Germany, 2011.

[James, Radchenko & Lv ’09] Gareth M. James, Peter Radchenko & Jinchi Lv, DASSO: Connections Between the Dantzig Selector and Lasso. Journal of the Royal Statistical Society: Series B, Volume 71, 127-142. 2009.

[Jameson ’15] Graham J. O. Jameson, A Simple Proof of Stirling’s Formula for the Gamma Function, The Mathematical Gazette, Volume 99, Issue 544, 68-74, 2015.

[Jerri ’77] Abdul J. Jerri, The Shannon Sampling Theorem: Its Various Extensions and Applications, A Tutorial Review, Proceedings of the IEEE, Volume 65, Number 11, 1565-1596, 1977.

[Johnson & Lindenstrauss ’84] W. B. Johnson & J. Linderstrauss, Extensions of Lipschitz Mappings Into a Hilbert Space. In Conference in Modern Analysis and Probability (New Haven, Conn. 1982). Con- temporary Mathematics, Vol.26, 189-206, AMS, 1984.

[Kabanava, Kueng, Rauhut & Terstiege ’16] Maryia Kabanava, Richard Kueng, Holger Rauhut & Ulrich Terstiege, Stable Low-Rank Matrix Recovery Via Null Space Properties, Information and Inference, Volume 5, Issue 4 ????? COMPLETAR ISSO

[Kahane ’60] Jean-Pierre Kahane, Proprietes Locales des Fonctions a Series de Fourier Aleatoires, Studia Mathematica, Volume 19, 1-25, 1960.

[Kahane’86] Jean-Pierre Kahane, Une Inegalité du Type de Slepian et Gordon Sur Les Processus Gaussiens, Israel Journal of Mathematics, Volume 55, 109-110, 1986.

[Karp ’72] Richard M. Karp, Reducibility Among Combinatorial Problems. In R.E. Miller and J.W. Thatcher (editors). Complexity of Computer Computations. Proc. of a Symp. on the Complexity of Computer Computations, 85-103, 1972.

[Kashin ’77] B. S. Kashin, Diameters of Some Finite-Dimensional Sets in Classes of Smooth Functions. Izvestiya Rossiiskoi Akademii Nauk, Seriya Matematicheskaya, Vol. 11, no.2, 317-333, 1977.

[Keshavan, Montanari & Oh ’10] Raghunandan Keshavan, Andrea Montanari & Sewoong Oh, Matrix Completion From a Few Entries. IEEE Transactions on Information Theory, Volume 56, Number 6, 2980-2998, 2010. 178 BIBLIOGRAPHY

[Kim, Koh, Lustig, Boyd & Gorinevsky ’08] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd & Dimitry Gorinevsky, An Interior-Point Method for Large-Scale-Regularized Least Squares, IEEE Journal of Selected Topics in Signal Processing, Volume 1, Issue 4, 606-617, 2008.

[Kim & Shevlyakov ’08] Kiseon Kim & Georgy Shevlyakov, Why Gaussianity?. IEEE Signal Processing Magazine, Volume 25, Number 2, 2008.

[Kitsos & Tavoularis ’09] Christos P. Kitsos & Nikolaos K. Tavoularis, Logarithmic Sobolev Inequalities for Information Measures . IEEE Transactions on Information Theory, Volume 55, Number 6, 2554- 2561, 2009.

[Krahmer ’09] Felix Krahmer, Novel Schemes for Sigma-Delta Modulation: From Improved Exponential Accuracy to Low-Complexity Design. Ph.D. Thesis, New York University, 2009.

[Krahmer & Ward ’12] Felix Krahmer & Rachel Ward, New and Improved Johnson-Linderstrauss Em- beddings via the Restricted Isometry Property. SIAM Journal of Mathematical Analysis, Vol.43, Issue 3, 1269-1281, 2011.

[Kolmogorov ’36] Andrei Kolmogorov, Uber die beste Annaherung von Funktionen einer gegebenen Funk- tionenklasse, Annals of Mathematics, Volume 2, Issue 37, 107-110, 1936.

[Lai & Liu ’11] Ming-Jun Lai & Yang Liu, The Null Space Property for Sparse Recovery From Multiple Measurement Vectors, Applied and Computational Harmonic Analysis, Volume 30, Issue 3, 402-406, 2011.

[Landau & Pollack ’61] H. J. Landau & H. O. Pollak, Prolate spheroidal wave functions, Fourier analysis and uncertainty-II, Bell Syst. Tech. J, vol. 40, 65-84, 1961.

[Landau & Pollack ’62] H. J. Landau & H. O. Pollak, Prolate spheroidal wave functions, Fourier analysis and uncertainty-III:: The Dimension of the Space of Essentially Time- and Band-Limited Signals, Bell Syst. Tech. J, vol. 41, 1295-1336, 1962.

[Larsen & Nelson ’14] Kasper Larsen & Jelani Nelson, The Johnson-Lindenstrauss Lemma is Optimal for Linear Dimensionality Reduction, Preprint 2014. Available at http://arxiv.org/abs/1411.2404

[Lidskii ’50] Victor B. Lidskii, On the Characteristic Numbers of the Sum and Product of Symmetric Matrices, Doklady Akad. Nauk SSSR (N.S.) Volume 75, 769-772, 1950.

[Linial, London & Rabinovich ’15] Nathan Linial, Eran London & Yuri Rabinovich,The Geometry of Graphs and Some of Its Algorithmic Applications , Combinatorica, Volume 15, Issue 2, 215-245, 1995.

[Lorenz, Pfetsch & Tillmann ’15] Dirk A. Lorenz, & Marc E. Pfetsch, Solving Basis Pursuit: Heuristic Optimality Check and Solver Comparison, ACM Transactions on Mathematical Software, Volume 41, Issue 2, 1-28, 2015. (An overview of updated computational results (as of April 2016) can be found at http://www.mathematik.tu-darmstadt.de/~tillmann/docs/SolvingBPupdateApr2016.pdf.)

[Lustig, Donoho & Pauly ’07] Michael Lustig, David Donoho & John M. Pauly, Sparse MRI: The Ap- plication of Compressed Sensing for Rapid MR Imaging. Magnetic Resonance in Medicine, Volume 58, Issue 6, 1182-1195, 2007.

[Lv & Fan ’09] Jinchi Lv & Yingying Fan, A Unified Approach to Model Selection and Sparse Recovery Using Regularized Least Squares, The Annals of Statistics, Volume 37, Number 6A, 3498-3528, 2009.

[Kutyniok ’12] Gitta Kutyniok, Theory and Applications of Compressed Sensing, Unpublished notes. Available at http://arxiv.org/pdf/1203.3815v2.pdf BIBLIOGRAPHY 179

[Maleki & Donoho ’10] Arian Maleki & David Donoho, Optimally Tuned Iterative Reconstruction Algo- rithms for Compressed Sensing, IEEE Journal of Selected Topics in Signal Processing, Volume 4, Issue 2, 330-341, 2010.

[Mallat & Zhang ’93] Stephane Mallat & Zhifeng Zhang, Matching Pursuit With Time-Frequency Dic- tionaries, IEEE Transactions on Signal Processing, Vol. 41, Number 12, 3397-3415, 1993.

[Marcenko & Pastur ’67] Vladimir Marcenko & Leonid Pastur, Distributions of Eigenvalues for SOme Sets of Random Matrices, Mathematics of the USSR-Sbornik, Volume 1, Number 4, 1967.

[Matousek ’08] Jiri Matousek, On Variants of the Johnson-Lindenstrauss Lemma, Random Structures & Algorithms archive, Vol. 33, Issue 2, 142-156, 2008.

[Meijering ’02] Erik Meijering, A Chronology of Interpolation: From Ancient Astronomy to Modern Signal and Image Processing. Proceedings of the IEEE, Volume 90, Issue 3, 319-342, 2002.

[Meinshausen, Rocha & Yu ’07] N. Meinshausen, G. Rocha & B. Yu, Discussion: A Tale of Three Cousins: Lasso, L2Boosting and Dantzig, The Annals of Statistics, Volume 35, Number 6, 2373- 2384, 2007.

[Mendelson, Pajor & Rudelson ’05] S. Mendelson, A. Pajor & M. Rudelson, The Geometry of Random 1, 1 -polytopes. Discrete Mathematics, Vol. 34, Issue 3, 365-379. 2005. {− } [Mendelson, Pajor & Tomczak-Jaegermann ’08] S. Mendelson, A. Pajor & N. Tomczak-Jaegermannm, Uniform Uncertainty Principle for Bernoulli and Subgaussian Ensembles. Constructive Approxima- tion, Vol. 28, Issue 3, 277-289, 2008.

[Milman & Pajor ’03] Vitali Milman & Alain Pajor, Regularization of Star Bodies by Random Hyperplane Cut Off. Studia Mathematica, Vol. 159, Issue 2, 247-261, 2003.

[Mishali & Eldar ’11] Moshe Mishali & Yonina Eldar, Sub-Nyquist Sampling: Bridging Theory and Prac- tice. IEEE Signal Processing Magazine, Volume 28, Issue 6, 98-124, 2011.

[Mixon ’15] Dustin Mixon, Explicit Matrices with the Restricted Isometry Preperty: Breaking the Square- Root Blottleneck in Compressive Sensing and Its Applications, 389-417, Birkhäuser, 2015.

[Mo & Li ’11] Qun Mo & Sung Li, New Bounds on The Restricted Isometry Constant δ2k, Applied and Computational Harmonic Analysis, Volume 31, Issue 3, 460-468, 2011.

[Mo & Shen ’12] Qun Mo & Yi Shen, A Remark on the Restricted Isometry Property in Orthogonal Matching Pursuit, IEEE Transactions on Information Theory, Volume 58, Issue 6, 3654-3656, 2012.

[Natarajan ’96] Balas K. Natarajan, Sparse Approximate Solutions to Linear Systems. SIAM Journal on Computing, Volume 24, Issue 2, 227-234, 1996.

[Needell ’09] Deanna Needell, Topics in Compressed Sensing., Ph. D. Thesis, University of California, 2009.

[Needell & Tropp ’08] Diane Needell & Joel Tropp, CoSaMP: Iterative Signal Recovery From Incomplete and Inaccurate Samples, Applied and Computational Harmonic Analysis, Volume 26, Issue 3, 201- 321, 2008.

[Nelson ’??] Jelani Nelson, Johnson-Lindenstrauss Notes , Unpublished notes. Available at http://web. mit.edu/minilek/www/jl_notes.pdf

[Noam & Avi ’94] N. Noam & W. Avi, Hardness vs Randomness. Journal of Computer and Systems Sciences, Vol. 49, Issue 2, 149-167, 1994. 180 BIBLIOGRAPHY

[Oliveira ’13] Roberto I. Oliveira, The Lower Tail of Random Quadratic Forms, With Applications to Ordinary Least Squares and Restricted Eigenvalue Properties, Preprint 2013. Available at http: //arxiv.org/abs/1312.2903

[Olshausen & Field ’04] Bruno A. Olshausen & David J. Field, Sparse Coding of Sensory Inputs. Current Opinion in Neurobiology, Volume 14, 481-487, 2004.

[Pati, Rezaiifar & Krishnaprasad ’93] Y. C. Pati, R. Rezaiifar & P. S. Krishnaprasad, Orthogonal Match- ing Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition, Pro- ceedings of the 27th Annual Asilomar Conference on Signals Systems and Computers, Nov. 1-3, 1993.

[Pearson ’20] Karl Pearson, Notes on the History of Correlation. Biometrika, Volume 11, 145-158, 1920.

[Pope ’09] Graeme Pope, Compressive Sensing: A Summary of Reconstruction Algorithms, Master’s Thesis, Department of Computer Science, Eidgenössische Technische Hochschule, Zurich, 2009.

[Rao & Gorodnistsky ’97] , Irina F. Gorodnistsky & Bhaskar D. Rao, Sparse Signal Reconstruction From Limited Data Using FOCUSS: A Re-Weighted Norm Minimization Algorithm, IEEE Transactions on Signal Processing, Vol. 4, Issue 3, 600-616, 1997.

[Recht, Fazel & Parrilo ’10] Benjamin Recht, Maryam Fazel & Pablo A. Parrilo, Guaranteed Minimum- Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization SIAM Review, Volume 52, Issue 3, 471-501, 2010.

[Recht, Xu & Hassibi ’08] Benjamin Recht, Weiyu Xu & Babak Hassibi, Necessary and Sufficient Con- ditions for Success of the Nuclear Norm Heuristic for Rank Minimization, in Proceedings of 47th IEEE Conference on Decision and Control, 3065-3070, 2008.

[Rivasplata ’12] Omar Rivasplata, Subgaussian Random Variables: An Expository Note. Unpublished notes. Available at http://www.stat.cmu.edu/~arinaldo/36788/subgaussians.pdf.

[Rohde & Tsybakov ’11] Angelika Rohde & Alexandre Tsybakov, Estimation of High-Dimensional Low- Rank Matrices, The Annals of Statistics, Volume 39, Number 2, 887-930, 2011.

[Romberg ’08] Justin K. Romberg, Imaging via Compressive Sampling. IEEE Signal Processing Magazine, Volume 25, Issue 2, 14-20, 2008.

[Rudelson & Vershynin ’08] Mark Rudelson & Roman Vershynin, On Sparse Reconstruction from Fourier and Gaussian Measurements. Communications in Pure and Applied Mathematics, Volume 61, Issue 8, 1025-1045, 2008.

[Rudelson & Vershynin ’10] Mark Rudelson & Roman Vershynin, Non-asymptotic Theory of Random Matrices: Extreme Singular Values, Proceedings of the International Congress of Mathematicians, Hyderabad, India, 2010

[Rudelson ’14] Mark Rudelson, Recent Developments in Non-asymptotic Theory of Random Matrices in Modern Aspects of Random Matrix Theory by Van Vu(ed.), 83-120, AMS, 2014. Also available at http://arxiv.org/pdf/1301.2382v2.pdf.

[Saab, Chartrand & Yilmaz ’08] Rayan Saab, Rick Chartrand & Ozgür Yilmaz, Stable Sparse Approxi- mations via Nonconvex Optimization, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008.

[Sarwate ’98] Dilip V. Sarwate, Meeting the Welch Bound with Equality in Sequences and Their Appli- cations: Proceedings of SETA ’98 by J.-P. Allouche et al. (eds.), 79-102, Springer, 1999. BIBLIOGRAPHY 181

[Schechtman ’03] Gideon Schechtman, Concentration, Results and Applications in Handbook of The Geometry of Banach Spaces Volume 2 by Joram Lindenstrauss & William. B. Johnson, 1603-1635, North Holland, 2003.

[Schlotter & Wehmeier ’13] Sven Schlotter & Kai F. Wehmeier, Gingerbread Nuts and Pebbles: Frege and the Neo-Kantians - Two Recently Discovered Documents British Journal for the History of Philosophy, Volume 21, Number 3, 591-609, 2013.

[Schnass & Vandergheynst ’07] Karin Schnass & Pierre Vandergheynst, Average Performance Analysis for Thresholding, IEEE Signal Processing Letters, Volume 14, Issue 11, 828-831, 2007.

[Scott & Grassl ’10] A. J. Scott & M. Grassl Symmetric Informationally Complete Positive-Operator- Valued Measures: A New Computer Study, Journal Mathematical Physics Vol. 51, 2010.

[Seneta ’13] Eugene Seneta, A Tricentenary History of the Law of Large Numbers, Bernoulli, Volume 19, Number 4, 1088-1121, 2013.

[Shi, Han & Zheng ’15] ZivQiang Shi, JivQing Han & Tie Ran Zheng, Soft Margin Based Low-Rank Audio Signal Classification, Neural Processing Letters, Volume 42, Issue 2, 291-299, 2015.

[Silverstein ’85] Jack Silverstein, The Smallest Eigenvalue of a Large-Dimensional Wishart Matrix, An- nals of Probability, Volume 13, 1364-1368, 1985

[Slepian & Pollack ’61] D. Slepian & H. O. Pollak, Prolate spheroidal wave functions, Fourier analysis and uncertainty-I, Bell Syst. Tech. J, vol. 40, 43-64, 1961.

[So & Ye ’07] Anthony Man-Cho So & Yinyu Ye Theory of semidefinite programming for Sensor Network Localization, Mathematical Programming, Volume 109, Issue 2, 367-384, 2007.

[Strassen ’69] Volker Strassen, Gaussian Elimination is not Optimal. Numerical Mathematics, Volume 13, 354-356, 1969.

[Strohmer ’12] Thomas Strohmer, Measure What Should be Measured: Progress and Challenges in Com- pressive Sensing, IEEE Signal Processing Letters, Volume 19, Issue 12, 887-893, 2012.

[Strohmer & Heath ’03] Thomas Strohmer & R. W. Heath, Grassmannian Frames With Applications to Coding and Communication, Appl. Comput. Harmon. Anal. Vol. 14, 257?275, 2003.

[Su, Bogdan & Candès ’15] Weijie Su, Malgorzata Bogdan, & Emmanuel J. Candès, False Discoveries Occur Early on the Lasso Path, To appear in Annals of Statistics. Available at https://arxiv.org/ abs/1511.01957.

[Sustik et al. ’07] M. A. Sustik, Joel A. Tropp, I. S. Dhillon, R. W. Heath, On the Existence of Equian- gular Tight Frames, Linear Algebra and its Applications. Vol. 426, Issue 2, 619-635, 2007.

[Székely & Rizzo ’07] Gábor J. Székely & Maria L. Rizzo, The Uncertainty Principle of Game Theory, The American Mathematical Monthly, Volume 114, Number 8, 688-702, 2007.

[Talagrand ’94] Michel Talagrand, The Supremum of Some Canonical Processes, American Journal of Mathematics, Volume 116, Number 2, 283-325, 1994.

[Tibshirani ’96] Robert Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B, VOlume 58, Issue 1, 267-288, 1996.

[Tikhomirov ’66] Vladimir M. Tikhomirov, A remark on L.A. Tumarkin’s article "On diameters of infinite-dimensional compact sets". Vestnik Mosk. Univ., ser. mat. rnekh. astron., Number 3, Issue 73, 1966. 182 BIBLIOGRAPHY

[Tikhomirov ’83] Vladimir M. Tikhomirov, Widths and entropy, Uspekhi Mat. Nauk, Vol. 38, Issue 4, 91-99, 1983.

[Tikhomirov ’03] Vladimir M. Tikhomirov, Some thoughts on the mathematical work of Boris Sergeevich Kashin. Proceedings of the Steklov Institute of Mathematics, Volume 280, Issue 1, 1-4, 2013.

[Tillmann & Pfetsch ’14] Marc E. Pfetsch & Andreas Tillmann, The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing, IEEE Transactions on Information Theory, Volume 60, Issue 2, 1248-1259, 2014. DOI:10.1109/TIT.2013.2290112.

[Tropp ’04] Joel A. Tropp, Greed is Good: Algorithmic Results for Sparse Approximation. IEEE Trans- actions on Information Theory, Vol. 50, Issue 10, 2231-2242, 2004.

[Tropp ’05] Joel A. Tropp, Average-Case Analysis of Greedy Pursuit, Proceedings SPIE, Volume 5914, Wavelets XI, 2005.

[Tropp & Gilbert ’07] Joel Tropp & Anna C. Gilbert, Signal Recovery From Random Measurements via Orthogonal Matching Pursuit. IEEE Transactions on Information Theory, Volume 53, Issue 12, 4655- 4666, 2007.

[Tropp, Laska, Duarte, Romberg & Baraniuk ’10] Joel Tropp, Jason Laska, Mario Duarte, Justin Romberg & Richard Baraniuk, Beyond Nyquist: Efficient Sampling os Sparse Bandlimited Signals. IEEE Transactions on Information Theory, Volume 56, Issue 1, 520-544, 2010.

[Tsirelson, Ibragimov & Sudakov ’76] Boris Tsirelson, Ildar A. Ibragimov & Vladimir N. Sudakov, Norm of Gaussian Sample Function. In Proceedings of the 3rd Japan-U.S.S.R. Symposium on Probability Theory, Volume 550 of Lecture Notes in Mathematics, pp. 20-41. Springer-Verlag, Berlin, 1976.

[Tweddle ’84] Ian Tweddle, Approximating n! Historical Origins and Error Analysis, American Journal of Physics, Volume 52, 487-488, 1984.

[Unser ’00] Michael Unser, Sampling - 50 Years After Shannon. Proceedings of the IEEE, Volume 88, Number 4, 569-587, 2000.

[Vandenberghe & Boyd ’96] Lieven Vandenberghe & Stephen Boyd, Semidefinite Programming, SIAM Review, Volume 38, Number 1, 49-95, 1996.

[Vasanawala, Alley, Hargreaves, Barth, Pauly & Lustig ’10] Shreyas S. Vasanawala, Marcus T. Alley, Brian A. Hargreaves, Richard A. Barth, John M. Pauly, & Michael Lustig, Improved Pediatric MR Imaging with Compressed Sensing. Radiology, Volume 256, Issue 2, 607-616, 2010.

[Vershynin ’12] Roman Vershynin, Introduction to the Non-asymptotic Analysis of Random Matri- ces.Compressed Sensing, 210-268, Cambridge University Press, Cambridge, 2012.

[Vershynin ’15] Roman Vershynin, Estimation in High Dimensions: A Geometric Perspective. Sampling Theory, a Renaissance, 3-66, Birkhauser, 2015.

[Vitale ’00] Richard A. Vitale, Some Comparisons for Gaussian Processes, Proceedings of the American Mathematical Society, Volume 128, 3043-3046, 2000.

[Vybiral ’11] Jan Vybiral, A Variant of the Johnson-Lindenstrauss Lemma for Circulant Matrices, Jour- nal of Functional Analysis, Volume 260, Issue 4, 1096-1105, 2011.

[Vybiral ’08] Jan Vybiral, Widths of Embeddings in Function Spaces. Journal of Complexity, Vol. 24, Issue 4, 545-570, 2008. BIBLIOGRAPHY 183

[Wainwright ’15] Martin Wainwright, Statistics Meets Optimization: Randomization and Approximation for High-Dimensional Problems , Nachdiplom Lecture, ETH Zurich, 2015. Available at http://www. stat.berkeley.edu/~wainwrig/nachdiplom/. [Waldron ’09] Shayne Waldron, On the Construction of Equiangular Frames From Graphs, Linear Al- gebra and its Applications, Vol. 431, 2228-2242, 2009. [Welch ’74] L. R. Welch, Lower Bounds on the Maximum Cross Correlation of Signals. IEEE Trans. on Info. Theory, Vol. 20, Issue 3, 1974. 397-399. doi:10.1109/TIT.1974.1055219 [Wielandt ’55] Helmut Wielandt, An Extremum Property of Sums of Eigenvalues, Proceedings of the American Mathematical Society, Volume 6, 106-110, 1955. [Wright ’04] Margaret H. Wright, The Interior-Point Revolution in Optimization: History, Recent De- velopments, and Lasting Consequences, Bulletin of the American Mathematical Society, Volume 42, Number 1, 39-56, 2004. [Yin, Bai & Krishnaiah ’88] Y. Q. Yin, Z. D. Bai & P. R. Krishnaiah, On the Limit of the Largest Eigenvalue of the Large-Dimensional Sample Covariance Matrix, Probability Theory and Related Fields, Volume 78, 509-521, 1988. [Yu & Feng ’14] Yi Yu & Yang Feng, Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models, Journal of Computational and Graphical Statistics, Volume 23, Issue 4, 1009- 1027, 2014. [Zauner ’99] Gerhard Zauner, Quantendesigns - Grundz¨uge einer nichtkommutativen Designtheorie, Ph.D. Thesis, Universitat Wien, 1999.

[Zhang ’09] Tong Zhang, Some Sharp Performance Bounds for Least Squares Regression with L1 Regu- larization, The Annals of Statistics, Volume 37, Number 5A, 2109-2144, 2009. [Zhang ’11] Tong Zhang, Sparse Recovery with Orthogonal Matching Pursuit under RIP. IEEE Trans- actions on Information Theory, Volume 57, Issue 9, 6215-6221, 2011. [Zhou, Kong & Xiu ’13] Shenglong Zhou, Lingchen Kong & Naihua Xiu, New Bounds for RIC in Compressed Sensing, Journal of the Operations Research Society of China, Volume 1, Issue 2, 227-237, 2013. Blog Posts and Lectures: [Tao’s Blog - 22/03/2008] Terence Tao Blog’s Entry on Dantzig Selector https: // terrytao. wordpress. com/ 2008/ 03/ 22/ the-dantzig-selector-statistical- estimation-when-p-is-much-larger-than-n/ [Tao’s Blog - 24/04/2008] Terence Tao Blog’s Entry on Logarithmic Sobolev Inequality and Perelman’s Entropy https: // terrytao. wordpress. com/ 2008/ 04/ 24/ 285a-lecture-8-ricci-flow-as-a- gradient-flow-log-sobolev-inequalities-and-perelman-entropy/ [Tao’s Blog - 09/06/2009] Terence Tao Blog’s Entry on Talagrand?s Concentration Inequality https: // terrytao. wordpress. com/ 2009/ 06/ 09/ talagrands-concentration-inequality/ [Tao’s Blog - 01/03/2010] Terence Tao Blog’s Entry on Concentration of Measure https: // terrytao. wordpress. com/ 2010/ 01/ 03/ 254a-notes-1-concentration-of- measure/ [Tao’s Blog - 06/25/2010] Terence Tao Blog’s Entry on Uncertainty Principle https: // terrytao. wordpress. com/ 2010/ 06/ 25/ the-uncertainty-principle/ . 184 BIBLIOGRAPHY

[Tao’s Blog - 09/14/2010] Terence Tao Blog’s Entry on Universality https: // terrytao. wordpress. com/ 2010/ 09/ 14/ a-second-draft-of-a-non-technical- article-on-universality/ [Tao’s Blog - 07/02/2010] Terence Tao Blog’s Entry on Deterministic RIP Matrices https: // terrytao. wordpress. com/ 2007/ 07/ 02/ open-question-deterministic-uup- matrices/ . [Tao’s Blog - 01/12/2010] Terence Tao Blog’s Entry on Eigenvalues and Sums of Hermitian matrices https: // terrytao. wordpress. com/ tag/ wielandt-hoffman-inequality/ [Mixon’s Blog - 12/02/2013] Dustin Mixon Blog’s Entry on Breaking the Quadratic Bottleneck I https: // dustingmixon. wordpress. com/ 2013/ 12/ 02/ deterministic-rip-matrices- breaking-the-square-root-bottleneck/ [Mixon’s Blog - 12/11/2013] Dustin Mixon Blog’s Entry on Breaking the Quadratic Bottleneck II https: // dustingmixon. wordpress. com/ 2013/ 12/ 11/ deterministic-rip-matrices- breaking-the-square-root-bottleneck-ii/ . [Mixon’s Blog - 01/14/2014] Dustin Mixon Blog’s Entry on Breaking the Quadratic Bottleneck III https: // dustingmixon. wordpress. com/ 2014/ 01/ 14/ deterministic-rip-matrices- breaking-the-square-root-bottleneck-iii/ [Mixon’s Blog - 02/08/2014] Dustin Mixon Blog’s Entry on Gordon?s Escape Through a Mesh Theorem https: // dustingmixon. wordpress. com/ 2014/ 02/ 08/ gordons-escape-through-a-mesh- theorem/ [Ellenberg - 02/22/10] Jordan Ellenberg Blog’s Entry on Compressive Sensing https: // www. wired. com/ 2010/ 02/ ff_ algorithm/ [Bandeira’s Blog - 12/04/2013] Afonso Bandeira Blog’s Entry on Deterministic Matrices https: // afonsobandeira. wordpress. com/ 2013/ 12/ 04/ deterministic-rip-matrices-a- new-cooperative-online-project/ [Wiki - Deterministic RIP Matrices] Math Research Wiki - Deterministic RIP Matrices http: // math-research. wikia. com/ wiki/ Deterministic_ RIP_ Matrices Lectures: [Ledoux - Lecture I] Michel Ledoux - Lecture on Logarithmic Sobolev Inequalities http: // www. math. univ-toulouse. fr/ ~ledoux/ Logsobwpause. pdf . [Ledoux - Lecture II] Michel Ledoux - Lecture on The Concentration of Measure Phenomenon http: // perso. math. univ-toulouse. fr/ ledoux/ files/ 2015/ 07/ Villani2wp. pdf Algorithms Websites: [NESTA] by Jérôme Bobin, Stephen Becker & Emmanuel Candès http: // statweb. stanford. edu/ ~candes/ nesta/ .

[`1-MAGIC] by Emmanuel Candes & Justin Romberg http: // statweb. stanford. edu/ ~candes/ l1magic/ . [YALL1] by Yin Zhang, Wei Deng, Junfeng Yang & Wotao Yin http: // yall1. blogs. rice. edu/ . [L1-LS] Kwangmoo Koh, Seung-Jean Kim & Stephen Boyd https: // stanford. edu/ ~boyd/ l1_ ls/ . BIBLIOGRAPHY 185

[Fast `1] by Allen Yang, Arvind Ganesh, Zihan Zhou, Andrew Wagner, Victor Shia, Shankar Sastry & Yi Ma https: // people. eecs. berkeley. edu/ ~yang/ software/ l1benchmark/ . [CVXPY] by Martin Andersen, Joachim Dahl & Lieven Vandenberghe http: // cvxopt. org/ examples/ mlbook/ l1. html . [GPSR] by Mário Figueiredo, Robert D. Nowak & Stephen J. Wright http: // www. lx. it. pt/ ~mtf/ GPSR/ . “After writing a story I was always empty and both sad and happy."

Ernest Hemingway in A Moveable Fast, p. 6