Universidade Federal do Rio de Janeiro
Compressive Sensing
Claudio Mayrink Verdun
Rio de Janeiro Dezembro de 2016 i
Universidade Federal do Rio de Janeiro
Compressive Sensing
Claudio Mayrink Verdun
Dissertação de Mestrado apresentada ao Programa de Pós-graduação em Matemática Aplicada, In- stituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.
Orientador: Prof. Cesar Javier Niche Mazzeo. Co-orientador: Prof. Bernardo Freitas Paulo da Costa. Co-orientador: Prof. Eduardo Antonio Barros da Silva.
Rio de Janeiro Dezembro de 2016
iii
Verdun, Claudio Mayrink V487c Compressive Sensing / Claudio Mayrink Verdun. – Rio de Janeiro: UFRJ/ IM, 2016. 202 f. 29,7 cm. Orientador: César Javier Niche Mazzeo. Coorientadores: Bernardo Freitas Paulo da Costa, Eduardo Ântonio Barros da Silva. Dissertação (Mestrado) — Universidade Federal do Rio de Janeiro, Instituto de Matemática, Programa de Pós-Graduação em Matemática, 2016.
1. Compressive Sensing. 2. Sparse Modelling. 3. Signal Processing. 4. Applied Harmonic Analysis. 5. Inverse Problems I. Mazzeo, César Javier Niche, orient. II. Costa, Bernardo Freitas Paulo da, coorient. Silva, Eduardo Ântonio Barros da, coorient. III. Título iv v
Del rigor en la ciencia
En aquel Imperio, el Arte de la Cartografía logró tal Perfección que el Mapa de una sola Provincia ocupaba toda una Ciudad, y el Mapa del Imperio, toda una Provincia. Con el tiempo, estos Mapas Desmesurados no satisficieron y los Colegios de Cartógrafos levantaron un Mapa del Imperio, que tenía el Tamaño del Imperio y coincidía puntualmente con él. Menos Adictas al Estudio de la Cartografía, las Generaciones Siguientes entendieron que ese dilatado Mapa era Inútil y no sin Impiedad lo entregaron a las Inclemencias del Sol y los Inviernos. En los Desiertos del Oeste perduran despedazadas Ruinas del Mapa, habitadas por Animales y por Mendigos; en todo el País no hay otra reliquia de las Disciplinas Geográficas.
Suárez Miranda: Viajes de varones prudentes Libro Cuarto, cap. XLV, Lérida, 1658. In [Borges ’60]. vi
Acknowledgments
This is the only part of my dissertation written in my native language. I hope that all who are cited here may read clearly what follows. Se você é daquelas pessoas que só liga para a matemática e acha que ela não é feita por seres humanos que sorriem, choram, pensam coisas profundas mas também fazem coisas triviais, pule esta parte e vá diretamente para o Capítulo 1. Ali, vetores esparsos e medições compressivas te esperam.
Eu tentei ser preciso e não esquecer ninguém. Se eu esqueci, possivelmente foi de propósito, se não foi, peço desculpas, embora nunca irá transparecer qual foi o motivo do esquecimento. Além disso, veremos ao longo dessa dissertação que redundância é algo fundamental, então começaremos por aqui, citando algumas pessoas em mais de um lugar e por motivos diferentes.
Esta dissertação é dedicada a duas santíssimas trindades: A primeira delas, Felipe Acker, o pai, Fábio Ramos, o filho, e Luiz Wagner Biscainho, o espírito santo. A segunda, Roberto Velho, o pai, Filipe Goulart, o filho, e Hugo Carvalho, o espírito santo. Essas seis pessoas contribuíram de maneira não trivial para minha formação e introduziram muitas não-linearidades na minha vida ao longo dos últimos anos. A quantidade de coisas que elas me ensinaram e todas as vezes que fui ajudado por elas, mesmo que as três horas da manhã, as vezes desesperado, e com pedidos nada convencionais não pode ser descrito aqui em poucas palavras. Parte substancial da minha formação humana e científica deve-se aos laços estreitos e inúmeras conversas que mantenho com esses seis indi- víduos bastante singulares.
Começo agradecendo três professores que tive na escola e que foram fonte de inspiração para a escolha da carreira científica: Andre Fontenele, José Carlos de Medeiros e José Alexandre Tostes Lopes (in memoriam). Aproveitei muitos os três anos de contato com eles, as conversas que iam de geometria euclidiana à suma teológica de São Tomás de Aquino e todos os ensinamentos sobre porque o mundo era muito maior que o IME/ITA e porque haviam opções melhores sobre o que fazer da vida.
Agradeço ao Vinícius Gripp por ter me mostrado a sociedade secreta que é a matemática aplicada da UFRJ. Se eu não tivesse passado mal um certo dia na escola e não tivesse precisado ir embora, não teria conhecido o Vinícius, todo o seu entusiasmo e tudo o que se seguiu a partir daí. Além disso, agradeço aos seis bandeirantes da matemática aplicada: Enio Hayashi, Pedro Maia, Rafael March, Roberto Velho, Rodrigo Targino e Yuri Saporito. Esses veteranos, que acabaram virando amigos, continuam por aí desbravando a mata fechada e eu sou muito feliz em aproveitar muitos dos conselhos por eles oferecidos.
Aos vários mestres que tive nos últimos sete anos. Cada lição, cada raciocínio, cada forma de fazer as coisas dentro e fora da sala de aula me mostrou um mundo de ideias interessantes que eu ainda vou utilizar muito nas minhas pesquisas e atividades didáticas. São eles Adan Corcho, Alexander Arbieto, Alexandra Schmidt, Amit Bhaya, Átila Pantaleão, Bernardo da Costa, Bruno Scárdua, César Niche, Didier Pilod, Eduardo Barros Silva, Fábio Ramos, Felipe Acker, Flávio Dickstein, Heudson Mirandola, Luca Moriconi, Marco Aurélio Cabral, Mar- cus Venícius Pinto, Luiz Wagner Biscainho, Nilson Bernardes, Paulo Diniz, Sergio Lima Netto, Severino Collier, Umberto Hryniewicz, Wagner Coelho, Wallace Martins e Wladimir Neves. Agradeço também à vários outros professores que conheci na universidade e que me ensinaram como não fazer as coisas, como não trabalhar, como não fazer pesquisa e dar aula, como não tratar os alunos, etc.
Aos muitos amigos e colegas que fiz aplicada e na UFRJ. Alguns se tornaram muito próximos, outros nem tanto, mas todos eles compartilharam momentos felizes e ajudaram a superar momentos tristes na universidade. Também me proporcionaram conversas fantásticas e muitas vezes me mostraram que eu estava errado, por mais que que fosse difícil dar o braço a torcer. São eles: Aloizio Macedo, Arthur Mitrano, Bruno Almeida, Bruno Oliveira, Bruno Santiago, Camilla Codeço, Carlos Lechner, Carlos Zanini, Cecília Mondaini, Claudio Heine, Daniel Soares, Danilo Barros, Davi Obata, Felipe Pagginelli, Felipe Senra, Flávio Ávila, Franklin Maia, Gabriel Castor, Gabriel Sanfins, Gabriel Vidal, Gabriela Lewenfus, Gabriella Radke, Ian Martins, Jhonatas Afradique, Leandro Machado, Leonardo Assumpção, Lucas Barata, Lucas Manoel, Lucas Stolerman, Marcelo Ribeiro, Marco Aurélio Soares, Milton Nogueira, Nicolau Sarquis, Paulo Cardozo, Pedro Fonini, Pedro Schmidt, Philipe de Fab- ritiis, Rafael Teixeira, Raphael Lourenço, Reinaldo de Melo e Souza, Rogério Lourenço, Thiago Degenring, Tiago Domingues, Tiago Vital, Washington Reis, Victor Coll e Victor Corcino. vii
Aos top’s, a galera realmente truuu, que se tornaram super amigos e que sei poder contar para qualquer coisa em qualquer hora e lugar. Alguns deles estão longe mas sei que caso eu precise, eles não medirão esforços para estar perto. Obrigado Adriano Côrtes, Bruno Braga, Bruno Santiago, Cláudio Gustavo Lima, Cristiane Krug, Douglas Picciani, Guilherme Sales, Luís Felipe Velloso e Rafael Ribeiro por serem quem vocês são e pela amizade tão forte.
Aos meus amigos de infância Hugo Portuita, Pedro Goñi e João Paulo Siqueira que sempre me procuram, insistem em me trazer para um mundo distante do acadêmico, e me lembram que o samba não pode morrer. Obrigado pelo carinho e amizade ao longo deste muitos anos. E desculpem pelo sumiço. Minhas ausências estão parcialmente justificada nas páginas que seguem.
Aos meus amigos de conversas estranhas, em qualquer horário e sobre qualquer assunto, sempre regadas a muitas risadas, gritos e discussões intensas. Algumas delas, no batista às cinco da tarde, outras no 485 lotado às dez da noite que faziam as pessoas olharem com vergonha alheia para nós, seja falando de política, religião, mal dos outros ou qualquer outra coisa. Ou mesmo nos infinitos áudios do Whatsapp. Espero ainda continuar brig- ando e conversando com Adriano Côrtes, André Saraiva, Bruno Santiago, Bruno Braga, Camila Codeço, Carlos Lechner, Cristiane Krug, Daniel Soares, Douglas Picciani, Filipe Goulart, Flávio Ávila, Gabriel Vidal, Guilherme Sales, Hugo Carvalho, Luís Felipe Velloso, Nicolau Sarquis, Reinaldo de Melo e Souza, Tiago Domingues e Tiago Vital.
Aos meus amigos do SMT, do lado aplicado da força. Isabela Apolinário e Renam Castro me ajudaram nas matérias com questões de computação e a entender a linguagem falada pelos engenheiros que soava tão estranha para mim. Ainda me assusto um pouco com um banco de filtros mas isso vai passar... Nossos trabalhos em equipe e seminários foram divertidos, por mais que muitas vezes eu não soubesse direito o que eu estava fazendo. Ao Flávio Ávila, grande amigo de seminários e de conversas político-ecônomicas, pela amizade aplicada. Espero finalmente ter tempo para nosso projetos começarem a andar. Meus colegas Hamed Yazdanpanah, Pedro Fonini e Rafael Zambrano, que trabalharam comigo no projeto do CENPES, me ajudaram a desbravar um mundo de engenharia de reservatórios completamente desconhecido para todos nós. Juro que tentarei não perturbar mais vocês as quatro da manhã para terminarmos um monte de relatórios e simulações. Aos professores Eduardo Barros Silva, Paulo Diniz e Sergio Lima Netto que compõem esta equipe e me ofereceram a oportunidade de trabalhar neste projeto, ajudando-me financeira e intelectualmente. A maneira como vocês encaram as situações de forma descontraída mas ao mesmo tempo profissional me encanta e fez com que todas as tarde de quarta fossem sempre engraçadas e leves mas ao mesmo tempo profundas. Aos professores Flávio Dickstein e Paulo Goldfeld por gen- tilmente cederem o espaço que utilizei para a escrita desta dissertação e para o trabalho no projeto nos últimos tempos. Foi muito interessante interagir com eles, mesmo que brevemente, e vê-los trabalhando com entusiasmo todos os dias. E ao Renan Vicente, pelas risadas, inúmeros favores com rodar simulações no IMEX para mim e pela companhia diária no LABMA.
Aos meus amigos Dingos da Montanha e companheiros de aventura Carolina Casals, Gabriel Vidal, Luis Fe- lipe Velloso, Nicolau Sarquis, Petrônio Lima e Tiago Vital. Os dias ensolarados de trilha e as noites frias de travessia seriam menos engraçadas e intensos se vocês não estivessem juntos para dividir barraca, comida, roupa e as expências pelas quais passamos. Espero vocês para continuarmos as aventuras pela Europa!
Aos meus amigos de café, de Batista, de Anjinho, de almoço, etc, Bruno Scárdua, Fábio Ramos, Heudson Mirandola e Marcelo Tavares, que sempre estiveram presentes vibrando e compartilhando conquistas, que me proporcionaram ótimas conversas e me deram conselhos vitais sobre como não aumentar a entropia, resolver con- flitos e como chegar ao sim comigo mesmo e com os outros. Em particular ao Bruno, mestre e amigo por quem nutro grande admiração, que me ensinou a importância dos slogans das propagandas de pneumáticos, dos alunos ilustres de Munique e que me ajudou em vários pontos fundamentais desde o curso de análise real.
Aos membros do seminário de Compressive Sensing: Amit Bhaya, Bernardo da Costa, César Niche, Hugo Carvalho, Iker Sobron, Luiz Wagner Biscainho, Marcello Campos e Wallace Martins. A paciência de vocês foi infinita ao me ouvir falar todas as terças, por mais de dois anos, sobre CS. Obrigado pelas discussões prolíficas e por me fazer entender o assunto melhor. Em especial ao Wallace Martins pela empolgação e por ter feito deste seminário uma realidade possível e única na história da UFRJ.
Agora vem a parte que agradecemos aos membros da banca. Farei isso de forma alfabética: viii
Ao Amit Bhaya, pelo carinho, amizade, pela erudição e conversas que vão desde a União Soviética à ética científica, passando pelo controle dos sistemas fisiológicos e a culinária indiana. Ir até a sua sala com uma per- gunta ou lhe enviar um email e sempre receber uma enxurrada de referências interessantes sempre fazia minhas semanas mais felizes. Ao Bernardo da Costa por toda a amizade, ajuda, suporte, orientação, empolgação, por toda a matemática e ciência que aprendi, pelas dúvidas difíceis desta dissertação (e fora dela!) e pelas conversas que nunca têm fim. Tenho a sensação que toda vez que eu o encontro fico cada vez mais perdido sobre por onde começar a falar já que o número de assunto entre nós está divergindo. Ao Carlos Tomei pelos almoços regados a histórias engraçadas e pelos conselhos de vida. Ele me ajudou de forma fundamental em momentos singulares onde eu achei que não dava para continuar em frente e me ensinou que existe uma relação não trivial entre inconscientes e molhos de tomate. Além disso, os muitos momentos esparsos de conversas que tivemos serviram para fazer surgir uma explosão de perguntas e ideias na minha cabeça. Ao Eduardo Silva por arrumar tempo para me receber em sua sala a qualquer hora, sempre com muito carinho, me ouvir e me ensinar muitas coisas sobre a vida e as relações com as pessoas. Ele foi um grande mentor, acreditando em mim, e me dando oportunidades de crescimento únicas, além de me ensinar várias coisas de processamento de sinais e teoria da informação. E fez tudo isso mesmo tendo 73472416 alunos, projetos, defesas, seminários e reuniões. Aprendi com ele, mais do que com qualquer outra pessoa, a importância do trabalho duro e da disciplina para fazer as coisas funcionarem. Ao Fábio Ramos, por tudo. Pelas matérias que não tinham fim e que nos faziam vir ao fundão no dia 30 de dezembro, pelos conselhos diários envolvendo todos os aspectos da minha vida, pela quantidade de livros e assuntos que descobri graças a ele, pelas piadas e trocadilhos que faziam os almoços um ponto alto do dia e muitas outras coisas difíceis de enumerar aqui. Além disso, seu apoio e extrema confiança valeram mais do que qualquer orientação. Ao Roberto Imbuzeiro por conselhos envolvendo matemática pura e aplicada, pelas críticas fundamentais à esta dissertação e por toda a ajuda em utilizar a sala de videoconferência do IMPA. Sem ela, meu futuro talvez tivesse tomado outro rumo. Ele também foi responsável por um imenso incentivo no que diz respeito a ir trabalhar na Alemanha. E à todos eles, por prontamente aceitarem fazer parte da minha banca e por lerem com todo o cuidado o trabalho que segue.
Não posso deixar de citar os funcionários da secretaria Alan, Aníbal e Cristiano. Os três sempre foram ex- emplares para resolver todos os problemas buracráticos que tive e para regularizar meu histórico cheio de coisas estranhas. Agradeço a tia Deise (in memoriam) que estava todo dia no fundão, com bom humor ou sem, e além de entender como todo aquele conto kafkaniano funcionava, resolvia com primor tudo o que os alunos da aplicada precisavam. Espero, como ela, trabalhar com força e determinação até o último dia. E também ao Walter, que é a pessoa que faz o Instituto de Matemática funcionar e que mesmo com greve, cachorros, esgoto radioativo na linha vermelha, tiroteio na Vila do João, falta de água, ataque zumbi, etc, sempre esteve disposto a me ajudar com os problemas que surgiam na ABC-116 e a proporcionar o melhor ambiente possível para os alunos estudarem.
No que diz respeito à dissertação, agradeço ao Yuri Saporito que forneceu o template que serviu de molde para essa e inúmeras outras dissertações da aplicada, ao Luís Felipe Velloso pela ajuda e suporte com todas as figuras que precisei nos últimos tempos não somente nesta dissertação mas principalmente em apresentações diversas e pela rataria com o WinFIG e Xfig, ao Hugo Carvalho pela revisão ninja e por inúmeras melhorias no texto e ao Roberto Velho por um certo código mágico do Mathematica. Gostaria muito de ter descoberto isso antes...
À Natália Castro Telles pelo suporte e por me ensinar a ver a vida de vários outros ângulos, repensando minhas posições e estruturas. O crescimento que você me proporciona é muito singular.
Aos criadores e colaboradores da Wikipedia, pelas inestimáveis consultas. Aos que trabalham duro para man- ter os servidores do Cazaquistão, de Neue, do Equador, da Samoa Ocidental, de Belize, do Território Britânico do Oceano Índico e da Rússia funcionando a todo vapor, apesar dos ataques constantes, e que tornam o conheci- mento livre de qualquer mordaça. Essa dissertação teria sido impossível sem os inúmeros esforços desses anônimos espalhados pelo mundo. Agradeço ao Donald Knuth, Leslie Lamport e à todos os que trabalham arduamente para ampliar este mundo incrível da formatação que é o LATEX. Aos criadores do Detextify, esta ferramenta poderosa que me ajudou em vários momentos. Sem todos vocês, este documento teria sido escrito naquele editor que não pode ser nomeado e as coisas não ficariam tão bonitas.
À Andrei Kolmogorov e Claude Shannon, duas das mentes mais privilegiadas da história da humanidade, pela inspiração e pelas ideias que originaram os primórdios desse trabalho. Sem eles, não é exagero dizer que tudo isso (e muitas outras coisas) teria demorado mais uns 500 anos para surgir.
Às minhas companhias da madrugada por proporcionarem múltiplos ensinamentos. Muitos de vocês foram ix essenciais em momentos recentes onde concentrações de medida ou teoremas malucos de russos sobre espaços de Banach não faziam mais nenhum sentido e eu estava de enjoado de olhar para o computador. Vivos ou mortos, carrego seus escritos como marcas indeléveis. Obrigado Antonio Damásio, Atul Gawande, Carl Sagan, Douglas Adams, Edward Wilson, Fernando Pessoa, Fiodor Dostoievsky, Franz Kafka, George Orwell, José Saramago, Liev Tolstoi, Luiz Felipe Pondé, Malcolm Gladwell, Vilayanur Ramachandran, Nelson Rodrigues, Oliver Sacks, Otto Maria Carpeaux, Robert Sapolsky, Valter Hugo Mãe e Wislawa Szymborska.
Aos professores Felix Krahmer e Holger Boche pela oportunidade que terei daqui pra frente e por tudo que conversamos por email até agora. Estou sendo tratado de maneira íncrivel e espero retribuir isso trabalhando bastante e não decepcionando-os.
À minha família pelo suporte e pelo amor. Eu enganei vocês e na verdade eu nunca estudei psicologia dos anelídeos como muitas vezes eu disse em casa. Agradeço ao meu irmão Gabriel, minhas tias Marisa e Marilene e meu primo Flávio por tudo. Agradeço a minha avó Maria da Glória pelas conversas sobre a finitude da vida e por me ensinar, dentre muitas coisas que em várias situações a experiência é um pente que a gente ganha quando fica careca. E também ao Carlos e Irene, pessoas que acabaram por se tornarem meus segundos pais e que sei estarem presentes, de coração aberto, para o que eu precisar.
E finalmente, mas não menos importante... ao meu orientador César Niche, por tudo. Por uma orientação que virou amizade e que espero mantê-la sempre e pelo carinho e preocupação. Também pelos ótimos cursos, pela matemática aprendida, pelas conversas sobre a vida na academia e também fora dela, por topar me acompanhar nessa jornada nos últimos quatro anos, por acreditar em mim e nos meus projetos muitas vezes loucos e por sempre ter me dado liberdade de seguir em frente e estudar o que eu quisesse. Apesar da minha imensa teimosia e das muitas vezes em que discordei de suas ideias, sou grato a todas elas e a tudo o que aprendi com ele. Essa dissertação é fruto direto de toda essa ótima interação.
Ao meu pai Luiz Cláudio pelo entusiamo com ciência, por me ensinar o poder da comunicação e por querer sempre mais dos meus esforços. À minha mãe Marília, por ter me ensinado o poder e a responsabilidade das próprias escolhas, da leitura e busca incessante, por tudo o que enfrentamos juntos, no dia a dia, sempre com muita conversa e argumentação e por sempre ter me dito que isso ia dar certo. À ambos, pelo amor e carinho, por nunca terem medido esforços para me dar a melhor educação que estava ao alcance deles e por me ensinarem que, na vida, devemos escolher fazer aquilo que nos deixe feliz e que faça com que nunca olhemos para o relógio.
À Ivani Ivanova, revisora máxima desse texto (que em outras situações seria colocada como coautora...), lendo cada detalhe, cada demonstração, cada argumento e respondendo com um tratado completo sobre como melhorá-los. Se o texto ainda apresenta inúmeros erros (e eu sei que apresenta pois somente hoje achei uns três), foi porque eu não terminei suas revisões com a paciência e esmero suficientes e a culpa é inteiramente minha. Ela não só é a revisora desse texto mas fez com que minha vida passasse por uma revisão completa desde que saí por aí falando de números cardinais. Toda a gratidão à companhia, amor, carinho e dedicação de sua parte, ao longo de toda a montanha-russa, não podem ser descritos aqui.
Claudio Mayrink Verdun Dezembro de 2016 x
“All that you touch All that you see All that you taste All you feel. All that you love All that you hate All you distrust All you save. All that you give All that you deal All that you buy, beg, borrow or steal. All you create All you destroy All that you do All that you say. All that you eat And everyone you meet All that you slight And everyone you fight. All that is now All that is gone All that’s to come and everything under the sun is in tune but the sun is eclipsed by the moon.
"There is no dark side of the moon really. Matter of fact it’s all dark." ."
Eclipse Roger Waters - Pink Floyd xi
Resumo
Compressive Sensing
Claudio Mayrink Verdun
Resumo da dissertação de Mestrado submetida ao Programa de Pós-graduação em Matemática Aplicada, Instituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.
Resumo: Vivemos em um mundo digital. Os aparelhos ao nosso redor interpretam e processam bits a todo instante. Para isso ser possível, a conversão de sinais analógicos para o domínio discreto se faz necessária. Ela se dá por meio dos processos de amostragem, quantização e codificação. Através de tais etapas, é possível processar e armazenar os sinais em dispositivos digitais. O paradigma clássico do Processamento Digital de Sinais nos diz que devemos realizar uma etapa de amostragem e, em seguida, uma etapa de compressão, por meio da quantização e codificação, para a aquisição de sinais. Entretanto, realizado desta forma, este processo pode ser custoso ou desnecessário, devido a uma alta taxa de amostragem ou a uma grande perda de dados na etapa de compressão. A partir desta observação, coloca-se a seguinte questão:
É possível realizar o processo de aquisição e compressão simultaneamente de forma a obter o mínimo de informação necessária para a reconstrução de um dado sinal?
Os trabalhos seminais de David Donoho, Emmanuel Candès, Justin Romberg e Terence Tao mostram que a resposta para esta pergunta é afirmativa em muitas situações envolvendo sinais naturais ou gera- dos pelo ser humano. O presente trabalho tem como objetivo estudar a área denominada Compressive Sensing, termo sem tradução que se refere à resposta positiva a esta pergunta, isto é, ao procedimento de realizar uma aquisição compressiva de dados. Esta dissertação é um guia do mochileiro para esta área que está em rápido crescimento. Os teoremas que a fundamentam são apresentados com detalhe e rigor. Do ponto de vista matemático, ela é uma interseção das áreas de Otimização, Probabilidade, Geometria de Espaços de Banach, Análise Harmônica e Álgebra Linear Numérica. Aqui são discuti- das técnicas de Teoria de Frames, Matrizes Aleatórias e aproximações ótimas em espaços de Banach para demonstrar a viabilidade deste novo paradigma em Processamento de Sinais.
Palavras–chave. Compressive Sensing, Compressive Sampling, Compressed Sensing, Esparsidade, Represen- tação Esparsa, Redundância, Teoria de Frames, Análise Harmônica Aplicada, Matrizes Aleatórias, Princípio da Incerteza, Processamento de Sinais.
Rio de Janeiro Dezembro de 2016 xii
Abstract
Compressive Sensing
Claudio Mayrink Verdun
Abstract da dissertaçao de Mestrado submetida ao Programa de Pós-graduação em Matemática Aplicada, Instituto de Matemática da Universidade Federal do Rio de Janeiro (UFRJ), como parte dos requisitos necessários à obtenção do título de Mestre em Matemática Aplicada.
Abstract: We live in a digital world. The devices around us interpret and process bits all the time. For this to be possible, a conversion of analog signals into the discrete domain is necessary. It hap- pens through the processes of sampling, quantization and coding. Through such steps, it is viable to process and store such signals in digital devices. The classical paradigm of Digital Signal Process- ing for signal acquisition is a sampling step followed by a compression step, through quantization and coding. However, this way, the process can be costly or unnecessary due to a high sampling rate or a large loss of data in the compression stage. From this observation, the following question is posed:
Is it possible to perform the acquisition and compression process simultaneously in order to obtain the minimum information for the reconstruction of a given signal?
The seminal works of David Donoho, Emmanuel Candès, Justin Romberg and Terence Tao show that the answer to this question is affirmative in many circumstances involving natural or man-made signals. The present work aims to study the area called Compressive Sensing, which is a procedure to perform data acquisition and compression at the same time. This dissertation is a hitchhiker’s guide to this rapidly growing field. The fundamental theorems will be explored with detail and rigor. From the mathematical point of view, it is an intersection of the Optimization, Probability, Geometry of Banach Spaces, Harmonic Analysis and Numerical Linear Algebra. Here we discuss the techniques of Frame Theory, Random Matrices and approximation in Banach Spaces to demonstrate the feasibility of this new paradigm in Signal Processing.
Keywords. Compressive Sensing, Compressive Sampling, Compressed Sensing, Sparsity, Sparse Representa- tion, Redundance, Frame Theory, Applied Harmonic Analyis, Random Matrices, Uncertainty Principle, Signal Processing.
Rio de Janeiro December of 2016 Contents
1 Sparse Solutions of Linear Systems 5 1.1 The What, Why and How of Compressive Sensing ...... 5 1.2 A Little (Pre)History of Compressive Sensing ...... 8 1.3 Sampling Theory ...... 9 1.4 Sparse and Compressible Vectors ...... 11 1.5 How Many Measurements Are Necessary? ...... 15 1.6 Computational Complexity of Sparse Recovery ...... 17 1.7 Some Definitions ...... 20
2 Some Algorithms and Ideas 23 2.1 Introduction ...... 23 2.2 Basis Pursuit ...... 23 2.3 Greedy Algorithms ...... 30 2.4 Thresholding Algorithms ...... 33 2.5 The Search For the Perfect Algorithm ...... 35
3 The Null Space Property 37 3.1 Introduction ...... 37 3.2 The Null Space Property ...... 38 3.3 Recovery via Nonconvex Minimization ...... 41 3.4 Stable Measurements ...... 42 3.5 Robust Measurements ...... 45 3.6 Low-Rank Matrix Recovery ...... 48 3.7 Complexity Issues of the NSP ...... 53
4 The Coherence Property 55 4.1 Introduction ...... 55 4.2 Uncertainty Principle ...... 55 4.3 A Case Study: Two Orthogonal Basis ...... 59 4.4 Uniqueness Analysis for the General Case ...... 60 4.5 Coherence for General Matrices ...... 61 4.6 Properties and Generalizations of Coherence ...... 63 4.7 Constructing Matrices with Small Coherence ...... 65 4.8 A Glimpse of Frame Theory ...... 67 4.8.1 History and Motivation ...... 67 4.8.2 An Example ...... 68 4.8.3 Definition and Basic Facts ...... 69 4.8.4 Constraints for Equiangular Tight Frames ...... 71 4.9 Analysis of Algorithms ...... 77 4.10 The Quadratic Bottleneck ...... 78
1 2 CONTENTS
5 Restricted Isometry Property 81 5.1 Introduction ...... 81 5.2 The RIP Constant and its Properties ...... 82 5.3 The Quadratic Bottleneck Reloaded ...... 88 5.4 Breaking the Bottleneck ...... 90 5.5 Analysis of Algorithms ...... 92 5.6 RIP Limitations ...... 99
6 Interlude: Non-asymptotic Probability 103 6.1 Introduction ...... 103 6.2 Subgaussian and Subexponential Random Variables ...... 104 6.3 Nonasymptotic Inequalities ...... 111 6.4 Comparison of Gaussian Processes ...... 115 6.5 Concentration of Measure ...... 118 6.6 Covering and Packing Numbers ...... 126
7 Matrices Which Satisfy The RIP 129 7.1 Introduction ...... 129 7.2 RIP for Subgaussian Ensembles ...... 130 7.3 Universal Recovery ...... 134 7.4 The Curious Case of Gaussian Matrices ...... 135 7.5 Johnson-Lindenstrauss Embeddings and the RIP ...... 140
8 Optimality in the Number of Measurements 147 8.1 Introduction ...... 147 8.2 n-Widths in Approximation Theory ...... 148 8.3 Compressive Widths ...... 151 8.4 Optimal Number of Measurements ...... 153 8.5 The Theorem of Kashin, Garnaev and Gluskin ...... 155 8.6 Connections Between Widths ...... 159 8.7 Instance Optimality ...... 161
Bibliography 165 Preface
Measure what can be measured, and make measurable what cannot be measured. Attributed to Galileo Galilei
Measure what should be measured. Compressive Sensing version by Thomas Strohmer
In this dissertation we will explore one of the fascinating connections that emerged over the last 20 years among the communities of Applied Mathematics, Electrical Engineering and Statistics. It deals with several relevant ideas that have arisen about data acquisition and how we perform measurements. We live in an era where data and information are spread everywhere and it is a major challenge getting them in the most economical way. Two studies sponsored by EMC corporation, [Gantz & Reinsel ’10] and [Gantz & Reinsel ’12], measured all digital data created, replicated, and consumed in a single year, and also did a projection of the digital universe size for the end of the decade. They argued that at the end of 2020, there will be 40000 exabytes or 40 trillion gigabytes of data in the world. They say “From now (2012) until 2020, the digital universe will about double every two years”. Also, it can be expensive in terms of time, money or (in case of, say, Computed Tomography) damage done to the object from which information is being acquired. Based on these facts, more than ever, we need efficient ways to represent, store and manipulate data. The question of how to sample and compress datasets is of major importance. In the last twenty years, many researchers asked if there are classes of signals that can be encoded using just few informations about them without much perceptual loss. Simultaneously, they realized that many signals seems to have few degrees of freedom, despite the fact that they “live”, in principle, in high-dimensional spaces. Knowing this, one can potentially design efficient sampling and storage protocols that take the useful information content embedded in a signal and condense it into a small amount of data. For example, [Candès, Romberg & Tao I ’06] performed an impressive numerical experiment and showed that a 512 512 pixel test image, known in the Image Processing community as the Logan-Shepp phantom, can be× reconstructed from 512 Fourier coefficients sampled along 22 radial lines. In other words, more than 95% of the relevant data is missing yet a very efficient compression and reconstruction scheme was found. At the same year, [Donoho ’06] realized that these compressed data acquisition protocols are, in his own words “nonadaptive and parallelizable”. They do not require any prior knowledge of the data other than the fact that data has a parsimonious representation with few coefficients. Furthermore, the measurements can be performed all at the same time. He called this new simultaneous acquisition and compression process Compressed Sensing. These two fundamental papers were followed by the works [Candès & Tao I ’06], [Candès & Tao II ’06] and [Candès, Romberg & Tao II ’06]. These three works, in turn,have further developed the results and disseminate them in the communities of Statistics, Signal Processing and Mathematics. Since these five founding papers were published, ten years ago, there was an explosion in the theory and applications of Compressive Sensing. Therefore it is important to highlight that ideas which emerged from the notions of sparsity, incoherence and randomness and that now permeate the world of data science will not be transient. In Brazil, the emergence of Compressive Sensing was in 2009, in a course at the 27◦ Brazilian Mathe- matical Colloquium through the book [Schulz, da Silva & Velho ’09]. I attended the classes and despite
3 4 CONTENTS the fact that I was in the first year of college, the impressions recorded in my mind were that it would be interesting to put together Mathematics, Signal Processing, computation and statistical skills and try to solve a real-life problem. After I spent two years studying Fourier Analysis and Inverse Problems and also working on math- ematical problems from medical images, I decided that I would dedicate myself to this subject and its new developments. Now, a little over ten years after the articles that founded the theory have been published, there are some excellent books that are partially or entirely dedicated to it. See [Elad ’10], [Damelin & Miller ’12] [Eldar & Kutyniok ’12], [Rish & Grabarnik ’14] and [Eldar ’15]. However, up to now, the comprehensive book about the subject is [Rauhut & Foucart ’13]. It is the main reference of this dissertation, my companion in the last three years and most of the main ideas and proofs were extracted from it. Despite this, the influence of the other five books is also present in several passages of this text. There are some excellent reviews and surveys about Compressive Sensing. See [Candès ’06], [Holtz ’08], [Baraniuk ’07], [Romberg ’08], [Strohmer ’12], [Candès & Wakin ’08], [Jacques & Vandergheynst ’11]. In addition, the success of this theory has been described in four issues of important journals entirely de- voted to the subject. The articles in these four issues served as a source of inspiration and clarification on some obscure and hard points of the theory.
IEEE Signal Processing Magazine, Vol. 25, No. 2, Mar 2008. • IEEE Journal on Selected Topics in Signal Processing, Vol. 4, No. 2, Apr 2010. • Proceedings of the IEEE, Vol. 98, No. 6, Jun 2010. • IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 2, No. 3, Sep 2010. • Unfortunately, I did not solve a real problem in the last three years, as was my initial desire. Instead, I studied Compressive Sensing theory carefully. It is essential to say that in each chapter I have outlined important open problems connected to Compressive Sensing. The purpose of this dissertation is to state Compressive Sensing as a new paradigm in Signal Processing and to develop the main techniques necessary to completely understand it from the mathematical viewpoint. I hope the next three years will be enough to solve or, at least, to understand a real problem using the knowledge acquired from the study of these techniques. Chapter 1
Sparse Solutions of Linear Systems
Is there anything of which one might say, “See this, it is new”? Already it has existed for ages Which were before us. Ecclesiastes 1:10
1.1 The What, Why and How of Compressive Sensing
In the last century, we have witnessed the deployment of a wide variety of sensors that acquire measure- ments to represent the physical world. Microphones, telescopes, anemometers, gyroscopes, transducers, radars, thermometers, GPS, seismometers, microscopes, hydrometers, etc are everywhere. The purpose of all these instruments is to directly acquire measures in order to capture the meaning of the world around us. Besides, a world without digital photos, videos and sounds being stored or shared is unimaginable nowadays. The increase in technology has allowed us to obtain more and more data in a short time. To deduce the state or structure of a system from this data is a fundamental task throughout the sciences and engineering. The first step to acquire the data is sampling. We need to sample the signals obtained from the real world in order to convert them from the analogical domain to a sequence of bits in the digital domain. The second step is compression, which involves encoding information using fewer bits than the original representation. Both steps are necessary to store, manipulate, process, transmit and interpret data. However, in many situations it is difficult and costly to obtain a massive amount of data or even slow, such as medical image acquisition. Moreover, nowadays, in many analog-digital (A/D) converters, after the signal is sampled, significant computations are expended on lossy compression algorithms. A representative example of this paradigm is the digital camera of any smartphone. It acquires millions of measurements each time a picture is taken and most of the data is discarded after the acquistion through the application of an image compression algorithm. Therefore, when the signal is decoded, typically only an approximation of the original one is produced. The main question that motivates this text is
Can both operations, sampling and compression, be combined?
That is, instead of high-rate sampling followed by computationally expensive compression, is it possible to reduce to number of samples and produce a compressed signal directly? The classical sampling theory does not exploit the structure or any prior knowledge about the signal being sampled. All we need to know is its frequency information and use this to determine the sample rate. The objective of this dissertation is the study of Compressive Sensing, also known as Compressed Sensing or Compressive Sampling. It takes its name from the fact that data acquisition and compres- sion can be performed simultaneously in such a way that the reconstruction algorithms will exploit the structure of the data. This technique is, nowadays, an essential underpinning of data analysis. We have two basic hypotheses behind the theory: sparsity and incoherence. The first one is related to the objects we are acquiring, i.e, we will use few measurements in order to capture signals by exploiting
5 6 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS their parsimonious representation in some basis. The second one is related to the measurement system. It extends the duality between time and frequency and the idea behind it is that an object that has sparse representation in a given basis must be spread out in any other basis, such as the one which represents the acquisition domain. We wil also see that randomness plays a key role in the design of optimal incoherent system. Thus, it can be considered as a third ingredient of the theory. From a mathematical viewpoint, Compressive Sensing can be seen as a chapter of contemporary Linear Algebra since it deals with the pursuit of sparse solution of underdetermined linear systems. Nevertheless, mathematical techniques behind it are very sophisticated, going from Probability Theory, passing through Harmonic Analysis and Optimization and arriving at the Geometry of Banach Spaces. From the applied viewpoint, it is a fecund intersection of many areas from Electrical Engineering, Computer Science, Statistics and Physics. Through the use of Compressive Sensing, we can not only provide theoretical guarantees for the minimal number of measurements but also efficient algorithms for practical reconstruction. And all this is based on a simple principle of parsimony, which was enunciated many times, but divulged by the medieval philosopher William of Ockham: “Pluralitas non est ponenda sine neccesitate”, i.e., entities should not be multiplied unnecessarily. It is interesting to note that ideas of Compressive Sensing ideas spontaneously occur in nature due to natural selection: the mammalian visual system has been shown to behave using sparse and redundant representation of sensory input. Some researches, starting from the works of Barlow (great-grandson of Charles Darwin), Hubel and Wiesel argued that a dramatic reduction occurs from the information that hits the retina to the information present in the visual cortex. Typically, a neuron in the retina responds simply to whatever contrast is present at that point in space, whereas a neuron in the cortex would respond only to a specific spatial configuration of light intensities. This leads to an optimal representation of the visual information in the brain and is known nowadays in the field of Neuroscience as Sparse Coding. See [Olshausen & Field ’04] and [Huang & Rao ’11]. The first application of Compressive Sensing was in pediatric magnetic resonance imaging (MRI). See [Vasanawala, Alley, Hargreaves, Barth, Pauly & Lustig ’10] and the blog post [Ellenberg - 02/22/10]. It enables a sevenfold speedup while preserving diagnostic quality and resolution. In this imaging modality, the traditional approaches to produce high-resolution imagines relies on Shannon Sampling Theorem (Theorem 1.2 demonstrated below), and may take several minutes. For a heart patient imaging, one cannot hold his breath for too long and the artificial induction of a cardiac arrest can be dangerous and irreversible, depending on the health of the patient. Thus, to get images without blurring is a difficult task. See [Lustig, Donoho & Pauly ’07] for the problem statement and the use of Compressive Sensing. After the launch of these preprints, Compressive Sensing appeared as a noticeable improvement, gaining the attention of the media and of several research groups interested in improving the performance of signal reconstruction. Nowadays, besides medical imaging, it leads to important contributions in Radar Technology, Error Correction Codes, Systems Recommendation, Wireless Communications, Learning Algorithms, DNA Microarrays, etc. See Section 1.2 of [Rauhut & Foucart ’13] and references therein for these and more examples. In mathematical terms, the problem of acquiring and compressing simultaneously the signal x CN of interest is modeled by a (fat and short) matrix measurement, converting x to y = Ax. After, we∈ need to reconstruct it using the underdetermined linear system
Ax = y, where the measurement fat and short 1 matrix A Cm×N models the information acquisition process ∈ and y Cm is the observed data. Since we are in a compression regime, we expect (or impose) that m N∈. At a first moment, this system has an infinite number solutions. But the role played by sparsity will allow us to reduce it to a unique solution and despite a seeming lack of sufficient information in the acquisition, we will see that recovery algorithms have a wondrous performance. Therefore we must start by understanding what sparsity is and how and when it leads to unique solutions of underdetermined linear systems. It is important to note that linear measurements are, in principle, not necessary and
1For this name and other ideas, one should consult the blog by Dustin Mixon: https://dustingmixon.wordpress.com/. 1.1. THE WHAT, WHY AND HOW OF COMPRESSIVE SENSING 7 we could develop a theory for non-linear measurements. Nevertheless, linear sensing is widely used and simpler than non-linear one and we study this framework throughout the text. In this dissertation we begin with basic definitions of sparse and compressible vectors. Then, we will explore the algorithms that solve efficiently the problem of the recovery of sparse vectors. They are divided in three main families: optimization algorithms, greedy algorithms and thresholding algorithms. After this, we will discuss the three most important properties in the study of indeterminate linear systems: the nullspace property, the coherence property and the restricted isometry property. Later, we will proceed to the study of non-asymptotic probability techniques, specially the concentration of measure phenomena, and how to use them to design efficient measurement matrices with an optimal number of measurements. Finally, the Geometry of Banach Spaces will play a special role to understand the minimal amount of information to recover sparse signals and it will appear as the main tool to provide sharper general results for even more general reconstruction methods than those presented in the previous chapters. The flowchart of ideas developed through this dissertation can be seen in Figure 1.1 below.
Figure 1.1: Flowchart of the dissertation
The core of this diagram is the “equivalence” between Basis Pursuit, an algorithm from convex opti- mization, and sparse recovery, a combinatorial problem. We will see that the Nullspace Property (NSP) plays a fundamental role in this equivalence. More specifically, Basis Pursuit allows sparse recovery if and only if the measurement matrix satisfies this property, as will be seen in Chapter 3. Since it is very hard to establish this property, we will replace it by two different sufficient conditions: the Coherence Property and the Restricted Isometry Property (RIP). This will be done in Chapters 4 and 5, respectively. The second will lead to better results in the number of measurements through the use of probability tech- niques, such as the concentration of measure. Therefore, in Chapters 6 and 7 we will explore probabilistic tools and how to use then in the Compressive Sensing framework. Finally, in Chapter 8 we introduce some concepts of the Geometry of Banach Spaces in order to establish the optimality of Basis Pursuit and other methods of sparse recovery discussed through this dissertation. 8 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
1.2 A Little (Pre)History of Compressive Sensing
Compressive Sensing has a long history. The first ideas of sparse estimation can be traced back to Gaspard Riche de Prony, a French mathematician and engineer. In 1795, he proposed an algorithm to estimate the parameters associated with a small number of complex exponentials sampled in the presence of noise. See Theorem 2.15 of [Rauhut & Foucart ’13] and Section 15.2 of [Eldar ’15]. After, in the first half of the 20th century, the works of Constantin Caratheodory, Arne Beurling and Benjamin Logan in Harmonic Analysis showed that it was possible to recover certain sinusoids and pieces of Fourier Transform by exploiting the notion of minimal extrapolation, which nowadays we recognize as `1-minimization. See [Donoho ’10]. This will be further discussed in Chapter 2. During World War II, the technique of combinatorial group testing was introduced. It was a time when there were many people infected with syphilis and the problem was to discover who was infected and who was not by using efficient testing schemes, considering that resources were scarce. The US Public Health Service did not want ill men serving in the military but it was not possible to test all individuals, hundreds of thousands of people, in order to identify the small proportion of infected people. Then, instead of test single individuals, the blood was combined in an efficient way to reveal that one person in this combination had the disease. Therefore the task was to identify all individuals with syphilis while minimizing the number of tests and again the parsimony principle applies. See [Dorfman ’43] and [Gilbert, Iwen & Strauss ’08]. In the late seventies and early eighties, researchers from Geology and Geophysics communities showed that the structure of the layered format in the earth’s interior could be explored to increase the accuracy of seismic signal recovery. [Taylor, Banks & McCoy ’79], [Levy & Fullagar ’81] and [Walker & Ulrych ’83], showed that very incomplete measurements can be used to recover a full wideband seismic signal, despite that no low-frequency can be acquired due to the nature of the seismic measurements. Also, the first paper to explicit state the use of `1-norm for signal reconstruction was written by people working on Geophysical Inverse Problems: [Santosa & Symes ’86]. After these contributions, the Magnetic Resonance Spectroscopy and Radio Astronomy communities started to use sparsity concepts in the signal recovery. In particular, methods exploiting this parsimonious representation of data were faster and more efficient than the classical method of Maximum Entropy Inversion [Donoho, Johnstone, Stern & Hoch ’92]. Essentially at the same time, the “Wavelets explosion” appeared through the works of Daubechies, Meyer, Mallat, Coifman, Wickerhauser and others. From their ideas, it became typical to describe effec- tive media representation, such as image and video, by using Wavelets. Afterwards, Wavelets became part of the standard techniques in Signal Processing and standard technologies such as fingerprint databases. See [Daubechies ’92] and [Burrus, Gopinath & Guo ’97]. The first works that showed the general algorithmic principles which are used in compressive sensing were [Mallat & Zhang ’93] and [Chen & Donoho ’94]. In these papers, a systematic study was performed in order to develop a general theory of sparse representations of signals and to understand how this can be done in a efficient way. Hitching on the paper [Chen & Donoho ’94], it is important to point out that a large part of the history and evolution of ideas before Compressive Sensing, in the nineties, can be understood from the point of view of the works of David Donoho and his collaborators. Donoho, a very prolific mathemati- cian, paved the way until to founding papers of Compressive Sensing. It starts with seminal paper [Donoho & Stark ’89], where important ideas about the relation between the uncertainty principle and the acquisition of information from a signal where showed. After this, many of his works contained funda- mental ideas in Statistics and Signal Processing that culminated in Compressive Sensing. One important example is [Donoho & Huo ’01]. The goal of this paper was to establish the connections between convex optimization problems and optimal parsimonious representations of signals by using very few coefficients, what they called an Ideal Atomic Decomposition. Parallel to advances in signal processing, similar ideas have emerged in the Statistics community. The development of LASSO by [Tibshirani ’96], a linear regression that uses `1-norm as a regularizer, started a new era in Statistics and Machine Learning where many problems, previously intractable, could be solved. 1.3. SAMPLING THEORY 9
Finally, the foundation of Compressive Sensing can be credited to [Donoho ’06], [Candès & Tao I ’06], [Candès & Tao II ’06], [Candès, Romberg & Tao I ’06] and [Candès, Romberg & Tao II ’06]. They com- bined deep techniques from Probability Theory, Approximation Theory and Banach Spaces Geometry in order to achieve a theoretical breakthrough: they showed how to exponentially decrease the number of measurements for an accurate and computationally efficient recovery of signals, provided that they are sparse. These five works revolutionized the way we think about Signal Processing. Now, according to Google Scholar, they have now more than 52000 citations together. Subsequently, Compressive Sensing turned into a rapidly growing field. Since 2006, many sharper theorems, more efficient algorithms, dedicated hardware and generalizations of the theory emerged. It will probably take a few more years for all contributions to be properly organized so we can tell the story of new ideas, developments and breakthroughs of the field in the last twenty years. Before we start with it, we need to come back to the fundamentals of sampling, specially to the classical Sampling Theorem.
1.3 Sampling Theory
Sampling lies at the core of Signal Processing and is the first step of analog to digital conversion. It is the gate between the discrete and the continuous world. Other two important steps are quantization and coding. The former is the reduction of the sample values from their continuous range to a discrete set. The latter generates a digital bitstream as the final representation of the continuous time signal. This dissertation is about sampling. We start by looking at the celebrated Shannon-Nyquist-Kotelnikov- Whittaker Theorem2. First, we need a definition.
2 Definition 1.1. The Paley-Wiener Space PWΩ(R) is the subspace of L (R) consisting of all functions with Fourier transforms supported on finite intervals symmetric around the origin. More precisely,
2 ˆ PWΩ(R) = f L (R): f(ξ) = 0 for ξ Ω . ∈ | | ≥ As [Mishali & Eldar ’11] points out, “The band-limited signal model is a natural choice to describe physical properties that are encountered in many applications. For example, a physical communication medium often dictates the maximal frequency that can be reliably transferred. Thus, material, length, dielectric properties, shielding and other electrical parameters define the maximal frequency Ω. Often, bandlimitedness is enforced by a lowpass filter with cutoff Ω, whose purpose is to reject thermal noise beyond frequencies of interest”.
Theorem 1.2. (Theorem 13 of [Shannon ’48]): Let f(t) PWΩ(R). Then f(t) is completely determined ∈ by its samples at the points tn = nπ/Ω n∈ . Also, the following reconstruction formula holds { } Z X nπ sin(Ωt nπ) f(t) = f − , Ω Ωt nπ n∈Z − where the series above converge in the L2 sense and uniformly.
Remark 1. Here we provide a proof of Sampling Theorem because it is difficult to find a rigorous one in sin t the literature. For a proof in a abstract framework, with more general interpolation functions than t , see Theorem 2.7 of [Güntürk ’00]. Also, for an engineering interpretation of Theorem 1.2 as “sampling by modulation”, see Section 4.2.2 of [Eldar ’15].
Proof. Since f(t) L2(R), then fˆ(ξ) L2(R). Also, as a consequence of the Paley-Wiener Theorem (see Section 4.3 from [Stein∈ & Shakarchi ’05]),∈ f(t) is the restriction to the real line of an analytic function. We ˆ inπξ/Ω can write the Fourier Series of f(ξ) in the interval [ Ω, Ω] since the complex exponentials e n∈Z form an orthogonal basis for L2( Ω, Ω). In particular,− the coefficients will be given by { } − 2In the literature of Approximation Theory, it is known as the cardinal theorem of interpolation. See [Higgins ’85]. 10 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
Z Ω 1 −inπξ/Ω cn = fˆ(ξ)e dξ. 2Ω −Ω
By hypothesis, fˆ(ξ) = 0 for all ξ Ω. This yields, for all x Z, | | ≥ ∈
Z Ω Z ∞ Z ∞ 1 inπξ/Ω 1 inπξ/Ω π 1 iξ nπ π nπ c−n = fˆ(ξ)e dξ = fˆ(ξ)e dξ = fˆ(ξ)e Ω dξ = f , 2Ω −Ω 2Ω −∞ Ω 2π −∞ Ω Ω where we used the Inversion Formula for the Fourier Transform in the last equality. Thus, we obtain
∞ N π X nπ −iξ nπ X π nπ −iξ nπ fˆ(ξ) = f e Ω = lim f e Ω . (1.1) Ω Ω N→∞ Ω Ω n=−∞ n=−N The convergence above is in the L2( Ω, Ω) sense. We have also that − Z Ω Z Ω fˆ 1 = fˆ(ξ) dx = 1 fˆ(ξ) dx 1 2 fˆ 2 = √2Ω fˆ 2. || || −Ω | | −Ω · | | ≤ || || || || || ||
This leads to the fact that fˆ(ξ) L1(R) and that the convergence in L2(R) implies convergence in 1 ˆ ∈ L (R). Using that f(ξ) = 0 for ξ Ω, we can multiply Equation (1.1) on both sides by χ[−Ω,Ω](ξ), the characteristic function of the interval| | ≥ [ Ω, Ω]. This yields − N π X nπ −iξ nπ fˆ(ξ) f e Ω χ[−Ω,Ω](ξ) 0. − Ω Ω −−−−→N→∞ n=−N p for p = 1 and p = 2. We known, by Parseval’s Theorem, that the Inverse Fourier Transform is a bounded continuous operator −1 : L2(R) L2(R). Also, From Riemann-Lebesgue Theorem, it is a bounded F 1 → continuous operator from L (R) to C0(R), the space of continuous functions vanishing at infinity equipped with the supremum norm. Therefore we conclude
N −1 −1 π X nπ −iξ nπ f = (fˆ) = lim f e Ω χ[−Ω,Ω](ξ) F F N→∞ Ω Ω n=−N N π X nπ −1 −iξ nπ = lim f e Ω χ[−Ω,Ω](ξ) , N→∞ Ω Ω F n=−N 2 where the convergence occurs both in the L (R) norm and in C0(R). To finish, note that Z ∞ Z Ω −1 −iξ nπ 1 −iξ nπ iξt 1 −iξ nπ iξt e Ω χ[−Ω,Ω](ξ) = e Ω χ[−Ω,Ω](ξ)e dξ = e Ω e dξ F 2π −∞ 2π −Ω Z Ω 1 −ξt−iξ nπ Ω sin(Ωt nπ) = e Ω dξ = − . 2π π Ωt nπ −Ω −
This is a modern translation of the theorem. We can compare it with the original words of Shannon: “If a function f(t) contains no frequencies higher than Ω cycles-per-second, it is completely determined by giving its ordinates at a series of points spaced 1/2Ω seconds apart”. He also wrote “it is a fact which is common knowledge in the communication art..but in spite of its evident importance it seems not to have appeared explicitly in the literature of communication theory“. Many signals of interest are modeled as bandlimited functions, i.e., real valued functions on the real line with compactly supported Fourier transforms. Then Theorem 1.2 tells us that it is possible to reconstruct 1.4. SPARSE AND COMPRESSIBLE VECTORS 11 a band-limited function using uniform samples. It says that an enumerable amount of information is sufficient to reconstruct any (band-limited) function on the continuum. The heuristic behind it is that band-limited functions have limited time variation, and can therefore be perfectly reconstructed from equalispaced samples with a rate at least Ω/π, its maximum frequency, called the Nyquist rate.
Remark 2. Theorem 1.2 has a long and prolific history. It is a striking case of multiple discoveries and many names beyond Shannon, Nyquist, Kotelnikov and Whittaker are associated with it. Also, it forms the basis of Signal Processing. See [Jerri ’77], [Unser ’00], [Meijering ’02], [Ferreira & Higgins ’11] and references therein. Also, many generalizations, including results for not necessarily band-limited functions or nonuniform sampling can be found in the literature. See [Butzer & Stens ’92] and [Marvasti ’01].
Although it forms the basis of modern Signal Processing, Theorem 1.2 has two major drawbacks. The first one is that not all signals of the real world are band-limited. The second is that if the bandwidth is too large, then we must have too many samples in order to perform the acquisition of the signal. Since, in many situations, it is very difficult to sample at high rate, some alternatives began to be considered. These alternatives are known as sub-Nyquist sampling strategies. Incredibly, the first overcoming of the Sampling Theorem occurred before its existence. It was proved in [Carathéodory 1911] that if a signal is a positive linear combination of any N sinusoids, it is uniquely determined by its value at t = 0 and any other 2N points. Depending on the highest frequency sinusoid in the signal, this can be much better than Nyquist rate. While the number of samples in the Nyquist rate increases with Ω, in Caratheodory’s argument it increases with N. Among all sub-Nyquist sampling alternatives, Compressive Sensing is one of the most important [Mishali & Eldar ’11]. It focuses on efficiently measuring a discrete signal through a measurement matrix A Cm×N with fewer measurements than the ambient dimension. Although it is a theory for discrete signals,∈ recently some researches generalized ideas from Compressive Sensing to analog signals. See [Tropp, Laska, Duarte, Romberg & Baraniuk ’10] and [Adcock, Hansen, Roman & Teschke ’13]. This sensing with fewer measurements is only possible (and useful) because many natural signals are sparse or compressible. Using this priori information will allow us to design efficient reconstruction schemes. These, in turn, will be able to recover the vectors as the unique solution of some optimization problems.
1.4 Sparse and Compressible Vectors
In this section we define the main objects of our study: sparse and compressible vectors. The fundamental hypothesis we will explore is that many real-world signals have most of their components being zero (or approximately zero).
Definition 1.3. The support of a vector x CN is the index set of its nonzero entries, i.e., ∈
supp(x) = j [N]: xj = 0 . { ∈ 6 } N A vector x C is called s-sparse if at most s of its entries are nonzero, i. e., if x 0 = #(supp(x)) s. ∈ || || ≤ The set of all s-sparse signals is denoted by Σs = x : x 0 s . { || || ≤ } Despite x 0 not being a norm, it is abusively called this way throughout the literature on Sparse Recovery, Signal|| || Processing or Computational and Applied Harmonic Analysis. It can be seen as the limit of `p-quasinorms as p decreases to zero through the following observation.
N N p X p p→0 X x = xj 1 = # j [N]: xj = 0 . || ||p | | −−−→ {xj 6=0} { ∈ 6 } j=1 j=1
Remark 3. Sparsity has no linear structure. Given any x, y Σs, we do not necessarily have x+y Σs, ∈ N ∈ although we do have x + y Σ2s. The set Σs consists in the union of all possible canonical subspaces ∈ s of CN . 12 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
As discussed in Section 1.3.2 of [Eldar & Kutyniok ’12], in certain applications it is possible to have more concise models by restricting the feasible signal supports to a small subset of the possi- N ble s selection of nonzero coefficient. This leads to the concept of sparse union of subspaces. See [Duarte & Eldar ’11]. In the Signal Processing community, it is said that we can model the sparse signal x Cm as the linear combination of N elements, called atoms, that is ∈
N X x = Φα = αiϕi, i=1 where αi are the representation coefficients of x in the dictionary Φ = [ϕ1, . . . , ϕN ]. We will require N m that the elements of the dictionary ϕ i=1 span the entire signal space. Here it will be C but in other { } situations it can be some specific subset or proper subspace of Cm. This representation is often redundant since we allow N > m. Due to redundancy, this representation is not unique, but we typically consider (or look for) the one with the smallest number of terms. A signal may be not sparse in a first moment but it may be “sparsified” just by using a specific dictionary. What is behind the concept of sparsity is the philosophy of Occam’s razor: when we are faced with many possible ways to represent a signal, the simplest choice is best one. This is so because there is a cost to describe the basis we are using for this representation. However, sparsity is a theoretical abstraction. In the real world, one does not seek for sparse vectors but, instead, for approximately sparse vectors. Even more, we want to measure how close our signals of interest are to a true sparse one. This measure is given by the next concept.
N Definition 1.4. For p > 0, the `p-error of best s-term approximation to a vector x C is defined by ∈ N σs(x)p = inf x z p, z C is s-sparse . {|| − || ∈ } Remark 4. The infimum in Definition 1.4 will always be achieved by an s-sparse vector z CN whose nonzero entries are equal to the s largest absolute entries of x. There is no reason for this∈ vector to be unique. Nevertheless, when it attains the infimum, it occurs independently of p > 0. The concept of approximately sparse or compressible vector is a little more ingenious and realistic. Instead of asking for the number of nonzero component to be small, we ask for the error of its best s-term approximation to decay fast in s. This can be modeled in two different ways. The first one is by using `p balls for some small p > 0. The second one is by exploring the concept of significant components of a vector itself. This will lead to the concept of weak `p-spaces. Proposition 1.5 tells us that discrete signals which belong to the unit ball in some `p-norm, for some small p > 0 are good models for compressible vectors. Here we will denote the nonconvex ball in RN N N equipped with `p-norm by Bp = z C : z p 1 . { ∈ || || ≤ } Proposition 1.5. For any q > p > 0 and any x CN , ∈ 1 σs(x)q x p. ≤ s1/p−1/q || || Definition 1.6. The nonincreasing rearrangement of the vector x CN is the vector x∗ RN for which ∈ ∈ x∗ x∗ x∗ 0. 1 ≥ 2 ≥ · · · ≥ N ≥ and there is a permutation π :[N] [N] with x∗ = x for all j [N]. → j | π(j)| ∈ ∗ N N Proof. (of Proposition 1.5): x R+ , the nonincreasing rearrangement of x C , yields ∈ ∈ N N s q−p N p q X ∗ q ∗ q−p X ∗ p 1 X ∗ p X ∗ p σs(x) = (x ) (x ) (x ) (x ) (x ) q j ≤ s j ≤ s j j j=s+1 j=s+1 j=1 j=s+1 1.4. SPARSE AND COMPRESSIBLE VECTORS 13
q−p 1 p 1 x p x p = x q . ≤ s || ||p || ||p sq/p−1 || ||p
The next result is a stronger version, with optimal constant, of Proposition 1.5. Its importance relies on the fact that sharp results on error reconstruction analysis depends on it. Also, the method for proving it appears many times in the Compressive Sensing literature. We use optimization ideas to prove some inequalities with sharp constant. This technique will be invoked again in Lemma 4.25 and Lemma 5.21.
Theorem 1.7. For any q > p > 0 and any x CN , the inequality ∈ cp,q σs(x)q x p ≤ s1/p−1/q || || holds with
pp/q p1−p/q1/p cp,q = 1 1. q − q ≤
∗ N N ∗ p Proof. Let x R+ be the nonincreasing rearrangement of x C . If we set αj = (xj ) , we will transform the problem∈ into a equivalent one given by ∈ q α1 α2 αN 0 q/p q/p q/p cp,q ≥ ≥ · · · ≥ ≥ = as+1 + as+2 + + as+N . α1 + α2 + + αN 1 ⇒ ··· ≤ sq/p−1 ··· ≤ Therefore, using r = q/p > 1, we need to maximize the convex function
r r r f(α1, α2, . . . , αN ) = α + α + + α , s+1 s+s ··· N N over the convex polytope C = (α1, . . . , αN ) R : α1 αN 0 and α1 + + αN 1 . As any point of C is a convex{ combination of∈ its vertices≥ and · · · because ≥ ≥ the function f···is convex,≤ the} maximum is attained at a vertex of C (see Theorem B.16 from [Rauhut & Foucart ’13] or Theorem 2.65 from [Ruszczynski]). Besides, the vertices are intersections of N hyperplanes when we force N of the N + 1 inequalities to become equalities. From this we obtain the following possibilities:
1. If α1 = = αN = 0, then f(α1, α2, . . . , αN ) = 0. ··· 2. If α1 + + αN = 1 and α1 = = αk > αk+1 = = αN = 0 for some 1 k s, then one has ··· ··· ··· ≤ ≤ f(α1, α2, . . . , αN ) = 0.
3. If α1 + + αN = 1 and α1 = = αk > αk+1 = = αN = 0 for some s + 1 k N then one ··· ··· ··· r ≤ ≤ has α1 = = αk = 1/k, and consequently f(α1, α2, . . . , αN ) = (k s)/k . ··· − Thus we have obtained
k s max f(α1, α2, . . . , αs) = max −r . (α1,...,αs)∈C s+1≤k≤N k Now, consider f(k) as a function of a continuous variable k, we observe that the function g(k) = (k s)/kr is increasing until the critical point k∗ = (r/(r 1))s and decreasing thereafter. Using this information− yields −
r−1 q ∗ 1 1 1 cp,q max f(α1, α2, . . . , αs) g(k ) = 1 r−1 = q/p−1 . (α1,...,αs)∈C ≤ r − r s s
The other way of reasoning about compressible vectors is to require that the number of its significant components be small. This can be done with the introduction of the weak `p-spaces. 14 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
N N Definition 1.8. For p > 0, the weak `p space w`p denotes the space C equipped with the quasinorm n M p o x p,∞ = inf M 0 : # j [N]: xj t for all t > 0 . || || ≥ { ∈ | | ≥ } ≤ tp Remark 5. In the definition above, the quaisnorm comes from the fact that a constant appears in the max{1,1/p} triangular inequality: x + y p,∞ 2 ( x p,∞ + y p,∞). For definitions, see Section 1.7. || || ≤ || || || || N There is an alternative expression for the weak `p-quasinorm of a vector x C . ∈ N Proposition 1.9. For p > 0, the weak `p-quasinorm of a vector x C can be expressed as ∈ 1/p ∗ x p,∞ = max k xk, || || k∈[N]
∗ N N where x R+ denotes the noincreasing rearrangement of x C . ∈ ∈ N ∗ Proof. Given x C , we clearly have x p,∞ = x p,∞. Then we need to prove that x := 1/p ∗ ∈ ∗ || || || || || || max k x equals x p,∞. For t > 0 we have two possibilities: k∈[N] k || || j [N]: x∗ t = [k] for some k [N] or j [N]: x∗ t = . { ∈ j ≥ } ∈ { ∈ j ≥ } ∅ ∗ 1/p ∗ p p In the first case, t xj x /k . This implies # j [N]: xj t = k x /t . Note that ≤ ≤ || || { ∗∈ ≥ } ≤ || || this inequality holds trivially in the case that j [N]: xj t = . By the definition of the weak `p- ∗ { ∈ ≥ } ∅∗ ∗ quasinorm, this leads to x p,∞ x . Now, suppose that x > x p,∞, so that x (1+ε) x p,∞ || || ≤ || || 1/p ∗ || ∗|| || || || || ≥ || || for some ε > 0. This can be translate into k xk (1 + ε) x p,∞ for some k [N]. Therefore we have the inclusion ≥ || || ∈
∗ ∗ 1/p [k] j [N]: x (1 + ε) x p,∞/k . ⊂ { ∈ j ≥ || || } Again, by the definition of the weak `p-quasinorm, we obtain
x∗ p k k || ||p,∞ = , ∗ 1/pp p ≤ (1 + ε) x p,∞/k (1 + ε) || || ∗ which is a contradiction. We conclude that x = x p,∞. || || || ||
With the alternative expression of the weak `p-quasinorm, we can establish a proposition similar to Proposition 1.5.
Proposition 1.10. For any q > p > 0 and x CN , we have the inequality ∈ p 1 σs(x)q x p,∞. ≤ q p s1/p−1/q || || − ∗ 1/p Proof. Without loss of generality, we assume that x p,∞ 1, so that xk 1/k for all k [N]. This yields || || ≤ ≤ ∈
N N t=N X X 1 Z ∞ 1 1 1 p 1 σ (x)p = (x∗)q dt = . s p k q/p q/p q/p−1 q/p−1 ≤ k ≤ s t −q/p 1 t ≤ q p s k=s+1 k=s+! − t=s − Taking the power 1/q leads to the desired inequality.
N Proposition 1.5 and Proposition 1.10 show that if we have vectors x C for which x p 1 or ∈ || || ≤ x p,∞ 1 for small p > 0, than they will be compressible vectors in the sense that their errors of best s||-term|| approximation≤ decay quickly with s. 1.5. HOW MANY MEASUREMENTS ARE NECESSARY? 15
1.5 How Many Measurements Are Necessary?
The fundamental problem in Compressive Sensing is to reconstruct an s-sparse vectors from few measure- ments. In mathematical terms, this is represented by solving the linear system Ax = y, where x CN is ∈ the s-sparse signal being acquired, A Cm×N represents the measurement matrix and y Cm represents the acquired information. ∈ ∈ Since we want to acquire the signal in a compressible fashion, we have fewer measurements than the ambient dimension, i.e., m < N. This results in an underdetermined system, which has, without any other assumption, an infinite number of solutions. As described in Section 1.3, it is natural to assume that the signals are sparse in some basis. With this regularization assumption, we expect to identify the original vector x. Thus, the recovery of sparse signals has two meanings:
1. Uniform recovery: the reconstruction of all s-sparse vectors x CN simultaneously. ∈ 2. Nonuniform recovery: the reconstruction of an specific s-sparse vector x CN . ∈ In case we are dealing with measurements matrices described by stochastic models, we can translate conditions above by finding a lower bound of the form
1. Uniform recovery: P( s-sparse x, recovery of x is successful using A) 1 ε. ∀ ≥ − 2. Nonuniform recovery: s-sparse, P(recovery of x is successful using A) 1 ε. ∀ ≥ − In both cases, the probability is over the random draw of A, described by a certain model. In this dissertation we will deal with the first case. We will answer when it is possible to recover all signals from few measurements, provided that all of them have some sparsity and that we have access to the acquired vector y (this acquired vector will be different for every signal x). For the nonuniform case, see Sections 12.2 and 14.2 of [Rauhut & Foucart ’13] and the discussion in [Candès & Plan ’11].
m×N Remark 6. For a matrix A C and a subset S [N], we denote by AS the column submatrix of A ∈ ⊂ N S consisting of the columns indexed by S. For a vector x C , we denote by xS the vector in C consisting ∈ of the entries indexed by S or the vector in CN which coincides with x on the entries in S and is zero on the entries outside S. It will be clear from the context which one of the two options we are dealing with. The way we recover signals is by supposing that they are as sparse as possible. Therefore, an algorith- mic approach for recovering them is through `0-minimization. In other words, we search for the sparsest vector consistent with the measured data y = Ax.
min z 0 subject to Az = y. (P0) N z∈C || || In case that measurements are corrupted by noise or are slightly inaccurate, the natural generalization is given by
min z 0 subject to Az y 2 η. (P0,η) N z∈C || || || − || ≤ The main point about Compressive Sensing is that we can recover compressible signals corrupted by noise. Then, this problem has stability and robustness. See Sections 3.4 and 3.5. Therefore, in real situations, we will deal with the problem (P0,η). We can understand the recovery of sparse vectors through the properties of the measurement matrix A as shown in the next theorem.
Theorem 1.11. Given A Cm×N , the following properties are equivalent ∈ a) Every s-sparse vector x CN is the unique s-sparse solution of Az = Ax, that is, if Ax = Az and both x and z are s-sparse,∈ then x = z. 16 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
b) The null space ker A does not contain any 2s-sparse vector other than the zero vector, that is, N ker A z C : z 0 2s = 0 . ∩ { ∈ || || ≤ } { } c) Every set of 2s columns of A is linearly independent.
S m d) For every S [N] with #S 2s, the submatrix AS is an injective map from C to C and the ⊂ ∗ ≤ Gram matrix ASAS is invertible.
N N Proof. a) = b): Assume that for every vector x C , we have z C : Az = Ax, z 0 s = x . Let v ker A⇒be 2s-sparse. We write v = x z for ∈s-sparse vectors{x,∈ z with supp x supp|| || z≤=} . Then{ } Ax = ∈Az and by assumption x = z. Since the− supports of x and z are disjoint, it follows∩ that z ∅= x = 0 and v = 0. N b) = c): Let S be a set of indexes 1 i1 < i2 < < i2s N with #S = 2s and let x C such ⇒ P2s ≤ ··· ≤ ∈ that supp(x) S. If Ax = `=1 ai` xi` = 0, then x = 0. Thus the 2s columns vectors indexed by S must be linearly independent.⊂ c) = d): If every set of 2s columns of A is linearly independent then every set of s columns is also ⇒ S m linearly independent. Thus, clearly, AS is injective as a map from C to C . Also, consider S and x as in the previous case. We have
2s 2 ∗ X x, A ASx = ASx, ASx = xis ais > 0, S h i h i s,t=1 2 ∗ since, by c), the 2s column vectors ais are linearly independent. Also, the matrix ASAS is self-adjoint, ∗ therefore all of its eigenvalues are real and they will be positive if and only if the quadratic form x, ASASx is positive. Furthermore, a matrix with all positive eigenvalues is invertible. h i d) = a): Suppose that x1 and x2 are two s-sparse vectors satisfying Ax1 = Ax2 = Az. Setting ⇒ x = x1 x2 we have that it is 2s-sparse and also that Ax = Ax1 Ax2 = Az Az = 0. Let S be the − − − index set is for the support of x. Since x is in the kernel of A, we have { } 2s 2 ∗ X x, A ASx = xis ais . S h i s,t=1 2 ∗ Since ASAS in invertible, the last sum vanishes only if x = 0. This implies x1 = x2. From Theorem 1.11, we conclude that if it is possible to reconstruct every s-sparse, so that the statement c) above holds, then we have rank(A) 2s. On the other hand, since A Cm×N , rank(A) m. We conclude that the number of measurements≥ necessary to reconstruct every s∈-sparse vector satisfies≤ m 2s. The next theorem shows that m = 2s measurements suffice. After, we will discuss why this bound≥ is a theoretical one and typically not stable for practical implementation. In particular, due to a lack of knowledge about the location of the support, in Chapter 8 we will see that, in fact, a few more than 2s measurements will be necessary.
Theorem 1.12. For any integer N 2s, there exists a measurements matrix A Cm×N with m = 2s ≥ ∈ rows such that every s-sparse vector x CN can be recovered from its measurement vector y = Ax Cm ∈ ∈ as a solution of (P0).
m×N Proof. For fixed tn > > t2 > t1 > 0, we can define the matrix A C with m = 2s as ··· ∈ 1 1 ... 1 t1 t2 . . . tN A = . . . . ...... 2s−1 2s−1 2s−1 t1 t2 . . . tN 2s×2s Let S = j1 < < j2s be an index set of cardinality 2s. Define the square matrix AS C . It will { ··· } Q ∈ be the transpose of a Vandermonde matrix. It is well known that det(AS) = (tj tj ) > 0. This k<` ` − k 1.6. COMPUTATIONAL COMPLEXITY OF SPARSE RECOVERY 17
shows that AS is invertible and, in particular, injective. The hypotheses of Theorem 1.11 are fulfilled, thus every s-sparse vector x CN is the unique s-sparse vector satisfying Az = Ax. Therefore it can be ∈ recovered as the solution of (P0).
Remark 7. As discussed in Section 2.2 of [Rauhut & Foucart ’13], it is interesting to note that many matrices fulfill the hypotheses of Theorem 1.11. We can take any matrix M RN×N that is totally ∈ positive, i.e., that satisfies det(MI,J ) > 0 for any sets I, J, [N] with same cardinality, where MI,J represents the submatrix of M with rows indexed by I and columns⊂ indexed by J. After, we select any m = 2s rows of M, indexed by a set I, to form the measurement matrix A. Since the original matrix M is totally positive, for any index S [N] with #S = 2s, the matrix AS is the ⊂ same as MI,S, hence it is invertible and we can apply Theorem 1.11. In particular, the partial Discrete Fourier Transform
1 1 1 1 ... 1 1 e2πi/N e2πi2/N . . . e2πi(N−1)/N A = . . . . . , . . . . . 1 e2πi(2s−1)/N e2πi(2s−1)2/N . . . e2πi(2s−1)(N−1)/N allows for the reconstruction of every s-sparse vector x CN from y = Ax C2s. ∈ ∈ The reasoning above can be generalized and we can prove that, from a measure-theoretical viewpoint, the set of matrices that can not recovery s-sparse vector is small.
Theorem 1.13. The set of 2s N matrices such that det(AS) = 0 for some S [N] with #S 2s has Lebesgue measure zero. Therefore,× most 2s N matrices allow the reconstruction⊂ of every s-sparse≤ × vector x CN from y = Ax C2s. ∈ ∈ Theorem 1.13 is a simples consequence of Sard’s Theorem3 and tells us that we can draw a matrix at random and it will be, in principle, a good measurement matrix. However, we do not address the problem of robustness to error measurements and stability for compressible vectors recovery. These points will be further discussed in Chapter 3 and Chapter 8. In the latter, we will prove that any stable reconstruction for s-sparse vectors requires at least m = Cs ln(eN/s) linear measurements, for a constant C > 0. From this perspective, matrices such as the Vandermonde matrix from Theorem 1.12 are not adequate as measurement matrices.
1.6 Computational Complexity of Sparse Recovery
In this section we focus on a brief digression about computational complexity theory. This is a branch of mathematics/computer science where one tries to quantify the amount of computational resources required to solve a given task. We do not intend to be rigorous here. For a rigorous treatment, see the classical [Garey & Johnson ’79] or the modern [Arora & Barak ’07]. For example, if one tries to solve a linear system, the classic Gaussian elimination algorithm uses essentially n3 basic arithmetic operations to solve n equations over n variables. However, in 1960s, a more efficient algorithm was created, see [Strassen ’69]. The latter is a highly nonintuitive algorithm. Therefore one might ask if, in some cases, there are more efficient (but nonintuitive) algorithms than the ones used for many years. In the field of computational complexity, researchers try to answer such questions and prove that, sometimes, there is no better algorithm than the existing ones. Even more, for some problems they try to prove that there is no efficient algorithm. Computational complexity starts with the work [Hartmanis & Stearns ’65]. For an historical account, see [Fortnow & Homer ’03]. It is philosophically similar to taxonomy, where typically one tries to establish patterns and group problems into classes according to its inherent difficulty, in the same way naturalists
3For a proof of this theorem, see pages 205-207 of [Guillemin & Pollack]. 18 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS try to classify the living beings according to physiological properties. After, they relate those classes to each other. In this classification, a problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. A polynomial/exponential-time algorithm is an algorithm performing its tasks in a number of steps bounded by a polynomial/exponential expression in the size of the input. The first distinction between polynomial-time and exponential-time algorithm was given by [von Neumann ’53]. It is standard in com- plexity theory to identify polynomial-time with feasible. Clearly, this is not always true, as pointed by [Cook]. He says: “for example, a computer program requiring n100 steps could never be executed on an input even as small as n = 10.” Even so, let us assume that this is the case and let us consider the Feasibility Thesis:
“A natural problem has a feasible algorithm if and only if it has a polynomial-time algorithm.”
There is a classification of computational problems: they are divided in decision problems, search problems, counting problems, optimization problems and function problems. See [Goldreich ’08]. Here we are interested in decision and optimization problems. A decision problem is a question in some formal system with a yes-or-no answer, depending on the values of some input parameters. An optimization problem asks to find the “best possible” solution among the set of all feasible solutions. Every optimization problem can be transformed into a decision problem just by asking if the best possible solution exists or not. Now we introduce some terminology following [Rauhut & Foucart ’13]. For rigorous definitions, see Definitions 1.13, 2.1 and 2.7 from [Arora & Barak ’07].
Definition 1.14. We have four basic complexity classes:
The class of P-problems consists of all decision problems for which there exists a polynomial-time • algorithmP finding a solution.
The class of NP-problems consists of all decision problems for which there exists a polynomial- • time algorithmNP certifying a solution.
The class -hard of NP-hard problems consist of all problems (not necessarily decision problems) • for whichNP a solving algorithm could be transformed in polynomial time into a polynomial time solving algorithm for any NP-problem. Roughly speaking, this is the class of problems at least as hard as any NP-problem.
The class -complete of NP-complete problems consist of all problems that are both NP and NP- • hard; in otherNP words, it consists of all the NP-problems at least as hard as any other NP-problem.
Note that the class is contained in the class . One can ask if the converse is also true. This is the important conjectureP = , see [Cook]. IfNP this conjecture is shown to be false, which is widely believed, then there will existP problemsNP for which potential solutions can be certified, but no solution can be found in polynomial time.
Figure 1.2: Relation between the four main complexity classes. 1.6. COMPUTATIONAL COMPLEXITY OF SPARSE RECOVERY 19
As [Arora & Barak ’07] points out, Recognizing the correctness of an answer is often much easier than coming up with the answer. Appreciating a Beethoven sonata is far easier than composing the sonata; verifying the solidity of a design for a suspension bridge is easier (to a civil engineer anyway!) than coming up with a good design; verifying the proof of a theorem is easier than coming up with a proof itself. Since nowadays we know that there are many -complete problems, these problems are probably intrinsically intractable. Some of the first examplesNP were given by a list of 21 problems described in [Karp ’72]. He says “In this paper we give theorems that suggest, but do not imply, that these problems, as well as many others, will remain intractable perpetually”. In particular, we are interested in the fourteenth problem of the list.
Remark 8. (Exact cover by 3-sets problem or X3C problem): Given a collection Ci, i [N] of 3- element subsets of [m], does there exist an exact cover (or a partition) of [m], i.e., a{ set J∈ [N}] such ⊂ that j∈J Cj = [m] and Cj Ck = for all j, k J with j = k? Clearly∪ we see that m ∩must be∅ a multiple∈ of 3. Let6 us use one example in order to understand. Suppose we have [m], i.e., [m] = 1, 2, 3, 4, 5, 6 . If we have the collection of 3-sets elements given by { } 1, 2, 3 , 2, 3, 4 , 1, 2, 5 , 2, 5, 6 , 1, 5, 6 = C1,C2,C3,C4,C5 then we could choose C2,C5 = {{2, 3, 4}, {1, 5, 6} {as an} exact{ cover} { because}} each{ element in [m] appears} exactly once. { } {{ If instead,} { the}} collection of 3-sets was given by 1, 2, 3 , 2, 4, 5 , 2, 5, 6 , then any subcover from this that we choose will not be an exact cover (we need{{ all 3} subsets{ } to{ cover}} all elements in [m] at least once, but then the element 2 appears three times). In Section 1.5 we stated the main problem of Compressive Sensing: the optimization problem of sparse recovery. The straightforward approach for solving it is to solve every square linear system given ∗ ∗ S by ASASu = ASy for u C where S runs though all possible subsets of [N] with size s. In this brute ∈ N force approach, one needs to solve a number of linear systems of order s . Example 1.15. As described in Section 2.3 of [Rauhut & Foucart ’13], suppose that we have a small size 1000 1000 10 20 problem of sparse recovery with N = 1000 and s = 10. We would have to solve 10 ( 10 ) = 10 linear systems of size 10 10. If each linear system require 10−10 seconds to be solved,≥ the time required ×10 for solving (P0) will be 10 seconds, i.e., more than 300 years. The heuristic of Example 1.15 will be confirmed by Theorem 1.16, i.e., this problem is in fact in- tractable for any possible approach. We will prove that (P0) belongs to the NP-hard class, i.e., assuming that the exact cover by 3-sets problem is NP-complete, we can now prove that the main Compressive Sensing problem is as hard as this one.
Theorem 1.16. (Theorem 1 of [Natarajan ’96]): For any η 0, the `0-minimization problem (P0,η) for ≥ general A Cm×N and y Cm is NP-hard. ∈ ∈ Proof. Without loss of generality, we may assume that η < 1, by a rescaling argument. Using that the exact cover by 3-sets problem is NP-complete, we will reduce it, in polynomial time, to our `0-minimization problem. Let us consider Ci : i [N] , the collection of 3-element subsets of [m]. Using this collection, { m∈ } we define vectors a1,..., an C by ∈ ( 1 if j Ci, (ai)j = ∈ 0 if j / Ci. ∈ With these vectors, we can define a matrix A Cm×N and a vector y Cm by ∈ ∈
A = a1 a2 ... aN , y = [1, 1,..., 1].
It is possible to make this construction in polynomial time since, by construction, we have N m. ≤ 3 Using the vector y defined above, for any z CN satisfying Az y η, we have that any of its ∈ || − || ≤ 20 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS
components must be distant to 1 at most by η. Hence, these components are nonzero and Az 0 = m. || P||N On the other hand, by definition, each ai has exactly 3 nonzero components and then Az = j=1 zjaj has at most 3 z 0 nonzero components, i.e., Az 0 3 z 0. || || || || ≤ || || Therefore, we conclude that if a vector satisfies Az y 2 η then it must satisfy z 0 m/3. Suppose || − N|| ≤ || || ≥ that we ran the `0-problem and that it returned x C as the output. We have two possibilities. ∈
1. If x 0 = m/3, then suppose the collection Cj, j supp(x) is not an exact cover of [m]. In this || || PN { ∈ } case, the m components of Ax = j=1 xjaj would not be all be nonzero. Since this is impossible, we conclude that in this case the exact cover exists. Therefore we can answer positively the question about the existence of an exact cover of [m] using the solution of (P0,η).
2. If x 0 > m/3, suppose that an exact cover of [m] exists, i.e., we can cover the set [m] with a || || N partition Cj, j J . In this case, we can define a vector z C by zj = 1 if j J and zj = 0 if zj / J. { ∈ } PN P ∈ ∈ ∈ This vector satisfies Az = zjaj = aj = y and z 0 = m/3. This is a contradiction, since j=1 j∈J || || in this case x would be no more the solution of (P0,η). Thus, in the case x 0 > m/3, we can answer || || negatively about the existence of an exact cover by using the solution of (P0,η). In both cases, we showed that through the solution of the `0-minimization problem we can solve the exact cover by the 3-sets problem.
It is important to highlight what this theorem is and what it is not. It concerns the intractability in the general case. This means that for a general matrix A and a general vector y, we cannot have a smart strategy to solve (P0), in other words, no algorithm is able to solve the problem for any choice of A and y. This does not mean that for special choices of A and y we do not have tractable algorithms for sparse recovery. This will be the content of Chapter 2. Also, as we shall see in Theorem 3.30, that `q-minimization, for q < 1, is also an NP-hard problem.
1.7 Some Definitions
In this section we give a pot-pourri of definitions used along this dissertation. Definition 1.17. A nonnegative function . : X [0, ) on a vector space X is called a norm if || || → ∞ (a) x = 0 if and only if x = 0. (b) ||λx|| = λ x for all scalars λ and all vectors x X. (c) ||x +||y | |||x|| + y . ∈ || || ≤ || || || || If only (b) and (c) hold,so that x = 0 does not necessarily imply x = 0, then . is called a seminorm. If (a) and (b) hold, but (c) is replacedby|| || the weaker quasitriangular inequality || ||
x + y C( x + y ) || || ≤ || || || || for some C 1, then . is called a quasinorm. The smallest constant C is called its quasinorm constant. ≥ || || n n Definition 1.18. Let . be a norm on R or C . Its dual norm . ∗ is defined by || || || ||
x ∗ = sup x, y . || || ||y||≤1 |h i| In the real case we have x ∗ = sup x, y . n || || y∈R ,||y||≤1h i In the complex case, we have x ∗ = sup Re x, y . n || || y∈R ,||y||≤1 h i 1.7. SOME DEFINITIONS 21
The dual of the dual norm . ∗ is the norm . itself. In particular, we have || || || || x = sup x, y = sup Re x, y || || ||y||∗≤1 |h i| ||y||∗≤1 h i Definition 1.19. Let A : X Y be a linear map between two normed vector spaces X, . and (Y, . ) with possible different norms.→ The operator norm of A is defined as || || ||| |||
A = sup Ax = sup Ax . || || ||x||≤1 ||| ||| ||x||=1 ||| |||
In particular, for a matrix A Cm×n and 1 p, q , we define the matrix norm (or operator n m ∈ ≤ ≤ ∞ norm) between `p and `q as
A p→q = sup Ax q = sup Ax q. || || ||x||p≤1 || || ||x||p=1 || ||
Definition 1.20. We say that two functions f, g R are comparable if there exists absolute constants ∈ c1, c2 > 0 such that c1f(t) g(t) c2Af(t) for all t. We denote it by f g. ≤ ≤ We have a notational convention for asymptotic analysis. It was introduced by Paul Bachmann in 1894 and popularized in subsequent years by Edmund Landau in the field of Number Theory.
Definition 1.21. i.) (Big O): We say that f(x) = O(g(x)) if there exists a constant C > 0 such that
f(x) C g(x) for all x. (1.2) | | ≤ | | and when O(g(x)) stands in the middle of a formula it represents a function f(x) that satisfies Equation (1.2). This notation is typically used in the asymptotic context, specially when x is not integer. See Chapter 9 of [Graham, Knuth & Patashnik ’94].
ii.) (Big Omega) : For lower bounds, we say that f(x) = Ω(g(x)) if there exists a constant C > 0 such that
f(x) C g(x) for all x. (1.3) | | ≥ | | We have f(x) = Ω(g(x)) if and only if g(x) = O(f(x)).
iii.) (Big Theta): To specify an exact order of growth, we have the Big Theta notation. We denote by f(x) = Θ(g(x)) if the following statements hold
f(x) = O(g(x)) and f(x) = Ω(g(x)). (1.4) Big Theta definition equivalent to Definition 1.20. We emphasize it here due to the fact that Computa- tional Complexity community uses it while Approximation Theory community uses f g. Both of them will be used along this text. 22 CHAPTER 1. SPARSE SOLUTIONS OF LINEAR SYSTEMS Chapter 2
Some Algorithms and Ideas
All models are wrong, some are useful. George E. P. Box in [Box ’79].
2.1 Introduction
For general measurement matrices A and vectors y, the problem (P0) of recovering sparse vectors is NP-hard, as Theorem 1.16 shows. Then there is no hope of solving it in a computationally efficient way. However, in the last twenty years a lot of efficient algorithms were developed to deal with this problem. Techniques from Convex Optimization and from Approximation Theory were combined in order to find heuristics that solve the problem, at least for particular instances. The early development of Compressive Sensing was based on the assumption that the solution to the `1-minimization problem provides the recovery of the correct sparse vector, serving as a proxy to (P0), and also that it is feasible to solve this problem with a computer. We will see why this is true but also that a lot of work has been done in order to find alternative algorithms that are faster or give superior reconstruction performance in some situations. The purpose of this chapter is to present the three most popular classes of algorithms used for sparse vectors recovery: optimization methods, greedy methods and thresholding methods. Here we will not analyze any of them. This will be postponed to later chapters, after we introduce the Coherence Property and the Restricted Isometry Property. We do not intend to be exhaustive in the description of these classes of algorithms and our goal now is just to give some intuitive explanation and to develop some ideas about the computational aspects of Compressive Sensing. After presenting the three major classes of algorithms in Sections 2.2 to 2.4, we will briefly discuss the problem of choosing an algorithm for a particular application in Section 2.5, along with a review of recent numerical work related to this issue.
2.2 Basis Pursuit
Basis Pursuit is the main strategy for solving P0 we will deal with along this dissertation. It was introduced by [Chen ’95] as part of his Ph.D. thesis and was fully explored by [Chen, Donoho & Saunders ’01]. Since then, it has been widely used by the Signal Processing, Statistics and Machine Learning communities. As this strategy belongs to the optimization methods family, we need to start by defining what is an optimization problem. It is not our intention to be encyclopedic because Mathematical Programming is a vast and deep subject, see [Ruszczynski], [Boyd & Vanderberghe ’04] or [Bertsekas ’16]1 for further
1There is a whole volume of Documenta Mathematica dedicated to the history of Optimization: https://www.math. uni-bielefeld.de/documenta/vol-ismp/vol-ismp.html for some references. Also, the book [Gass & Assad ’04] has some interesting historical notes.
23 24 CHAPTER 2. SOME ALGORITHMS AND IDEAS information. For the use in Machine Learning of Optimization ideas related to Compressive Sensing, see [Sra, Nowozin & Wright ’11]. Definition 2.1. An optimization problem is described by
minimize F0(x) subject to Fi(x) bi, i = 1, 2,... (2.1) ≤ N N where the function F0 : R ( , ] is the objective function and the functions F1,...,Fn : R → −∞ ∞ → ( , ] are the constraint functions. A point x RN is called feasible if it satisfies the constraints. A −∞ ∞ # ∈ # feasible point x for which the minimum is attained, that is, F0(x ) F0(x) for all feasible point x is # ≤ called an optimal point and F0(x ), an optimal value. If F0,F1,...,Fn are all convex (linear) functions, this is called a convex (linear) optimization problem. Remark 9. This framework, only with inequality constraints, is the most general one and encompasses equality constraints too. Every time we have some constraint of the form Fi(x) = ci, we can write the equivalent inequalities Fi(x) ci and Fi(x) ci. ≤ − ≤ − In our setting, the measurement constraint Ax = y must be satisfied, therefore some of the inequalities N will be given by Fj(x) := Aj, x yj and Fj(x) := Aj, x yj, where Aj R is the jth row of A. Here, we will distinguishh thei ≤ measurement− constraint−h as aispecial ≤ − constraint.∈ Also, when modeling some signals of interest, other constraints Fj(x) may appear, thus we will describe the set of feasible points by
N K = x R Ax = y, Fj(x) bj, j [M] . (2.2) { ∈ | ≤ ∈ } Therefore, we are interested in problems of the form
min F0(x). (2.3) x∈K Another important class of optimization problems is the class of conic problem. This kind of problem will appear in the context of complex vectors recovery. Here is the formal definition: Definition 2.2. A conic optimization problem is of the form
min F0(x) subject to x K and Fi(x) bi, i [n], N x∈R ∈ ≤ ∈ q N+1 PN 2 where K is a convex cone and all Fi are convex functions. If K = x R : xj xN+1 , ∈ j=1 ≤ then we have a second-order cone problem and if K is the cone of positive semidefinite matrices, we have a semidefinite program. The study of many optimization problems is carried on through the notion of duality. As sometimes it is hard to find the minimum of a problem, we transform it into another (hopefully easier) problem and maximize the new one. This new problem, called the dual problem, will provide a bound for the optimal value of the original problem, called primal problem. If we are lucky enough, the solution of both problems will agree. In order to describe the dual problem, we need another definition. Definition 2.3. The Lagrange function of an optimization problem given by (2.2) is defined for x N m M ∈ R , ξ R , v R with v` 0 for all ` [M], by ∈ ∈ ≥ ∈ M X L(x, ξ, v) = F0(x) + ξ, Ax y + v`(F`(x) b`). h − i − `=1
The variables ξ and v are called Lagrange Multipliers. Here, we will denote that v` 0 for all ` [M] ≥ ∈ by v < 0. The Lagrange dual function is defined by
H(ξ, v) = inf L(x, ξ, v), ξ m, v M , v 0. N R R < x∈R ∈ ∈ 2.2. BASIS PURSUIT 25
If x L(x, ξ, v) is unbounded from below, we set H(ξ, v) = . Also, if there are no inequality constraints,7→ then −∞
m H(ξ) = inf L(x, ξ) = inf F0(x) + ξ, Ax y , ξ . (2.4) N N R x∈R x∈R { h − i} ∈
# The dual function provides a bound on the optimal value of F0(x ) for the minimization problem # m M describe by the feasible set (2.2), that is, H(ξ, v) F0(x ) for all ξ R , v R , v 0. To see why, ≤ ∈ ∈ < taking x as a feasible point for (2.3), then Ax y = 0 and F`(x) b` 0 for all ` [M], so we have, for − − ≤ ∈ all ξ Rm and v 0, that ∈ < M X ξ, Ax y + v`(F`(x) b`) 0. h − i − ≤ `=1 And then
M X L(x, ξ, v) = F0(x) + ξ, Ax y + v`(F`(x) b`) F0(x). h − i − ≤ `=1
Using that H(ξ, v) L(x, ξ, v) and taking the minimum over all feasible points x RN , we obtain a lower bound for the≤ primal problem. This leads to the new optimization problem ∈
max H(ξ, v) subject to v 0. (2.5) m M < ξ∈R , v∈R The main point of this transformation is that H(ξ, v) is always concave, even if the original function is not convex. Therefore the dual problem is equivalent to minimizing H, a convex function, subject to − # # # v < 0. A feasible maximizer of (2.5) is called a dual optimal. We always have H(ξ , v ) F (x ) and this is what we call weak duality. In case of equality, we have strong duality. We can interpret≤ Lagrange duality via a saddle-point argument.
Remark 10. (Saddle-point interpretation): We will consider here the problem
minimize F0(x) subject to Ax = y, (2.6) that is, for the sake of simplicity we will consider the original problem without inequality constraints. The general case with inequality constraints (or for conic programs) has a similar derivation. Using the definition of Lagrange function yields ( F0(x) if Ax = y, sup L(x, ξ) = sup F0(x) + ξ, Ax y = ξ∈ m ξ∈ m h − i otherwise. R R ∞ Therefore, is x is not feasible, than the above supremum is infinite. On one hand, a feasible minimizer # of the primal problem will satisfies N m On the other hand, a dual optimal F0(x ) = infx∈R supξ∈R L(x, ξ) # satisfies m N by definition of the Lagrange dual function. Hence, weak H(ξ ) = supξ∈R infx∈R L(x, ξ) duality reads as
sup inf L(x, ξ) inf sup L(x, ξ), m x∈ N x∈ N m ξ∈R R ≤ R ξ∈R whereas the same reasoning holds for strong duality with equality instead of inequality. This tells us that we can change maximization and minimization if strong duality holds. This is the saddle-point property. This is the same as says that for a primal-dual optimal pair (x#, ξ#) holds that
# # # # N m L(x , ξ) L(x , ξ ) L(x, ξ ) for all x R , ξ R . ≤ ≤ ∈ ∈ 26 CHAPTER 2. SOME ALGORITHMS AND IDEAS
Saddle-property says that it is equivalent to finding a saddle point of the Lagrange function or to jointly optimize primal and dual problems, provided that strong duality holds. We will use it in Theorem 2.5. The next Theorem, stated here in a simplified version and known as the Slater condition, provides a sufficient condition for strong duality to hold, for a proof see Section 5.3 of [Boyd & Vanderberghe ’04].
N Theorem 2.4. ([Slater ’50]): Assume that F0,F1,...,FM are convex functions with dom(F0) = R . N If there exists x R such that Ax = y and F`(x) < b` for all ` [M], strong duality holds for the optimization problem∈ (2.3). In the absence of inequality constraints,∈ strong duality holds if there exists x RN with Ax = y. ∈ Remark 11. Duality theory for conic problems is more involved, see Section 4.3 of [Ruszczynski].
After this digression about optimization, we need to state what is the main strategy for solving the problem of finding the sparsest vector satisfying the linear system of measurements. This is called Basis Pursuit or `1-minimization and is interpreted as the convex relaxation of (P0).
Basis Pursuit (BP)
Input: measurement matrix A, measurement vector y. Instructions: # x = argmin z 1 subject to Az = y. (P1) || ||
Output: the vector x#.
Note we called Basis Pursuit a strategy and not an algorithm because we did not say how to implement the strategy above. In fact, there are many ways of doing that. Basis Pursuit is a linear program in the real case and a second-order cone program in the complex case, as we will show below. Therefore general purpose optimization algorithms can be used such as the Simplex Method or Interior-Point Methods. In particular, [Chen & Donoho ’94] points out that “Basis Pursuit is only thinkable because of recent advances in linear programming via "interior point" methods”. We refer to [Nesterov & Nemirovskii ’94]2 for details about this technique and to [Kim, Koh, Lustig, Boyd & Gorinevsky ’08] for its applications to `1-minimization and sparse recovery. We also have specifically design algorithms developed for `1-minimization. There is a discussion in the community about how faster they are compared to general purpose algorithms. We can cite the Homotopy Method, developed by [Donoho & Tsaig, ’08]3, Iteratively Reweighted Least Squares, developed by [Daubechies, DeVore, Fornasier & Güntürk ’10] or Primal-Dual Algorithms, developed by [Chambolle & Pock ’11] among many other direct or iterative methods. These three main algorithms are described in Chapter 15 of [Rauhut & Foucart ’13]. Besides, it is important to note that now there are many solvers available for Basis Pursuit. One can 4 cite [`1-MAGIC], [YALL1], [Fast `1], [NESTA], [GPSR] and [L1-LS] . A user-friendly package for Convex Optimization in Python is [CVXPY], where many of the techniques mentioned above are implemented. The work [Lorenz, Pfetsch & Tillmann ’15] studies and compares some solvers and heuristics for Basis Pursuit. When we have measurement errors, we replace Ax = y by Ax y 2 η and then the problem || − || ≤
min z 1 subject to Az y 2 η, (P1,η) N z∈C || || || − || ≤
2The papers [Wright ’04] and [Gondzio ’12] are very interesting for a historical perspective about Interior-Point Methods. 3In the Statistics community, this algorithm was independently developed by [Efron, Hastie, Johnstone & Tibshirani ’04] and is known as Least Angle Regression (or LARS). 4 The most complete (albeit not up-to-date and disorganized) list of `1-minimization solvers can be found at Section 4 of https://sites.google.com/site/igorcarron2/cs. 2.2. BASIS PURSUIT 27 is known as Quadratically Constrained Basis Pursuit. There are also two very relevant, related problems. The first one is Basis Pursuit Denoising(BPDN), which consists in solving, for some parameter λ 0, ≥ 2 min λ z 1 + Az y . (2.7) N 2 z∈C || || || − || The second one is LASSO, the Least Absolute Shrinkage and Selection Operator, which consists in solving for some parameter τ 0, ≥
min Az y 2 subject to z 1 τ. (2.8) N z∈C || − || || || ≤ Basic Pursuit Denoising was introduced in [Chen & Donoho ’94] and LASSO in [Tibshirani ’96]. These three problems are related, as the next Theorem shows. For a unified treatment of them, one can see [Hastie, Tibshirani & Wainwright ’15].
Theorem 2.5. We have an equivalence between LASSO, Basis Pursuit Denoising and Quadratically Constrained Basis Pursuit as the following three statements show.
i. If x is a minimizer of the Basis Pursuit Denoising with λ > 0, then there exists η = η(x) such that x is a minimizer of the Quadratically Constrained Basis Pursuit.
ii. If x is a unique minimizer of the Quadratically Constrained Basis Pursuit with η 0, then there exists τ = τ(x) 0 such that x is a unique minimizer of the LASSO. ≥ ≥ iii. If x is a unique minimizer of the LASSO with τ > 0, then there exists λ = λ(x) 0 such that x is a minimizer of the Basis Pursuit Denoising. ≥
N Proof. i.) Let η = Ax y 2 and consider z C such that Az y 2 η. Using the fact that x is a minimizer of (2.7),|| we have− || ∈ || − || ≤
2 2 2 λ x 1 + Ax y λ z 1 + Az y λ z 1 + Ax y . || || || − ||2 ≤ || || || − ||2 ≤ || || || − ||2 After some simplifications, we obtain x 1 z 1 and then x is a minimizer of the Quadratically Con- strained Basis Pursuit. || || ≤ || ||
N ii.) Set τ = x 1 and consider z C , z = x such that z 1 τ. Since x is, by hypothesis, the unique minimizer|| || of Quadratically Constrained∈ 6 Basis Pursuit,|| this|| ≤ implies that z cannot satisfy the constraint of (P1,η). Hence, Az y 2 > η > Ax y 2. This shows that x is the unique minimizer of (2.8). || − || || − ||
iii.) This part is more elaborate and we will prove a more general result, where we replace the `2-norm by any other norm. The Theorem will be proved for the real case but it also holds for the complex setting just by interpreting CN as R2N . Let . be a norm on Rm and . a norm on RN . For A Rm×N , y Rm and τ > 0, the general LASSO|| is|| given by ||| ||| ∈ ∈
min Ax y subject to x τ, (2.9) N x∈R || − || ||| ||| ≤ while the general Basis Pursuit Denoising is given by
min λ x + Ax y 2. (2.10) N x∈R ||| ||| || − || Therefore we will prove that if x# is a minimizer of the general LASSO, then there exists λ = λ(x) 0 such that x is a minimizer of the general Basis Pursuit Denoising. First of all, the general LASSO≥ is obviously equivalent to 28 CHAPTER 2. SOME ALGORITHMS AND IDEAS
min Ax y 2 subject to x τ. (2.11) N x∈R || − || ||| ||| ≤ The Lagrange function for this problem is given by
L(x, ξ) = Ax y 2 + ξ x τ (2.12) || − || ||| ||| − For τ > 0, there exist vectors x with x < τ. Then, by Theorem 2.4, we have strong duality for (2.9), so we guarantee the existence of||| a dual||| optimal ξ# 0. The saddle-point property ensures that ≥ L(x#, ξ#) L(x, ξ#) for all x RN . Therefore x# is also a minimizer of x L(x, ξ#). Since the constant term≤ ξ#τ does not affect∈ the minimizer, x# is also a minimizer of Ax7→ y 2 + ξ# x and the conclusion− follows with λ = ξ#. || − || ||| |||
The parameters η, λ and τ can be seen as regularization parameters for the solution of the linear system. Theorem 2.5 shows that the transformation of the parameters depends on the minimizer. There- fore we need to solve the optimization problems before we come to know them. For this reason, this equivalence is useless for practical purposes. Finding appropriate parameters is a highly nontrivial task and the topic is widely discussed in the literature, see [Yu & Feng ’14] and references therein. Also, problems related to how this kind of optimization technique can lead to false discoveries are discussed in [Su, Bogdan & Candès ’15]. Basis Pursuit is a linear program in the real case and a second-order cone program in the complex case. For the real case, let us introduce the variables x+, x− RN . For x RN , let ∈ ∈ ( ( + xj if xj > 0 − 0 if xj > 0 xj = and xj = 0 if xj 0 xj if xj 0. ≤ − ≤ + − Hence, problem (P1) is equivalent to the following linear optimization problem for the variables x , x ∈ RN
N + + X + − x x 0 min (x + x ) subject to [A A] − = y, − 0. (P1) x+,x−∈ N | − x x ≥ R i=1 + # − # With a solution (x ) , (x ) of this problem in hand, we get the solution of the original problem (P1) just by taking x# = (x+)# (x−)#. This consideration allows us to conclude that we have, in fact, a linear problem. − In the complex setting, we will be looking, instead, to the more general Quadratically Constrained Basis Pursuit. Then, given a vector z CN , we need to introduce its real and imaginary parts u, v RN N ∈ q 2 2 ∈ and also a vector c R such that cj zj = uj + vj for all j [N]. The problem P1,η will be ∈ ≥ | | ∈ equivalent to the following optimization problem with variables c, u, v RN : ∈ Re(A) Im(A) u Re(y) N − η X Im(A) Re(A) v − Im(y) ≤ 0 min cj subject to 2 (P1,τ ) c,u,v,∈ N R j=1 p 2 2 p 2 2 u + v c1,..., u + v cN . 1 1 ≤ N N ≤ This is a second-order cone problem. Then given a solution (c#, u#, v#), the solution of the original # # # problem is x = u + iv . When we take η = 0, we recover the solution of (P1) in the complex case. Why is Basis Pursuit good alternative to the original combinatorial problem (P0)? Because in the real case `1-minimizers are sparse, as we will show now. In Chapter 3 this will by fully explored through the notion of the Null Space Property for measurement matrices.
m×N Theorem 2.6. Let A R be a measurement matrix with columns a1, . . . , aN . Assuming the unique- ness of a minimizer x#∈ of 2.2. BASIS PURSUIT 29
min x 1 subject to Ax = y, N x∈R || || # # # then the set aj, j supp(x ) is linearly independent, and in particular x 0 = #(supp(x )) m. { ∈ } || || ≤ # Proof. Suppose, by contradiction, that the set aj, j S is linearly dependent, where S = supp(x ). { ∈ } This means that there exists a nonzero vector v RN supported on S such that Av = 0. By uniqueness of x#, for any t = 0, ∈ 6 # # X # X # # x 1 < x + tv 1 = x + tvj = sgn(x + tvj)(x + tvj). || || || || | j | j j j∈S j∈S Now, we need to understand what happens with a sign of a number when we add another number to it. # If b < a the sgn(a + b) = sgn(a). Therefore, when t < minj∈S xj / v ∞, we have | | | | | | | | || || # # sgn(x + tvj) = sgn(x ) for all j S. j j ∈ # It follows that, for t = 0 with t < minj∈S xj / v ∞, 6 | | | | || ||
# X # # X # # X # # X # x 1 < sgn(x + tvj)(x + tvj) = sgn(x )(x + tvj) = sgn(x )x + t sgn(x )vj || || j j j j j j j j∈S j∈S j∈S j∈S
# X # = x 1 + t sgn(x )vj. || j || j j∈S P # This is a contradiction because we can always take a very small t = 0 such that t sgn(x )vj 0. 6 j∈S j ≤ Remark 12. In the complex case, the situation is more delicate. Consider the vector x = [1, ei2π/3, ei4π/3] and the measurement matrix described by
1 0 1 A = . 0 1 −1 −
This vector is the unique minimizer of the problem min 3 z 1 subject to Az = Ax. See Section 3.1 z∈C || || of [Rauhut & Foucart ’13]. Hence, in the complex setting, `1-minimization does not necessarily provides m-sparse solutions, where m is the number of measurements. In order to better understand when this occurs, we need the theory of the next chapters, such as the Null Space Property.
Quadratically Constrained Basis Pursuit relies on Ax y 2 η, that is, the noise is bounded in the || − || ≤ `2-norm regardless of structure or prior information about it. So, in the case of Gaussian noise, which is unbounded, the error in sparse vector recovery typically does not decay as the number of measurement increases, see pages 29 and 30 of [Boche et at. ’15]. In order to deal which this problem, another type 5 of `1-minimization algorithm was proposed by [Candès & Tao III, 06], called Dantzig Selector . It is described by
∗ min z 1 subject to A (Az y) ∞ τ. (DS) N z∈C || || || − || ≤ The heuristics of this method is based on the fact that the residual r = Az y should have −∗ small correlation with all columns aj of the matrix A. Indeed, the constraint is A (Az y) ∞ = || − || maxj∈[N] r, aj . There is a theory for the Dantzig Selector similar to the one developed through this dissertation|h i| for Basis Pursuit, which is sometimes more complicated, see Chapter 8 in [Elad ’10], [Hastie, Tibshirani & Wainwright ’15], [Meinshausen, Rocha & Yu ’07], [James, Radchenko & Lv ’09], [Lv & Fan ’09], [Zhang ’09] and [Bickel, Ritov & Tsybakov ’09].
5As Tao points out in [Tao’s Blog - 22/03/2008], “we called the Dantzig selector, due to its reliance on the linear programming methods to which George Dantzig, who had died as we were finishing our paper, had contributed so much to”. 30 CHAPTER 2. SOME ALGORITHMS AND IDEAS
2.3 Greedy Algorithms
The basic idea of a greedy algorithm in Compressive Sensing is to update the target vector in such a way that in each iteration, the algorithm adds one index to the target support and then finds the vector that best fits the measurements with this given support. This kind of strategy justifies the name of such class of algorithms. The most famous algorithm in this context is called the Orthogonal Matching Pursuit, sometimes called Orthogonal Greedy Algorithm. This algorithm has been rediscovered many times in different fields. We can trace its origins back to [Chen, Billings & Luo ’89], in the context of system identification in Control Theory. Independently, it was introduced and analyzed in the context of Time-Frequency Analysis by [Davies, Mallat & Zhang ’94] and [Pati, Rezaiifar & Krishnaprasad ’93]. Here is the formal description of the algorithm.
Orthogonal Matching Pursuit (OMP)
Input: measurement matrix A, measurement vector y. Initialization: S0 = , x0 = 0. Iteration: repeat until∅ a stopping criterion is met at n = n:
n+1 n ∗ n S = S jn+1 , jn+1 = argmax (A (y Ax ))j , (OMP1) ∪ { } j∈[N] {| − |}
n+1 n+1 x = argmin y Az 2, supp(z) S . (OMP2) N z∈C {|| − || ⊂ } Output: the n-sparse vector x# = xn.
To identify the signal x, we need to determine which columns of A participate in the measurement vector. Informally speaking, OMP is a greedy algorithm that selects at each step the column that is most correlated with the current residuals. This column is then included into the set of selected columns. The algorithm updates the residuals by projecting the observation onto the linear subspace spanned by the columns that have already been selected and the algorithm then iterates. The main goal of the introduction of this algorithm was to improve the already existing Matching Pursuit6 by forcing it to converge, for finite-dimensional signals, in a finite number of steps, which does not happen for the original (Nonorthogonal) Matching Pursuit, as Theorem 4.1 of [DeVore & Temlyakov ’96] and the discussion in Section 2.3.2 of [Chen, Donoho & Saunders ’01] show. In general, the residuals of the Orthogonal MP decrease faster than the Nonorthogonal MP. However, this improvement possess some disadvantages, as, for example, orthogonal projection procedure can yield unstable expansions by selecting ill-conditioned elements through the running of the algorithm. Moreover, Orthogonal MP also requires much more operations than Nonorthogonal Pursuits due to the Gram-Schmidt orthogonalization process.
Remark 13. [Davies, Mallat & Avellaneda ’97] showed exponential convergence and made a detailed comparison of both greedy algorithms in terms of theoretical complexity and practical performance. For a proof and discussion of the decay rate, see Chapter 3 of [Elad ’10].
It is important to note that the connection between sparse approximation and greedy algorithms was first done by [Tropp ’04]. His results showed that under some conditions on the coherence of the matrix (see Theorem 4.41), OMP works for sparse recovery in the noiseless case. [Cai, Wang & Xu III ’10] proved that this condition is sharp and the performance of OMP on noisy case was analyzed by [Cai & Wang ’11].
6This algorithm was introduced by [Mallat & Zhang ’93] in the Signal Processing community and by [Friedman & Stuetzle ’81] in the Statistics community under the name Projection Pursuit Regression. 2.3. GREEDY ALGORITHMS 31
As we can see in the description of OMP, the costlier part is the projection step (OMP2). Typ- ically a QR-decomposition of ASn is used in each step and then efficient methods for updating the QR-decomposition when a column is added to the matrix are used, see Section 3.2 of [Björck ’96]. More- over, fast vector-matrix multiplication via FFT can be performed in (OMP1) for the computation of A∗(y Axn). We will briefly describe how this is done for a general least-square problem. For any − matrix A Cm×n with m n, there exists a unitary matrix Q Cm×m and an upper triangular matrix ∈ ≥ ∈ R Cn×n such that ∈ R A = Q . 0 n So, for a general least-squares problem of minimizing Ax y 2 for all x C , using that Q is unitary, we have the following || − || ∈ ∗ ∗ R ∗ Ax y 2 = Q Ax Q y 2 = x Q y . (2.13) || − || || − || 0 − 2 ∗ n Partitioning b = (b1, b2) = Q y with b1 C , the right-hand side of (2.13) will be minimized when we ∈ solve the triangular system Rx = b1 and this can be done with a simple backward elimination. Therefore, at each iteration of OMP, we need to solve a least-square problem with a new added column to the matrix A. We discuss now the choice of the index jn+1, as stated in (OMP1). It is given by a greedy strategy n where the objective is to reduce the `2-norm of the residual y Ax as much as possible at each iteration. ∗ − n Lemma 2.7 explains why an index j maximizing (A (y Ax ))j is a good candidate for this large decrease of the residual. We just need to apply it to| S = S−n and v =| xn.
m×N Lemma 2.7. Let A C be a matrix with `2-normalized columns. Given S [N], v supported on S and j [N], if ∈ ⊂ ∈
w = argmin y Az 2, supp(z) S j , N z∈C {|| − || ⊂ ∪ { }} then
2 2 ∗ 2 y Aw y Av (A (y Av))j . || − ||2 ≤ || − ||2 − | − | Proof. Note that any vector v + tej, with t C, is supported on S j . By the definition of w, we have ∈ ∪ { } 2 2 y Aw 2 min y A(v + tej) 2. || − || ≤ t∈C || − || The idea is then to work in polar coordinates and choose proper radial and angular numbers to represent t. Indeed, writing t = ρeiθ, with ρ 0 and θ [0, 2π), we estimate ≥ ∈
2 2 2 2 2 y A(v + tej) = y Av tAej = y Av + t Aej 2Re t y Av, Aej || − ||2 || − − ||2 || − ||2 | | || ||2 − h − i 2 2 −iθ ∗ 2 2 ∗ = y Av + ρ 2Re ρe (A (y Av))j y Av + ρ 2ρ (A (y Av))j || − ||2 − − ≥ || − ||2 − | − | with equality for a properly chosen θ. The right-hand side of this inequality is a quadratic polynomial in ∗ ρ and this expression is minimized when ρ = (A (y Au))j . Therefore this leads to | − | 2 2 ∗ 2 min y A(v + tej) 2 y Av 2 (A (y Av))j t∈C || − || ≤ || − || − | − | which concludes the proof. 32 CHAPTER 2. SOME ALGORITHMS AND IDEAS
Another interesting point is that (OMP2) is equivalent to solving the normal equations of least squares. n+1 ∗ −1 ∗ † n+1 In fact, it also reads as xSn+1 = (ASn+1 ASn+1 ) ASn+1 y = ASn+1 y, where xSn+1 denotes the restriction of xn+1 to the support set Sn+1. This is justified by the following lemma. Lemma 2.8. Given an index set S [N], if ⊂
v = argmin y Az 2, supp(z) S , N z∈C {|| − || ⊂ } then
(A∗(y Av)) = 0. − S Proof. The definition of v tells us that the vector Av is the orthogonal projection of y onto the space Az, supp(z) S . This means that { ⊂ } N y Av, Az = 0 for all z C with supp(z) S. h − i ∈ ⊂ The orthogonality condition is equivalent to A∗(y Av), z = 0 for all z CN with supp(z) S which is the same as (A∗(y Av)) = 0. h − i ∈ ⊂ − S Lemma 2.8 is important as it may be useful for solving normal equations instead of using QR decom- position. In this case, a direct method, such as Cholesky decomposition, or an iterative method, such as Conjugate Gradient method, may be used. For details of when the use of such methods has advantages, see Chapters 6 and 7 of [Björck ’96]. The most natural stopping criterion for OMP is Axn¯ = y. In applications, where we have measurement n¯ ∗ n¯ and rounding errors, this needs to be changed to y Ax 2 ε or A (y Ax ) ∞ ε for some tolerance parameter ε > 0. || − || ≤ || − || ≤ If we have a priori estimates for the sparsity of the signals, we can use this information to provide n¯ = s as a stopping criterion because the target vector xn¯ is n-sparse. In fact, in the case where A is an orthogonal matrix, this forces the algorithm to successfully recover the sparse x CN from y = Ax. n † n ∈ Indeed, from xSn = ASn y, we see that the vector x will be the n-sparse vector consisting of the n largest entries of x. In the general case, the success of OMP for recovering s-sparse vectors using s iterations is given by the following theorem.
Theorem 2.9. Given a matrix A Cm×N , every nonzero vector x CN supported on a set S of size s is recovered from y = Ax after at∈ most s iterations of orthogonal matching∈ pursuit if and only if the matrix AS is injective and
∗ ∗ max (A r)j > max (A r)` , (2.14) j∈S | | `∈S | | for all nonzero r Az, supp(z) S . ∈ { ⊂ } Proof. First, we will assume that the OMP works and recovers every vector supported on a set S in at most s = #S iterations. Therefore, any two vectors x1, x2 having S as support and the same measurement vector y = Ax1 = Ax2 must be the same. This implies that AS is injective. Moreover, we saw that an index chosen at the first iteration will always remain in the target support and then, if y = Ax for some x CN exactly supported on S, then an index ` S cannot be chosen at the first iteration and ∈ ∗ ∗ ∈ maxj∈S (A y)j > (A y)` . We use this to conclude | | | | ∗ ∗ max (A y)j > max (A y)` for all nonzero y Az, supp(z) S . j∈S | | `∈S | | ∈ { ⊂ } We established the two necessary conditions. Now, we proceed to prove that they are also sufficient. Assume that Ax1 = y, . . . , Axs−1 = y, because otherwise there is nothing to do. We need to prove that Sn is a subset6 of S of size n6 for any 0 n s and from this we will deduce that Ss = S. s ≤ ≤ s Using this with Ax = y =: Ax, given by (OMP2), allows us to conclude that x = x because AS is 2.4. THRESHOLDING ALGORITHMS 33 injective by hypothesis. Therefore we need to establish our claim by using (2.14). We will prove by induction that Sn S. We begin with S0 = S. Given 0 n s 1, we see that Sn S yields n n ⊂ ∅ ⊂ ≤ ≤ − ⊂ r = y Ax Az, supp(z) S . Then, by equation (2.14), the index jn+1 must lie in S. Hence, n+1 −n ∈ { ⊂ } n S = S jn+1 S. This proves that S S for an 0 n s. ∪ { } ⊂ ⊂ ∗ n ≤n ≤ n ∗ n Now, for 1 n s 1, Lemma 2.8 implies that (A r )S = 0. If jn+1 S , then A r = 0 and by equation (2.14)≤ we≤ could− conclude that rn = 0, since the strict inequality{ } must ⊂ occur for any nonzero 1 s−1 n vector. This cannot happen since we assumed Ax = y, . . . , Ax = y. Therefore jn+1 S , that is, in each iteration, a new index, different from all previous6 indexes is6 added to the{ target} support. 6⊂ This inductively proves that Sn is a set of size n.
As discussed by [Chen, Donoho & Saunders ’01], “because the algorithm is myopic, one expects that, in certain cases, it might choose wrongly in the first few iterations and, in such cases, end up spending most of its time correcting for any mistakes made in the first few terms. In fact this does seem to happen.”. This phenomenon will be illustrated in Section 5.5, after Theorem 5.22. Despite these pathological negative results, there is theoretical evidence and empirical results which suggest that OMP can recover an s-sparse signal when the number of measurements m is nearly propor- tional to s, as the next Theorem shows.
Theorem 2.10. (Theorem 2 of [Tropp & Gilbert ’07]): Fix δ (0, 0.36) and choose m > Cs log(n/δ). ∈ Assume that x RN is s-sparse and let A Rm×N have i.i.d entries from the Gaussian distribution N(0, 1/m). Then,∈ given the data y = Ax, orthogonal∈ matching pursuit can reconstruct the signal x with probability exceeding 1 2δ. The constant C satisfies C 20. For large s it can be shown that C 4 is enough. − ≤ ≈
These are not the unique greedy algorithms for sparse approximation. Many corrections and mod- ifications were proposed and we can cite Subspace Pursuit [Dai & Milenkovic ’09], Gradient Pursuit [Blumensath & Davies ’08], Stagewise Orthogonal Matching Pursuit [Donoho, Tsaig, Drori & Starck ’12] and Compressive Sampling Matching Pursuit, known as CoSaMP [Needell & Tropp ’08]. There is a whole monograph devoted to topic of greedy approximations and algorithms, see [Temlyakov ’11], specially Chapter 5, where the problem of Compressive Sensing is treated.
2.4 Thresholding Algorithms
In order to recover sparse vectors, we need to understand the action of the measurement matrix on them. Thresholding algorithms rely on the fact that we will approximate the inversion of the action of A onto sparse vectors by the action of its adjoint A∗. Thus, the basic thresholding strategy is to determine the support of the s-sparse vector x CN to be recovered from y = Ax Cm as the indices of s largest absolute entries of A∗. After finding∈ the support, we perform some least-square∈ technique in order to find the vector having this support that best fits the measurements. To describe the strategy, we need to introduce two operators. The first one denotes the best s-term approximation of a vector and the second denotes the support of this approximation,
Hs(z) = zLs(z) (2.15) N Ls(z) = index set of s largest absolute entries of z C (2.16) ∈
Next, we have the strategy using this cut-off idea. 34 CHAPTER 2. SOME ALGORITHMS AND IDEAS
Basic Thresholding Strategy
Input: measurement matrix A, measurement vector y, sparsity level s. Instruction: # ∗ S = Ls(A y). (BT1) # # x = argmin y Az 2, supp(z) S . (BT2) N z∈C {|| − || ⊂ } Output: the s-sparse vector x# = xn.
As Theorem 2.9 for greedy strategy, we have a necessary and sufficient condition for the recovery of s-sparse vectors through the use of Thresholding strategy.
Proposition 2.11. A vector x CN supported on a set S is recovered from y = Ax via Basic Thresh- ∈ ∗ ∗ olding Strategy if and only if minj∈S (A y)j > max (A y)` . | | `∈S | | Proof. The vector x will be recovered if and only if the index set S# defined in the iteration of the strategy coincides with the set S and this happens if and only if any entry of A∗y on S is greater than any entry of A∗y on S.
This strategy seems too simple to have any chance of working. We can reformulate it and solve the system Az = y using the prior information that the solution is s-sparse for some s. Actually, we will not solve the fat and short linear system Az = y but instead we will solve the square system A∗Az = A∗y. This system can be reformulated as the fixed-point equation z = (Id A∗A)z + A∗y. This, in turn, can be solved by using the iteration xn+1 = (Id A∗A)xn + A∗y. − − The thresholding strategy, applied at this fixed-point iteration, tells us to keep the s largest absolute entries of (Id A∗A)xn + A∗y = xn + A∗(y Axn). If the s largest entries are not uniquely defined, this algorithm selects− the smallest possible indices.− Formally, that is the content of the next algorithm. It was developed by [Blumensath & Davies II ’08] and analyzed in [Blumensath & Davies ’09]. Its analysis relies on the fact that the matrix A∗A behaves like the identity when its domain and range are restricted to small support sets (as we shall see in Chapter 5) and then xn+1 = xn +A∗A(x xn) xn +x xn = x. The authors of the algorithm argue that it has many good features such as robustness− ≈ to observation− noise, near-optimal error guarantees, the requirement of a fixed number of iterations depending only on the logarithm of a form of signal to noise ratio of the signal, a memory requirement which is linear in the problem size, among others.
Iterative Hard Thresholding (IHT)
Input: measurement matrix A, measurement vector y, sparsity level s. Initialization: s-sparse vector x0, typically x0 = 0. Iteration: repeat until a stopping criterion is met at n = n:
n+1 n ∗ n x = Hs(x + A (y Ax )). (IHT) − Output: the s-sparse vector x# = xn.
In the class of greedy algorithms, the difference between Matching Pursuit and Orthogonal Matching Pursuit is that in the later we decide to pay the price for an orthogonal projection. Here, we can do the same and look, at each iteration, to the vector with the same support of xn+1 that best fits the measurements. This idea was explored by [Foucart ’11], where the Hard Thresholding Pursuit was devel- oped as a fusion of IHT and the greedy algorithm CoSaMP. In fact, this work did more and created a 2.5. THE SEARCH FOR THE PERFECT ALGORITHM 35 family of thresholding algorithms indexed by an integer k. There, Iterative Hard Thresholding and Hard Thresholding Pursuit correspond to the cases k = 0 and k = , respectively. ∞
Hard Thresholding Pursuit(HTP)
Input: measurement matrix A, measurement vector y, sparsity level s. Initialization: s-sparse vector x0, tipically x0 = 0. Iteration: repeat until a stopping criterion is met at n = n:
n+1 n ∗ n S = Ls(x + A (y Ax )),. (HTP1) − n+1 n+1 x = argmin y Az 2, supp(z) S . (HTP2) N z∈C {|| − || ⊂ } Output: the s-sparse vector x# = xn.
Recently, some theoretical results concerning HTP have been proved. [Bouchot, Foucart & Hitczenko ’16], for example, proved that the number of iterations can be estimated independently of the shape of x and it is at most proportional to the sparsity s. For a precise statement, see Theorem 5 in that article. Also, it provided some benchmarks for HTP: in terms of success rate, its performance is roughly similar to OMP, however, in terms of number of iterations HTP does perform best, owing to the fact that we have prior knowledge of the sparsity s. Clearly, in these thresholding algorithms we need an estimate of vector sparsity. The Soft Thresholding Algorithms class of algorithms was developed to deal with situations in which we have no such estimates. There, we substitute the hard thresholding operator by the soft thresholding operator with threshold N τ > 0. This operator maps each entry zj of a vector z C to ∈ ( sgn(zj)( zj τ), if zj τ, Sτ (zj) = | | − | | ≥ 0, otherwise. One can cite as an example of such algorithms, the Iterative Shrinkage-Thresholding Algorithm developed by [Daubechies, Defrise & De Mol ’04]. This is a variation of the classical Landweber iteration, well- known to the Inverse Problems community, see section 2.3 of [Kirsch ’11]. It is typically used in the context of image restoration and in this context, we are able to use it to deal with non-differentiable regularizers such as total-variation regularization and wavelet-based regularization [Bioucas-Dias & Figueiredo ’07].
2.5 The Search For the Perfect Algorithm
Unfortunately, there is little guidance available on choosing a good technique for a given parameter regime. [Tropp & Gilbert ’07]
After presenting these three major classes of algorithms, we should ask which of them is better suited for a certain application. In other words, we need to answer the following question: Given the sparsity level s, the number of measurements m and the ambient dimension N, which algorithm should we use? In Compressive Sensing we seek for uniform recovery results, that is, we aim to reconstruct every sparse vector with the same measurement matrix. Thus, if we have some prior information about the signals we are looking for, that is, if we have more structure beyond sparsity, can we take advantage of it in order to carefully select the algorithms? These questions have not been not fully answered yet. The number of articles related to numer- ical issues of sparse recovery is small compared to the development of theoretical guarantees. Im- portant references in this area are [Elad ’10], which is devoted to numerical investigations in Com- pressive Sensing, as well as [Pope ’09], [Maleki & Donoho ’10], [Lorenz, Pfetsch & Tillmann ’15] and 36 CHAPTER 2. SOME ALGORITHMS AND IDEAS
[Blanchard & Tanner ’15]. This last one was the first to perform large-scale empirical testing with more realistic, application sized problems, typically with ambient dimension N = 218 or N = 220. This is particularly important in evaluating the behavior of algorithms in the extreme undersampling regime of m N. These analysis are out of the scope of this dissertation. Even so, we can provide some elementary rules of thumb. For small sparsity s, OMP is typically the fastest option because its speed depends on the number of iterations, which will be equal to s if the algorithm succeeds. For mild sparsity (compared to N) thresholding algorithms are preferred because its runtime is typically not affected by the sparsity level. Assigning rules of thumb for Basis Pursuit is more difficult because it depends of how we implement this strategy. Due to the lack of a comprehensive numerical study, the following project remains open.
Open Project: Provide a computational investigation for worst and average case behavior of all al- gorithms in the three major classes: optimization methods, greedy methods and thresholding methods. Identify the regions of the problem size where these algorithms are able to reliably recover the sparsest solution. Construct phase transition diagrams to the probability of recovery for a given algorithm. Per- form all the tests with matrices drawn from some probability distribution like Gaussian or Bernoulli and also with deterministic matrices. Chapter 3
The Null Space Property
It is the hallmark of any deep truth that its negation is also a deep truth.1. Niels Bohr
3.1 Introduction
As discussed in Chapter 2, in this dissertation we are interested in the solution of fat and short linear systems. This is so because they model the simultaneous acquisition and (high) compression of data. As the matrices of these systems have nontrivial kernel, we must define a proper notion of solution, i.e., we must specify what properties this solution should have. This is the same as saying that find a unique solution of an underdetermined linear system is impossible unless some additional information about the solution is given. Therefore, after this regularization procedure, there is hope in finding a unique (regularized) solution. As we are looking for parsimonious representation of signals, the natural approach is to look for the sparsest solution. In previous chapters, we argued that this approach makes sense, since many natural signals contain little information, as they are a combination of only a handful of vector from an appropriate basis. In mathematical terms, we need to recover a sparse vector x CN from its measurements y = Ax ∈ ∈ Cm, where m < N. In order to do so, we need to solve the following optimization problem:
min z 0 subject to Az = y. (P0) N x∈C || || This combinatorial problem is NP-hard, as Theorem 1.16 shows, so we need to develop some smart strategies in order to solve it. The most popular strategy is Basis Pursuit, or `1-minimization, presented in Chapter 2. Therefore, our approach will be to solve the problem
min z 1 subject to Az = y. (P1) N x∈C || || This chapter is devoted to understanding one property the matrix A must have in order to guarantee that solutions from problem (P1) are solutions of the problem (P0). It is the null space property (NSP), a necessary and sufficient condition for exact reconstruction of a sparse vector x from its measurements y = Ax, which we present in the next section. Even more, we will define two further criteria which we expect from a recovery algorithm, namely Stability and Robustness and show that NSP also works for stable reconstruction with respect to sparsity defect and that it is robust to measurement error. Then, the problem of low-rank matrix recovery will be explored. This can be seen as a variation of the Compressive Sensing problem and a generalization of the null space property will appear during the analysis of low-rank matrix recovery.
1Quoted by Max Delbruck, “Mind from Matter? An Essay on Evolutionary Epistemology", Blackwell Scientific Publi- cations, Palo Alto, CA, 1986; page 167.
37 38 CHAPTER 3. THE NULL SPACE PROPERTY
3.2 The Null Space Property
As we are interested in recovering the sparsest solution, we need to know when, i.e., for which matrices, this can be done. Moreover, we also need to know when we have uniqueness of the recovered solution. However, finding a property of the matrix A that implies that a solution to the (P1) problem is a solution of the (P0) problem was a nontrivial task. This took a little more than a decade, from the beginning of Basis Pursuit studies in the Thesis [Chen ’95], until the formal definition of the Null Space Property in the paper [Cohen, Dahmen & DeVore ’09]. It is important to mention that this property had already implicitly appeared in [Donoho & Huo ’01], [Elad & Bruckstein ’02] and [Donoho & Elad ’ 03]. The first place where this property was isolated is in Lemma 1 of [Gribonval & Nielsen ’03].
Definition 3.1. A matrix A Km×N is said to satisfy the null space property2 relative to a set S [N] if ∈ ⊂
vS 1 < v 1 v ker A 0 (NSP) || || || S|| ∀ ∈ \{ } It is said to satisfy the null space property of order s if it satisfies the null space property relative to any set S [N] with #S = s. ⊂ Remark 14. This property tells us how the null space of the matrix A should be oriented in order to touch the `1-ball in just one point. This geometric interpretation will be explored in Theorem 3.12 and can be seen in Figure 3.1 below.
Figure 3.1: Proper null space of A for uniqueness in the reconstruction of sparse vector.
First of all, it is important to observe that if this condition holds for the index set S of s largest (in absolute value) entries of v, then it holds for any other set S with #S s. Also, we can reformulate this ≤ definition of the null space property in two ways. The first one is by adding vS 1 to both sides of the inequality in the definition. Then, the null space relative to a set S become || ||
2 vS 1 < v 1 v ker A 0 . (3.1) || || || || ∀ ∈ \{ } For the second, one can add vS 1 to both sides of the inequality. Thus, the null space property of order s will be || || v 1 < 2 σs(v)1 = 2 inf x z p v ker A 0 || || ||z||0≤s || − || ∀ ∈ \{ }
2 [Cohen, Dahmen & DeVore ’09] defined NSP as a slightly more general property, namely, ||v||1 ≤ Cσs(x)1 for all v ∈ ker A, where C ≥ 1 in an unspecified constant. 3.2. THE NULL SPACE PROPERTY 39
We can now state the main Theorem concerning the NSP, which is a necessary and sufficient condition for exact recovery of sparse vectors through `1-norm minimization. The statement given here is a small modification form the original reference.
Theorem 3.2. (Essentially Theorem 3.2 of [Cohen, Dahmen & DeVore ’09]): Given a matrix A m×N N ∈ K , every vector x K supported on a set S is the unique solution of (P1) with Ax = y if and only if A satisfies the null space∈ property relative to S.
m×N Proof. Suppose that every vector x K supported on S is the unique minimizer of z 1 subject ∈ || || to Az = Ax. Thus, for any v ker A 0 , the vector vS is the unique minimizer of z 1 subject to ∈ \{ } || || Az = AvS. But, as we have Av = 0, then A(vS + v ) = Av = 0 and so A( v ) = AvS. Furthermore, as S − S v = 0, we must have v = vS. Then, we conclude that vS 1 < v 1. 6 − S 6 || || || S|| Conversely, assume that the null space property relative to S holds. For a given vector x Km×N ∈ supported on S and a vector z Km×N , with z = x and satisfying Az = Ax, we consider the vector v = x z ker A 0 . Due to the∈ null space property,6 we have − ∈ \{ }
x 1 x zS 1 + zS 1 = vS 1 + zS 1 < v 1 + zS 1 = z 1 + zS 1 = z 1. || || ≤ || − || || || || || || || || S|| || || || − S|| || || || ||
Now, if we let the set S vary, we obtain as a obvious consequence the following result.
Theorem 3.3. Given a matrix A Km×N , every s-sparse vector x A KN is the unique solution of ∈ ∈ ∈ (P1) if and only if A satisfies the null space property of order s.
This impressive and simple theorem tells us that if we have a sparse vector x and a measurement vector y = Ax, Basis Pursuit actually solves (P0), provided that the NSP holds for the matrix A. In fact, suppose that we are able to recover the vector x via `1-norm minimization from y = Ax. Then, let x˜ be the minimizer of (P0) with y = Ax. Therefore, of course, we have x˜ 0 x 0 and so x˜ is also s-sparse, || || ≤ || || since by hypothesis we have that x is s-sparse. Since every s-sparse vector is the unique `1-minimizer, it follows that x˜ = x. The Null Space Property has an interesting feature: it is preserved if we rescale and reshuffle the measurements or even if some new measurements were added. From a mathematical point of view, this is the same as replacing the measurement matrix A
A1 = GA, where A is some invertible m m matrix. ×
A 0 A2 = , where B is some m N matrix. B ×
In the second case, A2 is represented as a block matrix and B represents the additional measurements. This is reasonable because the measurement process should not get worse if: new information is acquired; if the order in which the measurements are obtained changes; or if we change the scale of all measurements of the signal. The justification for such facts is simple, note that ker A1 = ker A and ker A2 ker A hence ⊂ A1 and A2 have the NSP property if A does. In the definition of NSP, K could be either R or C. However, when the system has real coefficients, sparse solutions can be considered either as real or complex vectors, leading to two apparently distinct null space properties. This is so because there is a distinction between the real null space kerR A and the complex null space kerC A = kerR A + i kerR A. Nevertheless, the real and complex NSP are equivalent, as proved by [Foucart & Gribonval ’10]. While the original proof links it with a problem about convex polygons in the real plane, in Theorem 3.4, we prove this equivalence following the elementary argument of [Lai & Liu ’11]. 40 CHAPTER 3. THE NULL SPACE PROPERTY
Theorem 3.4. (Theorem 1 of [Foucart & Gribonval ’10]): Given a matrix A Rm×N , the real null space property relative to a set S, that is, ∈ X X vj < vi v ker A, v = 0. (3.2) | | | | ∀ ∈ R 6 j∈S i∈S is equivalent to the complex null space property relative to this set S, that is, X q X q v2 + w2 < v2 + w2 v, w ker A, (v, w) = (0, 0). (3.3) j j i i ∀ ∈ R 6 j∈S i∈S In particular, the real null space property of order s is equivalent to the complex null space property of order s. As [Foucart & Gribonval ’10] notes, “Before stating the theorem, we point out that, for a real mea- surement matrix, one may also recover separately the real and imaginary parts of a complex vector using two real `1-minimizations - which are linear programs - rather than recovering the vector directly using one complex `1-minimization - which is a second order cone program.”. Proof. Clearly equation (3.2) follows from equation (3.3) when we set w = 0. Then we need to prove the converse. Assume that (3.2) holds and consider v, w in ker A with (v, w) = (0, 0). Suppose first that v R 6 and w are linearly dependent, e.g. w = kv for some k C. Therefore, equation (3.3) is deduced from ∈ q q X X X p 2 X p 2 X 2 2 2 X 2 2 2 vj < vi k + 1 vj < k + 1 vi k v + v < k v + v | | | | ⇒ | | | | ⇒ j j i i j∈S i∈S j∈S i∈S j∈S i∈S X q X q w2 + v2 < w2 + v2. ⇒ j j i i j∈S i∈S Now assume u and v are independent. In this case, u = v cos θ + w sin θ ker A will be nonzero. So, ∈ R the real null space property (3.2) leads, for any θ R, to ∈ X X vj cos θ + wj sin θ < vi cos θ + wi sin θ . (3.4) | | | | j∈S i∈S
If we define, for each k [N], θk [ π, π] through the equations ∈ ∈ − v w cos θ = k , sin θ = k . k p 2 2 k p 2 2 vk + wk vk + wk Equation (3.4) can be reformulate as q X q 2 2 X 2 2 v + w cos(θ θj) < v + w cos(θ θi) . j j | − | i i | − | j∈S i∈S We can integrate over θ [ π, π] and obtain ∈ − Z π q Z π X q 2 2 X 2 2 vj + wj cos(θ θj) dθ < vi + wi cos(θ θi) dθ. −π | − | −π | − | j∈S i∈S R π Since −π cos(θ θj) dθ = 4 is independent of θi we have the desired inequality. Then the real null space property implies| − the complex| null space property.
Sometimes it makes sense to use `q-norm minimization, for 0 < q < 1 instead of `1-norm minimiza- tion. In order to analyze this case, the Null Space Property should be generalized. This was done by [Gribonval & Nielsen ’07], see also [Gao, Peng, Yue & Zhao ’15]. The same problem of the equivalence between real and complex Null Space Property arises in this nonconvex case. The proof that they are indeed equivalent notions was given by [Lai & Liu ’11]. 3.3. RECOVERY VIA NONCONVEX MINIMIZATION 41
3.3 Recovery via Nonconvex Minimization
The `q-quasinorm of a vector can be used to approximate the number of nonzero entries of a vector, as
N N X q q→0 X xi 1 = x 0. | | −−−→ {xi6=0} || || j=1 i=1
Therefore, we would like to know whether we can obtain results for sparse recovery using `q-norm mini- mization, with 0 < q < 1. Similarly to the other cases, let us call the problem of minimizing x q subject || || to Ax = y of (Pq). Despite being a nonconvex problem, it can be shown that exact reconstruction is possible with substantially fewer measurements while maintaining robustness to noise and stability to signal non-sparsity, see [Chartrand ’07] for numerical experiments and [Saab, Chartrand & Yilmaz ’08] for the robustness analysis in the nonconvex case. We will prove that the problem (Pq) does not provide a worse approximation to (P0) when we make q smaller. In order to do this, we need to show an equivalence similar to Theorem 3.3 between sparse recovery with the `q-norm and some form of the Null Space Property.
Theorem 3.5. Given a matrix A Cm×N and 0 < q 1, every s-sparse vector x CN is the unique ∈ ≤ ∈ solution of (Pq) with y = Ax in and only if, for any S [N] with #S s, ⊂ ≤
vS q < v q v ker A 0 . || || || S|| ∀ ∈ \{ } The proof of this theorem is analogous to that of Theorem 3.2. We just need to use the fact that the qth power of the `q-quasinorm satisfies the triangular inequality. Using Theorem 3.5, [Gribonval & Nielsen ’07] were able to establish that sparse recovery via `q-minimization implies sparse recovery via `p-minimization for 0 < p < q 1. ≤ Theorem 3.6. ([Gribonval & Nielsen ’07]): Given a matrix A Cm×N and 0 < p < q 1, if every N ∈ N ≤ s-sparse vector x C is the unique solution of (Pq) with y = Ax, then x C is also the unique ∈ ∈ solution of (Pp) with y = Ax.
Proof. Due to Theorem 3.5, we just need to prove that, if v ker A 0 and if S is an index set of s largest absolute entries of v, then ∈ \{ }
X p X p vi < vi . (3.5) | | | | i∈S i∈S
We have, by hypothesis, that the same inequality is valid with q in place of p. Then we have vS = 0 since P6 p S is an index set of the largest absolute entries and v = 0. So we can rewrite (3.5) dividing by i∈S vi . So, the inequality we want to prove is equivalent to 6 | |
X 1 f(p) := P p < 1. (3.6) ( vj / vi ) i∈S j∈S | | | |
Clearly we have that vj / vi 1 for j S and i S. So, f(p) is a nondecreasing function of p. Hence, for p < q, we have f(p|) | |f(q|) ≤. By hypothesis,∈ we∈ have f(q) < 1 and this implies the result. ≤ In these results about sparse recovery via convex (and also nonconvex) optimization problem, nothing is said about the computational tractability of the Null Space Property. In Section 3.7 we will study some computational complexity issues related to `q-norm minimization for 0 < q 1. We first need to analyze what happen if the measurements are corrupted or if we have vectors which≤ are not, in fact, sparse, but approximately sparse. 42 CHAPTER 3. THE NULL SPACE PROPERTY
3.4 Stable Measurements
In idealized situations, all the vectors we deal with are sparse. This may not be the case in more realistic scenarios. Instead, we can have approximately sparse vectors and we want to recover them with an error controlled by its distance to the closest sparse vector. This is called stability of the reconstruction with respect to sparsity defect. In this section, we will prove that Basis Pursuit is stable. In order to do this, we need a stronger Null Space Property that allow to encompass these defects into the kernel of the measurement matrix.
Definition 3.7. A matrix A Cm×N is said to satisfy the stable null space property with constant 0 < ρ < 1 relative to a set S ∈[N] if ⊂
vS 1 ρ v 1 v ker A.. (NSPρ) || || ≤ || S|| ∀ ∈ We say that A satisfies the stable null space property of order s with constant 0 < ρ < 1 if it satisfies the stable null space property with constant 0 < ρ < 1 relative to any set S [N] with #S s. ⊂ ≤ We now can generalize Theorem 3.3 to the stable case.
Theorem 3.8. Suppose that a matrix A Cm×N satisfies the stable null space property of order s with N∈ # constant 0 < ρ < 1. Then, for any x C , a solution x of (P1) with y = Ax approximates the vector ∈ x with `1-error
# 2(1 + ρ) x x 1 σs(x)1. (3.7) || − || ≤ (1 ρ) − Remark 15. It is interesting to note that Theorem 3.8 does not guarantee uniqueness for the `1- minimization. However, without explaining why, [Rauhut & Foucart ’13] argue that nonuniqueness is rather pathological.
# Albeit there is nonuniqueness, the content of Theorem 3.8 is that every solution x of (P1) with y = Ax satisfies (3.7). We will prove, in fact, a stronger result, namely Theorem 3.9. The key point is to look for any vector z CN satisfying Az = Ax and establish that, under the stable null space property ∈ relative to a set S, the distance between a vector x CN supported on S and a vector z CN satisfying Az = Ax can be controlled by the difference of their∈ norms. Theorem 3.8 will follow as a∈ corollary.
Theorem 3.9. The matrix A Cm×N satisfies the stable null space with constant 0 < ρ < 1 relative to S if and only if ∈ 1 + ρ z x 1 ( z 1 x 1 + 2 x 1), (3.8) || − || ≤ 1 ρ || || − || || || S|| − for all vectors x, z CN with Az = Ax. ∈ In order to prove this result, we need the following simple observation.
Lemma 3.10. Given a set S [N] and vectors x, z CN , the following inequality holds ⊂ ∈
(x z) 1 z 1 x 1 + (x z)S 1 + 2 x 1. || − S|| ≤ || || − || || || − || || S|| Proof. We need to separate a vector x CN into two parts, one relative to the set S and other relative ∈ to the complementary set S. So, given a set S [N] and vectors x, z CN , we have ⊂ ∈
x 1 = x 1 + xS 1 x 1 + (x z) 1 + zS 1, || || || S|| || || ≤ || S|| || − S|| || || and also
(x z) 1 x 1 + z 1. || − S|| ≤ || S|| || S|| 3.4. STABLE MEASUREMENTS 43
We sum these two inequality to conclude
x 1 + (x z) 1 2 x 1 + (x z) 1 + z 1. || || || − S|| ≤ || S|| || − S|| || ||
Proof. (of Theorem 3.9): We will assume that the matrix A satisfies (3.8) for all vectors x, z C with ∈ Az = Ax. For v ker A, A(vS + v ) = Av = 0 and so A( v ) = AvS. From (3.8) applied to x = vS ∈ S − S − and z = vS we obtain
1 + ρ v 1 v 1 vS 1 , || || ≤ 1 ρ || S|| − || || − which is equivalent to