Data Flow Algorithms for Processors with Vector Extensions Handling Actors With Internal State
Full use of the parallel computation capabilities of present and expected CPUs and GPUs requires use of vector extensions. Yet many actors in data flow systems for digital signal processing have internal state (or, equivalently, an edge that loops from the actor back to itself) that impose serial de...
        Saved in:
      
    
          | Published in | Journal of signal processing systems Vol. 87; no. 1; pp. 21 - 31 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York
          Springer US
    
        01.04.2017
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1939-8018 1939-8115  | 
| DOI | 10.1007/s11265-015-1045-x | 
Cover
| Abstract | Full use of the parallel computation capabilities of present and expected CPUs and GPUs requires use of vector extensions. Yet many actors in data flow systems for digital signal processing have internal state (or, equivalently, an edge that loops from the actor back to itself) that impose serial dependencies between actor invocations that make vectorizing across actor invocations impossible. Ideally, issues of inter-thread coordination required by serial data dependencies should be handled by code written by parallel programming experts that is separate from code specifying signal processing operations. The purpose of this paper is to present one approach for so doing in the case of actors that maintain state. We propose a methodology for using the parallel scan (also known as prefix sum) pattern to create algorithms for multiple simultaneous invocations of such an actor that results in vectorizable code. Two examples of applying this methodology are given: (1) infinite impulse response filters and (2) finite state machines. The correctness and performance of the resulting IIR filters and one class of FSMs are studied. | 
    
|---|---|
| AbstractList | Full use of the parallel computation capabilities of present and expected CPUs and GPUs requires use of vector extensions. Yet many actors in data flow systems for digital signal processing have internal state (or, equivalently, an edge that loops from the actor back to itself) that impose serial dependencies between actor invocations that make vectorizing across actor invocations impossible. Ideally, issues of inter-thread coordination required by serial data dependencies should be handled by code written by parallel programming experts that is separate from code specifying signal processing operations. The purpose of this paper is to present one approach for so doing in the case of actors that maintain state. We propose a methodology for using the parallel scan (also known as prefix sum) pattern to create algorithms for multiple simultaneous invocations of such an actor that results in vectorizable code. Two examples of applying this methodology are given: (1) infinite impulse response filters and (2) finite state machines. The correctness and performance of the resulting IIR filters and one class of FSMs are studied. | 
    
| Author | Barford, Lee Liu, Yanzhou Bhattacharyya, Shuvra S.  | 
    
| Author_xml | – sequence: 1 givenname: Lee surname: Barford fullname: Barford, Lee email: lee.barford@keysight.com organization: Keysight Laboratories, Keysight Technologies, Inc – sequence: 2 givenname: Shuvra S. surname: Bhattacharyya fullname: Bhattacharyya, Shuvra S. organization: University of Maryland, Tampere University of Technology – sequence: 3 givenname: Yanzhou surname: Liu fullname: Liu, Yanzhou organization: University of Maryland  | 
    
| BookMark | eNp9j8FOwzAMhiM0JLbBA3DrCwTstm47btPYBtIkOADXKKTu2LQ1KC5aeXsyDa74YuuXP8vfSA1a37JS1wg3CFDeCmJakAYkjZCT7s_UECfZRFeINPibAasLNRLZAhRQEg7V3b3tbLLY-UMy3a192HQfe0kaH5Ln4B2L-CDJIabJG7suxvO-41Y2vpVLdd7YnfDVbx-r18X8ZfagV0_Lx9l0pR0S9doVaU1NmpeVLV1uCTPHhJZLwHryXlLhmC1BzUwVZw5sXmAdyRrrY9lsrPB01wUvErgxn2Gzt-HbIJijvDnJmyhvjvKmj0x6YiTutmsOZuu_Qhvf_Af6AQfdYAM | 
    
| Cites_doi | 10.1109/I2MTC.2012.6229207 10.1109/PACT.2011.68 10.1145/1866739.1866760 10.1109/31.76483 10.1016/j.parco.2011.09.001 10.1145/1375527.1375559 10.1109/32.99191 10.1109/MAHC.2010.28 10.1109/TC.1987.5009446 10.1109/IPDPSW.2013.207 10.1109/ISPDC.2012.17 10.1109/I2MTC.2014.6860775 10.1109/MC.1980.1653418 10.1109/IPDPSW.2013.141 10.1109/TIM.2010.2090055 10.1090/S0002-9947-1965-0188316-1 10.1145/1365490.1365500 10.1109/ARITH.2013.24 10.1145/258492.258518 10.1007/978-3-662-44199-2_13 10.1016/S0019-9958(78)90320-0 10.1145/322217.322232 10.1109/SC.2006.31 10.1007/BF02406474 10.1109/12.42122 10.1109/ASAP.1993.397152  | 
    
| ContentType | Journal Article | 
    
| Copyright | The Author(s) 2015 | 
    
| Copyright_xml | – notice: The Author(s) 2015 | 
    
| DBID | C6C AAYXX CITATION  | 
    
| DOI | 10.1007/s11265-015-1045-x | 
    
| DatabaseName | Springer Nature OA Free Journals CrossRef  | 
    
| DatabaseTitle | CrossRef | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: C6C name: Springer Nature Link url: http://www.springeropen.com/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Engineering | 
    
| EISSN | 1939-8115 | 
    
| EndPage | 31 | 
    
| ExternalDocumentID | 10_1007_s11265_015_1045_x | 
    
| GroupedDBID | -5B -5G -BR -EM -Y2 -~C .86 .VR 06D 0R~ 0VY 1N0 203 29L 29~ 2J2 2JN 2JY 2KG 2LR 2VQ 2~H 30V 4.4 406 408 409 40D 40E 5GY 5VS 67Z 6NX 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABWNU ABXPI ACAOD ACBXY ACDTI ACGFO ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACREN ACZOJ ADHIR ADINQ ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEBTG AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFEXP AFGCZ AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AOCGG ARCEE ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN B-. BDATZ BGNMA BSONS C6C CAG COF CS3 CSCUP DDRTE DNIVK DPUIP DU5 EBLON EBS EIOEI EJD ESBYG F5P FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNWQR GQ6 GQ7 GQ8 H13 HF~ HG5 HG6 HLICF HMJXF HQYDN HRMNR HVGLF HZ~ IJ- IKXTQ ITM IWAJR IXC IZIGR I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ KOV LAK LLZTM M4Y MA- N9A NPVJJ NQJWS NU0 O93 O9G O9J OAM P9P PF0 PT4 QOS R89 R9I ROL RPX RSV S16 S1Z S27 S3B SAP SCLPG SDH SEG SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 TSG TSK TSV TUC U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7V Z7X Z7Z Z83 Z88 Z8M Z8N Z8P Z8T Z8W Z92 ZMTXR ~A9 AAPKM AAYXX ABBRH ABDBE ABFSG ABQSL ABRTQ ACSTC ADHKG AEZWR AFDZB AFHIU AFOHR AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION  | 
    
| ID | FETCH-LOGICAL-c155x-c62d5f2478a7c4a513ce51ae701d9b756ceea50dee58e3c0a461d155d1ddddda3 | 
    
| IEDL.DBID | C6C | 
    
| ISSN | 1939-8018 | 
    
| IngestDate | Wed Oct 01 01:08:43 EDT 2025 Fri Feb 21 02:35:08 EST 2025  | 
    
| IsDoiOpenAccess | true | 
    
| IsOpenAccess | true | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 1 | 
    
| Keywords | Data flow computing Vector processors Parallel algorithms Graphics processing units Digital signal processing  | 
    
| Language | English | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c155x-c62d5f2478a7c4a513ce51ae701d9b756ceea50dee58e3c0a461d155d1ddddda3 | 
    
| OpenAccessLink | https://doi.org/10.1007/s11265-015-1045-x | 
    
| PageCount | 11 | 
    
| ParticipantIDs | crossref_primary_10_1007_s11265_015_1045_x springer_journals_10_1007_s11265_015_1045_x  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 20170400 2017-4-00  | 
    
| PublicationDateYYYYMMDD | 2017-04-01 | 
    
| PublicationDate_xml | – month: 4 year: 2017 text: 20170400  | 
    
| PublicationDecade | 2010 | 
    
| PublicationPlace | New York | 
    
| PublicationPlace_xml | – name: New York | 
    
| PublicationSubtitle | for Signal, Image, and Video Technology (formerly the Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology) | 
    
| PublicationTitle | Journal of signal processing systems | 
    
| PublicationTitleAbbrev | J Sign Process Syst | 
    
| PublicationYear | 2017 | 
    
| Publisher | Springer US | 
    
| Publisher_xml | – name: Springer US | 
    
| References | Hwu, W.M.W. (2012). GPU Computing Gems Jade Edition. Morgan Kauffman. Franchetti, F., Voronenko, Y., & Puschel, M. (2006). FFT program generation for shared memory: SMP and multicore. In SC 2006 Conference, Proceedings of the ACM/IEEE (pp. 51–51). IEEE . DennisJBData flow supercomputersIEEE Computer19801311485610.1109/MC.1980.1653418 Vishkin, U. (1997). From algorithm parallelism to instruction-level parallelism: An encode-decode chain using prefix-sum. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, (pp. 260–271). Barford, L., & Keenan, K. (2014). Segmenting a signal based on a local property using multicore processors. In Proc. IEEE Intl. Instrumentation and Measurement Technology Conf., (pp. 397–401). LeeEAMesserschmittDGStatic scheduling of synchronous data flow programs for digital signal processingIEEE Transactions Computers1987361243510.1109/TC.1987.5009446 Sengupta, S., Harris, M., Zhang, Y., & Owens, J.D. (2007). Scan primitives for GPU computing. In Graphics Hardware, vol. 2007, (pp. 97–106). Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., & Padua, D.A. (2011). An evaluation of vectorizing compilers. In 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), (pp. 372–382). IEEE. BarfordLSpeeding localization of pulsed signal transitions using multicore processorsIEEE Transactions Instrumentation and Measurement20116051588159310.1109/TIM.2010.2090055 Zumbusch, G. (2012). Tuning a finite difference computation for parallel vector processors. In 2012 11th International Symposium on Parallel and Distributed Computing (ISPDC), (pp. 63–70). IEEE. LinHMesserschmittDGFinite state machine has unlimited concurrencyIEEE Transactions on Circuits and Systems199138546547510.1109/31.76483 LadnerREFischerMJParallel prefix computationJournal of the ACM198027483183859470210.1145/322217.3222320445.68066 Lee, J.H., Patel, K., Nigania, N., Kim, H., & Kim, H. (2013). OpenCL performance evaluation on modern multi core CPUs. In 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) (pp. 1177–1185). IEEE. NickollsJBuckIGarlandMSkadronKScalable parallel programming with CUDAQueue200862405310.1145/1365490.1365500 ParhiKKHigh-Level algorithm and architecture transformations for DSP synthesisJournal of VLSI Signal Processing199591210.1007/BF02406474 Rupley, J., King, J., Quinnell, E., Galloway, F., Patton, K., Seidel, P., Dinh, J., Bui, H., & Bhowmik, A. (2013). The floating-point unit of the Jaguar x86 core. In 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), (pp. 7–16). KoomeyJGBerardSSanchezMWongHImplications of historical trends in the electrical efficiency of computingIEEE Annals of the History of Computing20113334654275992610.1109/MAHC.2010.28 Munshi, A., Gaster, B., Mattson, T.G., & Ginsburg, D. (2011). OpenCL programming guide. Pearson Education. Zeiger, H.P. (1968). Algebraic Theory of Machines, Languages, and Semigroups, chap. Cascade decomposition of automata using covers, (pp. 55–80). Academic Press. Dotsenko, Y., Govindaraju, N.K., Sloan, P.P., Boyd, C., & Manferdelli, J. (2008). Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS ’08, (pp. 205–213). New York: ACM. doi:10.1145/1375527.1375559. WuCWCappelloPRApplication-specific CAD of VLSI second-order sections. IEEE Transactions on AcousticsSpeech and Signal Processing199836581382510.1109/29.15900709.94568 RadivojevicIPHerathHExecuting DSP applications in a fine-grained dataflow environmentIEEE Transactions on Software Engineering199117101028104110.1109/32.99191 Bell, N., & Hoberock, J. (2011). Thrust: A productivity-oriented library for CUDA. In W.W. Hwu (Ed.), GPU Computing Gems Jade Edition, 26, (pp. 359–371). Morgan Kauffman. Intel architecture instruction set extensions programming reference. https://software.intel.com/en-us/intel-isa-extensions. Barford, L. (2012). Parallelizing small finite state machines, with application to pulsed signal analysis. In Proc. IEEE Intl. Instrumentation and Measurement Technology Conf., (pp. 1957– 1962). KrohnKRhodesJAlgebraic theory of machines. I. Prime decomposition theorem for finite semigroups and machinesTransactions of the American Mathematical Society196511645046418831610.1090/S0002-9947-1965-0188316-10148.01002 Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., & Zhou, J. (2013). Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), (pp. 1107–1116). IEEE. KlöcknerAPintoNLeeYCatanzaroBIvanovPFasihAPyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generationParallel Computing201238315717410.1016/j.parco.2011.09.001 BlellochGEScans as primitive parallel operationsComputers, IEEE Transactions on198938111526153810.1109/12.42122 Egri-Nagi, A., Mitchell, J.D., & Nehaniv, C.L. (2014). SgpDec: Cascade (de)composition of finite transformation semigroups and pertmutation groups. In Hong, H., & Yap, C. (Eds.), Proc. of the 4th Internationcal Conference on Mathematical Software (ICMS 2014), Lecture Notes in Computer Science, (vol. 8592 pp. 75–82). Springer . Eilenberg, S. (1976). Automata, Languages, and Machines, vol. B, chap. 1, 6. Academic Press. TanKLiuHZhangJZhangYFangJVoelkerGMSora: High-performance software radio using general-purpose multi-core processorsCommunications of the ACM20115419910710.1145/1866739.1866760 NozakiAPractical decomposition of automataInformation and Control19783627529147620010.1016/S0019-9958(78)90320-00374.94035 AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWThe landscape of parallel computing research: A view from Berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department2006BerkeleyUniversity of California Ritz, S., Pankert, M., & Meyr, H. (1993). Optimum vectorization of scalable synchronous dataflow graphs. In Proceedings of the International Conference on Application Specific Array Processors. JB Dennis (1045_CR8) 1980; 13 IP Radivojevic (1045_CR27) 1991; 17 K Krohn (1045_CR16) 1965; 116 EA Lee (1045_CR18) 1987; 36 1045_CR21 1045_CR22 1045_CR23 1045_CR5 K Asanovic (1045_CR2) 2006 1045_CR4 1045_CR1 1045_CR28 1045_CR29 A Klöckner (1045_CR14) 2012; 38 1045_CR9 1045_CR6 J Nickolls (1045_CR24) 2008; 6 JG Koomey (1045_CR15) 2011; 33 H Lin (1045_CR20) 1991; 38 1045_CR30 GE Blelloch (1045_CR7) 1989; 38 1045_CR10 1045_CR32 CW Wu (1045_CR33) 1998; 36 1045_CR11 1045_CR12 1045_CR34 1045_CR13 1045_CR35 KK Parhi (1045_CR26) 1995; 9 1045_CR19 K Tan (1045_CR31) 2011; 54 RE Ladner (1045_CR17) 1980; 27 A Nozaki (1045_CR25) 1978; 36 L Barford (1045_CR3) 2011; 60  | 
    
| References_xml | – reference: Egri-Nagi, A., Mitchell, J.D., & Nehaniv, C.L. (2014). SgpDec: Cascade (de)composition of finite transformation semigroups and pertmutation groups. In Hong, H., & Yap, C. (Eds.), Proc. of the 4th Internationcal Conference on Mathematical Software (ICMS 2014), Lecture Notes in Computer Science, (vol. 8592 pp. 75–82). Springer . – reference: Sengupta, S., Harris, M., Zhang, Y., & Owens, J.D. (2007). Scan primitives for GPU computing. In Graphics Hardware, vol. 2007, (pp. 97–106). – reference: Barford, L. (2012). Parallelizing small finite state machines, with application to pulsed signal analysis. In Proc. IEEE Intl. Instrumentation and Measurement Technology Conf., (pp. 1957– 1962). – reference: Bell, N., & Hoberock, J. (2011). Thrust: A productivity-oriented library for CUDA. In W.W. Hwu (Ed.), GPU Computing Gems Jade Edition, 26, (pp. 359–371). Morgan Kauffman. – reference: BarfordLSpeeding localization of pulsed signal transitions using multicore processorsIEEE Transactions Instrumentation and Measurement20116051588159310.1109/TIM.2010.2090055 – reference: Ritz, S., Pankert, M., & Meyr, H. (1993). Optimum vectorization of scalable synchronous dataflow graphs. In Proceedings of the International Conference on Application Specific Array Processors. – reference: LinHMesserschmittDGFinite state machine has unlimited concurrencyIEEE Transactions on Circuits and Systems199138546547510.1109/31.76483 – reference: TanKLiuHZhangJZhangYFangJVoelkerGMSora: High-performance software radio using general-purpose multi-core processorsCommunications of the ACM20115419910710.1145/1866739.1866760 – reference: Intel architecture instruction set extensions programming reference. https://software.intel.com/en-us/intel-isa-extensions. – reference: BlellochGEScans as primitive parallel operationsComputers, IEEE Transactions on198938111526153810.1109/12.42122 – reference: Hwu, W.M.W. (2012). GPU Computing Gems Jade Edition. Morgan Kauffman. – reference: DennisJBData flow supercomputersIEEE Computer19801311485610.1109/MC.1980.1653418 – reference: Franchetti, F., Voronenko, Y., & Puschel, M. (2006). FFT program generation for shared memory: SMP and multicore. In SC 2006 Conference, Proceedings of the ACM/IEEE (pp. 51–51). IEEE . – reference: Zeiger, H.P. (1968). Algebraic Theory of Machines, Languages, and Semigroups, chap. Cascade decomposition of automata using covers, (pp. 55–80). Academic Press. – reference: Barford, L., & Keenan, K. (2014). Segmenting a signal based on a local property using multicore processors. In Proc. IEEE Intl. Instrumentation and Measurement Technology Conf., (pp. 397–401). – reference: LadnerREFischerMJParallel prefix computationJournal of the ACM198027483183859470210.1145/322217.3222320445.68066 – reference: LeeEAMesserschmittDGStatic scheduling of synchronous data flow programs for digital signal processingIEEE Transactions Computers1987361243510.1109/TC.1987.5009446 – reference: Eilenberg, S. (1976). Automata, Languages, and Machines, vol. B, chap. 1, 6. Academic Press. – reference: RadivojevicIPHerathHExecuting DSP applications in a fine-grained dataflow environmentIEEE Transactions on Software Engineering199117101028104110.1109/32.99191 – reference: NozakiAPractical decomposition of automataInformation and Control19783627529147620010.1016/S0019-9958(78)90320-00374.94035 – reference: KrohnKRhodesJAlgebraic theory of machines. I. Prime decomposition theorem for finite semigroups and machinesTransactions of the American Mathematical Society196511645046418831610.1090/S0002-9947-1965-0188316-10148.01002 – reference: Rupley, J., King, J., Quinnell, E., Galloway, F., Patton, K., Seidel, P., Dinh, J., Bui, H., & Bhowmik, A. (2013). The floating-point unit of the Jaguar x86 core. In 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), (pp. 7–16). – reference: Dotsenko, Y., Govindaraju, N.K., Sloan, P.P., Boyd, C., & Manferdelli, J. (2008). Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS ’08, (pp. 205–213). New York: ACM. doi:10.1145/1375527.1375559. – reference: KoomeyJGBerardSSanchezMWongHImplications of historical trends in the electrical efficiency of computingIEEE Annals of the History of Computing20113334654275992610.1109/MAHC.2010.28 – reference: Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., & Padua, D.A. (2011). An evaluation of vectorizing compilers. In 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), (pp. 372–382). IEEE. – reference: Zumbusch, G. (2012). Tuning a finite difference computation for parallel vector processors. In 2012 11th International Symposium on Parallel and Distributed Computing (ISPDC), (pp. 63–70). IEEE. – reference: Lee, J.H., Patel, K., Nigania, N., Kim, H., & Kim, H. (2013). OpenCL performance evaluation on modern multi core CPUs. In 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) (pp. 1177–1185). IEEE. – reference: NickollsJBuckIGarlandMSkadronKScalable parallel programming with CUDAQueue200862405310.1145/1365490.1365500 – reference: Vishkin, U. (1997). From algorithm parallelism to instruction-level parallelism: An encode-decode chain using prefix-sum. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, (pp. 260–271). – reference: KlöcknerAPintoNLeeYCatanzaroBIvanovPFasihAPyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generationParallel Computing201238315717410.1016/j.parco.2011.09.001 – reference: WuCWCappelloPRApplication-specific CAD of VLSI second-order sections. IEEE Transactions on AcousticsSpeech and Signal Processing199836581382510.1109/29.15900709.94568 – reference: Munshi, A., Gaster, B., Mattson, T.G., & Ginsburg, D. (2011). OpenCL programming guide. Pearson Education. – reference: ParhiKKHigh-Level algorithm and architecture transformations for DSP synthesisJournal of VLSI Signal Processing199591210.1007/BF02406474 – reference: AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWThe landscape of parallel computing research: A view from Berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department2006BerkeleyUniversity of California – reference: Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., & Zhou, J. (2013). Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), (pp. 1107–1116). IEEE. – ident: 1045_CR4 doi: 10.1109/I2MTC.2012.6229207 – ident: 1045_CR21 doi: 10.1109/PACT.2011.68 – volume: 54 start-page: 99 issue: 1 year: 2011 ident: 1045_CR31 publication-title: Communications of the ACM doi: 10.1145/1866739.1866760 – volume: 38 start-page: 465 issue: 5 year: 1991 ident: 1045_CR20 publication-title: IEEE Transactions on Circuits and Systems doi: 10.1109/31.76483 – volume: 38 start-page: 157 issue: 3 year: 2012 ident: 1045_CR14 publication-title: Parallel Computing doi: 10.1016/j.parco.2011.09.001 – ident: 1045_CR6 – ident: 1045_CR9 doi: 10.1145/1375527.1375559 – volume: 17 start-page: 1028 issue: 10 year: 1991 ident: 1045_CR27 publication-title: IEEE Transactions on Software Engineering doi: 10.1109/32.99191 – volume: 33 start-page: 46 issue: 3 year: 2011 ident: 1045_CR15 publication-title: IEEE Annals of the History of Computing doi: 10.1109/MAHC.2010.28 – volume: 36 start-page: 24 issue: 1 year: 1987 ident: 1045_CR18 publication-title: IEEE Transactions Computers doi: 10.1109/TC.1987.5009446 – ident: 1045_CR13 – ident: 1045_CR22 doi: 10.1109/IPDPSW.2013.207 – ident: 1045_CR35 doi: 10.1109/ISPDC.2012.17 – ident: 1045_CR5 doi: 10.1109/I2MTC.2014.6860775 – volume: 13 start-page: 48 issue: 11 year: 1980 ident: 1045_CR8 publication-title: IEEE Computer doi: 10.1109/MC.1980.1653418 – ident: 1045_CR11 – ident: 1045_CR19 doi: 10.1109/IPDPSW.2013.141 – volume: 36 start-page: 813 issue: 5 year: 1998 ident: 1045_CR33 publication-title: Speech and Signal Processing – ident: 1045_CR34 – volume-title: The landscape of parallel computing research: A view from Berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department year: 2006 ident: 1045_CR2 – volume: 60 start-page: 1588 issue: 5 year: 2011 ident: 1045_CR3 publication-title: IEEE Transactions Instrumentation and Measurement doi: 10.1109/TIM.2010.2090055 – volume: 116 start-page: 450 year: 1965 ident: 1045_CR16 publication-title: Transactions of the American Mathematical Society doi: 10.1090/S0002-9947-1965-0188316-1 – ident: 1045_CR30 – volume: 6 start-page: 40 issue: 2 year: 2008 ident: 1045_CR24 publication-title: Queue doi: 10.1145/1365490.1365500 – ident: 1045_CR29 doi: 10.1109/ARITH.2013.24 – ident: 1045_CR23 – ident: 1045_CR1 – ident: 1045_CR32 doi: 10.1145/258492.258518 – ident: 1045_CR10 doi: 10.1007/978-3-662-44199-2_13 – volume: 36 start-page: 275 year: 1978 ident: 1045_CR25 publication-title: Information and Control doi: 10.1016/S0019-9958(78)90320-0 – volume: 27 start-page: 831 issue: 4 year: 1980 ident: 1045_CR17 publication-title: Journal of the ACM doi: 10.1145/322217.322232 – ident: 1045_CR12 doi: 10.1109/SC.2006.31 – volume: 9 start-page: 1 year: 1995 ident: 1045_CR26 publication-title: Journal of VLSI Signal Processing doi: 10.1007/BF02406474 – volume: 38 start-page: 1526 issue: 11 year: 1989 ident: 1045_CR7 publication-title: Computers, IEEE Transactions on doi: 10.1109/12.42122 – ident: 1045_CR28 doi: 10.1109/ASAP.1993.397152  | 
    
| SSID | ssj0060751 | 
    
| Score | 2.0857291 | 
    
| Snippet | Full use of the parallel computation capabilities of present and expected CPUs and GPUs requires use of vector extensions. Yet many actors in data flow systems... | 
    
| SourceID | crossref springer  | 
    
| SourceType | Index Database Publisher  | 
    
| StartPage | 21 | 
    
| SubjectTerms | Circuits and Systems Computer Imaging Electrical Engineering Engineering Image Processing and Computer Vision Pattern Recognition Pattern Recognition and Graphics Signal,Image and Speech Processing Vision  | 
    
| Subtitle | Handling Actors With Internal State | 
    
| Title | Data Flow Algorithms for Processors with Vector Extensions | 
    
| URI | https://link.springer.com/article/10.1007/s11265-015-1045-x | 
    
| Volume | 87 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAVX databaseName: SpringerLINK - Czech Republic Consortium customDbUrl: eissn: 1939-8115 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0060751 issn: 1939-8018 databaseCode: AGYKE dateStart: 19970101 isFulltext: true titleUrlDefault: http://link.springer.com providerName: Springer Nature – providerCode: PRVAVX databaseName: SpringerLink Journals (ICM) customDbUrl: eissn: 1939-8115 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0060751 issn: 1939-8018 databaseCode: U2A dateStart: 20080101 isFulltext: true titleUrlDefault: http://www.springerlink.com/journals/ providerName: Springer Nature  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwFLRQu8CA-BTlo_LABLKwndhx2KLSUoHERFGZIid2YChN1RTRn8-zmwiKYCBDJifD2fK9-O5dEDoPDM2AdzSh3CgSZoqRWBpBAqEUKyRnQeENsg9yOArvxmJch0W7Xpgf-v1V5VpcnL3M5WWGgkC52AaOkl6Xlb1m05XAfGwlIMdu01WNgPnbK9YpaF3_9LQy2EHbdT2Ik9UE7qINO91DW99SAvfR9Y1eaDyYlB84mbyU8DX_-lZhKDZx7fIv5xV256n4yR_B4_7S29JhPR2g0aD_2BuS-pcHJAdiX5JcciMKHkZKR3moBQtyK5i2EWUmziIhgdO0oMZaoWyQUx1KZuBJw4y7dHCIWtNyao8QtoxayYqocC0gorBKUB3wLKZQAAFD2Q66aFBIZ6tki_Qrw9hBlgJkqYMsXXbQZYNTWi_y6u_Rx_8afYI2ueNKb4c5Ra3F_N2eAdMvsi5qJ7fP9_2un2u4j3jyCQiZoak | 
    
| linkProvider | Springer Nature | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVQOwAD34jy6YEJlMp2Ysdhq6Cl0NKpRWWKnNgBRGlQk4qKX4-dD5UiGJrZiazT5d6L790LAOe2RIHGHWEhIrnlBBxbHpPUsinnOGIE21EmkO2x9sC5H9JhMcedlGr3siWZVer5sBsmzAjNjHOmQy1NHKuO_j4hFVBt3D51mmUBZhoFcd5M9kwB5mUz86-HLMLRYi80g5jWJuiXm8uVJW_1aRrUw69fvo1L7n4LbBSUEzbyHNkGK2q8A9Z_GBHugqsbkQrYGsWfsDF6jiev6ct7AjWfhcUgQTxJoDmyhY_ZKT9szjLlu07ZPTBoNfvXbav4q4IVau4ws0JGJI2I43Lhho6g2A4VxUK5CEsvcCnTsCkokkpRruwQCYdhqe-UWJpL2PugMo7H6gBAhZFiOHIjM2VCI8UpEjYJPKQ5lgZBVQMXZXD9j9w8w5_bJJuA-DogvgmIP6uByzJyfvEeJf-vPlxq9RlYbfcfun73rtc5AmvEQHOmvjkGlXQyVSeaWKTBaZFI36jkwzo | 
    
| linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV27TsMwFLVQkRAMiKcoTw9MIKt2EjsOW9U2Kg9VDBR1i5zYhqEkVRNEPh87D5VKMODZznBt-Zz4nnsuANeuxLHBHYGwIznyYk5QwCRFLuWcaOYQV1cC2QkbT72HGZ01fU7zVu3epiTrmgbr0pQWvYXUvVXhG3GYFZ1ZF02PIkMiNz0DbraFwYAN2quYGTwkdVo5sFcxb9Oav31iHZjWs6IV2IR7YLdhibBfb-s-2FDpAdj54R14CO6GohAwnGdfsD9_y8w__vtHDg0FhY32P1vm0L6ywtfqYR6Oykqsbk7ZEZiGo5fBGDWNEFBi4L5ECXMk1Y7nc-EnnqDETRQlQvmYyCD2KTNIJyiWSlGu3AQLjxFpVkoi7RDuMeikWapOAFQEK0a0r21hCNWKUyxcJw6woUUGt1QX3LRRiBa130W0cja2IYtMyCIbsqjsgts2TlFz9PO_Z5_-a_YV2HoehtHT_eTxDGw7Fkwrvcw56BTLT3VhqEARX1bb_Q0vg6pi | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Data+Flow+Algorithms+for+Processors+with+Vector+Extensions&rft.jtitle=Journal+of+signal+processing+systems&rft.au=Barford%2C+Lee&rft.au=Bhattacharyya%2C+Shuvra+S.&rft.au=Liu%2C+Yanzhou&rft.date=2017-04-01&rft.issn=1939-8018&rft.eissn=1939-8115&rft.volume=87&rft.issue=1&rft.spage=21&rft.epage=31&rft_id=info:doi/10.1007%2Fs11265-015-1045-x&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s11265_015_1045_x | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1939-8018&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1939-8018&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1939-8018&client=summon |