<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="de" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2749-4411</journal-id>
<journal-title-group>
<journal-title>Zeitschrift Korpora Deutsch als Fremdsprache</journal-title>
</journal-title-group>
<issn pub-type="epub">2749-4411</issn>
<publisher>
<publisher-name>Universit&#228;ts- und Landesbibliothek Darmstadt</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.48694/tujournals-3841</article-id>
<article-categories>
<subj-group>
<subject>Section corpora</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>THE NOTTDEUYTSCH CORPUS:</article-title>
<subtitle>A corpus of German-language YouTube comments</subtitle>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Cotgrove</surname>
<given-names>Louis</given-names>
</name>
<email>cotgrove@ids-mannheim.de</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Leibniz-Institut f&#252;r Deutsche Sprache Mannheim</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2023-12-23">
<day>23</day>
<month>12</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>3</volume>
<issue>2</issue>
<fpage>225</fpage>
<lpage>229</lpage>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2023 The Author(s)</copyright-statement>
<copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>CC BY 4.0 International - Creative Commons, Namensnennung. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://kordaf.tujournals.ulb.tu-darmstadt.de/articles/10.48694/tujournals-3841"/>
<abstract>
<p>In diesem Beitrag wird das Nottinghamer Korpus deutscher YouTube-Sprache (das NottDeuYTSch-Korpus) vorgestellt. Das Korpus hat eine Gr&#246;&#223;e von &#252;ber 33 Millionen W&#246;rtern, die aus etwa 3 Millionen YouTube-Kommentaren gesammelt wurden. Die Kommentare wurden zwischen 2008 und 2018 ver&#246;ffentlicht und wurden von einer Gruppe von &#252;berwiegend jungen Deutschsprachigen geschrieben. Das NottDeuYTSch-Korpus bietet einen authentischen und repr&#228;sentativen sprachlichen Schnappschuss junger Deutschsprachiger und erm&#246;glicht umfangreiche Forschungsm&#246;glichkeiten in verschiedenen linguistischen Bereichen wie Lexik, Morphologie, Syntax, Orthografie, Multilingualismus, sowie Gespr&#228;chs- und Diskursanalyse.</p>
</abstract>
<trans-abstract xml:lang="en">
<p>This paper introduces the Nottinghamer Korpus deutscher YouTube-Sprache (&#8216;The Nottingham German YouTube Language Corpus&#8217; - or NottDeuYTSch corpus). The corpus comprises over 33 million words, taken from roughly 3 million YouTube comments published between 2008 and 2018, written by a young, German-speaking demographic. The <italic>NottDeuYTSch</italic> corpus provides an authentic and representative linguistic snapshot of young German speakers and offers significant opportunities for in-depth research in several linguistic fields, such as lexis, morphology, syntax, orthography, multilingualism, and conversational and discursive analysis.</p>
</trans-abstract>
<kwd-group>
<kwd>Korpuslinguistik</kwd>
<kwd>digitale Kommunikation</kwd>
<kwd>Deutsch</kwd>
<kwd>Multilingualismus</kwd>
<kwd>Jugendsprache</kwd>
</kwd-group>
<kwd-group xml:lang="en">
<kwd>Corpus linguistics</kwd>
<kwd>YouTube</kwd>
<kwd>CMC</kwd>
<kwd>online language</kwd>
<kwd>German</kwd>
<kwd>multilingualism</kwd>
<kwd>youth language</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>1. The importance of researching digital youth language</title>
<p>YouTube is a valuable source of authentic linguistic data written by young German-speakers, yet corpus linguistic scholarship in this field has been limited, despite the widespread use of YouTube by this demographic, with over 85% of young Germans regularly accessing the platform (cf. <xref ref-type="bibr" rid="B1">Bahlo et al. 2019: 80</xref>; <xref ref-type="bibr" rid="B11">Statista 2020</xref>). While there has been a steady increase in other German-language corpora comprised of digitally-mediated communication (DMC), none of them have explicitly focussed on youth language written on YouTube.<xref ref-type="fn" rid="n1">1</xref></p>
<p>The NottDeuYTSch corpus comprises over 33 million words from YouTube comments under the videos of 112 German-language channels targeted at young people and provides a unique opportunity for exploratory study of colloquial Digitally Mediated Communication (DMC) among young German-speakers. The corpus covers the period of 2008-2018, a crucial decade in the transition from PC to mobile-based communication for many young people. By investigating the linguistic features used by young German-speakers in digital spaces, the NottDeuYTSch corpus can potentially reveal any linguistic changes that may have accompanied technological changes during this period. The corpus complements and extends the research potential of existing corpora of DMC, and the communicative differences between the corpora demonstrate &#8220;unparalleled and rapidly evolving diversity in terms of speakers and settings&#8221; in DMC (<xref ref-type="bibr" rid="B2">Barbaresi 2019: 29</xref>). To further advance our understanding of the diverse and changing nature of online language, the creation of more specialised corpora of online language, such as the <italic>NottDeuYTSch</italic> corpus, is required, as they can provide valuable information specific written text types and genres.</p>
</sec>
<sec>
<title>2. Constructing the NottDeuYTSch corpus</title>
<p>The NottDeuYTSch corpus was built using five guiding principles to ensure balance and representativeness, as well as future application to a wide range of linguistic research:</p>
<list list-type="order">
<list-item><p>The corpus is representative of the language used by young German-speakers in YouTube comments.</p></list-item>
<list-item><p>The corpus can be analysed longitudinally.</p></list-item>
<list-item><p>The corpus can be analysed comparatively with other corpora.</p></list-item>
<list-item><p>Every video must have comments amounting to a 1,000-word minimum sample size &#8220;to reliably represent the distributions of linguistic features&#8221; (<xref ref-type="bibr" rid="B4">Biber 1993: 252</xref>).</p></list-item>
<list-item><p>Every video must be published between July 2008 and October 2018 to ensure all comments were written after YouTube launched the localised German version of the website.<xref ref-type="fn" rid="n2">2</xref></p></list-item>
</list>
<sec>
<title>2.1 Data selection</title>
<p>The NottDeuYTSch corpus was created by selecting comments from YouTube channels in German-speaking countries. The channels were identified based on my own previous exposure to German-language YouTube culture, appearances in youth media, such as <italic>BRAVO</italic> magazine, and ownership by media companies targeting young people, e.g. <italic>1Live</italic> from <italic>WDR</italic>. To explore popular YouTube channels in the start of the corpus 2008, I used a combination of the Internet Archive,<xref ref-type="fn" rid="n3">3</xref> which allows a user to view websites at particular points in time with <italic>SocialBlade</italic>,<xref ref-type="fn" rid="n4">4</xref> a website which lists the 250 channels in each of Germany, Austria, and Switzerland with the most subscribers.</p>
<p>It was important to ensure that the corpus was representative of the language used by young German-speakers online. To achieve this, custom R code was used to extract data using the YouTube API, which was then used to create a database of comments from all the videos of the selected channels, except comments under live-streamed videos, as this would not create a consistent communicative environment. This would have resulted in a corpus of over 1.5 billion tokens, which was too large for the scope of the project. The database was then sampled down to roughly 3 million comments using stratified random sampling of the year and video category to maintain &#8220;a wide range of text categories&#8221; for optimal balance (<xref ref-type="bibr" rid="B10">McEnery / Xiao / Tono 2006: 16</xref>). For further information on the construction and sampling process, see Cotgrove (<xref ref-type="bibr" rid="B6">2022: 59</xref>). <xref ref-type="table" rid="T1">Table 1</xref> provides a statistical overview of the NottDeuYTSch corpus.</p>
<table-wrap id="T1">
<label>Table 1</label>
<caption>
<p>Statistical overview of the <italic>NottDeuYTSch</italic> corpus, adapted from Cotgrove (<xref ref-type="bibr" rid="B6">2022: 343</xref>)</p>
</caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Statistic</th>
<th align="left" valign="top">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Tokens (inc. emoji and emoticons)</td>
<td align="right" valign="top">33,760,494</td>
</tr>
<tr>
<td align="left" valign="top">Tokens (only lexemes)</td>
<td align="right" valign="top">32,549,462</td>
</tr>
<tr>
<td align="left" valign="top">Number of Types</td>
<td align="right" valign="top">567,086</td>
</tr>
<tr>
<td align="left" valign="top">Type-Token Ratio (TTR)</td>
<td align="right" valign="top">0.017</td>
</tr>
<tr>
<td align="left" valign="top">Number of Comments</td>
<td align="right" valign="top">3,149,457</td>
</tr>
<tr>
<td align="left" valign="top">Number of Videos</td>
<td align="right" valign="top">296</td>
</tr>
<tr>
<td align="left" valign="top">YouTube Channels Represented</td>
<td align="right" valign="top">63</td>
</tr>
<tr>
<td align="left" valign="top">Mean Tokens per Comment</td>
<td align="right" valign="top">10.720</td>
</tr>
<tr>
<td align="left" valign="top">Median Tokens per Comment</td>
<td align="right" valign="top">5</td>
</tr>
<tr>
<td align="left" valign="top">Mean Comments per Video</td>
<td align="right" valign="top">1,914</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>3. The NottDeuYTSch corpus as a resource for linguistic research</title>
<p>The NottDeuYTSch corpus is a valuable resource for linguistic research, as it provides a large, representative sample of the language used by young German-speakers online. The NottDeuYTSch corpus is suitable for many different kinds of quantitative and qualitative projects, lexical, orthographical, morphosyntactic, and syntactic studies, interactional and discourse analyses, as well as investigations into multilingualism. Furthermore, the metadata enables longitudinal studies of language changes, as well as text and genre studies. For example, <xref ref-type="fig" rid="F1">Figure 1</xref> shows a comparison between the change in frequencies in three intensifiers, <italic>geil, cool</italic>, and <italic>mega</italic>:</p>
<p><styled-content style="text-align: center; display: block; line-height: 0.2"><italic>RegEx Queries: [ /g</italic>+<italic>e</italic>+<italic>i</italic>+<italic>l/gi ] [ /c</italic>+<italic>o{2,}l/gi ] [ /m</italic>+<italic>e</italic>+<italic>g</italic>+<italic>a/gi ]</italic></styled-content></p>
<fig id="F1">
<label>Figure 1</label>
<caption>
<p>Frequency of comments containing selected intensifiers over time</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="kordaf-3841_cotgrove-g1.png"/>
</fig>
<p>From the graph, we see dramatic changes in the choice of intensifier used by young people, with <italic>geil</italic> falling out of fashion, <italic>cool</italic> steadily increasing, overtaking <italic>geil</italic>, and <italic>mega</italic> dramatically increasing over the time period of the corpus. This demonstrates lexical change on a microdiachronic scale.</p>
<p>The <italic>NottDeuYTSch</italic> corpus is available for download in many different formats (see <xref ref-type="bibr" rid="B5">Cotgrove 2018</xref>), and has been integrated into the German Reference Corpus (DeReKo) (<xref ref-type="bibr" rid="B8">Leibniz-Institut f&#252;r Deutsche Sprache 2022</xref>).</p>
</sec>
</body>
<back>
<fn-group>
<fn id="n1"><p>Existing corpora include those using data from websites and forums (e.g. IBK und Social Media-Korpora, cf. <xref ref-type="bibr" rid="B9">L&#252;ngen / Kupietz 2020</xref>), Facebook (DiDi Korpus, cf. <xref ref-type="bibr" rid="B7">Glaznieks / Frey 2020</xref>), and WhatsApp messages (MoCoDa2 corpus, cf. <xref ref-type="bibr" rid="B3">Bei&#223;wenger et al. 2020</xref>).</p></fn>
<fn id="n2"><p>Please note, the videos or transcripts of the videos are not included in the corpus.</p></fn>
<fn id="n3"><p><ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://web.archive.org/">https://web.archive.org/</ext-link> (14.04.2023).</p></fn>
<fn id="n4"><p><ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://socialblade.com/youtube/top/country/de/mostsubscribed">https://socialblade.com/youtube/top/country/de/mostsubscribed</ext-link> (14.04.2023).</p></fn>
</fn-group>
<ref-list>
<ref id="B1"><mixed-citation publication-type="book"><string-name><surname>Bahlo</surname>, <given-names>Nils</given-names></string-name> / <string-name><surname>Becker</surname>, <given-names>Tabea</given-names></string-name> / <string-name><surname>Kalkavan-Ayd&#305;n</surname>, <given-names>Zeynep</given-names></string-name> / <string-name><surname>Lotze</surname>, <given-names>Netaya</given-names></string-name> / <string-name><surname>Marx</surname>, <given-names>Konstanze</given-names></string-name> / <string-name><surname>Schwarz</surname>, <given-names>Christian</given-names></string-name> / <string-name><surname>&#536;im&#537;ek</surname>, <given-names>Yazg&#252;l</given-names></string-name> (<year>2019</year>): <source>Jugendsprache: Eine Einf&#252;hrung</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>J.B. Metzler</publisher-name>.</mixed-citation></ref>
<ref id="B2"><mixed-citation publication-type="book"><string-name><surname>Barbaresi</surname>, <given-names>Adrien</given-names></string-name> (<year>2019</year>): <chapter-title>The Vast and the Focused: On the Need for Thematic Web and Blog Corpora</chapter-title>. In: <string-name><surname>Ba&#324;ski</surname>, <given-names>Piotr</given-names></string-name>; <string-name><surname>Barbaresi</surname>, <given-names>Adrien</given-names></string-name> / <string-name><surname>Biber</surname>, <given-names>Hanno</given-names></string-name> / <string-name><surname>Breiteneder</surname>, <given-names>Evelyn</given-names></string-name> / <string-name><surname>Clematide</surname>, <given-names>Simon</given-names></string-name> / <string-name><surname>Kupietz</surname>, <given-names>Marc</given-names></string-name> / <string-name><surname>L&#252;ngen</surname>, <given-names>Harald</given-names></string-name> / <string-name><surname>Iliadi</surname>, <given-names>Caroline</given-names></string-name> (eds.): <source>Proceedings of the Workshop on Challenges in the Management of Large Corpora</source>. <publisher-loc>Mannheim</publisher-loc>: <publisher-name>Leibniz-Institut f&#252;r Deutsche Sprache</publisher-name>, <fpage>29</fpage>&#8211;<lpage>32</lpage>.</mixed-citation></ref>
<ref id="B3"><mixed-citation publication-type="book"><string-name><surname>Bei&#223;wenger</surname>, <given-names>Michael</given-names></string-name> et al. (<year>2020</year>): <chapter-title>Die Mobile Communication Database 2 (MoCoDa 2)</chapter-title>. In: <string-name><surname>Marx</surname>, <given-names>Konstanze</given-names></string-name> / <string-name><surname>Lobin</surname>, <given-names>Henning</given-names></string-name> / <string-name><surname>Schmidt</surname>, <given-names>Axel</given-names></string-name> (Hg.): <source>Deutsch in Sozialen Medien: Interaktiv &#8211; Multimodal &#8211; Vielf&#228;ltig</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>de Gruyter</publisher-name>, <fpage>349</fpage>&#8211;<lpage>352</lpage>.</mixed-citation></ref>
<ref id="B4"><mixed-citation publication-type="journal"><string-name><surname>Biber</surname>, <given-names>Douglas</given-names></string-name> (<year>1993</year>): <article-title>Representativeness in Corpus Design. In: Literary and Linguistic Computing</article-title>, <volume>8</volume>: <fpage>243</fpage>&#8211;<lpage>257</lpage>.</mixed-citation></ref>
<ref id="B5"><mixed-citation publication-type="webpage"><string-name><surname>Cotgrove</surname>, <given-names>Louis Alexander</given-names></string-name> (<year>2018</year>): <article-title>Das Nottinghamer Korpus Deutscher YouTube-Sprache (the Nott-DeuYTSch corpus). LINDAT/CLARIAH-CZ</article-title>. <uri>http://hdl.handle.net/11372/LRT-4806</uri>.</mixed-citation></ref>
<ref id="B6"><mixed-citation publication-type="webpage"><string-name><surname>Cotgrove</surname>, <given-names>Louis Alexander</given-names></string-name> (<year>2022</year>): <chapter-title>#GlockeAktiv: A corpus linguistic investigation of German online youth language</chapter-title>. <publisher-loc>Nottingham</publisher-loc>: <publisher-name>University of Nottingham</publisher-name>. <uri>https://eprints.nottingham.ac.uk/id/eprint/69043</uri> (21.07.2023).</mixed-citation></ref>
<ref id="B7"><mixed-citation publication-type="book"><string-name><surname>Glaznieks</surname>, <given-names>Aivars</given-names></string-name> / <string-name><surname>Frey</surname>, <given-names>Jennifer-Carmen</given-names></string-name> (<year>2020</year>): <chapter-title>Das DiDi-Korpus: Internetbasierte Kommunikation aus S&#252;dtirol</chapter-title>. In: <string-name><surname>Marx</surname>, <given-names>Konstanze</given-names></string-name> / <string-name><surname>Lobin</surname>, <given-names>Henning</given-names></string-name> / <string-name><surname>Schmidt</surname>, <given-names>Axel</given-names></string-name> (Hg.): <source>Deutsch in Sozialen Medien: Interaktiv &#8211; Multimodal &#8211; Vielf&#228;ltig</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>de Gruyter</publisher-name>, <fpage>353</fpage>&#8211;<lpage>354</lpage>.</mixed-citation></ref>
<ref id="B8"><mixed-citation publication-type="webpage"><collab>Leibniz-Institut f&#252;r Deutsche Sprache</collab> (<year>2022</year>): <article-title>IDS: Korpuslinguistik: Korpusausbau</article-title>. <uri>http://www1.ids-mannheim.de/kl/projekte/korpora.html</uri> (14.04.2023).</mixed-citation></ref>
<ref id="B9"><mixed-citation publication-type="book"><string-name><surname>L&#252;ngen</surname>, <given-names>Harald</given-names></string-name> / <string-name><surname>Kupietz</surname>, <given-names>Marc</given-names></string-name> (<year>2020</year>): <chapter-title>IBK- und Social Media-Korpora am Leibniz-Institut f&#252;r Deutsche Sprache</chapter-title>. In: <string-name><surname>Marx</surname>, <given-names>Konstanze</given-names></string-name> / <string-name><surname>Lobin</surname>, <given-names>Henning</given-names></string-name> / <string-name><surname>Schmidt</surname>, <given-names>Axel</given-names></string-name> (Hg.): <source>Deutsch in Sozialen Medien: Interaktiv &#8211; Multimodal &#8211; Vielf&#228;ltig</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>de Gruyter</publisher-name>, <fpage>319</fpage>&#8211;<lpage>342</lpage>. <pub-id pub-id-type="doi">10.1515/9783110679885-016</pub-id>.</mixed-citation></ref>
<ref id="B10"><mixed-citation publication-type="book"><string-name><surname>McEnery</surname>, <given-names>Tony</given-names></string-name> / <string-name><surname>Xiao</surname>, <given-names>Richard</given-names></string-name> / <string-name><surname>Tono</surname>, <given-names>Yukio</given-names></string-name> (<year>2006</year>): <source>Corpus-Based Language Studies: An Advanced Resource Book</source>. <publisher-loc>Abingdon</publisher-loc>: <publisher-name>Routledge</publisher-name>.</mixed-citation></ref>
<ref id="B11"><mixed-citation publication-type="webpage"><collab>Statista</collab> (<year>2020</year>): <article-title>Jugendliche - Beliebteste Internetangebote 2020</article-title>. <uri>https://de.statista.com/statistik/daten/studie/419810/umfrage/beliebteste-internetangebote-bei-jugendlichen/</uri> (14.04.2023).</mixed-citation></ref>
</ref-list>
<sec>
<title>Biographische Notiz</title>
<p>Dr Louis Cotgrove is a researcher in the department of Lexicology at the Leibniz Institute for the German Language (IDS) in Mannheim. His research specialities include corpus linguistic investigation of youth and online language, as well as text technology and analysis in online digital lexicography and empirical lexicology, and developing data infrastructure for online dictionaries.</p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2"><bold>Contact address</bold>:</styled-content></p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2">Dr. Louis Cotgrove</styled-content></p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2">Leibniz-Institut f&#252;r Deutsche Sprachen</styled-content></p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2">68161, Mannheim</styled-content></p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2">Germany</styled-content></p>
<p><styled-content style="text-align: right; display: block; line-height: 0.2"><ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="mailto:cotgrove@ids-mannheim.de">cotgrove@ids-mannheim.de</ext-link></styled-content></p>
</sec>
</back>
</article>
