如何快速导入网络neo4j创建数据库到图neo4j创建数据库库Neo4j

处理非常大型的图对象一直都是个挑战,但最近大数据技术的进步却让这一工作变得更具实践性。作为纽约市的一家专注于跨设备内容分发的创业公司,Tapad利用大数据技术处理TB级的数据,并已将图处理作为其商业模型的核心业务。
像Facebook和Twitter这样的社交网络,其数据天生就适合于图表示法。而对这方面属性不太明显的数据,我们也可以用图对象来表示,比如Tapad的设备图。Tapad的联合创始人兼CTO,Dag Liodden,解释了为什么对设备使用图表示法很有意义:
“Tapad采用面向图的方式对设备间的关系进行建模。在设备图中,我们把匿名标示符(如cookie ID)表示为节点并且追踪这些节点的市场信息。节点间的边则结合使用测定数据、概率统计模型以及机器学习技术计分或加权重。我们将‘设备’的概念定义为一个起始设备或节点(比如说某个浏览器的cookie ID)和由该起点出发的、在一组可定制边约束下能达到的节点集合(比如说一个Tablet和一个Connected TV的cookie ID)。相对于单个节点仅有的聚合信息,实际的图结构使我们能够在动态平衡数据准确度和规模方面更具灵活性,而且还能更容易地运用新的边推理模型来对图进行扩充。”
用合适的工具完成合适的工作很重要,这个道理同样适用于图处理:对于通过传统工作负载就能处理的图对象,我们就没必要使用大数据技术。正如Dag所说:
“‘大数据’对我而言就像个门槛,跨过之后你就不能再使用少数通用的、现成的工具来存储和分析数据了,而是要依据具体的用例对不同的技术加以取舍。随着软硬件解决方案的进步和成熟,这些阈值每年都在变动,而我们所处理的数据集的大小以及所进行的分析的复杂程度亦是如此。”
对Facebook来说,这个阈值达到了几PB级,详情可参阅他们在2013纽约ACM SIGMOD大会上的报告。对Tapad而言,图对象的数据量虽然较小,但依然不可能用传统的方法来处理:
“全美的图对象当前有大约11亿个节点,它们代表着移动电话、平板、笔记本、游戏终端以及电视机。其中有些节点是临时的,比如因为浏览器使用非持久的cookie,导致节点缺少数据而没有边缘。非临时节点平均有大概5个边缘和约500个离散的信息片段与其相关联,如行为分段。实时图数据量达到了几TB级,而且我们还要跨多个数据中心每秒对其进行几十万次的读取、写入操作。图对象的更新实现了跨地域相互复制,每个数据中心由配备了20TB Flash SSD存储和2TB RAM的服务器来支撑。”
近几年涌现出很多处理大型图对象的技术,尤其是2013年,我们看到了几个新成员加入到该生态系统中。有两类系统值得考虑:
针对OLTP工作负载,能够快速低延迟访问小部分图数据的图数据库。
针对OLAP工作负载,能够对图对象中的大部分数据进行批处理的图处理引擎。
知名的图数据库已经很多了,但最近仍冒出了几个标新立异的项目。Neo4j算是最老牌、最成熟的图数据库之一,但因不支持分片而依然存在可伸缩性的问题。另一个相当年轻,却在2013年非常流行的数据库便是Titan。作为后端无关的图数据库,它支持HBase和Cassandra的可伸缩架构,并且如2013年的一篇博文所报道的,它在内部使用了一套优化的顶点和边表示法以使其能处理几十亿个边对象。
但你不必非要使用图特定数据库,更通用的可伸缩的NoSQL数据库也是有效的解决方案。基于Google BigTable并在2011年开源的Apache Accumulo就是一个通用数据库的例子,它的数据记录很灵活,所以也适合存储大型图对象,同时还可以用来存储含有类型化的边和权重的图对象,2013年发布的一份技术报告表明NSA也在使用它。Cassandra或者Aerospike则是另一种数据库,它们能通过适当的数据模型,用边、顶点和权重给图对象高效地建模。Facebook也构建了自己的解决方案,他们在被称为Tao的系统中使用了MySQL和Memcache组合,并正在使用这一方案为其用户提供社区图服务。据Dag所说,Tapad在其设备图的设计过程中也运用了同样的哲学:
“将实时的图对象保存在键值对存储中可以支持快速的遍历和更新。我们就是把图的快照周期性地存进HDFS,然后从中提取它们进行高级图处理并用其他数据流来扩充,之后再把结果回填至‘实时图’。虽然使用图特定的数据库会有一些优势,但以我们目前的设置,既可以在键值对存储中极快且简单地遍历图对象,还可在Hadoop上慢速但非常灵活地进行遍历和分析操作,对我们来说它工作的很好,至少现在如此。”
和存储于数据库中的图对象一样 ,可大规模进行的操作也只是局限于查找和小范围的遍历。至于在图对象中进行更加复杂的分析,就需要分布式的批处理框架。为了达到最佳性能,GraphLab框架使用了Message Passing Interaface(MPI)模型来调整并运行基于HDFS数据的复杂算法。而新近的框架如Apache Giraph和Apache Hama则基于Bulk Synchronous Paralle(BSP)范式,该范式是由Google的Pregel项目推广开的。而生态系统中最新的项目便是GraphX和Faunus。GraphX项目运行于2013年才问世的Spark之上,而Faunnus则通过用Hadoop运行MapReduce作业的方式来处理Titan数据库中图对象。Tapad正在运用这些新技术处理其离线图数据。按照Dag所说:
“目前,我们主要的图处理框架虽是Apache Giraph,但我们也在尝试Saprk GraphX和GraphLab。所有这些架构还都很年轻,学习曲线也颇为陡峭,而且全都有自己的优缺点及注意事项。举个例子,Giraph和GraphX由于能很好地支持我们的Hadoop架构所以很方便,但GraphLab却完全是因为其性能而更吸引我们。”
有些项目正试图提供统一的架构以支持OLTP和OLAP查询。来自Lab41的Dendrite就是这样一个项目,它利用基于Titan的GraphLab进行存储、处理,并用AngularJS实现可视化。因为这个非常年轻的项目在2014年年初才公开,所以社群反响有限,但是它试着顾及到所有用例,这应该有助于它的普及。
英语原文:
Processing extremely large graphs has been and remains a challenge, but recent advances in Big Data technologies have made this task more practical. Tapad, a startup based in NYC focused on cross-device content delivery, has made graph processing the heart of their business model using Big Data to scale to terabytes of data.
Social networks like Facebook or Twitter contain data that naturally lends itself to a graph representation. But graphs can be used to represent less obvious data, as in the case of Tapad’s device graph. Dag Liodden, Tapad’s co-founder and CTO, describes why using a graph representation for devices makes sense:
Tapad takes a graph-oriented approach to modeling relationships between devices. Anonymous identifiers (such as cookie IDs) are represented as nodes in our Device Graph and we track marketing information to these nodes. Edges between the nodes are scored / weighted using a combination of deterministic data and probabilistic statistical models / machine learning techniques. The concept of a “device” is defined as a starting device / node (let’s say the cookie ID of a browser) and the collections of nodes (let’s say the cookie IDs of a Tablet and a Connected TV) that are reachable from that starting point given a customizable set of edge constraints. Having an actual graph structure, as opposed to just aggregated information into a single node, gives us the flexibility to balance accuracy and scale dynamically as well as more easily augment the graph with new edge inference models.
Using the right tool for the right job is important, and the same goes for graph processing: there is no need to use Big Data technologies for graphs that can be handled by more traditional workloads, like Dag says:
“Big Data” to me is the threshold where you no longer can use a small set of general purpose, off-the-shelf tools to store and analyze your data, but instead have to tailor different technologies to address specific use cases. These thresholds keep moving every year as software and hardware solutions evolve and mature, but so does the size of the data sets we deal with and the level of sophistication of the analysis we need to perform.
For Facebook, this threshold is in the single digit petabytes, as detailed during their submission to the 2013 ACM SIGMOD conference in NYC. For Tapad, the amount of data in the graph is smaller but would still be impossible to process using traditional methods:
The US graph currently has about 1.1 billion nodes, representing mobile phones, tablets, laptops, gaming consoles and TVs. Some of these for instance, due to a browser with non-persistent cookies, and thus have little data and no edges. The non-transient nodes have about five edges on average and around 500 discrete pieces of information, such as behavioral segments, associated with them. The live graph data weighs in at multiple TB and we read / write from / to it several hundred thousand times per second across multiple data centers. Updates to the graph are geographically cross-replicated and each data center is currently serving off of servers backed by 20 TB of Flash SSD storage and 2 TB of RAM.
The recent years have seen a surge in the number of technologies used to process graphs at scale, especially 2013 which saw several new additions to the ecosystem. There are two classes of systems to consider:
Graph databases for OLTP workloads for quick low-latency access to small portions of graph data.
Graph processing engines for OLAP workloads allowing batch processing of large portions of a graph.
The list of graph databases is already very long, but several projects have emerged and differentiated themselves recently. Neo4j is one of the oldest and most mature graph databases, but still suffers from scalability issues since it doesn’t support sharding yet. Another database that, albeit pretty young, has been gaining a lot of popularity in 2013 is Titan. As a backend-agnostic graph database, it can leverage both HBase and Cassandra’s scalable architecture and uses an optimized vertex and edge representation internally to allow it to scale to billions of edges as reported in a blog post in 2013.
But one does not need to use graph-specific databases, and more generic scalable NoSQL databases can also be an effective solution to the problem. Apache Accumulo, a technology based on Google’s BigTable and open-sourced in 2011, is an example of a generic database that can also be a good fit to store graphs at scale because records are flexible and can be used to store graphs with typed edges and weights, and is actually being used by the NSA according to a technical report published in 2013. Cassandra or Aerospike are other examples of databases that, with a proper data model, can effectively model a graph with edges, vertexes and weights. Facebook also built their own solution using MySQL and Memcache in a system called Tao, which is being used to serve the social graph to its users. And according to Dag, Tapad used the same philosophy in the design of their device graph:
The live graph lives in a key-value store to allow for fast traversals and updates. We regularly snapshot the graph into HDFS where it can be retrieved for more advanced graph processing and augmented with other data streams. The results are later fed back into the “live graph”. There are advantages to using a graph specific database, but our current setup with extremely fast, simple traversals of the graph in our key-value store and slow, but very flexible traversal and analysis on Hadoop is serving us well, at least for now.
Even with a graph stored in a database, the operations that can be performed at scale will likely be limited to lookups and small traversals. For more complex analysis on a larger portion of a graph, there is a need for batch processing distributed frameworks. For the best performance, the GraphLab framework uses the Message Passing Interface (MPI) model to scale and run complex algorithms using data in HDFS. More recent frameworks like Apache Giraph and Apache Hama are based on the Bulk Synchronous Parallel (BSP) paradigm popularized by Google’s Pregel project. And the latest additions to the ecosystem are the GraphX project running on top of Spark which was unveiled in 2013, and Faunus, which is using Hadoop to run MapReduce jobs to process graphs in a Titan database. Tapad is using these new technologies to process their offline graph data. According to Dag:
Currently, our main graph processing framework is Apache Giraph, but we are experimenting with Spark GraphX and Graphlab as well. All of these frameworks are still pretty young, the learning curve is pretty steep and all come with their own pros, cons and caveats. For instance, Giraph and GraphX are convenient as they fit nicely into our Hadoop infrastructure, but Graphlab is very appealing due to the sheer performance of it.
Some projects are attempting to provide a unified framework to answer both OLTP and OLAP queries. Dendrite from Lab41 is such a project that leverages GraphLab on top of Titan for storage and processing, and AngularJS for visualization. This is still a very young project unveiled in early 2014 so the community reaction is limited, but the fact that it attempts to cover every use case should help drive adoption.
转载请注明来自36大数据(): &
除非特别注明,本站所有文章均不代表本站观点。报道中出现的商标属于其合法持有人。请遵守理性,宽容,换位思考的原则。面向图的数据库Neo4J部署与使用
- zzm - ITeye技术网站
博客分类:
图数据库的需求
?能以明显直观的方式存放图数据,而不是扭曲变化为别的替代方式
?能简单地进行图操作,例如寻路,遍历等
图数据库的基本思想
?图(Graph)由节点(node,顶点)和关系(relationship,边)组成
?图在节点上记录数据(节点的属性,property)
?节点由关系组织成图,关系也具有属性
?寻路(Traversal,另一意思是遍历)在图上定位由节点组成的路径
?索引(index)将属性映射到节点或关系
?图数据库管理系统 管理 图 及其上的 索引
?Neo4J是当前主流的图数据库产品
?由Neo Technology开发的开源图数据库,该公司从2000年起就开始研发图数据库,目前neo4j已经成为领先的图数据库产品,思科,惠普,德意志电信等跨国企业均成为客户
直观的图模型存储
完全支持ACID事务
基于磁盘的持久存储
支持海量数据,比如数十亿节点/关系/属性级别的数据
高可用的分布式集群
高度优化,迅速的图查询(Cypher图查询语言)
可以嵌入(只需几个小jar文件),支持REST API
Neo4J的安装
Neo4j的部署模式
?独立模式:独立的服务器,可以通过REST API、或基于语言的驱动等访问
?嵌入模式:提供若干库,嵌入到Java程序运行
##安装linux版本
?http://dist.neo4j.org/neo4j-community-1.9.4-unix.tar.gz
?下载后解包 (这里下载的是社区版)
?切换root,运行./bin/neo4j install
##安装neo4j需要首先安装JDK
# tar zxf neo4j-community-1.9.4-unix.tar.gz
# mv neo4j-community-1.9.4 neo4j
# cd neo4j/bin
[root@linux bin]# ./neo4j install
Graph-like power should be handled carefully. What user should run Neo4j? [oracle] root
##修改配置文件
./conf/neo4j-server.properties
org.neo4j.server.webserver.address=0.0.0.0
[root@linux bin]# ./neo4j status
Neo4j Server is not running
[root@linux bin]# ./neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...process [1128]... waiting for server to be ready....... OK.
Go to http://localhost:7474/webadmin/ for administration interface.
[root@linux bin]# ./neo4j status
Neo4j Server is running at pid 1128
##远程web UI
http://10.10.10.8:7474/webadmin/#
##Web UI: Console
http://10.10.10.8:7474/webadmin/#/console/
##用简单的语句测试
neo4j-sh (0)$ CREATE (ee { name: "Emil", from: "Sweden" }) RETURN ee.
==& +---------+
==& | ee.name |
==& +---------+
==& | "Emil"
==& +---------+
==& Nodes created: 1
==& Properties set: 2
==& 1100 ms
neo4j-sh (0)$ START ee=node(*) WHERE ee.name! = "Emil" RETURN
==& +------------------------------------+
==& +------------------------------------+
==& | Node[1]{name:"Emil",from:"Sweden"} |
==& +------------------------------------+
==& 502 ms
neo4j-sh (0)$
##Shell命令参考
模拟测试Neo4j高可用
?下载Neo4j企业版
?解压为多个不同目录
?分别修改配置文件,使监听不同的端口
?启动多个Neo4j服务器进程
?Neo4j手册第26.6节
?类似Mysql的高可用集群,可以实现读写分离,写数据被引导到Master节点,再从多个slave读数据
##下载企业版并解压缩到三个目录
# tar zxf neo4j-enterprise-1.9.4-unix.tar
# mv neo4j-enterprise-1.9.4 neo4j01
[root@linux neo4j]# cp -R neo4j01 neo4j02
[root@linux neo4j]# cp -R neo4j01 neo4j03
##1号节点的配置文件 neo4j.properties
online_backup_server=127.0.0.1:6362
ha.server_id=1
ha.initial_hosts=127.0.0.1:.0.1:.0.1:5003
ha.server=127.0.0.1:6001
ha.cluster_server=127.0.0.1:5001
##1号节点的配置文件 Neo4j-server.properties
org.neo4j.server.webserver.address=0.0.0.0
org.neo4j.server.manage.console_engines=gremlin, shell
org.neo4j.server.database.mode=HA
org.neo4j.server.webserver.port=7474
org.neo4j.server.webserver.https.port=7473
##2号节点的配置文件 neo4j.properties
online_backup_server=127.0.0.1:6363
ha.server_id=2
ha.initial_hosts=127.0.0.1:.0.1:.0.1:5003
ha.server=127.0.0.1:6002
ha.cluster_server=127.0.0.1:5002
##2号节点的配置文件 Neo4j-server.properties
org.neo4j.server.webserver.address=0.0.0.0
org.neo4j.server.manage.console_engines=gremlin, shell
org.neo4j.server.database.mode=HA
org.neo4j.server.webserver.port=7484
org.neo4j.server.webserver.https.port=7483
##3号节点的配置文件 neo4j.properties
online_backup_server=127.0.0.1:6364
ha.server_id=3
ha.initial_hosts=127.0.0.1:.0.1:.0.1:5003
ha.server=127.0.0.1:6003
ha.cluster_server=127.0.0.1:5003
##3号节点的配置文件 Neo4j-server.properties
org.neo4j.server.webserver.address=0.0.0.0
org.neo4j.server.manage.console_engines=gremlin, shell
org.neo4j.server.database.mode=HA
org.neo4j.server.webserver.port=7494
org.neo4j.server.webserver.https.port=7493
##启动三个neo4j实例
[root@linux bin]# ./neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...HA instance started in process [2244]. Will be operational once connected to peers. See /nosql/neo4j/neo4j01/data/log/console.log for current status.
[root@linux bin]#
[root@linux bin]# cd /nosql/neo4j/neo4j02/bin/
[root@linux bin]# ./neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...HA instance started in process [2417]. Will be operational once connected to peers. See /nosql/neo4j/neo4j02/data/log/console.log for current status.
[root@linux bin]#
[root@linux bin]# cd /nosql/neo4j/neo4j03/bin/
[root@linux bin]# ./neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...HA instance started in process [2690]. Will be operational once connected to peers. See /nosql/neo4j/neo4j03/data/log/console.log for current status.
[root@linux bin]#
[root@linux bin]# jps
2417 Bootstrapper
2244 Bootstrapper
2690 Bootstrapper
[root@linux bin]#
##Web UI上观看高可用集群状态
http://10.10.10.8:7474/webadmin/#/info/org.neo4j/High%20Availability/
http://blog.csdn.net/zq9017197/article/details/
浏览: 756274 次
来自: 南京
在分配IP地址时,要分清工作IP地址和主机自有地址。如在主从方 ...
http://blog.csdn.net/duanjiajia ...
woodding2008 写道分布式的没有减少么什么没减少?
手动更新索引,在nexus重启之后,解压出来的那些文件,就会被 ...
分布式的没有减少么将维基百科导入图形数据库的方法和步骤
 作者: 王玉圆 编辑:
  【IT168 技术】能够将维基百科导入Neo4j数据库是不是很酷?  Neo4j是NoSQL的一个代表,它是一个嵌入式,基于磁盘的,支持完整事务的Java持久化引擎,它在图像中而不是表中存储数据。Neo4j提供了大规模可扩展性,在一台机器上可以处理数十亿节点/关系/属性的图像,可以扩展到多台机器并行运行。相对于关系数据库来说,图形数据库善于处理大量复杂、互连接、低结构化的数据,这些数据变化迅速,需要频繁的查询&&在关系数据库中,这些查询会导致大量的表连接,因此会产生性能上的问题。  Neo4j重点解决了拥有大量连接的传统RDBMS在查询时出现的性能衰退问题。通过围绕图形进行数据建模,Neo4j会以相同的速度遍历节点与边,其遍历速度与构成图形的数据量没有任何关系。此外,Neo4j还提供了非常快的图形算法、推荐系统和OLAP风格的分析,而这一切在目前的RDBMS系统中都是无法实现的。  英国的软件工程师同样意识到这点,并使用批处理导入技术创建了(注:graphipedia是一个将维基百科页面和超链接导入图形数据库的工具集)。  Graphipedia使用Java编写,如果你专注于Ruby,并不擅长使用Java,下面的教程将会手把手的教你如何完成这些步骤。  首先,复制这个项目并打开:  git clone git:///mirkonasato/graphipedia.git  cd graphipedia  如果发现pom.xml文件意味着你需要下载Maven并创建项目。  sudo apt-get install maven2  mvn install  你将看到一串代码滚动,这是相关文件在下载,运行到最后你将看到如下代码:  [INFO] ------------------------------------------------------------------------  [INFO] Reactor Summary:  [INFO] ------------------------------------------------------------------------  [INFO] Graphipedia Parent .................................... SUCCESS [<span style="color: #:<span style="color: #.932s]  [INFO] Graphipedia DataImport ................................ SUCCESS [<span style="color: #:<span style="color: #.018s]  [INFO] ------------------------------------------------------------------------  [INFO] ------------------------------------------------------------------------  [INFO] BUILD SUCCESSFUL  [INFO] ------------------------------------------------------------------------  [INFO] Total time: <span style="color: # minutes <span style="color: # seconds  [INFO] Finished at: Thu Feb <span style="color: # <span style="color: #:<span style="color: #:<span style="color: # CST <span style="color: #12  [INFO] Final Memory: 28M/<span style="color: #4M  [INFO] ------------------------------------------------------------------------  下面将需要下载的维基百科文件通过wget下载下来:  wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2  这足足有7.6G的文件,我们先尝试一个较小的数据:词条,然后解压缩该词条的维基百科文件。  wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2  bzip2 -d towiki-latest-pages-articles.xml.bz2  以下是两个步骤,首先创建一个只包含页面标题和链接的较小的中间XML文件:  java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml  显示如下:  Parsing pages and extracting links...  ..  <span style="color: #35 pages parsed in <span style="color: # seconds.  然后运行批处理导入该文件,将内容存储在图形数据库目录下。  java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db  将会出现以下代码:  Importing pages...  ..  <span style="color: #35 pages imported in <span style="color: # seconds.  Importing links...  .....  <span style="color: #99 links imported in <span style="color: # <span style="color: #83 broken links ignored  打开并查看neostore文件夹:  cd graph.db  ls  重写/neo4j/data/graph.db文件夹,将该文件夹覆盖任何原有Neo4j数据库。
大学生分期购物销量榜
已有条评论
IT168企业级}

我要回帖

更多关于 java连接neo4j数据库 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信